• Stars
    star
    215
  • Rank 183,925 (Top 4 %)
  • Language
    Shell
  • Created about 11 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

WARNING: This project doesn't work and it's deprecated. Reason: Ajax support is completely deprecated by Google See also #42 (comment)

Build Status

Download all messages from Google Group archive

google-group-crawler is a Bash-4 script to download all (original) messages from a Google group archive. Private groups require some cookies string/file. Groups with adult contents haven't been supported yet.

Installation

The script requires bash-4, sort, curl, sed, awk.

Make the script executable with chmod 755 and put them in your path (e.g, /usr/local/bin/.)

The script may not work on Windows environment as reported in #26.

Usage

The first run

For private group, please prepare your cookies file.

# export _CURL_OPTIONS="-v"       # use curl options to provide e.g, cookies
# export _HOOK_FILE="/some/path"  # provide a hook file, see in #the-hook

# export _ORG="your.company"      # required, if you are using Gsuite
export _GROUP="mygroup"           # specify your group
./crawler.sh -sh                  # first run for testing
./crawler.sh -sh > curl.sh        # save your script
bash curl.sh                      # downloading mbox files

You can execute curl.sh script multiple times, as curl will skip quickly any fully downloaded files.

Update your local archive thanks to RSS feed

After you have an archive from the first run you only need to add the latest messages as shown in the feed. You can do that with -rss option and the additional _RSS_NUM environment variable:

export _RSS_NUM=50                # (optional. See Tips & Tricks.)
./crawler.sh -rss > update.sh     # using rss feed for updating
bash update.sh                    # download the latest posts

It's useful to follow this way frequently to update your local archive.

Private group or Group hosted by an organization

To download messages from private group or group hosted by your organization, you need to provide some cookie information to the script. In the past, the script uses wget and the Netscape cookie file format, now we are using curl with cookie string and a configuration file.

  1. Open Firefox, press F12 to enable Debug mode and select Network tab from the Debug console of Firefox. (You may find a similar way for your favorite browser.)

  2. Log in to your testing google account, and access your group. For example https://groups.google.com/forum/?_escaped_fragment_=categories/google-group-crawler-public (replace google-group-crawler-public with your group name). Make sure you can read some contents with your own group URI.

  3. Now from the Network tab in Debug console, select the address and select Copy -> Copy Request Headers. You will have a lot of things in the result, but please paste them in your text editor and select only Cookie part.

  4. Now prepare a file curl-options.txt as below

     user-agent = "Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
     header = "Cookie: <snip>"
    

    Of course, replace the <snip> part with your own cookie strings. See man curl for more details of the file format.

  5. Specify your cookie file by _CURL_OPTIONS:

     export _CURL_OPTIONS="-K /path/to/curl-options.txt"
    

    Now every hidden group can be downloaded :)

The hook

If you want to execute a hook command after a mbox file is downloaded, you can do as below.

  1. Prepare a Bash script file that contains a definition of __curl_hook command. The first argument is to specify an output filename, and the second argument is to specify an URL. For example, here is simple hook

     # $1: output file
     # $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)
     __curl_hook() {
       if [[ "$(stat -c %b "$1")" == 0 ]]; then
         echo >&2 ":: Warning: empty output '$1'"
       fi
     }
    

    In this example, the hook will check if the output file is empty, and send a warning to the standard error device.

  2. Set your environment variable _HOOK_FILE which should be the path to your file. For example,

     export _GROUP=archlinuxvn
     export _HOOK_FILE=$HOME/bin/curl.hook.sh
    

    Now the hook file will be loaded in your future output of commands crawler.sh -sh or crawler.sh -rss.

What to do with your local archive

The downloaded messages are found under $_GROUP/mbox/*.

They are in RFC 822 format (possibly with obfuscated email addresses) and they can be converted to mbox format easily before being imported to your email clients (Thunderbird, claws-mail, etc.)

You can also use mhonarc ultility to convert the downloaded to HTML files.

See also

Rescan the whole local archive

Sometimes you may need to rescan / redownload all messages. This can be done by removing all temporary files

rm -fv $_GROUP/threads/t.*    # this is a must
rm -fv $_GROUP/msgs/m.*       # see also Tips & Tricks

or you can use _FORCE option:

_FORCE="true" ./crawler.sh -sh

Another option is to delete all files under $_GROUP/ directory. As usual, remember to backup before you delete some thing.

Known problems

  1. Fails on group with adult contents (#14)
  2. This script may not recover emails from public groups. When you use valid cookies, you may see the original emails if you are a manager of the group. See also #16.
  3. When cookies are used, the original emails may be recovered and you must filter them before making your archive public.
  4. Script can't fetch from group whose name contains some special character (e.g, +) See also #30

Contributions

  1. parallel support: @Pikrass has a script to download messages in parallel. It's discussed in the ticket #32. The script: https://gist.github.com/Pikrass/f8462ff8a9af18f97f08d2a90533af31
  2. raw access denied: @alexivkin mentioned he could use the print function to work-around the issue. See it here #29 (comment)

Similar projects

License

This work is released under the terms of a MIT license.

Author

This script is written by Anh K. Huynh.

He wrote this script because he couldn't resolve the problem by using nodejs, phantomjs, Watir.

New web technology just makes life harder, doesn't it?

For script hackers

Please skip this section unless your really know to work with Bash and shells.

  1. If you clean your files (as below), you may notice that it will be very slow when re-downloading all files. You may consider to use the -rss option instead. This option will fetch data from a rss link.

    It's recommmeded to use the -rss option for daily update. By default, the number of items is 50. You can change it by the _RSS_NUM variable. However, don't use a very big number, because Google will ignore that.

  2. Because Topics is a FIFO list, you only need to remove the last file. The script will re-download the last item, and if there is a new page, that page will be fetched.

     ls $_GROUP/msgs/m.* \
     | sed -e 's#\.[0-9]\+$##g' \
     | sort -u \
     | while read f; do
         last_item="$f.$( \
           ls $f.* \
           | sed -e 's#^.*\.\([0-9]\+\)#\1#g' \
           | sort -n \
           | tail -1 \
         )";
         echo $last_item;
       done
    
  3. The list of threads is a LIFO list. If you want to rescan your list, you will need to delete all files under $_D_OUTPUT/threads/

  4. You can set the time for mbox output files, as below

     ls $_GROUP/mbox/m.* \
     | while read FILE; do \
         date="$( \
           grep ^Date: $FILE\
           | head -1\
           | sed -e 's#^Date: ##g' \
         )";
         touch -d "$date" $FILE;
       done
    

    This will be very useful, for example, when you want to use the mbox files with mhonarc.

More Repositories

1

pacapt

An ArchLinux's pacman-like shell wrapper for many package managers. 56KB and run anywhere.
Shell
957
star
2

bash-coding-style

A Bash coding style
284
star
3

w2w

What/Why people move from this to that.
150
star
4

bocker

Write Dockerfile completely in Bash/Bourne. Extensible and simple.
Shell
141
star
5

gk8s

Scripting-friendly tool to work with multiple k8s clusters.
Shell
41
star
6

btsync

Btsync (bittorrent sync) API for Bash users :)
Shell
18
star
7

bashy

A collection of my Bash utils and functions and stuff
Shell
12
star
8

genvsub

Another way to substitute environment variables in shell format strings ${FOO}, designed for k8s stuff
Shell
12
star
9

dusybox

I'm learning Dlang
D
12
star
10

k8s-vbox-the-hard-way

Script to bootstrap k8s cluster the hard way with Virtualbox
Shell
11
star
11

fortune-vn

Fun quotes for Vietnamese developers
Shell
9
star
12

cloudflare_api

Cloudflare API v4 binding in Bash4
Shell
7
star
13

git_xy

Handy way to manage (sub)module in multiple repositories. Designed for lazy engineers. This project is written for educational purpose only. Use them at your own risk.
Shell
6
star
14

rolo

Prevents a program from running more than one copy at a time
Ruby
5
star
15

ido

"I do" stops interactive command if there is any potential risky pattern
Go
5
star
16

docker

Dockerfiles written in Bocker extensible format
Shell
5
star
17

golo

Prevent an application from running twice. This is a Golang version of rolo/solo
Go
4
star
18

s3zip

Compress AWS S3 policies and make them maintainable for human being. Compress 1200 lines of Json files into 120 lines of configuration.
Ruby
4
star
19

jaloc

Just another laptop on the cloud
2
star
20

jenkins-groovy-library

Jenkins Library of Groovy Scripts
Ruby
2
star
21

offloadme

Save money with data offloading
Ruby
2
star
22

handcd

General purpose CD system.
1
star
23

geany.filedefs

A dark theme for Geany
Tcl
1
star
24

ram_free5G

Free 5G memory for your server or laptop!
1
star
25

JupyterHub-provisoners

Getting started with JupyterHub
Shell
1
star
26

icy_picture_modify

A Piwigo plugin that allows users to modify pictures they uploaded
PHP
1
star
27

latency

1
star
28

gDockerImageCheck

Check if Docker Image exists (well, before you deploy...)
Go
1
star
29

k8s-select

The official repository of k8s-select.rb https://gist.github.com/icy/228d7ce15b6c1fc66994a490608e6c7c
Ruby
1
star