• Stars
    star
    181
  • Rank 212,110 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 1 year ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A lightweight tool for scraping current and historic Google Analytics data

Contributors Forks Stargazers Issues MIT License LinkedIn


Wayback Google Analytics

A lightweight tool to gather current and historic Google analytics codes for OSINT investigations.

ยท Report Bug ยท Request Feature

Table of Contents
  1. About The Project
  2. Installation
  3. Usage
  4. Contributing
  5. Development
  6. License
  7. Contact
  8. Acknowledgments

About The Project

Wayback Google Analytics is a lightweight tool that gathers current and historic Google analytics data (UA, GA and GTM codes) from a collection of website urls.

Read Bellingcat's article about using this tool to uncover disinformation networks online here.

Why do I need GA codes?

Google Analytics codes are a useful data point when examining relationships between websites. If two seemingly disparate websites share the same UA, GA or GTM code then there is a good chance that they are managed by the same individual or group. This useful breadcrumb has been used by researchers and journalists in OSINT investigations regularly over the last decade, but a recent change in how Google handles its analytics codes threatens to limit its effectiveness. Google began phasing out UA codes as part of its Google Analytics 4 upgrade in July 2023, making it significantly more challenging to use this breadcrumb during investigations.

How does this tool help me?

Luckily, the Internet Archive's Wayback Machine contains useful snapshots of websites containing their historic GA IDs. While you could feasibly check each snapshot manually, this tool automates that process with the Wayback Machines CDX API to simplify and speed up the process. Enter a list of urls and a time frame (along with extra, optional parameters) to collect current and historic GA, UA and GTM codes and return them in a format you choose (json, txt, xlsx or csv).

The raw json output for each provided url looks something like this:

        "someurl.com": {
            "current_UA_code": "UA-12345678-1",
            "current_GA_code": "G-1234567890",
            "current_GTM_code": "GTM-12345678",
            "archived_UA_codes": {
                "UA-12345678-1": {
                    "first_seen": "01/01/2019(12:30)",
                    "last_seen": "03/10/2020(00:00)",
                },
            },
            "archived_GA_codes": {
                "G-1234567890": {
                    "first_seen": "01/01/2019(12:30)",
                    "last_seen": "01/01/2019(12:30)",
                }
            },
            "archived_GTM_codes": {
                "GTM-12345678": {
                    "first_seen": "01/01/2019(12:30)",
                    "last_seen": "01/01/2019(12:30)",
                },
        },
    }

Further reading

(back to top)

Built With

Python Pandas

Additional libraries/tools: BeautifulSoup4, Asyncio, Aiohttp

(back to top)

Installation

Install from pypi (with pip)

PyPI

The easiest way to to install Wayback Google Analytics is from the command line with pip.

  1. Open a terminal window and navigate to your chosen directory.
  2. Create a virtual environment and activate it (optional, but recommended; if you use Poetry or pipenv those package managers do it for you)
    python3 -m venv venv
    source venv/bin/activate
    
  3. Install the project with pip
    pip install wayback-google-analytics
    
  4. Get a high-level overview
    wayback-google-analytics -h
    

Download from source

You can also clone and download the repo from github and use the tool locally.

  1. Clone repo:

    git clone [email protected]:bellingcat/wayback-google-analytics.git
    
  2. Navigate to root, create a venv and install requirements.txt:

    cd wayback-google-analytics
    python -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    
  3. Get a high-level overview:

    python -m wayback_google_analytics.main -h
    

(back to top)

Usage

Getting started

  1. Enter a list of urls manually through the command line using --urls (-u) or from a given file using --input_file (-i).

  2. Specify your output format (.csv, .txt, .xlsx or .csv) using --output (-o).

  3. Add any of the following options:

Options list (run wayback-google-analytics -h to see in terminal):

options:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        Enter a file path to a list of urls in a readable file type
                        (e.g. .txt, .csv, .md)
  -u URLS [URLS ...], --urls URLS [URLS ...]
                        Enter a list of urls separated by spaces to get their UA/GA
                        codes (e.g. --urls https://www.google.com
                        https://www.facebook.com)
  -o {csv,txt,json,xlsx}, --output {csv,txt,json,xlsx}
                        Enter an output type to write results to file. Defaults to
                        json.
  -s START_DATE, --start_date START_DATE
                        Start date for time range (dd/mm/YYYY:HH:MM) Defaults to
                        01/10/2012:00:00, when UA codes were adopted.
  -e END_DATE, --end_date END_DATE
                        End date for time range (dd/mm/YYYY:HH:MM). Defaults to None.
  -f {yearly,monthly,daily,hourly}, --frequency {yearly,monthly,daily,hourly}
                        Can limit snapshots to remove duplicates (1 per hr, day, month,
                        etc). Defaults to None.
  -l LIMIT, --limit LIMIT
                        Limits number of snapshots returned. Defaults to -100 (most
                        recent 100 snapshots).
  -sc, --skip_current   Add this flag to skip current UA/GA codes when getting archived
                        codes.

Examples:

To get current codes for two websites and archived codes between Oct 1, 2012 and Oct 25, 2012: wayback-google-analytics --urls https://someurl.com https://otherurl.org --output json --start_date 01/10/2012 --end_date 25/10/2012 --frequency hourly

To get current codes for a list of websites (from a file) from January 1, 2012 to the present day, checking for snapshots monthly and returning it as an excel spreadsheet: wayback-google-analytics --input_file path/to/file.txt --output xlsx --start_date 01/01/2012

To check a single website for its current codes plus codes from the last 2,000 archive.org snapshots: wayback-google-analytics --urls https://someurl.com --limit -2000

Output files & spreadsheets

Wayback Google Analytics allows you to export your findings to either .csv or .xlsx spreadsheets. When choosing to save your findings as a spreadsheet, the tool generates two databases: one where each url is the primary index and another where each identified code is the primary index. In an .xlsx file this is one spreadsheet with two sheets, while the .csv option generates one file sorted by codes and another sorted by websites. All output files can be found in /output, which is created in the directory from which the code is executed.

Example spreadsheet

Let's say we're looking into data from 4 websites from 2015 until present and we want to save what we find in an excel spreadsheet. Our start command looks something like this:

wayback-google-analytics -u https://yapatriot.ru https://zanogu.com https://whoswho.com.ua https://adamants.ru -s 01/01/2015 -f yearly -o xlsx

The result is a single .xlsx file with two sheets.

Ordered by website:

Ordered by code:

(back to top)

Limitations & Rate Limits

We recommend that you limit your list of urls to ~10 and your max snapshot limit to <500 during queries. While Wayback Google Analytics doesn't have any hardcoded limitations in regards to how many urls or snapshots you can request, large queries can cause 443 errors (rate limiting). Being rate limited can result in a temporary 5-10 minute ban from web.archive.org and the CDX api.

The app currently uses asyncio.Semaphore() along with delays between requests, but large queries or operations that take a long time can still result in a 443. Use your judgment and break large queries into smaller, more manageable pieces if you find yourself getting rate limited.

(back to top)

Contributing

Bugs and feature requests

Please feel free to open an issue should you encounter any bugs or have suggestions for new features or improvements. You can also reach out to me directly with suggestions or thoughts.

(back to top)

Development

Testing

  • Run tests with python -m unittest discover
  • Check coverage with coverage run -m unittest

Using Poetry for Development

Wayback Google Analytics uses Poetry, a Python dependency management and packaging tool. A GitHub workflow automates the tests on PRs and to main (see our workflow here), be sure to update the semantic version number in pyproject.toml when opening a PR.

If you have push access, follow these steps to trigger the GitHub workflow that will build and release a new version to PyPI :

  1. Change the version number in pyproject.toml
  2. Create a new tag for that version git tag "vX.0.0"
  3. Push the tag git push --tags

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

You can contact me through email or social media.

Project Link: https://github.com/bellingcat/wayback-google-analytics

(back to top)

Acknowledgments

  • Bellingcat for hosting this project
  • Miguel Ramalho for constant support, thoughtful code reviews and suggesting the original idea for this project

(back to top)

More Repositories

1

octosuite

GitHub Data Analysis Framework.
Python
1,786
star
2

telegram-phone-number-checker

Check if phone numbers are connected to Telegram accounts.
Python
1,056
star
3

instagram-location-search

Finds Instagram location IDs near a specified latitude and longitude.
Python
548
star
4

auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
Python
532
star
5

sar-interference-tracker

A Google Earth Engine tool for identifying satellite radar interference.
JavaScript
519
star
6

open-questions

Want to contribute? These are difficult, long-term projects that could be valuable to open source investigators at Bellingcat and around the world.
Jupyter Notebook
328
star
7

tiktok-hashtag-analysis

Provides tools to analyze hashtags within posts scraped from TikTok.
Python
298
star
8

ukraine-timemap

TimeMap instance for Civilian Harm in Ukraine
JavaScript
243
star
9

ShadowFinder

Find possible locations of shadows around the world
Python
223
star
10

open-source-research-notebooks

Jupyter notebooks helping open source researchers, journalists, and fact-checkers use command line tools and code projects for digital investigations.
Jupyter Notebook
195
star
11

osm-search

A user friendly way to search OpenStreetMap data for features in proximity to each other.
Vue
161
star
12

EDGAR

Tool for the retrieval of corporate and financial data from the SEC
Python
105
star
13

reddit-post-scraping-tool

Given a subreddit name and a keyword, this program returns all top (by default) posts that contain the specified keyword.
Visual Basic .NET
80
star
14

whisperbox-transcribe

Easy to deploy API for transcribing and translating audio / video using OpenAI's whisper model.
Python
59
star
15

cloud-free-subregion

Google Earth Engine application that finds Sentinel-2 images that are cloud-free in a particular area of interest.
JavaScript
54
star
16

tiktok-timestamp

A tiny client side tool that retrieves the timestamp from Tiktok videos.
HTML
45
star
17

name-variant-search

A tool for searching common variations of a human name
JavaScript
40
star
18

vk-url-scraper

Scrape VK URLs to fetch info and media - python API or command line tool.
Python
40
star
19

knewkarma

A Reddit data analysis toolkit
Python
39
star
20

avoc

Working repo for the 2024 Bellingcat Tech Fellowship.
CSS
36
star
21

geoclustering

Command-line tool for clustering geolocations ๐Ÿ“
Python
30
star
22

uniform-timezone

Extension to standardize dates and times to the same timezone across social media websites.
JavaScript
30
star
23

facebook-downloader

Facebook video downloader
Python
26
star
24

twitter-geocode-searches

Analysis for "Geofenced Searches on Twitter: A Case Study Detailing South Asiaโ€™s Covid Crisis", published on May 19, 2021.
HTML
24
star
25

RS4OSINT

Guide to Remote Sensing for OSINT
TeX
23
star
26

google-apps-script

A collection of handy Google Apps Script code snippets
JavaScript
21
star
27

telegram-group-joiner

Online tool to automatically join public/private telegram groups.
JavaScript
18
star
28

cisticola

Coordinates scrapers and interfaces with database
Python
17
star
29

youtube-comment-scraper

A script to scrape youtube comments and checks whether a user commented on all of the given videos
Python
17
star
30

alias-generator

Node module to generate likely aliases for a given human name
JavaScript
16
star
31

polyphemus

Scraper for Odysee: alt-tech platform for sharing video
Python
14
star
32

quitobaquito

Methodology for "The Disappearance of Quitobaquito Springs: Tracking Hydrologic Change with Google Earth Engine," published on October 1, 2020.
Jupyter Notebook
12
star
33

hackathon-submission-template

Template repository and README for submissions to Bellingcat's Global Hackathon
9
star
34

o9a-product-scripts

Scripts used in research for a Bellingcat article about the Order of Nine Angles
Python
6
star
35

likee-downloader

A program for downloading videos from Likee, given a username
Python
4
star
36

gesara-entity-viz

Generates an interactive visualisation of named entities in English-language posts archived in a database of Telegram channels that have posted about the GESARA conspiracy theory.
TypeScript
3
star
37

vis-tj-kg-map-2022

Interactive map for the Tajikistan-Kyrgyzstan Border Clash 2022
JavaScript
2
star
38

search-grid-generator

A Vue App for quickly generating KML Search Grids
Vue
2
star
39

smart-image-sorter

User friendly zero-shot image classification using open-source models from HuggingFace's library
Jupyter Notebook
2
star
40

coronavirus-aid-data

Data for "What Restaurants and Maps Can Tell us About Billions of Dollars of Covid-19 Relief Funds," published on December 4, 2020.
2
star
41

who-killed-abelardo

visualization of audios in map
Vue
1
star
42

.github

Community health files and organization profile for @bellingcat
1
star