• Stars
    star
    532
  • Rank 82,781 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

Auto Archiver

PyPI version Docker Image Version (latest by date)

Read the article about Auto Archiver on bellingcat.com.

Python tool to automatically archive social media posts, videos, and images from a Google Sheets, the console, and more. Uses different archivers depending on the platform, and can save content to local storage, S3 bucket (Digital Ocean Spaces, AWS, ...), and Google Drive. If using Google Sheets as the source for links, it will be updated with information about the archived content. It can be run manually or on an automated basis.

There are 3 ways to use the auto-archiver:

  1. (easiest installation) via docker
  2. (local python install) pip install auto-archiver
  3. (legacy/development) clone and manually install from repo (see legacy tutorial video)

But you always need a configuration/orchestration file, which is where you'll configure where/what/how to archive. Make sure you read orchestration.

How to install and run the auto-archiver

Option 1 - docker

dockeri.co

Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the -v volume flag.

  1. install docker
  2. pull the auto-archiver docker image with docker pull bellingcat/auto-archiver
  3. run the docker image locally in a container: docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml breaking this command down:
    1. docker run tells docker to start a new container (an instance of the image)
    2. --rm makes sure this container is removed after execution (less garbage locally)
    3. -v $PWD/secrets:/app/secrets - your secrets folder
      1. -v is a volume flag which means a folder that you have on your computer will be connected to a folder inside the docker container
      2. $PWD/secrets points to a secrets/ folder in your current working directory (where your console points to), we use this folder as a best practice to hold all the secrets/tokens/passwords/... you use
      3. /app/secrets points to the path the docker container where this image can be found
    4. -v $PWD/local_archive:/app/local_archive - (optional) if you use local_storage
      1. -v same as above, this is a volume instruction
      2. $PWD/local_archive is a folder local_archive/ in case you want to archive locally and have the files accessible outside docker
      3. /app/local_archive is a folder inside docker that you can reference in your orchestration.yml file

Option 2 - python package

Python package instructions
  1. make sure you have python 3.8 or higher installed
  2. install the package pip/pipenv/conda install auto-archiver
  3. test it's installed with auto-archiver --help
  4. run it with your orchestration file and pass any flags you want in the command line auto-archiver --config secrets/orchestration.yaml if your orchestration file is inside a secrets/, which we advise

You will also need ffmpeg, firefox and geckodriver, and optionally fonts-noto. Similar to the local installation.

Option 3 - local installation

This can also be used for development.

Legacy instructions, only use if docker/package is not an option

Install the following locally:

  1. ffmpeg must also be installed locally for this tool to work.
  2. firefox and geckodriver on a path folder like /usr/local/bin.
  3. (optional) fonts-noto to deal with multiple unicode characters during selenium/geckodriver's screenshots: sudo apt install fonts-noto -y.

Clone and run:

  1. git clone https://github.com/bellingcat/auto-archiver
  2. pipenv install
  3. pipenv run python -m src.auto_archiver --config secrets/orchestration.yaml

Orchestration

The archiver work is orchestrated by the following workflow (we call each a step):

  1. Feeder gets the links (from a spreadsheet, from the console, ...)
  2. Archiver tries to archive the link (twitter, youtube, ...)
  3. Enricher adds more info to the content (hashes, thumbnails, ...)
  4. Formatter creates a report from all the archived content (HTML, PDF, ...)
  5. Database knows what's been archived and also stores the archive result (spreadsheet, CSV, or just the console)

To setup an auto-archiver instance create an orchestration.yaml which contains the workflow you would like. We advise you put this file into a secrets/ folder and do not share it with others because it will contain passwords and other secrets.

The structure of orchestration file is split into 2 parts: steps (what steps to use) and configurations (how those steps should behave), here's a simplification:

# orchestration.yaml content
steps:
  feeder: gsheet_feeder
  archivers: # order matters
    - youtubedl_archiver
  enrichers:
    - thumbnail_enricher
  formatter: html_formatter
  storages:
    - local_storage
  databases:
    - gsheet_db

configurations:
  gsheet_feeder:
    sheet: "your google sheet name"
    header: 2 # row with header for your sheet
  # ... configurations for the other steps here ...

To see all available steps (which archivers, storages, databses, ...) exist check the example.orchestration.yaml.

All the configurations in the orchestration.yaml file (you can name it differently but need to pass it in the --config FILENAME argument) can be seen in the console by using the --help flag. They can also be overwritten, for example if you are using the cli_feeder to archive from the command line and want to provide the URLs you should do:

auto-archiver --config secrets/orchestration.yaml --cli_feeder.urls="url1,url2,url3"

Here's the complete workflow that the auto-archiver goes through:

graph TD
    s((start)) --> F(fa:fa-table Feeder)
    F -->|get and clean URL| D1{fa:fa-database Database}
    D1 -->|is already archived| e((end))
    D1 -->|not yet archived| a(fa:fa-download Archivers)
    a -->|got media| E(fa:fa-chart-line Enrichers)
    E --> S[fa:fa-box-archive Storages]
    E --> Fo(fa:fa-code Formatter)
    Fo --> S
    Fo -->|update database| D2(fa:fa-database Database)
    D2 --> e

Orchestration checklist

Use this to make sure you help making sure you did all the required steps:

  • you have a /secrets folder with all your configuration files including
    • a orchestration file eg: orchestration.yaml pointing to the correct location of other files
    • (optional if you use GoogleSheets) you have a service_account.json (see how-to)
    • (optional for telegram) a anon.session which appears after the 1st run where you login to telegram
      • if you use private channels you need to add channel_invites and set join_channels=true at least once
    • (optional for VK) a vk_config.v2.json
    • (optional for using GoogleDrive storage) gd-token.json (see help script)
    • (optional for instagram) instaloader.session file which appears after the 1st run and login in instagram
    • (optional for browsertrix) profile.tar.gz file

Example invocations

The recommended way to run the auto-archiver is through Docker. The invocations below will run the auto-archiver Docker image using a configuration file that you have specified

# all the configurations come from ./secrets/orchestration.yaml
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml
# uses the same configurations but for another google docs sheet 
# with a header on row 2 and with some different column names
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1

The auto-archiver can also be run locally, if pre-requisites are correctly configured. Equivalent invocations are below.

# all the configurations come from ./secrets/orchestration.yaml
auto-archiver --config secrets/orchestration.yaml
# uses the same configurations but for another google docs sheet 
# with a header on row 2 and with some different column names
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1

Extra notes on configuration

Google Drive

To use Google Drive storage you need the id of the shared folder in the config.yaml file which must be shared with the service account eg [email protected] and then you can use --storage=gd

Telethon + Instagram with telegram bot

The first time you run, you will be prompted to do a authentication with the phone number associated, alternatively you can put your anon.session in the root.

Running on Google Sheets Feeder (gsheet_feeder)

The --gseets_feeder.sheet property is the name of the Google Sheet to check for URLs. This sheet must have been shared with the Google Service account used by gspread. This sheet must also have specific columns (case-insensitive) in the header as specified in Gsheet.configs. The default names of these columns and their purpose is:

Inputs:

  • Link (required): the URL of the post to archive
  • Destination folder: custom folder for archived file (regardless of storage)

Outputs:

  • Archive status (required): Status of archive operation
  • Archive location: URL of archived post
  • Archive date: Date archived
  • Thumbnail: Embeds a thumbnail for the post in the spreadsheet
  • Timestamp: Timestamp of original post
  • Title: Post title
  • Text: Post text
  • Screenshot: Link to screenshot of post
  • Hash: Hash of archived HTML file (which contains hashes of post media) - for checksums/verification
  • Perceptual Hash: Perceptual hashes of found images - these can be used for de-duplication of content
  • WACZ: Link to a WACZ web archive of post
  • ReplayWebpage: Link to a ReplayWebpage viewer of the WACZ archive

For example, this is a spreadsheet configured with all of the columns for the auto archiver and a few URLs to archive. (Note that the column names are not case sensitive.)

A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Link" column

Now the auto archiver can be invoked, with this command in this example: docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --config secrets/orchestration-global.yaml --gsheet_feeder.sheet "Auto archive test 2023-2". Note that the sheet name has been overridden/specified in the command line invocation.

When the auto archiver starts running, it updates the "Archive status" column.

A screenshot of a Google Spreadsheet with column headers defined as above, and several Youtube and Twitter URLs in the "Link" column. The auto archiver has added "archive in progress" to one of the status columns.

The links are downloaded and archived, and the spreadsheet is updated to the following:

A screenshot of a Google Spreadsheet with videos archived and metadata added per the description of the columns above.

Note that the first row is skipped, as it is assumed to be a header row (--gsheet_feeder.header=1 and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.

The "archive location" link contains the path of the archived file, in local storage, S3, or in Google Drive.

The archive result for a link in the demo sheet.


Development

Use python -m src.auto_archiver --config secrets/orchestration.yaml to run from the local development environment.

Docker development

working with docker locally:

  • docker build . -t auto-archiver to build a local image
  • docker run --rm -v $PWD/secrets:/app/secrets auto-archiver --config secrets/orchestration.yaml
    • to use local archive, also create a volume -v for it by adding -v $PWD/local_archive:/app/local_archive

release to docker hub

  • docker image tag auto-archiver bellingcat/auto-archiver:latest
  • docker push bellingcat/auto-archiver

RELEASE

  • update version in version.py
  • run bash ./scripts/release.sh and confirm
  • package is automatically updated in pypi
  • docker image is automatically pushed to dockerhup

More Repositories

1

octosuite

GitHub Data Analysis Framework.
Python
1,786
star
2

telegram-phone-number-checker

Check if phone numbers are connected to Telegram accounts.
Python
1,056
star
3

instagram-location-search

Finds Instagram location IDs near a specified latitude and longitude.
Python
548
star
4

sar-interference-tracker

A Google Earth Engine tool for identifying satellite radar interference.
JavaScript
514
star
5

open-questions

Want to contribute? These are difficult, long-term projects that could be valuable to open source investigators at Bellingcat and around the world.
Jupyter Notebook
328
star
6

tiktok-hashtag-analysis

Provides tools to analyze hashtags within posts scraped from TikTok.
Python
298
star
7

ukraine-timemap

TimeMap instance for Civilian Harm in Ukraine
JavaScript
243
star
8

ShadowFinder

Find possible locations of shadows around the world
Python
223
star
9

open-source-research-notebooks

Jupyter notebooks helping open source researchers, journalists, and fact-checkers use command line tools and code projects for digital investigations.
Jupyter Notebook
195
star
10

wayback-google-analytics

A lightweight tool for scraping current and historic Google Analytics data
Python
181
star
11

osm-search

A user friendly way to search OpenStreetMap data for features in proximity to each other.
Vue
151
star
12

EDGAR

Tool for the retrieval of corporate and financial data from the SEC
Python
105
star
13

reddit-post-scraping-tool

Given a subreddit name and a keyword, this program returns all top (by default) posts that contain the specified keyword.
Visual Basic .NET
80
star
14

whisperbox-transcribe

Easy to deploy API for transcribing and translating audio / video using OpenAI's whisper model.
Python
59
star
15

cloud-free-subregion

Google Earth Engine application that finds Sentinel-2 images that are cloud-free in a particular area of interest.
JavaScript
54
star
16

tiktok-timestamp

A tiny client side tool that retrieves the timestamp from Tiktok videos.
HTML
45
star
17

name-variant-search

A tool for searching common variations of a human name
JavaScript
40
star
18

vk-url-scraper

Scrape VK URLs to fetch info and media - python API or command line tool.
Python
40
star
19

knewkarma

A Reddit data analysis toolkit
Python
39
star
20

avoc

Working repo for the 2024 Bellingcat Tech Fellowship.
CSS
36
star
21

geoclustering

Command-line tool for clustering geolocations πŸ“
Python
30
star
22

uniform-timezone

Extension to standardize dates and times to the same timezone across social media websites.
JavaScript
30
star
23

facebook-downloader

Facebook video downloader
Python
26
star
24

twitter-geocode-searches

Analysis for "Geofenced Searches on Twitter: A Case Study Detailing South Asia’s Covid Crisis", published on May 19, 2021.
HTML
24
star
25

google-apps-script

A collection of handy Google Apps Script code snippets
JavaScript
21
star
26

telegram-group-joiner

Online tool to automatically join public/private telegram groups.
JavaScript
18
star
27

RS4OSINT

Guide to Remote Sensing for OSINT
TeX
17
star
28

youtube-comment-scraper

A script to scrape youtube comments and checks whether a user commented on all of the given videos
Python
17
star
29

alias-generator

Node module to generate likely aliases for a given human name
JavaScript
16
star
30

cisticola

Coordinates scrapers and interfaces with database
Python
15
star
31

polyphemus

Scraper for Odysee: alt-tech platform for sharing video
Python
14
star
32

quitobaquito

Methodology for "The Disappearance of Quitobaquito Springs: Tracking Hydrologic Change with Google Earth Engine," published on October 1, 2020.
Jupyter Notebook
12
star
33

hackathon-submission-template

Template repository and README for submissions to Bellingcat's Global Hackathon
9
star
34

o9a-product-scripts

Scripts used in research for a Bellingcat article about the Order of Nine Angles
Python
6
star
35

likee-downloader

A program for downloading videos from Likee, given a username
Python
4
star
36

gesara-entity-viz

Generates an interactive visualisation of named entities in English-language posts archived in a database of Telegram channels that have posted about the GESARA conspiracy theory.
TypeScript
3
star
37

vis-tj-kg-map-2022

Interactive map for the Tajikistan-Kyrgyzstan Border Clash 2022
JavaScript
2
star
38

search-grid-generator

A Vue App for quickly generating KML Search Grids
Vue
2
star
39

smart-image-sorter

User friendly zero-shot image classification using open-source models from HuggingFace's library
Jupyter Notebook
2
star
40

coronavirus-aid-data

Data for "What Restaurants and Maps Can Tell us About Billions of Dollars of Covid-19 Relief Funds," published on December 4, 2020.
2
star
41

who-killed-abelardo

visualization of audios in map
Vue
1
star
42

.github

Community health files and organization profile for @bellingcat
1
star