• Stars
    star
    407
  • Rank 106,183 (Top 3 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created almost 9 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Use yt-dlp to download video and upload to the Internet Archive with metadata.

Tubeup - a multi-VOD service to Archive.org uploader

Unit Tests Lint

tubeup uses yt-dlp to download a Youtube video (or any other provider supported by yt-dlp), and then uploads it with all metadata to the Internet Archive using the python module internetarchive.

It was designed by the Bibliotheca Anonoma to archive single videos, playlists (see warning below about more than video uploads) or accounts to the Internet Archive.

Prerequisites

This script strongly recommends Linux or some sort of POSIX system (such as macOS), preferably from a rented VPS and not your personal machine or phone.

Reccomended system specifications:

  • Linux VPS with Python 3.8 or higher and pip installed
  • 2GB of RAM, 100GB of storage or much more for anything other than single short video mirroring. If your OS drive is too small, symlink it to something larger.

Setup and Installation

  1. Install ffmpeg, pip3 (typically python3-pip), and git.
    To install ffmpeg in Ubuntu, enable the Universe repository.

For Debian/Ubuntu:

   sudo apt install ffmpeg python3-pip git
  1. Use pip3 to install the required python3 packages. At a minimum Python 3.7.13 and up is required (latest Python preferred).
   python3 -m pip install -U pip tubeup
  1. If you don't already have an Internet Archive account, register for one to give the script upload privileges.

  2. Configure internetarchive with your Internet Archive account.

   ia configure

You will be prompted for your login credentials for the Internet Archive account you use.

Once configured to upload, you're ready to go.

  1. Start archiving a video by running the script on a URL (or multiple URLs) supported by yt-dlp.. For YouTube, this includes account URLs and playlist URLs.
   tubeup <url>
  1. Each archived video gets its own Archive.org item. Check out what you've uploaded at

    http://archive.org/details/@yourusername.

Perodically before running, upgrade tubeup and its dependencies by running:

   python3 -m pip install -U tubeup pip

Docker

Dockerized tubeup is provided by etnguyen03/docker-tubeup. Instructions are provided.

Windows Setup

  1. Install WSL2, pick a distribution of your choice. Ubuntu is popular and well-supported.
  2. Use Windows Terminal by Microsoft to interact with the WSL2 instance.
  3. Fully update the Linux installation with your package manager of choice. sudo apt update ; sudo apt upgrade
  4. Install python pip and ffmpeg.
  5. Install Tubeup using steps 4-6 in the Linux configuration guide above and configuring internetarchive for your Archive.org account.
  6. Periodically update your Linux packages and pip packages.

Usage

Usage:
  tubeup <url>... [--username <user>] [--password <pass>]
                  [--metadata=<key:value>...]
                  [--cookies=<filename>]
                  [--proxy <prox>]
                  [--quiet] [--debug]
                  [--use-download-archive]
                  [--output <output>]
                  [--ignore-existing-item]
  tubeup -h | --help
  tubeup --version
Arguments:
  <url>                         yt-dlp compatible URL to download.
                                Check yt-dlp documentation for a list
                                of compatible websites.
  --metadata=<key:value>        Custom metadata to add to the archive.org
                                item.
Options:
  -h --help                    Show this screen.
  -p --proxy <prox>            Use a proxy while uploading.
  -u --username <user>         Provide a username, for sites like Nico Nico Douga.
  -p --password <pass>         Provide a password, for sites like Nico Nico Douga.
  -a --use-download-archive    Record the video url to the download archive.
                               This will download only videos not listed in
                               the archive file. Record the IDs of all
                               downloaded videos in it.
  -q --quiet                   Just print errors.
  -d --debug                   Print all logs to stdout.
  -o --output <output>         yt-dlp output template.
  -i --ignore-existing-item    Don't check if an item already exists on archive.org

Metadata

You can specify custom metadata with the --metadata flag. For example, this script will upload your video to the Community Video collection by default. You can specify a different collection with the --metadata flag:

   tubeup --metadata=collection:opensource_audio <url>

Any arbitrary metadata can be added to the item, with a few exceptions. You can learn more about archive.org metadata here.

Collections

Archive.org users can upload to four open collections:

Note that care should be taken when uploading entire channels. Read the appropriate section in this guide for creating collections, and contact the collections staff if you're uploading a channel or multiple channels on one subject (gaming or horticulture for example). Internet Archive collections staff will either create a collection for you or merge any uploaded items based on the YouTube uploader name that are already up into a new collection.

Dumping entire channels into Community Video is abusive and may get your account locked. Talk to the Internet Archive admins first before doing large uploads; it's better to ask for guidence or help first than run afoul of the rules.

If you do not own a collection you will need to be added as an admin for that collection if you want to upload to it. Talk to the collection owner or staff if you need assistance with this.

Troubleshooting

  • Some videos are copyright blocked in certain countries. Use the proxy or torrenting/privacy VPN option to use a proxy to bypass this. Sweden and Germany are good countries to bypass geo-restrictions.
  • Upload taking forever? Getting s3 throttling on upload? Tubeup has specifically been tailored to wait the longest possible time before failing, and we've never seen a S3 outage that outlasted the insane wait times set in Tubeup.

A note on live videos

Do not use Tubeup to archive live Youtube (or any other site) video. We will not/cannot fix it, it's not even our problem, and any solutions are unpalitable since they involve more code complexity to be maintained ontop of having to disable livechat for one extractor only for live video.

Major Credits (in no particular order)

  • emijrp who wrote the original youtube2internetarchive.py in 2012
  • Matt Hazinski who forked emijrp's work in 2015 with numerous improvements of his own.
  • Antonizoon for switching the script to library calls rather than functioning as an external script, and many small improvements.
  • Small PRs from various people, both in and out of BibAnon.
  • vxbinaca for stabilizing downloads/uploads in yt-dlp/internetarchive library calls, cleansing item output, subtitles collection, and numerous small improvements over time.
  • mrpapersonic for adding logic to check if an item already exists in the Internet Archive and if so skips ingestion.
  • Jake Johnson of the Internet Archive for adding variable collections ability as a flag, switching Tubeup from a script to PyPi repository, ISO-compliant item dates, fixing what others couldn't, and many improvements.
  • Refeed for re-basing the code to OOP, turning Tubeup itself into a library. and adding download and upload bar graphs, and squashing bugs.

License (GPLv3)

Copyright (C) 2020 Bibliotheca Anonoma

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

More Repositories

1

bibanon

The Bibliotheca Anonoma: A wikified library of the internet's treasures. Researching Something Awful, 2channel, 4chan, and other imageboard/textboard communities.
HTML
1,325
star
2

Coreboot-ThinkPads

A (formerly) comprehensive guide to installing Coreboot on various laptops.
278
star
3

BASC-Archiver

Python-based Imageboard (4chan) complete thread archiver.
Python
135
star
4

android-development-codex

A Wiki containing guides to modding many different consumer electronic devices.
HTML
99
star
5

BASC-py4chan

Python wrapper for 4chan API. The BA's vastly improved fork of Edgeworth's original.
Python
55
star
6

webcache-scraper

The Bibliotheca Anonoma's own Bing Cache and Google Cache scraper scripts. Unlike most of the other ones you've seen, these actually work.
Python
26
star
7

neofuuka-scraper

Asagi-like yotsuba scraper. WARNING: Currently has issues because of a recent 4chan change, contact Bakugo on the bibanon chat if you need to set up a new instance
Python
19
star
8

itabashi

Itabashi (板橋) is a bridging bot that syncs messages between a Discord and an IRC channel.
Python
18
star
9

ayase

Ayase is a 4chan Archiver API middleware and HTML frontend based on Python, as a replacement for FoolFuuka, supporting both Asagi and Ayase SQL Schema compatible scrapers.
Python
17
star
10

eve

Asagi replacement written in Python
Python
16
star
11

BASC-eBookGenerator

Create EPUB and MOBI ebooks from Markdown pages, with all necessary pages, images, fonts, and CSS stylesheets kept in a source code folder.
Shell
15
star
12

py8chan

Python wrapper for the 8chan API, based on BASC-py4chan.
Python
14
star
13

prntscr-scraper

A web scraper designed to link prntscr.com URLs with their associated Imgur images, and archive them.
Python
13
star
14

everything-shii-knows

An archive of Shii's Wiki (a major source for and the inspiration for the Bibliotheca Anonoma), which was uncermoniously deleted by the man himself due to personal concerns.
HTML
12
star
15

4chan.doc

A decompiled version of ThrustVect's impressive 4chan.doc report, to Markdown and Mediawiki. Meant for eventual integration into the 4chan Chronicle (Wikibook).
9
star
16

mitsuba

4chan archiver written in Rust
JavaScript
8
star
17

a-tsundere-christmas-carol

A Tsundere Christmas Carol, now in Visual Novel Format. Archived for future anons by the Bibliotheca Anonoma.
Ren'Py
8
star
18

world4ch

(under construction) A publicly viewable archive of 4chan's old textboards.
JavaScript
7
star
19

pyvichan

Python wrapper for the vichan API, based on BASC-py4chan. Not all features have been tested yet, but there's enough to browse and archive a thread.
Python
6
star
20

py420chan

Python wrapper for the 420chan API, based on BASC-py4chan.
Python
5
star
21

macrochan-scraper

A scraper designed to archive Macrochan.org's 45175+ image archive.
HTML
4
star
22

PB_Spade

Photobucket Archiver. Spiritual successor to PB_Shovel
JavaScript
4
star
23

assorted-archival-scripts

An assortment of bash/python scripts that make it easy to archive data or upload data to the Internet Archive.
Python
4
star
24

Tanasinn-Kopipe

A git backup of all Kopipe(copypasta) from Tanasinn.info
4
star
25

docker-swfdec-thumbnailer

swfdec-thumbnailer: uses the swfdec program to generate thumbnails for swf (Flash) files. Uses an Arch Linux Docker image.
Shell
3
star
26

asagi_schema

The Asagi schema standard versioned as per the histories of various FoolFuuka/Asagi SQL dumps. Notice that no major SQL changes occured past https://github.com/eksopl/asagi (tag 1.0.0), only some tweaks to Mysql/trigger.sql (tag 1.3.0).
TSQL
3
star
27

asagi_archive_image_exporter

Tool to dump a range of images from an asagi/foolfuuka archive
Python
3
star
28

chan.arc

Imageboard Archive File Format Specification
Python
3
star
29

archives

BASC Website Archives. To preserve old sites for public viewing.
HTML
3
star
30

vyrd

Vyrd was the personal website of one of the great contemporary 4chan Archivists. His website was full of links to dying 4chan pages, publicly viewable versions of the Penfifteen and Yotsuba Society thread archives, and lots of info that is otherwise no longer extant.
HTML
3
star
31

Neglected-Mario-Characters

An archive of Neglected Mario Characters, the first video game sprite-based webcomic, and the most memorable. Based on MetalMan88's lost revision, but fixed missing comix
3
star
32

dagobah-scraper

A paginated gallery scraper designed to archive flash files and metadata from Dagobah.
HTML
2
star
33

genmaicha

An IRC frontend to senchado and grab-site. Still in the pre-alpha/planning stage.
Python
2
star
34

BASC-WARC

Library for creating and managing WARC files. Currently in planning / pre-alpha stage.
Python
2
star
35

pyFuuka

A plan for a FoolFuuka API wrapper for Python. Because even the archivers themselves need to be archived (as seen from archive.moe, which lost 50% of the thumbnails!)
2
star
36

roverfetcher

Lua
2
star
37

Twitch-Plays

A wiki and webpage detailing the bewilderingly amazing hivemind growing around the "Twitch Plays" streams.
2
star
38

bibanon.github.io

Bibliotheca Anonoma Website
HTML
1
star
39

scraping-everyboty

A Node.js tool to scrape Everyboty's API
JavaScript
1
star
40

bing-cache-scraper

A collection of node.js scripts for scraping Bing
JavaScript
1
star
41

asagi_archive_auto_failover

Script to detech when an asagi / foolfuuka archive breaks
Python
1
star
42

eientei

An imageboard HTML/CSS template as used in eientei.xyz , specifically tuned for 4chan archives. Comes with a mustache based template variant.
CSS
1
star