Discover internetarchive/iaux Open Source project

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

5,180

heritrix3

The Internet Archive BookReader

2,821

bookreader

A web browser extension for Chrome, Firefox, Edge, and Safari 14.

975

wayback-machine-webextension

brozzler - distributed browser-based web crawler

657

brozzler

WARC writing MITM HTTP/S proxy

657

warcprox

Python Client Library for the Archive.org OpenLibrary API

377

openlibrary-client

Python library for reading and writing warc files

377

warc

Offline Internet Archive project

237

dweb-mirror

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

232

warctools

internetarchivebot

bookserver

Archive.org OPDS Bookserver - A standard for digital book distribution

Perpetual Access To The Scholarly Record

119

fatcat

Fast PDF generation and compression. Deals with millions of pages daily.

114

archive-pdf-tools

search interface for scholarly works

fatcat-scholar

State-of-the-art web crawler 🔱

Zeno

A repository of cleanup bots implementing the openlibrary-client

openlibrary-bots

A queue-controlled browser automation tool for improving web crawl quality

umbra

dweb-archive

Hashistack-IN-Docker (single container with nomad + consul + caddy)

hind

Reduce annoying 404 pages by automatically checking for an archived copy in the Wayback Machine. Learn more about this Test Pilot experiment at https://testpilot.firefox.com/

wayback-machine-firefox

Summarize web archive capture index (CDX) files.

cdx-summary

Voice Apps (Actions on Google, Alexa Skill) of Internet Archive. Just say: "Ok Google, Ask Internet Archive to Play Jazz" or "Alexa, Ask Internet Internet Archive to play Instrumental Music"

internet-archive-voice-apps

Liveweb proxy of the Wayback Machine project

liveweb

For code related to making ePub files

epub

Sort-friendly URI Reordering Transform (SURT) python module

surt

archive-hocr-tools

Efficient hOCR tooling

Trough: Big data, small databases.

trough

Internet Archive Decentralized Web Common API

dweb-transport

wayback-diff

React components to render differences between captures at the Wayback Machine

dweb-transports

Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki

sandcrawler

The official Internet Archive IIIF service

iiif

crawling-for-nomore404

Pure python HDFS client: python3.x version

snakebite-py3

Daily TV News Summary using GPT

newsum

ia-hadoop-tools

ARK minter, binder, resolver

arklet

Decentralized web Gateway for Internet Archive

dweb-gateway

Cache stampede test harness. Code accompanies the presentation made at RedisConf 2017, 30 May to 1 June, 2017, in San Francisco.

xfetch

PHP

openlibrary-librarians

Coordination between the OpenLibrary.org Librarian community

arch

Web application for distributed compute analysis of Archive-It web archive collections.

Scala

cicd

build & test using github registry; deploy to nomad clusters

scrapy-warcio

Support for writing WARC files with Scrapy

Summarize and ask questions about items in the Internet Archive

iacopilot

Import workflows for the Wikipedia Citations Database

iari

doublethink

rethinkdb python library

Watch for local files to appear and move them into S3

s3_loader

Internet Archive's Sparkling Data Processing Library

Sparkling

Scala

wayback-machine-android

Kotlin

archive-commons

a tool for continuously ingesting w/arc files into the archive

draintasker

Internet Archive S3-like connector

ias3

wayback-radial-tree

journal-level metadata munging. part of fatcat project

chocula

Demo code for the Open Library Read API

read_api_extras

wikibase-patcher

Python library for interacting with the Wikibase REST API

dweb-archivecontroller

An API wrapper to the Elasticsearch index of web archival collections and a web UI to explore those indexes.

web_collection_search

IAUX Typescript WebComponent Template

epub-labs

iaux-typescript-wc-template

A JS interface to archive.org

archive-ocr-tools

Tool to build solr index offline

offlinesolr

Internet Archive Command-line Utilities

ia-bin-tools

dweb-objects

An interactive IARI JSON viewer

iare

iaux-collection-browser

wayback-machine-safari

collections-cleaners

A mathematical model to calculate a normalized score to quantify the temporal resilience of a web page as a time-series data based on the historical observations of the page in web archives.

trendmachine

acs4_py

Python interface to ACS4

minify JS/TS files using `esbuild` and `swc` down to ES5 (uses `deno`)

esbuild_es5

iaux-search-service

map-of-the-web

Eventer is a simple event dispatching library in Python

eventer

The Internet Archive Donation Form

iaux-donation-form

Internet Archive Open Source Blog

internetarchive.github.com

CSS

isodos

Go module to interact with Internet Archive's Isodos API

strainer

Heritrix frontier files manipulation tool.

internet-archive-alexa-skill

Command line retrieval of torrents using transmission-daemon (via transmission-remote)

btget

A MediaWiki extension that supports importing of Archive.org palm leaf items

mediawiki-extension-archive-leaf

hashitalksdemo

API documentation for https://github.com/internetarchive/openlibrary

openlibrary-api

Fast and easy-to-use web server, using the Deno native http server (hyper in rust). It serves static files & dirs, with arbitrary handling using an optional `handler` argument.

httpd

Google Summer of Code (GSoC) 2024 Wayback Machine GenAI Knowledge Graph project

wbm_ai_kg

`deno` static file webserver, clone of `file_server.ts`, PLUS an additional final "404 handler" to run arbitrary JS/TS

file_server_plus

dyno

archiveorg-e2e-playwright

A Streamlit application to visualize Wikipedia IABot statistics

tarb_insights

Python client package for the playback rules engine

rulesengine-client

deploy saved changes to website unique hostnames instantly -- can skip commits, pushes & full CI/CD

coderunr

Redis promises & futures library for Predis / PHP

deferred

PHP

hello-js

an example of full CI/CD from GitHub to a nomad cluster

Data models and scripts to build a database of references (broadly defined) appearing on Wikipedia and other wikis

wiki-references-db

Project Gutenberg collection importation via IAS3 interface

maisy

Presentation for KohaCon 2011

kohacon2011-presentation

model and front-end for rules for managing wayback playback

rulesengine