• Stars
    star
    335
  • Rank 125,904 (Top 3 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 8 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Make a ZIM file from any Web site and surf offline!

Zimit

Zimit is a scraper allowing to create ZIM file from any Web site.

Docker Build CodeFactor License: GPL v3

⚠️ Important: this tool uses warc2zim to create Zim files and thus require the Zim reader to support Service Workers. At the time of zimit:1.0, that's mostly kiwix-android and kiwix-serve. Note that service workers have protocol restrictions as well so you'll need to run it either from localhost or over HTTPS.

Technical background

This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.

The system extends the crawling system in Browsertrix Crawler and converts the crawled WARC files to ZIM using warc2zim

The zimit.py is the entrypoint for the system.

After the crawl is done, warc2zim is used to write a zim to the /output directory, which can be mounted as a volume.

Using the --keep flag, the crawled WARCs will also be kept in a temp directory inside /output

Usage

zimit is intended to be run in Docker.

To build locally run:

docker build -t ghcr.io/openzim/zimit .

The image accepts the following parameters, as well as any of the warc2zim ones; useful for setting metadata, for instance:

  • --url URL - the url to be crawled (required)
  • --workers N - number of crawl workers to be run in parallel
  • --wait-until - Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).
  • --name - Name of ZIM file (defaults to the hostname of the URL)
  • --output - output directory (defaults to /output)
  • --limit U - Limit capture to at most U URLs
  • --exclude <regex> - skip URLs that match the regex from crawling. Can be specified multiple times. An example is --exclude="(\?q=|signup-landing\?|\?cid=)", where URLs that contain either ?q= or signup-landing? or ?cid= will be excluded.
  • --scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
  • --keep - if set, keep the WARC files in a temp directory inside the output directory

The following is an example usage. The --shm-size flags is needed to run Chrome in Docker.

Example command:

docker run ghcr.io/openzim/zimit zimit --help
docker run ghcr.io/openzim/zimit warc2zim --help
docker run  -v /output:/output \
       --shm-size=1gb ghcr.ioopenzim/zimit zimit --url URL --name myzimfile --workers 2 --waitUntil domcontentloaded

The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.

Note: Image automatically filters out a large number of ads by using the 3 blocklists from anudeepND. If you don't want this filtering, disable the image's entrypoint in your container (docker run --entrypoint="" ghcr.io/openzim/zimit ...).

Nota bene

A first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.

That version is now considered outdated and archived in 2016 branch.

License

GPLv3 or later, see LICENSE for more details.

More Repositories

1

mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
TypeScript
285
star
2

sotoki

StackExchange websites to ZIM scraper
Python
217
star
3

libzim

Reference implementation of the ZIM specification
C++
166
star
4

gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
Python
130
star
5

zim-tools

Various ZIM command line tools
C++
127
star
6

zimfarm

Farm operated by bots to grow and harvest new zim files
Python
83
star
7

python-libzim

Libzim binding for Python: read/write ZIM files in Python
Python
63
star
8

youtube

Create a ZIM file from a Youtube channel/username/playlist
Python
48
star
9

warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
Python
44
star
10

zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
37
star
11

zimwriterfs

[ARCHIVED] Create ZIM files based from a directory on your local filesystem
C++
36
star
12

node-libzim

Libzim binding for Node.js: read/write ZIM files in Javascript
C++
27
star
13

ifixit

iFixit to ZIM scraper
Python
25
star
14

wp1

Wikipedia 1.0 engine & selection tools
Python
24
star
15

nautilus

Turns a collection of documents into a browsable ZIM file
Python
21
star
16

python-scraperlib

Collection of Python code to re-use across Python-based scrapers
Python
19
star
17

wikihow

WikiHow scraper
Python
16
star
18

ted

Provide the best of TED.com for offline usage!
Python
13
star
19

zimit-frontend

Zimit Public Web UI
Vue
9
star
20

kolibri

Convert a Kolibri channel in ZIM file(s)
Python
8
star
21

openedx

Open edX (to zim) scraper
Python
8
star
22

phet

Scraper for PhET Science & Math Interactive Simulations
JavaScript
7
star
23

zip2zim

[ARCHIVED] Convert Zip Files to Zim Files
JavaScript
6
star
24

wp1_selection_tools

Create selections with the best articles of a WM project
Perl
6
star
25

zimreader-java

[ARCHIVED] ZIM file reader in Java
Java
5
star
26

freecodecamp

FreeCodeCamp.org scraper (to ZIM)
Python
4
star
27

cms

ZIM file Publishing Platform
Python
4
star
28

docker-publish-action

Docker Publish Action for OpenZIM projects
Python
4
star
29

education-numerique

Éducation & Numérique scraper
Python
3
star
30

zim-testing-suite

This repository contains testing zim files for libzim and other openzim repositories.
PHP
3
star
31

overview

🎈 Start here for current projects, how to get involved, and joining community calls. A resource for new and veteran members of the offline commmunity
2
star
32

zimfarm-client

Command line tool to deal with the Zimfarm
Python
2
star
33

nautilus-webui

SaaS Web UI for nautilus
Python
1
star
34

python-storagelib

S3 Cache wrapper to use within Kiwix/OpenZIM/Offspot projects
Python
1
star
35

zimreader-tntnet

[ARCHIVED] ZIM file reader using tntnet HTTP server
CSS
1
star
36

devdocs

devdocs.io to ZIM scraper
Python
1
star
37

_python-bootstrap

Sample openZIM Python project bootstrap
Python
1
star
38

xapian-meson

Xapian ( 1.4.23) source code with meson build system
C++
1
star
39

lilote

Generate a Lilote ZIM file from a Lilote export JSON
JavaScript
1
star