• Stars
    star
    130
  • Rank 277,575 (Top 6 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 10 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scraper for downloading the entire ebooks repository of project Gutenberg

Gutenberg Offline

This scraper downloads the whole Project Gutenberg library and puts it in a ZIM file, a clean and user friendly format for storing content for offline usage.

Python package Docker CodeFactor License: GPL v3

Coding guidelines

Main coding guidelines comes from the openZIM Wiki

Setting up the environment

It's recommended that you use virtualenv and py3.6+.

Install the dependencies

GNU/Linux

sudo apt-get install python-pip python-dev libxml2-dev libxslt-dev advancecomp jpegoptim pngquant p7zip-full gifsicle curl zip zim-tools
sudo pip install virtualenv

macOS

sudo easy_install pip
sudo pip install virtualenv
brew install advancecomp jpegoptim pngquant p7zip gifsicle

Set up the project

git clone [email protected]:kiwix/gutenberg.git
cd gutenberg
virtualenv gut-env (or any name you want)
./gut-env/bin/pip install -r requirements.pip

Working in the environment

  • Activate the environment: source gut-env/bin/activate
  • Quit the environment: deactivate

Getting started

After setting up the whole environment you can just run the main script gutenberg2zim. It will download, process and export the content.

./gutenberg2zim

Arguments

You can also specify parameters to customize the content. Only want books with the Id 100-200? Books only in French? English? Or only those both? No problem! You can also include or exclude book formats. You can add bookshelves and the option to search books by title to enrich your user experince.

./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search

This will download books in English and French that have the Id 100 to 200 in the HTML (default) and PDF format.

You can find the full arguments list below:

-h --help                       Display this help message
-y --wipe-db                    Empty cached book metadata
-F --force                      Redo step even if target already exist

-l --languages=<list>           Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)
-f --formats=<list>             Comma-separated list of formats to filter export to (epub, html, pdf, all)

-e --static-folder=<folder>     Use-as/Write-to this folder static HTML
-z --zim-file=<file>            Write ZIM into this file path
-t --zim-title=<title>          Set ZIM title
-n --zim-desc=<description>     Set ZIM description
-d --dl-folder=<folder>         Folder to use/write-to downloaded ebooks
-u --rdf-url=<url>              Alternative rdf-files.tar.bz2 URL
-b --books=<ids>                Execute the processes for specific books, separated by commas, or dashes for intervals
-c --concurrency=<nb>           Number of concurrent process for processing tasks
--dlc=<nb>                      Number of concurrent *download* process for download (overwrites --concurrency). if server blocks high rate requests
-m --one-language-one-zim=<folder> When more than 1 language, do one zim for each   language (and one with all)
--no-index                      Do NOT create full-text index within ZIM file
--check                         Check dependencies
--prepare                       Download rdf-files.tar.bz2
--parse                         Parse all RDF files and fill-up the DB
--download                      Download ebooks based on filters
--zim                           Create a ZIM file
--title-search                  Add field to search a book by title and directly jump to it
--bookshelves                   Add bookshelves
--optimization-cache=<url>      URL with credentials to S3 bucket for using as optimization cache
--use-any-optimized-version     Try to use any optimized version found on optimization cache

Screenshots

License

GPLv3 or later, see LICENSE for more details.

More Repositories

1

zimit

Make a ZIM file from any Web site and surf offline!
Python
335
star
2

mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
TypeScript
285
star
3

sotoki

StackExchange websites to ZIM scraper
Python
217
star
4

libzim

Reference implementation of the ZIM specification
C++
166
star
5

zim-tools

Various ZIM command line tools
C++
127
star
6

zimfarm

Farm operated by bots to grow and harvest new zim files
Python
83
star
7

python-libzim

Libzim binding for Python: read/write ZIM files in Python
Python
63
star
8

youtube

Create a ZIM file from a Youtube channel/username/playlist
Python
48
star
9

warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
Python
44
star
10

zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
37
star
11

zimwriterfs

[ARCHIVED] Create ZIM files based from a directory on your local filesystem
C++
36
star
12

node-libzim

Libzim binding for Node.js: read/write ZIM files in Javascript
C++
27
star
13

ifixit

iFixit to ZIM scraper
Python
25
star
14

wp1

Wikipedia 1.0 engine & selection tools
Python
24
star
15

nautilus

Turns a collection of documents into a browsable ZIM file
Python
21
star
16

python-scraperlib

Collection of Python code to re-use across Python-based scrapers
Python
19
star
17

wikihow

WikiHow scraper
Python
16
star
18

ted

Provide the best of TED.com for offline usage!
Python
13
star
19

zimit-frontend

Zimit Public Web UI
Vue
9
star
20

kolibri

Convert a Kolibri channel in ZIM file(s)
Python
8
star
21

openedx

Open edX (to zim) scraper
Python
8
star
22

phet

Scraper for PhET Science & Math Interactive Simulations
JavaScript
7
star
23

zip2zim

[ARCHIVED] Convert Zip Files to Zim Files
JavaScript
6
star
24

wp1_selection_tools

Create selections with the best articles of a WM project
Perl
6
star
25

zimreader-java

[ARCHIVED] ZIM file reader in Java
Java
5
star
26

freecodecamp

FreeCodeCamp.org scraper (to ZIM)
Python
4
star
27

cms

ZIM file Publishing Platform
Python
4
star
28

docker-publish-action

Docker Publish Action for OpenZIM projects
Python
4
star
29

education-numerique

Éducation & Numérique scraper
Python
3
star
30

zim-testing-suite

This repository contains testing zim files for libzim and other openzim repositories.
PHP
3
star
31

overview

🎈 Start here for current projects, how to get involved, and joining community calls. A resource for new and veteran members of the offline commmunity
2
star
32

zimfarm-client

Command line tool to deal with the Zimfarm
Python
2
star
33

nautilus-webui

SaaS Web UI for nautilus
Python
1
star
34

python-storagelib

S3 Cache wrapper to use within Kiwix/OpenZIM/Offspot projects
Python
1
star
35

zimreader-tntnet

[ARCHIVED] ZIM file reader using tntnet HTTP server
CSS
1
star
36

devdocs

devdocs.io to ZIM scraper
Python
1
star
37

_python-bootstrap

Sample openZIM Python project bootstrap
Python
1
star
38

xapian-meson

Xapian ( 1.4.23) source code with meson build system
C++
1
star
39

lilote

Generate a Lilote ZIM file from a Lilote export JSON
JavaScript
1
star