• Stars
    star
    285
  • Rank 145,115 (Top 3 %)
  • Language
    TypeScript
  • License
    GNU General Publi...
  • Created almost 9 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file

MWoffliner

MWoffliner is a tool for making a local offline HTML snapshot of any online MediaWiki instance. It goes through all online articles (or a selection if specified) and create the corresponding ZIM file. It has mainly been tested against Wikimedia projects like Wikipedia and Wiktionary --- but it should also work for any recent MediaWiki.

Read CONTRIBUTING.md to know more about MWoffliner development.

NPM

npm Docker Build Status codecov CodeFactor License

Features

  • Scrape with or without image thumbnail
  • Scrape with or without audio/video multimedia content
  • S3 cache (optional)
  • Image size optimiser / Webp converter
  • Scrape all articles in namespaces or title list based
  • Specify additional/non-main namespaces to scrape

Run mwoffliner --help to get all the possible options.

Prerequisites

  • *NIX Operating System (GNU/Linux, macOS, ...)
  • Redis
  • NodeJS version 16 or greater
  • Libzim (On GNU/Linux & macOS we automatically download it)
  • Various build tools which are probably already installed on your machine (packages libjpeg-dev, libglu1, autoconf, automake, gcc on Debian/Ubuntu)

... and an online MediaWiki with its API available.

Usage

To install MWoffliner globally:

npm i -g mwoffliner

You might need to run this command with the sudo command, depending how your npm is configured.

npm permission checking can be a bit annoying for a newcomer. Please read the documentation carefully if you hit problems: https://docs.npmjs.com/cli/v7/using-npm/scripts#user

Then to run it:

mwoffliner --help

To install and run it locally:

npm i
npm run mwoffliner -- --help

To use MWoffliner with a S3 cache, you should provide a S3 URL like this:

--optimisationCacheUrl="https://wasabisys.com/?bucketName=my-bucket&keyId=my-key-id&secretAccessKey=my-sac"

API

MWoffliner provides also an API and therefore can be used as a NodeJS library. Here a stub example:

const mwoffliner = require('mwoffliner');
const parameters = {
    mwUrl: "https://es.wikipedia.org",
    adminEmail: "[email protected]",
    verbose: true,
    format: "nopic",
    articleList: "./articleList"
};
mwoffliner.execute(parameters); // returns a Promise

Background

Complementary information about MWoffliner:

  • MediaWiki software is used by thousands of wikis, the most famous ones being the Wikimedia ones, including Wikipedia.
  • MediaWiki is a PHP wiki runtime engine.
  • Wikitext is the name of the markup language that MediaWiki uses.
  • MediaWiki includes a parser for WikiText into HTML, and this parser creates the HTML pages displayed in your browser.

GNU/Linux - Debian based distributions

Install NodeJS: Read https://nodejs.org/en/download/current/

Install Redis:

sudo apt-get install redis-server

Troubleshooting

Older GNU/Linux distributions and/or versions of Node.js might be shipped with a deprecated version of npm. Older versions of npm have incompatbilities with certain versions of Node.js and might simply fail to install mwoffliner package.

We recommend to use a recent version of npm. Recent versions can perfectly deal with older Node.js 10. Do install the packaged version of npm and then use it to install a newer version like:

sudo npm install --unsafe-perm -g npm

Don't forget to remove the packaged version of npm afterward.

License

GPLv3 or later, see LICENSE for more details.

More Repositories

1

zimit

Make a ZIM file from any Web site and surf offline!
Python
335
star
2

sotoki

StackExchange websites to ZIM scraper
Python
217
star
3

libzim

Reference implementation of the ZIM specification
C++
166
star
4

gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
Python
130
star
5

zim-tools

Various ZIM command line tools
C++
127
star
6

zimfarm

Farm operated by bots to grow and harvest new zim files
Python
83
star
7

python-libzim

Libzim binding for Python: read/write ZIM files in Python
Python
63
star
8

youtube

Create a ZIM file from a Youtube channel/username/playlist
Python
48
star
9

warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
Python
44
star
10

zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
37
star
11

zimwriterfs

[ARCHIVED] Create ZIM files based from a directory on your local filesystem
C++
36
star
12

node-libzim

Libzim binding for Node.js: read/write ZIM files in Javascript
C++
27
star
13

ifixit

iFixit to ZIM scraper
Python
25
star
14

wp1

Wikipedia 1.0 engine & selection tools
Python
24
star
15

nautilus

Turns a collection of documents into a browsable ZIM file
Python
21
star
16

python-scraperlib

Collection of Python code to re-use across Python-based scrapers
Python
19
star
17

wikihow

WikiHow scraper
Python
16
star
18

ted

Provide the best of TED.com for offline usage!
Python
13
star
19

zimit-frontend

Zimit Public Web UI
Vue
9
star
20

kolibri

Convert a Kolibri channel in ZIM file(s)
Python
8
star
21

openedx

Open edX (to zim) scraper
Python
8
star
22

phet

Scraper for PhET Science & Math Interactive Simulations
JavaScript
7
star
23

zip2zim

[ARCHIVED] Convert Zip Files to Zim Files
JavaScript
6
star
24

wp1_selection_tools

Create selections with the best articles of a WM project
Perl
6
star
25

zimreader-java

[ARCHIVED] ZIM file reader in Java
Java
5
star
26

freecodecamp

FreeCodeCamp.org scraper (to ZIM)
Python
4
star
27

cms

ZIM file Publishing Platform
Python
4
star
28

docker-publish-action

Docker Publish Action for OpenZIM projects
Python
4
star
29

education-numerique

Éducation & Numérique scraper
Python
3
star
30

zim-testing-suite

This repository contains testing zim files for libzim and other openzim repositories.
PHP
3
star
31

overview

🎈 Start here for current projects, how to get involved, and joining community calls. A resource for new and veteran members of the offline commmunity
2
star
32

zimfarm-client

Command line tool to deal with the Zimfarm
Python
2
star
33

nautilus-webui

SaaS Web UI for nautilus
Python
1
star
34

python-storagelib

S3 Cache wrapper to use within Kiwix/OpenZIM/Offspot projects
Python
1
star
35

zimreader-tntnet

[ARCHIVED] ZIM file reader using tntnet HTTP server
CSS
1
star
36

devdocs

devdocs.io to ZIM scraper
Python
1
star
37

_python-bootstrap

Sample openZIM Python project bootstrap
Python
1
star
38

xapian-meson

Xapian ( 1.4.23) source code with meson build system
C++
1
star
39

lilote

Generate a Lilote ZIM file from a Lilote export JSON
JavaScript
1
star