• Stars
    star
    235
  • Rank 171,079 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created about 1 year ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Convert all of libgen to high quality markdown

Libgen to txt

This repo will convert books from libgen to plain txt or markdown format. This repo does not contain any books, only the scripts to download and convert them.

The scripts use a seedbox to download the libgen torrents, copy them to your machine/cloud instance, convert them, and enrich them with metadata. Processing will be by chunk, with configurable parallelization.

It currently only works for the libgen rs nonfiction section, but PRs welcome for additional compatibility. It will cost about $300 to convert all of libgen rs nonfiction if you're using a cloud instance, and take about 1 week to process everything (bandwidth-bound). You'll need 3TB of disk space.

Community

Discord is where we discuss future development.

Install

This was only tested on Ubuntu 23.04 and Python 3.11. It should work with Python 3.8+.

Setup dependencies

  • apt-get update
  • xargs apt-get install -y < apt-requirements.txt
  • pip install -r requirements.txt

Import libgen rs metadata

  • Download the metadata DB (look for "metadata" and the nonfiction one)
  • bsdtar -xf libgen.rar
  • Start mariadb
    • systemctl start mariadb.service
  • Setup DB user
    • mariadb
    • GRANT ALL ON *.* TO 'libgen'@'localhost' IDENTIFIED BY 'password' WITH GRANT OPTION; # Replace with your password
    • FLUSH PRIVILEGES;
    • create database libgen;
    • exit
  • Import metadata
    • git clone https://annas-software.org/AnnaArchivist/annas-archive.git
    • pv libgen.sql | PYTHONIOENCODING=UTF8:ignore python3 annas-archive/data-imports/scripts/helpers/sanitize_unicode.py | mariadb -h localhost --default-character-set=utf8mb4 -u libgen -ppassword libgen
      • You may need to add the --binary-mode -o flag to the mariadb command above
      • And the --force flag if you get errors

Setup seedbox

Configuration

  • Get a putio oauth token following these instructions
  • Either set the env var PUTIO_TOKEN, or create a local.env file with PUTIO_TOKEN=yourtoken
  • Inspect libgen_to_txt/settings.py. You can edit settings directly to override them, set an env var, or add the key to a local.env file.
    • You may particularly want to look at CONVERSION_WORKERS and DOWNLOAD_WORKERS to control parallelization. The download step is the limiting factor, and too many download workers will saturate your bandwidth.

Usage

  • python download_and_clean.py to download and clean the data
    • --workers to control number of download workers (how many parallel downloads happen at once)
    • --no_download to only process libgen chunks that already exist on the seedbox
    • --max controls how many chunks at most to process (for testing)
    • --no_local_delete to avoid deleting chunks locally after they're downloaded. Mainly useful for debugging.

You should see progress information printed out - it will take several weeks to finish depending on bandwidth and conversion method (see below). Check the txt and processed folders to monitor.

Markdown conversion

This can optionally be integrated with marker to do high-accuracy pdf to markdown conversion. To use marker, first install it, then:

  • CONVERSION_METHOD to marker
  • MARKER_FOLDER to the path to the marker folder

CONVERSION_WORKERS will control how many marker processes per GPU are run in parallel. Marker takes about 2.5GB of VRAM per process, so set this accordingly.

You can adjust additional settings around how marker is integrated using the MARKER_* settings. In particular, pay attention to the timeouts. These ensure that conversion doesn't get stuck on a chunk. Marker can run on CPU or GPU, but is much faster on GPU. With 4x GPUs, a single libgen chunk should take about 1 hour to process.

Cloud storage

You can store the converted txt/markdown files in a s3-compatible storage backend as they're processed using s3fs. Here's how:

  • sudo apt install s3fs
  • echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ${HOME}/.passwd-s3fs
  • chmod 600 ${HOME}/.passwd-s3fs
  • s3fs BUCKET_NAME LOCAL_DIR -o url=STORAGE_URL -o use_cache=/tmp -o allow_other -o use_path_request_style -o uid=1000 -o gid=1000 -o passwd_file=${HOME}/.passwd-s3fs

More Repositories

1

marker

Convert PDF to markdown quickly with high accuracy
Python
15,391
star
2

surya

OCR, layout analysis, reading order, line detection in 90+ languages
Python
9,453
star
3

apartment-finder

A Slack bot that helps you find an apartment.
Python
1,061
star
4

zero_to_gpt

Go from no deep learning knowledge to implementing GPT.
Jupyter Notebook
940
star
5

texify

Math OCR model that outputs LaTeX and markdown
Python
673
star
6

textbook_quality

Generate textbook-quality synthetic LLM pretraining data
Python
467
star
7

pdftext

Extract structured text from pdfs quickly
Python
261
star
8

scribe

Simple speech recognition using your microphone.
Python
123
star
9

researcher

Concise answers to search queries using Google and GPT-3. Includes citations.
Python
72
star
10

scan

Score essays automatically with an easy web interface.
Python
41
star
11

evolve-music2

Evolve music automatically with python -- rewrite of evolve-music.
Python
40
star
12

classified

Score LLM pretraining data with classifiers
Python
38
star
13

evolve-music

Superseded by github.com/vikparuchuri/evolve-music2 -- use that instead.
C
25
star
14

simpsons-scripts

Find out how much the simpsons characters like each other with text and audio analysis.
Python
24
star
15

movide

The student-centric learning platform.
Python
18
star
16

snapcheck

Find out if your info was leaked.
Python
15
star
17

political-positions

Analyze politics.
Python
14
star
18

vikparuchuri.com

Code for vikparuchuri.com -- personal blog.
Ruby
13
star
19

boston-python-ml

Text scoring/classification presentation
JavaScript
9
star
20

percept

A modular machine learning framework that is easy to test and deploy.
Python
9
star
21

wp-deployment

Deploy wordpress with multisite to ec2 with ansible.
Python
7
star
22

spotify-export

Export albums from Spotify into Google Play Music.
Python
7
star
23

pdf_to_md

Python
6
star
24

algorithms

Pure python implementations of various algorithms, including a matrix class.
Python
6
star
25

triton_tutorial

Tutorials for Triton, a language for writing gpu kernels
Jupyter Notebook
5
star
26

vikparuchuri-affirm

CSS
5
star
27

ds-webinar

How to learn data science webinar presentation
CSS
5
star
28

nyt-articles

Get articles from new york times API.
Python
5
star
29

ml-math

Svelte
3
star
30

TulaLensSurvey

Android app that makes it easy to survey people.
Java
3
star
31

medicare-analysis

Analyze medicare data from the recent release.
CSS
3
star
32

sports-stats

Try to rethink sports statistics.
Python
3
star
33

bostonpython2015

Presentation for boston python 2015
CSS
2
star
34

dscontent-starter

2
star
35

Presentations

JavaScript
1
star
36

vik-blog

HTML
1
star
37

tulalens-survey-web

Web component of android survey app.
Ruby
1
star
38

nextml-talk

CSS
1
star
39

vj-wedding2

A site I made for a wedding.
JavaScript
1
star
40

matter

Chrome extension that highlights important passages.
JavaScript
1
star
41

vj-wedding

Placeholder site for a wedding (with countdown)
JavaScript
1
star
42

affirm-themes

Themes for affirm.io.
CSS
1
star
43

openphi

1
star