• Stars
    star
    467
  • Rank 93,430 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 1 year ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Generate textbook-quality synthetic LLM pretraining data

Textbook Quality

This project generates very long, textbook quality pretraining data. Here's a 70M token example. It can run generations in parallel, against OpenAI, or your own API. It can generate the topics from scratch, or use a set of seeds you provide.

The generator uses retrieval to improve quality. By default, it will use Serply to do the retrieval, but you can also use SerpAPI, or disable retrieval.

The core is extensible, so you can add your own adaptors to connect to new APIs and retrieval backends.

Installing

Prerequisites

  • Python 3.9+ (ideally 3.11)
  • You will need postgres installed. You can install it with brew install postgres on a Mac.

Setup

  • psql postgres -c "create database textbook;"
  • git clone https://github.com/VikParuchuri/textbook_quality.git
  • cd textbook_quality
  • poetry install
  • invoke migrate-dev

Configuration

First, create a local.env file in the root directory of the repo to store your secret keys. Alternatively, you can set any key below as an env var.

You can see all the available configuration values in app/settings.py.

With OpenAI and retrieval (highest quality)

  • Add your OpenAI key, like OPENAI_KEY=sk-xxxxxx
  • Add your serply key (SERPLY_KEY="...") or serpapi key (SERPAPI_KEY="...").
  • Add SEARCH_BACKEND=serply or SEARCH_BACKEND=serpapi to use the appropriate backend.

By default, this will use gpt-3.5. You can use gpt-4 by setting the env vars LLM_TYPE, LLM_INSTRUCT_TYPE to gpt-4. You may be able to get away with setting LLM_EXTENDED_TYPE to gpt-4 as well, but you may need longer than 8k context.

With vllm or other openai-compatible API and retrieval

  • Set OPENAI_KEY to the value of your API key, or a dummy value.
  • Set OPENAI_BASE_URL to the url of your API (like https://vllm-api.com/v1)
  • Set the LLM_TYPE, LLM_INSTRUCT_TYPE, and LLM_EXTENDED_TYPE settings to your model name (like llama)
  • Set the model name and max tokens in the LLM_TYPES setting.
  • Follow the instructions above for the retrieval setup.

The generator ideally needs a context length of up to 16k, but you can get away with 12k if you need to. If you've finetuned your own model for textbook gen (based on the prompts cached in this repo), you can use the FINETUNED and INCLUDE_EXAMPLES settings to reduce token usage.

Without retrieval

  • Set SEARCH_BACKEND=none

Usage

There are three main scripts in the repo. You can run each script on the output of the previous one. All outputs will appear by default in app/data, which is the specified DATA_DIR in settings.

Generate topics from scratch

You enter a subject, a file you want to save the topics to, and the number of iterations. The topics will be deduplicated.

Usage example:

python topic_generator.py "computer science with python" python_cs_titles.json --iterations 50

Augment topics from seeds

Take a file with existing seeds (in a flat json list), and augment them. You can pass in the output file from the topic generator as the seed file, or use your own seeds. Domain is an optional flag to constrain the topics within a domain.

This will also deduplicate the topics semantically.

Usage example:

python topic_augmentor.py python_titles.json python_topics.json --domain python

Generate textbooks

From titles

This will take a file with a flat json list of topics, and generate one textbook per topic. The workers flag controls the number of parallel generations. Lower it if you hit rate limits.

Usage example:

python book_generator.py topics.json books.jsonl --workers 5

You can also override settings with environment variables (instead of using local.env). This example will use a vllm api instead of openai:

LLM_TYPE=llama LLM_INSTRUCT_TYPE=llama LLM_EXTENDED_TYPE=llama OPENAI_KEY="llama" OPENAI_BASE_URL="https://vllm-api.com/v1" python book_generator.py topics.json books.jsonl --workers 10

You can see all options by running python book_generator.py --help.

Note that courses are cached by default, so regenerating a course with the same name twice will not hit the API again. The cache is specific to each model and each topic. You can skip the cache by using the --revision option to specify a revision number for the courses.

From outlines

You can also generate a book from an existing outline by creating a jsonl file with the following fields:

  • topic - The topic/title of the book
  • outline - The outline of the book, as a flat json list. This needs to be in a specific format, see "clean table of contents" below.
  • queries - Up to 2 search queries to use for retrieval. If you don't want to use retrieval, set this to an empty list.

Clean tables of contents

This will take in a jsonl file with an existing table of contents and title, and process it into the correct format for book generation.

Usage example:

python toc_cleaner.py toc.jsonl clean_toc.jsonl

toc.jsonl should have the following fields in each line:

  • title - The title of the book
  • toc - a string containing the table of contents. This can be poorly formatted

Extending

You can extend this to add in new LLM adaptors, retrieval methods, or tasks. PRs are very welcome.

  • LLM adapters are in app/llm/adaptors
  • Retrieval methods are in app/services/adaptors. You may also need to adjust settings in services/generators/pdf.py
  • Tasks are in app/llm/generators

Debugging

By default, a lot of exceptions will be hidden to avoid console noise. Use DEBUG=true to display them, like this:

DEBUG=true python book_generator.py python_topics.json books.jsonl --max 5 --workers 5

More Repositories

1

marker

Convert PDF to markdown quickly with high accuracy
Python
15,391
star
2

surya

OCR, layout analysis, reading order, line detection in 90+ languages
Python
9,453
star
3

apartment-finder

A Slack bot that helps you find an apartment.
Python
1,061
star
4

zero_to_gpt

Go from no deep learning knowledge to implementing GPT.
Jupyter Notebook
940
star
5

texify

Math OCR model that outputs LaTeX and markdown
Python
673
star
6

pdftext

Extract structured text from pdfs quickly
Python
261
star
7

libgen_to_txt

Convert all of libgen to high quality markdown
Python
235
star
8

scribe

Simple speech recognition using your microphone.
Python
123
star
9

researcher

Concise answers to search queries using Google and GPT-3. Includes citations.
Python
72
star
10

scan

Score essays automatically with an easy web interface.
Python
41
star
11

evolve-music2

Evolve music automatically with python -- rewrite of evolve-music.
Python
40
star
12

classified

Score LLM pretraining data with classifiers
Python
38
star
13

evolve-music

Superseded by github.com/vikparuchuri/evolve-music2 -- use that instead.
C
25
star
14

simpsons-scripts

Find out how much the simpsons characters like each other with text and audio analysis.
Python
24
star
15

movide

The student-centric learning platform.
Python
18
star
16

snapcheck

Find out if your info was leaked.
Python
15
star
17

political-positions

Analyze politics.
Python
14
star
18

vikparuchuri.com

Code for vikparuchuri.com -- personal blog.
Ruby
13
star
19

boston-python-ml

Text scoring/classification presentation
JavaScript
9
star
20

percept

A modular machine learning framework that is easy to test and deploy.
Python
9
star
21

wp-deployment

Deploy wordpress with multisite to ec2 with ansible.
Python
7
star
22

spotify-export

Export albums from Spotify into Google Play Music.
Python
7
star
23

pdf_to_md

Python
6
star
24

algorithms

Pure python implementations of various algorithms, including a matrix class.
Python
6
star
25

triton_tutorial

Tutorials for Triton, a language for writing gpu kernels
Jupyter Notebook
5
star
26

vikparuchuri-affirm

CSS
5
star
27

ds-webinar

How to learn data science webinar presentation
CSS
5
star
28

nyt-articles

Get articles from new york times API.
Python
5
star
29

ml-math

Svelte
3
star
30

TulaLensSurvey

Android app that makes it easy to survey people.
Java
3
star
31

medicare-analysis

Analyze medicare data from the recent release.
CSS
3
star
32

sports-stats

Try to rethink sports statistics.
Python
3
star
33

bostonpython2015

Presentation for boston python 2015
CSS
2
star
34

dscontent-starter

2
star
35

Presentations

JavaScript
1
star
36

vik-blog

HTML
1
star
37

tulalens-survey-web

Web component of android survey app.
Ruby
1
star
38

nextml-talk

CSS
1
star
39

vj-wedding2

A site I made for a wedding.
JavaScript
1
star
40

matter

Chrome extension that highlights important passages.
JavaScript
1
star
41

vj-wedding

Placeholder site for a wedding (with countdown)
JavaScript
1
star
42

affirm-themes

Themes for affirm.io.
CSS
1
star
43

openphi

1
star