• Stars
    star
    195
  • Rank 199,374 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Generate question/answer training pairs out of raw text.

Question Extractor 🧐

Large language models can be instruction tuned with a set of questions and answers. However, to further fine-tune a model on your own data, you need a large number of questions and answers about your data. Producing those questions and answers can be a lot of manual work.

This repository lets you use a non-fine-tuned language model (ChatGPT) to extract question/answer pairs automatically from existing textual data, eliminating all manual work.

Installation

To run this code, you will need to clone this repository then install the following Python packages:

  • tiktoken, the OpenAI tokeniser,
  • openai, the official OpenAI API client,
  • langchain, glue code used to combine models and utilities.

Usage

This script is designed to turn a folder of markdown (.md) documents into a .json file containing a list of questions, answers and paths to the source documents that were used to produce them.

To run the code, set the relevant file paths in the question_extractor.py file (both the input folder and the output path) and insure that your OpenAI API key is in the environment. Then run the script with Python:

python3 question_extractor.py

Once it is done, all questions/answers will be written as a .json file in the output path.

Inner-workings

The code loops on all files, for each file it extracts a list of questions using the following prompt followed by a chunk of text:

You are an expert user extracting information to quiz people on documentation. You will be passed a page extracted from the documentation, write a numbered list of questions that can be answered based *solely* on the given text.

It then loops on the questions, producing an answer by passing the following prompt followed by a chunk of text and a question:

You are an expert user answering questions. You will be passed a page extracted from a documentation and a question. Generate a comprehensive and informative answer to the question based *solely* on the given text.

Most of the actual logic of the code is dedicated to processing the files concurrently (for speed) and insuring that text chunks passed to the model are small enough to leave enough tokens for answering.

If a text is too long to be sent to the model, it is split along its highest markdown heading level (the process can be repeated recursively if needed until we get down to single paragraphs).

Performance-wise, this script can process the full NERSC documentation in 6 minutes1. Turning 318 markdown files into 8005 questions for $29.

Potential improvements

  • make it possible to use GPT4 for the question answering, improving the quality of the answers at the cost of a slower runtime and significantly increased costs
  • save intermediate results to be able to restart an interrupted job
  • use the OpenAI client directly, instead of Langchain, to reduce dependencies

Footnotes

  1. Running at about 93% of the model's rate limit.

More Repositories

1

fastai-extensions-repository

A list of extensions for the fastai library.
160
star
2

friedrich

A Rust implementation of Gaussian Process regression.
Rust
56
star
3

Simplers

Rust implementation of the Simple(x) Global Optimization algorithm
Rust
31
star
4

flaxOptimizers

A collection of optimizers, some arcane others well known, for Flax.
Python
29
star
5

AdaHessianJax

Jax implementation of the AdaHessian optimizer
Python
19
star
6

impersonator

Chat with an AI simulation of anyone as easily as copy-pasting text into a folder!
Python
19
star
7

ManifoldMixupV2

Manifold-Mixup implementation for fastai V2
Jupyter Notebook
17
star
8

GPTranslate

Translate any text using GPT.
Python
14
star
9

jochastic

A JAX implementation of stochastic addition.
Python
12
star
10

shaman

Evaluate the numerical accuracy of an application (mirror of the Gitlab main repo).
C++
11
star
11

text_rewritter

Rewrite large texts using deep learning.
Python
7
star
12

tabularGP

Gaussian process applied to tabular models (implemented with fastai)
Python
7
star
13

stop_word

Huggingface transformers stopping criteria that halts the generation when a given stop word is encountered.
Python
7
star
14

gambit

Symbolic regression using Monte-Carlo tree search
Rust
7
star
15

stochastorch

A Pytorch implementation of stochastic addition.
Python
5
star
16

permutationImportance

Feature importance by the permutation method (for fastai V1)
Python
5
star
17

chromosomes

Classify pictures of chromosomes.
Python
4
star
18

xmap

Alternative xmap implementation for JAX
Python
4
star
19

pandas2numpy

Dataframe to tensor converter for deep learning.
Python
3
star
20

pairArithmetic

A very simple, header only, implementation of pair arithmetic. Destined to be used when a C++ program needs a local boost in accuracy but cannot easily be refactored to be more numerically stable.
C++
3
star
21

jax_nersc_distributed_demo

Distributed JAX at NERSC (demo).
Python
3
star
22

flaxOptimizersBenchmark

Benchmarking code to evaluate Flax optimizers
Python
2
star
23

artificialartist

A set of scripts I use with stable diffusion.
Python
1
star
24

nestordemeure.github.io

My blog.
JavaScript
1
star
25

GoogleHashCode2017

Google Hashcode 2017 (team HTAG)
F#
1
star
26

kd_optimization

A blackbox optimization algorithm using kd-tree space partitionning and monte-carlo tree search.
Rust
1
star
27

letMeNERSCthatForYou

A custom made documentation bot for NERSC.
Python
1
star
28

fastshap

A wrapper for using SHAP in fastai2
Jupyter Notebook
1
star
29

awesome-rust-floats

Curated list of Rust floating point crates and utilities.
1
star
30

paretoFront

Rust library to build a pareto front incrementaly.
Rust
1
star