• This repository has been archived on 23/Feb/2024
  • Stars
    star
    119
  • Rank 296,130 (Top 6 %)
  • Language
    Python
  • Created over 1 year ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A proof of concept tool for using ChatGPT to transform messy text documents into structured JSON

GPT Document Extraction

This is a proof-of-concept for using ChatGPT to extract structured data from messy text documents like scanned/OCR'd PDFs and difficult forms.

It works by asking ChatGPT to turn text documents (found in an input JSON file or a text file) into a JSON record that matches a given JSON Schema specification.

If your input data is a text file where each line is a document, you can use the script like this:

./gpt-extract.py --input-type text infile.txt schema.json output.json

This would extract each line in infile, using schema.json and write extracted data to output.json. You can find an example JSON schema down below in the "JSON schema file" section.

If your input data is JSON, you'll need to tell the script how to find the documents (and, optionally how to find a unique ID for each recod). The only kind of supported JSON is a list of JSON objects. Your JSON input data should look something like this:

[{
  "id": 1
  "doc": "My text here..."
}, {
  "id": 2,
  "doc": "Another record..."
}]

You can run the script like this:

./gpt-extract.py --input-type json --keydoc doc --keyid id infile.json schema.json output.json

Note that the output file (output.json), if it exists, needs to be valid JSON (not a blank file) as the script will attempt to load it and continue where the extraction left off.

Setup

This repo depends on ChatGPT-wrapper, which is included as a submodule of this repo. Clone this repo like:

git clone --recurse-submodules https://github.com/brandonrobertz/chatgpt-document-extraction
cd chatgpt-document-extraction

If you've already cloned the repo you can get and/or update the submodule with this:

git submodule update --init --recursive

Then install ChatGPT-wrapper and set up Playwright:

cd chatgpt-wrapper/
pip install .
cd ..
playwright install

You need to login, so run the following command and log into ChatGPT:

chatgpt install

Extraction

Once you're set up, you can extract structured data,

./gpt-extract.py --headless --input-type infile.txt schema.json output.json

Input data spec

You can provide one of two options:

  1. text file, with one record per row (--input-type txt)
  2. a JSON file with an array of objects (--input-type json). You can specify which keys to use with the --keydoc and --keyid options which tell the script how to find the document text and the record ID.

JSON schema file

You need to provide a JSON Schema file that will instruct ChatGPT how to transform the input text. Here's an example that I used:

{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "name of person this document is from": {
      "type": "string"
    },
    "name of person this document is written to": {
      "type": "string"
    },
    "name of person this document is about": {
      "type": "string"
    },
    "violation": {
      "type": "string"
    },
    "outcome": {
      "type": "string"
    },
    "date": {
      "type": "string"
    },
    "summary": {
      "type": "string"
    }
  }
}

It can be helpful to name the fields in descriptive ways that ChatGPT can use to figure out what to extract.

More Repositories

1

BitcoinTradingAlgorithmToolkit

A framework for logging, simulating, and analyzing prices of currencies on various exchanges using technical analysis, fuzzy logic, and neural networks.
Python
177
star
2

SparseLSH

A Locality Sensitive Hashing (LSH) library with an emphasis on large, highly-dimensional datasets.
Python
139
star
3

autoscrape-py

An automated, programming-free web scraper for interactive sites
HTML
103
star
4

sentence-autosegmentation

Deep-learning based sentence auto-segmentation from unstructured text w/o punctuation
Python
37
star
5

artificial_seinfeld

Tools for generating artificial Seinfeld episodes using deep learning ... very serious project
Python
14
star
6

reason-act-sqlite-py

A demonstration of using reason and act with llama.cpp and a LLM to pose plain english queries to a sqlite database
Python
10
star
7

llm-document-extraction

A proof of concept tool for using local LLMs to transform messy text documents into structured JSON
Python
10
star
8

haunted_house_disassembly

Atari 2600 MOS 6502/7 commented disassembly of the game Haunted House
Assembly
7
star
9

bitcore-namecoin

Namecoin support for bitcore
JavaScript
6
star
10

ref-extract

Reference Extraction from Text Data (with Inaccuracy Support)
Python
5
star
11

virtualcurrency-trading-alerts

An alert system for when an event (volume, price, etc.) on a Bitcoin exchange hits a threshold in a specified time-frame
Python
5
star
12

namecoin-testnet-box

A private namecoin testnet based on namecoin core
Makefile
3
star
13

nicar2022-db-optimization

This is a presentation for my NICAR 2022 Database Optimization class.
JavaScript
3
star
14

AustinMunicipalCourtScraper

Takes a person's last name and DOB, grabs their Austin Municipal Court case history, and writes it to a CSV file.
Python
3
star
15

hextractor

Workbench module for Hext extraction of data
JavaScript
2
star
16

datasette-shorturl

A Datasette plugin that provides short URLs for your queries
HTML
2
star
17

Austin-Traffic-Scraper

Austin-Travis County Traffic Report Page Scraper
PHP
2
star
18

rescue-me

A webapp that makes linking animal rescues with potential adopters simple and painless
JavaScript
2
star
19

what-they-said

An interactive tool for searching hundreds of hours of 2016 campaign speeches
JavaScript
1
star
20

page-change-monitor

A simple all-in-one tool to monitor a set of web pages for changes
JavaScript
1
star
21

autoscrape-www

A frontend for driving AutoScrape via a web browser
JavaScript
1
star
22

parse-tx-cfr

An experimental parser for Texas-style Campaign Finance Reports
Clojure
1
star
23

APDIncidentReportsScraper

Scrape the Austin Police Department's messy Indcident Reports Database (police reports) into a machine-readable CSV format.
Python
1
star
24

python-rss2irc

Forked version of gehaxelt/python-rss2irc with arXiv and other improvements (RSS to IRC bot)
Python
1
star
25

AC-Crime-Visualization

A d3.js crime & sentence length by race/ethnicity visualization.
JavaScript
1
star
26

tabula-draw-columns

Simple tool to visually build column config strings for tabula-java
HTML
1
star