• Stars
    star
    199
  • Rank 194,915 (Top 4 %)
  • Language
    Python
  • Created over 2 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Custom recipe and utilities for document processing

πŸͺ spaCy Project: Prodigy recipes for document processing and layout understanding

This repository contains recipes on how to use Prodigy and Hugging Face for annotating, training, and reviewing document layout datasets. We'll be finetuning a LayoutLMv3 model using FUNSD, a dataset of noisy scanned documents.

This also serves as an illustration of how to design document processing solutions. I attempted to generalize this approach into a framework, which you can read more from my blog.

πŸ“‹ project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command Description
install Install dependencies
hydrate-db Hydrate the Prodigy database with annotated data from FUNSD
review Review hydrated annotations
train Train FUNSD model
qa Perform QA for the test dataset using a trained model
clean-db Drop all generated Prodigy datasets
clean-files Clean all intermediary files

⏭ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all install β†’ hydrate-db β†’ train
clean-all clean-db β†’ clean-files

πŸ—‚ Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File Source Description
assets/funsd.zip URL FUNSD dataset - noisy scanned documents for layout understanding

More Repositories

1

pyswarms

A research toolkit for particle swarm optimization in Python
Python
1,258
star
2

sprites-as-a-service

Generate your personal 8-bit avatars using Cellular Automata, a mathematical model that simulates life, survival, and extinction
Vue
304
star
3

seagull

A Python Library for Conway's Game of Life
Python
175
star
4

ljvmiranda921.github.io

✨ Github repository for my website
HTML
62
star
5

gym-lattice

An HP 2D Lattice Environment with a Gym-like API for the Protein Folding Problem
Python
54
star
6

calamanCy

NLP pipelines for Tagalog using spaCy
Python
41
star
7

burnout-barometer

A simple Slack tool to log, track, and assess you or your team's stress and work-life
Go
32
star
8

vs-split

A Python library for creating adversarial splits
Python
13
star
9

cv

Curriculum vitae of Lester James V. Miranda
TeX
10
star
10

abyss

Descend into the abyss | A retro action-roguelike game
GDScript
6
star
11

dataflow-cookiecutter

Create production-ready Dataflow projects in a zap! ⚑
Python
5
star
12

spacy-span-analyzer

Simple tool to analyze spans in your dataset. Implementation of Papay et al's work (EMNLP 2020) on span performance prediction
Python
5
star
13

scratch

πŸ““ Scratch notebooks and random assortment of projects. Think of this as a scratch paper of my ideas.
Jupyter Notebook
5
star
14

gallery

πŸ“· Gallery for Gameboy Camera
HTML
3
star
15

LiBERTus

Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024
Python
3
star
16

ud-tagalog-spacy

Training a POS Tagger and Dependency Parser for a Low-Resource Language (Tagalog)
Python
2
star
17

comments.ljvmiranda921.github.io

Blog comments for my personal blog: ljvmiranda921.github.io
1
star
18

pfn-rl-practice

My solutions to the 2017 PFN Intern Coding exercise in Reinforcement Learning
Python
1
star