• Stars
    star
    388
  • Rank 110,734 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a Deep Learning Framework for Text https://delft.readthedocs.io/

Documentation Status Build Status PyPI version SWH License

DeLFT

DeLFT (Deep Learning Framework for Text) is a Keras and TensorFlow framework for text processing, focusing on sequence labeling (e.g. named entity tagging, information extraction) and text classification (e.g. comment classification). This library re-implements standard state-of-the-art Deep Learning architectures relevant to text processing tasks.

DeLFT has three main purposes:

  1. Covering text and rich texts: most of the existing Deep Learning works in NLP only consider simple texts as input. In addition to simple texts, we also target rich text where tokens are associated to layout information (font. style, etc.), positions in structured documents, and possibly other lexical or symbolic contextual information. Text is usually coming from large documents like PDF or HTML, and not just from segments like sentences or paragraphs, and contextual features appear very useful. Rich text is the most common textual content used by humans to communicate and work.

  2. Reproducibility and benchmarking: by implementing several references/state-of-the-art models for both sequence labeling and text classification tasks, we want to offer the capacity to easily validate reported results and to benchmark several methods under the same conditions and criteria.

  3. Production level, by offering optimzed performance, robustness and integration possibilities, we aim at supporting better engineering decisions/trade-off and successful production-level applications.

Some contributions include:

  • A variety of modern NLP architectures and tasks to be used following the same API and input formats, including RNN, ELMo and transformers.

  • Reduction of the size of RNN models, in particular by removing word embeddings from them. For instance, the model for the toxic comment classifier went down from a size of 230 MB with embeddings to 1.8 MB. In practice the size of all the models of DeLFT is less than 2 MB, except for Ontonotes 5.0 NER model which is 4.7 MB.

  • Implementation of a generic support of categorical features, available in various architectures.

  • Usage of dynamic data generator so that the training data do not need to stand completely in memory.

  • Efficient loading and management of an unlimited volume of static pre-trained embeddings.

  • A comprehensive evaluation framework with the standard metrics for sequence labeling and classification tasks, including n-fold cross validation.

  • Integration of HuggingFace transformers as Keras layers.

A native Java integration of the library has been realized in GROBID via JEP.

The latest DeLFT release has been tested successfully with python 3.8 and Tensorflow 2.9.3. As always, GPU(s) are required for decent training time. A GeForce GTX 1050 Ti (4GB) for instance is fine for running RNN models and BERT or RoBERTa base models. Using BERT large model is possible from a GeForce GTX 1080 Ti (11GB) with modest batch size.

DeLFT Documentation

Visit the DELFT documentation for detailed information on installation, usage and models.

Using DeLFT

PyPI packages are available for stable versions. Latest stable version is 0.3.3:

pip install delft==0.3.3

DeLFT Installation

For installing DeLFT and use the current master version, get the github repo:

git clone https://github.com/kermitt2/delft
cd delft

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3.8 env
source env/bin/activate

Install the dependencies:

pip3 install -r requirements.txt

Finally install the project, preferably in editable state

pip3 install -e .

See the DELFT documentation for usage.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

If you contribute to DeLFT, you agree to share your contribution following these licenses.

Contact: Patrice Lopez ([email protected]) and Luca Foppiano (@lfoppiano).

How to cite

If you want to this work, please refer to the present GitHub project, together with the Software Heritage project-level permanent identifier. For example, with BibTeX:

@misc{DeLFT,
    title = {DeLFT},
    howpublished = {\url{https://github.com/kermitt2/delft}},
    publisher = {GitHub},
    year = {2018--2023},
    archivePrefix = {swh},
    eprint = {1:dir:54eb292e1c0af764e27dd179596f64679e44d06e}
}

More Repositories

1

grobid

A machine learning software for extracting information from scholarly documents
Java
3,496
star
2

grobid_client_python

Python client for GROBID Web services
Python
274
star
3

entity-fishing

A machine learning tool for fishing entities
Java
239
star
4

pdfalto

PDF to XML ALTO file converter
C
214
star
5

biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
Java
117
star
6

article_dataset_builder

Open Access PDF harvester, metadata aggregator and full-text ingester
Python
53
star
7

grobid-ner

A Named-Entity Recogniser based on Grobid.
Java
48
star
8

Pub2TEI

Service for converting and enhancing heterogeneous publisher XML formats into TEI
XSLT
43
star
9

biblio_glutton_harvester

Open Access PDF harvester
Python
34
star
10

pdf2xml

pdf2xml convertor based on Xpdf library - modified version
C
27
star
11

grobid-example

Some examples of usage of Grobid in a third party java project.
Java
18
star
12

grisp

Knowledge Base stuff
Java
16
star
13

grobid-client-node

Simple node.js client for GROBID REST services
JavaScript
14
star
14

xpdf-4.00

C++
13
star
15

datastet

Finding mentions and citations to named and implicit research datasets from within the academic literature
JavaScript
13
star
16

biblio-glutton-extension

A browser extension providing Open Access bibliographical services
JavaScript
11
star
17

grobid-astro

A machine learning software for extracting astronomical entities from scholarly documents
JavaScript
11
star
18

kish

Keeping It Simple is Hard
JavaScript
7
star
19

arxiv_harvester

Poor man's simple harvester for arXiv resources
Python
6
star
20

grobid-client-java

Simple Java client for GROBID REST services
Java
5
star
21

xpdf-4.03

patched xpdf lib for pdfalto
C++
2
star
22

anHALytics

Analytic platform for the HAL research archive
JavaScript
2
star
23

grobid-bio

Basic grobid-based bio-entity tagger using BioNLP/NLPBA 2004 dataset
Java
2
star
24

dataset_recognition_resources

Python
2
star
25

softcite-api

Web API for the Softcite Knowledge-Base
Python
2
star
26

softdata_mentions_client

Python client for software and dataset mention recognizer in scholarly publications, using the Softcite and Datastet services
Python
2
star
27

xpdf-3.04

C++
1
star