• Stars
    star
    427
  • Rank 101,680 (Top 3 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.

TransformerSum Logo

TransformerSum

Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.

GitHub license Github commits made-with-python Documentation Status GitHub issues GitHub pull-requests DeepSource

TransformerSum is a library that aims to make it easy to train, evaluate, and use machine learning transformer models that perform automatic summarization. It features tight integration with huggingface/transformers which enables the easy usage of a wide variety of architectures and pre-trained models. There is a heavy emphasis on code readability and interpretability so that both beginners and experts can build new components. Both the extractive and abstractive model classes are written using pytorch_lightning, which handles the PyTorch training loop logic, enabling easy usage of advanced features such as 16-bit precision, multi-GPU training, and much more. TransformerSum supports both the extractive and abstractive summarization of long sequences (4,096 to 16,384 tokens) using the longformer (extractive) and LongformerEncoderDecoder (abstractive), which is a combination of BART (paper) and the longformer. TransformerSum also contains models that can run on resource-limited devices while still maintaining high levels of accuracy. Models are automatically evaluated with the ROUGE metric but human tests can be conducted by the user.

Check out the documentation for usage details.

Features

  • For extractive summarization, compatible with every huggingface/transformers transformer encoder model.

  • For abstractive summarization, compatible with every huggingface/transformers EncoderDecoder and Seq2Seq model.

  • Currently, 10+ pre-trained extractive models available to summarize text trained on 3 datasets (CNN-DM, WikiHow, and ArXiv-PebMed).

  • Contains pre-trained models that excel at summarization on resource-limited devices: On CNN-DM, mobilebert-uncased-ext-sum achieves about 97% of the performance of BertSum while containing 4.45 times fewer parameters. It achieves about 94% of the performance of MatchSum (Zhong et al., 2020), the current extractive state-of-the-art.

  • Contains code to train models that excel at summarizing long sequences: The longformer (extractive) and LongformerEncoderDecoder (abstractive) can summarize sequences of lengths up to 4,096 tokens by default, but can be trained to summarize sequences of more than 16k tokens.

  • Integration with huggingface/nlp means any summarization dataset in the nlp library can be used for both abstractive and extractive training.

  • "Smart batching" (extractive) and trimming (abstractive) support to not perform unnecessary calculations (speeds up training).

  • Use of pytorch_lightning for code readability.

  • Extensive documentation.

  • Three pooling modes (convert word vectors to sentence embeddings): mean or max of word embeddings in addition to the CLS token.

Pre-trained Models

All pre-trained models (including larger models and other architectures) are located in the documentation. The below is a fraction of the available models.

Extractive

Name Dataset Comments R1/R2/RL/RL-Sum Model Download Data Download
mobilebert-uncased-ext-sum CNN/DM None 42.01/19.31/26.89/38.53 Model CNN/DM Bert Uncased
distilroberta-base-ext-sum CNN/DM None 42.87/20.02/27.46/39.31 Model CNN/DM Roberta
roberta-base-ext-sum CNN/DM None 43.24/20.36/27.64/39.65 Model CNN/DM Roberta
mobilebert-uncased-ext-sum WikiHow None 30.72/8.78/19.18/28.59 Model WikiHow Bert Uncased
distilroberta-base-ext-sum WikiHow None 31.07/8.96/19.34/28.95 Model WikiHow Roberta
roberta-base-ext-sum WikiHow None 31.26/09.09/19.47/29.14 Model WikiHow Roberta
mobilebert-uncased-ext-sum arXiv-PubMed None 33.97/11.74/19.63/30.19 Model arXiv-PubMed Bert Uncased
distilroberta-base-ext-sum arXiv-PubMed None 34.70/12.16/19.52/30.82 Model arXiv-PubMed Roberta
roberta-base-ext-sum arXiv-PubMed None 34.81/12.26/19.65/30.91 Model arXiv-PubMed Roberta

Abstractive

Name Dataset Comments Model Download
longformer-encdec-8192-bart-large-abs-sum arXiv-PubMed None Not yet...

Install

Installation is made easy due to conda environments. Simply run this command from the root project directory: conda env create --file environment.yml and conda will create and environment called transformersum with all the required packages from environment.yml. The spacy en_core_web_sm model is required for the convert_to_extractive.py script to detect sentence boundaries.

Step-by-Step Instructions

  1. Clone this repository: git clone https://github.com/HHousen/transformersum.git.
  2. Change to project directory: cd transformersum.
  3. Run installation command: conda env create --file environment.yml.
  4. (Optional) If using the convert_to_extractive.py script then download the en_core_web_sm spacy model: python -m spacy download en_core_web_sm.

Meta

ForTheBadge built-with-love

Hayden Housen – haydenhousen.com

Distributed under the GNU General Public License v3.0. See the LICENSE for more information.

https://github.com/HHousen

Attributions

Contributing

All Pull Requests are greatly welcomed.

Questions? Commends? Issues? Don't hesitate to open an issue and briefly describe what you are experiencing (with any error logs if necessary). Thanks.

  1. Fork it (https://github.com/HHousen/TransformerSum/fork)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request

More Repositories

1

DocSum

A tool to automatically summarize documents abstractively using the BART or PreSumm Machine Learning Model.
Python
66
star
2

PicoCTF-2021

Hayden Housen's solutions to the 2021 PicoCTF Competition
C
51
star
3

speaker-change-detection

Speaker change detection using SincNet and an LSTM/Transformer
Jupyter Notebook
43
star
4

lecture2notes

Convert lecture videos to notes using AI & machine learning. Code for the research titled "Lecture2Notes: Summarizing Lecture Videos by Classifying Slides and Analyzing Text using Machine Learning."
Jupyter Notebook
38
star
5

dotfiles

HHousen's dotfiles: Zsh, Chezmoi, Antigen, Oh My Zsh, Powerlevel10k, Oh My Tmux, GEF, and the ultimate vimrc
Shell
19
star
6

HTB-CyberSanta-2021

Hayden Housen's solutions to the 2021 HackTheBox "Cyber Santa is Coming to Town" Competition
Python
19
star
7

hack-the-box

HHousen's writeups to various HackTheBox machines and challenges from https://hackthebox.com.
JavaScript
15
star
8

object-discovery-pytorch

An implementation of several unsupervised object discovery models (Slot Attention, SLATE, GNM) in PyTorch with pre-trained models.
Python
13
star
9

PicoCTF-2019

Hayden Housen's solutions to the 2019 PicoCTF Competition
Python
10
star
10

ArXiv-PubMed-Sum

A script to process the ArXiv-PubMed dataset.
Python
9
star
11

PicoCTF-2022

Hayden Housen's solutions to the 2022 PicoCTF Competition
Python
9
star
12

ai-respiratory-doctor-electron

A desktop app built with Electron that mimics the AI Respiratory Doctor web app.
CSS
7
star
13

NCS-Competition

Hayden Housen's solutions to the 2021 National Cyber Scholarship and Cyber FastTrack Competitions
Python
5
star
14

try-hack-me

HHousen's writeups to various TryHackMe machines and challenges from https://tryhackme.com.
4
star
15

advent-of-code-2021

HHousen's solutions to the 2021 Advent Of Code puzzles at https://adventofcode.com/2021.
Python
4
star
16

GPT-Impostor

Impersonate your friends on Discord using the latest research in AI and machine learning.
Python
4
star
17

fruit-classifier-app-flask

A simple fruit classifier built with fastai and flask
Python
4
star
18

willihaveasnowday

A website that predicts the chance of a snow day automatically by using AI and machine learning.
SCSS
3
star
19

advent-of-code-2022

HHousen's solutions to the 2022 Advent Of Code puzzles at https://adventofcode.com/2022.
Python
3
star
20

fruit-classifier-app-node

A simple web app to classify fruits using the fastai library. Built with NodeJS with a Python layer for the model. Includes an extremely basic front-end built with Pug.
JavaScript
3
star
21

HHousen

HHousen Personal Repository
3
star
22

advent-of-code-2020

HHousen's solutions to the 2020 Advent Of Code puzzles at https://adventofcode.com/2020.
Python
3
star
23

ai-respiratory-doctor

A flask web app template for use with machine learning projects. Follows best practices such as application factories and Pipfiles.
Jupyter Notebook
2
star
24

resume

HTML
1
star
25

hh-personal

The hugo configuration for the Hayden Housen personal website.
JavaScript
1
star
26

hhousen.github.io

GitHub Hayden Housen Website - redirects to haydenhousen.com
1
star
27

learning-seq2seq

My work through the bentrevett/pytorch-seq2seq tutorial.
Python
1
star