• Stars
    star
    170
  • Rank 223,357 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Models for automatic abstractive summarization

summarus

Tests Status Code Climate

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

You can also checkout the MBART-based Russian summarization model on Huggingface: mbart_ru_sum_gazeta

Based on the following papers:

Contacts

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

Argument Required Description
-c true path to file with configuration
-s true path to directory where model will be saved
-t true path to train dataset
-v true path to val dataset
-r false recover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

Argument Required Default Description
-t true path to test dataset
-m true path to tar.gz archive with model
-p true name of Predictor
-c false 0 CUDA device
-L true Language ("ru" or "en")
-b false 32 size of a batch with test examples to run simultaneously
-M false path to meteor.jar for Meteor metric
-T false tokenize gold and predicted summaries before metrics calculation
-D false save temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

Argument Default Description
--train-path path to train dataset
--model-path path to directory where generated subword model will be saved
--model-type bpe type of subword model, see sentencepiece
--vocab-size 50000 size of the resulting subword model vocabulary
--config-path path to file with configuration for DatasetReader (with parse_set)

Headline generation

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru 

Results

Train dataset: RIA, test dataset: RIA
Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 40.0 23.3 37.5 -
ria_pgn_24kk 42.3 25.1 39.6 -
ria_mbart 42.8 25.5 39.9 -
First Sentence 24.1 10.6 16.7 -

Train dataset: RIA, eval dataset: Lenta

Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 25.6 12.3 23.0 -
ria_pgn_24kk 26.4 12.3 24.0 -
ria_mbart 30.3 14.5 27.1 -
First Sentence 25.5 11.2 19.2 -

Summarization - CNN/DailyMail

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
cnndm_pgn_25kk 38.5 16.5 33.4 17.6 -

Summarization - Gazeta, russian news dataset

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
gazeta_pgn_7kk 29.4 12.7 24.6 21.2 9.0
gazeta_pgn_7kk_cov 29.8 12.8 25.4 22.1 10.1
gazeta_pgn_25kk 29.6 12.8 24.6 21.5 9.3
gazeta_pgn_words_13kk 29.4 12.6 24.4 20.9 8.9
gazeta_summarunner_3kk 31.6 13.7 27.1 26.0 11.5
gazeta_mbart 32.6 14.6 28.2 25.7 12.4
gazeta_mbart_lower 32.7 14.7 28.3 25.8 12.5

Demo

python demo/server.py --include-package summarus --model-dir <model_dir> --host <host> --port <port>

Citations

Headline generation (PGN):

@article{Gusev2019headlines,
    author={Gusev, I.O.},
    title={Importance of copying mechanism for news headline generation},
    journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
    year={2019},
    volume={2019-May},
    number={18},
    pages={229--236}
}

Headline generation (transformers):

@InProceedings{Bukhtiyarov2020headlines,
    author={Bukhtiyarov, Alexey and Gusev, Ilya},
    title="Advances of Transformer-Based Models for News Headline Generation",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages={54--61},
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_4}
}

Summarization:

@InProceedings{Gusev2020gazeta,
    author="Gusev, Ilya",
    title="Dataset for Automatic Summarization of Russian News",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages="{122--134}",
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_9}
}

More Repositories

1

rulm

Language modeling and instruction tuning for Russian
Jupyter Notebook
446
star
2

rupo

Библиотека для анализа и генерации стихов на русском языке
Python
177
star
3

rnnmorph

Morphological analyzer for Russian and English languages based on neural networks and dictionary-lookup systems.
Python
152
star
4

tgcontest

Telegram Data Clustering contest solution by Mindful Squirrel
HTML
94
star
5

ping_pong_bench

Python
57
star
6

UNMT

Code inspired by Unsupervised Machine Translation Using Monolingual Corpora Only
Jupyter Notebook
50
star
7

PoetryCorpus

Поэтический корпус русского языка
Python
41
star
8

saiga_bot

Telegram bot for different language models. Supports system prompts and images
Python
35
star
9

gazeta

Gazeta: Dataset for automatic summarization of Russian news / Газета: набор данных для автоматического реферирования на русском языке
Python
30
star
10

saiga

Python
26
star
11

HeadlineCause

A dataset of news headlines for detecting causalities
Jupyter Notebook
11
star
12

russ

Package for word stress detection
Python
10
star
13

rudetox

Python
7
star
14

purano

News annotation and clustering
Jupyter Notebook
7
star
15

nghack

Решение НГ Hack от Mindful Squirrel
Jupyter Notebook
6
star
16

Algorithms

Algorithms on C++ and C
C++
5
star
17

IlyaGusev

4
star
18

quest

Quantitative evalUation of modErn LLM Sampling Techniques
Python
3
star
19

MIPT_Algo_Seminars

Материалы для семинаров по курсу "Алгоритмы и структуры данных" ФПМИ МФТИ
HTML
3
star
20

translate_api

Python
2
star
21

aika

Amateur level C++ chess engine with web GUI on top of lc0 board representation
C++
2
star
22

SentiRuEval-2016

Jupyter Notebook
2
star
23

nlp-homework

Задание по курсу NLP
Jupyter Notebook
2
star
24

remotion

Эксперименты по аспектному анализу тональности
Jupyter Notebook
1
star
25

Plotter

Graph plotter, MathML and TeX support
C++
1
star