• Stars
    star
    168
  • Rank 225,507 (Top 5 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 6 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python code for various NLP metrics

DOI

About

Natural Language Processing Performance Metrics [ppt]

NLP Metrics Timeline

Contents

Requirements • How to Use • Notebooks • Quick Notes • How to Cite

Requirements

Tested on Python 2.7

pip install -r requirements.txt

How to Use

  • Run: python test/test_mt_text_score.py
  • Currently only supporting MT metrics

Notebooks

Metric Application Notebook
BLEUMachine Translation Jupyter Colab
GLEU (Google-BLEU)Machine Translation Jupyter Colab
WER (Word Error Rate)Transcription Accuracy
Machine Translation
Jupyter Colab
  • TODO:
    • Generalized BLEU (?), METEOR, ROUGE, CIDEr

Evaluation Metrics: Quick Notes

Average precision

  • Macro: average of sentence scores
  • Micro: corpus (sums numerators and denominators for each hypothesis-reference(s) pairs before division)

Machine Translation

  1. BLEU (BiLingual Evaluation Understudy)
    • Papineni 2002
    • 'Measures how many words overlap in a given translation when compared to a reference translation, giving higher scores to sequential words.' (recall)
    • Limitation:
      • Doesn't consider different types of errors (insertions, substitutions, synonyms, paraphrase, stems)
      • Designed to be a corpus measure, so it has undesirable properties when used for single sentences.
  2. GLEU (Google-BLEU)
    • Wu et al. 2016
    • Minimum of BLEU recall and precision applied to 1, 2, 3 and 4grams
      • Recall: (number of matching n-grams) / (number of total n-grams in the target)
      • Precision: (number of matching n-grams) / (number of total n-grams in generated sequence)
    • Correlates well with BLEU metric on a corpus metric but does not have its drawbacks for per sentence reward objective.
    • Not to be confused with Generalized Language Evaluation Understanding or Generalized BLEU, also known as GLEU
      • Napoles et al. 2015's ACL paper: Ground Truth for Grammatical Error Correction Metrics
      • Napoles et al. 2016: GLEU Without Tuning
        • Minor adjustment required as the number of references increases.
      • Simple variant of BLEU, it hews much more closely to human judgements.
      • "In MT, an untranslated word or phrase is almost always an error, but in GEC, this is not the case."
        • GLEU: "computes n-gram precisions over the reference but assigns more weight to n-grams that have been correctly changed from the source."
      • Python code
  3. WER (Word Error Rate)
    • Levenshtein distance (edit distance) for words: minimum number of edits (insertion, deletions or substitutions) required to change the hypotheses sentence into the reference.
    • Range: greater than 0 (ref = hyp), no max range as Automatic Speech Recognizer (ASR) can insert an arbitrary number of words
    • $ WER = \frac{S+D+I}{N} = \frac{S+D+I}{S+D+C} $
      • S: number of substitutions, D: number of deletions, I: number of insertions, C: number of the corrects, N: number of words in the reference ($N=S+D+C$)
    • WAcc (Word Accuracy) or Word Recognition Rate (WRR): $1 - WER$
    • Limitation: provides no details on the nature of translation errors
      • Different errors are treated equally, even thought they might influence the outcome differently (being more disruptive or more difficult/easier to be corrected).
      • If you look at the formula, there's no distinction between a substitution error and a deletion followed by an insertion error.
    • Possible solution proposed by Hunt (1990):
      • Use of a weighted measure
      • $ WER = \frac{S+0.5D+0.5I}{N} $
      • Problem:
        • Metric is used to compare systems, so it's unclear whether Hunt's formula could be used to assess the performance of a single system
        • How effective this measure is in helping a user with error correction
    • See more info
  4. METEOR (Metric for Evaluation of Translation with Explicit ORdering):
  5. TER (Translation Edit Rate)
    • Snover et al. 2006's paper: A study of translation edit rate with targeted human annotation
    • Number of edits (words deletion, addition and substitution) required to make a machine translation match exactly to the closest reference translation in fluency and semantics
    • TER = $\frac{E}{R}$ = (minimum number of edits) / (average length of reference text)
    • It is generally preferred to BLEU for estimation of sentence post-editing effort. Source.
    • PyTER
    • char-TER: character level TER

Summarization

  1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Image Caption Quality

  1. CIDEr (Consensus-based Image Description Evaluation)

Acknowledgement

Please star or fork if this code was useful for you. If you use it in a paper, please cite as:

@software{cunha_sergio2019nlp_metrics,
    author       = {Gwenaelle Cunha Sergio},
    title        = {{gcunhase/NLPMetrics: The Natural Language 
                   Processing Metrics Python Repository}},
    month        = oct,
    year         = 2019,
    doi          = {10.5281/zenodo.3496559},
    version      = {v1.0},
    publisher    = {Zenodo},
    url          = {https://github.com/gcunhase/NLPMetrics}
    }

More Repositories

1

AMICorpusXML

Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
Python
52
star
2

StackedDeBERT

Stacked Denoising BERT for Noisy Text Classification (Neural Networks 2020)
Python
31
star
3

Emotional-Video-to-Audio-with-ANFIS-DeepRNN

Emotional Video to Audio Transformation with ANFIS-DeepRNN (Vanilla RNN and LSTM-DeepRNN) [MPE 2020]
MATLAB
25
star
4

PaperNotes

Important notes on scientific papers
Python
19
star
5

GeneticAlgorithm-SolarCells

Single and Multi-layer Solar Cell Thickness Optimization With Genetic Algorithm (Energies 2020)
MATLAB
11
star
6

EmbraceBERT

Attentively Embracing Noise for Robust Latent Representation in BERT (COLING 2020)
Shell
8
star
7

AnnotatedMV-PreProcessing

Pre-Processing of Annotated Music Video Corpora (COGNIMUSE and DEAP)
Python
5
star
8

ArXivAbsTitleDataset

Extract Abstract and Title Dataset from arXiv articles
Python
5
star
9

PhdThesis-LatexTemplate

LaTeX Template for KNU's PhD Thesis Document
TeX
4
star
10

ML-Notebook

Machine Learning Notebook
Jupyter Notebook
3
star
11

Scene2Wav

Emotional Scene Musicalization with Deep Neural Networks (MTAP 2020)
Python
3
star
12

BEP-2014

Repository referent to the Basic Education Program (2014)
MATLAB
3
star
13

FCECorpusXML

Converts FCE Corpus from XML to TXT format
Python
2
star
14

SplicerSpectrogram

Splices audio files and obtains their spectrogram
Python
1
star
15

PreSumm-AMICorpus-DialSum

BertSum model fine-tuned with AMI DialSum Corpus (Baseline)
Python
1
star
16

GeneticAlgorithm-1D

Simple Genetic Algorithm: 1D functions
MATLAB
1
star
17

flickrlabs

simple project using the public flickr api
Objective-C
1
star