• Stars
    star
    115
  • Rank 305,916 (Top 7 %)
  • Language
    Perl
  • License
    Other
  • Created almost 6 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

πŸ“‹ Collection of evaluation code for natural language generation.

language-evaluation (Experimental)

Collection of evaluation code for natural language generation.

Note: API could change frequently without notice

Metrics

  • CocoEvaluator: coco-caption (BLEU1-4, METEOR, ROUGE, CIDEr, SPICE)
  • RougeEvaluator: sentence-level rouge (ROUGE-1, ROUGE-2, ROUGE-L with f-measure)
  • Rouge155Evaluator: summary-level rouge (ROUGE-1, ROUGE-2, ROUGE-L with f-measure)

Requirements

  • Java 1.8.0+ (used by coco-caption evaluator)
  • Python 3.6+
  • libxml-parser-perl (used by ROUGE.1.5.5.pl)

Installation and Usage

Install external dependencies (e.g. Java 1.8.0+, libxml-parser-perl):

# Oracle Java
sudo add-apt-repository ppa:webupd8team/java
sudo apt upadte
apt-get install oracle-java8-installer

# libxml-parser-perl
sudo apt install libxml-parser-perl

Then run:

pip install git+https://github.com/bckim92/language-evaluation.git
python -c "import language_evaluation; language_evaluation.download('coco')"

Python API (or see language_evaluation_test.py):

import language_evaluation
from pprint import PrettyPrinter
pprint = PrettyPrinter().pprint

predicts = ['i am a boy', 'she is a girl']
answers = ['am i a boy ?', 'is she a girl ?']

evaluator = language_evaluation.CocoEvaluator()
results = evaluator.run_evaluation(predicts, answers)
pprint(results)
# {'Bleu_1': 0.9999999997500004,
#  'Bleu_2': 0.5773502690332603,
#  'Bleu_3': 4.3679023223468616e-06,
#  'Bleu_4': 1.4287202142987477e-08,
#  'CIDEr': 3.333333333333333,
#  'METEOR': 0.43354749322305886,
#  'ROUGE_L': 0.75,
#  'SPICE': 0.6666666666666666}

evaluator = language_evaluation.RougeEvaluator(num_parallel_calls=5)
results = evaluator.run_evaluation(predicts, answers)
pprint(results)
# {'rouge1': 1.0,
#  'rouge2': 0.3333333333333333,
#  'rougeL': 0.75}

evaluator = language_evaluation.Rouge155Evaluator(num_parallel_calls=5)
results = evaluator.run_evaluation(predicts, answers)
pprint(results)
# {'rouge1': 1.0,
#  'rouge2': 0.3333333333333333,
#  'rougeL': 0.75}

Notes

  • TODOs
    • Support more metrics (e.g. embedding-based)
    • Support command-line interface
    • Support full functionality and configuration for rouge
    • Implement summary-level rouge scorer in pure python
    • Add tests & CI

Related Projects

License

See LICENSE.md.