• Stars
    star
    193
  • Rank 201,081 (Top 4 %)
  • Language
    HTML
  • Created over 4 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Interpretable Evaluation for (Almost) All NLP Tasks

by Pengfei Liu, Jinlan Fu, Yang Xiao, Graham Neubig and other contributors.

This project is supported by two following works:


Final Product: ExplainaBoard (Updating)

Updates:


1. Motivated Questions

  • Performance of many NLP tasks has reached a plateau. What works, and what's next?

  • Is XX a solved task? What's left?

  • A good evaluation metric can not only rank different systems but also tell their relative advantages (strengths and weaknesses) of them.

  • Next-generation of Leaderboard: equip it with powerful analysis ability?


2. Interpretable Evaluation Methodology

image

The evaluation methodology generally consists of following steps.

2.1 Attribute Definition

Taking NER and CWS tasks for example, we have defined 8 attributes for the NER task, and 7 attributes for the CWS task.

Id NER CWS
1 Entity Length Word Length
2 Sentence Length Sentence Length
3 OOV Density OOV Density
4 Token Frequency Character Frequency
5 Entity Frequency Word Frequency
6 Label Consistency of Token Label Consistency of Character
7 Label Consistency of Entity Label Consistency of Word
8 Entity Density

2.2 Bucketing

Bucketing is an operation that breaks down the holistic performance into different categories. This can be achieved by dividing the set of test entities into different subsets of test entities (regarding spanand sentence-level attributes) or test tokens (regarding token-level attributes).

2.3 Breakdown

Calculate the performance of each bucket.

Summary Measures

Summarize quantifiable results using statistical measures


3. Application

3.1 System Diagnosis

  • Self-diagnosis
  • Aided-diagnosis

3.2 Dataset Bias Analysis

3.3 Structural Bias Analysis


4. Interpreting Your Results?

4.1 Method 1: Upload your files to the ExplainaBoard website

4.2 Method 2: Run it Locally

Give the Named Entity Recognition task as an example. Run the shell: ./run_task_ner.sh.

The shell scripts include the following three aspects:

  • tensorEvaluation-ner.py -> Calculate the dependent results of the fine-grained analysis.

  • genFig.py -> Drawing figures to show the results of the fine-grained analysis.

  • genHtml.py -> Put the figures drawing in the previous step into the web page.

After running the above command, a web page named tEval-ner.html will be generated for displaying the analysis and diagnosis results of the models.

The running process of the Chinese Word Segmentation task is similar.

4.2.1 Requirements:

  • python3
  • texlive
  • poppler
  • pip3 install -r requirements.txt

4.2.2 Analysis and diagnosis your own model.

Take CoNLL-2003 datasets as an example.

  • Put the result-file of your model on this path: data/ner/conll03/results/ (It contains three columns separated by space: token, true-tag, and predicted-tag). In order to carry out model diagnosis, two or more model result files must be included. You can also choose one of the result files provided by us as the reference model.

  • Name the train- and test-set (the dataset related to your result-file) as 'train.txt' and 'test.txt', and then put them on the path: data/ner/conll03/data/.

  • Set the path_data (path of training set), datasets[-] (dataset name), model1 (the first model's name), model2 (the second model's name), resfiles[-] (the paths of the results) in run_task_ner.sh according to your data.

  • Run: ./run_task_ner.sh. The analysis results will be generated on the path output_tensorEval/ner/your_model_name/.

   Notably, so far, our system only supports limited tasks and datasets, 
   we're extending them currently!

4.2.3 Generate the HTML code

As introduced in section 4.2.2, we have generated the analysis results on the path output_tensorEval/ner/your_model_name/. Next, we will generate the HTML code base on the analysis results. In the ./run_task_ner.sh, the codes after #run pdflatex .tex are used to generate the HTML code. Before running ./run_task_ner.sh, you need to make sure that you have installed the texlive and poppler.

Other illustrations of the ./run_task_ner.sh code are as follows:

  • genFig.py generates the latex codes about the analysis charts (e.g. bar-chart, heatmap).

  • pdflatex $file.tex generates a figure with .pdf format based on the latex code.

  • pdftoppm -png $file.pdf converts the figure with .pdf into the .png format.

  • genHtml.py generates the HTML code that arranges the analysis figures and tables.

4.2.4 Note:

  • More than two result files are required. Because comparative-diagnosis is to compare the strengths and weaknesses of the model architectures and pre-trained knowledge between two or more models, it is necessary to input as least two model results.

  • The result file must include three columns of words, true-tags, and predicted-tags, separated by space. If your result file is not in the required format, you can modify the function read_data() in file tensorEvaluation-ner.py to adapt to your format.

Here are some generated results of preliminary evaluation systems: Named Entity Recognition (NER), Chinese Word Segmentation (CWS), Part-of-Speech (POS), and Chunking.

More Repositories

1

prompt2model

prompt2model - Generate Deployable Models from Natural Language Instructions
Python
1,953
star
2

Text-Summarization-Papers

An Exhaustive Paper List for Text Summarization
HTML
500
star
3

compare-mt

A tool for holistic analysis of language generations systems
Python
450
star
4

nn4nlp-concepts

A repository of concepts related to neural networks for NLP
Python
447
star
5

ExplainaBoard

Interpretable Evaluation for AI Systems
Python
361
star
6

awesome-align

A neural word aligner based on multilingual BERT
Python
319
star
7

BARTScore

BARTScore: Evaluating Generated Text as Text Generation
Python
317
star
8

knn-transformers

PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an implementation of kNN-LM and kNN-MT
Python
249
star
9

ReviewAdvisor

Heavy Workload on Reviewing Papers? ReviewAdvisor Helps out
Python
192
star
10

xnmt

eXtensible Neural Machine Translation
Python
185
star
11

gemini-benchmark

Jupyter Notebook
150
star
12

RIPPLe

Code for the paper "Weight Poisoning Attacks on Pre-trained Models" (ACL 2020)
Jupyter Notebook
138
star
13

SpanNER

SpanNER: Named EntityRe-/Recognition as Span Prediction
Python
124
star
14

word-embeddings-for-nmt

Supplementary material for "When and Why Are Pre-trained Word Embeddings Useful for Neural Machine Translation?" at NAACL 2018
Python
119
star
15

guided_summarization

GSum: A General Framework for Guided Neural Abstractive Summarization
Python
112
star
16

code-bert-score

CodeBERTScore: an automatic metric for code generation, based on BERTScore
Jupyter Notebook
111
star
17

external-knowledge-codegen

Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation"
Python
96
star
18

cmu-multinlp

Generalizing Natural Language Analysis through Span-relation Representations
Python
88
star
19

REALSumm

REALSumm: Re-evaluating Evaluation in Text Summarization
Python
71
star
20

langrank

A program to choose transfer languages for cross-lingual learning
Python
66
star
21

retomaton

PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
Python
60
star
22

dynet-benchmark

Benchmarks for DyNet
Python
56
star
23

newlang-tech

A guide to building language technology in new languages.
56
star
24

ragged

Retrieval Augmented Generation Generalized Evaluation Dataset
Jupyter Notebook
51
star
25

contextual-mt

A repository with the code related to experiments around context-aware machine translation
Python
48
star
26

extreme-adaptation-for-personalized-translation

Code for the paper "Extreme Adaptation for Personalized Neural Machine Translation"
Python
43
star
27

lrlm

Code for the paper "Latent Relation Language Models" at AAAI-20.
Python
41
star
28

incremental_tree_edit

Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)
Python
40
star
29

wikiasp

Code for WikiAsp: Multi-document aspect-based summarization.
Shell
39
star
30

tranX-plugin

A plugin for code generation in PyCharm/IntelliJ using tranX
Java
35
star
31

neural-lpcfg

The Return of Lexical Dependencies: Neural Lexicalized PCFGs (TACL)
Python
33
star
32

covid19-datashare

A repo for sharing language resources related to the outbreak (in machine readable format)
GLSL
27
star
33

ToM-Language-Acquisition

Code used to run experiments for the ICLR 2023 paper "Computational Language Acquisition with Theory of Mind".
Python
14
star
34

cmulab

CMU Linguistic Annotation Backend
Python
14
star
35

AfricanVoices

Hosts text-to-speech corpus and speech synthesizers for African languages.
Shell
12
star
36

cmu-ner

NER System Developed at CMU
Python
12
star
37

lti-llm-deployment

Python
12
star
38

explainaboard_web

Mustache
8
star
39

KGxBoard

Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models
6
star
40

DGT

WNGT 2019, DGT Task.
Python
6
star
41

tranx-study

HTML
5
star
42

Reliable-NLPPP

Jupyter Notebook
5
star
43

cord19

cord19 related stuff
Python
5
star
44

globalbench

GlobalBench: A Benchmark for Global Progress in Language Technology
Python
5
star
45

jsalt2019-informal

A repository for random things from the JSALT informal translation group
Python
5
star
46

cmu-edl

Python
3
star
47

code-mining

Stuff for code mining
OpenEdge ABL
2
star
48

ocr-web-interface

OCR web interface using CMULAB backend
JavaScript
1
star
49

explainaboard_client

Python
1
star