• Stars
    star
    168
  • Rank 225,507 (Top 5 %)
  • Language
    Python
  • Created about 2 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An assignment for CMU CS11-711 Advanced NLP, building NLP systems from scratch

CMU Advanced NLP Assignment 2: End-to-end NLP System Building

by Emmy Liu, Zora Wang, Kenneth Zheng, Lucio Dery, Abhishek Srivastava, Kundan Krishna, Graham Neubig

So far in your machine learning classes, you may have experimented with standardized tasks and datasets that were provided and easily accessible. However, in the real world, NLP practitioners often have to solve a problem from scratch, which includes gathering and cleaning data, annotating the data, choosing a model, iterating on the model, and possibly going back to change the data. For this assignment, you'll get to experience this full process.

Please note that you'll be building your own system end-to-end for this project, and there is no starter code. You must collect your own data and train a model of your choice on the data. We will be releasing an unlabeled test dataset a few days before the assignment deadline, and you will run your already-constructed system over this data and submit the results. We also ask you to follow several experimental best practices, and describe the result in your report.

The full process will include:

  1. Understand the task specification
  2. Collect raw data
  3. Annotate test and training data for development
  4. Train and test models using this data
  5. "Deploy" your System
  6. Write your report

Task Specification

For this assignment, you'll be working on the task of scientific entity recognition, specifically in the domain of NLP papers from recent NLP conferences (e.g. ACL, EMNLP, and NAACL). Specifically, we will ask you to identify entities such as task names, model names, hyperparameter names and their values, and metric names and their values in these papers.

Input: The input to the model will be a text file with one paragraph per line. The text will already be tokenized using the spacy tokenizer, and you should not change the tokenization. An example of the input looks like this:

Recent evidence reveals that Neural Machine Translation ( NMT ) models with deeper neural networks can be more effective but are difficult to train .

Output: The output of your model should be a file in CoNLL format, with one token per line, a tab, and then a corresponding tag.

Please refer to these input and output files for more specific examples.

There are seven varieties of entity: MethodName, HyperparameterName, HyperparameterValue, MetricName, MetricValue, TaskName, DatasetName. Details of these entities are included in the annotation standard, which you should read and understand carefully.

Collecting Raw Data

You will next need to collect raw text data that can be used as inputs to your models. This will consist of three steps.

Obtaining PDFs of Scientific Papers

First, you will need to obtain PDFs of NLP papers that can serve as your raw data source. The best source for recent NLP papers is the ACL Anthology. We recommend that you write a web-scraping script to find and download PDF links from here. Other good sources for data include ArXiv and Semantic Scholar, both of which have web APIs which you can query to get the IDs and corresponding paper PDFs for various scientific papers.

Extracting Sentences Line-by-line

In order to process the text from the PDF files, you will first need to convert it into plaintext. This is a popular problem and there are multiple libraries designed for this task. Some of them are: PyPDF2 SciPDF Parser AllenAI Science Parse AllenAI Science Parse v2

You do not need to extract text/numbers from tables and figures.

Tokenizing the Data

As noted above, the inputs to your model will be tokenized using the spacy tokenizer, so you should probably also tokenize the input data using this tokenizer as well. Once you have done this, you should have a significant amount of raw data in the same input format as described above.

Annotating Data

Next, you will want to annotate data for two purposes: testing/analysis and training.

The testing/analysis data will be the data that you use to make sure that your system is working properly. In order to do so, you will want to annotate enough data so that you can get an accurate idea of how your system is doing, and if any improvements to your system are having a positive impact. Some guidelines:

  • Domain Relevance: Your test data should be similar to the data that you will finally be tested on, so we recommend that you create it from NLP papers from recent NLP conferences (e.g. ACL, EMNLP, and NAACL).
  • Size: Your test data should be large enough to distinguish between good and bad models. If you want some guidelines about this, please take a look at this paper.

For annotation, please see the separate doc that details annotation interfaces that you can use.

The training data is a bit more flexible, you could possibly:

  • Annotate it yourself manually through the same method as the test set.
  • Do some sort of automatic annotation/data augmentation.
  • Use other existing datasets for multi-task learning.

Training and Testing Your Model

In order to train your model, we highly suggest using pre-existing toolkits such as HuggingFace Transformers. You can read the tutorial on token classification which would be a good way to get started.

Because you will probably not be able to create a large dataset specifically for this task in the amount of time allocated, we strongly suggest that you use the knowledge that you have learned in this class to efficiently build a system. For example, you may think about ideas such as:

  1. Pre-training on a different task and fine-tuning
  2. Multi-task learning, training on different tasks at once
  3. Using prompting techniques

In order to test your model, you will want to use an evaluation script, as detailed in the evaluation and submission page. This page also explains how you can do the analysis that is a component of your report.

System Deployment

The final "deployment" of your model will consist of running your model over a private test set (text only) and submitting your results to us. You should try to finish building your system before this set is released, and basically not rely on it for model training or testing. The test set will be released shortly (2-3 days) before the final submission deadline.

When you are done running your system over this data, you will:

  1. Submit the results to ExplainaBoard through a submission script. See the evaluation and submission page.
  2. Submit any testing or training data that you created, as well as your code via Canvas.

Both of these will be due by October 26. See details in grading below.

Data Release

UPDATE (Oct. 25, 2022): The test set is now released in the data/ directory. It contains 3 files:

  1. anlp-sciner-test.txt: The data that should be input to your system, textual format one paragraph per line.
  2. anlp-sciner-test-withdocstart.txt: The same data as above, but some lines start with -DOCSTART- to indicate that it's the start of a paper.
  3. anlp-sciner-test-empty.conll: An example of the format that should be uploaded to ExplainaBoard, but with all the tags set to "O".

Please run your system over these files and upload the results. Because the goal of this assignment is not to perform hyperparameter optimization on this test set, we ask you to not upload too many times before the submission deadline. Try to limit to 5 submissions, although if you go slightly over this not an issue. Teams that make more than 10 submissions may be penalized.

Writing Report

We will ask you to write a report detailing some things about your system creation process (in the grading criteria below).

There will be a 7 page limit for the report, and there is no required template. However, we encourage you to use the ACL template.

This will be due October 31st for submission via Canvas.

Grading

The following points are derived from the "deployment" of the system:

  • Your group submits testing/training data of your creation (20 points)
  • Your group submits code for training the system in the form of a github repo. We will not necessarily run your code, but we may look at it, so please ensure that it contains up-to-date code with a README file outlining the steps to run it. (20 points)
  • Points based on performance of the system on the output of the private sciner test set (10 points for non-chance performance, plus 0 up to 10 points based on level of performance)

The exact number of points assigned for a certain level of performance will be determined based on how well the class's models perform.

The following points are derived from the report:

  • You report how the data was created. Please include the following details (10 points)
    • How did you obtain the raw PDFs, and how did you decide which ones to obtain?
    • How did you extract text from the PDFs?
    • How did you tokenize the inputs?
    • What data was annotated for testing and training (what kind and how much)?
    • How did you decide what kind and how much data to annotate?
    • What sort of annotation interface did you use?
    • For training data that you did not annotate, did you use any extra data and in what way?
  • You report model details (10 points)
    • What kind of methods (including baselines) did you try? Explain at least two variations (more is welcome). This can include which model you used, which data it was trained on, training strategy, etc.
    • What was the justification for trying these methods?
  • You report raw numbers from experiments (10 points)
    • What was the result of each model that you tried on the testing data that you created?
    • Are the results statistically significant?
  • Comparative quantitative/qualitative analysis (10 points)
    • Perform a comparison of the outputs on a more fine-grained level than just holistic accuracy numbers, and report the results. For instance, you may measure various models' abilities to perform recognition of various entities.
    • Show examples of outputs from at least two of the systems you created. Ideally, these examples could be representative of the quantitative differences that you found above.

More Repositories

1

nn4nlp-code

Code Samples from Neural Networks for NLP
Python
1,303
star
2

lowresource-nlp-bootcamp-2020

The website for the CMU Language Technologies Institute low resource NLP bootcamp 2020
Jupyter Notebook
598
star
3

nlptutorial

A Tutorial about Programming for Natural Language Processing
Perl
423
star
4

nmt-tips

A tutorial about neural machine translation including tips on building practical systems
Perl
368
star
5

kytea

The Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation, etc.
C++
197
star
6

lamtram

lamtram: A toolkit for neural language and translation modeling
C++
138
star
7

anlp-code

Jupyter Notebook
130
star
8

research-career-tools

Python
128
star
9

naacl18tutorial

NAACL 2018 Tutorial: Modelling Natural Language, Programs, and their Intersection
TeX
102
star
10

minbert-assignment

Minimalist BERT implementation assignment for CS11-711
Python
70
star
11

minnn-assignment

An assignment on creating a minimalist neural network toolkit for CS11-747
Python
64
star
12

yrsnlp-2016

Structured Neural Networks for NLP: From Idea to Code
Jupyter Notebook
59
star
13

minllama-assignment

Python
48
star
14

util-scripts

Various utility scripts useful for natural language processing, machine translation, etc.
Perl
46
star
15

latticelm

Software for unsupervised word segmentation and language model learning using lattices
C++
45
star
16

coderx

A highly sophisticated sequence-to-sequence model for code generation
Python
40
star
17

rapid-adaptation

Reproduction instructions for "Rapid Adaptation of Neural Machine Translation to New Languages"
Shell
39
star
18

mtandseq2seq-code

Code examples for CMU CS11-731, Machine Translation and Sequence-to-sequence Models
Python
33
star
19

travatar

This is a repository for the Travatar forest-to-string translation decoder
C++
28
star
20

lxmls-2017

Slides/code for the Lisbon machine learning school 2017
Python
28
star
21

modlm

modlm: A toolkit for mixture of distributions language models
C++
27
star
22

kylm

The Kyoyo Language Modeling Toolkit
Java
27
star
23

pialign

pialign - A Phrasal ITG Aligner
C++
23
star
24

pgibbs

An implementation of parallel gibbs sampling for word segmentation and POS tagging.
C++
16
star
25

nlp-from-scratch-assignment-spring2024

An assignment for building an NLP system from scratch.
16
star
26

lader

A reordering tool for machine translation.
C++
15
star
27

howtocode-2017

An example of DyNet autobatching for the NIPS "how to code a paper" workshop
Jupyter Notebook
13
star
28

kyfd

A decoder for finite state models for text processing.
C++
12
star
29

egret

A fork of the Egret parser that fixes a few bugs
C++
10
star
30

latticelm-v2

Second version of latticelm, a tool for learning language models from lattices
C++
7
star
31

globalutility

TeX
6
star
32

nafil

A program for performing bilingual corpus filtering
C++
4
star
33

prontron

A discriminative pronunciation estimator using the structured perceptron algorithm.
Perl
4
star
34

wat2014

Scripts for creating a system similar to the NAIST submission to WAT2014
Shell
3
star
35

multi-extract

A script for extracting multi-synchronous context-free grammars
Python
2
star
36

nile

A clone of the nile alignment toolkit
C++
1
star
37

webigator

A program to aggregate, rank, and search text information
Perl
1
star
38

ribes-c

A C++ implementation of the RIBES machine translation evaluation measure.
C++
1
star
39

swe-bench-zeno

Scripts for analyzing swe-bench with Zeno
Python
1
star