CMU Advanced NLP Assignment 2: End-to-end NLP System Building
by Emmy Liu, Zora Wang, Kenneth Zheng, Lucio Dery, Abhishek Srivastava, Kundan Krishna, Graham Neubig
So far in your machine learning classes, you may have experimented with standardized tasks and datasets that were provided and easily accessible. However, in the real world, NLP practitioners often have to solve a problem from scratch, which includes gathering and cleaning data, annotating the data, choosing a model, iterating on the model, and possibly going back to change the data. For this assignment, you'll get to experience this full process.
Please note that you'll be building your own system end-to-end for this project, and there is no starter code. You must collect your own data and train a model of your choice on the data. We will be releasing an unlabeled test dataset a few days before the assignment deadline, and you will run your already-constructed system over this data and submit the results. We also ask you to follow several experimental best practices, and describe the result in your report.
The full process will include:
- Understand the task specification
- Collect raw data
- Annotate test and training data for development
- Train and test models using this data
- "Deploy" your System
- Write your report
Task Specification
For this assignment, you'll be working on the task of scientific entity recognition, specifically in the domain of NLP papers from recent NLP conferences (e.g. ACL, EMNLP, and NAACL). Specifically, we will ask you to identify entities such as task names, model names, hyperparameter names and their values, and metric names and their values in these papers.
Input: The input to the model will be a text file with one paragraph per line. The text will already be tokenized using the spacy tokenizer, and you should not change the tokenization. An example of the input looks like this:
Recent evidence reveals that Neural Machine Translation ( NMT ) models with deeper neural networks can be more effective but are difficult to train .
Output: The output of your model should be a file in CoNLL format, with one token per line, a tab, and then a corresponding tag.
Please refer to these input and output files for more specific examples.
There are seven varieties of entity: MethodName
, HyperparameterName
, HyperparameterValue
, MetricName
, MetricValue
, TaskName
, DatasetName
.
Details of these entities are included in the annotation standard, which you should read and understand carefully.
Collecting Raw Data
You will next need to collect raw text data that can be used as inputs to your models. This will consist of three steps.
Obtaining PDFs of Scientific Papers
First, you will need to obtain PDFs of NLP papers that can serve as your raw data source. The best source for recent NLP papers is the ACL Anthology. We recommend that you write a web-scraping script to find and download PDF links from here. Other good sources for data include ArXiv and Semantic Scholar, both of which have web APIs which you can query to get the IDs and corresponding paper PDFs for various scientific papers.
Extracting Sentences Line-by-line
In order to process the text from the PDF files, you will first need to convert it into plaintext. This is a popular problem and there are multiple libraries designed for this task. Some of them are: PyPDF2 SciPDF Parser AllenAI Science Parse AllenAI Science Parse v2
You do not need to extract text/numbers from tables and figures.
Tokenizing the Data
As noted above, the inputs to your model will be tokenized using the spacy tokenizer, so you should probably also tokenize the input data using this tokenizer as well. Once you have done this, you should have a significant amount of raw data in the same input format as described above.
Annotating Data
Next, you will want to annotate data for two purposes: testing/analysis and training.
The testing/analysis data will be the data that you use to make sure that your system is working properly. In order to do so, you will want to annotate enough data so that you can get an accurate idea of how your system is doing, and if any improvements to your system are having a positive impact. Some guidelines:
- Domain Relevance: Your test data should be similar to the data that you will finally be tested on, so we recommend that you create it from NLP papers from recent NLP conferences (e.g. ACL, EMNLP, and NAACL).
- Size: Your test data should be large enough to distinguish between good and bad models. If you want some guidelines about this, please take a look at this paper.
For annotation, please see the separate doc that details annotation interfaces that you can use.
The training data is a bit more flexible, you could possibly:
- Annotate it yourself manually through the same method as the test set.
- Do some sort of automatic annotation/data augmentation.
- Use other existing datasets for multi-task learning.
Training and Testing Your Model
In order to train your model, we highly suggest using pre-existing toolkits such as HuggingFace Transformers. You can read the tutorial on token classification which would be a good way to get started.
Because you will probably not be able to create a large dataset specifically for this task in the amount of time allocated, we strongly suggest that you use the knowledge that you have learned in this class to efficiently build a system. For example, you may think about ideas such as:
- Pre-training on a different task and fine-tuning
- Multi-task learning, training on different tasks at once
- Using prompting techniques
In order to test your model, you will want to use an evaluation script, as detailed in the evaluation and submission page. This page also explains how you can do the analysis that is a component of your report.
System Deployment
The final "deployment" of your model will consist of running your model over a private test set (text only) and submitting your results to us. You should try to finish building your system before this set is released, and basically not rely on it for model training or testing. The test set will be released shortly (2-3 days) before the final submission deadline.
When you are done running your system over this data, you will:
- Submit the results to ExplainaBoard through a submission script. See the evaluation and submission page.
- Submit any testing or training data that you created, as well as your code via Canvas.
Both of these will be due by October 26. See details in grading below.
Data Release
UPDATE (Oct. 25, 2022): The test set is now released in the data/
directory. It contains 3 files:
- anlp-sciner-test.txt: The data that should be input to your system, textual format one paragraph per line.
- anlp-sciner-test-withdocstart.txt: The same data as above, but some lines start with
-DOCSTART-
to indicate that it's the start of a paper. - anlp-sciner-test-empty.conll: An example of the format that should be uploaded to ExplainaBoard, but with all the tags set to "O".
Please run your system over these files and upload the results. Because the goal of this assignment is not to perform hyperparameter optimization on this test set, we ask you to not upload too many times before the submission deadline. Try to limit to 5 submissions, although if you go slightly over this not an issue. Teams that make more than 10 submissions may be penalized.
Writing Report
We will ask you to write a report detailing some things about your system creation process (in the grading criteria below).
There will be a 7 page limit for the report, and there is no required template. However, we encourage you to use the ACL template.
This will be due October 31st for submission via Canvas.
Grading
The following points are derived from the "deployment" of the system:
- Your group submits testing/training data of your creation (20 points)
- Your group submits code for training the system in the form of a github repo. We will not necessarily run your code, but we may look at it, so please ensure that it contains up-to-date code with a README file outlining the steps to run it. (20 points)
- Points based on performance of the system on the output of the private
sciner
test set (10 points for non-chance performance, plus 0 up to 10 points based on level of performance)
The exact number of points assigned for a certain level of performance will be determined based on how well the class's models perform.
The following points are derived from the report:
- You report how the data was created. Please include the following details (10 points)
- How did you obtain the raw PDFs, and how did you decide which ones to obtain?
- How did you extract text from the PDFs?
- How did you tokenize the inputs?
- What data was annotated for testing and training (what kind and how much)?
- How did you decide what kind and how much data to annotate?
- What sort of annotation interface did you use?
- For training data that you did not annotate, did you use any extra data and in what way?
- You report model details (10 points)
- What kind of methods (including baselines) did you try? Explain at least two variations (more is welcome). This can include which model you used, which data it was trained on, training strategy, etc.
- What was the justification for trying these methods?
- You report raw numbers from experiments (10 points)
- What was the result of each model that you tried on the testing data that you created?
- Are the results statistically significant?
- Comparative quantitative/qualitative analysis (10 points)
- Perform a comparison of the outputs on a more fine-grained level than just holistic accuracy numbers, and report the results. For instance, you may measure various models' abilities to perform recognition of various entities.
- Show examples of outputs from at least two of the systems you created. Ideally, these examples could be representative of the quantitative differences that you found above.