• Stars
    star
    112
  • Rank 306,083 (Top 7 %)
  • Language
    Java
  • License
    Other
  • Created about 8 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A way to do annotations for NER. TALEN: Tool for Annotation of Low-resource ENtities

TALEN: Tool for Annotation of Low-resource ENtities

A lightweight web-based tool for annotating word sequences.

Screenshot of web interface

Installation

Requires Java 8 and Maven. Run:

$ ./scripts/run.sh

This will start the server on port 8009. Point a browser to localhost:8009. The port number is specified in application.properties.

This reads from config/users.txt, which has a username and password pair on each line. You will log in using one of those pairs, and then that username is tied to your activities in that session. All annotations that you do will be written to a path called <orig>-annotation-<username>, where <orig> is the original path specified in the config file, and <username> is what you chose as username.

Suppose you do some annotations, then leave the session, and come back again. If you log in with the same username as the previous session, it will reload all of the annotations right where you left off, so no work is lost.

Usage

You make annotations by clicking on words and selecting a label. If you want to remove a label, right click on a word.

To annotate a phrase, highlight the phrase, ending with the mouse in the middle of the last word. The standard box will show up, and you can select the correct label. To dismiss the annotation box, click on the word it points to.

A document is saved by pressing the Save button. If you navigate away using the links on the top of the page, the document is not saved.

Configuration

There are two kinds of config files, corresponding to the two annotation methods (see below). The document-based method looks for config files that start with 'doc-' and the sentence-based method looks for config files that start with 'sent-'.

See the example config files for the minimally required set of options.

Annotation Methods

There are two main annotation methods supported: document-based, and sentence-based.

Document-based

The document-based method is a common paradigm. You point the software to a folder of documents and each is displayed in turn, and you annotate them.

Sentence-based

The sentence-based method is intended to allow a rapid annotation process. First, you need to build an index using TextFileIndexer.java, then you supply some seed names in the config file. The system searches for these seed names in the index, and returns a small number of sentences containing them. The annotator is encouraged to annotate these correctly, and also annotate any other names which may appear. These new names then join the list of seed names, and annotation continues.

For example, if the seed name is 'Pete Sampras', then we might hope that 'Andre Agassi' will show up in the same sentence. If the annotator chooses to annotate 'Andre Agassi' also, then the system will retrieve new sentences containing 'Andre Agassi'. Presumably these sentences will contain entities such as 'Wimbledon' and 'New York City'. In principle, this will continue until some cap on the number of entities has been reached.

Using the sentence-based

First, you need to download a corpus. We have used Hindi for this. Run:

$ (If you don't already have nltk) sudo pip install -U nltk 
$ python -m nltk.downloader indian

Now convert this:

$ cd data
$ python data/getindian.py
$ cd ..

You'll notice that this created files in data/txt/hindi and in data/tajson/hindi. Now build the index:

$ mvn dependency:copy-dependencies
$ ./scripts/buildindex.sh data/tajson/hindi/ data/index_hindi 

That's it! There is already a config file called config/sent-Hindi.txt that should get you started.

Non-speaker Helps

One major focus of the software is to allow non-speakers of a language to annotate text. Some features are: inline dictionary replacement, morphological awareness and coloring, entity propagation, entity suggestions, hints based on frequency and mutual information.

How to build an index

Use buildindex.sh to build a local index for the sentence based mode. The indexdir variable will be put in the sentence-based config file. This, in turn calls TextFileIndexer.java.

Command line tool

We also ship a lightweight command line tool for TALEN. This tool will read a folder of JSON TextAnnotations (more formats coming soon) and spin up a Java-only server, serving static HTML versions of each document. This will be used only for examination and exploration.

Install it as follows:

$ ./scripts/install-cli.sh
$ export PATH=$PATH:$HOME/software/talen/

(You can change the INSTALLDIR in install-cli.sh if you want it installed somewhere else). Now it is installed, you can run it from any folder in your terminal:

$ talen-cli FolderOfTAFiles

This will serve static HTML documents at localhost:PORT (default PORT is 8008). You can run with additional options:

$ talen-cli FolderOfTAFiles -roman -port 8888

Where the -roman option uses the ROMANIZATION view in the TextAnnotation for text (if available), and the -port option uses the specified port.

Mechanical Turk

Although the main function of this software is a server based system, there is also a lightweight version that runs entirely in Javascript, for the express purpose of creating Mechanical Turk jobs.

The important files are mturkTemplate.html and annotate-local.js. The latter is a version of annotate.js, but the code to handle adding and removing spans is included in the Javascript instead of sent to a Java controller. This is less powerful (because we have NLP libraries written in Java, not Javascript), but can be run with no server.

All the scripts needed to create this file are included in this repository. It was created as follows:

$ python scripts/preparedata.py preparedata data/txt tmp.csv
$ python scripts/preparedata.py testfile tmp.csv docs/index.html

mturkTemplate.html has a lot of extra stuff (instructions, annotator test, etc) which can all be removed if desired. I found it was useful for mturk tasks. When you create the mturk task, there will be a submit button, and the answer will be put into the #finalsubmission field. The output string is a Javascript list of token spans along with label.

Citation

If you use this in your research paper, please cite us!

@inproceedings{talen2018,
    author = {Stephen Mayhew, Dan Roth},
    title = {TALEN: Tool for Annotation of Low-resource ENtities},
    booktitle = {ACL System Demonstrations},
    year = {2018},
}

Read the paper here: http://cogcomp.org/papers/MayhewRo18.pdf

More Repositories

1

cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Java
469
star
2

cogcomp-nlpy

CogComp's light-weight Python NLP annotators
Python
116
star
3

saul

Saul : Declarative Learning-Based Programming
Scala
64
star
4

zoe

Zero-Shot Open Entity Typing as Type-Compatible Grounding, EMNLP'18.
Python
43
star
5

arithmetic

Arithmetic word problem solver
Java
42
star
6

MCTACO

Dataset and code for β€œGoing on a vacation” takes longer than β€œGoing for a walk”: A Study of Temporal Commonsense Understanding, EMNLP 2019.
Python
40
star
7

multirc

Reasoning over Multiple Sentences (Multi-RC)
Perl
30
star
8

perspectrum

Perspectrum: a dataset of claims, perspectives and evidence documents
Jupyter Notebook
30
star
9

JointConstrainedLearning

Joint Constrained Learning for Event-Event Relation Extraction
Jupyter Notebook
25
star
10

illinois-sl

A general-purpose Java library for performing structured learning.
Java
22
star
11

TacoLM

Temporal Common Sense Acquisition with Minimal Supervision, ACL'20
Python
20
star
12

Benchmarking-Zero-shot-Text-Classification

Code for EMNLP2019 paper : "Benchmarking zero-shot text classification: datasets, evaluation and entailment approach"
Python
16
star
13

SRL-English

BERT-based nominal Semantic Role Labeling (SRL), both using the Nombank dataset and the Ontonotes dataset.
Python
15
star
14

faithful_summarization

Python
15
star
15

TAWT

Weighted Training for Cross-Task Learning
Python
15
star
16

lbjava

Learning Based Java (LBJava)
Java
13
star
17

APSI

Code for EMNLP 2020 paper: Analogous Process Structure Induction for Sub-event Sequence Prediction
Python
11
star
18

Event_Process_Typing

This is the repository for the resources in CoNLL 2020 Paper "What Are You Trying Todo? Semantic Typing of Event Processes"
Python
10
star
19

MultiOpEd

MULTIOPED: A Corpus of Multi-Perspective News Editorials.
Python
10
star
20

Salient-Event-Detection

The repository for the paper "Is Killed More Significant than Fled? A Contextual Model for Salient Event Detection"
Python
10
star
21

CogCompTime

CogCompTime
Java
9
star
22

TCR

Temporal and Causal Reasoning (dataset)
9
star
23

QuASE

QuASE :This is the code repository for the ACL paper QuASE: Question-Answer Driven Sentence Encoding.
Python
8
star
24

perspectroscope

A Window to the World of Differing Perspectives
JavaScript
7
star
25

open-eval

An open source evaluation framework for developing NLP systems
Java
7
star
26

event-linking

Python
7
star
27

Subevent_EventSeg

Python
7
star
28

qaeval-experiments

QAEval Experiments This repository will contain the code to reproduce the experiments from Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary.
Python
7
star
29

learning-to-decompose

Code and Data for Zhou et al. 2022: Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts
Python
7
star
30

summary-cloze

Summary Cloze: A New Task for Content Selection in Topic-Focused Summarization
Python
6
star
31

MATRES

Multi-Axis Temporal Relations for Start-points (dataset)
6
star
32

nmn-drop

Code for NMN over DROP -- Neural Module Networks for Reasoning over Text
Python
5
star
33

wikidump-preprocessing

Wikipedia Dump Processing
Python
4
star
34

decomp-el-retriever

4
star
35

CIKQA

Python
4
star
36

time

Understanding time in text
4
star
37

Logic2ILP

Java
3
star
38

transformer-lm-demo

A simple demo of transformer language models, mostly for our internal use: http://dickens.seas.upenn.edu:4001
Python
3
star
39

KAIROS-Event-Extraction

Event Extraction
Python
3
star
40

ZeroShotWiki

Towards Open Domain Topic Classification
Python
3
star
41

TemProb-NAACL18

Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource
Java
3
star
42

Complex_Event_Identification

Complex Event Identification - This repository contains the code for the paper Capturing the Content of a Document through Complex Event Identification
Python
3
star
43

illinois-sl-examples

Implementing simple learning problems using Illinois-SL
Java
3
star
44

NLP-Event-Extraction-Demo

NLP Event Extraction Demo
Python
2
star
45

content-analysis-experiments

Python
2
star
46

Yes-No-or-IDK

Yes, No or IDK: The Challenge of Unanswerable Yes/No Questions
Python
2
star
47

ZEC

Zero-shot event trigger and argument classification
Python
2
star
48

RESIN-11

RESIN-11: Schema-guided Event Prediction for 11 Newsworthy Scenarios
Shell
2
star
49

today

Jupyter Notebook
2
star
50

IDK-beyond-SQuAD2.0

Do we Know What We Don't Know? Studying Unanswerable Questions beyond SQuAD 2.0
2
star
51

JEANS

JEANS : This repo contains code for Cross-lingual Entity Alignment for Knowledge Graphs with Incidental Supervision from Free Text
Python
2
star
52

Essential-Step-Detection

HTML
2
star
53

multilingual-ner

Multilingual NER and XEL demo
Python
2
star
54

ccr_rock

ROCK: Causal Inference Principles for Reasoning about Commonsense Causality
Jupyter Notebook
2
star
55

apelles

CogComp-nlp demo
HTML
2
star
56

ner-with-partial-annotations

Python
2
star
57

mcqa-expectations

mcqa-expectations
Python
2
star
58

re-examining-correlations

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics - This repository contains the code for the NAACL 2022 paper "Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics."
Python
2
star
59

stat-analysis-experiments

Python
2
star
60

jwnl-prime

Modified version of the JWNL (http://sourceforge.net/projects/jwordnet/)
Java
2
star
61

Event_Process_Typing_Demo

Python
1
star
62

ccg-bibfiles

Repository to store cogcomp's bib and cited bib files
TeX
1
star
63

NER-Multilanguage-Demo

NER-Multilanguage-Demo
Python
1
star
64

reference-free-limitations

Python
1
star
65

Zero_Shot_Schema_Induction

Python
1
star
66

NLP-Multipackage-Demo

NLP-Multipackage-Demo
Python
1
star
67

PairedRL

PairedRL-coref
1
star
68

cogcomp-datastore

A convenient wrapper for interfacing minio java api
Java
1
star
69

PABI

Python
1
star
70

KAIROS2020

KAIROS 2020 backend services
Python
1
star
71

zeroshot-classification-demo

Python
1
star
72

NER-English-Demo

This repo is the frontend of NER Demo for English
Python
1
star
73

KAIROS-Temporal-Extraction

Extracts the temporal information and normalizes it.
Smalltalk
1
star
74

mbert-study

CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY
Python
1
star
75

NeuralTemporalRelation-EMNLP19

CogCompTime 2.0
Python
1
star
76

multi-persp-search-engine

Code for the prototype multi-perspective search engine from Findings of NAACL'21 paper - "Design Challenges for a Multi-Perspective Search Engine"
Python
1
star
77

NLP-SRL-English-Demo

NLP-SRL-English-Demo
Python
1
star
78

QA-idk-demo

Python
1
star
79

Event_Semantic_Classification

Repository for "Event Semantic Classification in Context" published in EACL 2024 Findings
1
star
80

Zeroshot-Event-Extraction

Zero-shot Event Extraction - This is the code repository for ACL2021 paper: Zero-shot Event Extraction via Transfer Learning: Challenges and Insights.
Python
1
star