• Stars
    star
    116
  • Rank 303,894 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created almost 8 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CogComp's light-weight Python NLP annotators

CogComp-NLPy

Build Status

Run NLP tools such as Part-of-Speech tagging, Chunking, Named Entity Recognition, etc on your documents in Python with ease and breeze!

Installation

  1. Make sure you have "pip" on your system.
  2. Make sure you have installed Cython:
pip install cython
  1. Install:
pip install ccg_nlpy
  1. Enjoy!

Here is the project page at PyPI website.

Support

The package is compatible with Python 2.6+ and Python 3.3+. We highly recommend using Python 3.3+

This package uses utf-8 encoding. In Python 2.6+, all strings are stored as unicode objects. In Python 3.3+, all strings are stored as str objects.

Getting Started

Here is a sample usage showing how easily you run our system:

from ccg_nlpy import remote_pipeline

pipeline = remote_pipeline.RemotePipeline()
doc = pipeline.doc("Hello, how are you. I am doing fine")
print(doc.get_lemma) # will produce (hello Hello) (, ,) (how how) (be are) (you you) (. .) (i I) (be am) (do doing) (fine fine)
print(doc.get_pos) # will produce (UH Hello) (, ,) (WRB how) (VBP are) (PRP you) (. .) (PRP I) (VBP am) (VBG doing) (JJ fine)

The default/easy usage has some restrictions as will deliniate in the next section. See the next section to

Api Docs: Here is the API docs of our Pipeliner module.

Structure

This tool enables you accesss CogComp pipeline in different forms. The figure below summarizes these approaches:

The figure above gives a summary of possible usages, as well as their pros and cons. Next we will go through each item and elaborate:

Remote Pipeline

In this setting, you can send annotation requests to a remote machine. Hence there is not much memory burden on your local machine. Instead all the heavy-lifting is on the remote server.

Default remote server: This is the default setting. The requests are sent to our remote server, hence requires a network connection. This option is here to demonstrate how things work, but it is not a viable solution for your big experiments since we limit the number of queries to our server (current limit is 100 queries a day). If you are a busy nlp user, you should use any of the other options.

Starting your own (remote) server: If you have a big (remote) machine, this is probably a good option for you. You'll have to read the instructions on how to install the pipeline server in the pipeline project documentation. In summary:

  1. Clone our CogComp-NLP java project.
  2. Run pipeline/scripts/runWebserver.sh to start the server.
  3. When you see Server:xxx - Started @xxxxxms, the server is up and running:

After making sure that the server is running, we can make python call to it:

from ccg_nlpy import remote_pipeline
pipeline = remote_pipeline.RemotePipeline(server_api='http://www.fancyUrlName.com:8080') 
# constructor declaration: RemotePipeline(server_api = None, file_name = None)
# "server_api" is the address of the server as string. An example: http://www.fancyUrlName.com:8080
# "file_name" is the config file used to set up pipeline (optional), please refer the latter section for more details

Note: This tool is based on CogComp's pipeline project. Essentially annotator included in the pipeline should be accessible here.

Local Pipeline

In this setting, the system will download the trained models and files required to run the pipeline locally. Since everything is run on your machine, it will probably require a lot of memory (the amount depends on which annotations you use). If you have a single big machine (i.e. memory > 15GB) for your expeirments, this is probably a good option for you. Local pipeline also gives you the functionality to work with pre-tokenized text.

To download the models, run the following command:

python -m ccg_nlpy download

This will download model files into your home directly under ~/.ccg_nlpy/.

Note: Note that downloading the models require you to have Maven installed on your machine. If you don't, here are some guidelines on how to install it.

In the local pipeline annotators are loaded lazily; i.e. they are not loaded until you call them for the first time.

from ccg_nlpy import local_pipeline
pipeline = local_pipeline.LocalPipeline() 
# constructor declaration: LocalPipeline()

To run on pre-tokenized text, the document is represented as a list of (sentences) list of tokens. The argument pretokenized=True needs to be passed to the pipeline.doc function.

from ccg_nlpy import local_pipeline
pipeline = local_pipeline.LocalPipeline()

document = [ ["Hi", "!"], ["How", "are", "you", "?"] ]
doc = pipeline.doc(document, pretokenized=True)

Frequent Issues:

  • To use the pipelne locally you have to make sure you have set JAVA_HOME variable. In MacOS, you can verify it with echo "$JAVA_HOME". If it is not set, you can export JAVA_HOME=$(/usr/libexec/java_home).
  • If you are using Java version > 8, you are likely to receive an error that looks like the following: ERROR:ccg_nlpy.local_pipeline:Error calling dlopen(b'/Library/Java/JavaVirtualMachines/jdk-10.0.1.jdk/Contents/Home/jre/lib/server/libjvm.dylib': b'dlopen(/Library/Java/JavaVirtualMachines/jdk-10.0.1.jdk/Contents/Home/jre/lib/server/libjvm.dylib, 10): image not found' To solve this, you have to install Java-8 on your machine and direct your commandline to it: export JAVA_HOME=`/usr/libexec/java_home -v 1.8` .

Setting from Configuration file

You can set settings on how to run CogComp-NLPy via a local option too, rather than setting it programmatically. Here is how to:

from ccg_nlpy import remote_pipeline
pipeline = remote_pipeline.RemotePipeline(file_name = 'path_to_custom_config_file')

The default keys and values are specified below. If you want to use custom config file, please provide a file in similar format.

[remote_pipeline_setting]
api = ADDRESS_OF_THE_SERVER # example: http://fancyUrlName.com:8080

System failures

System failures are part of any software system. Upon some certain outputs (e.g. receiving error 500 from remote pipeline), we return None in the output of call. When processing big documents it might make sense to check take care of this explicitly:

d = ... # docuemnt
p = ... # pipeline
doc = p.doc(d)
if doc is not None:
    # do sth with it
    ner_view = doc.get_ner_conll

Running Tests (For Contributors)

  1. Make sure you have downloaded the models using python -m ccg_nlpy download so that local_pipeline tests can run smoothly.
  2. Create a pristine python2 environment (say, using conda create -n py27 python=2.7 anaconda).
  3. You may need to install cython for pyjnius in the new python2 environment (pip2 install cython).
  4. Run python setup.py test in the new environment.

All tests should run smoothly before you submit a pull request.

Questions/Suggestions/Comments

Use comments or pull requests.

More Repositories

1

cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Java
470
star
2

talen

A way to do annotations for NER. TALEN: Tool for Annotation of Low-resource ENtities
Java
112
star
3

saul

Saul : Declarative Learning-Based Programming
Scala
64
star
4

zoe

Zero-Shot Open Entity Typing as Type-Compatible Grounding, EMNLP'18.
Python
43
star
5

arithmetic

Arithmetic word problem solver
Java
42
star
6

MCTACO

Dataset and code for β€œGoing on a vacation” takes longer than β€œGoing for a walk”: A Study of Temporal Commonsense Understanding, EMNLP 2019.
Python
40
star
7

multirc

Reasoning over Multiple Sentences (Multi-RC)
Perl
30
star
8

perspectrum

Perspectrum: a dataset of claims, perspectives and evidence documents
Jupyter Notebook
30
star
9

JointConstrainedLearning

Joint Constrained Learning for Event-Event Relation Extraction
Jupyter Notebook
26
star
10

illinois-sl

A general-purpose Java library for performing structured learning.
Java
22
star
11

TacoLM

Temporal Common Sense Acquisition with Minimal Supervision, ACL'20
Python
20
star
12

Benchmarking-Zero-shot-Text-Classification

Code for EMNLP2019 paper : "Benchmarking zero-shot text classification: datasets, evaluation and entailment approach"
Python
18
star
13

SRL-English

BERT-based nominal Semantic Role Labeling (SRL), both using the Nombank dataset and the Ontonotes dataset.
Python
16
star
14

faithful_summarization

Python
15
star
15

TAWT

Weighted Training for Cross-Task Learning
Python
15
star
16

lbjava

Learning Based Java (LBJava)
Java
13
star
17

reasoning-eval

13
star
18

APSI

Code for EMNLP 2020 paper: Analogous Process Structure Induction for Sub-event Sequence Prediction
Python
11
star
19

Event_Process_Typing

This is the repository for the resources in CoNLL 2020 Paper "What Are You Trying Todo? Semantic Typing of Event Processes"
Python
10
star
20

MultiOpEd

MULTIOPED: A Corpus of Multi-Perspective News Editorials.
Python
10
star
21

Salient-Event-Detection

The repository for the paper "Is Killed More Significant than Fled? A Contextual Model for Salient Event Detection"
Python
10
star
22

TCR

Temporal and Causal Reasoning (dataset)
10
star
23

CogCompTime

CogCompTime
Java
9
star
24

QuASE

QuASE :This is the code repository for the ACL paper QuASE: Question-Answer Driven Sentence Encoding.
Python
8
star
25

event-linking

Python
8
star
26

learning-to-decompose

Code and Data for Zhou et al. 2022: Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts
Python
8
star
27

perspectroscope

A Window to the World of Differing Perspectives
JavaScript
7
star
28

MATRES

Multi-Axis Temporal Relations for Start-points (dataset)
7
star
29

open-eval

An open source evaluation framework for developing NLP systems
Java
7
star
30

Subevent_EventSeg

Python
7
star
31

qaeval-experiments

QAEval Experiments This repository will contain the code to reproduce the experiments from Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary.
Python
7
star
32

summary-cloze

Summary Cloze: A New Task for Content Selection in Topic-Focused Summarization
Python
6
star
33

nmn-drop

Code for NMN over DROP -- Neural Module Networks for Reasoning over Text
Python
5
star
34

wikidump-preprocessing

Wikipedia Dump Processing
Python
4
star
35

decomp-el-retriever

4
star
36

CIKQA

Python
4
star
37

time

Understanding time in text
4
star
38

NLP-Event-Extraction-Demo

NLP Event Extraction Demo
Python
3
star
39

Logic2ILP

Java
3
star
40

KAIROS-Event-Extraction

Event Extraction
Python
3
star
41

ZeroShotWiki

Towards Open Domain Topic Classification
Python
3
star
42

TemProb-NAACL18

Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource
Java
3
star
43

Complex_Event_Identification

Complex Event Identification - This repository contains the code for the paper Capturing the Content of a Document through Complex Event Identification
Python
3
star
44

illinois-sl-examples

Implementing simple learning problems using Illinois-SL
Java
3
star
45

transformer-lm-demo

A simple demo of transformer language models, mostly for our internal use: http://dickens.seas.upenn.edu:4001
Python
3
star
46

ccr_rock

ROCK: Causal Inference Principles for Reasoning about Commonsense Causality
Jupyter Notebook
2
star
47

content-analysis-experiments

Python
2
star
48

Yes-No-or-IDK

Yes, No or IDK: The Challenge of Unanswerable Yes/No Questions
Python
2
star
49

ZEC

Zero-shot event trigger and argument classification
Python
2
star
50

ner-with-partial-annotations

Python
2
star
51

today

Jupyter Notebook
2
star
52

RESIN-11

RESIN-11: Schema-guided Event Prediction for 11 Newsworthy Scenarios
Shell
2
star
53

JEANS

JEANS : This repo contains code for Cross-lingual Entity Alignment for Knowledge Graphs with Incidental Supervision from Free Text
Python
2
star
54

IDK-beyond-SQuAD2.0

Do we Know What We Don't Know? Studying Unanswerable Questions beyond SQuAD 2.0
2
star
55

Essential-Step-Detection

HTML
2
star
56

multilingual-ner

Multilingual NER and XEL demo
Python
2
star
57

apelles

CogComp-nlp demo
HTML
2
star
58

mcqa-expectations

mcqa-expectations
Python
2
star
59

re-examining-correlations

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics - This repository contains the code for the NAACL 2022 paper "Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics."
Python
2
star
60

stat-analysis-experiments

Python
2
star
61

jwnl-prime

Modified version of the JWNL (http://sourceforge.net/projects/jwordnet/)
Java
2
star
62

Event_Process_Typing_Demo

Python
1
star
63

ccg-bibfiles

Repository to store cogcomp's bib and cited bib files
TeX
1
star
64

NER-Multilanguage-Demo

NER-Multilanguage-Demo
Python
1
star
65

reference-free-limitations

Python
1
star
66

Zero_Shot_Schema_Induction

Python
1
star
67

NLP-Multipackage-Demo

NLP-Multipackage-Demo
Python
1
star
68

PairedRL

PairedRL-coref
1
star
69

cogcomp-datastore

A convenient wrapper for interfacing minio java api
Java
1
star
70

KAIROS2020

KAIROS 2020 backend services
Python
1
star
71

zeroshot-classification-demo

Python
1
star
72

PABI

Python
1
star
73

NER-English-Demo

This repo is the frontend of NER Demo for English
Python
1
star
74

KAIROS-Temporal-Extraction

Extracts the temporal information and normalizes it.
Smalltalk
1
star
75

mbert-study

CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY
Python
1
star
76

NeuralTemporalRelation-EMNLP19

CogCompTime 2.0
Python
1
star
77

multi-persp-search-engine

Code for the prototype multi-perspective search engine from Findings of NAACL'21 paper - "Design Challenges for a Multi-Perspective Search Engine"
Python
1
star
78

NLP-SRL-English-Demo

NLP-SRL-English-Demo
Python
1
star
79

QA-idk-demo

Python
1
star
80

Event_Semantic_Classification

Repository for "Event Semantic Classification in Context" published in EACL 2024 Findings
1
star
81

ReEval-LLM-Hallucination

GitHub Repo for "ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks""
Python
1
star
82

Zeroshot-Event-Extraction

Zero-shot Event Extraction - This is the code repository for ACL2021 paper: Zero-shot Event Extraction via Transfer Learning: Challenges and Insights.
Python
1
star