• Stars
    star
    151
  • Rank 246,057 (Top 5 %)
  • Language
    Java
  • Created over 10 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

UPDATE MARCH 2018: this code is obsolete, beyond the issue of version compatibility issues, because CoreNLP has an easy-to-use and well-documented server mode now (I think for a while?). Example of using it: https://gist.github.com/brendano/29d9dc619bd7e087b459e6027a52af89

The only possible advantage of this wrapper code is that it does process management for you under the python process, which might be slightly convenient since you don't have to run a separate server. But this architecture is much worse with regards to parallelization (the external server can load resources only once and use threads to parallelize for multiple clients) and certain types of development convenience (with an external server, you don't have to re-load the models during development). I guess this code could be useful if you have to use an older CoreNLP version (for example, if you want to replicate older research results that depend on older formats of things).

=========================================================

This is a Python wrapper for the Stanford CoreNLP library for Unix (Mac, Linux), allowing sentence splitting, POS/NER, temporal expression, constituent and dependency parsing, and coreference annotations. It runs the Java software as a subprocess and communicates with named pipes or sockets. It can also be used to get a JSON-formatted version of the NLP annotations.

Alternatives you may want to consider:

Obsolete notice?: This was written around 2015 or so. But at some point (later?) CoreNLP added its own server mode, which is better to use than the server included inside of this package -- it stays up to date with their system, presumably -- and also it now has native JSON output support. This package should probably be replaced with a python client for that server and possibly the process management support.

Install

You need to have CoreNLP already downloaded. If you want to, you can install this software with something like:

git clone https://github.com/brendano/stanford_corenlp_pywrapper
cd stanford_corenlp_pywrapper
pip install .

Or you can just put the stanford_corenlp_pywrapper subdirectory into your project (or use virtualenv, etc.). For example:

git clone https://github.com/brendano/stanford_corenlp_pywrapper scp_repo
ln -s scp_repo/stanford_corenlp_pywrapper .

Java needs to be a version that CoreNLP is happy with; perhaps version 8.

Commandline usage

See proc_text_files.py for an example of processing text files. Note that you'll have to edit it to specify the jar paths as described below.

Usage from Python

The basic arguments to open a server are (1) the pipeline mode (or alternatively, the annotator pipeline), and (2) the path to the CoreNLP jar files (passed on to the Java classpath)

The pipeline modes are just quick shortcuts for some pipeline configurations we commonly use. They are defined near the top of stanford_corenlp_pywrapper/sockwrap.py and include

  • ssplit: tokenization and sentence splitting (included in all subsequent ones)
  • pos: POS (and lemmas)
  • ner: POS and NER (and lemmas)
  • parse: fairly basic parsing with POS, lemmas, trees, dependencies
  • nerparse: parsing with NER, POS, lemmas, depenencies.
  • coref: Coreference, including constituent parsing.

Here we assume the program has been installed using pip install. You will have to change corenlp_jars to where you have them on your system. Here's how to initialize the pipeline with the pos mode:

>>> from stanford_corenlp_pywrapper import CoreNLP
>>> proc = CoreNLP("pos", corenlp_jars=["/home/sw/corenlp/stanford-corenlp-full-2015-04-20/*"])

If things are working there will be lots of messages looking something like:

INFO:CoreNLP_PyWrapper:mode given as 'pos' so setting annotators: tokenize, ssplit, pos, lemma
INFO:CoreNLP_PyWrapper:Starting java subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -XX:ParallelGCThreads=1 -cp '/Users/brendano/sw/nlp/stanford_corenlp_pywrapper/stanford_corenlp_pywrapper/lib/*:/home/sw/corenlp/stanford-corenlp-full-2015-04-20/*:/home/sw/stanford-srparser-2014-10-23-models.jar'      corenlp.SocketServer --outpipe /tmp/corenlp_pywrap_pipe_pypid=140_time=1435943221.14  --configdict '{"annotators":"tokenize, ssplit, pos, lemma"}'
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.7 sec].
Adding annotator lemma
INFO:CoreNLP_JavaServer: CoreNLP pipeline initialized.
INFO:CoreNLP_JavaServer: Waiting for commands on stdin
INFO:CoreNLP_PyWrapper:Successful ping. The server has started.
INFO:CoreNLP_PyWrapper:Subprocess is ready.

Now it's ready to parse documents. You give it a string and it returns JSON-safe data structures:

>>> proc.parse_doc("hello world. how are you?")
{u'sentences': 
    [
        {u'tokens': [u'hello', u'world', u'.'],
         u'lemmas': [u'hello', u'world', u'.'],
         u'pos': [u'UH', u'NN', u'.'],
         u'char_offsets': [[0, 5], [6, 11], [11, 12]]
        },
        {u'tokens': [u'how', u'are', u'you', u'?'],
         u'lemmas': [u'how', u'be', u'you', u'?'],
         u'pos': [u'WRB', u'VBP', u'PRP', u'.'],
         u'char_offsets': [[13, 16], [17, 20], [21, 24], [24, 25]]
        }
    ]
}

You can also specify the annotators directly. For example, say we want to parse but don't want lemmas. This can be done with the configdict option:

>>> p = CoreNLP(configdict={'annotators':'tokenize, ssplit, pos, parse'}, output_types=['pos','parse'])

Or use an external configuration file (of the same sort the original CoreNLP commandline uses):

>>> p = CoreNLP(configfile='sample.ini')
>>> p.parse_doc("hello world. how are you?")
...

The annotators configuration option is explained more on the CoreNLP webpage.

Another example: using the shift-reduce constituent parser. The jar paths will have to be changed for your system.

>>> p = CoreNLP(configdict={
    'annotators': "tokenize,ssplit,pos,lemma,parse",
    'parse.model': 'edu/stanford/nlp/models/srparser/englishSR.ser.gz'},  
    corenlp_jars=["/path/to/stanford-corenlp-full-2015-04-20/*", "/path/to/stanford-srparser-2014-10-23-models.jar"])

Another example: coreference. This tool does not annotate coreference in the same way that it annotates other linguistic features. Where the other kinds of annotation (for instance, part of speech tagging) are collected in the top-level sentences attribute of the json output, coreference annotations get collected in the attribute, entities.

In the example below, there are 3 entities in the two sentences. The first is "Fred", the second is "her", and the third is the telescope. The telescope is mentioned twice, ("telescope" and "It"), so there are two mention objects in the json. "It" and "telescope" are said to co-refer.

>>> proc = CoreNLP("coref")
>>> proc.parse_doc("Fred saw her through a telescope. It was broken.")['entities']
[{u'entityid': 1,
  u'mentions': [{u'animacy': u'ANIMATE',
                 u'gender': u'MALE',
                 u'head': 0,
                 u'mentionid': 1,
                 u'mentiontype': u'PROPER',
                 u'number': u'SINGULAR',
                 u'representative': True,
                 u'sentence': 0,
                 u'tokspan_in_sentence': [0, 1]}]},
 {u'entityid': 2,
  u'mentions': [{u'animacy': u'ANIMATE',
                 u'gender': u'FEMALE',
                 u'head': 2,
                 u'mentionid': 2,
                 u'mentiontype': u'PRONOMINAL',
                 u'number': u'SINGULAR',
                 u'representative': True,
                 u'sentence': 0,
                 u'tokspan_in_sentence': [2, 3]}]},
 {u'entityid': 3,
  u'mentions': [{u'animacy': u'INANIMATE',
                 u'gender': u'NEUTRAL',
                 u'head': 5,
                 u'mentionid': 3,
                 u'mentiontype': u'NOMINAL',
                 u'number': u'SINGULAR',
                 u'representative': True,
                 u'sentence': 0,
                 u'tokspan_in_sentence': [4, 6]},
                {u'animacy': u'INANIMATE',
                 u'gender': u'NEUTRAL',
                 u'head': 0,
                 u'mentionid': 4,
                 u'mentiontype': u'PRONOMINAL',
                 u'number': u'SINGULAR',
                 u'sentence': 1,
                 u'tokspan_in_sentence': [0, 1]}]}]

Notes

  • We always use 0-indexed numbering conventions for token, sentence, and character indexes. Spans are always inclusive-exclusive pairs, just like Python slicing.

  • You can get the raw unserialized JSON with the option raw=True: e.g., parse_doc("Hello world.", raw=True). The python<->java communication is based on JSON and this just hands it back without deserializing it. In fact you can run the Java code as a standalone commandline program to just produce the JSON format. This can be helpful for storing parses from large corpora. (Even though this format is pretty repetitive, it is much more compact than CoreNLP's XML format. Though of course protobuf or something should be better.)

  • To use a different CoreNLP version, just update corenlp_jars to what you want. If a future CoreNLP breaks binary (Java API) compatibility, you'll have to edit the Java server code and re-compile with ./build.sh.

  • To change the Java settings, see the java_command and java_options arguments.

  • Output messages (on standard error) that start with INFO:CoreNLP_PyWrapper, INFO:CoreNLP_JavaServer, or INFO:CoreNLP_RWrapper are from our code. Other output is probably from CoreNLP.

  • Only works on Unix (Linux and Mac). Does not currently work on Windows.

  • If you want to know the latest version of CoreNLP this has been tested with, look at the paths in the default options in the Python source code.

  • SOCKET mode: By default, the inter-process communication is through named pipes, established with Unix calls. As an alternative, there is also a socket server mode (comm_mode='SOCKET') which is sometimes more robust, but requires using a port number, which you have to ensure does not conflict with any other processes running at the same time. (It's not much of a server since the python code assumes it's the only process communicating with it.) One advantage of `SOCKET' mode is that it has a timeout, in case CoreNLP is taking a very long time to return an answer.

  • Question: do JPype or Py4J work well? They seemed complex which is why we wrote our own IPC mechanism. But if there's a better alternative, no need.

Testing

There's a tiny amount of pytest-style tests.

py.test -v sockwrap.py

Changelog

Major changes include

  • 2015-07-03: add pipe mode and make it default (the namedpipe branch), plus an R wrapper.
  • 2015-05-15: no longer need to specify output_types (the outputs to include are inferred from the annotators setting).

For details see the commit log.

License etc.

Copyright Brendan O'Connor (http://brenocon.com).
License GPL version 2 or later.

More Repositories

1

ark-tweet-nlp

CMU ARK Twitter Part-of-Speech Tagger
Java
575
star
2

tweetmotif

Topical search for Twitter. See twokenize.py, emoticons.py for tokenization.
Python
162
star
3

tsvutils

Utilities for processing tab-separated files
Python
127
star
4

awkspeed

Speed testing for a data munging task
C++
44
star
5

arkref

http://www.ark.cs.cmu.edu/ARKref/
Java
32
star
6

scalacheat

cheat sheet for scala syntax
Shell
32
star
7

parseviz

Visualize constituent and dependency parses as PDF or image formats, through GraphViz.
Python
31
star
8

OConnor_IREvents_ACL2013

Replication software, data, and supplementary materials for the paper: O'Connor, Stewart and Smith, ACL-2013, "Learning to Extract International Relations from Political Context"
C++
26
star
9

mte

MiTextExplorer - interactive browser of text and document covariates.
Java
24
star
10

myutil

Java
23
star
11

dlanalysis

a bunch of R code for various statistical analyses
R
21
star
12

conplot

Console ascii art plotter - quick-and-dirty data visualization, e.g. for log statistics
Python
18
star
13

running_stat

Running variance / standard deviation calculation (C++ and Python)
Python
14
star
14

cmdutils

Some command-line utilities, mostly for data manipulation and inspection.
Python
13
star
15

muc4_proc

preprocessing of the MUC4 dataset
Python
11
star
16

bow

A patched version of bow & rainbow 20020213 that compiles with modern gcc 4.0.1, OSX 10.5
C
11
star
17

twitter_geo_preproc

A preprocessing script to get geo-coded tweets from the Streaming API
Python
9
star
18

gfl_syntax

Graph Fragment Language for Easy Syntactic Annotation
Python
8
star
19

nlp_jobs

research code from rion and brendan when writing snow, o'connor, jurafsky, ng EMNLP-2008 "cheap and fast, but is it good?"
Ruby
6
star
20

stanfordnlp-util

java utilities for stanford nlp
Java
5
star
21

gigaword_conversion

Python
3
star
22

glmnet_starter

Starter code for the glmnet package (elastic net regressions)
R
2
star
23

slmunge

Scripts to munge certain machine learning sparse data formats, including SVMLight/LibSVM
Python
2
star
24

twitter_geo_viz

REALLY HALFBAKED DO NOT USE YOU MAY CRASH OUR SERVER
JavaScript
2
star
25

namefreedom

data and analysis of country names versus democratic freedoms
2
star
26

viewdb

HTML report of an SQL DB's schema and data
Python
1
star
27

super_tuesday_2020

analysis of Super Tuesday exit poll data
HTML
1
star
28

flex-for-morpha

Patched version of GNU Flex 2.5.35 to compile "morpha"
C
1
star
29

beta_explorer

1
star
30

flightstats

Python
1
star
31

randomsearch

web app to randomly choose which search engine to use per query
Python
1
star