• Stars
    star
    138
  • Rank 264,508 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 9 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Stanford NLP group's shared Python tools.

Stanza

Master Build Status Documentation Status

Stanza is the Stanford NLP group’s shared repository for Python infrastructure. The goal of Stanza is not to replace your modeling tools of choice, but to offer implementations for common patterns useful for machine learning experiments.

Usage

You can install the package as follows:

git clone [email protected]:stanfordnlp/stanza.git
cd stanza
pip install -e .

To use the package, import it in your python code. An example would be:

from stanza.text.vocab import Vocab
v = Vocab('UNK')

To use the Python client for the CoreNLP server, first launch your CoreNLP Java server. Then, in your Python program:

from stanza.nlp.corenlp import CoreNLPClient
client = CoreNLPClient(server='http://localhost:9000', default_annotators=['ssplit', 'tokenize', 'lemma', 'pos', 'ner'])
annotated = client.annotate('This is an example document. Here is a second sentence')
for sentence in annotated.sentences:
    print('sentence', sentence)
    for token in sentence:
        print(token.word, token.lemma, token.pos, token.ner)

Please see the documentation for more use cases.

Documentation

Documentation is hosted on Read the Docs at http://stanza.readthedocs.org/en/latest/. Stanza is still in early development. Interfaces and code organization will probably change substantially over the next few months.

Development Guide

To request or discuss additional functionality, please open a GitHub issue. We greatly appreciate pull requests!

Tests

Stanza has unit tests, doctests, and longer, integration tests. We ask that all contributors run the unit tests and doctests before submitting pull requests:

python setup.py test

Doctests are the easiest way to write a test for new functionality, and serve as helpful examples for how to use your code. See progress.py for a simple example of a easily testable module, or summary.py for a more involved setup involving a mocked filesystem.

Adding a new module

If you are adding a new module, please remember to add it to setup.py as well as a corresponding .rst file in the docs directory.

Documentation

Documentation is generated via Sphinx using inline comments. This means that the docstring in Python double both as interactive documentation and standalone documentation. This also means that you must format your docstring in RST. RST is very similar to Markdown. There are many tutorials on the exact syntax, essentially you only need to know the function parameter syntax which can be found here. You can, of course, look at documentations for existing modules for guidance as well. A good place to start is the text.dataset package.

To set up your environment such that you can generate docs locally:

pip install sphinx sphinx-autobuild

If you introduced a new module, please auto-generate the docs:

sphinx-apidoc -F -o docs stanza
cd docs && make
open _build/html/index.html

You most likely need to manually edit the rst file corresponding to your new module.

Our docs are hosted on Readthedocs. If you'd like admin access to the Readthedocs project, please contact Victor or Will.

Road Map

  • common objects used in NLP
    • [x] a Vocabulary object mapping from strings to integers/vectors
  • tools for running experiments on the NLP cluster
    • [ ] a function for querying GPU device stats (to aid in selecting a GPU on the cluster)
    • [ ] a tool for plotting training curves from multiple jobs
    • [ ] a tool for interacting with an already running job via edits to a text file
  • [x] an API for calling CoreNLP

For Stanford NLP members

Stanza is not meant to include every research project the group undertakes. If you have a standalone project that you would like to share with other people in the group, you can:

Using git subtree

That said, it can be useful to add functionality to Stanza while you work in a separate repo on a project that depends on Stanza. Since Stanza is under active development, you will want to version-control the Stanza code that your code uses. Probably the most effective way of accomplishing this is by using git subtree.

git subtree includes the source tree of another repo (in this case, Stanza) as a directory within your repo (your cutting-edge research), and keeps track of some metadata that allows you to keep that directory in sync with the original Stanza code. The main advantage of git subtree is that you can modify the Stanza code locally, merge in updates, and push your changes back to the Stanza repo to share them with the group. (git submodule doesn't allow this.)

It has some downsides to be aware of:

  • You have a copy of all of Stanza as part of your repo. For small projects, this could increase your repo size dramatically. (Note: you can keep the history of your repo from growing at the same rate as Stanza's by using squashed commits; it's only the size of the source tree that unavoidably bloats your project.)
  • Your repo's history will contain a merge commit every time you update Stanza from upstream. This can look ugly, especially in graphical viewers.

Still, subtree can be configured to be fairly easy to use, and the consensus seems to be that it is superior to submodule (https://codingkilledthecat.wordpress.com/2012/04/28/why-your-company-shouldnt-use-git-submodules/).

Here's one way to configure subtree so that you can include Stanza in your repo and contribute your changes back to the master repo:

# Add Stanza as a remote repo
git remote add stanza http://<your github username>@github.com/stanfordnlp/stanza.git
# Import the contents of the repo as a subtree
git subtree add --prefix third-party/stanza stanza develop --squash
# Put a symlink to the actual module somewhere where your code needs it
ln -s third-party/stanza/stanza stanza
# Add aliases for the two things you'll need to do with the subtree
git config alias.stanza-update 'subtree pull --prefix third-party/stanza stanza develop --squash'
git config alias.stanza-push 'subtree push --prefix third-party/stanza stanza develop'

After this, you can use the aliases to push and pull Stanza like so:

git stanza-update
git stanza-push

I [@futurulus] highly recommend a topic branch/rebase workflow, which will keep your history fairly clean besides those pesky subtree merge commits:

# Create a topic branch
git checkout -b fix-stanza
# <hack hack hack, make some commits>

git checkout master
# Update Stanza on master, should go smoothly because master doesn't
# have any of your changes yet
git stanza-update

# Go back and replay your fixes on top of master changes
git checkout fix-stanza
git rebase master
# You might need to resolve merge conflicts here

# Add your rebased changes to master and push
git checkout master
git merge --ff-only fix-stanza
git stanza-push
# Done!
git branch -d fix-stanza

More Repositories

1

dspy

DSPy: The framework for programming—not prompting—foundation models
Python
18,220
star
2

CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
Java
9,678
star
3

stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
Python
7,278
star
4

GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
C
6,867
star
5

cs224n-winter17-notes

Course notes for CS224N Winter17
TeX
1,587
star
6

pyreft

ReFT: Representation Finetuning for Language Models
Python
1,137
star
7

treelstm

Tree-structured Long Short-Term Memory networks (http://arxiv.org/abs/1503.00075)
Lua
875
star
8

pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Python
625
star
9

string2string

String-to-String Algorithms for Natural Language Processing
Jupyter Notebook
533
star
10

python-stanford-corenlp

Python interface to CoreNLP using a bidirectional server-client interface.
Python
516
star
11

mac-network

Implementation for the paper "Compositional Attention Networks for Machine Reasoning" (Hudson and Manning, ICLR 2018)
Python
494
star
12

phrasal

A large-scale statistical machine translation system written in Java.
Java
208
star
13

spinn

SPINN (Stack-augmented Parser-Interpreter Neural Network): fast, batchable, context-aware TreeRNNs
Python
205
star
14

coqa-baselines

The baselines used in the CoQA paper
Python
176
star
15

cocoa

Framework for learning dialogue agents in a two-player game setting.
Python
158
star
16

chirpycardinal

Stanford's Alexa Prize socialbot
Python
131
star
17

stanfordnlp

[Deprecated] This library has been renamed to "Stanza". Latest development at: https://github.com/stanfordnlp/stanza
Python
114
star
18

wge

Workflow-Guided Exploration: sample-efficient RL agent for web tasks
Python
109
star
19

pdf-struct

Logical structure analysis for visually structured documents
Python
81
star
20

edu-convokit

Edu-ConvoKit: An Open-Source Framework for Education Conversation Data
Jupyter Notebook
75
star
21

cs224n-web

http://cs224n.stanford.edu
HTML
60
star
22

ColBERT-QA

Code for Relevance-guided Supervision for OpenQA with ColBERT (TACL'21)
40
star
23

stanza-train

Model training tutorials for the Stanza Python NLP Library
Python
37
star
24

phrasenode

Mapping natural language commands to web elements
Python
37
star
25

contract-nli-bert

A baseline system for ContractNLI (https://stanfordnlp.github.io/contract-nli/)
Python
29
star
26

color-describer

Code for Learning to Generate Compositional Color Descriptions
OpenEdge ABL
26
star
27

stanza-resources

23
star
28

python-corenlp-protobuf

Python bindings for Stanford CoreNLP's protobufs.
Python
20
star
29

miniwob-plusplus-demos

Demos for the MiniWoB++ benchmark
17
star
30

multi-distribution-retrieval

Code for our paper Resources and Evaluations for Multi-Distribution Dense Information Retrieval
Python
14
star
31

huggingface-models

Scripts for pushing models to huggingface repos
Python
11
star
32

nlp-meetup-demo

Java
8
star
33

sentiment-treebank

Updated version of SST
Python
8
star
34

en-worldwide-newswire

An English NER dataset built from foreign newswire
Python
7
star
35

plot-data

datasets for plotting
Jupyter Notebook
6
star
36

contract-nli

ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
HTML
4
star
37

plot-interface

Web interface for the plotting project
JavaScript
3
star
38

handparsed-treebank

Extra hand parsed data for training models
Perl
2
star
39

coqa

CoQA -- A Conversational Question Answering Challenge
Shell
2
star
40

pdf-struct-models

A repository for hosting models for https://github.com/stanfordnlp/pdf-struct
HTML
2
star
41

chirpy-parlai-blenderbot-fork

A fork of ParlAI supporting Chirpy Cardinal's custom neural generator
Python
2
star
42

wob-data

Data for QAWoB and FlightWoB web interaction benchmarks from the World of Bits paper (Shi et al., 2017).
Python
2
star
43

pdf-struct-dataset

Dataset for pdf-struct (https://github.com/stanfordnlp/pdf-struct)
HTML
1
star
44

nn-depparser

A re-implementation of nndep using PyTorch.
Python
1
star