• Stars
    star
    162
  • Rank 232,284 (Top 5 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 7 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python scripts preprocessing Penn Treebank and Chinese Treebank

TreebankPreprocessing

Python scripts preprocessing Penn Treebank (PTB) and Chinese Treebank 5.1 (CTB). They can convert treebanks to:

Corpus Format Description
constituency parse tree .txt one line for one sentence
dependency parse tree .conllx Basic Stanford Dependencies (SD)
word segmentation corpus .tsv first column for characters, second column for BMES tags, sentences separated by a blank line
part-of-speech tagging corpus .tsv first column for words, second column for tags, sentences separated by a blank line

When designing a tagger or parser, preprocessing treebanks is a troublesome problem. We need to:

  • Split dataset into train/dev/test, following conventional splits.
  • Remove xml tags inside CTB.
  • Combine the multiline bracketed files into one file, one line for one sentence.

I wondered why there were no open-source tools handling these tedious works. Finally I decide to write one myself. Hopefully it will save you some time.

Required software

  • Python3
  • NLTK
  • Optional stanford-parser for converting to dependency parse trees

Overview

What kind of task can we perform on treebanks?

Chinese Word Segmentation

For CTB, segmentation corpus are split as per Jiang et al. (2009):

  • CTB Training: 001–270, 400–1151. Development: 301–325. Test: 271-300.

Part-of-Speech Tagging

  • PTB Training: 0-18. Development: 19-21. Test: 22-24. As per Collins (2002) and Choi (2016).
  • CTB The same with Chinese Word Segmentation.

Phrase Structure Parsing

These scripts can also convert treebanks into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are:

  • PTB Training: 02-21. Development: 22. Test: 23.
  • CTB Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147.

Dependency Parsing

You will need Stanford Parser for converting phrase structure trees to dependency parse trees. Please download the Stanford Parser Version 3.3.0 and place them in this folder:

TreebankPreprocessing
├── ...
├── stanford-parser-3.3.0-models.jar
└── stanford-parser.jar

OK, let's do it on the fly.

PTB

1. Import PTB into NLTK

Bracketed files parsing relies on NLTK. Please follow NLTK instruction, put BROWN and WSJ into nltk_data/corpora/ptb, e.g.

ptb
├── BROWN
└── WSJ

2. Run ptb.py

This script does all the work for you, only requires a path to store output.

$ python3 ptb.py --help 
usage: ptb.py [-h] --output OUTPUT [--task TASK]

Combine Penn Treebank WSJ MRG files into train/dev/test set

optional arguments:
  -h, --help       show this help message and exit
  --output OUTPUT  The folder where to store the output train/dev/test files
  --task TASK      Which task (par, pos)? Use par for phrase structure
                   parsing, pos for part-of-speech tagging
  • You will get 3 .txt files corresponding to train/dev/test set.
  • If you want part-of-speech tagging corpora, simply append --task pos. This time, you get 3 .tsv files.
  • .txt files can be converted to .conllx files by tb_to_stanford.py:
$ python3 tb_to_stanford.py --help
usage: tb_to_stanford.py [-h] --input INPUT --lang LANG --output OUTPUT

Convert combined Penn Treebank files (.txt) to Stanford Dependency format
(.conllx)

optional arguments:
  -h, --help       show this help message and exit
  --input INPUT    The folder containing train.txt/dev.txt/test.txt in
                   bracketed format
  --lang LANG      Which language? Use en for English, cn for Chinese
  --output OUTPUT  The folder where to store the output
                   train.conllx/dev.conllx/test.conllx in Stanford Dependency
                   format

CTB

The CTB is a little messy, it contains extra xml tags in every gold tree, and is not natively supported by NLTK. You need to specify the CTB root path (the folder containing index.html).

$ python3 ctb.py --help           
usage: ctb.py [-h] --ctb CTB --output OUTPUT [--task TASK]

Combine Chinese Treebank 5.1 fid files into train/dev/test set

optional arguments:
  -h, --help       show this help message and exit
  --ctb CTB        The root path to Chinese Treebank 5.1
  --output OUTPUT  The folder where to store the output
                   train.txt/dev.txt/test.txt
  --task TASK      Which task (seg, pos, par)? Use seg for word segmentation,
                   pos for part-of-speech tagging, par for phrase structure
                   parsing
  • Tagging and dependency parsing corpora can be obtained similar to PTB.

Then you can start your research, enjoy it!

More Repositories

1

HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Python
33,681
star
2

pyhanlp

中文分词
Python
3,122
star
3

AhoCorasickDoubleArrayTrie

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie.
Java
946
star
4

CS224n

CS224n: Natural Language Processing with Deep Learning Assignments Winter, 2017
Python
673
star
5

Viterbi

An implementation of HMM-Viterbi Algorithm 通用的维特比算法实现
Java
369
star
6

multi-criteria-cws

Simple Solution for Multi-Criteria Chinese Word Segmentation
Python
300
star
7

hanlp-lucene-plugin

HanLP中文分词Lucene插件,支持包括Solr在内的基于Lucene的系统
Java
296
star
8

TextRank

TextRank算法提取关键词的Java实现
Java
201
star
9

LDA4j

A Java implemention of LDA(Latent Dirichlet Allocation)
Java
195
star
10

ID-CNN-CWS

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"
Python
136
star
11

MainPartExtractor

主谓宾提取器的Java实现(对斯坦福的代码失去兴趣,不再维护)
Java
136
star
12

neural_net

反向传播神经网络及应用
Python
82
star
13

udacity-deep-learning

Assignments for Udacity Deep Learning class with TensorFlow in PURE Python, not IPython Notebook
Python
66
star
14

AveragedPerceptronPython

Clone of "A Good Part-of-Speech Tagger in about 200 Lines of Python" by Matthew Honnibal
Python
49
star
15

MaxEnt

这是一个最大熵的简明Java实现,提供提供训练与预测接口。训练算法采用GIS训练算法,附带示例训练集和一个天气预测的Demo。
Java
46
star
16

text-classification-svm

The missing SVM-based text classification module implementing HanLP's interface
Java
46
star
17

IceNAT

IceNAT
Java
32
star
18

BERT-token-level-embedding

Generate BERT token level embedding without pain
Python
28
star
19

sub-character-cws

Sub-Character Representation Learning
Python
26
star
20

HanLPAndroidDemo

HanLP Android Demo
Java
22
star
21

maxent_iis

最大熵-IIS(Improved Iterative Scaling)训练算法的Java实现
Java
18
star
22

gohanlp

Golang RESTful Client for HanLP
Go
13
star
23

iparser

Yet another dependency parser, integrated with tokenizer, tagger and visualization tool.
Python
11
star
24

DeepBiaffineParserMXNet

An experimental implementation of biaffine parser using MXNet
Python
10
star
25

OpenCC-to-HanLP

无损转换OpenCC词典为HanLP格式
Python
9
star
26

tmsvm

Python
1
star
27

bolt_splits

Split Broad Operational Language Translation corpus into train/dev/test set
Python
1
star