• Stars
    star
    192
  • Rank 201,979 (Top 4 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 10 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Various utilities for processing the data.

UD Tools

alt text

This repository contains various scripts in Perl and Python that can be used as tools for Universal Dependencies.

validate.py

Reads a CoNLL-U file and verifies that it complies with the UD specification. It must be run with the language code and there must exist corresponding lists of treebank-specific features and dependency relations in order to check that they are valid, too.

The script runs under Python 3 and needs the third-party module regex. If you do not have the regex module, install it using pip install --user regex.

NOTE: Depending on the configuration of your system, it is possible that both Python 2 and 3 are installed; then you may have to run python3 instead of python, and pip3 instead of pip.

cat la_proiel-ud-train.conllu | python validate.py --lang la --max-err=0

You can run python validate.py --help for a list of available options.

eval.py

Evaluates the accuracy of a UD tokenizer / lemmatizer / tagger / parser against gold-standard data. The script was originally developed for the CoNLL 2017 and 2018 shared tasks in UD parsing, and later extended to handle the enhanced dependency representation in the IWPT 2020 and 2021 shared tasks.

python eval.py -v goldstandard.conllu systemoutput.conllu

For more details on usage, see the comments in the script. For more details on the metrics reported, see the overview papers of the shared tasks linked above.

check_sentence_ids.pl

Reads CoNLL-U files from STDIN and verifies that every sentence has a unique id in the sent_id comment. All files of one treebank (repository) must be supplied at once in order to test treebank-wide id uniqueness.

cat *.conllu | perl check_sentence_ids.pl

normalize_unicode.pl

Converts Unicode to the NFC normalized form. Can be applied to any UTF-8-encoded text file, including CoNLL-U. As a result, if there are character combinations that by definition must look the same, the same sequence of bytes will be used to represent the glyph, thus improving accuracy of models (as long as they are applied to normalized data too).

Beware: The output may slightly differ depending on your version of Perl because the Unicode standard evolves and newer Perl versions incorporate newer versions of Unicode data.

perl normalize_unicode.pl < input.conllu > normalized_output.conllu

conllu-stats.pl

Reads a CoNLL-U file, collects various statistics and prints them. This Perl script should not be confused with conllu-stats.py, an old Python 2 program that collects just a few very basic statistics. The Perl script (conllu-stats.pl) is used to generate the stats.xml files in each data repository.

The script depends on Perl libraries YAML and JSON::Parse that may not be installed automatically with Perl. If they are not installed on your system, you should be able to install them with the cpan command: cpan YAML and cpan JSON::Parse.

perl conllu-stats.pl *.conllu > stats.xml

mwtoken-stats.pl

Reads a CoNLL-U file, collects statistics of multi-word tokens and prints them.

cat *.conllu | perl mwtoken-stats.pl > mwtoken-stats.txt

enhanced_graph_properties.pl

Reads a CoNLL-U file, collects statistics about the enhanced graphs in the DEPS column and prints them. This script uses the modules Graph.pm and Node.pm that lie in the same folder. On UNIX-like systems it should be able to tell Perl where to find the modules even if the script is invoked from a remote folder. If that does not work, use perl -I libfolder script to invoke it. Also note that other third-party modules are needed that are not automatically included in the installation of Perl: Moose, MooseX::SemiAffordanceAccessor, List::MoreUtils. You may need to install these modules using the cpan tool (simply go to commandline and type sudo cpan Moose).

cat *.conllu | perl enhanced_graph_properties.pl > eud-stats.txt

enhanced_collapse_empty_nodes.pl

Reads a CoNLL-U file, removes empty nodes and adjusts the enhanced graphs so that a path traversing one or more empty nodes is contracted into a single edge: if there was a "conj" edge from node 27 to node 33.1, and a nsubj edge from node 33.1 to node 33, the resulting graph will have an edge from 27 to 33, labeled conj>nsubj

This script uses the modules Graph.pm and Node.pm that lie in the same folder. On UNIX-like systems it should be able to tell Perl where to find the modules even if the script is invoked from a remote folder. If that does not work, use perl -I libfolder script to invoke it. Also note that other third-party modules are needed that are not automatically included in the installation of Perl: Moose, MooseX::SemiAffordanceAccessor, List::MoreUtils. You may need to install these modules using the cpan tool (simply go to commandline and type sudo cpan Moose).

perl enhanced_collapse_empty_nodes.pl enhanced.conllu > collapsed.conllu

overlap.py

Compares two CoNLL-U files and searches for sentences that occur in both (verbose duplicates of token sequences). Some treebanks, especially those where the original text had been acquired from the web, contained duplicate documents that were found at different addresses and downloaded twice. This tool helps to find out whether one of the duplicates fell in the training data and the other in development or test. The output has to be verified manually, as some “duplicates” are repetitions that occur naturally in the language (in particular short sentences such as “Thank you.”)

The script can also help to figure out whether training-dev-test data split has been changed between two releases so that a previously training sentence is now in test or vice versa. That is something we want to avoid.

find_duplicate_sentences.pl & remove_duplicate_sentences.pl

Similar to overlap.py but it works with the sentence-level attribute text. It remembers all sentences from STDIN or from input files whose names are given as arguments. The find script prints the duplicate sentences (ordered by length and number of occurrences) to STDOUT. The remove script works as a filter: it prints the CoNLL-U data from the input, except for the second and any subsequent occurrence of the duplicate sentences.

conllu_to_conllx.pl

Converts a file in the CoNLL-U format to the old CoNLL-X format. Useful with old tools (e.g. parsers) that require CoNLL-X as their input. Usage:

perl conllu_to_conllx.pl < file.conllu > file.conll

restore_conllu_lines.pl

Merges a CoNLL-X and a CoNLL-U file, taking only the CoNLL-U-specific lines from CoNLL-U. Can be used to merge the output of an old parser that only works with CoNLL-X with the original annotation that the parser could not read.

restore_conllu_lines.pl file-parsed.conll file.conllu

conllu_to_text.pl

Converts a file in the CoNLL-U format to plain text, word-wrapped to lines of 80 characters (but the output line will be longer if there is a word that is longer than the limit). The script can use either the sentence-level text attribute, or the word forms plus the SpaceAfter=No MISC attribute to output detokenized text. It also observes the sentence-level newdoc and newpar attributes, and the NewPar=Yes MISC attribute, if they are present, and prints an empty line between paragraphs or documents.

Optionally, the script takes the language code as a parameter. Codes 'zh' and 'ja' will trigger a different word-wrapping algorithm that is more suitable for Chinese and Japanese.

Usage:

perl conllu_to_text.pl --lang zh < file.conllu > file.txt

conll_convert_tags_to_uposf.pl

This script takes the CoNLL columns CPOS, POS and FEAT and converts their combined values to the universal POS tag and features.

You need Perl. On Linux, you probably already have it; on Windows, you may have to download and install Strawberry Perl. You also need the Interset libraries. Once you have Perl, it is easy to get them via the following (call cpan instead of cpanm if you do not have cpanm).

cpanm Lingua::Interset

Then use the script like this:

perl conll_convert_tags_to_uposf.pl -f source_tagset < input.conll > output.conll

The source tagset is the identifier of the tagset used in your data and known to Interset. Typically it is the language code followed by two colons and conll, e.g. sl::conll for the Slovenian data of CoNLL 2006. See the tagset conversion tables for more tagset codes.

IMPORTANT:

The script assumes the CoNLL-X (2006 and 2007) file format. If your data is in another format (most notably CoNLL-U, but also e.g. CoNLL 2008/2009, which is not identical to 2006/2007), you have to modify the data or the script. Furthermore, you have to know something about the tagset driver (-f source_tagset above) you are going to use. Some drivers do not expect to receive three values joined by TAB characters. Some expect two values and many expect just a single tag, perhaps the one you have in your POS column. These factors may also require you to adapt the script to your needs. You may want to consult the documentation. Go to Browse / Interset / Tagset, look up your language code and tagset name, then locate the list() function in the source code. That will give you an idea of what the input tags should look like (usually the driver is able to decode even some tags that are not on the list but have the same structure and feature values).

check_files.pl

This script checks the contents of one data repositories for missing/extra files, invalid metadata in README etc. Together with validate.py, which checks the contents of individual CoNLL-U files, this script assesses whether a treebank is valid and ready to be released.

check_release.pl

This script must be run in a folder where all the data repositories (UD_*) are stored as subfolders. It checks the contents of the data repositories for various issues that we want to solve before a new release of UD is published.

conllu_align_tokens.pl

Compares tokenization and word segmentation of two CoNLL-U files. Assumes that no normalization was performed, that is, the sequence of non-whitespace characters is identical on both sides. Use case: We want to merge a gold-standard file, which has no lemmas, with lemmatization predicted by an external tool. But the tool also performed tokenization and we have no guarantee that it matches the gold-standard tokenization. Despite its name, the script now does exactly that, i.e., copies the system lemma to the gold-standard annotation if the tokens match, and prints the merged file to STDOUT. If something else than lemma shall be copied, the source code must be adjusted.

perl conllu_align_tokens.pl UD_Turkish-PUD/tr_pud-ud-test.conllu media/conll17-ud-test-2017-05-09/UFAL-UDPipe-1-2/2017-05-15-02-00-38/output/tr_pud.conllu

More Repositories

1

docs

Universal Dependencies online documentation
HTML
247
star
2

UD_English-EWT

English data
Python
188
star
3

UD_Chinese-GSD

86
star
4

UD_Portuguese-Bosque

This Universal Dependencies (UD) Portuguese treebank.
Common Lisp
49
star
5

UD_Indonesian-GSD

Indonesian conversion
42
star
6

UD_Turkish-IMST

38
star
7

universaldependencies.github.io

Universal dependencies homepage
HTML
36
star
8

UD_Chinese-GSDSimp

Conversion of UD_Chinese-GSD to simplified Chinese characters.
35
star
9

UD_Japanese-GSD

Japanese data from the Google UDT 2.0.
34
star
10

UD_Vietnamese-VTB

34
star
11

UD_English-GUM

28
star
12

UD_Persian-Seraji

UD_Persian
26
star
13

UD_Ukrainian-IU

26
star
14

UD_Cantonese-HK

Spoken Cantonese from Hong Kong.
24
star
15

UD_Classical_Chinese-Kyoto

24
star
16

UD_Spanish-AnCora

Spanish data from the AnCora corpus.
23
star
17

UD_French-GSD

22
star
18

UD_Korean-GSD

Korean UD Treebank.
22
star
19

UD_English-ESL

English as a Second Language
Python
21
star
20

UD_Hindi-HDTB

21
star
21

UD_Portuguese-GSD

Brazilian Portuguese data from the Google Universal Dependency Treebanks 2.0.
Python
20
star
22

UD_Japanese-BCCWJ

Python
20
star
23

UD_Greek-GDT

UD Greek
18
star
24

UD_Korean-Kaist

Data from KAIST (a Korean treebank).
18
star
25

UD_Italian-ISDT

17
star
26

UD_Romanian-RRT

17
star
27

UD_Norwegian-Bokmaal

16
star
28

UD_Turkish-BOUN

16
star
29

UD_German-GSD

15
star
30

UD_Thai-PUD

Parallel Universal Dependencies.
14
star
31

UD_Russian-GSD

Shell
14
star
32

UD_Chinese-CFL

Chinese as a foreign language.
14
star
33

UD_Swedish-Talbanken

Swedish data
Python
12
star
34

UD_Russian-Taiga

11
star
35

UD_Bulgarian-BTB

Perl
11
star
36

UD_Spanish-GSD

11
star
37

UD_Polish-PDB

Polish data.
10
star
38

UD_Armenian-ArmTDP

Armenian data.
9
star
39

UD_German-HDT

9
star
40

UD_Hebrew-HTB

Hebrew Universal Dependencies Treebank
9
star
41

UD_Indonesian-PUD

Parallel Universal Dependencies.
9
star
42

UD_Hebrew-IAHLTwiki

9
star
43

UD_Sanskrit-UFAL

Sanskrit data.
8
star
44

UD_Chinese-HK

Spoken mandarin Chinese from Hong Kong.
8
star
45

UD_Dutch-Alpino

Dutch data.
8
star
46

UD_Tamil-TTB

Tamil data.
8
star
47

UD_French-Sequoia

Data from the Sequoia treebank.
8
star
48

UD_English-PUD

Parallel Universal Dependencies.
8
star
49

UD_Chinese-PUD

Parallel Universal Dependencies.
7
star
50

UD_Danish-DDT

7
star
51

UD_Finnish-TDT

Finnish data
7
star
52

cairo

Cairo CICLing Corpus – a multi-lingual parallel UD-style treebank of short sentences
Perl
7
star
53

UD_Nheengatu-CompLin

6
star
54

UD_English-Pronouns

6
star
55

UD_Irish-IDT

Irish data
Shell
6
star
56

UD_Amharic-ATT

Python
6
star
57

UD_Galician-TreeGal

6
star
58

UD_Polish-LFG

6
star
59

UD_Icelandic-Modern

Python
6
star
60

UD_Urdu-UDTB

6
star
61

UD_Portuguese-PUD

Parallel Universal Dependencies.
6
star
62

UD_Telugu-MTG

Telugu data.
6
star
63

UD_Pomak-Philotis

6
star
64

UD_Norwegian-NynorskLIA

5
star
65

UD_Latin-Perseus

5
star
66

UD_Turkish-Kenet

5
star
67

UD_Hindi_English-HIENCS

Python
5
star
68

UD_Uyghur-UDT

Uyghur data.
Perl
5
star
69

UD_Turkish-PUD

Parallel Universal Dependencies.
5
star
70

UD_Kazakh-KTB

5
star
71

UD_Estonian-EDT

Estonian data
5
star
72

UD_Latin-ITTB

Latin data from the Index Thomisticus Treebank.
5
star
73

UD_Welsh-CCG

4
star
74

UD_Latin-PROIEL

Latin data from the PROIEL treebank.
4
star
75

UD_Lithuanian-ALKSNIS

Lithuanian data from the Alksnis treebank.
4
star
76

UD_Catalan-AnCora

Catalan data from the AnCora corpus.
4
star
77

UD_Tagalog-TRG

Shell
4
star
78

UD_Wolof-WTB

4
star
79

UD_Kurmanji-MG

Northern Kurdish data.
4
star
80

UD_Yoruba-YTB

Shell
4
star
81

UD_Ancient_Greek-Perseus

4
star
82

UD_Japanese-PUD

Parallel Universal Dependencies.
4
star
83

UD_Romanian-Nonstandard

4
star
84

UD_Serbian-SET

Serbian data.
4
star
85

UD_Old_Church_Slavonic-PROIEL

Old Church Slavonic data from the PROIEL project.
4
star
86

UD_Hungarian-Szeged

Hungarian data
Shell
4
star
87

UD_Dutch-LassySmall

Wikipedia sample from the Lassy Small treebank.
4
star
88

UD_French-FTB

Data from the French Treebank.
4
star
89

UD_English-ParTUT

English part of the ParTUT parallel treebank.
4
star
90

UD_Japanese-KTC

Perl 6
3
star
91

UD_Karelian-KKPP

3
star
92

UD_Japanese-GSDLUW

Long-unit-word version of UD_Japanese-GSD
3
star
93

UD_Faroese-FarPaHC

Python
3
star
94

UD_Japanese-Modern

3
star
95

UD_Tupinamba-TuDeT

3
star
96

UD_Norwegian-Nynorsk

Nynorsk version of the Norwegian Dependency Treebank.
3
star
97

UD_Icelandic-IcePaHC

Python
3
star
98

UD_Latin-LLCT

3
star
99

UD_Javanese-CSUI

3
star
100

UD_Icelandic-PUD

3
star