• Stars
    star
    478
  • Rank 91,331 (Top 2 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 14 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

PyNLPl - Python Natural Language Processing Library

https://travis-ci.org/proycon/pynlpl.svg?branch=master Documentation Status http://applejack.science.ru.nl/lamabadge.php/pynlpl

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotatation).

The library is a divided into several packages and modules. It works on Python 2.7, as well as Python 3.

The following modules are available:

  • pynlpl.datatypes - Extra datatypes (priority queues, patterns, tries)
  • pynlpl.evaluation - Evaluation & experiment classes (parameter search, wrapped progressive sampling, class evaluation (precision/recall/f-score/auc), sampler, confusion matrix, multithreaded experiment pool)
  • pynlpl.formats.cgn - Module for parsing CGN (Corpus Gesproken Nederlands) part-of-speech tags
  • pynlpl.formats.folia - Extensive library for reading and manipulating the documents in FoLiA format (Format for Linguistic Annotation).
  • pynlpl.formats.fql - Extensive library for the FoLiA Query Language (FQL), built on top of pynlpl.formats.folia. FQL is currently documented here.
  • pynlpl.formats.cql - Parser for the Corpus Query Language (CQL), as also used by Corpus Workbench and Sketch Engine. Contains a convertor to FQL.
  • pynlpl.formats.giza - Module for reading GIZA++ word alignment data
  • pynlpl.formats.moses - Module for reading Moses phrase-translation tables.
  • pynlpl.formats.sonar - Largely obsolete module for pre-releases of the SoNaR corpus, use pynlpl.formats.folia instead.
  • pynlpl.formats.timbl - Module for reading Timbl output (consider using python-timbl instead though)
  • pynlpl.lm.lm - Module for simple language model and reader for ARPA language model data as well (used by SRILM).
  • pynlpl.search - Various search algorithms (Breadth-first, depth-first, beam-search, hill climbing, A star, various variants of each)
  • pynlpl.statistics - Frequency lists, Levenshtein, common statistics and information theory functions
  • pynlpl.textprocessors - Simple tokeniser, n-gram extraction

Installation

Download and install the latest stable version directly from the Python Package Index with pip install pynlpl (or pip3 for Python 3 on most systems). For global installations prepend sudo.

Alternatively, clone this repository and run python setup.py install (or python3 setup.py install for Python 3 on most system. Prepend sudo for global installations.

This software may also be found in the certain Linux distributions, such as the latest versions as Debian/Ubuntu, as python-pynlpl and python3-pynlpl. PyNLPL is also included in our LaMachine distribution.

Documentation

API Documentation can be found here.

More Repositories

1

vocage

A minimalistic spaced-repetion vocabulary trainer (flashcards) for the terminal
Rust
142
star
2

clam

Quickly turn command-line applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
Python
129
star
3

colibri-core

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.
C++
123
star
4

flat

FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. Flat allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm.
JavaScript
103
star
5

LaMachine

LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilation/installation script
Shell
68
star
6

folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
Python
60
star
7

python-frog

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)
Cython
47
star
8

analiticcl

an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction
Rust
30
star
9

python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
Cython
29
star
10

codemetapy

A Python package for generating and working with codemeta
Python
24
star
11

gecco

Generic Environment for Context-Aware Correction of Orthography
Python
21
star
12

homeassistant-config

My elaborate home automation configuration + scripts
Python
21
star
13

dotfiles

My dotfiles
Shell
20
star
14

deepfrog

An NLP-suite powered by deep learning
Rust
19
star
15

hanzigrid

Hanzi grids for studying mandarin chinese (tool & output data)
HTML
18
star
16

foliapy

An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic annotation finding application in Natural Language Processing (NLP). This library was formerly part of PyNLPl.
Python
18
star
17

procmapgen

A small toy project written in Rust: procedural generation of various kinds of grid-based maps.
Rust
16
star
18

python-timbl

python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. With this module, all functionality exposed through the C++ interface is also available to Python scripts. Being able to access the API from Python greatly facilitates prototyping TiMBL-based applications.
Python
16
star
19

spacy2folia

Use spaCy for NLP and output to the FoLiA XML format.
Python
12
star
20

foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
Python
10
star
21

pbmbmt

Phrase-based Memory-based Machine Translation
Python
10
star
22

unilangforum

UniLang Language Community - Forum
PHP
8
star
23

colibri

THIS PROJECT IS BEING RENDERED OBSOLETE BY NEWER VERSIONS colibri-core and colibri-mt !!
C++
7
star
24

valkuil-gecco

Nederlandse Spellingscontrole / Dutch spelling correction system - powered by Gecco
Python
7
star
25

nederlab-pipeline

Linguistic enrichment pipeline for historical dutch, as used in the Nederlab project
Groovy
7
star
26

anavec

Proof-of-concept spelling correction/normalisation system based on anagram vectors
Python
6
star
27

codemeta-harvester

Harvest and aggregate codemeta/schema.org software metadata from source repositories and service endpoints, automatically converting from known metadata schemes in the process
Shell
6
star
28

semeval2014task5

This is the official repository for SemEval 2014 Task 5: L2 Translation Assistant. It contains the gold standard learner corpus, evaluation results and the Python program library needed for the task. It does not contain a full translation assistance system.
HTML
5
star
29

foliadocserve

FoLiA Document Server - HTTP webservice backend for serving and annotating FoLiA documents using the FoLiA Query Language (FQL). Used by FLAT.
Python
5
star
30

piereling

Piereling is a webservice and web-application to convert between a variety of document formats, mostly from and to FoLiA XML. It is intended for NLP pipelines.
Python
5
star
31

lingua-cli

Very small simple command-line interface for language detection using lingua-rs
Rust
5
star
32

colibri-mt

A Machine Translation framework that wraps around the Moses Decoder and enables k-NN classifier techniques to be used for modelling source-side-context
C++
5
star
33

babelente

BabelEnte: Entity Extractor and Translator using BabelFy and Babelnet.org
Python
4
star
34

labirinto

A web front-end portal for a virtual laboratory of NLP tools
Vue
4
star
35

clamservices

A collection of CLAM webservices for various of our Natural Language Processing tools
Python
4
star
36

folia-rust

FoLiA library for rust (alpha)
Rust
4
star
37

codemeta-server

Server for codemeta, in memory triple store, SPARQL endpoint and simple web-based visualisation for end-user
Python
4
star
38

sesdiff

Generates a shortest edit script (Myers' diff algorithm) to indicate how to get from the strings in column A to the strings in column B. Also provides the edit distance (levenshtein).
Rust
4
star
39

alpino_clam_webservice

A CLAM-powered webservice for Alpino, a dependency parser for Dutch
Python
3
star
40

vocadata

Data for vocabulary learning
3
star
41

parseme-support

FoLiA & FLAT support for PARSEME
Python
3
star
42

spreek2schrijf

Scripts voor Spreek2Schrijf, een project met de Tweede Kamer
Python
3
star
43

svkbd

my fork of suckless' simple virtual keyboard: https://tools.suckless.org/x/svkbd/
C
3
star
44

sxmo-docs

my fork of https://git.sr.ht/~mil/sxmo-docs
Shell
2
star
45

aNtiLoPe

A collection of NLP pipelines powered by Nextflow
Groovy
2
star
46

sxmo-utils

my fork of https://git.sr.ht/~mil/sxmo-utils/
Shell
2
star
47

wrexp

Experiment Wrapper - A framework for launching and keeping track of experiments. Wrexp takes care of storing all stdout/stderr logs and mails you when experiments are completed.
JavaScript
2
star
48

wikiente

A named entity recogniser and linker based on DBPedia Spotlight, with support for the FoLiA format
Python
2
star
49

colibri-apps

Contains NLP applications using Colibri Core, suited for end-users. The applications are generally web-based.
OpenEdge ABL
2
star
50

wsd2

Python
2
star
51

colloquery

Web application for searching for phrases/collocations/synonyms in phrase translation tables
Python
2
star
52

lexmatch

Simple lexicon matcher against a text
Rust
2
star
53

colibri-utils

NLP utilities that rely on Colibri Core: currently only language identification
TeX
2
star
54

nlpsandbox

Natural Language Processing Sandbox - An experimental playground for all kinds of NLP tasks
Python
2
star
55

ssam

split sampler: split your data into multiple sets (e.g. train/test/development)
Rust
2
star
56

LaMachine-docker-test

Meta repository for docker testing of LaMachine on Travis-CI
1
star
57

dwm

my patched fork of dwm
C
1
star
58

unilang_ulr

Collection of open language resources from UniLang; containing mostly phrasebooks and stories
1
star
59

oersetter-models

Models for Oersetter, a Frisian<->Dutch Machine Translation system
1
star
60

chira

Chinese Reading Assistant, pop-up translations for Linux
Python
1
star
61

valkuil

Valkuil.net is een automatische spellingcorrector voor het Nederlands die zowel gewone typefouten als grammaticale fouten en verwarringen tussen bestaande woorden opspoort.
Lex
1
star
62

sxmo-svkbd

My fork of https://git.sr.ht/~mil/sxmo-svkbd
C
1
star
63

aur-packages

Arch User Repository packages I maintain
Shell
1
star
64

cwrap

Small C wrapper to turn a C function into a very simple webservice
C
1
star
65

campyon

Campyon is both a command-line tool as well as Python library for viewing and manipulating columned data files. It supports various filters, statistics, visualisations, and plotting.
Python
1
star
66

vocavue

A vocabulary trainer with a view
JavaScript
1
star
67

lst-chat

JavaScript
1
star
68

homepage

My website
TeX
1
star
69

hyphertool

Command-line tool for syllabification and hyphenisation for multiple languages
Rust
1
star
70

lamastats

Generates statistical reports on the usage of our software and webservices
Python
1
star
71

charfreq

Very simply command-line tool that counts (unicode) character frequency from standard input
Rust
1
star
72

colibrita

Colibrita is a proof-of-concept translation assistance system, translating L1 fragments in an L2 context, using machine learning and statistical machine translation techniques
Python
1
star