• Stars
    star
    65
  • Rank 473,702 (Top 10 %)
  • Language
    C++
  • License
    GNU General Publi...
  • Created over 11 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

More Repositories

1

frog

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
C++
73
star
2

PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Python
48
star
3

timbl

TiMBL implements several memory-based learning algorithms.
C++
46
star
4

LuigiNLP

A workflow system for Natural Language Processing.
Python
21
star
5

libfolia

FoLiA library for C++
C++
15
star
6

ticcltools

Tools for TICCL
C++
14
star
7

CLIN28_ST_spelling_correction

Scripts that were used for preparing and converting the Wikipedia documents that are part of the CLIN28 shared task on spelling correction
Python
10
star
8

LamaEvents

Lama Events is a calendar application listing events in the near future. The events are detected and selected by a fully automatic procedure in the Dutch Twitter stream.
HTML
10
star
9

uctodata

Datafiles for the tokenizer ucto.
Shell
9
star
10

mbt

MBT: Memory-based tagger generation and tagging MBT is a memory-based tagger-generator and tagger in one.
C++
9
star
11

ticcutils

Ticcutils, a generic utility library shared by our software.
C++
7
star
12

wopr

Memory Based Word Predictor/Language Model http://ilk.uvt.nl/wopr/
C++
5
star
13

foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)
C++
4
star
14

quoll

Python
3
star
15

timblserver

TiMBL implements several memory-based learning algorithms. This is the server part.
C++
3
star
16

ICDAR2017-PostOCR-Ticcl

Wrapper scripts for processing ICDAR2017 PostOCR data given a TICCL ranked input list
Python
2
star
17

dimbl

Distributed Tilburg Memory Based Learner
C++
2
star
18

mbtserver

C++
1
star
19

dialect2keywords

Webinterface designed to convert words in Dutch dialects ("dialectopgaven") into standard Dutch keywords ("vernederlandste trefwoorden").
Python
1
star
20

releasereport

Python
1
star
21

paramsearch

Automated parameter optimisation for Timbl
C
1
star
22

frogdata

Data for Frog, mandatory
Lex
1
star
23

toad

Toad: Trainer Of All Data, the Frog training collection
C++
1
star
24

bp-som

BP-SOM: A hybrid of back-propagation learning in multi-layered perceptrons and self-organizing maps
C++
1
star