• Stars
    star
    170
  • Rank 223,357 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 10 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.

segtok

https://travis-ci.org/fnl/segtok.svg?branch=master

NB: segtok v2, code-named syntok, is available and fixes some tricky issues with segtok, in particular splitting sentence with terminals not followed by spaces.Like this :-).

Sentence segmentation and word tokenization

The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for splitting (Indo-European) text into sentences. The tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called tokens). Both modules can also be used from the command-line. While other Indo-European languages could work, it has only been designed with languages such as Spanish, English, and German in mind. For a more informed introduction to this tool, please read the article on my blog.

Install

To use this package, you minimally should have the latest version of Python 2.7 or any 3.5+ branch installed. The package is expected to work with both Python 2.7 and 3.5+, tested against those latest Python branches, as well as Python 3.5. The easiest way to get segtok installed is using pip or any other package manager that works with PyPI:

pip3 install segtok

Important: If you are on a Linux machine and have problems installing the regex dependency of segtok, make sure you have the python-dev and/or python3-dev packages installed to get the necessary headers to compile the package.

Then try the command line tools on some plain-text files (e.g., this README) to see if segtok meets your needs:

segmenter README.rst | tokenizer

Test Suite

The testing environment works with pytest, tox and pyenv. You first need to install pyenv (on OSX with Homebrew: brew install pyenv), and tox with pytest (pip3 install tox pytest). Configuring pyenv depends on the Python versions you have installed. Here, we assume you have the latest 2.7 and 3 versions installed and only need to provide an environment for testing segtok against the 3.8 branch:

pyenv install 3.8.2
pyenv global system 3.8.2

The second command is essential and indicates that your preferred Python binary is the system version and then the 3.8.2 branch. If you forget the second command, you will see errors like ERROR: InvocationError: Failed to get version_info for python3.8: pyenv: python3.8: command not found when running tox. If you only have one Python version installed (say, 2.7), to fully run the tests, you must also install and globally configure the other version (e.g., the latest 3.x) with pyenv, too.

Finally, to run all of segtok's unit-test suite, just run tox:

tox

Usage

For details, please refer to the respective documentation; This README only provides an overview of the provided functionality.

A command-line

After installing the package, two command-line tools will be available, segmenter and tokenizer. Each can take UTF-8 encoded plain-text and transforms it into newline-separated sentences or tokens, respectively. You can use other encoding in Python3 simply by reconfiguring your environment encoding or in any version of Python by forcing a particular encoding with the --encoding parameters. The tokenizer assumes that each line contains (at most) one single sentence, which is the output format of the segmenter. To learn more about each tool, please invoke them with their help option (-h or --help).

B segtok.segmenter

This module provides several split_... functions to segment texts into lists of sentences. In addition, to_unix_linebreaks normalizes linebreaks (including the Unicode linebreak) to newline control characters (\\n). The function rewrite_line_separators can be used to move (rewrite) the newline separators in the input text so that they are placed at the sentence segmentation locations.

C segtok.tokenizer

This module provides several ..._tokenizer functions to tokenize input sentences into words and symbols. To get the full functionality, use the web_tokenizer, which will split everything "semantically correctly" except for URLs and e-mail addresses. In addition, it provides convenience functionality for English texts: Two compiled patterns (IS_...) can be used to detect if a word token contains a possessive-s marker ("Frank's") or is an apostrophe-based contraction ("didn't"). Tokens that match these patterns can then be split using the split_possessive_markers and split_contractions functions, respectively.

Legal

License: MIT

Copyright (c) 2014-2021, Florian Leitner. All rights reserved.

Contributors (kudos):

  • Mikhail Korobov (@kmike; port to Python2.7 and Travis CI integration)
  • Georg Kucsko (@gkucsko; splitting sentences at terminals followed by noise)
  • Karthikeyan Singaravelan (@tirkarthi; removing deprecation warnings, #23)
  • PrimoΕΎ Godec (@PrimozGodec; fixed LICENSE file in setup.py)

History

  • 1.5.11 setup.py: renamed data_files with the LICENSE.txt file reference to license_files
  • 1.5.10 removed deprecation warning (#23) as well as support for Python 3.3 from tox
  • 1.5.9 added the license as a LICENSE.txt file to this repository
  • 1.5.7 enhancement: split sentences even if the terminal is followed by invalid characters (contributed by @gkucsko)
  • 1.5.6 fixed a bug that would lead to joining lines in single-line mode (#11, reported by @yucongo)
  • 1.5.5 support for middle name initials ("Lester P. Pearson")
  • 1.5.4 also support for European-style number-dates with numeric months (24. 12. 2016)
  • 1.5.3 added support for European-style number-dates and for months (24. Dez. 2016)
  • 1.5.2 fixed a tokenizer bug when parsing URLs ending with root paths (/), prevented sentence splitting after U.K., U.S. and E.U. if followed by upper-case ("U.S. Air Force"), added missing Unicode hyphens and apostrophes, and added test suite setup instructions
  • 1.5.1 removed count_continuations.py discussion from README (was only confusing); the segmenter now can preserve tab-separated text IDs before the text itself when reading from STDIN and then inserts a (tab-separated) sentence ID column for each sentence printed to STDOUT: see segmenter option --with-ids
  • 1.5.0 continuation words have been statistically evaluated and some poor choices removed (leading to more [precise] sentence splitting; see issue #9 by @Klim314 on GitHub)
  • 1.4.0 the word_tokenizer no longer splits on colons between digits (time, references, ...)
  • 1.3.1 fixed multiple dangling commas and colons (reported by Jim Geovedi)
  • 1.3.0 added Python2.7 support and Travis CI test integration (BIG thanks to Mikhail!)
  • 1.2.2 made segtok.tokenizer.match protected (renamed to "_match") and fixed UNIX linebreak normalization
  • 1.2.1 the length of sentences inside brackets is now parametrized
  • 1.2.0 wrote blog "documentation" and added chemical formula sub/super-script functionality
  • 1.1.2 fixed Unicode list of valid sentence terminals (was missing U+2048)
  • 1.1.1 fixed PyPI setup (missing MANIFEST.in for README.rst and "packages" in setup.py)
  • 1.1.0 added possessive-s marker and apostrophe contraction splitting of tokens
  • 1.0.0 initial release

More Repositories

1

syntok

Text tokenization and sentence segmentation (segtok v2)
Python
201
star
2

pymonad

"fork" of PyMonad on BitBucket to change the ``*`` functor/composition operator to ``<<``
Python
31
star
3

patricia-trie

a pure-Python PATRICIA trie implementation.
Python
31
star
4

medic

a Python 3 command-line tool to maintain a DB mirror of MEDLINE (https://pypi.python.org/pypi/medic) - ALERT: As I have moved out of science and am working as a consultant now, this project might need a new maintainer once PubMed changes its XML format. Heroes?
Python
25
star
5

progress_bar

an informative progress bar for Python 2+3 command-line tools
Python
13
star
6

asdm-tm-class

Course material for the Madrid ASDM class on text mining (C09)
Jupyter Notebook
12
star
7

libfnl

Python 3 tools for data mining in molecular biology
Python
12
star
8

classipy

A command-line tool to develop advanced text classifiers using SciKit-Learn.
Python
9
star
9

sentence_splitter

check my new spliter - segtok
Python
8
star
10

tokenizer

a concurrent, deterministic finite state tokenizer (for letter-based scripts)
Go
4
star
11

txtfnnl

a UIMA-based text mining pipeline
Java
3
star
12

SPECIES

a modified version of the SPECIES tagger
C++
2
star
13

otplc

A tool to convert corpus annotations between the brat annotation and OTPL formats.
Python
2
star
14

vimrc

my (Vim-centric) POSIX environment
Vim Script
2
star
15

cpp-project-template

A very basic C++ project structure using CMake, Catch2, and cxxopts.
C++
2
star
16

go

Golang source code collection
Go
2
star
17

bceval

BioCreative Evaluation Scripts and Library
Python
2
star
18

bootstrap

jump-start a simple GNU C project
C
2
star
19

lexikos

a minimal acyclic deterministic finite state automaton (MADFA)
Scala
1
star
20

gnamed

a tool to manage a unified repository of gene and protein names, symbols, keywords, literature references, and species associations
Python
1
star
21

word2numpy

A Python 3.0 port of word2vec.py, in itself a Python 2.7 port of word2vec
Python
1
star
22

segmenter

scripts to pre-process plain-text: sentence segmentation, tokenization, and stemming
Perl
1
star
23

OnlineTaggerFramework

an online tagger wrapper for GATE that only spans one global sub-process per processing resource
Java
1
star
24

fnl.github.io

my blog (http://fnl.es)
HTML
1
star
25

ibecs-to-omtd-transformer

A transformer that converts an IBECS XML file into an OMTD-SHARE corpus
Python
1
star
26

chemcheck

a syntax checker for BioCreative IV CHEMDNER task annotations
C
1
star
27

couchpy

a python3 library to programmatically access CouchDB (written when there was none, "long ago"...)
Python
1
star
28

libfsmg

A finite state machine library for pattern matching on generic types in Java sequence containers.
Java
1
star