• Stars
    star
    223
  • Rank 177,411 (Top 4 %)
  • Language
    Python
  • License
    Other
  • Created over 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

Text to Sentence Splitter

https://travis-ci.org/berkmancenter/mediacloud-sentence-splitter.svg?branch=develop https://coveralls.io/repos/github/berkmancenter/mediacloud-sentence-splitter/badge.svg?branch=develop

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

This module allows splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the Europarl corpus.

The module is a port of Lingua::Sentence Perl module with some extra additions (improved non-breaking prefix lists for some languages and added support for Danish, Finnish, Lithuanian, Norwegian (Bokmål), Romanian, and Turkish).

Usage

The module uses punctuation and capitalization clues to split plain text into a list of sentences:

from sentence_splitter import SentenceSplitter, split_text_into_sentences

#
# Object interface
#
splitter = SentenceSplitter(language='en')
print(splitter.split(text='This is a paragraph. It contains several sentences. "But why," you ask?'))
# ['This is a paragraph.', 'It contains several sentences.', '"But why," you ask?']

#
# Functional interface
#
print(split_text_into_sentences(
    text='This is a paragraph. It contains several sentences. "But why," you ask?',
    language='en'
))
# ['This is a paragraph.', 'It contains several sentences.', '"But why," you ask?']

You can provide your own non-breaking prefix file to add support for new Latin languages or improve sentence tokenization of the currently supported ones:

from sentence_splitter import SentenceSplitter, split_text_into_sentences

# Object interface
splitter = SentenceSplitter(language='en', non_breaking_prefix_file='custom_english_non_breaking_prefixes.txt')
print(splitter.split(text='This is a paragraph. It contains several sentences. "But why," you ask?'))

# Functional interface
print(split_text_into_sentences(
    text='This is a paragraph. It contains several sentences. "But why," you ask?',
    language='en',
    non_breaking_prefix_file='custom_english_non_breaking_prefixes.txt'
))

Languages

Currently supported languages are:

  • Catalan (ca)
  • Czech (cs)
  • Danish (da)
  • Dutch (nl)
  • English (en)
  • Finnish (fi)
  • French (fr)
  • German (de)
  • Greek (el)
  • Hungarian (hu)
  • Icelandic (is)
  • Italian (it)
  • Latvian (lv)
  • Lithuanian (lt)
  • Norwegian (BokmÃ¥l) (no)
  • Polish (pl)
  • Portuguese (pt)
  • Romanian (ro)
  • Russian (ru)
  • Slovak (sk)
  • Slovene (sl)
  • Spanish (es)
  • Swedish (sv)
  • Turkish (tr)

License

Copyright (C) 2010 by Digital Silk Road, 2017 Linas Valiukas.

Portions Copyright (C) 2005 by Philip Koehn and Josh Schroeder (used with permission).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

More Repositories

1

backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
Python
277
star
2

cliff-annotator

A lightweight server to allow HTTP requests to the Stanford Named Entity Recognized and a heavily modified CLAVIN geoparser.
Java
119
star
3

api-client

Public client for consuming content from the Media Cloud Online News Archive & Directory.
Python
68
star
4

web-tools

The shared repository for Media Cloud web apps (Explorer, Source Manager, Topic Mapper)
JavaScript
63
star
5

date_guesser

A library to extract a publication date from a web page, along with a measure of the accuracy.
Python
42
star
6

nyt-news-labeler

Tag news stories based on models trained on the NYT corpus.
Python
39
star
7

api-tutorial-notebooks

A set of jupyter notebooks demonstrating how to use the Media Cloud API.
Jupyter Notebook
33
star
8

feed_seeker

Find rss, atom, xml, and rdf feeds on webpages
Python
31
star
9

metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Python
12
star
10

web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
JavaScript
9
star
11

copy-kvs

Copy a lot of objects between various key-value stores (MongoDB GridFS, PostgreSQL BLOBs, Amazon S3)
Perl
8
star
12

rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
Python
5
star
13

cliff-api-client

A Python client for the CLIFF geoparsing tool
Python
5
star
14

email-templates

Templates for emails that Media Cloud sends.
HTML
4
star
15

wayback-news-client

A client library to access the Wayback Machine news archive search.
Python
4
star
16

word-embeddings-server

Helpful micro-service to return results from word2vec models
Python
2
star
17

glimpse

Get a glimpse of attention to a topic on social media.
Python
2
star
18

docker-compose-just-quieter

Docker Compose CLI utility wrapper which makes `docker-compose` quieter.
Python
2
star
19

postgresql-citus-aws-graviton2

PostgreSQL built for AWS Graviton2
2
star
20

sitemap-tools

simple toolkit of tools for consuming sitemaps
Python
2
star
21

fernandos-csv-randomizer

Fernando's CSV randomizer -- reads a CSV file, picks a specified number of random rows and writes them to a separate file
Python
1
star
22

cliff-homepage

A simple homepage for the CLIFF project
HTML
1
star
23

hausastemmer

Hausa language stemmer (Bimba et al., 2015)
Python
1
star
24

clavin-build-geonames-index

Builds and releases CLAVIN GeoNames.org index as a binary
1
star
25

sous-chef

Configurable Data Analytics Pipeline
Python
1
star
26

news-search-api

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).
Python
1
star
27

story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
Python
1
star