• Stars
    star
    449
  • Rank 97,328 (Top 2 %)
  • Language
  • Created over 6 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German Awesome

Resources and tools which can be used either off-the-shelf or with minor adjustments and which are currently maintained are primarily chosen for this list. It is deliberately biased in terms of usability and user-friendliness.

Pull requests and suggestions are welcome! See contributing guidelines.

Table of Contents

Text corpora

General-purpose

Historical

Specialized

Swiss German

Learner and Error Corpora

Word lists

Data acquisition

Lists of corpora

Generic resources

Frameworks

Treebanks

Deep learning models and transformers

Annotation

Standards

Linguistic processing

Preprocessing

Tokenization / Sentence boundary detection

Stemming

Lemmatization

Morphological analysis

Normalization

Phonology

POS-tagging

Syntactical parsing

Named Entity Recognition

Misc

Text generation

Industry/Applications

Evaluation

Semantic analysis

Datasets

Word embeddings and senses

Sentiment analysis datasets / polarity clues

Sentiment detection

GermEval

(category to improve)

Discourse

Summarization

Psycholinguistics

Speech NLP

Machine Translation

(category to improve)

Parallel corpora

Teaching resources and tutorials

More lists

German

General

Comparable lists

Larger institutional GitHub groups

Contributors

See the list of contributors.

License

CC-BY

More Repositories

1

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Python
3,298
star
2

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python
136
star
3

htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
Python
117
star
4

courlan

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
Python
109
star
5

geokelone

integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
Python
5
star
6

german-reddit

Extraction of a German Reddit Corpus
Python
3
star
7

tweets-tools

Diverse tools used with Twitter data
Python
2
star
8

flux-toolchain

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
Perl
2
star
9

jlcl-style

Experiments to modernize the LaTeX class of the JLCL
TeX
1
star
10

trafilatura_gui

Python
1
star
11

toponyms

Old prototype for toponym extraction in historical texts written in German
1
star
12

url-compressor

A fast pattern-based URL compression for lists of links
Pascal
1
star
13

zeitcrawler

Automatically exported from code.google.com/p/zeitcrawler
Java
1
star
14

vardial-experiments

Experiments conducted on the occasion of the VarDial shared tasks
Python
1
star
15

microblog-explorer

Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language
Python
1
star
16

coronakorpus

Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
1
star