• Stars
    star
    3
  • Rank 3,866,896 (Top 79 %)
  • Language
    Python
  • License
    MIT License
  • Created over 8 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extraction of a German Reddit Corpus

More Repositories

1

trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Python
2,979
star
2

German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
409
star
3

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python
129
star
4

htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
Python
111
star
5

courlan

Clean, filter and sample URLs to optimize data collection – includes spam, content type and language filters
Python
70
star
6

geokelone

integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
Python
5
star
7

tweets-tools

Diverse tools used with Twitter data
Python
2
star
8

flux-toolchain

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
Perl
2
star
9

jlcl-style

Experiments to modernize the LaTeX class of the JLCL
TeX
1
star
10

trafilatura_gui

Python
1
star
11

toponyms

Old prototype for toponym extraction in historical texts written in German
1
star
12

zeitcrawler

Automatically exported from code.google.com/p/zeitcrawler
Java
1
star
13

url-compressor

A fast pattern-based URL compression for lists of links
Pascal
1
star
14

coronakorpus

Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
1
star
15

vardial-experiments

Experiments conducted on the occasion of the VarDial shared tasks
Python
1
star
16

microblog-explorer

Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language
Python
1
star