• Stars
    star
    70
  • Rank 435,244 (Top 9 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 9 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Clean, filter and sample URLs to optimize data collection – includes spam, content type and language filters

More Repositories

1

trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Python
2,979
star
2

German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
409
star
3

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python
129
star
4

htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
Python
111
star
5

geokelone

integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
Python
5
star
6

german-reddit

Extraction of a German Reddit Corpus
Python
3
star
7

tweets-tools

Diverse tools used with Twitter data
Python
2
star
8

flux-toolchain

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
Perl
2
star
9

jlcl-style

Experiments to modernize the LaTeX class of the JLCL
TeX
1
star
10

trafilatura_gui

Python
1
star
11

toponyms

Old prototype for toponym extraction in historical texts written in German
1
star
12

zeitcrawler

Automatically exported from code.google.com/p/zeitcrawler
Java
1
star
13

url-compressor

A fast pattern-based URL compression for lists of links
Pascal
1
star
14

coronakorpus

Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
1
star
15

vardial-experiments

Experiments conducted on the occasion of the VarDial shared tasks
Python
1
star
16

microblog-explorer

Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language
Python
1
star