• Stars
    star
    109
  • Rank 319,077 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 9 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

More Repositories

1

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Python
3,298
star
2

German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
449
star
3

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python
136
star
4

htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
Python
117
star
5

geokelone

integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
Python
5
star
6

german-reddit

Extraction of a German Reddit Corpus
Python
3
star
7

tweets-tools

Diverse tools used with Twitter data
Python
2
star
8

flux-toolchain

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
Perl
2
star
9

jlcl-style

Experiments to modernize the LaTeX class of the JLCL
TeX
1
star
10

trafilatura_gui

Python
1
star
11

toponyms

Old prototype for toponym extraction in historical texts written in German
1
star
12

url-compressor

A fast pattern-based URL compression for lists of links
Pascal
1
star
13

zeitcrawler

Automatically exported from code.google.com/p/zeitcrawler
Java
1
star
14

vardial-experiments

Experiments conducted on the occasion of the VarDial shared tasks
Python
1
star
15

microblog-explorer

Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language
Python
1
star
16

coronakorpus

Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
1
star