• Stars
    star
    1
  • Language
    Python
  • Created over 12 years ago
  • Updated over 11 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their language

More Repositories

1

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Python
3,298
star
2

German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
449
star
3

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python
136
star
4

htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
Python
117
star
5

courlan

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
Python
109
star
6

geokelone

integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualization
Python
5
star
7

german-reddit

Extraction of a German Reddit Corpus
Python
3
star
8

tweets-tools

Diverse tools used with Twitter data
Python
2
star
9

flux-toolchain

Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchain
Perl
2
star
10

jlcl-style

Experiments to modernize the LaTeX class of the JLCL
TeX
1
star
11

trafilatura_gui

Python
1
star
12

toponyms

Old prototype for toponym extraction in historical texts written in German
1
star
13

url-compressor

A fast pattern-based URL compression for lists of links
Pascal
1
star
14

zeitcrawler

Automatically exported from code.google.com/p/zeitcrawler
Java
1
star
15

vardial-experiments

Experiments conducted on the occasion of the VarDial shared tasks
Python
1
star
16

coronakorpus

Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirus
1
star