There are no reviews yet. Be the first to send feedback to the community and the maintainers!
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XMLGerman-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on Germansimplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiencyhtmldate
Fast and robust date extraction from web pages, with Python or on the command-linecourlan
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filtersgeokelone
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualizationgerman-reddit
Extraction of a German Reddit Corpustweets-tools
Diverse tools used with Twitter datajlcl-style
Experiments to modernize the LaTeX class of the JLCLtrafilatura_gui
toponyms
Old prototype for toponym extraction in historical texts written in Germanurl-compressor
A fast pattern-based URL compression for lists of linkszeitcrawler
Automatically exported from code.google.com/p/zeitcrawlervardial-experiments
Experiments conducted on the occasion of the VarDial shared tasksmicroblog-explorer
Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their languagecoronakorpus
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirusLove Open Source and this site? Check out how you can help us