There are no reviews yet. Be the first to send feedback to the community and the maintainers!
trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XMLGerman-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on Germansimplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiencyhtmldate
Fast and robust date extraction from web pages, with Python or on the command-linecourlan
Clean, filter and sample URLs to optimize data collection β Python & command-line β Deduplication, spam, content and language filtersgeokelone
integrates spatial and textual data processing tools into a modular software package which features preprocessing, geocoding, disambiguation and visualizationgerman-reddit
Extraction of a German Reddit Corpustweets-tools
Diverse tools used with Twitter dataflux-toolchain
Filtering and Language-identification for URL Crawling Seeds (FLUCS) a.k.a. FLUX-Toolchainjlcl-style
Experiments to modernize the LaTeX class of the JLCLtrafilatura_gui
toponyms
Old prototype for toponym extraction in historical texts written in Germanzeitcrawler
Automatically exported from code.google.com/p/zeitcrawlervardial-experiments
Experiments conducted on the occasion of the VarDial shared tasksmicroblog-explorer
Perform crawls of social networks (identi.ca, reddit, friendfeed) to gather internal and external links and identify their languagecoronakorpus
Material zum Aufbau eines deutschsprachigen COVID-19-Webkorpus / Building a corpus in German dedicated to coronavirusLove Open Source and this site? Check out how you can help us