Awesome Linguistics Resources for Spanish
Curated list of Linguistic Resources for doing Spanish NLP & CL.
Clustering
Speech
- Mexican Spanish Speech Recognition DB - 150 Speakers
- Mexican Spanish Speech Recognition DB - 299 Speakers
- Phonetic Transcriptions of Spanish Pronunciation Lexicon
- Sphinx Speech Recognition Models
Part of Speech Taggers (POS Taggers)
- TreeTagger - POSTagger
- Stanford - POSTagger
- Freeling
- ixa-pipe-pos
- Ruby Snowball Implementation
- Spaguetti POSTagger(Based on NLTK + CESS corpus
Multiword Expressions Extractors (MLWE)
Name Entity Recognition (NER)
- OpenNLP - Person/Place/Organization models
- DBPedia Spotlight
- CitiusTagger - Spanish NER and POSTagger
Corpora
Shared tasks
- Exploiting Parallel Texts for Statistical Machine Translation - NAACL 2006 in New York City
- CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages
- Quality Estimation (Spanish - English) WMT13
- ACL 2010 in Uppsala - Shared Task: Machine Translation for European Languages
- TASS - 2014 (Sentiment Analysis focused on Spanish)
- SemEval-2 2010 Coreference Resolution in Multiple Languages
- SAB Corpus (Spanish Corpus for Sentiment Analysis towards Brands)
Corpora
- Multilingual Aligned Annotated Corpus (CRATER)
- UAM Treebank - 1,500 syntactically annotated sentences extracted from newspapers (El PaΓs Digital and Compra Maestra
- POSTagged/syntactic dependencies - European Corpus Initiative Multilingual Corpus I
- The Corpus of Contemporary Spanish(POStags, lemmas)
- Lemmas Dictionary
- esTenten Spanish (POSTagged)
- Europarl Corpus (Parallel Corpus English-Spanish)
- Colombian Political Speeches
- South American Slang Expressions/MTWE
- Syntax and Semantic Annotations (Subset Ancora Corpus)
- Plurilingual Specific Corpus on Economics, Medicine, Computer Science
- Copenhagen Treebank (Dependency Parsing)
- Reuters Corpora RCV2 - New Corpora
- MolinoLabs Corpus - News Corpora from Spain, Argentina and Mexico
- PANACEA- Legislation Corpus
- PANACEA- Legislation Ngram Corpus
- PANACEA- Dependency Parsed Corpus
- PANACEA- Monolingual Lexica (MWE, Frames, Semantic Classes)
- Opinion Mining - User reviews on Cars, Hotels, Washing machines, Books, Cell phones, Music..
- Cross Lingual Textual Entailment (CLTE) Corpus (English-Spanish)
- Ngram Frequencies out of Colombia News Corpora
- Sagan Textual Entailment Test Suite
- Garcia, Marcos and Pablo Gamallo, 2013 - Portuguese and Spanish biographical relation extraction corpora (Garcia, Marcos and Pablo Gamallo, 2013. Exploring the Effectiveness of Linguistic Knowledge for Biographical Relation Extraction. Natural Language Engineering, CJO2013. doi:10.1017/S1351324913000314.)
- Garcia, Marcos and Pablo Gamallo, 2014 - Portuguese, Spanish and Galician coreference corpora (Garcia, Marcos and Pablo Gamallo, 2014. Multilingual corpora with coreferential annotation of person entities. In Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Reykjavik: 3229-3233.)
- COW(Corpora From the Web) Ngram/Annotated People's Name Corpora
- Wikicorpus- Portion of 2006's wikipedia annotated with WordNet Synsets and POS
- Spanish Billion Words Corpus with word2vec Embeddings
Misc
- Word2Vec vectors for Wikipedia Spanish Articles
- DBpedia Spanish Entities Titles
- DBpedia Spanish Abstracts
- Conshuga - Galician Verb conjugator
Contribute
Contributions welcome! Read the contribution guidelines first.
License
To the extent possible under law, David Przybilla has waived all copyright and related or neighboring rights to this work.