• Stars
    star
    355
  • Rank 119,764 (Top 3 %)
  • Language
  • License
    Other
  • Created about 7 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Spanish word embeddings computed with different methods and from different corpora

Spanish Word Embeddings

Below you find links to Spanish word embeddings computed with different methods and from different corpora. Whenever it is possible, a description of the parameters used to compute the embeddings is included, together with simple statistics of the vectors, vocabulary, and description of the corpus from which the embeddings were computed. Direct links to the embeddings are provided, so please refer to the original sources for proper citation (also see References). An example of the use of some of these embeddings can be found here or in this tutorial (both in Spanish).

Summary (and links) for the embeddings in this page:

Corpus Size Algorithm #vectors vec-dim Credits
1 Spanish Unannotated Corpora 2.6B FastText 1,313,423 300 José Cañete
2 Spanish Billion Word Corpus 1.4B FastText 855,380 300 Jorge Pérez
3 Spanish Billion Word Corpus 1.4B Glove 855,380 300 Jorge Pérez
4 Spanish Billion Word Corpus 1.4B Word2Vec 1,000,653 300 Cristian Cardellino
5 Spanish Wikipedia ??? FastText 985,667 300 FastText team

FastText embeddings from SUC

Embeddings

Links to the embeddings (#dimensions=300, #vectors=1,313,423):

More vectors with different dimensiones (10, 30, 100, and 300) can be found here

Algorithm

  • Implementation: FastText with Skipgram
  • Parameters:
    • min subword-ngram = 3
    • max subword-ngram = 6
    • minCount = 5
    • epochs = 20
    • dim = 300
    • all other parameters set as default

Corpus

FastText embeddings from SBWC

Embeddings

Links to the embeddings (#dimensions=300, #vectors=855,380):

Algorithm

  • Implementation: FastText with Skipgram
  • Parameters:
    • min subword-ngram = 3
    • max subword-ngram = 6
    • minCount = 5
    • epochs = 20
    • dim = 300
    • all other parameters set as default

Corpus

  • Spanish Billion Word Corpus
  • Corpus Size: 1.4 billion words
  • Post processing: Besides the post processing of the raw corpus explained in the SBWCE page that included deletion of punctuation, numbers, etc., the following processing was applied:
    • Words were converted to lower case letters
    • Every sequence of the 'DIGITO' keyword was replaced by (a single) '0'
    • All words of more than 3 characteres plus a '0' were ommitted (example: 'padre0')

GloVe embeddings from SBWC

Embeddings

Links to the embeddings (#dimensions=300, #vectors=855,380):

Algorithm

  • Implementation: GloVe
  • Parameters:
    • vector-size = 300
    • iter = 25
    • min-count = 5
    • all other parameters set as default

Corpus

Word2Vec embeddings from SBWC

Embeddings

Links to the embeddings (#dimensions=300, #vectors=1,000,653)

Algorithm

Corpus

FastText embeddings from Spanish Wikipedia

Embeddings

Links to the embeddings (#dimensions=300, #vectors=985,667):

Algorithm

  • Implementation: FastText with Skipgram
  • Parameters: FastText default parameters

Corpus

References

More Repositories

1

beto

BETO - Spanish version of the BERT model
491
star
2

CC6205

Natural Language Processing
TeX
230
star
3

CC5205

Introducción a la Minería de Datos
Shell
202
star
4

CC6204

Material del curso de Deep Learning de la Universidad de Chile
Jupyter Notebook
197
star
5

wefe

WEFE: The Word Embeddings Fairness Evaluation Framework. WEFE is a framework that standardizes the bias measurement and mitigation in Word Embeddings models. Please feel welcome to open an issue in case you have any questions or a pull request if you want to contribute to the project!
Python
173
star
6

CC6104

Teaching material of the course "Statistical Thinking" of the Department of Computer Science at the University of Chile.
TeX
97
star
7

lightweight-spanish-language-models

ALBETO and DistilBETO are versions of ALBERT and DistilBERT pre-trained exclusively on Spanish corpora.
Python
29
star
8

rivertext

RiverText is a framework that standardizes the Incremental Word Embeddings proposed in the state-of-art. Please feel welcome to open an issue in case you have any questions or a pull request if you want to contribute to the project!
Python
18
star
9

GLUES

Resources for GLUE benchmark in Spanish
15
star
10

PracticaProfesional

Everything related to practica profesional
11
star
11

relela

Representations for Learning and Language
HTML
8
star
12

speedy-gonzales

Code for "Speedy Gonzales: A Collection of Fast Task-Specific Models for Spanish"
HTML
7
star
13

SNEC

Special Needs Education Corpus project
Jupyter Notebook
2
star
14

RiverText

Machine Learning for Text Sreams
2
star
15

word-embeddings-benchmarks

Python
1
star