• Stars
    star
    479
  • Rank 88,688 (Top 2 %)
  • Language
  • License
    Creative Commons ...
  • Created about 5 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

BETO - Spanish version of the BERT model

BETO: Spanish BERT

BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with Multilingual BERT as well as other (not BERT-based) models.

Download

BETO uncased tensorflow_weights pytorch_weights vocab, config
BETO cased tensorflow_weights pytorch_weights vocab, config

All models use a vocabulary of about 31k BPE subwords constructed using SentencePiece and were trained for 2M steps.

Benchmarks

The following table shows some BETO results in the Spanish version of every task. We compare BETO (cased and uncased) with the Best Multilingual BERT results that we found in the literature (as of October 2019). The table also shows some alternative methods for the same tasks (not necessarily BERT-based methods). References for all methods can be found here.

Task BETO-cased BETO-uncased Best Multilingual BERT Other results
POS 98.97 98.44 97.10 [2] 98.91 [6], 96.71 [3]
NER-C 88.43 82.67 87.38 [2] 87.18 [3]
MLDoc 95.60 96.12 95.70 [2] 88.75 [4]
PAWS-X 89.05 89.55 90.70 [8]
XNLI 82.01 80.15 78.50 [2] 80.80 [5], 77.80 [1], 73.15 [4]

Example of use

For further details on how to use BETO you can visit the 🤗Huggingface Transformers library, starting by the Quickstart section. BETO models can be accessed simply as 'dccuchile/bert-base-spanish-wwm-cased' and 'dccuchile/bert-base-spanish-wwm-uncased' by using the Transformers library. An example on how to download and use the models in this page can be found in this colab notebook. (We will soon add a more detailed step-by-step tutorial in Spanish for newcommers 😉)

Acknowledgments

We thank Adereso for kindly providing support for traininig BETO-uncased, and the Millennium Institute for Foundational Research on Data that provided support for training BETO-cased. Also thanks to Google for helping us with the TensorFlow Research Cloud program.

Citation

Spanish Pre-Trained BERT Model and Evaluation Data

To cite this resource in a publication please use the following:

@inproceedings{CaneteCFP2020,
  title={Spanish Pre-Trained BERT Model and Evaluation Data},
  author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
  booktitle={PML4DC at ICLR 2020},
  year={2020}
}

License Disclaimer

The license CC BY 4.0 best describes our intentions for our work. However we are not sure that all the datasets used to train BETO have licenses compatible with CC BY 4.0 (specially for commercial use). Please use at your own discretion and verify that the licenses of the original text resources match your needs.

References

More Repositories

1

spanish-word-embeddings

Spanish word embeddings computed with different methods and from different corpora
344
star
2

CC6205

Natural Language Processing
TeX
217
star
3

CC5205

Introducción a la Minería de Datos
Shell
190
star
4

CC6204

Material del curso de Deep Learning de la Universidad de Chile
Jupyter Notebook
185
star
5

wefe

WEFE: The Word Embeddings Fairness Evaluation Framework. WEFE is a framework that standardizes the bias measurement and mitigation in Word Embeddings models. Please feel welcome to open an issue in case you have any questions or a pull request if you want to contribute to the project!
Python
169
star
6

CC6104

Teaching material of the course "Statistical Thinking" of the Department of Computer Science at the University of Chile.
TeX
87
star
7

lightweight-spanish-language-models

ALBETO and DistilBETO are versions of ALBERT and DistilBERT pre-trained exclusively on Spanish corpora.
Python
25
star
8

rivertext

RiverText is a framework that standardizes the Incremental Word Embeddings proposed in the state-of-art. Please feel welcome to open an issue in case you have any questions or a pull request if you want to contribute to the project!
Python
18
star
9

GLUES

Resources for GLUE benchmark in Spanish
14
star
10

PracticaProfesional

Everything related to practica profesional
11
star
11

speedy-gonzales

Code for "Speedy Gonzales: A Collection of Fast Task-Specific Models for Spanish"
HTML
10
star
12

relela

Representations for Learning and Language
HTML
8
star
13

SNEC

Special Needs Education Corpus project
Jupyter Notebook
2
star
14

RiverText

Machine Learning for Text Sreams
2
star
15

word-embeddings-benchmarks

Python
1
star