• Stars
    star
    240
  • Rank 168,229 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻

A repository part of the MarIA project.

Corpora 📃

Corpora Number of documents Number of tokens Size (GB)
BNE 201,080,084 135,733,450,668 570GB

Models 🤖

Fine-tunned models 🧗🏼‍♀️🏇🏼🤽🏼‍♀️🏌🏼‍♂️🏄🏼‍♀️

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

Datasets 🗂️

Evaluation

Dataset Metric RoBERTa-b RoBERTa-l BETO* mBERT BERTIN** Electricidad***
MLDoc F1 0.9664 0.9702 0.9714🔥 0.9617 0.9668 0.9565
CoNLL-NERC F1 0.8851🔥 0.8823 0.8759 0.8691 0.8835 0.7954
CAPITEL-NERC F1 0.8960 0.9051🔥 0.8772 0.8810 0.8856 0.8035
PAWS-X F1 0.9020 0.9150🔥 0.8930 0.9000 0.8965 0.9045
UD-POS F1 0.9907🔥 0.9904 0.9900 0.9886 0.9898 0.9818
CAPITEL-POS F1 0.9846 0.9856🔥 0.9836 0.9839 0.9847 0.9816
SQAC F1 0.7923 0.8202🔥 0.7923 0.7562 0.7678 0.7383
STS Combined 0.8533🔥 0.8411 0.8159 0.8164 0.7945 0.8063
XNLI Accuracy 0.8016 0.8263🔥 0.8130 0.7876 0.7890 0.7878

* A model based on BERT architecture.

** A model based on RoBERTa architecture.

*** A model based on Electra architecture.

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

Cite 📣

@article{gutierrezfandino2022,
	author = {Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquin Silveira-Ocampo and Casimiro Pio Carrino and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Aitor Gonzalez-Agirre and Marta Villegas},
	title = {MarIA: Spanish Language Models},
	journal = {Procesamiento del Lenguaje Natural},
	volume = {68},
	number = {0},
	year = {2022},
	issn = {1989-7553},
	url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405},
	pages = {39--60}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact [email protected]

Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence.

In no event shall the owner of the models (SEDIA – State Secretariat for digitalization and artificial intelligence) nor the creator (BSC – Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties of these models.

Los modelos publicados en este repositorio tienen una finalidad generalista y están a disposición de terceros. Estos modelos pueden tener sesgos y/u otro tipo de distorsiones indeseables.

Cuando terceros desplieguen o proporcionen sistemas y/o servicios a otras partes usando alguno de estos modelos (o utilizando sistemas basados en estos modelos) o se conviertan en usuarios de los modelos, deben tener en cuenta que es su responsabilidad mitigar los riesgos derivados de su uso y, en todo caso, cumplir con la normativa aplicable, incluyendo la normativa en materia de uso de inteligencia artificial.

En ningún caso el propietario de los modelos (SEDIA – Secretaría de Estado de Digitalización e Inteligencia Artificial) ni el creador (BSC – Barcelona Supercomputing Center) serán responsables de los resultados derivados del uso que hagan terceros de estos modelos.

More Repositories

1

lm-legal-es

Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
23
star
2

lm-biomedical-clinical-es

Official source for Spanish pretrained biomedical and clinical language models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).
Python
20
star
3

Biomedical-Word-Embeddings-for-Spanish

Biomedical Word embeddings generated from Spanish Biomedical corpora.
10
star
4

SPACCC_MEDDOCAN

MEDDOCAN: Corpus, guidelines, IAA and scripts.
Python
6
star
5

corpus-cleaner

Generic toolkit for corpus cleaning
Python
5
star
6

NegEx-MES

[PlanTL/medicine/document annotation/negation] Negation detector for Spanish clinical texts based on Wendy Chapman's NegEx algorithm.
Java
5
star
7

CUTEXT

[PlanTL/medicine/terminological resource retrieval] A multilingual medical term extraction tool.
Java
4
star
8

MeSpEn_Glossaries

[PlanTL/medicine/lexical/terminological resource] Bilingual medical glossaries for various language pairs.
4
star
9

SPACCC

[PlanTL/medicine/document] Spanish Clinical Case Corpus
3
star
10

AbreMES-DB

[Plan TL/medicine/lexical/terminological resource] A Spanish Medical Abbreviation DataBase.
3
star
11

Medical-Translator

[PlanTL/medicine/neural machine translation/translation models] Files needed to use the Neural Machine Translation system for the Biomedical Domain.
Shell
3
star
12

covid-predictive-model

A RNN Predictive Model for COVID-19 mortality prediction.
Python
2
star
13

Medical-Translator-WMT19

Shell
2
star
14

PharmaCoNER-Evaluation-Script

Python
2
star
15

EHR-normalizer

[PlanTL/medicine/document/NLP preprocessing] Software to convert PDF files into HTML, TXT or XML files and to normalize EHRs.
Perl
2
star
16

AbreMES-X

[PlanTL/medicine/semantic annotation] Software used to generate the Spanish Medical Abbreviation DataBase (AbreMES-DB).
Java
1
star
17

BVS-Corpus

Biblioteca Virtual en Salud - Parallel Corpus
1
star
18

MEDDOCAN-Format-Converter-Script

Script to convert files between MEDDOCAN-Brat, MEDDOCAN-XML, and i2b2 formats.
Python
1
star
19

SPACCC_TOKEN

[PlanTL/medicine/annotated corpus/guidelines/tokenization] Tokenization annotations in the Spanish Clinical Case Corpus
Python
1
star
20

SciELO-Spain-Crawler

[PlanTL/medicine/dataset generation/retrieval] Crawler to download all the publications written in Spanish from the Spanish SciELO server.
Java
1
star
21

shared-task-resource-example

Example README file for Shared Task submissions
1
star
22

SPACCC_POS-TAGGER

[PlanTL/medicine/document annotation/NLP preprocessing/part-of-speech] Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing.
Python
1
star
23

EHR-TTS

[PlanTL/medicine/document annotation//time] HeidelTime grammar for temporal tagging of Spanish Electronic Health Records (EHR).
1
star
24

controversy-detection-model

This repository contains the code of the paper "Anticipating the Debate: Predicting Controversy in News with Transformer-based NLP"
Python
1
star