• Stars
    star
    11
  • Rank 1,694,829 (Top 34 %)
  • Language
    Jupyter Notebook
  • License
    Creative Commons ...
  • Created 7 months ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages -- under review

More Repositories

1

simalign

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
Python
345
star
2

Glot500

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023
Python
96
star
3

GlotLID

GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
Python
83
star
4

semi-markov-crf

Code for paper "Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging"
Python
17
star
5

GlotScript

GlotScript: A Resource and Tool for Low Resource Writing System Identification -- LREC 2024
Python
13
star
6

parcoure

ParCourE - Parallel Corpus Explorer
Python
12
star
7

ofa

A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
Python
11
star
8

bias-in-nlp

Literature overview: gender bias in natural language processing
Python
10
star
9

mPLM-Sim

mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
Python
10
star
10

graph-align

code for EMNLP graph align paper
Python
9
star
11

Taxi1500

Python
7
star
12

GlotWeb

GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
Python
5
star
13

TransMI

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
Python
4
star
14

TransliCo

TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models
Python
4
star
15

GlotStoryBook

Children StoryBooks for 180 langauges.
Jupyter Notebook
3
star
16

ColexificationNet

Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs
Jupyter Notebook
3
star
17

cisnlp.github.io

Homepage of cisnlp
SCSS
3
star
18

MaskLID

MaskLID: Code-Switching Language Identification through Iterative Masking -- ACL 2024
Python
3
star
19

Transliteration-PPA

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
Python
2
star
20

lohoravens-webpage

JavaScript
2
star
21

XAMPLER

XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples
Python
2
star
22

Spatial_Schemas

JavaScript
1
star