• Stars
    star
    131
  • Rank 275,867 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Indonesian Language Models and its Usage

Indonesian Language Models

The language model is a probability distribution over word sequences used to predict the next word based on previous sentences. This ability makes the language model the core component of modern natural language processing. We use it for many different tasks, such as speech recognition, conversational AI, information retrieval, sentiment analysis, or text summarization.

For this reason, many big companies are competing to build large and larger language models, such as Google BERT, Facebook RoBERTa, or OpenAI GPT3, with its massive number of parameters. Most of the time, they built only language models in English and some other European languages. Other countries with low resource languages have big challenges to catch up on this technology race.

Therefore the author tries to build some language models for Indonesian, started with ULMFiT in 2018. The first language model has been only trained with Indonesian Wikipedia, which is very small compared to other datasets used to train the English language model.

Universal Language Model Fine-tuning (ULMFiT)

Jeremy Howard and Sebastian Ruder proposed ULMFiT in early 2018 as a novel method for fine-tuning language models for inductive transfer learning. The language model ULMFiT for Indonesian has been trained as part of the author's project while learning FastAI. It achieved a perplexity of 27.67 on Indonesian Wikipedia.

Transformers

Ashish Vaswani et al. proposed Transfomer in the paper Attention Is All You Need. It is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease.

At the time of writing (March 2021), there are already more than 50 different types of transformer-based language models (according to the model list at huggingface), such as BERT, GPT2, Longformer, or MT5, built by companies and individual contributors. The author built also several Indonesian transformer-based language models using Huggingface Transformers Library and hosted them in the Huggingfaces model hub.

More Repositories

1

opentc

OpenTC is a text classification engine using several algorithms in machine learning
Python
27
star
2

ML-Collection

Collection of Machine Learning related scripts
Jupyter Notebook
19
star
3

deepracer-tools

Some tools for AWS Deepracer
Shell
13
star
4

indonesian-speech-recognition

Automatic Speech Recognition for Indonesian
Jupyter Notebook
12
star
5

artificial-commonvoice

Common Voice Generator using Speech Synthesizer
Python
10
star
6

stm32f4-musicplayer

Music player (read mp3 and wav files) running on STM32F4 Discovery
C
10
star
7

stm32f4-3dsound

3D Sound Effect on STM32F4
C
8
star
8

indonesian-whisperer

Experiment with OpenAI Whisper on Indonesian Languages
Python
8
star
9

Udacity-Course

Python
6
star
10

ConvAI.id

Indonesian Conversational AI
Jupyter Notebook
6
star
11

opentc-icap

ICAP Server for the OpenTC
Python
4
star
12

AskMeAnything-Chatbot

Ask Me Anything ChatBot
4
star
13

Bloom

Jupyter Notebook
3
star
14

luganda-asr

Python
3
star
15

google-spell-checker

Python
2
star
16

text_processor

Utility for Text Normalisation or Inverse Normalisation
Python
2
star
17

phase-classification

Seismic Phase Classification
Jupyter Notebook
2
star
18

ai-research.id

SCSS
1
star
19

usbprog-jtag

Automatically exported from code.google.com/p/usbprog-jtag
C
1
star
20

opentc-base

Python
1
star
21

opentc-web

Web user interface for the OpenTC
Python
1
star
22

grafana-table-panel

Extended Grafana Table Panel
JavaScript
1
star
23

image-search

Sentence-based Image Search using WIT Dataset
Python
1
star
24

Fuzzing

Fuzzing activity
C
1
star
25

multilingual-asr-demo

Python
1
star
26

lora

Jupyter Notebook
1
star
27

opentc-util

The utility/base module for the opentc, opentc-icap and opentc-web
Python
1
star
28

FastAI-Course

Collection of notes and notebooks for FastAI Course V3
Jupyter Notebook
1
star