indonesian-NLP-resources
Data NLP for bahasa indonesia (last update 20 sep 2020)
Sentences Dataset
- leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
- wn-msa.sourceforge.net Wordnet Bahasa
- Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
- Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
- Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
- corpus-frog-storytelling spoken text story telling
- TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
- Opus Opus NLPL
- Sealang Sealang dataset
link
Word reference (kemdikbud)- Entri Dasar : 50.668 (45,02 %)
- Kata Turunan : 26.835 (23,85 %)
- Gabungan Kata : 31.492 (27,98 %)
- Peribahasa : 2.054 (1,83 %)
- Kiasan : 269 (0,24 %)
- Ungkapan : 1.131 (1,00 %)
- Varian : 89 (0,08 %)
- Entri Total : 112.538 (100,00 %)
- Makna Total : 131.533
- Contoh Total : 30.010
- Kategori Total : 234
- Makna Per Entri : 1,169
- Contoh Per Makna : 0,228
PUEBI word type )
Words dataset (- word class => word noun(18647), word verb(39070) = 57717 words
- word type => rootword(41409), derivative word(24913), compound words, Figure of speech, proverb, expression = 66322 words
- Word root => source#1.1 : sastrawi 29932 words ; source#1.2 : sastrawi 30342 words ; source#2 : SentiStrengthID 27979 words ; source#3 : serangkai 30342 words
- Word spaCy : id
- word : serangkai
- Word name : random-name
- Word Indo name : genderprediction
- Word Wiktionary : word id
- word compound =>
- Word Acronims =>
- Word Negative =>
- source#1.1 : 3829 words ; source#1.2 : 3523 words ; source#1.3 : 154 words ;
- source#2 : ID-OpinionWords 2402 words
- source#3 : 3523 words
- source#4 : 126 words
- Word Positive =>
- source#1.1 : 1678 words ; source#1.2 : 40 words ; source#1.3 : 1293 words ;
- source#2 : 1182 words
- source#3 : 1293 words
- Word Slang =>
- Stopwords =>
- Emoticon =>
- Name Entity =>
- source#1 : [Place] country
- source#1 : [Place] Wilayah-Administratif-Indonesia (provinces, villages, districts, regencies)
- source#2 : [Place] Indonesia-Postal-Code (provinces, cities, subdistricts, urbans)
- source#3 : [Place] indonesian-region
- source#3 : [Person] gender prediction
- source#4 : [Person] random name
- source#5 : [Person] title of name
- source#6 : [Person] degree
- source#7 : [Org] institution
Tagged dataset
- NER =>
- POS-TAG
- POS-TAG : famrashel/idn-tagged-corpus
- POS-TAG : pebbie/pebahasa ~600 sentence
- POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
- Sentimen =>
- panl10n Pan Localization
- Acronyms : ramaprakoso/analisis-sentimen 4085 words
Parallel corpus Eng-Ind
Sentence Analyzer
- MALINDO_Morph
- morphind
- INDRA
- pujangga : An interface for InaNLP and Deeplearning4j's Word2Vec for Indonesian (Bahasa Indonesia) in the form of REST API.
- id-multi-label-hate-speech-and-abusive-language-detection : Here we provide our dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter.
- kawat : A Word Analogy Task Dataset for Indonesian
Crawler Data
- Crawler Indonesian news portal