• Stars
    star
    369
  • Rank 111,990 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 6 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Natural Language Toolkit for bahasa Malaysia, https://malaya.readthedocs.io/

logo

Pypi version Python3 version MIT License Documentation total stats download stats / month discord


Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by Tensorflow and PyTorch.

Documentation

Proper documentation is available at https://malaya.readthedocs.io/

Installing from the PyPI

$ pip install malaya

It will automatically install all dependencies except for Tensorflow and PyTorch. So you can choose your own Tensorflow CPU / GPU version and PyTorch CPU / GPU version.

Only Python >= 3.6.0, Tensorflow >= 1.15.0, and PyTorch >= 1.10 are supported.

If you are a Windows user, make sure read https://malaya.readthedocs.io/en/latest/running-on-windows.html

Development Release

Install from master branch,

$ pip install git+https://github.com/huseinzol05/malaya.git

We recommend to use virtualenv for development.

Documentation at https://malaya.readthedocs.io/en/latest/

Features

  • Alignment, translation word alignment using Eflomal and pretrained Transformer models.
  • Abstractive text augmentation, augment any text into social media text structure using T5-Bahasa.
  • Encoder text augmentation, augment any text Wordvector or Transformer-Bahasa word replacement technique.
  • Rules based text augmentation, augment any text using dictionary of synonym and rules based.
  • Isi Penting Generator, generate text from list of isi penting using T5-Bahasa.
  • Prefix Generator, generate text from prefix using GPT2-Bahasa.
  • Abstractive Keyword, provide abstractive keyword using T5-Bahasa.
  • Extractive Keyword, provide RAKE, TextRank and Attention Mechanism hybrid with Transformer-Bahasa.
  • Abstractive Normalizer, normalize any malay texts using T5-Bahasa.
  • Rules based Normalizer, using local Malaysia NLP researches hybrid with Transformer-Bahasa to normalize any malay texts.
  • Extractive QA, reading comprehension using T5-Bahasa and Flan-T5.
  • Doc2Vec Similarity, provide Word2Vec and Encoder interface for text similarity.
  • Semantic Similarity, provide semantic similarity using T5-Bahasa.
  • Spelling Correction, using local Malaysia NLP researches hybrid with Transformer-Bahasa to auto-correct any malay words and NeuSpell using T5-Bahasa.
  • Abstractive Summarization, provide abstractive summarization using T5-Bahasa.
  • Extractive Summarization, Extractive interface using Transformer-Bahasa and Doc2Vec.
  • Text to Knowledge Graph, Generate knowledge graph from human sentences.
  • Topic Modeling, provide Transformer-Bahasa, LDA2Vec, LDA, NMF, LSA interface and easy BERTopic integration.
  • EN-MS Translation, provide English to standard Malay using T5-Bahasa.
  • IND-MS Translation, provide Indonesian to standard Malay using T5-Bahasa.
  • JAV-MS Translation, provide Javanese to standard Malay using T5-Bahasa.
  • MS-EN Translation, provide standard Malay to English using T5-Bahasa.
  • MS-IND Translation, provide standard Malay to Indonesian using T5-Bahasa.
  • MS-JAV Translation, provide standard Malay to Javanese using T5-Bahasa.
  • Zero-shot classification, provide Zero-shot classification interface using Transformer-Bahasa to recognize texts without any labeled training data.
  • Zero-shot Entity Recognition, provide Zero-shot entity tagging interface using Transformer-Bahasa to extract entities.
  • Constituency Parsing, breaking a text into sub-phrases using finetuned Transformer-Bahasa.
  • Coreference Resolution, finding all expressions that refer to the same entity in a text using Dependency Parsing models.
  • Dependency Parsing, extracting a dependency parse of a sentence using finetuned Transformer-Bahasa and T5-Bahasa.
  • Emotion Analysis, detect and recognize 6 different emotions of texts using finetuned Transformer-Bahasa.
  • Entity Recognition, seeks to locate and classify named entities mentioned in text using finetuned Transformer-Bahasa.
  • Jawi-to-Rumi, convert from Jawi to Rumi using Transformer.
  • Knowledge Graph to Text, Generate human sentences from a knowledge graph.
  • Language Detection, using Fast-text and Sparse Deep learning Model to classify Malay (formal and social media), Indonesia (formal and social media), Rojak language and Manglish.
  • Language Model, using KenLM, Masked language model using BERT, ALBERT and RoBERTa, and GPT2 to do text scoring.
  • NSFW Detection, detect NSFW text using rules based and subwords Naive Bayes.
  • Num2Word, convert from numbers to cardinal or ordinal representation.
  • Paraphrase, provide Abstractive Paraphrase using T5-Bahasa and Transformer-Bahasa.
  • Grapheme-to-Phoneme, convert from Grapheme to Phoneme DBP or IPA using LSTM Seq2Seq with attention state-of-art.
  • Part-of-Speech Recognition, grammatical tagging is the process of marking up a word in a text using finetuned Transformer-Bahasa.
  • Relevancy Analysis, detect and recognize relevancy of texts using finetuned Transformer-Bahasa.
  • Rumi-to-Jawi, convert from Rumi to Jawi using Transformer.
  • Text Segmentation, dividing written text into meaningful words using T5-Bahasa.
  • Sentiment Analysis, detect and recognize polarity of texts using finetuned Transformer-Bahasa.
  • Text Similarity, provide interface for lexical similarity deep semantic similarity using finetuned Transformer-Bahasa.
  • Stemmer, using BPE LSTM Seq2Seq with attention state-of-art to do Bahasa stemming including local language structure.
  • Subjectivity Analysis, detect and recognize self-opinion polarity of texts using finetuned Transformer-Bahasa.
  • Kesalahan Tatabahasa, Fix kesalahan tatabahasa using TransformerTag-Bahasa.
  • Tokenizer, provide word, sentence and syllable tokenizers.
  • Toxicity Analysis, detect and recognize 27 different toxicity patterns of texts using finetuned Transformer-Bahasa.
  • Transformer, provide easy interface to load Pretrained Language Malaya models.
  • True Case, provide true casing utility using T5-Bahasa.
  • Word2Num, convert from cardinal or ordinal representation to numbers.
  • Word2Vec, provide pretrained malay wikipedia and malay news Word2Vec, with easy interface and visualization.

Pretrained Models

Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model

References

If you use our software for research, please cite:

@misc{Malaya, Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow,
  author = {Husein, Zolkepli},
  title = {Malaya},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huseinzol05/malaya}}
}

Acknowledgement

Thanks to KeyReply for private V100s cloud and Mesolitica for private RTXs cloud to train Malaya models,

logo logo

Also, thanks to Tensorflow Research Cloud for free TPUs access.

logo

Contributing

Thank you for contributing this library, really helps a lot. Feel free to contact me to suggest me anything or want to contribute other kind of forms, we accept everything, not just code!

logo

More Repositories

1

Stock-Prediction-Models

Gathers machine learning and deep learning models for Stock forecasting including trading bots and simulations
Jupyter Notebook
7,318
star
2

NLP-Models-Tensorflow

Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0
Jupyter Notebook
1,721
star
3

Gather-Deployment

Gathers Python deployment, infrastructure and practices.
Jupyter Notebook
349
star
4

malaysian-dataset

Text corpus for Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Jupyter Notebook
229
star
5

malaya-speech

Speech Toolkit for Malaysian language, https://malaya-speech.readthedocs.io/
Jupyter Notebook
191
star
6

Machine-Learning-Numpy

Gathers Machine learning models using pure Numpy to cover feed-forward, RNN, CNN, clustering, MCMC, timeseries, tree-based, and so much more!
Jupyter Notebook
104
star
7

Self-Driving-Car-Engines

Gathers signal processing, computer vision, machine learning and deep learning for self-driving car engines.
Jupyter Notebook
79
star
8

Python-DevOps

gathers Python stack for DevOps, these are usually my basic templates use for my implementations, so, feel free to use it and evolve it! Everything is Docker!
Python
75
star
9

Deep-Learning-Tensorflow

Gathers Tensorflow deep learning models.
Jupyter Notebook
51
star
10

project-suka-suka

Husein pet projects in here!
Jupyter Notebook
48
star
11

YOLO-Object-Detection-Tensorflow

YOLO: Real-Time Object Detection using Tensorflow and easy to use
Python
45
star
12

Machine-Learning-Data-Science-Reuse

Gathers machine learning and data science techniques for problem solving.
Jupyter Notebook
35
star
13

Bahasa-NLP-Tensorflow

Gathers Tensorflow deep learning models for Bahasa Malaysia NLP problems
Jupyter Notebook
28
star
14

Signal-Classification-Comparison

Classify signal using Deep Learning on Tensorflow and various machine learning models.
Jupyter Notebook
24
star
15

Tensorflow-JS-Projects

Web projects using Tensorflow JS, Plotly, D3, Echarts, NumJS, and NumericJS
JavaScript
19
star
16

Pyspark-ML

Gathers data science and machine learning problem solving using PySpark and Hadoop.
Jupyter Notebook
10
star
17

Reinforcement-Learning-Agents

Gathers machine learning and deep learning models for Reinforcement Learning
Python
9
star
18

Neural-Network-Multilanguages

implement Artificial Neural Network on different languages
PHP
4
star
19

herpetologist

Dynamic parameter type checking for Python 3.6 and above. This able to detect deep nested variables.
Jupyter Notebook
3
star
20

water-healer

Forked of Streamz to deliver processed guarantees at least once for Kafka consumers with extra features.
Jupyter Notebook
2
star
21

malaya-boilerplate

Tensorflow freeze graph optimization and boilerplates to share among Malaya projects.
Python
1
star
22

malaya-graph

Knowledge Graph Toolkit for Bahasa Malaysia, https://malaya-graph.readthedocs.io/
Jupyter Notebook
1
star
23

Hackathon-Huseinhouse

Gathers hackathon and huseinhouse dashboards
JavaScript
1
star