• Stars
    star
    200
  • Rank 194,112 (Top 4 %)
  • Language
    C
  • Created over 8 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Chinsese_word_vectors

Chinese word vectors

This project uses Word2vec and GloVe tools to train word vectors for Chinese using data from wikipedia dump.

Steps

  1. Download wikipedia dump from: https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

  2. Extract Chinese words from the downloaded xml files, using process_wiki.py. This step would take about 11 minutes(from 14:54:17 to 15:05:53) in my computer to finish iterating over Wikipedia corpus of 253646 documents (articles) with 56758668 positions (total 2724783 articles, 68096009 positions before pruning articles shorter than 50 words).

  3. We want to get both Traditional version and Simplified version texts. Firstly, Traditional Chinese should be translated to Simplified Chinese, we use opencc package.

    opencc -i wiki.zh.text -o wiki.zh.text.simplified -c zht2zhs.ini
    

    Then, simplified Chinese should be translated to traditional Chinese:

    opencc -i wiki.zh.text -o wiki.zh.text.traditional -c zhs2zht.ini
    
  4. Tokenization, split the sentence into words. Two tokennization mechanism are used:

    • one character as a words all the time, 因为汉语中单个字往往也能表示许多信息. Namely, we segment the Chinese sentence into character level as follows:
    欧几里得 西元前三世纪的希腊数学家 现在被认为是几何之父
    欧 几 里 得 西 元 前 三 世 纪 的 希 腊 数 学 家 现 在 被 认 为 是 几 何 之 父
    
    • using Jieba packages to separate sentence. Run tokenization.py,考虑到常用分词工具来分词,使用词汇可提供更多信息
  5. Train word vectors using train_word2vec_model.py and GloVe model.

    Hyper-parameters:

    • word vector dimension: 300
    • window: 5
    • min_count: 5
    • epochs: 3
    • Training algorithm: skip-gram (sg = 1)
    • Model training: hierarchical sampling (hs = 1)

    Word vectors statistical information:

    word vectors corpus words sentences dimensions Time
    word2vec traditional Chinese 860,835 253,646 300 3,086.8s
    word2vec simplified Chinese 686,601 253,646 300 3,186.9s
    GloVe traditional Chinese 860,835 253,646 300 2,160.0s
    GloVe simplified Chinese 686,601 253,646 300 2,280.0s

    Character vectors ststistical information: (Hyper-parameter is the same as training word-level vectors)

    Character vectors corpus character dimensions Time
    GloVe traditional 157,660 300 420s
    GloVe simplified 157,379 300 410s
    word2vec traditional 157,660 300 3,991.1s
    word2vec simplified 157,379 300 4,415.6s

Download

  1. word level and character level vectors trained by word2vec:

    Level Language syn0 syn1 Text
    word simplified Download Download Download
    word traditional Download Download Download
    character simplified Download Download Download
    character traditional Download Download Download

    Note: syn0 and syn1 files are produced by saving model using model.save(), which allows to continue training the model when loaded. The Text file is produced by saving model using model.save_word2vec_format(), which allows to view the vectors by a sublime-like software.

  2. word level and character level vectors trained by GloVe:

    Level Language vocab.txt vectors.txt cooccurrence.bin
    word simplified Download Download Download
    word traditional Download Download Download
    character simplified Download Download Download
    character traditional Download Download Download

    Note: vectors.txt is the vectors produced by GloVe model and vocab.txt is the vocabulary-index mapping file.

  3. Original Wikipeida Chinese Corpus: Download

  4. The Simplified Chinese version Wikipedia corpus before segmentation: Download, After Segmentation: Download, After character-level segmentation: Download.

  5. The Traditional Chinese version Wikipedia corpus before segmentation: Download, After segmentation: Download, After character-level segmentation: Download.

Contact

More Repositories

1

Dialog_Corpus

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Python
2,021
star
2

Speech-Corpus-Collection

A Collection of Speech Corpus for ASR and TTS
114
star
3

Griffin_lim

A TensorFlow implementation of Griffin-Lim algorithm
Python
77
star
4

AiVoice

Deep CNN networks for Speech Synthesis
Python
49
star
5

RawNet

RawNet: Fast End-to-End Neural Vocoder
42
star
6

Bots

Chatbot Framework for Chinese based on ChatScript 基于ChatScript的中文聊天引擎
C
41
star
7

CNTN

ChiNese Text Normalization (CNTN) tool for Text-to-speech system
Python
35
star
8

Ossian

Ossian: A simple language-independent Text-to-speech frontend
Python
17
star
9

ChatScript_DOC

A collection of document for ChatScript dialog engine
Batchfile
12
star
10

TensorFlow_Examples

This project use TensorFlow framework to do many interesting applications. Many popular deep leaning architecture will be implemented is this project, including Neural Networks, RNN, LSTM, Auto-encoder, CNN, etc.
Python
12
star
11

Alex

A Slot-filling based Dialog Manager for Task-oriented Bot
Python
11
star
12

SPExtractor

Tools for extract Speech parameters (lf0, mgc, bap) for TTS and wave restore.
Shell
5
star
13

texts_sentiment_analysis

texts sentiment analysis
Python
5
star
14

short_texts_sentiment_analysis

Short informal texts sentiment analysis
Python
5
star
15

ChatScript_Client

ChatScript Python Client
Python
3
star
16

TensorFlow_learn

Repo used for learning TensorFlow Framework
Python
3
star
17

Vecamend

Vecamend
Python
1
star
18

Ordinal_classification

Ordinal Classification of Tweets
Python
1
star
19

Concept_word_embeddings

Concept_word_embeddings
Python
1
star
20

T9Search

T9搜索
Java
1
star
21

Thesis_experiment

Thesis_experiment
Python
1
star
22

Vecamend-master2

more
Python
1
star