• Stars
    star
    505
  • Rank 84,603 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

中文詞向量訓練教學

使用 gensim 訓練中文詞向量

教學文件

套件需求

  • jieba
pip3 install jieba
  • gensim
pip3 install -U gensim
  • OpenCC (可更換為任何繁簡轉換套件)

訓練流程

1.取得中文維基數據,本次實驗是採用 2016/8/20 的資料。

目前 8 月 20 號的備份已經被汰換掉囉,請前往維基百科:資料庫下載按日期來挑選更新的訓練資料。( 請挑選以pages-articles.xml.bz2為結尾的檔案 )

2.將下載後的維基數據置於與專案同個目錄,再使用wiki_to_txt.py從 xml 中提取出維基文章

python3 wiki_to_txt.py zhwiki-20160820-pages-articles.xml.bz2

若您採用的不是 8 月 20 號的備份,請更換 zhwiki-20160820-pages-articles.xml.bz2 為您採用的備份的檔名。

3.使用 OpenCC 將維基文章統一轉換為繁體中文

opencc -i wiki_texts.txt -o wiki_zh_tw.txt -c s2tw.json

4.使用jieba 對文本斷詞,並去除停用詞

python3 segment.py

5.使用gensim 的 word2vec 模型進行訓練

python3 train.py

6.測試我們訓練出的模型

python3 demo.py

More Repositories

1

Chatbot

基於向量匹配的情境式聊天機器人
Python
895
star
2

Gossiping-Chinese-Corpus

PTT 八卦版問答中文語料
Jupyter Notebook
226
star
3

PTT-Chat-Generator

批踢踢推文產生器
Python
218
star
4

DeepToxic

top 1% solution to toxic comment classification challenge on Kaggle.
Jupyter Notebook
195
star
5

CIKM-AnalytiCup-2018

[ACM-CIKM] 2nd place solution at CIKM AnalytiCup 2018, a task for determining short text similarities.
Python
76
star
6

Sequence-to-Sequence-101

a series of tutorials on sequence to sequence learning, implemented with PyTorch.
Python
70
star
7

WSDM-Cup-2019

[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
Jupyter Notebook
64
star
8

Line-Chatbot

Rule-based Line chatbot demo, constructed with django.
Python
18
star
9

PTT-Crawler

A web crawler specifically for PTT website.
Python
18
star
10

Fill-the-GAP

[ACL-WS] 4th place solution to gendered pronoun resolution challenge on Kaggle
Jupyter Notebook
12
star
11

Luis-LineBot

a chatbot published on line, using LUIS for intent classification.
Python
8
star
12

NCKU-Online-Judge

Demonstration of an Online Judge System.
JavaScript
4
star
13

TensorFlow-Study-Notes

HTML
3
star
14

HS-Chess

🎲 a 2D chess game with the HearthStone theme.
Java
2
star
15

zake7749.github.io

Personal blog.
HTML
2
star
16

Fantasy-Invision

🚀 a simple vertically scrolling shoot 'em up game.
C#
2
star
17

SVM-with-Shiny

an example for support vector machine and shiny usage
R
2
star
18

MNIST

a Keras CNN autoencoder to solve the Kaggle Competition MNIST.
Python
2
star
19

AboutMe

My brief history.
CSS
1
star