• Stars
    star
    195
  • Rank 199,374 (Top 4 %)
  • Language
    Java
  • Created almost 10 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Java implemention of LDA(Latent Dirichlet Allocation)

LDA4j

A Java implemention of LDA(Latent Dirichlet Allocation). Inference topics from a set of documents with few lines of Java code.

How To Use

  • code
public static void main(String[] args)
{
    // 1. Load corpus from disk
    Corpus corpus = Corpus.load("data/mini");
    // 2. Create a LDA sampler
    LdaGibbsSampler ldaGibbsSampler = new LdaGibbsSampler(corpus.getDocument(), corpus.getVocabularySize());
    // 3. Train it
    ldaGibbsSampler.gibbs(10);
    // 4. The phi matrix is a LDA model, you can use LdaUtil to explain it.
    double[][] phi = ldaGibbsSampler.getPhi();
    Map<String, Double>[] topicMap = LdaUtil.translate(phi, corpus.getVocabulary(), 10);
    LdaUtil.explain(topicMap);
}
  • output
topic 0 :
公司=0.009538408630174017
市场=0.008848009751698062
中国=0.008756489189917975
企业=0.0068280510303913395
发展=0.005991900977658479
目前=0.004408401842957633
产品=0.0041981128106208625
服务=0.003756081561227181
已经=0.003410105744626914
记者=0.003289155629929911

topic 1 :
专业=0.00872496522205349
工作=0.008108171408190876
学生=0.00793944661866665
学校=0.006307480899983371
考生=0.005295205701518912
大学=0.0052671267600129445
教育=0.0051547106121291805
考试=0.00507254577329609
人才=0.004037747449851247
招聘=0.003913811857165103

topic 2 :
医院=0.006197066939127888
治疗=0.0048149451145789455
患者=0.0032264139617756145
健康=0.0026521203697810374
手术=0.0025525793863978826
女性=0.0023724111474892357
专家=0.0021711200905248276
发现=0.0021645199996586885
病人=0.0021567877663232846
医生=0.002155356316589454

topic 3 :
没有=0.008818728535385055
问题=0.00476170232225101
中国=0.00476161560515722
工作=0.004610303190509696
生活=0.004283310385880329
文化=0.0036558079614339278
孩子=0.003327977201447208
不能=0.0032901108349775716
知道=0.003127437274214269
已经=0.0030419673256694545

topic 4 :
公司=0.018241005428669386
股东=0.009281048036676322
股份=0.0078638937643388
搜狐=0.0065617441267974705
有限公司=0.006139808167975946
直播员=0.005439495997416965
股权=0.005353954615162839
项目=0.004984451830097043
发行=0.004511099443364358
改革=0.004489038403046334

topic 5 :
旅游=0.013331508385667979
游客=0.004296589238778804
城市=0.0032312276892446116
文化=0.0026831367778820704
旅行社=0.002242817493567529
世界=0.0021001546909288965
成都=0.001991337289815279
活动=0.001894687770595843
北京=0.0017106388886854072
公园=0.0016134766410937638

topic 6 :
美国=0.007679518424242107
日本=0.004777746687572576
训练=0.003947682941243526
系统=0.003926562149803556
飞机=0.0038757503504304267
部队=0.00365041154980242
进行=0.003644226666909795
军事=0.003637873811678725
作战=0.003407296869780034
装备=0.003319112427162246

topic 7 :
比赛=0.0092171879571152
队员=0.0036851386114063237
联赛=0.0032845199043377146
球队=0.0029432131822116707
冠军=0.0024090127058022104
俱乐部=0.002348957542679953
球员=0.0022159606741087795
决赛=0.002192739194333911
赛季=0.0020352324832133267
对手=0.001974829226645783

topic 8 :
The=0.002190604616155811
意思=0.001186435720799536
It=0.0011515962078723501
理解=0.0010433831740419728
What=9.560997173453189E-4
They=9.345962358267594E-4
听力=8.362275772461826E-4
In=8.166984660263638E-4
阅读=7.775969918239417E-4
译文=7.568900132152651E-4

topic 9 :
毛泽东=0.002633793448326645
曹操=0.0018832599387516155
曹丕=0.0016567353952110328
皇帝=0.001629990508040292
甄洛=0.0012930147890964736
中央=0.0012783947883529055
蒋介石=0.0010732052837016102
曹睿=8.511476483731437E-4
女王=8.125680914406854E-4
皇后=8.013815303127338E-4
  • corpus The data/mini is some documents included in this project, which use space to segment words. Feel free to replace it with yours.
  • algorithm Mainly depend on Gregor Heinrich's great work. Read more about this implementation on《LDA入门与Java实现》

More Repositories

1

HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Python
33,681
star
2

pyhanlp

中文分词
Python
3,122
star
3

AhoCorasickDoubleArrayTrie

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie.
Java
946
star
4

CS224n

CS224n: Natural Language Processing with Deep Learning Assignments Winter, 2017
Python
673
star
5

Viterbi

An implementation of HMM-Viterbi Algorithm 通用的维特比算法实现
Java
369
star
6

multi-criteria-cws

Simple Solution for Multi-Criteria Chinese Word Segmentation
Python
300
star
7

hanlp-lucene-plugin

HanLP中文分词Lucene插件,支持包括Solr在内的基于Lucene的系统
Java
296
star
8

TextRank

TextRank算法提取关键词的Java实现
Java
201
star
9

TreebankPreprocessing

Python scripts preprocessing Penn Treebank and Chinese Treebank
Python
162
star
10

ID-CNN-CWS

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"
Python
136
star
11

MainPartExtractor

主谓宾提取器的Java实现(对斯坦福的代码失去兴趣,不再维护)
Java
136
star
12

neural_net

反向传播神经网络及应用
Python
82
star
13

udacity-deep-learning

Assignments for Udacity Deep Learning class with TensorFlow in PURE Python, not IPython Notebook
Python
66
star
14

AveragedPerceptronPython

Clone of "A Good Part-of-Speech Tagger in about 200 Lines of Python" by Matthew Honnibal
Python
49
star
15

MaxEnt

这是一个最大熵的简明Java实现,提供提供训练与预测接口。训练算法采用GIS训练算法,附带示例训练集和一个天气预测的Demo。
Java
46
star
16

text-classification-svm

The missing SVM-based text classification module implementing HanLP's interface
Java
46
star
17

IceNAT

IceNAT
Java
32
star
18

BERT-token-level-embedding

Generate BERT token level embedding without pain
Python
28
star
19

sub-character-cws

Sub-Character Representation Learning
Python
26
star
20

HanLPAndroidDemo

HanLP Android Demo
Java
22
star
21

maxent_iis

最大熵-IIS(Improved Iterative Scaling)训练算法的Java实现
Java
18
star
22

gohanlp

Golang RESTful Client for HanLP
Go
13
star
23

iparser

Yet another dependency parser, integrated with tokenizer, tagger and visualization tool.
Python
11
star
24

DeepBiaffineParserMXNet

An experimental implementation of biaffine parser using MXNet
Python
10
star
25

OpenCC-to-HanLP

无损转换OpenCC词典为HanLP格式
Python
9
star
26

tmsvm

Python
1
star
27

bolt_splits

Split Broad Operational Language Translation corpus into train/dev/test set
Python
1
star