LDA4j

A Java implemention of LDA(Latent Dirichlet Allocation). Inference topics from a set of documents with few lines of Java code.

How To Use

code

public static void main(String[] args)
{
    // 1. Load corpus from disk
    Corpus corpus = Corpus.load("data/mini");
    // 2. Create a LDA sampler
    LdaGibbsSampler ldaGibbsSampler = new LdaGibbsSampler(corpus.getDocument(), corpus.getVocabularySize());
    // 3. Train it
    ldaGibbsSampler.gibbs(10);
    // 4. The phi matrix is a LDA model, you can use LdaUtil to explain it.
    double[][] phi = ldaGibbsSampler.getPhi();
    Map<String, Double>[] topicMap = LdaUtil.translate(phi, corpus.getVocabulary(), 10);
    LdaUtil.explain(topicMap);
}

output

topic 0 :
公司=0.009538408630174017
市场=0.008848009751698062
中国=0.008756489189917975
企业=0.0068280510303913395
发展=0.005991900977658479
目前=0.004408401842957633
产品=0.0041981128106208625
服务=0.003756081561227181
已经=0.003410105744626914
记者=0.003289155629929911

topic 1 :
专业=0.00872496522205349
工作=0.008108171408190876
学生=0.00793944661866665
学校=0.006307480899983371
考生=0.005295205701518912
大学=0.0052671267600129445
教育=0.0051547106121291805
考试=0.00507254577329609
人才=0.004037747449851247
招聘=0.003913811857165103

topic 2 :
医院=0.006197066939127888
治疗=0.0048149451145789455
患者=0.0032264139617756145
健康=0.0026521203697810374
手术=0.0025525793863978826
女性=0.0023724111474892357
专家=0.0021711200905248276
发现=0.0021645199996586885
病人=0.0021567877663232846
医生=0.002155356316589454

topic 3 :
没有=0.008818728535385055
问题=0.00476170232225101
中国=0.00476161560515722
工作=0.004610303190509696
生活=0.004283310385880329
文化=0.0036558079614339278
孩子=0.003327977201447208
不能=0.0032901108349775716
知道=0.003127437274214269
已经=0.0030419673256694545

topic 4 :
公司=0.018241005428669386
股东=0.009281048036676322
股份=0.0078638937643388
搜狐=0.0065617441267974705
有限公司=0.006139808167975946
直播员=0.005439495997416965
股权=0.005353954615162839
项目=0.004984451830097043
发行=0.004511099443364358
改革=0.004489038403046334

topic 5 :
旅游=0.013331508385667979
游客=0.004296589238778804
城市=0.0032312276892446116
文化=0.0026831367778820704
旅行社=0.002242817493567529
世界=0.0021001546909288965
成都=0.001991337289815279
活动=0.001894687770595843
北京=0.0017106388886854072
公园=0.0016134766410937638

topic 6 :
美国=0.007679518424242107
日本=0.004777746687572576
训练=0.003947682941243526
系统=0.003926562149803556
飞机=0.0038757503504304267
部队=0.00365041154980242
进行=0.003644226666909795
军事=0.003637873811678725
作战=0.003407296869780034
装备=0.003319112427162246

topic 7 :
比赛=0.0092171879571152
队员=0.0036851386114063237
联赛=0.0032845199043377146
球队=0.0029432131822116707
冠军=0.0024090127058022104
俱乐部=0.002348957542679953
球员=0.0022159606741087795
决赛=0.002192739194333911
赛季=0.0020352324832133267
对手=0.001974829226645783

topic 8 :
The=0.002190604616155811
意思=0.001186435720799536
It=0.0011515962078723501
理解=0.0010433831740419728
What=9.560997173453189E-4
They=9.345962358267594E-4
听力=8.362275772461826E-4
In=8.166984660263638E-4
阅读=7.775969918239417E-4
译文=7.568900132152651E-4

topic 9 :
毛泽东=0.002633793448326645
曹操=0.0018832599387516155
曹丕=0.0016567353952110328
皇帝=0.001629990508040292
甄洛=0.0012930147890964736
中央=0.0012783947883529055
蒋介石=0.0010732052837016102
曹睿=8.511476483731437E-4
女王=8.125680914406854E-4
皇后=8.013815303127338E-4

corpus The data/mini is some documents included in this project, which use space to segment words. Feel free to replace it with yours.
algorithm Mainly depend on Gregor Heinrich's great work. Read more about this implementation on《LDA入门与Java实现》

hankcs/LDA4j

hankcs

Reviews

Repository Details

LDA4j

How To Use

More Repositories