• Stars
    star
    142
  • Rank 258,495 (Top 6 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created about 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code & data accompanying the KDD 2017 paper "KATE: K-Competitive Autoencoder for Text"

KATE: K-Competitive Autoencoder for Text

Code & data accompanying the KDD2017 paper "KATE: K-Competitive Autoencoder for Text"

Prerequisites

This code is written in python. To use it you will need:

Getting started

To preprocess the corpus, e.g., 20 Newsgroups, just run the following:

    python construct_20news.py -train [train_dir] -test [test_dir] -o [out_dir] -threshold [word_freq_threshold] -topn [top_n_words]

It outputs 4 json files under the [out_dir] directory: train_data, train_label, test_data and test_label. You can download the preprocessed data we used in our experiments here.

To train the KATE model, just run the following:

    python train.py -i [train_data] -nd [num_topics] -ne [num_epochs] -bs [batch_size] -nv [num_validation] -ctype kcomp -ck [top_k] -sm [model_file]

To predict on the test set, just run the following:

    python pred.py -i [test_data] -lm [model_file] -o [output_doc_vec_file] -st [output_topics] -sw [output_sample_words] -wc [output_word_clouds]

To train a simple classifier, just run the following:

  python run_classifier.py [train_doc_codes] [train_doc_labels] [test_doc_codes] [test_doc_labels] -nv [num_validation] -ne [num_epochs] -bs [batch_size]

To train baseline methods, e.g., VAE, just run the following:

     python train_vae.py -i [train_data] -nd [num of dimensions] -ne [num_epochs] -bs [batch_size] -nv [num_validation] -sm [model_file]

Notes

  1. In order to apply the KATE model to your own dataset, you will need to preprocess the dataset on your own. Basically, prepare the vocabulary and Bag-of-Words representation of each document.

  2. The KATE model learns vector representations of words (which are in the vocabulary) as well as documents in an unsupervised manner. It can also extracts topics from corpus. Document labels will be needed only if you want to for example train a document classifier based on learned document vectors.

FAQ

  1. KeyError when plotting word clouds

Make sure the words belong to the vocabulary. See here.

Architecture

Experiment results on 20 Newsgroups

PCA on the 20-D document vectors

20news_doc_vec_pca

TSNE on the 20-D document vectors

20news_doc_vec_tsne

Five nearest neighbors in the word representation space

20news_word_vec

Extracted topics

Text classification results on 20 Newsgroups

Visualization of the normalized topic-word weight matrices of KATE & LDA (KATE learns distinctive patterns)

Reference

If you found this code useful, please cite the following paper:

Yu Chen and Mohammed J. Zaki. "KATE: K-Competitive Autoencoder for Text." In Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery. Aug 2017.

@inproceedings {chen2017kate,
author = { Yu Chen and Mohammed J. Zaki },
title = { KATE: K-Competitive Autoencoder for Text },
booktitle = { Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery },
doi = { http://dx.doi.org/10.1145/3097983.3098017 },
year = { 2017 },
month = { Aug }
}

Other research papers that applied the KATE model:

Chen, Yu, Rhaad M. Rabbani, Aparna Gupta, and Mohammed J. Zaki. "Comparative text analytics via topic modeling in banking." In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1-8. IEEE, 2017.

@inproceedings{chen2017comparative,
  title={Comparative text analytics via topic modeling in banking},
  author={Chen, Yu and Rabbani, Rhaad M and Gupta, Aparna and Zaki, Mohammed J},
  booktitle={2017 IEEE Symposium Series on Computational Intelligence (SSCI)},
  pages={1--8},
  year={2017},
  organization={IEEE}
}

More Repositories

1

Eye-Tracker

Implemented and improved the iTracker model proposed in the paper "Eye Tracking for Everyone"
Python
199
star
2

IDGL

Code & data accompanying the NeurIPS 2020 paper "Iterative Deep Graph Learning for Graph Neural Networks: Better and Robust Node Embeddings".
Python
198
star
3

BAMnet

Code & data accompanying the NAACL 2019 paper "Bidirectional Attentive Memory Networks for Question Answering over Knowledge Bases"
Python
173
star
4

RL-based-Graph2Seq-for-NQG

Code & data accompanying the ICLR 2020 paper "Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation"
Python
119
star
5

PFoodReq

Code & data accompanying the WSDM 2021 paper "Personalized Food Recommendation as Constrained Question Answering over a Large-scale Food Knowledge Graph"
Python
57
star
6

Graph2Seq-for-KGQG

Code & data accompanying the paper "Toward Subgraph Guided Knowledge Graph Question Generation with Graph Neural Networks"
Python
42
star
7

GraphFlow

Code & data accompanying the IJCAI 2020 paper "GraphFlow: Exploiting Conversation Flow with Graph Neural Networks for Conversational Machine Comprehension"
Python
35
star
8

AI-Papers

Favorite AI papers
17
star
9

rUDP

reliable UDP
C++
9
star
10

KDDCUP2016

Python
8
star
11

CourseNow

一个用python实现的针对电子科技大学网上选课系统的选课脚本,可实现cookie登录,预存课程,多线程选课功能……(项目来源于YaodongZhao,目前只对通信学院开放,不断完善中)
Python
7
star
12

WirelessShow

一个用python实现的图形绘制软件,可接收串口传递的加速度数据并通过计算绘制出图形轨迹。
Python
5
star
13

RenrenDataRepo

a renren.com website based user datas capture and analysis project using python 2.7
Python
4
star
14

winHook

a windows-based hook script for logging your mouse and keyboard events
Python
3
star
15

AcademicWeb

Hugo's academic homepage.
HTML
2
star
16

denoiser

Code accompanying the paper “Fast Voxel-based Surface Propagation Method for Outlier Removal”
C++
2
star
17

DVRouting

stimulating distance vector routing algorithm
C++
2
star
18

BlogWeb

Hugo's personal blog site.
HTML
1
star
19

OS

A simple operating system written in python.
Python
1
star
20

Recommender

realizing algorithms of recommenders. test & exploration...
Python
1
star
21

peoplefinder

Python
1
star
22

algorithms-in-python

Algorithm study notes in python
Python
1
star
23

TTE

C++
1
star
24

Programming-Language

Java
1
star
25

Hugo-CV

Hugo's CV
TeX
1
star
26

wirelessShow_lowerComputer

a lower computer program based on STM32f4 MPU cooperating with WirelessShow repo which is an upper computer program
C
1
star
27

ML-DM-Study-Notes

ML and DM study notes
Python
1
star
28

WebGL-Study-Notes

WebGL study notes
JavaScript
1
star
29

Linux-Programming

Linux programming example codes
C
1
star
30

DataAnalysis

A general framework for analysis of user-item data sets, which may provide suggestions for recommendation systems.
Python
1
star
31

Freeland

Hugo's personal homepage.
CSS
1
star
32

authentication

a realization of OICQ based on MFC directly communicating between server and client.
C++
1
star