• Stars
    star
    257
  • Rank 158,728 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 8 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

emoji2vec: Learning Emoji Representations from their Description

emoji2vec

This is the accompanying repository to emoji2vec: Learning Emoji Representations from their Description the paper recently released by Ben Eisner, Tim Rocktรคschel, Isabelle Augenstein, Matko Boลกnjak, and Sebastian Riedel.

In this repository, we present the code that we used to train our representations of emoji, the training data we used to do so, and several tools for analyzing the performance of the vectors trained.

NOTE: The emoji_joined.txt dataset was generated by combining emoji.txt (a dataset scraped from unicode.org) with a similar dataset scraped from iemoji.com. The data from iemoji.com contains short keyword descriptions for some of the lesser-used emoji, which improved coverage over the full emoji set. This source was mistakenly omitted from the original manuscript: all 6088 training examples were mistakenly attributed to unicode.org. In actuality, 2684 were scraped from unicode.org and 3404 were scraped from iemoji.com.

Pre-trained model

If you are interested in using the emoji vectors we used in our paper, they can be found in Gensim text/binary format in ./pre-trained/. The pre-trained vectors are meant to be used in conjunction with word2vec, and are therefore 300-dimensional. Other dimensions can be trained manually, as explained below. These vectors correspond with the following hyperparameters:

params = {
    "out_dim": 300,
    "pos_ex": 4,
    "max_epochs": 40,
    "ratio": 1,
    "dropout": 0.0,
    "learning": 0.001
}

Basic Usage

Once you've downloaded the pre-trained model, you can easily integrate emoji embeddings into your projects like so:

import gensim.models as gsm

e2v = gsm.Word2Vec.load_word2vec_format('emoji2vec.bin', binary=True)
happy_vector = e2v['๐Ÿ˜‚']    # Produces an embedding vector of length 300

Prerequisites

There are several prerequisites to using the code:

  • You must supply your own pretrained word vectors that are compatible with the Gensim tool. For instance, you can download the Google News word2vec dataset here. This must be in the binary format, rather than the .txt format.
  • To download tweets using Tweepy, you must create a Twitter application at https://apps.twitter.com/, and place the four generated keys in secret.txt in the directory where you run the Python script. However, you may not have to download the tweets, since they are stored raw in a pickle file in the repository.

CLI Arguments

Much of this code shares a common command line interface, which allows you to supply hyperparameters for training and model generation/retrieval as well as file locations. The following can be supplied:

  • -d: directory for training data (default is ./data/training)
  • -w: path to the word embeddings (i.e. Google News word2vec)
  • -m: file where we store mapping between index and emoji, for convenient caching between runs
  • -em: file where we cache the vectorized phrases so we don't have to recompute each time, only change when you change the train, test, and dev files
  • -k: output dimension of the emoji vectors we are training
  • -b: number of positive examples in a training batch
  • -e: number of training epochs
  • -r: ratio between positive and negative training examples in a batch
  • -l: learning rate
  • -dr: dropout rate
  • -t: threshold for classification, used in accuracy calculations
  • -ds: name of the dataset we are training on, mainly for output folder

These are defined in parameter_parser.py.

Model

The Emoji2Vec model, as well as a class for passing in hyperparameters, can be found in model.py. The Emoji2Vec class is a TensorFlow implementation of our model.

Important to note is that one can evaluate the correlation between a phrase and an emoji in two ways: one can either input a raw vector and an emoji index (for general queries), or the index of a training phrase and the index of an emoji (indices being the indices in the Knowledge Base). Typically, unless you are training the model on a totally different set of training examples, you'll want to use set use_embeddings to False in the constructor of the model. Otherwise, you'll have to pass in embeddings generated by the generate_embeddings function in utils.py.

In this initial release, the internals are a bit convoluted, so it would probably behoove anyone using the codebase to use train.py instead of using the Emoji2Vec class directly.

Phrase2Vec

The Phrase2Vec class is a convenience wrapper to compute vector sums for phrases. The class can be constructed with two different vector sets simultaneously: a word2vec Gensim object and an emoji vector Gensim object. Alternatively, you can provide two filenames to do so. Query like so:

vec = phrase2Vec['I am really happy right now! ๐Ÿ˜„]

Train

To train a single model, run train.py with any combination of the hyperparameters above. For instance,

python3 train.py -k=300 -b=4 -r=1 -l=0.001 -ds=unicode -d=./data/training -t=0.5

will generate emoji vectors with dimension 300, and will train in batches of 8 (4 positive, 4 negative examples) at a learning rate of 0.001. ./data/training/ must contain train.txt, dev.txt, and test.txt, the format of each being a tab-delimited, newline-delimited:

beating heart	๐Ÿฎ	False

The program will output various metrics, including accuracy (at the threshold provided), f1 score, and auc for a ROC curve. Additionally, the program will generate a Gensim representation of the model, a TensorFlow representation of the model, a TensorFlow tensorboard folder, and a cache of the results of the model's predictions on the train and dev datasets.

These results can be found in the following folder:

./results/unicode/k-300_pos-4_rat-1_ep-40_dr=0/

Grid Search

You can perform a grid search on a hyperparameter space one of two ways: either directly modify the search_params variable in grid_search.py and running grid_search.py, or from a separate file call grid_search with supplied parameter set. In essence, this grid search will generate results and embeddings in the same way as train.py for each parameter combination. The searchable parameters are represented as follows:

search_params = {
    "out_dim": [300],
    "pos_ex": [4, 16, 64],
    "max_epochs": [10, 20],
    "ratio": [0, 1, 2],
    "dropout": [0.0, 0.1]
}

NOTE: The epochs parameter will not be explored exactly as input. Since larger batches take more epochs to converge, we scale the number of epochs by the batch size.

Twitter Sentiment Dataset

twitter_sentiment_dataset.py contains a collection of helper functions for downloading, processing, and reasoning about tweets. In general, since tweets have already been downloaded and parsed and cached in ./data/tweets/examples.p, a client shouldn't need to access these functions unless they are running them on a new set of Tweets

TODO(beneisner): Clean up this library so that it's easier to run with new Tweets.

Visualize

To generate a 2D visualization of the emoji embeddings, run:

python3 visualize.py {arguments}

This technique uses t-SNE to project from N-dimensions into 2 dimensions.

Utils

utils.py contains several utility functions used in various files, and generally need not be used externally.

Jupyter Notebooks

We characterize the generated emoji embeddings in two files:

Results.ipynb

Results.ipynb displays quantitative and qualitative metrics for a given model. Change the hyperparameters near the top of the file to evaluate a different model.

TwitterClassification.ipynb

TwitterClassification.ipynb contains an evaluation scheme for the Twitter sentiment classification task outlined in the paper. It implements two rudimentary classifiers.

Contact

Contact me at ben [dot] a [dot] eisner [at] gmail [dot] com with questions about implementation or requests.

More Repositories

1

stat-nlp-book

Interactive Lecture Notes, Slides and Exercises for Statistical NLP
Jupyter Notebook
269
star
2

egal

easy drawing in jupyter
JavaScript
257
star
3

jack

Jack the Reader
Python
257
star
4

torch-imle

Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions
Python
257
star
5

fakenewschallenge

UCL Machine Reading - FNC-1 Submission
Python
166
star
6

pycodesuggest

Learning to Auto-Complete using RNN Language Models
Python
156
star
7

cqd

Continuous Query Decomposition for Complex Query Answering in Incomplete Knowledge Graphs
Python
95
star
8

ntp

End-to-End Differentiable Proving
NewLisp
88
star
9

d4

Differentiable Forth Interpreter
Python
66
star
10

low-rank-logic

Code for Injecting Logical Background Knowledge into Embeddings for Relation Extraction
Scala
65
star
11

inferbeddings

Injecting Background Knowledge in Neural Models via Adversarial Set Regularisation
Python
59
star
12

gntp

Python
57
star
13

ctp

Conditional Theorem Proving
Python
51
star
14

EMAT

Efficient Memory-Augmented Transformers
Python
34
star
15

stat-nlp-book-scala

Interactive book on Statistical NLP
Scala
32
star
16

simpleNumericalFactChecker

Fact checker for simple claims about statistical properties
Python
26
star
17

adversarial-nli

Code and data for the CoNLL 2018 paper "Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge."
Python
25
star
18

acl2015tutorial

Moro files for the ACL 2015 Tutorial on Matrix and Tensor Factorization Methods for Natural Language Processing
Scala
20
star
19

numerate-language-models

Python
19
star
20

fever

FEVER Workshop Shared-Task
Python
16
star
21

APE

Adaptive Passage Encoder for Open-domain Question Answering
Python
15
star
22

stat-nlp-course

Code for the UCL Statistical NLP course
Scala
11
star
23

newshack

BBC Newshack code
Scala
1
star
24

eqa-tools

Tools for Exam Question Answering
Python
1
star
25

softconf-start-sync

Softconf START sync, tool for Google Sheets
JavaScript
1
star
26

bibtex

BibTeX files
TeX
1
star