• Stars
    star
    218
  • Rank 181,805 (Top 4 %)
  • Language
    Python
  • Created almost 8 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Deep Learning models to detect hate speech in tweets

Hate Speech Detection on Twitter

Implementation of our paper titled - "Deep Learning for Hate Speech Detection" (to appear in WWW'17 proceedings).

Dataset

Dataset can be downloaded from https://github.com/zeerakw/hatespeech. Contains tweet id's and corresponding annotations.

Tweets are labelled as either Racist, Sexist or Neither Racist or Sexist.

Use your favourite tweet crawler and download the data and place the tweets in the folder 'tweet_data'.

Requirements

  • Keras
  • Tensorflow or Theano (we experimented with theano)
  • Gensim
  • xgboost
  • NLTK
  • Sklearn
  • Numpy

Instructions to run

Before running the model, make sure you have setup the input dataset in a folder named tweet_data.
To run a model for training, use the following instructions mentioned below. Use appropriate parameter settings to test the variations of the models.

This script contains code for runnning NN_model + GDBT.

Steps to run NN_model + GDBT

  • Run NN_model first (CNN/LSTM/Fast_text). It will create a model file
  • Change the name of the file at line 50 pointing to the model file
  • Run nn_classifier file as per instructions below

python nn_classifier.py <GradientBoosting(xgboost) or Random Forest>

  • BagOfWords models - BoWV.py[does not supports XGBOOST, supports sklearn's GBDT]
usage: BoWV.py [-h] -m [Deprecated]
               {logistic,gradient_boosting,random_forest,svm,svm_linear} -f
               EMBEDDINGFILE -d DIMENSION --tokenizer {glove,nltk} [-s SEED]
               [--folds FOLDS] [--estimators ESTIMATORS] [--loss LOSS]
               [--kernel KERNEL] [--class_weight CLASS_WEIGHT]

BagOfWords model for twitter Hate speech detection

optional arguments:
  -h, --help            show this help message and exit
  -m {logistic,gradient_boosting,random_forest,svm,svm_linear}, --model {logistic,gradient_boosting,random_forest,svm,svm_linear}
  -f EMBEDDINGFILE, --embeddingfile EMBEDDINGFILE
  -d DIMENSION, --dimension DIMENSION
  --tokenizer {glove,nltk}
  -s SEED, --seed SEED
  --folds FOLDS
  --estimators ESTIMATORS
  --loss LOSS
  --kernel KERNEL
  --class_weight CLASS_WEIGHT
  • TF-IDF based models - tfidf.py
usage: tfidf.py [-h] -m
                {tfidf_svm,tfidf_svm_linear,tfidf_logistic,tfidf_gradient_boosting,tfidf_random_forest}
                --max_ngram MAX_NGRAM --tokenizer {glove,nltk} [-s SEED]
                [--folds FOLDS] [--estimators ESTIMATORS] [--loss LOSS]
                [--kernel KERNEL] [--class_weight CLASS_WEIGHT]
                [--use-inverse-doc-freq]

TF-IDF model for twitter Hate speech detection

optional arguments:
  -h, --help            show this help message and exit
  -m {tfidf_svm,tfidf_svm_linear,tfidf_logistic,tfidf_gradient_boosting,tfidf_random_forest}, --model {tfidf_svm,tfidf_svm_linear,tfidf_logistic,tfidf_gradient_boosting,tfidf_random_forest}
  --max_ngram MAX_NGRAM
  --tokenizer {glove,nltk}
  -s SEED, --seed SEED
  --folds FOLDS
  --estimators ESTIMATORS
  --loss LOSS
  --kernel KERNEL
  --class_weight CLASS_WEIGHT
  --use-inverse-doc-freq
  • LSTM(RNN) based methods - lstm.py
usage: lstm.py [-h] -f EMBEDDINGFILE -d DIMENSION --tokenizer {glove,nltk}
               --loss LOSS --optimizer OPTIMIZER --epochs EPOCHS --batch-size
               BATCH_SIZE [-s SEED] [--folds FOLDS] [--kernel KERNEL]
               [--class_weight CLASS_WEIGHT] --initialize-weights
               {random,glove} [--learn-embeddings] [--scale-loss-function]

LSTM based models for twitter Hate speech detection

optional arguments:
  -h, --help            show this help message and exit
  -f EMBEDDINGFILE, --embeddingfile EMBEDDINGFILE
  -d DIMENSION, --dimension DIMENSION
  --tokenizer {glove,nltk}
  --loss LOSS
  --optimizer OPTIMIZER
  --epochs EPOCHS
  --batch-size BATCH_SIZE
  -s SEED, --seed SEED
  --folds FOLDS
  --kernel KERNEL
  --class_weight CLASS_WEIGHT
  --initialize-weights {random,glove}
  --learn-embeddings
  --scale-loss-function
  • CNN based models - cnn.py
usage: cnn.py [-h] -f EMBEDDINGFILE -d DIMENSION --tokenizer {glove,nltk}
              --loss LOSS --optimizer OPTIMIZER --epochs EPOCHS --batch-size
              BATCH_SIZE [-s SEED] [--folds FOLDS]
              [--class_weight CLASS_WEIGHT] --initialize-weights
              {random,glove} [--learn-embeddings] [--scale-loss-function]

CNN based models for twitter Hate speech detection

optional arguments:
  -h, --help            show this help message and exit
  -f EMBEDDINGFILE, --embeddingfile EMBEDDINGFILE
  -d DIMENSION, --dimension DIMENSION
  --tokenizer {glove,nltk}
  --loss LOSS
  --optimizer OPTIMIZER
  --epochs EPOCHS
  --batch-size BATCH_SIZE
  -s SEED, --seed SEED
  --folds FOLDS
  --class_weight CLASS_WEIGHT
  --initialize-weights {random,glove}
  --learn-embeddings
  --scale-loss-function

Examples:

python BoWV.py --model logistic --seed 42 -f glove.twitter.27b.25d.txt -d 25 --seed 42 --folds 10 --tokenizer glove  
python tfidf.py -m tfidf_svm_linear --max_ngram 3 --tokenizer glove --loss squared_hinge
python lstm.py -f ~/DATASETS/glove-twitter/GENSIM.glove.twitter.27B.25d.txt -d 25 --tokenizer glove --loss categorical_crossentropy --optimizer adam --initialize-weights random --learn-embeddings --epochs 10 --batch-size 512
python cnn.py -f ~/DATASETS/glove-twitter/GENSIM.glove.twitter.27B.25d.txt -d 25 --tokenizer nltk --loss categorical_crossentropy --optimizer adam --epochs 10 --batch-size 128 --initialize-weights random --scale-loss-function

More Repositories

1

neuralTextSegmentation

Code & dataset for the paper 'Attention-based Neural Text Segmentation'
Python
53
star
2

trust-inference-API

Implementation of multiple algorithms for inferring trust in the user network (social graph) using various network properties such as trust propagation, multi-aspect property, social trust etc.
Python
22
star
3

movie-reco-using-RBM

Movie Recommendation using Restricted Boltzmann Machine (RBM)
Python
18
star
4

Nex

A mini Multi-Threaded ProxyServer + HTTPserver in python using socket programming
CSS
13
star
5

PeerNet

A P2P file-sharing application written using socket programming in c
C
6
star
6

WikiSearch

A search engine built in python capable of building optimized positional index for the Wikipedia corpus & perform field queries
Python
5
star
7

gen-louvain-community-detection

C++
4
star
8

DonkeyKing

Comeback of the popular 2D game 'Donkey Kong' using pygame
Python
4
star
9

StaxSQL

A SQL engine written in python capable of parsing and executing medium-complexity queries.
Python
4
star
10

fact-checker

A simple Fact Checker using Knowledge Graph
Python
3
star
11

weird-news

Weird News Classification and Ranking
Python
2
star
12

JioTV-Player

HTML
2
star
13

ai-workshop

2
star
14

imgurUpload

Fast-Push screenshots to imgur from terminal
Python
2
star
15

parking-lot-management

Software Engineering project: Parking Lot
Python
1
star
16

topic-segmentation-wikipedia-dataset

Wikipedia dataset for Topic Segmentation
1
star
17

tTest-on-trust-communities

Running the ttest on the trust communities formed by gen-louvain algorithm
C++
1
star
18

twitter-tokenizer

A text tokenizer based on twitter datasetusing complex REGEX parsing
Python
1
star
19

pinkeshbadjatiya.github.io

HTML
1
star
20

covid19-vaccine-availability-notifier

Python
1
star
21

room-booking

Java
1
star
22

ultimate-tic-tac-toe

AI assignment 1
Python
1
star
23

spotify2youtube-playlist-synchronizer

API for synchronysing spotify playlist to youtube & offline download
Python
1
star
24

STV-algo-animate-js

Simple implementation of STV election algorithm in js with automated result generation
JavaScript
1
star
25

minimalistic-contest-portal

A simple portal frontend for contest
JavaScript
1
star
26

jailed-shell

jailed-shell
1
star