• Stars
    star
    106
  • Rank 325,871 (Top 7 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created almost 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DeEpLearning models for MultIlingual haTespeech (DELIMIT): Benchmarking multilingual models across 9 languages and 16 datasets.

Hits contributions welcome

Deep Learning Models for Multilingual Hate Speech Detection

đŸ‡ĩ🇹 🇸đŸ‡Ļ đŸ‡ĩ🇱 🇮🇩 🇮🇹 Solving the problem of hate speech detection in 9 languages across 16 datasets. :fr: :us: :es: :de:

New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.

Demo

Please look here to check model loading and inference.

Please cite our paper in any published work that uses any of these resources.

@inproceedings{aluru2021deep,
  title={A Deep Dive into Multilingual Hate Speech Classification},
  author={Aluru, Sai Saketh and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
  booktitle={Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14--18, 2020, Proceedings, Part V},
  pages={423--439},
  year={2021},
  organization={Springer International Publishing}
}

Folder Description 👈


./Dataset             --> Contains the dataset related files.
./BERT_Classifier     --> Contains the codes for BERT classifiers performing binary classifier on the dataset
./CNN_GRU	      --> Contains the codes for CNN-GRU model		
./LASER+LR 	      --> Containes the codes for Logistic regression classifier used on top of LASER embeddings

Requirements

Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt.


Dataset

Check out the Dataset folder to know more about how we curated the dataset for different languages. ⚠ī¸ There are few datasets which requires crawling them hence we can gurantee the retrieval of all the datapoints as tweets may get deleted. ⚠ī¸


Models used for our this task

We release the code for train/finetuning the following models along with their hyperparamters.

đŸĨ‡ best for high resource language , 🏅 best for low resource language

✈ī¸ fastest to train , 🛩ī¸ slowest to train

  1. mBERT Baseline: This setting consists of using multilingual bert model with the same language dataset for training and testing. Refer to BERT Classifier folder for the codes and usage instructions.

  2. mBERT All_but_one::1st_place_medal::small_airplane: This setting consists of using multilingual bert model with training dataset from multiple languages and validation and test from a single target language. Refer to BERT Classifier folder for the codes and usage instructions.

  3. Translation + BERT Baseline: This setting consists of translating the other language datasets to english and finetuning the bert-base model using this translated datasets. Refer to BERT Classifier folder for the codes and usage instructions.

  4. CNN+GRU Baseline: This setting consists of using MUSE word embeddings along with a CNN-GRU based model, and training and testing on the same language. Refer to CNN_GRU folder for the codes and usage instructions.

  5. LASER+LR baseline::airplane: This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The training and testing dataset are from the same language. Refer to LASER+LR folder for the codes and usage instructions.

  6. LASER+LR all_but_one::medal_sports: This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The dataset from other languages are also used to train the LR model. Refer to LASER+LR folder for the codes and usage instructions.

Blogs and github repos which we used for reference đŸ‘ŧ

  1. Muse embeddding are downloaded and extracted using the code from MUSE github repository
  2. For finetuning BERT this blog by Chris McCormick is used and we also referred Transformers github repo
  3. For CNN-GRU model we used the original repo for reference
  4. For generating the LASER embeddings of the dataset, we used the code from LASER github repository

For more details about our paper

Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. 2020. "Deep Learning Models for Multilingual Hate Speech Detection". ECML-PKDD

Todos

  • Upload our models to transformers community to make them public
  • Add arxiv paper link and description
  • Create an interface for social scientists where they can use our models easily with their data
  • Create a pull request to add the models to official transformers repo
👍 The repo is still in active developements. Feel free to create an issue !! 👍

More Repositories

1

HateXplain

Can we use explanations to improve hate speech models? Our paper accepted at AAAI 2021 tries to explore that question.
Python
186
star
2

Hate-Speech-Reading-List

This repository contains papers and resources pertaining to Hate speech research.
43
star
3

Tutorial-Resources

Resources and tools for the Tutorial - "Hate speech detection, mitigation and beyond" presented at ICWSM 2021
Python
36
star
4

Countering_Hate_Speech_ICWSM2019

Repository for the paper "Thou shalt not hate: Countering Online Hate Speech" accepted at ICWSM 2019.
Jupyter Notebook
30
star
5

Fear-speech-analysis

Can fear be used for polarisation and spreading negativity? Our paper accepted in The Web conference 2021 tries to explore this question in light of public Whatsapp groups.
Jupyter Notebook
24
star
6

HateALERT-EVALITA

Code for replicating results of team 'hateminers' at EVALITA-2018 for AMI task
Jupyter Notebook
13
star
7

HateMM

Python
10
star
8

CounterGEDI

CounterGeDi is a pipeline that aims at controlling the counter speech generated to make it emotional, polite and detoxified. Paper accepted at IJCAI 2022.
Jupyter Notebook
9
star
9

HateALERT-HASOC

Code for replicating the results of "HateMonitors" at HASOC 2019
Jupyter Notebook
8
star
10

Hate-Alert-DravidianLangTech

Team hate-alert's winning submission to the Workshop on Speech and Language Technologies for Dravidian Languages, EACL-2021
Jupyter Notebook
7
star
11

IndicAbusive

IndicAbusive
Python
7
star
12

Hateful-users-detection

Python
5
star
13

HateCheckHIn

HateCheckHIn
5
star
14

UrduAbuseAndThreat

This repository contain the wining solution of the abusive and threatening language detection task in Urdu
Jupyter Notebook
3
star
15

HateBegetsHate_CSCW2020

1
star
16

Counterspeech_Twitter

Jupyter Notebook
1
star
17

Spread_Hate_Speech_WebSci19

Repository for the paper "Spread of hate speech in online social media" accepted at WebSci 2019
1
star
18

Bengali_Hate

1
star