• Stars
    star
    164
  • Rank 230,032 (Top 5 %)
  • Language
  • License
    Other
  • Created about 6 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Hate speech dataset from Stormfront forum manually labelled at sentence level.

Hate speech dataset from a white supremacist forum

Disclaimer: The number of files available in this repository may be slightly different to the numbers reported in the paper due to some last minute changes and additions. But the overall content distribution and conclusions should remain unchanged.

These files contain text extracted from Stormfront, a white supremacist forum. A random set of forums posts have been sampled from several subforums and split into sentences. Those sentences have been manually labelled as containing hate speech or not, according to certain annotation guidelines.

More information about the dataset and the guidelines can be found in the following article [pdf]:

O. de Gibert, N. Perez, A. García-Pablos, M. Cuadros. Hate Speech Dataset from a White Supremacy Forum. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pp. 11-20, 2018.

If you use any of the provided material in your work, please cite us as follows:

@inproceedings{gibert2018hate,
    title = "{Hate Speech Dataset from a White Supremacy Forum}",
    author = "de Gibert, Ona  and
      Perez, Naiara  and
      Garc{\'\i}a-Pablos, Aitor  and
      Cuadros, Montse",
    booktitle = "Proceedings of the 2nd Workshop on Abusive Language Online ({ALW}2)",
    month = oct,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W18-5102",
    doi = "10.18653/v1/W18-5102",
    pages = "11--20",
}

Repository structure

  • all_files: the folder that contains all the forum posts. Each file contains a sentence. The file name is formatted as commentID_sentenceNumber.txt, so the files that share the same number before the underscore pertain to the same comment.
  • sampled_train: a balanced set of files (for "hate" and "noHate" classes) sampled from all_files, used for experiments.
  • sampled_test: a balanced set of files (for "hate" and "noHate" classes) sampled from all_files, used for experiments.
  • annotations_metadata.csv: this file contains the actual label for each file in the previous folders; additionally, it reports how much additional context the annotator required to make a decision over each sentence, the user id, and the subforum id (ids are just numbers that do not further identify people).

License

The resources in this repository are licensed under the Creative Commons Attribution-ShareAlike 3.0 Spain License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/es/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Contact

If you have any question or suggestion, do not hesitate to contact us at the following addresses:

Learn more about us at http://www.speechandlanguagesolutions.com/.

More Repositories

1

DMD-Driver-Monitoring-Dataset

DMD - Driver Monitoring Dataset
Python
59
star
2

video-content-description-VCD

Video Content Description (VCD) is a schema, API and set of tools to produce semantically rich labels from multi-sensorial data series.
Python
56
star
3

STDG-evaluation-metrics

Standardised Metrics and Methods for Synthetic Tabular Data Evaluation
Jupyter Notebook
28
star
4

ArchABM

Agent-based model simulator for air quality and pandemic risk assessment in architectural spaces
R
15
star
5

itzuli-api-lib

Itzuli® Machine Translation Engine API libraries
Go
10
star
6

d-EVD_dual-Electric-Vehicle-Dataset

d-EVD-dual-electric-vehicle-dataset
9
star
7

weblabel

weblabel
8
star
8

NUBes-negation-uncertainty-biomedical-corpus

Repository of the NUBes corpus
Python
7
star
9

serverless-mlperf

This repo aims to benchmark Amazon AWS DNN performance with Caffe, TensorFlow and OpenVINO models, using OpenCV and OpenVINO IE as inference backend engines.
Python
6
star
10

ClinIDMap

ClinIDMap
Python
4
star
11

RailSceneSet

RailSceneSet Dataset
4
star
12

CAPTAIN-Elderly-clustering-and-evolution-analysis

CAPTAIN - Elderly clustering and evolution analysis
Python
2
star
13

tando

TANDO is a corpus for training and evaluation of document-level machine translation models in Basque-Spanish.
2
star
14

Dataset-of-2D-polygons-for-Additive-Manufacturing

Dataset of 2D polygons for Additive Manufacturing
Python
1
star
15

GRACE-Benchmark

GRACE-Benchmark
1
star
16

SOSDaR24

Synthetic Open Sensor Dataset for Rail 2024
1
star
17

BaSCo-Corpus

BaSCo Corpus
1
star
18

esport-corpus

ES-Port Corpus. Spontaneous spoken human-human dialogue corpus consisting of transcribed dialogues from calls to the technical customer support service of a Spanish telecom operator for companies. The corpus has been anonymised and annotated at various linguistic and acoustic-related extralinguistic levels.
1
star
19

ASVspoophone

The ASVspoophone corpus is the telephonic version of the ASV Spoof 2019 corpus found at https://www.asvspoof.org It contains the telephonic versions of the audios used for the countermeasure (CM) ASV Spoof 2019 challenge, which have been created by transferring each of them through real land-land, mobile-land and land-mobile telephonic channels. The results are the corresponding 8 kHz 8 bit A-Law versions of the originial audios, which can be used to train anti-spoofing systems that will be used on real telephonic scenarios such as call and contact centres.
1
star
20

dataset-machine-tool-wear

dataset_machine_tool_wear
1
star
21

synthetic-neu-seg-images-via-stable-diffusion

This dataset accompanies the paper "Latent Diffusion Models to Enhance the Performance of Visual Defect Segmentation Networks in Steel Surface Inspection".
1
star
22

CNC-Assist

CNC-Assist
1
star
23

DiverSim

DiverSim is an innovative simulating tool to generate synthetic pedestrian data with a focus on diversity and inclusion.
Python
1
star