thu-coai/COLDataset

Stars
201
Rank 194,491 (Top 4 %)
Language
License
Apache License 2.0
Created about 2 years ago
Updated over 1 year ago

thu-coai/COLDataset

thu-coai

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

The official repository of the paper: COLD: A Benchmark for Chinese Offensive Language Detection

COLDataset

The official repository of the paper: COLD: A Benchmark for Chinese Offensive Language Detection

中文冒犯语言检测数据集

Paper link: https://arxiv.org/abs/2201.06025

Detector: We release the version of roberta-base-cold in Huggingface.

News

Our paper has been accepted by EMNLP 2022!

Info

COLDataset contains 37,480 comments with binary offensive labels and covers diverse topics of race, gender, and region. To gain further insights into the data types and characteristics, we annotate the test set at a fine-grained level with four categories: attacking individuals, attacking groups, anti-bias and other non-offensive.

the labels in train.csv and dev.csv:

label 0: safe,
label 1: offensive

fine-grained-label in test.csv:

0: safe (other-Non-offen)
1: attack individual
2: attack group
3: safe (anti-bias)

Citing

Please kindly cite our paper if this paper and the dataset are helpful.

  @article{deng2022cold,
  title="Cold: A benchmark for chinese offensive language detection",
  author= "Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Mi, Fei and Huang, Minlie",
  booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
  month = dec,
  year = "2022",
  address = "Abu Dhabi, United Arab Emirates",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2022.emnlp-main.796",
  pages = "11580--11599"
}

CDial-GPT

A Large-scale Chinese Short-Text Conversation Dataset and Chinese pre-training dialog models

Safety-Prompts

Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts，用于评估和提升大模型的安全性。

CrossWOZ

A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue Dataset

KdConv

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation

ConvLab-2

ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems

CharacterGLM-6B

CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models

EVA

EVA: Large-scale Pre-trained Chit-Chat Models

BPO

Emotional-Support-Conversation

Data and codes for ACL 2021 paper: Towards Emotional Support Dialog Systems

ccm

This project is a tensorflow implement of our work, CCM (Commonsense Conversational Model).

ecm

This project is a tensorflow implement of our work, ECM (emotional chatting machine).

NLG_book

书籍《现代自然语言生成》介绍

PaperForONLG

Paper list for open-ended language generation

PsyQA

一个中文心理健康支持问答数据集，提供了丰富的援助策略标注。可用于生成富有援助策略的长咨询文本。

SafetyBench

Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety.

ShieldLM

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

cotk

Conversational Toolkit. An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation

DA-Transformer

Official Implementation for the ICML2022 paper "Directed Acyclic Transformer for Non-Autoregressive Machine Translation"

PPT

Official Code for "PPT: Pre-trained Prompt Tuning for Few-shot Learning". ACL 2022

CommonsenseStoryGen

Implementation for paper "A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation"

PICL

Code for ACL2023 paper: Pre-Training to Learn in Context

CritiqueLLM

tatk

Task-oriented dialog system toolkits

SentiLARE

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

THUOOP

清华大学面向对象程序设计课程课程材料及答疑

OPD

OPD: Chinese Open-Domain Pre-trained Dialogue Model

LOT-LongLM

JointGT

Codes for our paper "JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs" (ACL 2021 Findings)

UNION

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

OpenMEVA

Benchmark for evaluating open-ended generation

HINT

CTRLEval

Codes for our paper "CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation" (ACL 2022)

CPT4DST

Official code for "Continual Prompt Tuning for Dialog State Tracking" (ACL 2022).

seq2seq-pytorch-bert

DiaSafety

This repo is for the paper: On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark

Targeted-Data-Extraction

Official Code for ACL 2023 paper: "Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation"

TaiLr

ICLR2023 - Tailoring Language Generation Models under Total Variation Distance

SafeUnlearning

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

LAUG

Language Understanding Augmentation Toolkit for Robustness Testing

MoralStory

ConPer

Official Code for NAACL 2022 paper: "Persona-Guided Planning for Controlling the Protagonist's Persona in Story Generation"

AugESC

Official repository for the Findings of ACL 2023 paper "AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation"

NAST

Codes for "NAST: A Non-Autoregressive Generator with Word Alignment for Unsupervised Text Style Transfer" (ACL 2021 findings)

CDConv

Data and codes for EMNLP 2022 paper "CDConv: A Benchmark for Contradiction Detection in Chinese Conversations"

JailbreakDefense_GoalPriority

[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

AutoCAD

Official Code for EMNLP 2022 findings paper: "AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning"

Implicit-Toxicity

Official Code for EMNLP 2023 paper: "Unveiling the Implicit Toxicity in Large Language Models""

grounded-minimal-edit

Code for EMNLP 2021 paper "Transferable Persona-Grounded Dialogues via Grounded Minimal Edits"

hred-tensorflow

EssayCommentGen

UDIT

Official Code for EMNLP2022 Paper: "Learning Instructions with Unlabeled Data for Zero-Shot Cross-Task Generalization"

Reverse_Generation

earl

This project is a tensorflow implementation of our work, EARL.

MoralDial

The official Implementations of the paper: MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions

seqGAN-tensorflow

LaMemo

NAACL2022 - LaMemo: Language Modeling with Look-Ahead Memory

Re3Dial

Official Code for EMNLP 2023 paper: "Re3Dial: Retrieve, Reorganize and Rescale Conversations for Long-Turn Open-Domain Dialogue Pre-training"

ERIC

Code for the AAAI 2023 paper "Generating Coherent Narratives by Learning Dynamic and Discrete Entity States with a Contrastive Framework"

DAG-Search

The beamsearch algorithm for DA-Transformer

cotk_docs

Document for cotk package. Refer to: https://github.com/thu-coai/cotk

lightseq-nat

A Modified Version of LightSeq for Non-Autoregressive Transformer

seq2seq-pytorch

SelfCont

Code for the paper "Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation"

CodePlan

transformerLM-pytorch

cotk_dashboard

Dashboard for cotk

GPT2LM-pytorch

ConvLab-2_docs

CVAE-tensorflow

GRULM-pytorch

LM-tensorflow

cotk-test-CVAE

tatk_docs

The document of TaTK platform.

seq2seq-tensorflow

VAE-tensorflow

ComplexBench

cotk_data

SST-pytorch