• Stars
    star
    233
  • Rank 172,230 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official datasets and pytorch implementation repository of SQuARe and KoSBi (ACL 2023)

Korean Safety Benchmarks

Overview

This repository provides the codes and datasets of the following two papers:

  1. SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created through Human-Machine Collaboration
  2. KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Applications

SQuARe

Dataset

SQuARe Dataset

Our SQuARe dataset can be found in data/SQuARe/. Please refer to SQuARe paper for the detail of the dataset.

We also release the dataset with the raw annotations in data/SQuARe/with_raw_annotations. Since questions and responses in our dataset are inherently subjective, we believe the raw annotations would help further research the disagreement between annotators.

Data Generation Pipeline

SQuARe Overview

Note: Though we've made our dataset include English-translated, cautions are needed when directly using it since the sensitive topics we used reflect the idiosyncrasies of Korean society. We recommend that researchers build their own dataset.

The pipeline for dataset generation can be found in pipeline/square.


KoSBi

Dataset

SQuARe Dataset

Our KoSBi dataset can be found in data/KosBi/. Please refer to KoSBi paper for the detail of the dataset.

Update: We collected more data by running an additional iteration. You can find them in the files named data/KoSBi/kosbi_v2_{train,valid,test}.json, which include the original KoSBi datasets. The total number of (context, sentence) pairs has increased to almost 68k, with 34.2k safe sentences and 33.8k unsafe sentences.

Data Generation Pipeline

SQuARe Overview

Similar to SQuARe, the pipeline for dataset generation can be found in pipeline/kosbi.


License

Korean-Safety-Benchmarks

MIT License 

Copyright 2023-present NAVER Cloud Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

How to cite

@inproceedings{lee2023square,
                title={SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created Through Human-Machine Collaboration}, 
                author={Hwaran Lee and Seokhee Hong and Joonsuk Park and Takyoung Kim and Meeyoung Cha and Yejin Choi and Byoung Pil Kim and Gunhee Kim and Eun-Ju Lee and Yong Lim and Alice Oh and Sangchul Park and Jung-Woo Ha},
                booktitle={Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics},
                year={2023}
}
@inproceedings{lee2023kosbi,
                title={KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large Language Model Application}, 
                author={Hwaran Lee and Seokhee Hong and Joonsuk Park and Takyoung Kim and Gunhee Kim and Jung-Woo Ha},
                booktitle={Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics: Industry Track},
                year={2023}
}

Contact

If you have any questions about our dataset or codes, feel free to ask us: Seokhee Hong ([email protected]) or Hwaran Lee ([email protected])

More Repositories

1

DenseDiffusion

Official Pytorch Implementation of DenseDiffusion (ICCV 2023)
Jupyter Notebook
466
star
2

StyleMapGAN

Official pytorch implementation of StyleMapGAN (CVPR 2021)
Python
458
star
3

Visual-Style-Prompting

Official Pytorch implementation of "Visual Style Prompting with Swapping Self-Attention"
Python
415
star
4

relabel_imagenet

Python
395
star
5

vidt

Python
305
star
6

pit

Python
240
star
7

BlendNeRF

Official pytorch implementation of BlendNeRF (ICCV 2023)
Python
149
star
8

c3-gan

Official Pytorch implementation of C3-GAN (Spotlight at ICLR 2022)
Python
125
star
9

rope-vit

[ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"
Python
124
star
10

pcme

Official Pytorch implementation of "Probabilistic Cross-Modal Embedding" (CVPR 2021)
Python
121
star
11

GGDR

Official Pytorch implementation of GGDR (ECCV 2022)
Python
102
star
12

cl-vs-mim

(ICLR 2023) Official PyTorch implementation of "What Do Self-Supervised Vision Transformers Learn?"
Jupyter Notebook
97
star
13

calm

Python
91
star
14

PfLayer

Learning Features with Parameter-Free Layers, ICLR 2022
Python
85
star
15

rdnet

[ECCV2024] Official implementation of paper, "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs".
Python
84
star
16

w-ood

Python
80
star
17

model-stock

Model Stock: All we need is just a few fine-tuned models
72
star
18

egtr

[CVPR 2024 Best paper award candidate] EGTR: Extracting Graph from Transformer for Scene Graph Generation
Python
65
star
19

hypermix

Code for text augmentation method leveraging large-scale language models
Python
60
star
20

carecall-corpus

CareCall for Seniors: Role Specified Open-Domain Dialogue dataset generated by leveraging LLMs (NAACL 2022).
59
star
21

eccv-caption

Extended COCO Validation (ECCV) Caption dataset (ECCV 2022)
Python
52
star
22

i-Blurry

Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)
Python
52
star
23

seit

[ECCV2024][ICCV2023] Official PyTorch implementation of SeiT++ and SeiT
Python
51
star
24

FSMR

Official Tensorflow implementation of "Feature Statistics Mixing Regularization for Generative Adversarial Networks" (CVPR 2022)
Python
49
star
25

pcmepp

Official Pytorch implementation of "Improved Probabilistic Image-Text Representations" (ICLR 2024)
Python
48
star
26

cmo

Python
45
star
27

facetts

Python
44
star
28

cream

Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models, EMNLP 2023
Python
42
star
29

dap-cl

Official code of "Generating Instance-level Prompts for Rehearsal-free Continual Learning (ICCV 2023)"
Python
39
star
30

NeglectedFreeLunch

Jupyter Notebook
36
star
31

neuralwoz

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)
Python
36
star
32

dual-teacher

Official code for the NeurIPS 2023 paper "Switching Temporary Teachers for Semi-Supervised Semantic Segmentation"
Python
35
star
33

augsub

Official PyTorch implementation of MaskSub "Masking Augmentation for Supervised Learning"
Python
32
star
34

chacha-chatbot

Python
31
star
35

tablevqabench

Jupyter Notebook
30
star
36

carecall-memory

Keep Me Updated! Memory Management in Long-term Conversations (Findings of EMNLP 2022)
28
star
37

mid.metric

Python
27
star
38

MetricMT

The official code repository for MetricMT - a reward optimization method for NMT with learned metrics
25
star
39

scob

Official Implementation of SCOB [ICCV 2023]
Python
22
star
40

ALMoST

Python
22
star
41

coco-annotation-tool

TypeScript
21
star
42

hmix-gmix

Jupyter Notebook
21
star
43

imagenet-annotation-tool

TypeScript
17
star
44

informer

17
star
45

cs-shortcut

Saving Dense Retriever from Shortcut Dependency in Conversational Search (EMNLP 2022)
Python
16
star
46

talebrush

The official source code for TaleBrush (CHI 2022)
Python
14
star
47

cgl_fairness

Python
14
star
48

KoBBQ

Official code and dataset repository of KoBBQ (TACL 2024)
Python
14
star
49

trace

TRACE: Table Reconstruction Aligned to Corner and Edges (ICDAR 2023)
Python
12
star
50

simseek

Generating Information-Seeking Conversations from Unlabeled Documents (EMNLP 2022).
Python
11
star
51

tc-clip

[ECCV 2024] Official PyTorch implementation of TC-CLIP "Leveraging Temporal Contextualization for Video Action Recognition"
Python
10
star
52

burn

Official Pytorch Implementation of Unsupervised Representation Learning for Binary Networks by Joint Classifier Training (CVPR 2022)
Python
10
star
53

tokenadapt

Python
8
star
54

llm-chatbot

The LLM chatbot demo website
HTML
7
star
55

lut

[ECCV 2024] Official PyTorch implementation of LUT "Learning with Unmasked Tokens Drives Stronger Vision Learners"
5
star
56

elva

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning
5
star
57

rewas

5
star
58

densediffusion

5
star
59

rite

Python
5
star
60

demystifying-ntk

Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? (CVPR 2022)
Python
2
star
61

carte

CARTE: Cell Adjacency Relation for Table Evaluation
Python
2
star
62

chacha

TypeScript
1
star