• Stars
    star
    254
  • Rank 160,264 (Top 4 %)
  • Language
    Python
  • License
    Creative Commons ...
  • Created over 1 year ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Comprehensive Assessment of Trustworthiness in GPT Models

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Overview

This repo contains the source code of DecodingTrust. This research endeavor is designed to help researchers better understand the capabilities, limitations, and potential risks associated with deploying these state-of-the-art Large Language Models (LLMs). See our paper for details.

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li.

This project is organized around the following eight primary areas of trustworthiness, including:

  1. Toxicity
  2. Stereotype and bias
  3. Adversarial robustness
  4. Out-of-Distribution Robustness
  5. Privacy
  6. Robustness to Adversarial Demonstrations
  7. Machine Ethics
  8. Fairness

Project Structure

This project is structured around subdirectories dedicated to each area of trustworthiness. Each subdir includes scripts, data, and a dedicated README for easy comprehension.

.
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ adv_demonstration
โ”‚   โ”‚   โ”œโ”€โ”€ backdoor
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ experiment1
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ experiment2
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ experiment3
โ”‚   โ”‚   โ”œโ”€โ”€ counterfactual
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ control_raising
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ control_raising_cf
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ irregular_form
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ irregular_form_cf
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ main_verb
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ main_verb_cf
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ snli_hypothesis
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ snli_hypothesis_cf
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ snli_premise
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ snli_premise_cf
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ syntactic_category
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ syntactic_category_cf
โ”‚   โ”‚   โ”œโ”€โ”€ generations
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ backdoor
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ counterfactual
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ spurious
โ”‚   โ”‚   โ”œโ”€โ”€ README.md
โ”‚   โ”‚   โ””โ”€โ”€ spurious
โ”‚   โ”‚       โ”œโ”€โ”€ adverb
โ”‚   โ”‚       โ”œโ”€โ”€ embedded_under_verb
โ”‚   โ”‚       โ”œโ”€โ”€ l_relative_clause
โ”‚   โ”‚       โ”œโ”€โ”€ passive
โ”‚   โ”‚       โ”œโ”€โ”€ PP
โ”‚   โ”‚       โ””โ”€โ”€ s_relative_clause
โ”‚   โ”œโ”€โ”€ adv-glue-plus-plus
โ”‚   โ”‚   โ”œโ”€โ”€ data
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ advglue_plus_plus.json
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ alpaca.json
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ stable-vicuna.json
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ vicuna.json
โ”‚   โ”‚   โ””โ”€โ”€ README.md
โ”‚   โ”œโ”€โ”€ fairness
โ”‚   โ”‚   โ”œโ”€โ”€ data
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ adult
โ”‚   โ”‚   โ”œโ”€โ”€ data_generation.py
โ”‚   โ”‚   โ”œโ”€โ”€ datasets
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ abstract_dataset.py
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult.py
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ fairness_data
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_0_200_test_base_rate_0.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_0_200_test_base_rate_0.5.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_0_200_test_base_rate_1.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_0_200_train_br_0.0_test_br_0.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_0_200_train_br_0.0_test_br_0.5.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_0_200_train_br_0.0_test_br_1.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_0_8.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_16_200_train_br_0.0_test_br_0.5.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_32_200_train_base_rate_0.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_32_200_train_base_rate_0.5.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_32_200_train_base_rate_1.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_32_200_train_br_0.0_test_br_0.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_32_200_train_br_0.0_test_br_0.5.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_32_200_train_br_0.5_test_br_0.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ adult_32_200_train_br_1.0_test_br_0.0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_0_200_test_base_rate_0.0.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_0_200_test_base_rate_0.5.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_0_200_test_base_rate_1.0.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_0_8.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_16_200_train_br_0.0_test_br_0.5.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_32_200_train_base_rate_0.0.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_32_200_train_base_rate_0.5.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_32_200_train_base_rate_1.0.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gt_labels_adult_32_200_train_br_0.0_test_br_0.5.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_attr_adult_0_200_test_base_rate_0.0.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_attr_adult_0_200_test_base_rate_0.5.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_attr_adult_0_200_test_base_rate_1.0.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_attr_adult_0_8.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_attr_adult_16_200_train_br_0.0_test_br_0.5.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_attr_adult_32_200_train_base_rate_0.0.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_attr_adult_32_200_train_base_rate_0.5.npy
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_attr_adult_32_200_train_base_rate_1.0..npy
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ sensitive_attr_adult_32_200_train_br_0.0_test_br_0.5.npy
โ”‚   โ”‚   โ”œโ”€โ”€ generations
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_0_200_8.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_0_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_15.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_16_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_30.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_32_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_32_60.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_32.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_34.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_4_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult_8_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_adult.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_compas_0_60.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_compas_4_8.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_german_0_60.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-3.5-turbo-0301_german_0_80.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_0_200_8.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_0_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_0_40_8.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_0.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_15.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_16_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_30.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_32_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_32_60.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_32.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_34.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_4_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult_8_200.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_adult.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_compas_0_60.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_german_0_60.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_german_0_80.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ gpt-4-0314_german_16_60.jsonl
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ gpt-4-0314_german_32_60.jsonl
โ”‚   โ”‚   โ””โ”€โ”€ README.md
โ”‚   โ”œโ”€โ”€ machine_ethics
โ”‚   โ”‚   โ”œโ”€โ”€ ethics
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ commonsense
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ deontology
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ justice
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ README.txt
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ utilitarianism
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ virtue
โ”‚   โ”‚   โ”œโ”€โ”€ jiminy_subset.csv
โ”‚   โ”‚   โ”œโ”€โ”€ jiminy_test.json
โ”‚   โ”‚   โ”œโ”€โ”€ jiminy_train.json
โ”‚   โ”‚   โ””โ”€โ”€ README.md
โ”‚   โ”œโ”€โ”€ ood
โ”‚   โ”‚   โ”œโ”€โ”€ generations
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ knowledge
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ styles
โ”‚   โ”‚   โ”œโ”€โ”€ knowledge
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ knowledge.json
โ”‚   โ”‚   โ”œโ”€โ”€ knowledge.json
โ”‚   โ”‚   โ”œโ”€โ”€ README.md
โ”‚   โ”‚   โ”œโ”€โ”€ style.json
โ”‚   โ”‚   โ””โ”€โ”€ styles
โ”‚   โ”‚       โ””โ”€โ”€ style.json
โ”‚   โ”œโ”€โ”€ privacy
โ”‚   โ”‚   โ”œโ”€โ”€ enron_data
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ context.pkl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ email2name.pkl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ email_freq.pkl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ five_shot_non_domain.pkl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ five_shot.pkl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ name2email.pkl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ one_shot_non_domain.pkl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ one_shot.pkl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ two_shot_non_domain.pkl
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ two_shot.pkl
โ”‚   โ”‚   โ”œโ”€โ”€ generations
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ enron
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ pii
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ understanding
โ”‚   โ”‚   โ””โ”€โ”€ README.md
โ”‚   โ”œโ”€โ”€ stereotype
โ”‚   โ”‚   โ”œโ”€โ”€ dataset
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ stereotype_bias_data.jsonl
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ system_prompts.csv
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ user_prompts.csv
โ”‚   โ”‚   โ”œโ”€โ”€ generations
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ benign
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ chavinlo
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ lmsys
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ meta-llama
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ mosaicml
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ openai
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ README.md
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ targeted
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ tiiuae
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ together
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ untargeted
โ”‚   โ”‚   โ””โ”€โ”€ README.md
โ”‚   โ””โ”€โ”€ toxicity
โ”‚       โ”œโ”€โ”€ generations
โ”‚       โ”‚   โ”œโ”€โ”€ realtoxicityprompts
โ”‚       โ”‚   โ”œโ”€โ”€ system_prompts
โ”‚       โ”‚   โ””โ”€โ”€ user_prompts
โ”‚       โ”œโ”€โ”€ README.md
โ”‚       โ”œโ”€โ”€ toxicity_prompts.py
โ”‚       โ””โ”€โ”€ user_prompts
โ”‚           โ”œโ”€โ”€ nontoxic.jsonl
โ”‚           โ”œโ”€โ”€ toxic.jsonl
โ”‚           โ”œโ”€โ”€ toxic.jsonl.gpt3.n=25.out1.split.annotated.jsonl.all.jsonl
โ”‚           โ”œโ”€โ”€ toxic.jsonl.gpt3.n=25.out1.split.annotated.jsonl.filtered.jsonl
โ”‚           โ”œโ”€โ”€ toxic.jsonl.gpt4.n=25.out1.split.annotated.jsonl.all.jsonl
โ”‚           โ””โ”€โ”€ toxic.jsonl.gpt4.n=25.out1.split.annotated.jsonl.filtered.jsonl
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ dry_run.sh
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ setup.cfg
โ”œโ”€โ”€ src
โ”‚   โ””โ”€โ”€ dt
โ”‚       โ”œโ”€โ”€ chat.py
โ”‚       โ”œโ”€โ”€ configs
โ”‚       โ”‚   โ”œโ”€โ”€ adv_demonstration
โ”‚       โ”‚   โ”œโ”€โ”€ advglue
โ”‚       โ”‚   โ”œโ”€โ”€ configs.py
โ”‚       โ”‚   โ”œโ”€โ”€ config.yaml
โ”‚       โ”‚   โ”œโ”€โ”€ fairness
โ”‚       โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚       โ”‚   โ”œโ”€โ”€ machine_ethics
โ”‚       โ”‚   โ”œโ”€โ”€ ood
โ”‚       โ”‚   โ”œโ”€โ”€ privacy
โ”‚       โ”‚   โ”œโ”€โ”€ stereotype
โ”‚       โ”‚   โ””โ”€โ”€ toxicity
โ”‚       โ”œโ”€โ”€ conversation.py
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ”œโ”€โ”€ main.py
โ”‚       โ”œโ”€โ”€ perspectives
โ”‚       โ”‚   โ”œโ”€โ”€ adv_demonstration
โ”‚       โ”‚   โ”œโ”€โ”€ advglue
โ”‚       โ”‚   โ”œโ”€โ”€ fairness
โ”‚       โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚       โ”‚   โ”œโ”€โ”€ machine_ethics
โ”‚       โ”‚   โ”œโ”€โ”€ ood
โ”‚       โ”‚   โ”œโ”€โ”€ privacy
โ”‚       โ”‚   โ”œโ”€โ”€ stereotype
โ”‚       โ”‚   โ””โ”€โ”€ toxicity
โ”‚       โ”œโ”€โ”€ response.py
โ”‚       โ””โ”€โ”€ utils.py
โ””โ”€โ”€ Tutorial.md

The data subdir includes our proposed/generated datasets essential for the evaluation of trustworthiness. (โš ๏ธWARNING: This paper contains model outputs that may be considered offensive.)

Getting Started

(Conda +) Pip

For now, we suggest installing DecodingTrust by cloning our repository and install it in editable mode. This will keep the data, code, and configurations in the same place.

git clone https://github.com/AI-secure/DecodingTrust.git && cd DecodingTrust
pip install -e .

Please note that this will install PyTorch with pip. If your system does not have a CUDA version compatible with the PyTorch pip wheel. To install PyTorch with Conda first, as shown below.

conda create --name dt-test python=3.9 pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda activate dt-test
pip install "decoding-trust @ git+https://github.com/AI-secure/DecodingTrust.git"

It is also possible to install DecodingTrust as a standalone package, but you will need to clone our repository again to run it will our data.

conda create --name dt-test python=3.9
conda activate dt-test
pip install "decoding-trust @ git+https://github.com/AI-secure/DecodingTrust.git"

Support for the ppc64le Architecture

We also support the ppc64le architecture of IBM Power-9 platforms. To install on this platform, please first make sure you have the following conda channels so that we can utilize pre-built packages.

--add channels 'defaults'   # lowest priority
--add channels 'https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/'
--add channels 'https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/'
--add channels 'https://opence.mit.edu'
--add channels 'https://ftp.osuosl.org/pub/open-ce/current/'
--add channels 'conda-forge'   # highest priority

Then, install the following pre-built packages.

mamba create --name dt-test python==3.9 pytorch=2.0.1 torchvision=0.15.2 spacy=3.5.3 scipy=1.10.1 fairlearn~=0.9.0 scikit-learn~=1.1.2 pandas~=2.0.3 pyarrow~=11.0.0 rust -c conda-forge

Finally, install DecodingTrust with pip as usual.

Docker / Singularity

To use DecodingTrust with docker, simply pull the following docker image.

sudo docker pull danielz01/decoding-trust
docker run -it \
    -v /path/on/host:/path/in/container \
    --gpus all \
    decoding-trust/v1.0:latest [arg1 arg2 ...]

To use it in through singularity or apptainer container environments on HPC environments, simply run the following.

module load singularity  # Change it to whatever module name your singularity / apptainer environment was given
singularity pull decoding-trust-v1.0.sif docker://danielz01/decoding-trust
singularity exec --nv --bind /path/on/host:/path/in/container decoding-trust-v1.0.sif [arg1 arg2]

We will also have a container build for ppc64le platforms soon. Stay tuned!

Notes

  • Each of the eight areas has its own subdirectory containing the respective code and README.

  • Follow the specific README: Every subdirectory has its own README. Refer to these documents for information on how to run the scripts and interpret the results.

[Important] Candidate models

In our benchmark, to have consistent conclusions and results, currently we mianly focus on evaluating the following two OpenAI models:

  • gpt-3.5-turbo-0301
  • gpt-4-0314

Note we use gpt-3.5-turbo-0301 (with time stamp) released in March instead of gpt-3.5-turbo for sake of model evolution to ensure reproducibility.

Currently, we have supported evaluating all the causal LLMs hosted in Huggingface or hosted locally. Specifically, we have tested the following open LLMs:

  • Llama-v2-7B-Chat
  • Vicuna-7BAlpaca-7B
  • MPT-7B
  • Falcon-7B
  • Alpaca-7B
  • RedPajama-INCITE-7B-Instruct

Tutorial

We have provided a Tutorial to help you walk through the usage of API to evaluate different trustworthiness perspectives and LLMs.

Useful tips

  • Please first evaluate your experiments with ++dry_run=True flags on to check the input / output format, and use gpt-3.5-turbo-0301 to check the generation since it has lower costs.
  • Suggesting saving the responses from OpenAI.
  • You can check https://arxiv.org/pdf/2302.06476.pdf to know if your performance of ChatGPT is reasonable for standard tasks. Also you may find their task descriptions useful.

File usage

  • main.py provides a unified entry point to evaluate all the perspectives and different LLMs with proper configuration
  • chat.py provides robust APIs for creating requests to OpenAI Chat Compleition models and Huggingface autoregressive LLMs. Recommend implementing experiments based on this file. If you think chat.py is not good enough and want to make modifications, please let @acphile and @boxinw know.
  • utils.py provide auxiliary functions

For other files, please refer to each subdirs for more information.

License

This project is licensed under the CC BY-SA 4.0 - see the LICENSE file for details.

Citation

Please cite the paper as follows if you use the data or code from DecodingTrust:

@article{wang2023decodingtrust,
  title={DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models},
  author={Wang, Boxin and Chen, Weixin and Pei, Hengzhi and Xie, Chulin and Kang, Mintong and Zhang, Chenhui and Xu, Chejian and Xiong, Zidi and Dutta, Ritik and Schaeffer, Rylan and others},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023}
}

Contact

Please reach out to us if you have any questions or suggestions. You can submit an issue or pull request, or send an email to [email protected].

Thank you for your interest in DecodingTrust. We hope our work will contribute to a more trustworthy, fair, and robust AI future.

More Repositories

1

DBA

DBA: Distributed Backdoor Attacks against Federated Learning (ICLR 2020)
Python
173
star
2

Certified-Robustness-SoK-Oldver

This repo keeps track of popular provable training and verification approaches towards robust neural networks, including leaderboards on popular datasets and paper categorization.
98
star
3

VeriGauge

A united toolbox for running major robustness verification approaches for DNNs. [S&P 2023]
C
85
star
4

InfoBERT

[ICLR 2021] "InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective" by Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, Ruoxi Jia, Bo Li, Jingjing Liu
Python
83
star
5

CRFL

CRFL: Certifiably Robust Federated Learning against Backdoor Attacks (ICML 2021)
Python
69
star
6

multi-task-learning

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.
Python
64
star
7

Meta-Nerual-Trojan-Detection

Python
57
star
8

SemanticAdv

Python
54
star
9

FLBenchmark-toolkit

Federated Learning Framework Benchmark (UniFed)
Python
47
star
10

Robustness-Against-Backdoor-Attacks

RAB: Provable Robustness Against Backdoor Attacks
Python
35
star
11

DataLens

[CCS 2021] "DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation" by Boxin Wang*, Fan Wu*, Yunhui Long*, Luka Rimanic, Ce Zhang, Bo Li
Python
35
star
12

QEBA

Code for CVPR2020 paper QEBA: Query-Ef๏ฌcient Boundary-Based Blackbox Attack
Python
30
star
13

Big-but-Invisible-Adversarial-Attack

This repo contains the code for CVPR submission "Big but Invisible Adversarial Attack"
Python
30
star
14

Shapley-Study

[CVPR 2021] Scalability vs. Utility: Do We Have to Sacri๏ฌce One for the Other in Data Importance Quanti๏ฌcation?
Python
30
star
15

G-PATE

[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators" by Yunhui Long*, Boxin Wang*, Zhuolin Yang, Bhavya Kailkhura, Aston Zhang, Carl A. Gunter, Bo Li
Python
27
star
16

aug-pe

[ICML 2024] Differentially Private Synthetic Data via Foundation Model APIs 2: Text
Python
27
star
17

T3

[EMNLP 2020] "T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack" by Boxin Wang, Hengzhi Pei, Boyuan Pan, Qian Chen, Shuohang Wang, Bo Li
Python
26
star
18

KNN-PVLDB

Official Repo for "Efficient task-specific data valuation for nearest neighbor algorithms"
Jupyter Notebook
25
star
19

Transferability-Reduced-Smooth-Ensemble

Python
22
star
20

GMI-Attack

Python
21
star
21

semantic-randomized-smoothing

[CCS 2021] TSS: Transformation-specific smoothing for robustness certification
Roff
20
star
22

LinkTeller

[IEEE S&P 22] "LinkTeller: Recovering Private Edges from Graph Neural Networks via Influence Analysis" by Fan Wu, Yunhui Long, Ce Zhang, Bo Li
Python
19
star
23

SemAttack

[NAACL 2022] "SemAttack: Natural Textual Attacks via Different Semantic Spaces" by Boxin Wang, Chejian Xu, Xiangyu Liu, Yu Cheng, Bo Li
Python
18
star
24

Uncovering-the-Connections-BetweenAdversarial-Transferability-and-Knowledge-Transferability

code for ICML 2021 paper in which we explore the relationship between adversarial transferability and knowledge transferability.
Python
16
star
25

Characterizing-Audio-Adversarial-Examples-using-Temporal-Dependency

ICLR 2019 Paper, "Characterizing Audio Adversarial Examples using Temporal Dependency".
Python
11
star
26

Does-Adversairal-Transferability-Indicate-Knowledge-Transferability

code for ICML 2021 paper in which we explore the relationship between adversarial transferability and knowledge transferability.
Python
11
star
27

Knowledge-Enhanced-Machine-Learning-Pipeline

Repository for Knowledge Enhanced Machine Learning Pipeline (KEMLP)
Python
10
star
28

stAdv-Spatially-Transformed-Adversarial-Examples-

Official Implementation of paper https://arxiv.org/abs/1801.02612
Python
10
star
29

CoPur

CoPur: Certifiably Robust Collaborative Inference via Feature Purification (NeurIPS 2022)
Python
9
star
30

adversarial-glue

[NeurIPS 2021] "Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models" by Boxin Wang*, Chejian Xu*, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, Bo Li.
Python
9
star
31

NonLinear-BA

The code accompanying paper: Nonlinear Gradient Estimation for Query Ef๏ฌcient Blackbox Attack
Python
9
star
32

CROP

[ICLR 2022] CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing
Python
8
star
33

COPA

[ICLR 2022] COPA: Certifying Robust Policies for Offline Reinforcement Learning against Poisoning Attacks
8
star
34

PSBA

[ICML 2021] "Progressive-Scale Boundary Blackbox Attack via Projective Gradient Estimation" by Jiawei Zhang*, Linyi Li*, Huichen Li, Xiaolu Zhang, Shuang Yang, Bo Li
Python
6
star
35

DPFL-Robustness

[CCS 2023] Unraveling the Connections between Privacy and Certified Robustness in Federated Learning Against Poisoning Attacks
Python
6
star
36

Layerwise-Orthogonal-Training

Python
5
star
37

SecretGen

A general model inversion attack against large pre-trained models.
Python
4
star
38

Certified-Fairness

Code for Certifying Some Distributional Fairness with Subpopulation Decomposition [NeurIPS 2022]
Python
4
star
39

Meta-Neural-Kernel

The official implementation of Meta Neural Kernel (MNK), a kernel method for meta-learning
Python
4
star
40

TextGuard

TextGuard: Provable Defense against Backdoor Attacks on Text Classification
Python
4
star
41

KNN-shapley

Python
3
star
42

MMDT

Comprehensive Assessment of Trustworthiness in Multimodal Foundation Models
Jupyter Notebook
3
star
43

FedGame

Official implementation for paper "FedGame: A Game-Theoretic Defense against Backdoor Attacks in Federated Learning" (NeurIPS 2023).
2
star
44

COPA_Atari

Python
1
star