• Stars
    star
    227
  • Rank 175,900 (Top 4 %)
  • Language
  • License
    Creative Commons ...
  • Created almost 2 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Model-Written Evaluation Datasets

This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations."

We intend the datasets to be useful to:

  1. Those who are interested in understanding the quality and properties of model-generated data
  2. Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks, and gender bias)

The evaluations were generated to be asked to dialogue agents (e.g., a model finetuned explicitly respond to a user's utterances, or a pretrained language model prompted to behave like a dialogue agent). However, it is possible to adapt the data to test other kinds of models as well.

We describe each of our collections of datasets below:

  1. persona/: Datasets testing models for various aspects of their behavior related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals (e.g., self-preservation or power-seeking).
  2. sycophancy/: Datasets testing models for whether or not they repeat back a user's view to various questions (in philosophy, NLP research, and politics)
  3. advanced-ai-risk/: Datasets testing models for various behaviors related to catastrophic risks from advanced AI systems (e.g., ). These datasets were generated in a few-shot manner. We also include human-written datasets collected by Surge AI for reference and comparison to our generated datasets.
  4. winogender/: Our larger, model-generated version of the Winogender Dataset (Rudinger et al., 2018). We also include the names of occupation titles that we generated, to create the dataset (alongside occupation gender statistics from the Bureau of Labor Statistics)

Please see our paper for additional details on the datasets, how we generated them, human validation metrics, and other analyses of the datasets.

Disclaimer: As discussed in our paper, some data contains content that includes social biases and stereotypes. The data may also contain other forms of harmful or offensive content. The views expressed in the data do not reflect the views of Anthropic or any of its employees.

Contact

For questions, please email ethan at anthropic dot com

Bibtex Citation

If you would like to cite our work or data, you may use the following bibtex citation:

@misc{perez2022discovering,
  doi = {10.48550/ARXIV.2212.09251},
  url = {https://arxiv.org/abs/2212.09251},
  author = {Perez, Ethan and Ringer, Sam and Lukošiūtė, Kamilė and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Ben and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amodei, Dario and Drain, Dawn and Li, Dustin and Tran-Johnson, Eli and Khundadze, Guro and Kernion, Jackson and Landis, James and Kerr, Jamie and Mueller, Jared and Hyun, Jeeyoon and Landau, Joshua and Ndousse, Kamal and Goldberg, Landon and Lovitt, Liane and Lucas, Martin and Sellitto, Michael and Zhang, Miranda and Kingsland, Neerav and Elhage, Nelson and Joseph, Nicholas and Mercado, Noemí and DasSarma, Nova and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Johnston, Scott and Kravec, Shauna and {El Showk}, Sheer and Lanham, Tamera and Telleen-Lawton, Timothy and Brown, Tom and Henighan, Tom and Hume, Tristan and Bai, Yuntao and Hatfield-Dodds, Zac and Clark, Jack and Bowman, Samuel R. and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Discovering Language Model Behaviors with Model-Written Evaluations},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

More Repositories

1

anthropic-cookbook

A collection of notebooks/recipes showcasing some fun and effective ways of using Claude.
Jupyter Notebook
4,407
star
2

anthropic-quickstarts

A collection of projects designed to help developers quickly get started with building deployable applications using the Anthropic API
TypeScript
4,213
star
3

courses

Anthropic's educational courses
Jupyter Notebook
3,703
star
4

hh-rlhf

Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
1,544
star
5

anthropic-sdk-python

Python
1,358
star
6

anthropic-sdk-typescript

Access to Anthropic's safety-first language model APIs
TypeScript
654
star
7

anthropic-tools

Python
261
star
8

PySvelte

A library for bridging Python and HTML/Javascript (via Svelte) for creating interactive visualizations
Python
171
star
9

ConstitutionalHarmlessnessPaper

133
star
10

anthropic-retrieval-demo

Lightweight demo using the Anthropic Python SDK to experiment with Claude's Search and Retrieval capabilities over a variety of knowledge bases (Elasticsearch, vector databases, web search, and Wikipedia).
Python
87
star
11

sleeper-agents-paper

Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
81
star
12

toy-models-of-superposition

Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
Jupyter Notebook
80
star
13

anthropic-tokenizer-typescript

TypeScript
49
star
14

anthropic-sdk-go

Access to Anthropic's safety-first language model APIs via Go
Go
47
star
15

DecompositionFaithfulnessPaper

Python
1
star