• Stars
    star
    302
  • Rank 138,030 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 1 year ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A large-scale, fine-grained, diverse preference dataset (and models).

A large-scale, fine-grained, diverse preference dataset

IntroductionDataset ConstructionExampleUltraRMUltraCM

News

  • [2023/09/26]: UltraRM unleashes the power of UltraLM-13B-v2.0 and UltraLM-13B! A simple best-of-16 sampling achieves 92.30% (UltraLM2, 🥇 in 13B results) and 91.54% (UltraLM, 🥇 in LLaMA-1 results) win rates against text-davinci-003 on AlpacaEval benchmark!
  • [2023/09/26]: We release the UltraFeedback dataset, along with UltraFeedback-powered reward model UltraRM and critique model UltraCM! Both built new SOTAs over open-source models!

Links

Introduction

UltraFeedback is a large-scale, fine-grained, diverse preference dataset, used for training powerful reward models and critic models. We collect about 64k prompts from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN, see here for dataset statistics). We then use these prompts to query multiple LLMs (see here for model lists) and generate 4 different responses for each prompt, resulting in a total of 256k samples.

To collect high-quality preference and textual feedback, we design a fine-grained annotation instruction, which contains 4 different aspects, namely instruction-following, truthfulness, honesty and helpfulness. We then ask GPT-4 to annotate the collected samples based on the instruction.

Features

  • Scale: UltraFeedback consists of 64k prompts, 256k responses and high-quality feedback. RLHF researchers could further construct around 340k comparison pairs to train their reward models.
  • Diversity: As a preference dataset, diversity is the core requirement for UltraFeedback. We collect prompts from various sources and query a diverse set of state-of-the-art open-source and prestigious models. To further increase diversity, we intended to select different base models, i.e., LLaMA, Falcon, StarChat, MPT, GPT and Bard. We also apply various principles to stimulate models completing instructions in different ways.
  • High-density: UltraFeedback provides both numerical and textual feedback. Moreover, we wrote fine-grained annotation documents to help rate responses in all dimensions

Dataset Construction

Instruction Sampling

We sample 63,967 instructions from 6 public available and high-quality datasets. We include all instructions from TruthfulQA and FalseQA, randomly sampling 10k instructions from Evol-Instruct, 10k from UltraChat, and 20k from ShareGPT. For FLAN, we adopt a stratified sampling strategy, randomly sampling 3k instructions from "CoT" subset whereas sampling 10 instructions per task for the other three subsets, excluding those with overly long instructions.

{
    "evol_instruct": 10000, 
    "false_qa": 2339,
    "flan": 20939, 
    "sharegpt": 19949, 
    "truthful_qa": 811,
    "ultrachat": 9929 
}

Model Sampling

To prevent reward model from overfiting to certain text style or capturing spurious correlation between text style and rewards, we select different base models of all levels, with varying sizes, architectures and training data, to complete the instructions. We set up a pool of 17 models:

  • Commercial Models: GPT-4, GPT-3.5 Turbo, Bard
  • LLaMA family:
    1. LLaMA-2-7B-chat, LLaMA-2-13B-chat, LLaMA-2-70B-chat
    2. UltraLM-13B, UltraLM-65B
    3. WizardLM-7B-v1.2, WizardLM-13B-v1.2, WizardLM-70B-v1.0
    4. Vicuna-33B-v1.3
    5. Alpaca-7B
  • Non-LLaMA series:
    1. Falcon-40B-instruct
    2. MPT-30B-chat
    3. StarChat-Beta
    4. Pythia-12B

Principle Sampling

Following [1] and [2], we define a set of principles to explicitly align model behaviors from different aspects. We set up a pool of 4 principles: Helpfulness, Truthfulness, Honesty and Verbalized Calibration. For each instruction, we randomly sample 4 models to complete the instruction, and for each completion, we sample a principle and add it to system prompt to align the model behavior. Considering different datasets outline different characteristics, not all dataset are suitable for all principles. We provide the following table to show the principle distribution for each dataset.

Datset Principle
Evol-Instruct 100% Helpful
FalseQA 100% TruthfulQA
FLAN 60% Helpful, 20% Truthful, 20% Verbalized Calibration
ShareGPT 60% Helpful, 20% Truthful, 18% Honesty, 2% Verbalized Calibration
TruthfulQA 100% Truthful
UltraChat 60% Helpful, 20% Truthful, 18% Honesty, 2% Verbalized Calibration

[1] Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. Sun et al.

[2] Orca: Progressive Learning from Complex Explanation Traces of GPT-4. Mukherjee et al.

Comparison with Previous Preference Datasets

UltraRM

We train and release a reward model UltraRM based on UltraFeedback to further facilitate alignment research. UltraRM is initialized by LLaMA2-13B.

Specifically, we train two versions of reward models, where UltraRM-UF is merely fine-tuned on UltraFeedback and UltraRM is fine-tuned on a mixture of UltraFeedback and an equal-size sample from three open-source datasets including Anthropic HH-RLHF, Standford SHP, and Summarization.

On four public preference test sets, our UltraRM achieves SOTA over other open-source reward models.

UltraCM

We train and release a critique model UltraCM based on UltraFeedback for automatic critique generation. Similar with UltraRM, UltraCM is initialized by LLaMA2-13B.

On nine benchmark datasets, UltraCM outperforms all open-source baselines and approaches ChatGPT.

Dataset Format

{
    "source": "sharegpt", // the dataset where the instruction comes from
    "instruction": "I am going to cairo in June of this year, thinking of 4-5 days, what are the best things I could do? Going with family of 5, granddad, grandmom, sister, mom and me. No days can have too much walking but sister, me and mom like adventure sports",
    "models": [ // four models that are randomly sampled from the model pool to complete the instruction
        "falcon-40b-instruct",
        "gpt-4",
        "starchat",
        "wizardlm-7b"
    ],
    "correct_answers": ["None"] // reference correct answers, set to None except TruthfulQA and FLAN
    "incorrect_answers": ["None"] // reference incorrect answers, set to None except TruthfulQA and FLAN
    "completions": [ // four completions corresponding to the four models
        { // completion 1
            "model": "falcon-40b-instruct",
            "principle": "helpfulness", // priciple that used to align model behavior
            "custom_system_prompt": "As an AI assistant, ensure your response offers the perfect blend of accuracy, positivity, and intrigue. Strive to be educational while keeping the user engaged.", // the system prompt corresponding to the principle, sampled from a pool of GPT-4 generated system prompts
            "response": "Cairo is a city that has something for everyone. Some of the best things you can do are visiting the Pyramids of Giza and the museums housing the ancient Egyptian artifacts, taking a camel ride in the desert, visiting the medieval neighborhoods such as Islamic Cairo or Khan El-Khalili, trying local food in traditional restaurants like Ful medames, or venturing into one of the city's many souks to shop for authentic goods. For adventure sports, you can go whitewater rafting on the Nile, hot air ballooning over the desert, or diving in the Red Sea. Just be mindful of the scorching heat and ensure everyone wears appropriate clothing and drinks enough water. Above all, have fun and enjoy your trip!",
            "annotations": {
                "instruction_following": 
                    {
                        "Rating": "2",
                        "Rationale": "The text only partially addresses the task goal by providing a general list of activities in Cairo without considering the specific needs of the family members, such as limited walking ability and adventure preferences."
                    },
                "honesty": 
                    {
                        "Rating": "3",
                        "Rationale": "The response is confident but contains subtle mistakes. It suggests whitewater rafting on the Nile and diving in the Red Sea, which are not feasible for a 4-5 day trip to Cairo. The rest of the suggestions are accurate and relevant to the family's preferences."
                    },
                "truthfulness": 
                    {
                        "Type": [
                            "1",
                            "2"
                        ],
                        "Rationale": "The text suggests whitewater rafting on the Nile and diving in the Red Sea, which are not available in Cairo. These activities are factually incorrect and not aligned with the instruction.",
                        "Rating": "3",
                        "Rationale For Rating": "The text provides some truthful information, but the inclusion of whitewater rafting and diving in the Red Sea causes partial hallucination and misunderstanding."
                    },
                "helpfulness": 
                    {
                        "Type": [
                            "1",
                            "2"
                        ],
                        "Rationale": "The response is clear and relevant, providing useful information, but it does not address the limited walking ability of the grandparents or the adventure sports preferences of the sister, mom, and the user.",
                        "Rating": "3",
                        "Rationale For Rating": "The text is correct and provides useful information, but it does not fully address the specific needs and preferences of the family."
                    }
            }
        },
        { // completion 2
            ...
        },
        { // completion 3
            ...
        },
        { // completion 4
            ...
        }
    ]
}

Dataset Example

Here we present an example of UltraFeedback

User: I am going to cairo in June of this year, thinking of 4-5 days, what are the best things I could do? Going with family of 5, granddad, grandmom, sister, mom and me. No days can have too much walking but sister, me and mom like adventure sports

Assisstant 1 (falcon-40b-instruct): Cairo is a city that has something for everyone. Some of the best things you can do are visiting the Pyramids of Giza and the museums housing the ancient Egyptian artifacts, taking a camel ride in the desert, visiting the medieval neighborhoods such as Islamic Cairo or Khan El-Khalili, trying local food in traditional restaurants like Ful medames, or venturing into one of the city's many souks to shop for authentic goods. For adventure sports, you can go whitewater rafting on the Nile, hot air ballooning over the desert, or diving in the Red Sea. Just be mindful of the scorching heat and ensure everyone wears appropriate clothing and drinks enough water. Above all, have fun and enjoy your trip!

Assisstant 2 (gpt-4): In Cairo for 4-5 days, the following activities cater to your family's preferences, with consideration for the limited walking ability of your grandparents:\n\n1. Visit the Pyramids of Giza and the Great Sphinx: A must-see in Cairo, it's an incredible experience for all ages. You can explore the premises using a hired camel, horse, or a horse-drawn carriage which accommodate the needs of your grandparents while keeping the experience enjoyable.\n\n2. Egyptian Museum: This museum in Tahrir Square contains over 120,000 artifacts of Egypt's history. A wheelchair can be arranged for the grandparents to comfortably explore the museum at their own pace.\n\n3. Khan el-Khalili Bazaar: A famous market in Cairo, suitable for souvenir shopping and experiencing the local culture. There are many cafes and restaurants your family can rest at during the visit.\n\n4. Nile River Cruise: Enjoy a dinner cruise with your family on the famous Nile River. It includes entertainment and a beautiful view of the city by night. Boats often have ramps and accessible facilities for limited mobility passangers.\n\n5. Felucca Ride: During the day, go for a more traditional sailboat ride called a Felucca. This can be done on an off-day or as a relaxation activity after an active day.\n\nFor adventure activities, you and your mom and sister can explore these options:\n\n1. Sandboarding: Head to the desert for a thrilling sandboarding experience on the dunes just outside Cairo.\n\n2. Hot Air Balloon Ride: Experience Egypt from above with a hot air balloon ride in Luxor. It would require a day trip from Cairo but is an incredible experience.\n\n3. Quad Biking: Go for a guided quad biking adventure in the desert, enjoying panoramic views of the iconic Pyramids.\n\nSpace the activities according to your family's preferences and ensure rest periods in the afternoon. Consider hiring a private guide or joining a wheelchair-accessible tour since you're concerned about your grandparents' walking ability.

Assisstant 3 (starchat): Here are some of the top things to do in Cairo, Egypt, with a family of 5:\n\n- Giza Pyramids: The Giza Pyramids, located just outside Cairo, are a must-see for anyone visiting the city. The Great Pyramid of Khufu, the Great Pyramid of Khafre, and the Great Pyramid of Menkaure make up one of the Seven Wonders of the Ancient World and are an impressive sight to behold.\n\n- Egyptian Museum: The Egyptian Museum, located in the heart of Cairo, is home to one of the world's largest and most impressive collections of ancient Egyptian art and artifacts. Visitors can explore the museum's extensive collection, which includes items such as mummies, sarcophagi, statues, paintings, and more.\n\n- Citadel of Salah El-Din: Located on a hill overlooking Cairo, the Citadel of Salah El-Din offers stunning views of the city and is an excellent place to get a sense of Cairo's rich history. Visitors can explore the various museums and mosques within the citadel, and also take a stroll through the beautiful gardens.\n\n- Khan El-Khalili Bazaar: Khan El-Khalili Bazaar is one of Cairo's most popular attractions, a bustling market where visitors can find everything from spices and pottery to clothes and jewelry. It's a great place to do some shopping and explore the local culture.\n\n- Alexandria: Just an hour's drive from Cairo, Alexandria is a historic city with a rich legacy. Visitors can explore the ancient Roman Amphitheater, the Qaitbay Citadel, and the Catacombs of Kom el-Shoqafa. They can also take a boat trip to the island of Montazah, where they can see the famous Graffiti Wall and other street art.\n\nAll of these attractions are family-friendly, with plenty of things to see and do. The Giza Pyramids in particular are very accessible and well-maintained, making them suitable for all ages.

Assisstant 4 (wizardlm-7b): If you are looking for adventure sports, you can explore the pyramids of Giza and go hot air ballooning over the desert. Another option could be to take a camel ride in the desert or go kayaking in the Nile River. For family-friendly activities, you can visit the Egyptian Museum, Cairo Tower, and Khan el-Khalili Bazaar. Additionally, you can take a day trip to the historical city of Alexandria and visit the Pompey's Pillar and the Roman Amphitheater. Another option could be to go on a guided tour of the ancient city of Petra in Jordan, which is a short flight from Cairo.

To Do

  • Train a reward model and a critique model using UltraFeedback (Coming soon!).
  • Enhance open-source LLMs with RLHF.
  • Extend UltraFeedback to multi-round dialogues.

Limitations

  • Although GPT-4 can provide well-aligned annotation and textual feedback for most samples, we must note that GPT-4 also makes mistakes.

Citation

@misc{cui2023ultrafeedback,
      title={UltraFeedback: Boosting Language Models with High-quality Feedback}, 
      author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Wei Zhu and Yuan Ni and Guotong Xie and Zhiyuan Liu and Maosong Sun},
      year={2023},
      eprint={2310.01377},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

More Repositories

1

ChatDev

Create Customized Software using Natural Language Idea (through LLM-powered Multi-Agent Collaboration)
Shell
24,842
star
2

MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Python
12,088
star
3

XAgent

An Autonomous LLM Agent for Complex Task Solving
Python
8,102
star
4

MiniCPM

MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.
Jupyter Notebook
7,009
star
5

ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
Python
4,789
star
6

AgentVerse

🤖 AgentVerse 🪐 is designed to facilitate the deployment of multiple LLM-based agents in various applications, which primarily provides two frameworks: task-solving and simulation
JavaScript
4,095
star
7

BMTools

Tool Learning for Big Models, Open-Source Solutions of ChatGPT-Plugins
Python
2,884
star
8

CPM-Bee

百亿参数的中英文双语基座大模型
Python
2,686
star
9

VisCPM

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列
Python
1,075
star
10

ProAgent

An LLM-based Agent for the New Automation Paradigm - Agentic Process Automation
Python
754
star
11

BMInf

Efficient Inference for Big Models
Python
573
star
12

IoA

An open-source framework for collaborative AI agents, enabling diverse, distributed agents to team up and tackle complex tasks through internet-like connectivity.
Python
556
star
13

BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Python
554
star
14

CPM-Live

Live Training for Open-source Big Models
Python
512
star
15

BMList

A List of Big Models
Python
339
star
16

RepoAgent

An LLM-powered repository agent designed to assist developers and teams in generating documentation and understanding repositories quickly.
Python
336
star
17

ModelCenter

Efficient, Low-Resource, Distributed transformer implementation based on BMTrain
Python
234
star
18

BMPrinciples

A collection of phenomenons observed during the scaling of big foundation models, which may be developed into consensus, principles, or laws in the future
222
star
19

UltraEval

[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.
Python
215
star
20

InfiniteBench

100k+ Long-Context Benchmark for Large Language Models (paper upcoming)
Python
105
star
21

OlympiadBench

[ACL 2024]Official GitHub repo for OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems.
Python
89
star
22

MobileCPM

A Toolkit for Running On-device Large Language Models (LLMs) in APP
C++
53
star
23

RAGEval

Python
47
star
24

DecT

Source code for ACL 2023 paper Decoder Tuning: Efficient Language Understanding as Decoding
Python
42
star
25

XAgent-doc

Document for XAgent.
19
star
26

UltraLink

An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset
Python
17
star
27

BMInf-demos

BMInf demos.
JavaScript
13
star
28

General-Model-License

6
star
29

VisRAG

Python
1
star