• Stars
    star
    105
  • Rank 328,196 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created 12 months ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

100k+ Long-Context Benchmark for Large Language Models (paper upcoming)

InfiniteBench: 100k+ Long-Context Benchmark for Large Language Models

ไธญๆ–‡ โ€ข English โ€ข (Paper Upcoming)

Introduction

Welcome to InfiniteBench, a cutting-edge benchmark tailored for evaluating the capabilities of language models to process, understand, and reason over super long contexts (100k+ tokens). Long contexts are crucial for enhancing applications with LLMs and achieving high-level interaction. InfiniteBench is designed to push the boundaries of language models by testing them against a context length of 100k+, which is 10 times longer than traditional datasets.

Features

  • Loooong Context: InfiniteBench is a pioneer in testing language models with a context length of 100k+, offering an unparalleled challenge in the field.
  • Diverse Domain: The benchmark comprises 12 unique tasks, each crafted to assess different aspects of language processing and comprehension in extended contexts.
  • Specialized Test: InfiniteBench consists of tasks that state-of-the-art LLMs are known to be capable of when using shorter context. This ensures that the performance degradation is only caused by the length of the contexts.
  • Real-World and Synthetic Scenarios: The tasks are a mix of real-world scenarios and synthetic constructs, ensuring a comprehensive evaluation of models. Real-world scenarios make the test pragmatic, and synthetic ones leave the space for extending the context length further with ease.

Task Composition

Task Name Context # Examples Avg Input Tokens Avg Output Tokens Description
En.Sum Fake Book 103 171.5k 1.1k Summarization of a fake book created with core entity substitution.
En.QA Fake Book 351 192.6k 4.8 Free-form question answering based on the fake book.
En.MC Fake Book 229 184.4k 5.3 Multiple choice questions derived from the fake book.
En.Dia Script 200 103.6k 3.4 Identification of talkers in partially anonymized scripts.
Zh.QA New Book 175 2068.6k 6.3 Question answering on a set of newly collected books.
Code.Debug Code Document 394 114.7k 4.8 Finding which function in a code repo contains an crashing error (in multiple choice form).
Code.Run Synthetic 400 75.2k 1.3 Simulating execution of multiple simple, synthetic functions.
Math.Calc Synthetic 50 43.9k 43.9k Calculations involving super-long arithmetic equations.
Math.Find Synthetic 350 87.9k 1.3 Finding special integers in a lengthy list.
Retrieve.PassKey1 Synthetic 590 122.4k 2.0 Retrieving hidden keys in a noisy long context.
Retrieve.Number Synthetic 590 122.4k 4.0 Locating repeated hidden numbers in a noisy long context.
Retrieve.KV2 Synthetic 500 89.9k 22.7 Finding the corresponding value from a dictionary and a key.

How to Download Data

Click here to download data from ๐Ÿค— Huggingface directly: https://huggingface.co/datasets/xinrongzhang2022/InfiniteBench

Using ๐Ÿค— Datasets

Alternatively, you can download using the ๐Ÿค— Datasets library as follows.

from datasets import load_dataset
dataset = load_dataset("xinrongzhang2022/InfiniteBench")

Using Scripts

cd InfiniteBench
bash scripts/download_dataset.sh

This will directly dump the data to data.

Evaluation Result

We evaluate SOTA proprietary and open-source LLMs, the result is as follows.

Task Name GPT-4 YaRN-Mistral-7B Kimi-Chat Claude 2
Retrieve.PassKey 100% 92.71% 98.14% 97.80%
Retrieve.Number 100% 56.61% 95.42% 98.14%
Retrieve.KV 89.00% < 5% 53.60% 65.40%
En.Sum 14.73% 9.09% 17.93% 14.45%
En.QA 22.22% 9.55% 16.52% 11.97%
En.MC 67.25% 27.95% 72.49% 62.88%
En.Dia 8.50% 7.50% 11.50% 46.50%
Zh.QA 25.96% 14.43% 17.93% 9.64%
Code.Debug 39.59% < 5% 18.02% < 5%
Code.Run 23.25% < 5% < 5% < 5%
Math.Calc < 5% < 5% < 5% < 5%
Math.Find 60.00% 17.14% 12.57% 32.29%

Note:

  1. The evaluation code for YaRN-Mistral-7B is implemented by ourselves, and please contact us or submit an issue if there are any problems.

  2. Kimi-Chat, Claude 2, and GPT-4 are evaluated using the official API with default configuration.

  3. For Math.Calc, the values in the parentheses have a measurement unit of 0.01%. This is because it is easy to get a very low score on this task.

  4. The metric for task Math.Find, Math.Calc, Code.Run, Code.Debug, En.Dia, En.MC, Retrieve.KV, Retrieve.Number, and Retrieve.PassKey is accuracy;

    The metric for task Zh.QA and En.QA are ROUGE F1 score;

    The metric for En.Sum is the rougeLsum score from the ๐Ÿค— Evaluate library.

Installation

pip install -r requirements.txt

How to Run

Download the dataset the data folder (or set the --data_dir argument to the location of the dataset). The data folder structure should be as follows.

InfiniteBench
โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ code_debug.jsonl
โ”‚   โ”œโ”€โ”€ code_run.jsonl
โ”‚   โ”œโ”€โ”€ kv_retrieval.jsonl
โ”‚   โ”œโ”€โ”€ longbook_choice_eng.jsonl
โ”‚   โ”œโ”€โ”€ longbook_qa_chn.jsonl
โ”‚   โ”œโ”€โ”€ longbook_qa_eng.jsonl
โ”‚   โ”œโ”€โ”€ longbook_sum_eng.jsonl
โ”‚   โ”œโ”€โ”€ longdialogue_qa_eng.jsonl
โ”‚   โ”œโ”€โ”€ math_calc.jsonl
โ”‚   โ”œโ”€โ”€ math_find.jsonl
โ”‚   โ”œโ”€โ”€ number_string.jsonl
โ”‚   โ”œโ”€โ”€ passkey.jsonl
โ”‚   โ””โ”€โ”€ construct_synthetic_dataset.py
...

Then, in the src folder, execute:

python eval_yarn_mistral.py --task kv_retrieval
python eval_gpt4.py --task longbook_sum_qa
python eval_rwkv.py --task passkey

The available tasks are:

Task Name Argument to specify in --task
En.Sum longbook_sum_qa
En.QA longbook_qa_eng
En.MC longbook_choice_eng
En.Dia longdialogue_qa_eng
Zh.QA longbook_qa_chn
Code.Debug code_debug
Code.Run code_run
Math.Calc math_calc
Math.Find math_find
Retrieve.PassKey passkey
Retrieve.Number number_string
Retrieve.KV kv_retrieval

Citation

This will be updated when our preprint paper is released.

@misc{zhang2023infinitebench,
  title  = {InfiniteBench: 128k Long-Context Benchmark for Language Models},
  author = {Zhang, Xinrong and Chen, Yingfa and Hu, Shengding and Wu, Qihao and Chen, Junhao and Xu, Zihang and Dai, Zhenning and Han, Xu and Wang, Shuo and Liu, Zhiyuan and Sun, Maosong},
  year   = {2023}
}

Acknowledgement

Thanks to Cong Feng, Zhongwu Zhai, Guoyang Zeng, Chenyang Song, Renjie Luo, Chaoqun He, Yuge Tu, Bowen Ping, Yujie Huang, Yudong Mei, Kaihuo Zhang, Weilin Zhao, Ao Sun, Yulin Chen, Ganqu Cui.

References

Footnotes

  1. Mohtashami, Amirkeivan and Martin Jaggi. "Landmark Attention: Random-Access Infinite Context Length for Transformers." ArXiv abs/2305.16300 (2023): n. pag. โ†ฉ

  2. Liu, Nelson F. et al. "Lost in the Middle: How Language Models Use Long Contexts." ArXiv abs/2307.03172 (2023): n. pag. โ†ฉ

More Repositories

1

ChatDev

Create Customized Software using Natural Language Idea (through LLM-powered Multi-Agent Collaboration)
Shell
24,842
star
2

MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Python
12,088
star
3

XAgent

An Autonomous LLM Agent for Complex Task Solving
Python
8,102
star
4

MiniCPM

MiniCPM3-4B: An edge-side LLM that surpasses GPT-3.5-Turbo.
Jupyter Notebook
7,009
star
5

ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
Python
4,789
star
6

AgentVerse

๐Ÿค– AgentVerse ๐Ÿช is designed to facilitate the deployment of multiple LLM-based agents in various applications, which primarily provides two frameworks: task-solving and simulation
JavaScript
4,095
star
7

BMTools

Tool Learning for Big Models, Open-Source Solutions of ChatGPT-Plugins
Python
2,884
star
8

CPM-Bee

็™พไบฟๅ‚ๆ•ฐ็š„ไธญ่‹ฑๆ–‡ๅŒ่ฏญๅŸบๅบงๅคงๆจกๅž‹
Python
2,686
star
9

VisCPM

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | ๅŸบไบŽCPMๅŸบ็ก€ๆจกๅž‹็š„ไธญ่‹ฑๅŒ่ฏญๅคšๆจกๆ€ๅคงๆจกๅž‹็ณปๅˆ—
Python
1,075
star
10

ProAgent

An LLM-based Agent for the New Automation Paradigm - Agentic Process Automation
Python
754
star
11

BMInf

Efficient Inference for Big Models
Python
573
star
12

IoA

An open-source framework for collaborative AI agents, enabling diverse, distributed agents to team up and tackle complex tasks through internet-like connectivity.
Python
556
star
13

BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Python
554
star
14

CPM-Live

Live Training for Open-source Big Models
Python
512
star
15

BMList

A List of Big Models
Python
339
star
16

RepoAgent

An LLM-powered repository agent designed to assist developers and teams in generating documentation and understanding repositories quickly.
Python
336
star
17

UltraFeedback

A large-scale, fine-grained, diverse preference dataset (and models).
Python
302
star
18

ModelCenter

Efficient, Low-Resource, Distributed transformer implementation based on BMTrain
Python
234
star
19

BMPrinciples

A collection of phenomenons observed during the scaling of big foundation models, which may be developed into consensus, principles, or laws in the future
222
star
20

UltraEval

[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.
Python
215
star
21

OlympiadBench

[ACL 2024]Official GitHub repo for OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems.
Python
89
star
22

MobileCPM

A Toolkit for Running On-device Large Language Models (LLMs) in APP
C++
53
star
23

RAGEval

Python
47
star
24

DecT

Source code for ACL 2023 paper Decoder Tuning: Ef๏ฌcient Language Understanding as Decoding
Python
42
star
25

XAgent-doc

Document for XAgent.
19
star
26

UltraLink

An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset
Python
17
star
27

BMInf-demos

BMInf demos.
JavaScript
13
star
28

General-Model-License

6
star
29

VisRAG

Python
1
star