• Stars
    star
    121
  • Rank 293,924 (Top 6 %)
  • Language
    Python
  • Created over 1 year ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An Evaluation of ChatGPT on Information Extraction task, including Named Entity Recognition (NER), Relation Extraction (RE), Event Extraction (EE) and Aspect-based Sentiment Analysis (ABSA).

Evaluation-of-ChatGPT-on-Information-Extraction

An Evaluation of ChatGPT on Information Extraction task, including Named Entity Recognition (NER), Relation Extraction (RE), Event Extraction (EE) and Aspect-based Sentiment Analysis (ABSA).

Abstract

ChatGPT has stimulated the research boom in the field of large language models. In this paper, we assess the capabilities of ChatGPT from four perspectives including Performance, Evaluation Criteria, Robustness and Error Types. Specifically, we first evaluate ChatGPT's performance on 17 datasets with 14 IE sub-tasks under the zero-shot, few-shot and chain-of-thought scenarios, and find a huge performance gap between ChatGPT and SOTA results. Next, we rethink this gap and propose a soft-matching strategy for evaluation to more accurately reflect ChatGPT's performance. Then, we analyze the robustness of ChatGPT on 14 IE sub-tasks, and find that: 1) ChatGPT rarely outputs invalid responses; 2) Irrelevant context and long-tail target types greatly affect ChatGPT's performance; 3) ChatGPT cannot understand well the subject-object relationships in RE task. Finally, we analyze the errors of ChatGPT, and find that "unannotated spans" is the most dominant error type. This raises concerns about the quality of annotated data, and indicates the possibility of annotating data with ChatGPT. The data and code are released at Github site.

Datasets, processed data, output result files

All datasets, processed data and output result files are available at the google drive, except ACE04, ACE05 and TACRED raw datasets (for copyright reasons).

Download all the files, unzip them, and place them in the corresponding directories.

Test with API

bash ./scripts/absa/eval.sh
bash ./scripts/ner/eval.sh
bash ./scripts/re/eval_rc.sh
bash ./scripts/re/eval_triplet.sh
bash ./scripts/ee/eval_trigger.sh
bash ./scripts/ee/eval_argument.sh
bash ./scripts/ee/eval_joint.sh

Before testing, you need to modify all --api_key and --result_file arguments in all *.sh scripts.

Get Evaluation Metrics

bash ./scripts/absa/report.sh
bash ./scripts/ner/report.sh
bash ./scripts/re/report_rc.sh
bash ./scripts/re/report_triplet.sh
bash ./scripts/ee/report_trigger.sh
bash ./scripts/ee/report_argument.sh
bash ./scripts/ee/report_joint.sh

By default, the metrics are calculated based on our output result files at Google Drive.

Main results

main results

Examples of prompts

Zero-shot Few-shot ICL Few-shot COT

Future Work

We will add the results and analysis of GPT-4.

Citation

@article{han2023-chatgpt-IE-evaluation,
  author       = {Ridong Han and
                  Tao Peng and
                  Chaohao Yang and
                  Benyou Wang and
                  Lu Liu and
                  Xiang Wan},
  title        = {Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors},
  journal      = {CoRR},
  volume       = {abs/2305.14450},
  year         = {2023},
  eprinttype   = {arXiv},
  eprint       = {2305.14450},
  url          = {https://doi.org/10.48550/arXiv.2305.14450},
  doi          = {10.48550/ARXIV.2305.14450},
}

More Repositories

1

LLMZoo

âš¡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.âš¡
Python
2,922
star
2

Medical_NLP

Medical NLP Competition, dataset, large models, paper
2,066
star
3

TextClassificationBenchmark

A Benchmark of Text Classification in PyTorch
Python
601
star
4

HuatuoGPT

HuatuoGPT, Towards Taming Language Models To Be a Doctor. (An Open Medical GPT)
Python
527
star
5

InstructionZoo

184
star
6

crosstalk-generation

Code and data for crosstalk text generation tasks, exploring whether large models and pre-trained language models can understand humor.
Python
163
star
7

CMB

CMB, A Comprehensive Medical Benchmark in Chinese
Python
124
star
8

qnn

Python
112
star
9

ReasoningNLP

paper list on reasoning in NLP
109
star
10

GrammarGPT

The code and data for GrammarGPT.
Python
87
star
11

complex-order

Python
83
star
12

Huatuo-26M

The Largest-scale Chinese Medical QA Dataset: with 26,000,000 question answer pairs.
67
star
13

FastLLM

Fast LLM Training CodeBase With dynamic strategy choosing [Deepspeed+Megatron+FlashAttention+CudaFusionKernel+Compiler];
Python
32
star
14

DPTDR

Code for COLING22 paper, DPTDR: Deep Prompt Tuning for Dense Passage Retrieval
Python
25
star
15

GPT-API-Accelerate

The "GPT-API-Accelerate" project provides a set of Python classes for accelerating the process of generating responses to prompts using the OpenAI GPT-3.5 API.
Python
19
star
16

REMOP

Code for the paper: Modular Retrieval for Generalization and Interpretation.
Python
11
star
17

ReaLM

A trainable user simulator
Python
9
star
18

ChatGPT-Detection-PR-HPPT

Codes and dataset for the paper: Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text
Python
9
star
19

Reading-list-of-ChatGPT

7
star
20

Autonomous_Learning

LLMs Could Autonomously Learn Without External Supervision. (An Autonomous Learning Method)
Python
5
star
21

DotaGPT

Chinese Medical instruction-tuning Dataset
Python
5
star
22

HuatuoGPT-R

RAG to reduce medical haluccination.
5
star
23

MultilingualSIFT

MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning
5
star
24

MindedWheeler

Embody_AI with car as Demo
C++
5
star
25

MedJamba

Multilingual Medical Model Based On Jamba
Python
4
star
26

finetune_chatgpt

The example for finetuning chatgpt.
Python
3
star
27

ChatZoo

Chat data for training LLMs
3
star
28

LLMZOO-API-SDK

Python
3
star
29

Overview-of-ChatGPT

3
star
30

LLMFactory

A factory to standardize LLM adaptation through modularization
Python
2
star
31

OpenChatGPT

2
star
32

try_Phoenix2

Phoenix2 code in dev
Python
1
star
33

MLLM-Bench

Evaluating Multi-modal LLMs using GPT-4V
HTML
1
star