• Stars
    star
    128
  • Rank 279,379 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A repository that contains models, datasets, and fine-tuning techniques for DB-GPT, with the purpose of enhancing model performance, especially in Text-to-SQL.

DB-GPT-Hub: Text-to-SQL parsing with LLMs

简体中文 |Discord|Wechat

1. What is DB-GPT-Hub

DB-GPT-Hub is an experimental project to implement Text-to-SQL parsing using LLMs, which mainly includes dataset collection, data pre-processing, model selection and construction, and fine-tuning weights, etc. Through this series of processing, we can reduce the model training cost while improving Text-to-SQL capability, and allow more developers to participate in the work of improving the accuracy of Text-to-SQL, and finally realizing the automatic database based question and answer capability, allowing users to complete complex database query operations through natural language descriptions.

2. Fine-tuning Text-to-SQL

Large Language Models (LLMs) have achieved impressive results in existing benchmark tests of Text-to-SQL. However, these models remain challenging in the face of large databases and noisy content, and the mysteries behind the huge database values need external knowledge and reasoning to be revealed. We enhance Text-to-SQL based on a large language models sustained SFT

2.1. Dataset

The following publicly available text-to-sql datasets are used for this project:

  • WikiSQL: A large semantic parsing dataset consisting of 80,654 natural statement expressions and sql annotations of 24,241 tables. Each query in WikiSQL is limited to the same table and does not contain complex operations such as sorting, grouping The queries in WikiSQL are limited to the same table and do not include complex operations such as sorting, grouping, subqueries, etc.
  • SPIDER: A complex text2sql dataset across domains, containing 10,181 natural language queries, 5,693 SQL distributed across 200 separate databases, covering 138 different domains.
  • CHASE: A cross-domain multi-round interactive text2sql Chinese dataset containing a list of 5,459 multi-round questions consisting of 17,940 <query, SQL> binary groups across 280 different domain databases.
  • BIRD-SQL: A large-scale cross-domain text-to-SQL benchmark in English, with a particular focus on large database content. The dataset contains 12,751 text-to-SQL data pairs and 95 databases with a total size of 33.4 GB across 37 occupational domains. The BIRD-SQL dataset bridges the gap between text-to-SQL research and real-world applications by exploring three additional challenges, namely dealing with large and messy database values, external knowledge inference and optimising SQL execution efficiency.
  • CoSQL: A corpus for building cross-domain conversational text-to-SQL systems. It is a conversational version of the Spider and SParC tasks. CoSQL consists of 30k+ rounds and 10k+ annotated SQL queries from Wizard-of-Oz's collection of 3k conversations querying 200 complex databases across 138 domains. Each conversation simulates a realistic DB query scenario in which a staff member explores the database as a user and a SQL expert uses SQL to retrieve answers, clarify ambiguous questions, or otherwise inform.

2.2. Model

DB-GPT-HUB currently supports the following base models:

  • LLaMa/LLaMa2

    • alpaca
    • vicuna
    • guanaco
  • Falcon

  • BLOOM

  • ChatGLM

  • WizardLLM

The approximate hardware resources required to quantize and fine-tune the model are as follows:

Model Parameters GPU RAM CPU RAM DISK
7b 4.8GB (14.7GB) 3.6GB 36.4GB
13b 8.4GB (28.7GB) 5.9GB 60.2GB
33b 18.3GB (OOM) 8.4GB 122GB
65b 38.7GB (OOM) 13.1GB 434GB

2.3. Fine-tuning methods

Spider+QLoRA/LoRA+LLM(Falcon/Vicuna/Guanaco/LLaMa)

This experimental project builds a dataset by adding table structure information, adjusting the parameters of the language model and then fine-tuning the LLM with QLoRA/LoRA, aiming to reduce the cost of fine-tuning while increasing the accuracy and speed of SQL generation. This can be executed with the following command:

sh scripts/qlora/qlora.sh
sh scripts/lora/lora.sh

3. Usage

3.1. Environment preparation

git clone https://github.com/csunny/DB-GPT-Hub.git
cd DB-GPT-Hub
conda create -n dbgpt_hub python=3.10 
conda activate dbgpt_hub
pip install -r requirements.txt 
mkdir model 

Put the model files under the new Model folder here

3.2. Data preparation

DB-GPT-HUB uses the information matching generation method for data preparation, i.e. the SQL + Repository generation method that combines table information. This method combines data table information to better understand the structure and relationships of the data table, and is suitable for generating SQL statements that meet the requirements.

Before running, you need to download the SQL data set and put it in this directory. Here, take the spider data set as an example. The spider data set consists of three main parts:

  • train_spide.json: each text-to-SQL QA data and database related data is stored as a json file
    • db_id: the name of the database
    • question: the command issued to the database in natural language
    • query: sql code that accepts the natural language command and executes it exactly
  • train_gold.slq: the real slq code for the question
  • database: the database source file
    • schema.sql: the table build statement.
    • sqlite: the specifics of the database.

First we need to extract all the information from the above data such as QA, table structure and database content in the following format:

{
        "query": sample["query"].
        "question": sample["question"].
        "db_id": db_id.
        "db_path": db_path.
        "db_table_names": schema["table_names_original"].
        "db_column_names": [
            {"table_id": table_id, "column_name": column_name}
            for table_id, column_name in schema["column_names_original"]
        ].
        "db_column_types": schema["column_types"].
        "db_primary_keys": [{"column_id": column_id} for column_id in schema["primary_keys"]].
        "db_foreign_keys": [
            {"column_id": column_id, "other_column_id": other_column_id}
            for column_id, other_column_id in schema["foreign_keys"]
        ].
    }

This data is then expressed in natural language, e.g:

{ "instruction": "concert_singer contains tables such as stadium, singer, concert, singer_in_concert. Table stadium has columns such as stadium_id. location, name, capacity, highest, lowest, average. table stadium has columns such as stadium_id, location, name, capacity, highest, lowest, average. stadium_id is the primary key. table singer has columns such as singer_id, name, country, song_name, song_release_year name, song_release_year, age, is_male. singer_id is the primary key. table concert has columns such as concert_id, concert_name, theme, stadium_id. year. Table singer_in_concert has columns such as concert_id, singer_id. concert_id is the primary key. The year of concert is the foreign key of location of stadium. The stadium_id of singer_in_concert is the foreign key of name of singer. concert is the foreign key of concert_name of concert.". 
"input": "How many singers do we have?". 
"response": "concert_singer | select count(*) from singer"}

The code implementation of the above data pre-processing section is as follows:

python dbgpt_hub/utils/sql_data_process.py

When fine-tuning the model, we also customize the prompt dict to optimize the input:

SQL_PROMPT_DICT = {
    "prompt_input": (
        "I want you to act as a SQL terminal in front of an example database. "
        "Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\n"
        "##Instruction:\n{instruction}\n\n###Input:\n{input}\n\n###Response: "
    ).
    "prompt_no_input": (
        "I want you to act as a SQL terminal in front of an example database. "
        "Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\n"
        "####Instruction:\n{instruction}\n\n### Response: "
    ).
}

3.3. Model fine-tuning

Model fine-tuning uses the QLoRA/LoRA method, where we can run the following command to fine-tune the model:

python train_qlora.py --model_name_or_path <path_or_name>

The fine-tuned model weights are saved under the adapter folder by default. The full training script is in scripts/qlora/qlora.sh.For multi-card runs, scripts/spider_qlora_finetune.sh is based on QLoRA by default, so it is recommended to specify the GPU number to run at the beginning. e.g. from python src/train/train_qlora.py to CUDA_VISIBLE_DEVICES=0,1,2,3 python src/train/train_qlora.py

python train_lora.py --model_name_or_path <path_or_name>

The full training script is in scripts/lora/.

3.4. Merge weights

Run the following command to generate the final merged model:

python dbgpt_hub/utils/merge_peft_adapters.py --base_model_name_or_path <path_or_name>

4. RoadMap

The whole process we will divide into three phases:

  • Stage 1:
    • LLaMa/LLaMa2
      • LoRA
      • QLoRA
    • Falcon
      • LoRA
      • QLoRA
    • ChatGLM
    • BLOOM
    • WizardLM
  • Stage 2:
    • Optimize model effects and release the optimized DB-GPT-SFT model
  • Stage 3:
    • Evaluate the dataset and methodological criteria

5. Contributions

We welcome your active participation and more comments on the dataset, model fine-tuning and effect evaluation.

6. Acknowledgements

Thanks to the following open source projects

More Repositories

1

DB-GPT

Revolutionizing Database Interactions with Private LLM Technology
Python
5,462
star
2

argo

使用go语言实现数据结构与算法,涵盖字符串、数组、链表、队列、栈、树、图等数据结构。在实现算法的基础上,进行go语言实战。此外也包含经典算法在go实战项目中的应用,以及开源项目算法方面源码分析。
Go
373
star
3

py-bitcoin

比特币的python简易实现,主要介绍比特币底层技术。如Base58编码、椭圆加密算法、MerkleTree、P2P对等网络、RPC通信、UTXO、虚拟机、DHT、DAG、链上数据的持久化存储等。
Python
289
star
4

etcd-from-arch-to-souce-code

etcd架构以及源码解析
69
star
5

dpos

A Simple DPoS Algorithm, Use go-libp2p implement.
Go
53
star
6

algorithm

Python3数据结构与算法、实现常用算法以及分布式系统相关算法。
Python
39
star
7

DB-GPT-Plugins

DB-GPT Plugins, Can run autogpt plugin directly
Python
22
star
8

blockchain

A simple blockchain use PoS algorithm.
Python
15
star
9

ipfs_blog

A blog system developed using IPFS
JavaScript
10
star
10

DB-GPT-Web

DB-GPT WebUI,ChatUI is all your need.
TypeScript
5
star
11

magicdb

magicdb is a distribute k-v database use libp2p network stack
Go
4
star
12

oxx

Operation X(twitter、reddit、bilibili、PH) use Multi-Agents automatically
3
star
13

csunny.github.io

Magic's hexo blog
HTML
2
star
14

libra_toolchain

This is a repo for build libra toolchain.
2
star
15

openai-cookbook-cn

openai cookbook chinese
2
star
16

python-kernel-program-solution

1
star
17

Grow_To_Maturity

1
star
18

blog

My blog. Mark some useful skills and some idea.
CSS
1
star
19

blog_project

This a django blog project.
JavaScript
1
star
20

learning_rust

A repo to learn rust language.
Rust
1
star
21

website

Python
1
star
22

wiki

The document of project
1
star