WikiSQL
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is the dataset released along with our work Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning.
Citation
If you use WikiSQL, please cite the following work:
Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning.
@article{zhongSeq2SQL2017,
author = {Victor Zhong and
Caiming Xiong and
Richard Socher},
title = {Seq2SQL: Generating Structured Queries from Natural Language using
Reinforcement Learning},
journal = {CoRR},
volume = {abs/1709.00103},
year = {2017}
}
Notes
Regarding tokenization and Stanza --- when WikiSQL was written 3-years ago, it relied on Stanza, a CoreNLP python wrapper that has since been deprecated. If you'd still like to use the tokenizer, please use the docker image. We do not anticipate switching to the current Stanza as changes to the tokenizer would render the previous results not reproducible.
Leaderboard
If you submit papers on WikiSQL, please consider sending a pull request to merge your results onto the leaderboard. By submitting, you acknowledge that your results are obtained purely by training on the training split and tuned on the dev split (e.g. you only evaluted on the test set once). Moreover, you acknowledge that your models only use the table schema and question during inference. That is they do not use the table content. Update (May 12, 2019): We now have a separate leaderboard for weakly supervised models that do not use logical forms during training.
Weakly supervised without logical forms
Model | Dev execution accuracy | Test execution accuracy |
---|---|---|
TAPEX (Liu 2022) | 89.2 | 89.5 |
HardEM (Min 2019) | 84.4 | 83.9 |
LatentAlignment (Wang 2019) | 79.4 | 79.3 |
MeRL (Agarwal 2019) | 74.9 +/- 0.1 | 74.8 +/- 0.2 |
MAPO (Liang 2018) | 72.2 +/- 0.2 | 72.1 +/- 0.3 |
Rule-SQL (Guo 2019) | 61.1 +/- 0.2 | 61.0 +/- 0.3 |
Supervised via logical forms
Model | Dev logical form accuracy |
Dev execution accuracy |
Test logical form accuracy |
Test execution accuracy |
Uses execution |
---|---|---|---|---|---|
SeaD +Execution-Guided Decoding (Xu 2021) (Ant Group, Ada & ZhiXiaoBao) |
87.6 | 92.9 | 87.5 | 93.0 | Inference |
SDSQL +Execution-Guided Decoding (Hui 2020) (Alibaba Group) |
87.1 | 92.6 | 87.0 | 92.7 | Inference |
IE-SQL +Execution-Guided Decoding (Ma 2020) (Ping An Life, AI Team) |
87.9 | 92.6 | 87.8 | 92.5 | Inference |
HydraNet +Execution-Guided Decoding (Lyu 2020) (Microsoft Dynamics 365 AI) (code) |
86.6 | 92.4 | 86.5 | 92.2 | Inference |
BRIDGE^ +Execution-Guided Decoding (Lin 2020) (Salesforce Research) |
86.8 | 92.6 | 86.3 | 91.9 | Inference |
X-SQL +Execution-Guided Decoding (He 2019) |
86.2 | 92.3 | 86.0 | 91.8 | Inference |
SDSQL (Hui 2020) (Alibaba Group) |
86.0 | 91.8 | 85.6 | 91.4 | |
BRIDGE^ (Lin 2020) (Salesforce Research) |
86.2 | 91.7 | 85.7 | 91.1 | |
Text2SQLGen + EG (Mellah 2021) (Novelis.io Research) |
91.2 | 91.0 | Inference | ||
SeqGenSQL+EG (Li 2020) | 90.8 | 90.5 | Inference | ||
SeqGenSQL (Li 2020) | 90.6 | 90.3 | Inference | ||
SeaD (Xu 2021) (Ant Group, Ada & ZhiXiaoBao) |
84.9 | 90.2 | 84.7 | 90.1 | Inference |
(Guo 2019) +Execution-Guided Decoding with BERT-Base-Uncased^ |
85.4 | 91.1 | 84.5 | 90.1 | Inference |
SQLova +Execution-Guided Decoding (Hwang 2019) |
84.2 | 90.2 | 83.6 | 89.6 | Inference |
IncSQL +Execution-Guided Decoding (Shi 2018) |
51.3 | 87.2 | 51.1 | 87.1 | Inference |
HydraNet (Lyu 2020) (Microsoft Dynamics 365 AI) (code) |
83.6 | 89.1 | 83.8 | 89.2 | |
(Guo 2019) with BERT-Base-Uncased^ |
84.3 | 90.3 | 83.7 | 89.2 | |
IE-SQL (Ma 2020) (Ping An Life, AI Team) |
84.6 | 88.7 | 84.6 | 88.8 | |
X-SQL (He 2019) |
83.8 | 89.5 | 83.3 | 88.7 | |
SQLova (Hwang 2019) |
81.6 | 87.2 | 80.7 | 86.2 | |
Execution-Guided Decoding (Wang 2018) |
76.0 | 84.0 | 75.4 | 83.8 | Inference |
IncSQL (Shi 2018) |
49.9 | 84.0 | 49.9 | 83.7 | |
Auxiliary Mapping Task (Chang 2019) |
76.0 | 82.3 | 75.0 | 81.7 | |
MQAN (unordered) (McCann 2018) |
76.1 | 82.0 | 75.4 | 81.4 | |
MQAN (ordered) (McCann 2018) |
73.5 | 82.0 | 73.2 | 81.4 | |
Coarse2Fine (Dong 2018) |
72.5 | 79.0 | 71.7 | 78.5 | |
TypeSQL (Yu 2018) |
- | 74.5 | - | 73.5 | |
PT-MAML (Huang 2018) |
63.1 | 68.3 | 62.8 | 68.0 | |
(Guo 2018) | 64.1 | 71.1 | 62.5 | 69.0 | |
SQLNet (Xu 2017) |
- | 69.8 | - | 68.0 | |
Wang 2017^ | 62.0 | 67.1 | 61.5 | 66.8 | |
Seq2SQL (Zhong 2017) |
49.5 | 60.8 | 48.3 | 59.4 | Training |
Baseline (Zhong 2017) |
23.3 | 37.0 | 23.4 | 35.9 |
^
indicates that table content is used directly by the model during training.
*
indicates that the order in where conditions is ignored.
Installation
Both the evaluation script as well as the dataset are stored within the repo. Only Python 3 is supported at the moment - I would very much welcome a pull request that ports the code to work with Python 2. The installation steps are as follows:
git clone https://github.com/salesforce/WikiSQL
cd WikiSQL
pip install -r requirements.txt
tar xvjf data.tar.bz2
This will unpack the data files into a directory called data
.
Content and format
Inside the data folder you will find the files in jsonl
and db
format.
The former can be read line by line, where each line is a serialized JSON object.
The latter is a SQLite3 database.
Question, query and table ID
These files are contained in the *.jsonl
files. A line looks like the following:
{
"phase":1,
"question":"who is the manufacturer for the order year 1998?",
"sql":{
"conds":[
[
0,
0,
"1998"
]
],
"sel":1,
"agg":0
},
"table_id":"1-10007452-3"
}
The fields represent the following:
phase
: the phase in which the dataset was collected. We collected WikiSQL in two phases.question
: the natural language question written by the worker.table_id
: the ID of the table to which this question is addressed.sql
: the SQL query corresponding to the question. This has the following subfields:sel
: the numerical index of the column that is being selected. You can find the actual column from the table.agg
: the numerical index of the aggregation operator that is being used. You can find the actual operator fromQuery.agg_ops
inlib/query.py
.conds
: a list of triplets(column_index, operator_index, condition)
where:column_index
: the numerical index of the condition column that is being used. You can find the actual column from the table.operator_index
: the numerical index of the condition operator that is being used. You can find the actual operator fromQuery.cond_ops
inlib/query.py
.condition
: the comparison value for the condition, in eitherstring
orfloat
type.
Tables
These files are contained in the *.tables.jsonl
files. A line looks like the following:
{
"id":"1-1000181-1",
"header":[
"State/territory",
"Text/background colour",
"Format",
"Current slogan",
"Current series",
"Notes"
],
"types":[
"text",
"text",
"text",
"text",
"text",
"text"
],
"rows":[
[
"Australian Capital Territory",
"blue/white",
"Yaa\u00b7nna",
"ACT \u00b7 CELEBRATION OF A CENTURY 2013",
"YIL\u00b700A",
"Slogan screenprinted on plate"
],
[
"New South Wales",
"black/yellow",
"aa\u00b7nn\u00b7aa",
"NEW SOUTH WALES",
"BX\u00b799\u00b7HI",
"No slogan on current series"
],
[
"New South Wales",
"black/white",
"aaa\u00b7nna",
"NSW",
"CPX\u00b712A",
"Optional white slimline series"
],
[
"Northern Territory",
"ochre/white",
"Ca\u00b7nn\u00b7aa",
"NT \u00b7 OUTBACK AUSTRALIA",
"CB\u00b706\u00b7ZZ",
"New series began in June 2011"
],
[
"Queensland",
"maroon/white",
"nnn\u00b7aaa",
"QUEENSLAND \u00b7 SUNSHINE STATE",
"999\u00b7TLG",
"Slogan embossed on plate"
],
[
"South Australia",
"black/white",
"Snnn\u00b7aaa",
"SOUTH AUSTRALIA",
"S000\u00b7AZD",
"No slogan on current series"
],
[
"Victoria",
"blue/white",
"aaa\u00b7nnn",
"VICTORIA - THE PLACE TO BE",
"ZZZ\u00b7562",
"Current series will be exhausted this year"
]
]
}
The fields represent the following:
id
: the table ID.header
: a list of column names in the table.rows
: a list of rows. Each row is a list of row entries.
Tables are also contained in a corresponding *.db
file.
This is a SQL database with the same information.
Note that due to the flexible format of HTML tables, the column names of tables in the database has been symbolized.
For example, for a table with the columns ['foo', 'bar']
, the columns in the database are actually col0
and col1
.
Scripts
evaluate.py
contains the evaluation script, whose options are:
usage: evaluate.py [-h] source_file db_file pred_file
positional arguments:
source_file source file for the prediction
db_file source database for the prediction
pred_file predictions by the model
optional arguments:
-h, --help show this help message and exit
The pred_file
, which is supplied by the user, should contain lines of serialized JSON objects.
Each JSON object should contain a query
field which corresponds to the query predicted for a line in the input *.jsonl
file and should be similar to the sql
field of the input.
In particular, it should contain:
sel
: the numerical index of the column that is being selected. You can find the actual column from the table.agg
: the numerical index of the aggregation operator that is being used. You can find the actual operator fromQuery.agg_ops
inlib/query.py
.conds
: a list of triplets(column_index, operator_index, condition)
where:column_index
: the numerical index of the condition column that is being used. You can find the actual column from the table.operator_index
: the numerical index of the condition operator that is being used. You can find the actual operator fromQuery.cond_ops
inlib/query.py
.condition
: the comparison value for the condition, in eitherstring
orfloat
type.
An example predictions file can be found in test/example.pred.dev.jsonl
.
The lib
directory contains dependencies of evaluate.py
.
Integration Test
We supply a sample predictions file for the dev set in test/example.pred.dev.jsonl.bz2
.
You can unzip this file using bunzip2 test/example.pred.dev.jsonl.bz2 -k
to look at what a real predictions file should look like.
We distribute a docker file which installs the necessary dependencies of this library and runs the evaluation script on this file.
The docker file also serves as an example of how to use the evaluation script.
To run the test, first build the image from the root directory:
docker build -t wikisqltest -f test/Dockerfile .
Next, run the image
docker run --rm --name wikisqltest wikisqltest
If everything works correctly, the output should be:
{
"ex_accuracy": 0.5380596128725804,
"lf_accuracy": 0.35375846099038116
}
Annotation
In addition to the raw data dump, we also release an optional annotation script that annotates WikiSQL using Stanford CoreNLP.
The annotate.py
script will annotate the query, question, and SQL table, as well as a sequence to sequence construction of the input and output for convenience of using Seq2Seq models.
To use annotate.py
, you must set up the CoreNLP python client using Stanford Stanza.
One docker image of the CoreNLP server that this works with is here:
docker run --name corenlp -d -p 9000:9000 vzhong/corenlp-server
Note that the sequence output contain symbols to delineate the boundaries of fields.
In lib/query.py
you will also find accompanying functions to reconstruct a query given a sequence output in the annotated format.
FAQ
I will update this list with frequently asked questions.
How do you convert HTML table columns to SQL table columns?
Web tables are noisy and are not directly transferrable into a database. One problem is that SQL column names need to be symbolic whereas web table columns usually have unicode characters, whitespaces etc. To handle this problem, we convert table columns to symbols (e.g.
Player Name
tocol1
) just before executing the query. For the implementation details, please seeevaluate.py
.
Changelog
- 1.1: Removed examples from each split that have gloss mismatch between the logical form conditions and the annotated question utterance.