• Stars
    star
    107
  • Rank 323,517 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code and data for Sato https://arxiv.org/abs/1911.06311.

Sato: Contextual Semantic Type Detection in Tables

This repository includes source code, scripts, and data for training the Sato model. The repo also includes a pretrained model to help replicate the results in our VLDB 2020 paper. Sato is a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction.

Above: Sato architecture. Sato's hyrid architecture consists of two basic modules; a topic-aware single-column prediction module and a structured output prediction module. The topic-aware module extends Sherlock's single-column prediction model (a deep neural network) with additional topic subnetworks, incorporating table intent into the model. The structured output prediction module then combines the topic-aware results for all m columns, providing the final semantic type prediction for the columns in the table.

What is Sato useful for?

Myriad data preparation and information retrieval tasks, including data cleaning, integration, discovery and search, rely on the ability to accurately detect data column types. Schema matching for data integration leverages data types to find correspondences between data columns across tables. Similarly, data discovery benefits from detecting types of data columns in order to return semantically relevant results for user queries. Recognizing the semantics of table values helps aggregate information from multiple tabular data sources. Search engines also rely on the detection of semantically relevant column names to extend support to tables. Natural language based query interfaces for structured data can also benefit from semantic type detection.

Demo

We set up a simple online demo where you can upload small tables and get semantic predictions for column types.

screenshot1 screenshot2

Environment setup

We recommend using a python virtual environment:

mkdir virtualenvs
virtualenv --python=python3 virtualenvs/col2type

Fill in and set paths:

export BASEPATH=[path to the repo]
# RAW_DIR can be empty if using extracted feature files.
export RAW_DIR=[path to the raw data]
export SHERLOCKPATH=$BASEPATH/sherlock
export EXTRACTPATH=$BASEPATH/extract
export PYTHONPATH=$PYTHONPATH:$SHERLOCKPATH
export PYTHONPATH=$PYTHONPATH:$BASEPATH
export TYPENAME='type78' 

source ~/virtualenvs/col2type/bin/activate

Install required packages

cd $BASEPATH
pip install -r requirements.txt

To specify GPUID, use CUDA_VISIBLE_DEVICES. CUDA_VISIBLE_DEVICES="" to use CPU.

Replicating results

Results in the paper can be replicated with and pre-trained models features we extracted.

  1. Download data. ./download_data.sh
  2. Run experiments cd $BASEPATH/scripts; ./exp.sh
  3. Generate plots from notebooks/FinalPlotsPaper

Additional

This repo also allows training new Sato models with other hyper-parameters or extract features from additional data.

Download the VIZNET data and set RAW_DIR path to location of VIZNET raw data.

Column feature extraction

cd $BASEPATH/extract
python extract_features.py [corpus_chunk] --f sherlock --num_processes [N]

corpus_chunk: corpus with potential partition post-fix, e.g. webtables0-p1, plotly-p1 N: number of processes used to extract features

Table topic feature extraction

Download nltk data

import nltk
nltk.download('stopwords')
nltk.download('punkt')

[Optional] To train a new LDA model

cd topic_model
python train_LDA.py 

Extract topic features

cd $BASEPATH/extract
python extract_features.py [corpus_chunk] --f topic --LDA [LDA_name] --num_processes [N]

corpus_chunk: corpus with potential partition post-fix, e.g. webtables0-p1, plotly-p1 LDA_name: name of LDA model to extract topic features. Models are located in topic_model/LDA_cache N: number of processes used to extract features

The extracted feature files go to extract/out/features/[TYPENAME] .

Split train/test sets

Split the dataset into training and testing (8/2).

cd $BASEPATH/extract
python split_train_test.py --multi_col_only [m_col] --corpus_list [c_list]

m_col:--multi_col_only is set, filter the result and remove tables with only one column c_list: corpus list

Output is a dictionary with entries ['train','test']. Dictionary values are lists of table_id.

Train Sato

cd $BASEPATH/model
python train_CRF_LC.py -c [config_file]

Check out train_CRF_LC.py for supported configurations.

Original tables

Please see table_data for the original tables used for the experiments.

Citing Sato

Please cite our VLDB 2020 paper

@article{zhang2020sato,
    title={Sato: Contextual Semantic Type Detection in Tables},
    author={Dan Zhang and 
            Yoshihiko Suhara and 
            Jinfeng Li and 
            Madelon Hulsebos and 
            {\c{C}}a{\u{g}}atay Demiralp and 
            Wang-Chiew Tan},
    year = {2020},
    volume = {13},
    number = {12},
    journal = {Proc. VLDB Endow.},
    pages = {1835–1848},
    numpages = {14},
    url = {https://doi.org/10.14778/3407790.3407793}
}

Contact

To get help with problems using Sato or replicating our results, please submit a GitHub issue.

More Repositories

1

ginza

A Japanese NLP Library using spaCy as framework based on Universal Dependencies
Python
727
star
2

HappyDB

A corpus of 100,000 happy moments
354
star
3

ditto

Code for the paper "Deep Entity Matching with Pre-trained Language Models"
Python
233
star
4

bunkai

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
Python
177
star
5

jrte-corpus

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Python
75
star
6

opiniondigest

OpinionDigest: A Simple Framework for Opinion Summarization (ACL 2020)
Python
56
star
7

vecscan

Python
49
star
8

SubjQA

A question-answering dataset with a focus on subjective information
40
star
9

t5-japanese

Codes to pre-train Japanese T5 models
Python
39
star
10

ruler

Data Programming by Demonstration (DPBD) for Document Classification
Jupyter Notebook
36
star
11

tagruler

Data programming by demonstration for information extraction and span annotation
JavaScript
35
star
12

coop

☘️ Code for Convex Aggregation for Opinion Summarization (Iso et al; Findings of EMNLP 2021)
Python
31
star
13

doduo

Annotating Columns with Pre-trained Language Models
Python
25
star
14

asdc

Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
Python
23
star
15

instruction_ja

Japanese instruction data (日本語指示データ)
Python
21
star
16

rotom

Code for the paper "Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond"
Roff
21
star
17

cocosum

🥥 Code & Data for Comparative Opinion Summarization via Collaborative Decoding (Iso et al; Findings of ACL 2022)
Python
20
star
18

ebe-dataset

Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
PLSQL
17
star
19

ginza-transformers

Use custom tokenizers in spacy-transformers
Python
17
star
20

teddy

Code and data for Teddy https://arxiv.org/abs/2001.05171.
Python
15
star
21

zett

🙈 Code for Zero-shot Triplet Extraction by Template Infilling (Kim et al; IJCNLP-AACL 2023)
Python
15
star
22

machamp

The dataset for the paper "Machamp: A Generalized Entity Matching Benchmark" published in CIKM 2021
14
star
23

starmie

Resources for PVLDB 2023 submission
Python
14
star
24

meganno-client

Python
7
star
25

sudowoodo

The source code of the Sudowoodo paper in ICDE 2023
Jupyter Notebook
7
star
26

explainit

Python
5
star
27

desuwa

Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
Emacs Lisp
5
star
28

react-jupyter-cookiecutter

Python
5
star
29

xatu

🕊️ Code and Data for XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates (Zhang et al; LREC-COLING 2024)
Python
4
star
30

magneton

Repository of the Magneton framework for authoring interaction-aware and customizable widgets.
TypeScript
4
star
31

emu

Enhancing Multilingual Sentence Embeddings with Semantic Specialization (AAAI '20)
4
star
32

learnit

A Tool for Machine Learning Beginners
Python
4
star
33

leam

Source code and demo for Leam
Jupyter Notebook
3
star
34

minun

Evaluating Counterfactual Explanations for Entity Matching
Python
3
star
35

llm-longeval

💵 Code for Less is More for Long Document Summary Evaluation by LLMs (Wu, Iso et al; EACL 2024)
Python
3
star
36

jrte-corpus_example

Example codes for Japanese Realistic Textual Entailment Corpus
Python
3
star
37

Tyrogue

Jupyter Notebook
2
star
38

qa-summarization

Ting-Yao's intern project
Python
2
star
39

pilota

✈ SCUD generator (解釈文生成器)
Python
1
star
40

quasi_japanese_reviews

Quasi Japanese Reviews (擬似レビューデータ)
Python
1
star
41

MCR

1
star
42

witqa

1
star