• Stars
    star
    191
  • Rank 201,724 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

(NeurIPS 2021) Revisiting Deep Learning Models for Tabular Data

Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021)

This is the official implementation of the paper "Revisiting Deep Learning Models for Tabular Data".

📜 arXiv
📚 Other projects on tabular deep learning


Table of Contents:


1. The main results

The tables from the main text (with extra details) can be found in this notebook.

2. Overview

The code is organized as follows:

  • bin:
    • training code for all the models
    • ensemble.py performs ensembling
    • tune.py tunes models
    • report.ipynb summarizes all the results
    • code for the section "When FT-Transformer is better than ResNet?" of the paper:
      • analysis_gbdt_vs_nn.py runs the experiments
      • create_synthetic_data_plots.py builds plots
  • lib contains common tools used by programs in bin
  • output contains configuration files (inputs for programs in bin) and results (metrics, tuned configurations, etc.)

The results are represented with numerous JSON files that are scatterd all over the output directory. Check bin/report.ipynb to see how the results can be summarized.

3. Setup the environment

3.1. PyTorch environment

Install conda

export PROJECT_DIR=<ABSOLUTE path to the repository root>
# example: export PROJECT_DIR=/home/myusername/repositories/revisiting-models
git clone https://github.com/yandex-research/tabular-dl-revisiting-models $PROJECT_DIR
cd $PROJECT_DIR

conda create -n revisiting-models python=3.8.8
conda activate revisiting-models

conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1.243 numpy=1.19.2 -c pytorch -y
conda install cudnn=7.6.5 -c anaconda -y
pip install -r requirements.txt
conda install nodejs -y
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# if the following commands do not succeed, update conda
conda env config vars set PYTHONPATH=${PYTHONPATH}:${PROJECT_DIR}
conda env config vars set PROJECT_DIR=${PROJECT_DIR}
conda env config vars set LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}
conda env config vars set CUDA_HOME=${CONDA_PREFIX}
conda env config vars set CUDA_ROOT=${CONDA_PREFIX}

conda deactivate
conda activate revisiting-models

3.2. TensorFlow environment

This environment is needed only for experimenting with TabNet. For all other cases use the PyTorch environment.

The instructions are the same as for the PyTorch environment (including installation of PyTorch!), but:

  • python=3.7.10
  • cudatoolkit=10.0
  • right before pip install -r requirements.txt do the following:
    • pip install tensorflow-gpu==1.14
    • comment out tensorboard in requirements.txt

3.3. Data

LICENSE: by downloading our dataset you accept licenses of all its components. We do not impose any new restrictions in addition to those licenses. You can find the list of sources in the section "References" of our paper.

  1. Download the data: wget https://www.dropbox.com/s/o53umyg6mn3zhxy/data.tar.gz?dl=1 -O revisiting_models_data.tar.gz
  2. Move the archive to the root of the repository: mv revisiting_models_data.tar.gz $PROJECT_DIR
  3. Go to the root of the repository: cd $PROJECT_DIR
  4. Unpack the archive: tar -xvf revisiting_models_data.tar.gz

4. Tutorial (how to reproduce results)

This section only provides specific commands with few comments. After completing the tutorial, we recommend checking the next section for better understanding of how to work with the repository. It will also help to better understand the tutorial.

In this tutorial, we will reproduce the results for MLP on the California Housing dataset. We will cover:

  • tuning
  • evaluation
  • ensembling
  • comparing models with each other

Note that the chances to get exactly the same results are rather low, however, they should not differ much from ours. Before running anything, go to the root of the repository and explicitly set CUDA_VISIBLE_DEVICES (if you plan to use GPU):

cd $PROJECT_DIR
export CUDA_VISIBLE_DEVICES=0

4.1. Check the environment

Before we start, let's check that the environment is configured successfully. The following commands should train one MLP on the California Housing dataset:

mkdir draft
cp output/california_housing/mlp/tuned/0.toml draft/check_environment.toml
python bin/mlp.py draft/check_environment.toml

The result should be in the directory draft/check_environment. For now, the content of the result is not important.

4.2. Tuning

Our config for tuning MLP on the California Housing dataset is located at output/california_housing/mlp/tuning/0.toml. In order to reproduce the tuning, copy our config and run your tuning:

# you can choose any other name instead of "reproduced.toml"; it is better to keep this
# name while completing the tutorial
cp output/california_housing/mlp/tuning/0.toml output/california_housing/mlp/tuning/reproduced.toml
# let's reduce the number of tuning iterations to make tuning fast (and ineffective)
python -c "
from pathlib import Path
p = Path('output/california_housing/mlp/tuning/reproduced.toml')
p.write_text(p.read_text().replace('n_trials = 100', 'n_trials = 5'))
"
python bin/tune.py output/california_housing/mlp/tuning/reproduced.toml

The result of your tuning will be located at output/california_housing/mlp/tuning/reproduced, you can compare it with ours: output/california_housing/mlp/tuning/0. The file best.toml contains the best configuration that we will evaluate in the next section.

4.3. Evaluation

Now we have to evaluate the tuned configuration with 15 different random seeds.

# create a directory for evaluation
mkdir -p output/california_housing/mlp/tuned_reproduced

# clone the best config from the tuning stage with 15 different random seeds
python -c "
for seed in range(15):
    open(f'output/california_housing/mlp/tuned_reproduced/{seed}.toml', 'w').write(
        open('output/california_housing/mlp/tuning/reproduced/best.toml').read().replace('seed = 0', f'seed = {seed}')
    )
"

# train MLP with all 15 configs
for seed in {0..14}
do
    python bin/mlp.py output/california_housing/mlp/tuned_reproduced/${seed}.toml
done

Our directory with evaluation results is located right next to yours, namely, at output/california_housing/mlp/tuned.

4.4. Ensembling

# just run this single command
python bin/ensemble.py mlp output/california_housing/mlp/tuned_reproduced

Your results will be located at output/california_housing/mlp/tuned_reproduced_ensemble, you can compare it with ours: output/california_housing/mlp/tuned_ensemble.

4.5. "Visualize" results

Use bin/report.ipynb:

  • find the cell "All Neural Networks"; the next cell contains many lines of this kind: ('algorithm/experiment', 'PrettyAlgorithmName', datasets)
  • uncomment the line relevant to the tutorial; it should look like this: ('mlp/tuned_reproduced', 'MLP | reproduced', [CALIFORNIA]),
  • run the updated cell
  • in order to do the same for the ensembles, take inspiration from other cells, where ensembles are used

4.6. What about other models and datasets?

Similar steps can be performed for all models and datasets. The tuning process is slightly different in the case of grid search: you have to run all desired configurations and manually choose the best one based on the validation performance. For example, see output/epsilon/ft_transformer.

5. How to work with the repository

5.1. How to run scripts

You should run Python scripts from the root of the repository. Most programs expect a configuration file as the only argument. The output will be a directory with the same name as the config, but without the extention. Configs are written in TOML. The lists of possible arguments for the programs are not provided and should be inferred from scripts (usually, the config is represented with the args variable in scripts). If you want to use CUDA, you must explicitly set the CUDA_VISIBLE_DEVICES environment variable. For example:

# The result will be at "path/to/my_experiment"
CUDA_VISIBLE_DEVICES=0 python bin/mlp.py path/to/my_experiment.toml

# The following example will run WITHOUT CUDA
python bin/mlp.py path/to/my_experiment.toml

If you are going to use CUDA all the time, you can save the environment variable in the Conda environment:

conda env config vars set CUDA_VISIBLE_DEVICES="0"

The -f (--force) option will remove the existing results and run the script from scratch:

python bin/whatever.py path/to/config.toml -f  # rewrites path/to/config

bin/tune.py supports continuation:

python bin/tune.py path/to/config.toml --continue

5.2. stats.json and other results

For all scripts, stats.json is the most important part of output. The content varies from program to program. It can contain:

  • metrics
  • config that was passed to the program
  • hardware info
  • execution time
  • and other information

Predictions for train, validation and test sets are usually also saved.

5.3. Conclusion

Now, you know everything you need to reproduce all the results and extend this repository for your needs. The tutorial also should be more clear now. Feel free to open issues and ask questions.

6. How to cite

@inproceedings{gorishniy2021revisiting,
    title={Revisiting Deep Learning Models for Tabular Data},
    author={Yury Gorishniy and Ivan Rubachev and Valentin Khrulkov and Artem Babenko},
    booktitle={{NeurIPS}},
    year={2021},
}

More Repositories

1

rtdl

Research on Tabular Deep Learning: Papers & Packages
Python
864
star
2

ddpm-segmentation

Label-Efficient Semantic Segmentation with Diffusion Models (ICLR'2022)
Python
655
star
3

tab-ddpm

[ICML 2023] The official implementation of the paper "TabDDPM: Modelling Tabular Data with Diffusion Models"
Python
346
star
4

navigan

Navigating the GAN Parameter Space for Semantic Image Editing
Jupyter Notebook
296
star
5

rtdl-num-embeddings

(NeurIPS 2022) On Embeddings for Numerical Features in Tabular Deep Learning
Python
295
star
6

tabular-dl-tabr

The implementation of "TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning"
Python
231
star
7

swarm

Official code for "SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient"
Python
115
star
8

DeDLOC

Official code for "Distributed Deep Learning in Open Collaborations" (NeurIPS 2021)
Jupyter Notebook
114
star
9

RuLeanALBERT

RuLeanALBERT is a pretrained masked language model for the Russian language that uses a memory-efficient architecture.
Python
89
star
10

heterophilous-graphs

A Critical Look at the Evaluation of GNNs under Heterophily: Are We Really Making Progress?
Python
86
star
11

invertible-cd

Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps
Python
63
star
12

GBDT-uncertainty

Jupyter Notebook
50
star
13

graph-glove

PyTorch code for the EMNLP 2020 paper "Embedding Words in Non-Vector Space with Unsupervised Graph Learning"
Python
40
star
14

DVAR

Official implementation of "Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics" (NeurIPS 2023)
Python
35
star
15

sparqling-queries

This repo in the implementation of EMNLP'21 paper "SPARQLing Database Queries from Intermediate Question Decompositions" by Irina Saparina, Anton Osokin
Python
34
star
16

moshpit-sgd

"Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices", official implementation
Jupyter Notebook
28
star
17

adaptive-diffusion

[CVPR2024] Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models
Python
26
star
18

gan-transfer

Supplementary code for "When, Why, and Which Pretrained GANs Are Useful?" (ICLR'22)
Jupyter Notebook
24
star
19

specexec

Python
15
star
20

btard

Code for the paper "Secure Distributed Training at Scale" (ICML 2022)
Python
14
star
21

structural-graph-shifts

Evaluating Robustness and Uncertainty of Graph Models Under Structural Distributional Shifts (NeurIPS'23)
Python
10
star
22

crosslingual_winograd

"It's All in the Heads" (Findings of ACL 2021), official implementation and data
Python
10
star
23

distill-nf

Code for the paper: Distilling the Knowledge from Conditional Normalizing Flows
Jupyter Notebook
9
star
24

gan_vs_diff_sr

Does Diffusion Beat GAN in Image Super Resolution?
8
star
25

classification-measures

Official implementation and data for 'Good Classification Measures and How to Find Them' (NeurIPS 2021)
Python
7
star
26

tabred

A Benchmark of Tabular Machine Learning in-the-Wild with real-world industry-grade tabular datasets
Python
7
star
27

text-to-img-hypernymy

Official code for "Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet Hierarchy"
Jupyter Notebook
6
star
28

learnable-init

Code for the paper: Discovering Weight Initializers with Meta-Learning
Jupyter Notebook
5
star
29

mind-your-format

Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements
Jupyter Notebook
5
star
30

dnar

The implementation of "Discrete Neural Algorithmic Reasoning"
Python
5
star
31

proxy-dirichlet-distillation

Implementation of "Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets" (NeurIPS 2021) and "Uncertainty Estimation in Autoregressive Structured Prediction" (ICLR 2021)
Python
3
star
32

msr

An official repository of "Multi-Sentence Resampling: A Simple Approach to Alleviate Dataset Length Bias and Beam-Search Degradation"
Python
2
star
33

vqdm

Official repository for VQDM:Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization paper
1
star