• Stars
    star
    657
  • Rank 68,119 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 3 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

XGBoost + Optuna

AutoXGB

XGBoost + Optuna: no brainer

  • auto train xgboost directly from CSV files
  • auto tune xgboost using optuna
  • auto serve best xgboot model using fastapi

NOTE: PRs are currently not accepted. If there are issues/problems, please create an issue.

Installation

Install using pip

pip install autoxgb

Usage

Training a model using AutoXGB is a piece of cake. All you need is some tabular data.

Parameters

###############################################################################
### required parameters
###############################################################################

# path to training data
train_filename = "data_samples/binary_classification.csv"

# path to output folder to store artifacts
output = "output"

###############################################################################
### optional parameters
###############################################################################

# path to test data. if specified, the model will be evaluated on the test data
# and test_predictions.csv will be saved to the output folder
# if not specified, only OOF predictions will be saved
# test_filename = "test.csv"
test_filename = None

# task: classification or regression
# if not specified, the task will be inferred automatically
# task = "classification"
# task = "regression"
task = None

# an id column
# if not specified, the id column will be generated automatically with the name `id`
# idx = "id"
idx = None

# target columns are list of strings
# if not specified, the target column be assumed to be named `target`
# and the problem will be treated as one of: binary classification, multiclass classification,
# or single column regression
# targets = ["target"]
# targets = ["target1", "target2"]
targets = ["income"]

# features columns are list of strings
# if not specified, all columns except `id`, `targets` & `kfold` columns will be used
# features = ["col1", "col2"]
features = None

# categorical_features are list of strings
# if not specified, categorical columns will be inferred automatically
# categorical_features = ["col1", "col2"]
categorical_features = None

# use_gpu is boolean
# if not specified, GPU is not used
# use_gpu = True
# use_gpu = False
use_gpu = True

# number of folds to use for cross-validation
# default is 5
num_folds = 5

# random seed for reproducibility
# default is 42
seed = 42

# number of optuna trials to run
# default is 1000
# num_trials = 1000
num_trials = 100

# time_limit for optuna trials in seconds
# if not specified, timeout is not set and all trials are run
# time_limit = None
time_limit = 360

# if fast is set to True, the hyperparameter tuning will use only one fold
# however, the model will be trained on all folds in the end
# to generate OOF predictions and test predictions
# default is False
# fast = False
fast = False

Python API

To train a new model, you can run:

from autoxgb import AutoXGB


# required parameters:
train_filename = "data_samples/binary_classification.csv"
output = "output"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["income"]
features = None
categorical_features = None
use_gpu = True
num_folds = 5
seed = 42
num_trials = 100
time_limit = 360
fast = False

# Now its time to train the model!
axgb = AutoXGB(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
axgb.train()

CLI

Train the model using the autoxgb train command. The parameters are same as above.

autoxgb train \
 --train_filename datasets/30train.csv \
 --output outputs/30days \
 --test_filename datasets/30test.csv \
 --use_gpu

You can also serve the trained model using the autoxgb serve command.

autoxgb serve --model_path outputs/mll --host 0.0.0.0 --debug

To know more about a command, run:

`autoxgb <command> --help` 
autoxgb train --help


usage: autoxgb <command> [<args>] train [-h] --train_filename TRAIN_FILENAME [--test_filename TEST_FILENAME] --output
                                        OUTPUT [--task {classification,regression}] [--idx IDX] [--targets TARGETS]
                                        [--num_folds NUM_FOLDS] [--features FEATURES] [--use_gpu] [--fast]
                                        [--seed SEED] [--time_limit TIME_LIMIT]

optional arguments:
  -h, --help            show this help message and exit
  --train_filename TRAIN_FILENAME
                        Path to training file
  --test_filename TEST_FILENAME
                        Path to test file
  --output OUTPUT       Path to output directory
  --task {classification,regression}
                        User defined task type
  --idx IDX             ID column
  --targets TARGETS     Target column(s). If there are multiple targets, separate by ';'
  --num_folds NUM_FOLDS
                        Number of folds to use
  --features FEATURES   Features to use, separated by ';'
  --use_gpu             Whether to use GPU for training
  --fast                Whether to use fast mode for tuning params. Only one fold will be used if fast mode is set
  --seed SEED           Random seed
  --time_limit TIME_LIMIT
                        Time limit for optimization

More Repositories

1

approachingalmost

Approaching (Almost) Any Machine Learning Problem
6,935
star
2

colabcode

Run VSCode (codeserver) on Google Colab or Kaggle Notebooks
Python
2,054
star
3

tez

Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.
Python
1,162
star
4

diffuzers

a web ui & api for 🤗 diffusers
Python
583
star
5

is_that_a_duplicate_quora_question

Python
441
star
6

approaching_almost_nlp

Approaching (Almost) Any Natural Language Processing Problem
341
star
7

mlspace

MLSpace: Hassle-free machine learning & deep learning development
Python
303
star
8

wtfml

WTFML: Well That's Fantastic Machine Learning
Python
295
star
9

bert-sentiment

Python
269
star
10

how-to-become-a-ds-in-30-days

How to become a data scientist in 30 days
215
star
11

mlframework

Python
199
star
12

clickbaits_revisited

Deep learning models to identify clickbaits taking content into consideration
Python
172
star
13

long-text-token-classification

Python
162
star
14

greedyFeatureSelection

greedy feature selection based on ROC AUC
Python
126
star
15

bert-entity-extraction

Python
122
star
16

StableSAM

98
star
17

pysembler

An automatic ensembler of machine learning models in python
Python
67
star
18

captcha-recognition-pytorch

Python
59
star
19

sandesh

A simple app to send messages to Slack channels / members using webhook
Python
56
star
20

ml_dev_env

Machine Learning / Deep Learning Environment. Everywhere. Anywhere.
Dockerfile
50
star
21

commonlit-pairwise-model

Pairwise model for commonlit competition
Python
46
star
22

e01

Python
37
star
23

chaii-hindi-tamil-question-answering

chaii: hindi and tamil question answering
Python
36
star
24

melanoma-deep-learning

JavaScript
34
star
25

bert-tweet-sentiment

Python
31
star
26

automl_gpu

Python
26
star
27

walmart2015

Python
26
star
28

csv_test

26
star
29

AutoML

Python
24
star
30

imet-collection

Python
23
star
31

anime_hentai

Distinguishing between anime and hentai
Python
15
star
32

autonlp

AutoNLP: AutoML for NLP (WIP)
Python
12
star
33

abhishekkrthakur

9
star
34

ApproachingAlmostNLP

8
star
35

competitions-template

8
star
36

LCE

Local Collective Embeddings. Python translation of https://github.com/msaveski/LCE
Python
7
star
37

moa-kaggle

6
star
38

movie_recommender

6
star
39

av_minihack

Python
5
star
40

naivebees

Python
5
star
41

amazon_challenge

code for amazon employee access challenge
C
4
star
42

nuSVM

implementation of nusvm using cvxopt
Python
4
star
43

aaamlp_figures

4
star
44

ultramnist

3
star
45

testing

3
star
46

NDSB

national data science bowl @ kaggle
Python
2
star
47

finetuning_googlenet

Python
2
star
48

images

2
star
49

pyCoDi

implementation of CoDi saliency in python
Python
2
star
50

illumination-compensation

C++
2
star
51

fastFibonacci

fast fibonacci in cython
C
1
star
52

xformers

1
star
53

kaggle-afsis

Beating the Benchmark in Kaggle Afsis challenge
Python
1
star
54

EMC

em clustering
Python
1
star