• This repository has been archived on 14/Nov/2023
  • Stars
    star
    465
  • Rank 94,287 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 5 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.

tune-sklearn

pytest

Tune-sklearn is a drop-in replacement for Scikit-Learn’s model selection module (GridSearchCV, RandomizedSearchCV) with cutting edge hyperparameter tuning techniques.

Features

Here’s what tune-sklearn has to offer:

  • Consistency with Scikit-Learn API: Change less than 5 lines in a standard Scikit-Learn script to use the API [example].
  • Modern tuning techniques: tune-sklearn allows you to easily leverage Bayesian Optimization, HyperBand, BOHB, and other optimization techniques by simply toggling a few parameters.
  • Framework support: tune-sklearn is used primarily for tuning Scikit-Learn models, but it also supports and provides examples for many other frameworks with Scikit-Learn wrappers such as Skorch (Pytorch) [example], KerasClassifier (Keras) [example], and XGBoostClassifier (XGBoost) [example].
  • Scale up: Tune-sklearn leverages Ray Tune, a library for distributed hyperparameter tuning, to parallelize cross validation on multiple cores and even multiple machines without changing your code.

Check out our API Documentation and Walkthrough (for master branch).

Installation

Dependencies

  • numpy (>=1.16)
  • ray
  • scikit-learn (>=0.23)

User Installation

pip install tune-sklearn ray[tune]

or

pip install -U git+https://github.com/ray-project/tune-sklearn.git && pip install 'ray[tune]'

Tune-sklearn Early Stopping

For certain estimators, tune-sklearn can also immediately enable incremental training and early stopping. Such estimators include:

  • Estimators that implement 'warm_start' (except for ensemble classifiers and decision trees)
  • Estimators that implement partial fit
  • XGBoost, LightGBM and CatBoost models (via incremental learning)

To read more about compatible scikit-learn models, see scikit-learn's documentation at section 8.1.1.3.

Early stopping algorithms that can be enabled include HyperBand and Median Stopping (see below for examples).

If the estimator does not support partial_fit, a warning will be shown saying early stopping cannot be done and it will simply run the cross-validation on Ray's parallel back-end.

Apart from early stopping scheduling algorithms, tune-sklearn also supports passing custom stoppers to Ray Tune. These can be passed via the stopper argument when instantiating TuneSearchCV or TuneGridSearchCV. See the Ray documentation for an overview of available stoppers.

Examples

TuneGridSearchCV

To start out, it’s as easy as changing our import statement to get Tune’s grid search cross validation interface, and the rest is almost identical!

TuneGridSearchCV accepts dictionaries in the format { param_name: str : distribution: list } or a list of such dictionaries, just like scikit-learn's GridSearchCV. The distribution can also be the output of Ray Tune's tune.grid_search.

# from sklearn.model_selection import GridSearchCV
from tune_sklearn import TuneGridSearchCV

# Other imports
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

# Set training and validation sets
X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

# Example parameters to tune from SGDClassifier
parameters = {
    'alpha': [1e-4, 1e-1, 1],
    'epsilon':[0.01, 0.1]
}

tune_search = TuneGridSearchCV(
    SGDClassifier(),
    parameters,
    early_stopping="MedianStoppingRule",
    max_iters=10
)

import time # Just to compare fit times
start = time.time()
tune_search.fit(X_train, y_train)
end = time.time()
print("Tune Fit Time:", end - start)
pred = tune_search.predict(X_test)
accuracy = np.count_nonzero(np.array(pred) == np.array(y_test)) / len(pred)
print("Tune Accuracy:", accuracy)

If you'd like to compare fit times with sklearn's GridSearchCV, run the following block of code:

from sklearn.model_selection import GridSearchCV
# n_jobs=-1 enables use of all cores like Tune does
sklearn_search = GridSearchCV(
    SGDClassifier(),
    parameters,
    n_jobs=-1
)

start = time.time()
sklearn_search.fit(X_train, y_train)
end = time.time()
print("Sklearn Fit Time:", end - start)
pred = sklearn_search.predict(X_test)
accuracy = np.count_nonzero(np.array(pred) == np.array(y_test)) / len(pred)
print("Sklearn Accuracy:", accuracy)

TuneSearchCV

TuneSearchCV is an upgraded version of scikit-learn's RandomizedSearchCV.

It also provides a wrapper for several search optimization algorithms from Ray Tune's tune.suggest, which in turn are wrappers for other libraries. The selection of the search algorithm is controlled by the search_optimization parameter. In order to use other algorithms, you need to install the libraries they depend on (pip install column). The search algorithms are as follows:

Algorithm search_optimization value Summary Website pip install
(Random Search) "random" Randomized Search built-in
SkoptSearch "bayesian" Bayesian Optimization [Scikit-Optimize] scikit-optimize
HyperOptSearch "hyperopt" Tree-Parzen Estimators [HyperOpt] hyperopt
TuneBOHB "bohb" Bayesian Opt/HyperBand [BOHB] hpbandster ConfigSpace
Optuna "optuna" Tree-Parzen Estimators [Optuna] optuna

All algorithms other than RandomListSearcher accept parameter distributions in the form of dictionaries in the format { param_name: str : distribution: tuple or list }.

Tuples represent real distributions and should be two-element or three-element, in the format (lower_bound: float, upper_bound: float, Optional: "uniform" (default) or "log-uniform"). Lists represent categorical distributions. Ray Tune Search Spaces are also supported and provide a rich set of potential distributions. Search spaces allow for users to specify complex, potentially nested search spaces and parameter distributions. Furthermore, each algorithm also accepts parameters in their own specific format. More information in Tune documentation.

Random Search (default) accepts dictionaries in the format { param_name: str : distribution: list } or a list of such dictionaries, just like scikit-learn's RandomizedSearchCV.

from tune_sklearn import TuneSearchCV

# Other imports
import scipy
from ray import tune
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

# Set training and validation sets
X, y = make_classification(n_samples=11000, n_features=1000, n_informative=50, n_redundant=0, n_classes=10, class_sep=2.5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)

# Example parameter distributions to tune from SGDClassifier
# Note the use of tuples instead if non-random optimization is desired
param_dists = {
    'loss': ['squared_hinge', 'hinge'], 
    'alpha': (1e-4, 1e-1, 'log-uniform'),
    'epsilon': (1e-2, 1e-1)
}

bohb_tune_search = TuneSearchCV(SGDClassifier(),
    param_distributions=param_dists,
    n_trials=2,
    max_iters=10,
    search_optimization="bohb"
)

bohb_tune_search.fit(X_train, y_train)

# Define the `param_dists using the SearchSpace API
# This allows the specification of sampling from discrete and 
# categorical distributions (below for the `learning_rate` scheduler parameter)
param_dists = {
    'loss': tune.choice(['squared_hinge', 'hinge']),
    'alpha': tune.loguniform(1e-4, 1e-1),
    'epsilon': tune.uniform(1e-2, 1e-1),
}


hyperopt_tune_search = TuneSearchCV(SGDClassifier(),
    param_distributions=param_dists,
    n_trials=2,
    early_stopping=True, # uses Async HyperBand if set to True
    max_iters=10,
    search_optimization="hyperopt"
)

hyperopt_tune_search.fit(X_train, y_train)

Other Machine Learning Libraries and Examples

Tune-sklearn also supports the use of other machine learning libraries such as Pytorch (using Skorch) and Keras. You can find these examples here:

More information

Ray Tune

More Repositories

1

ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Python
33,272
star
2

llm-numbers

Numbers every LLM developer should know
4,053
star
3

kuberay

A toolkit to run Ray applications on Kubernetes
Go
1,213
star
4

ray-llm

RayLLM - LLMs on Ray
Python
1,213
star
5

tutorial

Jupyter Notebook
777
star
6

llmperf

LLMPerf is a library for validating and benchmarking LLMs
Python
583
star
7

llmperf-leaderboard

417
star
8

ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
Jupyter Notebook
272
star
9

ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Python
204
star
10

langchain-ray

Examples on how to use LangChain and Ray
Python
202
star
11

deltacat

A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.
Python
155
star
12

rl-experiments

Keeping track of RL experiments
148
star
13

xgboost_ray

Distributed XGBoost on Ray
Python
137
star
14

rayfed

A multiple parties joint, distributed execution engine based on Ray, to help build your own federated learning frameworks in minutes.
Python
92
star
15

mobius

Mobius is an AI infrastructure platform for distributed online learning, including online sample processing, training and serving.
Java
88
star
16

plasma

A minimal shared memory object store design
C
46
star
17

enhancements

Tracking Ray Enhancement Proposals
44
star
18

lightgbm_ray

LightGBM on Ray
Python
44
star
19

ray_beam_runner

Ray-based Apache Beam runner
Python
40
star
20

mlflow-ray-serve

MLFlow Deployment Plugin for Ray Serve
Python
35
star
21

distml

Distributed ML Optimizer
Python
31
star
22

llms-in-prod-workshop-2023

Deploy and Scale LLM-based applications
Jupyter Notebook
23
star
23

ray-legacy

An experimental distributed execution engine
Python
21
star
24

ray_shuffling_data_loader

A Ray-based data loader with per-epoch shuffling and configurable pipelining, for shuffling and loading training data for distributed training of machine learning models.
Python
18
star
25

pygloo

Pygloo provides Python bindings for Gloo.
C++
15
star
26

contrib-workflow-dag

Python
11
star
27

anyscale-berkeley-ai-hackathon

Ray and Anyscale for UC Berkeley AI Hackathon!
Jupyter Notebook
11
star
28

credis

C++
9
star
29

ray-acm-workshop-2023

Scalable/Distributed Computer Vision with Ray
Jupyter Notebook
9
star
30

spark-ray-example

A simple demonstration of embedding Ray in a Spark UDF. For Spark + AI Summit 2020.
Jupyter Notebook
8
star
31

community

Artifacts intended to support the Ray Developer Community: SIGs, RFC overviews, and governance. We're very glad you're here! ✨
8
star
32

llm-application

Jupyter Notebook
6
star
33

releaser

Python
5
star
34

scalable-learning

Scaling multi-node multi-GPU workloads
5
star
35

air-reference-arch

Jupyter Notebook
5
star
36

serve-movie-rec-demo

Python
5
star
37

raynomics

Experimental genomics algorithms in Ray
Python
5
star
38

maze-raylit

Hackathon 2020! Max Archit Zhe
Python
5
star
39

ray-serve-arize-observe

Building Real-Time Inference Pipelines with Ray Serve
Jupyter Notebook
5
star
40

sandbox

Ray repository sandbox
Python
5
star
41

anyscale-workshop-nyc-2023

Scalable NLP model fine-tuning and batch inference with Ray and Anyscale
Jupyter Notebook
5
star
42

kuberay-helm

Helm charts for the KubeRay project
Mustache
4
star
43

ray-saturday-dec-2022

Ray Saturday Dec 2022 edition
Jupyter Notebook
4
star
44

RFC

Community Documents
4
star
45

ray-demos

Collection of demos build with Ray
Jupyter Notebook
4
star
46

prototype_gpu_buffer

Python
3
star
47

arrow-build

Queue for building arrow
3
star
48

numbuf

Serializing primitive Python types in Arrow
C++
3
star
49

odsc-west-workshop-2023

Jupyter Notebook
3
star
50

scipy-ray-scalable-ml-tutorial-2023

Jupyter Notebook
2
star
51

2022_04_13_ray_serve_meetup_demo

Code samples for Ray Serve Meetup on 04/13/2022
Python
2
star
52

q4-2021-docs-hackathon

HTML
2
star
53

ray-scripts

Experimental scripts for deploying and using Ray
Shell
2
star
54

raytracer

Polymer WebUI for Ray
HTML
2
star
55

travis-tracker-v2

Python
2
star
56

rllib-contrib

Python
2
star
57

serve_workloads

Python
2
star
58

qcon-workshop-2023

Jupyter Notebook
2
star
59

travis-tracker

Dashboard for Tracking Travis Python Test Result.
TypeScript
1
star
60

common

Code that is shared between Ray projects
C
1
star
61

photon

A local scheduler and node manager for Ray
C
1
star
62

spmd_grid

Grid-style gang-scheduling and collective communication for Ray
Python
1
star
63

checkstyle_java

Python
1
star
64

raylibs

Libraries for Ray
1
star
65

issues-to-airtable

JavaScript
1
star
66

ray-docs-zh

Chinese translation of Ray documentation. This may not be update to date.
1
star
67

streaming

Streaming processing engine based on ray platform.
1
star
68

ray-project.github.io

The Ray project website
HTML
1
star
69

train-serve-primer

Jupyter Notebook
1
star
70

serve_config_examples

Python
1
star
71

llmval-legacy

Jupyter Notebook
1
star
72

Ray-Forward

Some resources about Ray Forward Meetup
1
star
73

ray-summit-2022

Website for Ray Summit 2022
HTML
1
star