• Stars
    star
    224
  • Rank 177,792 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 10 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Compiled Decision Trees for scikit-learn

Scikit-Learn Compiled Trees

Build Status PyPI

Installation

Released under the MIT License.

pip install sklearn-compiledtrees

Or to get the latest development version:

pip install git+https://github.com/ajtulloch/sklearn-compiledtrees.git

sklearn-compiledtrees has been tested to work on OS X, Linux and Windows.

Installing on Windows requires GCC compiler and dlfcn-win32, setting CXX environment variable (set "CXX=gcc -pthread" for CMD), and manual installation from source directory. Using msys2 distribution in conda is strongly recommended.

Rationale

In some use cases, predicting given a model is in the hot-path, so speeding up decision tree evaluation is very useful.

An effective way of speeding up evaluation of decision trees can be to generate code representing the evaluation of the tree, compile that to optimized object code, and dynamically load that file via dlopen/dlsym or equivalent.

See https://courses.cs.washington.edu/courses/cse501/10au/compile-machlearn.pdf for a detailed discussion, and http://tullo.ch/articles/decision-tree-evaluation/ for a more pedagogical explanation and more benchmarks in C++.

This package implements compiled decision tree evaluation for the simple case of a single-output regression tree or ensemble.

Usage

import compiledtrees
import sklearn.ensemble

X_train, y_train, X_test, y_test = ...

clf = ensemble.GradientBoostingRegressor()
clf.fit(X_train, y_train)

compiled_predictor = compiledtrees.CompiledRegressionPredictor(clf)
predictions = compiled_predictor.predict(X_test)

Benchmarks

For random forests, we see 5x to 8x speedup in evaluation. For gradient boosted ensembles, it's between a 1.5x and 3x speedup in evaluation. This is due to the fact that gradient boosted trees already have an optimized prediction implementation.

There is a benchmark script attached that allows us to examine the performance of evaluation across a range of ensemble configurations and datasets.

In the graphs attached, GB is Gradient Boosted, RF is Random Forest, D1, etc correspond to setting max-depth=1, and B10 corresponds to setting max_leaf_nodes=10.

Graphs

for dataset in friedman1 friedman2 friedman3 uniform hastie; do
    python ../benchmarks/bench_compiled_tree.py \
        --iterations=10 \
        --num_examples=1000 \
        --num_features=50 \
        --dataset=$dataset \
        --max_estimators=300 \
        --num_estimator_values=6
done

timings3907426606273805268 timings-1162001441413946416 timings5617004024503483042 timings2681645894201472305 timings2070620222460516071

More Repositories

1

dnngraph

A DSL for deep neural networks, supporting Caffe and Torch
Haskell
697
star
2

Elements-of-Statistical-Learning

Contains LaTeX, SciPy and R code providing solutions to exercises in Elements of Statistical Learning (Hastie, Tibshirani & Friedman)
R
291
star
3

svmpy

Basic soft-margin kernel SVM implementation in Python
Python
244
star
4

quantcup-orderbook

Fast C++ adaptation of the QuantCup (http://www.quantcup.org/) limit order book.
C
129
star
5

adpredictor

A simple implementation of Microsoft's AdPredictor (http://bit.ly/SFgcq8) in Python
Python
90
star
6

decisiontrees

High performance implementations of gradient boosting, random forests, etc. in Go
JavaScript
60
star
7

haskell-ml

Haskell implementations of various ML algorithms.
Haskell
56
star
8

deeplearning-hs

Haskell
53
star
9

LaTeX2Markdown

An AMS-LaTeX compatible converter that maps a subset of LaTeX to Markdown/MathJaX.
Python
39
star
10

SydneyUniversityMathematicsNotes

Contains lecture notes for several Sydney University advanced mathematics courses. Contributions welcomed!
Ruby
34
star
11

admmlrspark

ADMM Logistic Regression implemented in Spark
Scala
32
star
12

caffe.rs

Rust
23
star
13

IntensityCreditModels

Code used to implement various stochastic intensity models for univariate and multivariate credit risk models.
Python
21
star
14

dots

Vim Script
17
star
15

hopfield-networks

Hopfield Networks for unsupervised learning in Haskell
Haskell
16
star
16

phabricator.el

Emacs Lisp
14
star
17

freelearning

Haskell
11
star
18

Isotonic.jl

Julia
10
star
19

sparse-ads-baselines

Python
8
star
20

decisiontree-performance

C++ code examining decision tree evaluation strategies, accompanying tullo.ch/articles/decision-tree-evaluation/
C++
7
star
21

boggle

Haskell
5
star
22

DeepLearning.jl

Julia
5
star
23

mkdown.el

CSS
4
star
24

lrucache.cpp

C++
3
star
25

NaiveBayesSpamFilter

An implementation of the Naive Bayes machine learning algorithm, applied to spam filtering
Python
3
star
26

tullo-ch

tullo.ch static website content
TeX
2
star
27

bgess

C++
2
star
28

spacemacs-tulloch

Emacs Lisp
2
star
29

ajtulloch.github.com

Contains the Jekyll code for the old tullo.ch site
JavaScript
2
star
30

UniTables

Collaborative timetabling management for students
JavaScript
1
star
31

phash-hs

Haskell
1
star
32

Dantzig.jl

Julia
1
star
33

DecisionTreePerformance.jl

Julia
1
star
34

tensorflow-rs

Rust
1
star
35

cascalog-tfidf

Cascalog implementation of TF-IDF document processing
Clojure
1
star
36

julia-thrift

Julia
1
star
37

sparse

1
star
38

mla

CSS
1
star
39

lpcnet_benchmark

C
1
star
40

dotfiles

tmux, vim, bash
Vim Script
1
star
41

c-decl-parsec

Haskell
1
star
42

Utilities

Contains Python code for various simple and not-so-simple algorithms.
Python
1
star
43

isotonic.cpp

C++
1
star
44

DropboxChallenges

Python solutions for the Dropbox challenges
Python
1
star