• Stars
    star
    356
  • Rank 119,482 (Top 3 %)
  • Language
    Julia
  • License
    Other
  • Created almost 12 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Julia implementation of Decision Tree (CART) and Random Forest algorithms

DecisionTree.jl

CI Codecov Docs Stable DOI

Julia implementation of Decision Tree (CART) and Random Forest algorithms

Created and developed by Ben Sadeghi (@bensadeghi). Now maintained by the JuliaAI organization.

Available via:

  • AutoMLPipeline.jl - create complex ML pipeline structures using simple expressions
  • CombineML.jl - a heterogeneous ensemble learning package
  • MLJ.jl - a machine learning framework for Julia
  • ScikitLearn.jl - Julia implementation of the scikit-learn API

Classification

  • pre-pruning (max depth, min leaf size)
  • post-pruning (pessimistic pruning)
  • multi-threaded bagging (random forests)
  • adaptive boosting (decision stumps), using SAMME
  • cross validation (n-fold)
  • support for ordered features (encoded as Reals or Strings)

Regression

  • pre-pruning (max depth, min leaf size)
  • multi-threaded bagging (random forests)
  • cross validation (n-fold)
  • support for numerical features

Note that regression is implied if labels/targets are of type Array{Float}

Installation

You can install DecisionTree.jl using Julia's package manager

Pkg.add("DecisionTree")

ScikitLearn.jl API

DecisionTree.jl supports the ScikitLearn.jl interface and algorithms (cross-validation, hyperparameter tuning, pipelines, etc.)

Available models: DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor, AdaBoostStumpClassifier. See each model's help (eg. ?DecisionTreeRegressor at the REPL) for more information

Classification Example

Load DecisionTree package

using DecisionTree

Separate Fisher's Iris dataset features and labels

features, labels = load_data("iris")    # also see "adult" and "digits" datasets

# the data loaded are of type Array{Any}
# cast them to concrete types for better performance
features = float.(features)
labels   = string.(labels)

Pruned Tree Classifier

# train depth-truncated classifier
model = DecisionTreeClassifier(max_depth=2)
fit!(model, features, labels)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)
# apply learned model
predict(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
predict_proba(model, [5.9,3.0,5.1,1.9])
println(get_classes(model)) # returns the ordering of the columns in predict_proba's output
# run n-fold cross validation over 3 CV folds
# See ScikitLearn.jl for installation instructions
using ScikitLearn.CrossValidation: cross_val_score
accuracy = cross_val_score(model, features, labels, cv=3)

Also, have a look at these classification and regression notebooks.

Native API

Classification Example

Decision Tree Classifier

# train full-tree classifier
model = build_tree(labels, features)
# prune tree: merge leaves having >= 90% combined purity (default: 100%)
model = prune_tree(model, 0.9)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)
# apply learned model
apply_tree(model, [5.9,3.0,5.1,1.9])
# apply model to all the sames
preds = apply_tree(model, features)
# generate confusion matrix, along with accuracy and kappa scores
DecisionTree.confusion_matrix(labels, preds)
# get the probability of each label
apply_tree_proba(model, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
# run 3-fold cross validation of pruned tree,
n_folds=3
accuracy = nfoldCV_tree(labels, features, n_folds)

# set of classification parameters and respective default values
# pruning_purity: purity threshold used for post-pruning (default: 1.0, no pruning)
# max_depth: maximum depth of the decision tree (default: -1, no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# n_subfeatures: number of features to select at random (default: 0, keep all)
# keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
n_subfeatures=0; max_depth=-1; min_samples_leaf=1; min_samples_split=2
min_purity_increase=0.0; pruning_purity = 1.0; seed=3

model    =   build_tree(labels, features,
                        n_subfeatures,
                        max_depth,
                        min_samples_leaf,
                        min_samples_split,
                        min_purity_increase;
                        rng = seed)

accuracy = nfoldCV_tree(labels, features,
                        n_folds,
                        pruning_purity,
                        max_depth,
                        min_samples_leaf,
                        min_samples_split,
                        min_purity_increase;
                        verbose = true,
                        rng = seed)

Random Forest Classifier

# train random forest classifier
# using 2 random features, 10 trees, 0.5 portion of samples per tree, and a maximum tree depth of 6
model = build_forest(labels, features, 2, 10, 0.5, 6)
# apply learned model
apply_forest(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_forest_proba(model, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
# add 7 more trees
model = build_forest(model, labels, features, 2, 7, 0.5, 6)
# run 3-fold cross validation for forests, using 2 random features per split
n_folds=3; n_subfeatures=2
accuracy = nfoldCV_forest(labels, features, n_folds, n_subfeatures)

# set of classification parameters and respective default values
# n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
# n_trees: number of trees to train (default: 10)
# partial_sampling: fraction of samples to train each tree on (default: 0.7)
# max_depth: maximum depth of the decision trees (default: no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
#              multi-threaded forests must be seeded with an `Int`
n_subfeatures=-1; n_trees=10; partial_sampling=0.7; max_depth=-1
min_samples_leaf=5; min_samples_split=2; min_purity_increase=0.0; seed=3

model    =   build_forest(labels, features,
                          n_subfeatures,
                          n_trees,
                          partial_sampling,
                          max_depth,
                          min_samples_leaf,
                          min_samples_split,
                          min_purity_increase;
                          rng = seed)

accuracy = nfoldCV_forest(labels, features,
                          n_folds,
                          n_subfeatures,
                          n_trees,
                          partial_sampling,
                          max_depth,
                          min_samples_leaf,
                          min_samples_split,
                          min_purity_increase;
                          verbose = true,
                          rng = seed)

Adaptive-Boosted Decision Stumps Classifier

# train adaptive-boosted stumps, using 7 iterations
model, coeffs = build_adaboost_stumps(labels, features, 7);
# apply learned model
apply_adaboost_stumps(model, coeffs, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_adaboost_stumps_proba(model, coeffs, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])
# run 3-fold cross validation for boosted stumps, using 7 iterations
n_iterations=7; n_folds=3
accuracy = nfoldCV_stumps(labels, features,
                          n_folds,
                          n_iterations;
                          verbose = true)

Regression Example

n, m = 10^3, 5
features = randn(n, m)
weights = rand(-2:2, m)
labels = features * weights

Regression Tree

# train regression tree
model = build_tree(labels, features)
# apply learned model
apply_tree(model, [-0.9,3.0,5.1,1.9,0.0])
# run 3-fold cross validation, returns array of coefficients of determination (R^2)
n_folds = 3
r2 = nfoldCV_tree(labels, features, n_folds)

# set of regression parameters and respective default values
# pruning_purity: purity threshold used for post-pruning (default: 1.0, no pruning)
# max_depth: maximum depth of the decision tree (default: -1, no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# n_subfeatures: number of features to select at random (default: 0, keep all)
# keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
n_subfeatures = 0; max_depth = -1; min_samples_leaf = 5
min_samples_split = 2; min_purity_increase = 0.0; pruning_purity = 1.0 ; seed=3

model = build_tree(labels, features,
                   n_subfeatures,
                   max_depth,
                   min_samples_leaf,
                   min_samples_split,
                   min_purity_increase;
                   rng = seed)

r2 =  nfoldCV_tree(labels, features,
                   n_folds,
                   pruning_purity,
                   max_depth,
                   min_samples_leaf,
                   min_samples_split,
                   min_purity_increase;
                   verbose = true,
                   rng = seed)

Regression Random Forest

# train regression forest, using 2 random features, 10 trees,
# averaging of 5 samples per leaf, and 0.7 portion of samples per tree
model = build_forest(labels, features, 2, 10, 0.7, 5)
# apply learned model
apply_forest(model, [-0.9,3.0,5.1,1.9,0.0])
# run 3-fold cross validation on regression forest, using 2 random features per split
n_subfeatures=2; n_folds=3
r2 = nfoldCV_forest(labels, features, n_folds, n_subfeatures)

# set of regression build_forest() parameters and respective default values
# n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
# n_trees: number of trees to train (default: 10)
# partial_sampling: fraction of samples to train each tree on (default: 0.7)
# max_depth: maximum depth of the decision trees (default: no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# keyword rng: the random number generator or seed to use (default Random.GLOBAL_RNG)
#              multi-threaded forests must be seeded with an `Int`
n_subfeatures=-1; n_trees=10; partial_sampling=0.7; max_depth=-1
min_samples_leaf=5; min_samples_split=2; min_purity_increase=0.0; seed=3

model = build_forest(labels, features,
                     n_subfeatures,
                     n_trees,
                     partial_sampling,
                     max_depth,
                     min_samples_leaf,
                     min_samples_split,
                     min_purity_increase;
                     rng = seed)

r2 =  nfoldCV_forest(labels, features,
                     n_folds,
                     n_subfeatures,
                     n_trees,
                     partial_sampling,
                     max_depth,
                     min_samples_leaf,
                     min_samples_split,
                     min_purity_increase;
                     verbose = true,
                     rng = seed)

Saving Models

Models can be saved to disk and loaded back with the use of the JLD2.jl package.

using JLD2
@save "model_file.jld2" model

Note that even though features and labels of type Array{Any} are supported, it is highly recommended that data be cast to explicit types (ie with float.(), string.(), etc). This significantly improves model training and prediction execution times, and also drastically reduces the size of saved models.

MLJ.jl API

To use DecsionTree.jl models in MLJ, first ensure MLJ.jl and MLJDecisionTreeInterface.jl are both in your Julia environment. For example, to install in a fresh environment:

using Pkg
Pkg.activate("my_fresh_mlj_environment", shared=true)
Pkg.add("MLJ")
Pkg.add("MLJDecisionTreeInterface")

Detailed usage instructions are available for each model using the doc method. For example:

using MLJ
doc("DecisionTreeClassifier", pkg="DecisionTree")

Available models are: AdaBoostStumpClassifier, DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor.

Feature Importances

The following methods provide measures of feature importance for all models: impurity_importance, split_importance, permutation_importance. Query the document strings for details.

Visualization

A DecisionTree model can be visualized using the print_tree-function of its native interface (for an example see above in section 'Classification Example').

In addition, an abstraction layer using AbstractTrees.jl has been implemented with the intention to facilitate visualizations, which don't rely on any implementation details of DecisionTree. For more information have a look at the docs in src/abstract_trees.jl and the wrap-function, which creates this layer for a DecisionTree model.

Apart from this, AbstractTrees.jl brings its own implementation of print_tree.

Citing the package in publications

DOI: DOI.

BibTeX entry:

@software{ben_sadeghi_2022_7359268,
  author       = {Ben Sadeghi and
                  Poom Chiarawongse and
                  Kevin Squire and
                  Daniel C. Jones and
                  Andreas Noack and
                  Cédric St-Jean and
                  Rik Huijzer and
                  Roland Schätzle and
                  Ian Butterworth and
                  Yu-Fong Peng and
                  Anthony Blaom},
  title        = {{DecisionTree.jl - A Julia implementation of the 
                   CART Decision Tree and Random Forest algorithms}},
  month        = nov,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {0.11.3},
  doi          = {10.5281/zenodo.7359268},
  url          = {https://doi.org/10.5281/zenodo.7359268}
}

More Repositories

1

MLJ.jl

A Julia machine learning framework
Julia
1,791
star
2

MLJBase.jl

Core functionality for the MLJ machine learning framework
Julia
160
star
3

DataScienceTutorials.jl

A set of tutorials to show how to use Julia for data science (DataFrames, MLJ, ...)
ReScript
116
star
4

ScientificTypes.jl

An API for dispatching on the "scientific" type of data instead of the machine type
Julia
96
star
5

MLJLinearModels.jl

Generalized Linear Regressions Models (penalized regressions, robust regressions, ...)
Julia
81
star
6

MLJModels.jl

Home of the MLJ model registry and tools for model queries and mode code loading
Julia
80
star
7

MLJTuning.jl

Hyperparameter optimization algorithms for use in the MLJ machine learning framework
Julia
66
star
8

MLFlowClient.jl

Julia client for MLFlow.
Julia
45
star
9

MLJModelInterface.jl

Lightweight package to interface with MLJ
Julia
37
star
10

LearnAPI.jl

A Julia interface for training and applying models in machine learning and statistics
Julia
31
star
11

Imbalance.jl

A Julia toolbox with resampling methods to correct for class imbalance.
Julia
28
star
12

NearestNeighborModels.jl

Package providing K-nearest neighbor regressors and classifiers, for use with the MLJ machine learning framework.
Julia
27
star
13

EarlyStopping.jl

Early stopping criteria for loss-generating iterative algorithms
Julia
25
star
14

IterationControl.jl

A package for controlling iterative algorithms
Julia
23
star
15

MLJScientificTypes.jl

Implementation of the MLJ scientific type convention
Julia
17
star
16

StatisticalMeasures.jl

Measures (metrics) for statistics and machine learning
Julia
14
star
17

CategoricalDistributions.jl

Providing probability distributions and non-negative measures over finite sets, whose elements are labelled.
Julia
13
star
18

TreeRecipe.jl

Plot recipe for plotting (decision) trees
Julia
13
star
19

MLJScikitLearnInterface.jl

MLJ Interface for ScikitLearn.jl
Julia
12
star
20

CatBoost.jl

Julia wrapper of the python library CatBoost for boosted decision trees
Julia
11
star
21

MLJText.jl

A an MLJ extension for accessing models and tools related to text analysis
Julia
11
star
22

MLJXGBoostInterface.jl

Julia
11
star
23

MLJOpenML.jl

Julia
10
star
24

OpenML.jl

Partial implementation of the OpenML API for Julia
Julia
10
star
25

MLJIteration.jl

A package for wrapping iterative MLJ models in a control strategy
Julia
10
star
26

ScientificTypesBase.jl

Base interface for dispatching on the "scientific" type of data instead of the machine type
Julia
9
star
27

MLJDecisionTreeInterface.jl

Julia
9
star
28

MLJGLMInterface.jl

MLJ.jl interface for GLM.jl models
Julia
9
star
29

MLJFlow.jl

Connecting MLJ and MLFlow
Julia
8
star
30

StatisticalMeasuresBase.jl

A Julia package for building production-ready measures (metrics) for statistics and machine learning
Julia
8
star
31

MLJParticleSwarmOptimization.jl

Julia
7
star
32

MLJMultivariateStatsInterface.jl

Repository implementing MLJ interface for MultivariateStats models.
Julia
7
star
33

MLJEnsembles.jl

Julia
6
star
34

MLJBalancing.jl

A package with exported learning networks that combine resampling methods from Imbalance.jl and classification models from MLJ
Julia
5
star
35

MLJClusteringInterface.jl

Julia
4
star
36

MLJTestIntegration.jl

Utilities to test implementations of the MLJ model interface and provide integration tests for the MLJ ecosystem
Julia
4
star
37

StatisticalTraits.jl

Julia
3
star
38

MLJLIBSVMInterface.jl

An implementation of the MLJ model interface for support vector machines provided by LIBSVM.jl
Julia
3
star
39

.github

1
star
40

MLJ

Work in Progress
Svelte
1
star
41

FeatureSelection.jl

Repository housing feature selection algorithms for use with the machine learning toolbox MLJ.
Julia
1
star