• Stars
    star
    526
  • Rank 84,247 (Top 2 %)
  • Language
    TeX
  • License
    Other
  • Created almost 11 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Repository of my thesis "Understanding Random Forests"

Understanding Random Forests

PhD dissertation, Gilles Louppe, July 2014. Defended on October 9, 2014.

arXiv: http://arxiv.org/abs/1407.7502

Mirrors:

License: BSD 3 clause

Contact: Gilles Louppe (@glouppe, [email protected])

Please cite using the following BibTex entry:

@phdthesis{louppe2014understanding,
  title={Understanding Random Forests: From Theory to Practice},
  author={Louppe, Gilles},
  school={University of Liege, Belgium},
  year=2014,
  month=10,
  note={arXiv:1407.7502}
}

Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results.

Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn.

In the second part of this work, we analyze and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances as computed from non-totally randomized trees (e.g., standard Random Forest) suffer from a combination of defects, due to masking effects, misestimations of node impurity or due to the binary structure of decision trees.

Finally, the last part of this dissertation addresses limitations of random forests in the context of large datasets. Through extensive experiments, we show that subsampling both samples and features simultaneously provides on par performance while lowering at the same time the memory requirements. Overall this paradigm highlights an intriguing practical fact: there is often no need to build single models over immensely large datasets. Good performance can often be achieved by building models on (very) small random parts of the data and then combining them all in an ensemble, thereby avoiding all practical burdens of making large data fit into memory.

More Repositories

1

info8010-deep-learning

Lectures for INFO8010 Deep Learning, ULiège
Jupyter Notebook
1,222
star
2

info8006-introduction-to-ai

Lectures for INFO8006 Introduction to Artificial Intelligence, ULiège
Jupyter Notebook
374
star
3

tutorials-scikit-learn

Scikit-Learn tutorials
Jupyter Notebook
128
star
4

info8004-advanced-machine-learning

Lectures for INFO8004 Advanced Machine Learning, ULiège
CSS
102
star
5

info8002-large-scale-data-systems

Lectures for INFO8002 - Large-scale Data Systems, ULiège
CSS
64
star
6

talk-pydata2015

Talk on "Tree models with Scikit-Learn: Great learners with little assumptions" presented at PyPata Paris 2015
TeX
50
star
7

recnn

Repository for the code of "QCD-Aware Recursive Neural Networks for Jet Physics"
Jupyter Notebook
45
star
8

dats0001-foundations-of-data-science

Materials for DATS0001 Foundations of Data Science, ULiège
Jupyter Notebook
37
star
9

paper-learning-to-pivot

Repository for the paper "Learning to Pivot with Adversarial Networks"
Jupyter Notebook
34
star
10

talk-bayesian-optimisation

Talk on "Bayesian optimisation", beginner level
Jupyter Notebook
25
star
11

talk-template

Template for talks in remark+KaTeX.
CSS
24
star
12

paper-author-disambiguation

Repository for the paper "Ethnicity sensitive author disambiguation using semi-supervised learning"
TeX
23
star
13

notebooks

Random fiddling stored in notebooks
Jupyter Notebook
22
star
14

ssi2023

CSS
20
star
15

kaggle-marinexplore

Code for the Kaggle Marinexplore challenge
C
17
star
16

flowing-with-jax

Jupyter Notebook
15
star
17

paper-avo

Repository for the paper "Adversarial Variational Optimization of Non-Differentiable Simulators"
TeX
15
star
18

tutorial-sklearn-lhcb

Tutorial "An introduction to Machine Learning with Scikit-Learn", presented at CERN
12
star
19

proj0016-big-data-project

Materials for PROJ0016 - Big data project
Jupyter Notebook
10
star
20

tutorials-iml2017

Jupyter Notebook
8
star
21

baby-copilot

An experimental AI system that can autonomously fix and improve code
Python
8
star
22

talk-learning-to-pivot

Talk on "Learning to Pivot with Adversarial Networks"
TeX
6
star
23

lectures-iccub-2016

Machine learning lectures given as part of ICCUB 2016 http://icc.ub.edu/congress/ICCUB_DM_SCHOOL
HTML
6
star
24

talk-lfi-effectively

CSS
6
star
25

kaggle-solar-energy

Code for the Kaggle Solar Energy Prediction challenge
Python
6
star
26

iaifi-summer-school-2024

Jupyter Notebook
6
star
27

covid19be

Jupyter Notebook
5
star
28

ggi-deep-learning

Jupyter Notebook
5
star
29

paper-variable-importances-nips2013

Repository for the paper "Understanding variable importances in forests of randomized trees"
TeX
5
star
30

lecture-dlvm

Jupyter Notebook
4
star
31

cv

Curriculum vitae
TeX
3
star
32

talk-disambiguation-inspire

Talk on "Machine Learning for Author Disambiguation" presented at Inspire weekly meeting
TeX
3
star
33

kaggle-higgs

Code for the Kaggle Higgs Boson challenge
C++
3
star
34

glouppe.github.io

Source of http://glouppe.github.io
HTML
2
star
35

talk-teaching-machines-to-discover-particles

Talk on "Teaching machines to discover particles"
TeX
2
star
36

talk-physai

CSS
1
star
37

talk-eccv2024

CSS
1
star
38

talk-qcd-rnn

Talk on "QCD-aware recursive neural networks for jet physics"
TeX
1
star
39

talk-cds2014

Talk on "Scikit-Learn in Particle Physics" presented at Telecom ParisTech
TeX
1
star
40

talk-classification-control-channel

Talk on "Classification with a control channel" presented at CERN, October 2015
TeX
1
star
41

talk-aleph-workshop2015

Talk on "Pitfalls of evaluating a classifier's performance in high energy physics applications" presented at the ALEPH workshop, NIPS, December 2015
Jupyter Notebook
1
star
42

talk-ariac-wp3

CSS
1
star
43

talk-popular-science-ai

CSS
1
star
44

talk-gap2024

CSS
1
star