• Stars
    star
    178
  • Rank 214,989 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Experimental Gradient Boosting Machines in Python with numba.

pygbm Build Status codecov python versions

Experimental Gradient Boosting Machines in Python.

The goal of this project is to evaluate whether it's possible to implement a pure Python yet efficient version histogram-binning of Gradient Boosting Trees (possibly with all the LightGBM optimizations) while staying in pure Python 3.6+ using the numba jit compiler.

pygbm provides a set of scikit-learn compatible estimator classes that should play well with the scikit-learn Pipeline and model selection tools (grid search and randomized hyperparameter search).

Longer term plans include integration with dask and dask-ml for out-of-core and distributed fitting on a cluster.

Installation

The project is available on PyPI and can be installed with pip:

pip install pygbm

You'll need Python 3.6 at least.

Documentation

The API documentation is available at:

https://pygbm.readthedocs.io/

You might also want to have a look at the examples/ folder of this repo.

Status

The project is experimental. The API is subject to change without deprecation notice. Use at your own risk.

We welcome any feedback in the github issue tracker:

https://github.com/ogrisel/pygbm/issues

Running the development version

Use pip to install in "editable" mode:

git clone https://github.com/ogrisel/pygbm.git
cd pygbm
pip install -r requirements.txt
pip install --editable .

Run the tests with pytest:

pip install -r requirements.txt
pytest

Benchmarking

The benchmarks folder contains some scripts to evaluate the computation performance of various parts of pygbm. Keep in mind that numba's JIT compilation takes time!

Profiling

To profile the benchmarks, you can use snakeviz to get an interactive HTML report:

pip install snakeviz
python -m cProfile -o bench_higgs_boson.prof benchmarks/bench_higgs_boson.py
snakeviz bench_higgs_boson.prof

Debugging numba type inference

To introspect the results of type inference steps in the numba sections called by a given benchmarking script:

numba --annotate-html bench_higgs_boson.html benchmarks/bench_higgs_boson.py

In particular it is interesting to check that the numerical variables in the hot loops highlighted by the snakeviz profiling report have the expected precision level (e.g. float32 for loss computation, uint8 for binned feature values, ...).

Impact of thread-based parallelism

Some benchmarks can call numba functions that leverage the built-in thread-based parallelism with @njit(parallel=True) and prange loops. On a multicore machine you can evaluate how the thread-based parallelism scales by explicitly setting the NUMBA_NUM_THREAD environment variable. For instance try:

NUMBA_NUM_THREADS=1 python benchmarks/bench_binning.py

vs:

NUMBA_NUM_THREADS=4 python benchmarks/bench_binning.py

Acknowledgements

The work from Nicolas Hug is supported by the National Science Foundation under Grant No. 1740305 and by DARPA under Grant No. DARPA-BAA-16-51

The work from Olivier Grisel is supported by the scikit-learn initiative and its partners at Inria Fondation

More Repositories

1

parallel_ml_tutorial

Tutorial on scikit-learn and IPython for parallel machine learning
Jupyter Notebook
1,589
star
2

notebooks

Some sample IPython notebooks for scikit-learn
Jupyter Notebook
556
star
3

pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
Java
159
star
4

python-appveyor-demo

Demo project for building Python wheels with appveyor.com
PowerShell
153
star
5

docker-distributed

Experimental docker-compose setup to bootstrap distributed on a docker-swarm cluster.
Shell
92
star
6

spylearn

Repo for experiments on pyspark and sklearn
Python
79
star
7

paper2ebook

Utility to re-structure research papers published in US Letter or A4 format PDF files to typically remove the 2 columns layout.
Java
53
star
8

text-mining-class

Introduction to web scraping and text mining
Python
43
star
9

dbpediakit

Python utilities to do work with the DBpedia dumps for analytics.
Python
38
star
10

euroscipy-2022-time-series

Tutorial on time-series forcasting with scikit-learn
Jupyter Notebook
32
star
11

wheelhouse-uploader

Script to help maintain a wheelhouse folder on a cloud storage.
Python
31
star
12

my-linux-devbox

Vagrant / Salt configuration with Ubuntu to work on projects related to the scipy stack under Python 3 and Python 2
Scheme
26
star
13

oglearn

ogrisel's utility extensions for scikit-learn
Python
24
star
14

eegssl

Experiments on Self-Supervised Learning on EEG data
Python
16
star
15

mahout

Personal development repository to prepare contributions and patches for Apache Mahout
Java
15
star
16

euroscipy_2017_sklearn

Notebooks for the EuroScipy 2017 tutorial (based on Adult Census income data)
Jupyter Notebook
15
star
17

corpusmaker

clojure utilities to build training corpora for machine learning / NLP out of public wikimedia dumps: status - partially stalled - will probably be reworked as cascalog scripts -- this project is in stalled mode right now: the pignlproc project is likely to replace it due to licensing constraints for future integration in Apache projects
Clojure
14
star
18

python-winbuilder

Tools to script a build environment on Windows for Python project
Python
9
star
19

codemaker

Neural nets-based utility to build low dimensional codes or/and sparse codes
Python
9
star
20

pycon-pydata-sprint

Experimental work for using IPython.parallel with scikit-learn
Python
8
star
21

salt-ipcluster

Salt states and modules to setup an IPython cluster
Scheme
7
star
22

docker-openblas

Docker container with an automated build for OpenBLAS stable branch:
Shell
5
star
23

stanbol-isbn

Demo stanbol extension for detecting and linking ISBN in text document
Java
5
star
24

silva

Leaf recognition prototype
4
star
25

bbuzz-semantic-hackathon

Sandbox for the Berlin Buzzwords semantic hackathon
Java
3
star
26

research

Draft research notes, code and todos
Jupyter Notebook
3
star
27

scikit-learn-github-actions

Test repo for github actions workflows
Python
2
star
28

ipython-azure

Utilities to deploy a IPython parallel cluster on Windows Azure
Python
2
star
29

lsh_glove

Script to build various LSH / ANN indices on glove word embeddings
Python
2
star
30

cardice

Cloud compute cluster setup with SaltStack
Python
2
star
31

energy_charts

Jupyter Notebook
2
star
32

decks

Slide decks for conferences
CSS
2
star
33

brain2vec

Brain embedding by contextual predictions (draft)
Python
2
star
34

instrumentalist

Python scripts to read XBee sensor data and push it to a couchdb database
Python
2
star
35

mnist-sbi

Simulation Based Inference for the important problem of drawing digits
Python
2
star
36

scikit-learn.org

Source repository to build the HTML website for the scikit-learn project.
Python
1
star
37

camera-html5

Test repo for HTML5 camera access on mobile phones
1
star
38

sandbox

1
star
39

cpython-nightly

Automated build of the master branch of CPython for Continuous Integration purposes
1
star
40

docker-sklearn-openblas

Shell
1
star
41

us-housing-prices-v2-parquet

Exploratory Data Analysis on a parquet dump of https://www.dolthub.com/repositories/dolthub/us-housing-prices-v2 using duckdb and Ibis
Jupyter Notebook
1
star