• Stars
    star
    109
  • Rank 319,077 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 6 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

scikit-learn_bench benchmarks various implementations of machine learning algorithms across data analytics frameworks. It currently support the scikit-learn, DAAL4PY, cuML, and XGBoost frameworks for commonly used machine learning algorithms.

Machine Learning Benchmarks

Build Status

Machine Learning Benchmarks contains implementations of machine learning algorithms across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks and algorithms. It currently supports the scikit-learn, DAAL4PY, cuML, and XGBoost frameworks for commonly used machine learning algorithms.

Follow us on Medium

We publish blogs on Medium, so follow us to learn tips and tricks for more efficient data analysis. Here are our latest blogs:

Table of content

How to create conda environment for benchmarking

Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.

pip install -r sklearn_bench/requirements.txt
# or
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm
conda install -c conda-forge scikit-learn daal4py pandas tqdm
conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm
pip install -r xgboost_bench/requirements.txt
# or
conda install -c conda-forge xgboost scikit-learn pandas tqdm

Running Python benchmarks with runner script

Run python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report] to launch benchmarks.

Options:

  • --configs: specify the path to a configuration file or a folder that contains configuration files.
  • --no-intel-optimized: use Scikit-learn without Intel(R) Extension for Scikit-learn*. Now available for scikit-learn benchmarks. By default, the runner uses Intel(R) Extension for Scikit-learn.
  • --output-file: specify the name of the output file for the benchmark result. The default name is result.json
  • --report: create an Excel report based on benchmark results. The openpyxl library is required.
  • --dummy-run: run configuration parser and dataset generation without benchmarks running.
  • --verbose: WARNING, INFO, DEBUG. Print out additional information when the benchmarks are running. The default is INFO.
Level Description
DEBUG etailed information, typically of interest only when diagnosing problems. Usually at this level the logging output is so low level that itโ€™s not useful to users who are not familiar with the softwareโ€™s internals.
INFO Confirmation that things are working as expected.
WARNING An indication that something unexpected happened, or indicative of some problem in the near future (e.g. โ€˜disk space lowโ€™). The software is still working as expected.

Benchmarks currently support the following frameworks:

  • scikit-learn
  • daal4py
  • cuml
  • xgboost

The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.

You can configure benchmarks by editing a config file. Check config.json schema for more details.

Benchmark supported algorithms

algorithm benchmark name sklearn (CPU) sklearn (GPU) daal4py cuml xgboost
DBSCAN dbscan โœ… โœ… โœ… โœ… โŒ
RandomForestClassifier df_clfs โœ… โŒ โœ… โœ… โŒ
RandomForestRegressor df_regr โœ… โŒ โœ… โœ… โŒ
pairwise_distances distances โœ… โŒ โœ… โŒ โŒ
KMeans kmeans โœ… โœ… โœ… โœ… โŒ
KNeighborsClassifier knn_clsf โœ… โŒ โŒ โœ… โŒ
LinearRegression linear โœ… โœ… โœ… โœ… โŒ
LogisticRegression log_reg โœ… โœ… โœ… โœ… โŒ
PCA pca โœ… โŒ โœ… โœ… โŒ
Ridge ridge โœ… โŒ โœ… โœ… โŒ
SVM svm โœ… โŒ โœ… โœ… โŒ
TSNE tsne โœ… โŒ โŒ โœ… โŒ
train_test_split train_test_split โœ… โŒ โŒ โœ… โŒ
GradientBoostingClassifier gbt โŒ โŒ โŒ โŒ โœ…
GradientBoostingRegressor gbt โŒ โŒ โŒ โŒ โœ…

Scikit-learn benchmakrs

When you run scikit-learn benchmarks on CPU, Intel(R) Extension for Scikit-learn is used by default. Use the --no-intel-optimized option to run the benchmarks without the extension.

For the algorithms with both CPU and GPU support, you may use the same configuration file to run the scikit-learn benchmarks on CPU and GPU.

Algorithm parameters

You can launch benchmarks for each algorithm separately. To do this, go to the directory with the benchmark:

cd <framework>

Run the following command:

python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>

The list of supported parameters for each algorithm you can find here:

More Repositories

1

sdc

Numba extension for compiling Pandas data frames, Intelยฎ Scalable Dataframe Compiler
Python
645
star
2

dpctl

Python SYCL bindings and SYCL-based Python Array API library
C++
99
star
3

numba-dpex

Data Parallel Extension for Numba
Python
75
star
4

dpnp

Data Parallel Extension for NumPy
C++
65
star
5

mkl-service

Python hooks for Intel(R) Math Kernel Library runtime control settings.
Cython
63
star
6

mkl_fft

NumPy-based Python interface to Intel (R) MKL FFT functionality
Python
55
star
7

container-images

Dockerfiles for building docker images
Python
27
star
8

ibench

Benchmarks for python
Python
26
star
9

DPEP

Data Parallel Extensions for Python*
Jupyter Notebook
24
star
10

examples

Examples and sample code showcasing features of the Intel(R) Distribution for Python
Shell
21
star
11

mkl_random

Python interface to Intel(R) Math Kernel Library's random number generation functionality
Python
20
star
12

dpbench

Benchmark suite to evaluate Data Parallel Extensions for Python
Python
17
star
13

composability_bench

Show effects of over-subscription and ways to fix that
Python
15
star
14

workshop

Getting Python Performance with Intel(R) Distribution for Python
Jupyter Notebook
13
star
15

smp

Static partitioning and thread affinity for nestable Symmetric Multi-Processing
Python
12
star
16

BlackScholes_bench

Benchmark computing Black Scholes formula using different technologies
Python
12
star
17

source-publish

Sources used in Intel Python that have a license that requires publication: GPL, LGPL, MPL
C
10
star
18

DPPY-Spec

Draft specifications of DPPY
4
star
19

scikit-ipp

Scikit-image like API to Intelยฎ IPP
C
4
star
20

fft_benchmark

C
3
star
21

optimizations_bench

Collection of performance benchmarks used to present optimizations implemented for Intel(R) Distribution for Python*
C++
3
star
22

bearysta

Pandas-based statistics aggregation tool
Python
3
star
23

sharded-array-for-python

C++
1
star
24

oneAPI-for-SciPy

"oneAPI for Scientific Python Community" virtual poster at SciPy 2022
CSS
1
star
25

sample-data-parallel-extensions

Sample data parallel extensions built with oneAPI DPC++
Python
1
star
26

sdc-doc

Documentation pages for SDC.
1
star
27

example-portable-data-parallel-extensions

Examples of portable data-parallel Python extensions using oneAPI DPC++
C++
1
star