Machine Learning Benchmarks

Machine Learning Benchmarks contains implementations of machine learning algorithms across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks and algorithms. It currently supports the scikit-learn, DAAL4PY, cuML, and XGBoost frameworks for commonly used machine learning algorithms.

Follow us on Medium

We publish blogs on Medium, so follow us to learn tips and tricks for more efficient data analysis. Here are our latest blogs:

Table of content

How to create conda environment for benchmarking
Running Python benchmarks with runner script
Benchmark supported algorithms
- Scikit-learn benchmakrs
Algorithm parameters

How to create conda environment for benchmarking

Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.

scikit-learn

pip install -r sklearn_bench/requirements.txt
# or
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm

daal4py

conda install -c conda-forge scikit-learn daal4py pandas tqdm

cuml

conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm

xgboost

pip install -r xgboost_bench/requirements.txt
# or
conda install -c conda-forge xgboost scikit-learn pandas tqdm

Running Python benchmarks with runner script

Run python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report] to launch benchmarks.

Options:

--configs: specify the path to a configuration file or a folder that contains configuration files.
--no-intel-optimized: use Scikit-learn without Intel(R) Extension for Scikit-learn*. Now available for scikit-learn benchmarks. By default, the runner uses Intel(R) Extension for Scikit-learn.
--output-file: specify the name of the output file for the benchmark result. The default name is result.json
--report: create an Excel report based on benchmark results. The openpyxl library is required.
--dummy-run: run configuration parser and dataset generation without benchmarks running.
--verbose: WARNING, INFO, DEBUG. Print out additional information when the benchmarks are running. The default is INFO.

Level	Description
DEBUG	etailed information, typically of interest only when diagnosing problems. Usually at this level the logging output is so low level that it’s not useful to users who are not familiar with the software’s internals.
INFO	Confirmation that things are working as expected.
WARNING	An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected.

Benchmarks currently support the following frameworks:

scikit-learn
daal4py
cuml
xgboost

The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.

You can configure benchmarks by editing a config file. Check config.json schema for more details.

Benchmark supported algorithms

algorithm	benchmark name	sklearn (CPU)	sklearn (GPU)	daal4py	cuml	xgboost
DBSCAN	dbscan	✅	✅	✅	✅	❌
RandomForestClassifier	df_clfs	✅	❌	✅	✅	❌
RandomForestRegressor	df_regr	✅	❌	✅	✅	❌
pairwise_distances	distances	✅	❌	✅	❌	❌
KMeans	kmeans	✅	✅	✅	✅	❌
KNeighborsClassifier	knn_clsf	✅	❌	❌	✅	❌
LinearRegression	linear	✅	✅	✅	✅	❌
LogisticRegression	log_reg	✅	✅	✅	✅	❌
PCA	pca	✅	❌	✅	✅	❌
Ridge	ridge	✅	❌	✅	✅	❌
SVM	svm	✅	❌	✅	✅	❌
TSNE	tsne	✅	❌	❌	✅	❌
train_test_split	train_test_split	✅	❌	❌	✅	❌
GradientBoostingClassifier	gbt	❌	❌	❌	❌	✅
GradientBoostingRegressor	gbt	❌	❌	❌	❌	✅

Scikit-learn benchmakrs

When you run scikit-learn benchmarks on CPU, Intel(R) Extension for Scikit-learn is used by default. Use the --no-intel-optimized option to run the benchmarks without the extension.

For the algorithms with both CPU and GPU support, you may use the same configuration file to run the scikit-learn benchmarks on CPU and GPU.

Algorithm parameters

You can launch benchmarks for each algorithm separately. To do this, go to the directory with the benchmark:

cd <framework>

Run the following command:

python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>

The list of supported parameters for each algorithm you can find here:

IntelPython/scikit-learn_bench

IntelPython

Reviews

Repository Details