• Stars
    star
    517
  • Rank 85,558 (Top 2 %)
  • Language
    Java
  • License
    GNU Affero Genera...
  • Created about 9 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Java library and command-line application for converting Scikit-Learn pipelines to PMML

JPMML-SkLearn Build Status

Java library and command-line application for converting Scikit-Learn pipelines to PMML.

Table of Contents

Features

Overview

  • Functionality:
    • Three times more supported Python packages, transformers and estimators than all the competitors combined!
    • Thorough collection, analysis and encoding of feature information:
      • Names.
      • Data and operational types.
      • Valid, invalid and missing value spaces.
      • Descriptive statistics.
    • Pipeline extensions:
      • Pruning.
      • Decision engineering (prediction post-processing).
      • Model verification.
    • Conversion options.
  • Extensibility:
    • Rich Java APIs for developing custom converters.
    • Automatic discovery and registration of custom converters based on META-INF/sklearn2pmml.properties resource files.
    • Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM, JPMML-StatsModels and JPMML-XGBoost.
  • Production quality:
    • Complete test coverage.
    • Fully compliant with the JPMML-Evaluator library.

Supported packages

Scikit-Learn

Examples: main.py

Category Encoders

Examples: extensions/category_encoders.py and extensions/category_encoders-xgboost.py

H2O.ai

Examples: main-h2o.py

Imbalanced-Learn

Examples: extensions/imblearn.py

LightGBM

Examples: main-lightgbm.py

Mlxtend

Examples: N/A

OptBinning

Examples: extensions/optbinning.py

PyCaret

Examples: extensions/pycaret.py

  • pycaret.internal.pipeline.Pipeline
  • pycaret.internal.preprocess.transformers.CleanColumnNames
  • pycaret.internal.preprocess.transformers.FixImbalancer
  • pycaret.internal.preprocess.transformers.RareCategoryGrouping
  • pycaret.internal.preprocess.transformers.RemoveMulticollinearity
  • pycaret.internal.preprocess.transformers.TransformerWrapper
  • pycaret.internal.preprocess.transformers.TransformerWrapperWithInverse
Scikit-Lego

Examples: extensions/sklego.py

  • sklego.meta.EstimatorTransformer
    • Predict functions apply, decision_function, predict and predict_proba.
  • sklego.pipeline.DebugPipeline
  • sklego.preprocessing.IdentityTransformer
SkLearn2PMML

Examples: main.py and extensions/sklearn2pmml.py

  • Helpers:
    • sklearn2pmml.EstimatorProxy
    • sklearn2pmml.SelectorProxy
    • sklearn2pmml.h2o.H2OEstimatorProxy
  • Feature specification and decoration:
    • sklearn2pmml.decoration.Alias
    • sklearn2pmml.decoration.CategoricalDomain
    • sklearn2pmml.decoration.ContinuousDomain
    • sklearn2pmml.decoration.ContinuousDomainEraser
    • sklearn2pmml.decoration.DateDomain
    • sklearn2pmml.decoration.DateTimeDomain
    • sklearn2pmml.decoration.DiscreteDomainEraser
    • sklearn2pmml.decoration.MultiAlias
    • sklearn2pmml.decoration.MultiDomain
    • sklearn2pmml.decoration.OrdinalDomain
  • Ensemble methods:
    • sklearn2pmml.ensemble.EstimatorChain
    • sklearn2pmml.ensemble.GBDTLMRegressor
      • The GBDT side: All Scikit-Learn decision tree ensemble regressors, LGBMRegressor, XGBRegressor, XGBRFRegressor.
      • The LM side: A Scikit-Learn linear regressor (eg. ElasticNet, LinearRegression, SGDRegressor).
    • sklearn2pmml.ensemble.GBDTLRClassifier
      • The GBDT side: All Scikit-Learn decision tree ensemble classifiers, LGBMClassifier, XGBClassifier, XGBRFClassifier.
      • The LR side: A Scikit-Learn binary linear classifier (eg. LinearSVC, LogisticRegression, SGDClassifier).
    • sklearn2pmml.ensemble.SelectFirstClassifier
    • sklearn2pmml.ensemble.SelectFirstRegressor
  • UDF models:
    • sklearn2pmml.expression.ExpressionClassifier
    • sklearn2pmml.expression.ExpressionRegressor
  • Feature selection:
    • sklearn2pmml.feature_selection.SelectUnique
  • Linear models:
    • sklearn2pmml.statsmodels.StatsModelsClassifier
    • sklearn2pmml.statsmodels.StatsModelsRegressor
  • Neural networks:
    • sklearn2pmml.neural_network.MLPTransformer
  • Pipeline:
    • sklearn2pmml.pipeline.PMMLPipeline
  • Postprocessing:
    • sklearn2pmml.postprocessing.BusinessDecisionTransformer
  • Preprocessing:
    • sklearn2pmml.preprocessing.Aggregator
    • sklearn2pmml.preprocessing.BSplineTransformer
    • sklearn2pmml.preprocessing.CastTransformer
    • sklearn2pmml.preprocessing.ConcatTransformer
    • sklearn2pmml.preprocessing.CutTransformer
    • sklearn2pmml.preprocessing.DataFrameConstructor
    • sklearn2pmml.preprocessing.DateTimeFormatter
    • sklearn2pmml.preprocessing.DaysSinceYearTransformer
    • sklearn2pmml.preprocessing.ExpressionTransformer
      • Ternary conditional expression <expression_true> if <condition> else <expression_false>.
      • Array indexing expressions X[<column index>] and X[<column name>].
      • String concatenation expressions.
      • String slicing expressions <str>[<start>:<stop>].
      • Arithmetic operators +, -, *, / and %.
      • Identity comparison operators is None and is not None.
      • Comparison operators in <list>, not in <list>, <=, <, ==, !=, > and >=.
      • Logical operators and, or and not.
      • Math constants math.e, math.nan and math.pi.
      • Math functions (too numerous to list).
      • Numpy constants numpy.e, numpy.NaN. numpy.NZERO, numpy.pi and numpy.PZERO.
      • Numpy function numpy.where.
      • Numpy universal functions (too numerous to list).
      • Pandas constants pandas.NA and pandas.NaT.
      • Pandas functions pandas.isna, pandas.isnull, pandas.notna and pandas.notnull.
      • Scipy functions scipy.special.expit and scipy.special.logit.
      • String functions startswith(<prefix>), endswith(<suffix>), lower, upper and strip.
      • String length function len(<str>).
      • User-defined functions.
    • sklearn2pmml.preprocessing.FilterLookupTransformer
    • sklearn2pmml.preprocessing.LookupTransformer
    • sklearn2pmml.preprocessing.MatchesTransformer
    • sklearn2pmml.preprocessing.MultiLookupTransformer
    • sklearn2pmml.preprocessing.NumberFormatter
    • sklearn2pmml.preprocessing.PMMLLabelBinarizer
    • sklearn2pmml.preprocessing.PMMLLabelEncoder
    • sklearn2pmml.preprocessing.PowerFunctionTransformer
    • sklearn2pmml.preprocessing.ReplaceTransformer
    • sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
    • sklearn2pmml.preprocessing.SecondsSinceYearTransformer
    • sklearn2pmml.preprocessing.StringNormalizer
    • sklearn2pmml.preprocessing.SubstringTransformer
    • sklearn2pmml.preprocessing.WordCountTransformer
    • sklearn2pmml.preprocessing.h2o.H2OFrameConstructor
    • sklearn2pmml.util.Reshaper
    • sklearn2pmml.util.Slicer
  • Rule sets:
    • sklearn2pmml.ruleset.RuleSetClassifier
  • Decision trees:
    • sklearn2pmml.tree.chaid.CHAIDClassifier
    • sklearn2pmml.tree.chaid.CHAIDRegressor
Sklearn-Pandas

Examples: main.py

  • sklearn_pandas.CategoricalImputer
  • sklearn_pandas.DataFrameMapper
StatsModels

Examples: main-statsmodels.py

TPOT

Examples: extensions/tpot.py

  • tpot.builtins.stacking_estimator.StackingEstimator
XGBoost

Examples: main-xgboost.py, extensions/category_encoders-xgboost.py and extensions/categorical.py

Prerequisites

The Python side of operations

Validating Python installation:

import joblib, sklearn, sklearn_pandas, sklearn2pmml

print(joblib.__version__)
print(sklearn.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

The JPMML-SkLearn side of operations

  • Java 1.8 or newer.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.7-SNAPSHOT.jar, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

  1. Use Python to train a model.
  2. Serialize the model in pickle data format to a file in a local filesystem.
  3. Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.

The Python side of operations

Loading data to a pandas.DataFrame object:

import pandas

df = pandas.read_csv("Iris.csv")

iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]

First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain

column_preprocessor = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])

Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy

table_preprocessor = Pipeline([
    ("pca", PCA(n_components = 3)),
    ("selector", SelectorProxy(SelectKBest(k = 2)))
])

Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.

Third, creating an Estimator object:

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(min_samples_leaf = 5)

Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:

from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
    ("columns", column_preprocessor),
    ("table", table_preprocessor),
    ("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)

Recording feature importance information in a pickle data format-compatible manner:

classifier.pmml_feature_importances_ = classifier.feature_importances_

Embedding model verification data:

pipeline.verify(iris_X.sample(n = 15))

Storing the fitted PMMLPipeline object in pickle data format:

import joblib

joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)

Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.

The JPMML-SkLearn side of operations

Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

Getting help:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --help

Documentation

Integrations:

AutoML and other kinds of workflow automations:

Extensions:

Miscellaneous:

Archived:

License

JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact [email protected]

More Repositories

1

jpmml-evaluator

Java Evaluator API for PMML
Java
864
star
2

sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
Python
666
star
3

jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
Java
265
star
4

jpmml-lightgbm

Java library and command-line application for converting LightGBM models to PMML
Java
160
star
5

jpmml-model

Java Class Model API for PMML
Java
147
star
6

jpmml-xgboost

Java library and command-line application for converting XGBoost models to PMML
Java
122
star
7

jpmml-evaluator-spark

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
Java
94
star
8

pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML
Python
92
star
9

jpmml

Java PMML API (legacy codebase)
Java
81
star
10

jpmml-tensorflow

Java library and command-line application for converting TensorFlow models to PMML
Java
75
star
11

r2pmml

R library for converting R models to PMML
R
71
star
12

jpmml-sparkml-xgboost

JPMML-SparkML plugin for converting XGBoost4J-Spark models to PMML
Java
36
star
13

jpmml-r

Java library and command-line application for converting R models to PMML
Java
32
star
14

jpmml-android

PMML evaluator library for the Android operating system (http://www.android.com/)
Java
27
star
15

jpmml-transpiler

Java Transpiler (Translator + Compiler) API for PMML
Java
23
star
16

jpmml-h2o

Java library and command-line application for converting H2O.ai models to PMML
Java
20
star
17

sklearn2pmml-plugin

The simplest way to extend sklearn2pmml package with custom transformation and model types
Java
19
star
18

jpmml-evaluator-python

PMML evaluator library for Python
Python
19
star
19

jpmml-converter

Java library for authoring PMML
Java
15
star
20

jpmml-cascading

PMML evaluator library for the Cascading application framework (http://www.cascading.org/)
Java
13
star
21

jpmml-hive

PMML evaluator library for the Apache Hive data warehouse software (legacy codebase)
Java
13
star
22

jpmml-postgresql

PMML evaluator library for the PostgreSQL database (http://www.postgresql.org/)
Java
11
star
23

jpmml-catboost

Java library and command-line application for converting CatBoost models to PMML
Java
7
star
24

jpmml-evaluator-hive

PMML evaluator library for the Apache Hive data warehouse software (http://hive.apache.org/)
Java
6
star
25

sparklyr2pmml

R library for converting Apache Spark ML pipelines to PMML
R
6
star
26

jpmml-statsmodels

Java library and command-line application for converting StatsModels models to PMML
Java
5
star
27

jpmml-storm

PMML evaluator library for the Apache Storm distributed realtime computation system (https://storm.apache.org/)
Java
5
star
28

jpmml-pig

PMML evaluator library for the Apache Pig platform (legacy codebase)
Java
4
star
29

jpmml-sparkml-bootstrap

The simplest way to get started with a JPMML-SparkML powered software project (legacy codebase)
Java
3
star
30

jpmml-python

Java library for converting Python models to PMML
Java
3
star
31

jpmml-example

Example JPMML-enabled software development project (legacy codebase)
Java
2
star
32

jpmml-codevault

Java utilities for protecting Java application code
Java
1
star
33

jpmml-codemodel

Java utilities for generating, compiling and packaging Java application code
Java
1
star