• Stars
    star
    1,136
  • Rank 41,003 (Top 0.9 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 6 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extra blocks for scikit-learn pipelines.

Build status Downloads Version Conda Version Code style: black DOI

scikit-lego

We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to attempt to consolidate these into a package that offers code quality/testing. This project started as a collaboration between multiple companies in the Netherlands but has since received contributions from around the globe. It was initiated by Matthijs Brouns and Vincent D. Warmerdam as a tool to teach people how to contribute to open source.

Note that we're not formally affiliated with the scikit-learn project at all, but we aim to strictly adhere to their standards.

The same holds with lego. LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this project.

Installation

Install scikit-lego via pip with

python -m pip install scikit-lego

Via conda with

conda install -c conda-forge scikit-lego

Alternatively, to edit and contribute you can fork/clone and run:

python -m pip install -e ".[dev]"
python setup.py develop

Documentation

The documentation can be found here.

Usage

We offer custom metrics, models and transformers. You can import them just like you would in scikit-learn.

# the scikit learn stuff we love
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# from scikit lego stuff we add
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier

...

mod = Pipeline([
    ("scale", StandardScaler()),
    ("random_noise", RandomAdder()),
    ("model", GMMClassifier())
])

...

Features

Here's a list of features that this library currently offers:

  • sklego.datasets.load_abalone loads in the abalone dataset
  • sklego.datasets.load_arrests loads in a dataset with fairness concerns
  • sklego.datasets.load_chicken loads in the joyful chickweight dataset
  • sklego.datasets.load_heroes loads a heroes of the storm dataset
  • sklego.datasets.load_hearts loads a dataset about hearts
  • sklego.datasets.load_penguins loads a lovely dataset about penguins
  • sklego.datasets.fetch_creditcard fetch a fraud dataset from openml
  • sklego.datasets.make_simpleseries make a simulated timeseries
  • sklego.pandas_utils.add_lags adds lag values in a pandas dataframe
  • sklego.pandas_utils.log_step a useful decorator to log your pipeline steps
  • sklego.dummy.RandomRegressor dummy benchmark that predicts random values
  • sklego.linear_model.DeadZoneRegressor experimental feature that has a deadzone in the cost function
  • sklego.linear_model.DemographicParityClassifier logistic classifier constrained on demographic parity
  • sklego.linear_model.EqualOpportunityClassifier logistic classifier constrained on equal opportunity
  • sklego.linear_model.ProbWeightRegression linear model that treats coefficients as probabilistic weights
  • sklego.linear_model.LowessRegression locally weighted linear regression
  • sklego.linear_model.LADRegression least absolute deviation regression
  • sklego.linear_model.QuantileRegression linear quantile regression, generalizes LADRegression
  • sklego.linear_model.ImbalancedLinearRegression punish over/under-estimation of a model directly
  • sklego.naive_bayes.GaussianMixtureNB classifies by training a 1D GMM per column per class
  • sklego.naive_bayes.BayesianGaussianMixtureNB classifies by training a bayesian 1D GMM per class
  • sklego.mixture.BayesianGMMClassifier classifies by training a bayesian GMM per class
  • sklego.mixture.BayesianGMMOutlierDetector detects outliers based on a trained bayesian GMM
  • sklego.mixture.GMMClassifier classifies by training a GMM per class
  • sklego.mixture.GMMOutlierDetector detects outliers based on a trained GMM
  • sklego.meta.ConfusionBalancer experimental feature that allows you to balance the confusion matrix
  • sklego.meta.DecayEstimator adds decay to the sample_weight that the model accepts
  • sklego.meta.EstimatorTransformer adds a model output as a feature
  • sklego.meta.OutlierClassifier turns outlier models into classifiers for gridsearch
  • sklego.meta.GroupedPredictor can split the data into runs and run a model on each
  • sklego.meta.GroupedTransformer can split the data into runs and run a transformer on each
  • sklego.meta.SubjectiveClassifier experimental feature to add a prior to your classifier
  • sklego.meta.Thresholder meta model that allows you to gridsearch over the threshold
  • sklego.meta.RegressionOutlierDetector meta model that finds outliers by adding a threshold to regression
  • sklego.meta.ZeroInflatedRegressor predicts zero or applies a regression based on a classifier
  • sklego.preprocessing.ColumnCapper limits extreme values of the model features
  • sklego.preprocessing.ColumnDropper drops a column from pandas
  • sklego.preprocessing.ColumnSelector selects columns based on column name
  • sklego.preprocessing.InformationFilter transformer that can de-correlate features
  • sklego.preprocessing.IdentityTransformer returns the same data, allows for concatenating pipelines
  • sklego.preprocessing.OrthogonalTransformer makes all features linearly independent
  • sklego.preprocessing.PandasTypeSelector selects columns based on pandas type
  • sklego.preprocessing.PatsyTransformer applies a patsy formula
  • sklego.preprocessing.RandomAdder adds randomness in training
  • sklego.preprocessing.RepeatingBasisFunction repeating feature engineering, useful for timeseries
  • sklego.preprocessing.DictMapper assign numeric values on categorical columns
  • sklego.preprocessing.OutlierRemover experimental method to remove outliers during training
  • sklego.model_selection.GroupTimeSeriesSplit timeseries Kfold for groups with different amount of observations per group
  • sklego.model_selection.KlusterFoldValidation experimental feature that does K folds based on clustering
  • sklego.model_selection.TimeGapSplit timeseries Kfold with a gap between train/test
  • sklego.pipeline.DebugPipeline adds debug information to make debugging easier
  • sklego.pipeline.make_debug_pipeline shorthand function to create a debugable pipeline
  • sklego.metrics.correlation_score calculates correlation between model output and feature
  • sklego.metrics.equal_opportunity_score calculates equal opportunity metric
  • sklego.metrics.p_percent_score proxy for model fairness with regards to sensitive attribute
  • sklego.metrics.subset_score calculate a score on a subset of your data (meant for fairness tracking)

New Features

We want to be rather open here in what we accept but we do demand three things before they become added to the project:

  1. any new feature contributes towards a demonstratable real-world usecase
  2. any new feature passes standard unit tests (we use the ones from scikit-learn)
  3. the feature has been discussed in the issue list beforehand

We automate all of our testing and use pre-commit hooks to keep the code working.

More Repositories

1

human-learn

Natural Intelligence is still a pretty good idea.
Jupyter Notebook
764
star
2

drawdata

Draw datasets from within Jupyter.
Python
579
star
3

doubtlab

Doubt your data, find bad labels.
Python
485
star
4

whatlies

Toolkit to help understand "what lies" in word embeddings. Also benchmarking!
Python
468
star
5

bulk

A Simple Bulk Labelling Tool
Python
424
star
6

embetter

just a bunch of useful embeddings
Python
381
star
7

cluestar

Gain clues from clustering!
Jupyter Notebook
289
star
8

calm-notebooks

notebooks that are used at calmcode.io
Jupyter Notebook
176
star
9

clumper

A small python library that can clump lists of data together.
Python
144
star
10

simsity

Super Simple Similarities Service
Python
141
star
11

memo

Decorators that logs stats.
Python
101
star
12

mktestdocs

Run pytest against markdown files/docstrings.
Python
99
star
13

spacy-youtube-material

Here are the notebooks used during the spacy youtube series.
Jupyter Notebook
96
star
14

tuilwindcss

Very much like Tailwind, but for TUI frameworks in Textual.
CSS
70
star
15

tokenwiser

Bag of, not words, but tricks!
Python
67
star
16

skedulord

captures logs and makes cron more fun
Python
65
star
17

pytest-duration-insights

A mini dashboard to help find slow tests in pytest.
Python
57
star
18

arxiv-frontpage

My personal frontpage app
HTML
46
star
19

scikit-partial

Pipeline components that support partial_fit.
Python
35
star
20

scikit-fairness

this repo might get accepted
Python
29
star
21

spacy-report

Generate reports for spaCy models.
Python
28
star
22

brent

bayesian graphical modelling and a bit of do-calculus for discrete data.
Jupyter Notebook
27
star
23

icepickle

It's a cooler way to store simple linear models.
Python
26
star
24

koaning

21
star
25

justcharts

Just charts. Really.
HTML
21
star
26

scikit-prune

Prune your sklearn models
Python
19
star
27

thismonth.rocks

motivational website to do something special this month
CSS
18
star
28

sentimany

Just another sentiment wrapper.
Python
17
star
29

kadro

A friendly pandas wrapper with a more composable grammar support.
Jupyter Notebook
14
star
30

prodigy-tui

A textual TUI for Prodigy
CSS
13
star
31

calmcode-feedback

A repo to collect issues with calmcode.io
12
star
32

open_notebooks

Some notebooks that I've shared.
Jupyter Notebook
12
star
33

sentence-models

A different, but useful, textcat approach.
Python
11
star
34

paftdunk

Recommendin' all night to get lucky.
Jupyter Notebook
6
star
35

proglang-project

Python
6
star
36

scikit-teach

Active Learning Benchmarks
Jupyter Notebook
6
star
37

texttoolz

tools and tricks that are good to have around
5
star
38

makefile-demo

just a demo of a makefile in action
Makefile
5
star
39

gitlit

Streamlit App on Github Actions
Python
5
star
40

kolektor

Let's give this git-scraping a try.
Python
5
star
41

optimal-on-paper

broken in reality
Jupyter Notebook
5
star
42

liBERTy

A benchmark to compare BERT against sklearn.
Python
5
star
43

classycookie

cookiecutter to run standard text classifiers
Python
5
star
44

lazylines

Pipelines for JSONL files
Python
4
star
45

salary-bias

just another dangerous situation
Jupyter Notebook
4
star
46

dql101

A 101 repo with some code for openai Deep Q Learning
Jupyter Notebook
4
star
47

boondoc

lightweight Python API docs for markdown
Python
4
star
48

subspacy

BPEmb embeddings for spaCy
Python
4
star
49

akin

Some text similarity utilities
Python
4
star
50

calm-stats

Some GitScrapers
Python
3
star
51

calmcode-datasets

Just a Collection of Datasets
3
star
52

koaning-old.github.io

my personal blog
HTML
3
star
53

sushigo

An OpenAi-like environment for the sushi go card game.
Python
3
star
54

featherbed

Very lightweight text vectors via tf/idf + SVD
Python
3
star
55

onnx-demo

onnx seems interesting
Jupyter Notebook
3
star
56

benchmarks

Collection of benchmarks
Jupyter Notebook
3
star
57

baseliner

baseliner offers simple models that can act as a baseline to compare against
R
3
star
58

spacy-intent-example

intent prediction example on spaCy v3
Python
3
star
59

scikit-bloom

Bloom tricks for text pipelines in scikit-learn.
Python
3
star
60

github-slideshow

A robot powered training repository 🤖
HTML
2
star
61

wordlists

Just a bunch of potentially useful wordlists.
2
star
62

gli

my gleeful scripts for the cli
Python
2
star
63

labeltable

Things for bulk labelling.
Python
2
star
64

fusebox

Finetune-able Universal Sentence Encoder
Jupyter Notebook
2
star
65

subsette

A dash-boarding environment for datasette.
HTML
2
star
66

manyterms

Many terms for whatever purposes (weak labelling)
2
star
67

sentency

Lightweight SpaCy pipeline to detect sentences.
2
star
68

pydata-slovenia-talk

Bag of NLP Tricks!
Jupyter Notebook
2
star
69

helloworld

a helloworld package that should just work
R
2
star
70

uvnb

Have UV deal with all your Jupyter deps.
Jupyter Notebook
2
star
71

blackjack

a simple pytest demo
Python
2
star
72

demopkg

a demo pkg in R with github actions
R
2
star
73

lamarl

sushigo simulations on an aws backend
Python
2
star
74

wow-avatar-datasets

A place to host some parquet files.
2
star
75

python_data_intro

A beginner notebook for people who want to get started with python and data. Joy ensues!
Jupyter Notebook
2
star
76

buggingface

Let's see what we can learn from poking huggingface models.
1
star
77

digital-potato

HTML
1
star
78

gha-demo

Demo application for GitHub Actions tutorial.
Python
1
star
79

fastfood-bot

a rasa demo that can find you a fast food location
1
star
80

ecosystem-watcher

Just keeping an eye on the ecosystem.
Python
1
star
81

git-scrape-unravel

CLI to unravel git-scraped code.
1
star
82

scikit-prodigy

Helpers to leverage scikit-learn pipelines in Prodigy.
Python
1
star
83

skooba

less weak supervision
1
star
84

rasa-nlu-deploy

A demo that can run Rasa NLU in a container.
Python
1
star
85

datasette-parcoords

Parallel coordinates chart for datasette
JavaScript
1
star
86

nlu-cluster-demo

Upload your model file and talk to it!
Jupyter Notebook
1
star
87

tjek

tjek changes with the main branch
Python
1
star
88

katacoda-scenarios

Katacoda Scenarios
1
star
89

bulk-datasets

Helpers for the download command.
1
star
90

there-are-no-bad-labels

Repo for the PyData 2023 Workshop
Jupyter Notebook
1
star
91

tokenvolt

Populate an embedding cache quickly and get on with your day.
Python
1
star
92

rusty

Learning how to Rst
1
star
93

uvtrick

I really outdid myself with this hack.
Python
1
star
94

ollama-railway

Just to see if this might work out well.
Python
1
star