• Stars
    star
    108
  • Rank 321,259 (Top 7 %)
  • Language
    Python
  • Created over 5 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Mini module with syntax sugar for pandas/sklearn

Chainlearn

GitHub Workflow Status

Mini module with some syntax sugar utilities for pandas and sklearn. It basically allows you turn this:

import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE 
from sklearn.cluster import KMeans
 
iris = sns.load_dataset('iris').drop('species', axis=1)
 
pca = PCA(n_components=3)
tsne = TSNE(n_components=2)

kmeans = KMeans(n_clusters=2)

transformed = tsne.fit_transform(pca.fit_transform(iris))

cluster_labels = kmeans.fit_predict(transformed)

plt.scatter(transformed[:, 0], transformed[:, 1], c=cluster_labels)

Into a chainlearn pipeline that tries to look like a "tidyverse" version:

import seaborn as sns
import chainlearn
from matplotlib import pyplot as plt

iris = sns.load_dataset('iris')

(iris
 .drop('species', axis=1)
 .learn.PCA(n_components=3)
 .learn.TSNE(n_components=2)
 .assign(
     cluster=lambda df: df.learn.KMeans(n_clusters=2)
 )
 .plot
 .scatter(
     x=0,
     y=1,
     c='cluster',
     cmap=plt.get_cmap('viridis')
 )
);

This is achieved by attaching some sklearn model and preprocessing classes to the pandas DataFrame and Series classes, and trying to guess what methods should be called.

You can also do supervised/regressions/etc:

(iris
 .assign(
     species=lambda df: df['species'].learn.LabelEncoder()
 )
 .learn.RandomForestClassifier(
     n_estimators=100,
     target='species'
 )
 .rename(columns={0: 'label'})
 .plot
 .hist()
)

Check out the examples notebook...

Other stuff you can do

Additionally, there are a couple of methods you can call to shorten some tasks.

Explain

Calling explain at the end of your chainlearn pipeline will get you whatever the model has to try to explain itself. In linear models this will be the coefficients, while ensemble models will have feature importances (in sklearn computed as mean decrease impurity for most models).

(iris
 .assign(
     species=lambda df: df['species'].learn.LabelEncoder()
 )
 .learn.Lasso(alpha=0.01, target='species')
 .learn.explain()
 .plot
 .bar()
);

I may add some SHAP value calculations in the near future.

Cross-validate

There is also a cross_validate function that will perform cross validation and get you the scores.

(iris
 .assign(
     species=lambda df: df['species'].learn.LabelEncoder()
 )
 .learn.RandomForestClassifier(
     n_estimators=100,
     target='species'
 )
 .learn.cross_validate(folds=5, scoring='f1_macro')
 .plot
 .hist()
);

Attaching your own models

If you have your own module with models that follow the sklearn api (i.e. have fit and/or fit_predict, fit_transform, transform, predict methods) you can attach them to DataFrames and Series:

import mymodels # Contains a MyModel class with a fit_transform method
from chainlearn import attach
attach(mymodels)

(iris
 .learn.MyModel(params=params)
 .plot
 .scatter(x=0, y=1)
);

Install

pip install chainlearn or install locally by cloning, changing to the repo dir and pip install -e .

More Repositories

1

awesome-bayes

List of resources for bayesian inference
150
star
2

molxspec

ML models to convert molecules to ESI mass spectra and maybe back again
Jupyter Notebook
8
star
3

simppl

A simple probabilistic programming language
Jupyter Notebook
8
star
4

stardocker

Docker container for a Starcluster launcher that can be easily configured to run docker applications
Python
8
star
5

scimitar

Single Cell Inference of MorphIng Trajectories and their Associated Regulation (SCIMITAR)
Jupyter Notebook
7
star
6

colab_biowrappers

Bio and chem tools wrapped in colabs
Jupyter Notebook
6
star
7

gett

Genotype, expression, and trait toolset
Python
3
star
8

hyperparameter.space

Personal blog
HTML
3
star
9

gexpfoundation_hackathon

Code repository for our submission at BioxML hackathon, team LLM
Jupyter Notebook
3
star
10

genoracle

A simple python module for gene set annotation, enrichment testing, consultation of gene/pathway databases and ontologies.
Python
2
star
11

pyroconductor

A high-level interface to access commonly-used R tools from python using the low-level r2py
Python
2
star
12

drug_sensitivity_RelNets

Drug sensitivity prediction with lots of gene/drug/cell line interaction information.
Python
1
star
13

blk

unicode blocks can plot anything!!
1
star
14

scsuite

single-cell Scalable Unified Inference of Trajectory Ensembles
Python
1
star
15

ashley_scripts

Python
1
star
16

parsegeo

Python utilities to parse data from the gene expression omnibus
1
star
17

til-about-jobs

Discrete hints, thoughs on job searches (focused on academia and biotech field because that's what I know)
1
star
18

llm-for-clinical-variants

Jupyter Notebook
1
star
19

embodi

Framework for large language model "embodiment", e.g. props some "sensors" and "actuators" on the LLM
Python
1
star
20

plsrecord

Lightweight data provenance record keeper
Python
1
star
21

pandio

Access python pandas in an "awk-like" manner in the command line -- inspired by Rio
Python
1
star