scikit-learn-ts
Intro
This project enables Node.js devs to use Python's powerful scikit-learn machine learning library – without having to know any Python.
See the full docs for more info.
Note This project is new and experimental. It works great for local development, but I wouldn't recommend using it for production just yet. You can follow the progress on Twitter @transitive_bs
Features
- All TS classes are auto-generated from the official python scikit-learn docs!
- All 257 classes are supported along with proper TS types and docs
KMeans
TSNE
PCA
LinearRegression
LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
XGBClassifier
DBSCAN
StandardScaler
MinMaxScaler
- ... all of them
💯
- Generally much faster and more robust than JS-based alternatives
- (benchmarks & comparisons coming soon)
Prequisites
This project is meant for Node.js users, so don't worry if you're not familiar with Python. This is the only step where you'll need to touch Python, and it should be pretty straightforward.
Make sure you have Node.js and Python 3 installed and in your PATH
.
node >= 14
python >= 3.7
In python land, install numpy
and scikit-learn
either globally via pip
or via your favorite virtualenv manager. The shell running your Node.js program will need access to these python modules, so if you're using a virtualenv, make sure it's activated.
If you're not sure what this means, it's okay. First install python, which will also install pip
, python's package manager. Then run:
pip install numpy scikit-learn
Congratulations! You've safely navigated Python land, and from here on out, we'll be using Node.js / JS / TS. The sklearn
NPM package will use your Python installation under the hood.
Install
npm install sklearn
Usage
See the full docs for more info.
import * as sklearn from 'sklearn'
const data = [
[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]
]
const py = await sklearn.createPythonBridge()
const model = new sklearn.TSNE({
n_components: 2,
perplexity: 2
})
await model.init(py)
const x = await model.fit_transform({ X: data })
console.log(x)
await model.dispose()
await py.disconnect()
Since the TS classes are auto-generated from the Python docs, the code will look almost identical to the Python version, so use their excellent API docs as a reference.
All class names, method names, attribute (accessor) names and types are the same as the official Python version.
The main differences are:
- You need to call
createPythonBridge()
before using anysklearn
classes- This spawns a Python child process and validates all of the Python dependencies
- You can pass a custom
python
path viacreatePythonBridge({ python: '/path/to/your/python3' })
- You need to pass this bridge to a class's async
init
method before using it- This creates an underlying Python variable representing your class instance
- Instead of using
numpy
orpandas
, we're just using plain JavaScript arrays- Anywhere the Python version would input or output a
nympy
array, we instead just usenumber[]
,number[][]
, etc - We take care of converting to and from
numpy
arrays automatically where necessary
- Anywhere the Python version would input or output a
- Whenever you're done using an instance, call
dispose()
to free the underlying Python resources - Whenever you're done using your Python bridge, call
disconnect()
on the bridge to cleanly exit the Python child process
Restrictions
- We don't currently support positional arguments; only keyword-based arguments:
// this works (keyword args)
const x = await model.fit_transform({ X: data })
// this doesn't work yet (positional args)
const y = await model.fit_transform(data)
- We don't currently generate TS code for
scikit-learn
's built-in datasets - We don't currently generate TS code for
scikit-learn
's top-level function exports (only classes right now) - There are basic unit tests for a handful of the auto-generated TS classes, and they work well, but there are probably edge cases and bugs in other auto-generated classes
- Please create an issue on GitHub if you run into any weird behavior and include as much detail as possible, including code snippets
Examples
Here are some side-by-side examples using the official Python scikit-learn
package on the left and the TS sklearn
package on the right.
StandardScaler
Python | TypeScript |
---|---|
import numpy as np
from sklearn.preprocessing import StandardScaler
data = np.array([
[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]
])
s = StandardScaler()
x = s.fit_transform(data) |
import * as sklearn from 'sklearn'
const data = [
[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]
]
const py = await sklearn.createPythonBridge()
const s = new sklearn.StandardScaler()
await s.init(py)
const x = await s.fit_transform({ X: data }) |
KMeans
Python | TypeScript |
---|---|
import numpy as np
from sklearn.cluster import KMeans
data = np.array([
[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]
])
model = KMeans(
n_clusters=2,
random_state=42,
n_init='auto'
)
x = model.fit_predict(data) |
import * as sklearn from 'sklearn'
const data = [
[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]
]
const py = await sklearn.createPythonBridge()
const model = new sklearn.KMeans({
n_clusters: 2,
random_state: 42,
n_init: 'auto'
})
await model.init(py)
const x = await model.fit_predict({ X: data }) |
TSNE
Python | TypeScript |
---|---|
import numpy as np
from sklearn.manifold import TSNE
data = np.array([
[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]
])
model = TSNE(
n_components=2,
perplexity=2,
learning_rate='auto',
init='random'
)
x = model.fit_transform(data) |
import * as sklearn from 'sklearn'
const data = [
[0, 0, 0],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]
]
const py = await sklearn.createPythonBridge()
const model = new sklearn.TSNE({
n_components: 2,
perplexity: 2,
learning_rate: 'auto',
init: 'random'
})
await model.init(py)
const x = await model.fit_transform({ X: data }) |
See the full docs for more examples.
Why?
The Python ML ecosystem is generally a lot more mature than the Node.js ML ecosystem. Most ML research happens in Python, and many common ML tasks that Python devs take for granted are much more difficult to accomplish in Node.js.
For example, I was recently working on a data viz project using full-stack TypeScript, and I needed to use k-means and t-SNE on some text embeddings. I tested 6 different t-SNE JS packages and several k-means packages. None of the t-SNE packages worked for medium-sized inputs, they were 1000x slower in many cases, and I kept running into NaN
city with the JS-based versions.
Case in point; it's incredibly difficult to compete with the robustness, speed, and maturity of proven Python ML libraries like scikit-learn
in JS/TS land.
So instead of trying to build a Rust-based version from scratch or using ad hoc NPM packages like above, I decided to create an experiment to see how practical it would be to just use scikit-learn
from Node.js.
And that's how scikit-learn-ts
was born.
How it works
This project uses a fork of python-bridge to spawn a Python interpreter as a subprocess and communicates back and forth via standard Unix pipes. The IPC pipes don't interfere with stdout
/stderr
/stdin
, so your Node.js code and the underlying Python code can print things normally.
The TS library is auto-generated from the Python scikit-learn
API docs. By using the official Python docs as a source of truth, we can guarantee a certain level of compatibility and upgradeability.
For each scikit-learn
HTML page that belongs to an exported Python class
or function
, we first parse it's metadata, params, methods, attributes, etc using cheerio
, then we convert the Python types into equivalent TypeScript types. We then generate a corresponding TypeScript
file which wraps an instance of that Python declaration via a PythonBridge
.
For each TypeScript
wrapper class
of function
, we take special care to handle serializing values back and forth between Node.js and Python as JSON, including converting between primitive arrays and numpy
arrays where necessary. All numpy
array conversions should be handled automatically for you since we only support serializing primitive JSON types over the PythonBridge
. There may be some edge cases where the automatic numpy
inference fails, but we have a regression test suite for parsing these cases, so as long as the official Python docs are correct for a given type, then our implicit numpy
conversion logic should "just work".
Credit
This project is not affiliated with the official Python scikit-learn project. Hopefully it will be one day.
All of the difficult machine learning work happens under the hood via the official Python scikit-learn project, with full credit given to their absolutely amazing team. This project is just a small open source experiment to try and leverage the existing scikit-learn
ecosystem for the Node.js community.
See the full docs for more info.
License
The official Python scikit-learn
project is licensed under the BSD 3-Clause.
This project is licensed under MIT © Travis Fischer.
If you found this project helpful, please consider following me on twitter