The open-source Continuous Machine Learning Platform
Build ML pipelines with only Python, run on your laptop, or in the cloud.
Sematic is an open-source ML development platform. It lets ML Engineers and Data Scientists write arbitrarily complex end-to-end pipelines with simple Python and execute them on their local machine, in a cloud VM, or on a Kubernetes cluster to leverage cloud resources.
Sematic is based on learnings gathered at top self-driving car companies. It enables chaining data processing jobs (e.g. Apache Spark) with model training (e.g. PyTorch, Tensorflow), or any other arbitrary Python business logic into type-safe, traceable, reproducible end-to-end pipelines that can be monitored and visualized in a modern web dashboard.
Read our documentation and join our Discord channel.
Why Sematic
- Easy onboarding β no deployment or infrastructure needed to get started, simply install Sematic locally and start exploring.
- Local-to-cloud parity β run the same code on your local laptop and on your Kubernetes cluster.
- End-to-end traceability β all pipeline artifacts are persisted, tracked, and visualizable in a web dashboard.
- Access heterogeneous compute β customize required resources for each pipeline step to optimize your performance and cloud footprint (CPUs, memory, GPUs, Spark cluster, etc.)
- Reproducibility β rerun your pipelines from the UI with guaranteed reproducibility of results
Getting Started
To get started locally, simply install Sematic in your Python environment:
$ pip install sematic
Start the local web dashboard:
$ sematic start
Run an example pipeline:
$ sematic run examples/mnist/pytorch
Create a new boilerplate project:
$ sematic new my_new_project
Or from an existing example:
$ sematic new my_new_project --from examples/mnist/pytorch
Then run it with:
$ python3 -m my_new_project
To deploy Sematic to Kubernetes and leverage cloud resources, see our documentation.
Features
- Lightweight Python SDK β define arbitrarily complex end-to-end pipelines
- Pipeline nesting β arbitrarily nest pipelines into larger pipelines
- Dynamic graphs β Python-defined graphs allow for iterations, conditional branching, etc.
- Lineage tracking β all inputs and outputs of all steps are persisted and tracked
- Runtime type-checking β fail early with run-time type checking
- Web dashboard β Monitor, track, and visualize pipelines in a modern web UI
- Artifact visualization β visualize all inputs and outputs of all steps in the web dashboard
- Local execution β run pipelines on your local machine without any deployment necessary
- Cloud orchestration β run pipelines on Kubernetes to access GPUs and other cloud resources
- Heterogeneous compute resources β run different steps on different machines (e.g. CPUs, memory, GPU, Spark, etc.)
- Helm chart deployment β install Sematic on your Kubernetes cluster
- Pipeline reruns β rerun pipelines from the UI from an arbitrary point in the graph
- Step caching β cache expensive pipeline steps for faster iteration
- Step retry β recover from transient failures with step retries
- Metadata and collaboration β Tags, source code visualization, docstrings, notes, etc.
- Numerous integrations β See below
Integrations
- Apache Spark β on-demand in-cluster Spark cluster
- Ray β on-demand Ray in-cluster Ray resources
- Snowflake β easily query your data warehouse (other warehouses supported too)
- Plotly, Matplotlib β visualize plot artifacts in the web dashboard
- Pandas β visualize dataframe artifacts in the dashboard
- Grafana β embed Grafana panels in the web dashboard
- Bazel β integrate with your Bazel build system
- Helm chart β deploy to Kubernetes with our Helm chart
- Git β track git information in the web dashboard
Community and resources
Learn more about Sematic and get in touch with the following resources:
Contribute!
To contribute to Sematic, check out open issues tagged "good first issue", and get in touch with us on Discord.