ย ย Sparsify [Alpha]
ML model optimization product to accelerate inference
Sparsify enables you to accelerate inference without sacrificing accuracy by applying state-of-the-art pruning, quantization, and distillation algorithms to neural networks with a simple web application and one-command API calls.
Sparsify empowers you to compress models through two components:
- Sparsify Cloud - a web application that allows you to create and manage Sparsify Experiments, explore hyperparameters, predict performance, and compare results across both Experiments and deployment scenarios.
- Sparsify CLI/API - a Python package and GitHub repository that allows you to run Sparsify Experiments locally, sync with the Sparsify Cloud, and integrate them into your workflows.
Table of Contents
Interested in test-driving our alpha? Get a sneak peek and influence the product's development process. Thank you in advance for your feedback and interest! Quickstart Guide
This quickstart details several pathways you can work through. We encourage you to explore one for Sparsify's full benefits. When you finish the quickstart, sparsifying your models is as easy as:
sparsify.run sparse-transfer --use-case image-classification --data imagenette --optim-level 0.5
1. Install and Setup
1.1 Verify Prerequisites
First, verify that you have the correct software and hardware to run the Sparsify Alpha.
Software
Sparsify is tested on Python 3.8 and 3.10, ONNX 1.5.0-1.12.0, ONNX opset version 11+, and manylinux compliant systems. Sparsify is not supported natively on Windows and MAC OS.
Hardware
Sparsify requires a GPU with CUDA + CuDNN in order to sparsify neural networks. We recommend you use a Linux system with a GPU that has a minimum of 16GB of GPU Memory, 128GB of RAM, 4 CPU cores, and is CUDA-enabled. If you are sparsifying a very large model, you may need more RAM than the recommended 128GB. If you encounter issues setting up your training environment, file a GitHub issue.
1.2 Create an Account
Creating a new one-time account is simple and free. An account is required to manage your Experiments and API keys. Visit the Neural Magic's Web App Platform and create an account by entering your email, name, and unique password. If you already have a Neural Magic Account, sign in with your email.
1.3 Install Sparsify
pip
is the preferred method for installing Sparsify.
It is advised to create a fresh virtual environment to avoid dependency issues.
Install with pip using:
pip install sparsify-nightly
1.4 Login via CLI
Next, with Sparsify installed on your training hardware:
- Authorize the local CLI to access your account by running the sparsify.login command and providing your API key.
- Locate your API key on the homepage of the Sparsify Cloud under the 'Get set up' modal, and copy the command or the API key itself.
- Run the following command:
sparsify.login API_KEY
2. Run an Experiment
Experiments are the core of sparsifying a model. They allow you to apply sparsification algorithms to a dataset and model through the three Experiment types detailed below:
All Experiments are run locally on your training hardware and can be synced with the cloud for further analysis and comparison, using Sparsify's two components:
- Sparsify Cloud - explore hyperparameters, predict performance, and generate the desired CLI/API command.
- Sparsify CLI/API - run an experiment.
2.1 One-Shot
Sparsity | Sparsification Speed | Accuracy |
---|---|---|
++ | +++++ | +++ |
One-Shot Experiments quickly sparsify your model post-training, providing a 3-5x speedup with minimal accuracy loss, ideal for quick model optimization without retraining your model.
To run a One-Shot Experiment for your model, dataset, and use case, use the following command:
sparsify.run one-shot --use-case USE_CASE --model MODEL --data DATASET --optim-level OPTIM_LEVEL
For example, to sparsify a ResNet50 model on the ImageNet dataset for image classification, run the following commands:
wget https://public.neuralmagic.com/datasets/cv/classification/imagenet_calibration.tar.gz
tar -xzf imagenet_calibration.tar.gz
sparsify.run one-shot --use-case image_classification --model "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none" --data ./imagenet_calibration --optim-level 0.5
Or, to sparsify a BERT model on the SST-2 dataset for sentiment analysis, run the following commands:
wget https://public.neuralmagic.com/datasets/nlp/text_classification/sst2_calibration.tar.gz
tar -xzf sst2_calibration.tar.gz
sparsify.run one-shot --use-case text_classification --model "zoo:nlp/sentiment_analysis/bert-base/pytorch/huggingface/sst2/base-none" --data ./sst2_calibration --optim-level 0.5
To dive deeper into One-Shot Experiments, read through the One-Shot Experiment Guide.
Note, One-Shot Experiments currently require the model to be in an ONNX format and the dataset to be in a Numpy format. More details are provided in the One-Shot Experiment Guide.2.2 Sparse-Transfer
Sparsity | Sparsification Speed | Accuracy |
---|---|---|
++++ | ++++ | +++++ |
Sparse-Transfer Experiments quickly create a smaller and faster model for your dataset by transferring from a SparseZoo pre-sparsified foundational model o, providing a 5-10x speedup with minimal accuracy loss, ideal for quick model optimization without retraining your model.
To run a Sparse-Transfer Experiment for your model (optional), dataset, and use case, run the following command:
sparsify.run sparse-transfer --use-case USE_CASE --model OPTIONAL_MODEL --data DATASET --optim-level OPTIM_LEVEL
For example, to sparse transfer a SparseZoo model to the ImageNette dataset for image classification, run the following command:
sparsify.run sparse-transfer --use-case image_classification --data imagenette --optim-level 0.5
Or, to sparse transfer a SparseZoo model to the SST-2 dataset for sentiment analysis, run the following command:
sparsify.run sparse-transfer --use-case text_classification --data sst2 --optim-level 0.5
To dive deeper into Sparse-Transfer Experiments, read through the Sparse-Transfer Experiment Guide.
Note, Sparse-Transfer Experiments require the model to be saved in a PyTorch format corresponding to the underlying integration such as Ultralytics YOLOv5 or HuggingFace Transformers. Datasets must additionally match the expected format of the underlying integration. More details and exact formats are provided in the Sparse-Transfer Experiment Guide.2.3 Training-Aware
Sparsity | Sparsification Speed | Accuracy |
---|---|---|
+++++ | ++ | +++++ |
Training-aware Experiments sparsify your model during training, providing a 6-12x speedup with minimal accuracy loss, ideal for thorough model optimization when the best performance and accuracy are required.
To run a Training-Aware Experiment for your model, dataset, and use case, run the following command:
sparsify.run training-aware --use-case USE_CASE --model OPTIONAL_MODEL --data DATASET --optim-level OPTIM_LEVEL
For example, to sparsify a ResNet50 model on the ImageNette dataset for image classification, run the following command:
sparsify.run training-aware --use-case image_classification --model "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenette/base-none" --data imagenette --optim-level 0.5
Or, to sparsify a BERT model on the SST-2 dataset for sentiment analysis, run the following command:
sparsify.run training-aware --use-case text_classification --model "zoo:nlp/sentiment_analysis/bert-base/pytorch/huggingface/sst2/base-none" --data sst2 --optim-level 0.5
To dive deeper into Training-Aware Experiments, read through the Training-Aware Experiment Guide.
Note, Training-Aware Experiments require the model to be saved in a PyTorch format corresponding to the underlying integration such as Ultralytics YOLOv5 or HuggingFace Transformers. Datasets must additionally match the expected format of the underlying integration. More details and exact formats are provided in the Training-Aware Experiment Guide.3. Compare Results
Once you have run your Experiment, the results, logs, and deployment files will be saved under the current working directory in the following format:
[EXPERIMENT_TYPE]_[USE_CASE]_{DATE_TIME}
โโโ deployment
โ โโโ model.onnx
โ โโโ *supporting files*
โโโ logs
โ โโโ *logs*
โโโ training_artifacts
โ โโโ *training artifacts*
โโโ *metrics and results*
You can compare the accuracy by looking through the metrics printed out to the console and the metrics saved in the experiment directory. Additionally, you can use DeepSparse to compare the inference performance on your CPU deployment hardware.
Note: In the near future, you will be able to visualize the results in the Cloud, simulate other scenarios and hyperparameters, compare the results to other Experiments, and package for your deployment scenario.To run a benchmark on your deployment hardware, use the deepsparse.benchmark
command with your original model and the new optimized model.
This will run a number of inferences to simulate a real-world scenario and print out the results.
It's as simple as running the following command:
deepsparse.benchmark --model MODEL --scenario SCENARIO
For example, to benchmark a dense ResNet-50 model, run the following command:
deepsparse.benchmark --model "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenette/base-none" --scenario sync
This can then be compared to the sparsified ResNet-50 model with the following command:
deepsparse.benchmark --model "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none" --scenario sync
The output will look similar to the following:
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230629 COMMUNITY | (fc8b788a) (release) (optimized) (system=avx512, binary=avx512)
deepsparse.benchmark.benchmark_model INFO deepsparse.engine.Engine:
onnx_file_path: ./model.onnx
batch_size: 1
num_cores: 1
num_streams: 1
scheduler: Scheduler.default
fraction_of_supported_ops: 0.9981
cpu_avx_type: avx512
cpu_vnni: False
=Original Model Path: ./model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 134.5611
Latency Mean (ms/batch): 7.4217
Latency Median (ms/batch): 7.4245
Latency Std (ms/batch): 0.0264
Iterations: 1346
See the DeepSparse Benchmarking User Guide for more information on benchmarking.
4. Deploy a Model
As an optional step to this quickstart, now that you have your optimized model, you are ready for inferencing. To get the most inference performance out of your optimized model, we recommend you deploy on Neural Magic's DeepSparse. DeepSparse is built to get the best performance out of optimized models on CPUs.
DeepSparse Server takes in a task and a model path and will enable you to serve models and Pipelines
for deployment in HTTP.
You can deploy any ONNX model using DeepSparse Server with the following command:
deepsparse.server --task USE_CASE --model_path MODEL_PATH
Where USE_CASE
is the use case of your Experiment and MODEL_PATH
is the path to the deployment folder from the Experiment.
For example, to deploy a sparsified ResNet-50 model, run the following command:
deepsparse.server --task image_classification --model_path "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none"
If you're not ready for deploying, congratulations on completing the quickstart!
Companion Guides
Resources
Now that you have explored Sparsify [Alpha], here are other related resources.
Feedback and Support
Report UI issues and CLI errors, submit bug reports, and provide general feedback about the product to the team via the nm-sparsify Slack Channel, or via GitHub Issues. Alpha support is provided through those channels.
Terms and Conditions
Sparsify Alpha is a pre-release version of Sparsify that is still in active development. The product is not yet ready for production use; APIs and UIs are subject to change. There may be bugs in the Alpha version, which we hope to have fixed before Beta and then a general Q3 2023 release. The feedback you provide on quality and usability helps us identify issues, fix them, and make Sparsify even better. This information is used internally by Neural Magic solely for that purpose. It is not shared or used in any other way.
That being said, we are excited to share this release and hear what you think. Thank you in advance for your feedback and interest!
Learning More
- Documentation: SparseML, SparseZoo, Sparsify (1st Generation), DeepSparse
- Neural Magic: Blog, Resources
Release History
Official builds are hosted on PyPI
- stable: sparsify
- nightly (dev): sparsify-nightly
Additionally, more information can be found via GitHub Releases.
License
The project is licensed under the Apache License Version 2.0.
Community
Contribute
We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.
Join
For user help or questions about Sparsify, sign up or log in to our Neural Magic Community Slack. We are growing the community member by member and happy to see you there. Bugs, feature requests, or additional questions can also be posted to our GitHub Issue Queue.
You can get the latest news, webinar and event invites, research papers, and other ML Performance tidbits by subscribing to the Neural Magic community.
For more general questions about Neural Magic, please fill out this form.
Cite
Find this project useful in your research or other communications? Please consider citing:
@InProceedings{
pmlr-v119-kurtz20a,
title = {Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks},
author = {Kurtz, Mark and Kopinsky, Justin and Gelashvili, Rati and Matveev, Alexander and Carr, John and Goin, Michael and Leiserson, William and Moore, Sage and Nell, Bill and Shavit, Nir and Alistarh, Dan},
booktitle = {Proceedings of the 37th International Conference on Machine Learning},
pages = {5533--5543},
year = {2020},
editor = {Hal Daumรฉ III and Aarti Singh},
volume = {119},
series = {Proceedings of Machine Learning Research},
address = {Virtual},
month = {13--18 Jul},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf},
url = {http://proceedings.mlr.press/v119/kurtz20a.html},
abstract = {Optimizing convolutional neural networks for fast inference has recently become an extremely active area of research. One of the go-to solutions in this context is weight pruning, which aims to reduce computational and memory footprint by removing large subsets of the connections in a neural network. Surprisingly, much less attention has been given to exploiting sparsity in the activation maps, which tend to be naturally sparse in many settings thanks to the structure of rectified linear (ReLU) activation functions. In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains. To induce highly sparse activation maps without accuracy loss, we introduce a new regularization technique, coupled with a new threshold-based sparsification method based on a parameterized activation function called Forced-Activation-Threshold Rectified Linear Unit (FATReLU). We examine the impact of our methods on popular image classification models, showing that most architectures can adapt to significantly sparser activation maps without any accuracy loss. Our second contribution is showing that these these compression gains can be translated into inference speedups: we provide a new algorithm to enable fast convolution operations over networks with sparse activations, and show that it can enable significant speedups for end-to-end inference on a range of popular models on the large-scale ImageNet image classification task on modern Intel CPUs, with little or no retraining cost.}
}
@misc{
singh2020woodfisher,
title={WoodFisher: Efficient Second-Order Approximation for Neural Network Compression},
author={Sidak Pal Singh and Dan Alistarh},
year={2020},
eprint={2004.14340},
archivePrefix={arXiv},
primaryClass={cs.LG}
}