• Stars
    star
    717
  • Rank 63,167 (Top 2 %)
  • Language
    C
  • License
    MIT License
  • Created over 2 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory

TinyEngine

This is the official implementation of TinyEngine, a memory-efficient and high-performance neural network library for Microcontrollers. TinyEngine is a part of MCUNet, which also consists of TinyNAS. MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. TinyEngine and TinyNAS are co-designed to fit the tight memory budgets.

The MCUNet and TinyNAS repo is here.

MCUNetV1 | MCUNetV2 | MCUNetV3

Demo (Inference)

demo

Demo (Training)

demo_v3

News

If you are interested in getting updates, please sign up here to get notified!

Overview

Microcontrollers are low-cost, low-power hardware. They are widely deployed and have wide applications, but the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.

MCUNet is a system-algorithm co-design framework for tiny deep learning on microcontrollers. It consists of TinyNAS and TinyEngine. They are co-designed to fit the tight memory budgets. With system-algorithm co-design, we can significantly improve the deep learning performance on the same tiny memory budget.

overview

Specifically, TinyEngine is a memory-efficient inference library. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing memory usage and accelerating the inference. It outperforms existing inference libraries such as TF-Lite Micro from Google, CMSIS-NN from Arm, and X-CUBE-AI from STMicroelectronics.

TinyEngine adopts the following optimization techniques to accelerate inference speed and minimize memory footprint.

  • In-place depth-wise convolution: A unique data placement technique for depth-wise convolution that overwrites input data by intermediate/output data to reduce peak SRAM memory.
  • Patch-based inference: A generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory.
  • Operator fusion: A method that improves performance by merging one operator into a different operator so that they are executed together without requiring a roundtrip to memory.
  • SIMD (Single instruction, multiple data) programming: A computing method that performs the same operation on multiple data points simultaneously.
  • HWC to CHW weight format transformation: A weight format transformation technique that increases cache hit ratio for in-place depth-wise convolution.
  • Image to Column (Im2col) convolution: An implementation technique of computing convolution operation using general matrix multiplication (GEMM) operations.
  • Loop reordering: A loop transformation technique that attempts to optimize a program's execution speed by reordering/interchanging the sequence of loops.
  • Loop unrolling: A loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space-time tradeoff.
  • Loop tiling: A loop transformation technique that attempts to reduce memory access latency by partitioning a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused.

inplace_depthwise

By adopting the abovementioned optimization techniques, TinyEngine can not only enhance inference speed but also reduce peak memory, as shown in the figures below.

MAC/s improvement breakdown: mac_result

Peak memory reduction: peakmem_result

To sum up, our TinyEngine inference engine could be a useful infrastructure for MCU-based AI applications. It significantly improves the inference speed and reduces the memory usage compared to existing libraries like TF-Lite Micro, CMSIS-NN, X-CUBE-AI, etc. It improves the inference speed by 1.1-18.6x, and reduces the peak memory by 1.3-3.6x.

measured_result

Save Memory with Patch-based Inference: We can dramastically reduce the inference peak memory by using patch-based inference for the memory-intensive stage of CNNs. measured_result

For MobileNetV2, using patch-based inference allows us to reduce the peak memory by 8x. measured_result

With patch-based infernece, tinyengine achieves higher accuracy at the same memory budgets. measured_result

Code Structure

code_generator contains a python library that is used to compile neural networks into low-level source code (C/C++).

TinyEngine contains a C/C++ library that implements operators and performs inference on Microcontrollers.

examples contains the examples of transforming TFLite models into our TinyEngine models.

tutorial contains the demo tutorial (of inference and training) of deploying a visual wake words (VWW) model onto microcontrollers.

assets contains misc assets.

Requirement

  • Python 3.6+
  • STM32CubeIDE 1.5+

Setup for Users

First, clone this repository:

git clone --recursive https://github.com/mit-han-lab/tinyengine.git

(Optional) Using a virtual environment with conda is recommended.

conda create -n tinyengine python=3.6 pip
conda activate tinyengine

Install dependencies:

pip install -r requirements.txt

Setup for Developers

Install pre-commit hooks to automatically format changes in your code.

pre-commit install

Deployment Example

Please see tutorial to learn how to deploy a visual wake words (VWW) model onto microcontrollers by using TinyEngine. We include both the inference demo and the training demo in the tutorial, please take a look!

Measured Results

  • All the tflite models are from Model Zoo in MCUNet repo. Please see MCUNet repo to know how to build the pre-trained int8 quantized models in TF-Lite format.
  • All the latency, peak memory (SRAM) and Flash memory usage results are profiled on STM32H743 with the limitations of 512 KB peak memory and 2 MB storage.
  • Note that we measure the newer versions of libraries in this repo, so that the results in this repo might be different from the ones in the MCUNet papers.
  • For each inference library, we use the git commit ID to indicate the version.
  • All the tflite models are compiled by -Ofast optimization level in STM32CubeIDE.
  • OOM denotes Out Of Memory.
  • Measurement for X-Cube-AI v7.3.0 was conducted with the default compilation setting of balanced mode.

The latency results:

net_id TF-Lite Micro
@ 713b6ed
CMSIS-NN
@ 011bf32
X-CUBE-AI
v7.3.0
TinyEngine
@ 0363956
# mcunet models (VWW)
mcunet-vww0 587ms 53ms 32ms 27ms
mcunet-vww1 1120ms 97ms 57ms 51ms
mcunet-vww2 5310ms 478ms 269ms 234ms
# mcunet models (ImageNet)
mcunet-in0 586ms 51ms 35ms 25ms
mcunet-in1 1227ms 103ms 63ms 56ms
mcunet-in2 6463ms 642ms 351ms 280ms
mcunet-in3 7821ms 770ms 414ms 336ms
mcunet-in4 OOM OOM 516ms 463ms
# baseline models
proxyless-w0.3-r64 512ms 54kB 35kB 23kB
proxyless-w0.3-r176 3801ms 380ms 205ms 176ms
mbv2-w0.3-r64 467ms 43ms 29ms 23ms

The peak memory (SRAM) results:

net_id TF-Lite Micro
@ 713b6ed
CMSIS-NN
@ 011bf32
X-CUBE-AI
v7.3.0
TinyEngine
@ 0363956
# mcunet models (VWW)
mcunet-vww0 163kB 163kB 88kB 59kB
mcunet-vww1 220kB 220kB 113kB 92kB
mcunet-vww2 385kB 390kB 201kB 174kB
# mcunet models (ImageNet)
mcunet-in0 161kB 161kB 69kB 49kB
mcunet-in1 219kB 219kB 106kB 96kB
mcunet-in2 460kB 469kB 238kB 215kB
mcunet-in3 493kB 493kB 243kB 260kB
mcunet-in4 OOM OOM 342kB 416kB
# baseline models
proxyless-w0.3-r64 128kB 136kB 97kB 35kB
proxyless-w0.3-r176 453kB 453kB 221kB 259kB
mbv2-w0.3-r64 173kB 173kB 88kB 61kB

The Flash memory usage results:

net_id TF-Lite Micro
@ 713b6ed
CMSIS-NN
@ 011bf32
X-CUBE-AI
v7.3.0
TinyEngine
@ 0363956
# mcunet models (VWW)
mcunet-vww0 627kB 646kB 463kB 453kB
mcunet-vww1 718kB 736kB 534kB 521kB
mcunet-vww2 1016kB 1034kB 774kB 741kB
# mcunet models (ImageNet)
mcunet-in0 1072kB 1090kB 856kB 842kB
mcunet-in1 937kB 956kB 737kB 727kB
mcunet-in2 1084kB 1102kB 849kB 830kB
mcunet-in3 1091kB 1106kB 867kB 835kB
mcunet-in4 OOM OOM 1843kB 1825kB
# baseline models
proxyless-w0.3-r64 1065kB 1084kB 865kB 777kB
proxyless-w0.3-r176 1065kB 1084kB 865kB 779kB
mbv2-w0.3-r64 940kB 959kB 768kB 690kB

Citation

If you find the project helpful, please consider citing our paper:

@article{
  lin2020mcunet,
  title={Mcunet: Tiny deep learning on iot devices},
  author={Lin, Ji and Chen, Wei-Ming and Lin, Yujun and Gan, Chuang and Han, Song},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

@inproceedings{
  lin2021mcunetv2,
  title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
  author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
  booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year={2021}
}

@article{
  lin2022ondevice,
  title = {On-Device Training Under 256KB Memory},
  author = {Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song},
  booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2022}
}

Related Projects

MCUNet: Tiny Deep Learning on IoT Devices (NeurIPS'20)

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning (NeurIPS'21)

MCUNetV3: On-Device Training Under 256KB Memory (NeurIPS'22)

More Repositories

1

streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
Python
6,530
star
2

bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
Python
2,286
star
3

temporal-shift-module

[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
Python
2,060
star
4

once-for-all

[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment
Python
1,866
star
5

llm-awq

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Python
1,687
star
6

proxylessnas

[ICLR 2019] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
C++
1,420
star
7

torchquantum

A PyTorch-based framework for Quantum Classical Simulation, Quantum Machine Learning, Quantum Neural Networks, Parameterized Quantum Circuits with support for easy deployments on real quantum computers.
Jupyter Notebook
1,304
star
8

data-efficient-gans

[NeurIPS 2020] Differentiable Augmentation for Data-Efficient GAN Training
Python
1,277
star
9

efficientvit

EfficientViT is a new family of vision models for efficient high-resolution vision.
Python
1,218
star
10

torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Cuda
1,181
star
11

smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Python
1,175
star
12

gan-compression

[CVPR 2020] GAN Compression: Efficient Architectures for Interactive Conditional GANs
Python
1,104
star
13

anycost-gan

[CVPR 2021] Anycost GANs for Interactive Image Synthesis and Editing
Python
778
star
14

tinyml

Python
755
star
15

TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library
C++
730
star
16

fastcomposer

[IJCV] FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
Python
644
star
17

pvcnn

[NeurIPS 2019, Spotlight] Point-Voxel CNN for Efficient 3D Deep Learning
Python
639
star
18

lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention
Python
589
star
19

spvnas

[ECCV 2020] Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
Python
577
star
20

distrifuser

[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
Python
538
star
21

mcunet

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
Python
460
star
22

tiny-training

On-Device Training Under 256KB Memory [NeurIPS'22]
Python
432
star
23

amc

[ECCV 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Python
428
star
24

dlg

[NeurIPS 2019] Deep Leakage From Gradients
Python
400
star
25

haq

[CVPR 2019, Oral] HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Python
368
star
26

offsite-tuning

Offsite-Tuning: Transfer Learning without Full Model
Python
365
star
27

hardware-aware-transformers

[ACL'20] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Python
321
star
28

litepose

[CVPR'22] Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation
Python
304
star
29

inter-operator-scheduler

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
C++
191
star
30

amc-models

[ECCV 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Python
166
star
31

apq

[CVPR 2020] APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
Python
156
star
32

parallel-computing-tutorial

C++
134
star
33

flatformer

[CVPR'23] FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
Python
119
star
34

patch_conv

Patch convolution to avoid large GPU memory usage of Conv2D
Python
74
star
35

6s965-fall2022

Jupyter Notebook
64
star
36

sparsevit

[CVPR'23] SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Python
48
star
37

bnn-icestick

Binary Neural Network on IceStick FPGA.
Jupyter Notebook
47
star
38

e3d

Efficient 3D Deep Learning
46
star
39

neurips-micronet

[JMLR'20] NeurIPS 2019 MicroNet Challenge Efficient Language Modeling, Champion
Jupyter Notebook
40
star
40

spatten-llm

[HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
Scala
32
star
41

tinychat-tutorial

C++
28
star
42

pruning-sparsity-publications

14
star
43

iccad-tinyml-open

[ICCAD'22 TinyML Contest] Efficient Heart Stroke Detection on Low-cost Microcontrollers
C
14
star
44

calo-cluster

Jupyter Notebook
5
star
45

ml-blood-pressure

Python
5
star
46

gan-compression-dynamic

Python
3
star
47

data-efficient-gans-dynamic

Python
3
star