• Stars
    star
    730
  • Rank 62,081 (Top 2 %)
  • Language
    C++
  • License
    MIT License
  • Created over 1 year ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TinyChatEngine: On-Device LLM Inference Library

tinychat_logo

TinyChatEngine: On-Device LLM Inference Library

Running large language models (LLMs) on the edge is useful: copilot services (coding, office, smart reply) on laptops, cars, robots, and more. Users can get instant responses with better privacy, as the data is local.

This is enabled by LLM model compression technique: SmoothQuant and AWQ (Activation-aware Weight Quantization), co-designed with TinyChatEngine that implements the compressed low-precision model.

Feel free to check out our slides for more details!

Code LLaMA Demo on an NVIDIA GeForce RTX 4070 laptop:

coding_demo_gpu

LLaMA Chat Demo on an Apple MacBook Pro (M1, 2021):

chat_demo_m1

Overview

LLM Compression: SmoothQuant and AWQ

SmoothQuant: Smooth the activation outliers by migrating the quantization difficulty from activations to weights, with a mathematically equal transformation (100*1 = 10*10).

smoothquant_intuition

AWQ (Activation-aware Weight Quantization): Protect salient weight channels by analyzing activation magnitude as opposed to the weights.

LLM Inference Engine: TinyChatEngine

  • Universal: x86 (Intel/AMD), ARM (Apple M1/M2, Raspberry Pi), CUDA (Nvidia GPU).
  • No library dependency: From-scratch C/C++ implementation.
  • High performance: Real-time on Macbook & GeForce laptop.
  • Easy to use: Download and compile, then ready to go!

overview

News

  • (2023/10) We extended the support for the coding assistant Code Llama. Feel free to check out.
  • (2023/10) ⚑We released the new CUDA backend to support Nvidia GPUs with compute capability >= 6.1 for both server and edge GPUs. Its performance is also speeded up by ~40% compared to the previous version. Feel free to check out!
  • (2023/09) πŸ”₯We released TinyVoiceChat, a voice chatbot that can be deployed on your edge devices, such as MacBook and Jetson Orin Nano. Check out our demo video and step-by-step guide to deploy it on your device!

Prerequisites

MacOS

For MacOS, install boost and llvm by

brew install boost
brew install llvm

For M1/M2 users, install Xcode from AppStore to enable the metal compiler for GPU support.

Windows with CPU

For Windows, download and install the GCC compiler with MSYS2. Follow this tutorial: https://code.visualstudio.com/docs/cpp/config-mingw for installation.

  • Install required dependencies with MSYS2
pacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git
  • Add binary directories (e.g., C:\msys64\mingw64\bin and C:\msys64\usr\bin) to the environment path

Windows with Nvidia GPU (Experimental)

  • Install CUDA toolkit for Windows (link). When installing CUDA on your PC, please change the installation path to another one that does not include "spaces".

  • Install Visual Studio with C and C++ support: Follow the Instruction.

  • Follow the instructions below and use x64 Native Tools Command Prompt from Visual Studio to compile TinyChatEngine.

Step-by-step to Deploy LLaMA2-7B-chat with TinyChatEngine

Here, we provide step-by-step instructions to deploy LLaMA2-7B-chat with TinyChatEngine from scratch.

  • Download the repo.

    git clone --recursive https://github.com/mit-han-lab/TinyChatEngine
    cd TinyChatEngine
  • Install Python Packages

    • The primary codebase of TinyChatEngine is written in pure C/C++. The Python packages are only used for downloading (and converting) models from our model zoo.
      conda create -n TinyChatEngine python=3.10 pip -y
      conda activate TinyChatEngine
      pip install -r requirements.txt
  • Download the quantized LLaMA2-7B-chat model from our model zoo.

    cd llm
    • On an x86 device (e.g., Intel/AMD laptop)
      python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_x86
    • On an ARM device (e.g., M1/M2 Macbook, Raspberry Pi)
      python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_ARM
    • On a CUDA device (e.g., Jetson AGX Orin, PC/Server)
      python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_CUDA
    • Check this table for the detailed list of supported models
  • (CUDA only) Based on the platform you are using and the compute capability of your GPU, modify the Makefile accordingly. If using Windows with Nvidia GPU, please modify -arch=sm_xx in Line 54. If using other platforms with Nvidia GPU, please modify -gencode arch=compute_xx,code=sm_xx in Line 60.

  • Compile and start the chat locally.

    make chat -j
    ./chat
    
    TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
    Using model: LLaMA2_7B_chat
    Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
    Loading model... Finished!
    USER: Write a syllabus for Operating Systems.
    ASSISTANT:
    Of course! Here is a sample syllabus for a college-level course on operating systems:
    Course Title: Introduction to Operating Systems
    Course Description: This course provides an overview of the fundamental concepts and techniques used in modern operating systems, including process management, memory management, file systems, security, and I/O devices. Students will learn how these components work together to provide a platform for running applications and programs on a computer.
    Course Objectives:
    * Understand the basic architecture of an operating system
    * Learn about processes, threads, and process scheduling algorithms
    * Study memory management techniques such as paging and segmentation
    * Explore file systems including file organization, storage devices, and file access methods
    * Investigate security mechanisms to protect against malicious software attacks
    * Analyze input/output (I/O) operations and their handling by the operating system
    ...

Backend Support

Precision x86
(Intel/AMD CPU)
ARM
(Apple M1/M2 & RPi)
Nvidia GPU Apple GPU
FP32 βœ… βœ…
W4A16 βœ… βœ…
W4A32 βœ… βœ… βœ…
W4A8 βœ… βœ…
W8A8 βœ… βœ…
  • For Raspberry Pi, we recommend using the board with 8GB RAM. Our testing was primarily conducted on Raspberry Pi 4 Model B Rev 1.4 with aarch64. For other versions, please feel free to try it out and let us know if you encounter any issues.
  • For Nvidia GPU, our CUDA backend can support Nvidia GPUs with compute capability >= 6.1. For the GPUs with compute capability < 6.1, please feel free to try it out but we haven't tested it yet and thus cannot guarantee the results.

Quantization and Model Support

The goal of TinyChatEngine is to support various quantization methods on various devices. For example, At present, it supports the quantized weights for int8 opt models that originate from smoothquant using the provided conversion script opt_smooth_exporter.py. For LLaMA models, scripts are available for converting Huggingface format checkpoints to our int4 wegiht format, and for quantizing them to specific methods based on your device. Before converting and quantizing your models, it is recommended to apply the fake quantization from AWQ to achieve better accuracy. We are currently working on supporting more models, please stay tuned!

Device-specific int4 Weight Reordering

To mitigate the runtime overheads associated with weight reordering, TinyChatEngine conducts this process offline during model conversion. In this section, we will explore the weight layouts of QM_ARM and QM_x86. These layouts are tailored for ARM and x86 CPUs, supporting 128-bit SIMD and 256-bit SIMD operations, respectively. We also support QM_CUDA for Nvidia GPUs, including server and edge GPUs.

Platforms ISA Quantization methods
Intel & AMD x86-64 QM_x86
Apple M1/M2 Mac & Raspberry Pi ARM QM_ARM
Nvidia GPU CUDA QM_CUDA
  • Example layout of QM_ARM: For QM_ARM, consider the initial configuration of a 128-bit weight vector, [w0, w1, ... , w30, w31], where each wi is a 4-bit quantized weight. TinyChatEngine rearranges these weights in the sequence [w0, w16, w1, w17, ..., w15, w31] by interleaving the lower half and upper half of the weights. This new arrangement facilitates the decoding of both the lower and upper halves using 128-bit AND and shift operations, as depicted in the subsequent figure. This will eliminate runtime reordering overheads and improve performance.

Download and Deploy Models from our Model Zoo

We offer a selection of models that have been tested with TinyChatEngine. These models can be readily downloaded and deployed on your device. To download a model, locate the target model's ID in the table below and use the associated script.

Models Precisions ID x86 backend ARM backend CUDA backend
LLaMA2_13B_chat fp32 LLaMA2_13B_chat_fp32 βœ… βœ…
int4 LLaMA2_13B_chat_awq_int4 βœ… βœ… βœ…
LLaMA2_7B_chat fp32 LLaMA2_7B_chat_fp32 βœ… βœ…
int4 LLaMA2_7B_chat_awq_int4 βœ… βœ… βœ…
LLaMA_7B fp32 LLaMA_7B_fp32 βœ… βœ…
int4 LLaMA_7B_awq_int4 βœ… βœ… βœ…
CodeLLaMA_13B_Instruct fp32 CodeLLaMA_13B_Instruct_fp32 βœ… βœ…
int4 CodeLLaMA_13B_Instruct_awq_int4 βœ… βœ… βœ…
CodeLLaMA_7B_Instruct fp32 CodeLLaMA_7B_Instruct_fp32 βœ… βœ…
int4 CodeLLaMA_7B_Instruct_awq_int4 βœ… βœ… βœ…
opt-6.7B fp32 opt_6.7B_fp32 βœ… βœ…
int8 opt_6.7B_smooth_int8 βœ… βœ…
int4 opt_6.7B_awq_int4 βœ… βœ…
opt-1.3B fp32 opt_1.3B_fp32 βœ… βœ…
int8 opt_1.3B_smooth_int8 βœ… βœ…
int4 opt_1.3B_awq_int4 βœ… βœ…
opt-125m fp32 opt_125m_fp32 βœ… βœ…
int8 opt_125m_smooth_int8 βœ… βœ…
int4 opt_125m_awq_int4 βœ… βœ…

For instance, to download the quantized LLaMA-2-7B-chat model: (for int4 models, use --QM to choose the quantized model for your device)

  • On an Intel/AMD latptop:
    python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_x86
  • On an M1/M2 Macbook:
    python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_ARM
  • On an Nvidia GPU:
    python tools/download_model.py --model LLaMA2_7B_chat_awq_int4 --QM QM_CUDA

To deploy a quantized model with TinyChatEngine, compile and run the chat program.

  • On CPU platforms
make chat -j
# ./chat <model_name> <precision> <num_threads>
./chat LLaMA2_7B_chat INT4 8
  • On GPU platforms
make chat -j
# ./chat <model_name> <precision>
./chat LLaMA2_7B_chat INT4

Experimental Features

Voice Chatbot [Demo]

TinyChatEngine offers versatile capabilities suitable for various applications. Additionally, we introduce a sophisticated voice chatbot. Explore our step-by-step guide here to seamlessly deploy a speech-to-speech chatbot locally on your device!

Related Projects

TinyEngine: Memory-efficient and High-performance Neural Network Library for Microcontrollers

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Acknowledgement

llama.cpp

whisper.cpp

transformers

More Repositories

1

streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
Python
6,530
star
2

bevfusion

[ICRA'23] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
Python
2,286
star
3

temporal-shift-module

[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
Python
2,060
star
4

once-for-all

[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment
Python
1,866
star
5

llm-awq

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Python
1,687
star
6

proxylessnas

[ICLR 2019] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
C++
1,420
star
7

torchquantum

A PyTorch-based framework for Quantum Classical Simulation, Quantum Machine Learning, Quantum Neural Networks, Parameterized Quantum Circuits with support for easy deployments on real quantum computers.
Jupyter Notebook
1,304
star
8

data-efficient-gans

[NeurIPS 2020] Differentiable Augmentation for Data-Efficient GAN Training
Python
1,277
star
9

efficientvit

EfficientViT is a new family of vision models for efficient high-resolution vision.
Python
1,218
star
10

torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Cuda
1,181
star
11

smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Python
1,175
star
12

gan-compression

[CVPR 2020] GAN Compression: Efficient Architectures for Interactive Conditional GANs
Python
1,104
star
13

anycost-gan

[CVPR 2021] Anycost GANs for Interactive Image Synthesis and Editing
Python
778
star
14

tinyml

Python
755
star
15

tinyengine

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory
C
717
star
16

fastcomposer

[IJCV] FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
Python
644
star
17

pvcnn

[NeurIPS 2019, Spotlight] Point-Voxel CNN for Efficient 3D Deep Learning
Python
639
star
18

lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention
Python
589
star
19

spvnas

[ECCV 2020] Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
Python
577
star
20

distrifuser

[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
Python
538
star
21

mcunet

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning
Python
460
star
22

tiny-training

On-Device Training Under 256KB Memory [NeurIPS'22]
Python
432
star
23

amc

[ECCV 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Python
428
star
24

dlg

[NeurIPS 2019] Deep Leakage From Gradients
Python
400
star
25

haq

[CVPR 2019, Oral] HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Python
368
star
26

offsite-tuning

Offsite-Tuning: Transfer Learning without Full Model
Python
365
star
27

hardware-aware-transformers

[ACL'20] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Python
321
star
28

litepose

[CVPR'22] Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation
Python
304
star
29

inter-operator-scheduler

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
C++
191
star
30

amc-models

[ECCV 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices
Python
166
star
31

apq

[CVPR 2020] APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
Python
156
star
32

parallel-computing-tutorial

C++
134
star
33

flatformer

[CVPR'23] FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
Python
119
star
34

patch_conv

Patch convolution to avoid large GPU memory usage of Conv2D
Python
74
star
35

6s965-fall2022

Jupyter Notebook
64
star
36

sparsevit

[CVPR'23] SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
Python
48
star
37

bnn-icestick

Binary Neural Network on IceStick FPGA.
Jupyter Notebook
47
star
38

e3d

Efficient 3D Deep Learning
46
star
39

neurips-micronet

[JMLR'20] NeurIPS 2019 MicroNet Challenge Efficient Language Modeling, Champion
Jupyter Notebook
40
star
40

spatten-llm

[HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
Scala
32
star
41

tinychat-tutorial

C++
28
star
42

pruning-sparsity-publications

14
star
43

iccad-tinyml-open

[ICCAD'22 TinyML Contest] Efficient Heart Stroke Detection on Low-cost Microcontrollers
C
14
star
44

calo-cluster

Jupyter Notebook
5
star
45

ml-blood-pressure

Python
5
star
46

gan-compression-dynamic

Python
3
star
47

data-efficient-gans-dynamic

Python
3
star