• Stars
    star
    135
  • Rank 267,797 (Top 6 %)
  • Language
    Python
  • Created almost 4 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

3X speedup over Apple’s TensorFlow plugin by using Apache TVM on M1

Apple-M1-BERT Inference

Setup Environment:

  1. Install Miniforge from their official [page] (https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh)

Once downloaded:

chmod +x ~/Downloads/Miniforge3-MacOSX-arm64.sh
sh ~/Downloads/Miniforge3-MacOSX-arm64.sh
source ~/miniforge3/bin/activate
  1. Create an environment for TVM and all dependencies to run this repo:
# Create a conda environment
conda create --name tvm-m1 python=3.8
conda activate tvm-m1

# Install TensorFlow and dependencies
conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos
python -m pip install tensorflow-metal

# Install PyTorch and Transformers
conda install -c pytorch pytorch torchvision
conda install -c huggingface transformers -y

# Install TVM dependencies
conda install numpy decorator attrs cython
conda install llvmdev
conda install cmake
conda install tornado psutil xgboost cloudpickle pytest
  1. Clone and Build TVM

Clone TVM

git clone --recursive https://github.com/apache/tvm tvm
cd tvm && mkdir build
cp cmake/config.cmake build

Edit the config.cmake file in the build directory setting the following

USE_METAL ON
USE_LLVM ON
USE_OPENMP gnu

Build TVM

make -DCMAKE_OSX_ARCHITECTURES=arm64 ..

Set the following environment variables or add to ~/.zshrc for persistency:

export TVM_HOME=/Users/tvmuser/git/tvm
export PYTHONPATH=$TVM_HOME/python:${PYTHONPATH}

For more details on setting up TensorFlow on MacOS click here

Run the TF and Keras benchmarks:

  1. Dump bert-base-uncased model into a graph by running python dump_tf_graph.py.
  2. Get Keras CPU benchmark by running python run_keras.py --device cpu. Sample output: [Keras] Mean Inference time (std dev) on cpu: 579.0056343078613 ms (20.846548561801576 ms)
  3. Get Keras GPU benchmark by running python run_keras.py --device gpu. Sample output: [Keras] Mean Inference time (std dev) on gpu: 1767.4337482452393 ms (27.00876036973127 ms)
  4. Get CPU benchmark by running python run_graphdef.py --device cpu --graph-path ./models/bert-base-uncased.pb. Sample output: [Graphdef] Mean Inference time (std dev) on cpu: 512.3187007904053 ms (6.115432898641167 ms)
  5. Get GPU benchmark by running python run_graphdef.py --device gpu --graph-path ./models/bert-base-uncased.pb. Sample output: [Graphdef] Mean Inference time (std dev) on gpu: 543.5237417221069 ms (4.210226676450006 ms)

Running TVM AutoScheduler Search

We have provided search_dense_cpu.py and search_dense_gpu.py for searching on M1 CPUs and M1 GPUs. Both scripts are using RPC. You should run each of these commands in separate windows or use a session manager like screen or tmux for each command.

The scripts require that you have converted HuggingFace's bert-base-uncased model to relay. This can be done via the dump_pt.py script.

  1. Start RPC Tracker: python -m tvm.exec.rpc_tracker --host 0.0.0.0 --port 9190
  2. Start RPC Server: python -m tvm.exec.rpc_server --tracker 127.0.0.1:9190 --port 9090 --key m1 --no-fork

Before continuing make sure all existing logs have been removed from the ./assets folder.

Once the scripts have completed you should see an output like this:

Extract tasks...
Compile...
Upload
run
Evaluate inference time cost...
Mean inference time (std dev): 35.54 ms (1.39 ms)

Running TVM Inference

If you decide not to run a search and use the provided pre-searched logs from AutoScheduler in assets folder for the M1Pro CPU and GPU comment out the following:

    if not os.path.exists(log_file):
        tasks, task_weights = auto_scheduler.extract_tasks(
            mod["main"], params, target=target_host, target_host=target_host)
        for idx, task in enumerate(tasks):
            print("========== Task %d  (workload key: %s) ==========" %
                  (idx, task.workload_key))
            print(task.compute_dag)

        run_tuning(tasks, task_weights, log_file)

Why is TVM much faster than Apple TensorFlow with MLCompute?

  • TVM AutoScheduler is able to using machine learning to search out CPU/GPU code optimization; Human experts are not able to cover all optimizations.
  • TVM is able to fuse any subgraphs qualified of computation nature and directly generate code for the target; Human experts are only able to manually add fusing patterns, manually optimize certain subgraph.
  • We visualized bert-base-uncased graph in Apple TensorFlow. Here is a sample block in BERT. sample block As we can see, MLCompute tried to rewrite a TF graph, replace some operators to what it supports In real practice perfect covert is alway hard, in BERT case, we can see MatMul operator is swapped to MLCMatMul, LayerNorm operator is swapped to MLCLayerNorm, while all others operators are not covered by MLCompute. In GPU case, data is copied between CPU and GPU almost in every step. On the other hand, TVM directly generates ALL operators on GPU, so it is able to maximize gpu utilization.

If you'd like to learn more about TVM please visit our Apache Project site or the OctoML site as well as our OctoML's blog.

More Repositories

1

octoml-profile

Home for OctoML PyTorch Profiler
104
star
2

synr

A library for syntactically rewriting Python programs, pronounced (sinner).
Python
70
star
3

octoai-textgen-cookbook

Simple getting-started code examples for LLM applications powered by OctoAI
Python
37
star
4

deformable-attention-kernel

TVMScript kernel for deformable attention
Python
24
star
5

triton-client-rs

A client library in Rust for Nvidia Triton.
Rust
23
star
6

octoml-llm-qa

A code sample that shows how to use 🦜️🔗langchain, 🦙llama_index and a hosted LLM endpoint to do a standard chat or Q&A about a pdf document
Python
17
star
7

relax

A fork of tvm/unity
Python
15
star
8

tvm2onnx

An open-source tool created by OctoML that converts TVM-optimized models to code runnable in ONNX Runtime.
Python
14
star
9

octoml-cli-tutorials

A repository containing full end to end examples of the OctoML CLI workflow.
Python
14
star
10

TransparentAI

An example of building your own ML cloud app using OctoML.
Python
13
star
11

public-tvm-docker

Build TVM docker image for production compilation deployments
13
star
12

dockercon23-octoai

DockerCon 2023 OctoAI AI/ML Workshop GitHub Repo
Jupyter Notebook
8
star
13

qualcomm

C
8
star
14

tvm-build

A library for building TVM programmatically.
Rust
7
star
15

mlops

CK MLOps components
6
star
16

octoml-examples

A collection of test models for the OctoML AI acceleration service
5
star
17

octoai-apps

A collection of OctoAI-based demos.
TypeScript
5
star
18

macho-dyld

Custom dyld version inherited from original Apple dyld implementation
C++
4
star
19

cm-mlops

Collective Mind repository with unified automations to automatically co-design, optimize and deploy intelligent and Pareto-efficient systems across continuously changing software and hardware stacks.
Python
4
star
20

mlperf-loadgen-harness

A simple Python harness to run an ONNX model in various concurrency and replication configurations against MLCommon's LoadGen to measure throughput.
Python
4
star
21

octoai-template-apps

Python
3
star
22

fern-config

Configuration for generating SDKs and Documentation.
MDX
3
star
23

mlcommons-inference

Fork of MLCommons inference repository to test TVM integration
Python
2
star
24

azsphere

TVM on Azure Sphere Platform
C
2
star
25

venv

CK virtual environment
Python
2
star
26

octoai-launch-examples

Examples of how to build Generative AI applications powered by the OctoAI compute service.
Jupyter Notebook
1
star
27

octocloud-templates

Python
1
star
28

.github

1
star