• Stars
    star
    265
  • Rank 149,593 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 2 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

tf-metal-experiments

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Setup

This is tested on M1 series Apple Silicon SOC only.

TensorFlow 2.x

  1. Follow the official instructions from Apple here
  2. Test that your Metal GPU is working by running tf.config.list_physical_devices("GPU"), you should see 1 GPU present (it is not named). Later when you actually use the GPU, there will be a more informative printout that says Metal device set to: Apple M1 Max or similar.
  3. Now you should be ready to run any TF code that doesn't require external libraries.

HuggingFace Transformers library

If you want to play around with Transformer models (with TF Metal backend of course), you will need to install the HuggingFace Transformers library.

  1. Install the regex library (I don't know why it has to be like this, but yeah): python3 -m pip install --upgrade regex --no-use-pep517. You might need do xcode-select --install if the above command doesn't work.
  2. pip install transformers ipywidgets

Experiments and Benchmarks

After some trial and error, some initial benchmarks for what should be the approx best capability of the M1 Max.

  • For all the cases here, increasing batch size does not seem to increase the throughput.
  • High Power Mode enabled + plugged into charger (this does not seem to affect the benchmarks anyway)

Power draw also doesn't seem to be able to go much higher than ~40W:

  • Power draw from the GPU (averaged over 1 second) can be measured with sudo powermetrics --samplers gpu_power -i1000 -n1.
  • I decided to report peak power as observed via asitop (see: tlkh/asitop)
Model GPU BatchSize Throughput Peak Power Memory
ResNet50 M1 Max 32c 128 140 img/sec 42W 21 GB
MobileNetV2 M1 Max 32c 128 352 img/sec 37W 13 GB
DistilBERT M1 Max 32c 64 120 seq/sec 35W 9 GB
BERTLarge M1 Max 32c 16 19 seq/sec 36W 14 GB

The benchmark scripts used are included in this repo.

python train_benchmark.py --type cnn --model resnet50
python train_benchmark.py --type cnn --model mobilenetv2
python train_benchmark.py --type transformer --model distilbert-base-uncased
python train_benchmark.py --type transformer --model bert-large-uncased --bs 16

Reference Benchmarks from RTX 3090

Model GPU BatchSize Throughput Power
Same Batch Size as M1
ResNet50 3090 128 1100 img/sec 360W
MobileNetV2 3090 128 2001 img/sec 340W
DistilBERT 3090 64 1065 seq/sec 360W
BERTLarge 3090 16 131 seq/sec 335W
Larger Batch Size
ResNet50 3090 256 1185 img/sec 370W
MobileNetV2 3090 256 2197 img/sec 350W
DistilBERT 3090 256 1340 seq/sec 380W
BERTLarge 3090 64 193 seq/sec 365W

For 3090, same script is used, but additional optimization that leverage hardware (Tensor Core) and software (XLA compiler) not present/working on M1 is added. Also increase the length of an epoch, as sometimes 3090 is too fast and results in poorer measurement due to overhead of starting/ending the training which finishes in seconds.

Note: 3090 running at 400W power limit. CPU is 5600X.

# config for NVIDIA Tensor Core GPU
# run with more steps, XLA and FP16 (enable tensor core aka mixed precision)
python train_benchmark.py --type cnn --model resnet50 --xla --fp16 --steps 100
python train_benchmark.py --type cnn --model mobilenetv2 --xla --fp16 --steps 100
python train_benchmark.py --type transformer --model distilbert-base-uncased --xla --fp16 --steps 100
python train_benchmark.py --type transformer --model bert-large-uncased --bs 16 --xla --fp16 --steps 30
# If no Tensor Core, remove --fp16 flag

Measuring Achievable TFLOPS

We can use TF to write a matrix multiplication benchmark to try and estimate what is the max compute performance we can get out of a M1 Max. It seems we can get around >8 TFLOPS for large enough problem sizes.

The plot can be generated using tflops_sweep.py.

Note that FP64 and FP16 performance appears to be non-existent. (the code automatically runs on CPU if FP64 or FP16 is specified as data type)

More Repositories

1

asitop

Perf monitoring CLI tool for Apple Silicon
Python
2,866
star
2

ai-lab

All-in-one AI container for rapid prototyping
JavaScript
432
star
3

text-emotion-classification

Archived - not answering issues
Jupyter Notebook
200
star
4

prowler

Distributed Network Vulnerability Scanner
Python
121
star
5

m1-cpu-benchmarks

Jupyter Notebook
44
star
6

SmartBin

Spring 2018 - 10.009 Digital World 1D Project
Python
42
star
7

pycon-sg19-tensorflow-tutorial

PyCon SG 2019 Tutorial: Optimizing TensorFlow Performance
Jupyter Notebook
25
star
8

t2t-tuner

Convenient Text-to-Text Training for Transformers
Jupyter Notebook
19
star
9

depsep-conv-benchmarks

Code for Depth-wise Separable Convolutions: Performance Investigations
Jupyter Notebook
19
star
10

fake-news-chrome-extension

Chrome Extension to help fight Online Misinformation
JavaScript
16
star
11

onprem-gpu-cluster-setup

On-prem GPU Cluster Setup
Shell
8
star
12

rhh-2017-crowd-tracking

Red Hat Hackathon Singapore 2017 - Camera-based Crowd Tracking Solution
Python
8
star
13

transformers-benchmarking

just for fun
Jupyter Notebook
8
star
14

shortcuts

Painless curl | bash installs. Not that you should!
Shell
7
star
15

awesome-tf2-implementations

List of official/unofficial TF2 implementations of models
7
star
16

hyperconverged-private-cloud-guide

A guide to building a hyper-converged private cloud on commodity hardware
6
star
17

prowler-dashboard

Dashboard for Prowler
HTML
6
star
18

fake-news-web-api

Backend endpoint for the Fake News Chrome Extension
Python
5
star
19

milair-dataset

Military Aircraft Image Dataset
5
star
20

nbvscode

VS Code in Jupyter
Python
3
star
21

50.012-dask-network-project

50.012: Networks Project (2019)
Jupyter Notebook
2
star
22

paraphrase-metrics

ACL 2022 paper "Towards Better Characterization of Paraphrases"
Jupyter Notebook
2
star
23

xfmers

Quickly initialize bespoke Transformers
Python
2
star
24

t5-fp16-surgery

T5 FP16 Surgery
Jupyter Notebook
2
star
25

simple-knowledge-graph

Some simple knowledge graph experiments
Jupyter Notebook
2
star
26

atomic-orbitals

Jupyter Notebook
1
star
27

sg-rainmap-predictor

SG Rain Areas Prediction
Jupyter Notebook
1
star
28

NVStatsRecorder

Python-based NVIDIA GPU Stats Recorder
Python
1
star
29

libsutd

Singapore University of Technical Difficulty
Python
1
star
30

hoax-images

Dataset of images commonly used for online hoaxes
1
star
31

tf-pipeline-model-parallel

TensorFlow Pipeline Model Parallel Experiments
Python
1
star
32

endgame

If we find something it's the endgame (also, Avengers yay)
HTML
1
star
33

pbs-demo-sutd

PBS Demo for SUTD HPC
Python
1
star
34

cyber-range-automation

Cyber-range Automation
Python
1
star
35

reverse-image-search

Reverse Image Search (Database)
Jupyter Notebook
1
star
36

serverless-transformers

Serve HuggingFace Transformer models via Cloud Run
Python
1
star
37

Cayenne-PlantMonitor

ESP8266 Plant Monitor using Cayenne
C++
1
star
38

dnn_animations

Animations for explaining DNN layers
1
star
39

sg-rainmap-dataset

Singapore Rain Areas Dataset crawled from NEA
Jupyter Notebook
1
star
40

browser-trafficgen

Realistic browser-based traffic generation with Python
Python
1
star