• Stars
    star
    120
  • Rank 295,983 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This project shows how to serve an TF based image classification model as a web service with TFServing, Docker, and Kubernetes(GKE).

Deploying ML models with CPU based TFServing, Docker, and Kubernetes

By: Chansung Park and Sayak Paul


Figure developed by Chansung Park

This project shows how to serve a TensorFlow image classification model as RESTful and gRPC based services with TFServing, Docker, and Kubernetes. The idea is to first create a custom TFServing docker image with a TensorFlow model, and then deploy it on a k8s cluster running on Google Kubernetes Engine (GKE). We are particularly interested in deploying the model as a gRPC endpoint with TF Serving on a k8s cluster using GKE and also with GitHub Actions to automate all the procedures when a new TensorFlow model is released.

๐Ÿ‘‹ NOTE

  • Even though this project uses an image classification its structure and techniques can be used to serve other models as well.
  • There is a counter part of this project that uses FastAPI instead of TFServing. It shows how to convert a TensorFlow model to an ONNX optimized model and deploy it on a k8s cluster, check out the this repo.

Update Jule 29 2022: We published a blog post on load-testing the REST endpoint. Check it out on the TensorFlow blog here.

Deploying the model as a service with k8s

  • Prerequisites: Doing anything beforehand, you have to create GKE cluster and service accounts with appropriate roles. Also, you need to grasp GCP credentials to access any GCP resources in GitHub Action. Please check out the more detailed information here.
flowchart LR
    A[First: Environmental Setup]-->B;
    B[Second: Build TFServing Image]-->C[Third: Deploy on GKE];
  • To deploy a custom TFServing Docker image, we define deployment.yml workflow file which is is only triggered when there is a new release for the current repository. It is subdivided into three parts to do the following tasks:
    • First subtask handles the environmental setup.
      • GCP Authentication (GCP credential has to be provided in GitHub Secret)
      • Install gcloud CLI toolkit
      • Authenticate Docker to push images to GCR (Google Cloud Registry)
      • Connect to the designated GKE cluster
    • Second subtask handles building a custom TFServing image.
      • Download and extract the latest released model from the current repository
      • Run the CPU optimized TFServing image which is compiled from the source code (FYI. image tag is gcr.io/gcp-ml-172005/tfs-resnet-cpu-opt, and it is publicly available)
      • Copy the extracted model into the running container
      • Commit the changes of the running container and give it a new image name
      • Push the commited image
    • Third subtask handles deploying the custom TFServing image to GKE cluster.
      • Pick a one of the scenarios from a various experiments
      • Download Kustomize toolkit to handle overlay configurations.
      • Update image tag with the currently built one with Kustomize
      • By provisioning Deployment, Service, and ConfigMap, the custom TFServing image gets deployed.
        • NOTE: ConfigMap is only used for batching enabled scenarios to inject batching configurations dynamically into the Deployment.
    • In order to use this repo for your own purpose, please read this document to know what environment variables have to be set.

If the entire workflow goes without any errors, you will see something silimar to the text below. As you see, two external interfaces(8500 for RESTful, 8501 for gRPC) are exposed. You can check out the complete logs in the past runs.

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                          AGE
tfs-server       LoadBalancer   xxxxxxxxxx     xxxxxxxxxx      8500:30869/TCP,8501:31469/TCP    23m
kubernetes       ClusterIP      xxxxxxxxxx     <none>          443/TCP                         160m

How to perform gRPC inference

If you wonder how to perform gRPC inference, grpc_client.py provides code to perform inference with the gRPC client (grpc_client.py contains $ENDPOINT placeholder. To replace it with your own endpoint, you can envsubst < grpc_client.py > grpc_client.py after defining ENDPOINT environment variable). TFServing API provides handy features to construct protobuf request message via predict_pb2.PredictRequest(), and tf.make_tensor_proto(image) creates protobuf compatible values from Tensor data type.

Load testing

We used Locust to conduct load tests for both TFServing and FastAPI. Below is the results for TFServing (gRPC) on a various setups, and you can find out the result for FastAPI (RESTful) in a separate repo. For specific instructions about how to install Locust and run a load test, follow this separate document.

Hypothesis

  • This is a follow-up project after ONNX optimized FastAPI deployment, so we wanted to know how CPU optimized TensorFlow runtime could be compared to ONNX based one.
  • TFServing's objective is to maximize throughput while keeping tail-latency below certain bounds. We wanted to see if this is true, how reliably it provides a good throughput performance and how much throughput is sacrified to keep the reliability.
  • According to the TFServing's official document, TFServing can achieve the best performance when it is deployed on fewer, larger (in terms of CPU, RAM) machines. We wanted to estimate how large of machine and how many nodes are enough. For this, we have prepared a set of different setups in combination of (# of nodes + # of CPU cores + RAM capacity).
  • TFServing has a number of configurable options to tune the performance. Especially, we wanted to find out how different values of --tensorflow_inter_op_parallelism, --tensorflow_intra_op_parallelism, and --enable_batching options gives different results.

Conclusion

From the results above,

  • TFServing focuses more on reliability than performance(in terms of throughput). In any cases, no failures are observed, and the the response time is consistent.
  • Req/s is lower than ONNX optimized FastAPI deployment, so it sacrifies some performance to achieve reliability. However, you need to notice that TFServing comes with lots of built-in features which are required in most of ML serving scenarios such as multi model serving, dynamic batching, model versioning, and so on. Those features possibly make TFServing heavier than simple FastAPI server.
    • NOTE: We spawned requests every seconds to clearly see how TFServing behaves with the increasing number of clients. So you can assume that the Req/s doesn't reflect the real world situation where clients try to send requests in any time.
  • 8vCPU + 16GB RAM seems like large enough machine. At least bigger size of RAM doesn't help much. We might achieve better performance if we increase the number of CPU core than 8, but beyond 8 cores is somewhat costly.
  • In any cases, the optimal value of --tensorflow_inter_op_parallelism seems like 4. The value of --tensorflow_intra_op_parallelism is fixed to the number of CPU cores since it specifies the number of threads to use to parallelize the execution of an individual op.
  • --enable_batching could give you better performance. However, since TFServing doesn't immediately response to each requests, there is a trade-off.
  • By considering cost trade-off, our recommendation from the experiment is to choose 2n-8c-16r-interop4(2 Nodes of (8vCPU + 16G RAM)) configuration - 2 replicas of TFServing with --tensorflow_inter_op_parallelism=4 unless you care about dynamic batching capabilities. Or you can write a similar setup by referencing 2n-8c-16r-interop2-batch but for smaller machines as well.

๐Ÿ‘‹ NOTE

  • Locust doesn't have a built-in support to write a gRPC based client, so we have written one for ourselves. If you are curious about the implementation, check this locustfile.py out.
  • The plot is generated by matplotlib after collecting CSV files generated from Locust.
  • For the legend in the plot, n means the number of nodes(pods), c means the number of CPU cores, r means the RAM capacity, interop means the number of --tensorflow_inter_op_parallelism, and batch means the batching configuration is enabled with this config.

Future works

  • More load test comparisons with more ML inference frameworks such as NVIDIA's Triton Inference Server, KServe, and RedisAI.

  • Advancing this repo by providing a semi-automatic model deployment. To be more specific, when new codes implementing new ML model is pull requested, maintainers could trigger model performance evaluable on GCP's Vertex Training via comments. The experiment results could be exposed through TensorBoard.dev or W&B. If it is approved, the code will be merged, the trained model will be released, and it is going to be deployed on GKE.

Acknowledgements

More Repositories

1

LLM-As-Chatbot

LLM as a Chatbot Service
Python
3,273
star
2

Machine-Learning-Yearning-Korean-Translation

Korean translation of machine learning yearning book by Andrew Ng.
360
star
3

llamaduo

This project showcases an LLMOps pipeline that fine-tunes a small-size LLM model to prepare for the outage of the service LLM.
Jupyter Notebook
231
star
4

CIFAR10-img-classification-tensorflow

image classification with CIFAR10 dataset w/ Tensorflow
Jupyter Notebook
132
star
5

mlops-hf-tf-vision-models

MLOps for Vision Models (TensorFlow) from ๐Ÿค— Transformers with TensorFlow Extended (TFX)
Jupyter Notebook
115
star
6

Soccer-Ball-Detection-YOLOv2

YOLOv2 trained against custom dataset
Jupyter Notebook
111
star
7

keras-sd-serving

showing various ways to serve Keras based stable diffusion
Jupyter Notebook
110
star
8

EN-FR-MLT-tensorflow

English-French Machine Language Translation in Tensorflow
HTML
108
star
9

hf-daily-paper-newsletter

Newsletter bot for ๐Ÿค— Daily Papers
HTML
100
star
10

fb-group-post-fetcher

HTML
91
star
11

semantic-segmentation-ml-pipeline

Machine Learning Pipeline for Semantic Segmentation with TensorFlow Extended (TFX) and various GCP products
Jupyter Notebook
91
star
12

PingPong

manage histories of LLM applied applications
Python
86
star
13

gradio-chat

HuggingChat like UI in Gradio
Python
63
star
14

fastai-course-korean

korean translation + more examples for fastai course contents
Jupyter Notebook
50
star
15

image_search_with_natural_language

Application for searching images from natural language queries
Jupyter Notebook
45
star
16

Vid2Persona

This project breathes life into video characters by using AI to describe their personality and then chat with you as them.
Jupyter Notebook
44
star
17

DeepModels

TensorFlow Implementation of state-of-the-art models since 2012
Python
38
star
18

LLM-Pref-Mark-UI

Python
37
star
19

AlexNet

AlexNet model from ILSVRC 2012
Jupyter Notebook
35
star
20

auto-paper-analysis

Jupyter Notebook
35
star
21

gpt2-ft-pipeline

GPT2 fine-tuning pipeline with KerasNLP, TensorFlow, and TensorFlow Extended
Jupyter Notebook
33
star
22

segformer-tf-transformers

This repository demonstrates how to use TensorFlow based SegFormer model in ๐Ÿค— transformers package.
Jupyter Notebook
31
star
23

LoRA-deployment

LoRA fine-tuned Stable Diffusion Deployment
Jupyter Notebook
31
star
24

CIFAR10-VGG19-Tensorflow

Jupyter Notebook
29
star
25

Continuous-Adaptation-for-Machine-Learning-System-to-Data-Changes

https://blog.tensorflow.org/2021/12/continuous-adaptation-for-machine.html
Jupyter Notebook
28
star
26

Model-Training-as-a-CI-CD-System

Demonstration of the Model Training as a CI/CD System in Vertex AI
Python
27
star
27

Object-Detection-YOLOv2-Darkflow

Jupyter Notebook
25
star
28

practical-time-series-analysis-korean

Jupyter Notebook
24
star
29

Continuous-Adaptation-with-VertexAI-AutoML-Pipeline

Jupyter Notebook
22
star
30

janus

generate synthetic data for LLM fine-tuning in arbitrary situations within systematic way
Jupyter Notebook
21
star
31

LLM-Serve

This repository provides a framework to serve LLM(Large Language Model) based applications such as Chatbot.
Python
17
star
32

complete-mlops-system-workflow

Jupyter Notebook
17
star
33

TFX-WandB

Jupyter Notebook
14
star
34

deep-diver

HTML
13
star
35

paperqa-ui

Python
13
star
36

LLM-Pool

Python
10
star
37

hllama

hllama is a library which aims to provide a set of utility tools for large language models.
Python
10
star
38

textual-inversion-pipeline

Python
9
star
39

LLMs-Colab

Python
9
star
40

personal_newsletter_curation

HTML
5
star
41

promptengineer

5
star
42

genai-apis

Python
5
star
43

portfolio_template

Java
5
star
44

never-leaving-vscode

5
star
45

VGG

VGG models from ILSVRC 2014
Python
4
star
46

pocket-ml-reference-korean

์ฃผ๋จธ๋‹ˆ์† ๋จธ์‹ ๋Ÿฌ๋‹
Jupyter Notebook
4
star
47

fastai-course

CSS
3
star
48

hf-hub-utils

3
star
49

object-detection-test

object-detection-test
Jupyter Notebook
3
star
50

deploy-stable-diffusion-tfserving

This repo explores and demonstrates how to deploy stable diffusion model with TF Serving
3
star
51

llamaduo-spinoff

Jupyter Notebook
3
star
52

Sampling-Distribution-on-Poker-Cards-

2
star
53

Data-Wrangling-on-OpenStreeMap

Jupyter Notebook
2
star
54

llama-keras

Jupyter Notebook
2
star
55

book-tracking-react

Book tracking web-app project in React. This project is one of the requirements to graduate from 'Front End Web Development Nanodegree' @Udacity.
JavaScript
2
star
56

Baseball_Data_Analysis

Exploratory Data Visualization Project on Baseball Data in Tableau
2
star
57

Responsive-Portfolio

HTML
2
star
58

Data-Analysis-on-RedWine

HTML
2
star
59

read-paper-list

archive of read paper list
2
star
60

SD-TFTRT

Jupyter Notebook
2
star
61

rnn_simple

Python
2
star
62

Data-Analysis-on-Titanic

applying data analysis on titanic data sheet
Jupyter Notebook
2
star
63

neighborhood-map-react

neighborhood-map-react
JavaScript
2
star
64

ml-fn-impls

practice implementing functions appearing in machine learning field
Python
2
star
65

Logistic-Regression

simple neural network without hidden layer
Python
2
star
66

Enron-Data-Analysis

Data Analysis and Machine Learning on Enron Data
HTML
2
star
67

Linear-Regression

implement simple version of "Linear Regression" using only Numpy
Jupyter Notebook
2
star
68

tfx-gpu-docker

Dockerfile
1
star
69

YOLO-Impl-Tensorflow

Implementation of YOLO in Tensorflow
Python
1
star
70

calculator

1
star
71

deeplearningbook-korean-translation

experiments on translation of the book deeplearningbook
Jupyter Notebook
1
star
72

paper-code-match

matching between paper and its codes in side-by-side layout
HTML
1
star
73

deeplearning-with-structured-data

Jupyter Notebook
1
star
74

KaggleNotebook-Notes

Personal notes on some kaggle notebooks publicly available
1
star
75

gitmlops-test1

HTML
1
star
76

Python-Machine-Learning-Book-Practice

Python Machine Learning ์ฑ…์˜ ์†Œ์Šค์ฝ”๋“œ๋ฅผ ์ฃผํ”ผํ„ฐ ๋…ธํŠธ๋ถ์ด ์•„๋‹Œ, ์†Œ์Šค์ฝ”๋“œ ํ˜•ํƒœ๋กœ ์ž‘์„ฑ ์—ฐ์Šต
Python
1
star
77

test_img_clf

HTML
1
star
78

dstack-exp

Python
1
star
79

recall-mate

Python
1
star