Various ways of serving Stable Diffusion
This repository shows a various ways to deploy Stable Diffusion. Currently, we are interested in the Stable Diffusion implementation from keras-cv
, and the target platforms/frameworks that we aim includes TF Serving, Hugging Face Endpoint, and FastAPI.
From the version 0.4.0
release of keras-cv
, StableDiffusionV2
is included, and this repository support both version 1 and 2 of the Stable Diffusion.
1. All in One Endpoint
This method shows how to deploy Stable Diffusion as a whole in a single endpoint. Stable Diffusion consists of three models(encoder
, diffusion model
, decoder
) and some glue codes to handle the inputs and outputs of each models. In this scenario, everything is packaged into a single Endpoint.
-
Hugging Face π€ Endpoint: In order to deploy something in Hugging Face Endpoint, we need to create a custom handler. Hugging Face Endpoint let us easily deploy any machine learning models with pre/post processing logics in a custom handler [Colab | Standalone Codebase]
-
FastAPI Endpoint: [Colab | Standalone]
- Docker Image:
gcr.io/gcp-ml-172005/sd-fastapi-allinone:latest
- Docker Image:
2. Three Endpoints
This method shows how to deploy Stable Diffusion in three separate Endpoints. As a preliminary work, this notebook was written to demonstrate how to split three parts of Stable Diffusion into three separate modules. In this example, you will see how to interact with three different endpoints to generate images with a given text prompt.
-
Hugging Face Endpoint: [Colab | Text Encoder | Diffusion Model | Decoder]
-
FastAPI Endpoint: [Central | Text Encoder | Diffusion Model | Decoder]
- Docker Image(text-encoder):
gcr.io/gcp-ml-172005/sd-fastapi-text-encoder:latest
- Docker Image(diffusion-model):
gcr.io/gcp-ml-172005/sd-fastapi-diffusion-model:latest
- Docker Image(decoder):
gcr.io/gcp-ml-172005/sd-fastapi-decoder:latest
- Docker Image(text-encoder):
-
TF Serving Endpoint: [Colab | Dockerfiles + k8s Resources]
- SavedModel: [Colab | Text Encoder | Diffusion Model | Decoder]
- wrapping
encoder
,diffusion model
, anddecoder
and some glue codes in separate SavedModels. With them, we can not only deploy each models on cloud with TF Serving but also embed in web and mobild applications with TFJS and TFLite. We will explore the embedded use cases later phase of this project.
- wrapping
- Docker Images
- text-encoder:
gcr.io/gcp-ml-172005/tfs-sd-text-encoder:latest
- text-encoder w/ base64:
gcr.io/gcp-ml-172005/tfs-sd-text-encoder-base64:latest
- text-encoder-v2:
gcr.io/gcp-ml-172005/tfs-sd-text-encoder-v2:latest
- text-encoder-v2 w/ base64:
gcr.io/gcp-ml-172005/tfs-sd-text-encoder-v2-base64:latest
- diffusion-model:
gcr.io/gcp-ml-172005/tfs-sd-diffusion-model:latest
- diffusion-model w/ base64:
gcr.io/gcp-ml-172005/tfs-sd-diffusion-model-base64:latest
- diffusion-model-v2:
gcr.io/gcp-ml-172005/tfs-sd-diffusion-model-v2:latest
- diffusion-model-v2 w/ base64:
gcr.io/gcp-ml-172005/tfs-sd-diffusion-model-v2-base64:latest
- decoder:
gcr.io/gcp-ml-172005/tfs-sd-decoder:latest
- decoder w/ base64:
gcr.io/gcp-ml-172005/tfs-sd-decoder-base64:latest
- text-encoder:
- SavedModel: [Colab | Text Encoder | Diffusion Model | Decoder]
NOTE: Passing intermediate values between models through network could be costly, and some platform limits certain payload size. For instance, Vertex AI limits the request size to 1.5MB. To this end, we provide different TF Serving Docker images which handles inputs and produces outputs in
base64
format.
3. One Endpoint with Two local APIs (w/ π€ Endpoint)
With the separation of Stable Diffusion, we could organize each parts in any environments. This is powerful especially if we want to deploy specialized diffusion model
s such as inpainting
and finetuned diffusion model
. In this case, we only need to replace the currently deployed diffusion model
or just deploy a new diffusion model
besides while keeping the other two(text encoder
and decoder
) as is.
Also, it is worth noting that we could run text encoder
and decoder
parts in local(Python clients or web/mobile with TF Serving) while having diffusion model
on cloud. In this repository, we currently show an example using Hugging Face π€ Endpoint. However, you could easily expand the posibilities.
NOTE: along with this project, we have developed one more project to fine-tune Keras based Stable Diffusion at Fine-tuning Stable Diffusion using Keras. We currently provide a fine-tuned model to Pokemon dataset.
4. On-Device Deployment (w/ TFLite) - WIP
We have managed to convert SavedModel
s into TFLite models, and we are hosting them as below (thanks to @farmaker47):
- Text Encoder TFLite Model - 127MB
- Diffusion Model TFLite Model - 864MB
- Decoder TFLite Model - 99MB
These TFLite models have the same signature as the SavedModel
s, and all the pre/post operations are included inside. All of them are converted with float16 quantization optimize process. You can find more about how to convert SavedModel
s to TFLite
models in this repository.
TODO
- Implement SimpleTokenizer in JAVA and JavaScript
- Run TFLite models on Android and Web browser
Timing Tests
details
Sequential
The figure below shows how long each scenario took from text encoding to diffusion to decoding. It assumes each request(batch_size=4
) is handled sequentially with a single server running on Hugging Face Endpoint for each endpoint. all-in-one endpoint deployed the Stable Diffusion on A10 equipped server while separate endpoints deployed text encoder on 2 vCPU + 4GB RAM, diffusion model on A10 equipped server, and decoder on T4 equipped server. Finally, one endpoint, two local only deployed difusion model on A10 equipped server while keeping the other two on Colab environment (w/ T4). Please take a look how these are measured from this notebook
π¨ XLA support
In this notebook, we show how we can XLA-compile the SavedModels to achieve a speed-up of about 52% over the non-XLA variant.
Acknowledgements
Thanks to the ML Developer Programs' team at Google for providing GCP credits.