Simple TensorFlow Serving
Introduction
Simple TensorFlow Serving is the generic and easy-to-use serving service for machine learning models. Read more in https://stfs.readthedocs.io.
- Support distributed TensorFlow models
- Support the general RESTful/HTTP APIs
- Support inference with accelerated GPU
- Support
curl
and other command-line tools - Support clients in any programming language
- Support code-gen client by models without coding
- Support inference with raw file for image models
- Support statistical metrics for verbose requests
- Support serving multiple models at the same time
- Support dynamic online and offline for model versions
- Support loading new custom op for TensorFlow models
- Support secure authentication with configurable basic auth
- Support multiple models of TensorFlow/MXNet/PyTorch/Caffe2/CNTK/ONNX/H2o/Scikit-learn/XGBoost/PMML/Spark MLlib
Installation
Install the server with pip.
pip install simple_tensorflow_serving
Or install from source code.
python ./setup.py install
python ./setup.py develop
bazel build simple_tensorflow_serving:server
Or use the docker image.
docker run -d -p 8500:8500 tobegit3hub/simple_tensorflow_serving
docker run -d -p 8500:8500 tobegit3hub/simple_tensorflow_serving:latest-gpu
docker run -d -p 8500:8500 tobegit3hub/simple_tensorflow_serving:latest-hdfs
docker run -d -p 8500:8500 tobegit3hub/simple_tensorflow_serving:latest-py34
docker-compose up -d
Or deploy in Kubernetes.
kubectl create -f ./simple_tensorflow_serving.yaml
Quick Start
Start the server with the TensorFlow SavedModel.
simple_tensorflow_serving --model_base_path="./models/tensorflow_template_application_model"
Check out the dashboard in http://127.0.0.1:8500 in web browser.
Generate Python client and access the model with test data without coding.
curl http://localhost:8500/v1/models/default/gen_client?language=python > client.py
python ./client.py
Advanced Usage
Multiple Models
It supports serve multiple models and multiple versions of these models. You can run the server with this configuration.
{
"model_config_list": [
{
"name": "tensorflow_template_application_model",
"base_path": "./models/tensorflow_template_application_model/",
"platform": "tensorflow"
}, {
"name": "deep_image_model",
"base_path": "./models/deep_image_model/",
"platform": "tensorflow"
}, {
"name": "mxnet_mlp_model",
"base_path": "./models/mxnet_mlp/mx_mlp",
"platform": "mxnet"
}
]
}
simple_tensorflow_serving --model_config_file="./examples/model_config_file.json"
Adding or removing model versions will be detected automatically and re-load latest files in memory. You can easily choose the specified model and version for inference.
endpoint = "http://127.0.0.1:8500"
input_data = {
"model_name": "default",
"model_version": 1,
"data": {
"keys": [[11.0], [2.0]],
"features": [[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]]
}
}
result = requests.post(endpoint, json=input_data)
GPU Acceleration
If you want to use GPU, try with the docker image with GPU tag and put cuda files in /usr/cuda_files/
.
export CUDA_SO="-v /usr/cuda_files/:/usr/cuda_files/"
export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
export LIBRARY_ENV="-e LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/cuda_files"
docker run -it -p 8500:8500 $CUDA_SO $DEVICES $LIBRARY_ENV tobegit3hub/simple_tensorflow_serving:latest-gpu
You can set session config and gpu options in command-line parameter or the model config file.
simple_tensorflow_serving --model_base_path="./models/tensorflow_template_application_model" --session_config='{"log_device_placement": true, "allow_soft_placement": true, "allow_growth": true, "per_process_gpu_memory_fraction": 0.5}'
{
"model_config_list": [
{
"name": "default",
"base_path": "./models/tensorflow_template_application_model/",
"platform": "tensorflow",
"session_config": {
"log_device_placement": true,
"allow_soft_placement": true,
"allow_growth": true,
"per_process_gpu_memory_fraction": 0.5
}
}
]
}
Generated Client
You can generate the test json data for the online models.
curl http://localhost:8500/v1/models/default/gen_json
Or generate clients in different languages(Bash, Python, Golang, JavaScript etc.) for your model without writing any code.
curl http://localhost:8500/v1/models/default/gen_client?language=python > client.py
curl http://localhost:8500/v1/models/default/gen_client?language=bash > client.sh
curl http://localhost:8500/v1/models/default/gen_client?language=golang > client.go
curl http://localhost:8500/v1/models/default/gen_client?language=javascript > client.js
The generated code should look like these which can be test immediately.
#!/usr/bin/env python
import requests
def main():
endpoint = "http://127.0.0.1:8500"
json_data = {"model_name": "default", "data": {"keys": [[1], [1]], "features": [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]} }
result = requests.post(endpoint, json=json_data)
print(result.text)
if __name__ == "__main__":
main()
#!/usr/bin/env python
import requests
def main():
endpoint = "http://127.0.0.1:8500"
input_data = {"keys": [[1.0], [1.0]], "features": [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]]}
result = requests.post(endpoint, json=input_data)
print(result.text)
if __name__ == "__main__":
main()
Image Model
For image models, we can request with the raw image files instead of constructing array data.
Now start serving the image model like deep_image_model.
simple_tensorflow_serving --model_base_path="./models/deep_image_model/"
Then request with the raw image file which has the same shape of your model.
curl -X POST -F 'image=@./images/mew.jpg' -F "model_version=1" 127.0.0.1:8500
TensorFlow Estimator Model
If we use the TensorFlow Estimator API to export the model, the model signature should look like this.
inputs {
key: "inputs"
value {
name: "input_example_tensor:0"
dtype: DT_STRING
tensor_shape {
dim {
size: -1
}
}
}
}
outputs {
key: "classes"
value {
name: "linear/binary_logistic_head/_classification_output_alternatives/classes_tensor:0"
dtype: DT_STRING
tensor_shape {
dim {
size: -1
}
dim {
size: -1
}
}
}
}
outputs {
key: "scores"
value {
name: "linear/binary_logistic_head/predictions/probabilities:0"
dtype: DT_FLOAT
tensor_shape {
dim {
size: -1
}
dim {
size: 2
}
}
}
}
method_name: "tensorflow/serving/classify"
We need to construct the string tensor for inference and use base64 to encode the string for HTTP. Here is the example Python code.
def _float_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def main():
# Raw input data
feature_dict = {"a": _bytes_feature("10"), "b": _float_feature(10)}
# Create Example as base64 string
example_proto = tf.train.Example(features=tf.train.Features(feature=feature_dict))
tensor_proto = tf.contrib.util.make_tensor_proto(example_proto.SerializeToString(), dtype=tf.string)
tensor_string = tensor_proto.string_val.pop()
base64_tensor_string = base64.urlsafe_b64encode(tensor_string)
# Request server
endpoint = "http://127.0.0.1:8500"
json_data = {"model_name": "default", "base64_decode": True, "data": {"inputs": [base64_tensor_string]}}
result = requests.post(endpoint, json=json_data)
print(result.json())
Custom Op
If your models rely on new TensorFlow custom op, you can run the server while loading the so files.
simple_tensorflow_serving --model_base_path="./model/" --custom_op_paths="./foo_op/"
Please check out the complete example in ./examples/custom_op/.
Authentication
For enterprises, we can enable basic auth for all the APIs and any anonymous request is denied.
Now start the server with the configured username and password.
./server.py --model_base_path="./models/tensorflow_template_application_model/" --enable_auth=True --auth_username="admin" --auth_password="admin"
If you are using the Web dashboard, just type your certification. If you are using clients, give the username and password within the request.
curl -u admin:admin -H "Content-Type: application/json" -X POST -d '{"data": {"keys": [[11.0], [2.0]], "features": [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}}' http://127.0.0.1:8500
endpoint = "http://127.0.0.1:8500"
input_data = {
"data": {
"keys": [[11.0], [2.0]],
"features": [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]
}
}
auth = requests.auth.HTTPBasicAuth("admin", "admin")
result = requests.post(endpoint, json=input_data, auth=auth)
TSL/SSL
It supports TSL/SSL and you can generate the self-signed secret files for testing.
openssl req -x509 -newkey rsa:4096 -nodes -out /tmp/secret.pem -keyout /tmp/secret.key -days 365
Then run the server with certification files.
simple_tensorflow_serving --enable_ssl=True --secret_pem=/tmp/secret.pem --secret_key=/tmp/secret.key --model_base_path="./models/tensorflow_template_application_model"
Supported Models
For MXNet models, you can load with commands and configuration like these.
simple_tensorflow_serving --model_base_path="./models/mxnet_mlp/mx_mlp" --model_platform="mxnet"
endpoint = "http://127.0.0.1:8500"
input_data = {
"model_name": "default",
"model_version": 1,
"data": {
"data": [[12.0, 2.0]]
}
}
result = requests.post(endpoint, json=input_data)
print(result.text)
For ONNX models, you can load with commands and configuration like these.
simple_tensorflow_serving --model_base_path="./models/onnx_mnist_model/onnx_model.proto" --model_platform="onnx"
endpoint = "http://127.0.0.1:8500"
input_data = {
"model_name": "default",
"model_version": 1,
"data": {
"data": [[...]]
}
}
result = requests.post(endpoint, json=input_data)
print(result.text)
For H2o models, you can load with commands and configuration like these.
# Start H2o server with "java -jar h2o.jar"
simple_tensorflow_serving --model_base_path="./models/h2o_prostate_model/GLM_model_python_1525255083960_17" --model_platform="h2o"
endpoint = "http://127.0.0.1:8500"
input_data = {
"model_name": "default",
"model_version": 1,
"data": {
"data": [[...]]
}
}
result = requests.post(endpoint, json=input_data)
print(result.text)
For Scikit-learn models, you can load with commands and configuration like these.
simple_tensorflow_serving --model_base_path="./models/scikitlearn_iris/model.joblib" --model_platform="scikitlearn"
simple_tensorflow_serving --model_base_path="./models/scikitlearn_iris/model.pkl" --model_platform="scikitlearn"
endpoint = "http://127.0.0.1:8500"
input_data = {
"model_name": "default",
"model_version": 1,
"data": {
"data": [[...]]
}
}
result = requests.post(endpoint, json=input_data)
print(result.text)
For XGBoost models, you can load with commands and configuration like these.
simple_tensorflow_serving --model_base_path="./models/xgboost_iris/model.bst" --model_platform="xgboost"
simple_tensorflow_serving --model_base_path="./models/xgboost_iris/model.joblib" --model_platform="xgboost"
simple_tensorflow_serving --model_base_path="./models/xgboost_iris/model.pkl" --model_platform="xgboost"
endpoint = "http://127.0.0.1:8500"
input_data = {
"model_name": "default",
"model_version": 1,
"data": {
"data": [[...]]
}
}
result = requests.post(endpoint, json=input_data)
print(result.text)
For PMML models, you can load with commands and configuration like these. This relies on Openscoring and Openscoring-Python to load the models.
java -jar ./third_party/openscoring/openscoring-server-executable-1.4-SNAPSHOT.jar
simple_tensorflow_serving --model_base_path="./models/pmml_iris/DecisionTreeIris.pmml" --model_platform="pmml"
endpoint = "http://127.0.0.1:8500"
input_data = {
"model_name": "default",
"model_version": 1,
"data": {
"data": [[...]]
}
}
result = requests.post(endpoint, json=input_data)
print(result.text)
Supported Client
Here is the example client in Bash.
curl -H "Content-Type: application/json" -X POST -d '{"data": {"keys": [[1.0], [2.0]], "features": [[10, 10, 10, 8, 6, 1, 8, 9, 1], [6, 2, 1, 1, 1, 1, 7, 1, 1]]}}' http://127.0.0.1:8500
Here is the example client in Python.
endpoint = "http://127.0.0.1:8500"
payload = {"data": {"keys": [[11.0], [2.0]], "features": [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}}
result = requests.post(endpoint, json=payload)
Here is the example client in C++.
Here is the example client in Java.
Here is the example client in Scala.
Here is the example client in Go.
endpoint := "http://127.0.0.1:8500"
dataByte := []byte(`{"data": {"keys": [[11.0], [2.0]], "features": [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}}`)
var dataInterface map[string]interface{}
json.Unmarshal(dataByte, &dataInterface)
dataJson, _ := json.Marshal(dataInterface)
resp, err := http.Post(endpoint, "application/json", bytes.NewBuffer(dataJson))
Here is the example client in Ruby.
endpoint = "http://127.0.0.1:8500"
uri = URI.parse(endpoint)
header = {"Content-Type" => "application/json"}
input_data = {"data" => {"keys"=> [[11.0], [2.0]], "features"=> [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}}
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Post.new(uri.request_uri, header)
request.body = input_data.to_json
response = http.request(request)
Here is the example client in JavaScript.
var options = {
uri: "http://127.0.0.1:8500",
method: "POST",
json: {"data": {"keys": [[11.0], [2.0]], "features": [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}}
};
request(options, function (error, response, body) {});
Here is the example client in PHP.
$endpoint = "127.0.0.1:8500";
$inputData = array(
"keys" => [[11.0], [2.0]],
"features" => [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]],
);
$jsonData = array(
"data" => $inputData,
);
$ch = curl_init($endpoint);
curl_setopt_array($ch, array(
CURLOPT_POST => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_HTTPHEADER => array(
"Content-Type: application/json"
),
CURLOPT_POSTFIELDS => json_encode($jsonData)
));
$response = curl_exec($ch);
Here is the example client in Erlang.
ssl:start(),
application:start(inets),
httpc:request(post,
{"http://127.0.0.1:8500", [],
"application/json",
"{\"data\": {\"keys\": [[11.0], [2.0]], \"features\": [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}}"
}, [], []).
Here is the example client in Lua.
local endpoint = "http://127.0.0.1:8500"
keys_array = {}
keys_array[1] = {1.0}
keys_array[2] = {2.0}
features_array = {}
features_array[1] = {1, 1, 1, 1, 1, 1, 1, 1, 1}
features_array[2] = {1, 1, 1, 1, 1, 1, 1, 1, 1}
local input_data = {
["keys"] = keys_array,
["features"] = features_array,
}
local json_data = {
["data"] = input_data
}
request_body = json:encode (json_data)
local response_body = {}
local res, code, response_headers = http.request{
url = endpoint,
method = "POST",
headers =
{
["Content-Type"] = "application/json";
["Content-Length"] = #request_body;
},
source = ltn12.source.string(request_body),
sink = ltn12.sink.table(response_body),
}
Here is the example client in Rust.
Here is the example client in Swift.
Here is the example client in Perl.
my $endpoint = "http://127.0.0.1:8500";
my $json = '{"data": {"keys": [[11.0], [2.0]], "features": [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}}';
my $req = HTTP::Request->new( 'POST', $endpoint );
$req->header( 'Content-Type' => 'application/json' );
$req->content( $json );
$ua = LWP::UserAgent->new;
$response = $ua->request($req);
Here is the example client in Lisp.
Here is the example client in Haskell.
Here is the example client in Clojure.
Here is the example client in R.
endpoint <- "http://127.0.0.1:8500"
body <- list(data = list(a = 1), keys = 1)
json_data <- list(
data = list(
keys = list(list(1.0), list(2.0)), features = list(list(1, 1, 1, 1, 1, 1, 1, 1, 1), list(1, 1, 1, 1, 1, 1, 1, 1, 1))
)
)
r <- POST(endpoint, body = json_data, encode = "json")
stop_for_status(r)
content(r, "parsed", "text/html")
Here is the example with Postman.
Performance
You can run SimpleTensorFlowServing with any WSGI server for better performance. We have benchmarked and compare with TensorFlow Serving
. Find more details in benchmark.
STFS(Simple TensorFlow Serving) and TFS(TensorFlow Serving) have similar performances for different models. Vertical coordinate is inference latency(microsecond) and the less is better.
Then we test with ab
with concurrent clients in CPU and GPU. TensorFlow Serving
works better especially with GPUs.
For simplest model, each request only costs ~1.9 microseconds and one instance of Simple TensorFlow Serving can achieve 5000+ QPS. With larger batch size, it can inference more than 1M instances per second.
How It Works
simple_tensorflow_serving
starts the HTTP server withflask
application.- Load the TensorFlow models with
tf.saved_model.loader
Python API. - Construct the feed_dict data from the JSON body of the request.
// Method: POST, Content-Type: application/json { "model_version": 1, // Optional "data": { "keys": [[1], [2]], "features": [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]] } }
- Use the TensorFlow Python API to
sess.run()
with feed_dict data. - For multiple versions supported, it starts independent thread to load models.
- For generated clients, it reads user's model and render code with Jinja templates.
Contribution
Feel free to open an issue or send pull request for this project. It is warmly welcome to add more clients in your languages to access TensorFlow models.