• Stars
    star
    661
  • Rank 65,589 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 20 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.

PyTriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments. The library allows serving Machine Learning models directly from Python through NVIDIA's Triton Inference Server.

Documentation

Read how to customize the Triton Inference Server, load models, deploy on clusters, and the API reference can be found in the documentation. The below sections provide brief information about the product and quick start guide.

How it works?

In PyTriton, like in Flask or FastAPI, you can define any Python function that executes a Machine Learning model prediction and exposes it through an HTTP/gRPC API. PyTriton installs Triton Inference Server in your environment and uses it for handling HTTP/gRPC requests and responses. Our library provides a Python API that allows you to attach a Python function to Triton and a communication layer to send/receive data between Triton and the function. The solution enables using the performance features of Triton Inference Server, such as dynamic batching or response cache, without changing your model environment. Thus, it improves the performance of running inference on GPU for models implemented in Python. The solution is framework-agnostic and can be used along with frameworks like PyTorch, TensorFlow, or JAX.

Installation

Before installing the library, ensure that you meet the following requirements:

  • An operating system with glibc >= 2.31. Triton Inference Server and PyTriton have only been rigorously tested on Ubuntu 20.04. Other supported operating systems include Ubuntu 20.04+, Debian 11+, Rocky Linux 9+, and Red Hat Universal Base Image 9+.
  • Python version >= 3.8. If you are using Python 3.9+, see the section "Installation on Python 3.9+" for additional steps.
  • pip >= 20.3

We assume that you are comfortable with the Python programming language and familiar with Machine Learning models. Using Docker is an option, but not mandatory.

The library can be installed in:

  • system environment
  • virtualenv
  • Docker image

NVIDIA optimized Docker images for Python frameworks can be obtained from the NVIDIA NGC Catalog.

If you want to use the Docker runtime, we recommend that you install NVIDIA Container Toolkit to enable running model inference on NVIDIA GPU.

Installing using pip

You can install the package from pypi.org by running the following command:

pip install -U nvidia-pytriton

Important: The Triton Inference Server binary is installed as part of the PyTriton package.

Building binaries from source

The binary package can be built from the source, allowing access to unreleased hotfixes, the ability to modify the PyTriton code, and compatibility with various Triton Inference Server versions, including custom server builds. For further information on building the PyTriton binary, refer to the Building page of documentation.

Quick Start

The quick start presents how to run Python model in Triton Inference Server without need to change the current working environment. In the example we are using a simple Linear PyTorch model.

The requirement for the example is to have installed PyTorch in your environment. You can do it running:

pip install torch

The integration of model requires to provide following elements:

  • The model - framework or Python model or function that handle inference requests
  • Inference callback - a lambda or function which handle the input data coming from Triton and return the result
  • Python function connection with Triton Inference Server - a binding for communication between Triton and Python callback

In the next step define the Linear model:

import torch

model = torch.nn.Linear(2, 3).to("cuda").eval()

In the second step create an inference callback as a function. The function as an argument obtain the HTTP/gRPC request data in the form of a numpy. The expected return object is also a numpy array.

Example implementation:

import numpy as np
from pytriton.decorators import batch

@batch
def infer_fn(**inputs: np.ndarray):
    (input1_batch,) = inputs.values()
    input1_batch_tensor = torch.from_numpy(input1_batch).to("cuda")
    output1_batch_tensor = model(input1_batch_tensor)  # Calling the Python model inference
    output1_batch = output1_batch_tensor.cpu().detach().numpy()
    return [output1_batch]

In the next step, create the connection between the model and Triton Inference Server using the bind method:

from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton

# Connecting inference callback with Triton Inference Server
with Triton() as triton:
    # Load model into Triton Inference Server
    triton.bind(
        model_name="Linear",
        infer_func=infer_fn,
        inputs=[
            Tensor(dtype=np.float32, shape=(-1,)),
        ],
        outputs=[
            Tensor(dtype=np.float32, shape=(-1,)),
        ],
        config=ModelConfig(max_batch_size=128)
    )
    ...

Finally, serve the model with the Triton Inference Server:

from pytriton.triton import Triton

with Triton() as triton:
    ...  # Load models here
    triton.serve()

The bind method creates a connection between the Triton Inference Server and the infer_fn, which handles the inference queries. The inputs and outputs describe the model inputs and outputs that are exposed in Triton. The config field allows more parameters for model deployment.

The serve method is blocking, and at this point, the application waits for incoming HTTP/gRPC requests. From that moment, the model is available under the name Linear in the Triton server. The inference queries can be sent to localhost:8000/v2/models/Linear/infer, which are passed to the infer_fn function.

If you would like to use Triton in the background mode, use run. More about that can be found in the Deploying Models page.

Once the serve or run method is called on the Triton object, the server status can be obtained using:

curl -v localhost:8000/v2/health/live

The model is loaded right after the server starts, and its status can be queried using:

curl -v localhost:8000/v2/models/Linear/ready

Finally, you can send an inference query to the model:

curl -X POST \
  -H "Content-Type: application/json"  \
  -d @input.json \
  localhost:8000/v2/models/Linear/infer

The input.json with sample query:

{
  "id": "0",
  "inputs": [
    {
      "name": "INPUT_1",
      "shape": [1, 2],
      "datatype": "FP32",
      "parameters": {},
      "data": [[-0.04281254857778549, 0.6738349795341492]]
    }
  ]
}

Read more about the HTTP/gRPC interface in the Triton Inference Server documentation.

You can also validate the deployed model using a simple client that can perform inference requests:

import torch
from pytriton.client import ModelClient

input1_data = torch.randn(128, 2).cpu().detach().numpy()

with ModelClient("localhost:8000", "Linear") as client:
    result_dict = client.infer_batch(input1_data)

print(result_dict)

The full example code can be found in examples/linear_random_pytorch.

You can learn more about client usage in the Clients document.

More information about running the server and models can be found in Deploying Models page of documentation.

Architecture

The diagram below presents the schema of how the Python models are served through Triton Inference Server using PyTriton. The solution consists of two main components:

  • Triton Inference Server: for exposing the HTTP/gRPC API and benefiting from performance features like dynamic batching or response cache.
  • Python Model Environment: your environment where the Python model is executed.

The Triton Inference Server binaries are provided as part of the PyTriton installation. The Triton Server is installed in your current environment (system or container). The PyTriton controls the Triton Server process through the Triton Controller.

Exposing the model through PyTriton requires the definition of an Inference Callable - a Python function that is connected to Triton Inference Server and executes the model or ensemble for predictions. The integration layer binds the Inference Callable to Triton Server and exposes it through the Triton HTTP/gRPC API under a provided <model name>. Once the integration is done, the defined Inference Callable receives data sent to the HTTP/gRPC API endpoint v2/models/<model name>/infer. Read more about HTTP/gRPC interface in Triton Inference Server documentation.

The HTTP/gRPC requests sent to v2/models/<model name>/infer are handled by Triton Inference Server. The server batches requests and passes them to the Proxy Backend, which sends the batched requests to the appropriate Inference Callable. The data is sent as a numpy array. Once the Inference Callable finishes execution of the model prediction, the result is returned to the Proxy Backend, and a response is created by Triton Server.

High Level Design

Examples

The examples page presents various cases of serving models using PyTriton. You can find simple examples of running PyTorch, TensorFlow2, JAX, and simple Python models. Additionally, we have prepared more advanced scenarios like online learning, multi-node models, or deployment on Kubernetes using PyTriton. Each example contains instructions describing how to build and run the example. Learn more about how to use PyTriton by reviewing our examples.

Profiling model

The Perf Analyzer can be used to profile models served through PyTriton. We have prepared an example of using the Perf Analyzer to profile the BART PyTorch model. The example code can be found in examples/perf_analyzer.

Version management

PyTriton follows the Semantic Versioning scheme for versioning. Official releases can be found on PyPI and GitHub releases. The most up-to-date development version is available on the main branch, which may include hotfixes that have not yet been released through the standard channels. To install the latest development version, refer to the instructions in the building binaries from source section.

Useful Links

More Repositories

1

server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Python
7,321
star
2

client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
C++
451
star
3

python_backend

Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.
C++
444
star
4

tensorrtllm_backend

The Triton TensorRT-LLM Backend
Python
439
star
5

fastertransformer_backend

Python
409
star
6

tutorials

This repository contains tutorials and examples for Triton Inference Server
Python
403
star
7

model_analyzer

Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Server models.
Python
374
star
8

backend

Common source, scripts and utilities for creating Triton backends.
C++
231
star
9

model_navigator

Triton Model Navigator is a tool that provides the ability to automate the process of model deployment on the Triton Inference Server.
Python
148
star
10

dali_backend

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
C++
116
star
11

onnxruntime_backend

The Triton backend for the ONNX Runtime.
C++
109
star
12

pytorch_backend

The Triton backend for the PyTorch TorchScript models.
C++
93
star
13

vllm_backend

Python
84
star
14

core

The core library and APIs implementing the Triton Inference Server.
C++
78
star
15

fil_backend

FIL backend for the Triton Inference Server
Jupyter Notebook
63
star
16

common

Common source, scripts and utilities shared across all Triton repositories.
C++
53
star
17

hugectr_backend

Jupyter Notebook
48
star
18

tensorrt_backend

The Triton backend for TensorRT.
C++
40
star
19

tensorflow_backend

The Triton backend for TensorFlow.
C++
39
star
20

paddlepaddle_backend

C++
32
star
21

openvino_backend

OpenVINO backend for Triton.
C++
22
star
22

developer_tools

C++
15
star
23

stateful_backend

Triton backend for managing the model state tensors automatically in sequence batcher
C++
10
star
24

contrib

Community contributions to Triton that are not officially supported or maintained by the Triton project.
Python
8
star
25

third_party

Third-party source packages that are modified for use in Triton.
C
7
star
26

checksum_repository_agent

The Triton repository agent that verifies model checksums.
C++
6
star
27

identity_backend

Example Triton backend that demonstrates most of the Triton Backend API.
C++
6
star
28

redis_cache

TRITONCACHE implementation of a Redis cache
C++
5
star
29

repeat_backend

An example Triton backend that demonstrates sending zero, one, or multiple responses for each request.
C++
5
star
30

local_cache

Implementation of a local in-memory cache for Triton Inference Server's TRITONCACHE API
C++
2
star
31

square_backend

Simple Triton backend used for testing.
C++
2
star