• Stars
    star
    272
  • Rank 151,235 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created over 2 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A tutorial for getting started with the Deep Learning Accelerator (DLA) on NVIDIA Jetson

Getting started with the Deep Learning Accelerator on NVIDIA Jetson Orin

In this tutorial, we’ll develop a neural network that utilizes the Deep Learning Accelerator (DLA) on Jetson Orin. In case you’re unfamiliar, the DLA is an application specific integrated circuit on Jetson Xavier and Orin that is capable of running common deep learning inference operations, such as convolutions. This dedicated hardware is power efficient, and allows you to offload work from the GPU, freeing it for other tasks. If you’re interested in better utilizing the full capabilities of your Jetson platform, this tutorial is for you.

To demonstrate how to use the DLA, we’ll run through the entire modeling process; from designing and training an architecture in PyTorch, to calibrating the model for INT8 precision, to profiling and validating the model. We hope this will give you a better understanding of how to use the DLA, as well as its capabilities and limitations. Understanding this information early on is important, as it will help you make design decisions to ensure your model can be easily accelerated by the DLA, enabling you to further unlock the capabilities of Jetson Orin.

Let’s get started!

Tip: You can easily recreate the models covered in this tutorial yourself! Check out QUICKSTART.md for instructions on how to use the code contained in this repository.

Overview

Step 1 - Define a model in PyTorch

First, let’s define our PyTorch model. We’ll create a simple CNN image classification model targeting the CIFAR10 dataset. To demonstrate the process of identifying and replacing unsupported layers, we will intentionally use the GroupNorm normalization layer, which is not currently supported by the DLA.

class ModelGN(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.cnn = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),
            nn.GroupNorm(8, 64),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
            nn.GroupNorm(8, 128),
            nn.ReLU(),
            nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1),
            nn.GroupNorm(8, 256),
            nn.ReLU(),
            nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1),
            nn.GroupNorm(8, 512),
            nn.ReLU()
        )
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.linear = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.cnn(x)
        x = self.pool(x)
        x = x.view(x.shape[0], -1)
        x = self.linear(x)
        return x

model_gn = ModelGN(num_classes=10).cuda().eval()

That’s it, our initial PyTorch model is now defined. We could proceed with training this model, but often training is a time consuming task. Instead, let’s try running our model on the DLA with TensorRT with dummy weights, just to see how it performs.

Step 2 - Export the model to ONNX

To run our model on the DLA, we need to use the TensorRT neural network inference library by NVIDIA. TensorRT is a framework that ingests a neural network graph description, and performs a variety of platform specific optimizations, including but not limited to running layers on the DLA. The optimizations result in a TensorRT engine, which may be executed using the TensorRT runtime.

There are a few different utilities for converting and building TensorRT engines from PyTorch models, each with their own benefits. Here we will use the PyTorch -> ONNX -> TensorRT workflow. To do this, we first export our model to ONNX as follows

data = torch.zeros(1, 3, 32, 32).cuda()

torch.onnx.export(model_gn, data, 'model_gn.onnx',
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

Note that we specify dynamic axes for the input and output batch dimensions. By enabling dynamic batch axes, we can then generate a TensorRT engine which is capable of using batch sizes larger than the size of the example data used when exporting to ONNX.

Step 3 - Optimize the untrained model using TensorRT

To build our TensorRT engine, we can use the trtexec tool, which is installed on Jetson at the location /usr/src/tensorrt/bin/trtexec. Let’s create an alias for this

alias trtexec=/usr/src/tensorrt/bin/trtexec

Now, we can call trtexec and specify the parameters we wish to use for optimization. The command we use is as follows

trtexec --onnx=model_gn.onnx --shapes=input:32x3x32x32 --saveEngine=model_gn.engine --exportProfile=model_gn.json --int8 --useDLACore=0 --allowGPUFallback --useSpinWait --separateProfileRun > model_gn.log
  • --onnx - The input ONNX file path.
  • --shapes - The shapes for input bindings, we specify a batch size of 32.
  • --saveEngine - The path to save the optimized TensorRT engine.
  • --exportProfile - The path to output a JSON file containing layer granularity timings.
  • --int8 - Enable INT8 precision. This is required for best performance on Orin DLA.
  • --useDLACore=0 - The DLA core to use for all compatible layers.
  • --allowGPUFallback - Allow TensorRT to run layers on GPU that aren't supported on DLA.
  • --useSpinWait - Synchronize GPU events, for improved profiling stability.
  • --separateProfileRun - Perform a separate run for layer profiling.
  • > model_gn.log - Capture the output into a file named model_gn.log

The trtexec program will log information related to the optimization and profiling processes. One notable output is the collection of layers running on the DLA. After calling trtexec to build and profile our model on GPU, we see the following output

[03/31/2022-14:34:54] [I] [TRT] ---------- Layers Running on DLA ----------
[03/31/2022-14:34:54] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0]}
[03/31/2022-14:34:54] [I] [TRT] [DlaLayer] {ForeignNode[Relu_10...Conv_11]}
[03/31/2022-14:34:54] [I] [TRT] [DlaLayer] {ForeignNode[Relu_21...Conv_22]}
[03/31/2022-14:34:54] [I] [TRT] [DlaLayer] {ForeignNode[Relu_32...Conv_33]}
[03/31/2022-14:34:54] [I] [TRT] [DlaLayer] {ForeignNode[Relu_43]}
[03/31/2022-14:34:54] [I] [TRT] ---------- Layers Running on GPU ----------
[03/31/2022-14:34:54] [I] [TRT] [GpuLayer] SHUFFLE: Reshape_2 + (Unnamed Layer* 7) [Shuffle]
[03/31/2022-14:34:54] [I] [TRT] [GpuLayer] PLUGIN_V2: InstanceNormalization_5
[03/31/2022-14:34:54] [I] [TRT] [GpuLayer] SHUFFLE: (Unnamed Layer* 12) [Shuffle] + Reshape_7
[03/31/2022-14:34:54] [I] [TRT] [GpuLayer] SCALE: 71 + (Unnamed Layer* 16) [Shuffle] + Mul_8
[03/31/2022-14:34:54] [I] [TRT] [GpuLayer] SCALE: 72 + (Unnamed Layer* 19) [Shuffle] + Add_9
[03/31/2022-14:34:54] [I] [TRT] [GpuLayer] SHUFFLE: Reshape_13 + (Unnamed Layer* 29) [Shuffle]

As we can see, several layers are running on the DLA as well as the GPU. This indicates that our many layers were not compatible for execution on the DLA, and fell back to execution on the GPU. Contiguous subgraphs that execute on the DLA will appear as one "ForeignNode" block. Ideally, our entire model would appear as a single "ForeignNode" entry running on the DLA.

In addition to this information, trtexec will report the number of batches executed per second.

[03/31/2022-15:12:38] [I] === Performance summary ===
[03/31/2022-15:12:38] [I] Throughput: 305.926 qps

Here we see that the engine executes 305.926 batches per second. Multiplying this number by the batch size results in the number of images per second.

This coarse information is useful for determining how fast a model will run in practice, but provides little insight into what we can do to improve the model performance.

Let's take a look at how we can use the Nvidia Visual Profiler, to provide a timeline trace of what's happening during our model execution.

Step 4 - Profile with Nvidia Visual Profiler

To collect data for visualization, we first need to profile our model. To do this, we will use the nsys command line tool installed by on the NVIDIA Jetson Orin. This tool is capable of collecting system profiling information during the execution of a provided program.

We will call nsys profile using /usr/src/tensorrt/bin/trtexec as the executable to be profiled. We will provide the engine we've already created in the previous step to trtexec to avoid profiling the entire optimization process.

nsys profile --trace=cuda,nvtx,cublas,cudla,cusparse,cudnn,nvmedia --output=model_gn.nvvp /usr/src/tensorrt/bin/trtexec --loadEngine=model_gn.engine --iterations=10 --idleTime=500 --duration=0 --useSpinWait

The notable nsys profile parameters are

  • --trace - Determines which events we want to capture during our profiling session. Notably, nvmedia will capture calls to the DLA. nvtx will capture application specific calls, which in this instance includes TensorRT layer executions.
  • --output=model_gn - Sets the output file that profiling information will be stored in. This can be directly loaded with the NVIDIA Visual Profiler.

The trtexec parameters are

  • --loadEngine=model_gn.engine - The path of the TensorRT engine we previously generated by calling trtexec with --saveEngine
  • --iterations=10 - Limit the number of iterations to 10, since we are interested in the fine-grained execution timeline, not larger statistical information.
  • --duration=0 - Removes the minimum profiling duration, we only wish to profile 10 executions.
  • --idleTime=500 - Adds a 500ms wait time between executions, so we can more easily distinguish separate execution calls in the timeline view.
  • --useSpinWait - Adds explicit synchronization to improve profiling stability

After calling the profiling command, we will generate the profiling data which we load into NVIDIA Visual Profiler.

Focusing on a single model execution call we see the following.

image

The orange block, which is recorded when specifying --trace=nvtx, shows the entire TensorRT engine execution call. The yellow blocks, which are also captured by specifying --trace=nvtx, show the individual TensorRT layer executions. The blocks in the row titled Other Accelerators API shows the DLA events. As we can see, the DLA events are rather spaced out, with a variety of other events happening in between. These other events take a significant portion of our model execution time.

In this instance, the events in-between our DLA events are caused by the GraphNorm layer, which isn't supported by the DLA. The data is formatted for execution on the GPU, and the formatted again for execution on the DLA between each DLA block. This overhead, since it recurs several times in our model, increases our overall runtime significantly.

Determining which layer isn't supported by the DLA can be done by referencing warnings output from trtexec or even inspecting the kernel names in the profiling view for hints. Sometimes the layer names don't map exactly to the original PyTorch layer, which can make this process more challenging. In this instance, substituting or removing layers and testing the conversion can provide a definitive indication of what is supported or problematic.

Tip: Though not covered here, the TensorRT Python API allows you to inspect which TensorRT layers in the TensorRT network definition are able to run on the DLA. We'll cover how to parse the ONNX model into a TensorRT network later in this tutorial. You could use this representation to programatically check which layers are supported by the DLA in Python. However, associating this information with the original PyTorch model may still present some challenge, because layers may be mapped from PyTorch to ONNX / TensorRT under different representation. In these instances, sometimes iterativelly testing the conversion of submodules in your model is the easiest way to hone-in on an incompatible layer.

By this analysis, we see that the GroupNorm layer, which has relatively low computational complexity, is introducing significant runtime do to the formatting overhead between GPU and DLA.

Let's try a different layer, that is supported by the DLA, to see how it improves our model performance.

Step 5 - Modify model for better DLA utilization

First, we modify our model definition, replacing all GroupNorm norm layers with BatchNorm2d layers.

class ModelBN(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.cnn = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU()
        )
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.linear = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.cnn(x)
        x = self.pool(x)
        x = x.view(x.shape[0], -1)
        x = self.linear(x)
        return x

model_bn = ModelBN(num_classes=10).cuda().eval()

Second, we export our model to ONNX as before.

data = torch.zeros(1, 3, 32, 32).cuda()

torch.onnx.export(model_bn, data, 'model_bn.onnx',
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

Third, we build and profile our TensorRT engine as before

trtexec --onnx=model_bn.onnx --shapes=input:32x3x32x32 --saveEngine=model_bn.engine --exportProfile=model_bn.json --int8 --useDLACore=0 --allowGPUFallback --useSpinWait --separateProfileRun > model_bn.log

Which outputs

[03/31/2022-15:12:48] [I] [TRT] ---------- Layers Running on DLA ----------
[03/31/2022-15:12:48] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0...Relu_7]}
[03/31/2022-15:12:48] [I] [TRT] [DlaLayer] {ForeignNode[Gemm_15 + linear.bias + (Unnamed Layer* 19) [Shuffle] + unsqueeze_node_after_linear.bias + (Unnamed Layer* 19) [Shuffle]_(Unnamed Layer* 19) [Shuffle]_output + (Unnamed Layer* 20) [ElementWise]]}
[03/31/2022-15:12:48] [I] [TRT] ---------- Layers Running on GPU ----------
[03/31/2022-15:12:48] [I] [TRT] [GpuLayer] REDUCE: GlobalAveragePool_8
[03/31/2022-15:12:48] [I] [TRT] [GpuLayer] SHUFFLE: copied_squeeze_after_(Unnamed Layer* 20) [ElementWise]

As we can see, there are significantly fewer layers running on the GPU as well as the DLA. This is an indication that more layers (Conv_0...Relu_7) were merged into a single "ForeignNode" block for execution on the DLA.

This is a good sign! And as seen below, the performance of our model has approximately doubled!

[03/31/2022-15:12:55] [I] === Performance summary ===
[03/31/2022-15:12:55] [I] Throughput: 649.723 qps

We profile this new engine as we did before with the following command

nsys profile --trace=cuda,nvtx,cublas,cudla,cusparse,cudnn,nvmedia --output=model_bn.nvvp /usr/src/tensorrt/bin/trtexec --loadEngine=model_bn.engine --iterations=10 --idleTime=500 --duration=0 --useSpinWait

As we see below, a majority of the model execution now occurs in one large block under the "Other accelerators API" row. This indicates our model is utilizing the DLA much more effectively!

image

Until this point, we've been working with dummy model weights. However, to obtain the best performance on the DLA, we need to use INT8 precision for our model. Reducing the precision of a neural network can impact the accuracy of the model. To mitigate this effect, it is important to perform calibration.

Let's go over how to train the model, and perform calibration with real data.

Step 6 - Train the model

First, we train the model on the CIFAR10 dataset. This is a toy image classification dataset, which has examples in PyTorch available from a variety of sources online.

Since this step isn't unique to execution on the DLA, we won't past code inline. However, we have provided a training script train.py that may be used to generate the trained neural network weights. The usage of train.py is covered in QUICKSTART.md. To train our new batch norm model, we simply call

python3 train.py model_bn --checkpoint_path=data/model_bn.pth

The trained model weights are stored at model_bn.pth. This model reaches an accuracy of roughly 85% on the CIFAR10 test dataset. While other architectures can obtain better accuracy on this dataset, we've opted for a simple model since our focus only on validating that the DLA workflow and verifying that the calibrated DLA engine obtains similar accuracy to it's PyTorch counterpart.

Once we have the trained weights, model_bn.pth we can export the model to ONNX as before.

model = ModelBN()
model.load_state_dict(torch.load('model_bn.pth'))

torch.onnx.export(
    model,
    data,
    'model_bn.onnx',
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

Now, our trained model, in ONNX representation, is stored at model_bn.onnx.

Let's proceed with optimizing our trained model.

Step 7 - Optimize the model (using real weights and calibration data)

To optimize our trained model, we'll use the TensorRT Python API, rather than trtexec. The reason for this decision is, that we'll need to provide a Python interface for generating calibration data.

To start, we import the TensorRT Python package, and make the required API calls to parse our ONNX model into a TensorRT network.

import tensorrt as trt

# create logger
logger = trt.Logger(trt.Logger.INFO)

# create builder
builder = trt.Builder(logger)

# create network, enabling explicit batch
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

# parse the ONNX model to generate the TensorRT network
parser = trt.OnnxParser(network, logger)

with open(args.onnx, 'rb') as f:
    parser.parse(f.read())

Now, our network class holds a full representation of our trained model. This representation however, is not executable. TensorRT still needs to optimize the model to produce the TensorRT engine, which stores the output of our optimization process. The engine describes which tactics are used to execute the layers in our network, including which sub-graphs are offloaded to the DLA for execution.

There are a variety of configuration options we may provide to control the optimization process. These parameters are largerly controled through a configuration class which may be created as follows.

config = builder.create_builder_config()

Now that our config instance is defined, we'll first set the optimization profile.

batch_size = 32

# define the optimization configuration
profile = builder.create_optimization_profile()
profile.set_shape(
    'input',
    (batch_size, 3, 32, 32),            # min shape
    (batch_size, 3, 32, 32),   # optimal shape
    (batch_size, 3, 32, 32)    # max shape
)

config.add_optimization_profile(profile)

This profile determines controls the shapes of input tensors that our final engine will be capable of handling. These shapes include a minimum, optimal, and maximum shape. The minimum and maximum shape control which shapes are allowed by the model, while the optimal shape controls which shape is used during the profiling and tactic selection process. In order for the DLA to be selected as a tactic, we must set all these shapes to be the same.

Next, we specify which DLA core we want our model to run on by default, and allow GPU fallback in case DLA layers are not supported.

config.dla_core = 0
config.set_flag(trt.BuilderFlag.GPU_FALLBACK)

Finally, since DLA is supported only for INT8 and FP16 precisions, we enable INT8 precision.

config.set_flag(trt.BuilderFlag.INT8)

Next, we create a class DatasetCalibrator that will be used to provide calibration data to the TensorRT builder. This class wraps the CIFAR10 dataset we used for training, and implements the IINT8Calibrator interface to provide batches of data for calibration. By implementing the get_batch, get_algorithm, and get_batch_size methods of the IInt8Calibrator interface, our class may now be used by the TensorRT builder.

import torch
import tensorrt as trt
import torchvision.transforms as transforms


class DatasetCalibrator(trt.IInt8Calibrator):
    
    def __init__(self, 
            input, dataset, 
            algorithm=trt.CalibrationAlgoType.ENTROPY_CALIBRATION_2):
        super(DatasetCalibrator, self).__init__()
        self.algorithm = algorithm
        self.dataset = dataset
        self.buffer = torch.zeros_like(input).contiguous()
        self.count = 0
        
    def get_batch(self, *args, **kwargs):
        if self.count < len(self.dataset):
            for buffer_idx in range(self.get_batch_size()):

                # get image from dataset
                dataset_idx = self.count % len(self.dataset)
                image, _ = self.dataset[dataset_idx]
                image = image.to(self.buffer.device)

                # copy image to buffer
                self.buffer[buffer_idx].copy_(image)

                # increment total number of images used for calibration
                self.count += 1

            return [int(self.buffer.data_ptr())]
        else:
            return []  # returning None or [] signals to TensorRT that calibration is finished
        
    def get_algorithm(self):
        return self.algorithm
    
    def get_batch_size(self):
        return int(self.buffer.shape[0]) 
    
    def read_calibration_cache(self, *args, **kwargs):
        return None
    
    def write_calibration_cache(self, cache, *args, **kwargs):
        pass

Next, we instantiate our DatasetCalibrator using the CIFAR10 training dataset (which is the same that we used for training) and assign it to config.int8_calibrator

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

train_dataset = torchvision.datasets.CIFAR10(
    root=os.path.join(args.data_dir, 'cifar10'), 
    train=True,
    download=True, 
    transform=transform
)

batch_size = 32

data = torch.zeros(batch_size, 3, 32, 32).cuda()

config.int8_calibrator = DatasetCalibrator(data, train_dataset)

Finally, build our TensorRT engine using the network we generated from our ONNX file, and the configuration we defined above.

engine_bytes = builder.build_serialized_network(network, config)

After the optimization process is finished, we save our engine to disk.

with open('model_bn.engine', 'wb') as f:
    f.write(engine_bytes)

Our optimized, calibrated, INT8 / DLA compatbile TensorRT engine is now stored at model_bn.engine.

You could now use this optimized engine with a variety of inferences APIs, such as deepstream, or Triton inference server.

Here we'll use the TensorRT Python API to run inference using our model, and evaluate the accuracy on the CIFAR10 test dataset.

Step 8 - Evaluate the accuracy of the optimized model

First, let's define the CIFAR10 test dataset classes.

Define the test dataset

test_dataset = torchvision.datasets.CIFAR10(
    root=os.path.join(args.data_dir, 'cifar10'), 
    train=False,
    download=True, 
    transform=transform
)

test_loader = torch.utils.data.DataLoader(
    test_dataset, 
    batch_size=batch_size,
    shuffle=False
)

Next, let's instantiate the TensorRT runtime, load our saved TensorRT engine, and create the context we'll use for execution.

import tensorrt as trt

logger = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(logger)

with open('model_bn.engine', 'rb') as f:
    engine_bytes = f.read()
    engine = runtime.deserialize_cuda_engine(engine_bytes)

context = engine.create_execution_context()

Next, we will create buffers to hold our neural network inputs and outputs.
TensorRT engine execution requires that we provide data pointers to GPU memory for the input and output tensors. We will define these tensors in PyTorch, and then use the data_ptr() method that PyTorch Tensors provide to get access to the underlying GPU memory pointer.

input_binding_idx = engine.get_binding_index('input')
output_binding_idx = engine.get_binding_index('output')

input_shape = (args.batch_size, 3, 32, 32)
output_shape = (args.batch_size, 10)

context.set_binding_shape(
    input_binding_idx,
    input_shape
)

input_buffer = torch.zeros(input_shape, dtype=torch.float32, device=torch.device('cuda'))
output_buffer = torch.zeros(output_shape, dtype=torch.float32, device=torch.device('cuda'))

bindings = [None, None]
bindings[input_binding_idx] = input_buffer.data_ptr()
bindings[output_binding_idx] = output_buffer.data_ptr()

Finally, we iterate through the data and evaluate the accuracy of our model.

test_accuracy = 0

# run through test dataset
for image, label in iter(test_loader):

    actual_batch_size = int(image.shape[0])

    input_buffer[0:actual_batch_size].copy_(image)

    context.execute_async_v2(
        bindings,
        torch.cuda.current_stream().cuda_stream
    )

    torch.cuda.current_stream().synchronize()

    output = output_buffer[0:actual_batch_size]
    label = label.cuda()

    test_accuracy += int(torch.sum(output.argmax(dim=-1) == label))

test_accuracy /= len(test_dataset)

print(f'TEST ACCURACY: {test_accuracy}')

The accuracy, as before should be approximately still 85% on the CIFAR10 dataset!

Step 9 - Modify trained model with ONNX graph surgeon

If you've followed the tutorial to this point, you now have a trained ONNX graph, model_bn.onnx, that is largely DLA compatible. However, one of the final layers AdaptiveAvgPool, which maps to the ONNX layer GlobalAveragePool, is still not supported by DLA. To demonstrate another method for modifying models for DLA compatibility is to modify the ONNX graph directly.

Here, we can replace the GlobalAveragePool layer, which isn't supported on the DLA, with an AveragePool layer which is supported on DLA for kernel sizes up to 8.

We do so with the following code.

graph = gs.import_onnx(onnx.load('model_bn.onnx'))

for node in graph.nodes:
    if node.op == 'GlobalAveragePool':
        node.op = 'AveragePool'
        node.attrs['kernel_shape'] = [2, 2]

onnx.save(gs.export_onnx('model_bn_modified.onnx'), args.output)

After optimizing our modified ONNX model with trtexec, we see the following output.

[04/28/2022-21:30:53] [I] [TRT] ---------- Layers Running on DLA ----------
[04/28/2022-21:30:53] [I] [TRT] [DlaLayer] {ForeignNode[Conv_0...Gemm_15]}
[04/28/2022-21:30:53] [I] [TRT] ---------- Layers Running on GPU ----------
[04/28/2022-21:30:53] [I] [TRT] [GpuLayer] SHUFFLE: reshape_after_Gemm_15

Now, one larger sugraph is running on the DLA, rather than two sugraphs split apart by the unsupported pooling layer!

There are often many ways you can creatively replace unsupported layers with other layers that are supported. It largely depends on the exact model and usage. We hope this helps give you ideas for how to better optimize your models for DLA.

Next steps

That concludes this tutorial. We hope this helps you get started using the DLA on Jetson.

For more information related to topics discussed in this tutorial, check out

More Repositories

1

torch2trt

An easy to use PyTorch to TensorRT converter
Python
4,547
star
2

jetbot

An educational AI robot based on NVIDIA Jetson Nano.
Jupyter Notebook
3,012
star
3

deepstream_python_apps

DeepStream SDK Python bindings and sample applications
Jupyter Notebook
1,439
star
4

Lidar_AI_Solution

A project demonstrating Lidar related AI solutions, including three GPU accelerated Lidar/camera DL networks (PointPillars, CenterPoint, BEVFusion) and the related libs (cuPCL, 3D SparseConvolution, YUV2RGB, cuOSD,).
Python
1,249
star
5

deepstream_reference_apps

Samples for TensorRT/Deepstream for Tesla & Jetson
C++
1,127
star
6

jetracer

An autonomous AI racecar using NVIDIA Jetson Nano
Jupyter Notebook
1,059
star
7

redtail

Perception and AI components for autonomous mobile robotics.
C++
1,013
star
8

trt_pose

Real-time pose estimation accelerated with NVIDIA TensorRT
Python
974
star
9

tf_trt_models

TensorFlow models accelerated with NVIDIA TensorRT
Python
683
star
10

nanosam

A distilled Segment Anything (SAM) model capable of running real-time with NVIDIA TensorRT
Python
616
star
11

cuPCL

A project demonstrating how to use the libs of cuPCL.
C++
551
star
12

yolo_deepstream

yolo model qat and deploy with deepstream&tensorrt
Python
534
star
13

CUDA-PointPillars

A project demonstrating how to use CUDA-PointPillars to deal with cloud points data from lidar.
Python
525
star
14

tf_to_trt_image_classification

Image classification with NVIDIA TensorRT from TensorFlow models.
Python
454
star
15

jetcam

Easy to use Python camera interface for NVIDIA Jetson
Jupyter Notebook
426
star
16

deepstream_tao_apps

Sample apps to demonstrate how to deploy models trained with TAO on DeepStream
C++
369
star
17

jetson_benchmarks

Jetson Benchmark
Python
363
star
18

deepstream_360_d_smart_parking_application

Describes the full end to end smart parking application that is available with DeepStream 5.0
JavaScript
340
star
19

deepstream_pose_estimation

This is a DeepStream application to demonstrate a human pose estimation pipeline.
C++
290
star
20

face-mask-detection

Face Mask Detection using NVIDIA Transfer Learning Toolkit (TLT) and DeepStream for COVID-19
Python
243
star
21

nanoowl

A project that optimizes OWL-ViT for real-time inference with NVIDIA TensorRT.
Python
230
star
22

deepstream-occupancy-analytics

This is a sample application for counting people entering/leaving in a building using NVIDIA Deepstream SDK, Transfer Learning Toolkit (TLT), and pre-trained models. This application can be used to build real-time occupancy analytics applications for smart buildings, hospitals, retail, etc. The application is based on deepstream-test5 sample application.
C
217
star
23

tensorrt_plugin_generator

A simple tool that can generate TensorRT plugin code quickly.
Python
215
star
24

jetcard

An SD card image for web programming AI projects with NVIDIA Jetson Nano
Python
210
star
25

trt_pose_hand

Real-time hand pose estimation and gesture classification using TensorRT
Jupyter Notebook
207
star
26

redaction_with_deepstream

An example of using DeepStream SDK for redaction
C
205
star
27

deepstream_lpr_app

Sample app code for LPR deployment on DeepStream
C
203
star
28

jetson-cloudnative-demo

Multi-container demo for Jetson Xavier NX and Jetson AGX Xavier
Shell
186
star
29

cuDLA-samples

YOLOv5 on Orin DLA
Python
177
star
30

jetson-multicamera-pipelines

Python
158
star
31

jetson-intro-to-distillation

A tutorial introducing knowledge distillation as an optimization technique for deployment on NVIDIA Jetson
Python
143
star
32

Gesture-Recognition

Gesture recognition neural network to classify various hand gestures
Python
129
star
33

clip-distillation

Zero-label image classification via OpenCLIP knowledge distillation
Python
104
star
34

ros2_torch_trt

ROS 2 packages for PyTorch and TensorRT for real-time classification and object detection on Jetson Platforms
Python
101
star
35

yolov5_gpu_optimization

This repository provides YOLOV5 GPU optimization sample
Python
100
star
36

Foresee-Navigation

Semantic-Segmentation based autonomous indoor navigation for mobile robots
C++
91
star
37

deepstream_parallel_inference_app

A project demonstrating how to use nvmetamux to run multiple models in parallel.
C++
90
star
38

deepstream_4.x_apps

deepstream 4.x samples to deploy TLT training models
C++
85
star
39

tao-toolkit-triton-apps

Sample app code for deploying TAO Toolkit trained models to Triton
Python
84
star
40

ros2_deepstream

ROS 2 package for NVIDIA DeepStream applications on Jetson Platforms
Python
82
star
41

argus_camera

Simple Python / C++ interface to CSI camera connected to NVIDIA Jetson.
C++
81
star
42

turtlebot3

Autonomous delivery robot with turtlebot3 and Jetson TX2
C++
79
star
43

ros2_jetson

Shell
79
star
44

jetson-copilot

A reference application for a local AI assistant with LLM and RAG
Python
79
star
45

jetson-stereo-depth

Python
78
star
46

my-jetson-nano-baseboard

An open source Jetson Nano baseboard and tools to design your own.
Python
77
star
47

nvidia-tao

Jupyter Notebook
77
star
48

jetnet

Easy to use neural networks for NVIDIA Jetson (and desktop too!)
Python
75
star
49

deepstream_triton_model_deploy

How to deploy open source models using DeepStream and Triton Inference Server
C++
73
star
50

jetson-generative-ai-playground

71
star
51

ros2_tao_pointpillars

ROS2 node for 3D object detection using TAO-PointPillars.
C++
70
star
52

Formula1Epoch

An autonomous R.C. racecar which detects people.
Makefile
66
star
53

ros2_trt_pose

ROS 2 package for "trt_pose": real-time human pose estimation on NVIDIA Jetson Platform
Python
63
star
54

Electron

An autonomous deep learning indoor delivery robot made with Jetson
C++
62
star
55

deepstream_dockers

A project demonstrating how to make DeepStream docker images.
Shell
57
star
56

ros2_jetson_stats

ROS 2 package for monitoring and controlling NVIDIA Jetson Platform resources
Python
56
star
57

isaac_ros_apriltag

CUDA-accelerated Apriltag detection
C++
55
star
58

jetson-trashformers

Autonomous humanoid that picks up and throws away trash
C++
52
star
59

NVIDIA-Optical-Character-Detection-and-Recognition-Solution

This repository provides optical character detection and recognition solution optimized on Nvidia devices.
C++
51
star
60

sdg_pallet_model

A pallet model trained with SDG optimized for NVIDIA Jetson.
Python
48
star
61

JEP_ChatBot

ChatBot: sample for TensorRT inference with a TF model
Python
46
star
62

jetson-min-disk

Shell
45
star
63

whisper_trt

A project that optimizes Whisper for low latency inference using NVIDIA TensorRT
Python
44
star
64

Deepstream-Dewarper-App

This project demonstrate how to infer and track from a 360 videos by using the dewarper plugin.
C
43
star
65

deepstream-retail-analytics

A DeepStream sample application demonstrating end-to-end retail video analytics for brick-and-mortar retail.
C++
42
star
66

isaac_ros_image_pipeline

Isaac ROS image_pipeline package for hardware-accelerated image processing in ROS2.
C++
41
star
67

gesture_recognition_tlt_deepstream

A project demonstrating how to train your own gesture recognition deep learning pipeline. We start with a pre-trained detection model, repurpose it for hand detection using Transfer Learning Toolkit 3.0, and use it together with the purpose-built gesture recognition model. Once trained, we deploy this model on NVIDIA® Jetson™ using Deepstream SDK.
C
40
star
68

synthetic_data_generation_training_workflow

Workflow for generating synthetic data and training CV models.
Jupyter Notebook
38
star
69

YOLOv5-with-Isaac-ROS

Sample showing how to use YOLOv5 with Nvidia Isaac ROS DNN Inference
Python
38
star
70

retinanet_for_redaction_with_deepstream

This sample shows how to train and deploy a deep learning model for the real time redaction of faces from video streams using the NVIDIA DeepStream SDK
C
37
star
71

scene-text-recognition

Python
34
star
72

deep_nav_layers

A series of plugins to the ROS navigation stack to incorporate deep learning inputs.
Makefile
33
star
73

Nav2-with-Isaac-ROS-GEMs

Python
33
star
74

tao_toolkit_recipes

Jupyter Notebook
32
star
75

GreenMachine

AI kiosk with a camera and a projector to visualize waste type of cafeteria objects
Python
32
star
76

viz_3Dbbox_ros2_pointpillars

Visualization tool for 3D bounding box results of TAO-PointPillars
Python
28
star
77

isaac_demo

Set of demo to try Isaac ROS with Isaac SIM
Python
27
star
78

tlt-iva-examples

A notebook that demonstrates how to use the NVIDIA Intelligent Video Analytics suite to detect objects in real-time. We use Transfer Learning Toolkit to train a fast and accurate detector and DeepStream to run that detector on an NVIDIA Jetson edge device.
Jupyter Notebook
27
star
79

mmj_genai

A reference example for integrating NanoOwl with Metropolis Microservices for Jetson
Python
25
star
80

TAO-Toolkit-Whitepaper-use-cases

TAO best practices. How to adapt for a new domain, new classes, and generalize the model with a small dataset using Nvidia's TAO toolkit
Jupyter Notebook
24
star
81

ros2_nanollm

ROS2 nodes for LLM, VLM, VLA
Python
24
star
82

caffe_ros

Package containing nodes for deep learning in ROS.
C++
23
star
83

jetson_isaac_ros_visual_slam_tutorial

Hosting a tutorial documentation for running Isaac ROS Visual SLAM on Jetson device.
23
star
84

jetbot_mini

Python
22
star
85

centernet_kinect

Real-time CenterNet based object detection on fused IR/Depth images from Kinect sensor. Works on NVIDIA Jetson.
Python
19
star
86

deepstream_libraries

DeepStream Libraries offer CVCUDA, NvImageCodec, and PyNvVideoCodec modules as Python APIs for seamless integration into custom frameworks.
Python
19
star
87

robot_freespace_seg_Isaac_TAO

In this workflow we demonstrate using SDG + TAO for a freespace segmentation application
Python
17
star
88

deepstream-yolo3-gige-apps

A project demonstration on how to use the GigE camera to do the DeepStream Yolo3 object detection, how to set up the GigE camera, and deployment for the DeepStream apps.
C
16
star
89

ros2_torch2trt_examples

ros2 packages for torch2trt examples
Python
15
star
90

ros2_trt_pose_hand

ROS2 package for trt_pos_hand, "Real-time hand pose estimation and gesture classification using TensorRT"
Python
14
star
91

deepstream_triton_migration

Triton Migration Guide for DeepStreamSDK.
14
star
92

ROS2-NanoOWL

ROS 2 node for open-vocabulary object detection using NanoOWL.
Python
14
star
93

jetson-platform-services

A collection of reference AI microservices and workflows for Jetson Platform Services
Jupyter Notebook
13
star
94

jetson_virtual_touchpanel

Enables Jetson to be controlled with handpose using trt_pose
Python
12
star
95

deepstream-segmentation-analytics

A project demonstration to do the industrial defect segmentation based on loading the image from directory and generate the output ground truth.
C
11
star
96

isaac_ros_common

Isaac ROS common utilities, Dockerfiles, and testing code.
Python
11
star
97

tao_byom_examples

Examples of converting different open-source deep learning models to TAO compatible format through TAO BYOM package.
Python
11
star
98

husky_demo

Husky Simulation and Hardware In the Loop simulation on Isaac SIM with Isaac ROS
Python
10
star
99

mmj_utils

A utility library to help integrate Python applications with Metropolis Microservices for Jetson
Python
9
star
100

a2j_handpose_3d

Python
8
star