• Stars
    star
    5,313
  • Rank 7,774 (Top 0.2 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Transformer related optimization, including BERT, GPT

FasterTransformer

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

Table Of Contents

Model overview

In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta, Turing and Ampere GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16.

FasterTransformer is built on top of CUDA, cuBLAS, cuBLASLt and C++. We provide at least one API of the following frameworks: TensorFlow, PyTorch and Triton backend. Users can integrate FasterTransformer into these frameworks directly. For supporting frameworks, we also provide example codes to demonstrate how to use, and show the performance on these frameworks.

Support matrix

Models Framework FP16 INT8 (after Turing) Sparsity (after Ampere) Tensor parallel Pipeline parallel FP8 (after Hopper)
BERT TensorFlow Yes Yes - - - -
BERT PyTorch Yes Yes Yes Yes Yes -
BERT Triton backend Yes - - Yes Yes -
BERT C++ Yes Yes - - - Yes
XLNet C++ Yes - - - - -
Encoder TensorFlow Yes Yes - - - -
Encoder PyTorch Yes Yes Yes - - -
Decoder TensorFlow Yes - - - - -
Decoder PyTorch Yes - - - - -
Decoding TensorFlow Yes - - - - -
Decoding PyTorch Yes - - - - -
GPT TensorFlow Yes - - - - -
GPT/OPT PyTorch Yes - - Yes Yes Yes
GPT/OPT Triton backend Yes - - Yes Yes -
GPT-MoE PyTorch Yes - - Yes Yes -
BLOOM PyTorch Yes - - Yes Yes -
BLOOM Triton backend Yes - - Yes Yes -
GPT-J Triton backend Yes - - Yes Yes -
Longformer PyTorch Yes - - - - -
T5/UL2 PyTorch Yes - - Yes Yes -
T5 TensorFlow 2 Yes - - - - -
T5/UL2 Triton backend Yes - - Yes Yes -
T5 TensorRT Yes - - Yes Yes -
T5-MoE PyTorch Yes - - Yes Yes -
Swin Transformer PyTorch Yes Yes - - - -
Swin Transformer TensorRT Yes Yes - - - -
ViT PyTorch Yes Yes - - - -
ViT TensorRT Yes Yes - - - -
GPT-NeoX PyTorch Yes - - Yes Yes -
GPT-NeoX Triton backend Yes - - Yes Yes -
BART/mBART PyTorch Yes - - Yes Yes -
WeNet C++ Yes - - - - -
DeBERTa TensorFlow 2 Yes - - On-going On-going -
DeBERTa PyTorch Yes - - On-going On-going -
  • Note that the FasterTransformer supports the models above on C++ because all source codes are built on C++.

More details of specific models are put in xxx_guide.md of docs/, where xxx means the model name. Some common questions and the respective answers are put in docs/QAList.md. Note that the model of Encoder and BERT are similar and we put the explanation into bert_guide.md together.

Advanced

The following code lists the directory structure of FasterTransformer:

/src/fastertransformer: source code of FasterTransformer
    |--/cutlass_extensions: Implementation of cutlass gemm/kernels.
    |--/kernels: CUDA kernels for different models/layers and operations, like addBiasResiual.
    |--/layers: Implementation of layer modules, like attention layer, ffn layer.
    |--/models: Implementation of different models, like BERT, GPT.
    |--/tensorrt_plugin: encapluate FasterTransformer into TensorRT plugin.
    |--/tf_op: custom Tensorflow OP implementation
    |--/th_op: custom PyTorch OP implementation
    |--/triton_backend: custom triton backend implementation
    |--/utils: Contains common cuda utils, like cublasMMWrapper, memory_utils
/examples: C++, tensorflow and pytorch interface examples
    |--/cpp: C++ interface examples
    |--/pytorch: PyTorch OP examples
    |--/tensorflow: TensorFlow OP examples
    |--/tensorrt: TensorRT examples
/docs: Documents to explain the details of implementation of different models, and show the benchmark
/benchmark: Contains the scripts to run the benchmarks of different models
/tests: Unit tests
/templates: Documents to explain how to add a new model/example into FasterTransformer repo

Note that many folders contains many sub-folders to split different models. Quantization tools are move to examples, like examples/tensorflow/bert/bert-quantization/ and examples/pytorch/bert/bert-quantization-sparsity/.

Global Environment

FasterTransformer provides some convenient environment variables for debuging and testing.

  1. FT_LOG_LEVEL: This environment controls the log level of debug messae. More details are in src/fastertransformer/utils/logger.h. Note that the program will print lots of message when the level is lower than DEBUG and the program would become very slow.
  2. FT_NVTX: If it is set to be ON like FT_NVTX=ON ./bin/gpt_example, the program will insert tha tag of nvtx to help profiling the program.
  3. FT_DEBUG_LEVEL: If it is set to be DEBUG, then the program will run cudaDeviceSynchronize() after every kernels. Otherwise, the kernel is executued asynchronously by default. It is helpful to locate the error point during debuging. But this flag affects the performance of program significantly. So, it should be used only for debuging.

Performance

Hardware settings:

  • 8xA100-80GBs (with mclk 1593MHz, pclk 1410MHz) with AMD EPYC 7742 64-Core Processor
  • T4 (with mclk 5000MHz, pclk 1590MHz) with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

In order to run the following benchmark, we need to install the unix computing tool "bc" by

apt-get install bc

BERT base performance

The FP16 results of TensorFlow were obtained by running the benchmarks/bert/tf_benchmark.sh.

The INT8 results of TensorFlow were obtained by running the benchmarks/bert/tf_int8_benchmark.sh.

The FP16 results of PyTorch were obtained by running the benchmarks/bert/pyt_benchmark.sh.

The INT8 results of PyTorch were obtained by running the benchmarks/bert/pyt_int8_benchmark.sh.

More benchmarks are put in docs/bert_guide.md.

BERT base performances of FasterTransformer new features

The following figure compares the performances of different features of FasterTransformer and FasterTransformer under FP16 on T4.

For large batch size and sequence length, both EFF-FT and FT-INT8-v2 bring about 2x speedup. Using Effective FasterTransformer and int8v2 at the same time can bring about 3.5x speedup compared to FasterTransformer FP16 for large case.

BERT base performance on TensorFlow

The following figure compares the performances of different features of FasterTransformer and TensorFlow XLA under FP16 on T4.

For small batch size and sequence length, using FasterTransformer can bring about 3x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

BERT base performance on PyTorch

The following figure compares the performances of different features of FasterTransformer and PyTorch TorchScript under FP16 on T4.

For small batch size and sequence length, using FasterTransformer CustomExt can bring about 4x ~ 6x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

Decoding and Decoder performance

The results of TensorFlow were obtained by running the benchmarks/decoding/tf_decoding_beamsearch_benchmark.sh and benchmarks/decoding/tf_decoding_sampling_benchmark.sh

The results of PyTorch were obtained by running the benchmarks/decoding/pyt_decoding_beamsearch_benchmark.sh.

In the experiments of decoding, we updated the following parameters:

  • head_num = 8
  • size_per_head = 64
  • num_layers = 6 for both encoder and decoder
  • vocabulary_size = 32001 for TensorFlow sample codes, 31538 for PyTorch sample codes
  • memory_hidden_dim = 512
  • max sequenc elength = 128

More benchmarks are put in docs/decoder_guide.md.

Decoder and Decoding end-to-end translation performance on TensorFlow

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to TensorFlow under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to TensorFlow, FT-Decoder provides 1.5x ~ 3x speedup; while FT-Decoding provides 4x ~ 18x speedup.

Decoder and Decoding end-to-end translation performance on PyTorch

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to PyTorch under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to PyTorch, FT-Decoder provides 1.2x ~ 3x speedup; while FT-Decoding provides 3.8x ~ 13x speedup.

GPT performance

The following figure compares the performances of Megatron and FasterTransformer under FP16 on A100.

In the experiments of decoding, we updated the following parameters:

  • head_num = 96
  • size_per_head = 128
  • num_layers = 48 for GPT-89B model, 96 for GPT-175B model
  • data_type = FP16
  • vocab_size = 51200
  • top_p = 0.9
  • tensor parallel size = 8
  • input sequence length = 512
  • output sequence length = 32

Release notes

Changelog

May 2023

  • Fix bugs of generation early stopping

January 2023

  • Support GPT MoE
  • Support FP8 for Bert and GPT (Experimental)
  • Support DeBERTa on TensorFlow 2 and PyTorch

Dec 2022

  • Release the FasterTransformer 5.2
  • Support min length penalty

Nov 2022

  • Support T5 Tensorflow 2 custom op.
  • Support T5 MoE
  • Support WeNet
  • Support BART & mBART
  • Support SwinV2
  • Initial support for w8a8 int8 mode with GPT (preview)
  • Support fused mha in GPT

Oct 2022

  • Support BLOOM

Sep 2022

  • Support factual sampling (link) in gpt
  • Support for IA3 adapting scheme in T5

Aug 2022

  • Support returning context tokens embeddings in GPT
  • Release the FasterTransformer 5.1
  • Support for interactive generation
  • Support for attention time-limited memory
  • Support mt5 and t5-v1.1

July 2022

  • Support UL2 huggingface ckpt. (link)
    • Fix bug of T5 under bfloat16.
  • Add ViT INT8 TensorRT Plugin
  • Support batch sampling
  • Support shared context optimization in GPT model

June 2022

  • Support streaming generation for triton backend.
  • Support OPT.
  • Support multi-node multi-GPU BERT under FP32, FP16 and BF16.

May 2022

  • Support bfloat16 on most models.
  • Support prefix-prompt for GPT-J.
  • Support GPT-NeoX.
    • epsilon value used in layernorm is now a parameter
    • rotary embedding GPT-NeoX style (only GPT-J was implemented)
    • load per-GPU layernorm and bias parameters
    • weight conversion from EleutherAI checkpoint

April 2022

  • Release the FasterTransformer 5.0
    • Change the default accumulation type of all gemm to FP32.
    • Support bfloat16 inference in GPT model.
    • Support Nemo Megatron T5 and Megatron-LM T5 model.
    • Support ViT.

March 2022

  • Support stop_ids and ban_bad_ids in GPT-J.
  • Support dynamice start_id and end_id in GPT-J, GPT, T5 and Decoding.

February 2022

  • Support Swin Transformer.
  • Optimize the k/v cache update of beam search by in-direction buffer.
  • Support runtime input for GPT-J, T5 and GPT.
  • Support soft prompt in GPT and GPT-J.
  • Support custom all reduce kernel.
    • Limitation:
      1. Only support tensor parallel size = 8 on DGX-A100.
      2. Only support CUDA with cudaMallocAsync.

December 2021

  • Add TensorRT plugin of T5 model.
  • Change some hyper-parameters of GPT model to runtime query.
  • Optimize the memory allocator under C++ code.
  • Fix bug of CUB including when using CUDA 11.5 or newer version.

November 2021

  • Update the FasterTransformer 5.0 beta
  • Add GPT-3 INT8 weight only qauntization for batch size <= 2.
  • Support multi-node multi-gpu support on T5.
  • Enhance the multi-node multi-gpu supporting in GPT-3.

August 2021

  • Release the FasterTransformer 5.0 beta
    • Refactor the repo and codes
    • And special thanks to NAVER Corp. for contributing a lot to this version, as listed below.
      • Bugs fix
        • Fix error that occurs when batch_size is less than max_batch_size for gpt pytorch wrapper.
        • Fix memory leak that occurs every forward because of reused allocator.
        • Fix race condition that occurs in repetition penalty kernel.
      • Enhancement
        • Add random seed setting.
        • Fix GEMM buffer overflow on FP16 of GPT.
        • Change to invalidate finished buffer for every completion.
        • Introduce stop_before for early stop.
    • Support Longformer.
    • Rename layer_para to pipeline_para.
    • Optimize the sorting of top p sampling.
    • Support sparsity for Ampere GPUs on BERT.
    • Support size_per_head 96, 160, 192, 224, 256 for GPT model.
    • Support multi-node inference for GPT Triton backend.

June 2021

  • Support XLNet

April 2021

  • Release the FasterTransformer 4.0
    • Support multi-gpus and multi-nodes inference for GPT model on C++ and PyTorch.
    • Support single node, multi-gpus inference for GPT model on triton.
    • Add the int8 fused multi-head attention kernel for bert.
    • Add the FP16 fused multi-head attention kernel of V100 for bert.
    • Optimize the kernel of decoder.
    • Move to independent repo.
    • Eager mode PyTorch extension is deprecated.

Dec 2020

  • Release the FasterTransformer 3.1
    • Optimize the decoding by adding the finisehd mask to prevent useless computing.
    • Support opennmt encoder.
    • Remove the TensorRT plugin supporting.
    • TorchScript custom op is deprecated.

Nov 2020

  • Optimize the INT8 inference.
  • Support PyTorch INT8 inference.
  • Provide PyTorch INT8 quantiztion tools.
  • Integrate the fused multi-head attention kernel of TensorRT into FasterTransformer.
  • Add unit test of SQuAD.
  • Update the missed NGC checkpoints.

Sep 2020

  • Support GPT2
  • Release the FasterTransformer 3.0
    • Support INT8 quantization of encoder of cpp and TensorFlow op.
    • Add bert-tf-quantization tool.
    • Fix the issue that Cmake 15 or Cmake 16 fail to build this project.

Aug 2020

  • Fix the bug of trt plugin.

June 2020

  • Release the FasterTransformer 2.1
    • Add Effective FasterTransformer based on the idea of Effective Transformer idea.
    • Optimize the beam search kernels.
    • Add PyTorch op supporting

May 2020

  • Fix the bug that seq_len of encoder must be larger than 3.
  • Add the position_encoding of decoding as the input of FasterTransformer decoding. This is convenient to use different types of position encoding. FasterTransformer does not compute the position encoding value, but only lookup the table.
  • Modifying the method of loading model in translate_sample.py.

April 2020

  • Rename decoding_opennmt.h to decoding_beamsearch.h
  • Add DiverseSiblingsSearch for decoding.
  • Add sampling into Decoding
    • The implementation is in the decoding_sampling.h
    • Add top_k sampling, top_p sampling for decoding.
  • Refactor the tensorflow custom op codes.
    • Merge bert_transformer_op.h, bert_transformer_op.cu.cc into bert_transformer_op.cc
    • Merge decoder.h, decoder.cu.cc into decoder.cc
    • Merge decoding_beamsearch.h, decoding_beamsearch.cu.cc into decoding_beamsearch.cc
  • Fix the bugs of finalize function decoding.py.
  • Fix the bug of tf DiverseSiblingSearch.
  • Add BLEU scorer bleu_score.py into utils. Note that the BLEU score requires python3.
  • Fuse QKV Gemm of encoder and masked_multi_head_attention of decoder.
  • Add dynamic batch size and dynamic sequence length features into all ops.

March 2020

  • Add feature in FasterTransformer 2.0
    • Add translate_sample.py to demonstrate how to translate a sentence by restoring the pretrained model of OpenNMT-tf.
  • Fix bugs of Fastertransformer 2.0
    • Fix the bug of maximum sequence length of decoder cannot be larger than 128.
    • Fix the bug that decoding does not check finish or not after each step.
    • Fix the bug of decoder about max_seq_len.
    • Modify the decoding model structure to fit the OpenNMT-tf decoding model.
      • Add a layer normalization layer after decoder.
      • Add a normalization for inputs of decoder

February 2020

  • Release the FasterTransformer 2.0
    • Provide a highly optimized OpenNMT-tf based decoder and decoding, including C++ API and TensorFlow op.
    • Refine the sample codes of encoder.
    • Add dynamic batch size feature into encoder op.

July 2019

  • Release the FasterTransformer 1.0
    • Provide a highly optimized bert equivalent transformer layer, including C++ API, TensorFlow op and TensorRT plugin.

Known issues

  • Cannot compile on tensorflow 2.10 due to undefined symbol issue.
  • Undefined symbol errors when import the extension
    • Please import torch first. If this has been done, it is due to the incompatible C++ ABI. You may need to check the PyTorch used during compilation and execution are the same, or you need to check how your PyTorch is compiled, or the version of your GCC, etc.
  • Results of TensorFlow and OP would be different in decoding. This problem is caused by the accumulated log probability, and we do not avoid this problem.
  • If encounter some problem in the custom environment, try to use the gcc/g++ 4.8 to build the project of TensorFlow op, especially for TensorFlow 1.14.

More Repositories

1

nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
16,896
star
2

open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
C
14,997
star
3

DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
Jupyter Notebook
13,339
star
4

NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Python
12,016
star
5

FastPhotoStyle

Style transfer, deep learning, feature transform
Python
11,020
star
6

TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
C++
10,618
star
7

Megatron-LM

Ongoing research training transformer models at scale
Python
10,332
star
8

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
C++
8,542
star
9

vid2vid

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.
Python
8,482
star
10

apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Python
8,239
star
11

pix2pixHD

Synthesizing and manipulating 2048x1024 images with conditional GANs
Python
6,488
star
12

cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit
C
6,119
star
13

cutlass

CUDA Templates for Linear Algebra Subroutines
C++
5,519
star
14

DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
C++
5,048
star
15

thrust

[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
C++
4,914
star
16

tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Jupyter Notebook
4,562
star
17

warp

A Python framework for high performance GPU simulation and graphics
Python
4,206
star
18

DIGITS

Deep Learning GPU Training System
HTML
4,105
star
19

NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
Python
4,064
star
20

nccl

Optimized primitives for collective multi-GPU communication
C++
3,187
star
21

flownet2-pytorch

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
Python
2,938
star
22

ChatRTX

A developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM
TypeScript
2,635
star
23

k8s-device-plugin

NVIDIA device plugin for Kubernetes
Go
2,481
star
24

libcudacxx

[ARCHIVED] The C++ Standard Library for your entire system. See https://github.com/NVIDIA/cccl
C++
2,294
star
25

GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Python
2,192
star
26

nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Go
2,171
star
27

waveglow

A Flow-based Generative Network for Speech Synthesis
Python
2,133
star
28

MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
Python
2,007
star
29

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Python
1,917
star
30

Stable-Diffusion-WebUI-TensorRT

TensorRT Extension for Stable Diffusion Web UI
Python
1,886
star
31

semantic-segmentation

Nvidia Semantic Segmentation monorepo
Python
1,763
star
32

gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Go
1,735
star
33

cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Cuda
1,679
star
34

DeepRecommender

Deep learning for recommender systems
Python
1,662
star
35

stdexec

`std::execution`, the proposed C++ framework for asynchronous and parallel programming.
C++
1,554
star
36

OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
Python
1,511
star
37

CUDALibrarySamples

CUDA Library Samples
Cuda
1,468
star
38

VideoProcessingFramework

Set of Python bindings to C++ libraries which provides full HW acceleration for video decoding, encoding and GPU-accelerated color space and pixel format conversions
C++
1,303
star
39

deepops

Tools for building GPU clusters
Shell
1,252
star
40

open-gpu-doc

Documentation of NVIDIA chip/hardware interfaces
C
1,243
star
41

aistore

AIStore: scalable storage for AI applications
Go
1,233
star
42

Q2RTX

NVIDIA’s implementation of RTX ray-tracing in Quake II
C
1,217
star
43

trt-samples-for-hackathon-cn

Simple samples for TensorRT programming
Python
1,211
star
44

cccl

CUDA Core Compute Libraries
C++
1,200
star
45

MatX

An efficient C++17 GPU numerical computing library with Python-like syntax
C++
1,187
star
46

partialconv

A New Padding Scheme: Partial Convolution based Padding
Python
1,145
star
47

sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification
Python
1,055
star
48

nvidia-container-runtime

NVIDIA container runtime
Makefile
1,035
star
49

modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
Python
991
star
50

gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
C
974
star
51

jetson-gpio

A Python library that enables the use of Jetson's GPIOs
Python
898
star
52

dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Go
886
star
53

retinanet-examples

Fast and accurate object detection with end-to-end GPU optimization
Python
885
star
54

flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
Jupyter Notebook
867
star
55

nccl-tests

NCCL Tests
Cuda
864
star
56

cuda-python

CUDA Python Low-level Bindings
Python
859
star
57

mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
Jupyter Notebook
852
star
58

gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
C++
832
star
59

libnvidia-container

NVIDIA container runtime library
C
818
star
60

BigVGAN

Official PyTorch implementation of BigVGAN (ICLR 2023)
Python
806
star
61

spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Scala
800
star
62

nv-wavenet

Reference implementation of real-time autoregressive wavenet inference
Cuda
728
star
63

DLSS

NVIDIA DLSS is a new and improved deep learning neural network that boosts frame rates and generates beautiful, sharp images for your games
C
727
star
64

tensorflow

An Open Source Machine Learning Framework for Everyone
C++
719
star
65

gvdb-voxels

Sparse volume compute and rendering on NVIDIA GPUs
C
674
star
66

MAXINE-AR-SDK

NVIDIA AR SDK - API headers and sample applications
C
671
star
67

nvvl

A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training
C++
665
star
68

runx

Deep Learning Experiment Management
Python
633
star
69

NVFlare

NVIDIA Federated Learning Application Runtime Environment
Python
630
star
70

NeMo-Aligner

Scalable toolkit for efficient model alignment
Python
564
star
71

nvcomp

Repository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.
C++
545
star
72

multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Cuda
535
star
73

Dataset_Synthesizer

NVIDIA Deep learning Dataset Synthesizer (NDDS)
C++
530
star
74

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
Python
513
star
75

jitify

A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).
C++
512
star
76

nvbench

CUDA Kernel Benchmarking Library
Cuda
501
star
77

libglvnd

The GL Vendor-Neutral Dispatch library
C
501
star
78

NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Jupyter Notebook
500
star
79

cuda-quantum

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows
C++
496
star
80

AMGX

Distributed multigrid linear solver library on GPU
Cuda
474
star
81

cuCollections

C++
470
star
82

enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Shell
459
star
83

NeMo-Framework-Launcher

Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
Python
459
star
84

hpc-container-maker

HPC Container Maker
Python
442
star
85

MDL-SDK

NVIDIA Material Definition Language SDK
C++
438
star
86

PyProf

A GPU performance profiling tool for PyTorch models
Python
437
star
87

framework-reproducibility

Providing reproducibility in deep learning frameworks
Python
424
star
88

gpu-rest-engine

A REST API for Caffe using Docker and Go
C++
421
star
89

DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
C++
394
star
90

NvPipe

NVIDIA-accelerated zero latency video compression library for interactive remoting applications
Cuda
390
star
91

torch-harmonics

Differentiable signal processing on the sphere for PyTorch
Jupyter Notebook
386
star
92

cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
Jupyter Notebook
344
star
93

data-science-stack

NVIDIA Data Science stack tools
Shell
317
star
94

ai-assisted-annotation-client

Client side integration example source code and libraries for AI-Assisted Annotation SDK
C++
308
star
95

video-sdk-samples

Samples demonstrating how to use various APIs of NVIDIA Video Codec SDK
C++
301
star
96

egl-wayland

The EGLStream-based Wayland external platform
C
299
star
97

nvidia-settings

NVIDIA driver control panel
C
292
star
98

NVTX

The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.
C
290
star
99

go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
C
288
star
100

gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Go
286
star