• Stars
    star
    1,200
  • Rank 39,023 (Top 0.8 %)
  • Language
    C++
  • License
    Other
  • Created about 4 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CUDA Core Compute Libraries

Open in GitHub Codespaces

CUDA C++ Core Libraries (CCCL)

Welcome to the CUDA C++ Core Libraries (CCCL) where our mission is to make CUDA C++ more delightful.

This repository unifies three essential CUDA C++ libraries into a single, convenient repository:

The goal of CCCL is to provide CUDA C++ developers with building blocks that make it easier to write safe and efficient code. Bringing these libraries together streamlines your development process and broadens your ability to leverage the power of CUDA C++. For more information about the decision to unify these projects, see the announcement here.

Overview

The concept for the CUDA C++ Core Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers. Naturally, there was a lot of overlap among the three projects, and it became clear the community would be better served by unifying them into a single repository.

  • Thrust is the C++ parallel algorithms library which inspired the introduction of parallel algorithms to the C++ Standard Library. Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs via configurable backends that allow using multiple parallel programming frameworks (such as CUDA, TBB, and OpenMP).

  • CUB is a lower-level, CUDA-specific library designed for speed-of-light parallel algorithms across all GPU architectures. In addition to device-wide algorithms, it provides cooperative algorithms like block-wide reduction and warp-wide scan, providing CUDA kernel developers with building blocks to create speed-of-light, custom kernels.

  • libcudacxx is the CUDA C++ Standard Library. It provides an implementation of the C++ Standard Library that works in both host and device code. Additionally, it provides abstractions for CUDA-specific hardware features like synchronization primitives, cache control, atomics, and more.

The main goal of CCCL is to fill a similar role that the Standard C++ Library fills for Standard C++: provide general-purpose, speed-of-light tools to CUDA C++ developers, allowing them to focus on solving the problems that matter. Unifying these projects is the first step towards realizing that goal.

Example

This is a simple example demonstrating the use of CCCL functionality from Thrust, CUB, and libcudacxx.

It shows how to use Thrust/CUB/libcudacxx to implement a simple parallel reduction kernel. Each thread block computes the sum of a subset of the array using cub::BlockReduce. The sum of each block is then reduced to a single value using an atomic add via cuda::atomic_ref from libcudacxx.

It then shows how the same reduction can be done using Thrust's reduce algorithm and compares the results.

Try it live on Godbolt!

#include <thrust/device_vector.h>
#include <cub/block/block_reduce.cuh>
#include <cuda/atomic>
#include <cstdio>

constexpr int block_size = 256;

__global__ void reduce(int const* data, int* result, int N) {
  using BlockReduce = cub::BlockReduce<int, block_size>;
  __shared__ typename BlockReduce::TempStorage temp_storage;

  int const index = threadIdx.x + blockIdx.x * blockDim.x;
  int sum = 0;
  if (index < N) {
    sum += data[index];
  }
  sum = BlockReduce(temp_storage).Sum(sum);

  if (threadIdx.x == 0) {
    cuda::atomic_ref<int, cuda::thread_scope_device> atomic_result(*result);
    atomic_result.fetch_add(sum, cuda::memory_order_relaxed);
  }
}

int main() {

  // Allocate and initialize input data
  int const N = 1000;
  thrust::device_vector<int> data(N);
  thrust::fill(data.begin(), data.end(), 1);

  // Allocate output data
  thrust::device_vector<int> kernel_result(1);

  // Compute the sum reduction of `data` using a custom kernel
  int const num_blocks = (N + block_size - 1) / block_size;
  reduce<<<num_blocks, block_size>>>(thrust::raw_pointer_cast(data.data()),
                                     thrust::raw_pointer_cast(kernel_result.data()),
                                     N);

  auto const err = cudaDeviceSynchronize();
  if (err != cudaSuccess) {
    std::cout << "Error: " << cudaGetErrorString(err) << std::endl;
    return -1;
  }

  // Compute the same sum reduction using Thrust
  int const thrust_result = thrust::reduce(thrust::device, data.begin(), data.end(), 0);

  // Ensure the two solutions are identical
  std::printf("Custom kernel sum: %d\n", kernel_result[0]);
  std::printf("Thrust reduce sum: %d\n", thrust_result);
  assert(kernel_result[0] == thrust_result);
  return 0;
}

Getting Started

Users

Everything in CCCL is header-only. Therefore, users need only concern themselves with how they get the header files and how they incorporate them into their build system.

CUDA Toolkit

The easiest way to get started using CCCL is via the CUDA Toolkit which includes the CCCL headers. When you compile with nvcc, it automatically adds CCCL headers to your include path so you can simply #include any CCCL header in your code with no additional configuration required.

If compiling with another compiler, you will need to update your build system's include search path to point to the CCCL headers in your CTK install (e.g., /usr/local/cuda/include).

#include <thrust/device_vector.h>
#include <cub/cub.cuh>
#include <cuda/std/atomic>

GitHub

Users that want to stay on the cutting edge of CCCL development are encouraged to use CCCL from GitHub. Using a newer version of CCCL with an older version of the CUDA Toolkit is supported, but not the other way around. For complete information on compatibility between CCCL and the CUDA Toolkit, see our platform support.

Everything in CCCL is header-only, so cloning and including it in a simple project is as easy as the following:

git clone https://github.com/NVIDIA/cccl.git
# Note:
nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub main.cu -o main

Note
Ensure to use -I and not -isystem in order to ensure the cloned headers are found before those included in the CUDA Toolkit

CMake Integration

CCCL uses CMake for all build and installation infrastructure, including tests as well as targets to link against in other CMake projects. Therefore, CMake is the recommended way to integrate CCCL into another project.

For a complete example of how to do this using CMake Package Manager see our example project.

Other build systems should work, but only CMake is tested. Contributions to simplify integrating CCCL into other build systems are welcome.

Contributors

Interested in contributing to making CCCL better? Check out our Contributing Guide for a comprehensive overview of everything you need to know to set up your development environment, make changes, run tests, and submit a PR.

Platform Support

Objective: This section describes where users can expect CCCL to compile and run successfully.

In general, CCCL should work everywhere the CUDA Toolkit is supported, however, the devil is in the details. The sections below describe the details of support and testing for different versions of the CUDA Toolkit, host compilers, and C++ dialects.

CUDA Toolkit (CTK) Compatibility

Summary:

  • The latest version of CCCL is backward compatible with the current and preceding CTK major version series
  • CCCL is never forward compatible with any version of the CTK. Always use the same or newer than what is included with your CTK.
  • Minor version CCCL upgrades won't break existing code, but new features may not support all CTK versions

CCCL users are encouraged to capitalize on the latest enhancements and "live at head" by always using the newest version of CCCL. For a seamless experience, you can upgrade CCCL independently of the entire CUDA Toolkit. This is possible because CCCL maintains backward compatibility with the latest patch release of every minor CTK release from both the current and previous major version series. In some exceptional cases, the minimum supported minor version of the CUDA Toolkit release may need to be newer than the oldest release within its major version series. For instance, CCCL requires a minimum supported version of 11.1 from the 11.x series due to an unavoidable compiler issue present in CTK 11.0.

When a new major CTK is released, we drop support for the oldest supported major version.

CCCL Version Supports CUDA Toolkit Version
2.x 11.1 - 11.8, 12.x (only latest patch releases)
3.x (Future) 12.x, 13.x (only latest patch releases)

Well-behaved code using the latest CCCL should compile and run successfully with any supported CTK version. Exceptions may occur for new features that depend on new CTK features, so those features would not work on older versions of the CTK. For example, C++20 support was not added to nvcc until CUDA 12.0, so CCCL features that depend on C++20 would not work with CTK 11.x.

Users can integrate a newer version of CCCL into an older CTK, but not the other way around. This means an older version of CCCL is not compatible with a newer CTK. In other words, CCCL is never forward compatible with the CUDA Toolkit.

The table below summarizes compatibility of the CTK and CCCL:

CTK Version Included CCCL Version Desired CCCL Supported? Notes
CTK X.Y CCCL MAJOR.MINOR CCCL MAJOR.MINOR+n Some new features might not work
CTK X.Y CCCL MAJOR.MINOR CCCL MAJOR+1.MINOR Possible breaks; some new features might not be available
CTK X.Y CCCL MAJOR.MINOR CCCL MAJOR+2.MINOR CCCL supports only two CTK major versions
CTK X.Y CCCL MAJOR.MINOR CCCL MAJOR.MINOR-n CCCL isn't forward compatible
CTK X.Y CCCL MAJOR.MINOR CCCL MAJOR-n.MINOR CCCL isn't forward compatible

For more information on CCCL versioning, API/ABI compatibility, and breaking changes see the Versioning section below.

Operating Systems

Unless otherwise specified, CCCL supports all the same operating systems as the CUDA Toolkit, which are documented here:

Host Compilers

Unless otherwise specified, CCCL supports all the same host compilers as the CUDA Toolkit, which are documented here:

C++ Dialects

  • C++11 (Deprecated in Thrust/CUB, to be removed in next major version)
  • C++14 (Deprecated in Thrust/CUB, to be removed in next major version)
  • C++17
  • C++20

GPU Architectures

Unless otherwise specified, CCCL supports all the same GPU architectures/Compute Capabilities as the CUDA Toolkit, which are documented here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability

Note that some features may only support certain architectures/Compute Capabilities.

Testing Strategy

CCCL's testing strategy strikes a balance between testing as many configurations as possible and maintaining reasonable CI times.

For CUDA Toolkit versions, testing is done against both the oldest and the newest supported versions. For instance, if the latest version of the CUDA Toolkit is 12.3, tests are conducted against 11.1 and 12.3. For each CUDA version, builds are completed against all supported host compilers with all supported C++ dialects.

The testing strategy and matrix are constantly evolving. The matrix defined in the ci/matrix.yaml file is the definitive source of truth. For more information about our CI pipeline, see here.

Versioning

Objective: This section describes how CCCL is versioned, API/ABI stability guarantees, and compatibility guideliness to minimize upgrade headaches.

Summary

  • The entirety of CCCL's API shares a common semantic version across all components
  • Only the most recently released version is supported and fixes are not backported to prior releases
  • API breaking changes and incrementing CCCL's major version will only coincide with a new major version release of the CUDA Toolkit
  • Not all source breaking changes are considered breaking changes of the public API that warrant bumping the major version number
  • Do not rely on ABI stability of entities in the cub:: or thrust:: namespaces
  • ABI breaking changes for symbols in the cuda:: namespace may happen at any time, but will be reflected by incrementing the ABI version which is embedded in an inline namespace for all cuda:: symbols. Multiple ABI versions may be supported concurrently.

Note: Prior to merging Thrust, CUB, and libcudacxx into this repository, each library was independently versioned according to semantic versioning. Starting with the 2.1 release, all three libraries synchronized their release versions in their separate repositories. Moving forward, CCCL will continue to be released under a single semantic version, with 2.2.0 being the first release from the nvidia/cccl repository.

Breaking Change

A Breaking Change is a change to explicitly supported functionality between released versions that would require a user to do work in order to upgrade to the newer version.

In the limit, any change has the potential to break someone somewhere. As a result, not all possible source breaking changes are considered Breaking Changes to the public API that warrant bumping the major semantic version.

The sections below describe the details of breaking changes to CCCL's API and ABI.

Application Programming Interface (API)

CCCL's public API is the entirety of the functionality intentionally exposed to provide the utility of the library.

In other words, CCCL's public API goes beyond just function signatures and includes (but is not limited to):

  • The location and names of headers intended for direct inclusion in user code
  • The namespaces intended for direct use in user code
  • The declarations and/or definitions of functions, classes, and variables located in headers and intended for direct use in user code
  • The semantics of functions, classes, and variables intended for direct use in user code

Moreover, CCCL's public API does not include any of the following:

  • Any symbol prefixed with _ or __
  • Any symbol whose name contains detail including the detail:: namespace or a macro
  • Any header file contained in a detail/ directory or sub-directory thereof
  • The header files implicitly included by any header part of the public API

In general, the goal is to avoid breaking anything in the public API. Such changes are made only if they offer users better performance, easier-to-understand APIs, and/or more consistent APIs.

Any breaking change to the public API will require bumping CCCL's major version number. In keeping with CUDA Minor Version Compatibility, API breaking changes and CCCL major version bumps will only occur coinciding with a new major version release of the CUDA Toolkit.

Anything not part of the public API may change at any time without warning.

API Versioning

The entirety of CCCL's public API across all components shares a common semantic version of MAJOR.MINOR.PATCH.

Only the most recently released version is supported. As a rule, features and bug fixes are not backported to previously released version or branches.

For historical reasons, the library versions are encoded separately in each of Thrust/CUB/libcudacxx as follows:

libcudacxx Thrust CUB Incremented when?
Header <cuda/std/version> <thrust/version.h> <cub/version.h> -
Major Version _LIBCUDACXX_CUDA_API_VERSION_MAJOR THRUST_MAJOR_VERSION CUB_MAJOR_VERSION Public API breaking changes (only at new CTK major release)
Minor Version _LIBCUDACXX_CUDA_API_VERSION_MINOR THRUST_MINOR_VERSION CUB_MINOR_VERSION Non-breaking feature additions
Patch/Subminor Version _LIBCUDACXX_CUDA_API_VERSION_PATCH THRUST_SUBMINOR_VERSION CUB_SUBMINOR_VERSION Minor changes not covered by major/minor versions
Concatenated Version _LIBCUDACXX_CUDA_API_VERSION (MMMmmmppp) THRUST_VERSION (MMMmmmpp) CUB_VERSION (MMMmmmpp) -

Application Binary Interface (ABI)

The Application Binary Interface (ABI) is a set of rules for:

  • How a library's components are represented in machine code
  • How those components interact across different translation units

A library's ABI includes, but is not limited to:

  • The mangled names of functions and types
  • The size and alignment of objects and types
  • The semantics of the bytes in the binary representation of an object

An ABI Breaking Change is any change that results in a change to the ABI of a function or type in the public API. For example, adding a new data member to a struct is an ABI Breaking Change as it changes the size of the type.

In CCCL, the guarantees about ABI are as follows:

  • Symbols in the thrust:: and cub:: namespaces may break ABI at any time without warning.
  • The ABI of cub:: symbols includes the CUDA architectures used for compilation. Therefore, a single cub:: symbol may have a different ABI if compiled with different architectures.
  • Symbols in the cuda:: namespace may also break ABI at any time. However, cuda:: symbols embed an ABI version number that is incremented whenever an ABI break occurs. Multiple ABI versions may be supported concurrently, and therefore users have the option to revert to a prior ABI version. For more information, see here.

Who should care about ABI?

In general, CCCL users only need to worry about ABI issues when building or using a binary artifact (like a shared library) whose API directly or indirectly includes types provided by CCCL.

For example, consider if libA.so was built using CCCL version X and its public API includes a function like:

void foo(cuda::std::optional<int>);

If another library, libB.so, is compiled using CCCL version Y and uses foo from libA.so, then this can fail if there was an ABI break between version X and Y. Unlike with API breaking changes, ABI breaks usually do not require code changes and only require recompiling everything to use the same ABI version.

To learn more about ABI and why it is important, see What is ABI, and What Should C++ Do About It?.

Compatibility Guidelines

As mentioned above, not all possible source breaking changes constitute a Breaking Change that would require incrementing CCCL's API major version number.

Users are encouraged to adhere to the following guidelines in order to minimize the risk of disruptions from accidentally depending on parts of CCCL that are not part of the public API:

  • Do not add any declarations to the thrust::, cub::, nv::, or cuda:: namespaces unless an exception is noted for a specific symbol, e.g., specializing a type trait.
    • Rationale: This would cause symbol conflicts if a symbol is added with the same name.
  • Do not take the address of any API in the thrust::, cub::, cuda::, or nv:: namespaces.
    • Rationale: This would prevent adding overloads of these APIs.
  • Do not forward declare any API in the thrust::, cub::, cuda::, or nv:: namespaces.
    • Rationale: This would prevent adding overloads of these APIs.
  • Do not directly reference any symbol prefixed with _, __, or with detail anywhere in its name including a detail:: namespace or macro
    • Rationale: These symbols are for internal use only and may change at any time without warning.
  • Include what you use. For every CCCL symbol that you use, directly #include the header file that declares that symbol. In other words, do not rely on headers implicitly included by other headers.
    • Rationale: Internal includes may change at any time.

Portions of this section were inspired by Abseil's Compatibility Guidelines.

Deprecation Policy

We will do our best to notify users prior to making any breaking changes to the public API, ABI, or modifying the supported platforms and compilers.

As appropriate, deprecations will come in the form of programmatic warnings which can be disabled.

The deprecation period will depend on the impact of the change, but will usually last at least 2 minor version releases.

Mapping to CTK Versions

// Links to old CCCL mapping tables // Add new CCCL version to a new table

CI Pipeline Overview

For a detailed overview of the CI pipeline, see ci-overview.md.

Projects that are related to CCCL's mission to make CUDA C++ more delightful:

  • cuCollections - GPU accelerated data structures like hash tables
  • NVBench - Benchmarking library tailored for CUDA applications
  • stdexec - Reference implementation for Senders asynchronous programming model

Projects Using CCCL

Does your project use CCCL? Open a PR to add your project to this list!

  • AmgX - Multi-grid linear solver library
  • ColossalAI - Tools for writing distributed deep learning models
  • cuDF - Algorithms and file readers for ETL data analytics
  • cuGraph - Algorithms for graph analytics
  • cuML - Machine learning algorithms and primitives
  • CuPy - NumPy & SciPy for GPU
  • cuSOLVER - Dense and sparse linear solvers
  • cuSpatial - Algorithms for geospatial operations
  • GooFit - Library for maximum-likelihood fits
  • HeavyDB - SQL database engine
  • HOOMD - Monte Carlo and molecular dynamics simulations
  • HugeCTR - GPU-accelerated recommender framework
  • Hydra - High-energy Physics Data Analysis
  • Hypre - Multigrid linear solvers
  • LightSeq - Training and inference for sequence processing and generation
  • PyTorch - Tensor and neural network computations
  • Qiskit - High performance simulator for quantum circuits
  • QUDA - Lattice quantum chromodynamics (QCD) computations
  • RAFT - Algorithms and primitives for machine learning
  • TensorFlow - End-to-end platform for machine learning
  • TensorRT - Deep leaning inference
  • tsne-cuda - Stochastic Neighborhood Embedding library
  • Visualization Toolkit (VTK) - Rendering and visualization library
  • XGBoost - Gradient boosting machine learning algorithms

More Repositories

1

nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
16,896
star
2

open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
C
14,997
star
3

DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
Jupyter Notebook
13,339
star
4

NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Python
12,016
star
5

FastPhotoStyle

Style transfer, deep learning, feature transform
Python
11,020
star
6

TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
C++
10,618
star
7

Megatron-LM

Ongoing research training transformer models at scale
Python
10,332
star
8

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
C++
8,542
star
9

vid2vid

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.
Python
8,482
star
10

apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Python
8,239
star
11

pix2pixHD

Synthesizing and manipulating 2048x1024 images with conditional GANs
Python
6,488
star
12

cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit
C
6,119
star
13

cutlass

CUDA Templates for Linear Algebra Subroutines
C++
5,519
star
14

FasterTransformer

Transformer related optimization, including BERT, GPT
C++
5,313
star
15

DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
C++
5,048
star
16

thrust

[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
C++
4,914
star
17

tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Jupyter Notebook
4,562
star
18

warp

A Python framework for high performance GPU simulation and graphics
Python
4,206
star
19

DIGITS

Deep Learning GPU Training System
HTML
4,105
star
20

NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
Python
4,064
star
21

nccl

Optimized primitives for collective multi-GPU communication
C++
3,187
star
22

flownet2-pytorch

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
Python
2,938
star
23

ChatRTX

A developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM
TypeScript
2,635
star
24

k8s-device-plugin

NVIDIA device plugin for Kubernetes
Go
2,481
star
25

libcudacxx

[ARCHIVED] The C++ Standard Library for your entire system. See https://github.com/NVIDIA/cccl
C++
2,294
star
26

GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Python
2,192
star
27

nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Go
2,171
star
28

waveglow

A Flow-based Generative Network for Speech Synthesis
Python
2,133
star
29

MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
Python
2,007
star
30

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Python
1,917
star
31

Stable-Diffusion-WebUI-TensorRT

TensorRT Extension for Stable Diffusion Web UI
Python
1,886
star
32

semantic-segmentation

Nvidia Semantic Segmentation monorepo
Python
1,763
star
33

gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Go
1,735
star
34

cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Cuda
1,679
star
35

DeepRecommender

Deep learning for recommender systems
Python
1,662
star
36

stdexec

`std::execution`, the proposed C++ framework for asynchronous and parallel programming.
C++
1,554
star
37

OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
Python
1,511
star
38

CUDALibrarySamples

CUDA Library Samples
Cuda
1,468
star
39

VideoProcessingFramework

Set of Python bindings to C++ libraries which provides full HW acceleration for video decoding, encoding and GPU-accelerated color space and pixel format conversions
C++
1,303
star
40

deepops

Tools for building GPU clusters
Shell
1,252
star
41

open-gpu-doc

Documentation of NVIDIA chip/hardware interfaces
C
1,243
star
42

aistore

AIStore: scalable storage for AI applications
Go
1,233
star
43

Q2RTX

NVIDIA’s implementation of RTX ray-tracing in Quake II
C
1,217
star
44

trt-samples-for-hackathon-cn

Simple samples for TensorRT programming
Python
1,211
star
45

MatX

An efficient C++17 GPU numerical computing library with Python-like syntax
C++
1,187
star
46

partialconv

A New Padding Scheme: Partial Convolution based Padding
Python
1,145
star
47

sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification
Python
1,055
star
48

nvidia-container-runtime

NVIDIA container runtime
Makefile
1,035
star
49

modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
Python
991
star
50

gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
C
974
star
51

jetson-gpio

A Python library that enables the use of Jetson's GPIOs
Python
898
star
52

dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Go
886
star
53

retinanet-examples

Fast and accurate object detection with end-to-end GPU optimization
Python
885
star
54

flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
Jupyter Notebook
867
star
55

nccl-tests

NCCL Tests
Cuda
864
star
56

cuda-python

CUDA Python Low-level Bindings
Python
859
star
57

mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
Jupyter Notebook
852
star
58

gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
C++
832
star
59

libnvidia-container

NVIDIA container runtime library
C
818
star
60

BigVGAN

Official PyTorch implementation of BigVGAN (ICLR 2023)
Python
806
star
61

spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Scala
800
star
62

nv-wavenet

Reference implementation of real-time autoregressive wavenet inference
Cuda
728
star
63

DLSS

NVIDIA DLSS is a new and improved deep learning neural network that boosts frame rates and generates beautiful, sharp images for your games
C
727
star
64

tensorflow

An Open Source Machine Learning Framework for Everyone
C++
719
star
65

gvdb-voxels

Sparse volume compute and rendering on NVIDIA GPUs
C
674
star
66

MAXINE-AR-SDK

NVIDIA AR SDK - API headers and sample applications
C
671
star
67

nvvl

A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training
C++
665
star
68

runx

Deep Learning Experiment Management
Python
633
star
69

NVFlare

NVIDIA Federated Learning Application Runtime Environment
Python
630
star
70

NeMo-Aligner

Scalable toolkit for efficient model alignment
Python
564
star
71

nvcomp

Repository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.
C++
545
star
72

multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Cuda
535
star
73

Dataset_Synthesizer

NVIDIA Deep learning Dataset Synthesizer (NDDS)
C++
530
star
74

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
Python
513
star
75

jitify

A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).
C++
512
star
76

nvbench

CUDA Kernel Benchmarking Library
Cuda
501
star
77

libglvnd

The GL Vendor-Neutral Dispatch library
C
501
star
78

NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Jupyter Notebook
500
star
79

cuda-quantum

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows
C++
496
star
80

AMGX

Distributed multigrid linear solver library on GPU
Cuda
474
star
81

cuCollections

C++
470
star
82

enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Shell
459
star
83

NeMo-Framework-Launcher

Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
Python
459
star
84

hpc-container-maker

HPC Container Maker
Python
442
star
85

MDL-SDK

NVIDIA Material Definition Language SDK
C++
438
star
86

PyProf

A GPU performance profiling tool for PyTorch models
Python
437
star
87

framework-reproducibility

Providing reproducibility in deep learning frameworks
Python
424
star
88

gpu-rest-engine

A REST API for Caffe using Docker and Go
C++
421
star
89

DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
C++
394
star
90

NvPipe

NVIDIA-accelerated zero latency video compression library for interactive remoting applications
Cuda
390
star
91

torch-harmonics

Differentiable signal processing on the sphere for PyTorch
Jupyter Notebook
386
star
92

cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
Jupyter Notebook
344
star
93

data-science-stack

NVIDIA Data Science stack tools
Shell
317
star
94

ai-assisted-annotation-client

Client side integration example source code and libraries for AI-Assisted Annotation SDK
C++
308
star
95

video-sdk-samples

Samples demonstrating how to use various APIs of NVIDIA Video Codec SDK
C++
301
star
96

egl-wayland

The EGLStream-based Wayland external platform
C
299
star
97

nvidia-settings

NVIDIA driver control panel
C
292
star
98

NVTX

The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.
C
290
star
99

go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
C
288
star
100

gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Go
286
star