• Stars
    star
    112
  • Rank 312,240 (Top 7 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created almost 10 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

OpenGL sample for NV_command_list

gl commandlist basic

In this sample the NV_command_list extension is used to render a basic scene and texturing is performed via ARB_bindless_texture.

Note: The NV_command_list extension is officially shipping with 347.88. The appropriate functions used in this sample can also be found in some older drivers (for example 347.09 and higher), however the performance for all driver/hardware combinations may not be representative there. The spec is available, and feedback is welcome and should be sent to Christoph Kubisch [email protected], Tristan Lorach [email protected], or Pierre Boudier [email protected]. Additional information can be found in this slide deck from SIGGRAPH Asia 2014, as well as the latest GTC 2015 presentation.

This new extension is built around bindless GPU pointers/handles and three more technologies, which allow rendering scenes with many state changes and hundreds of thousands of drawcalls with extremely low CPU time:

  • Tokenized Rendering:
    • Evolution of the "MultiDrawIndirect" mechanism in OpenGL
    • Commands are encoded into binary data (tokens), instead of issuing classic gl calls. This allows the driver or the GPU to efficiently iterate over a stream of many commands in one or multiple sequences: glDrawCommands( ...tokenbuffer, offsets[], sizes[], numSequences)
    • The tokens are stored in regular OpenGL buffers and can be re-used across frames, or manipulated by the GPU itself. Latency-free occlusion culling can be implemented this way (a special terminate sequence token exists).
    • Next to draw calls, the tokens cover the most frequent state changes (vertex, index, uniform-buffers) and a few basic scalar changes (blend color, polygonoffset, stencil ref...).
    • As tokens only reference data (for example uniform buffers), their content is still free to change - you can change vertex positions or matrices freely (which is different from classic display lists).
    • To get an idea of what is currently possible check the nvtoken.cpp/hpp files, which also showcase how the tokenstream could be decoded into classic OpenGL calls.
// The tokens are tightly-packed structs and most common tokens are 16 bytes.
// Below you will find the token definition to update a UBO binding. Compared 
// to standard UBOs, tokens update the binding per stage.


  UniformAddressCommandNV  
  {
    GLuint header;      // glGetCommandHeaderNV(GL_UNIFORM_ADDRESS_COMMAND_NV)
    GLushort   index;   // in glsl: layout(binding=INDEX,commandBindableNV) uniform ...
    GLushort   stage;   // glGetStageIndexNV(GL_VERTEX_SHADER)
    GLuint64   address; // glGetNamedBufferParameterui64vNV(buffer,
                        //   GL_BUFFER_GPU_ADDRESS, &address);
  } cmd;


// The mentioned glGets should not be done at encode time.
  • StateObjects:

    • Costly validation in the driver can often happen late at draw-call time or at other unexpected times, potentially causing unstable framerates. Monolithic state-objects, as they are common in other new graphics apis, allow pre-validation and reuse of the core rendering state (FBO, program, blending...).
    • Full control over when validation happens via glCaptureState(stateobj, primitiveBaseMode), uses the current GL state's setup, no other new special api, which eases integration.
    • Very efficient state switching between different stateobjects: glDrawCommandsStates(..., stateobjects[], fbos[], numSequences)
    • A stateobject can be reused with compatible fbos (same internal formats, but different textures/sizes).
    • To get an idea what the stateobject captures (or how to emulate it) check statesystem.cpp/hpp.
  • Pre-compiled Command List Object:

    • StateObjects and client-side tokens can be pre-compiled into a special object.
    • Allows further driver optimization (faster stateobject transitions) at the loss of flexibility (rendering from tokenbuffer allows buffer to change as well as stateobjects/FBOs).

Performance

The sample renders 1024 objects, each using a sphere or box IBO/VBO pairing, with either a shader using geometry shader as well or not (just as example for some state switching, around 500 toggles between these two per frame). Each object references a range within a big UBO that stores per-object data like matrix, color and texture. On the console output window the performance of CPU and GPU can be seen in detail (be aware that CPU timings may be skewed if the driver runs in dual-core mode).

The output should look something like this:

  Timer Frame;   GL   1333; CPU   2408; (microseconds, avg 758)
  Timer Setup;   GL     21; CPU     42; (microseconds, avg 758)
  Timer Draw;    GL    857; CPU   1752; (microseconds, avg 758)
  Timer Blit;    GL     59; CPU     54; (microseconds, avg 758)
  Timer TwDraw;  GL    389; CPU    551; (microseconds, avg 758)

Here some preliminary example results for Timer Draw on a win7-64, i7-860, Quadro K5000 system

draw mode GPU time CPU time (microseconds)
standard 850 1750
nvcmdlist emulated 830 1500
nvcmdlist buffer 775 30
nvcmdlist list 775 <1

One can see that by classic API usage the scene is CPU bound, as more time is spent there, than on the graphics card (using ARB_timer_query functionality), despite the already very well optimized Quadro drivers. Only through the native use of the NV_command_list do we more or less eliminate the CPU constraint and become GPU bound. One could argue that by better state sorting (which is still good and improves GPU time) and batching techniques CPU performance could be improved, but this may add complexities in the application. Here each object can have its own resource set and be modified independently.

The gained performance in emulation comes from the use of bindless UBO and VBO. The token-buffer technique is slightly slower on CPU than the pre-compiled list, because the 500 stateobject transitions still need to be checked every time. The nvcmdlist techniques essentially only make a single dispatch. The closest to get to this would be with multi-draw-indirect and vertex divisor indexing, but makes shaders more complex by adding parameter indirections and would not allow simple shader or other state changes.

New level of AZDO: An entire scene with state changes (shaders, buffers...) can be dispatched in a few microseconds CPU time, independent of the scene's complexity. Even if the tokens or stateobjects are more dynamic or have to be streamed per-frame the CPU time savings compared to standard API usage will be huge.

Why can't display-lists be so fast? Because they are too unbounded and inherit too much state from the OpenGL context at execution time (unless very specific subsets of commands are used, such as only geometry specification).

Explicit Control: The extension continues a trend in modern API design that gives the developer more explicit control over when certain costs arise, and how to manage data across frames. This also helps the driver pick very efficient paths, and it leverages GPU capabilities such as virtual memory addresses as already provided by other shipping bindless extensions (NV_vertex_buffer_unified_memory, NV_uniform_buffer_unified_memory, NV_shader_buffer_load/store, ARB/NV_bindless_texture) for very fast drawing.

GPU bound? While the extension primarily targets CPU bottlenecks, advanced GPU work creation through GPU-written token-buffers may allow in-frame alterations to what and how geometry is drawn, without costly CPU synchronization. The additional CPU time won may also be used to optimize the scene further, or invested elsewhere.

Sample Highlights

Depending on the availability of the extension, the sample allows switching between a standard OpenGL approach for rendering the scene, as well as the new extension in either token-buffer or commandlist-object mode. Inside basic-nvcommandlist.cpp you will find:

  • Sample::drawStandard()
  • Sample::drawTokenBuffer()
  • Sample::drawTokenList()
  • Sample::drawTokenEmulation()

As well as initialization and state update functions:

  • Sample::initCommandListMinimal()

  • Sample::updateCommandListStateMinimal()

  • Sample::initCommandList()

  • Sample::updateCommandListState()

The ''Minimal'' functions are used if the emulation layer is disabled via #define ALLOW_EMULATION_LAYER 0 at the top of basic_nvcommandlist.cpp. They represent the bare minimum work to do and don't make use of the nvtoken helper classes.

The emulation layer allows you to roughly get an idea of how the glDrawCommands* and glStateCapture work internally, and also aids debugging as the tokens are never error-checked. Customizing this emulation may also be useful as a permanent compatibility layer for driver/hardware combinations that do not run the extension natively.

sample screenshot

Building

Ideally, clone this and other interesting nvpro-samples repositories into a common subdirectory. You will always need nvpro_core. The nvpro_core is searched either as a subdirectory of the sample, or one directory up.

If you are interested in multiple samples, you can use the build_all CMAKE as an entry point. It will also give you options to enable or disable individual samples when creating the solutions.

Related Samples

The extension is also used in the gl commandlist bk3d models, gl occlusion culling, and gl cadscene rendertechniques samples. The latter two samples include token-buffer-based occlusion culling and the last also includes token-streaming techniques on real-world scenes.

More Repositories

1

vk_raytracing_tutorial_KHR

Ray tracing examples and tutorials using VK_KHR_ray_tracing
C++
1,314
star
2

vk_mini_path_tracer

A beginner-friendly Vulkan path tracing tutorial in under 300 lines of C++.
C++
1,098
star
3

vk_raytrace

Ray tracing glTF scene with Vulkan
C++
533
star
4

gl_occlusion_culling

OpenGL sample for shader-based occlusion culling
C++
517
star
5

nvpro_core

shared source code and resources needed for the samples to run
C++
457
star
6

optix_advanced_samples

C
411
star
7

gl_vk_meshlet_cadscene

This OpenGL/Vulkan sample illustrates the use of "mesh shaders" for rendering CAD models.
C++
345
star
8

gl_ssao

optimized screen-space ambient occlusion, cache-aware hbao
C++
343
star
9

build_all

GO HERE FIRST: nvpro-samples overview
Batchfile
312
star
10

vk_order_independent_transparency

Demonstrates seven different techniques for order-independent transparency in Vulkan.
C++
264
star
11

vk_video_samples

Vulkan video samples
C++
239
star
12

gl_vk_chopper

Simple vulkan rendering example.
C++
202
star
13

vk_mini_samples

Collection of Vulkan samples
HLSL
184
star
14

vk_raytracing_tutorial_NV

Vulkan ray tracing examples and tutorials using VK_NV_ray_tracing
C++
158
star
15

gl_vk_threaded_cadscene

OpenGL and Vulkan comparison on rendering a CAD scene using various techniques
C++
157
star
16

gl_cadscene_rendertechniques

OpenGL sample on various rendering approaches for typical CAD scenes
C++
151
star
17

vk_gltf_renderer

Rendering glTF scenes with ray tracer and raster (Vulkan)
C++
102
star
18

vk_displacement_micromaps

This sample showcases rasterizing and ray tracing displaced NVIDIA Micro-Mesh assets in Vulkan with and without the VK_NV_displacement_micromap extension.
C++
92
star
19

vk_denoise

Denoising a Vulkan ray traced image using OptiX denoiser
C++
88
star
20

gl_vk_bk3dthreaded

Vulkan sample rendering 3D with 'worker-threads'
C++
84
star
21

gl_vk_simple_interop

Display an image created by Vulkan compute shader, with OpenGL
C++
76
star
22

vk_toon_shader

Silhouette and toon shading post-processing with Vulkan
C++
74
star
23

gl_dynamic_lod

GPU classifies how to render millions of particles
C++
71
star
24

nvtt_samples

NVIDIA Texture Tools samples for compression, image processing, and decompression.
C++
64
star
25

gl_vk_supersampled

Vulkan sample showing a high quality super-sampled rendering
C++
63
star
26

optix_prime_baking

Shows how to bake ambient occlusion at mesh vertices using OptiX Prime
45
star
27

vk_compute_mipmaps

Customizable compute shader for fast cache-aware mipmap generation
GLSL
41
star
28

vk_async_resources

Sample showcasing lifetime management and resource transfers in Vulkan
C++
32
star
29

gl_vk_raytrace_interop

Adding ray traced ambient occlusion using Vulkan and OpenGL
C++
30
star
30

vk_timeline_semaphore

Vulkan timeline semaphore + async compute performance sample
GLSL
26
star
31

gl_render_vk_ddisplay

OpenGL sample that renders into a Vulkan direct display
C++
26
star
32

gl_multicast

OpenGL sample for the new GL_NVX_linked_gpu_multicast extension
C++
25
star
33

vk_device_generated_cmds

Vulkan sample on VK_NV_device_generated_commands
C++
25
star
34

shared_external

external libraries, needed for the samples (AntTweakBar; ZLib...)
HTML
17
star
35

vk_offline

Rendering offline using Vulkan without opening a window
C++
13
star
36

glsl_indexed_types_generator

GLSL code generator to aid use of Vulkan's descriptor set indexing
Lua
12
star
37

gl_cuda_simple_interop

Sample showing OpenGL and CUDA interop
C++
11
star
38

vk_memory_decompression

Vulkan Memory Decompression (VK_NV_memory_decompression) sample
C++
10
star
39

vk_streamline

DLSS Super Resolution and DLSS Frame Generation via Streamline
C++
10
star
40

vk_idbuffer_rasterization

Vulkan sample to render efficient per-part IDs in CAD models
C++
8
star
41

gl_path_rendering_CMYK

Example of how to use path rendering; and how to use it with CMYK (using multi-render target)
C++
8
star
42

dx12_present_barrier

This sample demonstrates the usage of the new NvAPI interface to synchronize present calls between windows on the same system as well as on distributed systems.
C++
7
star
43

nvml_enterprise_gpu_check

Shows how to check if a GPU is an Enterprise/Quadro GPU using NVML.
C++
4
star
44

vk_raytrace_displacement

C++
3
star
45

gl_vrs

Variable Rate Shading in OpenGL
C++
3
star
46

third_party_binaries

pre-built libraries for the nvpro-samples framework
C
2
star
47

vk_inherited_viewport

VK_NV_inherited_viewport_scissor and secondary subpass command buffer re-use
C++
2
star
48

vk_ddisplay

Sample to demonstrate multi-GPU rendering and presenting to ddisplays, meaning displays that are not part of the Windows desktop and of which an application takes complete control.
C++
2
star