• Stars
    star
    151
  • Rank 246,057 (Top 5 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created almost 10 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

OpenGL sample on various rendering approaches for typical CAD scenes

gl cadscene render techniques

This sample implements several scene rendering techniques that target mostly static data, such as often found in CAD or DCC applications. In this context, 'static' means that the vertex and index buffers for the scene's objects rarely change. This can include editing the geometry of a few scene objects, but the matrix and material values are the properties that are modified the most across frames. Imagine making edits to the wheel topology of a car, or positioning an engine; the rest of the assembly remains the same.

The principal OpenGL mechanisms that are used here are described in the SIGGRAPH 2014 presentation slides. It is highly recommended to go through the slides first.

The sample makes use of multiple OpenGL 4 core features, such as ARB_multi_draw_indirect, but also showcases OpenGL 3 style rendering techniques.

There are also several techniques built around the NV_command_list extension. Please refer to gl commandlist basic for an introduction to NV_command_list.

Note: This is just a sample to illustrate several techniques and possibilities for how to approach rendering. Its purpose is not to provide production-level, highly optimized implementations.

Scene Setup

The sample loads a cadscene file (csf). This file format is inspired by CAD applications' data organization, but (for simplicity) everything is stored in a single RAW file.

The scene is organized into:

  • Matrices: object transforms as well as concatenated world matrices

  • TreeNodes: a tree consisting hierarchical information, mapping to Matrix indices

  • Materials: just classic two-sided OpenGL Blinn-Phong material parameters

  • Geometries: storing vertex and index information, organized into

  • GeometryParts, which reference a sub-range within index buffer, for either "wireframe" or "solid" surfaces

  • Objects, that reference Geometry and have corresponding

  • ObjectParts, that encode part-level Material and Matrix assignment. Typically, an object uses just one Matrix for all its parts.

Shademodes

sample screenshot

  • solid: only triangles are drawn
  • solid with edges: triangles and edge outlines on top (using PolygonOffset to push triangles back). When no global sorting (see later) is performed, this means we toggle between the two modes for every object.
  • solid with edges (split test, only in sorted): an artificial mode that also separates triangles and edges into different FBOs, and is available in "sorted" and "token" renderers. The implementation has no real use-case character and is more or less for internal benchmarking of FBO toggles.

Strategies

These influence the number of drawcalls we generate for the hardware and software. Using OpenGL's MultiDraw* functions we can have less software calls than hardware drawcalls, which helps trigger faster paths in the driver as there is less validation overhead. A strategy is applied on a per-object level.

Imagine an object whose parts use two materials, red and blue:

material: r b b r
parts:    A B C D
  • materialgroups Here we create a per-object cache of drawcall ranges for MultiDraw* based on the object's material and matrix assignments. We also "grow" drawcalls if subsequent ranges in the index buffer have the same assignments. Our sample object would be drawn using 2 states one glMultiDrawElements each, which are creating 3 hardware drawcalls: red are ranges A, D and blue is B+C joined together as they are next to each other in the indexbuffer.
  • drawcall join As we traverse we combine drawcalls under same state, this means 3 drawcalls for hardware, and 3 for software as well as 3 states: red A, blue B+C, red D.
  • drawcall individual We render each piece individually: red A, blue B, C, red D.

Typically we do all rendering with basic state redundancy filtering so we don't setup a matrix/material change if the same is still active. To keep things simple for state redundancy filtering, you should not go too fine-grained, otherwise all the tracking causes too much memory hopping. In our case we have 3 indices we track: geometry (handles vertex / index buffer setup), material, and matrix.

Renderers

Most renderers will traverse the scene data every frame. The organization of the data is cache-friendly foremost, everything is stored in arrays, without too much memory hopping. Some renderers may implement additional caching for rendering.

Variants:

  • bindless: these variants make use of NVIDIA's bindless extensions NV_vertex_buffer_unified_memory and NV_uniform_buffer_unified_memory, which allows a lower-overhead path in the driver for faster drawcall submission. Classic glBindVertexBuffer or glBindBufferRange are replaced with glBufferAddressRangeNV.
  • sorted: indicates we do a global scene sort once, to minimize state changes in subsequent frames.
  • cullsorted: next to global sorting by state, we also apply occlusion culling as presented in end of the slides or in the gl occlusion culling sample.
  • emulated: several of the NV_command_list techniques can be run in emulated mode.

Techniques:

We are mostly looking into accelerating our matrix and material parameter switching performance.

  • uborange All matrices and materials are stored in big buffer objects, which allows us to efficiently bind the required sub-range for a drawcall via glBindBufferRange(GL_UNIFORM_BUFFER, usageSlot, buffer, index * itemSize, itemSize). NVIDIA provides optimized paths if you keep the buffer and itemSize for a usageSlot constant for many glBindBufferRange calls. Be aware of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, which is 256 bytes for most current NVIDIA hardware (Fermi, Kepler, Maxwell).

  • ubosub Not as efficient as the above, but maybe appropriate if you cannot afford to cache parameter data. We make use of one streaming buffer per usage slot and continously update it via glBufferSubData. NVIDIA's drivers do particularly well if you never bind this buffer as anything but a GL_UNIFORM_BUFFER and keep size and offsets a multiple of 4.

  • indexedmdi Similar to uborange we make use of all data stored in a bigger buffers in advance. It doesn't make this data "static"; you can always update the portions you need, but there is a high chance a lot of data is the same frame to frame. This time, we do not bind memory ranges through the OpenGL API, but let the shader do an indirection and only pass the required matrix and material indices. For the matrix data we use GL_TEXTURE_BUFFER as it's particularly performant for high frequency / potentially divergent access. We typically have far more matrices than materials in our scene. For material data, it's a bit "ugly" to use lots of texelFetch instructions decoding all our parameters; it's much easier to write them as structs and store the array either as GL_UNIFORM_BUFFER or GL_SHADER_STORAGE_BUFFER. The latter is only recommended if you have divergent shader access or exceed the 64 KB limit of UBOs. To pass the indices per-drawcall we make use of GL_ARB_multi_draw_indirect and "instanced" vertex attributes as described at GTC 2013 on slide 27. Therefore this renderer requires two additional buffers: one encoding our object's matrix and material index assignments, and one encoding the scene's drawcalls as GL_DRAW_INDIRECT_BUFFER.

A hybrid approach, where the parameter index like "indexedmdi" is used for matrices and uborange bind is used for materials, is not yet implemented, but would be a good compromise.

The following renderers make use of the NV_command_list extension. In principle they behave as "uborange", however all buffer bindings and drawcalls are encoded into binary tokens that are submitted in bulk. In preparation for drawing, the appropriate stateobjects are created and reused when rendering (one for lines and for triangles). While stateobject capturing is not extremely expensive, it is still best to cache it across frames.

  • tokenbuffer Similar to indexedmdi we create a buffer that describes our scene by storing all the relevant token commands. This buffer is filled only once and then later reused.
  • tokenlist Instead of storing the tokens inside a buffer we make use of the commandlist object, and create and compile one for each shademode for later reuse. Every time our state changes (for instance, when resizing FBOs), we have to recreate these lists, which makes it less flexible than buffer but faster when there are lots of statechanges within the list.
  • tokenstream This approach does not reuse the tokens across frames, but instead dynamically creates the tokenstream every frame. By default, the demo fills and submits tokens in chunks of 256 KB; better values may exist depending on the scene.

Performance

All timings are preliminary results for Timer Draw on a win7-64, i7-860, Quadro K5000 system.

Important Note About Timer Query Results: The GPU time reported below is measured via timer queries, those values however can be skewed by CPU bottlenecks. The "begin" timestamp may be part of a different command submission to the GPU than the "end" timestamp. That means a long delay on the CPU side between those submissions will also increase the reported GPU time. That is why in CPU-bottlenecked scenarios with tons of OpenGL commands, the GPU times below are close to the CPU time.

scene statistics:
geometries:    110
materials:      66
nodes:        5004
objects:      2497

tokenbuffer/glstream complexities:
type: solid              materialgroups | drawcall individual
commandsize:                     347292 | 1301692
statetoggles:                         1 | 1
tokens:                 
GL_DRAW_ELEMENTS_COMMAND_NV:      11103 |   68452
GL_ELEMENT_ADDRESS_COMMAND_NV:      807 |     807
GL_ATTRIBUTE_ADDRESS_COMMAND_NV:    807 |     807
GL_UNIFORM_ADDRESS_COMMAND_NV:     8988 |   11289
GL_POLYGON_OFFSET_COMMAND_NV:         1 |       1

type: solid w edges
commandsize:                     629644 | 2534412
statetoggles:                      4994 |    4994
tokens:
GL_DRAW_ELEMENTS_COMMAND_NV:      22281 |  136750
GL_ELEMENT_ADDRESS_COMMAND_NV:      807 |     807
GL_ATTRIBUTE_ADDRESS_COMMAND_NV:    807 |     807
GL_UNIFORM_ADDRESS_COMMAND_NV:    15457 |   20036
GL_POLYGON_OFFSET_COMMAND_NV:         1 |       1

As one can see from the statistics the key difference is the number of drawcalls for the hardware:

  • materialgroups: ~ 10 000 drawcalls (inner two columns)
  • drawcall individual: ~ 70 000 drawcalls (rightmost two columns)

shademode: solid

renderer GPU time CPU time GPU time CPU time (microseconds)
strategy material- -groups drawcall- -individual
ubosub 1550 1870 6000 7420
uborange 1010 1890 3720 7660
uborange_bindless 1010 1200 2560 4900
indexedmdi 1120 1200 2080 1100
tokenstream 860 300 1520 1400
tokenbuffer 780 <10 1230 <10
tokenlist 780 <10 880 <10
tokenbuffer_cullsorted 540 120 760 120

The results are of course very scene dependent; this model was specifically chosen as it is made of many parts with very few triangles. If the complexity per drawcall were higher (say more triangles or complex shading), then the CPU impact would be lower and we would be GPU-bound. However the CPU time recovered by faster submission mechanisms can always be used elsewhere. So even if we are GPU-bound, time should not be wasted.

We can see that the "token" techniques do very well and are never CPU-bound, and the "indexedmdi" technique is also quite good. This technique is especially useful for very high-frequency parameters, for example when rendering "id-buffers" for selection, but also for matrix indices. For general use-cases, working with uborange binds is recommended.

shademode: solid with edges

Unless "sorted", around 5000 toggles are done between triangles/line rendering. The shader is manipulated through an immediate vertex attribute to toggle between lit/unlit rendering respectively.

renderer GPU time CPU time GPU time CPU time (microseconds)
strategy material- -groups drawcall- -individual
ubosub 2890 3350 13000 15000
uborange 2150 3700 12500 15200
uborange_bindless 2150 2640 8300 10000
indexedmdi 2340 2200 4050 2050
tokenstream 1860 1250 3360 3200
tokenbuffer 1750 450 2650 350
tokenlist 1650 <10 1890 <10
tokenbuffer_cullsorted 770 120 1250 120

Compared to the "solid" results, the tokenbuffer and tokenlist techniques show a greater difference in CPU time.

Model Explosion View

The simple viewer allows you to add animation to the scene and artificially increase scene complexity via "clones".

xplodeclones

To "emulate" typical interaction where users might move objects around or have animated scenes, the sample also implements the matrix transform system sketched on slide 30.

The effect works by first moving all object matrices a bit (xplode-animation.comp.glsl), and afterwards the transform hierarchy is updated via a system that is implemented in the transformsystem.cpp / hpp files.

The code is not particularly tuned but naively assumes that upper levels of the hierarchy contain fewer nodes than lower levels (pyramid). Therefore it uses leaf-processing (which redundantly calculates matrices) instead of level-wise processing for the first 10 levels, to avoid dependencies (one small compute task waiting for the previous). Later levels are always processed level-wise. A better strategy would be to switch between the two approaches based on the actual number of nodes per level. The shaders for this are transform-leaves.comp.glsl and transform-level.comp.glsl.

The hierarchy is managed by nodetree.cpp/hpp, which stores the tree as array of 32bit values. Each value represents a node, and encodes the "level" in the hierarchy in 8 bits and their parent index in the rest of the bits. Which means you can traverse a node up to the root:

// sample traversal of "idx" node to root
self = array[idx];
while( self.level != 0) {
  self = array[self.parent];
}
// self is now the top root for the idx node

The nodetree also stores two node index lists for each level: one storing all nodes of a level, and one for all leaves in this level. We feed these two index lists to the appropriate shader. When leaf processing is used we append the leaves level-wise, which should minimize divergence within a warp (ideally most threads have the same number of levels to ascend in the hierarchy).

Many CAD applications tend to use double-precision matrices, and the system could be adjusted for this. For rendering, however, float matrices should be used. To account for large translation values, one could run a concatenation of view-projection (double) and object-world-matrix (double) per-frame and generate the matrices (float) for actual vertex transforms. To improve memory performance, it might be beneficial to use double only for storing translations within the matrices.

Note: Only the GPU matrices are updated. CPU techniques such as "ubosub" will not show animations.

Sample Highlights

This sample is a bit more complex than most others as it contains several subsystems. Don't hesitate to contact the author if something is unclear (commenting was not a priority ;) ).

csfviewer.cpp

The principle setup of the sample is in this main file. However, most of the interesting bits happen in the renderers.

  • Sample::think - prepares the frame and calls the renderer's draw function

renderer... and tokenbase...

Each renderer has its own file and is derived from the Renderer class in renderer.hpp

  • Renderer::init - some renderers may allocate extra buffers or create their own data structures for the scene.
  • Renderer::deinit
  • Renderer::draw

The renderers may have additional functions. The "token" renderers using NV_command_list or "indexedmdi", for instance, must create their own scene representation.

cadscene...

The "csf" (cadscene file) format is a simple binary format that encodes a scene as is typical for CAD. It closely matches the description at the beginning of the readme. It is not very sophisticated, and is meant for demo purposes.

Note: The geforce.csf.gz assembly binary file that ships with this sample may NOT be redistributed.

nodetree... and transform...

Implement the matrix hierarchy updates as described in the "model explosion view" section.

cull... and scan...

For files related to culling, it is best to refer to the gl occlusion cullling sample, as it leverages the same system and focuses on just that topic.

renderertokensortcull.cpp implements RendererCullSortToken::CullJobToken::resultFromBits, which contains the details of how the occlusion results are handled in this sample. The implementation uses the "raster" "temporal" approach.

statesystem... nvtoken... and nvcommandlist...

These files contain helpers when using the NV_command_list extension. Please see gl commandlist basic for a smaller sample.

Building

Ideally, clone this and other interesting nvpro-samples repositories into a common subdirectory. You will always need nvpro_core. The nvpro_core is searched either as a subdirectory of the sample, or one directory up.

If you are interested in multiple samples, you can use the build_all CMAKE as entry point. This will also give you options to enable or disable individual samples when creating the solutions.

Related Samples

gl commandlist basic illustrates the core principle of the NV_command_list extension. gl occlusion cullling also uses the occlusion system of this sample, but in a simpler usage scenario.

When using classic scenegraphs, there is typically a lot of overhead in traversing the scene. For this reason, it is highly recommended to use simpler representations for actual rendering. Consider using flattened hierarchies, arrays, memory-friendly data structures, data-oriented design patterns, and similar techniques. If you are still working with a classic scenegraph, then nvpro-pipeline may provide some acceleration strategies to avoid full scenegraph traversal. Some of these strategies are also described in this GTC 2013 presentation.

More Repositories

1

vk_raytracing_tutorial_KHR

Ray tracing examples and tutorials using VK_KHR_ray_tracing
C++
1,314
star
2

vk_mini_path_tracer

A beginner-friendly Vulkan path tracing tutorial in under 300 lines of C++.
C++
1,098
star
3

vk_raytrace

Ray tracing glTF scene with Vulkan
C++
533
star
4

gl_occlusion_culling

OpenGL sample for shader-based occlusion culling
C++
517
star
5

nvpro_core

shared source code and resources needed for the samples to run
C++
457
star
6

optix_advanced_samples

C
411
star
7

gl_vk_meshlet_cadscene

This OpenGL/Vulkan sample illustrates the use of "mesh shaders" for rendering CAD models.
C++
345
star
8

gl_ssao

optimized screen-space ambient occlusion, cache-aware hbao
C++
343
star
9

build_all

GO HERE FIRST: nvpro-samples overview
Batchfile
312
star
10

vk_order_independent_transparency

Demonstrates seven different techniques for order-independent transparency in Vulkan.
C++
264
star
11

vk_video_samples

Vulkan video samples
C++
239
star
12

gl_vk_chopper

Simple vulkan rendering example.
C++
202
star
13

vk_mini_samples

Collection of Vulkan samples
HLSL
184
star
14

vk_raytracing_tutorial_NV

Vulkan ray tracing examples and tutorials using VK_NV_ray_tracing
C++
158
star
15

gl_vk_threaded_cadscene

OpenGL and Vulkan comparison on rendering a CAD scene using various techniques
C++
157
star
16

gl_commandlist_basic

OpenGL sample for NV_command_list
C++
112
star
17

vk_gltf_renderer

Rendering glTF scenes with ray tracer and raster (Vulkan)
C++
102
star
18

vk_displacement_micromaps

This sample showcases rasterizing and ray tracing displaced NVIDIA Micro-Mesh assets in Vulkan with and without the VK_NV_displacement_micromap extension.
C++
92
star
19

vk_denoise

Denoising a Vulkan ray traced image using OptiX denoiser
C++
88
star
20

gl_vk_bk3dthreaded

Vulkan sample rendering 3D with 'worker-threads'
C++
84
star
21

gl_vk_simple_interop

Display an image created by Vulkan compute shader, with OpenGL
C++
76
star
22

vk_toon_shader

Silhouette and toon shading post-processing with Vulkan
C++
74
star
23

gl_dynamic_lod

GPU classifies how to render millions of particles
C++
71
star
24

nvtt_samples

NVIDIA Texture Tools samples for compression, image processing, and decompression.
C++
64
star
25

gl_vk_supersampled

Vulkan sample showing a high quality super-sampled rendering
C++
63
star
26

optix_prime_baking

Shows how to bake ambient occlusion at mesh vertices using OptiX Prime
45
star
27

vk_compute_mipmaps

Customizable compute shader for fast cache-aware mipmap generation
GLSL
41
star
28

vk_async_resources

Sample showcasing lifetime management and resource transfers in Vulkan
C++
32
star
29

gl_vk_raytrace_interop

Adding ray traced ambient occlusion using Vulkan and OpenGL
C++
30
star
30

vk_timeline_semaphore

Vulkan timeline semaphore + async compute performance sample
GLSL
26
star
31

gl_render_vk_ddisplay

OpenGL sample that renders into a Vulkan direct display
C++
26
star
32

gl_multicast

OpenGL sample for the new GL_NVX_linked_gpu_multicast extension
C++
25
star
33

vk_device_generated_cmds

Vulkan sample on VK_NV_device_generated_commands
C++
25
star
34

shared_external

external libraries, needed for the samples (AntTweakBar; ZLib...)
HTML
17
star
35

vk_offline

Rendering offline using Vulkan without opening a window
C++
13
star
36

glsl_indexed_types_generator

GLSL code generator to aid use of Vulkan's descriptor set indexing
Lua
12
star
37

gl_cuda_simple_interop

Sample showing OpenGL and CUDA interop
C++
11
star
38

vk_memory_decompression

Vulkan Memory Decompression (VK_NV_memory_decompression) sample
C++
10
star
39

vk_streamline

DLSS Super Resolution and DLSS Frame Generation via Streamline
C++
10
star
40

vk_idbuffer_rasterization

Vulkan sample to render efficient per-part IDs in CAD models
C++
8
star
41

gl_path_rendering_CMYK

Example of how to use path rendering; and how to use it with CMYK (using multi-render target)
C++
8
star
42

dx12_present_barrier

This sample demonstrates the usage of the new NvAPI interface to synchronize present calls between windows on the same system as well as on distributed systems.
C++
7
star
43

nvml_enterprise_gpu_check

Shows how to check if a GPU is an Enterprise/Quadro GPU using NVML.
C++
4
star
44

vk_raytrace_displacement

C++
3
star
45

gl_vrs

Variable Rate Shading in OpenGL
C++
3
star
46

third_party_binaries

pre-built libraries for the nvpro-samples framework
C
2
star
47

vk_inherited_viewport

VK_NV_inherited_viewport_scissor and secondary subpass command buffer re-use
C++
2
star
48

vk_ddisplay

Sample to demonstrate multi-GPU rendering and presenting to ddisplays, meaning displays that are not part of the Windows desktop and of which an application takes complete control.
C++
2
star