• Stars
    star
    245
  • Rank 162,118 (Top 4 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Demonstrates seven different techniques for order-independent transparency in Vulkan.

vk_order_independent_transparency

Demonstrates seven different techniques for order-independent transparency (OIT) in Vulkan.

Shows a thousand semitransparent spheres on a gray background with a user interface in the top-left corner.

About

This sample demonstrates seven different algorithms for rendering transparent objects without requiring them to be sorted in advance. Six of these algorithms produce ground-truth images if given enough memory, while a seventh produces fast and memory-efficient but approximate results. (Note that sorting alone isn't enough to blend transparent objects correctly, since the painter's algorithm can fail, while these six approaches can blend objects correctly.)

This is useful whether you're rendering skyscraper facades, automobile exteriors, or rows of glasses on a table. This sample shows these techniques applied to hundreds of overlapping transparent and opaque spheres. It also shows how they can be implemented in Vulkan, such as by using subpass inputs for Weighted, Blended Order-Independent Transparency.

These techniques were presented in Christoph Kubisch's GTC 2014 talk, "Order Independent Transparency In OpenGL 4.x", which you can find at http://on-demand.gputechconf.com/gtc/2014/presentations/S4385-order-independent-transparency-opengl.pdf.

You can also hover over any of the elements in the UI inside the sample to find out more about what they do.

Algorithm Descriptions

Overview

This sample implements seven OIT algorithms: Simple, Linked List, Loop32, Loop64, Spinlock, Interlock, and Weighted, Blended Order-Independent Transparency (WBOIT). These operate per sample or per pixel, depending on the antialiasing mode.

Six of these (all but WBOIT) sort each fragment's color information based on depth so long as they have space to store all of the separate pieces of information. The amount of space used to store fragment information can be configured using the GUI. When they run out of space, they tail blend the remaining fragments using normal, non-order-independent transparency directly onto the color buffer (using premultiplied alpha). Then they blend the sorted fragments on top. However, while Linked List, Loop32, Loop64, Spinlock, and Interlock always sort the frontmost few fragments per pixel/sample (tail blending the backmost samples), Simple sorts the first fragments it processes per pixel/sample.

WBOIT, on the other hand, uses a constant amount of space per pixel/sample, but weights instead of sorts fragments by depth before blending them.

Here's a quick overview of the properties of each algorithm. See the algorithm descriptions below for more details:

Name Correctness Bound By MSAA Support Bytes per Pixel or Sample Stability Between Frames Guaranteed Sorts Front Number of Transparent Draws Additional Extensions Required
Simple OIT_LAYERS, draw order Yes 8*OIT_LAYERS+4, or 16*OIT_LAYERS+4 (with antialiasing masks) No No 1 No
Linked List OIT_LAYERS and A-buffer size Yes 16 (per element) + 4 No Yes 1 No
Loop32 OIT_LAYERS No 8*OIT_LAYERS+4 Yes Yes 2 No
Loop64 OIT_LAYERS No 8*OIT_LAYERS+4 Without Tail Blend Yes 1 Yes
Spinlock OIT_LAYERS Yes 8*OIT_LAYERS+12, or 16*OIT_LAYERS+12 (with antialiasing masks) Without Tail Blend Yes 1 No
Interlock OIT_LAYERS Yes 16*OIT_LAYERS+8, or 32*OIT_LAYERS+8 (with antialiasing masks) With "Interlock Is Ordered" Checked Yes 1 Yes
WBOIT Approximation Yes 20 Yes Yes 1 No

This sample stores the vertex and index data for all of its spheres in a single mesh. It draws the faces corresponding to the last 100 - percentTransparent% of spheres using an opaque shader, then draws the first percentTransparent% of spheres using the algorithm's drawTransparent method.

Simple

For each pixel or sample, the color shader stores the colors and depths of the first OIT_LAYERS fragments it receives, and tail blends any subsequent fragments. The composite pass then sorts these fragments.

For instance, suppose OIT_LAYERS = 2 with no antialiasing, and a thread processes four (RGBA color, depth) pairs, (c4, 0.4), (c2, 0.2), (c1, 0.1), (c3, 0.3) (where c1...c4 are RGBA colors). The color shader would store (c4, 0.4), (c2, 0.2) in the A-buffer, and tail blend (c1, 0.1) followed by (c3, 0.3) onto the color buffer (out of order, which is generally the case in tail blending). The composite shader would then sort the A-buffer values to get (c2, 0.2), (c4, 0.4), then blend the colors from back to front. Note that in this case the frontmost fragment, (c1, 0.1), was drawn behind everything else, because there were more overlapping objects than the A-buffer had space for! This would usually result in visible artifacts, but this algorithm can also work well enough if there's minimal overlap, or if objects are sorted in advance.

Linked List

This algorithm builds a linked list of fragments for each pixel. To do this, it uses a single contiguous block of memory, an image storing the index of the head of each list, and a 1x1 image acting as an atomic counter containing the index of the first empty element in the block of memory, using 0 as a list terminator.

Each thread running the color shader atomically increments the counter to get the index of the element. If there's space left in the buffer, it writes its data and a pointer to the previous head of its linked list into that location; otherwise, it tail blends the fragment. Each thread running the composite shader then iterates down the linked list, gathering and sorting the first OIT_LAYERS elements and tail-blending the rest.

For instance, suppose OIT_LAYERS=2 with no antialiasing, the A-buffer is 4 elements long (including one space for 0, the list terminator), and (color, depth) fragments corresponding to two pixels are processed as follows:

  • Pixel 1: (c4, 0.4)
  • Pixel 2: (c1, 0.1)
  • Pixel 1: (c2, 0.2)
  • Pixel 1: (c3, 0.3)

At the end of the color shader, the A-buffer will contain the following (color, depth, old offset) values:

0 1 2 3
Empty (list terminator) (c4, 0.4, 0) (c1, 0.1, 0) (c2, 0.2, 1)

(c3, 0.3) was tail blended. When compositing, Pixel 1's thread will start at element 3 and gather and sort (c2, 0.2) and (c4, 0.4), while Pixel 2's thread will start at element 2 and gather and sort (c1, 0.1, 0).

Loop32

This algorithm uses 32-bit atomic operations to first sort the depths of each pixel's frontmost OIT_LAYERS fragments. It then matches colors to depths, and blends the fragments in order.

For instance, given the four fragments in the Simple example with OIT_LAYERS = 2 without antialiasing, the depth shader would compute that the frontmost sorted depths are (0.1, 0.2). This step would also tail blend (c4, 0.4) and (c3, 0.3). It would then match the colors to the depths to get (c1, c2), and then blend this together.

Because the final decision on whether each fragment should be included in the list is made for all fragments deterministically, based on their depths, before the color pass, with the remaining fragments tail blended in primitive order, the result is guaranteed to be stable between frames, as long as multiple fragments for a single pixel don't share the same depth.

Loop64

Loop32 uses three shaders, and requires drawing transparent objects twice. If the device supports the VK_KHR_shader_atomic_int64 extension, then we can pack colors and depths together into a 64-bit integer, and sort colors and depths together by sorting the 64-bit integers. This requires us to only draw transparent objects once.

For instance, given the four fragments in the Simple example with OIT_LAYERS = 2 without antialiasing, the first shader could compute that the frontmost sorted depths and colors are ((c1, 0.1), (c2, 0.2)), tail blending (c4, 0.4) and (c3, 0.3). It would then blend the sorted colors together.

Spinlock

This algorithm maintains a sorted list of the frontmost OIT_LAYERS fragments per pixel or sample using insertion sort. However, inserting elements into a list (and pushing all of the other elements back) is not thread-safe. This algorithm solves this problem by implementing a spinlock per pixel using atomic operations, which permits only one thread per pixel to insert elements at a time.

For instance, imagine the following scenario with OIT_LAYERS=2 without antialiasing, for a single pixel. Each of the threads is being run by a different warp.

  • Thread 1 starts processing the fragment (c3, 0.3). It enters the critical section. The A-buffer area for this pixel is still empty, ((0,0,0,0), 1), ((0,0,0,0), 1).
  • Thread 2 starts processing the fragment (c1, 0.1). It sees that the critical section is occupied and starts spin waiting.
  • Thread 3 starts processing the fragment (c2, 0.2). It sees that the critical section is occupied and starts spin waiting.
  • Thread 1 inserts (c3, 0.3) and leaves the critical section. The A-buffer area for this pixel is now (c3, 0.3), ((0,0,0,0), 1).
  • Thread 3 sees that the critical section is unoccupied and enters the critical section.
  • Thread 2 sees that the critical section is occupied and keeps spin waiting.
  • Thread 3 inserts (c2, 0.2) into the first position and leaves the critical section. The A-buffer area for this pixel is now (c2, 0.2), (c3, 0.3).
  • Thread 4 starts processing the fragment (c4, 0.4). It sees that it would be behind the last fragment in the A-buffer and tail blends (c4, 0.4), then exits.
  • Thread 2 sees that the critical section is unoccupied and enters the critical section. It inserts (c1, 0.1) into the first position, removing and tail blending (c3, 0.3). It then leaves the critical section, finishing execution. The A-buffer area for this pixel is now (c1, 0.1), (c2, 0.2).

Interlock

If the device supports the VK_EXT_fragment_shader_interlock extension, then we can use invocation interlocking to prevent multiple invocations from entering a critical section, without having to implement a spin lock (and without requiring the threads to spin while they wait for the critical section to be unoccupied). This is somewhat similar to rasterizer order views in Direct3D 11.3.

To do this, we call beginInvocationInterlockARB or beginInvocationInterlockNV before entering the critical section (depending on whether the GLSL code supports the GL_ARB_fragment_shader_interlock or GL_NV_fragment_shader_interlockextension), then call endInvocationInterlockARB or endInvocationInterlockNV to end the critical section.

Additionally, the critical section is entered in primitive order, so the selection of the fragment to tail blend in each invocation is guaranteed to be stable between frames.

The shader code is similar to the example for Spinlock, except with spin locks replaced by invocation interlocks.

Weighted, Blended Order-Independent Transparency

Weighted, Blended Order-Independent Transparency (McGuire and Bavoil 2013) assigns a weight to each fragment, then commutatively blends their colors together. By assigning higher weights for more important pixels, it can emulate some of the effects of layered opacity - such as how closer fragments usually affect the final color more than further fragments - without having to sort the fragments. However, it can also diverge from the ground truth in scenarios where order strongly affects the result, such as when opacity is high.

Here, we compute a weight from each fragment's depth and RGBA color, as described in oitWeighted.frag.glsl. For each pixel, we then compute the following quantities, where color_0, color_1, ... are premultiplied RGBA colors, and weight_0, weight_1, ... are the floating-point weights of each fragment:

outColor = (weight_0 * color_0) + (weight_1 * color_1) + ...

i.e. the weighted premultiplied sum, and

outReveal = (1 - color_0.a) * (1 - color_1.a) * ...

i.e. one minus the opacity of the result. This can be done using blending modes. In the resolve pass, we then get the average weighted RGB color, outColor.rgb/outColor.a, and blend it onto the image with the opacity of the result, 1 - outReveal, using a variant of premultiplied alpha to use outReveal directly.

Code Layout

This sample's main class is declared in oit.h, which includes descriptions for most of its functions. Its function definitions are split into four files:

  • oitRender.cpp contains the most important drawing code.
  • oit.cpp shows the parts of Vulkan object creation that are important for OIT.
  • oitGui.cpp implements the GUI.
  • main.cpp contains the rest of the functions, most of which are not as important for OIT (such as framebuffer and generic graphics pipeline generation).

utilities_vk.h contains some Vulkan helper objects which are specific to this sample, but make object management a bit easier.

common.h contains defines shared between C++ and GLSL code.

The shader files are laid out as follows:

  • oitInterlock.frag.glsl, oitLinkedList.frag.glsl, oitLoop.frag.glsl, oitLoop64.frag.glsl, oitSimple.frag.glsl, oitSpinlock.frag.glsl, and oitWeighted.frag.glsl contain the main shader code for each of the seven algorithms. They all use the same structure, so you can diff them to see the variations in each implementation.
  • fullScreenTriangle.vert.glsl generates a full-screen triangle, used for screen-space passes.
  • object.vert.glsl is the vertex shader for rendering objects.
  • opaque.frag.glsl is the fragment shader for opaque objects, applying basic Gooch shading.
  • oitColorDepthDefines.glsl, oitCompositeDefines.glsl, and shaderCommon.glsl contain common defines and functions used across GLSL files.

Building

To build this sample, first install a recent Vulkan SDK. Then do one of the following:

You can then use CMake to generate and subsequently build the project.

Additional Notes

  • Since the GTC 2014 talk, at least two new OIT techniques have been presented that are also worth considering:
    • Moment-Based Order-Independent Transparency (Münstermann et al. 2018) is a family of algorithms that operate somewhat like WBOIT, but use higher-order moments to produce a more accurate image.
    • It's also possible to create an image with correctly rendered semitransparent objects directly without sorting using ray tracing, whether by computing attenuation after each intersection, or by using stochastic transparency. For more information and for a tutorial of how to implement stochastic transparency, please see the NVIDIA Vulkan Ray Tracing Tutorials.
  • The six A-buffer-based OIT algorithms implement antialiasing through manually blending MSAA sample masks, combining that with A-buffer storage per sample instead of per pixel, or through supersampling. However, there are also many other ways to implement antialiasing with order-independent transparency techniques, and both accuracy and performance should be considered in the context of application implementations.
  • Other layouts for the A-buffer, such as using image arrays, could be more performant in terms of clearing and cache efficiency.

For further reading, please see:

Multi-Layer Alpha Blending by Marco Salvi and Karthik Vaidyanathan: https://software.intel.com/content/www/us/en/develop/articles/multi-layer-alpha-blending.html

Efficient Layered Fragment Buffer Techniques by Pyarelal Knowles, Geoff Leach, and Fabio Zambetta: http://openglinsights.com/bendingthepipeline.html

Freepipe: programmable parallel rendering architecture for efficient multi-fragment effects by Fang Liu, Mengcheng Huang, Xuehui Liu, and Enhua Wu: https://sites.google.com/site/hmcen0921/cudarasterizer

k+-buffer: Fragment Synchronized k-buffer by Andreas A. Vasilakis and Ioannis Fudos: www.cgrg.cs.uoi.gr/wp-content/uploads/bezier/publications/abasilak-ifudos-i3d2014/k-buffer.pdf

Real-time concurrent linked list construction on the GPU by Jason C. Yang, Justin Hensley, Holger Grün, and Nicolas Thibieroz: https://dl.acm.org/doi/10.1111/j.1467-8659.2010.01725.x

Stochastic Transparency by Eric Enderton, Erik Sintorn, Peter Shirley and David Luebke: http://enderton.org/eric/pub/stochtransp-tvcg.pdf

Interactive Order-Independent Transparency by Cass Everitt: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18.9286&rep=rep1&type=pdf

More Repositories

1

vk_raytracing_tutorial_KHR

Ray tracing examples and tutorials using VK_KHR_ray_tracing
C++
1,261
star
2

vk_mini_path_tracer

A beginner-friendly Vulkan path tracing tutorial in under 300 lines of C++.
C++
1,069
star
3

gl_occlusion_culling

OpenGL sample for shader-based occlusion culling
C++
511
star
4

vk_raytrace

Ray tracing glTF scene with Vulkan
C++
496
star
5

nvpro_core

shared source code and resources needed for the samples to run
C++
429
star
6

optix_advanced_samples

C
408
star
7

gl_ssao

optimized screen-space ambient occlusion, cache-aware hbao
C++
333
star
8

gl_vk_meshlet_cadscene

This OpenGL/Vulkan sample illustrates the use of "mesh shaders" for rendering CAD models.
C++
326
star
9

build_all

GO HERE FIRST: nvpro-samples overview
Batchfile
300
star
10

vk_video_samples

Vulkan video samples
C++
223
star
11

gl_vk_chopper

Simple vulkan rendering example.
C++
204
star
12

vk_mini_samples

Collection of Vulkan samples
CMake
177
star
13

vk_raytracing_tutorial_NV

Vulkan ray tracing examples and tutorials using VK_NV_ray_tracing
C++
159
star
14

gl_vk_threaded_cadscene

OpenGL and Vulkan comparison on rendering a CAD scene using various techniques
C++
157
star
15

gl_cadscene_rendertechniques

OpenGL sample on various rendering approaches for typical CAD scenes
C++
147
star
16

gl_commandlist_basic

OpenGL sample for NV_command_list
C++
112
star
17

vk_displacement_micromaps

This sample showcases rasterizing and ray tracing displaced NVIDIA Micro-Mesh assets in Vulkan with and without the VK_NV_displacement_micromap extension.
C++
89
star
18

vk_denoise

Denoising a Vulkan ray traced image using OptiX denoiser
C++
87
star
19

gl_vk_bk3dthreaded

Vulkan sample rendering 3D with 'worker-threads'
C++
86
star
20

gl_vk_simple_interop

Display an image created by Vulkan compute shader, with OpenGL
C++
75
star
21

vk_shaded_gltfscene

Rendering glTF scenes with ray tracer and raster (Vulkan)
C++
74
star
22

vk_toon_shader

Silhouette and toon shading post-processing with Vulkan
C++
73
star
23

gl_dynamic_lod

GPU classifies how to render millions of particles
C++
69
star
24

gl_vk_supersampled

Vulkan sample showing a high quality super-sampled rendering
C++
64
star
25

nvtt_samples

NVIDIA Texture Tools samples for compression, image processing, and decompression.
C++
61
star
26

optix_prime_baking

Shows how to bake ambient occlusion at mesh vertices using OptiX Prime
45
star
27

vk_compute_mipmaps

Customizable compute shader for fast cache-aware mipmap generation
GLSL
37
star
28

gl_vk_raytrace_interop

Adding ray traced ambient occlusion using Vulkan and OpenGL
C++
29
star
29

vk_async_resources

Sample showcasing lifetime management and resource transfers in Vulkan
C++
27
star
30

gl_render_vk_ddisplay

OpenGL sample that renders into a Vulkan direct display
C++
25
star
31

gl_multicast

OpenGL sample for the new GL_NVX_linked_gpu_multicast extension
C++
25
star
32

vk_device_generated_cmds

Vulkan sample on VK_NV_device_generated_commands
C++
24
star
33

vk_timeline_semaphore

Vulkan timeline semaphore + async compute performance sample
GLSL
22
star
34

shared_external

external libraries, needed for the samples (AntTweakBar; ZLib...)
HTML
16
star
35

vk_offline

Rendering offline using Vulkan without opening a window
C++
13
star
36

glsl_indexed_types_generator

GLSL code generator to aid use of Vulkan's descriptor set indexing
Lua
12
star
37

vk_memory_decompression

Vulkan Memory Decompression (VK_NV_memory_decompression) sample
C++
10
star
38

gl_cuda_simple_interop

Sample showing OpenGL and CUDA interop
C++
9
star
39

vk_streamline

DLSS Super Resolution and DLSS Frame Generation via Streamline
C++
9
star
40

vk_idbuffer_rasterization

Vulkan sample to render efficient per-part IDs in CAD models
C++
8
star
41

gl_path_rendering_CMYK

Example of how to use path rendering; and how to use it with CMYK (using multi-render target)
C++
8
star
42

dx12_present_barrier

This sample demonstrates the usage of the new NvAPI interface to synchronize present calls between windows on the same system as well as on distributed systems.
C++
5
star
43

nvml_enterprise_gpu_check

Shows how to check if a GPU is an Enterprise/Quadro GPU using NVML.
C++
4
star
44

vk_raytrace_displacement

C++
3
star
45

third_party_binaries

pre-built libraries for the nvpro-samples framework
C
2
star
46

vk_inherited_viewport

VK_NV_inherited_viewport_scissor and secondary subpass command buffer re-use
C++
2
star
47

gl_vrs

Variable Rate Shading in OpenGL
C++
2
star
48

vk_ddisplay

Sample to demonstrate multi-GPU rendering and presenting to ddisplays, meaning displays that are not part of the Windows desktop and of which an application takes complete control.
C++
1
star