• Stars
    star
    589
  • Rank 75,909 (Top 2 %)
  • Language
    C++
  • License
    MIT License
  • Created over 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An implementation of [Jimenez et al., 2016] Ground Truth Ambient Occlusion, MIT license

XeGTAO

Introduction

XeGTAO is an open source, MIT licensed, DirectX/HLSL implementation of the Practical Realtime Strategies for Accurate Indirect Occlusion, GTAO [Jimenez et al., 2016], a screen space effect suitable for use on a wide range of modern PC integrated and discrete GPUs. The main benefit of GTAO over other screen space algorithms is that it uses a radiometrically-correct ambient occlusion equation, providing more physically correct AO term.

We have implemented and tested the core algorithm that computes and spatially filters the ambient occlusion integral and, optionally, directional GTAO component (bent normals / cones).

Our implementation relies on an integrated spatial denoising filter and will leverage TAA for temporal accumulation when available. When used in conjunction with TAA it is both faster, provides higher detail effect on fine geometry features and is more radiometrically correct than other common public SSAO implementations such as closed source [HBAO+], and open source [ASSAO].

High quality preset, computed at full resolution, costs roughly 1.4ms at 3840x2160 on RTX 3070, 0.56ms at 1920x1080 on RTX 2060 and 2.39ms at 1920x1080 on 11th Gen Intel(R) Core(TM) i7-1195G7 integrated graphics. Faster but lower quality preset is also available. Computing and filtering the directional component (bent normals) adds roughly 25% to the cost.

This sample project (Vanilla.sln) was tested with Visual Studio 2019 16.10.3, DirectX 12 GPU (Shader Model 6_3), Windows version 10.0.19041.

XeGTAO OFF/ON/ON+BentNormals comparison in Amazon Lumberyard Bistro; click on image to embiggen:
thumb1 thumb2 thumb3

AO term only, left: ASSAO Medium (~0.72ms), right: XeGTAO High (~0.56ms), as measured on RTX 2060 at 1920x1080: ASSAO vs GTAO

Implementation and integration overview

We focus on simplicity and ease of integration with all relevant code provided in a 2-file, header only-like format:

  • XeGTAO.h provides the glue between the user codebase and the effect; this is where macro definitions, settings, constant buffer updates and optional ImGui debug settings are handled.
  • XeGTAO.hlsli provides the core shader code for the effect.

These two files contain the minimum required to integrate the effect, with the amount of work depending on the specific platform and engine details. In an ideal case, the user codebase can include these two files above with little or no modifications, and provide additional resources including working textures, a constant buffer, compute shaders, as well as codebase-specific shader code used to load screen space normals and etc, as shown in the usage example (see vaGTAO.h, vaGTAO.cpp, vaGTAO.hlsl). https://github.com/GameTechDev/XeGTAO/blob/master/Source/Rendering/Shaders/vaGTAO.hlsl

The effect is usually computed just after the depth data becomes available (after the depth pre-pass or a g-buffer draw). It takes depth buffer and (optional) screen space normals as inputs, produces a single channel AO buffer as the output and consists of three separate compute shader passes:

  • PrefilterDepths pass: inputs depth buffer; performs input depth conversion to viewspace and generation of depth MIP chain; outputs intermediary viewspace depth buffer with a MIP chain
  • MainPass: inputs intermediary depth buffer and (optional) screen space normals; performs the core GTAO algorithm; outputs unfiltered AO term and intermediary edge information by the denoiser
  • Denoise: inputs unfiltered AO term and intermediary edge information; performs the spatial denoise filter; outputs the final AO term

Implementation details

Following is a list of implementation details and differences from the original GTAO paper [Jimenez et al., 2016]:

Automatic heuristic tuning based on ray-traced reference

In order to best reproduce the work from the original paper and tune the heuristics, we built a simple ray tracer within the development codebase that can render AO ground truth. The ray tracer uses cosine-weighted hemisphere Monte Carlo sampling to approximate Lambertian reflectance for a hemisphere defined by a given point and geometry normal, using near-field bounding long visibility rays.

reference raytracer
left: Reference diffuse-only raytracer, 512spp; right: XeGTAO High preset

Using the ray-traced output as a ground truth, we then tune the XeGTAO heuristics for a best match across several scenes and locations and different near-field bound radii settings. We rely in big part on an automatic system informally called auto-tune, where selected settings (such as thickness heuristic, radius multiplier, falloff range, etc.) can be automatically tuned together. Given min/max ranges for each setting, the auto-tune will run through all permutations across pre-defined scene locations, searching for the lowest overall average MSE between the XeGTAO and raytraced ground outputs. For practical reasons we employ a multi-pass search based on narrowing the setting ranges.

Denoising

Original GTAO implementation is described as using spatio-temporal sampling with temporal reprojection; our current approach uses only a 5x5 depth-aware spatial denoising filter and relies on TAA for the temporal component, when available. This is a compromise that allows for the effect to still be used by codebases that do not employ TAA. When TAA is available, we indirectly leverage it by enabling temporal noise. The downside is that we must keep temporal variance low enough to avoid having TAA mischaracterizing this noise as features, which limits the amount of temporal supersampling that we can leverage. Depending on user feedback and future experimentation we are likely to go with the combined spatio-temporal approach in the future.

denoising
left: raw 3 slices 6 samples per pixel (18spp) XeGTAO output; middle: +5x5 spatial denoiser; right: +TAA and temporal noise

Resolution and sampling

Original GTAO implementation runs at half-resolution with one slice per pixel, 12 samples per hemisphere slice (6 per side; the hemisphere slice term is defined in the original paper). We default to running at full resolution, 3 slices per pixel with 6 samples (3 per side) each for a total of 18spp, and also have a lower quality preset with 2 slices per pixel and 4 samples (2 per side) for a total of 8spp. The reasoning behind reduction in samples per slice is described in 'Thickness Heuristic' section. This balance is likely to change if we move to a combined spatio-temporal approach in the future.

Sample distribution

In order to better capture thin crevices and similar small features, we use x = pow( x, 2 ) distribution for samples along the slice direction, where x is a normalized screen space distance from the evaluated pixel's center to the maximum distance (representing the worldspace 'Effect radius'). This is another setting where we used auto-tune to find the most optimal value, which was around 2.1. We decided to round it down to 2 for simplicity and performance reasons.

denoising
different sample power distribution settings; left: setting of 1.0; right: setting of 2.0, clumping more samples around the center gives more detail to small feature shadows

Near-field bounding

Like the [Jimenez et al., 2016], we attenuate the effect from distant samples based on near-field occlusion radius setting ('Effect radius') over a a range ('Falloff range'). This provides stable and predictable results and is easier to use in conjunction with longer-range, lower-frequency GI. Unlike the original, we do not linearly interpolate the sample horizon angle cosine towards -1 but towards the hemisphere horizon, computed as cos(normal_angle+PI/2) in one direction and cos(normal_angle-PI/2) in the other.

reference raytracer
out of bounds sample interpolation, left: towards -1; middle: ray traced ground truth; right: ours, towards sample horizon, resulting in less detail loss, noticeable around the window and curtain areas

This makes the attenuation function independent from the projected normal vector, avoid haloing or loss of detail under certain view angles, providing results that are on average closer to the ground truth.

Thin occluder conundrum

The main difficulty of approximating AO from the depth buffer is that the depth buffer is effectively a viewspace heightmap and does not correctly represent the actual scene geometry. This leads to visual artifacts such as thin features at depth discontinuities casting too much occlusion (please see 'Height-field assumption considerations' from the [Jimenez et al., 2016]) or haloing effects. The larger the near-field bounding radius setting, the worse the mismatch usually is. Conversely, with a radius that is small in proportion to geometry features, and using various heuristics to minimize the side-effects, a reasonably good approximation can be achieved. There are other solutions to improving the quality of the source geometry representation (such as Multi-view AO, [Vardis et al. 2013], or the more recent Stochastic-Depth Ambient Occlusion [Vermeer et al, 2021] ) which we did not consider due to complexity as they require changes to the rendering pipeline, but which could certainly be adopted for use with XeGTAO.

The original paper describes a conservative thickness heuristic that is derived from the assumption that the thickness of an object is similar to its size in screen space; the end result of it is that "a single sample that is behind the horizon will not significantly decrease the computed horizon, but many of them (in e.g. a thin feature) will considerably attenuate it". In our experimentation we found that increasing the number of slices while undersampling the horizon search (using lower number of samples per slice) achieves very similar result with the same overall number of samples. This removes the need for the somewhat computationally expensive heuristic.

We also experimented with a different heuristic that biases the near-field bounding falloff along the view vector, and in effect reducing the impact of samples that are in front of the evaluated pixel's depth (closer to the camera plane). This provided results closer to the ground truth compared to the heuristic from the original paper and this is now exposed through the "Thin occluder compensation" setting. With 6 (3+3) samples per slice, the (auto-tuned) optimum setting value yields a relatively small improvement, so we disabled it by default for performance reasons. It can be easily enabled if higher quality is required.

reference raytracer
left: default 'Thin occluder compensation' of 0; middle: ray traced ground truth; right: 'Thin occluder compensation' of 0.7

Above image demonstrates two opposing scenarios: in the top row, even the default settings (left column) over-compensate the thin occluder issue due to shelves being very deep, and increasing thin occluder compensation setting (right column) serves only to further deviate from the ground truth (middle column). This is in contrast to the bottom row where pipe and chair legs are very thin, and a high occluder compensation setting (right column) matches ground truth more closely.

Sampling noise

As with any technique based on Monte Carlo integration, a a good sampling method can significantly reduce the number of samples needed for the same quality. The original GTAO paper describes a tileable spatial noise of 4x4 with 6 different temporal rotations.

For stratified sampling we map screen coordinates to a Hilbert curve index, using it to drive Martin Robert's R2 quasi-random sequence. This was inspired by an excellent shadertoy example by user 'paniq' with the only difference in our code being that we use a R2 sequence instead of R1 since we need two low-discrepancy samples for chosing slice angle and step offset. We use a 6 level Hilbert curve, providing a 64x64 repeating tile, and for the temporal component we add an offset of 288*(frameIndex%64), found empirically.

Before we settled on the Hilbert curve + R2 sequence, we used a 2 channel 64x64 tileable blue noise from Christoph Peters's blog to drive slice rotations and individual sample noise offsets. This worked well for spatial-only noise but adding temporal offsets/rotations caused overlaps which would often show as temporal artifacts. We then switched to a 3D noise (from the sequel blogpost which worked well with TAA but was fairly big in size and did not work well when using spatial-only filtering (to quote the blog, "Good 3D noise is a bad 2D blue noise").

Since computing Hilbert curve index in the compute shader adds measurable cost (~7%), we optionally precompute it into a lookup texture which reduces this overhead. C++/HLSL code to compute the Hilbert Index is available in XeGTAO.h and the user can choose between the (simpler) GPU arithmetic or (usually faster) LUT-based codepaths.

reference raytracer
5x5 spatial with 8 frame temporal filter, left: using hash-based pseudo-random noise; right: using Hilbert curve index driving R2 sequence

Memory bandwidth bottleneck

Most screen space effects are performance-limited by the available memory bandwidth and texture caching efficiency, and XeGTAO is no different.

One common approach, which we rely on, relies on a technique presented in Scalable Ambient Obscurance [McGuire et al, 2012] and involves pre-filtering depth buffer into MIP hierarchy, allowing the more distant (from the evaluated pixel's center) locations to be sampled using lower detail MIP level. We follow the same approach as in the paper, with the exception of the choice of the depth MIP filter, for which we use a weighted average of samples, with the weight determined by whether the depth difference from the most distant sample is within a predefined threshold (please refer to DepthMIPFilter in the code for details). Using the most distant sample introduces a natural thin occluder bias and is more stable under motion compared to rotated grid subsampling (from the SAO paper), while averaging provides least precision errors on most slopes.

reference raytracer
left: color-coded sample MIP levels; middle: example of detail loss with a too low 'Depth MIP sampling offset' of 2.0; right: depth MIP mapping disabled

'Depth MIP sampling offset' user setting controls the MIP level selection (mipLevel = max( 0, log2( sampleOffsetLength ) - DepthMIPSamplingOffset )). The lower the value, the lower detailed MIPs are used, reducing memory bandwidth but also reducing quality. It defaults to the value of 3.15 which is the point below which there is no measurable performance increase on any of the tested hardware.

Another popular solution to this problem is presented in Deinterleaved Texturing for Cache-Efficient Interleaved Sampling [Bavoil, 2014] and involves a divide and conquer technique where the working dataset is subdivided into smaller parts that are processed in sequence, ensuring a much better utilization of memory cache structures. The downside is that by the definition, the processing of one dataset part can only rely on sampling data from that part, which constrains the sampling pattern. This was a significant issue for GTAO with its specific sampling pattern (samples lie on a straight line, etc.) where constraining them to a subset of data significantly limited flexibility, consistency and amplified precision issues, so we could not use this approach in XeGTAO.

Bent normals

Support for directional component (bent normals) was added in XeGTAO version 1.30, and is directly based on the formulation from the original [Jimenez et al., 2016], page 15, 'Algorithm 2'. Since this feature increases overal effect computation time by roughly 25%, it is an optional feature. When enabled, bent normal vector is computed in the MainPass and denoised together with the occlusion term. The output is packed into first 3 channels of a R8G8B8A8 texture, with the occlusion term stored in the alpha channel.

Watch the video debug visualization of bent cone (green) and the shading normal (red); click for the video

In the Bistro scene lighting we use the AO term to attenuate diffuse and specular irradiance from light probes using the multi-bounce diffuse and GTSO approaches detailed in the original GTAO work. We also attenuate unshadowed direct lighting using the micro-shadowing approximation from Material Advances in Call of Duty: WWII [Chan 2018] and SIGGRAPH 2016: Technical Art of Uncharted. It should be noted that the sample's current AO term usage is somewhat ad-hoc and has itself not been matched to ground truth, and is not meant as a reference.

thumb1 thumb2 thumb3
XeGTAO OFF/ON/ON+BentNormals: directional component significantly improves contact shadow correctness for lights that lack explicit shadowing; click to enlarge

Misc

  • We added a global 'Final power' heuristic that modifies the visibility with a power function. While this has no basis in physical light transfer, we found that auto-tune can use it to achieve better ground truth match in combination with all other settings.
  • In order to minimize bandwidth use we rely on 16-bit floating point buffer to store viewspace depth. This does cause some minor precision quality issues but yields better performance on most hardware. It is however not compatible with built-in screen space normal map generator.
  • It is always advisable to provide screen space normals to the effect, but in case that is not possible we provide a built-in depth to normal map generator.
  • We have enabled fp16 (half float) precision shader math on most places where the loss in precision was acceptable; this provides 5-20% performance boost on various hardware that we have tested on but is entirely optional.

ASSAO vs GTAO
left: ASSAO Medium (~0.72ms*), right: XeGTAO High (~0.56ms*) (*as measured on RTX 2060 at 1920x1080)

FAQ

  • Q: It is still too slow for our target platform, what are our options?
  • A: The "Medium" quality preset is roughly 2/3 of the cost of the "High" preset (for ex., 1.5ms vs 2.2ms at 1920x1080, GTX 1050), while the "Low" quality preset is roughly 2/3 of the cost of the "Medium" preset. For anything faster we advise further reducing sliceCount (in the call XeGTAO_MainPass) at the expense of more noise, or using lower resolution rendering (half by half or checkerboard) and upgrading the denoiser pass with a bilateral upsample.
  • Q: Why is there support for both half (fp16) and single (fp32) precision shader paths?
  • A: While the quality loss on the fp16 path is minimal, we found that some GPUs can suffer from unexpected performance regression on it, sometimes depending on the driver version. For that reason, while enabled by default, we leave it as an optional switch.
  • Q: Any plans for a Vulkan port?
  • A: Upgrades to other platforms/APIs will be added based on interest. Please feel free to submit an issue with a request.

Version log

See XeGTAO.h

Authors

XeGTAO was created by Filip Strugar and Steve Mccalla, feel free to send any feedback directly to [email protected] and [email protected].

Credits

Many thanks to Jorge Jimenez, Xian-Chun Wu, Angelo Pesce and Adrian Jarabo, authors of the original paper. This implementation would not be possible without their seminal work.

Thanks to Trapper McFerron for implementing the DoF effect and other things, Lukasz Migas for his excellent TAA implementation, Andrew Helmer (https://andrewhelmer.com/) for help with the Owen-Scrambled Sobol noise sequences, Adam Lake and David Bookout for reviews, bug reports and valuable suggestions!

Many thanks to: Amazon and Nvidia for providing the Amazon Lumberyard Bistro dataset through the Open Research Content Archive (ORCA): https://developer.nvidia.com/orca/amazon-lumberyard-bistro; author of the spaceship model available on Sketchfab; Khronos Group for providing the Flight Helmet model and other reference GLTF models.

Many thanks to the developers of the following open-source libraries or projects that make the Vanilla sample framework possible:

References

License

Sample and its code provided under MIT license, please see LICENSE. All third-party source code provided under their own respective and MIT-compatible Open Source licenses.

Copyright (C) 2021, Intel Corporation

More Repositories

1

PresentMon

Capture and analyze the high-level performance characteristics of graphics applications on Windows.
C++
1,576
star
2

IntroductionToVulkan

Source code examples for "API without Secrets: Introduction to Vulkan" tutorial
C++
1,273
star
3

MaskedOcclusionCulling

Example code for the research paper "Masked Software Occlusion Culling"; implements an efficient alternative to the hierarchical depth buffer algorithm.
C++
592
star
4

GTS-GamesTaskScheduler

A task scheduling framework designed for the needs of game developers.
C++
437
star
5

ISPCTextureCompressor

ISPC Texture Compressor
C++
426
star
6

OcclusionCulling

Demonstrates a software (CPU) based approach to occllusion culling using multi-threading and SIMD instructions to improve performance.
C++
383
star
7

Intel-Texture-Works-Plugin

Intel has extended Photoshop* to take advantage of the latest image compression methods (BCn/DXT) via plugin. The purpose of this plugin is to provide a tool for artists to access superior compression results at optimized compression speeds within Photoshop*.
C++
247
star
8

ASSAO

Adaptive Screen Space Ambient Occlusion
C++
245
star
9

MetricsGui

Library of ImGui controls for displaying performance metrics.
C++
237
star
10

OutdoorLightScattering

Outdoor Light Scattering Sample
C++
236
star
11

DynamicCheckerboardRendering

Checkerboard Rendering and Dynamic Resolution Rendering in the DX12 MiniEngine
C++
174
star
12

OpenGL-ES-3.0-Deferred-Rendering

OpenGL ES 3.0 Deferred Renderer
C
142
star
13

CMAA2

Conservative Morphological Anti-Aliasing 2.0
C++
129
star
14

asteroids_d3d12

Intel Asteroids DirectX 12 Sample
C++
128
star
15

PracticalVulkan

Repository with code samples for "API without Secrets: The Practical Approach to Vulkan" series of articles.
C++
119
star
16

IntelShaderAnalyzer

Command line tool for offline shader ISA inspection.
C++
118
star
17

stardust_vulkan

The Stardust sample application uses the Vulkan graphics API to efficiently render a cloud of animated particles.
C
116
star
18

gpudetect

An example application that demonstrates how to detect which Intel GPU is present, as well as architecture-specific information such as how much memory is available.
C++
115
star
19

SamplerFeedbackStreaming

This sample uses D3D12 Sampler Feedback and DirectStorage as part of an asynchronous texture streaming solution.
C++
105
star
20

TAA

Intel® Graphics Optimized Temporal Anti-Aliasing (TAA)
C++
100
star
21

LightScattering

Source code for the light scattering sample
C++
96
star
22

VRS-DoF

Variable Rate Shading and Depth of Field
C++
95
star
23

CloudsGPUPro6

C++
91
star
24

DX12-Multi-Adapter

DirectX 12 Explicit Heterogeneous Multi-adapter Sample implementing Split Frame Rendering
C++
82
star
25

CloudySky

Cloud Rendering Sample
C++
79
star
26

XeSSUnrealPlugin

Intel® XeSS Plugin for Unreal* Engine
78
star
27

FlipModelD3D12

Interactive visualization for understanding swap chains in D3D12
C++
76
star
28

FaceMapping2

This is an improvement on the original FaceMapping code sample. This sample uses Intel RealSense to scan the user's face, and map it onto 3d head mesh.
C++
73
star
29

ClusteredShadingAndroid

Clustered shading on Android sample
C
68
star
30

HybridDetect

Heterogeneous & Homogeneous CPU Detect for Intel Processors
C++
65
star
31

DeferredCoarsePixelShading

Deferred Coarse Pixel Shading Source Code (For the article in GPU Pro 7)
C++
62
star
32

UnrealCapabilityDetect

A plugin for system capability detect in Unreal Engine 4
C++
55
star
33

FaceMapping

The face mapping sample uses the 3D Scan module to scan the user's face and then map it onto an existing 3D head model. This technique does a "stone face" mapping that is not rigged or currently capable of animating.
C++
44
star
34

FaceTracking

Intel® RealSense™ SDK-Based Real-Time Face Tracking and Animation
Logos
42
star
35

AOIT-Update

Adaptive Order Independant Transparency Sample
C++
39
star
36

UE4_GPA_Plugin

Intel® Graphics Performance Analyzer plugin for Unreal Engine* 4
C++
39
star
37

DynamicResolutionRendering

DynamicResolutionRendering_source_V3_update
C++
32
star
38

UE4RealSensePlugin

This UE4 plugin provides support for the Intel RealSense SDK to Unreal Engine 4 developers by exposing features of the SDK to the Blueprints Visual Scripting System.
C++
28
star
39

Multi-Adapter-Particles

Demonstration of Integrated + Discrete Multi-Adapter modified from Microsoft's D3D12nBodyGravity
C++
26
star
40

D3D12VariableRateShading

A simple D3D12 example demonstrating how to use Intel Tier 1 Variable Rate Shading
C++
21
star
41

MetricsDiscoveryHelper

A wrapper for Intel(R) MetricsDiscovery API that simplifies some common tasks and provides a more unified interface across different graphics APIs.
C++
20
star
42

UnityPerformanceSandbox

Project that can be used to learn Graphics Performance Analyzer toolkit by following along Unity* Optimization Guide for Intel x86 Platforms article. https://software.intel.com/en-us/android/articles/unity-optimization-guide-for-x86-android-part-1
C#
17
star
43

ChatHeads

Chat Heads is a native sample that uses RealSense to overlay background segmented (BGS) player images on a 3D scene or video playback in a multiplayer scenario.
C++
16
star
44

VALAR

Velocity And Luminance Adaptive Rasterization
C++
13
star
45

GrassInstancing

Grass rendering using geometry instancing in Direct3D 10.
C++
13
star
46

OpenGLESTessellation

Sample demonstrating the use of tessellation shaders with OpenGLES
C
13
star
47

Windows-Desktop-Sensors

Sample demonstrating how to use sensors for Windows Desktop
C++
8
star
48

EZSIMD

C++
8
star
49

64-bit-Typed-Atomics-Extension

C
8
star
50

XeSS-VALAR-Demo

Mini-Engine Demonstration of Combining XeSS with VRS Tier 2.
C++
8
star
51

CPU_Capability_Tester

C#
7
star
52

RCRaceland

Unreal Engine 4 sample showing how to take advantage of the CPU for more realistic scenes
C++
5
star
53

CmdThrottlePolicy

sample showing how to use the DX12 CmdThrottlePolicy Extension
C
4
star
54

VALAR-API

C
3
star
55

InstantAccess_Tiling

C++
3
star
56

AdaptiveSync

Demo and Library for Adaptive Sync
C++
3
star
57

gametechdev.github.io

HTML
1
star