Simple C++ reflection (to generate HLSL constant buffer declaration)

Recently I was working a bit more on my Direct3d11  shadow experiments code. I had to modify constant buffer layouts many times and that meant I had to make many matching updates in C++ and HLSL. I was thinking about a simple solution to only define one side and get the other one automatically generated. Without the redundancy I can also prevent some bugs. I though of many solutions and decided on using macros and some C++ code that generates the HLSL code though some form of reflection (getting type and name). This is how I define my constant buffer on C++ side:

  ENTRY(XMMATRIX, c_WorldViewProj)
  ENTRY(XMMATRIX, c_InvViewProj)
  ENTRY(XMMATRIX, c_World2Shadow)
  ENTRY(float,    c_ShadowBias)
  ENTRY(XMFLOAT3, c_SunDirection) // normalized
  ENTRY(float,    c_FracTime)
  ENTRY(XMFLOAT4, c_BackbufferExtent) // w, h, 1/w, 1/h
  ENTRY(uint32,   c_iBackbufferExtentX)
  ENTRY(uint32,   c_iBackbufferExtentY)
  ENTRY(uint32,   c_iFrameId)
  ENTRY(uint32,   c_iDummy9123)
  ENTRY(XMFLOAT4, c_ShadowmapExtent) // w, h, 1/w, 1/h

This is the code needed for the reflection (WordPress ate all HTML looking parts in the templates so I had to post it as image and word document as it doesn’t support txt and cpp either):


reflection (as word document)

And this is how I generate the HLSL code:

struct SStructReflection : public IStructReflection
  FILE* out;
  SStructReflection(FILE* InOut) : out(InOut) {}

  virtual void Start(const char* StructName)
    fprintf(out, "\r\ncbuffer %s\r\n{\r\n", StructName);
  virtual void Add(const char* Type, const char* Name)
    fprintf(out, "\t%s %s;\r\n", Type, Name);
  virtual void End()
    fprintf(out, "};\r\n");
void GenerateAutoCommon()
  FILE* out = 0;
  if(_wfopen_s(&out, L"Shaders\\AutoCommon.hlsl", L"wb") == 0)
    SStructReflection Refl(out);

    // ... other constant buffers


This could be improved further e.g.

  • Checking for proper alignment (simply count the size and asset)
  • Make reflection defines independent from HLSL (simply return the type as string)
  • Avoid the virtual (create a data structure that can be processed later on)
  • Hide dummy functions from auto completion (could read/parse C++ as text and avoid macros)

However in it’s current form it fulfills its purpose and it’s much easier to get the idea (look at the defines, they create functions that produce a chain calling the reflection interface methods).


Shadows demo with source download


Short description:

  • GPU hard triangle shadows based on ray casting
  • Acceleration:  AABB tree with 2D light space grid
  • Direct3D 11, Visual Studio,  C++ and HLSL Source
  • Multiple 3d models, benchmark and profiling functionality
  • Purpose: Real time shadow research and collaboration

All the code and assets here not falling under other Copyrights (see Copyright.txt in the archive) is released under WTFPL

Download the .zip and unpack it locally. Run the executable in the root folder.

Download (14.2 MB):

Different ways to implement a stack in the shader (Update)

As the graph I made for my last post was a bit hard to read I invested some time to visualize the data a bit better. Here is the result:


The numbers at the bottom are (top to bottom): StackSize, ThreadMode, StackMode

This is explained in more detail in my former blog post.

This next image shows the graph overlayed to make it more clear how a 1d graph can represent many values. I simply vary some parameters over the benchmark duration and profile each frame. CPU and GPU stats are offset to counter the effect that the GPU data arrives some frames later.


The bars are easier to read but the graph is the more honest way to represent the data, it shows the measuring noise and spikes. I did make a more graph with much more data points to get better results but I got large spikes. I was considering removing outliers but the following frame seems to be too fast – fixing the former spike. Not very scientific but it’s hard / impossible on PC / Windows to get good profiling. Here is the output:



Comparing the two runs (300 frames vs 5000 frames) results in very similar bars.

Different ways to implement a stack in the shader

I am still trying to resolve some copyright issues before I can the demo. It shouldn’t be more than a few days.

After some discussion with David Luebke from NVIDIA about how to implement a stack I decided to test a few things with a benchmark. This graph is the result:


This code show how the different stack implementations look like:


Some of the method require a specific threading mode. I have the classic dispatch method (ThreadMode=0) and a simple persistent thread method (ThreadMode=1) where the dispatch call spawns just enough threads to fill the GPU (at least it seems to fill my NV580) and the shader function doesn’t exit but iterates through all pixels with the same position within the tile.

The persistent thread code was meant to restrict the memory requirements for the UAV method, it’s not very sophisticated. I believe with some inter thread communication this could be improved a lot. D3D11 doesn’t have ballot but that functionality could be emulated with a few group shared memory instructions. The functionality would be the same but as it’s meant to be an optimization it’s might be not be worth testing the emulation – it would not prove anything if it’s slower.

Note that (for the local stack method) the FrameTime goes slightly down with a larger stack (tested sizes: 12, 16, 20). You might expect it to get slower as it uses more memory per stack and therefore gives less latency hiding to the memory requests. Here the demo actually makes use of the larger stack and gets more efficient as the alternative (using the grid with a linked list of nodes AND a tranformation matrix) adds overhead. It might be interesting to just increase the stack size without using the extra memory.

A lot of more experiments could be done. I would be curious how a pixel shader is doing. Here we would have to stick to the StackMode=0. The pixel shaer can auto tune the thread layout and it should have a better 2D layout for coherency. The coherency could also be improved with some extra shader code (Morton Code / Space Filling Curve). I also could implement a short stack with restart or a stackless method.

I was expecting ThreadMode 1 and 2 (both using group shared memory) to be more different as one has to fight with more bank conflicts but I guess that isn’t the main cost here.

The UAV layout could be changed to interleave the stack data differently but I always expected this method to be the slowest so I wasn’t giving it much attention. Looking at the graph I probably should reconsider (for a larger stack it beats the group shared memory).

Still – the local stack with a large size wins. For me that was the least expected one. It cuts from the latency hiding memory and allows less threads to be in flight. Depending on what the driver and the compiler are doing it might even generate a large switch statement to index into the registers. I have a NV580 and it has Compute 2.0 capability which would not allow indexing into register memory (see here). The driver and the compiler could also move the array into group shared memory but then I shouldn’t see a performance difference to the code where I do that explicitly. The switch statement could result in large binary conditional jumps, conditional register assignments (no branch) or a mix. I wish I could see the actual shader assembly. I don’t think the byte code would give me enough insight.

Special thanks to David Luebke and other NVIDIA folks for the recent discussions on this topic.

Benchmark and Statistics

In order to find performance bottlenecks in the ray cast shadow algorithm I built myself a simple GPU and CPU profiler outputting data to .csv files (CommaSeparatedValues). With a few clicks I can open the file in Microsoft Excel and create a graph out of it. I could have spend much more time polishing that () but it serves the purpose well. I still want to explain it as could be useful information if you want to play with the demo or look at the code.

You can choose a benchmark from the menu, run it with a key press and after a few seconds while watching the rendering you get two .cvs files in the Stats folder. The system always advances a running number so files don’t get overwritten and even if you have a file open in Excel you can write a new one. One file has all data that has been recorded and the other file has only few selected columns that are relevant for the benchmark.

CPU time is measured using QueryPerformanceCounter(). As the shadow ray casting algorithm is mostly GPU I only use it for load time (BVH creation is only done on first load) and the FrameTime.

The system measures GPU time with two D3D11_QUERY_TIMESTAMP queries. To measure a section two absolute time queries are used. There is one query for the whole GPU workfload of a frame (FrameTimeGPU) and a query around the passes I want to measure. To measure the time a section takes the C++ code can be annotated like this:

const UINT Value[] = { -1, -1, -1, -1 };
devcon->ClearUnorderedAccessViewUint(receiverGrid.UAVView->UAVView, Value);

SCOPED_GPU_TIMER is a define which creates a object on the stack that gets a time stamp in the constructor and gets the second time stamp in the destructor. All timestamp object come from a manager which has many objects preallocated.

GPUs don’t immediately process the commands from the CPU. The commands get preprocessed (often by another CPU) and buffered up until the GPU can process them (I measures a 2 frame latency). As I wanted to see the timing in real-time on screen without causing stalls I had to triple buffer the queries. To make the on screen data more readable (avoiding flickering numbers) multiple data point are collected in an array and updated in a more human friendly rate (250ms or if the array size exceeds). Most of the time the minimum is the most useful number but to see spikes the delete is also printed.


In D3D you are supposed to have one D3D11_QUERY_TIMESTAMP_DISJOINT query around your time stamp queries to test if the collected data is valid. Only in rare cases I see a disjoint which means the numbers cannot be trusted (data is not printed and you can read “Disjoint”). Switching applications and other things can cause this. I assume the GPU has to work for some other applications and all timer queries would be distorted by that.

The .csv file has data from multiple sources (GPU time, CPU time, variables) and I want to have each line showing the collected data for the same frame. To achieve that all data is collected in objects storing the frame number, the data and the column identifier. This way data can be dumped to memory without bothering about the order. Before writing the elements get sorted by the frame number (named Progress). To test the system I made a benchmark causing heavy workload at frame 10 / 20 / 30 using a expensive Compute Shader / Pixel Shader / CPU code. The image shows how the data looks like when viewed in Excel (-1 means no data available):


Notice the compute shader time was measured (not -1) but it seems to be a very fast operation. This is wrongly measured and related to using compute shaders (and NVIDIA GPUs?). The frame timer however includes the workload as expected.

Another interesting point can be seen on the heavy CPU workload. The GPU frame time is invalid for two frames. It looks like the GPU drained as no work was coming from the CPU and 2 frames later it started picking up work again. I guess this is a D3D or D3D driver property.

In order to find the right thread group size I decided to make one benchmark which varies the numbers over time. I can look for the best frame time and pick the right settings.


In the diagram above you can see ComputeExtentX and  ComputeExtentY changes over time. The actual group size is computed from those variables (2 << ComputeExtent). In this case you can see  the best performance is achieved with

ComputeExtentX=2,  ComputeExtentY=3   and    ComputeExtentX=2, ComputeExtentY=4

Which is a thread group size of 8×16 and 8×32. Note that I picked the number ranges carefully to get all data in one chart (ms).

As the shader code and the C++ code needs to get those numbers the system needs to compile new shaders for each combination. I tried to make this fairly automatic to save programming time at the cost of some CPU performance waste (sprintf, map search,  memory allocations). Shaders are cached in a STL map based on a string that is also added to the front of the HLSL code. This simplified C++ code shows how a shader is set:

“#define COMPUTE_EXTENTX %d\n”
“#define COMPUTE_EXTENTY %d\n”,
g_ComputeExtentX, g_ComputeExtentY);

The last benchmark is interesting because it varies the local stack size that is compiled into the shader:


A smaller stack requires less temporaries and which allows the GPU to hide more latency when doing memory requests (less waits can lead to better thoughput). On the other hand a small stack requires more instances (see a former post) to be rendered. All instances are stored in a fixed size buffer. For small stack sizes this buffer exceeds and artifacts are rendered. So valid rendering is only beginning at frame 93. You can see the BuildInstanceGrid time get smaller with a larger stack size (less instances, less overlap). Around 16 seems to be a good stack size and it doesn’t get much better after that. This obviously depends on the scene, implementation, GPU and driver (here NVIDIA GTX 580, driver 344.11). I haven’t done a lot of testing on other hardware but I’ve seen a comparable AMD card running with similar performance.

To get a graph in Excel:

  • Open the csv file (see spread sheet above)
  • Select all (Ctrl Shift End)
  • Insert Scatter / Scatter with Straight Lines

Using the light space 2D grid to accelerate the ray cast


A single Chinese house results in a few rectangles in the grid. Many houses need more rectangles. (lighter brown shows more overdraw in the 2D grid)

The algorithm makes use of a 2D grid that is projected from the light source perspective. The resolution (currently 1024×1024) is only important for memory consumption and performance, it does not affect the quality. To avoid artifacts the grid needs to cover all receiver points. At the moment this is ensured by manual tweaking but that can be improved by adapting to the receiver points (given if we do deferred shadows). In it’s simplest form we can just create a 2d bounding box from all receiver points but other methods are possible. I experimented with two 1D histograms to warp the receiver grid (see Rectilinear Texture Warping for Fast Adaptive Shadow Mapping) to give more focus to the important areas. It worked very well but the histogram had a cost. With a reduced depth buffer and some approximations this can be done faster. I might come back to that.

Each frame the light space 2D grid is cleared and filled with all shadow casters. A grid cell stores a linked list of all objects covering that cell (similar to OIT). Each shadow caster is a static mesh that has a transformation matrix. A linked list entry stores and the next pointer and two integers. One is the AABB node Id to define the start node for ray casting and the other number is the instance Id which is needed to rotate the ray into the object space of that mesh.

Not using a single bounding volume hierarchy also allows for a simple way to rotate/scale/translate static meshes. Animated meshes (skinned meshes) are not yet handled with that.

In it’s first iteration the algorithm rendered each shadow caster with a single draw call. Each draw call was a 2d grid aligned rectangle covering the AABB conservatively (we need every grid cell that is touched by the object). This can be done by projecting each vertex into light space, scaling the position to the grid resolution and computing a 2d bounding box from all corners. The 2d bounding box corners are snapped to the grid (round down for min and up for max) to ensure conservative rasterization. The actual math there can be optimized but anotehr method might do better:

We could render 6 triangles with the actual AABB orientation. This would cover the actual bounding box more tightly (6 triangles as we don’t need back faces). Special vertex shader code would be needed to flip the 6 triangles (to account for the missing faces) and move the vertices to ensure conservative coverage. Covering that and other primitives (spheres, cones) could make for a nice GPUGem/Pro article (can be useful for tiled based lighting).

For each caster we have a object space AABB (axis aligned bounding box) tree precomputed and stored in GPU memory (in a buffer together with other AABB trees). Each caster can be rendered to the 2d grid using the root node bounding box but we also can traverse the tree and render multiple AABB of some lower level nodes.

The algorithm can be adjusted to scatter less (left) or more (right) allowing for a smaller stack during the ray cast.

The algorithm can be adjusted to scatter less (left) or more (right) allowing for a smaller stack during the ray cast.

With a few more boxes scattered in pre pass we can remove some traversal during ray casting and reduce the needed stack size (more temporary registers can cause the GPU to run slower as it has less memory to hide memory fetch latency).

A newer iteration of the algorithm has all the information needed to scatter all the bounding boxes on the GPU prepared to be rendered with a single draw call using instancing. At the moment the vertex count per instance is low (4) but if that becomes a bottleneck this can be improved (32 or 64 would be better).

Side note: I love rectangle rasterization became a GPU primitive (GL_NV_fill_rectangle). When Direct3D 10 decided o remove point sprites I was just about to use them. By sticking to GPU friendly sizes (e.g. 16×16) we can scatter to tiles using the frame buffer blend operations while maintaining order (classic UAV scatter cannot do that). It was always possible to rasterize the same with 2 triangles but the diagonal edge would waste performance (some quads are not covered, warps might not be filled up, dynamic branching cannot be as coherent and when traversing tile data).


  • For a low resolution grid or small objects the rectangle should be more efficient than the 6 vertex mesh.
  • The mentioned 1D histogram method only works well with 2D rectangles, combining both seems possible but quite a bit more complex.
  • I found the free Chinese house 3D model useful during development (not closed, small and large details, not too large, some non clean areas, some extreme vertex normal usage).

Accurate hard shadows in any distance – if you stick to flat shading

The image shows a Japanese wooden bench by Ben Tate that was released for free.

This picture series demonstrates the shadow quality of the algorithm. When getting closer to one of the 200 bench models you can see the tiny metal pin casting a perfect hard shadow. Note the flat shaded look. With vertex shading enabled the accurate shadows suffer a bit. Lambert shading using a pixel interpolated normal and the triangle accurate shadows don’t match well in all cases. This happens very rarely but it can be very noticeable when seen on a human face. This problem can be reduced with a bias on the Lambert shading result (making the darks a bit larger) or other methods. It’s surprisingly hard to model without causing such problems. Even a box with beveled edges can show some soft curvature on the flat sides. In off line rendering you might just subdivide the mesh a bit more but for real-time rendering this might be too expensive. Unfortunately I don’t have a good solution for this problem. I would like to avoid a bias on the ray start location as this can cause problems elsewhere. I think it should be possible to offset the ray start to match the shading.

Area light source shading makes this even worse (As Brian Karis recently pointed out): The bias trick fights the shading.

BTW: In case you noticed the flat shading brightness variation. This is due to my crappy method reconstructing the triangle normal with ddx/ddy. This can be fixed and is not related to the shadow algorithm.

Anyone has a good solution/reference for this problem?