Different ways to implement a stack in the shader

I am still trying to resolve some copyright issues before I can the demo. It shouldn’t be more than a few days.

After some discussion with David Luebke from NVIDIA about how to implement a stack I decided to test a few things with a benchmark. This graph is the result:


This code show how the different stack implementations look like:


Some of the method require a specific threading mode. I have the classic dispatch method (ThreadMode=0) and a simple persistent thread method (ThreadMode=1) where the dispatch call spawns just enough threads to fill the GPU (at least it seems to fill my NV580) and the shader function doesn’t exit but iterates through all pixels with the same position within the tile.

The persistent thread code was meant to restrict the memory requirements for the UAV method, it’s not very sophisticated. I believe with some inter thread communication this could be improved a lot. D3D11 doesn’t have ballot but that functionality could be emulated with a few group shared memory instructions. The functionality would be the same but as it’s meant to be an optimization it’s might be not be worth testing the emulation – it would not prove anything if it’s slower.

Note that (for the local stack method) the FrameTime goes slightly down with a larger stack (tested sizes: 12, 16, 20). You might expect it to get slower as it uses more memory per stack and therefore gives less latency hiding to the memory requests. Here the demo actually makes use of the larger stack and gets more efficient as the alternative (using the grid with a linked list of nodes AND a tranformation matrix) adds overhead. It might be interesting to just increase the stack size without using the extra memory.

A lot of more experiments could be done. I would be curious how a pixel shader is doing. Here we would have to stick to the StackMode=0. The pixel shaer can auto tune the thread layout and it should have a better 2D layout for coherency. The coherency could also be improved with some extra shader code (Morton Code / Space Filling Curve). I also could implement a short stack with restart or a stackless method.

I was expecting ThreadMode 1 and 2 (both using group shared memory) to be more different as one has to fight with more bank conflicts but I guess that isn’t the main cost here.

The UAV layout could be changed to interleave the stack data differently but I always expected this method to be the slowest so I wasn’t giving it much attention. Looking at the graph I probably should reconsider (for a larger stack it beats the group shared memory).

Still – the local stack with a large size wins. For me that was the least expected one. It cuts from the latency hiding memory and allows less threads to be in flight. Depending on what the driver and the compiler are doing it might even generate a large switch statement to index into the registers. I have a NV580 and it has Compute 2.0 capability which would not allow indexing into register memory (see here). The driver and the compiler could also move the array into group shared memory but then I shouldn’t see a performance difference to the code where I do that explicitly. The switch statement could result in large binary conditional jumps, conditional register assignments (no branch) or a mix. I wish I could see the actual shader assembly. I don’t think the byte code would give me enough insight.

Special thanks to David Luebke and other NVIDIA folks for the recent discussions on this topic.

1 thought on “Different ways to implement a stack in the shader

  1. My first guess for the random-access local stack would have been that it is converted to ‘local memory’ just like in CUDA (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses), meaning it would in fact just be global memory with better locality guarantees. (Also, memory layout should be warp-optimized, i.e. 32-wide SIMD-optimized.) This would also fit your graph in some way. However, on second thought, I have never really investigated this for compute shaders, so it would be interesting to get some substantiated statement on this.

    On a related note, the CUDA programming guide states that the new Maxwell GPUs (Compute 5.x) no longer cache device memory reads in L1, not even local memory reads. Therefore, there could be much more benefit from using shared memory on these next generation of GPUs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s