I am still trying to resolve some copyright issues before I can the demo. It shouldn’t be more than a few days.
After some discussion with David Luebke from NVIDIA about how to implement a stack I decided to test a few things with a benchmark. This graph is the result:
This code show how the different stack implementations look like:
Some of the method require a specific threading mode. I have the classic dispatch method (ThreadMode=0) and a simple persistent thread method (ThreadMode=1) where the dispatch call spawns just enough threads to fill the GPU (at least it seems to fill my NV580) and the shader function doesn’t exit but iterates through all pixels with the same position within the tile.
The persistent thread code was meant to restrict the memory requirements for the UAV method, it’s not very sophisticated. I believe with some inter thread communication this could be improved a lot. D3D11 doesn’t have ballot but that functionality could be emulated with a few group shared memory instructions. The functionality would be the same but as it’s meant to be an optimization it’s might be not be worth testing the emulation – it would not prove anything if it’s slower.
Note that (for the local stack method) the FrameTime goes slightly down with a larger stack (tested sizes: 12, 16, 20). You might expect it to get slower as it uses more memory per stack and therefore gives less latency hiding to the memory requests. Here the demo actually makes use of the larger stack and gets more efficient as the alternative (using the grid with a linked list of nodes AND a tranformation matrix) adds overhead. It might be interesting to just increase the stack size without using the extra memory.
A lot of more experiments could be done. I would be curious how a pixel shader is doing. Here we would have to stick to the StackMode=0. The pixel shaer can auto tune the thread layout and it should have a better 2D layout for coherency. The coherency could also be improved with some extra shader code (Morton Code / Space Filling Curve). I also could implement a short stack with restart or a stackless method.
I was expecting ThreadMode 1 and 2 (both using group shared memory) to be more different as one has to fight with more bank conflicts but I guess that isn’t the main cost here.
The UAV layout could be changed to interleave the stack data differently but I always expected this method to be the slowest so I wasn’t giving it much attention. Looking at the graph I probably should reconsider (for a larger stack it beats the group shared memory).
Still – the local stack with a large size wins. For me that was the least expected one. It cuts from the latency hiding memory and allows less threads to be in flight. Depending on what the driver and the compiler are doing it might even generate a large switch statement to index into the registers. I have a NV580 and it has Compute 2.0 capability which would not allow indexing into register memory (see here). The driver and the compiler could also move the array into group shared memory but then I shouldn’t see a performance difference to the code where I do that explicitly. The switch statement could result in large binary conditional jumps, conditional register assignments (no branch) or a mix. I wish I could see the actual shader assembly. I don’t think the byte code would give me enough insight.
Special thanks to David Luebke and other NVIDIA folks for the recent discussions on this topic.