Huh, I would have expected the request for additional registers to be linked int...

phire · 2025-04-06T01:55:13 1743904513

The feature was clearly designed for the ray-tracing use case, where the waves are independent and really shouldn't run into deadlocks as long as one wave can continue. That's why they have the deadlock avoidance mode, it guarantees one wave will always continues.

I believe they are using this to store the BVH traversal stack in registers, with allocation size depending on traversal depth. Eventually all the rays in the wave will hit or miss and free those registers.

Busy waiting is only the naive solution. I bet they went with the software based solution so they can experiment with other approaches. Such as dynamically switching to the previous approach of a stack in shared-memory when registers run out. Or maybe they can sort the rays, moving the long-running rays out, grouping them together into waves dedicated for deep traversal.

phire · 2025-04-06T04:10:24 1743912624

Actually, I'm wrong.

AMD's deep dive [1] says register use is low during BVH traversal, so they are still using shared memory for the stack. According to the ISA doc [2] they have a special mode for the stack that overwrites older entries instead of overflowing and a fallback stackless mode.

So the register allocation is actually increased when invoking the hit shader, allowing each hit shader to use a different number of VGPRs, spinning is probably the most logical approach. They probably don't even need the deadlock avoidance mode with their current implementation, which seems to allocate the full set of registers when calling the hit shader and only free them at the end.

[1] https://www.techpowerup.com/review/amd-radeon-rx-9070-series...

[2] See 12.5.3. DS Stack Operations for Ray Tracing - https://www.amd.com/content/dam/amd/en/documents/radeon-tech...