Bottom line: With new random number generator, RTOW-OptiX sample on Turing now runs in ~0.5 secs ….
Since several people have asked for Turing numbers for my “RTOW in OptiX” example I finally sat down and ran it. First result – surprise: In my original code there was hardly any difference between using Turing and Volta – and that just didn’t make sense. Sure, you do still need a special development driver to even use the Turing ray tracing cores from within OptiX, but I actually had that, so why didn’t it get faster? And sure, there’s only so much speedup you can except in a scene that doesn’t have any triangles at all, and only a very small number of primitives to start with. But still, that didn’t make sense. There also was hardly any difference between iterative and recursive versions … and none of that made sense whatsoever.
Well – in cases like that a good first step is always to have a look at the assembly (excuse me: PTX) code that one’s code is actually generating. In our OptiX example, that’s actually super-easy: Not only is PTX way easier to read than regular assembly, the very nature of OptiX’ “programs” approach also means that you don’t have to sift through an entire program’s worth of asm output to find the one function you’re interested in…. instead, you only look at the PTX code for the one kernel that you’re interested in. And even simpler, the cmakefile already generates all these ptx files (that’s the way OptiX works), so looking at that was very easy.
Now looking at the ray gen program, I was at first what, for lack of a better word, I can only call “dumbfounded”: thousands of lines of cryptic PTX code, with movs, xor’s, loads, and stores, all apparently randomly thrown together, and hardly anything that looked like “useful” code. Clearly my “actual” ray gen program was at the end of this file, and looked great – but what was all that other stuff?? No wonder this wasn’t any faster on Turing than on Volta – all it did was garbling memory!
Turns out the culprit was what I had absolutely not expected: CuRand. I hadn’t even known about curand before I saw Roger Allen’s CUDA example, but when I first saw it this looked like an easy-to-use equivalent to Pete’s use of drand48(), and simply used it for my sample, too. Now CuRand does indeed seem to be a very good random number generator, and to have some really nice properties – but it also has a very, very – did I say: very! – expensive set-up phase, where it’s taking something like a 25,000-sized scratchpad and garbling around in it. And since I ran that once per pixel it turns out that just initializing that random number generator was more expensive in this example than all rendering taken together ….
Of course, the solution to that was simple: Pete already used ‘drand48()’ in his reference CPU example, and though that function doesn’t exist in the CUDA runtime it’s trivially simple to implement. Throwing that into my example – and taking curand out – and lo and behold, my render time goes down to something like 0.5 sec. And in that variant I also see exactly what I had expected: that iterative is way faster than recursive, and Turing was way faster than Volta. Of course, changing the random number generator also changed the image (I haven’t looked in detail yet, but it “feels” as if the curand image was better), and has of course also made the Volta code faster. Either way – for now, 500ms is good with me 🙂
With that – back to work….