HPG Paper Preprint: Using RTX cores for something other than ray tracing….


Hey – just a quick heads-up: our HPG (short) paper on using RTX cores for something other than ray tracing got accepted; and for everybody interested in that I just uploaded a “author’s preprint” (ie, without the final edits) to my usual publications page (at http://www.sci.utah.edu/Publications).

The core idea of this paper was to play with the idea of “now that we have this ‘free’ hardware for tracing rays, what else can we use it for, in applications that wouldn’t otherwise use these units?” – after all, the hardware is already there, it’s actually doing some non-trivial tree traversal, it’s massively powerful (billions of such tree traversals per second!), and if it’s not otherwise being used then pretty much anything you can offload to it is a win… (and yes, pretty much the first thing we tried worked out well).

For this paper we only looked into one such applications (point location, in a tet mesh volume renderer), just as a “proof of concept” … but yes, there’s a ton more where it’d make sense: I already used it for some AMR rendering, too (same basic concept), but there’s sure to be more. If you play with it and find some interesting uses, let me know – I’m curious to see what others will do with it!

Hope you’ll like it – this has been a lot of fun, hope you’ll enjoy reading it, too…


Preprint here: http://www.sci.utah.edu/Publications

Full citation: RTX Beyond Ray Tracing – Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location, Ingo Wald, Will Usher, Nathan Morrical, Laura L Lediaev, and Valerio Pascucci, Proceedings of High Performance Graphics (HPG) 2019. (to appear).

“Accidental Art”: PBRT v3 ‘landscape’ model in my RTOW-OptiX Sample…

Just produced a (accidental) pic that in some undefinable way struck me – dunno why, but IMHO it “got something” – and wanted to quickly share this (see below).

For the curious: The way that pic was produced was that I took the latest version of my pbrt parser (https://github.com/ingowald/pbrt-parser), hooked it up to my RTOW-in-OptiX sample (https://github.com/ingowald/RTOW-OptiX), and ran that on a few of the PBRT v3 sample models (https://pbrt.org/scenes-v3.html). And since that RTOW-in-OptiX sample can’t yet do any of the PBRT materials I just assigned the Pete’s “Lambertian” model (with per-material random albedo value), which for the PBRT v3 “landscape” (view0.pbrt) produced the following pic. And I personally find it kind-of cute, so …. Enjoy!

PS: The buddha with random Metal material looks cool, too 🙂


PBRTParser V1.1

For all those planning on playing around with Disney’s Moana Island model (or any other PBRT format models, for that matter) : Check out my recently re-worked PBRTParser library on github (https://github.com/ingowald/pbrt-parser).

The first version of that library – as I first wrote it a few years ago – was rather “experimental” (in plain english: “horribly incomplete, and buggy), and only did the barest necessities to extract triangles and instances for some ray traversal research I was doing back then. Some brave users back then had already tried using that library, but as I just said, back then it was never really intended as a full parser, didn’t do anything meaningful with materials, etc…. so bottom line, I’m not really sure how useful it was back then.

Last year, however – when I/we first started playing around with Moana I finally dug up that old code, and eventually fleshed it out to a point where we could use it to import the whole of Moana – now also with textures, materials, lights, and curves – into some internal format we were using for the 2018 Siggraph demo. That still didn’t do anything more than required for Moana (e.g., it only did the “Disney” material, and only Ptex textures), but anyway, that was a major step – not so much in functionality, but in completeness, robustness, and general “usablity”.

And finally, after my switching employers (and thus, no longer having access to that ospray-internal format) – yet still wanting to play with this model – I spend some time on and off over the last few months in cleaning that library up even more, into fleshing it out to the point that it (apparently?) read all PBRT v3 models, and in particular, to a point where all materials, textures, etc, are all “fully” parsed to specific C++ classes (rather than just sets of name:value pairs as in the first version). And maybe best of all – in particular for those planning on playing with Moana! : The library can not only parse exising ASCII .pbrt files, but can also load and store any parsed model in an internal binary file format that is a few gazillion times faster to load and store than parsing those ASCII files (to give you an idea: parsing the 40GBs of PBRT files for Moana takes close on half an hour … reading the binary takes … wait… less time than it took me to write that sentence).

Mind – this is still not a PBRT “renderer” of any sorts – but for those that do want to play around with some PBRT style renderers (and in particular, with PBRT style models!), this library should make it trivially simple to at least get the data in, so you can then worry about the renderer. In particular, it should by now be reasonably complete and stable to work out of the box. No, I cannot guarantee that library’s working state for windows or Mac (it did compile at some point, but I’m not regularly testing those two), but at least on Linux I’d expect it to work – and will gladly help fixing whatever bugs are coming up. Of course, despite all these claims about completeness and robustness (and yes, I do use it on a daily basis): This is an open-source project, and I’m sure there will be some bugs and issues as soon as people start using it on models – or in ways – that I haven’t personally tried yet. If so: I’d be happy to fix, just let me know (preferably on gitlab).

Anyway: If you plan on playing with it, check it out on either github, or gitlab. I will keep those two sites’ repositories in sync (easy enough with git …), so they should always contain the same code, at least in the master branch. However, gitlab is somewhat easier to use with regard to issue tracking and, in particular, push requests by users, so if you do plan on filing issues or sending push requests, I’d suggest gitlab. Of course, any criticism, bugs, issues, or requests for improvement are highly appreciated….


PS: Just to show that it can really parse all of Moana, I just added a little screenshot of a normal shader from a totally unrelated renderer of mine. Note that the lack of shading information is entirely due to that renderer; the parser will have the full material and texture information – it’s just the renderer that doesn’t support all effects, yet, so I don’t want to prematurely post any images of it, yet.

RTOW in OptiX – Fun with CuRand…

Bottom line: With new random number generator, RTOW-OptiX sample on Turing now runs in ~0.5 secs ….

Since several people have asked for Turing numbers for my “RTOW in OptiX” example I finally sat down and ran it. First result – surprise: In my original code there was hardly any difference between using Turing and Volta – and that just didn’t make sense. Sure, you do still need a special development driver to even use the Turing ray tracing cores from within OptiX, but I actually had that, so why didn’t it get faster? And sure, there’s only so much speedup you can except in a scene that doesn’t have any triangles at all, and only a very small number of primitives to start with. But still, that didn’t make sense. There also was hardly any difference between iterative and recursive versions … and none of that made sense whatsoever.

Well – in cases like that a good first step is always to have a look at the assembly (excuse me: PTX) code that one’s code is actually generating. In our OptiX example, that’s actually super-easy: Not only is PTX way easier to read than regular assembly, the very nature of OptiX’ “programs” approach also means that you don’t have to sift through an entire program’s worth of asm output to find the one function you’re interested in…. instead, you only look at the PTX code for the one kernel that you’re interested in. And even simpler, the cmakefile already generates all these ptx files (that’s the way OptiX works), so looking at that was very easy.

Now looking at the ray gen program, I was at first what, for lack of a better word, I can only call “dumbfounded”: thousands of lines of cryptic PTX code, with movs, xor’s, loads, and stores, all apparently randomly thrown together, and hardly anything that looked like “useful” code. Clearly my “actual” ray gen program was at the end of this file, and looked great – but what was all that other stuff?? No wonder this wasn’t any faster on Turing than on Volta – all it did was garbling memory!

Turns out the culprit was what I had absolutely not expected: CuRand. I hadn’t even known about curand before I saw Roger Allen’s CUDA example, but when I first saw it this looked like an easy-to-use equivalent to Pete’s use of drand48(), and simply used it for my sample, too. Now CuRand does indeed seem to be a very good random number generator, and to have some really nice properties – but it also has a very, very – did I say: very! – expensive set-up phase, where it’s taking something like a 25,000-sized scratchpad and garbling around in it. And since I ran that once per pixel it turns out that just initializing that random number generator was more expensive in this example than all rendering taken together ….

Of course, the solution to that was simple: Pete already used ‘drand48()’ in his reference CPU example, and though that function doesn’t exist in the CUDA runtime it’s trivially simple to implement. Throwing that into my example – and taking curand out – and lo and behold, my render time goes down to something like 0.5 sec. And in that variant I also see exactly what I had expected: that iterative is way faster than recursive, and Turing was way faster than Volta. Of course, changing the random number generator also changed the image (I haven’t looked in detail yet, but it “feels” as if the curand image was better), and has of course also made the Volta code faster. Either way – for now, 500ms is good with me 🙂

With that – back to work….

RTOW in OptiX – added iterative variant…

Huh, how fitting: Ray Tracing on a Weekend“, and I’m sitting here, Sunday morning, over a coffee, and writing about ray tracing on a weekend … on a weekend. And if that wasn’t recursive enough, I’m even writing about recursion in ….. uh-oh.

Aaaaanyway. For reference, I also just added a purely iterative variant of the “RTOW-in-OptiX” example that I wrote about in my previous two posts: The original code I published Friday night tried to stay as close as possible to Pete’s example, and therefore used “real” recursion, in the sense that the “closest hit” programs attached to the spheres did the full “Material::scatter” of its respective material (lambertian vs dielectric vs metal), plus doing a recursive “rtTrace()” to continue the path, thus doing some real recursive ray (actually: path) tracing.

Now if you read the previous section very closely you may have seen that I put “real” in quotes, for good reason: OptiX will internally re-factor that code to not really recurse in the way Pete’s CPU version did – with very deep stack and everything – but will likely do something more clever by re-factoring that code, which you can read more about in the original OptiX SIGGRAPH paper.

All that said, no matter what OptiX may or may not do with it, from a programmer’s standpoint it’s true recursion …. and though OptiX may do some refactoring to avoid the “gigantic stacks” problem – it’ll still have to do something to handle all the recursive state – and that, of course, is not cheap. Consequently, real recursion is generally something to be avoided (which, BTW, typically makes the renderer simpler to argue about, anyway).

Roger Allen’s CUDA-version already did this transformation, and used a recursive version: Since his example used CUDA directly, there was no way for any compiler framework to re-factor the code, so if he had used recursion the CUDA compiler would really have had to use enough stack space per pixel to store up to 50 recursive trace contexts, which would probably not have ended well.

In my original OptiX example, I didn’t have this problem, and could trust OptiX to handle that recursion for me in a reasonable way. Nevertheless, as said above real recursion is usually not the right choice to go about it (and BTW: on a CPU it usually isn’t, either!), so the downside of my staying close to Pete’s original solution was that this originally example might actually have led some readers to think that I wanted them to write such recursive code, which of course is not what I intended.

As such, for reference, I just added a iterative version to my example as well. The particular challenge in this example is that while the CPU and CUDA versions have real “Material” classes with real virtual functions, in OptiX it’s a bit tricky to attach real virtual classes to OptiX objects (yes, you can do it – after all, programs are written in general CUDA code – but let’s not go there right now). For my particular version, the way I went about this is to have the closest hit programs do one Material::scatter() operation for the material associated to that geometry, and return the resulting scattered ray and attenuation back to the ray generation program via the PRD. Of course, this approach works only because the Material in Pete’s code does only exactly one thing – scatter() – and wouldn’t have worked if we the ray generation program would have had to call multiple different material methods … but hey, this example is not about “how to write a complex path tracer in OptiX” – that may come at a later time, but for now, this is only about how to map Pete’s example, nothing more.

I do hope the reference code will be useful; and as usual: any feedback is welcome!

With that – back to …. work?

PS: For those interested in having a look: I already pushed the code to github (https://github.com/ingowald/RTOW-OptiX). I’ll be running some more extensive numbers when I’m back to a real machine (no, I don’t bring my turing to my sunday-morning coffee…), but at least on my “somewhat dated” Thinkpad P50 laptop, I get the following (both using 1200x800x128 samples):

  • pete’s version (with -O3, and excluding image output), on a Core i7-6700HQ@2.6Ghz(running at 3.2Ghz turbo): 12m32s.
  • optix version, on a Quadro M1000M: 18 sec.

Of course, this comparison is extremely flawed: Pete’s version doesn’t even use threads, let alone an acceleration structure, both of which my OptiX version does. Take this with a grain of salt – or an entire salt-trucks worth of it, for that matter! That said, the parallelism in the OptiX version comes for free, and the acceleration structure …. well, all that took was adding a single line of code (‘gg->setAcceleration(g_context->createAcceleration(“Bvh”))‘) …

PPS: First performance numbers on some more powerful card (driver 410.57, optix 5.1.1):

  • 1070, recursive: 0.58s build, 6s render
  • 1070, iterative: 0.66s build, 5.5s render
  • Titan V, recursive: 0.57s build, 2.6s render
  • Titan V, iterative: 0.63s build, 2.1s render
  • Turing: to come…

“RTOW in OptiX” sample code now on github…

As promised in last night’s post, I cleaned up the sample code and pushed to github: https://github.com/ingowald/RTOW-OptiX.

I haven’t tried the cleanups on windows yet, but it should work. If you run into trouble, let me know!

One note on the code: I’ll very happily accept pull requests that cover bugs, typos, build fixes, etc. Please note I do want to stay as close as possible to the original example, though, so please don’t send pull requests with major restructurings, general improvements, or feature additions, even if they’d be useful in their own right…. this is not supposed to be a “how to do cool things in optix” repo; just a optix “port” of Pete’s example.

And now – back to work 🙂

Ray Tracing in a Weekend … in Optix (Part 0 of N :-) )

Yay! I finally have my first OptiX-version of Pete Shirley’s “Ray Tracing in a Week-end” tutorial working. Not the whole series yet (that’s still to come), but at least the “final scene”… pic below.


Ever since Pete’s now-famous “Ray Tracing in a Week-end” came out (see, e.g., this link for more details), lots of people have used his mini-books to learn more about ray tracing. Those books are, in fact, absolutely amazing learning material (if you have not read them yet – you should!), but suffer from one big disadvantage: yes, they’ll teach you the fundamental basics (and in particular, the elegance and beauty!) of ray tracing – but they won’t teach you how to use modern GPUs for that. And in particular since the introduction of Turing, one really should know how to do that.

To fix that shortcoming, I recently suggested to Pete that “somebody” should actually sit down and write up how to do that same book series – step by step – in OptiX. Roger Allen has since done that same exercise for CUDA (see here for that (also excellent!) article), but that still has a shortcoming in that by using “plain” CUDA it doesn’t use Turing’s ray tracing hardware acceleration. To use the latter, one would have to either use Windows-only DXR (e.g., through Chris Wyman’s – equally excellent! 🙂 – DXR samples), or through using OptiX.

Long story short: I did eventually start on a “OptiX On a Week-End” (“OO-Awe”!) equivalent of Pete’s book series (and hope Pete will jump in – he’s such a much better writer than I am :-/)… but writing an entire mini-book, with examples and everything, turns out to be even more work than feared. So, following my motto of “better something useful early than something perfect too late” I finally sat down and skipped all the step-by-step introductions, all the detailed explanations, etc, and just wrote the final chapter example in OptiX. I’ll still write all this other stuff, but at least for now, I’ll do a much shorter version just with the final chapter.

So, what’s to come:

First, I’ll clean up the code a bit, and push that one final chapter example (with cmake build scripts etc) on github (I’ll write another post when that’s done). Once that’s public, I’ll write a series of little posts on how that sample works, relative to Pete’s CPU-only book. And only when all of that is out and written, then I will go back to doing the longer mini-book version. As such, this blog post was actually “part 0” of a series of posts that will soon be coming…. I hope you’ll find it useful!

With that – back to work…. 🙂



Joining NVidia…

As I’m sure some of you will have heard by now, today is my last day at Intel, and starting on Monday, I’ll be working for NVidia.

Looking back, I’ve now been working for Intel for almost exactly 11 years, and if you were to include all the time I worked “closely with intel technologies” during my PhD and Post-Doc times, it’s actually close on two decades: Even before starting my PhD (while working on Alex Keller’s ray tracer while in Kaiserslautern) I was already drilling holes into Celeron chips (and soldering on cables) to make them dual-socket capable (they were supposed to be single-socket only 🙂 ); and at the start of my PhD we (including Carsten, in Saarbruecken) were writing the first interactive SSE ray tracer prototypes, at a time when the Linux kernel didn’t even save the SSE registers, yet (yes, that makes for fun-to-replicate bugs on a dual-socket machine!). Later on, while finally working for Intel, I’ve been lucky to have worked on virtually every cool technology that had come out, from Larrabee, to Knights-anything, to pretty much any Xeon architecture built in the last two decades, to lots of other cool stuff. It’s been fun, I’ve worked with truly talented people (some of which are, in their field, hands-down the best in the world, and some of which I know for longer than I have my kids!). And yes, we’ve done some pretty cool projects, too: From the first real-time ray tracers on Larrabee, to things like compilers  (my IVL, and Matt’s ISPC), to several prototype ray tracers that never made it into the public, and all the way  to projects like Embree and OSPRay, both of which turned into massively successful projects. In other words, I’ve had the chance to work on pretty much anything I wanted, which was typically anything that either involves, requires, or is required for, the tracing of rays.

All that said, as Matt recently wrote on his blog: “the world it is a-changing” (see this link for his blog article); and once again channeling Matt (man – that seems to become a pattern here!?) I felt like I needed “to be in the thick of all of that and to help contribute to it actually happening”… so when the opportunity to do so came up I simply couldn’t say no. So with all that: Today is my last day at Intel, and Monday will be my first at NVidia – looking forward to it, that’ll be interesting indeed!

One final note…

While trying to figure out how to best break this news I had a second close look at the article Matt had written when he joined NVidia a few weeks back. While doing so, it was actually for the first time that I realized how just deeply he had thought about all this “ray tracing for real time” topic. Of course I had “read” that before, but never really appreciated how much thought went into it.

Anyway – just to follow up on that particular topic from my point of view: For me personally, it’s never been a question of the “if”, but only of the “when”, and the “who” will be the first to make it happen. To explain: Even when I was still in the middle of my masters degree (say, ’96 or so), it was already clear that all high-quality rendering was done via ray tracing – sure, there were interesting discussions on whether it’d be path tracing, backwards/reverse/forward path tracing, photon mapping, bidirectional path tracing, or Metropolis (all of which at some point in time I had played with back then 🙂 )… but in the end, they all used ray tracing. At the same time, anything that was primarily time-constrained was doing something else (at that time: REYES’s “split-and-dice”, the equivalent of rasterization), but even then it seemed clear to me that with “computers” getting “faster” every year it’d eventually only be a question of time until the time constraint would go away, and that eventually, “the simpler/more elegant algorithm” would get used (because at the end of the day, that’s what it always comes down to: Once you can afford it, you always pick the more elegant, and more general, solution).

And sure enough, over the last decade-and-half we’ve already seen this happening in the movie industry: When I started my PhD, the general opinion was still that this industry would “never” switch to ray tracing, because it needed too much memory (REYES could do streaming), because it was too slow (REYES was faster), because it needed nasty acceleration structures, and because all this photo-realism wasn’t all that important (and at least apparently, sometimes detrimental!) to the artistic process, anyway … yet still, by today virtually every production enderer has switched to ray tracing, because in the budget allocated for a frame it is now possible to do it, and once it is, it was just simpler to express that renderer in ray-based terms. As such, at least in my eyes it’s always been merely a matter of time until real-time graphics will do what the movie industry has already gone through – at some point in time ray tracing will be fast enough to do it in real time, and once it is – if history is any guide – people will use it.

Anyway – no matter how you do reach that same conclusion, whether you think deeply about it or simply extrapolate into the future – it does look like ray tracing is here to stay. Let’s see where it takes us. It’ll be a few interesting years ahead.

Preprint of our Vis’19 paper on Iso-surface ray tracing of AMR Data now available …

Finally gotten to making an “authors copy” and uploading it to my blog, but here it now is – a preprint of our Vis 2019 paper on “CPU Isosurface Ray Tracing of Adaptive Mesh Refinement Data”  (link to pdf).


A few notes:

  • This paper is a direct follow-up to our previous AMR volume ray tracing paper (published at last year’s SigAsia Vis Symposium), but adds implicit iso-surface ray tracing capability (using a correct, analytic intersection method). The “octant method” reconstruction scheme was actually already sketched in the original submission of that previous paper, but wasn’t explained well enough back then, so got axed in the final version.
  • The “octant method” that this paper introduces is actually – if I may say so – pretty neat, because it’s both interpolating and continuous, even in corner cases. It may, however, well be the one thing in my career that I had to expent the most brain power to get right – it’s trivial in 1D, but even in 2D it took a while to get it right, and 3D has even more corner cases that some earlier attempts failed on (If only you could see the stack of notebooks all full of sketches: at one point I used xfig to draw a 2D AMR example, printed a few hundred pages full of that template, and pretty much used them all up going through the algorithm step by step, for each cell, until it finally worked!?). Worked on this – on and off – for almost 3 years, which is kind-of ridiculous …
  • The code is all implemented in OSPRay (of course?), as a loadable ospray module that is fully compatible with all other ospray actors (renderers, other geometry types, MPI parallel rendering, etc). This module is not yet being part of any official ospray release, but is already available upon request (Ethan should be able to provide – it’s all Apache License, so fully free), and will hopefully “at some point” be included in mainline ospray as well.
  • Though the paper’s title is exclusively on the adaptive mesh refinement (AMR) part, the actual code is just as much about the general implicit iso-surfacing code itself – the “impi” module (for imp-licit i-sosurface) is actually generally applicable to other volume types as well, and does come with an implementation for structured volumes, too. The paper itself is actually kind-of two papers in one, too… part on the IMPI module, and part on the octant method to use that for iso-surface ray tracing of AMR data. As such, I’d fully expect this module to be used as much without AMR as with AMR.
  • One reviewer (correctly!) pointed out that with all the “theoretical” continuity we claim in this paper there’s still a chance that there could be pixel-sized “shoot throughs” due to numerical accuracy issues: Even if we make the boundaries between levels fully continuous in a mathematical sense, the fact that different voxels/octants on different sides of the boundary use different floating point values for the cell coordinates (and those in different order of computations) means there can be elimination effects in the (limited-precision) floating point computations. Yes, that is perfectly correct, and I had fully overlooked it in the original submission (maybe one of the best reviewer catches I’ve ever seen!). But then, exactly the same effect will happen even for voxels in strutured volumes, without any level continuities ….