Silly(?), Useful Tools: Generating Data for Scaling Experiments

Over the many different rendering projects I’ve done over the years, I’ve frequently stumbled – again and again – over the same problem: How to get “useful” data for doing scaling-style stress-testing of one’s software. Sure, you can always take N random spheres, or if you need triangle meshes take the bunny and create N random copies of that with random positions … but then you still quickly run into lots of other issues, like, for example: “now i added all these copies they just all overlap?!”, or “ugh, how can i create the scene such that multiple scales still make sense with the same camera for rendering?”, or “what if i want more instances rather than more triangles?”, or “what if I want to look at more ‘shading’ data like textures”, etc.

All of those questions are “solvable” (this is not rocket science), but I’m always amazed how much time I spent over the years – again and again – to write (and debug, and re-debug, etc) those mini-tests. And since I just did that again I decided that this time should be the last one… so as a result of that I did add all the features I wanted, and pushed that into my miniScene repo on github.

The way this tool works is actually quite simple, generating a whole lot of spheres (or actually, “funnies”, see my tweet on how i stumbled over those), but allowing the user to control a whole lot of different parameters that can influence things like how much instantiation vs “real” geometry, what tessellation level for the funnies (ie, triangle density per sphere), what texture resolution to use for each of the funnies, etc. In particular one can control:

how many “non-instantiated spheres” to generate
how many different kinds of spheres to generate for instantiation
how many different instances of spheres to generate
what tesselation density to use per sphere
what texture res to use per sphere (sphere gets its own checkerboard pattern texture)

These spheres then all get put into a fixed-size slab that covers the space from (0,0,0) to (1000, 100, 1000), with sphere radii and instance scaling adjusted such that there should always be a reasonably equal density within that slab. Note that slab is intentionally 10x less high than wide, so we neither end up with just a 2D plane, nor with something that’s a cube (where all interior geometry is usually occluded by that at the boundary).

In particular, this tool allows for easily controlling whether you want to scale in instance count (increase instance count) or triangle count (increase num non-instances spheres and/or sphere tessellation level); whether to put more triangles into just finer surface tesselation or into more different meshes, how much of the output size should be in textures vs geometry, etc.

Here a few examples:

/miniGenScaleTest -o scaleTest.mini (ie, with trivially simple default settings) generates this:

num instances : 2
num objects : 2

num unique meshes : 101
num unique triangles : 40.40K (40400)
num unique vertices : 22.22K (22220)

num actual meshes : 101
num actual triangles : 40.40K (40400)
num actual vertices : 22.22K (22220)

num textures : 101

Which with my latest renderer looks like this:

Now let’s change that to use 10k instances: ./miniGenScaleTest -o scaleTest.mini -ni 10000, and we get this:

num instances		:   10.00K	(10001)
num objects		:   101
----
num *unique* meshes	:   200
num *unique* triangles	:   80.00K	(80000)
num *unique* vertices	:   44.00K	(44000)
----
num *actual* meshes	:   10.10K	(10100)
num *actual* triangles	:   4.04M	(4040000)
num *actual* vertices	:   2.22M	(2222000)
----
num textures		:   200
 - num *ptex* textures	:   0
 - num *image* textures	:   201
total size of textures	:   204.80K	(204800)
 - #bytes in ptex	:   0
 - #byte in texels	:   204.80K	(204800)
num materials		:   200

which looks like this:

But since that scene complexity is mostly all in instances (which for “large model rendering” is often considered “cheating” let’s instead add a few non-instanced spheres as well (but let’s add more instances, too, just for the fun of it): ./miniGenScaleTest -o scaleTest.mini -ni 10000000 -nbs 100000 -tr 32 (this creates 10 million instances of spheres (each having 4k triangles), and then another 100,000 spheres that are not instances, for a total of this:

num instances		:   10.00M	(10000001)
num objects		:   101
----
num *unique* meshes	:   100.10K	(100100)
num *unique* triangles	:   40.04M	(40040000)
num *unique* vertices	:   22.02M	(22022000)
----
num *actual* meshes	:   10.10M	(10100000)
num *actual* triangles	:   4.04G	(4040000000)
num *actual* vertices	:   2.22G	(2222000000)
----
num textures		:   100.10K	(100100)
 - num *ptex* textures	:   0
 - num *image* textures	:   100.10K	(100101)
total size of textures	:   1.64G	(1640038400)
 - #bytes in ptex	:   0
 - #byte in texels	:   1.64G	(1640038400)
num materials		:   100.10K	(100100)

(and zooming in a bit)

(note the “artifacts” on some of those spheres are intentional – they’re “funnies”, not spheres. I find these funnies more useful as testing geometry, but of course, if you want to generate “non-funny” spheres there’s a flag for that as well).

Now finally, let’s use this to push my two RTX8000 cards to the limit, and do this: ./miniGenScaleTest -o /slow/mini/scaleTest.mini -ni 10000 -nbs 2000000 -tr 32 … with which we end up at a whopping 800M unique triangles and an additional 32 GBs of texture data:

num instances		:   10.00K	(10001)
num objects		:   101
----
num *unique* meshes	:   2.00M	(2000100)
num *unique* triangles	:   800.04M	(800040000)
num *unique* vertices	:   440.02M	(440022000)
----
num *actual* meshes	:   2.01M	(2010000)
num *actual* triangles	:   804.00M	(804000000)
num *actual* vertices	:   442.20M	(442200000)
----
num textures		:   2.00M	(2000100)
 - num *ptex* textures	:   0
 - num *image* textures	:   2.00M	(2000101)
total size of textures	:   32.77G	(32769638400)
 - #bytes in ptex	:   0
 - #byte in texels	:   32.77G	(32769638400)
num materials		:   2.00M	(2000100)

The result looks like this:

… and just to show that this is really about to push my GPUs to the limit (even with my latest data-parallel multi-GPU renderer) here also the output from nvidia-smi:

Guess I might have squeezed a bit more (some 3GBs still unused on each GPU!), but the goal of this exercise was to have something that can bring my renderer to its limits, and guess that’s pretty much it for now.

BTW: The result still runs at 17 fps 🙂

If you want o have a look at this tool: have a look at the miniScene repo, then tools/genScaleTest.cpp. The resulting .mini file should be trivial to read and use for your own stuff, so …. enjoy!

Parallel/GPU-based Construction of Left-Balanced k-d Trees

Similar to the article I wrote last week on stack-free k-d traversal (see here), this post is about a “left-over” from many years back – one that I’ve been using on-and-off for ages, but never found the time to actually write up and share.

Well, as the very existence of this post may give away I actually did find the time to finally write this up and share it (and man, does it feel good to finally get this off my stack, after a few years of that pressing on me!). If you’re interested: the write-up is on my usual publications page (https://www.sci.utah.edu/~wald/Publications/), as well as on ArXiv (well, it will be there once it actually finishes processing – by Friday, if the email I just got is to be trusted). In addition, there’s also sample code in the github repo that I created for the reference implementation, see here: https://github.com/ingowald/cudaKDTree .

If you’re working with k-d trees (the left-balanced sort, that is), and need a GPU-accelerated way of building them, have a look. I’m not claiming this to be the un-doubtedly fastest way you could ever build a left-balanced k-d tree on a GPU (in fact, I’m fairly sure that faster methods exist, ahem), but what I really, really like about this particular method is what I’d call the the “soft metrics” of ease of use and simplicity – it’s all a pretty simple CUDA update kernel (that literally fits into a single laptop screen!), plus a bit of parallel sort… easy to use with/update for/template over different data types like int2, float3, float4, structs-with-payload (eg, photons in photon mapping), etcpp, and can be used in a header-only way, too. I’ve now used this method several times, and unlike for some other methods I have it was always trivially simple to just drop this in and use it.

The weakest part of that algorithm is that – at least for the reference implementation – I used thrust::sort for sorting, and since I need to use a custom sort operator that does fall back to a relatively slow implementation of sorting, and to one that actually needs a roughly 2x memory overhead during sorting. Both of that could, of course, be fixed by doing a better swap-based sorting via bitonic sorting … but “that is left as an exercise to the reader” :-).

Enjoy.

Stack-free k-d Tree Traversal

Yes, I haven’t written anything on this blog in a while – been too busy coding and already way overdue writing things up in latex/paper form – but since I finally managed to finally finish the write-up for that technique (which by now was overdue for now several years!) I thought I’d celebrate that milestone by at least dropping a note on that technique here, in case it’s useful for others, too….

The technique I’m referring to is a stack-free traversal method for k-d trees – and to be specific, I mean the kind of k-d trees that encode k-dimensional data points https://en.wikipedia.org/wiki/K-d_tree, not the ray tracing-related ones I used so much in my early days of ray tracing research… The algorithm is actually super-simple once you think it through – a write-up in “paper”-form (i.e., with full explanation of how it works) can be found here https://arxiv.org/abs/2210.12859; in addition, if you’re somebody that prefers header-files over PDFs you can also find some sample code here: https://github.com/ingowald/cudaKDTree. The repo that the latter link points to also contains some pretty neat GPU-friendly and parallel builder for left-balanced k-d trees, too (which is arguably even more interesting than the traversal…) … but that’s the topic for another post if and when I ever finish the write-up of that technique (in another few years!?).

Now what’s so interesting about this technique? First, it’s simple, which is always good; but more importantly, it allows you to traverse k-d trees (for photon mapping, point data, p-kd trees, etc) without needing a stack. It’s not necessarily faster than stack-based techniques (and if memory is not a concern a good BVH is almost certainly faster than a k-d tree, anyway), but in my experience on a GPU stacks always come with some strings attached, so I’ve come to now default to this technique if and when I ever need to use a k-d tree. The samples I added to this repo are intentionally small and self-contained; you can use the same technique for other queries, other dimension and data types (templates are your friend…), etc; but I wanted to keep that repo small – feel free to send PRs with other use cases if you have any to share.

With that – have fun!

PS: Again:

link to whitepaper: https://arxiv.org/abs/2210.12859
link to sample code: https://github.com/ingowald/cudaKDTree

EGPGV Paper Talk on AMR Streaming with ExaBricks EGPGV Paper Talk on Streaming

Just back from Rome, where Stefan gave a talk on our latest paper on how we extended our “ExaBricks” AMR rendering project to allow for efficiently streaming animated AMR data.

And as there’s always a lot more involved in such a project than you could seriously put into a paper he also just wrote a very nice blog post on all the “extra” stuff that’s not in that paper. I actually thought of doing the same, but he beat me to it …. anyway: check it out here!

“Ray Tracing Massive Models using Hierarchically Compressed Geometry” – or: Huh, the things you find when sifting through your old stuff…

Sometimes you just “stumble” over things from the past that you had mostly forgotten about (or “gone into denial over”!?) …. in this case, I was sifting through my back-then-when backups for a copy of some of the large scanned models that I have previously worked with in the past (like the David, Lucy, Atlas, etc). Now my trusted find tool eventually did find some matches on my 12TB backup server, but they were png files in an old paper repo from my TACC account backup … and since I couldn’t immediately place the name of that paper repo I started to get curious, so pulled out that directory, and looked inside. What i found what this here – a “aborted” but almost-finished paper draft:

quant Download

Now this draft was never actually submitted anywhere, and the style file is wrong, too (it was not submitted to EG06; in fact it wasn’t even written until around 2010 or 2011)… but still, some pretty cool ideas in there for the time – mind that this is over a decade old by now. Some of the ideas that this already talked about (in 2011-ish!):

a method of encoding large meshes with only 5-8 bytes per triangle including the BVH
using the concept of “quads” (triangle pairs with a shared edge, not necessarily planar) throughout everything, even for regular triangles; which not all ray tracers do yet even today
automatica generation of such quads from input triangle meshes
hierarchical encoding of both geometry and acceleration structure; basically the same of what we used in our 2021 paper on GPU volume ray tracing compressed unstructured mesh data sets. (And while I’m on that topic: a huge shout-out to Ben Segovia and Manfred Ernst, though, which pioneered this with their way under-appreciated Memory Efficient Ray Tracing with Hierarchical Mesh Quantization paper!)
a Quad-BVH with hierarchical encoding of the children relative to their parent node; similar to Carsten’s Compressed-Leaf BVHes paper from 2018
first interactive ray tracing of billion-plus sized models on fixed-memory “GPUs”; in this paper I used the “Phi” GPU equivalent that intel had at the time (either KNF or KNC, I really don’t know any more). Using that hardware that “may or may not” have been the reason this paper never got published at the time – but I digress.
first SPMD ray tracer / path tracer (RIVL) written with a predecessor/sister-project to what later become ISPC – think an ISPC-based path tracer all running on that KNx architecture.
an out of core streaming framework for large data (for the construction), that could even do out of core SAH construction etc, with on-the-fly compressed triangle streams etcpp (I still use that same technology today, just a different implementation).
…

Sigh. So much cool stuff, and all of that over a decade ago … and really cool results (<10 bytes per triangle, including BVH!) … and then I had completely forgotten about it.

Anyway …. just thought I’d drop that draft here; maybe it’ll inspire somebody to pick up some of these ideas, and do something cool with it – the paper does have some few missing text pieces, and in particular several blank tables with performance data – but the latter is actually a good thing: I didn’t do this intentionally for this blog post (I’m not even sure I still have the latex sources for that paper?!); but I would actually have somehow stripped them, anyway, so it’s good they’re not in there. That doesn’t mean that I didn’t have this running at the time (I did), or that it didn’t perform well (oh yes, it did; absolutely) …. but whoever was going to pick up on these ideas would have to re-implement that all on more modern hardware, anyway (and yes, that would probably just scream on a 3080!).

So; if anybody wants to go for it and re-investigate these ideas: feel free – I won’t do it, because I don’t have the time; but if somebody is looking for a student project, a master’s or bachelor’s thesis, or anything else of that sort, then this might still suit – and might still produce a paper, too, depending on results. And if or when somebody wants to do this on a GPU, and wants to chat about how to best do that – I can still chat, I just won’t write the code, or a new paper.

The things one finds in old accounts, indeed…

Parallel BVH Construction

Now this is a new one : usually I use this blog to write about newly published papers, or share some experience with some currently ongoing projects – but this one time, I’ll actually use this blog to direct attention back to some papers that actually happened a looong time ago. Reason: I recently had several discussions with people about the topic of fast, parallel BVH construction, and always replied by “just read that paper I wrote about that; I still use the same techniques as in my paper from 10 years ago, just with a GPU rather than a CPU” …. only to get long stares because whoever I talk to then often doesn’t know what the heck I’m talking about.

That confused me a while, until last week I actually tried to google this handful of papers that I had so vividly in my mind … and realized I couldn’t actually find them myself, because google just didn’t find what I was looking for. Now as it turns out, I’m not so delirious that I was imgining any papers that I never wrote – but unfortunately, when I wrote those papers I didn’t know how the lingo in ray tracing would eventually evolve, and apparently ended up picking paper titles that google just doesn’t match to those keyword you were probably search for: What I (naturally) googled for was “parallel bvh construction” (and if you reach this page, that’s probably exactly what you did, too) … but back then when I wrote the paper, I didn’t know that this would be the terminology everybody nowadays uses, and used different terms.

Parallel BVH Construction

So, if you are interested in parallel BVH construction (hopefully on a GPU, nowadays), I’d like to point your attention (and hopefully, google’s search algorithms) to the following two papers: First, the first one I did that actually looked at parallel construction (on a brand new “Clovertown” CPU back then, with four(!) hardware threads!) was “On fast Construction of SAH-based Bounding Volume Hierarchies“. Funnily even the web link contains the terms “parallel” and “BVH”, but the title itself contains neither, so google – at the time of this writing – really doesn’t want to find it. But if you are interested in some of the very first approaches to building SAH BVHes in parallel, this is the one – today I’d use CUDA kernels, of course, and way more threads – but the core concepts are the same. Interestingly, if you look at section “4.2. Grid-based Binning” that’s actually just a different name for what nowadays you’d call a Morton pre-binning step (well, it’s a bit more general because Morton codes are power of two, and that grid is not – but the core idea of an initial pre-binning, with an implicit BVH over those bins and a second BVH building step inside those bins, that’s the same …. but I digress).

Second, a more real-time implementation of that was done a few years later, at the time on the “MIC” Many-Core Knights-Series of Intel CPU in the following paper: Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture. That one is a bit more specialized because it really targeted those certain chips that Intel was doing at the time (bonus assignment for the advanced graphics geek: try to guess what the “LRB” in the URL may refer to…), but it’s also still relevant because this was the first one to really target real-time re-builds, and really had to look at all these things like having lots of parallel threads (just like a GPU today), at having wide “SIMD”, etc – in fact, what I do in my latest builders on a GPU is very similar to what this paper does except that today you have “warp parallelism” instead of “SIMD” for the SAH evaluation, and you you have CUDA async task launches where this paper talks about a hand-crafted tasking system. So, if you’re really into that topic, might be worth reading, too.

Refitting BVHes and Two-Level BVH Structure

Finally, once we are on the topic of (re-)building BVHes, two other papers come to mind: The first one where we proposed the idea of BVHes for real-time ray tracing – and in particular, of using refitting – was Ray Tracing Deformable Scenes using Dynamic Bounding Volume Hierarchies (again, no “BVH” in the title, nor “refitting” – I really suck at picking well-google’able paper titles); and the first paper that proposed the idea of two-level hierarchies with a top-level acceleration structure (TLAS) and botom level accels (BLASes) – back then, still using kd-trees instead of BVHes – was Distributed Interactive Ray Tracing of Dynamic Scenes … that actually also talks about “distributed” because that was the only way to get enough compute for real-time ray tracing back then when; but it’s actually about the two-level accel structure more than it is about the “parallel rendering” component.

Real-time BVH Builds Today

Now today, times have obviously changed, and for most of the time you can probably rely on the BVH builders that come with OptiX or DXR. But from time to time one still needs to have a BVH to traverse in one’s own traversal code (maybe for searches other than tracing rays?). If I “were to go about writing a real-time CUDA builder today, I’d follow the same basic ideas as above, but obviously using GPUs and CUDA for it: One useful technique is still the pre-binning described in the first paper; and/or a two-phase build where a first stage has all threads (across all SMs) work together on splitting individual, large subtrees until they reach a certain size; then a second stage where different SMs (or different warps) work on a different subtree each; with warp-parallelism (ie, 32 threads executing the same SAH evaluation in parallel). Getting the memory allocations right requires some care, as does the proper stream handling to keep the GPU busy – but the base idea is still exactly the same … and still works like a charm. Don’t have any code to share as of right now, unfortunately; but the base ideas are right there. Have fun!

New paper preprint: “A Memory Efficient Encoding for Ray Tracing Large Unstructured Data” (aka “the ‘bigmesh’ paper”), IEEE Vis 21

TL/DR: After having worked on this on and off for now roughly three years, our work on ray tracing really(!) large unstructured data on GPUs has finally been accepted at IEEE Vis; preprint here:

Short story long: This particular paper is part of a larger project that all started when NASA first released the Fun3D “Mars Lander Retropulsion Study” data at https://data.nas.nasa.gov/fun3d/ (with a HUGE “thank you” to Pat Moran for helping to get this data out into the community: Pat, we all owe you one!). This NASA project was about using supercomputing and numerical simulation (using NASA’s “Fun3D” solver) to simulate how a capsule entering Mars’ atmosphere could use “reverse” thrusters to slow down…. and generated a truly tremendous amount of data: The simulations were performed on three different model resolutions (each being a unstructured mesh with mostly tets, but also some wedges and hexes). The fun part: even the “smallest” version of that unstructured mesh was bigger than anything else I had been able to lay my hands on at the time, and the largest one having a solid 3 billion-ish unstructured elements (billion, not million)…. plus several different scalar fields, and many, many time steps.

Just downloading this data initially took me several weeks’ work, with initially a multitude of shell scripts to get all the files, check for broken data/interrupted connections, etcetc (that process has much improved since then…); and I also had to marshal quite some hardware for it, too: over the course of this project I not only got a whole stack of 48GB RTX8000 GPUs (from NVIDIA), I also bought – explicitly for this purpose – something on the order of about terabyte of RAM to upgrade two or three machines we used for data wrangling; and built two new RAID NASes (one 12TB, one 8TB) to deal with the resulting data.

On the software front, the effort to deal with this model was, if anything, worse: though we had plenty code to build on from some previous papers on tet-mesh point location and unstructured-mesh rendering, these codes initially were totally inadequate to deal with this scale of data – no matter how well a given piece of software works, once you throw something like two to three orders or magnitude larger data sizes at it you’ll find “where the timbers start to creak”. Well, that was the whole point of looking at this data, i guess…

Eventually, looking at this data quickly turned from “oh, a new data set to play with” to what turned into a major effort where pretty much each and every one of the tools we had been using before had to be completely rewritten to better handle data of that size; and where we ended up having to go into a whole lot of totally orthogonal directions that each had their own non-trivial challenges: building tools for even interfacing with unstructured meshes of that scale (our umesh library), different approaches to render it (with sampling vs tet/cell marching, different space skipping methods, different (adaptive) sampling methods, etcpp), with lots and lots of new helper tools for stitching, for computing mesh connectivity, iso-surfaces, shells, tetrahedralizations, partitionings, ghost-cells/regions, etcpp; a new framework for data-parallel rendering that can deal with this sort of data (where, for example, individual partitions aren’t convex), and so on, and so on, and so on….

What we realized at some point was that trying to write “one” paper about all of that simply wouldn’t work: there’s far too many individual pieces that one needs to understand for the others to make sense, and just not enough space in one paper to do that. So, to focus our first paper we decided to first describe and evaluate only exactly one particular angle of the whole project – namely, how far you could possibly “squeeze” the data required for a sample-based volume ray marcher operating on an unstructured mesh: given how expensive rendering this data is it was kind of obvious that we’d want to use GPUs for it – but for models of this size, the kind of algorithms and data structures we had used before were simply using (far) too much data. Consequently, one of the angles throughout this project was always to try and encode the same(!) data in a more efficient way, to try and fit ever larger parts of the data on a single GPU … until at some point we could fit all but the largest of the three versions of this data set (and even that largest one fits on two GPUs in a single workstation).

Unfortunately, even that single angle – how far you could possibly squeeze an instructured mesh and BVH to fit on a GPU (and how to best do that!) – turned into a non-trivial effort that in total we have now worked on (on and off) for over two years: part of that problem is the non-trivial amount of data wrangling, where even running one new variant of your encoding can take you another day to run through; but a totally unexpected one was just how much effort we had to put into evaluating/comparing that method. Reviewers (correctly so) asked questions like “how does that compare to tet marching” or “how does it compare to data parallel rendering” – and though these are absolutely interesting and valid questions, the problem for us was that there simply was nothing useful to compare against … so we eventually ended up having to solve all these totally orthogonal problems, too, just to be able to evaluate ours.

For example, just for something as simple as “comparing to tet-marching” you first need to have an implementation of tet-marching, so you need a tet marcher that can deal with this size of data … but before you can even start on writing the tet marcher you first need to have a tet-mesh neighborhood/connectivity information, and computing that for “several billion tets”-sized models (think hundreds of Gigabytes) isn’t exactly trivial, either …. and of course, you can’t even start computing the tet connectivity until you have a tet-only version of that model to start with, which in turn requires you to be able to robustly tetrahedralize a model with billions of mixed elements into another one with even more billions of tetrahedral elements, while properly handling bilinear quadrilateral faces etc while tetrahedralizing, etcpp … and if want to do your tet-marching in data-parallel (which you have to, because it’s too big for one GPU), then you also have to deal with partitioning, computing boundaries/”shells”, etcpp. If I include all this “extra” work required to make this paper, then there was probably far more effort required in doing this paper than in any other paper I’ve ever worked on… and with a healthy margin to spare.

Anyway; the first paper of this project is now finally “through” the reviewing pipeline, and has just been accepted at IEEE Vis 2021 (to later appear in TVCV). As said above this paper touches upon only one part of the whole project, and upon only one particular problem – one that’s an important ingredient to all the stuff around it, but nevertheless only part of the problem. A totally orthogonal aspect of that same “master project” has also just been accepted as a short paper at IEEE vis, but that’ll deserve its own blog post at some later time.

So for now: hope you’ll enjoy the above paper – it’s been a lot of work, but also a lot of fun. In case of any questions, feel free to drop me an email or leave a comment….. enjoy!

PS: Just a few of the images, to give you an idea of that data:

First, a volume rendering of the whole lander – note that there’s literally hundreds of millions of elements in that single view, and often dozens of tets per pixel (and not just in depth, but side by side!):

Now to give an idea of just how detailed the data behind this innocuous looking image is, here a is-surface that I extracted from that (yet another tool we had to write from scratch to deal with this size of data) :

Note this is for a different camera partition, but if you mirror it left-to-right you’ll probably see where they roughly match up. Doesn’t look too bad yet, because the tets (and thus, triangles) on the outside are much larger than those in the turbulent region, giving the totally wrong impression that the mesh wasn’t actually that fine: but this mesh alone is several hundred million triangles – and of course, is only a isosurface “cut” through the volumetric mesh.

Next let let’s use a color-coded version of this iso-surface, where different colors indicate some spatial pre-partitioning into blocks of already non-trivial models (remember, this is a isosurface from a volumetric mesh). As to how big each color-coded chunk is: I can’t say for sure (that image is over a year old), but I’m fairly certain those are the same as the “submeshes” we talk about in the paper, so roughly 64K vertices (maybe around 250K elements?) per chunk.

Now from that, let’s look at only the tiny region that is marked with a red square, and zoom in on that one, with the same “submesh ID” shader:

And finally, let’s take this same view, and render with a triangle ID shader to show individual triangles of this iso-surface:

Though hard to show in still images (it’s much more impressive if you can do that interactively 🙂 ), this hopefully helps in showing just how ridiculously finely tessellated this model is: each triangle you see in the iso-surface is roughly one tet in the volumetric model – except all the “whilespace” that the isosurface doesn’t show is full of tets (and other elements), too. And remember, that is only one tiny part of one plume of turbulence in the model, that in the beauty-shot above maps to just a handful of pixels… fun indeed 🙂

Time to Image: 2 Hours … :-)

If you read this title for the first time, you may be excused for wondering why I’m happy about a 2-hour “time to image”: usually “time to image” means the time it takes for a interactive renderer to get from starting to load a model to when it is ready to display the first image on screen … and anything over “seconds” is usually not all that useful. In this case, however, I meant “time to first image” in the sense of “starting with a completely blank repo, to having a interactive OWL/OptiX-based viewer that can render a model”. And in that meaning, two hours IMHO is pretty awesome!

Let’s take a step back. Why did I actually do this?

On Sunday, I finally posted a first article on OWL, which only recently got tagged as “1.0”, i.e., with some reasonable claim to “completeness”. In this article, I made a big deal about how much easier OWL can make your life if you want to write an OptiX/RTX accelerated GPU ray tracer …. but what could I actually base this claim on? Sure, I/we have by now several non-trivial OWL-based renderers that show what you can do with … but how would one actually measure how easy or productive its use would be?

While on the way to get my coffee I decided to simply run a little self-experiment – initially not to prove a point, but simply because I was curious myself: To do that, I decided I’d pick something simple I always wanted to write but never got around to (I picked a structured-volume direct volume renderer), and then spend a day or two trying to write one. Using OWL, of course, but otherwise starting with a blank repo, with no other libraries, etc…. and to keep an eye on the clock, to see where time goes while doing that.

Stage 1

Oh-kay …. how’d it go? I got home from the coffee run that gave me that idea (had only sipped at this coffee so far…), then went to gitlab, created a new blank repo, called it owlDVR. Cloned it, added submodule, created a CMakeLists.txt that is about 15 lines long, created a ‘deviceCode.cu’ with a dummy raygen program, and started on some simple host code: created a ‘Model’ class to hold a initially procedurally generated N^3 float volume, created a new windowed viewer by deriving from the owlViewer that comes with the owl samples, overrode the ‘render()’, ‘cameraChanged()’ and ‘resize()’ methods, and then started on the actual OWL code: create context, launch params, and raygen; build pipeline and SBT; upload the volume into a buffer and assign that buffer and volume size to launch params in the constructor (and of course, build pipeline and SBT); then add some owlLaunch2D() in Viewer::render() and some owlParamsSet()’s for the camera in Viewer::cameraChanged() ….. and that was pretty much it from the host side.

Then take the deviceCode.cu, and start filling out the raygen program: create a ray from the camera parameters that I put into the launch params, intersect ray with the bounding box of the volume, step the ray in dt’s through the resulting [t0,t1] interval, add a tri-linear interpolation on the buffer holding the volume data; put the result into a hand-crafted procedural “transfer function” (pretty much a clamped ramp function), and get this (and of course, you can interactively fly around in this):

Not really super impressive as a volume renderer per se …. but finally looking at the clock, the whole project took slightly less than two hours, on nothing more than a laptop, while sipping a cofffee. And full disclosure: these two hours also include helping my son with some linear-algebra math homework (yes, covid-homeschooling is just great…).

Of course, I’ve now written a ton of other OWL samples myself, so I’ve gone through the motions of creating an OWL project before, and knew exactly what to do – somebody that’s never done that before might take a while longer to figure out what goes where. However, this still surprised myself about just how quickly that worked – after all, this is not just a simple blank window, but even already includes volume generation, upload, and a more or less complete ray marcher with transfer function etcpp … so the actual time spent on writing OWL code is even way less than that – and that surprised even me.

Stage 2: Fleshing it out

After this “first light” was done, I did spend another two hours fleshing this sample out a bit more: first adding CUDA hardware texturing for the volume and the transfer function, then later some actual model loaders for various formats, and some prettier shading via gradient shading …. all of which took another two hours, which were roughly spent in equal times on

a) creating textures, in particular the 3D volume texture : the transfer function texture was simple, because OWL already supports 2D textures, so that was trivial; the volume texture I had to first figure out how CUDA 3D textures even work, which took a while.

b) creating the actual transfer functions and color maps – I liberally stole some of Will Usher’s code from our exabricks project for that, but then spent half an hour in finding a bug in re-sampling a color map from N to M samples. (I guess the coffee had worn off by that time ).

c) adding loaders, fleshing out float vs uint8 format differences, adding gradient shading, etc.

All together that second stage took another two hours (and in that case this does include the time to help our “tile guy” unload his truck for our bathroom remodel…). Here some images that show the progression of that second stage: left the original first-stage volume with a simple hard-coded ramp transfer function and manual tri-linear interpolation; then the first picture with CUDA textures and a real color map in the transfer function; then after adding a loader for the 302^3 float “heptane” model, and on the right, after adding uint8 volumes and adding gradient shading, with the 2048^2×1920 LLNL Richtmyer/Meshkov-250 model (I had to move to my 3090 for the latter – laptop didn’t have enough memory).

The main thing I haven’t added yet is a real transfer function editor widget, and of course, as with any GUI code that may well end up taking more time than all the above combined… but as a “proof of concept” I’d still argue this experiment was quite successful, because the one thing that I did not have to spend much time on in the entire project was anything involving OWL, SBTs, programs, buffers, etc.

One valid question that any observant reader could raise is that in this sample I didn’t actually use anything that really required RTX and OptiX – I did not create a single geometry, and could have done the same with just plain CUDA. This is true (obviously), and in retrospect I might have picked another example. However, adding some additional lines of code to create any triangle meshes at this point would indeed be trivial, and would almost certainly take less than 5 minutes… In fact, I might do just that for adding space skipping: all I’d need is creating some rough description of which regions of the volume are active, then I could trace rays against that to find entry- and exit-points, and done.

Where to go from here?

As mentioned above, this little experiment started as a little exercise in “I want to know for myself”; and the main motivation for this blog post was to share that experience. However, literally while writing this blog I also just realized how useful it would be for for some users if I documented the individual steps of this toy project in a bit more detail, ideally in enough detail that somebody interested in OWL and/or OptiX could follow the steps one by one, and end up with something that would be “my first OWL / OptiX program, in less than a day”….. That would indeed make a lot of sense; but that’ll to wait for another post…

Introducing OWL: A Node Graph Abstraction Layer on top of OptiX 7

Finally, the day’s arrived: I’m hereby officially “introducing” OWL to the world.

This post is long overdue – I’ve already mentioned / hinted at OWL in previous posts (in fact, we already used an cited it in several papers), but have never yet actually posted or written about it, yet: initially, I didn’t want to write about it before it was “ready enough” for public consumption; then later, I got too busy using OWL for my own projects … and the more we did with it, and the more features it supported, the harder it got to find a good place to start talking about it.

Anyway; I finally tagged OWL as version 1.0 last week, so it’s about time to write a bit about it.

What is OWL?

OWL is, as the title of this post suggests, a “node graph” abstraction layer on top of OptiX 7. To be clear, OWL is not a “renderer” on top of OptiX (such as PRBT is), nor is it real-time “rendering engine” on top of OptiX 7 (as, for example, OSPRay is one on top of Embree); instead, it’s sole purpose it to take all the features of OptiX 7.x, and add a little bit of “magic” on top of it, in order to:

a) make it easier to get started with – and to properly use – OptiX, RTX technology, and hardware accelerated ray tracing; in particular for those are not the kind of “Ninja”-level RTX practitioners that can define the proper data layout and element ordering of a Shader Binding Table (SBT) at 3 am in the morning. Somebody recently decribed OWL as “training wheels for using RTX”; and though I think it’s more than that, it’s still a good picture.

b) to make it more productive to use OptiX 7 and RTX ray tracing, even for those that do dream about SBTs at night (yes, yes, I know…) – by automating some things that do need to get done in any OptiX program, but that the user really shouldn’t need to worry about, such as building the SBT, acceleration structures, etc. By making some of the more common (but time-consuming and bug-prone) tasks simpler for the user, the user can concentrate on what he/she really wants to do (the shader programs that implement the renderer!), not the care and feeding of device buffers, acceleration structures, and SBTs. E.g., once all geometries and groups have been created, building a shader binding table in OWL (even if you have no clue what it is) is as simple as calling “owlBuildSBT()”.

Now before I go into some more detail, here just a few sample pictures that have been rendered with OWL over the last few weeks…. or as I should probably say more accurately, pictures that “have been rendered with several different RTX accelerated renderers built with OWL in the last few weeks”:

Why do we really need something like this?

If you wanted to write a GPU ray tracer a few years ago, you had two options: Either use OptiX, or write your own in CUDA. Today, you not only have much faster ray tracing thanks to hardware accelerated ray tracing, you also have more choice in the sense that you could also use DirectX Ray Tracing, or Vulkan Ray Tracing extensions. With all this choice, the question is why one would need anything more than that.

To fully understand why one would need something like OWL, it is useful to take a look back at OptiX before RTX and OptiX 7 came around: Initially – and up to vertion 6.5 – OptiX was a rather high-level abstraction library, where it was quite easy to get something going rather quickly, by writing the desired closest-hit, any-hit, ray-gen, etc, programs, then define a few “attributes”, “buffers”, and “variables” in the device code that the programs could use. One would create a usually rather simple node graph on the host (that would define, say, a triangle mesh and an acceleration structure), set a few variables to parameterize the device-side programs and geometries, and done. Sure, it still took some time to wrap one’s head around what all these programs were for, and how to map one’s conceptual ray tracer to these ray-gen/closest-hit/etc programs …. but once you had that figured out, the mechanics of creating this “pipeline” of ray tracing programs – and mapping to it the device – was relatively simple. In particular, you wouldn’t even need to know what, say, a “Shader Binding Table” even was (let alone how to build it), or which data structures would need to get built when, etc … OptiX 6 would do all that, fully automatically.

The downside of that approach was that OptiX 6 was pretty opaque: once you started becoming a power-user you might end up with OptiX 6 doing things that you didn’t intend it to do, or at times you’d rather it wouldn’t, etc… and because it was closed source, you could easily end up not even knowing why it sometimes did what it did, or what to do to avoid it. So very easy to get started, but sometimes too opaque for power-users.

When OptiX 7 came around, it made short shrift of this problem, by pretty much stripping away all the “convenience” functionality, all the node graph, etc, and instead exposing the (then newly added) RTX technology on what is pretty much a driver level abstraction (which, by the way, is similar to the DirectX/RT and VulkanRT abstraction levels). In that new “driver-level” abstraction the user has full control over everything, including what CUDA streams get used at what point in time, and which memory allocations happen where, which which type of CUDA memory, etc. This change unlocked a whole new level of performance that users have since made impressive use of, and that was the key behind the last two years’ rapid developments in high-end GPU ray tracing. For a power-user (well, at least for me!), the switch from OptiX 6 to OptiX 7 was an experience that was just amazing, plain and simple.

The downside of this change to a driver level API was that it has become much harder to get started with OptiX (or DXR or VKR, for that matter): instead of setting up a simple node graph on the host, you now have to understand the intricacies of acceleration structures, of setting up build inputs and building/compacting/refitting acceleration structures; of compiling programs and pipelines, building shader binding tables, etc. With OptiX 7 this is still much easier than with, say, DXR or VKR, but for the un-initiated, it can still be daunting…. and even for those that do by now fully understand all these low-level technical details, due to their low-level nature there are a lot of opportunities to shoot oneself in the foot by overlooking something or committing copy-and-paste bugs. This can easily take time that could more productively be spent somewhere else.

What OWL aims to do is help users bridge this gap between productivity and convenience on one hand, and performance and low-level control on the other: Like OptiX 6, it offers a node graph abstraction in which the user can create and parameterize relatively high-level entities like “Buffers”, “Geometries”, “Groups” (ie, acceleration structures), and “LaunchParams”, with OWL then doing all the menial tasks of managing the required device memory, building/compacting/refitting the acceleration structures, setting up launch constants, handling multi-device and async launches, and in particular building programs, pipelines, and, yes, the infamous shader binding table.

While thus clearly aiming for convenience, OWL also borrows a lot of the “give the user control” philosophy from OptiX 7: In particular, OWL is a much “thinner” abstraction layer on top of OptiX 7 than OptiX 6 was: there is no magic compiler technology anywhere in OWL, and all device-side shader code is pretty much exactly the same OptiX code as without OWL (though with a few convenience functions). OWL is also much more “explicit” than OptiX 6 was, in that, for example, it is the user that says when the SBT gets built. Third, OWL is completely “transparent” in the sense that unlike OptiX 6 it is completely open source (https://github.com/owl-project/owl), so the user can always see exactly what it is doing at what point in time … and even if he or she may or may not ever want to touch any code in OWL itself, he or she can still always see exactly what OWL does at any point in time, and what to do to avoid the bottlenecks or crashes. Finally, OWL explicitly aims to allow easy and efficient “inter-op” with CUDA, and one can, for instance, at any type query the device addresses of buffers or the CUDA streams used for a launch, etc, …. so it is, for example, absolutely possible to launch a CUDA kernel that reads from or writes to any of OWL’s data, or to launch into the same CUDA streams used by OWL, to run CUDA kernels asynchronously to OWL launches, etc.

How “real” is this?

The web (or even only the github-section of it) is full of libraries at varying levels of completeness, often abandoned, or doing only exactly what the proejcts’ author needed the project to do, and hopelessly incomplete for anything else. And yes, I’m absolutely sure that there will also be some bugs or missing features in OWL, that simply haven’t been found yet because nobody has yet used it in a specific way that would trigger the respective bug or missing feature. In particular, OWL is not a official “product” with a big team of engineers whose sole job it is to maintain this code – it is a library I’ve originally written because I had need of it myself, and that has simply grown to be much more than that.

At this stage OWL is still relatively new, and will undoubtedly still have some teething problems. However, it is now more than two years in the making, and at least judging from the last few months the teething problems seem to be mostly over. OWL now has been used successfully by different users, and for actually several very different kinds of different applications; in fact, I don’t think I’ve done a single project in the last year that did not get easier by using it, and the list of things that got newly added over the last few months is rather small. To give an idea of how far OWL has come, and what it can already do: here a brief selection of what has already been publicly written about (or is otherwise publicly accessible), and all the pictures here have all been done with renderers that built on OWL.

“Mowlana“. Mowlana started out as a sandbox for comparing/stress-testing different rendering back-ends (e.g., OptiX 6 vs OptiX 7), but has since developed into a bit more of “Moana on OptiX” viewer. I’ve recently written about this (https://ingowald.blog/2020/10/26/moana-on-rtx-first-light/), and though it’s clearly still “work in progress” virtually all the work of the last few weeks and months has revolved around things like data wrangling, with the entire OptiX component done by OWL, period.

OWL Samples. OWL itself comes with a few intentionally simple and self-contained samples; though intentionally simple these already demonstrate features like triangle and user geometry, different buffer and acceleration structure types, multi-gpu (pretty much free in OWL), different ray types, multi-level instancing, refitting, motion blur, etc. Here a few screenshots showing a OWL version of Pete Shirley’s Ray Tracing on a Weekend (plus some extensions, just because with OWL it was so easy to add this :-)), a OWL version of our OptiX 7 Siggraph course viewer, and some simple ones with multi-level instancing, IAS updating/refitting, and motion blur:

“Exabricks“: Arguably the prettiest images I’ve been involved in in a while (though most of the credit for the right transfer functions etc goes to Will, Nate, and Stefan…) – our “ExaBricks” Adaptive Mesh Refinement (AMR) Rendering project that we presented last week at IEEE Vis – with all the actual ray tracing and rendering of course done in OWL. (OWL also automatically handled the multi-GPU rendering that this challenging scene could really make use of).

Paper: “Ray Tracing Structured AMR Data Using ExaBricks”. I Wald, S Zellmann, W Usher, N Morrical, U Lang, and V Pascucci. IEEE TVCG(Proceedings of IEEE Vis 2020).
https://www.willusher.io/publications/exabrick

“OWL Tubes”: This started as a proof of concept that RTX ray transforms can also be used to accelerate intersections for thin primitives like tubes and hair (see our HPG 2020 Paper on this through this link) … but even for this rather low-level operation, it was eventually easier to do it through OWL than through OptiX natively. For these pictures, too, all the credit goes to the other authors of the mentioned paper:

“Unstructured Mesh Rendering”: Though some of the latest results that produced these particular images are not actually published yet, here a few screenshots from our latest unstructured-mesh rendering, on the NASA Mars Lander Retropulsion Study data set:

We also used OWL for some of our recent papers on fast tet-mesh and unstructured-mesh point location (and some applications of that), but I’ll skip these for now.

“OWL Prime, and Primer”: While the “main” interface to OptiX 6 was the node graph layer – with closest hit, intersection, miss, and raygen programs etc – it also came with an additional API that allowed users to set up only the geometry part of the scene, and then trace entire wave-fronts of rays, and get back wave-fronts of intersections. This abstraction comes with a lot caveats (that I will not go into here), but – based on feedback I got – was still quite useful for a lot of users. For OWL, I developed a similar library called “OWL Prime” – though by now I mostly refer to it as “Prime Owl”, as if it was a feathery animal – that offers the same abstraction level, including asynchronous launches, automatic (and async) upstreaming/downstreaming of ray/hit streams (if data lives on the host, etc).

And just to put that library through its paces I also wrote a “little” wavefront renderer (fittingly called “primer”) that, by now, has stolen almost all the shading / material / fresnel / distribution / sampling / microfacet / etc code from PBRT – doing the PBRT shading in CUDA, and all the tracing in prime-owl. Again, here a few proof of concept pics, these ones fresh from Sunday night:

This is clearly still “work in progress” (PBRT isn’t ported in a weekend …), but as a proof of concept it was more than successful: Literally all the remaining work is entirely on the shading/sampling/materials side – I didn’t even have to change anything in prime-owl, let alone in OWL.

I also have some version of Pete Shirley’s “Ray Tracing in a Weekend” scene where all the shading is done on the host (using a literal copy of Pete’s code), and all the tracing is done asnchronously on the GPU (with prime-owl using OWL’s launches and support for CUDA interop to do this, of course).

VisII: VisII is a python-scriptable, ray tracing based “scene imaging interface” (ie, a renderer) that allows a python user to easy create, modify, and render both photo-realistic images as well as certain derived images like normals, object IDs, etc. The main use case for this is use in robotics and other AI / Machine Learning based algorithms, where users can easily create high-quality images in python, including the additional “labels” that many algorithms require. Here again a few pictures:

Of particular interest to this post is that VisII was the first renderer that I did not write at least a significant portion of myself; it is almost entirely written by Nate Morrical (latest code is on https://github.com/owl-project/ViSII), and though initially there were certainly things missing in OWL that I had add for some of his more advanced use cases – in particular, I had to add buffers of textures, different texture settings, and motion blur, if I remember correctly – it is nevertheless a non-trivial application that could be done by somebody other than the author of the underlying library… which is always a good litmus test.

There are, in fact, a few more projects using OWL, but for the sake of space I’ll leave those out for now.

Now finally, how does one actually use it??

Oh-kay .. this has been a loong post; there will, eventually, be much more to be written about, but at least for now, I hope this will have given a rough idea of what OWL is, how it works, and what it can already do. If, as I hope, this had made you at least curious as to how it actually works, then your next question will likely be “well, but does this actually look like to a potential user?”.

For that, I had initially written a rather detailed “primer” on all the OWL concepts, with sample codes, etc, but that turned out to be a bit too long for a single post, so I’ll defer that to a later post. For now, all I’ll do is provide a tiny (and obviously incomplete!) example of creating a simple scene of a set of triangle meshes (and a single instance thereof), very similar to the kind of content that the “optix7course” sample would have done. Leaving out all the pieces of loading data, etc, this would look roughly like this:

// declares how device-side triangle meshes look like, used in 
// owlGeomTypeCreate, not shown here
OWLVarDecl triMeshVars[] = {
  { "diffuseColor",OWL_FLOAT3,OWL_OFFSETOF(MyTriMeshClass,diffuseColor) },
   ...
};
...
OWLGroup createWorld(host::Scene *scene)
{
  std::vector<OWLGeom> geoms;
  for (auto mesh : scene->meshes) {
    // first, create the vertex/index buffers:
    OWLBuffer vertices
       = owlDeviceBufferCreate(context,OWL_FLOAT,
               mesh->vertices.size(), mesh->vertices.data());
    ...
    // second, create the geometry, and assign buffers
    OWLGeom geom = owlGeomCreate(context,triMeshGeomType);
    owlTrianglesSetVertices(geom,vertices,....);
    owlTrianglesSetIndices(geom,indices,...);
    
    // third, assign user data (_user_ declared what that is!)
    owlGeomSet3f(geom,"diffuseColor",mesh->diffuseColor.x,...);
    ...
    geoms.push_back(geom);       
  }
  // build bottom level accel:
  OWLGroup blas 
    = owlTrianglesGeomGroupCreate(context,geoms.size(),&geoms);
  owlGroupBuildAccel(blas);

  // build top-level/instance accel struct:
  OWLGroup ias
    = owlInstanceGroupCreate(context,1,&blas);
  owlGroupBuild(ias);

  //done:
  return ias;
}

Of course, I left out a whole lot of stuff here – context creation, definition of the geometry type, creating a module that contains the device programs, the device-side closest-hit and other device programs themselves, the launch call, etc. However, if you can read the above code then all these other things I left out should be rather simple to use, too. For example, once all geometries and accel structs are built, creating the shader binding table is a single call:

owlBuildSBT(context); // that's it ...

Similarly, assuming a launch params object has already been created, launching a frame is as simple as this:

void render() {
   // note _what_ variables are in the launch params is _user_'s choice!
   owlParamsSetWorld(myLaunchParams, "world", world);
   owlParamsSetBuffer(myLaunchParams, "fb",frameBuffer);
   ...
   owlLaunch2D(rayGenProgram, fbSize.x, fbSize.y, myLaunchParams);
   float4 *pixels = (float4*)owlBufferGetPointer(frameBuffer,0);
   ...
}

Again, I’ve left out a whole lot of stuff; I’ll write a more detailed “primer” on OWL soon, but for this post, all I wanted to do is give you an idea that OWL exists, what it is, and that it is now ready for use.

If you’re interested in learning more: first, have a look at the OWL repo on github, namely https://github.com/owl-project/owl. In particular, OWL comes with a set of samples that are intentionally built in an almost “tutorial” style way, from very simple command-line ones that create a single triangle mesh, to more advanced ones with interactive model viewers, etc; these are all part of the github repo, under https://github.com/owl-project/owl/tree/master/samples. If you just want to get an idea of how OWL works, most of these should be rather self-explanatory, though I’d suggest to start with the simple command-line ones, so you don’t get distracted by any windowing code. And finally, I just started creating a github wiki as well, to provide some pages with explanations about what variables and launch params are, how to use them, etc: https://github.com/owl-project/owl/wiki. Going forward, this wiki will likely become the main entry point for documentation, how-to’s and frequently asked questions.

Wrap up

If this post did entice you to “maybe” give OWL a try, please feel free to do so; it’s completely free and Apache “do-as-you-please” licensed. If you run into issues, or have any specific “but how do I …” questions, most certainly let me know, and/or file an issue on the github issue tracker. And if you end up using OWL as “training wheels” to get the hang of it, and then later-on decide to go OptiX native – maybe even by copying useful pieces from OWL and discarding then rest – then that’s perfectly fine, too. I really do hope you’ll like it – there’s a tremendous amount of work in OWL, and I personally couldn’t imagine not using it any more … so I really do hope others will find it as useful as I do. And finally: If you do anything cool with it, let me know!

With that: back to work 🙂

PS: some links to further info:

OWL itself, on github: https://github.com/owl-project/owl
the (still-in-progress) OWL wiki pages: https://github.com/owl-project/owl/wiki
VisII, as a quite advanced OWL sample: https://github.com/owl-project/ViSII

PPS: Just because I know the question will come up: OWL stands for “OWL Wrappers Library”, because that’s exactly what it originally started out as: namely, a set of simple “wrappers” that would help with things like GPU memory allocation, up/download, building acceleration structures, setting up build inputs, etc. It just turned out that this abstraction level just wasn’t nearly enough to address what was the real elephant in the room (the beloved SBT… ), so by now that is a total misnomer …. but it sticks, so OWL it still is.

“The Elephant on RTX” – First Light. (or: “Ray Tracing Disney’s Moana Island using RTX, OptiX, and OWL”)

TL/DR: After Matt’s original “Swallowing the Elephant” with PBRT, and my own “Digesting the Elephant” with OSPRay, I finally got some “first light” on my “Moana on OptiX” sandbox; not fully done yet, but good enough to at least show a glimpse of. I’ll briefly say something about the voyage I took to get there, and will give a brief overview of what’s currently implemented, and what’s still missing.

Moana on OptiX/RTX: Background

As already mentioned/hinted at last week, I had actually been working on “Moana on OptiX”, on and off, since the very day I joined NVidia… which by now is over two years ago. In fact, there’s more than one such sandbox: at some point I had an OptiX 6 based Moana viewer (no RTX), an OptiX 6.5 one (with RTX), an OptiX prime one (pre-7, obviously), an optix-7-alpha one (way before 7 got released); then another pure CUDA one (with my own CUDA ray tracing core), and even one that ran the same CUDA shading code (with a few #define’s and template magic) on both CUDA, and on the CPU w/ either embree or OptiX prime as a backends … and probably a few more that I now forgot about. I also had a version that used NVLink to split the model over multiple GPUs; one that used managed memory to “page” to the host; and even one that can run mpi-data parallel (with full path tracing!) across diffurent GPUs and/or several different GPU nodes….

Of course, most of those Moana viewers were rather “prototypical” in the sense that they all looked at different aspects of the problem. For example, the CUDA-only version was heavily optimized towards lowest possible memory consumption, etc. Moreover, several of these sandboxes tried to be useful for both Moana and other heavily instanced PBRT models like “landscape” and “ecosys” – but since these all use very different materials, textures, lights, etc the latter ended up being a giant distraction….

All these different sandboxes were all nice and well …. but none of those sandboxes ever went all the way, and not one ever rendered the that model in the way you’d expect it to look: either the respective sandbox didn’t have a path tracer, or it didn’t include the water, or it didn’t use the lights, or the textures, or the curves, or … something that made it “un-showable”. That’s not to say that I didn’t ever have any of these individual components: in fact, I had some reservoir-sampling based direct lighting (from both quad lights and HDR envmap) over a year ago; I also at some point borrowed the Disney BRDF implementation from Will “TwinkleBear” Usher’s ChameleonRT ray tracer (https://github.com/Twinklebear/ChameleonRT); I had baking of the PTex textures in my github pbrtParser repo (https://github.com/ingowald/pbrt-parser) a long while back; Dave Hart gave me his curves tessellation code for the PBRT curves a loooong time ago, etc …. I just never had all of those particular pieces in the same viewer at the same time.

Anyway – I still don’t have all these pieces together right now, but triggered by Chris’ blog post last week I at least sat down and started pulling together all the pieces I still had, and trying to get my Moana back to rendering on a GPU. In particular, I finally sat down and extended my path tracer to also be able to handle water, by pretty much stealing Pete Shirley’s “Dielectric” material from his “Ray Tracing on a Weekend” series (because yes, I have no clue about things like Schlick, Fresnel, etc) …. and as of Saturday night, I finally have some “first light”, including something that’s at least looking like water:

BTW: The above renders quite interactively; currently at something like 25-ish fps on a RTX 8000 (at 2560×1080), and using about 32 out of the 48GBs of RAM. I (obviously) use progressive refinement, so while you move around the image is somewhat more noisy – but by the time you can even click on the screenshot app you pretty much get the above. (I’ll release the code at some point in time, when I cleaned it up a bit).

Moana on the GPU: Main Challenges

As several previous posts/articles have pointed out, there are a multitude of challenges in this model. I’d particularly point to Matt Pharr’s original “Swallowing the Elephant” series; to my own blog post (and accompanying paper) on “Digesting the Elephant”; and to Chris Hellmuth’s recent “GPU-Motonui” blog.

Most of these issues revolve around the sheer amount of data involved in this model, and in particular the data wrangling required to even get it loaded into a form that you can even start rendering it. For the GPU version, however, a few of these particularly stuck out:

Textures: All the textures in Moana are in Disney’s PTex format …. but there’s no PTex on the GPU, yet (at least, none that I could find). My first versions rendered without textures, but without textures, this model looks really different, as you can see by playing with this fancy new wordpress “image compare” feature:

Envmap: The envmap not only comes in two forms (EXR for HDR, and png for the default PBRT model), it’s also super-important to exactly match the orientation used in the pbrt file: in particular, the envmap is tilted to the back to account for the fact that the default camera points downwards; if you don’t do that you see the envmap’s lower “ground” half peeking through betwen the ocean and the clouds, which is really disturbing. I had to literally copy pieces of Matt’s PBRT code over to make it match
Instance count: The model has close on a 40 million instances (even if you make sure to only create those you need ….) – but early versions of OptiX only allowed 16 million per instance BVH. Early versions of OptiX didn’t allow multi-level instancing, either, so at times I had to have three “root” BVHes to trace into serially. Later versions used two-level instancing, which one root IAS over smaller second-level IASes …. which of course asks the question of “how do you best partigion 40 million instances into N groups of less than 16M each …. but I digress – since 7.1 OptiX can do more than 40 million, so this problem is gone.
Number of tiny meshes: Though the PBRT file contains some really big meshes, there are also a lot (!) of tiny ones. This was an issue mostly for OptiX 6, but even in OptiX 7 this would create a lot of different build inputs and SBT entries. Even worse, for some of the objects one ends up with a few really large meshes plus a ton of tiny ones, all in the same group/BVH…. which caused some other issues.
Water: The water in this scene is actually particularly tricky: not only do you need a water shader at all, the water is also – in some parts of the model – modeled twice : there’s the main body of water in a giant box (with some low-frequency waves pattern on top), but there’s also a second surface with a higher-resolution tessellation of the waves within the default camera frustum. Now as long as you only use that default camera it’ll be OK, but once you move around, and some pixels refract the water twice, you get some really disturbing pictures.
Memory: This model is big. Really big.

Moana on RTX: Current State

In its current state, my sandbox does the following:

Textures: To make the ptex-textures appear on the GPU I currently use some bake-out tool I wrote for exactly that purpose: For each of the input model’s polygon meshes I first re-construct the original quads, and bake out a tiny 16×16 texel “micro-texture” for each of those quads; these then get throws into a larger texture atlas that gets uploaded as a 2k-x-whatever CUDA texture. During rendering, the code material shader then reconstructs the current triangle’s corresponding texture coordinates within its corresponding mesh’s atlas, and uses a cuda bi-linear texture lookup. Total memory for that – at 16×16 texels per quad – is about 3.5GB, which isn’t too bad. I’m sure that this baking out does create some lower texture quality than PTex, but right now I’m pretty happy with it – it also saves a ton of memory (3.5GB vs 40-ish in original PTex files), and is super-fast (’cause I can use texture hardware …).
Materials: Though I did have some Disney BRDF at some point in time (borrowed from Will’s ChameleonRT) I could never make this work with the water. Currently, I use the material’s “specTrans” value to determine if the material is water or not, and use either Pete Shirley’s RTOW “Dielectric” material (for water), or his “Lambertian” (for everything else). For the water I’m tracking whether I’m already in/out of the water, and just pass straight through any second water surface the path may encounter.
Lights: I currently use a plain forward path tracer, until the path hits the environment map. All other light soruces get ignores, and even for the envmap I currently use only the LDR version, since that’s what the original PBRT file uses.
Curves: Curves currently get tessellated into triangle meshes during loading. Since this generates a ton of additional geometry I use a very low tessellation rate, but still can’t see any artifacts from that, likely because the input patches are already pretty small.
Triangle Meshes: every other geometry in this model is a triangle mesh, anyway. To avoid the many small meshes I currently merge all triangle meshes within a PBRT “object” into a single, large triangle mesh, and store, for each triangle in this model, an index of a tiny struct describing the corresponding sub-mesh’s data. This triangle mesh then gets uploaded to OWL, by creating the proper OWLGeom and OWLTrianglesGeomGroup.
To save on memory I do not use the texcoord and primitiveID arrays from the PBRT file, and simply compute this information on the fly in the CH program.
Instances: For older versions of OptiX I had to do some extra steps with multiple BVHes and/or multi-level instancing; in the latest version this is no longer required: I simply create a single list of all instances, throw those into and OWLInstanceGroup, and done.
Model import: I use my github pbrtParser project for all model importing – this library allows to first convert from the ascii PBRT model to binary “.pbf” version (with the exact same data), and loading from this format is few orders of magnitude faster than from the ASCII version … so very useful. Some of the set-up stages then get done right away on the “scene graph” that this library loaded: ie, transforming into a strict single-level instancing model, tessellating the curves into triangle meshes, extracting (and then removing!) the light sources, extracting the default camera pose and screen resolution, etc.
OptiX use: The actual OptiX usage in my latest viewer is all through OWL: OWL makes it (so!) much easier to deal with things like buffer uploads, launch params, building of data structures, constructing SBTs, etc, that I wouldn’t want to part with it (well, it got written largely for this very purpose…).
In particular, having all the low-level OptiX code “hidden” through owl is a great help in debugging: getting this model wrangled is enough of an issue in itself, so knowing one doesn’t even have to look for bugs in things like setting up build inputs or building SBTs is a huge help.
Other than that, the use of OptiX is pretty straightforward: I create the per-object BLASes and instance accel struct as described above; there’s one closest hit program for the triangle mesh that “deconstructs” the merged-mesh information, and stores primitive ID, instance ID, material ID, texture ID, etc, in the per-ray data. All textures, materials, etc, get first “serialized” on the host (ie, all textures, materials, etc, first all get collected into a single linear array each), then uploaded into an OWLBuffer of the respetive OWL_USER_TYPE(DisneyMaterial), OWL_TEXTURE, etc; these buffers then get attached to to the global LaunchParams from which they are accessible to both CH program and raygen program. All path tracing currently happens in the raygen program.

Moana RTX: What’s (still) Missing

OK, with all this implemented, what’s still missing? A lot, actually:

Disney BRDF: The current material model I have is “either it’s Dielectric, or it’s Lambertian” – for this particular model that’s actually not too far off, since most of the matrials are actually configured to look pretty much like that. However, it would still be useful to have the full Disney BRDF working.
HDR Env-map: The env-map should be HDR, but in the code above isn’t – I did have that in the past, but then took the EXR loader out to avoid some windows issues…
Quad Lights: The model contains a lot of “accent (quad-)lights”, in particular over the beaches; and these give the scene a nice reddish-warm “glow” that makes it look totally different. In the code above the light geometry is already separated from the surface geometry (else they show up as annoying white quads all over the place); but in the code above they’re not yet used. Given the large number of them one has to do some importance sampling, for which I have in the past used some reservoir sampling…. but that code got lost somewhere, so isn’t hooked up right now.
Next-event Estimation and MIS: Right now the path tracer just traces a ray until it hits a light source; and for the LDR envmap that works just well … but the moment I’ll make that envmap be HDR again that’ll get noisy… so will have to (re-)add some NEE and MIS.
“Real” Curves: I currently tessellate the curves (palm fronds, mostly) intro triangle meshes; but since OptiX 7.2 can also do curves I should now be able to save that memory.
Denoising: Currently not yet done – mostly because I’m still not sure whether I should do “the right thing” and first add denoising cleanly in owl, and then get it for free here … or rather to the “quick-hack” and get it done here, first …. we’ll see.
Animated water: The one piece I’ve never done yet – but in theory, the water in this model is animated…
More memory squeezing: I currently use about 32GBs of memory for this model – but since last week I got a shiny new 3090, so would obviously like to get that below 24GB. There’s several obvious ways for doing that; in particular the normal arrays can be encoded with way fewer bits, and right now I’m not even freeing the vertex and index arrays after building, which since OptiX 7.1 I could actually do (one can query the hit triangle’s vertices from OptiX).
Model/material fixes: there’s several objects in the PBRT model that have obviously broken/missing material data. Most of those I’ve now found and eliminated, but for some reason I haven’t yet identified my beach and ocean floor surface are all gray – which may well be the most annoying visual artifact in the current viewer. This may of course be a bug in my parser/importer, but whatever it is, I haven’t found it yet.

Anyway, having worked on this model for so long, it was a really “high”-moment on Saturday night, finally seeing something that looks at least roughly as one’d expect.

I’ll be working on those missing features on and off going forward; will update once I get some of those things working.