Parallel BVH Construction

Now this is a new one : usually I use this blog to write about newly published papers, or share some experience with some currently ongoing projects – but this one time, I’ll actually use this blog to direct attention back to some papers that actually happened a looong time ago. Reason: I recently had several discussions with people about the topic of fast, parallel BVH construction, and always replied by “just read that paper I wrote about that; I still use the same techniques as in my paper from 10 years ago, just with a GPU rather than a CPU” …. only to get long stares because whoever I talk to then often doesn’t know what the heck I’m talking about.

That confused me a while, until last week I actually tried to google this handful of papers that I had so vividly in my mind … and realized I couldn’t actually find them myself, because google just didn’t find what I was looking for. Now as it turns out, I’m not so delirious that I was imgining any papers that I never wrote – but unfortunately, when I wrote those papers I didn’t know how the lingo in ray tracing would eventually evolve, and apparently ended up picking paper titles that google just doesn’t match to those keyword you were probably search for: What I (naturally) googled for was “parallel bvh construction” (and if you reach this page, that’s probably exactly what you did, too) … but back then when I wrote the paper, I didn’t know that this would be the terminology everybody nowadays uses, and used different terms.

Parallel BVH Construction

So, if you are interested in parallel BVH construction (hopefully on a GPU, nowadays), I’d like to point your attention (and hopefully, google’s search algorithms) to the following two papers: First, the first one I did that actually looked at parallel construction (on a brand new “Clovertown” CPU back then, with four(!) hardware threads!) was “On fast Construction of SAH-based Bounding Volume Hierarchies“. Funnily even the web link contains the terms “parallel” and “BVH”, but the title itself contains neither, so google – at the time of this writing – really doesn’t want to find it. But if you are interested in some of the very first approaches to building SAH BVHes in parallel, this is the one – today I’d use CUDA kernels, of course, and way more threads – but the core concepts are the same. Interestingly, if you look at section “4.2. Grid-based Binning” that’s actually just a different name for what nowadays you’d call a Morton pre-binning step (well, it’s a bit more general because Morton codes are power of two, and that grid is not – but the core idea of an initial pre-binning, with an implicit BVH over those bins and a second BVH building step inside those bins, that’s the same …. but I digress).

Second, a more real-time implementation of that was done a few years later, at the time on the “MIC” Many-Core Knights-Series of Intel CPU in the following paper: Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture. That one is a bit more specialized because it really targeted those certain chips that Intel was doing at the time (bonus assignment for the advanced graphics geek: try to guess what the “LRB” in the URL may refer to…), but it’s also still relevant because this was the first one to really target real-time re-builds, and really had to look at all these things like having lots of parallel threads (just like a GPU today), at having wide “SIMD”, etc – in fact, what I do in my latest builders on a GPU is very similar to what this paper does except that today you have “warp parallelism” instead of “SIMD” for the SAH evaluation, and you you have CUDA async task launches where this paper talks about a hand-crafted tasking system. So, if you’re really into that topic, might be worth reading, too.

Refitting BVHes and Two-Level BVH Structure

Finally, once we are on the topic of (re-)building BVHes, two other papers come to mind: The first one where we proposed the idea of BVHes for real-time ray tracing – and in particular, of using refitting – was Ray Tracing Deformable Scenes using Dynamic Bounding Volume Hierarchies (again, no “BVH” in the title, nor “refitting” – I really suck at picking well-google’able paper titles); and the first paper that proposed the idea of two-level hierarchies with a top-level acceleration structure (TLAS) and botom level accels (BLASes) – back then, still using kd-trees instead of BVHes – was Distributed Interactive Ray Tracing of Dynamic Scenes … that actually also talks about “distributed” because that was the only way to get enough compute for real-time ray tracing back then when; but it’s actually about the two-level accel structure more than it is about the “parallel rendering” component.

Real-time BVH Builds Today

Now today, times have obviously changed, and for most of the time you can probably rely on the BVH builders that come with OptiX or DXR. But from time to time one still needs to have a BVH to traverse in one’s own traversal code (maybe for searches other than tracing rays?). If I “were to go about writing a real-time CUDA builder today, I’d follow the same basic ideas as above, but obviously using GPUs and CUDA for it: One useful technique is still the pre-binning described in the first paper; and/or a two-phase build where a first stage has all threads (across all SMs) work together on splitting individual, large subtrees until they reach a certain size; then a second stage where different SMs (or different warps) work on a different subtree each; with warp-parallelism (ie, 32 threads executing the same SAH evaluation in parallel). Getting the memory allocations right requires some care, as does the proper stream handling to keep the GPU busy – but the base idea is still exactly the same … and still works like a charm. Don’t have any code to share as of right now, unfortunately; but the base ideas are right there. Have fun!

One thought on “Parallel BVH Construction

  1. Hi! I was mind blowed when found Karras’ 2012 paper –

    Probably I find it via NVIDIA’s blog posts series “thinking parallel” –

    The possibility of adapting such traditionally linear (and often recursive) algorithm like splitting data into two parts to form some kind of KD-tree into truly per-element independent form, into massively parallel fashion was so inspiring! šŸ™‚


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s