Parallel/GPU-based Construction of Left-Balanced k-d Trees

Similar to the article I wrote last week on stack-free k-d traversal (see here), this post is about a “left-over” from many years back – one that I’ve been using on-and-off for ages, but never found the time to actually write up and share.

Well, as the very existence of this post may give away I actually did find the time to finally write this up and share it (and man, does it feel good to finally get this off my stack, after a few years of that pressing on me!). If you’re interested: the write-up is on my usual publications page (https://www.sci.utah.edu/~wald/Publications/), as well as on ArXiv (well, it will be there once it actually finishes processing – by Friday, if the email I just got is to be trusted). In addition, there’s also sample code in the github repo that I created for the reference implementation, see here: https://github.com/ingowald/cudaKDTree .

If you’re working with k-d trees (the left-balanced sort, that is), and need a GPU-accelerated way of building them, have a look. I’m not claiming this to be the un-doubtedly fastest way you could ever build a left-balanced k-d tree on a GPU (in fact, I’m fairly sure that faster methods exist, ahem), but what I really, really like about this particular method is what I’d call the the “soft metrics” of ease of use and simplicity – it’s all a pretty simple CUDA update kernel (that literally fits into a single laptop screen!), plus a bit of parallel sort… easy to use with/update for/template over different data types like int2, float3, float4, structs-with-payload (eg, photons in photon mapping), etcpp, and can be used in a header-only way, too. I’ve now used this method several times, and unlike for some other methods I have it was always trivially simple to just drop this in and use it.

The weakest part of that algorithm is that – at least for the reference implementation – I used thrust::sort for sorting, and since I need to use a custom sort operator that does fall back to a relatively slow implementation of sorting, and to one that actually needs a roughly 2x memory overhead during sorting. Both of that could, of course, be fixed by doing a better swap-based sorting via bitonic sorting … but “that is left as an exercise to the reader” :-).

Enjoy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s