Device/GPU side sorting in OpenMP (ie, with #prama omp target)

I’ve been playing a bit with OpenMP recently – in particular, with the pragma omp target based device offloading that OpenMP 5.0 and newer are offering. Overall I really like it, but one of the things I found is that whereas classical GPU languages have lots of helper libraries for common things like sorting, this isn’t (yet?) the case for OpenMP target offloading. Of course, you can also just sort on the host my properly mapping the data, but for those that approach omp target offloading in the same way they would with CUDA this doesn’t feel exactly right.

So, bottom line: For some OpenMP BVH builder I was writing I realized I needed an OpenMP based sorter, and not finding one I just took an existing CUDA-based bitonic sorter that I had lying around, and ported it over to OpenMP. Not the fastest way of device-side sorting, maybe, but super flexible because it works for any operator<(key_t,key_t) comparable data type, trivial to extend to key/value sorts, workable for any size and type of input data, etcpp. And since that code might also be useful for others I just put it into its own repo, too: https://github.com/ingowald/openmp_target_sort/

To make this look somewhat like the existing omp_target_alloc() etc I called the function omp_target_sort() (mostly to make it clear to whoever calls it that the key and/or value array(s) have to be device-side data). The code is fully templated, and header only; so should be easy to use in any codebase, and with any data type. I have not yet bothered adding a custom comparator based version, but that should be easy to add if/when required (if so, let me know).

With that: enjoy….

Update: OWL now (finally) hosted on https://github.com/NVIDIA/owl

I’m aware that not everybody does (yet) know what OWL actually is, but for those that do, and for those: yes, I know I have to write a blog article about OWL at some point in time to introduce it – but this is not the blog post that’ll do this.

For those that do know what OWL is, and are currently using it: Please be aware that the “root” github repo for OWL has changed: due to broader use OWL is now officially hosted under the github NVIDIA org (to be more precise, at https://github.com/NVIDIA/owl ). If you’re using OWL in your project, please make sure to update your git submodule URL accordingly, as all future updates, bugfixes, etc, will all happen in this repo.

Unfortunately, this move has forced a few (one time) changes: In particular, whereas the original repo used “master” and “devel” branches (and often had multiple different users’ feature branches as well) the new repo will only have a single “main” branch, and all feature-“branches” should henceforth be in dedicated forks and get merged in via PRs. Also, old and new repo do not share a common history, so merging changes from one to the other via PRs will not be possible. Thus: if you do use OWL, please update your remote ASAP (and ideally, fully re-clone from the new location).

On the upside: this move to a new upstream repo not only comes with higher visiblity, it also comes with some long-overdue cleanups and maintenance. In particular, all the interactive samples – and with that, the dependencies to GLFW and OpenGL (for the interactive viewer window stuff) – have been removed from the project, leading to a much simpler cmake project and build process. I also added some basic CI to check build status on both windows and Linux, fixed some Windows DLL declspec issues, etc.

Any issues with the new version: please file a github issue – and of course, do so at the new location at https://github.com/NVIDIA/OWL/issues

Happy coding.

Quick Update: Just Updated github ‘optix7course’

For a while now I’ve been telling everybody that my original ‘optix7course’ repo on github (ie, https://github.com/ingowald/optix7course) is a bit deprecated, and everybody should instead look at OWL (which since last week can be found at a new location, at https://github.com/NVIDIA/owl, but that’s a topic for another post…), but since that repo still has a lot of stars, watchers, and even the occasional bug report I decided to at least build and test it … and immediately found that it didn’t even build any more, because cmake has changed over the years (to the better, but still).

So, just updated that repo to now use all ‘modern cmake’ for includes and dependencies, including updated deviceCode-to-ptx-embedding macros (all stolen from OWL and back-ported to this older repo). I would still suggest to everybody to instead have a closer look at OWL (an updated post on that is looong overdue – by about a couple years by now), but at least on linux it now builds and runs again, with latest optix and latest cmake.

Multi-GPU (but non-MPI) Data-Parallel Rendering in Barney/ANARI

This article is about a capability – and in particular, about the “how do you actually use that” – that i recently added to Barney, namely, to do data-parallel multi-GPU rendering through Barney’s ANARI interface layer. This is not a comprehensive discussion about barney, or data-parallel rendering in Barney, but enough people asked about this that I think it’s easier to explain this once in a blog, and share it this way, instead of trying to explain over and over again in email and slack (of course, an actual paper would be even better, but that’d take much longer, and I don’t want anybody to have to wait for that … so here we go).

The Problem

‘K, for those that already know what the problem is: just skip this section, and jump to “ANARI Device Tethering”. For those that don’t, let’s first explain the problem: The core of the problem is that “data parallel rendering” refers to rendering where the model is actually split across multiple different GPUs and/or nodes, and since that’s a totally different beast than “regular” rendering on a single GPU (or even replicated rendering where each GPU has a full copy of the same model) this can create all kind of issues.

More specifically, Barney itself (ie, the renderer I recently worked a lot on) can natively do data-parallel just fine, and it can do that in both MPI and non-MPI multi-GPU ways – but the Khronos ANARI 3D Cross-Platform Rendering API layer that most users would arguably want to use Barney through has no actual notion of data parallelism, so it can’t easily express this. Barney itself has an explicit notion of both multiple GPUs as well as multiple “data ranks” within a single barney context, and is just fine with the app loading different data onto different GPUs, and then asking the (single) context to render an image across all those different GPUs’ different data. That’s a capability built deep within barney from the very beginning, and it’ll do that just fine. This article is not going to discuss how that is done, or what it all can or cannot do – let’s just assume it can. Barney can actually do this data parallel multi-GPU thingy both over MPI (each MPI rank can have its own data) as well as for single process that doesn’t even know about MPI (and instead specifies multiple GPUs and data ranks with that same context)… and it can even mix the two in all sorts of ways (so yes, you can have N ranks with G GPUs each, having D <= N*G different types of data, affinities, etcpp).

How it does that does not matter for this post – but what does matter is that users would likely want to use Barney not through it’s “native” API (which explicitly supports all that stuff), but would instead want to use it through the Khronos ANARI 3D Cross-Platform Rendering API (which Barney also supports)… but the ANARI API doesn’t have any native concept of “data parallel”, yet … so that’s a problem. For the most common way of using data-parallel rendering in Sci-Vis (which is over MPI, with different data per rank) we already proposed and described a ANARI “Extension” that would allow using that in Barney (more details on this in this 2024 LDAV paper), but this one has the catch that the user has to use MPI to use that … and not everybody is comfortable with that (a notion I can fully understand, actually).

As such, the problem we faced was that Barney already does have a concept of data parallel multi-GPU within a single process, but ANARI does not. To dig a little deeper, the problem is that if you do want to do data parallel single process rendering you have to deal with the fact that some “entities” in the rendering process are intrinsically “per process” (ie, you have one model, and one frame that you’re rendering, etc), but others are intrinsically “per device” (one geometry might live only on one device, others only on another)… but in ANARI there is only one device to create all of these entities, and no way to say “but this goes here, and that goes there).

Now for the MPI-based data-parallel ANARI extension mentioned above we allow the user to express this on a per-rank way – different ranks implicitly load different data, but then do some calls collectively – but this doesn’t easily work within a single process. The “different ranks have different data” you could still express as different devices (actually, that’s what we do as well), but the entire “collective” thing becomes a bit more tricky, and does not actually map all that well.

ANARI “Device Tethering”

So, the way we decided to realize that same functionality in (B)ANARI is through what we call “tethering” of devices: Basically, there will be N different devices (one per GPU), but these are all “tethered” to a single “leader” that plays a special role. Basically, the lead device is the one to talk to for any sort of “per process” operation (like creating a frame, rendering a frame, mapping a rendered frame buffer, etc), while all the other devices exist merely to describe how to create data on different GPUs – and the “tethering” expresses that – and how – these actually belong together. Ie, because the other devices are all explicitly tethered to that lead device, that lead device will know that there are other GPUs, that they each have different data, but that it is responsible for doing all the work.

So, how does that work in practice? Basically, that answer splits into two separate categories: How to initialize the whole thing (such that there are N devices that know they’re tethered together), and how to then use that during rendering.

Initial (Tethered) Device Setup

Basically, almost all the secret sauce lies in the initial setup and device creation, where we have to create the N different devices, tell them which is to run on which GPU, and tell them who’s the leader, and how they’re tethered together.

To do this, the first step is to prepare our app for having multiple different devices (one per GPU), so instead of having a single ANARIDevice you’d probably have something like this:

int numGPUs = ...;
std::vector<ANARIDevice> dev(numGPUs);

Now, the first step is to create the lead device. This lead device is like any other device (ie, we can also load data onto it), so we’ll just store it as dev[0] – we’ll just know later on that it has a special role to play. Creating that lead device would work just like any other device

// load barney - note this loads the non-MPI barney device!
ANARILibrary barney = anariLoadLibrary("barney", ...);
dev[0] = anariNewDevice(barney,...);

… except that we’ll also tell it – through some specially named parameters – that it’ll eventually be one of many. To do this we set the variables tetherCount and tetherIndex. Through the first we’ll tell this device how many others there are, through the second, we’ll tell it that it’s the first (and thus, implicitly the lead) device:

anari::setParameter(dev[0],dev[0],"tetherIndex",(int)0);
anari::setParameter(dev[0],dev[0],"tetherCount",numGPUs);
anari::setParameter(dev[0],dev[0],"dataGroupID",(int)0);
anari::setParameter(dev[0],dev[0],"cudaDevice", (int)0);

Note how each of these calls has the dev[0] parameter twice – this is not a typo, but actually correct: the first one is handle to the device that is setting the variable, the second one the object that this variable is being set on… it just so happens that this device has to set this variable on itself, but that is correct. Note that through the cudaDevice variable we also explicitly tell that device to run on GPU 0.

Also worth explaining is the "dataGroupID" parameter: In barney, data parallel rendering is realized by giving each device (actually, each of what Barney calls a “data slot”) a numerical index that describes what part of the entirety of the data it will have. Not which exact geometries or objects – those are going to come later, through anari entity creation calls – but what logical part of a hypothetical whole: If you create two devices that each have their dataGroupID set to 0 (or don’t set this variable at all, because 0 is the default), barney will interpret that as you guaranteeing that these devices will have the same data loaded onto them eventually – so it can use those devices in a data parallel manner. But if you tell the first device that is has data group ID 0, and the second one that it has data group ID 1, then barney knows that there’s two different kinds of data, that that these two devices need to work together to produce the right output. You can also do things like setting GPU0 and GPU1 both to ‘0’, and then setting GPU2 and GPU3 to ‘1’, in which case Barney will have GPU0 and GPU2 working data-parallel on some pixels, and GPU1 and GPU3 working together on others – but let’s not go into that in detail – bottom line is that if you have N different devices, and set each device’s data group to a different numerical ID, barney will realize that you want these devices to be run in data parallel mode (and yes, please use numerical IDs 0, 1, 2, etc; not 13, 47, 3, etc… I assume that’s what any sane person would use, so didn’t even bother to implement the latter).

Now at this point, it is time for this device to be committed, so it’ll actually know about these variables:

anari::commitParameters(dev[0],dev[0]);

At this point, our lead device is created, and it can initialize itself. In particular, it will also know that there will eventually be more more devices that will try to tether themselves to it. At this point you cannot actually use that device yet, because it’ll wait for those other devices to be created before they can then all together finish their setup (just to say this again because it is important: This device is not yet ready to be used for rendering – we told it that there’s some other devices coming up, and we cannot use this device for actual rendering calls until these have been created, too!). So, let’s create the other devices, and tether them to that lead device:

for (int i=1; i<numGPUs; i++) {
  // this is the same as for dev0
  anari::setParameter(dev[i],dev[i],"tetherIndex",(int)i);
  anari::setParameter(dev[i],dev[i],"tetherCount",numGPUs);
  anari::setParameter(dev[i],dev[i],"dataGroupID",(int)i);
  anari::setParameter(dev[i],dev[i],"cudaDevice", (int)i);

  // this is to tell those devices who to tether to:
  anari::setParameter(dev[i],dev[i],"tetherDevice",dev[0]);
  anari::commitParameters(dev[i],dev[i]);
}

This initialization code is almost identical to the one for GPU 0 (in fact, it can be run in the same loop), except that the "tetherDevice" variable cannot be set until dev[0] has been created and committed (and it may, in fact, not be set on dev[0], because this variable being null is what tells that device that it’s the lead device). Luckily ANARI already has a notion of setting object handles as parameters, and because an ANARIDevice is also implicitly a ANARIObject, both the setting of parameters on devices, as well as passing a device as a tetherDevice parameter all worked out of the box. Also note again how each GPU gets a different cudaDevice and a different dataGroupID. This is exactly how we tell each device what GPU to run on, and that they all have different data).

After this initialization step, we’ll have numGPUs different devices, each on a different GPU (cudaDevice), each expecting different data (dataGroupID), all knowing that they’re tethered together, and all knowing that dev[0] is the lead device. Also, all devices have been committed now, are now all fully initialized, and can now be used for other ANARI calls.

ANARI Rendering with Tethered Devices

With these properly tethered devices now created, we can do rendering almost identical to how it as done before. Basically, I assume that your existing (single GPU) ANARI workflow for a non-data parallel app looks something like this:

// load app-specific model
MyModel model = loadModel(....);
ANARIDevice device = createMyAnariDevice();

// issue anariNewVolume, anariNewGeometry, etc calls
ANARIWorld world = createAnariWorld(device,model);

// create a frame to be rendered
ANARIFrame frame = anariNewFrame(...);

// assign instances to this frame
anariSetParameter(frame,"world",world);

// render an actual frame
anariRender(frame);

// and read back the frame
float4 *pixels = anariMap(frame, "color", ...);

I’ve obviously taken some liberties with ANARI here – this isn’t literal code – but if you already have a single-device ANARI implementation I’m fairly sure you’ll recognize what I mean.

Four our data parallel device, this will be fairly similar in concept, with just a few differences. Most obviously, you won’t have one model in your app, but will have one per GPU. Whether you load a pre-partitioned model, or partition a model on the fly, or extract it on-the-fly from some other data already on the GPU doesn’t matter, but conceptually you’ll have something like this:

// create partitioned model - one 'model' per GPU
std::vector<MyModel> modelPerGPU(numGPUs);
for (int i=0..numGPUs) 
  modelPerGPU[i] = createModelForGPU(i);

We can now create our tethered devices, using the code we described above:

std::vector<ANARIDevice> dev = createTetheredDevices(numGPUs);

Remember this looks like just N different devices, but we know that dev[0] is special, and that the others are tethered, so they belong together.

Now with that we can also render our geometry as before, but of course, we’ll end up with a different ANARIWorld object for each GPU. So again we simply have the same we had before, just once per device:

std::vector<ANARIWorld> worldPerGPU(numGPUs);
for (int i=0..numGPUs)
  worldPerGPU[i] = createAnariWorld(dev[i],modelPerGPU[i])

Note how each createAnariWorld() call uses a different device – that means that you can use the exact same createAnariWorld() function you had before – by simply giving it a different device (which knows that it’s on a different GPU) all the ANARI geometry/volume creation calls you make within this createAnariWorld() will automatically get routed to that device – and thus, (only) this device’s GPU . The objects created on that specific GPU’s device will only appear on that device, they won’t even be valid or other devices, and no other object that isn’t on any other device should ever even know about it.

Similarly, we now create one frame per device. In theory we’d only ever want a single frame (we’re not rendering N different image, right!?), but one of the “quirks” of ANARI is that the “world” to rendered is specified as a variable on the frame object. And since we have N different world objects we also have to have N different frame objects to set this variable on. So, once again we have something “per device”:

std::vector<ANARIFrame> frame(numGPUs);
for (int i=0..numGPUs)
   frame[i] = anariNewFrame(dev[i],...);

We can now also set each devices’ frame’s "world" to the set of instances we created on that device:

for (int i=0..numGPUs)
  anariSetParameter(dev[i],frame[i],"world",worldPerGPU[i]);

Note again how these have to match: a world created on device i have to be set on the frame created on device i, because otherwise the handles wouldn’t even be valid.

Now at this point, it’s important to mention one important thing: At this point, it is clear to us that these frame “belong together” in that they each store one GPUs’ worth of data that we know belong to a single logical model … but there has to be some means for barney to recognize that, too. The easiest would be if we could pass all N frames to the anariRender() call, but that call takes only a single frame parameter, so we need another way to express this.

One way to do that would be to explicitly tether those frames as we did for devices, and that would surely work … but we found this to be too cumbersome. Instead, what we do in Barney is to i assumes that it is the order in which frames are created on the different devices that expresses this tethering relationship: If each one of 8 (tethered) devices creates exactly one ANARIFrame each, then obviously it’s these 8 ANARIFrame handles that logically belong together (or better: whose “world” variables belong together!). Similarly, if each one of those two devices creates two frame objects, the first one of each belong together, and the second one of each belong together, etc. As such, each (logical) frame always has to be created once on each device (you’ll probably only ever need one frame per device – I do – but still…).

Barney will also assume that frames that logically belong together are always resized and formatted in a consistent manner across all device, so a resize would resize each such frame handle:

// properly set size and format of "the" frame (on each dev!):
for (int i=0..numGPUs) {
  anariSetParameter(dev[i],frame[i],"size",...);
  anariSetParameter(dev[i],frame[i],"channel.color",...);
  anariCommitParameters(dev[i],frame[i]);
}

Now that all “the” model has been created and “its” instances have been set, we can proceed to rendering and mapping of the rendered pixels – but because barney will know that these different frame handles are actually all referring to the same logical frame and model, we can issue this call for only dev[0], and frame[0] … and barney will know what to do:

// this is only one PER PROCESS, NOT per device!
anariRender(dev[0],frame[0]);

At this point, barney will start rendering, using, on device[i], the world stored on frame[i]. It’ll actually only fill the pixels on frame[0]; the other ones are only there to hold their respective GPU’s “world” variable. Once rendering is done we can map “the” frame as usual, using the lead device’s frame handle, as before:

float4 *pixels = (float4*)anariMap(dev[0],frame[0],...);

Unlike all other calls, renderFrame() and mapFrame() are only ever called on device 0 (the frame handles for other devices only exist to store their respective device’s “world”) and that is actually the point, because anariRenderFrame() can only take one device and one frame – which is why we had to do this entire tethering in the first place, because now that one device internally knows that it is only part of a bigger group.

Final Notes

This was a long blog – much longer than expected. In fact, this almost makes it sound like this was a super-complicated thing to do – but in practice, the exact opposite is true. I just wanted to make clear to explain how this all works in detail (becomes somebody will ask, and it’s easier to point to one detailed write-up), but what you’ll realize is that if you already have an existing ANARI renderer then implementing this is going to be trivially simple: In my own main model viewer (called Haystack) adding this (once it was implemented in barney) was a thing of a few minutes, and it only affected a few lines of code (out of thousands). In fact, I’m almost certain you’ll spend more time dealing with how to even load or partition data into multiple different, per-gpu models – but once you have that, and you have code that can render your geometries and volumes into “a” ANARI device … then you simply create the N different devices in the tethered way described above, and call your anari render function once for each device.

Anyway, that’s enough for today; barney doesn’t write itself, not does it get better by me writing blogs. Take care, and if you dare, have fun playing with this.

New release: pynari 1.2.8

After a lot of fixed, pynari 1.2.8 was finally released earlier this week.
Biggest update: Milan Jaros (https://www.researchgate.net/profile/Milan-Jaros-3) provided some first interactive samples using a 3D volume viewer he wrote:

Other updates:

updated pip packages to use latest barney 0.9.8 that contains multiple bugfixes, and supports a more complete list of anari features and data types (e.g., spheres, cylinders, and cones)
pynari.Frame can now by read directly into a GPU buffer without first going over the host, allowing users of pyopengl and pycuda to copy frame buffers on the GPU (which is what allows the above example to run at 60+fps)
Fixed previously wrong handling of numpy arrays’ shape()s – old code wrongly interpreted these shapes in XYZ order, but they are actually ZYX. Unfortunately that also means that all existing pynari codes that used np.reshape() to ‘properly’ declare multi-dimensional pynari arrays have to be updated accordingly. I already did that for all samples (and above viewer); apologies for the inconvenience – “mea culpa” indeed.
Several new “demo”(-and-testing!) samples for different data types such as cylinders, cones, and some more ‘interesting’ triangle mesh geometry (blatantly stolen from TSD):

“First light” of Barney Rendering of Drosophilia Brain Data

I’ll have to write a bit more later on the “how” of it, but for now, just a quick “first light” of barney rendering the latest “Drosophilia Brain” data set (using four H100 GPUs).

That “droso” data set is from the “Virtual Fly Brain” project https://www.virtualflybrain.org/, and was recently used by/featured in multiple major articles (e.g., this one: https://www.nature.com/articles/d41586-024-03190-y)… and the best thing about this is that one can actually download the full neuron connectome data (in SWC format). Now I’m sure they’re primarily sharing that data for purposes other than just rendering – but this is still a very nice “hero” data set for testing a (GPU!-)ray tracer with: The full droso has 140,000 neurons (the image on the nature page shows only the 50 largest), and though that doesn’t sound all that much, it actually is: each individual neuron can consist of “multiple” (ie, a lot) of different “segments”… so the total data is – if I can trust to the importer code I wrote – a total of 727 million such segments. The input SWC files alone are 34 GBs, and with additional data for colors, acceleration structure, etc, this is far more than can be fit on a single GPU.

Barney already had a “capsules” geometry that can handle this kind of “link” data, and since the capsules are relatively easy to distribute across multiple GPUs (barney doesn’t care how they get distributed, so I literally assign them in file order)…. so other than data wrangling this pretty much worked out of the box. The full thing – with over 700 million links – needs more than one GPU, but it does (just barely) fit into a machine with four H100s.

And thus: ta-daa – here’s the first few images…

And while the image on the nature page (https://www.nature.com/articles/d41586-024-03190-y) shows “the 50 largest” of the neurons … this is all 140,000 of them :-).

PS: before I forget – Big kudo’s to Stefan Zellmann (University of Cologne), Serkan Dmirci (Bilkent University), Alper Sahistan (Univ of Utah), and Milan Jaros (it4innovation Institute, Ostrava), who were the ones that did these two specific images – I did the original data wrangling, and getting barney to be able to render that …. but the actual rendering, choice of colors, lights, camera, depth of field, and hardware wrangling – that’s all theirs!

PPS: Yes, barney is a interactive renderer, so in theory this data can be rendered interactively. But this particular machine didn’t have any X server, so this was rendered directly to file.

Funny Image of the Day…

One of the best things about working in graphics is that at least some of the bugs one encounters are – in some way, shape, or form – funny. I’ve always liked the “accidental art” that often comes out of this, and had my fair share of that myself … and every time I tell myself “you should archive this” – and then go on chasing the bug, and forget all about it. Not this time.

Here’s one “accidental art” / “hilarious image of the day” situation that I just ran into before even having had my first coffee … and which already made my day.

It all started with me hacking something up into barney to dump some points – I’m working on some kNN-query related project, needed to simulate some “LIDAR”-kind of data, and decided to just generate some by taking an outdoors scene (PBRT’s “landscape” model), letting barney trace some rays into it, and just dumping the first hit point.

Now to verify that these points are actually useful, I decided to just load those points into barney, and just render them as spheres (haystack has some handy command-line options to do that without any extra work). And what I get is this:

Now at first that looks good – certainly good enough to use for knn queries.

But… I did seem to see some sort of un-expected pattern in this image (after years of staring at images to find bugs I’ve gotten pretty good at it ), and since any rendering bug would be worth chasing down independent of the knn project I – of course – couldn’t let this go.

So, navigating closer (the perks of having fast ray tracing, i guess) shows this:

And yes – certainly some rendering bug on those spheres. For sure.

I first blamed the denoiser – probably halluciating something onto those spheres …. but no, even after refining for several seconds it doesn’t go away. Not a denoiser bug.

“Obviously” that then means there’s some numerial issue – surface acne for the shadow rays or something like that. Need to get just a little bit closer to see it better ….

And that then shows the actual reason for these rendering artifacts:

Turns out all that happend is the the default material I had chosen in haystack – my “DisneyMaterial” – has, by default, a low but non-zero glossy/reflective component on it … and those “artifacts” are all just reflections of the entire scene …

Duh. So many incoherent rays going all around in that scene (default path depth of 10 :-/), and one doesn’t even notice any more ….

pynari 1.1 coming up: Support for system-level ANARI installs, and CPU fallback

Good news; getting closer to the next version of pynari: Feedback to 1.0 was overwhelmingly positive, but two things stood out: First, there seems to be some demand for having a CPU-fallback for the kind of hardware that doesn’t have any CUDA capable GPUs (MacBooks in particular seem to be en vogue here :-/). I mean, you definitely should be using hardware with NVIDIA GPUs when doing graphics [Disclaimer: yes, i do work for NVIDIA in my day job], but still, I get it. Second, there was some interest in having pynari also support other – system-installed – ANARI backends, not only the barney backend it’s currently shipping with. This of course is partially related to ‘a’ – the C++ ANARI SDK comes with some CPU devices as well – but is also partly based on the fact that that baked barney i’m currently including doesn’t have all the bells and whistles that a ‘full’ build of barney has. In particular, full barney can do data-parallel rendering over MPI, the baked version can’t, and since even one of the samples on the pynari web pages uses MPI this of course is a bummer.

Based on that feedback, version 1.1 is currently coming along as follows:

a) pynari will still ship with barney as a ‘baked’ backend that you can always use even if you have not manually installed any system-wide/non-python ANARI libaries. Creating a “default” device will still use that baked barney backend, just as in 1.0 – but if you specify any other backend name during device creation it’ll also look up system-installed ANARI devices as well. Ie, you don’t have to install any external ANARI SDK (there’s still the baked ‘default’), but you can if you want to (and yes, that also means you can use a system-installed barney with MPI support).

b) In addition to the baked CUDA/OptiX version of barney I’ll also include a CPU fallback device that’ll also work on non-NVIDIA hardware. This fallback uses Embree (obviously), but given that your CPU will likely have quite a bit less “oomph” than your GPU that fallback version will be “quite a bit slower” than the CUDA/OptiX accelerated version (in particular when using any wetextures of volume data – you really want to have texture units for that). I mean – it’s still better than nothing at all, and certainly faster than writing a native Python ray tracer … but …. ugh, it is going to be slower than barney_cuda, believe me that.

In theory that version should already be done – I already have 1.1-to-be building and running on both linux and windows – all the way to locally built wheels – but i’m still hitting some hiccups on the github actions builds I need for uploadable wheels – and since each build attempt now takes well over an hour, ironing these hiccups out just takes a while. Anyway – with the holidays coming up I should have some time to finish this up, so hopefully this 1.1 should come out before the year’s end.

PS: At least for the CPU fallback version I should in theory also be able to make it build on Macs… we’ll see. I got myself a shiny new macbook for just that very reason (Christmas present to myself, kind of!), and that just arrived this morning – but it still needs some setup, and i’m sure it won’t just build all that software out of the box, so even with the holidays coming up I don’t have any ETA on that just yet.

pynari (Ray Traced ANARI rendering in Python) – First Light!

I haven’t written much for a while – partly because I was too busy, partly because some of the stuff i worked on I couldn’t write about, and mostly, because much of the stuff that I could in theory write about wasn’t exactly “ready enough” to do so. Much of that is now slowly falling into place, though, so it’s about time I start writing some of them up.

Let’s start with pynari (https://pypi.org/project/pynari/) : for those that already know what ANARI is that name will be an immediately recognizable play on two different words – the "py" comes from python, obviously, and the “nari” comes from ANARI…. and that’s exactly what it is: a python-interface for the ANARI ray traced rendering API (including a OptiX/NVIDIA GPU accelerated renderer implementing that API), all installable through pip, and really easy to use from python in a way similar to this:

pip install pynari

then in python:

import pynari as anari

device = anari.newDevice('default')
# create the world:
mesh = device.newGeometry('triangles')
...
world = device.newWorld()
...
world.setParameterArray('surface', anari.SURFACE, ...) 
...
frame.setParameter('world',world)
# render a frame:
frame.render()
fb_color = frame.get('channel.color')

(for some complete, ready-to-run samples look here: https://github.com/ingowald/pynari).

ANARI

If you already know what ANARI is (you should!), then the above should be instantly recognizable, and you’ll probably be able to write some ANARI code in python right away. For those that don’t (yet?) know about ANARI, let’s first rectify this.

ANARI is a fairly recent effort by the Khronos group (the guys that are also spec’ing OpenGL, OpenCL, and all other kind of cross-platform things) to standardize an API for “analytical rendering”. Now I’m not entirely sure what “analytical” is really supposed to mean, so let’s just call it by another name: it’s a ray tracing rendering API, plain and simple, roughly based on the same concepts that were originally used in Intel’s OSPRay API (more info on that here: https://www.ospray.org/talks/IEEEVis2016_OSPRay_paper_small.pdf). In particular, compared to the more widely known ray tracing APIs like OptiX, DXR, or Vulkan, the API level in ANARI is “one step higher” : you don’t have a low-level API that traces individual rays (with you having to write the renderer), but instead, the ANARI API is a ray tracing rendering API, where you create a “world”, populate it with “surfaces”, “volumes”, “lights”, etc, and eventually ask it to render a “frame”. You don’t have to be a ray tracing expert (you probably don’t even have to know how it works at all!), you just set up the world, and ask it to render images. For those interested in the official ANARI 1.0 spec – and/or the official SDK – please look here https://registry.khronos.org/ANARI/specs/1.0/ANARI-1.0.html and here https://github.com/KhronosGroup/ANARI-SDK .

PY-NARI

Anyway, back to pynari. The group of users most benefitting from the ANARI API is, of course, group of people that want to use ray tracing, but that do not necessarily want to become experts in writing their own renderers. Having said that, I eventually realized that this description would probably also – and maybe even in particular – fit python users: many python users (in my experience) tend to be really good at just using libraries/packages that do the heavy work (often in native C/C++/CUDA code)… avoiding the need to become experts in whatever is going on inside that package, as long as it has a nice “pythonic” way of accessing its goodies. (C/C++ users instead tend to be the opposite, generally preferring to re-implement each API for themselves “just because” …. well, guilty as charged, i guess ).

So, having realized that ANARI should in theory be pretty useful to at least some python users (there just must be some folks out there that wants to do generate some ray traced images in python!) the next task was to figure out how to make that accessible to such users – enter pynari. The first decision I made was to write my own python interface (said pynari): the ANARI SDK already does provide some low-level python bindings, but these only expose the C99 API, and I didn’t think that that was sufficiently “pythonic” for the typical python user. As such, for pynari I took the liberty of slightly deviating from the C API, and instead adopted a much more object-oriented API (which actually fits ANARI very well, because ANARI itself is all about different “objects” that jointly describe what is to be rendered). For example, what in the official C-99 interface looks like this:

ANARILibrary library = anariLoadLibrary("default",...);
ANARIDevice device = anariNewDevice(library,...);
anariCommitParameters(device,device);
ANARICamera camera
  = anariNewCamera(library,device,"perspective");
anariSetParameter(device, camera, 'aspect', 
                  ANARI_FLOAT32, width/height)...

… in pynari becomes what I’d consider more pythonic like this:

import pynari as anari
device = anari.newDevice('default')
camera = dev.newCamera('perspective')
camera.setParameter('aspect',anari.FLOAT32,width/height)

etc.

For “bulk” data like vertex or index arrays, volume voxel data, etc, I decided to mostly build on numpy – i.e., you’d load/create/manipulate the bulky data in numpy, then create an anari “array” wrapper, and use that:

vertex = np.array(....,dtype=np.float32)
array = anari.newArray(anari.FLOAT32,vertex)
mesh.setParameter('vertex.position',array)

Other than slightly adapting the API to look more pythonic, the second big “digression” from the true path of ANARI I made is to – at least for now – hard-bake a single backend implementation into the pynari pip-wheels: In theory, the ANARI API is supposed to be “abstract” in the sense that Khronos only specifies the API itself, so different vendors/developers can each provide their own implementations for it. In your ANARI application, the first thing you’d then do is specify which implementation you want by “loading” a specific “library” (say “ospray” if you’re on a CPU, or “barney” or “visrtx” if you have a RTX-capable GPU, etc). The problem with that is that this multi-backend thing makes the building of the python wheels annoyingly tricky, because you’d have to build all these different backends into the same python wheel – and though it’s probably “possible” to do that it certainly ain’t for the faint of heart (it’s already rather non-trivial, believe you me!). So, while I’m absolutely not trying to make a vendor-independent API vendor-specific on python, at least for now pynari has a single working back-end (and for those wondering, it’s obviously my “barney” GPU renderer). Consequently, to run the pynari that’s currently up on PyPI you currently need a RTX-capable NVIDIA GPU (Turing or newer, data center GPUs most certainly included!). If you have one of those, however, you should now be able to pip-install pynari on either Windows (python 3.12 or 3.13) or Linux (3.9 and newer).

Volunteers, one step forward!

Long story short: on either Windows or Linux, you should by now be able to simply do

pip install pynari

and then run any of the examples I’ve been providing on the pynari github repo (https://github.com/ingowald/pynari)… and of course, you should be able to modify these examples, add to them, write new ones, etc.

Fair warning: this really is “first light” for this package – the backend (barney) has now been used for quite a few different things, but pynari itself is still very “hot off the press”, and “not much tested”. I’m fairly sure there’ll be missing or broken things in there, and I’d certainly expect quite a few “hard aborts” if you do something in a way that isn’t supported yet. That said, I can’t be fixing things I don’t know about, so what I’m looking for is a set of adventurous “volunteers” that would be interested in at least playing with it. Install it, run it, let me know how it goes – send me an email, file an issue on github, comment to this post, … any feedback is useful. Extend some of the samples (I’d particularly like one that’d create a better test volume for the volume rendering sample!), or write your own samples, etc – let me know. And if you create some additional samples, I’d be happy to share them, either here or on the github repo. Any feedback is useful!

And finally, just some eye-candy to how what it can do (these are simply the samples from the pynari github repo):

(Note that the data-parallel MPI example – the fancy-colored cube of cubes – will not currently support MPI on the pip-installed package, you’d have to build the module locally for that).

And finally, to show that it really works even on Windows (the Linux part is always a given for anything that I have written…), here a screenshot I’ve taken last night after the buildwheel finally ran through:

(the careful observer will notice Warhammer running in the background – I literally just ran that on the gaming machine my son was just playing on, and it worked out of the box!).

Kudos

Last but not least, a few kudos for those without whose help this wouldn’t have been possible:

Wenzel Jakob: pynari internally makes heavy use of Wenzel’s amazing “pybind11” library (https://github.com/pybind/pybind11). Thanks to pybind, the actual python bindings were – by far – the least of the problems in building this.
Jeff Amstutz and anybody else behind the ANARI_SDK (https://github.com/KhronosGroup/ANARI-SDK): pynari doesn’t directly use the ANARI SDK – only barney does – but those folks were incredibly helpful in fixing things that were required to make barney work in pynari (such as building static libraries).
Jeff Amstutz and Stefan Zellmann – not involved in pynari itself (yet?), but without those two barney would never had an ANARI interface to start with, and without barney-anari pynari wouldn’t exist.
Nate Morrical, from whose NVISII project I stole a lot over the years – BRDF components, guidance on building wheels, and many other things. (and without whom I’d probably never have started to learn python in the first place).

A few links for further reading (some of those have appeared above):

The pynari github repo, with a set of readily runnable samples : https://github.com/ingowald/pynari
pynari on PyPI : https://pypi.org/project/pynari/ (currently on 1.0.36 as of this post)
pybind11 – the key to writing python bindings in C++ (while staying sane): https://github.com/pybind/pybind11 . I’ll probably soon start to write a blog series about how to turn bindings into complete wheels – that can be a journey! – but it all starts with pybind.
The official ANARI spec – describes what kind of object types there are in ANARI, and what their parameters are. Not all will be supported in pynari, yet, but this is what should exist: https://registry.khronos.org/ANARI/specs/1.0/ANARI-1.0.html
The ANARI SDK – contains all to get you started if you want to write your own ANARI devices (though just using pynari will certainly be easier 🙂 ): https://github.com/KhronosGroup/ANARI-SDK .
A relatively recent paper about data-parallel rendering with ANARI – that’s probably not what you want to start with, but it also gives a bit of info on barney (the backend I’m currently using in pynari): https://arxiv.org/pdf/2407.00179v1 or here https://www.researchgate.net/publication/381883066_Standardized_Data-Parallel_Rendering_Using_ANARI
Another paper about some of the technology behind barney: https://www.researchgate.net/publication/372855659_Data_Parallel_Multi-GPU_Path_Tracing_using_Ray_Queue_Cycling
Pete Shirley’s “Ray Tracing in One Weekend”, whose iconic “spheres” scene obviously provided the template for ‘sample02’: https://raytracing.github.io/books/RayTracingInOneWeekend.html

Silly(?), Useful Tools: Generating Data for Scaling Experiments

Over the many different rendering projects I’ve done over the years, I’ve frequently stumbled – again and again – over the same problem: How to get “useful” data for doing scaling-style stress-testing of one’s software. Sure, you can always take N random spheres, or if you need triangle meshes take the bunny and create N random copies of that with random positions … but then you still quickly run into lots of other issues, like, for example: “now i added all these copies they just all overlap?!”, or “ugh, how can i create the scene such that multiple scales still make sense with the same camera for rendering?”, or “what if i want more instances rather than more triangles?”, or “what if I want to look at more ‘shading’ data like textures”, etc.

All of those questions are “solvable” (this is not rocket science), but I’m always amazed how much time I spent over the years – again and again – to write (and debug, and re-debug, etc) those mini-tests. And since I just did that again I decided that this time should be the last one… so as a result of that I did add all the features I wanted, and pushed that into my miniScene repo on github.

The way this tool works is actually quite simple, generating a whole lot of spheres (or actually, “funnies”, see my tweet on how i stumbled over those), but allowing the user to control a whole lot of different parameters that can influence things like how much instantiation vs “real” geometry, what tessellation level for the funnies (ie, triangle density per sphere), what texture resolution to use for each of the funnies, etc. In particular one can control:

how many “non-instantiated spheres” to generate
how many different kinds of spheres to generate for instantiation
how many different instances of spheres to generate
what tesselation density to use per sphere
what texture res to use per sphere (sphere gets its own checkerboard pattern texture)

These spheres then all get put into a fixed-size slab that covers the space from (0,0,0) to (1000, 100, 1000), with sphere radii and instance scaling adjusted such that there should always be a reasonably equal density within that slab. Note that slab is intentionally 10x less high than wide, so we neither end up with just a 2D plane, nor with something that’s a cube (where all interior geometry is usually occluded by that at the boundary).

In particular, this tool allows for easily controlling whether you want to scale in instance count (increase instance count) or triangle count (increase num non-instances spheres and/or sphere tessellation level); whether to put more triangles into just finer surface tesselation or into more different meshes, how much of the output size should be in textures vs geometry, etc.

Here a few examples:

/miniGenScaleTest -o scaleTest.mini (ie, with trivially simple default settings) generates this:

num instances : 2
num objects : 2

num unique meshes : 101
num unique triangles : 40.40K (40400)
num unique vertices : 22.22K (22220)

num actual meshes : 101
num actual triangles : 40.40K (40400)
num actual vertices : 22.22K (22220)

num textures : 101

Which with my latest renderer looks like this:

Now let’s change that to use 10k instances: ./miniGenScaleTest -o scaleTest.mini -ni 10000, and we get this:

num instances		:   10.00K	(10001)
num objects		:   101
----
num *unique* meshes	:   200
num *unique* triangles	:   80.00K	(80000)
num *unique* vertices	:   44.00K	(44000)
----
num *actual* meshes	:   10.10K	(10100)
num *actual* triangles	:   4.04M	(4040000)
num *actual* vertices	:   2.22M	(2222000)
----
num textures		:   200
 - num *ptex* textures	:   0
 - num *image* textures	:   201
total size of textures	:   204.80K	(204800)
 - #bytes in ptex	:   0
 - #byte in texels	:   204.80K	(204800)
num materials		:   200

which looks like this:

But since that scene complexity is mostly all in instances (which for “large model rendering” is often considered “cheating” let’s instead add a few non-instanced spheres as well (but let’s add more instances, too, just for the fun of it): ./miniGenScaleTest -o scaleTest.mini -ni 10000000 -nbs 100000 -tr 32 (this creates 10 million instances of spheres (each having 4k triangles), and then another 100,000 spheres that are not instances, for a total of this:

num instances		:   10.00M	(10000001)
num objects		:   101
----
num *unique* meshes	:   100.10K	(100100)
num *unique* triangles	:   40.04M	(40040000)
num *unique* vertices	:   22.02M	(22022000)
----
num *actual* meshes	:   10.10M	(10100000)
num *actual* triangles	:   4.04G	(4040000000)
num *actual* vertices	:   2.22G	(2222000000)
----
num textures		:   100.10K	(100100)
 - num *ptex* textures	:   0
 - num *image* textures	:   100.10K	(100101)
total size of textures	:   1.64G	(1640038400)
 - #bytes in ptex	:   0
 - #byte in texels	:   1.64G	(1640038400)
num materials		:   100.10K	(100100)

(and zooming in a bit)

(note the “artifacts” on some of those spheres are intentional – they’re “funnies”, not spheres. I find these funnies more useful as testing geometry, but of course, if you want to generate “non-funny” spheres there’s a flag for that as well).

Now finally, let’s use this to push my two RTX8000 cards to the limit, and do this: ./miniGenScaleTest -o /slow/mini/scaleTest.mini -ni 10000 -nbs 2000000 -tr 32 … with which we end up at a whopping 800M unique triangles and an additional 32 GBs of texture data:

num instances		:   10.00K	(10001)
num objects		:   101
----
num *unique* meshes	:   2.00M	(2000100)
num *unique* triangles	:   800.04M	(800040000)
num *unique* vertices	:   440.02M	(440022000)
----
num *actual* meshes	:   2.01M	(2010000)
num *actual* triangles	:   804.00M	(804000000)
num *actual* vertices	:   442.20M	(442200000)
----
num textures		:   2.00M	(2000100)
 - num *ptex* textures	:   0
 - num *image* textures	:   2.00M	(2000101)
total size of textures	:   32.77G	(32769638400)
 - #bytes in ptex	:   0
 - #byte in texels	:   32.77G	(32769638400)
num materials		:   2.00M	(2000100)

The result looks like this:

… and just to show that this is really about to push my GPUs to the limit (even with my latest data-parallel multi-GPU renderer) here also the output from nvidia-smi:

Guess I might have squeezed a bit more (some 3GBs still unused on each GPU!), but the goal of this exercise was to have something that can bring my renderer to its limits, and guess that’s pretty much it for now.

BTW: The result still runs at 17 fps 🙂

If you want o have a look at this tool: have a look at the miniScene repo, then tools/genScaleTest.cpp. The resulting .mini file should be trivial to read and use for your own stuff, so …. enjoy!