RTOW in OptiX – Fun with CuRand…

Bottom line: With new random number generator, RTOW-OptiX sample on Turing now runs in ~0.5 secs ….

Since several people have asked for Turing numbers for my “RTOW in OptiX” example I finally sat down and ran it. First result – surprise: In my original code there was hardly any difference between using Turing and Volta – and that just didn’t make sense. Sure, you do still need a special development driver to even use the Turing ray tracing cores from within OptiX, but I actually had that, so why didn’t it get faster? And sure, there’s only so much speedup you can except in a scene that doesn’t have any triangles at all, and only a very small number of primitives to start with. But still, that didn’t make sense. There also was hardly any difference between iterative and recursive versions … and none of that made sense whatsoever.

Well – in cases like that a good first step is always to have a look at the assembly (excuse me: PTX) code that one’s code is actually generating. In our OptiX example, that’s actually super-easy: Not only is PTX way easier to read than regular assembly, the very nature of OptiX’ “programs” approach also means that you don’t have to sift through an entire program’s worth of asm output to find the one function you’re interested in…. instead, you only look at the PTX code for the one kernel that you’re interested in. And even simpler, the cmakefile already generates all these ptx files (that’s the way OptiX works), so looking at that was very easy.

Now looking at the ray gen program, I was at first what, for lack of a better word, I can only call “dumbfounded”: thousands of lines of cryptic PTX code, with movs, xor’s, loads, and stores, all apparently randomly thrown together, and hardly anything that looked like “useful” code. Clearly my “actual” ray gen program was at the end of this file, and looked great – but what was all that other stuff?? No wonder this wasn’t any faster on Turing than on Volta – all it did was garbling memory!

Turns out the culprit was what I had absolutely not expected: CuRand. I hadn’t even known about curand before I saw Roger Allen’s CUDA example, but when I first saw it this looked like an easy-to-use equivalent to Pete’s use of drand48(), and simply used it for my sample, too. Now CuRand does indeed seem to be a very good random number generator, and to have some really nice properties – but it also has a very, very – did I say: very! – expensive set-up phase, where it’s taking something like a 25,000-sized scratchpad and garbling around in it. And since I ran that once per pixel it turns out that just initializing that random number generator was more expensive in this example than all rendering taken together ….

Of course, the solution to that was simple: Pete already used ‘drand48()’ in his reference CPU example, and though that function doesn’t exist in the CUDA runtime it’s trivially simple to implement. Throwing that into my example – and taking curand out – and lo and behold, my render time goes down to something like 0.5 sec. And in that variant I also see exactly what I had expected: that iterative is way faster than recursive, and Turing was way faster than Volta. Of course, changing the random number generator also changed the image (I haven’t looked in detail yet, but it “feels” as if the curand image was better), and has of course also made the Volta code faster. Either way – for now, 500ms is good with me 🙂

With that – back to work….

RTOW in OptiX – added iterative variant…

Huh, how fitting: Ray Tracing on a Weekend“, and I’m sitting here, Sunday morning, over a coffee, and writing about ray tracing on a weekend … on a weekend. And if that wasn’t recursive enough, I’m even writing about recursion in ….. uh-oh.

Aaaaanyway. For reference, I also just added a purely iterative variant of the “RTOW-in-OptiX” example that I wrote about in my previous two posts: The original code I published Friday night tried to stay as close as possible to Pete’s example, and therefore used “real” recursion, in the sense that the “closest hit” programs attached to the spheres did the full “Material::scatter” of its respective material (lambertian vs dielectric vs metal), plus doing a recursive “rtTrace()” to continue the path, thus doing some real recursive ray (actually: path) tracing.

Now if you read the previous section very closely you may have seen that I put “real” in quotes, for good reason: OptiX will internally re-factor that code to not really recurse in the way Pete’s CPU version did – with very deep stack and everything – but will likely do something more clever by re-factoring that code, which you can read more about in the original OptiX SIGGRAPH paper.

All that said, no matter what OptiX may or may not do with it, from a programmer’s standpoint it’s true recursion …. and though OptiX may do some refactoring to avoid the “gigantic stacks” problem – it’ll still have to do something to handle all the recursive state – and that, of course, is not cheap. Consequently, real recursion is generally something to be avoided (which, BTW, typically makes the renderer simpler to argue about, anyway).

Roger Allen’s CUDA-version already did this transformation, and used a recursive version: Since his example used CUDA directly, there was no way for any compiler framework to re-factor the code, so if he had used recursion the CUDA compiler would really have had to use enough stack space per pixel to store up to 50 recursive trace contexts, which would probably not have ended well.

In my original OptiX example, I didn’t have this problem, and could trust OptiX to handle that recursion for me in a reasonable way. Nevertheless, as said above real recursion is usually not the right choice to go about it (and BTW: on a CPU it usually isn’t, either!), so the downside of my staying close to Pete’s original solution was that this originally example might actually have led some readers to think that I wanted them to write such recursive code, which of course is not what I intended.

As such, for reference, I just added a iterative version to my example as well. The particular challenge in this example is that while the CPU and CUDA versions have real “Material” classes with real virtual functions, in OptiX it’s a bit tricky to attach real virtual classes to OptiX objects (yes, you can do it – after all, programs are written in general CUDA code – but let’s not go there right now). For my particular version, the way I went about this is to have the closest hit programs do one Material::scatter() operation for the material associated to that geometry, and return the resulting scattered ray and attenuation back to the ray generation program via the PRD. Of course, this approach works only because the Material in Pete’s code does only exactly one thing – scatter() – and wouldn’t have worked if we the ray generation program would have had to call multiple different material methods … but hey, this example is not about “how to write a complex path tracer in OptiX” – that may come at a later time, but for now, this is only about how to map Pete’s example, nothing more.

I do hope the reference code will be useful; and as usual: any feedback is welcome!

With that – back to …. work?

PS: For those interested in having a look: I already pushed the code to github (https://github.com/ingowald/RTOW-OptiX). I’ll be running some more extensive numbers when I’m back to a real machine (no, I don’t bring my turing to my sunday-morning coffee…), but at least on my “somewhat dated” Thinkpad P50 laptop, I get the following (both using 1200x800x128 samples):

  • pete’s version (with -O3, and excluding image output), on a Core i7-6700HQ@2.6Ghz(running at 3.2Ghz turbo): 12m32s.
  • optix version, on a Quadro M1000M: 18 sec.

Of course, this comparison is extremely flawed: Pete’s version doesn’t even use threads, let alone an acceleration structure, both of which my OptiX version does. Take this with a grain of salt – or an entire salt-trucks worth of it, for that matter! That said, the parallelism in the OptiX version comes for free, and the acceleration structure …. well, all that took was adding a single line of code (‘gg->setAcceleration(g_context->createAcceleration(“Bvh”))‘) …

PPS: First performance numbers on some more powerful card (driver 410.57, optix 5.1.1):

  • 1070, recursive: 0.58s build, 6s render
  • 1070, iterative: 0.66s build, 5.5s render
  • Titan V, recursive: 0.57s build, 2.6s render
  • Titan V, iterative: 0.63s build, 2.1s render
  • Turing: to come…

“RTOW in OptiX” sample code now on github…

As promised in last night’s post, I cleaned up the sample code and pushed to github: https://github.com/ingowald/RTOW-OptiX.

I haven’t tried the cleanups on windows yet, but it should work. If you run into trouble, let me know!

One note on the code: I’ll very happily accept pull requests that cover bugs, typos, build fixes, etc. Please note I do want to stay as close as possible to the original example, though, so please don’t send pull requests with major restructurings, general improvements, or feature additions, even if they’d be useful in their own right…. this is not supposed to be a “how to do cool things in optix” repo; just a optix “port” of Pete’s example.

And now – back to work 🙂

Ray Tracing in a Weekend … in Optix (Part 0 of N :-) )

Yay! I finally have my first OptiX-version of Pete Shirley’s “Ray Tracing in a Week-end” tutorial working. Not the whole series yet (that’s still to come), but at least the “final scene”… pic below.


Ever since Pete’s now-famous “Ray Tracing in a Week-end” came out (see, e.g., this link for more details), lots of people have used his mini-books to learn more about ray tracing. Those books are, in fact, absolutely amazing learning material (if you have not read them yet – you should!), but suffer from one big disadvantage: yes, they’ll teach you the fundamental basics (and in particular, the elegance and beauty!) of ray tracing – but they won’t teach you how to use modern GPUs for that. And in particular since the introduction of Turing, one really should know how to do that.

To fix that shortcoming, I recently suggested to Pete that “somebody” should actually sit down and write up how to do that same book series – step by step – in OptiX. Roger Allen has since done that same exercise for CUDA (see here for that (also excellent!) article), but that still has a shortcoming in that by using “plain” CUDA it doesn’t use Turing’s ray tracing hardware acceleration. To use the latter, one would have to either use Windows-only DXR (e.g., through Chris Wyman’s – equally excellent! 🙂 – DXR samples), or through using OptiX.

Long story short: I did eventually start on a “OptiX On a Week-End” (“OO-Awe”!) equivalent of Pete’s book series (and hope Pete will jump in – he’s such a much better writer than I am :-/)… but writing an entire mini-book, with examples and everything, turns out to be even more work than feared. So, following my motto of “better something useful early than something perfect too late” I finally sat down and skipped all the step-by-step introductions, all the detailed explanations, etc, and just wrote the final chapter example in OptiX. I’ll still write all this other stuff, but at least for now, I’ll do a much shorter version just with the final chapter.

So, what’s to come:

First, I’ll clean up the code a bit, and push that one final chapter example (with cmake build scripts etc) on github (I’ll write another post when that’s done). Once that’s public, I’ll write a series of little posts on how that sample works, relative to Pete’s CPU-only book. And only when all of that is out and written, then I will go back to doing the longer mini-book version. As such, this blog post was actually “part 0” of a series of posts that will soon be coming…. I hope you’ll find it useful!

With that – back to work…. 🙂



Joining NVidia…

As I’m sure some of you will have heard by now, today is my last day at Intel, and starting on Monday, I’ll be working for NVidia.

Looking back, I’ve now been working for Intel for almost exactly 11 years, and if you were to include all the time I worked “closely with intel technologies” during my PhD and Post-Doc times, it’s actually close on two decades: Even before starting my PhD (while working on Alex Keller’s ray tracer while in Kaiserslautern) I was already drilling holes into Celeron chips (and soldering on cables) to make them dual-socket capable (they were supposed to be single-socket only 🙂 ); and at the start of my PhD we (including Carsten, in Saarbruecken) were writing the first interactive SSE ray tracer prototypes, at a time when the Linux kernel didn’t even save the SSE registers, yet (yes, that makes for fun-to-replicate bugs on a dual-socket machine!). Later on, while finally working for Intel, I’ve been lucky to have worked on virtually every cool technology that had come out, from Larrabee, to Knights-anything, to pretty much any Xeon architecture built in the last two decades, to lots of other cool stuff. It’s been fun, I’ve worked with truly talented people (some of which are, in their field, hands-down the best in the world, and some of which I know for longer than I have my kids!). And yes, we’ve done some pretty cool projects, too: From the first real-time ray tracers on Larrabee, to things like compilers  (my IVL, and Matt’s ISPC), to several prototype ray tracers that never made it into the public, and all the way  to projects like Embree and OSPRay, both of which turned into massively successful projects. In other words, I’ve had the chance to work on pretty much anything I wanted, which was typically anything that either involves, requires, or is required for, the tracing of rays.

All that said, as Matt recently wrote on his blog: “the world it is a-changing” (see this link for his blog article); and once again channeling Matt (man – that seems to become a pattern here!?) I felt like I needed “to be in the thick of all of that and to help contribute to it actually happening”… so when the opportunity to do so came up I simply couldn’t say no. So with all that: Today is my last day at Intel, and Monday will be my first at NVidia – looking forward to it, that’ll be interesting indeed!

One final note…

While trying to figure out how to best break this news I had a second close look at the article Matt had written when he joined NVidia a few weeks back. While doing so, it was actually for the first time that I realized how just deeply he had thought about all this “ray tracing for real time” topic. Of course I had “read” that before, but never really appreciated how much thought went into it.

Anyway – just to follow up on that particular topic from my point of view: For me personally, it’s never been a question of the “if”, but only of the “when”, and the “who” will be the first to make it happen. To explain: Even when I was still in the middle of my masters degree (say, ’96 or so), it was already clear that all high-quality rendering was done via ray tracing – sure, there were interesting discussions on whether it’d be path tracing, backwards/reverse/forward path tracing, photon mapping, bidirectional path tracing, or Metropolis (all of which at some point in time I had played with back then 🙂 )… but in the end, they all used ray tracing. At the same time, anything that was primarily time-constrained was doing something else (at that time: REYES’s “split-and-dice”, the equivalent of rasterization), but even then it seemed clear to me that with “computers” getting “faster” every year it’d eventually only be a question of time until the time constraint would go away, and that eventually, “the simpler/more elegant algorithm” would get used (because at the end of the day, that’s what it always comes down to: Once you can afford it, you always pick the more elegant, and more general, solution).

And sure enough, over the last decade-and-half we’ve already seen this happening in the movie industry: When I started my PhD, the general opinion was still that this industry would “never” switch to ray tracing, because it needed too much memory (REYES could do streaming), because it was too slow (REYES was faster), because it needed nasty acceleration structures, and because all this photo-realism wasn’t all that important (and at least apparently, sometimes detrimental!) to the artistic process, anyway … yet still, by today virtually every production enderer has switched to ray tracing, because in the budget allocated for a frame it is now possible to do it, and once it is, it was just simpler to express that renderer in ray-based terms. As such, at least in my eyes it’s always been merely a matter of time until real-time graphics will do what the movie industry has already gone through – at some point in time ray tracing will be fast enough to do it in real time, and once it is – if history is any guide – people will use it.

Anyway – no matter how you do reach that same conclusion, whether you think deeply about it or simply extrapolate into the future – it does look like ray tracing is here to stay. Let’s see where it takes us. It’ll be a few interesting years ahead.

Preprint of our Vis’19 paper on Iso-surface ray tracing of AMR Data now available …

Finally gotten to making an “authors copy” and uploading it to my blog, but here it now is – a preprint of our Vis 2019 paper on “CPU Isosurface Ray Tracing of Adaptive Mesh Refinement Data”  (link to pdf).


A few notes:

  • This paper is a direct follow-up to our previous AMR volume ray tracing paper (published at last year’s SigAsia Vis Symposium), but adds implicit iso-surface ray tracing capability (using a correct, analytic intersection method). The “octant method” reconstruction scheme was actually already sketched in the original submission of that previous paper, but wasn’t explained well enough back then, so got axed in the final version.
  • The “octant method” that this paper introduces is actually – if I may say so – pretty neat, because it’s both interpolating and continuous, even in corner cases. It may, however, well be the one thing in my career that I had to expent the most brain power to get right – it’s trivial in 1D, but even in 2D it took a while to get it right, and 3D has even more corner cases that some earlier attempts failed on (If only you could see the stack of notebooks all full of sketches: at one point I used xfig to draw a 2D AMR example, printed a few hundred pages full of that template, and pretty much used them all up going through the algorithm step by step, for each cell, until it finally worked!?). Worked on this – on and off – for almost 3 years, which is kind-of ridiculous …
  • The code is all implemented in OSPRay (of course?), as a loadable ospray module that is fully compatible with all other ospray actors (renderers, other geometry types, MPI parallel rendering, etc). This module is not yet being part of any official ospray release, but is already available upon request (Ethan should be able to provide – it’s all Apache License, so fully free), and will hopefully “at some point” be included in mainline ospray as well.
  • Though the paper’s title is exclusively on the adaptive mesh refinement (AMR) part, the actual code is just as much about the general implicit iso-surfacing code itself – the “impi” module (for imp-licit i-sosurface) is actually generally applicable to other volume types as well, and does come with an implementation for structured volumes, too. The paper itself is actually kind-of two papers in one, too… part on the IMPI module, and part on the octant method to use that for iso-surface ray tracing of AMR data. As such, I’d fully expect this module to be used as much without AMR as with AMR.
  • One reviewer (correctly!) pointed out that with all the “theoretical” continuity we claim in this paper there’s still a chance that there could be pixel-sized “shoot throughs” due to numerical accuracy issues: Even if we make the boundaries between levels fully continuous in a mathematical sense, the fact that different voxels/octants on different sides of the boundary use different floating point values for the cell coordinates (and those in different order of computations) means there can be elimination effects in the (limited-precision) floating point computations. Yes, that is perfectly correct, and I had fully overlooked it in the original submission (maybe one of the best reviewer catches I’ve ever seen!). But then, exactly the same effect will happen even for voxels in strutured volumes, without any level continuities ….


CfI: Embree on ARM/Power/…?

Executive Summary: This is a “CfI” – a “call for involvement” – for anybody interested in building and, in particular, maintaining a Embree version for non-IA ISAs such as ARM, Power, etc. If you’re mostly based on – or at least, very active on – any of those platforms, and interested in being involved with creating and maintaining a version of Embree on them …. let me know!


As most of those that read this blog will surely know already, Embree (http://embree.github.io) is a ray tracing library that focuses on accelerating BVH construction and ray traversal, thus allowing the user to build their own renderer with minimum constraints, yet good performance. Arguably the biggest strength of Embree – other than performance – is its versatility, in that it allows things like user-programmable geometries, intersection callbacks, different ray layouts and BVH types, etc.

Another big strength of Embree is that it will automatically select – and use – the best vector instructions available on your CPU, and make full use of it; e.g., if your CPU has AVX512 it’ll fully use it; if it doesn’t it will fall back to AVX2, or AVX, or SSE, … you get the point. The caveat of this, though, is that Embree today only supports Intel-style vector extensions; yes, it supports SSE, AVX, AVX2, and AVX512; and yes, it works on AMD CPUs just as well as it does on Intel CPUs …. but if you’re on Power, ARM, SPARC, etc, it currently won’t work.

Embree on Non-IA CPUs?

On the face of it, only supporting IA (Intel Architecture) style CPUs isn’t too big a limitation … in particular in high-end rendering almost every rendering is being done on Xeons, anyway. However, if you are an ISV whose software is supposed to also run on non-IA CPU types – think a game studio, or the Steam Audio 2 that’s been recently announced (see here), then you’re currently faced with two choices: either don’t use embree at all (even where it would be highly useful); or change your software to support two different ray tracers, depending on which platform you’re on (ugh). As such, I would personally argue – and yes, this is my own personal view – that it would be highly useful to have a version of Embree that will also compile – and run – on non IA CPUs.

In fact, doing just that is way simpler than you might think! Of course, everybody’s first thought is that with all the man-years of development that went into Embree “as is”, doing the same for other vector units would have to be a major undertaking. In fact, if you only dare to take a look at Embree’s source (if you want to, you can conveniently browse it on its github page) you’ll very quickly realize that almost everything in Embree is written using some SIMD “wrapper classes” (see code in embree/common/simd) that implement things like a logical “8 wide float”, “16 wide bool”, etcpp… and that then implements these wrapper classes once in SSE, once in AVX, once in AVX512, etc.

In other words, once you implement those wrappers in your favorite non-IA vector intrinsics, you’re 95% there towards having all of Emrbee compile – and run – on that ISA. After that, there’s still a few few more things to do in particular relating to the build system (adapting the cmake scripts to your architecture), in properly “registering” the new kernels (because Embree’s automatic CPU detection currently only looks at Intel CPUs), etc … but all that is “small beer” – all the traversal kernels, API implementation, BVH building, threading, etcpp, should all work out of the box.

I am, of course, not the first one to figure that out: In fact, almost two years ago a github user “marty1885” already did exactly that for ARM (see this blog article of his for an excellent write-up!). Funny thing back then was that I had just done “almost” the same in writing purely scalar implementations for those wrapper classes, while he independently did this port to ARM Neon. (And just to make this clear: With “scalar” I do not mean that Embree itself was re-written without vectors; I’m talking about an implementation that realizes, say, the 16-wide float “vectors” as doing 16 scalar add/mul/etc, rather than using an explicit _mm512_add_ps etc; this means it’ll still compile on any platform, but the rest of Embree still “sees” this as a 16-wide vector).

Both of those test implementations yielded interesting results: For mine, it was the fact that this purely “scalar” implementation of float4, float8, etc worked really well – auto-vectorizers may be, well, “problematic” for complex code, but for something that continuously does 8 adds, 8 muls, etc at a time, they absolutely do see that those can be mapped to vector instructions – not just as good an manual intrinsics, but surprisingly close. I did, however, never go through the exercise of changing the makefiles, so never even tried on ARM etc (well, I don’t have an ARM!). For Marty, he went “all the way”, and in particular, got his entire renderer working on ARM. His finding? That performance was pretty good out of the box, without any rewriting of kernels etc altogether… which is pretty cool…. and highly promising.

So, what (still) needs to be done?

OK, proof of concept done – what would we still need? Well – the problem is that both mine and Marty’s implementations are about 2 years old, and as such, pretty deprecated right now. It’s not hard to re-do that for the latest Embree, but it has to be done (and I sure won’t have time for that!).

Second, even if that exercise was re-done, it would still have to be maintained: Every new release of embree adds a few additional things, fixed bugs, adds improvements, etc… and to be really useful for ISVs, the “ported” embree would have to keep up to date with those updates. Now thanks to git, that “keeping it updated” shouldn’t be too hard – very few of those release updates would touch the wrapper classes, CPU detection, or build system, so in 95% of cases I’d expect such an update to be as simple as a “git pull remote” to make a new release…. but it would still have to be done. In fact, if I were to build and maintain such a “non-IA” version of Embree, I’d do it exactly as such: clone the embree repo once, port the wrapper functions and makefiles just like Marty did it, then push that onto another, easy to find repo… and of course, pull in the master diffs every time Embree makes a new release, fix whatever needs to get fixed (probably not much), and push that release too.

Be that as it may – I will likely not have the time to maintain such a project …. but if anybody out there is eager to make himself a name in non-IA CPU ray tracing land (and possibly, with some of existing users of Embree that want to be platform agnostic) – well, here’s a relatively easy project to do just that!

Anyway – ‘nough for today … back to work!

PS: And of course, once you have a non-IA version of Embree, it’s trvially simple to also have a non-IA version of OSPRay, too: OSPRay itself doesn’t use vector intrinsics at all; it only uses them indirectly, through ISPC, and through Embree. ISPC can already emit to “scalar” targets, so as soon as Embree could emit to whatever ISA you require (or to scalar, as I had done) ….. well, all you’d need is a C++11 compliant compiler, and likely MPI… which aren’t all too rare nowadays :-). As such, if you do have a non-IA supercomputer (hello there, Summit, Sierra, Tianhe, Sunway, Titan, Sequoia, etc!), and you need a good, scalable, fast ray tracer …. take your sign!

ISPC Bag of Tricks … “Object Oriented Programming” in ISPC

As mentioned in the past few articles, I intended to use this blog as a means of explaining “useful things” we’ve done in ISPC – in particular, as part of the OSPRay project – that are hard to describe in any other medium. Now while some of the previous topics were more useful “across the board” (ie, even for little kernels, no matter what program), this article will cover what is arguably – at least for projects on the scale of OSPRay – even more useful …. namely, our experience with mapping at least “some” C++/object oriented programming concepts into non-trivial ISPC programs.

Background: ISPC and C++ …

Looking at the design and features of ISPC, it’s pretty good at expressing low-level performance related issues like data layouts, vectorized (and in particular mixed-vectorized-and-scalar) control flow, etc … but somewhat less strong in the area of building large, complex code bases. Sure, in theory you could, but you rather quickly run into limitations like no templates, no STL, no virtual functions (or, in fact, any sort of methods), etcpp. As such, the way it looks like it is intended to be used is in a mix with C++ – C++ for the “Big system framework”, and ISPC for a few, performance-critical kernels. And of course, in that sense it’s not all that different from other frameworks such as OpenCL, either.

There are, however, cases – like building an entire, non-trivial renderer – where that design isn’t all that easy to stick with. In OSPRay, for example, the very base concept is that as a ray tracer it can do all these abstractions of “I don’t care which actual type of geometry it is, as long as it can provide a bounding box” etc, which just cries out for a C++ object hierarchy … and which you want to use so fine-grained that you really don’t want to always go back-and-forth between “host code” and “device code” (in particular because unlike for most OpenCL use cases, there is no difference between host code and device code in ISPC – it usually does all run on the same CPU!).

As such, pretty early on we decided to go and try to “emulate” as much of C++ in ISPC as we could, using function pointers etc in exactly the same way the first C++ compilers were built, too (for those old enough to remember: the first C++ compilers were actually “pre-“compilers that compiled from C++ to C, then used a C compiler to do the rest!). In this article, I’ll explain a bit on how we did this.


Starting with the bad news first – templates don’t exist. As in “not at all”. And no, there’s no good or easy way to do it. If I had one wish for ISPC, it’d be exactly that – give me some templates. Not even the full-features ones with which you can do template-metaprogramming, and write programs that you can neither understand the program nor the compiler’s error messages any more … but some basic templates like templating over base types and integers would go a long way.

The way we solved it in OSPRay therefore is, to large degree, a plain “not at all”. In some cases we’ve “solved” it by always going with a good-enough intermediary representation (eg, intermediary frame buffer tiles always store floats, even when bytes would have been enough). In a few cases, we’ve gone with using preprocessor macros, that emulate template parameters through textural processing. E.g., you first define a “template” using something like that:

#define define_vec3_operator(T) \
   inline vec3##T operator+(vec3##T a, vec3##T b) { .... }\
   inline vec3##T operator-(....) { ... } \

… then “instantiate” this with something like that:

define_vec3_operators(f); // defines vec3f+vec3f, vec3f*vec3f etc
define_vec3_operators(i); // ...
#undef define_vec3_operators

That certainly ain’t a nice solution, but it’s a solution, which is better than nothing. I suggest you use it sparingly, but it can be very useful. For example, in our volume sampling code using this eventually led to some 2x performance improvement (I won’t go into details here, but if interested can write more about it).

Inheritance and Virtual Functions

Though templates would be “useful” to have, a proper class hierarchy – and virtual functions – is even more important. It’s just very hard to imagine a non-trivial ray tracer without object oriented programming and virtual functions. For example, in OSPRay we did want to be able to say that there are “Geometries” that can “intersect a ray” and “query surface information re a given intersection”, no matter whether that geometry is a triangle mesh, a set of spheres, a implicit iso-surface, or anthing else … and the best way to do that is, of course, a “Geometry” base class, with virtual methods for “Geometry::intersect(Ray)” and “Geometry::postIntersect(Hit)”, etc.

Unfortunately, ISPC doesn’t have these language features – but quite fortunately, that is only a result of the parser not offering it, because everything you do need to implement this paradigm is there. As such, we can emulate it:


To start with, let’s look at inheritance. Say we want each Geometry to have a bounding box, and both a TriangleMesh and a Sphere to be geometries. In C++, we’d write this as follows:

/* inheritance example, c++ */
class Geometry {
   box3f bounds;
class TriangleMesh : public Geometry { 
   std::vector<vec3f> vertex; 
class Sphere : public Geometry {
   vec3f center;
   float radius;

In ISPC, we can’t directly do that – but if you look at how inheritance actually works in C++, all it means is that basically each “inherited” class first has all the members of its parent, and then adds some of its own. As such, what we can do in ISPC, is roughly the following:

/* "pseudo"-inheritance example, ISPC */
struct Geometry {
   box3f bounds;
struct TriangleMesh {
   Geometry super; // "inherit" all parent fields
   vec3f *vertex;
struct Sphere {
   Geometry super;
   vec3f center;
   float radius;

Now ignore for a moment that the “std::vector” in TriangleMesh::vertex is no longer a vector; and ignore that the syntax is a bit “cumbersome” – but the base concept of inheritance is there, at least for class members, and at least for single, straight-line inheritance: Every pointer to a derived class  can always be cast to a pointer to the parent class, and then act just like the parent, which is exactly what inheritance means.

Before I go on to how to do the same thing for methods – and virtual functions – two quick notes on this:

Inheritance, polymorphism, and pointer-demotion

First, as part of how its polymorphism works C++ will automatically “demote” any pointers when calling a function expecting the parent class:

/* this works in C++ : */ 
void foo(Geometry *) { ... }

void bar(TriangleMesh *mesh) { .... foo(mesh); ...}

In ISPC, that same code won’t work, because even though a Mesh “acts” the same way as a Geometry it still isn’t a geometry from the type system. As such, you actually have to do this manually by either type-casting, or by passing a reference to the actual inhertied base class:

/* this is how it works in ISPC */
void foo(Geometry *uniform) { ... }

/* this C++ equivalent will *NOT* work: */
void bar0(TriangleMesh *uniform mesh) { ... foo(mesh); .... }

/* this *will* work */
void bar1(TriangleMesh *uniform mesh) 
{ ... foo((Geometry *uniform)mesh); ... }

/* this will *also* work */
void bar2(TriangleMesh *uniform mesh) 
{ ... foo(&mesh->super); ... }

Inheritance – and the need for discipline

A second note on this concept of inheritance is a personal “recommendation”: Be very, very careful, because with all the manual typecasting you take away any sort of error checking from the compiler… so this may easily lead to “funny” errors. There are, however, ways of making this easier – for example, in the above example the second solution (bar2) is somewhat less prone to errors than the bar1 – not much, but somewhat.

Second, we also learned (the hard way, of course) that a good – and above all, consistent – naming scheme is super important, in particular if the code grows in complexity. Initially, in OSPRay every hierarchy used a different name – one used “parent” for the inherited members, some used “geometry” or “volume”, some used “super”, some “inhertied”, etc – and keeping that straight turned into a nightmare. Today, we consistently use “super”, and that saves a lot of headache. The same topic will come up below.

Methods and “this”

In C++, you can add a method right into a class, which is very useful. Again, you can’t do that in ISPC, but if you look at it closely enough, a “method” – at least a non-virtual one – is nothing other than a static function with an implicit “this” that points to the class that this method lives in. As such, to emulate:

/* method in C++ */
class Sphere : public Geometry { 
   box3f getBounds();
   vec3f center; float radius; 
box3f Sphere::getBounds() 
{ return box3f(center-radius,center+radius); }


struct Sphere { vec3f center; float radius; };

uniform box3f Sphere_getBounds(Sphere *uniform self)
{ return make_box3f(self->center - make_vec3f(self->radius),
                    self->center + make_vec3f(self->radius); }

Before we go on, again a few notes: First, there’s no method definition, because ISPC simply can’t do that.

Second, the biggest difference you’ll see is that the implicit “this” of the C++ variant is gone, and is replaced with an explicit variant of this (which we called “self”) – and we have to explicitly prefix each member access with this “self->” thing. But though somewhat more ugly, it does the trick.

Third, one thing that does require some attention is the use of “uniform” vs “varying”: It is actually pretty important whether “self” is a uniform or a varying pointer, or a pointer to a uniform or a varying base class. Though my own research compiler equivalent could handle this fact (and automatically generate code for all variants), in ISPC you can’t, so you may eventually have to end up writing at least two variants of this method – one with a uniform ‘self’ pointer, and one with a varying. That may be cumbersome, but it works.

One final note on method: Just like for inheritance, I can’t over-emphasize the importance of good and consistent naming. Initially we used all kind of schemes and names (“self”, “this”, “sphere”, “geometry”, “_THIS”, are all examples we used at various times) all over the place, and again, that turned into a mess. In the end, we always use “self”, and always use it as the first parameter … and switching to this has been very useful indeed.

Virtual Methods

Now that we have inheritance and methods, let’s go on to the really intersting stuff, which is “inheriting functions” – also known as “virtual functions”. Once again looking at how virtual functions actually work in practice, they’re nothing other than a function pointer to a method (which we now know how to do) that is inherited from the base class (which we now know how do to, too). So, let’s slightly extend our C++ example:

/* virtual example in C++ */
class Geometry { 
   virtual box3f getBounds() = 0;
class Sphere : public Geometry {
   virtual box3f getBounds() override;
   vec3f center;
   float radius;
box3f Sphere::getBounds() { return box3f(center-radius ....); }

In ISPC, we first have to define a function pointer for the Geometry::getBounds() function – which is a method, and as such needs an implicit ‘self’, which in this case has to be pointer to a geometry:

struct Geometry {
   uniform box3f (*getBounds)(Geometry *uniform self);

Note this doesn’t actually declare a function itself, and thus can’t take a function body – it merely states that this geometry will contain a pointer to a function/method with the given signature: a “self” (which happens to be a geometry) going in, and a box3f going out. We still have to write this function (like we’d have to do in C++), and – unlike C++ – have to actually set this function pointer to the right function – so before we talk about inheritance and “overriding” this function, let’s briefly insert the topic of “constructors”.


In C++, a lot of what constructors do is fully automatic. Sure, you can overwrite them, but in the above example, the constructor would always and automatically assign the virtual method table (with all the function pointers to the virtual methods). The ISPC variant of course doesn’t have implicit constructors, so no matter which “getBounds” functions you might be writing, the pointer in that class will remain undefined unless we explicitly set them. So, let’s write a (explicit) constructor first:

void Geometry_Construct(Geometry *uniform self) 
{ self->getBounds = NULL; }

In this example, we’ve set the ‘setBounds’ to NULL, which of course is only marginally better than leaving it undefined (though you could argue that as a virtual function it should be “0”, too). So let’s make this a bit more useful by “forcing” the caller to pass a useful function:

void Geometry_Constructor(Geometry *uniform self,
                          uniform box3f (*getBounds)(.....))
{ self->getBounds = getBounds;

This way, whoever wants to construct a geometry has to specify that function, which is useful.

Overriding Virtual Functions

Now, back to actually creating a userful subclass of geometry, and overriding its virtual function. For our Sphere, we can do the following:

struct Sphere {
   Geometry super; //< note this "inherits" the bounds function!
   vec3f center; 
   float radius;

Note now the “super” also inherits the Geometry’s getBounds() method – though to be more exact, it inherits the function pointer , not the method itself.

Let’s create there Sphere’s method we want to override with:

uniform box3f Sphere_getBounds(Sphere *uniform self)
{ return make_box3f(self->center + ........ ); }

and write a constructor for the sphere object:

void Sphere_Constructor(Sphere *uniform self)
   // first, make sure the 'parent' is constructed, too
   // ... then 'override' the getBounds method
   self->super.getBounds = Sphere_getBounds;

… or, using the constructor that expects afunction pointer, it’ll look like this:

void Sphere_Constructor(Sphere *uniform self)

Aaaaand – as soon as you try this out, you’ll immediately see that this won’t actually work, because the signatures of Sphere_getBounds doesn’t actually match the signature of the Geometry::getBounds pointer: the former expects a sphere, the latter a geometry, and as stated above, ISPC does not have C++’es automatic pointer demotion.

As such, you have two choices:

  1. you can give the derived methods use the signature of the parent, and (manually) upcast the “self” pointer to the derived class’es type; or
  2. you insert the right typecasts for the function signatures.

As an example of method 1:

/* Option 1: using the signature of the parent class */
void Sphere_getBounds(Geometry *uniform _self) 
   /* note how 'self' is a Geometry (not a sphere), just as 
      the parent's getBounds method type would expect */
   // first, typecast 'us' to our real type
   Sphere *uniform self = (Sphere *uniform)_self;
   return .... self->center ....;

And, for completness, here an example for method 2, using some typecasts to make the code less ugly:

typedef uniform box3f (*Geometry_getBounds_t)(Geometry *uniform);

/* the Sphere::getBounds, with its own type */
void Sphere_getBounds(Sphere *uniform self) { ... }

/* and use of that in the constructor */
void Sphere_Construct(Sphere *uniform self)

Once again, which of the two options you go with doesn’t matter too much, as long as you’ll remain consistent (else it does turn into a veritable nightmare). I personally prefer method 1, because the typecasts “feel” a bit more localized, but that’s probably only a matter of taste.

Calling Virtual Functions

With all that, the actual process of calling a virtual function is pretty mundane:

void buildBVH(Geometry **uniform geoms, uniform int numGeoms)
   for (uniform int i=0;i<numGeoms;i++) {
      bounds[i] = geom[i]->getBounds();

will do the trick just fine. Where it does get a bit more tricky is if we’re not dealing with a varying “warp” of object instances: ISPC can actually call a vector of function pointers just fine (it’ll implicitly and automatically serialize over all unique instances of it); but at least the way we’ve done it above the ‘self’ parameter of each such instance expects a uniform, so just calling a varying’s worth of getBounds()’s won’t work.

Option 1 for this is to actually implement each virtual method twice – once with a uniform self, and once with a varying …. but that’s a lot of work, and ugly. Instead, what we did is go with – as always in such cases – the “foreach_unique to the rescue” option, and serialize explicitly:

void Scene_postIntersect(Scene *uniform self, varying int geomID) 
   Geometry *varying geom = self->geom[geomID];
   /* geom is a *vector*, so we can't just call geom->getBounds(),
      at least not with geom as a parameter to it */

   // instead: serialize over unique geometries
   foreach_unique(uni_geom in geom) {
      ... uni_geom->getBounds(uni_geom);

The end … ?

Oh-kay – guess that was a much longer post than expected, but still hope it’s useful for those setting out to write non-trivial ISPC programs themselves. (Or, of course, those trying to get a better idea of the concepts behind OSPRay!). If you are interested in having a closer look at those concepts used in practice, by all means go and have a look at the OSPRay sources, at https://github.com/ospray/OSPRay (and in particular, the class hierarchies of Geometry, Renderer, etc) – in fact, it’s all over the place in OSPRay.

Most of what I described above will sound trivial to those that have done similar coding before; though I fear those that haven’t will still struggle, because truth be told, it’s not as easy as doing C++ with “real” C++. That said, the concepts described above have been immeasurably helpful in OSPRay – in fact, I couldn’t imagine OSPRay in any other way, nor could I imagine how to do it in any other way other than with “real” C++.

As such, I hope this has been useful…. and as always: Comments/Criticism/Corrections … all welcome!


ISPC Bag of Tricks Part 3: Addressing 64-bit arrays in 32-bit addressing mode

Some time last week or so I wrote a first article on the importance of using ISPC’s 32-bit addressing mode  whereever and whenever possible, because it is so much more friendly to the underlying CPU architecture …. and as such, can generate way faster code.

The caveat, of course, is that the times of 32-bit address spaces is over (by about a decade or two!), and most “serious” applications today can expect to be asked to handle many gigabytes of data. As already mentioned in this previous article, “in most cases” this is not actually a problem, because “32 bit mode” in ISPC only means that varying accesses are actually 32-bit varying offsets relative to a 64-bit base pointer – so as long as all your individual arrays are smaller than 32-bits, the sum of all such arrays can very well exceed the 4GB barrier (well, 2GB, to be exact, due to sign) without any issues. In more practical words: Assume you have a triangle mesh with order a million vertices, such that this mesh’s vertex array is about 10MB large.

Now further assume I read a scene that contains a thousand of such meshes, then I’ll end up with a total of one thousand such vertex arrays of a total of about 16GB – well beyond the 2GB mark, but still perfectly fine as long as all memory accesses are always relative to each mesh’s own vertex array… which will work our just fine in practice, without doing anything else. So one mesh with 1 billion vertices (16GB) would have led to crashes when the varying array index overflows – but 100 meshes of 10 million vertices each would work just fine. That is, in fact, exactly what we did in OSPRay for a long, long time:  we were rendering (many-)gigabyte sized scenes all day, in pure 32-bit addressing mode, and it just worked our fine, because typically the many gigabytes of data were always split over multiple arrays that were each smaller than 2GB.

But what if I have individual arrays larger than 2GB?

In practice, however, “usually work out OK” is just another term for saying “every now and then it won’t”, which isn’t all that great. So how can we safely handle cases where individual arrays are larger than 2GB?

One option, of course, is to recompile in 64-bit mode – but that has to be done for the entire project, so will come with a huge performance penalty even for the majority of cases where we would not need it.

The other option that we eventually ended up using (a lot!) in OSPRay is what we called “segmented” 64-bit addressing: Thought he array actually is a single array, we can logically view it as if it actually consisted of several smaller arrays – “segments” – placed right next to each other in memory, and manually serialize accesses into those segments. That sounds complicated, but isn’t – here an example.

First, assume we have – and this is a real example from OSPRay – a “Sphere Mesh” geometry that contains a set of N spheres, specified as a data array with N structs that each looked roughly like this:

struct Sphere {
   vec3f origin;
   float radius;
   vec3fa color;

… and the sphere geometry then looks roughly like that:

struct SphereGeometry {
    Sphere *uniform sphereArray;
    int numSpheres;

… and then something like this:

vec3f getSphereColor(Sphere *uniform sphere, varying int primID)
{ return sphere[primID].color; }

Note, in particular, that this is a real-world example from OSPRay – somewhat simplified, but mostly very close to this example (in fact, the first case where we ran into this issue!). Now of course, we had well documented that each geometry can hold only 2 billion geometries, because the primitive ID is a 32-bit int, as is ‘numSpheres’. However, users were getting core dumps well below this “2 billion spheres” limit, because the


expression – in 32-bit mode – evaluates to a 64-bit base (“sphere”) with 32-bit offsets (“primID*sizeof(Sphere)”), and since each sphere is 32 bytes large, the latter gets a 2GB overflow way before reaching 2 billion spheres (64 million spheres, in this example).

So, how to fix it? Assume we see our sphere array as made up of “segments” of at most 1 million spheres – in this case, each such segment can address “its” spheres with 32-bit offsets just fine. Also, for each primID we can easily compute both the segment that this primitive is in, where – in 64-bit memory – this segment begins, and what this sphere’s offset is within this segment. Now with that, all we have to make sure is that we serialize across the different segments, which we can do through a foreach_unique, as such:

vec3fa getSphereColor(Sphere *uniform sphereArray, int primID) 
   vec3fa returnValue;
   varying int segmentID     = primID / (1024*1024);
   varying int segmentOffset = primID % (1024*1024);
   foreach_unique(uniformSegID in segmentID) {
      Sphere *uniform segmentBase = sphereArray+uniformSegID;
      returnValue = segmentBase[segmentOffset].color; 
   return returnValue;

Now in order to understand why this works, let’s have a closer look: We first compute, in each lane (ie, in varying), the ID of the segment, and the offset therein. We then iterate over all the unique segments in this ‘segmentID’ value – and since that’s exactly how “foreach_unique” is defined, the unique values it then iterates over are uniform values: i.e., in each iteration, we iterate over a uniform segment ID. And because of this base address of the segment that we then compute in this body is uniform, too, which means is a plain, scalar, 64-bit pointer – no offsets whatsoever, and thus no overflows anywhere. In the following line, then, we do a varying array access, but relative to a “safe” 64-bit base pointer, and with offsets that are guaranteed to be less than 32 million bytes, so totally safe.

What about performance?

Of course, this code is “a bit” more costly than a simpler array access that is guaranteed to fit within 32GBs …. but it’s not actually all too horrible: If the accesses are mostly coherent, chances are that only very few iterations of this look will be required, because coherent accesses will predominantly go to the same segment, and then be done in a single loop. And if accesses are not coherent, chances are you’ll have performance issues with TLB walks etc, anyway (which would, quite interestingly, internally serialize across TLB pages almost exactly the same way this loop does over segments!).

That said, there is of course some performance penalty of doing this. In OSPRay, we actually do flag meshes as to whether they require this fix or not, and then, in “postIntersect”, either to the “safe” gather, or a “fast” gather. But even without going through these lengths – doing this safe gather only in select places is certainly better than recompiling everything in 64-bit mode.

One final note: In OSPRay, we do use this technique in many different places, in various modifications … the core idea, however, is always the same.

Hope this is useful to anybody that wants to handle (at least occasionally) “large” data in thei ISPC programs – try it out, so far at least for us it’s been rather useful!

PS: As always : comments, corrections, improvements are welcome!

ISPC Bag of Tricks Part 2: On Calling Back-and-Forth between ISPC and C/C++

As mentioned previously I wanted to use this blog to share little nuggets on “how to do things” with tools such as ISPC – little things that are useful, but not easily publishable in any form other than a blog post.

In the first such article (on the ISPC front) I wrote a bit about the different aspects of addressing modes (and their caveats) in ISPC, which is something we constantly stumbled over / used in our OSPRay project. In this article, I’ll write a bit about something any non-trivial project will at some point stumble over: calling back-and-forth between ISPC and C++.

Typically, the reason we want to mix ISPC and C/C++ at all is that ISPC is good at expressing (low-level) parallel code for vectorization, but a little bit less good at building more complex software systems that nowadays use all the goodies of C++, such as templates, virtual functions, copious amounts of libraries (often heavily based on templates, too), etc… all things that ISPC – very unfortunately – lacks. As such, the typical set-up for most ISPC programs I’ve seen so far is that there’s a “main application” that does all the “big systems” stuff (I/O, parsing, data structure construction, scene graph, etcpp) in C++ – and smaller “kernels” of the performance-critical code written in ISPC. Obviously, those two parts have to communicate.

The obvious part: Calling ISPC from C/C++

Now the first, and most common, operation involved in that is having the C/C++ program call into ISPC, to launch and/or otherwise execute those kernels (I’ll say a bit more on “launching” vs “executing” below). Luckily, that part is easy, even has some language support (the “export” keyword), and is well documented: Basically, you declare an ISPC function as “export <function>”; have ISPC generate a header file with a C-callable function defitinition of that (the “-h” flag in ISPC invocation”; include the thus-generated header-file in your C/C++ code, and simply call that function.

Even with that simple way, there’s a few caveats: In particular, though ISPC can export the function declaration to C/C++, it can’t easily export all the parameter types that this function may expect as parameters – in particular if those parameters include compound types (structs), varying types (ugh, it really doesn’t like that), an implicit active mask, functions pointers with varying arguments, etcpp.

Example 1: Getting our feet wet

To start with, let’s look at a simple case, a ISPC kernel that operates on a float array (it doesn’t matter what it does). In this case, we’d simply declare the ISPC side (in foo.ispc) as

export void foo_simple(float *uniform A, uniform int N) { ... }

Once we compile this function with the “-h” argument, ISPC will emit a C callable function (with “extern C” linkage, decared in foo_ispc.h), with the following signature:

/* in foo_ispc.h */
namespace ispc {
  extern "C" void foo_simple(float *A, int N);

Obviously, calling this function from C++ is trivial. Worst that could happen – and usually does happen in many programs – is that the data on the C++ side isn’t actually stored as a plain C-array, but rather in a std::array or std::vector, but even that can easily be fixed with a bit of typecasting:

/* in foo.cpp */

#include "foo_ispc.h"

void fooOnVec(const std::vector<float> &vec) 
   ispc::foo((float *)&vec[0],(int)vec.size());

Of course, any C++ purist with a drop of blood left in his veins will cry out at this abuse of C++ : A std::vector should be opaque, we shouldn’t be assuming that it internally stores this as a float array, just getting the base pointer by taking the address of the first element is devil’s work, and don’t even get me started on casting a pointer to a const vector’s internal storage to a non-const float array. But hey, this is about performance, not cleanliness, so I’ve intentionally chosen this example just to make this point: In many cases you will have to type-cast from “better” C++ types to something that’s more easily digestible in ISPC. Of course, the cleaner way would have been to first copy from that std::vector to a temporary array, and call that kernel on that array – but calling into ISPC is all about speed, so there we are (and if speed is not the goal: just write all the code in C++ to start with!).

Example 2: Structs as function arguments

The reason the previous example was so simple is that it used only built-in types like “float” and “<pointer>” (and those in purely uniform variability) that ISPC can map one-to-one into the generated header file. For compound types and/or varying it’s a little bit more tricky, because typically ISPC and C++ can not include the same header files (except in trivial example), and the compound types that ISPC exports are not always very useful on the C++ side. For example, consider making out example a little bit more tricky:

/* in foo.ispc */

/* declare a three-float vector type */
struct vec3f { float x, y, z };

/* export a kernel that operates on an array of those */
export void foo(uniform vec3f &v) { .. }

In this example, all parameters are still purely uniform; however, there’s already a struct type involved. Now ISPC can already export this struct (generating a “ispc::vec3f” struct with float members x,y,z as above), and the C++ code can already use this. But generally, this “pure C99” struct wouldn’t be much use in C++ (typically a C++ version of a vec3f would have all kinds of methods associated with it that we can’t add in ISPC), so more likely, we’ll also be using some “class vec3f” on the C++ side. Nevertheless, if we do design our ISPC and C++ classes such that they do have exactly the same memory layout (e.g., in this example, as “three floats, called x, y, z, right after each other”), then we can once again simply typecast:

/* in foo.cpp */

#include "foo_ispc.h"

namespace cpp { 
   class vec3f {
      /* all kinds of C++ goodies */
      float x, y, z;

   void fooPlusPlus(const ours::vec3f &vec) {
     /* just typecast the reference type, and call ISPC */
     ispc::foo((const ispc::vec3f &)vec);

And lo and behold, everything works as expected. Again, a C++ purist might object – but that’s the kind of pattern we’re using in OSPRay all over the place, and so far it’s been very useful. Do note, though, that we do typecast the reference types – this basically tells the compiler that it doesn’t even have to know how to “convert” from one type to the other, and that it’s simply a pointer for which we guarantee the right underlying data layout.

The caveat, of course, is exactly what every C++ purist will cry out about: There is no type-guarantees in that at all that the compiler can even give you a warning about: If you change the order of the members on the C++ side without also doing that on the ISPC side (or vice versa), or simply add a member on one side but not the other … or do anything else that changes the memory layout on one side but not the other …. well, then you’ll get funny results. Be warned.

Example 3: What about varying argument types?

Oh-kay, in the previous examples we’ve passed only uniform types, but what about varying ones? Answer: generally speaking, you won’t need it. This sounds counter-intuitive (after all ISPC is all about varying types), but isn’t: typically it’s only the ISPC side that “creates” the varying types, so if anything you’d want to pass a varying type from ISPC to C++ (see below), but not the other way around.

Now if you still think you have to do that, you’ll have to go the way of arrays – ISPC will simply refuse to emit an export header for anything that contains an actual varying type. For the curious: There’s actually no reason it can’t do that – my own version of a SPMD compiler (called IVL, and at one point ISPC++) could actually do that, by exporting a struct of appropriate size :-/ …. but it’s simply not implemented in ISPC, so don’t even try it. Unfortunately, the same goes even with “indirect” references to varying types. For example, in OSPRay we make heavy use of “poor mans C++ in ISPC” with function pointers and stuff – and though the classes we use – and the function pointers therein – are perfectly uniform every time we would even want to export them, the uniform function pointers inside those uniform classes often have varying parameters, so currently ISPC can’t export them (IVL/ISPC in these cases wouldn’t even have emitted the varying types themselves, only forward declarations). Either way – don’t export varying types, or those with function pointers with varying arguments, and you’ll be fine – and as said above, in 99.9% of all cases, you probably don’t need this, anyway.

Now, how about the other way around? Calling C/C++ functions from ISPC?

While the above way of calling from C/C++ into ISPC is pretty well documented, significantly fewer people realize that the other way around works perfectly well, too: ISPC obviously can’t directly call any C++ functions – it would have to understand all the mangling (and in almost all cases, C++ types) to do so …. but what it can do perfectly well is call functions with extern C linkage, no matter what language was used to generate them. I.e., you absolutely can do the following:

/* in foo.cpp */

extern "C" void someCppMagic()
   std::ifstream .... /* whatever C++ code you want */

Then, once declared as “extern C” in ISPC, we can call this function:

/* in foo.ispc */

/* _declare_ the C++ function (with extern C linkage */
extern "C" {
   unmasked void someCppMagic();

/* now call from another ispc function */
void myIspcKernel(varying ....) 

Now in this example, there’s two little details that aren’t immediately obvious, but important: First, on the ISPC side we have used a

extern "C" { 

rather than the simpler

extern "C" ...

… and though that looks mondane, it’s actually important, since the ISPC parser can digest the former, but – for whatever reason – can’t digest the latter. So using the former is more cumbersome, but actually important to make it work.

Second, you may or may not have noticed the “unmasked” keyword in the ISPC declaration of “someCppMagic()”. In this particular example this is actually superfluous, but in some more complex ones it isn’t, so let me explain: In ISPC, all the “varying” control flow is handled implicitly, meaning ISPC keeps track which “programs” (lanes) in a “program group” (aka warp, work group, SIMD vector, etc) are active or not. It does that fully implicitly, without the user seeing it – but it obviously does have to pass this information between (non-inlined) functions, which it does by always putting an additional, completely implicit, parameter on the stack of each function call (and of course, that parameter contains the active mask).

Now in our example the C++ side doesn’t expect any parameters, so even if it was on the stack it wouldn’t hurt, since the C++ side would simply ignore it. If the function did have any paramters, however, putting the “unmasked” keyword is actually pretty important, since the C++ compiler would assume one sort of data on the stack (i.e., no active mask), but ISPC would put both the “real” arguments and the active mask there… and that can lead to rather funny results (believe me, we did stumble over that in both OSPRay and Embree ….).

Now, what about passing stuff?

Now just like in the “C++ to ISPC” example, the real fun starts once we want to pass arguments from ISPC back to C. In its simplest form, we can do this by simply using foreach_active to serialize over all active programs, and calling a (scalar) C/C++ function with it:

/* foo.ispc */

extern "C" {
unmasked void fooScalarCpp(uniform float f);

void myIspc(varying float f) 
   foreach_active(uniform_f in f) 

And yes, that’ll work perfectly well.

More often, however, you actually do want to pass varying data to C/C++, so let’s talk about how to do that. Just as in the C-to-ISPC example, uniform basic types are trivial, as are pointers/references … and varying ones are.

First, the simple stuff – varying built-in types

For built-in types (float, int, …) the easiest way of passing them in varying form is actually as arrays:

/* in foo.ispc */

extern "C" {
unmasked void myCppFunc(float *uniform, uniform int);

void myIspcFunc(varying float &f)
  myCppFunc((float *uniform)&f, programCount)

… and everything works fine.

Now what about compound types?

Now where it does get tricky is varying compound types, because these get internally stored in struct-of-arrays (SoA) form, but must typically be converted to array-of-structs (AoS) before C++ can digest them. For example, the typical way to do so is the following:

extern "C" {
unmasked void myCpp(vec3f *uniform, uniform int);

void myIspc(varying vec3f &v)
   uniform vec3f vInAos[programCount];
   vInAos[programIndex] = v;

In this example, we first use ‘programIndex’ and ‘programCount’ to transform from SoA to AoS, then call the C++ side function with this array of (uniform) structs, just like above (note that we do not have to typeast when calling from ISPC to C++, because we did the declaration of the function with ISPC types … there is no “foo_cpp.h” to include).

Of course, just like we converted data on the way “out” to C++, so we’ll have to do once again if that C++ function modified that data. Also “of course”, this conversion is pretty horrible performance-wise, so “use it wisely” is all I can say.

What about active/inactive lanes?

One thing all the previous examples had in common is that they all assumed that the C++ function would always want to process all the lanes, or at least, that it was oblivious of the fact that some of the program instances might not actually be active – in fact, I even harped on the importance of the “unmasked” keyword to intentionally not pass any active masks. In practice, however, we often do have to pass information which lanes are vs are not active… so how to do that?

The first thing that comes to mind is to not use the “unmasked” keyword on the ISPC side, and then, on the C++ side, simply declare this additional parameter explicitly – after all, ISPC will put this “varying bool” on the stack, so if we only declare it properly, the C++ side can use it. Only one answer: Don’t do it. Of course, we actually did at first, and it does in fact “kind of” work – for a while. However, ISPC doesn’t guarantee the form that this parameter takes on different ISAs, so code that may work perfectly fine on AVX may lead to “funny results” when run on AVX512 … which we had to learn the hard way, of course. As such, don’t do it.

Instead, the better way is to create an explicit array of active bits, and pass this the same way as any other array (ie, with “unmasked” enabled). Here’s an example:

/* foo.ispc */

extern "C" {
unmasked void fooCpp(float *uniform v  /* the data */, 
                     int *uniform active /* explicit active mask */,
                     int uniform N);
void fooIspc(varying float f)
   /* the same stuff we did above: */
   uniform float fArray[programCount];
   fArray[programCount] = f;

   /* now "construct" an active mask - mind the 'unmasked'!!!! */
   uniform int activeMask[programCount];
   unmasked { activeMask[programIndex] = 0; }
   activeMask[programIndex] = 1;


Once again, there’s one hidden “imporant caveat” in this example that is easily overlooked, and which I’ll therefore call out explicitly here: the second “unmasked {}” statement, when we construct the active mask array. Ie, in particular those two lines:

 unmasked { activeMask[programIndex] = 0; }
 activeMask[programIndex] = 1;

This may look particularly awkward, but is actually important: If we only used the second of those lines, the semantics of ISPC do not actually say what the values of the inactive lanes would be: Yes, the active lanes would be 1, but the inactive lanes may be something other than 0, which will, once again, give “funny results” on the C++ side. Even worse, in most cases these lanes will actually get initialized to 0, and work just fine – but as I said, there’s no guarantee for that, and I’ve seen cases where they weren’t …. so this one is the “correct” way of doing it: The first line tells ISPC to set all lanes to zero, becuase the “unmaksed{}” around the assignment forces all lanes to active, thereby all lanes will get set to 0. Then, the second statement will revert to only the active lanes, and set (only) those to 1, which is exactly what one wants.

One other observation in this: I’ve actually passed the active lanes as an array of ints rather than an array of bools – you can do either, of course, but ints are often easier to digest, so I always use ints.

Last “Trick”: Calling C++ virtual functions from ISPC …

Taken to the extreme, you can even use the above things to call “real” virtual functions (on the C++ side) from the ISPC side. Imagine, for example, that you have a base class “Geometry” with a virtual function “commit”, and a derived “TriangleMesh”, derived from Geometry:

/* in geom.cpp */

class Geometry { 
   virtual void commit() = 0;

class TriangleMesh : public Geometry { 
   virtual void commit() override { ... }

Now further assume, we have at some point created an instance of such a geometry (say, a triangle mesh) on the C++ side, and have passed a poitner to this geometry to the ISPC side, for example such as this:

void Scene::doSomethingWith(Geometry *geom) 
   ispc::doSomethingInIspc((void *)geom);

(note how to typecast “geom” to a “void *”, because ISPC doesn’t know anything about these classes).

Now further, let’s assume “somethign” on the ISPC side has this “void *” that we know points to a geometry, and wants to call this geometry’s virtual commit() method. Directly calling this isn’t possible, because ISPC of course doesn’t understand any of this. However, we can create a simple “hook” for it – on the C++ side – that will allow that, say:

/* in geometry.cpp */

extern "C" callGeometryCommit(void *ptr)
   Geometry *g = (Geometry *)ptr;

All this function does is de-ambiguate the void poitner back to a geometry, and call the respective method. And since it has extern C linkage an only plain C data types (void *), you can perfectly well call it from ISPC:

/* foo.ispc */

extern "C" {
unmasked void callGeometryCommit(void *unform);

void doSomethingInIspc(void *geometry)

…. and la-voila, we’ve had ISPC call a virtual function. Not so hard, is it? And if you do have a varying pointer of geometries, you can of course serialize this via foreach_unique, and call “callCommit” for each instance individually).

Of course, the downside of this method is that there’s two function calls – first the ISPC-to-C function call (which can’t be inlined), and then the actual virtual function, which again is a function call. So, some overhead … but definitely useful to be able to do it at all, so I do hope this’ll be useful to those that always wanted to mix and match ISPC and C++ more than is demonstrated in the simpler examples.

Anyway – this article turned out to be way longer than imagined – or intended – so for today that’ll be all. As said above I do hope it’ll be useful reading for all those that do set out to do some more non-trivial stuff with ISPC. Much of the above took some experimentation to figure out – but we do use all of those patterns throughout OSPRay, and so far they work fine. As such: Hope this has been helful – and as always: feel free to comment if I’ve done something wrong, or un-necessarily hard ….