ISPC Bag of Tricks Part 3: Addressing 64-bit arrays in 32-bit addressing mode

Some time last week or so I wrote a first article on the importance of using ISPC’s 32-bit addressing mode  whereever and whenever possible, because it is so much more friendly to the underlying CPU architecture …. and as such, can generate way faster code.

The caveat, of course, is that the times of 32-bit address spaces is over (by about a decade or two!), and most “serious” applications today can expect to be asked to handle many gigabytes of data. As already mentioned in this previous article, “in most cases” this is not actually a problem, because “32 bit mode” in ISPC only means that varying accesses are actually 32-bit varying offsets relative to a 64-bit base pointer – so as long as all your individual arrays are smaller than 32-bits, the sum of all such arrays can very well exceed the 4GB barrier (well, 2GB, to be exact, due to sign) without any issues. In more practical words: Assume you have a triangle mesh with order a million vertices, such that this mesh’s vertex array is about 10MB large.

Now further assume I read a scene that contains a thousand of such meshes, then I’ll end up with a total of one thousand such vertex arrays of a total of about 16GB – well beyond the 2GB mark, but still perfectly fine as long as all memory accesses are always relative to each mesh’s own vertex array… which will work our just fine in practice, without doing anything else. So one mesh with 1 billion vertices (16GB) would have led to crashes when the varying array index overflows – but 100 meshes of 10 million vertices each would work just fine. That is, in fact, exactly what we did in OSPRay for a long, long time:  we were rendering (many-)gigabyte sized scenes all day, in pure 32-bit addressing mode, and it just worked our fine, because typically the many gigabytes of data were always split over multiple arrays that were each smaller than 2GB.

But what if I have individual arrays larger than 2GB?

In practice, however, “usually work out OK” is just another term for saying “every now and then it won’t”, which isn’t all that great. So how can we safely handle cases where individual arrays are larger than 2GB?

One option, of course, is to recompile in 64-bit mode – but that has to be done for the entire project, so will come with a huge performance penalty even for the majority of cases where we would not need it.

The other option that we eventually ended up using (a lot!) in OSPRay is what we called “segmented” 64-bit addressing: Thought he array actually is a single array, we can logically view it as if it actually consisted of several smaller arrays – “segments” – placed right next to each other in memory, and manually serialize accesses into those segments. That sounds complicated, but isn’t – here an example.

First, assume we have – and this is a real example from OSPRay – a “Sphere Mesh” geometry that contains a set of N spheres, specified as a data array with N structs that each looked roughly like this:

struct Sphere {
   vec3f origin;
   float radius;
   vec3fa color;

… and the sphere geometry then looks roughly like that:

struct SphereGeometry {
    Sphere *uniform sphereArray;
    int numSpheres;

… and then something like this:

vec3f getSphereColor(Sphere *uniform sphere, varying int primID)
{ return sphere[primID].color; }

Note, in particular, that this is a real-world example from OSPRay – somewhat simplified, but mostly very close to this example (in fact, the first case where we ran into this issue!). Now of course, we had well documented that each geometry can hold only 2 billion geometries, because the primitive ID is a 32-bit int, as is ‘numSpheres’. However, users were getting core dumps well below this “2 billion spheres” limit, because the


expression – in 32-bit mode – evaluates to a 64-bit base (“sphere”) with 32-bit offsets (“primID*sizeof(Sphere)”), and since each sphere is 32 bytes large, the latter gets a 2GB overflow way before reaching 2 billion spheres (64 million spheres, in this example).

So, how to fix it? Assume we see our sphere array as made up of “segments” of at most 1 million spheres – in this case, each such segment can address “its” spheres with 32-bit offsets just fine. Also, for each primID we can easily compute both the segment that this primitive is in, where – in 64-bit memory – this segment begins, and what this sphere’s offset is within this segment. Now with that, all we have to make sure is that we serialize across the different segments, which we can do through a foreach_unique, as such:

vec3fa getSphereColor(Sphere *uniform sphereArray, int primID) 
   vec3fa returnValue;
   varying int segmentID     = primID / (1024*1024);
   varying int segmentOffset = primID % (1024*1024);
   foreach_unique(uniformSegID in segmentID) {
      Sphere *uniform segmentBase = sphereArray+uniformSegID;
      returnValue = segmentBase[segmentOffset].color; 
   return returnValue;

Now in order to understand why this works, let’s have a closer look: We first compute, in each lane (ie, in varying), the ID of the segment, and the offset therein. We then iterate over all the unique segments in this ‘segmentID’ value – and since that’s exactly how “foreach_unique” is defined, the unique values it then iterates over are uniform values: i.e., in each iteration, we iterate over a uniform segment ID. And because of this base address of the segment that we then compute in this body is uniform, too, which means is a plain, scalar, 64-bit pointer – no offsets whatsoever, and thus no overflows anywhere. In the following line, then, we do a varying array access, but relative to a “safe” 64-bit base pointer, and with offsets that are guaranteed to be less than 32 million bytes, so totally safe.

What about performance?

Of course, this code is “a bit” more costly than a simpler array access that is guaranteed to fit within 32GBs …. but it’s not actually all too horrible: If the accesses are mostly coherent, chances are that only very few iterations of this look will be required, because coherent accesses will predominantly go to the same segment, and then be done in a single loop. And if accesses are not coherent, chances are you’ll have performance issues with TLB walks etc, anyway (which would, quite interestingly, internally serialize across TLB pages almost exactly the same way this loop does over segments!).

That said, there is of course some performance penalty of doing this. In OSPRay, we actually do flag meshes as to whether they require this fix or not, and then, in “postIntersect”, either to the “safe” gather, or a “fast” gather. But even without going through these lengths – doing this safe gather only in select places is certainly better than recompiling everything in 64-bit mode.

One final note: In OSPRay, we do use this technique in many different places, in various modifications … the core idea, however, is always the same.

Hope this is useful to anybody that wants to handle (at least occasionally) “large” data in thei ISPC programs – try it out, so far at least for us it’s been rather useful!

PS: As always : comments, corrections, improvements are welcome!

ISPC Bag of Tricks Part 2: On Calling Back-and-Forth between ISPC and C/C++

As mentioned previously I wanted to use this blog to share little nuggets on “how to do things” with tools such as ISPC – little things that are useful, but not easily publishable in any form other than a blog post.

In the first such article (on the ISPC front) I wrote a bit about the different aspects of addressing modes (and their caveats) in ISPC, which is something we constantly stumbled over / used in our OSPRay project. In this article, I’ll write a bit about something any non-trivial project will at some point stumble over: calling back-and-forth between ISPC and C++.

Typically, the reason we want to mix ISPC and C/C++ at all is that ISPC is good at expressing (low-level) parallel code for vectorization, but a little bit less good at building more complex software systems that nowadays use all the goodies of C++, such as templates, virtual functions, copious amounts of libraries (often heavily based on templates, too), etc… all things that ISPC – very unfortunately – lacks. As such, the typical set-up for most ISPC programs I’ve seen so far is that there’s a “main application” that does all the “big systems” stuff (I/O, parsing, data structure construction, scene graph, etcpp) in C++ – and smaller “kernels” of the performance-critical code written in ISPC. Obviously, those two parts have to communicate.

The obvious part: Calling ISPC from C/C++

Now the first, and most common, operation involved in that is having the C/C++ program call into ISPC, to launch and/or otherwise execute those kernels (I’ll say a bit more on “launching” vs “executing” below). Luckily, that part is easy, even has some language support (the “export” keyword), and is well documented: Basically, you declare an ISPC function as “export <function>”; have ISPC generate a header file with a C-callable function defitinition of that (the “-h” flag in ISPC invocation”; include the thus-generated header-file in your C/C++ code, and simply call that function.

Even with that simple way, there’s a few caveats: In particular, though ISPC can export the function declaration to C/C++, it can’t easily export all the parameter types that this function may expect as parameters – in particular if those parameters include compound types (structs), varying types (ugh, it really doesn’t like that), an implicit active mask, functions pointers with varying arguments, etcpp.

Example 1: Getting our feet wet

To start with, let’s look at a simple case, a ISPC kernel that operates on a float array (it doesn’t matter what it does). In this case, we’d simply declare the ISPC side (in foo.ispc) as

export void foo_simple(float *uniform A, uniform int N) { ... }

Once we compile this function with the “-h” argument, ISPC will emit a C callable function (with “extern C” linkage, decared in foo_ispc.h), with the following signature:

/* in foo_ispc.h */
namespace ispc {
  extern "C" void foo_simple(float *A, int N);

Obviously, calling this function from C++ is trivial. Worst that could happen – and usually does happen in many programs – is that the data on the C++ side isn’t actually stored as a plain C-array, but rather in a std::array or std::vector, but even that can easily be fixed with a bit of typecasting:

/* in foo.cpp */

#include "foo_ispc.h"

void fooOnVec(const std::vector<float> &vec) 
   ispc::foo((float *)&vec[0],(int)vec.size());

Of course, any C++ purist with a drop of blood left in his veins will cry out at this abuse of C++ : A std::vector should be opaque, we shouldn’t be assuming that it internally stores this as a float array, just getting the base pointer by taking the address of the first element is devil’s work, and don’t even get me started on casting a pointer to a const vector’s internal storage to a non-const float array. But hey, this is about performance, not cleanliness, so I’ve intentionally chosen this example just to make this point: In many cases you will have to type-cast from “better” C++ types to something that’s more easily digestible in ISPC. Of course, the cleaner way would have been to first copy from that std::vector to a temporary array, and call that kernel on that array – but calling into ISPC is all about speed, so there we are (and if speed is not the goal: just write all the code in C++ to start with!).

Example 2: Structs as function arguments

The reason the previous example was so simple is that it used only built-in types like “float” and “<pointer>” (and those in purely uniform variability) that ISPC can map one-to-one into the generated header file. For compound types and/or varying it’s a little bit more tricky, because typically ISPC and C++ can not include the same header files (except in trivial example), and the compound types that ISPC exports are not always very useful on the C++ side. For example, consider making out example a little bit more tricky:

/* in foo.ispc */

/* declare a three-float vector type */
struct vec3f { float x, y, z };

/* export a kernel that operates on an array of those */
export void foo(uniform vec3f &v) { .. }

In this example, all parameters are still purely uniform; however, there’s already a struct type involved. Now ISPC can already export this struct (generating a “ispc::vec3f” struct with float members x,y,z as above), and the C++ code can already use this. But generally, this “pure C99” struct wouldn’t be much use in C++ (typically a C++ version of a vec3f would have all kinds of methods associated with it that we can’t add in ISPC), so more likely, we’ll also be using some “class vec3f” on the C++ side. Nevertheless, if we do design our ISPC and C++ classes such that they do have exactly the same memory layout (e.g., in this example, as “three floats, called x, y, z, right after each other”), then we can once again simply typecast:

/* in foo.cpp */

#include "foo_ispc.h"

namespace cpp { 
   class vec3f {
      /* all kinds of C++ goodies */
      float x, y, z;

   void fooPlusPlus(const ours::vec3f &vec) {
     /* just typecast the reference type, and call ISPC */
     ispc::foo((const ispc::vec3f &)vec);

And lo and behold, everything works as expected. Again, a C++ purist might object – but that’s the kind of pattern we’re using in OSPRay all over the place, and so far it’s been very useful. Do note, though, that we do typecast the reference types – this basically tells the compiler that it doesn’t even have to know how to “convert” from one type to the other, and that it’s simply a pointer for which we guarantee the right underlying data layout.

The caveat, of course, is exactly what every C++ purist will cry out about: There is no type-guarantees in that at all that the compiler can even give you a warning about: If you change the order of the members on the C++ side without also doing that on the ISPC side (or vice versa), or simply add a member on one side but not the other … or do anything else that changes the memory layout on one side but not the other …. well, then you’ll get funny results. Be warned.

Example 3: What about varying argument types?

Oh-kay, in the previous examples we’ve passed only uniform types, but what about varying ones? Answer: generally speaking, you won’t need it. This sounds counter-intuitive (after all ISPC is all about varying types), but isn’t: typically it’s only the ISPC side that “creates” the varying types, so if anything you’d want to pass a varying type from ISPC to C++ (see below), but not the other way around.

Now if you still think you have to do that, you’ll have to go the way of arrays – ISPC will simply refuse to emit an export header for anything that contains an actual varying type. For the curious: There’s actually no reason it can’t do that – my own version of a SPMD compiler (called IVL, and at one point ISPC++) could actually do that, by exporting a struct of appropriate size :-/ …. but it’s simply not implemented in ISPC, so don’t even try it. Unfortunately, the same goes even with “indirect” references to varying types. For example, in OSPRay we make heavy use of “poor mans C++ in ISPC” with function pointers and stuff – and though the classes we use – and the function pointers therein – are perfectly uniform every time we would even want to export them, the uniform function pointers inside those uniform classes often have varying parameters, so currently ISPC can’t export them (IVL/ISPC in these cases wouldn’t even have emitted the varying types themselves, only forward declarations). Either way – don’t export varying types, or those with function pointers with varying arguments, and you’ll be fine – and as said above, in 99.9% of all cases, you probably don’t need this, anyway.

Now, how about the other way around? Calling C/C++ functions from ISPC?

While the above way of calling from C/C++ into ISPC is pretty well documented, significantly fewer people realize that the other way around works perfectly well, too: ISPC obviously can’t directly call any C++ functions – it would have to understand all the mangling (and in almost all cases, C++ types) to do so …. but what it can do perfectly well is call functions with extern C linkage, no matter what language was used to generate them. I.e., you absolutely can do the following:

/* in foo.cpp */

extern "C" void someCppMagic()
   std::ifstream .... /* whatever C++ code you want */

Then, once declared as “extern C” in ISPC, we can call this function:

/* in foo.ispc */

/* _declare_ the C++ function (with extern C linkage */
extern "C" {
   unmasked void someCppMagic();

/* now call from another ispc function */
void myIspcKernel(varying ....) 

Now in this example, there’s two little details that aren’t immediately obvious, but important: First, on the ISPC side we have used a

extern "C" { 

rather than the simpler

extern "C" ...

… and though that looks mondane, it’s actually important, since the ISPC parser can digest the former, but – for whatever reason – can’t digest the latter. So using the former is more cumbersome, but actually important to make it work.

Second, you may or may not have noticed the “unmasked” keyword in the ISPC declaration of “someCppMagic()”. In this particular example this is actually superfluous, but in some more complex ones it isn’t, so let me explain: In ISPC, all the “varying” control flow is handled implicitly, meaning ISPC keeps track which “programs” (lanes) in a “program group” (aka warp, work group, SIMD vector, etc) are active or not. It does that fully implicitly, without the user seeing it – but it obviously does have to pass this information between (non-inlined) functions, which it does by always putting an additional, completely implicit, parameter on the stack of each function call (and of course, that parameter contains the active mask).

Now in our example the C++ side doesn’t expect any parameters, so even if it was on the stack it wouldn’t hurt, since the C++ side would simply ignore it. If the function did have any paramters, however, putting the “unmasked” keyword is actually pretty important, since the C++ compiler would assume one sort of data on the stack (i.e., no active mask), but ISPC would put both the “real” arguments and the active mask there… and that can lead to rather funny results (believe me, we did stumble over that in both OSPRay and Embree ….).

Now, what about passing stuff?

Now just like in the “C++ to ISPC” example, the real fun starts once we want to pass arguments from ISPC back to C. In its simplest form, we can do this by simply using foreach_active to serialize over all active programs, and calling a (scalar) C/C++ function with it:

/* foo.ispc */

extern "C" {
unmasked void fooScalarCpp(uniform float f);

void myIspc(varying float f) 
   foreach_active(uniform_f in f) 

And yes, that’ll work perfectly well.

More often, however, you actually do want to pass varying data to C/C++, so let’s talk about how to do that. Just as in the C-to-ISPC example, uniform basic types are trivial, as are pointers/references … and varying ones are.

First, the simple stuff – varying built-in types

For built-in types (float, int, …) the easiest way of passing them in varying form is actually as arrays:

/* in foo.ispc */

extern "C" {
unmasked void myCppFunc(float *uniform, uniform int);

void myIspcFunc(varying float &f)
  myCppFunc((float *uniform)&f, programCount)

… and everything works fine.

Now what about compound types?

Now where it does get tricky is varying compound types, because these get internally stored in struct-of-arrays (SoA) form, but must typically be converted to array-of-structs (AoS) before C++ can digest them. For example, the typical way to do so is the following:

extern "C" {
unmasked void myCpp(vec3f *uniform, uniform int);

void myIspc(varying vec3f &v)
   uniform vec3f vInAos[programCount];
   vInAos[programIndex] = v;

In this example, we first use ‘programIndex’ and ‘programCount’ to transform from SoA to AoS, then call the C++ side function with this array of (uniform) structs, just like above (note that we do not have to typeast when calling from ISPC to C++, because we did the declaration of the function with ISPC types … there is no “foo_cpp.h” to include).

Of course, just like we converted data on the way “out” to C++, so we’ll have to do once again if that C++ function modified that data. Also “of course”, this conversion is pretty horrible performance-wise, so “use it wisely” is all I can say.

What about active/inactive lanes?

One thing all the previous examples had in common is that they all assumed that the C++ function would always want to process all the lanes, or at least, that it was oblivious of the fact that some of the program instances might not actually be active – in fact, I even harped on the importance of the “unmasked” keyword to intentionally not pass any active masks. In practice, however, we often do have to pass information which lanes are vs are not active… so how to do that?

The first thing that comes to mind is to not use the “unmasked” keyword on the ISPC side, and then, on the C++ side, simply declare this additional parameter explicitly – after all, ISPC will put this “varying bool” on the stack, so if we only declare it properly, the C++ side can use it. Only one answer: Don’t do it. Of course, we actually did at first, and it does in fact “kind of” work – for a while. However, ISPC doesn’t guarantee the form that this parameter takes on different ISAs, so code that may work perfectly fine on AVX may lead to “funny results” when run on AVX512 … which we had to learn the hard way, of course. As such, don’t do it.

Instead, the better way is to create an explicit array of active bits, and pass this the same way as any other array (ie, with “unmasked” enabled). Here’s an example:

/* foo.ispc */

extern "C" {
unmasked void fooCpp(float *uniform v  /* the data */, 
                     int *uniform active /* explicit active mask */,
                     int uniform N);
void fooIspc(varying float f)
   /* the same stuff we did above: */
   uniform float fArray[programCount];
   fArray[programCount] = f;

   /* now "construct" an active mask - mind the 'unmasked'!!!! */
   uniform int activeMask[programCount];
   unmasked { activeMask[programIndex] = 0; }
   activeMask[programIndex] = 1;


Once again, there’s one hidden “imporant caveat” in this example that is easily overlooked, and which I’ll therefore call out explicitly here: the second “unmasked {}” statement, when we construct the active mask array. Ie, in particular those two lines:

 unmasked { activeMask[programIndex] = 0; }
 activeMask[programIndex] = 1;

This may look particularly awkward, but is actually important: If we only used the second of those lines, the semantics of ISPC do not actually say what the values of the inactive lanes would be: Yes, the active lanes would be 1, but the inactive lanes may be something other than 0, which will, once again, give “funny results” on the C++ side. Even worse, in most cases these lanes will actually get initialized to 0, and work just fine – but as I said, there’s no guarantee for that, and I’ve seen cases where they weren’t …. so this one is the “correct” way of doing it: The first line tells ISPC to set all lanes to zero, becuase the “unmaksed{}” around the assignment forces all lanes to active, thereby all lanes will get set to 0. Then, the second statement will revert to only the active lanes, and set (only) those to 1, which is exactly what one wants.

One other observation in this: I’ve actually passed the active lanes as an array of ints rather than an array of bools – you can do either, of course, but ints are often easier to digest, so I always use ints.

Last “Trick”: Calling C++ virtual functions from ISPC …

Taken to the extreme, you can even use the above things to call “real” virtual functions (on the C++ side) from the ISPC side. Imagine, for example, that you have a base class “Geometry” with a virtual function “commit”, and a derived “TriangleMesh”, derived from Geometry:

/* in geom.cpp */

class Geometry { 
   virtual void commit() = 0;

class TriangleMesh : public Geometry { 
   virtual void commit() override { ... }

Now further assume, we have at some point created an instance of such a geometry (say, a triangle mesh) on the C++ side, and have passed a poitner to this geometry to the ISPC side, for example such as this:

void Scene::doSomethingWith(Geometry *geom) 
   ispc::doSomethingInIspc((void *)geom);

(note how to typecast “geom” to a “void *”, because ISPC doesn’t know anything about these classes).

Now further, let’s assume “somethign” on the ISPC side has this “void *” that we know points to a geometry, and wants to call this geometry’s virtual commit() method. Directly calling this isn’t possible, because ISPC of course doesn’t understand any of this. However, we can create a simple “hook” for it – on the C++ side – that will allow that, say:

/* in geometry.cpp */

extern "C" callGeometryCommit(void *ptr)
   Geometry *g = (Geometry *)ptr;

All this function does is de-ambiguate the void poitner back to a geometry, and call the respective method. And since it has extern C linkage an only plain C data types (void *), you can perfectly well call it from ISPC:

/* foo.ispc */

extern "C" {
unmasked void callGeometryCommit(void *unform);

void doSomethingInIspc(void *geometry)

…. and la-voila, we’ve had ISPC call a virtual function. Not so hard, is it? And if you do have a varying pointer of geometries, you can of course serialize this via foreach_unique, and call “callCommit” for each instance individually).

Of course, the downside of this method is that there’s two function calls – first the ISPC-to-C function call (which can’t be inlined), and then the actual virtual function, which again is a function call. So, some overhead … but definitely useful to be able to do it at all, so I do hope this’ll be useful to those that always wanted to mix and match ISPC and C++ more than is demonstrated in the simpler examples.

Anyway – this article turned out to be way longer than imagined – or intended – so for today that’ll be all. As said above I do hope it’ll be useful reading for all those that do set out to do some more non-trivial stuff with ISPC. Much of the above took some experimentation to figure out – but we do use all of those patterns throughout OSPRay, and so far they work fine. As such: Hope this has been helful – and as always: feel free to comment if I’ve done something wrong, or un-necessarily hard ….


Float atomics in OpenCL …

OpenCL – the final frontier …. uh… not really, but still: Just stumbled over a situation where I needed some atomic min/max operations for floats (which OpenCL apparently doesn’t have yet?!)… Now I’m sure a lot of people will have way better solutions than what I went with, but since even googl’ing didn’t produce an easy hit I wanted to at least put my solution out here.


Some time last week I finally started doing some experimenting with building BVHs in OpenCL – not exactly my favorite language, but at least one that works across a broad spectrum of hardware architectures. I had written some OpenCL programs before (one of them – wait! – a ray tracer!), but so far had always done only the simple stuff (ie, traversal and shading) on the OpenCL side, and left BVH builds to the host.

Now when playing with BVH build, I  quickly realized that OpenCL – at least, without vendor specific extensions – still doesn’t have floating point atomics… in particular, not the floating point min/max’es that I had wanted to use when different work groups were collectively computing the scene’s bounding box: yes, there’s atomics for integer min/max, but not for floats. Really? In the 20th century?

First attempt to work around that was to simply bit-cast the float to and int, do an integer min/max, and bitcast back – but that gets a bit tricky in the presence of special values (NaN, inf, etc – which your BVH builder shouldn’t actually have to be able to deal with), as well as for “negative zero” values (which unfortunately do happen in practice).


However, there’s also a very simple solution using just compare-exchange (“cmpexch”), which luckily is available for floats: Simply read the value, do a non-atomic max with the new value, and use cmpexch to write this value back: If the value returned from cmpexch is smaller or equal than “our” local max, then we’re done; but if there was a race condition where another thread wrote a larger value than what we read then cmpexch will return that – we can then do another max with this new value, rinse and repeat, done:

inline void atomic_maxf(volatile __global float *g_val, float myValue)
  float cur;
  while (myValue > (cur = *g_val))
    myValue = atomic_xchg(g_val,max(cur,myValue));

Pretty simple, actually. Probably pretty expensive, but that’s kind of expected for atomics, anyway, so this should be used as a matter of last resort, anyway.

If anybody has a better solution: Feel free to comment! As said above, OpenCL is not exactly my favorite language, so I’m sure there’s better/more elegant solutions out there …

PS: Yes, atomic add/sub/mul is a bit more tricky, but luckily I don’t need this for BVH builds, so won’t go into that here. At some point I might – just for the fun of it – but for now 🙂


ISPC Bag of Tricks, Part 1: Addressing…

Hm; took a while longer to find the time to actually write some (hopefully?) useful content, but here we go: As promised in the blog’s intro, I’ll try to use the blog – among other purposes – as a means of sharing some experiences and “bag of tricks” that other media such as twitter or papers would be badly suited for. In particular, I feel we (ie, not just I, but the entire embree/OSPRay teams) have accumulated quite a bit of experiences/know-how on the entire front of “how to get performance out of CPUs” in general, and on “using ISPC to do that” in particular. As such, there’s actually a lot of different little topics I plan on eventually writing about re out way of using ISPC, which I’ll refer to as the “ISPC Bag of Tricks” series …

As a first installment, I want to write a bit more about “addressing” in ISPC, in particular about why addressing is important, what we do in OSPRay re this topic, and why,,, and in particular, what this infamous “64-plus-32”-addressing is that I sometimes talk about. Of course, this article assumes at least some very basic familiarity with ISPC – if you don’t yet have a clue what I’m talking about, I suggest starting at ISPC’s github overview page…. and of course, there’s a ton of ISPC related material on Matt Pharr’s blog as well.

Why “addressing” is such a big deal in ISPC…

So, let’s start slowly, and first talk about why addressing is actually such a big deal with ISPC (at least if your goal is performance – which I assume as a given if you started looking at ISPC in the first place).

To fully understand this, you’ll have to a little bit closer look at both how ISPC internally works, and, in particular, on the CPU Instruction Set(s) that ISPC will eventually have to map its instructions to. Don’t worry, I won’t go into crazy details – hopefully just enough to make the point.

First, before talking about instruction sets, let me briefly summarize one key concept of ISPC that will eventually be behind all of what I’m talking about below: the key idea of ISPC is that it’s a “SPMD” (single program, multiple data) language in which the compiler executes a given “single” program on “multiple” data instances – same as work items in OpenCL, threads on CUDA, etc. In particular, the key idea is to have each “lane” of the SIMD registers execute a different program instance, and use masking/predication to emulate control flow (a bit more complex than that, but not much). As such, if your SIMD units consist of 8 “lanes” that are always operated on in parallel, then you have 8 such programs running in parallel, with with first such program’s variables all in lane 0 of the SIMD registers, the second program’s variables in lane 1, etc.

So far, so good. That concept is used in at least a dozen languages (and actually what many auto-vectorizers use, too), it works pretty well (for a lot of cases, at least), and conceptually is very simple – which unfortunately is just another way of saying that the devil is in the details. In particular, there’s two problems that continually haunt the “code generation” phases.

Problem 1: “Missing” Vector Instructions

First, vector/SIMD instruction sets are often not “complete”, in the sense that there are often some instructions that a CPU can execute in scalar code, but for which there is no direct equivalent in SIMD form. For example, a AVX2 capable CPU can do a int32+int32 addition in both scalar (add) and SIMD (vaddpi) form – where the vector form just does 8 of those int32 adds in parallel, as you’d expect – but whereas it can do a int64/int64 division in scalar instructions (divq) there is no corresponding vector instruction for that in AVX. As a result, some operations are way more costly to do in “varying” form than you’d expect them to be when you’re used to from purely scalar code. That’ll apply to many kinds of operations, and some of those are related to addressing and memory operations, unfortunately – I’ll come back to that in a sec.

Problem 2: What are lanes, actually?

The second big problem in this SPMD paradigms (at least on SIMD CPUs) is that it implicitly assumes this concept of “each register is made up of N lanes” – and yes, many people (me included) often speak of it in exactly this way, as if, for example, a AVX2 register (or even a AVX2 CPU) was actually “8 wide”. The problem, with that conceptual view, however, is that it depends on the data type: SIMD registers do not always have “N” lanes of everything; instead, they have a fixed number of bits. A AVX2 register, for example, is 256 bits wide – and though that does indeed fit 8 int32s, 8 float32s, etc, it will only fit four int64s. Now of course, you can simply use two such registers for int64s – thus being logically back to “8 wide” – but even if we do this, we now have to suddenly think of which bits each of these 8 lanes actually actually map to in these two registers: For 32-bit types, lane 0 is always bits 0..31, and lane 1 is bits 32…63, etc; but if for a 64-bit type lane 0 is now in lanes 0…63 and lane 1 in bits 64…128 … then getting the different types to play nicely with each other becomes a bit of an issue, because you’re constantly using shuffles and stuff to move data values between where different-width types map their lanes to.

(Note to the compiler-savvy: All this can be somewhat mitigated by realizing 64-bit types differently (8 times lower 32 bits in one register, and 8 times upper 32 bits in another), but that has issues, too, because the instruction set has to then support operations that are split across different registers…. which of course they often don’t, see problem 1. Chicken and egg, anybody?)

Back to addressing…

Now, with these two issues in mind, let’s have another look at addressing. The problem here is that the most common data types in most programs (in particular in graphics) are int32 and float32, so most naturally you want ISPC to use a mapping of 32-bit types; ie, in the AVX2 example you’d indeed want each register be treated as 8 lanes of 32-bit types wherever possible (and yes, that’s exactly what it does).

Unfortunately, however, while most computations are on 32-bit types, it’s become pretty standard to assume 64-bit address spaces, as even the laptop I type this on has more than 4GB of memory. As such, all address computations today are typically assumed to be done in 64-bit…. and though that is perfectly fine for scalar code – where CPUs have lots of instructions to help with 64-bit addressing (hey, you can even do 64-bit indirect addressing with scale in the instruction operands!), as soon as you have anything even as trivial as a varying array index the compiler now has to do all the actual address computations in explicit vector arithmetic (because they’re different for each lane, and because things like “lea” or (vector-)register-indirect operands are not generally available for vectors, see “problem 1” from above).

Even worse, if we assume a 64-bit addressing mode, all these operations have to be done in 64-bit, which we’ve just argued above creates a lot of issues with moving data around. Alternatively, if you can tell ISPC that you’ll be fine if all addressing is done in 32 bits, then you’ll still have some overhead from “problem 1”, but at least all the 64-bit arithmetic – and going back-and-forth between 32-bit and 64-bit types – is going way … well, mostly. And that, exactly, is what ISPC’s “–addressing={32,64}” flag is about.

Now really, how important is 32-bit addressing?

Now if you’re like most people – that is, you’re used to mostly scalar code – you’ll probably have a hard time believing that this entire addressing is indeed such a big issue – after all, it works just fine in scalar code, so how hard an it be, really?

To illustrate this point, I’ve created a tiny little example, and run it through ISPC once with “–addressing=32”, and once with “–addressing=64”. Here’s the tiny little sample function:

extern float foo(float *uniform array, int uniform scale, int varying idx)
 return array[scale*idx];

…and here’s the makefile (yes, good old gmake!):

all: avx2_32.s avx2_64.s avx1_32.s avx1_64.s

avx2_32.s: test.ispc
 ispc -O3 -o $@ $< --emit-asm --addressing=32 --target=avx2-i32x8

avx2_64.s: test.ispc
 ispc -O3 -o $@ $< --emit-asm --addressing=64 --target=avx2-i32x8

avx1_32.s: test.ispc
 ispc -O3 -o $@ $< --emit-asm --addressing=32 --target=avx1-i32x8

avx1_64.s: test.ispc
 ispc -O3 -o $@ $< --emit-asm --addressing=64 --target=avx1-i32x8

As you can see, this example is trivial (the makefile is larger than the code!), but it already illustrates the point: First, let’s look at the AVX2 version of the code that’s being generated, assuming 32-bit addressing (some header and footer gunk removed):

 vmovd %esi, %xmm2
 vpbroadcastd %xmm2, %ymm2
 vpmulld %ymm2, %ymm0, %ymm0
 vpslld $2, %ymm0, %ymm2
 vpxor %ymm0, %ymm0, %ymm0
 vgatherdps %ymm1, (%rdi,%ymm2), %ymm0

… and here’s the 64-bit variant:

 vmovd %esi, %xmm2
 vpbroadcastd %xmm2, %ymm2
 vpmulld %ymm0, %ymm2, %ymm0
 vpmovsxdq %xmm0, %ymm2
 vextracti128 $1, %ymm0, %xmm0
 vpmovsxdq %xmm0, %ymm0
 vpsllq $2, %ymm2, %ymm2
 vextractf128 $1, %ymm1, %xmm3
 vxorps %xmm4, %xmm4, %xmm4
 vgatherqps %xmm1, (%rdi,%ymm2), %xmm4
 vpsllq $2, %ymm0, %ymm0
 vxorps %xmm1, %xmm1, %xmm1
 vgatherqps %xmm3, (%rdi,%ymm0), %xmm1
 vinsertf128 $1, %xmm1, %ymm4, %ymm0

As you can see, even for this tiny example the number of instructions more than doubles! And yes – that was a tiny example; it’ll get even funnier with large ones.

Now that example was only on the 64-bit arithmetic. If you also want an even more crass illustration of how the “instruction set” issues can play into this, I suggest you have a look at the code emitted for AVX1 (or even better, SSE) …. the avx1_64.s version is 55 lines for this example.

Though I do know that the example was too trivial to be much representative, I do hope it gives a rough idea. How big the impact is going to be in practice will depend on many factors: First of all, on how much data I/O you have relative to the compute you’re doing with it, on whether you can live with 32 bit addressing, and finally, how much you can “massage” the loads/stores to be more vector friendly (see below).

As a single – probably not representative, and certainly not up to date – number of something more complex, I point to a somewhat outdated anecdote: A while back we changed OSPRay to use 64-bit addressing rather than default 32-bit addressing …. just to test it out … and impact on total runtime was more than 2x. I have no idea if that is still the case – and most certainly it’ll depend on the actual renderer, data set, CPU type, etc – but either way, addressing overhead shouldn’t be neglected.

Now, based on all this long discourse on addressing in vector arithmetic, a few tips that can hopefully help:

Tip #1: Use 32-bit addressing if at all possible

Most obviously, use 32-bit addressing wherever you can. Unfortunately, ISPC doesn’t allow to specify that on a per-function basis, so it’s “all 32” or “all 64” (this is because the addressing flag changes how pointers are represented internally, so you can’t mix and match different functions any more).

Of course, the downside to 32-bit addressing is that you shouldn’t use it on anything that requires large, 64-bit offsets. On the upside, it is important to realize that in ISPC, (varying) pointers are usually realized as a 64-bit scalar base address with a 32-bit offset. So “32 bit addressing” doesn’t mean that it’ll only ever run in a 32-bit windows build, or that you can never allocate more than 4GB of memory – all it means is you can’t ever access into an array with 64-bit offsets. I.e., having 16 separate 1GB arrays each is “generally” no problem, but one single 16GB array is.

In practice – e.g., in OSPRay – that usually means you compile in 32-bit mode, and make sure that the above restrictions are obeyed in your code. And of course, that wherever you do have to do 64-bit indexing you’ll solve it in another way – I’ll talk about that later.

Tip #2: do uniform addressing wherever you can!

Another thing you should take home from the above examples is how important it is to do the addressing right, because if you don’t, it can cost a lot. Aside from the 32- vs 64-bit addressing, one thing that I cannot hammer home often enough is the importance of telling the compiler that something you know will be uniform actually is uniform.

For example, let’s take the above example, and be a little bit lazy, in that we just leave out the “uniform”s for the parameter declarations. I.e., we now have two functions:

extern float foo(float *uniform array, int uniform scale, int varying idx)
{ return array[scale*idx]; }

extern float lazy_foo(float *array, int scale, int idx)
{ return array[scale*idx]; }

Both functions of course do exactly the same thing, and the reasons I called the second one the “lazy” one is that this is exactly what a “lazy” programmer (or one that wouldn’t  know better) would write. In fact, I’ve seen tons of such examples in the past, and it’s particularly tricky because the lazy function does exactly what the “fast” one does, so it’s where most people stop worrying – it’s doing its job, it’s in ISPC, so it’s probably fast, so I’m done, right ?

Now, let’s take a look at the generated asm (only for avx2, 32 bit addressing): For the original one, we still have

 vmovd %esi, %xmm2a
 vpbroadcastd %xmm2, %ymm2
 vpmulld %ymm2, %ymm0, %ymm0
 vpslld $2, %ymm0, %ymm2
 vpxor %ymm0, %ymm0, %ymm0
 vgatherdps %ymm1, (%rdi,%ymm2), %ymm0

… but the lazy one – doing exactly the same – now looks like this:

 vpmulld %ymm3, %ymm2, %ymm2
 vpslld $2, %ymm2, %ymm2
 vpmovsxdq %xmm2, %ymm3
 vextracti128 $1, %ymm2, %xmm2
 vpmovsxdq %xmm2, %ymm2
 vpaddq %ymm0, %ymm3, %ymm0
 vextractf128 $1, %ymm4, %xmm3
 xorl %eax, %eax
 vxorps %xmm5, %xmm5, %xmm5
 vgatherqps %xmm4, (%rax,%ymm0), %xmm5
 vpaddq %ymm1, %ymm2, %ymm0
 vpxor %xmm1, %xmm1, %xmm1
 vgatherqps %xmm3, (%rax,%ymm0), %xmm1
 vinsertf128 $1, %xmm1, %ymm5, %ymm0

Now does this look just as bad as the 64-bit variant? Well, that is because it mostly is: The compiler no longer knows when it can do 32-bit offsets to a common base pointer, nor does it know which operations are scalar, etc. And again, that was the easy example.

Oh, and just for reference: If we knew that ‘idx’ was also a uniform variable, and we’d tell the compiler that, too, then that’s what would come out instead:


extern uniform float alluniform_foo(float *uniform array, int uniform scale, int uniform idx)
{ return array[scale*idx]; }


 imull %edx, %esi
 movslq %esi, %rax
 vmovss (%rdi,%rax,4), %xmm0 # xmm0 = mem[0],zero,zero,zero

Quite a difference, eh?

Now of course, telling the compiler what is uniform and what isn’t requires some additional knowledge about the program. BUT if you don’t pass this information even where you have it, the compiler can’t do any better than assume the worst, and you’ll get “sub optimal” performance. So, “some discipline required”, unfortunately, at least if you want to get good performance. On the upside, ISPC at least allows polymorphism where you can “overload” the same function with different uniform/varying combinations of the parameters, which helps a lot in that ISPC can then, at the call site, figure out which version to best use. But again, this requires that the programmer tell the compiler that there’s different versions, so “some discipline required”.

Tip #3: “foreach_unique” to uniform’ize (where applicable)

One particularly useful trick – that we use a lot in OSPRay, but which seems not well known outside of OSPRay – is to use the “foreach_unique” statement to “uniformize” otherwise varying expressions.

To re-cap:

 foreach_unique (uniformValue in varyingVariable) { ... }

executes over a varying variable ‘varyingVar’, and executes the body (with the right masks) exactly once for each unique value in that varying variable. Now if that varying variable usually has all-different fields, anyway, then all this achieves is serializing the computations, which in general isn’t all such a good idea….

BUT – and here’s the trick – if we know that varyingVariable will usually have but few different values – then we can use this to “trade” multiple iterations for some operations becoming uniform/scalar. For example, assume you write a ray tracer (say, OSPRay), and each lane can have hit a different geometry, shader, etc …. but where, at least in most cases, the rays will all have hit the same geometry, or only two or three. In that case, if we use the varying geometry ID in any address computations all computations will become varying (which is usually expensive, see above)… but if we first foreach_unique over the different geometry IDs we may have multiple iterations over that body, but many operations in this body will be uniform (ie, scalar), which will often be way cheaper. So, if there’s not too many of those iterations, and the savings are big enough, it can well pay off.

As an example, consider the following – horribly trivialized – example of having a “two level” scene with a list of “geometries” that each have “primitives”, and where we now have to access a given primitive in a given geometry: In its trivial form, this looks like this

struct Geom 
{ float prim[100]; };

extern float foo(Geom **uniform geom, int geomID, int primID)
{ return geom[geomID]->prim[primID]; }

Now if we use foreach_unique, we can also write the latter as

extern float foo(Geom **uniform geom, int geomID, int primID)
 float ret;
 foreach_unique(uniGeomID in geomID) 
    ret = geom[uniGeomID]->prim[primID];
 return ret;

Yes, it’s a bit more complicated but not that much, is it? The key, though, is to look at the generated asm code: For the above (simple) example this looks as this:

 vextractf128 $1, %ymm2, %xmm3
 vpmovsxdq %xmm3, %ymm3
 vpmovsxdq %xmm2, %ymm4
 vextracti128 $1, %ymm0, %xmm5
 vpxor %ymm6, %ymm6, %ymm6
 vpxor %ymm7, %ymm7, %ymm7
 vpgatherdq %ymm4, (%rdi,%xmm0), %ymm7
 vpgatherdq %ymm3, (%rdi,%xmm5), %ymm6
 vpslld $2, %ymm1, %ymm0
 vpmovsxdq %xmm0, %ymm1
 vpaddq %ymm1, %ymm7, %ymm1
 vextracti128 $1, %ymm0, %xmm0
 vpmovsxdq %xmm0, %ymm0
 vextractf128 $1, %ymm2, %xmm3
 xorl %eax, %eax
 vpxor %xmm4, %xmm4, %xmm4
 vgatherqps %xmm2, (%rax,%ymm1), %xmm4
 vpaddq %ymm0, %ymm6, %ymm0
 vxorps %xmm1, %xmm1, %xmm1
 vgatherqps %xmm3, (%rax,%ymm0), %xmm1
 vinsertf128 $1, %xmm1, %ymm4, %ymm0

While with the foreach_unique variant, it looks like that:

.LBB0_5: # %foreach_find_next33
 # =>This Inner Loop Header: Depth=1
  tzcntq %rax, %rcx
  movslq (%rsp,%rcx,4), %rcx
  movq (%rdi,%rcx,8), %rdx
  vmovd %ecx, %xmm4
  vpbroadcastd %xmm4, %ymm4
  vpcmpeqd %ymm0, %ymm4, %ymm4
  vpand %ymm2, %ymm4, %ymm5
  vxorps %ymm6, %ymm6, %ymm6
  vgatherdps %ymm5, (%rdx,%ymm3), %ymm6
  vmovmskps %ymm4, %ecx
  andnq %rax, %rcx, %rax
  vblendvps %ymm4, %ymm6, %ymm1, %ymm1
  jne .LBB0_5
.LBB0_3: # %foreach_done
  vmovaps %ymm1, %ymm0
  movq %rbp, %rsp
  popq %rbp

(in both cases I’ve cut some common preamble code, adn the “all on” special cases).

Now that doesn’t look too much simpler, but keep in mine two things: First, the second exactly already includes the instructions for the iteration over the different values, and is still shorter than the simple one. And second, in this trivial example there’s hardly anything to be gained by uniforming in the inner loop – if you had more complex stuff like calling a per-geometry function pointer, looking at multiple data arrays (vertices, normals, colors, etc) the impact will be quite more pronounced.

And why does it get simpler in the first place? In the original example, ‘geomID’ is varying, so the entire expression “geom[geomID]->prim” becomes varying, all the pointers in there are varying, etc, which is costly. In the foreach_unique variant, ‘uniGeoMID” is a uniform, so the entire expression of “geom[geomID]->prim” evaluates to a single, uniform (ie, scalar) pointers, and the only varying gather required in this example is the final gather relative to this scalar base pointer …. which is way cheaper.

As said above, this trick doesn’t always help – in fact, in many cases it can be detrimental. But in some cases, it can be hugely beneficial, as the foreach_unique can now guarantee that the variable we’re iterating over will now be uniform in the body of that expression – which will help the compiler express things more cheaply than it it didn’t know this. (and of course, it still produces exactly the same results).


Well – there’s probably books I could fill with little discussions on what to do and what to (try to) avoid in ISPC (or on CPUs in general), but given that my laptop’s battery is slowly running out I guess this’ll be it for today. I’ll write some more a bit soon on some of the pitfalls of 32-bit addressing soon (including some of the ways of fixing it), but at least for today that’s more than the average person will likely be willing to read, anyway. I do hope it’ll still be useful for those that do have a closer look at ISPC (which seems a growing numbers). And of course, feel free to send me emails – and/or use the ‘comments’ section – if there’s anything to comment on in these examples.

Now, back to work …

Preprint of our HPG2018 “Compressed Leaf BVH” (short-)paper…

In the past, I always tried to be good and put accepted papers on the web to share, but usually only as a direct link, that – well – tended to get lost or forgotten very soon. Sure, there’s my “publications” page at SCI, but that required copying the pdf to a magic location on a remote node’s file system, (remote-)editing the HTML files, setting permissions correctly, etcpp … which usually meant this turned into a giant mess … which turned into a “well, I’ll finish that tomorrow”, which … well, you get the point.

So from now on, what I’m going to do is always share accepted papers here on the blog; this not only to makes it way easier to publish those through wordpress, it also means they’ll be more easily findable by google, and, in particular, you can actually post comments and questions to a given paper if you so desired – and even better, I can actually answer them, add updates, etc. Yay!

Anyway, for the first such installment: Here a preprint of our “Compressed Leaf” BVHes (a recent extension to Embree to allow saving memory without losing much performance) that just got accepted at HPG 2018, and will thus be presented at Vancouver. Of course, all the real work in that paper was Carsten’s (I just helped with the writing), so if you have some real questions you’ll probably have to ask him ;-).

Also one note: The version of the paper here is a un-shortened version – it’s basically the originally submitted version with the reviewer’s comments addressed (well, mostly), but without having been shortened down to the 4 pages that it has been allocated as a short paper. As such, I hope it can server as a additional source of information; but note that the only “real” official version will be the short one in the ACM digital library.

With that, here’s the final link to the paper (in PDF) – I hope it’ll be useful!


Another blog ….

…. is probably the last thing the world needs right now – so of course, I just had to sit down and create one!. So if you’ve miraculously found this page, and really have nothing  better to do than reading this – then welcome!

Now of course, those that know me also know that I can be a bit facetious at times… and yes, that intro may fall in this category, too – I just couldn’t stand the auto-generated “and so it begins” that WordPress wanted me to start with. Be that as it may, as with all facetious comments there’s always a grain of truth to it, so “why another blog”? Sure, almost about everybody seems to have one nowadays – some better, some worse – so “why not”? That, however, is only half the truth, because a blog can actually be useful, and I do hope this one will, too.

Before I say anything else, I’ll first have to admit that I’m not actually as new to blogging as this “intro” may suggest: I do actually already write another blog – under a different name, and on a different topic – that WordPress claims has a few hundred visitors a day (I’ll leave it as an “exercise to the reader” to figure out what blog that might be!).

What this blog aims to be…

As with the other blog, I’ll try to use this blog as a medium to share information, mostly around the work I do: updates on papers that got accepted, upcoming tech demos of the work we have done, cool new results we’ve achieved, newly added features that got added to the software projects I work on, and in particular, any sort of  information that didn’t make it into a paper, that a paper wouldn’t be the right medium for, etc; in particular, the  little technical “nuggets” around ISPC, Embree, OSPRay, etc that I believe will be interesting to the users of those technologies, but that simply wouldn’t have made it into any sort of “real” paper.

The latter of those reasons is actually the main reason I finally decided to get busy and start this blog, and the thing I find blogs to be most useful for: Over the course of my now 20 years’ career I’ve now found many different instances where I (or “we”, because usually it’s the work of many!) have done something that didn’t fully work out, or that did work out but wasn’t good enough to ever make it into a paper, or in fact, something that was good enough that it should have been a paper but that we never found the time to write up, or that we did too bad a job to ever get past the reviewers, etc. And in many of these cases, it eventually turned out that many others – often years later – started to look at the same problems, often at least partially re-doing what I/we had done before. Sometimes they did this with better success, sometimes with worse – but every time with a lot of work that could have been saved if only those that came later had had access to the information I/we had already had.

This, in fact, is what I’ll try to convey in this blog: Little things I’ve done – or been doing – that I believe may be of interest, yet that would certainly not make it out in any other form. Inevitably there will be topics and posts that will not be useful for everybody (and maybe a few that won’t be useful for anybody!?), but that I guess is what google’ing and like’ing are for.

All of this said: Welcome to my blog, and I sincerely hope you’ll find at least some useful information in the articles to come!