ISPC Bag of Tricks Part 3: Addressing 64-bit arrays in 32-bit addressing mode

Some time last week or so I wrote a first article on the importance of using ISPC’s 32-bit addressing mode  whereever and whenever possible, because it is so much more friendly to the underlying CPU architecture …. and as such, can generate way faster code.

The caveat, of course, is that the times of 32-bit address spaces is over (by about a decade or two!), and most “serious” applications today can expect to be asked to handle many gigabytes of data. As already mentioned in this previous article, “in most cases” this is not actually a problem, because “32 bit mode” in ISPC only means that varying accesses are actually 32-bit varying offsets relative to a 64-bit base pointer – so as long as all your individual arrays are smaller than 32-bits, the sum of all such arrays can very well exceed the 4GB barrier (well, 2GB, to be exact, due to sign) without any issues. In more practical words: Assume you have a triangle mesh with order a million vertices, such that this mesh’s vertex array is about 10MB large.

Now further assume I read a scene that contains a thousand of such meshes, then I’ll end up with a total of one thousand such vertex arrays of a total of about 16GB – well beyond the 2GB mark, but still perfectly fine as long as all memory accesses are always relative to each mesh’s own vertex array… which will work our just fine in practice, without doing anything else. So one mesh with 1 billion vertices (16GB) would have led to crashes when the varying array index overflows – but 100 meshes of 10 million vertices each would work just fine. That is, in fact, exactly what we did in OSPRay for a long, long time:  we were rendering (many-)gigabyte sized scenes all day, in pure 32-bit addressing mode, and it just worked our fine, because typically the many gigabytes of data were always split over multiple arrays that were each smaller than 2GB.

But what if I have individual arrays larger than 2GB?

In practice, however, “usually work out OK” is just another term for saying “every now and then it won’t”, which isn’t all that great. So how can we safely handle cases where individual arrays are larger than 2GB?

One option, of course, is to recompile in 64-bit mode – but that has to be done for the entire project, so will come with a huge performance penalty even for the majority of cases where we would not need it.

The other option that we eventually ended up using (a lot!) in OSPRay is what we called “segmented” 64-bit addressing: Thought he array actually is a single array, we can logically view it as if it actually consisted of several smaller arrays – “segments” – placed right next to each other in memory, and manually serialize accesses into those segments. That sounds complicated, but isn’t – here an example.

First, assume we have – and this is a real example from OSPRay – a “Sphere Mesh” geometry that contains a set of N spheres, specified as a data array with N structs that each looked roughly like this:

struct Sphere {
   vec3f origin;
   float radius;
   vec3fa color;
};

… and the sphere geometry then looks roughly like that:

struct SphereGeometry {
    ...
    Sphere *uniform sphereArray;
    int numSpheres;
};

… and then something like this:

vec3f getSphereColor(Sphere *uniform sphere, varying int primID)
{ return sphere[primID].color; }

Note, in particular, that this is a real-world example from OSPRay – somewhat simplified, but mostly very close to this example (in fact, the first case where we ran into this issue!). Now of course, we had well documented that each geometry can hold only 2 billion geometries, because the primitive ID is a 32-bit int, as is ‘numSpheres’. However, users were getting core dumps well below this “2 billion spheres” limit, because the

sphere[primID].color

expression – in 32-bit mode – evaluates to a 64-bit base (“sphere”) with 32-bit offsets (“primID*sizeof(Sphere)”), and since each sphere is 32 bytes large, the latter gets a 2GB overflow way before reaching 2 billion spheres (64 million spheres, in this example).

So, how to fix it? Assume we see our sphere array as made up of “segments” of at most 1 million spheres – in this case, each such segment can address “its” spheres with 32-bit offsets just fine. Also, for each primID we can easily compute both the segment that this primitive is in, where – in 64-bit memory – this segment begins, and what this sphere’s offset is within this segment. Now with that, all we have to make sure is that we serialize across the different segments, which we can do through a foreach_unique, as such:

vec3fa getSphereColor(Sphere *uniform sphereArray, int primID) 
{
   vec3fa returnValue;
   varying int segmentID     = primID / (1024*1024);
   varying int segmentOffset = primID % (1024*1024);
   foreach_unique(uniformSegID in segmentID) {
      Sphere *uniform segmentBase = sphereArray+uniformSegID;
      returnValue = segmentBase[segmentOffset].color; 
   }
   return returnValue;
}

Now in order to understand why this works, let’s have a closer look: We first compute, in each lane (ie, in varying), the ID of the segment, and the offset therein. We then iterate over all the unique segments in this ‘segmentID’ value – and since that’s exactly how “foreach_unique” is defined, the unique values it then iterates over are uniform values: i.e., in each iteration, we iterate over a uniform segment ID. And because of this base address of the segment that we then compute in this body is uniform, too, which means is a plain, scalar, 64-bit pointer – no offsets whatsoever, and thus no overflows anywhere. In the following line, then, we do a varying array access, but relative to a “safe” 64-bit base pointer, and with offsets that are guaranteed to be less than 32 million bytes, so totally safe.

What about performance?

Of course, this code is “a bit” more costly than a simpler array access that is guaranteed to fit within 32GBs …. but it’s not actually all too horrible: If the accesses are mostly coherent, chances are that only very few iterations of this look will be required, because coherent accesses will predominantly go to the same segment, and then be done in a single loop. And if accesses are not coherent, chances are you’ll have performance issues with TLB walks etc, anyway (which would, quite interestingly, internally serialize across TLB pages almost exactly the same way this loop does over segments!).

That said, there is of course some performance penalty of doing this. In OSPRay, we actually do flag meshes as to whether they require this fix or not, and then, in “postIntersect”, either to the “safe” gather, or a “fast” gather. But even without going through these lengths – doing this safe gather only in select places is certainly better than recompiling everything in 64-bit mode.

One final note: In OSPRay, we do use this technique in many different places, in various modifications … the core idea, however, is always the same.

Hope this is useful to anybody that wants to handle (at least occasionally) “large” data in thei ISPC programs – try it out, so far at least for us it’s been rather useful!

PS: As always : comments, corrections, improvements are welcome!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s