Hm; took a while longer to find the time to actually write some (hopefully?) useful content, but here we go: As promised in the blog’s intro, I’ll try to use the blog – among other purposes – as a means of sharing some experiences and “bag of tricks” that other media such as twitter or papers would be badly suited for. In particular, I feel we (ie, not just I, but the entire embree/OSPRay teams) have accumulated quite a bit of experiences/know-how on the entire front of “how to get performance out of CPUs” in general, and on “using ISPC to do that” in particular. As such, there’s actually a lot of different little topics I plan on eventually writing about re out way of using ISPC, which I’ll refer to as the “ISPC Bag of Tricks” series …
As a first installment, I want to write a bit more about “addressing” in ISPC, in particular about why addressing is important, what we do in OSPRay re this topic, and why,,, and in particular, what this infamous “64-plus-32”-addressing is that I sometimes talk about. Of course, this article assumes at least some very basic familiarity with ISPC – if you don’t yet have a clue what I’m talking about, I suggest starting at ISPC’s github overview page…. and of course, there’s a ton of ISPC related material on Matt Pharr’s blog as well.
Why “addressing” is such a big deal in ISPC…
So, let’s start slowly, and first talk about why addressing is actually such a big deal with ISPC (at least if your goal is performance – which I assume as a given if you started looking at ISPC in the first place).
To fully understand this, you’ll have to a little bit closer look at both how ISPC internally works, and, in particular, on the CPU Instruction Set(s) that ISPC will eventually have to map its instructions to. Don’t worry, I won’t go into crazy details – hopefully just enough to make the point.
First, before talking about instruction sets, let me briefly summarize one key concept of ISPC that will eventually be behind all of what I’m talking about below: the key idea of ISPC is that it’s a “SPMD” (single program, multiple data) language in which the compiler executes a given “single” program on “multiple” data instances – same as work items in OpenCL, threads on CUDA, etc. In particular, the key idea is to have each “lane” of the SIMD registers execute a different program instance, and use masking/predication to emulate control flow (a bit more complex than that, but not much). As such, if your SIMD units consist of 8 “lanes” that are always operated on in parallel, then you have 8 such programs running in parallel, with with first such program’s variables all in lane 0 of the SIMD registers, the second program’s variables in lane 1, etc.
So far, so good. That concept is used in at least a dozen languages (and actually what many auto-vectorizers use, too), it works pretty well (for a lot of cases, at least), and conceptually is very simple – which unfortunately is just another way of saying that the devil is in the details. In particular, there’s two problems that continually haunt the “code generation” phases.
Problem 1: “Missing” Vector Instructions
First, vector/SIMD instruction sets are often not “complete”, in the sense that there are often some instructions that a CPU can execute in scalar code, but for which there is no direct equivalent in SIMD form. For example, a AVX2 capable CPU can do a int32+int32 addition in both scalar (add) and SIMD (vaddpi) form – where the vector form just does 8 of those int32 adds in parallel, as you’d expect – but whereas it can do a int64/int64 division in scalar instructions (divq) there is no corresponding vector instruction for that in AVX. As a result, some operations are way more costly to do in “varying” form than you’d expect them to be when you’re used to from purely scalar code. That’ll apply to many kinds of operations, and some of those are related to addressing and memory operations, unfortunately – I’ll come back to that in a sec.
Problem 2: What are lanes, actually?
The second big problem in this SPMD paradigms (at least on SIMD CPUs) is that it implicitly assumes this concept of “each register is made up of N lanes” – and yes, many people (me included) often speak of it in exactly this way, as if, for example, a AVX2 register (or even a AVX2 CPU) was actually “8 wide”. The problem, with that conceptual view, however, is that it depends on the data type: SIMD registers do not always have “N” lanes of everything; instead, they have a fixed number of bits. A AVX2 register, for example, is 256 bits wide – and though that does indeed fit 8 int32s, 8 float32s, etc, it will only fit four int64s. Now of course, you can simply use two such registers for int64s – thus being logically back to “8 wide” – but even if we do this, we now have to suddenly think of which bits each of these 8 lanes actually actually map to in these two registers: For 32-bit types, lane 0 is always bits 0..31, and lane 1 is bits 32…63, etc; but if for a 64-bit type lane 0 is now in lanes 0…63 and lane 1 in bits 64…128 … then getting the different types to play nicely with each other becomes a bit of an issue, because you’re constantly using shuffles and stuff to move data values between where different-width types map their lanes to.
(Note to the compiler-savvy: All this can be somewhat mitigated by realizing 64-bit types differently (8 times lower 32 bits in one register, and 8 times upper 32 bits in another), but that has issues, too, because the instruction set has to then support operations that are split across different registers…. which of course they often don’t, see problem 1. Chicken and egg, anybody?)
Back to addressing…
Now, with these two issues in mind, let’s have another look at addressing. The problem here is that the most common data types in most programs (in particular in graphics) are int32 and float32, so most naturally you want ISPC to use a mapping of 32-bit types; ie, in the AVX2 example you’d indeed want each register be treated as 8 lanes of 32-bit types wherever possible (and yes, that’s exactly what it does).
Unfortunately, however, while most computations are on 32-bit types, it’s become pretty standard to assume 64-bit address spaces, as even the laptop I type this on has more than 4GB of memory. As such, all address computations today are typically assumed to be done in 64-bit…. and though that is perfectly fine for scalar code – where CPUs have lots of instructions to help with 64-bit addressing (hey, you can even do 64-bit indirect addressing with scale in the instruction operands!), as soon as you have anything even as trivial as a varying array index the compiler now has to do all the actual address computations in explicit vector arithmetic (because they’re different for each lane, and because things like “lea” or (vector-)register-indirect operands are not generally available for vectors, see “problem 1” from above).
Even worse, if we assume a 64-bit addressing mode, all these operations have to be done in 64-bit, which we’ve just argued above creates a lot of issues with moving data around. Alternatively, if you can tell ISPC that you’ll be fine if all addressing is done in 32 bits, then you’ll still have some overhead from “problem 1”, but at least all the 64-bit arithmetic – and going back-and-forth between 32-bit and 64-bit types – is going way … well, mostly. And that, exactly, is what ISPC’s “–addressing={32,64}” flag is about.
Now really, how important is 32-bit addressing?
Now if you’re like most people – that is, you’re used to mostly scalar code – you’ll probably have a hard time believing that this entire addressing is indeed such a big issue – after all, it works just fine in scalar code, so how hard an it be, really?
To illustrate this point, I’ve created a tiny little example, and run it through ISPC once with “–addressing=32”, and once with “–addressing=64”. Here’s the tiny little sample function:
extern float foo(float *uniform array, int uniform scale, int varying idx) { return array[scale*idx]; }
…and here’s the makefile (yes, good old gmake!):
all: avx2_32.s avx2_64.s avx1_32.s avx1_64.s avx2_32.s: test.ispc ispc -O3 -o $@ $< --emit-asm --addressing=32 --target=avx2-i32x8 avx2_64.s: test.ispc ispc -O3 -o $@ $< --emit-asm --addressing=64 --target=avx2-i32x8 avx1_32.s: test.ispc ispc -O3 -o $@ $< --emit-asm --addressing=32 --target=avx1-i32x8 avx1_64.s: test.ispc ispc -O3 -o $@ $< --emit-asm --addressing=64 --target=avx1-i32x8
As you can see, this example is trivial (the makefile is larger than the code!), but it already illustrates the point: First, let’s look at the AVX2 version of the code that’s being generated, assuming 32-bit addressing (some header and footer gunk removed):
vmovd %esi, %xmm2 vpbroadcastd %xmm2, %ymm2 vpmulld %ymm2, %ymm0, %ymm0 vpslld $2, %ymm0, %ymm2 vpxor %ymm0, %ymm0, %ymm0 vgatherdps %ymm1, (%rdi,%ymm2), %ymm0 retq
… and here’s the 64-bit variant:
vmovd %esi, %xmm2 vpbroadcastd %xmm2, %ymm2 vpmulld %ymm0, %ymm2, %ymm0 vpmovsxdq %xmm0, %ymm2 vextracti128 $1, %ymm0, %xmm0 vpmovsxdq %xmm0, %ymm0 vpsllq $2, %ymm2, %ymm2 vextractf128 $1, %ymm1, %xmm3 vxorps %xmm4, %xmm4, %xmm4 vgatherqps %xmm1, (%rdi,%ymm2), %xmm4 vpsllq $2, %ymm0, %ymm0 vxorps %xmm1, %xmm1, %xmm1 vgatherqps %xmm3, (%rdi,%ymm0), %xmm1 vinsertf128 $1, %xmm1, %ymm4, %ymm0 retq
As you can see, even for this tiny example the number of instructions more than doubles! And yes – that was a tiny example; it’ll get even funnier with large ones.
Now that example was only on the 64-bit arithmetic. If you also want an even more crass illustration of how the “instruction set” issues can play into this, I suggest you have a look at the code emitted for AVX1 (or even better, SSE) …. the avx1_64.s version is 55 lines for this example.
Though I do know that the example was too trivial to be much representative, I do hope it gives a rough idea. How big the impact is going to be in practice will depend on many factors: First of all, on how much data I/O you have relative to the compute you’re doing with it, on whether you can live with 32 bit addressing, and finally, how much you can “massage” the loads/stores to be more vector friendly (see below).
As a single – probably not representative, and certainly not up to date – number of something more complex, I point to a somewhat outdated anecdote: A while back we changed OSPRay to use 64-bit addressing rather than default 32-bit addressing …. just to test it out … and impact on total runtime was more than 2x. I have no idea if that is still the case – and most certainly it’ll depend on the actual renderer, data set, CPU type, etc – but either way, addressing overhead shouldn’t be neglected.
Now, based on all this long discourse on addressing in vector arithmetic, a few tips that can hopefully help:
Tip #1: Use 32-bit addressing if at all possible
Most obviously, use 32-bit addressing wherever you can. Unfortunately, ISPC doesn’t allow to specify that on a per-function basis, so it’s “all 32” or “all 64” (this is because the addressing flag changes how pointers are represented internally, so you can’t mix and match different functions any more).
Of course, the downside to 32-bit addressing is that you shouldn’t use it on anything that requires large, 64-bit offsets. On the upside, it is important to realize that in ISPC, (varying) pointers are usually realized as a 64-bit scalar base address with a 32-bit offset. So “32 bit addressing” doesn’t mean that it’ll only ever run in a 32-bit windows build, or that you can never allocate more than 4GB of memory – all it means is you can’t ever access into an array with 64-bit offsets. I.e., having 16 separate 1GB arrays each is “generally” no problem, but one single 16GB array is.
In practice – e.g., in OSPRay – that usually means you compile in 32-bit mode, and make sure that the above restrictions are obeyed in your code. And of course, that wherever you do have to do 64-bit indexing you’ll solve it in another way – I’ll talk about that later.
Tip #2: do uniform addressing wherever you can!
Another thing you should take home from the above examples is how important it is to do the addressing right, because if you don’t, it can cost a lot. Aside from the 32- vs 64-bit addressing, one thing that I cannot hammer home often enough is the importance of telling the compiler that something you know will be uniform actually is uniform.
For example, let’s take the above example, and be a little bit lazy, in that we just leave out the “uniform”s for the parameter declarations. I.e., we now have two functions:
extern float foo(float *uniform array, int uniform scale, int varying idx) { return array[scale*idx]; } extern float lazy_foo(float *array, int scale, int idx) { return array[scale*idx]; }
Both functions of course do exactly the same thing, and the reasons I called the second one the “lazy” one is that this is exactly what a “lazy” programmer (or one that wouldn’t know better) would write. In fact, I’ve seen tons of such examples in the past, and it’s particularly tricky because the lazy function does exactly what the “fast” one does, so it’s where most people stop worrying – it’s doing its job, it’s in ISPC, so it’s probably fast, so I’m done, right ?
Now, let’s take a look at the generated asm (only for avx2, 32 bit addressing): For the original one, we still have
vmovd %esi, %xmm2a vpbroadcastd %xmm2, %ymm2 vpmulld %ymm2, %ymm0, %ymm0 vpslld $2, %ymm0, %ymm2 vpxor %ymm0, %ymm0, %ymm0 vgatherdps %ymm1, (%rdi,%ymm2), %ymm0 retq
… but the lazy one – doing exactly the same – now looks like this:
vpmulld %ymm3, %ymm2, %ymm2 vpslld $2, %ymm2, %ymm2 vpmovsxdq %xmm2, %ymm3 vextracti128 $1, %ymm2, %xmm2 vpmovsxdq %xmm2, %ymm2 vpaddq %ymm0, %ymm3, %ymm0 vextractf128 $1, %ymm4, %xmm3 xorl %eax, %eax vxorps %xmm5, %xmm5, %xmm5 vgatherqps %xmm4, (%rax,%ymm0), %xmm5 vpaddq %ymm1, %ymm2, %ymm0 vpxor %xmm1, %xmm1, %xmm1 vgatherqps %xmm3, (%rax,%ymm0), %xmm1 vinsertf128 $1, %xmm1, %ymm5, %ymm0 retq
Now does this look just as bad as the 64-bit variant? Well, that is because it mostly is: The compiler no longer knows when it can do 32-bit offsets to a common base pointer, nor does it know which operations are scalar, etc. And again, that was the easy example.
Oh, and just for reference: If we knew that ‘idx’ was also a uniform variable, and we’d tell the compiler that, too, then that’s what would come out instead:
Input
extern uniform float alluniform_foo(float *uniform array, int uniform scale, int uniform idx) { return array[scale*idx]; }
asm:
imull %edx, %esi movslq %esi, %rax vmovss (%rdi,%rax,4), %xmm0 # xmm0 = mem[0],zero,zero,zero retq
Quite a difference, eh?
Now of course, telling the compiler what is uniform and what isn’t requires some additional knowledge about the program. BUT if you don’t pass this information even where you have it, the compiler can’t do any better than assume the worst, and you’ll get “sub optimal” performance. So, “some discipline required”, unfortunately, at least if you want to get good performance. On the upside, ISPC at least allows polymorphism where you can “overload” the same function with different uniform/varying combinations of the parameters, which helps a lot in that ISPC can then, at the call site, figure out which version to best use. But again, this requires that the programmer tell the compiler that there’s different versions, so “some discipline required”.
Tip #3: “foreach_unique” to uniform’ize (where applicable)
One particularly useful trick – that we use a lot in OSPRay, but which seems not well known outside of OSPRay – is to use the “foreach_unique” statement to “uniformize” otherwise varying expressions.
To re-cap:
foreach_unique (uniformValue in varyingVariable) { ... }
executes over a varying variable ‘varyingVar’, and executes the body (with the right masks) exactly once for each unique value in that varying variable. Now if that varying variable usually has all-different fields, anyway, then all this achieves is serializing the computations, which in general isn’t all such a good idea….
BUT – and here’s the trick – if we know that varyingVariable will usually have but few different values – then we can use this to “trade” multiple iterations for some operations becoming uniform/scalar. For example, assume you write a ray tracer (say, OSPRay), and each lane can have hit a different geometry, shader, etc …. but where, at least in most cases, the rays will all have hit the same geometry, or only two or three. In that case, if we use the varying geometry ID in any address computations all computations will become varying (which is usually expensive, see above)… but if we first foreach_unique over the different geometry IDs we may have multiple iterations over that body, but many operations in this body will be uniform (ie, scalar), which will often be way cheaper. So, if there’s not too many of those iterations, and the savings are big enough, it can well pay off.
As an example, consider the following – horribly trivialized – example of having a “two level” scene with a list of “geometries” that each have “primitives”, and where we now have to access a given primitive in a given geometry: In its trivial form, this looks like this
struct Geom { float prim[100]; }; extern float foo(Geom **uniform geom, int geomID, int primID) { return geom[geomID]->prim[primID]; }
Now if we use foreach_unique, we can also write the latter as
extern float foo(Geom **uniform geom, int geomID, int primID) { float ret; foreach_unique(uniGeomID in geomID) ret = geom[uniGeomID]->prim[primID]; return ret; }
Yes, it’s a bit more complicated but not that much, is it? The key, though, is to look at the generated asm code: For the above (simple) example this looks as this:
vextractf128 $1, %ymm2, %xmm3 vpmovsxdq %xmm3, %ymm3 vpmovsxdq %xmm2, %ymm4 vextracti128 $1, %ymm0, %xmm5 vpxor %ymm6, %ymm6, %ymm6 vpxor %ymm7, %ymm7, %ymm7 vpgatherdq %ymm4, (%rdi,%xmm0), %ymm7 vpgatherdq %ymm3, (%rdi,%xmm5), %ymm6 vpslld $2, %ymm1, %ymm0 vpmovsxdq %xmm0, %ymm1 vpaddq %ymm1, %ymm7, %ymm1 vextracti128 $1, %ymm0, %xmm0 vpmovsxdq %xmm0, %ymm0 vextractf128 $1, %ymm2, %xmm3 xorl %eax, %eax vpxor %xmm4, %xmm4, %xmm4 vgatherqps %xmm2, (%rax,%ymm1), %xmm4 vpaddq %ymm0, %ymm6, %ymm0 vxorps %xmm1, %xmm1, %xmm1 vgatherqps %xmm3, (%rax,%ymm0), %xmm1 vinsertf128 $1, %xmm1, %ymm4, %ymm0 retq
While with the foreach_unique variant, it looks like that:
.LBB0_5: # %foreach_find_next33 # =>This Inner Loop Header: Depth=1 tzcntq %rax, %rcx movslq (%rsp,%rcx,4), %rcx movq (%rdi,%rcx,8), %rdx vmovd %ecx, %xmm4 vpbroadcastd %xmm4, %ymm4 vpcmpeqd %ymm0, %ymm4, %ymm4 vpand %ymm2, %ymm4, %ymm5 vxorps %ymm6, %ymm6, %ymm6 vgatherdps %ymm5, (%rdx,%ymm3), %ymm6 vmovmskps %ymm4, %ecx andnq %rax, %rcx, %rax vblendvps %ymm4, %ymm6, %ymm1, %ymm1 jne .LBB0_5 .LBB0_3: # %foreach_done vmovaps %ymm1, %ymm0 movq %rbp, %rsp popq %rbp retq
(in both cases I’ve cut some common preamble code, adn the “all on” special cases).
Now that doesn’t look too much simpler, but keep in mine two things: First, the second exactly already includes the instructions for the iteration over the different values, and is still shorter than the simple one. And second, in this trivial example there’s hardly anything to be gained by uniforming in the inner loop – if you had more complex stuff like calling a per-geometry function pointer, looking at multiple data arrays (vertices, normals, colors, etc) the impact will be quite more pronounced.
And why does it get simpler in the first place? In the original example, ‘geomID’ is varying, so the entire expression “geom[geomID]->prim” becomes varying, all the pointers in there are varying, etc, which is costly. In the foreach_unique variant, ‘uniGeoMID” is a uniform, so the entire expression of “geom[geomID]->prim” evaluates to a single, uniform (ie, scalar) pointers, and the only varying gather required in this example is the final gather relative to this scalar base pointer …. which is way cheaper.
As said above, this trick doesn’t always help – in fact, in many cases it can be detrimental. But in some cases, it can be hugely beneficial, as the foreach_unique can now guarantee that the variable we’re iterating over will now be uniform in the body of that expression – which will help the compiler express things more cheaply than it it didn’t know this. (and of course, it still produces exactly the same results).
Summary
Well – there’s probably books I could fill with little discussions on what to do and what to (try to) avoid in ISPC (or on CPUs in general), but given that my laptop’s battery is slowly running out I guess this’ll be it for today. I’ll write some more a bit soon on some of the pitfalls of 32-bit addressing soon (including some of the ways of fixing it), but at least for today that’s more than the average person will likely be willing to read, anyway. I do hope it’ll still be useful for those that do have a closer look at ISPC (which seems a growing numbers). And of course, feel free to send me emails – and/or use the ‘comments’ section – if there’s anything to comment on in these examples.
Now, back to work …
Hi Ingo,
thanks for the insights.
One question regarding foreach_unique. From your post it reads as if the body S of
foreach_unique (val in x) S
will always be executed uniformly. However, my understanding was the following (and this is how the official documentation reads):
Suppose the current mask is: {1, 1, 1, 0}
Say x is: {3, 2, 3, 4}
This means that S will perform two iterations:
val = 3 with current mask = {1, 0, 1, 0}
and
val = 2 with current mask = {0, 1, 0, 0}
In particular, S will be executed in a varying way. Maybe ispc specializes S to be executed uniformly if only one mask element is true (in the second iteration), but in general S will be executed in a varying way. Am I wrong? Am I missing something?
Thanks,
Roland
LikeLike
Yes, you correct. I’d have to re-read my original formulation to see how I formulated it, but the way I think about it is that there is no “uniform execution” vs “varying execution” of a _block_, but instead, that _variables_ are uniform vs varying. So the way I view the foreach_unique is that you put a varying in, but the body of the loop the variable is then uniform during the execution. Does that make snese?
LikeLike
Yes, that makes sense 🙂
LikeLike
Hi Ingo,
thanks for the insights. One question regarding foreach_unique. From your post it reads if the body S of
foreach_unique (val in x) S
will always be executed uniformly. However, my understanding was the following (and this is how the official documentation reads):
Suppose the current mask is: {1, 1, 1, 0}
Say x is: {3, 2, 3, 4}
This means that S will perform two iterations:
val = 3 with current mask = {1, 0, 1, 0}
and
val = 2 with current mask = {0, 1, 0, 0}
In particular, S will be executed in a varying way. Maybe ispc specializes S to be executed uniformly if only one mask element is true (in the second iteration), but in general S will be executed in a varying way. Am I wrong? Am I missing something?
Thanks,
Roland
LikeLike