CfI: Embree on ARM/Power/…?

Executive Summary: This is a “CfI” – a “call for involvement” – for anybody interested in building and, in particular, maintaining a Embree version for non-IA ISAs such as ARM, Power, etc. If you’re mostly based on – or at least, very active on – any of those platforms, and interested in being involved with creating and maintaining a version of Embree on them …. let me know!


As most of those that read this blog will surely know already, Embree ( is a ray tracing library that focuses on accelerating BVH construction and ray traversal, thus allowing the user to build their own renderer with minimum constraints, yet good performance. Arguably the biggest strength of Embree – other than performance – is its versatility, in that it allows things like user-programmable geometries, intersection callbacks, different ray layouts and BVH types, etc.

Another big strength of Embree is that it will automatically select – and use – the best vector instructions available on your CPU, and make full use of it; e.g., if your CPU has AVX512 it’ll fully use it; if it doesn’t it will fall back to AVX2, or AVX, or SSE, … you get the point. The caveat of this, though, is that Embree today only supports Intel-style vector extensions; yes, it supports SSE, AVX, AVX2, and AVX512; and yes, it works on AMD CPUs just as well as it does on Intel CPUs …. but if you’re on Power, ARM, SPARC, etc, it currently won’t work.

Embree on Non-IA CPUs?

On the face of it, only supporting IA (Intel Architecture) style CPUs isn’t too big a limitation … in particular in high-end rendering almost every rendering is being done on Xeons, anyway. However, if you are an ISV whose software is supposed to also run on non-IA CPU types – think a game studio, or the Steam Audio 2 that’s been recently announced (see here), then you’re currently faced with two choices: either don’t use embree at all (even where it would be highly useful); or change your software to support two different ray tracers, depending on which platform you’re on (ugh). As such, I would personally argue – and yes, this is my own personal view – that it would be highly useful to have a version of Embree that will also compile – and run – on non IA CPUs.

In fact, doing just that is way simpler than you might think! Of course, everybody’s first thought is that with all the man-years of development that went into Embree “as is”, doing the same for other vector units would have to be a major undertaking. In fact, if you only dare to take a look at Embree’s source (if you want to, you can conveniently browse it on its github page) you’ll very quickly realize that almost everything in Embree is written using some SIMD “wrapper classes” (see code in embree/common/simd) that implement things like a logical “8 wide float”, “16 wide bool”, etcpp… and that then implements these wrapper classes once in SSE, once in AVX, once in AVX512, etc.

In other words, once you implement those wrappers in your favorite non-IA vector intrinsics, you’re 95% there towards having all of Emrbee compile – and run – on that ISA. After that, there’s still a few few more things to do in particular relating to the build system (adapting the cmake scripts to your architecture), in properly “registering” the new kernels (because Embree’s automatic CPU detection currently only looks at Intel CPUs), etc … but all that is “small beer” – all the traversal kernels, API implementation, BVH building, threading, etcpp, should all work out of the box.

I am, of course, not the first one to figure that out: In fact, almost two years ago a github user “marty1885” already did exactly that for ARM (see this blog article of his for an excellent write-up!). Funny thing back then was that I had just done “almost” the same in writing purely scalar implementations for those wrapper classes, while he independently did this port to ARM Neon. (And just to make this clear: With “scalar” I do not mean that Embree itself was re-written without vectors; I’m talking about an implementation that realizes, say, the 16-wide float “vectors” as doing 16 scalar add/mul/etc, rather than using an explicit _mm512_add_ps etc; this means it’ll still compile on any platform, but the rest of Embree still “sees” this as a 16-wide vector).

Both of those test implementations yielded interesting results: For mine, it was the fact that this purely “scalar” implementation of float4, float8, etc worked really well – auto-vectorizers may be, well, “problematic” for complex code, but for something that continuously does 8 adds, 8 muls, etc at a time, they absolutely do see that those can be mapped to vector instructions – not just as good an manual intrinsics, but surprisingly close. I did, however, never go through the exercise of changing the makefiles, so never even tried on ARM etc (well, I don’t have an ARM!). For Marty, he went “all the way”, and in particular, got his entire renderer working on ARM. His finding? That performance was pretty good out of the box, without any rewriting of kernels etc altogether… which is pretty cool…. and highly promising.

So, what (still) needs to be done?

OK, proof of concept done – what would we still need? Well – the problem is that both mine and Marty’s implementations are about 2 years old, and as such, pretty deprecated right now. It’s not hard to re-do that for the latest Embree, but it has to be done (and I sure won’t have time for that!).

Second, even if that exercise was re-done, it would still have to be maintained: Every new release of embree adds a few additional things, fixed bugs, adds improvements, etc… and to be really useful for ISVs, the “ported” embree would have to keep up to date with those updates. Now thanks to git, that “keeping it updated” shouldn’t be too hard – very few of those release updates would touch the wrapper classes, CPU detection, or build system, so in 95% of cases I’d expect such an update to be as simple as a “git pull remote” to make a new release…. but it would still have to be done. In fact, if I were to build and maintain such a “non-IA” version of Embree, I’d do it exactly as such: clone the embree repo once, port the wrapper functions and makefiles just like Marty did it, then push that onto another, easy to find repo… and of course, pull in the master diffs every time Embree makes a new release, fix whatever needs to get fixed (probably not much), and push that release too.

Be that as it may – I will likely not have the time to maintain such a project …. but if anybody out there is eager to make himself a name in non-IA CPU ray tracing land (and possibly, with some of existing users of Embree that want to be platform agnostic) – well, here’s a relatively easy project to do just that!

Anyway – ‘nough for today … back to work!

PS: And of course, once you have a non-IA version of Embree, it’s trvially simple to also have a non-IA version of OSPRay, too: OSPRay itself doesn’t use vector intrinsics at all; it only uses them indirectly, through ISPC, and through Embree. ISPC can already emit to “scalar” targets, so as soon as Embree could emit to whatever ISA you require (or to scalar, as I had done) ….. well, all you’d need is a C++11 compliant compiler, and likely MPI… which aren’t all too rare nowadays :-). As such, if you do have a non-IA supercomputer (hello there, Summit, Sierra, Tianhe, Sunway, Titan, Sequoia, etc!), and you need a good, scalable, fast ray tracer …. take your sign!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s