Sharing “my” NASA Mars Lander Unstructured-Mesh Data Set

TLDR: If any vis (or other researcher) is in need of a large unstructured mesh data set (tets and/or other linear elements): I’m hereby sharing a properly wrangled and pre-processed version of the “NASA Mars Lander” Data Set (link at bottom).

For those that haven’t yet heard of this data set: It’s one of the most amazing data sets I’ve ever gotten my hands on – partly because of the amazing back-story (simulating the landing of Mars, how much geekier could it get?), but also because

  1. it’s gigantic (over 6 billion tets – yes, billion, not million)
  2. it’s not – as most ‘big’ data sets – some artificial test case, but a “real world” data set (yes, the did simulate at that accuracy)
  3. it even contains multiple time steps, so you can make cool animations (see here:
  4. it’s a “raw” data drop in the sense that this is really what the sim code (Fun3D) wrote out (ie, it’s useful for “in situ” and “data-parallel rendering” research, too; and
  5. it actually looks awesome when rendered:


    (image credits: Nate Morrical, UofU)

The full, unadulterated  data for that “Mars Lander Retropulsion Study” is available from the “Fun3D Retropulsion Data Portal” at, and has been made available for the wider vis rendering community by the scientists that ran this data (for full attribution, see, with a lot of help from Pat Moran.

Unfortunately, if you start working on that data you’ll quickly realize that the main reasons for its awesomeness – its  “raw data dump” nature, and sheer size – have a flip side in that getting this data into any form useful for rendering comes with a “non trivial” amount of data wrangling : you need to get the thousands of different files downloaded, parse them, strip ghost cells, extract variables and time steps, re-merge from hundreds of per-rank results to a single mesh, etc. Doing so has been a lot of fun, but it was also a lot(!) of work (even for somebody that has a lot of hardware, and a lot of experience dealing with those things)… so to make it easier for others to use this data I decided to make both the result of my wrangling, and the code used for doing it, available to others that might want to work with it.

The resulting data for me is – for both the “small” (order three-quarters of a billion tets) and the large lander (about six billion – with a ‘b’) — a single unstructured mesh, with a single per-vertex scalar field (for me, ‘rho’, for one of the later time steps). For really high-quality rendering you probably want more than one variable; and/or multiple time steps …. but since my google drive space is limited I’ll provide only these two dumps, which should be more than enough to get you started. I’ll also provide the library I wrote to deal with this data set, so anybody serious enough to deal with data of that size should be able to follow my steps and extract other variables, time steps, etc.

With that:

And as usual: Any comments/praise/feedback/criticism…. let me know. I’d be particularly interested in hearing from you if you actually use this data!



PS: If you end up using this data, please do not forget to properly attribute the researchers that made this data available; please check the corresponding info on the original data portal!

Fun Problems #1: How to tetrahedralize an unstructured mesh with pyramids, wedges, and hexes, without losing face connectivity

Having introduced the idea of a series of articles on “fun problems” earlier today, let’s get the first one out of the door: How to tetrahedralize an unstructured mesh that contains pyramids, wedges, and hexes, into a mesh that contains only tets, and in a way that one does not lose any shared-face connectivity in the output mesh. Most of the unstructured meshes I dealt with in the past have been “pure” tetrahedral meshes, but more recently more and more of those I encountered in any unstructured-mesh visualization projects also contained wedges, pyramids, and hexes – sometimes because those are natively the most obvious choices (eg, in the Agulhas data set multiple water depth layers for a triangulated ocean surface nativly form wedges, etc), and sometimes because the dual mesh of a structured AMR data set actually contains pyramids, wedges, and hexes, … so there we are. Maybe the most prominent example of such data is the NASA Mars Lander Retropulsion study data set that they recently released – almost all tets, but a few wedges thrown in, too. But since triangles and tets are so much easier to deal with, the most obvious choice for such data sets (if only for comparison purposes) is typically to simply tetrahedralize them into a tet-only mesh.

Now: Why is this tricky? After all, splitting an given wedge, pyramid, or hexahedron into tetrahedra isn’t all that complicated (in fact, I actually do remember that as a toddler I had toy puzzle with colored plastic tetrahedra that did just that!). After googling for solutions to that I found that you can actually also split a tet into five tets (rather than the obvious six ones), but that aside, the basic concept of tetrahedralizing such unstructured meshes isn’t all that complicated.

The problem, in fact, comes in through the back-door, if one innocently expects this tetrahedralization to maintain proper face connectivity: ie, if two elements shared a face in the input, we want the generated tetrahedra to also share faces in the output. For those faces in the input mesh that were triangle faces — ie, the four faces of any input tet, the four sides of an input pyramid, and the front and back sides of a wedge — that’s actually not a problem: those are triangles, and those won’t change. For those faces that are quadrilaterals, however — ie, the the base of a pyramid, the left side, right side, and bottom of a wedge, or the six faces of a hexahedron — it’s a bit more complicated: if the element itself gets split into tets these sides will, by necessity, have to end up as pairs of triangles… and there’s two ways of doing that, based on which of the two opposing pairs of vertices in that quadrilateral one ends up connecting.

Now if both elements that originally shared a quadrilateral face do this split in exactly the same way then both elements would end up with the “same” pair of triangles, so each triangle from the one element will have a matching one from the other one, and all is good. If, however, those two adjoining elements end up splitting that face in different ways, then the one’s triangles will not match the other ones, which can be, ahem, “problematic” for all the kind of algorithms that assume that each face can tell what the neighboring tet on the other side of that face would be (e.g., for face-to-face ray marching, or for the “shared faces” method in our 2019 “RTX Beyond Ray Tracing” paper).

In fact, even for algorithms that do not need to know this “neighbor on the other side”, ending up with an inconsistent way of splitting these shared faces can result in nasty surprises: quadrilateral faces in an unstructured mesh do not (!) have to be planar, but can actually be general bilinear patches (to visualize this imagine a hexahedron that’s a perfect cube, then take one of the vertices and pull it into an arbitrary direction …. then the three faces adjoining that vertex will become curved, bilinear patches). Now if two elements share a bilinear patch but split that into two triangles with different edges, then the resulting tet-mesh would either have a gap between these elements (if both chose the inward-flipping edge), or both elements would overlap in a tet-shaped region (if both flip outwards … and in case you were wondering: one going inwards and one going outwards only happens if they select the same edge).

Though this problem is kind-of obvious once one thinks a bit about it, I actually had to learn the hard way; most illustrations of wedges and hexes show only planar faces, so it took me a while to figure out why rendering my first tetrahedralized meshes seemed to produce some nasty “pockets” of empty space. Well, learned that one, didn’t we?

Anyway, i digress again. How to fix that problem? First obvious idea is to just make sure that all pairs or elements that share a face will always flip the same way; e.g., by always placing the edge into the vertex with the smallest index, or always picking the shorter of the two edges, etc. Again I spent quite a while trying to do just that, only to realize that this isn’t actually that trivial, either: to do this one would have to be able to independently choose the edge orientation in any of a hex’s faces …. but you can’t do that because at least one pair of opposing faces in a hex always has to have the same orientation (in case you’re wondering: tetrahedralizing a hex works by first using a diagonal plane to split it into two wedges, but that means that the two opposing faces split by that plane will have the same edge orientation….). Now I’m sure there’s a global optimization problem somewhere in there that would allow to somehow rescue that idea, but at that point I realized that this is a good deal more tricky than initially thought.

So, what else to do? Instead of always splitting each quad face with an edge into two triangles, I decided to instead insert a new vertex into the center of each such face, and split it into four triangles by connecting this new vertex to the four edges of that face (for unstructured meshes with scalar per-vertex data we obviously also have to interpolate the scalar values for this new vertex).  The obvious downside of this is that one has to create some new vertices, which has all kind of issues (more vertices, more tets, more memory; the need to interpolate scalar values, etc) …. but if we’re willing to do that we can guarantee that any shared bi-linear patch will always end up with the same four vertices.

Having decided to do this for the faces, the next question is how to actually tetrahedralize the elements such that the faces will end up with this pattern. Here some little illustrations:

First, a tet will obvious remain a tet, no changes whatsoever.

Second, let’s look at a pyramid, which has exactly one bilinear face (see sketch below this paragraph, red and blue faces just for illustration): We create a new vertex C in the center of this face (by averaging vertices 0,1,2, and 3; green arrow), then connect this new vertex with the top (4) and the four base vertices (0-3, new edges in green), and and up with four nice tets that, on the bottom face, create exactly the pattern we’re aiming for (green for new edges on that face). Perfect.


(and, yes, i absolutely did steal my kid’s school pencils for this awesomely professional illustration… and free-hand sketching is so much faster than all the fancy programs!)

Third, let’s look at a wedge with front face 0,1,2 (blue) and back face 3,4,5 (red). We now have three quadrilaterals to deal with, for each of which we want to have the “four triangles” pattern described above. Luckily, there’s an easy way to create just that: we simply ignore the faces completely, and create a new vertex C in the center of the wedge (again, by averaging); now if we connect this new vertex to the existing five faces we end up with two tets (C to front and back triangle), and three pyramids (C to bottom, left and right quadrilateral). For the pyramids we do the same as described above, so our desired pattern appears on all quad faces of our wedge, too … perfect.


One will obviously have to further subdivide the three pyramids in this sketch (ending up with a total of 16 tets), but that got too much to draw …. and should be obvious.

Finally, compared to the wedge case the hexes are almost trivial: create a center vertex C, then connect it to the four faces, and get four pyramids …. feed those into step 2, and done.


All in all, that algorithm is trivially simple to code up (oh, if only I had found that one earlier ….). Only one thing to consider: thanks to limited floating-point accuracy it is, in theory, possible that two elements sharing a face might end up with slightly different results for the center vertex, which might then throw off the code to find the “matching” triangles. This can, however, easily be avoided by always sorting the four vertex indices before adding the vertices, in which case both elemens would always perform the same computations, and up with the exactly same vertex (note we only have to do that for the face vertices, the inner vertices in the wedge and hex case are only ever created once, anyway).

As said above, the algorithm is simple, and foolproof; the main disadvantage is that additional vertices require more memory, and compute more tets, than tessellating without introducing new vertices. For example a hex can be represented with as few as 5 tets… but in our method it would create 24.

Hope that whoever found that page found it useful – I only wish I had found one like this before I tried the other routes. I wouldn’t be surprised at all if lots of people had done exactly that before, too … in fact, I’d be surprised if nobody did…. I just couldn’t find it. Any comments, suggestions, or issues with this: let me know!

PS: Fun fact – this all started with “just” needing to produce some more testing data for our “RTX Beyond Ray Tracing” paper – why not simply take an AMR data set (that we had plenty of), compute the dual mesh, and then tetrahedralize the result … can’t be that hard, right? Well …

“Fun Problems”

It’s been a while since I’ve written anything; for a lot of reasons (it’s 2020, what can I say….).

Anyway. One thing I recently realized is that there’s a ton of stuff I should be writing either papers and/or blog articles about (such as the OWL project,, some of my “Moana on GPUs” experiences, or some of my recent work on data-parallel ray tracing) …. but that I don’t get to because I spend far too much time worrying about writing up other things that are fun, but significantly less important. These are typically “side problems” that I ran into while working on something totally different – often things I thought were trivial, but that suddenly turned out to be unexpectedly tricky, and that I (say: google) simply couldn’t find any solutions for.

Anyway. Many of these (solutions to) fun problems do indeed need to get documented if only so I could reference them in my code or papers ….  but in “proper paper form” – with previous work, discussions, comparisons to other solution, and in particular all the insane latex-polishing and perfect-figure-crafting – it just takes up too much time, and distracts from the real problems.

So. To break that log-jam I’ve decided to instead share some of these ideas in blog form; using hand-drawn-and-scanned scribbles rather than perfectly designed illustrations (if only i could have all the time back I spent experimenting with ever new sketching program ….), using wordpress rather than latex (oh my beloved \vspace*{…}, \multicolumn{}, and \includegraphics{}….), and doing away with all the stuff that otherwise takes up so much time. To distinguish those write-ups from any other “update”-style articles I’ll explicitly tag each one with a “Fun Problems:” prefix; in the same spirit as the “ISPC bag of tricks” series i wrote a few years ago. Basically they’re the same category: something worth sharing that’ll hopefully(?) help others; but that’s not worth making a real paper about.

As such: on to the first one – how to tetrahedralize a unstructured element with pyramids, wedges, and hexes, without losing proper shared-face connectivity….

Quick note on CUDA/OWL on Ubuntu 20

Not sure if a “blog post” is the right medium for that, but just in case anybody stumbles over this: I just upgraded my ubuntu 18 based laptop to ubuntu 20, and after that ran into some issues compiling my CUDA projects – apparently, the gcc version 9 that Ubuntu 20 ships with isn’t yet supported in CUDA (at least, not in the version I have installed!?), meaning I ended up with an error of

/usr/local/cuda/include/crt/host_config.h:129:2: error:
#error -- unsupported GNU version! gcc versions later than 8 are not supported!

Just in case anybody else stumbles over this, the solution is actually quite simple: First, make sure to install the ‘g++-8’ package:

sudo apt install g++-8

(note: mind the “++”: my machine already had `gcc-8` installed by default, but you need g++-8)

Once installed, just tell nvcc to always use ‘/usr/bin/gcc-8’. If you’re using cmake under linux, that’s as simple as setting CUDA_HOST_COMPILER to /usr/bin/gcc-8; cmake and make should then run through just fine.

Hope that helps.

PS: I also previously experimented with using ‘alternatives’ to change my system to use gcc-8 by default, but that didn’t go all too well: yes, setting gcc-8 as default compiler did make nvcc work, but the next apt update ended up being quite confused …. not good. Setting it in the project seems to work just fine.

PPS: Update – for another machine I also installed U20 “from scratch” (rather than upgrading). In this case, installing U20 was actually an amazingly impressive experience: not only did it install out-of-the-box on a RTX-enabled Thinkpad laptop that I had not gotten any other distribution to run on at all, it even came with pre-packaged nvidia 440 driver, with apt packages for CUDA, etc. You still need the “g++-8” think from above, but overall that was a very pleasant experience.

(Short-)Paper Preprint: Using RTX for Glyph Rendering

Haven’t written in a while…. (very busy…) but currently so down with allergies that updating my web presence is pretty much the only thing I’m still useful for, so : Just uploaded another paper pre-print, this time on the following (short) paper:

High-Quality Rendering of Glyphs Using Hardware-Accelerated Ray Tracing
Stefan Zellmann, Martin Aumueller, Nate Marshak, and Ingo Wald
Eurographics Parallel Graphics and Visualization (EGPGV) Short Papers, 2020.

(paper, as usual, on my  publications page at SCI:

The core idea behind this paper was to look into how hardware ray tracing – and in particular, the ability to easily and cheaply created millions of “copies” of an object via instancing – would impact glyph rendering. At least in theory, there’s three key benefits that ray tracing brings to the table for glyph rendering: First, the ability to create lots of copies of objects, very cheaply, is good because there’s typically a lot of glyphs involved in rendering an image (and animating them is cheap, too, because all you have to do is change some transforms). Sure, you could previously do this with fragment shaders, too (at least for primary visibility), but with ray tracing you have this somewhat more “cleanly” integrated in the over rendering system, you have less issues with overdraw, etc…. so this should be useful.

Second, the ability to have arbitrary intersection programs allows for more easily creating non-trivial glyph shapes without having to worry about tessellation, which can either be tricky (say, for superquadrics) or – at least when doing naively – un-necessarily expensive (say, tessellating millions of arrows into hundreds of triangles each). Again, even before hardware ray tracing you could  fix this with fragment shaders, but still….

Third – and something that’s significantly less easy to fix with fragment shaders – with ray tracing it’s relatively easy to add secondary shading effects like shadows, AO, or indirect illumination …. and while this may also make images look “better”, the core motivation for that is that when you draw lots and lots of glyphs without such effects  you often end up with just a garbled mess on the screen, where it is hard to see which of the glyphs are actual where relative to the others (fancy jargon: “visual clutter”). In the last few year’s we’ve seen again and again how much shadows and AO can helps with that in, e.g., particle visualization … and whether particles, arrows, superquadrics, or other glyphs – it’s exactly the same problem.

In theory, all three of those advantages were kind-of obvious; the big question just was how well this would work in practice, how much effort it would be, whether there’s any un-foreseen pitfalls, and whether we could actually get this to render fast enough. And as shown in this (short) paper: it does actually work out pretty well: in fact, we even stumbled over some additional ideas we hadn’t initially expected – for example, you can actually use motion blur to convey motion, and I’m sure there’s other things we haven’t looked at yet, too (eg, would motion blur work for uncertainty visualization? For conveying error ranges? Would defocus or warping of the primary rays be useful, too?…?).

Anyway, it’s been a fun paper to play with; Stefan, Nate, and Martin have done a great job at that …. Enjoy!

PS: This was also one of the first (public) papers made with my new “OWL” library, with which building that framework turned out to be pretty easy… (well, that was the idea of OWL! 🙂 ) … but since I realize that the long-planned blog article about OWL still hasn’t actually been written yet I’ll say no more for now …

Digesting the Elephant…

TL;DR: We just uploaded our “experiences with getting moana rendering interactively (and with all bells and whistles!) in embree/ospray” whitepaper to ArXiv:



Quite a while ago (by now), Matt wrote an excellent article named “Swallowing the Elephant” – basically, his experiences with all the unexpected things that happen when you want to make even a well-established renderer like PBRT “swallow” a model on the scale of Disney’s “Moana Island” model.

This Moana model (graciously donated to the research community by Disney about two years ago; link here: is the first time a major studio released a real “production asset” -with unbelievably detailed geometry, millions of instances, lots and lots of textures, full material data, lights, etc…. basically, “the full thing” as it would be used in a movie. The reason that this is/was such a big deal is that when you develop a renderer you need data to test and profile it with, and truly realistic model data is hard to come by – sure, you can take any model and replicate it a few times, but that takes you only so far. As such: Disney, if you’re reading this – you’ve done an immeasurable service to the rendering community, we can’t thank you enough!

While I was still at Intel, we had actually gotten a very early version of this model; we’ve worked on it for quite a while, and it has since been shown several times at major events. Just as a teaser for the rest of this article: Here’s a few “beauty”-shots of our embree/ospray based renderer on this model:


(click on the images for full res versions)

And just to give you an idea, here is a close-up of some of the region under the palm of the previous image:roots-beauty

Digesting the Elephant

Anyway, Matt at some point wrote an article about exactly this experience: What happens when you take even an established renderer that’s already pretty amazing – and then try to drop Moana into it. That article, or actually, series of articles, made for excellent reading, and in particular, rang some bells for us, because “we” (which means I and my now-ex Intel colleagues) had gone through many of exactly the same experiences, too, plus a few additional ones because our goal was not just offline rendering, but getting this thing to render interactively on top of Embree and OSPRay. And to be clear, our goal was not do just get “some” interactive version of moana, but the full thing, with every instance, every texture, curves, subdiv, full path tracing with Disney BRDF, every-frigging-thing... but also, at interactive rates.

And yes, many of things Matt wrote about sounded incredibly familiar (we had actually worked on that way before his set of articles) : you think you have a good software system to start with, you’ve rendered models with lots of triangles or lots of instances before, you’ve done path tracing, cluster based parallel rendering, etc, it’s available in some human-readable file format (even two different ones), so: how bad could it be?

For us as it was for Matt, the answer to this naive “how bad could it be” question turned out to be (to put it mildly) “not as easy as you thought it might”: In fact, I think I spent an entire week just working on fixing/extending my then-already-existing pbrt-format parser to be able to handle this thing even remotely correctly (or completely), and initially it took hours to parse; and at some point, I vividly remember Jim (my then manager) sending me an additional 512GBs of memory for just that purpose (by now we need much less than that, but at the time …). And all that doesn’t include the time spent on several aborted attempts at using the JSON format instead, let alone that at the time we hadn’t even started with the real work of getting that data into the ospray scene graph, into ospray, embree, cluster, etcpp, let alone dealing with things like PTex textures, subdiv, hair/curves, Disney BRDF, denoising, …. hm.

In fact, the number and kind of pitfalls you run into once you get such a “pushing the boundaries”-model to test with is truly astounding: For example, at the time we started the ospray scene graph still used a lot of strings for  maintaining things like names of scene graph nodes or parameters of those nodes; all of that had been used plenty times before without issues – creating an instance node was only a millisecond or so, which surely can’t be too bad, right…. unless you throw ~100 million instances at at, at which point even a one-millisecond processing time for an instance node suddenly becomes an issue (though 100M milliseconds still sounds benign, it’s actually 100,000 seconds… or 1,600 minutes …. or 28 hours…. just to create the instance nodes in the scene graph). And there were plenty such examples.

And just to demonstrate a bit of what one has to deal with in this model geometry-wise (Matt has several similar shot, here’s a few color-coded “instance ID” shots around various places of this model:


Again – every different color in these pics is a different instance of often thousands of triangles. My favorite here: a single Coral in the water off the coast – barely covering a full pixel in some of the overview shot above:


I digress….

Yes, I really love this model… but anyway, I digress.

As said above reading Matt’s articles reminded me of all that, and once I read those posts I realized that Matt had actually done something pretty useful: not just “battle through” and “make it work”, but actually share what he learned by writing it up – so others would know what to expect, and hopefully have it a bit easier by knowing what to expect (when writing a piece of software it’s so easy to just say “oh, that will surely never happen” – yes, it does!). As a result, we decided to follow his lead, and to also write up our experiences and “lessons learned”…. and since our task was not only to “swallow” that model into an existing renderer, but actually to also add lots and lots of things to that renderer that it had then never been designed to do (remember, OSPRay was a purely “Svi-Vis” focussed renderer, then), too, I concluded that a fitting title for that write-up would be “Digesting the Elephant” (after all, digesting something is what you have to do after you managed to swallow it … :-/).

Anyway, I digress. Again.

The problem with this write-up was that I had since switched to NVidia, so writing articles on my pre-NVidia work wasn’t exactly top priority any more. I spent pretty much all of he 2018(!) Christmas holidays writing up a first draft, then all my ex-colleagues started adding in all the stuff I no longer had access to (plus a lot of new text and results, too, of course) but eventually we all kind-of “dropped” it because we all had other stuff to do. At some point we did submit it as a paper, but a “experiences” story just doesn’t fit that well at academic conferences, so not unsurprisingly it didn’t get in, and we completely forgot about it for another few months. Finally, a few days ago I accidentally stumbled over the paper repository, and realized we had all completely forgotten about it… but since the paper was already written, I asked around, and we decided we should at least make it publicly available; “technical novelty” or not, I still firmly believe (or hope?) that the story and experiences we share in there will be useful to those treading in our footsteps.

Long story short: We finally uploaded the PDF to ArXiv, which just published it under the following URL:


With that, I do hope that this will be useful; and for those for whom it is : Enjoy!

PS: As usual – any feedback, comments, criticism on what we did: let me know…

PPS: And as still images show only so much, here a video that Will Usher made, uploaded to youtube:

PPPS: and a final little personal plug: If you’re playing with this model – or intending to do so – also have a a look at my github “pbrt-parser” library, that I also use when working with this model. It has a binary file format, too, so once the model is converted you can load the full thing in a few seconds as opposed to half an hour of parsing ascii files :-). Link here:

My newest toy: A Jetson Xavier…

Since it was getting closer to Christmas I decided to treat myself to a new toy – so since Friday evening, I’m the proud owner of a shiny, new, Jetson Xavier AGX (Newegg had a special, where you could get one for 60% off…). And since I found some of my experiences at least somewhat un-intuitive, I thought it might help other newbees to just write it up.

Why a Xavier?

Okay, first: Why a Xavier? Having for years worked only on x86 CPUs I had actually wanted to play with an ARM CPU for quite a while (also see this past post on Embree on ARM), but never gotten my hands on one to do so. And in particular after NVidia showed some ARM reference designs at SC this year I really wanted to get my hands on one to experiment with how all my recent projects would do on such an architecture.

Now last year my wife had already gotten me a Raspberry Pi for Christmas – but though this thing is kind-of cute I found myself struggling with the fact that it’s just too wimpy on its own – yes, you can attach a keyboard and a monitor, and it kind-of runs linux, but everything feels just a tiny bit too stripped down (Ubuntu Core doesn’t even have ‘apt’!?), so actually developing on it turned out to be problematic, and cross-compiling is just not my cup of tea (Yes, I understand how it works; yes, I’ve done it in the past, and yes, it’s a …. well, let’s keep this kid friendly). Eventually I want to use “my arm system” also as a gitlab CI runner, and with the Raspberry, that just sounds like a bridge too far.

In contrast, if you look at a Xavier it does have an 8-core, 64-bit CPU in it, 32GB of memory, a pretty powerful Volta GPU, NVidia driver, CUDA, cuDNN, etc – so all in all, this thing should be at least as capable as what I’m doing most of my development on (my laptop), and since it has the full NV software/driver stack and a pretty decent GPU I should in theory be able to not only do automated compile CIs, but even automated testing. So. Xavier it is.

First Steps – or “Where did you say the deep end is?”

OK – now that I got one, first thing you do is plug in a monitor and a keyboard, press the power button, and up comes a Linux. Yay. That was significantly easier than the Pi’s “create an account somewhere, then download an image from there, then burn that”. So far, so good. There’s also a driver install script (so of course, went ahead and installed that), then there’s your usual ‘apt’ (so went ahead and did apt update/apt upgrade), and yes, there’s a full linux (so created a new user account, started installing packages, etc. Just like a regular laptop. Great.

Except – the first roadblock: I have the nvidia driver installed, I have gdm running, have cmake, gcc, etc,  but where’s my cuda? And wait – that driver is from early 2018, I somehow doubt that’ll have Optix 7 on it?

So, start searching for ARM NVidia drivers to install – and there is one, but it’s only 32bit? Wait, what? Turns out that even though the thing looks like a regular laptop/PC, that’s apparently not how you’re supposed to use it, at least not yet, and at least not from the developer tools point of view: The right way to get all that to work – as I eventualy had to learn realize after I had already done the above – is to use the NVidia “JetPack” tool set ( Good news: this tool is actually quite powerful – bad news: it flashes your Xavier, so all the time I spent in installing driver, creating users, updating system, etc … hm; i hope you will read this first.

Doing it the right way: JetPack

So, Jetpack. The way Jetpack works is that you download it for a host system, install it there, and use that host system to flash your Xavier. Initially I was a bit skeptical when I read that, because this entire “host system” smacked a lot of exactly the “cross-compile workflow” I wanted to avoid in the first place. Luckily, it turns out you really only need this for the initial flashing, and for installing the SDK – once that’s done, you no longer need the host system (well, “apparently” – it’s not that I wrote all that much software on it, yet).

OK, so just to summarize: To do it the right way, go to, and download the .deb file (sdkmanager_0.9.14-4964_amd64.deb in my case), then install this with

sudo apt install sdkmanager_0.9.14-4964_amd64.deb

Then start the newly installed ‘sdkmanager’ tool, and you should see something like this:sdkmanager

Now this being me, I had of course already clicked through the first two steps before I took that picture, but all the values in those two steps were correct by default, so there’s not much to show for those first two steps, anyway. SDKManager now downloads a ton of stuff, until in step 4 you can then start installing.

Install Issues

During install, you first have to connect your Xavier – with the accompanying USB cable – to your host PC, then get it to do factory install. I was a bit surprised it couldn’t just do that through ssh (after all, my system was already up and on the network), but factory reset it is. To do that the tool tells you to “press left button, then press center button, then release both” – which isn’t all that complicated, except …. apparently this only works if your Xavier is off (at least mine simply ignored it when it was still on). So, first unplug, plug back in, and then do this button magic. That tiny aside, flashing then went as expected.

Some time later, the Xavier is apparently flashed, and SDKManager wants to install the SDK (driver, CUDA etcpp) – for which it apparently also wants to use the USB connection (the weird IP it shows in that dialog is apparently where the IP-over-USB connection is supposed to be!?). Two tiny problems: a) For some reason my host system complained about “error stablishing network connection”, so there was no USB connection. And b), that dialog asks for a username and password to log into the Xavier with, but since you haven’t even created one, yet, what do you use?

Turns out, after the first flash and reboot your Xavier is actually waiting for you to click on “accept license”, create a user account, etc (very helpful to know if your screen has already gone to screensaver, and you unplugged keyboard/mouse to plug in the host USB cable :-/). So before you can even do the SDK install you first have to plug in a keyboard and mouse, accept the license, and create a user account … then you can go back to SDKManager on the host system to install the SDK through that user account.

That is, of course, if your host PC could establish an IP-over-USB connection, which as stated before mine didn’t (and unpluggig the host USB to connect keyboard and mouse probably haven’t helped, either). Solution: ignore the error messages, plug in an ethernet cable to your Jetson, open a terminal, and do ‘ipconfig’ to figure out the IP address of the ethernet network. Then back to the host PC, change the weird default USB IP to the system’s real ethernet IP, and la-voila, it starts installing.

And la-voila, we have an ARM Development Box…

These little stumbling-stones aside, once the driver and SDK is installed, everything seems to be working just fine: reboot, apt update and apt upgrade, apt install cmake, emacs, etc, suddenly everything works just exactly as you’d expect it from any other Ubuntu system – the right emacs with the right syntax highlighting, cmake that automatically picks up gcc and cuda, etcpp, and everything works out of the box.

Now git clone one of my projects from gitlab, run cmake on it, and it even automatically picks up the right gcc, CUDA (only 10.0, but that’ll do), etc; so from here on it looks like we’re good. I haven’t run any ray tracing on it, yet, but once I got my first “hello world” to run last night, it really does look like a regular linux box. Yay!!!

Now where do we go from here? Of course, it only gets real fun once we get OptiX to work; and of course I’d really like to put a big, shiny Titan RTX or RTX8000 into that PCIe slot … but that’s for another time.

PS: Core Count and Frequency

PS – or as Columbo would say “oh, only one last thing” – one last thing I stumbled over when compiling is that the system looked un-naturally slow; and when looking into /proc/cpuinfo it showed only four cores (it should be 8!), and showed “revision 0” for the CPU cores, even though the specs say “revision 2”. Turns out that by default the system starts in a low-power mode in which only four cores are active, at only half the frequency (the ‘rev 0’ is OK, it’s just that the cpu reports something different than what you’d expect – it is a revision 2 core).

To change that, look at the top right of your screen, where you can change the power mode. Click on that, and change it to look like this:


Once done, you should have 8 cores at 2.2Ghz, which makes quite a difference when compiling with “make -j” :-).

So, for now, that’s it – I’ll share some updates if and when (?) I get some first ray tracing working on those thingys (wouldn’t it be great do have a cluster of a few of those? 🙂 ). But at least for now, I’m pretty happy with it. And as always: feedback, comments, and suggestions are welcome!

Ray Tracing the “Gems” Cover, Part II

As Matt Pharr just let me know: the original Graphics Gems II cover image was in fact a 3D model, and was in fact even ray traced! (that really made my day 🙂 ).

That notwithstanding, the challenge still stands: Anybody out there to create (and donate) a 3D model of that that looks somewhat more “ray trace’y” for us all to play with?

PS: And I promise, we’ll work very hard to ray trace that faster than what was possible back then … I’ll leave it to you to find out how fast that was! (Tip: The book has an “About the Cover” page 🙂 )

Recreating the “Graphics Gems 2” Cover for Ray Tracing?

Usually I use this blog to share updates, such a newly published papers, etc …. but this time, I’m actually – kind-of – calling for some “community help”: In particular, I’m curious if there’s any graphics people out there that knows how to use a modelling program (I don’t; not if my life depended on it :-/), and that would be able to create a renderable model of the “scene” depicted on the “Graphics Gems 2” book. To be more specific, to create a model that looks roughly like that:

Screenshot from 2019-11-14 13-39-18

Now you may wonder “where does that suddenly come from?” … so let me give a tiny bit of background: We – Adam Marrs, Pete Shirley, and I – recently brainstormed a bit about some “potentially” upcoming “Ray Tracing Gems II” book (which hasn’t been officially announced, yet, but ignore that for now), and while doing so we all realized how much we are, in fact, channeling the original “Graphics Gems” series (which at that time was awesome, by the way!).

Now at one instance, I was actually googling for the old books (to check if they used “II/III/IV” or “2/3/4” in the title, in case you were wondering), and while doing so stumbled over this really nice cover … which would probably look amazing if that was a properly ray traced 3D model rather than an artist’s sketch. And what could possibly be a better cover image – or test model – used in RTG2 than the original cover from GG2!?

So – if there is anybody around that know his modelling tools, and looking for a fun challenge: Here is one! If anybody does model this and has anything to share – ideally full model with public-access license, or even only just images – please send it over. I can of course not promise that we’ll actually use such material in said book (which so far is purely hypothetical, anyway) – but I’d find it amazing if we could find a way of doing so.

PS: Of course, a real-time demo of that scene would be totally awesome, too 🙂

New Preprint: “BrickTree” Paper@LDAV

Aaaand another paper that just made it in: Our “BrickTree” paper (or – using its final, official, title: our paper on “Interactive Rendering of Large-Scale Volumes on Multi-core CPUs“) just got accepted at LDAV (ie, the IEEE Symposium on Large-Data Analysis and Visualization).

The core idea of this paper was to develop some data structure (the “BrickTree”) that had some intrisics “hierarchical representation” capabilities similar to an octree, but much lower memory overhead … (because if your input data is already terabytes of voxel data, then you really don’t want to spend a 2x or 3x overhead on encoding tree topology :-/). The resulting data structure is something that is more or less a generalization of a octree with NxNxN branching factor, but with some pretty nifty encoding that keeps memory overhead really low, while at the same time having some really nice cache/memory-related properties, and (relatively) cheap traversal to find individual cell values (several variants of this core data structures have been used before; the key here is the actual encoding).

Such a hierarchical encoding will, of course, allow for some sort of progressive loading / implicit level of detail rendering, where you can get some first impression of the data set long before the full thing is loaded – because even if your renderer can handle data of that size, loading a terabyte data set can literally take hours to first pixel!. (And just to throw this in: this problem of bridging load times is, IMHO, one of the most under-appreciated problems in interactive data vis today: yes, we’ve made giant progress in rendering large data sets once the data has been properly preprocessed and loaded …. but what good is an eventual 10-frames-per-second frame rate if it takes you an hour to load the model?!).

Anyway – read the paper …. there’s tons of things that could be added to this framework; I’d be curious to see some of those explored! (if you have questions, feel free to drop me an email!). Maybe most of all, I’d be curious re how that same idea would work on, say, a RTX 8000 – yes, the current paper mostly talks about bridging load times (assuming you’ll eventually load the full thing, anyway), but what is to stop one from loading once a certain memory budget has been filled!? This should be an obvious approach to rendering such large data, but I’m sure there’ll be some devil or two hiding in the details… so would be curious if somebody were to look at that (if anybody wants to: drop me an email!).

Anyway – enough for today; feedback of course appreciated!

PS: Link to PDF of paper is embedded above, but just in case: PDF is behind this link, or as usual, on my publications page.