My newest toy: A Jetson Xavier…

Since it was getting closer to Christmas I decided to treat myself to a new toy – so since Friday evening, I’m the proud owner of a shiny, new, Jetson Xavier AGX (Newegg had a special, where you could get one for 60% off…). And since I found some of my experiences at least somewhat un-intuitive, I thought it might help other newbees to just write it up.

Why a Xavier?

Okay, first: Why a Xavier? Having for years worked only on x86 CPUs I had actually wanted to play with an ARM CPU for quite a while (also see this past post on Embree on ARM), but never gotten my hands on one to do so. And in particular after NVidia showed some ARM reference designs at SC this year I really wanted to get my hands on one to experiment with how all my recent projects would do on such an architecture.

Now last year my wife had already gotten me a Raspberry Pi for Christmas – but though this thing is kind-of cute I found myself struggling with the fact that it’s just too wimpy on its own – yes, you can attach a keyboard and a monitor, and it kind-of runs linux, but everything feels just a tiny bit too stripped down (Ubuntu Core doesn’t even have ‘apt’!?), so actually developing on it turned out to be problematic, and cross-compiling is just not my cup of tea (Yes, I understand how it works; yes, I’ve done it in the past, and yes, it’s a …. well, let’s keep this kid friendly). Eventually I want to use “my arm system” also as a gitlab CI runner, and with the Raspberry, that just sounds like a bridge too far.

In contrast, if you look at a Xavier it does have an 8-core, 64-bit CPU in it, 32GB of memory, a pretty powerful Volta GPU, NVidia driver, CUDA, cuDNN, etc – so all in all, this thing should be at least as capable as what I’m doing most of my development on (my laptop), and since it has the full NV software/driver stack and a pretty decent GPU I should in theory be able to not only do automated compile CIs, but even automated testing. So. Xavier it is.

First Steps – or “Where did you say the deep end is?”

OK – now that I got one, first thing you do is plug in a monitor and a keyboard, press the power button, and up comes a Linux. Yay. That was significantly easier than the Pi’s “create an account somewhere, then download an image from there, then burn that”. So far, so good. There’s also a driver install script (so of course, went ahead and installed that), then there’s your usual ‘apt’ (so went ahead and did apt update/apt upgrade), and yes, there’s a full linux (so created a new user account, started installing packages, etc. Just like a regular laptop. Great.

Except – the first roadblock: I have the nvidia driver installed, I have gdm running, have cmake, gcc, etc,Β  but where’s my cuda? And wait – that driver is from early 2018, I somehow doubt that’ll have Optix 7 on it?

So, start searching for ARM NVidia drivers to install – and there is one, but it’s only 32bit? Wait, what? Turns out that even though the thing looks like a regular laptop/PC, that’s apparently not how you’re supposed to use it, at least not yet, and at least not from the developer tools point of view: The right way to get all that to work – as I eventualy had to learn realizeΒ after I had already done the above – is to use the NVidia “JetPack” tool set (https://developer.nvidia.com/embedded/jetpack). Good news: this tool is actually quite powerful – bad news: it flashes your Xavier, so all the time I spent in installing driver, creating users, updating system, etc … hm; i hope you will read this first.

Doing it the right way: JetPack

So, Jetpack. The way Jetpack works is that you download it for a host system, install it there, and use that host system to flash your Xavier. Initially I was a bit skeptical when I read that, because this entire “host system” smacked a lot of exactly the “cross-compile workflow” I wanted to avoid in the first place. Luckily, it turns out you really only need this for the initial flashing, and for installing the SDK – once that’s done, you no longer need the host system (well, “apparently” – it’s not that I wrote all that much software on it, yet).

OK, so just to summarize: To do it the right way, go to https://developer.nvidia.com/nvidia-sdk-manager, and download the .deb file (sdkmanager_0.9.14-4964_amd64.deb in my case), then install this with

sudo apt install sdkmanager_0.9.14-4964_amd64.deb

Then start the newly installed ‘sdkmanager’ tool, and you should see something like this:sdkmanager

Now this being me, I had of course already clicked through the first two steps before I took that picture, but all the values in those two steps were correct by default, so there’s not much to show for those first two steps, anyway. SDKManager now downloads a ton of stuff, until in step 4 you can then start installing.

Install Issues

During install, you first have to connect your Xavier – with the accompanying USB cable – to your host PC, then get it to do factory install. I was a bit surprised it couldn’t just do that through ssh (after all, my system was already up and on the network), but factory reset it is. To do that the tool tells you to “press left button, then press center button, then release both” – which isn’t all that complicated,Β except …. apparently this only works if your Xavier is off (at least mine simply ignored it when it was still on). So, first unplug, plug back in, and then do this button magic. That tiny aside, flashing then went as expected.

Some time later, the Xavier is apparently flashed, and SDKManager wants to install the SDK (driver, CUDA etcpp) – for which it apparently also wants to use the USB connection (the weird IP it shows in that dialog is apparently where the IP-over-USB connection is supposed to be!?). Two tiny problems: a) For some reason my host system complained about “error stablishing network connection”, so there was no USB connection. And b), that dialog asks for a username and password to log into the Xavier with, but since you haven’t even created one, yet, what do you use?

Turns out, after the first flash and reboot your Xavier is actually waiting for you to click on “accept license”, create a user account, etc (very helpful to know if your screen has already gone to screensaver, and you unplugged keyboard/mouse to plug in the host USB cable :-/). So before you can even do the SDK install you first have to plug in a keyboard and mouse, accept the license, and create a user account … then you can go back to SDKManager on the host system to install the SDK through that user account.

That is, of course, if your host PC could establish an IP-over-USB connection, which as stated before mine didn’t (and unpluggig the host USB to connect keyboard and mouse probably haven’t helped, either). Solution: ignore the error messages, plug in an ethernet cable to your Jetson, open a terminal, and do ‘ipconfig’ to figure out the IP address of the ethernet network. Then back to the host PC, change the weird default USB IP to the system’s real ethernet IP, and la-voila, it starts installing.

And la-voila, we have an ARM Development Box…

These little stumbling-stones aside, once the driver and SDK is installed, everything seems to be working just fine: reboot, apt update and apt upgrade, apt install cmake, emacs, etc, suddenly everything works just exactly as you’d expect it from any other Ubuntu system – the right emacs with the right syntax highlighting, cmake that automatically picks up gcc and cuda, etcpp, and everything works out of the box.

Now git clone one of my projects from gitlab, run cmake on it, and it even automatically picks up the right gcc, CUDA (only 10.0, but that’ll do), etc; so from here on it looks like we’re good. I haven’t run any ray tracing on it, yet, but once I got my first “hello world” to run last night, it really does look like a regular linux box. Yay!!!

Now where do we go from here? Of course, it only gets real fun once we get OptiX to work; and of course I’d really like to put a big, shiny Titan RTX or RTX8000 into that PCIe slot … but that’s for another time.

PS: Core Count and Frequency

PS – or as Columbo would say “oh, only one last thing” – one last thing I stumbled over when compiling is that the system looked un-naturally slow; and when looking into /proc/cpuinfo it showed only four cores (it should be 8!), and showed “revision 0” for the CPU cores, even though the specs say “revision 2”. Turns out that by default the system starts in a low-power mode in which only four cores are active, at only half the frequency (the ‘rev 0’ is OK, it’s just that the cpu reports something different than what you’d expect – it is a revision 2 core).

To change that, look at the top right of your screen, where you can change the power mode. Click on that, and change it to look like this:

arm-power

Once done, you should have 8 cores at 2.2Ghz, which makes quite a difference when compiling with “make -j” :-).

So, for now, that’s it – I’ll share some updates if and when (?) I get some first ray tracing working on those thingys (wouldn’t it be great do have a cluster of a few of those? πŸ™‚ ). But at least for now, I’m pretty happy with it. And as always: feedback, comments, and suggestions are welcome!

Ray Tracing the “Gems” Cover, Part II

As Matt Pharr just let me know: the original Graphics Gems II cover image was in fact a 3D model, and was in fact even ray traced! (that really made my day πŸ™‚ ).

That notwithstanding, the challenge still stands: Anybody out there to create (and donate) a 3D model of that that looks somewhat more “ray trace’y” for us all to play with?

PS: And I promise, we’ll work very hard to ray trace that faster than what was possible back then … I’ll leave it to you to find out how fast that was! (Tip: The book has an “About the Cover” page πŸ™‚ )

Recreating the “Graphics Gems 2” Cover for Ray Tracing?

Usually I use this blog to share updates, such a newly published papers, etc …. but this time, I’m actually – kind-of – calling for some “community help”: In particular, I’m curious if there’s any graphics people out there that knows how to use a modelling program (I don’t; not if my life depended on it :-/), and that would be able to create a renderable model of the “scene” depicted on the “Graphics Gems 2” book. To be more specific, to create a model that looks roughly like that:

Screenshot from 2019-11-14 13-39-18

Now you may wonder “where does that suddenly come from?” … so let me give a tiny bit of background: We – Adam Marrs, Pete Shirley, and I – recently brainstormed a bit about some “potentially” upcoming “Ray Tracing Gems II” book (which hasn’t been officially announced, yet, but ignore that for now), and while doing so we all realized how much we are, in fact, channeling the original “Graphics Gems” series (which at that time was awesome, by the way!).

Now at one instance, I was actually googling for the old books (to check if they used “II/III/IV” or “2/3/4” in the title, in case you were wondering), and while doing so stumbled over this really nice cover … which would probably look amazing if that was a properly ray traced 3D model rather than an artist’s sketch. And what could possibly be a better cover image – or test model – used in RTG2 than the original cover from GG2!?

So – if there is anybody around that know his modelling tools, and looking for a fun challenge: Here is one! If anybody does model this and has anything to share – ideally full model with public-access license, or even only just images – please send it over. I can of course not promise that we’ll actually use such material in said book (which so far is purely hypothetical, anyway) – but I’d find it amazing if we could find a way of doing so.

PS: Of course, a real-time demo of that scene would be totally awesome, too πŸ™‚

New Preprint: “BrickTree” Paper@LDAV

Aaaand another paper that just made it in: Our “BrickTree” paper (or – using its final, official, title: our paper on “Interactive Rendering of Large-Scale Volumes on Multi-core CPUs“) just got accepted at LDAV (ie, the IEEE Symposium on Large-Data Analysis and Visualization).

The core idea of this paper was to develop some data structure (the “BrickTree”) that had some intrisics “hierarchical representation” capabilities similar to an octree, but much lower memory overhead … (because if your input data is already terabytes of voxel data, then you really don’t want to spend a 2x or 3x overhead on encoding tree topology :-/). The resulting data structure is something that is more or less a generalization of a octree with NxNxN branching factor, but with some pretty nifty encoding that keeps memory overhead really low, while at the same time having some really nice cache/memory-related properties, and (relatively) cheap traversal to find individual cell values (several variants of this core data structures have been used before; the key here is the actual encoding).

Such a hierarchical encoding will, of course, allow for some sort of progressive loading / implicit level of detail rendering, where you can get some first impression of the data set long before the full thing is loaded – because even if your renderer can handle data of that size, loading a terabyte data set can literally take hours to first pixel!. (And just to throw this in: this problem of bridging load times is, IMHO, one of the most under-appreciated problems in interactive data vis today: yes, we’ve made giant progress in rendering large data sets once the data has been properly preprocessed and loaded …. but what good is an eventual 10-frames-per-second frame rate if it takes you an hour to load the model?!).

Anyway – read the paper …. there’s tons of things that could be added to this framework; I’d be curious to see some of those explored! (if you have questions, feel free to drop me an email!). Maybe most of all, I’d be curious re how that same idea would work on, say, a RTX 8000 – yes, the current paper mostly talks about bridging load times (assuming you’ll eventually load the full thing, anyway), but what is to stop one from loading once a certain memory budget has been filled!? This should be an obvious approach to rendering such large data, but I’m sure there’ll be some devil or two hiding in the details… so would be curious if somebody were to look at that (if anybody wants to: drop me an email!).

Anyway – enough for today; feedback of course appreciated!

PS: Link to PDF of paper is embedded above, but just in case: PDF is behind this link, or as usual, on my publications page.

IEEE Short Paper Preprint: RTX Accelerated Space Skipping and Adaptive Sampling of Unstructured Data

As a follow-up to our “RTX Beyond Ray Tracing” paper we (Nate, Will, Valerio, and I) also worked on using that paper’s sampling technique within a larger rendering context, where we took this technique, and added some nifty technique for doing efficient space skipping, and, in particular, adaptive sampling on top of this. And happy to say: The paper finally got accepted in the IEEE Vis 2019 short papers track, so Nate will present this in Vancouver later this year.

In the mean time I already uploaded a “authors preprint” of that paper to my publications page (http://www.sci.utah.edu/~wald/Publications/index.html), so if you want to read about it before his talk please feel free to download and read.

The core idea of that paper is, of course, to realize that in a time where we now all have access to fast “traceRay()” operations (in our case on the GPU, but the same should work on a CPU, too) you can of course use that technique to do some sort of hierarchical acceleration of volume rendering, too. In our case, we build a hierarchy over regions of the volume, compute min/max opacity for those regions (for a given transfer function), then use this traceRay operation to step through leaves of this data structure. And for each leaf of this data structure, you can then either skip it, or adjust sample rate, based on how “important” it is (at least for the space skipping side of that a similar idea was recently proposed by David Ganter at HPG, too, though in a somewhat different context).

Details in the paper; go read it… enjoy!

PS: Of course the technique is not restricted to tet meshes. That’s what we demonstrated it on, but it should work for any other type of volumetric data…

 

HPG Paper Preprint: Using RTX cores for something other than ray tracing….

teaser1

Hey – just a quick heads-up: our HPG (short) paper on using RTX cores for something other than ray tracing got accepted; and for everybody interested in that I just uploaded a “author’s preprint” (ie, without the final edits) to my usual publications page (at http://www.sci.utah.edu/Publications).

The core idea of this paper was to play with the idea of “now that we have this ‘free’ hardware for tracing rays, what else can we use it for, in applications that wouldn’t otherwise use these units?” – after all, the hardware is already there, it’s actually doing some non-trivial tree traversal, it’s massively powerful (billions of such tree traversals per second!), and if it’s not otherwise being used then pretty much anything you can offload to it is a win… (and yes, pretty much the first thing we tried worked out well).

For this paper we only looked into one such applications (point location, in a tet mesh volume renderer), just as a “proof of concept” … but yes, there’s a ton more where it’d make sense: I already used it for some AMR rendering, too (same basic concept), but there’s sure to be more. If you play with it and find some interesting uses, let me know – I’m curious to see what others will do with it!

Hope you’ll like it – this has been a lot of fun, hope you’ll enjoy reading it, too…

PS:

Preprint here: http://www.sci.utah.edu/Publications

Full citation: RTX Beyond Ray Tracing – Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location, Ingo Wald, Will Usher, Nathan Morrical, Laura L Lediaev, and Valerio Pascucci, Proceedings of High Performance Graphics (HPG) 2019. (to appear).

“Accidental Art”: PBRT v3 ‘landscape’ model in my RTOW-OptiX Sample…

Just produced a (accidental) pic that in some undefinable way struck me – dunno why, but IMHO it “got something” – and wanted to quickly share this (see below).

For the curious: The way that pic was produced was that I took the latest version of my pbrt parser (https://github.com/ingowald/pbrt-parser), hooked it up to my RTOW-in-OptiX sample (https://github.com/ingowald/RTOW-OptiX), and ran that on a few of the PBRT v3 sample models (https://pbrt.org/scenes-v3.html). And since that RTOW-in-OptiX sample can’t yet do any of the PBRT materials I just assigned the Pete’s “Lambertian” model (with per-material random albedo value), which for the PBRT v3 “landscape” (view0.pbrt) produced the following pic. And I personally find it kind-of cute, so …. Enjoy!

PS: The buddha with random Metal material looks cool, too πŸ™‚

landscape.accidental-art.jpg

PBRTParser V1.1

For all those planning on playing around with Disney’s Moana Island model (or any other PBRT format models, for that matter) : Check out my recently re-worked PBRTParser library on github (https://github.com/ingowald/pbrt-parser).

The first version of that library – as I first wrote it a few years ago – was rather “experimental” (in plain english: “horribly incomplete, and buggy), and only did the barest necessities to extract triangles and instances for some ray traversal research I was doing back then. Some brave users back then had already tried using that library, but as I just said, back then it was never really intended as a full parser, didn’t do anything meaningful with materials, etc…. so bottom line, I’m not really sure how useful it was back then.

Last year, however – when I/we first started playing around with Moana I finally dug up that old code, and eventually fleshed it out to a point where we could use it to import the whole of Moana – now also with textures, materials, lights, and curves – into some internal format we were using for the 2018 Siggraph demo. That still didn’t do anything more than required for Moana (e.g., it only did the “Disney” material, and only Ptex textures), but anyway, that was a major step – not so much in functionality, but in completeness, robustness, and general “usablity”.

And finally, after my switching employers (and thus, no longer having access to that ospray-internal format) – yet still wanting to play with this model – I spend some time on and off over the last few months in cleaning that library up even more, into fleshing it out to the point that it (apparently?) read all PBRT v3 models, and in particular, to a point where all materials, textures, etc, are all “fully” parsed to specific C++ classes (rather than just sets of name:value pairs as in the first version). And maybe best of all – in particular for those planning on playing with Moana! : The library can not only parse exising ASCII .pbrt files, but can also load and store any parsed model in an internal binary file format that is a few gazillion times faster to load and store than parsing those ASCII files (to give you an idea: parsing the 40GBs of PBRT files for Moana takes close on half an hour … reading the binary takes … wait… less time than it took me to write that sentence).

Mind – this is still not a PBRT “renderer” of any sorts – but for those that do want to play around with some PBRT style renderers (and in particular, with PBRT style models!), this library should make it trivially simple to at least get the data in, so you can then worry about the renderer. In particular, it should by now be reasonably complete and stable to work out of the box. No, I cannot guarantee that library’s working state for windows or Mac (it did compile at some point, but I’m not regularly testing those two), but at least on Linux I’d expect it to work – and will gladly help fixing whatever bugs are coming up. Of course, despite all these claims about completeness and robustness (and yes, I do use it on a daily basis): This is an open-source project, and I’m sure there will be some bugs and issues as soon as people start using it on models – or in ways – that I haven’t personally tried yet. If so: I’d be happy to fix, just let me know (preferably on gitlab).

Anyway: If you plan on playing with it, check it out on either github, or gitlab. I will keep those two sites’ repositories in sync (easy enough with git …), so they should always contain the same code, at least in the master branch. However, gitlab is somewhat easier to use with regard to issue tracking and, in particular, push requests by users, so if you do plan on filing issues or sending push requests, I’d suggest gitlab. Of course, any criticism, bugs, issues, or requests for improvement are highly appreciated….

Enjoy!

PS: Just to show that it can really parse all of Moana, I just added a little screenshot of a normal shader from a totally unrelated renderer of mine. Note that the lack of shading information is entirely due to that renderer; the parser will have the full material and texture information – it’s just the renderer that doesn’t support all effects, yet, so I don’t want to prematurely post any images of it, yet.

RTOW in OptiX – Fun with CuRand…

Bottom line: With new random number generator, RTOW-OptiX sample on Turing now runs in ~0.5 secs ….

Since several people have asked for Turing numbers for my “RTOW in OptiX” example I finally sat down and ran it. First result – surprise: In my original code there was hardly any difference between using Turing and Volta – and that just didn’t make sense. Sure, you do still need a special development driver to even use the Turing ray tracing cores from within OptiX, but I actually had that, so why didn’t it get faster? And sure, there’s only so much speedup you can except in a scene that doesn’t have any triangles at all, and only a very small number of primitives to start with. But still, that didn’t make sense. There also was hardly any difference between iterative and recursive versions … and none of that made sense whatsoever.

Well – in cases like that a good first step is always to have a look at the assembly (excuse me: PTX) code that one’s code is actually generating. In our OptiX example, that’s actually super-easy: Not only is PTX way easier to read than regular assembly, the very nature of OptiX’ “programs” approach also means that you don’t have to sift through an entire program’s worth of asm output to find the one function you’re interested in…. instead, you only look at the PTX code for the one kernel that you’re interested in. And even simpler, the cmakefile already generates all these ptx files (that’s the way OptiX works), so looking at that was very easy.

Now looking at the ray gen program, I was at first what, for lack of a better word, I can only call “dumbfounded”: thousands of lines of cryptic PTX code, with movs, xor’s, loads, and stores, all apparently randomly thrown together, and hardly anything that looked like “useful” code. Clearly my “actual” ray gen program was at the end of this file, and looked great – but what was all that other stuff?? No wonder this wasn’t any faster on Turing than on Volta – all it did was garbling memory!

Turns out the culprit was what I had absolutely not expected: CuRand. I hadn’t even known about curand before I saw Roger Allen’s CUDA example, but when I first saw it this looked like an easy-to-use equivalent to Pete’s use of drand48(), and simply used it for my sample, too. Now CuRand does indeed seem to be a very good random number generator, and to have some really nice properties – but it also has a very, very – did I say: very! – expensive set-up phase, where it’s taking something like a 25,000-sized scratchpad and garbling around in it. And since I ran that once per pixel it turns out that just initializing that random number generator was more expensive in this example than all rendering taken together ….

Of course, the solution to that was simple: Pete already used ‘drand48()’ in his reference CPU example, and though that function doesn’t exist in the CUDA runtime it’s trivially simple to implement. Throwing that into my example – and taking curand out – and lo and behold, my render time goes down to something like 0.5 sec. And in that variant I also see exactly what I had expected: that iterative is way faster than recursive, and Turing was way faster than Volta. Of course, changing the random number generator also changed the image (I haven’t looked in detail yet, but it “feels” as if the curand image was better), and has of course also made the Volta code faster. Either way – for now, 500ms is good with me πŸ™‚

With that – back to work….