The Developer's Cry: 2011

Sunday, December 4, 2011

libobjects++ : A Python-esque library for C++

Nearly exactly a year ago, I wrote about C+Objects: A Python-esque library for C/C++. The goal of this library was to ease programming in C/C++ in such a way that the code would resemble Python (to a certain extent) and that it would make programming in C/C++ as easy as it is in Python. I implemented the library in C++, which wasn't very successful, and last summer I implemented a version of it in standard C. The standard C version handles very much like Objective-C (rather than like Python); all objects are pointers. All objects are manually reference counted using retain() and release() calls. There is a print() function that prints any Object. It has runtime typing information and even features inheritance (in standard C!).

After a while I did get a bit tired of the interface though. Standard C has no this pointer for objects so you have to pass a pointer along as first argument to the method functions. Secondly, there is no operator overloading. For example, for the Array class you would call arrayItemAt(arr, idx) rather than just arr[idx]. This was particularly bothersome because in C++ you can overload the subscript operator and make it accept this syntax. Furthermore, standard C has no references like C++ has, so you can't address an array item and modify it on the spot.

So even though C++ is not my favorite language (C++ is too complicated to get any work done quickly), it does have some neat features that I needed for libobjects to work the way I wanted. A recap of the would-like-to-haves:

everything is an Object
objects can be stored into containers: arrays, lists, dictionaries
containers are also Objects (see first bullet)
container classes like arrays should only store pointers
objects are automatically reference counted
strings should be copy-on-write
number objects should be copied, not referenced
adding new user-defined classes should be made easy

For libobjects++ (the all new and shiny C++ rewrite of libobjects) I made a couple of design decisions that changed the game for the better.

One of the problems I was wrestling with was this: in C++ you can have an instance go out of scope, and it will self-destruct. If you wish to retain this object, you would have to make a copy using new and keep that pointer around somehow. For this reason, in the old C++ implementation, every Object was a proxy that had a pointer to a reference counted backend. It didn't work very nicely because it required you to add two classes for every newly defined class: one for the proxy and one for the backend.

The solution to this problem has multiple aspects. Firstly, the new implementation has the convention that the backend is only a struct. From this follows that all the method code that fiddles with the structure fields is present in the proxy class. Secondly, the Object proxy class itself has no real data fields; it only has a pointer to the data structure. Thirdly, the data structure has been encapsulated in a reference counting back end object. This encapsulation has been realized by means of a C++ template. Templates can be hard to write but they ensure correct typing, which is a good thing when you are building robust code. The Object class acts much like a smart pointer to the original data structure. The user of the Object class sees only the data structure without ever knowing that it is automatically being reference counted. I said it acts much like a smart pointer because it isn't really one, the user of the class doesn't bother with pointers at all. The user mostly uses the Integer number class, the String class, the Array class, which are derived from Object.

A note about reference counting; in my opinion it is best implemented using intrusive reference counting. Intrusive means the counter itself is part of the data structure; the object that is being reference counted. The most non-intrusive way of adding intrusive reference counting is to wrap the whole data structure by a new class that includes the reference count. Note how this is different from having a reference counted base class that the object must be derived from (see NSObject in Objective-C — which doesn't have templates, and thus solved it this way). Another note, I started out with a non-intrusive reference counting design, in which the reference count lives in a smart pointer that points at the referenced object. This costs another “hop” and therefore (a little) performance.

Although the internals of libobjects++ were kind of hard to write, it is a joy to use and see in action. I'd post a snippet of demo code, but libobjects++ is not fully finished at this point and not available for download yet either. The first things to be added next are the Array and Dictionary classes. After that I would like to add a File, Buffer, Thread, and maybe Socket. These would be simply re-implementations of older code I have lying around. Adding multi-threading to a reference counting library is another problematic topic I won't go into deeper right now.

Best way to think of libobjects++ is as a useful replacement for STL. All this basic functionality is available in STL and boost, but somehow those never quite did it for me. Their interfaces simply do not handle as easily as Python's or libobjects++'s.

I'd like to add that although libobjects++ is really nice (and fully type safe), Python does handle things differently quite fundamentally. In Python, a variable is a pointer to a backend of any type. When you for example address the pointer as if it were an array, Python will try to treat it as an array. If you then address the variable like it's a string, Python will treat it like a string. A runtime type check is made whether the object really supports a given method or operator, and if it doesn't, it will throw an exception. Think for a moment about how different that approach is compared to C++'s compile time type checking and my (now seemingly straightforward) implementation of libobjects++.
I generally consider the loose typing nature of Python one of its great strengths, but sometimes all this runtime checking can be a major nuisance. Other than a basic syntax check there appears to be hardly any compile time checking performed at all.

I could go on ranting about this topic for days, but to conclude this blog entry, I would like to give you something to ponder: C++ is becoming a language of the past decades, while it seems Python stays overshadowed by Java and JavaScript.

(Not that it matters much, every language has its uses).

Update: I ran into a problem with the templated back end. C++ has strict typing, so it became near impossible to create a generic container class that can hold any kind of item. There are ways around this ... I actually started over yet again with a design that is a compromise between the old and the new way: it has two classes, a front end and a back end. By convention, the back end does not implement many methods. The back end is friends with the front end, which can manipulate the back end directly. This works pretty good so far ...

Monday, November 14, 2011

Thread safety and thread specific data

Every once in a while I work again on my Pythonesque / Objective-Cish library for standard C. One of the goals of this library is simplicity; ease of use. One of the ways of coming to an interface that is easy to use is the use of static variables. Statics keep the state and save it for later. A negative side-effect of statics is, however, that they break thread safety. In a multi-threaded program you can not have static variables because every thread works on the same memory.

Thread safety is an important issue. A library that is not thread-safe can only work correctly when used in serial codes. Today's computers have multiple cores, and tomorrow's computers will have many (and I mean many) cores. Although in some cases it is alright to use a forking model, it is generally not a bad thing to use threading whenever and wherever you can.

For synchronized access to global variables there are mutexes, semaphores and condition variables. For access to static variables we need something totally different. Each thread should have its own copy of the static variable, because the static keeps state and the state may be different for each thread. This is called thread-specific data. The creators of the POSIX thread library recognized the need for an interface to handle thread-specific data, so they included some functions for this in their API.

The answer to the problem of converting static variables into thread-specific data lies in using the functions pthread_key_create(), pthread_getspecific() and pthread_setspecific().

#include <pthread.h>

pthread_key_t key_buf = (pthread_key_t)0L;

void func(void) {
/* static char buf[64]; old code; not thread-safe */

    char *buf = pthread_getspecific(key_buf);
    if (buf == NULL) {
        if ((buf = (char *)malloc(64)) == NULL)
            error();

        if (pthread_setspecific(key_buf, buf) != 0)
            error();
    }
    ... do something ...
    ... save state into buf ...
}

void initialize(void) {
    if (key_buf != (pthread_key_t)0L)
        return;

    if (pthread_key_create(&key_buf, free) != 0)
        error();
}

In the code example above, key_buf is a key that is used to get access to the thread-specific data. The key must first be created (or initialized). This can be done in a serial part of the code, or you can put a mutex lock around the code that creates the key. Once the key has been created, it exists for all threads, even for threads that will be created later.
In func() we obtain the thread-specific buffer using the key for that thread-specific data. If this thread did not have its own allocated buffer yet, allocate it and assign it to this thread's specific data.
What if the thread exits? We allocated a buffer for this thread, so would it leak memory? The answer is no; when we created the key, we passed free() as second argument to pthread_key_create(). The second argument is the destructor, it gets called when the thread exits. Note that the destructor is bound to the key, so you must use each unique key in the same way for every thread. (Which makes perfect sense, but I'm just tellin' ya).

Concluding, thread-specific data is the perfect mechanism for keeping state in threads, where you would normally – in purely serial code – use static variables.

Sunday, November 6, 2011

Portal rendering engine (or not)

All this thinking and writing about the id Tech 5 engine in my last post brought back a lot of memories of DOOM and Quake. It sparked an old desire in me to develop a 3D engine for a first person game. As computers have become faster over the years, making a DOOM like game now is a lot easier than it was in 1993. On the other hand, people also expect to see more in a game now than back then. It shouldn't be your goal to be better than Rage or write the next Crysis. As an amateur, you can not set the bar that high for yourself, or you will fail. Yet it is still a challenge to create an engine that will allow you to walk smoothly through fairly large data sets.

The odd thing about DOOM is that it's hard to comprehend how its internals work because it's not really 3D and uses complex trickery rather than straightforward OpenGL to draw the scene. In 1993 there were no hardware 3D graphics cards for PC and even worse, the strongest CPUs in that day had a hard time doing 3D math. Therefore, DOOM employs techniques that you would not use nowadays.

I find DOOM interesting though for thought experiments. It works with a 2D map, which simplifies things greatly. The map may be 2D, but it is all vector based and it does include height. Starting out in 3D from nothing with little experience at all is a daunting task. Therefore I say it is a-okay to start out in a vector based 2D world and work up from there.

Imagine a floor plan of a building and having to visualize a first person view, walking through that building. Note that I start out by considering only a single floor and not a full apartment tower block.

Suppose the player is looking down a corridor. Left and right are closed doors to offices. Further on, an office door is open on the left side. At the end, the corridor continues to the right.

The engine may have loaded the entire map into memory, but there is no reason to calculate nor draw any geometry for rooms that are not visible. So, you want to take some scissors and clip away the parts of the map that can not be seen from the current player's position. It's easy to do when you look at it, but how can we do this in a computer program?

The hardest part of the whole rendering engine is determining what walls are visible and what walls are not, from any position on the map, looking into any direction.

The answer lies in dividing the map up into sectors. It is not based on a rectangular grid, a sector is merely a collection of walls. The corridor itself is such a sector. The office room that is on the left is a sector. The part at the end, where the corridor takes a right turn, that's a new sector. These sectors are connected through doors, but also through invisible doors that are always open. They are just lines on the map. These are portals. You can see through them and you can walk through them. Portals can also be small windows with bars, if you like.

Back to our rendering engine. The player is in sector 1. The engine considers drawing the walls in sector 1. During this process, it also encounters a portal to sector 2 (office room) and a portal to sector 3 (corridor takes a right turn). The player can see inside these sectors so the engine has to take sector 2 and 3 into consideration as well.

Only small parts of sector 2 and 3 should be visible (as we are standing in the corridor), so optimizations can be made. Before drawing the geometry at all, the engine performs view frustum culling. Since you can only see what is inside the portal, the viewing frustum for the sector is adjusted to the portal itself. The line of sight from the viewpoint (of the player) to the boundaries of the portal now form the new viewing frustum for looking from one sector into the other; from the corridor into the next room. This will cull any walls in sector 2 that are not visible and only draw walls that are visible from the player's point of view.

Sector 3 is another corridor which connects to more rooms and more portals. The other portals will be culled as described above. Hence it doesn't matter how much geometry is beyond sector 3, it will not be considered.

In a game engine, you can put this to other use as well. For example, to activate monsters: an enemy soldier in the next area might be walking in circles while whistling a tune, so that the player can hear him while he's not seeing him yet. This kind of games ties enemies down to the areas where they live on the map, if the player rushes past them on to the next sector, they are typically literally out of view and not being considered (or controlled) by the game engine any more.

Now for the real bonus. The geometry of a building is static, so unless you have moving walls there is no further computation involved. Simply send the geometry to the video hardware and the scene is being rendered.

So, is everything great? Not really. In theory you can adjust the view frustum to the portal to see into the other sector, but in practice it's not super easy to do in 3D. The portal may have an irregular shape which complicates matters.

Another practical issue is where to put portals on the map. It's easy for a human level designer to draw them into a 2D map, but it gets hard for 3D maps. Moreover, it's incredibly hard to develop a program that does this automatically. You can use a binary space partitioning (BSP) method but it tends to cut up a map into many more pieces than a human would, creating many more portals. Although not incorrect, but having many portals in a scene means having to do more calculations on the CPU, which is something we were trying to avoid.

You will find no references to the term portal in technical descriptions of how the Quake engine works. Quake maps are partitioned into clusters (sectors) and each cluster carries onboard a potentially visible set (PVS) in the form of a bit array that tells the engine what other clusters are visible from this cluster. It then proceeds to do coarse grained view frustum culling using bounding boxes, and finally rendering the geometry. OpenGL will take care of backface culling.

The PVS are precomputed and the maps have been optimally segmented so that Quake never has to bother with portals. It is well possible that portals are used in the preprocessing step, although they might as well use a costly brute force method like casting rays of light and see what walls are lit.

For id's latest game Rage, I realized that it's largely based on the same engine. Rage's texturing is awesome (read my previous post) but the way the corridors and halls work are still the same as before. DOOM3 and Rage feature very high polygon count environments but the trick is, they are still simple blocky corridors. A player can never climb on or crawl under pipes that run along the walls, revealing the fact that the player is in reality walking through a crude maze of invisible walls with some very detailed visuals situated behind those plexiglass walls.

The game's rendering engine deals with a small collection of flat walls per sector and selects which walls to render. Each flat wall has a property that points to the complex geometry that the player is about to see. The engine then sends that geometry to the GPU and the result is a highly detailed 3D scene for a very low cost of CPU power.

For further reading on portal engines I recommend flipcode's articles on Building a 3D Portal Engine. I have to say it is doubtful that any modern game today uses a true portal renderer. That aside, it is an interesting way of looking at the problem and flipcode's articles are a fun and educational read.

Tuesday, October 25, 2011

RAGE, MegaTexture, clipmapping

Recently, id software (of DOOM and Quake fame) released their highly anticipated new title named RAGE. I don't normally write game reviews on this blog, but since I wrote about this game before way back in 2008 – three ... years ... ago! – and because I played through the game I felt like writing some more about how its new "Tech 5" engine works. The game's release was a bit of a disaster on PC (lead developer John Carmack raged out on twitter) because both NVidia and ATI did not deliver promised updates to their video drivers in time, but hey, I play the PlayStation 3 version.

One of the most impressive aspects of the game are its visuals. The outdoor scenes were described as "a moving painting". And indeed, it does look painted, which has mostly to do with the artists that did the graphics. Aside from that, this game is known for the fact that the graphics show no repeating textures. Every location, every object and every little detail has a unique look. The terrain in the game world is composed from a single gigantic and highly detailed texture. This is very special as this is usually not the case with traditional texture mapping, and id software is probably the first to show this off in a game. (Note: RAGE is not their first game to use this technique; in 2007 Quake Wars: Enemy Territory used an earlier version of the same code).

MegaTextures
Carmack uses a texture mapping technique in his rendering engine that he calls "mega textures" and "virtual textures". This is not just marketing hype first of all because Carmack is a geeky programmer and not a marketing guy, and secondly because it looks too good to be just marketing hype. I think I have a good impression now of how it works, it's something along these lines.

Traditionally if you want to texture map a cube, you'd have six faces of the cube and you would choose to texture it using six textures, one for each face. Or you could choose to have just one texture, and have it repeat. In games, if you wanted to texture map a large area, you would repeat the texture over and again. When you use a different texture for each face, not only will you be using up texture ids quickly, it will also cost performance. So what Carmack did since the beginning (already in Quake 1) was making a texture map like a sheet of paper, and wrapping it around an object. This way you get a fully textured model, with just one texture per model.

Terrain data is like a gaint object. Using geomipmapping you can have a single highly detailed mesh for the terrain, while still being able to render it efficiently and having level of detail (LOD) with respect to viewing distance. (NB. What is also possible is that they tricked the level of detail using multiple terrain maps, each with a little less detail, but this doesn't matter in respect to texturing the terrain).
To texture the terrain, a giant texture map with a resolution of something like 64000x64000 is used. It is impossible to load this texture into memory all at once, so it loads only tiles. Each tile is 4096x4096 pixels. For geometry that is further in the distance, a lower resolution texture is chosen (like mipmapping).
I don't know if RAGE employs this, but it is possible to do a dynamic in-between level of detail by blending the higher resolution texture with a lower res texture. My guess is that this would cost too much performance, but I really don't know.

Carmack mentions using tiles of 4096x4096 on the PS3. Note that the PS3 has five year old hardware by now, but also note that 4096x4096 is still a lot. For example, 1080p "full HD" is only a fraction of that.
Modern PCs have larger video memory than the PS3 so RAGE should look even better on PC, but there is a bandwidth problem with loading the data from disk that gets in the way, so not everything is sunshine on the PC.
4K by 4K amounts to 16 megapixels per texture. Take 24 bits per pixel and you're up to 48 megabytes of memory per texture. This may not sound like a lot of data, but it takes more than half a second to load 48 megabytes from a hard drive.
Now compare the 5 year old PS3 hardware to a modern PC, and notice how the PC has much more memory, much more graphics memory and much more processing power, but both are limited in maximum performance by the same speed of the hard drive.

To speed up loading times, textures are compressed. Compression has the disadvantage that it takes processing power to decompress the textures when they are loaded. You can choose to use hardware compression, in which case the GPU has to decompress the texture. The problem with this is that the GPU will be busy, while it already was busy doing renderings. Another thing is that there are better and faster compression codes available for CPU ... in RAGE, a CPU thread is used to decompress the texture data.

In addition to mega textures, Carmack uses a technique he calls "virtual textures". I guess it works like a demand paging system; whenever a texture is needed because it is going to be displayed on screen, it needs to be loaded into the video texture memory.
The texture may be loaded from a core memory buffer (cache), or from a memory mapped file (disk cache), or lastly from the source media, the BluRay DVD player. Of course, loading off DVD is slow and that's when you see "texture pop-in" artifacts.
This technique is also called texture streaming because it is constantly loading texture data as you move around and look around the world.

60 Hertz refresh rate
The game is promoted for its 60 Hz refresh rate. Many games use 30 Hz. Experienced gamers say they can easily tell the difference, as 60 Hz games feel a lot faster and snappier. Just try to remember Quake 3 Arena when it ran an yer ole 100 MHz 486, as opposed to Quake 3 Arena on modern hardware — the game is much more action packed now because everything reacts faster on a higher frame rate.
RAGE feels fast at times indeed. It is an achievement to write a game that leans so heavily on graphics and have a frame rate of 60 Hz. Where does all this speed come from? For one thing, RAGE doesn't do multi-texturing. Why would it, if all textured space looks unique anyway? All the graffiti on the walls have already been "burned in" by the artists. All that hard work was already done during preprocessing. Other things that are known to have been tricked are lighting and shadows. It's a game, as long as it looks good, it is good enough — no need to pull off compute-intensive raytracing-style mathematics for no apparent reason.

Clipmapping
Carmack is a very good 3D graphics programmer but he sure did not come up with these ideas all by himself. He does do his homework very good, apparently.

In the legendary game DOOM he popularized the use of BSP trees to efficiently render indoor environments. Use of BSP trees in computer graphics had been researched for many years by then, but it was a research paper published in 1991 that first described how gain lots of performance by using the BSP trees to draw a scene in a (then) unusual way: from the front to the back.

The mega texturing technique sounds an awful lot, no, exactly like clipmapping, a technique thought up by SGI researchers and published in 1998 in their paper titled “The Clipmap: A Virtual Mipmap”. You can read the full paper here.
Another term for clipmapping is the use of cliptextures.

Surprisingly, the paper also mentions a frame rate of 60 Hz. The 60 Hz frame rate is tied to the hardware limits of a high end graphics machine from around 1998, which does make you wonder if modern hardware could pull this off at 30 Hz too.

Carmack is not a bad guy and he knows where he came from; every now and then id software release the source code to their old engines for the sake of teaching students how to create 3D game engines.

The Game
RAGE is a fun game, but I have the feeling that they tried to make a game more in the likes of Red Dead Redemption, and too bad for them, Rockstar beat them. Unlike RAGE, Red Dead is a truly open world game and it is very different from the shooter genre. RAGE is a good shooter when compared to Quake, but it's not as frantic as shooters like Crysis or Killzone.

Maybe id software should give a shot at going back to their underground roots and produce a fairly simple, but large enough game that is pure brute and blunt fun rather than working for ages on producing a Hollywood blockbuster type of game. Take the Super Turbo Turkey Puncher for example. That was cool, and not because it was a mini game. It was simple and it was fun, and I wish I had it as a separate game. It would be killer on iPhone, I'm sure.

Remember DOOM and how it would get your adrenaline pumping. Halls full of imps to slaughter. The clicking sound as you picked up the shotguns left behind by enemies was so addictive. The monsters would moan and howl in pain as you were gunning them down.
Trent Reznor of NIN once phrased it like this: “the we don't give a shit attitude, it really struck a chord”.

Sunday, October 23, 2011

Exceptional conditions: adding exceptions in standard C

An important aspect of programming is error handling. A program is running, all is well, and then ... oh no! All of a sudden a condition occurs that was not supposed to happen, or at least the programmer wished it would not happen. A program usually checks the most common errors and may retry, work around, or simply abort the operation depending on how hard to resolve this error is.

There are software error conditions and there are hardware errors. Hardware errors are represented by software error codes. There are different ways in which an error condition is given to an application when it comes to programming:

error return value from a function
library reports error: in standard C, check errno
UNIX signal received
an exception was thrown or raised

Easiest to understand and use is the first one; if the return value of a function has a certain value (error code), then it signifies an error and the application should act accordingly. The application programmer himself must choose what error codes to use for error conditions.

The second case describes how the standard C library reports errors; the library functions return zero on success and -1 on error. To get more information about what error occurred, examine the error code in errno. Note, errno is not a normal variable but a function that returns a thread-safe errno value. All errno values are predefined by the operating system. The errno values are (either translated or directly propagated) return codes from the operating system's system calls.

UNIX signals may be sent by the (UNIX) operating system to signify certain conditions. These conditions vary from SIGALRM (a timer expired) to hard errors like SIGSEGV (segment violation, which is illegal memory access) and SIGBUS (bus error). Signals are handled by the process itself, but the signal handler is run in a different context. This means that the normal program flow is interrupted and continues after the signal handler has finished executing. A signal can even interrupt a system call, after which the system call will report the errno value EINTR (system call was interrupted). Signal handlers may be hooked by the application, but an application can not define its own signal numbers. The signal numbers are predefined by the operating system.

Exceptions are exceptional conditions that may be thrown or raised. An application may catch an exception, in which it will resume operation at the point where it first tried to do the operation. Because it involves stack unwinding and jumping through code, exceptions generally require special language support. Moreover, exceptions generally have a specific syntax in the form of try { ... } catch(Exception) { ... } or try: ... except Exception: .... Exceptions are a key aspect of the Java and Python programming languages, in which it is common use to throw exceptions for error conditions. In C++ exceptions exist but it is more left to the programmer whether to use them extensively or stick with error codes. In Objective-C exceptions exist but they are not commonly used.

Exceptions do not exist in standard C, but I thought it would be fun to add them. How do you add a language feature that is completely absent from C? Well, you can emulate them. In fact, I stole this neat idea from Objective-C; in Objective-C the NSException is implemented using setjmp()/longjmp(). Without using any macros to define try and catch, I came up with the following:

if (!catchException(exception_number)) {
... try some code ...

throwException(exception_number, "descriptive message");

} else {
printf("caught exception! %s\n", caughtExceptionMsg());
}
endException();

In this code, catchException() really sets up a try block. What it does behind the screens is creating an Exception object containing a jmp_buf and pushing it onto an exception watching stack. The code that actually catches the Exception is really in throwException(), which examines the stack to see if we want to catch the thrown exception at all. When it finds the corresponding Exception object, it cleans up the stack and jumps back to the initial starting point of the try block. If the exception is not being caught, print a message about the uncaught exception and abort the program.
If the exception does not occur, you are left with an object on the exception watching stack, which is why you need to explicitly call endException() to clean up the stack. Other languages do this implicitly.

This way we have added exceptions to standard C. There is a huge caveat to using exceptions correctly: as your program is able to jump back and forth through large pieces of code, you have to be extra careful not to leak memory. You can't have everything, but if you have something like an AutoReleasePool then this may help a great deal.

Sunday, September 4, 2011

Bytecode interpreters, virtual machines, emulators

Lately I've been working on programming a utility that translates more or less natural text to more cryptic firewall statements. In the translation process, I use two passes: the first pass translates to bytecode, and the second pass translates from bytecode to the final output. A huge advantage of using the intermediate bytecode stage is that this makes it possible to easily support multiple kinds of firewalls while having only one syntax for the input. The code for the output generator is completely decoupled from the input parser code. Cross compilers work in the same way. Modern compilers like gcc also work in a different way; they support multiple frontends for compiling different programming languages.

When I was still a student, I once wrote an assembler/disassembler for a nonexistent architecture. While a nonexistent architecture may seem an odd choice, I was inspired by the Java bytecode that executes in a virtual machine rather than on a real CPU. The professor did not give me a straight A only because I had not actually written the virtual machine that would run the code. The instruction set was rather extensive and implementing the full virtual machine would have been too much work for the limited time of the assignment. The feeling of what if always stuck with me however.

An emulated CPU has the following properties:

a bunch of general purpose registers
a program counter (this is just an address; the instruction pointer)
a stack pointer
a flags register (with a zero, sign, carry, overflow flag)
optionally: interrupt flag or current interrupt level
optionally: supervisor mode flag or current privilege level

Emulated interrupts enable you to let the system interact with I/O devices much like it happens in real computers. Whenever an I/O device is ready, it will interrupt the CPU, which will then automatically jump to the Interrupt Service Routine (ISR). The address of the ISR is read from an interrupt vector table that typically resides at a low memory address.
The supervisor mode or privilege levels offer the possibilities of implementing a sturdy “operating system” in the virtual machine, adding a hypervisor, and ultimately having virtual machines running in virtual machines.

The instruction set needs to have load, store, arithmetic operations like add, subtract, multiply and divide, logical operations like and, or, exclusive or, and bitwise operations like bit shifting and possibly bit rotation. Furthermore you need a jump (goto) instruction and a set of instructions for conditional branching, and push/pop to work with the stack. Lastly you could add some privileged instructions like loading the flags register or resetting the machine. At the bare minimum, there are about 25 instructions to be implemented. By comparison, the 8086 CPU has about 75 different instructions.

For a memory model, 64K ought to be enough for anybody ... Or you could give the system very little real memory and implement paging and swapping. Implementation wise, the virtual memory management could take place in either the guest or the host ... (when implemented in the host, the guest OS would never know its memory was being swapped in and out!)
A related issue is protected memory: can the current process write to a given address or not? Early micro-computers did not have support for paging, but they did have read-only memory in the form of ROM. You probably want to have memory mapped I/O, an interrupt vector table and some display memory representing the screen.

One of the problems I've always had with this model is that programming for such an emulated system is hell as you would need to write everything in its assembly language. To overcome this problem, you would really need to make a high level compiler generate the code. Nowadays we have gcc and if you're handy it should be possible to have it generate code for platform X.

Now let us take a step back and get back to bytecode interpreters. Java, Perl, PHP, Python, various kinds of BASIC and even some MUD codes all use a stack machine at the heart of their bytecode interpreters. The stack machine is the simplest kind of processor: it has no registers and its instructions only manipulate a stack. The lack of registers make both compilation and execution (or emulation) easy. Interpreted languages benefit from bytecode because interpreting bytecode is faster than having to go through the language parsing procedures all the time.

A bytecode interpreter for a scripting language is quite different from a full blown system emulator. For a bytecode interpreter, you can make up your own opcodes for any common operation you like. There is no need for an opcode to perform only low level operations; they can be high level operations just the same. For example, Python has opcodes for slicing strings. Until Python 3, print was a statement with an opcode equivalent rather than being a function. It's not cheating, it increases performance.

It's good fun designing your own bytecode interpreting virtual machine. There are a lot design of choices to be made beforehand. Do you want to fully emulate a CPU or not? Maybe you want to emulate any other hardware components? Will you write anything like a system BIOS in bytecode? Are you going to develop a self-hosting system? How far are you willing to take it?

Wednesday, August 3, 2011

Lion: Yet Another Review

Cheerios, finally a new blog post after a long while. Haven't I been programming a lot lately? Well, yes and no. I've been working real hard on synctool, which is a sysadmin tool written in Python for doing software configuration management on clusters of computers. In github the development branch is now something like 180 commits ahead of the stable branch. After which I got the flu and after that I was on a well deserved long vacation. Just when I got back home, Apple released Mac OS X Lion so I had some interesting upgrading to do. In the past, I sometimes blogged about Ubuntu (Linux) upgrades, nowadays “I'm a mac” (sorry — btw, still doing lots of things with Linux at work) so I'll write about my experiences with the Lion.

Surprise surprise
Upgrading to Lion is a breeze. Just buy it from the App Store, download the program and off you go. After an hour or so you are now running Mac OS X Lion. It's as easy as that. But wait, as a sysadmin, there are a few things here that are making me feel uneasy.

I got Lion from the App Store, so I have no CD-ROM to install the system from if my system ever breaks. Apple solved this partially by including a system restore partition on your hard drive. Eww. Uhm. Well ... OK. I guess..?

Apparently there is a way to create a Lion install CD-ROM but Apple does not tell you how. It is (apparently) reserved to “power users” (or should I say “hackers”?). Doing so is not easy especially because the Lion installer magically disappears from your hard drive after the upgrade is done. After downloading the 4 gig or so installer I had to download it again to upgrade my laptop. This is not such a nice thing for Apple to do.

Upgrading my laptop took a bit more work. It's an older model white macbook with only 1 gig of memory. Lion only works on systems with 2 gigs of memory or more, so I had to do a small hardware upgrade first. I can't really explain, but Lion uses a ton of memory even though the running apps do not show big memory footprints in Activity Monitor.

Gimme gimme new features
Lion offers a number of new things. There isn't an awful lot, but then again, it was only 29 dollars or something. Some of those things were good, and some of those were bad. I love my Mac enough to blog about it and I do like Lion but grrrrrowls ... I'll give you the bad first.

Ironically most of Lion's biggest selling points were pretty useless to me. Honestly, it's mostly marketing hype. Here's why: my main system is a 27 inch iMac. Gimmicks like Launchpad and fullscreen Mail are nice on a laptop, but they are not useful (just plain awful) on a 27 inch screen. Lion's features are mostly ideal for laptops but not for desktop Macs.

new UI. Really? Yeah, but you have to look super hard to notice. And then when you do notice, it annoys the crap out of you because it does not look better than Leopard. Except for the squared buttons, which looks more like Microsoft Windows?
full screen apps. Great feature, but on large displays it is not useful. You just drown in this amazingly large screen. Which is kind of fun actually, because a regular PC never does that to me.
3 column layout for Mail. Oh wow. I hate it. It's a cheap Outlook imitation if you ask me. And it's ugly.
Launchpad. A big screen full of icons. Great. Sigh. I will never use it.
new trackpad gestures. Yeah, I do have a trackpad with my iMac. I'm just not comfortable with four and five finger gestures, rotating your hand, etc. For photo editing, I use a mouse (!)
Mission Control. Where did Expose go??
Dashboard is awful on a large display. It was nice in Leopard, the screen would fade out to darkness and your widgets would appear. You would click next to a widget in the void and the whole thing would zoom out and you would return to your desktop. In Lion, the whole desktop slides offscreen to the right (a dazzling animation that gives you a headache) and you are presented with a million (really, a million!) tiny knobs that are some kind of Lego base plate, and you can't even change that annoying background without hacking into your system library folder (!!!)
AirDrop. Maybe Lion's coolest feature, but it doesn't work with my old macbook. I don't even blame Apple for this because it is an older model and the WiFi chip in this macbook does not have the needed functionality, but as said, AirDrop is unfortunately useless to me.
FileVault now has whole disk encryption. To a lot of people, this feature came like five years too late. On the other hand, this feature is arguably useless to everyone. It enables you to encrypt even your system files, files that are identical on any Lion installation. Moreover, even though Apple denies it, FileVault makes your system noticeably slower.

Apps that broke:

PowerPC apps simply don't work anymore. Apple removed the brilliant Rosetta software. I say brilliant because it allowed me to play WarCraft 3 (for PowerMac) on my intel Leopard system flawlessly.
TrueCrypt / MacFUSE. I found a website that offered up-to-date MacFUSE software that fixed the problem. But it was not on the official MacFUSE website. Odd. Very odd. But it really does work now.
Some folks say that Spaces is broken in Lion because Apple changed it. I don't use Spaces but looking at Mission Control, I believe them.
Printing from Google Chrome crashes the app. Of course, it's all Google's fault and you should use Safari. But I'm pretty sure that I could print from the Chrome browser before.
Lion removed my compilers and did so even without asking. gcc, clang, make, all gone. Insane! Apple decided that thou shalt use Xcode 4 on a Lion system. It's free, and thou shalt download the 4 gig Xcode 4 installer from the App Store and you are not given any option to install only the command-line tools. So I spent some additional hours to get my programming environment back in order again.

The Good Stuff
So, is Lion all useless? No, here's what I do like about it:

More system programs are 64-bits binaries.
The OS kernel runs in 64-bits mode now, which is faster than 32-bits mode on modern CPUs.
ASLR has now been implemented properly or so they say. Address randomization is important for security; it helps prevent system crackers to gain root access by smashing the stack. I find this to be a very important feature because I do online banking with my system, and who doesn't, nowadays.
The price is fair, even for a guy like me who has been spoiled rotten for years with Linux and GNU where software is typically free (as in beer — yes, RMS says you can make money from it but everyone is getting it for free anyway. But wait, isn't that exactly the same in the Windows world? Hmmm).
The arrows on the scroll bars are gone. I'm not too crazy about the new scroll bars, but you have to admit that no one was seriously clicking those arrows anymore and removing them is a bold move. No doubt Ubuntu and other Linuxes will follow this example, as will Microsoft.

I Still Love You
Despite my criticism on Lion I have to say that I'm still very much in love with the Mac OS. It just works. I can't put it any other way. I love working with it. I love programming for it. I love its strangely brilliant Objective-C. I love digging in the docs of its Objective-C API. I love its brilliant BSD UNIX that is its core. I love that you can sit down and get some work done rather than having to click away popup after popup as your system is begging for attention because it needs you to be a sysadmin night after night. I love that OS upgrades do not totally break your system and that you do not have to spend a full weekend on hacking away in it in a futile attempt to fix it. I love how you can copy gigabyte files around and the whole desktop remains responsive. I love how my old white macbook runs the latest Mac OS X smoothly and without feeling sluggish at all. I love the finger scroll with inertia on the Magic Mouse that seems to ‘just know’ where you want to scroll to. I love the beautiful screen fonts. They are not ugly like on other platforms. And I love how the Dashboard worked in Leopard. No other OS in the world implemented widgets in this way and it was done exactly right. And I love Spotlight. Spotlight has to be the only desktop search tool that does not grind on your hard drive as it is rebuilding an index every single day from a daily cron job.

I could go on and on. There are few things in Mac OS that I do not like. The Finder does not behave like Windows Explorer — but then again, it is not a Windows application. The default settings for PageUp and PageDown in the Terminal and Xcode editor appear broken as they do not react intuitively. Launchpad is rubbish. They messed with my favorite app, the Dashboard.
Other than that, the Mac is still insanely great.

Update: In System Preferences|Mission Control, uncheck “Show Dashboard as a space” to get the old Dashboard behavior back.

Sunday, May 8, 2011

Quaternion versus Matrix performance

When doing 3D graphics programming, you will be dealing with vertices, vectors, translations and rotations. OpenGL will happily do the translations and rotations for you, but in some cases you will want to do the math by yourself anyway. For example, I like keeping the orientation of an object around so I can rotate it whenever I like (like in an animation or game loop or heartbeat routine). At that point, OpenGL is not involved. Later, a drawing routine is invoked that calls OpenGL to do the necessary rendering to display.

OpenGL uses matrices to represent 3D space in memory, so it makes sense to store the orientation of the object in a matrix as well. The 3D space is actually represented in a 4x4 matrix, which is generally written as an array of 16 floats. Now, there are some gotchas like the column major format to layout the matrix in memory and the right-handedness rule, but once you got that right, all the drawing routine really has to do call glMultMatrix() on your 'matrix' array of float values, and the object will (hopefully, if you got everything else right as well) appear under the desired angles. That's easy.
Multiplying matrices together to make combined rotations is not an easy task and takes 4x16 = 64 floating point multiply operations! There are 4x12 = 48 add/subtract operations as well, but I assume that the multiplications have the worst impact on performance. If you want to do a rotate in a glRotate-style you will have to setup the matrix first, which adds another 24 multiply operations in the worst case, without counting the multiplications and the sqrt() call needed to normalize the vector.

There is another way of storing orientation, and it's the quaternion. This is something like a vector in complex 4D space and consists of "something like" a <x,y,z> vector and an additional w component. What, only 4 values? Yup, that's all. The memory footprint of a quaternion is really small compared to that of a matrix. (By the way, this is a non-issue to me because modern computers have plenty of memory — even the mobile devices do. But maybe you have some whack project in which you want keep tons of different orientations and memory becomes a problem). Because there are only 4 values, initializing a quaternion is dirt cheap in terms of CPU usage. Multiplying quaternions is also relatively cheap with only 16 floating point multiply operations. There are 12 add/subtract operations as well, but I assume that the multiplications have the worst impact on performance. Again, not counting the operations needed to normalize the rotation vector.
So, are quaternions the golden egg? Well, yes and no. Yes, they are great, but the main drawback is that OpenGL works with matrices. Converting the quaternion back to a matrix costs 27 multiplies.

In my book (I keep a little black book to pen down these kinds of numbers) 16+27 = 43 is still less than the 88 that matrices cost. However, there is a special case where matrices will still be faster. The trick is that when working with matrices, initially, you will have the identity matrix. Since multiplying the identity matrix with another matrix equals that other matrix (check this yourself, it's fun ...), you can greatly optimize the first matrix rotation, as it requires no multiply at all. This requires that you keep a flag on the matrix saying that it is identity. Or you can make a separate routine that simply initializes the matrix in its first rotated state. It makes for an embarrassingly fast rotate call, especially if you were going to do just one rotation of the object.
Of course, this is cheating. You can cheat in a similar way with quaternions, saving an extracted copy of its corresponding matrix and flagging it as dirty whenever it needs to be updated. Just keep in mind that if you rotate the object all the time, you will have to extract the matrix for use with OpenGL all the time.

I want to end this post with a couple of remarks:

Quaternions are apparently terrific for combined rotations. If you hardly do combined rotations, matrices will be faster. If you do combined rotations all the time (like for animating skeletons and such) then you probably already knew that quaternions are the way to go.
Quaternions produce less floating point drift than matrices, because they do less multiply operations than when multiplying matrices. They do drift however, and don't let anybody tell you that they don't.
In my post I made a remark about modern computers having enough memory ("640K ought be enough for anybody ..."). The same goes for CPU power, really. However, on mobile devices it probably does pay off to investigate app performance not only because of the less powerful CPU, but also because of battery power consumption.
I got to writing this blog entry because I spent a day wondering why my matrix rotation around an arbitrary axis gave weird results. The quaternion code did work, until I passed in a vector that was not normalized and it displayed the exact same weird result! That was an eye-opener. After normalizing said vector, the matrix code gave just as good results.
Over a year ago, when I wrote in my blog about quaternions for the first time, I made a remark that NeHe's code has a sign wrong somehow. To my surprise, my model was rotating clockwise using my own quaternion code. I fixed it to have it rotate anti-clockwise. Either I got the matrix column major layout wrong before, or my other project was working with different axis. Anyway, NeHe's quaternion code is probably alright after all. I didn't use it. By the way, I saw wikipedia too now shows code examples for quaternions.

Sunday, March 6, 2011

Database performance in practice

For a project at work, I wanted to put some syslog messages into a database so you would be able to query the dataset easily and most importantly, quickly. To my surprise, this was not as easy as I thought it would be. Database engines are not the magic tools they promise to be.

The main problem with syslog data is its volume. The relevant logging lines of several machines combined, gathered over three years time was about 50 million records. You would think a database would be able to hold 50 million records ...

MongoDB

MongoDB is a fairly recent new high performance database engine. MongoDB is "web-scale" meaning that you can spread load over multiple servers. I have only one server so web-scale doesn't apply in this case. MongoDB doesn't understand SQL, but instead it uses a Javascript interface. I don't like SQL much so initially I was quite happy with MongoDB. There is some controverse around MongoDB because it defers commits to disk (which means undetected data loss in case of a power outage) in order to gain performance but in this case I could live with it and this wasn't a show stopper.

It comes with nice Python bindings so developing the app was quickly done. MongoDB doesn't require you to define any schemas or anything at all so it's very easy to get something working quickly.

Tests with a months worth of syslog data were satisfactory so I decided to load up all the data, with records dating back as long as three years. This took a while so I let it run overnight.

Queries now took a long time to complete. Unacceptably long. To cut down on query time, I took the extreme measure to make a subset of the data that contained only three months worth of data.

I read that querying on datetime objects was slow, so I broke it down to year/month/day numbers and used that instead. When adding these columns, the database files on disk blew up from 50 gigs to 120 gigs. Disk space was not really an issue on this system, but really, 120 gigabytes?

Memory usage was much worse though, to the point it became unbearable. MongoDB uses the mmap() system call to map on-disk files into memory, with the unfortunate consequence that when you do a costly query, it will consume all system memory. Linux has a-OK memory management so your system won't die on the spot, but it was clearly having a hard time. Since I didn't want to dedicate the full box to MongoDB, this was a no-go and Mongo had to go.

MySQL

MySQL is a famous free relational database and would have been many a person's first choice anyway, so I decided to give it a go. One of the reasons I don't like SQL databases is that you have sit down and take the time to set it up. Create the database, the user(s), set passwords, write the schema etcetera before you can start. What you get in return for this investment is that you can give certain users or apps read-only rights to the database (which is kind of important when dealing with logging data) and you can easily review what attributes and data types you used to create the tables with.

Tests with a months worth of syslog data were satisfactory so I decided to load up all the data, with records dating back as long as a year. This was practically impossible. Inserting rows into MySQL goes alright for a while and then it slows down, and slows down more and more until the point were it takes minutes to insert data into the database! I tried improving this with transactions (do multiple inserts and commit) and it didn't help!

Fifty M records is too much for MySQL. MySQL's on-disk and in-memory footprint were formidable, but who cares when you can't even put some data in your database. You can do tricks with partitioning in MySQL but I didn't try it because it creates new issues to consider. Someone said I should use PostgreSQL, but I was already too tired to try.

'Proprietary' solution

I was convinced that a modern PC should be able to handle 50 million records, so I decided to do it the hardcoded way using a packed struct and dump that into a binary file. Database lovers cringed and called me an idiot but I let them. Using some shortcuts, it was possible to cram a syslog line into just 48 bytes — and this includes two magic bytes to detect possible data corruption. The total amounts to 2.5 gigabytes which takes about half a minute to read through from disk, but the subset is only a few hundred megs and is processed much quicker.

Because the data can be packed so small, it's no problem doing a linear search to find what you are looking for. A linear scan through memory may be an inefficient way of searching but it finishes in a split second so no further optimisations are necessary at this point. Mind you, there is no indexing going on because I'm not trying to imitate a database engine.

What's also cool about this data is that when you are searching by date, you can skip through the data using a binary partitioning method (just like bsearch() does). Moreover, data can be easily partitioned by writing to a new data file every month.

The major drawback of this solution is of course development time. It's written in C for performance reasons and as a consequence the program is not yet fully featured, and it will take considerable effort before it is.

Conclusion

Databases are generic tools that keep you from writing lots of code that have to do with storing and searching through data. In that respect, databases are like what scripting languages are compared to low-level programming languages. You can outperform (or maybe in this case I should say, outmaneuver) them but whether it's a smart thing to go that way depends.

The given problem can probably be solved using MySQL by dynamically creating a new partition each month and adapting the application to work with that. I'm pretty confident that it will work but this too will require some more work to get done.

What bothers me is that database engines are overkill for the small data store problems (where you would typically use some ASCII flat format file to put all records in), while at the same time these heavyweight tools cripple under the load when you put a lot of records into them. All of a sudden you have to be a database expert for solving a seemingly simple problem.

Update (March 14, 2011)
I set up a MySQL db with the archive storage engine and partitioned it by month. It's currently loaded with 150 million records (!) and counting ... It's holding up nicely. We will get some more experience with it in the coming weeks, but so far, I'm quite impressed with it especially because it is running on a rather cheap and simple PC server.