The Developer's Cry

Solving the mother of all sudoku puzzles

2015-07-20T18:34:00.000+02:00

This post is all about sudoku, the venerable Japanese puzzle. Even though the hype lies far behind us, every once in a while I get bored and return to solving a few puzzles. But then one day I was given an exceptionally hard to beat sudoku puzzle. Even after hours of staring it down, I wasn’t getting anywhere. Hmmm… I was not going to be belittled by some stupid puzzle! And so I decided to cheat a little and crack it by writing a sudoku solver.

Sudoku is a numbers game, but there is no calculation involved whatsoever. The ‘game board’ is a set of 9×9 tiles, each containing a number from 1 to 9. The numbers in a row must be unique, however, as well as the numbers in each column, and the numbers in each 3×3 sub block.
Some numbers are given; it is up to you (the player) to fill in the gaps. Once all the numbers are filled in according to the given rules, the puzzle is solved. In general, a sudoku puzzle has one, and only one, unique solution.

First of all, let me mention that brute-forcing it is a definitive no-no. There are countless possibilities that do not lead to the solution. I figured that the total number of permutations would be around 9×9! (over 3 million), but two researchers have calculated that there are 5,472,730,538 (almost 5.5 billion) essentially different sudoku grids. The number of raw permutations is much larger however, in the range 6.67×10²¹.
Trying them all is going to take forever, even on the fastest computer available today. We need to be smarter than that—why not simply play the game and see where it gets us?

In order to let the computer play, we are going to need a couple of routines:

load a given sudoku puzzle
check whether puzzle is correct
find cell with only one possibility left

It’s easy enough to store the puzzle into a 9×9 array of cells. Each cell holds a number, but also has a property saying whether it’s a ‘solid’ cell, or can be filled in by the player.
Checking the correctness of the puzzle is a three-step process: check all the rows, columns, and 3×3 blocks. Finding a cell with only one possibile solution also checks all the rows, columns, and 3×3 blocks.

This is a reasonable start for a solver program, but note that it will only solve the most naive beginner-level sudoku puzzles. The problem is, there will be cells that have more than just one option, and this solver is just not good enough. For the super evil sudoku, we are going to have to do better than this.

Rather than finding a cell with only one possible solution left, let’s take a look at what possible solutions there are for a blank cell. A blank cell may have any number between 1 and 9, unless that number is already present in that row, column, or 3×3 block. Any already filled-in numbers are removed from the set of possible solutions for the cell. The remaining set holds a number of possiblities, of which one is the solution for this cell. Let’s try them all, but because we want to have the solution as quickly as possible, start with the cell that has the smallest set. So if there is a cell with only one or two possible solutions, try that one first before moving on to harder to crack cells.

As you are trying out possible solutions, the numbers lock into place and they can no longer be part of solution sets in other cells that are in the same row, column or block. Slowly, the puzzle is being solved. But we may still hit a wall and have to go back to the point where we tried a wrong possibilty; this happens because not all attempts lead to a perfect puzzle. Many configurations are invalid, and need to be discarded.

Going back is easily done by means of backtracking. Backtracking works by recursion. I find it easy to think about in terms of functional programming; in functional programming, objects are immutable, they can not be changed. If you wish to change the state of an object, then that is a new object, albeit a modified clone of the object.
So whenever trying a solution to a cell, clone the whole puzzle. Put that puzzle object onto a stack. Now recurse, and try solving the next cell. If the puzzle is invalid, backtrack: pop the stack and return. As the recursive function is returning, it will continue trying the next solution to the cell. Notice how cells that ‘work’ are followed-up on and lead to the next step in solving the puzzle, and solutions that don’t work tend to cut off excess work.

I implemented this solver in Python. Implementation detail: cloning is done with deepcopy.copy(). My aging computer solves the evil sudoku in under half a second. On the Raspberry Pi 2 it takes 3.5 seconds. Remember that brute-forcing would take forever. You don’t need a fast computer (nor a low-level programming language) to pull this off. What’s important is choosing the right algorithm.

Backtracking is the algorithm for this kind of puzzles, where a partial list of candidates could lead to a solution. In fact, after finding a solution, the algorithm may continue to run and find all solutions. In a bold move, I handed it a completely blank 9×9 grid. It started spewing out all solutions to sudoku … each and every sudoku there is, ever was, or ever will be. And after a while, I hit Ctrl-C.

Live Coding & Hotloading

2015-06-14T13:50:00.000+02:00

This marks this blog's 100th post. Well, it's number 98 really, but two posts have been held back forever as draft. Anyway, I have been saving something special—a magic trick to impress your friends with. I saw this trick being performed live by an oldskool game programmer on Twitch, and it's called “live coding”. It enables you to type in some C code, and immediately see the results onscreen, even while the game is running. How can that be? It's a kind of magic.

Computers aren't any good at performing magic tricks, so something must be happening behind the scenes. As you may know, games run in a loop. What you see onscreen, is the result of frames being rendered by the render loop. With live coding, we dynamically load newly compiled code, and execute it. So, we are effectively changing the game code as it is running. The result is as if we are using C as an internal scripting language.
This is made possible by dynamic linking; rather than statically linking executable code, a shared object is loaded and linked at run-time. The coolio term for dynamic linking is “hotloading”.

Hotloading requires the game to be written with a certain structure; the game loop is in the main program, while the actual gameplay functions are dynamic, they are compiled into a shared object. It is a loadable module, like a plugin. The main program will check whether a newer shared object file is present in the directory, and if so, load it.

The functions for loading a shared object file at run-time are platform-specific. On UNIX (think Linux, OSX) you use dlopen(), dlsym(), dlclose().
On Windows, use LoadLibrary(), GetProcAddress(), and FreeLibrary().
Or we can use SDL to do this in a portable way:

typedef void (*Func)(void *);

    if (handle != NULL) {
        SDL_UnloadObject(handle);
        handle = NULL;
    }
    handle = SDL_LoadObject(filename);
    if (handle == NULL) {
        error("failed to load %s", filename);
    }
    func = (Func)SDL_LoadFunction(handle, "game");
    if (func == NULL) {
        error("failed to find symbol 'game'");
    }

The code shown will load a shared object file, and find the function named game. Afterwards, we can call func(), effectively calling game().

It's not all that easy though as there are some caveats. You will want to add live coding capability in the early stages of development; adding it later on is going to be painful. I will try to explain why.

Notice that we obtained a pointer to a function by searching for the symbol, ie. the function name. Mind that on Windows, that works both ways; if you want to call a function from the game code that resides in the main part, you need to know where that function is. Hence the desire to structure the program such that we need to know about as few functions as possible.

I had no problem whatsoever on UNIX, where calling functions back and forth like that works naturally, out of the box. If you develop on UNIX and forget about this bitsy detail, you will be in a world of pain when you want to port to Windows later on.

Secondly, all global variables must be declared in the main part of the program. You can't put them in the shared object, because that memory is going to be unloaded and replaced once the new version gets loaded. This means that you also can not have any static variables in the module. If you do, the variable will lose its value when the module is unloaded and the game will misbehave.

This asks for a coding style where you invariably pass a pointer to a big bad struct that contains everything, and I do mean everything, that should be globally accessible to the entirety of the program.
Personally, I'm not fond of this style at all—say what you will, but I like having statics in my modules.

In turn, this quickly leads to doing your own memory management. Note that it is a myth that live coding requires you do your own memory management.

Another myth is you can't use C++ with live coding. C++ compilers do name mangling, throwing a spanner in the works. You can work around this by making sure that the interface is in plain C, and does not rely on C++. Entry point functions should be declared extern "C".

Because all globals live in a big struct, you can't use live coding for making large changes. You can't add new globals. Once you touch the layout of the big memory structure, the program will crash. And if it doesn't crash, you will be overwriting fields in memory unintentionally.

My programmin’ brother dismissed live coding as a gimmick. It is super useful for tweaking small parameters like monster speed, hit points, and such. But all these parameters shouldn't be hard-coded in the first place; they should be tunables in the game engine that developers can play with as the engine is running.

Despite its drawbacks, seeing live coding in action is most impressive. It allows you to experiment in a playground. It turns good old C into a scripting language, and you must admit, that's all we ever wanted.

Loading PNG images with libpng

2015-05-17T22:05:00.000+02:00

For textures in games I'm often inclined to quickly grab some old TGA (targa) loader. The code works, and thus I've been dragging it along for years. Another often used format is Windows BMP. Writing an image loader can be quite an undertaking, which is because of the huge variety of features a single image format may support. Lucky for us, we don't have to write a PNG loader from scratch; let's use the libpng library.

About the PNG image format
Nearly 20 years ago the PNG (Portable Network Graphics) image format was invented for the Web. Up till then, the most used image formats were GIF and JPEG. GIF supports only 256 colors, and uses a patented compression method. JPEG does not support transparency, and uses lossy compression, resulting in visual artefacts.
PNG improved greatly on this by supporting 32-bit color, transparency, and lossless compression. Moreover, it's absolutely patent-free and freely available. Since it was invented for the Web and not photography or printing press, PNG lacks alternate color schemes like CMYK, and you won't find the image format being used for those applications. It's a good fit for our purposes, however.

The PNG file format consists of a header followed by a number of chunks. The most basic chunks are PLTE (palette) and IDAT (data), but there are many other kinds of chunks, for example for storing color information and metadata, like last modified time. The image data is compressed with the deflate algorithm, which is the algorithm used by zip/unzip. Deflate does lossless compression, and more importantly, the algorithm is not patented. It is present in zlib, which is also freely available.
However, the image lines are filtered by one of five possible filter types to improve the compression rate. Many kinds of pixel formats are supported. Finally, PNG supports interlacing, so pixels aren't necessarily stored neatly in sequence.
This is just to illustrate that writing a PNG loader from scratch is not easily done. We don't have so much time on our hands, and will therefore be using the library.

A walkthrough of the code
Let's load up some pixels — should be easy, right? So you open up the manual for libpng, only to find that it starts with a whopping 20 pages listing just function headers. That's right, there are about 260 different functions in libpng. We won't need to use them all, but unfortunately, loading a PNG image is not quite straightforward. But let's get down to it.

Our PNG_Image class will have height, width, a bytes per pixel member, a color format (for OpenGL), and a pixel buffer. Following is the annotated code needed to load a PNG image. The method will take a filename and return true on success.

In the top the file, include the appropriate headers:

#include <png.h>
#include <gl.h>

First off, we will want to open a .png file and check the header for the “PNG” signature.

    FILE *f = fopen(filename, "rb");
    if (f == nullptr) {
        return false;
    }

    // check that the PNG signature is in the file header

    unsigned char sig[8];
    if (fread(sig, 1, sizeof(sig), f) < 8) {
        fclose(f);
        return false;
    }
    if (png_sig_cmp(sig, 0, 8)) {
        fclose(f);
        return false;
    }

Next, we need to create two data structures: png_struct and png_info. Many of libpng's functions will require us to pass in these structs. You can think of png_struct as a “PNG object” that holds state for various functions. It won't hold any image data, however. The png_info struct will hold metadata like image height and width. Creating these structs allocates memory from the heap; if anything goes wrong we must take care to deallocate them, or else we'd be leaking memory.
[Also mind that we still have an open file].

    // create two data structures: 'png_struct' and 'png_info'

    png_structp png = png_create_read_struct(PNG_LIBPNG_VER_STRING, nullptr, nullptr, nullptr);
    if (png == nullptr) {
        fclose(f);
        return false;
    }

    png_infop info = png_create_info_struct(png);
    if (info == nullptr) {
        png_destroy_read_struct(&png, nullptr, nullptr);
        fclose(f);
        return false;
    }

Mind that plain C has no exceptions. To work around this, libpng has an error handling mechanism that works with setjmp(). Imagine decompressing a corrupted image, and you don't want the program to crash. What libpng does is, it jumps back to the point where you issued the setjmp() call. So this here is just some cleanup code.

    // set libpng error handling mechanism
    if (setjmp(png_jmpbuf(png))) {
        png_destroy_read_struct(&png, &info, nullptr);
        fclose(f);

        if (pixels != nullptr) {
            delete [] pixels;
            pixels = nullptr;
        }
        return false;
    }

Q: Where did that pixels array come from?
A: That's our pixel buffer member variable. It won't be allocated until later. The construction with setjmp() is turning the code inside out.

We are just getting started here. Let's gather some basic information: image height, width, and such.

    // pass open file to png struct
    png_init_io(png, f);

    // skip signature bytes (we already read those)
    png_set_sig_bytes(png, sizeof(sig));

    // get image information
    png_read_info(png, info);

    w = png_get_image_width(png, info);
    h = png_get_image_height(png, info);

We're only interested in RGB and RGBA data, so other bit depths and color types are converted. So even if the image is in grayscale or whatever, we will get RGB data when reading the image. It's nice that libpng can do these conversions on the fly.

    // set least one byte per channel
    if (png_get_bit_depth(png, info) < 8) {
        png_set_packing(png);
    }

PNG may have a separate transparency chunk. Convert it to alpha channel.

    // if transparency, convert it to alpha
    if (png_get_valid(png, info, PNG_INFO_tRNS)) {
        png_set_tRNS_to_alpha(png);
    }

Next, get the color type. The format variable is meant as a parameter for OpenGL texturing. If the format can not be determined, give up and return an error.

    switch(png_get_color_type(png, info)) {
        case PNG_COLOR_TYPE_GRAY:
            format = GL_RGB;
            png_set_gray_to_rgb(png);
            break;

        case PNG_COLOR_TYPE_GRAY_ALPHA:
            format = GL_RGBA;
            png_set_gray_to_rgb(png);
            break;

        case PNG_COLOR_TYPE_PALETTE:
            format = GL_RGB;
            png_set_expand(png);
            break;

        case PNG_COLOR_TYPE_RGB:
            format = GL_RGB;
            break;

        case PNG_COLOR_TYPE_RGBA:
            format = GL_RGBA;
            break;

        default:
            format = -1;
    }
    if (format == -1) {
        png_destroy_read_struct(&png, &info, nullptr);
        fclose(f);
        return false;
    }

This is how to get the number of bytes per pixel, regardless of channels and bit depth:

bpp = (uint)png_get_rowbytes(png, info) / w;

[That bpp variable is bytes per pixel, not bits per pixel].

We don't want to receive any interlaced image data. Even if the image is interlaced, we want to have the ‘final’ image data. There's a func for that:

png_set_interlace_handling(png);

We are done reading the header (info) chunk. You need to tell libpng you want to get ahead and skip the rest of the chunk.

png_read_update_info(png, info);

Now we are ready to go read the image data. We deferred allocating the pixel buffer up till now (this is how the manual says you should do it).

Annoyingly enough, png_read_image() expects an array of pointers to store rows of pixel data. So you need to construct both a pixel buffer and a temporary array of pointers to the rows in that pixel buffer. And then you can read the image data.

    // allocate pixel buffer
    pixels = new unsigned char[w * h * bpp];

    // setup array with row pointers into pixel buffer
    png_bytep rows[h];
    unsigned char *p = pixels;
    for(int i = 0; i < h; i++) {
        rows[i] = p;
        p += w * bpp;
    }

    // read all rows (data goes into 'pixels' buffer)
    // Note that any decoding errors will jump to the
    // setjmp point and eventually return false
    png_read_image(png, rows);

You actually need to read the end of the PNG data. This will read the final chunk in the file, and check that the file is not corrupted.

png_read_end(png, nullptr);

Finally, clean up and return true, indicating success.

    png_destroy_read_struct(&png, &info, nullptr);
    fclose(f);
    return true;

Well ... that was a little odd, but we managed to load some pixels. Not included in this tutorial: we can pass the pixel buffer on to OpenGL to create textures.

Opinion or plain justified criticism

libpng provides tons of functionality for working with PNG images. Be that as it may, there is a lot not to like here. For one thing, loading an image is a ridiculous, unwieldy process. Conspicuously absent is a simple and stupid png_load() function that does all that listed above, now left as an exercise by the developers of libpng.

All I wanted to do was load an image, but libpng required me to learn everything there is to know about the file format, the library, and then some.

At any rate, do the dance and it will work. If it's too much for you, there's always SDL_Image.

Fast Pseudo Random Number Generator

2015-04-27T06:10:00.001+02:00

Games need a good pseudo random number generator (PRNG). Flipping cards, rolling dice, determining hit points, choosing whether to fire a bullet or not, all these are random choices. Random is a strange phenomenon though. The set of numbers 1,1,1,1 is just as random as 4,2,1,3 — even though I carefully hand-picked both sequences, the latter looks more random. A shuffled list of numbers is perceived as being more random than a list of truly random numbers.

It is a well-known fact (or actually, myth) that computers do not do random numbers; instead they are computed and therefore they are predictable. This was true until computer scientists found that real-world noise like typing speed, mouse movement, networking interrupts, and hard drive rotation statistics may be used for entropy: a source for random number generation. This is only a source though and the produced random number sequences may still be predictable depending on the algorithm used. Cryptographically safe RNGs aim to make it impossible to guess the next random number. There is a kind of chicken-and-egg relation here: crypto functions need good RNGs to be able to scramble the input; the scrambled output appears as random as any random data. Therefore the crypto function can be used as a RNG.

Anyway, there are some problems when it comes to games. Fast-paced action games may make hundreds of random choices every frame. At 60 frames per second, we don't want to take too much time generating random numbers. Therefore cryptographically safe RNGs are off limits, we are going to have to settle with a more predictable PRNG.

Another problem is that at 60 fps, we might quickly drain the entropy pool. We can't afford having to wait for new entropy, and we don't want the operating system to run out. So even though modern operating systems can supply fairly truthful random numbers, we are not going to bother and stick with “decent enough” instead; we are going at it old-skool and calculate the pseudo random numbers by ourselves.

PRNG requirements for games:

it must be fast
it must show good distribution
not repeating numbers often
no cryptographically safeness required

In them old days, we would use rand() and seed it by calling srand(). rand() is a notoriously bad PRNG and you should not use it nowadays. The BSD man page even calls it “bad random number generator”. It has a low period, meaning that after a while the same prefix starts to repeat. This is a bad habit for a PRNG.

I personally have bad experiences with rand(). My Tetris clone would happily generate the same block over and over again, giving some really bad gameplay experiences. I started handing out bonuses for this, but it didn't make it more fun. The game notably improved when I switched to a different PRNG.

The modern, new and improved equivalent of rand() is random(). It should be noted however that there might be a portability issue here; random() may not be available on your platform of choice.

Mersenne Twister
A popular modern PRNG is the Mersenne Twister. It is named after the Mersenne prime number that it uses. It is a well respected PRNG, and an easy choice for C++ programmers, as it is included in the C++11 standard library. The C++11 way to get some random numbers is:

#include <random>

    std::random_device rd;     // only used once for seeding
    std::mt19937 rng(rd());    // select Mersenne Twister
    std::uniform_int_distribution<int> dist(0, 1000);
    int n = dist(rng);

The Mersenne Twister is relatively slow for game code. Apparently it badly stresses the CPU cache.

XorShift

The most simple and fastest decent PRNG there is, is the XorShift algorithm. It simply does a couple of bit-shift and exclusive-OR operations, and that's it. It has four internal state variables that you should initialize (in a more or less random manner, as you see fit).

uint32_t xorshift128(void) {
    // algo taken from wikipedia page
    uint32_t t = state.x ^ (state.x << 11);
    state.x = state.y;
    state.y = state.z;
    state.z = state.w;
    state.w = state.w ^ (state.w >> 19) ^ t ^ (t >> 8);
    return state.w;
}

XorShift is incredibly simple, but good enough for games. In that respect, it is a perfect fit for our purposes. There are XorShift* and XorShift+ variants, that use multiplication and addition respectively, and produce 64 bits values.

Statistical Analysis

In a feeble attempt to prove that XorShift suffices, I calculated the standard deviation for a series of numbers between 0 and 1000 generated by rand(), random(), std::mt19937, and xorshift128(). They all produce a standard deviation around 285, meaning that they all exhibit the same fairly high spread of values. In that sense, we can say that probably they all produce sets of random values usable for games. This is by no means an extensive analysis of the PRNGs, of course. There exists a test suite named TESTU01 for testing PRNGs, but I'll leave it up to the math and computer scientists to play around with that.

For now, I'm happy to have found that the XorShift PRNG is exceptionally fast, while still producing decently enough pseudo random numbers. It is a perfect fit for games.

Two interesting links for further reading:

Wrapping Worlds

2015-03-08T13:37:00.000+01:00

In the classic game Pac-Man, there is a tunnel in the middle of the level, when you go in on one side, you emerge on the other side of the screen. It plays an important part in gameplay, as you can use it to evade the fast red ghost.

In the classic Defender, if you fly your spaceship far enough to the right (or left, as you can go both ways), space will wrap around and you will eventually return to the same point. The game world of Defender is cylindrical.

You will find this technique of wrapping worlds a lot in classic 2D games. It doesn't work well for 3D games, because of their level of realism and immersiveness, it feels weird if a 3D world wraps. [The irony is that the real world does seem to wrap around, as observed by a traveller on the ground]. On the other hand, I always I feel weird when reaching the edge of the obviously square world in 3D games.

Anyway, wrapping worlds are a lot of fun for 2D retro style games, and I couldn't find much information on the net how to go about and implement such a thing. So I devised a couple of methods and wrote them down here.

1. The straightforward way

Wrapping is simply achieved by using the modulo operator. If the player moves off the map, her new position is mod the world width. Consequently, when a player moves off the map on the right, she re-emerges on the left. That was easy.

But not so fast. A monster chasing the player will effectively see the player being teleported from the right side of the map, to the far left side of the map. The monster will now change direction, and has to travel all the way across the map in order to get near the player again. So to fix this, you might set a flag that a monster is not allowed to turn if ... But there is more. A similar problem exists for monsters that live in the left side of the map. They will not see that the player is actually close nearby, until the player is teleported in over the right edge of the map. By now it's becoming clear that wrapping is not as easy as it looks.

2. Overscan areas

Rather than having the map just be the map, allow the player (and everything else) to exist in a boundary outside the world. The boundary area is like an overscan area.
This region is a copy of the other side of the world. Once the player ventures too far, teleport everything in the overscan area to the other side.

The overscan area should at least be a screen wide to work properly. Maps should also be at least two screens wide. This method is relatively costly on memory because it duplicates large parts of the map.

3. Far offset

The previous method still has a problem with the direction of monsters that are outside the screen.
A way to solve this is to have the game take place at a far offset. The problem with wrapping is partly because we're too close to zero. By translating the origin to, say, 100,000, we're largely preventing the problem of monsters heading the wrong way. It's mostly a band-aid though as the situation still does exist. Eventually the player may reach the zero line, at which everything needs to be rebased at the far offset again. It is unlikely that any player will venture so far that she wrapped around the world a dozen times, but nevertheless possible.

4. Sliding window

Let the area around the player be a sliding window that glides over the map. The area should at least be what's visible on screen, but may be larger to allow off-screen monsters to come in and attack.

The sliding window uses ‘stripping’ to phase in and phase out relevant parts of the map. For example, as the window slides to the right, a vertical strip slides into the window on the right, and a vertical strip slides out of view on the left. This works particularly well for tiled maps. Monsters can be frozen into the tile map, and come to life as soon as they come into view. The coordinates of alive objects are relative to the sliding window (they are no longer in ‘map’ space) and therefore there is no directional problem when the world wraps.

I implemented each of these and they all work. What seems to me the best method however is the sliding window technique. Even for games that are not tile based at all, you may want to consider adding a tile grid anyway. The sliding window technique makes a clear distinction between objects that are alive, and objects that are not worth spending time on. You can enjoy a massive world with millions of objects without breaking a sweat. And what's fun, the world wraps.

Rock solid frame rates with SDL2

2015-02-11T22:05:00.000+01:00

For fast-paced action games you need to have a steady framerate. If the framerate is not steady, you are likely to see stutter. It's a periodic wobble, jerky movement, as if a frame was skipped. It's very annoying, and once you see it, you can't unsee it. I ran into this problem and for a moment there, I just blamed SDL, heh! In forums online, I found more people complaining about stutter with SDL. The accepted answer on StackOverflow for this problem was wrong. Game dev forums did turn up some good tips. Finally, I was able to piece it together. Apparently this topic is not well understood by many—and the ones who do get it right don't seem to get just what the problem is. It has indeed to do with frame rate, but it's not skipping. Something else is up.

Loop the loop

Games run in a loop. You update some monsters, do the rendering, and cap the frame rate by delaying until it's time to display the next frame. It's likely to be programmed something like this:

    move_monsters();
    render();

    float freq = SDL_GetPerformanceFrequency();

    now = SDL_GetPerformanceCounter();
    secs = (now - last_time) / freq;
    SDL_Delay(FRAMERATE - secs * 1000);

    swap_buffers();

NB. You might also use SDL_GetTicks(). SDL 2 offers new performance counter functions. These have much greater precision than SDL_GetTicks().

We've subtracted the time that the game was busy, so to complete a frame we need to wait the remaining time. This is exactly where the bug is, odd as it may seem. To prove it, try timing SDL_Delay() calls in a loop:

    float freq = SDL_GetPerformanceFrequency();

    t0 = SDL_GetPerformanceCounter();
    SDL_Delay(30);
    t1 = SDL_GetPerformanceCounter();

    printf("time slept: %.2f msecs", (t1 - t0) / freq * 1000.0f);

Result:

    time slept: 31.51 msecs
    time slept: 31.31 msecs
    time slept: 34.06 msecs
    time slept: 32.15 msecs
    time slept: 34.54 msecs
    time slept: 33.07 msecs

But we asked to sleep only 30 milliseconds! How can this be?

The problem is not so much with SDL as with the operating system. This can be proved by replacing SDL_Delay() with usleep(), and the problem persists. Even if you try the insanely precise nanosleep() call, the problem persists. Sleeping is inaccurate—despite the fact that modern computers are very capable of doing high precision timing.

Some try fixing this by sleeping a little less, and then doing a busy loop to get as close to the sleep time as possible. This almost works, but it it isn't perfect and you will still see stutter. The non-deterministic nature of a multitasking operating system gets in our way.

Thou shall wait

So, the system is incapable of sleeping an exact amount of time. How about not sleeping at all.

The trick of sleeping until the next frame is probably a spin-off of the waiting for vsync trick. But sleeping isn't the same thing, and it doesn't work because it doesn't necessarily bring you in sync with anything. So instead of sleeping, we are going to wait for vsync.

    sdl_window = SDL_CreateWindow(title, x, y, w, h, SDL_WINDOW_OPENGL);
    gl_context = SDL_GL_CreateContext(sdl_window);
    SDL_SetSwapInterval(1);

This is for use with OpenGL. It will enable the swap interval (wait for vsync) before swapping the render buffers. You can remove any SDL_Delay() calls, because during the swap interval the program already sleeps. If you are using the SDL render functions, it's done like this:

renderer = SDL_CreateRenderer(win, -1,
SDL_RENDERER_ACCELERATED|SDL_RENDERER_PRESENTVSYNC);

Now SDL_RenderPresent() will wait for vsync before presenting the buffer.

Moving on

That's only half of the story. The problem of stutter is not solely caused by an inaccurate sleep timer. It stutters because movement is too rigid:

new_pos = old_pos + speed;

This sort of code always advances the position, no matter whether time advanced evenly or not. So in the case where time did not advance perfectly in line with the desired rate (which is impossible, as we saw earlier), this code will produce stutter.

To solve this problem, we must write code that can handle variable timings. It works just like in physics class:

float delta_time = (now - t0) * (float)freq;

new_pos = old_pos + speed * delta_time;

With SDL_GetPerformanceCounter() and SDL_GetPerformanceFrequency() you can calculate the time step for this frame in seconds. Note that delta_time holds a small value, so crank up the speed. If your game world is defined in meters, you can simulate things rather faithfully with speed in meters per second.

At last

Lo and behold, our frame rate is now rock steady and scrolling is super smooth. It's locked down at 60 fps, with very low CPU usage, even on my five year old computer. Like with Columbus' egg, it's easy once you know how.

My advice for testing is to create a large checkerboard and scrolling that across the screen. If it stutters, you will see it. Also try disabling the vsync just for fun, and watch the fps counter go nuts.

Memory Management: the SLUB allocator

2015-01-11T17:09:00.002+01:00

Last time I wrote about the SLAB allocator, a method for caching objects in a pool. And then I read about SLUB, the slab allocator that is present in modern versions of the Linux kernel, and got a bit carried away with it. SLUB is brilliant due to its simplicity, and like the Linux kernel, by now I have thrown out my SLAB code in favor of SLUB. That's how it goes with the software lifecycle; out with the old, in with the new. Descriptions of how these allocators work often either do not go into enough detail, or are way too cryptic to understand, so I'll do my best to shed some light.

Pool allocators basically cache objects of the same size in an array. Because we are talking low-level memory allocators here, the absolute size of the array is always going to be a single page. A single page is 4 kilobytes. [Yes, I know hugepages can be 2 megabytes, and sometimes even a gigabyte, but for the sake of simplicity we'll work with regular 4k pages here].
To keep track of what array slots are in use and what slots are free, they keep a free list. The free list is just a list of indices saying which slots are free. It is metadata that has to be kept somewhere. The difference between SLAB and SLUB is the way they deal with that metadata (and boy, what a difference it makes).

In SLAB, the freelist is kept at the beginning of the memory page. The size of the freelist depends on the number of objects that can fit in the page. When the objects are small, more objects can fit into the page and the freelist will be larger than when the object size is large, not as many objects can fit in the page.
SLUB is much more clever than this. SLUB decides to keep the metadata elsewhere, and uses the full page for object storage. The result is that SLUB can do something that SLAB can not: it can store four objects of 1k each in a 4k page, or two objects of 2k, or one object of 4k in size. This is something that SLAB could never do.

But what about the metadata? SLUB puts the free pointers inside the memory slots. The objects are free, so this memory is available for us to use.
The other metadata (like, where is the first free object? What is my object size?) is kept in a Page struct. All pages in memory are tracked by an array of Page struct. Beware the confusion here, a Page struct is not the actual memory page, it just represents the metadata for that memory page.

All we did was take that little bit of metadata from each slab, and store it elsewhere. The net result is that we can make use of memory much more effectively than before.

In our last implementation we used C++ templates to make things type safe. While this makes sense from a C++ programmer's perspective, it doesn't make any sense from a memory allocator's perspective. The memory allocator has bins for sizes, not for types. This alone is reason enough not to code it using templates. Stroustrup himself said that object allocation and object creation are separate, and there's no arguing with that.
Therefore the SLUB is implemented as a plain C code, and it turns out very straightforward and easy to understand. You have 4k of memory, the number of objects you can store equals 4k divided by object_size. The first free object slot is page_struct.free_index. When you allocate an object, return the address of slot[page_struct.free_index]. Update the free_index and zero the object's memory, of course. And that's all there is to it, really.

The engineers at Sun actually took into account CPU cachelines and had slab queues per CPU. This may have been a good idea in 1992 but becomes kind of crazy when you realize that today's supercomputers comprise tens (if not hundreds) of thousands processor cores.
SLAB also needs to take into account object alignment, which is automatic in SLUB as long as the object size is a multiple of the CPU's word size, and this is already taken care of by the compiler when it packs and pads a struct. At the end of the day, SLAB is unnecessarily complicated, over-engineered. SLUB says: Keep It Simple, Stupid!

Caches
When the slab runs out of free memory slots, you will want to get a new page and initialize it as a new slab. Likewise, when a slab is entirely free, you will want to release the page. This is trivially implemented using malloc() and free(). It's more fun however to actually cache these pages in a linked list. The pointers for the linked list can again be members of Page struct, it's metadata after all.

The Linux kernel optimizes for speed by keeping linked lists of full, partially full, and empty pages. When an object is allocated or freed, it always looks at the partially full list first. If after allocation the page is now full, the page is moved to the full list. If after freeing an object the page is now empty, the page is moved to the empty list.

You may wonder why it holds on to empty pages rather than releasing them immediately. The answer is that getting a new page may well take a long time. So, the empty list is an insurance policy that the cache can always quickly get a page.
If the cache is empty, it will have to get a new page from an other allocator. The Linux kernel implements the buddy allocator for managing blocks of pages. Implementing buddy allocator code is a bit complicated (and fun), I might write about that some other time.
Anyway, slab allocators are fascinating, and it's nice to know how they work. You can't talk about SLAB without talking about SLUB, and noting that simplicity beats complexity.

Memory Management: Slab allocator

2014-12-21T23:42:00.001+01:00

Last time we saw we could speed up memory allocations dramatically by using a pool allocator. I had a nagging feeling though. My pool allocator uses a std::vector, and it's doing all kinds of things behind the scenes. Resizing on the fly is cheating and undesired. Another thing that's certainly not right is that we're manually calling the destructor, while the object is alive and well, and present in the vector. This ultimately leads to the double destruction of objects once the pool is destructed; the destructor of the pool destructs the vector, which in turn destructs all objects it holds—effectively destructing objects twice. Double destruction is not a problem when your destructors are coded neatly, but it is something that normally is not supposed to happen. That's a big FIXME remark that needs to be addressed.

As a proposed fix for the problem, let's do away with std::vector altogether. Let's change the vector into a fixed size array. Upon initialization, the space for the array is preallocated, but no objects are created, no constructors are being run. When an object is allocated from the array, its constructor is invoked via placement new. When the (fixed size) array runs out space, allocate a new ‘subpool’ and chainlink them together as a linked list.
When an object is freed, you can look at the address and work out which array slot it came from. When all is well, manually call the destructor. If that pointer came from somewhere else, generate a fatal error.

This is a slab allocator. There is an opportunity here to implement garbage collection, and it's not as difficult as it may seem. All the allocator needs to do is keep track of which slots are allocated, and which slots are free—and it already needs to do that anyway. Once we get that right, garbage collection is simply destroying any objects that are still lingering around when the pool is destructed.

In order to track which slots are free, you could keep a small bitmap. If there are eight slots per slab, the bitmap is as small as one single byte. If you choose a single 64-bit int for a bitmap, the slab can hold 64 slots. That's a big slab. Personally, I went for sixteen objects in a slab.
Although a bitmap is perfectly okay for tracking free slots, you'll need to run through lots of bit shifting loops for the bookkeeping. The performance can be greatly improved by employing a different technique.

I know the Linux kernel internally uses a slab allocator, and therefore I took a peek in the book Understanding the Linux Kernel. It's in there alright, but unfortunately I couldn't make any sense of it. This stuff is hard. Or maybe not so much, I did find a rather good explanation on the web in the documentation on kernel.org.

What happens in the kernel is that each slab has a freelist, which is an array of integers (since we store only small numbers, these can be bytes rather than 64-bit ints). Each entry in the array represents an index to the next free slot. So, the index functions somewhat as a pointer to a list of free objects.
The final entry in the freelist array is -1, marking the end, when the slab runs out of memory. Additionally, there is an integer free_slot which denotes the first free slot. Since this is always the index of the first free slot, all the allocator does is access this number when an object is requested. It doesn't even have to search a free block, it already knows which object slot is free.
It's not described exactly in the kernel documentation, but at this point you should mark the freelist slot as being in use. Again, this is important for garbage collection.
[In the kernel source the member is actually called free, which in my code would inconveniently clash with the std::free() function, as well as my Slab<T>::free() method. Even though C++ has namespacing I still went with free_slot instead.]

For illustration purposes, consider what happens when we have a slab allocator with four object slots. Upon initialisation:

free_slot: 0
free_list: { 1, 2, 3, END }

Now, let's allocate an object.

free_slot: 1
free_list: { IN_USE, 2, 3, END }

Let's allocate one more.

free_slot: 2
free_list: { IN_USE, IN_USE, 3, END }

When memory is freed, look at the address to find out to which slab it belongs, and which slot it is. The free_slot pointer points at the first next free object; insert this number into the freelist slot. This effectively marks the slot as free again. If the slot was not marked as in use, something would be terribly wrong and the program should abort immediately. The free_slot pointer now becomes the slot number itself, thus the first free object is the object we just freed. Note how the freelist works in a LIFO fashion.

When we free the first allocated object, the result is:

free_slot: 0
free_list: { 2, IN_USE, 3, END }

If we want to garbage collect at this point, we can easily see that the object in slot #1 is still hanging around. We can garbage collect by freeing it. The situation then becomes as follows:

free_slot: 1
free_list: { 2, 0, 3, END }

The freelist now appears shuffled, but that doesn't matter; it operates as a LIFO queue. If we now allocate an object again, it fetches the object in slot #1.

Managing the freelist like this is deceivingly simple. This slab allocator greatly outperforms our previous solution.
Combining everything said here, the data structure in code is:

template <typename T>
struct Slab {
    Slab<T>* next;
    signed short free_slot;
    signed char free_list[NUM_OBJECTS];
    T objects[NUM_OBJECTS];
};

What I initially did here is use fixed size arrays. In the Linux kernel however, the only constant is the size of the memory page for the slab. It calculates how many objects of the desired object size can fit onto that page, taking into account the size of the metadata and necessary alignment. It crams as many objects as possible into a page, but note that the precise amount is different for every type T. For instance, many small objects of type A may fit into a single memory page, while another slab can hold a few large objects of type B.

The Linux kernel improves efficiency of the allocator even more by keeping three separate lists to manage slabs: lists for empty, full, and partially full slabs. On top of that, it keeps caches of slabs, so it always has quick access to them in case of need.
I simply couldn't resist the temptation, so the updated version is:

template <typename T>
struct Slab {
    Slab<T>* prev, *next;
    unsigned short num_objects, num_free;
    unsigned short free_slot;
    // compiler may insert padding here
    // free_list is directly behind the struct in memory
    // objects are behind the free_list in memory
};

Slab<T>* empty, *full, *partial;

[Now, num_objects isn't really a property of the slab; it's a property of the slab type. So why is it implemented as such? The reason is it's not a straightforward constexpr due to alignment, and the freelist having a maximum limit (we're using bytes as slot index, that's only 8 bits). The slab is written as a self-contained object allocator, and therefore it knows how many objects it can contain. In the Linux kernel, each slab also knows its object size. Since we're using C++ templates here, we can simply get away with sizeof(T).]

And you know what they say, safety first. The metadata is kept inside the slab, as well as the objects that are handed out to the application side of things. A badly behaving application may well overwrite other objects, or worse, our freelist. To guard against this, my slab allocator implements red zoning and object markers; magic numbers are inserted in between, and they are checked every single time.
For game code, the extra checks are #ifdef-ed out in a release build, but for other kinds of applications it's probably best to leave them in anyway.
Moreover, be mindful that a memory allocator should always zero the memory: not only when handing out objects, but also when they are freed. If you don't, you will get a massive Heartbleed disaster—the baddest security bug of 2014, the decade, and maybe of all computing history.

Happy holidays, and be careful with fireworks on New Year's Eve. You will need those fingers to do some more programming in 2015.

Memory Management: Pool Allocator

2014-11-30T12:38:00.000+01:00

I want to get back to game programming, so I had another look at the quad-tree. I did a post on quad-trees in August last year. Remember that it's a technique for grouping objects together by chopping up 2D spaces into four quarters. Each interesting quarter is again divided into four parts. This basically goes on until max_depth is reached. The purpose of this space partitioning is grouping objects for the sake of collision testing; objects that are near each other may collide and are worth testing for collisions. Objects that are not near each other can not possibly collide, so we skip those when collision testing.

Having lots of moving objects, it's easiest to construct the quad-tree on every frame, or game tick. Now imagine that they are slow moving objects. If the objects are moving slowly, chances are that the quad-tree will look exactly the same at every frame. So why would you spend all that time building up and tearing down the same quad-tree on every frame? Unfortunately, we have to, and trying to bend an existing quad-tree into shape is much more difficult than simply rebuilding it from scratch. But we can optimize in another way.
The process of rebuilding the quad-tree takes a lot of malloc() and free() calls, and they are not cheap functions. If you want 60 fps, you will want to spend your time better than calling malloc() and free() all the time.
Rather than trying to reuse the entire quad-tree, let's reuse the timber that it's made of. Let's grab the objects that represent the quad-tree nodes and leafs from a pool allocator.

A pool allocator is simply a stack of pointers to objects in memory. Rather than freeing the memory, you push the object onto a stack. Next, when you need to allocate an object, you pop it from the stack. If the stack runs out of memory, you simply allocate a new object using malloc().
A pool allocator only allocates objects for a specific data type. All the objects in the pool are of the same type and size. By no means try to make it a generic allocator, or you'll be writing you own fragmenting memory manager.

In C, you might have:

static QuadTree *qt_pool[MAX_QT_POOL];
static int qt_pool_idx = 0;

QuadTree *qt_alloc(void);
void qt_free(QuadTree *);

Things get quite interesting when you move to C++. First of all, C++ has a std::vector class that you can use as a dynamic array, meaning that you can grow the pool as large as is necessary without having to worry too much about what the exact limit should be.
Secondly, C++ has template classes for generic programming. A templated Pool class can work with all kinds of types: QuadTrees, Monsters, Bullets, Explosions, etcetera.
Thirdly, C++ classes have constructors and destructors. In order to do this right, the pool allocator must properly construct and destruct objects when they are allocated and freed. In C++, placement new takes some already allocated memory, and calls the constructor for it. And in case you're wondering, manually calling a destructor requires no special trickery.

template <typename T>
class Pool {
public:
    // insert boilerplate here ...

    T* alloc(void) {
        if (v_.empty()) {
            // allocate new object
            return new T();
        }

        // take last object in pool
        size_t idx = v_.size() - 1;

        // use placement new to construct
        T* t = new (v_[idx]) T();

        // remove item from vector
        v_.erase(v_.begin() + idx);

        // return object pointer
        return t;
    }

    void free(T* t) {
        v_.push_back(t);
        // destruct t
        t->~T();
    }

private:
    std::vector<T*> v_;
};

This code works with raw pointers, and you can get philosophical and argue that these should be std::unique_ptr<T> objects, that are std::move()d around. And with C++14, you can use std::make_unique<T>().
[I didn't bother because personally, I'm no fan of std::unique_ptr. I think the syntax is horrid, and I wish they would have invented a new keyword instead for strong referencing pointers.]

Growing the vector means doing implicit memory allocations, something that we were trying to avoid in the first place..! Luckily, this will only happen the first time around; the pool will grow to a certain size and then stay that size as objects start being reused. An optimization would be to reserve chunks of space in advance, so that the vector need not be resized as often.
You can avoid working with pool arrays altogether, by chain linking objects to a free-list. This requires the objects to have a pointer member though (to make the chain link), and this breaks the generic nature of the Pool template class that is shown above. It would be an elegant solution for plain C however.

Round Balls and Flat Disks

2014-10-25T18:26:00.000+02:00

I used to call gluSphere() to render simple planets in 3D space programs. For reasons beyond me, this function has been deprecated. Apparently, all GLU functions have been scrapped in the newer OpenGL standard. I've been told by an expert that OpenGL maintains backward compatibility by using version numbering, meaning that you should still be able to use old OpenGL 2, as long as you don't try to mix it with OpenGL 4 functions. GLU and GLUT are not really part of the OpenGL standard so that's where the story ends, I suppose. The problem is, I found myself being unable to compile older programs anymore after upgrading the operating system — totally unexpected, an unpleasant surprise. Searching for an answer, I found StackOverflow. The answer: “Do the math.”

I had to scratch my head a little there. I have to admit, it's been a while since I ‘did the math’ on anything, really. A visit to MathWorld turned up some nice formulas. Now, how to put that into code.

Implementing gluSphere()

As you can see in wireframe mode, the gluSphere is built up from quads. Clearly, triangle strips speed up the process. It should be possible to use a single triangle strip, but I didn't bother trying to figure that one out. So I went for stacked triangle strips that wrap around the sphere. This method uses slices and stacks (similar to lines of longitude and latitude) just like gluSphere() does.

Anyway, the math. We need to calculate all the vertices for the triangle strips that make up the sphere. A sphere is like rotating a vector 360 degrees around in a horizontal plane (left, right), while rotating a vector 180 degrees up and down in a vertical plane. We'll call the first angle alpha and the second beta. The size of the sphere is determined by radius r. Any point (x, y, z) on the sphere is given by:

    x = r ⋅ cos α ⋅ sin β
    y = r ⋅ cos β
    z = r ⋅ sin α ⋅ sin β

One caveat is that the angles should be given in radians rather than degrees for the C math library, so use M_PI. Another is the OpenGL coordinate system; you may have to work with -z, or like in my case, I had to exchange the Y and Z-axis to get the desired result.

Knowing this, calculating all vertices is easy. Setup a double loop that calculates all coordinates, taking into a account the number of stacks and slices that you wish to have.

    for(beta = 0.0; beta <= M_PI; beta += M_PI/num_stacks) {
        for(alpha = 0.0; alpha <= 2.0*M_PI; alpha += 2.0*M_PI/num_slices) {
            calculate x, y, z
            store in vertex array
        }
    }

Calculating Sphere Normals
The get good lighting on the sphere, we need to know the normal vectors. The normal at each vertex is given by normalize(x,y,z). That's easy.

Calculating Sphere Texture Coordinates
Having planets without texture is not fun. Texture coordinates run from 0.0 to 1.0.

x = alpha / (2.0 * M_PI)
y = (M_PI - beta) / M_PI

Implementing gluDisk()

For the rings of Saturn, it's easy to use a gluDisk. gluDisk() renders a flat disk, with an inner hole in the center. In math terms, the disk uses a circle formula, and there are two radii (one for the outer circle, one for the inner hole). So a gluDisk is really nothing but a single triangle strip that loops around, and its vertices are determined by two circles with radius r1 and r2.
The vertices are given by:

    // inner vertex
    ax = r1 ⋅ cos α
    ay = r1 ⋅ sin α
    az = 0.0

    // outer vertex
    bx = r2 ⋅ cos α
    by = r2 ⋅ sin α
    bz = 0.0

Calculating Disk Normals

The disk is really a 2D object in 3D space. The normals point to +Z, they are all defined as (0, 0, 1).

Calculating Disk Texture Coordinates
Texturing a disk is not at all like texturing a sphere. When you put a texture on a gluDisk, it's like sticking a label onto a DVD. You can take a rectangular image and put it onto the circular disk without distorting the image or wrapping it around the object. The texture is mapped to the bounding rectangle of the disk, which is two times the outer radius.

x = ax / (2.0 * r2) + 0.5
y = ay / (2.0 * r2) + 0.5

Do the same for bx and by. Note the plus one half, you want to be in the center of the texel or else OpenGL may show artifacts.

Implementing gluCylinder()

I did not implement gluCylinder(), but with the information given for gluDisk() you should be able to pull it off. It's just two circles and an added Z-axis.

Roundup
So, that's it. A bit of simple math, but still a sizable amount of work to get spheres going again in OpenGL. It's something that any serious 3D programmer should be able to do, but I still feel a bit bummed that they just deprecated those really useful functions.

On a final note I'd like to mention that there is another way of creating spheres, one that presumably looks better. Imagine an icosahedron, and subdividing each face into more triangles to enhance resolution. You can subdivide as many times as you like, and at a certain point it will be a good approximation of a sphere. This method is described in the Red Book.

How To strcpy()

2014-09-22T13:29:00.000+02:00

I've written a lot about strcpy() already, the tiny but famous C library function for copying strings. There is a lot to say about its (in)security, and ways to improve on that. I didn't realize before that implementing your own strcpy() function could be part of a job interview for a programmer position. If I were asked to rewrite strcpy(), I would at least put in an abort() (for lack of exceptions) when passing in a NULL pointer. Moreover, I would add a length parameter for doing bounds checking — albeit that might be considered cheating, because it would change the function declaration; the signature of the function.

Rather than zooming in on this rather old topic, I'm going to be lazy for once and just link to this excellent post that tells the story of the solutions given by job applicants for the problem of writing your own strcpy() function. It's a must read, hilarious, but also somewhat alerting.

How Do I Copy Thee? Let Me Count the Ways by David Avraamides

A Bugs Life

2014-08-03T15:46:00.000+02:00

Software bugs are everywhere. This shouldn't be so bad, except that software is everywhere — from your office computer, to your cell phone, home router, television set, car engine, and the list goes on and on. Heck, even an electric toothbrush has a tiny microcontroller programmed with firmware in it. Most of this software has been tested and tried, but bugs keep popping up nevertheless.

Last week I kept a bug diary. The dreadful results are described below.

Saturday, July 26 - FaceTime would not connect. After updating iOS the problem was resolved. Of course, “you must always update to the latest version”, but mind you, this problem never existed before. User experience: Everything was working fine, and then all of a sudden it didn't.

Sunday, July 27 - Tried the OS X Yosemite Beta. Okay, it's a beta release. Reported a bug where the GUI was actually messing up the screen. Sadly couldn't take a screenshot, because it would redraw the screen, effectively clearing things up. Luckily, the bug did reproduce every time.

Monday, July 28 - Didn't go near my computer, but today a colleague sent in a bug report for a piece of software of mine. My bad. One of those cases of “this should never happen”. A set of dictionary keys holds only unique items, so how could it be that an item appeared twice?

Tuesday, July 29 - Encountered trouble trying to build the PCRE regular expression library with gcc 4.9.1 on Linux. Somehow it wouldn't pass confidence tests. Reportedly did work with older gcc; smelled an awful lot like a compiler problem.

Found a recent online tirade by Linus Torvalds, in which he called gcc, I quote, “pure and utter *shit*.” Turns out the compiler's code generator emits buggy code.

Wednesday, July 30 - The pedestrian lights near the office stayed on “Walk” all day long. Nearly got run over by a BMW, as the driver hit the gas when he got the green light. Could have been caused by a loose wire somewhere, but I like to think it was a software bug.

Thursday, July 31 - Twitter app on iPhone kept crashing upon loading a particular news article. Who knows, maybe it was ad malware trying to infect the system.

Friday, August 1 - At last, a small victory. Fixed an endless loop problem in the RADIUS PAM module. Found some additional related issues, such as by the looks of it is a ~~broken~~ ~~stupid~~ unusual implementation of our RADIUS server.

Special bonus bug of the week (Linux): Exiting an interactive bash shell and getting “Illegal instruction” on the screen. This is most likely caused by stack corruption.

Game On

2014-07-06T14:22:00.000+02:00

I have a PS3 game controller lying around, so I got the idea to add support for it to an old SDL game code. This code was still using SDL 1.2, which is now a thing of the past. So the first thing to do was quickly porting it over to SDL 2, a more modern game programming API. It even includes a GameController API, and because of that, programming support for game controllers with SDL is remarkably easy.

Initialising initialisation
For initialisation, call SDL_Init() with extra flag SDL_INIT_GAMECONTROLLER. Additionally, you must open the game controller. The game controller is found by first querying the system for the number of connected joysticks.

    SDL_GameController *ctrl = NULL;

    SDL_Init(SDL_INIT_VIDEO|SDL_INIT_GAMECONTROLLER);

    int num = SDL_NumJoysticks();
    if (num <= 0) {
        printf("no joy connected\n");
        return false;
    }
    if (!SDL_IsGameController(0)) {
        printf("it's not a game controller\n");
        return false;
    }
    ctrl = SDL_GameControllerOpen(ctrl);
    if (ctrl == NULL) {
        printf("open failed: %s\n", SDL_GetError());
        return false;
    }
    name = SDL_GameControllerName(ctrl);
    printf("game controller: %s\n", name);

This code is just an example. It assumes the first connected joystick is indeed the game controller that the player wants to use. This is of course ridiculous, and you should really loop over all joysticks and open as many connected devices as possible.

Mapping the mappings
In them old days, supporting game controllers was ... problematic. The thing is, every game controller is different and there is no common standard among manufacturers for numbering buttons and analog sticks. Even a simple D-pad is different between different controllers — think XBox360, PS3, PS4, but even PS2, Logitech, OUYA game controller? SDL 2 solves this incompatibility by mapping the raw device button ids to generic SDL enums. You do need to provide the correct mappings, these can be imported from a flat-file game controller database.

    SDL_GameControllerAddMappingsFromFile("gamecontrollerdb.txt");

    if (SDL_GameControllerMapping(ctrl) == NULL) {
        printf("no mapping for %s\n", name);
        SDL_GameControllerClose(ctrl);
        ctrl = NULL;
        return false;
    }

Finally, enable events so that SDL will generate events for us to collect in the main event loop.

SDL_GameControllerEventState(SDL_ENABLE);

Moving forward
When SDL game controller events are enabled, you will receive events of type SDL_CONTROLLERAXISMOTION in the event loop. You will find that SDL sends lots of these events, even when just holding the controller. This is just noise on the sensor, and the analog sticks are quite sensitive. Merely touching already registers movement. The ‘axis values’ are a signed 16-bit value, between -32768 and 32767. The SDL documentation says values between -3200 and 3200 are in the ‘dead zone’ where the sticks should be considered at rest.

The SDL_CONTROLLERAXISMOTION event boils down to this:

event.caxis.which     connected joystick number
event.caxis.axis      axis number
event.caxis.value     amount of movement

Click image to enlarge

Analog triggers L2 and R2 register as axis 4 and 5, but they only report value 32767 when pressed, acting like digital buttons. This could be a driver issue (I'm on a Mac) or SDL not properly supporting the triggers. I'm not sure.

Benjamin Buttons
SDL generates SDL_CONTROLLERBUTTONDOWN and SDL_CONTROLLERBUTTONUP events. Inspect event.cbutton.button to see which button was pressed.
The button codes are hard to find in the documentation, but look in SDL_gamecontroller.h to find the enums. The API is modeled after an XBox type controller, for PlayStation the button ‘A’ is a cross, button ‘Y’ is a triangle, etcetera. The ‘guide’ button is the PS logo button.

SDL_CONTROLLER_BUTTON_INVALID = -1
SDL_CONTROLLER_BUTTON_A
SDL_CONTROLLER_BUTTON_B
SDL_CONTROLLER_BUTTON_X
SDL_CONTROLLER_BUTTON_Y
SDL_CONTROLLER_BUTTON_BACK
SDL_CONTROLLER_BUTTON_GUIDE
SDL_CONTROLLER_BUTTON_START
SDL_CONTROLLER_BUTTON_LEFTSTICK
SDL_CONTROLLER_BUTTON_RIGHTSTICK
SDL_CONTROLLER_BUTTON_LEFTSHOULDER
SDL_CONTROLLER_BUTTON_RIGHTSHOULDER
SDL_CONTROLLER_BUTTON_DPAD_UP
SDL_CONTROLLER_BUTTON_DPAD_DOWN
SDL_CONTROLLER_BUTTON_DPAD_LEFT
SDL_CONTROLLER_BUTTON_DPAD_RIGHT

Any other business
I have a single player game so programmatically, I took the easy road and just map all buttons and stick movement to key presses, internally translating to a keyboard joystick. This works, but it's not good enough for multiplayer games. It also loses the ability to properly react to stick acceleration; you might want to scale the axis values to have a proper difference between large and small stick movement.

Some tutorials suggest you ignore the ‘dead zone’, but this is wrong because when you fully ignore those events, you are unable to detect when someone suddenly lets go of the joystick. Instead, the dead zone should be mapped to zero movement. You can also keep track of sticks entering/leaving the dead zone, and react correspondingly.

The DualShock 3 controller has pressure sensitive buttons. SDL has no support for measuring the amount of pressure. Nor does it support the sixaxis motion sensor. SDL 2 can do rumble however via its Haptic API.

The cool thing about SDL is that it also ports to other platforms, maybe most notably the Steam OS. Since I'm on OS X I also had a quick look at how to do this joystick & game controller thing in pure Cocoa — not so easy at all. I'm with SDL 2 on this one.

Links
SDL 2 GameControllerDB file
SDL 2 API by Category
How to connect a PS3 controller

Go Rust Swift, Vala

2014-06-08T13:13:00.001+02:00

In the WWDC keynote speech, Apple announced their new programming language named Swift. It looks like a scripting language, but it actually compiles to binary, giving you lots of performance. For years, I've been griping about the need for a programming language that's as easy as scripting, while delivering decently high performance. Lately, we've seen a number of new languages that do exactly this. So, it's time to discuss a few of these new, modern languages.

Go was developed at Google and was first publicly released at the end of 2009. It's no secret that I'm a bit of a fan of Go, and I'll admit that of the four new languages discussed here, it's the only one I actually wrote a decent amount of code in.
Go is awesome for parallel programming. Goroutines and communication channels are plain brilliant. In fact, it completely changes the way you think about parallel programming. Go's duck typing system is marvelous, an enormous sigh of relief after decades of OOP and I just want to hug the creators of Go.
And yet, it has not yet become my primary language of choice. A big reason for that is mandatory CamelCasing. In Go, lower/upper CamelCasing actually controls whether the symbol is being exported or not; a really awkward design decision (golang's only mishap?).
Go is great for systems programming (ie. for UNIX tools, services), however ... it's as clunky as C when trying to parse a config file, and its flag command-line argument parser module isn't great either. Then I tried making a game in Go and simply couldn't get started because there were no stable bindings for the needed libs at hand at the time.
Go has a very “C feel” to it, every time I code in it I feel like I'm constructing a tiny, delicate machine. Everything about Go is excellently engineered. Some clever men with beards really thought things through.

What will golang's popularity be, five years from now? We shall see if it's truly to replace C/C++ as top systems programming language.

Rust

Mozilla's language named Rust started in 2006, but didn't go public until 2010. Rust is quite impressive, but the tutorial kind of lost me once it started on heap allocation, object ownership of memory, control of mutability, references, and confusing syntax involving glyphs. This leads to some pretty horrific code that shouldn't have been that way, in my opinion. Secretly, Rust is trying hard to be like C++.
The tutorial has some notes on features that are currently missing or buggy. In that respect, Rust is not ready for prime time yet. It does show a lot of potential and it did manage to create a buzz in the open source world, so check back once it hits version 1.0.

Swift

Apple's Swift language was announced only last week and is said to have been in development since 2010. At a glance, it's suspiciously a lot like Rust. This is particularly interesting given the fact that Rust is sponsored by rival company Samsung. What's the true story behind this?

Anyway, Swift looks awesome, and it's already miles ahead of Rust. Swift builds upon a foundation laid by Objective-C, all the way from NextStep to OS X and iOS. Cocoa and Cocoa Touch bindings are in place and ready to use. Under the hood Swift uses reference counting for garbage collection, and here lies a major pitfall; you should take care not to create any cyclic references, and/or know the difference between weak and unowned references. If you get this wrong, memory will leak just as it did in Obj-C. [Note that the same is true for Rust and Vala].

Swift lives in an Apple-only world, which should be no surprise — I suppose this compares to Microsoft's strategy regarding Visual Basic, C#, .NET.
There is no doubt that Swift is going to be huge on iOS, and probably also on OS X. Nobody likes Obj-C (apart from some weird guys like me). The matter of truth is, Obj-C is outdated, it has just been obsoleted by Swift. Two years from now, no one will be writing new applications in Obj-C.

Vala

Last but not least, there is Vala, which has been around since 2006 already. Vala is a Java/C#/Mono-like language for developing GNOME applications. It is said Elementary OS's main apps are built in Vala. The Vala compiler converts Vala code to plain C with GObject, which is in turn compiled by a generic C compiler. In a sense, Vala is just syntactical sugar, removing all the bad things that C has (the tedious task of programming intricate details), while keeping all the good (high performance).
Its syntax will be appealing to mostly Java and C# programmers. Personally, I get lost in the soup of keywords. A Vala source can be like a C++ code in disguise. There are just too many keywords; the designers forgot that less is more. It's probably too late now, but my wish for Vala is to be stripped down, simplified, made easier to work with.
Nevertheless it is a great choice for developing GNOME applications, and there is Elementary OS to prove it. Currently it looks like Vala's success is tied to that of GNOME, and it's a bit of a shame because it deserves more attention from a broader audience.
Vala wins the prize for “Best chosen name for new programming language” simply because googling it lands you on the right pages right away.

Links

Programming is hard

2014-05-29T14:10:00.004+02:00

I have always seen programming as something creative, like an art. And like in art, it is hard work to create something beautiful. It's hard to get the details right. If you get it right, then people will like the result, and of course, some may not because tastes differ and you can't make everybody happy. But aside from people's reaction to the picture painted, a painter also gets his satisfaction from mere crafting, using this technique here and that trick there, knowing the internals of his work.
Picture a musician, selecting an instrument to play, a rhythm, writing a melody.

That sure is a romantic view of programming. In reality, it's more of a purely technical process. When you want to accomplish goal G you are going to have to make work A and B together. This might be done with technique C. This is engineering. Apply a trick here, and it is called a hack. Hacks can be dirty hacks, or they can be clever hacks. In a security context, they might even be dangerous hacks.

Some say building applications is like building a house, and there should be blueprints for building software as well. Houses are built from small, simple building blocks. All houses have walls and a roof, a kitchen, a living and a bedroom, yet each house is unique. For each room, you can zoom in and work out the details. Users have the option of changing the wallpaper. And of course, a construction error may take down the entire house.
It's a nice metaphore that works well, but I suspect that whoever thought up that idea is not a programmer.

Programming is a mind game. The programmer builds a model in his mind, writes down the code and tries it. If it doesn't work, she tries to find out where it doesn't work. You may use a debugger or sketch it on paper but mostly the problem is cracked in your mind. Like a game of chess, it takes a lot of concentration trying to think six moves ahead, a single mistake will play out badly sooner or later.

Bugs are rarely ever “happy little accidents” like in Bob Ross paintings. They range from apparently random program crashes to worldwide Heartbleed scale disaster. Slowly the world is waking up, it has become apparent that eventually all software is flawed. Consequently, using the internet is as big a risk as a running across a highway, naked.

Programming is hugely underestimated. Programming is hard. It is really, really hard. The human mind was not made for this job. It is incredible that we have computers at all. Steve Jobs said that everyone should learn to program, because it teaches you to think. I agree a little, but it's like saying “Everyone can paint!” Or, “everyone should learn how to build a house.”

The best way to keep out bugs? Keep It Simple, Stupid. Simplify, simplify. Write code in a clear, consistent style. Add useful comments. Sometimes comment on why you did it this way. Test, test, test. Eat your own dog food.

There is a nice test to see if you've written good code. Open up a source file that you wrote six months ago. If you can't make out how it works within five minutes, then it's just not good enough. Now open up a source code that someone else wrote.

Boom. Oh, the horror.

Note that even this test is flawed, because you might not be seeing what you think you are seeing. Think about it. Programming is hard. And so is building houses.

In exceptional conditions

2014-04-07T15:18:00.000+02:00

On the topic of condition variables, wikipedia says: “In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true.”

This is commonly used in producer/consumer-like problems.

The code for programming such a monitor looks like this:

lock(&mutex);
while (!predicate) {
condition_wait(&mutex);
}
consume(&item);
unlock(&mutex);

This can be implemented, for example, with pthread_mutex_lock() and pthread_cond_wait(). If you would wrap things into a class, it might look like the following:

mutex.lock();
while (!predicate) {
cond.wait(mutex);
}
consume(item);
mutex.unlock();

However, in a language that uses exceptions, this code would be flawed. Suppose that consume() (or even cond.wait(), however unlikely) throws an exception, then the mutex remains locked. Therefore, in a language like Java, you should write:

mutex.lock();
try {
    while (!predicate) {
        cond.wait(mutex);
    }
    consume(item);
} finally {
    mutex.unlock();
}

This ensures the mutex gets unlocked, even in case an exception occurs.
In Python, the mutex is part of the condition object. This simplifies things somewhat by hiding the mutex under the hood:

cond.acquire()
try:
    while not predicate:
        cond.wait()
    consume(item)
finally:
    cond.release()

C++ does not have a finally keyword — it doesn't need it, because it has RAII (“Resource Acquisition Is Initialisation”). In C++, it should be coded as:

std::mutex mx;
{
    std::unique_lock<std::mutex> lk(mx);
    while (!predicate) {
        cond.wait(lk);
    }
    consume(item);
}

Note that the destructor of std::unique_lock will ensure that the mutex is unlocked. The destructor will run when the stack unwinds, which is also done when an exception occurs.

Although entirely correct, and arguably the most elegant solution, I have to say I find the C++ code unintuitive and difficult to follow. The reason is that there are no explicit statements in this code block for dealing with locking; the mutex is locked by the constructor and unlocked by the destructor, that run automagically. It is almost as if the mutex variable isn't doing anything at all. Annoyingly, this code can not be refactored to resemble the other code fragments, because of RAII. And you are being forced to write it this way now that threads, mutexes, and condition variables are part of the std library; there is no other way in which a condition variable can be used because the other way would be incorrect in C++.
This is one of those cases where C++ shines but somehow it doesn't feel gratifying.

Python multiprocessing: I did it my way

2014-03-16T12:27:00.000+01:00

Python has a wonderful multiprocessing module for parallel execution. It's well known that Python's threading module does not do real multithreading due to the Global Interpreter Lock (GIL). Hence multiprocessing, which forks a pool of workers so you can efficiently run tasks in parallel.

In principle:

    p = multiprocessing.Process(target=func)
    p.start()
    p.join()

In practice, I have had weird issues with multiprocessing. Two notable issues:

program does not start any processes at all, and just exits
program raises an exception in QueueFeederThread at exit

Apparently the multiprocessing module has its problems, and although there are fixes in the most recent versions, I can't rely on that when the users of my program are running on operating systems that don't always include the latest greatest Python.

So, a decision had to be made. I ditched the multiprocessing module and replaced it with my own, that calls os.fork(). The resulting code is much easier to handle than with multiprocessing, too.

Note that os.fork() does not port to Microsoft Windows. My target platform is UNIX anyway.

Pseudo-code:

    parallel(func, work_array):
        for i < numproc:
            fork()
            if child:
                work_items = part_of(work_array)
                for item in work_items:
                    func(item)

                # child exits
                exit(0)

        wait_for_child_procs()

So, a pool of numproc workers is spawned. Each of the child processes do their part of the work given in work_array. No communication is needed here because fork() causes the child to be a copy of the parent process, and thus getting a copy of work_array.

This is the simplest kind of parallel programming in UNIX. Surprisingly, this bit of code works better for me than the multiprocessing module — which supposedly does the same thing, under the hood.

Having decent language support for parallel programming is of utmost importance in today's world, where having a quadcore CPU is no exception; practically all modern computers are multi-core machines. A modern programming language should offer some easy mechanisms to empower the programmer, enabling you to take advantage of the hardware at hand.

Proper multithreading is exceptionally hard (if not impossible) to do for interpreted languages. This is a fact of life. Using a forking model, you can still get good parallellism.

MD5 and libs and licenses

2014-02-24T14:27:00.000+01:00

I keep running into code that looks like this:

#define S21 5
#define S22 9
#define S23 14
#define S24 20
    GG ( a, b, c, d, in[ 1], S21, UL(4129170786)); /* 17 */
    GG ( d, a, b, c, in[ 6], S22, UL(3225465664)); /* 18 */
    GG ( c, d, a, b, in[11], S23, UL( 643717713)); /* 19 */
    GG ( b, c, d, a, in[ 0], S24, UL(3921069994)); /* 20 */
    GG ( a, b, c, d, in[ 5], S21, UL(3593408605)); /* 21 */
    GG ( d, a, b, c, in[10], S22, UL( 38016083)); /* 22 */
    GG ( c, d, a, b, in[15], S23, UL(3634488961)); /* 23 */
    /* ... etcetera ... */
    HH ( ... )
    II ( ... )

This is an excerpt of C code for the MD5 algorithm as implemented by Ron Rivest, published in RFC-1321 in 1992. What's wrong with this picture? Well, not so much, except that there is this nice OpenSSL library that is present practically everywhere. The OpenSSL library provides this functionality and the code has been reviewed by dozens of people and is being used by millions.

Using libssl for MD5 digests is easy:

#include <openssl/md5.h>

    unsigned char digest[16];
    MD5_CTX ctx;

    MD5_Init(&ctx);
    MD5_Update(&ctx, buf, buf_len);
/* multiple calls to MD5_Update() may be made */
    MD5_Final(digest, &ctx);

Link with -lssl and you're done.

On OSX, something is up. OSX no longer ships OpenSSL. Instead, Apple now provides their “CommonCrypto API”, which is, ironically, not so common. An ugly trick to get OpenSSL MD5 code to work with CommonCrypto:

#include <CommonCrypto/CommonDigest.h>

#define MD5_CTX      CC_MD5_CTX
#define MD5_Init     CC_MD5_Init
#define MD5_Update   CC_MD5_Update
#define MD5_Final    CC_MD5_Final

#ifdef MD5_DIGEST_LENGTH
#undef MD5_DIGEST_LENGTH
#define MD5_DIGEST_LENGTH    CC_MD5_DIGEST_LENGTH
#endif

Why did Apple do this, you may ask. The reason is probably software licensing; OpenSSL's dualistic license is problematic in particular for apps on iOS devices. Apple could probably have kept OpenSSL on the Mac, but they didn't.

(* It's either this or the NSA forced them into putting some weakened PRNGs into CommonCrypto).

Coming full circle, you would be justified not to use a common library in cases where you run into issues with the software license. There are many free and open software licenses, and many of them have quirks. This is a major annoyance in software development, and something to consider when using libraries.

One and one are eleven

2014-01-19T14:30:00.000+01:00

Oh C++, why art thou so hard? Thou giveth us nought but grievance and pain. Often I've cried out because C++ just wouldn't play nice. The language is too hard to learn, and it couldn't do anything you could do in plain C. But I've actually started to like C++. I have come to appreciate it a lot. I suppose it ages like fine wine. C++11 brings a lot of good to the table, well, as long as your compiler is up to par.

Rather than go over all the features of C++11 (there's enough of that on the web to go 'round), I want to show you a neat trick. Lo and behold.

template <typename... Tpack>
void go(Tpack&&... args) {
std::function<void()> func = std::bind(std::forward<Tpack>(args)...);
std::thread(trampoline, func);
}

This defines a go() function that allows you to fire up any function (with arguments!) as a thread. The thing with Tpack is a variadic template, which is a type-safe kind of varargs. The call to std::bind turns it into a std::function object, which is a kind of functor. std::thread creates a new thread and launches a trampoline function that will invoke the function functor for us. Okay, the actual implementation is a little different, I have simplified it here for the sake of example. For one thing, you should catch the std::system_error exception that is thrown in case the operating system fails to launch the thread.

So now we can easily run any function as a thread. Next, I wanted go() to accept my custom Functor class as well. Normally you would go about that doing that like so:

void go(const Functor&);
void go(Functor&);

Guess what, it didn't work! The result was a (not so) glorious Segmentation fault. Tracing it revealed that the compiler was favoring the go(Tpack&&...) variant over the specific Functor functions. But no worries, template specialization to the rescue:

// specialized templates for passing in a Functor
// const Functor& is only used for const functions or
// explicitly const Functors
// in most cases go(Functor&) is used
template <>
inline void go<const Functor&>(const Functor& f) {
do_something_with(f);
}

template <>
inline void go<Functor&>(Functor& f) {
do_something_with(f);
}

This weird looking stuff is C++ dark magick, written in the black of night by the light of a candle. If you look past all the confusing glyphs, it's simply telling the compiler to call the right function for what kind of argument we have.

All of this wasn't quite possible to this extent before C++11. Other than impressing your girlfriend with this trickery (not!), C++11 comes with a most important change: the move constructor. This is something you MUST learn. It's hard to believe, but before eleven it was not possible to simply return instances from a function, because the destructor would kick in — you would return a copy and the original object would be destroyed. C++11 fixes this long standing problem by handing you the opportunity to move the content of the original to the copy, allowing you to return objects, like in any sane programming language. Actually, the move constructor is much more important than showing off template and thread trickery, but Brian and me really wanted to bring you the go() function this time.

The Yearly Countdown

2013-12-29T14:54:00.000+01:00

End of year, December 31st. Every year millions of families world-wide gather round the tv-set just before midnight to watch the clock tick tick tick away the final seconds of the year. The clock strikes twelve and we yell “Happy New Year!” while outside fireworks burst, lighting up the sky.

It's about time

For some years already I run a desktop widget that counts down the year. I made it with Dashcode when I was just messing around on a boring January day. After nearly a year, I found an interesting bug: it started counting up in the next year..! Just like a regular clock does.

Anyway, it's fun saying things like “99 days left!” out of the blue and see people reacting “what was that all about?”.

The year 2012 had a leap second so there was another bug: it was off by a second. This was totally unnecessary had I implemented it properly ... Just saying that dealing with time can be tricky.

Two minutes to midnight

Counting down the days until year+1 Jan 1 00:00:00 is fun and not hard to do. Be mindful though that the naive implementation of doing sleep(1) in a loop is plain wrong. You will find your family and friends yelling out Happy New Year a split second before the counter reaches zero. Why is that? The thing is, you forgot about milliseconds, microseconds, and nanoseconds. This is illustrated by the following timeline:

23:59:56.852 countdown program start
sleep 1
23:59:57.852 sleep 1
23:59:58.852 sleep 1
23:59:59.852 sleep 1
00:00:00.000 Happy New Year! on tv
00:00:00.852 countdown program reacts too late

Here, the program starts at an offset of 852 milliseconds causing it to react too late; it practically misses the strike of midnight. 852 milliseconds feels like almost a full second, so the countdown program just isn't good enough.

A better way to do it is to use either usleep() or nanosleep() to sleep fractions of a second and round it off to an (near) exact tick of a second.

Sleep with one eye open

Although you can do nanosecond sleeps, the practical accuracy of your software clock is about 15 milliseconds. This is because of timeslicing in multitasking operating systems. Really? Well, yes and no. Modern computers have a High Precision Event Timer (HPET) that can be used to do more accurate timing of events. This is entirely meant for doing things like performance measurements and not so much for wallclock time keeping.

No time to waste

A countdown timer is pretty neat and not so hard to write. It gets more fun when you add a big LED display in OpenGL graphics. I suppose it's too late now to get it approved in the App store and get rich before 2014 starts, but we can always try again next year.

Best wishes! and if you want to read more about high precision timing, see this excellent blog post by another code monkey:

http://tdistler.com/2010/06/27/high-performance-timing-on-linux-windows

C++ shared_ptr (or really C++ standard mess)

2013-11-27T21:53:00.000+01:00

The C++ shared_ptr is a pointer to an allocated instance that ensures the instance is deleted after the last shared_ptr to the object goes out of scope (or is explicitly deleted). It's a great mechanism for doing some behind-the-scenes automated memory management. You can point multiple shared_ptr's to the same object and not have too much difficulty in managing the memory of the object that it's pointing at.

The shared_ptr first appeared in the boost library. Once upon a time we would write:

#include <boost/shared_ptr.hpp>

boost::shared_ptr<T> ptr;

Later, a committee found shared_ptr so cool, they said “we want that too” and incorporated it into the C++ library. The shared_ptr became part of the TR1 extension (first Technical Report), and we would write:

#include <tr1/memory>

std::tr1::shared_ptr<T> ptr;

For about a decade, work on the next C++ standard continued under the name C++0x. At some point, the gcc team at GNU recognized that TR1 would soon become part of the next C++ standard. So they already incorporated TR1 into the std namespace, but you would have to invoke the compiler with a special flag because it wasn't quite standard yet:

    #include <memory>

    std::shared_ptr<T> ptr;

    g++ -std=c++0x

Not that long ago, the new standard arrived as ‘C++11’. The compilers and library were updated. There was more to C++11, and it was a new standard after all, so GNU added a new flag:

    #include <memory>

    std::shared_ptr<T> ptr;

    g++ -std=c++11

At this very moment, things should have been good now since we have C++11. In practice however, we're in bad luck. Every platform ships their own ‘current’ version of the compiler, and it works differently every time. Older compilers choke on C++11, and newer compilers that prefer C++11 choke on the TR1 stuff. In order to write C++ code that actually compiles across multiple platforms, you have to:

use autoconf (yuck)
use ugly ifdefs to get the correct includes
use typedefs, or hack the std namespace, or import entire namespaces
use the correct compiler flags, and mind which flag to use for what version of g++

It's either this or telling people “Your compiler is too old!”.
We can only hope that time passes quickly and that TR1 will soon be out of the picture, only a vague memory of an obscurity in the distant past. Until then, ...

The quad-tree

2013-08-18T23:45:00.000+02:00

Six years ago (which seems like an eternity already) I made a 2D arcade shooter in which you walk around the screen, zapping monsters. Each level is just one screen, and once you clear it, it's on to the next one. For collision detection, it checked every monster against the player, and every bullet against every monster. The game ran fine – especially on today's computers with near infinite power – but in terms of efficiency, it's quite bad to do that many collision tests. To make things better, I followed the good advice of the game programming gurus and added in a quad-tree.

Objects can only collide when they are near each other. So if an object is far away, you don't even want to test whether it might collide. Furthermore, if a group of objects is far away, you don't want to test for collisions with any member of that group. A quad-tree enables skipping large amounts of objects, thus bringing down the number of collision tests to perform.

The quad-tree is a space partitioning “device”; it divides 2D space up in compartments, automatically grouping objects together. It helps you select only the relevant group of objects for a possible collision and keeping the count of collision tests down to a minimum.

This post is not a tutorial on quad-trees, see below for a link to a good tut at gamedev.net. I had some additional thoughts on quad-trees.

Considerations

People often resort to using a uniform grid, also known as binning. With this technique, imagine putting a regular grid over the screen and adding objects to the bin (grid cell) that they are in. It's easy, but this is a poor solution that has all kinds of issues. For example when an object is on a cell boundary, it can be in multiple cells at the same time. If an object is so large that it spans many cells, there will be a big problem. If the player is at the edge of a cell, you'll have to check the adjacent cell as well. If the player is at the edge of the world, take care not go out of bounds. All in all this technique just isn't very good, and it wastes memory when the world is large but sparsely populated. The quad-tree however, is elegant and efficient and works in all cases.

Some people seem to think that a quad-tree can only deal with fixed size problems, i.e. the root node should be able to hold all objects that are in the world. This is so not true (!) If your quad-tree implementation uses arrays, you are doing it wrong. It is much more efficient to chainlink objects together using pointers. Like so:

q.objlist => obj1.neighbor => obj2.neighbor => NULL

This scales dynamically, without cost, without growing arrays or allocating any extra memory at all.

Other uses for quad-trees

The quad-tree is ideal for static geometry like wall segments. Since the walls don't change, you can generate the quad-tree only once when loading the level and keep it around for testing against while the game is running. For moving walls and destructible environments it's probably easiest to use a separate, second quad-tree. (* If you are ~~a graficks wizard~~ clever, you may use BSP trees for this purpose. Quad-trees are much easier to use though.)

Another ideal use of quad-trees is for view frustum culling in 3D engines. If a node falls outside the view, then all of its leaves will not be visible either, culling a large number of vertices at once. This is typically used in terrain rendering, but also works for other kinds of scenes.

Taking it one step further, you can use the same technique for doing occlusion culling. From a nearby object you can create an occlusion frustum. The nodes that are in the occlusion frustum area are not visible and can be culled away. If it's a large building or a mountain that is in the view, it will pay off to do this little extra work. Occlusion culling is an advanced topic, and I'm not convinced that modern games actually do this.

It's trivially easy to extend the quad-tree code to an octree for use with 3D data.

Quad-trees in science

Cosmologists like doing simulations of colliding galaxies. This involves simulating gravitational forces between large numbers of stars (or bodies). Because each and every body interacts with each other body, this is called an n-body problem. One way of implementing such a simulation is the Barnes-Hut algorithm, which uses a quad-tree to group bodies together. Barnes-Hut may treat groups of bodies as a single body by using its center of mass, and thus having a single gravitational force. It's easy to see how the use of a quad-tree greatly speeds up such a simulation.

Links

Quadtrees for collision detection at GameDev.net
Barnes-Hut algorithm by Ventimiglia & Wayne, who teach at Princeton University (not game programming class)
Barnes-Hut simulation at Wikipedia, with cool pictures

Programming mkdir -p

2013-07-15T14:19:00.000+02:00

It's midsummer so let's do something light and easy for a change. In UNIX there is the mkdir -p command, meaning “create directory and all leading (parent) directories”. It is a convenient command with which you can quickly create a new directory tree. In programming, there is the mkdir() system call. This system call creates new filesystem directories, but it does not, however, automatically create any new leading parent directories for you. An attempt to do so is an error: ENOENT, or “No such file or directory”. How can we programmatically make deep directories?

A naive answer is to call system("mkdir -p my/deep/path"). While not incorrect, it comparatively uses lots of resources, but even if we don't care about that, it poses the interesting question: “Well, how does the mkdir shell command do it?”

A possible approach is to to break the path argument up into pieces, and for each piece create the new subdirectory. Something along these lines:

break up "my/deep/path" into "my", "deep", "path"
mkdir "my"
mkdir "my/deep"
mkdir "my/deep/path"

For example, in Python:

def mkdir_p(path):
    arr = path.split(os.sep)
    newpath = ''
    for elem in arr:
        if not newpath:
            newpath = elem
            if not elem:
                newpath = '/'
        else:
            newpath = os.path.join(newpath, elem)

        if os.path.exists(newpath):
            continue

        os.mkdir(newpath)

It has some checks for correctly handling relative and absolute paths, and it gets the job done. It performs poorly however for long paths that already exist for the most part. For example, for mkdir_p("/home/walter/src/python/ blog/2013/mkdir_p") it would do seven loops already before creating a new directory. Surely we can do better.

If we could work backwards, the performance could be improved. First see if the deep path exists, take a step back, see if that path exists, and if it does, rewind and create the deeper path. This is achieved by recursion.

def mkdir_p(path):
    if not path:
        return

    if os.path.exists(path):
        return

    mkdir_p(os.path.dirname(path))
    os.mkdir(path)

The recursive version is wonderfully terse and like often with recursive functions, it may warp your mind and take some effort to grasp.

I confess I used the non-recursive implementation for years. Ironically, Python already has a function to do this:

os.makedirs(path)

I had a peek in the source of Python's os module, and os.makedirs() indeed uses the recursive implementation. Mind that like os.mkdir(), os.makedirs() will throw an exception if the path already exists — unlike the mkdir -p shell command that doesn't complain if the directory is already there.

Object(ive)-oriented programming in plain C

2013-06-24T15:46:00.000+02:00

Last time I wrote about a safe string type implementation in a C library, and mentioned liboco. It's a library that provides some fundamental building blocks like a safe string type, a safe array type, and a map (or dictionary). In liboco, everything is a reference counted object, and arrays and maps store objects. liboco takes concepts from Objective-C, and its basic function compares to Foundation Kit. Like in Objective-C, reference counting is done manually using retain and release. It even has a memory pool with which you can deallocate objects in a deferred fashion using autorelease.

How does it work? Plain C is not object oriented, it has no classes, no constructors, no inheritance, etcetera. But we can implement object orientness in C. In fact, the first C++ compiler was a translator that produced a mangled plain C source code as output.

As you might suspect, a safe string type would be implemented as a struct with member fields for a char* pointer to the data and an integer length. Now take a step back, and say that everything should be an object. This means a string would be derived from an object, and inherit the base object's properties. In plain C, we do this by including the struct of the base object in the struct of the derived object.

typedef struct {
    object_t obj;

    char *str;
    int length;
} string_t;

What does the base object type look like? As said, every object is reference counted, so it should include a reference count.

typedef {
int refcount;
} object_t;

It doesn't seem much, but we've already laid down a basis. We can create custom types and derive them from object. We could even create a new type and derive it from the string_t type, if we wanted to.

There is no automatic constructor, but like in Objective-C when we allocate an object of a certain type, we should initialize it through an init function. This is its constructor.

string_t *s = init_string(calloc(1, sizeof(string_t)));

calloc() zeroes out the memory for us. Alloc and init are two distinct operations. If you combine them into a single function, you'll run into trouble later (when we look at inheritance) so keep them separate.

Since I don't like the syntax of what we now have, alloc and init is wrapped as:

string_t *s = new_string();

Note that every object instance is addressed by a pointer to the object. This isn't strange when you realize that in plain C, strings are char pointers and arrays are referenced by a pointer to the array.

The reference count of a newly allocated object is 1. In a reference counting system, you don't delete instances like you do in C++. Instead, you let go of them by calling release(). Releasing an object instead of bluntly deleting it will start to make sense once you put objects into containers.

release(s);

The release() function lowers the reference count, and when the reference count drops to zero it will destruct the object and free the memory. To keep an object around, increase its reference count by calling retain(). When you put an object into a container, like an array or a map, the container automatically retains the object, which is another way of saying that it holds a strong reference to the object. The container keeps the object around — until the container itself is released.

Functions like retain(), release() work on any type derived from object. But C is a strictly typed language ... so you'd need to typecast down to object_t every time to satisfy the compiler, or else you'd get a ton of warnings. This issue is solved by declaring the functions with void* arguments:

void retain(void *);
void release(void *);

Now, if you don't supply a valid object, the compiler will not complain and these functions will happily trash around. So to make it safe we need some kind of runtime type checking. Wouldn't that be incredibly slow? No, it's just a single if-statement. We actually already need to know the type of the object for another reason: when the reference count drops to zero, how would release() know what destructor function to call? It knows because every object holds run-time type information (RTTI).

typedef struct {
objtype_t *objtype;
int refcount;
} object_t;

Mind that in C++, the compiler holds type information at compile time. In C++, RTTI may or may not be available during run-time depending on whether the class is polymorphic and whether RTTI is enabled during compilation. What's fun is that RTTI also allows us to easily implement type introspection.

What information is in the object type structure? It holds a pointer to the destructor function. release() will call this function before freeing the allocated instance.

Other than just the destructor, we can also include a constructor. Mind that in Objective-C, you have to manually call the super classes init function to correctly initialize an instance of a derived class. This is also the case in Python. In C++, it automatically calls the base class constructors.

At first I actually made it work this way using the default constructor only. There is a compelling case for having manual control anyway, and not including a constructor in the object's type definiton.

It's because of parameters: constructors often take parameters so it makes sense to write them this way. This means a constructor of an object can have any number of parameters and there is no strict declaration that fits all types. Thus, automatic calling of base class constructors goes out the window.

As slight added advance, it will be more clear how to do multiple inheritance.

As said, the whole thing is a reference counting system. So when we copy an array, the contents of the copy point at the same elements as the original, and each element has had its reference count increased by one. In order to make a deep copy, we need a copy constructor. The copy constructor is a common idiom in C++, but it's not as well-known in Objective-C. Objective-C has the NSCopying protocol, and you implement a function -copyWithZone: that returns a newly allocated instance; the copied object.

liboco adds the copy constructor as a special kind of constructor (analog to C++), and likewise it automatically calls any base class copy constructors.

C++11 adds a move constructor, with which you can move values from one instance to another under the hood. This is typically needed for doing return by value efficiently.

We have no need for this feature. The main reason is that objects are pointers, and pointers already efficiently return by value. Another reason is that we do not automatically destroy instances when they go out of scope, like C++ does. It's a manual reference counting system, so the programmer decides when an object is released or not.

Now that we have our own typing system, this gives way to adding a feature that is not present in Objective-C nor C++ (but is present in Python and Go), and it is printing an object. For printing any object we need a method that converts an object to a string.

Additionally, we need a special printing function that calls the string conversion method when requested.

print("object type: %T\n", o);
print("object value: %v\n", o);

It's quite powerful; this allows us to print the value of any type of object with the format string specifier ‘%v’.

It's also possible to add operator functions. Although plain C does not allow redefinition of operators, they can be simulated using function calls. This would be pretty annoying for hand-written codes however, but could be useful for machine generated codes.

I haven't touched upon overloading. It gets close but doesn't become real OOP until it can do function overloading, which enables polymorphism. Function overloading can be implemented by including a table of methods (function pointers) in the object's type definition. Calling a method would resemble this chain:

self->objtype->methods.method(self, parameters);

This is how virtual function calls work in C++. A derived type has a copy of the table of its base class. In Objective-C, it works a little different. There, invoking a method means “sending a message”. What happens is, the table is searched for the method, and if the method is not found, the table of the superclass will be searched. This way, all methods of the superclass are inherited by the subclass.

Both ways can be simulated in plain C, but there is the tiresome work of writing out the virtual tables. Moreover, the first argument to the method would always have to be void* to prevent compiler warnings.

Finally, a word on using liboco. Programming in C with liboco is a lot like programming in Objective-C, and like in Objective-C, you practically leave plain C behind. You are now working with this new API, and the code is full of calls into this library, which is your foundation. Adding a new type for some kind of data structure means implementing it as an object, and writing the destructor, copy constructor, and string representation methods. This can be a drag, but it's part of what liboco is. Having an array and map type at your disposal in which any object can be placed, is a great win. You can quickly prototype something in Python and convert it to C rather easily. Wouldn't it be great if we could do automated machine translation from a scripting language to C and then compile it to native code? It's maybe something to work on at a later time.

More on unsafe C strings: A followup

2013-05-13T14:33:00.000+02:00

Recently John Carmack, lead programmer of DOOM, wrote in a tweet: “Considering that 8 bit BASICs of the 70s had range checked and garbage collected strings, it is amazing how much damage C has done”. This is in line with my previous posts, and it's fairly remarkable that C stuck around so long without having this fixed. There have been many updates to the language, but this issue was simply not addressed — at least, not until C11. C11 includes the strcpy_s() and strcat_s() safe string functions. It feels a bit too little, too late but, at last, something has been done about it.

Are our troubles over now? Not really, C11 is still too new. My system does not have the new safe string functions so I can't use them. Adoption of the new standard takes time. Documentation (and school books) have to be updated. Next, the old strcpy() and strcat() have to be deprecated and finally, be eradicated from our source codes.

The bounds checking problem continues to exist for arrays.

Other than C11 I want to expand on two other possibilities for fixing the issue. The first solution is using an external library, the second is choosing a different programming language.

1. Using an external library

Rather than just supplying a safe string copying function, I mean using a library that supplies a safe string type and associated functions. As is often the case, this library would be written in C itself. One problem is that you have to rigorously adapt to using the library's string functions, and resist the temptation of using the old char* pointers for quick and dirty strings anyway.

Another problem is that system functions such as open(), write(), etcetera expect char* pointers, so you keep converting between the safe string type and the char* pointer.

Typical library solutions allocate strings on the heap, giving an extra performance hit. Using a library means marrying that library; your whole program source code will look different and be tightly coupled to it. Switching libraries will have a big impact and involve lots of work.

My github project liboco is a C library that implements a safe string type, as well as an array and a map type, and a few other things. It follows the Objective-C design principle that everything is a reference counted object. Like in Objective-C, this means manual reference counting. It isn't bad, but can be problematic if you are not used to it. Hard to track down memory leaks may happen. Other than that, liboco is pretty neat.

Comparably, I have written oolib for C++ which is much easier to use because it does automatic reference counting under the hood. This brings us to the next point.

2. Choosing a different programming language

It is fair to say that the problem of unsafe strings and having no bounds checks is intrinsic to C. It is a low-level language, after all. So if you want that extra bit of safety, you have to let go of C and choose a different language.

C++ fixes strings with std::string. Its interface annoys me, but anyway, C++ is a complicated language and it's really hard to learn. Even after thoroughly studying C++ I'm not sure I would recommend the language to anyone. Sure, classes are useful, but you almost have to be an academic to master C++.

Other languages more popular than C on github are: PHP, Python, Java, JavaScript — all byte code interpreted languages. JavaScript is a total mess, but anyway, these languages by design lack the punch that C can deliver in terms of performance. The process of byte code interpreting is essentially emulation, and emulation is slow. I bet that even the laziest, dumbest compiler would produce a faster running code than these byte code interpreters.
The argument of performance appears debatable, “do you really need that kind of performance?” The answer, in short, is: Yes. I have personally seen scripts run for hours, where the equivalent written in C ripped through it in half a minute. A speedup by a factor of 100 is no exception. Compiled languages offer vastly superior performance.

There is Objective-C, but it is different enough to say it changes everything. Choosing Objective-C in practice means marrying OpenStep's Foundation Kit. It is a good choice if you are Apple-only and intend to keep it that way. Objective-C has manual reference counting, which is alright but kind of tough to get right. Apple's compiler however, sports automatic referencing counting (ARC) which does some clever source code analysis.

Whatever happened to Pascal? Although a rather verbose language (consider PROCEDURE / BEGIN / END versus void { }), Pascal wasn't so bad — except from the fact that it has the worst string type of all (!). Although Pascal strings are safe, they have an absolute maximum length of 255 bytes. In modern times, with unicode encoding, this means that for some languages only 63 characters can be stored in a single Pascal string, making it a totally useless language by today's standards.

Go deserves honorable mention. It is an easy to learn language with cool core features, and it compiles to native code. Even though I'm not fond of its mandatory CamelCasing, golang is really great. Go was designed as a systems programming language, and it shows. There are some Plan9 peculiarities like the command-line flags parser not being GNU getopt()like, and the tar archiving module doesn't grok GNU tar extensions. Other than that, Go has great potential.

Truthfully, I don't want a language that strays too far from C. Interestingly, although many modern languages are depicted “C like”, none of them are plain C with an added safe string type. I suppose it is hard to change C. For one thing, a ‘string’ is a fairly high-level concept that doesn't fit nicely in a low-level language.