Sunday, December 21, 2014

Memory Management: Slab allocator

Last time we saw we could speed up memory allocations dramatically by using a pool allocator. I had a nagging feeling though. My pool allocator uses a std::vector, and it's doing all kinds of things behind the scenes. Resizing on the fly is cheating and undesired. Another thing that's certainly not right is that we're manually calling the destructor, while the object is alive and well, and present in the vector. This ultimately leads to the double destruction of objects once the pool is destructed; the destructor of the pool destructs the vector, which in turn destructs all objects it holds—effectively destructing objects twice. Double destruction is not a problem when your destructors are coded neatly, but it is something that normally is not supposed to happen. That's a big FIXME remark that needs to be addressed.

As a proposed fix for the problem, let's do away with std::vector altogether. Let's change the vector into a fixed size array. Upon initialization, the space for the array is preallocated, but no objects are created, no constructors are being run. When an object is allocated from the array, its constructor is invoked via placement new. When the (fixed size) array runs out space, allocate a new ‘subpool’ and chainlink them together as a linked list.
When an object is freed, you can look at the address and work out which array slot it came from. When all is well, manually call the destructor. If that pointer came from somewhere else, generate a fatal error.



This is a slab allocator. There is an opportunity here to implement garbage collection, and it's not as difficult as it may seem. All the allocator needs to do is keep track of which slots are allocated, and which slots are free—and it already needs to do that anyway. Once we get that right, garbage collection is simply destroying any objects that are still lingering around when the pool is destructed.

In order to track which slots are free, you could keep a small bitmap. If there are eight slots per slab, the bitmap is as small as one single byte. If you choose a single 64-bit int for a bitmap, the slab can hold 64 slots. That's a big slab. Personally, I went for sixteen objects in a slab.
Although a bitmap is perfectly okay for tracking free slots, you'll need to run through lots of bit shifting loops for the bookkeeping. The performance can be greatly improved by employing a different technique.

I know the Linux kernel internally uses a slab allocator, and therefore I took a peek in the book Understanding the Linux Kernel. It's in there alright, but unfortunately I couldn't make any sense of it. This stuff is hard. Or maybe not so much, I did find a rather good explanation on the web in the documentation on kernel.org.

What happens in the kernel is that each slab has a freelist, which is an array of integers (since we store only small numbers, these can be bytes rather than 64-bit ints). Each entry in the array represents an index to the next free slot. So, the index functions somewhat as a pointer to a list of free objects.
The final entry in the freelist array is -1, marking the end, when the slab runs out of memory. Additionally, there is an integer free_slot which denotes the first free slot. Since this is always the index of the first free slot, all the allocator does is access this number when an object is requested. It doesn't even have to search a free block, it already knows which object slot is free.
It's not described exactly in the kernel documentation, but at this point you should mark the freelist slot as being in use. Again, this is important for garbage collection.
[In the kernel source the member is actually called free, which in my code would inconveniently clash with the std::free() function, as well as my Slab<T>::free() method. Even though C++ has namespacing I still went with free_slot instead.]

For illustration purposes, consider what happens when we have a slab allocator with four object slots. Upon initialisation:
free_slot: 0
free_list: { 1, 2, 3, END }
Now, let's allocate an object.
free_slot: 1
free_list: { IN_USE, 2, 3, END }
Let's allocate one more.
free_slot: 2
free_list: { IN_USE, IN_USE, 3, END }
When memory is freed, look at the address to find out to which slab it belongs, and which slot it is. The free_slot pointer points at the first next free object; insert this number into the freelist slot. This effectively marks the slot as free again. If the slot was not marked as in use, something would be terribly wrong and the program should abort immediately. The free_slot pointer now becomes the slot number itself, thus the first free object is the object we just freed. Note how the freelist works in a LIFO fashion.

When we free the first allocated object, the result is:
free_slot: 0
free_list: { 2, IN_USE, 3, END }
If we want to garbage collect at this point, we can easily see that the object in slot #1 is still hanging around. We can garbage collect by freeing it. The situation then becomes as follows:
free_slot: 1
free_list: { 2, 0, 3, END }
The freelist now appears shuffled, but that doesn't matter; it operates as a LIFO queue. If we now allocate an object again, it fetches the object in slot #1.

Managing the freelist like this is deceivingly simple. This slab allocator greatly outperforms our previous solution.
Combining everything said here, the data structure in code is:
template <typename T>
struct Slab {
    Slab<T>* next;
    signed short free_slot;
    signed char free_list[NUM_OBJECTS];
    T objects[NUM_OBJECTS];
};
What I initially did here is use fixed size arrays. In the Linux kernel however, the only constant is the size of the memory page for the slab. It calculates how many objects of the desired object size can fit onto that page, taking into account the size of the metadata and necessary alignment. It crams as many objects as possible into a page, but note that the precise amount is different for every type T. For instance, many small objects of type A may fit into a single memory page, while another slab can hold a few large objects of type B.

The Linux kernel improves efficiency of the allocator even more by keeping three separate lists to manage slabs: lists for empty, full, and partially full slabs. On top of that, it keeps caches of slabs, so it always has quick access to them in case of need.
I simply couldn't resist the temptation, so the updated version is:
template <typename T>
struct Slab {
    Slab<T>* prev, *next;
    unsigned short num_objects, num_free;
    unsigned short free_slot;
    // compiler may insert padding here
    // free_list is directly behind the struct in memory
    // objects are behind the free_list in memory
};

Slab<T>* empty, *full, *partial;
[Now, num_objects isn't really a property of the slab; it's a property of the slab type. So why is it implemented as such? The reason is it's not a straightforward constexpr due to alignment, and the freelist having a maximum limit (we're using bytes as slot index, that's only 8 bits). The slab is written as a self-contained object allocator, and therefore it knows how many objects it can contain. In the Linux kernel, each slab also knows its object size. Since we're using C++ templates here, we can simply get away with sizeof(T).]

And you know what they say, safety first. The metadata is kept inside the slab, as well as the objects that are handed out to the application side of things. A badly behaving application may well overwrite other objects, or worse, our freelist. To guard against this, my slab allocator implements red zoning and object markers; magic numbers are inserted in between, and they are checked every single time.
For game code, the extra checks are #ifdef-ed out in a release build, but for other kinds of applications it's probably best to leave them in anyway.
Moreover, be mindful that a memory allocator should always zero the memory: not only when handing out objects, but also when they are freed. If you don't, you will get a massive Heartbleed disaster—the baddest security bug of 2014, the decade, and maybe of all computing history.

Happy holidays, and be careful with fireworks on New Year's Eve. You will need those fingers to do some more programming in 2015.

Sunday, November 30, 2014

Memory Management: Pool Allocator

I want to get back to game programming, so I had another look at the quad-tree. I did a post on quad-trees in August last year. Remember that it's a technique for grouping objects together by chopping up 2D spaces into four quarters. Each interesting quarter is again divided into four parts. This basically goes on until max_depth is reached. The purpose of this space partitioning is grouping objects for the sake of collision testing; objects that are near each other may collide and are worth testing for collisions. Objects that are not near each other can not possibly collide, so we skip those when collision testing.

Having lots of moving objects, it's easiest to construct the quad-tree on every frame, or game tick. Now imagine that they are slow moving objects. If the objects are moving slowly, chances are that the quad-tree will look exactly the same at every frame. So why would you spend all that time building up and tearing down the same quad-tree on every frame? Unfortunately, we have to, and trying to bend an existing quad-tree into shape is much more difficult than simply rebuilding it from scratch. But we can optimize in another way.
The process of rebuilding the quad-tree takes a lot of malloc() and free() calls, and they are not cheap functions. If you want 60 fps, you will want to spend your time better than calling malloc() and free() all the time.
Rather than trying to reuse the entire quad-tree, let's reuse the timber that it's made of. Let's grab the objects that represent the quad-tree nodes and leafs from a pool allocator.

A pool allocator is simply a stack of pointers to objects in memory. Rather than freeing the memory, you push the object onto a stack. Next, when you need to allocate an object, you pop it from the stack. If the stack runs out of memory, you simply allocate a new object using malloc().
A pool allocator only allocates objects for a specific data type. All the objects in the pool are of the same type and size. By no means try to make it a generic allocator, or you'll be writing you own fragmenting memory manager.

In C, you might have:
static QuadTree *qt_pool[MAX_QT_POOL];
static int qt_pool_idx = 0;

QuadTree *qt_alloc(void);
void qt_free(QuadTree *);

Things get quite interesting when you move to C++. First of all, C++ has a std::vector class that you can use as a dynamic array, meaning that you can grow the pool as large as is necessary without having to worry too much about what the exact limit should be.
Secondly, C++ has template classes for generic programming. A templated Pool class can work with all kinds of types: QuadTrees, Monsters, Bullets, Explosions, etcetera.
Thirdly, C++ classes have constructors and destructors. In order to do this right, the pool allocator must properly construct and destruct objects when they are allocated and freed. In C++, placement new takes some already allocated memory, and calls the constructor for it. And in case you're wondering, manually calling a destructor requires no special trickery.
template <typename T>
class Pool {
public:
    // insert boilerplate here ...

    T* alloc(void) {
        if (v_.empty()) {
            // allocate new object
            return new T();
        }

        // take last object in pool
        size_t idx = v_.size() - 1;

        // use placement new to construct
        T* t = new (v_[idx]) T();

        // remove item from vector
        v_.erase(v_.begin() + idx);

        // return object pointer
        return t;
    }

    void free(T* t) {
        v_.push_back(t);
        // destruct t
        t->~T();
    }

private:
    std::vector<T*> v_;
};
This code works with raw pointers, and you can get philosophical and argue that these should be std::unique_ptr<T> objects, that are std::move()d around. And with C++14, you can use std::make_unique<T>().
[I didn't bother because personally, I'm no fan of std::unique_ptr. I think the syntax is horrid, and I wish they would have invented a new keyword instead for strong referencing pointers.]

Growing the vector means doing implicit memory allocations, something that we were trying to avoid in the first place..! Luckily, this will only happen the first time around; the pool will grow to a certain size and then stay that size as objects start being reused. An optimization would be to reserve chunks of space in advance, so that the vector need not be resized as often.
You can avoid working with pool arrays altogether, by chain linking objects to a free-list. This requires the objects to have a pointer member though (to make the chain link), and this breaks the generic nature of the Pool template class that is shown above. It would be an elegant solution for plain C however.

Saturday, October 25, 2014

Round Balls and Flat Disks

I used to call gluSphere() to render simple planets in 3D space programs. For reasons beyond me, this function has been deprecated. Apparently, all GLU functions have been scrapped in the newer OpenGL standard. I've been told by an expert that OpenGL maintains backward compatibility by using version numbering, meaning that you should still be able to use old OpenGL 2, as long as you don't try to mix it with OpenGL 4 functions. GLU and GLUT are not really part of the OpenGL standard so that's where the story ends, I suppose. The problem is, I found myself being unable to compile older programs anymore after upgrading the operating system — totally unexpected, an unpleasant surprise. Searching for an answer, I found StackOverflow. The answer: “Do the math.”

I had to scratch my head a little there. I have to admit, it's been a while since I ‘did the math’ on anything, really. A visit to MathWorld turned up some nice formulas. Now, how to put that into code.

Implementing gluSphere()
As you can see in wireframe mode, the gluSphere is built up from quads. Clearly, triangle strips speed up the process. It should be possible to use a single triangle strip, but I didn't bother trying to figure that one out. So I went for stacked triangle strips that wrap around the sphere. This method uses slices and stacks (similar to lines of longitude and latitude) just like gluSphere() does.

Anyway, the math. We need to calculate all the vertices for the triangle strips that make up the sphere. A sphere is like rotating a vector 360 degrees around in a horizontal plane (left, right), while rotating a vector 180 degrees up and down in a vertical plane. We'll call the first angle alpha and the second beta. The size of the sphere is determined by radius r. Any point (x, y, z) on the sphere is given by:
    x = r ⋅ cos α ⋅ sin β
    y = r ⋅ cos β
    z = r ⋅ sin α ⋅ sin β
One caveat is that the angles should be given in radians rather than degrees for the C math library, so use M_PI. Another is the OpenGL coordinate system; you may have to work with -z, or like in my case, I had to exchange the Y and Z-axis to get the desired result.

Knowing this, calculating all vertices is easy. Setup a double loop that calculates all coordinates, taking into a account the number of stacks and slices that you wish to have.
    for(beta = 0.0; beta <= M_PI; beta += M_PI/num_stacks) {
        for(alpha = 0.0; alpha <= 2.0*M_PI; alpha += 2.0*M_PI/num_slices) {
            calculate x, y, z
            store in vertex array
        }
    }
Calculating Sphere Normals
The get good lighting on the sphere, we need to know the normal vectors. The normal at each vertex is given by normalize(x,y,z). That's easy.

Calculating Sphere Texture Coordinates
Having planets without texture is not fun. Texture coordinates run from 0.0 to 1.0.
    x = alpha / (2.0 * M_PI)
    y = (M_PI - beta) / M_PI

Implementing gluDisk()
For the rings of Saturn, it's easy to use a gluDisk. gluDisk() renders a flat disk, with an inner hole in the center. In math terms, the disk uses a circle formula, and there are two radii (one for the outer circle, one for the inner hole). So a gluDisk is really nothing but a single triangle strip that loops around, and its vertices are determined by two circles with radius r1 and r2.
The vertices are given by:
    // inner vertex
    ax = r1 ⋅ cos α
    ay = r1 ⋅ sin α
    az = 0.0

    // outer vertex
    bx = r2 ⋅ cos α
    by = r2 ⋅ sin α
    bz = 0.0
Calculating Disk Normals
The disk is really a 2D object in 3D space. The normals point to +Z, they are all defined as (0, 0, 1).

Calculating Disk Texture Coordinates
Texturing a disk is not at all like texturing a sphere. When you put a texture on a gluDisk, it's like sticking a label onto a DVD. You can take a rectangular image and put it onto the circular disk without distorting the image or wrapping it around the object. The texture is mapped to the bounding rectangle of the disk, which is two times the outer radius.
    x = ax / (2.0 * r2) + 0.5
    y = ay / (2.0 * r2) + 0.5
Do the same for bx and by. Note the plus one half, you want to be in the center of the texel or else OpenGL may show artifacts.

Implementing gluCylinder()
I did not implement gluCylinder(), but with the information given for gluDisk() you should be able to pull it off. It's just two circles and an added Z-axis.

Roundup
So, that's it. A bit of simple math, but still a sizable amount of work to get spheres going again in OpenGL. It's something that any serious 3D programmer should be able to do, but I still feel a bit bummed that they just deprecated those really useful functions.

On a final note I'd like to mention that there is another way of creating spheres, one that presumably looks better. Imagine an icosahedron, and subdividing each face into more triangles to enhance resolution. You can subdivide as many times as you like, and at a certain point it will be a good approximation of a sphere. This method is described in the Red Book.

Monday, September 22, 2014

How To strcpy()

I've written a lot about strcpy() already, the tiny but famous C library function for copying strings. There is a lot to say about its (in)security, and ways to improve on that. I didn't realize before that implementing your own strcpy() function could be part of a job interview for a programmer position. If I were asked to rewrite strcpy(), I would at least put in an abort() (for lack of exceptions) when passing in a NULL pointer. Moreover, I would add a length parameter for doing bounds checking — albeit that might be considered cheating, because it would change the function declaration; the signature of the function.

Rather than zooming in on this rather old topic, I'm going to be lazy for once and just link to this excellent post that tells the story of the solutions given by job applicants for the problem of writing your own strcpy() function. It's a must read, hilarious, but also somewhat alerting.

How Do I Copy Thee? Let Me Count the Ways by David Avraamides

Sunday, August 3, 2014

A Bugs Life

Software bugs are everywhere. This shouldn't be so bad, except that software is everywhere — from your office computer, to your cell phone, home router, television set, car engine, and the list goes on and on. Heck, even an electric toothbrush has a tiny microcontroller programmed with firmware in it. Most of this software has been tested and tried, but bugs keep popping up nevertheless.
Last week I kept a bug diary. The dreadful results are described below.

Saturday, July 26 - FaceTime would not connect. After updating iOS the problem was resolved. Of course, “you must always update to the latest version”, but mind you, this problem never existed before. User experience: Everything was working fine, and then all of a sudden it didn't.

Sunday, July 27 - Tried the OS X Yosemite Beta. Okay, it's a beta release. Reported a bug where the GUI was actually messing up the screen. Sadly couldn't take a screenshot, because it would redraw the screen, effectively clearing things up. Luckily, the bug did reproduce every time.

Monday, July 28 - Didn't go near my computer, but today a colleague sent in a bug report for a piece of software of mine. My bad. One of those cases of “this should never happen”. A set of dictionary keys holds only unique items, so how could it be that an item appeared twice?

Tuesday, July 29 - Encountered trouble trying to build the PCRE regular expression library with gcc 4.9.1 on Linux. Somehow it wouldn't pass confidence tests. Reportedly did work with older gcc; smelled an awful lot like a compiler problem.
Found a recent online tirade by Linus Torvalds, in which he called gcc, I quote, “pure and utter *shit*.” Turns out the compiler's code generator emits buggy code.

Wednesday, July 30 - The pedestrian lights near the office stayed on “Walk” all day long. Nearly got run over by a BMW, as the driver hit the gas when he got the green light. Could have been caused by a loose wire somewhere, but I like to think it was a software bug.

Thursday, July 31 - Twitter app on iPhone kept crashing upon loading a particular news article. Who knows, maybe it was ad malware trying to infect the system.

Friday, August 1 - At last, a small victory. Fixed an endless loop problem in the RADIUS PAM module. Found some additional related issues, such as by the looks of it is a broken stupid unusual implementation of our RADIUS server.


Special bonus bug of the week (Linux): Exiting an interactive bash shell and getting “Illegal instruction” on the screen. This is most likely caused by stack corruption.

Sunday, July 6, 2014

Game On

I have a PS3 game controller lying around, so I got the idea to add support for it to an old SDL game code. This code was still using SDL 1.2, which is now a thing of the past. So the first thing to do was quickly porting it over to SDL 2, a more modern game programming API. It even includes a GameController API, and because of that, programming support for game controllers with SDL is remarkably easy.

Initialising initialisation
For initialisation, call SDL_Init() with extra flag SDL_INIT_GAMECONTROLLER. Additionally, you must open the game controller. The game controller is found by first querying the system for the number of connected joysticks.
    SDL_GameController *ctrl = NULL;

    SDL_Init(SDL_INIT_VIDEO|SDL_INIT_GAMECONTROLLER);

    int num = SDL_NumJoysticks();
    if (num <= 0) {
        printf("no joy connected\n");
        return false;
    }
    if (!SDL_IsGameController(0)) {
        printf("it's not a game controller\n");
        return false;
    }
    ctrl = SDL_GameControllerOpen(ctrl);
    if (ctrl == NULL) {
        printf("open failed: %s\n", SDL_GetError());
        return false;
    }
    name = SDL_GameControllerName(ctrl);
    printf("game controller: %s\n", name);
This code is just an example. It assumes the first connected joystick is indeed the game controller that the player wants to use. This is of course ridiculous, and you should really loop over all joysticks and open as many connected devices as possible.

Mapping the mappings
In them old days, supporting game controllers was ... problematic. The thing is, every game controller is different and there is no common standard among manufacturers for numbering buttons and analog sticks. Even a simple D-pad is different between different controllers — think XBox360, PS3, PS4, but even PS2, Logitech, OUYA game controller? SDL 2 solves this incompatibility by mapping the raw device button ids to generic SDL enums. You do need to provide the correct mappings, these can be imported from a flat-file game controller database.
    SDL_GameControllerAddMappingsFromFile("gamecontrollerdb.txt");

    if (SDL_GameControllerMapping(ctrl) == NULL) {
        printf("no mapping for %s\n", name);
        SDL_GameControllerClose(ctrl);
        ctrl = NULL;
        return false;
    }
Finally, enable events so that SDL will generate events for us to collect in the main event loop.
    SDL_GameControllerEventState(SDL_ENABLE);

Moving forward
When SDL game controller events are enabled, you will receive events of type SDL_CONTROLLERAXISMOTION in the event loop. You will find that SDL sends lots of these events, even when just holding the controller. This is just noise on the sensor, and the analog sticks are quite sensitive. Merely touching already registers movement. The ‘axis values’ are a signed 16-bit value, between -32768 and 32767. The SDL documentation says values between -3200 and 3200 are in the ‘dead zone’ where the sticks should be considered at rest.

The SDL_CONTROLLERAXISMOTION event boils down to this:
event.caxis.which     connected joystick number
event.caxis.axis      axis number
event.caxis.value     amount of movement

Click image to enlarge

Analog triggers L2 and R2 register as axis 4 and 5, but they only report value 32767 when pressed, acting like digital buttons. This could be a driver issue (I'm on a Mac) or SDL not properly supporting the triggers. I'm not sure.

Benjamin Buttons
SDL generates SDL_CONTROLLERBUTTONDOWN and SDL_CONTROLLERBUTTONUP events. Inspect event.cbutton.button to see which button was pressed.
The button codes are hard to find in the documentation, but look in SDL_gamecontroller.h to find the enums. The API is modeled after an XBox type controller, for PlayStation the button ‘A’ is a cross, button ‘Y’ is a triangle, etcetera. The ‘guide’ button is the PS logo button.
SDL_CONTROLLER_BUTTON_INVALID = -1
SDL_CONTROLLER_BUTTON_A
SDL_CONTROLLER_BUTTON_B
SDL_CONTROLLER_BUTTON_X
SDL_CONTROLLER_BUTTON_Y
SDL_CONTROLLER_BUTTON_BACK
SDL_CONTROLLER_BUTTON_GUIDE
SDL_CONTROLLER_BUTTON_START
SDL_CONTROLLER_BUTTON_LEFTSTICK
SDL_CONTROLLER_BUTTON_RIGHTSTICK
SDL_CONTROLLER_BUTTON_LEFTSHOULDER
SDL_CONTROLLER_BUTTON_RIGHTSHOULDER
SDL_CONTROLLER_BUTTON_DPAD_UP
SDL_CONTROLLER_BUTTON_DPAD_DOWN
SDL_CONTROLLER_BUTTON_DPAD_LEFT
SDL_CONTROLLER_BUTTON_DPAD_RIGHT

Any other business
I have a single player game so programmatically, I took the easy road and just map all buttons and stick movement to key presses, internally translating to a keyboard joystick. This works, but it's not good enough for multiplayer games. It also loses the ability to properly react to stick acceleration; you might want to scale the axis values to have a proper difference between large and small stick movement.

Some tutorials suggest you ignore the ‘dead zone’, but this is wrong because when you fully ignore those events, you are unable to detect when someone suddenly lets go of the joystick. Instead, the dead zone should be mapped to zero movement. You can also keep track of sticks entering/leaving the dead zone, and react correspondingly.

The DualShock 3 controller has pressure sensitive buttons. SDL has no support for measuring the amount of pressure. Nor does it support the sixaxis motion sensor. SDL 2 can do rumble however via its Haptic API.

The cool thing about SDL is that it also ports to other platforms, maybe most notably the Steam OS. Since I'm on OS X I also had a quick look at how to do this joystick & game controller thing in pure Cocoa — not so easy at all. I'm with SDL 2 on this one.

Links
SDL 2 GameControllerDB file
SDL 2 API by Category
How to connect a PS3 controller

Sunday, June 8, 2014

Go Rust Swift, Vala

In the WWDC keynote speech, Apple announced their new programming language named Swift. It looks like a scripting language, but it actually compiles to binary, giving you lots of performance. For years, I've been griping about the need for a programming language that's as easy as scripting, while delivering decently high performance. Lately, we've seen a number of new languages that do exactly this. So, it's time to discuss a few of these new, modern languages.

Go
Go was developed at Google and was first publicly released at the end of 2009. It's no secret that I'm a bit of a fan of Go, and I'll admit that of the four new languages discussed here, it's the only one I actually wrote a decent amount of code in.
Go is awesome for parallel programming. Goroutines and communication channels are plain brilliant. In fact, it completely changes the way you think about parallel programming. Go's duck typing system is marvelous, an enormous sigh of relief after decades of OOP and I just want to hug the creators of Go.
And yet, it has not yet become my primary language of choice. A big reason for that is mandatory CamelCasing. In Go, lower/upper CamelCasing actually controls whether the symbol is being exported or not; a really awkward design decision (golang's only mishap?).
Go is great for systems programming (ie. for UNIX tools, services), however ... it's as clunky as C when trying to parse a config file, and its flag command-line argument parser module isn't great either. Then I tried making a game in Go and simply couldn't get started because there were no stable bindings for the needed libs at hand at the time.
Go has a very “C feel” to it, every time I code in it I feel like I'm constructing a tiny, delicate machine. Everything about Go is excellently engineered. Some clever men with beards really thought things through.
What will golang's popularity be, five years from now? We shall see if it's truly to replace C/C++ as top systems programming language.

Rust
Mozilla's language named Rust started in 2006, but didn't go public until 2010. Rust is quite impressive, but the tutorial kind of lost me once it started on heap allocation, object ownership of memory, control of mutability, references, and confusing syntax involving glyphs. This leads to some pretty horrific code that shouldn't have been that way, in my opinion. Secretly, Rust is trying hard to be like C++.
The tutorial has some notes on features that are currently missing or buggy. In that respect, Rust is not ready for prime time yet. It does show a lot of potential and it did manage to create a buzz in the open source world, so check back once it hits version 1.0.

Swift
Apple's Swift language was announced only last week and is said to have been in development since 2010. At a glance, it's suspiciously a lot like Rust. This is particularly interesting given the fact that Rust is sponsored by rival company Samsung. What's the true story behind this?
Anyway, Swift looks awesome, and it's already miles ahead of Rust. Swift builds upon a foundation laid by Objective-C, all the way from NextStep to OS X and iOS. Cocoa and Cocoa Touch bindings are in place and ready to use. Under the hood Swift uses reference counting for garbage collection, and here lies a major pitfall; you should take care not to create any cyclic references, and/or know the difference between weak and unowned references. If you get this wrong, memory will leak just as it did in Obj-C. [Note that the same is true for Rust and Vala].
Swift lives in an Apple-only world, which should be no surprise — I suppose this compares to Microsoft's strategy regarding Visual Basic, C#, .NET.
There is no doubt that Swift is going to be huge on iOS, and probably also on OS X. Nobody likes Obj-C (apart from some weird guys like me). The matter of truth is, Obj-C is outdated, it has just been obsoleted by Swift. Two years from now, no one will be writing new applications in Obj-C.

Vala
Last but not least, there is Vala, which has been around since 2006 already. Vala is a Java/C#/Mono-like language for developing GNOME applications. It is said Elementary OS's main apps are built in Vala. The Vala compiler converts Vala code to plain C with GObject, which is in turn compiled by a generic C compiler. In a sense, Vala is just syntactical sugar, removing all the bad things that C has (the tedious task of programming intricate details), while keeping all the good (high performance).
Its syntax will be appealing to mostly Java and C# programmers. Personally, I get lost in the soup of keywords. A Vala source can be like a C++ code in disguise. There are just too many keywords; the designers forgot that less is more. It's probably too late now, but my wish for Vala is to be stripped down, simplified, made easier to work with.
Nevertheless it is a great choice for developing GNOME applications, and there is Elementary OS to prove it. Currently it looks like Vala's success is tied to that of GNOME, and it's a bit of a shame because it deserves more attention from a broader audience.
Vala wins the prize for “Best chosen name for new programming language” simply because googling it lands you on the right pages right away.

Links

Thursday, May 29, 2014

Programming is hard

I have always seen programming as something creative, like an art. And like in art, it is hard work to create something beautiful. It's hard to get the details right. If you get it right, then people will like the result, and of course, some may not because tastes differ and you can't make everybody happy. But aside from people's reaction to the picture painted, a painter also gets his satisfaction from mere crafting, using this technique here and that trick there, knowing the internals of his work.
Picture a musician, selecting an instrument to play, a rhythm, writing a melody.

That sure is a romantic view of programming. In reality, it's more of a purely technical process. When you want to accomplish goal G you are going to have to make work A and B together. This might be done with technique C. This is engineering. Apply a trick here, and it is called a hack. Hacks can be dirty hacks, or they can be clever hacks. In a security context, they might even be dangerous hacks.

Some say building applications is like building a house, and there should be blueprints for building software as well. Houses are built from small, simple building blocks. All houses have walls and a roof, a kitchen, a living and a bedroom, yet each house is unique. For each room, you can zoom in and work out the details. Users have the option of changing the wallpaper. And of course, a construction error may take down the entire house.
It's a nice metaphore that works well, but I suspect that whoever thought up that idea is not a programmer.

Programming is a mind game. The programmer builds a model in his mind, writes down the code and tries it. If it doesn't work, she tries to find out where it doesn't work. You may use a debugger or sketch it on paper but mostly the problem is cracked in your mind. Like a game of chess, it takes a lot of concentration trying to think six moves ahead, a single mistake will play out badly sooner or later.

Bugs are rarely ever “happy little accidents” like in Bob Ross paintings. They range from apparently random program crashes to worldwide Heartbleed scale disaster. Slowly the world is waking up, it has become apparent that eventually all software is flawed. Consequently, using the internet is as big a risk as a running across a highway, naked.

Programming is hugely underestimated. Programming is hard. It is really, really hard. The human mind was not made for this job. It is incredible that we have computers at all. Steve Jobs said that everyone should learn to program, because it teaches you to think. I agree a little, but it's like saying “Everyone can paint!” Or, “everyone should learn how to build a house.”

The best way to keep out bugs? Keep It Simple, Stupid. Simplify, simplify. Write code in a clear, consistent style. Add useful comments. Sometimes comment on why you did it this way. Test, test, test. Eat your own dog food.

There is a nice test to see if you've written good code. Open up a source file that you wrote six months ago. If you can't make out how it works within five minutes, then it's just not good enough. Now open up a source code that someone else wrote.

Boom. Oh, the horror.

Note that even this test is flawed, because you might not be seeing what you think you are seeing. Think about it. Programming is hard. And so is building houses.

Monday, April 7, 2014

In exceptional conditions

On the topic of condition variables, wikipedia says: “In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true.”
This is commonly used in producer/consumer-like problems.

The code for programming such a monitor looks like this:
lock(&mutex);
while (!predicate) {
    condition_wait(&mutex);
}
consume(&item);
unlock(&mutex);
This can be implemented, for example, with pthread_mutex_lock() and pthread_cond_wait(). If you would wrap things into a class, it might look like the following:
mutex.lock();
while (!predicate) {
    cond.wait(mutex);
}
consume(item);
mutex.unlock();
However, in a language that uses exceptions, this code would be flawed. Suppose that consume() (or even cond.wait(), however unlikely) throws an exception, then the mutex remains locked. Therefore, in a language like Java, you should write:
mutex.lock();
try {
    while (!predicate) {
        cond.wait(mutex);
    }
    consume(item);
} finally {
    mutex.unlock();
}
This ensures the mutex gets unlocked, even in case an exception occurs.
In Python, the mutex is part of the condition object. This simplifies things somewhat by hiding the mutex under the hood:
cond.acquire()
try:
    while not predicate:
        cond.wait()
    consume(item)
finally:
    cond.release()
C++ does not have a finally keyword — it doesn't need it, because it has RAII (“Resource Acquisition Is Initialisation”). In C++, it should be coded as:
std::mutex mx;
{
    std::unique_lock<std::mutex> lk(mx);
    while (!predicate) {
        cond.wait(lk);
    }
    consume(item);
}
Note that the destructor of std::unique_lock will ensure that the mutex is unlocked. The destructor will run when the stack unwinds, which is also done when an exception occurs.

Although entirely correct, and arguably the most elegant solution, I have to say I find the C++ code unintuitive and difficult to follow. The reason is that there are no explicit statements in this code block for dealing with locking; the mutex is locked by the constructor and unlocked by the destructor, that run automagically. It is almost as if the mutex variable isn't doing anything at all. Annoyingly, this code can not be refactored to resemble the other code fragments, because of RAII. And you are being forced to write it this way now that threads, mutexes, and condition variables are part of the std library; there is no other way in which a condition variable can be used because the other way would be incorrect in C++.
This is one of those cases where C++ shines but somehow it doesn't feel gratifying.

Sunday, March 16, 2014

Python multiprocessing: I did it my way

Python has a wonderful multiprocessing module for parallel execution. It's well known that Python's threading module does not do real multithreading due to the Global Interpreter Lock (GIL). Hence multiprocessing, which forks a pool of workers so you can efficiently run tasks in parallel.

In principle:
    p = multiprocessing.Process(target=func)
    p.start()
    p.join()
In practice, I have had weird issues with multiprocessing. Two notable issues:

  • program does not start any processes at all, and just exits
  • program raises an exception in QueueFeederThread at exit

Apparently the multiprocessing module has its problems, and although there are fixes in the most recent versions, I can't rely on that when the users of my program are running on operating systems that don't always include the latest greatest Python.

So, a decision had to be made. I ditched the multiprocessing module and replaced it with my own, that calls os.fork(). The resulting code is much easier to handle than with  multiprocessing, too.
Note that os.fork() does not port to Microsoft Windows. My target platform is UNIX anyway.

Pseudo-code:
    parallel(func, work_array):
        for i < numproc:
            fork()
            if child:
                work_items = part_of(work_array)
                for item in work_items:
                    func(item)

                # child exits
                exit(0)

        wait_for_child_procs()
So, a pool of numproc workers is spawned. Each of the child processes do their part of the work given in work_array. No communication is needed here because fork() causes the child to be a copy of the parent process, and thus getting a copy of work_array.
This is the simplest kind of parallel programming in UNIX. Surprisingly, this bit of code works better for me than the multiprocessing module — which supposedly does the same thing, under the hood.

Having decent language support for parallel programming is of utmost importance in today's world, where having a quadcore CPU is no exception; practically all modern computers are multi-core machines. A modern programming language should offer some easy mechanisms to empower the programmer, enabling you to take advantage of the hardware at hand.
Proper multithreading is exceptionally hard (if not impossible) to do for interpreted languages. This is a fact of life. Using a forking model, you can still get good parallellism.

Monday, February 24, 2014

MD5 and libs and licenses

I keep running into code that looks like this:
#define S21 5
#define S22 9
#define S23 14
#define S24 20
    GG ( a, b, c, d, in[ 1], S21, UL(4129170786)); /* 17 */
    GG ( d, a, b, c, in[ 6], S22, UL(3225465664)); /* 18 */
    GG ( c, d, a, b, in[11], S23, UL( 643717713)); /* 19 */
    GG ( b, c, d, a, in[ 0], S24, UL(3921069994)); /* 20 */
    GG ( a, b, c, d, in[ 5], S21, UL(3593408605)); /* 21 */
    GG ( d, a, b, c, in[10], S22, UL(  38016083)); /* 22 */
    GG ( c, d, a, b, in[15], S23, UL(3634488961)); /* 23 */
    /* ... etcetera ... */
    HH ( ... )
    II ( ... )
This is an excerpt of C code for the MD5 algorithm as implemented by Ron Rivest, published in RFC-1321 in 1992. What's wrong with this picture? Well, not so much, except that there is this nice OpenSSL library that is present practically everywhere. The OpenSSL library provides this functionality and the code has been reviewed by dozens of people and is being used by millions.

Using libssl for MD5 digests is easy:
#include <openssl/md5.h>

    unsigned char digest[16];
    MD5_CTX ctx;

    MD5_Init(&ctx);
    MD5_Update(&ctx, buf, buf_len);
/* multiple calls to MD5_Update() may be made */
    MD5_Final(digest, &ctx);
Link with -lssl and you're done.

On OSX, something is up. OSX no longer ships OpenSSL. Instead, Apple now provides their “CommonCrypto API”, which is, ironically, not so common. An ugly trick to get OpenSSL MD5 code to work with CommonCrypto:
#include <CommonCrypto/CommonDigest.h>

#define MD5_CTX      CC_MD5_CTX
#define MD5_Init     CC_MD5_Init
#define MD5_Update   CC_MD5_Update
#define MD5_Final    CC_MD5_Final

#ifdef MD5_DIGEST_LENGTH
#undef MD5_DIGEST_LENGTH
#define MD5_DIGEST_LENGTH    CC_MD5_DIGEST_LENGTH
#endif
Why did Apple do this, you may ask. The reason is probably software licensing; OpenSSL's dualistic license is problematic in particular for apps on iOS devices. Apple could probably have kept OpenSSL on the Mac, but they didn't.
(* It's either this or the NSA forced them into putting some weakened PRNGs into CommonCrypto).

Coming full circle, you would be justified not to use a common library in cases where you run into  issues with the software license. There are many free and open software licenses, and many of them have quirks. This is a major annoyance in software development, and something to consider when using libraries.

Sunday, January 19, 2014

One and one are eleven

Oh C++, why art thou so hard? Thou giveth us nought but grievance and pain. Often I've cried out because C++ just wouldn't play nice. The language is too hard to learn, and it couldn't do anything you could do in plain C. But I've actually started to like C++. I have come to appreciate it a lot. I suppose it ages like fine wine. C++11 brings a lot of good to the table, well, as long as your compiler is up to par.

Rather than go over all the features of C++11 (there's enough of that on the web to go 'round), I want to show you a neat trick. Lo and behold.
template <typename... Tpack>
void go(Tpack&&... args) {
    std::function<void()> func = std::bind(std::forward<Tpack>(args)...);
    std::thread(trampoline, func);
}
This defines a go() function that allows you to fire up any function (with arguments!) as a thread. The thing with Tpack is a variadic template, which is a type-safe kind of varargs. The call to std::bind turns it into a std::function object, which is a kind of functor. std::thread creates a new thread and launches a trampoline function that will invoke the function functor for us. Okay, the actual implementation is a little different, I have simplified it here for the sake of example. For one thing, you should catch the std::system_error exception that is thrown in case the operating system fails to launch the thread.

So now we can easily run any function as a thread. Next, I wanted go() to accept my custom Functor class as well. Normally you would go about that doing that like so:
void go(const Functor&);
void go(Functor&);
Guess what, it didn't work! The result was a (not so) glorious Segmentation fault. Tracing it revealed that the compiler was favoring the go(Tpack&&...) variant over the specific Functor functions. But no worries, template specialization to the rescue:
// specialized templates for passing in a Functor
// const Functor& is only used for const functions or
// explicitly const Functors
// in most cases go(Functor&) is used
template <>
inline void go<const Functor&>(const Functor& f) {
    do_something_with(f);
}

template <>
inline void go<Functor&>(Functor& f) {
    do_something_with(f);
}
This weird looking stuff is C++ dark magick, written in the black of night by the light of a candle. If you look past all the confusing glyphs, it's simply telling the compiler to call the right function for what kind of argument we have.

All of this wasn't quite possible to this extent before C++11. Other than impressing your girlfriend with this trickery (not!), C++11 comes with a most important change: the move constructor. This is something you MUST learn. It's hard to believe, but before eleven it was not possible to simply return instances from a function, because the destructor would kick in — you would return a copy and the original object would be destroyed. C++11 fixes this long standing problem by handing you the opportunity to move the content of the original to the copy, allowing you to return objects, like in any sane programming language. Actually, the move constructor is much more important than showing off template and thread trickery, but Brian and me really wanted to bring you the go() function this time.