Monday, June 22, 2009

Taking out the trash

While working on a hobby project in one of my all-time favorite programming languages (standard C), I once again encountered an old problem: memory leakage. I hunted down the problem and fixed it, but it bothered me that I made this mistake, and it bothered me that it was so easy to make this mistake. In my second favorite programming language (Python), this would never have happened. Python features a built-in garbage collector. Why can't C have a garbage collector? The answer is, it can, but you just have to implement it.

Description of The Problem, The Pitfall
The common mistake happens when doing things like this:
int foo() {
char *p;
struct some_struct s;

p = (char *)malloc(NUMBER_OF_BYTES);
bar(&s, p);

if (...) {
if (... && ...) {
...
} else {
...
return -1;
}
}
free(p);
return 0;
}

Clever folks will see that there's a memory leak at the "return -1" (forgot to call free(p)), but did you spot the other memory leak? That's right ... what I didn't tell you, is that function bar() also allocated some memory to store into a member of the struct, and it's not being freed anywhere because it's not obvious that it's using memory.

Use another language: C++
So, I can hear you say, use C++. C++ will call the destructor when the local variable goes out of scope. This only partly solves the problem. In practice, I've learned to hate C++ because it makes things even less obvious than before, causing more problems than solving them.

Reference counting
How does Python pull it off? In essence, it's simple — it uses reference counting. Reference counting means that when you point a pointer variable at a piece of memory, you are referencing this object, and so a counter is incremented. When you stop using this object, the counter is decreased. Once the counter hits zero, no pointer variable is tracking this space anymore, meaning that this object has become useless, and therefore OK to clean up. Hence, the object is being garbage collected. Note how the object can be cleaned up immediately, and that there is absolutely no need for a separate thread to actively scan memory to collect bits of unused memory. I have no idea where this myth stems from (the Java world?).

Mimicking the interpreted language
Reference counting can be done in standard C too, but it's actually quite hard to do it right. Every single time you assign a pointer variable, you have to increment the associated reference count. This is doable, and you can even write a set_xxx() wrapper function or macro to do it for you. What's much harder, is to unreference the object when you let go of it. You are bound to forget to do this correctly just the same as pairing malloc()/free() calls was too hard to get right. The only reason why (byte-code) interpreted languages like Perl, Python, PHP, Java, etc. do get it right, is because the outer layer that really drives the language is doing all assignments and keeping track of the references, rather than the programmer trying to track things from the inside. (Or am I mixing up "inside" and "outside" here, are you still with me?!?!)
I think it was Guido himself who once said that if you would get all the reference counting with objects right in C, you'd practically already be programming in Python.

Tracking mallocs
What if you'd register all malloc()ed memory to a Memory Manager, and replace C's standard "return" statement with a macro "Return" that first visits the Memory Manager to see if there's anything left behind to clean up.
Sounds great, but how can it see the difference between a temporarily allocated buffer, and an object that was allocated to live for the rest of the lifetime of the process? This is exactly the core of the problem — we use the same malloc() for two different things; we want temp memory that is somehow magically destroyed at the end of the function, and we also would like to allocate memory because our program needs it. (NB. There may be a situation where an object may be temporarily, but not quite ... although I can't say I can think of such a situation right now). The solution to this problem would be to implement a tmp_malloc() function that is used for temp variables (like local variables), and have it track this memory so that a "Return" statement can clean it up. This actually works, but in practice, a lot of duplicate code is needed because you want to able to call malloc() in one situation, and tmp_malloc() in another. A real slick trick to solve this, is to make Malloc a pointer to a memory allocating function, and have it point to either malloc() or tmp_malloc(). All the programmer needs to do now is decide whether he wants to temporarily use memory, or assign it for a longer period of time.

Implementing the solution
The solution with the Temp Memory Manager and the Malloc pointer to the allocator function is so exotic that I have not implemented it yet. Do I really want to write this kind of crazed code, just to keep from making small mistakes? One could just as easily label local variables as such, and alert the programmer to double check that this space needs to be free()d.

#define LOCAL

int foo() {
int a;
LOCAL char *p; /* <-- indicator, just to alert the programmer */

p = (char *)malloc(SOME_BYTES);
...

Here, the empty, seemingly meaningless #define alerts the programmer to double check this situation, and the mistake of forgetting to call free() is easily avoided. No need for in-program memory managers and self-inspection, only one simple empty #define and your common sense.

Conclusion
We've taken a shot at trying to find a way to do garbage collection in standard C. We've found reference counting, and thought up a way to free all 'locally' allocated memory. In the end, we're probably better off tagging potential leaks as such and leaving it up to the programmer to do his job right. While this non-technical solution may not appeal to everyone, at least we're not going for the technical solution for the sake of having a technical solution. Sometimes, "good is better than optimal".

I may still implement the technical solution, just for fun, of course.

Sunday, May 24, 2009

pthreads and UNIX signals

A while ago, I wrote in this blog entry that when sending signals to a multi-threaded process, it is unclear which thread receives the signal. This little problem is actually easy to solve when you realize that you can create a seperate thread whose sole purpose it is to catch signals and respond to them.

You can mask out certain signals using sigprocmask() and its threaded nephew, pthread_sigmask(). These signals will be effectively disabled and will never be received in these contexts.
In the "signal processing" thread, enable the blocked signals. This will now be the only thread who receives the signals.

In almost finished C-code:
int main(int argc, char *argv[]) {
sigset_t set;

/* create signal processing thread */
pthread_create(&thread_id, NULL, signal_processor, NULL);

/*
in the main thread, set up the desired signal mask, common to most threads
any newly created threads will inherit this signal mask
*/
sigemptyset(&set);
sigaddset(&set, SIGHUP);
sigaddset(&set, SIGINT);
sigaddset(&set, SIGUSR1);
sigaddset(&set, SIGUSR2);
sigaddset(&set, SIGALRM);
/* sigaddset(&set, SIGWHATEVER); */

/* block out these signals */
sigprocmask(SIG_BLOCK, &set, NULL);

/* create other threads */
pthread_create(...);

/* ... */

/* pthread_join(...); */

/* ... */

return 0;
}

/*
implementation of the signal processing thread
*/
void *signal_processor(void *arg) {
/*
install signal handlers for interesting signals
I choose to install dummy handlers (see below for why)
*/
install_signal_handler(SIGHUP);
install_signal_handler(SIGINT);
install_signal_handler(SIGUSR1);
install_signal_handler(SIGUSR2);
install_signal_handler(SIGALRM);
/* install_signal_handler(SIGWHATEVER); */

for(;;) { /* forever */
/*
wait for any signal
*/
sigfillset(&set);
sigwait(&set, &sig);

/*
handle the signal here, rather than in a signal handler
*/
switch(sig) {
case SIGTERM:
exit(1);

case SIGHUP:
reload_config();
break;

default:
printf("caught signal %d\n", sig);
}
}
}

void dummy_signal_handler(int sig) {
/* nothing */
;
}

So, the code first creates a signal processing thread. This thread has the default signal mask and will therefore receive signals. The main thread then installs a signal mask that blocks most signals. Any threads created from the main thread will inherit this new signal mask.
The signal processing thread uses sigaction() (not shown here) to install signal handlers. These signal handlers are empty dummies. The reason for this is, by installing a signal handler you tell the operating system that you do not want the default signal action for these signals. Next, call sigwait(), so wait indefinately until any signal arrives. Then, handle the caught signal.
It is possible to install useful signal handlers instead of using the dummies, but I'd much rather use sigwait() and make signal handling synchronous as opposed to asynchronous. (The threads are still run asynchonously in respect to each other, though).

It is tempting to simply use sigfillset() and block all signals. While this is perfectly possible, it does have some implications. Any errors that generate signals like SIGSEGV, SIGBUS, SIGILL, will be blocked in the thread that generated the error, and caught in the signal processing thread. Be wary of this, as this is likely to produce debugging headaches.

There is a pthread_kill() function that you can use to send signals to threads. Do not use this mechanism to communicate between threads. Threads should synchronize and communicate using mutexes, barriers, and condition variables, and if you want to terminate a thread you should use pthread_cancel(), or communicate to it that it should exit, so that it can call pthread_exit() by itself.
The only use of pthread_kill() is to copy the signal to other threads that should also receive it, if you have a (weird) code where multiple threads can be signalled.

Wednesday, May 6, 2009

Multi-core games programming

I'm on sick leave, and although I've got some bad stomach problems and my head hurts, this is finally an opportunity to think over the problem of multi-core games programming. You see, modern computers are parallel machines, and it is a challenge to the programmer to take advantage of all this processing power.

Games are an interesting class of computer programs. Even the simplest of games draw our attention, because of the interaction between human and machine. On the inside, games are interesting as well. The code often contains things like an input handler, object allocators, sound file loaders, sprite or 3D object rendering code, etc, etc.
When switching from a single threaded game design to a multi-threaded one, you would probably think it would be smart to make a functional split:
  • a thread that handles sound
  • a thread that handles AI
  • a thread that handles rendering
  • a thread that handles input processing
While there is nothing wrong with this design, you are not getting the most out of the machine. What if we have a 8-core or 16-core or 100-core machine?

Imagine an old 8-bit 2D game where you walk around in a maze, picking up objects, and shooting bad guys. Do you have it visualized in your mind? Now, imagine a modern 3D shooter. What has changed? The amount of data, but also the amount of math. 3D worlds can be huge, with high-res textures, and realtime reflections and shadows. Great graphics hardware will take care of rendering objects to the screen, but the decision of what data of this vast 3D world is pushed to the GPU, is up to the programmer. All the culling has to be performed on the CPU, not the GPU.
What if the player bumps into a wall? Collision detection is a job done by the CPU. In our old 8-bit game, we could get away with checking the byte value (representing a wall) next to the player position in the two-dimensional array that represented the map. In a modern 3D game, we have to do a lot more than just that -- we have to compute the intersection between player and wall object with any of the walls that are near. In theory, we'd have to check every wall in the map, however, this is made more efficient by splitting the map into sections. Still, a section can be full of walls and other objects to bump in to. This takes a vast amount of computations.
Another subject is AI. While you'd probably think that good AI takes an enormous amount of CPU power, this is usually not the case. Yes, AI may utilize complex formulas, but this is generally totally unnecessary. Making a character behave or perform certain actions is accomplished by implementing a simple state machine. The state machine need not do much to get great results. It is the (vast) amount of intellectually behaving objects in the game world that results in AI taking up CPU time.
One last thing is animation. In the old 8-bit days, animations were done by drawing the next animation frame on every game tick. In the modern days of 3D graphics, there are still frames, but the 3D points can be interpolated in between frames, allowing for supersmooth animation. This requires every point to be interpolated on every game tick. In fact, nowadays we have multi-tasking operating systems, and a game tick is no longer a hard set value — one tick can last 10 ms, while the next one lasts 50 ms or more. This irregular behaviour of the game clock calls for interpolation too, in order to keep animations from stuttering.

As you may have realized by now, it is wise to parallelize by splitting on "data", or object handling. If we spawn as many threads as there are processor cores, we are fully utilizing the CPU. Maybe we even want to play nice and leave a core unallocated so that the operating system gets a chance to run its tasks while we are busy gaming.
We parallelize each big for-loop in the code:

PARALLEL DO
foreach obj in objects
do
animate(obj)
artificial_intelligence(obj)
collision_detect(obj)
culling(obj) # set flag whether object is visible
end
END PARALLEL DO

render:
foreach obj in objects
do
if obj.visible
then
draw_object(obj)
endif
end
flush_screen()

Note how the rendering loop is not parallel. Funnily enough, the GPU is addressed in serial. This is funny because the GPU in itself is a massively parallel processor.
I do not recommend putting the rendering loop in a thread by itself. The reason for this is that if you do, you will need to mutex-lock each object that is being rendered. This is less efficient than simply doing it in serial.

Enough talk and pseudo-code, it's time to sit down and implement this for real. But first, I need to get rid of my headache ...

Saturday, February 28, 2009

Dining Philosophers

A while ago, I wrote about threading. Last week, one of the students I'm guiding for their internship, mentioned he still had to complete a programming assignment. It was about implementing the dining philosophers problem in C, using pthreads. To a student, this can be quite a challenge.

The dining philosophers problem is a very fun programming problem concerning resource locking. The origins date all the way back to 1965, a time when computers were big clunky machines with green terminals, and the operators were computer scientists wearing lab coats.

An excellent description of the problem can be found in Wikipedia.

I remember my brother once explained it to me when I was young. He loves spaghetti, so I guess the idea of the philosophers having spaghetti kind of appealed to him. He noted that, in a real world situation, the problem is easily solved by adding a director, who tells the philosophers when to eat and when to think. So, a scheduling assistant can help in solving the problem. However, the problem can also easily balance itself by doing the right kind of locking, and be solved without the use of an explicit arbiter.

In pseudo code:
proc run_thread() {
while plate not empty {
lock(left_fork)
if not trylock(right_fork) {
unlock(left_fork)
continue
}
eat from plate

unlock(right_fork)
unlock(left_fork)
}
}
In order to avoid deadlock, trylock is used rather than a hard lock. trylock is an atomic operation like bit-test-and-set; it locks only if the lock was free, and does not wait until you actually own the lock.
When locking multiple resources at once, it is good practice to unlock in the reverse order of locking, also to avoid deadlock.
Note that there is no need to sleep anywhere in the loop; waiting for short periods of time is something that a human would do, but is totally unnecessary in a computer program. (Think of all the cycles needed to load/execute the instructions in between as waiting for short periods of time!)

Starvation
In the above code, it's possible that one philosopher finishes his plate, while the others haven't even started yet — they are literally starving. It would be good table manners if we instructed each philosopher not to eat faster than his neighbors. This would produce automatic balancing in the use of the resources.

In pseudo-code:
    if my_plate < left_neighbor_plate
or my_plate < right_neighbor_plate then
eat from plate
Making the wrong if-statement here results in nobody eating. You can also move this if-statement up, and refrain from trying to lock while the neighbors haven't eaten yet. This is giving your neighbors a chance to eat, and results in spinning before the lock, but also in a higher success rate for acquiring both locks.

What's fun is seeing how the outcome is different every time; you can actually gamble on which philosopher will finish first. Multi-threaded programs have a weird way of being undeterministic somehow.

As an exercise, and just for the fun of it, I coded up the dining philosophers problem with N philosophers in C with pthreads in less than an hour. I think that I would have had a much harder time at it when I was still a student, though.