Sunday, May 24, 2009

pthreads and UNIX signals

A while ago, I wrote in this blog entry that when sending signals to a multi-threaded process, it is unclear which thread receives the signal. This little problem is actually easy to solve when you realize that you can create a seperate thread whose sole purpose it is to catch signals and respond to them.

You can mask out certain signals using sigprocmask() and its threaded nephew, pthread_sigmask(). These signals will be effectively disabled and will never be received in these contexts.
In the "signal processing" thread, enable the blocked signals. This will now be the only thread who receives the signals.

In almost finished C-code:
int main(int argc, char *argv[]) {
sigset_t set;

/* create signal processing thread */
pthread_create(&thread_id, NULL, signal_processor, NULL);

/*
in the main thread, set up the desired signal mask, common to most threads
any newly created threads will inherit this signal mask
*/
sigemptyset(&set);
sigaddset(&set, SIGHUP);
sigaddset(&set, SIGINT);
sigaddset(&set, SIGUSR1);
sigaddset(&set, SIGUSR2);
sigaddset(&set, SIGALRM);
/* sigaddset(&set, SIGWHATEVER); */

/* block out these signals */
sigprocmask(SIG_BLOCK, &set, NULL);

/* create other threads */
pthread_create(...);

/* ... */

/* pthread_join(...); */

/* ... */

return 0;
}

/*
implementation of the signal processing thread
*/
void *signal_processor(void *arg) {
/*
install signal handlers for interesting signals
I choose to install dummy handlers (see below for why)
*/
install_signal_handler(SIGHUP);
install_signal_handler(SIGINT);
install_signal_handler(SIGUSR1);
install_signal_handler(SIGUSR2);
install_signal_handler(SIGALRM);
/* install_signal_handler(SIGWHATEVER); */

for(;;) { /* forever */
/*
wait for any signal
*/
sigfillset(&set);
sigwait(&set, &sig);

/*
handle the signal here, rather than in a signal handler
*/
switch(sig) {
case SIGTERM:
exit(1);

case SIGHUP:
reload_config();
break;

default:
printf("caught signal %d\n", sig);
}
}
}

void dummy_signal_handler(int sig) {
/* nothing */
;
}

So, the code first creates a signal processing thread. This thread has the default signal mask and will therefore receive signals. The main thread then installs a signal mask that blocks most signals. Any threads created from the main thread will inherit this new signal mask.
The signal processing thread uses sigaction() (not shown here) to install signal handlers. These signal handlers are empty dummies. The reason for this is, by installing a signal handler you tell the operating system that you do not want the default signal action for these signals. Next, call sigwait(), so wait indefinately until any signal arrives. Then, handle the caught signal.
It is possible to install useful signal handlers instead of using the dummies, but I'd much rather use sigwait() and make signal handling synchronous as opposed to asynchronous. (The threads are still run asynchonously in respect to each other, though).

It is tempting to simply use sigfillset() and block all signals. While this is perfectly possible, it does have some implications. Any errors that generate signals like SIGSEGV, SIGBUS, SIGILL, will be blocked in the thread that generated the error, and caught in the signal processing thread. Be wary of this, as this is likely to produce debugging headaches.

There is a pthread_kill() function that you can use to send signals to threads. Do not use this mechanism to communicate between threads. Threads should synchronize and communicate using mutexes, barriers, and condition variables, and if you want to terminate a thread you should use pthread_cancel(), or communicate to it that it should exit, so that it can call pthread_exit() by itself.
The only use of pthread_kill() is to copy the signal to other threads that should also receive it, if you have a (weird) code where multiple threads can be signalled.

Wednesday, May 6, 2009

Multi-core games programming

I'm on sick leave, and although I've got some bad stomach problems and my head hurts, this is finally an opportunity to think over the problem of multi-core games programming. You see, modern computers are parallel machines, and it is a challenge to the programmer to take advantage of all this processing power.

Games are an interesting class of computer programs. Even the simplest of games draw our attention, because of the interaction between human and machine. On the inside, games are interesting as well. The code often contains things like an input handler, object allocators, sound file loaders, sprite or 3D object rendering code, etc, etc.
When switching from a single threaded game design to a multi-threaded one, you would probably think it would be smart to make a functional split:
  • a thread that handles sound
  • a thread that handles AI
  • a thread that handles rendering
  • a thread that handles input processing
While there is nothing wrong with this design, you are not getting the most out of the machine. What if we have a 8-core or 16-core or 100-core machine?

Imagine an old 8-bit 2D game where you walk around in a maze, picking up objects, and shooting bad guys. Do you have it visualized in your mind? Now, imagine a modern 3D shooter. What has changed? The amount of data, but also the amount of math. 3D worlds can be huge, with high-res textures, and realtime reflections and shadows. Great graphics hardware will take care of rendering objects to the screen, but the decision of what data of this vast 3D world is pushed to the GPU, is up to the programmer. All the culling has to be performed on the CPU, not the GPU.
What if the player bumps into a wall? Collision detection is a job done by the CPU. In our old 8-bit game, we could get away with checking the byte value (representing a wall) next to the player position in the two-dimensional array that represented the map. In a modern 3D game, we have to do a lot more than just that -- we have to compute the intersection between player and wall object with any of the walls that are near. In theory, we'd have to check every wall in the map, however, this is made more efficient by splitting the map into sections. Still, a section can be full of walls and other objects to bump in to. This takes a vast amount of computations.
Another subject is AI. While you'd probably think that good AI takes an enormous amount of CPU power, this is usually not the case. Yes, AI may utilize complex formulas, but this is generally totally unnecessary. Making a character behave or perform certain actions is accomplished by implementing a simple state machine. The state machine need not do much to get great results. It is the (vast) amount of intellectually behaving objects in the game world that results in AI taking up CPU time.
One last thing is animation. In the old 8-bit days, animations were done by drawing the next animation frame on every game tick. In the modern days of 3D graphics, there are still frames, but the 3D points can be interpolated in between frames, allowing for supersmooth animation. This requires every point to be interpolated on every game tick. In fact, nowadays we have multi-tasking operating systems, and a game tick is no longer a hard set value — one tick can last 10 ms, while the next one lasts 50 ms or more. This irregular behaviour of the game clock calls for interpolation too, in order to keep animations from stuttering.

As you may have realized by now, it is wise to parallelize by splitting on "data", or object handling. If we spawn as many threads as there are processor cores, we are fully utilizing the CPU. Maybe we even want to play nice and leave a core unallocated so that the operating system gets a chance to run its tasks while we are busy gaming.
We parallelize each big for-loop in the code:

PARALLEL DO
foreach obj in objects
do
animate(obj)
artificial_intelligence(obj)
collision_detect(obj)
culling(obj) # set flag whether object is visible
end
END PARALLEL DO

render:
foreach obj in objects
do
if obj.visible
then
draw_object(obj)
endif
end
flush_screen()

Note how the rendering loop is not parallel. Funnily enough, the GPU is addressed in serial. This is funny because the GPU in itself is a massively parallel processor.
I do not recommend putting the rendering loop in a thread by itself. The reason for this is that if you do, you will need to mutex-lock each object that is being rendered. This is less efficient than simply doing it in serial.

Enough talk and pseudo-code, it's time to sit down and implement this for real. But first, I need to get rid of my headache ...