Wednesday, May 6, 2009

Multi-core games programming

I'm on sick leave, and although I've got some bad stomach problems and my head hurts, this is finally an opportunity to think over the problem of multi-core games programming. You see, modern computers are parallel machines, and it is a challenge to the programmer to take advantage of all this processing power.

Games are an interesting class of computer programs. Even the simplest of games draw our attention, because of the interaction between human and machine. On the inside, games are interesting as well. The code often contains things like an input handler, object allocators, sound file loaders, sprite or 3D object rendering code, etc, etc.
When switching from a single threaded game design to a multi-threaded one, you would probably think it would be smart to make a functional split:
  • a thread that handles sound
  • a thread that handles AI
  • a thread that handles rendering
  • a thread that handles input processing
While there is nothing wrong with this design, you are not getting the most out of the machine. What if we have a 8-core or 16-core or 100-core machine?

Imagine an old 8-bit 2D game where you walk around in a maze, picking up objects, and shooting bad guys. Do you have it visualized in your mind? Now, imagine a modern 3D shooter. What has changed? The amount of data, but also the amount of math. 3D worlds can be huge, with high-res textures, and realtime reflections and shadows. Great graphics hardware will take care of rendering objects to the screen, but the decision of what data of this vast 3D world is pushed to the GPU, is up to the programmer. All the culling has to be performed on the CPU, not the GPU.
What if the player bumps into a wall? Collision detection is a job done by the CPU. In our old 8-bit game, we could get away with checking the byte value (representing a wall) next to the player position in the two-dimensional array that represented the map. In a modern 3D game, we have to do a lot more than just that -- we have to compute the intersection between player and wall object with any of the walls that are near. In theory, we'd have to check every wall in the map, however, this is made more efficient by splitting the map into sections. Still, a section can be full of walls and other objects to bump in to. This takes a vast amount of computations.
Another subject is AI. While you'd probably think that good AI takes an enormous amount of CPU power, this is usually not the case. Yes, AI may utilize complex formulas, but this is generally totally unnecessary. Making a character behave or perform certain actions is accomplished by implementing a simple state machine. The state machine need not do much to get great results. It is the (vast) amount of intellectually behaving objects in the game world that results in AI taking up CPU time.
One last thing is animation. In the old 8-bit days, animations were done by drawing the next animation frame on every game tick. In the modern days of 3D graphics, there are still frames, but the 3D points can be interpolated in between frames, allowing for supersmooth animation. This requires every point to be interpolated on every game tick. In fact, nowadays we have multi-tasking operating systems, and a game tick is no longer a hard set value — one tick can last 10 ms, while the next one lasts 50 ms or more. This irregular behaviour of the game clock calls for interpolation too, in order to keep animations from stuttering.

As you may have realized by now, it is wise to parallelize by splitting on "data", or object handling. If we spawn as many threads as there are processor cores, we are fully utilizing the CPU. Maybe we even want to play nice and leave a core unallocated so that the operating system gets a chance to run its tasks while we are busy gaming.
We parallelize each big for-loop in the code:

PARALLEL DO
foreach obj in objects
do
animate(obj)
artificial_intelligence(obj)
collision_detect(obj)
culling(obj) # set flag whether object is visible
end
END PARALLEL DO

render:
foreach obj in objects
do
if obj.visible
then
draw_object(obj)
endif
end
flush_screen()

Note how the rendering loop is not parallel. Funnily enough, the GPU is addressed in serial. This is funny because the GPU in itself is a massively parallel processor.
I do not recommend putting the rendering loop in a thread by itself. The reason for this is that if you do, you will need to mutex-lock each object that is being rendered. This is less efficient than simply doing it in serial.

Enough talk and pseudo-code, it's time to sit down and implement this for real. But first, I need to get rid of my headache ...