Wednesday, December 30, 2009

New adventures in Cocoa OpenGL

It's been a while since my last post, one reason is that I was away on a vacation, another that it's the holiday season, and another one that I caught a bad bad cold out in the snow. Well, enough with the excuses, it's time for some interesting programming blogging. Santa brought me a brand new Mac, so I will venture into the unknown (to me, at least) territory of Cocoa with OpenGL.

Nearly a year ago, I threw together an OpenGL demo for the iPhone. This was a bit of a hack, since I merely called my plain old C code from an Xcode template. While this works, it's not really the right way to go about when developing for iPhone, nor the Mac.

When you say Cocoa, you say it in Objective-C. This weird dialect of C has some pretty powerful features that should probably best be left alone and kept for later. The basics, however, are exactly the same as in any other OOP language that you may have encountered before — you have a class, members, methods, a constructor, and inheritance.

When learning a new programming language I usually go through these stages:
  1. read about it, get sick over the syntax, and hate it
  2. don't actually use it, and keep complaining about the syntax
  3. let it rest for a while, sometimes for as long as a couple of months
  4. read about it a little more
  5. use it, and fall in love with it (if it's good)
Objective-C is good stuff. Although the APIs are from a totally different world than where I come from, I cannot help but think how Objective-C could have helped me in past projects. In Objective-C, everything is automatically reference counted, effectively taking care of the most difficult problems in C (and C++, for that matter), being memory management (e.g. with string manipulation) and pointers.

Strangely enough, developing Cocoa applications is not all about writing code. A part of "the magic" is done in the Interface Builder. With this tool you do not only draw your windows, but you also visually connect class instances together by drawing lines. This works not only for predefined classes, but also for classes that you newly created yourself (!).
When you think about it, it makes perfect sense for a windowing, event-driven operating system to have this kind of development model.
The applications themselves revolve around a design pattern called "Model-View-Controller", where a "controller" controls the data that is behind the application, and sees to it that the view is being represented to the end user. While you are not obliged to follow this paradigm, the code will be quite clean and more reusable when implemented as such.

Enough talk, let's get to the details. For implementing demos and games, what we need is an OpenGL window that reads keyboard and mouse input. Steve Jobs has summarized this for us in a Cocoa NSOpenGLView class. There is a lot of code on the internet that does not use NSOpenGLView, so my guess is that it's a relatively new class. It makes things so much easier, once you know how to use it.
You should subclass it and call it something like MyOpenGLView. You can drag the NSOpenGLView into a window in the Interface Builder, and rename it in the Object Inspector. In the Object Inspector, you should also specify whether you want to have a depth buffer, stencil buffer, accumulator buffer, etc. or you won't be able to use those. There is also a tab that allows you to specify how the view resizes, when the underlying window is resized.

Writing code for NSOpenGLView is easy, but like I said, you have to know how to use the predefined methods correctly.
/*
initialize, but do not set the matrices or the viewport here
*/
-(void)prepareOpenGL {
glClearColor(0, 0, 0, 1);
glClearDepth(1);

// load textures here
// (I wrote a texture manager class to do this)

glEnable(...);
}

/*
this gets called when the window is resized
*/
-(void)reshape {
NSRect boundsInPixelUnits = [self convertRectToBase:[self bounds]];
glViewport(0, 0, boundsInPixelUnits.size.width, boundsInPixelUnits.size.height);

glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glFrustum(...) or glOrtho(...) or gluPerspective(...)

glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
}

/*
draw to screen
*/
-(void)drawRect:(NSRect)rect {
glClear(...);
glLoadIdentity();

// reverse the camera ... or you could put this in a camera class
glTranslatef(-cam_x, -cam_y, -cam_z);

// draw stuff ... maybe do this from another class
...

glFlush();
GLenum err = glGetError();
if (err != GL_NO_ERROR)
NSLog(@"glGetError(): %d", (int)err);
}
To get keyboard and mouse input, use the keyDown and mouseDown methods. To get these events at all, you need to "tell" MacOS that you want to receive these events:

-(BOOL)acceptsFirstResponder { return YES; }
-(BOOL)becomeFirstResponder { return YES; }
-(BOOL)resignFirstResponder { return YES; }
keyDown will not see any meta-keys. For detecting key presses of the Ctrl, Shift, Alt/Option, Command keys and such, override the method flagsChanged.
You will find that MacOS key events have a method keyCode for getting the "virtual key code". A virtual key code is like a keyboard scan code, only with different values. There appear to be no symbolic constants for these virtual key codes, so go ahead and make some #defines yourself. Mind that key codes are usable for cursor and meta keys, but you should never use them for the other keys because of keyboard layout issues — look for the unicode character instead.

There is a mouseMoved method that does not do anything by default. To enable mouse move events, do this:
-(void)awakeFromNib {
[[self window] setAcceptsMouseMovedEvents:YES];
}
Now, you can put all your code concerned with keyboard and mouse input directly into the MyOpenGLView class, or you can be a good application developer and create a "controller" class that acts as a controller for the OpenGL view.
Create the class MyController as subclass of NSResponder. Put all the keyboard and mouse input code in MyController.m.

Add the controller definition to the MyOpenGLView class:
IBOutlet NSResponder *controller;
In Interface Builder, instantiate the MyController class and connect MyOpenGLView's controller outlet to it.

To enable the controller, do this in awakeFromNib in MyOpenGLView:
-(void)awakeFromNib {
[self setNextResponder:controller];
}

-(BOOL)acceptsFirstResponder { return YES; }
/* you may comment this one out now
-(BOOL)becomeFirstResponder { return YES; }
*/
Now, MyOpenGLView will not become a first responder, but it will set its controller as the next responder. Hence, the events will be sent to the controlling object.

After the controller has modified the model (e.g. the user is moving left), the view should be updated. Therefore the controller is also connected to the view ... Updating the view becomes as easy as this:
[glview setNeedsDisplay:YES];
Which will trigger drawRect in the MyOpenGLView class.

Games and demos usually run at a framerate, because there is so much going on, they update the screen all the time. The easiest way of getting this done, is by running the "main loop" as a timer function.
// set up timer for running the main loop

timer = [[NSTimer scheduledTimerWithTimeInterval:1/30.0f
target:self selector:@selector(mainloop)
userInfo:nil
repeats:YES]
[[NSRunLoop currentRunLoop] addTimer:timer
forMode:NSEventTrackingRunLoopMode];
The funny thing is, you do not need to call update explicitly from this mainloop/timer function. All you do is move some monsters around and call setNeedsDisplay whenever needed. Cocoa takes care of the rest.

NSOpenGLView does not provide a method for switching to fullscreen. I did find example code of how to do this on Mac Dev Center, but it looks advanced so I'll leave it at that for now. It involves creating a new OpenGL context with a special attribute NSOpenGLPFAFullScreen in the pixel format, and then calling setFullScreen. You can "share" this context, meaning that it's not needed to reload textures, reinitialize OpenGL, etc (which is great). What strikes me as odd, is that the example code utilizes an SDL-like event processing loop — exactly the kind that Cocoa is trying to hide from us.

For learning Objective-C and Cocoa, I recommend MacResearch. Although the man is a scientist, he knows how to explain things well enough to get you started. After a few lessons, it gets pretty advanced, at which point you should stop reading and try out programming some stuff yourself.

Sunday, October 4, 2009

Writing your own CoverFlow (for Linux)

In case you were wondering what the last post was all about, well, it was a little bit of "tech" that I needed to research for a nice little project of mine: a music player with a CoverFlow(tm)-like interface.

I've been wanting a CoverFlow for Linux for years now. I waited. I googled, found nothing. I waited more. No one implemented it. iTunes version X dot Y was released for Mac and Windows. I googled some more. I found something, but the author himself said it was dreadfully slow to startup and animated at two frames per second. Well, I'm terribly sorry but you must have done something wrong, dude! There was no option left but to give it a go myself.

Actually, I would have settled with a quick and simple popup-like music browser, but Linux did not even offer that — and deep in my heart I knew I craved for CoverFlow anyway.
So, in the design phase, my wannahaves list quickly came to hold these items:
  • popup-like app, with a borderless window
  • should display album art
  • should be a simple player to play albums, since that's all I ever do anyway
  • no app-centric database ... I hate databases; use the filesystem
  • startup should be fast, no loading lag
At first I thought implementing this was going to be easy ... well, not exactly. The borderless window gave enough headaches to dedicate a whole post to.
Making the startup fast was really a matter of not loading all album art at startup, but only the ones that are visible. There are only 10 or 11 or so covers visible at any time, and to go easy on resources it does not load any more than that.

The next problem was that I really did not want to have to write a complete music player. Luckily, there is this great player named the MPD or Music Player Daemon, which is a music server to which controlling clients can connect and act as a frontend. There are many frontends available to mpd, and this app would be another one.
Interfacing with mpd is easily demonstrated by the following:
$ telnet localhost 6600
OK
mpd version something ...
play
OK
So, I dusted off some old inet-code, only to find out that my code was all character based rather than line based, so it still needed some recoding.
mpd has commands to list its internal database (which is really fast, too), but sadly no command to ask where the music directory is, so I ended up parsing the /etc/mpd.conf file anyway to find the album directories and the corresponding album cover art.
The mpd protocol is quite well documented and you can play with it using telnet (as shown above) to see how it reacts.

Then there was the challenge of loading and displaying the album cover art. I've worked with BMP and TGA formats before, but how to load a JPEG image? Luckily, there is a SDL_image library to take care of it. It's surprisingly simple, and what's particularly nice, it just works:
SDL_Surface *img = IMG_Load("cover.jpg");
Well, now we're kind of stuck with an SDL_Surface. I don't like these, because I want OpenGL textures. Making a texture out of img->pixels is easy enough, but beware that OpenGL really wants the dimensions of the texture to be a power of two. This is never a problem on my NVIDIA card (which happily textures just about any dimension you feed it), but always a problem on my laptop, which has a much cheaper intel video chip. To counter this problem, we must find the next power of two for the dimensions of the image, and scale it to these new dimensions, before creating the texture.

Peepz on the net find the next power of two by using round() or ceil() and log2(). Yuck! I say yuck because of the slow floating point arithmetic. A computer works with bits, and bits are actually all powers of two. There are two neat algorithms on Wikipedia that I used:
The first is (nearly) only an AND operation and the second is a couple of bit shifts, whee!

So, if we have an album art image of 200x200 pixels, it will scale it to 256x256, and if it's 300x300 it will scale to 512x512. Note that this is a must-do for OpenGL textures to work correctly on all systems.
We still have to actually scale the pixel data to fit the new dimensions, before turning it into an OpenGL texture. Scaling images is incredibly hard to get right ... unless we use another library that handles it for us. There is the SDL_gfx library that includes a zoomSurface() routine, which is exactly what I needed. The zoomSurface() works very well and very fast, and I'm very happy with it.

After using this much SDL code, I almost wondered why I wanted OpenGL in the first place. (Of course, animating the rotating album covers in 3D is implemented with glRotate().)
Adding a shiny mirror effect of the album covers was easy; set the color to 30% (or something) and put the texture coordinates upside-down so that it looks like the object is mirrored in the shiny black glass table (or wherever they're situated). Note that I did not use any blending here; when you blend multiple images together, they will blend together (well that's what it does, right?), so you'd end up with the covers being blended thru one another in the reflection rather than having one cover in the back, and another one in the front center.

Next challenge, the album title needed to be displayed. Rendering text in OpenGL is, as always, a nightmare. OpenGL was not meant to render text, so it can't really do it. I decided to have a look at SDL_ttf, a TrueType font rendering library. SDL_ttf renders to SDL_Surfaces which you need to turn into textures again. After writing a lot of code and finally getting it to work, I found that SDL_ttf produces some really ugly output. I was quite unhappy with SDL_ttf.
So, I ripped out the SDL_ttf code again and threw it in the dustbin. Then I took some old piece of code that uses display lists and glBitmap() to blit text using a bitmap font. This looked quite nice, but it kind of bothered me that display lists are not in OpenGL/ES, so ... I ripped out this code as well, and used dedicated character blitting code to a temporary pixel buffer to create an OpenGL texture that represents the blitted string. For displaying text in OpenGL, you can also create a texture per glyph and use that to texture strings (this would be even faster, too), but in this case, I did not bother. (Maybe tomorrow, who knows?)

For the user interface, I opted to have as few buttons and bells and whistles as possible (1. to have a clutter free interface, and 2. because you have to implement all these bells and whistles too, which can be a lot of work). I decided that:
  • double click plays an album
  • single click pauses playback
  • right click skips song
  • click on the side flips through the album collection
  • click in the top corner flips to a screen resembling an "About" box
  • mouse click and move in the top to drag the window
  • shake the window to shuffle songs
Which called for some interesting mouse event code, especially the window shake was a bit of a challenge.

This concludes my story of implementing a CoverFlow-like interface for Linux, and I must say, I'm quite happy with it because it looks great and works nice and fast. There are some constraints to using it though; it only plays full albums, and you must have your music directory organized in albums like I have. Furthermore, it does not automatically download album art, but there are other tools to do this. I used Google images for a couple of hours to update all my album art to higher res images ...

What's really nice, is that I combined a number of technologies, and added some new things myself:
  • socket code for connecting to mpd
  • using the mpd protocol to interface with mpd
  • Xlib code, mainly for dragging a borderless window
  • SDL for event handling
  • SDL_image for loading JPEG image files
  • SDL_ttf for rendering TrueType fonts (which was taken out again)
  • OpenGL for 3D graphics
  • the power of two routines from Wikipedia
  • bitmap font blitting into a texture
  • window shake mouse event code
Anyway, I should end by providing the download link: mpflow

Wednesday, September 23, 2009

Dragging an SDL_NOFRAME borderless window

Despite the curious title of this blog entry, I hope you will keep reading. I ran into a funny problem with SDL (the cross-platform library that is used for cool things like graphics and game programming). In SDL, you can create a main application window using the call SDL_SetVideoMode(). You can pass a flag SDL_NOFRAME to this library function to create a borderless window. Borderless windows are nice for making splash screens and such. There are two problems with borderless windows in SDL:
  1. How to center the splash screen when it appears? In X-Windows, the window manager intelligently places the windows on the desktop, but the splash screen is not automatically placed in the center of the screen.
  2. How do you move a window, when it has no title bar where you can grab it with the mouse to drag it across the screen?
I googled and googled, and I could not find an answer online, especially not to the second issue. So I guess this blog entry will be of value to some, as I did manage to solve it, and I will be giving the answer now.

The answer is: SDL can't do it. However, Xlib can.

Luckily, SDL has a hook for interfacing with the system's window manager, and this is what we'll be using. Mind that portability ends here; Xlib functions work for X11 (UNIX-like systems only, and you might be a little happy to know that MacOS X is derived from BSD UNIX and includes X11 support too) and not for Microsoft Windows or other systems.
Xlib programming is quite hard when trying to build great software, which is why people resort to GUI toolkits like GTK, Qt, KDE, GNOME, or SDL in the first place, but we're going to stick to SDL as much as possible and do only the missing bits with Xlib.

Centering the splash window
So, you've called SDL_SetVideoMode() with SDL_NOFRAME and got a borderless window. Now it's time to tell X11 to center this window onscreen. The little bit of trouble with this is, X11 works with screen coordinates, so you need to know the display resolution. SDL does not seem to have a way of getting the current display resolution — or did I miss something? Therefore, we ask X11 for the dimensions of the root window. The root window id can be obtained by calling XQueryTree(). After getting the dimensions, we can calculate the desired window position and set it by calling XMoveWindow().

The code looks a lot like this:
#include "SDL_syswm.h"

SDL_SysWMinfo info;

SDL_VERSION(&info);
if (SDL_GetWMinfo() > 0 && info.subsystem == SDL_SYSWM_X11) {
XWindowAttributes attrs;
Window root, parent, *children;
unsigned int n;

info.info.x11.lock_func();

/* find the root window */
XQueryTree(info.info.x11.display, info.info.x11.wmwindow, &root, &parent, &children, &n);
if (children != NULL)
XFree(children); /* not really interested in this */

/* get dimensions of root window */
XGetWindowAttributes(info.info.x11.display, root, &attrs);
printf("debug: display res == %d by %d\n", attrs.width, attrs.height);

/* center the splash window on screen */
x = (attrs.width - window_width) / 2;
y = (attrs.height - window_height) / 2;
XMoveWindow(info.info.x11.display, info.info.x11.wmwindow, x, y);

/* force raise window to top */
XMapRaised(info.info.x11.display, info.info.x11.wmwindow);

info.info.x11.unlock_func();
}

Dragging a borderless window
For normal windows, the window manager takes care of window dragging when the user holds the window by the title bar using the mouse. Borderless windows do not have a title bar and it is left up to the application what to do on a mouse click and move.
For dragging a borderless window, we will also be using XMoveWindow(). SDL supports mouse events, and there is a SDL_MouseMotionEvent that we can use.
Sadly, when you try to implement window drag using the coordinates reported by the SDL_MouseMotionEvent, you will fail (or at least, I did). The problem is that SDL reports the mouse coordinates relative to the application window. Next, you are moving the window. This causes "jumps" in the mouse coordinates that SDL reports, which causes the window to jump, which causes larger mouse jumps, which causes a larger window movement, which causes ... In other words, SDL's mouse coordinate system is not good enough in this case. What we need to know is the absolute mouse coordinates on the desktop. The easiest way of getting these coordinates is by calling Xlib's XQueryPointer():
 XQueryPointer(info.info.x11.display, info.info.x11.wmwindow, &root, &child, &abs_mouse_x, &abs_mouse_y, &win_mouse_x, &win_mouse_y, &modstate);

Rocksolid window dragging is now easily implemented as:
mouse_button_down:
if (mouse_y < window_height / 8) { /* only top of window activates drag */
mouse_drag = 1;
call_XQueryPointer(... &drag_x, &drag_y, ...);
}

mouse_move:
if (mouse_drag) {
call_XQueryPointer(... &new_x, &new_y, ...);

call_XMoveWindow_delta(drag_x - new_x, drag_y - new_y);

drag_x = new_x;
drag_y = new_y;
}

Naturally, the call_XMoveWindow_delta(dx, dy) calls XMoveWindow() with the current window position plus dx, dy.

Yippee
Although the X11 code is not very cross-platform, I'm quite happy with it, since it fixes some shortcomings of the SDL. If someone has ported this solution to Microsoft Windows, please send me your code; you never know when it might come in handy.

Monday, August 31, 2009

Implementing the end of the loop


After yesterday's blog entry, I may have left you thinking "this guy Walter, he sure can talk the talk, but can he code the code?" Well, yeah. After writing the entry, I felt the irresistible need to actually implement the (not new) idea of having a generic framework for running a compute kernel distributed over N threads on a system with M cores. It was not as easy as I thought it would be, and there was a catch interesting enough to write about here. Talking the talk is one thing, but coding the code still is another.

Description of the problem
You want to able to process an array using a generic "compute kernel", which is a subroutine that computes a value for a single cell in the array. The system should run as many threads as needed to complete this task, so it should virtually run one thread per cell. As this is a framework, the compute kernel routine must be passed as a function pointer so that it becomes a generic interface for processing arrays in parallel.

Designing a solution
A big problem is that it makes no sense to spawn a hundred threads on a (for example) dual core system, as the overhead would play a significant role in slowing the system down rather than speeding it up in comparison to its serial brother. The solution is to spawn only as many threads as there are processor cores. Each thread processes its part of the array (using an "old-fashioned" loop) and does this without locking, so it's very fast.
However, the compute kernel is written in such a way that each invocation still works on only one array element. This is a requirement, because otherwise it wouldn't be a real compute kernel. From the compute kernel's point of view, it should look like there are a hundred threads (or more) running in the system. So rather than passing a real thread id to the compute kernel, the framework cheats and passes the cell number, which now acts as a "virtual thread id".

Even though threads are lightweight, in a system like this, it is not maximum efficiency to spawn threads on an as-needed basis. I chose to pre-spawn (or pre-fork) the worker threads and to let them sleep until called for. This requires some basic thread scheduling mechanism. After finishing the parallel work, there should be another synchronization point as well. So the end result is that an array gets processed in parallel, after which the program can continue in serial.

Implementation details 1: Automated data partitioning
OK, suppose we have 100 array elements, that we want to process on only 2 cores. That's easy, each thread handles 50 array elements. Now suppose we have 115 array elements, and 8 cores. Easy too, each thread handles 115/8th part of the array, plus a little part that is the remainder. First, let's focus on the 115/8th part. The compute kernel routine expects a cell number (or "virtual thread id" as I like to call it), which is determined as follows:
    part = 115/8              # num_elem / n_threads
for i = 0 to part do
vtid = part * tid + i
call compute_kernel(vtid)
end
By the way, I implemented this in C, but it's easier explaining in pseudo-code. Also, you may have noticed you can optimize this pseudo-code by moving the multiplication for the "base vtid" outside the loop, but it's not really important for what I'm trying to convey here. The idea is that every thread acts on its own part of the array, and it selects its own part by using its thread id as base to index the array. Now, you should be aware that the thread id of a POSIX thread can be an arbitrary number, so it should be a sequential number that is known to this thread and has been set by you at initialization time of the thread. In other words, you give the first created thread a tid of 0, the second 1, the third 2, etc, and this number should be known to the thread itself (you can do this by passing it as argument to the thread's main function). This may seem very obvious, but as said, POSIX thread ids are arbitrary numbers, and otherwise this trick of selecting the correct region in the array would never work.

When you have an uneven distribution, as is the case with 115 array elements and 8 cores, there is a remainder of the array to be processed. The remainder is computed as the modulo; 115 % 8 equals 3. The way to handle this is the same as in the following case; suppose we had an array of only 3 elements and 8 cores, then we'd do it like this:
    if tid <= num_elem then
call compute_kernel(tid)
endif
So, this is also how you handle the remainder part of the array in parallel, except that you'd make a simple calculation to get the "vtid" right.

This was all really simple, but we've actually accomplished a great thing here; we have created the ability to automatically process an arbitrarily sized array using any number of threads — meaning that this code will run good on a dual core system, and formidable on an 8-core system, with near linear performance improvement. Given the nature of scaling by data partitioning, this scales to 1000 cores just as easily, and further if you can find a machine that big (1).

Implementation details 2: Thread scheduling
Data partitioning was the easy part. The second part is all about thread synchronization, which is a schoolbook example, really, but very frustrating to get right if you didn't get it right the first time. I like to pre-fork the threads, and have them sleep until they are assigned work to do. Then they should be woken up, do their job, and go back to sleep. When all threads are sleeping again, that's our barrier point at which the parallel part is done, and the main application can continue in serial fashion.

This behavior can probably be implemented using mutexes only, but what's really fit for this job are condition variables. A condition variable enables a thread to sleep on a state variable, and then another thread (like in this case, the main thread) can signal it, to wake it up.

The way to do it is to use a state variable that says whether it's OK for the thread to run now, or not. This is a boolean that I named runnable. A thread is runnable when it can run, else it should go back to sleep.
thread code:
mutex_lock(my_mutex)
runnable = False
while not runnable do
condition_wait(my_mutex)
end
mutex_unlock(my_mutex)

main thread:
foreach thread do
mutex_lock(thread's mutex)
thread runnable = True
condition_signal(thread)
mutex_unlock(thread's mutex)
end
To understand this code, you should know that condition_wait unlocks the mutex when going to sleep, and automatically (and atomically) relocks the mutex when waking up again. Apparently (at least according to the pthread specifications), there may be spurious wake ups, so the waking thread must always check its state when it wakes up. That's why it's enclosed in a while loop.

The main thread isn't so exciting, it simply signals all threads to start work. A caveat here is that upon startup, not all threads may have gone to sleep yet. So there must be a barrier sync before starting, and after ending. The barrier sync simply checks whether all spawned threads are not runnable, ie. already sleeping.

Implementation details 3: The number of threads to use
I said it's best to spawn as many threads as there are processor cores. This is true for tight code, if this doesn't drive your system close to 100% CPU utilization, you may add more threads until it does. How do you find out how many cores are in the system? This is a tough question, as it is different on every system. See this great info on stackoverflow.com, that shows how to get the number of cores from the system for various operating systems.
In my code, I choose the easy way and simply read an environment variable that you can set yourself, which is kind of inspired by OMP_NUM_THREADS, really.

Conclusion
Whew, this blog entry ended up much more text and much less code than I anticipated, but I got to tell you that it's a joy to see a function call like
    run_parallel(process_array, arr, sizeof(arr))
work in parallel by the number of threads given in an environment variable. What's really great about this routine is its flexibility and its scalability. It really scales "for free", gaining more performance when you add more cores.



1. At NASA Ames Research Center, they actually have an SGI Altix supercomputer named Columbia that has a partition running a single system image of 2048 processors, which is incredibly large, even by supercomputer standards.

Sunday, August 30, 2009

The end of the loop

Multi-core processors have been on the market for several years now. In commodity hardware we've seen dual processor systems (SMPs), four-way SMP systems, dual cores, quad cores, dual quad cores, quad dual cores, quad quad cores, and even the seemingly odd triple cores (XBox 360) and AMD's 6-core processors. IBM stepped into this this game by putting 8 low power SPEs and a controlling PowerPC into a single chip: the Cell processor that is the engine under the hood in the Playstation 3.

Consider a simple for-loop, running over an array to do some simple operation:
FOR i = 0 TO 100
DO
arr[i] = arr[i] + 1
DONE
The execution time of this loop is determined as follows: it's a memory fetch and store, an increment, a loop counter increment, and conditional branch (jump back if the loop counter is less than 100), so it's about 5 operations times 100, which adds up to 500 ticks. This is a very rough estimate because a modern CPU has some tricks up its sleeve like combining certain operations, but the 500 ticks will be good to illustrate what I'm talking about.

Now imagine you have a dual core system. On a dual core system, you could run this loop on one core, and play a CPU-consuming game on the second core. However, if you wanted to be clever rather than waste your time playing games, you could split the loop and be done with it in half the wallclock time.
LOCAL i LOCAL i
FOR i = 0 TO 50 FOR i = 1 TO 50
DO DO
arr[i] = arr[i] + 1 arr[i] = arr[i] + 1
DONE DONE
or you could split it like this, where the first core processes the even elements, and the second core processes the odd elements in the array:
LOCAL i LOCAL i
FOR i = 0 TO 100 STEP 2 FOR i = 1 TO 100 STEP 2
DO DO
arr[i] = arr[i] + 1 arr[i] = arr[i] + 1
DONE DONE
This code finishes in 250 ticks.
So, by dividing the array up we are enabling it for processing in parallel. This is called data partitioning, and often called a divide and conquer method. Running it in parallel made it twice as fast.

Now imagine you have a hundred cores in your CPU. It doesn't, it's insane, but what if technology had advanced so that you'd have a hundred cores in your CPU. If this was the case, then there would be no need to write the loop statement at all. The loop over the array would look something like this:
LOCAL i
arr[i] = arr[i] + 1
This one statement would be processed by a hundred cores in parallel, and the outcome of the program would be identical to outcome we obtained previously. This code would finish in just 3 ticks (fetch, increment, store) and no loop statement. (NB. Maybe the architecture would be such that it would be even less than 3 ticks, but it wouldn't be much more than just 3 ticks). Compare this to the 500 ticks that we started out with.
For a long time, many believed that this could only be achieved through quantum computing.

The funny thing is, this technique is possible today. A modern graphics card has a hundred cores (and even more than that). Technology has indeed advanced that much. The GPU is a massively parallel processing unit that was made for one purpose only: processing as many pixels (array elements) in parallel as possible. The graphics cards are programmable so that the graphics gurus can do cool things with shaders, which are essentially little programs that compute a value to put into each pixel/array element.
The computing power of graphics cards has been recognized before, but now the graphics card vendors have stepped into the world of high performance computing, literally delivering supercomputer power to your home. Moreover, Apple's latest release of OSX includes OpenCL, a programming API to harness the power of the NVIDIA chipset that is in the Powerbook.

The API (at the moment, either OpenCL or CUDA) allows you to create and destroy thousands of threads on the fly so you don't even have to care much how your array is going to be processed by the GPU cores. This is nice because often, you don't even know how many cores are in the machine that the code is running on.
It changes the way of programming because you must no longer think in terms of loops, but instead you should think only about what it is you want to do with each array element (1).

Of course, it is not all too good to be true; some problems simply can not be parallelized in this way, e.g. due to interdependencies or relations between array elements.
Still, it's pretty exciting to see supercomputing power coming to your laptop, and I can't wait until NVIDIA comes up with its next big thing.


1. This is, of course, already possible with pthreads. The only problem is that when you create a hundred threads on a dual core system, the CPU overhead of creating the threads is much higher than simply looping the loop, and thus slowing the program down. If you had the cores ready to run at your disposal, you wouldn't have this problem and you could run the code in a massively parallel manner without any overhead.

Saturday, August 15, 2009

What I've been up to

Code::BlocksHi folks, this post is a follow-up to my last blog entry, which was about GLFW. As quickly as I fell in love with GLFW, just as quick my crush was over; I switched back to SDL. The reason: GLFW — a "portable" library — was giving me huge problems with, you guessed right, portability. I really liked GLFW, and I really tried, but it just didn't work out. In the end, it only made me cry.

Portability is a strange phenomenon. Code that works perfectly on one platform, simply doesn't work on another. This is especially true if you haven't tested it and actually have seen it work with your own eyes. Unless you have actively engaged in the act of porting your software over, it is unlikely it will run 100% correctly. I am saying this because I just ported my (now SDL) game over to MacOS X and uncovered a really dumb bug: using an uninitialized variable, assuming it is zero. The code worked like a charm in Linux, but not on the Mac. Why the GNU C compiler didn't see this problem is beyond me since it does complain in most cases, but hey, you can't expect a free compiler to be bug free, either.

Anyway, I'm back with SDL and pulled a really neat trick; I wrapped it with some code to make it more GLFW-like! You see, I really liked GLFW's structure, hiding the event loop, using callback functions, etcetera. So, now can live with a framework that handles real easily, and on top of that, I may replace the inner workings of the "SDK" wrapper with something else some day — maybe even the next version of GLFW — without having to overhaul all of the code.
IMHO, this surrogate blend of a library is actually even better than the real thing.

In the meantime, I also switched editor. The editor is a very personal kind of tool for the programmer. I was fairly happy with joe in Linux, until the release came that has an annoying bug that SEGVs the editor when giving multiple files to edit on the command line. An editor that bombs out when you want to edit files, how great is that? Ah well. At least Joe gave away his Own Editor for free.

And then I found the Code::Blocks editor (cool name, by the way). Although it's kind of awkward to use a Windows program in a UNIX environment, I'm getting used to it. You read that right; Code::Blocks may run natively in Linux, the portable WxWidgets library makes it look and feel like a Windows program, independently from whatever platform or OS you are running it on. WxWidgets extrapolates portability to the max. This means no X mouse clicks to copy text (argh!), use the Ctrl-C/Ctrl-V copy/paste combination (oh, the horror!).
This surrogate blend is not really better than the real thing, but you do get a nice IDE that handles a lot friendlier than make and gdb. Yeah, I know you all like gvim ... but for some reason I never got to like vi enough to use it for anything larger than shell scripts. Code::Blocks is a great alternative, but I do miss integration of man pages and svn or git. These are available through plugins but I haven't sorted out all that stuff yet.
Being used to just a tabbed terminal window, Code::Blocks also suffers a bit from too much screen clutter like toolbars, panels, etcetera. It would be nice if you could press a magic key (the Windows key, perhaps ...) to make the clutter disappear and actually have a normal editing window on your desktop. Elaborating on that idea, it would be great if the "full screen" view would actually make the editor full screen, rather than making the application full screen and keeping all toolbars and panels in view.
I should end by saying that Code::Blocks is a great developers tool, even while it insists on linking standard C codes using the g++ command. (Somewhere out there, there must be a brilliant mind that can give a perfectly logical explanation for this. Naturally, I did not have g++ installed, since plain old C is my thing, but ... sigh. As I said, I should end by saying ... Code::Blocks is a great developers tool).

On the Mac, there is of course Xcode. Xcode has the same kind of interface with panels all over and a relatively small editing window, but that doesn't matter, you can't get anything done in Xcode without having taken Apple's developer course. I don't know why, but while Xcode looks fantastic, it is immensely complex to build software with. When I write software on the Mac, I use make on the command line. And that is one sad fact.

I like to finish this post on a positive note. The good news is, my game is nearing completion. I had something playable already after three weekends, but you know how it goes, you keep adding and adding new stuff as you keep getting more and more ideas. Creativity spawns more creative ideas. Some programmers call it "being in the zone". I've been there. It's addictive. I want more. But I also really want this project to be done with, mostly so I can find peace and round off two other, smaller projects that I've been working on. When you are working on three projects at the same time, none of them get finished any time soon.
So when it's done, I can catch my breath ... and start off writing code for a really cool idea that I've been having lately ...

Wednesday, July 29, 2009

Experiences with a different cross platform graphics library

There are periods when I don't write any code at all. These times are like the silence before the storm ... Last month, the silence broke and the storm was unleashed. I wrote code, code, code, and still managed to keep my head above in the sea of binary.
I'm currently working on three interesting hobby projects at once; a rewrite of an old skool BBS using pthreads and an SQL database backend; a cluster administration tool written entirely in Python; an OpenGL arcade game. So, there's plenty to blog about, but for now, I'll pick the game because OpenGL and games are always interesting. As you may know, I like using the SDL for developing OpenGL programs because it is a portable library. And then I ran into some post on the net where a guy said, "I always use glfw". I decided to check it out, and believe it or not, I ported it over from SDL to GLFW in just a couple of hours. GLFW is deceivingly simple. I decided it's out with SDL, and in with GLFW.

Starting out with GLFW
A first look at the website reveals that the latest stable version dates back to 2007. Hmmm ... this could mean that it's really stable, or that bugs are no longer being fixed. There is mention of an upcoming new release, though.

How to make use of GLFW? Just create a window, and make library calls. The library does not assume you use an event loop in your code. While this is surprising, it's really all for the better. I spent lots of time perfecting the event loop that SDL needs. GLFW just doesn't care. You want to know if a key has been pressed? Well, ask for it! Want to know the state of your windowed app? Call a glfwGet function to obtain it. If you really want, you can also register a callback function, for example for keyboard input events, and do things the event driven way.

Example code:

void init_events(void) {
glfwSetKeyCallback(key_event);
}

void key_event(int key, int state) {
if (state == GLFW_PRESS)
handle_keypress(key);
else
handle_keyrelease(key);
}


No library is perfect
As it says on the GLFW website, no library is perfect. I ran into little problems that had great consequences and had me hit the reset button of the computer (!) multiple times. Hello ... hitting reset is so 1989. But then again, I was programming an old skool arcade game.

I would like to detail the troubles I ran into here. Mind that they are not really big obstacles, just gotchas, things you need to know about GLFW, and maybe annoyances that might want you to favor SDL anyway, but to each his own.
  1. There seems to be no good way to toggle between full screen and windowed mode. I tried to resolve this by closing the window and recreating it, but the window manager somehow doesn't like it. It is not an OpenGL problem, nor a problem of my app — SDL can do it and GLFW can't. So stick to either windowed or full screen, not both.
  2. When you resize a window, you should reinitialize OpenGL, reload textures, etc. This may take some time. When you drag a window using the mouse, GLFW will register multiple resize events, causing your app to reinitialize multiple times and causing considerable lag. I never had this problem with SDL.
  3. The characters of keys that are pressed will be reported in uppercase, even when shift is not being held down.
  4. For text input, do not use glfwSetKeyCallback(). Use glfwSetCharCallback() instead, which does ASCII translation, etc. This will give lowercase characters, a working Shift key, etc. One thing I really don't like is that it does not translate the return key to '\n', nor does it translate backspace to '\b', and Escape is not code 27. So, I use SetCharCallBack() for the regular input, and I use SetKeyCallBack() at the same time to translate often used keys like Escape, Return, Enter, Backspace.
  5. glfwGetTime() returns a double, in seconds. This is actually great, but keep this in mind because other libraries, like SDL, work with integers and milliseconds.
  6. The library includes functions for multi-threading. Don't use them, they will be removed in a later version. Use pthreads instead.
  7. GLFW does not include functions for sound. This is intended; use OpenAL for sound.

Do you want to know more?

If you would like to get started using GLFW, I recommend reading the users guide that is available in PDF format on the website: http://glfw.sourceforge.net/documentation.html

Monday, June 22, 2009

Taking out the trash

While working on a hobby project in one of my all-time favorite programming languages (standard C), I once again encountered an old problem: memory leakage. I hunted down the problem and fixed it, but it bothered me that I made this mistake, and it bothered me that it was so easy to make this mistake. In my second favorite programming language (Python), this would never have happened. Python features a built-in garbage collector. Why can't C have a garbage collector? The answer is, it can, but you just have to implement it.

Description of The Problem, The Pitfall
The common mistake happens when doing things like this:
int foo() {
char *p;
struct some_struct s;

p = (char *)malloc(NUMBER_OF_BYTES);
bar(&s, p);

if (...) {
if (... && ...) {
...
} else {
...
return -1;
}
}
free(p);
return 0;
}

Clever folks will see that there's a memory leak at the "return -1" (forgot to call free(p)), but did you spot the other memory leak? That's right ... what I didn't tell you, is that function bar() also allocated some memory to store into a member of the struct, and it's not being freed anywhere because it's not obvious that it's using memory.

Use another language: C++
So, I can hear you say, use C++. C++ will call the destructor when the local variable goes out of scope. This only partly solves the problem. In practice, I've learned to hate C++ because it makes things even less obvious than before, causing more problems than solving them.

Reference counting
How does Python pull it off? In essence, it's simple — it uses reference counting. Reference counting means that when you point a pointer variable at a piece of memory, you are referencing this object, and so a counter is incremented. When you stop using this object, the counter is decreased. Once the counter hits zero, no pointer variable is tracking this space anymore, meaning that this object has become useless, and therefore OK to clean up. Hence, the object is being garbage collected. Note how the object can be cleaned up immediately, and that there is absolutely no need for a separate thread to actively scan memory to collect bits of unused memory. I have no idea where this myth stems from (the Java world?).

Mimicking the interpreted language
Reference counting can be done in standard C too, but it's actually quite hard to do it right. Every single time you assign a pointer variable, you have to increment the associated reference count. This is doable, and you can even write a set_xxx() wrapper function or macro to do it for you. What's much harder, is to unreference the object when you let go of it. You are bound to forget to do this correctly just the same as pairing malloc()/free() calls was too hard to get right. The only reason why (byte-code) interpreted languages like Perl, Python, PHP, Java, etc. do get it right, is because the outer layer that really drives the language is doing all assignments and keeping track of the references, rather than the programmer trying to track things from the inside. (Or am I mixing up "inside" and "outside" here, are you still with me?!?!)
I think it was Guido himself who once said that if you would get all the reference counting with objects right in C, you'd practically already be programming in Python.

Tracking mallocs
What if you'd register all malloc()ed memory to a Memory Manager, and replace C's standard "return" statement with a macro "Return" that first visits the Memory Manager to see if there's anything left behind to clean up.
Sounds great, but how can it see the difference between a temporarily allocated buffer, and an object that was allocated to live for the rest of the lifetime of the process? This is exactly the core of the problem — we use the same malloc() for two different things; we want temp memory that is somehow magically destroyed at the end of the function, and we also would like to allocate memory because our program needs it. (NB. There may be a situation where an object may be temporarily, but not quite ... although I can't say I can think of such a situation right now). The solution to this problem would be to implement a tmp_malloc() function that is used for temp variables (like local variables), and have it track this memory so that a "Return" statement can clean it up. This actually works, but in practice, a lot of duplicate code is needed because you want to able to call malloc() in one situation, and tmp_malloc() in another. A real slick trick to solve this, is to make Malloc a pointer to a memory allocating function, and have it point to either malloc() or tmp_malloc(). All the programmer needs to do now is decide whether he wants to temporarily use memory, or assign it for a longer period of time.

Implementing the solution
The solution with the Temp Memory Manager and the Malloc pointer to the allocator function is so exotic that I have not implemented it yet. Do I really want to write this kind of crazed code, just to keep from making small mistakes? One could just as easily label local variables as such, and alert the programmer to double check that this space needs to be free()d.

#define LOCAL

int foo() {
int a;
LOCAL char *p; /* <-- indicator, just to alert the programmer */

p = (char *)malloc(SOME_BYTES);
...

Here, the empty, seemingly meaningless #define alerts the programmer to double check this situation, and the mistake of forgetting to call free() is easily avoided. No need for in-program memory managers and self-inspection, only one simple empty #define and your common sense.

Conclusion
We've taken a shot at trying to find a way to do garbage collection in standard C. We've found reference counting, and thought up a way to free all 'locally' allocated memory. In the end, we're probably better off tagging potential leaks as such and leaving it up to the programmer to do his job right. While this non-technical solution may not appeal to everyone, at least we're not going for the technical solution for the sake of having a technical solution. Sometimes, "good is better than optimal".

I may still implement the technical solution, just for fun, of course.

Sunday, May 24, 2009

pthreads and UNIX signals

A while ago, I wrote in this blog entry that when sending signals to a multi-threaded process, it is unclear which thread receives the signal. This little problem is actually easy to solve when you realize that you can create a seperate thread whose sole purpose it is to catch signals and respond to them.

You can mask out certain signals using sigprocmask() and its threaded nephew, pthread_sigmask(). These signals will be effectively disabled and will never be received in these contexts.
In the "signal processing" thread, enable the blocked signals. This will now be the only thread who receives the signals.

In almost finished C-code:
int main(int argc, char *argv[]) {
sigset_t set;

/* create signal processing thread */
pthread_create(&thread_id, NULL, signal_processor, NULL);

/*
in the main thread, set up the desired signal mask, common to most threads
any newly created threads will inherit this signal mask
*/
sigemptyset(&set);
sigaddset(&set, SIGHUP);
sigaddset(&set, SIGINT);
sigaddset(&set, SIGUSR1);
sigaddset(&set, SIGUSR2);
sigaddset(&set, SIGALRM);
/* sigaddset(&set, SIGWHATEVER); */

/* block out these signals */
sigprocmask(SIG_BLOCK, &set, NULL);

/* create other threads */
pthread_create(...);

/* ... */

/* pthread_join(...); */

/* ... */

return 0;
}

/*
implementation of the signal processing thread
*/
void *signal_processor(void *arg) {
/*
install signal handlers for interesting signals
I choose to install dummy handlers (see below for why)
*/
install_signal_handler(SIGHUP);
install_signal_handler(SIGINT);
install_signal_handler(SIGUSR1);
install_signal_handler(SIGUSR2);
install_signal_handler(SIGALRM);
/* install_signal_handler(SIGWHATEVER); */

for(;;) { /* forever */
/*
wait for any signal
*/
sigfillset(&set);
sigwait(&set, &sig);

/*
handle the signal here, rather than in a signal handler
*/
switch(sig) {
case SIGTERM:
exit(1);

case SIGHUP:
reload_config();
break;

default:
printf("caught signal %d\n", sig);
}
}
}

void dummy_signal_handler(int sig) {
/* nothing */
;
}

So, the code first creates a signal processing thread. This thread has the default signal mask and will therefore receive signals. The main thread then installs a signal mask that blocks most signals. Any threads created from the main thread will inherit this new signal mask.
The signal processing thread uses sigaction() (not shown here) to install signal handlers. These signal handlers are empty dummies. The reason for this is, by installing a signal handler you tell the operating system that you do not want the default signal action for these signals. Next, call sigwait(), so wait indefinately until any signal arrives. Then, handle the caught signal.
It is possible to install useful signal handlers instead of using the dummies, but I'd much rather use sigwait() and make signal handling synchronous as opposed to asynchronous. (The threads are still run asynchonously in respect to each other, though).

It is tempting to simply use sigfillset() and block all signals. While this is perfectly possible, it does have some implications. Any errors that generate signals like SIGSEGV, SIGBUS, SIGILL, will be blocked in the thread that generated the error, and caught in the signal processing thread. Be wary of this, as this is likely to produce debugging headaches.

There is a pthread_kill() function that you can use to send signals to threads. Do not use this mechanism to communicate between threads. Threads should synchronize and communicate using mutexes, barriers, and condition variables, and if you want to terminate a thread you should use pthread_cancel(), or communicate to it that it should exit, so that it can call pthread_exit() by itself.
The only use of pthread_kill() is to copy the signal to other threads that should also receive it, if you have a (weird) code where multiple threads can be signalled.

Wednesday, May 6, 2009

Multi-core games programming

I'm on sick leave, and although I've got some bad stomach problems and my head hurts, this is finally an opportunity to think over the problem of multi-core games programming. You see, modern computers are parallel machines, and it is a challenge to the programmer to take advantage of all this processing power.

Games are an interesting class of computer programs. Even the simplest of games draw our attention, because of the interaction between human and machine. On the inside, games are interesting as well. The code often contains things like an input handler, object allocators, sound file loaders, sprite or 3D object rendering code, etc, etc.
When switching from a single threaded game design to a multi-threaded one, you would probably think it would be smart to make a functional split:
  • a thread that handles sound
  • a thread that handles AI
  • a thread that handles rendering
  • a thread that handles input processing
While there is nothing wrong with this design, you are not getting the most out of the machine. What if we have a 8-core or 16-core or 100-core machine?

Imagine an old 8-bit 2D game where you walk around in a maze, picking up objects, and shooting bad guys. Do you have it visualized in your mind? Now, imagine a modern 3D shooter. What has changed? The amount of data, but also the amount of math. 3D worlds can be huge, with high-res textures, and realtime reflections and shadows. Great graphics hardware will take care of rendering objects to the screen, but the decision of what data of this vast 3D world is pushed to the GPU, is up to the programmer. All the culling has to be performed on the CPU, not the GPU.
What if the player bumps into a wall? Collision detection is a job done by the CPU. In our old 8-bit game, we could get away with checking the byte value (representing a wall) next to the player position in the two-dimensional array that represented the map. In a modern 3D game, we have to do a lot more than just that -- we have to compute the intersection between player and wall object with any of the walls that are near. In theory, we'd have to check every wall in the map, however, this is made more efficient by splitting the map into sections. Still, a section can be full of walls and other objects to bump in to. This takes a vast amount of computations.
Another subject is AI. While you'd probably think that good AI takes an enormous amount of CPU power, this is usually not the case. Yes, AI may utilize complex formulas, but this is generally totally unnecessary. Making a character behave or perform certain actions is accomplished by implementing a simple state machine. The state machine need not do much to get great results. It is the (vast) amount of intellectually behaving objects in the game world that results in AI taking up CPU time.
One last thing is animation. In the old 8-bit days, animations were done by drawing the next animation frame on every game tick. In the modern days of 3D graphics, there are still frames, but the 3D points can be interpolated in between frames, allowing for supersmooth animation. This requires every point to be interpolated on every game tick. In fact, nowadays we have multi-tasking operating systems, and a game tick is no longer a hard set value — one tick can last 10 ms, while the next one lasts 50 ms or more. This irregular behaviour of the game clock calls for interpolation too, in order to keep animations from stuttering.

As you may have realized by now, it is wise to parallelize by splitting on "data", or object handling. If we spawn as many threads as there are processor cores, we are fully utilizing the CPU. Maybe we even want to play nice and leave a core unallocated so that the operating system gets a chance to run its tasks while we are busy gaming.
We parallelize each big for-loop in the code:

PARALLEL DO
foreach obj in objects
do
animate(obj)
artificial_intelligence(obj)
collision_detect(obj)
culling(obj) # set flag whether object is visible
end
END PARALLEL DO

render:
foreach obj in objects
do
if obj.visible
then
draw_object(obj)
endif
end
flush_screen()

Note how the rendering loop is not parallel. Funnily enough, the GPU is addressed in serial. This is funny because the GPU in itself is a massively parallel processor.
I do not recommend putting the rendering loop in a thread by itself. The reason for this is that if you do, you will need to mutex-lock each object that is being rendered. This is less efficient than simply doing it in serial.

Enough talk and pseudo-code, it's time to sit down and implement this for real. But first, I need to get rid of my headache ...

Saturday, February 28, 2009

Dining Philosophers

A while ago, I wrote about threading. Last week, one of the students I'm guiding for their internship, mentioned he still had to complete a programming assignment. It was about implementing the dining philosophers problem in C, using pthreads. To a student, this can be quite a challenge.

The dining philosophers problem is a very fun programming problem concerning resource locking. The origins date all the way back to 1965, a time when computers were big clunky machines with green terminals, and the operators were computer scientists wearing lab coats.

An excellent description of the problem can be found in Wikipedia.

I remember my brother once explained it to me when I was young. He loves spaghetti, so I guess the idea of the philosophers having spaghetti kind of appealed to him. He noted that, in a real world situation, the problem is easily solved by adding a director, who tells the philosophers when to eat and when to think. So, a scheduling assistant can help in solving the problem. However, the problem can also easily balance itself by doing the right kind of locking, and be solved without the use of an explicit arbiter.

In pseudo code:
proc run_thread() {
while plate not empty {
lock(left_fork)
if not trylock(right_fork) {
unlock(left_fork)
continue
}
eat from plate

unlock(right_fork)
unlock(left_fork)
}
}
In order to avoid deadlock, trylock is used rather than a hard lock. trylock is an atomic operation like bit-test-and-set; it locks only if the lock was free, and does not wait until you actually own the lock.
When locking multiple resources at once, it is good practice to unlock in the reverse order of locking, also to avoid deadlock.
Note that there is no need to sleep anywhere in the loop; waiting for short periods of time is something that a human would do, but is totally unnecessary in a computer program. (Think of all the cycles needed to load/execute the instructions in between as waiting for short periods of time!)

Starvation
In the above code, it's possible that one philosopher finishes his plate, while the others haven't even started yet — they are literally starving. It would be good table manners if we instructed each philosopher not to eat faster than his neighbors. This would produce automatic balancing in the use of the resources.

In pseudo-code:
    if my_plate < left_neighbor_plate
or my_plate < right_neighbor_plate then
eat from plate
Making the wrong if-statement here results in nobody eating. You can also move this if-statement up, and refrain from trying to lock while the neighbors haven't eaten yet. This is giving your neighbors a chance to eat, and results in spinning before the lock, but also in a higher success rate for acquiring both locks.

What's fun is seeing how the outcome is different every time; you can actually gamble on which philosopher will finish first. Multi-threaded programs have a weird way of being undeterministic somehow.

As an exercise, and just for the fun of it, I coded up the dining philosophers problem with N philosophers in C with pthreads in less than an hour. I think that I would have had a much harder time at it when I was still a student, though.

Sunday, January 25, 2009

OpenAL - sounds good!

For getting sound to work on the iPhone, I needed to acquaint myself with the OpenAL sound system. OpenAL is also particularly interesting because it is a 3D sound system. You set the position of a sound source, and OpenAL will take care of getting the volume just right, as well as the "position" of the sound where you are perceiving it to be coming from, when you hear it played on your surround set. It can also do cool things like the doppler effect. While I do not need 3D sound and just have started using OpenAL, it would certainly be cool to get some simple stereo sound effects in my game. This is something that SDL_mixer can not do.

OpenAL tutorials and ALUT
There are some nice tutorials on the web, and they all use ALUT. It turns out that ALUT had some (simple to resolve!) bugs in it, and some guys decided to mark it as deprecated. When standard C code gets marked as deprecated, gcc won't compile it anymore. So, now all those nice online tutorials don't work anymore. ALUT had two functions that were really convenient: alutInit() and alutLoadWAVFile(). Writing my own WAV loader took some time, but I managed, and initializing OpenAL properly takes some magic sequence, but read below about how that works.

Ubuntu Linux suxxors
First a sidestep from writing code; getting your system to work, and be able to produce sound at all. My favorite development platform is still Linux. For some reason, Ubuntu decided to break sound in version 8.10 (Intrepid) and I've been hassling with it for days to get some audio output out of my programs. At it turns out, Ubuntu Intrepid mixes asound (ALSA), esnd, and PulseAudio, with an apparent preference for PulseAudio, except that it's not properly integrated or set up or whatever, it just doesn't work in all cases. That's right, one program seems to work, while others don't. The solution: I removed all PulseAudio packages from my system, and installed their ALSA counterparts. (In fact, I compiled the mpd music player daemon by hand, because Ubuntu's version tried to use PulseAudio anyway). After taking these drastic measures, my sound programs produced audio output, hurray!!

Initializing OpenAL
As said, we are going to have to do without ALUT. OpenAL needs to be initialized by creating a "context" and selecting the audio device, otherwise you will get nothing but errors. In some tutorials, I found that you may select "DirectSound3D" as a device, but that obviously only works on Windows. I have no clue what devices are available on Linux, MacOS X, or whatever other platform you may think of. Luckily, there is a way to get the default device.
I'll share The Code with you:
#include "al.h"
#include "alc.h"

int init_openal(void) {
ALCcontext *context;
ALCdevice *device;
const ALCchar *default_device;

default_device = alcGetString(NULL,
ALC_DEFAULT_DEVICE_SPECIFIER);

printf("using default device: %s\n", default_device);

if ((device = alcOpenDevice(default_device)) == NULL) {
fprintf(stderr, "failed to open sound device\n");
return -1;
}
context = alcCreateContext(device, NULL);
alcMakeCurrentContext(context);
alcProcessContext(context);

/* you may use alcGetError() at this point to see if everything is still OK */

atexit(atexit_openal);

/* allocate buffers and sources here using alGenBuffers() and alGenSources() */
/* ... */

alGetError(); /* clear any AL errors beforehand */
return 0;
}
First a remark about the includes. Including the files "al.h" and "alc.h" is much more portable than using <AL/al.h> and <AL/alc.h>, and it saves you from resorting to "#ifdef PLATFORM_SO_AND_SO" constructs.
Using the default device may not always be the best choice. There should be an option in the program to let the user decide on what device to use.
Note the atexit() call; OpenAL will complain loudly when you exit the program without having cleaned up properly. It has a right to complain too; you've allocated some valuable hardware resources and it may be well possible that the operating system is not exactly babysitting this hardware, meaning that when you exit the program without releasing the allocated hardware buffers, they will remain as marked 'in use' until the computer is rebooted (!).
A good atexit() handler frees the allocated resources and closes the sound device.
#include "al.h"
#include "alc.h"

int init_openal(void) {
ALCcontext *context;
ALCdevice *device;
const ALCchar *default_device;

default_device = alcGetString(NULL,
ALC_DEFAULT_DEVICE_SPECIFIER);

printf("using default device: %s\n", default_device);

if ((device = alcOpenDevice(default_device)) == NULL) {
fprintf(stderr, "failed to open sound device\n");
return -1;
}
context = alcCreateContext(device, NULL);
alcMakeCurrentContext(context);
alcProcessContext(context);

/* you may use alcGetError() at this point to see if everything is still OK */

atexit(atexit_openal);

/* allocate buffers and sources here using alGenBuffers() and alGenSources() */
/* ... */

alGetError(); /* clear any AL errors beforehand */
return 0;
}
Mind the order in which to call OpenAL in the atexit handler, always first stop the sound, delete the sound sources, and then delete the buffers. Then move on to the context, and finally close the device. If you don't do it in this order, the sound system will not be properly de-initialized.

Loading .WAV files
Loading WAV sound files is easy, although there is one big gotcha to it: it is a binary format, stored in little endian format. The intel x86 line of processors are all little endian, so no problem there. However, I like to write portable code, so this is a bit of fun. I have the pleasure of having access to a big endian machine, so I could test my WAVE loader code there, too. I came up with a very simple test too see if a machine is big or little endian:
#include <stdint.h>

int is_big_endian(void) {
unsigned char test[4] = { 0x12, 0x34, 0x56, 0x78 };

return *((uint32_t *)test) == 0x12345678;
}
The is_little_endian() function is left as an exercise to the reader.

Back to the WAVE loader, what I do is simply read the whole file into
memory and then map a struct that represents the WAV header over it. Then I do the integer fixups (read as little endian values), and then perform some basic checks on whether this really was a good WAV file. For a music player, you should do these checks before loading the music data, but for my game code, I was happy with simply loading the file all at once.

You can find good information about the structure of a WAV file header on the web. Note how a WAVE loader is really a RIFF loader, and in my case, also a PCM sound loader.
There is a gotcha when blindly pouring the WAV header information into a struct, which is alignment. The compiler may mess around with the size of your struct ... use pragma pack to
instruct the compiler to refrain from doing that:
#pragma pack(1)
typedef struct {
uint32_t Chunk_ID;
uint32_t ChunkSize;
...
uint16_t AudioFormat;
...
} WAV_header_t;
#pragma pack()
I've noticed that there are quite a few WAV files out there that have little mistakes in the
headers, in particular the final field data_size (or subchunk2_size), which represents the length of the music data, is often wrong. The correct value is file size minus header size, and note that you can now 'repair' the WAV file.
Now that we've succeeded in loading the .WAV file, feed this data into alBufferData() and we're ready to make some noise.

Closing remarks
In this blog entry I've shown how you can survive without ALUT. Although SDL_mixer happily handles .mp3s, and many other formats, it doesn't give you the stereo and 3D sound that OpenAL does. For OpenAL, you will need to write your own sample loader. Painful as it seems, it is really just a good exercise in handling data formats.
In the devmaster OpenAL tutorials they show how to stream Ogg Vorbis files through OpenAL. Be sure to read all their tutorials, they're pretty straightforward once you've gotten through lesson 1. Also note how they rely on ALUT ... forget about ALUT. Use OpenAL, it sounds good.

Sunday, January 11, 2009

Porting an OpenGL demo to iPhone

You may read this blog entry as "a beginners guide to OpenGL ES". Like many others, I wanted to write a cute little app for the iPhone. Also, like many others, I know nothing about MacOS X programming (the Cocoa API), so I quickly found myself lost in the woods. I do have experience with SDL and OpenGL, however, so I decided to port the infamous "WHYz!" demo to iPhone, using Apple's Xcode and iPhone simulator.

History of the WHYz! demo
The WHYz! demo is a story by itself. It is included here as background information and for the sake of story-telling. Skip this section if you are interested in the OpenGL ES code only.
Way back in them good old days, 1996, I wrote the first WHYz! demo on an 33 MHz 486 PC. It was meant as an electronic birthday card. During a dark night of coding, I rewrote the whole thing in 100% assembly code, hoping to squeeze a few extra frames out.
In 1997, I ported the C version to Linux using svgalib. I also played around with Qt around that time, but I don't think I ever made a Qt port of the WHYz! demo.
In early 2006, I was first trying out SDL and made an SDL implementation of the WHYz! demo.
In 2007, I made a completely new rewrite in SDL/GL. This version is different from the original, but demonstrates the same kind of wave effect.
Now the wavy WHYz! demo comes to the iPhone with OpenGL ES.

Taking the first hurdle: Objective-C
The iPhone's operating system closely resembles MacOS X. In MacOS X, applications are typically written in Objective-C. This is an object-oriented dialect of C. When you see Objective-C code for the first time in your life, your stomach will turn and your eyes won't know where to look and you will feel dizzy. Remember this: Objective-C is much like C++, only with a different syntax. Readers of this blog may know I have already expressed my aversion of C++, well, Objective-C is even less prettier on the outside. On the other hand, the first impression is not everything. Objective-C seems to come with a very nice API, that is going to cost time and dedication to master. At this very moment, life is too short to spend valuable time on this, so lucky for us, it is possible to mix standard C with Objective-C, just like it is possible to write "C++" code in standard C, and only benefit from the handy "// comment" C++ syntax.
So, we will have Xcode generate the startup code for the main application for us, and call our good old standard C code from there. Cheating? No, just not rewriting working code in a dialect that I don't speak.

Taking the second hurdle: SDL for iPhone
As my existing code is all SDL/GL, I really need SDL for iPhone. How else would we setup the screen, handle input events from the operating system, play sound, and finally, render out OpenGL scene? As of yet, the answer is ... you don't. Maybe I didn't look hard enough, but I did not find a SDL library for iPhone. It appears to be in development. For this demo, I managed to do without SDL, but things will be much harder when trying to port a full blown game.

For setting up the screen: Let Xcode generate an EAGLView.m for you. Insert your own init_gl() routine to setup the viewport, and reset the projection and model view matrices.
Input events: The demo doesn't need input, but I'm sure Apple has some neat CoreInput or whatever framework for this.
Play sound: Not used in the demo, but they say you should use OpenAL, which sounds nice. OpenAL is portable and doesn't look hard to use.
Render OpenGL scene: Insert your own draw_scene() code into the EAGLView drawView() method.

All done? NO, the best is yet to come. Hit the "Build and go" button in Xcode and you will get a ton of errors on OpenGL calls.

Hurdle #3: The limitations of OpenGL ES
OpenGL ES stands for "OpenGL, Embedded Systems". Mobile devices are always relatively slow, simple, low-power computers. In the beginning, there would not even be a FPU. If your mobile device has no FPU and therefore only implements OpenGL ES 1.0, you are stuck with fixed point calculations. Lucky for us, modern mobile devices like the iPhone are capable of doing floating point calculations in hardware.
The graphics chips in mobile devices are also real simple compared their mean big brothers in gaming PCs. The OpenGL ES API has been slimmed down to mirror this.

Just to annoy developers, the great creators have altered a fundamental concept of OpenGL, only to force people into writing more efficient code. Shocking news: in OpenGL ES, there is no glBegin()/glEnd(). There is no glVertex(). There is no glTexCoord(). Argh!
There are only glDrawArrays() and glDrawElements().
(By the way, remember my OpenGL optimization lesson? Use glDrawArrays() to speed things up a little).

There is not even glFrustum() and glOrtho(), but their callings have been made more consistent with the rest of the API; call glFrustumf() and glOrthof().

There is no polygon mode, so no more cool debugging in wireframe (my absolute favorite polygon drawing mode). Well, you can do it, but not through glPolygonMode().

There are no quads, only GL_TRIANGLE and GL_TRIANGLE_STRIP remain.

There are no display lists, use a subroutine as workaround.

You can not push/pop attributes. The programmer should be knowing what the state is at all times, and he should be knowing what he's doing anyway. Just like in the real world.

There is no full set of color manipulation routines, only glColor4f().

There is no GLuint, only GLfloat.

Oh yeah, and you have to find the right include somewhere, too.

This leaves you with a rather big mess of cool OpenGL standard code, that just doesn't work in OpenGL ES. I guess you could write a set of macros that converts existing code to ES, but I'd rather stay away from this headache. Nothing remains but to go through the code and do the tedious work of converting each and every block of OpenGL code to ES.

Taking hurdles: Converting existing OpenGL code
Converting the existing code is mostly looking for glBegin()s and putting all glVertex()s into a vertex array. Finally, call glDrawArrays(). I will give a 2D example for the sake of simplicity:
/* old code, in some kind of orthogonal coordinate system */

#include "SDL_opengl.h"

glBegin(GL_TRIANGLE_STRIP);
glTexCoord2i(0, 0);
glVertex2i(-1, 1);
glTexCoord2i(0, 1);
glVertex2i(-1, -1);
glTexCoord2i(1, 0);
glVertex2i(1, 1);
glTexCoord2i(1, 1);
glVertex2i(1, -1);
glEnd();
New:
#include <OpenGLES/ES1/gl.h>

GLfloat vertex_arr[8] = {
-1.0f, 1.0f,
-1.0f, -1.0f,
1.0f, 1.0f,
1.0f, -1.0f
};
GLfloat tex_arr[8] = {
0.0f, 0.0f,
0.0f, 1.0f,
1.0f, 0.0f,
1.0f, 1.0f,
};

glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_TEXTURE_COORD_ARRAY);

glVertexPointer(2, GL_FLOAT, 0, vertex_arr);
glVertexPointer(2, GL_FLOAT, 0, tex_arr);

glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
I was lucky that my code already used triangle strips ...

If you use display lists, convert those to subroutines. You will lose the performance benefit that glCallLists() has. My guess is that mobile devices lack the GL "compiler" and cache memory that are needed for display lists.
Simply delete calls that set the polygon mode. If you need to draw lines ... pass GL_LINES to glDrawArrays().
Finally, search your source for "glColor" and replace all by glColor4f().
glColor3ub(0xff, 0x80, 0);
becomes:
glColor4f(1.0f, 0.5f, 0.0f, 1.0f);

The final hurdle: Reaching the finish line
When I finally got the code to compile, it still did not work. One of the problems was, the textures did not load. The thing is, on MacOS X, applications and data files are part of a bundle. If you want to load a texture, you have to access the bundle using the NSBundle class in the Cocoa library. Since my code is based on fopen(), I chose to cheat and not use the NSBundle class. At startup, I do a chdir(dirname(argv[0])) and find my texture files there. This is quite ugly, as it manipulates argv[0].

I had another caveat that was pretty obvious after I figured it out ... I deleted the SDL resize_window() code from the source. This routine also (re)initialized OpenGL in the proper way ... so I was trying to start out without having set the right coordinate system (oops).
When this was fixed, I could now happily enjoy watching the WHYz! demo in the iPhone simulator. (Yeah, it's time to go buy a real iPhone ...)