The Developer's Cry: August 2009

Monday, August 31, 2009

Implementing the end of the loop

After yesterday's blog entry, I may have left you thinking "this guy Walter, he sure can talk the talk, but can he code the code?" Well, yeah. After writing the entry, I felt the irresistible need to actually implement the (not new) idea of having a generic framework for running a compute kernel distributed over N threads on a system with M cores. It was not as easy as I thought it would be, and there was a catch interesting enough to write about here. Talking the talk is one thing, but coding the code still is another.

Description of the problem
You want to able to process an array using a generic "compute kernel", which is a subroutine that computes a value for a single cell in the array. The system should run as many threads as needed to complete this task, so it should virtually run one thread per cell. As this is a framework, the compute kernel routine must be passed as a function pointer so that it becomes a generic interface for processing arrays in parallel.

Designing a solution
A big problem is that it makes no sense to spawn a hundred threads on a (for example) dual core system, as the overhead would play a significant role in slowing the system down rather than speeding it up in comparison to its serial brother. The solution is to spawn only as many threads as there are processor cores. Each thread processes its part of the array (using an "old-fashioned" loop) and does this without locking, so it's very fast.
However, the compute kernel is written in such a way that each invocation still works on only one array element. This is a requirement, because otherwise it wouldn't be a real compute kernel. From the compute kernel's point of view, it should look like there are a hundred threads (or more) running in the system. So rather than passing a real thread id to the compute kernel, the framework cheats and passes the cell number, which now acts as a "virtual thread id".

Even though threads are lightweight, in a system like this, it is not maximum efficiency to spawn threads on an as-needed basis. I chose to pre-spawn (or pre-fork) the worker threads and to let them sleep until called for. This requires some basic thread scheduling mechanism. After finishing the parallel work, there should be another synchronization point as well. So the end result is that an array gets processed in parallel, after which the program can continue in serial.

Implementation details 1: Automated data partitioning

OK, suppose we have 100 array elements, that we want to process on only 2 cores. That's easy, each thread handles 50 array elements. Now suppose we have 115 array elements, and 8 cores. Easy too, each thread handles 115/8th part of the array, plus a little part that is the remainder. First, let's focus on the 115/8th part. The compute kernel routine expects a cell number (or "virtual thread id" as I like to call it), which is determined as follows:

    part = 115/8              # num_elem / n_threads
    for i = 0 to part do
        vtid = part * tid + i
        call compute_kernel(vtid)
    end

By the way, I implemented this in C, but it's easier explaining in pseudo-code. Also, you may have noticed you can optimize this pseudo-code by moving the multiplication for the "base vtid" outside the loop, but it's not really important for what I'm trying to convey here. The idea is that every thread acts on its own part of the array, and it selects its own part by using its thread id as base to index the array. Now, you should be aware that the thread id of a POSIX thread can be an arbitrary number, so it should be a sequential number that is known to this thread and has been set by you at initialization time of the thread. In other words, you give the first created thread a tid of 0, the second 1, the third 2, etc, and this number should be known to the thread itself (you can do this by passing it as argument to the thread's main function). This may seem very obvious, but as said, POSIX thread ids are arbitrary numbers, and otherwise this trick of selecting the correct region in the array would never work.

When you have an uneven distribution, as is the case with 115 array elements and 8 cores, there is a remainder of the array to be processed. The remainder is computed as the modulo; 115 % 8 equals 3. The way to handle this is the same as in the following case; suppose we had an array of only 3 elements and 8 cores, then we'd do it like this:

    if tid <= num_elem then
        call compute_kernel(tid)
    endif

So, this is also how you handle the remainder part of the array in parallel, except that you'd make a simple calculation to get the "vtid" right.

This was all really simple, but we've actually accomplished a great thing here; we have created the ability to automatically process an arbitrarily sized array using any number of threads — meaning that this code will run good on a dual core system, and formidable on an 8-core system, with near linear performance improvement. Given the nature of scaling by data partitioning, this scales to 1000 cores just as easily, and further if you can find a machine that big (1).

Implementation details 2: Thread scheduling

Data partitioning was the easy part. The second part is all about thread synchronization, which is a schoolbook example, really, but very frustrating to get right if you didn't get it right the first time. I like to pre-fork the threads, and have them sleep until they are assigned work to do. Then they should be woken up, do their job, and go back to sleep. When all threads are sleeping again, that's our barrier point at which the parallel part is done, and the main application can continue in serial fashion.

This behavior can probably be implemented using mutexes only, but what's really fit for this job are condition variables. A condition variable enables a thread to sleep on a state variable, and then another thread (like in this case, the main thread) can signal it, to wake it up.

The way to do it is to use a state variable that says whether it's OK for the thread to run now, or not. This is a boolean that I named runnable. A thread is runnable when it can run, else it should go back to sleep.

thread code:
    mutex_lock(my_mutex)
    runnable = False
    while not runnable do
        condition_wait(my_mutex)
    end
    mutex_unlock(my_mutex)

main thread:
    foreach thread do
        mutex_lock(thread's mutex)
        thread runnable = True
        condition_signal(thread)
        mutex_unlock(thread's mutex)   
    end

To understand this code, you should know that condition_wait unlocks the mutex when going to sleep, and automatically (and atomically) relocks the mutex when waking up again. Apparently (at least according to the pthread specifications), there may be spurious wake ups, so the waking thread must always check its state when it wakes up. That's why it's enclosed in a while loop.

The main thread isn't so exciting, it simply signals all threads to start work. A caveat here is that upon startup, not all threads may have gone to sleep yet. So there must be a barrier sync before starting, and after ending. The barrier sync simply checks whether all spawned threads are not runnable, ie. already sleeping.

Implementation details 3: The number of threads to use

I said it's best to spawn as many threads as there are processor cores. This is true for tight code, if this doesn't drive your system close to 100% CPU utilization, you may add more threads until it does. How do you find out how many cores are in the system? This is a tough question, as it is different on every system. See this great info on stackoverflow.com, that shows how to get the number of cores from the system for various operating systems.

In my code, I choose the easy way and simply read an environment variable that you can set yourself, which is kind of inspired by OMP_NUM_THREADS, really.

Conclusion

Whew, this blog entry ended up much more text and much less code than I anticipated, but I got to tell you that it's a joy to see a function call like

    run_parallel(process_array, arr, sizeof(arr))

work in parallel by the number of threads given in an environment variable. What's really great about this routine is its flexibility and its scalability. It really scales "for free", gaining more performance when you add more cores.

1. At NASA Ames Research Center, they actually have an SGI Altix supercomputer named Columbia that has a partition running a single system image of 2048 processors, which is incredibly large, even by supercomputer standards.

Sunday, August 30, 2009

The end of the loop

Multi-core processors have been on the market for several years now. In commodity hardware we've seen dual processor systems (SMPs), four-way SMP systems, dual cores, quad cores, dual quad cores, quad dual cores, quad quad cores, and even the seemingly odd triple cores (XBox 360) and AMD's 6-core processors. IBM stepped into this this game by putting 8 low power SPEs and a controlling PowerPC into a single chip: the Cell processor that is the engine under the hood in the Playstation 3.

Consider a simple for-loop, running over an array to do some simple operation:

FOR i = 0 TO 100
DO
arr[i] = arr[i] + 1
DONE

The execution time of this loop is determined as follows: it's a memory fetch and store, an increment, a loop counter increment, and conditional branch (jump back if the loop counter is less than 100), so it's about 5 operations times 100, which adds up to 500 ticks. This is a very rough estimate because a modern CPU has some tricks up its sleeve like combining certain operations, but the 500 ticks will be good to illustrate what I'm talking about.

Now imagine you have a dual core system. On a dual core system, you could run this loop on one core, and play a CPU-consuming game on the second core. However, if you wanted to be clever rather than waste your time playing games, you could split the loop and be done with it in half the wallclock time.

LOCAL i LOCAL i
FOR i = 0 TO 50 FOR i = 1 TO 50
DO DO
arr[i] = arr[i] + 1 arr[i] = arr[i] + 1
DONE DONE

or you could split it like this, where the first core processes the even elements, and the second core processes the odd elements in the array:

LOCAL i LOCAL i
FOR i = 0 TO 100 STEP 2 FOR i = 1 TO 100 STEP 2
DO DO
arr[i] = arr[i] + 1 arr[i] = arr[i] + 1
DONE DONE

This code finishes in 250 ticks.

So, by dividing the array up we are enabling it for processing in parallel. This is called data partitioning, and often called a divide and conquer method. Running it in parallel made it twice as fast.

Now imagine you have a hundred cores in your CPU. It doesn't, it's insane, but what if technology had advanced so that you'd have a hundred cores in your CPU. If this was the case, then there would be no need to write the loop statement at all. The loop over the array would look something like this:

LOCAL i
arr[i] = arr[i] + 1

This one statement would be processed by a hundred cores in parallel, and the outcome of the program would be identical to outcome we obtained previously. This code would finish in just 3 ticks (fetch, increment, store) and no loop statement. (NB. Maybe the architecture would be such that it would be even less than 3 ticks, but it wouldn't be much more than just 3 ticks). Compare this to the 500 ticks that we started out with.

For a long time, many believed that this could only be achieved through quantum computing.

The funny thing is, this technique is possible today. A modern graphics card has a hundred cores (and even more than that). Technology has indeed advanced that much. The GPU is a massively parallel processing unit that was made for one purpose only: processing as many pixels (array elements) in parallel as possible. The graphics cards are programmable so that the graphics gurus can do cool things with shaders, which are essentially little programs that compute a value to put into each pixel/array element.

The computing power of graphics cards has been recognized before, but now the graphics card vendors have stepped into the world of high performance computing, literally delivering supercomputer power to your home. Moreover, Apple's latest release of OSX includes OpenCL, a programming API to harness the power of the NVIDIA chipset that is in the Powerbook.

The API (at the moment, either OpenCL or CUDA) allows you to create and destroy thousands of threads on the fly so you don't even have to care much how your array is going to be processed by the GPU cores. This is nice because often, you don't even know how many cores are in the machine that the code is running on.

It changes the way of programming because you must no longer think in terms of loops, but instead you should think only about what it is you want to do with each array element (1).

Of course, it is not all too good to be true; some problems simply can not be parallelized in this way, e.g. due to interdependencies or relations between array elements.

Still, it's pretty exciting to see supercomputing power coming to your laptop, and I can't wait until NVIDIA comes up with its next big thing.

1. This is, of course, already possible with pthreads. The only problem is that when you create a hundred threads on a dual core system, the CPU overhead of creating the threads is much higher than simply looping the loop, and thus slowing the program down. If you had the cores ready to run at your disposal, you wouldn't have this problem and you could run the code in a massively parallel manner without any overhead.

Saturday, August 15, 2009

What I've been up to

Hi folks, this post is a follow-up to my last blog entry, which was about GLFW. As quickly as I fell in love with GLFW, just as quick my crush was over; I switched back to SDL. The reason: GLFW — a "portable" library — was giving me huge problems with, you guessed right, portability. I really liked GLFW, and I really tried, but it just didn't work out. In the end, it only made me cry.

Portability is a strange phenomenon. Code that works perfectly on one platform, simply doesn't work on another. This is especially true if you haven't tested it and actually have seen it work with your own eyes. Unless you have actively engaged in the act of porting your software over, it is unlikely it will run 100% correctly. I am saying this because I just ported my (now SDL) game over to MacOS X and uncovered a really dumb bug: using an uninitialized variable, assuming it is zero. The code worked like a charm in Linux, but not on the Mac. Why the GNU C compiler didn't see this problem is beyond me since it does complain in most cases, but hey, you can't expect a free compiler to be bug free, either.

Anyway, I'm back with SDL and pulled a really neat trick; I wrapped it with some code to make it more GLFW-like! You see, I really liked GLFW's structure, hiding the event loop, using callback functions, etcetera. So, now can live with a framework that handles real easily, and on top of that, I may replace the inner workings of the "SDK" wrapper with something else some day — maybe even the next version of GLFW — without having to overhaul all of the code.

IMHO, this surrogate blend of a library is actually even better than the real thing.

In the meantime, I also switched editor. The editor is a very personal kind of tool for the programmer. I was fairly happy with joe in Linux, until the release came that has an annoying bug that SEGVs the editor when giving multiple files to edit on the command line. An editor that bombs out when you want to edit files, how great is that? Ah well. At least Joe gave away his Own Editor for free.

And then I found the Code::Blocks editor (cool name, by the way). Although it's kind of awkward to use a Windows program in a UNIX environment, I'm getting used to it. You read that right; Code::Blocks may run natively in Linux, the portable WxWidgets library makes it look and feel like a Windows program, independently from whatever platform or OS you are running it on. WxWidgets extrapolates portability to the max. This means no X mouse clicks to copy text (argh!), use the Ctrl-C/Ctrl-V copy/paste combination (oh, the horror!).

This surrogate blend is not really better than the real thing, but you do get a nice IDE that handles a lot friendlier than make and gdb. Yeah, I know you all like gvim ... but for some reason I never got to like vi enough to use it for anything larger than shell scripts. Code::Blocks is a great alternative, but I do miss integration of man pages and svn or git. These are available through plugins but I haven't sorted out all that stuff yet.

Being used to just a tabbed terminal window, Code::Blocks also suffers a bit from too much screen clutter like toolbars, panels, etcetera. It would be nice if you could press a magic key (the Windows key, perhaps ...) to make the clutter disappear and actually have a normal editing window on your desktop. Elaborating on that idea, it would be great if the "full screen" view would actually make the editor full screen, rather than making the application full screen and keeping all toolbars and panels in view.

I should end by saying that Code::Blocks is a great developers tool, even while it insists on linking standard C codes using the g++ command. (Somewhere out there, there must be a brilliant mind that can give a perfectly logical explanation for this. Naturally, I did not have g++ installed, since plain old C is my thing, but ... sigh. As I said, I should end by saying ... Code::Blocks is a great developers tool).

On the Mac, there is of course Xcode. Xcode has the same kind of interface with panels all over and a relatively small editing window, but that doesn't matter, you can't get anything done in Xcode without having taken Apple's developer course. I don't know why, but while Xcode looks fantastic, it is immensely complex to build software with. When I write software on the Mac, I use make on the command line. And that is one sad fact.

I like to finish this post on a positive note. The good news is, my game is nearing completion. I had something playable already after three weekends, but you know how it goes, you keep adding and adding new stuff as you keep getting more and more ideas. Creativity spawns more creative ideas. Some programmers call it "being in the zone". I've been there. It's addictive. I want more. But I also really want this project to be done with, mostly so I can find peace and round off two other, smaller projects that I've been working on. When you are working on three projects at the same time, none of them get finished any time soon.

So when it's done, I can catch my breath ... and start off writing code for a really cool idea that I've been having lately ...