The Developer's Cry: 2013

Sunday, December 29, 2013

The Yearly Countdown

End of year, December 31st. Every year millions of families world-wide gather round the tv-set just before midnight to watch the clock tick tick tick away the final seconds of the year. The clock strikes twelve and we yell “Happy New Year!” while outside fireworks burst, lighting up the sky.

It's about time

For some years already I run a desktop widget that counts down the year. I made it with Dashcode when I was just messing around on a boring January day. After nearly a year, I found an interesting bug: it started counting up in the next year..! Just like a regular clock does.

Anyway, it's fun saying things like “99 days left!” out of the blue and see people reacting “what was that all about?”.

The year 2012 had a leap second so there was another bug: it was off by a second. This was totally unnecessary had I implemented it properly ... Just saying that dealing with time can be tricky.

Two minutes to midnight

Counting down the days until year+1 Jan 1 00:00:00 is fun and not hard to do. Be mindful though that the naive implementation of doing sleep(1) in a loop is plain wrong. You will find your family and friends yelling out Happy New Year a split second before the counter reaches zero. Why is that? The thing is, you forgot about milliseconds, microseconds, and nanoseconds. This is illustrated by the following timeline:

23:59:56.852 countdown program start
sleep 1
23:59:57.852 sleep 1
23:59:58.852 sleep 1
23:59:59.852 sleep 1
00:00:00.000 Happy New Year! on tv
00:00:00.852 countdown program reacts too late

Here, the program starts at an offset of 852 milliseconds causing it to react too late; it practically misses the strike of midnight. 852 milliseconds feels like almost a full second, so the countdown program just isn't good enough.

A better way to do it is to use either usleep() or nanosleep() to sleep fractions of a second and round it off to an (near) exact tick of a second.

Sleep with one eye open

Although you can do nanosecond sleeps, the practical accuracy of your software clock is about 15 milliseconds. This is because of timeslicing in multitasking operating systems. Really? Well, yes and no. Modern computers have a High Precision Event Timer (HPET) that can be used to do more accurate timing of events. This is entirely meant for doing things like performance measurements and not so much for wallclock time keeping.

No time to waste

A countdown timer is pretty neat and not so hard to write. It gets more fun when you add a big LED display in OpenGL graphics. I suppose it's too late now to get it approved in the App store and get rich before 2014 starts, but we can always try again next year.

Best wishes! and if you want to read more about high precision timing, see this excellent blog post by another code monkey:

http://tdistler.com/2010/06/27/high-performance-timing-on-linux-windows

Wednesday, November 27, 2013

C++ shared_ptr (or really C++ standard mess)

The C++ shared_ptr is a pointer to an allocated instance that ensures the instance is deleted after the last shared_ptr to the object goes out of scope (or is explicitly deleted). It's a great mechanism for doing some behind-the-scenes automated memory management. You can point multiple shared_ptr's to the same object and not have too much difficulty in managing the memory of the object that it's pointing at.

The shared_ptr first appeared in the boost library. Once upon a time we would write:

#include <boost/shared_ptr.hpp>

boost::shared_ptr<T> ptr;

Later, a committee found shared_ptr so cool, they said “we want that too” and incorporated it into the C++ library. The shared_ptr became part of the TR1 extension (first Technical Report), and we would write:

#include <tr1/memory>

std::tr1::shared_ptr<T> ptr;

For about a decade, work on the next C++ standard continued under the name C++0x. At some point, the gcc team at GNU recognized that TR1 would soon become part of the next C++ standard. So they already incorporated TR1 into the std namespace, but you would have to invoke the compiler with a special flag because it wasn't quite standard yet:

    #include <memory>

    std::shared_ptr<T> ptr;

    g++ -std=c++0x

Not that long ago, the new standard arrived as ‘C++11’. The compilers and library were updated. There was more to C++11, and it was a new standard after all, so GNU added a new flag:

    #include <memory>

    std::shared_ptr<T> ptr;

    g++ -std=c++11

At this very moment, things should have been good now since we have C++11. In practice however, we're in bad luck. Every platform ships their own ‘current’ version of the compiler, and it works differently every time. Older compilers choke on C++11, and newer compilers that prefer C++11 choke on the TR1 stuff. In order to write C++ code that actually compiles across multiple platforms, you have to:

use autoconf (yuck)
use ugly ifdefs to get the correct includes
use typedefs, or hack the std namespace, or import entire namespaces
use the correct compiler flags, and mind which flag to use for what version of g++

It's either this or telling people “Your compiler is too old!”.
We can only hope that time passes quickly and that TR1 will soon be out of the picture, only a vague memory of an obscurity in the distant past. Until then, ...

Sunday, August 18, 2013

The quad-tree

Six years ago (which seems like an eternity already) I made a 2D arcade shooter in which you walk around the screen, zapping monsters. Each level is just one screen, and once you clear it, it's on to the next one. For collision detection, it checked every monster against the player, and every bullet against every monster. The game ran fine – especially on today's computers with near infinite power – but in terms of efficiency, it's quite bad to do that many collision tests. To make things better, I followed the good advice of the game programming gurus and added in a quad-tree.

Objects can only collide when they are near each other. So if an object is far away, you don't even want to test whether it might collide. Furthermore, if a group of objects is far away, you don't want to test for collisions with any member of that group. A quad-tree enables skipping large amounts of objects, thus bringing down the number of collision tests to perform.

The quad-tree is a space partitioning “device”; it divides 2D space up in compartments, automatically grouping objects together. It helps you select only the relevant group of objects for a possible collision and keeping the count of collision tests down to a minimum.

This post is not a tutorial on quad-trees, see below for a link to a good tut at gamedev.net. I had some additional thoughts on quad-trees.

Considerations

People often resort to using a uniform grid, also known as binning. With this technique, imagine putting a regular grid over the screen and adding objects to the bin (grid cell) that they are in. It's easy, but this is a poor solution that has all kinds of issues. For example when an object is on a cell boundary, it can be in multiple cells at the same time. If an object is so large that it spans many cells, there will be a big problem. If the player is at the edge of a cell, you'll have to check the adjacent cell as well. If the player is at the edge of the world, take care not go out of bounds. All in all this technique just isn't very good, and it wastes memory when the world is large but sparsely populated. The quad-tree however, is elegant and efficient and works in all cases.

Some people seem to think that a quad-tree can only deal with fixed size problems, i.e. the root node should be able to hold all objects that are in the world. This is so not true (!) If your quad-tree implementation uses arrays, you are doing it wrong. It is much more efficient to chainlink objects together using pointers. Like so:

q.objlist => obj1.neighbor => obj2.neighbor => NULL

This scales dynamically, without cost, without growing arrays or allocating any extra memory at all.

Other uses for quad-trees

The quad-tree is ideal for static geometry like wall segments. Since the walls don't change, you can generate the quad-tree only once when loading the level and keep it around for testing against while the game is running. For moving walls and destructible environments it's probably easiest to use a separate, second quad-tree. (* If you are ~~a graficks wizard~~ clever, you may use BSP trees for this purpose. Quad-trees are much easier to use though.)

Another ideal use of quad-trees is for view frustum culling in 3D engines. If a node falls outside the view, then all of its leaves will not be visible either, culling a large number of vertices at once. This is typically used in terrain rendering, but also works for other kinds of scenes.

Taking it one step further, you can use the same technique for doing occlusion culling. From a nearby object you can create an occlusion frustum. The nodes that are in the occlusion frustum area are not visible and can be culled away. If it's a large building or a mountain that is in the view, it will pay off to do this little extra work. Occlusion culling is an advanced topic, and I'm not convinced that modern games actually do this.

It's trivially easy to extend the quad-tree code to an octree for use with 3D data.

Quad-trees in science

Cosmologists like doing simulations of colliding galaxies. This involves simulating gravitational forces between large numbers of stars (or bodies). Because each and every body interacts with each other body, this is called an n-body problem. One way of implementing such a simulation is the Barnes-Hut algorithm, which uses a quad-tree to group bodies together. Barnes-Hut may treat groups of bodies as a single body by using its center of mass, and thus having a single gravitational force. It's easy to see how the use of a quad-tree greatly speeds up such a simulation.

Links

Quadtrees for collision detection at GameDev.net
Barnes-Hut algorithm by Ventimiglia & Wayne, who teach at Princeton University (not game programming class)
Barnes-Hut simulation at Wikipedia, with cool pictures

Monday, July 15, 2013

Programming mkdir -p

It's midsummer so let's do something light and easy for a change. In UNIX there is the mkdir -p command, meaning “create directory and all leading (parent) directories”. It is a convenient command with which you can quickly create a new directory tree. In programming, there is the mkdir() system call. This system call creates new filesystem directories, but it does not, however, automatically create any new leading parent directories for you. An attempt to do so is an error: ENOENT, or “No such file or directory”. How can we programmatically make deep directories?

A naive answer is to call system("mkdir -p my/deep/path"). While not incorrect, it comparatively uses lots of resources, but even if we don't care about that, it poses the interesting question: “Well, how does the mkdir shell command do it?”

A possible approach is to to break the path argument up into pieces, and for each piece create the new subdirectory. Something along these lines:

break up "my/deep/path" into "my", "deep", "path"
mkdir "my"
mkdir "my/deep"
mkdir "my/deep/path"

For example, in Python:

def mkdir_p(path):
    arr = path.split(os.sep)
    newpath = ''
    for elem in arr:
        if not newpath:
            newpath = elem
            if not elem:
                newpath = '/'
        else:
            newpath = os.path.join(newpath, elem)

        if os.path.exists(newpath):
            continue

        os.mkdir(newpath)

It has some checks for correctly handling relative and absolute paths, and it gets the job done. It performs poorly however for long paths that already exist for the most part. For example, for mkdir_p("/home/walter/src/python/ blog/2013/mkdir_p") it would do seven loops already before creating a new directory. Surely we can do better.

If we could work backwards, the performance could be improved. First see if the deep path exists, take a step back, see if that path exists, and if it does, rewind and create the deeper path. This is achieved by recursion.

def mkdir_p(path):
    if not path:
        return

    if os.path.exists(path):
        return

    mkdir_p(os.path.dirname(path))
    os.mkdir(path)

The recursive version is wonderfully terse and like often with recursive functions, it may warp your mind and take some effort to grasp.

I confess I used the non-recursive implementation for years. Ironically, Python already has a function to do this:

os.makedirs(path)

I had a peek in the source of Python's os module, and os.makedirs() indeed uses the recursive implementation. Mind that like os.mkdir(), os.makedirs() will throw an exception if the path already exists — unlike the mkdir -p shell command that doesn't complain if the directory is already there.

Monday, June 24, 2013

Object(ive)-oriented programming in plain C

Last time I wrote about a safe string type implementation in a C library, and mentioned liboco. It's a library that provides some fundamental building blocks like a safe string type, a safe array type, and a map (or dictionary). In liboco, everything is a reference counted object, and arrays and maps store objects. liboco takes concepts from Objective-C, and its basic function compares to Foundation Kit. Like in Objective-C, reference counting is done manually using retain and release. It even has a memory pool with which you can deallocate objects in a deferred fashion using autorelease.

How does it work? Plain C is not object oriented, it has no classes, no constructors, no inheritance, etcetera. But we can implement object orientness in C. In fact, the first C++ compiler was a translator that produced a mangled plain C source code as output.

As you might suspect, a safe string type would be implemented as a struct with member fields for a char* pointer to the data and an integer length. Now take a step back, and say that everything should be an object. This means a string would be derived from an object, and inherit the base object's properties. In plain C, we do this by including the struct of the base object in the struct of the derived object.

typedef struct {
    object_t obj;

    char *str;
    int length;
} string_t;

What does the base object type look like? As said, every object is reference counted, so it should include a reference count.

typedef {
int refcount;
} object_t;

It doesn't seem much, but we've already laid down a basis. We can create custom types and derive them from object. We could even create a new type and derive it from the string_t type, if we wanted to.

There is no automatic constructor, but like in Objective-C when we allocate an object of a certain type, we should initialize it through an init function. This is its constructor.

string_t *s = init_string(calloc(1, sizeof(string_t)));

calloc() zeroes out the memory for us. Alloc and init are two distinct operations. If you combine them into a single function, you'll run into trouble later (when we look at inheritance) so keep them separate.

Since I don't like the syntax of what we now have, alloc and init is wrapped as:

string_t *s = new_string();

Note that every object instance is addressed by a pointer to the object. This isn't strange when you realize that in plain C, strings are char pointers and arrays are referenced by a pointer to the array.

The reference count of a newly allocated object is 1. In a reference counting system, you don't delete instances like you do in C++. Instead, you let go of them by calling release(). Releasing an object instead of bluntly deleting it will start to make sense once you put objects into containers.

release(s);

The release() function lowers the reference count, and when the reference count drops to zero it will destruct the object and free the memory. To keep an object around, increase its reference count by calling retain(). When you put an object into a container, like an array or a map, the container automatically retains the object, which is another way of saying that it holds a strong reference to the object. The container keeps the object around — until the container itself is released.

Functions like retain(), release() work on any type derived from object. But C is a strictly typed language ... so you'd need to typecast down to object_t every time to satisfy the compiler, or else you'd get a ton of warnings. This issue is solved by declaring the functions with void* arguments:

void retain(void *);
void release(void *);

Now, if you don't supply a valid object, the compiler will not complain and these functions will happily trash around. So to make it safe we need some kind of runtime type checking. Wouldn't that be incredibly slow? No, it's just a single if-statement. We actually already need to know the type of the object for another reason: when the reference count drops to zero, how would release() know what destructor function to call? It knows because every object holds run-time type information (RTTI).

typedef struct {
objtype_t *objtype;
int refcount;
} object_t;

Mind that in C++, the compiler holds type information at compile time. In C++, RTTI may or may not be available during run-time depending on whether the class is polymorphic and whether RTTI is enabled during compilation. What's fun is that RTTI also allows us to easily implement type introspection.

What information is in the object type structure? It holds a pointer to the destructor function. release() will call this function before freeing the allocated instance.

Other than just the destructor, we can also include a constructor. Mind that in Objective-C, you have to manually call the super classes init function to correctly initialize an instance of a derived class. This is also the case in Python. In C++, it automatically calls the base class constructors.

At first I actually made it work this way using the default constructor only. There is a compelling case for having manual control anyway, and not including a constructor in the object's type definiton.

It's because of parameters: constructors often take parameters so it makes sense to write them this way. This means a constructor of an object can have any number of parameters and there is no strict declaration that fits all types. Thus, automatic calling of base class constructors goes out the window.

As slight added advance, it will be more clear how to do multiple inheritance.

As said, the whole thing is a reference counting system. So when we copy an array, the contents of the copy point at the same elements as the original, and each element has had its reference count increased by one. In order to make a deep copy, we need a copy constructor. The copy constructor is a common idiom in C++, but it's not as well-known in Objective-C. Objective-C has the NSCopying protocol, and you implement a function -copyWithZone: that returns a newly allocated instance; the copied object.

liboco adds the copy constructor as a special kind of constructor (analog to C++), and likewise it automatically calls any base class copy constructors.

C++11 adds a move constructor, with which you can move values from one instance to another under the hood. This is typically needed for doing return by value efficiently.

We have no need for this feature. The main reason is that objects are pointers, and pointers already efficiently return by value. Another reason is that we do not automatically destroy instances when they go out of scope, like C++ does. It's a manual reference counting system, so the programmer decides when an object is released or not.

Now that we have our own typing system, this gives way to adding a feature that is not present in Objective-C nor C++ (but is present in Python and Go), and it is printing an object. For printing any object we need a method that converts an object to a string.

Additionally, we need a special printing function that calls the string conversion method when requested.

print("object type: %T\n", o);
print("object value: %v\n", o);

It's quite powerful; this allows us to print the value of any type of object with the format string specifier ‘%v’.

It's also possible to add operator functions. Although plain C does not allow redefinition of operators, they can be simulated using function calls. This would be pretty annoying for hand-written codes however, but could be useful for machine generated codes.

I haven't touched upon overloading. It gets close but doesn't become real OOP until it can do function overloading, which enables polymorphism. Function overloading can be implemented by including a table of methods (function pointers) in the object's type definition. Calling a method would resemble this chain:

self->objtype->methods.method(self, parameters);

This is how virtual function calls work in C++. A derived type has a copy of the table of its base class. In Objective-C, it works a little different. There, invoking a method means “sending a message”. What happens is, the table is searched for the method, and if the method is not found, the table of the superclass will be searched. This way, all methods of the superclass are inherited by the subclass.

Both ways can be simulated in plain C, but there is the tiresome work of writing out the virtual tables. Moreover, the first argument to the method would always have to be void* to prevent compiler warnings.

Finally, a word on using liboco. Programming in C with liboco is a lot like programming in Objective-C, and like in Objective-C, you practically leave plain C behind. You are now working with this new API, and the code is full of calls into this library, which is your foundation. Adding a new type for some kind of data structure means implementing it as an object, and writing the destructor, copy constructor, and string representation methods. This can be a drag, but it's part of what liboco is. Having an array and map type at your disposal in which any object can be placed, is a great win. You can quickly prototype something in Python and convert it to C rather easily. Wouldn't it be great if we could do automated machine translation from a scripting language to C and then compile it to native code? It's maybe something to work on at a later time.

Monday, May 13, 2013

More on unsafe C strings: A followup

Recently John Carmack, lead programmer of DOOM, wrote in a tweet: “Considering that 8 bit BASICs of the 70s had range checked and garbage collected strings, it is amazing how much damage C has done”. This is in line with my previous posts, and it's fairly remarkable that C stuck around so long without having this fixed. There have been many updates to the language, but this issue was simply not addressed — at least, not until C11. C11 includes the strcpy_s() and strcat_s() safe string functions. It feels a bit too little, too late but, at last, something has been done about it.

Are our troubles over now? Not really, C11 is still too new. My system does not have the new safe string functions so I can't use them. Adoption of the new standard takes time. Documentation (and school books) have to be updated. Next, the old strcpy() and strcat() have to be deprecated and finally, be eradicated from our source codes.

The bounds checking problem continues to exist for arrays.

Other than C11 I want to expand on two other possibilities for fixing the issue. The first solution is using an external library, the second is choosing a different programming language.

1. Using an external library

Rather than just supplying a safe string copying function, I mean using a library that supplies a safe string type and associated functions. As is often the case, this library would be written in C itself. One problem is that you have to rigorously adapt to using the library's string functions, and resist the temptation of using the old char* pointers for quick and dirty strings anyway.

Another problem is that system functions such as open(), write(), etcetera expect char* pointers, so you keep converting between the safe string type and the char* pointer.

Typical library solutions allocate strings on the heap, giving an extra performance hit. Using a library means marrying that library; your whole program source code will look different and be tightly coupled to it. Switching libraries will have a big impact and involve lots of work.

My github project liboco is a C library that implements a safe string type, as well as an array and a map type, and a few other things. It follows the Objective-C design principle that everything is a reference counted object. Like in Objective-C, this means manual reference counting. It isn't bad, but can be problematic if you are not used to it. Hard to track down memory leaks may happen. Other than that, liboco is pretty neat.

Comparably, I have written oolib for C++ which is much easier to use because it does automatic reference counting under the hood. This brings us to the next point.

2. Choosing a different programming language

It is fair to say that the problem of unsafe strings and having no bounds checks is intrinsic to C. It is a low-level language, after all. So if you want that extra bit of safety, you have to let go of C and choose a different language.

C++ fixes strings with std::string. Its interface annoys me, but anyway, C++ is a complicated language and it's really hard to learn. Even after thoroughly studying C++ I'm not sure I would recommend the language to anyone. Sure, classes are useful, but you almost have to be an academic to master C++.

Other languages more popular than C on github are: PHP, Python, Java, JavaScript — all byte code interpreted languages. JavaScript is a total mess, but anyway, these languages by design lack the punch that C can deliver in terms of performance. The process of byte code interpreting is essentially emulation, and emulation is slow. I bet that even the laziest, dumbest compiler would produce a faster running code than these byte code interpreters.
The argument of performance appears debatable, “do you really need that kind of performance?” The answer, in short, is: Yes. I have personally seen scripts run for hours, where the equivalent written in C ripped through it in half a minute. A speedup by a factor of 100 is no exception. Compiled languages offer vastly superior performance.

There is Objective-C, but it is different enough to say it changes everything. Choosing Objective-C in practice means marrying OpenStep's Foundation Kit. It is a good choice if you are Apple-only and intend to keep it that way. Objective-C has manual reference counting, which is alright but kind of tough to get right. Apple's compiler however, sports automatic referencing counting (ARC) which does some clever source code analysis.

Whatever happened to Pascal? Although a rather verbose language (consider PROCEDURE / BEGIN / END versus void { }), Pascal wasn't so bad — except from the fact that it has the worst string type of all (!). Although Pascal strings are safe, they have an absolute maximum length of 255 bytes. In modern times, with unicode encoding, this means that for some languages only 63 characters can be stored in a single Pascal string, making it a totally useless language by today's standards.

Go deserves honorable mention. It is an easy to learn language with cool core features, and it compiles to native code. Even though I'm not fond of its mandatory CamelCasing, golang is really great. Go was designed as a systems programming language, and it shows. There are some Plan9 peculiarities like the command-line flags parser not being GNU getopt()like, and the tar archiving module doesn't grok GNU tar extensions. Other than that, Go has great potential.

Truthfully, I don't want a language that strays too far from C. Interestingly, although many modern languages are depicted “C like”, none of them are plain C with an added safe string type. I suppose it is hard to change C. For one thing, a ‘string’ is a fairly high-level concept that doesn't fit nicely in a low-level language.

Sunday, April 7, 2013

strcpy(): The safety of an unsafe string copy function

Last time I wrote about the strcpy() function and said that it's unsafe. But why exactly is it unsafe? Let us see the details of what is going on under the hood when strcpy() is called. To do so, we will dive down to the machine level and have a look at what is happening in the stack memory.

In the C language, strings have no explicit length. Instead, the length is determined by a terminating NUL character. Therefore, strcpy() copies bytes until it sees a zero byte:

void strcpy(char *dst, char *src) {
     /* copy src to dst until src[0] == 0 */
     while(*src)
         *dst++ = *src++;
}

Consider the following function, which includes the common programming mistake of assuming that the input will fit into the buffer:

void func(char *input) {
char buf[256];

strcpy(buf, input);
}

Now let's examine what happens at the machine level when this baby executes. When a subroutine is called, the CPU pushes the address of the next instruction onto the stack. This address is the return address. To return from a subroutine, the return address is popped off the stack and loaded as the current instruction pointer. Thus the program jumps back and resumes execution at the point right after the subroutine was called. Using a stack allows for nested subroutine calls.

Local variables and function parameters are placed on the stack as well. This is great because now when a subroutine ends, the local variables go out of scope as the stack frame is ‘cleaned up’ (in fact, the data is still there, but the stack pointer is moved).

This amounts to the following picture of a stack frame when we are executing the func:

stack pointer ->
+------------------------------+
| local var: char buf[256]    |
|------------------------------|
|        return address        |
|------------------------------|
| parameter: char *input      |
|------------------------------|
| local vars       of caller   |
|------------------------------|
| return address   of caller   |
|------------------------------|
| parameters       of caller   |
|------------------------------|
| ...                          |
end of stack

You can see that an attacker can overwrite the return address when he supplies an input string that is longer than buf. Not only can he overwrite the return address, he can insert a specially crafted mini-program into buf. What many exploits do is put the return address as the address of buf so that the program will jump back to execute the payload in buf.

To prove that this actually works, we can do a little experiment. Write a small program that fills a buffer with garbage and overwrites the return address as described above. Guess what, it doesn't work! The program gets killed by a run-time check, cleverly inserted by the compiler. Here is a postmortem stack trace:

(gdb) bt
#0 0x00007fff9a45cd46 in __kill ()
#1 0x00007fff9ad98ec0 in __abort ()
#2 0x00007fff9ad5a77d in __chk_fail ()
#3 0x00007fff9ad5aa4f in __strcpy_chk ()
#4 0x0000000100000de4 in func (input=0x7fff5fbff7e0 'x' , "\030??_") at hijack.c:19
#5 0x0000000100000e5e in main () at hijack.c:29

This is in OS X using clang. Googling turns up __memcpy_chk for gcc:

“GCC implements a limited buffer overflow protection mechanism that can prevent some buffer overflow attacks.”

void *
__memcpy_chk (void *__restrict__ dest,
              const void *__restrict__ src,
              size_t len, size_t slen)
{
    if (len > slen)
        __chk_fail ();
    return memcpy (dest, src, len);
}

As you can see, the compiler inserts a run-time check for the size of the buffer. On top of this, a second run-time check is made that checks the integrity of the stack. This technique is known as inserting a ‘stack canary’ and can be observed by studying a disassembly of our func below. Here you can see that 288 bytes (or in hexadecimal notation, 0x120) of space is taken from the stack for local variables. This is more than the 256 we actually requested. The remaining 32 bytes are used for the stack canary.

Then it calls strcpy_chk() rather than strcpy(). Finally, the stack canary is checked, and may result in stack_chk_fail() being called. Otherwise, the stack frame is cleaned up and the function returns normally.

0x0000000100000d90 push   %rbp
0x0000000100000d91 mov    %rsp,%rbp
0x0000000100000d94 sub    $0x120,%rsp
0x0000000100000d9b mov    0x26e(%rip),%rax
0x0000000100000da2 mov    (%rax),%rax
0x0000000100000da5 mov    %rax,-0x8(%rbp)
0x0000000100000da9 mov    $0x100,%rdx
0x0000000100000db3 lea    -0x110(%rbp),%rax
0x0000000100000dba mov    %rdi,-0x10(%rbp)
0x0000000100000dbe mov    -0x10(%rbp),%rsi
0x0000000100000dc2 mov    %rax,%rdi
0x0000000100000dc5 callq 0x100000e8c <__strcpy_chk>
0x0000000100000dca mov    0x23f(%rip),%rdx
0x0000000100000dd1 mov    (%rdx),%rdx
0x0000000100000dd4 mov    -0x8(%rbp),%rsi
0x0000000100000dd8 cmp    %rsi,%rdx
0x0000000100000ddb mov    %rax,-0x118(%rbp)
0x0000000100000de2 jne    0x100000df1 <func+97>
0x0000000100000de8 add    $0x120,%rsp
0x0000000100000def pop    %rbp
0x0000000100000df0 retq
0x0000000100000df1 callq 0x100000e86 <__stack_chk_fail>

So, the compiler does a lot of work for us in order to prevent simple buffer overflows. The canary is initialized with a random value before main() runs, so it practically can not be defeated. But beware, an attacker may still influence the program behavior in different ways and deliberately not touch the canary.

Other techniques that help prevent buffer overflow attacks are ASLR (address space layout randomization) and DEP (data execution prevention) or NX pages (non-executable memory pages). Although they make it more difficult, these too can be circumvented by trickery.

Mind ye that all of this is plain impossible if only you (or the high-level language itself!) always properly check the size of the buffer and array bounds. It is something that C normally does not do for you, so be mindful that the compiler will not always be able to save you from writing insecure code.

Monday, April 1, 2013

strlcpy(): The unsafety of a safe string copy function

Every C programmer should know that the strcpy() function is insecure; if the destination buffer is too small to hold the string, strcpy() will happily overflow the buffer and copy over whatever was there in memory. Other than corrupting variables, it may overwrite return addresses in stack memory, which spells doom for system security because it allows an attacker to inject code into the running program, commonly known as ‘exploiting a vulnerability.’

Sure enough the strcpy() function is a very dangerous thing. OpenBSD thought up the strlcpy() and strlcat() safe string functions to counter the problem. With these, you always must supply the size of the destination buffer. Strings that are too long will not overflow the buffer, any attempts at buffer overflow are simply stopped. strlcpy() will terminate the copied string at the end of the buffer and thereby plugging the hole.

A remarkably simple but effective solution. The strlcpy() and strlcat() functions are widely used in OpenBSD and were adopted in other operating systems as well, like FreeBSD and Mac OS X. But I wouldn't have known about these functions in the first place if I hadn't run into a code that would not compile on Linux, where strlcpy() is missing from the GNU C library, and maybe righteously so.

However superficially brilliant strlcpy() seems, it is all too easy. The function may truncate the copy of the string, so ... It copies the string, except when it doesn't, then it only partially copies it. Many people consider this incorrect. It isn't logical to have a copy function that truncates the copy. You can get really weird things from this, for example when an UTF-8 string gets truncated in the wrong position, it will result in a corrupted string. What if the string is a file path or a URL that is truncated?
Of course, you should check the return value of the function. It's standard programming practice that also applies to the traditional string handling functions. But in a strange way, these safe string functions are actually unsafe. A false sense of security creeps in. Actually, not copying the string at all would have been better than truncating.

Nevertheless, strlcpy() remains in use in various codes (like OpenSSH and rsync). Wouldn't it be nice if these functions were available just for the sake of portability. In the Linux world, that argument just isn't good enough.

So, what are your options? For portability, you might want to use autoconf and ifdef HAVE_STRLCPY, but note that it doesn't really help you. It's sugar coating that looks advanced, but it doesn't make it any better. My advice is to steer clear of strlcpy(), just don't use it. Stick with the traditional string functions and keep checking those buffer lengths.
You might use an external string library that supports growable strings. Personally, I'm good with my_strcpy() function which is basically strlcpy() with a twist: call abort() when the destination buffer is too small. It's not user-friendly, but it gets you out of a bad situation quickly.
Other than that, accept that C is maybe not the best choice for implementing userland code written by mere mortals. Try a different language, like golang. It has very robust string handling.

Next time we'll have a look under the hood and examine how buffer overflows work.

Monday, March 18, 2013

OpenGL Pac-Man

Every once in a while I like to re-create an arcade classic game just for fun. Many arcade classics are simple enough to be built by a single person with a reasonable amount of spare time. The game of choice is Pac-Man.

When you think about it, Pac-Man is rather bizarre. You are a pill eating yellow mouth that is being chased through a maze by four ghosts. Their mission: KILL THE PAC-MAN! What makes the game fun is that you may temporarily turn the tables by swallowing down a power-pill, after which you can EAT the ghosts. The hunted becomes the hunter. Combined with cutesy graphics it becomes child's play. This is not an easy game though, Pac-Man is notoriously difficult. Personally, I still can't make it past level five.

Programming a Pac-Man clone isn't particularly hard, but it does take effort to perfect it. There are lots of little tweaks and details that make the original so good. For example, the corners in the maze are rounded. This allows Pac-Man to cut corners and stay ahead of the ghosts that are in close pursuit.

Graphics

Rather than using old-fashioned clunky bitmaps, I went for scalable vector graphics so it neatly scales to any screen resolution. Clearly Pac-Man is depicted as a yellow circle with a pizza slice missing. When you do it this way, you'll notice that it doesn't look as nice. The thing is, Pac-Man's mouth is larger than a clean-cut pizza slice. The center of the circle is slightly off, allowing for a bigger mouth. For OpenGL this is not a problem; the Pac-Man character can be drawn as a triangle fan with circle coordinates, but the first vertex is not at the center of the circle.

The ghosts are drawn with a rectangle and a number of circles and ellipses. Drawing ellipses is not easy, but since these are all perfectly symmetrical ellipses they can be drawn as stretched circles: scale the Y direction and draw a circle; the result is a symmetrical ellipse.

A simple trick for drawing circles in OpenGL: calculate the coordinates for a circle with a radius of 1.0. This we call the unit circle. When drawing a circle with radius r, scale X and Y by r and draw the unit circle.

Movement

In the classic Pac-Man game the ghosts move from tile to tile, and so do they in my implementation. But since we're using OpenGL anyway, it wouldn't be a bad choice to use OpenGL's world coordinates instead and move from corner to corner. This would theoretically allow for ultra-high speeds. Come to think of it, maybe it would be nice if the ghosts would accelerate and hit the brakes before taking a corner. This would subtly change the game, however.

In terms of AI, I always thought that the ghosts moved in predetermined patterns, which is not true. Nor do they turn at random. Each ghost has its own behavioural rules. You can read all about it at the Pac-Man Dossier. This page describes the game in enough detail to allow you to create your very own Pac-Man clone.

Sunday, February 24, 2013

Estimating tar size

There's a rather old but interesting question on StackOverflow: Is it possible to estimate how large the tar archive will become before you start packing? Some say yes, some say no. Well, of course it's possible. The exact figure would be given by:

tar cf - directory/ | wc -c

This can be a rather slow command though, as it already runs tar and reads all data from disk. But can we estimate the archive size, and do so in a more efficient way?
Yes, it's possible, but you have to mimick the tar program really well when adding up all the sizes of the individual files. Unexpectedly, this is kind of hard. Tar is a much more complicated utility than you might think.

In essence, tar concatenates files after one another. The resulting tar file is called an archive, and every file in it is a member. It saves the original file name, size, and other relevant metadata such as ownership in a header directly in front of every member. So the size of the archive will be bigger than the sum of the sizes of the members, but by how much exactly?

Tar works with 512 byte blocks. Every header is in a 512 byte block, padded with null bytes, and all member data is stored in 512 byte blocks. The final member data block is padded with null bytes as necessary. Members like directories and symbolic links do not have data bocks. They are stored in the archive just by their headers. Classic tar ends the archive by appending two blocks filled with null bytes.

With this information you should already be able to get a pretty good estimate of how large the resulting tar archive is going to be.

But it isn't quite right. Why is that? The classical view of the world works but it isn't quite like that anymore.

One of the problems with classic tar was storing long filenames. Tar does not only store the file's name, but also its leading path. As the directory structure gets more deep, the stored filename becomes longer. There was only 100 bytes of room in the tar header for filenames, so storing a file with a longer path was simply not possible using classic tar.
Later, UNIX gained ACLs and high precisicion file access times, and tar should be able to store and restore those.

In order to resolve these issues, extensions to the format were crafted, in which a regular tar header may be followed by an extended header block. The PaX format became a POSIX standard. Archives that use the PaX format may optionally have a global header at the beginning of the archive accounting for two extra blocks (regular header plus an extended header). A PaX extended header is in the form of "key=value" and therefore can contain any kind of information. Tar implementations that encounter unknown keys may ignore them or give warnings.

Meanwhile, GNU crafted their own extensions that work in a similar way. For example, GNU tar has an extension for supporting sparse files with multiple holes.

To estimate the tar archive size, you have to guess (or know) what extended headers are going to be added. By taking filename lengths into account, as well as the destination paths for symbolic links, you can make a good guess of how many blocks of metadata this is going take.

Remember I said that tar archives end with two blocks of null bytes. Well, GNU tar likes working with a blocking factor of 20, meaning that it will do I/Os of 20 blocks at a time. As a consequence, it writes archives by multiples of 20 blocks. So for GNU tar archives there will often be more than just two blocks of null bytes at the end.

I actually implemented all of this in code and it's pretty accurate at estimating tar archive sizes. I have seen cases when it was off, though. The only way you could really get 100% accuracy is to use tar anyway. I suppose tar could use a dry run mode.

When you take a close look at the tar format, you find this hodgepodge of headers and extended headers and slight variations. This is what 30+ years of software evolution does to a program. It's almost poetic, seeing tar as a an organic mass.

For a more detailed description of the tar format, see:

Wikipedia: tar (computing)
GNU manual: Basic Tar Format

Sunday, January 20, 2013

A tale of two tiny bugs in Linux PAM's pam_access

Recently, a colleague of mine decided to introduce netgroups to our systems. NIS netgroups are a means of logically grouping users. So I reconfigured PAM to use netgroups: users had to be member of a certain netgroup to be granted access to the system. To my surprise, it didn't work on one system, while it did on another. Clearly we hit a bug, and one system was running a different version of the pam_access module. A little digging let to the excavation of an old bug. Interestingly enough, I discovered a second tiny mistake in the code.

First of all, I want it to be clear that I have great respect for all the developers that do the hard work for Linux. Zooming in on this little piece of code is for the love of code, and to illustrate the nature of the bug in detail for educational purposes.

Let us do a bit of NIS programming. You need the innetgr() function to tell whether a user is member of a group. Its signature is:

int innetgr(const char *netgroup, const char *host, const char *user, const char *domain);

The function innetgr() returns 1 for a successful match and 0 otherwise. The host and domain parameters may be NULL. The domain must be set though if a NIS domain has been configured. In order to find out what the NIS domain is, you can use either yp_get_default_domain() or getdomainname().

int yp_get_default_domain(char **domp);
int getdomainname(char *name, int namelen);

Both functions get the domain name, apparently there are systems where only one of these is available. If no domain name is configured, yp_get_default_domain() will set the pointer to NULL. The function getdomainname() may set the name to "\0" (empty string) or it may set it to "(none)".

The pam_access code looked like this:

/* netgroup_match - match group against machine or user */

static int
netgroup_match (pam_handle_t *pamh, const char *netgroup,
const char *machine, const char *user, int debug)
{
int retval;
char *mydomain = NULL;

#ifdef HAVE_YP_GET_DEFAUTL_DOMAIN
yp_get_default_domain(&mydomain);

#elif defined(HAVE_GETDOMAINNAME)
char domainname_res[256];

if (getdomainname (domainname_res, sizeof (domainname_res)) == 0)
{
if (strcmp (domainname_res, "(none)") == 0)
{
/* If domainname is not set, some systems will return "(none)" */
domainname_res[0] = '\0';
}
mydomain = domainname_res;
}
#endif

So, suppose you have a system that has no NIS domain name configured and it used getdomainname() to get the domain name. Consequently the name will be either an empty string or the string "(none)". If the domain is "(none)", it will be reset to an empty string. If you pass an empty string as domain to innetgr(), it will fail and report the user is not part of the netgroup. The reason is that the domain parameter of innetgr() must be NULL to indicate any domain name, an empty string will not do.

Note that this is an old version of the code. However, this code was still current for a debian linux system. Debian is known for running ‘stable’ code, as new software is likely to include new and buggy features. This particular bug was fixed over a year ago though.

The second problem with the code was still present, and it is right here:

#ifdef HAVE_YP_GET_DEFAUTL_DOMAIN

which is clearly a typo. When you make typos in code the compiler will usually catch them, but this is a preprocessor statement and it managed to go unnoticed for at least three years. This typo caused yp_get_default_domain() never to be used at all. This wasn't much of a problem since Linux systems typically provide both functions, but ironically, the block with getdomainname() contained a bug. Fixing the typo causes yp_get_default_domain() to be used and effectively discards the block with getdomainname().

Computers are very unforgiving to the mistakes that we humans make. The smallest of mistakes can be of great consequence (like users not being able to authenticate (!)). I was lucky to encounter and fix this problem, and I guess it shows the beauty of open source. Anyone can contribute to open source software, and anyone can help in making the whole a little bit better.