Monday, May 13, 2013

More on unsafe C strings: A followup

Recently John Carmack, lead programmer of DOOM, wrote in a tweet: “Considering that 8 bit BASICs of the 70s had range checked and garbage collected strings, it is amazing how much damage C has done”. This is in line with my previous posts, and it's fairly remarkable that C stuck around so long without having this fixed. There have been many updates to the language, but this issue was simply not addressed — at least, not until C11. C11 includes the strcpy_s() and strcat_s() safe string functions. It feels a bit too little, too late but, at last, something has been done about it.

Are our troubles over now? Not really, C11 is still too new. My system does not have the new safe string functions so I can't use them. Adoption of the new standard takes time. Documentation (and school books) have to be updated. Next, the old strcpy() and strcat() have to be deprecated and finally, be eradicated from our source codes.
The bounds checking problem continues to exist for arrays.

Other than C11 I want to expand on two other possibilities for fixing the issue. The first solution is using an external library, the second is choosing a different programming language.

1. Using an external library
Rather than just supplying a safe string copying function, I mean using a library that supplies a safe string type and associated functions. As is often the case, this library would be written in C itself. One problem is that you have to rigorously adapt to using the library's string functions, and resist the temptation of using the old char* pointers for quick and dirty strings anyway.
Another problem is that system functions such as open(), write(), etcetera expect char* pointers, so you keep converting between the safe string type and the char* pointer.
Typical library solutions allocate strings on the heap, giving an extra performance hit. Using a library means marrying that library; your whole program source code will look different and be tightly coupled to it. Switching libraries will have a big impact and involve lots of work.

My github project liboco is a C library that implements a safe string type, as well as an array and a map type, and a few other things. It follows the Objective-C design principle that everything is a reference counted object. Like in Objective-C, this means manual reference counting. It isn't bad, but can be problematic if you are not used to it. Hard to track down memory leaks may happen. Other than that, liboco is pretty neat.

Comparably, I have written oolib for C++ which is much easier to use because it does automatic reference counting under the hood. This brings us to the next point.

2. Choosing a different programming language
It is fair to say that the problem of unsafe strings and having no bounds checks is intrinsic to C. It is a low-level language, after all. So if you want that extra bit of safety, you have to let go of C and choose a different language.

C++ fixes strings with std::string. Its interface annoys me, but anyway, C++ is a complicated language and it's really hard to learn. Even after thoroughly studying C++ I'm not sure I would recommend the language to anyone. Sure, classes are useful, but you almost have to be an academic to master C++.

Other languages more popular than C on github are: PHP, Python, Java, JavaScript — all byte code interpreted languages. JavaScript is a total mess, but anyway, these languages by design lack the punch that C can deliver in terms of performance. The process of byte code interpreting is essentially emulation, and emulation is slow. I bet that even the laziest, dumbest compiler would produce a faster running code than these byte code interpreters.
The argument of performance appears debatable, “do you really need that kind of performance?” The answer, in short, is: Yes. I have personally seen scripts run for hours, where the equivalent written in C ripped through it in half a minute. A speedup by a factor of 100 is no exception. Compiled languages offer vastly superior performance.

There is Objective-C, but it is different enough to say it changes everything. Choosing Objective-C in practice means marrying OpenStep's Foundation Kit. It is a good choice if you are Apple-only and intend to keep it that way. Objective-C has manual reference counting, which is alright but kind of tough to get right. Apple's compiler however, sports automatic referencing counting (ARC) which does some clever source code analysis.

Whatever happened to Pascal? Although a rather verbose language (consider PROCEDURE / BEGIN / END versus void { }), Pascal wasn't so bad — except from the fact that it has the worst string type of all (!). Although Pascal strings are safe, they have an absolute maximum length of 255 bytes. In modern times, with unicode encoding, this means that for some languages only 63 characters can be stored in a single Pascal string, making it a totally useless language by today's standards.

Go deserves honorable mention. It is an easy to learn language with cool core features, and it compiles to native code. Even though I'm not fond of its mandatory CamelCasing, golang is really great. Go was designed as a systems programming language, and it shows. There are some Plan9 peculiarities like the command-line flags parser not being GNU getopt()like, and the tar archiving module doesn't grok GNU tar extensions. Other than that, Go has great potential.

Truthfully, I don't want a language that strays too far from C. Interestingly, although many modern languages are depicted “C like”, none of them are plain C with an added safe string type. I suppose it is hard to change C. For one thing, a ‘string’ is a fairly high-level concept that doesn't fit nicely in a low-level language.