The Developer's Cry: August 2012

Sunday, August 26, 2012

Crash course in ncurses

The UNIX terminal is an archaic text interface to the computer. It evolved from teletype machines, which were essentially keyboards with a line printer attached. Any text I/O would be directly printed on paper. Later, the paper was replaced by a monitor. If you want to create a terminal program that has a text interface anything fancier than just lines of text scrolling up the screen, you are likely to go with ncurses. ncurses is a library for creating text mode user interfaces.

A word of warning, like with all things that are new, learning to use ncurses is very much a step by step process. ncurses is not a magical tool for easily crafting great UIs. It has the same 1970s style interface as the rest of the UNIX that we all love.
In this post I will throw a lot of library function names at you with only a minimal description of what they do. Trust me, this is all you need to get you kickstarted with ncurses, but do check out the man pages afterwards to get more deeply involved.

Start off by including <ncurses.h>. First thing in main(), initialize the library with a call to initscr(). To deinitialize, call endwin(). You should always call endwin() upon program termination to prevent leaving the terminal in a state that the UNIX shell (or rather, the user) doesn't like. By the way, why endwin() isn't named deinitscr(), we may never know.

initscr() initializes a global variable in the library named stdscr. It's a pointer to an ncurses window that represents the terminal screen. It is not really the terminal screen, but think of it as a back buffer; you do your operations on the back buffer, then call refresh() to get output to the screen. Because of this, do not use printf() to display text. Instead, use the wprintw() function to print to a window and call refresh() at an appropriate time afterwards. To print text at a fixed position on the screen or window, use mvwprintw(). You can print text with attributes and in other colors using attron() and attroff(). For example, attron(A_BOLD) enables the bold attribute. Clear the screen with clear(). To get the screen dimensions, call the macro getmaxyx(stdscr, height, width).

The cursor can be freely controlled. You can hide or show it using curs_set(visible), and you can move it around with move(y, x).

You can get input key presses with getch(). But before that, you should put the terminal in the right mode, usually during program initialization. Normally, the terminal is in cooked mode, meaning that a lot of processing has already been done before the pressed key was passed on to the running process. For example, the user might suspend the process by hitting Ctrl-Z.
By default, input is line buffered. This means that the user has to hit return before the key is passed to the process. To keep this from happening, set cbreak mode (break after each character). Setting cbreak mode is easy: call cbreak().
getch() will echo the characters to the screen. If you don't want this, call noecho(). What's really nice, getch() responds to nearly all keys on your keyboard, including the arrow keys, page up, page down, and so on. The symbolic constants for keys are named KEY_xxx. They should be in the man page for curs_getch or do a grep for "KEY_" in /usr/include/ncurses.h.
The function keys are not enabled by default. Set keypad(stdscr, true) to enable them.
getch() will implicitly call refresh() as necessary.

Call raw() to set the terminal in raw mode. In raw mode, interrupt and flow control characters like Ctrl-C, Ctrl-Z, and the like behave like any other key press without producing a signal. My personal preference is to set cbreak, noecho, and keypad, but there are cases in which raw mode is well-suited too.
You may be familiar with the tcsetattr() function that manipulates the terminal mode. Word of advice: don't bother using it—the ncurses functions are easier and more comprehensible.

ncurses allows you to work with windows. A window is a rectangular screen area. Allocate a new window with newwin(), and deallocate it with delwin(). To neatly close a window and erase its content, use wborder() and call wrefresh(), before finally deallocating it with delwin().
Subwindows are created using derwin() [as in ‘derived window’]. I prefer derwin() over subwin() because derwin() uses coordinates relative to its parent window.
Subwindows are windows just like any other ncurses window, they are all of type WINDOW pointer. Again, deallocate with delwin().

Windows are confined to the screen area. To create windows larger than the screen, or even off-screen windows, use pads. A pad window is allocated using newpad(), and deallocated using ... delwin(). Because a pad can only have a small visible portion onscreen, you should use prefresh() rather than wrefresh() to redisplay pad windows.

This is ncurses in a nutshell. It should be just enough to get you started. Don't forget to call refresh()! Frankly, I just couldn't get used to ncurses odd parameter order of "h,w,y,x" so I wrapped everything in my own interface. People say it's better to learn the standard API, but hey. ncurses is not exactly UIKit for terminal based apps. But fair enough, it's fun being able to handle the arrow keys in a UNIX program, and to print text at any screen position without having to write ANSI escape sequences.

For a more elaborate instruction with code examples and everything, please see the NCURSES Programming HOWTO at The Linux Documentation Project.

Saturday, August 11, 2012

Multibyte and wide character strings in C (addendum)

Last time I wrote about multibyte character strings in C, and said that the easiest way to deal with them is to convert them to wide character strings. Unfortunately, there is an issue with the wide character wchar_t type; it just so happens that its size is 32 bits on UNIX (and alike) platforms, while it is only 16 bits wide on the Windows platform. On UNIX mbstowcs() converts to a UTF-32 string, and on Windows mbstowcs() converts to a UTF-16 string. What this means is that everything I talked about in last week's post is quite alright on UNIX and not so cool on Windows. I'm a UNIX programmer and I don't work on Windows, but I do care about portability across platforms, and the wchar_t is hopelessly broken across platforms.

So, what is going on? The C standard actually says that the size of wchar_t is compiler dependent, and that portable code should not use wchar_t. Oddly enough, C does provide a complete set of functions for handling wide character strings (!). Since wchar_t is not defined as a portable type, then a) how are we supposed to work with strings and unicode, and b) what is it doing in the standard in the first place.
The origin of the problem stems from the fact that Unicode started out with just 16 bits code points, but later realized they were going to need a few bits more. Hence the jump to 32 bits. By that time, Microsoft was long happily using a 16 bits wchar_t. When others started supporting unicode, they implemented wchar_t as a 32 bits value so it could hold a single UTF-32 character. Portability ended right there.
Consequently, wchar_t was a good idea that turned out as a failure.

Today, if you want to work around this problem, you are going to have to work with uint32_t for characters and roll your own string functions, including your own UTF-8 encoding and decoding functions. It's pretty sad. There is a bit of good news on the horizon; the proposed ISO C11 standard includes two new data types: char16_t and char32_t and associated conversion functions. Missing however are string handling functions for these types. Basically, it is discouraged that you use strings in their UTF-32 form.
There is no compiler today that implements C11. These new character types are also present in C++11, and recent versions of the g++ and clang++ compilers do support them.

Tuesday, August 7, 2012

Multibyte and wide character strings in C

Over a century ago, man transmitted messages over a wire using a four bit character encoding scheme. Much later, the ASCII table became the standard for encoding characters, using seven bits per character. ASCII is nice for representing English text but it can't work well for other languages, so nowadays we have unicode. Unicode defines code points (numbers) for characters and symbols of every language there is, or was. Documents aren't written in raw unicode; we use UTF-8 encoding.

UTF-8 encoding is a variable length encoding and uses one to four bytes to encode characters. This means some characters (like the ones in the English alphabet) will be represented by a single byte, while others may take up to two, three, or four bytes.
For C programmers using char pointers this means:

strlen() does not return the number of characters in a string;
strlen() does return the number of bytes in the string (minus terminating nul character)
buf[pos] does not address a character;
buf[pos] addresses a byte in the UTF-8 stream
buf[32] does not reserve space for 31 characters;
buf[32] might reserve space for only 7 characters ...
strchr() really searches for a byte rather than a character

If you want to be able to address and manipulate individual characters in multibyte character strings, the best thing you can do is converting the string to wide character format and work with that. A wide character is a 32-bit character.

The operating system is configured to work with a native character encoding set (which is often UTF-8, but could be something else). All I/O should be done using that encoding. So if you do a printf(), print the multibyte character string.

During initialization of your program (like in main()), set the locale. If you forget to do this, the string conversion may not work properly.

setlocale(LC_ALL, "");

Converting a multibyte character string to a wide character string:

mbstowcs(wstr, str, n);

Converting a wide character string back to a multibyte character string:

wcstombs(str, wstr, n);

One problem with these functions is estimating the buffer size. Either play it safe and assume each character takes four bytes, or write a dedicated routine that correctly calculates the needed size.

It's fun seeing your program being able to handle Chinese, Japanese, etcetera. For more on the subject, these two pages are highly recommended: