Sunday, January 20, 2013

A tale of two tiny bugs in Linux PAM's pam_access

Recently, a colleague of mine decided to introduce netgroups to our systems. NIS netgroups are a means of logically grouping users. So I reconfigured PAM to use netgroups: users had to be member of a certain netgroup to be granted access to the system. To my surprise, it didn't work on one system, while it did on another. Clearly we hit a bug, and one system was running a different version of the pam_access module. A little digging let to the excavation of an old bug. Interestingly enough, I discovered a second tiny mistake in the code.

First of all, I want it to be clear that I have great respect for all the developers that do the hard work for Linux. Zooming in on this little piece of code is for the love of code, and to illustrate the nature of the bug in detail for educational purposes.

Let us do a bit of NIS programming. You need the innetgr() function to tell whether a user is member of a group. Its signature is:
int innetgr(const char *netgroup, const char *host, const char *user, const char *domain);
The function innetgr() returns 1 for a successful match and 0 otherwise. The host and domain parameters may be NULL. The domain must be set though if a NIS domain has been configured. In order to find out what the NIS domain is, you can use either yp_get_default_domain() or getdomainname().
int yp_get_default_domain(char **domp);
int getdomainname(char *name, int namelen);
Both functions get the domain name, apparently there are systems where only one of these is available. If no domain name is configured, yp_get_default_domain() will set the pointer to NULL. The function getdomainname() may set the name to "\0" (empty string) or it may set it to "(none)".

The pam_access code looked like this:
/* netgroup_match - match group against machine or user */

static int
netgroup_match (pam_handle_t *pamh, const char *netgroup,
        const char *machine, const char *user, int debug)
  int retval;
  char *mydomain = NULL;

  char domainname_res[256];

  if (getdomainname (domainname_res, sizeof (domainname_res)) == 0)
      if (strcmp (domainname_res, "(none)") == 0)
  /* If domainname is not set, some systems will return "(none)" */
          domainname_res[0] = '\0';
      mydomain = domainname_res;
So, suppose you have a system that has no NIS domain name configured and it used getdomainname() to get the domain name. Consequently the name will be either an empty string or the string "(none)". If the domain is "(none)", it will be reset to an empty string. If you pass an empty string as domain to innetgr(), it will fail and report the user is not part of the netgroup. The reason is that the domain parameter of innetgr() must be NULL to indicate any domain name, an empty string will not do.

Note that this is an old version of the code. However, this code was still current for a debian linux system. Debian is known for running ‘stable’ code, as new software is likely to include new and buggy features. This particular bug was fixed over a year ago though.

The second problem with the code was still present, and it is right here:
which is clearly a typo. When you make typos in code the compiler will usually catch them, but this is a preprocessor statement and it managed to go unnoticed for at least three years. This typo caused yp_get_default_domain() never to be used at all. This wasn't much of a problem since Linux systems typically provide both functions, but ironically, the block with getdomainname() contained a bug. Fixing the typo causes yp_get_default_domain() to be used and effectively discards the block with getdomainname().

Computers are very unforgiving to the mistakes that we humans make. The smallest of mistakes can be of great consequence (like users not being able to authenticate (!)). I was lucky to encounter and fix this problem, and I guess it shows the beauty of open source. Anyone can contribute to open source software, and anyone can help in making the whole a little bit better.