Sunday, February 24, 2013

Estimating tar size

There's a rather old but interesting question on StackOverflow: Is it possible to estimate how large the tar archive will become before you start packing? Some say yes, some say no. Well, of course it's possible. The exact figure would be given by:

    tar cf - directory/ | wc -c

This can be a rather slow command though, as it already runs tar and reads all data from disk. But can we estimate the archive size, and do so in a more efficient way?
Yes, it's possible, but you have to mimick the tar program really well when adding up all the sizes of the individual files. Unexpectedly, this is kind of hard. Tar is a much more complicated utility than you might think.

In essence, tar concatenates files after one another. The resulting tar file is called an archive, and every file in it is a member. It saves the original file name, size, and other relevant metadata such as ownership in a header directly in front of every member. So the size of the archive will be bigger than the sum of the sizes of the members, but by how much exactly?

Tar works with 512 byte blocks. Every header is in a 512 byte block, padded with null bytes, and all member data is stored in 512 byte blocks. The final member data block is padded with null bytes as necessary. Members like directories and symbolic links do not have data bocks. They are stored in the archive just by their headers. Classic tar ends the archive by appending two blocks filled with null bytes.

With this information you should already be able to get a pretty good estimate of how large the resulting tar archive is going to be.

But it isn't quite right. Why is that? The classical view of the world works but it isn't quite like that anymore.

One of the problems with classic tar was storing long filenames. Tar does not only store the file's name, but also its leading path. As the directory structure gets more deep, the stored filename becomes longer. There was only 100 bytes of room in the tar header for filenames, so storing a file with a longer path was simply not possible using classic tar.
Later, UNIX gained ACLs and high precisicion file access times, and tar should be able to store and restore those.

In order to resolve these issues, extensions to the format were crafted, in which a regular tar header may be followed by an extended header block. The PaX format became a POSIX standard. Archives that use the PaX format may optionally have a global header at the beginning of the archive accounting for two extra blocks (regular header plus an extended header). A PaX extended header is in the form of "key=value" and therefore can contain any kind of information. Tar implementations that encounter unknown keys may ignore them or give warnings.

Meanwhile, GNU crafted their own extensions that work in a similar way. For example, GNU tar has an extension for supporting sparse files with multiple holes.

To estimate the tar archive size, you have to guess (or know) what extended headers are going to be added. By taking filename lengths into account, as well as the destination paths for symbolic links, you can make a good guess of how many blocks of metadata this is going take.

Remember I said that tar archives end with two blocks of null bytes. Well, GNU tar likes working with a blocking factor of 20, meaning that it will do I/Os of 20 blocks at a time. As a consequence, it writes archives by multiples of 20 blocks. So for GNU tar archives there will often be more than just two blocks of null bytes at the end.

I actually implemented all of this in code and it's pretty accurate at estimating tar archive sizes. I have seen cases when it was off, though. The only way you could really get 100% accuracy is to use tar anyway. I suppose tar could use a dry run mode.

When you take a close look at the tar format, you find this hodgepodge of headers and extended headers and slight variations. This is what 30+ years of software evolution does to a program. It's almost poetic, seeing tar as a an organic mass.

For a more detailed description of the tar format, see: