Unix Tip: Comparing Files with Checksums

Send in your Unix questions today! | See additional Unix tips and tricks

Unix systems provide numerous ways to compare files. The most common way to verify that you have received or downloaded the proper file is to compute a checksum and compare it against one computed by a reliable source. MD5 is frequently used to compute checksums because it is computationally unlikely that two different files will ever have the same checksum. Similar commands, such as sum and cksum, also compute checksums but not with as much reliability. Let's look at several checksums and see why.

One of the first things you'll notice if you compare the output of the sum, time and md5 commands is the length of each calculated value. The sum command prints two numbers. The first (31339 in our example) is a 16-bit checksum. This means that you will get any of 65,536 distinct responses (from 0 to 65,535) for any file. The chance of getting the same checksum for two files which are different is very small. If you have 65,000 files to compare, however, the chance that two of them have the same checksum, though different, is quite high. In fact, you'll probably have a number of false matches.

# sum /export/home/jdoe/bigfile.gz
31339 165523 home/jdoe/bigfile.gz
One characteristic of the sum command is that the length of the checksum has some relationship to the length of the file. If one file contains "abc" and another contains "abd", the checksums are only different by 1. This command is clearly using a very simple calculation, better for verifying the integrity of a file than for heavy duty or high security file checking.
# sum /tmp/ab*
304 1 /tmp/abc
305 1 /tmp/abd
The second number that sum prints is the number of 512-byte blocks that are in the file. This helps considerably to insure that dissimilar files are clearly dissimilar. Unless the files you are comparing are also roughly the same size, the fact that the checksums are the same can be discounted.

The cksum command works similarly. The first number that it prints is a cyclical redundancy check (CRC) for the file. As you can see from the sample output below, the CRC is a fairly large number. This decreases the chance that two files will be taken as being identical when they are not. Notice the difference in the checksum of our two three-byte files.

# cksum /tmp/ab*
1112837078      4       /tmp/abc
1197460547      4       /tmp/abd
Using cksum against the lartge file we saw earlier, we see a similar checksum even though the size of the file is dramatically larger.
# cksum /export/home/jdoe/bigfile.gz
3574185895      84747520        home/tcs/bigfile.gz
The second number in the cksum output is the number of octets (bytes) in the file. This is a similar concept to the number of blocks, but is considerably finer grained. Two files occupying the same number of blocks are still likely to include a different number of octets.

The md5 command is the most reliable of the three commands and the only one recommended for serious file checking. If you are sending a gzipped file to a customer and want the customer to be confident that the file you have sent is both intact and the file you intended to send, providing him with an md5 checksum is a very good idea. Notice the length of the checksum below.

# md5 /export/home/jdoe/bigfile.gz
MD5 (/export/home/jdoe/bigfile.gz) = e1e0aec5c73eeb3bcf4cff4d5a44b067
This thirty-two hexadecimal number can take on any of 2 ** 128 possible values. This is a bigger number than most of us can think about. It's billions times billions big. I am told, it is exactly:
Probably so. I don't even want to think about calculating so large a number.

The chance of two files having the same md5 checksum is infinitesimally small. Looking at the two small files, we see that the md5 checksums seem to have no similarity whatsoever.

# md5 /tmp/ab*
MD5 (/tmp/abc) = 0bee89b07a248e27c83fc3d5951213c1
MD5 (/tmp/abd) = 8f0abafc5f8e6686a882c78cac4bcb9f

Of course, to be valuable, checksums have to compute identically on different systems. Fortunately for us, this should always be the case.  

This story, "Unix Tip: Comparing Files with Checksums" was originally published by ITworld.

Copyright © 2006 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon