Unix How To: Sorting Log Data

Sorting data in system log files certainly can be a chore, but only under certain circumstances. Whenever you have a text file of any kind with uniform delimiters, you can sort the data on any particular column with the sort command.

To sort the /etc/passwd file on the user's full name (the GECOS field), for example, you can do this:

sort -t: +4 /etc/passwd

The -t argument specifies the delimiter used in the file and +4 tells the system to sort on the 5th field (four fields to the right of the default first field). To sort on the UID (third field), you would have to additionally specify that you want a numeric sort.

sort -n +2 -t: /etc/passwd | more

Sorting columns in text files is relatively easy if the files use uniform delimiters, whether colons, commas, white space or some other character not used except as delimiters in the files. Working with more complex files, involving a number of delimiters is a considerably more complex problem.

Take as an example this line from the /var/adm/messages file:

Dec 10 14:07:13 boson xntpd[544]: [ID 798733 daemon.notice] using kernel phase-lock loop 0041

If all we want to do is sort on the message portion of lines such as these, you could do this in a while loop. Note that the read statement in this sample script stuffs everything after the 5th field into $Message.


while read Month Day Time System Process Message
    echo $Message >> /tmp/msgs$$
done < /var/adm/messages.3

sort /tmp/msgs$$ | uniq -c
rm /tmp/msgs$$

We have both white space and colons acting as delimiters in /var/adm/messages as well as a set of square brackets joining several fields together into a single message component. Consider also that every record in the file will not necessarily have the same format. We might also have lines like these:

Dec 10 07:01:33 boson       Fault_PC 0x1035ae0 Esynd 0x0094
Dec 10 12:09:56 boson   last message repeated 1 time

Obviously, some of the fields in /var/adm/messages are optional. This adds another element of complexity to the task of sorting on the file's various components.

Since the requester mentioned using awk, I have to assume he was attempting to parse his log records. With awk's split and substr commands plus its array functions, you can get around some of the problems I've mentioned, but it would still be a very tricky task to separate the records into logical units.

In my humble opinion, perl would be a better tool to use than awk. With perl's regular expressions, you can denote optional fields such as those showing up as "xntpd[544]:" and "[ID 798733 daemon.notice]" in our sample record above.

Here's a quick stab at a perl script that breaks out the date/time, system name, optional process and alert messages and the message text. Notice that the data assigned to $proc has to be handled in an if statement since it might not be assigned for any particular record. I've commented out an if statement surrounding a print just as a reminder that we can't just assume $proc has a value.

#!/usr/bin/perl -w
# parse /var/adm/messages, storing message part in msgs


open IN,  "<$infile" or die;
open OUT, ">$outfile" or die;

while ( <IN> ) {
    my($line) = $_;
    my($date,$system,$message) = /(^\S+\s+\d+ \d+:\d+:\d+) (\S+)\s+(.*)$/;
    my($proc) = /(\S+\[\d+\]: \[.*\])/;
    # if ( $proc ) {
    #     print "proc: $proc\n";
    # }
    print OUT "$message\n";

This script creates a file (msgs) containing just the message text for each line in the /var/adm/messages file. Once it's created, however, you can do a "sort msgs | uniq -c" to get a sorted list of these messages along with a count of how many times each has appeared in the file. All sorts of other options can be put into play, but I'd have to know more about what you are trying to do to make additional suggestions.

This article is published as part of the IDG Contributor Network. Want to Join?

Crash Course: Advanced beginner's guide to R
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies