Saturday, March 8, 2008

A little known Unix command - join

Unix/Linux offers lot of small but powerful tools for text manipulations. Here is a little known filter called join. This command is very useful as well as powerful but you rarely hear about this command anywhere.

The purpose of the command is to compare two set of inputs and find out what is missing and what is common to each other.

Here is an example:

I have two sets of inputs,

$ sort fileA > fileA.sorted
$ sort fileB > fileB.sorted
$ cat fileA.sorted
a
b
c
d
e
$ cat fileB.sorted
a
d
g
h
j

To list what is in fileA but not in fileB, use the join command as follows. The "-j 1" switch instructs the join command to use the first field in the file for comparison. The "-v 1" switch lists the entries in the first file but not in the second.

$ join -j1 -v 1 fileA.sorted fileB.sorted
b
c
e

To list the entries in fileB but not in fileA, you just have to use the "-v 2" switch instead of "-v 1"???

$ join -j1 -v 2 fileA.sorted fileB.sorted
g
h
j

To list those entries that appear on both the lists, just remove the "-v " switch altogether.

$ join -j1 fileA.sorted fileB.sorted
a
d

More options

Join command has many options but some of the commonly used options are "-t" to specify the delimiter for fields, -1 and -2 options to specify different join fields for each file, -i to ignore case difference, etc. You can get more details about the command from the Unix man pages.

Applications

Where would you use this command in a typical Tivoli setup? This is very useful to identify discrepancies in data, for example, if you have a file containing endpoints in inventory database and another containing the list of all endpoints (wlookup output), you can easily use the join command to find out which endpoints are scanned, which endpoints are not scanned and which endpoints are in inventory database but no longer in the Tivoli database.

You will also find it useful to keep track of your distribution list. For example, if you have a list of all your endpoints in one file and a set of endpoints that received the distribution in another file, you can easily find out which endpoints have/have not received distributions.

The only restriction is that both the input files should be in sorted form, or you will not get the correct result.

Hope you find it useful.

No comments: