ITworld.com – Send in your Unix questions today!
See additional Unix tips and tricks
In a typical work week, a Unix systems administrator is likely to have at least one small mystery to solve -- one "huh?", one "that doesn't make any sense" or one "I've never seen this before". Most of the time when I find myself baffled by something on one of the systems I manage, it's because I've overlooked some element of the problem. Soon afterwards, I'm usually saying "oh, yes, of course!", having pulled the missing piece into focus during my review of how things are supposed to work. In this week's column, we'll follow my train of thought as I poked through one such small mystery.
The onset of this particular puzzle was noticing that I could not successfully run a remote shell command from one of two servers that I use as launch pads in managing many other servers. On one such server, the command retrieved the requested information. On the other, the same command to the same system failed with an unexpected error.
Logged into the first server, remote commands worked just fine:
# ssh beanybaby date Mon Nov 8 08:26:17 EST 2004From the other server, I got the error shown here:
# ssh beanybaby date ssh: connect to host beanybaby port 22: Connection timed outI was in the process of verifying that a relatively large number of systems will each accept a superuser command run from either of the two secured systems. This configuration will come in very handy if ever I need to shut them all down, remove an account from all of them in short order or install an important patch.
Immediately, I began to run through a list of some things that might be wrong and what I'd expect to see in each case. For example, if the hostname wasn't resolving on one of the servers, I'd be getting a response like this:
ssh: beanybaby: host/servname not known
If the remote host wasn't configured to allow password-free SSH commands, I'd be prompted to enter a password:
root@beanybaby's password:
If the remote system's fingerprint wasn't in the local known_hosts file (/.ssh/known_hosts), I'd be getting this:
The authenticity of host 'beanybaby (10.9.8.7)' can't be established. DSA key fingerprint is 6a:7f:a0:ac:bc:28:3a:7f:10:38:83:e1:0b:27:95:6f. Are you sure you want to continue connecting (yes/no)?While the particular error generated by my ssh command suggests that I am having a problem connecting to port 22 (the SSH port) on the target system, it was obvious that the problem could not be related to the sshd process on that system because the same connection request worked properly on the first server.
Same Target?
The next thing that I questioned was whether the two servers were actually connecting to the same system when I issued my "ssh beanybaby" command. Using nslookup, I was quickly able to determine that they were both pulling the proper information from DNS. Both servers responded with the information shown below.
> nslookup beanybaby Server: ns1 Address: 127.0.0.1 Name: beanybaby.example.org Address: 10.10.2.11Knowing that DNS is generally the second source for resolving hostnames, however, I then checked the /etc/host file on each system. On one of the servers, I noticed this:
# grep beanybaby /etc/hosts #10.10.2.11 beanybaby beanybaby.example.org 10.9.1.90 beanybaby beanybaby.example.orgAha! On one of the servers (the working one), someone had commented out the host entry for beanybaby and replaced it with another. This was clearly the source of the discrepancy.
As it turned out, the change made to the /etc/hosts file on the first of the two servers corresponded to a redeployment of the particular system on a different subnet. The problem I ran into came about because the change was made locally on one server and wasn't folded into the zone file on the DNS server. The "Connection timed out" message came about because the hostname resolved, but the resultant IP address was no longer valid.
After determining why my ssh command had failed, I updated the DNS zone file to reflect the new location of the system in question and gave a little thought to the process of maintaining proper records in an environment in which systems move from one subnet to another with some frequency. I also gave some thought to the troubleshooting process. The two approaches that seem most popular are asking what has changed recently that might account for the problem we have run into and comparing similar systems to determine what is different between the two. Either approach would have been useful in this case, but the first would require that I know more about what other sysadmins are doing.
One trick that I use to verify that a system is properly registered in DNS is to ask it to reflect its hostname back to me with a command like this:
# ssh beanybaby uname -n
beanybaby
Since a fairly large number of the systems I manage are configured to respond to such commands, I can run this command against a collection of them with a simple loop on the command line:
# for server in `cat server_list` > do > echo $server > ssh $server "uname -n" > done spongebob spongebob barbie barbie powerranger powerranger ...
If any of the hosts doesn't respond, I jot down its name and check it separately to determine whether it has been moved or its configuration has changed. Inaccessible systems slow down my loop, but not so much that it wastes much of my time.
There are eight million stories in the Unix sysadmin's bag of tricks; this has been one of them.
This story, "Unix Tip: Debugging tales: SSH command failure" was originally published by ITworld.