Skip the navigation

Server's down: How do I find out what's wrong?

By Kyle Rankin
January 28, 2013 05:55 AM ET

In all of the cases mentioned, it's easy to bypass DNS so your troubleshooting results are accurate. All of the commands we discussed earlier accept an -n option, which disables any attempt to resolve IP addresses into hostnames. I've just become accustomed to adding -n to all of the commands I introduced you to in the first part of this chapter unless I really do want IP addresses resolved.

Note: DNS resolution can also affect your web server's performance in an unexpected way. Some web servers are configured to resolve every IP address that accesses them into a hostname for logging. Although that can make the logs more readable, it can also dramatically slow down your web server at the worst times -- when you have a lot of visitors. Instead of serving traffic, your web server can get busy trying to resolve all of those IPs.

Find the network slowdown with traceroute

When your network connection seems slow between your server and a host on a different network, sometimes it can be difficult to track down where the real slowdown is. Especially in situations where the slowdown is in latency (the time it takes to get a response) and not overall bandwidth, it's a situation traceroute was made for. traceroute was mentioned earlier as a way to test overall connectivity between you and a server on a remote network, but traceroute is also useful when you need to diagnose where a network slowdown might be. Since traceroute outputs the reply times for every hop between you and another machine, you can trace down servers that might be on a different continent or gateways that might be overloaded and causing network slowdowns. For instance, here's part of a traceroute between a server in the United States and a Chinese Yahoo server:

$ traceroute yahoo.cn
traceroute to yahoo.cn (202.165.102.205), 30 hops max, 60 byte packets
1 64-142-56-169.static.sonic.net (64.142.56.169) 1.666 ms 2.351 ms 3.038 ms
2 2.ge-1-1-0.gw.sr.sonic.net (209.204.191.36) 1.241 ms 1.243 ms 1.229 ms
3 265.ge-7-1-0.gw.pao1.sonic.net (64.142.0.198) 3.388 ms 3.612 ms 3.592 ms
4 xe-1-0-6.ar1.pao1.us.nlayer.net (69.22.130.85) 6.464 ms 6.607 ms 6.642 ms
5 ae0-80g.cr1.pao1.us.nlayer.net (69.22.153.18) 3.320 ms 3.404 ms 3.496 ms
6 ae1-50g.cr1.sjc1.us.nlayer.net (69.22.143.165) 4.335 ms 3.955 ms 3.957 ms
7 ae1-40g.ar2.sjc1.us.nlayer.net (69.22.143.118) 8.748 ms 5.500 ms 7.657 ms
8 as4837.xe-4-0-2.ar2.sjc1.us.nlayer.net (69.22.153.146) 3.864 ms 3.863 ms 3.865 ms
9 219.158.30.177 (219.158.30.177) 275.648 ms 275.702 ms 275.687 ms
10 219.158.97.117 (219.158.97.117) 284.506 ms 284.552 ms 262.416 ms
11 219.158.97.93 (219.158.97.93) 263.538 ms 270.178 ms 270.121 ms
12 219.158.4.65 (219.158.4.65) 303.441 ms * 303.465 ms
13 202.96.12.190 (202.96.12.190) 306.968 ms 306.971 ms 307.052 ms
14 61.148.143.10 (61.148.143.10) 295.916 ms 295.780 ms 295.860 ms
...

Without knowing much about the network, you can assume just by looking at the round-trip times that once you get to hop 9 (at the 219.158.30.177 IP), you have left the continent, as the round-trip time jumps from 3 milliseconds to 275 milliseconds.

Find what is using your bandwidth with iftop

Sometimes your network is slow not because of some problem on a remote server or router, but just because something on the system is using up all the available bandwidth. It can be tricky to identify what process is using up all the bandwidth, but there are some tools you can use to help identify the culprit.

top is such a great troubleshooting tool that it has inspired a number of similar tools like iotop to identify what processes are consuming the most disk I/O. It turns out there is a tool called iftop that does something similar with network connections. Unlike top, iftop doesn't concern itself with processes but instead lists the connections between your server and a remote IP that are consuming the most bandwidth. For instance, with iftop you can quickly see if your backup job is using up all your bandwidth by seeing the backup server IP address at the top of the output.

Sample iftop output
Sample iftop output

iftop is available in a package of the same name on both Red Hat- and Debian-based distributions, but in the case of Red Hat-based distributions, you might have to find it from a third-party repository. Once you have it installed, just run the iftop command on the command line (it will require root permissions). Like with the top command, you can hit Q to quit.

At the very top of the iftop screen is a bar that shows the overall traffic for the interface. Just below that is a column with source IPs followed by a column with destination IPs and arrows between them so you can see whether the bandwidth is being used to transmit packets from your host or receive them from the remote host. After those columns are three more columns that represent the data rate between the two hosts over 2, 10, and 40 seconds, respectively. Much like with load averages, you can see whether the bandwidth is spiking now, or has spiked some time in the past. At the very bottom of the screen, you can see statistics for transmitted data (TX) and received data (RX) along with totals. Like with top, the interface updates periodically.

The iftop command run with no arguments at all is often all you need for your troubleshooting, but every now and then, you may want to take advantage of some of its options. The iftop command will show statistics for the first interface it can find by default, but on some servers you may have multiple interfaces, so if you wanted to run iftop against your second Ethernet interface (eth1), type iftop -i eth1.

By default iftop attempts to resolve all IP addresses into hostnames. One downside to this is that it can slow down your reporting if a remote DNS server is slow. Another downside is that all that DNS resolution adds extra network traffic that might show up in iftop! To disable network resolution, just run iftop with the -n option.

Normally iftop displays overall bandwidth used between hosts, but to help you narrow things down, you might want to see what ports each host is using to communicate. After all, if you knew a host was consuming most of your bandwidth over your web port, you would perform different troubleshooting than if it was connecting to an FTP port. Once iftop is launched, press P to toggle between displaying all ports and hiding them. One thing you'll notice, though, is that sometimes displaying all the ports can cause hosts you are interested in to fall off the screen. If that happens, you can also hit either S or D to toggle between displaying ports only from the source or only from the destination host, respectively. Showing only source ports can be useful when you run iftop on a server, since for many services, the destination host uses random high ports that don't necessarily identify what service is being used, but the ports on your server are more likely to correspond to a service on your machine. You can then follow up with the netstat -lnp command referenced earlier to find out what service is listening on that port.

Like with most Linux commands, iftop has an advanced range of options. What we covered should be enough to help with most troubleshooting efforts, but in case you want to dig further into iftop's capabilities, just type man iftop to read the manual included with the package.

This article is excerpted from the book DevOps Troubleshooting: Linux Server Best Practices by Kyle Rankin, published by Pearson/Addison-Wesley Professional. It is reprinted by permission. Copyright 2013 Pearson Education Inc., all rights reserved.

Our Commenting Policies