The art of network troubleshooting

In many ways, the art of networking today is quite different from when I entered the field about 17 years ago as a part-time student assistant in the networking group at the university where I was working on my undergraduate degree. At that time, the network group had begun to dabble in a new technology called Ethernet, which promised to deliver data at the stunning speed of 10Mbit/sec. via coaxial cable (Thinnet, also known as 10Base2) between clients and servers.

While the technology has changed, the basic methods for troubleshooting networks really haven't. Sure, there are fancier sniffers, analyzers and monitors, but the real basics that demand an understanding of networking to the core level remain the same. Let me present a case history from the days of Thinnet to illustrate.

Lessons from history

Early in my career, while interviewing for a network engineering position, my potential supervisor and I engaged in a mutual story-telling session. I had recently solved a problem whereby users in a large building were experiencing severe network latency, and I felt sharing this experience in the interview would help demonstrate my networking skills.

The design was standard networking for the time; a Thinnet backbone running up one closet stack of the 10-story building, continuing along the ceiling of the top floor, and then back down the closet stack on the other side of the building. The backbone was connected to the campus network via a 10Mbit/sec. transceiver connected to a multimode fiber circuit that terminated at a Wellfleet router.

Latency only occurred when the network experienced a significant load, usually beginning shortly after 8 a.m. and ending at around 5 p.m. I plugged a Network General sniffer that was the size of a small suitcase and boasted a screaming 486 processor into a tap on the Thinnet backbone and observed multiple late collisions. This led me to conclude there was a physical layer problem.

The most basic method for testing the physical integrity of a Thinnet segment was to ensure proper cable termination. Thinnet required a 50-ohm resistor be placed on either end to prevent the signal from reflecting back along the cable by absorbing its energy. Removing the terminator at one end and using a digital multimeter to determine continuity and resistance was an intrusive but effective method of determining the backbone health. This test returned positive continuity and a resistance over 50 ohms, consistent with baseline observations of similar Thinnet segments yet on the high side for the estimated cable length.

Collisions were a normal functioning aspect of half-duplex Ethernet, but late collisions usually indicated either a station is not employing the CSMA/CD algorithm properly or a segment exceeded the maximum allowed length. The higher resistance reading pointed to the latter. Consultation with the network engineer who was responsible to this particular network design verified that recently the backbone was indeed extended at the furthest end to accommodate network connectivity for new offices.

The solution seemed simple: shrink the size of the collision domain. Since physically reducing the length of the Thinnnet backbone wasn't possible, I opted to divide the backbone into two collision domains by inserting into the midpoint of the Thinnet backbone an Ethernet bridge (essentially a two-port Ethernet switch) in one of the communications rooms on the 10th floor. After making the necessary connections, network performance immediately improved, allowing me to not miss my regular 10:30 coffee break.

I chose this story to demonstrate basic network troubleshooting techniques not because I enjoyed reminiscing about older Ethernet technology (although I did) but to demonstrate that the basic approach to solving a network performance problem has not changed since then. While these steps may not necessarily include every action required to solve a particular problem, they do represent most of the basics that should be attempted.

Understand the topology

In my example, my first action was to understand the building's network topology. Whether it's a single Thinnet backbone feeding repeaters daisy-chained together or a fiber star topology in which all intermediate distribution frames are connected via fiber optics to a centrally located switch in the main distribution frame, without understanding the overall topology, there is no way to determine where to start the troubleshooting.

Gather information

Next, I gathered as much detail about the problem as I could from those experiencing the problem. This sort of "social networking," if you will, is invaluable. After all, the end goal is to solve the user's access problems (and close that help desk call). Since users will almost always place any sort of access blame on "the network," it is the duty of the network professional to discern whether or not the problem is really network-related and, if so, gather information to help determine the cause.

Have the right tools

There is no greater waste of time than arriving on site and not having the proper tools to diagnose and fix a problem. A well-stocked tool kit, proper documentation, a laptop, serial cable, network equipment passwords, communication room keys, the trouble ticket and a sniffer are a few of the items that should accompany every network professional on a trouble call.

Use the sniffer

More often than not, it's necessary to see what is "on the wire" by doing a packet analysis of the traffic to at the very least determine the proper direction towards achieving a rapid solution. Then and now, a sniffer (be it tcpdump, wireshark or a commercially available sniffer) coupled with knowledge is a networker's best friend. Unfortunately, this is a critical step that many seem to ignore or not employ early in the troubleshooting process. Knowing where to place the sniffer is also critical. In this instance, I knew the problem was buildingwide, so I needed to perform a traffic analysis on the backbone.

Examine the trace

The trace (the output of a sniffer) will most likely define the next step, so long as one understands the basics of reading a trace and has knowledge of what a trace of normal traffic on the network will look like from previous baseline measurements. Whether the problem is a broadcast storm, an infected computer, a machine hogging bandwidth or, as in this case, related to the physical media, without sniffing, you are left guessing for causes.

Ensure standards compliance

Having diagnosed the physical issue, I tracked down the cause of the physical errors to excessive cable length. The IEEE standards exist for a reason, and if one component of the complex communication process between computers is not compliant with those standards, all bets are off.

Determine recent changes

Cables do not grow longer by themselves, so I employed the next classic troubleshooting tool, which is determining what has changed from the days it worked fine to the state of inoperability. Sometimes this step should be done earlier in the process, and in my example, I had consulted the network documentation as part of attempting to understand the topology before troubleshooting the problem. However, the network engineering staff had not yet added the recent backbone extension to the network documentation. Documentation should, but does not always, reflect reality.

Maintain an adequate spares inventory

When an adequate inventory of equipment spares is available, time to replace malfunctioning equipment or, as in this case, re-engineering a network is greatly reduced. Maintenance contracts are essential, but for critical situations, waiting for a part and a technician to arrive on site can mean costly delays.

Choose the correct solution

Certainly one solution was to reduce the length of the backbone by breaking the building into two subnets, installing another fiber link to a new router interface and the associated network hardware, and manually changing all of the client's network information (this was in the days before widespread deployment of Dynamic Host Configuration Protocol). However that would have necessitated a downtime measured in days, not hours, and therefore was unacceptable. Thus, it is very important not only to devise a technically correct solution but one that takes into account the business functions of those being served.

Verify the network operation

It may seem obvious, but the solution implemented must be tested to verify that it indeed solved the problem. Testing involves not only verifying network connectivity and throughput and taking traces but querying the end users as to application performance. The value of good public relations can't be understated.

It should be evident that basic network troubleshooting methods have not changed even while the technology has. It is reasonable to expect that the basic methods will continue to be valid troubleshooting techniques as the technology continues to emerge. The challenge is to not get lost in the bells and whistles of the latest and greatest network troubleshooting toy. When it comes to networking, there is no panacea.

Incidentally, I was offered and accepted the network engineering position. A few months at my new place of employment, I asked my supervisor what led him to hire me over other candidates. His response was "Remember that story about the Thinnet backbone?"

Greg Schaffer is a freelance writer based in Tennessee. He has over 15 years of experience in networking, primarily in higher education. He can be reached at

Copyright © 2006 IDG Communications, Inc.

Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon