Troubleshooting application problems by looking at network traffic

My favorite type of onsite meetings are those where I help to solve a major problem. A couple of days ago I was visiting a large online gaming company. Most of what they do revolves around a web-based game application which they host. Like a lot of businesses now, their web presence is the main source of income, and any downtime costs money.

One of the reasons for my visit was that their main web servers were slowing down intermittently. The server and application teams were monitoring CPU, memory and other local resources. During the slowdowns, CPU usage would shoot up. Log files were not helping due to the thousands of transactions that the servers were processing. When they tried to switch on extra logging, it just slowed things down even more.

I recommended that they stepped back from the servers and have a look at what was connecting to them. We used port-mirroring on the network switches to take a copy of all traffic going to and from the web servers. I discussed ways of monitoring your network using this technology in an earlier blog post. This copy of the web traffic allowed us to see what was happening without interfering with online gamers. I sent the data to a traffic analysis system. If you are interested in these traffic analysis systems, you can get anything from free packet capture applications which run on a PC or laptop, right up to systems that perform deep packet inspection (DPI) on this data.

Josh Stevens over on the EtherGeek blog has an interesting post on layer 4 of the OSI model. This is the layer where I focused my efforts at this stage. I only looked at TCP traffic where the destination port was 80. This is very typical of web application traffic. You can normally get port information for other applications via regular search engines.

The next step was to look at who was connecting to the web servers. What became obvious very early on was that one client was responsible for most connections to the servers. The client in this case is defined as the system that established the connection to the application server. I began to suspect a denial-of-service (DoS) attack, so I moved on to taking a closer look at the web traffic.  My DPI system had a web protocol decoder built in. If you don't have something like this, you can take a look at the packet content for traffic on port 80 or whatever port number your application uses.

As soon as I started looking at the web pages being accessed, I could see that one single page was being accessed over and over. The page itself was something like stress-test.cgi. Immediately we had the answer to what was causing the problem. The IP address was traced back to an employee. When asked about the traffic and unique pages being accessed, the employee confirmed that they were running stress tests. He had no idea about the impact this was having on the performance of the web server.

In summary, I would suggest that you sometimes need to step back and get greater visibility of an application problem. Too much logging on application servers can sometimes add to the problem. Look at who is connecting to your applications and drill down to the detail. It can be easier to spot some problems by monitoring the network switch ports connecting your application servers.

Darragh

Darragh Delaney is head of technical services at NetFort Technologies.  As Director of Technical Services and Customer Support, he interacts on a daily basis with NetFort customers and is responsible for the delivery of a high quality technical and customer support service.

Copyright © 2011 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon