Company
Careers
Contact

BLOG

The Answer’s in the Packets

EN-Blog-Answers-in-Packets

by Adam White | 06.26.2013
Categories: Blog

“The NETWORK is slow today.”

Feel free to fill whatever term your organization’s users prefer to use. “The Server”, “The SharePoint Site”, “Google” and my personal favorite “The Internet!”

When users experience performance issues it’s common to blame the network. True, network performance can sometimes be a factor, but it’s often not. What tools do you have at your disposal to help you get to the root cause of a user-facing performance issues?

As network engineers at Emergent Networks, our customers bring us a variety of performance issues that each require a logical, systematic approach to troubleshooting.

Recently a customer experienced an interesting issue with web browsing performance. Their users began reporting that page loads to all websites would frequently (but not always) take thirty seconds or more to load. Sometimes certain elements of the web page would fail to load, and sometimes the browser would time out and fail to load the page at all.

After hearing reports like these we always try to recreate the issue to see the symptom first-hand. So we logged into a server at the customer’s headquarters and tried to load google.com. Sure enough, after refreshing the page three times successfully, the fourth refresh of the page took almost twenty seconds to load. Hmm…an issue with Google’s web farm today? Doubtful.

The next step after recreating the issue was to perform packet captures of the issue so we could analyze them using third-party tools.

We took two simultaneous packet captures of this issue. The first capture was taken from Wireshark installed directly on the server that we were initiating the page refresh from. The second capture was taken on the firewall (Juniper SRX240 in this case) of all traffic ingressing and egressing the ISP-facing interface. We also setup a 1:1 static NAT to source translate the server’s private IP to an unused public IP so that we could easily find and filter out its traffic later.

Once we had the capture files we used a powerful reporting tool from Riverbed called Cascade Pilot. Think of Pilot as a reporting engine for Wireshark.

Let’s walk through the troubleshooting steps that we took to narrow down the root cause of this problem. (IPs have been obfuscated to protect the innocent)

First we load the two capture files (Client.pcapng) and (Firewall External.pcap) into Pilot. Then we apply the “IP Conversations” view to each of these files.

Networking1
IP conversations between the client (large dot on the left) and all of the various IPs it was communicating with

Note the highlighted conversation between our client (10.0.0.207) and google.com (74.125.225.115). Let’s see what the IP conversations look like from the perspective of the outside interface on the firewall. In the firewall capture we’ll expect to see mostly conversations between the main source NAT IP of the firewall for all client internet access, and several public website IPs.

Networking2
IP Conversations seen from the ISP-facing interface of the firewall

And indeed that’s exactly what we see here. The large dot on the right is the IP that the firewall is source-NATing it’s client’s internet traffic to. Most of the other public IPs are various web sites that the clients are going to. The large dot on the left, you may be wondering, is my IP, and the reason for that large conversation was due to an RDP session I had open at the time.

Recall that I setup a 1:1 static NAT for the client (10.0.0.207) to translate it’s traffic to an unused public IP (5.12.44.23). This makes it easy to find the conversation we’re interested in. I’ve highlighted the conversation between our server and google.com above.

Let’s drill into the TCP sessions in each of these conversations side by side. To do this we apply a view called “TCP Sequence Diagram.” What we’ll expect to see is four separate instances of refreshing the page at ww.google.com. Below is a table of when (roughly) I initiated each refresh and the outcome that I saw on the browser
Refresh Attempt

Refresh Initiated At

 Perceived Page Load Time
 1  0 sec  OK ( <1 sec)
 2  15 sec  OK ( <1 sec)
 3  24 sec  OK ( <1 sec)
 4  33 sec  Browser sat there for ~20sec before displaying the page

Notice how these page loads appear in the TCP sequence diagram below.

Networking3
TCP Sequence diagram of the four sessions between the client and google.com. (as seen from the perspective of the client)

You can see that the first three page refreshes (each one represented by the blue horizontal lines) finished very quickly. However the fourth refresh (which was initiated at the 33 second mark approximately) resulted in several TCP retransmits (red lines) for about 20 seconds. We didn’t end up getting our page back until the 54 second mark.

Now let’s look at these sessions from the capture that was taken from the firewall.

Networking4
Capture of the four page loads as seen from the ISP-facing interface of the firewall

Interestingly, we don’t see the TCP retransmits leaving the firewall during the fourth session. Let’s use Cascade’s built-in Wireshark integration to send just these packets to Wireshark for further analysis. We highlight the last session and then right-click to send to Wireshark.

Networking5

Next we filter out everything except the specific TCP stream we’re interested in. (tcp.stream == 2)

Networking6
Side-by-side comparison of the client capture (left) and the firewall capture (right).

A side-by-side comparison of the client capture and the firewall capture reveal some interesting details. We used the “IP Identifier” field (ip.id) in the IP header to match packets between the two captures. For clarity I’ve spaced the packets out vertically so that they all appear in chronological order together.

Note the first three packets are part of the TCP three-way handshake, which appears on both. However the next packet (IP ID 3776, highlighted in yellow) doesn’t appear in the firewall capture until the 20.456 sec mark! Why is it being delayed?

As you can imagine, the browser isn’t going to just wait around for 20 seconds for a reply from google.com, so it starts retransmitting (IDs 3807-4003, highlighted in red), but we never see these retransmits leave the firewall. Clearly the firewall is culprit here. It’s intervening by delaying the HTTP request and blocking the TCP retransmits.

Armed with this information we took a close look at the firewall and found it was configured to redirect HTTP requests to a Websense content filtering server, whereupon it would wait for a block/allow response from the Websense server. We handed this information to the in-house Websense experts and they were able to resolve the issue on their end. The “network” was exonerated!

This serves as a classic example of how having an understanding of transport (TCP) and application protocols (HTTP in this case) along with having access to the tools to analyze packet captures (Cascade Pilot and Wireshark) can speed troubleshooting of complex issues to isolate root causes which may otherwise remain hidden.