One of the ways I think an engineer can separate themselves from the pack is in not in the number of commands they know or how large of a network they have worked on but how deep their understand of the protocols and technologies goes. Think about OSPF for a second. How many people do you know can get a single area up and running? A lot. How many can do multiple area’s, multiple types of areas and lots of redistribution? Still a lot of people. Now ask those people to explain a type 4 LSA and you will get a wide range of answers, lots of them wrong. Because in today’s world many people just learn how to do X. They don’t take the time to learn why. Certifications don’t always go to this level and many people only learn what is needed to pass the exam.
So now I am going to tell you a story about some TCP maximum segment lifetime fun we had and give one example of why you really need to understand the protocols. I had saved these packet captures with the intent to do a detailed blog post but that has not happened and it has been almost two years since this happened and so far no tech deep dive so you must settle for a story.
We had moved a large portion of our key business systems to a new third party which provided their own private convention into our data centers and we peered with them via BGP via public IP’s and then NAT’d to our internal systems. They did same so the private WAN was all public IP space and they NAT’d on their end as well. A fairly common setup with so nothing fancy there.
Well months after moving to this we started to get reports of one app that accessed those remote 3rd party systems acting odd. Person A would work fine for a few hours then stop working at all while person B sitting next to them had no issues using the same app hitting the same IP. Then for no apparent reason person A would start working again. Very random but it was limited to a single app using this connect and we had quite a few that worked without issue. As it was just a few users on a single app while everything else worked fine this was initially dismissed as a local network issue and pushed back to the 3rd party (who also manged the WAN) as their issue. Well they could not find the issue pushed it back to us. A packet capture was done on our side and it showed a lot of opening and closing of connections. Turns out this app would not keep a connection open. It would open the connection, do what it needed to, then send a TCP reset to close the connection. This process would take around 5 seconds so if the user was active in the app it would open and close a lot of TCP connections. While interesting and debatable if that is the right way to do it vs something more efficient it works so other the using a lot of TCP connections this is not the problem. What we would see when the connection failed is the the client sends a SYN but never get a SYN/ACK back. Because we could see the traffic leaving our network but never get a response we again pushed it back to the third party. The catch is they found the same thing, they would see the SYN come in, send a SYN,ACK but would never get an ACK back from us. So both pushed it to the WAN service provider. They did some investigation and showed no packet loss. So all sides had “proof” it was not their issue. What next?
Well we stepped back took a closer look at the packet capture. (Here is where the packet capture would really help this blog post.) You would see what I described as well traffic from all the other sessions. The capture on our side was done outside the firewall, pre-NAT our side/post their NAT, so it all appeared as 1 IP going to 1 IP. During this evaluation something odd was noticed. All the issues where on TCP ports of recently closed sessions. For example the client would send the request, the NAT would send it out on port 55001, it would work, the client would send a TCP reset and move on to the next transaction. What we noticed was a minute later we would see port 55001 used again. The SYN would go out on port 55001 and this time it would not complete the 3-way handshake. So this narrowed it down a lot .This where really understanding TCP and how it works with NAT matters. When you do a many to one NAT the unique connection is assigned a port and that is one way how it keeps track of what traffic belongs to what internal connection. You also have timers here when to reuse connections, typically double the maximum segment lifetime, commonly refereed to as 2MSL. If you don’t understand TCP this is something you will never see looking at packet capture.
The problem here is in poorly worded RFC’s and how vendors implement them. Our vendor had a 1 minute MSL so if a connection just stopped it would wait 2 minutes before reusing that connection. However if the connection was closed cleanly, like it was in this case, it would only wait 1 minute then reuse the connection. The 3rd parities vendor did not do that. Despite seeing the connection close cleanly if a new connection came in within the 2MSL window using the same port it would assume it was part of the old connection and thus try skip over the 3 way hand shake. This was the root cause of our issue. Vendors doing things different and both withing the guides of the RFC. The fix is rather simple, first don’t use NAT, since that is not practical in most cases we just upped the MSL on our gear and this resolved the issue. Just note this issue is not limited to NAT, this applies to anything that uses the 2MSL to reuse connections. Some will reuse quicker if the connection closed, others won’t. Both sides need be timed the same. Also be aware that upping the timer may impact something else so make sure you understand the consequences of this.
The point of this story is if you don’t have a deep understanding of protocols and what they are doing and why things like this are easy to miss. Both of us in this example did. We both had packet captures that “proved” our side was not the issue and we were right. Both of were following the RFC. The catch was we looked at the surface and did not dig deep enough. The good thing was both parties when seeing the others packet capture knew something more was going on and worked together to find the issue and we did fairly quickly once got the captures because both sides understand what to look for, we just missed it the first time making assumptions. Learn the protocols, learn why they do what they do and not just how, this will make you better at your job, reduce downtime and make everything preform better.
tl;dr: Your career is working on the network so LEARN HOW THE NETWORK REALLY WORKS!
[Sorry that was not as technical as it could be, just wanted to share those thoughts and hopefully give you one more reason to go learn something more about networking to help make your life easier.]