Fault Tolerance and Network Teaming

 

Anyone who has called Microsoft for help with a networking problem has likely heard the question: “Are you using network teaming?” I have often heard this referred to by Microsoft’s customers as a “quick out” or an excuse that Microsoft was looking to pass the responsibility on to someone else. As someone that has been on both ends of the phone, and at the highest escalation point within Microsoft’s Network queues I can tell you that it is a question born of wisdom and tempered with experience. While working the phones at Microsoft, supporting the largest and most critical systems in the US it was rare to ever get a call about the same problem more than once. Even more rare was for everyone on our group to get the same calls, and have the same experiences. I recall it happening when we fought the blaster worm, and when Microsoft’s “Scalable Networking Pack” was released with 2003 SP2. These were bad, but a few months went by and except for a few straggles the phone calls stopped, the world got wise to the issue and the problem was resolved. I was amazed though to experience 1-3 calls a week with network issues CAUSED by network teaming. I could not help but be blown away by the irony of a program meant to avoid network failure so often causing it. I talked to colleagues, (of which I have found no better single source in the industry than at Microsoft), and found that even the old timers having more than 15 years with the company had the same stories of problems caused by networking teaming as we are constantly experiencing today. I am amazed that an industry as wise and agile as the computer industry has been (and is), has stuck with such a poor technology. I always asked my customers as the called with problems, usually critical ones, “What is teaming these network cards getting you”. Almost unanimously the answer would come fault tolerance, to which I would reply rhetorically “How often do you NICs or Switches fail and how often has teaming caused network failure?” In my opinion, it is unforgivable for an application to constantly cause the problem that it is written to avoid. It should cause pause and reflection as to whether the technology is well suited for its function, whether it is just written poorly or if all of its implementations have similar problems. Technology today is beyond network teaming. There are far better methods of providing fault tolerance with manual and automatic failover. Most application writers have taken into consideration fault tolerance at the service level superseding anything that network teaming offers, so that network teaming should be a dead technology, because it is killing us.

Finally, if you are considering using network teaming, or have had reason to reconsider its use, maybe these questions will help your assessment:

 

What is my goal with using network teaming?

Can I gain Availability through use of a more capable NIC card?

How often have my NIC cards failed?

When NIC cards have failed were they the only failure, or was it in conjunction with a Motherboard or other failure causing the service to be unavailable?

What are my needs for uptime for these services?

Would a manual failover (the simplest of options) be viable for this service?

What options for automatic failover do I have (since most applications can have multiple providers through configuration)?

 

One other note to add. While working on the phones at Microsoft, and later as a consultant to large and federal organizations, I found one thing that seemed to be true most of the time. When a problem occurred, it was rarely the OS itself, but something unnatural to its processes. Simplicity and minimalism is really one of the keys to a healthy server and environment. Often it is necessary to introduce other applications and services, but I do not think near as often as we do.
Note: MSFT does not support network teaming, because they do not own the software that provides it. In certain instances though, like with OCS, they flat out will not support OCS if teaming is enabled on the server.