CiscoLive 2017 in Las Vegas was a great event. I took full advantage of the opportunity to talk to fellow network geeks about different issues they were dealing with in their networks. One recurring question was “How do I define what failure is when monitoring redundancy?” In particular, monitoring inter-data center redundancy was the focus of quite a few of these discussions. These conversations were fascinating, so I thought I’d take a minute to summarize them and discuss how the Apstra Operating System™ (AOS) can help network engineers monitor redundancy in their networks.
What is redundancy?
There are many ways to build redundancy into your network. When dealing with inter-data center redundancy, how exactly do we achieve it? Do we buy redundant services from the same transport provider? Do we buy parallel services from different providers? The services we employ could be “layer 2,” meaning we, the customer, own all the layer 3 peering and configuration. Or the services could be “layer 3” in which we exchange routes with our provider. Perhaps the redundancy is a combination of such services.
What is failure?
Once we have designed our redundancy, how exactly do we define failure? Perhaps we should start by defining what is normal. When all services are operating as expected, are we load-balancing across redundant paths or do we expect to use one service primarily? This line of thinking doesn’t ring exactly true to me. There is the monitoring of individual services, sure, but this doesn’t say much about redundancy.
Failure in redundancy is something slightly more abstract. If one transport service between data centers fails, and the remaining service(s) continue carrying traffic between them as expected, then is this a failure? This is evidence that redundancy is working as designed! This is a higher-level way of looking at redundancy, and frankly it is not something captured well, or at all, in existing network monitoring or network automation tools.
Separating Intent and Implementation
This simple example demonstrates the need to separate our intent from the implementation details. In our case, our intent is that network service should be unaffected by the loss of one of our transport services. If one of our transport services fails, that is a failure in and of itself separate from whether or not redundancy is working, or will work, as expected. If load-balancing is not working as expected, that is also an issue separate from redundancy.
Suppose you are running BGP over two redundant paths in your network. You could periodically check “show ip bgp” looking for secondary routes with specific attributes that would indicate in the event of failure that BGP would reconverge over the redundant path. You could further run ICMP probes between loopback interfaces that are only advertised over the redundant path, ensuring that traffic is actually passing as expected. Effectively, these two pieces of information check the control and forwarding planes, and together they provide assurance that your design will work as expected. With AOS, you can roll this information up into something more consumable via the AOS API. A simple “<url>/checkRedundancy” call to the API could return “True” or “False.”
The higher-level intent of redundancy design needs to be modeled, and relevant state (i.e., telemetry) collected in order to validate that the intent of the design is being met. This needs to happen apart from the monitoring of the individual components of the design. This is where Apstra’s AOS platform really shines. AOS comes with libraries and tools designed specifically to help network engineers solve these kinds of problems quickly and easily.