Network resiliency and availability – understanding and optimizing

Network resiliency and availability are two important factors to take into account when designing a production network. In this post, i’ll go over some of the ways to measure these, the different stages of convergence and how to optimize each one of them.

Network availability vs resiliency, so what is the difference?

Network availability is how often the network is up and working. This is usually expressed in a percentage or with a number of “9’s”; “5 9’s” meaning 99.999% up-time.

Some common network availability terms that you might see:

Screenshot from 2017-12-24 16-20-16.png

Network resiliency on the other hand, is how long the network works without breaking. Usually equipment vendors will express this in time or as MTTF (Mean Time To Failure). For example a vendor might say that a specific type of router will be resilient up to 4 years or have a MTTF of 4 years.

These two concepts together bring us to what you should really be focusing on: Mean Time To Repair (MTTR). Network MTTR is a straightforward number – it’s the time between when your last network outage happened and when the next service interruption will start. This gives us the average time that the network will take to recover from a failure.

Good, so now we know how to express availability and resiliency but how do we measure a network outage?

When a network outage happens and traffic needs to be rerouted on a backup path, 4 steps needs to occur and be measured in order to determine the overall convergence time:

  1. Detection – How long does it take to discover that there was a failure?
  2. Announcement – How long does it take to report back that there was a failure (spread the information)?
  3. Calculation – How long does it take to calculate a new path?
  4. Install – How long does it take to install this new path in the FIB/RIB?

These four steps added together give us the time it takes for a network to converge after a failure. As a network designer, we want to make sure that we optimize each of these steps. Let’s take a look at the tools available to us for each stage:

Detection:

  • Carrier Detection settings: In most cases, network devices will be directly connected and the PHY will detect a failure so only Carrier Detection needs to be changed.
  • Protocol Hello timers: If the devices are not directly connected (for example going through a switch or through an L2VPN), you can modify the protocol timers to decrease detection time. By default, most protocols have very high keepalive/hello timers so it is recommended to adjust these if you cannot rely on carrier detection.
  • BFD: BFD is the most reliable of these options as the protocol keepalives are handled in hardware. One must be careful not to tweak these too aggressively in a network with packet loss (for example wireless back-haul networks) as this could cause excessive churn.

Announcement:

  • LSA/LSP Throttling timers: By default OSPF/IS-IS implementations throttle LSA/LSP generations to prevent excessive flooding. One can lower these settings to reduce convergence time.

Calculate:

  • iSPF/PRC: The use of incremental SPF or partial route computation, can have a huge impact on convergence time in certain topologies and specific type of failures when running a link-state routing protocol. These technologies accomplish this by calculating routes without running SPF or by partially running it. By default, iSPF and PRC are not activated in OSPF so I would highly recommend looking into these features if high availability is a design goal.
  • SPF Throttling: In some implementations, an exponential backoff algorithm is used when scheduling SPF runs in order to avoid excessive calculations in periods of instability. This is the same principle as the LSA/LSP throttling method in the “Announcement” stage and can be reduced to improve convergence times.
  • FRR/LFA: One can also take the approach to pre-calculate the backup paths with technologies like FRR and IP LFA’s/Remote LFA’s. This is the best way to ensure low convergence times in link-state protocols as without this feature, the calculate stage will only be as fast as the slowest node in the network.

Install:

  • Summarization: The install stage is sometimes overlooked but can contribute the most in reducing network convergence in networks with large amounts of prefixes.  Summarization helps but sometimes comes with a trade-off of reducing optimal path routing due to topology hiding.
  • Prefix suppression:  In most typical network cores, none of the transit links are needed as loopback addresses are used. Prefix suppression is used to remove all transit links and is another way to reduce the amount of prefixes needed to be installed in the RIB/FIB.
  • Priority-driven RIB: This feature is not available on all platforms but can also help reduce convergence time by expediting some prefix insertion into the forwarding table, starting with the most important ones. For example, you can configure it as such that the PE /32 prefixes are inserted in the FIB first.

 

One last point, often people will associate redundancy with resiliency but this is a common mistake. Redundancy is a tool to increase resiliency but it can also arguably decrease it. Let me explain this with an example of a small network with a single link between all devices.

Screenshot from 2017-12-24 15-47-41.png

One of the link fails and the network goes down. In order to improve resiliency, the network designer decides to add another link to the topology:

 

Screenshot from 2017-12-24 15-51-26.pngIf a link in this new topology fails again, traffic will be re-routed to one of the back-up links. This is great, we added resiliency.

However, if we start adding more links to the same network as below:

 

Screenshot from 2017-12-24 15-53-20.png

By doing this, we added more prefixes and more state to this network and convergences becomes slower (think about the Announcement, Calculate and Install stages of convergences) thus increasing our MTTR. As we start increasing redundancy, the actual resiliency of the network goes down because of the added links. In general, 2 to 3 nodes/links should be the target in mind for the best balance between redundancy and resiliency.

To summarize, one must take several factors into account when characterizing the resiliency and availability of a network. Understanding the different stages of network convergence and the various tools to optimize these is an important skill for the network designer.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s