Issue Summary:

A configuration error in backbone network caused an outage for Website’s services that lasted for 27 minutes and caused a 50% traffic drop across the network. The architecture of the network allowed the outage to occur in a certain geography. The outage occurred because while working on another, unrelated issue, a configuration update was made on a server. This configuration contained an error that caused a reroute of traffic from one city to another. This overwhelmed that location server and caused the outage.


Root cause and resolution:

As there was a traffic congestion in cityC, a decision was made to remove some of this incoming traffic. Instead of redirecting away, a one line error caused the redirection to be made towards it. This caused a bigger traffic congestion and outage in more locations.

Configuration changes should have been checked before a redirection decision was made, especially after an update.

Corrective and preventative measures:



