A configuration error in backbone network caused an outage for Website’s services that lasted for 27 minutes and caused a 50% traffic drop across the network. The architecture of the network allowed the outage to occur in a certain geography. The outage occurred because while working on another, unrelated issue, a configuration update was made on a server. This configuration contained an error that caused a reroute of traffic from one city to another. This overwhelmed that location server and caused the outage.
- 20:25: Loss of backbone link between cityA and cityB
- 20:25: Backbone between cityC and cityD is congesting
- 21:12 to 21:39: cityC attracted traffic from across the backbone
- 21:39 to 21:47: cityC dropped from the backbone, service restored
- 21:47 to 22:10: Core congestion caused some logs to drop, edge continues operating
- 22:10: Full recovery, including logs and metrics
Root cause and resolution:
As there was a traffic congestion in cityC, a decision was made to remove some of this incoming traffic. Instead of redirecting away, a one line error caused the redirection to be made towards it. This caused a bigger traffic congestion and outage in more locations.
Configuration changes should have been checked before a redirection decision was made, especially after an update.
Corrective and preventative measures:
- Introduce a maximum-prefix limit on the backbone data share sessions — this would have shut down the backbone in cityC, but the network is built to function properly without a backbone. This change will be deployed on Monday, May 24.
- Change the local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident.