Postmortem

Soledad Frechou
2 min readMay 23, 2021

Issue Summary:

A configuration error in backbone network caused an outage for Website’s services that lasted for 27 minutes and caused a 50% traffic drop across the network. The architecture of the network allowed the outage to occur in a certain geography. The outage occurred because while working on another, unrelated issue, a configuration update was made on a server. This configuration contained an error that caused a reroute of traffic from one city to another. This overwhelmed that location server and caused the outage.

Timeline:

  • 20:25: Loss of backbone link between cityA and cityB
  • 20:25: Backbone between cityC and cityD is congesting
  • 21:12 to 21:39: cityC attracted traffic from across the backbone
  • 21:39 to 21:47: cityC dropped from the backbone, service restored
  • 21:47 to 22:10: Core congestion caused some logs to drop, edge continues operating
  • 22:10: Full recovery, including logs and metrics

Root cause and resolution:

As there was a traffic congestion in cityC, a decision was made to remove some of this incoming traffic. Instead of redirecting away, a one line error caused the redirection to be made towards it. This caused a bigger traffic congestion and outage in more locations.

Configuration changes should have been checked before a redirection decision was made, especially after an update.

Corrective and preventative measures:

  • Introduce a maximum-prefix limit on the backbone data share sessions — this would have shut down the backbone in cityC, but the network is built to function properly without a backbone. This change will be deployed on Monday, May 24.
  • Change the local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident.

--

--