Postmortem

Soledad Frechou
2 min readMay 23, 2021

Issue Summary:

A configuration error in backbone network caused an outage for Website’s services that lasted for 27 minutes and caused a 50% traffic drop across the network. The architecture of the network allowed the outage to occur in a certain geography. The outage occurred because while working on another, unrelated issue, a configuration update was made on a server. This configuration contained an error that caused a reroute of traffic from one city to another. This overwhelmed that location server and caused the outage.

TRAFFIC CONGESTION !!! DON’T PANIC (unless you are in charge of fixing it)

Timeline:

  • 20:25: Loss of backbone link between cityA and cityB
CONNECTION LOST
  • 20:25: Backbone between cityC and cityD is congesting
  • 21:12 to 21:39: cityC attracted traffic from across the backbone
  • 21:39 to 21:47: cityC dropped from the backbone, service restored
  • 21:47 to 22:10: Core congestion caused some logs to drop, edge continues operating
  • 22:10: Full recovery, including logs and metrics
SAVED THE DAYYYYYY

Root cause and resolution:

As there was a traffic congestion in cityC, a decision was made to remove some of this incoming traffic. Instead of redirecting away, a one line error caused the redirection to be made towards it. This caused a bigger traffic congestion and outage in more locations.

REDIRECT! REDIRECT! REDIRECT!

Configuration changes should have been checked before a redirection decision was made, especially after an update.

Corrective and preventative measures:

  • Introduce a maximum-prefix limit on the backbone data share sessions — this would have shut down the backbone in cityC, but the network is built to function properly without a backbone. This change will be deployed on Monday, May 24.
  • Change the local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident.

--

--