Postmortem

Issue Summary:

A configuration error in backbone network caused an outage for Website’s services that lasted for 27 minutes and caused a 50% traffic drop across the network. The architecture of the network allowed the outage to occur in a certain geography. The outage occurred because while working on another, unrelated issue, a configuration update was made on a server. This configuration contained an error that caused a reroute of traffic from one city to another. This overwhelmed that location server and caused the outage.

Timeline:

Root cause and resolution:

As there was a traffic congestion in cityC, a decision was made to remove some of this incoming traffic. Instead of redirecting away, a one line error caused the redirection to be made towards it. This caused a bigger traffic congestion and outage in more locations.

Configuration changes should have been checked before a redirection decision was made, especially after an update.

Corrective and preventative measures:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store