At 10:03UTC on Sunday 30th August Continent 8’s monitoring systems began to notice something was awry with traffic traversing our links to and from CenturyLink/Level(3).
The events which followed were relatively simple for Continent 8 and service providers in a similar position, but many others globally (including some large names) were unaware that they were set for a global outage of over 4 hours. Casualties of the issue to varying degrees included Reddit, Hulu, AWS, Blizzard, Steam, Cloudflare, Xbox Live, Discord, and dozens more.
Continent 8 is connected to CenturyLink/Level(3) among a large and diverse set of network providers. In simple terms, when we see an increase in errors or an outage from one network provider, our systems automatically switch all the affected traffic across alternative providers. Given the number of providers we have access to, we are able to continue to route all traffic even when one provider has a global issue.
In addition to automated route and peering provider management, via specific endpoint monitoring setup across the C8 internet peering; the NOC is also able to manually manipulate the peering relationships and priorities should we identify an issue with an ability to get to source or destination locations. Providers relying on solely automated protocols such as BGP with upstream peering or exchange providers can amplify an issue when they continue to treat a provider as “good” when in reality it isn’t.
The specific behaviour of the issue on Sunday was such that automatic redirection of all traffic away from CenturyLink wasn’t triggered by the routing protocol (it was the same situation for all ISPs), and so manual intervention was required. By 10:35UTC Continent 8 NOC had identified and acted to move all traffic away from CenturyLink to other providers and all services remained fully available to our customers.
The same action was also then taken by those providers fortunate to have similar options. Some had services available again within the hour, some longer. Interestingly, some large ISPs took much longer to manually switch, and the online feedback of when services were eventually reinstated is interesting.
So why did some individuals, businesses and service providers suffer for over 4 hours? Well CenturyLink/Level(3) is among the largest network providers in the world, who also happen to provide some of the lowest latency routing. As a result, many hosting providers rely solely on their connectivity to the Internet, especially in some of the most densely connected locations which CenturyLink/Level(3) operate or peer in local exchanges. Because this outage appeared to take all of the CenturyLink/Level(3) network offline, both ISPs solely reliant and individuals who are CenturyLink customers would not have been able to reach any other Internet provider until the issue was resolved, over 4 hours later. They literally had no other option but to wait.
In addition, a “knock-on” effect can be that those peering providers or exchanges not migrating away from CenturyLink/Level(3) can compound the issue if they sit between a source and destination. Meaning having a broader peering and manual capability could best minimise the impact, if not completely resolve it.
As a provider CenturyLink/Level(3) responded and communicated well throughout and did everything they could during what was clearly a globally significant event. However, if ever there was a case study for carrier redundancy and awareness that even the most reliable, robust and highest capacity networks can go down, this was it. Stacking odds in your favour and investing in services which are truly carrier redundant may at times seem overkill, but it clearly pays off when events like this inevitably happen.
In terms of the details around what happened, CenturyLink released a statement which advised that the root cause was due to an incorrect Flowspec announcement effectively preventing BGP routes from taking root. Regardless of the detail though even if that issue was to be prevented in future, there are many other issues which could take the largest providers offline again.
ISPs and businesses across the globe should again be asking themselves this week “Do we, and our service provider have sufficient carrier redundancy and incident management capability?”.