Rogers Explains How a Piece of Code Took Down its Entire Network
The Canadian Radio-television and Telecommunications Commission (CRTC) had lots of questions for Rogers over the network outage earlier this month that took out cellphone and internet service in the country. Rogers responded to these questions in a detailed submission, and Canada’s telecom regulator made these documents public in a redacted form last week.
As part of its response, Rogers explained how a mere coding error during a scheduled upgrade caused a flood of issues on its core network that ultimately took much of Canada offline (via The Globe and Mail).
At 4:43 a.m. on July 8, a piece of code was introduced that deleted a routing filter. In telecom networks, packets of data are guided and directed by devices called routers, and filters prevent those routers from becoming overwhelmed, by limiting the number of possible routes that are presented to them.
The coding error occurred during the sixth phase of a planned seven-phase upgrade for the core network. All five of the previous phases had gone smoothly, according to Rogers.
With the filter deleted, there was more traffic going through most of Rogers’s network devices. This ultimately resulted in many of the routers being overwhelmed and the core network shutting down.
Rogers has one core network that supports all of its connectivity services, from wireline and broadband to wireless cellular. The company now plans to separate its wireless and wireline core networks to improve resiliency.
The telecom’s core network comprises routers from two different manufacturers, and the company said the differences between the equipment from the two vendors were at the root of the network breakdown.
In the same documents, Rogers revealed that both Bell and Telus had offered to take its customers on while it dealt with the failure. However, the company said it couldn’t take either up on their offers as the added weight of the Rogers traffic would have overwhelmed the two rivals’ networks.
Rogers also said in the filings that the disruption prevented it from delivering four emergency alerts to customers.
The company proclaimed on the eve of its hearing before the House of Commons it now plans to spend $10 billion over the next three years to make its network more reliable. Industry Minister François-Philippe Champagne has also tasked the company with establishing an agreement with competitors Bell and Telus to pool resources and provide emergency roaming to each other’s customers.
Rogers will delve deeper into what caused the disruption and how future incidents can be prevented when the company’s new CTO, Ron McKenzie, and other company bigwigs appear before the House of Commons Industry and Technology committee to answer for the outage, which took place this morning.