DNS Propogation Service Incident: 3/31/2020 Master Tkt#77952
Stephen Myers -
ON March 31st 2020 TeleFlex NOC alerted on incidents of multiple customers experiencing rolling loss of connectivity, ultimately related to Domain Name System resolution and propagation failures from the authoritative registrar nameservers for server TeleFlex service-related domains.
Total Incident Duration: ~1 Hour
Duration of Global SLA Impact: ~28 minutes
(defined by ASR < 65% average. ASR Average baseline includes all WAN origination/termination as well as all internal call events including ring group and queue calls to multiple endpoints for a single ASR log).
Based on initial customer report timestamps and examination of system monitoring DNS resolution failures started to impact some customers at approximately 8:50AM Central Time and resolved fully for all customers by 9:50AM Central Time. These service issues were found to have sporadically impacted some customer phones at different intervals and varying locations, following a failure of DNS propagation and response from authoritative Nameservers provided by our registrar, Godaddy.
The majority of service impacting incidents occurred between 9:08AM and 9:36AM. The sporadic impact on certain locations and devices, was observed to be a rolling disruption of DNS resolution as endpoints DNS cache TTL timers expired from each endpoints last DNS lookup of the record. The fact that the impact spanned exactly 1 hour further pointed to s DNS propagation failure within the standard 1 Hour TTL for most DNS records.
NOC Incident Response
TeleFlex NOC received the first customer report of service impacting issues at 9:03AM, approximately 13 minutes after DNS propagation issues began to start due to failure of the authoritative NS's to respond and/or propagate as of 8:50AM. TeleFlex NOC immediately investigated while escalating to engineering.
At 9:13AM TeleFlex NOC identified the apparent root issue with name resolution propagation across networks and confirmed findings with engineering.
By 9:18AM Engineering issued a series of bulk updates to live and "dummy" DNS entries for SRV and CNAME entries in each impacted DNS Zone, with the intent to force a refresh/propagation of records and effectively reset TTL timers as soon as possible to encourage replication.
TeleFlex observed DNS resolution succeeding on initially failed endpoints as well as a reduction in resolution of names following timeout for additional endpoints. By 9:36AM ASR levels returned to SLA norms. No further tickets were reported by customers occurring after 9:40AM and log events related to potential DNS resolution cleared as of 9:50AM.
Implemented and On-going Mitigations & Response
There have been and continue to be drastic changes to network operations across the country as mass shifts of human behavior forces networks to handle use-cases few to no-one would have predicted. TeleFlex has been monitoring carrier networks and choke pints as well as customer Quality of Experience and has been implementing tailored measures to respond to network challenges and mitigate service-availability risks to our customers.
Regional, US, and Global Telecom Carriers and service providers have all been working to respond to the overwhelming spike in network load as response to fighting Covid-19 drastically changed the topology of data utilization levels on carrier networks. In the first 72 hours following the announcement of the first 15 days of social-distancing measures, regional Internet utilization increased 50% to 70%.
The major challenge and cause of widespread congestion and service degradation-events is not just the uptick in overall utilization to nearly double normal use. The challenge is that the nodes, ingress-points, egress-points and interconnects where this traffic must traverse are not necessarily designed for or upgraded to this need. The massive data shift includes the entirety of Business use shifting off of the corporate and metro fiber infrastructure with engineered and commercially contracted SLA's and now forced onto the largely residential neighborhood networks with no SLA"s and no high-level coordination across competing CLEC, cable, wireless and other operators serving these communities.
To directly address the risk of another issue such as the DNS propagation event on 3/31/2020, TeleFlex has accelerated project execution and completion of a planned upgrades to DNS and CDN-enabling components of our architecture, originally scheduled for end of Q2 2020 .
As of EOD 4/2/2020, TeleFlex has moved DNS services of all domains used in service delivery to CloudFlare's DNS, CDN, and leading-edge security and cloud delivery platform. This migration enables "edge point-of-access" delivery of underlay services such as DNS as well as both static content and streaming services. The architecture allows for distributed caching with rapid propagation as changes occur. Moving the access to critical DNS and route info to multiple points close to the endpoints requesting this info.
Over the last 4 weeks TeleFlex has added multiple 10Gbps VXC's directly connected to major Cloud Service Providers, in addition to establishing multiple BGP IX peering relationships with commercial aggregation nodes. These upgrades provide immense cost and performance benefits to our customers using cloud and virtual services more and more frequently under the current needs for changing daily operations. The BGP peering Exchanges providers with direct routes to providers , regional carriers and alternate long-haul routes.
TeleFlex will continue to implement improvements dynamically suited for our customers needs. If you have any questions or need assistance, please reach out to our team. We will be announcing several major features and entire product platforms which are being made available at little to no cost to help our customers navigate the "new normal"
Stephen Myers, PMP
TeleFlex Networks, LLC
Customers NOC: 713.231.5005
1510 Primewest Parkway | Suite 800
Katy, TX 77449