Production Outage
Incident Report for TrackAbout, Inc.
Postmortem

Yesterday’s extended outage was caused by a service dependency of TrackAbout’s, Microsoft Azure. Azure is the cloud hosting environment in which TrackAbout operates all its services.

TrackAbout first detected the problem at 19:37 UTC. The problem was not fully resolved for TrackAbout until 21:18 UTC.

The Microsoft Azure outage was serious, global and impacted thousands of companies. It made news.

Microsoft posted the following report yesterday, and promised a full root cause analysis (RCA) in 72 hours.

Summary of Impact: Between 19:43 and 22:35 UTC on 02 May 2019, customers may have experienced intermittent connectivity issues with Azure and other Microsoft services (including M365, Dynamics, DevOps, etc). Most services were recovered by 21:30 UTC with the remaining recovered by 22:35 UTC.

Preliminary Root Cause: Engineers identified the underlying root cause as a nameserver delegation change affecting DNS resolution and resulting in downstream impact to Compute, Storage, App Service, AAD, and SQL Database services. During the migration of a legacy DNS system to Azure DNS, some domains for Microsoft services were incorrectly updated. No customer DNS records were impacted during this incident, and the availability of Azure DNS remained at 100% throughout the incident. The problem impacted only records for Microsoft services.

Mitigation: To mitigate, engineers corrected the nameserver delegation issue. Applications and services that accessed the incorrectly configured domains may have cached the incorrect information, leading to a longer restoration time until their cached information expired.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. A detailed RCA will be provided within 72 hours.

Microsoft Azure has been reliable over the 3+ years we’ve been using it. This is the worst outage we’ve seen since joining the platform.

I maintain confidence in Microsoft Azure as our cloud provider. They have been an excellent provider (in all other regards) and continue to compete fiercely with the cloud heavyweight in the industry, Amazon AWS.

I don’t anticipate the to-be-published RCA from Microsoft is going to shed much light on anything we could have done differently here to prevent the outage. All secondary fail-over environments in Azure were similarly impacted and unavailable. Short of duplicating and synchronizing everything to a completely different cloud provider (which is not realistic for TrackAbout, given our reliance on Microsoft, and specifically Azure technologies), there was unfortunately no avoiding yesterday’s outage.

We regret the downtime, inconvenience and hassle this undoubtedly caused our customers.

Sincerely,

Larry Silverman
Chief Technology Officer
TrackAbout, Inc.

Posted 19 days ago. May 03, 2019 - 11:59 EDT

Resolved
Services are now stable. We are closing out this incident.
Posted 20 days ago. May 02, 2019 - 18:18 EDT
Update
Systems have been online for 15 minutes or so. We continue to monitor.
Azure reported the problem is with their DNS domain name resolution system.
Posted 20 days ago. May 02, 2019 - 17:27 EDT
Monitoring
We continue to monitor the situation. Azure and Office 365 customers are affected globally. The Azure status page is showing network infrastructure impacted around the globe. https://status.azure.com
Posted 20 days ago. May 02, 2019 - 16:56 EDT
Update
Microsoft Azure is now stating there are multiple issues with Azure services affecting many clients globally. We're going to have to wait and see what happens here. https://twitter.com/search?q=azure%20outage&src=typd
Posted 20 days ago. May 02, 2019 - 16:27 EDT
Investigating
It appears Microsoft's Azure SQL Database is having a rolling outage. We have confirmation from other Azure users via Twitter that are experiencing similar. Will share more news as we get it.
Posted 20 days ago. May 02, 2019 - 16:05 EDT
Monitoring
Services are back online and we are monitoring.
Posted 20 days ago. May 02, 2019 - 15:50 EDT
Investigating
We are currently investigating the issue. Hang tight.
Posted 20 days ago. May 02, 2019 - 15:43 EDT
This incident affected: Production Services (Production Application Web Site, Production TAMobile 6 Sync Services, TAMobile 7 iOS, TAMobile 7 Android).