On June 6th, 2022, TrackAbout experienced a critical outage of services. The root cause of the outage was a failure of datacenter cooling systems in the Microsoft Azure East US 2 region. This failure caused a region-wide outage impacting many Microsoft Azure customers using services in the East US 2 region.
Microsoft Azure has not yet released their own RCA (root-cause analysis). However, we at TrackAbout know enough currently to issue our own RCA as it’s clear the root cause of TrackAbout’s outage was Microsoft Azure’s outage.
We first became aware of trouble on June 6th, 2022 at 11:34 PM Eastern. Our alert system detected SQL database connection issues appearing in our logs and notified our DevOps team. The alerts indicated database connection pool exhaustion, which occurs when either database resources are taxed beyond their available capacity or attempts to create new database connections from app servers fail.
By checking our database performance metrics, we ruled out resource exhaustion. We confirmed through attempting to connect to the SQL database servers that connections were either failing or taking so long that our applications would time out attempting to connect.
At 11:49 PM, Microsoft Azure released a Service Issue health event (https://app.azure.com/h/NMB2-ND0/04e477). Initially, the description of this event mentioned only that Azure was aware of database connection issues to SQL databases in our region of East US 2.
This would be the first of some 28 alerts and updates that we would receive from Microsoft Azure over the next 10+ hours pertaining to their outage.
We began executing our Incident Response plan, which involves announcing what we know via our public Status Page at https://status.trackabout.com, and started a new internal Incident Log to capture the timeline and actions taken over time to mitigate the issue. We would continue to update customers throughout the incident via the Status Page.
We opened a Severity 1 (Critical) ticket with our Cloud Service Provider (CSP) Logicworks, who in turn opened a Sev1 ticket with Microsoft Azure. We established a Microsoft Teams chat including TrackAbout DevOps, the Logicworks NOC (network operations center) and Azure Support.
Our alert system began letting us know about connection issues with our application components that connect to the Azure Service Bus, a messaging system that is used to allow software components to talk to one another.
At 12:33 AM on June 7th, we also received an alert from Microsoft Azure indicating that their Managed Identity service upon which we rely was also experiencing problems.
Other alerts from Azure indicating issues with Storage and Compute resources would soon follow.
It was now clear that Microsoft Azure was experiencing a multi-system failure, at a minimum within the East US 2 region where most of our services are hosted.
Failures in Azure are rare, but they do happen. In our experience, they tend to be short-lived, with resolution taking an hour or two at most, and with TrackAbout services restored usually before Azure announces a resolution.
TrackAbout maintains a disaster recovery plan which enables migrating services from one Azure region to another in the event of a regional failure. However, performing a regional failover poses its own set of complexities and risks. Usually, the best course of action is to suffer the short-term outage and wait for Microsoft to fix whatever is broken.
At this point in the incident, the root cause of the Azure outage was unclear. It was not clear whether key Azure components required to perform a geographic DR failover were impacted, or whether we would be met with similar problems in our target geographic region in Azure Central US.
We decided to stay put in Azure East US 2 and continue to wait for more information from Microsoft.
At around 2:04 AM EDT, Microsoft Azure amended their Service Health alert to finally indicate a root cause.
Summary of Impact: Starting at 02:41 UTC on 07 Jun 2022, a subset of customers with resources in the East US 2 region may experience difficulties connecting to resources hosted in this region. Engineers are aware of an issue where multiple scale units were impacted by a cooling failure in the East US 2 datacenter which has impacted Storage and Compute resources. Due to the reliance on these types of resources, other Azure services within this region may also be experiencing impact.
Current Status: We have taken steps to bring down the temperatures in the impacted areas of the datacenter to address the cooling issues. We are now in the early stages in mitigation efforts to restore hardware for underlying resources. The next update will be provided as events warrant.
IMPACTED SERVICE(S) AND REGION(S)
Microsoft Azure subsequently updated this ticket several times, eventually listing issues with:
Datacenters are buildings full of racks of computers running full throttle and generating tremendous amounts of heat. Cooling systems are required to maintain safe temperatures within tolerance ranges of those computers. Without proper cooling, most computers have thermal overload protection and will either slow performance to reduce heat or simply shut down so as not to damage sensitive components.
With so many Azure components affected, it seemed clear that some part of the East US 2 datacenter went into a forced slowdown or shutdown, impacting a great many Azure services in the region.
At this point, with a root cause known, we very seriously considered executing a full DR failover to a different Azure region than East US 2. However, in hindsight, with “Azure Active Directory (Global)” being mentioned among the impacted services, it’s not clear whether we would have run into problems executing.
Regardless, with Azure having identified the problem and reduced temperatures in the data center, we noticed that SQL connections were having better success, and at the very least our Application Web Site for TrackAbout was largely operational.
Connections to Azure Service Bus continued to be impacted. The result of this was that any process that required the Service Bus would fail. We included the list of impacted services in an update to our customers on our own status page.
Although Azure was announcing many different service failures, the only ones that seemed to be impacting TrackAbout were SQL connections and Azure Service Bus. With SQL on the mend and our application site largely functional, we focused on Service Bus.
We set about testing the viability of failing just our Azure Service Bus instance over to the Central US region. We proved this was workable by executing the DR (disater recovery) failover against our customer-facing Client Test environment. We then executed the failover for Production.
At 4:54 EDT, we succeeded in failing over our Production instance of Azure Service Bus to the Central US region. All TrackAbout systems at that point were fully operational.
Azure announced a resolution of their own issues around 10:30 AM.
We’ve had a few days to think about what, if anything, we may have done differently to restore TrackAbout services more quickly, and what changes we might consider making because of the experience.
A couple potential architecture improvements have jumped out.
We can increase the service class of our Azure Service Bus to a Premium tier at increased cost for added high availability within the single region. We can’t know whether this would have saved us trouble during this outage, but it increases our odds. We may also investigate cross-region geo-redundancy options for Azure Service Bus, but it’s yet not clear that’s necessary. It’s very fast to create an entirely new Service Bus in another Azure Region, although it does require modifying configuration files and restarting services to adopt it into use, which causes a brief service disruption.
We will investigate adopting a feature of Azure SQL called “failover groups” which simplify the act of failing over large groups of Azure SQL databases. This may accelerate our ability to conduct a geo failover without requiring configuration changes within our web and app servers.
Aside from spending more to improve our High Availability posture, we can take further steps to make our Test environment more like Production, at some increased cost, so that we can practice more DR exercises than we can currently.