Outage

Incident Report for TrackAbout System

Postmortem

Root Cause Analysis

On June 6th, 2022, TrackAbout experienced a critical outage of services. The root cause of the outage was a failure of datacenter cooling systems in the Microsoft Azure East US 2 region. This failure caused a region-wide outage impacting many Microsoft Azure customers using services in the East US 2 region.

Microsoft Azure has not yet released their own RCA (root-cause analysis). However, we at TrackAbout know enough currently to issue our own RCA as it’s clear the root cause of TrackAbout’s outage was Microsoft Azure’s outage.

We first became aware of trouble on June 6th, 2022 at 11:34 PM Eastern. Our alert system detected SQL database connection issues appearing in our logs and notified our DevOps team. The alerts indicated database connection pool exhaustion, which occurs when either database resources are taxed beyond their available capacity or attempts to create new database connections from app servers fail.

By checking our database performance metrics, we ruled out resource exhaustion. We confirmed through attempting to connect to the SQL database servers that connections were either failing or taking so long that our applications would time out attempting to connect.

At 11:49 PM, Microsoft Azure released a Service Issue health event (https://app.azure.com/h/NMB2-ND0/04e477). Initially, the description of this event mentioned only that Azure was aware of database connection issues to SQL databases in our region of East US 2.

This would be the first of some 28 alerts and updates that we would receive from Microsoft Azure over the next 10+ hours pertaining to their outage.

We began executing our Incident Response plan, which involves announcing what we know via our public Status Page at https://status.trackabout.com, and started a new internal Incident Log to capture the timeline and actions taken over time to mitigate the issue. We would continue to update customers throughout the incident via the Status Page.

We opened a Severity 1 (Critical) ticket with our Cloud Service Provider (CSP) Logicworks, who in turn opened a Sev1 ticket with Microsoft Azure. We established a Microsoft Teams chat including TrackAbout DevOps, the Logicworks NOC (network operations center) and Azure Support.

Our alert system began letting us know about connection issues with our application components that connect to the Azure Service Bus, a messaging system that is used to allow software components to talk to one another.

At 12:33 AM on June 7th, we also received an alert from Microsoft Azure indicating that their Managed Identity service upon which we rely was also experiencing problems.

Other alerts from Azure indicating issues with Storage and Compute resources would soon follow.

It was now clear that Microsoft Azure was experiencing a multi-system failure, at a minimum within the East US 2 region where most of our services are hosted.

Failures in Azure are rare, but they do happen. In our experience, they tend to be short-lived, with resolution taking an hour or two at most, and with TrackAbout services restored usually before Azure announces a resolution.

TrackAbout maintains a disaster recovery plan which enables migrating services from one Azure region to another in the event of a regional failure. However, performing a regional failover poses its own set of complexities and risks. Usually, the best course of action is to suffer the short-term outage and wait for Microsoft to fix whatever is broken.

At this point in the incident, the root cause of the Azure outage was unclear. It was not clear whether key Azure components required to perform a geographic DR failover were impacted, or whether we would be met with similar problems in our target geographic region in Azure Central US.

We decided to stay put in Azure East US 2 and continue to wait for more information from Microsoft.

At around 2:04 AM EDT, Microsoft Azure amended their Service Health alert to finally indicate a root cause.

Summary of Impact: Starting at 02:41 UTC on 07 Jun 2022, a subset of customers with resources in the East US 2 region may experience difficulties connecting to resources hosted in this region. Engineers are aware of an issue where multiple scale units were impacted by a cooling failure in the East US 2 datacenter which has impacted Storage and Compute resources. Due to the reliance on these types of resources, other Azure services within this region may also be experiencing impact.

Current Status: We have taken steps to bring down the temperatures in the impacted areas of the datacenter to address the cooling issues. We are now in the early stages in mitigation efforts to restore hardware for underlying resources. The next update will be provided as events warrant.

IMPACTED SERVICE(S) AND REGION(S)

Microsoft Azure subsequently updated this ticket several times, eventually listing issues with:

SQL Database
Azure Relay
Backup
Event Hubs
ExpressRoute \ ExpressRoute Gateways
Service Bus
Azure Active Directory (Global)
Storage
HD Insight

Datacenters are buildings full of racks of computers running full throttle and generating tremendous amounts of heat. Cooling systems are required to maintain safe temperatures within tolerance ranges of those computers. Without proper cooling, most computers have thermal overload protection and will either slow performance to reduce heat or simply shut down so as not to damage sensitive components.

With so many Azure components affected, it seemed clear that some part of the East US 2 datacenter went into a forced slowdown or shutdown, impacting a great many Azure services in the region.

At this point, with a root cause known, we very seriously considered executing a full DR failover to a different Azure region than East US 2. However, in hindsight, with “Azure Active Directory (Global)” being mentioned among the impacted services, it’s not clear whether we would have run into problems executing.

Regardless, with Azure having identified the problem and reduced temperatures in the data center, we noticed that SQL connections were having better success, and at the very least our Application Web Site for TrackAbout was largely operational.

Connections to Azure Service Bus continued to be impacted. The result of this was that any process that required the Service Bus would fail. We included the list of impacted services in an update to our customers on our own status page.

Although Azure was announcing many different service failures, the only ones that seemed to be impacting TrackAbout were SQL connections and Azure Service Bus. With SQL on the mend and our application site largely functional, we focused on Service Bus.

We set about testing the viability of failing just our Azure Service Bus instance over to the Central US region. We proved this was workable by executing the DR (disater recovery) failover against our customer-facing Client Test environment. We then executed the failover for Production.

At 4:54 EDT, we succeeded in failing over our Production instance of Azure Service Bus to the Central US region. All TrackAbout systems at that point were fully operational.

Azure announced a resolution of their own issues around 10:30 AM.

Corrective and Preventative Actions

We’ve had a few days to think about what, if anything, we may have done differently to restore TrackAbout services more quickly, and what changes we might consider making because of the experience.

A couple potential architecture improvements have jumped out.

We can increase the service class of our Azure Service Bus to a Premium tier at increased cost for added high availability within the single region. We can’t know whether this would have saved us trouble during this outage, but it increases our odds. We may also investigate cross-region geo-redundancy options for Azure Service Bus, but it’s yet not clear that’s necessary. It’s very fast to create an entirely new Service Bus in another Azure Region, although it does require modifying configuration files and restarting services to adopt it into use, which causes a brief service disruption.
We will investigate adopting a feature of Azure SQL called “failover groups” which simplify the act of failing over large groups of Azure SQL databases. This may accelerate our ability to conduct a geo failover without requiring configuration changes within our web and app servers.

Aside from spending more to improve our High Availability posture, we can take further steps to make our Test environment more like Production, at some increased cost, so that we can practice more DR exercises than we can currently.

Posted Jun 10, 2022 - 12:40 EDT

Resolved

All systems are fully operational. We are resolving this incident.

Posted Jun 07, 2022 - 10:21 EDT

Monitoring

We have performed a partial fail-over to move affected resources out of the impacted Azure East US 2 region. Things appear to be stable and we will continue monitoring.

Posted Jun 07, 2022 - 04:54 EDT

Update

I'm afraid we don't have any good news to share just yet. Microsoft Azure continues to work to restore impacted hardware and devices which shut down due to the cooling problems.

While we have a disaster recovery plan to migrate our resources from the East US 2 Azure region to the Central region, it's a riskier bet to exercise the plan at this time. We are trying to get better information from Azure with some kind of expected resolution timeframe before we go down that road.

Posted Jun 07, 2022 - 03:34 EDT

Update

Just passing along an update from Microsoft Azure and their status page at https://status.azure.com/en-us/status

We continue to experience issues at TrackAbout as a result. We've heard some customers are unable to sync their mobile devices as a result.

We'll pass along updates as we get them.

From Microsoft Azure:

Warning - Multiple Azure Resources - East US 2 - Mitigation in Progress

Starting at 02:41 UTC on 07 Jun 2022, a subset of customers with resources in the East US 2 region may experience difficulties connecting to resources hosted in this region. Engineers are aware of an issue where multiple scale units were impacted by a cooling failure in the East US 2 datacenter which has impacted Storage and Compute resources. Due to the reliance on these types of resources, other Azure services within this region may also be experiencing impact.

We have taken steps to bring down the temperatures in the impacted areas of the datacenter to address the cooling issues. We are now in the early stages in mitigation efforts to restore hardware for underlying resources. The next update will be provided in 60 minutes, or as events warrant.

This message was last updated at 05:54 UTC on 07 June 2022

Posted Jun 07, 2022 - 02:23 EDT

Update

Microsoft Azure released this statement a short while ago. We continue to experiencing rolling service outages as a result.

Multiple Azure Resources - East US 2 - Mitigation in Progress

Starting at 02:41 UTC on 07 Jun 2022, a subset of customers with resources in the East US 2 region may experience difficulties connecting to resources hosted in this region.

Engineers are aware of an issue where multiple scale units were impacted by a cooling failure in the East US 2 datacenter which has impacted Storage and Compute resources. As a result, downstream services may also be experiencing impact. We are currently actively engaged in mitigation efforts to restore services. The next update will be provided in 60 minutes or as events warrant.

This message was last updated at 04:56 UTC on 07 June 2022

Posted Jun 07, 2022 - 01:18 EDT

Update

Issues with Azure Service Bus continue. See our last announcement for a list of service at TrackAbout impacted by Service Bus unavailability.

We are also seeing intermittent slow page loads in the app web site.

Microsoft Azure has now posted alerts pertaining to "Managed Identity" and "Virtual Machines". Our virtual machines are as-yet unaffected, but we will continue to monitor the situation.

Posted Jun 07, 2022 - 00:48 EDT

Update

Azure SQL Database appears to have recovered, but we're also seeing connection issues to Azure Service Bus. It seems like Azure is experiencing multiple issues. We are monitoring.

The following actions may be impacted by Azure Service Bus failure:
- TAMobile 6 record saves
- Delivery receipt emails
- Lot number generation during fills
- Inventory audits
- Record Alerts
- Rental bill generation

Posted Jun 07, 2022 - 00:23 EDT

Update

Services appear to be back online although performance is not yet 100%. It's possible service may be disrupted again.

Posted Jun 06, 2022 - 23:59 EDT

Identified

We have received notice from Microsoft Azure that Azure SQL Databases in the US East 2 region are experiencing issues. Microsoft Azure is aware of the situation and is actively investigating.

Posted Jun 06, 2022 - 23:53 EDT

Investigating

We are currently investigating an issue with the Azure SQL Database back-end. Database connections are failing. TrackAbout is offline.

Posted Jun 06, 2022 - 23:51 EDT

This incident affected: Production Services (Production Application Web Site, Production TAMobile 6 Sync Services, TAMobile 7 iOS, TAMobile 7 Android, Custom Reports and OpenData).