Sync Record Processing Outage
Incident Report for TrackAbout System
Postmortem

At 10:49 AM Eastern, our monitoring systems alerted us to a loss of connectivity from our Production web servers and services to the Azure Service Bus. Azure Service Bus is a messaging platform used to decouple systems from each other.

We determined that several instances of Azure Service Bus hosted in the Microsoft Azure cloud were not accessible. We could not load the management tools in the Azure web portal, and we could not connect to the service bus using client tools.

We opened a critical severity ticket with Microsoft Azure, the highest severity.

Our web site and most services remained operational during this time. However, mobile device wireless record saving and syncing was impacted during this time. Our mobile applications are designed to store saved records locally in the event our services are unavailable, so that data is not lost, merely delayed.

We noticed that other instances of Azure Service Bus that we use during the development process were still working.

We had begun planning to provision a new instance of the bus to handle our production workload when our old production instance came back online.

At 11:26 AM, our monitoring systems reported that systems had come back online and were operational.

Total impacted time was approximately 37 minutes.

At 2:33 PM Eastern, Microsoft Azure sent an Azure Service Health Alert email which conveyed the root cause of the outage:

PRELIMINARY ROOT CAUSE: Engineers determined that a localized power event affected a limited portion of Azure infrastructure in this region, leading to downstream impact for Service Bus, Event Hubs, and Azure Relay.

MITIGATION: Engineers validated that the platform returned to a healthy state once power was restored to the affected infrastructure.

About 350 records from various customers synced from mobile devices had been delayed due to the outage. We were able to process these shortly after services came back online.

We apologize for the impact to affected customers.

Posted Apr 30, 2020 - 15:57 EDT

Resolved
All backlogged syncs have been processed.

Microsoft Azure is conducting research into the root cause and we will add a post-mortem once we learn the nature of the failure.

We are resolving this issue.
Posted Apr 30, 2020 - 14:10 EDT
Update
This incident has been resolved. Services are operating normally.

We do have a small backlog of syncs that did not process during the outage and are working on pushing them through.

We will update this incident report when the sync backlog has been successfully processed.

We are also awaiting the root cause analysis from Microsoft Azure and will share here in a post-mortem when we have the information.
Posted Apr 30, 2020 - 12:02 EDT
Monitoring
Azure Service Bus connectivity has been restored. We are monitoring the situation.
Posted Apr 30, 2020 - 11:30 EDT
Identified
As a result of the Azure Service Bus outage, TAMobile 6 wireless saves and syncs will fail. Records will be saved on-device to be synced later when services are restored.

Outgoing integration messages to external systems (push messages) will also be failing but will be queued up to be sent later.
Posted Apr 30, 2020 - 11:13 EDT
Investigating
We are currently investigating an issue that is causing sync files to not process. Connectivity to Azure Service Bus, on which we depend, is having issues. We are investigating.
Posted Apr 30, 2020 - 11:00 EDT
This incident affected: Customer Test Services (Customer Test TAMobile 6 Sync Services) and Production Services (Production TAMobile 6 Sync Services).