At 10:49 AM Eastern, our monitoring systems alerted us to a loss of connectivity from our Production web servers and services to the Azure Service Bus. Azure Service Bus is a messaging platform used to decouple systems from each other.
We determined that several instances of Azure Service Bus hosted in the Microsoft Azure cloud were not accessible. We could not load the management tools in the Azure web portal, and we could not connect to the service bus using client tools.
We opened a critical severity ticket with Microsoft Azure, the highest severity.
Our web site and most services remained operational during this time. However, mobile device wireless record saving and syncing was impacted during this time. Our mobile applications are designed to store saved records locally in the event our services are unavailable, so that data is not lost, merely delayed.
We noticed that other instances of Azure Service Bus that we use during the development process were still working.
We had begun planning to provision a new instance of the bus to handle our production workload when our old production instance came back online.
At 11:26 AM, our monitoring systems reported that systems had come back online and were operational.
Total impacted time was approximately 37 minutes.
At 2:33 PM Eastern, Microsoft Azure sent an Azure Service Health Alert email which conveyed the root cause of the outage:
PRELIMINARY ROOT CAUSE: Engineers determined that a localized power event affected a limited portion of Azure infrastructure in this region, leading to downstream impact for Service Bus, Event Hubs, and Azure Relay.
MITIGATION: Engineers validated that the platform returned to a healthy state once power was restored to the affected infrastructure.
About 350 records from various customers synced from mobile devices had been delayed due to the outage. We were able to process these shortly after services came back online.
We apologize for the impact to affected customers.