Mar 14, 15:52 EDT
This incident has been resolved. We'll be publishing a post-mortem soon with our findings and improvements we'll be making going forward.
Mar 14, 11:02 EDT
The system appears to have stabilized. We are monitoring.
Mar 14, 05:22 EDT
We're continuing to see some slow performance here and there. We're currently thinking it's due to backlog of work from earlier. We're monitoring the situation and keeping an eye on individual high-load databases.
Mar 14, 05:04 EDT
We believe we have identified the root cause of the sudden worker process spike from 3% to 100% that caused the outage.
We have a separate team working on building a data warehouse/BI offering. They had scheduled a number of heavy data migrations for 1 AM Eastern Daylight Time, the exact time we started experiencing problems.
We found the data warehouse job and killed it. The system began recovering almost immediately.
We expect operations to return to normal shortly.
Mar 14, 04:04 EDT
Azure SQL Elastic Pool transition is at 86% complete.
We've determined that at about 1:00 AM Eastern Daylight Time, our SQL pool worker percentage (the percentage of available worker processes in the SQL environment) went from 3% right up to 100% and stayed there. We're trying to find the cause of this very sudden spike in load.
When the workers in the SQL engine are exhausted, it cannot process further work until things calm down.
Increasing the size of our Azure SQL Elastic Pool as we have done increases the maximum number of workers.
Finding the cause of this spike is our highest priority.
Mar 14, 03:41 EDT
While the Azure SQL Elastic Pool is migrating, we are seeing numerous database connection timeouts which indicate some of our customers are unable to use the application. The capacity migration is at 45% complete at this time. It started at 31 minutes past the hour. We hope to have operations restored in less than an hour.
Mar 14, 03:09 EDT
We've identified the Azure SQL Elastic Pool is again suffering poor performance under load. The pool is currently transitioning to increase power. This may result in periods of inaccessibility while the transition takes place.
Mar 14, 02:48 EDT
We are investigating a report of slowness of the site. We are boosting capacity of our Azure SQL instance to compensate.
Mar 14, 02:37 EDT