First off, we at TrackAbout would like to offer our apology for the performance issues that have impacted you, our customers, today and last week. We understand how critical our service is to your operations, how much you depend on us to be there, and we fully appreciate what a work stoppage means to you. We know it means inventory doesn’t get filled, trucks don’t roll, orders don’t get delivered, and everyone gets frustrated. It disrupts your business.
We let you down and we’re sorry.
Let’s talk about what happened, and what we’re going to do to avoid the same thing happening again.
Root Cause Analysis
We feel confident that the incident from earlier today, and similar incidents from last week, were caused by database resource contention due to a data warehouse export job executing within the same SQL environment as many of our large customers' databases.
The data warehouse team was initially directed to read from our read-only geo-replica Azure SQL environment to isolate the load from production. However, due to a technical limitation that prevents accessing SQL Server Change Tracking information in a read-only environment, the team decided to instead read from the primary writeable data source.
Compounding the issue, the target destination for writing the data was in the same Elastic Pool.
Neither the sources nor the destination data warehouse databases showed up as poor performers in our usual performance monitoring graphs and alerts. As far as they were concerned, they were working as designed. This delayed us in identifying them as the root cause. However, that activity was consuming enough shared resources that other databases began to suffer, and as a result we experienced a general cascade failure.
We believe we’re correct on this being the root cause because killing the data warehouse export jobs resulted in an immediate drop of SQL worker thread percentage and allowed the environment to slowly recover.
There are several resources in any Azure SQL environment that must be monitored and managed for optimal performance. There’s CPU, data I/O (input/output), log I/O, memory, worker threads, and more. If any one of those resources is maxed out, there can be severe consequences. In this case, the worker thread pool became exhausted, which led to connection failures from our applications. That resulted in a terrible user experience and work stoppage.
If root cause is unknown or will take time to track down, in a cloud hosting environment, the first move will usually be to increase capacity by whatever means are available.
Both last week and this morning, this was the first step we took to try to bring relief to the situation. We began what is called an Elastic Pool migration to a larger-capacity elastic pool.
Unfortunately, migrating 50+ customer databases to a higher capacity Azure SQL Elastic Pool takes substantial time. In our case, about 90 minutes. During the migration, databases that move sooner will experience relief sooner. But we do not have control over which databases move when.
Following a successful migration, there is usually a backlog of user requests that will need to be processed. Those can take some time to clear out as well and can cause an abnormal spike in load that risks the system further. Users will continue to retry their requests until they get results, creating more load than normal. This is natural user behavior, and there’s no getting around it except to ensure you’ve increased capacity enough to compensate for this short-lived spike.
After last week’s incidents, we implemented suggestions from the Azure SQL Support Engineering team. Indeed, the steps we took then resulted in a significant performance improvement which we maintained for two days. This led us to conclude we would be successful in moving the size of the Azure SQL Elastic Pool back to where we started, so we did.
We thought we had treated the root cause, bad statistics in the database and an incorrect default setting for maximum degrees of parallelism in Azure SQL. But it turns out we did not know the root cause after all.
Very early this morning, we began to experience the same symptoms as last week. We immediately upscaled the elastic pool to combat the load, and again, predictably, it took 90 minutes to complete. During that time, we were able to identify the data warehouse jobs, which we now believe to be the actual root cause. Killing the running jobs resulted in an immediate change in system behavior, for the better. Further bolstering our confidence in this as the root cause, a scheduled job caused the data warehouse export to start again 20 minutes later, and we saw the spike. We intervened, killed the job again, and permanently disabled it. Again, the system recovered.
The first obvious and immediate change will be to re-engineer the data warehouse processing to insulate the production systems from load.
Beyond that, we have a devised new strategy for better database distribution among multiple Azure SQL Elastic Pools.
Ninety minutes is too long to wait to increase capacity. The duration is proportional to the total size of the databases in a pool. Fewer databases equals faster transitions. With smaller pools, we would be able to increase or decrease capacity more quickly and impact fewer customers at a time should the need arise.
To that end, we’re going to split up our high performance, business critical Azure SQL Elastic Pool into 3 separate pools.
We’ll be distributing customer databases among the pools with an eye towards minimizing overlap during busy periods.
We believe we’ve identified the root cause of the problem and have several positive changes that should prevent something similar from happening in the future. We’ll work hard to restore your trust in us and our systems.
Chief Technology Officer