Slow Performance

Incident Report for TrackAbout System

Postmortem

First off, we at TrackAbout would like to offer our apology for the performance issues that have impacted you, our customers, today and last week. We understand how critical our service is to your operations, how much you depend on us to be there, and we fully appreciate what a work stoppage means to you. We know it means inventory doesn’t get filled, trucks don’t roll, orders don’t get delivered, and everyone gets frustrated. It disrupts your business.

We let you down and we’re sorry.

Let’s talk about what happened, and what we’re going to do to avoid the same thing happening again.

Root Cause Analysis

We feel confident that the incident from earlier today, and similar incidents from last week, were caused by database resource contention due to a data warehouse export job executing within the same SQL environment as many of our large customers' databases.

The data warehouse team was initially directed to read from our read-only geo-replica Azure SQL environment to isolate the load from production. However, due to a technical limitation that prevents accessing SQL Server Change Tracking information in a read-only environment, the team decided to instead read from the primary writeable data source.

Compounding the issue, the target destination for writing the data was in the same Elastic Pool.

Neither the sources nor the destination data warehouse databases showed up as poor performers in our usual performance monitoring graphs and alerts. As far as they were concerned, they were working as designed. This delayed us in identifying them as the root cause. However, that activity was consuming enough shared resources that other databases began to suffer, and as a result we experienced a general cascade failure.

We believe we’re correct on this being the root cause because killing the data warehouse export jobs resulted in an immediate drop of SQL worker thread percentage and allowed the environment to slowly recover.

Explanation

There are several resources in any Azure SQL environment that must be monitored and managed for optimal performance. There’s CPU, data I/O (input/output), log I/O, memory, worker threads, and more. If any one of those resources is maxed out, there can be severe consequences. In this case, the worker thread pool became exhausted, which led to connection failures from our applications. That resulted in a terrible user experience and work stoppage.

If root cause is unknown or will take time to track down, in a cloud hosting environment, the first move will usually be to increase capacity by whatever means are available.

Both last week and this morning, this was the first step we took to try to bring relief to the situation. We began what is called an Elastic Pool migration to a larger-capacity elastic pool.

Unfortunately, migrating 50+ customer databases to a higher capacity Azure SQL Elastic Pool takes substantial time. In our case, about 90 minutes. During the migration, databases that move sooner will experience relief sooner. But we do not have control over which databases move when.

Following a successful migration, there is usually a backlog of user requests that will need to be processed. Those can take some time to clear out as well and can cause an abnormal spike in load that risks the system further. Users will continue to retry their requests until they get results, creating more load than normal. This is natural user behavior, and there’s no getting around it except to ensure you’ve increased capacity enough to compensate for this short-lived spike.

After last week’s incidents, we implemented suggestions from the Azure SQL Support Engineering team. Indeed, the steps we took then resulted in a significant performance improvement which we maintained for two days. This led us to conclude we would be successful in moving the size of the Azure SQL Elastic Pool back to where we started, so we did.

We thought we had treated the root cause, bad statistics in the database and an incorrect default setting for maximum degrees of parallelism in Azure SQL. But it turns out we did not know the root cause after all.

Very early this morning, we began to experience the same symptoms as last week. We immediately upscaled the elastic pool to combat the load, and again, predictably, it took 90 minutes to complete. During that time, we were able to identify the data warehouse jobs, which we now believe to be the actual root cause. Killing the running jobs resulted in an immediate change in system behavior, for the better. Further bolstering our confidence in this as the root cause, a scheduled job caused the data warehouse export to start again 20 minutes later, and we saw the spike. We intervened, killed the job again, and permanently disabled it. Again, the system recovered.

Going Forward

The first obvious and immediate change will be to re-engineer the data warehouse processing to insulate the production systems from load.

Beyond that, we have a devised new strategy for better database distribution among multiple Azure SQL Elastic Pools.

Ninety minutes is too long to wait to increase capacity. The duration is proportional to the total size of the databases in a pool. Fewer databases equals faster transitions. With smaller pools, we would be able to increase or decrease capacity more quickly and impact fewer customers at a time should the need arise.

To that end, we’re going to split up our high performance, business critical Azure SQL Elastic Pool into 3 separate pools.

We’ll be distributing customer databases among the pools with an eye towards minimizing overlap during busy periods.

Conclusion

We believe we’ve identified the root cause of the problem and have several positive changes that should prevent something similar from happening in the future. We’ll work hard to restore your trust in us and our systems.

Sincerely,

Larry Silverman
Chief Technology Officer
TrackAbout, Inc.

Posted Mar 14, 2023 - 15:52 EDT

Resolved

This incident has been resolved. We'll be publishing a post-mortem soon with our findings and improvements we'll be making going forward.

Posted Mar 14, 2023 - 11:02 EDT

Update

The system appears to have stabilized. We are monitoring.

Posted Mar 14, 2023 - 05:22 EDT

Update

We're continuing to see some slow performance here and there. We're currently thinking it's due to backlog of work from earlier. We're monitoring the situation and keeping an eye on individual high-load databases.

Posted Mar 14, 2023 - 05:04 EDT

Monitoring

We believe we have identified the root cause of the sudden worker process spike from 3% to 100% that caused the outage.

We have a separate team working on building a data warehouse/BI offering. They had scheduled a number of heavy data migrations for 1 AM Eastern Daylight Time, the exact time we started experiencing problems.

We found the data warehouse job and killed it. The system began recovering almost immediately.

We expect operations to return to normal shortly.

Posted Mar 14, 2023 - 04:04 EDT

Update

Azure SQL Elastic Pool transition is at 86% complete.

We've determined that at about 1:00 AM Eastern Daylight Time, our SQL pool worker percentage (the percentage of available worker processes in the SQL environment) went from 3% right up to 100% and stayed there. We're trying to find the cause of this very sudden spike in load.

When the workers in the SQL engine are exhausted, it cannot process further work until things calm down.

Increasing the size of our Azure SQL Elastic Pool as we have done increases the maximum number of workers.

Finding the cause of this spike is our highest priority.

Posted Mar 14, 2023 - 03:41 EDT

Update

While the Azure SQL Elastic Pool is migrating, we are seeing numerous database connection timeouts which indicate some of our customers are unable to use the application. The capacity migration is at 45% complete at this time. It started at 31 minutes past the hour. We hope to have operations restored in less than an hour.

Posted Mar 14, 2023 - 03:09 EDT

Identified

We've identified the Azure SQL Elastic Pool is again suffering poor performance under load. The pool is currently transitioning to increase power. This may result in periods of inaccessibility while the transition takes place.

Posted Mar 14, 2023 - 02:48 EDT

Investigating

We are investigating a report of slowness of the site. We are boosting capacity of our Azure SQL instance to compensate.

Posted Mar 14, 2023 - 02:37 EDT

This incident affected: Production Services (Production Application Web Site, Production TAMobile 6 Sync Services, TAMobile 7 iOS, TAMobile 7 Android).