Recent Performance Issues

Incident Report for TrackAbout System

Resolved

Two times over the past two days at around 3:00 AM Eastern US Time, we experienced a significant increase in load from a small number of large customers. Specifically, we saw very high CPU consumption within our Azure SQL database environment.

This resulted in application slowness, inaccessibility at times, and understandable customer frustration.

Both times this occurred, we increased capacity, but increasing capacity levels in an Azure SQL Database Elastic Pool can take over an hour to fully complete.

When the performance problems appeared, our database systems appear to have hit a critical breaking point resulting in cascade failure. With more demand, database queries were returning more slowly, and users were getting impatient.

Impatient users retry. We saw one of our users request the same Load Truck page URL over 100 times in a rather short time span.

Every page request represents a workload that must be queued and eventually processed, regardless of whether the user waited around for the response or not. The work queued up, the system slowed down further, and it could not recover.

We don't see any one thing to point to that is "broken". We are just getting more load from our customers than we're used to.

It is a constant challenge to maintain an optimum level of capacity that is both cost-effective and which yields acceptable performance. In a large, dynamic system with uneven workload, sometimes, unfortunately, you learn where that line is by crossing it.

The options at this time are (1) throw money at the problem and increase capacity and (2) tune databases and database queries to perform better than they do, which is a slow, methodical exercise.

Since the problem cannot wait, we have already increased our database capacity to a level that should handle the load.

We also opened a Severity 1 ticket with Azure Engineering, and they came back with a few worthwhile suggestions to explore in the area of database statistics updates, reindexing, and modifying our MAXDOP (maximum degree of parallelization) in the databases. We are working through all their suggestions.

More to come, if needed. Let's see how the next 24 hours goes.

Posted Mar 07, 2023 - 03:00 EST