Continued performance issues

Incident Report for TrackAbout System

Resolved

We had a clean, well-performing 24 hours. We're going to close this incident while we continue to work with Azure Engineering on the root cause analysis. We'll share our findings in a post-mortem to this and the previous incidents on our status page.

Posted Nov 15, 2023 - 09:50 EST

Update

Today, we've been working on performance tuning and root cause analysis.

In recent weeks we have been experiencing anomalous behavior in the Azure environment, and we've engaged with Azure Engineering to get to the bottom of it. We had a good call today with a skilled engineer, but he needs to engage other members of the team to answer some of our questions.

For now, we have isolated the busiest database in our environment that was partly to blame for the outage. That database is in a dedicated, standalone server for now.

We have also increased our Azure SQL elastic pool service level objective by two full tiers to increase our performance headroom.

We are reducing the MAXDOP, or maximum degree of parallelism, across the database fleet. We hope to reduce the number of threads/workers per database. Running out of threads/workers was one symptom of our issues last night.

We've found some queries whose performance has gone awry, which can contribute to a cascade failure. We're working on those at present.

We are continuing to spend our time working on performance engineering.

Posted Nov 14, 2023 - 18:08 EST

Update

Performance appears to be looking much better at the moment. We don't have any more changes to make at this time. We're going to monitor the situation.

Posted Nov 14, 2023 - 07:18 EST

Monitoring

This is a continuation of the issue from earlier. We closed the incident too soon and could not append to it.

We've been getting reports from the field that users are unable to sync handhelds. This appears to be due to continued high database load.

Our Azure SQL Database elastic pool performance transition to a higher tier took 2.5 hours, which is 1.5 hours longer than we'd ever seen before.

That transition just now completed and we're monitoring performance.

Posted Nov 14, 2023 - 06:40 EST

This incident affected: Production Services (Production Application Web Site, Production TAMobile 6 Sync Services, TAMobile 7 iOS, TAMobile 7 Android).