Continued performance issues
Incident Report for TrackAbout System
Resolved
We had a clean, well-performing 24 hours. We're going to close this incident while we continue to work with Azure Engineering on the root cause analysis. We'll share our findings in a post-mortem to this and the previous incidents on our status page.
Posted Nov 15, 2023 - 09:50 EST
Update
Today, we've been working on performance tuning and root cause analysis.

In recent weeks we have been experiencing anomalous behavior in the Azure environment, and we've engaged with Azure Engineering to get to the bottom of it. We had a good call today with a skilled engineer, but he needs to engage other members of the team to answer some of our questions.

For now, we have isolated the busiest database in our environment that was partly to blame for the outage. That database is in a dedicated, standalone server for now.

We have also increased our Azure SQL elastic pool service level objective by two full tiers to increase our performance headroom.

We are reducing the MAXDOP, or maximum degree of parallelism, across the database fleet. We hope to reduce the number of threads/workers per database. Running out of threads/workers was one symptom of our issues last night.

We've found some queries whose performance has gone awry, which can contribute to a cascade failure. We're working on those at present.

We are continuing to spend our time working on performance engineering.
Posted Nov 14, 2023 - 18:08 EST
Update
Performance appears to be looking much better at the moment. We don't have any more changes to make at this time. We're going to monitor the situation.
Posted Nov 14, 2023 - 07:18 EST
Monitoring
This is a continuation of the issue from earlier. We closed the incident too soon and could not append to it.

We've been getting reports from the field that users are unable to sync handhelds. This appears to be due to continued high database load.

Our Azure SQL Database elastic pool performance transition to a higher tier took 2.5 hours, which is 1.5 hours longer than we'd ever seen before.

That transition just now completed and we're monitoring performance.
Posted Nov 14, 2023 - 06:40 EST
This incident affected: Production Services (Production Application Web Site, Production TAMobile 6 Sync Services, TAMobile 7 iOS, TAMobile 7 Android).