Our patch appears to have solved the problem. We're closing out this incident.
In terms of root cause, we believe we've found a specific case where our code did not change from release to release, but a third-party dependency's implementation of asynchronous execution did change. The problem only exposed itself under specific circumstances involving multiple independent actions occurring at the same time. Not something we'd be likely to catch in testing before the code release, and this explains why we didn't detect it until the code made it to Production.
We'll be further investigating how our code interacted with the third-party dependency in order to create a path forward, as we do want to stay current on third-party dependencies.
Nov 30, 20:15 EST
We deployed a minor patch to roll back one change (an upgraded third-party dependency) that was released with yesterday's Production update which we suspect to be the cause of the issues. We're now monitoring the performance of the system.
Nov 30, 19:15 EST
Earlier this morning, we boosted capacity a second time to deal with continued excessive CPU load. The CPU load is now low and stable.
We are continuing to investigate the issue as we're seeing some errors in our logs and we have reports that some mobile actions are not completing reliably.
Nov 30, 12:01 EST
Performance looks much better after boosting capacity. We're going to continue investigating root cause.
Nov 30, 06:04 EST
We are seeing unusually high CPU load in our Production environment. We are investigating the root cause. In order to improve user experience, we are boosting our capacity in Azure to accommodate for the higher load.
Nov 30, 05:37 EST