Temporary outage (4 minutes)

Incident Report for TrackAbout System

Postmortem

We have a theory that our Redis instance may have experienced high memory pressure, which caused a mass eviction of keys stored in the cache, thus reducing memory pressure, until the cluster could become healthy again. We’re going to scale up our Redis instance over the weekend and take a new measure of health and decide where to go from there.

Posted Apr 04, 2024 - 14:16 EDT

Resolved

From 10:46 AM to 11:00 AM we experienced a brief outage. We have traced this to an unusual spike in our Redis Cache backend, which suddenly spiked to 100% in metrics for CPU, load and memory usage all together.

The Redis Cache is the data store for user session information. The result of the unavailability of Redis meant that users could not complete requests to TrackAbout.

We're investigating the cause of the spike, but it does not appear to have been caused by our application as there was no spike in operations per second or connections made to Redis. This hints to us that the fault may be Microsoft Azure back-end issue. We are opening a ticket with Microsoft to investigate.

Posted Apr 02, 2024 - 11:00 EDT