Service outage

Incident Report for TrackAbout System

Postmortem

The TrackAbout application web site was down from 2022-05-13 05:14:24 to 2022-05-13 05:45:24 UTC, 31 minutes.

The TrackAbout Sync Server, which is critical to mobile device operation, was down from 2022-05-13 05:14:24 to 2022-05-13 06:14:10, 60 minutes.

Post Mortem and Root Cause Analysis

During patching of our Active Directory Domain Services (AD DS) domain controllers, our managed service provider did not adhere to our agreed-upon runbook procedures. As a result, they failed to verify that all required services were operational following patching and rebooting of the primary domain controller before patching and rebooting the secondary. One of the required services was DNS. Another was ADFS.

With DNS offline on both primary and secondary domain controllers, application servers were unable to resolve DNS addresses for backend services such as Azure SQL Database. With ADFS offline, some services could not authenticate. The TrackAbout service as a whole came down.

Our DevOps team was alerted to the outage by both alert alarms and our customer support team calling with customer complaints.

Many Windows services configured to start automatically on the domain controllers were refusing to start, even after the server had booted up completely. The service management tool would respond with an error dialog stating the services were not starting in a timely fashion.

We did manage to start DNS services on the domain controllers, but ADFS continued to fail. With DNS up and running, the majority of TrackAbout’s services came back online after 31 minutes, but mobile device sync services continued to be down a full 60 minutes until we restarted them. Then they too started operating normally. That left only one legacy service for syncing TAMobile 5 devices still down.

We escalated with our managed services provider to determine what procedures they may have been following that led to this situation.

Our first theory was that something in the monthly Windows Updates may have caused this to happen and we began to consider rolling back the updates or restoring the domain controllers from backup. However, domain controllers are notoriously fussy when they get out of sync with each other, and the risk seemed unacceptably high. So we continued investigating how to fail forward, find the root cause, and fix the problem.

We found our tertiary disaster recovery domain controller, which had also been patched and rebooted, suffering the same problems. This gave us a “safe” domain controller (one not relied upon for Production) on which to probe and experiment getting the necessary services back online.

Ultimately, after much searching, an escalation engineer from our managed service provider discovered a single setting on the domain controllers which seemed to be set incorrectly. The setting was a registry key HKEY_LOCAL_MACHINE:\SYSTEM\CurrentControlSet\Control\ServicesPipeTimeout. This registry key had a value of 96.

ServicesPipeTimeout controls how long the Windows Service Manager will wait for a service to start before killing it. The setting is in milliseconds.

96 milliseconds is not long enough to start most Windows services.

We set the value to a reasonable timeout (in minutes) and rebooted each server, which is required to take up the change.

Upon boot, all services were able to start successfully.

It remains unknown how the ServicesPipeTimeout came to be set to 96 milliseconds. We are continuing to research how and when that may have happened. We currently do not have a plausible explanation. We are investigating the impact of enabling system change tracking going forward, so that we can audit registry changes should something like this happen in the future.

We are also having a lively conversation with our managed service provider, who failed to follow our agreed-upon patching procedures, which caused the outage. Had they stopped the patching cycle when services failed to start on the first domain controller, we would not have suffered the outage at all.

Posted May 13, 2022 - 16:13 EDT

Resolved

We have identified and resolved the issue. We have verified that all services are operational.

Posted May 13, 2022 - 04:20 EDT

Update

The TAM6 sync services are operational now. We are continuing to investigate.

Posted May 13, 2022 - 02:26 EDT

Update

The web site is operational, but the TAM6 sync services are currently down. We are continuing to investigate.

Posted May 13, 2022 - 02:09 EDT

Update

We are continuing to investigate this issue.

Posted May 13, 2022 - 02:04 EDT

Investigating

We just experienced an outage in our application. The main web site is operational now. We are checking different areas of the application to verify that they are all operational.

Posted May 13, 2022 - 01:59 EDT

This incident affected: Production Services (Production Application Web Site, Production TAMobile 6 Sync Services) and Customer Test Services (Customer Test Application Web Site).