GitHub header
Incident with API Requests, Git Operations, Webhooks and Copilot
Incident Report for GitHub
Resolved
On March 11, 2024 starting at 22:45 UTC and ending on March 12, 2024 00:48 UTC various GitHub services were degraded and returned intermittent errors for users. During this incident, the following customer impacts occurred: API error rates as high as 1%, Copilot error rates as high as 17%, and Secret Scanning and 2FA using GitHub Mobile error rates as high as 100% followed by a drop in error rates to 30% starting at 22:55 UTC. This elevated error rate was due to a degradation of our centralized authentication service upon which many other services depend.

The issue was caused by a deployment of network related configuration that was inadvertently applied to the incorrect environment. This error was detected within 4 minutes and a rollback was initiated. While error rates began dropping quickly at 22:55 UTC, the rollback failed in one of our data centers, leading to a longer recovery time. At this point, many failed requests succeeded upon retrying. This failure was due to an unrelated issue that had occurred earlier in the day where the datastore for our configuration service was polluted in a way that required manual intervention. The bad data in the configuration service caused the rollback in this one datacenter to fail. A manual removal of the incorrect data allowed the full rollback to complete at 00:48 UTC thereby restoring full access to services. We understand how the corrupt data was deployed and continue to investigate why the specific data caused the subsequent deployments to fail.

We are working on various measures to ensure safety of this kind of configuration change, faster detection of the problem via better monitoring of the related subsystems, and improvements to the robustness of our underlying configuration system including prevention and automatic cleanup of polluted records such that we can automatically recover from this kind of data issue in the future.
Posted Mar 12, 2024 - 01:00 UTC
Update
We believe we've resolved the root cause and are waiting for services to recover
Posted Mar 12, 2024 - 01:00 UTC
Update
API Requests is operating normally.
Posted Mar 12, 2024 - 00:56 UTC
Update
Git Operations is operating normally.
Posted Mar 12, 2024 - 00:55 UTC
Update
Webhooks is operating normally.
Posted Mar 12, 2024 - 00:54 UTC
Update
Copilot is operating normally.
Posted Mar 12, 2024 - 00:54 UTC
Update
We're continuing to investigate issues with our authentication service, impacting multiple services
Posted Mar 12, 2024 - 00:14 UTC
Update
Webhooks is experiencing degraded performance. We are continuing to investigate.
Posted Mar 11, 2024 - 23:55 UTC
Update
Webhooks is operating normally.
Posted Mar 11, 2024 - 23:31 UTC
Update
Copilot is experiencing degraded performance. We are continuing to investigate.
Posted Mar 11, 2024 - 23:21 UTC
Update
Git Operations is experiencing degraded performance. We are continuing to investigate.
Posted Mar 11, 2024 - 23:20 UTC
Update
Webhooks is experiencing degraded performance. We are continuing to investigate.
Posted Mar 11, 2024 - 23:09 UTC
Investigating
We are investigating reports of degraded availability for API Requests, Git Operations and Webhooks
Posted Mar 11, 2024 - 23:01 UTC
This incident affected: Git Operations, API Requests, Webhooks, and Copilot.