GitHub header
Incident with Pull Requests, Git Operations, Actions, API Requests, Issues and Webhooks
Incident Report for GitHub
Resolved
On April 24, 2024, between 10:26 and 16:30 UTC, the Pull Request page took longer than usual to enable a Pull Request to be merged. During the incident, the 75th percentile time to merge a pull request went from around 3 seconds to just over 6 minutes. We also saw slightly elevated error rates across the service, with an average error rate of 0.1%, peaking to 0.31% of requests.

The underlying cause of the incident was a repeated problematic query that resulted in MySQL replicas crashing, which caused a back-up in the jobs that compute whether PRs can be merged.

At 16:30 the incident self-mitigated when the volume of the problematic query decreased. Concurrent to this, we deployed a mitigation to remove the mergability polling code from the PR experience to alleviate pressure on MySQL servers. This mitigation caused the merge button to not automatically enable, but require a page refresh instead. At 17:40 we statused PRs again for this issue. We rolled back the mitigation to resolve the second incident at 18:20.

Going forward, to improve mitigation we are investing in faster recovery from cascading read-replica crashes, improving query retry logic to more aggressively back off when read replicas are unhealthy, improving our ability to diagnose server crashes and trace them back to user activity, and making it easier to block problematic queries.

To prevent recurrence, we are eliminating the query that caused the server to crash, improving our detection and mitigation of problematic queries, and investigating and remediating the underlying issues that caused MySQL to crash. We are also improving the merge button polling mechanism and mergeability background job to be resilient to this class of incident so customers can still merge PRs.
Posted Apr 24, 2024 - 16:16 UTC
Update
Issues is operating normally.
Posted Apr 24, 2024 - 16:13 UTC
Update
Actions is operating normally.
Posted Apr 24, 2024 - 16:13 UTC
Update
Pull Requests is operating normally.
Posted Apr 24, 2024 - 16:13 UTC
Update
Webhooks is operating normally.
Posted Apr 24, 2024 - 16:13 UTC
Update
Git Operations is operating normally.
Posted Apr 24, 2024 - 16:13 UTC
Update
API Requests is operating normally.
Posted Apr 24, 2024 - 16:12 UTC
Update
We are seeing site-wide recovery but continue to closely monitor our systems and putting additional mitigations in place to ensure we are back to full health.
Posted Apr 24, 2024 - 15:50 UTC
Update
We are continuing to see consistent impact, and we’re continuing to work on multiple mitigations to reduce load on our systems.
Posted Apr 24, 2024 - 14:08 UTC
Update
We have found an issue that may be contributing additional load to the web site and are working on mitigations. We don't see any additional impact at this time and will provide another update within an hour if we see improvements or fully mitigate the issue based on this investigation.
Posted Apr 24, 2024 - 12:47 UTC
Update
We have taken some mitigations and see less than 0.3 percent of requests failing site wide but we still see elevated 500 errors and will continue to stay statused and investigate until we are confident we have restored our error rate to base line.
Posted Apr 24, 2024 - 12:00 UTC
Update
We are seeing increased 500 errors for various GraphQL and REST APIs related to database issues. Some users may see periodic 500 errors. The team is looking into the problematic queries and mitigations now.
Posted Apr 24, 2024 - 11:13 UTC
Update
Actions is experiencing degraded performance. We are continuing to investigate.
Posted Apr 24, 2024 - 11:09 UTC
Update
Git Operations is experiencing degraded performance. We are continuing to investigate.
Posted Apr 24, 2024 - 11:06 UTC
Update
Pull Requests is experiencing degraded performance. We are continuing to investigate.
Posted Apr 24, 2024 - 10:55 UTC
Update
Webhooks is experiencing degraded performance. We are continuing to investigate.
Posted Apr 24, 2024 - 10:52 UTC
Update
Issues is experiencing degraded performance. We are continuing to investigate.
Posted Apr 24, 2024 - 10:51 UTC
Investigating
We are investigating reports of degraded performance for API Requests
Posted Apr 24, 2024 - 10:45 UTC
This incident affected: Git Operations, Webhooks, API Requests, Issues, Pull Requests, and Actions.