Incident Report: May 13, 2024 (lasting approximately 4 hours)
On May 13 at 10:40 AM UTC, GitHub Copilot Chat began returning error responses to 6% of users. The problem was identified, and a status update was provided shortly after. A mitigation strategy was implemented by 14:30 UTC, which mitigated the impact.
The root cause of the incident was a combination of issues in the request handling process. Specifically, some requests were malformed, which resulted in being incorrectly routed to the wrong deployment. That deployment wasn’t resilient to the malformed requests and resulted in errors.
To mitigate the immediate impact, requests were routed away from the failing deployment. This temporarily reduced the number of errors while the underlying issue was investigated and resolved.
To prevent similar incidents in the future, we have enhanced validation checks for incoming requests to ensure proper handling and routing. In addition, we upgraded backend systems to provide more robust error handling and observability.
Posted May 13, 2024 - 15:44 UTC
Update
We are applying configuration changes to mitigate impact to Copilot Chat users.
Posted May 13, 2024 - 14:56 UTC
Update
We continue to investigate the root cause of elevated errors in Copilot Chat.
Posted May 13, 2024 - 14:13 UTC
Update
Copilot is experiencing degraded performance. We are continuing to investigate.
Posted May 13, 2024 - 13:27 UTC
Update
We are investigating an increase in exceptions impacting Copilot Chat usage from IDEs.