On Monday February 12th, 2024, 03:00 UTC we deployed a code change to a component of Copilot. At 06:00 UTC we observed an increase in timeouts for code completions impacting 55% of Copilot users at peak across Asia and Europe.
At 12:00 UTC we restarted the nodes, and response durations returned to normal operation until 13:00 UTC when response durations degraded again. At 16:15 UTC we made a configuration change to send traffic to regions that were not exhibiting the errors, which resulted in code completions working fully although completing at a higher latency than normal for some users. At 18:00 UTC we reverted the deploy and response durations returned to normal.
We have added better monitoring to components that failed to decrease resolution times to incidents like this in the future.