On July 5, 2024, between 16:31 UTC and 18:08 UTC, the Webhooks service was degraded, with customer impact of delays to all webhook delivery. On average, delivery delays were 24 minutes, with a maximum of 71 minutes. This was caused by a configuration change to the Webhooks service, which led to unauthenticated requests sent to the background job cluster. The configuration error was repaired and re-deploying the service solved the issue. However, this created a thundering herd effect which overloaded the background job queue cluster which put its API layer at max capacity, resulting in timeouts for other job clients, which presented as increased latency for API calls.
Shortly after resolving the authentication misconfiguration, we had a separate issue in the background job processing service where health probes were failing, leading to reduced capacity in the background job API layer which magnified the effects of the thundering herd. From 18:21 UTC to 21:14 UTC, Actions runs on PRs experienced approximately 2 minutes delay and maximum of 12 minutes delay. A deployment of the background job processing service remediated the issue.
To reduce our time to detection, we have streamlined our dashboards and added alerting for this specific runtime behavior. Additionally, we are working to reduce the blast radius of background job incidents through better workload isolation.
Posted Jul 05, 2024 - 20:57 UTC
Update
We are seeing recovery in Actions start times and are observing for any further impact.
Posted Jul 05, 2024 - 20:44 UTC
Update
We are still seeing about 5% of Actions runs taking longer than 5 minutes to start. We are scaling and shifting resources to encourage recovery of the problem.
Posted Jul 05, 2024 - 20:32 UTC
Update
We are still seeing about 5% of Actions runs taking longer than 5 minutes to start. We are evaluating mitigations to increase capacity to decrease latency.
Posted Jul 05, 2024 - 19:58 UTC
Update
We are seeing about 5% of Actions runs not starting within 5 minutes. We are continuing investigation.
Posted Jul 05, 2024 - 19:19 UTC
Update
We have seen recovery of Actions run delays. Keeping the incident open to monitor for full recovery.
Posted Jul 05, 2024 - 18:40 UTC
Update
Webhooks is operating normally.
Posted Jul 05, 2024 - 18:10 UTC
Update
We are seeing delays in Actions runs due to the recovery with webhook deliveries. We expect this to resolve with the recovery of webhooks.
Posted Jul 05, 2024 - 18:09 UTC
Update
Actions is experiencing degraded performance. We are continuing to investigate.
Posted Jul 05, 2024 - 18:07 UTC
Update
We are seeing recovery as webhooks are being delivered again. We are burning down our queue of events. No events have been lost. New webhook deliveries will be delayed while this process recovers.
Posted Jul 05, 2024 - 17:57 UTC
Update
Webhooks is experiencing degraded performance. We are continuing to investigate.
Posted Jul 05, 2024 - 17:55 UTC
Update
We are reverting a configuration change that is suspected to contribute to the problem with webhook deliveries.
Posted Jul 05, 2024 - 17:42 UTC
Update
Our telemetry shows that most webhooks are failing to be delivered. We are queueing all undelivered webhooks and are working to remediate the problem.
Posted Jul 05, 2024 - 17:20 UTC
Update
Webhooks is experiencing degraded availability. We are continuing to investigate.
Posted Jul 05, 2024 - 17:17 UTC
Investigating
We are investigating reports of degraded performance for Webhooks