On November 21, 2023, at 09:50 UTC GitHub Actions jobs encountered delays due to an incident in our background job service caused by excessive rebalancing in a Kafka consumer group. After a quick mitigation, we began to see recovery on the job queues by 10:02 UTC. During this time window 100% of Actions jobs were delayed in starting for up to 11 minutes.
Unfortunately, the rapid queue recovery sent a thundering herd of jobs to Actions hosted runner pools, causing a database deadlock that resulted in some hosted runner pools having increased latency when accepting new jobs. This affected only a small percentage of overall jobs, around 2%. Configuration changes led to a resolution and the system was fully recovered by 11:27 UTC and all in progress jobs were processed.
The incident is now resolved.
Posted Nov 21, 2023 - 11:27 UTC
We've applied a mitigation to fix the issues with queuing and running Actions jobs. We are seeing improvements in telemetry and are monitoring for full recovery.
Posted Nov 21, 2023 - 11:12 UTC
We have recovery for the underlying issue but are waiting for Actions queues to catch up. We expect this to be completed in less than 1 hour(s).
Posted Nov 21, 2023 - 10:24 UTC
We are investigating reports of degraded performance for Actions