On March 11, 2024 between at 18:44 UTC and 19:10 UTC, GitHub Actions performance was degraded and some users experienced errors when trying to queue workflows. Approximately 3.7% of runs queued during this time were unable to start.
The issue was partially caused by a deployment of an internal system Actions relies on to process workflow run events. The pausing of the queue processing during this deployment for about 3 minutes caused a spike in queued workflow runs. When this queue began to be processed, the high number of queued workflows overwhelmed a secret-initialization component of the workflow invocation system. The errors generated by this overwhelmed system ultimately delayed workflow invocation. Through our alerting system, we received initial indications of an issue at approximately 18:44 UTC. However, we did not initially see impact on our run start delays and run queuing availability metrics until approximately 18:52 UTC. As the large queue of workflow run events burned down, we saw recovery in our key customer impact measures by 19:11 UTC, but waited to declare the incident resolved at 19:22 UTC while verifying there was no further customer impact.
We are working on various measures to reduce spikes in queue build up during deployments of our queueing system, and have scaled up the workers which handle secret generation and storage during the workflow invocation process.
Posted Mar 11, 2024 - 19:22 UTC
Update
Actions experienced a period of decreased workflow run throughput, and we are seeing recovery now. We are in the process of investigating the cause.
Posted Mar 11, 2024 - 19:21 UTC
Investigating
We are investigating reports of degraded performance for Actions