On September 16, 2024, between 21:11 UTC and 22:20 UTC, Actions and Pages services were degraded. Customers who deploy Pages from a source branch experienced delayed runs. Approximately 1,100 runs were delayed long enough to get marked as abandoned. The runs that weren't abandoned completed successfully after we recovered from the incident. Actions jobs experienced average delays of 23 minutes, with some jobs experiencing delays as high as 45 minutes. During the course of the incident, 17% of runs were delayed by more than 5 minutes. At peak, as many as 80% of runs experienced delays exceeding 5 minutes. The root cause was a misconfiguration in the service that manages runner connections, which caused CPU throttling and led to a performance degradation in that service.
We mitigated the incident by diverting runner connections away from the misconfigured nodes. We are working to improve our internal monitoring and alerting to reduce our time to detection and mitigation of issues like this one in the future.
Posted Sep 16, 2024 - 22:08 UTC
Update
Actions is experiencing degraded performance. We are continuing to investigate.
Posted Sep 16, 2024 - 21:55 UTC
Update
The team is investigating issues with some Actions jobs being queued for a long time and a percentage of jobs failing. A mitigation has been applied and jobs are starting to recover.
Posted Sep 16, 2024 - 21:53 UTC
Update
Pages is operating normally.
Posted Sep 16, 2024 - 21:52 UTC
Update
Actions is experiencing degraded availability. We are continuing to investigate.
Posted Sep 16, 2024 - 21:37 UTC
Investigating
We are investigating reports of degraded performance for Actions and Pages