On September 27, 2023 at 00:12 UTC, our alerting systems detected an increase in the time it took GitHub Actions workflow runs to start. During the incident, some customers experienced delays in starting Github Actions workflow runs and receiving status updates for in-progress runs. The root cause was identified to be a change that was deployed to an internal distributed event streaming platform which resulted in several worker nodes to go over a misconfigured memory limit. This caused these nodes to restart leading to a reduced job processing throughput. Github Actions relies on events delivered through this event streaming platform to start workflow runs and update their status. Delays in receiving these events led to run delays for about 40% of the Actions workflows.
We mitigated this through a rollback of the offending change at 00:18 UTC. This allowed our event streaming platform to catch up with the backlog of workflow runs that were queued during the incident. The backlog was processed by 00:44 UTC. We have additional repair items in place to prevent a recurrence in the future.
Posted Sep 27, 2023 - 00:58 UTC
We are investigating reports of degraded performance for Actions