On November 11, 2023, at 1:00 UTC, GitHub background jobs encountered delays lasting up to 50 minutes. This delay affected various services utilizing background jobs, including Actions, Webhooks, Pull Requests, and Pages. The impact persisted for approximately one hour until 2:10 UTC.
During the incident, some customers experienced delays in starting Github Actions workflow runs and Pages builds. We estimate that about 10% of Actions workflow runs were delayed during the impact window and 99% of Pages builds failed from 1:00 UTC to 1:20 UTC. Users may have experienced a delay in seeing recent pushes reflected in pull request views. This delay averaged between 5 and 10 minutes and affected up to 30% of pull request page views during the incident. 1% of pull request page views experienced delays of up to 60 minutes. Finally, 30% of webhook deliveries in this window missed our target of being delivered within 1 minute of the triggering event.
This incident was caused by excessive rebalancing in our Kafka consumer group that feeds our background job system. We have altered our Kafka configuration to reduce the likelihood of this issue, created diagnostic tools to identify future causes, and will be breaking up this relay into multiple groups to limit the blast radius if the problem does reoccur.
Posted Nov 11, 2023 - 02:14 UTC
Update
Pages is operating normally.
Posted Nov 11, 2023 - 02:14 UTC
Update
Actions is operating normally.
Posted Nov 11, 2023 - 02:13 UTC
Update
Rebalancing completed and job queues are improving. We continue to monitor for full recovery of Webhooks, Actions, and Pages workflows.
Posted Nov 11, 2023 - 01:53 UTC
Update
Actions is experiencing degraded performance. We are continuing to investigate.
Posted Nov 11, 2023 - 01:42 UTC
Update
Webhooks is experiencing degraded performance. We are continuing to investigate.
Posted Nov 11, 2023 - 01:41 UTC
Update
Pages builds, webhooks, and other workflows were delayed starting at 1:00 UTC. We have failed over the service that was contributing to the delays and see successful processing. We are continuing to monitor for full recovery
Posted Nov 11, 2023 - 01:40 UTC
Investigating
We are investigating reports of degraded performance for Pages
Posted Nov 11, 2023 - 01:26 UTC
This incident affected: Webhooks, Actions, and Pages.