On September 16, 2024, between 13:24 UTC and 14:28 UTC, the Git Operations service experienced a degradation, leading to intermittent SSH connection drops. The overall SSH error rate during this period was 0.0005%, with a peak error rate of 0.3%.
The root cause was traced to a regression in the service reload mechanism, which resulted in SSH hosts dropping connections on an hourly basis. As SSH hosts were rebooted for routine security updates, the issue progressively affected more hosts.
To mitigate the impact, we removed the affected hosts from production traffic. The SSH regression has since been identified and resolved, with all SSH hosts fully restored. Additionally, we have implemented new monitoring to alert us of any SSH connection refusals moving forward.
Posted Sep 16, 2024 - 14:28 UTC
Update
We are no longer seeing dropped Git SSH connections and believe we have mitigated the incident. We are continuing to monitor and investigate to prevent reoccurrence.
Posted Sep 16, 2024 - 14:27 UTC
Update
We have taken suspected hosts out of rotation and have not seen any impact in the last 20 minutes. We are continuing to monitor to ensure the problem is resolved and are investigating the cause.
Posted Sep 16, 2024 - 14:11 UTC
Update
We are seeing up to 2% of Git SSH connections failing.
We have taken suspected problematic hosts out of rotation and are monitoring for recovery and continuing to investigate.
Posted Sep 16, 2024 - 13:38 UTC
Update
We are investigating failed connections for Git SSH. Customers may be experiencing failed SSH connections both in CI and interactively. Retrying the connection may be successful. Git HTTP connections appear to be unaffected.