How a Two-Minute Blip Became a Seven-Hour Outage
Or: why four NICs across two switches saved nothing
How a Two-Minute Blip Became a Seven-Hour Outage
Or: why four NICs across two switches saved nothing.
Last week a routine firmware update on a single piece of network gear took our production Kubernetes cluster down for seven hours. The firmware reboot lasted about two minutes. The cluster outage lasted about three hundred and sixty minutes.
That ratio — two minutes of trigger, seven hours of damage — is the whole point of this article. The trigger was small and survivable. The blast radius came from layers of subtle design decisions, each individually defensible, that compounded into something none of us had modeled.
No data was lost, and every service came back. Storage was intact end-to-end, the state store's committed history was intact, every blocklisted client was successfully remounted by hand, and every workload returned to where it had been at 3 AM. The damage was time — not data.
The short version: redundancy is only as good as your health probe. The long version is what follows.
3 AM and everything was quiet
The on-call phone did not ring. There was no alert thunderclap. The cluster was simply gone, and the first thing anyone knew about it was the morning.
(The reason on-call didn't ring deserves its own article: the monitoring stack was hosted inside the same cluster that died. We'll get there.)
So we started where you start: who had access at 3 AM? Nobody. What did the change log say? Nothing. What was on the change calendar? Nothing.
We walked it backwards from what we could see.
The trigger: The gateway that updated itself
Our perimeter gateway — the single piece of equipment all site traffic routes through — auto-updated its firmware in the small hours and rebooted to apply it. About four hours later, in the middle of incident response, one of us rebooted it again — it looked frozen from the outside, and a reboot felt like the obvious thing to try. (It wasn't frozen. We'll come back to that.)
Each reboot dropped the site network for about two minutes.
By itself, that's a blip. Two minutes of no internet at 3 AM doesn't bring down a production cluster, and a deliberate two-minute reboot during incident response definitely shouldn't. We have four NICs per node bonded across two independent access switches for exactly this reason. The redundancy was real. It just didn't help.
This is the part that took us the longest to figure out.
The amplifier: Why four NICs saved nothing
Every node in the cluster has the same network setup. Two pairs of NICs, each pair bundled into a link aggregation group, each group attached to a different access switch. Sitting above the two aggregation groups is an active-backup bond. On paper, this survives any single switch loss, any single NIC loss, any single link loss. In practice, it survives none of those things — not in this incident, and probably not in many others.
The reason is brutally simple: the active-backup bond doesn't check link carrier. It pings the gateway. If two consecutive ARP pings to the gateway fail, the bond declares its slave dead and fails over. The standby slave's ARP target is the same gateway. So the moment the gateway stops answering ARP — two seconds in — all slaves on the host are simultaneously declared dead, and the bond drops to "no active interface".
Read that again. Every node, every NIC, every aggregation group, both switches — all of it converged on one liveness target, and that target was the very device whose reboot caused the incident.
The L1 and L2 redundancy was complete. The health probe was a single point of failure. The redundancy lost.
We know it wasn't physical because the switches' own logs confirm it: at the time the bonds collapsed, no server-facing switch port ever changed state. The cables, the switch ports, the link aggregations — all up, the entire time. The bonds gave up because they couldn't reach one IP address.
If you take one thing from this article, take this:
Redundant data paths behind a single-point-of-failure liveness probe are not redundancy.
The three traps that turned a blip into seven hours
So now both two-minute network blips have hit every node in the cluster. With well-behaved software, the nodes would reconnect, the cluster would reconverge, and we would never have heard about it.
We heard about it. Three traps in the software stack made sure we did.
Trap #1 — the leader-election crash-loop
Our Kubernetes distribution uses leader election in its control daemon. When the network died, the leader-holding node could no longer renew its lease, and after about 47 seconds it gave up and exited with a fatal log line.
That single exit killed the daemon's child container runtime. Which started the next trap (see below). And then, when the supervisor restarted the daemon, the daemon discovered it couldn't talk to its local state store — because the state store was frozen — and exited again. And again. And again. Hundreds of restarts before a human got there.
The pattern:
The daemon's startup path required something the daemon had just broken.
Self-healing only works when the heal path doesn't depend on what just failed.
Trap #2 — the state store that froze, not crashed
When the container runtime died with the control daemon, it took the state store's log-pipe consumer down with it. The state store kept running, and kept trying to write logs. The pipe filled. The write blocked. Because the state store writes its logs under a lock, every other goroutine that wanted to log queued behind the blocked write.
The whole process froze — alive, but paralyzed. Health checks returned nothing. The state store's log file stopped dead. And — crucially — the freeze persisted even after the network came back. The pipe was still full, its consumer still gone. Until we unblocked it by hand, the state store was permanently wedged.
The pattern:
A "transient" failure can become permanent through subtle inter-process coupling.
A producer–consumer relationship between two processes is fine until one of them is supposed to recover by itself.
Trap #3 — DNS lived inside its own failure domain
This is the trap that kept us down for hours after we had everything else fixed.
Our perimeter gateway forwards LAN DNS queries to an internal resolver. That internal resolver is a load-balanced service running inside the Kubernetes cluster, on a virtual IP advertised by — and this is where the trap snaps shut — a DaemonSet that has to pull its container image to start.
The cluster could not bootstrap itself out of this. Every minute it spent trying confirmed it never would.
The pattern:
Foundational infrastructure — DNS, image registry, time, identity — must not live inside the failure domain it serves.
If your DNS goes down whenever your cluster goes down, your cluster will never come back without a human.
What we actually did (and what we did wrong on the way there)
That second reboot — the manual one — was the one that finished us off. About four hours in, the gateway looked frozen from the outside: ARP wasn't answering, the management UI crawled, nothing seemed to be talking to it. The obvious move was to reboot it. The obvious move was wrong. The gateway wasn't stuck — the cluster was, and the gateway looked weird precisely because the cluster couldn't talk to it. The two-minute reboot meant to "wake up" the gateway is what finished off the two control-plane nodes that had been limping along on a degraded cluster up until that point. Lesson for later.
Once we understood what the actual freeze was, recovery was mechanical:
- Unfreeze the state store by hand on every control-plane node — drain the blocked log pipe by reading from the other end. (Yes, really. We unblocked a Linux kernel pipe by reading the file descriptor from the dead container runtime's child file table.)
- Snapshot the state store before doing anything further. We had a freeze, not a crash, but trust nothing.
- Kill every cluster process clean on every control-plane node, then restart the daemon. With the state store responding again, the daemon's startup path completed.
- Break the DNS deadlock by pointing the gateway at an external upstream DNS so that something, anything, could resolve image-registry hostnames. The DaemonSet pulled its image. The in-cluster DNS service came back. The gateway eventually got its in-cluster resolver back too — but only because we had given it an emergency external one first.
- Remediate every blocklisted storage client. Worker nodes that had been unresponsive too long during the partition were blocklisted by the storage cluster's metadata server when they tried to reconnect, and their mounts were permanently dead with no auto-recovery. We unmounted and remounted by hand.
The cluster was back about seven hours after the first reboot. Zero data loss. Every workload that had been running at 3 AM was running again, on the same state and the same data.
Lessons (or: the patterns worth internalizing)
Specific tools come and go. Patterns persist. These are the ones we walked away with.
The rest of the network has to survive when the gateway reboots — because the gateway will reboot. Auto-updates, scheduled maintenance, hardware faults, a power blip, or — as we found out the hard way — a responder hitting reboot during incident response because the device looks stuck. The reboot is going to happen, and probably more often than you think. The right question isn't how do we prevent the gateway from rebooting — it's what fails when it does, and how do we make sure nothing critical depends on it answering ARP at any particular moment.
During an incident, the obvious fix can be the wrong fix. A gateway that "looks frozen from the outside" is sometimes just a healthy gateway sitting next to a broken cluster. The reboot we did to recover the gateway extended the outage to the remaining nodes — because the gateway wasn't actually broken; the cluster was. Before pulling the power on something during an incident, ask: Could this be a symptom rather than the cause?
Redundancy is only as good as the health check sitting on top of it. Four NICs, two switches, two aggregation groups — all defeated by one ARP probe target. If your liveness check has a single point of failure, the data-plane redundancy underneath it is theater. Health probes should be multi-target, ideally toward devices in your own broadcast domain, and never exclusively toward the single device whose failure you are most worried about.
Self-healing requires that the heal path doesn't depend on what just broke. Control-plane daemons that restart but can't start without a healthy state store. State stores that can't make progress without a healthy logger. DNS services that need DNS to start. Audit every "auto-recover" claim and ask: what does this depend on, and what happens when that dependency is the thing that failed?
Foundational services must live outside the failure domain they serve. DNS, image registry, NTP, identity, the monitoring you use to find out things are broken — none of these can depend on the thing they exist to keep alive. We had been saying this for years. We were doing it for one of them. We had not finished doing it for the rest.
A short trigger does not imply a short outage. The blast radius of a two-minute event is determined entirely by the software stack underneath. Our trigger was small. Our blast radius was seven hours and entirely manual to recover. If those two numbers can have that ratio, your stack has compounding traps you haven't surfaced yet.
Final thought
None of the design decisions that compounded into this outage were obviously wrong when they were made. The bond ARP probe pointed at the gateway because the gateway is what you ping to verify the network actually works. The DNS service lived on a cluster VIP because cluster VIPs are how you get HA service IPs. The control daemon's leader election guarded against split-brain, which is the thing that destroys clusters. Each was sound in isolation.
What's worth taking away is not don't do these things. It is:
When these things stack, ask what happens if all of them have a bad day at once.
Because sooner or later one of them will, and the others will join in.
The cluster's back. The bond probes are getting a multi-target rewrite. The internal DNS is moving off-cluster onto a standalone resolver. We're explicitly not turning off the gateway's auto-update — the gateway is going to reboot regardless of what we do, and the fix that matters is making everything downstream survive it. None of the fixes are exciting. None of them needed to be.
That's usually how it goes.