The Liturgy of the Restart: What a GPU Node Reboot Actually Costs
A reboot looks instant on a dashboard. The node goes red, then green, and the on-call engineer marks the incident resolved. The machine knows better. Between those two colors sits a sequence of slow, expensive rituals, and if you do not account for them you will keep promising recovery times you cannot meet.
Photo by panumas nikhomkhai on Pexels.
Consider what actually has to happen before a single token comes back out of a freshly restarted node serving a 70B model.
The weights have to land
Model weights do not live in GPU memory by default. They live on disk, or worse, on a remote object store, and they have to be read, deserialized, and copied across PCIe into HBM. For a 70B model in bf16, that is roughly 140 GB to move. Even at a healthy 4 GB/s from local NVMe, you are looking at 35 seconds of pure transfer before anything else can begin. Pull those weights from S3 over a saturated network and the same step can stretch past five minutes.
The first lesson of the restart is that storage topology is a latency decision, not a cost decision. Teams that colocate weights on local NVMe and teams that stream them from a bucket experience completely different reboots, and they discover the difference at the worst possible moment.
The collectives have to reconverge
If the model is sharded across eight GPUs with tensor parallelism, the process group has to re-form. NCCL has to rediscover its peers, negotiate transports, and rebuild its communication rings. On a healthy node with NVLink this is quick. On a node where one GPU is flapping or a NIC is renegotiating its link speed, the initialization can hang for minutes before it either succeeds or throws an error that points nowhere useful.
Watch NCCL_DEBUG=INFO during a cold start sometime. The log reads like an invocation. Every rank announces itself, the ring assembles, and only when the last participant reports in does the collective come alive. Until then, your inference server is a process holding 140 GB of weights it cannot yet use.
The cache starts empty
Here is the cost almost everyone forgets. A warm inference node carries a populated KV cache and, often, a prefix cache full of system prompts and common prefixes. A rebooted node carries neither. Every request in the first minutes after restart pays full prefill cost with zero cache reuse.
If your traffic leans on a long shared system prompt, this is brutal. A node that served requests at 40 ms of prefill while warm can spend 600 ms per request while cold, simply because it is recomputing attention over tokens it used to have cached. Latency-based autoscaling reads this as overload and spins up more nodes, which themselves start cold, which deepens the pattern. The machine, briefly, eats its own tail.
Counting the true number
Add the steps honestly. Weight transfer, framework and CUDA context initialization, collective formation, the JIT compilation of fused kernels on first call, and the cache warmup period where latency runs high. For a large sharded model pulling weights from local disk, a realistic floor is 90 to 180 seconds before the node serves at its steady-state latency. Pull from remote storage or hit a kernel cache miss and you can double that.
This number matters because it sets the real ceiling on your recovery objective. A rolling deploy across 40 nodes at 150 seconds each, with a concurrency of four, takes a hair over 25 minutes during which a portion of your fleet is degraded. If your SLO assumed instant cutover, you wrote a promise the machine never agreed to.
Designing around the ritual
You cannot abolish the restart, so you plan its liturgy. A few practices pay for themselves quickly.
Keep weights local and pre-staged, so transfer is bounded and predictable. Pin a kernel cache to disk so the first request does not pay for JIT compilation. Warm the prefix cache before a node rejoins the load balancer, replaying a handful of representative prompts so the node enters rotation already primed. Drain gracefully and add back gradually, so cold nodes take a trickle of traffic rather than a flood while their caches fill.
Most of all, instrument the gap. Emit a metric for time-from-process-start to first-token-at-steady-latency, not just to process-healthy. The distance between those two events is the part of the restart that lies to you, and the only way to stop being surprised by it is to measure it on every reboot until the number stops moving.
The machine recovers on its own schedule. Our job is to know that schedule well enough to stop pretending it is zero.
Get Omnissiah Systems in your inbox
New posts delivered directly. No spam.
No spam. Unsubscribe anytime.