Reading the Pulse: Where Tail Latency Hides in Distributed Inference

The median lies. A serving stack can post a beautiful p50 while a tenth of its requests crawl, and the users who feel that slowness are often the ones you most want to keep. Tail latency is where the machine reveals its real character, so that is where you learn to listen.

Close-up of server racks in a data center highlighting modern technology infrastructure. Photo by panumas nikhomkhai on Pexels.

Here is where the long tail tends to live.

Inside the batch

Continuous batching is the engine of modern inference throughput, and it is also the first suspect when the tail goes bad. Requests of wildly different lengths share a batch. A short request that would finish in 200 ms gets scheduled alongside a 4,000-token generation, and depending on your scheduler it may wait, or it may decode in lockstep at the batch's pace.

The symptom is a tail that scales with batch diversity rather than load. Two requests per second of mixed lengths can produce a worse p99 than ten requests per second of uniform short prompts. If your tail worsens when traffic gets more varied rather than heavier, look at the batch composition before you look at anything else.

In the prefill

Prefill is compute that scales with prompt length, and it competes with decode for the same GPU. A single 32,000-token prompt arriving mid-stream can stall every active decode while the machine grinds through attention over that long context. Every other user in the batch feels the pause, even the ones who sent ten tokens.

Chunked prefill exists precisely to tame this. By splitting a long prompt into smaller pieces interleaved with decode steps, you trade a little prefill throughput for a much smoother tail. Whether you have it on, and how you have sized the chunks, is often the single biggest lever on p99 in a stack that serves variable prompt lengths.

In the queue

Before a request reaches a GPU it waits in line, and that line is rarely as short as your dashboards suggest. Admission queueing, scheduler quantum, and the gap between when a request arrives and when the next batch forms all add latency that no GPU profiler will show you.

Measure queue wait separately from execution time. A request that spent 180 ms waiting and 90 ms generating is a queue problem wearing a generation costume, and tuning your kernels will do nothing for it. The fix lives in admission control and batch cadence, not in the model.

In the network

Sharded models talk constantly. Every token in a tensor-parallel deployment triggers all-reduce operations across GPUs, and those collectives run at the speed of the slowest participant. One GPU running slightly hot and throttling its clocks drags the whole shard group down with it, silently, because the model still produces correct output. It just produces it late.

This is the cruelest tail because nothing is broken. No error fires. The collective simply waits on its slowest member, every single token, and your p99 climbs while every health check stays green. The only way to catch it is per-device telemetry: clock speeds, temperatures, and per-rank step times that let you spot the laggard.

Listening properly

Three habits separate teams that understand their tail from teams that fight it blindly.

Decompose the latency. Queue wait, prefill, decode, and network should each be their own measurement, because each has a different cause and a different fix. A single end-to-end number tells you something hurts and nothing about where.

Watch percentiles by request shape. Bucket your latency by prompt length and generation length before you read it. A p99 averaged across all shapes blends two unrelated populations and obscures both.

Trace the slow ones. Sample full traces of requests above your p99 threshold and read them, actually read them, until you can narrate where the time went. The machine will tell you what it is doing. You only have to keep records faithful enough to hear it.

The median is the story the machine tells about its good days. The tail is the story it tells about the rest, and the rest is where your users decide whether to trust it.

Reading the Pulse: Where Tail Latency Hides in Distributed Inference

Inside the batch

In the prefill

In the queue

In the network

Listening properly

Related Reading

Weight Loading Latency: Where Your Model Startup Time Actually Goes

Routing Decay: How Request Dispatch Goes Wrong in Multi-Instance Inference

Speculative Decoding in Production: What the Benchmarks Don't Tell You