Batch Size Drift: Why Your Inference Throughput Degrades Overnight
You deploy a model, profile it, tune the maximum batch size to saturate your A100s without blowing the KV cache budget, and ship it. Throughput looks great. Two weeks later, p50 latency has crept up 40% and GPU utilization is reading somewhere between "confused" and "insulted." Nothing changed in the model. Nothing changed in the serving config. Or so you think.
What actually happened is batch size drift: a quiet, compound failure where the effective batch size your inference server assembles at runtime diverges from the size you tuned for, and the divergence compounds with every traffic pattern shift that nobody bothered to alert on.
Where the Drift Comes From
Most production inference stacks use some form of continuous batching (vLLM, TGI, TensorRT-LLM's inflight batching). These systems fill a batch by pulling waiting requests up to a configured token budget or request count, whichever binds first. The tuning you did at launch assumed a specific mix of prompt lengths and generation lengths. That mix drifts.
Consider what happens when a new client integration starts sending longer system prompts. The token budget fills faster, so the scheduler assembles smaller request batches. Fewer sequences per forward pass means the matrix multiplications that dominate transformer compute run at lower arithmetic intensity. You paid for 312 TFLOPS; you're getting the utilization profile of a machine doing polite work.
Generation length shifts cause the mirror image problem. If clients start asking for longer outputs, sequences linger in the batch longer, blocking new arrivals and inflating queue depth. The batch assembler responds by capping concurrency, again shrinking effective batch size.
What the Metrics Say (and What They Hide)
The insidious part: your dashboard probably shows GPU utilization holding steady. SM utilization is a coarse signal. A GPU at 85% utilization could be doing efficient large-batch matmuls or it could be spinning on a handful of tiny ones with terrible memory bandwidth efficiency. dcgm-exporter gives you DCGM_FI_DEV_GPU_UTIL, which counts clock cycles the SMs are active. It does not tell you whether those cycles were productive.
What you actually want to watch:
- Requests per forward pass (your serving framework should expose this; in vLLM it surfaces under
vllm:num_requests_running) - Mean prompt token length and mean generation token length, tracked as distributions, not just averages
- Tokens generated per second per GPU, not total throughput, which inflates when request rate increases
- Batch assembly wait time: how long a request sits in queue before it joins a batch
When effective batch size drifts down, tokens-per-second-per-GPU drops, queue wait time rises, and request-count-per-pass shrinks. All four moving together is the signature.
Diagnosing the Specific Cause
graph TD
A[Throughput drops] --> B{Batch size per pass down?}
B -->|Yes| C{Prompt tokens up?}
B -->|No| D[Look elsewhere: memory pressure, preemption]
C -->|Yes| E[Token budget binding: audit system prompt length]
C -->|No| F{Generation length up?}
F -->|Yes| G[Sequence eviction or concurrency cap binding]
F -->|No| H[Request rate drop or scheduler regression]
Start with the batch size signal. If it's down, bifurcate on whether the token budget is binding (prompt side) or the concurrency cap is binding (generation side). Both are fixable, but the fixes are different.
Fixing It
For prompt-length drift, the lever is the token budget per batch. If your max_num_batched_tokens in vLLM was set assuming 512-token prompts and clients are now sending 1200-token prompts, recalculate. Profile the new prompt distribution, estimate the batch size you can fit in VRAM with the updated lengths, and raise or lower the limit accordingly. Consider prefix caching (vLLM's enable_prefix_caching) if long system prompts are reused across requests. Caching the KV state for the shared prefix lets you reclaim the token budget it would otherwise consume.
For generation-length drift, the knob is max_num_seqs, the maximum concurrent sequences. If sequences are running longer, you can either accept lower concurrency (and lower throughput) or increase it and accept higher VRAM pressure. There is no free answer here. The honest move is to profile the actual generation length distribution monthly, compare it against your provisioning assumptions, and adjust max_num_seqs accordingly.
Set alerts. A 15% drop in tokens-generated-per-second-per-GPU sustained over a 10-minute window should page someone. Batch size drift is slow enough to miss in daily spot checks but fast enough to cause real capacity problems before your next quarterly review.
The model did not change. The hardware did not change. The traffic did, and nobody was watching the right number.
Get Omnissiah Systems in your inbox
New posts delivered directly. No spam.
No spam. Unsubscribe anytime.