KV Cache Eviction: What Gets Dropped and Why It Costs You
Every token a transformer generates leans on every token that came before it. That backward dependency is why the key-value cache exists: rather than recompute attention over the full context on every decoding step, the runtime stores those intermediate tensors and reads from them. The cache is the reason autoregressive inference is tractable at scale.
Photo by panumas nikhomkhai on Pexels.
When it fills up, something has to go. That decision, made in microseconds by whatever eviction policy your serving runtime implements, shapes latency in ways that don't always show up in the metrics you're watching.
What the Cache Is Actually Holding
For each transformer layer, the KV cache stores the projected key and value tensors for every token in the active context. A single cache slot for one token in one layer costs 2 × num_heads × head_dim × bytes_per_element. Multiply across layers and you understand quickly why a 70B model in fp16 can burn through GPU HBM just holding contexts.
vLLM's PagedAttention (introduced in v0.1, now central to most serious serving stacks) treats the cache as paged memory, allocating fixed-size blocks rather than contiguous buffers. This solves fragmentation but doesn't eliminate the underlying pressure: when all pages are occupied and a new request arrives, the scheduler must either reject the request, preempt an active sequence, or evict pages belonging to a sequence that hasn't been touched recently.
The Three Eviction Decisions
Most production systems implement one of three policies, or a blend.
LRU (Least Recently Used). Evict the sequence whose pages were last accessed longest ago. Cheap to implement, reasonable for uniform workloads. Breaks badly when you have a mix of long-running sessions and short completions: a 32K-token conversation that's waiting on a user gets evicted to make room for a flood of 512-token batch requests.
Beam search priority. Some runtimes track which sequences are part of active beam searches and protect them. Pages belonging to pruned beams are candidates; surviving beams are not. This keeps active decode trees coherent but adds bookkeeping overhead.
Prefix sharing with eviction guards. When multiple requests share a common system prompt, caching that prefix once and pinning it saves significant compute. vLLM's prefix caching marks shared blocks as "reference-counted" rather than owned by a single sequence. Eviction only fires when the ref count drops to zero. The failure mode here is a shared prefix that never fully expires, quietly consuming pages that decode throughput needs.
graph TD
A[New request arrives] --> B{Free pages available?}
B -- Yes --> C[Allocate pages, begin decode]
B -- No --> D{Preempt or evict?}
D --> E[Score active sequences by policy]
E --> F{Sequence selected}
F --> G[Swap to CPU or discard]
G --> H[Recompute on reschedule]
G --> C
Where the Latency Hit Actually Lives
Eviction itself is fast. The cost arrives later, when the evicted sequence gets rescheduled and the runtime has to recompute its KV state from scratch. That recomputation is a full prefill pass over the evicted context: O(n²) attention over however many tokens were in that sequence when it got dropped.
If you're tracing latency with something like Prometheus + Grafana and watching p50 TTFT (time to first token), eviction-driven recomputation shows up as periodic spikes that correlate with high concurrency. They're easy to misread as network jitter or scheduler overhead. The tell is that spike duration scales with context length: a 4K-token sequence takes proportionally longer to recompute than a 512-token one.
Profile this with vLLM's built-in metrics: vllm:num_preemptions_total and vllm:gpu_cache_usage_perc together tell you whether you're regularly hitting the ceiling and paying the recompute tax.
Practical Tuning Levers
Raise gpu_memory_utilization carefully. vLLM defaults to 0.90; pushing to 0.95 buys more cache pages but narrows the safety margin before OOM. Monitor nvidia-smi's memory reserved versus used, not just the utilization percentage.
Limit max concurrency explicitly. If your p99 latency target requires that no request waits more than 500ms for recompute, calculate the maximum context length your throughput budget can afford to recompute, then cap max_num_seqs accordingly. Shedding requests at the load balancer is cheaper than silent degradation from constant eviction.
For workloads with predictable long-lived sessions (agents, multi-turn assistants), consider CPU offload rather than discard. vLLM supports swapping evicted blocks to CPU memory; retrieval latency is measurable (typically 5-20ms depending on PCIe bandwidth) but far cheaper than a full prefill. The tradeoff only pays out if the evicted sequence actually gets rescheduled.
Prefix caching is worth enabling for any deployment with a fixed system prompt. Measure the prefix hit rate via vllm:gpu_prefix_cache_hit_rate. If it's below 60% on a workload you expected to benefit, audit whether prompt variability is defeating the cache key hash.
The cache fills. Something gets evicted. The question worth asking before that happens is whether your serving config was designed to answer it.
Get Omnissiah Systems in your inbox
New posts delivered directly. No spam.
No spam. Unsubscribe anytime.