Prefill Latency and the Cost of Long Prompts: Where Your TTFT Goes
Time-to-first-token is the number your users feel before they can articulate why they're frustrated. A model that generates at 80 tokens per second feels broken if it spends four seconds staring at the prompt before producing anything. That four seconds is prefill, and it scales quadratically with sequence length in standard attention.
Photo by panumas nikhomkhai on Pexels.
Most teams instrument decode latency carefully and ignore prefill entirely. That's the wrong allocation of attention.
What Prefill Actually Costs
During prefill, the model processes every token in the input prompt in a single forward pass, building the KV cache that decode will consume. Unlike decode (which is memory-bandwidth-bound, one token at a time), prefill is compute-bound. You're doing full matrix multiplications across the entire prompt length. On an A100 80GB with a 7B-parameter model, a 512-token prompt takes roughly 80-120ms to prefill. Push that to 4,096 tokens and you're looking at 600ms to over a second, depending on batch composition.
The quadratic relationship comes from attention: each token attends to every preceding token, so the attention computation scales as O(n²) in sequence length. Even with FlashAttention-2 (which reduces memory overhead dramatically through tiling), the FLOPs are the same. You're just doing them more efficiently, not fewer of them.
This matters operationally because many RAG pipelines, tool-use agents, and chat applications with long history windows are quietly stuffing 8k to 32k tokens into every request.
How Prefill Hides in Your Metrics
If you're logging TTFT as a single number, you've already lost the signal. Separate your traces:
- Queue wait time: how long the request sat before the engine touched it
- Prefill duration: time from first forward pass to KV cache complete
- First decode step: the additional step before the first output token emits
vLLM exposes this breakdown in its request-level output if you enable --enable-prefix-caching and trace at the AsyncLLMEngine level. TGI (text-generation-inference) logs prefill and decode tokens per step in its metrics endpoint. Pull those into your Prometheus scrape and graph them by prompt length bucket.
What you'll find, almost always: the p99 TTFT is dominated by a small fraction of requests with very long prompts. Those requests also block GPU compute that shorter requests could have used. Prefill is not just slow for the long-prompt user; it delays everyone queued behind it.
graph TD
A[Request Arrives] --> B{Prompt Length?}
B --> C[Short: under 512 tokens]
B --> D[Long: 2k+ tokens]
C --> E(Fast Prefill: ~80ms)
D --> F(Slow Prefill: 600ms+)
E --> G[Decode Begins]
F --> G
F --> H[Queued Requests Stall]
Practical Mitigations
Prompt length bucketing and priority routing. Route requests above a token threshold to a separate pool or lower-priority queue. This is a scheduling decision, not a model change. You keep short-prompt latency stable while long-prompt requests drain in the background. The tradeoff is throughput isolation, not latency improvement for the long requests.
Prefix caching. If your system prompt or RAG preamble is shared across requests (common in production), prefix caching lets the engine reuse KV cache entries for the shared prefix. vLLM's prefix cache (stable since v0.3) hashes prompt prefixes and reuses computed KV blocks. With a 2k-token shared system prompt, this can eliminate the bulk of prefill cost entirely for repeat prefixes. Measure your cache hit rate; if it's below 60%, your prompts aren't as consistent as you think.
Chunked prefill. Rather than processing a long prompt in a single monolithic pass, chunked prefill splits it into fixed-size chunks and interleaves them with decode steps from other requests. vLLM added this in v0.4 under --enable-chunked-prefill. The tradeoff: individual long-prompt TTFT gets slightly worse, but p50 TTFT for concurrent short requests improves significantly. Profile your actual request mix before enabling it blindly.
Prompt compression and truncation. Speculation aside, this is often the highest-leverage intervention: audit what's actually in your prompts. Retrieval pipelines frequently concatenate more context than the model reads. LLMLingua and similar reranking-based compression tools can reduce context by 3-5x with minimal quality loss on many tasks. Smaller input means faster prefill, smaller KV cache, and more room for concurrency.
The Discipline Here
Treat prompt length as an infrastructure variable, not just a product decision. Every token you add to the context window has a compute price. Profile it, trace the breakdown, and route around the cost where you can. The machine does not forgive inattention to its input.
Get Omnissiah Systems in your inbox
New posts delivered directly. No spam.
No spam. Unsubscribe anytime.