Tensor Parallelism Under Pressure: What Breaks When You Scale Width
Tensor parallelism is one of those techniques that looks elegant on a whiteboard and turns feral in production. Split the weight matrices across ranks, issue your all-reduce after each layer, collect the result. Simple. Until you're staring at a p99 latency spike that only appears when tensor parallel degree (TP) hits 8, and no single GPU is above 60% utilization.
Photo by Brett Sayles on Pexels.
This post is about what actually breaks when you widen your TP degree, and how to find it.
The All-Reduce is Your New Hot Path
At TP=2 or TP=4, the all-reduce operations at the end of each attention and MLP block are cheap enough to hide. Push to TP=8 across two nodes and the NVLink-to-PCIe boundary comes into view fast. At TP=8 on an A100 SXM cluster with NVLink within nodes, a single all-reduce on a 4096-token sequence can run 80-120 µs. Do the math across 32 transformer layers with two all-reduces each: you are now spending roughly 5-8 ms per forward pass purely on synchronization. That's before any compute.
Profile this with nsys profile --trace=cuda,nvtx and look for ncclAllReduce spans in the CUDA timeline. If they are back-to-back with no compute overlap, your all-reduces are not pipelined and you're paying full serialization cost. Megatron-LM and vLLM both have flags to enable async tensor parallel (--tp-comm-overlap in Megatron, experimental in vLLM as of v0.4). Enable them; the overlap is real but it demands your CUDA graphs are stable.
Numerical Drift at High TP Degree
Float16 reduction order is not associative. This is a known property that most people know about and few people actually track. At TP=2, the difference between the canonical single-GPU output and the tensor-parallel output is usually below 1e-3 in absolute terms. At TP=8, partial sums accumulate more rounding error per rank before the final all-reduce, and the gap widens.
For generation tasks this is usually tolerable. For scoring pipelines, embeddings, or any use case where you compare logits across requests, the drift starts to corrupt results in subtle ways. Validate by running the same prompt through TP=1 and TP=8 paths and comparing raw logit tensors with torch.allclose(atol=1e-2, rtol=1e-2). If you're failing that tolerance, consider running the final scoring head at TP=1 as a gather step, or switch the accumulation dtype to bfloat16, which handles large-magnitude sums better than float16 in practice.
Load Imbalance Across Ranks
Even with perfectly sharded weights, attention heads rarely divide evenly. A model with 40 attention heads at TP=8 gives you 5 heads per rank. A model with 36 heads gives you 4.5, which means some ranks carry 5 and some carry 4. The slower ranks gate every all-reduce. You can see this in dcgmi dmon as a persistent 10-15% utilization delta across ranks in the same TP group.
The fix is not always to change TP degree. Sometimes you pad the head count at model load (vLLM does this automatically for several architectures). Other times you accept the imbalance and compensate with a tighter timeout on the all-reduce collective to catch hangs early rather than stall the whole request batch.
graph TD
A[Request Arrives] --> B{TP Degree?}
B -->|TP=1| C[Single GPU Forward Pass]
B -->|TP>1| D[Shard QKV + MLP Weights Across Ranks]
D --> E[Per-Rank Partial Compute]
E --> F[All-Reduce Collective]
F --> G{Numerical Tolerance Check}
G -->|Pass| H[Output Logits]
G -->|Fail| I[Fallback to TP=1 Scoring]
When to Stop Scaling TP Width
The honest answer: TP degree should stop at the NVLink domain boundary unless you have a very good reason to cross it. A single DGX A100 node gives you 8 GPUs with full NVLink; TP=8 within that node is efficient. TP=16 across two nodes sends your all-reduces over InfiniBand, and even HDR (200 Gb/s) InfiniBand cannot match NVLink bandwidth for small collectives. At that point, pipeline parallelism or a second replica behind a load balancer usually gives better throughput per dollar.
Choose TP degree by profiling, not by instinct. Measure ncclAllReduce wall time at each candidate degree. Measure GPU SM utilization per rank. Check your head-count divisibility before committing to a serving config. The model doesn't care how elegant your parallelism strategy looks on paper.
Scale with data first. Scale with replicas second. Reach for tensor parallelism when model weight memory genuinely demands it, and then watch the collectives carefully.
Get Omnissiah Systems in your inbox
New posts delivered directly. No spam.
No spam. Unsubscribe anytime.