Pipeline Parallelism Stalls: Finding the Bubble and Paying It Down
Pipeline parallelism splits a model's layers across multiple devices and passes activations forward stage by stage. On paper, once the pipeline fills, every device stays busy. In practice, the bubble eats your throughput and the telemetry rarely tells you where.
Photo by Brett Sayles on Pexels.
The bubble is the idle time a device spends waiting at the start and end of each microbatch. For a pipeline with p stages and m microbatches per batch, the theoretical bubble fraction is (p-1)/(m+p-1). Run four stages with four microbatches and roughly 43% of your forward-pass time is wasted at the edges. Double the microbatch count to eight and you halve the bubble to about 27%. The math is not subtle. What is subtle is noticing this in a real system, because your GPU utilization dashboards will still read 60-80% and nothing will look obviously broken.
Where the Signal Hides
GPU utilization is a blunt instrument here. A device busy-waiting on a NCCL receive still shows compute active in nvidia-smi. Profile at the right layer. Use torch.profiler with record_shapes=True and export a Chrome trace; look for long gaps in the CUDA kernel timeline that correlate with cross-device transfers. In a healthy pipeline, those gaps are short and uniform. In a bubbled one, you will see jagged stalls at the first and last stages that do not appear in the middle stages.
If you are on a managed cluster, check your inter-stage P2P bandwidth with nccl-tests point-to-point benchmarks before you assume the bubble is your fault. A degraded NVLink lane or a congested InfiniBand port will artificially inflate what looks like bubble time because the activation transfer itself is slower, pushing the next stage's start further out.
graph TD
A[/Microbatch enters Stage 1/] --> B[Stage 1 forward]
B --> C{Transfer activation}
C --> D[Stage 2 forward]
D --> E{Transfer activation}
E --> F[Stage 3 forward]
F --> G((Output))
C --> H[Stage 1 idle: bubble]
Three Levers Worth Pulling
Increase microbatch count. This is the first adjustment to try and the cheapest to test. The bubble fraction shrinks as m grows, but memory pressure on each stage grows with it because you are holding more intermediate activations in flight. Profile peak device memory before committing. With FP16 activations for a 7B-parameter model sharded across four stages, each additional microbatch in flight can cost 800MB to 1.2GB per stage depending on sequence length. Know your headroom before you increase m.
Overlap communication with compute. PyTorch's PipelineStage (available via torch.distributed.pipelining since PyTorch 2.1) supports scheduling the activation send asynchronously while the stage begins computing the next microbatch's layers. This does not eliminate the bubble but reduces its wall-clock cost. The catch: async sends compete for PCIe or NVLink bandwidth with the forward kernels running in parallel, so you need to measure, not assume, that overlap helps. On bandwidth-constrained nodes it can hurt.
Rebalance stage depth. Uneven layer counts per stage mean some devices finish faster and stall longer. Profiling stage latency per microbatch often reveals that the first stage (embedding lookup plus early transformer blocks) finishes in 80% of the time that the last stage (final norm plus head projection) takes. Move two transformer blocks from the last stage to the second-to-last. The rebalancing calculation is tedious but the implementation is a one-line change to your stage split indices.
The Measurement Loop
Before and after any change, record three numbers: bubble fraction estimated from profiler traces, end-to-end batch latency at your target batch size, and peak device memory per stage. A change that improves bubble fraction but increases peak memory past your headroom is not a win; you will either OOM under load or be forced to reduce batch size, trading one problem for another.
Schedule regular profiler captures as part of your inference deployment process, not just at launch. Activation shapes change when context length distributions shift in production traffic. A pipeline tuned for 512-token sequences develops a different bubble profile at 2048 tokens because the activation transfer volume quadruples and transfer latency starts dominating the gap.
The bubble is not an anomaly. It is the cost of the serialization you chose when you ran out of memory on a single device. Pay attention to what you are actually paying.
Get Omnissiah Systems in your inbox
New posts delivered directly. No spam.
No spam. Unsubscribe anytime.