Omnissiah Systems

Blog

Field notes on operating the machine.

infrastructure gpu-operations observability

Thermal Throttling in GPU Clusters: How Heat Kills Your Throughput Silently

How GPU thermal throttling silently degrades inference throughput in dense clusters, how to detect it, and what to do before it becomes a capacity problem.

Magos Veridian

· July 27, 2026 · 4 min read

All Posts

training mlops

Activation Recomputation: When Saving Memory Costs You More Than You Think

Activation recomputation trades GPU memory for extra compute during training. Here's when that tradeoff works, when it breaks, and how to tune it.

Magos Veridian

· July 20, 2026 · 4 min read

inference mlops

Weight Loading Latency: Where Your Model Startup Time Actually Goes

A practical breakdown of where time disappears during model weight loading at startup, and how to profile and reduce it in production inference systems.

Magos Veridian

· July 13, 2026 · 4 min read

inference mlops

Routing Decay: How Request Dispatch Goes Wrong in Multi-Instance Inference

How request routing silently degrades across multi-instance inference deployments, the failure modes to recognize, and how to diagnose and correct them.

Magos Veridian

· July 9, 2026 · 4 min read

inference latency

Speculative Decoding in Production: What the Benchmarks Don't Tell You

Speculative decoding promises faster token generation, but production deployments expose failure modes that synthetic benchmarks quietly hide. Here's what to watch.

Magos Veridian

· July 6, 2026 · 4 min read

inference mlops

Batch Size Drift: Why Your Inference Throughput Degrades Overnight

How dynamic batching misconfigures itself over time, why throughput numbers lie on aging inference servers, and how to catch batch size drift before it costs you.

Magos Veridian

· July 2, 2026 · 4 min read

inference gpu-memory

Flash Attention Memory Footprint: What Your GPU Actually Allocates During Prefill

A practical breakdown of how FlashAttention allocates GPU memory during prefill, where the pressure points are, and how to diagnose OOM failures before they surprise you.

Magos Veridian

· June 29, 2026 · 4 min read

inference latency

Prefill Latency and the Cost of Long Prompts: Where Your TTFT Goes

Long prompts silently inflate time-to-first-token in LLM serving. Here's how prefill cost accumulates and what you can do about it operationally.

Magos Veridian

· June 25, 2026 · 4 min read

distributed inference MLOps

Pipeline Parallelism Stalls: Finding the Bubble and Paying It Down

How pipeline bubbles form in multi-stage model inference, how to measure them precisely, and what scheduling changes actually reduce their cost.

Magos Veridian

· June 22, 2026 · 4 min read

distributed inference MLOps

Tensor Parallelism Under Pressure: What Breaks When You Scale Width

A field guide to the failure modes that emerge when you scale tensor parallelism across GPU ranks: communication bottlenecks, numerical drift, and load imbalance.

Magos Veridian

· June 18, 2026 · 4 min read

inference mlops

KV Cache Eviction: What Gets Dropped and Why It Costs You

A practical guide to KV cache eviction policies in distributed LLM inference: what triggers eviction, how it degrades latency, and how to tune against it.

Magos Veridian

· June 15, 2026 · 5 min read

MLOps distributed training

Gradient Accumulation Gone Wrong: Debugging Silent Training Divergence

How misconfigured gradient accumulation silently corrupts large model training runs, and the specific checks you need to catch it before loss curves lie to you.

Magos Veridian

· June 11, 2026 · 4 min read

mlops training-infrastructure

Checkpoint Rot: How Saved Model State Goes Stale and What to Do About It

Checkpoints feel like safety nets, but saved model state degrades in subtle ways. Here's how to detect checkpoint rot before it costs you a training run.

Magos Veridian

· June 8, 2026 · 5 min read

inference gpu