From below of long thin blue cables connected to row of small white connectors on system block in data center

Omnissiah Systems

Field notes on operating the machine.

Recent Writing

Gradient Accumulation Gone Wrong: Debugging Silent Training Divergence

How misconfigured gradient accumulation silently corrupts large model training runs, and the specific checks you need to catch it before loss curves lie to you.

June 11, 2026 · 4 min read

About

Omnissiah Systems explores AI systems, MLOps, and distributed inference infrastructure, written as the disciplined study of large models as living machinery. New writing published regularly.

Omnissiah Systems

Recent Writing

Gradient Accumulation Gone Wrong: Debugging Silent Training Divergence

Checkpoint Rot: How Saved Model State Goes Stale and What to Do About It

The Liturgy of the Restart: What a GPU Node Reboot Actually Costs

Reading the Pulse: Where Tail Latency Hides in Distributed Inference

Quantization as an Act of Humility

About