From below of long thin blue cables connected to row of small white connectors on system block in data center

Omnissiah Systems

Field notes on operating the machine.

Recent Writing

MLOpsdistributed trainingdebugging

Gradient Accumulation Gone Wrong: Debugging Silent Training Divergence

How misconfigured gradient accumulation silently corrupts large model training runs, and the specific checks you need to catch it before loss curves lie to you.

· 4 min read

About

Omnissiah Systems explores AI systems, MLOps, and distributed inference infrastructure, written as the disciplined study of large models as living machinery. New writing published regularly.