OMA system preserves Kubernetes crash evidence before rotation
Lost diagnostic context in crash loops? OMA captures it in under 1ms
Kubernetes clusters generate rich operational events during pod lifecycle transitions, but the native event retention model discards the most diagnostically valuable context. The LastTerminationState field, which records a container's last failure, is overwritten shortly after a pod restart — a phenomenon the paper defines as the "evidence horizon." During high-frequency crash loops, this horizon may be crossed multiple times before inspection, permanently losing critical evidence. The Operational Memory Architecture (OMA) addresses this by encoding evidence retention and causal reconstruction as explicit architectural requirements.
OMA captures operational events into causal chains using three patterns: OOMKill chain, ConfigMap variable misconfiguration, and ConfigMap volume mount propagation. The implementation is a Go-based Kubernetes watcher with an SQLite operational memory store and a simple query interface. Empirical evaluation on Minikube and Azure Kubernetes Service (AKS) includes a 30-run latency analysis and stress tests with up to 20 crash-looping pods. Results show causal edges built with mean latency below 1 ms, the collector processing ~2.8 events/sec while using under 10 MB memory, demonstrating minimal overhead and effective evidence preservation.
- OMA captures causal failure chains (OOMKill, ConfigMap misconfig, volume mount) before Kubernetes rotates event data
- Built with Go watcher, SQLite store, and simple query interface; open-source code available
- Processes ~2.8 events/sec with <10MB memory and <1ms latency per causal edge across 30-run tests on Minikube and AKS
Why It Matters
Enables SREs to debug Kubernetes crash loops with full causal context, preventing data loss across pod restarts.