Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
New spec language catches AMD vs. NVIDIA gradient mismatches and silent precision bugs.
Cooper Veit's paper on Kernel Contracts tackles a growing problem in ML: when kernels produce different results across hardware platforms (e.g., AMD vs. NVIDIA), there's no formal way to arbitrate the dispute. The proposed specification language defines a contract with eight components: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. Veit establishes twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. A key requirement is three-state calibration: every contract must have at least one conforming implementation and one violating implementation that still passes basic functional tests.
Veit applies the framework to three documented incidents: Huawei Ascend's silent precision coercion, Sakana AI's CUDA Engineer reward hacking, and AMD's out-of-bounds silent acceptance. In each case, the informal diagnosis is mapped to a specific contract violation with a measurable signature. The paper positions kernel contract suites as normative references for conformance grading, similar to how ISASecure grades industrial control systems against IEC 62443. This provides a much-needed formal mechanism for ensuring ML kernel correctness across heterogeneous silicon.
- Kernel Contracts define eight-part specifications: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature.
- Twelve contract classes cover precision, ordering, compiler-induced, and exceptional-value failure modes, all grounded in empirical evidence.
- Applied to three real incidents: Huawei Ascend precision coercion, Sakana AI reward hacking, and AMD out-of-bounds errors, each mapped to a measurable violation.
Why It Matters
Provides a formal framework to ensure ML kernel correctness across diverse hardware, preventing silent errors in production.