Research & Papers

Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon

New spec language catches AMD vs. NVIDIA gradient mismatches and silent precision bugs.

Deep Dive

Cooper Veit's paper on Kernel Contracts tackles a growing problem in ML: when kernels produce different results across hardware platforms (e.g., AMD vs. NVIDIA), there's no formal way to arbitrate the dispute. The proposed specification language defines a contract with eight components: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. Veit establishes twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. A key requirement is three-state calibration: every contract must have at least one conforming implementation and one violating implementation that still passes basic functional tests.

Veit applies the framework to three documented incidents: Huawei Ascend's silent precision coercion, Sakana AI's CUDA Engineer reward hacking, and AMD's out-of-bounds silent acceptance. In each case, the informal diagnosis is mapped to a specific contract violation with a measurable signature. The paper positions kernel contract suites as normative references for conformance grading, similar to how ISASecure grades industrial control systems against IEC 62443. This provides a much-needed formal mechanism for ensuring ML kernel correctness across heterogeneous silicon.

Key Points
  • Kernel Contracts define eight-part specifications: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature.
  • Twelve contract classes cover precision, ordering, compiler-induced, and exceptional-value failure modes, all grounded in empirical evidence.
  • Applied to three real incidents: Huawei Ascend precision coercion, Sakana AI reward hacking, and AMD out-of-bounds errors, each mapped to a measurable violation.

Why It Matters

Provides a formal framework to ensure ML kernel correctness across diverse hardware, preventing silent errors in production.