Research & Papers

Preisach Attention Layer slashes transformer cost to O(n log n)

Replaces softmax with hysteresis, achieving Turing-completeness at O(1) depth...

Deep Dive

The Preisach Attention Layer (PAL), introduced by Piotr Frydrych, is a radical departure from standard transformer architectures. It replaces the softmax attention mechanism with a binary relay operator inspired by the classical Preisach hysteresis model from physics. PAL maintains an internal stack of input local extrema, enabling it to process sequences with O(1) depth while achieving Turing-completeness via simulation of a two-stack pushdown automaton. In contrast, standard hard-attention transformers require O(log n) depth for the same result. The architecture also boasts O(n log n) total inference cost, a dramatic improvement over the O(n²) of standard attention.

Functionally, PAL and standard transformers are incomparable: PAL computes historical range statistics in O(1) layers that would need O(log n) layers in a transformer, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The key property is rate-independence—PAL responds only to the sequence of local extrema, ignoring absolute token positions or temporal spacing. This makes PAL particularly efficient for tasks involving long episodic memory, such as time series analysis, continuous control, or any domain where relative ordering matters more than exact positions. The extremum stack acts as a minimal sufficient statistic for all rate-independent functionals, giving PAL a formal analogue of the wiping property in hysteresis theory.

Key Points
  • PAL uses a Preisach hysteresis operator with binary relay, achieving Turing-completeness at O(1) depth vs O(log n) for hard-attention transformers.
  • Inference cost drops to O(n log n) from standard attention's O(n²), a quadratic speedup for long sequences.
  • Excels at historical range statistics and episodic memory tasks where weak positional dependence is present, with rate-independent processing.

Why It Matters

Could make long-context transformers dramatically more efficient for time series, episodic memory, and continuous control tasks.