Open Source

[Release] Ouro-2.6B-Thinking — first working inference (ByteDance's recurrent "thinking" model, fixed for transformers 4.55)

r/LocalLLaMA February 21, 2026

⚡Recurrent 'thinking' model runs 48 layers 4 times per token, requiring custom inference fixes.

Deep Dive

ByteDance's experimental Ouro-2.6B-Thinking model features an unusual recurrent Universal Transformer architecture where each token passes through all 48 layers four times sequentially (192 total passes), unlike standard models that process layers once. This caused existing GGUF quantizations to fail. Developer scpalmetto identified and patched two critical bugs in the original modeling code that made it incompatible with Hugging Face's transformers library version 4.55: a cache inheritance issue causing AttributeError and a missing method for causal mask generation. The fixed model now demonstrates explicit chain-of-thought reasoning between <think> tags before final answers, though it requires full context recomputation (use_cache=False) due to its recurrent nature. Performance benchmarks show ~3.8 tokens/second on an NVIDIA L4 GPU with 5.3 GB VRAM consumption in float16 precision.

Key Points

Recurrent architecture processes 48 layers 4 times per token (192 passes) vs standard single-pass
Required patching cache inheritance bugs and missing mask methods for transformers 4.55 compatibility
Demonstrates explicit chain-of-thought reasoning with <think> tags before final answer generation

Why It Matters

Shows practical implementation of recurrent 'thinking' architectures that could enable more deliberate reasoning in smaller models.

Read Original Article

[Release] Ouro-2.6B-Thinking — first working inference (ByteDance's recurrent "thinking" model, fixed for transformers 4.55)

Why It Matters

Stay Ahead in AI