[Release] Ouro-2.6B-Thinking — first working inference (ByteDance's recurrent "thinking" model, fixed for transformers 4.55)
Recurrent 'thinking' model runs 48 layers 4 times per token, requiring custom inference fixes.
ByteDance's experimental Ouro-2.6B-Thinking model features an unusual recurrent Universal Transformer architecture where each token passes through all 48 layers four times sequentially (192 total passes), unlike standard models that process layers once. This caused existing GGUF quantizations to fail. Developer scpalmetto identified and patched two critical bugs in the original modeling code that made it incompatible with Hugging Face's transformers library version 4.55: a cache inheritance issue causing AttributeError and a missing method for causal mask generation. The fixed model now demonstrates explicit chain-of-thought reasoning between <think> tags before final answers, though it requires full context recomputation (use_cache=False) due to its recurrent nature. Performance benchmarks show ~3.8 tokens/second on an NVIDIA L4 GPU with 5.3 GB VRAM consumption in float16 precision.
- Recurrent architecture processes 48 layers 4 times per token (192 passes) vs standard single-pass
- Required patching cache inheritance bugs and missing mask methods for transformers 4.55 compatibility
- Demonstrates explicit chain-of-thought reasoning with <think> tags before final answer generation
Why It Matters
Shows practical implementation of recurrent 'thinking' architectures that could enable more deliberate reasoning in smaller models.