Open Source

Running Minimax 2.7 at 100k context on strix halo

r/LocalLLaMA May 10, 2026

⚡After heavy tweaking, a Reddit user runs 100k context locally on Strix Halo.

Deep Dive

Reddit user /u/Zc5Gwu shares a custom llama-server setup for the MiniMax M2.7 GGUF model at 100k context. Key flags include `--no-context-shift` (to know when context is exhausted instead of silent corruption), `--no-mmap` (per Donato), `--kv-unified` (share cache across two concurrent sessions to save VRAM), and `--cache-ram 0` (keep cache in VRAM, solved OOMs). Additional setup uses Headless Fedora Linux (per Donato's guides), with recommendations to increase swap size and set `OOMScoreAdjust=500` to protect critical processes. The model excels at coding "intuition" and understanding intent, but is less well-rounded than Qwen3.6 27b, which the user finds stronger at coding architecture discussions, code review, and non-coding tasks.

Key Points

Achieves 100k context on MiniMax M2.7 using a custom llama-server configuration on AMD Strix Halo
Key flags include --no-context-shift, --no-mmap, --kv-unified, and --cache-ram 0 to prevent OOM crashes
MiniMax M2.7 outperforms Qwen 3.6 27b in coding intuition but lags in code review and architecture discussions

Why It Matters

Enables professionals to run long-context coding models locally on AMD hardware, preserving privacy and reducing cloud costs.

Read Original Article

Running Minimax 2.7 at 100k context on strix halo

Why It Matters

Stay Ahead in AI