Open Source

Running Minimax 2.7 at 100k context on strix halo

After heavy tweaking, a Reddit user runs 100k context locally on Strix Halo.

Deep Dive

Reddit user /u/Zc5Gwu shares a custom llama-server setup for the MiniMax M2.7 GGUF model at 100k context. Key flags include `--no-context-shift` (to know when context is exhausted instead of silent corruption), `--no-mmap` (per Donato), `--kv-unified` (share cache across two concurrent sessions to save VRAM), and `--cache-ram 0` (keep cache in VRAM, solved OOMs). Additional setup uses Headless Fedora Linux (per Donato's guides), with recommendations to increase swap size and set `OOMScoreAdjust=500` to protect critical processes. The model excels at coding "intuition" and understanding intent, but is less well-rounded than Qwen3.6 27b, which the user finds stronger at coding architecture discussions, code review, and non-coding tasks.

Key Points
  • Achieves 100k context on MiniMax M2.7 using a custom llama-server configuration on AMD Strix Halo
  • Key flags include --no-context-shift, --no-mmap, --kv-unified, and --cache-ram 0 to prevent OOM crashes
  • MiniMax M2.7 outperforms Qwen 3.6 27b in coding intuition but lags in code review and architecture discussions

Why It Matters

Enables professionals to run long-context coding models locally on AMD hardware, preserving privacy and reducing cloud costs.