Lawyer builds 12-GPU V100 cluster for local AI legal drafting with MoE models
MoE models achieve 50 tok/s on 122B model across 4 V100s
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A lawyer has assembled a formidable local AI cluster for legal document drafting, comprising 12 V100-SXM2 32GB GPUs on a Threadripper Pro workstation plus a second box with an EPYC 7302P, 512GB RAM, 4x RTX 3090s, and 2x V100-PCIe cards. Initially relying on vLLM, the builder quickly discovered that dense models were a trap on Volta architecture: decode speeds for 32B dense models hit only ~20-28 tok/s, well below the 40 tok/s floor needed for practical use. Switching to llama.cpp and MoE (mixture-of-experts) GGUFs transformed performance. Qwen3.5-122B-A10B, a 122B parameter model with only 10B active per token, now runs at ~50 tok/s on a single 4-card board, while Gemma-4-26B-A4B reaches ~113 tok/s. These speeds hold even at long contexts (25k+ tokens), where dense models previously choked.
To maximize throughput, the lawyer abandoned a single-model approach in favor of a sequential orchestrator pipeline that routes legal tasks across multiple models, each pinned to its own GPU board. The workflow for a full affidavit or motion lights up 16 GPUs across both boxes: Qwen3.6-35B-A3B handles workhorse drafting, Qwen3.5-122B-A10B tackles heavy reasoning, a small gate model evaluates claim viability, and an adversarial reviewer attacks the generated draft. Secondary tasks (financial extraction, routing) run on the 3090s via Ollama. This architecture proves that with careful model selection and orchestration, a local V100 cluster can rival cloud-based solutions for complex legal drafting—no FP8 or high-end ampere cards required.
- 12x V100-SXM2 32GB on Threadripper Pro plus second box with 4x RTX 3090 and 2x V100-PCIe
- Switched from vLLM to llama.cpp because MoE GGUFs are not supported on Volta; MoE models achieve 2-3x speed over dense
- Custom orchestrator pipeline routes tasks (gate model, drafting, adversarial review) across 16 GPUs sequentially for full legal document generation
Why It Matters
Proves MoE models on modest V100 hardware can outperform dense models for real-world legal drafting.