12x V100-SXM2 32GB on Threadripper Pro plus second box with 4x RTX 3090 and 2x V100-PCIe?

12x V100-SXM2 32GB on Threadripper Pro plus second box with 4x RTX 3090 and 2x V100-PCIe

Switched from vLLM to llama.cpp because MoE GGUFs are not supported on Volta; MoE models achieve 2-3x speed over dense?

Switched from vLLM to llama.cpp because MoE GGUFs are not supported on Volta; MoE models achieve 2-3x speed over dense

Custom orchestrator pipeline routes tasks (gate model, drafting, adversarial review) across 16 GPUs sequentially for full legal document generation?

Custom orchestrator pipeline routes tasks (gate model, drafting, adversarial review) across 16 GPUs sequentially for full legal document generation

Open Source

Lawyer builds 12-GPU V100 cluster for local AI legal drafting with MoE models

r/LocalLLaMA May 26, 2026

⚡MoE models achieve 50 tok/s on 122B model across 4 V100s

Deep Dive

A lawyer has assembled a formidable local AI cluster for legal document drafting, comprising 12 V100-SXM2 32GB GPUs on a Threadripper Pro workstation plus a second box with an EPYC 7302P, 512GB RAM, 4x RTX 3090s, and 2x V100-PCIe cards. Initially relying on vLLM, the builder quickly discovered that dense models were a trap on Volta architecture: decode speeds for 32B dense models hit only ~20-28 tok/s, well below the 40 tok/s floor needed for practical use. Switching to llama.cpp and MoE (mixture-of-experts) GGUFs transformed performance. Qwen3.5-122B-A10B, a 122B parameter model with only 10B active per token, now runs at ~50 tok/s on a single 4-card board, while Gemma-4-26B-A4B reaches ~113 tok/s. These speeds hold even at long contexts (25k+ tokens), where dense models previously choked.

To maximize throughput, the lawyer abandoned a single-model approach in favor of a sequential orchestrator pipeline that routes legal tasks across multiple models, each pinned to its own GPU board. The workflow for a full affidavit or motion lights up 16 GPUs across both boxes: Qwen3.6-35B-A3B handles workhorse drafting, Qwen3.5-122B-A10B tackles heavy reasoning, a small gate model evaluates claim viability, and an adversarial reviewer attacks the generated draft. Secondary tasks (financial extraction, routing) run on the 3090s via Ollama. This architecture proves that with careful model selection and orchestration, a local V100 cluster can rival cloud-based solutions for complex legal drafting—no FP8 or high-end ampere cards required.

Key Points

12x V100-SXM2 32GB on Threadripper Pro plus second box with 4x RTX 3090 and 2x V100-PCIe
Switched from vLLM to llama.cpp because MoE GGUFs are not supported on Volta; MoE models achieve 2-3x speed over dense
Custom orchestrator pipeline routes tasks (gate model, drafting, adversarial review) across 16 GPUs sequentially for full legal document generation

Why It Matters

Proves MoE models on modest V100 hardware can outperform dense models for real-world legal drafting.

Read Original Article

Lawyer builds 12-GPU V100 cluster for local AI legal drafting with MoE models

Why It Matters

Related Articles

🚀 Stay Ahead in AI