Does going from 96GB -> 128GB VRAM open up any interesting model options?
A developer combines an RTX Pro 6000 with a 5090 via Thunderbolt, unlocking new AI model possibilities.
A tech professional has sparked discussion by detailing a unique hardware setup to push the boundaries of local AI development. By connecting an NVIDIA RTX 5090 to their existing RTX Pro 6000 workstation GPU via a Thunderbolt 4 dock, they've expanded their total VRAM from 96GB to 128GB. This hybrid, albeit bandwidth-limited, configuration allows them to explore larger and more capable open-source language models that were previously impossible to run locally. The user's primary goal is to find models with exceptional coding proficiency that can leverage this substantial memory pool, moving beyond their current daily driver, the GPT-OSS-120B model.
The post doubles as a technical deep dive, revealing a specific bug encountered when using the popular llama.cpp inference engine. The user reported that Qwen 3.5 models produced garbled, random token outputs until they switched from the default `-sm layer` tensor-splitting strategy to `-sm row` or forced single-GPU execution. This has turned the thread into a collaborative troubleshooting session, with other developers weighing in on multi-GPU inference quirks. The community is now actively suggesting model candidates like deeper quantizations of Mixtral 8x22B, potential variants of DeepSeek-Coder, and other 70B+ parameter models that could fully utilize the new 128GB memory ceiling for enhanced code generation and reasoning tasks.
- A developer created a 128GB VRAM system by pairing an RTX Pro 6000 (96GB) with an RTX 5090 via Thunderbolt 4.
- The setup aims to run larger, more capable coding-focused LLMs than the current GPT-OSS-120B, seeking community model recommendations.
- The post highlights a llama.cpp bug with Qwen 3.5 models on multi-GPU setups, requiring a switch to `-sm row` for correct output.
Why It Matters
This showcases the bleeding edge of consumer AI hardware hacking, pushing what's possible for local, high-performance coding assistants beyond cloud dependencies.