Models & Releases

Alibaba Qwen3-235B fine-tuned model beats GPT, Claude in finance tasks

A fine-tuned Alibaba Qwen3-235B model outperformed GPT, Claude, and Gemini in finance tasks with 84.7% accuracy and 13.8x lower inference costs.

Deep Dive

Bridgewater Associates’ AIA Labs and Thinking Machines Lab have published internal evaluation results showing that a fine-tuned version of Alibaba’s open-weight Qwen3-235B model outperformed leading commercial AI models—including variants of GPT, Claude, and Gemini—in finance-specific tasks. The tuned model achieved 84.7% accuracy compared to 78.2% for the strongest frontier model tested, while reducing inference cost per 1,000 tasks by 13.8x.

The evaluation focused on document triage, a critical but challenging task for financial firms where correct answers often depend on private workflows rather than public knowledge. Frontier models averaged only ~50% accuracy with task descriptions alone, rising to the mid-70% range with expert-written prompts—still below Bridgewater’s 80% trust threshold. The fine-tuned Qwen model, trained using Thinking Machines Lab’s Tinker platform with LoRA-based adapters, incorporated Bridgewater’s proprietary labels, review rules, and expert corrections to encode investor judgment. While the results highlight the potential of domain-specific fine-tuning, Bridgewater cautions that AI outputs may still contain inaccuracies or vulnerabilities, emphasizing the need for careful deployment rather than blind trust in automated systems.

Key Points
  • Fine-tuned Alibaba Qwen3-235B model achieved 84.7% accuracy vs. 78.2% for best commercial model in finance tasks with 13.8x lower inference costs.
  • Model relied on Bridgewater’s private workflow judgments and expert labels, addressing limitations of public knowledge in financial document triage.
  • Thinking Machines Lab’s Tinker platform used LoRA adapters for efficient fine-tuning, enabling cost-effective customization without exposing sensitive data.

Why It Matters

Domain-specific fine-tuning can deliver superior accuracy and cost efficiency in finance, but deployment requires rigorous validation due to privacy and compliance risks.

📬 Get the top 10 AI stories daily