Research & Papers

POLARIS-9B: Small AI model writes long stories rivaling much larger ones

Trained on just 4 GPUs, POLARIS-9B matches 27B models in story quality.

Deep Dive

Small open-weight language models typically struggle with long-form creative writing—they either cut stories short or see quality degrade sharply as length increases. A new paper from UMass Amherst and Google Research introduces POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), a low-compute recipe that solves this. The team applied it to Qwen3.5-9B, using only 1.4K prompt-story pairs derived from 100 short-story anthologies and trained on just 4 A100 GPUs. The key innovations: a frontier LLM judge that scores stories on a structured quality rubric (providing online rewards), and human-reference injection (HRI)—teacher-forcing a human-written story as a high-reward anchor within each GRPO group. The result, POLARIS-9B, matches or exceeds much larger models in story quality and length adherence.

Across five benchmarks covering both in-distribution and out-of-distribution prompts, POLARIS-9B is competitive with Qwen3.5-27B (three times its size) while following length instructions far more precisely than base Qwen3.5-9B. A blinded human evaluation confirmed that POLARIS-9B is preferred over the base model and on par with the 27B variant. Remarkably, despite training only on stories up to 4K words, the model preserves quality when asked to produce stories up to 12K words—a threefold length generalization that most open-weight models fail at. The authors argue that length generalization is a meaningful stress test for creative writing models and a useful lens for distinguishing models that otherwise perform similarly. This work democratizes high-quality long-form story generation, proving that small models can punch far above their weight with the right training strategy.

Key Points
  • POLARIS uses an LLM-as-judge reward model with a structured Story Quality rubric plus human-written references as training anchors.
  • Trained on only 1,400 prompt-story pairs using 4 A100 GPUs, yet outperforms models 3x its size (Qwen3.5-27B) in human evaluation.
  • Generalizes to stories 3x longer than training data (12K words) without quality degradation, a feat most open-weight models cannot achieve.

Why It Matters

Smaller, cheaper AI models can now produce long-form creative content, lowering the barrier for high-quality story generation.