Open Source

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

New framework uses algorithmic rewards to close the gap between fluent chat and actual task completion in e-commerce.

Deep Dive

A team from Owlgebra AI has introduced Ecom-RLVE, a significant evolution of the RLVE (Reinforcement Learning with Verifiable Environments) framework. Moving beyond single-turn puzzles, Ecom-RLVE-GYM creates eight distinct, procedurally generated e-commerce scenarios—including product discovery, cart building, returns, and order tracking—each with a 12-axis difficulty curriculum. The core innovation is its use of purely algorithmic reward functions. Instead of relying on subjective LLM-as-a-judge scoring, the system verifies outcomes like cart accuracy and constraint satisfaction through code that checks against a hidden ground-truth goal, eliminating hallucination.

In practice, an agent like the Qwen 3 8B model they trained interacts with a simulated user, using tools to search catalogs and modify carts. Its performance is scored on metrics like the F1 score over correct product tuples and efficiency bonuses. Early results from 300 steps of Direct Preference Optimization (DPO) training show that this environment scaling and adaptive difficulty can successfully transfer to agentic, real-world task completion. The project addresses a critical deployment gap: while LLMs are fluent conversationalists, they often fail at the precise, multi-step tool use required for actual shopping workflows. By making rewards verifiable and adaptive, Ecom-RLVE provides a scalable training ground for agents that need to reliably execute complex, multi-intent customer journeys.

Key Points
  • Extends RLVE to 8 multi-turn e-commerce environments (product discovery, returns, etc.) with procedural generation and a 12-axis difficulty curriculum.
  • Uses 100% algorithmic, verifiable rewards—no LLM judges—scoring agents on F1 for correct products, efficiency, and hallucination checks.
  • Trained a Qwen 3 8B model with DPO over 300 steps, showing promising transfer to real-world, tool-augmented task completion.

Why It Matters

Provides a scalable, objective method to train AI shopping assistants that can reliably complete complex transactions, not just chat.