Production AI very different from the demos [D]
Real-world AI token usage can be 10x more than prototypes, draining budgets fast.
A developer recently shared a cautionary tale about deploying an AI feature into production. During prototyping, costs were low due to small volumes and short prompts—typical for demos. But once real traffic hit, token usage scaled dramatically. Customers asked longer, more ambiguous questions than the test set, forcing the team to add context retrieval that effectively doubled the input length on every API call. The feature ran on OpenAI's GPT-4o, which offered excellent response quality, but after a few weeks of volume, the bill came in far higher than expected.
The real pain point emerged when finance asked for a breakdown of which features or models drove the costs. The OpenAI dashboard only shows aggregate totals, not per-feature usage. The developer now spends half a day every week exporting token counts and trying to map them back to specific features manually—and still lacks confidence in the numbers. This experience highlights a growing challenge for AI teams: production costs can diverge wildly from demos, and existing billing tools lack the granularity needed for internal cost attribution and budgeting.
- Real user queries are longer and more ambiguous than test sets, doubling input length via context retrieval.
- OpenAI billing dashboard provides only aggregate costs, not per-feature breakdowns.
- Engineer spends half a day weekly manually reconciling token counts against feature usage.
Why It Matters
AI cost management is critical; production deployments must account for unpredictable token scaling.