[d] i think my scm + llm agent architecture is fundamentally broken (identifiability issues)
Developer questions whether combining causal graphs with black-box LLMs is fundamentally broken.
Oransim is an ambitious project aiming to build a predictive counterfactual engine for marketing—letting users ask 'what if I move 30% budget from creator A to B?' before spending money. The architecture combines a Structural Causal Model (SCM) backbone using do-calculus, Hawkes processes for viral self-excitation, and LLM agents representing user archetypes that react to content via an embedding bus. The developer admits the approach may be 'fundamentally broken' and outlines three core issues.
First, the SCM vs LLM boundary: SCMs require clean structural equations, while LLMs are black boxes. Treating agent outputs as a noisy observation layer for the SCM may be theoretically indefensible—mixing oil and water. Second, identifiability: once an LLM mediates a causal node, it's unclear whether prompt-level interventions genuinely map to do-operators on latent user states. The developer fears this is hand-waving. Third, the sim-to-real gap: fitting Hawkes parameters on agent-generated data yields decent marginals but poor covariance compared to real-world logs. The developer questions if anyone has solved this for point processes. The post is a plea for expert input before investing another six months.
- SCM + LLM boundary: Black-box LLM outputs break the clean structural equations required by SCMs, creating a theoretical 'oil and water' problem.
- Identifiability: Prompt-level interventions may not correspond to valid do-operators on latent user states, undermining causal inference.
- Sim-to-real gap: Agent-generated Hawkes process data matches marginals but fails on covariance vs. real-world logs, a known challenge for point processes.
Why It Matters
Questions the viability of combining causal inference with LLM agents for predictive marketing simulations.