Research & Papers

Lakehouse queries vary 2x in runtime; fixing variance cuts error 80%

Same query can run twice as long depending on cloud conditions—here's how to fix it.

Deep Dive

A team led by James Nurdin has published a systematic study of runtime variance in distributed lakehouse systems on arXiv (2606.03464). Using Kubernetes deployments across AWS, Azure, GCP, and a private cloud, they ran three analytical benchmarks at multiple database scales. The key finding: repeated executions of the same query can vary in runtime by nearly twofold—a huge source of unpredictability for platform operators trying to optimize monetary, resource, and environmental costs.

Beyond quantifying variance, the researchers conducted a factor analysis isolating data locality, co-tenant load, and caching effects as primary contributors. They then tested state-of-the-art Query Performance Prediction (QPP) models and showed that explicitly modeling these variance sources reduces prediction error by up to 80%. Finally, they demonstrated downstream benefits for low-carbon scheduling: accounting for runtime variance led to a significant reduction in carbon costs, making workload management more efficient and environmentally friendly.

Key Points
  • Identical lakehouse queries varied by up to 2x across repeated runs on AWS, Azure, GCP, and private cloud.
  • Factor analysis pinpoints data locality, co-tenant load, and caching as main variance drivers.
  • Addressing these sources cuts QPP prediction error by 80% and reduces carbon costs in scheduling.

Why It Matters

Lakehouse operators can now predict performance more reliably, cutting waste and carbon emissions significantly.