Research & Papers

Beyond Jupyter Notebooks: The real work behind Production ML systems [D]

Model training is just 10% of the ML lifecycle, says platform lead.

Deep Dive

In a viral post, an ML Platform Engineer from one of India's top tech startups shares the reality behind production ML systems. They emphasize that model training and algorithm selection are only a small part of the lifecycle. Most of the work involves building and maintaining reliable infrastructure: data pipelines, feature stores, training pipelines, model registries, low-latency inference paths, monitoring, drift detection, retraining workflows, and rollback strategies. Without these, even the best model can fail silently. The engineer, who came from a software and data engineering background, credits their success to a product-oriented mindset and deep collaboration with data scientists and product managers. They regularly use tools like Kubernetes, Docker, CI/CD, open table formats, and messaging queues. The key insight: a model is not a product until it's wrapped in a reliable operational system.

Key Points
  • Only ~10% of production ML work involves model training; 90% is infrastructure like data pipelines, feature stores, and monitoring.
  • ML Platform Engineers need DevOps skills (Kubernetes, Docker, CI/CD) and close collaboration with data scientists and product managers.
  • Without drift detection and rollback strategies, models can return wrong predictions while APIs still show 200 OK, leading to silent failures.

Why It Matters

For professionals moving ML to production, success depends on platform engineering, not just algorithm accuracy.