Research & Papers

[D] Training a classifier entirely in SQL (no iterative optimization)

New classifier achieves 0.954 AUC on fraud detection with no iterative optimization, enabling massive parallelization.

Deep Dive

A novel machine learning implementation is challenging the traditional data science workflow by proving complex models can be built directly within a database. A developer has created SEFR, a fast linear classifier, and implemented it entirely in SQL on Google BigQuery. This approach eliminates the iterative optimization loops common in algorithms like Logistic Regression, instead using a fully parallelizable, single-pass calculation. The result is a dramatic speed increase, with SEFR training approximately 18 times faster than a standard Logistic Regression model on the same hardware and data.

Benchmarked on a real-world fraud detection dataset containing 55,000 samples, SEFR demonstrated robust performance. It achieved an Area Under the Curve (AUC) score of 0.954, coming reasonably close to the 0.986 AUC of the more computationally intensive Logistic Regression. This performance-to-speed trade-off makes SEFR particularly compelling for scenarios requiring rapid, in-database analytics on large-scale datasets, such as real-time fraud scoring, large-scale A/B test analysis, or feature exploration directly on terabyte-scale tables without any data movement.

Key Points
  • SEFR, a linear classifier, was implemented entirely in SQL within Google BigQuery, requiring no external code.
  • On a 55k-sample fraud detection dataset, it achieved 0.954 AUC vs. Logistic Regression's 0.986, but trained ~18x faster.
  • Its non-iterative, parallelizable design allows model training directly on massive datasets inside a data warehouse.

Why It Matters

It enables rapid, large-scale machine learning directly inside data warehouses, simplifying MLOps and reducing data movement costs.