Pseudo-Labeling for Unsupervised Domain Adaptation with Kernel GLMs
New framework tackles covariate shift with 55-page theory and open-source Python code.
Researchers Nathan Weill and Kaizheng Wang have published a 55-page paper introducing a principled framework for unsupervised domain adaptation (UDA) in kernel Generalized Linear Models (GLMs). The work specifically addresses covariate shift—where training (source) and deployment (target) data have different distributions—in models like kernelized linear, logistic, and Poisson regression with ridge regularization. Their core innovation is a two-stage pseudo-labeling strategy: they split labeled source data to first train a family of candidate models, then use a separate batch to build an imputation model. This imputation model generates pseudo-labels for the unlabeled target data, enabling robust model selection to minimize prediction error in the target domain.
The paper provides rigorous theoretical backing with non-asymptotic excess-risk bounds that quantify adaptation performance through an "effective labeled sample size," explicitly accounting for the unknown distribution shift. This moves beyond heuristic approaches by offering provable guarantees. Experiments on both synthetic and real datasets demonstrate that their framework delivers consistent performance improvements over standard source-only training baselines. The researchers have made their work fully reproducible, releasing all Python solvers and experiment scripts alongside the paper, facilitating immediate application and further research in the field.
- Proposes a pseudo-labeling framework for unsupervised domain adaptation in kernel GLMs (linear, logistic, Poisson regression).
- Establishes non-asymptotic excess-risk bounds with an "effective labeled sample size" to quantify adaptation performance.
- Open-sources Python code, showing consistent gains over source-only baselines in experiments.
Why It Matters
Provides a theoretically sound, practical method for adapting AI models to new data domains without costly re-labeling.