Research & Papers

Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

OpenStreetMap and satellite imagery improve motor insurance risk models by up to 12%

Deep Dive

A new study by Sherly Alfonso-Sánchez, Cristián Bravo, and Kristina G. Stankova, published on arXiv (2604.21893), explores how geographic data from alternative sources can enhance motor insurance claim frequency models. Using the BeMTPL97 dataset, the researchers adopted a zone-level framework to predict claims at the postcode level, incorporating environmental indicators from OpenStreetMap and CORINE Land Cover, as well as orthoimagery from the Belgian National Geographic Institute. They tested three baseline models—generalized linear models (GLMs), regularized GLMs, and gradient-boosted trees—alongside convolutional neural networks for raw imagery.

The results reveal that augmenting traditional actuarial variables with geographic information significantly improves accuracy. Both linear and tree-based models benefited most from combining coordinates with environmental features extracted at a 5 km scale, yielding an 8-12% improvement in predictive performance. Image embeddings from pretrained vision transformers enhanced accuracy and stability for regularized GLMs only when environmental features were absent. The study concludes that the predictive value of geography depends more on how it's represented than on model complexity, offering a practical pathway for insurers to incorporate spatial context despite limited individual-level data.

Key Points
  • Environmental features at 5 km scale improved model accuracy by 8-12% across GLMs and gradient-boosted trees
  • Image embeddings from vision transformers only helped when environmental data was missing, boosting regularized GLM stability
  • Study used BeMTPL97 dataset with OpenStreetMap, CORINE Land Cover, and Belgian orthoimagery for zone-level predictions

Why It Matters

Insurers can now leverage free geographic data to refine risk models, potentially lowering premiums for safer zones.