Case study · Insurance · GLM / Pricing

Pure-premium pricing — freMTPL2

Three GLM families and a gradient-boosted challenger compared on the canonical French motor third-party liability dataset (678k policies). Tweedie wins on segmentation power; Poisson + Gamma wins on top-decile lift. Choosing between them is an actuarial decision, not a modelling one.

Read · 6 min · 1,319 words Best on Gini · Tweedie GLM (0.310) Best on top-decile lift · Poisson + Gamma GLM (2.66) Data · 678,013 policies · ~5% with claims Overdispersion · Var/Mean = 1.083 (mild, favours Tweedie)

Summary

On the canonical freMTPL2 French motor third-party liability dataset, a Tweedie compound-Poisson GLM beats both a separately-fit Poisson + Gamma GLM and a gradient-boosted challenger on the standard segmentation metric (Gini coefficient on policy-level pure premium). But Poisson + Gamma wins on a different operationally-meaningful metric — top-decile lift — and remains the more defensible choice when frequency and severity drivers diverge. The "right" pricing model is the one whose biases match how the rating engine and underwriting team will actually use the score.

The business question

A motor insurer wants policy-level expected-loss estimates: the pure premium — accurate enough to feed a tier-pricing engine and explainable enough for actuarial sign-off. Two operational uses sit on top of the score:

An incumbent Poisson GLM was already in production (industry-standard). The question: would Tweedie or an ML challenger meaningfully improve either dimension?

Data

Real freMTPL2 from the French Federation of Insurers, redistributed via Kaggle as floser/french-motor-claims-datasets-fremtpl2freq (claims) and floser/fremtpl2sev (severities). After joining and capping severities at the 99.9th percentile (standard practice to limit catastrophic-claim distortion):

CSV · KagglePolicy-levelSeverities capped at 99.9%5-fold-style 80/20 hold-out

EDA & overdispersion

Two empirical facts dominate the modelling decision:

  1. The claim-count distribution is heavy on zero. Almost 95% of policies file no claims. Of those that do, most file exactly one; the long tail of two-or-more is genuinely small.
  2. Severity is heavy-right-tailed. Most positive claims are small (~1k EUR), but a long tail extends to the per-policy cap. Modelling severity in linear space without a log-link is a non-starter.
Claim-count distribution and severity histogram
Figure 1. Left: count of claims per policy on a log-y axis — 95% zero, ~5% with one claim, very few with two or more. Right: severity distribution on positive claims, capped at 30k EUR for visibility; the right tail extends well beyond the plot.

The empirical variance-to-mean ratio of claim count is 1.083. Mild overdispersion. Poisson is a defensible baseline, but a negative-binomial or Tweedie family captures the extra variance more cleanly.

Modelling approach

Three modelling families on the same 80/20 train/test split. All trained with offset = log(exposure) for frequency, and severity restricted to positive claims. Pure-premium predictions are made on the test set with frequency and severity combined where they live in separate models, or directly where they don't.

1. Poisson + Gamma GLMs (industry baseline)

The classical actuarial decomposition: model claim frequency with a Poisson GLM (log link, exposure offset) and conditional severity with a Gamma GLM on positive claims. Pure premium = E[N|X] / exposure × E[Y|X, N≥1]. Easy to interpret per-coefficient; easy for the rating engine to consume; easy to explain to a regulator.

2. Tweedie compound-Poisson GLM

A single model on the per-exposure pure premium target with a Tweedie(var_power=1.5) distribution and log link. Handles the zero-inflation and the positive tail in one fit. Exposes a cleaner story: a single set of coefficients on the same response. The trade-off is that frequency and severity are no longer modelled separately, so a coefficient can't be split into "this driver is more likely to file" vs. "this driver files larger claims."

3. Gradient-boosted regressor (challenger)

A pair of GradientBoostingRegressor models — one on the per-exposure frequency, one on log-severity — multiplied to give pure premium. No monotonicity constraints (a production system would add them on bonus_malus), no interaction priors. The point is to see what the methodology family picks up that the GLMs miss.

Results

80/20 hold-out. Two metrics:

ModelGini (PP)Top-10% lift
Tweedie GLM (var_power = 1.5)0.3102.52
Poisson + Gamma GLM0.2412.66
GBM (Poisson + Gamma)0.2111.45
Lorenz curves for Tweedie, Poisson+Gamma, and GBM
Figure 2. Lorenz curves on the held-out test set. The diagonal is the random baseline. The further the curve bows below the diagonal, the better the model concentrates loss in the lowest-risk policies (and equivalently, the better it concentrates predicted risk in the highest-loss policies). Tweedie sits below Poisson + Gamma overall, but the GLMs cross at the upper-right, exactly the region the top-decile-lift metric is sensitive to.
Two metrics, two winners. Tweedie captures more of the rank-ordering signal across the whole policy base (Gini), so it would price better on average. But Poisson + Gamma identifies the extreme upper tail more sharply (top-decile lift 2.66 vs Tweedie's 2.52), so it would route more correctly to the highest pricing tier. Which model "wins" depends on whether the rating engine prices continuously off the score or bins into tiers, and on which tier matters most.

Trade-offs

Deployment sketch

For the rating engine and the actuarial team:

Lessons

  1. Pick the metric your downstream consumer actually uses, then pick the model. Tweedie wins overall Gini; Poisson + Gamma wins top-decile lift. If your rating engine bins into tiers, the right answer is not the higher-Gini model.
  2. Mild overdispersion is not Tweedie-mandatory. Var/Mean = 1.083 means Poisson isn't badly mis-specified. Tweedie's win is real but modest; the real differentiator is the modelling-philosophy fit (one model vs. two).
  3. GLMs still beat unconstrained ML on this kind of data. Without monotonicity priors and feature regularisation, a GBM is structurally too flexible for the signal-to-noise ratio of motor insurance. The right ML deployment in this domain is monotonic-constrained boosting plus calibration. The unconstrained version is a research point, not a production lift.