ModelSkill comes with a comprehensive set of skill metrics. This page is a when-to-use guide: how the metrics differ, which one answers which question, and the common pitfalls. Each metric name links to its full mathematical definition in the metrics API reference.
Metrics are passed to skill (and mean_skill, score) as lower-case strings, or as callables:
Construct a comparer from observation and model data
In the tables below the first column gives the metric string ModelSkill accepts; the alias (a longer synonym that also works) is listed underneath it. Both link to the API.
When you pass no metrics= argument, skill reports the default set — the point count n plus bias, rmse, urmse, mae, cc, si, r2. You can change the default via ms.options.metrics.list = [...].
Error magnitude — same units as the data (scale-dependent)
These are in the data’s units (m, m³/s, …), so they are not comparable across stations of different magnitude — a 5,000 m³/s river will always out-RMSE a 50 m³/s tributary.
Largest single absolute error — the worst-case miss.
Use when: you want the error in physical units. bias for systematic offset, rmse as the default, mae when a few outliers shouldn’t dominate, urmse to isolate random error from bias, max_error for safety-/threshold-critical checks.
Relative / normalised error — dimensionless
Divide the error by a scale, so they can be compared across stations — but they divide by observed values, so they are unstable when observations pass through zero (watch out for water level around a datum).
Mean absolute % error, mean(\|model−obs\| / \|obs\|). Intuitive %, but blows up near obs = 0.
Use when: comparing error across stations of very different magnitude, and the observed values stay comfortably away from zero. For water level, prefer the efficiency scores below over mape/si.
Efficiency / skill scores — dimensionless, comparable across stations
The “one number for overall skill” family. All reference the observed mean or variance, so a score of 0 (for NSE/KGE) means “no better than predicting the mean.”
RMSE / std(obs) = √(1 − NSE). Same information as NSE, expressed as an error (lower is better).
Use when: you need a single dimensionless score to rank models or compare stations. Reach for kge when you want to diagnose why a model fails (it separates correlation, bias, and variance); nse is the hydrology standard for overall predictive power; willmott if you want a strictly bounded [0,1] score.
cc, ev and r2: a nested hierarchy
A common question is how cc, ev and r2 relate. They answer increasingly strict versions of “how much of the observed variation does the model capture?”, each penalising one more kind of error:
Score
Penalises bias (offset)?
Penalises wrong amplitude?
Scored against
cc² (square of cc)
no
no
the best-fit line (free slope + intercept)
ev
no
yes
obs variance, ignoring a constant offset
r2 (= nse)
yes
yes
the 1:1 line
For the same data this gives the ordering cc² ≥ ev ≥ r2: cc² forgives both a constant offset and a wrong amplitude, ev forgives only the offset, and r2/nse forgives nothing (it scores against the 1:1 line). The gap ev − r2 is exactly the squared, normalised bias.
In practice, report cc (timing/phase) plus one efficiency score (nse or kge) plusbias separately — that trio localises a failure to phase, offset, or amplitude, which no single number can. cc² rarely adds anything once you already report cc, and ev (scikit-learn’s explained_variance_score) sits between the two and is seldom reported on its own in water modelling.
Warningr2 is the coefficient of determination, not squared correlation
ModelSkill’s r2 equals NSE — they are the same number under two names — not squared Pearson correlation. The two generally differ (r2 penalises bias, cc² does not). If you want squared correlation, compute cc and square it yourself.
Correlation & amplitude — dimensionless
These ignore systematic bias, so always read them alongsidebias.
Fraction of points within an acceptable deviation a of the observation. Takes an a= argument.
Use when:peak_ratio for storm-surge / flood-peak capture. hit_ratio for acceptance-criterion reporting — “X % of points within ±0.1 m” (set the tolerance with a=).
Directional / circular — for direction variables only
For wind / wave / current direction, where 359° and 1° are 2° apart, not 358°. These require the quantity to be flagged directional (Quantity(..., is_directional=True)); they handle the 0–360° wrap-around. See the directional data example.
Use when: the variable is an angle. Don’t use the scalar metrics above on direction data.
Quick decision guide
Your question
Reach for
Is the model systematically high or low?
bias
Does it under-/over-respond in amplitude?
lin_slope
How big is a typical error, in my units?
rmse (penalise big misses) or mae (robust)
Split random vs systematic error?
urmse vs bias (rmse² = bias² + urmse²)
What’s the worst single miss?
max_error
One dimensionless score, comparable across stations?
nse or kge
Why does the model fail (corr / bias / variance)?
kge
Compare error across very different-sized stations?
si or nse/kge (avoid rmse)
Timing / phase agreement?
cc
Do we capture the storm peaks?
peak_ratio
What fraction meets an acceptance tolerance?
hit_ratio (set a=)
Working with direction (wind/wave/current)?
the c_* family
Cross-cutting reminders
Scale-dependent vs dimensionless. The error-magnitude metrics are in data units — never compare them across stations of different magnitude. The rest are dimensionless and comparable.
Always pair correlation with bias.cc/rho are blind to offset; a model can correlate perfectly while sitting 1 m too high.
Division-by-obs metrics (mape, si) are fragile near zero observations.
Custom metrics: any f(obs, model) -> float callable drops straight into metrics=[...]; the column takes the function name. See the custom metric example.