Skill Scores

Visualizing model performance provides qualitative insights, but quantitative metrics are essential for objective assessment and comparison. Skill scores serve this purpose by providing numerical measures of how well a model’s predictions match observed data. ModelSkill facilitates the calculation of these key statistics through its Comparer and ComparerCollection objects.

Comparer

The Comparer object (e.g., comparer_116l1) calculates skill scores for a single observation-model pair.

Skill Table

The skill() method returns a SkillTable object, which is a specialized data structure provided by ModelSkill for presenting multiple skill scores in a clear, tabular format.

sk_single = comparer_116l1.skill()
sk_single

	n	bias	rmse	urmse	mae	cc	si	r2
observation
116l1	121	0.003229	0.068114	0.068037	0.054517	0.991234	0.172594	0.982405

Available metrics

You’ll notice several metrics listed (e.g., bias, rmse, nse). We’ll cover the definitions of common metrics in more detail at the end of this section.

Skill metrics in DataFrame

The SkillTable object can be converted to a Pandas DataFrame using sk_single.to_dataframe().

To get a subset of metrics, pass a list of metric names to the metrics argument.

sk_subset_single = comparer_116l1.skill(metrics=['rmse', 'bias', 'nse'])
sk_subset_single

	n	rmse	bias	nse
observation
116l1	121	0.068114	0.003229	0.982405

Single Score

Use score() for direct access to a single numerical value for a specific metric. If model results within the Comparer are named (as in this example with “MIKE+”), this method returns a dictionary where keys are model names.

rmse_val_dict = comparer_116l1.score(metric='rmse')
print(f"RMSE for MIKE+ at 116l1: {rmse_val_dict['MIKE+']:.4f}")

bias_val_dict = comparer_116l1.score(metric='bias')
print(f"Bias for MIKE+ at 116l1: {bias_val_dict['MIKE+']:.4f}")

RMSE for MIKE+ at 116l1: 0.0681
Bias for MIKE+ at 116l1: 0.0032

ComparerCollection

The ComparerCollection (e.g., cc) assesses model performance across multiple observation points.

Skill Table

Calling skill() on a ComparerCollection returns a SkillTable object summarizing skill for each Comparer within the collection.

sk_coll = cc.skill()
sk_coll

	n	bias	rmse	urmse	mae	cc	si	r2
observation
116l1	121	0.003229	0.068114	0.068037	0.054517	0.991234	0.172594	0.982405
12l1	121	-0.004083	0.063414	0.063282	0.049679	0.971574	0.305942	0.942928

You can request specific metrics for all comparisons.

sk_subset_coll = cc.skill(metrics=['rmse', 'bias'])
sk_subset_coll

	n	rmse	bias
observation
116l1	121	0.068114	0.003229
12l1	121	0.063414	-0.004083

Mean Skill Table

The mean_skill() method calculates average skill scores across all locations, presented in a SkillTable.

sk_mean = cc.mean_skill()
sk_mean

	n	bias	rmse	urmse	mae	cc	si	r2
model
MIKE+	242	-0.000427	0.065764	0.06566	0.052098	0.981404	0.239268	0.962667

And for specific metrics:

sk_mean_subset = cc.mean_skill(metrics=['rmse', 'bias', 'nse'])
sk_mean_subset

	n	rmse	bias	nse
model
MIKE+	242	0.065764	-0.000427	0.962667

Weighted mean skill

The mean_skill() method allows for weighted averages. You can provide weights for each observation if, for example, you trust some observation points more than others or if they represent areas of different importance. See the ModelSkill documentation for details on applying weights.

Score

The score() method on a ComparerCollection calculates a score for each model across all relevant observations. It returns a Python dictionary where keys are the model names (e.g., ‘MIKE+’) and values are these scores (e.g., mean RMSE for ‘MIKE+’). This provides a single summary value for each model’s performance on a specific metric.

# For our ComparerCollection 'cc' containing one model named "MIKE+"
score_rmse_scores = cc.score(metric='rmse')
print(f"Mean RMSE for models: {score_rmse_scores}")

score_bias_scores = cc.score(metric='bias')
print(f"Mean Bias for models: {score_bias_scores}")

Mean RMSE for models: {'MIKE+': 0.06576367286619714}
Mean Bias for models: {'MIKE+': -0.00042693940822553073}

Weighted mean score

Similar to mean_skill(), the score() method on a ComparerCollection also supports weighting. This enables you to calculate a weighted mean score (e.g., weighted RMSE) for each model across all observations.

Working with SkillTables

SkillTable objects are more than just static tables; they offer several useful features for analysis and presentation.

Sorting Values

You can sort the SkillTable by any of its columns (metrics or identifiers). This is useful for ranking models or observations.

# Sort by RMSE in ascending order
sk_coll_sorted = sk_coll.sort_values('rmse', ascending=True)
sk_coll_sorted

	n	bias	rmse	urmse	mae	cc	si	r2
observation
12l1	121	-0.004083	0.063414	0.063282	0.049679	0.971574	0.305942	0.942928
116l1	121	0.003229	0.068114	0.068037	0.054517	0.991234	0.172594	0.982405

Styling Tables

SkillTable objects integrate with Pandas’ styling capabilities, allowing you to highlight important values, apply color maps, or format numbers for better readability in Jupyter environments.

sk_coll.style()

	n	bias	rmse	urmse	mae	cc	si	r2
observation
116l1	121	0.003	0.068	0.068	0.055	0.991	0.173	0.982
12l1	121	-0.004	0.063	0.063	0.050	0.972	0.306	0.943

Plotting Skills

SkillTable objects have a .plot accessor for quickly visualizing skill scores, such as creating bar charts of metrics.

# Bar plot of RMSE for each observation point
sk_coll["rmse"].plot.bar()

These are just a few examples. The SkillTable’s .style and .plot accessors offer more customization. Refer to the ModelSkill documentation and Pandas styling documentation for further details.

Useful Skill Metrics

ModelSkill calculates numerous metrics. The choice of metrics depends on your modelling goals. Some useful metrics include:

Bias (bias): Average difference (Modeled - Observed). Ideal: 0.
RMSE (Root Mean Square Error) (rmse): Typical magnitude of error. Ideal: 0.
NSE (Nash-Sutcliffe Efficiency) (nse): Measures the predictive power of the model compared to using the mean of the observed data as the prediction. Ranges from -\(\infty\) to 1. Ideal: 1.
KGE (Kling-Gupta Efficiency) (kge): A composite metric evaluating correlation, bias, and variability components. Ranges from -\(\infty\) to 1. Ideal: 1.
Willmott’s Index of Agreement (willmott): Measures the degree of model prediction error, standardized by observed variability. Ranges from 0 to 1. Ideal: 1.
Peak Ratio (pr): Ratio of the maximum modeled value to the maximum observed value over the matched time period. Ideal: 1.0.

You can change the default list of metrics that are used by skill() and mean_skill() as follows:

ms.set_option("metrics.list", ['bias', 'rmse', 'nse', 'kge', 'willmott', 'pr'])
cc.skill()

	n	bias	rmse	nse	kge	willmott	pr
observation
116l1	121	0.003229	0.068114	0.982405	0.977502	0.995498	0.944704
12l1	121	-0.004083	0.063414	0.942928	0.932798	0.984693	0.891797

Note

Reset to default values with ms.reset_option("metrics.list").

For a comprehensive list of all available metrics and their precise definitions, please refer to the official ModelSkill API documentation for metrics.

Adding Custom Metrics

ModelSkill’s metrics are extensible. You can define and use your own custom skill score functions if needed. If you believe a metric would be broadly useful, consider suggesting it for inclusion in ModelSkill via a GitHub issue.