scores: A Python package for verifying and evaluating models and predictions with xarray

`scores` is a Python package containing mathematical functions for the verification, evaluation and optimisation of forecasts, predictions or models. It supports labelled n-dimensional (multidimensional) data, which is used in many scientific fields and in machine learning. At present, `scores` primarily supports the geoscience communities; in particular, the meteorological, climatological and oceanographic communities. `scores` not only includes common scores (e.g., Mean Absolute Error), it also includes novel scores not commonly found elsewhere (e.g., FIxed Risk Multicategorical (FIRM) score, Flip-Flop Index), complex scores (e.g., threshold-weighted continuous ranked probability score), and statistical tests (such as the Diebold Mariano test). It also contains isotonic regression which is becoming an increasingly important tool in forecast verification and can be used to generate stable reliability diagrams. Additionally, it provides pre-processing tools for preparing data for scores in a variety of formats including cumulative distribution functions (CDF). At the time of writing, `scores` includes over 50 metrics, statistical techniques and data processing tools. All of the scores and statistical techniques in this package have undergone a thorough scientific and software review. Every score has a companion Jupyter Notebook tutorial that demonstrates its use in practice. `scores` supports `xarray` datatypes, allowing it to work with Earth system data in a range of formats including NetCDF4, HDF5, Zarr and GRIB among others. `scores` uses Dask for scaling and performance. Support for `pandas` is being introduced. The `scores` software repository can be found at https://github.com/nci/scores/


Summary
scores is a Python package containing mathematical functions for the verification, evaluation and optimisation of forecasts, predictions or models.It supports labelled n-dimensional (multidimensional) data, which is used in many scientific fields and in machine learning.At present, scores primarily supports the geoscience communities; in particular, the meteorological, climatological and oceanographic communities.
scores not only includes common scores (e.g., Mean Absolute Error), it also includes novel scores not commonly found elsewhere (e.g., FIxed Risk Multicategorical (FIRM) score, Flip-Flop Index), complex scores (e.g., thresholdweighted continuous ranked probability score), and statistical tests (such as the Diebold Mariano test).It also contains isotonic regression which is becoming an increasingly important tool in forecast verification and can be used to generate stable reliability diagrams.Additionally, it provides pre-processing tools for preparing data for scores in a variety of formats including cumulative distribution functions (CDF).At the time of writing, scores includes over 50 metrics, statistical techniques and data processing tools.
All of the scores and statistical techniques in this package have undergone a thorough scientific and software review.Every score has a companion Jupyter Notebook tutorial that demonstrates its use in practice.
scores supports xarray datatypes, allowing it to work with Earth system data in a range of formats including NetCDF4, HDF5, Zarr and GRIB among others.scores uses Dask for scaling and performance.Support for pandas is being introduced.
The scores software repository can be found at https://github.com/nci/scores/.
Labelled, n-dimensional data is widely used in many scientific fields.The Earth system science community makes heavy use of physics-based and machine learning models, both to process observations (such as identifying land use from satellite data) and to make predictions about the future (such as forecasting the weather).These models, predictions and forecasts undergo verification and evaluation to assess their correctness.
The purpose of scores is (a) to mathematically verify and validate models and predictions and (b) to foster research into new scores and metrics.
scores handles dimensionality and weighting (e.g., latitude weighting) more effectively than commonly-used data science packages.While there are existing open source Python verification packages for labelled n-dimensional data (see "Related Software Packages" further below), none of these packages offer all of the key benefits of scores.

Key Benefits of scores
To meet the needs of researchers and other users, scores provides the following key benefits.

Data Handling
• Works with labelled, n-dimensional data (e.g., geospatial, vertical and temporal dimensions) for both pointbased and gridded data.scores can effectively handle the dimensionality, data size and data structures commonly used for: gridded Earth system data (e.g., numerical weather prediction models) tabular, point, latitude/longitude or site-based data (e.g., forecasts for specific locations).
• Handles missing data, masking of data and weighting of results.

Usability
• A companion Jupyter Notebook (Jupyter Team, 2024) tutorial for each metric and statistical test that demonstrates its use in practice.• Novel scores not commonly found elsewhere (e.g., FIRM (Taggart et al., 2022), Flip-Flop Index (Griffiths et al., 2019(Griffiths et al., , 2021))).• Commonly-used scores are also included, meeting user requests to use scores as a standalone package.
• All scores and statistical techniques have undergone a thorough scientific and software review.
• An area specifically to hold emerging scores which are still undergoing research and development.This provides a clear mechanism for people to share, access and collaborate on new scores, and be able to easily re-use versioned implementations of those scores.

Compatibility
• Highly modular -provides its own implementations, avoids extensive dependencies and offers a consistent API.• Easy to integrate and use in a wide variety of environments.It has been used on workstations, servers and in high performance computing (supercomputing) environments.• Maintains 100% automated test coverage.

Metrics, Statistical Techniques and Data Processing Tools Included in scores
At the time of writing, scores includes over 50 metrics, statistical techniques and data processing tools.For an up to date list, please see the scores documentation.

Categorical
Scores (including contingency table metrics) for evaluating forecasts of categories.

Spatial
Scores that take into account spatial structure.

Statistical Tests
Tools to conduct statistical tests and generate confidence intervals.

Processing Tools
Tools to pre-process data.
Data matching, discretisation, cumulative density function manipulation.

Use in Academic Work
In 2015, the Australian Bureau of Meteorology began developing a new verification system called Jive, which became operational in 2022.For a description of Jive see Loveday, Griffiths, et al. (2024).The Jive verification metrics have been used to support several publications (Foley & Loveday, 2020;Griffiths et al., 2017;Taggart, 2022aTaggart, , 2022bTaggart, , 2022c)).
scores has arisen from the Jive verification system and provides Jive verification functions as a modular, open source package.scores also includes additional metrics that Jive does not contain.

Related Software Packages
There are multiple open source verification packages in a range of languages.Below is a comparison of scores to other open source Python verification packages.None of these include all of the metrics implemented in scores (and vice versa).
xskillscore (Bell et al., 2021) provides many but not all of the same functions as scores.The Jupyter Notebook tutorials in scores cover a wider array of metrics.
climpred (Brady & Spring, 2021) uses xskillscore combined with data handling functionality, and is focused on ensemble forecasts for climate and weather.climpred makes some design choices related to data structure (specifically associated with climate modelling) which may not generalise effectively to broader use cases.Releasing scores separately allows the differing design philosophies to be considered by the community.
METplus (Brown et al., 2021) is a substantial verification system used by weather and climate model developers.
METplus includes a database and a visualisation system, with Python and shell script wrappers to use the MET package for the calculation of scores.MET is implemented in C++ rather than Python.METplus is used as a system rather than providing a modular Python API.
Verif (Nipen et al., 2023) is a command line tool for generating verification plots whereas scores provides a Python API for generating numerical scores.
Pysteps (Imhoff et al., 2023;Pulkkinen et al., 2019) is a package for producing short-term ensemble predictions, focusing on probabilistic nowcasting of radar precipitation fields.It includes a significant verification submodule with many useful verification scores.Pysteps does not provide a standalone verification API.
PyForecastTools (Morley & Burrell, 2020) is a Python package for model and forecast verification which supports dmarray rather than xarray data structures and does not include Jupyter Notebook tutorials.

Table 1 :
A curated selection of the metrics, tools and statistical tests currently included in scores