Abstract
Forecasting and forecast evaluation are inherently sequential tasks. Predictions are often issued on a regular basis, such as every hour, day, or month, and their quality is monitored continuously. However, the classical statistical tools for forecast evaluation are static, in the sense that statistical tests for forecast calibration are only valid if the evaluation period is fixed in advance. Recently, e-values have been introduced as a new, dynamic method for assessing statistical significance. An e-value is a nonnegative random variable with expected value, at most, one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a conservative p-value. Since they naturally lead to statistical tests which are valid under optional stopping, e-values are particularly suitable for sequential forecast evaluation. This article proposes e-values for testing probabilistic calibration of forecasts which is one of the most important notions of calibration. The proposed methods are also more generally applicable for sequential goodness-of-fit testing. We demonstrate in a simulation study that the e-values are competitive in terms of power, when compared to extant methods which do not allow for sequential testing. In this context we introduce test power heat matrices, a graphical tool to compactly visualize results of simulation studies on test power. In a case study we show that the e-values provide important and new useful insights in the evaluation of probabilistic weather forecasts.
Funding Statement
This work was supported by the Swiss National Science Foundation.
Acknowledgments
The authors are grateful to Sebastian Lerch for providing data for the case study and thank Timo Dimitriadis, Tilmann Gneiting, and the members of his group for valuable discussions and inputs. Valuable comments by Aaditya Ramdas and an anonymous reviewer helped us to improve this article. This work was supported by the Swiss National Science Foundation. Computations have been performed on UBELIX (https://ubelix.unibe.ch/), the HPC cluster of the University of Bern.
Citation
Sebastian Arnold. Alexander Henzi. Johanna F. Ziegel. "Sequentially valid tests for forecast calibration." Ann. Appl. Stat. 17 (3) 1909 - 1935, September 2023. https://doi.org/10.1214/22-AOAS1697
Information