Ranking earthquake forecasts using proper scoring rules: Binary events in a low probability environment

Probabilistic earthquake forecasts can be used to estimate the chance of future earthquake hazards, or to model important risk quantities including the number of fatalities, damaged elements of infrastructure or economic losses. The Collaboratory for the Study of Earthquake Predictability (CSEP) is a global community initiative seeking to make research on earthquake forecasting more open and rigorous, pursuing these ends by comparing different forecasts in a competitive setting and basing judgements of accuracy and usefulness on objective grounds. The CSEP aims to compare the predictive performance of diverse earthquake forecasts generated using underlying physical, stochastic or hybrid models and using a variety of input data including past seismicity, deformation rates and fault maps.
A key notion in this context is the concept of “proper scoring rules” – rules for ranking forecasts which give the highest score, on average, to the forecasting model that is “closer” to the actual statistical distribution which generates the observations. In a recent paper, LML Fellow Maximilian Werner and colleagues examine why is it crucial for a scoring rule to be proper and, also, what the consequences are of using an improper one.
To fairly compare the performance of the scores in a realistic framework, they use simulated data from a known model and compare it with alternative models. In doing this, it is crucial to account for the uncertainty in the observed score difference. In fact, as they show, the “properness” of a rule ensures that, at least on average, the rule provides the correct ranking. However, the score calculated from any finite set of observations could still be far from its average and this uncertainty needs to be accounted for. In the paper, Werner and colleagues show how to express a preference towards a model using confidence intervals for the expected score difference. Importantly, this method introduces the possibility of not expressing a preference. Such an outcome is potentially useful, the authors argue, because it indicates that, for a scoring rule, the forecasts could have similar performances, or the data may not be sufficient to distinguish between the models. This analysis offers scientists a stronger basis for evaluating the scientific hypotheses and models underlying the forecasts, and provides decision-makers with a more robust model ranking for hazard and risk applications.
The paper is available at https://academic.oup.com/gji/advance-article/doi/10.1093/gji/ggac124/6555032?login=false

Leave a Reply

Your email address will not be published. Required fields are marked *