Ranking earthquake forecasts using proper scoring rules: binary events in a low probability environment

Probabilistic earthquake forecasts estimate the spatial and/or temporal evolution of seismicity and offer practical guidance to authorities during earthquake sequences, especially following notable earthquakes. They’ve been applied, for example, to monitor the seismicity after the 2010 Darfield earthquake in New Zealand, just prior to the Christchurch earthquake. In Italy, the Instituto Nazionale di Geofisica e Vulcanologia (INGV) produces regular earthquake probabilistic forecasts and ground-motion hazard forecasts to inform the Italian government on risks associated with this natural hazard. INGV is working to use such probabilistic forecasts as a basis for modelling important quantities for operational loss forecasting including number of evacuated residents, damaged infrastructure, and number of fatalities.
Supporting a wider uptake of such efforts requires further demonstrations of the operational utility of these forecasts, and the validity of the scientific hypotheses incorporated in the underlying models. The Collaboratory for the Study of Earthquake Predictability (CSEP) is a global community initiative that seeks to make earthquake research more rigorous, based on an open-science approach. This is done by facilitating the competitive comparison of forecasts against future data in pre-defined testing regions. In a recent paper, LML External Fellow Maximilian Werner and colleagues focus on comparing different forecasts that can be made from such competing models in the light of observed data.
As they point out, one metric frequently used to rank competing models is the Parimutuel Gambling score, which allows alarm-based forecasts to be compared with probabilistic ones. They examine the suitability of this score for ranking competing earthquake forecasts, first proving analytically that this score is in general ‘improper.’ This means that the score does not, on average, prefer the model that generated the data in cases where this is known. In the special case where it is proper, however, the authors show it can still be used in an improper way. The researchers then compare the performance of this metric with two commonly-used proper scores (the Brier and logarithmic scores), taking into account the uncertainty around the observed average score.
Among other things, the analysis clarifies how much data a test requires, in principle, to express a preference towards a particular forecast. Such thresholds may be used in experimental design to specify the duration, time windows and spatial discretisation of earthquake models and forecasts.
The paper is available at https://arxiv.org/pdf/2105.12065.pdf

Leave a Reply

Your email address will not be published. Required fields are marked *