During the 2016 LML summer school, Max and I ended up discussing the irreproducibility crisis in science, and that led to a draft manuscript, published today on the arXiv:1706.07773, see also this earlier blog post. Over the last few years several (apparently reproducible!) studies have come out that confirm the crisis. For instance, a survey of scientists, conducted by Nature, found that 70% of respondents had at some point tried but failed to reproduce a published result.
Our discussion was motivated by some curious aspects of earthquake statistics, to which we’ll come in a moment. This is one of Max’s long-standing research interests: what can we learn from the statistics of earthquakes? What kinds of forecasts or predictions can we make, based on a mix of physical understanding and statistical data? How good are the forecasts? Max addresses these questions as a group leader in the Collaboratory for the Study of Earthquake Predictability (CSEP), which was initiated by the Southern California Earthquake Center (SCEC). CSEP’s aim is to evaluate earthquake forecast models in an independent, rigorous and truly prospective manner. Given earthquake prediction’s murky past, CSEP aims for transparency, replicability, a controlled environment and community-endorsed standards. Seismologists develop models based on a variety of hypotheses and install them on CSEP’s servers; CSEP runs these models automatically and compares them against new, in-coming data from an independent source.
Of course, the holy grail of earthquake science is to make reliable and precise predictions of earthquakes and save lives. But the problem is fiendishly difficult. For instance, we don’t know, at present, how to make predictions that would pinpoint dangerous future earthquakes accurately enough in time and space to allow viable evacuations before they happen. On the other hand, we do know some things. We can draw maps of earthquake frequency that have some predictive skill. For example, regions that have had more earthquakes in the past will generally have more earthquakes in the future — these maps often show what we interpret as fault lines, and boundaries of tectonic plates. We are even beginning to update such maps dynamically to capture the temporal clustering of earthquakes.
This very practical problem quickly leads into conceptually challenging territory. We may not be able to make reliable predictions of the type “there will be a magnitude 5.2 earthquake in Tooting tomorrow afternoon” but we can make broader forecasts such as “the likelihood of a 5.2 earthquake tomorrow afternoon is lower in Tooting than in San Francisco.” But what does that mean? The concept of likelihood is either a frequency over time, or over an ensemble, or it’s a subjective belief. Since we’ve fixed the time variable at tomorrow afternoon, we can only be speaking about an ensemble or a subjective belief — but what ensemble? We only have one Tooting and one San Francisco. You see we had a lot of fun in this discussion, and it led us quite naturally to ensemble theory and the ergodicity problem — one of my own favourites.
The particular feature in the statistics of earthquakes that we discussed is this: When comparing different forecast models, one model may be consistently better than the others for some time period. The stability of this observation over time suggests that the model captures a meaningful aspect of the physical process, even though we don’t know what that aspect is. Surprisingly, despite this apparent stability, a single earthquake can occasionally change the apparent relative quality of models. What seemed like a real effect may turn out not to be reliable.
From these thoughts we abstracted, as described in the manuscript. We chose the simplest non-ergodic process we could think of — Brownian motion — and looked at the behaviour of its time average. This average stabilises over time, in the sense that changes in the time average vanish if we average for long enough (additional observations don’t change the result). However, the non-ergodicity of the process means that the stability over time does not imply stability across the ensemble. Repeating the experiment, i.e. measuring the time average of a Brownian motion in another realisation of the process will yield a different result.
First of all, it occurred to us that this feels counter-intuitive and may be behind some of the reported irreproducible results: stability over time does not automatically imply stability across the ensemble. Secondly, it turned out that the relevant calculations are beautifully simple, and the problem can be solved exactly with pen and paper. It seems relevant to the reproducibility debate, and it’s intellectually pleasing. Hence the manuscript.