How significance tests are misused in climate science
Posted on 12 November 2010 by Maarten Ambaum
Guest post by Dr Maarten H. P. Ambaum from the Department of Meteorology, University of Reading, U.K.Climate science relies heavily on statistics to test hypotheses. For example, we may want to ask whether the global mean temperature has really risen over the past ten years. A standard answer is to calculate a temperature trend from data and then ask whether this temperature trend is “significantly” upward; many scientists would then use a so-called significance test to answer this question. But it turns out that this is precisely the wrong thing to do.
This poor practice appears to be widespread. A new paper in the Journal of Climate reports that three quarters of papers in a randomly selected issue of the same journal used significance tests in this misleading way. It is fair to say, though, that most of the times, significance tests are only one part of the evidence provided.
The post by Alden Griffith on the 11th of August 2010 lucidly points to some of the problems with significance tests. Here we summarize the findings from the Journal of Climate paper, which explores how it is possible that significance tests are so widely misused and misrepresented in the mainstream climate science literature.
Not unsurprisingly, preprints of the paper have enthusiastically been picked up by those on the sceptic side of the climate change debate. We better find out what is really happening here.
Consider a scientist who is interested in measuring some effect and who does an experiment in the lab. Now consider the following thought process that the scientist goes through:
- My measurement stands out from the noise.
- So my measurement is not likely to be caused by noise.
- It is therefore unlikely that what I am seeing is noise.
- The measurement is therefore positive evidence that there is really something happening.
- This provides evidence for my theory.
To the surprise of most, the logical fallacy occurs between step 2 and step 3. Step 2 says that there is a low probability of finding our specific measurement if our system would just produce noise. Step 3 says that there is a low probability that the system just produces noise. These sound the same but they are entirely different.
This can be compactly described using Bayesian statistics: Bayesian statistics relies heavily on conditional probabilities. We use notations such as p(M|N) to mean the probability that M is true if N is known to be true, that is, the probability of M, given N. Now say that M is the statement “I observe this effect” and N is the statement “My system just produces noise”. Step 2 in our thought experiment says that p(M|N) is low. Step 3 says that p(N|M) is low. As you can see, the conditionals are swapped; these probabilities are not the same. We call this the error of the transposed conditional.
How about a significance test? A significance test in fact returns a value of p(M|N), the so-called p-value. In this context N is called the “null-hypothesis”. It returns the probability of observing an outcome (M: we observe an upward trend in the temperature record) given that the null-hypothesis is true (N: in reality there is no upward trend, there are just natural variations).
The punchline is that we are not at all interested in this probability. We are interested in the probability p(N|M), the probability that the null hypothesis is true (N: there is no upward temperature trend, just natural variability) given that we observe a certain outcome (M: we observe some upward trend in the temperature record).
Climate sceptics want to argue that p(N|M) is high (“Whatever your data show me, I still think there is no real trend; probably this is all just natural variability”), while many climate scientists have tried to argue that p(N|M) is low (“Look at the data: it is very unlikely that this is just natural variability”). Note that low p(N|M) means that the logical opposite of the null-hypothesis (not N: there really is an upward temperature trend) is likely to be true.
Who is right? There are many independent reasons to believe that p(N|M) is low; standard physics for example. However many climate scientists have shot themselves in the foot by publishing low values of p(M|N) (in statistical parlance, low p(M|N) means a “statistically significant result”) and claiming that this is positive evidence that p(N|M) is low. Not so.
We can make some progress though. Bayes' theorem shows how the two probabilities are related. The aforementioned paper shows in detail how this works. It also shows how significance tests can be used; typically to debunk false hypotheses. These aspects may be the subject of a further post.
In the meantime, we need to live with the fact that “statistically significant” results are not necessarily in any relevant sense significant. This doesn't mean that those results are false or irrelevant. It just means that the significance test does not provide a way of quantifying the validity of some hypothesis.
So next time someone shows you a “statistically significant” result, do tell them: “I don't care how low your p-value is. Show me the physics and tell me the size of the effect. Then we can discuss whether your hypothesis makes sense.” Stop quibbling about meaningless statistical smoke and mirrors.
Reference:M. H. P. Ambaum, 2010: Significance tests in climate science. J. Climate, 23, 5927-5932. doi:10.1175/2010jcli3746.1
- We still have not found a single species that would not increase its numbers exponentially in a favorable environment
- No environment is known that would be stable on a geological timescale
- No environment with infinite or exponentially growing resources is found (except the environment of human society, due to ever shifting technological definition of "resource").
- No species is found where offspring and progenitor are either strictly identical or dissimilar (if the entire life cycle of the species is taken into account)
In this sense the empirical basis of the theory is not falsified. Problems discovered later have nothing to do with this quick-and-dirty inductive step, it is still a masterpiece. Problems arose not because of hasty induction, but some vagueness and much hand waving in its deductive structure (in the ratiocination phase, using Mill's term), some of which is due to sketchy definition of basic concepts, some to lack of rigorous formalism. The issues centered around point 4. above. He was sticking to the idea of blending inheritance until the end of his life, although he would have the chance to read about Mendel's results in Hermann Hoffman's book (1869), had he not skipped page 52 due to lack of mathematical training and interest. Therefore the important difference between phenotype and genotype along with the quantized nature of inheritance was unknown to him (and, understandably, also recombination, as it was discovered later). However, even with the tremendous advance in formalization and the description (and utilization) of the standard digital information storage and retrieval system encapsulated in all known life forms, Evolution of Complexity is still not understood (although this is the single most important aspect of the theory as far as general human culture is concerned). Even a proper widely agreed upon definition of complexity is lacking and while there is no way to assign probabilities to candidates like Kolmogorov complexity, it makes even less sense to talk about the probability of individual propositions dependent on this concept being true, either in a Bayesian context or otherwise. Current status of AGW theory is much the same. It is also highly deductive, based on the single observation carbon dioxide has a strong emission line in thermal infrared. It is the only inductive step, other than those necessary for launching general atmospheric physics of course. Otherwise the structure of the theory is supposed to be entirely deductive, relying on computational climate models as devices of inference. However, according to Galileo, the great Book of the Universe is written in the language of mathematics, not computer programs. The difference is essential. Mathematical formulae as they are used in physics lend themselves to all kinds of transformation, revealing hidden symmetries or conservation principles, making perturbation theories, equivalence proofs (like Schroedinger's exploit with matrix and wave mechanics) or analysis of general properties of a dynamic system (like existence and geometry of attractors) possible, etc., etc. On the other hand, there is no meaningful transformation for the code base of a General Circulation Model (other than compiling it under a specific operation system). Move on folks, there's nothing to see here. There is a metaphysical difference between our viewpoints. In unstructured problems like spam filtering Bayesian inference may be useful. As soon as some noticeable structural difference occurs between spam and legitimate email, spammers are fast to exploit it, so it is a race for crumbs of information. Stock rates work much the same way, from a strictly statistical point of view. On the other hand as soon as meaning is considered, it is no longer justified to attach Bayesian probabilities to propositions concerning this meaning. One either understands what was being said or not (if you take your time and actually read and understand each piece of your incoming mail, it is easy to tell spam and the rest apart, even for non-experts). To make a long story short, I think Galileo's statement about the hidden language is not just a metaphor, but there's more to it. There's indeed a message to be decoded, written in an utterly non-human language. It is a metaphysical statement of course and as such, has no immediate bearing on questions of physics. Nevertheless metaphysical stance plays an undeniable role in the manner people approach problems, even in their choice of problems as well. More so in their assessment what constitutes a proper solution.