## How significance tests are misused in climate science

#### Posted on 12 November 2010 by Maarten Ambaum

**Guest post by Dr Maarten H. P. Ambaum from the Department of Meteorology, University of Reading, U.K.**

Climate science relies heavily on statistics to test hypotheses. For example, we may want to ask whether the global mean temperature has really risen over the past ten years. A standard answer is to calculate a temperature trend from data and then ask whether this temperature trend is “significantly” upward; many scientists would then use a so-called significance test to answer this question. But it turns out that this is precisely the wrong thing to do.

This poor practice appears to be widespread. A new paper in the *Journal of Climate* reports that three quarters of papers in a randomly selected issue of the same journal used significance tests in this misleading way. It is fair to say, though, that most of the times, significance tests are only one part of the evidence provided.

The post by Alden Griffith on the 11th of August 2010 lucidly points to some of the problems with significance tests. Here we summarize the findings from the *Journal of Climate* paper, which explores how it is possible that significance tests are so widely misused and misrepresented in the mainstream climate science literature.

Not unsurprisingly, preprints of the paper have enthusiastically been picked up by those on the sceptic side of the climate change debate. We better find out what is really happening here.

Consider a scientist who is interested in measuring some effect and who does an experiment in the lab. Now consider the following thought process that the scientist goes through:

- My measurement stands out from the noise.
- So my measurement is not likely to be caused by noise.
- It is therefore unlikely that what I am seeing is noise.
- The measurement is therefore positive evidence that there is really something happening.
- This provides evidence for my theory.

To the surprise of most, the logical fallacy occurs between step 2 and step 3. Step 2 says that there is a low probability of finding our specific measurement if our system would just produce noise. Step 3 says that there is a low probability that the system just produces noise. These sound the same but they are entirely different.

This can be compactly described using Bayesian statistics: Bayesian statistics relies heavily on conditional probabilities. We use notations such as `p(M|N)` to mean the probability that `M` is true if `N` is known to be true, that is, the probability of `M`, given `N`. Now say that `M` is the statement “I observe this effect” and `N` is the statement “My system just produces noise”. Step 2 in our thought experiment says that `p(M|N)` is low. Step 3 says that `p(N|M)` is low. As you can see, the conditionals are swapped; these probabilities are not the same. We call this the error of the transposed conditional.

How about a significance test? A significance test in fact returns a value of `p(M|N)`, the so-called `p`-value. In this context `N` is called the “null-hypothesis”. It returns the probability of observing an outcome (`M`: we observe an upward trend in the temperature record) given that the null-hypothesis is true (`N`: in reality there is no upward trend, there are just natural variations).

The punchline is that we are not at all interested in this probability. We are interested in the probability `p(N|M)`, the probability that the null hypothesis is true (`N`: there is no upward temperature trend, just natural variability) given that we observe a certain outcome (`M`: we observe some upward trend in the temperature record).

Climate sceptics want to argue that `p(N|M)` is high (“Whatever your data show me, I still think there is no real trend; probably this is all just natural variability”), while many climate scientists have tried to argue that `p(N|M)` is low (“Look at the data: it is very unlikely that this is just natural variability”). Note that low `p(N|M)` means that the logical opposite of the null-hypothesis (not `N`: there really is an upward temperature trend) is likely to be true.

Who is right? There are many independent reasons to believe that `p(N|M)` is low; standard physics for example. However many climate scientists have shot themselves in the foot by publishing low values of `p(M|N)` (in statistical parlance, low `p(M|N)` means a “statistically significant result”) and claiming that this is positive evidence that `p(N|M)` is low. Not so.

We can make some progress though. Bayes' theorem shows how the two probabilities are related. The aforementioned paper shows in detail how this works. It also shows how significance tests *can* be used; typically to debunk false hypotheses. These aspects may be the subject of a further post.

In the meantime, we need to live with the fact that “statistically significant” results are not necessarily in any relevant sense significant. This doesn't mean that those results are false or irrelevant. It just means that the significance test does not provide a way of quantifying the validity of some hypothesis.

So next time someone shows you a “statistically significant” result, do tell them: “I don't care how low your `p`-value is. Show me the physics and tell me the size of the effect. Then we can discuss whether your hypothesis makes sense.” Stop quibbling about meaningless statistical smoke and mirrors.

M. H. P. Ambaum, 2010: Significance tests in climate science.

*J. Climate,*

**23**, 5927-5932. doi:10.1175/2010jcli3746.1
Alexandreat 01:45 AM on 13 November, 2010In the meantime, we need to live with the fact that “statistically significant” results are not necessarily in any relevant sense significant.I think you could statistically correlate car sales and global warming, for instance, and it would mean nothing. It's the underlying physics AND the statistics that will give you the evidence - which is the case.Daniel Baileyat 01:57 AM on 13 November, 2010"perceived flattening of the global temperature rise". Sigh. In life and statistics, some will see only what they expect to see. The YooperCBDunkersonat 02:06 AM on 13 November, 2010cynicusat 02:31 AM on 13 November, 2010Andrew Mclarenat 02:32 AM on 13 November, 2010Paul Dat 02:38 AM on 13 November, 2010Bob Lacatenaat 03:35 AM on 13 November, 2010DSLat 04:12 AM on 13 November, 2010Steve Lat 04:15 AM on 13 November, 2010KirkSkywalkerat 05:57 AM on 13 November, 2010Moderator Response:This thread is narrowly focused on concepts related to assessing statistical significance.General discussion of broad categories of evidence for global warming should go in an appropriate thread, such as this or this.

Also, please note that in a series of visits over the past month, you've left at least five versions of the same comment about ice cores, in five different threads. Most of them have now been deleted or redirected here.

Please try to post your comments in the appropriate thread and then stick with them there, rather than spreading discussions across many different threads. This helps make the site more readable for everyone.

macolesat 10:44 AM on 13 November, 2010dansatat 10:49 AM on 13 November, 2010Daniel Baileyat 11:22 AM on 13 November, 2010Yo-ho, yo-ho, indeed. The Yo-ho-Yoopermichael sweetat 14:03 PM on 13 November, 2010Tom Daytonat 16:03 PM on 13 November, 2010dansatat 18:03 PM on 13 November, 2010Eric Lat 22:41 PM on 13 November, 2010Eric Lat 22:50 PM on 13 November, 2010TonyLat 06:34 AM on 14 November, 2010Miriam O'Brien (Sou)at 07:04 AM on 14 November, 2010michael sweetat 12:06 PM on 14 November, 2010HumanityRulesat 00:21 AM on 15 November, 2010muoncounterat 00:42 AM on 15 November, 2010forensicscienceat 02:01 AM on 15 November, 2010TOPat 08:26 AM on 15 November, 2010"This thread is narrowly focused on concepts related to assessing statistical significance. "There is a high probability that only an off topic post by a skeptic will be flagged while more egregiously off topic posts about pirates will go unanswered. Based on Ambaum's statement that 3/4 of the articles in a recent randomly picked issue of a prestigious climate publication contained this error is it likely that the papers that the IPCC uses in it's publications are tainted? Ambaum further stated that this number was up over a ten year previous issue where the error only occurred 1/2 the time. I have seen what Ambaum alludes to in his paper, an increased use of computer programs to analyze data without understanding the underlying reasoning. You will typically see this on tests when asking students to take the sin(pi/3)/cos(pi/3)/sqrt(3). A calculator dependent student will more often than not get this wrong. Temperature anomaly is a low signal to noise ratio quantity. I'd sure like to see a study of the proper use of statistics in deriving that quantity. In fact it seems like there was one in a past topic. Can't quite recall the name at the moment. @muoncounter"No, but it does mean that 75% of climate denier posts are misleading -- and that's significant."Guess I'm not seeing the connection to "climate deniers". What is a "climate denier" anyway? Someone who denies that there is such a thing as climate? I wasn't aware that theJournal for Climatewas an anti-anthropogenic global warming publication. After all they put out this, "Global Warming is Unequivocal: The Evidence from NOAA" 5/6/2010. @TonyL The Ambaum ArticleDaniel Baileyat 09:09 AM on 15 November, 2010Pirate Chartwas used to illustrateAlexandre'spoint, that just because thingscanbe correlated doesn't mean that the correlation itself has any meaning. Just because comments by skeptics get flagged for being off-topic doesn't mean comments by those who believe in climate science do not get flagged for being off-topic. Check out theDeleted Commentsbin sometime. I've had comments land there before; I can also guarantee I'll end up there again sometime. Comments that are off-topic get deleted; fact of life here. Tamino has some insights into the Ambaum piecehere. The YooperTom Daytonat 09:27 AM on 15 November, 2010HumanityRulesat 12:34 PM on 15 November, 2010dhogazaat 12:48 PM on 15 November, 2010Daniel Baileyat 13:00 PM on 15 November, 2010Sad. There was a time when I thought you had something constructive to offer, HumanityRules. Now I find I can't take you seriously anymore as it seems you aren't even trying, preferring to serve up inflammatory distortions instead. The Yoopermuoncounterat 13:21 PM on 15 November, 2010everyonein the affected class. That includes Watt$, Godd@rd, Mc&tyre and the like. If you want to stick with this nonsense, thatrequiresthat 75% of climate change denier posts are misleading. Better to drop both the name-calling ('fear-mongering'? really?) and the gross generalizing. Then maybe we can have an intelligent conversation.Tom Daytonat 15:32 PM on 15 November, 2010Eric Lat 15:33 PM on 15 November, 2010insignificance, as when many climate deniers misunderstood Phil Jones' remarks about warming since 1995 not being statistically significant as evidence that warming has stopped. A statistically insignificant warming trend isn't evidence either way. This is not the sort of error Dr. Ambaum is talking about. Are you aware of instances where climate scientists have made this error? TOP, Be sure you are not misinterpreting the author as saying climate scientists should be making weaker claims or that they are publishing "statistically significant" results that if tested the way the author thinks they should be would be insignificant. Chances are climate scientists would use Bayesian statistics to show that they can make even stronger claims of confidence. For example, because the physics of climate lead you to believe it should be warming with high probability, you can combine this prior probability with your analysis of the temperature data to give an even stronger confidence in the existence of a warming trend than you would have otherwise. If Phil Jones had followed Dr. Anbaum's advice when calculating statistical significance, he would have said something far less useful to those trying to cast doubt on warming. But I think he was right not to do it that way, as I mentioned above. And I should qualify that by saying I haven't read the paper, only this post, so maybe I don't understand what it is Dr. Anbaum thinks they should be doing when analyzing data.Maarten Ambaumat 22:17 PM on 15 November, 2010andphysics to make any progress. What I am highlighting, though, is not that specific issue (which is serious and important in itself). I am highlighting that significance tests are used to give certain statistical results higher "credibility" than others, based on a largely spurious test. So it is the selection of statistical results that I am objecting to, not the statistical results per se. Some posts (specifically Steve L) refer to the frequentist vs Bayesian discussion. This is interesting in itself, but in my paper I am simply applying Bayes' equation, which also a frequentists would accept as indisputable. The difference comes in the interpretation of the meaning of these probabilities. Indeed, significance tests have a clear frequentist flavour, while hypothesis tests have a much more Bayesian flavour. I think it is hard to escape that scientific hypotheses naturally fit a Bayesian framework. Nonetheless, I think the distinction between Bayesian and frequentist interpretations is largely irrelevant to the discussion at hand. Several posts point out that scientists should know about this and also that climate science should not be singled out. Indeed, in my paper I point to more general references which highlight the misuse of significance tests in a wide spectrum of fields (medicine, economics, sociology, psychology, biology, ...) In fact, I suspect that your average research psychologist knows more about the pitfalls of significance tests than the climate scientist. In those more "softer" fields, people have had to mainly rely on statistics from the start and therefore needed to know how to use statistics from day one. In those fields, many people have pointed this problem out (and it still seems to persist). Climate science has always been a subfield of physics, where significance tests are largely irrelevant. I bet that most physicists (by training, I am a theoretical physicist myself) didn't get a stats course in their curriculum! However, these days more and more geographical thinking seems to enter the field of climate science with the resulting lack of rigour and physical underpinning. Many climate scientists have become geographers of their model worlds! Also, the point I am making is not new: many people are aware of the problems with significance tests, and many people have pointed it out before (although most practitioners probably believe that climate scientists would know better). It boggles the mind that the error keeps on being propagated - surely an interesting question for a psychologist or sociologist to get their teeth into. I do have an opinion about why this may be, but that would make this post even longer. Regarding the somewhat rambling posts about 75% of papers being misleading in part. I claim that 75% of papers (in my own paper I clearly state that this is based just 1 (one) sample and make no claim regarding its statistical significance!) make a technical misuse of significance tests: they use it to select or highlight certain statistical results in favour of others. Perhaps I should write a post where I discuss what significance tests can be used for (largely for debunking fake hypotheses, but even this is an application with its own pitfalls). However, this is generally not how significance tests are presented in the literature. The latter of course follows from the fact that very few scientists would publish negative results (in fact, they would probably have a hard time to get it past the reviewers). Some people, including John Cook himself, pointed me to a post by Tamino. Tamino also highlights some further points from my original paper. Let me just add two little comments to Tamino's interesting post: Tamino states that "I’ve certainly struggled to emphasize to colleagues that a highly significant statistical result does not prove that one’s hypothesis is true, it merely negates the null hypothesis." This is again the error of the transposed conditional: a low p-value doesnotnegate the null-hypothesis, it just indicates that our statistical result would be unlikely in case the null-hypothesis were true. It is remarkable how easily we can stray into this error. Tamino also seems to indicate that thep-value does provide useful quantitative information. I cannot find any evidence in his post of this. Yes, thep-value is quantitative, but its usefulness is never really made clear. Thep-value is perhaps an indication of the signal-to-noise ratio; a high p-value means that it will be difficult to see any evidence of any claimed effect. A lowp-value indicates very little really: we want to study the validity of some hypothesisassumingit is false; some attempt at areductio ad absurdumproof of your hypothesis - unfortunately it is not quite that ...Dikran Marsupialat 22:26 PM on 15 November, 2010Maarten Ambaumat 22:38 PM on 15 November, 2010meanto say that a lowp-value is evidence for their hypothesis, but by publishing the lowp-value along with phrases such as, "this or that effect is significant at the 95% level" certainly seems to imply that that want to use these statistics as positive evidence at face value.Dikran Marsupialat 00:01 AM on 16 November, 2010Maarten Ambaumat 00:16 AM on 16 November, 2010p-value cannotobjectivelybe used to reject a null-hypothesis; it simply does not contain the required information to do so. I formalize this in my paper, if you like to know more. On the other hand, a highp-value indicates that the presented evidence is easily consistent with the null hypothesis. This is not evidence that the null-hypothesis is true; the evidence could also be consistent with the alternative hypothesis. A significance test simply contains no information either way. Using Occam's razor we can then conclude that there is no evidence for our hypothesis, so we better stick with the null-hypothesis. It is Occam's razor that makes the argument here, not the significance test. MaartenDaniel Baileyat 00:25 AM on 16 November, 2010Dikran Marsupialat 00:58 AM on 16 November, 2010HumanityRulesat 01:47 AM on 16 November, 2010Daniel Baileyat 02:06 AM on 16 November, 2010Eric Lat 10:59 AM on 16 November, 2010Maarten Ambaumat 20:55 PM on 16 November, 2010p-value does not contain enough information to calculate the probability of the truth of a hypothesis, or the null hypothesis (such statements can be perfectly well framed in frequentist terms). Regarding the dendrochronologist, this is an example that is very interesting. Equation 6 in my paper states how to view this. It is simply Bayes equation written in terms of prior and posterior odds: posterior odds = prior odds xp(M|notN) / p(M|N)where I used the notation as in the post above (note thep(M|N)is thep-value). So whether your confidence in the global warming hypothesis has been increased by your tree work depends on whether thep-value is smaller than the probability to see your measurement in situations that we know there is global warming. This statement is independent of the prior odds; the actual posterior odds of course do depend on the prior odds. In other words, every single measurement increases our knowledge (changes our confidence in a hypothesis) in the same way; this is independent of whether you were a "believer" or not to start with. This discussion is getting quite long now. I will probably write another post with some of this stuff in sometime soon where I can also comment on the suggestion by HumanityRules. I think John Cook agreed that I could send in another guest post about this subject anyway. Best wishes to all and thank you very much for your interest in this post and for an interesting discussion, Maarten AmbaumHumanityRulesat 01:32 AM on 17 November, 2010Berényi Péterat 01:59 AM on 17 November, 2010does not make senseto talk about the probability of hypotheses being true (or false). It's either true or false. Of course it is entirely possible we are ignorant about its truth value; in that case one should sayI do not know(a perfectly legitimate scientific stance), but it surely has a truth value, even if no one was able to determine it so far (provided of course the hypothesis makes sense in the first place). The Bayesian method you describe could only serve as aheuristicdevice, but only if we had clear (quantifiable!) picture ofprior probabilitiesregarding our own ignorance. That's almost never the case. If we knew how ignorant we were (having a reliablestructuralmodel of our own ignorance), most of the job required toovercomethis ignorance would already be completed. However, when heuristics is most needed, we are at the edge of utter darkness, just feeling our way around, not even equipped to make educated guesses about Bayesian priors of our own state of mind regarding the subject matter. In cases like that almost any fractional understanding is better than fake formal methods to arrive at a reasonable conclusion regarding the way forward. It may be different fordecision makers(like politicians or business people) who rely onexpert advicein certain matters, but are not equipped to actuallyunderstandandevaluatethe detailed reasoning behind those expert opinions (they only digest theexecutive summary, anyway). They may well wonder how likely it is the experts have got it right, and in complicated cases it makes perfect sense for them to seek a quantified description of uncertainty. To ask anindependentgroup of experts to give an estimate of prior probabilities and build a Bayesian model to evaluate reliability of expert propositions may be a way forward. However, in practice extra rounds like that are seldom better than honestexpert meta-opinion, expressed in plain language. There is a more restricted domain where statistics can (and do) come into play in natural sciences. That's measurement laden with noise. However, in this case there is no room for theoretical ambiguity. We should know pretty much everything how the signal we are looking for is supposed to look like along with the statistical properties of noise behind which it is hiding. This knowledge should take the form of a bunch oftruepropositions about the phenomenon under scrutiny, neither of which has a dubious truth value expressible in a probabilistic form. If this knowledge is given, we should be able to build an adequate statistical model which enables us to recover the signal from noise as much as possible. Of course the first thing to do is not to rely on statistical speculations, but to improve the signal to noise ratio of measurement whenever it is practicable. Unfortunately in climate studies most of the noise is not from the measurement procedure itself, but it isweather noise, that is, an inherent property of the system itself. There is no way to get rid of it during the measurement phase. Weather is an open thermodynamic system, and as such it works on the edge of chaos, in other words it is always incritical state(by way of SOC - Self Organized Criticality). Systems like this are characterized by system variables withpink noisecharacteristics (the noise has random phase and the same power in each octave). Pink noise is scale invariant with no lower cutoff frequency, therefore system variables like this do not make a natural distinction between weather and climate, no matter how long is the averaging window used (how low the upper cutoff). Pink noise is neverstationary, it has an arbitrarily long autocorrelation scale. This is why it is a bit tricky to look fortrend(as signal) in a climate variable laden with weather noise. A simple model of a linear trend plus some stationary noise would surely not do (even ifmainstreamclimate science is almost always guilty of using such simplistic models). Pink noise can have spontaneous excursions on all scales, including extremely low frequency ones (well in the supposedclimaterange of 30+ years). You say"A standard answer[to the question if temperatures are rising or not]is to calculate a temperature trend from data and then ask whether this temperature trend is “significantly” upward; many scientists would then use a so-called significance test to answer this question. But it turns out that this is precisely the wrong thing to do."Yes, but it is not wrong just because the result of an otherwise correctly applied significance test is misused, but in most cases people also apply the wrong significance test (that fails to take into account the very long autocorrelation timescale). The above statements on weather (or climate) noise, critical state, self-organized criticality, pink noise, etc. are simplytruestatements with no further qualification whatsoever. It is notlikelythey are true, not even 100% sure, they are simply adequate descriptions of certain aspects of the behavior of open thermodynamic systems with many degrees of freedom. Still, they are entirely missing from IPCC reports, prepared byexpertsfordecision makers. Phrases like "pink noise" (or "1/f noise") are not even mentioned under http://ipcc.ch. Funny.michael sweetat 03:13 AM on 17 November, 2010Dikran Marsupialat 04:45 AM on 17 November, 2010KRat 05:17 AM on 17 November, 20101/frelationship would indicate the largest variations on low frequencies, where what we observe (glacial cycles, for example) is a fairly direct tracking of climate variables (temperature, ice cover, etc.) to historic forcings. (2) The universe is what it is - that's the final arbitrator of our theories. However, ourknowledge is imperfect, and our hypothesesare probablistic, as per the first definition of probability. We can only state that a particular hypothesis is more probable than others given the evidence, the statistics of our data. And whether using Bayesian or frequentist methods, we can estimate from the statistics the probability (second definition) that our hypotheis is supported by that data. That's how induction works, and how we can learn something new. We can be pretty sure, but we can only work with the evidence we have - we don't have perfect knowledge of anything. At a certain point we become certain enough to label a particular hypothesis afact. Gravity, evolution, and it appears climate change falls into that category as well. But even the strongest "fact" is supported by ourinductiveconclusion that the laws of physics are consistent over space and time, and won't change on us - incredibly well supported, but the rules could change tomorrow. Crystalline proofs of the type you describe would be nice, but they don't exist.Tom Daytonat 06:00 AM on 17 November, 2010