On Statistical Significance and Confidence

Posted on 11 August 2010 by Alden Griffith

Guest post by Alden Griffith from Fool Me Once

My previous post, “Has Global Warming Stopped?”, was followed by several (well-meaning) comments on the meaning of statistical significance and confidence. Specifically, there was concern about the way that I stated that we have 92% confidence that the HadCRU temperature trend from 1995 to 2009 is positive. The technical statistical interpretation of the 92% confidence interval is this: "if we could resample temperatures independently over and over, we would expect the confidence intervals to contain the true slope 92% of the time."  Obviously, this is awkward to understand without a background in statistics, so I used a simpler phrasing. Please note that this does not change the conclusions of my previous post at all. However, in hindsight I see that this attempt at simplification led to some confusion about statistical significance, which I will try to clear up now.

So let’s think about the temperature data from 1995 to 2009 and what the statistical test associated with the linear regression really does (it's best to have already read my previous post). The procedure first fits a line through the data (the “linear model”) such that the deviations of the points from this line are minimized, i.e. the good old line of best fit. This line has two parameters that can be estimated, an intercept and a slope. The slope of the line is really what matters for our purposes here: does temperature vary with time in some manner (in this case the best fit is positive), or is there actually no relationship (i.e. the slope is zero)?  

Figure 1: Example of the null hypothesis (blue) and the alternative hypothesis (red) for the 1995-2009 temperature trend

Looking at Figure 1, we have two hypotheses regarding the relationship between temperature and time:  1) there is no relationship and the slope is zero (blue line), or 2) there is a relationship and the slope is not zero (red line). The first is known as the “null hypothesis” and the second is known as the “alternative hypothesis”. Classical statistics starts with the null hypothesis as being true and works from there. Based on the data, should we accept that the null hypothesis is indeed true or should we reject it in favor of the alternative hypothesis? 

Thus the statistical test asks: what is the probability of observing the temperature data that we did, given that the null hypothesis is true?

In the case of the HadCRU temperatures from 1995 to 2009, the statistical test reveals a probability of 7.6%. Thus there’s a 7.6% probability that we should have observed the temperatures that we did if temperatures are not actually rising. Confusing, I know…  This is why I had inverted 7.6% to 92.4% to make it fit more in line with Phil Jones’ use of “95% significance level”.

Essentially, the lower the probability, the more we are compelled to reject the null hypothesis (no temperature trend) in favor of the alternative hypothesis (yes temperature trend). By convention, “statistical significance” is usually set at 5% (I had inverted this to 95% in my post). Anything below is considered significant while anything above is considered nonsignificant. The problem that I was trying to point out is that this is not a magic number, and that it would be foolish to strongly conclude anything when the test yields a relatively low, but “nonsignificant” probability of 7.6%. And more importantly, that looking at the statistical significance of 15 years of temperature data is not the appropriate way to examine whether global warming has stopped (cyclical factors like El Niño are likely to dominate over this short time period).

Ok, so where do we go from here, and how do we take the “7.6% probability of  observing the temperatures that we did if temperatures are not actually rising” and convert it into something that can be more readily understood?  You might first think that perhaps we have the whole thing backwards and that really we should be asking: “what is the probability that the hypothesis is true given the data that we observed?” and not the other way around. Enter the Bayesians!

Bayesian statistics is a fundamentally different approach that certainly has one thing going for it: it’s not completely backwards from the way most people think!  (There are many other touted benefits that Bayesians will gladly put forth as well.)  When using Bayesian statistics to examine the slope of the 1995-2009 temperature trend line, we can actually get a more-or-less straightforward probability that the slope is positive. That probability?  92%1. So after all this, I believe that one can conclude (based on this analysis) that there is a 92% probability that the temperature trend for the last 15 years is positive.

While this whole discussion comes from one specific issue involving one specific dataset, I believe that it really stems from the larger issue of how to effectively communicate science to the public. Can we get around our jargon?  Should we embrace it?  Should we avoid it when it doesn’t matter?  All thoughts are welcome…

1To be specific, 92% is the largest credible interval that does not contain zero. For those of you with a statistical background, we’re conservatively assuming a non-informative prior.

  1. I would appreciate some more background on how you computed the Bayesian credible interval. For example, what exactly do you mean my non-informative prior? Uniform? And how did you deal with auto-correlations if at all? (I realize that I am asking for the complexity you seek to simplify--fair enough, but a 'technical appendix' might be helpful for those more conversant with statistics.)
  2. Good post, Alden. Communicating anything to the public does indeed require a minimal usage of jargon; but as we all know, there exist those who live to be contrarians, for whom no level of clear explanations exist that cannot be obfuscated. Thanks again! The Yooper
  3. Thanks for this, you have a great website. btw, did you check out the Bayes factor relative to the "null"?
  4. Another interesting way to look at it is to look at the actual slope of the line of best fit, which I get to be 0.01086. Now take the actual yearly temperatures and randomly assign them to years. Do this (say) a thousand times. Then fit a line to each of the shuffled data sets and look at what fraction of the time the shuffled data produces a slope of greater than 0.01086 (the slope the actual data produced). So for my first trial of 1000 I get 3.5% as the percentage of times random re-arrangement of the temperature data produces a greater slope than the actual data. The next trial of 1000 gives 3.5% again, and the next gave 4.9%. I don't know exactly how to phrase this as a statistical conclusion, but you get the idea. If the data were purely random with no trend, you'd be expecting ~50%.
  5. I hate to admit this -- I'm very aware some will snort in derision -- but as a reasonably intelligent member of the public, I don't really understand this post and some of the comments that follow. My knowledge of trends in graphs is limited to roughly (visually) estimating the area contained below the trend line and that above the trend line, and if they are equal over any particular period then the slope of that line appears to me to be a correct interpretation of the trend. That's why, to me, the red line seems more accurate than the blue line on the graph above. And this brings me to the problem we're up against in explaining climate science to the general public: only a tiny percentage (and yes, it's probably no more than 1 or 2 percent of the population) will manage to wade through the jargon and presumed base knowledge that scientists assume can be followed by the reader. Some of the principles of climate science I've managed to work out by reading between the lines and googling -- turning my back immediately on anything that smacks just of opinion and lacks links to the science. But it still leaves huge areas that I just have to take on trust, because I can't find anyone who can explain it in words I can understand. This probably should make me prime Monckton-fodder, except that even I can see that he and his ilk are politically-motivated to twist the facts to suit their agenda. Unfortunately, the way real climate science is put across, provides massive opportunities for the obfuscation that we so often complain about. Please don't take this personally, Alden; I'm sure you're doing your best to simplify -- it's just that even your simplest is not simple enough for those without the necessary background.
  6. The data set contains two points which are major 'outliers' - 1996 (low) and 1998 (high). I appreciate 1998 is attributable to a very strong El Nino. Very likely, the effect of the two outliers is to cancel one another out. Nevertheless, it would be an interesting exercise to know the probability of a positive slope if either or both outliers were removed (a single and double cherry pick if you like) given the 'anomalous' nature of the gap between two temperatures in such a short space of time.
  7. As has been mentioned elsewhere by others, given that the data prior to this period showed a statistically significant temperature increase, with a calculated slope, then surely the null hypothesis should be that the trend continues, rather than there is no increase? I guess it depends on whether you take any given interval as independent of all other data points... stats was never my strong point - we had the most uninspiring lecturer when I did it at uni, it was a genuine struggle to stay awake!
  8. John Brooks: yes, this is definitely one way to test significance. It's called a "randomization test" and really makes a whole lot of sense. Also, there are fewer assumptions that need to be made about the data. However, the reason that you are getting lower probabilities is that you are conducting the test in a "one-tailed" manner, that is you are asking whether the slop is greater instead of whether it is simply different (i.e. could be negative too). Most tests should be two-tailed unless you have your specific alternative hypothesis (positive slope) before you collect the data. -Alden p.s. I'll respond to others soon, I just don't have time right now.
  9. John Russell If it is any consolation, I don't think it is overly contraverisal to suggest that there are many (I almost wrote majority ;o) active scientists who use tests of statistical significance every day that don't fully grasp the subtleties of underlying statistical framework. I know from my experience of reviewing papers that it is not unknown for a statistican to make errors of this nature. It is a much more subtle concept that it sounds. chriscanaris I would suggest that the definition of an outlier is another difficult area. IMHO there is no such thing as an outlier independent of assumtions made regarding the process generating the data (in this case, the "outliers" are perfectly consistent with climate physics, so they are "unusual" but not strictly speaking outliers). The best definition of an outlier is an observation that cannot be reconciled with a model that otherwise provides satisfactory generalisation. ABG Randomisation/permutation tests are a really good place to start in learning about statistical testing, especially for anyone with a computing background. I can recommend "Understanding Probability" by Henk Tijms for anyone wanting to learn about probability and stats as it uses a lot of simulations to reinforce the key ideas, rather than just maths.
  10. "While this whole discussion comes from one specific issue involving one specific dataset, I believe that it really stems from the larger issue of how to effectively communicate science to the public. Can we get around our jargon? Should we embrace it? Should we avoid it when it doesn’t matter? All thoughts are welcome…" More research projects should have metanalysis as a goal. The outcomes of which should be distilled ala Johns one line responses to denialist arguments and these simplifications should be subject to peer review. Firtsly by scientists but also sociologists, advertising executives, politicians, school teachers, etc etc. As messages become condensed the scope for rhetoricical interpretation increases. Science should limit its responsability to science but should structure itself in a way that facilitates simplification. I think this is why we have political parties, or any comitee. I hope the blogsphere can keep these mechanics in check. The story of the tower of babylon is perhaps worth remembering. It talks about situation where we reach for the stars and we end up not being able to communicate with one another.
  11. I'm going to have a go at explaining why the 1 - the p-value is not the confidence that the alternative hypothesis is true in (only) slightly more mathematical terms. The basic idea of a frequentist test is to see how likely it is that we should observe a result assuming the null hypothesis is true (in this case that there is no positive trend and the upward tilt is just due to random variation). The less likely the data under the null hypothesis, the more likely it is that the alternative hypothesis is true. Sound reasonable? I certainly think so. However, imagine a function that transforms the likelihood under the null hypothesis into the "probability" that the alternative hypothesis is true. It is reasonable to assume that this function is strictly decreasing (the more likely the null hypothesis the less likely the alternative hypothesis) and gives a value between 0 and 1 (which are traditinally used to mean "impossible" and "certain"). The problem is that other than the fact it is non-decreasing and bounded by 0 and 1, we don't know what that function actually is. As a result there is no direct calibration between the probability of the data under the null hypothesis and the "probability" that the alternative hypothesis is true. This is why scientists like Phil Jones say things like "at the 95% level of significance" rather than "with 95% confidence". He can't make the latter statement (although that is what we actually want to know) simply because we don't know this function. As a minor caveat, I have used lots of "" in this post because under the frequentist definition of a probability (long run frequency) it is meaningless to talk about the probability that a hypothesis is true. That means in the above I have been mixing Bayesian and frequentist definitions, but I have used the "" to show where the dodgyness lies. As to simplifications. We should make things a simple as possible, but not more so (as noted earlier). But also we should only make a simplification if the statement remains correct after the simplification, and in the specific case of "we have 92% confidence that the HadCRU temperature trend from 1995 to 2009 is positive" that simply was not correct (at least for the traditional frequentists test).
  12. Alden # Original Post We can massage all sorts of linear curve fits and play with confidence limits to the temperature data - and then we can ask why are we doing this? The answer is that the temperatures look like they have flattened over the last 10-12 years and this does not fit the AGW script! AGW believers must keep explaining the temperature record in terms of linear rise of some kind - or the theory starts looking more uncertain and explanations more difficult. It it highly likely that the temperature curves will be non-linear in any case - because the forcings which produce these temperature curves are non-linear - some and logarithmic, some are exponential, some are sinusoidal and some we do not know. The AGW theory prescribes that a warming imbalance is there all the time and it is increasing with CO2GHG concentration. With an increasing energy imbalance applied to a finite Earth system (land, atmosphere and oceans) we must see rising temperatures. If not, the energy imbalance must be falling - which either means that radiative cooling and other cooling forcings (aerosols and clouds) are offsetting the CO2GHG warming effects faster that they can grow, and faster than AGW theory predicts.
  13. Ken Lambert #12 wrote: "The answer is that the temperatures look like they have flattened over the last 10-12 years and this does not fit the AGW script!" This is fiction. Temperatures have not "flattened out"... they have continued to rise. Can you cherry pick years over a short time frame to find flat (or declining!) temperatures? Sure. But that's just nonsense. When you look at any significant span of time, even just the 10-12 years you cite, what you've got is an increasing temperature trend. Not flat. "With an increasing energy imbalance applied to a finite Earth system (land, atmosphere and oceans) we must see rising temperatures." We must see rising temperatures SOMEWHERE within the climate system. In the oceans for instance. The atmospheric temperature on the other hand can and does vary significantly from year to year.
  14. Since we are at the basis of statistics. I studied a “long three years” statistics in ecology and agriculture. Why exactly 15 years? I have written repeatedly that the counting period for the trend may not be in the decimal system, because in this system is not running type noise variability: EN(LN) SO, etc. For example, trends AMO 100 and 150 years combined with the negative phase of AMO positive "improving "results. The period for which we hope the trend must have a deep reason. While in the above-mentioned cases (100, 150 years), the error is small, in this particular case ("flat" phase of the AMO after a period of growth for 1998 - an extreme El Nino), the trend should be calculated from the same phase of EN(LN)SO after a period of reflection after the extreme El Nino, ie after 2001., or remove the "noise": extreme El Nino and the "leap" from cold to warm phase AMO. This, however, and so may not matter whether you currently getting warmer or not, once again (very much) regret tropical fingerprint of CO2 (McKitrick et al. - unfortunately published in Atmos Sci Lett. - here, too, went on statistics, including the selection of data)
  15. Stephan Lewandowsky: I used the Bayesian regression script in Systat using a diffuse prior. In this case I did not specifically deal with autocorrelation. We might expect that over such a short time period, there would be little autocorrelation through time which does appear to be the case. You are right that this certainly can be an issue with time-series data though. If you look at longer temperature periods there is strong autocorrelation. apeescape: I'm definitely not a Bayesian authority, but I'm assuming you're asking whether I examined this in more of a hypothesis testing framework? No - in this case I just examined the credibility interval of the slope. Ken Lambert: please read my previous post -Alden
  16. Discussing trends and statistical significance is something that I attempt to do - with no training in statistics. All I have learned from various websites over the last few years is conceptual, not mathematical. I would appreciate anyone with sufficient qualifications straightening out any misconceptions re the following: 1) Generally speaking, the greater the variance in the data, the more data you need (in a time series) to achieve statistical significance on any trend. 2) With too-short samples, the resulting trend may be more an expression of the variability than any underlying trend. 3) The number of years required to achieve statistical significance in temperature data will vary slightly depending on how 'noisy' the data is in different periods. 4) If I wanted to assess the climate trend of the last ten years, a good way of doing it would be to calculate the trend from 1980 - 1999, and then the trend from 1980 - 2009 and compare the results. In this analysis, I am using a minimum of 20 years of data for the first trend (statistically significant), and then 30 years of data for the second, which includes the data from the first. (With Hadley data, the 30-year trend is slightly higher than the 20-year trend) Aside from asking these questions for my own satisfaction, I'm hoping they might give some insight into how a complete novice interprets statistics from blogs, and provide some calibration for future posts by people who know what they're talking about. :-) If it's not too bothersome, I'd be grateful if anyone can point me to the thing to look for in the Excel regression analysis that tells you what the statistical significance is - and how to interpret it if it's not described in the post above. I've included a snapshot of what I see - no amount of googling helps me know which box(es) to look at and how to interpret.
  17. #13 CBDunkerson at 00:09 AM on 12 August, 2010 We must see rising temperatures SOMEWHERE within the climate system. In the oceans for instance. Nah. It's coming out, not going in recently.
  18. John Russell: You're not alone! Statistics is a notoriously nonintuitive field. Instead of getting bogged down in the details, here's perhaps a more simple take home message: IF temperatures are completely random and are not actually increasing, it would still be rather unlikely that we would see a perfectly flat line. So I've taken the temperature data and completely shuffled them around so that each temperature value is randomly assigned to a year: So here we have completely random temperatures but we still sometimes see a positive trend. If we did this 1000 times like John Brookes did the average random slope would be zero, but there would be plenty of positive and negative slopes as well. So the statistical test is getting at: is the trend line that we actually saw unusual compared to all of the randomized slopes? In this case it's fairly unusual, but not extremely. To get at your specific question - the red line definitely fits the data better (it's the best fit, really). But that still doesn't mean that it couldn't be a product of chance and that the TRUE relationship is flat. [wow - talking about stats really involves a lot of double negatives... no wonder it's confusing!!!] -Alden
  19. I bet you can get low-ish significance trends in any short interval in the last half century. There's nothing special in the "lack of significance" of this recent period. One could claim forever that "the last x years did not reach 95% significance".
  20. Ken Lambert @12: No scientist who studies climate would use 10 or 12 years, or the 15 in the OP, to identify a long-term temperature trend. For reasons that have been discussed at length many times, here and elsewhere, there is quite a bit of variance in annualized global temperature anomalies, and it takes a longer period for reliable (i.e., statistically significant) trends to emerge. Phil Jones was asked a specific question about the 15-year trend, and he gave a specific answer. Alden Griffith was explaining what he meant. Neither, I believe, would endorse using any 15-year period as a baseline for understanding climate, nor would most climate scientists. The facts of AGW are simple and irrefutable: 1. There are multiple lines of direct evidence that human activity is increasing the CO2 in the atmosphere. 2. There is well-established theory, supported by multiple lines of direct evidence, that increasing atmospheric CO2 creates a radiative imbalance that will warm the planet. 3. There are multiple lines of direct evidence that the planet is warming, and that that warming is consistent with the measured CO2 increase. One cannot rationally reject AGW simply because the surface temperature record produced by one organization does not show a constant increase over whatever period of years, months, or days one chooses. The global circulation of thermal energy is far too complex for such a simplistic approach. The surface temperature record is but one indicator of global warming, it is not the warming itself. When viewed over a period long enough to provide statistical significance, all of the various surface temperature records indicate global warming.
  21. BP @17: Nice. That level of disingenuousness must be applauded. Using a plot of localized ENSO-related temperature anomaly to suggest that the oceans are losing heat is pure genius. Anyone interested in the source and significance of BP's plot is directed here. See, in particular, the "Weekly ENSO Evolution, Status, and Prediction Presentation."
  22. ABG at 01:29 AM on 12 August, 2010 Thanks, Alden. I actually understood exactly what you're getting at. Whether I can remember and apply it in future is another matter!
  23. #14 Arkadiusz Semczyszak, "Why exactly 15 years?" Good question. The answer is that the person asking the question of Phil Jones used the range 1995-2009, knowing that if he used the range 1994-2009, Dr. Jones would have been able to answer 'yes' instead of 'no'.
  24. #12 Ken Lambert, It is well known that CO2 is not the only influence on the earth's energy content. As temperature has a reasonably good relationship with energy content (leaving out chemical or phase changes), it is reasonable to use air temperatures to some extent. (Ocean temps should be weighed far more heavily than air temps, but regardless...) If you pull up any reputable temperature graph, you will see that there have been about 4 to 6 times in the past 60 years where the temperature has actually dipped. So, according to your logic GW has stopped 4 to 6 times already in the last 60 years. However, it continues to be the case that every decade is warmer than the last. What I find slightly alarming is that, despite the sun being in an usually long period of low output, the temperatures have not dipped.
    Moderator Response: Rather than delve once more into specific topics handled elsewhere on Skeptical Science and which may be found using the "Search" tool at upper left, please be considerate of Alden's effort by trying to stay on the topic of statistics. Examples of statistical treatments employing climate change data are perfectly fine, divorcing discussion from the thread topic is not considerate. Thanks!
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
    0 0
