# Of Averages and Anomalies - Part 1A. A Primer on how to measure surface temperature change

In recent years a number of claims have been made about ‘problems’ with the surface temperature record: that it is faulty, biased, or even ‘being manipulated’. Many of the criticisms often seem to revolve around misunderstandings of how the calculations are done and thus exaggerated ideas of how vulnerable to error the analysis of the record is. In this series I intend to look at how the temperature records are built and why they are actually quite robust. In this first post (Part 1A) I am going to discuss the basic principles of how a reasonable surface temperature record should be assembled, Then in Part 1B I will look at how the major temperature products are built. Finally in Parts 2A and 2B I will then look at a number of the claims of ‘faults’ against this to see if they hold water or are exaggerated based on misconceptions.

### How NOT to calculate the Surface Temperature

So, we have records from a whole bunch of meteorological stations from all around the world. They have measurements of daily maximum and minimum temperatures for various parts of the last century and beyond. And we want to know how much the world has warmed or not.

Sounds simple enough. Each day we add up all these station’s daily average temperatures together, divide by the number of stations and, voilá, we have the average temperature for the world that day. Then do that for the next day and the next and…. Now we know the world’s average temperature, each day, for all that measurement period. Then compare the first and last days and we know how much warming has happened – how big the ‘Temperature Anomaly’ is - between the two days. We are calculating the ‘Anomaly of the Averages’. Sounds fairly simple doesn’t it? What could go wrong?

Absolutely everything.

So what is wrong with the method I described above?

1. Every station may not have data for the entire period covered by the record. They have come and gone over the years for all sorts of reasons. Or a station may not have a continuous record. It may not be measured on weekends because there wasn’t the budget for someone to read the station then. Or it couldn’t be reached in the dead of winter.

Imagine we have 5 measuring stations, A to E that have the following temperatures on a Friday:

A = 15, B = 10, C = 5, D = 20 & E = 25

The average of these is (15+10+5+20+25)/5 = 15

Then on Saturday, the temperature at each station is 2 °C colder because a weather system is passing over. But nobody reads station C because it is high in the mountains and there is no budget for someone to go up there at the weekend. So the average we calculate from the data we have available on Saturday is:

(13+8+18+23)/4 = 15.5.

But if station C had been read as well it would have been:

(13+8+3+18+23)/5 = 13

This is what we should be calculating! So our missing reading has distorted the result.

We can’t just average stations together! If we do, every time a station from a warmer climate drops off the record, our average drops. Every time a station from a colder climate drops off, our average rises. And the reverse for adding stations.  If stations report erratically then our record bounces erratically. We can’t have a consistent temperature record if our station list fluctuates and we are just averaging them. We need another answer!

2. Our temperature measurements aren’t from locations spaced evenly around the world. Much of the world isn’t covered at all – the 70% that is oceans. And even on land our stations are not evenly spread. How many stations are there in the roughly 1000 km between Maine and Washington DC, compared to the number in the roughly 4000 km between Perth & Darwin?

We need to allow for the fact that each station may represent the temperature of very different size regions. Just doing a simple average of all of them will mean that readings from areas with a higher station density will bias the result.  Again, we can’t just average stations together!

We need to use what is called an Area Weighted Average. Do something like: take each station's value, multiply it by the area it is covering, add all these together, and then divide by the total area. Now the world isn’t colder just because the New England states are having a bad winter!

3. And how good an indicator of its region is each station anyway? A station might be in a wind or rain shadow. It might be on a warm plain or higher in adjacent mountains, or in a deep valley that cools quicker as the Sun sets. It might get a lot more cloud cover at night or be prone to fogs that cause night-time insulation. So don’t we need a lot of stations to sample all these micro-climates to get a good reliable average? How small does each station’s ‘region’ need to be before its readings are a good indicator of that region?  If we are averaging stations together we need a lot of stations!

4. Many sources of bias and errors can exist in the records. Were the samples always taken at the same time of day? If Daylight Savings Time was introduced, was the sampling time adjusted for this? Where log sheets for a station (in the good old days before new fangled electronic recording gizmos) written by someone with bad handwriting – is that a 7 or a 9? Did the measurement technology or their calibrations change? Has the station moved, or changed altitude? Are there local sources of biasing around the station? And do these biases cause one-off changes or a time-varying bias?

We can’t take the reading from a station at face-value. We need to check for problems. And if we find them we need to decide whether we can correct for the problem or need to throw that reading or maybe all that station’s data away. But each reading is a precious resource – we don’t have a time-machine to go back and take another reading. We shouldn’t reject it unless there is no alternative.

So, we have a Big Problem. If we just average the temperatures of stations together, even with the Area Weighting answer to problem #2, this doesn’t solve problems #1, #3 or #4. It seems we need a very large detailed network, which has existed for all of the history of the network, with no variations in stations, measurement instruments etc, and without any measurement problems or biases.

And we just don’t have that. Our station record is what it is. We don’t have that time machine. So do we give up? No!

### How do stations' climates change?

Let’s consider a few key questions. If we look at just one location over its entire measurement history, say down on the plains, what will the numbers look like? Seasons come and go; there are colder and warmer years. But what is the longer term average for this location? What is meant by ‘long term’? The World Meteorological Organisation (WMO) defines Climate as the average of Weather over a 30 year period. So if we look at a location averaged over something like a 30 year period and compare the same location averaged over a different 30 year period, the difference between the two is how much the average temperature for that location has changed. And what we find is that they don’t change by very much at all. Short term changes may be huge but the long term average is actually pretty stable.

And if we then look at a nearby location, say up in the mountains, we see the same thing: lots of variation but a fairly stable average with only a small long term change. But their averages are very different from each other. So although a station’s average change over time is quite small, an adjacent station can have a very different average even though its change is small as well. Something like this:

Next question: if each of our two nearby stations averages only change by a small amount, how similar are the changes in their averages? This is not an idle question. It can be investigated, and the answer is: mostly very similar. Nearby locations will tend to have similar variations in their long term averages. If the plains warm long term by 0.5°C, it is likely that the nearby mountains will warm by say 0.4–0.6°C in the long term. Not by 1.5 or -1.5°C.

It is easy to see why this would tend to be the case. Adjacent stations will tend to have the same weather systems passing over them. So their individual weather patterns will tend to change in lockstep. And thus their long term averages will tend to be in lock-step as well. Santiago in Chile is down near sea level while the Andes right at its doorstep are huge mountains. But the same weather systems pass over both. The weather that Adelaide, Australia gets today, Melbourne will tend to get tomorrow.

Final question. If nearby locations have similar variations in their climate, irrespective of each station's local climate, what do we mean by ‘nearby’? This too isn’t an idle question; it can be investigated, and the answer is many 100’s of kilometres at low latitudes, up to 1000 kilometres or more at high latitudes. In Climatology this is the concept of ‘Teleconnection’ – that the climates of different locations are correlated to each other over long distances.

Figure 3, from Hansen & Lebedeff 1987 (apologies for the poor quality, this is an older paper) plots the correlation coefficients versus separation for the annual mean temperature changes between randomly selected pairs of stations with at least 50 common years in their records. Each dot represents one station pair. They are plotted according to latitude zones: 64.2-90N, 44.4-64.2N, 23.6-44.4N, 0-23.6N, 0-23,6S, 23.6-44.4S, 44.4-64.2S.

Notice how the correlation coefficients are highest for stations closer together and less so as they stretch farther apart. These relationships are most clearly defined at mid to high northern latitudes and mid southern latitudes – the regions of the Earth with higher proportions of land to ocean.

This makes intuitive sense since surface air temperatures of the oceanic regions are influenced also by water temperatures, ocean currents etc instead of just air masses passing over them, while land temperatures don’t have this other factor. So land temperatures would be expected to have better correlation since movement of weather systems over them is a stronger factor in their local weather.

This is direct observational evidence of Teleconnection. Not just climatological theory but observation.

So what if we do the following? Rather than averaging all our stations together, instead we start out by looking at each station separately. We calculate its long term average over some suitable reference period. Then we recalculate every reading for that station as a difference from that reference period average. We are comparing every reading from that station against its own long term average. Instead of a series of temperatures for a station, we now have a series of ‘Temperature Anomalies’ for that station.  And then we repeat this for each individual station, using the same reference period to produce the long term average for each separate station.

Then, and only then, do we start calculating the Area Weighted Average of these Anomalies. We are now calculating the ‘Area Average of the Anomalies’ rather than the ‘Anomaly of the Area Averages’ – now there’s a mouthful. Think about this. We are averaging the changes, not averaging the absolute temperatures.

Does this give us a better result? In our imaginary ideal world where we have lots of stations, always reporting all the time, no missing readings, etc., then these two methods will give the same result.

The difference arises when we work in an imperfect world. Here is an example (for simplicity I am only doing simple averages here rather than area weighted averages):

Let's look at stations A to E. Let's say their individual long term reference average temperatures are:

A = 15, B = 10, C = 5, D = 20 & E = 25

Then for one day's data their individual readings are:

A = 15.8, B = 10.4, C = 5.7, D = 20.4 & E = 25.3

Using the simple Anomaly of Averages method from earlier we have:

(15.8+10.4+5.7+20.9+25.3)/5 - (15+10+5+20+25)/5 = 0.52

While using our Average of Anomalies method we get:

((15.8-15) + (10.4-10) + (5.7-5) + (20.4-20) + (25.3-25))/5 = 0.52

Exactly the same!

However, if we remove station C as in our earlier example, things look very different. Anomaly of Averages gives us:

(15.8+10.4+20.4+25.3)/4 - (15+10+5+20+25)/5 = 2.975 !!

While Average of Anomalies gives us:

((15.8-15) + (10.4-10) + (20.4-20) + (25.3-25))/4 = 0.475

Obviously both values don’t match what the correct value would be if station C were included, but the second method is much closer to the correct value. Bearing in mind that Teleconnection means that adjacent stations will have similar changes in anomaly anyway, this ‘Average of Anomalies’ method is much less sensitive to variations in station availability.

Now let’s consider how this approach could be used when looking at station histories over long periods of time. Consider 3 stations in ‘adjacent’ locations. A has readings from 1900 to 1960. B has reading from 1930 to 2000 and C has readings from 1970 to today. A overlaps with B, B overlaps with C. But C doesn’t overlap with A. If our reference period is say 1930 – 1960, we can use the readings from A & B. But C doesn’t have any readings from our reference period. So how can we splice together A, B, & C to give a continuous record for this location?

Doesn’t this mean we can’t use C since we can’t reference it to out 1930-1960 baseline? And if we use a more recent reference period we lose A. Do we have to ignore C’s readings entirely? Surely that means that as the years roll by and the old stations disappear, eventually we will have no continuity to our record at all? That’s not good enough.

However there is a way we can ‘splice’ them together.

A & B have a common period from 1930-1960. And B & C have a common period from 1970-2000. So if we take the average of B from 1930 to 1960 and compare it to the same average from A for the same period we know how much their averages differ. Similarly we can compare the average of B from 1930-1960 to the average for B from 1970-2000 to see how much B has changed over the intervening period. Then we can compare B vs C over the 1970-2000 period to relate them together. Knowing these three differences, we can build a chain of relationships that links C1970-2000 to B1970-2000 to B1930-1960 to A1930-1960

Something like this:

If we have this sort of overlap we can ‘stitch together’ a time series stretching beyond more than one station’s data. We have the means to carry forward our data series beyond the life (and death) of any one station, as long as there is enough time overlap between them. But we can only do this if we are using our Average of Anomalies method. The Anomaly of Averages method doesn’t allow us to do this.

So where has this got us in looking at our problems? The Average of Anomalies approach directly addresses problem #1. Area Weighted Averaging addresses problem #2. Teleconnection and comparing a station to itself helps us hugely with problem #3 – if fog provides local insulation, it probably always had, so any changes are less related to the local conditions and more to underlying climate changes. Local station bias issues still need to be investigated but if they don’t change over time, then they don’t introduce ongoing problems. For example, if a station is too close to an artificial heat source, then this biases that station's temperature. But if this heat source has been a constant bias over the life of the station, then it cancels out when calculate the anomaly for the station. So this method also helps us with (although doesn’t completely solve) problem #4. In contrast, using the Anomaly of Averages method, local station biases and erratic station availability will compound each other making things worse.

So this looks like a better method.

Which is why all the surface temperature analyses use it!

The Average of Anomalies approach is used precisely because it avoids many of the problems and pitfalls.

In Part 1B I will look at how the main temperature records actually compile their trends.

Posted by Glenn Tamblyn on Sunday, 29 May, 2011