Spawned from another thread, I'd like to work through an example of regression analysis and some of the complications that comes with it. So everyone's on the same page, I'm using the data freely available at http://data.giss.nasa.gov/gistemp/graphs_v3/. For the moment I want to use the Global Monthly Mean Surface Temperature Change data. There are two columns of data within that set. For the purposes of argument, let's use the Land-Ocean temperature index since it best demonstrates the problems with regression analysis.

So, the question is, given that the data set is quite noisy, what conclusions can you draw from it? Does it show a trend? Does it show no trend? (Note that those could both be false.)

More specifically, if you fit a simple linear model to the data, you get that the best fit line has a slope of 0.01 and that slope's standard error is 0.0015, which gives it a p-value of 3.5 * 10^-10. Now, that's not much of a slope, but that's also a tiny p-value. In case you don't know, a p-value is an estimation of how likely it is that that value is due to chance. If the assumptions underlying the model are sound, it's a pretty good estimate. If you're using a statistical package to do this analysis (such as R, which is free) you can also look at some diagnostic graphs to check if those assumptions hold, and in this case, those graphs look pretty reasonable to me.

Now, the standard deviation of the temperature is 0.14, and the contention is that such a huge deviation relative to the slope means that you can't draw any conclusions at all from the data set.

I think that there are two separate things getting confused here. There's the error associated with the regression line and the coefficients of that line, and the errors associated with any value predicted using that line, and those aren't the same thing. As you get more and more data points, the errors associated with the line itself go to zero since the individual errors average out. On the other hand, the error associated with any predictions are limited by the errors within the data itself. You can't make a guess more accurate than the data you started with.

What I think this means in this case is that we can be confident that a small but real upwards trend exists within the data, but that trend is small enough that trying to say any more than that (for example, what the temperature will be next year) is pointless as the 0.01 degrees from the trend would be lost in the 0.14 degree error within the data.

As a starting point for references: Mean and predicted response - Wikipedia, the free encyclopedia and http://www.stat.cmu.edu/~roeder/stat707/lectures.pdf. I'm looking for more and better references, but this isn't the kind of stuff you find in papers. It's textbook stuff, and I don't have any good stat textbooks on me. (If anyone has any better links, feel free to post them.)