Updated: happy now; the words "hypothesis" and "null" do not appear (except to discard them).

There’d be no more abused tool in all of science than the linear regression “p-value” for trend significance. Don’t take my word for it; the problem is so severe that some technical journals have actually considered banning publication of p-values.¹

So what’s a trend p-value? The example I’m going to use is our Spencers Creek season peak snow depth record (midway between Perisher Valley and Thredbo, NSW, Australia, from Snowy Hydro). That looks like this:

The straight line there is the *ordinary least squares linear regression*, the “best fit” obtained using a standard mathematical technique that minimises the sum of the squares of the vertical differences between the line and the data points. “Minimising the squares” is not some arbitrary convenient choice; it can be shown to be the method that gives the highest likelihood of correctly matching an underlying relationship.

The slope of the line is -7.1 cm/decade. Also shown is an “R-squared” value, a statistical measure of “goodness of fit” (English expression isn’t high on the list of statisticians’ skills). R-squared is a measure of how much of the variance in the data is “explained” by the line. That word “explained” *is* actually well chosen; we’re not saying the line necessarily means anything, just that if you subtract it from the data, the variance (standard deviation squared) falls by a proportion equal to R². [Note that it’s R-*squared* (R²) — a reminder that the effect belongs to the variance (σ²) not the standard deviation (σ).]

So we know what the slope is and how much of the jiggle in the data it “explains” (not all that much here). But does the trend mean anything, or is it just the stuff of chance observation? That’s the p-value.

### Significance testing

At this point standard treatments veer off into null hypotheses and the like … and the general reader turns off. Lets not go there.

*The p-value is the probability that, by chance, we could see a trend as steep or steeper than that in our regression, if the underlying process contained no trend at all.*

Note that, although we call that “trend significance”, it’s actually more like a probability that our trend is spurious. But but but … it’s definitely not one minus the probability that our trend is valid. There’s heaps of grey between those two, and this method really can’t properly get at either. It’s a kind of *index test*; a guide rather than something absolute and definitive. (Therein lies much of the problem, and much woolly thinking … a discussion for another time.)

“The probability of seeing a trend as steep or steeper” … we don’t need fancy formulas for that; these days we can easily get it by statistical simulation. Just randomly generate a few hundred series with similar stats to ours but no (deliberate) trend, do a regression on each, record the slopes and fit a distribution to them. Then we can just read off the probability. There’ll be lots of generated slopes around about zero, with less and less the further you go outwards — both on the positive side (upslope) and on the negative side (downslope). That’s got to be a bell curve, like a normal distribution:

But we didn’t actually need to do all that. The problem can be solved analytically — was solved, long ago, largely by the famous RA Fisher. It turns out that our bell curve is just Mr Student’s old t-distribution.³ That’s a distribution very like the normal distribution⁴ but with one extra parameter that fattens its tails a bit. Its parameters are the *mean*, the *standard error* (“SE”, analogous to standard deviation) and the *degrees of freedom* (“df”, the ‘effective’ sample size — equal to n-2 for ordinary linear regression). Like the normal distribution, the raw t-distribution is commonly standardised by subtracting the mean (here that’s zero anyway) and dividing through by the standard error. The position in the standardised normal distribution is called a z-score — the sample deviation from the mean divided by the standard deviation — and for the t-distribution it’s the t-score — the sample value (our slope, “m”) divided by its standard error (and there’s a standard formula for that … of course).

So it’s easy … our p-value is just the cumulative t-distribution probability that the slope falls beyond the t-score, and the t-score is just the regression slope divided by its standard error.⁵ [Incidentally, that smooth curve in the probability plot above is not a fit to the simulation results; it’s a t-distribution with the standard error and degrees of freedom taken directly from our regression.]

But there’s still a quirk. We said “as steep or steeper”, but the conventional default is to interpret that as “as steep or steeper, upwards or downwards”. Effectively that’s saying we’re interested in whether or not there’s a trend, but we don’t care at all which way it goes. If so, we should allow that our deviation could be on either side of the distribution, leading to the so called “2-tailed” or “2-sided” test. The p-value becomes the *total of the two sides*. Accepting that (more below), head off to some tables or a suitable stats function somewhere and you have your answer:

For the Spencers Creek peak depths, the values are:

m = -7.1 cm/decade

SE = 4.5 cm/decade

t = |m/SE| = 1.56

n = 61

df = n – 2 = 59

which gives

p = T(1.56, 59, 2-tails) = 12%.

We would conclude that the slope is *not* statistically significant, because the conventional cutoff for that is p ≤ 5%.

### One or two tails?

The standard default for trend significance testing goes “is there a trend; don’t care positive or negative”, for which the answer is clearly two tailed. The dubious contention is that has to be the least biased (most conservative) assumption, but there’s plenty of situations where it makes no sense at all. For example for our snow depth record, even if the trend were positive (which looks seriously unlikely), in the context of discussing a possible global warming downtrend, that would be just another contrary result — not one that ought to be lumped together with finding a downtrend. In such circumstances it makes sense to frame the thing differently.

If we instead say that we’re interested in the presence or absence of a *downtrend* — but not concerned at all with an uptrend — the two tailed thing vanishes. The t-score is the same (and has much the same meaning), but the test is just on one side of the distribution. The t-distribution happens to be symmetrical, so the one-sided p-value is exactly half the two-sided. Our Spencers Creek peak depth p-value then falls back to 6%, still just outside the magic 5%.

### Outliers and bad data

Are we cherry picking here yet? There’s one more thing to think about with our dataset — data quality. There are several problems with bad data points in regression, the most obvious being that “sum of squares” thing. The square of the deviation to a bad data point — one far from the trend for some exogenous reason — may be very large; large enough to greatly skew the result. Obviously the slope will be most affected by outliers near the ends of the data, while the line position (intercept) can be affected by an outlier at any location. Users of regression tools need to be wary of outliers.

One of the more distant outliers in our 61-year peak snow depth record is in the very first year, 1954. That year is also the worst fit to my multi-parameter snow depth prediction model. That’s *by a long way*: -1.65 m or a 3.7-sigma deviation, about a 1 in 10,000 chance. Were there teething problems that produced a spurious peak depth estimate that year? It is the case that the sampling was much less intense in 1954 than the other years; there are gaps in the data of three weeks (in July) and two weeks (in August and also in September). Did the measurements miss the true peak, perhaps by a substantial margin?

There are multiple responses we could try, but we need to be extremely careful; we’re transgressing ground that a good statistician may fear to tread.⁶ One thing to try — not so rare in the real world of, say, engineering flood frequency analysis — would be to artificially pad the 1954 data point to make up for the possible under-measurement. Simpler and more defensible (but with more dramatic effects in this case) is to just exclude 1954 altogether. That results in the following:

m = -8.9 cm/decade

SE = 4.6 cm/decade

t = |m/SE| = 1.95

n = 60

df = n – 2 = 58

which gives

p = T(1.95, 58, 1-tail) = 2.8%.

The trend becomes statistically significant (p ≤ 5%), but whether what we just did is appropriate is at best highly debatable.

### Issues

There’s a stack of issues with the classical t-score / p-value approach to assessing trend significance:

- It’s an ordinary (“frequentist”) statistical test, so it only considers the tested data. But there’s often
*much more*known about the system than just that data. For our snow depth record, we know that our world is warming, and that that should reduce snow depths by well understood mechanisms: changes in the winter proportion of rain vs snow precipitation, reduction in overall winter precipitation, increased and earlier melt. It would be surprising indeed if there were no observable downtrend. The trend significance test can only say what the particular data shows; it*cannot*indicate the overall confidence in the validity of the trend.^{1,2}

- The test only applies to the slope, but the regression line is (usually) a
*two parameter fit*. What about the other parameter — the intercept with the y-axis? In the classical test of trend significance it is simply ignored; the method just focuses on the slope. (The simplest fix there is to consider the confidence limits of the whole fit, not just of the slope … also for another time.)

- It’s assumed that the estimated slope fits a t-distribution with n-2 degrees of freedom. In fact that is an approximation. It is reasonably valid only in specific circumstances that are often neither considered nor met.

- Most importantly the data points need to have near normally distributed, independent and “homoscedastic” errors. Homoscedastic means that the overall statistical properties should be unchanging, in our case that they need to be “stationary” over time. The Spencers Creek peak depths probably aren’t, but the effect seems to be slight. Nor are they exactly normally distributed, but the skewness is pretty small. The most abused of the three is “independent”. If there is positive autocorrelation (data points are correlated with nearby or preceding points, as they often are in a time series) then the test will overestimate trend significance. That’s
*by a lot*, if the correlation is strong. The Spencers Creek peak depth series autocorrelation is actually weakly negative, and probably spurious.By the way, if we get our p-value by simulation instead of from the t-distribution (like we started out with above) these restrictions don’t apply. That’s provided of course that the simulated data appropriately reproduces the distribution, autocorrelation and heteroscedasticity of the real data.⁷

- The distinction between 1-tailed and 2-tailed testing is often misunderstood; even wantonly misapplied.

### Responses

First, notice that there are other, often better, ways (try confidence limits or Mann-Kendall).⁸ But if you must use the simple classical t-test for trend significance, learn to apply the thing correctly. Most importantly, treat it as a guide rather than some sort of fundamental gospel truth — as originally recommended by RA Fisher, the guy who invented the thing.

And if by chance you happen to be reading Mr Peterson on snow depth trend significance, be aware that he gets most of it wrong.

### References and notes

- Nuzzo, R., Scientific method: statistical errors.
*Nature*506 (2014): 150–152

- Churchill, Gary A. When are results too good to be true?
*Genetics*198.2 (2014): 447-448.

- No, that’s not his real name, nor did he actually invent the thing — he just helped popularise it. “Student” worked at the Guinness brewery (really!) and needed to hide his identity. (What’s more, it turns out I know his granddaughter, an Australian mathematician.)

- The normal distribution and t-distribution are virtually identical above about 100 degrees of freedom.

- It turns out that the t-score is also a simple function of R-squared and the degrees of freedom: t = √(df.R²) / √(1-R²). So
*for a given sample size*, if you know R-squared you also know the p-value, and if you know the p-value, you can also say how much of the variance the regression explains.

- Not quite. There’s an argument for excluding rare outliers even if you think they’re 100 percent valid, because classical regression techniques work poorly in the presence of rare, extreme data points. There’s even another one of those arcane statistical tests to help decide. If we were convinced that the point is valid, I’d be more inclined to arbitrarily move it into the “reasonable” range (say put it at 2-sigma) rather than exclude it altogether, which would amount to discarding valid information.

- My simulation above included the skewness but not the negative autocorrelation, which I think is probably spurious. It didn’t and shouldn’t have included the small proportional heteroscedasticity (think about it…).

- Yue, Sheng, Paul Pilon, and George Cavadias. Power of the Mann–Kendall and Spearman’s rho tests for detecting monotonic trends in hydrological series.
*Journal of hydrology*259.1 (2002): 254-271.

And for a good all round summary, try http://www.biostathandbook.com/linearregression.html.