One of the most common mistakes I saw when teaching simple statistics in introductory ecology lab is a misinterpretation of R2 from a linear regression. Students tend to use R2 as a measure of association strength. That is, a high R2 indicates a strong correlation. Although we have emphasized to students that R2 is a goodness of fit measure, not a measure of association strength, I have never found a convincing way to explain to students why. John Vinson and I had a discussion on this issue today and this is an attempt to explain why.

Let’s start by making some data. We have x from 1 to 20 and *y* = *x*+some noise. We use a standard normal distribution, i.e. *N(0,1)*, for the noise. This is how the simulated data set looks like.

Now imagine that when we actually collect data, we may not cover the whole range of *x*. We may only collected a subset of the data, as shown in the figures below.

What I am going to show is this: the range of *x* influences your R2 from regression. Your R2 tends to be larger when your range of *x* is larger. Keep in mind that *y=x*+some noise. We know *x* and *y* is “associated” to the same degree no matter what range of *x* we have. The fact that R2 varies with range of *x* even if we know the association strength is the same tells us that R2 is not an indicator of association strength.

Here are the details. For any set of *y* (*y* is simulated as x+noise) and *x*, I did a bunch of linear regression with different range of *x*. That is, I did regression of *y* and *x* only using the middle 4, 6, 8, 10……,20 points. For each regression, I get the R2. Then I did the same thing again and again by repeatedly simulating *y* and *x*. So for each range of *x*, I get a lot of R2. These R2 differ from simulation to simulation (vertical dots), but overall, R2 increases with the range of *x* (as shown in the red line).

If you are interested, here is the R code for generating the data set and getting the R square from linear regression.

x = 1:20
Rsquare = array(NA, dim=c(100,9))
for(i in 1:100){
y=x+rnorm(20,0,1)
for(j in 1:9){
Rsquare[i,j]=summary(lm(y[(10-j):(11+j)]~x[(10-j):(11+j)]))$r.square
}
}

### Like this:

Like Loading...

## About Chao Song

I am a postdoctoral research associate at the Department of Fisheries and Wildlife, Michigan State University.