Why Is R2 Not a Measure of Association Strength?

One of the most common mistakes I saw when teaching simple statistics in introductory ecology lab is a misinterpretation of R2 from a linear regression. Students tend to use R2 as a measure of association strength. That is, a high R2 indicates a strong correlation. Although we have emphasized to students that R2 is a goodness of fit measure, not a measure of association strength, I have never found a convincing way to explain to students why. John Vinson and I had a discussion on this issue today and this is an attempt to explain why.

Let’s start by making some data. We have x from 1 to 20 and y = x+some noise. We use a standard normal distribution, i.e. N(0,1), for the noise. This is how the simulated data set looks like.

yx

Now imagine that when we actually collect data, we may not cover the whole range of x. We may only collected a subset of the data, as shown in the figures below.

xyrange

What I am going to show is this: the range of x influences your R2 from regression. Your R2 tends to be larger when your range of x is larger. Keep in mind that y=x+some noise. We know x and y is “associated” to the same degree no matter what range of x we have. The fact that R2 varies with range of x even if we know the association strength is the same tells us that R2 is not an indicator of association strength.

Here are the details. For any set of y (y is simulated as x+noise) and x, I did a bunch of linear regression with different range of x. That is, I did regression of y and x only using the middle 4, 6, 8, 10……,20 points. For each regression, I get the R2. Then I did the same thing again and again by repeatedly simulating y and x. So for each range of x, I get a lot of R2. These R2 differ from simulation to simulation (vertical dots), but overall, R2 increases with the range of x (as shown in the red line).

r2

If you are interested, here is the R code for generating the data set and getting the R square  from linear regression.

x = 1:20
Rsquare = array(NA, dim=c(100,9))
for(i in 1:100){
	y=x+rnorm(20,0,1)
	for(j in 1:9){
		Rsquare[i,j]=summary(lm(y[(10-j):(11+j)]~x[(10-j):(11+j)]))$r.square
	}
}
Advertisements

About Chao Song

I am a PhD student in Odum School of Ecology at the University of Georgia. I study carbon dynamics in various ecosystems, using both theoretical and experimental approaches.
This entry was posted in Statistics, Teaching. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s