A Complete Work Flow of Fitting a Simple Linear Model

I recently discussed with one of my fellow graduate students testing linear contrasts from a two way ANOVA model. He was a little confused about implementing it in R and interpreting some of the model outputs. As I discussed in an earlier post, we often learn statistical methods piece by piece without the complete workflow in class. Hence, this is the first of a series of blog posts I intend to write to show examples of complete workflow of data analysis. For those familiar with statistics, this is very simple. But I hope this is useful for those who are not so familiar with it.


I will use a built in dataset in R package MASS for demonstration. The dataset, called oats, comes from an experiment where researchers investigated how nitrogen fertilizer dose and oats variety influence yield. The original experiment has a split plot design within block but let’s ignore that for simplicity. Let’s assume it is a full factorial design with three varieties and four nitrogen levels. Within each treatment combination, there are 6 replicates, resulting in a total of 72 measurements. The first few lines of data are shown below. Here Y is yield, V is variety of oats and N is the nitrogen fertilization level.

> library(MASS)
> head(oats)
  B       V          N   Y
1 I Victory     0.0cwt 111
2 I Victory     0.2cwt 130
3 I Victory     0.4cwt 157
4 I Victory     0.6cwt 174
5 I Golden.rain 0.0cwt 117
6 I Golden.rain 0.2cwt 114

Exploratory data analysis

The first step to analyze the data is to get an idea of how the data look like. Function xyplot in package lattice is handy for showing data in the factorial design.

> library(lattice)
> xyplot(Y~N|V, data=oats, layout=c(3,1)


Based on the graph, it appears that high N level leads to higher yield. When comparing three panels, the yield does not seem to differ that much among varieties. If we compare the “slope” across three panels, there does not seem to be much difference. We will test these more formally later.

Fit linear model

Next, we can fit a linear model to the data. Both variety of oats and nitrogen level are treated as categorical variables. Some may wonder if it is possible to treat N as continuous variable. It is possible but we are making a more specific assumption about how nitrogen influence yield. When treated as categorical variable, we assume that each level of nitrogen corresponds to a mean yield. This is a very general assumption. When treated as continuous variable,  we assume that yield change linearly with nitrogen. So the choice should first be based on examining the graph and your knowledge on how nitrogen should theoretically affects yield. You should also consider the goal of the research. If you are interested in comparing effects of particular levels of fertilization, it is preferable to treat it as categorical variable. But if you are interested in characterizing the response curve and predict yield with arbitrary amount of nitrogen, it is more helpful to treat it as continuous. In short, you should consider 1) theoretical relationship between variables based on  your knowledge of the system and 2) practical goal of the data analysis. Here, we treat N as a categorical variable for demonstration. We fit a model with both main effect of variety and nitrogen and their interaction.

> mod = lm(Y~V*N, data=oats)
> summary(mod)

lm(formula = Y ~ V * N, data = oats)

    Min      1Q  Median      3Q     Max
-38.500 -16.125   0.167  10.583  55.500 

                    Estimate Std. Error t value Pr(>|t|)
(Intercept)          80.0000     9.1070   8.784 2.28e-12 ***
VMarvellous           6.6667    12.8792   0.518 0.606620
VVictory             -8.5000    12.8792  -0.660 0.511793
N0.2cwt              18.5000    12.8792   1.436 0.156076
N0.4cwt              34.6667    12.8792   2.692 0.009199 **
N0.6cwt              44.8333    12.8792   3.481 0.000937 ***
VMarvellous:N0.2cwt   3.3333    18.2140   0.183 0.855407
VVictory:N0.2cwt     -0.3333    18.2140  -0.018 0.985459
VMarvellous:N0.4cwt  -4.1667    18.2140  -0.229 0.819832
VVictory:N0.4cwt      4.6667    18.2140   0.256 0.798662
VMarvellous:N0.6cwt  -4.6667    18.2140  -0.256 0.798662
VVictory:N0.6cwt      2.1667    18.2140   0.119 0.905707
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.31 on 60 degrees of freedom
Multiple R-squared:  0.4257,	Adjusted R-squared:  0.3204
F-statistic: 4.043 on 11 and 60 DF,  p-value: 0.0001964

After fitting the model, we should evaluate whether the model assumptions are roughly adequate. This is done mostly through evaluation of residuals. In a classical linear model, we assume an independent and identical normal distribution of error. For linear model, you can directly use plot(mod) to see a few diagnostic plots. Alternatively, you can extract the residual and fitted value to do that manually.

> par(mfrow=c(2,2), mar=c(4,4,1,1), oma=c(0,0,0,0),cex=0.85)
> qqnorm(rstandard(mod))
> plot(rstandard(mod)~fitted(mod))
> plot(rstandard(mod)~as.numeric(oats$N))
> plot(rstandard(mod)~as.numeric(oats$V))

Here, I use quantile plot to examine normality. I also plot residual against fitted value and all factors in the model to see if the assumption of linearity and homoskedasticity hold. I use standardized residual instead of raw residual because the variance of raw residual is not strictly equal (Although using raw residual for diagnosing heteroskedasticity is common and the heteroskedasticity in raw residual is not severe). The reason I use as.numeric(oats$N) is to force R to use scatter plot instead of box plot. This is probably not standard graphing practice but it gets the job done.


The quantile plot shows some skewness to the right but it is not that severe. The residual does not have a trend when plotted against fitted value (which would suggest nonlinearity ). The variance is roughly constant across treatments and fitted values. The diagnostic plot based on residual seems to suggest that the assumptions of linear model is adequate.

Test hypotheses

The most common goals of factorial experiment are to determine 1) whether yield differs among varieties, 2) whether yield differs among nitrogen levels and 3) whether there is interaction, i.e. the influence of nitrogen depends on variety. We achieve that with analysis of variance after fitting the linear model.

> anova(mod)
Analysis of Variance Table

Response: Y
          Df  Sum Sq Mean Sq F value    Pr(>F)
V          2  1786.4   893.2  1.7949    0.1750
N          3 20020.5  6673.5 13.4108 8.367e-07 ***
V:N        6   321.7    53.6  0.1078    0.9952
Residuals 60 29857.3   497.6
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We see that N level significantly influences yield. There is not a significant interaction and the varieties do not differ significantly in their yield. Give that the nitrogen effect is significant, we can perform a pairwise comparison of nitrogen levels. There are several ways to do the post hoc multiple comparison in linear model. I choose to use function glht to do it as it is a very versatile function to test hypothesis generally.

> summary(glht(mod, linfct=mcp(N="Tukey", interaction_average=T)))

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Fit: lm(formula = Y ~ V * N, data = oats)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)
0.2cwt - 0.0cwt == 0   19.500      7.436   2.622  0.05264 .
0.4cwt - 0.0cwt == 0   34.833      7.436   4.685  < 0.001 ***
0.6cwt - 0.0cwt == 0   44.000      7.436   5.917  < 0.001 *** 0.4cwt - 0.2cwt == 0   15.333      7.436   2.062  0.17741     0.6cwt - 0.2cwt == 0   24.500      7.436   3.295  0.00865 **  0.6cwt - 0.4cwt == 0    9.167      7.436   1.233  0.60874     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Adjusted p values reported -- single-step method) 

What we see from the post-hoc pairwise comparison is that 0.4cwt increment in nitrogen fertilizer causes a difference in yield. We can also easily construct a confidence interval of these difference.

> confint(glht(mod, linfct=mcp(N="Tukey", interaction_average=T)))

	 Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts

Fit: lm(formula = Y ~ V * N, data = oats)

Quantile = 2.6435
95% family-wise confidence level

Linear Hypotheses:
                     Estimate lwr      upr
0.2cwt - 0.0cwt == 0  19.5000  -0.1567  39.1567
0.4cwt - 0.0cwt == 0  34.8333  15.1766  54.4900
0.6cwt - 0.0cwt == 0  44.0000  24.3433  63.6567
0.4cwt - 0.2cwt == 0  15.3333  -4.3234  34.9900
0.6cwt - 0.2cwt == 0  24.5000   4.8433  44.1567
0.6cwt - 0.4cwt == 0   9.1667 -10.4900  28.8234

Function glht also allows us to test any linear hypothesis. For example, we may want to see if the nitrogen influence is additive for oats Victory. That is, we want to see if the increases in yield due to 0.6cwt nitrogen is the sum of increase caused by 0.2 and 0.4cwt individually for this specific variety of oats. We can test this hypothesis in a linear contrast. Using the term in the linear model output, The yield increase of Victory with 0.2cwt nitrogen is  N0.2cwt+VVictory:N0.2cwt. We can calculate the yield increase for other dose of nitrogen amount. In the end, the hypothesis we want to test is (N0.2cwt+VVictory:N0.2cwt)+(N0.4cwt+VVictory:N0.4cwt)-(N0.6cwt+VVictory:N0.6cwt)=0. We can implement this hypothesis test with the following code.

> K=matrix(c(0,0,0,1,1,-1,0,1,0,1,0,-1), nrow=1)
> summary(glht(mod, linfct=K))

	 Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = Y ~ V * N, data = oats)

Linear Hypotheses:
       Estimate Std. Error t value Pr(>|t|)
1 == 0    10.50      18.21   0.576    0.566
(Adjusted p values reported -- single-step method)

What we see from here is that the sum of 0.2cwt and 0.4cwt does not show significant different than 0.6cwt. This is intuitive as we see when plot the data. The relationship between nitrogen and yield seem to be fairly linear.

Model prediction

With the fitted model, we can calculate confidence interval of predicted mean or the prediction interval. For example, we want to predict the yield of variety Victory at 0.2cwt nitrogen fertilizer. The code below does this prediction. The choice of “p” and “c” stands for prediction interval and confidence interval of the mean. We can also predict the response from a range of predictors and then visualize them in figures.

> predict(mod, newdata=data.frame(V="Victory",N="0.2cwt"), interval="p")
       fit     lwr      upr
1 89.66667 41.4699 137.8634
> predict(mod, newdata=data.frame(V="Victory",N="0.2cwt"), interval="c")
       fit   lwr      upr
1 89.66667 71.45 107.8833

Few more things

1. Regarding the pairwise comparison, I used Tukey method to control family wise error rate in this case. This is appropriate because I want to do all pairwise comparison. Tukey’s method is generally considered the most appropriate method for comparing all pairs. If the goal is to compare all treatments to a control, the Dunnett method has more statistical power. It can also be implemented in glht. For more in depth understanding of multiple comparison and its implementaion in R, the book Multiple Comparison Using R is an excellent reference.

2. I did not perform any model selection before hypothesis testing. In my opinion, the necessity of  model selection depends on the goal of research. If it is a designed experiment to specifically test certain factors, the model should reflect the experimental design. Model selection is essentially hypothesis testing. I would rather retain the structure of model that reflects the experimental design and intention. Then I leave the decision of significance of a term to hypothesis test. However, if the goal of fitting the model is to explore bunch of predictors and find a most efficient predictive model, it seems to be proper to perform a model selection. And you should be careful in what model selection metric you use. See this wonderful discussion in Ecology on the difference between AIC and BIC.

3. The dataset here is a balanced data set, i.e. there are equal number of replicates for each treatment level. In the case of balanced data, all types of sum of square are equal. It does not matter which type you choose. Function anova in R automatically use type I sum of square, as known as the sequential partition of sum of square. But in case you data are unbalanced and you want to test equality among model parameters, type III ANOVA should be used. That can be done with function Anova in package car. You should also specify a sum of zero contrast coding when doing type III sum of square. Otherwise, it is not testing the correct hypothesis. This can be achieved with following code.

> library(car)
> options(contrasts=c("contr.sum", "contr.poly"))
> mod = lm(Y~N*V, data=oats)
> Anova(mod, type=3)

About Chao Song

I am a postdoctoral research associate at the Department of Fisheries and Wildlife, Michigan State University.
This entry was posted in R, Statistics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s