I recently discussed with one of my fellow graduate students testing linear contrasts from a two way ANOVA model. He was a little confused about implementing it in R and interpreting some of the model outputs. As I discussed in an earlier post, we often learn statistical methods piece by piece without the complete workflow in class. Hence, this is the first of a series of blog posts I intend to write to show examples of complete workflow of data analysis. For those familiar with statistics, this is very simple. But I hope this is useful for those who are not so familiar with it.

**Data**

I will use a built in dataset in R package MASS for demonstration. The dataset, called oats, comes from an experiment where researchers investigated how nitrogen fertilizer dose and oats variety influence yield. The original experiment has a split plot design within block but let’s ignore that for simplicity. Let’s assume it is a full factorial design with three varieties and four nitrogen levels. Within each treatment combination, there are 6 replicates, resulting in a total of 72 measurements. The first few lines of data are shown below. Here Y is yield, V is variety of oats and N is the nitrogen fertilization level.

> library(MASS) > head(oats) B V N Y 1 I Victory 0.0cwt 111 2 I Victory 0.2cwt 130 3 I Victory 0.4cwt 157 4 I Victory 0.6cwt 174 5 I Golden.rain 0.0cwt 117 6 I Golden.rain 0.2cwt 114

**Exploratory data analysis**

The first step to analyze the data is to get an idea of how the data look like. Function xyplot in package lattice is handy for showing data in the factorial design.

> library(lattice) > xyplot(Y~N|V, data=oats, layout=c(3,1)

Based on the graph, it appears that high N level leads to higher yield. When comparing three panels, the yield does not seem to differ that much among varieties. If we compare the “slope” across three panels, there does not seem to be much difference. We will test these more formally later.

**Fit linear model**

Next, we can fit a linear model to the data. Both variety of oats and nitrogen level are treated as categorical variables. Some may wonder if it is possible to treat N as continuous variable. It is possible but we are making a more specific assumption about how nitrogen influence yield. When treated as categorical variable, we assume that each level of nitrogen corresponds to a mean yield. This is a very general assumption. When treated as continuous variable, we assume that yield change linearly with nitrogen. So the choice should first be based on examining the graph and your knowledge on how nitrogen should theoretically affects yield. You should also consider the goal of the research. If you are interested in comparing effects of particular levels of fertilization, it is preferable to treat it as categorical variable. But if you are interested in characterizing the response curve and predict yield with arbitrary amount of nitrogen, it is more helpful to treat it as continuous. In short, you should consider 1) theoretical relationship between variables based on your knowledge of the system and 2) practical goal of the data analysis. Here, we treat N as a categorical variable for demonstration. We fit a model with both main effect of variety and nitrogen and their interaction.

> mod = lm(Y~V*N, data=oats) > summary(mod) Call: lm(formula = Y ~ V * N, data = oats) Residuals: Min 1Q Median 3Q Max -38.500 -16.125 0.167 10.583 55.500 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 80.0000 9.1070 8.784 2.28e-12 *** VMarvellous 6.6667 12.8792 0.518 0.606620 VVictory -8.5000 12.8792 -0.660 0.511793 N0.2cwt 18.5000 12.8792 1.436 0.156076 N0.4cwt 34.6667 12.8792 2.692 0.009199 ** N0.6cwt 44.8333 12.8792 3.481 0.000937 *** VMarvellous:N0.2cwt 3.3333 18.2140 0.183 0.855407 VVictory:N0.2cwt -0.3333 18.2140 -0.018 0.985459 VMarvellous:N0.4cwt -4.1667 18.2140 -0.229 0.819832 VVictory:N0.4cwt 4.6667 18.2140 0.256 0.798662 VMarvellous:N0.6cwt -4.6667 18.2140 -0.256 0.798662 VVictory:N0.6cwt 2.1667 18.2140 0.119 0.905707 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.31 on 60 degrees of freedom Multiple R-squared: 0.4257, Adjusted R-squared: 0.3204 F-statistic: 4.043 on 11 and 60 DF, p-value: 0.0001964

After fitting the model, we should evaluate whether the model assumptions are roughly adequate. This is done mostly through evaluation of residuals. In a classical linear model, we assume an independent and identical normal distribution of error. For linear model, you can directly use plot(mod) to see a few diagnostic plots. Alternatively, you can extract the residual and fitted value to do that manually.

> par(mfrow=c(2,2), mar=c(4,4,1,1), oma=c(0,0,0,0),cex=0.85) > qqnorm(rstandard(mod)) > plot(rstandard(mod)~fitted(mod)) > plot(rstandard(mod)~as.numeric(oats$N)) > plot(rstandard(mod)~as.numeric(oats$V))

Here, I use quantile plot to examine normality. I also plot residual against fitted value and all factors in the model to see if the assumption of linearity and homoskedasticity hold. I use standardized residual instead of raw residual because the variance of raw residual is not strictly equal (Although using raw residual for diagnosing heteroskedasticity is common and the heteroskedasticity in raw residual is not severe). The reason I use as.numeric(oats$N) is to force R to use scatter plot instead of box plot. This is probably not standard graphing practice but it gets the job done.

The quantile plot shows some skewness to the right but it is not that severe. The residual does not have a trend when plotted against fitted value (which would suggest nonlinearity ). The variance is roughly constant across treatments and fitted values. The diagnostic plot based on residual seems to suggest that the assumptions of linear model is adequate.

**Test hypotheses**

The most common goals of factorial experiment are to determine 1) whether yield differs among varieties, 2) whether yield differs among nitrogen levels and 3) whether there is interaction, i.e. the influence of nitrogen depends on variety. We achieve that with analysis of variance after fitting the linear model.

> anova(mod) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) V 2 1786.4 893.2 1.7949 0.1750 N 3 20020.5 6673.5 13.4108 8.367e-07 *** V:N 6 321.7 53.6 0.1078 0.9952 Residuals 60 29857.3 497.6 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We see that N level significantly influences yield. There is not a significant interaction and the varieties do not differ significantly in their yield. Give that the nitrogen effect is significant, we can perform a pairwise comparison of nitrogen levels. There are several ways to do the post hoc multiple comparison in linear model. I choose to use function glht to do it as it is a very versatile function to test hypothesis generally.

> summary(glht(mod, linfct=mcp(N="Tukey", interaction_average=T))) Simultaneous Tests for General Linear Hypotheses Multiple Comparisons of Means: Tukey Contrasts Fit: lm(formula = Y ~ V * N, data = oats) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) 0.2cwt - 0.0cwt == 0 19.500 7.436 2.622 0.05264 . 0.4cwt - 0.0cwt == 0 34.833 7.436 4.685 < 0.001 *** 0.6cwt - 0.0cwt == 0 44.000 7.436 5.917 < 0.001 *** 0.4cwt - 0.2cwt == 0 15.333 7.436 2.062 0.17741 0.6cwt - 0.2cwt == 0 24.500 7.436 3.295 0.00865 ** 0.6cwt - 0.4cwt == 0 9.167 7.436 1.233 0.60874 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Adjusted p values reported -- single-step method)

What we see from the post-hoc pairwise comparison is that 0.4cwt increment in nitrogen fertilizer causes a difference in yield. We can also easily construct a confidence interval of these difference.

> confint(glht(mod, linfct=mcp(N="Tukey", interaction_average=T))) Simultaneous Confidence Intervals Multiple Comparisons of Means: Tukey Contrasts Fit: lm(formula = Y ~ V * N, data = oats) Quantile = 2.6435 95% family-wise confidence level Linear Hypotheses: Estimate lwr upr 0.2cwt - 0.0cwt == 0 19.5000 -0.1567 39.1567 0.4cwt - 0.0cwt == 0 34.8333 15.1766 54.4900 0.6cwt - 0.0cwt == 0 44.0000 24.3433 63.6567 0.4cwt - 0.2cwt == 0 15.3333 -4.3234 34.9900 0.6cwt - 0.2cwt == 0 24.5000 4.8433 44.1567 0.6cwt - 0.4cwt == 0 9.1667 -10.4900 28.8234

Function glht also allows us to test any linear hypothesis. For example, we may want to see if the nitrogen influence is additive for oats Victory. That is, we want to see if the increases in yield due to 0.6cwt nitrogen is the sum of increase caused by 0.2 and 0.4cwt individually for this specific variety of oats. We can test this hypothesis in a linear contrast. Using the term in the linear model output, The yield increase of Victory with 0.2cwt nitrogen is N0.2cwt+VVictory:N0.2cwt. We can calculate the yield increase for other dose of nitrogen amount. In the end, the hypothesis we want to test is (N0.2cwt+VVictory:N0.2cwt)+(N0.4cwt+VVictory:N0.4cwt)-(N0.6cwt+VVictory:N0.6cwt)=0. We can implement this hypothesis test with the following code.

> K=matrix(c(0,0,0,1,1,-1,0,1,0,1,0,-1), nrow=1) > summary(glht(mod, linfct=K)) Simultaneous Tests for General Linear Hypotheses Fit: lm(formula = Y ~ V * N, data = oats) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) 1 == 0 10.50 18.21 0.576 0.566 (Adjusted p values reported -- single-step method)

What we see from here is that the sum of 0.2cwt and 0.4cwt does not show significant different than 0.6cwt. This is intuitive as we see when plot the data. The relationship between nitrogen and yield seem to be fairly linear.

**Model prediction**

With the fitted model, we can calculate confidence interval of predicted mean or the prediction interval. For example, we want to predict the yield of variety Victory at 0.2cwt nitrogen fertilizer. The code below does this prediction. The choice of “p” and “c” stands for prediction interval and confidence interval of the mean. We can also predict the response from a range of predictors and then visualize them in figures.

> predict(mod, newdata=data.frame(V="Victory",N="0.2cwt"), interval="p") fit lwr upr 1 89.66667 41.4699 137.8634 > predict(mod, newdata=data.frame(V="Victory",N="0.2cwt"), interval="c") fit lwr upr 1 89.66667 71.45 107.8833

**Few more things**

1. Regarding the pairwise comparison, I used Tukey method to control family wise error rate in this case. This is appropriate because I want to do all pairwise comparison. Tukey’s method is generally considered the most appropriate method for comparing all pairs. If the goal is to compare all treatments to a control, the Dunnett method has more statistical power. It can also be implemented in glht. For more in depth understanding of multiple comparison and its implementaion in R, the book Multiple Comparison Using R is an excellent reference.

2. I did not perform any model selection before hypothesis testing. In my opinion, the necessity of model selection depends on the goal of research. If it is a designed experiment to specifically test certain factors, the model should reflect the experimental design. Model selection is essentially hypothesis testing. I would rather retain the structure of model that reflects the experimental design and intention. Then I leave the decision of significance of a term to hypothesis test. However, if the goal of fitting the model is to explore bunch of predictors and find a most efficient predictive model, it seems to be proper to perform a model selection. And you should be careful in what model selection metric you use. See this wonderful discussion in Ecology on the difference between AIC and BIC.

3. The dataset here is a balanced data set, i.e. there are equal number of replicates for each treatment level. In the case of balanced data, all types of sum of square are equal. It does not matter which type you choose. Function anova in R automatically use type I sum of square, as known as the sequential partition of sum of square. But in case you data are unbalanced and you want to test equality among model parameters, type III ANOVA should be used. That can be done with function Anova in package car. You should also specify a sum of zero contrast coding when doing type III sum of square. Otherwise, it is not testing the correct hypothesis. This can be achieved with following code.

> library(car) > options(contrasts=c("contr.sum", "contr.poly")) > mod = lm(Y~N*V, data=oats) > Anova(mod, type=3)