## Proper Consideration of Error Propagation in Data Analysis

This post is motivated by working with my lab mate Elizabeth Guinessey on some data analyses for her thesis. The goal of the study is to investigate how greenhouse gas flux rates differ in various vegetation types in salt marshes. To do that, she chose several locations for each vegetation type. At each location, she put a closed chamber on the ground and measured the concentration change over time in the head space of the chamber. What is the proper way to analyze these data?

This seems to be an easy problem. The concentration of greenhouse gas will increase over time in the chamber head space. The slope of a linear regression between concentration and time for each chamber gives the flux rate. One possible approach is to first calculate slope for each chamber to obtain the flux rate. The flux rate for each chamber is then used as the response variable in ANOVA, where vegetation type is used as a categorical explanatory variable. To translate this analysis into formal statistical language, what we do can be represented as follows. First, we fit a linear regression between concentration and time for each chamber: $c_{ijk}=\alpha_{ij} + \beta_{ij}t_{ijk}+\epsilon_{ijk}$. Here, $c_{ijk}$ is the concentration of gas in vegetation type $i$ at location $j$ at time $k$. $\epsilon$ is random error with a normal distribution. After obtaining the slope $\beta_{ij}$ for each chamber, we use the estimated slope as the response variable in ANOVA as $\beta_{ij}=\mu_{i}+e_{ij}$. Here $\mu_{i}$ is the mean flux rate of vegetation $i$ and $e_{ij}$ is a random error with normal distribution.

This seems to be a reasonable approach. Indeed, it is probably commonly used in ecology literature. But one problem with this approach is doing “statistics on statistics”. Specifically, in the ANOVA, we assume that replicates of the same vegetation type have the same mean flux rate and all the replicates from the same vegetation type follows a normal distribution. But when we calculate flux rate for each chamber within a particular vegetation type, such distributional constraints are not considered at all. Thus, analyzing data using two separate linear models (i.e. one regression to calculate flux rate and one ANOVA to compare among vegetation types) this way is not theoretically consistent. In general, when we use a statistics (e.g. the flux is a calculated slope) as a response variable for future analysis, we have the risk of not considering error propagation properly.

Here is what I believe as a more consistent way to analyze this kind of data. This is where using the formal statistical language to describe our analysis is helpful. If we plug $\beta_{ij}=\mu_{i}+e_{ij}$ into $c_{ijk}=\alpha_{ij} + \beta_{ij}t_{ijk}+\epsilon_{ijk}$, we get $c_{ijk}=\alpha_{ij} + (\mu_{i}+e_{ij})t_{ijk}+\epsilon_{ijk}$. This model can be expanded as $c_{ijk}=\alpha_{ij} + \mu_{i}t_{ijk}+e_{ij}t_{ijk}+\epsilon_{ijk}$. This combines two classic linear models into one linear mixed effect model. We can fit a model using all measurements in one step in a linear mixed effect model. Here, we will include a chamber specific intercept, a vegetation type–time interaction, a random slope of time for each chamber and a normally distributed random error. Testing the significance of vegetation type–time interaction tells us whether vegetation type influence flux rate.

Essentially, we collapse a hierarchical model into one mixed effect model. This approach is fairly common in education research but it is not commonly used in ecological research. Obviously, whether the ignorance of error propagation causes significant difference in inference is context specific. If you have this kind of data, I am curious to analyze it and see if it makes a difference.