I talked with my fellow graduate student Rachel Smith this morning on analyzing her species abundance data. The goal is to use environmental variables to predict the abundance of species. One particular concern she has is overdispersion. That is, the variance of data is more than predicted/constrained by the model. The problem of overdispersion often occurs in logistic regression or poisson regression. Here are some thoughts I have on dealing with overdispersion.
What is overdispersion?
Overdispersion occurs when the variance of data exceeds that expected from the model. For example, if we have a poisson regression, we expect that the mean and variance of response to be roughly the same due to the fact that mean and variance is equal in a Poisson distribution. If we have a logistic regression, we expect a particular relationship with mean and variance (mean=np and variance=np(1-p)). Any deviation of the expected mean variance relationship violates the model assumptions. Theoretically, underdispersion can occur but in reality, overdispersion is much more common.
If we ignore overdispersion when fitting logistic regression or poisson regression, the point estimates of the regression coefficients are still consistent. But the standard error estimates are incorrect. Consequently, hypothesis test about regression coefficients is incorrect.
How do we detect overdispersion?
The first problem when dealing with overdisperion is detecting overdispersion. In classic linear model, problems with the distributional assumptions of the model are often diagnosed by evaluating the residuals. But in generalized linear model, this is not the case. In generalized linear model, what we need to look at is the mean variance relationship. That requires us to calculate the variance of response for a particular mean, and examine whether it conforms to the model assumption. In certain specific situations, we may be able to examine the mean-variance relationship from the data directly. For example, if all predictors are categorical and we have multiple replicates within each treatment combination (just like multi-way ANOVA with replication), we may calculate the variance at a particular mean, i.e. within each treatment combination. Then we can plot to see if mean-variance relationship meets the model assumption. If we have just on continuous predictors, we may roughly bin the predictors and calculate mean and variance for each small range of predictor. But in general, visualization of mean-variance relationship is not possible or extremely difficult when we don’t have these specific model structure or experimental design.
A more common approach to detect overdispersion is examination of model fit. If we fit a model without considering the overdispersion, and the model shows significant lack of fit, it is often a signal of overdispersion (or underdispersion). For generalized linear model, we often examine the residual deviance and compare it to the residual degree of freedom. The residual deviance for a adequately fit model should be roughly equal to the residual degree of freedom. if residual deviance is significantly larger than degree of freedom, it is a signal for overdispersion.
Overdispersion is not the only reason for poor model fit. If the predictors just do not predict the response well, we get poor fit (shown as large residual deviance) as well. So before we blame the poor fit to overdispersion, we should also look at whether the model predictions match the data well or not. If the model predictions match the data well, we can go ahead and deal with overdispersion. If not, we should first focus on the predictors.
How do we deal with overdispersion?
There are two general approaches to deal with overdispersion: the mechanistic approach and the empirical approach. Although the two approaches tackle the issue with different perspectives, the end result is quite similar. They both introduce one more parameter in the model to adjust the variance such that the expected mean-variance relationship can fit the data better.
The mechanistic approach deals with overdispersion by considering the data generating mechanism. For example, beta-binomial model assumes that the response is a binomial conditional on the success probability and the success probability is draw from a beta distribution. The commonly used negative bionomial model for count data assumes that the response follow a Poisson distribution conditioning on the mean and the mean is draw from a gamma distribution. The genesis of overdispersion with these mechanistic approach is based on the understanding of how the data are generated. But the end result is that we introduce another parameter in the model to fit the larger than expected variance.
Another commonly encountered scenario of overdispersion is the zero inflation problem. The signal for such problem is that the data contain a lot of zeros than predicted by the model. This situation is often dealt with zero inflated regression. That is, you first model a probability of having zero count or not. For those with non-zero counts, you model the count with the appropriate generalized linear model. Again, this method is motivated mechanistically. For example, assume you want to use environmental variables to predict species abundance at various locations. But due to dispersal limitation, the species is not present in many of your sampling sites. The zero count is not a result of incompatible environment but due to dispersal limitation. So it is not something you explain with your predictors. In this case, it is suitable to model a probability for zero count first and then a generalized linear model for the non-zero count. In R, you can use package pscl for a zero inflated poisson or logistic regression.
The empirical approach is more straightforward. Simply speaking, we don’t care what generates the overdispersion. We simply recognize the existence of overdispersion. We directly introduce a dispersion parameter in the model to account for overdispersion. This is what the quasi-likelihood does. If you have data that motivate you to fit logistic regression or poisson regression but you have overdispersion problem, you can simply account for such overdispersion with quasi-likelihood. In R, you specify “family = quasibinomial” or “quasipoisson” to do that. One caveat is that quasi-likelihood is not likelihood. So you cannot use any likelihood based metric, such as AIC or likelihood ratio, for model comparison.
A simplified recipe
Based on the discussion above, here is a simplified recipe of dealing with overdispersion. Obviously, the real data often are more complex and has nuisances. But the procedure below should be a good starting point:
1. Fit the model without considering overdispersion. For example, directly fit count data with poisson regression or binary response with logistic regression. Examine how the model prediction matches the mean. If the model predictions do not fit the data well, think about modifying the model structure. If the model predicts well, move on.
2. Examine the residual deviance. If residual deviance is much larger than degree of freedom, that signals existence of overdispersion.
3. If the overdispersion is a result of excess zero in the data, try zero inflated regression. if not, think about whether the data generating mechanism could lead to overdispersion. That warrants things like negative binomial regression or beta binomial regression. Fit these models.
4. Compare the model fit to the ones without considering overdisperion, and see if model fits much better after allowing overdispersion. This can be done by compare AIC. If the models are nested, for example, poisson is a special case of negative binomial, you can compare the model fits by likelihood ratio test. The model should fits a lot better if overdispersion is indeed the problem.
5. If you don’t have any mechanistic reason for overdispersion, try quasi-likelihood. Examine the fitted overdispersion parameter and see if it is a lot larger than 1. If it is, it is a signal of existence of overdispersion.