I wrote a blog post on the practice of plotting everything on everything else in ecology a while ago. Since writing the post, I have had quite a few insightful conversations with colleagues and did some readings on this issue. I think it will be useful to write a follow up and share some insights I gained from these conversations.
Imagine I measured a response variable and many potential predictors . I want to ask which drives ? I then regress on all . I find and have significant relationships with . I then write up a paper claiming I hypothesized that and will have significant relationships with . My data supported my hypothesis.
In my opinion, it is total fine to ask such open end question. It is a valid scientific question. It is also OK to perform exploratory data analysis. What’s problematic is the selective reporting of the significant relationships. With such selection bias, the validity of the significance is questionable. After all, we will almost surely find some significant relationships if we search hard enough because of the very definition of type I error. Fishing expedition for statistical significance creates bad science. However, I will not simply ignoring these significant relationships. What if these relationships are real? I may miss some exciting findings. To make full use of data in a correct way, the key here is to use good practice to make reliable conclusion and reduce of the chance of false discovery. Below are some thoughts I have on how to help us make reliable conclusions when we make lots of comparisons/analyses in the exploratory study.
Some thoughts on good practice
1. The problem of false discover is similar to a multiple comparison problem. When we have a lot of hypotheses to test simultaneously (e.g. lots of regressions one time), the chance we find at least one significant relationship is fairly large even if there is actually no relationship. The same methods of controlling family wise error rate as used in multiple comparison may be applicable here. For example, if one relationship shows up as highly significant in a statistical sense, i.e. very small p-value, it is unlikely to be a false discovery. However, if we make lots of analysis together and get borderline p-value, we should not directly conclude significant relationship without further analyses.
2. If we find significant relationships in an exploration of everything we measured, the correct approach is to perform an independent designed follow up study to specifically test these findings. This is not always feasible. But maybe we can split the current data set, using one part for the exploration and the other part as confirmation as if it is a new study. The way of splitting the data should be context dependent. Random split may be a quite generally applicable. What should be avoided is deliberate choice of data splitting to seek significance.
The idea of using multiple lines of evidence should also be taken with caution, especially when it is done by searching relevant literature as evidence. Daniel Kahneman pointed out that publication bias towards significant result in the literature could mislead us. If lots of underpowered studies with small sample size all point to a particular direction, it could be evidence for a very strong effect or publication bias. After all, we cannot expect to detect an effect all the time when the power of the study is low. If low power studies all show a consistent effect, either the effect is very strong or only selective results are reported in the literature.
3. Multiple measurements sometimes stem from the operationalization of a concept. For example, bone quality may be evaluated in lots of metrics, such as bone density, size, mineral content etc. In these situations, it might be useful to combine the multiple measurements into a single indicator. A principal component analysis or factor analysis could be useful for the dimension reduction. Reducing the number of variables for exploration could reduce the chance of false discovery.
4. A mechanistic model could help. A mechanistic model prescribes how variables should relate to each other and thus permit us to test the model with multiple aspects of the observations. Confirming findings with multiple lines of evidence may help us avoid false discovery. Some researcher even argue that understanding ecological data without mechanistic model is a waste of time. In my opinion, this statement is a little extreme. Implementing mechanistic models with realistic assumptions could be challenging. Such mechanistic modeling approach may not be applicable in many situations. The traditional hypothesis testing framework of doing science is still useful. I would not say that mechanistic model is the only approach. But I would say it is a very powerful and useful approach in many situations.
Finally, doing exploratory study in a post hoc fashion is OK in my opinion. We need to be careful when drawing conclusions. Not all science has to be done in a hypothesis confirmation way. We don’t need to package an exploratory study as if it is a hypothesis confirmation study. That’s why I am strongly against the view that each paper has to have a specific hypothesis. I don’t see any problem if a clear question of the study is defined. Maybe this is a topic worth more discussions in a future blog post.