I have seen the following scenario a lot in ecological research. After we collected bunch of data from field, we often explore the data first. We plot response variables of interests against all the possible predictors we measured. Some may show a significant relationship. Some may not. Those significant (in a statistical sense) relationships end up in papers while those insignificant ones don’t. I have certainly done things in this fashion before. But as I reflected the approaches of doing research in general, I start to feel bad about doing this.

**An obvious problem**

This is what data dredging or p-value hacking is. To illustrate the point, let’s start with a simple simulation. Assume I measured a predictor called and many different responses. I want to investigate what response variable is influenced by this predictor. In the simulation, I made all response variables totally unrelated to the predictor. Then I perform a linear regression on each pair, and see if the slope of the regression is significantly different from 0 with a significance level of 0.05. Below is the R code I used for the simulation.

set.seed(100) x= 1:100 p.value = vector(mode="numeric", length=100) for(i in 1:100){ y = rnorm(100,5,1) p.value[i] = pf(summary(lm(y~x))$fstatistic[1],summary(lm(y~x))$fstatistic[1],summary(lm(y~x))$fstatistic[1],lower.tail=F) } which(p<0.05)

I generated the response as a normally distributed random variable with a fixed mean. Thus, all the response variables are totally unrelated to the predictor. But when I perform the linear regression with the simulated data set, I found that 6 response variables has a significant relationship with the predictor. If these 6 relationships are what we publish as the findings, we clearly misinform our readers. Statistically, this is not hard to understand. When we define a significance level, we define the probability of rejecting the true hypothesis. If we test many hypotheses, it is almost certain that we will reject some of these even if the hypothesis is actually true.

**A not so obvious problem**

We are usually taught that research should be question driven. We should have specific hypotheses developed before designing experiment to test it. Thus, the significant pattern seeking analysis is often packaged as if it is a hypotheses driven confirmative study. Papers often bring up the specific hypothesis in the introduction as if it is the pre-determined goal of the study while these hypotheses are in fact the results of extensive search in everything measured.

This confuses me a lot. On one hand, I can develop a hypothesis ahead of the experiment for each response variable. Then I collected these responses and predictors. I investigate each individual pair and report my findings. Philosophically, this is a valid approach. I have clearly defined hypotheses. I collected the data to test them. I reported what I found. But on the other hand, this is no different than the significance seeking analysis in practice. I still have some variables in mind and I measured all of them. Then I plot everything against everything else. I report my major findings. We phrase them differently. But what we actually did is essentially the same.

The line between the “bad” approach and the “good” approach seems to be vague. While I can see the philosophical difference, they are rather similar in practice. To some extend, it looks like a matter of packaging. That is why I sometimes feel unnecessary to be religious about having a specific hypothesis in the paper or presentation. To me, it is somewhat semantic.

The vagueness also poses a dilemma for me. On one hand, we ecologist faces complex problems. It seems to be naive to just test one thing at a time. It seems to be natural to collect possible relevant information in the field. One the other hand, we got in the trap of significance chasing if we just plot everything against everything else to find a pattern.

I don’t have solution to this problem for now. We often measure a lot of potential responses and predictors. What is the best practice to examine the their relationships without falling into the p-value hacking issue? I would love to hear your thoughts.

the second part of this is an old debate over how science should be done. RH Peters ” A Critique of Ecology” gives the argument for correlation being useful in prediction in the most forceful terms.

LikeLiked by 1 person

Thanks Walter for the comments. I was thinking of reading that book and it seems it will be worthwhile. It is very eye opening as I look into many very different perspectives on this issue.

LikeLike

Pingback: A Follow Up on Plotting Everything Against Everything Else | Chao's Blog

Pingback: Comprehensive Exams: Uniform or Customized | Chao's Blog