After completing an experiment, we’ll usually perform statistical tests to determine whether our results are “significant.” P-values are commonly used to report statistical significance in scientific papers, but biologists have been criticized in recent years for misunderstanding and misusing this statistic. A recent paper in PLOS Biology surveyed the scientific literature and found widespread evidence of “p-hacking”, or the manipulation of experimental parameters, such as sample size and the removal of outlier data points, for the sole purpose of obtaining statistically significant p-values. Below I’ll explain about how to use p-values and their importance in biology research.
Probability theory: Frequentism vs. Bayesianism
First, it’s important to note that there are several different interpretations of the concept of “probability,” perhaps the two most notable belonging to the Bayesian and Frequentist schools of statistics. According to the Bayesian approach (developed by 18th century mathematician Thomas Bayes), probability is best thought of as the likelihood of a particular outcome, given our prior knowledge of the situation in addition to newly acquired data. To give a commonplace example: when searching for a lost set of keys in your home, you will want to estimate the probability that they are in a given location — most likely by remembering previous occasions that the keys were lost and where they were recovered. This “prior” knowledge will factor heavily into your probability estimate. You can then contribute new data to update this probability estimate, for instance if you know with certainty that the keys are not in your pants pocket. The Bayesian interpretation of probability aligns more with our common, everyday usage of the term.
However, the understanding of probability that dominates in the biological sciences is known as Frequentism; most p-value statistics in biological research are computed using this school’s methods. According to frequentist statistics, the probability of a given event is simply the frequency with which it occurs. To give a simple example: If a coin is flipped 100 times and lands “heads” on 58 flips, the probability of the coin’s landing heads is 0.58. Presumably, as the number of coin flips approaches infinity, the observed frequency of heads will approach the “true” probability of 0.5. Frequentism is based on the notion that repeated randomized trials, or experiments, will in the long run approximate the true probability of an event.
So what is a p-value?
When running an experiment, a biologist may want to know the probability of her hypothesis being true, given the experimental data she observes. However, a p-value calculated using a standard t-test would tell her the converse of this: the probability of observing the experimental data, given the null hypothesis being true. A common experimental “null hypothesis” is a statement of no relationship between the variables under observation (e.g. the means of two data sets are roughly equal). The p-value is therefore the probability of observing the experimental data or a data set more extreme, when assuming that this null hypothesis is correct – a lower p-value makes a stronger case to reject this null hypothesis.
There are a few things that the p-value definitely does not tell a scientist. First, do experimental results with a low p-value tell a scientist that her hypothesis is correct? No. Rejecting the statistical null hypothesis is not equivalent to accepting her particular biological hypothesis. Does the p-value equal the probability that the null hypothesis is correct? Again, no. Biologists and statisticians use the term “hypothesis” very differently. When the statistician and evolutionary biologist Ronald Fisher popularized use of the p-value in the 1920s, it was never intended as a metric for confirming or refuting biological hypotheses. It was meant to be a general heuristic for judging whether a data set might warrant a second look or follow-up experiments; the p-value itself does not decisively settle any experimental questions.
How to avoid “p-hacking”
What should researchers do to avoid p-hacking? One recent paper recommends choosing the experimental sample sizes in advance, detailing the removal of any outlier data points, and allowing other researchers access to the raw data. P-value statistics can be useful when employed properly, but they are not the whole story. As scientists face continued pressure to report “significant” findings and publish in high-tier journals, understanding procedures for proper data interpretation will be increasingly important. Hopefully, the trend towards open access publication will encourage greater transparency and scrutiny of experimental data reporting, along with a better understanding of p-value statistics and their applications.
If you enjoyed this summary of probability theory and how to use p-values, please check out my post on the Drosophila melanogaster lifecycle.