Scientists are always warning the public — and each other — not to confuse correlation with causation. When a study is published linking our favorite food to cancer, heart attacks, or other health problems, we’re cautioned to take these findings with a grain of salt because identifying causes in a complex sea of correlations is a daunting task. How can we possibly differentiate correlation vs causation?
What is causation?
Despite the challenge, a major task of researchers is to uncover causes — whether it’s the simple mechanism of a protein’s activity within the cell, or a population-level analysis of interactions among genes that increase the risk of disease.
This raises the question: what exactly does it mean for X to cause Y? The concept of causality has existed for a long time, predating the scientific revolution by many centuries. Aristotle explained causation by dividing it into four separate aspects. Take the simple example of a wooden table: its material cause is the wood of which it is composed, its efficient cause is the carpenter who crafted it, its formal cause is the particular shape which makes it a table rather than something else, and its final cause is the purpose for which it was created, maybe to hold a lamp.
Scientists today don’t labor under such a multifaceted theory of causation. Although the meaning of “cause” is usually taken for granted in everyday life, when pressed for a precise definition a biologist would likely explain cause and effect in terms of probabilities. According to probabilistic theories of causation, a cause both precedes its effect and increases its probability, all other things being equal. For instance, we know that smoking causes heart disease; this does not imply that everyone who smokes will suffer heart problems, but it does mean that smokers have a higher probability than non-smokers of developing heart disease, all other factors being held equal.
How can we study causal relationships?
To scientifically study causal relationships, we need the ability to intervene in a system and manipulate individual variables. Luckily, researchers can often alter experimental variables and examine counterfactual scenarios, which take the form “if X causes Y, then if X does not occur, Y will not occur.” Model organism biologists (like those who work with Drosophila melanogaster, for instance) pride themselves on applying this skill in the laboratory. If I want to determine whether a particular mutation is the cause of an interesting phenotype, I can compare flies that are genetically identical in all respects except for the mutation in question. By eliminating the confounding variables in this way, a direct causal link can be established.
Correlation vs Causation
What, then, is the relationship between causation and correlation? Two correlated variables or events share a mutual connection that can be observed as a positive or negative relationship. At first glance, a correlation between two variables may suggest a causal relationship, but this conclusion does not necessarily follow. Fires and fire trucks are often correlated, but obviously it is not the fire trucks that cause fires. To demonstrate this point, simply take a look at the ridiculous spurious correlations that can occur between events that are not causally linked. Or the correlation between ice cream and sunburns in the example below:
To make the issue more confusing, even if we do know with certainty that x causes y, it does not therefore imply that these variables will be correlated. Imagine a mixed community of smokers and non-smokers: cigarette smoking is a known cause of heart disease, but in this hypothetical population all of the smokers exercise while the non-smokers do not. If the heart-healthy benefits of the smokers’ exercise perfectly counteract their increased risk of heart disease, then there will be no correlation between smoking and heart disease at the population level. Interesting, right?
In a game of billiards, the precise ordering of cause and effect is obvious to the observer. In the real world, discovering causal relationships is often a slow and arduous process, but it’s what scientists signed up to do.