Why Statistics Do Not Always Reflect Reality
- Christos Nikolaou

- 5 hours ago
- 4 min read
The Parable of the Two Exams
In science, we often treat the p-value as a definitive verdict on reality. If a study reports a "significant" result (p < 0.05), we assume the effect is real. If it reports a "non-significant" result (p > 0.05), we assume the effect doesn't exist.
However, statistics are not a direct measure of reality; they are a measure of uncertainty. And uncertainty depends entirely on the design of the experiment, not just on the subject's talent.
To understand how a study design can mask the truth—making a skilled student look average—consider the scenario of Alice and Bob.
The Scenario: Identical Ability
Alice and Bob are sitting a mathematics assessment. Their physical performance is identical in every measurable way:
Alice works for exactly one hour, answers 9 questions, and gets 7 correct.
Bob works for exactly one hour, answers 9 questions, and gets 7 correct.
In reality, they are equally skilled. They produced the exact same data in the exact same time. However, the setup of their exams was different.
Context A: The "Closed" Experiment (Fixed N)
The Setup: Alice was given a paper containing exactly 9 questions.
The Analysis: Alice faced a "Closed Universe." Her sample size was capped at 9 by design. There was no possibility of her answering a 10th question. She hit the ceiling of the test.
The Statistical Verdict: Because she got 7 out of a maximum of 9, the probability of scoring this high by chance is very low.
Result: Significant (p = 0.046). Statistics conclude that Alice is a skilled mathematician.
Context B: The "Open" Experiment (Fixed Time)
The Setup: Bob was given an infinite stack of questions and told to work for one hour.
The Analysis: Bob faced an "Open Universe." Although he physically stopped at 9 questions, statistically, his sample size was fluid.
The Statistical Penalty: Because Bob's test allowed for the possibility of answering more questions (unlike Alice's), the statistical formula must account for that extra uncertainty. It asks: "What if he had answered a 10th question and got it wrong?" This wider range of possibilities creates more "statistical noise."
The Statistical Verdict: Bob's score of 7/9 is treated with more caution because the "ceiling" of his test was undefined. The math is less convinced that this isn't just a lucky run.
Result: Not Significant (p = 0.11). Statistics conclude that we cannot be sure of Bob's skill.

Real World Application: Count vs. Continuous Data
This parable reveals a major trap in retrospective research, but the severity of the trap depends on the type of data you are collecting.
1. The Danger Zone: Count Data (e.g., Complications, Adverse reactions)
If you are counting discrete events retrospectively (like Bob answering questions), you are in danger.
Example: "We looked at all surgeries in 2023 (Window) and counted how many had complications (Events)."
The Trap: If you analyse this using a standard proportion test (assuming N is fixed), you are making Alice's assumption for Bob's reality. You are ignoring the fact that the volume of surgeries fluctuated throughout the year.
The Fix: You must use a Poisson Rate Test. Instead of calculating "Percentage of Complications" (7/9), you must calculate "Complications per Month" (Rate). This treats Time as the fixed denominator, correcting the error.
2. The Safety Zone: Continuous Data (e.g., Blood Pressure, Weight)
If you are measuring continuous averages (Means), the "Bob Penalty" is much smaller.
Example: "We looked at all patients in 2023 and measured their average Blood Pressure."
Why it's safer: Here, you would use a t-test. Unlike counting questions, adding more patients doesn't change the "scale" or "ceiling" of Blood Pressure (BP stays between 50–200 mmHg regardless of sample size).
The Nuance: While a t-test is mathematically valid here even if N is random (conditioned on the observed N), the risk is behavioural. Because the window is flexible, researchers can easily "cherry-pick" the window that gives the lowest p-value (p-hacking).
The Disconnect Between Statistics and Reality
This parable illustrates a fundamental limitation of p-values: They can fail to detect the truth depending on the assumptions of the design.
1. The Design Can Hide the Talent
In reality, Bob is just as good as Alice. However, the statistical test failed to recognise his ability (p > 0.05). This is not because Bob lacks skill, but because his experimental design was inherently "noisier."
By running a time-based experiment, Bob introduced a variable (fluctuating sample size) that Alice did not have. The statistical test penalised him for this extra variable by increasing his p-value. A real effect was missed simply because of how the room was set up.
2. Statistics Are Based on "Could Have Beens"
Alice's result is significant because she had no opportunity to fail further (she ran out of questions). Bob's result is non-significant because he could have failed further (he had infinite questions).
Statistics do not just measure what did happen; they calculate probabilities based on what could have happened under the null hypothesis. Because Bob's design allowed for more hypothetical outcomes, the statistics were less impressed by his actual performance.
Conclusion
A non-significant p-value (p > 0.05) does not prove that an effect is absent. As Bob's case shows, you can have a real effect and a talented subject, but if the study design introduces too much theoretical uncertainty (like the open-ended nature of the time limit), the statistics may fail to reflect the reality.
When we read a study that claims "no significant effect was found," we must remember Bob: The talent might be there, but the statistical assumptions might have simply been too strict to see it.
_edited.png)



Comments