Module 3 · Significance & Power

The drug that "didn't work" — until it did.

A promising new drug is tested against placebo. The result comes back not statistically significant — no clear benefit. The programme is shelved; the drug is written off as a failure.

Two years later, a much larger trial of the very same drug finds a clear, real benefit. The drug worked all along. So what went wrong the first time?

The first trial found 'no significant effect.' What's the most likely explanation?

A statistical test doesn't deliver the truth — it delivers a verdict, and verdicts can be wrong in two opposite ways. Understanding those two ways, and the thing called 'power' that governs them, is what separates someone who reads 'not significant' correctly from someone who gets fooled by it.

A verdict is not the truth

Here's the mental model that fixes almost everything. There are two separate things:

These don't always agree. Cross them and you get four possibilities — two where the verdict matches the truth, and two where the test gets it wrong:

The test can raise a false alarm — shout "effect!" when there's really nothing there. Or it can miss — stay silent when there's a real effect to be found. Same test, two completely different ways to be fooled. Let's map them.

The two errors

Tap each square to see what it means. Rows are the hidden truth; columns are your study's verdict.

Significant
Not significant
Effect is real
No real effect

Two errors, pulling in opposite directions. The first (false alarm) is controlled by your significance threshold. The second (miss) is controlled by something we haven't met yet — power.

Power: the ability to see

Power is a study's ability to detect a real effect when one genuinely exists. If a drug truly works, power is the probability your trial will actually come back "significant" and catch it. (Formally, power = 1 − the miss rate.)

A study with low power is like a blurry camera: even when there's something to photograph, it often comes back with nothing. A "not significant" result from such a study tells you almost nothing — it couldn't have seen the effect even if it were there.

What raises power? Three things:

The crucial, counterintuitive part: a non-significant result from an underpowered study is not evidence the drug doesn't work. It's evidence the study couldn't tell. Let's watch that happen.

Watch power grow

The drug below truly works — it really does lower the outcome by 6. That never changes. All you're going to change is the sample size. Watch what happens to the verdict.

Sample size: N = 25
0 (no effect)true effect: +6-505101520mmHg difference from control
Verdict: Not significant
Power ≈ chance of catching this real effect ≈ 32%

95% CI: [-2.0, 14.0] mmHg

Try this: start small. The interval sprawls across zero — "not significant," even though the drug genuinely works. Now drag N up and watch the verdict flip to "significant" and the power climb.

Nothing about the drug changed — it always worked. Yet at a small sample the verdict was "no significant effect," and at a large one it was "significant." The only thing that moved was power. So when you read "not significant," your first question is never "so it doesn't work?" — it's "was this study even big enough to find out?"

Trap 1, closed: "not significant" ≠ "no effect"

Now you can fully dismantle the trap from the p-value lesson. When a study reports "no significant difference," there are two completely different things it could mean — and the confidence interval tells them apart:

A wide confidence interval is the visible fingerprint of low power. "Not significant" from a wide interval doesn't close the question — it reopens it, and the honest response is "we need a bigger study," not "the drug doesn't work."

This matters most where it's quietest: a real harm, or a real difference from a competitor, can vanish behind "not significant" simply because nobody powered the study to find it.

Significant vs important: two different questions

Now the opposite trap. Crank the sample size high enough and almost any effect — however trivial — becomes "statistically significant." With 100,000 patients, a drug that lowers blood pressure by 0.3 mmHg will sail past p < 0.05. Real? Yes. Worth anything to a patient? No.

This is the gap between two ideas people constantly merge:

They are not the same, and one does not imply the other. A huge trial can make a meaningless effect "highly significant." A small trial can leave a genuinely important effect "not significant."

Always ask both questions. First: is it real? (significance, the CI excluding zero). Then — and this is the one that decides funding — is it big enough to care about? And recall from the very first module: with a fixed budget, "real but tiny" is exactly the kind of benefit that isn't worth what it displaces.

Read the result

Put both traps together. Each result below pairs an effect with its study size and CI. Tap what it really tells you.

The winner's curse

One last, subtle trap — and it explains why so many exciting early results fade.

Think about small studies that do reach significance. To clear the bar with few patients, the observed effect has to be large — and the easiest way to get a large observed effect in a small sample is a lucky, exaggerated draw. So among small significant studies, the ones that made the cut are systematically the ones that overstated the effect.

This is the winner's curse: a small study that reaches significance tends to overestimate how big the effect really is. The first dramatic result is usually too good to be true — and a larger, calmer study later brings it back to earth.

It's a major reason real effects "shrink" on replication, and why a single small, spectacular trial should make you cautious, not excited. (We'll meet its cousin — the way extreme results drift back toward average — again later.)

Why this matters for HTA

This is daily ammunition for reading a submission honestly:

"Significant" answers only "is it probably real?" The decision needs two more answers it can't give: is it big enough to matter, and was the study even able to find out?

Significance, power & the traps, in one breath

A verdict of "significant" or "not" is the beginning of reading a result — never the end. Ask: is it real, is it big enough to matter, and was the study able to tell?

Everything so far assumed one test, one question. But real trials test many things at once — multiple outcomes, multiple subgroups, multiple looks at the data. And once you're running many tests, false alarms stop being rare accidents and start becoming almost guaranteed. Next: multiplicity and p-hacking — how testing enough things manufactures "significance" out of pure noise.