Module 3 · Multiplicity & p-hacking

The trial failed. Except in left-handed women over 60.

A large trial tests a new drug against placebo. Overall result: no benefit. The drug doesn't work — disappointing, but clear.

Then the manufacturer points to a footnote. "In women over 60 who were left-handed, the drug showed a clear benefit — p = 0.03, statistically significant!" They want this subgroup approved.

The overall trial failed, but one oddly specific subgroup 'worked,' p = 0.03. Should you believe it?

Everything you've learned so far assumed one test, one question. Real trials test many things at once — many outcomes, many subgroups, many time points. And once you're running many tests, a false alarm stops being a rare 1-in-20 accident and becomes almost inevitable. That shift is one of the most important — and most exploited — facts in all of evidence appraisal.

One test versus many

Recall the deal with significance: at p < 0.05, a single test of a drug that truly does nothing has a 5% chance of a false alarm — crying "effect!" by pure chance. One in twenty. Acceptably rare.

But what if you don't run one test? What if you run twenty?

Each test still has its own 5% chance of a false alarm. So across twenty tests of things that all do nothing, the false alarms add up. The right question is no longer "will this test fool me?" but "across all these tests, how likely is it that at least one fools me?" — and the answer is unsettling.

Five percent per test sounds safe. Run enough tests, and "at least one false alarm" goes from unlikely to almost certain. Let's watch it happen — with a drug that does absolutely nothing.

Watch noise become "significance"

The drug below is a dud. It has zero real effect — we've built it that way. But we're going to test it separately in a whole batch of subgroups. Run the trial and see what "significant" results turn up.

Number of subgroups tested:20

Try this: run it several times at 20 subgroups. Sometimes none light up. Often one. Sometimes two or three. Now slide it up to 40 and run again.

Look what you produced. A drug that does nothing just handed you a "statistically significant" subgroup — not because it works, but because you tested enough things that chance had to deliver one. You didn't find an effect. You found noise. And with enough subgroups, noise always offers you something to report.

Name it, and count it

This is multiple comparisons — also called the multiple comparisons problem. The more tests you run, the more false alarms you should expect, even when nothing is real.

And you can put a number on it. If each test has a 5% false-alarm rate, then across many tests the expected number of false alarms is just:

Expected false alarms ≈ number of tests × 0.05

Worked example: 20 tests × 0.05 = 1 expected false alarm

And the chance of at least one false alarm is 1 − 0.95²⁰ ≈ 64% — more likely than not.

Your turn. Same rule, bigger number: 100 tests, each at 5% false-alarm rate.

100 tests × 0.05 = ?

Test 20 things, expect 1 false alarm. Test 100, expect 5. The "findings" were never findings — they're the statistical exhaust of running many tests. The p-value of any single one looks perfectly respectable. The problem is invisible unless you know how many tests were run.

p-hacking: the active version

So far we assumed the tests just happened. But there's a more insidious version, where a researcher — often without any intent to deceive — steers toward significance. It's called p-hacking.

The trap is that a single dataset offers dozens of quiet choices: which outcome to focus on, which subgroups to look at, which patients to exclude, where to put a cut-off, when to stop collecting data. Each choice seems reasonable in isolation. But try a handful, keep the analysis that crosses p < 0.05, and report only that — and you've manufactured a "finding" from noise without ever lying.

This is the garden of forking paths: so many defensible analyses of the same data that one of them is almost bound to look significant. p-hacking rarely means fraud — it usually means a well-meaning analyst who tried several things and reported the one that "worked," forgetting the others change what the p-value means.

The defence: declare it in advance

If running many analyses breaks the p-value, the fix is simple in principle: decide what you're testing before you see the data.

Pre-specification. A trial names its primary endpoint — the one outcome it will be judged on — in advance, before any data arrives. That single test keeps its honest 5%. Everything else is secondary, exploratory, not confirmatory.
Pre-registration. The whole analysis plan is locked in publicly before the trial runs, so anyone can see whether a "finding" was the planned question or a later fishing trip.
Correcting for multiplicity. If you genuinely must run many tests, raise the bar: demand a stricter threshold so the overall false-alarm rate stays at 5%. (The simplest version just divides 0.05 by the number of tests — run 10 tests, require p < 0.005 for each.)

The single most useful question you can ask of any striking result: was this the question the study set out to answer — or one it went looking for afterwards?

Trustworthy, or a red flag?

Put it to work. For each result below, decide whether it's trustworthy evidence or a multiplicity red flag.

Subgroups: where this bites hardest

One application matters more than any other in HTA: subgroup analyses. A trial that fails overall is a commercial disaster — so there is enormous pressure to find some slice of patients where the drug "worked."

The result is a flood of subgroup claims: the drug helped the older patients, or the sicker ones, or those with a particular biomarker. Each is one of many comparisons, each multiplies the false-alarm risk, and each tends to surface after the main result disappointed.

A subgroup finding is hypothesis-generating, not confirmatory. At best it says "this might be worth testing properly in a dedicated trial." It never, on its own, proves the drug works in that subgroup. The right response to "but it worked in this subgroup" is always: how many subgroups did you look at, and was this one named in advance?

Why this matters for HTA

This is one of the sharpest tools you'll use in appraising a submission, because multiplicity is everywhere in the evidence a manufacturer presents:

A subgroup claim — ask how many subgroups were examined, and whether this one was pre-specified. Post-hoc subgroups are hypothesis-generating at best.
A trial that "succeeded" on a secondary outcome after missing its primary — treat with suspicion; the primary was the honest test, and secondaries multiply false alarms.
A result with a suspiciously specific patient definition — narrow, oddly-cut populations are a fingerprint of fishing for significance.
Always ask for the pre-registered protocol — it tells you which findings were confirmatory and which were discovered along the way.

The same p < 0.05 means completely different things depending on how many tests stood behind it. Your job is to ask the question the p-value can't answer on its own: how many shots did they take?

Multiplicity & p-hacking, in one breath

One test at p < 0.05 has a 5% false-alarm rate — but run many tests and false alarms pile up fast (expected ≈ tests × 0.05).
Test enough subgroups or outcomes of a drug that does nothing, and a "significant" result is almost guaranteed — pure noise.
p-hacking is the active version: trying many analyses and reporting only the ones that "worked" — usually without any intent to deceive.
The defences are pre-specification, pre-registration, and correcting the threshold when many tests are genuinely needed.
Subgroup findings are hypothesis-generating, never confirmatory — always ask how many comparisons were run.

A p-value only means what it claims if it stood alone. Behind every striking "finding," ask the question that reveals the truth: how many things did they test?

That completes the toolkit for the hardest question in evidence: is this effect real? You can now judge a single result, read its uncertainty, and see through the traps that fake significance. But "real" is only half of what a decision needs. The other half is how big — and there are several different ways to measure the size of an effect, some of which can make the very same result look modest or miraculous depending on which you choose. That's where we turn next: the measures of effect.