Module 3 · Multiplicity & p-hacking

The trial failed. Except in left-handed women over 60.

A large trial tests a new drug against placebo. Overall result: no benefit. The drug doesn't work — disappointing, but clear.

Then the manufacturer points to a footnote. "In women over 60 who were left-handed, the drug showed a clear benefit — p = 0.03, statistically significant!" They want this subgroup approved.

The overall trial failed, but one oddly specific subgroup 'worked,' p = 0.03. Should you believe it?

Everything you've learned so far assumed one test, one question. Real trials test many things at once — many outcomes, many subgroups, many time points. And once you're running many tests, a false alarm stops being a rare 1-in-20 accident and becomes almost inevitable. That shift is one of the most important — and most exploited — facts in all of evidence appraisal.

One test versus many

Recall the deal with significance: at p < 0.05, a single test of a drug that truly does nothing has a 5% chance of a false alarm — crying "effect!" by pure chance. One in twenty. Acceptably rare.

But what if you don't run one test? What if you run twenty?

Each test still has its own 5% chance of a false alarm. So across twenty tests of things that all do nothing, the false alarms add up. The right question is no longer "will this test fool me?" but "across all these tests, how likely is it that at least one fools me?" — and the answer is unsettling.

Five percent per test sounds safe. Run enough tests, and "at least one false alarm" goes from unlikely to almost certain. Let's watch it happen — with a drug that does absolutely nothing.

Watch noise become "significance"

The drug below is a dud. It has zero real effect — we've built it that way. But we're going to test it separately in a whole batch of subgroups. Run the trial and see what "significant" results turn up.

Number of subgroups tested:20

Try this: run it several times at 20 subgroups. Sometimes none light up. Often one. Sometimes two or three. Now slide it up to 40 and run again.

Look what you produced. A drug that does nothing just handed you a "statistically significant" subgroup — not because it works, but because you tested enough things that chance had to deliver one. You didn't find an effect. You found noise. And with enough subgroups, noise always offers you something to report.

Name it, and count it

This is multiple comparisons — also called the multiple comparisons problem. The more tests you run, the more false alarms you should expect, even when nothing is real.

And you can put a number on it. If each test has a 5% false-alarm rate, then across many tests the expected number of false alarms is just:

Expected false alarms ≈ number of tests × 0.05

Worked example: 20 tests × 0.05 = 1 expected false alarm

And the chance of at least one false alarm is 1 − 0.95²⁰ ≈ 64% — more likely than not.

Your turn. Same rule, bigger number: 100 tests, each at 5% false-alarm rate.

  1. 100 tests × 0.05 = ?

Test 20 things, expect 1 false alarm. Test 100, expect 5. The "findings" were never findings — they're the statistical exhaust of running many tests. The p-value of any single one looks perfectly respectable. The problem is invisible unless you know how many tests were run.

p-hacking: the active version

So far we assumed the tests just happened. But there's a more insidious version, where a researcher — often without any intent to deceive — steers toward significance. It's called p-hacking.

The trap is that a single dataset offers dozens of quiet choices: which outcome to focus on, which subgroups to look at, which patients to exclude, where to put a cut-off, when to stop collecting data. Each choice seems reasonable in isolation. But try a handful, keep the analysis that crosses p < 0.05, and report only that — and you've manufactured a "finding" from noise without ever lying.

This is the garden of forking paths: so many defensible analyses of the same data that one of them is almost bound to look significant. p-hacking rarely means fraud — it usually means a well-meaning analyst who tried several things and reported the one that "worked," forgetting the others change what the p-value means.

The defence: declare it in advance

If running many analyses breaks the p-value, the fix is simple in principle: decide what you're testing before you see the data.

The single most useful question you can ask of any striking result: was this the question the study set out to answer — or one it went looking for afterwards?

Trustworthy, or a red flag?

Put it to work. For each result below, decide whether it's trustworthy evidence or a multiplicity red flag.

Subgroups: where this bites hardest

One application matters more than any other in HTA: subgroup analyses. A trial that fails overall is a commercial disaster — so there is enormous pressure to find some slice of patients where the drug "worked."

The result is a flood of subgroup claims: the drug helped the older patients, or the sicker ones, or those with a particular biomarker. Each is one of many comparisons, each multiplies the false-alarm risk, and each tends to surface after the main result disappointed.

A subgroup finding is hypothesis-generating, not confirmatory. At best it says "this might be worth testing properly in a dedicated trial." It never, on its own, proves the drug works in that subgroup. The right response to "but it worked in this subgroup" is always: how many subgroups did you look at, and was this one named in advance?

Why this matters for HTA

This is one of the sharpest tools you'll use in appraising a submission, because multiplicity is everywhere in the evidence a manufacturer presents:

The same p < 0.05 means completely different things depending on how many tests stood behind it. Your job is to ask the question the p-value can't answer on its own: how many shots did they take?

Multiplicity & p-hacking, in one breath

A p-value only means what it claims if it stood alone. Behind every striking "finding," ask the question that reveals the truth: how many things did they test?

That completes the toolkit for the hardest question in evidence: is this effect real? You can now judge a single result, read its uncertainty, and see through the traps that fake significance. But "real" is only half of what a decision needs. The other half is how big — and there are several different ways to measure the size of an effect, some of which can make the very same result look modest or miraculous depending on which you choose. That's where we turn next: the measures of effect.