Flukes and flaws

It's a scary world out there: breathing in other people's cigarette smoke, eating too much salt, living near power lines-all have a stack of scientific studies linking them to life-threatening diseases, from cancer to heart disease. Just as well, then, that there is also an impressive list of studies of these same risks finding no links at all. In any case, doctors now have a host of clinically proven wonder drugs to save our lives if we ever do fall prey to these diseases. But here is another funny thing: there is an equally impressive stack of evidence showing that these wonder drugs are anything but wonderful.

Surely these are statistical blips, the inevitable outcome of research into complex issues. Or are they? Last year, the British Medical Journal published a study by a team of cardiac specialists at Queen's Medical Centre in Nottingham. Dr Nigel Brown and his colleagues had asked a simple question: how big an impact on heart attack survival have all those "clinically proven" wonder drugs had since the early 1980s? Trials of these drugs suggested that they would double survival rates. But what Brown found on the wards was disquieting. Back in 1982, the death rate on the wards was 20 per cent. Ten years later, it was the same. Somewhere between the clinical trials and the wards, the wonder drugs had lost their life-saving abilities.

A similar picture emerges from cancer research; so many therapies have failed to perform on the wards that talk of wonder drugs is now taboo. According to the US National Cancer Institute, despite all those "clinically proven" therapies that have emerged since the launch of President Nixon's war on cancer in 1971, the overall survival prospects of patients have risen by only 4 per cent.

Researchers have tried to explain these cases of "the vanishing breakthrough." Some blame the tabloid-like preference of academic journals for positive findings rather than boring refutations. Others cite dodgy experimental methods, a failure to rule out other causes and over-reliance on data from tiny samples. But there is another culprit whose identity has been known within academic circles for decades. Warnings about its effects on the reliability of research findings have been made repeatedly since the 1960s. Yet scientists are doing nothing about them.

The warnings focus on techniques that lie at the very heart of modern science: so-called "significance tests," the statistical methods used to gauge the importance of new results. Suppose, for example, that trials of some new cancer treatment show that 15 per cent of patients given the drug survived, compared to 8 per cent of those given the standard therapy. Is this a "significant" difference? To answer this question, standard statistical methods ask how likely it is that the increased survival rate arose by pure chance. If calculations show that this probability is sufficiently low, then the effects of the drug will be deemed "significant."

Relied on by the scientific community for over 70 years, these tests have become the underpinning of thousands of scientific papers and hundreds of millions of pounds of research funding. Yet some of the world's most distinguished statisticians have been warning for years that these techniques routinely exaggerate the size and significance of implausible findings. Used to analyse claims for a new health risk, these techniques will see "significance" in results which are in fact the product of chance. Applied to the outcome of clinical trials, they can easily double the real efficacy of a useless drug-and allow it to sail past the regulatory approval system.

The problems with significance testing become clear when others try to confirm the claimed link, or use the supposed wonder drug to save lives on wards-and find they can't. Instead, the evidence just flip-flops this way or that for years on end, or bursts spectacularly on to the scene for a few years-and then fades away again like a spent firework.

Take the long-running saga of power lines and cancer. To this day, no one has come up with a plausible explanation of how electromagnetic fields from pylons could trigger cancer in humans. Even so, dozens of studies have claimed that those living or working near electric power lines face a higher risk of developing the disease. In 1992, a study by the respected Karolinska Institute of Stockholm found a highly significant three to four-fold increase in risk.

But over the same time period, a whole raft of studies has also pointed to no significant risk at all. The most recent and most comprehensive study involved over 1,200 children and was published in the New England Journal of Medicine in July last year. Some epidemiologists believe that it marks the end of the debate; others say that the evidence is not definitive. Both sides can point to studies supporting their case-while all the time physicists and others insist the claims simply make no sense.

Then there is the bizarre story of the infectious heart attacks. In 1994 a team led by Michael Mendall at St George's Hospital Medical School, London, published evidence suggesting that heart disease was linked to infection with a bacterium called Helicobacter pylori. Found in the stomach, H pylori was already known to be a cause of peptic ulcers, but quite how a bug in the stomach could cause heart disease was anyone's guess. Like pylons and cancer, the link lacked any solid support. No matter: Mendall and his team found statistical evidence that men with coronary heart disease were 50 per cent more likely to test positive for H pylori than healthy men.

The mere suggestion that heart disease might be both infectious and treatable with antibiotics was enough to prompt many other studies of the link. By the end of last year, around 20 had been published. But as with pylons and cancer, the findings have just flip-flopped this way and that. Some studies have pointed to a quadrupling in the risk of heart disease, while others have found nothing at all. And again, the latest and largest study, published last year in the British Medical Journal, shows no link at all.

Dietary salt and blood pressure, vitamin K injections and leukaemia, passive smoking and lung cancer: all these have been criticised as lacking real plausibility, and all have seen statistical evidence of risk come and go like the morning dew. It is the same story with many supposed wonder therapies found by clinical studies. Nitrate patches and magnesium injections for heart attacks; aspirin for pre-eclampsia in pregnancy-during the 1980s, all produced results in clinical trials pointing to astonishing levels of efficacy. And all have since proved almost useless.

Raise doubts about significance testing, and you raise doubts about the reliability of much of the research carried out in experimental sciences over the last half-century or more. Medical research is not alone in falling prey to the flaws in significance tests. The same symptoms can be seen across the scientific spectrum, from psychology to genetics and studies of the paranormal. But how did the scientific community come to put its faith in such flawed methods?

Ironically, the scientific community adopted these flawed statistical methods out of its desire for truly reliable knowledge. Scientists believed that significance testing could give them insights that were not subject to fashion, or opinion or prejudice. Unfortunately, they were wrong-and the explanation lies in a formula first deduced by an English mathematician and cleric over 200 years ago. It is known as Bayes's Theorem, and it holds the key to the modern-day scandal of significance testing.

born in 1702, Thomas Bayes became one of the founders of modern-day probability; he was elected to a Fellowship of the Royal Society at the age of 40. His most important work, however, appeared in his "Essay Towards Solving a Problem in the Doctrine of Chances," published in 1763, two years after his death. In it, Bayes showed that he had discovered a mathematical recipe of great power: a formula showing how we should change our belief in a theory in the light of new evidence.

Bayes's discovery was investigated by other leading mathematicians, who tidied up his work and turned his formula into a recipe for assessing the impact of new evidence on a scientific theory. This recipe states that all we have to do is combine our initial belief in our theory-its "prior probability"-with the weight of evidence provided by the new data we have acquired. This is encapsulated by the so-called "likelihood ratio," which captures the likelihood of our explanation being the basis for the data as opposed to the many alternative explanations. The end-result is an updated value for the probability of our claim being correct, in the light of the new data.

But how do we determine the "prior probability" of a claim being correct? The answer is: it depends. For example, 30 years of bitter experience in clinical trials shows that it is pretty unlikely that a new drug will cut cancer death rates by 50 per cent or more. Something closer to 5 to 10 per cent is more plausible-and that experience can be used in Bayes's Theorem to put new findings in some sort of perspective. With more outlandish claims, such as the existence of telepathy, a greater degree of scepticism is required-and this again can be factored in using Bayes's Theorem.

Bayesian methods reflect the obvious fact that while evidence might convince some people that there is something going on, it may not convince others-sceptics will start with higher odds of fluke being the most plausible explanation. Crucially, however, Bayes's Theorem shows that as the data accumulate, everyone ends up being driven to the same conclusion: whatever one's prior belief in fluke as an explanation, that initial belief becomes progressively less influential as the data comes in.

For over 150 years, Bayes's Theorem formed the foundation of "statistical inference." But during the early part of this century, a number of mathematicians and philosophers began to raise objections to it. The most damning centred on the fact that the Theorem only shows how new data can alter existing belief in a theory. Clearly, different people may have different starting points-and so could end up coming to different conclusions about the same data. Faced with the same experimental evidence for, say, telepathy, true believers could use Bayes's Theorem to say the new data was wholly convincing evidence for it. Sceptics, in contrast, could use Bayes's Theorem to insist the jury was still out.

To non-scientists, this may not seem like an egregious failing. But this is to ignore the scientific community's fear and loathing of subjectivity. To most scientists, subjectivity is the Great Satan. It is the mind-virus which has turned the humanities into an intellectual free-for-all, where the idea of "progress" is dismissed as bourgeois, and belief in "facts" na?. Once allowed into the citadel of science, subjectivity would turn all research into glorified Lit. Crit.

It was to these sentiments that the critics of Bayes appealed. By the 1920s, Bayes's Theorem had been declared heretical-leaving scientists with a problem. They still had data to analyse and conclusions to draw from them. How were they going to do it, now that Bayes was beyond the pale? The answer came from one of Bayes's most brilliant critics: the Cambridge mathematician and geneticist, Ronald Aylmer Fisher.

Few scientists had greater need of a replacement for Bayes than Fisher, who frequently worked with complex data from plant breeding trials. And if anyone could find a truly objective way of drawing conclusions from data, Fisher could. By 1925, he appeared to have succeeded; his book, Statistical Methods for Research Workers, gave researchers a raft of apparently objective methods for analysing data.

It was to become one of the most influential scientific texts; its methods were swiftly adopted throughout the scientific community. It was Fisher who recommended turning raw data into something called probability-values or P-values. This is the probability of getting results at least as impressive as those obtained assuming mere fluke was their cause. If this P-value was less than 0.05 (one in 20), the results could be declared "significant." Fisher's recipe has been used to judge significance ever since. Open any leading scientific journal and you will see the phrase "P < 0.05"-the hallmark of a "significant" finding.

Following the publication of Fisher's book, questions where raised about whether he had really succeeded in banishing subjectivity. The distinguished Cambridge mathematician Harold Jeffreys, writing in his own treatise on statistics published in 1939, raised an especially incisive question: why did Fisher set 0.05 as the crucial dividing line for significance?

The 0.05 figure remains central to a large number of claims made by researchers. Even decisions as to whether a new drug is approved by the government depends on meeting Fisher's standard of a P < 0.05. So what were the insights which led Fisher to choose that talismanic figure of 0.05, on which so much scientific research stands or falls? As Fisher himself admitted, there weren't any. He had simply picked the figure because it was "convenient."

But there is a more serious problem. Suppose you are a scientist observing the effect of a new drug. The question you would like to be able to answer is whether the effect is due to mere chance or to the drug having some genuine effect on patients. The P-value seems to answer that question, but it doesn't. Given the data, the P-value only tells you how likely the observed effect would be, assuming it were due to chance. If this probability is less than one in 20, then scientists have taken this to mean that the result is significant. But this is not the same as asking how likely it is that chance is actually responsible for the observed effect. This second question is the important one-and it is not answered by asking how likely the observed effect would be, assuming it were due to chance. Only Bayes's Theorem can give us the answer to the question we are interested in.

but before we can use Bayes's Theorem, its central lesson must be heeded: that the "significance" of data cannot be judged in isolation; data must take account of context, of prior knowledge and plausibility. In trying to free himself of Bayes's Theorem, Fisher had developed significance tests which had no means of taking plausibility into account. The awful consequences of this crucial difference between Bayes and Fisher started to emerge in the early 1960s, when a number of statisticians published comparisons of the results from significance tests based on each approach. In 1963, a team of statisticians led by the probability expert Leonard Savage at the University of Michigan showed that by failing to factor in plausibility, Fisher's P-values can easily boost the "significance" of results by a factor of 10 or more.

Publishing their findings in the prestigious journal Psychological Review, the team issued a warning that Fisher's techniques were "startlingly prone" to see significance in fluke results. It is a warning which has been repeated many times since. During the 1980s, Professor James Berger of Duke University published a series of papers again warning of the "astonishing" tendency of Fisher's P-values to exaggerate significance. In 1988, Berger warned: "Significant evidence at the 0.05 standard can actually arise when the data provide very little or no evidence in favour of an effect." Suddenly all those vanishing breakthroughs start to make sense again. Using Bayes's Theorem, Savage and others showed that at least a quarter of all claims said to be "significant" with P < 0.05 are actually meaningless flukes.

After decades of clinical trials, it is clear that most drugs for treating big killers such as heart disease have only a small impact on survival. So when a new wonder therapy such as nitrate therapy comes along claiming 50 per cent reductions in mortality, scepticism is clearly justified. Bayes's Theorem allows such scepticism to be taken into account. But Fisher's significance tests, lacking any way of factoring in plausibility, inevitably boost hopes too high-guaranteeing a very hard landing.

so why does such a fundamental deficiency in so basic a technique continue to be ignored? Part of the explanation lies in the same sentiments that led to the adoption of Fisher's methods in the first place. Over the last 70 years, the scientific community has come to terms with uncertainty in quantum physics, undecideability in mathematics and unpredictability in natural phenomena. Yet for most scientists subjectivity in assessing evidence remains beyond the pale.

Another key factor is that most working scientists know little and care less about statistics. Many of them could not even give you a decent definition of a P-value, much less explain Bayes's Theorem. What they can all parrot straight back at you, though, is Fisher's 70-year-old mantra about a P-value of less than 0.05 implying "significance." Every working scientist knows the importance of reaching that magic figure in the scramble to get research papers out-and research funding in.

And that raises an altogether more cynical explanation for the lack of action on significance tests: sticking with P-values makes it far easier to get publishable results. For scientists to abandon the textbook significance tests would be professional suicide. As long as the academic institutions and their journals accept P-values, there is no advantage to be gained in using Bayesian methods-except, of course, that the results are less likely to be rubbish.

To date, however, not a single academic institution has taken decisive action on the unreliability of conventional significance tests. In 1995, the British Psychological Society (BPS) set up a working party to consider replacing P-values with other ways of stating results. It got nowhere. "It just sort of petered out," said one senior BPS figure. "The view was that it would cause too much upheaval for the journals."

The BPS's sister organisation, the American Psychological Association, has also tried weaning researchers off P-values. In 1994, it issued guidelines "encouraging" researchers to do more than simply state P-values, but also show how big the effect found actually was. A modest proposal, one might think. Yet studies of its impact show that it has made absolutely no difference: the journals are as full of results based solely on P-values as ever.

Dissatisfaction with traditional significance testing in medicine has produced some innovations. "Meta-analyses" which bring together the data from a very large number of studies is one; another is the use of very large "simple" studies. Also, many leading medical journals have accepted that P-values are pretty uninformative ways of stating results. Many medical papers now give results as so-called "95 per cent confidence intervals," which state both the size and likely range of the claimed effect. But despite their impressive name, confidence intervals suffer from the same flaw as P-values: they don't factor in plausibility.

So how do scientists account for all the evidence for the failings in significance tests, the flip-flop studies and vanishing breakthroughs? One of the favourite explanations is over-reliance on small samples.

It is true that small samples are generally less reliable than larger ones, for all sorts of reasons. But the real problem with small samples is that they lack the statistical power to detect effects that are there, not that they detect effects that aren't there. With so many health scares, however, it is small samples that have claimed to have seen "statistically significant" evidence that large studies have failed to confirm. But you can't blame sample size for a small study finding a "significant" result which is later overturned by a larger study, because P-values take sample size into account.

A more plausible explanation is that the dodgy claims of big effects have been caused by what scientists call "bias and confounding." These are the famous bug-bears of medical research: for example, selective publishing of positive studies could create a misleading impression of the real efficacy of some drug. Or the failure to take into account diet or smoking can allow for a link to be established between pylons and cancer; but this link is a correlation not a cause.

There is no doubt that bias and confounding have the power to make a mockery of any research study. Bayes's Theorem won't cure these problems overnight. But this has prompted some defenders of standard significance tests to argue that there is therefore no point bothering with Bayesian methods. It is an argument that makes as much sense as refusing to take aspirin for headaches because it does not cure all known diseases. For while Bayesian methods may not provide instant solutions to all the problems of difficult research, they do allow action to be taken right now on that far from trivial issue: plausibility.

Yet this pragmatic argument is one that the scientific community has eschewed for years, preferring instead to cling to the old ways like a comfort-blanket. Nobody claims that Bayesian methods are a panacea. But they are a whole lot better than the statistical fix that the scientific community has accepted for the last 70 years.