Clinical significance is not the same as statistical significance

A great example of why details and context always, always matter, from the surgeon/blogger at The Skeptical Scalpel:

Twelve patients who served as their own controls wore compression stockings for a week and then no stockings for a week alternating. The stockings lowered the amount of fluid in the neck by 60%, a statistically significant difference. So far, so good.

This resulted in another highly statistically significant finding, which was a 36% reduction in episodes of apnea [cessation of breathing] and hypopnea [inadequate breathing]. Sounds good, right? The problem is that the average number of episodes of apnea/hypopnea decreased from 48 per hour to 31 per hour. Patients experiencing more than 30 episodes of apnea/hypopnea per hour are classified as having severe obstructive sleep apnea. This means that the treatment only put the patients in the low range of severe obstructive sleep apnea. They still would require maximum therapy.

Via Ivan Oransky


  1. Ask yourself if you had that choice would you say “No, the improvement isn’t enough–I think I’ll skip it.”

    I didn’t think you would. That’s the difference between being a skeptic and being a rational person.

    1. I think it depends on what the treatment was. If my symptoms were improved somewhat but still negligible, but the treatment was very uncomfortable, I would be willing to forgo the treatment entirely in order to maintain comfort.

    2. Actually, that is what I’d say under those conditions – an AHI of 31 is still high enough you’d want to treat it with CPAP (nighttime breathing machines), so you might as well try just the CPAP and see how the patient responds.  If that doesn’t do a good enough job (e.g. only gets their AHI down to 10, or gets it down to 5 (i.e. “good enough”) but requires a blow-the-mask-off-your-face level of pressure), _then_ you might try using the stockings as well as the CPAP. 

      Now, if the tradeoff was stockings vs. jaw surgery, yeah, probably you’d go with stockings, but stockings vs. minor tonsil surgery you’d probably go with the surgery.  Or if stockings worked well enough that the tradeoff was stockings vs. CPAP, you’d decide which was less annoying – it might be really interesting if the patient had an untreated AHI of 10-15, for instance.

  2. Go Maggie!  Go Maggie!

    It is ironic that this summation came from a surgeon given their oft and increasing bad rap of cutting first and asking questions later only if necessary : )  That said, I would want this surgeon to be my surgeon.
    But you bring up a very important point that is all to often massaged or outright skirted in clinical research, and that which make epidemiologists squirm and shake their heads.  That being p values and null hypotheses.  At risk of making a mess of it I will leave this best described by Wikipedia (kudos to the authors, they do a better job than my epi professors did).

    p values are what separate the anecdotal from the accepted for publication, or the FDA fast lane.  Although, when desperate I would be one of the first who might say p values be damned!  Give it to me now!

  3. Love this.  Glad you put it up.  Consider another example: taking aspirin.  While taking aspirin reduces the risk of a heart attack by only 1 or 2%, the cost and effort are *so minimal* that there really is no reason why doctors should not be insisting that their patients take it.

    In this case, too, the clinical significance is extremely low, but statistically significant.  But the amount of effort required for the small benefit is negligible, and possible harm is almost nonexistent, so the treatment should be taken seriously.

    Aspirin is now routine after many surgeries to prevent DVTs, for the same reasons: no reason not to capitalize on this small but significant benefit.

  4. I can relate. I’ve had an article rejected from three journals because my result isn’t statistically significant. It has a higher effect size difference between groups than the only other study that I can compare it to. That study had a significant p value because it had over 39,000 participants, which is a massive sample in my field (psychology). The effect size of that study was .02 – very non-clinically significant for the study question. Because it is statistically significant it can be published.

    1. …Which is a perfect example of a related can of worms, err… topic, called “publication bias.”

    2. I’m confused – if your results aren’t statistically significant, why do you think they’re ready to be published?   If one of your patients got turned into a newt, and the other one got better, that’s a clinically significant difference in outcomes, but if you don’t have enough statistical significance, you can’t tell if it’s because you treated their witchcraft problem or if one of them just got lucky. 

      Sometimes you’ve got enough ancillary data to make it a useful paper anyway, but it sounds like what you have is really more like the kind of documentation you need to get more grant money to study the problem with more patients, because your small trial has promising but preliminary results.

  5. It seems to me the author is being overly pedantic. While it may be technically true that “patients experiencing more than 30 episodes of apnea/hypopnea per hour are classified as having severe obstructive sleep apnea”, there is still a continuum; one person’s “severe” is another person’s, “eh, I can live with it.” More important, while the stocking treatment decreased the average number of episodes experienced by a given percent, that means certain patients would have experienced a larger decrease in episodes, putting them below the “severe” threshold. (I’m pretty sure this is what the first commenter on the Skeptical Scalpel blog said, in so many words.)

  6. I somehow doubt that the 30-per-hour cutoff is a magic number: 29 and 31 are probably not much different in terms of their health impact, and if we had 14 fingers the cutoff would probably be 28 in our base-14 number system. So a 36% reduction can only be a good thing, and whether or not it crosses some numerological threshold for clinical classification is probably not relevant to a patient’s health.

    It’s good to acknowledge that significance depends on context, and also on a cost/risk-benefit analysis, though. Many researchers get so caught up in looking for something publishable that they don’t (or can’t afford to) think about whether statistical significance is what actually matters.

  7. I work in the slightly murky world of epidemiolgy, where finding an effect size of a few percent is great (if it’s p<0.05!). For an individual, knowing that eating oily fish successfully reduces your chance of dementia by 2% would be neither here nor there, and would certainly not be of clinical significance. However on a population basis that 2% could be thousands of people over a 10 year period. So my point really is that statistical significance may not mean clinical significance BUT might still be important at a population level.

  8. In medicine we also make a distinction between efficacy and effectiveness.

    The stocking demonstrated efficacy in reducing apneic episodes, but did not demonstrate effectiveness in curing the disorder or meaningfully reducing symptoms and making the patients well or much better.

    This is important because we often combine statistically efficacious treatments to get a net result that is effective.

    If the stockings used in combination with another treatment improved the outcome better than either treatment alone, it may still be a useful addition to a physician’s options for managing apnea. But that would require another study with a different design examining combined treatment effects.

  9. As a patient with severe sleep apnea (my AHI is over 70) I would like to point out that the average person without sleep apnea has an AHI of around 1.   Anything beyond that could create issues for that individual (no, it may not need a CPAP/APAP etc, but a bite guard could be enough).

    When my AHI goes above about a 3 (yes 3) I can usually notice it the next day, so this “30” is meaningless.

    One of the big issues here is that the study was limited in number and scope, and therefore probably statistically irrelevant.   For the record, most patients with sleep apnea above 30 have additional conditions such as: Obesity, High Blood Pressure, Heart related issues, just to name a few.  All of these were excluded from the study.

  10. I don’t think this is how “clinical significance” is normally applied. The idea isn’t to come up with an arbitrary threshold to dismiss a statistically significant therapy after the fact, but rather a tool to help design the clinical trial in the first place. By making the trial bigger, you gain power to detect a statistically significant difference for treatments which have smaller effects. Since you want to be statistically significant (to publish, or market your treatment, or just to convince yourself it works) you should make the trial as large as possible, right? Well no: bigger trials cost more, in terms of money, time, and patients who don’t receive the new therapy while you’re still trying to figure out if it works. So you want the trial to be big enough, but not bigger. How big is big enough? That’s where you introduce clinical significance. It’s an arbitrary threshold for the lowest effectiveness you’d like to detect in doing the trial. You then plug the numbers through to figure out how big your trial needs to be. After the trial is over, you either reach statistical significance, and can then evaluate how effective your treatment is in practice, or you don’t, in which case you can at least be assured that in all likelihood any effect of the therapy is less than your predetermined minimal level of effectiveness. It helps avoid a limbo of, “Oh I’m still sure it works, I just need a few more patients to reach statistical significance.” Power your trial right, and you get a clear answer, one way or the other.

  11. I like Gerd Gigerenzer’s book Calculated Risk: How to Know When Numbers Deceive You, where he documents a bunch of studies that demonstrated people getting confused by stats. A big chunk of it is about doctors, who (surprise, surprise) aren’t statisticians. 

    This doctor is right that being statistically significant only means that there is a high probability that the results he saw were not errors and not that the changes were large. But he doesn’t seem to get that lowering the average number of apnea incidents for his patients may have moved some of the patients out of the severe classification. If even one patient is no longer in his classification, that  is clinically significant too. (Assuming his classifications are worth anything – more fun with stats.)

  12. Surely it should be 31 plus or minus a confidence interval (95% or whatever is acceptable) and so the threshhold of 30 should also have a c.i. Then if the two overlap the incidence is not significantly above the target range whereas the original (non-treated) incidence was significantly above this range. Hence  a clinically significant result.

Comments are closed.