Original article
Subgroup analyses in randomized trials: risks of subgroup-specific analyses;: power and sample size for the interaction test

https://doi.org/10.1016/j.jclinepi.2003.08.009Get rights and content

Abstract

Objective

Despite guidelines recommending the use of formal tests of interaction in subgroup analyses in clinical trials, inappropriate subgroup-specific analyses continue. Moreover, trials designed to detect overall treatment effects have limited power to detect treatment–subgroup interactions. This article quantifies the error rates associated with subgroup analyses.

Study design and setting

Simulations quantified the risks of misinterpreting subgroup analyses as evidence of differential subgroup effects and the limited power of the interaction test in trials designed to detect overall treatment effects.

Results

Although formal interaction tests performed as expected with respect to false positives, subgroup-specific tests were considerably less reliable: A significant effect in one subgroup only was observed in 7% to 64% of simulations depending on trial characteristics. Regarding power of the interaction test, a trial with 80% power for the overall effect had only 29% power to detect an interaction effect of the same magnitude. For interactions of this size to be detected with the same power as the overall effect, sample sizes should be inflated fourfold, increasing dramatically for interactions smaller than 20% of the overall effect.

Conclusion

Although it is generally recognized that subgroup analyses can produce spurious results, the extent of the problem may be underestimated.

Introduction

Subgroup analyses in randomized controlled trials are common [1], [2]. Despite many easily accessible guidelines on their selection and analysis [3], [4], [5], key messages have not been universally accepted, and inappropriate analyses continue to occur [2], [6]. Although guidelines state that subgroup findings should be considered exploratory in nature and that only in exceptional circumstances should they affect the conclusions drawn from the trial, they are commonly over-emphasized [2]. This has potentially serious implications because erroneous identification of differential subgroup effects may lead to inappropriate provision or withholding of treatment [7], [8], [9], [10]. Ultimately this issue relates to individual patients, requiring a study design in which each patient receives various treatments [11]. The focus of this article is on common approaches to subgroup analyses whereby randomized trial data analyses are extended to investigate potential differential effects of a treatment across subgroups of patients.

One common analytical approach is to perform separate (stratified) analyses of the treatment effect within each subgroup. This approach has consequences in terms of multiple testing, increasing the risk of obtaining a false positive result (i.e., finding a statistically significant treatment effect in one or both subgroups when there are no true treatment effects). As regards failing to detect a true treatment effect within one or both subgroups (a false negative result), subgroup-specific analyses require the data to be subdivided into smaller data sets, each with reduced power to detect a similar treatment effect.

A test for interaction between treatment and subgroup is the appropriate way to examine whether treatment effects differ between subgroups [1], [3], [4], [12], [13], [14], [15]. This approach tests and estimates the difference between treatment effects across subgroups directly. It involves one statistical test irrespective of the number of subgroups, whereas subgroup-specific analyses involve two or more. Although the single interaction test partially overcomes the concerns of a false (positive) conclusion of a treatment–subgroup interaction, such tests are likely to be underpowered [1], [16]; a true differential treatment effect across subgroups may therefore be missed. Power calculations for randomized trials usually relate to the overall treatment effect rather than the interaction, and many published trials have insufficient power to detect even the overall effect [17]. Sample size determination for subgroup analyses is rarely performed, and there is relatively little literature to aid such calculations. The literature that is available tends to be theoretical and is likely to be of limited practical use to trialists, with most articles relating to the case-control study or gene–environment interactions [18], [19], [20], [21], [22], [23], [24], [25].

There are three main aims of this article: (1) to reinforce existing advice to conduct formal tests of interaction by quantifying the risks of misinterpreting subgroup-specific analyses as evidence of differential treatment effects across subgroups, (2) to quantify the power of the interaction test in trials sufficiently powered only to detect the overall treatment effect, and (3) to determine the appropriate sample size required to investigate such interactions reliably. This article summarizes a more detailed account of this research published as a Health Technology Assessment monograph [26].

Section snippets

Methods

Simulations were used rather than a theoretical approach to make the methods and hence interpretation as transparent as possible for a general audience. Moreover, if theory were used, then a simulation approach would be highly desirable for confirmation.

Results

The findings for the continuous and binary data were similar with the exception of greater instability in the percentages for smaller sample size with binary data. Unless otherwise stated, the results for the continuous case are presented here. There was no pattern with increasing sample size, with a steady state being achieved at around a total sample size of 200 for all specifications (steady state percentages are presented throughout this article).

The test for overall treatment effect

Discussion

We have quantified the dangers of conducting subgroup analyses by performing subgroup-specific tests. Among data sets simulated with no true subgroup effects (no treatment–subgroup interaction), the scenario with the greatest potential to be misinterpreted was a significant treatment effect in one subgroup only. If the overall finding were significant, the chance of this observation could be as high as 2 in 3. If the overall test result were nonsignificant, then the chance of just one

Acknowledgements

The authors thank Dr. Jonathan Sterne (Department of Social Medicine, University of Bristol, UK) and the reviewers for their very helpful comments on the manuscript. This work was supported by the NHS R&D Health Technology Assessment Programme, UK.

References (31)

  • Gruppo Italiano per lo Studio della Streptochinasi Nell Infarcto Miocardico (GISSI)

    Effectiveness of intravenous thrombolytic treatment in acute myocardial infarction

    Lancet

    (1986)
  • Randomized trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction: ISIS-2

    Lancet

    (1988)
  • R. Peto

    Misleading subgroup analyses in GISSI

    Am J Cardiol

    (1990)
  • S. Senn

    Individual therapy: new dawn or false dawn?

    Drug Inf J

    (2001)
  • D. Moher et al.

    The CONSORT statement: revised recommendations for improving the quality of reports of parallel group randomized trials

    JAMA

    (2001)
  • Cited by (554)

    • Designing and testing treatments for alcohol use disorder

      2024, International Review of Neurobiology
    View all citing articles on Scopus
    View full text