Original articleSubgroup analyses in randomized trials: risks of subgroup-specific analyses;: power and sample size for the interaction test
Introduction
Subgroup analyses in randomized controlled trials are common [1], [2]. Despite many easily accessible guidelines on their selection and analysis [3], [4], [5], key messages have not been universally accepted, and inappropriate analyses continue to occur [2], [6]. Although guidelines state that subgroup findings should be considered exploratory in nature and that only in exceptional circumstances should they affect the conclusions drawn from the trial, they are commonly over-emphasized [2]. This has potentially serious implications because erroneous identification of differential subgroup effects may lead to inappropriate provision or withholding of treatment [7], [8], [9], [10]. Ultimately this issue relates to individual patients, requiring a study design in which each patient receives various treatments [11]. The focus of this article is on common approaches to subgroup analyses whereby randomized trial data analyses are extended to investigate potential differential effects of a treatment across subgroups of patients.
One common analytical approach is to perform separate (stratified) analyses of the treatment effect within each subgroup. This approach has consequences in terms of multiple testing, increasing the risk of obtaining a false positive result (i.e., finding a statistically significant treatment effect in one or both subgroups when there are no true treatment effects). As regards failing to detect a true treatment effect within one or both subgroups (a false negative result), subgroup-specific analyses require the data to be subdivided into smaller data sets, each with reduced power to detect a similar treatment effect.
A test for interaction between treatment and subgroup is the appropriate way to examine whether treatment effects differ between subgroups [1], [3], [4], [12], [13], [14], [15]. This approach tests and estimates the difference between treatment effects across subgroups directly. It involves one statistical test irrespective of the number of subgroups, whereas subgroup-specific analyses involve two or more. Although the single interaction test partially overcomes the concerns of a false (positive) conclusion of a treatment–subgroup interaction, such tests are likely to be underpowered [1], [16]; a true differential treatment effect across subgroups may therefore be missed. Power calculations for randomized trials usually relate to the overall treatment effect rather than the interaction, and many published trials have insufficient power to detect even the overall effect [17]. Sample size determination for subgroup analyses is rarely performed, and there is relatively little literature to aid such calculations. The literature that is available tends to be theoretical and is likely to be of limited practical use to trialists, with most articles relating to the case-control study or gene–environment interactions [18], [19], [20], [21], [22], [23], [24], [25].
There are three main aims of this article: (1) to reinforce existing advice to conduct formal tests of interaction by quantifying the risks of misinterpreting subgroup-specific analyses as evidence of differential treatment effects across subgroups, (2) to quantify the power of the interaction test in trials sufficiently powered only to detect the overall treatment effect, and (3) to determine the appropriate sample size required to investigate such interactions reliably. This article summarizes a more detailed account of this research published as a Health Technology Assessment monograph [26].
Section snippets
Methods
Simulations were used rather than a theoretical approach to make the methods and hence interpretation as transparent as possible for a general audience. Moreover, if theory were used, then a simulation approach would be highly desirable for confirmation.
Results
The findings for the continuous and binary data were similar with the exception of greater instability in the percentages for smaller sample size with binary data. Unless otherwise stated, the results for the continuous case are presented here. There was no pattern with increasing sample size, with a steady state being achieved at around a total sample size of 200 for all specifications (steady state percentages are presented throughout this article).
The test for overall treatment effect
Discussion
We have quantified the dangers of conducting subgroup analyses by performing subgroup-specific tests. Among data sets simulated with no true subgroup effects (no treatment–subgroup interaction), the scenario with the greatest potential to be misinterpreted was a significant treatment effect in one subgroup only. If the overall finding were significant, the chance of this observation could be as high as 2 in 3. If the overall test result were nonsignificant, then the chance of just one
Acknowledgements
The authors thank Dr. Jonathan Sterne (Department of Social Medicine, University of Bristol, UK) and the reviewers for their very helpful comments on the manuscript. This work was supported by the NHS R&D Health Technology Assessment Programme, UK.
References (31)
- et al.
Subgroup analysis and other (mis)uses of baseline data in clinical trials
Lancet
(2000) Subgroup analysis
Lancet
(1988)- et al.
Prognostic significance of the extent of myocardial injury in acute myocardial infarction treated by streptokinase (the GISSI trial)
Am J Cardiol
(1989) - et al.
Design and methodological considerations in the National Cooperative Gallstone Study: A multicenter clinical trial
Control Clin Trials
(1981) - et al.
Power calculations for detecting interaction in stratified 2x2 tables
Stat Prob Lett
(1999) The use and abuse of subgroup analysis in epidemiological research
Prev Med
(1987)- et al.
Statistical problems in the reporting of clinical trials: a survey of three medical journals
N Engl J Med
(1987) - et al.
A consumers guide to subgroup analyses
Ann Intern Med
(1992) Statistical principles for clinical trials ICH E9
Stat Med
(1999)- et al.
On wisdom after the event
J Clin Epidemiol
(1997)
Effectiveness of intravenous thrombolytic treatment in acute myocardial infarction
Lancet
Randomized trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction: ISIS-2
Lancet
Misleading subgroup analyses in GISSI
Am J Cardiol
Individual therapy: new dawn or false dawn?
Drug Inf J
The CONSORT statement: revised recommendations for improving the quality of reports of parallel group randomized trials
JAMA
Cited by (554)
Does sex affect the efficacy of systemic pharmacological treatments of pain in knee osteoarthritis? A systematic review
2024, Osteoarthritis and Cartilage OpenDesigning and testing treatments for alcohol use disorder
2024, International Review of NeurobiologyCan Machine Learning Aid the Selection of Percutaneous vs Surgical Revascularization?
2023, Journal of the American College of CardiologyEffects of body weight variability on risks of macro- and microvascular outcomes in individuals with type 2 diabetes: The Rio de Janeiro type 2 diabetes cohort
2023, Diabetes Research and Clinical Practice