Comparison of aggregate and individual participant data approaches to meta-analysis of randomised trials: An observational study

Background It remains unclear when standard systematic reviews and meta-analyses that rely on published aggregate data (AD) can provide robust clinical conclusions. We aimed to compare the results from a large cohort of systematic reviews and meta-analyses based on individual participant data (IPD) with meta-analyses of published AD, to establish when the latter are most likely to be reliable and when the IPD approach might be required. Methods and findings We used 18 cancer systematic reviews that included IPD meta-analyses: all of those completed and published by the Meta-analysis Group of the MRC Clinical Trials Unit from 1991 to 2010. We extracted or estimated hazard ratios (HRs) and standard errors (SEs) for survival from trial reports and compared these with IPD equivalents at both the trial and meta-analysis level. We also extracted or estimated the number of events. We used paired t tests to assess whether HRs and SEs from published AD differed on average from those from IPD. We assessed agreement, and whether this was associated with trial or meta-analysis characteristics, using the approach of Bland and Altman. The 18 systematic reviews comprised 238 unique trials or trial comparisons, including 37,082 participants. A HR and SE could be generated for 127 trials, representing 53% of the trials and approximately 79% of eligible participants. On average, trial HRs derived from published AD were slightly more in favour of the research interventions than those from IPD (HRAD to HRIPD ratio = 0.95, p = 0.007), but the limits of agreement show that for individual trials, the HRs could deviate substantially. These limits narrowed with an increasing number of participants (p < 0.001) or a greater number (p < 0.001) or proportion (p < 0.001) of events in the AD. On average, meta-analysis HRs from published AD slightly tended to favour the research interventions whether based on fixed-effect (HRAD to HRIPD ratio = 0.97, p = 0.088) or random-effects (HRAD to HRIPD ratio = 0.96, p = 0.044) models, but the limits of agreement show that for individual meta-analyses, agreement was much more variable. These limits tended to narrow with an increasing number (p = 0.077) or proportion of events (p = 0.11) in the AD. However, even when the information size of the AD was large, individual meta-analysis HRs could still differ from their IPD equivalents by a relative 10% in favour of the research intervention to 5% in favour of control. We utilised the results to construct a decision tree for assessing whether an AD meta-analysis includes sufficient information, and when estimates of effects are most likely to be reliable. A lack of power at the meta-analysis level may have prevented us identifying additional factors associated with the reliability of AD meta-analyses, and we cannot be sure that our results are generalisable to all outcomes and effect measures. Conclusions In this study we found that HRs from published AD were most likely to agree with those from IPD when the information size was large. Based on these findings, we provide guidance for determining systematically when standard AD meta-analysis will likely generate robust clinical conclusions, and when the IPD approach will add considerable value.


Methods and findings
We used 18 cancer systematic reviews that included IPD meta-analyses: all of those completed and published by the Meta-analysis Group of the MRC Clinical Trials Unit from 1991 to 2010. We extracted or estimated hazard ratios (HRs) and standard errors (SEs) for survival from trial reports and compared these with IPD equivalents at both the trial and metaanalysis level. We also extracted or estimated the number of events. We used paired t tests to assess whether HRs and SEs from published AD differed on average from those from IPD. We assessed agreement, and whether this was associated with trial or meta-analysis characteristics, using the approach of Bland and Altman. The 18 systematic reviews comprised 238 unique trials or trial comparisons, including 37,082 participants. A HR and SE could be generated for 127 trials, representing 53% of the trials and approximately 79% of eligible participants. On average, trial HRs derived from published AD were slightly more in favour of the research interventions than those from IPD (HR AD to HR IPD ratio = 0.95, p = 0.007), but the limits of agreement show that for individual trials, the HRs could deviate substantially. These limits narrowed with an increasing number of participants (p < 0.001) or a greater number (p < 0.001) or proportion (p < 0.001) of events in the AD. On average, metaanalysis HRs from published AD slightly tended to favour the research interventions whether based on fixed-effect (HR AD to HR IPD ratio = 0.97, p = 0.088) or random-effects (HR AD to HR IPD ratio = 0.96, p = 0.044) models, but the limits of agreement show that for individual meta-analyses, agreement was much more variable. These limits tended to narrow with an increasing number (p = 0.077) or proportion of events (p = 0.11) in the AD. However, even PLOS  when the information size of the AD was large, individual meta-analysis HRs could still differ from their IPD equivalents by a relative 10% in favour of the research intervention to 5% in favour of control. We utilised the results to construct a decision tree for assessing whether an AD meta-analysis includes sufficient information, and when estimates of effects are most likely to be reliable. A lack of power at the meta-analysis level may have prevented us identifying additional factors associated with the reliability of AD meta-analyses, and we cannot be sure that our results are generalisable to all outcomes and effect measures.

Conclusions
In this study we found that HRs from published AD were most likely to agree with those from IPD when the information size was large. Based on these findings, we provide guidance for determining systematically when standard AD meta-analysis will likely generate robust clinical conclusions, and when the IPD approach will add considerable value.

Author summary
Why was this study done?
• Most standard systematic reviews and meta-analyses of the effects of interventions are based on aggregate data (AD) extracted from trial publications.
• It is not clear when such AD meta-analyses provide reliable estimates of intervention effects.
• It is also not clear when the collection of more detailed individual participant data (IPD) is needed.

What did the researchers do and find?
• Based on 18 cancer systematic reviews, we compared trial and meta-analysis results based on IPD with those based on AD.
• Results from AD were most likely to agree with those from IPD when the number of participants or events (absolute information size) and the proportion of participants or events available from the AD relative to the IPD (relative information size) were large.
• Based on findings from this study, we provide guidance on assessing when AD metaanalysis will likely lead to robust clinical conclusions, and when the IPD approach might add considerable value.
What do these findings mean?
• If the absolute information size is small, AD meta-analysis results will be unreliable, and there will be little value in collecting IPD unless it will lead to a considerable increase in information.
• If the absolute information size is sufficient, but the relative information size small, AD meta-analysis results will be unreliable, and more AD and/or IPD will be needed.

Introduction
It remains unclear when standard systematic reviews and meta-analyses of published aggregate data (AD) are reliable enough to form robust clinical conclusions, and consequently when the 'gold standard' individual participant data (IPD) approach might be required. Most standard reviews continue to rely on published AD [1,2], and if some eligible trials are unpublished, or reported trial analyses are based on a subset of participants or outcomes, then information may be limited, and AD meta-analyses will be at risk of reporting biases [3]. There are additional considerations for AD meta-analyses evaluating the effects of interventions on time-toevent outcomes, which are frequently based on hazard ratios (HRs), either derived directly from trial publications, or estimated indirectly from published statistics or from data extracted from Kaplan-Meier (KM) curves [4][5][6]. Inevitably, each of these methods requires stronger and more assumptions, which, together with varying lengths of follow-up, could have repercussions for the reliability of the results. The collection of IPD can help circumvent publication and other reporting biases associated with AD, provided data on unpublished trials and all (or most) participants and outcomes are obtained, and, if relevant, follow-up is extended beyond the time point of the trial publication [7][8][9][10]. Also, IPD enable more complex or detailed analyses, such as the investigation of whether intervention effects vary by participant characteristics [11]. However, it remains unclear whether the IPD approach is always needed for the reliable evaluation of the overall effects, and because these projects can take many years to complete, results may not be sufficiently timely. Moreover, the IPD approach may not be feasible, owing to the expertise and resources required [7,8] or to difficulties obtaining the necessary data. Hence, patients, clinicians, and policy makers will continue to rely on standard AD meta-analyses.
While some guidance is available to help reviewers gauge when AD might suffice and when IPD might add value [8,12], it is not backed by empirical evidence. A large systematic review of published AD versus IPD meta-analyses found that conclusions were often similar, but the comparisons could only be made on the basis of statistical significance [13]. For meta-analyses of published time-to-event outcomes, individual case studies have shown that they can produce effects that are larger than, smaller than, or similar to their IPD equivalents [14][15][16][17][18][19][20][21][22][23]. Bria et al. [24] compared effect estimates (HRs) from a cohort of AD meta-analyses with IPD equivalents and concluded that they gave very similar results. However, each AD meta-analysis had to include at least 90% of eligible participants and was compared to an IPD meta-analysis of the same set of trials, which may have minimised differences and is perhaps an unrealistic comparison of the 2 approaches. Moreover, both reviews [13,24] included multiple outcomes from the same meta-analyses, marring interpretation. Here, for a single outcome, we compare the results from a large cohort of cancer systematic reviews and meta-analyses based on IPD, with the best meta-analyses of published AD possible at the time these were completed, to establish when the latter are most likely to be reliable, and when the IPD approach might be required.

Methods
The study did not follow a protocol or pre-specified plan. We reported the study according to the STROBE checklist.

Data collection
We used a cohort of 18 cancer systematic reviews that included IPD meta-analyses: all of those completed and published by the Meta-analysis Group of the MRC Clinical Trials Unit at University College London over a 20-year period (1991 to 2010) [25][26][27][28][29][30][31][32][33][34][35][36], including updates where relevant. Each IPD review included a comprehensive search for all eligible trials, irrespective of publication status. Thus, at the time point each IPD meta-analysis was completed, we could ascertain which trials were published and include them in the related AD meta-analysis. This ensured that we were comparing each IPD meta-analysis with a meta-analysis of the published data available at that time. We used the corresponding publications for extraction of AD, and if a trial was reported in multiple publications, we used the one with the most up-to-date or complete information. Although a variety of research and control interventions were used, overall survival was the primary outcome in all of the meta-analyses, and the HR was the effect measure, so these are used as the basis for all our comparisons.
One author (JFT, SB, or DJF) independently extracted all data relevant to the derivation of the HR for the effect of treatment on overall survival and the associated standard error (SE) of its natural logarithm [4,6], and these data were crosschecked by another author. These data included reported HRs and SEs, confidence intervals and p-values, numbers of participants randomised and analysed, and numbers of events. If KM curves were available, we also extracted survival probabilities across a series of time intervals and the related numbers at risk [5,6], or the actual or estimated [4,6] minimum and maximum follow-up, to estimate HRs and SEs [4][5][6]. One author (JFT) reviewed all KM curve estimates to ensure a consistent approach to deciding the number and size of these intervals.

Estimating HRs from published AD
We estimated the HRs and SEs using all possible methods [4][5][6], but preferentially used estimates calculated directly from the reported observed and expected events or the hazard rates for the research intervention and control groups [4,6]. If this was not possible, we used HRs and SEs estimated indirectly using a published log-rank, Mantel-Haenszel, or Cox p-value, and either the associated confidence interval or the number of events, provided the confidence intervals and p-values were given to at least 2 significant figures [4]. Finally, in the absence of these statistics, we used HRs and SEs derived from KM curves [4,6]. This meant we used the best possible estimate of each trial HR.
We matched each AD meta-analysis to the relevant IPD meta-analysis in terms of both the intervention comparisons and the analyses. Thus, if treatment effects were reported by participant subgroup, the subgroup HRs and SEs were combined using a fixed-effect inverse-variance meta-analysis to provide an appropriate AD estimate for the whole trial or treatment comparison. For a small number of 3-arm trials, we combined very similar treatment arms to provide a single estimate of treatment versus control. Whilst not best practice, we wanted to replicate the original analyses. For multi-arm trials with treatment comparisons that were eligible for different meta-analyses or a single treatment comparison that was eligible for more than 1 metaanalysis, estimates for the individual comparisons were included as appropriate. However, trials or treatment comparisons were not used more than once in the trial-level comparisons of HRs from AD and IPD.

Statistical methods for comparing HRs from AD and IPD
We compared HRs and SEs derived from AD and IPD both at the trial level and meta-analysis level. At the trial level, we included all trials with both an AD and an IPD result. The metaanalyses were based on all available published AD and all available IPD, thus representing the best possible AD and IPD estimates available at the time the IPD meta-analysis was published. The IPD meta-analysis estimates were derived from the original IPD projects using 2-stage fixed-effect inverse-variance models, with trial-level HRs and SEs derived using Cox regression. We also performed sensitivity analyses using the DerSimonian and Laird random-effects model [37][38][39].
All data included in these analyses were aggregate in nature, whether derived from trial publications or from the original analyses of anonymised participant data, and therefore ethical approval was not required.
Estimates were compared on the log scale throughout, because the log HR is approximately normally distributed. However, we present the differences between log HRs from AD and IPD as back-transformed ratios of the AD HRs to the IPD HRs (i.e., the HR AD to HR IPD ratio). Differences between log SEs were also 'back-transformed' so that they are always greater than 0 and interpretable as relative percentage changes [40].
We used paired t tests to assess whether (log) HRs and SEs from AD differed on average from their IPD equivalents, recognising that the statistical significance of these tests relates to the amount of data available. More pertinently, we assessed agreement between HR and SE estimates from AD and IPD using the approach of Bland and Altman [40][41][42] . This involves plotting the differences between the AD and IPD estimates against their average, along with 95% 'limits of agreement' (defined as mean ± 1.96 × standard deviation), which represent a range within which most differences are expected to lie. Wide limits suggest poor agreement, although note they are not 95% confidence intervals and do not test a statistical hypothesis. At the trial level, we also used ANOVA to investigate whether the estimation method (direct, indirect, or KM curve) influenced the extent of agreement.
The Bland-Altman method also allowed us to examine whether agreement was associated with trial or meta-analysis characteristics. This involved plotting the differences between the AD and IPD log HRs against each characteristic and testing for a non-zero regression slope for the average agreement and for non-constant limits of agreement [40]. As described above, we initially plotted these differences against their averages, thus testing whether agreement improves or worsens with increasing size of the estimates [42]. We then went on to examine whether agreement was associated with the number of trials, participants, and events in the AD meta-analysis, as well as the proportion of trials, participants, and events in the AD metaanalysis relative to the IPD analysis. Regression slopes were reported as standardised beta coefficients.
Subsequently, we also used sensitivity analyses to assess whether agreement at the metaanalysis level might be improved by excluding trials where the reported analyses were at potential risk of bias [43] from incomplete outcome data or had limited or imbalanced follow-up. Pre-specified criteria were mutually agreed and applied independently by 2 authors (DJF and SB, or DJF and JFT). We considered trials that excluded greater than 10% of participants overall or that had a greater than 10% imbalance in patient exclusion by arm to be at potential risk of bias from incomplete outcome data [44]. Trials in which more than half of participants were estimated to have been censored prior to what would be considered an appropriate follow-up time for the site and stage of cancer (Table 1) were considered to have insufficient follow-up. We classified these based on the reported KM curves and extracted or estimated levels of censoring. Note that only trials judged to be at low risk of bias in terms of randomisation sequence generation and allocation concealment (based on information supplied by investigators and checking of the IPD) were included in our IPD meta-analyses.

A decision tree for assessing the reliability of AD meta-analyses
We utilised these results to construct a decision tree for assessing when AD meta-analyses are most likely to be reliable. As per reviewer comments, we have made this only as generalisable as the data allow.

Feasibility of estimating HRs and associated SEs from published AD
The 18 systematic reviews included 243 trials, 5 of which were eligible for inclusion in 2 separate meta-analyses. Of the 238 unique trials, 33 (14%) were unpublished in any form, and 205 (86%) were published: 175 (74%) in peer-reviewed journals, 4 (2%) as book chapters, and 26 (11%) as abstracts in conference proceedings, with publication dates ranging from 1976 to 2005. HRs and SEs could be obtained or estimated from trial reports for 127 of the trials, representing 61% of published trials, 53% of all trials, and approximately 79% of eligible participants (Table 1). Of the remaining 78 trial reports, 49 (63%) did not include overall survival results (e.g., providing disease response or progression results instead) or presented survival results that could not be used to estimate a HR reliably (e.g., median survival [45] or survival rates); 8 (10%) included a KM curve, but with insufficient information to estimate censoring; 15 (19%) presented survival results, but not for the specific treatment comparison and/or data sample of interest; and 6 (8%) reports could not be accessed.
We obtained HR and SE estimates from IPD for 196 (82%) of trials, representing 89% of randomised participants ( Table 1). As well as being able to include trials that had not been published, and trials that had not been reported in sufficient detail, we were also able to obtain additional participants that had been excluded from published analyses and additional events arising from updated follow-up.
The best available method for estimating HRs from published AD was direct extraction or calculation for 23 trials (18%), from a p-value for 31 trials (24%), and from a KM curve for 73 trials (57%; Table 1). For the SE, the best available method was direct extraction for 1 trial, from a confidence interval for 17 trials (13%), from the number of events for 58 trials (46%), and from a KM curve for 51 trials (40%). Where estimation from a KM curve was the best available method, the associated numbers at risk were reported for only 4 trials, so the minimum and maximum follow-up was used by default to estimate censoring [4].

Reliability of trial HRs and SEs estimated from published AD
Among the 114 trials with estimates available from both AD and IPD, trial HRs derived from AD were on average slightly more in favour of the research intervention than those from IPD (HR AD to HR IPD ratio = 0.95, 95% CI 0.92 to 0.99, paired t test p = 0.007). However, the wide Bland-Altman limits of agreement (Fig 1) show that for any individual trial, HRs derived from AD could deviate from those derived from IPD by around a relative 30% in favour of either the research (HR AD to HR IPD ratio = 0.67) or control intervention (HR AD to HR IPD ratio = 1.36). There was no clear evidence that that agreement was associated with the size of effect (standardised β = +0.08, p = 0.39) or the estimation method (F statistic on 2 and 111 degrees of freedom = 0.26, p = 0.77; Fig 1). Also, there was no good evidence that agreement was related to the number (standardised β = +0.13, p = 0.17) or proportion (standardised β = −0.09, p = 0.36) of participants represented by the AD relative to IPD, but the limits of agreement did narrow as the absolute number of participants increased (standardised β = −0.45, p < 0.001). Moreover, average agreement improved (standardised β = +0.30 and +0.25, p = 0.001 and p = 0.009, respectively), and the limits of agreement narrowed (standardised β = −0.44 and −0.31, p < 0.001 and p < 0.001, respectively), as the absolute and relative number of events in the AD relative to the IPD increased (Fig 2).

Meta-analysis
Individual trial SEs based on AD were larger than those based on IPD (average percentage change = +12%, 95% CI +8% to +16%, p < 0.001, Bland-Altman 95% limits of agreement = −20% to +57%), which was more pronounced as the average SE increased (standardised β = +0.44, p < 0.001). After adjusting for this, agreement was also associated with a greater proportion of participants (standardised β = −0.15, p = 0.082) and number or proportion of events (standardised β = −4.55 and −0.88, respectively, p < 0.001 for both) being included in the AD analysis relative to the IPD analysis.

Reliability of meta-analyses of HRs and SEs estimated from published AD
IPD were typically available for a high proportion of eligible trials (65% to 100%) and participants (75% to 100%; Table 1), with most including in excess of 85% of those eligible. While the AD meta-analyses tended to include a smaller proportion of eligible trials (33% to 83%; Table 1), often they still included a high proportion of eligible participants (42% to 96%; Table 1) relative to the IPD meta-analyses, but not necessarily such a high proportion of events (e.g., Sarcoma, Bladder 2, Ovary 5; Table 1). Many HRs from AD and IPD meta-analyses were very similar (Fig 3), and, on average, meta-analyses from published AD were only slightly more likely to favour research interventions than those from IPD, irrespective of whether a fixed-effect (HR AD to HR IPD ratio = 0.97, 95% CI 0.94 to 1.00, paired t test p = 0.087) or random-effects (HR AD to HR IPD ratio = 0.96, 95% CI 0.93 to 0.99, paired t test p = 0.043; Fig 4) model was used. However, the Bland-Altman 95% limits of agreement suggest that an individual (fixed-effect) AD meta-analysis could deviate by up to around a relative 15% in favour of the research intervention (HR AD to HR IPD ratio = 0.86) to 10% (HR AD to HR IPD ratio = 1.10) in favour of control (Fig 4A). Findings were very similar with the random-effects model (Bland-Altman 95% limits of agreement for HR AD to HR IPD ratio = 0.84 to 1.11; Fig 4B).
Based on the fixed-effect model, there was no clear evidence that average agreement was associated with the average size of the HRs (standardised β = +0.06, p = 0.82; Fig 5A), the number (standardised β = −0.40, p = 0.099) or proportion (standardised β = −0.21, p = 0.40) of eligible trials (Fig 5A and 5B), or the number (standardised β = −0.23, p = 0.35) or proportion (standardised β = −0.29, p = 0.24) of eligible participants (Fig 5C and 5D). We also found no evidence that the limits of agreement narrowed when trials with published analyses at potential risk of bias from incomplete outcome data or that had limited or imbalanced follow-up were excluded (Table 2). There was some evidence that the limits of agreement became narrower as the total number of events (standardised β = −0.42, p = 0.079; Fig 5E), and, less clearly, the proportion of events (standardised β = −0.39, p = 0.11; Fig 5F), in the AD relative to IPD increased. However, even at the maximum proportion of events observed in this dataset (87% AD to IPD events), an AD meta-analysis might still differ from its IPD equivalent by around a relative 10% in favour of the research intervention (HR AD to HR IPD ratio = 0.90) to 5% in favour of control (HR AD to HR IPD ratio = 1.05). Statistical evidence for these associations was less clear under a random-effects model.
Meta-analysis SEs were consistently larger with AD compared to IPD by an average of around 30% (e.g., fixed-effect 95% CI 18% to 35%; fixed-effect and random-effects p < 0.001), with wide Bland-Altman limits of agreement (e.g., fixed-effect 95% limits of agreement −3% to +63%). Not surprisingly, agreement improved when a greater proportion of trials Comparison of aggregate and individual participant data approaches to meta-analysis (standardised β = −0.63, p = 0.005), participants (standardised β = −0.89, p < 0.001), and events (standardised β = −0.99, p < 0.001) were included in the AD meta-analysis. These associations all remained significant under a random-effects model.

A decision tree for assessing the reliability of AD meta-analyses of HRs
Taking results at the trial and meta-analysis level together, HRs derived from published AD were most likely to concur with those from IPD when the overall number of participants or Comparison of aggregate and individual participant data approaches to meta-analysis events ('absolute information size') was high, and also when the proportion of events included in the AD relative to the IPD ('relative information size') was high. Hence, ascertaining the absolute and relative information size of the available AD is a critical part of determining whether a meta-analysis of published HRs is sufficient for robust syntheses, and when IPD might be needed (Fig 6). Intuitively, establishing information size should also be a goal for AD meta-analyses of other outcomes and effect measures. For time-to-event outcomes and binary outcomes, information size will mostly relate to the number of participants and events, and for continuous outcomes, to the number of participants.

Fig 4. Comparison of meta-analysis HRs from AD versus IPD.
Bland-Altman plots showing how the ratio of the HR from AD to the HR from IPD, as estimated by fixed-effect (A) and random-effects models (B), respectively, varies with the average HR (i.e., the geometric mean of the 2 HR estimates). The red horizontal line represents no difference (i.e., a ratio of 1). The shaded area represents the 95% Bland-Altman limits of agreement. Dashed and dotted lines represent statistical precision around the average ratio and the limits of agreement, respectively. AD, aggregate data; HR, hazard ratio; IPD, individual participant data. https://doi.org/10.1371/journal.pmed.1003019.g004

Fig 5. Potential predictors of the extent of agreement between (fixed-effect) meta-analysis HRs from AD and IPD.
Bland-Altman plots showing how the ratio of the HR from AD to the HR from IPD varies according to the number of trials (A), participants (C), and events (E) available from AD, and the proportion of trials (B), patients (D), and events (F) available from AD relative to IPD. The red horizontal lines represent no difference (i.e., a ratio of 1). The shaded areas represent the 95% Bland-Altman limits of agreement, with fitted linear dependence upon the value of the covariate. Dashed and dotted lines represent statistical precision around the average ratios and the limits of agreement, respectively. AD, aggregate data; HR, hazard ratio; IPD, individual participant data.  Comparison of aggregate and individual participant data approaches to meta-analysis The starting point for assessing the absolute information size is to establish the total number of eligible participants and, if relevant/possible, the number of events. For accuracy, this assessment needs to be based on all trials whether published, unpublished, or ongoing, and the actual or projected accrual figures for each. If the absolute information size is small, an AD meta-analysis will lack power and be unreliable. Also, the collection of IPD will add little value unless it can bring about an increase in the number of participants or events (Fig 6).
If the absolute information size is deemed sufficient, but AD are only available for a small proportion of the eligible participants or the number of events is low, it follows that the relative information size will be small, and any AD estimate is likely to be unreliable. If further AD are not available, the collection of IPD could be very valuable in increasing the number of participants or events (Fig 6).
If the absolute information size is adequate, and AD are available for a large proportion of the eligible participants, and/or most events have already happened, the relative information size is likely to be large, and an AD meta-analysis is expected to be reliable. In this scenario, the collection of IPD would only be useful if an intervention effect has been detected and more detailed analyses are required.
Our results also suggest that there may still be uncertainty in the size and direction of effect, which could influence any decision to collect IPD. In particular, for time-to-event outcomes, we found that even if both the absolute and relative information size of an AD meta-analysis are large, an AD meta-analysis HR can still differ unpredictably from its IPD equivalent, by an approximate relative 10% in favour of the research interventions (HR AD to HR IPD ratio = 0.90) to 5% in favour of control (HR AD to HR IPD ratio = 1.05). By applying these limits to a plausible range of AD meta-analysis HRs (i.e., dividing them by 0.90 and 1.05), we can see how estimates might change when IPD are collected and what these would mean in absolute terms. This helps to gauge which observed HRs are most likely to be reliable (Table 3). For example, an observed HR � 0.75 would translate mostly to sizeable potential IPD absolute benefits, and therefore a benefit is likely confirmed without the need for IPD (Table 3; Fig 6). For an observed AD meta-analysis HR of around 0.80 to 0.90, the potential IPD absolute effects would not necessarily be clinically worthwhile (Table 3). Hence, IPD might be needed to provide a greater degree of certainty about whether an effect exists, and its size and precision ( Fig  6). Finally, with an observed AD meta-analysis HR � 0.95, a lack of benefit is probably confirmed, and the collection of IPD would be difficult to justify (Table 3; Fig 6). Note that our example HR ranges purposefully leave gaps, reflecting regions where the reliability of AD and need for IPD may be context-specific and harder to judge (Table 3).

Findings
We compared trial and meta-analysis HRs from published AD with those from IPD, and found they were most likely to agree when both the absolute and relative information size (number and proportion of events or participants) of the AD were large. However, the AD meta-analysis results could still differ from their IPD equivalents by up to a relative 10% in favour of the research interventions to 5% in favour of control. There was no clear evidence that agreement between meta-analysis HRs from AD and IPD was associated with the number or proportion of eligible trials or the number participants included in the AD analyses, or the method of estimating the HR. Furthermore, agreement was not improved by excluding trials with reported analyses that were potentially at risk of bias from incomplete outcome data or that had insufficient follow-up. These results have been used to construct a decision tree for determining when an AD meta-analysis might be sufficiently reliable and when the IPD approach might be required (Fig 6).

Context
Our results support the assertion that in order for a meta-analysis to be reliable, the information size should be at least as large as an adequately powered trial [46]. Although there is greater interest now in estimating the (absolute) information size of meta-analyses [47][48][49][50][51][52], surprisingly little attention has been paid to explicitly quantifying the relative information size of an AD meta-analysis [48][49][50][51]. A comprehensive systematic review of published comparisons of AD and IPD meta-analyses did not find that agreement was associated with the information they contained (the number of trials or participants) [53], but without access to the primary studies, the authors could not investigate this more thoroughly, and, as stated previously, multiple outcomes from the same meta-analyses were included. However, the authors recommend that systematic reviewers conduct an AD meta-analysis first and carefully consider the potential benefits of an IPD meta-analysis [13], and our decision tree provides the means to do this. Unlike previous studies [4], there was no strong indication that HRs estimated indirectly from KM curves were systematically biased, at either the trial or meta-analysis level. In fact, some AD meta-analyses that relied heavily on HRs estimated from KM curves were very similar to their IPD equivalents. Thus, if other survival statistics cannot be obtained, we encourage reviewers to include HRs estimated carefully from KM curves [6]. Although alternative weighting approaches [54] and digital methods to extract data from KM curves [55] have emerged, they do not necessarily improve HR estimates [55]. However, a HR may not always be the most appropriate effect measure, for example, if there are non-proportional hazards within 1 or more trials in a meta-analysis. Non-proportionality of hazards can be readily checked with IPD and alternative effect measures used if desired (e.g., Wei et al. [56]), but such checks are also possible with AD [57], if 'IPD' can be reconstructed from published KM curves [55].

Strengths
To our knowledge, our study represents the largest systematic comparison of trial and metaanalysis HRs from AD and IPD, and is the first to reveal characteristics associated with the reliability of results based on published AD. Our findings are based on all cancer systematic reviews and meta-analyses of IPD conducted by the MRC Clinical Trials Unit at University College London over a 20-year period. By utilising a cohort of 18 reviews and 238 unique trials, we avoid the potential publication bias that might be associated with reviewing published comparisons of AD and IPD meta-analyses [13]. The sample is diverse in terms of the cancer and intervention types, number of trials and participants, availability of data, and mix of methods used to estimate the AD HRs (Table 1), which increases generalisability. From recent data [1], we estimate that approximately 1,200 oncology intervention reviews are published each year, which may be of variable quality, so we expect our findings to be of widespread use. IPD were collected for over 80% of eligible trials and nearly 90% of eligible participants, and often included updated follow-up. Thus, the included IPD meta-analyses provide a true 'gold standard' with which to compare the HRs derived from AD.

Limitations
Our analyses may lack power at the meta-analysis level, which could have prevented us identifying additional factors associated with the reliability of AD meta-analyses based on HRs. Also, we cannot be sure that results from a cohort of cancer systematic reviews are entirely generalisable to other healthcare areas and outcomes, although they do emphasise that information size should be considered alongside the direction, precision, and consistency of effects, when appraising an AD meta-analysis. Only about half of the eligible trials were included in the AD meta-analyses, but these trials represented around 80% of participants, minimising the impact of selective outcome reporting bias [58] on our findings. However, we could only estimate a HR and SE for 61% of published eligible trials in our time window of 1991-2010, a situation that has likely improved since the publication of the CONSORT statement [59,60]. Thus, we would strongly encourage other custodians of multiple IPD meta-analyses to do similar comparisons and add to this body of evidence, particularly for other conditions, outcomes, and effect measures. In the meantime, it is worthwhile factoring a degree of uncertainty into the interpretation of any AD meta-analysis.

Implications
Once the absolute and relative information size of an AD meta-analysis have been ascertained, our decision tree can be used to systematically assess whether it will likely suffice or if IPD might be required (Fig 6). If the absolute information size indicates that a meta-analysis will be clearly underpowered to assess the primary research question, we do not recommend the collection of IPD unless it would lead to a considerable increase in information, for example, as a result of further follow-up of the included trials or reinstatement of participants that were excluded from the published analyses. If an AD meta-analysis likely has power but the relative information size is small, the meta-analysis results are more likely to be biased or otherwise unreliable, and the collection of further AD should be prioritised, for example, from trials that are unpublished or published in insufficient detail. If this is not feasible, but the collection of IPD could bring about a substantial increase in the amount of information, this is where the approach could add considerable value. If the absolute and relative information size of the AD are both large, the results of an AD meta-analysis are most likely reliable, so if there is no evidence of an effect, there is little justification for going to the trouble of collecting IPD. Whereas, if an effect has been detected based on AD, there may be motivation to collect IPD in order to conduct subgroup or other detailed analyses and provide more nuanced results. The absolute and relative information size are also useful for anticipating when accumulating evidence from trials might be sufficient for reliable AD meta-analysis, using a prospective framework for adaptive meta-analysis (FAME) [48][49][50][51].

Conclusions
In this study, we show how to determine systematically when standard AD meta-analysis will likely generate robust clinical conclusions, and when the IPD approach will add considerable value.
Supporting information S1 Checklist. Completed STROBE checklist for the study.