Multiple Testing, Cut-Point Optimization, and Signs of Publication Bias in Prognostic FDG–PET Imaging Studies of Head and Neck and Lung Cancer: A Review and Meta-Analysis

Positron emission tomography (PET) imaging with 2-deoxy-2-[18F]-fluorodeoxyglucose (FDG) was proposed as prognostic marker in radiotherapy. Various uptake metrics and cut points were used, potentially leading to inflated effect estimates. Here, we performed a meta-analysis and systematic review of the prognostic value of pretreatment FDG–PET in head and neck squamous cell carcinoma (HNSCC) and non-small cell lung cancer (NSCLC), with tests for publication bias. Hazard ratio (HR) for overall survival (OS), disease free survival (DFS), and local control was extracted or derived from the 57 studies included. Test for publication bias was performed, and the number of statistical tests and cut-point optimizations were registered. Eggers regression related to correlation of SUVmax with OS/DFS yielded p = 0.08/p = 0.02 for HNSCC and p < 0.001/p = 0.014 for NSCLC. No outcomes showed significant correlation with SUVmax, when adjusting for publication bias effect, whereas all four showed a correlation in the conventional meta-analysis. The number of statistical tests and cut points were high with no indication of improvement over time. Our analysis showed significant evidence of publication bias leading to inflated estimates of the prognostic value of SUVmax. We suggest that improved management of these complexities, including predefined statistical analysis plans, are critical for a reliable assessment of FDG–PET.


Introduction
Positron emission tomography (PET) offers a non-invasive method to assess functional biological characteristics of a tumor in an individual patient with cancer. A number of positron emitting tracers were developed to study various aspects of tumor biology [1][2][3][4][5][6][7]. However, clinical practice in cancer PET imaging is still dominated by very few tracers, with 2-deoxy-2-[ 18 F]-fluorodeoxyglucose (FDG) being the clinical workhorse for most tumor sites. FDG-PET imaging is primarily used for staging purposes, as a supplement to anatomical images, but advances in the availability of PET imaging led to an increased interest in the feasibility of PET guided radiotherapy planning [8]. Several studies investigated the prognostic value of FDG-PET, and dose escalation to PET-positive areas within the tumor is one of the potential strategies for increasing effect of radiotherapy [9][10][11][12].
If the HR with CI was not stated in a report, one of two methods were used for its estimation. If the HR was given with a p-value, but without the CI, we assumed a normal distribution of the logarithm of the HR and estimated the CI by first finding the z-parameter of the normal distribution pertaining to the reported p-value. We then calculated the standard error of the ln(HR) estimate as S ln (HR) = ln(HR) z . A number of studies did not report an HR, but only a p-value, together with the outcome at one or more specified points in time, in most cases in the form of a plot of Kaplan-Meier curves. In these cases, we estimated the HR from the relationship R(t) = ln(p 2 (t)) ln(p 1 (t)) , where p 1 is the Kaplan-Meier estimate for low SUVmax at time t and p 2 is the estimate for high SUVmax. When possible, we sampled the ratio at multiple time-points, ranging from the first time an event occurred in both groups to the end of follow-up, and averaged the resulting HR(t) estimates. The CI was then calculated from the p-value as explained above. The methodology was previously described in more detail [18]. Variances were then calculated and used as study weights in meta-analysis, using RevMan [17].
Assessment of publication bias was made visually by ordering the studies in forest plots according to variance and by funnel plots. We did not systematically assess the risk of bias in the individual studies. The risk of bias across studies was assessed using the so-called Egger-Var method as a formal test for publication bias [19]. We performed quantitative assessment using the Egger's method as follows. For each endpoint and comparison, a linear fit of log HR versus standard error of the study estimate was performed to assess if the included studies effect size depended on study precision. Formally, the regression equation was weighted by the inverse variance of each study. Here, α and β were the fitting parameters, SE i was the standard error of lnHR of the ith study, and ε was the residual error assumed to have a normal distribution. If β was different from zero at the 95% confidence level, it was concluded that the effect size estimates depended on the study precision-a clear indication of publication bias. α in Equation (1) was the extrapolation to zero SE and was used to estimate a publication bias corrected value of ln(HR).
Turning to the assessment of the number of statistical tests and the use of cut-point optimization, the assessment was made independently by MMC and IRV and disagreements were resolved by a consensus meeting. Only statistical tests related to the association of an imaging metric on one side and oncological outcome or baseline characteristics on the other, were counted. When the same question was addressed in univariate and multivariate analysis, the corresponding p-value was only counted once. Similarly, it was only counted once if it was part of a rational model building procedure, including forward or backwards elimination. In cases where a large number of multivariate models with a functional image metric were examined, outside of a model building procedure, the multivariate tests were included in the assessment of a number of statistical tests that were investigated, e.g., Schwartz et al. [20].

Results
The study selection process for this analysis is presented in Figure 1. Of the 930 studies identified by the initial search, 57 were analyzable. A total of 178 full text articles were screened, and 133 of these were excluded, as data were not assessable (not reporting HR, univariate log-rank test without p-value, no cut-point for SUV). In other words, 25.3% of the screened full text reports were included. Twelve studies were added from manual cross-referencing of articles, reviews, and browsing for a total of 57 included studies; 27 studies in patients with HNSCC , and 30 studies in NSCLC .
Study characteristics are summarized in Tables 1 and 2 for HNSCC and NSCLC, respectively. The vast majority of studies were retrospective analyses-20 studies of HNSCC (74%) and 26 studies of NSCLC (87%). The included studies comprised a total of 5102 patients-1704 patients in the HNSCC group and 3398 in the NSCLC group. The median study size was 74. The NSCLC studies were generally larger, with a median study sample size of 95 patients compared to 58 in the HNSCC group. Twenty-seven studies did not perform MVA for DFS or OS, which were the primary endpoints of this analysis. A few studies reported no events until long follow-up, which gave rise to additional uncertainty in the HR estimate [54,68]. One study reported no events in the low-uptake group, giving rise to an infinite HR estimate, and the study had to be excluded [77]. A single study was excluded due to problems with interpretation of the KM plots [78].
The patient cohorts in both the HNSCC and NSCLC group were quite heterogeneous with respect to stage, treatment, and follow-up time. Figures 2 and 3 display the forest plots for SUVmax as a predictor of DFS and OS for HNSCC and NSCLC, respectively, with the studies ordered according to inverse variance from top to bottom. There was a trend for HR to decrease with decreasing variance, and publication bias is therefore suspected. The HR estimate from the pooled analysis is shown by the diamond-shaped mark, and it favored low SUV. However, this should be interpreted with caution due to the suspicion of publication bias. The same trend was observed for LC (Supplemental Figure S1). The data entered in RevMan and HR for UVA and MVA for all studies are listed in the Supplemental Table S1. of NSCLC (87%). The included studies comprised a total of 5102 patients-1704 patients in the HNSCC group and 3398 in the NSCLC group. The median study size was 74. The NSCLC studies were generally larger, with a median study sample size of 95 patients compared to 58 in the HNSCC group. Twenty-seven studies did not perform MVA for DFS or OS, which were the primary endpoints of this analysis. A few studies reported no events until long follow-up, which gave rise to additional uncertainty in the HR estimate [54,68]. One study reported no events in the low-uptake group, giving rise to an infinite HR estimate, and the study had to be excluded [77]. A single study was excluded due to problems with interpretation of the KM plots [78]. The patient cohorts in both the HNSCC and NSCLC group were quite heterogeneous with respect to stage, treatment, and follow-up time.      NSCLC, respectively, with the studies ordered according to inverse variance from top to bottom. There was a trend for HR to decrease with decreasing variance, and publication bias is therefore suspected. The HR estimate from the pooled analysis is shown by the diamond-shaped mark, and it favored low SUV. However, this should be interpreted with caution due to the suspicion of publication bias. The same trend was observed for LC (Supplemental Figure S1). The data entered in RevMan and HR for UVA and MVA for all studies are listed in the Supplemental Table S1. When Eggers regression was applied, neither OS nor DFS appeared to be significantly associated with FDG uptake in neither HNSCC or NSCLC. The regression slopes were significantly greater than zero in three of the four cases: DFS for HNSCC (p = 0.02) and both OS (p < 0.001) and DFS (p = 0.014) for NSCLC. See the supplement for details and the associated plots (Supplemental Figure S2)  When Eggers regression was applied, neither OS nor DFS appeared to be significantly associated with FDG uptake in neither HNSCC or NSCLC. The regression slopes were significantly greater than zero in three of the four cases: DFS for HNSCC (p = 0.02) and both OS (p < 0.001) and DFS (p = 0.014) for NSCLC. See the supplement for details and the associated plots (Supplemental Figure S2). Figure 4 shows a plot of the number of patients, number of statistical tests, and number of cut-point optimizations against the year of publication. Unfortunately, there is little sign of a consistent improvement in study characteristics, i.e., number of statistical tests, number of cut-point analyses, and size of study population over time. For HNSCC studies, there was a statistically significant increase in the number of cut-point optimizations versus time (Spearman rho = 0.5, p = 0.02), while there was no significant increase in the number of patients for later studies and data were also in accordance with no change in the number of tests performed ( Figure 4A). Data for the NSCLC studies are shown in Figure 4B, where the Spearman rank correlation coefficients were statistically in agreement with no change over time, in either the number of patients, number of cut-points, or number of tests performed (p > 0.27 for all coefficients).
Power calculations were not performed in any of the included studies, and only three studies mention or adjust for multiple testing [27,33,60]. Diagnostics 2020, 10, x FOR PEER REVIEW 11 of 18  Figure 4 shows a plot of the number of patients, number of statistical tests, and number of cutpoint optimizations against the year of publication. Unfortunately, there is little sign of a consistent improvement in study characteristics, i.e., number of statistical tests, number of cut-point analyses, and size of study population over time. For HNSCC studies, there was a statistically significant increase in the number of cut-point optimizations versus time (Spearman rho = 0.5, p = 0.02), while there was no significant increase in the number of patients for later studies and data were also in accordance with no change in the number of tests performed ( Figure 4A). Data for the NSCLC studies are shown in Figure 4B, where the Spearman rank correlation coefficients were statistically in agreement with no change over time, in either the number of patients, number of cut-points, or number of tests performed (p > 0.27 for all coefficients).
Power calculations were not performed in any of the included studies, and only three studies mention or adjust for multiple testing [27,33,60].

Discussion
Publication bias is a well-known problem, enriching the literature with false/true positive studies that will not be balanced by other studies with negative findings that are more likely to remain unpublished. This in turn will inflate the effect size estimated from an intervention or the discriminatory power of a diagnostic test. The inflated effect sizes from individual studies will carry over to a meta-analysis [79], thus reducing the value of the meta-analysis in evidence-based medicine. Indeed, our systematic analysis of prognostic studies of FDG uptake found statistically significant evidence of publication bias. Small studies are at particular risk of inflated effect-size bias [80] and it is thus a concern that the median study size was only 58 and 95 patients in published HNSCC and NSCLC studies, respectively. The TRIPOD reporting guidelines [81], attempts to address the problem by requiring a sample size justification in reporting, but this is not provided in any of the studies included here. Additionally, it might be argued that the general reporting guidelines of TRIPOD, albeit relevant, are not sufficiently specific for adequate reporting of image-based prognostic studies. In particular, it is an important concern that a large number of possible predictors can be extracted from a PET scan-SUVmax, SUVpeak, SUVmean, MTV, and TLG, just to name a few. Multiple comparisons, post-hoc search for positive associations and scanning for 'optimal' cut-off values separating the low and the high uptake groups, increase the risk of false positive findings [82], as also discussed by Vesselle et al., in the context of FDG prognostication [73].
A limitation of our study was that we did not have access to individual patient data, which led to the exclusion of some reports. Most of the included studies were conducted as retrospective studies (80.7%), without a predefined data analysis plan. While this might be defendable in the explorative setting, it increases the risk of overestimating the effect size if the cohort studies are not followed by controlled trials or studies with pre-specified protocols. In particular, with FDG, we would argue that we are beyond the exploratory phase and should perform larger studies with predefined protocols, to unequivocally reveal the prognostic or predictive role of FDG uptake in cancers that are common in the two sites studied in the present work. With the high number of correlations that are testable in image-based prognostication, it appears prudent to require predefined research protocols and, perhaps, publication of raw data to allow independent validation of findings, regardless of the chosen cut-point or predictor. It is possible that a functional imaging specific extension to the TRIPOD or REMARK guidelines could be of use. When published studies perform tens of comparisons and multiple cut-point optimizations in datasets of less than 100 patients, and without correction for multiple comparisons, the field is bound to be dominated by false

Discussion
Publication bias is a well-known problem, enriching the literature with false/true positive studies that will not be balanced by other studies with negative findings that are more likely to remain unpublished. This in turn will inflate the effect size estimated from an intervention or the discriminatory power of a diagnostic test. The inflated effect sizes from individual studies will carry over to a meta-analysis [79], thus reducing the value of the meta-analysis in evidence-based medicine. Indeed, our systematic analysis of prognostic studies of FDG uptake found statistically significant evidence of publication bias. Small studies are at particular risk of inflated effect-size bias [80] and it is thus a concern that the median study size was only 58 and 95 patients in published HNSCC and NSCLC studies, respectively. The TRIPOD reporting guidelines [81], attempts to address the problem by requiring a sample size justification in reporting, but this is not provided in any of the studies included here. Additionally, it might be argued that the general reporting guidelines of TRIPOD, albeit relevant, are not sufficiently specific for adequate reporting of image-based prognostic studies. In particular, it is an important concern that a large number of possible predictors can be extracted from a PET scan-SUVmax, SUVpeak, SUVmean, MTV, and TLG, just to name a few. Multiple comparisons, post-hoc search for positive associations and scanning for 'optimal' cut-off values separating the low and the high uptake groups, increase the risk of false positive findings [82], as also discussed by Vesselle et al., in the context of FDG prognostication [73].
A limitation of our study was that we did not have access to individual patient data, which led to the exclusion of some reports. Most of the included studies were conducted as retrospective studies (80.7%), without a predefined data analysis plan. While this might be defendable in the explorative setting, it increases the risk of overestimating the effect size if the cohort studies are not followed by controlled trials or studies with pre-specified protocols. In particular, with FDG, we would argue that we are beyond the exploratory phase and should perform larger studies with predefined protocols, to unequivocally reveal the prognostic or predictive role of FDG uptake in cancers that are common in the two sites studied in the present work. With the high number of correlations that are testable in image-based prognostication, it appears prudent to require predefined research protocols and, perhaps, publication of raw data to allow independent validation of findings, regardless of the chosen cut-point or predictor. It is possible that a functional imaging specific extension to the TRIPOD or REMARK guidelines could be of use. When published studies perform tens of comparisons and multiple cut-point optimizations in datasets of less than 100 patients, and without correction for multiple comparisons, the field is bound to be dominated by false or exaggerated correlations, which will ultimately harm patients if applied in clinical decision making and harm a promising field of research by misusing resources.
It is a substantial challenge to accommodate cross-study synthesis of data in meta-analysis at the same time as allowing the individual authors to appropriately handle the coding of image metrics in their study. Decisions to use continuous coding of SUV, logarithmic transformation, or a limited number of cut-points are all fair (if performed correctly), but hampers the ability to perform a meta-analysis. It appears to us that the complexity of these analyses implies that publication of the raw modeling data is a necessity for meaningful synthesis of data. We believe that the observations of the current study imply that such a synthesis is necessary for real progress.

Conclusions
Functional imaging with FDG or other tracers remains a promising tool for prognostication, prediction, and treatment selection for cancer patients. However, the current study points to issues limiting the interpretation, including inadequate sample sizes, lack of predefined analysis plans, lack of correction for multiple testing, and post-hoc cut-point optimizations. These issues result in a high risk of inflated effect sizes or false positive correlations that must be addressed to avoid leading the field astray.