An Adaptive Thresholding Method for BTV Estimation Incorporating PET Reconstruction Parameters: A Multicenter Study of the Robustness and the Reliability

Objective. The aim of this work was to assess robustness and reliability of an adaptive thresholding algorithm for the biological target volume estimation incorporating reconstruction parameters. Method. In a multicenter study, a phantom with spheres of different diameters (6.5–57.4 mm) was filled with 18F-FDG at different target-to-background ratios (TBR: 2.5–70) and scanned for different acquisition periods (2–5 min). Image reconstruction algorithms were used varying number of iterations and postreconstruction transaxial smoothing. Optimal thresholds (TS) for volume estimation were determined as percentage of the maximum intensity in the cross section area of the spheres. Multiple regression techniques were used to identify relevant predictors of TS. Results. The goodness of the model fit was high (R 2: 0.74–0.92). TBR was the most significant predictor of TS. For all scanners, except the Gemini scanners, FWHM was an independent predictor of TS. Significant differences were observed between scanners of different models, but not between different scanners of the same model. The shrinkage on cross validation was small and indicative of excellent reliability of model estimation. Conclusions. Incorporation of postreconstruction filtering FWHM in an adaptive thresholding algorithm for the BTV estimation allows obtaining a robust and reliable method to be applied to a variety of different scanners, without scanner-specific individual calibration.


Introduction
In the last years the coregistration of 18 F-fluorodeoxyglucose positron emission tomography ( 18 F-FDG PET) images with computed tomography (CT) images has gained an increasing interest in the staging and treatment planning for radiotherapy of several tumor sites. However, a standardized way of converting PET signals into target volumes is not yet available [1].
New semiautomatic and automatic segmentation methods have been developed implying gradient, region growing, clustering, statistical methods and other approaches [2][3][4][5][6][7][8][9]. While referring to these promising methods, it should be pointed out that some of these new methods suffer from the need of extensive preprocessing of the images (e.g., edgedetection). Moreover, the majority of these new algorithms are not widely available, so their use is currently restricted to  the developers and, as a consequence, they are not independently validated. Apart from visual inspection of PET scans, which suffers from interobserver variability [10], thresholding methods are widely used as PET segmentation approach in clinical practice for biological target volume (BTV) delineation for radiotherapy planning. Adaptive thresholding methods based on contrast-oriented contouring algorithms have been developed independently by many groups and validated in patient data in head and neck, in lung cancer, and in lymph nodes [11][12][13]. These methods are based on phantom measurements to derive a relationship between the "true" volume and the threshold to be applied to PET images.
These threshold-volume curves for one PET/CT scanner have been previously obtained varying target-to-background ratio (TBR), target dimensions, and postreconstruction smoothing [14]. It has been previously demonstrated that the emission scan duration (ESD) and background activity concentration, related to the level of image noise, are not predictors of the thresholding level of PET images [15]. Moreover, adaptive-threshold segmentation algorithms are not influenced by the different conditions of attenuation and scatter which may be encountered in different anatomical districts [16] and by the degree of convergence of iterative reconstruction algorithms [14].
Although adaptive thresholding methods are applicable to every PET scanner, it is generally assumed that the values of the parameters obtained during model building are system dependent so that a specific calibration for each PET system is required. On the other hand, a less hardwaredependent solution to the problem of PET segmentation could provide a robust algorithm, easily usable with images acquired by different scanner models without needing any previous optimization of the individual image quality.
Our hypothesis is that this goal can be accomplished by incorporating in our algorithm the reconstruction parameters that impact on threshold determination. To validate this hypothesis we firstly developed an original method to adapt the thresholds (TS) used to estimate the BTV in PET images. The proposed method incorporates the PET reconstruction parameters that influence the threshold determination. Secondly, we investigated in a multicenter trial the robustness of this method with respect to various scanner models, reconstruction settings, and acquisition conditions: a multivariable approach was adopted to study the dependence of the TS that define the boundaries of 18 F-FDG uptake on object characteristics (contrast, size), acquisition parameters (scan duration), and reconstruction modalities (reconstruction algorithm, number of iterations, and amount of postreconstruction smoothing) in eleven state-of-the-art PET/CT scanners installed in eight different institutions. Finally, we assessed the reliability of the regression models through the use of split-sample analysis.

Phantoms.
Measurements were performed on the NEMA IEC Body Phantom Set (Data Spectrum Corporation, Hillsborough, NC). This phantom contains 6 coplanar spheres, with internal diameters (ID) of 10, 13, 17, 22, 28, and 37 mm. A supplemental set of 2 microhollow spheres of 6.5 and 8.1 mm ID and 1 sphere of 57.4 mm ID were positioned at the bottom of the phantom. The experimental setup is depicted in Figure 1, together with sphere IDs (mm), maximum cross section areas ( ) (mm 2 ), and volumes (mL). The same positioning of the phantom was ensured through laser localizer and a scout CT acquisition.

Phantom Acquisition.
The background of the IEC phantom was filled with 3 kBq/mL activity concentration of 18 F-FDG. A standard protocol was designed to generate the following acquisitions for each scanner model.  1), determined by the dose calibrator and dilution, were imaged in different acquisition sessions. The measured TBRs were determined in the reconstructed image as the maximum pixel intensity in a region of interest (ROI) encircling the cross sectional area of the target, divided by the average pixel intensity of ROIs surrounding the sphere. These TBRs ranged from 70 down to 2.5 and were within the full range observed in patients.
(b) Four different ESD (2, 3, 4, and 5 min) were acquired to provide independent replicates of the experiments.

Discovery ST, Biograph HI-REZ, and Biograph TRUEV.
These systems use a 2D Fourier-rebinning (FORE) ordered subset expectation maximization (OSEM) algorithm with all corrections (scatter, random, dead time, attenuation, and normalization) incorporated into the iterative reconstruction scheme. In these systems the user can independently specify the number of iterations and subsets and the amount of the transaxial postreconstruction Gaussian smoothing, through the filter full-width-at-half-maximum (FWHM) expressed in mm.

Discovery STE and Discovery 600.
The D-600 system uses a fully 3D-OSEM algorithm with all corrections incorporated into the iterative reconstruction scheme. The reconstruction settings are the same as above with the only difference that the axial filter is a mean filter with available kernels of 1 : 2 : 1, 1 : 4 : 1, and 1 : 6 : 1.

Discovery 690.
The D-690 system uses a fully 3D-OSEM algorithm with all corrections incorporated into the iterative reconstruction scheme. Furthermore, new reconstruction algorithms are available on the D-690, which add to the standard configuration the time of flight information (TOF) and/or a 3D model of the D-690 PET point spread function (PSF). The activation of TOF and/or PSF does not require the setting of any new parameter compared to those used with the 3D-OSEM algorithm (number of subsets, number of iterations, reconstructed field of view (FOV), image matrix, and axial and transaxial postfilters). In this study, both TOF and PSF information were included in the reconstruction scheme.

Gemini XL.
This system uses a fully 3D lineof-response (LOR) based iterative reconstruction algorithm named row-action maximum likelihood algorithm (RAMLA) [24]. The number of iterations is fixed (2 iterations and 33 subsets) and the reconstruction protocols contain one modifiable parameter that can be set to adjust the quality of the images as normal, smooth, or sharp.

Gemini TF.
This system uses the TOF maximum likelihood expectation-maximization reconstruction algorithm (TF-MLEM) [24]. The reconstruction protocols contain three modifiable parameters that can be set to adjust the quality of the images: the first is the number of iterations (3 iterations and 20 subsets or 3 iterations and 33 subsets); the second is a so called relaxation parameter that can be set between 1, 0.7 and 0.5 and controls the magnitude of change that each iteration makes to the image. A third parameter, the kernel width of the TOF, can be set by the user at two levels (Gemini TF manual) [25].
The type of reconstruction algorithm, the degree of the convergence of the iterative algorithm, and the amount of the postreconstruction smoothing applied on images were varied starting from the clinical acquisition protocols used in each institution for radiotherapy planning. Overall, in each scanner, the maximum of theoretical independent combinations of acquisition parameters available for the subsequent model fitting were 9 sphere × 9 TBR × 4 ESD = 324. The 8.1 and 6.5 mm spheres was not always included in the analysis because they were not clearly visible in all the phantom acquisitions. The number of reconstruction modalities available for model fitting depends on the scanner capabilities: the details of the reconstruction parameters together with the voxel size of the reconstructed images and the number of data points that is actually available for model fitting in each scanner are shown in Table 2.

Image
Analysis. TS were determined as a percentage of the maximum intensity in the cross section area of the spheres. Target cross sections of area were selected in the middle of the spheres, which constitutes the largest cross section of the sphere. The values of TS were entirely based on the apparent activity concentration in the images and not on the known activities actually placed in the spheres. To find the TS value that yielded an area best matching the true value, the cross sections were autocontoured in the attenuation corrected slices varying TS in step of 1%, until the area so determined differed by less than 10 mm 2 versus its known physical value.
The analysis was performed by means of an automatic routine, EyeLite RT v.1.1 (G-Squared, Vicenza, Italy), to avoid the influence of the operator in ROIs dimensioning and to minimize the influence of the operator in the ROIs positioning. The operator placed six 17 mm-diameter ROIs in the background area surrounding the spheres. The mean intensity of these 6 ROIs was used as a background value (BG). ROI analyses were performed only for visually detectable spheres: this accounted for the discrepancy between theoretical and experimental data points collected for each scanner.

Statistical Analysis.
For each combination of EMequivalent iteration number ( ) and ESD ( ), the following variables were evaluated: 1 defined as target cross section , 2 defined as 1 − 1/TBR, and 3 defined as the FWHM. Multiple linear regression analysis was performed in order to define the relationship between the best TS (TS ) (providing the most accurate sphere cross sectional area) and 1 , 2 , and 3 . The multiple regression model used for the fit was Computational and Mathematical Methods in Medicine 5    [14,15]. However, nonlinear objective functions or even indicator functions according to the different parameters range could be investigated for fitting TS. Additional variables, reflecting the characteristics of the reconstruction protocol of each considered scanner, were inserted in the model as independent predictors. Axial smoothing was considered for the Discovery 690 and voxel dimensions were inserted for the Biograph Hi-REZ, while relaxation parameter and TOF kernel width were accounted for in the Gemini and in the Gemini TF, respectively.
Stepwise forward selection was used as a strategy for selecting the variables and statistic was used as a criterion for selecting a model. Goodness of fit for each regression model was expressed using the adjusted coefficient of determination ( 2 ). Goodness of fit was reported at each stage of model building as partial 2 . The criteria for retaining a variable in a model were > 4 and an increment of at least 0.01 in the 2 in order to be cautious in including redundant variables into the models. The weight of the different independent variables in explaining TS was quantified by means of standardized regression coefficients .
The reliability of the regression models was assessed through split-sample analysis [26]. Using this methodology, all observations in each scanner model were randomly assigned to one of two groups, the training group or the holdout group. The regression models were derived using the training group and the sample squared multiple correlation 2 was obtained. Then the prediction equation for the training group was used to compute predicted values for the holdout group. Finally, the univariate correlation 2 * (cross validation correlation) was obtained between these predicted values and the observed responses in the holdout group. The reliability of the regression models was expressed by using the shrinkage on cross validation coefficients 2 − 2 * . As a criterion, shrinkage values of less than 0.10 were considered as indicative of a reliable model.
In order to compare separate multiple regressions of TS as a function of the independent variables for two different scanners, an additional dummy variable, coding for each scanner, was inserted in the model. A regression model was then built by pooling the data coming from the two scanners and inserting this dummy variable as a predictor. The criteria for retaining this variable in the model were the ones specified above.
Statistical analysis was performed using the software Statistica 6.0 (Statsoft Inc., Tulsa OK). Figure 2 shows the plots of averaged TS versus cross sectional areas for a coarse grouping of TBRs for each scanner model.

Multiple Linear Regression.
The TS versus predictor variables plot was fitted only for cross sectional area > 133 mm 2 that is in the range of clinically relevant volumes comprised between 1 and 100 mL. The cross sectional area of 133 mm 2 (that corresponds to a sphere ID of 13 mm and approximately the twofold FWHM of the scanners) was selected as a separator of the data due to the resolution characteristics of the scanners.
Following (1), the regression equations that best summarize the results obtained in a multiple regression model with TS as the predicted variable are reported for each scanner in Table 3 together with the values of the corresponding parameters 0 -3 . In the third column of Table 3 the multiple-2 of model fitting are reported, while the last column shows the ranking of the independent predictors together with the standardized regression coefficients and the amount of TS variance explained by each predictor.
The emission scan duration and the degree of the convergence of the iterative algorithm were never significant predictors of TS. This provided a confirmation of previously reported findings. Also the axial smoothing in the Discovery 690, the voxel size in the Biograph Hi-REZ, the relaxation in the Gemini scanners, and the Kernel width of the time of flight correction in the Gemini TF were not significant predictors of TS.
The goodness of the model fit, assessed by the coefficient of determination 2 , was high, ranging from a minimum of 0.74 for the Discovery 690 to a maximum of 0.92 for both the Discovery 600 and the Biograph Hi-REZ. The most relevant variable for TS prediction was TBR with a partial 2 accounting for 74% to 91% of TS variability. In the case of the Discovery 690, TBR only accounted for 40% of TS variability, although it remains the best individual predictor. Second came the amount of smoothing in the transaxial plane (FWHM) that showed an additional 2 roughly explaining from 1 to 5% of TS variability. The only exceptions were the Gemini scanners, where this parameter cannot be varied by the user, and the Discovery 690, where its contribution is significantly increased to 29% of TS variability. Last came the lesion size ( ) that played an independent role only in the Discovery 690 and in the Gemini scanners accounting for 5%-8% of TS variability.
The comparison of the regression lines obtained from two scanners of the same model (Discovery ST, Discovery 600, and Gemini TF) did not evidence any relevant difference. The test of the hypothesis of coincident regression lines for the two DST scanners provided an 3,2145 = 0.73 ( = 0.53). This statistics is small ( is large), so we do not reject 0 and therefore have no statistical basis for believing that the two lines are not coincident. The details of the test are reported in Table 4. Moreover, the additional 2 of the "scanner" dummy variable as predictor of TS variance was below <0.001. Similar results were found for the D600 and GTF scanners (not shown). Accordingly, the results of the regression analysis obtained by pooling all the measurements from scanners of the same model are reported in Table 3.

Regression Model Reliability.
The results of the reliability study on regression models are reported in Table 3. The shrinkage on cross validation was always below 0.07, which is quite small and indicative of an excellent reliability of estimation. An important aspect related to assessing the reliability of a model involves considering difference score of the form TS observed − TS predicted , where only holdout cases are used and when the training sample equation is used to compute the predicted values. The "unstandardized residuals" can be subjected to various residual analyses. The most helpful entails univariate descriptive statistics such as the box and whiskers plots depicted in Figure 3. In our case a few large residuals are present, but they are neither sufficiently implausible nor influential to require further investigations. In our investigation, we derived the calibration curve for eleven PET scanners (eight models, three manufacturers, and eight sites) to apply the adaptive-threshold algorithm for PET-based contouring. The eight scanner types investigated in this study differ in scintillation crystal, scanner electronics, and reconstruction methodologies. Methods of retrospective image resolution recovery such as PSF-reconstruction or TOF measurements were also characterized in the present study.
At present, there is considerable variability in the way standard PET/CT scans are performed in different centers [27,28]. Thus, there was no chance for a multicenter standardization of all scanners and all imaging protocols in use. Instead, we chose to directly incorporate in our adaptive thresholding algorithm the reconstruction parameters that can be selected by the user and that are relevant for TS determination. This should increase the robustness of the proposed method by avoiding the need to perform individual calibrations in each center of the algorithm. Noteworthy, the comparison of the regression lines obtained from two scanners of the same model did not evidence any relevant difference, at least for the three scanner models tested. This brings another relevant consequence; that is, with the incorporation of the reconstruction parameters in the regression models the calibration curve in a specific scanner model need not to be obtained at each site. Instead, it can be derived once and applied irrespectively of the specific scanner being utilized provided that is of the same model.

Comparison with Previous Published Papers.
To the best of our knowledge only three studies have been published so far on the integration of PET/CT scans from different hospitals into radiotherapy treatment planning. In the first study,Öllers et al. [29] used a TBR algorithm to evaluate head-and-neck tumors. To this purpose only small spheres of volumes ranging from 2 to 16 mL (i.e., sphere ID less than 3 cm) were used. TBRs, as determined by the dose calibrators, ranged from 2 to 12. The authors performed phantom measurements on three scanners of the same manufacturer (Biograph Accel, Siemens) equipped with Pico 3D (2 scanner) or standard (1 scanner) detector electronics. Identical acquisition and reconstruction protocols were used. To study the effect of different reconstruction parameters on the results PET raw data were reconstructed varying the number of iterations (IT from 2 to 64) with a fixed smoothing of FWHM = 5 mm. They found that the standardized uptake value (SUV) threshold of the scanner equipped with standard electronic differed significantly from those of the other two scanners and that at least 16 iterations are required in order to produce reliable SUV thresholds. Our own results support these findings. On the one hand, the regression lines did not differ significantly between scanners of the same type equipped with similar electronics, while the calibration curves for scanners of different type clearly differ (Figure 2). On the other hand, the number of iteration is not a significant predictor of TS, provided that this number is kept above a certain level which is both recommended by the manufactures and necessary to have good image quality. Table 3: Scanner-model specific calibration curves, model 2 , shrinkage on cross validation and independent predictor of TS ranked in order of significance.   In a second study, Hatt et al. [30] evaluated the robustness and repeatability of a TBR algorithm in comparison to fuzzy C-means clustering and fuzzy locally adaptive Bayesian algorithm. The authors performed phantom measurements on four different PET/CT scanners (Philips Gemini and Gemini TF, Siemens Biograph, and GE Discovery LS) using a standard acquisition protocol with two TBR (4 and 8) and Three ESD (1, 2, and 5 min). PET raw data were reconstructed using routine clinical image reconstruction and two voxel size volume for all scanners. They reported a higher robustness of the fuzzy locally adaptive Bayesian algorithm while the repeatability provided by all segmentation methods was very high with a negligible variability of <5% in comparison to that associated with manual delineation. However, as recognized by the same authors, in order to assess the robustness of the TBR approach they applied an adaptive thresholding using the parameters optimized on other scanners to the image datasets acquired with the Siemens Biograph, which is sort of misleading since the TBR approach is system dependent. Instead, by adopting scanner-model specific calibration curves, similar mean classification error (∼10%) and variability (∼5%) would have been obtained for the TBR algorithm and for the fuzzy locally adaptive Bayesian approach. By first principles, the inclusion of the postreconstruction smoothing (not considered in the study of Hatt) should increase both the accuracy and robustness of adaptive thresholding algorithms possibly leading to results even superior to those achieved by advanced image segmentation methods. Noteworthy, the coefficient of regression for the TBR variable reported by Hatt for the Gemini TF is very similar to the one obtained for the same variable in the present work ( TBR = 61.4 versus 59.3), also considering that the two regression models are not identical. This provides, although indirectly, a further confirmation of the robustness of the scanner-model specific approach in deriving TS calibration curves.
In the last study Schaefer et al. [31] evaluated the calibration of an adaptive SUV thresholding algorithm in eleven centers equipped with 5 Siemens Biograph, 5 Philips Gemini, and one Siemens ECAT ART scanners. They reported only minor differences in calibration parameters for scanners of the same type provided that identical imaging protocols were used, whereas significant differences were found comparing scanners of different types. Moreover, they reported no statistically significant differences among SUV thresholds calculated for each site by use of the "site-specific" calibration neither among scanners of the same type at different sites nor among scanners of different types at different sites. Our own results support these findings only partially. In our study both acquisition and reconstruction parameters were varied and relevant parameters were incorporated into the "site-specific" algorithms so that there is no need to force individual centers to adopt a fixed protocol of image acquisition and image reconstruction. Bearing in mind this relevant difference, also in our study the calibration curves were not significantly different between scanners of the same type, whereas significant differences were found comparing scanners of different types. On the contrary, both the measured (Figure 2) and the calculated TS (Table 3) were significantly different among scanners of different types. For instance, the measured TS averaged over the entire spectrum of acquisition and reconstruction parameters for larger targets (sphere > 133 mm 2 ) for the Discovery 690 (S6) and the Discovery 600 (S4-5) were 39.0 ± 5.8% versus 45.4 ± 10.8, respectively ( < 0.0001). This difference largely reflects the hot-contrast recovery capabilities of the different scanners which, in the case of the Discovery 690, are emphasized by the introduction of PSF techniques in the reconstruction process.

Study Advantages and Limitations.
The proposed method for the definition of BTV has several advantages, even if the results of this study must be interpreted in the context of some limitations.
Our method is feasible in a clinical context in those lesions presenting a uniform radiotracer uptake, as for different oncological lesions. In this case, the proposed method can be effective for extracting functional biomarkers and for using PET imaging in image-guided radiotherapy treatments. As a representative example, Figure 4 shows the application of the proposed method to a head and neck oncological patient, a candidate for image-guided radiotherapy. In these patients, BTV can be used to optimize radiotherapy treatment taking advantages from the information of functional imaging. The effects of lesion movement in lung tumors have been recently incorporated in an adaptive thresholding algorithm using multiple regression techniques similar to those in the present study [32]. Though the effects of lesion movement were not included in this study, we believe that the conclusions regarding the effect of smoothing and TBR on thresholds still apply in the case of moving targets.
Threshold techniques do not take into account variations in tumor heterogeneity. This has motivated the investigation of advanced segmentation techniques not based on thresholding. While referring to these important methods for segmentation of nonuniform tracer concentration it should be pointed out that until they are further developed and validated, adaptive threshold segmentation methods are and will be used in most clinics and therefore need to be accurately characterized.
Furthermore, it has to be pointed out that in our work BTVs were fitted only for cross sections larger than 133 mm 2 . This choice is justified by the fact that several studies found severe errors in the volume estimation for tumor volume < 2 mL corresponding to cross sections < 192 mm 2 (in terms of sphere-equivalent cross section) [33][34][35].

Conclusion
This study demonstrated that the calibration curves for the proposed adaptive thresholding method were not significantly different between scanners of the same type at different sites. The incorporation of the postreconstruction Gaussian smoothing in the algorithms avoids the need of systemdependent optimization procedures. This, together with the demonstrated high level of reliability of this approach, may provide robust and reliable tools to aid physicians as an initial guess in segmenting biological volumes on FDG-PET images.