Postoperative thyroglobulin as a yard-stick for radioiodine therapy: decision tree analysis in a European multicenter series of 1317 patients with differentiated thyroid cancer

Purpose An accurate postoperative assessment is pivotal to inform postoperative 131I treatment in patients with differentiated thyroid cancer (DTC). We developed a predictive model for post-treatment whole-body scintigraphy (PT-WBS) results (as a proxy for persistent disease) by adopting a decision tree model. Methods Age, sex, histology, T stage, N stage, risk classes, remnant estimation, TSH, and Tg were identified as potential predictors and were put into regression algorithm (conditional inference tree, ctree) to develop a risk stratification model for predicting the presence of metastases in PT-WBS. Results The lymph node (N) stage identified a partition of the population into two subgroups (N-positive vs N-negative). Among N-positive patients, a Tg value  > 23.3 ng/mL conferred a 83% probability to have metastatic disease compared to those with lower Tg values. Additionally, N-negative patients were further substratified in three subgroups with different risk rates according to their Tg values. The model remained stable and reproducible in the iterative process of cross validation. Conclusions We developed a simple and robust decision tree model able to provide reliable informations on the probability of persistent/metastatic DTC after surgery. These information may guide post-surgery 131I administration and select patients requiring curative rather than adjuvant 131I therapy schedules.


Introduction
A trend towards de-escalating the role of iodine-131 ( 131 I) in the treatment of differentiated thyroid cancer (DTC) emerged after the release of the 2015 edition of the American Thyroid Association (ATA) Management Guidelines for adult patients with thyroid nodules and differentiated thyroid cancer [1]. Basically, a three-tiered postoperative risk assessment system, primarily based on pathology reports, was proposed to inform decisions on postoperative 131 I administration (i.e., ATA risk classes). Such a system, however, cannot detect the presence of postoperative biochemical, structural, or functional persistent disease that require curative 131 I administration. Moreover, different options (i.e., wait-and-see, remnant ablation, adjuvant 131 I therapy) are available in patients with no evidence of persistent disease. ATA risk classes are relevant in assigning patients to different strategies but additional factors such as local resources and expertise as well as patient' preferences should be incorporated and discussed, preferably in a multidisciplinary context (i.e., tumor board), in order to do justice to the modern concept of an individualized, targeted therapy [2]. The postoperative application of 131 I, which encompasses the integration of diagnostics (i.e., post-treatment whole-body scintigraphy, PT-WBS) and therapeutics, served a long time as the gold standard for detecting persistent disease, assess its 131 I avidity, and predict response to 131 I, thus enabling personalized management of DTC patients. However, following current standards, the omission of 131 I therapy in selected cases prevents the treating physician from obtaining such information; hence, alternative markers are warranted. Thyroglobulin (Tg) is a glycoprotein produced by the thyroid follicular cells roughly related to the amount of thyroid tissue present [3]. In DTC patients who undergo total thyroid ablation (i.e., thyroidectomy and 131 I ablation), Tg is a powerful tumor marker and it is monitored to detect persistent or recurrent disease, evaluate disease progression, and provide prognostic informations, respectively [4]. Preablation Tg measurement can accurately determine the likelihood of achieving remission or having persistent or recurrent disease after the initial 131 I therapy [5], and was also proposed to assess the postoperative status and guide 131 I therapy selection with sparse results [6][7][8]. As a matter of fact, the predictive value of postoperative Tg is influenced by many factors. These include the time elapsed since surgery, the amount of thyroid remnant [9], the selected Tg cutoff level, and the TSH level at the time of Tg measurement [10][11][12][13]. Furthermore, the presence of Tg autoantibodies (TgAb), often "imported" by co-existing thyroid autoimmune diseases, significantly affects the measurable Tg values in up to 25% of DTC patients in the early postoperative period [3,4,14]. Accordingly, Tg reference intervals mathematically normalized to TSH levels and estimated amounts of thyroid remnants have been claimed to improve the reliability of postoperative Tg measurement [15,16]. Moreover, the pre-test individual risk of loco-regional and distant metastases further complicates the issue. As a result, optimal Tg cutoff levels to distinguish normal residual thyroid tissue from persistent thyroid cancer requiring curative 131 I administration are not yet available [1,17,18]. In the available literature, patients treated with 131 I were predominantly grouped together as one cohort, irrespective of whether 131 I was carried out in the context of remnant ablation, adjuvant therapy, or as therapy for known metastases. Considering new recommendations, however, early detection of patients with persistent disease after surgery is pivotal to optimize 131 I treatment (i.e., patients' preparation and administered 31 I activities) to maximize the treatment effectiveness. First, metastatic DTC cells have lower density and poorer functionality of natrium iodine symporter (NIS) and the TSH elevation over time (i.e., area under the curve of TSH stimulation) is relevant to increase 131 I uptake and retention [19]. Accordingly, thyroid hormone withdrawal (THW) is preferable to recombinant human TSH (rhTSH) in these cases (i.e., with the exception of patients who are either unable to elevate endogenous TSH during THW, or in whom THW is contraindicated for medical reason). Second, ablative and adjuvant treatments are performed with 131 I activities ranging from 1.1 to 3.7 GBq while treatment of known disease is performed with higher activities ranging from 3.7 to 5.5 GBq for small-volume loco-regional disease and 5.6-7.4 GBq (150-200 mCi) or even more 131 I for treatment of advanced locoregional disease and/or smallvolume distant metastatic disease. Identification of iodine-avid diffuse metastatic disease may lead to escalation of prescribed therapeutic 131I activity with or without dosimetry calculations [20]. All in all, an accurate postoperative assessment is pivotal to inform our treatments and avoid suboptimal 131 I administration (i.e., adjuvant instead of curative strategy). The present study was prompted to develop a predictive model for PT-WBS results (as a proxy for persistent disease) by adopting a decision tree model integrating postoperative TSH and Tg levels, thyroid remnant estimate, patients' demographic and clinical data, and ATA risk classes, respectively, in a large series of 1317 TgAb-negative DTC patients.

Patients
From institutional databases of participating centers, all patients 18 years and older with histologically proved DTC who underwent (near-)/total thyroidectomy and THWassisted 131 I therapy were included. Records without information on (i) TgAb levels; (ii) preablation TSH, Tg, and 24-h radioiodine uptake (RAIU) values within 1 week before 131 I therapy; and (iii) post-treatment whole-body scintigraphy (PT-WBS) results were excluded from the present study.

Radioiodine treatment
Patients selected for 131 I administration underwent THW and received a 131 I activity determined at the discretion of the attending physician according to nuclear medicine guidelines and practice recommendations (median 2.4 GBq, range 1.1-4.5 GBq). Radioprotection issues were managed in strict adherence to national regulations [18,21].

RAIU testing and post-treatment WBS
All RAIU testing and PT-WBS examinations were performed strictly following the EANM procedure guidelines in all participating centers. Single photon emission computed tomography/computed tomography (SPECT/CT) was performed in addition to PT-WBS at the judgment of attending physicians [22]. The PT-WBS was classified as negative (i.e., absent uptake within the thyroid bed and absent non-physiological uptakes in other regions); remnant only (i.e., uptake within the thyroid bed without non-physiological uptakes in other regions); or positive (pathological uptake outside the thyroid bed, with or without uptake within the thyroid bed).

Laboratory
Serum TSH levels were measured by 2nd or 3rd generation immunoassays on conventional automated analytical platforms. Assays employed to quantify serum Tg and TgAb are reported in Table 1.

Decision tree model
Age, sex, histology, T stage, N stage, ATA risk, RAIU, TSH, and Tg were identified as potential predictors and were put into regression algorithm (conditional inference tree, ctree) to develop a risk stratification model for predicting the presence of metastasis in PT-WBS. Conditional inference tree analysis provides a decision tree by performing recursive population splitting into subgroups according to the specified clinical endpoint. At each partition, the algorithm searches for the best predictor and corresponding cut-off value that split the cohort into two subsequent nodes such that the outcome is significantly different between the two nodes, respectively. The process is iterated for each node until the algorithm cannot find any predictor that leads to significantly different subclasses, thus creating an algorithm for predicting future outcomes within more homogeneous subgroups. The final sets of subpopulations are called terminal nodes. The ctree algorithm allows to include both continuous and categorical variables in the analysis. Additionally, it ranks the relevance of different variables instead of merely focusing on the outcome prediction with less attention to variables' contribution. To minimize the overfitting, the dataset was divided into training and validation dataset through 200-fold cross validation after a preliminary choice of the most relevant variables. First, we evaluated the most frequently selected parameters by varying the dataset. To this end, the selection of patients to be included in the dataset was iterated 1000 times, maintaining the original proportion of variables and endpoint. The clinical parameters selected in at least the 95% of the iterations were used in the subsequent analysis. In a second step, these parameters were used to build the final model that was then validated by 200-fold cross-validation process applying a splitting ratio 70:30. The correct proportion of variables and events was maintained also in this case. For each iteration, the performance of the model was verified calculating accuracy, positive predictive value (PPV), and negative predictive value (NPV). Since for each iteration, the cutoff value of each node could slightly differ due to different patients; the median and interquartile range of threshold have been evaluated. Finally, the performance of the model was tested for each center to assess the impact of different Tg and TgAb assays.

Statistical analysis
To analyze differences between different groups, χ 2 test and Kruskal-Wallis or Mann-Whitney U tests were used for categorical and continuous variables, respectively. Differences were considered statistically significant when P ≤ 0.05. Statistical analyses were carried out with R and the integrated

Results
Demographic, clinical, histopathological, and biochemical data included in our statistical model are summarized in Table 2 for the overall series and single series of different participating centers, respectively. Relevant between-center differences emerged for almost all parameters considered in our analysis ( Table 2).

Decision tree model
The percentages of analytic cycles in which each variable was selected as predictive of persistent disease in PT-WBS was estimated. Variables selected in more than 95% of cycles were retained in the subsequent analysis. Accordingly, Tg values and N stage were the best predictive parameters in the first analytic round recurring in 100% and 97.4% of 1000 iterative cycles of analysis, respectively (Table 3). In the second step, the 200-fold cross-validation analysis was performed and the algorithm generated a conditional inference tree with five terminal nodes using Tg and N stage as predictive variables as depicted in Fig. 1. According to the decision tree model, in  Table 4. In the training cohorts, the mean accuracy, PPV, and NPV of the generated predictive model were 88%, 68%, and 90%, respectively. Similar performance was also obtained in the validation sets, with accuracy of 88%, PPV of 60%, and NPV of 91% (Table 5). Finally, as summarized in Table 6, the accuracy, NPV, PPV, and AUC  values are similar for each center with the exception of center 3 which has a lower PPV compared to other ones likely due to the use of the Roche Tg assay which produces higher results compared to Tg assays employed in other centers [23].

Discussion
We developed and validated an algorithm to predict whether viable tumor lesions after surgery will be visualized by PT-WBS. Notably, it should be intended as a tool to better select patients that will profit from a more intense treatment with curative intent rather than to "per se" exclude patients from 131 I application. Currently, no definitive data exist to safely support the omission of 131 I adjuvant treatment in low to intermediate risk DTC. Therefore, clinical decisions should include local factors, patients' values, and preferences in addition to the conventional risk stratification. As the main result of our study, the combination of lymph node status and Tg values outperformed any other tested factor (i.e., age, sex, T, histology, ATA risk classes, TSH, and RAIU values) in predicting the presence of persistent disease after surgery. This highlights the drawback of ATA risk stratification system alone in predicting postoperative persistent disease and remarks the role of a more sophisticated postoperative assessment of DTC patients [2,24]. Basing on our data, a decision tree is provided to guide the clinical decision-making dependent on the presence of lymph node involvement and, subsequently, on Tg levels in a different subset of patients. Notably, differences in patient selection, surgical skills, and related post-operative RAIU values (i.e., estimates of thyroid tissue remnant) and preablation TSH levels are common in clinical real life as well as the use of different Tg and TgAb assays in different centers. Overall, these factors represent a major limitation in selecting general thresholds and decision limits to inform postoperative clinical actions. Interestingly, however, neither RAIU values nor TSH levels were independently retained in our model and the impact of different Tg assays was negligible in our analysis. Accordingly, our Tg nodal thresholds were proved to be actionable even in different local populations that represent a relevant result of our study. Some limitations of our study must be mentioned. First, a potential drawback of our study is the lack of an external dataset for model validation. Rather, we performed an internal crossvalidation procedure. On the other hand, this method reduces the risk of overfitting and provides a more robust estimate of model's performance since all data are used for both training and validation. In addition, most analyzed parameters significantly varied between different centers supporting the use of an internal cross-validation instead of an external one. All in all, a more homogeneous population was obtained reflecting the reallife distribution of the parameters. Second, different Tg assays were employed in different centers and the PPV was lower in one center where a Tg assay was used which produces higher results compared to other assays [23]. However, the overall performance of the model remained good when retested in each subgroup. This is likely related to the good alignment of Tg assays employed in other participating centers. Additionally, Tg concentrations in our patients were significantly higher than those usually measured during the long term follow-up of cured DTC patients (i.e., 0.1-1 ng/mL), making the clinical impact of such analytical differences less relevant. Anyway, a careful evaluation of local Tg assay is advised, before adopting our decisional model in clinical practice [3,4,16,23]. Third, our model included only patients treated after THW as most patients were treated with this preparation protocol in our centers. A significant positive correlation exists between Tg values measured under thyroxine, after rhTSH stimulation and after THW, respectively, with a basal/ rhTSH-Tg and THW-Tg ratios of 1:5 and 1:10, respectively.  Notwithstanding, our results cannot be directly translated to patients under thyroxine or those stimulated by rhTSH and further specific studies are warranted [16,25,26]. Finally, our model is explicitly intended as a tool to inform curative 131 I administration in patients with high probability of persistent structural disease, independently by the initial risk stratification [27]. Notably, adjuvant therapy with 131 I could still decrease the recurrence risk even in intermediaterisk patients with unstimulated Tg ≤ 1 ng/mL or stimulated Tg ≤ 10 ng/mL [28] making our system not actionable to rule out adjuvant 131 I administration in low-and intermediate-risk DTC, respectively.

Conclusions
In conclusion, we developed a simple, accurate and reproducible decision tree model able to provide reliable information on the probability of persistent/metastatic DTC after surgery. The information provided by our model is highly relevant to refine the initial risk stratification and guide 131 I administration with adjuvant or therapeutic basing on the probability of persistent and/ or metastatic disease.