Benign descriptors and ADNEX in two‐step strategy to estimate risk of malignancy in ovarian tumors: retrospective validation in IOTA5 multicenter cohort

ABSTRACT Objective Previous work has suggested that the ultrasound‐based benign simple descriptors (BDs) can reliably exclude malignancy in a large proportion of women presenting with an adnexal mass. This study aimed to validate a modified version of the BDs and to validate a two‐step strategy to estimate the risk of malignancy, in which the modified BDs are followed by the Assessment of Different NEoplasias in the adneXa (ADNEX) model if modified BDs do not apply. Methods This was a retrospective analysis using data from the 2‐year interim analysis of the International Ovarian Tumor Analysis (IOTA) Phase‐5 study, in which consecutive patients with at least one adnexal mass were recruited irrespective of subsequent management (conservative or surgery). The main outcome was classification of tumors as benign or malignant, based on histology or on clinical and ultrasound information during 1 year of follow‐up. Multiple imputation was used when outcome based on follow‐up was uncertain according to predefined criteria. Results A total of 8519 patients were recruited at 36 centers between 2012 and 2015. We excluded patients who were already in follow‐up at recruitment and all patients from 19 centers that did not fulfil our criteria for good‐quality surgical and follow‐up data, leaving 4905 patients across 17 centers for statistical analysis. Overall, 3441 (70%) tumors were benign, 978 (20%) malignant and 486 (10%) uncertain. The modified BDs were applicable in 1798/4905 (37%) tumors, of which 1786 (99.3%) were benign. The two‐step strategy based on ADNEX without CA125 had an area under the receiver‐operating‐characteristics curve (AUC) of 0.94 (95% CI, 0.92–0.96). The risk of malignancy was slightly underestimated, but calibration varied between centers. A sensitivity analysis in which we expanded the definition of uncertain outcome resulted in 1419 (29%) tumors with uncertain outcome and an AUC of the two‐step strategy without CA125 of 0.93 (95% CI, 0.91–0.95). Conclusion A large proportion of adnexal masses can be classified as benign by the modified BDs. For the remaining masses, the ADNEX model can be used to estimate the risk of malignancy. This two‐step strategy is convenient for clinical use. © 2022 The Authors. Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of International Society of Ultrasound in Obstetrics and Gynecology.


INTRODUCTION
Ovarian cancer is the fifth leading cause of cancer death among women in developed countries. Patients with ovarian cancer treated in tertiary oncology referral centers have a better prognosis compared with those managed in general gynecology departments [1][2][3][4] . Correct diagnosis is important to facilitate the delivery of optimal treatment.
To help clinicians decide on appropriate management, mathematical models to predict malignancy in adnexal masses have been developed on cohorts of patients that underwent surgery. A well-known model is the risk-of-malignancy index (RMI) 5 . The International Ovarian Tumor Analysis (IOTA) group created and validated four models to estimate the risk of malignancy in adnexal masses: logistic regression model 1, logistic regression model 2, simple rules risk model (SRRisks) and Assessment of Different NEoplasias in the adneXa (ADNEX) [6][7][8][9][10] . Systematic reviews and prospective cohort studies have shown that IOTA models discriminate better between benign and malignant tumors than do all other models including the RMI 6,[11][12][13][14] . The ADNEX model uses simple predictor variables and calculates the risk of four types of malignancy 7 .
Some adnexal lesions can be classified easily as benign or malignant using the IOTA simple descriptors. These are based on easily recognizable ultrasound features and do not require access to a computer 15 . If a benign simple descriptor (BD) applies to a tumor selected for surgery, the tumor is almost certainly benign (> 99%), while > 92% of tumors to which a malignant simple descriptor applies are malignant 6,15 . In clinical practice, it would be logical to first apply the BDs. If one of these applies, the mass could be classified as benign (risk of malignancy < 1%), while if none applies, a mathematical model could be used to estimate the risk of malignancy. To the best of our knowledge, such a two-step strategy has not been suggested before, nor has it been validated either in masses removed by surgery or in those managed conservatively.
The primary aim of this study was to validate the diagnostic performance of the modified BDs and of a two-step strategy, i.e. modified BDs followed by ADNEX if modified BDs do not apply, when used in both surgically and conservatively managed adnexal masses.

Study design
This was a retrospective analysis of the interim data from the IOTA Phase-5 study (IOTA5), an international multicenter prospective cohort study that is ongoing 16,17 . Consecutive patients with at least one adnexal tumor examined with transvaginal ultrasonography were included. Surgery or conservative management was suggested by the ultrasound examiner based on the ultrasound appearance of the tumor (pattern recognition), symptoms and evolution of the tumor over time. Recruitment into IOTA5 stopped in December 2016. However, patient follow-up will continue until each conservatively managed patient has been followed up for at least 5 years. The interim analysis includes patients enrolled between 1 January 2012 and 1 March 2015, and follow-up data until 30 June 2017. A total of 36 centers in 14 countries participated in the study; both oncology referral centers (tertiary centers with a specific gynecologic oncology unit) and other types of centers. Approval was obtained from the ethics committee IOTA two-step strategy 233 of the University Hospitals Leuven, Leuven, Belgium as the co-ordinating center (B32220095331/S51375) and from the local ethics committee of each contributing center (ethical approval numbers are listed in Table S1). The IOTA5 study protocol can be found at ClinicalTrials .gov (NCT01698632). The current report is written in accordance with transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines 18 .

Inclusion and exclusion criteria
Patients were eligible for inclusion if they were at least 18 years old and had at least one adnexal (ovarian/paraovarian or tubal/paratubal) tumor on ultrasonography. We used the IOTA definition of adnexal tumor (lesion), i.e. 'the part of an ovary or an adnexal mass that is judged from assessment of ultrasound images to be inconsistent with normal physiologic function' 19 .
Cysts judged to be physiologic (follicular cysts, corpus luteum cysts) with a largest diameter < 3 cm were not eligible for inclusion in IOTA5. Denial or withdrawal of informed consent were other exclusion criteria. Pregnancy was not an exclusion criterion. For the analysis of this study, patients with adnexal tumors already diagnosed and in follow-up in the participating center before enrolment in the IOTA5 study were excluded.

Ultrasound examination and CA125 measurement
At inclusion, ultrasound examiners performed a standardized transvaginal ultrasound examination and registered clinical information following a research protocol. By design, the ultrasound examiners were blinded to the outcome. They were not actively blinded to clinical information, nor to results of biomarkers or other imaging, such as computed tomography, that might have been performed before the ultrasound examination. All ultrasound examiners (n = 77) had passed the IOTA certification test (https://www.iotagroup.org/certified-members). Most scans were performed by Level-II or -III examiners and very few were performed by Level-I examiners (level defined by the European Federation of Societies of Ultrasound in Medicine and Biology (EFSUMB)) 20 . IOTA terminology was used to describe the ultrasound findings 19 . Information on predefined ultrasound variables was collected for each patient (Table S2). Using subjective assessment (pattern recognition), ultrasound examiners classified each tumor as benign, borderline or malignant and specified the degree of certainty with which the diagnosis was made (certain, probable, uncertain). The ultrasound diagnoses were based on knowledge of the typical ultrasound appearance of benign, borderline and malignant lesions and that of different types of specific adnexal pathology 21 . If there were multiple masses, the one with the most complex ultrasound morphology was registered by the ultrasound examiner as the dominant tumor. The dominant tumor was used in our statistical analysis. At follow-up visits, ultrasound examination was performed following the same protocol as at the inclusion scan, and clinical information, including information on symptoms, was obtained. At each examination, the ultrasound examiner proposed management (surgical removal or follow-up) based on the ultrasound diagnosis and the patient's symptoms. However, the final decisions about management were made by the referring clinicians, taking into account clinical symptoms, ultrasound findings, findings from other imaging modalities such as computed tomography or magnetic resonance imaging, tumor markers and patient preference.
Conservative management comprised clinical and ultrasound follow-up at intervals of 3 months, 6 months and 12 months, and then every 12 months thereafter. Measurement of serum CA125 was encouraged, but it was not an inclusion criterion for the study. Measurements of CA125 were performed according to local practice in each center.

Data collection and cleaning
Patient data were registered on a secure electronic platform (IOTA5 Study Screen; astraia software, Munich, Germany). A unique identifier code was assigned automatically to each patient. All data communications were encrypted to guarantee data security. Data cleaning was performed by a team of ultrasound examiners and biostatisticians. It included queries to local investigators to amend inconsistencies and complete missing data. A standardized questionnaire (Appendix S1) for patients and/or managing clinicians was used at the local centers to retrieve missing information. Before analyzing our data, we defined the criteria for a study center to be included in our analysis. For a center to be included, we required it to have recruited at least 50 patients, to have recruited patients consecutively, irrespective of suggested management (surgery or conservative management with follow-up) and to have good quality follow-up data for at least 70% of the recruited patients. We defined good follow-up data as a recorded study outcome (surgery at any point, spontaneous resolution of the mass or patient death) or a last follow-up visit at least 10 months after inclusion. The 70% cut-off was chosen arbitrarily, because it seemed reasonable to members of the IOTA Steering Committee (details in Appendix S2).

Modified BDs and the two-step strategy
We modified a priori the original BDs 15 by requiring the largest diameter of the tumor to be < 10 cm for all four descriptors instead of only for the third descriptor ( Figure 1). We refer to these descriptors as modified BDs. The size criterion was added to decrease the likelihood of a malignant tumor being misclassified as benign. Based on data from the IOTA Phase-1-3 studies (n = 5914) 6 , 1618 (27%) tumors fulfilled the criteria of an original BD, of which 11 (0.7%) were malignant. Among the same 5914 tumors, 1427 (24%) fulfilled the criteria of a modified BD, of which six (0.4%) were malignant (unpublished data).
The malignant simple descriptors 15 are not used in the two-step strategy. As a first step in the two-step strategy, the modified BDs are used. When the modified BDs do not apply, the second step is to use ADNEX. ADNEX calculates the probability of five outcome categories: benign, borderline, Stage-I primary invasive ovarian malignancy, Stage-II-IV primary invasive ovarian malignancy and metastasis in the ovary from another Descriptor 1: Unilocular cyst with groundglass echogenicity and largest diameter < 10 cm, in a premenopausal woman (suggestive of endometrioma) Descriptor 2: Unilocular cyst with mixed echogenicity, acoustic shadows and largest diameter < 10 cm, in a premenopausal woman (suggestive of benign cystic teratoma) Descriptor 3: Unilocular cyst with anechoic cyst fluid, smooth internal walls and largest diameter < 10 cm, in a pre-or postmenopausal woman (suggestive of simple cyst or cystadenoma) Descriptor 4: All other unilocular cysts with smooth internal walls and largest diameter < 10 cm, in a pre-or postmenopausal woman primary origin (e.g. breast cancer or colon cancer) 7 . ADNEX uses three clinical and six ultrasound predictors: type of center (oncology center vs other), patient age, CA125 level, maximum diameter of the lesion, proportion of solid tissue, number of papillary projections, presence of > 10 cyst locules, presence of acoustic shadows and presence of ascites. ADNEX can also be used without CA125. Details on the ADNEX model are provided in Appendix S3. Model predictions are based on information obtained at the inclusion scan and hence are blinded to the outcome.

Reference standard
The reference standard refers to the nature of the adnexal tumor (benign or malignant) at inclusion. Borderline tumors were classified as malignant. Each adnexal mass was classified as benign or malignant based on histology, if the tumor was surgically removed, otherwise on the results of follow-up examinations (see below). The histology of the surgically removed tumor was determined at the local center. Central pathology review was not performed, because we found little difference between local and central pathology reports in a previous IOTA study 8 . Pathologists were blinded to ultrasound predictor variables and model predictions but might have received information on the subjective assessment by the ultrasound examiner when clinically relevant. Malignant tumors were classified according to the International Federation of Gynecology and Obstetrics (FIGO) guidelines 22 . If the tumor was not surgically removed, it was classified as benign or malignant based In line with previous work 7 , we used 120 days as the maximum interval between inclusion and surgery. When surgery was performed more than 120 days after inclusion and histology was malignant, we recognize the possibility that the tumor was benign at inclusion but underwent malignant transformation. In these cases, we relied on subjective assessment at inclusion and on follow-up scans to decide whether to label the outcome as malignant or uncertain. *Type of malignancy could not be determined so was treated as a missing value and imputed (Appendix S4). B, benign; M, malignant; U, uncertain at inclusion.
IOTA two-step strategy 235 on clinical and ultrasound findings during 12 ± 2 months of follow-up (i.e. minimum follow-up time to assign an outcome was 10 months). Different scenarios were possible: some patients underwent surgery without follow-up, others were managed conservatively with or without surgery later. For some patients, we had no information from after the inclusion visit. If data to classify the tumor as benign or malignant at inclusion were not available, the outcome was classified as uncertain. Table 1 describes the criteria for classifying tumors as benign, malignant or uncertain.

Study endpoints
In line with the study objectives, the main study endpoints were: (1) the percentage of tumors to which the modified BDs apply; (2) the percentage of malignant tumors among lesions to which the modified BDs apply; and (3) the diagnostic performance in terms of discrimination and calibration of the two-step strategy. The secondary study endpoint was the discriminative ability of ADNEX (with and without CA125) when applied only to tumors to which the modified BDs do not apply.

Statistical analysis
A summary of the statistical analysis is provided below. Details on statistical analysis and discussion of sample size can be found in Appendices S4-S6. The statistical analysis was performed with R version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria). We calculated the percentage of patients to which the modified BDs applied and the prevalence of malignancy in tumors to which the modified BDs applied. To assess performance of the two-step strategy, we needed risk estimates for each of the five tumor outcomes when the modified BDs applied. Appendix S5 describes how these risk estimates were obtained. We evaluated discrimination between benign and malignant tumors using the area under the receiver-operating-characteristics curve (AUC). To evaluate calibration of the estimated risk of malignancy, we calculated the calibration intercept and slope using a logistic recalibration model 23 . Clinical utility was assessed using decision-curve analysis by calculating net benefit at thresholds for estimated risks of malignancy between 5% and 50% to decide which patients to refer to specialized oncological care 24 . For the two-step strategy, we assessed further the AUC for each pairing of the five tumor subtypes, the polytomous discrimination index (PDI) as a multiclass AUC, and calibration for the estimated risks of each of the five tumor subtypes 25,26 .
For the percentage of patients to which the modified BDs applied, AUC for benign vs malignant tumors, calibration of the risk of malignancy and decision-curve analysis, we addressed heterogeneity between centers. This was done by calculating center-specific performance and combining the results using meta-analysis (Appendix S4) 23,24,27 . Heterogeneity in the AUC was quantified using 95% prediction intervals (PI) 28 , which describe the range of AUC values that can be expected in a new center. Because the number of malignant tumors was too low, meta-analysis was not possible for the prevalence of malignancy in tumors to which the modified BDs applied, AUC for each pairing of tumor subtypes, PDI and calibration for each tumor subtype. For these analyses, data from all centers were pooled. For the percentage of patients to which the modified BDs applied, we performed both a meta-analysis and a pooled analysis.
Subgroup analyses were performed for menopausal status and type of center.

Methods to address potential sources of bias
We implemented several procedures to reduce potential bias. First, we followed a prespecified statistical analysis plan to avoid selecting analyses based on results. Second, to handle differential verification, we used prespecified criteria to determine whether tumor outcome was benign, malignant or uncertain (Table 1). Third, the primary analysis included all patients after multiple imputation of missing CA125 levels and uncertain outcomes (Appendix S4). Excluding participants with uncertain outcome leads to partial verification bias and excluding participants with missing CA125 leads to selection bias 18,27,29,30 . Multiple imputation is a recommended approach to avoid such exclusions 30 . Fourth, we performed a prespecified sensitivity analysis, in which we expanded the definition of uncertain outcomes to include all groups in which subjective assessment of ultrasound images was used to label outcomes as benign or malignant (B2, M2,3 and U1-4 in Table 1). This was done to address possible optimistic bias due to differential verification. Fifth, we used prespecified criteria for data quality in order to include only data from centers with consecutive inclusion and sufficiently complete and accurate data (Appendix S2). This may limit potential attrition bias by avoiding exclusions at patient-level (instead, we excluded entire centers) and limits the number of uncertain outcomes. Finally, an additional prespecified analysis was performed in which masses with uncertain outcome as per Table 1 were excluded. This was done for completeness only, because exclusions based on missing data result in high risk of bias 18,30 . Figure S1. A total of 8519 patients recruited at 36 centers were included in the interim dataset of IOTA5 (Table S3). Twenty-five patients were excluded due to withdrawal of consent. Another 2777 patients from 19 centers were excluded from the primary analysis: one center terminated participation, seven centers recruited < 50 participants, three centers were excluded due to non-consecutive recruitment and eight centers due to suboptimal data quality. Suboptimal data quality was explained by lack of staff (three centers), problems with information technology (two centers) and/or difficulties with making patients return for planned follow-up visits 236 Landolfo et al.

Patient flow is shown in
(four centers) (Appendix S2). Of the 5717 patients in the remaining 17 centers, 812 (14%) patients had a mass that was already being followed up in the recruitment center before inclusion. Therefore, 4905 patients were included in our primary analysis (Table S1). In 4151 (85%) of the 4905 women, the ultrasound examiner's suggestion for management was followed, in 445 (9%) it was not followed and in 309 (6%) the actual management was unknown.
Patient and tumor characteristics are shown in Table 2. Median age of the 4905 patients was 48 (interquartile range (IQR), 36-62; range, 18-98) years, 2151 patients (44%) were postmenopausal, 2140 (44%) had a dominant mass that was a unilocular cyst and 1734 (35%) had a dominant mass containing solid components. Median maximum lesion diameter was 55 (IQR, 38-83; range, 7-751) mm and 2031 masses (41%) had no detectable blood flow on color or power Doppler (color score 1). Information on CA125 was missing in 2620/4905 (53%) patients. Missing CA125 values were less common for patients who underwent surgery (32%) or who had a tumor considered probably malignant (23%) or certainly malignant (14%) ( Table S4). In all, 3441 (70%) tumors were benign, 978 (20%) were malignant (borderline or invasive) and for 486 (10%) tumors the outcome was uncertain (Table S1). Uncertain outcome was explained by loss to follow-up (n = 432) or by conflicting information during follow-up (n = 54) ( Table 1). Loss to follow-up was more common when conservative management was suggested (13%) than when surgery was suggested (5%), and more common when the diagnosis based on subjective assessment was benign (21%) compared with when it was uncertain or malignant (16%) ( Table S5). A smaller proportion of tumors in this study manifested malignant ultrasound features compared with the development dataset of ADNEX. This is because the development set included only patients that underwent surgery 17 .
The overall calibration curves for the two-step strategies are shown in Figure 3. Risk estimates were slightly underestimated and heterogeneity between centers was observed ( Figure S2). The summary decision curves of the two-step strategies overlapped completely with the curves showing the results when using ADNEX in all tumors ( Figure 4).
The ability of the two-step strategies to discriminate between different tumor types is shown in Table S8. With two exceptions, the two-step strategies manifested similar discriminative ability: using ADNEX with CA125 as the second step instead of ADNEX without CA125 discriminated better between Stage-II-IV and Stage-I ovarian malignancy (AUC, 0.81 vs 0.72) and between Stage-II-IV ovarian malignancy and metastases (AUC, 0.76 vs 0.64). For discrimination between benign tumors and each malignant subtype, AUCs ranged from 0.91 to 0.98. Calibration of the predicted risks for the five subgroups of tumor was good for both two-step strategies, albeit with some underestimation of the risk of secondary metastasis ( Figure S3).

Subgroup analyses
The modified BDs were applicable less often in postmenopausal women (24%, 509/2151) compared with premenopausal women (47%, 1289/2754), and the prevalence of malignancy among tumors to which a modified BD applied was higher in postmenopausal compared with premenopausal women (1.0% vs 0.5%) ( Table S9). The modified BDs were applicable less often in patients examined in oncology centers (33%, 1020/3094) compared with those examined in non-oncology centers (43%, 778/1811), and the prevalence of malignancy among tumors to which a modified BD applied was higher in oncology centers compared with in non-oncology centers (0.8% vs 0.4%) ( Table S10). The discriminative ability of the two-step strategies was similar in pre-and postmenopausal women, but the two-step strategies were better calibrated in postmenopausal women ( Figures S4-S9, Table S11). The discriminative ability and the calibration of the two-step strategies were similar in oncology centers and non-oncology centers (Figures S10-S15, Table S12).

Additional analyses
The results of the additional analyses are shown in Table S13 and S14 and Figures S16-S21. Omitting patients with uncertain tumor outcome from the analysis slightly increased the overall prevalence of malignancy compared with the primary analysis (22.1% vs 21.1%), but had minimal effect on discriminative performance and calibration. Our sensitivity analysis, in which we expanded the definition of uncertain outcome, resulted in 1419 (29%) tumors with uncertain outcome and an AUC of the two-step strategy without CA125 of 0.93 (95% CI, 0.91-0.95; 95% PI, 0.82-0.98).

DISCUSSION
We describe the diagnostic performance of the modified BDs and of a two-step strategy using the modified BDs as a first step and ADNEX as a second step when applied to patients managed either surgically or conservatively. The results indicate that the modified BDs are applicable in almost 40% of patients with an adnexal mass, that the risk of malignancy is very low if a modified BD applies, and that the two-step strategy has excellent discriminative performance and is reasonably well calibrated.
The study strengths include: first, the large sample size and high number of participating centers; second, the prospective ultrasound protocol with agreed ultrasound terms, definitions and measurement techniques; and finally, consecutive inclusion of patients managed either surgically or conservatively.
We acknowledge four limitations related to potential bias. First, we excluded all data from 19 centers because they did not fulfill our predefined quality criteria. This means that, like in any study, we cannot rule out selection bias on a center-level. We made these exclusions to obtain data that was informative (excluding centers with limited recruitment), representative (excluding centers with non-consecutive recruitment) and reliable (excluding centers with low-quality data or centers that stopped participation). This also reduced the potential for attrition bias (exclusions on a patient-level). Including other centers could have resulted in higher or lower performance due to case-mix heterogeneity 31 . We do not expect our exclusion of centers to have resulted in an overestimation of diagnostic performance, because we do not expect the quality of the ultrasound examinations to be lower in the excluded than in the included centers. In a study on the same dataset, a sensitivity analysis using immediately operated patients from all 36 centers resulted in the same AUC for ADNEX as the primary analysis based on 17 centers 17 . Second, the tumor outcome was based on multiple reference standards (differential verification), and for a small group of patients, tumor outcome could not be determined due to conflicting information or insufficient follow-up (partial verification) 29 . We addressed the potentially optimistic bias from partial and differential verification by using multiple imputation and a sensitivity analysis 27 . Model performance changed very little depending on the definition of uncertainty that we used. Third, 53% of the patients had a missing value for CA125. This affects the performance of ADNEX with CA125 but not that of the modified BDs or ADNEX without CA125. To deal with missing CA125 values, we used multiple imputation, which is the recommended approach to reduce bias due to missing values 18,30 . Excluding cases with missing CA125 values is likely to bias AUCs downwards, because missing values are most common among tumors judged to be benign on ultrasound (Table S4) 32 . The high number of missing values adds uncertainty to the results, and imputing multiple times acknowledges this as reflected in wider confidence intervals around performance estimates. Fourth, there was no blinding of examiners to previous information about the patient or of pathologists to clinical information. Imposing such blinding would be unethical and unrealistic. Lack of blinding may induce information bias when assessing predictors and detection bias when determining outcome based on histology. Information bias may lead to overestimation of performance (even though, importantly, the outcome was unknown at the inclusion scan when the predictors were assessed). We consider detection bias to be limited. Pathologists are unlikely to be influenced by preoperative ultrasound findings. This assumption is supported by findings in the IOTA Phase-1 study, in which the results of central pathology review (blinded to clinical information) were highly similar to local results 8 . In summary, we used several recommended approaches to reduce bias. Information bias, an unavoidable clinical reality, may nevertheless have biased performance optimistically.
This study is the first to evaluate the performance of the modified BDs, to suggest and evaluate the performance of a two-step strategy using the modified BDs as a first step and ADNEX as a second step, and to do this in patients managed either surgically or conservatively. Three studies have externally validated a three-step strategy using both the benign and malignant original simple descriptors as a first step, followed by the IOTA simple rules 10 as a second step and by subjective assessment by an expert as a final step. All three studies included patients managed either surgically or conservatively and showed the three-step strategy to have excellent ability to discriminate between benign and malignant adnexal masses [33][34][35] . We believe that our two-step strategy has advantages over the three-step strategy: all tumors can be classified by a single ultrasound examiner, and a risk of malignancy is assigned to all tumors as well as a likelihood estimate of type of malignancy.
It is reassuring that the discriminative ability of ADNEX when applied to tumors to which the modified BDs do not apply was almost as good (AUC, 0.92 (with CA125) and 0.91 (without CA125)) as when ADNEX was applied on all tumors. When applied on all 4905 tumors, the AUC for ADNEX both with and without CA125 was 0.94 17 . Moreover, the discriminative performance of the two-step strategies (AUC, 0.95 and 0.94) was similar to that of using ADNEX on all 4905 masses, and the clinical utility of the two-step strategies was the same as that of applying ADNEX on all masses (Figure 4). This shows that using ADNEX on all masses has no advantage over using the two-step strategy.
Two issues require further research. First, evaluating ultrasound images is affected by the level of experience of the examiner. The IOTA5 study involved mainly Level-II and -III examiners. Even though previous studies have suggested that ADNEX works well also in the hands of less experienced examiners 36 , the role of experience should be investigated more explicitly. A large multicenter study, in which examiner experience is quantified before patient recruitment, could elucidate whether and how experience affects discrimination and calibration performance. Second, we observed heterogeneity between centers regarding discrimination and calibration for all models validated on IOTA Phase-5 data 17 . When more data become available, it will be important to study possible reasons for this heterogeneity.
In conclusion, the two-step strategy lends itself very well to clinical use. A large proportion of adnexal masses can be classified by the modified BDs as having a very low risk of malignancy without computer support. For the remaining masses, an estimate of risk of malignancy and type of malignancy can be obtained using the ADNEX model. An ADNEX calculator is available online and as an application for smartphones (https://iotagroup.org/iotamodels-software/adnex-risk-model). It is also embedded in many ultrasound machines. This facilitates its use in clinical practice. The two-step strategy can be used for 240 Landolfo et al. patient counseling to individualize management. It could also be used to stratify patients into risk groups, such as the ovarian-adnexal reporting and data system (O-RADS) risk groups 37 . Risk stratification can facilitate selection of optimal management for patients with adnexal masses 38 .

SUPPORTING INFORMATION ON THE INTERNET
The following supporting information may be found in the online version of this article:

Table S1
Descriptive statistics of centers that were included in primary analysis   Table S7 Sensitivity and specificity of two-step strategies for prediction of malignancy at prespecified risk thresholds (n = 4905; meta-analysis)

Table S8
Pairwise areas under the receiver-operating-characteristics curve (AUC) and polytomous discrimination index for two-step strategies (n = 4905; pooled analysis)

Table S9
Tumor subtypes for masses to which modified benign descriptors applied by menopausal status (n = 1798; pooled data) Table S10 Tumor subtypes for masses to which modified benign descriptors applied by type of center (n = 1798; pooled data) Table S11 Sensitivity and specificity of two-step strategies for prediction of malignancy by menopausal status Table S12 Sensitivity and specificity of two-step strategies for prediction of malignancy by type of center at which patients were examined Table S13 Outcome of masses to which modified benign descriptors applied in two prespecified additional analyses (pooled analysis) Table S14 Sensitivity and specificity of two-step strategies for prediction of malignancy, for two prespecified additional analyses Appendix S1 Standardized questionnaire for IOTA5.

Appendix S2
Details on data quality and exclusion of centers.
Appendix S3 Assessment of Different NEoplasias in the adneXa (ADNEX) model.

Appendix S4 Details on imputation and statistical analysis.
Appendix S5 A-priori estimated risks when modified benign simple descriptors apply.
Appendix S6 Discussion of sample size.

Figure S2
Calibration curves per center for two-step strategies (n = 4905).

Figure S4
Forest plots with center-specific areas under receiver-operating-characteristics curve of two-step strategies and results of meta-analysis in premenopausal patients (n = 2754).

Figure S5
Forest plots with center-specific areas under receiver-operating-characteristics curve of two-step strategies and results of meta-analysis in postmenopausal patients (n = 2151).

Figure S6
Overall calibration curves of two-step strategies in premenopausal patients (n = 2754; meta-analysis).

Figure S7
Calibration curves per center of two-step strategies in premenopausal patients (n = 2754).

Figure S8
Overall calibration curves of two-step strategies in postmenopausal patients (n = 2151; meta-analysis).

Figure S9
Calibration curves per center of two-step strategies in postmenopausal patients (n = 2151).

Figure S10
Forest plots with center-specific areas under receiver-operating-characteristics curve of two-step strategies and results of meta-analysis in patients examined in oncology centers (n = 3094).

Figure S11
Forest plots with center-specific areas under receiver-operating-characteristics curve of two-step strategies and results of meta-analysis in patients examined in non-oncology centers (n = 1811).

Figure S12
Overall calibration curves of two-step strategies in patients examined in oncology centers (n = 3094; meta-analysis).

Figure S13
Calibration curves per center of two-step strategies in patients examined in oncology centers (n = 3094).

Figure S14
Overall calibration curves of two-step strategies in patients examined in non-oncology centers (n = 1811; meta-analysis).

Figure S15
Calibration curves per center of the two-step strategies in patients examined in non-oncology centers (n = 1811).

Figure S16
Forest plots with center-specific areas under receiver-operating-characteristics curve (AUCs) of two-step strategies and results of meta-analysis for additional analysis in which patients with uncertain outcome were omitted (n = 4419).

Figure S17
Forest plots with center-specific areas under receiver-operating-characteristics curve (AUCs) of two-step strategies and results of meta-analysis for sensitivity analysis in which broader definition of uncertain outcome was used (n = 4905).

Figure S18
Overall calibration curves of two-step strategies for analysis in which patients with uncertain outcome were omitted (n = 4419; meta-analysis).

Figure S19
Calibration curves per center of two-step strategies for analysis in which patients with uncertain outcome were omitted (n = 4419).

Figure S20
Overall calibration curves of two-step strategies for sensitivity analysis in which broader definition of uncertain outcome was used (n = 4905; meta-analysis). This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.