Developing symptom-based predictive models of endometriosis as a clinical screening tool: results from a multicenter study

Objective To generate and validate symptom-based models to predict endometriosis among symptomatic women prior to undergoing their first laparoscopy. Design Prospective, observational, two-phase study, in which women completed a 25-item questionnaire prior to surgery. Setting Nineteen hospitals in 13 countries. Patient(s) Symptomatic women (n = 1,396) scheduled for laparoscopy without a previous surgical diagnosis of endometriosis. Intervention(s) None. Main Outcome Measure(s) Sensitivity and specificity of endometriosis diagnosis predicted by symptoms and patient characteristics from optimal models developed using multiple logistic regression analyses in one data set (phase I), and independently validated in a second data set (phase II) by receiver operating characteristic (ROC) curve analysis. Result(s) Three hundred sixty (46.7%) women in phase I and 364 (58.2%) in phase II were diagnosed with endometriosis at laparoscopy. Menstrual dyschezia (pain on opening bowels) and a history of benign ovarian cysts most strongly predicted both any and stage III and IV endometriosis in both phases. Prediction of any-stage endometriosis, although improved by ultrasound scan evidence of cyst/nodules, was relatively poor (area under the curve [AUC] = 68.3). Stage III and IV disease was predicted with good accuracy (AUC = 84.9, sensitivity of 82.3% and specificity 75.8% at an optimal cut-off of 0.24). Conclusion(s) Our symptom-based models predict any-stage endometriosis relatively poorly and stage III and IV disease with good accuracy. Predictive tools based on such models could help to prioritize women for surgical investigation in clinical practice and thus contribute to reducing time to diagnosis. We invite other researchers to validate the key models in additional populations.

Use your smartphone to scan this QR code and connect to the discussion forum for this article now.* * Download a free QR code scanner by searching for "QR scanner" in your smartphone's app store or app marketplace. E ndometriosis is a chronic disease defined by the presence of endometrial-like tissue outside the uterus (1). Diagnostic delay ranging from 7 to 12 years is well documented in endometriosis (2)(3)(4)(5) and contributes to the impaired quality of life and significant personal and societal costs associated with the condition (5). Surgery under general anaesthesia, most commonly a laparoscopy, is required to make a definitive diagnosis but this is expensive and potentially associated with complications (6,7). The availability of a noninvasive method to evaluate the likelihood of finding endometriosis at laparoscopy could reduce the diagnostic delay (8) and the number of women undergoing surgery unnecessarily. Accordingly, and in keeping with a consensus statement on research priorities in endometriosis (9), the last decade has seen much effort directed at identifying nonsurgical methods of diagnosing the disease or at least predicting its presence, which can then inform therapies for associated pelvic pain and/or infertility. Peripheral biomarkers (e.g., CA-125), endometrial biomarkers (e.g., endometrial nerve fibre density), and imaging have all been evaluated (10,11), showing varying degrees of accuracy and potential clinical utility. Generally, however, these procedures predict endometriosis inadequately and are largely either invasive or semiinvasive themselves. In recent years, the potential for using clinical information to predict endometriosis before surgery has been explored. In one such study of 90 women undergoing diagnostic laparoscopy (12), a classification tree based on symptoms, physical examination, and ultrasound findings correctly predicted only 38% of nonovarian endometriosis, an unsurprising finding given the general limited predictive accuracy of classification trees (13). Another related study described the development of an externally validated predictive model of pregnancy rates following a surgical diagnosis of endometriosis, but prediction was not based solely on symptoms. Two other key studies have evaluated the likelihood of finding deep infiltrating (14) and bladder (15) endometriosis based on standardized symptom questionnaires, showing ''acceptable'' and ''excellent'' diagnostic value, respectively, for these subtypes. However, the studies used relatively small samples and, although internal crossvalidation within the data set was attempted in one of the studies (14) through a bootstrap method, the models have never been validated in external data sets. Because predictive models always perform better on data on which they were generated than on new data, external validation is essential before implementing predictive models in clinical practice (16)(17)(18).
In addition to the limitations highlighted above, the global utility of predicting endometriosis clinically is often hampered by inconsistencies in the definition of the disease and associated symptoms across studies and populations (19). The question as to whether a valid predictive model based on symptoms associated with endometriosis, generalizable across populations in different countries, can be generated, led us to initiate the Women's Health Symptom Survey (WHSS). Its aims were to [1] develop symptombased models that predict the likelihood of finding endometriosis at laparoscopy in women who are being investigated for endometriosis-associated pain and/or infertility and [2] determine the sensitivity and specificity of the models in predicting the likelihood of a diagnosis of endometriosis in a separate validation sample of women. Although physical examination plays an important role in the clinical evaluation of symptomatic women, in this study, the models were generated solely on the basis of symptom/medical history profiles and ultrasound evidence, to allow for standardized evaluation across centers and countries. Such models, if shown to have good predictive power, could be used to generate a screening assessment tool to prioritize women for further surgical evaluation.

Study Population
The WHSS was a two-phase (model development/validation), clinic-based study in 19 hospitals in 13 countries. Between September 2008 and January 2010, we prospectively recruited 1,396 consecutive pre-menopausal women, aged 18-45, undergoing diagnostic laparoscopy because of at least one of the following symptoms: dysmenorrhoea (34.0% in phase I vs. 31.8% in phase II), dyspareunia (12.3% in phase I vs. 14.5% in phase II), nonmenstrual pelvic pain (36.1% in phase I vs. 37.0% in phase II), menstrual dyschezia (6.9% in phase I vs. 8.0% in phase II), or infertility (56.5% in phase I vs. 47.5% in phase II). Women with a previous surgical diagnosis of endometriosis, amenorrhoea or current pregnancy, or who had taken hormonal medication (including combined oral contraception) within the previous 3 months, were excluded.
During the model generating phase I (September 2008 to June 2009), consenting women who met the inclusion criteria completed, prior to their scheduled surgery, a 25-item selfadministered questionnaire in their own language (www. endometriosisfoundation.org/WERF-WHSS-Questionnaire-English.pdf). During the model validating phase II (July 2009 to January 2010), premenopausal women, aged 18-45 years, were recruited using the same inclusion and exclusion criteria as in phase I; they completed the same 25-item questionnaire before their surgery. The questionnaire incorporated items to elicit women's past medical, obstetric, and family histories, as well as items to evaluate the intensity and frequency of pelvic pain. Pelvic pain intensity was assessed on 11-point numerical pain rating scales (20) ranging in possible values from 0 (no pain) to 10 (worst possible pain). The questionnaire also included standardized questions previously validated in women with pelvic pain or other symptom groups. These instruments included [1] the IBS Rome III questionnaire to identify women with pelvic pain symptoms due to irritable bowel syndrome (21) and [2] standardized pelvic pain symptom assessment used in earlier studies in Oxford (22,23). The questionnaire also asked for sociodemographic, lifestyle, and physical attributes. Experienced gynecologists recorded the laparoscopic findings in a standard manner (http://www.endometriosisfoundation.org/WERF-GSWH-WHSS-surgical-sheet.pdf). For those women who had preoperative pelvic ultrasound (84.7% and 92.2% of women in phases I and II, respectively), imaging findings were recorded on the surgical sheet.
Cases were defined as women in the study populations who, at laparoscopy, were found to have endometriosisdiagnosed on visual evidence alone according to the European Society of Human Reproduction and Embryology guideline (1) and staged using the revised American Fertility Society classification: I (minimal), II (mild), III (moderate), or IV (severe) (24). Controls were women in the study populations without endometriosis (with or without other diagnoses) at laparoscopy. The Mid-and South Buckinghamshire Research Ethics Committee in the United Kingdom approved the study, followed by approval from all the local ethics committees.

Statistical Analysis
Model generation. Women who had [1] any stage of endometriosis and [2] stage III or IV endometriosis were compared to controls. To compare cases and controls on categorical variables, Pearson's c 2 tests or the Fisher's exact test were used where appropriate. Continuous parametric variables were assessed using the Student's t-test and nonparametric variables with the Mann-Whitney U test.
As the outcome of interest (presence of any-stage and stage III or IV endometriosis) was binary in nature, logistic regression was used for the predictive modeling. The WHSS questionnaire contained more than 200 variables, which, if entered into one logistic regression model, would have resulted in over fitting of models to the data. As a guide, in the final model, the number of degrees of freedom (df) should not exceed approximately 10% of the number of observations in the smaller outcome category (25). We therefore employed a tiered approach to building the predictive model in phase I (Fig. 1).
First, groups of clinically related variables were assessed in a multivariate logistic regression framework, to assess which variables within each group showed little or no association with endometriosis, and could therefore be excluded. For each group, this was done iteratively by first excluding variables for which tests of association with endometriosis resulted in significance levels of P>.5, progressing to dropping variables with P>.2, resulting in a submodel for each of the groups. In each group, the goodness of fit of the submodel to the data was tested using the Hosmer and Lemeshow test (26), whereas the drop in Nagelkerke's R 2 (assessing the disease variance explained by the variables in the model) when removing variables was not to exceed 10%. Variables from each of the submodels were then included in the complete models, which were again reduced iteratively by considering [1] the Hosmer and Lemeshow test for goodness of fit, [2] the drop in Nagelkerke's R 2 when removing variables, and [3] the significance of the association of each variable in the model with endometriosis (dropping variables from P>.5 down to P>.2). This resulted in best-fitting final models for the prediction of any-stage and stage III and IV endometriosis based on the phase I data. For each of the two diagnostic outcomes, a model including and excluding preoperative ultrasound evidence of cysts/nodules was generated (Supplemental Table 1). Final models were subsequently fitted to phase II data to assess their predictive performance (see model validation below). In addition, reduced models were generated, which excluded variables in the bestfitting final models that had opposite directions of effect in phase I and II data sets. These reduced models were also assessed for performance; thus, a total of eight model types were generated and assessed (Supplemental Table 1).

Model validation.
To illustrate the performance of the derived predictive models in the phase I population data, and the relative drop in performance when externally validating in phase II data, the accuracy of the models in predicting anystage and stage III and IV endometriosis was assessed by analyzing the receiver operating characteristic (ROC) curve in both phase I and II. The ROC curve displays the relationship  between sensitivity and 1-specificity and the area under the ROC curve (AUC) depicts how well the model distinguishes women with and without endometriosis; a model with a greater AUC has a better-performing risk function. Model sensitivity, specificity, and positive and negative likelihood ratios were also calculated, and the best model cut-off points were considered to be those that corresponded to the highest sum of specificity and sensitivity. All univariate and logistic regression analyses were done using the Statistical Package for the Social Sciences 16.0 (SPSS, Inc.); prediction analyses in both phase I and II populations were conducted within the binary logistic module in SPLUS 6.0 (TIBCO Software, Inc.); and ROC analysis using MedCalc 11.6 (MedCalc Software).

Description of the Predictive Models
As shown in Supplemental Figure 1, 771 (phase I) and 625 (phase II) of the women recruited met the inclusion criteria and had complete surgical information at the close of the study. Among participants, the proportions of cases at centers varied from 35% in Ibadan to 97% in Guangzhou. Supplemental Table 2 shows the average age of case and control women in both phases and the frequency of pathology found at surgery. Endometriosis was diagnosed in 360 (46.7%) women in phase I and 364 (58.2%) women in phase II.
The results of the best-fitting final models for any-stage and stage III and IV endometriosis, respectively, are shown in Supplemental Table 3 and Table 1, respectively. The results of the reduced models (only retaining variables with consistent evidence in phases I and II [see Methods]) are not shown, but these variables are highlighted in bold. The full ''any-stage no ultrasound'' model 1 (see Supplemental Table 3) had 26 variables but 12 of these (46%) were dropped in the reduced model 2 (see Methods). Similarly, 9 of 23 variables (39%) in the full ''any-stage ultrasound'' model 3 were dropped in the corresponding reduced model 4. Notably, menstrual dyschezia (pain on opening bowels during periods) and a medical history of benign ovarian cysts were most strongly associated with any-stage endometriosis in models with ultrasound (Phase II OR ¼ 3.12, 95% CI ¼ 1.07-9.10, P¼ .037, and OR ¼ 2.92, 95% CI ¼ 1.35-6.30, P¼ .006, respectively) and without ultrasound (Phase II OR ¼ 3.47, 95% CI ¼ 1.40-8.57, P¼ .007, and OR ¼ 4.15, 95% CI ¼ 2.19-7.86, P< .001, respectively). Rectal bleeding during menstruation, IBS (Rome III), unspecified functional bowel disorder, duration of smoking, subfertility due to blocked tubes, and ethnicity (Asian/Oriental and other/mixed) were inconsistently associated with endometriosis in both anystage endometriosis models. The any-stage ultrasound model 3 explained substantially more variability in endometriosis than the any-stage no ultrasound model 1 (Nagelkerke's R 2 ¼ 0.54 vs. 0.44 in phase I). Although the any-stage model including ultrasound (model 3) retained its R 2 value of 0.54 in phase II, its value dropped to 0.30 for model 1, indicating a better predictive performance of models, including ultrasound evidence.
The full stage III and IV no ultrasound model 5 (Table 1), had 23 variables, but 6 of these were dropped in the reduced model 6. In comparison, 3 of 18 variables in the full stage III

Validation of the Predictive Models
All four full models were evaluated for their predictive performance in the phase II data. As expected, ultrasound evidence alone showed high sensitivity but very low specificity in the prediction of any or stage III or IV endometriosis (Table 2), that is, positive ultrasound evidence was a good predictor of the presence of endometriosis, but negative evidence was a very poor predictor of absence of disease. As shown in Table 2 and Figure 2, the full any-stage no ultrasound model 1 had good discrimination in phase I data (AUC ¼ 84.2, 95% CI ¼ 81.1-87.0, P< .0001), but its performance was reduced substantially when applied to the phase II validation data set (AUC ¼ 68.3, 95% CI ¼ 63.9-72.4, P< .0001). Predictive ability in phase II data was somewhat improved in the reduced model 2 to AUC ¼ 72.2 (95% CI ¼ 68.1-76.1, P< .001), although the optimal model cut-off of 0.59 would provide a low sensitivity of 54%. Similarly, the full any-stage ultrasound model 3 had good discrimination in phase I data (AUC ¼ 87.3, 95% CI ¼ 84.2-90.0, P< .0001), and although some reduction in its performance was evident when applied to phase II data (AUC ¼ 80.0, 95% CI ¼ 75.6-83.3, P< .0001), this reduction was not as substantial as for the any-stage no ultrasound model 1. The corresponding reduced model 4 improved predictive ability in phase II data to AUC ¼ 85.1 (95% CI ¼ 81.5-88.2, P< .001), its optimal model cut-off of 0.51 providing a sensitivity of 80% and a specificity of 77% (cf. respective values of 0.80, 58% and 89% for the full model).
The full stage III or IV no ultrasound model 5 had good discrimination in phase I data (AUC ¼ 87. 3 reduced model optimal cut-off was 0.29, with a sensitivity of 80% and a specificity of 80% (cf. respective values of 0.24, 82%, and 76% for the full model).

DISCUSSION
In this multicenter study of symptomatic premenopausal women presenting with symptoms that were potentially indicative of endometriosis, we show that a combination of symptom characteristics and variables in the medical history, with or without ultrasound evidence of cysts/nodules, can predict the finding of stage III and IV endometriosis at laparoscopy with reasonably good accuracy. The best-fitting predictive model included, along with ultrasound evidence, menstrual dyschezia, ethnicity, and a history of benign ovarian cysts as the variables with the strongest predictive performance. These variables are mostly disease risk factors reported in previous studies. Specifically, menstrual dyschezia is strongly associated with deep infiltrating endometriosis, a severe form of the disease (27), which had a relatively low prevalence in our clinical population (7.0% in phase I, and 7.8% in phase II). The positive association of stage III and IV endometriosis with Black ethnicity, however, conflicts with previous reports which suggest that White and Asian women have a greater risk of disease (28), although these reports generally relate to any-stage, rather than stage III and IV, endometriosis.
In this study, we report the development in one sample population, and validation in another drawn from the same source population, of models predicting [1] any-stage and [2] stage III and IV, endometriosis, with or without ultrasound scan evidence. The ability to predict any-stage endometriosis in the model excluding ultrasound evidence was generally poor (AUC ¼ 68.3). Some improvement (AUC ¼ 72.2) could be gained by removing from the model variables with inconsistent association with endometriosis across phase I and II data, but the external validity of the reduced model would need to be evaluated in an independent data set. When Receiver operating characteristic curves for the full models predicting any-stage and stage III or IV endometriosis. including ultrasound scan evidence, the prediction of any-stage endometriosis was improved (AUC ¼ 80.0), but the optimal model cut-off results in a relatively low sensitivity (58%), which reduces the utility of the model as a potential clinical screening tool. The poor predictability of any-stage endometriosis is not surprising given that similar findings are reported for models based on serum markers (29), and stage I (minimal) endometriosis is considered pathogenetically to be different to stage III and IV disease (30)(31)(32)(33). In contrast to our findings, both any-stage and stage III and IV endometriosis were reported to be predictable from the medical history of 1,079 prospectively recruited subfertile women in Portugal (34). However, as the predictive models were not validated in an independent data set, their findings should be interpreted with some caution.
The models predicting stage III and IV endometriosis (AEultrasound evidence) showed much better performance. However, the model that included ultrasound evidence showed better performance, which is not surprising given the value of ultrasound to diagnose ovarian endometriomas (35)and deep infiltrating endometriosis affecting the bowel (36). Optimal model cut-offs resulted in sensitivities of 70.9% and 82.3%, and specificities of 84.7% and 75.8%, for models excluding and including ultrasound evidence, respectively. In contrast to any-stage models, only marginal improvement could potentially be gained by excluding variables with inconsistent association with stage III and IV endometriosis across the data from phases I and II; this relative consistency in association across phases provided further evidence of the superior predictability of stage III and IV endometriosis over ''any stage'' disease.
The stage III and IV endometriosis models could, in addition to ultrasound and physical examination, be used to prioritize women presenting with symptoms for laparoscopy in clinical practice, that is, mirroring the setting in which the present study was conducted, or to initiate medical therapies sometimes reserved until a surgical diagnosis of endometriosis has been made. To what extent the models have predictive power in other settings (e.g., self-selected women with pelvic pain symptoms in the general population) is unknown, and therefore the utility of the tool should not be advocated for this purpose. Indeed, although identifying a noninvasive diagnostic test for endometriosis is an explicit priority in endometriosis research, other authors have cautioned that such a tool could be misappropriated as a population screening tool for a disease that may not fit a population screening model (37).
A potential argument against the use of the models to prioritize symptomatic women for surgery is that a high prevalence of other pathologies was found among controls in this study (72%), which could warrant surgical intervention. However, whether surgical intervention would be deemed appropriate for these pathologies, or indeed whether they were likely to be the underlying reason for the symptoms or a coincidental finding, is a matter for debate. We believe that the stage III and IV prediction models in particular are potentially useful clinically, as the likely presence of moderate/severe disease would be a good basis for prioritization of surgical exploration and intervention.
As far as we know, this is the first study to use robust modeling techniques for model generation, followed by external validation to generate and validate symptom-based predictive models of endometriosis in a large prospectively recruited cohort of women across different countries and ethnicities. Previous attempts, focused on subtypes of endometriosis (13,14), have been hampered by small sample sizes (11), and failed to validate models in populations independent to those from which model parameters were generated (34). The enrollment of women from diverse backgrounds according to a uniform set of criteria potentially addresses issues with the global utility of clinical prediction of endometriosis arising from inconsistencies in disease definition across studies and population. We invite other research groups to validate the key models in this paper in additional populations, as well as in subgroups that may be of specific clinical or population-based interest (e.g., those women who had infertility as the only surgical indication, who had biopsy-proven disease, or who were of a particular ethnicity).
Although the WHSS was designed to improve on the limitations of earlier studies, it had itself potential limitations. First, endometriosis was diagnosed visually, without histologic confirmation, although this followed the European Society of Human Reproduction and Embryology guideline (23), based on the premise that negative histology does not exclude the presence of disease. Consequently, disease status may have been inappropriately assigned; however, participating hospitals were experienced in diagnosing endometriosis. In a separate diagnostic validation study, 29 surgeons from the participating centers viewed nine standardized videos to allow, in a blinded manner, the assessment of consistency in diagnosis and staging of disease. Preliminary analysis suggested substantial inter-rater agreement in disease identification and staging (both Fleiss k > 0.60; C. Becker and K. May, unpublished data). Second, the generation of the models with reduced numbers of variables was based on both phase I and phase II data. More parsimonious models are always preferable (38); however, because they are partly based on phase II evidence, they would require additional external validation. Third, although a strength of the study was that results were generated using data from a wide variety of clinical centers worldwide covering a range of patient profiles, this meant that the results may have been affected to some extent by selection bias possibly arising from [1] differential frequency of concomitant pathologies, in particular the higher proportion of women with nonendometriotic adhesions amongst controls compared to cases (32.4% vs. 14.7% in phase I), and [2] the significant variations across centers in proportions of cases in the sample populations. This is another reason why we call for further independent validation of the models in additional clinical populations, which are likely to each have their own unique patient population.
In conclusion, the diagnostic delay, high investigation costs, and personal suffering associated with endometriosis might be reduced by access to a screening tool that predicts endometriosis with good accuracy in women presenting in a clinical setting. Although prediction of any-stage endometriosis is relatively poor, the symptom-based models developed and validated in this study predict stage III and IV endometriosis with a good degree of accuracy. They suggest that such a tool might help to prioritize women for surgical investigation in gynecologic practice.   * Variables retained in the reduced models. Variables in the best-fitting final models that had opposite directions of effect in phase I and II data sets were dropped in the reduced models. IBS ¼ irritable bowel syndrome. a Included as a linear term: OR represents unit change. b Included as a linear term: OR represents unit change from light to moderate to heavy. c Included as a linear term: OR represents unit change from no pain to 0-3 months, 4-6 months, 7-12 months, 1-5 years, to 5 years ago. d Included as a linear term: OR represents unit change from 0 to 10. e Included as a linear term: OR represents unit change from 0, 1, 2, to 3þ. f Included as a linear term: OR represents unit change from none, 1-10, 11-50, to 50þ. g Ethnicity is included as a categorical term, with largest group (white) as reference.