Machine-intelligence for developing a potent signature to predict ovarian response to tailor assisted reproduction technology

The prediction of poor ovarian response (POR) for stratified interference is a critical clinical issue that has received an increasing amount of recent concern. Anthropogenic diagnostic modes remain too simple for the handling of actual clinical complexity. Therefore, this study conducted extensive selection using models that were derived from a variety of machine learning algorithms, including random forest (RF), decision trees, eXtreme Gradient Boosting (XGBoost), support vector machine (SVM), and artificial neural networks (ANN) for the development of two models called the COS pre-launch model (CPLM) and the hCG pre-trigger model (HPTM) to assess POR based on different requirements. The results demonstrated that CPLM constructed using ANN achieved the highest AUC result of all the algorithms in COS pre-launch (AUC=0.859, C-index=0.87, good calibration), and HPTL constructed using random forest was found to be the most effective in hCG pre-trigger (AUC=0.903, C-index=0.90, good calibration). It is notable that CPLM and HPTM exhibited better performance than common clinical characteristics (0.895 [CPLM], and 0.903 [HPTM] in comparison to 0.824 [anti-Müllerian hormone (AMH)], and 0.799 [antral follicle count (AFC)]). Furthermore, variable importance figure elucidated the values of AMH, AFC, and E2 level and follicle number on hCG day, which provides important theoretical guidance and experimental data for further application. Generally, the CPLM and HPTM can offer effective POR prediction for patients who are receiving assisted reproduction technology (ART), and has great potential for guiding the clinical treatment of infertility.


INTRODUCTION
As assisted reproduction technologies (ART) has advanced, the improvement of the clinical pregnancy rate has remained both a high priority and significant difficulty for fertility doctors [1]. Meanwhile, the response to controlled ovarian stimulation (COS) during ART is highly diverse and ovarian response plays crucial roles during this process [2]. In particular, poor ovarian response (POR), generally refers to a poor response to gonadotropin stimulation and is characterized by a low number of growing follicles which may result in poor oocyte retrieval, cycle cancellation, or even a failed reproductive outcome [3][4][5].
It is quite promising that researchers have discovered the advanced identification of poor responders to be of potential help in providing patients with more directed counseling which can lessen the disappointment of undesirable outcomes [6]. Generally, predicting POR AGING before COS may be a contributor to formulating individualized programs [7], and prediction before hCG trigger day can facilitate the adjustment of trigger protocols (for example, when POR is predicted, GnRH-a + hCG double trigger [8,9] can be used for the amelioration of IVF outcomes). These findings inspired us to predict POR based on clinical data in COS prelaunch and hCG pre-trigger in order to offer sufficient decision support.
Several clinically predictive indicators associated with POR have already been detected, such as age, basal follicle stimulating hormone (FSH), antral follicle count (AFC), and anti-Müllerian hormone (AMH) [10][11][12][13]. Significant attention has been paid to the comprehensive analysis of various indicators [14][15][16], but with current POR assessment approaches, traditional logistic regression is highly subjective and time-consuming [17], and is also unable to exploit interconnections between predictors and combinations of factors which may not be significant individually. Machine learning algorithms can be used for analyzing interactions between the exploratory variables of large data sets without knowledge of the form of the specific parameter function underlying the relationship [18]. Furthermore, many classical algorithms have been widely applied in ART, such as logistic regression (LR) [19] and machine learning, including decision tree [20], support vector machine (SVM) [21], and artificial neural network (ANN) [22,23]. However, very few works have reported machine-learning models for the prediction of ovarian response, therefore, further exploration of the prediction potential of machine-learning algorithms in related fields was warranted.
In this study, the clinical data of patients undergoing IVF/ICSI was analyzed in order to establish optimum models for POR prediction (COS pre-launch model [CPLM] and hCG pre-trigger model [HPTM]) using different algorithms (typical statistical methods and machine learning models). By using these models, it was inferred that clinicians can apply appropriate therapeutic strategies mentioned above to infertile couples in order to increase the probability of favorable IVF outcomes.

Data processing
The clinical data of 1,110 infertile women who had undergone IVF/ICSI treatment for the first time between July 2018 and May 2019 in Renmin Hospital of Wuhan University was retrospectively analyzed. Women with several different infertility factors were incorporated in order to establish a universal approach for POR prediction at our center.

Patients' characteristics and main outcomes
In the prospective cohort analysis, the main outcome measure was POR, which was defined as the retrieval of four or fewer oocytes or cycle cancellation [24]. Variables with a potential relationship to ovarian response were incorporated into our research, and models were constructed based on the various therapeutic stages of the treatment cycle: (1) Variables of COS pre-launch model: age, BMI, infertility cause, infertility duration, infertility type, AMH, basic hormone levels (E2, FSH, and LH), AFC, pelvic surgery, and gravidity history.
(2) Variables of hCG pre-trigger model: all factors of the COS pre-launch model, plus therapeutic regimen, dosage of Gn (recombinant human follicle-stimulating hormone for injection, Gonal-f, German Merck Serono), days of Gn, E2 level on hCG day, and follicle number on hCG day (follicles with a diameter of ≥ 14 mm in bilateral ovaries).

Feature selection
EpiData 3.1 software was used for establishing a database, and this was double-entered and validated by two qualified personnel. Once it was checked, the data was transferred to R software (version 3.6.4) and parameters proven to have a direct effect on ovarian response were screened using logistic regression; variables with a P < 0.05 were chosen for further analysis. After the effects of features on outcomes were fully assessed, least absolute shrinkage and selection operator (LASSO) regression was used for further minimization of the risk of over-fitting, and variables with high collinearity were eliminated. The LASSO regression was dependent on a cyclical coordinate descent algorithm and was conducted using a glmnet package in R software. The workflow of the study is presented in Figure 1.

Construction of model
All data was randomly divided into a training dataset (70%) for feature selection and model training, and an independent validation dataset (30%) for repeated optimization and verification of the prediction model. And the models were set to use default parameters in R software.

Multivariable logistic model
Normality was evaluated using a Kolmogorov-Smirnov test and Spearman's Rho (nonparametric), or Pearson's (parametric) bivariate correlation analysis was completed as deemed appropriate. For independent variables selected AGING for the generalized multivariable logistic model, stepwise Akaike information criterion (AIC) was applied for eliminating multicollinearity and for selection of the model with the lowest AIC as the final model. A multivariate logistic model was also used to construct the ovarian response predictive model (ORPM). To facilitate this, the risk score was calculated using the following formula: 1 Risk score where the risk score defined as ORPM-based risk signature was calculated by the ORPM -n represents the total number included in the ORPM, β i represents the regression coefficient of feature i, and E i refers to the coefficient of feature i in the constructed model.

Machine learning
Decision tree Decision tree algorithms use the Gini index to measure each decision point and create an optimal separation of the independent variables [25]. A dataset which minimizes the Gini index was selected after division as the optimal distribution in the subset of data. This splits the data which exhibited the best optimization criteria (subject to tree depth (11)) on our predictor.

Random forest (RF)
RF combines multiple decision trees and randomizes and summarizes the use of variables and data [26]. This study conducted RF containing 1,000 trees, where the maximum depth of each tree was determined based on the final numbers of the included features. AGING eXtreme gradient boosting (XGBoost) XGBoost introduces the gradient descent algorithm and minimizes the loss when a new model is added, which helps it continuously learn a new function matching the residual of the previous prediction [27]. Similarly, XGBoost served as iterative model before reaching 1,000 cycles, and the maximum depth of each tree was determined based on the final numbers of the included features.

Support vector machines (SVM)
The aim of SVM is the establishment of a classification hyperplane that can correctly classify each sample and make the largest possible distance between the sample closest to the hyperplane for each sample type and the hyperplane [28].
Artificial neural network ANN consists of an input layer, an output layer and one or more hidden layers between the input and the output layers. The most outstanding representative of the algorithm is resilient backpropagation learning [29]. In a typical process, hidden layers are determined to refer to the actual status, and the threshold is set as 0.005, the learning rate is set as 0.1, and parameter optimization is performed using rprop+ method.

Validation of COS pre-launch model (CPLM) and hCG pre-trigger model (HPTM)
Several different approaches were utilized for the assessment of all models' stratification abilities. Area under curve (AUC) was calculated from the receiveroperating characteristic (ROC) curve and was used to estimate the discrimination of each model. The accuracy of the derived models was evaluated by calibration plot, and models which shared a high goodness of fit with the dotted line were regarded as providing good calibration [30]. Notably, the net-classification index (NRI) was used to quantify the improvement of the predictive abilities of each model. The models with the highest ovarian response prediction accuracy in COS pre-launch and hCG pre-trigger models were defined as CPLM and HPTM. The contribution and importance of each CPLM/ HPTM-based signature were quantified using mean concordance-index (C-index). Spearman correlation analysis was then performed to accurately determine the correlation between the CPLM and HPTM scores of each patient and the corresponding retrieved oocytes.

Grouped analysis for potential difference of clinical features
Statistical comparisons of patients' clinical characters were performed using Wilcoxon's test, and P-value adjustment using the Benjamini-Hochberg procedure.

Statistics
R software (version 3.6.4) was used for data processing and analysis.

Ethics approval and consent to participate
Written informed consent was obtained from each participant and the study was approved by the ethical committee of the Renmin Hospital of Wuhan University.

Demographic and clinical characteristics of participants
Based on the number of oocytes retrieved, the prevalence of POR was 14.59% in the present cohort. The demographic parameters of participants are displayed in Table 1. Poor ovarian responders were older than the normal to high responders, and exhibited significantly higher E2, FSH, days of Gn, dosages of Gn, E2 level and follicle number on hCG day. Significantly differences were also presented regarding infertility cause and therapeutic regimen.

Feature engineering
In order to prevent the risk of over-fitting and to screen the important features which impact outcomes for the optimization of the constructed models, feature engineering was conducted. LASSO regression combined with univariable logistic regression was performed to narrow the candidate features, the results of which were displayed in Table 2 and Supplementary Figure 1A, 1B. A total of 11 features remained of the original 19 features, and those selected were confirmed to be important regarding outcome. The significant variables identified following the selection procedure were recorded as follows: AFC, AMH, Age, E2, FSH, and infertility factors were incorporated in the COS prelaunch model. Variables in the hCG pre-trigger model included all factors from the COS pre-launch model, in addition to E2 level and follicle number on hCG day, therapeutic regimen, days of Gn, and dosages of Gn.

Construction and comparison of method performance
After the process of feature selection completed, statistic models and machine-learning models were trained and validated according to the aforementioned methods. For COS pre-launch models, parameters of logistic model and decision tree were represented in Table 3   Normally distributed data, skewed distribution data and nominal data are described by mean ± SD, median ± interquartile range and frequency (relative frequency) respectively. Wilxon signed-rank test and chi-square test were applied in skew distribution data and nominal data respectively, and properly used r and V as the effect size to quantify the significance.
SVM, and ANN please visit our data online at https://data.mendeley.com/datasets/tpj39wptts/1. For hCG pre-trigger models, components of logistic model and decision tree were exhibited in Table 4 and Supplementary Figure 4, respectively; and framework of machine-learning models please visit our data online at https://data.mendeley.com/datasets/tpj39wptts/1.
It has been demonstrated that the area under the ROC curve (AUC) is a puissant indicator for the prediction of dichotomous outcomes, and then the AUC was examined to assess the accuracy of the constructed models. As can be seen in Figure 2A-2L, ANN yielded optimum predictive ability and accuracy in all algorithms with an AUC of 0.859 in COS pre-launch models, and the RF had the highest AUC (0.903) in hCG pre-trigger models. The predictive ability and accuracy of logistic regression (AUC = 0.848 and 0.883 corresponded to COS pre-launch and hCG pre-trigger models) and decision tree (AUC= 0.701 and 0.800) were slightly worse in comparison to ANN or RF. XGBoost produced relatively poor results with AUC of AGING   0.724 and 0.693, and SVM exhibited the minimum prediction efficiency, with AUC of 0.556 and 0.519. Similar trends were also observed in the training cohort.

Validation of CPLM and HPTM
As they have been proven to be the best models for the estimation of ovarian response, derived ANN and RF AGING models were considered as CPLM and HPTM and further investigations were conducted. C-index was determined for reaffirming CPLM and HPTM prediction accuracy. After 1,000 estimations were made using the bootstrap method, the mean C-index of the validation cohort's CPLM and HPTM were 0.87 and 0.90, respectively. This demonstrated that the predicted results for CPLM and HPTM were highly consistent with the actual value, and represented high accuracy among the constructed models [31]. The training cohort also demonstrated similar results regarding C-index.

AGING
In addition, a calibration plot measuring calibration ability also showed that the predicted value of the CPLM and HPTM-based signature was in accordance with the observed proportion ( Figure 3A, 3B).
For further evaluation of the model's credibility, correlation analysis between the CPLM and HPTM scores and the corresponding number of retrieved oocytes for each patient were determined. The analysis results demonstrated that each patient's CPLM and HPTM scores were correlated negatively with retrieved oocytes, thereby suggesting that the retrieved oocytes gradually decreased as the score increased ( Figure 3C, 3D), and the relevant correlation coefficient was 0.59 and 0.69 in CPLM and HPTM, respectively.
All aforementioned evidence was presented following a series of investigations, which strongly indicated that the constructed models reached an optimum contribution and employed a small enough number of clinical characters without losing their predictive value.

Comparison between CPLM/HPTM and common clinical characteristics
Numerous studies have proven AMH and AFC to be the most effective parameters for the prediction of poor ovarian response in ART [4,32]. An evaluation of the effectiveness of obtained CPLM and HPTM was performed through a comparison of the above characteristics to establish both their superiority and applicability in clinical practice. The results were encouraging and revealed that the AUC of CPLM and HPTM (0.903 and 0.859) (Figure 2A, 2H) were superior to those of the most common clinical characteristics -AMH and AFC (0.824 and 0.796) ( Figure 4A, 4B), indicating that the constructed models had more valuable prediction signatures than common clinical characteristics.
NRI is a method for measuring a model's accuracy based on changes made to the number of correct classifications. Results showed that CPLM had better accuracy compared to AMH and AFC (NRI =13.4% and 18.8%, respectively). In addition, HPTM's accuracy was considerably higher than that of AMH and AFC (compared to AMH, NRI = 74.7%; compared to AFC, NRI = 82.6%), and CPLM and HPTM's prediction efficiency was preferable. Similar trends were observed in the training cohort (Table 5).

Variable importance ranking in CPLM and HPTM
For facilitation of the clinical decision process, variable importance figures of CPLM and HPTM were used to AGING investigate the models. As can be seen in Supplementary  Figure 2 and Figure 4C, AMH was the most important predictor for POR, conforming to findings of the latest study which emphasized the significance of AMH [33]. Indictors including AFC and FSH that are commonly used for the assessment of ovarian response also displayed significant contribution to the objective function. In addition, HPTM highlighted the illustrious positions of E2 level and follicle number on hCG day in the prediction of hCG pre-trigger model. However, age, dosages of Gn, E2, therapeutic regimen, and days of Gn were proven to be slightly less significant in the models.

Potential differences between high-and low-risk group identified by CPLM or HPTM
In order to detect potential differences in clinical characteristics between the high-risk group (with a higher risk of predicting to be POR) and the low-risk group defined by CPLM and HPTM, grouped analyses were performed. As is shown in Figure 5A-5J, significant differences were discovered between both groups, with the exception of age and days of Gn.

DISCUSSION
This study has provided the first report for establishing CPLM and HPTM in the prediction of ovarian response at various therapeutic stages of IVF cycles using multiple machine learning algorithms, when individualized interference is available to sterile couples. This study was also the first attempt where machine learning was applied to routine medical practice to facilitate the improvement of clinical management and provide successful outcomes for infertile couples in ART.
One significant advantage of this study is the machine learning-based CPLM and HPTM, which can be  implemented in related clinical processes for predicting ovarian response in sterile women, which will also allow the application of individualized stratified interference. Machine learning is based on non-linear parallel processing and has identified a new direction in the field of IVF, improving reason and self-organization, as it continues to learn [34,35]. Several machine learning algorithms, including RF, decision tree, XGBoost, SVM, and ANN, were used in this research for the selection of two models in COS pre-launch and hCG pre-trigger, which were considered to be CPLM and HPTM.  [14,[36][37][38][39]. These findings strongly demonstrate that there is great clinical application potential for this study's constructed CPLM and HPTM due to the high accuracy they have for POR prediction.
To further evaluate the importance of the features incorporated in the chosen CPLM and HPTM, variable importance rankings were established. It is notable that both models displayed robust significance in AMH, AFC, and FSH, irrespective of different time periods, thereby indicating the important value of these traits during IVF concluded in previous researches [40][41][42]. It is of great significance that this study's results were similar to those obtained through previous studies, which indicates that AMH with the highest variable importance value in CPLM and HPTM is the most important variable for POR prediction [43,44]. Although age had previously been considered to be of great value for ovarian response prediction [45], several studies have placed more focus on "ovarian age", and this study was consistent with them in demonstrating that age should not be regarded as a stable characteristic for POR prediction [14,46,47]. In addition, variable importance results in HPTM proved that both E2 levels and follicle number on hCG day play important roles, as E2 levels on hCG day can reflect the secretory function of follicles and they are related to the number and size of follicles in both ovaries during COS, AGING which is considered to be a marker of ovarian reactivity [48]. Previous research has also demonstrated that E2 level on hCG day is an independent POR marker, which further highlights the importance of the indicator [49,50]. It is of interest that days of Gn are associated with follicular maturation and appropriate extension of days of Gn can improve follicular maturation and retrieved oocytes [51]. Similarly, the models used in this study also attached significant importance to days of Gn, proving that clinicians should have greater focus on the individualized use of ovulatory drugs.
In this study, the prediction efficiency of HPTM was proven to be greater than that of CPLM. The main reason for this could be that HPTM incorporates additional important characteristics (E2 lever and follicle number on hCG day), which are particularly significant in ovarian response prediction [52,53]. However, it is AGING notable that HPTM is better suited to hCG pre-trigger in terms of delayed information. Accordingly, clinicians can access ovarian response based on the CPLM before treatment cycles for the formulation of individualized regimens, whereas HPTM can be used for guidance on hCG administration day.
This study was limited due to being retrospective regarding design and the data was obtained from only one fertility center. In addition, the models failed in the prediction of retrieved oocytes, embryo quality, or IVF outcomes. Therefore, long-term research with a greater, multicenter sample and a more in-depth exploration of IVF outcomes is required in order to provide confirmation of the efficacy of our findings.

CONCLUSIONS
To summarize, the current study's CPLM and HPTM exhibited higher accuracy for poor ovarian response prediction in sterile women than the reported models of AMH and AFC as clinical indicators. The constructed models used in this study can access more precise individualized interference for the implementation of related clinical processes which will help achieve better pregnancy outcomes.

Data availability statement
All generated data was included in the present study.