A novel system based on artificial intelligence for predicting blastocyst viability and visualizing the explanation

Abstract Purpose The purpose of the study was to invent and evaluate the novel artificial intelligence (AI) system named Fertility image Testing Through Embryo (FiTTE) for predicting blastocyst viability and visualizing the explanations via gradient‐based localization. Methods The authors retrospectively analyzed 19 342 static blastocyst images with related inspection histories from 9961 infertile patients who underwent in vitro fertilization. Among these data, 17 984 cycles of single‐blastocyst transfer were used for training, and data from 1358 cycles were used for testing purposes. Results The prediction accuracy for clinical pregnancy achieved by a control model using conventional Gardner scoring system was 59.8%, and area under the curve (AUC) was 0.62. FiTTE improved the prediction accuracy by using blastocyst images to 62.7% and AUC of 0.68. Additionally, the accuracy achieved by an ensemble model using image plus clinical data was 65.2% and AUC was 0.71, representing an improvement in prediction accuracy. The visualization algorithm showed brighter colors with blastocysts that resulted in clinical pregnancy. Conclusions The authors invented the novel AI system, FiTTE, which could provide more precise prediction of the probability of clinical pregnancy using blastocyst images secondary to single embryo transfer than the conventional Gardner scoring assessments. FiTTE could also provide explanation of AI prediction using colored blastocyst images.

over the world. 3 Recently, the clinical use of preimplantation genetic testing (PGT) has become widespread. This technology enables more accurate embryo selection by analyzing the embryo's chromosomes.
However, PGT requires embryo biopsy to obtain embryonic genetic materials, thereby increasing the cost of IVF and the potential risk of compromising embryo viability in some cases. 4 Moreover, recent studies have reported that some embryo classified as mosaic in PGT and therefore unable for transfer were reclassified as chromosomally normal in re-analysis of PGT. [5][6][7] Other researchers also suggested that the transfer of "abnormal" embryos (as per PGT) offered robust pregnancies and high chances of live births with low miscarriage rates; therefore, PGT cannot reliably determine which embryos should or should not be transferred. 8 Therefore, the development of a new embryo grading technology that enables accurate prediction of embryo viability is still an important challenge.
Several studies have been conducted regarding testing a noninvasive artificial intelligence (AI)-based approach to aid in predicting embryo viability during IVF. 9,10 The reported accuracy of prediction is about 0.65, which indicates that AI models can improve the accuracy of prediction by 10%-20% compared with traditional grading methods (<50%). 9,11 Since there are lots of variables that affect pregnancy, including uterine condition, hormonal status, and complications besides infertility, the theoretical accuracy of predicting clinical pregnancy is estimated to be <80%. 12 In fact, the success rate of clinical pregnancy of embryo transfer using euploid blastocysts is reported as about 70%. 13 Therefore, a non-invasive technology that can predict clinical pregnancy with 70% accuracy will have great value.
This study aimed to develop a novel AI system that can predict clinical pregnancy using ensemble modeling to combine blastocyst images and clinical data, such as age, hormonal status, and uterine condition.
Another goal of this study was to develop an explanation function regarding AI prediction. The black box mechanism of deep learning is considered a major hindrance for clinical application. For example, if an AI system predicts the viability of certain embryos and the predictive value was dissociated from the traditional grading system, it will cause some difficulties to physicians regarding explaining the result. Many methods-such as saliency mapping, class activation mapping (CAM), and gradient-weighted CAM (Grad-CAM)-have been developed for the visual explanation of AI systems. 14 Grad-CAM generates a heatmap that visualizes the class-discriminative region. This will help physicians to identify regions of clinical value. This study developed a novel AI system for predicting blastocyst viability. We also investigated the validity of this system by regional visualization using Grad-CAM.

| Patients
This analysis included all patients who underwent single embryo transfer with known pregnancy outcomes. There were no exclusion criteria based on patient characteristics. Clinical data, including age, serum anti-müllerian hormone (AMH), hormonal profiles, pregnancy history, ART history, height, weight, body mass index, menstrual cycle, blood pressure, endometrial thickness, and ART method, are available for 1358 patients. Since the ensemble AI model requires both patient inspection histories and hormonal profiles of cycles as inputs, the development of such models was restricted to this subset of 1358 embryos. All data were anonymized and sent to NextGeM Inc. for analysis. All patients were well-informed regarding the use of these medical data for research purposes, and written informed consent was obtained from them prior to the treatment period. A Web site with additional information, including an opt-out button for this study, was set up in the official website of Hanabusa Women's Clinic.

| Control model using embryo grading by embryologist
AI algorithm based on conventional embryo grading by experienced embryologists was used as a control model. Conventional embryo grading system was based on Gardner's grading scale. 15 Embryos were evaluated at day five or six after oocyte retrieval. Embryos that had not yet progressed pass the morula stage were excluded from were used for testing purpose.

| Image processing
Images were converted into grayscale and rescaled into a resolution of 480 × 640 pixels. All images were initially labeled as "viable" or "non-viable" according to the pregnancy outcome, in other words, blastocyst images those resulted in positive serum human chorionic gonadotropin (HCG) and fetal heartbeats were regarded as "viable" and the other images were regarded as "non-viable." Two types of algorithms were evaluated: an image-only model and an ensemble model, which combines deep learning algorithms for image inputs and machine-learning algorithms for non-image inputs ( Figure 1).

| Outcomes
The primary end point of this study was clinical pregnancy as defined by a rising serum human chorionic gonadotropin (hCG) test and fetal heartbeats in the uterus as detected by transvaginal ultrasound. The secondary outcome was live birth. Although it is more important than clinical pregnancy, its sample size was smaller than that for clinical pregnancy, and the ensemble model for live birth prediction was not completed because of an insufficient sample size for deep learning; therefore, we set clinical pregnancy as the primary end point and live birth as the secondary outcome. Accuracy is used as the main measure to evaluate the performance of AI algorithms and is defined as the percentage of both viable and non-viable embryos correctly identified by AI models. Accuracy is used as the main measure to evaluate the performance of AI algorithms and is defined as the percentage of both viable and non-viable embryos correctly identified by AI models. Another measure for the performance of FiTTE was calculated using the receiver operator characteristic (ROC) curve generated by plotting the true positive rate against the false-positive rate across all positive thresholding values using the predicted confidence score compared with the actual pregnancy outcome.
Another end point of this study was confirmation of the visualization model for explaining AI prediction via visualization algorithm Grad-CAM to confirm the validity of the model. Figure 1 shows the layer algorithm of FiTTE, which represents the process from images and clinical data to predicting clinical pregnancy or live birth. For the prediction from blastocyst images, images were first set into residual network (ResNet18), which is widely recognized as a great learning model in the field of image recognition 16 that offers a deep convulsion neural network.

| Algorithm architecture and training methods
Convolutional neural networks (CNNs) comprise several convolutions to pass the result to the next layer, pooling layers to combine the outputs of neurons into a single neuron, and fully connected layers, which represent the outputs. 17,18 FiTTE utilizes five convolution blocks made of 18 layers, two pooling layers, and one fully connected layer. The architecture ends with binary cross entropy for prediction. In this study, we used the embryo image and nonimage clinical data from 17 984 cycles of single-blastocyst transfer for training. Additional data from 1358 cycles were used for testing purposes. For live birth prediction, data from 10 643 cycles of single embryo transfer cycles were used for analysis. Among these cycles, 9091 were used for training, and 1552 were used for testing purposes. Figure 1B shows that a prediction algorithm from an ensemble of blastocyst images and non-image clinical data consisted of the same algorithm from image process to binary cross entropy.
The processed data were then set into random forest classifier with non-image clinical data for prediction. Because of the limited sample size (1358) including those with all the required data, the predictive accuracy for the ensemble model was evaluated using the 10-fold cross-validation method.

| DISCUSS ION
This present study revealed that the AI system, FiTTE, can predict pregnancy expectation from blastocyst images more accurately than the conventional Gardner grading system. This can help physicians to select embryos for embryo transfer and also help embryologists to manually grade embryos under optical light microscope.
Deep learning using AI has recently come into the spotlight for various medical imaging diagnosis applications such as detecting bone fracture, cancer, and diabetic retinopathy, [19][20][21] with an accuracy of >90%. Unlike these studies, the difficult point of this field is that nobody can tell which embryo is viable or not. For example, when analyzing the bone fracture image, certain physicians can detect the bone fracture within the image, and the programmer can teach the AI the correct answer. Similarly, when detecting cancer or diabetic retinopathy, answers can be found in the images. However, in the field of predicting pregnancy, no definite answer can be obtained because viable embryos do not always result in pregnancy. This is because, besides embryos, various factors such as endometrial thickness, uterine myoma, endometritis, autoimmune system, and hormonal conditions can affect the outcome of clinical pregnancy.
So far, several studies have developed AI models in the field of embryology. Filho et al. introduced a semi-automatic grading system of human embryos, with accuracy rates ranging from 67%-92%. 22 Similarly, Khosravi et al. developed an AI model that classifies blastocyst images using Gardner's classification. 23 Although these reports achieved high accuracies, they set the end point as embryo classification and do not assess clinical pregnancy as an end point.
In the present study, FiTTE achieved an accuracy rate of 62.7% for clinical pregnancy from embryo images. We also confirmed that an ensemble model of blastocyst images and clinical data had a 2.5% improvement in accuracy for pregnancy prediction. The overall accuracy of 65.2% for prediction of clinical pregnancy was considered relatively high given that the theoretical maximum accuracy for prediction of pregnancy based on embryo evaluation is estimated to be <80%. 12  Whisperer and our present model, the maximum predictivity from this model will be between 60% and 70%.
To date, it is generally considered that the most accurate way to predict the viability of embryo is PGT. Indeed, the clinical outcomes of PGT are more accurate than those AI prediction models that the clinical pregnancy rate by euploid blastocyst is reported to be approximately 70%. 13 However, PGT has several ethical problems related to its use such as mosaicism, embryo damage, and high cost. Therefore, a non-invasive AI model that can predict clinical pregnancy with 65% accuracy is considered valuable.
This present study confirmed that of several variables, the blastocyst image was the best predictor of clinical pregnancy followed by age, pregnancy history, AMH, and estradiol and progesterone levels at the time of embryo transfer. This result seems reasonable since most clinicians consider embryo quality as the most important factor for pregnancy. In fact, prediction accuracy of FiTTE stratified in the age reveals that the accuracy was not different by age groups (Figures S1 and S2), indicating that the prediction of FiTTE can be used for patients with all reproductive ages. The present study shows a non-significant improvement in prediction accuracy (2.5%) using an ensemble model as compared with an image-only model. This result indicates that the present AI model does not fully assess other variables besides blastocyst images. This is presumably because gynecological images such as those obtained using uterine ultrasonography, hysteroscopy, hysterosalpingography, and magnetic resonance imaging were not included in the present study.
In addition, the small sample size of ensemble model made it difficult to compare the result with image-only model. Therefore, further studies regarding the use of additional data will be needed.
Another important aim of this study was the explanation of This study is not without limitations. First, this study was based on the analysis of static blastocyst images. Although the use of still images in an AI system in the IVF laboratory is of importance since not all laboratories are equipped with time-lapse, dynamic analysis of embryos by analyzing time-lapse images is needed for a more accurate analysis. Second, this study only included Asian populations, mostly Japanese, who have different characteristics from other races; therefore, it may be difficult to apply the present FiTTE system to the entire cohort of patients receiving ART.
Finally, this study was a non-randomized, retrospective study, with a limited sample size. Moreover, since the embryos were selected for embryo transfer by humans, the false-positive rate of the AI model might have been underrepresented. Therefore, a larger prospective study is necessary to establish the true benefit of FiTTE in the field of ART.
In conclusion, we invented the novel AI system named FiTTE, which could predict the probability of clinical pregnancy using blastocyst images secondary to single embryo transfer more precisely than the conventional Gardner scoring assessment. Although the prediction accuracy was slightly improved in the model of images plus clinical data, blastocyst images had the greatest impact on the prediction. FiTTE could also provide visual explanation of the AI prediction using colored blastocyst images.

This study was funded in part by the Kobe Medical Industry
Development Project, which is a public grant providing support in the form of research materials. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Isao Miyatsuka and Le My An are employees of NextGem Inc. The equipment and programs have been jointly provided by NextGem Inc.

E TH I C A L A PPROVA L
The statement of approval from the Institutional Review Board:

This study was approved by the Ethical Committee of Hanabusa
Women's Clinic consists of members chosen by our institute and third party medical institute (approval number; 2019-06).

H U M A N R I G HT S S TATE M E NT A N D I N FO R M E D CO N S E NT
All patients were well-informed, and written informed consent was obtained prior to the treatment period.

A N I M A L R I G HTS
This article does not contain any studies with animal subjects performed by the any of the authors.

Noritoshi Enatsu
https://orcid.org/0000-0002-9375-5191 Junko Otsuki https://orcid.org/0000-0002-8183-8651 F I G U R E 6 Gradient-weighted class activation mapping (Grad-CAM)-assisted image identification for pregnancy prediction from blastocyst images. Cases 1-3 are blastocysts that resulted in positive pregnancies, and cases 4-6 are negative pregnancies. The images represent the original and those generated after applying the model with Grad-CAM, which visualizes the class-discriminative regions as the predictors of pregnancy