Introduction

Since the birth of the first baby conceived through in vitro fertilization (IVF) in 1978, the development of assisted reproductive technology (ART) has evolved significantly. Over the last 40 years, ART has provided infertile couples with the possibility to conceive, culminating in the birth of over eight million children1. IVF protocols are complex and require intensive monitoring, with clinicians and embryologists responsible for several key decision points prior to and during the cycle (Fig. 1). Although several of these decisions have a solid evidence base, many are highly subjective and will vary immensely based on clinical experience with an inevitable non-reproducible impact on clinical outcomes—leading to the mantra that ART is an art.

Fig. 1: Potential targets for artificial intelligence in assisted reproductive technology.
figure 1

Potential targets for the application of artificial intelligence and machine learning methods during clinical and embryological steps in assisted reproductive technology (ART). Investigations of infertility and pre-treatment counseling are not captured here and discussed independently in Section Pre-treatment counseling. The order and timings of the steps can differ depending on the ART protocol used. Figure created with BioRender.com.

Given these limitations, there is increasing recognition that alternative data-driven approaches that harness the large number of ART cycles undertaken and facilitate objective, consistent, and optimal decision-making may be associated with improved outcomes. Large amounts of data generated during IVF cycles have enabled interdisciplinary researchers to propose artificial intelligence (AI) methodologies to drive individualized approaches. These have ranged from algorithmic drug dosing tools, to ‘human-in-the-loop’ AI clinical decision support systems (CDSSs) for embryo selection, whereby humans are supported by AI but ultimately make the final decision. Harnessing the symbiosis between the experience of clinicians, and personalized recommendations from AI models based on the one million cycles undertaken annually, has the potential to synergistically improve clinical outcomes. In this review, we examine current implementations of AI models within ART, and future prospects concerning their utility, efficacy, and application in the field.

Artificial intelligence methods for assisted reproductive technology

AI is an overarching term that encompasses a growing number of subfields including machine learning (ML), robotics, and computer vision (Fig. 2). Principally, ML methods can learn patterns from data and draw inferences, and therefore build models that optimize/personalize ART protocols for a specified outcome. Traditionally, ML can be under either a supervised, unsupervised, or reinforcement learning framework. In supervised learning, data are labeled as inputs and outputs with the goal being to develop models that capture the relationship between the two, which can be used to predict outputs when presented with new, unseen inputs. Conversely, in unsupervised learning, models are built to capture the structure (e.g., clustering) of data with no output labels (‘unlabeled’) that can be used to interpret new, or generate synthetic data. Reinforcement learning trains an ML agent that interacts with a defined environment towards achieving a goal and receives a ‘reward’ for its actions.

Fig. 2: The artificial intelligence landscape.
figure 2

A Venn diagram providing a holistic view of the artificial intelligence (AI) landscape, with a particular focus on machine learning (ML) methods. ML is a subfield that is often used in conjunction with other AI subfields, such as computer vision. Some methods can be used in alternative learning frameworks however their most common current manifestations are presented here.

Supervised methods include decision trees, linear/logistic regression, k-nearest neighbors, support vector machines, random forests, artificial neural networks (ANNs), and more. Decision trees are models used to classify or predict outcomes based on input data. They can effectively capture non-linear relationships and can be visualized intuitively as tree-like structures: starting from the root, each branch represents a decision rule to select which subsequent branch should be followed; the final nodes (‘leaves’) of the tree represent outcomes. Extending this to an ‘ensemble’ of trees inspires the random forest algorithm, where each tree is trained on a random partition of the data and its input features. The final prediction is determined by a voting mechanism, combining the predictive power of all decision trees. This generally makes the model less prone to ‘overfitting’, a phenomenon whereby a model may perform very well on training data but poorly on new, unseen data. Supervised methods have widespread applications with tabular (i.e., numerical or categorical) outcomes in ART.

ANNs are networks of connected computational units representing artificial neurons—they receive inputs, process them, and signal the result to other neurons connected to them, in a multi-layered structure (e.g., the multi-layer perceptron algorithm). The input layer receives data to be processed, and the output layer presents the model output. The strengths (‘weights’) of connections between artificial neurons comprise the parameters of the ANN and are calibrated during model training.‘Deep’ learning is ANNs with complex architectures comprising many layers, an example being convolutional neural networks (CNNs), useful for spatial, grid-like data (e.g., embryo images).

As for unsupervised methods, k-means is a popular algorithm for clustering data into k-groups based on the distance of data points from the centroid of each group. Another example is generative adversarial networks (GANs), where one network is trained to generate synthetic data whilst the other discriminates synthetic from real data. The two networks are trained in parallel, competing as adversaries, resulting in better discrimination between synthetic and real data. Multimodal generative AI has recently caught mass media attention, especially through both text (e.g., ChatGPT and Med-PaLM) and text-to-image generators (e.g., DALL-E), which have been evolving rapidly in performance since their inception2. These frameworks bring together large language models (LLMs), a type of natural language processing built with ANNs, and diffusion models, an alternative generative methodology to GANs based on iterative de-noising to estimate how image data are distributed to therefore generate a desired image3.

During model development, it is standard practice to use ‘training’, ‘validation’, and ‘test’ datasets: ‘training’ to fit the model, ‘validation’ to fine-tune the model’s hyperparameters, and a ‘test’ set to independently evaluate the model’s performance. For generally more reliable estimates of model performance, cross-validation can be used to evaluate the model on multiple training/validation data splits. Using test datasets that are externally unseen and temporally different (e.g., from a different clinic) can provide further reassurance of a model’s generalizability. The fundamental choice of ML algorithm for a certain task is multifaceted and often driven by contextual reasoning. Nevertheless, Table 1 presents some rules-of-thumb regarding popular ML methods (Fig. 2) within the context of ART.

Table 1 Rules-of-thumb for most suitable machine learning algorithms

Pre-treatment counseling

Classically, age-stratified population estimates have been used to inform patients of their overall chance of success, however, these often fail to incorporate important determinants of outcome such as previous treatment cycle attempts or for treatment-naive patients their ovarian reserve and likely ovarian response. To try to tailor these models further both population data and clinic-specific datasets have been used to develop models for a variety of outcomes including for cumulative live births across multiple cycles4. These models are now being used by both patients and a range of stakeholders to manage access to care (national healthcare services, insurance providers) and clinics, or third-party companies offering shared-risk financial programs5. Moreover, the emergence of AI chatbots using LLMs could improve efficiency in the initial assessment of infertility. A recent ‘Fast Track to Fertility’ program using semi-automated two-way text messages reduced the time to complete a workup by 50%6. The deployment of LLMs for fertility assessment offers unique challenges and currently remains experimental, whilst the frameworks for validation and regulation of such systems are yet to be formalized2,7.

Gonadotropin dosing for ovarian stimulation

Ovarian stimulation (OS) is used to stimulate the growth of multiple ovarian follicles in order to result in multiple oocytes for retrieval8. IVF treatment is a profligate process as not all follicles yield oocytes, not all oocytes will fertilize, and not all embryos will develop, implant, or be capable of becoming healthy babies. Various preparations of gonadotropins exist but most will contain a supra-physiological amount of follicle-stimulating hormone (FSH) to extend the ‘FSH-window’ by maintaining high FSH levels, and induce multi-follicular growth9. Optimization of the gonadotropin dosing regimen can maximize the number of follicles with respect to ovarian potential9. As such, an optimal initial dose of FSH can ensure sufficient follicles are recruited, whilst avoiding the recruitment of too many follicles (often defined as >15 oocytes at pickup), and an increased risk of ovarian hyperstimulation syndrome (OHSS)8,10.

The application of ML approaches to retrospective datasets for model learning has demonstrated the potential to personalize FSH dose as summarized in Table 2. Fanton et al. aimed to identify the 100 most similar patient profiles to each patient, to then generate individualized dose-response curves11. The authors reported limitations including a protocol-agnostic approach, and that 87% of cycles included both pure FSH and Menopur (for luteinizing hormone (LH)-like activity) during OS11. The methodology was further evaluated against the national US database (SART CORS) including 365,473 patients and reported upon in conference proceedings12. The results similarly predicted that an increased number of two-pronuclear fertilized embryos (2PNs) and blastocysts could be retrieved whilst using significantly lower total FSH doses, key in reducing high medication costs for patients12. Nevertheless, OS protocols vary across clinical practice, and the generated dose-response curves presented less confidence in predicting oocytes with lower doses of FSH administration11, which are the norm in Europe (where 150-225 IU is suggested for normal responders8). Therefore, it is necessary to determine whether the proposed models are directed at certain geographies or intend to be universal. Setting a precedent for the conduct of future multi-center studies is central to achieving either objective—Ferrand et al. successfully leveraged a federated learning framework13, a potentially effective approach that allows data to be kept decentralized and private, whilst deploying ML models for collaborative training between clinics14,15.

Table 2 Ovarian stimulation assessment studies using artificial intelligence

Recent studies have also focused on the effects of demographic, endocrine, and genetic data to optimize OS, and therewith predict the retrieval of mature oocytes16,17,18. Although these are retrospective studies, they highlight the need to explore available characteristics and further assess their impact on clinical outcomes when determining dosing regimens, whereby endocrine monitoring or genomic sequencing for ART cycles may be efficacious for some patients19,20. To best identify such predictors in an unbiased manner, the treatment cycles of patients should not exist in both the training and test sets21. An independent test set of patients should be partitioned at random, or if cross-validation is employed, cycles from the same patient must not exist across the training and test folds.

Ultimately, determining the efficacy of introducing individualized gonadotropin dosing algorithms into the clinic will require appropriate validation across different geographies. The three prospective international multi-center randomized controlled trials (RCTs) for follitropin delta (recombinant-FSH; Ferring Pharmaceuticals) that assess a unique algorithm to facilitate individualization of dose based on anti-Müllerian hormone (AMH) and body weight are an apt example of that critical step22,23,24. The retrospective studies in Table 2 would benefit from similar prospective validation in multiple centers to establish whether their adoption in the clinic is appropriate and of value for patients.

Induction of oocyte maturation

Once multiple follicles have grown during OS, a hormonal trigger is administered to mature oocytes in preparation for retrieval. The triggering agent is most efficacious when follicles are neither too large nor too small10. In turn, AI/ML techniques have been harnessed to optimize the trigger day (TD) as summarized in Table 3. Our research team previously developed a random forest model to determine follicle sizes on TD that most contributed to the number of mature oocytes retrieved25. Maximizing the number of follicles sized 12-19 mm on TD was determined as optimal for yielding mature oocytes and could be used as a feature in conjunction with baseline endocrine characteristics to predict oocyte yield19.

Table 3 Trigger day assessment studies using artificial intelligence

A more recent study leveraged patients that had ultrasound scans both on the day before trigger, and on the true TD, to learn why a clinician might decide to wait a further day to trigger26. They found follicles sized 16-20 mm as most contributory in determining optimal TD, and predicted superior outcomes in terms of 2PN and blastocyst yield compared solely to a clinician’s decision26. With a similar methodology but using a simpler model, Fanton et al. confirmed the findings with even further granularity and showed follicles sized 14-15 mm were most predictive on TD, whilst those sized 11-13 mm on the day prior to triggering were most contributory27. The aforementioned studies employed ML methods which show predictor importance measures against the desired outcome (oocytes retrieved), and therefore provide a useful data-driven target for oocyte maturation based upon many previous IVF cycles25,26,27. Transparent models such as these should be favored at embryonic stages of AI-driven developments, to ensure clinicians and patients can gain trust towards CDSSs as part of ART workflows28,29. It is crucial to take into account the nuances of workload management in daily clinical practice in order to incorporate AI models into workflows effectively30. Real-world data where ultrasound scans may not be conducted every day can challenge the precision of models developed to assess TD or misrepresent the predictive capacity of certain features.

A proof-of-concept CDSS by Letterie and MacDonald (Table 2) also considered a decision point to trigger or cancel the cycle30. This notion was further developed in a later study looking specifically at TD assignment to optimize the retrieval of oocytes31. Features included pre-cycle characteristics, as well as estradiol level and follicle diameters determined on the single ‘best day’ for assessing TD, for which baseline AMH alone was most predictive31. A stacking model was trained, which compounds the predictive power of multiple ML models to improve overall robustness. This CDSS fulfills the need for streamlining follicular monitoring that may arise from reasons such as long-distance travel to clinics or unprecedented public health constraints. In response to the constraints enforced by COVID-19, Roberston et al. demonstrated that day-5 of OS would be the ‘best day’ for predicting both the risk of OHSS and optimal TD32. Both these studies highlight reducing monitoring in certain clinical settings may be possible, which could reduce resource requirements in the clinic, and the burden upon patients. The timing of the TD is a multifaceted decision point and therefore to confirm utility in practice, prospective validation of the developed models in diverse populations would be a prudent next step forward.

In the embryology laboratory

The application of AI in the embryology lab has attracted significant recognition in recent years and has been reviewed comprehensively33,34, with more recent developments summarized here (Tables 4, 5, and 6). The capacity of AI techniques to analyze large amounts of complex data such as images and time-lapse objectively, whereby non-invasive assessment of gametes and embryos can be done in real-time, has significant potential for future impact in achieving healthy live birth. This can lessen the need for specialist embryology resources whilst automating some of the processes involved to reduce costs.

Table 4 Sperm assessment studies using artificial intelligence
Table 5 Oocyte assessment studies using artificial intelligence
Table 6 Embryo assessment studies using artificial intelligence

Sperm assessment

Computer-aided sperm analysis

Standard semen analysis comprising of concentration, motility, and morphology assessment remains the first-line investigation of pre-treatment male fertility potential. Computer-aided sperm analyzers (CASA) aim to reduce intra-operator subjectivity and variability associated with manual assessment while standardizing and increasing throughput capacity. CASA analysis of sperm concentration and motility have shown a good correlation with manual assessment35, while estimates of progressive motility are also significantly linked to both in vivo and in vitro fertilization rates36,37,38,39. However, CASA-based morphological assessment tends to correlate the least with manual assessment, likely as a result of heterogeneity within a given semen sample and the subjective nature of interpretation35.

The latest WHO manual on sperm analysis40 (2021) recognized the ability of CASA to accurately determine sperm concentration and progressive motility parameters through the use of fluorescent DNA stains and tail-detection algorithms41. These advancements have improved the distinction between immotile spermatozoa and particulate debris; a problem that has led to the overestimation of concentration, and underestimation of progressive motility, since the inception of computer-aided systems.

At a population level, ML algorithms could be a useful to identify individuals at risk of an abnormal semen profile. An ANN based on an 11-question demographic characteristic questionnaire (including age, alcohol consumption, smoking status, urbanization and occupational exposures) achieved 92.9% accuracy in predicting abnormal sperm concentration, and 85.7% for predicting any sperm abnormality42. Although only developed in a small cohort of 141 men, if replicated, an AI-driven triage model could be used as a preliminary screening tool with early recourse to diagnostic testing.

Further, an ANN using semen parameters as inputs in 177 men was able to predict seminal plasma biochemical markers including fructose, zinc, and total protein content43. The added value of these biochemical parameters over standard semen analyses is still unclear, but a number of omics-based markers in seminal fluid have been identified as helpful in determining fertilization prognosis in a cost-effective manner44. Incorporating these techniques into the IVF clinic is challenging, namely due to initial set up costs and specialized techniques required for analysis. Moreover, whether these markers and profiles could drive selection of an individual spermatozoon for fertilization remains unclear.

Motility

Accurate assessment of sperm motility is paramount in fully understanding genetic and biochemical factors that may impact normal fertilization and thus plays a key role in selection for ART. Motility prediction based on deep learning using sperm videos has been examined with promising results45,46,47. AI software may begin to allow correlation of kinetic motility patterns with other crucial factors such as sperm morphology, likelihood of fertilization, or blastocyst formation to aid in selection for intra-cytoplasmic sperm injection (ICSI) in real-time48,49. These studies show the potential of incorporating temporal features into deep learning models to extract insights into sperm motility consistently and efficiently.

Morphology

Staining of spermatozoa is currently required to identify morphological abnormalities and defects for diagnostic purposes. However, given that the staining of sperm affects their vitality and motility, tested spermatozoa are no longer viable for use in ICSI and thus, do not aid in sperm selection for fertilization50. Consequently, morphological assessment of a single spermatozoon in a non-invasive manner using AI techniques is of interest for sperm selection34. Some models consider specifically the sperm head morphology51,52,53,54, whereas others consider a more comprehensive analysis of the whole sperm55.

WHO describe eleven different sperm head abnormalities by taking into account shape, size, and consistency40. Some of these subtypes present further challenges, with their morphology forming a vast continuum with overlaps, such that discrimination is complex to the naked eye. Using a dictionary learning approach combined with segmented microscopic sperm head images, Shaker et al. achieved a 92.3% accuracy in distinguishing between four sub-types against a ground truth dataset agreed by three experts52.

Open datasets of spermatozoa are becoming accessible to researchers and have been used to benchmark different models against one another51,52,56. Latest deep learning advancements with CNNs are capable of detecting morphological deformities in spermatozoa head, acrosome, and vacuole in real-time using low-magnification microscopes (400-600x) without staining and with increased objectivity56,57.

Non-invasive AI methods are also capable of assessing morphological features of immotile or frozen sperm that are difficult to characterize manually. Current viability tests require cytotoxic staining that renders individual spermatozoon unusable for ICSI. Recently, Jiang et al. described an AI model capable of identifying viable sperm based on a single bright-field image without the need for any sample processing or reagents58. The model exhibited 94.9% accuracy, 97.0% sensitivity, and 93.3% specificity, based on subtle morphological changes to the cell nucleus. Incorporation of such AI models into existing CASA systems could further reduce the need for sperm staining in the future, especially in the context of surgically retrieved or frozen sperm with unknown viability.

To our knowledge, no computer-aided systems exist to improve the surgical retrieval of sperm yet. Current testicular sperm extraction techniques for ICSI can be challenging, with outcomes being greatly operator-dependent59. However, AI techniques to aid identification of sperm from biopsies during testicular sperm extraction have been investigated. Wu et al. describe a deep CNN capable of finding sperm in testicular biopsy samples with good accuracy (mean average precision of 0.74) but did not compare this to standard embryology techniques60. ML models employing 16 preoperative assessment variables (e.g., hormonal parameters, genetic, demographic, lifestyle, and urogenital history) have also been shown with moderate performance to predict the success of testicular sperm extraction61. Given the clinical implications of not pursuing surgical sperm retrieval (i.e., unequivocal use of donor sperm), further external validation of this promising model is required. The inclusion of additional biomarkers such as more detailed genetic information, seminal plasma microRNA, or additional hormones, as a way of further improving model performance, would also be of interest.

Sperm selection for ICSI is not standardized and WHO guidelines are interpreted subjectively by embryologists. High-throughput AI models have the potential to be more objective and tackle the fundamental challenge of selecting individual sperm with the best potential for embryo formation from a sample of over 108 gametes50. Nonetheless, with respect to morphology, there are currently no studies that assess AI performance against manual assessment according to WHO guidelines34. Indeed, the potential performance of AI networks is directly linked to the quality of the database used for training, as well as the caliber of data used as input. Progress on its use in sperm selection would benefit from global collaboration between clinical and laboratory teams to build a robust and definitive database of sperm images to establish a consensus ground truth.

DNA fragmentation

Existing techniques for sperm DNA fragmentation similarly lack data at the single spermatozoon level. Modern-day tests of DNA integrity are invasive and conducted at the sample level, making them an unsuitable metric in the selection of individual sperm for ICSI. McCallum et al. described a CNN trained using a set of 1064 images of individual sperm cells of known DNA integrity to provide a DNA integrity prediction from a single bright-field image in under 10 ms62. Recently, Kuroda et al. described further progress with their AI-augmented sperm chromatin dispersion (SCD) test kit capable of assessing DNA fragmentation in >5000 spermatozoa at once, compared to a limited 300 in the widely commercially-used Halosperm SCD test63. The improved kit showed a good correlation with the conventional test that requires manual counting (Halosperm G2; r = 0.69, p = 0.02). DNA fragmentation counting took 5 min. in the automated device compared to around 20 min. with the manual method63.

Emerging evidence increasingly suggests that sperm DNA fragmentation is associated with reduced male reproductive capability and can be assessed in combination with conventional sperm analysis64. However, routine testing remains contentious and may not necessarily provide predictive value65. Other technical limitations exist, in particular the use of different staining, microscopes, and assays for DNA fragmentation that can challenge the training of an accurate AI model. Guidelines for testing, and optimal techniques for testing sperm DNA fragmentation have been proposed66,67, but testing is still not widely recommended. Progress in this field thus relies on the standardization and optimization of DNA fragmentation assays, prospective evaluation of its impact on ART outcomes, and the development of therapies to improve sperm DNA fragmentation levels68. Should this be achieved, ML algorithms that can combine morphological, motility, and DNA fragmentation data with outcomes such as fertilization, miscarriage, and live birth rates, could standardize, and vastly improve, single sperm assessment/selection by reducing the subjective and inter-variable outcomes between embryologists.

Oocyte assessment

Nuclear maturity of human oocytes can only be verified by observation of the extruded polar body, which requires removal of the cumulus10. Automated, non-invasive methods to assess nuclear and cytoplasmic maturity and future reproductive potential would be desirable, particularly for fertility preservation. Accurate prediction of oocyte quality and fertilization prospects would allow better estimation of personalized live birth predictions from a pool of cryopreserved oocytes. Consideration of whether this is sufficient to realize a desired family size may dictate the need for further cycles of OS and cryopreservation. Clinicians would also be able to manage expectations for success and reduce the number of poor-quality embryos with low implantation potential69.

Currently, assessment of nuclear oocyte maturity is performed visually by embryologists in a subjective manner prior to fertilization. Oocyte scoring systems assessing cytoplasmic morphological features such as the presence of vacuoles, degree of perivitelline space, and cytoplasmic granularity, among others, have long been proposed as predictors of insemination outcome but remain points of contention as prognostic indicators of embryo development and implantation70,71. Substantial labeled datasets of oocytes are scarce—as such, Kanakasabapathy et al. combined a retrospective dataset of oocyte images with known fertilization outcomes alongside synthetic oocyte images generated by a GAN to form a synthetic CNN72. This synthetically-extended CNN outperformed the raw CNN, and delivered an accuracy of 82.58% with an AUC of 0.81 in identifying oocytes that would fertilize normally to form two-pronuclear zygotes (2PNs), versus those that would not (non-2PNs)72. This study showed the value of using AI to augment the training, predictive power, and robustness of existing CNNs available for the embryology lab, perhaps widening their scope of use in ART73.

A non-invasive CNN-based software, VIOLET™ (Future Fertility), has been shown to predict fertilization and blastulation with 91.2% and 63% accuracy respectively, based on morphological features of 2D oocyte images. The tool’s performance was much quicker and also outperformed expert embryologists in accuracy74. VIOLET™ aims to give users undergoing oocyte cryopreservation a personalized estimate of live birth potential based on the morphology of oocytes cryopreserved as opposed to generalized age-related outcomes. Similarly, the MAGENTA™ tool employs 2D images of denuded oocytes and a similar morphology-based CNN to score oocytes and predict the potential for high-quality blastocyst formation with good accuracy75. Though promising in correlating oocyte morphology with blastocyst potential, their estimates lack interplay with potential male factor subfertility and could benefit from the incorporation of clinical variables such as BMI or endometriosis, to enhance the prediction of outcomes such as clinical pregnancy or live birth.

More recently, a non-invasive gene expression test was prospectively trialed by Link et al.76. The ‘OsteraTest’ software is composed of eight ML modules and uses a 25-gene network to predict oocyte quality based on cumulus cells76. This bioinformatics-inspired approach was able to non-invasively predict oocyte development to a day-5 blastocyst with 86% accuracy76. Though further large-scale validation is necessary, this type of AI approach could change current practices in oocyte selection prior to cryopreservation and ICSI, as well as reduce the pool of embryos formed, cryopreserved, and tested, prior to embryo transfer. This may be particularly beneficial in countries with regulatory frameworks surrounding embryos such as Poland, where only six oocytes may be fertilized per cycle, or Germany where no more than three embryos can be stored per treatment attempt. Additionally, it may guide egg sharing or donor oocyte cycles and inform on how to distribute oocytes evenly or the total within a cohort depending on blastocyst potential.

Although these approaches provide direction for further research, the data must be viewed with caution until published in peer-reviewed journals. In developing an AI model, it is imperative to define a set end goal such as oocyte quality following oocyte cryopreservation. If fertilization is planned and blastocyst potential is being predicted, then spermatozoon quality and other male confounders should be considered. Proposed biomarkers to predict oocyte potential include follicular fluid markers (insulin-like growth factor, zinc levels77), cumulus-oocyte complex composition78, and cytoplasmic features like mitochondrial function79. Consideration of these methods to guide oocyte selection in the future would also require analysis into whether they are feasible in daily practice or in fact as cost-effective as fertilizing all suitable oocytes80.

Embryo assessment

Embryo selection based on morphological assessment is an important predictor of success in IVF cycles but is primarily based on static visual observations at specific developmental time points. Information obtained in this manner is not only highly subjective with great inter-operator variability but also diminishes the dynamic nature of a developing embryo in culture, thus limiting its accuracy. AI-driven embryo analysis is suited to predicting developmental potential, non-invasive aneuploidy assessment, and ultimately the selection of an embryo with the best live birth potential for transfer.

Morphokinetics and morphology

Examples of developments in embryo evaluation include the assessment of pronuclear stage embryos to differentiate between 2PN and non-2PN zygotes81,82. Morphokinetic data such as cytoplasmic movements have also shown potential to predict blastocyst formation at early cleavage stages in a time series-based ANN model83. Further assessments of interest include morphological classification of pronuclei size and arrangement to monitor embryo development84. CNN models showed comparable results to manual labeling, albeit with high precision and reproducibility at a fraction of the time required by clinicians (12.18 s vs. 130 hrs.)84. Despite promising results, the standard morphological assessment remains the international consensus which is subjective and labor-intensive.

Time-lapse images combined with automated morphology assessment of embryos based on CNNs have shown promise, capable of outperforming individual embryologists with excellent accuracy85,86. Other fully automated deep learning-based models using time-lapse images such as iDAScore (Vitrolife) have shown the ability to accurately assess embryo morphology without the need for concurrent embryologist assessment or annotation, and predict implantation outcome87,88,89. The benefit of using time-lapse incubation systems and/or AI technology in the embryo selection process is yet to be proven as superior to current means in double-blind RCTs90,91. The SelecTIMO trial recently showed no improvement in cumulative live birth rates when using uninterrupted culture conditions with routine morphological embryo selection compared to a time-lapse based embryo selection algorithm alongside uninterrupted culture for day-3 embryos92. With no improvement in cumulative pregnancy rates or time-to-pregnancy, it may be that the time-lapse selection method may not improve pregnancy rates, however, whether this applies to day-5 embryos is still to be clarified. Nevertheless, the time-lapse technology was not inferior and therefore could achieve similar outcomes in an automated and less subjective manner. Importantly, with modern advancements in cryopreservation, it is likely that the most viable embryos will eventually be transferred if needed. Additionally, human input may be needed to aid the assessment of embryo quality, for example, by repositioning embryos to get a better view, which should be taken into account when considering the application of an AI for this task. Validation data from the VISA Study (ClinicalTrials.gov Identifier: NCT04969822), a noninferiority, prospective, multi-center RCT may further reflect the clinical impact of AI-driven systems compared to manual morphology assessment by embryologists for day-5 embryos. Such studies highlight the necessity for the accuracy of predictions made via AI techniques to be prospectively validated prior to adoption into clinical practice with appropriate mitigation of study biases and evaluation of cost-effectiveness20,93.

Recently, a biomarker-scoring CDSS based on 799 blastocyst videos, CHLOE EQ (Fairtility), has been described and takes into account patient and embryo data including blastocyst diameter, degree, and time of expansion, and other morphokinetic markers. Though preliminary results are promising, these new systems still require external validation and larger-scale prospective studies before widespread adoption to realize the end goal of fully automated blastocyst assessment and accurate embryo prognosis94,95. It is paramount that future algorithms focus not only on the competitive selection of the best embryos for culture and transfer but also can differentiate between embryos that are otherwise morphologically indistinguishable to the naked eye, wherein the real challenge lies.

Aneuploidy

Rates of pre-implantation genetic testing for aneuploidy (PGT-A) as a screening tool to improve clinical outcomes in ART cycles have increased in recent years. Currently, PGT-A is performed by trophectoderm biopsy on blastocysts followed by whole-genome or targeted DNA amplification and a next-generation sequencing assay. Multiple blinded non-selection studies have now shown a high prognostic failure of live birth when an aneuploid result is obtained96,97. Furthermore, discarding uniformly aneuploid embryos is unlikely to have a meaningful impact on cumulative live birth rates, especially in women over 35 years of age where it is more likely to be employed98. As modern invasive techniques still bring technical and financial challenges, non-invasive AI-driven PGT-A could offer the benefits of PGT-A without embryo manipulation and biopsy. Recent single-center studies have shown ongoing validation of AI models feeding time-lapse imaging data into CNNs to predict ploidy status from abnormal morphokinetic patterns with good accuracy99,100. These models may not replace PGT-A but highlight the potential for PGT-A triage and well-informed guidance towards embryo selection in a non-invasive manner99,100,101,102. Once again, further validation and large multi-center datasets must be compiled for standardization and generalization of these AI-driven models.

Omics

A comprehensive understanding of the embryo at a molecular level may present another adjunct for the high throughput and comprehensive capabilities of AI-driven predictive models in the future. Various metabolomic signatures of an embryo have been investigated over the years, mainly pertaining to metabolites or biomarkers in spent culture media as a reflection of complex physiological and pathological responses and in turn, reproductive potential or ploidy status. Conflicting results to this approach have been shown103,104,105,106,107, while a previous meta-analysis including four RCTs and a total of 924 women showed no meaningful effects for metabolomic assessment on clinical outcomes108. Interestingly, an ANN employing a combination of conventional embryological data and thirteen nuclear magnetic resonance spectroscopy-identified metabolite levels has shown promise in predicting blastocyst implantation, though at a very small scale with a test dataset of twelve spent culture media109.

Current limitations of the omics approach lie within the vast variability in culture media components used and handling of spent media, contrasting infertility phenotypes, definitive biomarkers predictive of reproductive potential, and a general lack of conclusive evidence that fertility outcomes can be optimized through omics profiling. Though non-invasive, highly specific, and perhaps crucial towards a better understanding of gamete development, it is unclear whether omics profiling can effectively contribute to an improvement in clinical outcomes or will remain principally a research tool110. Furthermore, the complexities of omics analysis and interpretation of output data present significant barriers to adoption in daily laboratory practice.

Embryo quality aside, reproductive outcomes also depend on implantation and the endometrium. The construction of models should also integrate features of the uterus and crosstalk between an embryo and the endometrium. To date, the clinical benefit of an endometrial receptivity array (ERA) for assessment has yet to be proven111. The invasive nature of biopsy for endometrial receptivity testing, the time needed for results preventing immediate embryo transfer, and the potential accuracy of the diagnostic test itself are further limitations112. AI is however well suited to drive collaboration between ART clinics and omics-focused research groups, on account of its ability to perform large-scale data throughput and analysis. Whether these approaches will alter conventional therapies remains unclear, particularly as diagnoses such as true recurrent implantation failure and its relevance are being hotly debated currently113. However, given the lessons to date, the value of any ‘AI-omics’ platform should be validated in appropriately powered RCTs.

Conclusions and future prospects

With respect to ART, several groups have developed CDSS frameworks or decision-making tools for use at key decision-points in the clinic, and/or embryology laboratory17,30,31. Personalization in further avenues could better improve the clinical outcomes of ART. Ovarian response has been shown to vary significantly depending on ovarian reserve, between ethnic groups114,115, FSH receptor genetic polymorphisms116, and body weight19,117. Therefore, incorporating such factors which influence pharmacokinetic parameters when dosing gonadotropins9,19,20, or suppressing premature ovulation20,118, may be beneficial. ML methods could also help tailor luteal phase support regimens to certain patient subgroups, where a lack of clinical consensus currently exists119.

The ubiquity of electronic health records (EHRs) has accelerated the development of CDSSs15. A predominant barrier to adoption is trustworthiness, especially with ‘black-box’ AI systems29, which has led to transparency being a key characteristic preferred by clinicians as such models offer simpler interpretations, although may compromise accuracy when applied to more complicated learning tasks28. Implementations of ‘black-box’ models are evolving, especially for embryological analyses, due to the data being primarily image-based; in turn, efforts in explainability have emerged to seek insights for model generalizability, fairness, and trustworthiness94,95. Misleading conclusions may be reached if clinical inference is neglected during the decision-making process since such methods are often correlation-based and prone to ‘overfitting’120. Generating counterfactual examples in this context, such as: “what if the optimal TD was yesterday(?)", or “what if the other embryo were implanted(?)", are generally unavailable—and to further exacerbate this—ground truths are often based upon clinical guidelines/scoring rather than objective outcome labels. The emergence of omics analyses offers an alternative, and arguably more efficient, solution for clinical and embryological assessment, although advancements currently remain of a preliminary nature18,108. Ultimately, appropriate assessment of CDSSs for ART is necessary in practical, ethical, and clinical contexts prior to clinical adoption. Rigorous validation with comprehensive standardized reporting is essential for establishing trustworthy models before attempting viable integration into clinical workflows21,121. Research conduct and reporting guidelines such as PROBAST-AI are in progress for the wider field of AI for healthcare, and with this at hand, a more granular and contextual guideline for AI in the domain of ART can be proposed122,123.

Salient efforts from both academia and industry have validated the utility of retrospective data to enable data-driven decision-making for ART123. To ensure viable deployment, these models can benefit from larger, multi-center datasets that incorporate both heterogeneous patient populations and also capture the idiosyncratic nature of clinical practice worldwide. Achieving this is best achieved through a collaborative effort from all stakeholders representing multiple disciplines across the AI and healthcare landscape21. Furthermore, streamlining workloads is an essential objective of CDSSs, and seamless implementation with, or within, EHR systems are essential to not inadvertently decrease the efficiency of clinical workflows. Prospective validation (e.g., well-designed RCTs) with relevant outcome measures is a key step to assess the efficacy and efficiency of these models in clinical environments and thus demonstrate impact on patient outcomes. With such efforts in place, a comprehensive end-to-end CDSS seems a plausible future goal. Whether this paradigm should extend to an autonomous AI clinician within the ART domain remains an open and contentious question. The use of AI to automate some of the tasks currently performed by clinicians or laboratory staff could have implications in training and a potential loss of expertise in the workforce, but may also free up staff time to focus on more challenging and physically demanding technical processes. Reflections on the current literature to date elicit valuable questions regarding future studies, including determining the specification of what should be measured/captured, to what precision, and how often. Decision points cannot necessarily be considered in isolation, and the relationships between some of the key topics described in this review require further interdisciplinary research to prioritize the individualization and utility of certain decisions over others. The intersection of AI and ART undoubtedly remains a nascent and valuable field of study, which has the potential to reduce intensive resources, whilst ultimately improving clinical outcomes for patients.