Computers in Biology and Medicine

Embryo selection is a critical step in assisted reproduction: good selection criteria are expected to increase the probability of inducing a pregnancy. Machine learning techniques have been applied for implantation prediction or embryo quality assessment, which embryologists can use to make a decision about embryo selection. However, this is a highly uncertain real-world problem, and current proposals do not model always all the sources of uncertainty. We present a novel probabilistic graphical model that accounts for three different sources of uncertainty, the standard embryo and cycle viability, and a third one that represents any unknown factor that can drive a treatment to a failure in otherwise perfect conditions. We derive a parametric learning method based on the Expectation–Maximization strategy, which accounts for uncertainty issues. We empirically analyze the model within a real database consisting of 604 cycles (3125 embryos) carried out at Hospital Donostia (Spain). Embryologists followed the protocol of the Spanish Association for Reproduction Biology Studies (ASEBIR), based on morphological features, for embryo selection. Our model predictions are correlated with the ASEBIR protocol, which validates our model. The benefits of accounting for the different sources of uncertainty and the importance of the cycle characteristics are shown. Considering only transferred embryos, our model does not further discriminate them as implanted or failed, suggesting that the ASEBIR protocol could be understood as a thorough summary of the available morphological features.


Introduction
Assisted reproductive technologies (ARTs) are a set of invasive medical techniques that attempt to induce a pregnancy.Each trial of treatment is known as a cycle.The woman first follows a treatment of ovarian stimulation for several weeks to induce the development of multiple follicles with a large number of oocytes.Then, oocytes are retrieved and fertilized, and the resulting embryos are cultured for several days.Finally, clinicians need to select which embryos are transferred to the woman's uterus [1].This process is physically and psychologically tough, especially for women, and success is not guaranteed.The Spanish Society of Fertilization (SEF) reported in 2018 that only 35.6% of the ART cycles succeeded (ended up in pregnancy) [2].The probability of success can be improved by increasing the number of transferred embryos [3], but this also leads to higher multiple-birth rates, which is considered risky for both mother and fetuses [3,4].Thus, many countries restrict the number of embryos that can be transferred (e.g., Spanish law limits it to 3).Therefore, the selection of the most viable embryos is a critical step to optimize the probability of pregnancy.
Embryo selection is a complex and partially subjective task.The evaluation of embryos is based mainly on the evolution of their morphological characteristics.The protocol of the Spanish Association for Reproduction Biology Studies (ASEBIR) [1], the criteria of reference in Spain, classifies embryos into an ordinal scale (from A -high-quality embryos-to D -low-quality embryos-) using morphological criteria and posed a unified protocol to address the lack of consensus in embryo quality assessment [5].
In recent years, machine learning (ML) techniques have been used to assist embryologists in embryo selection and pregnancy prediction [6][7][8][9].Most of them rely on supervised classification and require complete and fully labeled training data.That is, we would need to know, for each embryo in our training dataset, its viability to induce a pregnancy.However, in ARTs, viability can only be determined after transference by the occurrence of embryo implantation.Moreover, for a transfer of https://doi.org/10.1016/j.compbiomed.2022.106160Received 29 April 2022; Received in revised form 8 September 2022; Accepted 1 October 2022 multiple embryos, current techniques are unable to identify individually which embryo(s) implanted.This implies that many embryos are not (fully) labeled in ART data samples, and previous works usually discarded all the embryos lacking a full labeling.Nowadays, specific methods [8] have been proposed to learn also using information from cycles with partial implantation (not all the transferred embryos were implanted).
All current methods, ML-based or not, use a combination of embryos and cycles descriptive characteristics to predict embryo implantation.Yet, there exists a recurrent situation in assisted reproduction units: apparently viable cycles, using embryos allegedly viable, do not succeed.The repeated occurrence of this type of failure suggests that there exist still unknown factors which also determine cycle success.
In this paper, we propose a novel probabilistic graphical model that works under the assumption of independence between embryo and cycle viability, and accounts for a third source of uncertainty corresponding to unknown factors that can lead a cycle to fail.We have derived a learning algorithm specifically for this model based on the Expectation-Maximization (EM) strategy, given the context of partially labeled data and latent variables.We use two probabilistic classifiers to approximate the probabilistic distribution for embryo and cycle viability given their respective descriptive features.
We perform a thorough experimental validation of the model using real data.It is compared with several baseline approaches designed in an incremental way in order to test different working hypotheses.We also test the importance for embryo implantation prediction of the cycles features, as well as the relationship of the predictions of our model with the ASEBIR protocol.A preliminary version of this last part of the empirical validation was presented in [10].The results show the ability of our model to learn and take advantage of all the available information.Its behavior is in line with the ASEBIR score, which validates our model.
The rest of the paper is organized as follows.First of all, the state of the art is reviewed.Then, we describe the real data available for this study and the proposed model, as well as the learning technique derived for it.In Section 4, we discuss a complete empirical evaluation of our model against several baseline techniques.The paper finishes drawing conclusions and future work.

State of the art
ART treatments are complex processes involving maternal hormonal changes, immune responses, and maturational events in the embryo.A treatment can fail when these events are not synchronized [11].Despite the great improvements in ovarian stimulation protocols and fertilization procedures, implantation rates per embryo remain at approximately 15% and many patients experience multiple failed attempts [12].Recurrent implantation failure (RIF) is a condition resulting from repetitive unsuccessful ART cycles [13], and it provides evidence of the existence of still unknown factors that affect ART success.
All this has provided an ideal context for the application of ML methods.Since more than 20 years ago, ML techniques provide the standardized and efficient tools demanded in laboratories for evaluating the different processes in ARTs: from embryo selection to assessing patient reproductive potential, or individualizing stimulation protocols [14].Since the popularization of infertility treatments, many works have focused on the problem of ART outcome prediction and one of its critical steps: the selection of embryos [15][16][17].This subfield has rapidly evolved in the last decade due to technological advances.In the classical scenario, embryologists collect the most relevant morphological traits of embryos by visual inspection of them via microscopes [18].More recently, the use of (static) photographs of the embryos enabled the use of automated image processing techniques [19,20].Current embryo incubators incorporate cameras that allow embryologists to inspect the whole evolution of the embryos through time-lapse videos.
This data is being fed directly to ML models for embryo viability prediction [21].Indeed, these sources of data are complementary and can be combined in a single ML method [22,23].
The technological breakthroughs have brought novel machine learning methods too, which have been applied to ARTs.Standard artificial intelligence techniques have been used, such as ranking algorithms [24], statistical models, ensemble techniques, neural networks [25], Bayesian networks [6,8,26], Support Vector Machines [27], classification and regression trees, logistic regression, case-based reasoning systems, etc. [15,17].More recently, deep learning methods [9,28] have been used to analyze the vast amount of data coming from time-lapse incubators.ML techniques are of great interest since traditional morphokinetic grading systems can be subjective and variable.It is generally agreed that ML methods are promising for the ART community but still require further validation [29].For example, fully automated (time-lapse imaging) approaches require costly equipment and have not demonstrated sufficient predictive ability yet [30].Moreover, there exist doubts about the deployment of these systems in the medical domain, regarding technological and ethical aspects.Recently, Müller et al. [31] proposed a list of ten principles for designing ML-based decision support systems: an ethical system should be transparent, explainable, fair, repeatable, under responsibility and monitored by a physical person, and its suggestions must imply no human harm.
Most of these ML works take the standard supervised classification approach, which requires completely labeled datasets.However, labeling all the embryos is not always possible: in a cycle where not all the transferred embryos get implanted, the use of current medical techniques allows for knowing how many embryos got implanted, but not to know exactly which embryo did.Many previous works [9,32,33] directly disregard the embryos from these partially observed cycles.Morales et al. [26] proposed, to circumvent this issue, joining the descriptive vectors of all embryos in each cycle and learning to predict a pregnancy.Hernández-González et al. [8] reformulated the task as a weakly supervised learning problem, and learned using all the embryos and the available information of supervision (label proportions or counts of implanted embryos per cycle).
Another widespread approach is the embryo-uterine model (EU), introduced by Speirs et al. [34] and later extended by Zhou and Weinberg [35].It assumes that, for a pregnancy to happen, both a fertile patient (receptive uterus) and a viable embryo are required.Two separate modules (embryo [E] and uterus [U]) compose it: the probability of implantation is predicted as the product of the probabilities given by both submodules.These models suffer from even harder issues of partial observability: if a cycle fails and no embryo implants, one cannot know if the embryos were not viable, if the cycle was not fertile, or both.Roberts [18] addressed this via the Expectation-Maximization (EM) algorithm, and Corani et al. [6] used a Bayesian network trained with an averaging approach as an alternative to MAP estimation using a very limited set of descriptive features for cycles and embryos.Roberts and Stylianou [36] used an EU model to try to assess other unknown factors that might be related to a given patient when they undergo several ART cycles.

Data
The database, originally presented by Hernández-González et al. [8], was collected by the Unit of Assisted Reproduction of the Hospital Donostia (Spain) from January 2013 to June 2015.In total, 604 cycles were carried out, compiling a total number of 3125 embryos.Each cycle has a certain number of embryos associated, only some of which were actually transferred.As detailed in Table 1, in this period 412 cycles failed to induce a pregnancy (839 embryos), and only in 57 cycles did all the transferred embryos (108) result implanted.In the remaining 135 cycles, only a subset of the 307 transferred embryos were implanted.This last subset is of relevance in our analysis, as we cannot determine the real fate of each embryo individually (it is not possible to know which specific embryos are the ones implanted).Among all the cycles, up to 1871 embryos were not selected for transfer.The criteria for limiting the number of embryos to transfer goes from the low quality of the embryos to legal restrictions (in Spain, the maximum number of embryos that can be transferred in a single trial is 3).
Each cycle is described by 25 features including characteristics of the patients (female and male) and stimulation procedure.Moreover, summary variables of the associated embryos are provided (e.g., cycle's fertility rate, i.e., the proportion of oocytes successfully fertilized).Each embryo is described by 20 features, mainly morphological characteristics at different stages of development (up to 48 h after fertilization, when transference was carried out).Appendix A details the descriptive features of both subsets.In practice, only informative variables were considered.We used one-hot encoding to transform categorical into numeric features.All of them were then standardized (centered and scaled to unit variance).After all, 36 features for cycles and 25 for embryos were left.
A key feature is success rate (∈ [0, 1]), which indicates the percentage of transferred embryos in the cycle that implanted.Note that this is the ultimate information we would like our models to predict.The value 0 indicates that all the embryos of the cycle failed to implant, 1 that all of them implanted, and any value in the interval (0, 1) indicates the proportion of implanted (and failed) embryos.This latter case is directly related to the aforementioned problem of partially observed data: we cannot know the identity of the implanted embryos (we do not know their actual outcome) in the cycle, we only know that some of them were implanted.
For each embryo, we also have a quality score (A-D, from high to low-quality embryos) given by embryologists according to the ASEBIR protocol [1], which assigns each embryo to a category based on its morphological characteristics.The distribution of embryos among categories is rather balanced, although category C (mid-low quality) stands out (see Fig. 1).This quality score is a decisive factor in the selection process performed by embryologists, as can be seen in Fig. 2(a).
Ideally, there should be a clear difference in the implantation rate of embryos graded in different categories.We display in Fig. 2(b) the fraction of transferred embryos with different outcomes for each quality category.It can be observed that there is a small signal: as the quality score increases, the proportion of non-implanted embryos decreases.Being aware of the difficulty of this problem, these numbers support the effectiveness of the ASEBIR protocol to indicate implantation.

A probabilistic implantation model for ART
In this paper, we propose a probabilistic model that comprehensively accounts for three different sources of uncertainty in the ART problem, which is presented in this section.Later on, we present its learning method that uses all the available information, even the partial label information from cycles where not all the embryos were implanted.
We model the problem of ART by means of a probabilistic graphical model (PGM) [37], which grounds on a solid mathematical background.A directed acyclic graph is used to encode a set of conditional independencies between the random variables, and the join distribution factorizes as the product of conditional probability distributions for each variable given its parents.Given a fixed structure, the model parameters can be estimated from data.
The proposed model takes into account three sources of uncertainty related to the success of an ART procedure, namely the viability of embryos and cycles, and other unknown factors.
The viability of the embryo.A widely accepted assumption in ART is that the individual characteristics of an embryo (  ) are relevant in order to predict the probability of such embryo implanting in the uterus.According to the provided data, an embryo's viability is assumed to be related to its morphological traits.Here, the distribution measures the probability of the embryo to implant in a ''perfect cycle'' (fully fertile patient), where   represents the descriptive characteristics of embryos.We will model this distribution with a probabilistic classifier.
The viability of the cycle.Another common assumption is that the individual patient features and the undergone stimulation treatment exert an influence on the likelihood of her fertility potential.This is how we define cycle viability.The distribution assesses how the descriptive characteristics of the cycle,   , influence fertility potential.We will model this distribution through a probabilistic classifier too.These two components form the classical embryo-uterine modeling approach.It implies that we assume that the fertility potential of the patient is statistically independent from the embryo characteristics.This is a practical assumption, but highly unlikely when the patient's own oocytes are used, as was the case in this study.

Other unknown factors.
There is consensus in the ART scientific literature that there are still unknown factors that (partially) determine the outcome of an ART treatment, like those provoking recurrent implantation failure [38].We model this uncertainty by means of a Bernoulli distribution, with parameter  1 ∈ [0, 1].The implantation of a viable embryo in a fertile cycle follows a distribution where    is 1 if embryo  was transferred in cycle , and 0 otherwise. 1 is the probability that in a cycle that has been properly configured (  = 1), a viable embryo (  = 1), selected for transfer (   = 1), gets implanted.Ideally, there would be no such unknown factor and  1 = 1.For modeling convenience, we use a second Bernoulli with fixed parameter  0 = 0, which tells that there will not be implantation if any r.v.  ,   or    is 0 (no viable embryo or cycle, or embryo not transferred).
Finally, the number of embryos implanted in cycle , i.e., the outcome   , is deterministically assessed as: Note that depending on the practice of the specific ART unit, more than one transference could be carried out for the same cycle.Following the practice of our Unit of reference (and as reflected in the data), here we only consider the case where a single transfer of one or more embryos is carried out in each cycle.
The graphical structure of our model is shown in Fig. 3.The observed variables are shadowed (  ,   ,     ,   ), whereas white nodes  (  ,   ,    ) represent latent variables, the value of which need to be inferred.In certain cases, the values of some of these latter r.v.can be known.Finally, , ,  are the hyper-parameters of the cycles' and embryo's classifiers (Eqs.( 1) and ( 2)) and of the Bernoulli distribution (Eq.( 3)).All the notation is summarized in Table 2.
The joint probability distribution of the model is

Machine learning method
In the presented model, there are latent variables (,  and ) whose value is (generally) unknown, which makes the learning of the model parameters ⟨, , ⟩ difficult.We use an Expectation-Maximization (EM) algorithm [39] to overcome this issue.The EM is an iterative strategy to find (local) maximum likelihood estimators of the model parameters in the presence of missing data or latent variables.First, the expected value of the missing data is obtained.Then, the MLE parameters are obtained for that completed data.
Formally, let  be the observed variables in the model and  the unobserved latent ones.The complete log-likelihood is (; , ), where  are the parameters which we want to estimate maximizing the likelihood.
The expectation (E) step consists in computing the conditional expected value of the log-likelihood given the observed variables and the current fit of the parameters  () : where (|;  () ) is the conditional probability distribution of the unobserved variables  conditioned to the observed variables  and the current fit of the parameters  () .
The  1 value that maximizes this expression, i.e., the maximum likelihood estimator of  1 , is: which is the probability that a viable embryo selected for transfer in a fertile cycle gets implanted.And it can be understood as our ability to model the uncertainty of the problem: the higher this probability is, the more portion of the uncertainty is modeled by the classifiers of embryos and cycles, and thus the more explanatory the model can be for new cycles.The full derivation of this expression is given in Appendix B. As aforementioned, both Eqs. ( 1) and ( 2) are approximated by means of probabilistic classifiers, and the model parameters  and  represent the hyperparameters of the respective classifier.In this sense, the values of  and  that maximize the previous conditional expectation are obtained by learning a new fit for the classifiers using Eqs.( 6) and (7), respectively, to weigh the instances of the training set.
To sum up, after initialization, our EM algorithm repeats iteratively these two steps: (i) Expectation: The expectation of the latent variables   ,    and    is computed with Eqs. ( 6) to (8), using the current fit of the model ⟨ () ,  () ,  () ⟩.Note that there exist cases where we do know the value of the latent variables,   ,    ,    .When successful cycles ended up in a pregnancy (  ≥ 1), we do know that the cycle was viable (  = 1), so we can safely use (  = 1) = 1 and (  = 0) = 0.Moreover, when the number of implanted embryos is the same as the number of transferred embryos (  = |  |, success rate = 1), we do know that all the transferred embryos were viable (   = 1), so we can safely use (   = 1) = 1 and (   = 0) = 0, for all  ∈   .In this case, there also exists a single valid implantation vector  ∈    ∶ ,  (|   ∶ ,  | = 1), and thus (  = ) = 1.
The method iterates these two steps until the stopping condition is met.The pseudocode of the resulting method is shown in Algorithm 1.

Set up
To initialize Algorithm 1, we assign initial probabilities directly to the sample weights (  for cycles,   for embryos, and   for implantation vectors) and obtain a first fit of the model with them (as if it were an M-step).All the weights are randomly generated and normalized to sum up to 1.The only exceptions are the special cases previously discussed where we actually know the value of these latent variables, for which no random initialization is required.
We have considered a stop condition that is actually two-fold: we test convergence by comparing the sample weights assigned in consecutive iterations, and we fix a maximum number of iterations (100) that the algorithm can run.
As known, the EM strategy is only guaranteed to reach local maxima or saddle points of the likelihood.We run our algorithm multiple (10) times with different initializations to try to reach other local maxima and keep only the best one, mitigating thus the local-maximum problem of EM algorithms.

Empirical validation
In this section, we aim to perform a robust validation of the proposed model.To do so, we use different probabilistic classifiers for our embryo viability (Eq.( 1)) and cycle fertility potential (Eq. ( 2)) modules, and compare them to others learned with up to 3 different baseline approaches.In particular, we carry out three sets of experiments: • Experiment #1: we compare the results of our model against the classifiers obtained with a series of baseline approaches to test the behavior of our proposal.• Experiment #2: we estimate the relevance of the information of the cycle in the embryo implantation predictive task by including the cycle's characteristics as descriptive variables for the baseline approaches too.• Experiment #3: we validate our model by comparing its results to the ASEBIR protocol.Specifically, we compare the performance of our model using or not this score as a feature.
The interpretation of the predictions that we can obtain from our model needs proper consideration.For instance, by using the whole model we obtain the probability of implantation of an embryo in a cycle (it assumes independence between embryo and cycle), which is calculated as: where (   = 1|   = 1,    ,   = 1; ) =  1 ⋅    .Remember that if    = 0, (   = 1|   ,    = 0,   ; ) = 0.This is the reason why the evaluation will be only performed with embryos that were transferred (   = 1).The other two terms (Eqs.( 1) and ( 2)) represent the probabilistic classifiers of embryo and cycle viability, respectively.In fact, we could unplug these classifiers from the learned model and use them to predict embryo/cycle viability.Note that in real practice, in embryo-selection time, the cycle's stimulation is already finished; thus, if we fix the cycle, the ranking of embryos given by Eqs. ( 1) and ( 10) is the same.
As aforementioned, we approximate Eqs. ( 1) and ( 2) by means of probabilistic classifiers.By the no free lunch theorem, we know that different classifiers may perform differently depending on the context.In order to make a fair comparison, in these experiments we have tested three types of classifiers of different nature: Logistic Regression (LR), Random Forest (RF) and Gradient Boosting (GBOOST) classifiers.We use the default parametrization of these techniques, as implemented by Python's Scikit-learn library [40].

Baseline approaches
The two main characteristics of our model are the way it combines the information from the cycle and the embryos, as well as its ability to learn using all the available examples independently of the amount of class information that they carry.The baselines that we use in this study for comparison follow simplistic approaches to these two aspects.All of them use the same types of probabilistic classifiers previously described, for the sake of fair comparison.Note that the models learned with these baseline methods directly predict implantation.This is slightly different from the embryo module of our probabilistic model (Eq.( 1)), which actually predicts whether an embryo is willing to implant (viability).
Baseline approaches use, to learn the classifiers, a transformed dataset using different assumptions to assign a label to examples that are originally (partially) unlabeled (Table 1 summarizes the number of embryos in our database with (un)known fate).We design the baseline approaches in an incremental way regarding these assumptions on the partially labeled embryos, as summarized in Table 3.
Our pessimistic approach assumes that all the embryos with unknown fate are negative examples (unviable embryos).This is the simplest decision as it leads to a completely labeled dataset that can be directly learned using standard supervised learning techniques.However, it holds a heavy assumption: all non-transferred embryos are not viable for implantation (questionable), and all the embryos in partially implanted cycles are not viable for implantation (wrong: some of them are, but we do not know their identity).This approach brings a severe class imbalance problem, with only a tiny portion of embryos labeled as positive.
With the objective of relaxing this heavy assumption, our second baseline approach does not assign any label to embryos of unknown fate (see Table 3).This decision leads to a semi-supervised learning setting.We use a standard EM algorithm [41], which we call simple EM, to learn from this type of data.Thus, we allow the learning technique to unveil the class label of the embryos of unknown fate, alleviating at the same time the class imbalance problem.
This previous approach still dismisses the class information of embryos in partially implanted cycles: the label proportions or counts of implanted embryos per cycle.Our third baseline approach lets the model be learned from these counts of implanted embryos per cycle (see Table 3).This decision leads to a learning from label proportions setting.We use the EM algorithm proposed by Hernández-González et al. [8], which we call LP-EM, to learn from this type of data.One can arguably consider that this approach uses all the available information of supervision.

Evaluation
As a weakly supervised problem [42], a fair evaluation of the models is not trivial and needs to be properly addressed.Fully unlabeled examples (embryos non-transferred) might be used for learning but not for model performance assessment.Fortunately, partially labeled examples (transferred embryos in cycles with partial implantation), where only the proportion of implanted embryos is known, can carefully be used for evaluation.
Performance is assessed in terms of different metrics, which are applied in each of the experiments only if all the required information is available.

Table 3
Description of the labeling used by the baselines.In each method, embryos of different (un)known fates receive these labels: negative (0), positive (1), unknown (?), or label-proportions (lp).To test the ability to predict embryo implantation, we use the area under the ROC curve (AUC-ROC) [43,44], which plots the true positive rate against the false positive rate as the discrimination threshold is varied.It does not require fixing a threshold to estimate the predictive performance of a probabilistic classifier.It represents the probability that the classifier will assign to a positive instance higher probability than to a negative one.The higher the score, the better the classifier (a random classifier would obtain a value of 0.5).Note that AUC-ROC could not be appropriate when the dataset is highly unbalanced, as it produces overly optimistic measurements [45,46].Many alternatives have been proposed to address this limitation, including partial AUC-ROC [46] (which focuses on the most relevant parts of the ROC curves) and the area under the Precision-Recall curve [45] (which focuses exclusively on the minority and relevant class).In our study, this issue is expected to impact mainly one of our baselines, the pessimistic approach, the results of which should be interpreted accordingly.As all the labels of the individual embryos are needed to calculate this metric, only those belonging to failed or completely-implanted cycles are considered.

Method
To account also for the embryos in cycles with partial implantation, we use the negative log-likelihood.It measures the confidence of the model in predicting each of the labels.Formally, we calculate the mean probability of the real number of implanted embryos per cycle given the current model: where |  | is the number of transferred embryos in cycle , and (  ), the probability of cycle  having   implanted embryos over all valid implantation vectors   , is, where (   ) is given by Eq. ( 10).For model performance assessment, we use 10 × 5-fold crossvalidation.All the results show the average value.Fig. 4 displays the workflow of this study.

Experiment #1: Performance comparison
In this first snapshot of the experiments, we show a comparison between our model and the different baseline approaches when used for embryo implantation prediction, for different base probabilistic classifiers.Table 4 shows the results in terms of different metrics.
Our model obtains the best performance in terms of the AUC-ROC metric (consistently for all the classifier types).From the detailed inspection of the densities produced by these approaches, 1 we can appreciate signs of learning, though they might be limited, for classifiers learned with all the approaches (e.g., the high-quality embryos receive a higher probability of implantation).However, the results of the baselines in terms of AUC-ROC are rather limited.As mentioned previously, AUC-ROC is calculated using only embryos with known fate.Thus, it is reasonable to think that it favors those approaches which can detect if the cycle is actually a critical factor.In this set of 1 Figures available in the supplementary material at https: //jhernandezgonzalez.github.io/supp_arts_pgm.htmlexperiments, the only approach that uses the cycle information is our complete model.This could provide an explanation for the performance gap observed in the results, and it is precisely the idea that we test in the second set of experiments.
In terms of negative log-likelihood, which measures the confidence of the model about its predictions and allows us to use also the partially implanted cycles for evaluation, the pessimistic approach, which deals with a highly unbalanced dataset due to its unrealistic but simplifying assumption, shows the worst results.The EM-based approaches perform better: they use partially labeled cycles in model learning without any hard assumptions.These approaches, mainly the one that considers also label proportions, seem promising as their performance reaches that of our model: they match the best global result (with RF classifiers) of our model and even outperform it when learning LR classifiers.

Experiment #2: cycle characteristics for embryo implantation prediction
In this second snapshot of the experiments, we pay attention to a different dimension of the problem: the importance of the features describing the cycle configuration for the predictive task of embryo implantation.Our model includes them, whereas the baselines do not (as configured so far).Table 5 shows the results in terms of different metrics for our model and the pessimistic approach (already shown in the previous section), together with the results of the pessimistic approach when the training dataset is enlarged with the cycle features.
The results in terms of log-likelihood are not conclusive: in some cases, the inclusion of cycle features improves the performance of the models, but it is not consistent.However, we can observe clearly a relevant improvement in the results in terms of AUC-ROC, competitive with our complete model.Inspecting the density of the probability values (available in the Supplementary Material), we observe that, although true unviable embryos are clearly concentrated around probability equal to 0, the density for the truly implanted embryos is clearly shifted towards higher probability values.That is, in some cases where the cycle is identified as viable, the models are more confident when predicting implantation.This seems to explain the differences in the AUC-ROC values of the pessimistic approach using or not the cycle features.
To fully grasp the behavior of our model, we inspect the probability densities for successful and failed cycles in Fig. 5, separately for embryo viability prediction (Eq.( 1), left column), cycle fertility-potential prediction (Eq.( 2), middle column), and cycle success prediction (whole model, right column).An ideal model would completely separate the densities in this last column.The results of all classifiers show a large intersection between both densities, but the mode of the density for successful cycles (pregnancy) is shifted to the right of the density of the failed cycles.This points out a small signal: the model seems to predict success, on average, more for actually implanted embryos than for those that failed.According to the plots of the first column (embryo viability), there is almost no difference between successful and failed treatments.At a first glance, embryos seem to be irrelevant to predict a pregnancy.Nevertheless, it is noteworthy that the embryos employed in this part of the study are only the transferred ones, that is, a subset of the set of embryos manually selected by the embryologists as the best embryos for transference (see Fig. 2(a)).Most of the predictive power of the model seems to come from the cycle descriptors.The middle column of Fig. 5 shows that cycles that actually induced a pregnancy receive a higher probability of cycle viability.One can conclude that the protocol followed by the embryologists for embryo selection based on the morphological features performs well, as our model seems not to be able to further discriminate the embryos based on this data (the same they used) alone.

Experiment #3: Our model and the effect of the ASEBIR score
So far, we have focused on analyzing the performance of our model and the baselines regarding their ability to predict ART success.We have also available a measure of embryo quality, calculated by the embryologists according to the ASEBIR protocol [1], for each individual embryo in our database.In this last set of experiments, we test whether our model agrees with the ASEBIR quality score.
To study the agreement between our method and the ASEBIR score, we compare two versions of our complete model: (i) a model trained with an embryos dataset where the ASEBIR score is just another descriptive feature (the ASEBIR score is an element in vector ), and (ii) a model trained with a dataset from which the ASEBIR score has been completely removed.In Table 6, we show the results obtained with both models (with and without the ASEBIR score feature) for the different probabilistic classifiers.
It is noteworthy that there are no significant differences between both models.Including the ASEBIR score as a descriptive feature of the embryos does not apparently boost the performance of the model.Table 6 also shows the mean value estimated for the  1 parameter, which measures the probability that a viable embryo actually gets implanted in a viable cycle.It represents the third source of uncertainty in our proposal, which measures the effect of any unknown factors.Its value is usually close to 0.5.This means that in these cases, even if the classifiers consider that both embryo and cycle are viable, the model expects that only half of these pairs will succeed.The standard deviation is low, implying a consistent estimation.
As before, we also inspect the probability densities to understand the behavior of the model regarding the ASEBIR score in Fig. 6.Specifically, we show the results with the version of the model that does not use the ASEBIR score feature.Under the independence hypothesis, the quality of an embryo should not be related to the probability that a cycle is viable and, in general, we observe that the embryo information has not leaked into the cycle classifier (the probability density of cycle viability -middle column-is almost the same for all the ASEBIR categories).Embryo quality has the highest impact on the model ability to predict embryo viability (left column).All classifiers (mainly LR and GBOOST) tend to separate the best (A) and worst (D) quality embryos, but barely discriminate embryos of medium quality (B and C).The model mostly agrees with the ASEBIR score in the identification of the most and least promising embryos, even without explicitly considering the ASEBIR score as a feature.
All in all, although our method seems not to directly consider the ASEBIR score feature as relevant, it is important to bear in mind that the rest of descriptive variables of the embryos, , are exactly the ones used in ASEBIR protocol [1].Given the alignment between the ASEBIR score and our model's results observed in Fig. 6, the irrelevance of the ASEBIR score feature is possibly due to the fact that our method finds the relevant information among the rest of variables when this key feature is not given.This interpretation would suggest that the ASEBIR protocol already extracts the relevant information out of the available morphological features, which is also captured by our model.

Conclusions
In this work, we address the problem of embryo selection for ARTs, a complex real-world problem with partial observability issues.We propose a novel probabilistic graphical model, an extension of the standard embryo-uterine model, which assumes independence between embryos and cycles.It is, to the extent of our knowledge, the first one that takes into account three different possible sources of uncertainty, accounting for the unknown factors which cause that viable embryos, selected to transfer, fail to implant.We also derived its learning procedure, which is able to learn from all the available information of supervision, including partially labeled data.Using morphological data for each individual embryo and characteristics of the cycle, the model is able to predict embryo implantation.
We studied the effect of the ASEBIR embryo's quality score within our model.The models learned with and without the ASEBIR score show a similar separation between categories.Our results suggest that, once embryologists have made their selection, the model does not provide more information about individual embryos.This might indicate that the current protocol already extracts most of the value out of the available morphological data.The performance of the model was further validated against three baseline approaches.We show the benefits of implementing an EM strategy for the learning process, letting the learning technique unveil the label of embryos of unknown fate.We observe that the cycle's features play a key role to predict implantation, especially when either all embryos in a cycle or none were implanted.More importantly, we obtain an estimation of the uncertainty originating from unknown, external factors,  1 .The most common result suggests that even when the embryo and cycle are viable, there is only about a 50% probability of actually inducing pregnancy.The novel result increases the modeling ability of the system and may assist clinicians in decision-making in real ART practice.
Many issues are still open.The empirical validation of the method by means of an enlarged experimental setting is still possible, as well as using real data from more than a hospital/source.Moreover, the learning techniques of the classifiers could be fine-tuned to optimize their predictive power, as we only used default configurations.Another direction would be to conceive new, maybe simpler, PGMs to test the assumptions of our current model (independence between embryos and cycles, awareness of a third source of error, etc.).Finally, the most challenging idea for future work would be to try to validate, in collaboration with embryologists, the value for  1 obtained by our model and its relationship with the proportion of promising treatments that failed to implant.

Fig. 1 .
Fig. 1.Embryo counts for each of the categories of the ASEBIR scoring system [1].

Fig. 5 .
Fig. 5. Densities of the predicted probabilities of our model separated by outcome (pregnancy or not).Each row shows results with different types of base classifiers.Each column shows densities of the predicted probabilities (i) for embryo viability (left column), (ii) for cycle viability (middle column), and (iii) for cycle success (whole model, right column).

Fig. 6 .
Fig. 6.Densities of the predicted probabilities of our model separated by ASEBIR quality category.The learned models do not use the ASEBIR quality score as a descriptive feature.Each row shows results with different types of base classifiers.Each column shows densities of the predicted probabilities (i) for embryo viability (left column), (ii) for cycle viability (middle column), and (iii) for cycle success (whole model, right column).

Table 1
Number of cycles and embryos, separated by use (transferred or not) and success (pregnancy or not, embryo implanted or not).

Table 2
Notation employed in this paper.

Table 4
Results in terms of AUC-ROC and log-likelihood of classifiers of different type learned with our model and the 3 baseline approaches.

Table 5
Results in terms of AUC-ROC and log-likelihood of classifiers of different type learned with our model and the pessimistic baseline (considering or not the cycle features).

Table 6
Results in terms of AUC-ROC and log-likelihood of classifiers of different type learned with our model (considering or not the ASEBIR score as a feature).The last column shows the mean value learned for the  1 model parameter.