Constrained parameter estimation with uncertain priors for Bayesian networks

: In this paper we investigate the task of parameter learning of Bayesian networks and, in particular, we deal with the prior uncertainty of learning using a Bayesian framework. Parameter learning is explored in the context of Bayesian inference and we subsequently introduce Bayes, constrained Bayes and robust Bayes parameter learning methods. Bayes and constrained Bayes estimates of parameters are obtained to meet the twin objective of simultaneous estimation and closeness between the histogram of the estimates and the posterior estimates of the parameter histogram. Treating the prior uncertainty, we consider some classes of prior distribu- tions and derive simultaneous Posterior Regret Gamma Minimax estimates of parameters. Evaluation of the merits of the various procedures was done using synthetic data and a real clinical dataset.


Introduction
Bayesian networks (BNs) have become one of the most popular probabilistic models for representing joint probability distributions of a set of random variables [7,28,33]. Learning BNs from data is normally split into two different, although related steps: (1) learning the structure of the network and (2) learning the parameters [8,18]. Sometimes the network structure is designed using expert knowledge. Once the structure of a network is obtained, parameter learning becomes possible.
Several methods are available to learn the structure of a BN (see [6,10,17,44] among others), and there are many good software implementations of many of these [e.g. 30].
The focus of this paper is on the task of parameter learning only in BNs whose nodes represent discrete random variables. However, later in our final remarks, we refer to three common approaches of extending such BNs to BNs whose nodes are continuous random variables. Parameter learning has been studied also widely, giving rise to many different approaches. Most of the studies are based on the maximum likelihood (ML), the maximum a posteriori (MAP), or the posterior mean (PM) criterion. The ML estimation is a classical technique providing a parameter estimator by maximizing the joint probability density functions (pdfs), while the MAP and PM estimates, as Bayesian solutions, combine the information derived from the data with a priori knowledge concerning the parameter, see [4,8,9,25,35] among others.
Parameters of a BN possess an inherent symmetry as the sum of the parameters of a specific node is always equal to one, and thus, we expect their estimates satisfy this condition as well. Our main focus is to estimate parameters of a BN using Bayesian methods but it is very well-known that Bayes estimates highly depend on hyperparameters of a chosen prior and this may affect the corresponding results. Such a dependence in a learning procedure has been reported to be a serious problem [1,42]. We adjust the task of Bayesian parameter learning using the idea of constrained Bayesian (CB) estimation of [29]. Further, we introduce and motivate the use of the simultaneous robust Bayes concept. The notion of robustness used in this paper is different from the one explored by [34] in their description of the robust Bayes estimator, where they deal with missing data by means of probability intervals. This paper is organized as follows: In Section 2, we introduce some preliminaries. Section 3 is devoted to simultaneous Bayes and the idea of CB learning. In addition, explicit forms for parameter estimates are derived. In Section 4, we introduce the idea of simultaneous posterior regret gamma minimax (SPRGM) learning in the presence of prior uncertainty and derive the corresponding estimates. In Section 5, we carry out an experimental study and compare performance of the proposed estimators using synthetic data from a well-known example network. Further, we study the impact of the proposed methods using real clinical data and a real-world BN. Finally, we conclude with some final remarks and a discussion. To keep readers in track, all the proofs along with some supplementary materials are provided in the Appendix.

Basic notions
A BN consists of a set of variables (or nodes) V = {X 1 , . . . , X d } and a subset of directed links E (also sometimes called edges or arcs) contained in the Cartesian product V × V . We say the structure of a BN is known if the variables in the set V are connected to each other according to the links in E. Mathematically, the structure is called a directed graph. The directed graph is called acyclic, if it does not contain any directed cycle. We refer to such a directed acyclic graph (DAG) by G = (V, E). In the BNs context, a node is instantiated when its value is known through observing what it represents. We say we have a complete instantiation if all the nodes of a BN are simultaneously observed.
Suppose that for each j = 1, . . . , d, the variable X j takes values in the set X j = {x (1) j , . . . , x (kj ) j }. The set of all possible outcomes for the experiment may be denoted by X = X 1 × · · · × X d . Hence, a sample of cases is given by x = (x (1) , . . . , x (n) ), where x (i) = (x where (x j ) is a configuration of the family (X j , Λ j ). Let θ ∈ Θ denote the set of parameters defined by for l = 1, . . . , q j , i = 1, . . . , k j , j = 1, . . . , d, with kj i=1 θ jil = 1. Using the decomposition of the probability distribution defined by the BN, the joint probability of a case x (k) may be written as For independent observations (x (1) , . . . , x (n) ), the joint probability of the cases is where n jil. = n k=1 n jilk , which is the likelihood function. One can observe that the ML estimate of θ jil in Eq. (2.2) is given by . Observe that all parameters (θ j1l , . . . , θ jkj l ) of a specific node X j preserve the inherent symmetry of kj i=1 θ jil = 1. We expect the corresponding estimates (δ j1l , . . . , δ jkj l ) preserve this symmetry and satisfy the constraint This constraint is automatically achieved by the ML estimates in Eq. (2.3) and kj i=1 δ ML jil = 1. Example 2.1. Consider the DAG depicted in Fig. 1 with five nodes X 1 , . . . , X 5 . Suppose that all the nodes except X 3 are binary variables and X 3 takes values 0, 1 and 2 with the same probability. Hence, d = 5, k i = 2 for i = 1, 2, 4, 5, k 3 = 3,
The ML estimate of θ 5l is given by

Bayesian learning methods
In Example 2.1, one might believe that the sequence (0, 0, 1, 1, 1) occurs in 80 percent of cases. If so, we could take a priori knowledge into account, assuming that some prior knowledge in forms of a prior distribution is available.
It is easy to observe that the MAP and PM estimates of θ jil are

Constrained Bayesian learning
In the preceding section, assuming the squared error loss (SEL) function, we observed that if the only objective is simultaneous estimation of the BN parameters θ jl = (θ j1l , . . . , θ jkj l ) with θ jil defined in Eq. (2.2), the Bayes estimate is a vector of posterior means, i.e., δ P M jl = (δ P M j1l , . . . , δ P M jkj l ) with δ P M jil given by Eq. (2.6). In this section, following the idea of CB estimation of [29], we provide an adjusted version of δ P M jil . To avoid any unambiguity, we define the following key terms. Let δ jl = (δ j1l , . . . , δ jkj l ) be a vector of arbitrary estimates of elements of θ jl = (θ j1l , . . . , θ jkj l ) with θ jil defined in Eq.
θ jil , respectively. [29] suggested that problems with using posterior means as Bayes estimates might be dealt with by constructing a vector of CB estimators for which the sample mean and the sample variance of an ensemble of them are equal to the PESM and the PESV of an ensemble of parameters, respectively. Particularly, he proved that under normal likelihood with normal prior, the sampling variability of a collection of Bayes estimates is smaller than the posterior expectation of the corresponding population variability, see [16]. This property holds true in BNs, as provided in the following lemma. See the Appendix for a detailed verification of this inequality.
By the CB approach of [29], the empirical distribution function of CB estimates becomes close to the empirical distribution function of the corresponding unknown parameters. This way, the sampling variability of a collection of estimates is a better estimate of the underlying variability among the population parameters. For more details, see [11,13,14,15].
The idea of matching the first two moments from posterior distribution of parameters with the corresponding moments from distribution of estimates has been followed in a wide range of problems, mostly to derive adjusted empirical Bayes estimators in problems such as disease mapping or environmental risk assessment. For example, in disease mapping it is supposed that there are k regions labeled with the indices 1, 2, . . . , k. By this setting, [11] follows a hierarchical model for disease counts and estimate the true disease rates θ i , i = 1, 2, . . . , k. For more information, see [11,16] and papers cited therein. Now, consider the problem of estimating θ jil defined in (2.2) under the SEL function for a fixed j and l. If interest lies in both simultaneous estimation and closeness between the distribution of estimates and the posterior distribution of the parameters, the idea of deriving CB estimation might be helpful. To make a motivation, it is of interest to compare our estimation problem with the disease mapping problem considered in [11]. In the latter problem, some levels were considered for the parameter of interest (the true disease rates θ i , i = 1, 2, . . . , k) and in our estimation problem, in a specific node and for a specific parent, the parameter of interest, i.e., θ jil , has different levels when changing i in the set {1, 2, . . . , k j }. Thus, CB estimation can be considered in order to meet the twin objective of simultaneous estimation and closeness between the distribution of the estimates and the posterior distribution of the parameters.
Here, we consider the problem of obtaining CB estimates of θ jil , subject to the constraints considered by [29], and an additional constraint which is imposed due to the nature of parameter learning in BNs, i.e., It is interesting to note that since kj i=1 θ jil = kj i=1 δ P M jil = 1, the constraint (i) results in (iii). However, each one of the constraints (i) and (iii) plays its separate role and hence, we simultaneously consider both of these constraints for later use. The following theorem provides CB estimates of parameters in BNs. The main idea of this theorem comes from a proof that appeared in [29]. See the Appendix for a version of the proof compatible with the constraints considered in this paper.  Bayes estimates generally depend on hyperparameters of a chosen prior and this can affect the relevant results. The following example clarifies this point. That the hyperparameters affect learning BN structures has been reported as a serious problem [41,45]. In the next section, we consider this issue and explore robust Bayesian methods to overcome this problem.

Posterior regret Gamma minimax learning
When available, a particular prior distribution is usually somewhat arbitrary and there are good reasons to question the reliability of such a distribution. Usually, there is no way for a user to say that a particular prior is better than another one. Thus, in practice, prior knowledge is often vague. Alternatively, the expert may be unable to specify the prior completely. This situation may also occur when two or more experts do agree on the choice of a prior distribution arising in a decision making problem but differ in opinion w.r.t. the choice of the hyperparameters. A common solution to handle prior uncertainty in Bayesian statistical inference is to choose a class Γ of prior distributions and compute some quantity, such as the posterior risk, the Bayes risk or the posterior expected value, as the prior ranges over Γ. This is known as robust Bayesian analysis. This methodology is connected with studying the effect of changing a prior within a class Γ over some quantity, see [1,2,3]. In this section, we use the idea of SPRGM estimation in the parameter learning procedure. Readers may refer to the treatise by [19] for a detailed discussion of literature on various robust Bayes analysis problems. The book contains chapters on robust Bayes rules including many references dealing with various standard classes of priors (e.g., Chapters 8 and 13) as well as some applications provided in Chapters 17-21.
It is worth stressing that, in addition to the debate on being robust Bayesian, there are other strong arguments in the literature about incorporating prior knowledge into the task of data analysis of which [12] and [37] are excellent references. The relevant approach, known as hierarchical Bayes approach, robustifies the conjugate distribution, assuming a fully Bayesian model. The idea is that one may have structural and subjective prior information at the same time and would like to model this in stages. The attention is often on two stage priors and is used when the first stage of prior elicitation leads to a class Γ of priors and then the statistician in the second stage, puts a prior on Γ. Thus, if Γ = {π 1 is of a given functional form and λ ∈ Λ}, then the second stage would consist of putting a prior, π 2 (λ), on the hyperparameter λ. While specification of the hyperparameter is usually done based on subjective beliefs assuming that it reflects the best guess of statistician, it is difficult. The difficulty level of the hyperparameter specification is more tangible as number of hyparameters increases. BNs are a prime example of such a complicated specification and thus in this paper, we only emphasize on the robust Bayes approach. The difficulty of specifying the hyperprior has made common the use of noninformative priors at the second stage [e.g. 1, 37] but the noninformative priors might lead to inappropriate choices of priors. In contrast, not only the robust Bayes approach we consider in this paper obviates the complicatedness of prior elicitation, it leads to a global prevention against inappropriate choices of priors or their hyperparameters [22,23]. See [12,37] for more information on robust Bayes and hierarchical Bayes approaches, and [21,22] for applications of these approaches as well as a quick list of some of their advantages and disadvantages. Now, let ρ(π, δ jil ) be posterior risk of the estimate δ jil of θ jil in Eq.
For a learning procedure of the parameters θ jl in a DAG under the SEL function and given a class of priors Γ, the posterior regret of choosing δ jil instead of the Bayes . With respect to simultaneous estimation, we define the posterior regret of choosing δ jl instead of δ P M jl to be with the constraint kj i=1 δ jil = 1. Then we define δ SP R jl = (δ SP R j1l , . . . , δ SP R jkj l ) to be the SPRGM value over the class Γ of priors if where D is the class of all possible estimates of θ jl . As it is obvious from Eq. (4.1), deriving SPRGM would be possible by determining the supremum of r p (δ jil , δ P M jil ), where the prior varies over all priors in the class Γ. As δ jil does not depend on prior information, one way to obtain insight into the supremum of r p (δ jil , δ P M jil ) is to look at the behavior of the Bayes estimate δ P M jil in Eq. (2.6). For fixed data and fixed j and l, variation of the hyperparameters α j1l , α j2l , . . . , α jkj l in some given intervals determines the behavior of the PM estimate δ P M jil and thus, the supremum of r p (δ jil , δ P M jil ) can be analyzed. To make it clear, we recall Example 2.2 where δ P M 512 = n512.+α512. n5.2.+α512+α522 . Obviously, δ P M 512 is increasing in α 512 and decreasing in α 522 . Now, if the hyperparameters α 512 and α 522 (which in fact reflect prior beliefs) vary over some intervals, δ P M 512 can take some minimum and maximum values and thus, the supremum of r p (δ jil , δ P M jil ) can be analyzed in order to determine the SPRGM estimate.
To derive SPRGM estimates of θ jil , i = 1, . . . , k j , once again, consider the conjugate Dir(α j1l , . . . , α jkj l ) prior and let K j = α j1l , . . . , α jkj l . Also, to adopt prior information in the robust learning methodology, our prior knowledge about the Dirichlet hyperparameters may cluster them at three disjoint sets, i.e., the prior information may indicate that it would be better to consider some elements of K j , say α jul , are fixed known constants and some other elements, say α jvl , are varied over some fixed known intervals. We refer to these cases as U j and V j , respectively. Thus, α jul is a fixed hyperparamer if u ∈ U j and similarly, α jvl is a varying hyperparameter if v ∈ V j . To cover all the possible cases of hyperparameter variations, let W j = K j −U j −V j consist of all the other cases. The set W j is not necessarily empty, since prior knowledge may suggest letting the sum of all the hyperparameters vary in a fixed known interval. This clustering leads to different classes of priors. The following are examples of such classes of Dirichlet priors Π j = Dir(α j1l , . . . , α jkj l ) where α * jul , α jvl , α jvl , α w and α w are known constants. The classes in (4.2) and (4.3) are very general. A special case occurs when either U j = ∅ in Γ † or V j = K j in Γ ‡ . The resulting class of priors is where α jvl and α jvl are fixed known constants. As seen above, there can be a wide variety of classes of Dirichlet priors for a specific problem. We emphasize that each of the possible classes of priors reflect the prior knowledge behind the choice of such a class of prior and this does not mean at all that a chosen class is superior to many alternatives. In fact, when choosing a class of priors, we only decide based on our experience. Although SPRGM estimates of θ jl = (θ j1l , . . . , θ jkj l ) can be derived for different values of k j , we provide two most promising cases with k j = 2 and k j = 3.
The following theorem provides one SPRGM estimator of θ jl under the sum of SEL function when k j = 2. For the proof, see the Appendix. Theorem 4.1. Let Γ be a class of priors and suppose that, for i = 1, 2, δ jil (X) ≡ δ jil = inf π∈Γ δ P M jil and δ jil (X) ≡ δ jil = sup π∈Γ δ P M jil are finite. Then, the SPRGM estimate of (θ j1l , θ j2l ) over the class Γ subject to the constraint δ j1l + δ j2l = 1, is given by does not exist, Otherwise, The following example illustrates how to derive SPRGM estimates in practice.

Synthetic data
In this section, we provide a simulation study to compare performance of the ML, MAP, PM, CB and SPRGM estimates. For this purpose, we use the wellknown metastatic lung cancer BN shown in Fig. 2. This network appeared in the early literature on BNs, see [24,43] among others. For our simulation study, let X 1 be distributed according to B(1, 0.2), where B(1, p) stands for a Bernoulli distribution with success probability p. To generate values for the variables X 2 and X 3 , note that their possible parent sets are λ 2. Finally, we define the variable X 5 to be zero with probability θ 511 = 0.4 if the output of X 3 is zero. Otherwise, X 5 takes one with probability θ 522 = 0.8, indicating that a patient who has Brain tumour will suffer from severe headaches.
To draw conclusions about performance of the different estimates provided earlier, we consider estimates of the conditional probabilities θ 5l = (θ 51l , θ 52l ), l = 1, 2. To obtain the MAP, PM and CB estimates of θ 52l , l = 1, 2, we use the conjugate Dir(α 52l , α 51l )-prior distribution. Notice that in Bayes estimation of θ 51l , the conjugate prior is Dir(α 51l , α 52l ). To make a choice in estimating θ 522 w.r.t. the hyperparameters, suppose that three experts have provided information about having a brain tumour and subsequently estimated chance of being affected by severe headaches. Assume that one of the experts based on some prior knowledge assumes the conjugate Dir(α 522 , α 512 )-prior with α 512 = 5 and α 522 = 35, implying that the mean chance is about 0.875. Suppose that this expert opinion does not attract consensus of opinion from the two other experts. Rather, they believe in different hyperparameters. They attribute Dir (40,10) and Dir (45,15)-priors, respectively, reflecting that they believe that the prior mean is about 0.80 and 0.75. We shall refer to these three chosen priors by π 1 , π 2 and π 3 , respectively. Clearly, the three experts attributed priors with means around the real parameter 0.8, but the resulting Bayes estimates can still be quite different. To deal with this issue, we consider the following class of priors incorporating the three experts' beliefs: We rely on this class to derive the SPRGM estimate of θ 52 = (θ 512 , θ 522 ). Now, to estimate θ 521 , the probability that a patient has severe headaches in the absence of a Brain tumor, suppose similar to the above situation, that three experts have provided estimates of this conditional probability. The opinion of the three experts is expressed by the Dir(α 521 , α 511 )-prior with (α 521 , α 511 ) = (40,25), (45,25), (35,30), implying that the mean chance is around 0.60. We shall refer to these priors by π * 1 , π * 2 and π * 3 , respectively. To obtain the SPRGM estimate of θ 51 = (θ 511 , θ 521 ), we consider the following class of priors incorporating the three experts' beliefs: Consider the three priors π j , π * j , j = 1, 2, 3, and the classes of priors Γ and Γ * , as defined above. The following steps in the simulation study are taken: Step and δ SP R 5i1,Γ * of θ 5i1 are computed. Again, these computations result in 11 estimates of θ 5i1 denoted by d * [k, i], k = 1, . . . , 11, and i = 1, 2.
Step 3. Steps 1 and 2 are run N = 10, 000 times. Based on the generated data, mean, average of Kullback-Leibler divergence (AKLD) and average of sample variance (ASV) of ensemble of the estimates (d[k, 1], d[k, 2]) of (θ 512 , θ 522 ), k = 1, . . . , 11, are computed as follows: , The quantitative results for different values of n are summarized in Table 1 and Table A.1 of the Appendix. Before drawing any conclusion, we would like to restate that the true value of the parameters θ 511 , θ 521 , θ 512 and θ 522 are 0.4, 0.6, 0.2 and 0.8 respectively. Thus, based on the mean criterion in Step 3, any of the proposed estimates which has a mean close to the corresponding true value would be preferred to the alternatives. By the AKLD criterion, any estimate with lowest AKLD value would be preferred to the other alternatives. We introduced the ASV criterion based on the condition (ii) in Theorem 3. From Table 1, we observe that the simulation process failed to compute the ML estimate for n = 25, 50, 100. However, for n = 200 in Table 1 and all sample sizes in Table A.1 of the Appendix, the ML estimates perform quite well, although one should notice that in practice, we use them when there is no source of prior knowledge.
The three different priors in Table 1 have led to the different prior-based estimates MAP, PM and CB estimates. When considering π 2 , i.e., Dir(α 522 , α 512 )prior with α 512 = 10 and α 522 = 40 (in this case the prior mean is equal to the true parameter 0.8), the corresponding MAP, PM and CB estimates, i.e., δ MAP,π2 5i2 , δ P M,π2 5i2 , δ CB,π2 5i2 , perform better than the other Bayes and CB estimates. Similarly, in Table A.1 of Appendix the MAP, PM and CB estimates w.r.t. the prior π * 1 (which has a mean closer than the mean of other priors to the true parameter 0.6), outperform the other Bayes and CB estimates. As noted earlier, it is not possible to decide only relying on one source of prior information. Rather one should respect the knowledge of all the experts. Thinking in this way, we observe that the SPRGM estimates computed over Γ and Γ * perform better than the other prior-based estimates. In other words, the SPRGM estimates are better in most cases because the case with correct prior actually yields equally good estimates, although correct prior knowledge may be rare.  (35,5), Dir (40,10) and Dir (45,15)-priors, respectively. Γ stands for the class of conjugate Dirichlet priors in (5.1).
Conducting a more precise investigation, Fig. 3 provides side-by-side histograms of sample variance of ensemble of the estimates in δ 52 = (δ 512 , δ 522 ) of the parameters θ 52 = (θ 512 , θ 522 ) with δ 5i2 replaced by one of the estimates δ MAP,πj 5i2 , δ P M,πj 5i2 , δ CB,πj 5i2 and δ SP R 5i2,Γ , i = 1, 2, j = 1, 2, 3 (the simulation process failed to compute the ML estimates). For these simulations we took n = 50 but our investigation led to similar results for other values of n. Associated with each of the priors π j , j = 1, 2, 3, in each row of Fig. 3, we provide histograms of PESV of ensemble of the parameters in θ 52 to show how similarly they behave, compared to the sample variance of ensemble of the estimates in δ 52 . We observe that the histograms of PESV and the sample variance of ensemble of the CB estimates w.r.t. all the priors π j , j = 1, 2, 3 coincide. This is in fact an illustration of Theorem 3.1. It is also of interest to note that as we observe from Fig. 3, the CB estimator is the only estimator with the same distribution of the posterior distribution of the parameters (the corresponding histogram and histogram of PESV fall on each other). Further, Fig. 4 provides averages of PESV (APESV) of ensemble of the parameters in θ 52 and ASV of ensemble of the different estimates w.r.t. all the priors π j , j = 1, 2, 3 given by Eq. (5.3). This figure also confirms that PESV associated with each of the priors π j , j = 1, 2, 3, is always greater than sample variance of the corresponding PM estimates (as provided by Lemma 3.1), and the CB estimator is the only estimator of which the corresponding sample variance is equal to the PESV of ensemble of the parameters in θ 52 (as an illustration of Theorem 3.1).
On the other hand, if δ 5l estimates θ 5l very well, the corresponding ASV is expected to be close to 1 2 (θ 51l − 1 2 ) 2 + (θ 52l − 1 2 ) 2 , which is equal to 0.01 for l = 1 and 0.09 for l = 2. From Fig. 4 we observe that the ASV of the SPRGM estimates of θ 52 is not close to the APESV but its ASV is very close to 0.09. Also, this is clearly observed from Fig. 3 in which the histogram of SPRGM estimates is centred about 0.09. Comparing the PM and the CB estimates, we observe that ASV of the CB estimates w.r.t. the prior π 3 is closer to 0.09 than the corresponding PM estimates and thus, their performance is better than the PM estimates w.r.t. the priors π 3 . This also can be confirmed from Table 1. Thus, in some situations, the CB estimates act better than the PM ones.

Real clinical data
In this section, we analyze a clinical dataset using an associated, expert-designed BN and compare performance of ML, MAP, PM, CB and SPRGM estimates. For this purpose, we consider the Hepar BN model [32], which is a causal BN concerning a subset of the domain of hepatology: 11 liver diseases (described by 9 disorder nodes), 18 risk factors, and 44 symptoms and laboratory tests results. Fig. 5 shows a simplified fragment of the Hepar BN model.    The network models 18 variables related to diagnosis of a small set of hepatic disorders: three risk factors, 12 symptoms and test results, and three disorder nodes. To give the reader an idea of the number of numerical parameters needed to quantify a BN, let us assume for simplicity that each variable in the model in Fig. 5 is Binary.
We are interested in computing the probability P (PBC | Evidence), where 'PBC' stands for primary biliary cirrhosis, one of the possible liver diseases modelled in the network, and 'Evidence' would be a set of variables with their values that pertain to PBC in some way. We will use this as an example to examine the effects of different parameter estimation methods. For example, LE cells = 0 and Antimytochondrial AB = 1, Gender = female will already give a high probability if the Age is above 40 and these are indeed part of the characteristics of the disease according to the clinical literature.
For simplicity, we refer to PBC, LE cells, Antimytochondrial AB, Gender and Age by B, L, A, G and E, respectively. The variables A, G and L are assumed to be binary. For the variable Age, we consider that E takes value 0 if the patient's age is under 40 and takes value 1, otherwise. Also, in our clinical data, B takes either the value zero (disease is absent) or one (disease is present). Thus, our goal is to compute Suppose we know that the probability θ 312 = P (B = 1|E > 40, A = 0) has a high value (from our prior experience), but we are unable to determine its exact value reliably based on the data available. From the data we first determine point estimates for the parameters in Eq. (5.5), i.e., we can at least propose a prior distribution by looking at possible estimates of θ 312 , e.g. its ML estimate, which is 0.883. Based on this value, one may consider using the Dir(α 312 , α 322 )prior with α 312 = 50 and α 322 = 5, which gives a prior mean of 50/55=0.9091. However, this specific estimate might not be the same if we change the sample while it is obvious that a change in the available sample data would make a change in the point estimates. To make sure that the proposed prior is rich enough to include some other possible cases, one may consider the class Γ 1 below, which reflects the possibility of getting some estimates in the interval (40/48, 60/62).
By expressing the uncertainty of the parameters in terms of some classes of conjugate distributions as listed below, we make sure that a wider range of probabilities is covered.
One way to derive the SPRGM estimate of θ D , is to compute SPRGM estimate for each of θ jil as appeared in Eq. (5.5). The relevant computed estimates are shown in Table 2. It can be observed that the SPRGM estimate of the desired parameter θ D is high enough, as somehow expected. For comparison, we also report on the corresponding ML estimates in Table 2. It should be emphasized that since ML estimates do not depend on the prior knowledge, comparing ML estimates and Bayesian estimates is not fair and we should rely on the ML estimates only in situations in which we do not have access to any source of prior information.

Final remarks
In this paper we focused on discrete random variables. We would like to stress that this is common in BNs. For example, the well-known bnlearn software [39] is based on discrete random variables. One main reason is the fact that many BN learning algorithms are unable to treat efficiently continuous variables. However, as [5] reports, there are three common approaches of extending BNs to continuous variables introduced in the literature: one approach, introduced by [47], is to model the conditional pdf of each continuous random variable based on certain family of distribution first, and redesign the corresponding BN inference based on the parameterizations, next. Another approach is to use nonparametric distributions such as Gaussian processes [20], and the third approach would be to discretize the continuous variables based on some criteria such as the minimum description length. The third approach has been extensively used in the literature and new developments have been introduced, see for example [17] among many others. Thus assuming the random variables are discrete is not a restrictive assumption. We also highlight that in our developments we assumed a BN with a complete instantiation is available, meaning that no missing values are present. But we would like to emphasize that in the presence of incomplete/missing data it can be handled with one of the available methods in the literature. [8] provided a theoretical approach to handle the problem of learning with missing data. They show that one can solve this problem by taking a sum of the conditional probabilities over all posible values for each missing data point. [27] studied the parameter learning task in presence of some missing data based on the Expectation-Maximization (EM) technique. [36] applied the important sampling technique into solving such a problem.
Among the existing methods, we suggest using the EM algorithm due to its advantage of being easy to implement and having the property of converging relatively quickly [38]. Now, to apply the EM algorithm, suppose that in the kth sample, k = 1, 2, . . . , n, of the variables in the set x (k) , X m is the variable whose value is missing. The EM algorithm starts with an initial estimation θ 0 and at each iteration t, the data set is completed based on θ t and then the parameters are re-estimated using the completed data set, obtaining θ t+1 . The E-step finds the conditional expectation of the complete data log-likelihood, given the observed component of the data and the current values of the parameters. In fact, the E-step computes the current expected log-likelihood of θ given the data x, as denoted by Q(θ|θ t ) for simplicity below where m t jil = k xm P (X m = x m |X (k) = x (k) , θ t ) n jilk in which n jilk is given by the Eq. (2.1). The M-step then computes the next estimate θ t by maximizing the current expected log-likelihood of the data, i.e., θ t+1 = arg max θ Q(θ|θ t ). After some algebraic manipulations, for all i, j and k, we will get Here m t jil is interpreted as the number of cases where X j = i when its parent configuration is in the state λ l in the completed data set. Thus, θ t+1 jil is interpreted as the expected proportion of cases where X j = i among all possibilities when its parent configuration is in the state λ l .
Since the EM algorithm converges [26,38], this iterative approach leads to a replacement of θ jil by θ s jil , where s is the time thereafter θ t jil is constant. Once this replacement is done, δ ML jil in Eq. (2.3) is derived. Since n j.l. will be known, we get n jil. = δ ML jil n j.l. . Now, replacing the new value for n jil. in Eq. (2.5) and Eq. (2.6), as well as Theorems 3.1, 4.1 and 4.2 leads to MAP, PM, CB, and SPRGM estimates of parameters associated with the variable whose value is missing.

Conclusions and discussion
In this paper we considered the task of parameter learning in BNs. Improvements of Bayesian methods were provided, leading to the extension and application of the simultaneous estimation and robust Bayesian methodology to the context of parameter learning in BNs.
Assuming accessibility of some prior knowledge, we dealt with different approaches to incorporate prior knowledge and derived explicit forms of Bayes (MAP and PM), adjusted Bayes (CB) and robust Bayes (SPRGM) estimates. From the Bayesian estimation literature it is understood that, in presence of crisp prior knowledge, one can reach a reliable Bayes estimate for the desired parameter. Prior knowledge can be specified by determining hyperparameters of the underlying prior distribution, but in many situations there may be a lack of consensus among experts or decision-makers concerning these hyperparameters. In such situations, one sensible approach, as adopted in this paper, would be to define a class of priors to ensure that the existing knowledge fall within the proposed class. The corresponding rule, which we referred to as the 'robust Bayes rule', can be used in the hope of arriving at a rule consistent with the real world.
Our simulation study emphasizes that if the crisp prior is present, Bayes and CB rules are reliable methods. This was obvious from the choice Dir (40,10) and Dir (35,10)-priors and the corresponding Bayes and CB estimates in Table A.1 of the Appendix, as the true parameter was 0.2 and 0.4, respectively. But it is seen that for the other specified priors, the resulting Bayes and CB estimates are quite far from the true parameters and thus, these selected priors are bad choices. However, as noted earlier in the simulation study, in practice, the availability of exact prior knowledge in terms of specific prior hyperparameters is rare. The overall class (5.1) was rich enough to ensure that it includes all the prior information attributed by the three experts. In addition to prevention of selecting bad choices of priors, quantitative statistics show that the SPRGM estimates perform quite well.
We emphasize that when the values of hyperparameters are not justifiably chosen, or when the exact prior knowledge is not available, SPRGM estimates outperform Bayes rules, as we should expect due to the fact that robust rules are aimed at global prevention of bad choices in a single prior. Obviously, for a justified choice of a single prior the results may reverse in the sense that for such a prior, the Bayes estimate outperforms robust Bayes rules. When specific hyperparameters of a prior are available, we encourage the use of MAP and PM estimates. We encourage using the CB estimates only if the interest lies in both simultaneous estimation and closeness between distribution of estimates and posterior distribution of the parameters. When there is a lack of consensus of opinion about the prior hyperparameters, we encourage using the SPRGM estimate(s), with the hope of reaching an optimal estimate.
We would like to wrap up this work by addressing the main interest of Bayesian analysis considered in this paper. Although different prior-based point estimates of the desired parameters have been provided in this paper, the points estimates have been driven by recovering the posterior distribution. The MAP and PM rules are the points that minimize the posterior function which is informally averages of losses of choosing an estimator of the desired parameter w.r.t. the posterior distribution, the CB estimates adjust the PM estimates according to the additional constraints (i)-(iii) of Section 3 and the SPRGM estimates minimize the difference between posterior risk of any arbitrary estimator and the posterior risk of the Bayes estimator.

Appendix
Proof of Lemma 3.1. Hence,

Constrained parameter estimation with uncertain priors
Proof of Theorem 3.1. To derive CB estimates of the elements of θ jl , we minimize The first term in the RHS of (A.1) does not depend on the estimates δ jil . Hence, subject to the conditions (i)-(iii).