Mixed noise and posterior estimation with conditional deepGEM

We develop an algorithm for jointly estimating the posterior and the noise parameters in Bayesian inverse problems, which is motivated by indirect measurements and applications from nanometrology with a mixed noise model. We propose to solve the problem by an expectation maximization (EM) algorithm. Based on the current noise parameters, we learn in the E-step a conditional normalizing flow that approximates the posterior. In the M-step, we propose to find the noise parameter updates again by an EM algorithm, which has analytical formulas. We compare the training of the conditional normalizing flow with the forward and reverse Kullback–Leibler divergence, and show that our model is able to incorporate information from many measurements, unlike previous approaches.


Introduction
In a variety of healthcare and other contemporary applications, the variables of primary interest are obtained through indirect measurements, such as in the case of Magnetic Resonance Imaging (MRI) and Computed Tomography (CT).For some of these applications, the reliability of the results is of particular importance.The accuracy and trustworthiness of the outcomes obtained through indirect measurements are significantly influenced by two critical factors: the degree of uncertainty associated with the measuring instrument and the appropriateness of the (forward) model used for the reconstruction of the parameters of interest (measurand).In this paper, we consider Bayesian inversion to obtain the measurand from signals measured by the instrument and a noise model that mimics background noise coming from the instrument and the variation of the measurement, depending on the forward model.Within this framework, we developed an extension of the expectation maximization (EM) algorithm that is able to handle a Bayesian inversion with a measurement noise model.As a result, we obtain the posterior distribution for the parameters of interest (distribution of the measurand), which is a measure of the reliability of the measurement results.To demonstrate the applicability and effectiveness we apply the algorithm to two real examples in nanometrology, i.e., EUV Scatterometry.The key focus of the work is the development of a noise-adapted posterior sampler based on DeepGEM [24], which can incorporate information from several measurements simultaneously.
In this context we consider Bayesian inverse problems with a possibly nonlinear forward operator F : R d → R n and a random noise variable η θ which depends on an unknown parameter θ and on F (X). Note, Y describes the signals of the instrument whereas X are the parameters of interest.The posterior (parameter distribution) P X|Y θ =y for observations y will ultimately depend on these parameters θ, as the likelihood P Y θ |X=x depends on them.Therefore, we aim to estimate the parameter θ from observations y i ∈ R n , i = 1, ..., N , where N is possibly small.There exists plenty of literature on estimating the standard deviation σ within the Gaussian noise model η θ = η σ ∼ N (0, σ 2 I n ).However, motivated by applications in nanometrology [30], we are interested in a mixture of additive and multiplicative Gaussian noise of the form where F (x) 2 = (F i (x) 2 ) n i=1 and the identity in R n×n is given by I n .For convenience we assume that the instrument noise and other sources can be described by the simple Ansatz made here.In general different noise models may appear in the applications.The noise model Eq.( 2) was used in several previous studies in optics [29,30,32,59] and analyzed in [19].A similar noise model appears in analytical chemistry [58] and the study of gene expression arrays [57].It belongs to the class of heteroskedastic noise models [22] and an algorithm for parameter estimation in a slightly different problem was proposed in [23].Learning the noise model without any parametric form was done using NFs in [1].
The standard approach for parameter estimation is maximum likelihood estimation.That is, we choose θ as the minimizer of the negative log likelihood function posterior and summary network together in an amortized (i.e.conditional manner).However, they optimize it not iteratively, since they do not treat it as an EM framework and do not discuss noise modelling.The main idea of this paper was presented by some of the authors in a one-page extended abstract in [56].
Contributions First, we propose to use the conditional normalizing flows [7,66] in the E-step.This allows the incorporation of several measurements from the same error model and to solve the inverse problem for all measurements simultaneously.Fortunately, the forward KL [7] can be used as loss function for training the conditional NFs which makes the method mode covering.Second, we propose an inner EM algorithm for solving the M-step more efficiently.For our special noise model (2), we deduce analytic expressions for E-and M-steps of this inner algorithm.The performance of our approach will be demonstrated on two applications from nano-optics.In particular, we propose a conditional version of DeepGEM and benchmark it against forward conditional DeepGEM, where the reverse KL is replaced by a forward KL.
Organization We start in Section 2 by recalling the general EM algorithm.Then, in Section 3, we construct the E-step and M-step for our application.
That is, we show how conditional normalizing flows can be incorporated into the E-step and describe how the M-step can be solved for our noise model with an "inner" EM algorithm which steps can be given in a closed analytical form.Some of the technical computations are postponed to A. We test our algorithms on two nano-optics problems, which is done in Section 4. Finally, conclusions are drawn in Section 5.

EM Algorithm
In this section, we introduce the EM algorithm as a maximization-maximization algorithm of an evidence lower bound.A general introduction into the EM algorithm can be found, e.g., in [10].
Let {Y θ : θ ∈ Θ} be a family of n-dimensional random variables having probability density functions p θ , θ ∈ Θ.Given i.i.d.samples y 1 , ..., y N ∈ R n from Y θ * for some unknown θ * , which we want to approximate by computing the maximum log-likelihood estimator In the literature, the term log(p θ (y)) is also called evidence of y under θ.As in many applications it is hard to maximize L, we introduce an absolute continuous d-dimensional auxiliary random variable X such that the joint density p X,Y θ exists and is easy to evaluate.Then, it holds by the law of total probability and Jensen's inequality, for any probability density function q on R d , that We call the random variable X the hidden variable and the expression F(q, θ|y) the evidence lower bound (ELBO).Now, instead of maximizing the loglikelihood function directly, the EM algorithm is a maximization-maximization algorithm for the ELBO, i.e., starting with an initial estimate θ (0) , it consists of the following two steps: M-step: where pdf is the space of d-dimensional probability density functions.The E-step (3) can be solved based on the following standard lemma which can be found, e.g., in [9,Section 9.4].For convenience, we provide the simple proof.Recall that the Kullback-Leibler (KL) divergence of two probability measures P, Q with densities p, q is defined by if P is absolutely continuous with respect to Q, and KL(P, Q) = +∞ otherwise.Further, we use the convention 0 log 0 = 0.
Lemma 1.Let X ∈ R d be a absolute continuous random variable and let Q be an absolutely continuous measure on R d with probability density function q.
Proof.By definition of the conditional distribution, we have As the KL divergence KL(Q, P ) is minimal if and only if Q = P , the lemma implies that the solutions q i of the E-step (3) are given explicitly by E-step: For solving the M-step (4), we decompose the ELBO as Note, that only the first summand depends on θ.Using this decomposition and the explicit form (5) of the q i in the M-step, we obtain the classical EM algorithm as proposed in [16]: E-step: A convergence analysis of the EM algorithm based on KL proximal point algorithms was done in [13,14].In particular, we obtain the following convergence properties.
Proposition 2. Let the sequence (q (r) , θ (r) ) be generated by the EM algorithm.Then, the following holds true.
(ii) The sequence of likelihood values L(θ (r) ) is monotone increasing.
Remark 3 (Generalized EM algorithms).Several papers propose so-called generalized EM algorithms [34,42,46,52].The key idea of these generalizations is to replace the maximization steps (3) and (4) by increase steps.More precisely, in each iteration the values q (r+1) i and θ (r+1) are chosen such that Using such increase steps, generalized EM algorithms often achieve simpler and faster steps, than the original EM algorithm even though they might require more steps until convergence.By construction, part (i) of Proposition 2 remains for generalized EM algorithms, while part (ii) is not longer proved for certain of these algorithms.

Parameter Estimation in Bayesian Inverse Problems
Now we consider the inverse problem (1), where we assume that X has density p X .Given N observations y 1 , ..., y N of Y θ , we aim to determine the parameter θ.We will derive an EM algorithm for this problem, where the hidden variable is given by the ground truth random variable X.In particular, we will deal with the noise model ( 2).Here the parameter θ = (a, b) can be updated in the M-step analytically.

E-Step: Conditional NFs
As we have seen in ( 5), the E-step corresponds to finding the posterior densities for given θ (r) .We propose to approximate these posteriors by conditional NFs.This extends the so-called DeepGEM from [24] to the conditional case.We will see that our approach brings the forward KL instead of the reverse KL into the play which has several advantages, see Remark 4.
A conditional NF is a mapping T ϕ : R n × R d → R d depending on some parameters ϕ such that T ϕ (y, •) is invertible for any y ∈ R n .In this paper, T ϕ is a neural network.There were several architectures for NFs proposed in the literature.They include GLOW [39], real NVP [18], invertible ResNets [8,12,33] and autoregressive Flows [15,20,36,53].They were extended to the conditional setting in [7,17,27,40,66].The parameters ϕ are learned such that for all i = 1, ..., N , where P Z is some latent distribution, usually a standard Gaussian one.Once we have learned the conditional NF T ϕ for an appropriate parameter θ this provides us with a desired approximation of the posterior.
In other words, given y ∈ R n , we can sample z from Z and produce a sample from P X|Y =y by T ϕ (y, z).Now we could learn the conditional NF T ϕ by minimizing the loss function where the last relation follows from Lemma 1 and "∝" indicates equality up to a constant.In literature, this loss function is known as reverse or backward KL loss function.Applying the change of variable formula for push-forward measures and Bayes's formula this can be rewritten as see [2,3,40] for a detailed explanation and applications.In order to evaluate these terms we have to be able to evaluate the prior density p X as well as the conditional densities p Y θ (r) |X=x (y), which contains the forward operator and the noise model for given parameters θ (r) .Unfortunately, it is known from the literature that the reverse KL is prone to mode collapse, see [48].That is, in the case that P X|Y θ =y i is multimodal, it tends to generate only samples from one of the modes.
As a remedy, we interchange the arguments in the KL divergence in J reverse and replace the sum over the y i by the expectation over P Y θ (r) .Then, we arrive at the so-called forward KL loss function To compute these terms we need samples (x j , ỹj ), j = 1, ..., Ñ from the joint distribution P X,Y θ (r) .Note that such samples can be generated from just knowing the xj by evaluating the forward operator and the noise model.In this setting, we do not need access to the prior density p X or the conditional densities p Y θ (r) |X=x .The forward KL is more standard in (conditional) generative modelling [7,18,66] due to these properties and is also known as maximum likelihood training [67].
Remark 4 (Forward versus Reverse KL).Note that in the case that T ϕ is an universal approximator, we have for both loss functions that the optimal parameters φ fulfills T φ(y i , •) # P Z = P X|Y θ (r) =y i .This is important, as we propose to replace the reverse KL in the E-step by the forward KL.Moreover, the assumptions for training and the approximation properties differ.For the reverse KL, we have to be able to evaluate the density p X of the prior distribution, while the forward KL needs samples from P X .In practice it depends on the problem which assumption is more realistic.On the other hand, the forward KL loss function is not that prone to mode collapse.The universality of conditional normalizing flows has been discussed in [44].

M-Step: Inner EM for Mixed Noise Model
As described in (7), the M-step is given by Unfortunately, to the best of our knowledge, an analytic solution of ( 8) is not available.Therefore, we discretize the expectation in (8) by where the x k i , k = 1, ..., M are sampled from P X|Y θ (r) =y i .This can be solved by various iterative methods, e.g., by a stochastic gradient algorithm [38] as done in [24].
For our special noise model ( 2) with θ = (a, b), we propose to use again an EM algorithm, since both the E-and M-step of the "inner" EM can be computed analytically, which will be shown in the following paragraphs.We use that for this noise model we have For simplicity, we assume that we have only M = 1 samples.The case M > 1 can be reduced to this case by considering M copies of y i .In our EM algorithm for (9) we use V θ ∼ aN (0, 1) as hidden variable, which corresponds to the "additive part" of the noise.
Inner E-step We have to compute the conditional distribution P V θ |(X,Y θ )=(x,y) .Using Bayes' formula, we obtain , where the quotients in the last line are understood componentwise and "∝" indicates, that we have equality up to an additive constant independent of v. Consequently, the conditional distribution P V θ |(X,Y θ )=(x,y) is given by Inner M-step: We will just outline the final result.The quite technical proof is deferred to appendix A, where the M-step can be rewritten as where and y i = (y i1 , ..., y in ) and F = (F 1 , ..., F n ) : R d → R n .By setting the derivatives of A 1 and A 2 to zero, this is equivalent to which are the update rules we will use.

Resulting Algorithm
The summary of the two nested algorithms can be seen in Algorithm 1.Here, both the E-steps and the M-steps are not run for 1 iteration, but several.
In particular the analytical M-step is cheap, and therefore it is intuitive to make use of this.For the E-step we take usually 10 steps to perform posterior updates.The initialization of a, b are done in such a way that we approximate the posterior distribution "from above".This is important so that the observed measurements are included in the distribution P Y θ , which is similar to the logdet schedule proposed in [63].It indeed can be shown that making the logdet term larger corresponds to scaling the noise higher for additive Gaussian noise, which makes the estimated distributions broader and therefore prevents mode missing, although we still found that this does not solve the problem completely.
Algorithm 1 EM Algorithm Mixed Noise estimation via CNFs Input: y 1 , ..., y N ∈ R n , conditional normalizing flow T and initial estimate a (0) , b (0) , number of x in total K = 2000 ≥ N .
end for

Experiments
We will benchmark our algorithm on two problems from nano-optics, the first one being low-dimensional and the second one harder and more recent.The first was introduced in [29,30] and the second one is part of a current research project.The goal of this is to learn both a reasonable posterior reconstruction as well as the error parameters a, b jointly.To showcase the advantages of making the models conditional we also vary the number of measurements and hope that more measurements lead to better reconstructions.
Generally, we use the PyTorch framework [54] and use FrEia package for implementation of the conditional normalizing flows [6].The code is available on GitHub1 .We train our models using the Adam optimizer [38] and fix some hyperparameter choices across the experiments.In particular, we only use a learning rate of 1e-3, P = 10, set K = 2000, R = 5000 in 1 and L = 20.The choice of K is in particular constant no matter how many measurements N are used.This allows us to compare whether the information of many measurements is beneficial for the estimation of a and b.However, these hyperparameters were not optimized in a grid search and therefore it is likely that one can improve the performance.We generate synthetic measurements via surrogate forward operators with known noise levels a true , b true , similar as in [3, Section 3].In both these experiments, we take these surrogate neural networks as forward operators.The extension to real world measurements and the relation to the true PDE inverse problem is left for future work.This allows us, given some noise parameters, to sample (x, y) data on the fly.
Then we apply our proposed algorithms to learn a and b as well as the posterior reconstructions.Then we are able to compare the models and error parameters on two metrics.The metrics and models evaluated are summarized below.

Models evaluated
We will evaluate two EM-based models, one is the conditional version of the DeepGEM method [24] which we combined with our M-step.Note that amounts to using the reverse KL divergence in algorithm 1. However we propose to use the forward KL divergence which we call conditional forward DeepGEM.To see the general comparison of EM algorithms, we also implement a grid search over (a, b) and save the "best" model of this.The grid search is still possible since we are searching over a two dimensional space, but becomes quickly infeasible for higher dimensional noise models.We will call this grid conditional NF and also evaluate its forward and reverse KL version.

Metrics
We are going to benchmark the models using the following two metrics.
• Distance to true a and b: We will consider synthetic data, where the a, b the observations were generated with, are known.This metric is given by However, this is a terrible metric as there can be other combinations of a and b which explain the observations equally well.However, we hope that for sufficient observations we will converge to the true a and b.
• ELBO: From lemma 1 we see that maximizing the ELBO F leads to minimizing the KL distance to the true posteriors as well as maximizing the likelihoods of the observations under the estimated error parameters a, b.This is a good proxy, as both the likelihood of the observations as well as the KL distance to the posteriors are intractable in high dimensions.
By the above discussion, we also obtain a suitable model selection criterion.We train all the models for the same amount of steps, but we validate it after every EM-step according to the ELBO for the measurements.We then load the best model for every run and evaluate our metrics.

Scatterometry
For chip manufacturing the control of nanopattern in the lithography process is essential and non-destructive measurement methods with high throughput are desirable.In addition to standard scanning electron microscopy (low throughput, destructive) scatterometry is gaining importance.Scatterometry is a non-destructive optical measurement technique for assessing lithography's periodic nanostructures' critical dimensions (CDs) [37].In this measuring method, nanostructured periodic surfaces are illuminated with light and refraction patterns are detected.From these patterns geometry parameters are reconstructed by solving an inverse problem.According to Eq. ( 1) observations are given by the refraction patterns, the forward operator is determined by time-harmonic Maxwell's equations and the noise is given by the instrument as well as the model error.
In the following, we consider two examples to demonstrate the performance of the developed algorithm for applications in nanometrology of chip production.The first example considers a typical photomask for extreme ultra violet light (EUV) and the second a line grating.

Photo mask
The EUV-photomask considered here consists of periodic absorber lines, capping layers and a multilayer stack functioning as a mirror for 13.4 nm wavelength waves (EUV range).Key geometry parameters include the line width, height and the angle of the sidewall (3 parameters).The refraction patterns comprise 23 intensities (maxima of the refractive orders) and the measurement/model noise is assumed to be distributed according to our mixed noise model.
The problem has x-dimension 3 and y-dimension 23 and therefore is wellsuited for first experiments.Furthermore, by [30] it is known that the posterior is indeed multimodal.The prior is chosen uniformly in [−1, 1] and its density is approximated like in [26] for the reverse KL.For the example we train both the conditional DeepGEM as well as the conditional forward DeepGEM using data from the finite element method (FEM) based forward model [30], which is approximated by a surrogate neural network.The forward DeepGEM is a bit quicker to train.The true a and b used to generate simulated signals of the instrument were set to 0.005 and 0.1 respectively.We benchmark now the four methods, the conditional DeepGem with forward and reverse KL as well as the grid conditional normalizing flows (gridCNF) with forward and reverse KL.For the grids we chose an equispaced grid with 8 points for a and for b.For a this ranged from 0.001 to 0.03 and for b from 0.01 to 0.2.Concerning training time, the forward and reverse conditional DeepGEM were similar with 9 minutes per run.The grid versions took approximately 13 minutes to train, where we took 1200 optimizer steps per grid point.Generally, we can see that the grid methods get outperformed by our EM versions, although they take a longer time.From Table 1 and Fig. 1  and b, where the forward KL seems to have a slight edge in the case of many measurements.However, in terms of ELBO, we observe in Table 2 that the forward KL performs favorably.This is somewhat remarkable, as the reverse KL is the ELBO objective when ignoring the parameters independent of the flow.Considering posterior measures obtained form simulated measurements we realize that the reverse KL does not exactly reproduce the modes in some of the examples, see Fig. 2 whereas the forward KL performs quite well.The inability of the reverse KL to detect the correct modes of the posterior can indeed explain the better performance of the forward conditional DeepGEM.Both algorithms are improving with more measurements, see Fig. 1a and 1b.

Line grating with oxid layer
The second example involves a periodic line grating consisting of a silicon bulk and an oxide layer on top.Similar samples were investigated e.g. in [43].
In addition to the geometry parameters, as used in the previous example, the optical constants (OC) of the materials are assumed to be not accurately known.In practice this is often the case if the material composition was changed due to oxidation and contamination of the sample.So for each material, there are two parameters for the complex refractive index (real and imaginary part) [31], which depend on the material density.Hence we change  For simulations the forward model was solved with the software package JCMsuite2 , based on the FEM which solves a boundary value problem following from the Maxwell's equations [21].In order to get a strong response for the OC of the oxide layer we used a wavelength of 12.99 nm, right before the absorption edge [4,31].For this work we standardized the data from the forward simulation [11] on [0, 1] and chose a uniform prior for the x-data.Again, we plot two example posterior distributions calculated.The distribution shapes seen in 4 clearly reflect the sensitivity of the forward operator against the line height (parameter 0), the silicon oxide density (parameter 4) and non-sensitive against the layer roughness (parameter 6).The true a and b were set to 0.03 and 0.25 respectively.Again as in the first scatterometry example we can see in Table 3 and 4 that the forward KL performs a bit better in distance to the true a and b as well as ELBO.Similarly, one can observe that the first x-component, the height can be multimodal, where   the reverse KL can indeed miss the mode.This can be observed in Fig. 4.
Similarly, the distance to the true a and b decreases by adding more simulated measurement values, which can be seen in Fig. 3.

Conclusions and Limitations
We developed a nested EM algorithms, one for estimating the posterior distribution via a conditional NF and a second one to solve the M-step within the former EM algorithm to estimate the error model parameters.For the special kind of non-additive noise appearing in our applications we derived analytic formulas for the inner E-and M-steps.We showed advantages of using the forward KL for modelling multimodal distributions.The reverse KL often led to mode collapse.However, there has been a plethora of literature tackling this issue of the reverse KL, namely [5,47,64,45].It would be interesting to compare these approaches to the forward KL.Moreover, we could replace the conditional normalizing flow by other methods for posterior sampling like score-based diffusion models [35,60], conditional GANs [49] or posterior MMD flows [25].Furthermore, we chose synthetic a true and b true .One of the next steps is to test these approaches on real world measurements.Even if the novel algorithm was applied to two specific real world experiments, it may have an impact to a wide range of applications where indirect measurements are involved.The extension of the algorithm to other noise distributions than Gaussian is analogous.An advantage over standard approaches like Markov-Chain Monte Carlo methods is the fact that once the network has  been trained, further similar measurements can be evaluated very quickly.This benefit opens the possibility of scatterentry and similar measurement techniques for real time applications, e.g.important for process control.
In terms of limitations, it would be also interesting to test the algorithm on other inverse problems.Intuitively, we believe that the scatterometric inverse problem is particularly well-suited for these estimations since the observed f (X)-data is living on a low-dimensional manifold in a nominal high-dimensional space.One can indeed easily think of an inverse problem, where recovering noise parameters is much harder if the observed data already lies in the full space.

A Derivation of the inner M-step
For the simplicity of the notation, we use the abbreviation Using the decomposition (6) of the ELBO, and noting that the second summand within (6) does not depend on the parameters θ = (a, b), we obtain that the optimization problem ( 4) is equivalent to Now, the objective function reads as where (leaving out constants with respect to a or b) Now, by (10), the expressions E v∼Q i [v j ] and E v∼Q i [v 2 j ] are the first and second moment of certain normal distributions, such that (a (r) ) 2 (y ij − F j (x i )) (a (r) ) 2 + (b (r) ) 2 F j (x i ) 2   and (a (r) ) 2 (y ij − F j (x i )) (a (r) ) 2 + (b (r) ) 2 F j (x i ) 2 2 + (a (r) ) 2 (b (r) ) 2 F j (x i ) 2 (a (r) ) 2 + (b (r) ) 2 F j (x i ) 2 .
Putting everything together, we obtain that (11) [62,61].Making an approximation of the forward model in a polynomial basis using Polynomial Chaos (PC) [65,21] makes it very easy to calculate the Sobol' indices.The indices in Fig. 6 come from a PC-approximation with a relative L 2 -error of about 0.076 and show the dependence on each single parameter.It is clearly seen, that the height of the grating line and the density of the oxide layer and hence the OC for silicon-oxide has a huge impact on the forward model.In general the sensitivity analysis fits very well to the reconstruction of the line grating, since in Fig. 4 the distributions for parameter 0 and 4 are very sharp defined, while those which are very broad distributed also show a low impact on the forward model.For the sensitivity analysis the source software tool PyThia was used [28].

Figure 1 :
Figure 1: Distance to the hyperparameters (a,b) for forward and reverse KL conditional DeepGEM.
(a) Posterior reconstruction for forward KL conditional DeepGEM for one simulated measurement.(b) Posterior reconstruction for reverse KL conditional DeepGEM for one simulated measurement.(c) Posterior reconstruction for forward KL conditional DeepGEM for another simulated measurement.(d) Posterior reconstruction for reverse KL conditional DeepGEM for another simulated measurement.

Figure 2 :
Figure 2: Posterior reconstructions for different measurements using forward/reverse conditional DeepGEM via one dimensional histograms on the diagonal and two dimensional on the offdiagonal.Ground truth x is depicted by the blue line.

Figure 3 :
Figure 3: Distance to the hyperparameters (a,b) for forward and reverse KL conditional DeepGEM.

( a )
Posterior reconstruction for forward KL conditional DeepGEM for one simulated measurement.(b) Posterior reconstruction for reverse KL conditional DeepGEM for one simulated measurement.(c) Posterior reconstruction for forward KL conditional DeepGEM for another simulated measurement.(d) Posterior reconstruction for reverse KL conditional DeepGEM for another simulated measurement.

Figure 5 :
Figure 5: Convergence plots for a and b where we save every 20 EM-steps.

Figure 6 :
Figure 6: Barplot of Sobol' indices for PC-expansion of the forward model.

Table 1 :
we can see that the forward KL and the reverse KL have both similar performance in terms of distance to the true a Distance of estimated a and b to the true ones over 10 runs.

Table 2 :
ELBO of the algorithms over 10 runs.Calculated for the measurements based on 2000 samples.

Table 3 :
Distance of estimated a and b to the true ones over 5 runs.

Table 4 :
ELBO of the algorithms over 5 runs.Calculated for the measurements based on 10000 samples.