Bayesian inverse problems with Monte Carlo forward models

. The full application of Bayesian inference to inverse problems requires exploration of a posterior distribution that typically does not possess a standard form. In this context, Markov chain Monte Carlo (MCMC) methods are often used. These methods require many evaluations of a computationally intensive forward model to produce the equivalent of one independent sample from the posterior. We consider applications in which approximate forward models at multiple resolution levels are available, each endowed with a probabilistic error estimate. These situations occur, for example, when the forward model involves Monte Carlo integration. We present a novel MCMC method called MC 3 that uses low-resolution forward models to approximate draws from a posterior distribution built with the high-resolution forward model. The acceptance ratio is estimated with some statistical error; then a conﬁdence interval for the true acceptance ratio is found, and acceptance is performed correctly with some conﬁdence. The high-resolution models are rarely run and a signiﬁcant speed up is achieved. Our multiple-resolution forward models themselves are built around a new importance sampling scheme that allows Monte Carlo forward models to be used eﬃciently in inverse problems. The method is used to solve an inverse transport problem that ﬁnds applications in atmospheric remote sensing. We present a path-recycling methodology to eﬃciently vary parameters in the transport equation. The forward transport equation is solved by a Monte Carlo method that is amenable to the use of MC 3 to solve the inverse transport problem using a Bayesian formalism.


(Communicated by Jari Kaipio)
Abstract. The full application of Bayesian inference to inverse problems requires exploration of a posterior distribution that typically does not possess a standard form. In this context, Markov chain Monte Carlo (MCMC) methods are often used. These methods require many evaluations of a computationally intensive forward model to produce the equivalent of one independent sample from the posterior. We consider applications in which approximate forward models at multiple resolution levels are available, each endowed with a probabilistic error estimate. These situations occur, for example, when the forward model involves Monte Carlo integration. We present a novel MCMC method called M C 3 that uses low-resolution forward models to approximate draws from a posterior distribution built with the high-resolution forward model. The acceptance ratio is estimated with some statistical error; then a confidence interval for the true acceptance ratio is found, and acceptance is performed correctly with some confidence. The high-resolution models are rarely run and a significant speed up is achieved.
Our multiple-resolution forward models themselves are built around a new importance sampling scheme that allows Monte Carlo forward models to be used efficiently in inverse problems. The method is used to solve an inverse transport problem that finds applications in atmospheric remote sensing. We present a path-recycling methodology to efficiently vary parameters in the transport equation. The forward transport equation is solved by a Monte Carlo method that is amenable to the use of M C 3 to solve the inverse transport problem using a Bayesian formalism.
1. Introduction. The Bayesian methodology has proven to be a convenient framework to solve inverse problems from available data with limited information. Sampling the posterior distribution is the major computational difficulty associated with Bayesian inversion. This distribution is often high dimensional and does not possess a standard form in most applications of interest. Several authors have found ways to use low-resolution forward models to speed MCMC simulation. In particular, we note the "two-level" chains developed in [14,7,5,15]. Two-level chains use low-and high-resolution forward models in a Metropolis-Hastings scheme and may be summarized as follows: Proposals to sample the posterior distribution are generated as usual, and then pre-accepted using an acceptance ratio built with a low-resolution model. If pre-accepted, they are then accepted or rejected using an acceptance ratio that measures a misfit between low and high resolution models. As a result, the high-resolution forward model is rarely run-and when it is, acceptance is almost assured.
Using these two-level chains as a starting point, we introduce a multi-level MCMC scheme adapted to Monte Carlo forward models, called M C 3 . Since our forward model output is itself a Monte Carlo estimate (hence the name (M C) 3 = M C × M CM C), we can control resolution by adjusting the number of random samples drawn. This gives rise to multiple resolution levels and corresponding forward models F 1 , F 2 , . . ., converging to F ∞ . Since the Monte Carlo estimate is a sum of independent random variables, the error is well approximated by a Gaussian random variable and we can obtain a posteriori estimates of its variance. Thus, when we utilize a low-resolution model we are able to accurately calibrate its effect on the posterior at that level. These finite-resolution posteriors π i (6) are used in the first stage of a two-level scheme as described above. For the second stage we do not fix an upper resolution level. Instead, we progressively increase the resolution j until the second-level acceptance probability β ij has a desired level of accuracy. The desired "accuracy" corresponds to making the correct decision with confidence λ ∈ (0, 1). By "correct decision" we mean the acceptance/rejection choice corresponding to an evaluation of β i∞ , where β i∞ uses the exact forward model F ∞ . This modification necessarily introduces some distortion of the resultant posterior. We obtain a stationary distribution π λ for an idealized version of our chain and characterize its deviation from π ∞ in Theorem 2.1 below. This deviation can be made arbitrarily small and depends on λ and the initial resolution level. Very little distortion is present if (i) λ is close to 1 independent of the accuracy at the initial resolution level; or (ii) if the initial resolution level is chosen to be reasonably accurate.
Since sampling of the posterior distribution requires a large number of forward solves (at different resolutions in the M C 3 scheme) for different values of the inversion parameters, it is important to reduce the computational cost of these forward solves as well. In applications to remote sensing, one of the main reasons Monte Carlo forward models are slow is that they are measuring the probability of a rare event (a photon from the sun reaching a small detector). This rare event sampling can be accelerated using an importance sampling scheme similar to perturbation Monte Carlo developed for medical imaging [10,9,4]. This path recycling scheme sends many paths once through a reference atmosphere and then stores only those that hit the detector. Then, to simulate detector hits in another atmosphere an importance sampling scheme is used whereby the original paths are re-weighed to account for changes in atmospheric absorption/scattering (provided these changes result in an equivalent measure, as is the case in photon propagation modeled by a transport equation). We differ from standard perturbation Monte Carlo schemes in that our path recycling scheme also provides multiple-resolution forward models that are used to speed up the sampling of the posterior distribution. To that end, our finite-resolution forward model F j recycles a j−dependent number H j of all available paths, but uses information from many more stored paths H max in order to reduce variance (see (24)). Although our paper focuses on the M C 3 algorithm, we obtain a significant variance reduction using our path-recycling scheme, which appears to be novel. The theoretical variance reduction results are summarized in Theorems 3.2 and 3.3.
Both the M C 3 and path-recycling methods were motivated by an inverse problem in atmospheric imaging. Here we are presented with a passive remote sensing problem involving the identification of an absorbing plume (e.g., a pollution cloud) using detected sunlight. The detector is spatially small and measures incoming light from fifteen different angles. Our atmosphere and plume are parameterized with five unknown parameters (plume absorption cross-section α, plume center (x p , z p ), plume radius ρ, and atmospheric constant c). Since the measurements are sparse, introducing prior information in the Bayesian setting is useful. Since the number of parameters to recover is small, full exploration of the posterior distribution is both feasible and desirable. Unfortunately, because the detector is small, most Monte Carlo forward paths do not reach it. Importance sampling schemes can increase the probability of a detector hit [2], but we instead use the path recycling scheme described above, as it is better suited to the multi-resolution inverse problem. On its own, path recycling reduces forward model run-time by a factor of thousands, but this is still too slow for posterior exploration. We therefore use path-recycling in conjunction with M C 3 and present results in Section 4. We stress again that because of the enormous computational cost involved in transport simulations and the Bayesian framework, we would not be able to produce such results on desktop architectures without the huge variance reductions afforded by path-recycling and M C 3 .
While we will demonstrate the M C 3 scheme on an atmospheric imaging problem, we note that many other applications use forward model predictions that are Monte Carlo estimates. These include more general problems of radiation transport, molecular dynamics simulations, floating random walk methods for capacitance extraction in electronic design, determination of pricing measures in finance, and DSMC (Direct simulation Monte Carlo) methods in fluid flow.
In Section 2 we describe a MCMC scheme for Bayesian inverse problems where the forward model is available at multiple resolution levels and endowed with tight (probabilistic) error estimates. In Section 3 we develop our path-recycling forward model. In Section 4 we combine techniques from Sections 2 and 3 to solve a fivedimensional Bayesian inverse problem.

2.
Multi-level metropolis-hastings. Here we describe Metropolis-Hastings schemes that use multiple-resolution forward models (along with accurate error estimates) to significantly speed up a Metropolis Hastings scheme. In particular, M C 3 is presented as an extension of the two-level schemes developed in [7,5,15]. The novelty of M C 3 is evaluation of an acceptance ratio up to some confidence (the λ−acceptance method ); this aspect of M C 3 may be used with standard Metropolis or any scheme where an acceptance ratio must be evaluated. We refer the reader to [11] for an introduction to Bayesian inverse problems.
2.1. Basic setup. The Bayesian viewpoint models the unknown quantities as random variables. Our unknown is a random vector x ∈ Γ ⊂ R n with prior probability density f prior (x). This is the distribution we assume (from prior knowledge) on x before any data are collected.
We assume our data d ∈ R m are given by an infinite-resolution forward model F ∞ : R n → R m plus an additive independent (Gaussian) noise term E.
The methods presented here are independent of the choice of prior density. The additive Gaussian noise model will simplify the algebra involved in the λ−acceptance evaluation, but is not necessary either.
Our infinite resolution posterior is thus where for vectors v ∈ R m and matrices A ∈ R m×m we define v 2 A := v T Av. We do not have access to F ∞ , but instead a sequence of approximate models F 1 , F 2 , · · · . In our framework, the approximate models are an unbiased sum of i.i.d. random variables, and so we are justified using a Gaussian error model We assume that Var {F j (x)} =: Σ j (x) can be estimated accurately. This is the case when the forward model is solved by Monte Carlo.
Equation (3) leads to an enhanced noise model (see e.g. [1,12], or [11] as well as [16] for more discussion of model error and inverse problems) at resolution level j: and a likelihood at resolution level j, Instead of one posterior, we have a sequence of finite resolution posteriors Assume that for a.e. fixed x ∈ supp(f prior ), F j (x) → F ∞ (x) and Var {F j (x)} → 0. Then, as a consequence of dominated convergence, we can approximate expectations against π ∞ by using π j in the sense that for all f such that |f (x)|π(x) dx < ∞, From here on we omit the explicit conditioning on d and simply write π j (x), j = 1, 2, . . . , ∞.

The algorithms.
Here we present three MCMC chains that take advantage of our multiple resolution models F j and the error estimates.
2.2.1. Metropolis Hastings at resolution level j. As a starting point, we first present the standard Metropolis-Hastings algorithm (algorithm 1) [17], using our finite resolution posterior π j from (6).

Algorithm 1 Metropolis-Hastings at resolution level j
3: With probability α, accept and set x k+1 = y. Otherwise set x k+1 = x k Note that the transition kernel associated with standard Metropolis-Hastings is where δ x (·) is the Dirac measure on R n concentrated at x.

2.2.2.
Two-level Metropolis-Hastings. The two-level chain (algorithm 2, as developed in [7,5,15]) is as follows: Algorithm 2 Two-Level Metropolis Hastings at resolution levels-ij 3: With probability α i (x k , y), pre-accept y. Otherwise set y ← x k . 4: The second-level proposal is now y, effectively drawn from Set β ij (x k , y) : = min 1, q i (x k | y)π j (y) q i (y | x k )π j (x k ) .
Note that there is no need to compute the integral defining q i since The equality of the two minimizers above is shown in (16). One can show that under reasonable assumptions on the proposal q, π j (·) is the stationary distribution of the two-level chain (see [7,5]).
Remark 1. The two-level chain (algorithm 2) can be viewed in two ways: (a) If we are "pre-rejected" (y ← x k ) in line 3 of algorithm 2, then we are guaranteed β ij = 1 and we need not evaluate β ij (or F j ). In this sense the preacceptance stage filters out poor draws. This in turn allows one to make more "courageous" proposals (e.g., from the prior) and not waste time evaluating the high-resolution forward model on them.
(b) In algorithm 1, the "ideal" proposal is π j . Since π i ≈ π j we can also view the two-level chain as a low-resolution chain producing high-quality proposals for a high-resolution chain.

2.2.3.
The multi-stage algorithm M C 3 . The M C 3 chain is now introduced. We will re-use α i from algorithm 2 and replace β ij with β i∞ (x k , y) : = min 1, The equality of the two minimizers above is similar to the case of β ij , which is shown in (16).
If we used β i∞ in place of β ij in the two-level chain (or in place of α i in the onelevel chain) then the invariant distribution would be the infinite resolution posterior π ∞ (·). Since in practice we do not have access to π ∞ (·), we instead use an asymptotic formula to approximate draws from π ∞ using a level of confidence λ ∈ (0, 1). Roughly speaking, the ratio π ∞ (y)/π ∞ (x k ) in β i∞ is replaced by π j (y)/π j (x k ) plus a Gaussian error term. See section 2.3 for more detail.
The two-stage algorithm 2 performs its second-level acceptance step as follows: Step Step 2: If u < β ij accept, otherwise reject.
The multi-stage algorithm M C 3 replaces step 2 with the λ-acceptance step: Step 2a: Using data generated at resolution level j, determine if, with confidence λ, we can say β i∞ > u or β i∞ < u Step 2b: If β i∞ > u with confidence λ, then accept. If β i∞ < u with confidence λ, then reject. If our confidence in β i∞ versus u is less than λ, then increase resolution j and goto step 2a.
Algorithm 3 M C 3 at initial level i, final level j max , and confidence λ The second-level proposal is now y, effectively drawn from This algorithm requires the definition of intervals of confidence parameterized by the confidence level λ ∈ (0, 1). Assuming that the Monte Carlo forward simulations are sufficiently accurate, then the statistical errors are accurately described by a Gaussian approximation as an application of the central limit theorem. The resulting intervals of confidence are analyzed in the following section, which is an intrinsic part of the definition of M C 3 . More precisely, the notion of inequalities with confidence λ at j th level is described in (14) below.
2.3. The asymptotic confidence interval. We use the delta method to derive an asymptotic confidence interval for the acceptance test used in algorithm 3. Note that in our numerical results we will have the following isotropic model for the likelihood function: Since this simplifies the presentation in this section significantly, we only consider this special case here. Generalizations to noniid (or even non-Gaussian) additive error are straightforward but not considered for concreteness.
We first reduce the test β i∞ > u to a form more amenable to an interval test. First recall, so that Therefore, our "ideal" acceptance criteria can be written Second, we derive an asymptotic relation between ϕ(X,Ȳ ) and data available at resolution level j. Defining we have This approximation is accurate for small |X − X j | and |Ȳ − Y j |, which is the case so long as Note that we define the covariance and variance of random vectors X, Y by We therefore approximate This argument may be made rigorous, see e.g., [3] for a proof of the delta method in one dimension. Note that these estimates do hold asymptotically, but are biased. This bias could affect the resultant sampling distribution (see section 2.4.2). Assuming (12), (9) becomes Third, we use (12) to derive a confidence interval around a random variable ϕ(X j , Y j ) that contains the mean E {ϕ(X j , Y j )} with probability λ ∈ (0, 1). We do not quite know µ j in (12) and must content ourselves with an estimateμ j , which can be easily computed using the random variables defining X j , Y j and is therefore obtained at resolution level j; see the description in section 3.3.2 for the inverse transport problem. Briefly, F j (x), F j (y) (and hence X j , Y j ) are sums of i.i.d. random variables, so we use the empirical covariance matrix as an estimate of Cov (X j , Y j ). Then we use X j , Y j in place ofX,Ȳ in ∇ϕ(X,Ȳ ). This is put together to giveμ 2 j . With z λ > 0 such that P[−z λ < N (0, 1) < z λ ] = λ, we define our confidence interval We say that with confidence λ, we have ϕ(X,Ȳ ) ∈ C λ . See [3] for an introduction to confidence intervals.
Finally, to use this confidence interval in the second stage of algorithm 3, we start from (13) and then (with , denoting >, < with confidence λ at the j th level, i.e., for the j−dependent confidence interval C λ ): The above formulas summarize the multi-stage algorithm M C 3 . At a given level j, we either accept or reject according to the above probabilities, and go to the next level j + 1 otherwise.

Chain analysis.
The two-level and multi-level chains do not have as limiting distribution π ∞ (this would require access to F ∞ ). Instead, the two-level chain limits to π j (a broader version of π ∞ defined explicitly in (6)) and M C 3 has π λ (a distorted version of π ∞ defined implicitly in Theorem 2.1).
2.4.1. Two-level chain analysis. For the two-level chain, following [7,5] we have that (for x k = y) As a result, we have The transition kernel of algorithm 2 is thus It satisfies the detailed balance equation Under standard assumptions, π j may be shown to be the limiting distribution [7,5].
2.4.2. Approximate M C 3 chain analysis. We present here an approximation of the M C 3 chain for which we are able to obtain an invariant distribution.
To construct the approximation, first assume that every time the model F j is used, a new set of i.i.d. random variables are drawn to compute F j (x). So for example, if we visit a particular x twice (so that F j (x) is computed twice), then the two computations are independent. This assumption is not met by our forward model described in section 3. There, the same set of paths is recycled for all x.
As a second assumption, we idealize the implementation of the accept/reject conditions in (14). Our idealization assumes that we accept/reject once we know, with confidence exactly equal to λ, that β i∞ < u or β i∞ > u. While this is a correct interpretation of a confidence interval, this does not strictly hold (in our M C 3 implementation) for two reasons: (a) The normal approximation (delta method) is only asymptotically correct.
(b) We consider the λ−acceptance test (14) for multiple lower resolution levels before finally resolving the question at a high enough level. In detail: we increase j until we can evaluate (9) with confidence greater than or equal to λ. We always initially take j = i, and most of the time this results in a confidence interval C κ for some κ j > 0. So long as λ > κ j , we keep increasing resolution, until at some point κ j > λ and we stop. Given these two assumptions, M C 3 becomes an algorithm (the λ-approximate algorithm) that acts like two-level MCMC but makes the wrong decision at every step with probability λ. We present this as algorithm 4 and analyze the error made on the invariant measure.
We are not able to find the invariant measure associated to M C 3 (or even prove that one exists). The results of this section should then be interpreted as advisory. Indeed, we use them to set the levels i and λ later in section 4, but conclude that our method works for the problem at hand only after obtaining numerical results.
Note that algorithm 4 makes use of the functions q i , β i∞ from algorithm 3.
Algorithm 4 λ-approximate algorithm Using algorithm 4 we accept y (recall that we could have y = x at this point) with probability If we reject y, it must then be done with probability 1 − R λ (x). So algorithm 4 has transition kernel We have the following theorem, whose proof we relegate to the appendix.
Theorem 2.1. If λ ∈ (0, 1] the transition kernel K λ associated to algorithm 4 has invariant density If we assume further that E + := {x : π ∞ (x)} is open and connected, and for every bounded G ⊂ E + there exist constants C j , C q , D j , δ such that for x, y ∈ G with |x − y| < δ and j = 1, 2, . . .
and with · T V the total variation distance (see (17)), µ an arbitrary initial distribution, K n λ (x, ·) the measure for n iterations of the transition kernel K λ , and Π λ the distribution associated with π λ Remark 2. Some heuristics are evident: (a) When λ is smaller, m λ (x) is more dependent on r i (x), hence will vary more with x, and hence more distortion is present. (b) If one wishes to maintain a fixed amount of distortion (measured in some metric), i can be decreased only if λ is simultaneously increased.

Distortion example.
To investigate the distortion of the measure and verify the heuristics in remark 2 we compute explicitly the distortion in a simple onedimensional case. The target distribution and the proposal are both Gaussian.
. This roughly corresponds to a forward+noise model for some M 1 so that the prior has negligible impact on the posterior. To approximate (6) we give π i broadened likelihood with variance σ 2 E + σ 2 i . In other words, Using this model, we can numerically compute m λ and then the distance between the distributions Π ∞ , Π λ (corresponding to π ∞ , π λ ) can be measured with the total variation distance: Inverse Problems and Imaging The first equality is a definition and the second can be seen by integrating In Figure 1 we visualize this simple experiment. We set σ E = 1, σ q = 3, and sweep λ and σ i . We see that maximum distortion occurs indeed for small λ and large σ i . While decreasing σ q (still holding σ 2 E < σ 2 E + σ 2 i < σ 2 q ) increases error, the same contour-line shape remains.
2.5. Numerical comparison of chains. We use the one-level, two-level, and M C 3 chains with a variety of parameter values to generate samples. The posterior distribution of interest fits the framework in section 2.1; it is in fact the atmospheric imaging problem alluded to in the introduction and described in section 4. Also see section 4 for a plot of the posterior. The purpose of the present section, however, is to compare chain performance for different choices of initial resolution level i, final level j max , and confidence λ.
We compute the autocorrelation time, for each one-dimensional marginal of samples (call this AT , = 1, . . . , n). The autocorrelation time for a stationary sequence X k ∞ k=0 is defined as

It follows then that
Asymptotically, N/AT correlated samples have the same variance-reducing power as the N uncorrelated samples. See the discussion of effective sample size in [17]. Note that calculating autocorrelation time is non-trivial, and straightforward methods may have non-vanishing variance in the limit N → ∞. For this reason, we use the initial positive sequence estimator described in section 3.3 of [8].
In all cases we used 20 different resolution levels, 0 through 19, with level j + 1 recycling roughly twice as many random variables as level j. Runtime is almost linearly proportional to the number of recycled random variables. We therefore let the total number of random variables recycled in a simulation run serve as a proxy for forward-model-time. Our performance metric is fwd-model-time/1000-effective-samples : = #Recycled-random-variables Simple estimates of the error incurred by using a finite resolution posterior at level j are shown in Table 1. We consider j = 6 a very low resolution and j = 15 a moderate resolution. We compared various schemes with various parameter choices. One-level results were obtained at level j = 6; two-level results were obtained with i = 6 and j = 15; while M C 3 results were obtained with i = 6 and j max = 19. See Table 2 and also Figure 2 for a performance comparison. We chose i = 6 since this resulted in the lowest fwd-model-time/1000-effective-samples for two-level and M C 3 . At this level, a sufficient number of paths interact with the plume to satisfy (27). In other words, we trust the central limit approximation involved in our λ−acceptance step. Table  1 indicates that we are around the σ i = 2.0 line in Figure 1, and therefore λ ≈ 0.90 should give small distortion. Increasing i results would be a safer choice, but this increases forward model time. When moderate resolution is desired (j, j max = 15), M C 3 is about twice as fast at producing uncorrelated random variables as the twolevel scheme. When high resolution is desired, the improvement increases to nine times. Figure 3 shows how forward model time may be reduced by decreasing λ.
The dynamics of the M C 3 chain are partially explained by Figure 3. Here we see that lower values of λ result in fewer uses of high-resolution forward models. We also see that the high-resolution forward models, despite being used infrequently, still account for a significant fraction of forward model evaluation time.   3. Path recycling for Monte Carlo transport. Here we describe a new pathrecycling scheme that allows Monte Carlo forward models to be used efficiently in inverse problems. After the transport models of photon propagation are recalled, the path-recycling scheme is presented in the construction of the Monte Carlo estimator T n [γ] in (23) below, which measures the probability of a photon reaching a given detector in a given environment γ. This estimator is very similar to those used in perturbation Monte Carlo [10,9,4]. When very accurate information is available in a reference environment, a significantly more efficient (i.e., with much lower variance) unbiased estimator F j (γ) is then introduced in (24) below. We then combine the path recycling scheme for inverse transport calculations with the (M C) 3 algorithm and characterize our model errors and intervals of confidence.
3.1. Transport of photons in the atmosphere. We consider photon flux at a single wavelength (all velocities v ∈ S) in a connected domain R ⊂ R 2 with boundary ∂R. Denote by Γ ± the incoming/outgoing boundary flux. That is, witĥ n(r) the outgoing normal at r ∈ ∂R, Γ ± := {(r, v) ∈ ∂R × S : ±n(r) · v > 0}.
where S is the photon source and The functions θ, Θ, σ s , and σ = σ s + σ a account for scattering, absorption, and scattering+absorption in the domain. See [13]. We also assume the presence of a purely absorbing disk-shaped plume with center (x p , z p ), radius ρ. Away from this plume the absorption/scattering cross sections σ a , σ s are given by σ a (r) : = σ a,0 e −(c0+c)z , σ s (r) = σ s,0 e −(c0+c)z .
In the support of the plume σ a is modified by the addition of the constant α.
We group the unknown parameters into the vector γ = (α, x p , z p , ρ, c). In the name of realism many constraints must be invoked.
All other parameters/constants are known and the inverse problem involves the recovery of γ. In particular, the detector location and size are fixed.

Path measures and path recycling.
Here we describe the viewpoint that different atmospheric parameters correspond to different measures on the space of possible photon paths. Our forward model uses one fixed set of paths to compute measurements in many different atmospheres. This allows significant time savings since we need not re-simulate paths that miss the detector. Probabilistic error estimates are also obtained.
Collectively, the parameter γ describes a world in which we simulate photon travel. In determining unknown parameters we will have to simulate many different worlds. This path measure is induced by the Markov chain corresponding to the chosen atmospheric parameters. We give below a heuristic description of the Markov chain, and refer the reader to [13] for details. 1. A starting position and direction (x 0 , v 0 ) are drawn from the source probability distribution S. 2. The photon travels along the path x 0 +tv 0 , t > 0, interacting by point x 0 +tv 0 with probability 1−exp − t 0 σ(x 0 + sv 0 )ds . If the photon does not interact in the volume it will always interact at the surface. 3. At an interaction, the photon is either absorbed or scattered.
• At a volume interaction, the photon will be absorbed with probability σ a (r)/σ(r). If not, it will scatter into a new direction using the probability density θ(r, v 0 →v 1 ). • At a surface interaction, the photon will be absorbed with probability σ a (r)/σ(r). If not it will choose a new direction using the density Θ(r, v 0 → v 1 ). 4. This continues until the photon is absorbed. Exit from the domain is accounted for by making σ a = ∞, σ s = 0 outside of R. We note that σ a = ∞, σ s = 0 at the detector as well.
Note that under reasonable conditions the stopping time τ < ∞ and so we have a probability measure P γ . In the special case γ = γ 0 := (0, 0, 2, 0, c 0 ) (corresponding to no plume and nominal background) we have our reference measure P. This allows us to define a differential measure dP (and similarly dP γ ) and expectation E P {·} by where for A ⊂ Ω, the indicator function As an example of a subset of paths consider those that hit (and are necessarily absorbed in) the detector. Denote these by a disjoint union D = D 1 ∪ · · · ∪ D m , meaning that if ω ∈ D then the path ω ended up in the detector, and if ω ∈ D ν then ω hit the detector with incoming angle in the interval (ν−1)π m − π 2 , νπ m − π 2 . Let D : = D 1 × · · · × D m , 3.2.1. The multi-resolution forward models. We are now in position to describe our forward model and obtain probabilistic error estimates. This model uses importance sampling to compute P γ [D] from one fixed set of reference paths. This technique is an advancement upon schemes developed in the context of medical imaging (see e.g., [10,9,4]).
Choosing n ∈ N, we generate n paths {ω 1 , . . . , ω n }. Now for any random variable X, For example, we could generate paths from measure P γ , and then n −1 n It is very important to realize that since we only intend on estimating expectations involving detector hits (e.g., E γ {1 D X}), we only need store paths that hit the detector. The expected number of detector hits is exactly nP [D] n. For every new γ we could generate a new set of paths and repeat the above procedure. This would be costly since path generation involves complicated steps. Instead, consider fixing one set of reference paths {ω j } n j=1 (in practice storing only those that hit the detector) generated by the reference measure P and then set Computation of T n requires computing the Radon-Nikodym derivative for the ≈ nP[D] paths that hit the detector. This is significantly faster than sending n new paths. For details as to this calculation, see [13]. For the present work, it will suffice to assume the following Assumptions 3.1. Assume that for every γ ∈ Γ, the Radon-Nikodym derivative dPγ dP exists.
Note that this requires absolute continuity of the measures, and in particular P must allow scattering in the same directions (at the same points) as P γ .
Although fast, T n can be significantly improved by using (for relatively small n) information from a simulation that used very large n. This is where we depart from the previously mentioned perturbation Monte Carlo schemes.
We first generate n max paths using the reference measure P. Denote by H ν max the collection of paths ω k ∈ D ν . That is, be nested subsets of H ν max of (fixed, deterministic) size |H ν j |. Note that H ν j , H ν max consist of i.i.d. draws from P[· | D ν ]. Since |H ν max | = n −1 max nmax k=1 1 Dν (ω k ), we have Above Cov P (X, Y ) is defined as in (11) and the subscript P makes it clear that expectations are with respect to the measure P. Although {|H ν max |} m ν=1 are negatively correlated, so long as |H ν j | may be selected independently of H ν max , the sets H ν j are independent. We will always ensure this condition holds.
Notice that if P = P γ , then F ν j sums |H ν j | i.i.d. draws from P[· | D ν ], and each of them scores a hit |H ν max |/n max . In other words, up to the approximations P γ ≈ P, and |H ν max |/n max ≈ P[D ν ], F ν j (γ) sums |H ν j | random variables, each one recording the exact solution. Hence (up to these approximations) F ν j (γ) computes P[D ν ] with zero variance.
On the practical side, for j < j , H ν j ⊂ H ν j , and therefore computation of F j is quicker after computation of F j is done.
We collect this and other facts in a lemma Lemma 3.1.
The sets {H ν j }, are independent, and for all j, the set H ν j consists of |H ν j | i.i.d. draws from dP[· | D ν ] = P[D ν ] −1 1 Dν dP. Conditioning on |H ν max | does not change this: The next theorem shows that F j is unbiased. See the appendix A for the proof.
Remark 3. Since |H ν j | is deterministic, it follows that F ν j (γ) is a sum of |H ν j | unbiased random variables and so by the strong law of large numbers, for fixed x, F j (x) → F ∞ (x) a.s.. It is straightforward then to show that with probability one, F j (x) → F ∞ (x) for a.e. x. Equation (7) then follows.
The following theorem shows that in the limit dP γ → dP, and |H ν max | → ∞, the F ν j (γ) are uncorrelated zero-variance estimates of P γ [D ν ]. See appendix A for a proof. Theorem 3.3. As n max → ∞,

Remark 4. A similar calculation shows that
One can replace (in the expression for Var P {F j }) |H ν j | with the approximation n j P[D ν ] and see that Var P {F j (γ)} Var P T nj (γ) if dP ≈ dP γ . In other words, the variance of our unbiased estimator F j is significantly smaller than the estimator T nj typically used in aforementioned perturbation Monte Carlo schemes.
3.3. Path recycling and M C 3 . Here we show that the multiple resolution forward models F j (24) meet most of the assumptions from section 2. As was established in remark 3, lemmas 3.2 and 3.3 establish that F j → F ∞ in the sense of (7). The Gaussian error model F j (γ) ∼ N (F ∞ (γ) , Σ j (γ)) will be established in sections 3.3.1. The covariance estimate used in section 2.3 is then established in section 3.3.2.
Note that since we are recycling paths, different runs of these models are not independent. This is the only way the F j do not meet the criteria of section 2. This fact turns out to be an advantage for practical reasons: Since we expect F j (x) and F j (y) to be positively correlated for x near y, we see thatμ 2 j (an estimate of the variance in our acceptance-ratio estimate, see (26)) will be significantly lower. Thus, we are able to use fewer paths to determine acceptance/rejection. If we were using an optimization-based inversion method, a similar phenomenon would occur. The general theme is that if X and Y are random variables (say some functions In other words, error in the differential measurements is reduced. This is absolutely essential since if x is close to y, we expect |F ∞ (x) − F ∞ (y)| Var {F j (x)} unless j is very large.
3.3.1. Characterization of the model error. We use lemma 3.3 and the central limit theorem to characterize the error. We estimate this using standard techniques. The result is that we approximate F j (γ) as following a N (P γ [D] , Σ j (γ)) distribution, where The final sum is the standard estimate of the variance of the random variable dPγ dP (ω k ).

3.3.2.
The estimate of µ j . For our confidence interval in section 2.3 we need to estimate where ϕ(X, Y ) : = |X| 2 − |Y | 2 , and for ν = 1, . . . , m and Y ν is defined similarly. We also haveX := The question arises, "how large should |H ν j | be before we believe the CLT approximation?" Using a theorem on the remainder in a CLT approximation we arrive at the following scaling criteria (see appendix B) # ω k ∈ H ν j : ω k hits the plume 1.
In the process of computing the weights dPγ dP we can easily keep track of whether or not the plume was hit. We thus choose our lowest resolution level high enough so that (27)  The performance increase is dramatic. For example, it took 11,727 minutes to generate approximately 231 million paths (of which 1/348 hit the detector). These paths can be recycled in only 30.9 seconds (22,770 times quicker). One might have expected a speed-up of only about 348 times. The dramatic difference is due mostly to the fact that the original paths were cast using complicated Python code that explicitly stepped the photons through their path, while the much simpler recycling could be done using optimized Cython code. In any case, recycling paths only involves computing a ratio of weights, and in many cases may be much quicker than sending the original paths. 4. Use in an inverse problem. Here we combine the forward models from section 3 in a Bayesian inverse problem fitting the framework of section 2.
The scene is a sunlit valley with reflecting mountain, a detector located on the right side, and an absorbing plume to be reconstructed. See Figure 4. The plume is parameterized by γ = (α, x p , z p , ρ, c) where α is the plume absorption, (x p , z p ) is the plume center, ρ is the plume radius, and c is the atmospheric constant. We assume γ ∈ Γ (defined in (21)). Our prior density f prior (γ) is In other words, we assume all components are independent with α being Gamma distributed, and all the others being uniform. We ran a Monte Carlo forward simulation in an atmosphere containing a plume to generate the "noiseless" data. A separate set of paths was then generated for use in the inverse problem. We added Gaussian noise corresponding to an SNR of 10. We then sampled from the posterior using the M C 3 algorithm using initial resolution level i = 6, j max = 15, and confidence λ = 0.90. This is compared with two-level scheme at i = 6, j = 15 and one-level scheme at j = 6. One can see that M C 3 generates virtually the same posterior as the more expensive two-level scheme. The difference could be entirely attributed to the finite number of samples used to construct the histogram. The one-level scheme at j = 6 has too low resolution: the likelihood is too broad and hence the posterior π 6 closely follows the prior. We also ran M C 3 simulations with λ = 0.70, 0.99 and observed results similar to those obtained with λ = 0.90.
From the practical standpoint, we find that the horizontal position x p and the plume absorption and plume radius are reasonably reconstructed, in the sense that the posterior marginals are much tighter than the prior marginals for these parameters. The reconstruction of the vertical position z of the plume remains very inaccurate: The true value is near the tail of the posterior. The Bayesian framework allows us to quantify such statements and the M C 3 algorithm allows us to do so at a more reasonable computational cost than other methods.

5.
Conclusions and outlook. A coherent framework was presented for solving Bayesian inverse problems when the forward model involves Monte Carlo. Pathrecycling yields efficient multi-resolution forward models for use in the Bayesian framework, though the utility of these models is not restricted to Bayesian inference. In this context, however, M C 3 takes advantage of multi-resolution forward models to quickly draw from the posterior distribution. Its application is not limited to the forward models considered here; M C 3 may be used separately whenever multiresolution forward models are available with tight probabilistic error estimates.
M C 3 was shown to be robust with respect to the choice of initial resolution level i and confidence λ: the discrepancy for finite i and λ < 1 was quantified in a simple toy model, and shown to be small in our transport problem (which used path-recycling). This does not provide strict parameter choice rules for (i, λ) as would be desirable. However, a reasonable procedure was explained and verified in sections 2.5 and 4. In general, the forward-model-time/1000-effective-samples for M C 3 at levels i, j max is superior to that of a two-level MCMC scheme at i, j = j max (we experienced two to nine times speedup), but slower than a one-level scheme at resolution i. The latter, however, may be extremely inaccurate. The efficiency gain provided by M C 3 rises significantly when higher resolution is desired.
We see path recycling as a necessary element in solving any inverse problem with a Monte Carlo forward model. We see M C 3 as a great time saver in Bayesian applications. Future work should provide further intuition and ideally some concrete results about parameter choice. We also mention that the M C 3 scheme could be adapted to forward models that are not Monte Carlo, yet that are endowed with tight error estimates.
Appendix A. Proofs.
Proof of Theorem 2.1. The proof of Theorem 2.1 follows standard techniques. We include it since the chain in algorithm 4 is not a Metropolis-Hastings scheme and so some care must be taken.
If λ = 1 then the result is trivial so assume in the sequel that λ ∈ (0, 1). We first show that the density π λ is well-defined. Since R λ is a convex combination of r i∞ (x) and 1 − r i∞ (x), we have 0 < R λ < 1. Therefore, λ ≤ m λ and thus 0 < π ∞ /m λ ≤ π ∞ /λ is integrable and we can normalize it to define the density π λ .
To show the convergence results we assume (i) and (ii) and proceed to show irreducibility, a-periodicity, and Harris recurrence.
Let x 0 ∈ E + and K n λ (x 0 , ·) be the transition kernel associated with n steps of the chain starting at x 0 . Let B ⊂ E + with B π ∞ (y) dy > 0. Irreducibility will follow once we show K n λ (x 0 , B) > 0 for some n. To that end first note that K λ (x, y) ≥ λq(y | x)α i (x, y)β i∞ (x, y).
Hence the chain is π ∞ irreducible. A-periodicity follows since by choosing a small enough ball B x0 x 0 , one can show that K n λ (x 0 , B x0 ) > 0 for all n. We now show Harris recurrence. The proof here follows more-or-less lemma 7.3 in [17] or corollary 2 in [18]. and relies on showing that the only bounded harmonic functions are constant. Recall that a function h is harmonic with respect to our chain if E h(X 1 ) | X 0 = x 0 = h(x 0 ). Since π λ is an invariant probability measure of the chain, the chain is recurrent by proposition 6.36 of [17]. As in lemma 7.3 from [17] this implies that h is π λ almost everywhere equal to a constanth. To show that h is constant everywhere (on the support of π λ ) we write The above integral is unchanged if we replace h(x 1 ) byh. Then since m λ q i β i∞ dx 1 = R λ (x 0 ) we have R λ (x 0 ) h − h(x 0 ) = 0. The irreducibility portion of this proof shows that R λ > 0, hence h(x 0 ) =h and h is constant.
Having shown the chain is Harris recurrent and a-periodic with invariant measure π ∞ , the convergence results follow directly from Theorems 6.63, 6.51, and proposition 6.52 in [17].
Proof of Theorem 3.2. Using lemma 3.1, we have that E P F ν j (γ) is equal to The last equality follows since E P {|H max |} is the expected number of hits in D ν , equal to n max P[D ν ].
Proof of Theorem 3.3. Writing |H ν max |/n max = (|H ν max |/n max − P[D ν ]) + P[D ν ], and similarly for µ, we can express Cov P F ν j , F ν j as    plus terms that tend to zero as n max → ∞ (due to lemma 3.1). We now compute the variance of this term, keeping in mind that the ω k ∈ H ν j are i.i.d. draws from P[· | D ν ].
This is the statement of the theorem.
Appendix B. Estimate of µ j and the scaling criteria. Note that where due to the approximate independence of the components (true as n max → ∞) Cov P (X, Y ) ≈ Cov P (Y, X) = diag (Cov P (X ν , Y ν )) , Since X, Y are sums are i.i.d. random variables, we can estimate the covariance using Theorem 3.3 (this is similar to (25)).