Borel Kernels and their Approximation, Categorically

This paper introduces a categorical framework to study the exact and approximate semantics of probabilistic programs. We construct a dagger symmetric monoidal category of Borel kernels where the dagger-structure is given by Bayesian inversion. We show functorial bridges between this category and categories of Banach lattices which formalize the move from kernel-based semantics to predicate transformer (backward) or state transformer (forward) semantics. These bridges are related by natural transformations, and we show in particular that the Radon-Nikodym and Riesz representation theorems - two pillars of probability theory - define natural transformations. With the mathematical infrastructure in place, we present a generic and endogenous approach to approximating kernels on standard Borel spaces which exploits the involutive structure of our category of kernels. The approximation can be formulated in several equivalent ways by using the functorial bridges and natural transformations described above. Finally, we show that for sensible discretization schemes, every Borel kernel can be approximated by kernels on finite spaces, and that these approximations converge for a natural choice of topology. We illustrate the theory by showing two examples of how approximation can effectively be used in practice: Bayesian inference and the Kleene star operation of ProbNetKAT.


Introduction
Finding a good category in which to study probabilistic programs is a subject of active research [6,19,23,24]. In this paper we present a dagger symmetric monoidal category of kernels whose dagger-structure is given by Bayesian inversion. The advantages of this new category are two-fold.
Firstly, the most important new construct introduced by probabilistic programming, viz. Bayesian inversion, is interpreted completely straightforwardly by the †-operation which is native to our category. In particular we never leave Pre-print, March 2018 the world of kernels and we therefore do not require any normalization construct. Consider for example the following simple Bayesian inference problem in Anglican ( [26]) ( defquery example ( let [ x ( sample ( normal 0 1))] ( observe ( normal x 1) 0.5) ( > x 1))) The semantics of this program is build easily and compositionally in our category: • The second line builds a Borel space equipped with a normally distributed probability measure -an object (R, µ) of our category. • The (normal x 1) instruction builds a Borel kernel -a morphism f : (R, µ) → (R, ν ) in our category. • The observe statement builds the Bayesian inverse of the kernel -the morphism f † : (R, ν ) → (R, µ) in our †-category. • Finally, the kernel f † is evaluated, i.e. the denotation of the program above is f † (0.5)(]1, ∞[).
The functoriality of † ensures compositionality. Second, since Bayesian inference problems are in general very hard to compute (although the one given above has an analytical solution), it makes sense to seek approximate solutions, i.e. approximate denotations to probabilistic programs. As we will show, our category of kernels comes equipped with a generic and endogenous approximating scheme which relies on its involutive structure and on the structure of standard Borel spaces. Moreover, this approximation scheme can be shown to converge for any choice of kernel for a natural choice of topology.
Main contributions.
1. We build a category Krn of Borel kernels ( §2) and we show how two kernels which agree almost everywhere can be identified under a categorical quotient operation. This technical construction is what allows us to define Bayesian inversion as an involutive functor, denoted †. This is a key technical improvement on [6] where the †structure 1 was hinted at but was not functorial. We show that Krn is a dagger symmetric monoidal category.
All the proofs can be found in the Appendix.
Related work. Quasi-Borel sets have recently been proposed as a semantic framework for higher-order probabilistic programs in [24]. The main differences with our approach are: (i) unlike [23,24] we never leave the realm of kernels, and in particular we never need to worry about normalization. This makes the interpretation of observe statements, i.e. of Bayesian inversion, simpler and more natural. However, (ii) unlike the quasi-Borel sets of [24], our category is not Cartesian closed. We can therefore not give a semantics to all higher-order programs. This shortcoming is partly mitigated by the fact that the category of Polish space, on which our category ultimately rests, does have access to many function spaces, in particular all the spaces of functions whose domain is locally compact. We can thus in principle provide a semantics to higher-order programs, provided that λ-abstraction is restricted to locally compact spaces like the reals and the integers, although this won't be investigated in this paper. The approximation of probabilistic kernels has been a topic of investigation in theoretical computer science for nearly twenty years (see e.g. [4,[9][10][11]), and for much longer in the mathematical literature (e.g. [5]). Our results build 2 To the best of our knowledge. on the formalism developed in [4] with the following differences: (i) we can approximate kernels, their associated stochastic operator (backward predicate transformer), or their associated Markov operator (forward state transformer) with equivalent ease, and move freely across the three formalisms.
(ii) Given a kernel f : X Y , we can define its approximation f ′ : X ′ → Y ′ along any quotients X ′ of X and Y ′ of Y as in [4], but we can also 'internalize' the approximation as a kernel f * : X → Y of the original type. Morally f ′ and f * are the same approximation, but the second approximant, being of the same type as the original kernel, can be compared with it. In particular it becomes possible to study the convergence of ever finer approximations, which we do in Section 6. Finally, (iii) we opt to work with Banach lattices rather than the normed cones of [4,20] because it allows us to formulate the operator side of the theory very naturally, and it connects to a large body of classic mathematical results ( [2,27]) which have been used in the semantics of probabilistic programs as far back as Kozen's seminal [17].

A category of Borel kernels
In [6] the first three authors presented a category of Borel kernels similar in spirit to the construction of this section, but with a major shortcoming. As we will shortly see, our category Krn of Borel kernels can be equipped with an involutive functor -a dagger operation † in the terminology of [21] -which captures the notion of Bayesian inversion and is absolutely crucial to everything that follows. In [6] this operation had merely been identified as a map, i.e. not even as a functor. In this section we show that Bayesian inversion does indeed define a †-structure on a more sophisticatedbut measure-theoretically very natural -category of kernels.

Standard Borel spaces and the Giry monad
A standard Borel space -or SB space for short -is a measurable space (X , S) for which there exists a Polish topology T on X whose Borel sets are the elements of S, i.e. such that S = σ (T ) (see e.g. [16] for an overview). Let us write SB for the category of standard Borel spaces and measurable maps.
One key structural feature of SB is the following: Theorem 1. Every SB object is a limit of a countable codirected diagram of finite spaces.
The Giry monad was originally defined in two variants [15]: -As an endofunctor G Pol of Pol, the category of Polish spaces, one sets G Pol (X , T ) to be the space of Borel probability measures over X together with the weak topology. This space is Polish [16,Th 17.23], and the Portmanteau Theorem [16,Th 17.20]) gives multiple characterizations of the weak topology.
-As an endofunctor G Meas of Meas, the category of measurable spaces, one sets G Meas (X , S) to be the set of probability measures on X together with the initial σ -algebra for the maps ev A : G Meas (X , S) → R, µ → µ(A), A ∈ S.
In both cases the Giry monad is defined on an arrow f : X → Y as the map f * which sends a measure µ on X to the pushforward measure f * µ on Y , defined as We want to define the Giry monad on the category SB of standard Borel spaces (and measurable maps), and the two versions of the Giry monad described above offer us natural ways to do this: given an SB space (X , σ (T )) we can either compute G Pol (X , T ) and take the associated standard Borel space, or directly compute G Meas (X , σ (T )). Fortunately, the two methods agree. Th 17.24). Let B : Pol → SB denote the functor sending a Polish space (X , T ) to its associated SB-space (X , σ (T )) and leaving morphisms unchanged, then We define the Giry monad on SB spaces to be the endofunctor G : SB → SB defined by either of the two equivalent constructions above. The monadic data of G is given at each SB space X by the unit δ X : X → GX , x → δ x , the Dirac δ measure at x, and the multiplication m X : We refer the reader to [15] for proofs that δ X and m X are measurable.

The construction of Krn
Let us denote by SB G the Kleisli category associated with the Giry monad (G, δ, m). We denote Kleisli arrows, i.e. Markov kernels, by X Y , and we call such an arrow deterministic if it can be factorized as an ordinary measurable function followed by the unit δ . Kleisli composition is denoted by •. The category * ↓ SB G has arrows * X as objects, where * is the one point SB space (the terminal object in SB). An arrow from µ : dµ for any measurable subset A of Y . This situation will be denoted in short by f : (X , µ) (Y , ν ), and we will call a pair (X , µ) a measured SB space.
We want to construct a quotient of * ↓ SB G , such that two * ↓ SB G arrows are identified if they disagree on a null set w.r.t. the measure on their domain. For д, д ′ : We now define a relation ∼ on Hom((X , µ), (Y , ν )) by saying that for any two arrows д, д ′ : (X , µ) (Y , ν), д ∼ д ′ if µ(N (д, д ′ )) = 0. This clearly defines an equivalence relation on Hom((X , µ), (Y , ν )). In order to perform the quotient of the category * ↓ SB G modulo ∼, we need to check that it is compatible with composition.

Definition 5. Let
Krn be the category obtained by quotienting * ↓ SB G hom-sets with ∼.
The following Theorem is of great practical use and generalizes the well-known result for deterministic arrows.

The dagger structure of Krn
Krn has an extremely powerful inversion principle: The kernel f † µ is called the disintegration of µ along f . As our notation suggests, the disintegration depends fundamentally on the measure µ over the domain, however we will omit this subscript when there is no ambiguity. The following lemma relates disintegrations to conditional expectations. Lemma 8 ([8]). Let f : (X , µ) → (Y , ν ) be a deterministic Krn-morphism, and let ϕ : X → R be measurable, then µ-a.e.
We can extend the definition of (−) † to any Krn-morphism f : (X , µ) (Y , ν) in a functorial way, although f † will not in general be a right inverse to f . The construction of f † is detailed in [6], but let us briefly recall how it works. The category SB has products which are built in the same way as in Meas via the product of σ -algebras 3 . Given any kernel f : (X , µ) (Y , ν ), we can canonically construct a probability measure γ f on the product X × Y of SB-space by defining it on the rectangles of X × Y as Equivalently, Letting π X : X × Y → X and π Y : X × Y → Y be the canonical projections, we observe that Gπ X (γ f ) = µ and Gπ Y (γ f ) = ν : in other words, γ f is a coupling of µ and ν . The disintegration of γ f along π Y is a kernel π † Y : (Y , ν) → (X × Y , γ f ). Finally we define: The following Krn diagram sums up the situation: The following property characterizes the action of (−) † on Krn-morphisms: In view of Eq. (4), we will call f † the Bayesian inversion of f , and refer to (−) † as the Bayesian inversion operation on Krn. It will be crucial throughout the rest of this paper. It is important to see that f † absolutely depends on the choice of µ and not only on f seen as a function. We can now improve on [6] and show that (−) † is indeed a †-operation in the strict categorical meaning of the term.

Banach lattices
It is well-known that kernels can alternatively be seen as predicate -i.e. real-valued function -transformers, or as state -i.e. probability measure -transformers. The latter perspective was adopted by Kozen in [17] to describe the denotational semantics of probabilistic programs (without conditioning). We shall see in this section and the next, that the predicate and state transformer perspectives are dual to one another in the category of Banach lattices, a framework incidentally also used in [17]. For an introduction to the theory of Banach lattices we refer the reader to e.g. [2,27].
An ordered real vector space V is a real vector space together with a partial order ≤ which is compatible with the linear structure in the sense that for all is σ -order complete if every non-empty countable subset of V which is order bounded has a supremum.
A normed Riesz space is a Riesz space (V , ≤) equipped with a lattice norm, i.e. a map ∥·∥ : V → R such that: |v | ≤ |w | implies ∥v ∥ ≤ ∥w ∥ . (5) A normed Riesz space is called a Banach lattice if it is (norm-) complete, i.e. if every Cauchy sequence (for the norm ∥·∥) has a limit in V .
Example 11. For each measured space (X , µ) -and in particular Krn-objects -and each 1 ≤ p ≤ ∞, the space L p (X , µ) is a Riesz space with the pointwise order. When it is equipped with the usual L p -norm, it is a Banach lattice. This fact is often referred to as the Riesz-Fischer theorem (see [2,Th 13.5]). We will say that p, q ∈ N ∪ {∞} are Hölder conjugate if either of the following conditions hold: (i) 1 < p, q < ∞ and 1 p + 1 q = 1, or (ii) p = 1 and q = ∞, or (iii) p = ∞ and q = 1.
There are two very natural modes of 'convergence' in a Banach lattice: order convergence and norm convergence. The latter is well-known, the former less so. An order bounded sequence {v n } n ∈N in a σ -complete Riesz space (and thus in a Banach lattice) converges in order to v if either of the following equivalent conditions holds: For a monotone increasing sequence v n , this definition simplifies to v = n v n , which is often written v n ↑ v.
In a general σ -complete Riesz space, order and norm convergence are disjoint concepts, i.e. neither implies the other (see [27,Ex. 15.2] for two counter-examples). However if a sequence converges both in order and in norm then the limits are the same (see [27,Th. 15.4]). Moreover, for monotone sequences norm convergence implies order convergence: In a Banach lattice we have the following stronger property.
Proposition 14 (Lemma 16.1 and Theorem 16.2 of [27]). If {v n } n ∈N is a sequence of positive vectors in a Banach lattice such that sup n ∥v n ∥ converges, then n v n exists and ∥ n v n ∥ = n ∥v n ∥.
It can also happen that order convergence implies norm convergence. A lattice norm on a Riesz space is called σorder continuous if v n ↓ 0 (v n is a decreasing sequence whose infimum is 0) implies ∥v n ∥ ↓ 0.
Example 15. For 1 ≤ p < ∞, the L p -norm is σ -order continuous, and thus order convergence and norm convergence coincide. However, for p = ∞ this is not the case as the following simple example shows. Consider the sequence of essentially bounded functions v n = 1 [n,+∞[ : it is decreasing for the order on L ∞ (R, λ) with the constant function 0 as its infimum, i.e. v n ↓ 0. However ∥v n ∥ = 1 for all n.
Many types of morphisms between Banach lattices are considered in the literature but most are at least linear and positive, that is to say they send positive vectors to positive vectors. From now on, we will assume that all morphisms are positive (linear) operators. Other than that, we will only mention two additional properties, corresponding to the two modes of convergence which we have examined. The first notion is very well-known: a linear operator T : V → W between normed vector spaces is called norm-bounded if there exists C ∈ R such that ∥Tv ∥ ≤ C ∥v ∥ for every v ∈ V . The following result is familiar: Theorem 16. An operator T : V → W between normed vector spaces is norm-bounded iff it is continuous.
Thus norm-bounded operators preserve norm-convergence. The corresponding order-convergence concept is defined as follows: an operator T : V → W between σ -order complete Riesz spaces is said to be σ -order continuous if whenever v n ↑ v, Tv = Tv n . It follows that we can consider two types of dual spaces on a Banach lattice V : on the one hand we can consider the norm-dual: The latter is sometimes known as the Köthe dual of V (see [12,27]). The two types of duals coincide for a large class of Banach spaces of interest to us.
Theorem 17. If a Banach lattice V admits a strictly positive linear functional and has a σ -order-continuous norm, then Example 18. The result above can directly be applied to our running example: given a measured space (X , µ) and an integer 1 ≤ p < ∞, the Lebesgue integral provides a strictly positive functional on L p (X , µ), and we already know from Example 15 that L p (X , µ) has a σ -order-continuous norm. It follows that L p (X , µ) * = L p (X , µ) σ Moreover, it is well-known that if (p, q) are Hölder conjugate and 1 < p, q < ∞, then L p (X , µ) * = L q (X , µ), and thus L p (X , µ) σ = L q (X , µ). It is also known that L 1 (X , µ) * = L ∞ (X , µ), and thus L 1 (X , µ) σ = L ∞ (X , µ).
However Theorem 17 does not hold for L ∞ (X , µ) since the L ∞ -norm is not σ -order continuous, as was shown in Example 15. It is well-known that L ∞ (X , µ) * L 1 (X , µ), and in fact L ∞ (X , µ) * can be concretely described as the Banach lattice ba(X , µ) of charges (i.e. finitely additive finite signed measures) which are absolutely continuous w.r.t, µ on X (see [13,IV.8.16]). However, as is shown in e.g. [4,27] L ∞ (X , µ) σ = L 1 (X , µ) As Examples 15 and 18 show, the (−) σ operation brings a lot of symmetry to the relationship between L p -spaces since L p (X , µ) σ = L q (X , µ) for any Hölder conjugate pair 1 ≤ p ≤ ∞. For this reason we will consider the category BL σ whose objects are Banach lattices and whose morphisms are σ -order continuous positive operators. Note that the Köthe dual of a Banach lattice is a Banach lattice, and it easily follows that (−) σ in fact defines a contravariant functor BL op σ → BL σ which acts on morphisms by pre-composition. As we will now see, BL σ is the category in which predicate and state transformers are most naturally defined.

From Borel kernels to Banach lattices
The functors S p and T p . For 1 ≤ p ≤ ∞, the operation which associates to a Krn-object (X , µ) the space L p (X , µ) can be thought of as either a contravariant or a covariant functor. We define the functors S p : Krn → BL op σ , 1 ≤ p ≤ ∞ as expected on objects, and on Krn-morphisms f : X Y via the well-known 'predicate transformer' perspective: For a proof that this defines a functor see [6]. We define the covariant functors T p : Every band in a Banach lattice is itself a Banach lattice. Of particular importance is the band B v generated by a singleton {v}, which can be described explicitly as Given µ ∈ ca(X ), the band B µ generated by µ is just the set of measures of bounded variation which are absolutely continuous w.r.t. µ. In particular B µ is a Banach lattice.
Radon-Nikodym is natural. We now present a first pair of natural transformations which will establish a natural isomorphism between the functors T 1 and M ≪· . First, we define the Radon-Nikodym transformation rn : M ≪· → T 1 at each Krn-object (X , µ) by the map The fact that this transformation defines a positive operator between Banach lattices is simply a restatement of the usual Radon-Nikodym theorem [13,III.10.7.], combined with the well-known linearity property of the Radon-Nikodym derivative. To see that it is also σ -order-continuous, consider a monotone sequence µ n ↑ µ converging in order to µ in M ≪ν (X ). This means that for any measurable set A of X , lim n→∞ µ n (A) = µ(A). Since ( d µ n/d ν ) n ∈N is bounded in L 1 -norm the function д = n d µ n/d ν exists and is simply the pointwise limit д(x) = lim n→∞ d µ n/d ν (x). It now follows from the monotone convergence theorem (MCT) that in other words, д = d µ /dν and rn is well-defined. That rn is also natural has -to our knowledge -never been published.
Theorem 21. The Radon-Nikodym transformation is natural.
Secondly, we define the Measure Representation transformation mr : This is a very well-known construction in measure theory, and the fact that mr (X, µ) is a σ -order continuous operator between Banach lattices is immediate from the linearity of integrals and the MCT. Riesz representations are natural. We now present a second pair of natural transformations which will establish a natural isomorphism between (−) σ • S ∞ and M ≪· . First, we define the Riesz Representation transformation rr : (−) σ • S ∞ → M ≪· at each Krn-object (X , µ) by the map rr (X, µ) : This construction is key to a whole collection of results in functional analysis commonly known as Riesz Representation Theorems (see [2] Chapter 14 for an overview). One can readily check that the Riesz Representation transformation is well-defined: rr (X, µ) (F )(∅) = F (0) = 0 and the σ -additivity of rr (X, µ) (F ) follows from the σ -order-continuity of F . To see that rr (X, µ) (F ) ≪ µ, assume that µ(B X ) = 0, then clearly 1 B X = 0 µ-a.e., i.e. 1 B X = 0 in L ∞ (X , µ), and thus F (1 B X ) = 0.
Theorem 23. The Riesz Representation transformation is natural.
Finally, we define the Functional Representation transformation fr at each Krn-object (X , µ) by the map fr (X, µ) : This construction is also completely standard in measure theory, although it has never to our knowledge been seen as a natural transformation.
Theorem 24. The Functional Representation transformation is well-defined, i.e. fr (X, µ) is a σ -order continuous positive operator, and is natural.

Natural Isomorphisms
We have now defined the following four natural transformations: In fact, both pairs form natural isomorphisms, and these can be restricted to arbitrary Hölder conjugate pairs (p, q).
Theorem 25. rn and mr are inverse of one another, in particular there exists a natural isomorphism between M ≪µ (X ) and L 1 (X , µ).
Theorem 26. rr and fr are inverse of each other, in particular there exists a natural isomorphism between M ≪µ (X ) and (L ∞ (X , µ)) σ .
We can now conclude that the isomorphism proved in Theorem 6 of [6] is in fact natural. We can in fact restrict this result to any Hölder conjugate pair (p, q): Theorem 28. For 1 ≤ p ≤ ∞ with Hölder conjugate q, the natural transformation rn • rr restricts to a natural transfor- The correspondence between the various categories and functors discussed in this section are summarized as follows:

Approximations
In this section we develop a scheme for approximating kernels which follows naturally from the †-structure of Krn.
The †-structure of Krn allows us to define the new kernels 10) The supscript notation is meant to indicate that the approximation lives 'upstairs' in Diagram (8) and conversely for the subscripts. Intuitively, f p,q and f p,q take the average of f over the fibres given by p, q according to µ and ν (see Section 7 for concrete calculations). The advantage of (10) is that we can approximate a kernel on a huge space by a kernel on a, say, finite one. The advantage of (9) is that although it is more complicated, it is morally equivalent and has the same type as f , which means that we can compare it to f .
A very simple consequence of our definition is that Bayesian inversion commutes with approximations. We shall use this in §7.1 to perform approximate Bayesian inference.
Theorem 29. Let f : (X , µ) (Y , ν ), let p : X → X ′ and q : Y → Y ′ n be a pair of deterministic maps, then (f † ) q,p = (f p,q ) † and (f † ) q,p = (f p,q ) † In practice we will often consider endo-kernels f : X X with a single coarsening map p : X → X ′ to a finite space. In this case (9) simplifies greatly.
Proposition 30. Under the situation described above In the case covered by Proposition 30, the interpretation of f p is very natural: for each x ∈ X the measure f (x) is approximated by its average over the fibre to which x belongs, conditioned on being in the fibre. For fibres with strictly positive µ-probability, this is simply ) However (11) also covers the case of µ-null fibres. Note also that in the case where f p = f , the map p corresponds to what is known as a strong functional bisimulation for f .
Approximating is non-expansive. It is well-known that conditional expectations are non-expansive and we know from Lemma 8 that pre-composing by p † µ •p as in (11) amounts to conditioning. The following lemma is an easy consequence.
Lemma 31. Let f : (X , µ) (Y , ν ) and q : X → X ′ be a deterministic quotient, then for all 1 ≤ p ≤ ∞ and ϕ ∈ L p (Y , ν) Compositionality of approximations In the case where we wish to approximate a composite kernel д • f , it might be convenient, for modularity reasons, to approximate f and д separately. This does not entail any loss of information provided the quotient maps are hemi-bisimulations, in the following sense. Let p : X → X ′ , q : Y → Y ′ , r : Z → Z ′ be deterministic quotients and let f : (Z , ρ) be composable kernels. We say that q is a left hemi-bisimulation for f if f = q † • q • f , and conversely that it is a right hemi-bisimulation for д if д = д • q † • q holds. In either case, one can verify using Theorems 7 and 29 that approximation commutes with composition, i.e. that (д • f ) p,r = д q,r • f p,q .
Discretization schemes We will use (10) and (11) to build sequences of arbitrarily good approximations of kernels. For this we introduce the following terminology.
Definition 32. We define a discretization scheme for an SBspace X to be a countable co-directed diagram (ccd) of finite spaces for which X is a cone (not necessarily a limit).
If (X i ) i ∈I is a discretization scheme of X and p i : X → X i are the maps making X a cone, then it follows from the definition that if i < j, σ (p i ) ⊆ σ (p j ) where σ (p i ) is the σ -algebra generated by p i . For each i ∈ I the finite quotient p i defines a measurable partition of X whose disjoint components p −1 i ({k}), k ∈ X i we will call cells. By Theorem 1 every SB-space has a discretization scheme for which it is not just a cone but a limit.
In practice we will work with discretization schemes linearly ordered by N. In this case the sequence (X , σ (p n )) n ∈N defines what probabilists call a filtration and we will denote the approximation f p n given by (11) simply by f n .

Convergence
We now turn to the question of convergence of approximations. There appears to be little literature on the subject of the convergence of approximations of Markov kernels. One rare reference is [5]. Via the functor S p defined above in Sections 3 and 4 we can seek a topology in terms of the operators associated to a sequence of kernels. Indeed, following [5], we will prove convergence results for the Strong Operator Topology (SOT).
Definition 33. We will say that a sequence of kernels f n : X Y converges to f : X Y in strong operator topology, and write f n −→ s f , if S 1 f n converges to S 1 f in the strong operator topology, i.e. if lim Proving convergence. We start with the following key lemma which is a consequence of Lévy's upward convergence Theorem ([25, Th. 14.2]) .
Lemma 34. Let f : (X , µ) (Y , ν ) be a Krn-morphism and let p n : X → X n , n ∈ N be a discretization scheme such that for B X the Borel σ -algebra of X we have

Theorem 35 (Convergence of Approximations Theorem).
Under the conditions of Lemma 34, for µ-almost every x ∈ X lim In other words f n −→ s f .
Note that operators of the shape S p f n obtained from a discretization scheme are finite rank operators. Thus, we, in fact, also obtained a theorem to approximate stochastic operators by stochastic operators of finite rank for the SOT topology. In general, we cannot hope for convergence in the stronger norm topology since the identity operator -which is stochastic -is a limit of operators of finite rank in the norm topology iff the space is finite dimensional.
Note also that the various relationships established in Section 4 allow us to move from an approximation of a kernel to an approximation of the corresponding Markov operator. Since a discretization scheme making f n −→ s f will also make (f † ) n −→ s f † , it follows from Theorem 25 that we get a finite rank approximation of the Markov operator M ≪· (f ).

Approximate Bayesian Inference
Consider again the inference problem from the introduction. There one needed to invert f (x) = N (x, 1) with prior µ = N (0, 1). We can use Theorem 29 to see how our approximate Bayesian inverse compares to the exact solution which in this simple case is known to be f † µ (0.5) = N (1/4, 1/2). To do this, we use a doubly indexed discretization scheme: q mn : R → 2 × m × n + 2 defining a window of width 2m centred at 0 divided in 2mn equal intervals; with the remaining intervals (−∞, −m] and (m, ∞) each sent to a point (hence the +2 above).
where [k], [l] range over classes of q mn . The corresponding stochastic matrices are shown in Fig. 2 and 3 for m, n = 5, 3 and 6, 10 respectively.
Since these approximants are finite, their Bayesian inverse can be computed directly by Bayes theorem (i.e. taking the adjoint of the stochastic matrices): with ν = f * (µ). Commutation of inversion and approximation guarantees that the f m,n † converge to f † . Indeed, Fig. 1 shows the the Lebesgue density of f m,n † (0.5) for m, n = 3, 2 (in dashed blue) and 7, 5 (dashed red). The latter approximant is already hardly distinguishable from the exact solution (solid black).
It must be emphasized that this example is meant only as an illustration and does not constitute a universal solution to the irreducibly hard (not even computable in general [1]) problem of performing Bayesian inversion. Also, not all quotients are equally convenient: what makes the approach computationally tractable is that the fibres are easily described and the measure conveniently evaluated on such fibres.

Approximating the Kleene star of ProbNetKAT
ProbNetKAT ( [14,22]) is a probabilistic network specification language extending Kleene Algebras with Tests ( [18]) with network primitives and a binary probabilistic choice operator ⊕ λ , λ ∈ [0, 1]. For the purpose of the example shown here we will not need to introduce the full syntax and semantics of ProbNetKAT, rather we will focus on a single ProbNetKAT program which we will call cantor and is given by: cantor := p; (dup; p) * where p := π 0 ! ⊕1 /2 π 1 ! (13)  The program acts on sets of finite sequences of 0 and 1, which can be thought of as packet histories. We will write H for the set {0, 1} * of all packet histories and H n for the set of histories of length as most n. A ProbNetKAT program is always interpreted as a kernel 2 H → G2 H . Programs with both dup and * revealed to be quite complex from the earliest development of the language. As we will describe, cantor denotes a continuous distribution and hence having a way to approximate it is crucial for practical uses of the language. The denotation of π 0 ! on a single sequence {(a 0 , . . . , a n )} is: π 0 ! ({(a 0 , . . . , a n )}) = δ {(0,a 1 , ...,a n )} in other words π 0 ! overwrites the first entry in the sequence with 0. Similarly, π 1 ! overwrites the first entry with 1. This semantics is extended to sets of sequences in the obvious way by taking direct images. The semantics of p is thus: p (a) = 0.5δ π 0 ! (a) + 0.5δ π 1 ! (a) The denotation of dup is given on singleton histories by dup ({(a 0 , . . . , a n )}) = δ {(a 0 ,a 0 , ...,a n )} i.e. dup shifts the history to the right and duplicates the first entry. Again, this is extended to sets of histories by taking direct images. The sequential composition operator ; is interpreted by Kleisli composition.
The interpretation of the Kleene star is more involved, and we here describe it categorically. To avoid any confusion we will not use Kleisli arrows in this construction, i.e. all kernels will be explicitly typed as kernels. Note first that the infinite product (2 H ) ω can be defined as the limit of the ccd given by the maps q n+1,n : (2 H ) n+1 → (2 H ) n dropping the last component. By Bochner's theorem ( [7]) this also holds of G((2 H ) ω ). Next, consider any program r. We turn 2 H into a cone for the diagram with limit G((2 H ) ω ) via the inductively defined maps: a n = a n−1 ⊗ r • ∆ n : where ∆ n : (2 H ) n → (2 H ) n × 2 H is the map copying the last entry. It is easy to check q n+1,n • a n • a n−1 • . . . • a 1 = a n−1 • . . . • a 1 , and the diagram described by the morphisms b n := a n • . . . • a 1 : 2 H → G(2 H ) n makes 2 H a cone for lim ← − − G(2 H ) n . There must therefore exist a unique morphism For each input, this kernel builds a distribution on the sample paths of the discrete-time stochastic processes associated with r and this input. We now define Since the definition above makes sense for any kernel f on 2 H , we will overload the Kleene star and put f * := G •f ∞ . Given the input (0), a sample path of cantor will draw uniformly a history of size 1, then a history of size 2 whose suffix matches the size 1 history drawn at the previous step, and so on for every integer. The distribution cantor (0) associates to a measurable collection of sets of histories A the probability that the union of a sample path from (0) belongs to A. For example cantor (0){A | (01) ∈ A} = 1 /4, since there's a 1 /4 chance that a sample path will have drawn (01) amongst the histories of size 2. We start by turning 2 H into a Krn-object. Consider the countable directed diagram given by all injections i mn : H m → H n , n > m, then H = lim − − → H n , and it follows that 2 H = lim ← − − 2 H n since 2 − turns colimits into limits. We know from Bochner's theorem that G2 H = lim ← − − G2 H n , and we use this fact to place a canonical measure on 2 H as follows: since each 2 H n is finite with cardinality c n := 2 n i 2 i = 2 2 n+1 −1 , and can thus be equipped with the uniform measure 1 /c n , we can find a limit measure µ on 2 H with the pleasing property that for all history truncating maps p n : 2 H → 2 H n , the pushforward µ n := (p n ) * µ is the uniform measure on H n . It is clear that these maps define a discretization scheme on 2 H which satisfies the condition of Theorem 35. We will now show that if f n −→ s f , then (f n ) * −→ s f * . To prove this we need the following lemma which is interesting in its own right.
Lemma 36. The monoidal structure of Krn is continuous for the SOT, i.e. f n −→ s f and д n −→ s д implies f n ⊗ д n −→ s f ⊗ д.
Theorem 37. Under the set-up described above, for any kernel f : (2 H , µ) (2 H , ν ) we have (f n ) * −→ s f * The advantage of working over finite spaces is that (f n ) * can, in principle at least, be computed for kernels defined in ProbNetKAT. Let us examine this in the case of cantor and of the discretization scheme p n : 2 H → 2 H n .
In the case n = 3 the underlying Markov chain has 2 7 states, but has an interesting property which means we need not consider them all: when we compute p ; (dup; p) 3 ∞ , the process necessarily lands in an ergodic component of the chain consisting of the singletons of histories of length exactly 3. The reason is that once the process reaches histories of length 3 it starts randomly re-writing the histories, and with probability 1 any two histories will eventually get re-written to the same thing. Once a set of histories has decreased in cardinality by one, it can never go back, thus eventually any set of histories gets re-written to a single length 3 history, and then loops among length 3 singletons indefinitely. The situation is represented from the initial state (0) in Figure 4 where, for clarity's sake, the ergodic component is symbolized by common double-sided arrows to a new state.  In other words, at n = 3 we have the first two steps in the construction of the Cantor distribution towards which cantor converges.

Conclusion
We have presented a framework for the exact and approximate semantics of first-order probabilistic programming. The semantics can be read off either in terms of kernels between measured spaces, or in terms of operators between L p spaces. Either forms come with related involutive structures: Bayesian inversion for (measured) kernels between Standard Borel spaces, and Köthe duality for positive linear and σcontinuous operators between Banach lattices. Functorial relations between both forms can themselves be related by way of natural isomorphisms. Our main result is the convergence of general systems of finite approximants in terms of the strong operator topology (the SOT theorem). Thus, in principle, one can compute arbitrarily good approximations of the semantics of a probabilistic program of interest for any given (measurable) query. Future work may allow one to derive stronger notions of convergences given additional Lipschitz control on kernels, or to develop approximation schemes that are adapted to the measured kernel of interest.
More ambitiously perhaps, one could investigate whether MCMC sampling schemes commonly used to perform approximate Bayesian inference in the context of probabilistic programming could be seen as randomized approximations of the type considered in this paper.
■ Proof of Lemma 3. By Dinkyn's π -λ theorem, two finite measures are equal if and only if they agree on a π -system generating the σalgebra. Any standard Borel space admits such a countable πsystem (any countable basis for a Polish topology generating the σ -algebra). Let {B n } n ∈N be such a π -system. Then, for all x ∈ X , д(x) д ′ (x) ⇔ ∃n.д(x)(B n ) д ′ (x)(B n ). Hence, Clearly, for any space V and any deterministic function u : It is now enough to show that λ(N (д • f , д ′ • f )) = 0. Let us reason contrapositively. We have: ■ Proof of Theorem 6. If ϕ is ν -integrable, there exists a monotone sequence {ϕ n } of simple functions such that ϕ n ↑ ϕ and ∫ Y ϕ n dν → ∫ Y ϕdν < ∞. By definition each ϕ n = k i=0 α i 1 B i , and by unravelling the definition we have ∫ and the result follows from the Monotone Convergence Theorem (MCT).

Proof of Theorem 9.
It follows by definition of f † and from the disintegration theorem that ∫ from which Eq. 4 follows easily. It remains to prove that this uniquely characterizes f † . Let us reason contrapositively. Assume there exists д : (Y , ν ) (X , µ) verifying for all A, B measurable ∫ y ∈Y д(y)(A) · 1 B (y) = γ f (A × B) as in Eq. 16 and such that ν (N (f † , д)) > 0 (assuming we take some representative of f † ). Let {A n } n ∈N be a countable π -system generating the σ -algebra of X . It is enough to test equality of measures on X on this π -system. Therefore, ■ Proof of Theorem 10.
Let us first show that (−) † is a functor Krn → Krn op , i.e. that id † (X, µ) = id (X, µ) and that for any f : Let (X , µ) be an object of Krn and id X, µ the corresponding identity. By Th. 9, it is enough to prove, for all A, A ′ measurable subsets of X , that ∫ The same calculation on the right hand side of the first equation yields trivially the same result. Hence the equality is verified. Now, on to compatibility w.r.t. composition. In sight of Th. 9, it is enough to show that for all A ⊆ X , In the following, for X a measurable space, we denote by SF (X ) the set of simple functions over X (finite linear combinations of indicator functions of measurable sets). We will use repeatedly the monotone convergence theorem (MCT). The left hand side of the above equation can be re-written as: ∫ x ∈X (1) is because д n ↑ д(−)(C), д n ∈ SF (Y ) and (2) by monotone convergence. Note that the n-indexed family x → ∫ y ∈Y д n (y) d f (x) is pointwise increasing. Therefore, We have proved the sought identity.
Finally let us show that (−) † is involutive, i.e. that for any f : (X , µ) (Y , ν ), (f † ) † = f . This follows easily by two applications of Th. 9): we have The fact that (f ⊗ д) † = f † ⊗ д † follows immediately from the definitions and the property of disintegrations given by Th. 9. The fact that the associator, unitors and braiding transformations are unitary follows immediately from the fact that they are deterministic isomorphisms and Th. 7.

■
Proof of Proposition 20. Let B ⊆ Y be a measurable set. By definition, we have (f • ρ)(B) = ∫ X ev B • f dρ where we recall that ev B : G(X ) → R + is the evaluation morphism. Let { f B n } n ∈N be an increasing chain of simple functions converging pointwise to ev B • f such that for each n, f B n = k n i=1 α n i 1 A n i with α n i ≥ 0. By the MCT, Similarly, Notice that since the integral is linear and the sequence { f B n } n is increasing, the sequences { ∫ X f B n dρ} n and { ∫ X f B n dµ} n are also increasing. Assume ν (B) = 0. Then for all n, ∫ X f B n dµ = 0. We deduce that for all n, for all 1 ≤ i ≤ k n , either α n i = 0 or µ(A n i ) = 0. Using that ρ ≪ µ, we deduce that for all 1 ≤ i ≤ k n , either α n i = 0 or ρ(A n i ) = 0, from which we conclude that for all n, ∫ X f B n dρ = 0 and finally, (f • ρ)(B) = 0. Hence, f • ρ ≪ ν .

■
Proof of Theorem 21. We start by proving the following Lemma Proof. We start by showing the equation on characteristic Since ϕ is measurable and integrable, there exists a sequence ϕ n ↑ ϕ of simple functions such that lim n ∫ Y ϕ n dν < ∞, and the results follows by the linearity of integration and the MCT. □ We can now prove the naturality of rn. Let f : (X , µ) (Y , ν ) be a morphism in Krn; we have on the one hand and on the other To show the equality of these two maps in L 1 (Y , ν) it is enough to show that they are equal ν-a.e. To see this, we show that ( * * ) satisfies the condition to be the Radon-Nikodym derivative ( * ). Let B Y be a measurable subset of Y . We have from the well-known property of Radon-Nikodym derivatives: ∫ where (1) is by Lemma 38 and (2) is a well-known property of Radon-Nikodym derivatives.

Proof of Theorem 22.
We start with the following elementary lemma.
Lemma 39. If ψ , ϕ ∈ L 1 (X , µ) then ∫ X ψϕ dµ = ∫ X ψ d(mr (M, µ) ϕ) Proof. The proof of naturality now follows easily: it is enough to show the equality in the case where ψ = 1 B X for a measurable subset B X of X , and the result then extends to all measurable functions by linearity of integrals and the MCT. We have ∫ To show naturality we now let f : (X , µ) (Y , ν ) be a Krn-morphism, ϕ ∈ L 1 (X , µ) and = ∫ Again, we start with a simple but helpful Lemma.
Lemma 40. Let F ∈ (S ∞ (X , µ) σ and ϕ ∈ S ∞ (X , µ), then Proof. Starting with characteristic functions, let ϕ = 1 B for some measurable subset B of X . We then have We can then extend the result to simple functions by linearity and then to all functions in L ∞ (X , µ) by the MCT. □ To show naturality we now let f : (X , µ) (Y , ν ) be a Krnmorphism, F ∈ (S ∞ (X , µ)) σ and B Y measurable in Y . We have Proof of Theorem 24.
We start by showing that fr is well defined. The linearity of fr (X, µ) is easily checked on simple functions and extended by the CMT. Positivity is also immediate. For the σ -order continuity, let µ m ↑ µ, ϕ ∈ L ∞ (X , µ), and ϕ n ↑ ϕ be a monotone approximation of ϕ by simple functions. We need to show that For note first that the doubly indexed series ∫ X ϕ n dµ m is monotonically increasing in m, since the µ m are monotonically increasing. Note also that the differences Let (X , µ) be a Krn-object, let F ∈ (L ∞ (X , µ)) σ and let ϕ ∈ L ∞ (X , µ). We have where the last equality follows from Lemma 40. Similarly, we have rr (X, µ) •fr (X, µ) (ρ)(B X ) = fr (X, µ) (ρ)(1 B X ) = ∫ X 1 B X dρ = ρ(B X )

■
Proof of Theorem 28. The case p = 1 has been treated already, for the case of 1 < p < ∞, see for example the proof of Theorem 4.4.1 of [3]. Finally for the case of p = ∞, see Proposition 3.3 of [4].

■
Proof of Proposition 30. Note in (11) that we disintegrate p with respect to two different measures. For notational clarity let us define the endokernels a := p † µ • p b := p † ν • p The kernel a associates to each x ∈ X in a fibre p −1 ({i}) the measure π † µ (i) supported by this fibre. In particular it is constant on each fibre, and similarly for b. We can now