A Predicate/State Transformer Semantics for Bayesian Learning

This paper establishes a link between Bayesian inference (learning) and predicate and state transformer operations from programming semantics and logic. Speciﬁcally, a very general deﬁnition of backward inference is given via ﬁrst applying a predicate transformer and then conditioning. Analogously, forward inference involves ﬁrst conditioning and then applying a state transformer. These deﬁnitions are illustrated in many examples in discrete and continuous probability theory and also in quantum theory.


Introduction
Increasingly probabilistic programs are used to describe problems in Bayesian inference ( [2]), see e.g.[10,19,4,1,21].The term 'inference' is used for what is informally best called: learning 3 .Learning involves updating one's knowledge, in the light of certain evidence, typically given via the validity of a certain predicate (which may be a fuzzy one).In this situation one represents knowledge in terms of likelihoods, via a probability distribution (in the discrete case) or a probability measure (in the continuous case).Updating one's knowledge then involves computing a conditional distribution/measure. Now that the overlap between the (probabilistic) programming community and the Bayesian community is growing, a merging of concepts and techniques can be expected.This paper is an example.It shows how the notions of predicate and state transformer from programming languages semantics ( [7]) can be used in precisely defining two fundamental notions of learning: backward and forward inference.A conditioning operation, which makes a certain distribution/measure depend on a predicate, also plays a role.In a nutshell, the correspondence can be summarised as follows.

Conditioning
and then

Forward Inference
This connection hopefully works as an Aha Erlebnis, giving a sudden insight.Indeed, predicate transformers work backwards, from predicates on post-states to predicates on pre-states.This is precisely what is at stake in backward inferenceas we will demonstrate.Similarly, state transformers work in a forward direction, which is what happens in forward learning.
Strictly speaking, the main contribution of this paper is only one definition, namely of (backward and forward) inference, see Definition 2.1.Contrarily to traditional approaches, our formulation is not tied to the probabilistic setting, but works in the context of any effectus, that is a categorical notion embracing a wide spectrum of computational models, both classical, probabilistic and also quantum, see [11,5].Within the theory of effectuses, predicate and state transformers are well-defined, and predicates (or effects) and states can be nicely organised in stateand-effect triangles, which connect predicates and states via a (dual) adjunction (1), see also [12].Intriguingly, these triangles correspond to what physicists call the duality between states and effects, referring to the opposite directions in the work of Schrödinger and Heisenberg on quantum foundations.Within this effectus context one can also describe normalisation and conditioning of states in an abstract manner (see [13,5]).Therefore, we believe effectuses form the right setting for developing a general approach to inference.
Still, precisely recognising what is what in this setting is a subtle matter.For instance, what is a predicate, at the abstract level?Traditionally in probability theory 'events' are used as predicates.Formally they are subsets of the sample space, corresponding to 'sharp' predicates on this space.More generally, 'fuzzy' predicates are considered; they are functions taking values in the unit interval [0, 1].The sharp predicates can then be characterised as the ones taking values in the Boolean subset {0, 1} ⊆ [0, 1].In discrete probability every distribution is at the same time a fuzzy predicate.This blurs the picture -the confusion between states and predicates is particularly evident in Bayesian network representations, where nodes may play both roles.In continuous probability there is, in principle, a clear distinction between states (probability measures) and predicates (measurable functions to [0, 1]).But again, things easily get mixed up, when a state/measure is given by a probability distribution function (pdf), which looks very much like a predicate.The framework of effectus theory helps in this respect, since it gives a clear distinction between states, as maps of the form 1 → X, and predicates, as maps X → 1+1.Only when this perspective is recognised, the role of predicate and state transformers becomes clear.It is for this reason that we think it is justified to dedicate an entire paper to elaborating and explaining a single definition.
The paper is organised as follows.We first introduce the notions of backward and forward inference in terms of predicate and state transformers and show some basic properties.Then, we concentrate on illustrating the impact and power of our definition in many situations.We show what our abstract setting translates to in discrete and continuous probability theory and also (briefly) in quantum theory.We elaborate many examples of computations of how inference works, and what it produces.Of special interest is the application of our definition of inference in Bayesian networks.It is shown that the forward/backward distinction can be used flexibly, and can describe what inference means at different points in the network.

Backward and forward inference, abstractly
In this section we describe our abstract set up for inference, both in a backward and forward manner.This works in the setting of an effectus: briefly, this is a category with finite coproducts (+, 0) and a final object 1, such that certain diagrams are pullbacks and certain maps are jointly monic.By virtue of these basic requirements, an effectus is able to capture some basic aspects of quantum computation, with probabilistic computation as special case, see [11,5].
States in an effectus C are maps of the form 1 → X and predicates are maps X → 2 = 1 + 1.The set of states Stat(X) of an object X form a convex set, and the set of predicates Pred(X) on X form an effect module.States and predicates give rise to a 'state-and-effect triangle' of the form: We refer to [11] for details about effect modules and convex sets.In the current setting we need the predicate transformer f * = Pred(f ) and state transformer f * = Stat(f ) operations associated with a map f : X → Y in the base category C.They are given by pre-and post-composition: Pred(X) P r e d ( Y ) In concrete examples of effectuses states are distributions -in the Kleisli category of the distribution monad -or probability measures -in the Kleisli category of the Giry monad -or just states -in C * -or W * -algebras.We will understand states as descriptions of our state of knowledge.Given a predicate p and a state ω on the same object X two definitions are of interest: We now distinguish two forms of inference (learning).
Definition 2.1 Backward inference ω| f * (q) involves first applying a predicate transformer and then computing a conditional.This applies in situations of the form: More explicitly, one first applies the predicate transformer f * to the predicate q on Y , and then computes the backwardly inferred conditional state ω| f * (q) on X. Forward inference f * (ω| p ) is first computing a conditional and then applying a state transformer.This works in a situation: In this case the conditional state on X is ω| p , and applying the state transformer f * gives the forwardly inferred state f * (ω| p ) on Y .
In the trivial case where the map f is the identity there is no difference between backward and forward inference.Inference then just involves updating a state (of knowledge).Notice that in backward inference we use a predicate on the codomain of the map f , namely q, and update our knowledge about the state on f 's domain X.In forward inference we use a predicate on the domain of f , namely p, and use it to infer more about the state on f 's codomain Y .This may also be called 'evidence propagation'.
In the situation (4) we have the following Galois style equalities for validity: In general, there are very few 'nice' algebraic properties for conditional states.For instance, we do have f * ω| f * (q) = f • ω | q , but only for the special case where the map f is 'pure'.The latter means for instance in a Kleisli category that the map comes from the underlying category.
In the remainder of this paper we shall illustrate these forms of inference via several examples, involving various kinds of computation, and including Bayesian networks where the above map f in (3) and ( 4) arises from a graph (network of conditional probability tables).The composition notation '•' used above looks deceptively simple, but will each time be interpreted differently in different categories.
This leads to various concrete forms of inference which are all instances of the same pattern.

Inference with discrete probability
We shall write D for the discrete probability monad on the category Set of sets and functions.The set D(X) contains the finite discrete probability distributions ω over X which we write as formal convex combinations: The 'ket' notation |x is meaningless syntactic sugar, used to distinguish elements x ∈ X from their occurrence in such formal convex sums.Notice that such ω ∈ D(X) can be identified with functions ω : This function-description is often more convenient.We shall write K (D) for the Kleisli category of the distribution monad D. Its objects are sets X, and its morphisms X → Y are stochastic matrices, in the form of functions X → D(Y ).
We will see later (in Section 3.1) how Bayesian networks can be seen as certain arrows of K (D).For this interpretation, it is of importance that K (D) forms a symmetric monoidal category, with the following ingredients.The monoidal product ⊗ is defined on objects as the cartesian product × in Set, with unit 1.On arrows f : A → X and g : B → Y , it is defined as where the map dst sends a pair (ρ, σ) ∈ D(X)×D(Y ) to the distribution in D(X×Y ) given by (x, y) −→Y × X in Set; we will omit the subscript when X and Y are clear from the context.
We now turn to the description of states and predicates in K (D).Notice that states ω : 1 → X in K (D) can be identified with distributions ω ∈ D(X).Since D(2) ∼ = [0, 1] we can identify predicates X → 2 = 1 + 1 in K (D) with functions X → [0, 1], that is, with fuzzy predicates.We will often make both identifications when emphasising the role of states and predicates in a computation.
Given a Kleisli map f : X → D(Y ), a state ω ∈ D(X) and a predicate q ∈ [0, 1] Y we have the following descriptions for state and predicate transformation.They arise from unravelling (Kleisli) composition in K (D).
For a distribution ω ∈ D(X) and a predicate p ∈ [0, 1] X on the same set X we define the validity ω |= p in [0, 1] as: If this validity ω |= p is non-zero, then the conditional state ω| p ∈ D(X) is given as formal convex sum: We shall describe a familiar medical test example in the current setting.We use the following notational convention.We write a letter D for a certain disease, which is represented as a two-element set 2 D = {d, d ⊥ }, where the element d represents occurrence of the disease, and d ⊥ represents non-occurrence.A distribution over 2 D is, e.g., of the form 1  4 |d + 3 4 |d ⊥ , when describing that the disease occurs with probability 1  4 .Similar we write 2 T for a (positive) test, where 2 T = {t, t ⊥ }.For each such set 2 A = {a, a ⊥ } we write A? : 2 A → [0, 1] for the sharp predicate given by A?(a) = 1 and A?(a ⊥ ) = 0.
Consider the following situation in the Kleisli category K (D).
The state ω describes the a priori probability of 1% that someone has the disease.The map s describes the sensitivity of the test: when someone has the disease, the test will be positive in 90% of the cases, and when someone does not have the disease there is still a 5% chance that the test is positive.A basic question is: what is the chance that I have the disease if I test positive?We formalise this by adding the predicate T ?: 2 T → [0, 1], which expresses that there is a positive test.We then compute consecutively the predicate s * (T ?) : 2 D → [0, 1], the validity ω |= s * (T ?) and the inferred conditional state ω| s * (T ?) .We use formulas ( 5), (6), and (7) for backward inference from Definition 2.1: Hence after a positive test the chance that I have the disease is 18 117 ∼ 15%.This is an instance of backward inference, where an observation on the codomain (the test outcome) changes the state of knowledge about the domain (the disease occurrence).Of course, standard Bayesian methods will arrive at the same outcome.The point is that we can describe these methods here in a uniform, abstract manner via calculations in (Kleisli) categories.
We briefly describe a forward example.Suppose that I know that the chance of having this disease is half as likely for me, for instance because I belong to a particular age group.We model this via the predicate p : 2 D → [0, 1] given by p(d) = 1  2 and p(d ⊥ ) = 1.We would like to learn what the probability is of getting a positive test under these circumstances.
We take a step back, and ask ourselves: what is the probability of getting a positive test in general -without the adapted likelihood.This probability is computed via the state transformer s * from (5) -that is, via Kleisli composition in K (D) as: For forward inference we first compute the conditional state ω| p and then push it forward to a state s * (ω| p ) on 2 T .
Hence, upon knowing that I have a reduced (halved) risk, my chance of getting a positive test goes down from 117 2000 ∼ 5.8% to 216 3980 ∼ 5.4%.The impact is limited, because I only have a very small chance of having the disease in the first placeand the false positive probability of the test is 5%.
By imposing the predicate p on the disease state ω we adapt the influence of the state ω on the outcome.This may be useful for counterfactual reasoning, see [17].In this way one can test to what extend a conclusion depends on certain initial states.For instance, if a particular conclusion is reached starting in a state where 70% of the participants is female, then by imposing an additional predicate on this state that changes the gender percentage, one can check if the same conclusion is reached.

Inference in a Bayesian network
Bayesian networks are graph-like structures, widely-adopted for the representation of probabilistic relationships between random events.They are usually depicted as directed acyclic graphs with nodes standing for random variables and edges indicating causal dependencies between them.Inference tasks are one of the fundamental uses of these networks.They are typically performed by updating a single nodeevent and then propagating the information to the rest of the network.Computing the inference typically goes through a repeated use of the Bayes' rule for conditional probability, see e.g.[16,18,17,2].
In this subsection we show how our abstract account of inference instantiates to the case of Bayesian networks.Our approach predicts the same outcomes as traditional Bayesian inference, but also improves it in two ways.First, it is more flexible and compositional, as it allows to focus on single nodes in the same way as on bigger portions of the network, with the same methodology.Second, it is more structured, in the sense that the computations that would require the use of Bayes' rule are carried out by the categorical machinery -essentially, by composition of arrows in a category.
In order to illustrate this picture, we will use as a running example the situation of a scientist that wants to publish a paper at a conference.The specification for the corresponding Bayesian network consists of a graph together with conditional . .

PC Member Championing
Pr (T ) Pr (S) f f The initial conditions of the example estimate whether there is enough time available to prepare the paper (the variable T ) and whether the scientist is sufficiently skilled to do the necessary research (S).The results that the scientist is able to obtain (R) depend both on the time and the skill, while how well the paper reads only depends on the time.Both results and readability have an influence on whether the reviews will be positive (P ), but results will be more relevant.Similarly, these two factors may lead a PC member to enthusiastically endorse the paper (M ), independently of what the reviewers say, although this possibility is quite rare.Finally, acceptance (A) is influenced by the reviews and by the possible endorsement of a PC member.
In order to study inference in this example, we first need to formulate it in more categorical terms.We shall express our Bayesian network (9) as an arrow in the Kleisli category K (D) of the distribution monad D. First, each node N of the graph, say with k incoming edges from nodes N 1 , N 2 , . . ., N k , is associated with an arrow N : 2 k → 2 in K (D), which we conveniently write using the same labeling convention for the elements of 2 as in the disease example: The probability distributions defining N are given according to the probability table of the node.For instance, the Kleisli map A : 2 P ⊗2 M → 2 A for acceptance is defined by: (p, m) → 1 |a (p, m ⊥ ) → 7 10 |a + 3 10 |a ⊥ (p ⊥ , m) → 8  10 |a + 2 10 |a ⊥ (p ⊥ , m ⊥ ) → 1 10 |a + 9 10 |a ⊥ .Another example is the initial map T : 1 → 2 T for the time node, which amounts to the distribution 4  10 |t In order to recover the whole network ( 9), one pastes node-arrows together using the symmetric monoidal structure of K (D), which we recalled in the beginning of this section.Nodes in ( 9) that have multiple outgoing edges are modeled by composing the corresponding arrow 2 k → 2 with the pairing map δ : 2 → 2 2 defined by x → 1 |(x, x) .The Bayesian network (9) in its entirety is then expressed as the following arrow in K (D), where for simplicity we omit the subscripts naming the elements of each copy of 2.
We have written the "structural" arrows vertically.A more insightful representation of the same arrow can be given using the graphical language of string diagrams [20], with 2 k depicted as a bundle of k wires and δ as .The result almost resembles the original network.
A M R W P T S (10) It may be calculated4 that the entire arrow 1 → 2 in (10) amounts to the distribution 0.48 |a + 0.52 |a ⊥ in D(2) ∼ = [0, 1].In words: given 40% of chances that the scientist has enough time at disposal and 70% of chances of being adequately skilled, the odds of having a paper accepted at the conference is ∼48%.
We now have everything in place to instantiate our framework for inference.As this example is more elaborated than the previous ones, it gives us the possibility to explore the situation in which knowledge update only involves a segment of the computation, namely f or g in the following partitioned version of (10).
In order to formulate a backward inference question, we follow the recipe (3) and introduce a predicate A? : 2 A → [0, 1] that tests for acceptance of the paper.It is a sharp predicate, defined by A?(a) = 1 and A?(a ⊥ ) = 0.
First we compute ω| (g•f ) * (A?) , that is, the odds that the accepted paper actually was submitted by a scientist with an adequate amount of time and skill to concoct it.
We observe that, after finding out that the paper has been accepted, the chances that the scientist had both sufficient time and skill rise from 28% to 39%.As a second example, we shift the attention from the author to the paper itself.The following state on 2 W ⊗ 2 R expresses the chances that an accepted paper was actually well written and contained strong scientific results.Note that it mixes state and predicate transformers to bind different segments of the network.
We see that, in our model, roughly one half of the accepted papers had both qualities, but only 10% of them had none.
Lastly, we consider an example of forward inference.Following the recipe (4), we introduce a predicate E? : 2 T ⊗ 2 S → [0, 1] on the state ω : 1 → 2 T ⊗ 2 S : it expresses the event that, while writing the paper, the scientist finds out that the main result contains a minor mistake and thus needs some revision.
Differently from A?, this E? is a fuzzy predicate: a mistake gets more likely the less time and skill are available to the scientist.If this situation occurs, the scientist may still be able to produce on time a paper that gets accepted, but chances are lower: they decrease from 48% to 43%.This is expressed by the following inference.

Remark 3.1
We have modeled a Bayesian network as a graph in the Kleisli category K (D).This is inspired by the approach of Fong [8], except that he uses the Kleisli category K (G) of the Giry monad (even though all his examples are discrete).Such graphs in K (D) or K (G) can be seen as symmetric monoidal functors from a PROP P, generated by a signature with the nodes and edges of the network, to the Kleisli category.We recall that a PROP (product and permutation category [15]) is a symmetric strict monoidal category with the natural numbers as objects and with monoidal product ⊕ given by addition of numbers.Intuitively, PROPs generalise Lawvere theories from the cartesian to the linear setting; functors from P as above are called the models of P.
In our case, the model P → K (D) sends ⊕ to the monoidal product ⊗ of K (D), and sends the number 1 to the object 2 = 1 + 1 in K (D).P has pairing (copying) , but a crucial point is that these copiers are not natural -as can be checked easily in K (D).This implies that P is not a Lawvere theory (cf.[3]), and there is no associated monad on Set.
This monad perspective comes up in the following way.A Bayesian network with set of nodes X can be seen as a coalgebra of the form: where This coalgebra c sends a node N ∈ X to a pair c(N ) = c 1 (N ), c 2 (N ) where c 1 (N ) ⊆ fin X is a finite set of predecessor nodes of N , and c 2 (N ) : as used in the above description of the paper-acceptance example.It is not hard to see that the mapping X → B(X) is a functor on Set, and comes with a unit map X → B(X).But B is not a monad, at least not in the expected obvious sense, precisely because the copiers are not natural.

Inference with continuous probability
Our abstract description of inference allows us to transfer the definitions from the discrete to the continuous approach simply by switching from the Kleisli category K (D) of the distribution monad to the Kleisli category K (G) of the Giry monad [9] on measurable spaces.We shall sketch an example where the function f in the inference situation (3) is the identity, but where we have multiple predicates p i for successive learning.Hence there is no predicate/state transformation involved.We describe the essentials and refer to [5] for more information.A state ω : 1 → X in the Kleisli category K (G) is a probability measure ω ∈ G(X), given by a function ω : Σ X → [0, 1] that maps measurable subsets to probabilities.A predicate p : and conditional state ω| p in G(X) are given by the following integration formulas.
Often the state/probability measure ω that we start from is given by a probability density function.This means that ω is of the form φ |= q, for some predicate q.
In that case the conditional state ω| p = (φ| q )| p is the same as the condition of the product predicate: φ| q•p with pdf q • p.This greatly simplifies the picture below.
The inference example that we use is a continuous version of the archeological example described in [13].The aim is to infer the date of a tomb at an archeological site of which we already know that it is from the interval 0 − 100 AD.We are specifically looking to find three kinds of objects, labelled 0, 1, 2 of which we know the time of use more precisely.They are used to infer the age of the tomb.This knowledge is represented by three predicates p 0 , p 1 , p 2 : [0, 100] → [0, 1] given by the formulas: Our inference works as follows.We start from the uniform measure ω = φ| q with pdf q(x) = 1 100 on [0, 100], for the Lebesgue measure φ.Its probability on the sub-interval [a, b] ⊆ [0, 100] is given by the integral: We now successively observe objects i 1 , . . ., i n , for i k = 0, 1, 2, and compute the conditional probability measure ( We can describe this measure via the product pdf q • p i 1 • • • p in , after normalisation.Below we sketch the shape of some of the resulting pdf's (ignoring normalisation), after finding certain objects successively.
after finding 2 after finding 2,1 after finding 2,1,0 after finding 2,1,0,0 These curves describe the inferred probability for the age of the tomb in the interval 0 -100 AD.

Quantum inference
Our inference situations (3) and (4) can also be interpreted in the effectus of von Neumann algebras for quantum computation.Actually, one uses the opposite vNA op of the category vNA of von Neumann algebras, with normal completely positive unital maps between them (see [5] for details).We have to take the oppo- The conditional state ω| f * (q) : B → C in backward inference is given by the general formula: In this situation predicate transformation f * (q) = f • q works in the opposite direction.The square-roots arise from the particular form of 'assert' map that is used for von Neumann algebras, see [5] for details.The predicate q : C 2 → A is a positive unital map, and can thus be identified with an effect in A , that is, with an element q ∈ A satisfying 0 ≤ q ≤ 1. Bayesian inference in a quantum setting is a relatively new topic, see e.g.[14,6].At this stage we only apply our general pattern from Definition 2.1 in a quantum setting.The illustration below repeats the disease-test example from Section 3 for the von Neumann algebra M 2 of 2 × 2 complex matrices.Our only ambition at this stage is to show how the quantum description extends the probabilistic one.Consider therefore the diagram: We see that the outcome is the same, up to some re-shuffling, as in the discrete probabilistic presentation in (8).But this situation allows much richer structure, for instance using as state ρ : site category because maps between von Neumann algebras should be understood as predicate transformers.Typical examples are the von Neumann algebras B(H ) of bounded operators on a Hilbert space H . Below we use the matrix algebra M 2 = B(C 2 ) as special case.For instance, the situation (3) translates into a diagram of maps in the category vNA pointing in the other direction:
|= p describes the validity, or expected value, of the predicate in the state ω.Typically its value is in the unit interval [0, 1].If this validity ω |= p is non-zero, then the conditional state ω| p exists.It is the updated state of knowledge after observing 'evidence' p.In each of the above concrete examples of states we