Fine-grained Semantics for Probabilistic Programs

. Probabilistic programming is an emerging technique for modeling processes involving uncertainty. Thus, it is important to ensure these programs are assigned precise formal semantics that also cleanly handle typical exceptions such as non-termination or division by zero. However, existing semantics of probabilistic programs do not fully accommodate diﬀerent exceptions and their interaction, often ignoring some or conﬂating multiple ones into a single exception state, making it impossible to distinguish exceptions or to study their interaction. In this paper, we provide an expressive probabilistic programming language together with a ﬁne-grained measure-theoretic denotational semantics that handles and distinguishes non-termination, observation failures and error states. We then investigate the properties of this semantics, focusing on the interaction of diﬀerent kinds of exceptions. Our work helps to better understand the intricacies of probabilistic programs and ensures their behavior matches the intended semantics.


Introduction
A probabilistic programming language allows probabilistic models to be specified independently of the particular inference algorithms that make predictions using the model.Probabilistic programs are formed using standard language primitives as well as constructs for drawing random values and conditioning.The overall approach is general and applicable to many different settings (e.g., building cognitive models).In recent years, the interest in probabilistic programming systems has grown rapidly with various languages and probabilistic inference algorithms (ranging from approximate to exact).Examples include [11,13,14,25,36,26,27,29] and [10]; for a recent survey, please see [15].An important branch of recent probabilistic programming research is concerned with providing a suitable semantics for these programs enabling one to formally reason about the program's behaviors [2,3,4,33,34,35].
Often, probabilistic programs require access to primitives that may result in unwanted behavior.For example, the standard deviation σ of a Gaussian distribution must be positive (sampling from a Gaussian distribution with negative standard deviation should result in an error).If a program samples from a Gaussian distribution with a non-constant standard deviation, it is in general undecidable if that standard deviation is guaranteed to be positive.A similar situation occurs for while loops: except in some trivial cases, it is hard to decide if a program terminates with probability one (even harder than checking termination of deterministic programs [20]).However, general while loops are important for many probabilistic programs.As an example, a Markov Chain Monte Carlo sampler is essentially a special probabilistic program, which in practice requires a non-trivial stopping criterion (see e.g.[6] for such a stopping criterion).In addition to offering primitives that may result in such unwanted behavior, many probabilistic programming languages also provide an observe primitive that intuitively allows to filter out executions violating some constraint.
Motivation.Measure-theoretic denotational semantics for probabilistic programs is desirable as it enables reasoning about probabilistic programs within the rigorous and general framework of measure theory.While existing research has made substantial progress towards a rigorous semantic foundation of probabilistic programming, existing denotational semantics based on measure theory usually conflate failing observe statements (i.e., conditioning), error states and non-termination, often modeling at least some of these as missing weight in a sub-probability measure (we show why this is practically problematic in later examples).This means that even semantically, it is impossible to distinguish these types of exceptions1 .However, distinguishing exceptions is essential for a solid understanding of probabilistic programs: it is insufficient if the semantics of a probabilistic programming language can only express that something went wrong during the execution of the program, lacking the capability to distinguish for example non-termination and errors.Concretely, programmers often want to avoid non-termination and assertion failure, while observation failure is acceptable (or even desirable).When a program runs into an exception, the programmer should be able determine the type of exception, from the semantics.This Work.This paper presents a clean denotational semantics for a Turing complete first-order probabilistic programming language that supports mixing continuous and discrete distributions, arrays, observations, partial functions and loops.This semantics distinguishes observation failures, error states and nontermination by tracking them as explicit program states.Our semantics allows for fine-grained reasoning, such as determining the termination probability of a probabilistic program making observations from a sequence of concrete values.
In addition, we explain the consequences of our treatment of exceptions by providing interesting examples and properties of our semantics, such as commutativity in the absence of exceptions, or associativity regardless of the presence of exceptions.We also investigate the interaction between exceptions and the score primitive, concluding in particular that the probability of non-termination cannot be defined in this case.score intuitively allows to increase or decrease the probability of specific runs of a program (for more details, see Section 5.3).
In this section we demonstrate several important features of our probabilistic programming language (PPL) using examples, followed by a discussion involving different kinds of exception interactions.

Features of Probabilistic Programs
In the following, we informally discuss the most important features of our PPL.Conditioning.Listing 2 samples two independent values from the uniform distribution on the interval [0, 1] and conditions the possible values of x and y on the observation x + y > 1 before returning x.Intuitively, the first two lines express a-priori knowledge about the uncertain values of x and y.Then, a measurement determines that x + y is greater than 1.We combine this new information with the existing knowledge.Because x+y > 1 is more likely for larger values of x, the return value has larger weight on larger values.Formally, our semantics handles observe by introducing an extra program state for observation failure .Hence, the probability distribution after the third line of Listing 2 will put weight 1  2 on and weight 1 2 on those x and y satisfying x + y > 1.In practice, one will usually condition the output distribution on there being no observation failure ( ).For discrete distributions, this amounts to computing: where x is the outcome of the program (a value, non-termination or an error) and P r[X = x] is the probability that the program results in x.Of course, this conditioning only works when the probability of is not 1.Note that tracking the probability of has the practical benefit of rendering the (often expensive) marginalization P r[X = ] = x = P r[X = x] unnecessary.
Other semantics often use sub-probability measures to express failed observations [4,34,35].These semantics would say that Listing 2 results in a return value between 0 and 1 with probability 1  2 (and infer that the missing weight of 1  2 is due to failed observations).We believe one should improve upon this approach as the semantics only implicitly states that the program sometimes fails an observation.Further, this strategy only allows tracking a single kind of exception (in this case, failed observations).This has led some works to conflate observation failure and non-termination [18,34].We believe there is an important distinction between the two: observation failure means that the program behavior is inconsistent with observed facts, non-termination means that the program did not return a result.Listing 3 illustrates that it is not possible to condition parts of the program on there being no observation failure.In Listing 3, conditioning the first branch x := 0; observe(flip( 12 )) on there being no observation failure yields P r[x = 0] = 1, rendering the observation irrelevant.The same situation arises for the second branch.Hence, conditioning the two branches in iso-

Geometric distribution
Loops.Listing 4 shows a probabilistic program with a while loop.It samples from the geometric( 12 ) distribution, which counts the number of failures (flip returns 0) until the first success occurs (flip returns 1).This program terminates with probability 1, but it is of course possible that a probabilistic program fails to terminate with positive probability.Listing 5 demonstrates this possibility.
x := 5; Listing 5. Program that may not terminate Listing 5 modifies x until either x = 0 or x = 10.In each iteration, x is either increased or decreased, each with probability 1  2 .If x reaches 0, the loop terminates.If x reaches 10, the loop never terminates.By symmetry, both termination and non-termination are equally likely.Hence, the program either returns 0 or does not terminate, each with probability 1  2 .Other semantics often use sub-probability measures to express non-termination [4,23].Thus, these semantics would say that Listing 5 results in 0 with probability 1  2 (and nothing else).We propose to track the probability of nontermination explicitly by an additional state , just as we track the probability of observation failure ( ). x (undefined for x < 0).Listing 6 shows an example program using √ x.Usually, semantics do not explicitly address partial functions [23,24,28,33] or use partial Fig. 1.Visual comparison of the exception handling capabilities of different semantics.For example, is filled in [34] because its semantics can handle non-termination.However, the intersection between and is not filled because [34] cannot distinguish non-termination from observation failure.
functions without dealing with failure (e.g.[19] use Bernoulli(p) without stating what happens if p / ∈ [0, 1]).Most of these languages could use a sub-probability distribution that misses weight in the presence of errors (in these languages, this results in conflating errors with non-termination and observation failures).
We introduce a third exception state ⊥ that can be produced when partial functions are evaluated outside of their domain.Thus, Listing 6 results in ⊥ with probability 1  2 and returns a value from [0, 1] with probability 1 2 (larger values are more likely).Some previous work uses an error state to capture failing computations, but does not propagate this failure implicitly [34,35].In particular, if an early expression in a long program may fail evaluating √ −4, every expression in the program that depends on this failing computation has to check whether an exception has occurred.While it may seem possible to skip the rest of the function in case of a failing computation (by applying the pattern if (x = ⊥) {return ⊥} else {rest of function}), this is non-modular and does not address the result of the function being used in other parts of a program.
Although our semantics treat ⊥ and similarly, there is an important distinction between the two: ⊥ means the program terminated due to an error, while means that according to observed evidence, the program did not actually run.

Interaction of Exception States
Next, we illustrate the interaction of different exception states.We explain how our semantics handles these interactions when compared to existing semantics.Figure 1 gives an overview of which existing semantics can handle which (interactions of) exceptions.We note that our semantics could easily distinguish more kinds of exceptions, such as division by zero or out of bounds accesses to arrays.
x:=0; Non-termination and observation failure.Listing 7 shows a program that has been investigated in [22].Based on the observations, it only admits a single behavior, namely always sampling x = 0 in the third line.This behavior results in non-termination, but it occurs with probability 0. Hence, the program fails an observation (ending up in state ) with probability 1.If we try to condition on not failing any observation (by rescaling appropriately), this results in a division by 0, because the probability of not failing any observation is 0.
The semantics of Listing 7 thus only has weight on , and does not allow conditioning on not failing any observation.This is also the solution that [22] proposes, but in our case, we can formally back up this claim with our semantics.
Other languages handle both non-termination and observation failure by subprobability distributions, which makes it impossible to conclude that the missing weight is due to observation failure (and not due to non-termination) [4,24,34].The semantics in [28] cannot directly express that the missing weight is due to observation failure (rather, the semantics are undefined due to a division by zero).However, the semantics enables a careful reader to determine that the missing weight is due to observation failure (by investigating the conditional weakest precondition and the conditional weakest liberal precondition).Some other languages can express neither while loops nor observations [23,33,35].Assertions and non-termination.For some programs, it is useful to check assumptions explicitly.For example, the implementation of the factorial function in Listing 8 explicitly checks whether x is a valid argument to the factorial function.If x / ∈ N, the program should run into an error (i.e.only have weight on ⊥).If x ∈ N, the program should return x! (i.e.only have weight on x!).This example illustrates that earlier exceptions (like failing an assertion) should bypass later exceptions (like non-termination, which occurs for x / ∈ N if the programmer forgets the first two assertions).This is not surprising, given that this is also the semantics of exceptions in most deterministic languages.Most existing semantics either cannot express Listing 8 ( [23,34] have no assertions, [35] has no iteration) or cannot distinguish failing an assertion from non-termination [24,28,33].The consequence of the latter is that removing the first two assertions from Listing 8 does not affect the semantics.Handling assertion failure by sum types (as e.g. in [34]) could be a solution, but would force the programmer to deal with assertion failure explicitly.Only the semantics in [4] has the expressiveness to implicitly handle assertion errors in Listing 8 without conflating those errors with non-termination.Listing 9 shows a different interaction between nontermination and failing assertions.Here, even though the loop condition is always true, the first iteration of the loop will run into an exception.Thus, Listing 9 results in ⊥ with probability 1.Again, this behavior should not be surprising given the behavior of deterministic languages.For Listing 9, conflating errors with non-termination means the program semantics cannot express that the missing weight is due to an error and not due to non-termination.Observation failure and assertion failure.In our PPL, earlier exceptions bypass later exceptions, as illustrated in Listing 8.However, because we are operating in a probabilistic language, exceptions can occur probabilistically.Listing 10 shows a program that may run into an observation failure, or into an assertion failure, or neither.If it runs into an observation failure (with probability 1  2 ), it bypasses the rest of the program, resulting in with probability 1  2 and in ⊥ with probability 1  4 .Conditioning on the absence of observation failures, the probability of ⊥ is 1  2 .An important observation is that reordering the two statements of Listing 10 will result in a different behavior.This is the case, even though there is no obvious data-flow between the two statements.This is in sharp contrast to the semantics in [34], which guarantee (in the absence of exceptions) that only data flow is relevant and that expressions can be reordered.Our semantics illustrate that even if there is no explicit data-dependency, some seemingly obvious properties (like commutativity) may not hold in the presence of exceptions.Some languages either cannot express Listing 10 ( [23,33] lack observations), cannot distinguish observation failure from assertion failure [24] or cannot handle exceptions implicitly [34,35].

observe(flip(
Summary.In this section, we showed examples of probabilistic programs that exhibit non-termination, observation failures and errors.Then, we provided examples that show how these exceptions can interact, and explained how existing semantics handle these interactions.

Preliminaries
In this section, we provide the necessary theory.Most of the material is standard, however, our treatment of exception states is interesting and important for providing semantics to probabilistic programs in the presence of exceptions.All key lemmas (together with additional definitions and examples) are proven in Appendix A.

Natural numbers, [n],
Iverson brackets, restriction of functions.We include 0 in the natural numbers, so that N := {0, 1, . . .}. Set of variables, generating tuples, preservation of properties, singleton set.Let Vars be a set of admissible variable names.We refer to the elements of Vars by x, y, z and x i , y i , z i , v i , w i , for i ∈ N.For v ∈ A and n ∈ N, v!n := (v, . . ., v) ∈ A n denotes the tuple containing n copies of v.A function f : A n → A preserves a property if whenever a 1 , . . ., a n ∈ A have that property, f (a 1 , . . ., a n ) ∈ A has that property.Let 1 denote the set which only contains the empty tuple (), i.e. 1 := {()}.For sets of tuples S ⊆ n i=1 A i , there is an isomorphism S × 1 1 × S S. This isomorphism is intuitive and we sometimes silently apply it.
Exception states, lifting functions to exception states.We allow the extension of sets with some symbols that stand for the occurrence of special events in a program.This is important because it allows us to capture the event that a given program runs into specific exceptions.Let X := {⊥, , } be a (countable) set of exception states.We denote by A := A ∪ X the set A extended with X (we require that A ∩ X = ∅).Intuitively, ⊥ corresponds to assertion failures, corresponds to observation failures and corresponds to non-termination.For a function f : A → B, f lifted to exception states, denoted by f : propagates the first exception in its arguments, or evaluates f if none of its arguments are exceptions.Formally, it is defined by ∈ X and a 2 ∈ X , and so on.Only if a 1 , . . ., a n / ∈ X , we have f (a 1 , . . ., a n ) = f (a 1 , . . ., a n ).Thus, f ( , a, ⊥) = .In particular, we write (a, b) for lifting the tupling function, resulting in for example ( , ) = .To remove notation clutter, we do not distinguish the two different liftings f : A → B and f : n i=1 A i → B notationally.Whenever we write f , it will be clear from the context which lifting we mean.We write S×T for {(s, t) | s ∈ S, t ∈ T }.
Records.A record is a special type of tuple indexed by variable names.For sets . We can access the elements of a record by their name: In what follows, we provide the measure theoretic background necessary to express our semantics.σ-algebra, measurable set, σ-algebra generated by a set, measurable space, measurable functions.Let A be some set.A set The elements of Σ A are called measurable sets.For any set A, a trivial σ-algebra on A is its power set P (A).Unfortunately, the power set often contains sets that do not behave well.To come up with a σ-algebra on A whose sets do behave well, we often start with a set S ⊆ P (A) that is not a σ-algebra and extend it until we get a σ-algebra.For this purpose, let A be some set and S ⊆ P (A) a collection of subsets of A. The σ-algebra generated by S denoted by σ(S) is the smallest σ-algebra that contains S. Formally, σ(S) is the intersection of all σ-algebras on A containing S. For a set A and a σ-algebra Σ A on A, (A, Σ A ) is called a measurable space.We often leave Σ A implicit; whenever it is not mentioned explicitly, it is clear from the context.Table 1 provides the implicit σ-algebras for some common sets.As Set σ-algebra on this set the Borel σ-algebra on R generated by all intervals 1. Implicit σ-algebras on common sets, for measurable spaces (A, ΣA), (Ai, ΣA i ) an example, some elements of Σ R include [0, 1] ∪ {⊥} and {1, 3, π}.For measurable spaces (A, Σ A ) and (B, Σ B ), a function f : If one is familiar with the notion of Lebesgue measurable functions, note that our definition does not include all Lebesgue measurable functions.As a motivation to why we need measurable functions, consider the following scenario.We know the distribution of some variable x, and want to know the distribution of y = f (x).To figure out how likely it is that y ∈ S for a measurable set S, we can determine how likely it is that x ∈ f −1 (S), because f −1 (S) is guaranteed to be a measurable set.
Measures, examples of measures.For a measurable space (A, Σ A ), a function µ : Σ A → [0, ∞] is called a measure on A if it satisfies two properties: null empty set (µ(∅) = 0) and countable additivity (for any countable collection {S i } i∈I of pairwise disjoint sets S i ∈ Σ A , we have µ i∈I S i = i∈I µ(S i )).Measures allow us to quantify the probability that a certain result lies in a measurable set.For example, µ( [1,2]) can be interpreted as the probability that the outcome of a process is between 1 and 2.
The Lebesgue measure λ : is defined by 0(S) = 0 for all S ∈ Σ A .For a measurable space (A, Σ A ) and some a ∈ A, the Dirac measure δ a : Unfortunately, there are measures that do not satisfy some important properties (for example, they may not satisfy Fubini's theorem, which we discuss later on).The usual way to deal with this is to restrict our attention to σ-finite measures, which are well-known and were studied in great detail.However, σ-finite measures are too restrictive for our purposes.In particular, the s-finite kernels that we introduce later on can induce measures that are not σ-finite.This is why in the following, we work with s-finite measures.Table 2 gives an overview of the different kinds of measures that are important for understanding our work.The expression 1/2 • δ 1 stands for the pointwise multiplication of the measure δ 1 by Here, the λ refers to λ-abstraction and not to the Lebesgue measure.To distinguish the two λs, we always write "λx." (with a dot) when we refer to λ-abstraction.For more details on the definitions and for proofs about the provided examples, see Appendix A.1.

Type of Measure Characterization
Examples Definition and comparison of different measures µ : ΣA → [0, ∞] on measurable spaces (A, ΣA).Reading the table top-down, we get from the most restrictive definition to the most permissive definition.For example, any sub-probability measure is also a σ-finite measure.We also provide an example for each type of measure that is not an example of the more restrictive type of measure.For example, the Lebesgue measure λ is σ-finite but not s-finite.
Product of measures, product of measures in the presence of exception states.For s-finite measures µ : and define it by For s-finite measures µ : we denote the lifted product of measures by µ×µ : Σ A×B → [0, ∞] and define it using the lifted tupling function: (µ×µ )(S) = a∈A b∈B [(a, b) ∈ S]µ (db)µ(da).While the product of measures µ × µ is well known for combining two measures to a joint measure, the concept of a lifted product of measures µ×µ is required to do the same for combining measures that have weight on exception states.Because the formal semantics of our probabilistic programming language makes use of exception states, we always use × to combine measures, appropriately handling exception states implicitly.
Lemma 1.For measures µ : For µ : Lemma 2. × and × for s-finite measures are associative, left-and right-distributive and preserve (sub-)probability and s-finite measures.
Lebesgue integrals, Fubini's theorem for s-finite measures.Our definition of the Lebesgue integral is based on [31].It allows integrating functions that sometimes evaluate to ∞, and Lebesgue integrals evaluating to ∞.
Here, (A, Σ A ) and (B, Σ B ) are measurable spaces and µ : For any simple function s, the Lebesgue integral of s over E with respect to µ, denoted by a∈E s(a)µ(da), is defined by ∞] be measurable but not necessarily simple.Then, the Lebesgue integral of f over E with respect to µ is defined by Here, the inequalities on functions are pointwise.Appendix A.2 lists some useful properties of the Lebesgue integral.Here, we only mention Fubini's theorem, which is important because it entails a commutativity-like property of the product of measures: (µ × µ )(S) = (µ × µ)(swap(S)), where swap switches the dimensions of S: swap(S) = {(b, a) | (a, b) ∈ S}.The proof of this property is straightforward, by expanding the definition of the product of measures and applying Fubini's theorem.As we show in Section 5, this property is crucial for the commutativity of expressions.In the presence of exceptions, it does not hold: (µ×µ )(S) = (µ ×µ)(swap(S)) in general.
Theorem 1 (Fubini's theorem).For s-finite measures µ : For s-finite measures µ : Σ A → [0, ∞] and µ : Σ B → [0, ∞] and any measurable function (Sub-)probability kernels, s-finite kernels, Dirac delta, Lebesgue kernel, motivation for s-finite kernels.In the following, let (A, Σ A ) and (B, Σ B ) be measurable spaces.A (sub-)probability kernel with source A and target B is a function κ : We denote the set of s-finite kernels with source A and target B by Because we only ever deal with s-finite kernels, we often refer to them simply as kernels.
We can understand the Dirac measure as a probability kernel.For a measurable space (A, Σ A ), the Dirac delta δ : Note that for any a, δ(a, •) : Σ A → [0, ∞] is the Dirac measure.We often write δ(a)(S) or δ a (S) for δ(a, S).Note that we can also interpret δ : A → A as an s-finite kernel from A → B for A ⊆ B. The Lebesgue kernel λ * : A → R is defined by λ * (a)(S) = λ(S), where λ is the Lebesgue measure.The definition of s-finite kernels is a lifting of the notion of s-finite measures.Note that for an s-finite kernel κ, κ(a, •) is an s-finite measure for all a ∈ A. In the context of probabilistic programming, s-finite kernels have been used before [34].
Working in the space of sub-probability kernels is inconvenient, because, for example, λ * : R → R is not a sub-probability kernel.Even though λ * (x) is σfinite measure for all x ∈ R, not all s-finite kernels induce σ-finite measures in this sense.As an example, (λ * ; λ * )(x) is not a σ-finite measure for any x ∈ R (see Lemma 15 in Appendix A.1).We introduce (;) shortly in Definition 1.
Working in the space of s-finite kernels is convenient because s-finite kernels have many nice properties.In particular, the set of s-finite kernels A → B is the smallest set that contains all sub-probability kernels with source A and target B and is closed under countable sums.
Lifting kernels to exception states, removing weight from exception states.For kernels κ : A → B or kernels κ : A → B, κ lifted to exception states κ : When transforming κ into κ, we preserve (sub-)probability and s-finite kernels.
Composing kernels, composing kernels in the presence of exception states.
Note that f ; g intuitively corresponds to first applying f and then g.Throughout this paper, we mostly use >=> instead of (;), but we introduce (;) because it is well-known and it is instructive to show how our definition of >=> relates to (;).Lemma 3. (;) is associative, left-and right-distributive, has neutral element2 δ and preserves (sub-)probability and s-finite kernels.

Lemma 4. For
Lemma 4 shows how >=> relates to (;), by splitting f >=> g into nonexceptional behavior of f (handled by (;)) and exceptional behavior of f (handled by a sum).Intuitively, if f produces an exception state ∈ X , then g is not even evaluated.Instead, this exception is directly passed on, as indicated by δ(x)(S).
If f (a)(X ) = 0 for all a ∈ A, or if S ∩ X = ∅, then the definitions are equivalent in the sense that (f ; g)(a)(S) = (f >=> g)(a)(S).The difference between >=> and (;) is the treatment of exception states produced by f .Note that technically, the target B of f : A → B does not match the source B of g : B → C. Therefore, to formally interpret f ; g, we silently restrict the domain of Lemma 5. >=> is associative, left-distributive (but not right-distributive), has neutral element δ and preserves (sub-)probability and s-finite kernels.
Product of kernels, product of kernels in the presence of exception states.For s-finite kernels κ : A → B, κ : A → C, we define the product of kernels, denoted by κ × κ : . For s-finite kernels κ : A → B and κ : A → C, we define the lifted product of kernels, denoted by κ×κ : A → B × C, as (κ×κ )(a)(S) = (κ(a)×κ (a))(S).× and × allow us to combine kernels to a joint kernel.Essentially, this definition reduces the product of kernels to the product of measures.Lemma 6. × and × for kernels preserve (sub-)probability and s-finite kernels, are associative, left-and right-distributive.
Summary.The most important concepts introduced in this section are exception states, records, Lebesgue integration, Fubini's theorem and (s-finite) kernels.

A Probabilistic Language and its Semantics
We now describe our probabilistic programming language, the typing rules and the denotational semantics of our language.

Syntax
Let V := Q ∪ {π, e} ⊆ R be a (countable) set of constants expressible in our programs.Let i, n ∈ N, r ∈ V, x ∈ Vars, a generic unary operator (e.g., − inverts the sign of a value, ! is logical negation mapping 0 to 1 and all other numbers to 0, • and • round down and up respectively), ⊕ a generic binary operator (e.g., +, −, * , /, ∧ for addition, subtraction, multiplication, division and exponentiation, &&, || for logical conjunction and disjunction, =, =, <, ≤, >, ≥ to compare values).Let f : A → R → [0, ∞) be a measurable function that maps a ∈ A to a probability density function.We check if f is measurable by uncurrying f to f : containing e 2 at every index) and F (e) (evaluating function F on argument e).To handle functions F (e 1 , . . ., e n ) with multiple arguments, we interpret (e 1 , . . ., e n ) as a tuple and apply F to that tuple.Our functions express λx.{P ; return e; } (function taking argument x running P on x and returning e), flip(e) (random choice from {0, 1}, 1 with probability e), uniform(e 1 , e 2 ) (continuous uniform distribution between e 1 and e 2 ) and sampleFrom f (e) (sample value distributed according to probability density function f (e)).An example for f is the density of the exponential distribution, indexed with rate λ.Formally, Often, f is partial (e.g., λ ≤ 0 is not allowed).Intuitively, arguments outside the allowed range of f produce the error state ⊥.
Our statements express skip (no operation), x := e (assigning to a fresh variable), x = e (assigning to an existing variable), P 1 ; P 2 (sequential composition of programs), if e {P 1 } else {P 2 } (if-then-else), {P } (static scoping), assert(e) (asserting that an expression evaluates to true, assertion failure results in ⊥), observe(e) (observing that an expression evaluates to true, observation failure results in ) and while e {P } (while loops, non-termination results in ).We additionally introduce syntactic sugar e 1 [e 2 ] = e 3 for e 1 = e 1 [e 2 → e 3 ], if (e) {P } for if e {P } else {skip} and func(e 2 ) for λx.{P ; return e 1 ; }(e 2 ) (using the name func for the function with argument x and body {P ; return e 1 }).

Typing Judgments
Let n ∈ N. We define types by the following grammar in BNF, where τ [] denotes arrays over type τ .We sometimes write n i=1 τ i for the product type Note that we also use the type τ 1 → τ 2 of kernels with source τ 1 and target τ 2 , but we do not list it here to avoid higher-order functions (discussed in Section 4.5).
Formally, a context Γ is a set {x i : τ i } i∈[n] that assigns a type τ i to each variable x i ∈ Vars.In slight abuse of notation, we sometimes write x ∈ Γ if there is a type τ with x : τ ∈ Γ .We also write Γ, x : τ for Γ ∪ {x : τ } (where x / ∈ Γ ) and Γ, Γ for Γ ∪ Γ (where Γ and Γ have no common variables).
Γ () : Fig. 3.The typing rules for expressions and functions in our language The rules in Figures 3 and 4 allow deriving the type of expressions, functions and statements.To state that an expression e is of type τ under a context Γ , we write Γ e : τ .Likewise, F : τ → τ indicates that F is a kernel from τ to τ .Finally, Γ P Γ states that a context Γ is transformed to Γ by a statement P .For sampleFrom f , we intuitively want f to map values from τ to probability density functions.To allow f to be partial, i.e., to be undefined for some values from τ , we use A ∈ Σ τ (and hence A ⊆ τ ) as the domain of f (see Section 4.3).

Semantics
Semantic domains.We assign to each type τ a set τ together with an implicit σ-algebra Σ τ on that set.Additionally, we assign a set Γ to each context The remaining semantic domains are outlined in Figure 5.
(xi : Si) Si ∈ Στ i Fig. 5. Semantic domains for types Fig. 6.The semantics of expressions.v!n stands for the n-tuple (v, . . ., v). t[i] stands for the i-th element (0-indexed) of the tuple t and t[i → v] is the tuple t, where the i-th element is replaced by v. |t| is the length of a tuple t. σ stands for a program state over all variables in some Γ , with σ ∈ Γ .
Expressions. Figure 6 assigns to each expression e typed by Γ e : τ a probability kernel e τ : Γ → τ .When τ is irrelevant or clear from the context, we may drop it and write e .The formal interpretation of Γ → τ is explained in Section 3. 3 Note that Figure 6 is incomplete, but extending it is straightforward.When we need to evaluate multiple terms (as in (e 1 , . . ., e n )), we combine the results using ×.This makes sure that in the presence of exceptions, the first exception that occurs will have priority over later exceptions.In addition, deterministic functions (like x + y) are lifted to probabilistic functions by the Dirac delta (e.g.δ(x + y)) and incomplete functions (like x/y) are lifted to complete functions via the explicit error state ⊥.
Statements. Figure 8 assigns to each statement P with Γ P Γ a probability kernel P : Γ → Γ .Note the use of × in δ× e , which allows evaluating e while keeping the state σ in which e is being evaluated.Intuitively, if evaluating e results in an exception from X , the previous state σ is irrelevant, and the result of δ× e will be that exception from X .
While loop.To define the semantics of the while loop while e {P }, we introduce a kernel transformer while e {P } trans : ( Γ → Γ ) → ( Γ → Γ ) that transforms the semantics for n runs of the loop to the semantics for n + 1 runs of the loop.Concretely, while e {P } trans (κ) = δ× e >=> λ(σ, b).
This semantics first evaluates e, while keeping the program state around using δ.
If e evaluates to 0, the while loop terminates and we return the current program state σ.If e does not evaluate to 0, we run the loop body P and feed the result to the next iteration of the loop, using κ.
We can then define the semantics of while e {P } using a special fixed point operator fix : ((A → A) → (A → A)) → (A → A), defined by the pointwise limit fix(∆) = lim n→∞ ∆ n ( ), where := λσ.δ( ) and ∆ n denotes the n-fold composition of ∆. ∆ n ( ) puts all runs of the while loop that do not terminate within n steps into the state .In the limit, only has weight on those runs of the loop that never terminate.fix(∆) is only defined if its pointwise limit exists.Making use of fix, we can define the semantics of the while loop as follows: while e {P } = fix while e {P } trans Lemma 7.For ∆ as in the semantics of the while loop, and for each σ and each S, the limit lim n→∞ ∆ n ( )(σ)(S) exists.
Lemma 7 holds because increasing n may only shift probability mass from to other states (we provide a formal proof in Appendix B).Kozen shows a different way of defining the semantics of the while loop [23], using least fixed points.Lemma 8 describes the relation of the semantics of our while loop to the semantics of the while loop of [23].For more details on the formal interpretation of Lemma 8 and for its proof, see Appendix B. Lemma 8.In the absence of exception states, and using sub-probability kernels instead of distribution transformers, the definition of the semantics of the while loop from [23] is equivalent to ours.Theorem 2. The semantics of each expression e and statement P is indeed a probability kernel.
Proof.The proof proceeds by induction.Some lemmas that are crucial for the proof are listed in Appendix C. Conveniently, most functions that come up in our definition are continuous (like a + b) or continuous except on some countable subset (like a b ) and thus measurable.

Recursion
To extend our language with recursion, we apply the same ideas as for the while loop.Given the source code of a function F that uses recursion, we define its semantics in terms of a kernel transformer F trans .This kernel transformer takes semantics for F up to a recursion depth of n and returns semantics for F up to recursion depth n + 1. Formally, F trans (κ) follows the usual semantics, but uses κ as the semantics for recursive calls to F (we will provide an example shortly).Finally, we define the semantics of F by F := fix F trans .Just as for the while loop, fix F trans is well-defined because stepping from recursion depth n to n + 1 can only shift probability mass from to other states.We note that we could generalize our approach to mutual recursion.To demonstrate how we define the kernel transformer, consider the recursive implementation of the geometric distribution in Listing 11 (to simplify presentation, Listing 11 uses early return).Given semantics κ for geom : 1 → R up to recursion depth n, we can define the semantics of geom up to recursion depth n + 1, as illustrated in Figure 9.

Higher-order Functions
Our language cannot express higher-order functions.When trying to give semantics to higher-order probabilistic programs, an important step is to define a σ-algebra on the set of functions from real numbers to real numbers.Unfortunately, no matter which σ-algebra is picked, function evaluation (i.e. the function that takes f and x as arguments and returns f (x)) is not measurable [1].This is a known limitation that previous work has looked into (e.g.[35] address it by restricting the set of functions to those expressible by their source code).
A promising recent approach is replacing measurable spaces by quasi-Borel spaces [16].This allows expressing higher-order functions, at the price of replacing the well-known and well-understood measurable spaces by a new concept.

Non-determinism
To extend our language with non-determinism, we may define the semantics of expressions, functions and statements in terms of sets of kernels.For an expression e typed by Γ e : τ , this means that e τ ∈ P ( Γ → τ ), where P (S) denotes the power set of S. Lifting our semantics to non-determinism is mostly straightforward, except for loops.There, while e {P } contains all kernels of the form where ∆ i ∈ while e {P } trans .Previous work has studied non-determinism in more detail, see e.g.[21,22].
We now investigate two properties of our semantics: commutativity and associativity.These are useful in practice, e.g. because they enable rewriting programs to a form that allows for more efficient inference [5].
In this section, we write e 1 e 2 when expressions e 1 and e 2 are equivalent (i.e. when e 1 = e 2 ).Analogously, we write P 1 P 2 for P 1 = P 2 .

Commutativity
In the presence of exception states, our language cannot guarantee commutativity of expressions such as e 1 + e 2 .This is not surprising, as in our semantics the first exception bypasses all later exceptions.Lemma 9.For function F (){while 1 {skip}; return 0}, 1 0 + F () F () + 1 0 Formally, this is because if we evaluate 1 0 first, we only have weight on ⊥.If instead, we evaluate F () first, we only have weight on , by an analogous calculation.A more detailed proof is included in Appendix D.
However, the only reason for non-commutativity is the presence of exceptions.Assuming that e 1 and e 2 cannot produce exceptions, we obtain commutativity: Lemma 10.If e 1 (σ)(X ) = e 2 (σ)(X ) = 0 for all σ, then e 1 ⊕ e 2 e 2 ⊕ e 1 , for any commutative operator ⊕.
The proof of Lemma 10 (provided in Appendix D) relies on the absence of exceptions and Fubini's Theorem.This commutativity result is in line with the results from [34], which proves commutativity in the absence of exceptions.
In the analogous situation for statements, we cannot assume commutativity P 1 ; P 2 P 2 ; P 1 , even if there is no dataflow from P 1 to P 2 .We already illustrated this in Listing 10, where swapping two lines changes the program semantics.However, in the absence of exceptions and dataflow from P 1 to P 2 , we can guarantee P 1 ; P 2 P 2 ; P 1 .

Associativity
A careful reader might suspect that since commutativity does not always hold in the presence of exceptions, a similar situation might arise for associativity of some expressions.As an example, can we guarantee e 1 +(e 2 +e 3 ) (e 1 +e 2 )+e 3 , even in the presence of exceptions?The answer is yes, intuitively because exceptions can only change the behavior of a program if the order of their occurrence is changed.This is not the case for associativity.Formally, we derive the following: Lemma 11. e 1 ⊕ (e 2 ⊕ e 3 ) (e 1 ⊕ e 2 ) ⊕ e 3 , for any associative operator ⊕.

Adding the score Primitive
Some languages include the primitive score, which allows to increase or decrease the probability of a certain event (or trace) [34,35].
x:=flip( Listing 12. Using score Listing 12 shows an example program using score.Without normalization, it returns 0 with probability 1  2 and 1 with "probability" 1 2 • 2 = 1.After normalization, it returns 0 with probability 1  3 and 1 with probability 2 3 .Because score allows decreasing the probability of a specific event, it renders observe unnecessary.In general, we can replace observe(e) by score(e = 0).However, performing this replacement means losing the explicit knowledge of the weight on . x:=gauss(0,1); Listing 13.Reshaping a distribution.
score can be useful to modify the shape of a given distribution.For example, Listing 13 turns the distribution of x, which is a Gaussian distribution, into the Lebesgue measure λ, by multiplying the density of x by its inverse.Hence, the density of x at any location is 1.Note that the distribution over x cannot be described by a probability measure, because e.g. the "probability" that x lies in the interval [0, 2] is 2. Unfortunately, termination in the presence of score is not well-defined, as illustrated in Listing 14.In this program, the only non-terminating trace keeps changing its weight, switching between 1 and 2. In the limit, it is impossible to determine the weight of non-termination.
Hence, allowing the use of the score primitive only makes sense after abolishing the tracking of nontermination ( ), which can be achieved by only measuring sets that do not contain non-termination.Formally, this means restricting the semantics of expressions e typed by Γ e : τ to e τ : Γ → τ − { } .Intuitively, abolishing non-termination means that we ignore non-terminating runs (these result in weight on non-termination).After doing this, we can give well-defined semantics to the score primitive.
The typing rule and semantics of score are: After including score into our language, the semantics of the language can no longer be expressed in terms of probability kernels as stated in Theorem 2, because the probability of any event can be inflated beyond 1.Instead, the semantics must be expressed in terms of s-finite kernels.Theorem 3.After adding the score primitive and abolishing non-termination, the semantics of each expression e and statement P is an s-finite kernel.
Proof.As for Theorem 2, the proof proceeds by induction.Most parts of the proof are analogous (e.g.>=> preserves s-finite kernels instead of probability In the classic semantics of [23], Kozen uses distribution transformers (i.e., functions from distributions to distributions).For later work [24], Kozen also switches to sub-probability kernels, which has the advantage of avoiding redundancies.A different approach uses weakest precondition to define the semantics, as in [28].Staton et al. [35] use a different concept of measurable functions A → P (R ≥0 × B) (where P (S) denotes the set of all probability measures on S).
Typing.Some probabilistic languages are untyped [4,28], while others are limited to just a single type: R n [23,24] or [33].Some languages provide more interesting types including sum types, distribution types and tuples [34,35].We allow tuples and array types, and we could easily account for sum types.
Loops.Because the semantics of while loops is not always straightforward, some languages avoid while loops and recursion altogether [35].Borgström et al. handle recursion instead of while loops, defining the semantics in terms of a fixed point [4].Many languages handle while loops by least fixed points [23,24,28,33].Staton defines while loops in terms of the counting measure [34], which is similar to defining them by a fixed point.We define the semantics of while loops in terms of a fixed point, which avoids the need to prove the least fixed point exists (still, the classic while loop semantics of [23] and our formulation are equivalent).
Most languages do not explicitly track non-termination, but lose probability weight by non-termination [4,23,24,34].This missing weight can be used to identify the probability of non-termination, but only if other exceptions (such as fail in [24] or observation failure in [4]) do not also result in missing weight.The semantics of [33] are tailored to applications in networks and lose nonterminating packet histories instead of weight (due to a particular least fixed point construction of Scott-continuous maps on algebraic and continuous directed complete partial orders).Some works define non-termination as missing weight in the weakest precondition [28].Specifically, the semantics in [28] can also explicitly express probability of non-termination or ending up in some state (using the separate construct of a weakest liberal precondition).We model nontermination by an explicit state , which has the advantage that in the context of lost weight, we know what part of that lost weight is due to non-termination.
Kaminski et al. [21] investigate the run-time of probabilistic program with loops and fail (interpreted as early termination), but without observations.In [21], non-termination corresponds to an infinite run-time.
Error states.Many languages do not consider partial functions (like fractions a b ) and thus never run into an exception state [23,24,33].Olmedo et al. [28] do not consider partial functions, but support the related concept of an explicit abort.The semantics of abort relies on missing weight in the final distribution.Some languages handle expressions whose evaluation may fail using sum types [34,35], forcing the programmer to deal with errors explicitly (we discuss the disadvantages of this approach at Listing 6).Formally, a sum type A + B is a disjoint union of the two sets A and B. Defining the semantics of an expression in terms of the sum type A+{⊥} allows that expression to evaluate to either a value a ∈ A or to ⊥. Borgström et al. [4] have a single state fail expressing exceptions such as dynamically detected type errors (without forcing the programmer to deal with exceptions explicitly).Our semantics also uses sum types to handle exceptions, but the handling is implicit, by defining semantics in terms of (>=>) (which defines how exceptions propagate in a program) instead of (;).
Constraints.To enforce hard constraints, we use the observe(e) statement, which puts the program into a special failure state if it does not satisfy e.We can encode soft constraints by observe(e), where e is probabilistic (this is a general technique).Borgström et al. [4] allow both soft constraints that reduce the probability of some program traces and hard constraints whose failure leads to the error state fail.Some languages can handle generalized soft constraints: they can not only decrease the probability of certain traces using soft constraints, but also increase them, using score(x) [34,35].We investigate the consequences of adding score to our language in Section 5.3.Kozen [24] handles hard (and hence soft) constraints using fail (which results in a sub-probability distribution).Some languages can handle neither hard nor soft constraints [23,33].Note though that the semantics of ProbNetKAT in [33] can drop certain packages, which is a similar behavior.Olmedo et al. [28] handle hard (and hence soft) constraints by a conditional weakest precondition that tracks both the probability of not failing any observation and the probability of ending in specific states.Unfortunately, this work is restricted to discrete distributions and is specifically designed to handle observation failures and non-termination.Thus, it is not obvious how to adapt the semantics if a different kind of exception is to be added.
Interaction of different exception.Most existing work handles at least some exceptions by sub-probability distributions [4,23,24,33,34].Then, any missing weight in the final distribution must be due to exceptions.However, this leads to a conflation of all exceptions handled by sub-probability distributions (for the consequences of this, see, e.g., our discussion of Listing 8).Note that semantics based on sub-probability kernels can add more exceptions, but they will simply be conflated with all other exceptions.Some previous work does not (exclusively) rely on sub-probability distributions.Borgström et al. [4] handle errors implicitly, but still use sub-probability kernels to handle non-termination and score.Olmedo et al. can distinguish nontermination (which is conflated with exception failure) from failing observations by introducing two separate semantic primitives (conditional weakest precondition and conditional liberal weakest precondition) [28].Because their solution specifically addresses non-termination, it is non-trivial to generalize this treatment to more than two exception states.By using sum types, some semantics avoid interactions of errors with non-termination or constraint failures, but still cannot distinguish the latter [34,35].Note that semantics based on sum types can easily add more exceptions (although it is impossible to add non-termination).However, the interaction of different exceptions cannot be observed, because the programmer has to handle exceptions explicitly.
To the best of our knowledge, we are the first to give formal semantics to programs that may produce exceptions in this generality.One work investigates assertions in probabilistic programs, but explicitly disallows non-terminating loops [32].Moreover, the semantics in [32] are operational, leaving the distribution (in terms of measure theory) of program outputs unclear.Cho et al. [8] investigate the interaction of partial programs and observe, but are restricted to discrete distributions and to only two exception states.In addition, this investigation treats these two exception states differently, making it non-trivial to extend the results to three or more exception states.Katoen et al. [22] investigate the intuitive problems when combining non-termination and observations, but restrict their discussions to discrete distributions and do not provide formal semantics.Huang [17] treats partial functions, but not different kinds of exceptions.In general, we know of no probabilistic programming language that distinguishes more than two different kinds of exceptions.Distinguishing two kinds of exceptions is simpler than three, because it is possible to handle one exception as an explicit exception state and the other one by missing weight (as e.g. in [4]).
Cousot and Monerau [9] provide a trace semantics that captures probabilistic behavior by an explicit randomness source given to the program as an argument.This allows handling non-termination by non-terminating traces.While the work does not discuss errors or observation failure, it is possible to add both.However, using an explicit randomness source has other disadvantages, already discussed by Kozen [23].Most notably, this approach requires a distribution over the randomness source and a translation from the randomness source to random choices in the program, even though we only care about the distribution of the latter.

Conclusion
In this work we presented an expressive probabilistic programming language that supports important features such as mixing continuous and discrete distributions, arrays, observations, partial functions and while-loops.Unlike prior work, our semantics distinguishes non-termination, observation failures and error states.This allows us to investigate the subtle interaction of different exceptions, which is not possible for semantics that conflate different kinds of exceptions.Our investigation confirms the intuitive understanding of the interaction of exceptions presented in Section 2. However, it also shows that some desirable properties, like commutativity, only hold in the absence of exceptions.This situation is unavoidable, and largely analogous to the situation in deterministic languages.
Even though our semantics only distinguish three exception states, it can be trivially extended to handle any countable set of exception states.This allows for an even finer-grained distinction of e.g.division by zero, out of bounds array accesses or casting failures (in a language that allows type casting).Our semantics also allows enriching exceptions with the line number that the exception originated from (of course, this is not possible for non-termination).For an uncountable set of exception states, an extension is possible but not trivial.

A Proofs for preliminaries
In this section, we provide lemmas, proofs and some definitions that were left out or cut short in Section 3.For a more detailed introduction into measure theory, we recommend the book A crash course on the Lebesgue integral and measure theory [7].
-We call µ s-finite if µ can be written as a countable sum i∈N µ i of subprobability measures µ i .
Note that for a σ-finite measure µ, µ(A) = ∞ is possible, even though µ(A i ) < ∞ for all i.As an example, the Lebesgue measure is σ-finite Lemma 12.The following definition of s-finite measures is equivalent to our definition of s-finite measures (the difference is that the µ i s are only required to be finite): We call µ : Σ A → [0, ∞] an s-finite measure if it can be written as µ = i∈N µ i for finite measures µ i : Proof.Since any sub-probability measure is finite, one direction is trivial.For the other direction, let µ = i∈N µ i for finite measures µ i .Obviously, µ ≥ 0, µ(∅) = 0 and µ( i∈N A i ) = i∈N A i for mutually disjoint A i ∈ Σ A , so µ is a measure.To show that µ can be written as a sum of sub-probability measures, let Lemma 13.Any σ-finite measure µ : Without loss of generality, assume that the A i form a partition of A. Then, µ(S) Thus, µ is a countable sum of finite measures.Proof.For the counting measure c, assume (toward a contradiction) c = i∈N c i .We have n } is uncountable.Thus for any measurable, countably infinite S ⊆ S, c i (S ) = ∞, which means that c i is not finite.Proceed analogously for the infinity measure.Proof.µ = i∈N λ, and λ is s-finite, so µ is s-finite.Assume (toward a contradiction) that µ is σ-finite.For associativity, let µ : The proof proceeds analogously for ×.
Proof.The following properties can be proven for simple functions and limits of simple functions (this suffices): For the other properties, see [31].Theorem 1 (Fubini's theorem).For s-finite measures µ : For s-finite measures µ : Proof.Let µ = i∈N µ i and µ = i∈N µ i for bounded measures µ i and µ i .The proof in the presence of exception state is analogous.
The proof for κ 1 >=> κ 2 ≤ κ 1 >=> κ 2 works analogously.Associativity and preservation of (sub-)probability kernels is well known (see for example [12]).For s-finite kernels f = i∈N f i and g = i∈N g i and h = i∈N h i , we have (for sub-probability kernels f i , g i , h i ) (;) preserves s-finite kernels because for s-finite kernels f and g, we have (for sub-probability kernels f i , g i ) f ; g = i,j∈N f i ; g i , a sum of kernels.
To show that f >=> g preserves s-finite kernels, let f : A → B and g : B → C be s-finite kernels.Then, for sub-probability kernels f i , Note that for each x ∈ X and i ∈ N, λa.λS.δ(x)(S)f i (a)({x}) is a sub-probability kernel.Thus, f >=> g is a sum of s-finite kernels and hence s-finite.
Proving that for sub-probability kernels f and g, f >=> g is also a (sub-)probability kernel is trivial, since we only need to show that (f >=> g)(a)(C) = 1 (or ≤ 1).
Proof.Associativity, left-and right-distributivity are inherited from respective properties of the product of measures established by Lemma 2. Sub-probability kernels are preserved by Lemma 23.

B Proofs for Semantics
Lemma 7.For ∆ as in the semantics of the while loop, and for each σ and each S, the limit lim n→∞ ∆ n ( )(σ)(S) exists.
We proceed analogously when we restrict the allowed arguments for the kernel lim n→∞ ∆ n ( )(σ)(S) to only those S with / ∈ S, proving ∆ n+1 ( ) ≥ ∆ n ( ) for that case.Lemma 8.In the absence of exception states, and using sub-probability kernels instead of distribution transformers, the definition of the semantics of the while loop from [23] is equivalent to ours.Definition 6.In [23], Kozen shows a different way of defining the semantics of the while loop.In our notation, and in terms of probability kernels instead of distribution transformers, that definition becomes while e {P } = sup Here, exponentiation is in terms of Kleisli composition, i.e. κ 0 = δ and κ n+1 = κ >=> κ n .The sum and limit are meant pointwise.Furthermore, we define filter by the following expression (note that filter(e) and filter(¬e) are only sub-probability kernels, not probability kernels).In particular, have have used that left-distributivity does hold in this case since S ∩ X = ∅.

C Probability kernel
In the following, we list lemmas that are crucial to prove Theorem 2 (restated for convenience).
Theorem 2. The semantics of each expression e and statement P is indeed a probability kernel.
Proof.We prove that f is an s-finite kernel.Let A ∞ := {x ∈ A | f (x) = ∞}.Since f is measurable, the set A ∞ must be measurable.
which is a sum of finite kernels because the sets A ∞ and {x | i ≤ f (x) < i + 1} = f −1 ([i, i + 1)) are measurable.Note that any sum of finite kernels can be rewritten as a sum of sub-probability kernels.The following lemma is important to show that the semantics of the while loop is a probability kernel.Lemma 29.Suppose {κ n } n∈N is a sequence of (sub-)probability kernels A → B. Then, if the limit κ = lim n→∞ κ n exists, it is also a (sub-)probability kernel.Here, the limit is pointwise in the sense ∀a ∈ A : ∀S ∈ Σ B : κ(a, S) = lim n→∞ κ n (a)(S).
Proof.For every a ∈ A, κ(a, •) is a measure, because the pointwise limit of finite measures is a measure.For every S ∈ Σ B , κ(•, S) is measurable, because the pointwise limit of measurable functions f n : A → R (with B as the σ-algebra on R) is measurable.

D Proofs for consequences
In this section, we provide some proofs of consequences of our semantics, explained in Section 5. Proof.If we evaluate 1 0 first, we will only have weight on ⊥.If instead, we first evaluate F (), we only have weight on , by an analogous calculation.
Proof.Here, we crucially rely on the absence of exceptions (for the third equality) and Fubini's Theorem (for the fourth equality).
Proof.The important steps of the proof are the following.Here, we make crucial use of associativity for the lifted product of measures in Lemma 6.

1 √ 2 .
Discrete and continuous primitive distributions.Listing 1 illustrates a simple Gaussian mixture model (the figure only shows the function body).Depending on the outcome of a fair coin flip x (resulting in 0 or 1), y is sampled from a Gaussian distribution with mean 0 or mean 2 (and standard deviation 1).Note that in our PPL, we represent gauss(•, •) by the more general construct sampleFrom f (•, •), with f : R × [0, ∞) → R → R being the probability density function of the Gaussian distribution f (µ, σ)(x) =2πσ 2 e − (x−µ) 2 2σ

Listing 6 .
Using partial functionsPartial functions.Many functions that are practically useful are only partial (meaning they are not defined for some inputs).Examples include uniform(a, b) (undefined for b < a) and √ For n ∈ N, [n] := {1, . . ., n}.The Iverson brackets [•] are defined by [b] = 1 if b is true and [b] = 0 if b is false.A particular application of the Iverson brackets is to characterize the indicator function of a specific set S by [x ∈ S].For a function f : X → Y and a subset of the domain S ⊆ X, f restricted to S is denoted by f |S : S → Y .