Probabilistic Abductive Logic Programming using Dirichlet Priors

Probabilistic programming is an area of research that aims to develop general inference algorithms for probabilistic models expressed as probabilistic programs whose execution corresponds to inferring the parameters of those models. In this paper, we introduce a probabilistic programming language (PPL) based on abductive logic programming for performing inference in probabilistic models involving categorical distributions with Dirichlet priors. We encode these models as abductive logic programs enriched with probabilistic deﬁnitions and queries, and show how to execute and compile them to boolean formulas. Using the latter, we perform generalized inference using one of two proposed Markov Chain Monte Carlo (MCMC) sampling algorithms: an adaptation of uncollapsed Gibbs sampling from related work and a novel collapsed Gibbs sampling (CGS). We show that CGS converges faster than the uncollapsed version on a latent Dirichlet allocation (LDA) task us-ing synthetic data. On similar data, we compare our PPL with LDA-speciﬁc algorithms and other PPLs. We ﬁnd that all methods, except one, perform similarly and that the more expressive the PPL, the slower it is. We illustrate applications of our PPL on real data in two variants of LDA models (Seed and Cluster LDA), and in the repeated insertion model (RIM). In the latter, our PPL yields similar conclusions to inference with EM for Mallows models.


Introduction
Probabilistic programming is an area of research that aims to develop general inference algorithms for probabilistic models expressed as probabilistic programs whose execution corresponds to inferring the parameters of the probabilistic model.A wide range of probabilistic programming languages (PPLs) have been developed to express a variety of classes of probabilistic models.Examples of PPLs include Church [1], Anglican [2], BUGS [3], Stan [4] and Figaro [5]. 1 Some PPLs, such as Church, enrich a functional programming language with exchangeable random primitives, and can typically express a wide range of probabilistic models.However, inference is not always tractable in these expressive languages.Other PPLs are logic-based.They typically add probabilistic annotations or primitives to a logical encoding of the model.This encoding usually relates either to first-order logic, e.g.Alchemy [6], BLOG [7] or to logic programming, e.g.PRiSM [8], ProbLog [9].Most logic-based PPLs focus only on discrete models, and consequently are equipped with more specialized inference algorithms, with the advantage of making the inference more tractable.
However, logic-based PPLs generally do not consider Bayesian inference with prior distributions.For instance, Alchemy implements Markov logic, encoding a first-order knowledge base into Markov random fields.Uncertainty is expressed by weights on the logical formulas, but it is not possible to specify prior distributions on these weights.ProbLog is a PPL that primarily targets the inference of conditional probabilities and the most probable explanation (maximum likelihood solution); it does not feature the specification of prior distributions on categorical distributions.PRiSM is a PPL which introduces Dirichlet priors over categorical distributions and is deigned for efficient inference in models with non-overlapping explanations.
This paper contributes to the field of logic-based PPL by proposing an alternative approach to probabilistic programming.Specifically, we introduce a PPL based on logic programming for performing inference in probabilistic models involving categorical distributions with Dirichlet priors.We encode these models as abductive logic programs [10] enriched with probabilistic definitions and inference queries, such that the result of abduction allows overlapping explanations.We propose two Markov Chain Monte Carlo (MCMC) sampling algorithms for the PPL: an adaptation of the uncollapsed Gibbs sampling algorithm, described in [11], and a newly developed collapsed Gibbs sampler.Our PPL is similar to PRiSM and different from ProbLog in that it can specify Dirichlet priors.Unlike PRiSM, but similarly to ProbLog, we allow overlapping explanations.However, in this paper, all the models we study have non-overlapping explanations.
We show how our PPL can be used to perform inference in two classes of probabilistic models: Latent Dirichlet Allocation (LDA, [12]), a well studied approach for topic modelling, including two variations thereof (Seed LDA and Cluster LDA); and the repeated insertion model (RIM, [13]), a model used for preference modelling and whose generative story can be expressed using recursion.Our experiments demonstrate that our PPL can express a broad class of models in a compact way and scale up to medium size realdata sets, such as LDA with approximately 5000 documents and 100 words per document.On synthetic LDA data, we compare our PPL with two LDA-specific algorithms: collapsed Gibbs sampling (CGS) and variational expectation maximization (VEM), and two state-of-the-art PPLs: Stan and PRiSM.We find that all methods, with the exception of the chosen VEM implementation, perform similarly, and that the more expressive the method, the slower it is, in the following order, with the exception of VEM: CGS (fastest), PRiSM, VEM, our PPL, Stan (slowest).
The paper is organised as follows.Section 2 presents the class of probabilistic models supported by our PPL.In Section 3 we outline the syntax and the semantics of the PPL, whereas our two Gibbs sampling algorithms are discussed in Section 4. Section 5 shows our experimental results and Section 6 relates our PPL to other PPLs and methods.Finally, Section 7 concludes the paper.

The Probabilistic Model
This section begins with the description of a particular approach to probabilistic programming.Then we introduce peircebayes2 (PB), our probabilistic abductive logic programming language designed to perform inference in discrete models with Dirichlet priors.Throughout the paper we will use normal font for scalars (α), arrow notation for vectors ( α), and bold font for collections with multiple indexes (α α α), e.g.sets of vectors.
Probabilistic programs, as defined in [14], are " 'usual' programs with two added constructs: (1) the ability to draw values at random from distributions, and (2) the ability to condition values of variables in a program via observe statements".We can instantiate this general definition by considering the notions of hyperparameters α α α, parameters θ θ θ, and observed variables f and assuming the goal to be the characterisation of the conditional distribution P (θ θ θ| f ; α α α).
Most PPLs assume such a conditional with a continuous sample space, i.e. they allow, in principle, probabilistic programs with an uncountably infinite number of unique outputs, should one not take into account issues of real number representations.In our approach, the conditional sample space is assumed to be finite, i.e. one can enumerate all possible executions.Specifically, in our PPL we restrict the class of distributions from which we can draw to categorical distributions with Dirichlet priors.The Dirichlet priors are chosen for their conjugacy, which supports efficient marginalisation.
The generality of our PPL is not given by the range of probability distributions that we can draw from, but rather by the way the draws from categorical distributions interact in the "generative story" of the model.We choose our "usual programs" to be abductive logic programs enriched with probabilistic primitives.Similarly to Church, Anglican and other PPLs this is declarative programming language, but one in which the generative story is expressed as an abductive reasoning task responsible for identifying relevant draws from the categorical distributions given the observations.Our choice is motivated by the significant amount of related work in probabilistic logic programming, although both functional and logic programming are Turing complete, so they are equally general.In what follows we present the class of probabilistic models that are supported by our PPL.We define them first as uncollapsed models, then show how they can be collapsed.As demonstrated in Section 4, this dual formalisation leads naturally to the possibility of using in our PPL uncollapsed as well as collapsed MCMC sampling methods.

The Uncollapsed PB model
The class of probabilistic models that can be expressed in PB, inspired by "propositional logic-based probabilistic (PLP) models" [11], is depicted in Figure 1, using the general plate notation [15].In the figure, unbounded nodes are constants, circle nodes are latent variables, shaded nodes are observed variables, diamond nodes are nodes that can be (but are not always) The PB plate model.
deterministic given their parent x (we discuss the details below), and A, I a , N , and k a are positive integers, with k a ≥ 2.
The above plate model encodes the following (joint) probability distribution: We use * to denote the set of variables obtained by iterating over the missing indexes, e.g.v nai * is the set of all the variables v naij , for j = 1, . . ., k a − 1, and v v v n * is the set of all the variables v naij , for a = 1, . . ., A, i = 1, . . ., I a , j = 1, . . ., k a − 1. Un-indexed variables are implicitly such sets, e.g.x x x = x * .
In the model, each α a , for a = 1, . . ., A, is a vector of finite length k a ≥ 2 of positive real numbers.Each α a may have a different size, and represents the parameters of a Dirichlet distribution.From each such distribution I a samples are drawn, i.e.: The samples θ ai are parameters of categorical distributions.For each i and a, there are N samples x nai from the associated categorical distribution of the form: [16], as a set of propositional variables v naij ∈ {0, 1}, for j = 1, . . ., k a − 1, in the following manner: where v denotes boolean negation, and 2 l−(ka−1) is a normalization constant.
Note that not all variables in v v v are deterministic, more specifically for all j such that l < j < k a , v naij are not determined by the realization x nai = l.Our presentation deviates from [11] in that we present both x x x and v v v in the same model, rather than considering two types of equivalent models ("base models" and "PLP models").But the probabilistic semantics of v v v is the same as the one derived from the annotated disjunction (AD) compilation (cf.Section 3.3.1 of [17]).In our PB model: where [i = j] is the Kronecker delta function δ ij .Therefore, it follows that: Example 2.1.For the reader unfamiliar with ADs, we offer a brief example.
Let there be a four-sided die with probabilities 0.
The observed variables of the model, f n ∈ {0, 1}, represent the output of boolean functions of v, such that: where Bool n (v v v) denotes an arbitrary boolean function of variables v v v.In our approach it is assumed the observed value for each f n to be 1 (or true).
Inference in PB can be described in terms of a general schema of probabilistic inference presented at the beginning of this section, i.e. the characterization of P (θ θ θ| f ; α α α).The parameters and the hyper-parameters correspond to θ θ θ and α α α, respectively.The observed data f is a vector of N data points (observations) where, by convention, f n = 1 ensures that the n-th observation is included in the model (assume this to be always the case).Furthermore, observation f n is independent of any other observations f n , n = n , if conditioned on x x x (since x x x determines v v v); this is implied by the joint distribution given in Equation ( 1).The various ways in which an n-th data point can be generated, as well as the distributions involved in this process, are encoded through the boolean function Bool n (v v v n * ).

Example 2.2. LDA as a PB model
We illustrate the encoding of a popular probabilistic model for topic modelling, the latent Dirichlet allocation (LDA) [12], as a PB model.This will also serve as a running example throughout Section 3. LDA can be summarized as follows: given a corpus of D documents, each document is a list of tokens, the set of all tokens in the corpus is the vocabulary, with size V , and there exist T topics.There are two sets of categorical distributions: D distributions over T categories (one topic distribution per document), each distribution parametrized by µ d ; and T distributions over V categories, (one token distribution per topic), each distribution parametrized by φ t .The tokens of a document d are produced independently by sampling a topic t from µ d , then sampling a token from φ t .Furthermore, each distribution in µ µ µ is sampled using the same Dirichlet prior with parameters γ, and, similarly, each distribution in φ φ φ is sampled using β.Note that {µ µ µ, φ φ φ} correspond to the parameters θ θ θ in the general model, and { γ, β} correspond to α α α.To instantiate a minimal LDA model, let us consider a corpus with 3 documents, 2 topics and a vocabulary of 4 tokens.The plate notation of the PB model for this minimal LDA, depicted in full, is given in Figure 2. Relating to the PB model, we have A = 2, i.e. one plate for each of γ and β, I 1 = 3, one topic mixture for each of the 3 documents, I 2 = 2 and k 1 = 2 for the two topics, and k 2 = 4 for the 4 tokens in the vocabulary.
Let the first data point to be the observation of token 2 in document 3. Then the associated boolean function is: The literals v 13 and v 13 denote the choice, in document 3, of topic 1 and 2, The PB model for an LDA example with 3 documents, 2 topics and 4 tokens.
respectively, and the conjunctions v 141 v 142 and v 151 v 152 denote the choice of the second token from topic 1 and 2, respectively.Note that, in Figure 2, even though all possible edges between deterministic nodes and f n are drawn, not all the variables will necessarily and/or simultaneously affect the probability of f n .For instance, the value of Bool 1 (v v v 1 * ) doesn't depend on the value of v 12 , which is related to document 2 and observations about document 3 and document 2 are independent.Also, Bool 1 (v v v 1 * ) cannot depend at the same time on both v 141 and v 151 , since the truth value of v 13 effectively filters out the effect of one or the other.

The Collapsed PB model
The above model, described as uncollapsed, can be inefficient to sample from, due to the large number of variables.The same PB model can be reformulated as collapsed model where the parameters of the categorical distributions are marginalised out.This is straightforward since we assume conjugate priors.We present the collapsed model here in order to introduce the distributions used in the derivation of the collapsed Gibbs sampler (CGS) given in Section 4.
Consider a and i fixed.Since we make multiple draws from a categorical distribution parametrized by θ ai it is convenient to work with count summaries.Therefore we overload the notation x * ai to denote a k a sized vector of counts, i.e. for all categories l = 1, . . ., k a , x * ail = N n=1 [x nai = l].Similarly, x x x is overloaded to denote a set of such vectors for each a = 1, . . ., A, i = 1, . . ., I a .Integrating out θ for a single Dirichlet-categorical pair yields: where Σ( v) denotes the sum of the elements of some vector v and Γ denotes the gamma function.Also, from the conditional independence of our x * ai : The joint distribution of the collapsed PB model becomes: where The joint distribution in Equation ( 5) is simply Equation (1) with θ θ θ integrated out.Note that v v v must take values from the models of the formulas Bool n in order to have P ( f |v v v) = 1 (recall that f is observed).Furthermore, given a realization of v v v, it is uniquely decoded to a realization of x x x.In practice, we always make sure that these conditions are met, cf.Section 3.
In summary, inference in PB models means to characterize P (θ θ θ| f , α α α).Since Dirichlet priors are conjugate to categorical distributions, the posterior distributions are also Dirichlet distributions with parameters α α α : Therefore, the inference task is to estimate α α α .Informally, the PB inference task is computed in two steps.The first step, described in Section 3, consists of representing a given probabilistic model as an abductive logic program enriched with probabilistic primitives and executing this program.The latter yields the formulas Bool n , and thus the PB model is completely specified.The second step consists of sampling the PB model and is described in Section 4.

Syntax and Semantics
In this section we formally define the syntax and semantics (up to MCMC sampling) of probabilistic programs in PB.PB programs will be defined as abductive logic programs where the set of abducibles correspond to the sample space of P (x x x|θ θ θ).Each observation in the model corresponds to an abductive query.The execution of the abductive query given the program means explaining that observation, and the result of this execution will be a formula in terms of abducibles.This formula is then translated into a boolean formula Bool expressed in terms of the boolean variables v v v from the PB model.We argue that all observations need not be explained individually, i.e. by executing the previous steps for each observation, and instead propose a more efficient approach.Finally, we describe a compact encoding of each boolean formula Bool into a (reduced ordered) binary decision diagram, which will be used for MCMC sampling.

Abductive Logic Programming and PB
A PB program is an abductive logic program [10] enhanced with probabilistic predicates.Adapting the definitions from [18], an abductive logic program is a tuple (Π, AB ), where Π is a normal logic program, AB is a finite set of ground atoms called abducibles.In PB, an abductive logic program encodes the generative story of the model, as well as the observed data.
A query Q is a conjunction of existentially quantified literals.In PB, there exists one query for each observation, describing how that observation should be explained.An abductive solution for a query Q is a set of ground abducibles ∆ ⊆ AB: where CET denotes the Clark Equality Theory axioms [19], and comp 3 (Π) the Fitting three-valued completion of a program Π [20].CET is needed to define the semantics of universally quantified negation in programs containing variables.The result of a query is the disjunction of all the abductive solutions computed by the abductive logic program, where each abductive solution is considered a conjunction of its elements.This computation is described in Appendix A.
The syntax of the non-probabilistic predicates is similar to Prolog, and is documented in [21].The domain of the abducibles is specified through a probabilistic predicate pb dirichlet.This predicate defines a set of categorical distributions with the same Dirichlet prior, i.e. it allows draws from P (x nai | θ ai ), for a fixed a and all n and i.A conjunction of such predicates defines the plate indexed by a.In PB, the syntax of the predicate is pb dirichlet(Alpha a, Name, K a, I a).The first argument, Alpha a, corresponds to α a in the model, and can be either a list of k a positive scalars specifying the parameters of the Dirichlet, or a positive scalar that specifies a symmetric prior.The second argument, Name is an atom that will be used as a functor when calling a predicate that represents a realization of a categorical random variable on the a-th plate.The third argument K a corresponds to k a , and I a represents I a , i.e. the number of categorical distributions having the same prior.
Declaring pb dirichlet predicates simply states that the distributions on the a-indexed plate exist.The draws from these distributions, i.e. samples from P (x x x|θ θ θ), are realized using a predicate Name(K a, I a).The first argument denotes a category from 1, . . ., k a , and the second argument a distribution from 1, . . ., I a .Informally, the meaning of the predicate is that it draws a value K a from the I a-th distribution with Dirichlet prior parametrized by α a .All predicates Name(K a, I a) are assumed to be ground atoms when called.Therefore the set of abducibles can be defined as: Each observe fact encodes a document, indexed using the first argument, and consisting of a bag-of-words in the second argument.The bag-of-words is a list of pairs: a token with an index and its (positive) count per document.
The generate rule describes how each observation, characterized by a particular Token in a Document, is generated (or explained).The variable Topic is grounded as either 1 or 2, in general 1 up to T .Note that observe and generate are not keywords, but descriptive conventional names.In this example, the abducibles are the predicates with functors mu and phi.
In PB, each abducible corresponds to a draw from P (x nai | θ ai ), represented as a tuple (a, i, l) via a bijection ρ : AB → {(a, i, l)|a ∈ {1, . . ., A}, i ∈ {1, . . ., I a }, l ∈ {1, . . ., k a }}.The tuple consists of an index of the distribution (a, i) and the drawn category l.For an abducible Name(K a, I a), K a corresponds to l, I a corresponds to i, and a is determined by the order of the pb dirichlet definition involving Name.For instance, in the context of Example 3.1, ρ(phi(3,2)) = (2, 2, 3).
Abusing notation, the mapping ρ is also used to map abductive solutions, i.e. ρ(∆) is a list of tuples (a, i, l).
The semantics of the abductive logic program ensures that abductive solutions are minimal, which means no abducibles are included in an abductive solution unless they are necessary to satisfy the query.Probabilistically, this means that no draws are made except the ones needed to explain an observation.This implicit marginalisation and compactness (e.g.queries that produce v 1 and v 1 v 2 + v 1 v 2 are equivalent) is effectively enforced by BDD compilation, described in Section 3.2.
The PB model constrains the draws from the categorical distributions such that, for each observation, we can draw once per distribution, and since each observation is explained by an abductive query, we must enforce this constraint on our abductive solutions, i.e. there can be no (a, i, l 1 ), (a, i, l 2 ) ∈ ρ(∆) such that l 1 = l 2 .
We obtain the result of the query by applying ρ to the abductive solutions and taking their disjunction:

Knowledge compilation and multiple observations
The purpose of abduction is to produce, for each observation n = 1, . . ., N , a formula Bool n (v v v n * ).Having computed the result of a query, all that is left is to translate each draw (a, i, l) into boolean variables.We define this translation, denoted conditional AD compilation, as ordering all draws by their distribution index (a, i), i.e. first by a then i, and compiling each distribution as an AD with respect to the categories present in the result of the query.Therefore the result of the query is parsed using conditional AD compilation into the formula: Note that for a fixed number of topics, this representation is invariant to V , the number of tokens in the vocabulary, due to the fact, when explaining an observation, we draw a single token from some topic.
The proposed representation of conditional AD compilation is different from conventional AD compilation (cf.Section 3.3.1 of [17]) in that instead of considering the sample space of the distributions, it considers the conditional sample space of the distributions given a particular observation.In models such as LDA this difference is crucial, since in typical inference tasks, the size of the vocabulary V is of the order of tens of thousands of tokens, and AD compilation would thus create conjunctions of up to V − 1 variables.
So far we have discussed the explanation of a single observation, i.e. the computation of a single abductive query and the translation of the result of the query into a boolean formula.It would be inefficient to treat multiple observations by computing a query for every single observation.For instance, in the LDA example, the same token Token can be produced multiple times from the same document Document.The queries corresponding to these observations are identical: generate(Document, Token), therefore it is redundant to execute the query more than once for all such observations.We make a further remark, namely that all observations generated from the same set of topics in an LDA task produce the same formula after conditional AD compilation, although the particular distributions used are different for different document-token pairs.
We exploit these properties when expressing queries in a PB program.To do so, we introduce the probabilistic predicate pb plate(OuterQ, Count, InnerQ), where the queries OuterQ and InnerQ are conjunctions represented as lists, and Count is a positive integer.For each successful solution of OuterQ, the Count argument and the arguments of InnerQ are instantiated, then the predicate executes the query InnerQ.We assume that all observations defined by a pb plate predicate yield the same formula from the abductive solutions of InnerQ.If one didn't know whether the formulas are identical, one could simply write a pb plate definition for each observation, with an empty outer query, and a count argument of 1.  Count, [generate(Doc, Token)] ).
An example of a PB program with multiple pb plate definitions is given for the Seed LDA model, discussed in Section 5, in Table 1.
The formula for each pb plate definition is compiled into a reduced ordered binary decision diagram (ROBDD, in the rest of the paper the RO attributes are implicit) [22,23], with the variables in ascending order according to their index.The BDD of a boolean formula Bool allows one to express in a compact manner all the models of Bool.This enables us to sample (a subset of) the variables v v v such that P ( f |v v v) equals 1.Furthermore, the conditional AD compilation allows us to correctly decode v v v into x x x, such that all the deterministic constraints of the model are satisfied.
Compiling the abductive solutions into a BDD allow PB to perform inference in models with overlapping explanations, similarly to ProbLog.
The final step of inference in PB is sampling via Markov Chain Monte Carlo (MCMC), which we discuss in Section 4.

MCMC sampling
Inference in high-dimensional latent variable models is typically done using approximate inference algorithms, e.g.variational inference or MCMC sampling.We use the latter, in particular two algorithms: an adaptation to PB models of Ishihata and Sato's uncollapsed Gibbs sampling for PLP models [11], and our novel collapsed Gibbs sampling (CGS) for PB models.

Uncollapsed Gibbs sampling
Following [11], we can perform uncollapsed Gibbs sampling along two dimensions (θ θ θ and x x x) by alternatively sampling from P (x x x| θ θ θ, f ) and P (θ θ θ|x x x, α α α).We use hat to denote the samples in some iteration, and if the variable is a vector, we omit vector notation.A sample from P (θ θ θ|x x x, α α α) yields an estimate θ θ θ that is averaged over the sampling iterations to produce our posterior belief of θ θ θ.
Sampling from P (θ θ θ|x x x, α α α) means sampling: θai ∼ Dirichlet( α a + x * ai ) a = 1, . . ., A , i = 1, . . ., I a Assume this is implemented by a function: Sampling from P (x x x| θ θ θ, f ) can be done by sampling a path from root to the "true" leaf3 in each of the BDDs for Bool n , n = 1, . . ., N , which yield a sample from v v v that is then decoded to x x x.To sample a BDD, we sample the truth value of each node, and therefore we need to use P (v v v; θ θ θ), as defined in Section 2.1.
According to the conditional AD compilation defined in Section 3.2, each compilation is performed for each individual observation, and the observations are grouped if they correspond to the same query, and grouped again when they correspond to the same compiled boolean formula Bool.This means that for each formula Bool, the probabilities of the boolean variables P (v v v; θ θ θ) must be computed from θ θ θ for each distinct query.Let N bdd be the number of boolean variables in BDD bdd , and N Q,bdd be the number of distinct queries whose result yields the same formula Bool encoded in bdd .Then θ θ θ bdd is a N Q,bdd by N bdd matrix, computed from θ θ θ using a function: The sampling algorithm is given in Algorithm 1, and the algorithm for sampling a single BDD (sample x() from Algorithm 1) is explained in Appendix B.
In Algorithm 1, the input to the sampler are the priors α α α, a set of BDDs bdds, and a number of iterations maxit.We define some additional notation: we use [ ] to denote an empty list, append(list, el ) to append element el to a list list, zeros() to create empty vectors or matrices of shapes specified by the arguments and avg() to take the average of a vector or list.To update x x x, we must sample all BDDs (one for each pb plate definition), and all observations within each BDD.We assume that an observation obs takes values in 1, . . ., N Q,bdd , and count represents the number of times the observation is repeated, as specified by the Count argument of a pb plate predicate.For each observation obs we will use only the obs-th row from θ θ θ bdd , denoted as θ θ θ bdd [obs, :].

Collapsed Gibbs sampling
Algorithm 2 Collapsed Gibbs sampling for PB.
for a = 1, . . ., A do for i = 1, . . ., I a do α ai ← α a + x * ai end for end for end for samples ← append(samples, α α α ) end for return avg(samples) end function As explained in [11], it is not feasible to perform CGS in PLP models, as a generalization of CGS for LDA in [24].It is, however, possible to define a general CGS procedure for PB.This uses the independence property of our observations f given x x x.The advantage of collapsed versus uncollapsed Gibbs sampling is the faster convergence of the collapsed sampler.We show that this is the case for an experiment using synthetic data in an LDA task in Section 5.1.
CGS for PB models entails sampling P (x x x n * |α α α, x x x −n * ), for n = 1, . . ., N , where the subscript −n means the range of values 1, . . ., n − 1, n + 1, . . ., N .Note that in constrast to uncollapsed Gibbs sampling, we sample one observation at a time, and θ θ θ is integrated out.Since sampling a BDD requires probabilities on boolean variables, P (v v v|θ θ θ), we "uncollapse" θ θ θ for this purpose as the average of its current Dirichlet posterior parametrized by α α α : We assume this is implemented by avg(α α α ).
The CGS is shown in Algorithm 2. Before sampling, we initialize x x x by setting θ θ θ to the average of the Dirichlet prior, then x is sampled like in an iteration of the uncollapsed Gibbs sampler.We assume this setup is implemented in a function initialize(α α α).We switch representation from a set of BDDs to a set of O of observations, each observation having its own BDD.Each iteration loops over the observations and removes the draws of the current observation from x x x, then updates θ θ θ, then re-samples the current observation to update x x x.To remove the draws of the current observation, we use a function draws(obs), that, for some observation obs, returns the list of draws required to explain that observation, i.e. a list of (a, i, l) tuples.The draws represent the path sampled from the BDD in the previous iteration.Unlike the uncollapsed Gibbs sampling algorithm, where all observations were sampled given the same θ θ θ, in CGS, we update θ θ θ after each observation.Furthermore, the samples we record are not θ θ θ, but the posterior Dirichlet parameters α α α .

Evaluation
In this section we present experiments with PB4 .The first two experiments are quantitative, while the rest are qualitative.In the latter, we show that inference in PB yields reasonable results given our intuition about the models, and in the case of the repeated insertion model, we reach similar conclusions compared to a different model (the Mallows model).

PB and collapsed Gibbs sampling (CGS) for LDA on synthetic data
We run a variation of the experiment performed in [24,11].A synthetic corpus is generated from an LDA model with parameters: 25 words in the  vocabulary, 10 topics, 100 documents, 100 words per document, and a symmetric prior on the mixture of topics µ, γ = 1.The topics used as ground truth specify uniform probabilities over 5 words, cf.[24,11].We evaluate the convergence of the PB samplers and an LDA-specific CGS implementation in the topicmodels R package.The parameters are: β = γ = 1 as (symmetric) hyper-parameters, and we run 100 iterations of the samplers.The experiments are run 10 times over each corpus from a set of 10 identically sampled corpora, yielding 100 values of the log likelihoods per iteration.The average and 95% confidence interval (under a normal distribution) per iteration are shown in Figure 3.The experiment yields two conclusions: 1) similarly to [11], we find that uncollapsed Gibbs sampling converges slower than CGS and 2) the PB-CGS performs similarly to the LDA-specific CGS, which supports our claim that CGS in PB generalizes CGS in LDA.

PB, CGS-LDA, VEM-LDA, PRiSM and Stan for LDA on synthetic data
We compare PB with two LDA-specific methods from the topicmodels R package: collapsed Gibbs sampling (tm-gibbs) and variational expectation maximization (tm-vem), and two state-of-the-art PPLs: PRiSM [8] and Stan [4].The metrics we are interested in are: 1) the intrinsic metric of each method, used to subjectively decide when a sampler has converged, 2) log likelihood, used to asses the quality of fit to the observed data and 3) foldaverage perplexity in 5-fold cross-validation, used to asses the generalization power of each method.In a previous paper, we showed that Church [1] doesn't seem to converge in a reasonable amount of time on simple LDA tasks, cf.Appendix C of [18].We use a similar setup to the previous experiment (T = 10, γ = 1), except that we generate only one corpus of 1000 documents and average over 10 runs with different RNG seeds.
The first problem in evaluating the methods is deciding when has the model converged.We use the default settings of all implementations unless specified and their intrinsic measures: the likelihood attributes of LDA objects for tm-gibbs and tm-vem (@logLiks), the LDA-likelihood for PB, cf.Appendix B of [18], the get logposterior function for Stan and the variational free energy (VFE) for PRiSM.Time is measured in seconds and averaged across runs, the metrics are averaged across runs and the error bars show one standard error, though very often the methods show little variation.
We plot the results in Figures 4 and 5.Note that the methods should not be compared with each other based on these figures.The last value of the number of iterations is the one where we deem there is convergence.For sampling methods, we ran up to 200 iterations for tm-gibbs, 400 iterations for PB and 25 iterations for Stan to make this decision.For PB we use 50, 100, 150 and 200 iterations, for Stan 5, 10, 15, 20 iterations, for Prism 50, 150 and 200 iterations (based on the intrinsic convergence of VEM), for tm-gibbs 25, 50, 75, 100 iterations, for tm-vem 20, 30, 40, 50 iterations (based on the intrinsic convergence of VEM).When varying the number of topics, we used the maximum value of iterations for sampling methods, and the rest ran until convergence.
When we vary the number of topics in the probabilistic program, e.g. in Figure 5, we typically expect the metric to increase up to 10 topics, and then decrease or remain the same.This behaviour is illustrated by the intrinsic measures of PB and tm-gibbs, while that of Stan behaves oddly with respect to this expectation.In what follows we shall see that when we track likelihood and perplexity, these behaviours change.
To be able to compare the methods, we used the following definition of "likelihood": , where C is a corpus of tokens (w, d), w is the token index, d is the document index, and T , µ and φ are the same as in Example 2.2.We plot the results in Figures 6 and 7. We use the same settings for iteration numbers.As in the intrinsic metric experiment, the more expressive the method, the slower it is, with the exception of PRiSM being faster than tm-vem.The difference of more than an order of magnitude between PRiSM and PB can be explained by the fact that the assumption of non-overlapping explanations allows for significant optimizations in PRiSM, as well as the fact that the implementation of the latter is much more mature.Concretely, PRiSM is written in C, Stan compiles its programs to C++ code, while our current PB prototype sampling implementation is written in Python and Cython.
With respect to the number of topics, all methods perform similarly Finally, we show the average per fold perplexity in a 5-fold cross-validation split on every document of the corpus in Figures 8 and 9.
We define perplexity similarly to [25]: We run the methods only four times, and due to faster overall convergence, we tune the number of iterations as follows: 10, 20, 30, 40, 200 for

PB for seed LDA on 20 newsgroups dataset.
Inspired by an experiment from [26], we use the computing related (comp.*)newsgroups in the 20 newsgroups dataset [27].We tokenize, lemmatize and remove stop words from all documents to obtain a corpus with V = 27206 unique tokens in D = 4777 documents with average length of approx.72 tokens.We set T = 20 topics and priors γ = 50/T , β = 0.01.We seed two topics with hardware (hardware, machine, memory, cpu), and software (software, program, version, shareware) related terms.Seeding a token in a topic means that whenever we observe that token we will assign it only that topic.
We show the PB program, with the observe facts omitted, in Table 1.The omitted observe facts encode the corpus in the same manner as in Example 3.2.The seed predicate specifies a token as its first argument and a list of allowed topics as the second argument.The observations are split into seed tokens and non-seed tokens, and are placed on separate pb plate pred- icates.To distinguish between the two types of tokens we use the predicate seed naf because negation is universally quantified.We run PB for 400 iterations and summarize the seeded topics, as word in Figure 10.We observe that both topics give high weights to the seed words, and other hardware (mhz, rom, disk, bios, board) or software (source, library, utility, server, user) related terms.

PB for cluster LDA on arXiv abstracts
We consider a different variant of the LDA model, one in which we define a partition C over the topics.In the experiment, we consider 25 topics clustered into 5 clusters of 5 topics.Each token is then generated by choosing a topic cluster according to a document-specific mixture of clusters, then a topic from the cluster, again according to a document-specific mixture, and finally the token is chosen from the topic.Note that this model is different from the parametric version of hierarchical Dirichlet processes [28], equation 29, namely that model considers a global set of topics, and each document (or group) selects a subset from the global set.In cluster LDA, all documents can use any of the topics, however the topic clusters are disjoint, as opposed    'observe' facts are ommited pb_dirichlet (2.5, theta, 20, 4777).pb_dirichlet(0.01,phi, 27206, 20).seed(9398, [1]).seed(21247, [2]).seed(13167, [1]).seed(17982, [2]).seed(13813, [1]).seed(24490, [2]).seed(4483, [1]).
We collect all abstracts on arXiv submitted in 2007, from five categories: quantitative finance (q-fin), statistics (stats), quantitative biology (q-bio), computer science (cs), and physics (physics).We tokenize and remove stop words to obtain a corpus with V = 26834 unique tokens in D = 5769 documents with average length of approx.80 tokens.We use priors of γ = 50/T = 10 for each cluster mixture and topic mixture per cluster, and β = 0.1 for the topics.
The PB program for the cluster LDA task is shown in Table 2.We use psi to denote the draw of a topic cluster, and a predicate create term to create terms that, when called with pb call, choose between the topics within a cluster, as well as the tokens from the topics.
In Figure 11 we plot, as heat maps, the cluster mixtures for each document in each category.We observe that quantitative biology is well represented by cluster 2 and physics is well represented by cluster 3. Computer science is characterized by cluster 5, but also 1 and 4, while quantitative finance and statistics are very similar, consisting mainly of clusters 1,2 and 4. The latter effect may be due to the small number of documents in both quantitative finance and statistics, as well as the fact that most quantitative finance papers focus on statistical methods.
Furthermore, if we inspect, for example, cluster 2, corresponding to quantitative biology, shown in Table 3, we find that most topics give high probability to terms used in biology, e.g."proteins" in topic 2, "genetic" and "brain" in topic 3, "response" and "diffusion" in topic 4. The results agree with our intuition about the model: topics within the same cluster are similar.

PB for RIM on Sushi dataset
A repeated insertion model (RIM, [13]) provides a recursive and compact representation of K probability distributions, called preference profiles, over the set of all permutations of M items.This intuitively captures K different types of people with similar preferences.We evaluate a variant of the repeated insertion model in an experiment inspired by [29], on a dataset published in [30].The data consists of 5000 permutations over M = 10 Sushi ingredients, each permutation expressing the preferences of a surveyed person.Following [29], we use K = 6 preference profiles, however we use the RIM rather than a Mallows model, and we train on the whole dataset.The parameters of the model are 50/K symmetric prior for the mixture of profiles, and 0.1 symmetric prior for all categorical distributions in all profiles.
We show the PB program used in Table 5, assuming that the create term predicate is defined as in Table 2.We are not aware of any other implementation of RIM in a PPL, therefore we briefly describe the program.The mixture of profiles is characterized by π, a set of K distributions, and for each profile there are M − 1 categorical distributions that specify the probabilities over the set of permutations of M elements.An observed permutation is produced by selecting a latent profile, then generating that permutation by consecutively inserting elements from a permutation called insertion order, e.g.[0, 1, . . ., 9], at the right position, according to the distributions in that profile.The right position is chosen using the insert rim predicate, as naïvely generating all the possible permutations is intractable (and unnecessary).We run PB 10 times for 100 iterations and average the parameters.For each preference profile, we show its mixture parameter and its mode in Table 4.The inference yields similar conclusions to [29]: there is a strong preference for fatty tuna, a strong dislike of cucumber roll and a strong positive correlation between salmon roe and sea urchin.

Related Work
In this section, we will briefly describe the relation between the approach proposed in this paper and other relevant methods.We begin with each of the two fundamental steps of PB: 1) enumeration of the conditional sample space and 2) sampling thereof, and conclude with a high-level comparison with other probabilistic programming languages.
The first step of PB, the enumeration of the conditional sample space through abductive logic programming, could be compared to "logical inference" in ProbLog [9].While both languages aim to generate a propositional formula and compile it into a decision diagram, "logical inference" in PB is based on abductive logic programming, while ProbLog grounds the relevant parts of the probabilistic program.Moreover, in PB compilation of the boolean formulas is performed using (RO)BDDs, while ProbLog can use a wider range of decision diagrams, e.g.sentential decision diagrams (SDD), deterministic, decomposable negation normal form (d-DNNF).These differences reflect the different aims of the two PPLs: ProbLog focuses on models where "logical inference" needs to be efficient, and the resulting representation, the decision diagrams, need to be compact, while PB focuses on models where "logical inference" is typically easy, however it must be applied repeatedly, according to the nature and the number of the observations.However, in future work, PB could benefit from the use of more compact decision diagrams.
The second step of PB is inspired by the uncollapsed Gibbs sampling algorithm from [11], which we have adapted to PB.However, sampling in the PB model, unlike PLP models, can be performed using collapsed Gibbs sampling, an algorithm with faster convergence that the uncollapsed version, as shown in Section 5.1.
In relation to Church [1] and many other related PPLs, PB is similar in that it uses a Turing-complete declarative language, but the set of probabilistic primitives available in PB is very restricted compared to Church.On the other hand, inference in discrete models such as LDA is difficult in highly expressive PPLs, whereas in PB inference is tractable on various discrete models.
Both PB and the logic-based PPL Alchemy [6] focus on discrete models, however they differ at a fundamental level due to the fact that the probabilistic model of PB is directed, whereas that of Alchemy is undirected (Markov networks).Consequently, the specification of probaiblistic programs is also different: in Alchemy, programs are expressed weighted formulas, whereas in PB specifies models using abductive logic programs where the abducibles represent draws from a categorical distribution with Dirichlet priors.Furthermore, by using abductive logic programming instead of a first-order knowledge base, PB can easily encode recursive generative models, such as RIM.It is much less obvious how to do so using Alchemy.

Conclusions and Future Work
In this paper, we introduced PB, a probabilistic logic programming language for dicrete models with Dirichlet priors.This paper bridges the gap between logical and probabilistic inference in the considered class of models, and addresses issues on representation of abductive solutions and inference on "syntactically" identical BDDs.The main contribution to representation is the conditional AD compilation, while the main contribution to inference is the collapsed Gibbs sampling algortihm that generalizes the one proposed for LDA in [24].
We have shown, through the experiments in Section 5, that PB yields reasonable inference results both on synthetic and real datasets.
In future work, we hope to explore more probabilistic models that fit the PB paradigm, and to design, implement, and compare efficient algorithms for generalized probabilistic inference in PB models.
On the other hand, we wish to relax the important restriction to discrete models using Dirichlet processes, that allow discretization of any continuous distribution specified as the base distribution of the process.
We restrict our programs such that in a success state, the denial store is empty.Let SS Q denote the set of success states for query Q.Then, the result of an abductive query is the formula: It is not difficult to extend the result to non-empty denial stores, a case in which M(Q) will not be in a disjunctive normal form (DNF), but rather an arbitrary formula, as we do not take advantage of the particular DNF encoding in the rest of the paper.
In defining an ASystem derivation tree, we mentioned the application of proof procedure rules, which we proceed to define in the rest of the section.Given an abductive framework Π, AB , let S k = (G k , ∆ k , N k ) denote an ASystem state, and let F be a goal selected by Ξ from G k , such that G − k = G k \ {F }.By applying a suitable proof procedure rule, we obtain a child state S k+1 = (G k+1 , ∆ k+1 , N k+1 ).Some inference rules create more than one child, and the children states will be separated by OR.In the description of the rules, we mention only the updated goal and elements in the store, the rest remain the same as in the parent state.
We use the following notation: Y denotes a set of variables, u denotes a term, U , and Z denote vectors of terms.Equality between variables and terms denotes unification, and is extended to vectors component-wise.Φ is used to denote rule bodies, i.e. conjunctions of literals.The symbols a, i, l and k a have the same meaning as in Section 2, p denotes a functor, L denotes a literal.Var (U ) denotes the set of variables in U , where U can be either a predicate, a set of terms or a literal.
Based on the type of the selected goal (i.e.literal or denial) we distinguish three inference rules for each type.
(1) If the selected goal is a literal F , then: (D1) if F = p(u) and p is pb call, then: ) is a non-abducible, let p(Z r ) ← Φ r , r = 1, . . ., R, be R rules in Π, then:

Example 3 . 5 .
Continuing the previous examples of this section, we specify the last part of the PB program for an LDA task.The pb plate predicate iterates through the corpus and generates each token according to the model.In this model, iterating through observations means selecting different pairs of document and token indexes.

Figure 4 :
Figure 4: Intrinsic metrics by log average execution time.The methods should not be compared based on these values.

Figure 5 :
Figure 5: Intrinsic measures by number of topics in the evaluated model.

Figure 6 :
Figure 6: Likelihood against log average execution time.Higher is better.The results are consistent with Figure 4.

Figure 7 :
Figure 7: Likelihood against number of topics in the evaluated model.Notice the difference w.r.t. Figure 5.

Figure 8 :
Figure 8: Average fold perplexity in 5CV against average execution time.Lower is better.The first four markers for Stan almost overlap.

Figure 9 :
Figure 9: Average fold perplexity in 5CV against number of topics in the evaluated model.Note the consistency w.r.t.Figure 7.

Figure 10 :
Figure 10: Seeded topics in the seed LDA task.

Table 1 :
PB program for seed LDA.

Table 2 :
PB program for cluster LDA.

Table 4 :
Mixture parameters and modes of preference profiles on the Sushi dataset.