The Exact Inference of Beta Process and Beta Bernoulli Process From Finite Observations

: Beta Process is a typical nonparametric Bayesian model. and the Beta Bernoulli Process provides a Bayesian nonparametric prior for models involving collections of binary valued features. Some previous studies considered the Beta Process inference problem by giving the Stick-Breaking sampling method. This paper focuses on analyzing the form of precise probability distribution based on a Stick-Breaking approach, that is, the joint probability distribution is derived from any finite number of observable samples: It not only determines the probability distribution function of the Beta Process with finite observation (represented as a group of number between [ ] 0,1 ), but also gives the distribution function of the Beta Bernoulli Process with the same finite dimension (represented as a matrix with element value of 0 or, 1 ) by using this distribution as a prior.

subsequent integral treatment, need through the sampling method of approximate integral operation, Sampling is also a tedious step. In this work, we intercept a finite number of random variable observations sampled from a Beta Process, calculate a posteriori Bernoulli Process, and make inferences. Here, we mainly do two things. First, for a high-dimensional sequence consisting of any finite number of real number observations with values between [ ] 0,1 , only the hypothesis generated by sampling from a Beta Process is made, and its probability distribution is directly analyzed and inferred. The second is that for a 0 /1 matrix that can have any finite row, and each row can have any finite column, you just make the assumption that you sampled from a Beta Bernoulli process and infer its distribution function. The definition of relevant parameters is similar to that given in [John and Lawrence (2011)]. The inference process here can be made without any additional assumptions. The above three restrictions can be relaxed in turn: Above all, we set up the joint probability distribution function of the Beta Process by taking the Beta random variable itself as the observation variable, and the other random variables as the intermediate variables. In this way, the distribution function of the Beta Process can be directly generated by marginalization. Furthermore, when we construct the likelihood function, instead of recursing the number of rounds of each sample sequentially, we only focus on the number of rounds of the last sample, so that we can directly construct the joint distribution function of the number of samples in each round at one time. Finally, we analyze the Beta Process of a finite number of observation samples and use it as a prior distribution to directly calculate the posterior probability of the occurrence of any finite dimensional binary valued matrix, so that the posterior probability can be directly analyzed and calculated. Thus, the possibility of any finite dimensional binary matrix is analyzed directly. Finite dimensional binary matrices can be used to select factors, such as modeling radar signal data. The radar transmits a set of full-bandwidth spectrum data  is used to achieve sparse, non-zero only in the position of some column vectors of Φ . Here, for the binary variable n z , we use the Beta Bernoulli Process priors. For the Beta Bernoulli Process priors, we usually use the finite approximation of the Beta Bernoulli Process. Here, however, you can directly use the priors of the Beta Bernoulli Process without approximations. The rest of this article is organized as follows. The second part through the analysis, provides a preliminary knowledge of the Beta Process and the probability distribution function of the observed variables generated by the Stick Breaking Construction method.
In the third part, the inference method of probability distribution function of intermediate variable is given. In the last part, the final likelihood function of the observed variable is given by deduction. Describing likelihood function as efficiently as possible is an important step in machine learning with probability model.

The definition of the beta process and stick breaking construction
The Beta Process is a nonparametric Bayesian method, which is used to describe a sequence composed of an infinite number of atoms, in which each atom has a weight, and the weight is subject to a degenerate Beta distribution.

Beta process definition
Let 0 H be a non-atomic continuous measure on the space ( )

Stick breaking construction of beta process
By defining a concept called stick breaking, Paisley et al. [John and Aimee (2010)] clearly proposed a method to build the Beta Process. Stick breaking is a method used to generate discrete probability measure [Ishwaran and James (2001)

Calculation of edge distribution of k π
Through the construction of stick breaking concept, Paisley et al. [John and Aimee (2010); John and Michael (2016)] proposed a method that can clearly show the process of the construction of the Beta Process. They divided the probability distribution of elements in the Beta Process into two groups: that is j π is generated in the 1th round, and j π is not generated in the first round of cycles. Paisley et al. [John and Aimee (2010)] calculated the marginal probability distribution of these two types of observations respectively.
Here, when k π is specified to be generated in the th i round, the conditional probability density function of k π can be defined as follows: When 1 i = , the corresponding weight of the atoms in this round follows a Beta distribution, with the first parameter as 1 , and the second parameter as b . That is and k k V π = , Its probability density function is given as, For other case Another probability density function can be obtained by calculating the probability distribution of the function of random variables and the probability distribution function of the product of random variables: The equation k d i = indicates that the th k atom occurred in the th i round. Thus, as shown in Fig. 1, the probability of the th k atom being observed in round 1 is:

Inference for
This is what is shown in the first line frame section of Fig. 1. By analogy, the probability of th k atom being observed in round 2 is: According to the same reason continue to deduce, can get the final result about the probability of the th k atom appearing in the round k d . When 3 On the other hand, the marginal probability can be viewed as obtained by marginalizing other variables of the joint probability distribution. In this way, the form of joint probability distribution can be obtained through the marginal probability distribution: Below, we discuss the likelihood terms and prior terms. The inference processing process of the joint probability distribution of the number of samples generated in each round can be described as follows: Mout is the result ( ) Here, the loop only executes k d times, and the time complexity is ( ) k d Ο . Next, we discuss the likelihood term and the prior term.

Likelihood term
To solve the problem of likelihood term, the integration problem of random variable k d should be solved first. Here, the conditional probability formula is adopted, and the joint probability can be expanded by the value of variable k d . And then we introduce the joint probability of C  . From the conditional probability formula, there is: , , , C , , C , Eq. (6) proposes a method to realize the joint probability distribution, which represents the observation sequence generated by the Beta Process. We use limited observation (here k ) to generate the observation likelihood.

( )
, k p C d  here is the joint probability distribution generated by Eq. (5). If we expand and analyze this formula, we can get the integral of k d . Substitute Eq. (5) into Eq. (6), will get: Here item ( ) in Eq. (6) has been replaced by Eq. (3).

Derivation of conditional probability term
The data is generated by H through the Beta Process and expressed in the form of an infinite dimensional vector, with each element between ( ) 0,1 . The probability distribution ( )

 
is analyzed as follows: this formula is equivalent to the posterior distribution of 1 2 , , , k π π π  after the indicator sequence is given.

Inference for
In other words, k samples were generated in the first round and the second round, and the sum of the number of samples in the two rounds was k .
π item in Eq. (8) will be replaced by Eqs.
(3) and (4). From the definition of round, we can obtain the elements from the th j round:

Inference
Then, given the values of C  and k , the joint distribution of variables π  can be obtained: π term in Eq. (9) will also be replaced by Eqs. (3) and (4).
The calculation process of the conditional probability ( )

 
can be described as follows: Figure 4: Calculation process of conditional probability ( ) Here, , count I is the loop variable. When the loop is complete, the output variable value Sout is the result ( ) In this way, the joint probability distribution of the final observed variables can be calculated,  (3) and (4) into Eq. (10), the final precise joint probability distribution function for a finite number of observations of Beta Process is obtained. Thus, the overall calculation process of the likelihood term can be described as follows: This can be represented as a where ij z has been specified to be drawn from a Bernoulli Process with a parameter i π .
Then the Joint probability distribution of { } z  and { } π  is subject to: Here each j π parameter follows a beta distribution. The joint probability distribution of π  can be calculated by Eq. (10). By substituting Eqs. (10), (12) into Eq. (13), the exact Joint Probability distribution function of Beta Bernoulli Process for finite observation can be obtained.
Using the above equation, the variable π  in the intermediate step can be conveniently eliminated by integration, and the result is given in Eq. (15).
Next, by replacing the distribution of the integral in Eq. (15) by using Eq. (3), we can obtain Eq. (16): Here, variable R is introduced through the properties of the Poisson distribution: In order to simplify the following calculation, the calculation form can be appropriately simplified first: π π π π π π π π π π π π π π π (17) is a simple shift of the last two terms in Eq. (15).

Likelihood term for
Using the conjugate relationship between the Beta Process and the Bernoulli Process, through integral calculation, the following result can be obtained, without loss of generality. The Bernoulli samples produced in round k d are analyzed here: In this case, k z will be used to represent { } 1 , , k k kg z z  . and the probability distribution of k π has been represented by the Eq. (4).
By Taylor's expansion, we can obtain the integral of the variable k π analytically.
Considering series analysis of the middle part of Eq. (18), i.e., the Taylor expansion of the term ( ) Through variable substitution, set There are: The integral result in Eq. (19) can be obtained: In order to calculate the posterior distribution of a given binary indicator variables, a prior distribution is required. In order to obtain the prior distribution, the following two steps can be performed:

Calculate the proportional term in equation
To calculate the proportional term in Eq. (17), we need to substitute the result of Eq. (23) into Eq. (17), in this way, we can obtain, The Eq. (29) is the final result that we want to get here.   1

The final result of the joint probability distribution of z 
The result obtained from Eq. (31) is the final joint probability distribution function required in this paper. Here, the calculation of the observed likelihood for each Bernoulli sample is done directly. Thus, the likelihood calculation of each Bernoulli sample is carried out directly. This process can be described as follows: The calculation result can be analytically generated due to the integration order of beta variable k π and intermediate variable w is exchanged in the function to be integrated, and then Taylor expansion is carried out on the power of ( ) The flow chart of the calculation of the final observation variable Z in the Beta Bernoulli process is almost the same as that in Fig. 4. Since the probability ( ) , I p count b π in the process described in Figure 4 can be directly used here, due to that the introduction of conditional probability distribution of the observation variable z  π occurrence in the Beta Process π .

Beta process factor analysis and the logarithmic likelihood function of the joint probability distribution for beta process
The most commonly used method in machine learning is variational inference, which is often called EM algorithm in parameter estimation. One of its core steps is to calculate the joint probability distribution function of the observed variable and the hidden variable. At the same time, the convexity of the final objective function is guaranteed by taking logarithm of the joint probability distribution function. Therefore, it is one of the most important tasks in machine learning to find the logarithmic likelihood of the joint probability distribution function. For the same reason, the logarithmic likelihood of the joint distribution of Beta Process with finite observations is also calculated below.
The key use of the Beta Process is for Beta Process Factor Analysis [John and Lawrence (2009); John and Lawrence (2011); Ishwaran and James (2001)]. Among this, the Bernoulli Process, which takes the Beta Process as a parameter, will be used for factor selection in the set of factors. Therefore, the Factor Analysis of finite observation Beta Process will be discussed in the following part.

Beta process factor analysis
Beta Process Factor Analysis is mainly described as: Define the matrix and define a set of vector That is, another matrix X is generated from one matrix Φ through vector transformation. Here In this case, the probability distribution function of ( ) p Z can be directly described by Eq.
(31). It is usually straightforward to set i g G = to i ∀ . Only X is observable here, and the rest are unobservable variables. Theoretically, when the joint probability distribution function is constructed, the work of inference and machine learning can be performed.

Logarithmic likelihood of the joint probability distribution of beta process
Through Eq. (10), the joint probability distribution function can be directly described.   (8) and (9), it can be inferred that the above two terms can be uniformly expressed, that is, the conditional distribution of the Beta Process observation sequence can be uniformly expressed as: At the same time, the joint probability distribution of random variable can be deduced from Eqs. (5) and (32): Similarly, through Eq. ( ) 4 , and variable substitution ln T w = , the joint distribution function of random variables { } ,T π can be obtained: Here, the range of values for random variables can be limited to 0 ln k k T π ≤ ≤ − , and 2 k d ≥ is required at the same time.
In addition, random variable sequence { } . (36) is the joint probability distribution function form of all random variables that are really needed. It will be calculated below to simplify its representation.

Quotient calculation of { }
By substituting Eqs. (37) and (38) into Eq. (36), the joint probability distribution function of all required random variables can be obtained:

Logarithmic likelihood of joint probability distribution function
Taking the logarithm of Eq. (39) here, we can deduce: Regarding Poisson distribution, it can be calculated as:

Discussion
Theoretically, the joint probability distribution function must be able to handle any number of observations, and, importantly, the number of actual observations can be arbitrarily large, but not infinite. The method that we have given here is simple and effective in dealing with this problem, because the Nonparametric Bayesian stochastic Process we discussed here does not satisfy the Kolmogorov consistency theorem, so lead to the relationship between observed variables is not independent identically distributed. The distribution function form is much more complicated than the traditional machine learning situation, and the number of unobserved variables has a direct impact on the form of the distribution function. The method proposed here eliminates the information irrelevant to observations and thus gives the general form of any finite number of observations. We have obtained several results of this idea through the Stick-Breaking structure proposed by Paisley et al. [Paisley and Zaas (2010)]: including the more general construction of finite observation and the new type of joint probability distribution function for Stick-Breaking Beta Processes, which indicates that the Beta Process is the superposition of a Poisson Process countable set and used as a priori of Bernoulli Process. Finally, a finite observation of a 0/1 matrix is completed.
In the future, we will extend the proposed method and use variational inference method to solve the problem that the accurate estimation of marginal distribution is too complex, so as to be applicable to the machine learning task of approximate parameter estimation of Beta Process and Beta Bernoulli Process. We will also explore some approximate inference models of distribution functions of Non-parametric process variables, hoping to obtain better and simplified performance by means of variational inference method. These similar methods can also be used for Gamma Processes [Anirban and Brian (2015)] and Gamma Poisson Processes [Michalis and Titsias (2007)]. This is the next step of our consideration. Regression analysis is one of the main research directions in the field of machine learning. At present, Gaussian Process Regression is the main regression method when stochastic process is used as the tool. Among them, Kalman Filter is the most widely used field in Gaussian Process Regression. Based on the same idea, because for the Beta Process, when the joint probability distribution function is given, the conditional probability can be calculated according to the Bayesian formula and the regression analysis can be carried out theoretically. Therefore, the regression analysis of Beta Process is also one of the directions to be considered in the next step. The idea described above can be also used in the context of Gamma Processes similar to Beta Processes, so our results also contribute to the establishment of a general Nonparametric Bayesian inference mechanism. A more common variant of the Beta Bernouilli Process is the Indian Buffet Process (IBP), which learns the number of features included in the model from the observed data, thus allowing the model to interpret the data more accurately. The Non-parametric Bayesian model based on IBP can automatically learn the implicit features, and can in a scalable way to determine the number of features. Therefore, in theory, better prediction performance can be achieved. In practical applications, the 0/1 output of the Beta Bernouilli Process is generally used to describe the relationship between entities. In the sample matrix of the Beta Bernoulli Process, a specific entity is described by a set of binary features, and then the features are obtained from the observations. And try to infer the features. The sample matrix value of the Beta Bernoulli Process can be used as a basis for determining whether the entities are related. If the weight is attached to the 0/1 output of the Beta Bernouilli Process at the same time, the strength of the influence between the entities can be added while describing the correlation between the entities. Since the distribution of the Beta Bernouilli Process is long-tailed, and the distribution functions for each round generated by the Beta Process Stick Breaking do not necessarily have the same attenuation trend as the power-rate distribution, resulting in the model being basically sufficient to describe the entity possessing any number of features. The general Beta Bernouilli Process describes the probability distribution, which can be used to describe the relationships between entities, and the relationships are not necessarily symmetric. This asymmetry relationship can be applied to some important issues such as social network connection prediction. Connection prediction is an important issue in social network modeling [Miller, Michael and Thomas (2009)]. Here, it can be assumed that the link probability from one node to another node is determined by the combined effect of pairwise feature interactions. If a weight is added to the 0/1 sample matrix of the Beta Bernouilli Process, and the positive weight corresponds to the probability of high correlation, while the negative weight corresponds to the probability of low correlation, and the zero weight indicates that there is no correlation between the two features, then the representation ability of the model will be greatly improved, and the influence relationship between nodes will have stronger performance. The relationship between entities can be simplified. The simplified symmetric relationship is used to learn a complete symmetric weight matrix. The symmetric Beta Bernouilli Process model can also be used to describe the co-authorship relationship judgment in text mining, because the co-authorship relationship is symmetric [Teh, Jordan, Beal et al. (2006)]. Currently, IBP Process with multiple levels proposed by the academia has been applied in Deep Learning. It is used to learn the structure of Deep Belief Network, including the number of layers of neurons, the number of neurons in each layer, and the connection structure of neurons between layers [Adams, Hanna and Zoubin (2010)]. In this paper, the exact analytical form of probability distribution function of finite arbitrary dimension is directly analyzed for Beta Bernouilli Process, and its properties as the objective function of machine learning are discussed. In the next step of prediction [Miller, Michael and Thomas (2009)] and learning [Adams, Hanna and Zoubin (2010)], it can be directly used as the prior probability distribution function of the discriminant model [Miller, Michael and Thomas (2009)] and substituted into the objective function for parameter optimization.
Since the marginal probability distribution function defined here is accurate, the calculation process of parameter optimization can be carried out based on demand, or we can directly optimize the precise distribution of marginal probability, or choose sampling [Miller, Michael and Thomas (2009)] and variational inference to make approximate inference to the joint probability distribution function. In Deep Learning, ideas similarity of Teh et al. [Teh, Jordan, Beal et al. (2006)] can also be used to conduct Deep Learning reasoning by taking the row number and column number of each row of the binary matrix output by the Beta Bernouilli Process as the layer number of multi-layer neural network and the node number of each layer. The two-parameter beta process description adopted in this paper theoretically promotes the model in [Adams, Hanna and Zoubin (2010)], which directly adopted the Indian Buffet Process as the priori of the number of layers and the number of nodes of each layer of Deep Belief Network.

Conclusions
Beta Process contains a list of random variables. However, these random variables do not satisfy the stationarity or the global independent increment, so the probability distribution of these random variables has an extremely complex form. The stick breaking construction method is a means to indirectly define the Beta Process by describing the sampling process. Further analyzing and deducing the joint probability distribution of observed samples through the described sampling method is the next step necessary for machine learning. The result presented here is an analytical method for directly calculating the probability distribution of observable variables in Beta Process. Through probability distribution calculation, on the one hand, all intermediate variables are directly marginalized, thus completely eliminating the unobservable information. On the other hand, the observation of Bernoulli Process directly generated by taking Beta Process as a parameter can have the form of analytic probability distribution function.
In the future, we will further extend the derived results and deal with the later steps of machine learning on the Beta Process.