Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations

There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. On the other hand, variational inference methods have been employed to solve classification problems using probit regression and logistic regression with normal priors. In this article, we propose to apply the variational approximation on probit regression models with intrinsic prior. We review the mean-field variational method and the procedure of developing intrinsic prior for the probit regression model. We then present our work on implementing the variational Bayesian probit regression model using intrinsic prior. Publicly available data from the world’s largest peer-to-peer lending platform, LendingClub, will be used to illustrate how model output uncertainties are addressed through the framework we proposed. With LendingClub data, the target variable is the final status of a loan, either charged-off or fully paid. Investors may very well be interested in how predictive features like FICO, amount financed, income, etc. may affect the final loan status.


Introduction
There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. By far, only two articles have explored intrinsic prior related methods on classification problems. Reference [1] implements integral priors into the generalized linear models with various link functions. In addition, reference [2] considers intrinsic priors for probit models. On the other hand, variational inference methods have been employed to solve classification problem with logistic regression ( [3]) and probit regression ( [4,5]) with normal priors. Variational approximation methods have been reviewed in [6,7], and more recently [8].
In this article, we propose to apply variational approximations on probit regression models with intrinsic priors. In Section 4, we review the mean-field variational method that will be used in this article. In Section 3, procedures for developing intrinsic priors for probit models will be introduced following [2]. Our work is presented in Section 5. Our motivations for combining intrinsic prior methodology and variational inference is as following

•
Avoiding manually set ad hoc plugin priors by automatically generating a family of non-informative priors that are less sensible. • Reference [1,2] do not consider inference of posterior distributions of parameters. Their focus is on model comparison. Although the development of intrinsic priors itself comes from a model selection background, we thought it would be interesting to apply intrinsic priors on inference problems. In fact, some recently developed priors that proposed to solve inference or estimation problems turned out to be also intrinsic priors. For example, the Scaled Beta2 prior [9] and the Matrix-F prior [10]. • Intrinsic priors concentrate probability near the null hypothesis, a condition that is widely accepted and should be required of a prior for testing a hypothesis. • Also, intrinsic priors have flat tails that prevents finite sample inconsistency [11]. • For inference problems with large data set, variational approximation methods are much faster than MCMC-based methods.
As for model comparison, due to the fact that the output of variational inference methods cannot be employed directly to compare models, we propose in Section 5.3 to simply make use of the variational approximation of the posterior distribution as an importance function and get the Monte Carlo estimated marginal likelihood by importance sampling for model comparison.

Bayes Factor
The Bayesian framework of model selection coherently involves the use of probability to express all uncertainty in the choice of model, including uncertainty about the unknown parameters of a model. Suppose that models M 1 , M 2 , ..., M q are under consideration. We shall assume that the observed data x = (x 1 , x 2 , ..., x n ) is generated from one of these models but we do not know which one it is. We express our uncertainty through prior probability P(M j ), j = 1, 2, ..., q. Under model M i , x has density f i (x|θ i , M i ), where θ i are unknown model parameters, and the prior distribution for θ i is π i (θ i |M i ). Given observed data and prior probabilities, we can then evaluate the posterior probability of M i using Bayes' rule where is the marginal likelihood of x under M i , also called the evidence for M i [12]. A common choice of prior model probabilities is P(M j ) = 1 q , so that each model has the same initial probability. However, there are other alternatives of assigning probabilities to correct for multiple comparison (See [13]). From (1), the posterior odds are therefore the prior odds multiplied by the Bayes factor where the Bayes factor of M j to M i is defined by Here we omit the dependence on models M j , M i to keep the notation simple. The marginal likelihood, p i (x) expresses the preference shown by the observed data for different models. When B ji > 1, the data favor M j over M i , and when B ji < 1 the data favor M i over M j . A scale for interpretation of B ji is given by [14].

Motivation and Development of Intrinsic Prior
Computing B ji requires specification of π i (θ i ) and π j (θ j ). Often in Bayesian analysis, when prior information is weak, one can use non-informative (or default) priors π N i (θ i ). Common choices for non-informative priors are the uniform prior, π U i (θ i ) ∝ 1; the Jeffreys prior, where I i (θ i ) is the expected Fisher information matrix corresponding to M i . Using any of the π N i in (4) would yield The difficulty with (5) is that π N i are typically improper and hence are defined only up to an unspecified constant c i . So B N ji is defined only up to the ratio c j /c i of two unspecified constants. An attempt to circumvent the ill definition of the Bayes factors for improper non-informative priors is the intrinsic Bayes factor introduced by [15], which is a modification of a partial Bayes factor [16]. To define the intrinsic Bayes factor we consider the set of subsamples x(l) of the data x of minimal size l such that 0 < p N i (x(l)) < ∞. These subsamples are called training samples (not to be confused with training sample in machine learning). In addition, there is a total number of L such subsamples.
The main idea here is that training sample x(l) will be used to convert the improper π N i (θ i ) to proper posterior Then, the Bayes factor for the remaining of the data x(n − l), where x(l) ∪ x(n − l) = x, using π N i (θ i |x(l)) as prior is called a "partial" Bayes factor, This partial Bayes factor is a well-defined Bayes factor, and can be written as B N ji (x(n − l)|x(l)) = ji (x(n − l)|x(l)) will depend on the choice of the training samples x(l). To eliminate this arbitrariness and increase stability, reference [15] suggests averaging over all training samples and obtained the arithmetic intrinsic Bayes factor (AIBF) The strongest justification of the arithmetic IBF is its asymptotic equivalence with a proper Bayes factor arising from Intrinsic priors. These intrinsic priors were identified through an asymptotic analysis (see [15]). For the case where M i is nested in M j , it can be shown that the intrinsic priors are given by

Bayesian Probit Model and the Use of Auxiliary Variables
Consider a sample y = (y 1 , ..., y n ), where Y i , i = 1, ..., n, is a 0 − 1 random variable such that under model M j , it follows a probit regression model with a j + 1-dimensional vector of covariates x i , where j ≤ p. Here, p is the total number of covariate variables under our consideration. In addition, this probit model M j has the form where Φ denotes the standard normal cumulative distribution function and β j = (β 0 , ..., β j ) is a vector of dimension j + 1. The first component of the vector x i is set equal to 1 so that when considering models of the form (10), the intercept is in any submodel. The maximum length of the vector of covariates is p + 1. Let π(β), proper or improper, summarize our prior information about β. Then the posterior density of β is given by which is largely intractable. As shown by [17], the Bayesian probit regression model becomes tractable when a particular set of auxiliary variables is introduced. Based on the data augmentation approach [18], introducing n latent variables Z 1 , ..., Z n , where The probit model (10) can be thought of as a regression model with incomplete sampling information by considering that only the sign of z i is observed. More specifically, define Y i = 1 if Z i > 0 and Y i = 0 otherwise. This allows us to write the probability density of y i given z i p(y i |z i ) = I(z i > 0)I(y i = 1) + I(z i ≤ 0)I(y i = 0).
Expansion of the parameter set from {β} to {β, Z} is the key to achieving a tractable solution for variational approximation.

Development of Intrinsic Prior for Probit Models
For the sample z = (z 1 , ..., z n ) , the null normal model is For a generic model M j with j + 1 regressors, the alternative model is where the design matrix X j has dimensions n × (j + 1). Intrinsic prior methodology for the linear model was first developed by [19], and was further developed in [20] by using the methods of [21]. This intrinsic methodology gives us an automatic specification of the priors π(α) and π(β), starting with the non-informative priors π N (α) and π N (β) for α and β, which are both improper and proportional to 1.
The marginal distributions for the sample z under the null model, and under the alternative model with intrinsic prior, are formally written as However, these are marginals of the sample z, but our selection procedure requires us to compute the Bayes factor of model M j versus the reference model M 1 for the sample y = (y 1 , ..., y n ). To solve this problem, reference [2] proposed to transform the marginal p j (z) into the marginal p j (y) by using the probit transformations y i = 1(z i > 0), i = 1, ..., n. These latter marginals are given by where

Overview of Variational Methods
Variational methods have their origins in the 18th century with the work of Euler, Lagrange, and others on the calculus of variations (The derivation in this section is standard in the literature on variational approximation and will at times follow the arguments in [22,23]). Variational inference is a body of deterministic techniques for making approximate inference for parameters in complex statistical models. Variational approximations are a much faster alternative to Markov Chain Monte Carlo (MCMC), especially for large models, and are a richer class of methods than the Laplace approximation [6].
Suppose we have a Bayesian model and a prior distribution for the parameters. The model may also have latent variables, here we shall denote the set of all latent variables and parameters by θ. In addition, we denote the set of all observed variables by X. Given a set of n independent, identically distributed data, for which X = {x 1 , ..., x n } and θ = {θ 1 , ..., θ n }, our probabilistic model (e.g., probit regression model) specifies the joint distribution p(X, θ), and our goal is to find an approximation for the posterior distribution p(θ|X) as well as for the marginal likelihood p(X). For any probability distribution q(θ), we have the following decomposition of the log marginal likelihood We refer to (14) as the lower bound of the log marginal likelihood with respect to the density q, and (15) is by definition the Kullback-Leibler divergence of the posterior q(θ|X) from the density q. Based on this decomposition, we can maximize the lower bound L(q) by optimization with respect to the distribution q(θ), which is equivalent to minimizing the KL divergence. In addition, the lower bound is attained when the KL divergence is zero, which happens when q(θ) equals the posterior distribution p(θ|X). It would be hard to find such a density since the true posterior distribution is intractable.

Factorized Distributions
The essence of the variational inference approach is approximation to the posterior distribution p(θ|X) by q(θ) for which the q dependent lower bound L(q) is more tractable than the original model evidence. In addition, tractability is achieved by restricting q to a more manageable class of distributions, and then maximizing L(q) over that class.
Suppose we partition elements of θ into disjoint groups {θ i } where i = 1, ..., M. We then assume that the q density factorizes with respect to this partition, i.e., The product form is the only assumption we made about the distribution. Restriction (16) is also known as mean-field approximation and has its root in Physics [24].
For all distributions q(θ) with the form (16), we need to find the distribution for which the lower bound L(q) is largest. Restriction of q to a subclass of product densities like (16) gives rise to explicit solutions for each product component in terms of the others. This fact, in turn, leads to an iterative scheme for obtaining the solutions. To achieve this, we first substitute (16) into (14) and then separate out the dependence on one of the factors q j (θ j ). Denoting q j (θ j ) by q j to keep the notation clear, we obtain (17) wherep(X, θ j ) is given by The notation E i =j [·] denotes an expectation with respect to the q distributions over all variables z i for i = j, so that (17) with respect to all possible forms for the density q j (θ j ). By recognizing that (17) is the negative KL divergence betweenp(X, θ j ) and q j (θ j ), we notice that maximizing (17) is equivalent to minimize the KL divergence, and the minimum occurs when q j (θ j ) =p(X, θ j ). The optimal q * j (θ j ) is then The above solution says that the log of the optimal q j is obtained simply by considering the log of the joint distribution of all parameter, latent and observable variables and then taking the expectation with respect to all the other factors q i for i = j. Normalizing the exponential of (19), we have The set of equations in (19) for j = 1, ..., M are not an explicit solution because the expression on the right hand side of (19) for the optimal q * j depends on expectations taken with respect to the other factors q i for i = j. We will need to first initialize all of the factors q i (θ i ) and then cycle through the factors one by one and replace each in turn with an updated estimate given by the right hand side of (19) evaluated using the current estimates for all of the other factors. Convexity properties can be used to show that convergence to at least local optima is guaranteed [25]. The iterative procedure is described in Algorithm 1.

Algorithm 1
Iterative procedure for obtaining the optimal densities under factorized density restriction (16). The updates are based on the solutions given by (19).

Derivation of Intrinsic Prior to Be Used in Variational Inference
Let X l be the design matrix of a minimal training sample (mTS) of a normal regression model M j for the variable Z ∼ N(X j β j , I j+1 ). We have, for the j + 1-dimensional parameter β j , Therefore, it follows that the mTS size is j + 1 [2]. Given that priors for α and β are proportional to 1, the intrinsic prior for β conditional on α could be derived. Let β 0 denote the vector with the first component equal to α and the others equal to zero. Based on Formula (9), we have Therefore, .
Notice that X l X l is unknown because it is a theoretical design matrix corresponding to the training sample z l . It can be estimated by averaging over all submatrices containing j + 1 rows of the n × (j + 1) design matrix X j . This average is j+1 n X j X j (See [26] and Appendix A in [2]), and therefore Next, based on π I (β|α), the intrinsic prior for β can be obtained by Since we assume that π I (α) = π N (α) is proportional to one, set π N (α) = c where c is an arbitrary positive constant. Denote 2n j+1 (X j X j ) −1 by Σ β|α , we obtain where Σ −1 β|α (1,1) is component of Σ −1 β|α at position row 1 column 1 and Σ −1 is the first column of Σ −1 β|α . Denote Σ −1 β|α (1,1) by σ 11 and Σ −1 by γ 1 , we then obtain Therefore, we have derived that For model comparison, the specific form of the intrinsic prior may be needed, including the constant factor. Therefore, by following (21) and (22) we have

Iterative Updates for Factorized Distributions
We have that in Section 3.1. We have shown in Section 5.1 that Since y is independent of β given z, we have
To apply the variational approximation to probit regression model, unobservable variables are considered in two separate groups, coefficient parameter β and auxiliary variable Z. To approximate the posterior distribution of β, consider the product form q(Z, β) = q Z (Z)q β (β).
We proceed by first describing the distribution for each factor of the approximation, q Z (Z) and q β (β). Then variational approximation is accomplished by iteratively updating the parameters of each factor distribution.
Start with q Z (Z), when y i = 1, we have Now, according to (19) and Algorithm 1, the optimal q Z is proportional to So, we have the optimal q Z , Similar procedure could be used to develop cases when y i = 0. Therefore, we have that the optimal approximation for q Z is a truncated normal distribution, where Denote XE β [β] by µ z , the location of distribution q * Z (Z). The expectation E β is taken with respect to the density form of q(β) for which we shall derive now.
For q β (β), given the joint form in (25), we have Taking expectation with respect to q Z (z), we have Again, based on (19) and Algorithm 1, the optimal q β (β) is proportional to E Z [log p(y, z, β)], First notice that any constant terms, including constant factor in the intrinsic prior, were canceled out due to the ratio form of (19). Then by noticing the quadratic form in the above formula we have where Notice that µ q β , i.e., E β [β], depends on E Z [Z]. In addition, from our previous derivation, we found that the update for E Z [Z] depends on E β [β]. Given that the density form of q Z is truncated normal, we have where φ is the standard normal density and Φ is the standard normal cumulative density. Denote E Z [Z] by µ q Z . See properties of truncated normal distribution in Appendix A. Updating procedures for parameters µ q β and µ q Z of each factor distribution are summarized in Algorithm 2.

Algorithm 2
Iterative procedure for updating parameters to reach optimal factor densities q * β and q * Z in Bayesian probit regression model. The updates are based on the solutions given by (26) and (27).
until the increase in L(q) is negligible.

Evaluation of the Lower Bound L(q)
During the process of optimization of variational approximation densities, the lower bound for the log marginal likelihood need to be evaluated and monitored to determine when the iterative updating process converges. Based on derivations from previous section, we now have the exact form for the variational inference density, According to (14), we can write down the lower bound L(q) with respect to q(β, Z).
As we can see in (28), L(q) has been divided into four different parts with expectation taken over the variational approximation density q(β, Z) = q β (β)q Z (Z). We now find the expression of these expectations one by one.
Deal with the inner integral first, we have Substitute (31) into (30), we got Substituting (32) back into (29) gives We applied properties of truncated normal distribution in Appendix B to find the expression of the second moment E Z [z z].
Again, see Appendix B for well-known properties of truncated normal distribution. Now subtracting (34) from (33) we got Based on the exact expression of the intrinsic prior π I (β), denoting all constant terms by C, we have To find the expression for the integral, we have Substituting (37) back into (36), we obtained Combining all four parts together, we get

Model Comparison Based on Variational Approximation
Suppose we want to compare two models, M 1 and M 0 , where M 0 is the simpler model. An intuitive thought on comparing two models by variational approximation methods is just to compare the lower bounds L(q 1 ) and L(q 0 ). However, we should note that by comparing the lower bounds, we are assuming that the KL divergences in the two approximations are the same, so that we can use just these lower bounds as guide. Unfortunately, it is not easy to measure how tight in theory any particular bound can be, if this can be accomplished we could then more accurately estimate the log marginal likelihood from the beginning. As clarified in [27], when comparing two exact log marginal likelihood, we have The difference in log marginal likelihood, log p 1 (X) − log p 0 (X), is the quantity we wish to estimate. However, if we base this on the lower bounds difference, we are basing our model comparison on (43) rather than (42). Therefore, there exists a systematic bias towards simpler model when comparing models if KL(q 1 p 1 ) − KL(q 0 p 0 ) is not zero.
Realizing that we have a variational approximation for the posterior distribution of β, we propose the following method to estimate p(X) based on our variational approximation q β (β) (27). First, writing the marginal likelihood as we can interpret it as the conditional expectation with respect to q β (β). Next, draw samples β (1) , ..., β (n) from q β (β) and obtain the estimated marginal likelihood p(x|β (i) )π I (β (i) ) q β (β (i) ) .
Please note that this method proposed is equivalent to importance sampling with importance function being q β (β), for which we know the exact form and the generation of the random β (i) is easy and inexpensive.

Introduction
LendingClub (https://www.lendingclub.com/) is the world's largest peer-to-peer lending platform. LendingClub enables borrowers to create unsecured personal loans between $1000 and $40,000. The standard loan period is three or five years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee. To attract lenders, LendingClub publishes most of the information available in borrowers' credit reports as well as information reported by borrowers for almost every loan issued through its website.

Modeling Probability of Default-Target Variable and Predictive Features
Publicly available LendingClub data, from 2007 June to 2018 Q4, has a total of 2,260,668 issued loans. Each loan has a status, either Paid-off, Charged-off, or Ongoing. We only adopted loans with an end status, i.e., either paid-off or charged-off. In addition, that loan status is the target variable. We then selected following loan features as our predictive covariates. We took a sample from the original data set that has customer yearly income between $15,000 and $60,000 and end up with a data set of 520,947 rows.

Addressing Uncertainty of Estimated Probit Model Using Variational Inference with Intrinsic Prior
Using the process developed in Section 5, we can update the intrinsic prior for parameters (see Figure 1) of the probit model using variational inference, and get the posterior distribution for the estimated parameters. Based on the derived parameter distributions, questions of interest may be explored with model uncertainty being considered. Investors will be interested in understanding how each loan feature affect the probability of default, given a certain loan term, either 36 or 60. To answer this question, we samples 6000 cases from the original data set and draw from derived posterior distribution 100 times. We end up with 6000 × 100 calculated probability of default, where each one of the 6000 samples yield 100 different probit estimates based on 100 different posterior draws. We summarize some of our findings in Figure 2, where color red representing 36 months loans and green representing 60 months loans.

•
In general, 60 months loans have higher risk of default. • There is no clear pattern regarding income. This is probably because we only included customers with income between $15,000 and $60,000 in our training data, which may not representing the true income level of the whole population.
Model uncertainty could also be measured through credible intervals. Again, with the derived posterior distribution, the credible interval is just the range containing a particular percentage of estimated effect/parameter values. For instance, the 95% credible interval of the estimated parameter value of FICO is simply the central portion of the posterior distribution that contains 95% of the estimated values. Contrary to the frequentist confidence intervals, Bayesian credible interval is much more straightforward to interpret. Using the Bayesian framework created in this article, from Figure 3, we can simply state that given the observed data, the estimated effect of DTI on default has 89% probability of falling within [8.300, 8.875]. Instead of the conventional 95%, we used 89% following suggestions in [28,29], which is just as arbitrary as any of the conventions.
One of the main advantages of using variational inference over MCMC is that variational inference is much faster. Comparisons were made between the two approximation frameworks on a 64-bit Windows 10 laptop, with 32.0 GB RAM. Using the data set introduced in Section 6.2, we have that • with a conjugate prior and following the Gibbs sampling scheme proposed by [17], it took 89.86 s to finish 100 simulations for the Gibbs sampler; • following our method proposed in Section 5.2, it took 58.38 s to get the approximated posterior distribution and sampling 10,000 times from that posterior.

Model Comparison
Following the procedure proposed in Section 5.3, we compare the following series of nested models. From the data set introduced in Section 6.2, 2000 records were sampled to estimate the likelihood p(x|β (i) ). Where β (i) is one of the 2500 draws sampled directly from the approximated posterior distribution q β (β), which serves as the importance function used to estimate the marginal likelihood p(x). Estimated log marginal likelihood for each model is plotted in Figure 4. We can see that the model evidence has increased by adding predictive features Loan Amount and Annual Income sequentially. However, if we further adding home ownership information, i.e., Mortgage Indicator as a predictive feature, the model evidence decreased. We have the Bayes factor

Further Work
The authors thank the reviewers for pointing out that mean-field variational Bayes underestimates the posterior variance. This could be an interesting topic for our future research. We plan to study the linear response variational Bayes (LRVB) method proposed in [30] to see if it can be applied on the framework we proposed in this article. To see if we can get the approximated posterior variance close enough to the true variance using our proposed method, comparisons should be made between normal conjugate prior with the MCMC procedure, normal conjugate prior with LRVB, and intrinsic prior with LRVB.
Author Contributions: Methodology, A.L., L.P. and K.W.; software, A.L.; writing-original draft preparation, A.L., L.P. and K.W.; writing-review and editing, A.L. and L.P.; visualization, A.L. All authors have read and agreed to the published version of the manuscript.

Funding:
The work of L.R.Pericchi was partially funded by NIH grants U54CA096300, P20GM103475 and R25MD010399.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Density Function
Suppose X ∼ N(µ, σ 2 ) has a normal distribution and lies within the interval X ∈ (a, b), −∞ ≤ a < b ≤ ∞. Then X conditional on a < X < b has a truncated normal distribution. Its probability density function, f , for a ≤ X < b, is given by and by f = 0 otherwise. Here is the probability density function of the standard normal distribution and Φ(·) is its cumulative distribution function. If b = ∞, then Φ( b−µ σ ) = 1, and similarly, if a = −∞, then Φ( a−µ σ ) = 0. And the cumulative density for the truncated normal distribution is where ξ = x−µ σ and Z = Φ(β) − Φ(α).