A variational Bayes approach to a semiparametric regression using Gaussian process priors

This paper presents a variational Bayes approach to a semiparametric regression model that consists of parametric and nonparametric components. The assumed univariate nonparametric component is represented with a cosine series based on a spectral analysis of Gaussian process priors. Here, we develop fast variational methods for fitting the semiparametric regression model that reduce the computation time by an order of magnitude over Markov chain Monte Carlo methods. Further, we explore the possible use of the variational lower bound and variational information criteria for model choice of a parametric regression model against a semiparametric alternative. In addition, variational methods are developed for estimating univariate shape-restricted regression functions that are monotonic, monotonic convex or monotonic concave. Since these variational methods are approximate, we explore some of the trade-offs involved in using them in terms of speed, accuracy and automation of the implementation in comparison with Markov chain Monte Carlo methods and discuss their potential and limitations. MSC 2010 subject classifications: Primary 62G08; secondary 62F15.


Introduction
This paper develops a mean field variational Bayes approximation algorithm for a semiparametric regression model, known as a partial linear model, that consists of parametric and nonparametric components. The nonparametric component is represented with a cosine series based on a spectral analysis of Gaussian VB approach to semiparametric regression 4259 process priors. Specifically, the semiparametric regression model is given by . . . , n, (1.1) where w i β is referred to as the parametric component, w i and β are p + 1 dimensional vectors of known covariates and coefficients respectively, and f (·) is an unknown function of x that is univariate and defined on the interval [0, 1], the nonparametric component. The error terms { i } are a random sample from a normal distribution with mean 0 and an unknown variance σ 2 . For modeling the nonparametric component f (x), a Gaussian process is used for the unknown f , f (x) = Z(x), where Z is a second-order Gaussian process with mean function equal to zero and covariance function ν(s, t) = Cov(Z(s), Z(t)), s, t ∈ [0, 1]. Gaussian processes provide a natural way to specify prior distributions on the space of functions for nonparametric regression (O'Hagan, 1978), and are also widely used for machine learning applications (e.g., Rasmussen and Williams (2006)). One of the main practical drawbacks in the application of Gaussian process regression (hereafter GPR) is the computational burden in fitting these models when the number of data points increases, due to the need for large dense matrix calculations and associated storage requirements. An alternative approach that avoids these problems is to linearize the covariance function and to use a computationally efficient basis representation via the spectral representation of covariance functions. For example, Paciorek (2007), Lázaro-Gredilla et al. (2010) and Tan et al. (2016) considered using the spectral representation of a stationary covariance function based on Bochner's theorem (see, e.g., Grenander (1981) and Cressie and Wikle (2011)).
In Lenk and Choi (2017) the Bayesian semiparametric regression framework using Gaussian process priors in (1.2) was used, referred to as Bayesian spectral analysis regression (BSAR), and Markov chain Monte Carlo (MCMC) methods were developed. In particular, they proposed a Bayesian method to estimate shape-restricted regression functions by assuming that the derivatives of the functions are squares of Gaussian processes. In regression models it is often the case that subject matter knowledge imposes shape restrictions on the unknown regression functions, which can yield fitted models that are more interpretable and have improved performance compared to those without restrictions. The proposed model based on BSAR in Lenk and Choi (2017) was able to successfully deal with shape restrictions for the regression functions that are monotonic, monotonic convex or concave. Lenk (1999) considered related methods in the case without shape restriction. Lenk and Choi (2017) showed that their method is flexible for handling different kinds of shape restrictions and that it enjoys good performance compared to other Bayesian methods in the literature, for example, using spline smoothing (Shively, Sager and Walker, 2009;Meyer, Hackstadt and Hoeting, 2011), Bernstein polynomials (Curtis and Ghosh, 2011), and Gaussian processes (Lin and Dunson, 2014;Wang and Berger, 2016). The comparisons in Lenk and Choi (2017) are restricted to the univariate setting.
However, the approach of Lenk and Choi (2017) has a disadvantage for handling large data sets, mainly in requiring lengthy computation times for MCMC, especially in regression models with shape constraints. This is despite the fact that it is based on carefully designed MCMC algorithms resulting in methodology which is often faster than the alternative methods mentioned above, all of which are based on generic MCMC methods. Further, an R package is available for practitioners, using compiled Fortran code to maximize computational efficiency (Jo et al., 2017), but alternative numerical methods still need to be developed for real-time applications or large data sets. Variational Bayes (VB) methods are known to be fast deterministic alternatives to Markov chain Monte Carlo (MCMC) for Bayesian computation, facilitating approximate posterior inference for the parameters in complex statistical models (see, e.g., Waterhouse, Mackay and Robinson (1996), Jordan et al. (1999) and Attias (2000) for early developments of the method and Titterington (2004), Jordan (2004) and Ormerod and Wand (2010) for nontechnical overviews). In the nonparametric and semiparametric regression context, variational approximation schemes have found increasing use; for instance, real-time semiparametric regression (Wand and Ormerod, 2011), truncated power splines for partially linear additive models with variable selection (Zhao and Lian, 2014), penalized splines for mean and quantile regression in geoadditive latent Gaussian regression (Waldmann and Kneib, 2015), and sparse spectrum Gaussian process regression (Tan et al., 2016).
The objective of the current study is to develop fast variational Bayes computation methods for the semiparametric regression model of (1.1) using Gaussian process priors, which reduce computation time by an order of magnitude over the MCMC methods of Lenk and Choi (2017). Specifically, we provide variational Bayes approximation methods for spectral representations of one-dimensional Gaussian processes via the cosine basis expansion of (1.2). Further, we explore the possible use of the variational lower bound for model choice of a parametric regression model against a semiparametric alternative. In addition, we develop a variational Bayes approximation scheme to solve the computational challenges associated with MCMC with shape restrictions in a univariate nonlinear regression function, which is more challenging than the regression model without shape restriction because of the non-conjugacy of many factors associated with those shape restrictions. To the best of our knowledge, there exists no variational Bayes approximation methods in the literature for shape-restricted regression models, and thus, our work is the first variational Bayes approach to the semiparametric regression with shape restrictions. This new approach is limited to one-dimensional Gaussian processes and shape constraints of monotonicity and convexity in the current work, but broadens the applicability of variational Bayes approximation in the context of Gaussian process regression modeling.
The rest of the paper is organized as follows. Section 2 provides a brief overview of variational Bayes approximation methods and reviews the basic model structure and the hierarchical prior specification proposed in Lenk and Choi (2017). Then, we develop a variational Bayes algorithm for fitting the unrestricted model, that is, BSAR without shape restriction. In Section 3, the shape restricted models, that is, BSAR with shape restrictions, are considered with monotonicity and convexity, and appropriate variational Bayes approximation schemes are developed. Section 4 illustrates the empirical performance of the proposed variational Bayes methods with simulation studies and real applications. Since these variational methods are approximate, we explore some of the trade-offs involved in using them in terms of speed, accuracy and automation of the implementation, in comparison with MCMC methods as well as other existing variational Bayes approximations for semiparametric regression in the literature. In Section 5, we discuss the potential and limitations of the methodology along with concluding remarks.

An overview of variational Bayes methods
Consider a general Bayesian model with parameter vector δ, its prior density function p(δ), observation vector y, and its assumed probability density function p(y|δ). We assume that y and δ are continuous for simplicity. Then, the posterior density function is given by where p(y) is a marginal probability density function of y. For Bayesian inference with the posterior density p(δ|y), which is often mathematically intractable, variational approximation methods (e.g. Jordan (2004), Titterington (2004), and Ormerod and Wand (2010)) can be employed. In these variational approximations, the posterior density p(δ|y) is approximated by a density q(δ) from some tractable family, and q(δ) is chosen optimal in terms of minimization of the Kullback-Leibler (KL) divergence between q(δ) and p(δ|y).
It is easy to see that forms a lower bound on log p(y)), and the second term in (2.1) is the KL divergence between q(δ) and p(δ|y). The fact that L(q) is a lower bound clearly follows from (2.1) and the non-negativity of the KL divergence. Clearly maximizing L(q) with respect to q(·) is equivalent to minimizing the KL divergence term in (2.1). The term variational Bayes(VB) is often used to denote variational inference when some kind of product restriction is made on the approximating distribution q(·) but where this distribution is otherwise arbitrary. This approach is also sometimes known as mean field variational Bayes (MFVB). By a product restriction we mean that we partition the parameter vector δ into blocks, δ = (δ 1 , . . . , δ k ), and consider a density function q(·) that is assumed to factorize as q(δ) = j q j (δ j ). In variational Bayes a coordinate ascent approach is used to maximize L(q) by updating each term in q(δ) = j q(δ j ) in turn with all other terms fixed. The update for q j (δ j ) takes the form where E −j (·) denotes expectation with respect to i =j q i (δ i ). If all the conditional distributions have a conjugate-exponential structure, then q j takes the parametric form of an exponential family, and the variational update procedures are conveniently performed (Ghahramani and Beal, 2001). A general algorithmic implementation of the procedure in this setting is given by the variational message passing algorithm of Winn and Bishop (2005). When there are nonconjugate factors in the model, one way to proceed is to use a generalization of variational message passing, namely the nonconjugate variational message passing (NCVMP) algorithm (Knowles and Minka, 2011;Wand, 2014). In NCVMP, for a factor q j (δ j ) having an intractable mean field update, it is assumed to have the parametric form of a natural exponential family where ρ j are the vector of the natural parameters, S j (δ j ) are sufficient statistics of ρ j , and h j (ρ j ) is a normalizing factor. A fixed-point updating procedure can then be derived, which reduces to the variational message passing update in the conjugate-exponential case. See Knowles and Minka (2011) and Wand (2014) for further details.

Bayesian spectral analysis regression (BSAR)
As briefly discussed in Section 1, the Bayesian spectral analysis regression (BSAR) model (Lenk, 1999;Lenk and Choi, 2017) expresses the Gaussian process as an infinite series expansion (1.2) and uses the cosine basis function on [0, 1] as a choice of orthonormal system for the unknown nonparametric function f . In the semiparametric regression model of (1.1), the parametric term w i β includes an intercept β 0 , confounded with θ 0 , and the basis function ϕ 0 (x) is dropped in the representation of f . The infinite series in (1.2) is approximated by a finite sum Z J (x): where J denotes the truncation point. The mean integrated squared error between Z and Z J decreases in J and can be made as small as desired because the sum of the variance is assumed to be finite, Note that if the prior distribution of θ j is inherited from Z by the spectral representation, then the choice of J does not considerably affect the accuracy of estimating f for sufficiently large J (Lenk (1999) and Lenk and Choi (2017)).
The prior probability that θ j is in a neighborhood of zero increases with j and γ, and it decays to zero exponentially fast as indicated in (2.6). Further, we consider hyper priors on τ and γ in a hierarchical fashion (2.6), which allows the data to select the optimal smoothness given the data and structure of the model . The prior specification is completed with a conjugate prior distribution for β, which is also scale-invariant, β ∼ N (μ 0 β , σ 2 Σ 0 β ), and for . All the remaining hyperparameters are assumed to be known.

A variational Bayes approximation for BSAR
In this subsection, we provide a variational Bayes approximation for the semiparametric regression model, BSAR, without any shape restriction on f . To be specific, a variational Bayes approximation algorithm, Algorithm 1, is given based on the model structure of (2.5). The joint posterior distribution of (β, θ J , σ 2 , τ 2 , ψ) is approximated by a variational approximation with the product form of q(β, θ J , σ 2 , τ 2 , ψ) = q 1 (β)q 2 (θ J )q 3 (σ 2 )q 4 (τ 2 )q 5 (ψ), |ψ| = γ. (2.7) In (2.7), we have introduced a new hyperparameter ψ instead of γ by the reparametrization of |ψ| = γ, and the corresponding prior distribution of ψ is given as the double exponential distribution, ψ ∼ DE(0, w 0 ), with a density function p(ψ) = 0.5w 0 exp(−w 0 |ψ|), −∞ < ψ < ∞. Note that such a reparametrization in terms of ψ causes ψ only to be identifiable up to a sign change, but this is not a problem in practice as the variational optimization will lock on to one of the equivalent local modes. The reparametrization allows us to use a normal distribution for the variational approximation to the posterior distribution of ψ in the corresponding NCVMP variational updates. Although one could apply the NCVMP update directly to the parametrization of γ with, say, a gamma distribution, some unacceptable restrictions on the variational parameters are necessary for the existence of all the moments in the variational lower bound so that we avoid this approach here. We use mean field variational updates for all the factors except for q 5 (ψ). That is, the mean field updates are based on the commonly used conjugate distributions for q 1 -q 4 ; q 1 (β) is a normal distribution, parametrized as N (μ q β , Σ q β ), q 2 (θ J ) is also a normal distribution, denoted as N (μ q θ , Σ q θ ), q 3 (σ 2 ) is an inverse gamma distribution, denoted as IG(r q,σ /2, s q,σ /2), and q 4 (τ 2 ) is an inverse gamma denoted as IG(r q,τ /2, s q,τ /2). These mean field updates are given in the Appendix. Here and in the Appendix we use the following notation. If f (x) ∝ g(x) for two functions f (·) and g(·), we write log f (x) . = log g(x) to show that log f (x) and log g(x) differ by an additive constant not depending on x. Note that all the expectations in the Appendix, denoted by E −k , k = 1, 2, 3, 4 are with respect to the marginal variational density for the parameters except for the parameters in the kth block under consideration.
Since the update for ψ is a non-standard one, we give some details here. For updating q 5 (ψ) we use an NCVMP update and assume q 5 (ψ) normal, N (μ q ψ , σ q ψ 2 ). In the derivation of NCVMP updates for ψ below, all the expectations denoted as E 5 are with respect to the full variational density. For ψ, applying the general procedures of the NCVMP algorithm, (Knowles and Minka, 2011) to q 5 (ψ), we first compute S k (≡ S k (μ q ψ , σ q ψ 2 )), k = 1, 2 as given below: where the expression for Q j (·) is given in the Appendix and μ q ψ and σ q ψ are then updated by In addition to deriving updates for the variational factors, we also require an expression for the variational lower bound, L(q) in (2.1), which is specifically given by The terms in the expressions above are given in the Appendix. The variational lower bound L(q) is used for two purposes, one being for defining the stopping rule in the variational Bayes approximation algorithm and the other being for model selection for choosing between two competing models. For the use of the stopping rule, the variational Bayesian algorithm, as given in Algorithm 1 below, is terminated when the increase of the lower bound of the log-likelihood (2.1) is negligible. Further, we use the lower bound of the log-likelihood as an approximation of the marginal likelihood for testing the adequacy of semiparametric regression model in the empirical analysis presented in Section 4. Based on all the above development, we now provide the following variational Bayes approximation algorithm for BSAR without restriction, Algorithm 1, describing the updates of all the variational parameters ϑ in the approximate distribution of (β, θ J , σ 2 , τ 2 , ψ).
In implementing Algorithm 1, note that the update for Σ q θ often results in a numerically singular matrix because the shrinkage spectral coefficients θ J essentially degenerate at zero. Thus, for numerical stability, we set such coefficients exactly to zero in implementing the scheme numerically, which effectively corresponds to a change of the truncation point J.

Variational Bayes approximations for the shape-restricted models
In this section, we consider the shape restricted regression models, that is, BSAR with shape restrictions of monotonicity and concavity, and develop appropriate variational approximation schemes for them. As discussed in Section 1, there have been several methods proposed on Bayesian shape-restricted regression, all of which use MCMC methods (see, e.g, Shively, Sager and Walker (2009), Meyer, Hackstadt and Hoeting (2011), Curtis and Ghosh (2011), Lenk and Choi (2017), and the references therein), and no results have been discussed in the context of variational approximation.
Here, we focus on the shape restricted regression models of Lenk and Choi (2017), BSAR with shape restrictions, in which the derivatives of the regression functions are modelled in terms of squares of Gaussian processes for shape constraints, based on their spectral representations. That is, the proposed approach of Lenk and Choi (2017) enforces shape restrictions on the th derivative of f as the square of a Gaussian process Z(x) in (1.2), where δ ∈ {−1, 1} and are given by the user. For example, when is 1 and δ is 1, f is non-decreasing, and f is a non-decreasing and convex function when is 2 and δ is 1. The proposed method of Lenk and Choi (2017) based on the characterization of (3.1) was shown to be flexible for handling different kinds of shape restrictions and to have good performance compared to other Bayesian methods in the literature. We provide fast and efficient variational Bayes approximation methods for monotonic, monotonic convex or monotonic concave regression models based on the framework of (3.1).

A Variational Bayes approximation for the monotone function
We first consider the shape-restricted model with monotone regression functions and develop its variational approximation algorithm. The derivative representation in (3.1) for the monotone function with = 1 is rewritten in terms of the regression function f (·) by integration where δ is 1 for a non-decreasing function and -1 for a non-increasing function, and the last term is chosen to satisfy the mean-centering condition of f (·) . Then, using the spectral representation of is expanded as where ϕ a j,k (x) using the cosine basis are specifically given as : Thus, the semiparametric model (1.1) with monotone restriction is written in matrix notation as The parameters in the model are the same as in the case with the unrestricted model (β, θ J , σ 2 , τ 2 , ψ). Thus, we adopt the same hierarchical prior specification of (2.6) in the case with BSAR without shape restrictions in Section 2 except for the prior distributions of θ j , That is, in the prior on θ j , j ≥ 0, to ensure scale-invariant prior specification as discussed in Lenk and Choi (2017), σ rather than σ 2 appears in the variance, in contrast to the BSAR without restrictions. Note that we do not consider the identifiability condition θ 0 ≥ 0 of Lenk and Choi (2017) in the variational Bayes approximation scheme. The optimization in the variational approximation in general locks on to one of the two equivalent modes obtained by switching the signs of all elements of θ J .
In the variational Bayes approximation to the joint posterior distribution for the regression model with monotone restriction in (3.4), we use mean field updates for β, σ 2 and τ 2 and an NCVMP update for ψ as before, but an NCVMP update with a normal factor for θ J , in contrast to the case of BSAR without shape restrictions, because of the non-conjugacy for θ J with the characterization of the squared Gaussian processes in (3.1) and the scale-invariant prior specification of σ in (3.5).
The assumed form of the variational approximation in terms of the blocks (β, θ J , τ 2 , ψ) is similar to BSAR without restrictions. q 1 (β) for β is parametrized as N (μ q β , Σ q β ), the factor q 2 (θ J ) for θ J is parametrized as N (μ q θ , Σ q θ ), the factor q 4 (τ 2 ) for τ 2 is inverse gamma, IG( rq,τ 2 , sq,τ 2 ), and the factor q 5 (ψ) for ψ is parametrized as N (μ q ψ , σ q ψ 2 ). Note that the factor q 3 (σ 2 ) for the mean field update is not an inverse gamma but takes a different form because of the prior specification in (3.5) as mentioned before, with details given below. The mean field updates for β and τ 2 are described in the Appendix. Again we describe the non-standard non-conjugate updates in some detail and again we note that the expectations denoted by E −k are with respect to the marginal variational density for the parameters except for the parameters in the kth block under consideration, and the expectations denoted by E k are with respect to the full variational density. We provide the details of the updating procedures as follows: • For θ J , the mean field update does not take the form of a standard distribution for monotone restrictions, and we use a multivariate normal approximation with the parameters updated by the NCVMP algorithm (Knowles and Minka, 2011;Wand, 2014).
Then it follows from Wand (2014) that the NCVMP update takes the form Using standard rules of matrix differential calculus, we obtain the NCVMP update for θ J given in Algorithm 2.
Details of the lower bound calculation are given in the Appendix. Based on all the above computations and notation, we provide the variational Bayes algorithm for BSAR with monotone restriction in Algorithm 2.
Note once again that in Algorithm 2, μ q * θ is μ q θ with the first component removed, Σ q * θ is Σ q θ with the first row and column removed, and is the same as S 2 (μ q ψ , σ q ψ 2 ) with r q,σ /s q,σ replaced by E −2 1 σ . How to compute this last expectation is discussed in the Appendix. As in Algorithm 1, the update for Σ q θ often results in a numerically singular matrix, and we set coefficients with a prior degenerate on zero exactly to zero in implementation, which as we mentioned earlier effectively corresponds to a change of the truncation point J.

Variational Bayes approximations for the convex/concave function
We next consider variational Bayes approximation methods for convex or concave regression functions. When a twice-differentiable function is assumed to be either monotonic convex or monotonic concave, then the first and second derivatives of the function have the same sign. Then, from the shape-restricted representation of (3.1), the second derivative of f is modeled as the square of a Gaussian process Z(x). That is, when is 2, f is a non-decreasing and convex function when δ = 1 or non-increasing and concave function when δ = −1, and f is represented as : Notice that if we take the first and second derivatives of f (x) in (3.7), we get and that δα ≥ 0 ensures monotonicity. Similar to the monotone restriction, the spectral representation of f (x) with monotone convexity or concavity in (3.7) becomes Then, the resulting basis, ϕ b j,k , is obtained as : In order to develop the variational Bayes approximation method, instead of using the spectral representation of f (x) in (3.8), we replace α with δα 2 as an equivalent representation of f (x), in which the monotonicity and the convexity of f (x) are controlled by a single parameter δ. Then, we have the following model structure for monotonic convex/concave regression functions, in which the variational Bayes approximation method is explored: where θ α J = (α, θ 0 , θ 1 , ..., θ J ) and ϕ b,α J a (J + 2) × (J + 2) matrix with By reformulating the model (3.10) with α 2 instead of α, we use a normal distribution for the variational approximation to the posterior distribution of α, based on a normal prior α ∼ N (0, σσ 2 0,α ), instead of a truncated normal prior considered in Lenk and Choi (2017). For the remaining parameters, we adopt the same priors as in the monotone case in Section 3.1. Thus, the unknown parameters in the model (3.10) are (β, θ α J , σ 2 , τ 2 , ψ). Similar to the monotone case, we use mean field updates for β, σ 2 , and τ 2 , and NVCMP updates with normal factors for ψ and θ α J . That is, q 1 (β) for β is parametrized as N (μ q β , Σ q β ), the factor q 3 (σ 2 ) has the same form as for the monotone case, the factor q 4 (τ 2 ) for τ 2 is an inverse gamma, IG( rq,τ 2 , sq,τ 2 ), and the factor q 5 (ψ) for ψ is parametrized as N (μ q ψ , σ q ψ 2 ). The factor q 2 (θ α J ) for θ α J is parameterized as N (μ q α,θ , Σ q α,θ ). Note that we use the same notations for all the expectations as those used in Section 3.1. In addition, note that by replacing Σ q θ with Σ q α,θ , μ q θ with μ q α,θ , ϕ a J (x i ) with ϕ b,α J (x i ), and Υ with Υ α = (σ 2 0,α , σ 2 0 , τ 2 exp(−γ), . . . , τ 2 exp(−Jγ)) , the updates for β, θ α J , τ 2 and ψ follow the same form as the variational updates derived in Section 3.1. Thus, the remaining ones are about the updating procedure for σ 2 and the variational lower bound, whose details are as follows: • For σ 2 , the update is log q(σ 2 ) . = E(log p(σ 2 ) + log p(y|β, θ α J , σ 2 ) + log p(θ α J |σ 2 , τ 2 , ψ)) + log p(β|σ 2 ), where E(log p(θ α J |σ 2 , τ 2 , ψ) tr Σ q α,θ + μ q α,θ μ q Σ q θ with Σ q α,θ , and μ q θ with μ q α,θ . Therefore, we have where a = r 0,σ + n + p + (J + 2)/2 2 To derive the variational lower bound, the computations are the same as in the monotone case if we replace Σ q θ with Σ q α,θ , μ q θ with μ q α,θ , ϕ a J (x i ) with ϕ b,α J (x i ), and Υ with Υ α = (σ 2 0,α , σ 2 0 , τ 2 exp(−γ), . . . , τ 2 exp(−Jγ)) , except for the terms E(log p(θ α J |σ 2 , τ 2 , ψ)) and E(log q(θ α J )), which are given as Hence, it follows that the variational Bayes algorithm for BSAR with monotonic convex restriction is the same as Algorithm 2 but with the replacements described above.

Curve fitting with VB approximations
In this section, we compare the performance of the VB approximations of the BSAR we proposed in Section 3 with some existing methods, based on simulation studies. Specifically, we first consider fitting univariate nonparametric regression models and compare the proposed VB approximation method of BSAR without restrictions, referred to as the VBU (Variational Bayes for Unrestricted model), with the three variational approximation approaches, VB-SSGP of Tan et al. (2016), VB-Spline of Zhao and Lian (2014) and VB-Pspline of Waldmann and Kneib (2015).
We simulate 50 datasets with two different sample sizes n = 100 and 200 based on the regression model y = f (x) + and use the root mean integrated squared error (RM ISE) between the true function f and the posterior mean f for performance evaluation. We consider N = 50 simulated datasets and by writing f j (·) for the posterior mean obtained from dataset j, we define is the ith value of covariate x in dataset j. To compare different methods, we consider the RM ISE j ( f j , f) values averaged over the different datasets j = 1, · · · , N. The values of x are equally spaced on 0 to 1, and the same values are used for each dataset with the same sample size.
In the first simulation study, we consider the following nonlinear regression models: where ∼ N (0, 1). Figure 1 displays the simulated data and the true mean curve, respectively. The average RMISE values for the four methods, VBU, VB-SSGP, VB-spline and VB-Pspline are summarized in Table 1 with the standard errors (s.e.) and computing time (time) in seconds within parentheses.

Fig 1. Simulated data (black circle) and the true mean curves (red solid line) for f1-f4
It appears that no method dominates and that the four methods have equivalent performance based on the standard errors. VBU has the best average RMISE in two functions f 1 and f 4 while other methods have the best average RMISE in f 2 and f 3 . Overall, the simulation results indicate that the VBU is competitive with other variational methods in terms of RMISE as well as computing time. In terms of computational speed, VBU and VB-Pspline are the two best variational methods, which have the shortest computing times in all cases.
In the second simulation study, we consider the following monotone regression models: where ∼ N (0, 1). Figure 2 displays the simulated data and the true mean curve, respectively. Since there are no existing VB approximation methods for shaperestricted regression models, we compare the performance of the proposed VB approximation method of BSAR with shape restrictions, referred to as the VBM (variational Bayes for the monotone model) with the BSARM for monotone regression model of Lenk and Choi (2017) and Bayesian regression splines with monotone restrictions, BRSM of Meyer, Hackstadt and Hoeting (2011), with the latter two methods implemented using MCMC. Our numerical implementation of the VBM approach is written in R, and the entire setup, including initial values and the tolerance value, are the same as in VBU. For numerical implementations of other methods, an R package, bsamGP (Jo et al., 2017) is used for BSARM, and BRSM is implemented by the R code available from the author's website as given in Meyer, Hackstadt and Hoeting (2011).  Table 2 summarizes the RMISE's, the standard errors and computing time in seconds. Overall, the simulation results indicate that VBM is slightly worse than the two other MCMC methods for the shape-restricted regression models in terms of RMISE. The worse performance of VBM compared to the MCMC methods is, we believe, due to two factors. The first is the posterior independence assumptions that are inherent to the VB approximation, and which can cause both underestimation of variability as well as bias in point estimates of variance parameters and smoothing parameters. These drawbacks are not confined to this application only but apply to VB approaches more generally (see, e.g., Wang and Titterington (2004) and Turner and Sahani (2011)). The second, less important factor that may explain the poorer performance of the VBM approach is the adaptive truncation of the number of basis functions used to avoid numerical singularities as described in the remark following Algorithm 2. As expected, in all cases, VBM has an advantage over BSARM and BRSM in terms of computational speed. Next, consider the following three regression models with monotonicity and convexity, where ∼ N (0, 1). Both the Expo and QuadCos models are increasing and convex on [0, 1] while the LogX model is increasing and concave on [0, 1], as also shown in Figure 2.
As summarized in Table 3, the average RMISE of VBMC is slightly larger than for the MCMC method BSARMC for the monotonic and convex/concave functions. We believe the worse performance of VBMC compared to BSARMC is for reasons similar to those discussed earlier for the case of monotone constraints. We do note, however, that the performance gap compared to MCMC seems to be reduced for the case of concave/convex constraints compared to monotone constraints. We believe this occurs because with more stringent shape constraints the fit becomes less sensitive to smoothing parameters and to any bias in estimation of them. A comparison of computation times for the algorithms is given in Table 4. For both monotone and convex/concave shape constraints, the VB algorithms are an order of magnitude faster, which justifies some loss of statistical performance in cases where computation time is an important consideration.  Figure 3 shows the boxplots of the ratios of the individual RMISE values between the BSARMC and VBMC for different sample sizes n = 50, 100, and 200. Ratios less than one indicate that BSARMC has better performance than VBMC does for point estimation in terms of RMISE. It seems that the RMISE's of the VBMC and BSARMC approaches are similar for each of the three models.   Table 4 presents the average computation time of VBMC and BSARMC for the above examples. As expected, the variational Bayes approach has a much lower average computation time compared to the MCMC approach. The differences become increasingly significant as the sample size increases. For example, when n = 500 and J = 100, the amount of time required for the variational approach to converge is a small fraction of the time (less than 2%) required to run an MCMC analysis.

Credible interval estimation and model selection with application to electricity demand data
In this example, we use the electricity demand data in Yatchew (2003) to compare between the VB and MCMC algorithm based on BSAR. The data contains 288 quarterly observations of Ontario's electricity demand from 1971 to 1994. Following Yatchew (2003), Lenk and Choi (2017) use the log of the electricity demand to GDP as the dependent variable and log price ratio of electricity to natural gas as a covariate in W . The choice of dependent variable is intentional as Yatchew (2003) found that the demand for electricity is co-integrated with the gross domestic product. Similar to Lenk and Choi (2017), we use "Temperature", which is the number of heating and cooling degree days relative to 68 • F, as the independent variable x. We consider all three models, namely unrestricted, monotone, and monotonic convex in our application.
The hyperparameters for the priors in the variational Bayes approach are set up as follows. We use μ 0 β = (0, 0) T and Σ 0 β = 100I 2 as hyperparameters for the prior of β. For all other hyperparameters for the priors, we set them to be exactly the same as discussed in the previous sections. Similar to the simulation studies, our choice of starting points is determined by trial and error. For the unrestricted fit, we use μ q ψ = 1. For the non-increasing shape-restricted fit, we use μ q ψ = 5 and μ q θ = (5, 5, ..., 5) T as the starting point. Finally, for the nonincreasing convex fit, we use μ q ψ = 1 and μ q θ = (0.5, 0.5, ..., 0.5) T . Further, our choice of the initial truncation point J is set to 60.
In addition to curve fitting for point estimation, we consider credible interval estimation. One advantage of the variational procedure is that we are able to simulate independent samples directly from the variational posterior distribution, which facilitates computations. For example, if we want to estimate a credible interval for δθ T J ϕ a J (x)θ J in the monotone case, we first simulate a sufficiently large number of θ J from N q (μ q θ , Σ q θ ) and then for each of these points we plug θ J into the function δθ T J ϕ a J (x)θ J . A credible interval of δθ T J ϕ a J (x)θ J can then be obtained from the corresponding sample quantiles of these plug-in values. The procedure is similar to the one followed for constructing credible intervals from the MCMC output, except that in the case of MCMC, the approximate posterior samples are dependent.  Figure 4 shows the estimated posterior mean of f for all three models against temperature and the 95% credible intervals using both VB and MCMC based on BSAR. In particular, Figure 4 (a), (b) and (c) shows the fit of the unrestricted, non-increasing and non-increasing convex models respectively. We observe that in all three models, the variational Bayes fit follows quite closely the MCMC fit. We also observe that other than the unrestricted case, it seems that the width of the 95% credible interval is similar in both VB and MCMC for both the monotone and monotonic convex case. This result is not always what is expected when using a variational Bayes approximation, since such approximations are known to underestimate variability in some situations. Our findings show that the estimated posterior mean of τ by the VB procedures is much larger than that of the BSAR but this does not seem to result in any corresponding inaccuracy in estimation of mean functions or credible intervals. Computation times for the unrestricted, decreasing, and decreasing convex case were (in seconds) 1. 65,26.21 and 25.94 respectively for MCMC,and 0.02,12.39,and 4.08 seconds respectively, for VB.
Furthermore, we test the adequacy of the parametric against the semiparametric model for fitting the electricity demand data, by computing the marginal likelihoods of competing models. In particular, we compare a parametric model without "Temperature" (H 0 ) to a semiparametric model with "Temperature" (H 1 ) in our application, where x denotes "Temperature" as mentioned before. As summarized in Table  5, the semiparametric models H 1 with "Temperature" have larger marginal likelihoods, log p(y), than for the parametric model H 0 and they also have better in sample fits with smaller root mean squared error (RMSE) between the observed Y and estimated regression function than for the parametric model, based on VB as well as MCMC procedures. Here, the marginal likelihood is computed using the Gelfand and Dey approximation (Gelfand and Dey, 1994) for MCMC methods in exactly the same way as described in Lenk (1999) and Lenk and Choi (2017) for BSAR and BSARM. Specifically, let ϑ j be a set of unknown parameters involved in BSAR and BSARM for model H j ; let p j (ϑ j ) be a prior density of ϑ j , p j (y|ϑ j ) be the likelihood function of y given ϑ j under H j , and h j (ϑ j ) be an auxiliary distribution on the support of ϑ j . Then the Gelfand and Dey approximation p j (y) used for the marginal likelihood under H j is given by is the uth value of ϑ j generated from the MCMC algorithm, and B denotes the total number of poster samples after burn-in period. As the auxiliary distribution h j , we take the same distributions as priors for β, σ 2 , θ, τ 2 and θ 0 , while we use the truncated normal distribution for γ. In comparison with BSAR and BSARM, we evaluate the lower bound L(q) in the VB approximation for marginal likelihood computation.
Further, as shown in Table 5, the marginal likelihoods based on VBM and BSARM are larger than those from VBU and BSAR, which indicates that in the semiparametric models H 1 , the shape-restricted model with monotonicity is favored over the unrestricted model in terms of the marginal likelihoods for both VB and MCMC procedures and that the VB lower bounds could be used for model selection purposes in addition to point estimation and credible interval construction. However, the variational lower bound can have errors of very different magnitudes for the shape restricted and unrestricted cases, which suggests that it should be used with caution in model choice in this setting.
Alternatively, we consider two information criteria in the context of the variational Bayes approach, namely VAIC (Variational AIC) and VBIC (Variational BIC), proposed by You, Ormerod and Müller (2014), in particular for the Bayesian linear model and certain diffuse priors. We also speculate that these VAIC and VBIC approaches would be applicable to our problems and that they would ameliorate such a limitation with normalizing constants and diffuse priors we employed, in addition to aforementioned concerns in the variational lower bound for model section. In computing VAIC and VBIC, we need to additionally evaluate E q (log σ), and details about this are given in the Appendix. The results summarized in Table 5 also indicate that VBM is still favored over VBU in terms of VAIC, the same as the VB lower bounds, whereas VBU is favored with VBIC as in RMSE. Note that VBIC still relies on the lower bound directly and hence has the same problem as the lower bound for model choice purposes. However, it seems VAIC does not depend directly on the lower bound and hence may be more reliable. Although the two VB information criteria do not agree, it is evident that semiparametric models in H 1 provide adequate descriptions of the electricity demand data, compared to the parametric model H 0 as also shown in RMSE and L(q) values.

A large data set with stock price
The last empirical analysis is for an illustration of the merit of the VB approach for dealing with a large data set, specifically, a stock price data set from the London Stock Exchange in the United Kingdom. A similar data set was also analyzed in Luts, Broderick and Wand (2014). The data set is based on London Stock Exchange data during its opening hours, collected through the R package quantmod setting the time interval from January 1st of 2005 to July 21st of 2016. The source of the data is Google Finance, https://www.google.com/finance. The predictor (x) and response variable (y) consist of the stock prices of two financial institutions: The Barclays PLC and The Hongkong and Shanghai Banking Corporation(HSBC), as summarized in Table 6. Although this data set is moderately large, it is also chosen to be small enough that MCMC implementations of shape restricted regression are still feasible for comparison.
We consider six different approaches, VB and MCMC for three models, unrestricted, monotone, and monotonic concave, to analyze the data set. Figure 5 presents the estimated fits for VBU and VBM with 95 % credible intervals for stock price data, and Table 7 summarizes additional information about the fits, including RMSE and computing time in seconds. As summarized in Table 7, the unrestricted model (VBU/BSAR) has the largest marginal likelihood and the smallest RMSE among the three models, and in terms of RMSE, VB approaches provide competitive fits compared to MCMC. The VB approach has computational demands less than for the MCMC approaches by several orders of magnitude.

Conclusion
In this paper, we presented a variational Bayes approach to a semiparametric regression model based on a spectral analysis of Gaussian process priors.
In particular, we developed fast variational Bayes methods for semiparametric regression models with monotone and convex/concave restrictions for the regression function by modeling its derivatives with squared Gaussian processes. The variational approximation schemes we developed were shown to fit the semiparametric regression models based on the framework of Lenk and Choi (2017)  credible intervals and marginal likelihoods useful for uncertainty quantification and model selection based on real data applications.
There are several issues that could be considered in future work. In our experiments with the variational algorithm, we found that the convergence rate of the variational approach can be quite sensitive to the starting point. In particular the variational algorithm may become stuck in local modes in certain models or exhibit slow convergence. In the current work, suitable starting points were determined by trial and error, and more systematic methods for this are needed. Further, the use of variational methods to obtain better MCMC proposals could be explored. There are also other shape restrictions considered in Lenk and Choi (2017), such as U-Shaped and S-Shaped restrictions, and it would be interesting to attempt to implement a variational Bayes approach in these models. Variational approaches in these and other semiparametric models, for example, functional regression, quantile regression and spatial data analysis, and non-Gaussian data (see, e.g., Goldsmith, Wand and Crainiceanu (2011), Luts and Wand (2015) and Waldmann and Kneib (2015)), may be particularly challenging and important in dealing with high-dimensional problems in the context of Gaussian process priors and shape restrictions. Alternatively, stochastic gradient approaches to variational inference (Ji, Shen and West, 2010;Nott et al., 2012;Paisley, Blei and Jordan, 2012) could be considered in these settings. Further, we plan to adapt the proposed VB methods for shape restrictions into the mean field VB of Neville, Ormerod and Wand (2014) for sparse signal shrinkage and the linear response VB of Giordano, Broderick and Jordan (2015) for overcoming the limitations of mean field variational Bayes in underestimating the variability, incurring bias and posterior dependence (see, e.g., Wang and Titterington (2004), Turner andSahani (2011) andNeville, Ormerod andWand (2014)).
Moreover, the proposed VB approach to Fourier series with shape restrictions could be extended to multivariate predictors. Most simply, a multivariate nonparametric component with an assumed additive structure could be used, with shape constraints on the additive terms. However, the additive assumption is limiting and the more general problem of handling multivariate shape constraints is complex, with a much smaller existing literature than for univariate shape constraints. Expanding the BSAR methods to handle multivariate shape constraints is not easy; the number of basis terms needed grows exponentially with respect to the dimensions. Existing methods for handling multivariate shape constraints include methods using Gaussian processes (Riihimäki and Vehtari, 2010;Lin and Dunson, 2014), as well as methods using multivariate basis functions with shape restrictions such as multivariate splines (e.g., Cai and Dunson (2007)), tensor product bases (e.g. Hofner, Kneib and Hothorn (2016)) and radial basis functions (e.g. Chakraborty, Ghosh and Mallick (2012) and Zhang et al. (2014)). It is fair to say, however, that most of these methods either do not scale well with the dimension or with the sample size. An exception is the recent work of Riihimäki and Vehtari (2010) for monotone Gaussian Process regression and classification using virtual derivative observations. That approach is able to handle genuinely multivariate shape constraints, and they implement their methods using a scalable approximate inference algorithm, expectation propagation.

Conjugate variational updates and lower bound for model without shape restriction
We provide details of the updates for q 1 -q 4 as follows: • For β, the mean field update q 1 (β) takes the form log q 1 (β) .