Big Data Bayesian Linear Regression and Variable Selection by Normal-Inverse-Gamma Summation

. We introduce the normal-inverse-gamma summation operator, which combines Bayesian regression results from diﬀerent data sources and leads to a simple split-and-merge algorithm for big data regressions. The summation operator is also useful for computing the marginal likelihood and facilitates Bayesian model selection methods, including Bayesian LASSO, stochastic search variable selection, Markov chain Monte Carlo model composition, etc. Observations are scanned in one pass and then the sampler iteratively combines normal-inverse-gamma distributions without reloading the data. Simulation studies demonstrate that our algorithms can eﬃciently handle highly correlated big data. A real-world data set on employment and wage is also analyzed.


Introduction
Advances in computation and storage technologies have led to exponential growth of data volume, velocity and variety. Data scientists face the challenge of scaling up the existing Bayesian inference methods when the data are too large to be processed by a single machine. It is natural to split the big data and merge the results obtained from a plurality of data sources. Scott et al. (2016) propose the consensus Monte Carlo, in which data are partitioned among multiple machines and the data-distributed results are averaged by appropriate weights. Neiswanger et al. (2014) show that combination of subsample results obtained from parallel simulation is asymptotically consistent, and they also bound the rate of convergence. For linear regression models, Miroshnikov et al. (2015) resort to the summary statistics as the input to Markov chain Monte Carlo (MCMC) simulation. Ordonez et al. (2014) discuss the generalized sufficient statistics in stochastic search variable selection (SSVS) by George and McCulloch (1993). Ghosh and Reiter (2013) study the partitioned data for secure Bayesian model averaging.
Conjugate Bayesian linear regressions are discussed in almost every Bayesian textbook, but we study the normal-inverse-gamma (NIG) distributions from a new perspective, namely the NIG summation operator, which effectively combines the datadistributed results and facilitates the marginal likelihood computation. Our approach is applicable to Bayesian linear regressions and a variety of Bayesian hierarchical shrinkage and variable selection methods, including Bayesian ridge regressions, Bayesian least absolute shrinkage and selection operator (LASSO), SSVS, MCMC model composition (MC 3 ) by Madigan et al. (1995), Bayesian calibration to frequentist Akaike/Bayesian information criteria (AIC/BIC) (George and Foster, 2000), the pseudo-prior method by Carlin and Chib (1995) and a form of reversible jump MCMC (Green, 1995). We put them in a unified framework that involves two stages. First, out-of-memory data are processed by a split-and-merge NIG summation algorithm. Second, in-memory MCMC samplers (or analytic solvers) iteratively combine NIG distributions to learn the posteriors without reloading data.
In this paper, big data refer to the large number of observations (n), while the number of variables (k) is moderate in the sense that it remains feasible to store and manipulate k × k matrices in memory. 1 The conventional wisdom is that variable selection techniques are mostly ideal for large-k applications. We demonstrate that variable selection methods are also useful when several hundred variables exhibit near-perfect multicollinearity. Our algorithms can efficiently identify the promising predictors for successful out-of-sample forecast.
The remainder the paper is organized as follows. Section 2 reviews Bayesian linear regressions and Section 3 introduces the NIG summation operator and a split-and-merge algorithm for big data regressions. Section 4 applies the NIG summation operator to Bayesian variable selection methods. Section 5 is devoted to simulation studies and computational complexity analysis. Section 6 analyzes a real-world wage data set. Section 7 concludes the paper and suggests promising directions of future research.

Bayesian Linear Regression
Consider the multiple linear regression model: where Y is a n × 1 response vector, X is a n × k predictor matrix and ε is a vector of independent standard normal disturbances.
Textbook Bayesian results show that the posterior distributions are analytically tractable under the conjugate and non-informative priors. As we will frequently refer to those results, we state them as Proposition 1 and 2. Proposition 1. Under the conjugate prior NIG(μ, Λ, a, b) and n observations X, Y , the posterior distribution is given by NIG(μ, Λ, a, b), where μ = (Λ + X X) −1 (Λμ + X Y ), Λ = Λ + X X, a = a + n 2 , b = b + 1 2 Y Y + 1 2 μ Λμ − 1 2 μ Λμ. Proposition 2. Under the non-informative prior p(β, σ 2 ) ∝ σ −2 and n observations X, Y , the posterior distribution is given by NIG( μ, Λ, a, b) If the prior hyperparameters are calibrated to (μ, Λ, a, b) = (0 k , 0 k×k , − k 2 , 0), then the posterior distribution in Proposition 1 reduces to the result shown in Proposition 2. The non-informative prior p(β, σ 2 ) ∝ σ −2 is not a proper NIG distribution. As far as the posterior distribution is concerned, we treat the non-informative prior as a special case of the calibrated NIG distribution, hence our notation NIG(0 k , 0 k×k , − k 2 , 0). Proposition 2 has an inverse problem: is it possible to recover observations if we are given NIG( μ, Λ, a, b)? Proposition 3 provides a pseudo-observation interpretation of the NIG distribution.
Then, under the non-informative prior and the pseudo observations ( X1 X2 ), ( Y1 Y2 ), the posterior distribution is NIG ( μ, Λ, a, b). Furthermore, if another data set X * , Y * yields the same posterior distribution, the data must satisfy X * X * = Λ, Proposition 3 illustrates the observations extracted from the NIG distribution: X 1 , Y 1 are informative on both β and σ 2 , while X 2 , Y 2 are not informative about β, but the nonzero Y 2 can update the distribution of σ 2 . The Gaussian likelihood function p(Y |β, can be expressed as a function of X X, X Y , Y Y . Therefore, the pseudo observations are determined up to those sufficient statistics.

NIG Summation and Big Data Regression
Motivated by the pseudo-observation interpretation of the conjugate distributions, we define the NIG summation operator. The sum of two NIG distributions can be thought as the posterior distribution induced by concatenation of the observations extracted from two NIG distributions.
then it is said to be the sum of two NIG distributions, denoted by Occasionally, it is inconvenient to invert Λ 1 or Λ 2 . By completing the squares, we can rewrite Rule 4 in Definition 2 as Proposition 4. The NIG summation operator satisfies 1. Commutativity:

Identity element:
Proposition 4 is derived from Definition 2. Note that NIG(0 k , 0 k×k , − k 2 , 0) serves as the "zero" in summation. Two use cases of the NIG summation operator are given by Proposition 5 and 6. Proofs of all propositions are provided in Supplementary Material (Qian, 2017 Proposition 6. Let the posterior distribution, under the non-informative prior and full-sample observations, be NIG ( μ, Λ, a, b). Split the observations X, Y into m subsets X i , Y i , i = 1, . . . , m. Let the subset posterior distribution, under the non-informative prior and the data An immediate consequence of Proposition 5 and 6 is a split-and-merge algorithm for big data Bayesian linear regressions. NIG(μ, Λ, a, b) and the observations X, Y , in which the sample size is so large that data cannot be stored and/or processed by a single machine. The posterior distribution NIG(μ, Λ, a, b) shown in Proposition 1 can be obtained by the following split-and-merge algorithm.

Algorithm 1. Consider Bayesian linear regression under the conjugate prior
Step 1 partition the big data into m subsets X i , Y i , i = 1, . . . , m.
Step 2 run subset Bayesian linear regressions under the non-informative prior and obtain the subset posterior distributions Step 3 gather and sum up the subset posterior distributions: Step 4 sum up the prior and the combined subset posterior distributions: Some remarks on implementation of Algorithm 1. First, subset sizes are not necessarily the same. Second, there is no dependency between m subset regressions, which can be processed in an embarrassingly parallel fashion. Third, the NIG summation operator is associative, so Step 3 can be processed recursively by pairs. Fourth, Step 3 tolerates machine failures. If some subset regression fails to add, μ reweighs the remaining regressions and the precision Λ decreases without breakdown. Fifth, commutativity and associativity justify online updating for flow data. For example, when new data X m+1 , Y m+1 come in, we can update the distribution: An application of Algorithm 1 is the big data ridge regression, which can be interpreted as Bayesian linear regression under the conjugate prior NIG(0 k , Λ, a, b), where Λ = λI k×k and λ is the regularization parameter. The posterior distribution is The posterior mean and mode of β is given by μ = (Λ+X X) −1 X Y , which is the ridge point estimator, regardless of the hyperparameter values for a and b.

Big Data Bayesian LASSO
The LASSO of Tibshirani (1996) is a popular shrinkage and variable selection technique. It has a Bayesian interpretation with the double exponential priors imposed on the regression coefficients. Park and Casella (2008) propose an efficient Gibbs sampler for Bayesian LASSO. Consider the model (1) with the priors: where λ 1 , . . . , λ k are regularization parameters. The posterior density is The posterior mode of β coincides with the classic LASSO estimator conditional on σ 2 = 1. 2 The double exponential distribution has a scale mixture representation. By augmenting latent variables ψ = (ψ 1 , . . . , ψ k ) with the prior The posterior conditional distribution of ψ is also tractable: which indicates that ψ −1 j |β, σ 2 , Y follows an inverse Gaussian distribution with parameters | λj σ βj | and λ 2 j (see Park and Casella, 2008, p. 682). Therefore, Bayesian LASSO can be handled by a Gibbs sampler that alternately samples from the inverse Gaussian distributions and the posterior NIG distributions. By Proposition 5 and 6, an efficient algorithm with one pass of big data is summarized below. (1) and (2) with the big data X, Y . We have the following Gibbs sampling algorithm:
2 The posterior mode matching the frequentist LASSO estimator without conditioning on σ 2 can be achieved by an alternative prior p(β, σ 2 ) ∝ (σ 2 ) −(a+ k 2 +1) e −bσ −2 −σ −2 k j=1 λ j |β j | . Park and Casella (2008) note that the posterior density is concave and unimodal under (2), hence a preferable form. George and McCulloch (1993) develop a variable selection method in which the regression coefficients have a hierarchical mixture prior that involves small and large variance terms. The posterior mixture probabilities identify the promising subset of predictors.

Big Data SSVS
Let γ ≡ (γ 1 , . . . , γ k ) , where γ j ∈ {0, 1} is the variable selection indicator for the j th predictor. For most applications, it is appropriate to adopt an independent prior: The hyperparameter w j is the prior probability of variable inclusion. To exempt a variable (e.g., the intercept term) from exclusion, we can put w j = 1. Following George and Mcculloch (1997), we assume a conditionally conjugate prior for (β, σ 2 ): where ] −1 and the hyperparameters satisfy D > d > 0. By marginalizing γ, we see that β j |σ 2 follows a zero-mean Gaussian mixture distribution: one component has a large variance σ 2 D, and the other has a small variance σ 2 d. Furthermore, β|σ 2 has a 2 k -component mixture prior, which uses observations to assign larger posterior probabilities to the more promising components. The crux of SSVS is a Gibbs sampler that iteratively generates draws from p(β, σ 2 |γ, Y ) and p(γ|β, σ 2 , Y ), which circumvents the overwhelming problem of calculating the posterior probabilities for all 2 k subsets of predictors.
We note that the hierarchical prior structure in SSVS is similar to that in Bayesian LASSO. Both are Gaussian mixture models that involve latent variables (i.e., the discrete γ in SSVS and the continuous ψ in LASSO), and both contain a conjugate Bayesian linear sub-structure conditional on the latent variables. Proposition 5 and 6 suggest a big data SSVS sampler analogous to Algorithm 2. (3) and (4), the likelihood (1) and the big data X, Y , we have the following SSVS Gibbs sampling algorithm:
Step 4 iterate the sub-steps until (β, σ 2 , γ) draws converge to stationary distributions: 4.2 given the current draw of (β, σ 2 ), generate a draw for each γ j from the Bernoulli distribution with the probability where φ(·) denotes the normal density.

Big Data M C 3
SSVS does not strictly exclude variables from the regression. The prior variance d must be small but positive, or Algorithm 3 yields a reducible Markov chain that violates the MCMC convergence conditions. MC 3 proposed by Madigan et al. (1995) and Raftery et al. (1997) can handle variable selection in the presence of degenerate distributions, which are analytically integrated out of the posterior distributions. Similar marginalization techniques are discussed in Geweke (1996) and Smith and Kohn (1996). In this subsection, we adapt MC 3 for big data applications.
Consider (4) with d = 0. Given a particular γ, we have the partition β = ( [c]cβγ βo ), where β γ (and β o ) represents the coefficients of the predictors included in (and excluded from) the regression. Then (4) can be decomposed as where p(β o |γ, β γ , σ 2 ) is the Dirac delta with a spike at zero, and By (4), the hyperparameters μ γ is a vector of zeros and Λ γγ equals D −1 times an identity matrix. In practice, μ γ , Λ γγ can take a more general form (see below). Since β o is essentially a vector of zeros, the model (1) is reduced to where X γ is a sub-matrix of X with columns selected by γ.
For variable selection, we are mostly interested in p(γ|Y ), which is proportional to the product of the prior probability p(γ) and the marginal likelihood p(Y |γ). Note that (5) and (6) constitute a conjugate regression with a subset of variables. Employing the NIG summation operator, we evaluate the marginal likelihood by the Bayes formula: where p(β γ , σ 2 |γ) is the prior density of NIG(μ γ , Λ γγ , a, b), while p(β γ , σ 2 |Y, γ) is the posterior density of NIG(μ γ , Λ γγ , a γ , b γ ) obtained by Proposition 5 using the data X γ , Y . Note that (7) holds for all β γ and σ 2 values. We pick β γ = 0 and an arbitrary σ 2 so that the likelihood function p(Y |β γ , σ 2 , γ) remains a constant for all γ. The implication is that the term p(Y |β γ , σ 2 , γ) can be dropped if we just evaluate (7) up to a proportionality constant.
If the sample size were small, MC 3 would be straightforward once the marginal likelihood had been evaluated by (7). However, it is too costly to reload the big data X γ , Y in each MCMC iteration. The following dimension-reduction propositions resolve that problem.
Proposition 7 and 8 are useful for Bayesian linear regressions with a subset of predictors. First, by Proposition 2 we obtain NIG( μ, Λ, a, b). Second, by Proposition 7 and 8 we recover NIG( μ γ , Λ γγ , a γ , b γ ). Third, by Proposition 5 we insert the prior information by NIG summation: Fourth, we use (7) for evaluating p(Y |γ) and p(γ|Y ) up to a proportionality constant.
It is seldom feasible to enumerate p(γ|Y ) for all 2 k scenarios. We resort to MC 3 by a random-walk Metropolis-Hastings sampler. Following Raftery et al. (1997), we define a neighborhood nbd(γ) which consists of γ itself and the sets of models that select either one variable more or fewer than γ. Loosely speaking, we randomly change one element of γ as the proposal draw. We summarize the big data MC 3 algorithm in Algorithm 4. (3) and (5), the model (6) and the big data X, Y , we have the following MC 3 algorithm:
Step 4 iterate the sub-steps until draws of γ converge to stationary distributions: 4.1 let the current draw be γ * . Propose a new draw γ from nbd(γ * ) by randomly changing one element of γ * .
4.2 reduce NIG( μ, Λ, a, b) to NIG( μ γ , Λ γγ , a γ , b γ ) by Proposition 7 and 8. An interesting application of Algorithm 4 is Bayesian calibration to frequentist information criteria for model selection. The prior precision matrix Λ γγ in (5) is not necessarily a diagonal matrix. Instead, we may adopt a g-prior (Zeller, 1986) George and Foster (2000) show that model selection by the information criteria such as AIC and BIC corresponds to selection of maximum posterior models by calibrating the prior hyperparameters. Specifically, if the hyperparameter c and the probability w are calibrated such that 1+c c [2 ln 1−w w + ln(1 + c)] = 2, then the highest posterior model coincides with the model that minimizes AIC. Furthermore, if 1+c c [2 ln 1−w w + ln(1 + c)] = ln n, then the highest posterior model minimizes BIC. 3 Such correspondence implies that Algorithm 4 is also a big data algorithm for frequentist model selection by the information criteria.

Other Bayesian Variable Selection Techniques
In addition to Bayesian LASSO, SSVS and MC 3 , various Bayesian variable selection techniques are proposed in the literature, such as the indicator-in-regression method by Kuo and Mallick (1998), the pseudo-prior settings by Carlin and Chib (1995), reversible jump MCMC (RJMCMC) by Green (1995) and the composite space approach by Godsill (2001). We review those techniques and propose a general-purpose algorithm for big data applications. Kuo and Mallick (1998) introduce an indicator-in-regression method. Consider (1) and decompose β such that β j = β j γ j , j = 1, . . . , k. Given a particular γ, we have the selected β γ and the unselected β o depending on γ j equals to 1 or 0. The priors are given by is a pseudo prior (Carlin and Chib, 1995) that has no impact on the likelihood, and μ o , V o are tuning parameters for efficient MCMC operation. The posterior conditional distribution p( β γ , σ 2 |Y, γ) is the density of NIG(μ γ , Λ γγ , a γ , b γ ) (refer to Step 4.3 in Algorithm 4), while p( β o |Y, γ, β γ , σ 2 ) remains the pseudo prior density. Godsill (2001) further explores the idea of pseudo priors and demonstrates that RJMCMC, a powerful trans-dimensional model selection approach by Green (1995), can be derived from a composite model with fixed dimension. Each γ defines a model with parameters θ γ ≡ (β γ , σ 2 ). Conditional on γ, the likelihood depends on θ γ , but not on the unused parameters denoted by θ −γ . Let θ = (θ 0 , . . . , θ 2 k −1 ). 4 The priors are where p(θ γ |γ) is the genuine prior given by (5), while p(θ −γ |γ, θ γ ) is the pseudo prior. Carlin and Chib (1995) resort to the Gibbs sampler, in which the posterior conditional 3 The calibration results hold exactly under a known σ 2 , which may be replaced by an estimate σ 2 such as the mean squared residuals. The mean and variance of IG(a, b) (a−1) 2 (a−2) , respectively. If the prior hyperparameters are calibrated such that b, a are large and their ratio approximately equals σ 2 , the prior density is concentrated around σ 2 and the variance tends to zero. By Proposition 1 and 2, the posterior density for σ 2 is close to the prior density. Therefore, we obtain an approximation to the regression results with σ 2 fixed at σ 2 .
4 For notational convenience, θγ is indexed by the decimal representation of k × 1 binary vector γ. For example, if k = 4, the vector γ = (1, 0, 0, 1) corresponds to the decimal number 9, hence θ 9 . distributions are given by where p(θ γ |Y, γ) is NIG(μ γ , Λ γγ , a γ , b γ ) and p(θ −γ |Y, γ, θ γ ) remains the pseudo prior. The sampler is computationally intensive in that p(θ −γ |γ, θ γ ) involves parameters in 2 k −1 models. Godsill (2001) considers the Metropolis-Hastings sampler with a proposal q from the current state (γ where q 1 , q 2 denote the model and parameter transition, respectively. A new state will be accepted with the probability given by min (1, α), where Since the pseudo priors are canceled in α, drawing θ −γ is conceptual and not performed in practice. This is a form of RJMCMC, in which the "dimension matching" variables, denoted by u * and u, are chosen such that (θ γ , u) = (u * , θ * γ * ) with the Jacobian term being one. Also, if the model transition q 1 is a random change of an element of the current model γ * , and the parameter transition q 2 is independent to the current parameters θ * γ * , with θ γ drawn from NIG(μ γ , Λ γγ , a γ , b γ ), then the sampler reduces to a version of MC 3 , because is the marginal likelihood and α is determined by the Bayes factor of the two models.
To implement any of those variable selection techniques, the MCMC sampler requires at most three data-related inputs: the posterior distribution p(θ γ |Y, γ), the likelihood function p(Y |θ γ , γ) and the marginal likelihood p(Y |γ). They are functions of X γ , Y , which vary as the value of γ updates in MCMC iterations. Recall that Proposition 3 introduces pseudo observations. If they are used in MCMC simulations, then only k, instead of n, observations are cached in memory. Proposition 9 demonstrates that pseudo observations are the perfect substitute for the big data for Bayesian inference.
Proposition 9. Consider the model (1) with the big data X, Y . Let NIG( μ, Λ, a, b) be the posterior distribution obtained by Proposition 2 under the non-informative prior. Let X 1 , Y 1 , X 2 , Y 2 be the pseudo observations extracted from NIG( μ, Λ, a, b) by Proposition 3. Let γ be the variable selection indicator under which the regression is reduced to (6). For an arbitrary prior on θ γ ≡ (β γ , σ 2 ), the genuine and pseudo observations yield the same posterior, likelihood and marginal likelihood. That is, Though the sample size of the pseudo observations is n, only the first k observations X 1 , Y 1 contain dense data, while the remaining observations are trivial: the predictors are zeros and the response values are the same. Therefore, we only need to store X 1 , Y 1 as well as a scalar element of Y 2 in the computer memory. Proposition 9 leads to a generic big-data algorithm suitable for many Bayesian variable selection methods.
Algorithm 5. Consider Bayesian variable selection for Gaussian linear regressions with big data X, Y . We have the following general-purpose algorithm: Step 1-3 the same as those in Algorithm 1. After Step 3, save NIG( μ, Λ, a, b) only.
Step 5 treat pseudo observations as if they were the genuine data, and apply variable selection methods (e.g., Bayesian LASSO, SSVS, MC 3 , RJMCMC, etc.) designed for in-memory computation.
Algorithm 5 is not necessarily the most computationally efficient algorithm, but it is the simplest method that addresses the storage and computation burden induced by big data. The original data are scanned only once and then replaced by pseudo observations. The existing variable selection techniques designed for small data inputs are applicable with minimum adaption.

Synthetic Data Examples
In this section, we evaluate the performance of Algorithm 1 -5 by synthetic data. We consider two regimes: 1) a correctly specified model with highly correlated predictors, and 2) a regression with non-Gaussian (skewed and leptokurtic) disturbances, which are resampled from regression residuals of real-world data used by Section 6.

Variable Selection with Highly Correlated Predictors
The data generating process (DGP) is given by (1) with n = 10 8 , k = 100, σ = 10, β = (1, 0.9, . . . , 0.1, 0, . . . , 0) . That is, among the 100 predictors, only the first 10 have non-zero coefficients. Each row of X is randomly sampled from a zero-mean multivariate normal distribution with the correlation 0.99 for all variable pairs. The DGP shows two characteristics of big data: volume (large n) and veracity (large σ and multicollinearity).
The double-floating regression data occupy 80GB disk space, and they are saved in 1000 text files; each file contains 100 thousand observations. As the full-sample data cannot be loaded into our computer memory, we resort to the split-and-merge method (Algorithm 1). We consider the following factors regarding data partition: 1) the subsample size should be small enough to allow in-memory computing of 2) the subsample size should be large enough so that the rank of X i equals k; 3) as we implement Algorithm 1 via MATLAB datastore and tall array objects, the subsample size should be large enough to take advantage of the matrix computing platform; 4) the computing cost increases slightly with more subsets (see Section 5.3). However, when n is large, it hardly has any impact on computing speed; and 5) the data-file location matters for partition. In this example, MATLAB tall array works efficiently when it reads all observations in a text file. About 80MB data are loaded into the computer memory for the subset regressions.
Though we run Bayesian regressions, Step 3 of Algorithm 1 also produces the ordinary least squares (OLS) results. The posterior mean μ equals the OLS estimator. As is shown in Figure 1, the OLS estimator is volatile due to multicollinearity.
The big data are scanned only once, and then we reuse NIG( μ, Λ, a, b) for Bayesian LASSO, SSVS, MC 3 by Algorithm 2 -4, and RJMCMC with pseudo observations by Algorithm 5. Figure 1 plots the posterior means of the regression coefficients E(β j |Y ), j = 1, . . . , k, which are Bayesian model averaging results. Table 1 reports the posterior probabilities p(γ j |Y ), which indicate Bayesian model selection results, for SSVS, MC 3 and RJMCMC.
Variable selection is challenging because of high collinearity between predictors, but the big data algorithms work well. SSVS strongly favors the first 8 predictors in that E(γ j |Y ) ≈ 1, and the estimated coefficients are close to the true values specified by the DGP. The true coefficients for the 9 th and 10 th predictors are 0.2 and 0.1. SSVS tends to exclude them. Given the fact that the OLS standard error is about 0.1, it is not surprising that SSVS rejects "insignificant" predictors. The estimated coefficients and probabilities for the remaining predictors are close to zero, which indicate that they are decisively excluded from the regression. RJMCMC produces similar results (the solid line and the dot-cross line largely overlap in Figure 1), though RJMCMC is slower. MC 3 calibrated to BIC produces acceptable results. The first 10 estimated coefficients are close to the OLS results, and MC 3 suppresses most spikes produced by the OLS estimator for the remaining coefficients. Bayesian LASSO estimator is also reasonable. Though the shrinkage estimator is slightly smaller compared to other estimators, LASSO effectively excludes most predictors whose true coefficients are zero by the DGP.

Regression with Skewed and Leptokurtic Disturbances
Before we analyze the real-world wage data in Section 6, we generate some synthetic data. The predictors X are copied from the real data, and the noises ε are resampled from OLS residuals using the real data. The noises have the sample skewness 2.7 and kurtosis 33.9. The synthetic response variable is constructed such that Y = Xβ + ε, where β = (5, 4.9, . . . , 0.1, 0, . . . , 0) . That is, the first 50 of the 327 predictors have non-zero coefficients.
As is seen in Figure 2, the synthetic-data OLS estimator substantially departs from zero, and the 171 st and 172 nd coefficients are extremely large in magnitude (see Section 6 for an explanation; they correspond to Treating and Diagnosing). In contrast, Bayesian LASSO, SSVS and RJMCMC effectively shrink the estimators towards zero after the 50 th predictor. Meanwhile, the estimators for the first 50 coefficients are close to the true values specified by the DGP.

Computational Complexity
We measure complexity by floating point operations (flops). By convention, a floatingpoint addition, subtraction, multiplication, or division is counted as a flop. Solving a k-dimensional linear equation is assumed to take 2 3 k 3 flops. Algorithm 1 has the complexity O(k 2 n) + O(k 3 m), where n, k, m denote the number  4.9, . . . , 0.1, 0, . . . , 0) . of observations, variables and subsets, respectively. Specifically, for the full-sample regression without data partition, the required flops are about 2k 2 n+ 2 3 k 3 . With data split into m subsets, flops increase by 4 3 k 3 m, which is mild compared to 2k 2 n when n is large. If Algorithm 2, 3, 4 and 5 receive NIG( μ, Λ, a, b) produced by Algorithm 1 as the input, then they only add O(k 3 r) flops for MCMC simulations, where r denotes the number of draws. With the NIG summation operator, the MCMC samplers will not be more computationally expensive as n increases. For big data applications, Bayesian LASSO and SSVS might run as fast as the OLS regression, because O(k 2 n) can be much larger than O(k 3 r).
To be specific, in the first synthetic data example, the OLS regression without data partition costs 2 × 10 12 flops. Splitting data into 1000 subsets for Algorithm 1, we add 1.3 × 10 9 flops, which is less than 0.1% of the OLS computing costs. Given the outputs of Algorithm 1, we count the run-time flops for Bayesian variable selection MCMC simulations with 10 5 draws. Both Bayesian LASSO and SSVS take about 2.2 × 10 11 flops. MC 3 and RJMCMC cost 1.4 × 10 12 and 1.5 × 10 12 flops, respectively.

Application
Occupational employment and wage are of interest to both researchers and the public. Despite various publications on highest paying jobs, many studies rely on survey data and self-reported incomes, which are prone to the mis-reporting problem that could induce bias or inflate the variance of the estimator. We study the Labor Condition Application (LCA) disclosure data by the Office of Foreign Labor Certification, within the Department of Labor. When a U.S. employer sponsors an H1B visa, the employer files an LCA, on behalf of the worker, that includes the job title, Standard Occupational Classification (SOC) code, rate of pay (actual wage offered for the position), and the prevailing wage for the same position in that area. Since an LCA is legal instrument prepared by employers' attorneys, we believe the job and pay information is accurate. The publicly available data contain 3 million observations for the year 2011 -2016. 5 H1B wages are typically for technical jobs that require advanced degrees. The median annual wage equals 72.4K (where K stands for thousand dollar) and the standard deviation (sd.) is 33.0K. We generate regressors by extracting 200 highest frequency (HF) words from the job titles and SOC names, as well as 100 HF employer names. We also add the year dummies, the census division and region dummies (such as New England, South Atlantic, etc.), 10 most affluent metropolitan area dummies (such as California Bay Area, New York, etc.), the employment length and part-time position dummies. There are 327 predictors in total. Details on data cleaning and processing are provided in Supplementary Material. Table 2 illustrates the job HF words and the average wages of the subsamples that contain the key words. HF words cover occupations (e.g., Engineer, Analyst), industries (e.g., Software, Finance), experience (e.g., Senior, Assistant), positions (e.g., Manager, Lead), skills (e.g., Database, SAS), countries (e.g., USA, India), etc. Some HF words are associated with high wages, say Manager (99.3K) and Senior (97.6K), while other words such as Researcher (68.5K) and Accountant (65.8K) correspond to lower wages. Table 3 shows the top employers and the mean wages they offer. The most generous employers are Facebook (137.6K) and Google (131.4K). The largest H1B sponsors are multinational consultancy services such as Infosys (78.9K) and Tata (68.7K). University of Michigan (67.4K) and Johns Hopkins University (65.8K) are also among the top 100 sponsors.

OLS Results
The OLS estimator is obtained from μ by running Step 1-3 of Algorithm 1. Split and merge are performed by MapReduce-like (Dean and Ghemawat, 2008) MATLAB tall array operations.
Predictors with the largest positive and negative coefficients are listed in Table 4. Physician (+88.4K) and Lawyer (+55.1K) are among the highest paying jobs, while words like Preschool (-20.3K) and Assistant (-17.4K) have substantially negative impact on the rate of pay. We can also learn the joint effects of multiple regressors. For example, Resident Physician has the net effect -105.3K + 88.4K = -16.9K, which is reasonable  The dummy variables are informative. As we include all the 9 census region dummies, their coefficients represent the intercept terms (i.e., baseline salaries) specific to the regions. Workers in Pacific (74.0K) have the highest income, followed by New England (70.5K), Mid Atlantic (67.9K), West South Central (67.0K), Mountain (66.1K), South Atlantic (66.0K), West North Central (65.6K), East North Central (64.7K) and East South Central (64.0K). In addition, metropolitan workers in New York (+10.6K) and Bay Area (+10.0K) earn substantially more.   A problem of the OLS regression is that HF words have heterogeneous predictive power and accuracy. Some coefficients are close to zero (by posterior mean); some nonzero coefficients are inaccurate (by large posterior sd.), and many predictors are appar-ently correlated. Also, there is an anomaly in Table 4: Treating has a large posterior mean 22.7K with a surge of sd. to 10.4K. Though the coefficient remains "significant" (in frequentist terminology), the unusually large sd. hints at the possibility of multicollinearity. 6 After visual inspection of the predictors, we found that Treating mostly comes from the SOC name "Health Diagnosing and Treating Practitioners". Both Treating and Diagnosing enter the regression, and the latter has the coefficient -14.1K with sd. 10.4K. Near-perfect multicollinearity is difficult to detect when predictors are machine generated, especially in big data applications (because sd. decreases with the sample size, rendering all coefficients "significant"). As is shown below, Bayesian shrinkage and variable selection can overcome multicollinearity.

Bayesian LASSO Results
We implemented Algorithm 2 with a sequence of LASSO regularization parameters λ = 500, 1000, 2000, 5000 for HF words and employers. As for the 27 geographic and time dummy variables, they enter the regression by economic, rather than statistical, significance, and we intend not to shrink them. As the Gibbs sampler cannot proceed under λ = 0, we put a small value 0.1 for those dummy variables. Note that big data are scanned once regardless of multiple rounds of LASSO regressions under different regularization parameters, because Step 1-3 of Algorithm 2 have no reference to λ values. 7 Table 5 demonstrates that the magnitude of shrinkage is negatively related to "tstatistics". For example, FINANCE has a large "t-statistics"128.2. As λ increases, the LASSO estimators are stable: 14.5K, 15.2K, 15.7K and 14.7K, all of which are close to the OLS estimator 13.7. In comparison, TREATING has a small "t-statistics"2.2. Even under mild regularization, its estimator shrinks substantially, from 22.7K (OLS estimator) to 2.6K (LASSO with λ = 500). As λ increases to 1000, 2000 and 5000, its estimator quickly drops to 0.38K, 0.04K and 0.01K, respectively. Bayesian LASSO overcomes multicollinearity and effectively discards predictors with inflated variances.
Frequentist LASSO is a popular variable selection method, as L 1 -penalized least squares yield corner solutions, rendering some coefficients exactly zero. Bayesian LASSO can shrink the posterior mode of weak predictors to zero, but the mode does not reveal itself from MCMC draws. Nevertheless, variable selection can be achieved by visual inspection of the posterior means. Table 5 indicates that there is a dichotomy between the strong and weak predictors, especially when λ exceeds 1000. For example, if we exclude a variable if its posterior mean is less than 0.1, most predictors are either substantially larger or substantially smaller than the threshold. Under that criterion, Bayesian LASSO selects 282, 228, 175, 105 variables when λ = 500, 1000, 2000, 5000 respectively.
6 In this particular case, it appears that multicollinearity inflates the variances of the correlated pairs without contaminating other predictors. We removed Treating and run the regression again, the results are largely the same as those reported in Table 4. 7 An interpretation of multiple values is an unknown regularization parameter with a uniform prior over the points in the selected grid.

Words
Coeff (  as well as Treating and Diagnosing. In the second column, Coeff represents posterior means under the non-informative prior (OLS estimator), and "t-stat" in parenthesis are simply the ratio of the posterior mean and standard deviation. Column 3 -6 correspond to the Bayesian LASSO with regularization parameters λ = 500, 1000, 2000, 5000.
Column 7 and 8 are the results for SSVS and MC 3 .

SSVS and M C 3 Results
In our implementation of Algorithm 3 and 4 for SSVS and MC 3 , the prior probability of selecting each HF word is assumed to be 0.5, and that of the geographic and time dummies equals 1 (so that the posterior probability still equals 1). SSVS involves the large and small variances of the mixture distribution as the tuning parameters. We experimented several pairs such as 10/0.001, 100/0.001, 1000/0.001, and the results are similar. Again, big data are scanned once under multiple pairs of tuning parameters. The Gibbs sampler generates draws for the model indicators, and we favor the most frequently visited model, which is likely to be a model with a high, if not the highest, posterior probability. SSVS selects 229 out of the 300 predictors on occupations and employers. As is seen from Table 5, our previously identified strong predictors, such as Finance, Professor, Resident and Lawyer, are also selected by SSVS and their coefficients are close to those under the LASSO regression. Unlike LASSO shrinkage on both Treating and Diagnosing, SSVS selects the former and discards the latter, which offers an alternative solution to the multicollinear anomaly in our regression. Table 5 also reports the regression results for the most frequently visited model under MC 3 by Algorithm 4. We adopt the g-prior and hyperparameters are calibrated to the frequentist AIC criterion. See George and Foster (2000). 276 variables are selected. Coefficients of the strong predictors are close to those under SSVS.

Forecast Evaluation
Lastly, we perform out-of-sample forecast using the latest release LCA data from October 2016 to March 2017. We consider LCA cases in which job titles contain at least one of the 200 HF words and employers belong to one of the 100 HF employers. There are about 112 thousand observations for forecast evaluation. The mean absolute deviation (MAD) of the forecast by the OLS regression is 15.1K, which is reasonable as the sample sd. amounts to 33.0K. The MAD of LASSO regression is given by 14.5K, 14.0K, 13.3K, 13.6K when λ = 500, 1000, 2000, 5000 respectively. The MAD of SSVS and MC 3 are both near 15.1K. It appears that Bayesian LASSO regression with λ = 2000 works best for the current application.

Conclusion
The primary advantage of the NIG summation operator is the ability to merge the subset posterior distributions with data split into manageable pieces. It is also useful for Bayesian variable selection methods in which priors have mixture NIG representations, as the MCMC samplers can iteratively combine NIG distributions with a single pass of big data. Computational complexity analysis demonstrates that Bayesian variable selection algorithms with NIG summations are computationally efficient, and some MCMC samplers may run almost as fast as the OLS regression, when the sample size is large.
NIG summation can be extended to subtraction and scalar multiplication. Subtraction can be thought as taking some observations out of the NIG distribution, and scalar multiplication rescales the precision of regression data. Consider online statistical learning in which data become available in a sequential order. For example, to study the time-varying beta in the capital asset pricing model, it is common practice to employ rolling-window regressions with five or ten years of moving observations (see Ang and Chen, 2007). With NIG summation (for sequential addition of newest observations) and subtraction (for data point retirement of oldest observations), the rolling regression complexity can drop from O(k 2 n 2 ) + O(k 3 n) to O(k 3 n). Another use case is recursive least squares with a forgetting factor δ ∈ (0, 1) (see Branch and Evans (2006) for a macroeconomic forecasting application), which discounts past observations at geometric rate. NIG summation and scalar multiplication like m i=1 δ m−i NIG( μ i , Λ i , a i , b i ) lead to a weighted regression. Another direction of extending the current approach is that a collection of NIG distributions closed under summation and scalar multiplication may constitute a linear space that might have interesting theoretic properties. That will be left for future research.

Supplementary Material
Supplementary Material for Big Data Bayesian Linear Regression and Variable Selection by Normal-Inverse-Gamma Summation (DOI: 10.1214/17-BA1083SUPP; .pdf). Proofs of Proposition 1-9 and data cleaning procedures in Section 6 (in a separate document).