A Nonparametric Bayesian Technique for High-Dimensional Regression

This paper proposes a nonparametric Bayesian framework called VariScan for simultaneous clustering, variable selection, and prediction in high-throughput regression settings. Poisson-Dirichlet processes are utilized to detect lower-dimensional latent clusters of covariates. An adaptive nonlinear prediction model is constructed for the response, achieving a balance between model parsimony and flexibility. Contrary to conventional belief, cluster detection is shown to be aposteriori consistent for a general class of models as the number of covariates and subjects grows. Simulation studies and data analyses demonstrate that VariScan often outperforms several well-known statistical methods.


Introduction
An increasing number of studies involve the regression analysis of p continuous covariates and continuous or discrete univariate responses on n subjects, with p being much larger than n. The development of effective clustering and sparse regression models for reliable predictions is especially challenging in these "small n, large p" problems. The goal of the analysis is often three-pronged: (i) Cluster identification: We wish to identify clusters of covariates with similar patterns for the subjects. For example, in biomedical studies where the covariates are gene expression levels, subsets of genes associated with distinctive between-subject patterns may correspond to different underlying biological processes; (ii) Detection of sparse regression predictors: From the set of p covariates, we wish to select a sparse subset of reliable predictors for the subject-specific responses and infer the nature of their relationship with the responses. In most genomic applications, just a few of the biological processes are usually relevant to a response variable of interest, and we need reliable and parsimonious regression models; and (iii) Response prediction: Using the inferred regression relationship, we wish to predict the responses ofñ additional subjects for whom only covariate information is available. The reliability Subharup Guha is Associate Professor, Department of Statistics, University of Missouri; Veerabhadran Baladandayuthapani is Associate Professor, Department of Biostatistics, The University of Texas MD Anderson Cancer Center, and Adjunct Associate Professor, Department of Biostatistics, Rice University (email addresses: GuhaSu@missouri.edu, Veera@mdanderson.org). This work was supported by the National Science Foundation under awards DMS-0906734, DMS-1461948 to SG and DMS-1463233 to VB, and by the National Institutes of Health under grant R01 CA160736 to VB. The authors thank Ganiraju Manyam for help with biological interpretations of the results and Chiyu Gu for key ideas about the Markov chain Monte Carlo procedure.
of an inference procedure is measured by its prediction accuracy for out-of-sample individuals.
In high-throughput regression settings with continuous covariates and continuous or discrete outcomes, this paper proposes a nonparametric Bayesian framework called VariScan for simultaneous clustering, variable selection, and prediction.
1.1. Motivating applications. Our methods and computational endeavors are motivated by recent high-throughput investigations in biomedical research, especially in cancer. Advances in array-based technologies allow simultaneous measurements of biological units (e.g. genes) on a relatively small number of subjects. Practitioners wish to select important genes involved with the disease processes and to develop efficient prediction models for patient-specific clinical outcomes such as continuous survival times or categorical disease subtypes. The analytical challenges posed by such data include not only high-dimensionality but also the existence of considerable gene-gene correlations induced by biological interactions. In this article, we analyze gene expression profiles assessed using microarrays in patients with diffuse large B-cell lymphoma (DLBCL) [Rosenwald et al., 2002] and breast cancer [van't Veer et al., 2002]. Both datasets are publicly available and have the following general characteristics: for individuals i = 1, . . . , n, the data consist of mRNA expression levels x i1 , . . . , x ip for p genes, where n >> p, along with censored survival times denoted by w i . More details, analysis results, and gains using our methods over competing approaches are discussed in Section 6.
The scope and success of the proposed methodology and its associated theoretical results extend far beyond the examples we discuss in this paper. For instance, the technique is not restricted to biomedical studies; we have successfully applied VariScan in a variety of other high-dimensional applications and achieved high inferential gains relative to existing methodologies.
1.2. Challenges in high-dimensional predictor detection. Despite the large number of existing methods related to clustering, variable selection and prediction, researchers continue to develop new methods to meet the challenges posed by newer applications and larger datasets. Predictor detection becomes particularly problematic in big datasets due to the pervasive collinearity of the covariates.
For a simple demonstration of this fact, consider a process that independently samples n-variate covariate column vectors x 1 . . . , x p , so that p = 200 vectors with n = 10 i.i.d. elements are generated from a common normal distribution. Although the vectors are independently generated, extreme values of the pairwise correlations are observed in the sample, as shown in the histogram of Figure 1. The proportion of extremely high or low correlations typically increases with p, and with greater correlation of the generated vectors under the true process.
Multicollinearity is common in high-dimensional problems because the n -dimensional space of the covariate columns becomes saturated with the large number of covariates. This is disadvantageous for regression because a cohort of highly correlated covariates is weakly identifiable as regression predictors. For example, imagine that the j th and k th covariate columns have a sample correlation close to 1, but that neither covariate is really a predictor in a linear regression model. An alternative model in which both covariates are predictors with equal and opposite regression coefficients, has a nearly identical joint likelihood for all regression outcomes. Consequently, an inference procedure is often unable to choose between these competing models as the likely explanation for the data.
In the absence of strong application-specific priors to guide model selection, collinearity makes it impossible to pick the true set of predictors in high-dimensional problems. Furthermore, collinearity causes unstable inferences and erroneous test case predictions [Weisberg, 1985]. The problem is exacerbated if some of the regression outcomes are unobserved, as with categorical responses and survival applications.
1.3. Bidirectional clustering with adaptively nonlinear functional regression and prediction. Since the data in small n, large p regression problems are informative only about the combined effect of a cohort of highly correlated covariates, we address the issue of collinearity using clustering approaches. Specifically, VariScan utilizes the sparsity-inducing property of Poisson-Dirichlet processes (PDPs) to first group the p columns of the covariate matrix into q latent clusters, where q p, with each cluster consisting of columns with similar patterns across the subjects. The data are allowed to direct the choice between a class of PDPs and their special case, a Dirichlet process, for selecting a suitable allocation scheme for the covariates. These partitions could provide meaningful insight into unknown biological processes (e.g. signaling pathways) represented by the latent clusters.
To flexibly capture the within-cluster pattern of the covariates, the n subjects are allowed to group differently in each cluster via a nested Dirichlet process. This feature is motivated by genomic studies [e.g., Jiang et al., 2004] which have demonstrated that subjects or biological samples often group differently under different biological processes. In essence, this modeling framework specifies a random, bidirectional (covariate, subject) nested clustering of the high-dimensional covariate matrix.
Clustering downsizes the small n, large p problem to a "small n, small q" problem, facilitating an effective stochastic search of the indices S * ⊂ {1, . . . , q} of potential cluster predictors. If necessary, we could then infer the indices S ⊂ {1, . . . , p} of the covariate predictors. This feature differentiates the VariScan procedure from black-box nonlinear prediction methods. In addition, the technique is capable of detecting functional relationships through elements such as nonlinear functional kernels and basis functions such as splines or wavelets. An adaptive mixture of linear and nonlinear elements in the regression relationship aims to achieve a balance between model parsimony and flexibility. These aspects of VariScan define a joint model for the responses and covariates, resulting in an effective model-based clustering and variable selection procedure, improved posterior inferences and accurate test case predictions. Figure 2 illustrates the key ideas of VariScan using a toy example with n = 10 subjects and p = 25 covariates. The plot in the upper left panel represents a heatmap of the covariates. When investigators are interested in discovering a sparse prediction model for additional subjects, the posterior analysis averages over all possible realizations of two basic steps, both of which are stochastic and may be stylistically described as follows: (1) Clustering The column vectors are allocated in an unsupervised manner to q = 11 number of PDP clusters. This is plotted in the upper right panel, where the columns are grouped via bidirectional clustering to reveal the similarities in the within-cluster patterns.
(2) Variable selection and regression One covariate is stochastically selected from each cluster and is known as the cluster representative. The middle right panel displays the set of representatives, {x 7 , x 4 , x 11 , x 5 , x 24 , x 17 , x 9 , x 12 , x 3 , x 15 , x 14 }, for the 11 clusters. The regression predictors are stochastically selected from the random set of the cluster representatives. Some representatives are not associated with the response; the remaining covariates are outcome predictors and may have either a linear or nonlinear regression relationship. The linear predictors {x 24 , x 12 , x 3 } and non-linear predictors {x 11 , x 9 } are shown in the middle left panel. For a nonlinear function h, the regression equation for a subject is displayed in the lower panel for a zero-mean Gaussian error, . The subscripts of the β parameters are the cluster labels, e.g., covariate x 24 represents the fifth PDP cluster.
When out-of-the-bag prediction is not of primary interest, alternative variable selection strategies discussed in Section 2.2 may be applied.

Existing Bayesian approaches and limitations.
There is a vast literature on Bayesian strategies for one or more of the three inferential goals mentioned at the beginning of Section 1. A majority of Bayesian model-based clustering techniques rely on the celebrated Dirichlet process; see Müller and Mitra [2013, chap. 4] for a comprehensive review. A seminal paper by Lijoi, Mena, and Prünster [2007b] advocated the use of Gibbs-type priors [Gnedin andPitman, 2005, Lijoi, Mena, andPrünster, 2007a] for accommodating more flexible clustering mechanisms than those induced by the Dirichlet process. This work also demonstrated the practical utility of PDPs in genomic applications. Among model-based clustering techniques based on Dirichlet processes, the approaches of Medvedovic and Sivaganesan [2002], Dahl [2006], and Müller et al. [2011] assume that it is possible to globally reshuffle the rows and columns of the covariate matrix to reveal the clustering pattern. More closely related to our clustering objectives is the nonparametric Bayesian local clustering (NoB-LoC) approach of Lee et al. [2013], which clusters the covariates locally using two sets of Dirichlet processes. Although some similarities exist between NoB-LoC and the clustering aspect , , Figure 2. Stylized example illustrating the basic methodology for reliable prediction for n = 10 subjects and p = 25 covariates allocated to q = 11 number of PDP clusters. The column labels represent the covariate indices. The row labels are the subjects. See the text for further explanation.
of VariScan, there are major differences. Specifically, the VariScan framework can accommodate high-dimensional regression in addition to bidirectional clustering. Furthermore, VariScan typically produces more efficient inferences by its greater flexibility in modeling a larger class of clustering patterns via PDPs. The distinction becomes especially important for genomic datasets where PDP-based models are often preferred to Dirichlet-based models by log-Bayes factors on the order of thousands; see Section 6 for an example. Moreover, the Markov chain Monte Carlo (MCMC) implementation of VariScan explores the posterior substantially faster due to its better ability to allocate outlying covariates to singleton clusters via augmented variable Gibbs sampling. From a theoretical perspective, contrary to widely held beliefs about the non-identifiability of mixture model clusters, we discover the remarkable theoretical property of VariScan that, as both n and p grow, a fixed set of covariates that (do not) co-cluster under the true VariScan model, also (do not) asymptotically co-cluster under its posterior. From a regression-based Bayesian viewpoint, perhaps the most ubiquitous approaches are based on Bayesian variable selection techniques in linear and nonlinear regression models. See Denison et al. [2002] for a comprehensive review. For Gaussian responses, the common linear methods include stochastic search variable selection [George and McCulloch, 1993], selection-based priors [Kuo and Mallick, 1997] and shrinkage-based methods [Park and Casella, 2008, Xu et al., 2015, Griffin et al., 2010. Some of these methods have been extended to non-linear regression contexts [Smith and Kohn, 1996] and to generalized linear models [Dey et al., 2000, Meyer andLaud, 2002]. However, most of the afore-mentioned regression methods are based on strong parametric assumptions and do not explicitly account for the multicollinearity commonly observed in high-dimensional settings. Nonparametric approaches typically assume priors on the error residuals [Hanson andJohnson, 2002, Kundu andDunson, 2014] or on the regression coefficients using random effect representations [Bush andMacEachern, 1996, MacLehose andDunson, 2010]. For nonparametric mean function estimations, they are typically based on basis function expansions such as wavelets [Morris and Carroll, 2006] and splines [Baladandayuthapani et al., 2005]. We take a fundamentally different approach in this article by defining a nonparametric joint model, first on the covariates and then via an adaptive nonlinear prediction model on the responses.
The rest of the paper is organized as follows. We introduce the VariScan model in Section 2. Some theoretical results for the VariScan procedure are presented in Sections 4. Through simulations in Section 5.1 and 5.2, we demonstrate the accuracy of the clustering mechanism and compare the prediction reliability of VariScan with that of several established variable selection procedures using artificial datasets. In Section 6, we analyze the motivating gene expression microarray datasets for leukemia and breast cancer to demonstrate the effectiveness of VariScan and compare its prediction accuracy with those of competing methods. Additional supplementary materials contain the theorem proofs, as well as additional simulation and data analyses results.

VariScan Model
In this section, we layout the detailed construction of the Variscan model components, which involves two major steps. First, we utilize the sparsity-inducing property of Poisson-Dirichlet processes to perform a directional nested clustering of the covariate matrix (Section 2.1), and second, we describe the choice of the cluster-specific predictors and nonlinearly relate them to Gaussian regression outcomes of the subjects (Section 2.2). Subsequently, Section 2.3 provides details of the model justifications and generalizations to discrete and survival outcomes.
2.1. Covariate clustering model. First, each of the p covariate matrix columns, x 1 , . . . , x p , is assigned to one of q latent clusters, where q p, and where the assignments and q are unknown. That is, for j = 1, . . . , p and k = 1, . . . , q, an allocation variable c j equals k if the j th covariate is assigned to the k th cluster.
We posit that the q clusters are associated with latent vectors v 1 , . . . , v q of length n. The covariate columns assigned to a latent cluster are essentially contaminated versions of these cluster's latent vector and thus induces high correlations among covariates belonging to a cluster. In practice, however, a few individuals within each cluster may have highly variable covariates. We model this aspect by associating a larger error variance with those individuals. This is achieved via a Bernoulli variable, z ik , for which the value z ik = 0 indicates a high variance: where τ 2 1 and τ 2 are variance parameters with inverse Gamma priors and τ 2 1 is much greater than τ 2 . It is assumed that the support of τ is bounded below by a small, positive constant, τ * . Although not necessary from a methodological perspective, this restriction guarantees the asymptotic result of Section 4. The indicator variables for the individual-cluster combinations are apriori modeled i.i.d as: where ξ ∼ beta(ι 1 , ι 0 ). The condition ι 1 ι 0 guarantees that prior probability P (z ik = 1) is nearly equal to 1, and so only a small proportion of the individuals have highly variable covariates within each cluster.
Allocation variables. As an appropriate model for the covariate-to-cluster allocations that accommodates a wide range of allocation patterns, we rely on the partitions induced by the two-parameter Poisson-Dirichlet process, PDP α 1 , d , with discount parameter 0 ≤ d < 1 and precision or mass parameter α 1 > 0. In genomic applications, for example, these partitions may allow the discovery of unknown biological processes represented by the latent clusters. We defer additional details of the empirical and theoretical justifications of using PDP processes until Section 2.3.
The PDP was introduced by Perman et al. [1992] and later investigated by Pitman [1995] and Pitman and Yor [1997]. Refer to Lijoi and Prünster [2010] for a detailed discussion of different classes of Bayesian nonparametric models, including Gibbs-type priors [Gnedin andPitman, 2005, Lijoi, Mena, andPrünster, 2007a] such as Dirichlet processes and PDPs. Lijoi, Mena, and Prünster [2007b] were the first to implement Gibbs-type priors for more flexible clustering mechanisms than Dirichlet process partitions.
The PDP-based allocation variables are apriori exchangeable and evolve as follows. Since the cluster allocations labels are arbitrary, we may assume without loss of generality that c 1 = 1, i.e., the first covariate is assigned to the first cluster. Subsequently, for covariates j = 2, . . . , p, suppose there exist q (j−1) distinct clusters among c 1 , . . . , c j−1 , with the k th cluster containing n (j−1) k number of covariates. The predictive probability that the j th covariate is assigned to the k th cluster is then where the event c j = q (j−1) + 1 in the second line corresponds to the j th covariate opening a new cluster. When d = 0, we obtain the well known Pòlya urn scheme for Dirichlet processes [Ferguson, 1973]. In general, exchangeability holds for all product partition models [Barry andHartigan, 1993, Quintana andIglesias, 2003] and species sampling models [Ishwaran and James, 2003], of which PDPs are a special case. The number of clusters, q, stochastically increases as α 1 and d increase. For d fixed, the p covariates are each assigned to p singleton clusters in the limit as α 1 → ∞.
A PDP achieves dimension reduction in the number of covariates because q, the random number of clusters, is asymptotically equivalent to for a random variable T d,α1 > 0 as p → ∞. This implies that the number of Dirichlet process clusters (i.e., when d = 0) is asymptotically of a smaller order than the number of PDP clusters when d > 0. This property was effectively utilized by Lijoi, Mena, and Prünster [2007b] in species prediction problems and applied to gene discovery settings. The use of Dirichlet processes to achieve dimension reduction has precedence in the literature; see Medvedovic et al. [2004], Kim et al. [2006],  and Dunson and Park [2008].
The PDP discount parameter d is given the mixture prior 1 2 δ 0 + 1 2 U (0, 1), where δ 0 denotes the point mass at 0. Posterior inferences of this parameter allows us to flexibly choose between Dirichlet processes and more general PDPs for the bestfitting clustering mechanism.
Latent vectors. The hierarchical prior for the covariates is completed by specifying a base distribution G (n) in R n for the latent vectors v 1 , . . . , v q . Consistent with our goal of developing a flexible and scalable inference procedure capable of fitting large datasets, we impose additional lower-dimensional structure on the n-variate base distribution. Specifically, since the n subjects are exchangeable, base distribution G (n) is assumed to be the n-fold product measure of a univariate distribution, G. This allows the individuals and clusters to communicate through the nq number of latent vector elements: . . , n, and k = 1, . . . , q.
The unknown, univariate distribution, G, is itself given a nonparametric Dirichlet process prior, allowing the latent vectors to flexibly capture the within-covariate patterns of the subjects: for mass parameter α 2 > 0 and univariate base distribution, N (µ 2 , τ 2 2 ). Being a realization of a Dirichlet process, distribution G is discrete and allows the subjects to group differently in different PDP clusters. The number of distinct values among the v ik 's is asymptotically equivalent to α 2 · log nq, facilitating further dimension reduction and scalability of inference as n approaches hundreds or thousands of individuals, as commonly encountered in genomic datasets.
2.2. Prediction and regression model. Now, suppose there are n k covariates allocated to the k th cluster. We posit that each cluster elects from among its member covariates a representative, denoted by u k . A subset of the q cluster representatives, rather than the covariates, feature in an additive regression model that can accommodate nonlinear functional relationships. The cluster representatives may be chosen in several different ways depending on the application. Some possible options include: (i) Select with apriori equal probability one of the n k covariates belonging to the k th cluster as the representative. Let s k denote the index of the covariate chosen as the representative, so that c s k = k and u k = x s k . (ii) Set latent vector v k of Section 2.1 as the cluster representative.
Option (i) is preferable when practitioners are mainly interested in identifying the effects of individual regressors, as in gene selection applications in cancer survival times (as noted in the introduction). Option (ii) is preferable when the emphasis is less on covariate selection and more on identifying clusters of candidate variables (e.g., genomic pathways) that are jointly associated with the responses.
The regression predictors are selected from among the q cluster representatives, with their parent clusters constituting the set of cluster predictors, S * ⊂ {1, . . . , q}. Extensions of the spike-and-slab approaches [George and McCulloch, 1993, Kuo and Mallick, 1997, Brown et al., 1998] are applied to relate the regression outcomes to the cluster representatives as: and h is a nonlinear function. Possible options for the nonlinear function h in equation (4) include reproducible kernel Hilbert spaces , nonlinear basis smoothing splines [Eubank, 1999], and wavelets. Alternatively, due to their interpretability as a linear model, order-r splines with m number of knots [de Boor, 1978, Hastie and Tibshirani, 1990, Denison et al., 1998] are especially attractive and computationally tractable.
The linear predictor η i in expression (4) implicitly relies on a vector of cluster- k ), add to 1 for each cluster k. If γ (0) k = 1, the cluster representative and none of the covariates belonging to cluster k are associated with the responses. If γ (1) k = 1, the cluster representative appears as a simple linear regressor in equation (4); γ (2) k = 1 implies a nonlinear regressor. The number of linear predictors, non-linear predictors, and non-predictors are respectively, For a simple illustration of this concept, consider again the toy example of Figure 2, where one covariate is nominated from each cluster as the representative. Of the q = 11 cluster representatives, q 1 = 3 are linear predictors, q 2 = 2 are non-linear predictors, and the remaining q 0 = 6 representatives are non-predictors.
For nonlinear functions h having a linear representation (e.g., splines), let U γ be a matrix of n rows consisting of the intercept column and the independent regressors based on the cluster representatives. For example, if we use order-r splines with m number of knots in equation (4), then the number of columns, col(U γ ) = q 1 + (m + r) · q 2 + 1. With [·] denoting densities of random variables, the prior, where the probabilities ω = (ω 0 , ω 1 , ω 2 ) are given the Dirichlet distribution prior, ω ∼ D 3 (1, 1, 1). The truncated prior for γ is designed to ensure model sparsity and prevent overfitting, as explained below. Conditional on the variances of the regression outcomes in equation (4), we postulate a weighted g prior for the regression coefficients: where matrix Σ = diag(σ 2 1 , . . . , σ 2 n ). A schematic representation of the entire hierarchical model involving both the clustering and prediction components is shown in Figure 3. 2.3. Model justification and generalizations. In this section, we discuss the justification, consequences, and generalizations of different aspects of the Variscan model. In particular, we investigate the appropriateness of PDPs in this application as a tool for covariate clustering. We also discuss the choice of basis functions for the nonlinear prediction model and consider generalizations to discrete and survival outcomes.
Empirical justification of PDPs. We conducted an exploratory data analysis (EDA) of the gene expression levels in the DLBCL data set of Rosenwald et al. [2002]. Randomly selecting a set of p = 500 probes for n = 100 randomly chosen individuals, we iteratively applied the k-means procedure until the covariates were grouped into fairly concordant clusters with a small overall value of τ 2 . The allocation pattern depicted in Figure 4 is atypical of Dirichlet processes which, as is well known among practitioners, are usually associated with relatively small number of clusters and exponentially decaying cluster sizes. Instead, the large number of clusters (q = 161) and the predominance of small clusters suggest a non-Dirichlet model for the covariate-cluster assignments. More specifically, a PDP is favored due to the slower, power law decay in the cluster sizes typically associated with these models.
Theoretical justifications for a PDP model. Sethuraman [1994] derived the stick-breaking representation for a Dirichlet process, and then Pitman [1995] extended it to PDPs. These stick-breaking representations have the following consequences for the induced partitions. Let N be the set of natural numbers. Subject to a one-to-one transformation of the first q natural numbers into N, the allocation variables c 1 , . . . , c p may be regarded as i.i.d. samples from a discrete distribution F α1,d on N with stick-breaking probabilities,  of p and for clusters k = 1, . . . , q, the frequencies n (p) k /p are approximately equal to π h k for some distinct integers h 1 , . . . , h q .
As previously mentioned, the VariScan model assumes that the base distribution G (n) of the PDP is the n-fold product measure of a univariate distribution, G, which follows a Dirichlet process with mass parameter α 2 . This bidirectional clustering structure has some interesting consequences. Let {φ h } ∞ h=1 be the stick-breaking probabilities associated with this nested Dirichlet process. For two or more of the q PDP clusters, the latent vectors are identical with a probability bounded above by q 2 · ∞ h=1 φ 2 h n . Applying the asymptotic relationship of p and q given in expression (1), we find that the upper bound tends to 0 as the dataset grows, provided p grows at a slower-than-exponential rate as n. In fact, for n as small as 50 and p as small as 250, in simulations as well as in data analyses, we found all the latent vectors associated with the PDP clusters to be distinct. Consequently, from a practical standpoint, the VariScan allocations may be regarded as clusters with unique characteristics in even moderate-sized datasets. Theorem 2.1 below provides formal expressions for the first and second moments of the random log-probabilities of the discrete distribution F α1,d . In conjunction with equation (1), this result justifies the use of PDPs when the observed number of clusters is large or the cluster sizes decay slowly. Part 2c provides an explanation for the fact that Dirichlet process allocations typically consist of a small number of clusters, only a few of which are large, with exponential decay in the cluster sizes. Part 1c suggests that in PDPs with d > 0 (i.e., non-Dirichlet realizations), there is a slower, power law decay of the cluster sizes as d increases. Part 3 indicates that for every α 1 and d > 0, a PDP realization F α1,d is thicker tailed compared to a Dirichlet process realization, F α1,0 . See Section ?? of the Appendix for a proof.
It should be noted that the differential allocation patterns of PDPs and Dirichlet processes are well known, and has been previously emphasized by several papers, including Lijoi, Mena, and Prünster [2007a] and Lijoi, Mena, and Prünster [2007b]. However, it is difficult to come across a formal proof for this differential behavior. Although the theorem is primarily of interest when the base measure is non-atomic, it is relevant in this application because of the empirically observed uniqueness of the latent vectors in high-dimensional settings due to VariScan's nested structure.
(1) For 0 < d < 1, the distribution F α1,d ∈ N is a realization of a PDP with stick-breaking probabilities π h , where h ∈ N. However, F α1,d is not a Dirichlet process realization because d = 0. Then . Unlike a Dirichlet process realization, lim h→∞ Var(log π h ) is finite regardless of d > 0.
(2) For d = 0, the distribution F α1,0 ∈ N is a Dirichlet process realization with . This implies that as h → ∞, the random stick-breaking Dirichlet process probabilities, π * h , are stochastically equivalent to e −h/α1 .
. That is, as h → ∞, the ratios of the Dirichlet process and non-Dirichlet process stick-breaking random probabilities, π * h /π h , are stochastically equivalent to e −h/α1 for every d > 0.
Remark By Lemma 1 of Ishwaran and James [2003], lim h→∞ E(log π * h ) = −∞ in Part 2a of Theorem 2.1 implies that ∞ h=1 π * h = 1 almost surely for a Dirichlet process. A similar comment applies in Part 1a for a PDP.
Empirical justification of nested Dirichlet process model for the latent vector elements. For the DLBCL dataset, Figure 5 presents a summary of the VariScan model estimates for the 14,000 latent vector elements with estimated Bernoulli indicatorsẑ ik = 1. More than 87% of the nq = 16, 500 latent vector elements were estimated to haveẑ ik = 1, implying that a relatively small proportion of covariate values for the DLBCL dataset can be regarded as random noise having no clustering structure. Further details about the inference procedure are provided in Section 3. In Figure 5, the small number of clusters corresponding to the large number of latent vector elements, and the sharp decline in the cluster sizes compared with Figure 4, are consistent with Dirichlet process allocation patterns. Similar results were obtained for the breast cancer data and for other genomic datasets that we have analyzed.
Choice of basis functions: model parsimony versus flexibility. The reliability of inference and prediction rapidly deteriorates as the number of cluster predictors and additive nonlinear components in equation (4) increases beyond a threshold value and approaches the number of subjects, n. The restriction in the prior (5) prevents over-fitting. It ensures that the matrix U γ , consisting of the independent regression variables, has fewer columns than rows, and is a sufficient condition for the existence of (U γ Σ −1 U γ ) −1 and the least-squares estimate of β γ in equation (4). In spline-based models, the relatively small number of subjects also puts a constraint on the order of the splines, often necessitating the use of linear splines with m = 1 knot per cluster in equation (4). In the applications presented in this paper, we fixed the knot for each covariate at the sample median.
Unusually small values of σ 2 i in equation (4) correspond to over-fitted models, whereas unusually large values correspond to under-fitted models. Any parameters that determine σ 2 1 , . . . , σ 2 n are key, and their priors must be carefully chosen. For instance, linear regression assumes that σ 2 i = σ 2 . We have found that noninformative priors for σ 2 do not work well because the optimal model sizes for variable selection are unknown. Additionally, we have found that it is helpful to restrict the range of σ 2 based on reasonable goals for inference precision. In the examples discussed in this paper, we assigned the following truncated prior: σ −2 ∼ χ 2 ν · I 0.95 −1 /Var(ŷ) < σ −2 < 0.5 −1 /Var(ŷ) , where the degrees of freedom ν were appropriately chosen and the vectorŷ relied on EDA estimates of latent regression outcomes from a previous study or the training set individuals. The support for σ −2 approximately corresponds to the constraint, 0.5 < R 2 < 0.95, quantifying the effectiveness of regression. As Sections 5.2 and 6 demonstrate, the aforementioned strategies often result in high reliability of the response predictions.
Generalizations for discrete or survival outcomes In a general investigation, the subject-specific responses may be discrete or continuous, and/or may be censored. In such cases, the responses, denoted by w 1 , . . . , w n , can be modeled as deterministic transformations of random variables R i from an exponential family distribution. The Laplace approximation [Harville, 1977] transforms each R i into a Gaussian regression outcome, y i , that can be modeled using our Variscan model proposed above. The details of the calculation are as follows. For a set of functions f i , we assume that w i = f i (R i ) and density function [ , where r(·) is a non-negative function, ς is a dispersion parameter, i is the canonical parameter, and [·] represents densities with respect to a dominating measure. The Laplace approximation relates the R i 's to Gaussian regression outcomes: For an appropriate link function g(·), the mean η i equals g(µ i ). Gaussian, Poisson, and binary responses are applicable in this setting. Accelerated failure time (AFT) censored outcomes [Buckley andJames, 1979, Cox andOakes, 1984] also fall into this modeling framework.
The idea of using a Laplace-type approximation for exponential families is not new. Some early examples in Bayesian settings include Zeger and Karim [1991], Albert and Chib [1994], and Albert et al. [1998]. For linear regression, the approximation is exact with y i = R i . The Laplace approximation is not restrictive even when it is approximate; for example, MCMC proposals for the model parameters can be filtered through a Metropolis-Hastings step to obtain samples from the target posterior. Alternatively, inference strategies relying on normal mixture representations through auxiliary variables could be used to relate the R i 's to the y i 's. For instance, Albert and Chib [1993] used truncated normal sampling to obtain a probit model for binary responses, and Holmes and Held [2006] utilized a scale mixture representation of the normal distribution [Andrews andMallows, 1974, West, 1987] to implement logistic regression using latent variables.

Posterior inference
Starting with an initial configuration obtained by a naïve, preliminary analysis, the model parameters are iteratively updated by MCMC methods. Due to the intensive nature of the posterior inference, the analysis is performed in two stages, with cluster detection followed by predictor discovery: Stage 2 Conditional on the least-squares allocation and least-squares configuration, and focussing on the responses, the cluster predictors and latent regression outcomes, if any, are generated to obtain a third MCMC sample. The MCMC sample is post-processed to predict the responses for out-of-thebag test set individuals. The interested reader is referred to Sections ??, ?? and ?? of the Appendix for details.
As a further benefit of a coherent model for the covariates, VariScan is able to perform model-based imputations of any missing subject-specific covariates as part of the MCMC procedure.

Clustering consistency
As mentioned in Section 2.1, the latent vectors associated with two or more PDP clusters may be identical under the VariScan model, but this probability becomes vanishingly small as n grows. Consequently, for practical purposes, the VariScan allocations may be interpreted as distinct, identifiable clusters in even moderately large datasets. In order to study the reliability of VariScan's clustering procedure in our targeted Big Data applications, we make large-sample theoretical comparisons between the VariScan model's cluster allocations and the true allocations of a hypothetical covariate generating process.
In the general problem of using mixture models to allocate p objects to an unknown number of clusters, the problem of non-identifiability and redundancy of the detected clusters has been extensively documented in Bayesian and frequentist applications [e.g., see Frühwirth-Schnatter, 2006]. Some partial solutions are available in the Bayesian literature. For example, in finite mixture models, rather than assuming exchangeability of the mixture component parameters, Petralia et al. [2012] regard them as draws from a repulsive process, leading to fewer, better separated and more interpretable clusters. Rousseau and Mengersen [2011] show that a carefully chosen prior leads to asymptotic emptying of the redundant components in over-fitted finite mixture models. The underlying strategy of these procedures is that they focus on detecting the correct number of clusters rather than the correct allocation of the p objects.
In contrast to the non-identifiability of the detected clusters in fixed n settings, Theorem 4.1 establishes the interesting fact that, when p and n are both large, a fixed set of covariates that (do not) co-cluster under the true process, also (do not) asymptotically co-cluster under the posterior. The key intuition is that, as with most mixture model applications, when n-dimensional objects are clustered and n is small, it is possible for the clusters to be erroneously placed too close together even if p is large. On the other hand, if n is allowed to grow with p, then objects in R n eventually become well separated. Consequently, for n and p large enough, the VariScan method is able to infer the true clustering for a fixed subset of the p covariate columns. In the sequel, using synthetic datasets in Section 5.1, we exhibit the high accuracy of VariScan's clustering-related inferences.
The true model. The VariScan model's exchangeability assumption for the p covariates stems from our belief in the existence of a true, unknown de Finetti density in R n from which the column vectors arise as a random sample. In particular, for any given n and p, we make the following assumptions about the true covariategenerating process: (a) The column vectors x 1 , . . . , x p are a random sample of size p from an n-variate distribution P  p , mapping the p covariates to distinct atoms of P (n) 0 . For subjects i = 1, . . . , n, and columns j = 1, . . . , p, the covariates are then distributed as A value near 1 indicates the high accuracy of inferred allocations c for the set L.
Notice that the measure κ L (c) is invariant to permutations of the clusters labels. This is desirable because the labels are arbitrary.
Theorem 4.1. Denote the covariate matrix by X np . In addition to assumptions (a)-(e) about the true covariate-generating process, suppose that the true standard deviation τ 0 in equation (7) is bounded below by τ * , the small, positive constant postulated in Section 2.1 as a lower bound for the Variscan model parameters, τ 1 and τ . Let L = {j 1 , . . . , j L } ⊂ {1, . . . , p} be a fixed subset of L covariate indexes. Then there exists an increasing sequence of numbers {p n } such that, as n grows and provided p > p n , the VariScan clustering inferences for the covariate subset L are aposteriori consistent. That is, See Section ?? of the Appendix for a proof. The result relies on non-trivial extensions, in several directions, of the important theoretical insights provided by [Ghosal et al., 1999]. Specifically, it extends Theorem 3 of Ghosal et al. [1999] to densities on R n arising as convolutions of vector locations with errors distributed as zero-mean finite normal mixtures.

Simulation studies
5.1. Cluster-related inferences. We investigate VariScan's accuracy as a clustering procedure using artificial datasets for which the true clustering pattern is known. We simulated the covariates for n = 50 subjects and p = 250 genes from a discrete distribution convolved with Gaussian noise, and compared the co-clustering posterior probabilities of the p covariates with the truth. The parameters of the true model were chosen to approximate match the corresponding estimates for the DLBCL dataset of Rosenwald et al. [2002]. Specifically, for each of 25 synthetic datasets, and for the true model's parameter τ 0 in Theorem 4.1 belonging to the range [0.60, 0.96], we generated the following quantities to obtain the matrix X in Step 3 below: (1) True allocation variables: We generated c p as the partitions induced by a PDP with true discount parameter d (0) = 0.33 and mass parameter α 1 = 20. The true number of clusters, Q 0 , was thereby computed for this non-Dirichlet allocation.
(3) Covariates: icj , τ 2 0 ) for i = 1, . . . , n and j = 1, . . . , p. No responses were generated in this study. Applying the general technique of Dahl [2006] developed for Dirichlet process models, we computed a point estimate for the allocations, called the least-squares configuration, and denoted byĉ 1 , . . . ,ĉ p . For the full set of covariates, we estimated the accuracy of the least-squares allocation by the estimated proportion of correctly clustered covariate pairs, κ = 1 p 2 j1 =j2∈{1,...,p} A high value ofκ is indicative of VariScan's high clustering accuracy for all p covariates.
For each value of τ 0 , the second column of Table 1 displays the percentageκ averaged over the 25 independent replications. We find that, for each τ 0 , significantly For different values of simulation parameter τ 0 , column 2 displays the proportion of correctly clustered covariate pairs, with the standard errors for the 25 independent replications shown in parentheses. Column 3 presents 95% posterior credible intervals for the lower bound of the log-Bayes factor of PDP models relative to Dirichlet process models. See the text for further explanation.
less than 5 pairs were incorrectly clustered out of the 250 2 = 31,125 different covariate pairs, and soκ was significantly greater than 0.999. The posterior inferences appear to be robust to large noise levels, i.e., large values of τ 0 . For every dataset, q, the estimated number of clusters in the least-squares allocation was exactly equal to Q 0 , the true number of clusters. Recall that the non-atomicity of true distribution G 0 is a sufficient condition of Theorem 4.1. Although the condition is not satisfied in this setting, we nevertheless obtained highly accurate clustering-related inferences for the full set of p = 250 covariates.
Accurate inferences were also obtained for the PDP discount parameter, d ∈ [0, 1). Figure 6 plots the 95% posterior credible intervals for d against different values of τ 0 . The posterior inferences are substantially more precise than the prior and each interval contained the true value, d 0 = 0.33. Furthermore, in spite of being assigned a prior probability of 0.5, there is no posterior mass allocated to Dirichlet process models. The ability of VariScan to discriminate between PDP and Dirichlet process models was evaluated using the log-Bayes factor, log (P [d > 0|X]/P [d = 0|X]). With Θ * representing all the parameters except d, and applying Jensen's inequality, the log-Bayes factor exceeds E log P [d>0|X,Θ * ] which (unlike the log-Bayes factor) can be estimated using just the post-burn-in MCMC sample. For each τ 0 , the third column of Table 1 displays 95% posterior credible intervals for this lower bound. The Bayes factors are significantly greater than e 10 = 22, 026.5 and are overwhelmingly in favor of PDP allocations, i.e., the true model.

Prediction accuracy.
We evaluate the operating characteristics of our methods using a simulation study based on the DLBCL dataset of Rosenwald et al. [2002]. To generate the simulated data, we selected p = 500 genes from the original gene expression dataset of 7,399 probes, as detailed below: (1) Select 10 covariates with pairwise correlations less than 0.5 as the true predictor set, S ⊂ {1, . . . , 500}, so that |S| = 10.
(2) For each value of β * ∈ {0.2, 0.6, 1.0}: (a) For subjects i = 1, . . . , 100, generate failure times t i from distribution E i , denoting the exponential distribution with mean exp(β * j∈S x ij ). Note that the model used to generate the outcomes differs from VariScan assumption (4) for the log-failure times. (b) For 20% of individuals, generate their censoring times as follows: u i ∼ E i ·I(u i < t i ). Set the survival times of these individuals to w i = log u i and their failure statuses to δ i = 0. (c) For the remaining individuals, set w i = log t i and δ i = 1.
(3) Randomly assign 67 individuals to the training set and the remaining 33 individuals to the test set. (4) Assuming the AFT survival model, apply the VariScan procedure with linear splines and m = 1 knot per spline. Choose a single covariate from each cluster as the representative as described in Section 2.2. Make posterior inferences using the training data and predict the outcomes for the test cases.
We analyzed the same set of simulated data using six other techniques for gene selection with survival outcomes: lasso [Tibshirani, 1997], adaptive lasso [Zou, 2006], elastic net [Zou and Trevor, 2005], L 2 -boosting [Hothorn and Buhlmann, 2006], random survival forests [Ishwaran et al., 2010], and supervised principal components [Bair and Tibshirani, 2004], which have been implemented in the R packages glmnet, mboost, randomSurvivalForest, and superpc. The "RSF-VH" version of the random survival forests procedure was chosen because of its success in highdimensional problems. The selected techniques are excellent examples of the three categories of approaches for small n, large p problems (variable selection, nonlinear prediction, and regression based on lower-dimensional projections) discussed in Section 1. We repeated this procedure over fifteen independent replications.
We compared the prediction errors of the methods using the concordance error rate, which is defined as 1 − C, where C denotes the c index of Harrell et al. [1982]. Let the set of "usable" pairs of subjects be U = {(i, j) : w i < w j , δ i = 1} ∪ {(i, j) : w i = w j , δ i = δ j }. The concordance error rate of a procedure is [May et al., 2004]: , wherew i is the predicted response of subject i. For example, for the VariScan procedure applied to analyze AFT survival outcomes, the predicted responses arew i = exp(ỹ i ), wherẽ y i are the predicted regression outcomes.
The concordance error rate measures a procedure's probability of incorrectly ranking the failure times of two randomly chosen individuals. The accuracy of a procedure is inversely related to its concordance error rate. The measure is especially useful for comparisons because it does not rely on the survivor function, which is estimable by VariScan, but not by some of the other procedures. Figure 7 depicts boxplots of the concordance error rates of the procedures sorted by increasing order of prediction accuracy. We find that as β * increases, the concordance error rates progressively decrease for most procedures, including VariScan. For larger β * , the error rates for VariScan are significantly lower than the error rates for the other methods.
In order to facilitate a more systematic evaluation, we have plotted in Figure 8 the error rates versus model sizes for the different methods, thereby providing a joint examination of model parsimony and prediction. To aid a visual interpretation, we did not include the supervised principal components method, since it performs the  worst in terms of prediction and detects models that are two to four fold larger than L 2 -boosting, which typically produces the largest models among the depicted methods. The three panels correspond to increasing effect size, β * . A few facts are evident from the plots. VariScan seems to balance sparsity and prediction the best for all values of β * , with its performance increasing appreciably with β * . Penalization approaches such as lasso, adaptive lasso, and elastic net produce sparser models but have lower prediction accuracies. L 2 -boosting is comparable to Variscan in terms of prediction accuracy, but detects larger models for the lower effect sizes (left and middle panel); Variscan is the clear winner for the largest effect size (right panel). Additionally, especially for the largest β * , we observe substantial variability between the simulation runs for the penalization approaches, as reflected by the large standard errors. Further simulation study comparisons of VariScan and the competing approaches are presented in Section ?? of the Appendix. Nonlinearity measure. Unlike some existing approaches, VariScan is able to measure the degree of nonlinearity in the relationships between the responses and covariates. For example, we could define nonlinearity measure N as the posterior expectation, This represents the posterior odds that a hypothetical, new cluster is a non-linear predictor in equation (4) rather than a simple linear regressor. A value of N close to 1 corresponds to predominantly nonlinear associations between the responses and their predictors. Averaging over the 15 independent replications of the simulation, as β * varied over the set {0.2, 0.6, 1.0}, the estimates of the nonlinearity measure N defined in equation (9), were 0.72, 0.41, and 0.25, respectively. The corresponding standard errors were 0.04, 0.07, and 0.06. This indicates that on the scale of the simulated log-failure times, simple linear regressors are increasingly preferred to linear splines as the signal-to-noise ratio, quantified by β * , increases. Such interpretable measures of nonlinearity are not provided by the competing methods.

Analysis of benchmark data sets
Returning to the two publicly available datasets of Section 1, we chose p = 500 probes for further analysis. For the DLBCL dataset of Rosenwald et al. [2002], we randomly selected 100 out of the 235 individuals who had non-zero survival times. Of the individuals selected, 50% had censored failure times. For the breast cancer dataset of van't Veer et al. [2002], we analyzed the 76 individuals with non-zero survival times, of which 44 individuals (57.9%) had censored failure times.
We performed 50 independent replications of the three steps that follow. (i) We randomly split the data into training and test sets in a 2:1 ratio. (ii) We analyzed the survival times and p = 500 gene expression levels of the training cases using the techniques VariScan, lasso, adaptive lasso, elastic net, L 2 -boosting, random survival forests, and supervised principal components. (iii) The different techniques were used to predict the test case outcomes. For the VariScan procedure, a single covariate from each cluster was chosen to be the cluster representative.
The number of clusters for the least-squares allocation of covariates,q, computed in Stage 1a of the analysis, were 165 and 117 respectively for the DLBCL and the breast cancer datasets. The nonlinearity measure N estimates were 0.97 and 0.75 respectively with small standard errors. This indicates that the responses in both datasets, but especially in the DLBCL dataset, have predominantly nonlinear relationships with the predictors. In spite of being assigned a prior probability of 0.5, the estimated posterior probability of the Dirichlet process model (corresponding to discount parameter d = 0) was exactly 0 for both datasets, justifying the PDP-based allocation scheme.
For the DLBCL data, the upper panel of Figure 9 displays the estimated posterior density of the PDP's discount parameter d. The estimated posterior probability of the event [d = 0] is exactly zero, implying that a non-Dirichlet process clustering mechanism is strongly favored by the data, as suggested earlier by the EDA. The middle panel of Figure 9 plots the estimated posterior density of the number of clusters. The a posteriori large number of clusters (for p = 500 covariates) is suggestive of a PDP model with d > 0 (i.e. a non-Dirichlet process model). The lower panel of Figure 9 summarizes the cluster sizes of the least-squares allocation [Dahl, 2006]. The large number of clusters (q = 165) and the multiplicity of small clusters are very unusual for a Dirichlet process, justifying the use of the more general PDP model.
The effectiveness of VariScan as a model-based clustering procedure can be shown as follows. For each of theq = 165 clusters in the least-squares allocation of Stage 1a, we computed the correlations between its member covariates and the latent vector for individuals withẑ ik = 1. The cluster-wise median correlations are plotted in Figure 10. The plots reveal fairly good within-cluster concordance regardless of the cluster size. Figure 11 displays heatmaps for the DLBCL covariates that were allocated to column clusters having more than 10 members. The panels display the covariates before and after bidirectional clustering of the subjects and probes, with the lower panel of Figure 11 illustrating the within-cluster patterns revealed by VariScan. For each column cluster in the lower panel, the uppermost rows represent

Histogram of median correlations
Median correlation the covariates of any subjects that do not follow the cluster structure and which are better modeled as random noise (i.e., covariates withẑ ik = 0).
Comparing the test case predictions with the actual survival times, boxplots of numerical summaries of the concordance error rates for all the methods are presented in Figure 12. The success of VariScan appears to be robust to the different censoring rates of survival datasets. Although L 2 -boosting had comparable error rates for the DLBCL dataset, VariScan had the lowest error rates for both datasets. Further data analysis results and comparisons are available in Section ?? of the Appendix.
For subsequent biological interpretations, we selected genes having high probability of being selected as predictors (with the upper percentile decided by the model size). We then analyzed these genes for their role in cancer progression by cross-referencing with the existing literature. For the breast cancer dataset, our survey indicated several prominent genes related to breast cancer development and progression, such as TGF-B2 [Buck and Knabbe, 2006], ABCC3, which is known to be up-regulated in primary breast cancers, and LAPTM4B, which is related to breast carcinoma relapse with metastasis [Li et al., 2010]. For the DLBCL dataset,  Figure 11. Heatmaps of DLBCL covariates that were assigned to latent column clusters with more than 10 members. The panels display the covariates before and after bidirectional local clustering by VariScan. The vertical lines in the bottom panel mark the covariate-clusters. The color key for both panels is displayed at the top of the plot.

Breast cancer dataset
Concordance error rate Figure 12. Side-by-side boxplots of percentage concordance error rates for the benchmark datasets.
we found several genes related to DLBCL progression, such as the presence of multiple chemokine ligands (CXCL9 and CCL18), interleukin receptors of IL2 and IL5 [Lossos and Morgensztern, 2006], and BNIP3, which is down-regulated in DLBCL and is a known marker associated with positive survival [Pike et al., 2008]. A detailed functional/mechanistic analysis of the main set of genes for both datasets is provided in Section ?? of the Appendix.

Conclusions
Utilizing the sparsity-inducing property of PDPs, VariScan offers an efficient technique for clustering, variable selection, and prediction in high-dimensional regression problems. The covariates are grouped into a smaller number of clusters consisting of covariates with similar across-subject patterns. We theoretically demonstrate how a PDP allocation can be differentiated from a Dirichlet process allocation in terms of the relative sizes of the latent clusters. We provide a theoretical explanation for the impressive ability of VariScan to aposteriori detect the true covariate clusters for a general class of models.
In simulations and real data analysis, we show that VariScan makes highly accurate cluster-related inferences. The technique consistently outperforms established methodologies such as random survival forests, L 2 -boosting, and supervised principal components, in terms of prediction accuracy. In the analyses of benchmark microarray datasets, we identified several genes having known implications in cancer development and progression, which further engenders our hypothesis.
The VariScan methodology focusses on continuous covariates as a proof of concept, achieving simultaneous clustering, variable selection, and prediction in highthroughput regression settings and possessing appealing theoretical and empirical properties. Generalization to count, categorical, and ordinal covariates is possible. It is important to investigate the dependence structures and theoretical properties associated with the more general framework. This will be the focus of our group's future research.
Due to the intensive nature of the MCMC inference, we performed these analyses in two stages, with cluster detection followed by predictor discovery. We are currently working on implementing VariScan's MCMC procedure in a parallel computing framework using graphical processing units [Suchard et al., 2010]. We plan to make the software available as an R package for general purpose use in the near future. The single-stage analysis will allow the regression and clustering results to be interrelated, as implied by the VariScan model. We anticipate being able to dramatically speed up the calculations by multiple orders of magnitude, which will allow for single-stage inferences of user-specified datasets on ordinary desktop and laptop computers.
University of Missouri, USA. E-mail address: GuhaSu@missouri.edu The University of Texas MD Anderson Cancer Center and Rice University, USA. E-mail address: Veera@mdanderson.org