A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures

In the era of big data, there are increasing interests on clustering variables for the minimization of data redundancy and the maximization of variable relevancy. Existing clustering methods, however, depend on nontrivial assumptions about the data structure. Note that nonlinear interdependence among variables poses significant challenges on the traditional framework of predictive modeling. In the present work, we reformulate the problem of variable clustering from an information theoretic perspective that does not require the assumption of data structure for the identification of nonlinear interdependence among variables. Specifically, we propose the use of mutual information to characterize and measure nonlinear correlation structures among variables. Further, we develop Dirichlet process (DP) models to cluster variables based on the mutual-information measures among variables. Finally, orthonormalized variables in each cluster are integrated with group elastic-net model to improve the performance of predictive modeling. Both simulation and real-world case studies showed that the proposed methodology not only effectively reveals the nonlinear interdependence structures among variables but also outperforms traditional variable clustering algorithms such as hierarchical clustering.

Predictive modeling extracts useful information and patterns from the data to drive decisions or actions. For example, insurance companies have gathered a vast amount of data in their data warehouses 1 . The objective of the predictive model is not only to improve the pricing or marketing process, but also to analyze profitability, fraud, catastrophe, and other insurance operations. In the 21st century, wireless sensing, electronic health records, and health Internet of Things are increasingly adopted to assist in the process of clinical decision making [2][3][4] . This amount of information from multiple sources provides numerous variables for the contemplated predictive model. When a predictive model involves large amounts of variables (i.e., explanatory or response variables), researchers are confronted with the need to reduce the number of variables in order to build the compact model. To some extent, the variables are unknown to be redundant or relevant to the objective of predictive models but rather need to be tested with real-world data. In addition, when there is an enormous amount of variables, it becomes difficult to find out the relationship between variables. If the model building involves too many variables, it will impact the model compactness and efficiency. There is also a possibility to increase the model sensitivity to noises and overfit the data with many variables. The model parameters are not stable when variables are highly correlated. It is even more difficult to explain the physical meanings of the predictive model when there are many variables. Finally, model building with a large amount of variables is computationally expensive and could take indefinite time for the exhaustive search. An intermediate approach to the exhaustive search may also be time-consuming and some combinations of variables could be overseen.
Data clustering is an unsupervised method to group data samples into homogeneous clusters, while variable clustering is to detect subsets of homogeneous variables and then cluster them into the same group, in which variables have stronger interrelations to each other than to those in other groups. As shown in Fig. 1a, data clustering groups data samples into clusters and each sample has two values, e.g., (0.26, − 0.09) in the 2-dimensional space, where X-axis is the value of variable 1, and Y-axis is the value of variable 2. Data samples are clustered based on the similarity measure, e.g., Euclidean distance. However, variable clustering is different from data clustering. Figure 1b illustrates the clustering results of 15 variables, each of which has 1000 data samples. For example, the variable v 15 represents a series of 1000 samples. Notably, each point in Fig. 1b is a variable instead of a data sample.
Variable clustering uncovers natural groups of objects (variables, features, or factors) in a multivariate dataset. The hierarchical clustering (HC) 5 , a generic clustering procedure, sequentially merges pairs of clusters that share common characteristics based on similarity measures. HC procedures generate a nested set of partitions, also called hierarchy. The choice of the similarity measure plays an important role in the clustering process because it indirectly defines the structure of the clusters. This choice is not only guided by problems to solve, but also restricted to commonly used measures, such as the Euclidean distance or Pearson's correlation coefficient. However, nonlinear interdependence among variables cannot be adequately captured by linear correlation. Further, we cannot relocate the variables once the merge is done for two closest clusters, because HC is not a dynamic approach. There is no adaptive step for two variables to make modifications in the later stage if they are 'incorrectly' clustered at the early stage.
In this paper, we develop a new methodology of information theoretic approach for variable clustering and predictive modeling. The proposed approach investigates both redundancy and relevancy among variables. Specifically, nonlinear interdependence structures are measured among variables. Further, we introduced nonparametric Dirichlet process to cluster embedded variables with their probability distributions. Finally, orthonormalized variables were integrated with group elastic net models to improve the performances of predictive models. Both simulation and real-world case studies demonstrate that the proposed methodology not only outperforms traditional variable clustering algorithms such as hierarchical clustering, but also effectively identifies nonlinear interdependence structures among variables and further improves the performance of predictive modeling.

Research Background
Clustering Analysis. When "clustering" is used in the literature, it is referred to be "data clustering" for most of the time. The approach of data clustering groups data samples into homogeneous subsets, in which data samples are closer to each other in the same cluster than to other clusters. As shown in Fig. 2, data clustering is more concerned about the samples that are rows (i.e., … s s s , , , N 1 2 s ) in the table format of a dataset but variable clustering focus on the variables in the columns (i.e., v 1     s , are also called nodes in the network or words in the text, where s j = (v j1 , v j2 , … , v jN ) T , j = 1, 2, … , N s . It may be noted that big data often brings a large number of variables that can be bigger than the number of samples, i.e., N > N s . Complex interdependence structures among variables significantly challenge the traditional framework of predictive modeling. As such, variable clustering to delineate homogeneous groups of variables is urgently needed.
In recent years, community detection in network analysis receives increasing interests in data clustering. Network-based methods cluster nodes with strong connections into a community. For example, mixed membership stochastic blockmodels (MMSB) 6 were proposed to discover complex network structure in a variety of applications, e.g., large-scale protein interaction network and social network. The MMSB develops a novel class of latent variable models for relational data, and assumes each variable belongs to multiple communities/clusters rather than a single community/cluster. Joint Gamma process Poisson factorization (J-GPPF) 7 was developed to jointly model sparse networks with large size and side information. Infinite edge partition model 8 was introduced to not only study overlapping communities and inter-community interactions but also predict missing edges. However, community detection groups nodes that represent data samples (e.g., proteins), rather than variables, into communities by considering the unweighted or weighted edges between nodes.
In addition, topic models are widely used for data clustering in the field of text mining. Topic models are statistical models for discovering topics that occur in a collection of documents with a large number of words (i.e., data samples in rows of table-form data in Fig. 2). Latent Dirichlet allocation (LDA) 9 was first introduced as an unsupervised model to cluster documents in the topic space. LDA assumes the topic distribution to have a Dirichlet prior and maximizes the likelihood (or posterior probability) of the document collection. It may also be noted that supervised topic models with side information (e.g., document categories or review rating scores) were proposed to find latent topics and provide more predictive power than regression on unsupervised LDA features. For example, supervised latent Dirichlet allocation (sLDA) 10 introduced the real-valued document rating as regression response and jointly modeled the documents and response by maximizing the joint likelihood. Maximum entropy discrimination LDA (MedLDA) 11,12 proposed a unified constrained optimization framework that solves problems of dimensionality reduction and max-margin classification using features in the reduced-dimension space. Topic models formulate statistical models based on the intuition that specific words would appear more or less frequently in the document for a given topic. However, variable clustering does not hold this intuition. As such, topic models address special clustering problems in text mining that are different from other general data clustering or variable clustering problems.
Moreover, many previous approaches group a dataset into co-clusters (or biclusters), which are subsets of data samples exhibit similar behaviors across a subset of variables, or vice versa. Co-clustering approaches have been widely used in a variety of applications such as biological gene expression data 13 and text mining 14,15 . Notably, a simultaneous co-clustering and learning (SCOAL) 16 framework was proposed to generalize co-clustering and construct predictive models simultaneously. The SCOAL co-cluster the entire dataset into subsets of samples and variables such that each subset can be well characterized by a predictive model. However, the whole data set is divided into multiple subsets that capture incomplete data information. These subsets are then used to construct multiple predictive models rather than one model. In addition, nonlinear correlations among variables were not fully utilized in traditional co-clustering approaches. Instead, nonlinear predictive models were usually introduced to account for data nonlinearity, which also brings a large number of parameters.
Hierarchical Clustering. Variable clustering is the task to group homogeneous variables into the same category, in which variables have stronger interrelations than to those in other groups. Variable clustering considers the interdependence structure among variables, e.g., correlation. The Pearson's correlation 17 between variables v 1 and v 2 is where cov(v 1 , v 2 ) is the covariance between v 1 and v 2 , σ v 1 and σ v 2 are variances of v 1 and v 2 , µ v 1 and µ v 2 are means of v 1 and v 2 , E is the expectation. However, the Pearson's correlation only measures the linear relationship between variables v 1 and v 2 .
In the literature, Pearson's correlation was integrated with hierarchical clustering (HC) for variable clustering 5 . There are two ways to perform HC procedures -the agglomerative way and the divisive way. For example, agglomerative HC defines each variable as a singleton cluster in the first step. Then, two closest clusters with smallest dissimilarity measure are merged into one cluster. The merging process recursively moves up along the hierarchy until the stopping criterion is satisfied, e.g., the maximum number of clusters or the maximum group-average (GA) dissimilarity. The criterion of group average measures the average intergroup dissimilarity between two clusters, i.e., where N C m and N C n are the sizes of cluster C m and C n , is the dissimilarity between variables v i and v j , which is usually calculated as Here, a motivating example is shown to evaluate the performance of HC with Pearson's correlation for variable clustering. Two clusters of variables are generated as follows: Scientific RepoRts | 6:38913 | DOI: 10.1038/srep38913 where v 1 and v 5 are independent standard normal variables. In the cluster 1, variable v 1 has linear correlation with variable v 2 and nonlinear correlation with variables v 3 and v 4 . The cluster 2 has the similar situation. Figure 3a shows the correlation matrix of these eight variables. The red color represents a high correlation, while the blue color indicates no interrelationships. It may be noted that the correlation matrix effectively detects the linear correlation between variables v 1 and v 2 , v 5 and v 6 . However, nonlinear correlations are not well captured. Figure 3b shows the hierarchical clustering results based on the Pearson's correlation. Variables v 1 , v 2 and v 4 are clustered in the same cluster, and variables v 5 , v 6 and v 8 are clustered in another cluster. However, hierarchical clustering failed to cluster variable v 3 into Cluster 1, and variable v 7 into Cluster 2. This is mainly due to the fact that nonlinear correlations among variables are not fully considered. Very little work has been done to cluster a large number of variables with complex structures of nonlinear interdependences. Thus, we propose a new methodology that integrates information theoretic approach with Dirichlet process mixtures for variable clustering and predictive modeling.
Research Methodology. In this section, we will first characterize nonlinear correlation (i.e., mutual information) among variables and then embed variables in the lower-dimensional space. Second, we introduce the nonparametric Dirichlet process (DP) to derive self-organizing clusters of homogeneous variables with specific consideration of nonlinear interdependence. Finally, we orthonormalize variables in each cluster and then integrate them with group elastic-net model to improve the performance of predictive modeling.

Mutual Information based Embedding of Variables. First, mutual information is characterized and
quantified among variables. Traditionally, such interrelationships are estimated with linear methods such as Pearson's correlation. As aforementioned, Pearson's correlation, a second-order quantity, evaluates merely linear dependency among data and is limited in the ability to represent the variable-to-variable dissimilarities. Therefore, we propose to characterize the variable-to-variable dissimilarity matrix using mutual information and further embed variables into low-dimensional feature vectors that preserve the dissimilarity distances among variables. Mutual information 18 quantifies both linear and nonlinear interdependence between two variables v i and v j , i.e., ith and jth columns in Fig. 2. Although there are various measures that capture nonlinear correlations among variables, mutual information has the advantage to equitably quantify statistical associations between two variables that is insensitive to the form of the underlying function 19 , where equitability means that the statistic gives similar scores to equally noisy relationships of different types 20 . In other words, mutual information has an attractive feature to provide an equitable measure of association between two variables that is insensitive to the form of the underlying function 19 . It may be noted that mutual information was introduced to cluster nonlinear structures among data samples (e.g., feature vectors of a gene, a company and a movie) by formulating a tradeoff function among average similarity and information carried by the cluster identities 21 . However, this information-theoretic approach considers nonlinear correlation structures among data samples, rather than variables, by introducing mutual information as a similarity measure. Moreover, the number of clusters was pre-defined in order to solve the tradeoff function.
The mutual information is defined as: where p(v ik , v jl ) is the joint probability distribution, p(v ik ) and p(v jl ) are marginal probabilities. Figure 4 shows the practical implementation to compute the mutual information with the scatter plot of two variables v i and v j , and the marginal histogram for each variable. Marginal probabilities p(v ik ) and p(v jl ) are computed as the number of points in v ik and v jl divided by the total number of points in the 2-dimensional space. While the joint probability p(v ik , v jl ) is computed as the number of points in box (v ik , v jl ) divided by the total number of points in the space. In practice, large box size will lead to an accurate estimation of average probability, but a flat estimation of joint probability p(v ik , v jl ). As such, this will underestimate the mutual information MI(v i , v j ). In contrast, small box size estimates the joint probability p(v ik , v jl ) in small scales but brings significant variations, which overestimate the mutual information MI(v i , v j ). In the present investigation, we choose the number of bins as N /2 where N S is the sample size.
Once the mutual information is computed for each pair of variables, the dissimilarity matrix among variables will be generated. It may be noted that the mutual information is inversely proportional to the dissimilarity. Therefore, we define δ ij = 1/MI(v i , v j ) as the dissimilarity measure between ith and jth variables in N × N dissimilarity matrix Δ . Further, an embedding algorithm is developed to transform the dissimilarity matrix into low-dimensional feature vectors that preserve the variable-to-variable dissimilarity matrix. Let y i and y j denote the ith and jth feature vectors. The objective function is formulated as: where ||·|| is the Euclidean norm. The Gram matrix B is firstly reconstructed from the dissimilarity matrix Δ in order to solve this optimization problem: where H = I − N −1 11 T is the centering matrix, I is the identity matrix with size N and 1 is a column vector with N ones. The Δ (2) is a squared matrix and each element in Δ (2) is δ ij 2 . Then the element b ij in matrix B is: Due to the property of Gram matrix, it is defined as the scalar product B = YY T , where the matrix Y minimizes the aforementioned objective function. It is known that Gram matrix B is decomposed as: T T where V is a matrix of eigenvectors and Λ is a diagonal matrix of eigenvalues. Then, the matrix of feature vectors is obtained as: . As such, each variable is embedded as a feature vector in the low-dimensional network that preserves the dissimilarity matrix.
Dirichlet Process for Variable Clustering. Furthermore, we propose to cluster low-dimensional feature vectors that are embedded from variables. Although K-Means clustering is the most popular algorithm for data clustering 22 , it has several drawbacks. First, it is a parametric model and the number of clusters needs to be predefined. For clusters that are not well separated, this may not be straightforward. Second, K-Means algorithm needs to recalculate the objective function for assigning a cluster label to a new variable. Third, the results of K-Means clustering are not unique due to the recalculation of objective function. Therefore, we introduced the nonparametric Dirichlet process (DP) models to cluster variables 23,24 . DP models partition the vector space into local clusters, and assign cluster labels for new observations according to the assignment probability derived from the mean and covariance of each cluster, with each one following a multivariate Gaussian distribution.
The Chinese Restaurant Process (CRP) is an effective representation of DP, which visualizes the clustering effects more explicitly. Figure 5 shows the algorithm and illustration of CRP. Suppose a restaurant has potentially infinite many tables k = 1, 2, … , and each table has value θ k drawn from base probability measure G 0 . Customers are indexed by n = 1, 2, … , N as they arrive, while indicator variables c n = k denotes that the nth customer choose to sit at the kth table. The tables are chosen according to the following random process: 1. The first customer always chooses the first table. where α > 0 is a concentration parameter, and m k denotes the number of customers seated at the kth table. From the conditional probability distribution above, we can see that a customer is more likely to sit at a table if there are already many people sitting there. However, a customer will sit at a new table with the probability proportional to α.
This CRP provides an effective representation for the inference in Dirichlet process mixture models (DPMM). In DPMM, the distribution of indicator variables c 1 , c 2 , … , c N given mixing proportions π = (π 1 , π 2 , … , π K ) is multinomial is the number of data points in kth cluster and ∑ = m N k . Since the Dirichlet distribution is conjugate to the multinomial, we can assume mixing proportions π for K clusters have a Dirichlet prior Then, integrating out the mixing proportions gives: If the total number of clusters K is finite, then the probability of nth data point belongs to kth cluster given all other data points and concentration parameter α is where c −n denotes all indices except n, and m −n,k = ∑ i ≠ n δ(c i , k) is the number of data points in the kth cluster for assigning the first (n − 1) data points. If K is infinite as K → ∞ , we can update the posterior indicator distribution using Gibbs sampling as: The distribution for a new variable y * within a mixture cluster follows normal distribution where the parameters μ k and Σ k are the mean and the covariance for cluster k. As a result, the weight for each cluster is obtained as Due to the nonparametric nature of DP, the shape as well as the number of clusters need not be known a priori. Therefore, DP clusters are derived from characteristics inherent to data.

Predictive Modeling with Clustered Variables. Although the Dirichlet process clusters variables into
different groups, the variables in each group are similar to each other and thus bring the redundant information. It is necessary to delineate the structure of latent variables hidden in each cluster. As such, homogeneous variables in each cluster are orthonormalized before predictive modeling. Assume we have K clusters and there are M k variables, i.e., where w k1 is the normalized variable of v k1 . Then, we orthogonalize and normalize the second vector v k2 as,

in the k-th cluster. Then, the redundant information within original variables
where w k2 is the second orthonormalized vector. The process is recursively updated to get the m-th orthogonal vector x km where w km is the m-th orthonormalized vector. Further, we leverage orthonormalized variables in each cluster to develop a group elastic-net model 25 , which achieves the model sparsity by the group-level and individual-level selection of features. The elastic net criterion is defined as:  To develop the group elastic-net model for logistic regression, we define h β (w, i) as the probability for z i being a success (i.e., z i = 1) and thus 1 − h β (w, i) is the probability for z i being a failure (i.e., where γ and λ are penalization parameters, the logistic function h β (w, i) is used in the likelihood function because of the binary responses. The proposed approach will be evaluated and validated using experimental studies. The details are shown in the next section.

Experimental Materials and Results
In this section, we evaluate and validate the proposed methodology using both simulation and real-world case studies.
Simulation Study. First, a simulation study is shown to evaluate the performance of the proposed methodology for variable clustering. We simulate four clusters of variables in Table 1 as follows. Figure 6a shows the matrices of Pearson's correlations among variables that are computed from the simulation data set. Notably, the linear correlation in Fig. 6a cannot fully identify the nonlinear interdependence among simulated variables. Figure 6b shows that the HC cannot delineate the cluster structure of variables. This is mainly due to the fact that Pearson's correlation is limited in the ability to detect nonlinear interdependence structures among variables. Figure 7a shows the mutual information based correlation matrix among variables that are computed from the simulated data set. The red color represents a higher nonlinear correlation, while the blue color indicates no interrelationships. Figure 7a shows significant nonlinear correlation within the simulated clusters. Also, variables from different clusters have little interrelationship. If we use Dirichlet process to cluster variables based on low-dimensional vectors embedded from the dissimilarity matrix of mutual information, four clusters of variables are distinctly separated in the space (see Fig. 7b). The simulation study shows that Dirichlet process models effectively cluster these 20 variables into 4 groups and identifies the underlying cluster structures of variables.  16 (t + 10) v 18 = v 16 (t + 20) v 19 = v 16 (t + 30) v 20 = v 16 (t + 40) Table 1. Four cluster of simulated variables. Where v 1 and v 6 are independent standard normal variables, v 11 is a nonlinear variable sampled from logistic map v 11 (n + 1) = 3.8v 11 (n)(1 − v 11 (n)), v 16 is a second-order autoregressive variable that is nonlinearly coupled with x Lorenz , , where ε n is Gaussian noise, x Lorenz is the x-component of a Lorenz system: Real-world Case Study. In the previous work, we characterized and represented 3-dimensional vectorcardiogram (VCG) signals using a sparse basis function model 26 . This sparse representation not only reduces large amounts of data to a limited number of model parameters, but also preserves the signal information. As opposed to the original data, this present paper will utilize parameters in basis function models as explanatory variables to further predict the myocardial infarctions. VCG signals are represented by L superposed basis functions in order to capture intrinsic characteristics of cardiac electrical activity as: where ϕ j and σ j are shifting and scaling factors, ψ j (·) are basis functions, and ω j are weight factors, respectively. The objective is to optimize the sparse representation of 3D VCG signals: In order to identify a compact set of basis functions that minimize the representation error, the number of basis functions L is minimized and basis functions ψ are optimally placed. Model parameters ω, ϕ, σ are adaptively estimated by "best matching" projections of VCG signals onto a dictionary of nonlinear basis functions. The optimization algorithms of a sparse basis function representation for spatiotemporal VCG signals were detailed in our previous work 26 .
In this present study, model parameters, i.e., weight, shifting, scaling factors and residuals, are extracted from the sparse basis function representation of VCG signals, and then are further utilized as explanatory variables for the identification of cardiac disorders (i.e., myocardial infarctions). The parameter set is {ω 3×L , φ 3×L , σ 3×L } for L basis functions because there are 3 channels of signals in 3-lead VCG. Our previous study 26 showed that modeling performance is greater than 99.9% goodness-of-fit with a parsimonious set of 20 basis functions for a variety of  cardiac conditions. Hence, a total of 180 model parameters is adaptively estimated from the 3D VCG trajectory. In addition, we add other parameters in this present investigation, the overall feature matrix is: where |ω| 3×20 are absolute values of weights, describing amplitudes of each basis function and indicating local strengths of a heartbeat. The residual sum of squares RSS 3×1 measure the discrepancy between model representation and VCG signals in each channel. The heart rate RR 1×1 characterizes temporal beat-to-beat variations of cardiac electrical activity. Therefore, these 244 parameter-based features are used to represent the details of original VCG signals. Notably, the high-dimensional VCG signals are reduced into a parsimonious set of model parameters using the sparse representation without losing clinically important information.
A total of 388 (79 controls and 309 infarctions) 3-lead VCG signals, available in the PhysioNet Database 27 , are used in this investigation. These signals were digitized at 1 kHz sampling rate with a 16-bit resolution over a range of 16.384 mV. Our previous study showed that most of model-driven parameters (146 over 244 features) are statistically significant between healthy controls and diseased conditions, i.e., Kolmogorov-Smirnov (K-S) statistics are greater than critical value 0.17 28 . In addition, weight factors yield larger K-S statistics than other parametric features. However, the "curse of dimensionality" as well as the overfitting problems come out with a large number of predictors for the predictive modeling. Therefore, the lasso-penalized logistic regression model was utilized to shrink the number of predictors and further identify cardiac disorders (i.e., myocardial infarctions) in our previous study 28 .
Nonetheless, our previous study 28 focused on the relevancy between predictor and response variables, without specifically considering nonlinear interdependence structures among predictor variables. Prior research showed that the collinearity (i.e., large correlation between variables) leads to stability problems in predictive models (i.e., increased variances of estimation) 29 . The present paper further investigates the nonlinear correlations between variables and then identifies the cluster structures of variables for improving the predictive performance. Figure 8a shows the visualization of information-based dissimilarity matrix measured among variables. It may be noted that six groups of variables have stronger nonlinear relationships, i.e., ω 3×20 and |ω| 3×20 as the weights and absolute weights of X, Y and Z-axis directions. However, few, if any, previous work has explicitly considered such relationships among variables in the process of predictive modeling. Moreover, weight factors ω 3×20 also have strong nonlinear correlation with the variables of absolute weights |ω| 3×20 . Without taking these nonlinear interrelationships into account, predictive models are sensitive to extraneous noises and are limited in the ability to provide an effective prediction of myocardial infarctions. Figure 8b shows the nonparametric Dirichlet process for variable clustering of model-based parametric features. As shown in Fig. 8b, the Dirichlet process cluster all the variables into five groups based on the embedding features from the variable-to-variable dissimilarity matrix of mutual information. Three clusters are shown to be significant, i.e., weight and absolute weight variables of X, Y and Z-axis respectively. As a result, homogeneous variables are clustered into subset communities. It may be noted that the result of variable clustering is consistent with the prior knowledge and the variable-to-variable dissimilarity matrix of mutual information. Figure 9 shows the results of variable clustering by our proposed algorithm and the information-based clustering. Note that there are 244 variables represented as color markers, and each marker with the same color represents the same cluster. Each row denotes a type of variables. For example, the first row of 20 markers is weight factors ω X1:20 in the X-dimension of VCG signals. Figure 9a shows the clustering results for MI-DP clustering (also see Fig. 8b), while Fig. 9b shows the clustering results for the information-based clustering. It may be noted that information-based clustering was designed to cluster data samples rather than variables. We modified the original algorithm in ref. 21 for variable clustering. Because information-based clustering 21 needs to predefine the number of clusters, we therefore use the same number of clusters identified by our proposed algorithm. Note that Fig. 9 shows there are slight differences in clustering results by MI-DP and information-based clustering. Figure 9b shows that a small portion of Y weights is not accurately clustered by information-based clustering. In addition, some variables such as shifting and scaling factors, and residuals are grouped together and cannot be well separated. As such, information-based clustering yields slightly inferior performance of predictive modeling in comparison with the proposed MI-DP approach (also see Fig. 10). Figure 10 shows the comparison of prediction performances of different clustering procedures in the real-world case study. "Without clustering" represents the results from the lasso-penalized logistic regression model in our previous study 28 . "HC clustering" denotes the hierarchical clustering with linear correlation measured between variables. "Information clustering" is the information-based clustering from the literature 21 . "MI-DP clustering" is the proposed information theoretic approach for variable clustering using mutual information and Dirichlet Process Mixtures. As shown in Fig. 10, MI-DP clustering yields better performance than "Without clustering". Note that MI-DP clustering improves the predictive accuracy from 89.50% to 95.84%, the sensitivity is improved from 94.33% to 97.56%, and the specificity is increased from 84.80% to 93.78%. In addition, MI-DP clustering yields smaller standard deviations of performance metrics (i.e., accuracy, sensitivity, and specificity) than "without clustering". Similarly, the results of MI-DP clustering are better than "HC clustering" (i.e., accuracy 93.07%, sensitivity 96.05% and specificity 90.18%) and "Information clustering" (i.e., accuracy 95.38%, sensitivity 97.02% and specificity 93.33%). Experimental results showed that MI-DP clustering effectively delineates the nonlinear correlation structures among variables and further derive homogeneous groups of variables, thereby improving the prediction performance.

Discussion and Conclusions
Advanced sensing and real-time data acquisition bring the proliferation of big data. This provides an unprecedented opportunity to move forward data-driven knowledge discovery. However, it is common that big data involves large amounts of variables with complex interdependence structures, which brings significant challenges on traditional modeling strategies. To tackle these challenges, variable selection and variable clustering are widely used in the literature. Nonetheless, variable selection focuses primarily on the relevancy between predictors and  group elastic-net model with hierarchical clustering using linear correlation measured between variables; "Information clustering": information-based clustering 21 ; "MI-DP clustering": group elastic-net model with variable clustering using mutual information and Dirichlet Process Mixtures.
Scientific RepoRts | 6:38913 | DOI: 10.1038/srep38913 response variables, but does not explicitly consider the redundancy among variables. The variable clustering, on the other hand, focuses on the linear relevancy between variables. There is a need to develop new methodologies to improve the effectiveness and efficiency of variable clustering and predictive analytics.
The computational complexity of MI-DP clustering consists of three components, namely measure of mutual information, low-dimensional embedding, and DP variable clustering. First, mutual information is measured among N(N − 1)/2 pairs of variables. The computational complexity for one pair of variables is o((# of bins) ) 2 , i.e., . Second, the complexity of low-dimensional embedding is shown to be N N o( ) in the literature 30 . Third, the Dirichlet process allocates each variable to a cluster with a computational complexity of o(N). In the present case studies, there are not significant challenges in computational complexity. However, it is worth mentioning that a new research direction is to design efficient algorithms to compute the pairwise mutual information (MI) between all pairs of variables, which will significantly improve the performance of MI-DP approach for big data applications.
This paper presents a new information-theoretic approach for variable clustering and predictive modeling using Dirichlet process mixtures. This new methodology investigates both redundancy and relevancy among variables for improving the performance of predictive modeling. Both simulation and real-world case studies demonstrate that the proposed MI-DP clustering algorithm not only outperforms traditional methods (i.e., lasso-penalized variable selection and classical hierarchal clustering), but also identifies nonlinear interdependence structures among variables and further improves the performance of predictive modeling. The new methodology of MI-DP variable clustering is generally applicable for predictive modeling in many disciplines that involve a large number of highly-redundant variables. In the future work, we will also consider the integration of our proposed MI-DP clustering algorithm with co-clustering approach to investigate the nonlinear interdependence among subsets of both samples and variables.