A Matrix Iteration Algorithm With Pruning for Pinpointing Multivariate Correlations From High Dimensional Data Sets

There are a few dependent multivariate relationships among high dimensional data sets. Then how to identify these dependent variables from high dimensional data sets is an important issue for data analysis. Now, the most frequently used method is the enumeration method, that is all multivariate relationships in the high dimensional data sets are examined. However, the time complexity of the enumeration method is exponential ( $2^{n}$ ) and the calculation load is very heavy when the dimension is high. Aiming at solving this problem, the matrix iteration algorithm with pruning (MIP) is proposed for pinpointing multivariate dependent relationships in high dimensional data sets without examining all multivariate relationships. Some not dependent relationships are ignored without examined by the pruning process of the proposed MIP and the computing burden is reduced. The maximal information coefficient (MIC) is adopted as the measure of correlations in the proposed MIP algorithm due to the excellent properties, generality and equitability, of MIC. In the case of the data set with 5 variables, more than 50% multivariate relationships are pruned without examining. Numerical experiments also show that the calculating burden is greatly reduced. Compared to the enumeration method, 82.5% calculating time and 98.5% calculating times of multivariate relationships are saved for the data set with two dependent multivariate relationships among 30 variables in the experiment. The proposed MIP algorithm is effective for pinpointing multivariate dependent relationships from data sets with high dimensions.


I. INTRODUCTION
Pinpointing correlated stochastic quantities from high dimensional data sets is a very important issue. With the wide application of information technologies, various information can be obtained. The era of big data is approaching with large amount data emerging at the growing rate of fifty percent a year [1]. Besides of the properties of velocity (the speed of data in and out), variety (range of data types and resources) and veracity (an indication of data integrity and the ability for an organization to trust the data and be able to confidently use it to make crucial decisions) [2], [3], The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano . the volume, the amount of data, is also an important property of which high dimension is an important feature of large volume.
In the exploratory data analysis of data sets with hundreds or thousands of dimensions (stochastic quantities, variables), the first step may be to identify promising bivariate or multivariate correlations for further research. In order to identify correlations among stochastic quantities, a natural approach, that is the enumeration method, is to compute a measure of dependence on all stochastic relationships, and then sort these stochastic relationships according to these measure values. The higher ranking relationships are what we want to identify. For bivariate correlations, the natural approach may be appropriate, that is all measures of dependence on all bivariate stochastic quantities are computed. However, for multivariate correlations, if the interrelations among different multi stochastic quantities are ignored and all measures are also computed using the natural approach, the computation workload increases exponentially due to combinations of variables. For example, given k stochastic variables (the dimension of the data set is k), C 2 k = k(k − 1) bivariate relationships are examined, while 3C 3 k + . . . + kC k k > 3(2 k − C 0 k − C 1 k − C 2 k ) multivariate relationships should be examined. Then, if the natural approach of identifying bivariate correlations, that is calculating measures of the dependence of all bivariate relationships and ranking, is adopted, it is infeasible to identify multivariate correlations in the data set with hundreds or thousands of dimensions. The reason is that every multivariate relationship is examined even if there are independent variables in the multivariate relationships. Facing high dimensional data sets, how to avoid calculating the measures of independent multivariate relationships as much as possible and how to efficiently identify multivariate correlations from quite a lot of relationships is an important work.
There are many measures of correlations. The maximal information coefficient (MIC) is proposed by Reshef et al. [4] via employing mutual information [6], [8]. Compared to other measures, MIC has two excellent properties: generality and equitability. Generality means that, with sufficient sample size, MIC can discover a wide range of interesting functional and non-functional associations. Equitability means that MIC gives similar scores to equally noisy relationships of different types. However, the algorithm proposed by Reshef et al. [4], [5] can only identify bivariate correlations and the computation time is much longer. Aiming at these problems, Shao et al. [10] design a fast algorithm calculating the MIC of multi variables. In this paper, MIC is adopted as the measure of dependence of multi variables. Of course, other measures of dependence can also be adopted replacing MIC in the designed efficient matrix iteration algorithm with pruning.
However, if the fast algorithm is directly applied to identifying multivariate correlations in high dimensional data sets, that is the MIC of every multi-variable relationship is calculated using the proposed fast algorithm [10], the computation workload is also very heavy. Aiming at solving this problem, employing the interrelation of multivariate relationships, some nonabsolutely correlative multivariate relationships are filtered out without calculating measurement values and an efficient algorithm named as the matrix iteration algorithm with pruning (MIP) is designed for pinpointing multivariate correlations in high dimensional data sets. The contributions of this paper are as follows.
Firstly, without calculating measure values of all multivariate relationships, multivariate correlations can be pinpointed from quiet a lot of multivariate relationships by employing the proposed MIP algorithm. Many nonabsolutely correlative multivariate relationships are filtered out (that is the pruning process). Multivariate correlations can be pinpointed out from massive relationships and relatively less computation workload is needed.
Secondly, besides the pruning operation promoting computation efficiency, the proposed efficient algorithm MIP can also give the exact expression of the multivariate dependent relationship. Given k variables, x 1 , x 2 , . . . , x k , there are k multivariate relationships which are relationships between variable x i (i = 1, 2, . . . , k) and variables x 1 , x 2 , . . . , x i−1 , x i+1 , . . . , x k . MIP algorithm can give the specific form of the multivariate correlations, that is variable x i can de determined by the algorithm MIP.
Thirdly, in this paper, if other measures of dependence replace MIC, the algorithm is also can be used for pinpointing multivariate correlations. Then the proposed efficient algorithm MIP can be regarded as an algorithmic framework which has good expansibility and transplantability.
The paper is organized as follows. In section II, some related research work is reviewed. In section III, employing the fast algorithm [10] calculating the MIC value of the multi-variable relationship, the efficient algorithm, named as the matrix iteration algorithm with pruning (MIP), for pinpointing multivariate correlations is proposed. A simple case of the MIP algorithm is given in Section IV. Section V gives some experimental results. Lastly, some conclusions and future work are given in section VI.

II. LITERATURE REVIEW
Measures of variable dependence can be roughly divided into the following four categories: grid-based method, mutual information estimation, distance/kernel-based statistics and correlation-based methods [9]. These measures are summarized in Table 1.
Grid-based method. The grid-based method explores the space of all possible grids drawing on the sampled data, assigns a score to each grid, and aggregates these scores. Normally, the value of the correlation coefficient is equal to the maximal value of the aggregated scores. The maximal information coefficient (MIC) [4] is the maximal value of the metric of normalized mutual information scores. However, it is difficult to compute the MIC value efficiently over all grids. Then fast approximate algorithms computing the MIC values of bivariate and multivariate relationships is proposed by Shao et al. [10], [11]. Albanese et al. [12] develop a practical tool for detecting associations in big data sets. HHG [13] explores the three-by-three grids defined by pairwise and uses as its score Pearson's χ 2 test statistic computed on two-bytwo contingency tabels derived from the three-by-three grids. However, the HHG is not distribution free. The coefficient S DDP [15] explores much more grids and its score is the normalized mutual information.
Mutual information estimation. Due to its information theoretic background, mutual information differs from the other measures of dependence of random variables. The theoretical advantages of mutual information derived from the reason that it is closely tying to Shannon entropy. Then the aim is to estimate mutual information only from the data set without knowing the densities of random variables (including joint density of variables). The easy and very crude approximations to mutual information is based on cumulate expansions [6]. However, these approximations are valid for distributions close to Gaussians alone. It is more robust for expressions obtained by entropy maximization using averages of functions of the data set [6]. The estimations based on explicit parametrization of densities are useful but are less efficient [7]. The promising methods is based on kernel density estimators [6], and Kraskov et al. estimate mutual information from k−nearest neighbor statistics [6].
Distance/kernel-based statistics. The distance correlation (dCor) [16], which is defined analogously to ordinary correlations, uses the distance variance/covariance based on pairwise distances between points. Going a step further, the distance covariance is developed as the metric spaces of negative type of which Euclidean spaces are a special case [17]. In addition to distance criteria, there are kernel-based measures which are formulated based on embedding of probability distributions into reproducing kernel Hilbert spaces [18]. Hilber-Schmidt information criterion (HSIC) is a more general statistic in kernel Hilbert spaces of which dCor is a special case [19].
Correlation-based methods. Pearson's correlation coefficient is the first coefficient. After that, many kinds of coefficients are proposed. The maximal correlation [20] may be the best-known correlation-based method which searches for arbitrary measurable functions such that the coefficient is maximized, and it is hard to be computed in general. However, the approximate method of alternating conditional expectations is widely used [21]. The recent method, randomized dependence coefficient (RDC) [22] which is the estimation of the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient (HGR), is the largest canonical correlation between random non-linear projections of their respective empirical copula-transformations.
Besides the above taxonomy, the statistical correlation coefficient can also divided as the two categories: bivariate correlation coefficient and multivariate coefficient. All above correlation coefficient can be as a measurement of dependence between two random variables. However, there are much more multivariate correlations in big data and there are relatively few correlation coefficient can be used to detect multivariate correlations [14], [23], [24].
Compared to other measures, the maximal information coefficient (MIC) has two excellent properties: generality and equitability. Generality means that with sufficient sample size, MIC can capture a wide range of interesting associations, not limited to specific functional types. Equitability means that MIC gives similar scores to equally noisy relationships of different types. Reshef et al. (2011) [4], [25] firstly proposed MIC which can only detect bivariate correlations. Shao et al. [10] further defined MIC of multi random variables and designed an approximate fast algorithm for calculating the MIC value of multi variables. However, if the fast algorithm is directly applied into identifying dependent multivariate relationships in data sets with thousands of variables or much more, that is the enumeration method is adopted, dependent multi-variable relationships cannot be effectively pinpointed from data sets.

III. THE MATRIX ITERATION ALGORITHM WITH PRUNING
It is infeasible to examine all multivariate relationships in data sets with hundreds of or thousands of dimensions in limited time. Then in III-B, aiming at avoiding calculating some not dependent multivariate relationships (the pruning process), the efficient matrix iteration algorithm with pruning (MIP) is designed to reduce the computation load in the procedure of detecting multivariate correlations. In this paper, the maximal information coefficient (MIC) is selected as the measure of dependence of variables due to its two excellent properties: generality and equitability. In III-A, the definition and related algorithms of MIC are introduced.

A. INTRODUCTION TO THE MAXIMAL INFORMATION COEFFICIENT (MIC)
In this subsection, the definition of MIC [4], the approximate algorithm [4] calculating MIC values of bivariate relationships and the fast algorithm [10] calculating MIC values of multivariate relationships are introduced.
Given a finite data set D with two variables, x, y, the x-value of D is partitioned into s bins and the y-value is partitioned into t bins. Then an s-by-t grid G is obtained. D| G is the distribution induced by the points of D in the cells of G. Given a fixed data set D, different grids with the same or different numbers of partitions of the x-value and y-value will result in different distributions. Specifically, in reference [4], the maximal information coefficient of two random variables is given in definitions III.1-III.3.
Definition III.1 [4] Given a finite data set D ⊂ R 2 and positive integers s, t, define where the maximum is over all grids with s columns and t rows, that is the first variable in the data set D is divided into s partitions and the second one is divided into t parts.
Definition III.2 [4] The characteristic matrix M (D) of the data set D ⊂ R 2 is a finite matrix with entities M (D) s,t = I * (D, s, t) log min{s, t} Definition III.3 [4] The MIC of the data set D ⊂ R 2 with sample size n and grid size less than B(n) is given by From the above definitions, it can be found that the most important work is calculating the mutual information of the distribution D| G . The space of grids that must be searched grows exponentially with the increasing of the number of data points. Then, considering the computational efficiency, an approximate algorithm [4] is designed to calculate the approximate MIC value via employing the dynamic programming algorithm. To be specific, intuitively, equal division of axes may lead to the maximal value of MIC [4]. Hence, the approximate algorithm firstly partitions a variable of the data set equally. For the other axis, some candidate partition clumps are given to reduce the computation load and then the partitions of the other axis are obtained by employing the dynamic program algorithm. With the partitions of the two axes, the gird on the scatter plot of the two variables is obtained. Another important problem in the above definitions is the maximal grid size B(n) = n α . If B(n) is set too high, that can lead to non-zero scores even for random data. While setting B(n) too low means that only simple patterns are searched for. To balance these competing considerations, the parameter α is usually set to be 0.6 according to the practical experience [4].
The difficulty of directly applying the approximate algorithm [4] to detecting multivariate correlations is how to divide partitions of multi variables. Aiming at solving this problem, the MIC definition of multi variables is firstly given, and then the difficulty of dividing partitions of multi variables is solved via employing the clustering algorithm in the fast adaptive-MIC algorithm proposed by Shao et al. [10]. Compared to the approximate algorithm [4], the fast adaptive-MIC algorithm can calculate MIC values of bivariate and multi-variable relationships in a very short time.
However, if the fast adaptive-MIC algorithm [10] is directly applied to detecting multivariate correlations in data sets with hundreds of or thousands of dimensions, that is the enumeration method is adopted, the computation time is much longer since the most time is spent on random relationships. Then the matrix iteration algorithm with pruning (MIP) is proposed for efficiently detecting multivariate correlations from high dimensional data sets.

B. THE PROPOSED MATRIX ITERATION ALGORITHM WITH PRUNING (MIP)
In this paper, the matrix iteration algorithm with pruning (MIP) can precisely and efficiently identify the existent multivariate correlations, and it is proposed based on MIC and the fast adaptive-MIC algorithm [10]. Of course, if there are new better measures of dependence appear, MIC and the fast adaptive-MIC algorithm can be replaced by other measures and corresponding algorithms, respectively.
The fast adaptive-MIC algorithm [10] can calculate MIC values of relationships with VN (VN ≥ 2) variable in the data set D. Firstly, select a variable Y from the data set D and then the remaining VN − 1 variables, X 1 , X 2 , . . . , X VN −1 , are as the whole. Secondly, the variables X 1 , X 2 , . . . , X VN −1 and variable Y are clustered into x, y partitions via employing the bisecting k-means clustering algorithm, respectively. Thirdly, according to the definition of MIC, calculate the MIC value of the VN −variable relationship.
There are at least two problems if the adaptive-MIC algorithm is directly applied to detecting multivariate correlations in big data with hundreds of thousands of variables (that is the enumeration method). Firstly, the number (more than ) of the examined multi-variable relationships exponentially increases with the increasing of the number (k) of variables in the data set. As a consequence, the computation time is very long when the number of variables in the data set is large. Secondly, some multivariate correlations with relatively lower MIC values may be missed. The reason is as follows. Although MIC gives similar scores to equally noisy relationships of different types (the equitability property), there is a slightly difference among MIC values of relationships of different types and MIC values of multivariate relationships may be less than that of bivariate relationships. For example, the MIC value of the linear relationship might be higher than that of the sine/linear relationship for data sets with the same scale. Given two relationships y = sin(4π 4 , y, is generated according to definitions of these two relationships, and the scale of the data set D is 20000. If the enumeration method is adopted, that is the adaptive-MIC algorithm [10] is directly applied to detecting multivariate correlations in the data set D, the MIC values of all C 2 5 + 3C 3 5 + 4C 4 5 + 5C 5 5 = 65 multi-variable relationships in the data set D are calculated. The highest MIC values of relationships, (x 2 , x 4 ), (x 4 , (x 1 , x 2 )) and (y, (x 1 , x 3 )), are 0.9988, 0.9769 and 0.9178, respectively. According to these calculated MIC values, it is obvious that the multi-variable relationship (x 4 , (x 1 , x 2 )) is more important than the VOLUME 9, 2021 relationship (y, (x 1 , x 3 )). However, the obvious conclusion is not correct. In fact, according to the data set generating process, the relationships (x 2 , x 4 ) and (y, (x 1 , x 3 )) are the most important and these two relationships should be identified. The relationship (x 4 , (x 1 , x 2 )) should be excluded. From this example, it can be found that, if only rank all relationships according to MIC values, some not important relationships may be paid more attention to and a lot of time is squandered on calculating MIC values of these not important relationships. An efficient algorithm is needed to exclude these not important relationships and to precisely identify the exist multivariate dependent correlations.
Aiming at solving these problems, the matrix iteration algorithm with pruning (MIP) is designed and the intuitions of the designed efficient MIP algorithm are as follows.
Firstly, pruning. The new relationship which is obtained by adding independent variables into the exist dependent relationship will be given a lower MIC value than that of the exist dependent relationship, and the new relationship which is obtained by adding variables into the independent relationship also is not the relationship we are looking for, even though the MIC of the new relationship is high. For example, if the variable y is dependent on the variables (x 1 , x 3 ), that is the relationship (y, (x 1 , x 3 )) is dependent, the MIC value of the relationship (y, (x 1 , x 3 )) is higher than these of the relationships (y, (x 1 , x 2 , x 3 )), (y,(x 1 , x 3 , x 4 )) and (y, (x 1 , x 2 , x 3 , x 4 )), where the variables x 2 , x 4 are independent of the variables y, x 1 and x 3 . The relationship (y, x 2 ) is the relationship of independent variables x 2 , y. The relationship (y, (x 1 , x 2 )) can be seen as adding the variable x 1 into the relationship (y, x 2 ).
Secondly, identifying dependent multivariate correlations. Given a multivariate relationship, new relationships can be obtained by adding one of the remaining variables into the given relationship. If these MIC values of the obtained new relationships are lower than that of the given multivariate relationship, the given multivariate relationship is regarded as a multivariate correlation.
For clearness, these symbols are introduced. (x i , (x j , x j+1 , . . . , x j+k−1 )) is the multivariate relationship between the variable x i and k variables x j , x j+1 , . . . , x j+k−1 , and its corresponding MIC, which is calculated by the fast adaptive-MIC algorithm [10] .,x k ) . For clarity, relationships with k (in this table, k = 2) variables and corresponding MIC values are added above the designed matrix IA, and relationships are also added on the left of the matrix IA. Elements IA p,q of the table body (the matrix IA) are MIC values of relationships with k +1 variables (see Table 2).
Based on the above intuitions, the designed matrix iteration algorithm with pruning (MIP) is presented in Algorithm 2. In MIP, the most important procedure is the pruning process which is presented in Algorithm 1.
In Algorithm 1, RS is the set of relationships with i − 1 variables in the first line above the matrix IA and IV is the Algorithm 1 PrunPro(RS, IV ): the pruning process of the proposed MIP algorithm Require: The set of relationships RS, the set of independent relationships IV . 1: Select the p-th and q-th relationships, R p , R q , from RS. 2: if the number of different variables in R p , R q is equal to 0, or equal to or larger than 2 then 3: IA p q = −, does not need to be calculated. 4: else 5: /* There is only one different variable in relation- set of dependent relationships with i − 1 variables. Select two relationships, R p , R q , from the set RS. If the number of different variables in R p , R q is equal to 0, or equal to or larger than 2, there is no new relationships generated according to R p , R q . The element IA p q = does not need to be calculated (line 2-3). If the number of different variables in R p , R q is equal to 1, new relationships can be generated according to R p , R q . There are three cases (line 5).
No loss of generality, the p−th and q−th relationships are (x i−1 , (x 1 , x 2 , . . . , x i−2 )), (x i , (x 1 , x 2 , . . . , x i−2 )), respectively. Case 1. If (x i−1 , (x 1 , x 2 , . . . , x i−2 )) ∈ IV and (x i , (x 1 , x 2 , . . . , x i−2 )) ∈ IV , completely prune. The MIC values of relationships (x j ,(x 1 , x 2 , . . . , x j−1 , x j+1 , . . . , x i )), j = 1, 2, . . . , i do not need be calculated and the element IA pq does not need to be calculated.  Algorithm 2 presents the whole MIP algorithm. The data set D is with n points and v variables. VR is the maximal number of variables in the detected relationships. The critical variable ε is used for measuring the dependence of relationships. If the MIC value of the relationship is less than ε, the relationship is independent. DR and IV are the set of identified dependent relationships and independent relationships, respectively. The count number i is the number of variables in the current relationships.
In Algorithm 2, the initialization process is presented in lines 1-7. All MIC values of bivariate relationships are calculated and independent bivariate relationships are identified. The RS and IV are obtained, and then the initialization matrix IA is obtained by employing the pruning process Prun-Pro(RS,IV). The specific procedure is as follows. Select two relationships, the p-th and q-th bivariate relationships R p , R q , from RS. If there is two different variables between R p and R q , the element IA pq does not need to be calculated. If there is only one different variable between the p-th and q-th bivariate relationships in the set RS, such as (x l , x m ) and (x l , x s ), this situation can be divided into three cases.  (x l , x m )). And After the initialization procedure, the next step is the iterative loop for calculating the matrix IA in lines 8-25. The set DR of dependent relationships are updated in lines 8-9. And the k-th relationship R k of the first row above IA is added into DR if the MIC of R k is larger than all MIC values (IA .k ) in the k-th column. In lines 12-17, the set IV of independent relationships is updated. With sets RS and IV , the Algorithm 2 is called to calculate the iteration matrix IA.
If the relationships above the new matrix IA is empty, stop and return the dependent relationships in DR and independent relationships in IV ; otherwise, calculate the element IA pq of the matrix IA with pruning.
If the number of different variables between the p−th and the q−th relationships in the first row of the table header is equal to 0 or larger than 2, there is no need to calculate the element IA pq . Otherwise (the number of different variables between the p−th and the q−th relationships is equal to 1), the element IA pq is calculated with pruning which is similar to Step 1.4. No loss of generality, the p−th and q−th relationships are (x i−1 , (x 1 , x 2 , . . . , x i−2 )), (x i , (x 1 , x 2 , . . . , x i−2 )), respectively. There are also three cases. There is the pruning process in the proposed MIP algorithm. The relationships, obtained by adding a variable into the exist dependent (independent) relationships in the set DR(IV ), are pruned. At last, the dependent multi-variable relationships and independent relationships are pinpointed from the data set D.
In the proposed MIP Algorithm 2, there are two important parameters i, ε. The integer i is used to count the number of variables in the current calculated relationships. The integer if the k-th MIC value of the matrix header is larger than max{IA jk , j = 1, 2, .., m} then 10: The k-th relationship R k of the matrix header is added into the set DR Select two elements (relationships) from IV t , R p =, R q =. 14: if variables in the first part of R p , R q are the same and there is only one different variable between the second parts, for example, R p = (x k 1 , (x k 2 , x k 3 , . . . , x k i )), R q = (x k 1 , (x k 2 , x k 3 , . . . , x k i−1 , x k i+1 )) then 15: . 16: VR is the biggest number of variables in the relationships we want to detect. Another important parameter is ε. If ε is set too large, many not independent relationships are added into the independent relationship set IV . Then this will lead to that all relationships are independent and dependent relationships can not be identified. However, if ε is set too small, the independent relationship set IV will be empty at last, and all independent relationships are deemed to be a certain dependent. Then this will lead to that the MIP algorithm will have no pruning. In our opinion, the parameter ε should be set to be a little higher than the MIC value of two random variables under the same scale. And the reference values of ε under different scales are given in Table 3. The time complexity of MIP is related to the number of independent relationships in the detected data set. If there are many dependent relationships, that is the ratio of the number of dependent relationships to that of independent relationships is relatively higher, the computational complexity of MIP is high and if all relationships are dependent in the data set, the complexity of MIP is equal to 2 n which is the time complexity of enumeration method. However, if dependent relationships is sparse in the data set, that is the ratio of the number of dependent relationships to that of independent relationships is close to 0, the complexity of MIP is much lower than 2 n . In reality, the number of dependent relationships is very small in data sets with high dimensions. And the proposed MIP algorithm is scalable and suitable for pinpointing dependent relationships from data sets.
The MIP algorithm has the following advantages. Firstly, the MIP algorithm has the procedure of pruning and MIC values of some not important relationships are avoided to be calculated. The calculation workload is reduced. Secondly, the MIP algorithm can precisely identify the dependent and independent relationships from data sets with many variables. And the relationships, which can be regarded as adding one or more variables into dependent (independent) relationships, are excluded although these MIC values of these relationships are relatively higher (lower). Thirdly, the dependent and independent relationships are identified at the same time.

IV. CASE STUDY
The validity of the proposed MIP algorithm is verified through a simple case in this section. The data set D has five variables (attributes), x 1 , x 2 , x 3 , x 4 , y, where x 1 , x 2 , x 3 are mutually independent variables. There are two relationships in the data set D: x 4 = 5x 2 + 1 (linear relationship) and y = sin(4π x 2 1 x 2 3 ) + x 1 + x 3 (sine/linear relationship). The scale of the data set D is 20000, that is n = |D| = 20000. Obviously, there are two dependent relationships, (x 2 , x 4 ) and (y, (x 1 , x 3 )). And the aim of the MIP algorithm is to precisely identify these two dependent relationships from the data set D.
If the enumeration method is directly used to detect multi-variable relationships in the data set D via ranking MIC values of relationships, sort C 2 5 + 3C 3 5 + 4C 4 5 + 5C 5 5 = 10+30+20+5 = 65 MIC values, which are shown in Table 4, of all relationships in the data set D. There are at least two problems. Firstly, every relationship is examined, even if this relationship is definitely not dependent. Secondly, because MIC values are slightly different for different types of relationships under the same noise level [4], the MIC values of some not dependent relationships by adding independent variables into dependent relationships may be higher than that of some dependent relationships. For example, the MIC value of (x 2 , (x 1 , x 4 )) is higher than that of (y, (x 1 , x 3 )). Table 4, these relationships (x 2 , (x 1 , x 4 )), (x 2 , (x 3 , x 4 )), (x 2 , (x 1 , x 3 , x 4 )), . . ., (x 4 , (x 1 , x 2 , x 3 , y)) are more important than the relationship (y, (x 1 , x 3 )). However, this result has biggish error with the fact. Now, the MIP algorithm is applied to pinpoint the dependent relationships (x 2 , x 4 ), (y, (x 1 , x 3 )) in the data set D.  Table 3). -Step 1: Initialization of the iteration matrix IA.

And if only according to MIC values in
Step 1.1: i = 2.
Step 1.2: C 2 5 MIC values of all bivariate relationships in the data set D are calculated which are listed in Table 5.
{MIC (x 3 ,(x 2 ,y)) , MIC (y,(x 2 ,x 3 )) }. Then the first row above IA and the column on left of IA are relationships (x 1 , x 2 , y), (x 1 , x 4 , y), (x 3 , x 4 , y), (x 1 , x 3 , y), (x 2 , x 3 , y) and the second row above IA are these corresponding MIC values. The updated matrix IA is shown in Table 7.  Besides, the calculation workload is reduced. The number of relationships, of which MIC values are calculated, is C 2 5 + 17+2 = 29. However, the number of relationships is 65 if the enumeration method is employed. More than 50% calculation is saved. The proposed MIP algorithm not only can pinpoint the exist dependent relationships in big data, but also can reduce calculation workload compared to the enumeration method.

V. EXPERIMENTAL RESULTS
In the enumeration method, MIC values of relationships are calculated by the Adaptive-MIC algorithm proposed by Shao et al. [10] which is implemented in C programming language and parameters are default, that is, α = 0.6 and C = 15. The proposed MIP algorithm is also implemented in C programming language. The computing platform of the two C programs is personal notebook computer and its configuration is as following. Win 8 Operating System; CPU: Intel(R) Core(TM) i7 − 4510U , 2.60GHz; RAM: 8.00GB.
In order to compare calculating time between the proposed MIP algorithm and the enumeration method, the time of the fast Adaptive-MIC algorithm, which is used in the MIP algorithm and the enumeration method, calculating MIC values of three-variable and five-variable relationships under different scales [10], is firstly shown in Fig. 1. With the increasing of the scale, the calculating time increases for three-variable and five-variable relationships. When scales are the same, the calculating time of three-variable relationships is slightly lower than that of five-variable relationships overall. However, the difference of calculating time between three-variable relationships and five-variable relationships is extremely tiny under the same scale. The calculating time of the Adaptive-MIC algorithm for relationships with different variables is almost equal under the same scale. Then, in the enumeration method, the time of calculating an MIC value of a relationship, no matter the number of variables in the relationship, is approximated as the time of calculating the MIC value of the relationship with five variables.
From the simple example in Section IV, it can be found that the proposed MIP algorithm can significantly reduce calculation times of MIC values of multivariate relationships in the procedure of detecting dependent multivariate relationships in data sets. In all data sets of experiments in Fig. 2, there are two dependent multivariate relationships, x 3 = 3x 2 + 5x 1 , x 4 = x 2 5 −6, and the other variables are independent. With the increasing of the number of variables in the data set, the calculation times of MIC values of multivariate relationships are increasing. However, the calculation times of MIC values of the proposed MIP algorithm is considerably lower than that of the enumeration method, and the increasing speed of calculation times of the proposed MIP algorithm is also much slower than that of the enumeration method. The calculation times of the proposed MIP algorithm is much lower than that of the enumeration method. If the dimension of the data set is lower than 15, the calculating time of MIP is longer than that of the enumeration method. And if the dimension of the data set is equal to or higher than 20, the calculating time of MIP is much shorter than that of the enumeration method. When the dimension of the data set is high, for example larger than 30, about 1 order of magnitude of time cost is reduced by the MIP algorithm compared to the enumeration method. If there are 30 variables in the data set, compared to the enumeration method, 4 orders of magnitude of calculation times are reduced in the proposed MIP algorithm. The MIP algorithm can significantly reduce the calculation times of MIC values in the procedure of detecting dependent relationships in data sets and the MIP algorithm is suitable for detecting multivariate dependent relationships in high dimensional data sets.
In Fig. 3, the calculating time of MIP algorithm is the average time. In order to further investigate the variation of calculating time, the standard deviations of calculating time is given in Fig. 4. From Fig. 4, it can be found that the variation of deviations of the calculating time varies very small and the proposed MIP algorithm is relatively stable in the aspect of calculating time. And the further investigate is analyzed by the t-test. In the t-test, the null hypothesis is that the two deviations are equal. The p-value of the t-test of any two deviations is shown in Tab. 8. From Tab. 8, it can be found that these p-values are all less than confidence level 0.01. Then it can be concluded that there is no significant differences between any two deviations. That is all deviations of calculating time are equal under different dimensions. And the variation of calculating time of the MIP algorithm is the same for different dimensions.
In Fig. 2  by employing the Adaptive-MIC algorithm, the comparison of calculating time between the MIP and enumeration method is also shown in Fig. 3. The scale of the data set in experiments is 100 and the calculating time of the enumeration method is equal to the product of calculation times and the time of calculating an MIC value of five-variable relationships with 100 data points. The calculating time of the MIP algorithm is the CPU time. When the dimension is low (10,15), the calculating time of the MIP algorithm (168.69, 1154.51 seconds) is higher than that of the enumeration method (15.16, 580.64 seconds). However, when the dimension is high (20), the calculating time of MIP (5776.9 seconds) is much lower than that of the enumeration method (10185.33 seconds), and if the dimension is higher than 20, about 1 order of magnitude of the calculating time is reduced in the MIP algorithm compared to the enumeration method. The proposed MIP algorithm is suitable for detecting dependent relationships in high dimensional data set and much calculating time can be saved.
In order to make the comparison of calculation times and calculating time more intuitively, the ratio of the calculation times of the MIP algorithm to that of the enumeration method and the ratio of the calculating time of the MIP algorithm to that of the enumeration method are displayed in Fig. 5. Both the ratio of the calculation times of the MIP algorithm to that of the enumeration method and the ratio of the calculating time of the MIP algorithm to that of the enumeration method decrease with dimension. The ratio of the calculation times of the MIP algorithm to that of the enumeration method is very small even when the dimension is very low. And the ratio of calculation times declines with dimensions. However, the ratio of calculating time of the MIP algorithm to that of the enumeration method is very large when the dimension is very low. With the increasing of dimension, the ratio of calculating time rapidly declines to less than 1 and the ratio of calculating time is close to zero when the dimension is large. When dimension is large in the data set, both ratios are very low which is close to zero. That is, when dimension is large, the the calculating time and calculation times of the MIP algorithm are much lower than the calculating time and calculation times of the enumeration method, respectively.
However, there a big difference between the ratio of the calculation times and the ratio of the calculating time. No matter how large the dimension, the ratio of the calculation times of the MIP algorithm to that of the enumeration method is very low. That is, the calculation times of the MIP algorithm is much lower than that of the enumeration method. When the dimension is lower than 20, the ratio is larger than 1. That is, the calculating time of the MIP algorithm is much longer than that of the enumeration method. When the dimension is equal to 10, the ratio of the calculating time is equal to 11.12, that is, the calculating time of the MIP algorithm is about 11 times of that of the enumeration method. There is a big difference of the calculating time between the low dimension and the large dimension. The reason is as follows. When the dimension is low, the saved time of calculation of MIC of multivariate relationships is not longer than the time of increased processes. And when the dimension is large, the saved time of calculation of MIC is much longer than the time of increased processes. The proposed MIP is suitable for detecting dependent multivariate relationships in high dimensional data sets.

VI. CONCLUSION
The matrix iteration with pruning algorithm (MIP) is proposed for pinpointing a small amount of multivariate dependent relationships from high dimensional data sets. The MIP algorithm can also be seen as a framework, in which other excellent coefficients can replace the maximal information coefficient (MIC) as the measure of correlations in future. In MIP, there is a pruning process by which some not dependent relationships with relatively higher correlation values are discarded. Then the calculation load is significantly reduced. With the increasing of sparsity, that is the ratio of the number of variables in dependent relationships to the number of all variables in data sets, of data sets, the ratio of the number of correlation values calculated by the MIP algorithm to that of the enumeration method is decreasing and the calculating time of the MIP algorithm is greatly reduced. Without calculating all correlation values of all multivariate relationships, the proposed MIP algorithm can pinpoint correlations among high dimensional data sets.
ZHIQIANG HOU was born in Zhangjiakou, Hebei, China, in 1978. He received the B.S. and M.S. degrees in port, waterway and coastal engineering from Southeast University, Nanjing, China, in 2004, and the Ph.D. degree in communication and transportation engineering from Beijing Jiaotong University, Beijing, in 2017.
Since 2017, he has been a Researcher with the China Waterborne Transport Research Institute. He is the author of two books, more than 40 articles, ten norms, 11 technical awards, and five patents. His research interests include accident causation and risk assessment of engineering.
LIMIN JIA received the Ph.D. degree in automation and control in transportation from the China Academy of Railway Sciences, Beijing, China, in 1991. He is currently a Professor with Beijing Jiaotong University, and the Chief Scientist of the National Center of Collaborative Innovation Center for Rail Safety and the State Key Laboratory of Rail Traffic Control and Safety. He is the first batch of Millions of Leading Engineering Talents Project. His research interests include intelligent transportation systems, computational intelligence, and rail traffic control and safety.
ZHE ZHANG was born in 1988. He received the Ph.D. degree in transportation engineering from Beijing Jiaotong University, in 2017. He is currently a Researcher with the State Key Laboratory of Rail Traffic Control, Beijing Jiaotong University. His research interests include pedestrian flow simulation, crowd control, and facility optimization in terminals. VOLUME 9, 2021