Reduce the rank calculation of a high-dimensional sparse matrix based on network controllability theory

Numerical computing of the rank of a matrix is a fundamental problem in scientific computation. The datasets generated by the internet often correspond to the analysis of high-dimensional sparse matrices. Notwithstanding recent advances in the promotion of traditional singular value decomposition (SVD), an efficient estimation algorithm for the rank of a high-dimensional sparse matrix is still lacking. Inspired by the controllability theory of complex networks, we converted the rank of a matrix into maximum matching computing. Then, we established a fast rank estimation algorithm by using the cavity method, a powerful approximate technique for computing the maximum matching, to estimate the rank of a sparse matrix. In the merit of the natural low complexity of the cavity method, we showed that the rank of a high-dimensional sparse matrix can be estimated in a much faster way than SVD with high accuracy. Our method offers an efficient pathway to quickly estimate the rank of the high-dimensional sparse matrix when the time cost of computing the rank by SVD is unacceptable.


I. INTRODUCTION
With the development of online social networks, researchers often face complex networks composed of huge numbers of individuals and multiple relationships among them.For the analysis of these large complex networks, we need to convert the network into its corresponding matrix and obtain some characteristics of the original network from its matrix based on traditional matrix theory, such as the page-rank method [1], communities detective [2], and some dynamical problem [3,4].Rank is one of the most important numerical characteristics of a matrix.At present, a large number of researchers focus on the rank of the special matrix [5], low-rank problem [6][7][8][9], maximal rank problem [10], nullity of graphs [11,12], and application in robust principal component analysis [13][14][15].The most successful method of rank calculation is the traditional singular value decomposition (SVD), which computes the rank by decomposing the original matrix into singular values and computing the statistical properties of the decomposed matrix.However, the complexity of SVD is the cube of the matrix size (denoted as N), which makes the SVD numerically difficult to compute in high-dimensional situations.Therefore, several methods are developed to tackle the complexity problem based on the novel matrix decompositions [16,17], Monte Carlo simulation [18], and multicomputing technologies [19][20][21][22][23].However, all these methods cannot significantly improve the time complexity.
Benefitting from the development of the control theory of complex networks, we know that the rank of the coupling matrix reflects the exact controllability of sparse complex networks.On the other hand, the structural controllability can be measured by the maximum matching of sparse complex networks.For sparse complex networks, the exact controllability is equivalent to the structural controllability [24,25].The cavity method, a powerful approximation method developed in statistical mechanics [26,27], can be designed to calculate the maximum matching of complex networks.Therefore, the controllability of sparse complex networks builds a bridge between the cavity method and rank computation.
In other words, for an N-dimensional sparse matrix, we can convert it into an N-node complex network and compute its structural controllability through the cavity method.Due to its sparsity, the structural controllability is equal to the exact controllability, and the rank of the input N-dimensional sparse matrix can be approximately estimated.This process, which is a Fast Estimation method for a sparse matrix Rank called FER, can estimate the rank of a high-dimensional sparse matrix much faster than SVD.Therefore, we applied FER to randomly generalized sparse matrices and systematically compared FER with SVD in terms of efficiency, accuracy, and applicability in two typical distributions of nonzero elements of each row (denoted as k ).We found that the time cost of FER does not significantly increase during N growing with a constant k , and the results estimated by FER maintained high accuracy, which confirms that FER is an efficient tool for estimating the rank of high-dimensional matrices.We also studied the impact of k on the time cost and accuracy of FER, and the performance of FER remained very good.Finally, we applied FER to the matrices with the identity of nonzero elements.The efficiency and accuracy of FER were still very high.All the results suggest that FER is a valid access for estimating the rank of a sparse matrix, especially for estimating the rank of a high-dimensional sparse matrix, which is almost unacceptable for computing the rank by SVD while considering the time cost.

II. MATERIALS AND METHODS
FER is based on the development of controllability theory of complex networks.Two existing theoretical frameworks for quantifying the controllability of a complex network are structural controllability theory (SCT) and exact controllability theory (ECT) [28].SCT claims that the structural controllability of any directed network is determined by the maximum matching.The maximum matching can be solved by the cavity method when the network is directed with a structural matrix.The exact controllability obtained by ECT is determined by the maximum multiplicity of eigenvalues of the coupling matrix.In the sparse situation, ECT is an efficient tool to obtain the controllability of the networks by calculating the rank of the coupling matrix.When the network is sparse and the weights of links are weakly correlated, the structural controllability and the exact controllability are theoretically equivalent [24].Therefore, computing the rank of a sparse matrix can be converted into a maximum matching problem; then, we can estimate the rank by solving the corresponding coupling equations of the cavity method in an efficient way.This is the core of FER.
Without loss of generality, we consider an arbitrary sparse input matrix A with weakly correlated nonzero elements, as shown in Fig. 1a, where only the white grids represent the zero elements and the darker color represents the larger value of the nonzero elements.Then, we apply FER to the input matrix A, and the procedure of FER can be described as the following five steps: Step 1. Transfer the input matrix A into a structural matrix B, in which the elements can only be 0 or 1. 0s represent the zero elements denoted as white grids, and 1s represent the nonzero elements denoted as black grids, as shown in Fig. 1a-b; Step 2. Consider the structural matrix B as a coupling matrix of a complex network and construct a directed network, as shown in Fig. 1b-c; Step 3. Obtain the in-degree (P in (k)) and out-degree (P out (k)) distributions of the directed network, where P in(out) (k) = n in(out) (k)/N.n in(out) (k) is the number of nodes with the in(out)degree value k in the whole network, as illustrated in Fig. 1c-d; Step 4. Calculate the structural controllability (N C ) of the directed network according to the degree distribution by the cavity method [25], which is illustrated in Fig. 1d-e, where G(x) and Ĝ(x) are ordered by the following equations: shows the statistics for the in-degree and out-degree distributions of the complex network, where the horizontal axis represents the degrees, and the vertical axis represents the relative frequency of the corresponding degree.(e) inputs the degree distributions of the network into the coupling equations of the cavity method and solves the values of four coupling parameters.Following eq. ( 1), the structural controllability N C can be calculated.According to N C , (f) obtains the rank approximation of the input matrix A.
and ω 1 ,ω 2 ,ω 1 ,ω 2 are the solutions of the following coupling equations: in the above equations, the functions of H( * ) and Ĥ( * ) are shown as: According to eq. ( 3), H(x) and Ĥ(x) can be calculated by the degree distribution (P out and P in ).In most cases, the degree distribution, a primary statistical property, can be easily obtained from the empirical data as described in step 3.The above coupling equations are transcendental equations, and the solutions of ω 1 , ω 2 , ω1 , and ω2 can be obtained by numerically solving eq. ( 2).Finally, we obtain the structural controllability N C from eq. ( 1).
Step 5.As the SCT and the ECT are equivalent when the input matrix A is sparse, the structural controllability N C is equal to the exact controllability N − Rank(A).Thus, we can estimate the rank of the input sparse matrix A as illustrated in Fig. 1e-f: It is worth noting that, first, N C can be directly calculated by maximum matching based on SCT [28].The cavity method is an efficient tool based on statistical physics for estimating the maximum matching, which can be obtained just by the degree distribution.That is, the complexity of the FER method is determined by the complexity of the statistics on the degree distribution and the accuracy of the numerical solution.Second, if a matrix contains totally irrelevant element values (every nonzero element is a real random number), Rank(A) is theoretically equal to Rank(B) based on the SCT and ECT.However, the assumption is too strict for general cases, which means the result of FER is just an estimation tool for the rank of the input matrix.The correlation strength of nonzero elements in the input matrix indeed affects the accuracy of FER.

III. RESULTS
Some comparisons between FER and SVD are exhibited from the efficiency and accuracy aspects in some typical situations.To analyze the impact of the matrix size (N), we generate some matrices randomly with a fixed sparsity, i.e., the average number of nonzero elements in each row ( k ).The nonzero elements are generated following two typical distributions: random distribution and power-law distribution.Then, we apply FER and SVD to the generated matrices, and the results of comparing the efficiency and the accuracy are shown in Fig. 2. The efficiency of the algorithm is defined by the time cost of solving the task, denoted as T cost .As Fig. 2a and Fig. 2c show, if N increases, T cost of SVD increases following its theoretical computational complexity O(N 3 ).Although we can use a GPU for acceleration, the T cost of SVD increases beyond O(N 2 ) as N increases.In contrast, T cost of FER increases very little as N increases, which suggests that its computational complexity is determined by the size of the matrix and the average number of nonzero elements in each row together.The rank estimated by FER (denoted as r F ER M ) and SVD (denoted as r SV D M ) almost overlap, as shown in both Fig. 2b and Fig. 2d, which implies that these two methods obtain a similar result no matter how N increases.To explain the high accuracy of FER in more detail, we treat the rank computed by SVD as the ground truth and define the relative error as: On the other hand, to compare the FER accuracy as N increases, we define a normalized rank (denoted as r M ) as the following equation:  6).The inset figures in (b) and (d) show the relative errors of FER calculated by eq. ( 5).k is kept at 2, and all the nonzero elements are random in the generated matrices.The results of N ≤ 5000 are averaged over 50 independent calculations, and the results of 5000 < N ≤ 10000 are averaged over 20 independent calculations.
The inset figures in Fig. 2b and Fig. 2d show the relative error ∆r M in random distribution and power-law situations, respectively.∆r M are quite small with fluctuations as N increases and remains below 0.003 and 0.001 in random and power-law situations, respectively.The results indicate that FER has good performance in both typical scenarios.When N grows larger, ∆r M has a downward trend in both distributions, which implies that the relative error between FER and SVD should be very small when N is sufficiently large.In summary, for a high-dimensional sparse matrix, we can use FER to obtain an accurate estimation of rank efficiently with a similar accuracy as that obtained by SVD, regardless of the random or power-law distribution.
As shown in Fig. 3, we checked how the sparsity of the input matrix, measured by k , affects the efficiency and accuracy of FER when the matrix size is fixed as N = 3000.It is shown that T SV D cost and T F ER cost are functions of k in random situations (Fig. 3a) and power-law situations (Fig. 3c).T F ER cost is much smaller than T SV D cost in each situation.In Fig. 3b and Fig. 3d, we analyzed the accuracy of FER as k increased.There are almost no differences between the FER and SVD results, and the scatters in the main figures almost overlap.Then, we consider the relative error of FER, as shown in the inset figures of Fig. 3b and Fig. 3d.If k increases, The values of ∆r M are both much smaller in the two situations, which fluctuates obviously in random situations.∆r M remains almost constant and is smaller than 5 × 10 −5 , in the power-law situation.In summary, we find that FER is much more efficient than SVD, no matter when k increases, and the impact of k on the efficiency and accuracy of FER is quite small in both situations.
FER works only if all the nonzero elements in the sparse matrix are uncorrelated.However, there are many relevant elements in the real data, which means that errors are unavoidable if the nonzero elements are correlated.Thus, we discuss whether the result obtained approximately by FER is acceptable when the nonzero elements of the input matrix are correlated.In Fig. 4, we consider an extreme case where all the nonzero elements in the input matrix are identical (set as 1), which means that all the nonzero elements are strongly correlated.The strong correlation has a negative effect on T cost in both random situations (Fig. 4a) and power-law situations (Fig. 4c).T F ER cost is still much smaller than T SV D cost .However, the results shown in Fig. 4b and Fig. 4d indicate that the accuracy of FER has a significant decline compared with Fig. 3b and Fig. 3d.This means that the correlation of the input matrix does affect the accuracy of FER, which agrees with the limitation of structural controllability as well as the cavity method.Although the accuracy of FER has decreased, we can also learn from the inset figures that the relative error of FER is still very small in both situations.Especially when k increases over 4, ∆r M has an obvious descent.In other words, even though the nonzero elements in the sparse matrix are strongly correlated, the performance of FER is also acceptable in terms of efficiency and accuracy.The robustness of FER suggests its effectiveness in estimating the rank of a more general matrix extracted from the empirical data set.

IV. DISCUSSION
In summary, we utilized the cavity method to estimate the maximum matching.Based on controllability theory of complex networks, we know that the rank of the matrix is theoretically equal to the maximum matching of the network when the network is sparse and the weights of links are weakly correlated.Then, we established an efficient estimation tool for analyzing the rank of a high-dimensional sparse matrix by the cavity method, which is called FER.We discussed the impact of the input matrix size (N), the sparsity of the matrix (measured by k ), and the correlation of nonzero elements on the efficiency (measured by T cost ) and accuracy (measured by ∆r M ) of FER in random situations and power-law situations.We found that FER has remarkable performance in terms of both efficiency and accuracy in random distribution and power-law distribution.Although the characteristics of nonzero elements affect the results, FER can still be applied to most sparse matrices to estimate their rank with fast speed and high accuracy.It can significantly outperform SVD in terms of the time cost and has a similar accuracy to SVD.Therefore, FER provides an efficient and accurate method for estimating the rank of a sparse matrix.Especially for dealing with a large real network by some algorithms with its matrix rank, FER can do a good job to estimate the rank directly by its degree distribution obtained from the raw data, while SVD is inapplicable due to its excessive time cost.Furthermore, in some special situations, where only the structural information of a social network can be detected, such as degree distribution or partially missing degree distribution, FER is still applicable to estimate the rank of its corresponding matrix.This means that FER can potentially be used in some algorithms designed for incomplete data or data polluted by interference noise.

FIG. 1 :
FIG.1: Illustration of the fast estimation algorithm for matrix rank.(a) Matrix A represents a general sparse matrix as the input matrix, and each grid represents an element in the matrix, in which white grids denote zeros, and darker grids represent the nonzero elements.(b) shows the structural matrix of the input matrix A. (c) transfers the structural matrix B to a directed network.(d) shows the statistics for the in-degree and out-degree distributions of the complex network, where the horizontal axis represents the degrees, and the vertical axis represents the relative frequency of the corresponding degree.(e) inputs the degree distributions of the network into the coupling equations of the cavity method and solves the values of four coupling parameters.Following eq.(1), the structural controllability N C can be calculated.According to N C , (f) obtains the rank approximation of the input matrix A.

FIG. 2 :
FIG. 2:The impact of N on the efficiency and accuracy of FER.The changes of T cost to compute the rank of the input matrix using SVD and FER, when input matrix size N increases.k follows two typical distributions: random distribution (a) and power-law distribution (γ = 3) (c).The results of the rank calculated by SVD and the rank estimated by FER in random distribution (b) and power-law distribution (d) follow eq.(6).The inset figures in (b) and (d) show the relative errors of FER calculated by eq.(5).k is kept at 2, and all the nonzero elements are random in the generated matrices.The results of N ≤ 5000 are averaged over 50 independent calculations, and the results of 5000 < N ≤ 10000 are averaged over 20 independent calculations.

FIG. 3 :
FIG. 3:The impact of k on FER when N is fixed.The impact of k on T SV D cost and T F ER cost in random distribution (a) and power-law distribution (γ = 3) (c).The comparison between the rank calculated by SVD and FER on generated matrices with random distribution (b) or power-law distributions (d) for different k .The inset figures in (b,d) show the relative errors of FER versus k .The fixed size of all the simulated networks is N = 3000, and all the results are averaged over 100 independent calculations.

FIG. 4 :
FIG. 4:The efficiency and accuracy of FER when the nonzero elements of the input matrix are all strongly correlated.T SV D cost and T F ER cost on input matrices with random (a) or power-law (c) distributions for different k .b,d the ranks obtained by SVD and FER on generated matrices with random distribution and power-law distributions (d) for different average degrees.All the nonzero elements in the generated matrices are set as 1.The size of all the simulated networks is N = 3000, and all the results are averaged over 100 independent calculations.