Quantitative Analysis of a Weak Correlation between Complicated Data on the Basis of Principal Component Analysis

The mining of weak correlation information between two data matrices with high complexity is a very challenging task. A new method named principal component analysis-based multiconfidence ellipse analysis (PCA/MCEA) was proposed in this study, which first applied a confidence ellipse to describe the difference and correlation of such information among different categories of objects/samples on the basis of PCA operation of a single targeted data. This helps to find the number of objects contained in the overlapping and nonoverlapping areas of ellipses obtained from PCA runs. Then, a quantitative evaluation index of correlation between data matrices was defined by comparing the PCA results of more than one data matrix. The similarity and difference between data matrices was further quantified through comprehensively analyzing the outcomes. Complicated data of tobacco agriculture were used as an example to illustrate the strategy of the proposed method, which includes rich features of climate, altitude, and chemical compositions of tobacco leaves. The number of objects of these data reached 171,516 with 14, 4, and 5 descriptors of climate, altitude, and chemicals, respectively. On the basis of the new method, the complex but weak relationship between these independent and dependent variables were interestingly studied. Three widely used but conventional methods were applied for comparison in this work. The results showed the power of the new method to discover the weak correlation between complicated data.


Introduction
Information extraction and mining of data with high complexity has increasing interests in both academic and industrial sectors. With rapid development and use of smartphones and sensing equipments and advanced scientific instruments such as chromatography, mass spectrometry, spectroscopy analysis and their coupled techniques, theoretical calculations, and simulation, of course, the difficulty of generation of data has been largely overcome to date [1][2][3][4][5]. us, it has strong priority to further develop powerful algorithms for knowledge discovery according to the characteristics of different types of data themselves. e conventional strategies are not always flexible to mine rich information hidden in datasets. Using a typical data of the tobacco planting process as an example, the quality of tobacco leaves, such as the chemical compositions, and physical properties, such as consuming experience, are inevitably affected by ecological conditions, soil, planting, and production processes of tobacco. Of course, the genes and mutations of tobacco are also potential factors to the results [6][7][8]. However, huge challenges may be encountered while attempting to discover the influence of independent variables/factors on dependent indices/indicators. e reason is that there are many different kinds of data types or sources, and the unknown relationship between various variables is quite complicated and not easy to be discovered.
is leads to the difficulty for reasonable interpretation on the basis of the conventional methods for univariate or multivariate analysis [9,10].
Traditionally, the process for quantitative modeling is applied as follows. First, univariate statistical analysis of each feature in a single data is performed, and hypothesis testing between different types of samples is used to obtain statistical results of a single feature. is helps to find and further remove the outliers from both the dimensions of objects and variables [11]. Next, data exploratory analysis methods such as principal component analysis (PCA) are used to mine the correlation between data matrices, but one of the main problems is the difficulty to provide quantitative results for evaluation. is mostly generates a model with low generalization ability and poor conclusions, especially to the matrices with weak correlation [12][13][14]. Furthermore, the methods for classification or regression analysis can be applied to build a multivariate model for qualitative or quantitative analysis, if required. is kind of analysis can help to correlate the relationship between independent and dependent variables [15][16][17][18]. Of course, other rich types of methods, such as Boolean association rules, decision trees, recommendation algorithms, and deep learning, can also be utilized to attain the goals [19][20][21][22]. In addition, canonical correlation analysis seeks to find the correlation between comprehensive data pairs and reflect the overall correlation between them. at is, it discovers the correlation between data matrices as a whole and extracts the representative information of data by using canonical variates. But, it is still incapable to correlate the data with low internal connection [23,24].
Many researchers reported the studies to correlate the relationship between multiple datasets. PCA should be one of them with the highest concerns used in different fields, which has also been extensively developed in theory and applications, including sparse PCA generated by an elastic network (LASSO), probabilistic method on the basis of an associated likelihood function, and the robust PCA for processing of data with outliers [25][26][27][28]. Since it helps for financial decision makers to deal with credit classification problems, the credit classification model has been largely applied in recent years. Tang et al. were inspired by the nonlinearity of dendrites in biological neural models and proposed a pruned neural network and applied to solve classification problems. e results showed that it was superior to other classical algorithms in terms of accuracy and computational efficiency [29]. Using the processing of heavy metal adsorption as an example, Bingöl et al. proposed a nonlinear autoregressive network with exogenous inputs (NARX) and compared with multiple regression analysis (MLR). It was found that the prediction ability of the NARX method was superior to the MLR method using dummy variables, which can successfully achieve the evaluation of the adsorption process on the experimental data [30]. Canonical correlation analysis (CCA) can be used to study the correlation between two datasets, which is a classical statistical tool to correlate multivariate data. Jendoubi et al.
proposed a CCA probabilistic model in the form of a twolayer latent variable model and used for integrated analysis of gene expression data, lipid concentration, and methylation level omics datasets. It provided a new strategy to unify the spheroidization process, multiple regression, and the corresponding probability model [31]. In addition, deep learning has been widely applied in many areas for big data analysis. Litjens et al. introduced its applications in image classification, target detection, segmentation, and registration for the analysis of the nerve, retina, lungs, digital pathology, breast, heart, abdomen, and so on [32].
In this work, a new method called PCA/MCEA (PCAbased multiconfidence ellipse analysis) was developed for correlation analysis, which was based on PCA to discover weak information between datasets. It first utilized PCA for dimensionality reduction after data quality improvement. e confidence ellipse analysis then quantitatively analyzed the similarity and difference among different categories of samples. Next, a quantitative index was defined to introduce the correlation of samples in terms of the sample/objects distribution in the individual ellipse of each sample class, as well as their results in the overlapping zone of the ellipses. In the end, the correlation between independent variables and dependent variables was attained after comprehensively analyzing the performance of these findings between different samples. e findings showed that the PCA/MCEA method can be used as an effective tool to mine out the correlation information between multidimensional data with high complexity. Example data of tobacco agriculture were applied to deliver the strategy, which included planting climate and altitude, and the corresponding concentrations of chemical compositions. e results of visualization analysis further explained the findings and show rich characteristics and relationships of planting production of tobacco collected in different locations/regions. ree traditional methods were used for comparison. e strategy and procedure of the PCA/MCEA method can be extensively used for analysis of other types of datasets.

Principal Component
Analysis. PCA has been widely employed for data processing, which has the power to reduce data size by projecting the raw data to low-dimensional space containing most of the original variance and ignoring part of the features with small variance [12,27]. e PCA method can be used for both data compression and extraction and removal of interference factors.
Singular value decomposition (SVD) is a strategy to realize PCA analysis to obtain the orthogonal principal components (PCs) with the results shown in the following equation [27]: where A denotes the raw data with a size of m × n for decomposition and the three matrices U, Σ, and V T denote scores, singular values, and loadings with sizes of m × m, m × n, and n × n, respectively.

Multidimensional Confidence Ellipse.
As described above, the three matrices U, V T , and Σ obtained after PCA operation, respectively, represent the left and right singular vector and singular values of original data A. rough analyzing these matrices, the correlation and difference between objects/samples in A can be discovered. e significant difference amongst these objects or features of A can be achieved with the help of confidence interval analysis [33]. Using two-dimensional data as an example, the score plot with the first two or three PCs can be constructed based on matrix U, and furthermore, the confidence ellipse, for example, 95% ellipse on the plot, can be obtained to show the distribution of samples inside or out of the ellipses or included inside of the overlapping zone of two ellipses. en, multiple confidence ellipses of different types of samples can be applied to deeply mine out the correlation of distribution and further find the characteristics of samples and the correlation and difference among these samples. Typically, the confidence ellipse analysis utilizes the smallest ellipse that covers 95% of data points (objects/samples) in the score plot to a certain class of samples. e ellipse has two important parameters, that is, area of the ellipse and inclination of the principal axis relative to the x-axis or y-axis in the plot representing the direction of change of the ellipse. e ellipse can be calculated by assuming an approximate Gaussian distribution of the objects coordinating around its average value.

PCA/MCEA Method.
In this paper, a quantitative method to evaluate the correlation of two matrices was proposed on the basis of PCA operation, and thus, it was originally named as multidimensional confidence ellipse analysis (PCA/MCEA). e flowchart of the PCA/MCEA method is shown in Figure 1. First, the original data are divided into multiple matrices according to the actual situation of data information, which is included in the introduction section of matrix. is step is applied to obtain the multiple data for processing and correlation assessment. After this, the PCA method is independently performed to reduce the dimensionality of each data. After this, determination of class of samples can be attained by using the priori knowledge. Next, the multidimensional confidence ellipse analysis can be used to analyze the independent variables between samples of each category and similarity of the dependent variables. e potential correlation between independent variables and dependent variables, namely, integrated analysis of different matrices to be analyzed, can be achieved on the basis of these results.

Division of Raw Data with More
an One Data Matrix. In terms of the source and type of different indexes of the original data, it is divided into several matrices for correlation analysis. Using data with size of n × m as an example, the original data can be expressed as follows: e three matrices X � x 1 , . . . , x m 1 , Y � y 1 , . . . , y m 2 , and D � d 1 , d 2 , . . . , d m 3 , respectively, represent data with independent variables, dependent variables, and sample description information. e sizes of the three matrices are m 1 , m 2 , and m 3 . e independent variables and dependent variables of data X and Y can be further subdivided, as shown in equations (3) and (4). where . After these processing steps, two matrices of independent variables and dependent variables can be mathematically obtained for analysis.

Dimensional
Reduction Analysis by Using PCA. As introduced above, the PCA operation can be used for analysis of two datasets with independent variables and dependent variables, and the PCs with the largest variance can be extracted with removal of interference information.
Supposing the data to be analyzed as is decomposed and the latent variables with the largest variance are selected for subsequent analysis.
where ‖Z i ‖ pca represents results of Z i by using PCA and MinPriCol σ (‖Z i ‖ pca ) represents the minimum number of PCs, in which cumulative variance contribution exceeds to a predefined threshold σ. at is to say, the cumulative variance contribution of the first k PCs of is greater than the threshold σ, and ∃‖Z i ‖ pca , the cumulative variance contribution of the first k − 1 PCs, is less than the threshold σ. Finally, the first k latent variable after PCA analysis is extracted to construct a new dataset of latent variable

PCA Analysis to Each Class of Samples.
According to the sample description information d ∈ D, d classes of samples can be divided for analysis. Similarly, each data matrix T i ∈ T can be divided into D � D j (j � 1, 2, . . . , |d|) numbers of classification. en, a total of (M 1 + M 2 ) × |d| data matrices were analyzed to a data matrix with M 1 + M 2 independent and dependent variables and recorded as P � P i . e set P i � P ij (j � 1, 2, . . . , |d|) is the |d| classification result.

Analysis of the Multidimensional Confidence 95 Ellipse.
Based on the results of PCA operation, the multidimensional confidence ellipse of each data matrix represents the Journal of Analytical Methods in Chemistry 3 multidimensional ellipse corresponding to the |d| numbers of sample classification, as shown in Figure 2.
For the |d| ellipses in the multidimensional confidence ellipse Θ ij (i � 1, 2, . . . , M 1 , M 1 + 1, . . . , M 1 + M 2 , j � 1, 2, . . . , |d|) corresponding to data P i of classification Z i , it can be divided into two spaces, that is, inside and out of the space. rough statistical analysis of sample classification D p , D q (p, q � 1, 2, . . . , |d|, and p ≠ q), the number of different types of samples existing in the confidence ellipse Θ ip and Θ iq respectively, describe the classification information of Z i , sample classification D p and D q , and the degree of sample internal aggregation corresponding to the quantification of similarity and difference evaluation of samples of D p and D q . e quantitative index for evaluation is provided as follows: where ‖D p ‖ indicates D p number of samples in sample classification and ‖Θ ip ∩ Θ iq ‖ p indicates the number of samples simultaneously existing in sample classification D p and D q , which corresponds to the overlapping area in the multidimensional confidence ellipse. S i,p,q denotes the similarity between sample classes of D p and D q in the case of classification Z i . In particular, it indicates the degree of aggregation of the sample classification D p in the case of index classification Z i , if p � q. By using the multidimensional confidence ellipse analysis method introduced above, the similarity and difference among multiple classes of samples can be quantitatively evaluated, and this helps to discover the characteristics of sample classification, such as Z i here.

Integrated Correlation Analysis among More an One Data Matrix.
e PCA/MCEA method quantitatively analyzes the results of each classification of samples Z i (i � 1, 2, . . . , M 1 , M 1 + 1, . . . , M 1 + M 2 ), degree of aggregation of different types of samples, and similar characteristics between classes. is helps to determine the association between multiple data matrices. By the confidence ellipse analysis of different classes of samples and then the correlation coefficient between the number of samples included in the ellipse of each class, the correlation between data can be discovered by analyzing the number of different groups of samples. It is especially helpful to mine out weak correlation, for example, while the classification or regression relationship is not significant. e proposed method is unlike the traditional strategies for direct analysis of correlation of independent variables (X) and dependent variables (Y), which seeks to find the simple relationship.
is is probably unsuitable for data matrices with a weak relationship. e PCA/MCEA method completely avoids the difficulties and challenges encountered in constructing multivariate classification or regression models, but constructs a confidence ellipse to analyze relationship after PCA analysis. e samples included in the overlapping and nonoverlapping samples of the ellipse effectively denote the correlation of similarity and difference between different classes of samples. After this, the correlation between more than one matrix can be found through individually analyzing the contribution of different independent variables on the influence of dependent variables.

Conventional Methods.
In this study, stepwise regression analysis, PLS regression analysis, and SVR regression analysis methods were used for comparison of the proposed method.

Stepwise Regression Analysis.
In the process of stepwise regression analysis, an independent variables X i is introduced in each run, and the regression coefficients must be tested by the F test. It is recorded as F (1) 1 , . . . , F (1) p , respectively, and it is assumed that i 1 is greater than the critical value F (1) , which corresponds to the given significance level α. Otherwise, it will be excluded to the model [34]. e stepwise regression method realizes the screening of "optimal" independent variables by gradually introducing variables and further calculates the square sum of partial regression analysis. It avoids the multicollinearity problem that occurs while using full independent variable analysis.

Partial Least Squares Regression Analysis.
In PLS regression analysis, the independent and dependent variables are projected into a new space to generate a linear regression relationship in a new space [35]. e PLS  regression analysis method avoids structural uncertainty and nonnormal distribution problems by extracting the maximum information reflecting data variability.

Support Vector Regression Analysis.
e support vector regression (SVR) analysis uses the optimal model shown in formula (7), which helps to find the hyperplane with the "shortest distance" from the farthest sample to the hyperplane [36].
e SVR regression analysis is the first on the basis of structured risk minimization, which introduces an ε-insensitive loss function. Especially, it has strong generalization ability to reduce the requirement for balanced data sampling of samples.

Data Introduction and Analysis
In this article, an example data of tobacco agriculture was employed to deliver the strategy proposed in this work. e purpose is to study the impact of an ecological environment, including climate and altitude, on the quality of tobacco leaves, which was originally collected from Yunnan Province, one of the largest districts for tobacco planting in China. e detailed information of the dataset is given in Table 1. In this table, the climate and altitude indices and chemical compositions for quality evaluation of tobacco are introduced. It totally includes 14, 4, and 5 indices of the three independent and dependent variables. e total number of samples reaches 171,516.
Before processing, the data quality was improved by using the following steps, including filling of missing values, removal of outliers, and data normalization: Deletion of missing values: the samples with missing values of climate, altitude, and/or chemicals of tobaccos were eliminated before the next step. Removal of outliers: the strategy of box-plot was applied to remove the samples of outliers. Data normalization: the Z-score method was used for data normalization. at is, the mean (μ) and variance (σ) of the each variable were calculated and standardized according to the following formula: x′ � (x − μ)/σ.
After these steps, a total of 168,643 sample data are finally generated for analysis.

Workflow of the PCA/MCEA Method.
To use the PCA/ MCEA method for analysis, a total of 168,643 samples were preprocessed as introduced above, and then, they was divided into 35 classes of samples on the basis of the geographic locations attributes of the 35 "county and city/district" of samples. In terms of the PCA/MCEA method introduced in Figure 1, the preprocessed data were processed for analysis. e specific parameters and processing factors of the PCA/MCEA method are described as follows: (1) e original data was divided and reduced into independent variables including climate (X 1 ), altitude (X 2 ), and dependent variables, including 5 indices of tobacco (Y 1 ) (2) For PCA analysis, the cumulative variance threshold was defined as σ � 0.8 for the three data matrices X 1 , X 2 , and Y 1 during dimensionality reduction (3) In terms of the sample description information, including "county and city/district" index, the 168,643 real samples were divided into 35 classes for the subsequent confidence ellipse analysis (4) e final integrated analysis of correlation was attained on the basis of these results and findings, in which the influence of independent variables to dependent variables was introduced.
e data structure of experimental data is shown in Figure 3, and the process for PCA/MCEA analysis is shown in Figure 4.

Results of the PCA/MCEA Method.
e three datasets of climate (X 1 ), altitude (X 2 ), and chemical compositions of tobacco leaves (Y 1 ) were analyzed by the PCA operation. Based on the cumulative variance threshold, the new variables with the largest variance were selected for analysis. As shown in Figure 5, the results were obtained by using the PCA/MCEA method, which is obtained by comprehensively analyzing each data with the help of PCA and confidence ellipse. e results of PCA fully show the distribution of samples obtained from different geographic locations. In Figures 5(a)-5(c), each plot corresponds to three different parts, in which the main results in the middle of the figure Figure 2: Schematic diagram of confidence ellipse analysis included in the proposed method, in which a two-dimensional confidence ellipse example is given to deliver the strategy. Here, the two ellipses represent two classes of samples, and the data points with two different colors denote samples corresponding to different classes. e overlapping and nonoverlapping areas of ellipses include the common and noncommon characteristics of the two types of samples.
are obtained from PCA analysis, and the points of two different colors correspond to the samples of targeted category and all the remainings outside of the target class. e common information of each class was explained by the overlapping zone of the 2-dimensional confidence ellipses, as described above. e results shown in the top and right subgraph are on the basis of the results of each class of objects, which were extracted from different categories/ geographic regions. e distribution density of the samples is constructed from the first PC and the second PC, respectively. Obviously, the results of different categories of samples can be well identified and distinguished from the distribution of the two curves of density. If the two categories of samples were well distinguished, the overlapping area will be smaller in the curve of density distribution, and vice versa. If the samples of the same category of samples were more concentrated, the curve will be sharper with a small value of standard deviation (SD). at is, SD of the curve is smaller, and vice versa. e results of Figures 5(a)-5(c) correspond to the analytical results of climate, altitude, and chemical compositions, respectively.
A specific class of sample, namely, GuChengQu (the ancient district of a city and abbreviated as GCQ), was applied as an example to illustrate the process for quantitative analysis of data with weak correlation by using the PCA/MCEA method. As mentioned above, Figures 5(a)-5(c) are the samples of GCQ and all other counties and cities except GCQ and the 2-dimensional confidence ellipse distribution after PCA operation. e results in Figure 5 show that the samples of GCQ have certain unique characteristics in terms of climate, altitude, and chemical compositions of tobacco leaves.
On the basis of the formula given in (2)-(5), the results of 2-dimensional confidence ellipse analysis were obtained to the samples of 35 locations, respectively, corresponding to the data matrices of climate, altitude, and chemical compositions, and the confidence ellipses analysis were, respectively, attained. e number of samples distributing  Furthermore, we conduct an integrated analysis of the correlation between the three data on the basis of the correlation coefficients defined in Figures 6(a)-6(c) in terms of the strategy of the PCA/MCEA method. e correlation between each two data of a specific location, for example, the data of GCQ, can be calculated and then used to analyze the quantitative impact of climate and altitude on the chemical compositions of tobacco leaves. To the data pair of climate and chemicals, the minimum, average, and maximum correlation coefficients are −0.2796, 0.0320, and 0.3334, respectively. To the data pair of altitude and chemicals, the three correlation coefficients are −0.2610, 0.0759 and 0.3593, respectively. To the data pair of climate and altitude, the values of these three correlation coefficients are −0.1718, 0.1612, and 0.4717, respectively. Using the results of GCQ as an example, the abovementioned three correlation coefficients are 0.33073, 0.07855, and 0.26514. e results of GCQ show that the climate has a more significant effect on the change quality of tobacco leaves, compared with the altitude factor. Of course, it is particularly important to notice that there is a potentially nonlinear relationship between climate and altitude on the quality of tobacco leaves. e characteristics of different regions are quite different. In addition, there are still too many other kind of factors with possible influence to the quality of tobacco leaves. e extent of a specific factor may be not completely the same to tobacco quality. e advantage of the PCA/MCEA method is that it attempts to quantitatively analyze the relationship between the influencing factors of the two groups of dependent variables and the independent variables from a perspective of a single group of sample. is has certain advantages and application prospects in contrast to the conventional methods.
In the next step, we further adopted radar chart analysis to visually analyze the results of a multidimensional confidence ellipse, which helps to intuitively show the differences of data in climate, altitude, and chemical of tobacco

Integrated correlation analysis Visualization analysis
Correlation results between data matrices Helps for samples division Confidence ellipse and features analysis of data

Results of Conventional Methods.
As introduced above, the PCA/MCEA method was used to explain the effect of climate and altitude on the chemical compositions in tobacco leaves. In this section, three traditional regression methods were attempted to be used for building of more accurate and reliable models to the same data. With the help of the leave-one-out (LOO) method for validation, stepwise regression, PLS regression, and SVR analysis were utilized to construct a quantitative model for possible accurate prediction of chemicals, which are obtained by using the model between climate, altitude, or other independent variables and the known chemical indices of tobacco leaves. Here, the influence of climate and altitude on the content of element K (potassium) of tobacco leaves was demonstrated with the results given in Figure 8. e models of climate, altitude, and content of K were constructed by the three methods with the R-squared of 0.1336, 0.1386, and 0.1431, respectively. Obviously, the performance of such models was not good enough to qualitatively discover the correlation among datasets, and thus, it has poor predictive ability for model generalization. at is to say, it is almost uninformed from the results of regression modeling. e models of the contents of total sugar, reducing sugar, nicotine, and chlorine in tobacco leaves were established with all R-squared less than 0.1. ese results fully illustrate the limitations of regression methods to be used for effective finding of the relationship of independent and dependent variables, such as climate and altitude, and chemicals in tobacco leaves. e performance of modeling further shows the difficulty of conventional methods for modeling prediction of data with weak correlation, such as climate and altitude and chemical compositions of tobacco leaves. It has high challenges to find a quantitative correlation by using regression analysis, while the influence of potential factors on independent variables is too complicated with limited prior knowledge.
In this work, the PCA/MCEA method was constructed based on division and pretreatment of original data, and PCA is first performed for dimension reduction on different data obtained from samples classification. en, the relationship of samples is constructed by using multidimensional confidence ellipse analysis through finding the samples of different classes existing in the inside and out of these ellipses. e comparative analysis of samples between different groups can be achieved for quantitative analysis of   LQ  ZYQ  HPX  YMX  YJX  YDX  LX  HNX  GJ  TCS  LXQ  YX  LLX  CNX  LYQ  ZKX  GCQ  MZ  JCX  THX  QLQ  XPX  ESX  YL  NL  CJX  HTQ  YSX  SDX  KY  difference and similarity between independent and dependent variables. is largely helps to effectively explore the hidden information of weak correlation between datasets.
e dilemma of the traditional methods for quantitative modeling is overcome for processing of data with weak correlation and low capacity for new prediction.  In the plot, the value in the most central and extreme edge of the circle is 1 and 0, respectively. e closer it is to the center of the circle, the more similar it is to the characteristics of the samples in GCQ. If GCQ is closer to the center of the circle, it means that GCQ has high autocorrelation.

Conclusions
e PCA/MCEA method proposed in this work seeks to discover the weak correlation between complex data with the help of a multidimensional confidence ellipse obtained from the results of PCA operation. e common features and difference of samples between different types of samples are characterized by the number of samples existing in the overlapping or nonoverlapping areas of ellipses, which mathematically contains the characteristics of one or multiple classes of samples. e quantitative correlation between independent and dependent variables is comprehensively evaluated through individually analyzing the relationship of data pairs for information discovery. Data containing 171,516 tobacco leaves were handled as an example to deliver the strategy of the proposed method. In contrast to the conventional methods for classification and regression analysis, the results obtained from PCA/MCEA are more helpful to generate rich and informative conclusions. It can be also widely used for more types of complicated datasets with low but potential correlations.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors Tao Pang and Yong Li are employed by CNTC/ YB, and Haitao Zhang, Jun Tang, Bing Zhou, Qianxu Yang, and Jiajun Wang are employed by CTYI/CT, respectively. All these authors state that they cooperated for scientific study and have no conflicts of interest in the outcome of the work.

Authors' Contributions
Tao Pang and Haitao Zhang contributed equally to this work.