Unsupervised Feature Selection via Metric Fusion and Novel Low-Rank Approximation

Unsupervised feature selection aims to derive a compact set of features with desired generalization ability via removing the irrelevant and redundant features, yet challenging due to the unavailability of labels. Works about unsupervised feature selection always need to construct the similarity matrix, which makes the selected features highly depend on the accuracy of similarity measurement. However, existing works usually leverage a single fixed metric to build similarity matrix, which cannot fit various feature types very well and even damage the local manifold structure. To address this problem, we propose an adaptive multi-metric fusion by automatically integrating similarity across different metrics according to the specific data. Besides, to capture the global structure more precisely, a novel low-rank approximation method is proposed, which is relatively insensitive to the rank-norm parameter. Via the proposed novel low-rank approximation method, better tradeoff between the performance and robustness can be provided. Experimental results show that the accuracy performance of the proposed method can be boosted by $2\%-11\%$ , compared with previous methods.

high-dimensional data was proposed to address this problem, 23 which is called feature selection. Generally, feature selection 24 methods are of two classifications: supervised feature selec-25 tion and Unsupervised Feature Selection (UFS), according to 26 the availability of labels in the data samples. Among them, 27 UFS is widely used in the realistic scenarios, because the data 28 in the realistic tasks usually is unlabelled. 29 Moreover, UFS methods are of three types, filters 30 [1], wrappers [2], or embedding [3]. Among them, the 31 The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . performance of filter and wrapper approaches are affected 32 by the search strategy, while the embeddeding-based method 33 performs the feature selection by learning. Thus, the 34 embeddeding-based method has a better performance than 35 other methods, thereby attracting more attention. The recent 36 work [4] about embeddeding-based method tries to imple-37 ment the feature selection by discovering the global struc-38 ture of data space. Furthermore, based on the work [4], 39 an improved work [5] is proposed, where the rank constraint 40 is imposed on the self-representation matrix, and better per-41 formance is achieved. Besides, since local manifold struc-42 ture is superior to global structure in some aspects, some 43 recent embeddeding-based methods turn to leverage the local 44 manifold structure to select features.
(2) 109 In [8], similarity between the i-th sample and j-th sample 110 is measured by the inverse of τ -weighted squared euclidean 111 distance between the i-th sample and j-th sample, i.e., where φ () is the kernel function. In [9], KNN approach is 114 adopted in the construction of similarity matrix, where where N i is the neighbor set of sample i. In [10], the similarity 117 matrix is constructed by the heat kernel, i.e., Furthermore, LPP problem can be described as where W ∈ R d×k is the projection matrix with k being the 123 dimension of the subspace, and the constraint W T W = I k 124 is to avoid trivial solution [14]. As shown in Equation (6), 125 the similarity between x i and x j in the original space can be 126 preserved in the subspace. Furthermore, to make the rows of 127 W sparse, the sparse constraint l 2,1 -norm is imposed, then the 128 objective function (6) can be rewritten as  where α and ω are the balance parameters for the low-rank 142 constraint and LPP, respectively, diag (Z) = 0 is used to 143 avoid trivial solution.

145
To improve the performance of UFS, we propose a novel 146 low-rank approximation to capture the global structure more 147 accurately and propose a scheme of adaptive multi-metric 148 fusion to fit a specific data. The specific details are discussed 149 in the following.  169 where J γ (σ i ) is a piecewise function defined as In Equation (12), we can discover that the error between 187 rank (Z) and its lower bound ϕ ε (Z) would be close to zero, 188 when the parameter ε approaches zero. Therefore, we can 189 use ϕ ε (Z) to approximate the rank function almost unbi-190 asedly. On the other hand, ϕ ε (Z) is a continuous function 191 with ε, and when σ 2 i is much larger than ε, ϕ ε (Z) barely 192 changes with ε. Therefore, ϕ ε (Z) has a desired robustness 193 on the rank-norm parameter ε. However, the rank function 194 represented by ϕ ε (Z) in Equation (12) is an implicit function 195 about the variable r, which causes the considerable incon-196 venience to the solution process of the UFS problem. Thus, 197 we have to further transform ϕ ε (Z) to an explicit function 198 about the variable r. In addition, based on the property of 199 unitary matrix, specifically, ϕ ε (Z) can be represented as in 200 Equation (9), shown at the bottom of the next page.

201
As shown in Equation (9), diag σ 2 (Z) is a diagonal 202 matrix with r squares of corresponding non-zero singular 203 values of Z. Therefore, the explicit function can be derived 204 by Then, the optimization problem (8) can be rewritten as Currently, the existing works tend to leverage single fixed 212 metric to build the similarity matrix, which cannot fit various 213 feature types very well and even damage the local mani-214 fold structure. To address this issue, we propose an adaptive 215 multi-metric fusion by automatically integrating similarity 216 across different metrics according to the specific data. Com-217 pared with the traditional single metric based method, the 218 fused similarity metric can fit a specific data much better. 219 Specifically, the multiple similarity metrics are adaptively 220 fused to a unified similarity matrix S, which can be written 221 as where m t denotes the t-th given similarity metric, m t (X) ∈ 225 R n×n is a similarity matrix based on the metric m t , q denotes 226 the number of similarity metrics needed to be fused, h t is the 227 adaptive weight for the t-th similarity metric in the fusion 228 process. If the difference between the matrix S and the spe-229 cific similarity matrix m t (X) is larger, a smaller weight is 230 automatically assigned to the t-th metric. Hence, the weight 231 h t should be inversely proportional to m t (X) − S 2 F . Thus, 232 we leverage the weighting scheme by In this way, the different similarity metric can be fused adap-235 tively to fit a specific data. Therefore, by considering the 236 adaptive multi-metric fusion, the optimization problem (14) 237 can be reformulated as By taking the self-representation into it, problem (18) can be 263 further written as Then, by introducing the auxiliary variables A, Z 1 and Z 2 , 268 the optimization problem can be rewritten as where L s = P T L s P. Then, the augmented Lagrange of Equa-273 tion (20) is 2) UPDATE Z 1 286 Z 1 can be updated iteratively as

288
where m represents the number of iterations, η Z 1 denotes the 289 learning rate, and the gradient ∇ Z 1 is The update rule of Z 2 can be written as Besides, given A, Z 1 and Z 2 , the dual variables 1 and 2 299 can be updated by the rules for the dual variables in [17].

300
When the self-representation matrix Z is updated, the con-301 stant X can be updated by the latest Z, and be used in the To simplify the optimization problem, let where s i and q i are the i-th column of S and Q, respectively, 334 ν is the multiplier for the equality constraint, and u i is the 335 multiplier for the inequality constraint. For any j, the KKT 336 condition of Equation (32) is as follows Thus, based on the above KKT conditions, each entry in S 341 can be derived by So far, the update rules about all variables are provided 344 above, these rules are then implemented repeatedly until the 345 convergence or the achievement of maximum number of iter-346 ation. Furthermore, the update rules described above are sum-347 marized in the following algorithm. The main computation cost of the optimization algorithm is 350 the update of Z and F. Specifically, the complexity of updat-351 ing Z is O(n 3 ), since the matrix inversion and multiplication. 352 The complexity of updating F is O(n 3 ), due to the implemen-353 tation of SVD.

355
In this section, we perform validation experiments in cluster-356 ing tasks.

Algorithm 1
Input: X, m t (t = 1, · · · , q), θ, α, β, ω, η, ε, µ, λ, c Output: Projection matrix W Initialize:  such as self-adaptive multi-metric unsupervised feature 383 1 https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php 2 http://cam-orl.co.uk/facedatabase.html/ selection with the nuclear norm (SMUN), the Gamma norm 384 (SMUG) and the proposed low-rank approximation (Ours). 385 Especially in SMUN, to compare with the state-of-the-386 art method, the weighted nuclear norm [22], the latest 387 nuclear norm based method, is adopted here. Moreover, 388 all parameters of existing approaches mentioned above are 389 based on the respective parameter-searching strategy pro-390 vided in the open source code or paper. Besides, all bal-391 ance parameters in the proposed method are tuned from 392 {50 −3 , 10 −3 , 50 −2 , 10 −2 , 50 −1 , 10 −1 , 1, 5, 10}. 393 In addition, the parameter for the proposed rank-norm 394 approximation ε is tuned from {50 −3 , 10 −3 , 50 −2 , 10 −2 , 395 50 −1 }, while the parameter for the Gamma norm γ is varied 396 from 50 −3 to 10 −1 with step 50 −3 . Apart from that, the learn-397 ing rate is set as 0.05, and the maximum number of iteration is 398 300. To evaluate the performance of these algorithms, we use 399 clustering accurate (ACC) as evaluation metrics in this paper. 400 Table 2 shows the ACC performance on artificial dataset 402 where the feature space of 10% and 30% original samples 403 corrupted by noise arranging from -0.3 to 0.3. As shown 404 in Table 2, we can see that, as the ratio of corrupted data 405 increases, the ACC performance of each method decreases 406 obviously. But, one may notice that, the ACC performance of 407 the proposed method beats other methods, and the proposed 408 method has slower deceasing speed of ACC performance with 409 the increase of the ratio of corrupted data. This is because that, 410 the noise problem can be alleviated by self-representation, 411 and an adaptive multi-metric fusion by automatically inte-412 grating similarity across different metrics according to the 413 specific data. On the other side, in the existing works usu-414 ally leverage a single fixed metric to build similarity matrix, 415 which cannot fit various feature types very well and even 416 damage the local manifold structure. 417 Table 4 and Table 5 show the ACC performance on public 418 datasets by selecting 50% and 30% features, respectively. 419 As shown in Table 4 and Table 5, compared with SOGFS, 420 LLCFS, DGUFS, which consider the local manifold struc-421 ture only, the ACC performance of LGSP, SMUN, SMUG 422 and Ours are obviously better. This is because that, making 423 use of local manifold and global structure simultaneously 424 can achieve higher accuracy than the method based on local 425 or global structure preserving only. Furthermore, one may 426 observe that, the ACC performance of SMUN, SMUG and 427 Ours generally tend to outperform the ACC performance 428 of LGSP. Hence, the effectiveness of the proposed adaptive 429 multi-metric fusion method can be verified. Besides, it can be 430 seen that, the ACC performance of SMUG and Ours is better 431 than that of SMUN. This is because that, compared with the 432 nuclear norm, the proposed low-rank approximation and the 433 Gamma norm can be nearly unbiased to approximate the rank 434 function. In addition, one may find that, the ACC performace 435 of Ours, which is based on the proposed low-rank approxi-436 mation, is almost about the same as it of SMUG. Therefore, 437 the proposed low-rank approximation can achieve almost the 438 same performance of the rank function approximation via the 439    Gamma norm. However, although the same performance is  Table 6 shows the effectiveness of similarity metric fusion, 446 where the highest ACC under the different number of similar-447 ity metrics for fusion is demonstrated. As shown in Table 6, 448 the ACC performance is inferior to its counterpart of existing 449 methods, such as JSCFS and PDUFS, when only single fixed 450 metric is leveraged to build similarity matrix. Besides, one 451  To further study the robustness on the parameter of the 456 Gamma norm and the proposed method, Table 7 shows 457 how the parameters change of low-rank approximation meth-458 ods (i.e., ε in the proposed low-rank approximation or γ 459 in the Gamma norm) impacts on ACC performance. Due  Table 4 only, namely, results of selecting the 462 50% features is considered only. In Table 7 Table 4. For example, 466 f = 10% means that, the parameters are increased and 467 decreased by 10% from the value in Table 4, and the cor-468 responding ACC in Table 7 is the highest one among the 469 two ACC values. As can be seen, the impact of the variation 470 of f on the ACC performance of Ours is smaller than its 471 counterpart on the ACC performance of SMUG. Therefore, 472 compared with the Gamma norm, the proposed method is less 473 sensitive to the parameter ε.

474
In Fig. 2 and 3, the curves of adaptive weights as a function 475 of the number of iterations for the COIL20 dataset and the 476 Lung dataset is shown. It can be seen that, the convergence 477 rate of the adaptive metric weights in the proposed method 478 is very fast. And all the similarity metrics used in this paper, 479 by some iterations, they tend to converge to a fixed value.   the squared Euclidean metric method. Conversely, the KNN 486 metric method, which performs well in the dataset COIL20, 487 has lower weights. It can be seen that under different datasets, 488 the proposed method has a preference for the choice of mea-489 surement method.

490
The analysis of parameter sensitivity is shown in Fig. 4,   491 where different combinations of balance parameter θ and ω 492 are conducted the analysis and the ACC is taken as the per-493 formance metric. As shown in Fig. 4 In this paper, to address the problem that the fixed metric 498 cannot fit various feature types well, we propose an adaptive 499 multi-metric fusion by automatically fusing refined similar-500 ity across different metrics according to the specific data.