Skip to main content

Unsupervised Feature Selection via Nonlinear Representation and Adaptive Structure Preservation

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14431))

Included in the following conference series:

  • 360 Accesses

Abstract

Unsupervised feature selection has attracted increasing attention for its promising performance on high dimensional data with higher dimensionality and more expensive labeling costs. Existing unsupervised feature selection methods mostly assume that linear relationships can interpret all feature associations. However, data with exclusively linear relationships are rare and impractical. Moreover, the quality of the similarity matrix significantly affects the effectiveness of conventional spectral-based methods. Real-world data contains lots of noise and redundancy, making the similarity matrix built using the raw data unreliable. To address these problems, we propose a novel and robust method for feature selection over a novel nonlinear mapping function, aiming to mine the nonlinear relationships among features. Furthermore, we incorporated manifold learning into our training process, embedded with adaptive graph constraints based on the principle of maximum entropy, to maintain the intrinsic structure of the data and simultaneously capture more accurate information. An efficient and effective algorithm was designed to perform our method. Experiments with eight benchmark datasets from face images, biology, and time series outperformed nine state-of-the-art algorithms, validating the superiority and effectiveness of our method. The source code is available at https://github.com/aasdlaca/NRASP.

This work was supported in part by the National Natural Science Foundation of China under Grant 62306244, and in part by the Key Project of Shaanxi Provision-City Linkage under Grant 2022GD-TSLD-53.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Atashgahi, Z., et al.: Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders. Mach. Learn. 111(1), 377–414 (2022)

    Google Scholar 

  2. Balın, M.F., Abid, A., Zou, J.: Concrete autoencoders: differentiable feature selection and reconstruction. In: Proceedings of the International Conference on Machine Learning, pp. 444–453 (2019)

    Google Scholar 

  3. Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 333–342 (2010)

    Google Scholar 

  4. Gong, X., Yu, L., Wang, J., Zhang, K., Bai, X., Pal, N.R.: Unsupervised feature selection via adaptive autoencoder with redundancy control. Neural Netw. 150, 87–101 (2022)

    Article  Google Scholar 

  5. Gu, Q., Li, Z., Han, J.: Joint feature selection and subspace learning. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1294–1299 (2011)

    Google Scholar 

  6. Han, K., Wang, Y., Zhang, C., Li, C., Xu, C.: Autoencoder inspired unsupervised feature selection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2941–2945 (2018)

    Google Scholar 

  7. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS], pp. 507–514 (2005)

    Google Scholar 

  8. Huang, Q., Xia, T., Sun, H., Yamada, M., Chang, Y.: Unsupervised nonlinear feature selection from high-dimensional signed networks. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), pp. 4182–4189 (2020)

    Google Scholar 

  9. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)

    Article  MathSciNet  Google Scholar 

  10. Li, X., Zhang, H., Zhang, R., Liu, Y., Nie, F.: Generalized uncorrelated regression with adaptive graph for unsupervised feature selection. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1587–1595 (2019)

    Article  MathSciNet  Google Scholar 

  11. Li, Z., Yang, Y., Liu, J., Zhou, X., Lu, H.: Unsupervised feature selection using nonnegative spectral analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence (2012)

    Google Scholar 

  12. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989)

    Article  MathSciNet  Google Scholar 

  13. Mahmud, M., Kaiser, M.S., Hussain, A., Vassanelli, S.: Applications of deep learning and reinforcement learning to biological data. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2063–2079 (2018)

    Article  MathSciNet  Google Scholar 

  14. Nie, F., Zhu, W., Li, X.: Unsupervised feature selection with structured graph optimization. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1302–1308 (2016)

    Google Scholar 

  15. Qian, M., Zhai, C.: Robust unsupervised feature selection. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1621–1627 (2013)

    Google Scholar 

  16. Saberian, M.J., Vasconcelos, N.: Boosting algorithms for simultaneous feature extraction and selection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2448–2455 (2012)

    Google Scholar 

  17. Yang, Y., Shen, H.T., Ma, Z., Huang, Z., Zhou, X.: \({l}_{{2,1}}\)-norm regularized discriminative feature selection for unsupervised learning. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1589–1594 (2011)

    Google Scholar 

  18. You, M., Ban, L., Wang, Y., Kang, J., Wang, G., Yuan, A.: Unsupervised feature selection with joint self-expression and spectral analysis via adaptive graph constraints. Multim. Tools Appl. 82(4), 5879–5898 (2023)

    Article  Google Scholar 

  19. You, M., Yuan, A., He, D., Li, X.: Unsupervised feature selection via neural networks and self-expression with adaptive graph constraint. Pattern Recognit. 135, 109173 (2023)

    Article  Google Scholar 

  20. You, M., Yuan, A., Zou, M., He, D.J., Li, X.: Robust unsupervised feature selection via multi-group adaptive graph representation. In: TKDE, p. 1 (2021)

    Google Scholar 

  21. Yuan, A., Huang, J., Wei, C., Zhang, W., Zhang, N., You, M.: Unsupervised feature selection via feature-grouping and orthogonal constraint. In: International Conference on Pattern Recognition (ICPR), pp. 720–726 (2022)

    Google Scholar 

  22. Yuan, A., You, M., He, D., Li, X.: Convex non-negative matrix factorization with adaptive graph for unsupervised feature selection. IEEE Trans. Cybern. 52(6), 5522–5534 (2022)

    Article  Google Scholar 

  23. Zhang, Y., et al.: Unsupervised nonnegative adaptive feature extraction for data representation. IEEE Trans. Knowl. Data Eng. 31(12), 2423–2440 (2019)

    Google Scholar 

  24. Zhu, P., Zhu, W., Hu, Q., Zhang, C., Zuo, W.: Subspace clustering guided unsupervised feature selection. Pattern Recogn. 66(C), 364–374 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aihong Yuan .

Editor information

Editors and Affiliations

Appendices

A Derivation

1.1 A.1 Derivation of the Manifold Structure Preservation Item

Recalling the aforementioned definition of the manifold structure preservation, if the data points are close in the original data space, the projected points \(W^{T}x_i\) and \(W^{T}x_j\) should also have a small distance. Therefore, We can get the manifold structure preservation item as:

$$\begin{aligned} \min \limits _{W}{\frac{1}{2}{\sum \limits _{i,j}{\left\| {W^{T}x_{i} - W^{T}x_{j}} \right\| _{2}^{2}s_{ij}}}} \end{aligned}$$
(14)

where \(s_{ij}\) denotes the similarity between the data \(x_i\) and \(x_j\). The value of \(||W^{T}x_i - W^{T}x_j||_2^2\) can be large when the value of \(s_{ij}\) is small. Therefore, the neighbor relationship of the original data points can still be maintained in the mapped data points.

We can verify that

$$\begin{aligned} \begin{aligned} {}&\frac{1}{2}\sum _{i,j}{||{W}^{T}}{{x}_{i}}-{{W}^{T}}{{x}_{j}}||_{2}^{2}{{s}_{ij}}\\ &=\sum _{i=1}^n(W^T x_i )^T W^T x_i D_{ii}- \sum _{i,j=1}^n(W^T x_i )^T W^T x_j s_{ij}\\ &=Tr(W^T X^T DXW)- Tr(W^T X^T SXW)\\ &=Tr(W^T X^T L_S XW) \end{aligned} \end{aligned}$$
(15)

where \(L_S\) is a Laplacian matrix. \(Tr(\cdot )\) denotes the trace of matrix. \(L_S\) is calculated by \(L_S = D - (\frac{S+S^{T}}{2}) \), where D is a diagonal matrix, and its elements are defined as:

$$\begin{aligned} D_{ii} = \sum _{j=1}^m \frac{\left( s_{i j}+s_{ji}\right) }{2}, i=1,2,\cdots ,m. \end{aligned}$$
(16)

1.2 A.2 Derivation of the KKT Condition

With the Lagrangian multiplier method, Eq. (6) (in main body) is rewritten as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{1}^{'} &{}= \frac{1}{2} \beta \sum _{i=1}^m \sum _{j=1}^m{||{W}^{T}}{{x}_{i}}-{{W}^{T}}{{x}_{j}}||_{2}^{2}{{s}_{ij}}+ \gamma \sum _{i=1}^m \sum _{j=1}^m s_{ij} \log s_{ij}\\ &{}~~~~ +\sum _{i=1}^m \sum _{j=1}^m \lambda _{ij} s_{ij}+\sum _{i=1}^m \mu _i\left( \sum _{j=1}^m s_{i j}-1\right) \end{aligned} \end{aligned}$$
(17)

where \(M = [\mu _1, \mu _2, \ldots , \mu _m]\) and \(\Lambda = [\lambda _{ij}]_{m \times m}\) are Lagrangian multipliers. The KKT conditions of Eq. (17) are summarized as

$$\begin{aligned} \begin{aligned} \left\{ \begin{array}{l} \dfrac{\partial \mathcal {L}_{1}^{'}}{\partial s_{i j}}=\dfrac{\beta }{2}\left\| W^{T} x_i-W^{T} x_j\right\| _{2}^{2}\gamma \left( \log s_{i j}+1\right) +\lambda _{i j}+\mu _i=0 \\ s_{i j} \ge 0,~~ \lambda _{i j} \ge 0,~~ \lambda _{i j} s_{i j} = 0,~~ \sum _{j=1}^m s_{i j}=1 . \end{array}\right. \end{aligned} \end{aligned}$$
(18)

Based on Eq. (18), we can get the optimal solution of \(s_{ij}\) shown in Eq. (7) (in main body).

B Experiment Setup

Table 2. Statistics of used datasets.

For the validation of the effectiveness of our method, we made a comparison with the baseline of all the features and nine representative unsupervised FS methods.

These methods are briefly described as follows.

  1. 1.

    All-Fea: Use all features for clustering. This method was used as the baseline to verify whether the selected features can outperform all the features in clustering.

  2. 2.

    Laplacian score (LS) [7]: This method measures feature using variance and local structure preservation ability.

  3. 3.

    Multi-cluster feature selection (MCFS) [3]: This method uses \({{l}_{1}}\)-regularized regression model with spectral analysis to select the most important features. It is capable of preserving data’s clustering structure.

  4. 4.

    Nonnegative discriminative feature selection (NDFS) [11]: This method utilizes discriminative information of data. It incorporates clustering label learning into FS. In addition, the method used a nonnegative constraint for more accurate cluster labels.

  5. 5.

    Unsupervised discriminative feature selection (UDFS) [17]: This method integrates discriminative analysis with \({{l}_{2,1}}\)-norm regularization within a unified framework to exploit discriminative information and perform an unsupervised FS problem.

  6. 6.

    Generalized uncorrelated regression with the adaptive graph for unsupervised feature selection (URAFS) [10]: This method selects features and performs manifold learning simultaneously using an uncorrelated regression model while incorporating the data’s geometric structure into the manifold learning process.

  7. 7.

    Autoencoder feature selection (AEFS) [6]: This method combines an autoencoder network and group LASSO by excavating the linear and nonlinear information among features to perform FS.

  8. 8.

    Concrete autoencoder (CAE) [2]: This method proposed a concrete autoencoder to implement separable feature selection and reconstruction. CAE uses a concrete selector layer with an effective learning algorithm that can converge to a discrete feature subset.

  9. 9.

    Quick selection (QS) [1]: This method selects features by the strength of the neurons from trained sparse denoising autoencoders with a sparse evolution strategy for training.

It should be noted that the Laplacian score (LS) [7] method does not belong to the embedding-based methods. LS measures features using variance and local structure preservation ability. Due to its efficiency and decent performance, LS remains a popular method for FS. Hence, we include it in our comparative study. Besides, the All-Fea method is a baseline method that includes all features.

To evaluate the performance of various unsupervised methods, we utilized the K-nearest neighbor (KNN) algorithm in LS, MCFS, UDFS, and NDFS, where we set the number of nearest neighbors to five. In addition, the activation functions of the encoder and decoder are set as tanh functions in our method. For the initialization of parameters, we utilized the grid-search strategy from \(\left\{ 10^{-3}, 10^{-2}, \ldots , 10^{3}\right\} \) on parameters to find the optimal parameters in UDFS, URAFS, AEFS, and our method. In the AEFS, we set the size of the neurons in the hidden layer to 256. We selected one hidden layer in CAE, and the activation function was selected as LeakyRelu(0.2). In QS, we utilize the grid-search strategy from \(\left\{ 0.1, 0.2, 0.3, 0.4, 0.5\right\} \) for parameter \(\zeta \) and from \(\left\{ 2, 5, 10, 13, 20, 25\right\} \) for parameter \(\varepsilon \). In particular, in our proposed method, we also use the grid-search strategy from \(\left\{ 10^{-3}, 10^{-2}, \ldots , 10^{3}\right\} \) for parameters \(\alpha \), \(\beta \) and \(\gamma \). We selected \(k\in \left\{ 20, 40, \ldots , 300\right\} \) features respectively to conduct the experiments.

Considering Eq. (13) (in main body), the W is regularized to be sparse in rows, \(||w_i||_2\) is most likely zero during training. To avoid this, we add a small positive constant \(\epsilon \) close enough to infinitesimal, aiming to ensure \(Q_{ii}\) to be differentiable. Subsequently, Q is transformed into \(Q^{'}\) such that the i-th elements are defined as

$$\begin{aligned} Q_{ii}^{'} = \frac{1}{2\sqrt{w_i^Tw_i+\epsilon }} (i=1,2, \cdots , d) \end{aligned}$$
(19)

Replacing Q with \(Q^{'}\), Eq. (12) (in main body) can be written as follows:

$$\begin{aligned} \begin{array}{l} \dfrac{{\partial \mathcal {L}_{3}}}{{\partial W}} = \;\;2{a^{\left( 0 \right) }}{\left( {{\delta ^{\left( 1 \right) }}} \right) ^T} + 2\alpha WQ^{'} + 2\beta {X^T}{L_S}XW \end{array} \end{aligned}$$
(20)
Fig. 4.
figure 4

Comparison of original and reconstructed images with MNIST and the visualization of the feature importance \(\Vert {{w}_{i}}\Vert \).

C Visualization

The training results of the novel nonlinear self-representation are visualized in Fig. 4, where the input sample image is shown in Fig. 4. The reconstructed image is shown in Fig. 4(b). This shows that the nonlinear self-representation model effectively reconstructs the sample by preserving the intrinsic structure and the linear and nonlinear relationships among the original features. Figure 4(c) presents the importance \(\Vert {{w}_{i}}\Vert \) of each feature i, which is reshaped by the shape of one sample in MNIST, such as that in the top left of Fig. 4(a). The 40 most important features of the MNIST dataset are presented in Fig. 4(d). From Fig. 4(c) and Fig. 4(d), the meaningful features are primarily concentrated in the middle of the image rather than the edges, which is highly consistent with the practical situation. Meanwhile, the region composed of features with high importance is similar to the shape of the number 3. These may imply that the top, bottom, middle, and right parts are the key to distinguishing between different numbers.

Table 3. Clustering performance (ACC% ± std) of different parts combination of our method on 8 datasets. The bold indicates the best results.
Table 4. Clustering performance (NMI% ± std) of different parts combination of our method on 8 datasets. The bold indicates the best results.

D Ablation Study

When it is necessary to understand the function of each part of the model, ablation experiments are commonly performed. We studied the role of each part of the proposed model by removing different parts. Each of the combinations shown in Tables 3 and 4 is presented as follows:

  1. 1.

    Base linear model: The linear self-representation model with sparse constraints is as follows:

    $$\begin{aligned} \mathcal {L}=\left\| X-XW\right\| _F^2+\lambda \Vert {W}\Vert _{2,1} \end{aligned}$$
    (21)
  2. 2.

    Nonlinear self-representation part (b): Basic model of nonlinear mapping based on the method of self-representation and added by sparse constraints. The loss function is

    $$\begin{aligned} \mathop {\min }\limits _{W, \varTheta } \quad \mathcal {L}(W; \varTheta ) = \left\| {X - f\left( {g\left( {XW} \right) } \right) } \right\| _F^2 + \;\alpha {\left\| W \right\| _{2,1}} \end{aligned}$$
    (22)
  3. 3.

    \(b+\beta \): Incorporated base nonlinear self-representation model with manifold learning that has a fixed similarity matrix.

    $$\begin{aligned} \begin{aligned} \mathop {\min }\limits _{W, \varTheta } \quad \mathcal {L}(W; \varTheta ) = &{}\left\| {X - f\left( {g\left( {XW} \right) } \right) } \right\| _F^2 + \;\alpha {\left\| W \right\| _{2,1}}+ \beta Tr({W^T}{X^T}{L_S}XW) \end{aligned} \end{aligned}$$
    (23)
  4. 4.

    NRASP (\(b+\beta +\gamma \)): The whole proposed model by Eq. (5) (in main body).

As shown in Tables 3 and 4, we first consider the linear self-representation model as the basis on which we introduce nonlinear mapping method adopting the self-representation idea for comparison. It shows comparatively superior performance on most biological datasets, such as lymphoma, TOX_171, and GLIOMA, corroborating the better nonlinear learning ability of the nonlinear self-representation model than the linear model. However, the basic nonlinear self-representation model underperformed on the warpPIE10P image dataset but demonstrated even better performance when combined with manifold learning, which was conducted on all datasets. This indicates that for the information in the data, the nonlinear self-representation model can capture the nonlinear information well, while the structural information of the data may be omitted. Furthermore, when the similarity matrix is dynamically updated for manifold learning, it achieves excellent performance on all but the TOX_171 and HAR datasets, which shows that the structural information in the data is better captured. By comparing the combination of different parts, the NRASP can balance the different parts well and achieve superior performance.

E Detailed Results

Table 5. Clustering performance (NMI% ± std) of 9 FS algorithms on 8 datasets. The bold indicates the best results.
Fig. 5.
figure 5

Clustering NMI of 9 FS algorithms on 8 datasets with different number of selected feature.

Fig. 6.
figure 6

Convergence curves of our method on 4 additional datasets.

Fig. 7.
figure 7

Additional results of clustering ACC with different \(\alpha \), \(\beta \), \(\gamma \) on lymphoma and orlraws10P. The same dataset is indicated in the same row. (d)–(f) are lymphoma, (g)–(h) are orlraws10P. Best view in color.

Fig. 8.
figure 8

Clustering NMI with different \(\alpha \), \(\beta \), \(\gamma \) on MNIST, lymphoma and orlraws10P. The same dataset is indicated in the same row. (a)–(c) are MNIST, (d)–(f) are lymphoma, (g)–(h) are orlraws10P. Best view in color.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yuan, A., Lin, L., Tian, P., Zhang, Q. (2024). Unsupervised Feature Selection via Nonlinear Representation and Adaptive Structure Preservation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14431. Springer, Singapore. https://doi.org/10.1007/978-981-99-8540-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8540-1_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8539-5

  • Online ISBN: 978-981-99-8540-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics