NNNPE: non-neighbourhood and neighbourhood preserving embedding

ABSTRACT Manifold learning is an important class of methods for nonlinear dimensionality reduction. Among them, the LLE optimisation goal is to maintain the relationship between local neighbourhoods in the original embedding manifold to reduce dimensionality, and NPE is a linear approximation to LLE. However, these two algorithms only consider maintaining the neighbour relationship of samples in low-dimensional space and ignore the global features between non-neighbour samples, such as the face shooting angle. Therefore, in order to simultaneously consider the nearest neighbour structure and global features of samples in nonlinear dimensionality reduction, it can be linearly calculated. This work provides a novel linear dimensionality reduction approach named non-neighbour and neighbour preserving embedding (NNNPE). First, we rewrite the objective function of the algorithm LLE based on the principle of our novel algorithm. Second, we introduce the linear mapping to the objective function. Finally, the mapping matrix is calculated by the method of the fast learning Mahalanobis metric. The experimental results show that the method proposed in this paper is effective.


Introduction
Data dimensionality reduction is an important basic task of pattern recognition, and it is divided into traditional machine learning dimensionality reduction and deep learningbased dimensionality reduction methods. At present, the most commonly used and effective data dimensionality reduction method is automatic feature extraction based on deep learning. For example, high-dimensional raw image data are directly input into the convolutional neural network (CNN), such as ResNet (He et al., 2016), and then a large amount of labelled data are used to drive supervised learning. Subsequently, output effective features are obtained after dimensionality reduction, and the fully connected network is entered for classification. Alternatively, the image is input into the auto-encoder (Baldi, 2012), and unsupervised learning is performed to compress the feature dimension with as little loss of information as possible. The advantage of this type of method is that it does not need to manually design features and can automatically extract features according to the datadriven target. However, the method is also limited by the large amount of training data and lack of interpretability. For fields that focus on interpretability, such as medical data analysis, traditional data dimensionality reduction methods with sound theoretical support are still valuable.
Traditionally, dimensionality reduction is a type of projection method that performs linear projection on incoming data and reduces its dimension from high to low, such as principal component analysis (PCA) (Turk & Pentland, 1991) and linear discriminant analysis (LDA) (Belhumeur et al., 1997). LDA enhances the ratio of between-class to within-class variance to identify an explicit projection, whereas PCA maximises the variance of data in low dimension to train a linear projection. The projection method uses projections to reduce data dimensionality. New linear dimensionality reduction methods have performed well in recent years. Among them, SWRLDA (Yan et al., 2020) can automatically avoid the optimal mean calculation and learn the adaptive weights of class pairs to achieve fast convergence. Chang et al. (2015) proposed a composite rank-k projection algorithm to directly process matrices without converting them to vectors for bilinear analysis. However, these linear dimensionality reduction methods are usually inadequate for dealing with complex nonlinear data.
Another type of traditional data dimensionality reduction method is nonlinear dimensionality reduction. Nonlinear dimensionality reduction methods include methods based on kernel functions (Sumithra & Surendran, 2015;Zhu et al., 2012), early neural networks (Pehlevan & Chklovskii, 2015;Wang et al., 2012), and manifold learning (Cai et al., 2007;Yang et al., 2016). An important problem in various kernel function dimensionality reduction methods is knowing how to choose the kernel function, and the objective function used for dimensionality reduction does not consider maintaining the integrity of its data structure. The methods based on early neural networks, such as multilayer fully connected networks, BP network (Schmidhuber, 2015), etc., have high training complexity and lack interpretability. The goal of manifold learning is to recover the low-dimensional manifold structure from high-dimensional sampling data and obtain the corresponding embedding map to realise data dimensionality reduction or data visualisation, which has good interpretability. Brahma et al. (2015) explained from the perspective of manifold structure that deep learning requires a learning of the hidden manifold structure from high-dimensional data.
Local linear embedding (LLE) and neighbourhood preserving embedding (NPE) (He et al., 2005) strive to preserve each neighbourhood's local configuration in low-dimensional space as it is in high-dimensional space. Thus, these methods simply preserve each data point's local relationship during the dimensionality reduction process. However, they ignore the relationship of data points that are far from each other, especially the supervised ones. Dimensionality reduction mainly preserves the intrinsic dimensionality, which is characterised by discrimination. For example, LDA ensures that inter -and intra-class distances are maximised and minimized, respectively. It seeks to preserve the local neighbourhood structure (inside the class) while also maintaining the global non-neighbourhood structure (between-class). Meanwhile, Mahalanobis metric learning (MML) (Zheng et al., 2012) aims to maximise the distance between dissimilar points while decreasing the distance between similar points.

Main contributions
In this study, motivated by LDA, MML, and NPE, we present a novel linear projection approach, which is a version of supervised NPE. First, we create an adjacency graph per category and compute the weights on the graph's edges for LLE and NPE. Then, we construct a dissimilar pairwise sample set and a pseudo similar pairwise sample set. By using techniques from MML and ensuring that the mapping function is linear, we can obtain the objective function of this study's algorithm, particularly the non-neighbourhood and neighbourhood preserving embedding (NNNPE). Finally, binary search and eigenvalue optimisation is employed to efficiently solve the optimisation issue (Bellet et al., 2013). The main contributions and innovations of this work are as follows: (1) Similar to the NPE, the method in this study can provide an ideal linear approximation to the true LLE's embedding of the underlying data manifold. It does not have to be defined only by the training data points.
(2) The difference between this study's algorithm and NPE is that this work considers the relationship of non-neighbour points, whereas NPE ignores them. Thus, our algorithm can provide a more meaningful representation of the data than NPE. (3) We establish the connection between manifold learning and MML. By using techniques from MML and ensuring that the mapping function is linear, we can obtain an objective function that is similar to that presented in a previous study (Xiang et al., 2008). (4) This method belongs to the traditional nonlinear dimensionality reduction method. It comprehensively considers maintaining local neighbourhood features and global features with good interpretability. It has certain application value in many application fields with strict interpretability requirements, such as the medical field.
The paper is organised as follows. The associated dimension reduction approaches are briefly introduced in Section 2. Section 3 introduces other algorithms and our algorithm, and Section 4 discusses the theoretical justification. Section 5 offers a time-saving approach for resolving the goal function from Section 3. Section 6 presents the experimental outcomes. Section 7 summarises this study.

Manifold learning
Nonlinear dimensionality reduction approaches, in contrast to linear dimensionality reduction techniques, deal with complex nonlinear data, thus attracting widespread attention. Many nonlinear dimensionality reduction algorithms have been proposed in recent decades, such as isomaps (Tenenbaum et al., 2000), LLE (Roweis & Saul, 2000), Laplacian eigenmaps (LE) (Belkin & Niyogi, 2001), Hessian LLE (Donoho & Grimes, 2003), and LTSA (Zhang & Zha, 2004). These algorithms utilise nonlinear low-dimensional manifolds from sample datasets that are inherent in high-dimensional space. Isomap is a global approach in low-dimensional space that seeks to retain pairwise geodesic distances among data points. By contrast, other techniques are local methods. LLE and LE endeavour to keep the local geometry of data, and neighbour points on the high-dimensional are regarded as neighbouring on the low-dimensional manifold. Hessian LLE is similar to LE in that it replaces the manifold Laplacian with the manifold Hessian. Meanwhile, LTSA is a technique that uses the local tangent space of each sample to characterise the local features of high-dimensional data (Van Der Maaten et al., 2009). These nonlinear dimensionality reduction methods have the advantage of finding manifold embedding owing to the highly nonlinear manifold of real-world data. However, they cannot be defined everywhere.
Most manifold learning techniques, in contrast to linear dimension reduction approaches, do not provide explicit projections for data points. Recently, advancements in effective and efficient algorithms have been achieved for the linearisation of nonlinear manifold learning (LML) techniques. IsoProjection (Cai et al., 2007) is the process of linearising an isomap by first constructing a weighted data graph. The weights are approximate to the geodesic distances on the data manifold, and then the pairwise distances are preserved to determine the linear subspace. OIP is a variation of IsoProjection and considers an explicit orthogonal linear projection. LPP (He & Niyogi, 2003) and NPE (He et al., 2005) are linear approximations of LE and LLE that attempt to keep the local geometry of data. In addition, Yang et al. (2016) proposed a supervised Isomap method called the multimanifold discriminant isomap. On this basis, Zhang et al. (2018) proposed a semi-supervised local multimanifold isomap learning framework. He et al. (2005) extended the LLE algorithm to learn linear projection with eigenvalue optimisation. Suppose that X = (x 1 , . . . x m ) is a sample set. The first step is to use K nearest neighbours or ε-neighbourhood to create an adjacency graph.

NPE
The weights are calculated in the second phase. The weight of the edge is calculated by minimising the objective function as follows: where W represents the weight matrix, and W ij represents the edge weight from node i to node j.
The third step is to compute the projections by minimising the objective function as follows: The transformation vector a is computed according to the minimum eigenvalue solution from the following generalised eigenvector problem: Thus, A R n×d can be composed of a, such as A = (a 1 , a 2 . . . a d ).

Method
This section introduces NNNPE, the new linear dimensionality reduction approach proposed in this study. NNNPE is an improved NPE algorithm that retains all the properties of NPE. Therefore, NNNPE is also a linear approximation to the nonlinear dimensionality reduction algorithm LLE. According to a previous study (He et al., 2005), NPE is particularly useful when data points x 1 , . . . x m M and M are nonlinear manifolds embedded in R n . We study the supervised situation and suppose that the data points belong to c classes. Let C = (l 1 , . . . l m ) denote the data's category in which l i = (1, . . . c). The following steps describes the formal algorithmic procedure: (1) Building an adjacency graph: Define G as a graph with m nodes corresponding to data points in the proper order. Add an edge between x i and x j of the same category (l i = l j ).
(2) Computing the weights: Each data point is rebuilt using other data points from the same category as the linear coefficients in this stage. Let W be the weight matrix. Thus, the weights W ij can be used to summarise the contribution of the jth data points to the ith reconstruction. Then, enforce W ij = 0 if l i = l j (without an edge between nodes i and j). Compute other weights that can best linearly reconstruct x i from its neighbours, thereby solving the following cost objective function: (3) Computing the covariance matrix: Compute the covariance matrix of the following pseudo congeneric and heterogeneous point pairs: where x i j is the linear reconstruction of x i , and x i j = j W ij x j . (4) Computing the projections: Compute the linear mapping matrix A ∈ R n×d , and then compute the embedding, i.e. y i = A T x i . The following objective function can be used to obtain the best projections: The formulation can be regarded as an MML problem and can be efficiently solved, as shown in Section 5. The above algorithm procedure is supervised. The first step is to construct the adjacency graph with category information, with the resulting graph being an undirected graph. However, this algorithm is also suitable for unsupervised situations. We can use the K-NN algorithm to build the adjacency graph and then add a directed edge from node i to node j if x j is among the K nearest neighbours of x i . This graph is a directed graph. The rest of the steps are the same.

Theoretical justification
The theoretical derivation of our algorithm, which is essentially based on LLE and NPE, is presented in this section.

Optimal linear embedding
LLE is a local dimensionality reduction approach, and it attempts to maintain only the local attributes of the data. The local attributes of the data manifold are produced in LLE by writing the data points as a linear combination of their nearest neighbours. Although many real-world data have nonlinear manifold distributions, we can assume that the manifold is locally linear. Linear coefficients W ∈ R m×m (reconstruction weights) that reconstruct each data point from its neighbours can characterise the local geometry of these patches. The following cost formula is used to calculate the reconstruction errors: To compute the weights, we subject the minimisation of the cost function to two constraints: W ij = 0 if x j is not the neighbourhood of x i , and the rows of the weight matrix sum to one is j W ij = 1.
In the lower-dimensional space, the reconstruction weights are preserved by the LLE algorithm. In other words, in low-dimensional data representation, the weight matrix may reconstruct data point y i from its neighbours. Let the sub-manifold Y denote the following d-dimensional data representation Y = (y 1 , . . . , y m ) ∈ R d×m . This scheme can be obtained by minimising the cost function as follows: In this case, the NPE algorithm transforms the cost function by introducing a linear transformation matrix A ∈ R n×d as follows: The low-dimensional embedding can be obtained by Y = A T X. Here, we describe a new cost function. The LLE and NPE both preserve the local manifold structure but ignore its global counterpart. Each data point can be represented as a linear combination of its neighbours; furthermore, through the relationships between non-neighbouring data points, the data point can also be represented by the "local structure" and "global structure." The essence of dimensionality reduction is to preserve the intrinsic characteristics of data. The intrinsic dimensionality of data refers to the smallest number of parameters required to account for their observed qualities. Each intrinsic dimensionality has a powerful ability to establish the distinctions, even in low-dimensional space. After dimensionality reduction, these data points in the same or different regions take either the same form or different types. LDA may maximise the ratio of between-class to within-class variance. Then, we introduce the global property ignored by the LLE algorithm. We expect the non-neighbouring data points to be separated as far as possible in the low-dimensional space while preserving the neighbourhood relationships. Thus, we propose a new cost function as follows: We introduce the linear matrix A = (a 1 , a 2 , . . . , a d ) to linearise the cost function similar to that in the NPE algorithm.
where x i j = j W ij x j denotes the reconstructed data point of x i from its neighbours. We define two sets of pairwise constraints, including pseudo congeneric point pairs Here, S w and S b are the respective covariance matrices of the point pairs in S and D.
To avoid degenerate solutions, we use the LLE algorithm to constrain the solution vectors y r i (ith row vector of Y) as y rT i × y r i = 1. We also impose the constraint on matrix A to satisfy A T A = I. Finally, we obtain the optimal projections by solving the following objective function:

tr(A T S w A) tr(A T S b A)
.
The problem is similar to the MML problem. In the next section, we will introduce the connection between linear manifold learning and MML.

Connection between LML and MML
The Mahalanobis distance between points x and y in n-dimensional space is given by: The learning of M ∈ R n×n is a classic MML problem consisting of some symmetric positive semidefinite matrices. According to the property of M, it can be decomposed into M = A T A, where A ∈ R d×n and d is the rank of M. Then, we rewrite d M (x, y) as follows: Clearly, the learning of M is equivalent to the learning of linear projective mapping A in the sample space. In general, most methods learn the metric M in a weak supervised manner from pair-based constraints of the following forms: The purpose of MML is to reduce the distance between similar sites while increasing the distance between dissimilar points.
Formulas (18) and (19) are similar to the numerator and denominator of Formula (11), respectively. Formula (18) is the same as objective function (9) of the LLE algorithm. Furthermore, formula (18) minimises the sum of the distance between similar points in Mahalanobis space. Meanwhile, formula (9) minimises the sum of the distance between each sample y i and its respective reconstructed sample y i j in low-dimensional space (y i j = j W ij y j ).
The following formula can achieve the goal of MML: where S b and S w are the covariance matrices of the point pairs in S and D, respectively. This representation is the same as that in formula (15). As LML approaches, it learns a linear transformation; this configuration can be regarded as the learning projective A matrix as described above, followed by the solving of the exact problem as a distance metric learning. As a result, each LML algorithm, which is capable of learning an explicit projective mapping, also has the goal of learning a distance measure. In the next section, we will introduce an efficient solution method to solve the objective function (21).

An efficient algorithm for learning projections
In the previous section, we have identified the objective function. A previous work (Xiang et al., 2008) proposed an efficient algorithm to obtain the optimal projections by solving formula (20). Then, we introduce an algorithm whose theories have been proven in the work.
A previous study (Xiang et al., 2008) also considered the rank r of S w in two cases: if d > n − r, then tr(W T S w W) > 0; if d ≤ n − r, then tr(W T S w W) = 0.
In the first case (d > n − r), formula (22) is the transformation of formula (21), which is achieved by introducing the parameter λ.
Thereafter, the task of finding the optimal solution does not depend on matrix W * but on parameter λ * , which satisfies g(λ * ) = 0. On this basis, we can prove that parameter λ * has lower and upper limits, namely, λ min = tr (S b ) tr (S w ) and , respectively, where α 1 . . . α d represents the first d largest eigenvalues of S b , and β 1 . . . β d represents the first d smallest eigenvalues of S w . Two propositions of the function g(λ) hold naturally: if g(λ) > 0, then λ < λ * ; if g(λ) < 0, then λ > λ * . This formulation indicates that we can iteratively find λ * according to the sign of g(λ).
In the second case (d ≤ n − r), W is in the null space of S w , and tr(W T S w W) = 0. Thus, the objective function can be transformed into formula (23) after performing a null-space transformation of y = Z T x as follows: r)×d , and Z ∈ R n×(n−r) are matrices whose column vectors are the eigenvectors corresponding to (n − r) zero eigenvalues of S w . Once the optimal matrix V * is obtained, the optimal matrix W * = Z × V * can also be determined. The specific algorithm process is shown below. recognition task, we evaluated our algorithm on the ORL face database and compared it with Eigenface and Fisherface, which are two of the most popular linear techniques for face representation and recognition. We also compared our proposed algorithm with NPE. Our algorithm and NPE were both supervised in this experiment. The code was implemented on the MATLAB platform on a Win7 system.

Manifold of Frey's facial image
The Frey face set used in this research was the same as that used in the LLE. The set comprised approximately 2000 photos of Brendan's face culled from a tiny video's sequential frames. The width of each image was 20 × 28 pixels. We applied the NNNPE algorithm to the Frey Face set to demonstrate that the linear NNNPE algorithm could also detect the manifold structure detected by the nonlinear techniques. In this experiment, we set the parameter k to 10. The mapping results are shown in Figure 1. The first two coordinates of the NNNPE were used to map the facial images into 2D embedding space. The first two coordinates, as shown in the diagram, indicate the facial expression and the viewing position of the faces, respectively. To a certain extent, the linear algorithm could detect the nonlinear manifold structure of the facial images. In various sections of the space, several representative faces were depicted next to the data points. Overall, the result shown in Figure 1 is similar to the result of the LLE algorithm in the past work, although the distribution of points in our figure are more dispersed. This property will be covered in detail in the next section. As shown in the figure, the facial photos are clearly separated into two portions. The upper section depicts a face with a closed mouth, whereas the lower part depicts a face with an open mouth. This difference can be explained by the novel technique that implicitly favours natural clusters in the data while attempting to retain the neighbourhood and nonneighbourhood features in the embedding. In the reduced representation space, it brings adjacent and non-neighbouring points in ambient space closer and farther, respectively. The bottom images correspond to locations along the top path (connected by a blue solid line), exhibiting stance variability with a particular facial expression.

Experiment on the ORL face database
ORL consists of a total of 400 facial images of 40 people (10 samples per person). The people in the set vary in terms of age, gender, and race. The photos were taken at various periods and feature a variety of expressions (open or closed eyes, smiling or non-smiling) and facial characteristics (with glasses or without glasses). For certain tilting and rotation of the faces, the photos were obtained with a tolerance of up to 20 degrees. This database is a widely used standard database.
Before conducting the experiment, we performed data processing similar to that in NPE. The original images were scaled and oriented in such a way that the two eyes would be aligned in the same place. Then, the final image was cropped to match the facial areas. In all of the experiments, the size of each cropped image was 32 × 32 pixels, with 256 grey levels per pixel.

Embedding structure in the low-dimensional space
The aim of our algorithm is to separate non-neighbouring data points as far as possible while preserving the neighbourhood relationship between neighbouring data points in low-dimensional space. Figure 2 presents the distribution of the data points in 2D embedding space described by the first two coordinates of NPE and our novel algorithm. The two figures have the same coordinate scale. Clearly, the degree of dispersion of the data points in the left figure is greater than those in the right figure. This finding proves the ability of the NNNPE to separate the non-neighbouring data points in contrast to the NPE. Moreover, as shown in the figure, each point (individual) represents an overlap of ten data points (facial images) because the location is extremely close to one another, such as the point marked by a circle in the figure. The local property is both preserved by the NPE and the NNNPE, which means that any data point can be represented as a linear combination of its neighbours. Then, the distribution of the 10 neighbouring data points is displayed in a coordinate figure with the same scale. Clearly, the distribution of the 10 neighbouring data points in the left figure is closer than that of the right figure. Thus, the NNNPE can preserve the neighbourhood structure better than the NPE. The results also have a similar manifold structure. Figure 3 presents an order of facial images according to abscissa sorting of 10 data points included in the circle in Figure 2. This scenario represents a particular mode of variability in an individual pose. Table 1 shows the best outcomes in the ideal subspace and the appropriate dimensions for each approach. Our innovative method outperforms the competition in all situations. LDA outperforms NPE, but both methods are superior to PCA. Moreover, the optimal dimensionalities obtained by our novel algorithm, along with those of NPE and LDA, are much lower    than those obtained by PCA. The average detection time of NNNPE is 1 ms, and the training time is less than 60 s.

Face recognition based on CNN
A deep learning algorithm is used to train the face recognition model. Even with the use of the earliest image classification model, AlexNet (Krizhevsky et al., 2017), the recognition accuracy is higher than our proposed NNNPE method. Furthermore, in the face recognition task in more complex scenes, the ability of deep learning algorithms is much higher than manifold learning. However, the algorithm proposed in this research is still valuable because manifold learning has good interpretability and offers a reference value in scenarios with strict interpretability requirements. Many scholars are currently attempting to study the interpretability of deep learning algorithms from the perspective of manifold structures.

Conclusion
This study introduced a new linear dimensionality reduction approach named NNPE. This algorithm is inspired by the principles of the NPE, LLE, and LDA algorithms and by the MML metric learning method. This research also explored the connection between LML and MML, both of which have similar locality-preserving properties. In contrast to the nonlinear LLE method, which is defined solely by the training samples, LML and MML can be defined anywhere (and thus on novel test data points). Despite being a linear algorithm, our method can identify the nonlinear structure of the data manifold. Furthermore, the novel algorithm has globalistpreserving properties, which allows non-neighbouring data points to be separated as far as possible in low-dimensional space. The improved performance of our algorithm over other linear dimensionality reduction methods was demonstrated through experiments.
The main idea behind the current work may also be applied to other algorithms (e.g. LPP), which represents the linearisation of nonlinear algorithms. We aim to propose a general framework with this idea in the future. In addition, the NNNPE algorithm proposed in this paper preserves the non-nearest neighbour relationship between samples. When the sample size is larger than a certain level, it becomes difficult to preserve the non-neighbour relationship between all samples, and these problems need to be further explored.