1 Introduction

Spectral unmixing analysis has received increasing interest in hyperspectral remote sensing, since the mixture pixels widely exist in hyperspectral imagery. The linear mixing model (LMM) plays an important role in hyperspectral unmixing analysis (Keshava and Mustard, 2002; Plaza et al., 2004; Bioucas-Dias et al., 2012). It assumes that any pixel in the image can be regarded as a linear combination of several pure spectral signatures (called ‘end-members’) weighted by corresponding abundance fractions. Owing to the physical constraints, the abundances satisfy the abundance non-negative constraint (ANC) and abundance sum-to-one constraint (ASC). In the assumption of LMM, when all the end-members are available, the abundances of the end-members can be obtained conveniently using linear unmixing methods (Heinz and Chang, 2001; Parente and Plaza, 2010; Heylen et al., 2011). Therefore, some methods concentrate on endmember selection, for instance, pixel purity index (PPI) (Boardman, 1992), N-FINDR (Winter, 1999; Ji et al., 2015), orthogonal bases algorithm (OBA) (Tao et al., 2007a), iterative error analysis (IEA) (Neville et al., 1999), simplex growing algorithm (SGA) (Chang et al., 2006), successive projection algorithm (SPA) (Zhang et al., 2008), maximum volume by householder transformation (MVHT) (Liu and Zhang, 2012), Gaussian elimination method (GEM) (Geng et al., 2013b), and fast Gram determinant based algorithm (FGDA) (Sun et al., 2014). All these algorithms assume that at least one pure pixel exists for each endmember in the image.

However, the pure-pixel assumption is hard to satisfy for real hyperspectral images. In this case, another alternative approach for unmixing analysis, endmember generation, is required. Methods of this type include minimum volume transform (MVT) (Craig, 1994), minimum volume simplex analysis (MVSA) (Li and Bioucas-Dias, 2008), MINVEST (Hendrix et al., 2012), minimum volume enclosing simplex (MVES) (Chan et al., 2009; Ambikapathi et al., 2011), simplex identification via split augmented Lagrangian (SISAL) (Bioucas-Dias, 2009), iterated constrained endmember (ICE) (Berman et al., 2004), and geometric optimization model (GOM) (Geng et al., 2013a). Take ICE as an example. It formulates an optimization problem with an effort to minimize the reconstruction error regularized by a constrained term, i.e., the sum of variances of the simplex vertices. In each iteration of ICE, the abundance fractions of each pixel can be found by solving a quadratic programming problem, which is very time-consuming. In recent years, non-negative matrix factorization (NMF) has been applied to hyperspectral data unmixing (Miao and Qi, 2007; Zymnis et al., 2007; Jia and Qian, 2009; Huck et al., 2010; Liu et al., 2011; Ji et al., 2013; Zhu et al., 2014). The multiplicative update algorithm for NMF was provided by Lee and Seung (1999), which is demonstrated to be computationally simple and does not need any manually set parameters. Note that NMF suffers from two main problems when used for hyperspectral unmixing analysis. One is that the multiplicative iteration version of NMF is very time-consuming if performed directly to the original hyperspectral data since the dimensionality of hyperspectral data is very high (generally more than 100). The other disadvantage of NMF is that it is very sensitive to noise (outliers). To address the two problems, the operation of dimensionality reduction can be conducted, which can not only reduce the data size but also improve the signal-to-noise ratio (SNR) of the data set. However, after dimensionality reduction, the multiplicative learning rule for NMF cannot be used since dimensionality reduced data generally contain negatives. In this paper, we explore the impact of principal component analysis (PCA) on NMF and find that the multiplicative updating rule is still applicable to the data after the dimensionality reducing process, which contains only the step of rotation and does not involve translation. Then, we present a new approach for NMF in the principal component (PC) space, namely PCNMF.

2 Background

In this section, we briefly review the theory of LMM and NMF.

2.1 Linear mixing model

If the assumption of the LMM holds, each spectral vector r i can be expressed by a linear combination of several endmember vectors e j (j = 1, 2,…,p) with the physical constrained conditions:

$${r_i} = \sum\limits_{j = 1}^p {{c_{ij}}{e_j}} + n = E{c_i} + n,$$
((1))
$$\sum\limits_{j = 1}^p {{c_{ij}} = 1} ,\;\;{c_{ij}} > 0,$$
((2))

where p is the number of endmembers in the image, c i = [c i 1, c i 2,…,c ip ]T and c ij is a scalar representing the fractional abundance of endmember vector e j in the pixel r i . E =[e1, e2,…, e p ] is an L × p mixing matrix (L is the number of bands for the original data).

2.2 Non-negative matrix factorization

Given an L×M non-negative matrix R(ML in general) and a positive integer p < L, the task of non-negative matrix factorization is to find two non-negative matrices E L ×p and C p ×M such that

$$R \approx EC.$$
((3))

A natural way to solve the NMF problem for E and C is to construct the following optimization problem:

$$\begin{array}{*{20}c}{\min \;f(E,C) = {1 \over 2}||R - EC||_{\rm{F}}^2} \\ {{\rm{subject}}\;{\rm{to}}\;E \ge 0,C \ge 0,\quad \quad } \\ \end{array} $$
((4))

where ∥ · ∥F represents the Frobenius norm. There are many methods that can be used to solve Eq. (4), among which the most popular one is the multiplicative iteration rule. The learning process for the multiplicative NMF is

$$C = C{.}\ast ({E^{\rm{T}}}R){.}/({E^{\rm{T}}}EC),$$
((5))
$$E = E{.}\ast (R{C^{\rm{T}}}){.}/(EC{C^{\rm{T}}}){.}$$
((6))

If the initial matrices E and C are strictly non-negative, these matrices can remain non-negative throughout the iterations. If the ASC needs to be considered, we can just replace the matrices R and E by \(\bar R = \left[ {\begin{array}{*{20}c}R \\ {\delta 1_M^{\rm{T}}} \end{array} } \right],\;\;\bar E = \left[ {\begin{array}{*{20}c}E \\ {\delta 1_p^{\rm{T}}} \end{array} } \right]\), where 1 M is an M-dimensional column vector, and 1 p a p-dimensional column vector, with all elements equal to one, and δ is a positive parameter to control the effect of ASC, which is usually assigned manually. In recent years, NMF has been widely used in unmixing analysis for hyperspectral data (Miao and Qi, 2007; Zymnis et al., 2007; Jia and Qian, 2009; Huck et al., 2010; Liu et al., 2011; Ji et al., 2013; Zhu et al., 2014). Most of these methods perform NMF directly to the original data, so they are generally time-consuming and sensitive to noise (outliers). Although dimensionality reduction can mitigate these problems well, it leads to another intractable situation; i.e., the non-negative property cannot be satisfied for the dimensionality reduced data. As a result, the multiplicative updating rule cannot be applied in the dimensionality reduced space. Interestingly, we find that the multiplicative updating rule is still applicative for dimensionality reduction where only rotation operation is contained. It will be elaborated in the following section.

3 Impact of PCA on NMF

In this section, taking PCA (Jolliffe, 2002) as an example, we investigate the applicability of NMF in the PC space. As is well known, the standard PCA process contains two main steps, translation and rotation. More specifically, translation means moving the center of the data to the mean vector, while rotation means projecting the mean-removed data to the directions of the eigenvectors derived from its co-variance matrix, which is equivalent to rotating the data by the orthogonal matrix (eigenvector matrix). Hence, in the following, we discuss the impact of general translation and rotation on NMF.

3.1 Impact of rotation

In general, the original hyperspectral data are non-negative, i.e., R ≥ 0. However, the data may contain negatives after the rotation by orthogonal matrix V. Denote \(\hat R = {V^{\rm{T}}}R\) and Ê = VTE as the rotated data and endmember matrix, respectively. First, we consider the multiplicative learning rule for C, which can be rewritten as

$$\hat C = \hat C{.}\ast \left( {{{\hat E}^{\rm{T}}}\hat R} \right){.}/\left( {{{\hat E}^{\rm{T}}}\hat E\hat C} \right){.}$$
((7))

Substituting \(\hat R = {V^{\rm{T}}}R\) and Ê = VTE into Eq. (7), we have

$$\begin{array}{*{20}c}{\hat C = \hat C{.}\ast \left( {{{({V^{\rm{T}}}E)}^{\rm{T}}}({V^{\rm{T}}}R)} \right){.}/\left( {{{({V^{\rm{T}}}E)}^{\rm{T}}}{V^{\rm{T}}}E\hat C} \right)} \\ { = \hat C{.}\ast \left( {{E^{\rm{T}}}V{V^{\rm{T}}}R} \right){.}/({E^{\rm{T}}}V{V^{\rm{T}}}E\hat C)\quad \quad \;\;} \\ { = \hat C{.}\ast ({E^{\rm{T}}}R){.}/({E^{\rm{T}}}E\hat C){.}\quad \quad \quad \quad \quad \quad } \end{array}$$
((8))

Clearly, Eq. (8) is absolutely equivalent to Eq. (5). That is to say, the rotation does not change the multiplicative updating rule for the abundance matrix. This is a very interesting conclusion, which implies that the abundance matrix can be obtained directly in the PC space though the rotated data may contain negatives. Considering that the information of hyperspectral data is contained mainly in the top several components, we can multiplicatively update the abundance matrix of hyperspectral data in the space spanned by the top several principal components (PCs). This will greatly reduce the computational complexity. Moreover, the accuracy of endmember and abundance can be improved by performing NMF in the PC space, since PCA has a capacity of suppressing noise and enhancing SNR. Next, let us discuss the computation of endmember matrix E in the PC space. The multiplicative updating rule for E is

$$\hat E = \hat E{.}\ast (\hat R{\hat C^{\rm{T}}}){.}/(\hat E\hat C{\hat C^{\rm{T}}}){.}$$
((9))

Similarly, plugging Ê = C and \(\hat R = {V^{\rm{T}}}R\) into Eq. (9), wehave

$$\begin{array}{*{20}c}{\hat E = \hat E{.}\ast (\hat R{C^{\rm{T}}}){.}/(\hat EC{C^{\rm{T}}})\quad \quad } \\ { = \hat E{.}\ast ({V^{\rm{T}}}R{C^{\rm{T}}}){.}/(\hat EC{C^{\rm{T}}}){.}} \end{array}$$
((10))

Unfortunately, the effect of V cannot be eliminated in the multiplicative updating rule for the endmember matrix. Both \(\hat R\) and Ê in Eq. (10) may contain negatives in the PC space. Therefore, this is the main obstacle to prevent us from applying NMF in the PC space.

3.2 Impact of translation

From the above analysis, it can be found that the rotation will not change the update rule of C, but the update rule of E may not hold, since \(\hat R\) in Eq. (10) may contain negatives. To make the multiplicative update rule applicable to the rotated data, a simple trick to make all data non-negative is by translation. Here, supposing the data and endmembers contain negatives, we can select a vector r0 (L-dimensional column vector), such that \((R - {r_0}1_M^{\rm{T}}) \geq 0\) and \((E - {r_0}1_p^{\rm{T}}) \geq 0\). Then the corresponding multiplicative updating rules become

$$\begin{array}{*{20}c}{C = C{.}\ast \left( {{{(E - {r_0}1_p^{\rm{T}})}^{\rm{T}}}(R - {r_0}1_M^{\rm{T}})} \right)} \\ {\quad {.}/\left( {{{(E - {r_0}1_p^{\rm{T}})}^{\rm{T}}}(E - {r_0}1_p^{\rm{T}})C} \right),} \end{array}$$
((11))
$$\begin{array}{*{20}c}{E = (E - {r_0}1_p^{\rm{T}}){.}\ast \left( {(R - {r_0}1_M^{\rm{T}}){C^{\rm{T}}}} \right)} \\ {{.}/\left( {(E - {r_0}1_p^{\rm{T}})C{C^{\rm{T}}}} \right) + {r_0}1_p^{\rm{T}}{.}} \end{array}$$
((12))

Then the objective function of NMF in Eq. (4) becomes

$$\begin{array}{*{20}c}{f(E - {r_0}1_p^{\rm{T}},C) = ||(R - {r_0}1_M^{\rm{T}}) - (E - {r_0}1_p^{\rm{T}})C||_{\rm{F}}^2} \\ { = f(E,C) + 2{\rm{tr}}\left( {(R - EC){{({r_0}1_p^{\rm{T}}C - {r_0}1_M^{\rm{T}})}^{\rm{T}}}} \right)} \\ { + ||{r_0}1_M^{\rm{T}} - {r_0}1_p^{\rm{T}}C||_{\rm{F}}^2,\quad \quad \quad \quad \quad \quad \quad \quad } \end{array}$$
((13))

where tr(·) is the trace operator. We can see from Eq. (13) that, when the abundance matrix C satisfies ASC, i.e., \(1_M^{\rm{T}} = 1_p^{\rm{T}}C\), the translation does not change the value of the objective function. However, the ASC of C cannot be guaranteed in the multiplicative learning process (see Eqs. (5), (6), (11), and (12)). Therefore, the translation of the data will change the final solution of NMF. Moreover, the number of such r0 which can achieve the non-negativity of R and E is infinite, and different r0 will lead to different E and C even with the same initialization. Therefore, although the way of translation can make R ≥ 0, E ≥ 0, it is unnatural and its use in real applications is not suggested.

4 Principal component non-negative matrix factorization

In Section 3, we learn that both steps of the PCA process could cause the failure of applying multiplicative update rules to NMF in the PC space. For the rotation step, since the effect of rotation matrix V can be completely eliminated, as Eq. (8), the update rule for the abundance matrix remains the same. Yet, the rotation matrix V cannot be eliminated in the updateruleof E, as Eq. (10), so the update rule of E is no longer universally applicable. Although data translation can solve the problem caused by V, the analysis in Section 3.2 indicates that the translation will change the final solution, and different r0 will lead to different results.

Interestingly, based on our observation, the maximum spectral angle between pixels in real hyperspectral data is mostly small (generally less than 45°). In addition, the rotation operation will not change the spectral angles between data points. Both facts motivate us to apply the orthogonal procrustes (OP) technique to solve the non-negative problem of R and E in the PC space. That is, forcibly rotating all the data points into the first quadrant of the PC space will make the update rule still work for E.

The OP problem was first proposed by Green (1952), and the general solution to this problem was given by Schönemann (1966). The problem is defined as the least-squares problem of transforming a given matrix B to a given matrix A by an orthogonal transformation matrix Q, which maps B to A as closely as possible. Mathematically, this problem can be stated as follows:

$$\begin{array}{*{20}c}{\min\limits_Q \;\;f(Q) = {\rm{\Vert }}A - BQ{\rm{\vert }}{{\rm{\vert }}_{\rm{F}}}\quad \;\;} \\ {{\rm{subject}}\;{\rm{to}}\quad {Q^{\rm{T}}}Q = Q{Q^{\rm{T}}} = I{.}} \end{array}$$
((14))

According to Schönemann (1966), the optimal solution of Eq. (14) is

$$Q = U{W^{\rm{T}}},$$
((15))

where matrices U and W satisfy the singular value decomposition (SVD) equation BTA = UDWT.

Here, a 2D example is shown in Fig. 1 to illustrate the OP process. The original data are distributed in the second quadrant. Let A =[1, 1], and B be the mean vector of the data set. Then we can calculate Q by Eq. (15). It can be seen that all data are transformed to the first quadrant, being non-negative. As mentioned before, to achieve the non-negativity requirement for both R and E, the data should satisfy one premise; that is, the maximum spectral angle of the data set should be no larger than 90°. Since digital number, radiance, or reflectance values are all non-negative, real hyperspectral data all exist in the first quadrant. As a result, the maximum spectral angle of the real data set is no larger than 90°. Moreover, we compute a series of real hyperspectral data sets, and find that the maximum spectral angle is usually less than 45° in practice. Therefore, by OP transformation, we can perform the multiplicative update rules in the PC space without involving the translation process. Clearly, the PC transformation without translation and the OP transformation can be combined into one process, named ‘PC-OP transformation’ here. Overall, PC-OP transformation has the following advantages: (1) reduce the dimensionality of the data set and thus the computational complexity, (2) make all data non-negative, so that multiplicative update rules of NMF can still work in the PC space, and (3) improve the SNR of the data set by choosing only the first p − 1 PC bands, to improve the final accuracy.

Fig. 1
figure 1

Illustration of the orthogonal procrustes (OP) process in the 2D space

After obtaining the non-negative data and endmember matrices, we can directly perform NMF on them for the final endmember extraction, which is summarized in Algorithm 1. Some practical considerations need to be stated here:

  1. 1.

    Initialization: Due to the non-convexity of the objective function (4), the random initialization on E and C will definitely lead to the local minima problem, which makes the solution meaningless. Taking a 2D situation as an example, the scatter points in Fig. 2 are the highly mixed data cloud. Points A, B, and C are three real endmembers, and obviously ΔABC is the most compacted triangle that encloses all these points. Because of the non-convexity of NMF, any triangle in the plane that can enclose the data set will be a local solution of NMF, such as ΔA1B1C1 and ΔA2B2C2. However, the vertices A1, B1, C1 or A2, B2, C2 are far different from the true endmembers A, B, and C, so they have no physical meanings.

    To avoid unreasonable local minima, Tao et al. (2007b) used the N-FINDR outputs as the endmember initialization. We can employ the same strategy. We use the results of FGDA (Sun et al., 2014), which are independent of the dimensionality of data, as the initialization for E. For C, it is estimated by solving the non-negative least-squares constraint problem (lsqnonneg) on the augmented data and endmember matrices, \(\bar R\) and Ē0.

  2. 2.

    Stopping conditions: There are two widely used criteria, maximum iteration number and minimum error tolerance. Here, we employ the maximum iteration number.

Fig. 2
figure 2

Illustration of the orthogonal procrustes (OP) process in the 2D space

5 Experiments

We have conducted tests based on both simulated and real data to evaluate the performance of PCNMF. Both endmember accuracy and computation time have been evaluated. For fair comparison, we set the same initializations of E and C, maximum iteration number maxiter, and ASC weight δ for both NMF and PCNMF in all experiments.

5.1 Simulated data

Here, we design two experiments to evaluate the performance of FGDA, NMF, and PCNMF. The spectra of three minerals (Alunite, Calcite, and Kaolinite) from the U.S. Geological Survey (USGS) Digital Spectral Library are selected as endmember signatures. Then 2000 mixture vectors are generated according to Eqs. (1) and (2) with abundance fractions following a Dirichlet distribution. To ensure that no pure pixel exists, fractions are not allowed to be larger than 0.9. The selection of the maximum iteration number, maxiter, may be different for different data sets. For the simulated data sets used in this study, the NMF algorithms can generally produce acceptable results when the iteration number is around 4000, so we set maxiter to 4000. The ASC weight is set to S = 13 for all experiments. In addition, since the data are randomly generated, the average result of 10 runs is presented in the following. To evaluate the performance of these methods, the metrics of rmsSAD, rmsSID (Nascimento and Bioucas-Dias, 2005), and the relative abundance error (RAE) (Geng et al., 2015) are used, where SAD stands for the spectral angle distance, and SID stands for the spectral information divergence. The abundance matrix of FGDA is estimated by a least-squares method with the ASC weight also set to 13.

5.1.1 Experiment 1: accuracy test for noiseless data

Fig. 3 shows that NMF-based methods NMF and PCNMF obtain exactly the same performance in terms of accuracy. Compared to FGDA (rmsSAD about 0.81°), the two NMF-based methods can obtain more accurate endmembers (rmsSAD about 0.49°), which can be attributed to the fact that there are no ‘pure’ pixels in the data cloud.

Fig. 3
figure 3

The rmsSAD (a), rmsSID (b), and relative abundance error (RAE) (c) for different methods with noise-free data

5.1.2 Experiment 2: accuracy test for noisy data

Real data always contain noise. Here we investigate the performances of these methods for noisy data. The Gaussian noise at different SNR levels (SNR = 10, 15, 20, 25, and 30 dB) is added to the simulated data. The accuracy results are shown in Fig. 4.

Fig. 4
figure 4

The rmsSAD (a), rmsSID (b), and relative abundance error (RAE) (c) as functions of SNR for different methods with noisy data

As can be seen from Fig. 4, our proposed method PCNMF is always better than FGDA and NMF with the original dimensional data. When SNR≥15 dB, FGDA has the worst performance. When SNR=10 dB, NMF has the lowest accuracy in all metrics, which indicates that NMF is sensitive to noise. The superiority of PCNMF can be attributed to the fact that PCA contributes to the improvement of SNR and suppression of noise. NMF in the PC space is more accurate, particularly when SNR is low.

5.1.3 Experiment 3: computation time

We have also recorded the computation time for these methods (Table 1). As shown, NMF is very time-consuming. Benefiting from the computation of E and C in the low-dimensional PC space, PCNMF costs much less time than NMF.

Table 1 Computation time for different methods

5.2 Real data

In this subsection, we evaluate the performance of PCNMF using the well-known AVIRIS Cuprite data set (Green et al., 1998). AVIRIS is a high-quality, low-noise hyperspectral instrument that acquires data in 224 contiguous spectral bands ranging from 365 to 2500 nm. The selected subscence is shown in Fig. 5 with a size of 191 × 250 pixels. Due to water absorption or low SNR, bands 1–3, 104–113, 148–167, and 221–224 are removed. The maximum SADs of the original, PC transformed, and OP transformed data are listed in Table 2. The maximum SAD of the original data is around 23°, greatly less than 90°, and the PC and PC-OP transformations have little effect on SADs between pixels.

Fig. 5
figure 5

Band 8 (λ=0.463 µm) of the real hyperspectral scene

Table 2 The maximum SAD of the original, PC transformed, and OP transformed data

According to the related references (Chan et al., 2009) and the ground-truth (Swayze et al., 1992), we set the number of endmembers as p = 14. For fair comparison, FGDA is also conducted on the data after dimensionality reduction by Algorithm 1. Also, the 14 endmembers extracted by FGDA are used as the initial endmembers for PCNMF. For the maximum iteration number and ASC weight, the same values used in the simulations are applied, i.e., maxiter=4000, δ=13. The resampled spectra from the USGS Digital Spectral Library are selected as the ground-truth for comparison. For each mineral, the library spectrum, which has both small SAD and small SID with the endmember extracted from FGDA, is selected as the reference spectrum. The comparison of FGDA and PCNMF in terms of SAD is shown in Table 3. It can be seen that 10 out of 14 endmembers extracted by PCNMF outperform those by FGDA and it has smaller average SAD compared to FGDA.

Table 3 The SAD between USGS reference spectra and extracted endmembers by FGDA and PCNMF

Fig. 6 shows the abundance images of six main endmembers from PCNMF, which present high level of similarity to the published geologic maps.

Fig. 6
figure 6

Abundance maps of different endmembers from PCNMF: (a) E1, Muscovite; (b) E3, Alunite1; (c) E5, Montmorillonite; (d) E6, Kaolinite; (e) E7, Buddingtonite; (f) E10, Chalcedony

6 Conclusions

In this paper, we have explored the possibility of applying the PCA dimensionality reduction technique to the multiplicative update rules of NMF. The main obstacle is that data after PC transformation may contain negatives, which could be caused by both steps of PCA (i.e., translation and rotation). We have proved that the rotation matrix of PCA can be eliminated using the multiplicative learning rule for C, but not for E. To solve the non-negative problem of PCA data, one possible way is to add a large positive vector, r0, to all data. However, different r0 will lead to different E and C, so translation is not an advisable way for practical applications. According to our observation, the maximum SAD of real hyperspectral data is not large (generally less than 45°). Since rotation operation will not change the SADs between data vectors, we thereby introduce the OP transformation to forcibly rotate the data cloud into the first quadrant in the PC space. This NMF-based unmixing analysis method for data after PC rotation and OP transformations is named PCNMF. Compared to the original NMF, PCNMF is more robust to noise. Moreover, since the data dimensionality is reduced to p − 1, our method can greatly save computational time.