1 Introduction

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. It is a super-set of activities which include feature extraction, feature construction and feature selection, all of the them are important steps in feature engineering. Nowadays, a lot of practical applications of machine learning, visual recognition, text retrieval and bioinformatics need to deal with high-dimension and large-sample data efficiently (Alzubi and Abuarqoub 2020; Andras 2018; Chen et al. 2020; Liu et al. 2020; Vishwakarma and Singh 2019; Zhang et al. 2017). To effectively manipulate and analyze massive data, feature extraction and dimensionality reduction seem imperative in feature engineering (Eldén 2005; Fukunaga 1991; Gado et al. 2016; Hastie et al. 2001; Kokiopoulou et al. 2010; Liu et al. 2020; Park and Park 2008; Zhang et al. 2010).

Linear Discriminant Analysis (LDA) is a popular method for feature extraction, whose goal is to find a suitable linear transformation such that each high dimensional sample vector is projected into a low dimension vector, while maintaining the original cluster structure and achieving maximum class separability as much as possible (Fukunaga 1991; Hastie et al. 2001). In essence, it is also a process of dimensionality reduction. Trace-ratio problem is an important optimization problem involved in dimensionality reduction techniques such as Fisher linear discriminant analysis (LDA) (Belhumeur et al. 1997; Fukunaga 1991). More precisely, let \(X=[\mathbf{x}_{1},\mathbf{x}_{2},\ldots ,\mathbf{x}_{n}]\in {\mathbb {R}}^{d\times n}\) be a set of n training samples in a d-dimensional feature space. Assume that the data matrix is partitioned into k classes as \(X=[X_1,\ldots ,X_k]\), where \(X_j\) is the j-th set with \(n_{j}\) being the number of samples. Denote by \(\mathbf{c}_{j}\) the centroid vector of \(X_j\), and by \(\mathbf{c}\) the global centroid vector of the training data. Let \(\mathbf{1}_{j}=[1,1,\ldots ,1]^{T} \in {\mathbb {R}}^{j}\), then the within-class scatter matrix is defined as

$$\begin{aligned} S_{W}=\sum _{j=1}^{k}\sum _{\mathbf{x}_{i}\in X_{j}}(\mathbf{x}_{i}-\mathbf{c}_{j})(\mathbf{x}_{i}-\mathbf{c}_{j})^{T}=H_WH_W^T, \end{aligned}$$

where \(H_{W}=[X_{1}-\mathbf{c}_{1} \mathbf{1}_{1}^{T},\ldots ,X_{k}-\mathbf{c}_{k} \mathbf{1}_{k}^{T}]\in {\mathbb {R}}^{d\times n}\). The between-class scatter matrix is defined as

$$\begin{aligned} S_{B}=\sum _{j=1}^{k}n_{j}(\mathbf{c}_{j}-\mathbf{c})(\mathbf{c}_{j}-\mathbf{c})^{T}=H_BH_B^T, \end{aligned}$$

where \(H_B=[\sqrt{n_{1}}(\mathbf{c}_{1}-\mathbf{c}),\sqrt{n_{2}}(\mathbf{c}_{2}-\mathbf{c}),\ldots ,\sqrt{n_{k}}(\mathbf{c}_{k}-\mathbf{c})]\in {\mathbb {R}}^{d \times k}\). The total scatter matrix is defined as

$$\begin{aligned} S_{T}=\sum _{j=1}^{n}(\mathbf{x}_{j}-\mathbf{c})(\mathbf{x}_{j}-\mathbf{c})^{T}=H_TH_T^T, \end{aligned}$$

where \(H_T=[\mathbf{x}_{1}-\mathbf{c}, \mathbf{x}_{2}-\mathbf{c},\ldots , \mathbf{x}_{n}-\mathbf{c}]\in {\mathbb {R}}^{d \times n}\), moreover, it is well-known that (Park and Park 2008)

$$\begin{aligned} S_T=S_W+S_B. \end{aligned}$$

The LDA method resorts to maximizing the between-class scatter distance while minimizing the within-class scatter distance. This gives the following trace-ratio problem (Fukunaga 1991; Kokiopoulou et al. 2010; Kramer et al. 2018; Ngo et al. 2012)

$$\begin{aligned} \begin{aligned} {\widehat{\rho }}^{*}=\max _{\begin{array}{c} V\in {\mathbb {R}}^{d\times s} \\ V^{T}V = I \end{array}} \frac{{tr}(V^{T}S_{B}V)}{{tr}(V^{T}S_{W}V)}, \end{aligned} \end{aligned}$$
(1.1)

where tr\((\cdot )\) denotes the trace of a matrix, and \(s\ll d\) is the reducing dimension. However, this problem is difficult to solve (Fukunaga 1991; Park and Park 2008), and in general a simpler ratio-trace problem is solved instead

$$\begin{aligned} \begin{aligned} {\widetilde{\rho }}^{*}=\max _{\begin{array}{c} V\in {\mathbb {R}}^{d\times s} \\ V^{T}V = I \end{array}} {tr}\big ((V^{T}S_{W}V)^{-1}(V^{T}S_{B}V)\big ), \end{aligned} \end{aligned}$$
(1.2)

whose solution can be obtained from solving the following generalized eigenvalue problem

$$\begin{aligned} S_{B} \mathbf{v}=\lambda S_{W} \mathbf{v}. \end{aligned}$$
(1.3)

Indeed, the trace-ratio problem and the ratio-trace problem are not mathematically equivalent (Park and Park 2008), and the trace-ratio problem (1.1) has regained great concerns in recent years (Guo et al. 2003; Jia et al. 2009; Jiang et al. 2017; Ngo et al. 2012; Nie et al. 2008; Wang et al. 2007; Zhang et al. 2010; Zhao et al. 2013). One reason is that the trace-ratio model (1.1) can yield markedly improved recognition results for supervised learning tasks compared to (1.2). However, it has been long believed that there is no explicit solution for the trace-ratio problem, and some commonly used techniques are inner-outer iterative algorithms (Jia et al. 2009; Ngo et al. 2012; Wang et al. 2007; Zhao et al. 2013). That is, inner iterations for solving eigenvectors and outer iterations for computing the trace-ratio value. More precisely, in the j-th outer iteration, we compute \(V_j\) from solving the eigenproblem with respect to \(A-\rho _j B\) for given \(\rho _j\), where \(A=S_B\), \(B=S_W\) (or \(S_T\)), and then we compute \(\rho _{j+1}\) by using \(V_j\). The value of \(\rho _{j+1}\) is determined by different ways in different methods. For instance, the Newton-Lanczos algorithm (Ngo et al. 2012) and the method proposed by Wang et al. (2007) make use of the trace-ratio value \({tr}(V_j^{T}AV_j)/{tr}(V_j^{T}BV_j)\) as \(\rho _{j+1}\). The GFST algorithm uses the bisection method to choose \(\rho _{j+1}\) (Guo et al. 2003), while the DNM algorithm searches \(\rho _{j+1}\) by calculating the first order expansion of the \(A-\rho _jB\) (Jia et al. 2009). However, in high-dimension and large-sample problems, both the dimension and the number of the samples are very large, and the overhead of all these algorithms can be prohibitive.

For high-dimension and large-sample data, there are many algorithms available to the ratio-trace problem. For instance, by using spectral graph analysis, the SRDA method casts discriminant analysis into a regression framework that facilitates both efficient computation and the use of regularization techniques (Cai et al. 2008). The LDADL method is designed for a new batch LDA model formulated as a regularized least squares (RLS) problem with multiple columns on the right-hand side (Zhang et al. 2017). The Rayleigh-Ritz discriminant analysis (RRDA) method is a gradient-type method based on the Rayleigh-Ritz framework for the generalized eigenvalue problem of LDA (Zhu and Huang 2014). There are also some randomized algorithms have been investigated, such as RFDA/RP (Ye et al. 2017) and FastLDA (Gado et al. 2016). However, all of the above algorithms are only designed for the ratio-trace problem (1.2) rather than the trace-ratio problem (1.1), and all of them are parameter-dependent. It is well-known that the optimal parameter is difficult to choose in practice, if there is no other information available in advance (Gui et al. 2014). Therefore, how to solve high-dimension and large-sample trace-ratio problem is an interesting topic that deserves further investigation (Jia et al. 2009; Ngo et al. 2012).

In this paper, we pay special attention to solving the high-dimension and large-sample trace-ratio problem efficiently. In Sect. 2, we provide an alternative way for (1.1), which does not rely on the inner-outer iterative framework. In Sect. 3, we provide a closed-form formula for this problem. Based on the formula and the randomized singular value decomposition (RSVD) (Halko et al. 2011), we propose a randomized algorithm for high-dimension and large-sample dense data problems. However, for large sparse data sets, RSVD will destroy the sparse structure of the original data. Thus, a method based on solving (consistent) under-determined systems is proposed for high-dimension and large-sample sparse data problems. Theoretical results are established to show the rationality and feasibility of the proposed methods. In Sect. 4, we perform some numerical experiments on some real-world data sets to illustrate the numerical behavior of our proposed algorithms. Concluding remarks are given in Sect. 5.

Throughout this paper, we suppose that the high-dimension and large-sample dense data matrix \(X\in {\mathbb {R}}^{d\times n}\) is of full column rank. In this work, by high-dimension and large-sample data, we mean that both the dimensionality d and the number of samples n are very large and even in the same order, but with the assumption that \(d\ge n\). Some notations used in this paper are listed in Table 1.

Table 1 Some notations used in this paper

2 An alternative way to solve the trace-ratio problem

In this section, we present an alternative way for solving the trace-ratio problem to take the place of the inner-outer iterative framework. Let \({\mathscr {W}}\) be a subspace and let W be a basis of it. If Z is a matrix, in this paper, by \(Z\in {\mathscr {W}}\), we mean there is a matrix S of appropriate size such that \(Z=WS\).

We rewrite (1.1) as

$$\begin{aligned} \begin{aligned} {\rho }^{*}=\max _{\begin{array}{c} V\in {\mathbb {R}}^{d\times s} \\ V^{T}V = I \end{array}} \frac{{tr}(V^{T}S_{B}V)}{{tr}(V^{T}S_{T}V)}. \end{aligned} \end{aligned}$$
(2.1)

It is seen that (1.1) and (2.1) are mathematically equivalent as \(S_T=S_W+S_B\) (Guo et al. 2003). Denote by \(Z_1\) an orthogonal basis for \({\mathscr {N}}^{\perp }(S_T)\), and by \(Z_2\) an orthonormal basis for \({\mathscr {N}}(S_T)\), respectively. Then \(Z=[Z_1,Z_2]\) is unitary, and for any matrix \(V\in {\mathbb {R}}^{d\times s}\), there are matrices \(W_1,~W_2\) such that

$$\begin{aligned} V=Z_1W_1+Z_2W_2=V_{11}+V_{22}, \end{aligned}$$

where \(V_{11}=Z_1W_1\in {\mathscr {N}}^{\perp }(S_T)={\mathscr {R}}(S_T)\) as \(S_T\) is symmetric, and \(V_{22}=Z_2W_2\in {\mathscr {N}}(S_T)\). Therefore,

$$\begin{aligned} S_BV=S_BZ_1W_1+S_BZ_2W_2=S_BZ_1W_1=S_BV_{11}, \end{aligned}$$

where we use the property that \(Z_2\) is also in the null space of \(S_B\). Indeed, as \(S_T=S_B+S_W\) and all the three matrices \(S_B, S_W\) and \(S_T\) are symmetric positive semi-definite, the null space of \(S_T\) are in the intersection of the null spaces of \(S_B\) and \(S_W\). Thus, we have \(V^TS_BV=V_{11}^TS_BV_{11}\), and

$$\begin{aligned} {{tr}}(V^TS_BV)={tr}(V_{11}^TS_BV_{11}). \end{aligned}$$
(2.2)

Similarly, we can prove that

$$\begin{aligned} {{tr}}(V^TS_TV)={tr}(V_{11}^TS_TV_{11}). \end{aligned}$$
(2.3)

By (2.2) and (2.3),

$$\begin{aligned} \begin{aligned} {\rho }^{*}=\max _{\begin{array}{c} V\in {\mathbb {R}}^{d\times s} \\ V^{T}V = I \end{array}}\frac{{tr}(V^{T}S_{B}V)}{{tr}(V^{T}S_{T}V)}=\max _{\begin{array}{c} V\in {\mathbb {R}}^{d\times s}, V^{T}V = I \\ V\in {\mathscr {N}}^{\perp }(S_T) \end{array}}\frac{{tr}(V^TS_BV)}{{tr}(V^TS_TV)}, \end{aligned} \end{aligned}$$
(2.4)

which implies the information in \({\mathscr {N}}(S_T)\) has no contribution to the trace-ratio value at all. This coincides with the assertion that there is no useful information in \({\mathscr {N}}(S_T)\) for recognition (Park and Park 2008).

In view of (2.4), we focus on the following optimization problem in this work:

$$\begin{aligned} \begin{aligned} V^{*}=\arg \max _{\begin{array}{c} V\in {\mathbb {R}}^{d\times s}, V^{T}V = I \\ V\in {\mathscr {N}}^{\perp }(S_T) \end{array}} \frac{{tr}(V^{T}S_{B}V)}{{tr}(V^{T}S_{T}V)}. \end{aligned} \end{aligned}$$
(2.5)

If the data matrix \(X\in {\mathbb {R}}^{d\times n}\) is of full column rank, then \(\dim ({\mathscr {N}}^{\perp }(S_T))=\dim ({\mathscr {R}}(S_T))=n-1\) and \(\dim ({\mathscr {R}}(S_W))=n-k\) (Park and Park 2008). Thus, \(\dim ({\mathscr {N}}^{\perp }(S_T))+\dim ({\mathscr {N}}(S_W))=d+k-1>d\). That is, \({\mathscr {N}}^{\perp }(S_T)\) and \({\mathscr {N}}(S_W)\) have non-trivial intersection whose dimension is (at least) \(k-1\).

For any d-by-s orthonormal matrix \(V\in {\mathscr {N}}^{\perp }(S_T)\), there holds

$$\begin{aligned} \frac{{tr}(V^{T}S_{B}V)}{{tr}(V^{T}S_{T}V)}=\frac{{tr}(V^{T}S_{B}V)}{{tr}(V^{T}S_{B}V)+{tr}(V^{T}S_{W}V)}\le 1, \end{aligned}$$

and the equality holds if and only if \({tr}(V^{T}S_{W}V)=0\) and \({tr}(V^{T}S_{B}V)\ne 0\), which can be attained as \({\mathscr {N}}^{\perp }(S_T)\) and \({\mathscr {N}}(S_W)\) have non-trivial intersection. In other words, the orthonormal matrix \(V^{*}\in {\mathbb {R}}^{d\times s}\) is a solution to (2.5) if \(V^{*}\in {\mathscr {N}}(S_W)\cap {\mathscr {N}}^{\perp }(S_T)\), i.e., \(V^{*}\in {\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T)\). In conclusion, we have the following theorem for the solution of the trace-ratio problem (2.5).

Theorem 2.1

The subspace \({\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T)\) is the solution space of the trace-ratio problem (2.5). Let s be the reducing dimension, if \(\dim ({\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T))\ge s\), then any orthonormal basis for an s-dimensional subspace of \({\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T)\) is a solution to (2.5).

Remark 2.1

Theorem 2.1 indicates that the trace-ratio problem does admit a simple solution when the dimension of the data points d is greater than or equal to the number of data points n. Note that the condition \(d\ge n\) is general and can be satisfied in many real applications such as image, video, audio, and microarray data, and so on (Alzubi and Abuarqoub 2020; Andras 2018; Cai et al. 2008; Chen et al. 2020; Gado et al. 2016; Liu et al. 2020; Vishwakarma and Singh 2019; Zhang et al. 2017; Zhu and Huang 2014). Indeed, in big data era, the dimensionality or feature of the data points is huge, and it is often larger than the number of samples. Note that our strategy works when both d and n are very large and even in the same order.

Theorem 2.1 also reveals that our method is related to the null space method (Chen et al. 2000; Chu and Thye 2010; Huang et al. 2002; Lu and Wang 2012; Wu and Feng 2015). Indeed, the null space method tries to seek a solution V in the null space of \(S_W\) while maximizing \({tr}(V^{T}S_{B}V)\). The authors in Zhao et al. (2012) mentioned that the solution of the null space method and that of the trace-ratio LDA method are equivalent for singularity problem. We point out that our result is different from the one given in Zhao et al. (2012). First, due to (2.4), we exploit the model (2.5) as an alternative to (2.1). Second, the solution of the null space method is also in \({\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T)\), so the conclusion in Zhao et al. (2012) is just a special case of our conclusion.

When both the dimension d and the number of training samples n are large, however, computing the null spaces of \(S_W\) and \(S_T\) directly is prohibitive in practice. Consequently, a direct application of the null space method to high-dimension and large-sample dense data sets is impractical. Based on Theorem 2.1, we aim to seek an efficient algorithm to solve (2.5) in the following work.

3 New algorithms for solving the trace-ratio problem on high-dimension and large-sample data sets

In large-scale discriminant analysis, the number of samples can be very large and even in the order of the data dimension d (Cai et al. 2008; Gado et al. 2016; Zhu and Huang 2014). In this situation, both the rows and the columns of the data matrix are very large, and a direct decomposition to the data matrix is infeasible. In this section, we give insight into fast implementations on Theorem 2.1 for large data matrices.

3.1 A closed-form solution

Consider the centered data matrix \({\overline{X}}=X(I_n-\mathbf{1}_{n} \mathbf{1}_{n}^T/n)=XL_T\), where \(L_T=I_n-\mathbf{1}_{n}\mathbf{1}_{n}^T/n\). Define W as the following \(n\times n\) block diagonal matrix

$$\begin{aligned} W= \left[ \begin{array}{cccc} \frac{1}{n_1} \mathbf{1}_{n_1} \mathbf{1}_{n_1}^T & \mathbf{0} & \cdots & \mathbf{0} \\ \mathbf{0} & \frac{1}{n_2} \mathbf{1}_{n_2} \mathbf{1}_{n_2}^T & \cdots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \cdots & \frac{1}{n_k} \mathbf{1}_{n_k} \mathbf{1}_{n_k}^T\\ \end{array} \right] , \end{aligned}$$

then we have \(S_B={\overline{X}}W{\overline{X}}^T\) and \(S_T={\overline{X}}{\overline{X}}^T\) (Cai et al. 2008). Thus, \(S_B\) and \(S_T\) can be rewritten as

$$\begin{aligned} S_B=XL_TWL_T{X}^T,\quad S_T=XL_TL_T{X}^T=XL_TX^T, \end{aligned}$$
(3.1)

where we use the fact that \(L_T^2=L_T\). Moreover, from \(W\mathbf{1}_n=\mathbf{1}_n\) and \(L_T\mathbf{1}_n=\mathbf{0}\), we obtain

$$\begin{aligned} S_W=S_T-S_B =X(L_T-L_TWL_T){X}^T =X(I_n-W)X^T. \end{aligned}$$
(3.2)

In view of Theorem 2.1, the framework of our method is composed of two steps:

(i) First, we compute a basis for a k-dimensional subspace of \({\mathscr {N}}(S_W)\).

Theorem 3.1

Denote by Y the following \(n\times k\) matrix

$$\begin{aligned} Y=[\mathbf{{y}}_1,\mathbf{{y}}_2,\ldots ,\mathbf{{y}}_k]\equiv \left[ \begin{array}{cccc} \mathbf{1}_{n_1} & \mathbf{0} & \cdots & \mathbf{0} \\ \mathbf{0} & \mathbf{1}_{n_2} & \cdots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \cdots & \mathbf{1}_{n_k}\\ \end{array} \right] \in {\mathbb {R}}^{n\times k}, \end{aligned}$$
(3.3)

where \(\mathbf{1}_{n_i}\) is the one vector of size \(n_i\). If X is of full column rank, then \(F=(X^T)^\dag Y\) is a basis for a k-dimension subspace of \({\mathscr {N}}(S_W)\), where \((X^T)^\dag\) denotes the Moore-Penrose inverse of \(X^T\).

Proof

As X and Y are of full column rank, \(F=(X^T)^\dag Y\) is also of full column rank. Moreover, we obtain from (3.3) that \(W\mathbf{{y}}_i=\mathbf{{y}}_i,~i=1,2,\ldots ,k\). By (3.2),

$$\begin{aligned} S_WF=X(I_n-W)X^T\big ((X^T)^\dag Y\big ) =X(I_n-W)Y =\mathbf{0}, \end{aligned}$$
(3.4)

which completes the proof.  \(\square\)

(ii) Second, we remove some information in \({\mathscr {N}}(S_T)\) from \({span}\{F\}\).

Theorem 3.2

Let \({\underline{Y}}=(I_n-\mathbf{{1}}_n\mathbf{{1}}_n^T)Y\), and let \({\overline{Y}}\in {\mathbb {R}}^{n\times (k-1)}\) be an orthonormal basis of \({span}\{{\underline{Y}}\}\), then

$$\begin{aligned} {\overline{F}}=(X^T)^\dag {\overline{Y}}\in {\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T). \end{aligned}$$
(3.5)

Specifically, if \(F_Q\in {\mathbb {R}}^{d\times (k-1)}\) is the Q-factor of the economized QR factorization of \({\overline{F}}\), then \(F_Q\) is a solution to (2.5).

Proof

On one hand, as \((X^T)^\dag =X(X^TX)^{-1}\) and \(W\mathbf{{1}}_n=\mathbf{1}_n\), we obtain from (3.4) that

$$\begin{aligned} S_W\big ((X^T)^\dag {\underline{Y}}\big )& = X(I_n-W)X^T\big ((X^T)^\dag {\underline{Y}}\big )\\& = X(I_n-W)(X^TX)(X^TX)^{-1}(I_n-\mathbf{{1}}_n\mathbf{{1}}_n^T)Y\\& = X(I_n-W)Y-X(I_n-W)\mathbf{{1}}_n\mathbf{{1}}_n^TY\\& = \mathbf{0}, \end{aligned}$$

i.e., \((X^T)^\dag {\underline{Y}}\) is in the null space of \(S_W\). On the other hand, as \(L_T\mathbf{{1}}_n=\mathbf{0}\), we get

$$\begin{aligned} S_T\big ((X^T)^\dag {\underline{Y}}\big )& = XL_TX^T\big ((X^T)^\dag {\underline{Y}}\big )\\& = XL_T(X^TX)(X^TX)^{-1}(I_n-\mathbf{{1}}_n\mathbf{{1}}_n^T)Y\\& = XL_TY-XL_T\mathbf{{1}}_n\mathbf{{1}}_n^TY\\& = XL_TY. \end{aligned}$$

Since \({\mathscr {N}}(L_T)={span}\{\mathbf{{1}}_n\}\), it follows from (3.3) that \(L_T\mathbf{{y}}_i\ne \mathbf{0}\) and \(XL_T\mathbf{{y}}_i\ne \mathbf{0},~i=1,\ldots ,k\). Thus, \(S_T\big ((X^T)^\dag {\underline{Y}}\big )\ne \mathbf{0}\), and it follows that

$$\begin{aligned} (X^T)^\dag {\underline{Y}}\in {\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T). \end{aligned}$$

Further, as \({\overline{Y}}\) is an orthonormal basis for \({span}\{{\underline{Y}}\}\), we have \((X^T)^\dag {\overline{Y}}\in {span}\{(X^T)^\dag {\underline{Y}}\}\), and thus the d-by-\((k-1)\) matrix \({\overline{F}}=(X^T)^\dag {\overline{Y}}\in {\mathscr {N}}(S_W) \setminus {\mathscr {N}}(S_T)\). Since \(F_Q\) is the Q-factor of the economized QR factorization of \({\overline{F}}\), by Theorem 2.1, it is a solution to (2.5). \(\square\)

Remark 3.1

From Sect. 2, if the data matrix \(X\in {\mathbb {R}}^{d\times n}\) is of full column rank, we have \({dim}({\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T))\ge k-1\). Thus, without loss of generality, we usually set the reducing dimension \(s=k-1\). If the reducing dimension \(s<k-1\), then any s columns of \(F_Q\) is a solution to (2.5). For instance, we can choose the first s columns of \(F_Q\), i.e., \(F_Q(:,1:s)\) as a solution.

3.2 A randomized algorithm for the trace-ratio problem on high-dimension and large-sample dense data sets

In this paper, we assume that the data matrix X has full column rank, however, it may be ill-conditioned in practice, and a direct computation on (3.5) is inadvisable. To deal with this problem and make the solution less sensitive to perturbations, the technique of truncated singular value decomposition (TSVD) is often utilized (Eldén 2005; Golub and Van Loan 2013). However, for high-dimension and large-sample data, both d and n are very large, thus the overhead for truncated singular value decomposition will still be prohibitive.

To overcome this difficulty, we make use of the randomized singular value algorithm (RSVD) (Gu 2015; Halko et al. 2011; Martinsson et al. 2011; Woodruff 2014) to produce a low-rank approximation \({\widetilde{X}}^T\) to \(X^T\), and then compute an approximation to \({\overline{F}}\). Now we briefly introduce the randomized singular value decomposition, for more details on its implementation and theoretical background, we refer to Gu (2015), Halko et al. (2011). Notice that the integer r in the algorithm is a user-provided parameter whose choice is problem-dependent.

figure a

Let \({\widetilde{X}}^T\) be the approximation to \(X^T\) obtained from Algorithm 1, the idea is to compute an approximation to \({\overline{F}}\) by solving the following optimization problem:

$$\begin{aligned} {\widetilde{F}}=\arg \min _{F\in {\mathbb {R}}^{d\times (k-1)}} \Vert {\widetilde{X}}^TF-{\overline{Y}}\Vert _F. \end{aligned}$$
(3.6)

Thus, we propose the following randomized algorithm for solving the high-dimension and large-sample trace-ratio problem (2.5). The main overhead in Algorithm 2 is to compute \({\widetilde{U}}_r,{\widetilde{V}}_r\) and \({\widetilde{\varSigma }}_r\) by using Algorithm 1, which needs about \({\mathscr {O}}((d+n)r^2)\) flops, and the main storage requirement is to store a \(d\times r\) matrix (Halko et al. 2011).

figure b

Remark 3.2

We stress that Theorems 2.1, 3.1 and 3.2 hold for a general data matrix X, whether it is dense or not, and Algorithm 2 applies to both dense and sparse data sets. However, when the data matrix X is sparse, RSVD will destroy the sparse structure of the original data. Thus, Algorithm 2 is more appropriate to dense data sets.

We note that the key step in Algorithm 2 is to compute an approximation \({\widetilde{X}}^T\) to \(X^T\) by using Algorithm 1. Indeed, \({\widetilde{X}}^T\) (with rank-r) can be viewed as an approximation of \(X_r^T\), i.e., the truncated SVD (TSVD) of \(X^T\) with rank-r. Indeed, let \(X_r^T\) be the best rank-r approximation to \(X^T\) in terms of 2-norm or Frobenius norm (Golub and Van Loan 2013), then \((X_r^T)^{\dag }{\overline{Y}}\) is a reasonable choice to approximate \(({X}^T)^\dagger {\overline{Y}}\). The key problems are:

  1. (a)

    How large is the distance between the approximation \(({\widetilde{X}}^T)^{\dag }{\overline{Y}}\) from (3.6) and \((X_r^T)^{\dag }{\overline{Y}}\) from truncated SVD (TSVD)?

  2. (b)

    Why the proposed randomized algorithm still works even if the solution is “inexact”?

To answer these questions, we consider the distance between the subspace spanned by \(({\widetilde{X}}^T)^{\dag }{\overline{Y}}\) and that by \((X_r^T)^{\dag }{\overline{Y}}\). Along the line of Algorithm 2, we will divide our analysis into two procedures.

(i) First, we establish the relationship between \(X^T_r\) and the approximation \({\widetilde{X}}^T\).

Let \(P_{G}\) be the orthogonal projector onto the subspace span\(\{G\}\). In (Halko et al. 2011, pp.275), Halko et al. derived the following deviation bound for the randomized SVD algorithm without power scheme (i.e., with \(q=0\)): for all \(u,t\ge 1\), there holds

$$\begin{aligned} \Vert (I-P_{G})X\Vert _2\le \left( 1+t\sqrt{\frac{3r}{p+1}}+ut\frac{e\sqrt{r+p}}{p+1}\right) \sigma _{r+1}+t\frac{e\sqrt{r+p}}{p+1} \left( \sum _{j>r}\sigma ^2_j\right) ^{1/2},~p\ge 4, \end{aligned}$$
(3.7)

with failure probability at most \(2t^{-p}+e^{-u^2/2}\). Here \(\sigma _1\ge \sigma _2\ge \cdots \ge \sigma _n>0\) are the (nonzero) singular values of X. For more bounds on the approximation obtained from the randomized power method, we refer to Gu (2015), Musco and Musco (2015), Woodruff (2014).

Based on (3.7), we will give a deviation bound for the approximation obtained from the randomized SVD accelerated by power iteration with \(q\ge 1\). Let

$$\begin{aligned} L=(XX^{T})^qX,\quad {and}\quad G=L\varOmega , \end{aligned}$$

then it follows from Algorithm 1 that \(Q_1Q_1^T\) is the orthogonal projector onto the subspace span\(\{G\}\). By (3.7),

$$\begin{aligned} \Vert (I-Q_1Q_1^T)L\Vert _2\le \left( 1+t\sqrt{\frac{3r}{p+1}}+ut\frac{e\sqrt{r+p}}{p+1}\right) {\widetilde{\sigma }}_{r+1}+t\frac{e\sqrt{r+p}}{p+1} \left( \sum _{j>r}{\widetilde{\sigma }}_j^2\right) ^{1/2},~p\ge 4, \end{aligned}$$

with failure probability at most \(2t^{-p}+e^{-u^2/2}\), where \({\widetilde{\sigma }}_{1}\ge {\widetilde{\sigma }}_{2}\ge \cdots \ge {\widetilde{\sigma }}_{n}\) are the singular values of L. Let \(X^T=U\varSigma P^T\) be the economized singular value decomposition of \(X^T\), where \(\varSigma ={diag}(\sigma _1,\sigma _2,\ldots ,\sigma _n)\in {\mathbb {R}}^{n\times n}\) with \(\sigma _1\ge \sigma _2\ge \cdots \ge \sigma _n>0\), and \(U\in {\mathbb {R}}^{n\times n}\), \(P\in {\mathbb {R}}^{d\times n}\) are two orthonormal matrices. Then we have \(L=P\varSigma ^{2q+1}U^T\) and \({\widetilde{\sigma }}_{i}=\sigma _i^{2q+1}\), \(i=1,2,\ldots ,n\). Moreover, it was shown that (Halko et al. 2011, pp.270)

$$\begin{aligned} \Vert (I-Q_1Q_1^T)X\Vert _2\le \Vert (I-Q_1Q_1^T)L\Vert _2^{1/{(2q+1)}}. \end{aligned}$$

As a result, for all \(u,t\ge 1\) and \(p\ge 4\), we have

$$\begin{aligned}&\Vert X^T-X^TQ_1Q_1^T\Vert _2=\Vert (I-Q_1Q_1^T)X\Vert _2 \\& \le \left[ \left( 1+t\sqrt{\frac{3r}{p+1}}+ut\frac{e\sqrt{r+p}}{p+1} \right) \sigma ^{2q+1}_{r+1}+t\frac{e\sqrt{r+p}}{p+1} \left( \sum _{j>r}\sigma ^{2(2q+1)}_j\right) ^{\frac{1}{2}}\right] ^{\frac{1}{2q+1}} \end{aligned}$$
(3.8)

with failure probability at most \(2t^{-p}+e^{-u^2/2}\).

Further, according to Steps 4–5 of Algorithm 1, we have \(X^TQ_1Q_1^T=U_2\varSigma _1V_1^T\), with \({\widetilde{X}}^T\equiv {\widetilde{U}}_r{\widetilde{\varSigma }}_r {\widetilde{V}}_r^{T}\) being its rank-r approximation. Let \(\sigma _{r+1}(X^TQ_1Q_1^T)\) be the \((r+1)\)-th largest singular value of \(X^TQ_1Q_1^T\), according to the interlacing theorem for singular values (Golub and Van Loan 2013), we have \(\sigma _{r+1}(X^TQ_1Q_1^T)\le \sigma _{r+1}\), where \(\sigma _{r+1}\) is the \((r+1)\)-th largest singular value of \(X^T\). Thus, from Steps 4–5 of Algorithm 1, we have

$$\begin{aligned} \Vert X^T-X^TQ_1Q_1^T\Vert =\Vert U_2\varSigma _1V_1^T-{\widetilde{U}}_r{\widetilde{\varSigma }}_r {\widetilde{V}}_r^{T}\Vert _2=\sigma _{r+1}(X^TQ_1Q_1^T)\le \sigma _{r+1}. \end{aligned}$$
(3.9)

According to (3.8) and (3.9), we get

$$\begin{aligned}&\Vert X^T-{\widetilde{X}}^T\Vert _2=\Vert X^T-{\widetilde{U}}_r{\widetilde{\varSigma }}_r {\widetilde{V}}_r^{T}\Vert _2 \le \Vert X^T-X^TQ_1Q_1^T\Vert _2+\Vert U_2\varSigma _1V_1^T-{\widetilde{U}}_r{\widetilde{\varSigma }}_r {\widetilde{V}}_r^{T}\Vert _2 \\& \le \left[ \left( 1+t\sqrt{\frac{3r}{p+1}}+ut\frac{e\sqrt{r+p}}{p+1}\right) \sigma ^{2q+1}_{r+1}+t\frac{e\sqrt{r+p}}{p+1} \left( \sum _{j>r}\sigma ^{2(2q+1)}_j\right) ^{\frac{1}{2}}\right] ^{\frac{1}{2q+1}} +\sigma _{r+1} \end{aligned}$$
(3.10)

with failure probability at most \(2t^{-p}+e^{-u^2/2}\). Denote

$$\begin{aligned} \eta =\left[ \left( 1+e\sqrt{\frac{3r}{p}}+e^2\sqrt{\frac{2(r+p)}{p}}\right) \sigma ^{2q+1}_{r+1} +e^2\frac{\sqrt{r+p}}{p}\left( \sum _{j>r}\sigma ^{2(2q+1)}_j\right) ^{\frac{1}{2}} \right] ^{\frac{1}{2q+1}}+\sigma _{r+1}, \end{aligned}$$
(3.11)

if we take \(t=e\), \(u=\sqrt{2p}\), then (3.10) reduces to

$$\begin{aligned} \Vert X^T-{\widetilde{X}}^T\Vert _2\le \eta =\mu \sigma _{r+1}={\mathscr {O}}(\sigma _{r+1}), \end{aligned}$$
(3.12)

with failure probability at most \(3e^{-p}\), where \(\mu =\eta /\sigma _{r+1}\). Note that the value of \(\mu\) depends on the choice of q and could be in the order of \({\mathscr {O}}(1)\).

Recall that \(X^T=U\varSigma P^T\) is the economized singular value decomposition of \(X^T\), if we partition \(U=[U_r,~U_-]\), \(\varSigma =\left( \begin{array}{cc} \varSigma _r & \mathbf {0} \\ \mathbf {0} & \varSigma _- \\ \end{array} \right)\), and \(P=[P_r,~P_-]\), where \(U_r\in {\mathbb {R}}^{n\times r}\), \(\varSigma _r\in {\mathbb {R}}^{r\times r}\), and \(P_r\in {\mathbb {R}}^{d\times r}\), then

$$\begin{aligned}(X^T)^{\dag }=P\varSigma ^{\dag }U^T =[P_r,P_-]\left( \begin{array}{cc} \varSigma _r^{-1} & \mathbf {0} \\ \mathbf {0} & \varSigma _-^{-1} \\ \end{array} \right) \left( \begin{array}{c} U_r^T \\ U_-^T \\ \end{array} \right) =P_r\varSigma _r^{-1}U_r^T+P_-\varSigma _-^{-1}U_-^T. \end{aligned}$$

Notice that \(X^T_r=U_r\varSigma _rP_r^T\), we have \((X^T_r)^{\dag }=P_r\varSigma _r^{-1}U_r^T\), moreover,

$$\begin{aligned} \Vert (X_r^T)^{\dag }\Vert _2=\frac{1}{\sigma _r},~~{and}~~\Vert X^T-X_r^T\Vert _2&=\sigma _{r+1}. \end{aligned}$$
(3.13)

By (3.12) and (3.13),

$$\begin{aligned} \Vert X^T_r-{\widetilde{X}}^T\Vert _2\le \Vert X^T-X_r^T\Vert _2+\Vert X^T-{\widetilde{X}}^T \Vert _2\le (1+\mu )\sigma _{r+1}. \end{aligned}$$
(3.14)

(ii) Second, we consider the angle between the subspaces \({{\widetilde{\mathscr{X}}}}={span}\{({\widetilde{X}}^T)^{\dag }{\overline{Y}}\}\) and \({\mathscr {X}}={span}\{(X^T_r)^{\dag }{\overline{Y}}\}\).

Assume that \(({\widetilde{X}}^T)^{\dag }{\overline{Y}}\) and \((X^T_r)^{\dag }{\overline{Y}}\) are of full column rank, then

$$\begin{aligned} {rank}\big (({\widetilde{X}}^T)^{\dag }{\overline{Y}}\big )={rank}\big ((X^T_r)^{\dag }{\overline{Y}}\big ), \end{aligned}$$

and we have from Wedin (1973) and (3.14) that

$$\begin{aligned} \Vert (X^T_r)^{\dag }{\overline{Y}}-({\widetilde{X}}^T)^{\dag }{\overline{Y}}\Vert _2&\le \left( \frac{\Vert (X^T_r)^{\dag }-({\widetilde{X}}^T)^{\dag } \Vert _2}{\Vert (X^T_r)^{\dag }\Vert _2}\right) \Vert (X^T_r)^{\dag }\Vert _2\Vert {\overline{Y}}\Vert _2 \\&\le \sqrt{2}\Vert ({\widetilde{X}}^T)^{\dag }\Vert _2\Vert X^T_r-{\widetilde{X}}^T \Vert _2\Vert (X^T_r)^{\dag }\Vert _2\Vert {\overline{Y}}\Vert _2 \\&\le \sqrt{2}(1+\mu )\Vert ({\widetilde{X}}^T)^{\dag }\Vert _2\frac{\sigma _{r+1}}{\sigma _{r}}, \end{aligned}$$
(3.15)

where we use the fact that \({\overline{Y}}\) is orthonormal. On the other hand,

$$\begin{aligned} \Vert [(X^T_r)^{\dag }{\overline{Y}}]^{\dag }\Vert _2=\frac{1}{\sigma _{\min }((X^T_r)^{\dag } {\overline{Y}})} \le \frac{1}{\sigma _{\min }((X^T_r)^{\dag })}=\sigma _{\max }(X^T_r)=\sigma _{1}. \end{aligned}$$
(3.16)

Then from Sun (1984), (3.15) and (3.16), we have

$$\begin{aligned} \sin \angle ({\mathscr {X}}, {{\widetilde{\mathscr{X}}}})&=\Vert P_{(X^T_r)^{\dag } {\overline{Y}}}-P_{({\widetilde{X}}^T)^{\dag }{\overline{Y}}}\Vert _2 \\&\le \min \{\Vert [(X^T_r)^{\dag }{\overline{Y}}]^{\dag }\Vert _2, \Vert [({\widetilde{X}}^T)^{\dag }{\overline{Y}}]^{\dag }\Vert _2\} \Vert (X^T_r)^{\dag }{\overline{Y}}-({\widetilde{X}}^T)^{\dag }{\overline{Y}}\Vert _2 \\&\le \Vert [(X^T_r)^{\dag }{\overline{Y}}]^{\dag }\Vert _2\Vert (X^T_r)^{\dag }{\overline{Y}} -({\widetilde{X}}^T)^{\dag }{\overline{Y}}\Vert _2 \\&\le \sqrt{2}(1+\mu )\Vert ({\widetilde{X}}^T)^{\dag }\Vert _2(\sigma _{1}/\sigma _{r}) \cdot \sigma _{r+1}, \end{aligned}$$
(3.17)

where \(P_{(X^T_r)^{\dag }{\overline{Y}}}\) and \(P_{({\widetilde{X}}^T)^{\dag }{\overline{Y}}}\) are the orthogonal projectors onto the subspace span\(\{(X^T_r)^{\dag }{\overline{Y}}\}\) and span\(\{({\widetilde{X}}^T)^{\dag }{\overline{Y}}\}\), respectively.

In summary, we have the following theorem for Algorithm 2. Let \({\overline{F}}_Q\) be the Q-factor of the economized QR factorization of \((X^T_r)^{\dag }{\overline{Y}}\), it provides a bound between the distance between the subspaces spanned by the “truncated” solution \({\overline{F}}_Q\) and the “approximate” solution \({\widetilde{F}}_Q\):

Theorem 3.3

Let \({\mathscr {X}}={span}\{(X^T_r)^{\dag }{\overline{Y}}\}={span}\{{\overline{F}}_Q\}\) and \({{\widetilde{\mathscr{X}}}}={span}\{({\widetilde{X}}^T)^{\dag }{\overline{Y}}\}={span}\{{\widetilde{F}}_Q\}\). If \((X^T_r)^{\dag }{\overline{Y}}\) and \(({\widetilde{X}}^T)^{\dag }{\overline{Y}}\) are of full column rank, then under the above notations, we have

$$\sin \angle ({\mathscr {X}}, {{\widetilde{\mathscr{X}}}})\le \big (\sqrt{2}(1+\mu )\kappa (X_r)\Vert ({\widetilde{X}}^T)^{\dag }\Vert _2\big )\cdot \sigma _{r+1}$$
(3.18)

with failure probability at most \(3e^{-p}\), where \(p\ge 4\).

Remark 3.3

Recall that \({\widetilde{X}}^T\) is an approximation to \(X_r^T\), as \((X^T_r)^{\dag }{\overline{Y}}\) is of full column rank, the assumption that \({{rank}}(({\widetilde{X}}^T)^{\dag }{\overline{Y}})={{rank}}((X^T_r)^{\dag }{\overline{Y}})\) is not strict, moreover, we have that \(\Vert ({\widetilde{X}}^T)^{\dag }\Vert _2=\mathscr {O(}\frac{1}{\sigma _{r}})\). As a result, Theorem 3.3 indicates that the distance between the “truncated” solution space \({{span}}\{(X^T_r)^{\dag }{\overline{Y}}\}\) and the “approximate” solution space \({{span}}\{({\widetilde{X}}^T)^{\dag }{\overline{Y}}\}\) is in the order of \({\mathscr {O}}\left( \kappa _2(X_r)\frac{\sigma _{r+1}}{\sigma _r}\right)\).

On the other hand, we point out that the upper bound given in (3.18) may be pessimistic in practice and even much larger than one. Our contribution is to indicate that the difference between the two subspaces \({\mathscr {X}}\) and \({\widetilde{\mathscr{X}}}\) is closely related to the condition number \(\kappa (X_r)\) and the gap \(\frac{\sigma _{r+1}}{\sigma _r}\) between singular values. If the singular values of X decay quickly and \(X_r\) is not too ill-conditioned, the distance between \({\mathscr {X}}\) and \({\widetilde{\mathscr{X}}}\) will be small.

Finally, we briefly interpret why our randomized algorithm works for recognition. Let \(\widehat{\mathbf{a}}_{i}\) be a sample from the training set, and let \(\widehat{\mathbf{b}}_{j}\) be a sample from the testing set. In the widely used nearest neighbour classifier (NN) (Cover and Hart 1967), the class membership is from investigating the Euclidean distance as follows:

$$\begin{aligned} d_{ij}=\Vert {\overline{F}}_Q{\overline{F}}_Q^T(\widehat{\mathbf{a}}_{i}-\widehat{\mathbf{b}}_{j})\Vert _2=\Vert {\overline{F}}_Q^T(\widehat{\mathbf{a}}_{i}-\widehat{\mathbf{b}}_{j})\Vert _2. \end{aligned}$$

Along the line of (Wu et al. 2017, Theorem 6), we can establish the following relationship between the Euclidean distances obtained from \({\overline{F}}_Q\) and \({\widetilde{F}}_Q\).

Theorem 3.4

Let \({\overline{F}}_Q,{\widetilde{F}}_Q\in {\mathbb {R}}^{d\times s}\) be orthonormal matrices whose columns span the “truncated” solution space \({\mathscr {X}}\) and the “approximate” solution space \(\widetilde{{\mathscr {X}}}\) of (2.5), respectively. Denote by \(d_{ij}=\Vert {\overline{F}}_Q^T(\widehat{\mathbf{a}}_{i}-\widehat{\mathbf{b}}_{j})\Vert _2\) and by \({\widetilde{d}}_{ij}=\Vert {\widetilde{F}}_Q^T(\widehat{\mathbf{a}}_{i}-\widehat{\mathbf{b}}_{j})\Vert _2\) the “truncated” and the “approximate” Euclidean distances, respectively. If \(\Vert \widehat{\mathbf{a}}_i\Vert _2,\Vert \widehat{\mathbf{b}}_j\Vert _2=1\) and \(\cos \angle ({\mathscr {X}}, {\widetilde{\mathscr{X}}})\ne 0\), then

$$\begin{aligned} \frac{{\widetilde{d}}_{ij}-2\sin \angle ({\mathscr {X}}, {\widetilde{\mathscr{X}}})}{\cos \angle ({\mathscr {X}}, {\widetilde{\mathscr{X}}})}\le d_{ij}\le {\widetilde{d}}_{ij}\cos \angle ({\mathscr {X}}, {\widetilde{\mathscr{X}}})+2\sin \angle ({\mathscr {X}}, {\widetilde{\mathscr{X}}}). \end{aligned}$$
(3.19)

Remark 3.4

Theorem 3.4 shows that if \(\sin \angle ({\mathscr {X}}, {\widetilde{\mathscr{X}}})\) is not large, then the distances \(\{d_{ij}\}'s\) and \(\{{\widetilde{d}}_{ij}\}'s\) will be close to each other. Consequently, the recognition rates obtained from the “truncated” solution and the “approximate” solution will be about the same; see (Wu et al. 2017) for more details.

3.3 An iterative algorithm for the trace-ratio problem on high-dimension and large-sample sparse data sets

In many high-dimension data processing tasks such as text processing, the data matrix is often large and sparse (Cai et al. 2008; Tavernier et al. 2017; Zhu and Huang 2014). In this situation, applying the randomized SVD to X are unfavorable because it will destroy the sparse structure of the original data. Thus, it is necessary to propose new technologies for solving the trace-ratio problem on large and sparse data sets. It follows from Theorem 3.2 that \({\overline{F}}=(X^T)^\dag {\overline{Y}}\) is in the solution space \({\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T)\). As X is of full column rank, we have

$$\begin{aligned} X^T{\overline{F}}=X^T(X^T)^\dag {\overline{Y}}={\overline{Y}}. \end{aligned}$$
(3.20)

Thus, the idea is to solve the underdetermined (consistent) block system (3.20) for \({\overline{F}}\), with no need to form \((X^T)^\dag\) explicitly. This avoids destroying the sparse structure of the original data matrix. We are ready to present the following algorithm for the large sparse trace-ratio problem.

figure c

Remark 3.5

Unlike the methods given in Jia et al. (2009), Ngo et al. (2012), Wang et al. (2007), Zhao et al. (2013) for the trace-ratio problem, Algorithm 3 is not an inner-outer iterative method, so it is much simpler than those conventional methods. The main overhead of this algorithm is in Step 1, where we have to solve \(k-1\) least squares problems iteratively. We can use the lsrq package provided by Cai et al. (2008) to solve (3.21), which is a modification to the LSQR algorithm due to Paige and Saunders (1982). Thus, Algorithm 3 can preserve the sparse structure of X, and the main storage requirement is to store the matrix \({\widetilde{F}}_Q\) of size \(d\times (k-1)\).

The following theorem gives an error bound on solving (3.20) iteratively by Algorithm 3.

Theorem 3.5

Let \({\overline{F}}\) be the exact solution to (3.20), and let \({\widetilde{F}}\) be the computed solution of (3.21) satisfying \({\Vert X^T{\widetilde{F}}-{\overline{Y}}\Vert _F}/{\Vert {\overline{Y}}\Vert _F}\le {tol}\), where tol is the convergence tolerance used in Algorithm 3. Denote by \({\mathscr {F}}={span}\{{\overline{F}}\}\) and by \({\widetilde{\mathscr{F}}}={span}\{{\widetilde{F}}\}\), if \({\widetilde{F}}\) and \({\overline{F}}\) are of full column rank, and the columns of \({\widetilde{F}}-{\overline{F}}\) are not in \({\mathscr {N}}(X^T)\), then

$$\begin{aligned} \sin \angle ({\mathscr {F}}, {\widetilde{\mathscr{F}}})\le \sqrt{k-1}\kappa _2(X)\cdot {tol}. \end{aligned}$$
(3.22)

Proof

Since \(X^T{\overline{F}}={\overline{Y}}\) with \({\overline{Y}}\) being orthonormal, we have

$$\begin{aligned} \Vert X^T({\widetilde{F}}-{\overline{F}})\Vert _F=\Vert X^T{\widetilde{F}} -{\overline{Y}}\Vert _F\le {\Vert {\overline{Y}}\Vert _F}\cdot {tol}\le \sqrt{k-1}\cdot {tol}, \end{aligned}$$

where we used the convergence criterion \(\Vert X^T{\widetilde{F}}-{\overline{Y}}\Vert _F/\Vert {\overline{Y}}\Vert _F\le tol\) in Algorithm 3, as well as the fact that \(\Vert {\overline{Y}}\Vert _F=\sqrt{k-1}\).

If the columns of \({\widetilde{F}}-{\overline{F}}\) are not in \({\mathscr {N}}(X^T)\), then

$$\begin{aligned} \Vert X^T({\widetilde{F}}-{\overline{F}})\Vert _F^2\ge \sigma ^2_{\min }(X^T) \Vert {\widetilde{F}}-{\overline{F}}\Vert _F^2, \end{aligned}$$

where \(\sigma _{\min }(X^T)=\sigma _{\min }(X)\) is the smallest nonzero singular value of \(X^T\). So it follows that

$$\begin{aligned} \Vert {\widetilde{F}}-{\overline{F}}\Vert _2\le \Vert {\widetilde{F}}-{\overline{F}} \Vert _F\le \sqrt{k-1}\cdot \frac{{tol}}{\sigma _{\min }(X)}. \end{aligned}$$
(3.23)

On the other hand, \({\sigma _{\min }({\overline{F}})}={\sigma _{\min }((X^T)^{\dag }{\overline{Y}})} \ge {\sigma _{\min }((X^T)^{\dag })}=1/\sigma _{\max }(X)\), where we used the fact that \({\overline{Y}}\) is orthonormal. Thus,

$$\begin{aligned} \Vert {\overline{F}}^{\dag }\Vert _2=\frac{1}{\sigma _{\min }({\overline{F}})}\le \sigma _{\max }(X). \end{aligned}$$
(3.24)

Assume that \({\widetilde{F}}\) and \({\overline{F}}\) are of full column rank, then we have from (Sun 1984), (3.23) and (3.24) that

$$\begin{aligned} \sin \angle ({\mathscr {F}},{{\widetilde{\mathscr{F}}}})& = \Vert P_{{\widetilde{F}}} -P_{{\overline{F}}}\Vert _2 \\& \le \min \{\Vert {\widetilde{F}}^{\dagger }\Vert _2,\Vert {\overline{F}}^{\dagger } \Vert _2\}\Vert {\widetilde{F}}-{\overline{F}}\Vert _2 \\ & \le \Vert {\overline{F}}^{\dagger }\Vert _2\Vert {\widetilde{F}}-{\overline{F}}\Vert _2 \\& \le \sqrt{k-1}\kappa _2(X)\cdot {tol}, \end{aligned}$$

where \(P_{{\widetilde{F}}}\) and \(P_{{\overline{F}}}\) is the orthogonal projector onto span\(\{{\widetilde{F}}\}\) and span\(\{{\overline{F}}\}\), respectively.     \(\square\)

Remark 3.6

As the exact solution \({\overline{F}}=(X^T)^{\dag }{\overline{Y}}\) is not in \({\mathscr {N}}(X^T)\) and \({\widetilde{F}}\) is an approximation to \({\overline{F}}\), the assumption that the columns of \({\widetilde{F}}-{\overline{F}}\) are not in \({\mathscr {N}}(X^T)\) is trivial. Although the upper bound given in (3.22) can be pessimistic and even much larger than one as k is not small and the matrix X is ill-conditioned, Theorem 3.5 reveals that the distance between the computed solution and exact solution is closely related to the product \(\kappa _2(X)\cdot {tol}\), where \(\kappa _2(X)\) is the 2-norm condition number of X and tol is a user-described convergence threshold used in Algorithm 3. In other words, the distance will be close to zero as X is not too ill-conditioned and tol is small.

Spectral regression discriminant analysis (SRDA) (Cai et al. 2008) is a popular method for large and sparse discriminant analysis. This method combines spectral graph analysis and regression to provide an efficient approach for discriminant analysis. The main task of SRDA is to solve a set of regularized least squares problems, and there is no eigenvector computation involved. However, SRDA aims to solve the ratio-trace problem (1.2) rather than the trace-ratio problem (1.1).

Let \({\widehat{X}}=[X^T, \mathbf{1}_n]^T\), it was shown that the solution of SRDA satisfies (Cai et al. 2008, pp.7)

$$\begin{aligned} {\widehat{F}}={\widehat{X}}({\widehat{X}}^T{\widehat{X}}+\alpha I_{n})^{-1}{\overline{Y}},\quad \alpha \rightarrow 0, \end{aligned}$$
(3.25)

where \(\alpha\) is a regularized parameter. The following theorem establishes the relationship between (3.5) and (3.25), i.e., the solution of (2.5) and that of SRDA. It indicates that our method is different from that of SRDA in essence.

Theorem 3.6

Let \({\widehat{F}}\) be defined in (3.25), and denote by \({\overline{F}}=(X^T)^{\dag }{\overline{Y}}\) the exact solution of (2.5). If X has full column rank, we have

$$\begin{aligned}{\widehat{F}}=\left( \begin{array}{c} I_d-\frac{1}{\beta } \mathbf{g} \mathbf{g}^T \\ \frac{1}{\beta } \mathbf{g}^T \\ \end{array} \right) {\overline{F}},\quad \alpha \rightarrow 0, \end{aligned}$$

where \(\beta =1+\mathbf{1}_n^T(X^TX)^{-1} \mathbf{1}_n\), and \(\mathbf{g}=(X^T)^{\dag } \mathbf{1}_n\).

Proof

We notice that

$$\begin{aligned}{\widehat{X}}^T{\widehat{X}}=\left( \begin{array}{cc} X^T & \mathbf{1}_n \\ \end{array} \right) \left( \begin{array}{c} X \\ \mathbf{1}_n^T \end{array}\right) =X^TX+\mathbf{1}_n\mathbf{1}_n^T \end{aligned}$$

is a rank-1 modification to the matrix \(X^TX\). Let \(\gamma =1+\mathbf{1}_n^T(X^TX+\alpha I_n)^{-1} \mathbf{1}_n\) and note that \(X^{\dag }X=I\). From the Shermam-Morrison-Woodbury formula (Golub and Van Loan 2013), we obtain

$$\begin{aligned} {\widehat{F}}&={\widehat{X}}({\widehat{X}}^T{\widehat{X}}+\alpha I_{n})^{-1} {\overline{Y}}\\&=\left( \begin{array}{c} X \\ \mathbf{1}_n^T \end{array}\right) (X^TX+\mathbf{1}_n\mathbf{1}_n^T+\alpha I_{n})^{-1}{\overline{Y}} \\&=\left( \begin{array}{c} X \\ \mathbf{1}_n^T \end{array}\right) \left[ (X^TX+\alpha I_{n})^{-1}-\frac{1}{\gamma }(X^TX +\alpha I_{n})^{-1} \mathbf{1}_n\mathbf{1}_n^T(X^TX+\alpha I_{n})^{-1}\right] {\overline{Y}}\\&=\left( \begin{array}{c} X(X^TX+\alpha I_{n})^{-1}{\overline{Y}}-\frac{1}{\gamma }X(X^TX +\alpha I_{n})^{-1} \mathbf{1}_n\mathbf{1}_n^T(X^TX+\alpha I_{n})^{-1}{\overline{Y}} \\ \mathbf{1}_n^T(X^TX+\alpha I_{n})^{-1}{\overline{Y}}-\frac{1}{\gamma } \mathbf{1}_n^T(X^TX+\alpha I_{n})^{-1} \mathbf{1}_n\mathbf{1}_n^T(X^TX+\alpha I_{n})^{-1}{\overline{Y}} \end{array}\right) \\&=\left( \begin{array}{c} X(X^TX+\alpha I_{n})^{-1}{\overline{Y}}-\frac{1}{\gamma }X(X^TX+\alpha I_{n})^{-1} \cdot \mathbf{1}_n\mathbf{1}_n^TX^{\dag }\cdot \big (X(X^TX+\alpha I_{n})^{-1}{\overline{Y}}\big ) \\ \big (1-\frac{1}{\gamma } \mathbf{1}_n^T(X^TX+\alpha I_{n})^{-1}\mathbf{1}_n\big )\cdot \mathbf{1}_n^TX^{\dag }\cdot \big (X(X^TX+\alpha I_{n})^{-1}{\overline{Y}}\big ) \end{array}\right) . \end{aligned}$$

As \(\alpha \rightarrow 0\), we see that \(X(X^TX+\alpha I_{n})^{-1}\rightarrow X(X^TX)^{-1}=(X^T)^{\dag },~X(X^TX+\alpha I_{n})^{-1}{\overline{Y}}\rightarrow {\overline{F}}\) and \(\gamma \rightarrow 1+\mathbf{1}_n^T(X^TX)^{-1} \mathbf{1}_n=\beta\). In conclusion,

$$\begin{aligned} {\widehat{F}} \rightarrow \left( \begin{array}{c} {\overline{F}}-\frac{1}{\beta }(X^T)^{\dag } \mathbf{1}_n\mathbf{1}_n^TX^{\dag } {\overline{F}} \\ \frac{1}{\beta } \mathbf{1}_n^TX^{\dag }{\overline{F}} \\ \end{array} \right) =\left( \begin{array}{c} I_d-\frac{1}{\beta }(X^T)^{\dag } \mathbf{1}_n\mathbf{1}_n^TX^{\dag } \\ \frac{1}{\beta } \mathbf{1}_n^TX^{\dag } \\ \end{array} \right) {\overline{F}},\quad \alpha \rightarrow 0. \end{aligned}$$

So we complete the proof. \(\square\)

In Zhang et al. (2017), an incremental regularized least squares (LDADL) method was proposed by Zhang et al.. With the help of Theorem 3.6 and (Zhang et al. 2017, Lemma 4.3), we can also establish the relationship between our proposed method and LDADL. We point out that Algorithm 3 is different from SRDA and LDADL in essence. First, the three methods Algorithm 3, LDADL and SRDA are designed for different problems. More precisely, Algorithm 3 is for the trace-ratio problem (1.1) while SRDA and LDADL are for the ratio-trace problem (1.2). Second, the left-hand sides of the least-squares problems involved in LDADL and SRDA are same, while the right-hand sides are different. Indeed, the right-hand sides involved in the former have one more column than the latter, refer to Zhang et al. (2017). As a comparison, the right-hand sides involved in our method and SRDA are the same, while the left-hand sides are different. More precisely, the left-hand sides involved in SRDA have one more row than the proposed method. Third, Algorithm 3 is parameter-free while both SRDA and LDADL involve regularization parameters. It is well-known that the optimal parameter is difficult to choose in practice if there is no other information available in advance (Gui et al. 2014), so our method is preferable.

4 Numerical experiments

In this section, we perform some numerical experiments on some real-world high-dimension and large-sample data dimensionality reduction problems. In Sect. 4.1, we describe the databases and some benchmark algorithms used in the experiments. In Sect. 4.2, we compare our proposed algorithms with some state-of-the-art algorithms to show the merits of the new algorithms.

4.1 Datasets, benchmark algorithms and experiment settings

Seven real-world databases, including video data, text documents, face images, audio data, are used in our experiment. The details of these data bases, such as dimensionality, the number of total samples, the background and data type are summarized in Table 2.

  • The Color FERET datasetFootnote 1 was collected in 15 sessions between August 1993 and July 1996. The database contains 1564 sets of images for a total of 14126 images that includes 1199 individuals and 365 duplicate sets of images ranging from frontal to left and right profiles. In our experiment, a part of the FERET program containing \(N=3528\) images of \(k=269\) different people was utilized. We crop and scale the images to \(192\times 128\) pixels, and set the reducing dimension \(s=200\).

  • The AR databaseFootnote 2 contains over 4000 color images corresponding to 126 people’s faces (70 men and 56 women). Images feature frontal view these faces with different facial expressions, illumination conditions, and occlusions (e.g., sun glasses and scarf). The pictures were taken at the CVC under strictly controlled conditions. No restrictions on wear (clothes, glasses, etc.), make-up, hair style were imposed to participants. Each person participated in two sessions, separated by two weeks time. The same pictures were taken in both sessions. A subset of \(k=100\) with 26 images of per people, i.e., \(N=2600\) images are used in our experiment. We crop and scale the images to \(120 \times 165\) pixels, and the reducing dimension s is set to be 99.

  • The Extended YaleBFootnote 3 data set contains 5760 single light source images of 10 subjects, each seen under 576 viewing conditions (9 different poses and 64 illumination conditions of each person). The images have normal, sleepy, sad and surprising expressions. A subset of \(k=38\) persons with 64 images of per people, i.e., \(N=2432\) images are used in the example. We crop and scale the images to \(100 \times 100\) pixels in our experiment, and set the reduced dimension \(s=k-1=37\).

  • The CAS-PEAL face database (Gao et al. 2008) was constructed under the sponsors of National Hi-Tech Program and ISVISION, by the Face Recognition Group of JDL, ICT, and CAS. It contains 99594 images of 1040 individuals (595 males and 445 females) with varying pose, expression, accessory, and lighting (PEAL). For each subject, 9 cameras spaced equally in a horizontal semicircular shelf are setup to simultaneously capture images across different poses in one shot. Each subject is also asked to look up and down to capture 18 images in another two shots. We also considered 5 kinds of expressions, 6 kinds accessories (3 glasses, and 3 caps), and 15 lighting directions. This face database is now partly made available. A subset named by CAS-PEAL-R1, contains \(N=21840\) images of 1040 subjects, and each subject across 21 different poses without any other variations are included for research purpose. We crop and scale the images to \(360\times 480\) pixels, and set the reduced dimension \(s=400\).

  • The Youtube dataset (Wolf et al. 2011) is a large-scale video classification database. It contains 80 million YouTube video links, which are labeled as 4800 knowledge graph entities. Here we provide a links of 5020 videos with 1595 persons labels. All images have been downloaded from YouTube. In our experiments we have employed up to 90 samples of each class, resulting to a dataset of \(N=124819\) feature samples data and \(k=1595\) classes.

  • The Reuters datasetFootnote 4 contains 21578 documents in 135 categories. The documents with multiple category labels are discarded. It left us with \(N=8213\) documents in \(k=65\) categories. After preprocessing, this corpus contains \(d=18933\) distinct terms.

  • The original TDT2 (Nist Topic Detection and Tracking) corpusFootnote 5 collects during the first half of 1998 and taken from 6 sources, including 2 newswires (APW, NYT), 2 radio programs (VOA, PRI) and 2 television programs (CNN, ABC). It consists of 11201 on-topic documents which are classified into 96 semantic categories. In this subset, those documents appearing in two or more categories were removed, and only the largest \(k=30\) categories were kept, thus leaving us with \(N=9394\) documents in total.

Table 2 Details of the databases: dimensionality (d), the number of total samples (N), the background and data type (sparse or dense)

As the PCA+Newton-Lanczos method (PCA+NL) (Ngo et al. 2012) is a popular algorithm for the trace-ratio problem, and SRDA is a state-of-the-art algorithm for large-scale discriminant analysis (Cai et al. 2008), we take them and the proposed Algorithm 2 as the benchmark algorithms in all the experiments. The details of the PCA+Newton Lanczos method and SRDA are listed as follows:

  • PCA+NL (Ngo et al. 2012): The PCA+Newton Lanczos method, in which the PCA stage is performed before running the Newton Lanczos method. Here we preserve 99% energy in the PCA stage. We find that infinite trace-ratio values may occur in this algorithm during outer iterations. In order to deal with this problem, let \(V^{(j)},~V^{(j+1)}\) be the orthnormalized eigenvectors obtained from the j-th and the \((j+1)\)-th iteration, respectively, we use the sine of the angle between the subspaces span\(\{V^{(j)}\}\) and span\(\{V^{(j+1)}\}\) as the stopping criterion in this algorithm. As was suggested in Ngo et al. (2012), we use the MATLAB built-in function eigs.m for the eigenvalue problems involved, with the stopping criterion \(tol=10^{-4}\).

  • SRDA (Cai et al. 2008): The spectral regression discriminant analysis method based on solving regularized least squares problem. This method is one of the state-of-the art algorithms for large-scale and sparse discriminant analysis problems, and it is required to solve s dense least squares problems of size \((d+1)\)-by-n. The lsrq package provided by Cai et al. is used to compute the regularized least squares problem involved, and the algorithm is stopped as soon as the residual norm is below \(tol=10^{-6}\) or the number of iterations reaches 20. The the regularized parameter is chosen as \(\alpha =10^{-2}\).

  • In Algorithm 2, we set \(q=2\) and the over-sampling parameter \(p=20\). The target rank r is chosen in the following way: we set \(r=100\) if the number of training sample n is less than 1000, and set \(r=200\) if the number of training sample is in the interval (1000, 3500]; otherwise, we set \(r=400\).

All the experiments are run on a Hp workstation with 16 cores double Intel(R) Xeon(R) Platinum 8253 processors, and with CPU 2.20 GHz and RAM 256 GB. The operation system is 64-bit Windows 10. All the numerical results are obtained from running the MATLAB R2018b software.

In the examples, we randomly pick some, say, \(n=60\%N,70\%N\) and \(80\%N\) samples as the training set, and the remaining samples are used as the testing set, where N is the total number of samples. More precisely, we rearrange the original N samples data by using the MATLAB command randperm(N), which returns a row vector containing a random permutation of the integers from 1 to N, then we select the samples corresponding to the first n elements of the row vector as the training set. We make use of the nearest neighbor classifier (NN) (Cover and Hart 1967) for classification in the experiments. Each experiment will be repeated 10 for times, and all the numerical results, i.e., the CPU time in seconds, the recognition rate and the standard deviation (Std-Dev), are the mean from the 10 runs.

Table 3 Example 1: numerical results on the Color FERET database, \(d=24576,~N=3528\), \(s=200\)

4.2 Numerical experiments

In this subsection, we perform some numerical experiments to show the superiority of our randomized method Algorithm 2 and the iterative method Algorithm 3 over some state-of-the-art algorithms for high-dimension and large-sample data dimensionality reduction problems. As Algorithm 3 is proposed for solving high-dimension and large-sample sparse trace-ratio problem, we only run it on large sparse data sets.

Example 1

In this example, we compare Algorithm 2 with some state-of-the-art algorithms for large-scale trace-ratio problem. The test set is the color FERET database. Besides the three benchmark algorithms Algorithm 2, PCA+NL and SRDA, we also run the following four state-of-the-art trace-ratio algorithms for this problem:

  • PCA+DNM (Jia et al. 2009): The trace-ratio linear discriminant method presented by Jia et al, whose MATLAB code is available from http://www.escience.cn/people/fpnie/papers.html. Here we preserve 99% energy in the PCA stage.

  • PCA+iITR (Zhao et al. 2013): A trace-ratio linear discriminant method proposed by Zhao et al., in which we preserve 99% energy in the PCA stage.

  • FastITR (Zhang et al. 2010): An algorithm advocated by Zhang et al.Footnote 6 for the trace-ratio problem. The MATLAB built-in function eigs.m is used for computing the eigenvalue problems involved. In this method, a regularized parameter is required to ensure the positive definiteness of the total scatter matrix. We set the regularized parameter to be \(10^{-2}\) in the experiment.

  • WangITR (Wang et al. 2007): An algorithm proposed by Wang et al. for the trace-ratio problem. As this algorithm solves the trace difference problem by using the eigenvalue decomposition method in each outer iteration (Wang et al. 2007, Algorithm 1), we exploit the MATLAB build-in function eig.m for the eigenvalue problems involved. The numerical results are list in Table 3.

We see from Table 3 that Algorithm 2 outperforms all the compared algorithms both in terms of CPU time and recognition accuracy. More precisely, it not only runs faster than the two benchmark algorithms PCA+NL and SRDA, but also beats all the algorithms for solving the trace-ratio problem. For example, Algorithm 2 is about 90 times faster than SRDA and is about 30 faster than PCA+NL. Moreover, one finds that WangITR fails to converge within 1000 seconds for this problem. This is because we have to solve trace difference problem by using eigenvalue decomposition during each outer iteration, whose cost is prohibitively large for large-scale problems.

On the other hand, it is observed that the recognition rates of Algorithm 2 are much higher than those of the other algorithms. In fact, the approximation \({\widetilde{X}}\) got from the randomized algorithm is a low-rank approximation to the training sample data X, and this process has the effect of denoising. Although the PCA processing can also filter noise in some sense, however, preserving 99% energy in this stage may not denoise effectively for this problem.

Example 2

As was discussed in Sect. 2, Algorithm 2 is closely related to the null space method. In this example, we compare Algorithm 2 with some null space methods including NLDAfast (Lu and Wang 2012), NLDAS (Huang et al. 2002) and NLDA (Chen et al. 2000). The test set is the AR database. Table 4 lists the numerical results.

Table 4 Example 2: Numerical results on the AR database, \(d=19800,~N=2600\), \(s=99\)

Again, we observe from Table 4 that Algorithm 2 performs much better than the other algorithms. For example, it is about eight times faster than NLDAS, NLDAfast and PCA+NL, and is about 50 times faster than NLDA and SRDA. On the other hand, the recognition rates and the standard derivations obtained from Algorithm 2 and the null space algorithms are about the same. This because all of them seek solutions in the subspace \({\mathscr {N}}(S_W)\setminus {\mathscr {N}}(S_T)\), see Sect. 3.1, and our proposed algorithm runs much faster than the null space methods. Moreover, we notice that the recognition rates of the three algorithms are a little lower than those of SRDA, and are a little higher than those of PCA+NL. Indeed, which one is the best according to recognition rate is often problem-dependent.

Example 3

In this example, we compare Algorithm 2 with two randomized algorithms including FastLDA and RFDA/RP for large-scale discriminant analysis. The test set is the Extended YaleB database.

  • FastLDA (Gado et al. 2016): A fast LDA algorithm that uses a feature extraction method based on random projection to reduce the dimensionality, and then performs LDA in the reduced space. In essence, this algorithm can be understood as a RSVD+LDA method. In this algorithm, the regularized parameter is set to be \(10^{-2}\), and the size of Gaussian matrix for sampling is chosen as \(d\times 160\), where d is the dimension of the data.

  • RFDA/RP (Ye et al. 2017): A fast LDA algorithm based on random projection. In this algorithm, the random projection matrix is determined by Theorem 3 of Ye et al. (2017), and the number of columns is chosen as \(\frac{{rank}(XL_T)-\log (1/\alpha )}{\epsilon ^{-2}}\), where \(L_T=I_n-\mathbf{1}_{n} \mathbf{1}_{n}^T/n\), \(\alpha =0.1\) is the regularized parameter, and \(\epsilon =10^{-2}\) is related to the desired failure probability. Table 5 lists the numerical results of the five algorithms.

Table 5 Example 3: Numerical results on the Extended YaleB database, \(d=10000,~N=2432\), \(s=37\)

We see from Table 5 that the three randomized algorithms FastLDA, RFDA/RP, and Algorithm 2 run much faster than PCA+NL and SRDA, while our proposed randomized algorithm is about two times faster than FastLDA and five times faster than RFDA/RP. On the other hand, for this problem, the recognition rates of the three randomized algorithms are (a little) lower than those of SRDA and PCA+NL, while those of our proposed algorithm are the highest among the three randomized algorithms. Specifically, we find that the recognition rates of RFDA/RP are (much) lower and the standard deviations are (much) higher than those of the others, especially when n is relatively small. In fact, the success of RFDA/RP greatly relies on the random projection matrix used. However, the projection matrix requires two parameters, one is to avoid the singularity problem, and the other is to determine the success probability of a low-rank approximation. How to choose the optimal values of two parameters is a difficult problem.

Example 4

In this example, we show the superiority of Algorithm 2 over many state-of-the-art algorithms for high-dimension and large-sample dense data dimensionality reduction problem. The test sets are two high-dimension and large-sample dense databases CAS-PEAL and YouTube. As the number of total samples are very large in the YouTube database, we randomly pick some \(n=40\%N,50\%N\) and \(60\%N\) samples as the training set, and the remaining samples are used as the testing set, where N is the total number of samples. We compare Algorithm 2 with the above trace-ratio algorithms, two popular large-scale discriminant analysis methods SRDA (Cai et al. 2008) and LDADL (Zhang et al. 2017), two randomized algorithms FastLDA (Gado et al. 2016) and RFDA/RP (Ye et al. 2017), as well as three null space methods NLDAfast (Lu and Wang 2012), NLDAS (Huang et al. 2002) and NLDA (Chen et al. 2000).

In the experiments, we set the size of the Gaussian matrix for sampling in FastLDA to be \(d\times 400\). For the PCA plus algorithms, we preserve both 99% and 95% energy on the PCA stage, respectively. In LDADL, we exploit the lsrq package provided by Cai et al. (2008) to solve the regularized least squares problem, and the algorithm is stopped as soon as the residual norm is below \(tol=10^{-6}\) or the number of iterations reaches 20. The numerical results are listed in Tables 6 and 7.

Table 6 Example 4: Numerical results on the CAS-PEAL database, \(d=172800,~N=21840\), \(s=400\)
Table 7 Example 4: Numerical results on the YouTube database, \(d=102400,~N=124819\), \(s=400\)

We observe from Table 6 that for the CAS-PEAL database, most of the algorithms fail to converge within 1000 seconds or suffer from heavy storage requirement. Some trace-ratio algorithms such as PCA+DNM and PCA+iITR do not work at all if we preserve 99% energy in the PCA process, as the recognition rates are only about 4%. However, if we preserve 95% energy in the PCA process, the recognition rates raise to about 50%. The reason is that the PCA plus methods may be unstable for dimensionality reduction (Shi et al. XXX). Furthermore, Algorithm 2 is about 10 times faster than PCA+DNM and PCA+iITR.

It is seen that the recognition rates of FastLDA are the highest for the CAS-PEAL database. For this problem, the recognition rates of Algorithm 2 are about 20% higher than RFDA/RP, but are about 10% lower than those of FastLDA. On the other hand, Algorithm 2 is about 2-3 times faster than FastLDA, and is about ten times faster than RFDA/RP.

Table 7 demonstrates the advantage of randomized algorithms for high-dimension and large-sample dense data sets. More precisely, except for Algorithm 2 and FastLDA, all the algorithms fail to converge with in 1000 seconds or suffer from the difficulty of out of memory. Compared with FastLDA, Algorithm 2 is about 3 times faster than FastLDA, however, the recognition rates of our proposed algorithm are about 6% lower than those of FastLDA. This is similar to the numerical results given in Table 6 for the CAS-PEAL database. Therefore, for high-dimension and large-sample dense data sets, Algorithm 2 is a good choice if speed is more important, and it is a competitive candidate for high-dimension and large-sample dense data dimensionality reduction problem.

Example 5

In this example, we show the efficiency of Algorithm 2 and Algorithm 3 for high-dimension and large-sample sparse data dimensionality reduction problem. To this aim, we compare Algorithm 2 and Algorithm 3 with all the algorithms run before. The test sets are two high-dimension and large-sample sparse databases TDT2 and Reuters. In the experiments, we set the size of the Gaussian matrix for sampling in FastLDA to be \(d\times 400\). For the PCA plus algorithms, we preserve both 99% energy on the PCA stage. In LDADL, we exploit the lsrq package provided by Cai et al. (2008) to solve the regularized least squares problem, and the algorithm is stopped as soon as the residual norm is below \(tol=10^{-6}\) or the number of iterations reaches 20. The numerical results are listed in Tables 8 and 9.

Some comments are given. First, as both the row and the column of the training set are very large, the conventional trace-ratio algorithms PCA+NL, PCA+DNM, PCA+iITR, FastITR and WangITR are very slow or even fail to converge for these two data sets. It is obvious to see that Algorithm 2 and Algorithm 3 run much faster than these trace-ratio algorithms. These illustrate the superiority of Algorithm 2 and Algorithm 3 for high-dimension and large-sample trace-ratio problems.

Table 8 Example 5: Numerical results on the TDT2 database, \(d=36771,~N=9394\), \(s=29\)
Table 9 Example 5: Numerical results on the Reuters database, \(d=18933,~N=8213\), \(s=64\)

Second, Algorithm 2 and Algorithm 3 runs much faster than the three null space methods and the two randomized algorithms, moreover, the recognition rates of the two proposed algorithms are much higher than the five algorithms. Third, the numerical performance of the two proposed algorithms is comparable to SRDA and LDADL for large sparse data sets. Although Algorithm 2 runs a little slower than Algorithm 3, SRDA and LDADL, the CPU time and recognition rates of Algorithm 2 and Algorithm 3 are comparable to those of SRDA and LDADL. The reason is that the sparse structure of the data samples is not preserved in randomized algorithms. As a result, Algorithm 2 is more suitable to high-dimension and large-sample dense data, while Algorithm 3 is a competitive candidate for high-dimension and large-sample sparse data.

5 Concluding remarks

The trace-ratio problem is crucial in high dimensionality reduction and machine learning. However, it has been long believed that this problem does not have a known explicit solution in general. A goal of this paper is to provide alternative and ideally faster algorithms for this problem. We show that the trace-ratio problem does admit a close-form solution when the dimension of the data points d is greater than or equal to the number of training samples n.

To the best of our knowledge, most of the algorithms for trace-ratio problem are based on inner-outer iterations, which are complicated and even may be prohibitive for high-dimension and large-sample data sets. Therefore, efficient algorithms for high-dimension and large-sample trace-ratio problem are still lacking, especially for dense data sets.

In this work, we pay special attention to high-dimension and large-sample trace-ratio problem, and propose two non-inner-outer iteration methods to solve it. With the help of the close-form solution and randomized singular decomposition, we first propose a randomized algorithm (i.e., Algorithm 2) for dense data sets. Theoretical results are given to show how to choose the target rank in randomized SVD, and why an inexact solution still works for recognition.

For high-dimension and large-dense sample sparse trace-ratio problem, we propose an iterative algorithm (i.e., Algorithm 3) based on the close-form solution. Similar to SRDA and LDADL, it does not destroy the sparse structure of the original data matrix. An advantage of the new algorithm over SRDA and LDADL is that it is parameter-free, and one only needs to solve some (consistent) under-determined linear systems rather than regularized least-squares problems. The difference and the theoretical relation between SRDA and our new method is given, and the distance between the computed solution and the exact solution is established.

Numerical experiments illustrate the numerical behavior of our proposed algorithms, and show the effectiveness of the theoretical results. They show that Algorithm 2 is superior to many state-of-the-art algorithms for dimensionality reduction on high-dimension and large-sample dense data sets, while Algorithm 3 is more suitable to high-dimension and large-sample sparse data sets. Specifically, Algorithm 2 is the best one among all the compared algorithms in overall consideration, especially when both d and n are very large and even in the same order.