GrassmannOptim : An R Package for Grassmann Manifold Optimization

The optimization of a real-valued objective function f ( U ), where U is a p × d, p > d, semi-orthogonal matrix such that U (cid:62) U = I d , and f is invariant under right orthogonal transformation of U , is often referred to as a Grassmann manifold optimization. Manifold optimization appears in a wide variety of computational problems in the applied sciences. In this article, we present GrassmannOptim , an R package for Grassmann manifold optimization. The implementation uses gradient-based algorithms and embeds a stochastic gradient method for global search. We describe the algorithms, provide some illustrative examples on the relevance of manifold optimization and ﬁnally, show some practical usages of the package.


Introduction
This article presents GrassmannOptim, an R (R Development Core Team 2012) package for orthogonally constrained optimization. Let U be a p × d, d < p, semi-orthogonal matrix, U U = I d . The package aims to maximize a real-valued function f (U) that is invariant under right orthogonal transformations of U: for any d × d orthogonal matrix O, f (U) = f (UO). This invariance of f means that the argument of f can be any representative orthonormal basis of the subspace spanned by the d columns of U. Thus, the optimization of f is performed over the set of all d-dimensional linear subspaces of R p . This set, denoted here by G (d,p) , is called a Grassmann manifold or Grassmannian.
The set of all one-dimensional subspaces of R 2 , G (1,2) , is simply the set of all lines passing through the origin, the set of all two-dimensional subspaces of R 3 , G (2,3) can be thought of as the set of all planes, and the collection of all d-dimensional subspaces of R p , G (d,p) , can be similarly viewed as the set of all d-dimensional hyperplanes. A single subspace in G (d,p) can be described uniquely by d(p − d) real angles, one angle being required to pick a line in R 2 and two angles being required to uniquely specify a plane in R 3 . The function f (U) could conceivably be parameterized and maximized in terms of these d(p−d) angles. However, using the geometry of a Grassmann manifold allows one to move continuously and smoothly through the subspaces and to develop efficient algorithms based on derivatives on the manifold.
For example, consider approximating a collection of n centered p dimensional vectors X 1 , . . .,X n , n i=1 X i = 0, with a corresponding collection of vectors Z i that lie in a d-dimensional subspace of R p . The approximating vectors can be represented in the form Z i = UC i , where U is a semi-orthogonal basis matrix for the d-dimensional subspace and C i holds the coordinates of Z i in terms of U. Using the Euclidean norm we wish to minimize n i=1 X i − UC i 2 . This norm is minimized over C i with U fixed by the value C i = B X i and thus we need to minimize n i=1 X i − UU X i 2 ; equivalently, we need to maximize f (U) = Tr(UU Σ), where Σ is the sample covariance matrix of the X i 's. This f (U) is invariant under right orthogonal transformations of U and so we have reduced the original problem to maximizing over a Grassmann manifold. In this case it is known that f (U) is maximized by choosing the column of U to be the first d eigenvectors of Σ. However, in more complicated problems an analytic solution will not be available and numerical optimization becomes necessary.
Many different manifolds are found in the literature. While several are unnamed, the often encountered are oblique manifold, Riemannian manifold, Stiefel manifold and Grassmann manifold. Defining algorithms on manifolds requires endowing these manifolds with a differentiable structure so that the computation of a gradient or the evaluation of an objective function is meaningful. An element of a given Grassmann manifold is a subspace. The differential geometry of Grassmann manifolds has been well studied (Wong 1967;Edelman, Arias, and Smith 1998;Chikuse 2003;Absil, Mahony, and Sepulchre 2004), and computations are often carried out using matrices to represent corresponding subspaces. A subspace can be specified non-uniquely by a basis. It also can be specified uniquely by a projection matrix.
In this article, we adopt the representation by a basis of the subspace.
Steepest ascent is the closest numerical analogy to the Grassmann manifold algorithm described in this article. Our algorithm is an iterative procedure that, given a point S U , computes an ascent direction where an increase in f is found. However, this procedure on a Grassmann manifold is not as straightforward as a steepest ascent method could be in a Euclidean space.
En route to presenting the algorithm, a few terms need to be defined. Let V be a p × (p − d) semi-orthogonal matrix that is a completion of U, such that Q = (U, V) is a p × p orthogonal matrix. Let B = (∇f (U)) V, the d × (p − d) matrix that is the directional derivative of f , which is the rate of change of f (U) in the direction of V. And finally, let A be a p × p skew-symmetric matrix define as In a nutshell, the algorithm works by rotating the starting basis Q to a new basis by right multiplication by an orthogonal matrix. A single step of the algorithm for a step size δ ∈ (0, 1) involves the update Q t+1 = Q t exp{δA}. (1) The matrix A depends on the directional derivative B that changes at each iteration, and A is updated during each iteration until a stopping criterion is met.
For S U to be the a maximizer of the objective function f , a necessary condition is that for any tangent vector at S U , the directional derivative of f in the direction of that vector should be zero. Practically, that means that B , the norm of B should be sufficiently small. To summarize, the algorithm is the following: 1. Provide an initial matrix p × p matrix Q 0 at t = 0 and then 2. Do until B < ε a. Compute the directional derivative B and form the matrix 3. S U is the subspace spanned by the first d columns of Q at the last iteration.
Numerically, the choice of δ is not trivial. In our implementation, a discrete optimization of f over the range of δ is used. Specifically at each iteration, a sequence of values of δ ∈ (0, 1) is used to generate candidates Q t+1 . The candidate that yields the largest f (U t+1 ) greater than f (U t ) is used.
This gradient method requires the directional derivative B = (∇f (U)) V of the objective function. In some applications, an explicit analytical expression for B can be obtained. As an example, consider the objective function f (U) = Tr{U WU} where Tr{S} represents the trace of the square matrix S. Differentiating f with respect to the elements of U yields (W +W )U. The analytical expression of the directional derivative is then U (W +W )V. In many other cases an approximation of the directional derivative can be computed. Liu et al. (2004) provided a method to approximate B = (α ij ) at any S U ∈ G (d,p) where for a small value of > 0 and U = Q exp{ E ij }J. The term E ij is a p × p skew-symmetric matrix with all entries being 0 except 1 in position (i, j) and −1 in position (j, i) and J is the p × d matrix of the first d columns of the p × p identity matrix.

Stochastic gradient
The basic gradient algorithm has the potential drawback to yield a local optimum at convergence when the starting value is not properly chosen. To help avoid this possibility and to reach a global optimum, we adapted the Markov chain-type simulated annealing process proposed by Gallivan et al. (2003).
Let U 0 be a representative of a starting subspace S U 0 ∈ G (d,p) . Set the iteration index t = 0 and choose an initial "temperature" T 0 .
2. Decrease the temperature T t to T t+1 , with T t+1 = T t /τ where τ > 1 is the cooling ratio.
3. Set t = t + 1 and go to Step 1 if T t+1 is greater than a threshold. Stop otherwise.
Simulating annealing can be quite time-consuming. The user provides the maximum number of iterations m, the cooling ratio τ specified in this algorithm and the initial temperature T 0 . Ideally, m should be large enough to allow the Markov chain at each temperature to be approximately in its stationary distribution before cooling. The cooling schedule should be slow. Various cooling schedules are suggested in the literature. Nourani and Andresen (1998) studied the performance of linear, logarithmic and exponential cooling schedules. Givens and Hoeting (2005, Section 3.4.1.2) also mentioned various cooling schedules. In general, a good cooling strategy should decrease the temperature rapidly at first and slowly over time. But in the above algorithm, a simple linear cooling schedule is adopted. Here, we suggest setting the cooling ratio τ close to 1. The value of the initial temperature T 0 is problemdependent, but it must be chosen properly. A useful strategy is to choose T 0 > 0 so that exp{[f (Y 1 ) − f (Y 2 )]/T 0 } is close to 1 for any pair of candidates matrices Y 1 and Y 2 .

Relevance
Computing eigenspaces is a task that occurs in many research areas, including signal processing (Comon and Golub 1990), data mining (Berry, Drmac, and Jessup 1999), and control theory (Patel, Laub, and Van Dooren 1994). The overarching goal is to reduce the complexity of a problem by focusing on a subset of relevant quantities that are dependent on eigenspaces.
In statistics, a small number of principal components may be computed to capture the important variability in the data. Many eigenvalue problems are optimization problems on manifolds (Absil et al. 2004;Edelman et al. 1998). In this section, we present three problems that can be phrased as manifold optimization problems. The first is related to linear discriminant analysis, the second is about independent components analysis and the third describes Cook's principal fitted components models.

Discriminant analysis
In a typical classification problem, let us suppose there are K classes that need to be separated in p-dimensional space. When p is large, the information relevant to the separation of the classes may be largely contained in a few directions α 1 , ..., α d ∈ R p where d < p. One goal of discriminant analysis is to identify these directions. Fisher's linear discriminant analysis (LDA) problem amounts to maximizing the function over α ∈ R p , where B is the between-class covariance matrix and W is the within-class covariance matrix. Equivalently, we can find arg max α α Bα subject to α Wα = 1. To estimate the d directions, standard practice in multivariate statistics is to use sequential maximization of the objective function: Supposing that {α 1 , ..., α m−1 } are the first m − 1 discriminant directions that maximize Equation 3, the next is α m = arg max α f (α) subject to α Wα j = 0, j = 1, ..., m − 1. In fact, the sequential optimization of the function f (α), identified as a Rayleigh quotient, is a generalized eigenvalue problem that can be solved simultaneously on a manifold. To see this, let U = (α 1 , . . . , α d ) and consider the generalized Rayleigh quotient f (U) = Tr{U BU(U WU) −1 }. With the change of variables Y = W 1/2 U and B * = W −1/2 BW −1/2 , maximizing f (U) is equivalent to maximizing Tr{Y B * Y} subject to Y Y = I. This maximum is achieved when span(Y) equals the span of the first d eigenvectors of B * , which is the same as span(α 1 , . . . , α d ) from the usual sequential procedure. Viewed as manifold optimization, there is nothing special about α 1 , . . . , α d other than the subspace they span.
In the case of Fisher's linear discriminant, sequential and simultaneous optimization yield the same subspace, but this equivalence will not always hold, as illustrated by the following two examples. The first example is the LDA projection pursuit index for classification (Lee, Cook, Klinke, and Lumley 2005). Based on Fisher's LDA, the index expression is given by where U is a (p × d) semi-orthogonal matrix. Using this index, the directions provided by the subspace spanned by the columns of U that maximize f (U) reveal differences between classes. For the second example, consider the optimal Bayes rule for classifying a new feature vector x into one of K normal populations with known means µ j and variances Σ j , j = 1, . . . , K, based on d linear combinations y = U x, where U is again a p × d matrix with rank d.
Using an optimal Bayes rule, the probability of correct classification is (Guseman, Peters, and Walker 1975) where N p is the p-dimensional normal density evaluated at y with the indicated mean and variance. It can be seen that in both examples, f (U) depends only on span(U) and thus maximizing f is again a Grassmann optimization problem. Sequential and simultaneous optimization will not necessarily yield the same subspace for these objective functions. This might be appreciated by recalling that sequential optimization requires an external inner product, and there does not seem to be a clear choice for that inner product in these optimization problems. Relatedly, Cook and Forzani (2010) discuss sequential versus full Grassmann optimization, and present an example illustrating the potential downside of sequential methods.

Independent components analysis
The independent component analysis (ICA) of a random vector consists of searching for a linear transformation that minimizes the statistical dependence between its components. Also known as blind source separation or referred to as sources separation problem, it was, according to Comon (1994), first proposed by J. Herault and C. Jutten around 1986. It has been used in biomedical applications like image analysis and in signal processing. A recurring example of the application of ICA is the cocktail party problem where the task is to recover p source signals s(t) = {s 1 (t), ..., s p (t)}, supposed to be statistically independent, from recordings where they appear as linear mixtures x(t) = {x 1 , ..., x n }. Mathematically, there is a mixing matrix A such that x(t) = As(t) and the goal of ICA is to identify the mixing matrix.
The problem is often translated into finding an n × p de-mixing matrix W such that the signals y(t) = W x(t) are as independent as possible (Absil et al. 2004). The quantification of the independence involves a cost function f such that f (WD) = f (W) for all nonsingular diagonal matrices D. Because of the invariance property of f , its optimization must be restrained to constrained sets corresponding to equivalence classes {WD : D diagonal}. A possible choice for the constraint set is the oblique manifold OB = {W ∈ R n×p : Diag(WW ) = I n }.
Thus, algorithms for independence component analysis can be formulated in terms of manifolds (Douglas 2000;Pham 2001). See Comon (1994) and Yeredor (2002) for further details on ICA.

Principal fitted components
One of the oldest and best known methods for reducing dimensionality in multivariate problems is principal component analysis (PCA). But PCA in a regression setting does not make any use of the outcome in reducing the dimensionality of the predictors. Cook (2007) proposed principal fitted component (PFC) models that can outperform PCA as a reductive method for regression. In this section, we discuss one specific PFC model where parameter estimation is carried over a Grassmann manifold.
Let X = (X 1 , ..., X p ) be a p-vector of random predictors and Y be the response. Cook (2007), using the stochastic nature of the predictors, proposed the following model: Here, X y denotes the conditional X given Y = y, f y ∈ R r is a function of the response, µ ∈ R p , Γ ∈ R p×d is a semi-orthogonal matrix and β ∈ R d×r . The error term ε ∈ R p is assumed to be normally distributed with mean 0 and variance ∆.
Essentially, this model can be seen as a way to partition the information in the predictors given the response into two parts: one part (Γβf y = E(X y − X)) is related to the response and the other (µ+ε) that is not. The term that is related to y suggests that the translated conditional means E(X y − X) fall in the d-dimensional subspace S Γ . Under the appropriate structure on ∆, the sufficient reduction is Γ X, and thus X can be replaced by Γ X without loss of information about the regression of Y on X. Here, the concept of sufficient reduction follows the definition of Cook (2007) However, as Γ X is a sufficient reduction, for any d × d orthogonal matrix O, (ΓO) X is also sufficient. Thus Γ is not estimable but the subspace spanned by its columns is estimable. The goal of an analysis is then to estimate the subspace S Γ spanned by the columns of Γ.
The estimation of S Γ depends on the structure of ∆. Various structures for ∆ can be modeled, including isotropic (σ 2 I), diagonal (σ 2 1 , ..., σ 2 p ), and general unstructured ∆ > 0 cases. The data structure dictates the variance structure to be used. While an isotropic structure of the variance may be appropriate for conditionally independent predictors that are on the same numerical scale, a diagonal structure may fit better when on different measurement scales, and an unstructured variance could be considered with predictors that are conditionally dependent. An interesting case arises with heterogeneous errors yielding where Γ 0 is the orthogonal completion of Γ such that (Γ, Γ 0 ) is a p×p orthogonal matrix. The matrices Ω ∈ R d×d and Ω 0 ∈ R (p−d)×(p−d) are assumed to be symmetric and full-rank. Under model (6) with covariance structure (7), Y is independent of X given Γ X and consequently Γ X carries all the predictive information that X has about Y .
We now turn to the maximum likelihood estimation of S Γ which involves a Grassmann manifold optimization. Assuming that a set of n data points is observed, letX be the sample mean of X and let X denote the n × p matrix with rows (X −X) . Let F denote the n × r matrix with rows (f y −f ) and set P F to denote the linear operator that projects onto the subspace spanned by the columns of F. Also let X = P F X denote the n × p matrix of the fitted values from the multivariate linear regression of X on f y . Setting Σ = X X/n and Σ fit = X X/n, we let Σ res = Σ − Σ fit . Then the maximum likelihood estimator of S Γ is where and V is an orthogonal completion of U. The objective function is a real-valued function with its argument being a p × d semi-orthogonal matrix. The invariant property of f , f (UO) = f (U) for any d × d orthogonal matrix O, and the orthogonal constraint U U = I d set up the optimization of f to be carried over the Grassmann manifold G (d,p) , which is the parameter space for S Γ .
Optimization over a Grassmann manifold is used in several other statistical models for dimension reduction, including covariance reducing models (Cook and Forzani 2008), envelope models (Cook, Li, and Chiaromonte 2010), generalized PFC (Cook and Li 2009), and likelihood acquired directions (Cook and Forzani 2009). In Section 5, an application to a set of data is presented where LAD is reviewed briefly and used.

The R package GrassmannOptim
In this section we give specific examples on how to use GrassmannOptim. The main visible function of this package is of the same name as the package name. To use this function, the user needs to provide the objective function; providing the analytical expression of the directional derivative is not required.

R> objfun <-function(W) list(value = f(W), gradient = Gradient(W))
The call of the function GrassmannOptim requires a minimum set of arguments. Other than objfun, the object W, which is of type list, is needed. As a list object, it may contain the optional initial starting matrix Qt that is a p × p orthogonal matrix. Regardless of Qt being provided or not, W must contain at least two components: the first is dim = c(d, p) where d is the dimension of the subspace sought and p is the number of predictors (d < p); the second is any list of arguments to be passed to the objective function f and the directional derivative Gradient. In the case of the current example, that list is sigmas. We present three scenarios to illustrate the use of GrassmannOptim. In the context of this application, d = 3 is the number of columns of Γ and p = 8. In the first, no initial matrix is provided. The call is: The process converges after only a handful iterations at the cost of the long simulated annealing process. At convergence, the first d components of the matrix output Qt is the solution to the optimization problem. This matrix is the basis of the estimate of the three dimensional subspace S Γ in R 8 .
The projected data onto the directions given by first three columns of Qt are plotted on Figure 1(right) using the following code.

Application
We applied GrassmannOptim to a dataset used by Cook and Forzani (2009) in their development and exposition of LAD, which requires Grassmann manifold optimization. Cook and Forzani (2009) carried out the optimization using the MATLAB (The MathWorks, Inc. 2010) program sg min (Lippert 2004), which enables us to contrast results from GrassmannOptim with those from a different code written in a different language. We briefly review the optimization portion of LAD and then turn to the example.

Likelihood acquired direction
Consider a regression in which the response Y is discrete with support S Y = {1, 2, ..., h}. Following standard practice, continuous response can be sliced into finite categories to meet this condition. Let X y ∈ R p denote a random vector of predictors distributed as X|(Y = y) and assume that X y ∼ N (µ y , ∆ y ), y ∈ S Y . Let µ = E(X) and Σ = var(X) denote the marginal mean and variance of X and let ∆ = E(∆ Y ) denote the average covariance matrix. Given n y independent observations of X y , y ∈ S Y , the goal is to obtain the maximum likelihood estimate of the d-dimensional central subspace S Y |X , which is defined informally as the smallest subspace such that Y is independent of X given its projection P S Y |X X onto S Y |X .
Let Σ denote the sample covariance matrix of X, let ∆ y denote the sample covariance matrix for the data with Y = y, and let ∆ = h y=1 f y ∆ y where f y is the fraction of cases observed with Y = y. The maximum likelihood estimator of S Y |X maximizes over S ∈ G (d,p) the log-likelihood function where |A| 0 indicates the product of the non-zero eigenvalues of a positive semi-definite symmetric matrix A and P S indicates the projection onto the subspace S in the usual inner product. The desired reduction is thenη X, where the columns ofη are a basis for the maximum likelihood estimate of S Y |X .

Is it a bird, a plane or a car?
The dataset is from a pilot study to assess the possibility of distinguishing birds, planes and cars by the sounds they make. The goal of the study was the reconstruction of sonic maps that identify both the level and source of sound.
A two-hour recording was made in a city, and then five second snippets of sounds were selected. This resulted in 58 recordings identified as birds, 43 as cars and 64 as planes. Each recording was processed and ultimately represented by 13 SDMFCCs (Scale Dependent Mel-Frequency Cepstrum Coefficients). The focus was on reducing this dimension of the 13-dimensional feature vector.
We applied LAD to estimate the central subspace via Grassmann manifold optimization using our package. The first two columns of the estimated basis matrix were used to form the two LAD predictors. These two LAD predictors, shown in Figure 2, nicely separate the sound sources. Figure 2 is quite similar to the one found in Cook and Forzani (2009). The orientation here is different than that found by Cook and Forzani (2009), but this is to be expected since the goal is to estimate a subspace and not a particular basis. Cook and Forzani (2009) and Tomassi (2011) adapted Lippert (2004)'s sg min 2.4.1 MATLAB (The MathWorks, Inc. 2010) package for Grassmann optimization with analytical first derivatives and numerical second derivatives, using Newton-Raphson iteration on G (d,p) . With GrassmannOptim, we used only numerical first derivatives with no use of the second derivatives. We also used the simulated annealing procedure to help reach a global optimum. Given these distinctions and other potentially relevant differences like starting values, we found the agreement between out results and those found by Cook and Forzani (2009) to be remarkable. The full R code used to obtain Figure 2 is provided in the online supplements. (More optimization examples using GrassmannOptim are available at http://www.math.umbc.edu/~kofi/GrassmannOptim/.)

Discussion
We have introduced GrassmannOptim, an R package for Grassmann manifold optimization, and described how many subspace optimization problems can be identified as manifold optimization. The current package uses a gradient-based algorithm. Minimally, the user is required to provide the objective function and the dimensions of the basis of the subspace to be estimated. Analytical expressions of the directional derivative typically result in more efficient estimation. However, analytic derivatives are not required. Limited theoretical results are available to analyze the convergence of the algorithm. But in practice, the convergence appears linear. The dimension of the Grassmann manifold affects greatly the speed of the package. We are dealing with a space where a single object is a subspace of dimension d × (p − d). When p is large, the algorithm becomes memory-greedy and slow.
A simulated annealing method was added to the package to help a global search. Without it, there is a chance that the algorithm could convergence to a local maximum. A good starting value can also help avoid local minima. As a Markov chain method, the processing time for simulated annealing depends on many elements, including the cooling schedule, the number of iterations at each stage of the Markov chain and the dimension of the subspace to optimize. The choice of the initial temperature, the cooling schedule and the maximum number of iterations at each temperature are left to the discretion of the user.