Speed-up and multi-view extensions to Subclass Discriminant Analysis

In this paper, we propose a speed-up approach for subclass discriminant analysis and formulate a novel efficient multi-view solution to it. The speed-up approach is developed based on graph embedding and spectral regression approaches that involve eigendecomposition of the corresponding Laplacian matrix and regression to its eigenvectors. We show that by exploiting the structure of the between-class Laplacian matrix, the eigendecomposition step can be substituted with a much faster process. Furthermore, we formulate a novel criterion for multi-view subclass discriminant analysis and show that an efficient solution for it can be obtained in a similar to the single-view manner. We evaluate the proposed methods on nine single-view and nine multi-view datasets and compare them with related existing approaches. Experimental results show that the proposed solutions achieve competitive performance, often outperforming the existing methods. At the same time, they significantly decrease the training time.


Introduction
In the modern world, large amounts of data available for training of machine learning algorithms result in their applicability and efficiency in different subject areas [1,2]. However, when the dimensionality of data is high, the algorithms can become susceptible to the well-known curse of dimensionality, stating that in the cases of high-dimensional data, its representation becomes sparse and, therefore, huge amounts of training data are required for the estimation of the parameters of a machine learning method. To address this problem, many dimensionality reduction methods were proposed over the recent years, acquiring an important role within the machine learning field. The objective of the dimensionality reduction methods is to determine a feature space, projection onto which results in a lower dimensionality of data, while preserving properties of the data that are of interest for the problem at hand.
Subspace learning methods can be divided into unsupervised and supervised ones, i.e., those relying solely on the structure of data and those exploiting additional class label information provided by experts. Among the unsupervised dimensionality reduction methods, probably the most common one is Principal Component Analysis (PCA) [3], that projects the data onto the subspace where the data has the highest variance.
Supervised subspace learning methods assume that during training the data is given with class labels. Therefore, they lead to enhanced class discrimination compared to unsupervised methods and they are more suitable for classification problems. One of the most well-known methods incorporating the information on class distribution is Linear Discriminant Analysis (LDA) [4,5], where the optimal subspace is obtained by optimizing the Fisher -Rao's criterion [6] that is defined over the within-class and between-class scatter matrices, under the assumption that the classes are unimodal and follow normal distribution. While incorporating the class label information, LDA can only define a subspace of at most d dimensions, where d is the rank of the between-class scatter matrix, which is equal to C-1 for the case of C classes.
The assumption of the class unimodality in LDA limits its performance in problems where classes form subclasses, i.e., classes are represented by multiple disjoint distributions. In order to address this limitation, approaches incorporating the subclass information in the optimization problem solved for determining the discriminant subspace have been proposed. Methods following this approach are the Subclass Discriminant Analysis (SDA) [7], Clustering Discriminant Analysis (CDA) [8], and Subclass Marginal Fisher Analysis (SMFA) [9]. In addition to better describing the classes' distributions, these methods are also able to determine discriminant subspaces of higher dimensionalities, since the maximum dimensionality of the learned feature space is limited by the rank of a modified between-class scatter matrix which is bounded by the total number of subclasses.
One of the main drawbacks of the subspace learning methods lies in the low speed for high-dimensional data and large datasets. For speeding up the training process several approaches have been proposed, including approximate solutions [1], incremental learning [10], and speed-up solutions [11,12,13,14,15]. In this paper, we propose a speed-up approach for SDA and its kernelized form, i.e., Kernel Subclass Discriminant Analysis (KSDA) [16]. The proposed approach is based on graph embedding [9,17] and exploitation of the structure of the between-class Laplacian matrix.
In some problems, the descriptions of the same items from multiple differently distributed modalities might be available, resulting in multiple modalities of the data. Such problems are referred to as multi-view or multimodal problems. The nature of multi-view problems is similar to the way humans perceive the world and take decisions, as the real-world data is not limited to one source, but consisting of, e.g., visual and audio signals, tactile sensations. The data from different modalities is perceived by the human and the decision is made by combining information from different sources. A similar approach is followed by multi-view subspace learning methods, where the combination of the information coming from different views is performed by defining a latent feature space, jointly determined using data from all available views during the training process. Moreover, the views can have different dimensionalities. An example of a multi-view problem is the classification of video sequences using their two views, i.e., audio and visual signals.
Extensions of supervised subspace learning methods to the multi-view case include the Multi-view Discriminant Analysis (MDA) [18] that defines a variant of the LDA criterion to incorporate information from multiple views. In [18], the between-class scatter is maximized regardless of the difference between inter-view and intra-view covariances, while the within-class scatter is minimized. Multi-view Common Component Discriminant Analysis proposes a way to address the nonlinearity, view discrepancy and discriminability jointly by incorporating both label information and geometric information during subspace learning [19]. In order to address the problem of multi-label classification with a high number of classes on a multi-view dataset, a Multi-view Label Embedding model was proposed [20]. Besides, for the problems with incomplete or incompletely labeled multi-view data, a unified subspace learning framework has been proposed [21]. In addition to that, several multi-view extensions of LDA have been recently proposed, including Standard Multi-view Discriminant Analysis (SMvDA) and Multi-view Modular Discriminant Analysis(MvMDA) [22]. Being extensions of LDA, these methods have similar limitations: the assumption of the unimodality of data within each view and maximal number of dimensions bounded by the number of classes. In this work, we propose an approach to overcome these limitations by introducing Multi-view Subclass Discriminant Analysis, as well as its kernelized form, and show that the solution for its optimization problem can be obtained by following a fast and efficient process.

Related Work
This section describes the previous works related to the proposed supervised subspace learning methods.
Let us consider a set of N D-dimensional vectors X = [x 1 , x 2 , ..., x N ] ∈ R D , each belonging to a class indicated by the corresponding label c i . We define the subspace learning problem as searching for the d-dimensional feature space, with d < D, that provides the highest class separability as it is defined on the data in X when projected onto that space. Most dimensionality reduction methods, including LDA, SDA, CDA, and SMFA optimize the Fisher-Rao's criterion [6]: where S w and S b are symmetric positive semi-definite matrices, referred to as within-class and between-class scatter matrices. The main differences between the subspace learning methods lie in the definition of these matrices. LDA [4] assumes that each class is unimodal and seeks to find a space, projection onto which would result in compact classes lying far from each other, therefore resulting in high discrimination between classes. The within-class and betweenclass scatter matrices are defined as where C is the number of classes, µ is the mean of data, µ i is the mean of class i, N i is the number of samples in class i and x ij is the j th sample of class i. Many extensions to LDA have been proposed over the recent years. Methods relaxing the assumption of LDA for normally distributed classes and the limitations on the dimensionality of the learned subspace in binary problems have been recently proposed in [23,24,25].
CDA [8] relaxes the assumption on unimodal classes and applies clustering techniques to incorporate the subclass structure of the data in the training process. SMFA relies on a framework of Subclass Graph Embeddings [9], where the dimensionality reduction problem is described from a graph embedding perspective. The problem is defined by intrinsic and penalty graph matrices, which are built relying on the label information of k nearest neighbors of the data points, as defined by Euclidean distance or some other distance metric. The intrinsic graph matrix represents the compactness within the subclass, while penalty graph matrix enforces penalization to ensure inter-class separability.

Subclass Discriminant Analysis
In order to relax the class unimodality assumption of LDA, SDA [7] expresses each class by a set of subclasses that are obtained by applying clustering on the class data. The difference between CDA and SDA lies in the definition of the within-class and between-class scatter matrices. In SDA, the total scatter matrix S t is minimized instead of the within-class scatter as S t = S b + S w . SDA uses the following definitions: where µ is the mean of data, i and l are the class labels, j and h are the subclass labels, p ij and p lh are the subclass priors, p ij = Nij N , where N ij is the number of samples in subclass j of class i and N is the total number of samples in X. The solution of (1) is given by solving the generalized eigendecomposition problem The obtained eigenvectors [w 1 , w 2 , ..., w d ] that correspond to d minimal eigenvalues form a projection matrix W. The projected data point y i can be computed as y i = W T x i . It is trivial to see that for the data centered to µ, S t = XX T . In addition to that, the representation of S b can be defined as follows: where c i is the class label of x i , and z i is the subclass label of x i , N c is the number of samples in class c and N ch is the number of samples in subclass h of class c.
The objective function of SDA can be reformulated into a maximization problem (9), and exploiting the formulations in (7) and (8), the solution is given by the generalized eigendecomposition problem (10), and the projection matrix is obtained by selecting the eigenvectors corresponding to maximal eigenvalues.

Kernel Subclass Discriminant Analysis
Kernel methods are widely used in machine learning to overcome the limitation of the linear separability, which is rarely present in real-world problems. In order to nonlinearly map each data point x i from the space R D so its image φ i in some space F , the nonlinear function φ(x) is defined, i.e., φ(x i ) ∈ F . The dimensionality of F depends on the choice of the function and can be arbitrary. A linear projection is then defined in F , i.e. y i = W T φ(x i ).
Conventional approach to solving the nonlinear problems involves exploitation of kernel function defined over the pair of data points in X that maps them to the dot product of their projections in F : k(x 1 , x 2 ) = φ(x 1 ) T φ(x 2 ) and formulating the problem accordingly. By exploiting the dot product representation, the explicit mapping of each data point x i in X to its image φ i = φ(X) can be omitted, therefore avoiding the issues related to the arbitrary dimensionality of F . The N × N kernel matrix K is defined as According to the Representer Theorem [26], W can be represented as a linear combination of data in F Therefore, The kernelization of the SDA can be easily obtained by exploiting the modified representation of S b and S t (7) [16]. Here we can assume that data is centered in F . The kernel matrix of the centered data can be obtained as in (12) where 1 N ∈ R N is a vector of ones. After mean-centering φ(X), S kt and S kb are given as follows: whereφ ij is the mean of the subclass j of class i in F , andφ is the mean of the data in F . Exploiting (11,(15)(16), the solution to KSDA is given by the generalized eigendecomposition problem

Multi-view Extentions to Linear Discriminant Analysis
In multi-view learning, the data X = diag(X 1 , X 2 , ..., X V ) is described from V views and we seek to find V matrices W v that project the data X v from all views v = 1, ..., V to a common (latent) space, where the separability between the classes is the highest. A generalized framework for multi-view subspace learning, that includes many of the existing methods as special cases, was proposed in [22]. Here, the optimization problem is defined as where P and Q are the inter-view and intra-view covariance matrices. The solution is obtained by solving the generalized eigendecomposition problem where W v is the projection matrix of the view v. The feature vectors in the latent space are obtained as where L bij is either L * bij orL bij , as defined below, i and j are the view labels, and V is the number of views.
Using the above notations, SMvDA aims to maximize the distance between the class means regardless of the view and defines L bij as where e p is N -dimensional class vector with 1s at the positions corresponding to the samples belonging to class p and 0s elsewhere, i and j are views, and C is the number of classes. The MvMDA maximizes the distances between centers of different classes across different views: In both cases, the intra-view Laplacian matrix L w is defined as in (25), where where i is the view label, c is the class label, C is the total number of classes, and I is the identity matrix. Similarly, the solution to Kernel MvMDA and Kernel SMvDA is given by optimizing where L b is defined using L * bij orL bij and K is a block-diagonal matrix having K v as its v th block. The solution is then given by solving the eigendecomposition problem

Spectral Regression
In this section, we focus on the spectral regression approach that was introduced as a way of speeding up the eigendecomposition step of LDA [28]. It has been shown that the solution of the generalized eigendecomposition problem (10) is equivalent to the problem Jt = λt with the same eigenpairs, for t = X T w and J = L b : Exploiting this fact, the solution of (10) can be obtained by solving an eigenvalue decomposition problem Jt = λt and finding such w that X T w = t. In practice, such w may not always exist, but it can be approximated with the closest value in the least squares sense: where α is a regularization parameter and T = [t 1 , ..., t d ] T . Spectral Regression Discriminant Analysis (SRDA) was proposed as an extension to LDA based on the spectral regression [28]. It has been shown that in the case of LDA the matrix J (33) has C eigenvectors corresponding to nonzero values, all of which correspond to the eigenvalue of 1 and have the form of where p is the class label, N p is the number of samples in class p and C is the number of classes. Therefore, the solution can be obtained by selecting the vector of ones as the first eigenvector and obtaining the rest by orthogonalization of the vectors of the structure as in (37). A tensor extension to SRDA has been recently proposed in [29], where the eigendecomposition problem of Higher Order Discriminant Analysis is transformed into a regression problem.

Kernel Regression
A kernelized version of the spectral regression was proposed in [11]. In this case, the objective is to solve the eigendecomposition problem JKa = λKa, which is equivalent to solving the eigendecomposition problem of Jt = λt given Ka = t: Then the kernel regression is applied to obtain where α is the regularization parameter.

Approximate Kernel Regression
For large-scale datasets, kernel regression method can be substituted by an approximate kernel regression, where W is expressed as a linear combination of r reference vectors (r < N ) [1]. We define W = ΨA, where Ψ is a set of reference vectors in F . The reference vectors in F correspond to r prototype vectors from R D that can be randomly selected training vectors from X, random data following the same distribution as data in X, subclass centers obtained by clustering all data, or subclass centers obtained by clustering data in each subclass separately.
Given W = ΨA, (40) becomes whereK = ΨΦ. Then, where α is a regularization parameter. It should be noted that in the case Ψ = Φ, the problem becomes equivalent to (41).

SDA with Spectral Regression
Subclass Discriminant Analysis has not been previously used together with Spectral Regression, but their combination is straightforward. The process of solving SDA using Spectral Regression can be defined as follows: 1. Create the between-class Laplacian graph (8) 2. Solve the generalized eigendecomposition problem L b t = λt and create the matrix T out of the obtained vectors 3. Regress T to W as in (36) 4. Orthogonalize W such that W T W = I Equivalently, for the kernel case, the steps 3-4 are the regression of T to A as in (41) or (43) and orthogonalization of A such that A T KA = I.
The above-described process for solving the SDA optimization problem provides several advantages. Firstly, as we will show in the next section, the eigendecomposition step (33) can be substituted with a much faster process. Secondly, the eigendecomposition step (10) or (17) is avoided and substituted with the least squares regression, for which several efficient solutions exist [30].

Proposed approach
In this section, the proposed methods are described. Firstly, we propose a speed-up approach for single-view SDA that relies on the structure of the Laplacian matrix L b and allows to substitute the eigendecomposition step of (33) by a much faster process. Secondly, we propose a linear and kernel solutions for multi-view SDA. Thirdly, we show that the solution to multi-view SDA can be obtained by a faster process that is similar to the one described for the single-view case.

Speeding up the eigendecomposition step
In this section, we show how the specific block structure of the Laplacian matrix L b in SDA allows to replace the eigendecomposition step with a much faster process.
Without the loss of generality, we assume that the data in X is mean-centered and sorted according to the class and subclass labels, i.e., [1, .. ., N 11 , 1, ..., N CZ ], where [1, ..., N CZ ] are the subclass labels of class C and subclass Z.
It can be observed that L b has a block structure with constant values in the blocks, as described in (8), with different blocks of L b corresponding to different classes. The class blocks are futher divided into the subclass blocks.
Since L b has a block structure, its eigenvectors have the block structure as well. Moreover, bigger eigenvalues show larger differentiation and correspond to the eigenvectors discriminating class blocks, while smaller eigenvalues discriminate subblocks of class blocks, therefore representing subclasses. L b has a rank of C * Z − 1 and, therefore, it has C * Z − 1 nonzero eigenvalues, where Z is the number of subclasses in each class.
Assuming the eigenvectors are sorted according to the eigenvalues in decreasing order, the first C − 1 eigenvectors share similar values at indices corresponding to one class. The rest of the eigenvectors correspond to different classes, and in each of them the subclass structure of a certain class can be observed -the indices corresponding to data of the same subclass have the same nonzero value, while the indices corresponding to other classes have the value of 0. We observe that bigger eigenvalues correspond to the eigenvectors showing the subclass discrimination of classes with smaller number of samples; and the classes having the same amount of samples share the eigenvectors, i.e., samples at positions of both classes have nonzero values, that are the same within a subclass, while positions corresponding to other classes have the value of zero. In this case, such eigenvectors are repeated the number of times equal to the number of classes with the same amount of samples.
As an example, let us consider a problem of 2 classes, where class 1 contains 8 samples and class 2 -9 samples. Each class contains 2 subclasses, where class 1 has 3 samples in the first subclass and 5 in the second, and class 2 has 4 samples in the first subclass and 5 samples in the second subclass. Then the three eigenvectors of the between-class Laplacian matrix of this data that correspond to nonzero eigenvalues are of the structure that is outlined in (44), where c corresponds to the class label, z -to subclass label and r i -to i th random value.
Moreover, L b is a symmetric weightless constant sum matrix. Therefore, all of its eigenvectors are orthogonal and a vector of ones is an eigenvector with eigenvalue 0 [31]. In addition to that, we can observe that for the data with a subclass structure, the eigenvectors maximizing the criterion (9) are those with the block structure as described. Following this, the orthogonalization can be performed on random vectors that follow the block structure as described above [32]. Therefore, we can choose the vector of ones as our first eigenvector and obtain the remaining C * Z − 1 vectors by orthogonalizing the random vectors of the described structure following the Gram-Schmidt process [33]. The vector of ones can then be removed as being useless. The detailed process of target vectors creation is outlined in Algorithm 1. T ← append T clust as columns on the right; end T ← append N ×1 vector of ones as a column on the left; Orthogonalize T ; remove first column of T ; return T

Multi-view Subclass Discriminant Analysis
In this section, we propose a novel method for multi-view subspace learning -Multi-view Subclass Discriminant Analysis along with the kernelized version of it. The idea behind multi-view Subclass Discriminant Analysis is the maximization of the distance between the subclass means of different classes, while minimizing the distances between the samples of the same subclass. The total scatter matrix for the mean-centered data is defined as where y i k is the k th sample of view i in the latent space. The between-class scatter matrix is defined as where i and j are view labels, p and q are class labels, l and h are subclass labels, p i pl = The solution is then obtained by optimizing the Fisher-Rao's criterion: where X and W are defined as in (47) and (48), respectively, and K is centered. Equivalently, solution to the kernel version of the method is obtained by optimizing where K is a block-diagonal matrix having K v as its v th block. The solution to problem in (51) is obtained by solving the eigendecompo- Similarly, the solution to (52) is given by L b Ka = λKa. Both of these problems can be solved by the process equivalent to the one described in 3.1.

Speeding up the eigendecomposition step: multi-view case
In this section, we describe a speed-up approach for the Multi-view Subclass Discriminant Analysis, based on the specific structure of the Laplacian matrix L mv b . The process of speeding up the eigendecomposition step for the multi-view case is similar to the single-view one. The Laplacian matrix L mv b is the constant sum symmetric block matrix, therefore having orthogonal eigenvectors, one of which is the vector of ones corresponding to eigenvalue of 0. The matrix has a block structure, where different blocks correspond to different views, and inside of each diagonal view block we can observe the block structure that is the same as in single-view case. Due to this block structure the eigenvectors of L mv b have the block structure as well. Assuming that the number of clusters is the same in all views, the rank of the L mv b is C * Z * V − 1, and that is the maximum number of nonzero eigenvalues.
Let us consider the data of 2 views and 2 classes. Let the class 1 contain 2 subclasses, with 2 samples in the first subclass and 2 samples in the second subclass in both views. Let the class 2 contain 2 subclasses, with 3 samples in the first subclass and 4 samples in the second subclass in the first view, and 4 samples in the first subclass and 3 samples in the second subclass in the second view. Then the eigenvectors of the between-class Laplacian matrix of the proposed multi-view SDA will have the structure outlined in (53), where c corresponds to the class label, z corresponds to the subclass label, v corresponds to the view label and r i corresponds to i th random value.
It can be observed that the first C − 1 eigenvectors have the class block structure similar to the one in the single-view case, and the blocks are repeated across the positions corresponding to the different views. In the same way as in the previously described single-view case, the rest of the eigenvectors correspond to different classes and each of them exposes the subclass structure of specific class -the values corresponding to the same subclass are the same within each view in the eigenvector and the values corresponding to other classes are 0 in all the views. We observe that the classes with the same amount of samples share the eigenvectors in a way similar to the single-view case, and these eigenvectors are repeated for the number of times equal to the number of classes sharing the number of elements. The eigenvectors showing subclass discrimination of smaller classes correspond to bigger eigenvalues.
Following the procedure described for single-view case, the eigenvectors can be obtained by forming the random vectors of the structure described, and orthogonalizing starting from the vector of ones following the Gram-Schmidt process [33]. The vector of ones can then be removed. The detailed procedure is described in Algorithm 2.

Experimental results
In this section, the experimental results are presented. The results are compared with other subspace learning techniques, namely SDA, CDA, SMFA, and SRDA, as well as the kernel SDA, CDA, and SMFA. In addition, to verify some of the assumptions regarding the proposed approach by performing eigendecomposition of L b , regressing the obtained eigenvectors following (36) and projecting the data onto the obtained vectors that correspond to larger criterion values (9).
For the kernel version of the methods, we exploit the RBF kernel function: where we set the Gaussian scale σ to the mean Euclidean distance between the training vectors.
In our experiments, we assume that the subclass of each data point in each class is known and is determined by applying the k-means clustering in R D . The performance is tested for the different numbers of clusters Z = {1, 2, 3, 4, 5, 6}, and the same number of clusters is used for each class. We perform clustering in the original space and use the same cluster labels in the kernel methods. In the multi-view case, data in each view is clustered separately. The dimensionality d of the projection space is defined by the rank of the L b or L mv b matrix and is equal to C * Z − 1 and V * C * Z − 1, respectively, where V is the number of views, C is the number of classes, and Z is the number of clusters.
For each experiment, 5-fold stratified cross-validation was used, with 60% of data of each class belonging to training set, 20% to validation set, and 20% to test set, where validation set is used for hyperparameter tuning, and results are reported by training on the original training set and testing on the test set. All experiments were performed on a computer with 4-core Intel i7-4800Q CPU and 32 GB of RAM.
For single-view approaches, prior to using any method, we applied PCA, preserving the eigenvectors corresponding to 98% of total energy and the data was standardized. The hyperparameters of all methods, if any, were tuned with the grid search. For SMFA, k Int and k P en were selected from the range of [2..14] with step 3 and [20.
For calculating distance matrix in SMFA/KSMFA, Gaussian similarity (54) with σ equal to the mean Euclidean distance between the training vectors was used. The regularization parameter for kernel regression was chosen from the range of regularization parameter selected from the range [1e −3 , 1e −2 , ..., 1e 3 ]. For the regularization of the other single-view kernel methods and multi-view methods the same parameter range was used. Cholesky decomposition was used for efficient matrix inversion.
In the multi-view kernel case, the solutions for the datasets containing more than 2500 samples were obtained with approximate kernel regression with the kernel matrix formed with 1500 random vectors from the training data. In the single-view kernel case, the approximate kernel regression [1] with the prototype vectors formed by clustering all data with cardinality of 1000 was used on a largescale SoF dataset for the proposed approach. On this dataset, the Nyströmbased approximate kernel [34] was used for KSDA, KCDA, and KSMFA methods with cardinality 1000.

Single-view datasets
We conducted experiments on 4 facial image datasets, one large-scale facial image dataset, and 4 other datasets of various data types. The Jaffe [35], BU [36], and Cohn-Kanade [37] dataset contain facial images of people of different ethnic backgrounds with 7 different facial expressions: anger, happiness, fear, disgust, sadness, surprise, and neutral. The datasets contain 213, 100, and 245 images, respectively. The Extended Yale-B dataset [38] contains 2432 grayscale facial images of 38 people and, therefore, defines a face recognition problem with 38 classes. Each class is represented by 64 images of the same person under different illumination conditions, positions, and view angles. All the facial image datasets mentioned above were reshaped to images of 30 × 40 pixels and flattened to obtain 1200 × 1 vectors. The large-scale SoF dataset [39] consists of 42,592 images of 112 persons (66 male and 46 female) collected under different illumination conditions and containing images with occlusions (e.g. glasses). All images were converted to grayscale, resized to 30x40 and subsequently flattened to form a 1200 × 1 vectors.
The Ionosphere dataset [40] contains radar data represented as 351 34dimensional vectors, along with the information on whether they contain evidence of some type of structure in the ionosphere or not, therefore posing a binary classification problem. The Semeion dataset [41] contains 1593 instances of handwritten digits produced by 80 person, each of whom had written each digit twice, -in a normal way and in a fast way. The digits are represented by 16x16 binarized images flattened to 1 × 256 vectors.
The MONKS2 dataset [42] is derived from a domain, where each instance is represented by 6 discrete features corresponding to one of the two classes. The Pima Indians Diabetes dataset [43] contains information on various medical attributes of patients, including the number of pregnancies the patient has had, their BMI, insulin level, age, along with the information on whether the patient has diabetes. The dataset contains 768 instances.

Multi-view datasets
For the evaluation of the multi-view methods seven datasets were used: Handwritten digits [44], Caltech-101 [45,46], NUS-WIDE [47,46], Human Action Recognition Using Smartphones [48], Robots Execution Failures [49], Healthy Old People Action Recognition [50], Million Song Dataset with Images (MSDI) [51]. The Handwritten digits dataset (HWD) [44]  The Human Action Recognition Using Smartphones dataset (HARS) [48] contains 3-axial angular velocity and linear acceleration data taken from the accelerometer and gyroscope data of a smartphone attached to a person's waist while the person is performing one of the 6 activities. Actions are described from 9 views: angular velocity of each of 3 axes, total acceleration of each of 3 axes, and body acceleration of each axis. Data was gathered from a group of 30 volunteers, resulting in 7352 instances. The cross-validation splits in our experiments were done such that the subjects performing the experiments are not repeated between training, validation, and test splits.
Healthy Old People Action Recognition dataset (HOPAR) [50] contains 2 datasets, each containing the information from a wireless sensor worn by a person, while performing one of the 4 activities: sitting on bed, sitting on chair, lying, ambulating. The data is organized into 4 views, where views 1-3 represent the acceleration from each of the 3 axes and view 4 contains information about the received signal strength indication, frequency, and phase of the signal, obtained from the sensor. The first dataset consists of data obtained from 60 subjects, out of which 25% of each class instances were selected, resulting in 10495 instances. The second dataset contains information obtained from 27 subjects, resulting in 9057 instances.
The Robot Execution Failures dataset [49] consists of 5 subsets, each describing a different problem. For our experiments, subsets 1 and 4 were combined, resulting in a dataset of failures in approach to grasp or ungrasp position. The data is represented by 4 classes: normal, collision, frontal collision and obstruction, and described from 6 views: force on each of the 3 axes and torque on each of the 3 axes. The dataset consists of 205 instances.
The Million Song Dataset with Images (MSDI) [51] poses a music genre classification task for 15 different genres. Each instance represents a song, that is described from two views: audio spectrograms from audio signal and CNN features of the corresponding album cover. We perform evaluation on the subset of 7468 instances, chosen randomly from the dataset and preserving the initial class proportions.   Tables 1 and 2 show the results for the single-view linear and kernel methods, respectively. We performed experiments on 9 datasets using the proposed approach, which is compared to the conventional eigendecomposition-based approaches of SDA, CDA, SMFA and with SRDA in the linear case; and KSDA, KCDA and KSMFA for the kernel case.   Tables 3 and 4 show the results for the multi-view case. The following methods are compared: single-view SDA, where features from different views are concatenated, MvMDA and SMvDA. For the single-view SDA we use the proposed fast approach. We report the accuracy, time taken for training, and the number of subclasses that resulted in the highest accuracy. In the multi-view datasets, the clustering time is included in the total time, as the comparison is done with the methods that do not require clustering. In the single-view datasets, total time does not include the time used for clustering, as comparison is done to other clustering-based methods, where the same subclass labels are used. It can be seen that the proposed single-view method is performing better or close to the conventional methods, while always taking less time.

Results
In addition, by performing the projection onto the sorted by criterion value (9) regressed eigenvectors of L b , we verify that for the data with subclass structure the eigenvectors corresponding to larger criterion values are those following the described structure. The only exceptions were observed in the Monks2 and PIMA datasets, where some of the eigenvectors had random structure -this is due to the samples of different subclasses being mixed with each other. However, even in this case, it can be observed that the proposed approach results in competitive accuracy and higher speed. The accuracy obtained by projecting data using the transformation matrix comprised of eigenvectors corresponding to largest criterion values is shown in the second last column of Table 1.
For the multi-view case we compared the proposed multi-view SDA to other multi-view methods that assume unimodality of data. It can be seen that the proposed approach results in significant speed-up and competitive accuracy, often outperforming competing methods.

Conclusions
This work presents two contributions, proposing a fast and efficient solution for Subclass Discriminant Analysis and introducing multi-view Subclass Discriminant Analysis with a fast solution to it. As can be seen from the experimental results, the proposed speed-up approach allows to reduce the training time significantly, while being competitive in accuracy and often outperforming the conventional methods. This results in the possibility of analysis on largescale datasets, where solutions by conventional methods are not feasible. The proposed multi-view Subclass Discriminant Analysis provides superior accuracy compared to the methods relying on the assumption of unimodality of data. In addition to that, the proposed speed-up approach can be applied to this formulation, resulting in significant speed gain.