Multimodal Subspace Support Vector Data Description

In this paper, we propose a novel method for projecting data from multiple modalities to a new subspace optimized for one-class classification. The proposed method iteratively transforms the data from the original feature space of each modality to a new common feature space along with finding a joint compact description of data coming from all the modalities. For data in each modality, we define a separate transformation to map the data from the corresponding feature space to the new optimized subspace by exploiting the available information from the class of interest only. The data description in the new subspace is obtained by Support Vector Data Description. We also propose different regularization strategies for the proposed method and provide both linear and non-linear formulation. We conduct experiments on two multimodal datasets and compare the proposed approach with baseline and recently proposed one-class classification methods combined with early fusion and also considering each modality separately. We show that the proposed Multimodal Subspace Support Vector Data Description outperforms all the methods using data from a single modality and performs better or equally well than the methods fusing data from all modalities.


I. INTRODUCTION
I N our surroundings on daily basis, we are exposed to information from many different sources. Different sensors are used to gather information about similar objects. Our brains usually performs well in combining the information from different sources to make a concise analysis of that particular entity. In order to analyze an entity, even a single source of information might be enough, but to make some critical decisions it is important to combine information from different sources in a systematic way. For example, if a person is walking in a crowd, the main information to not hit anything comes from visual cues, but people can warn each other also by voice or even by touch and this extra information helps in understanding the environment in a better way. Smell could help to avoid unpleasant spots, too. As another example, while watching a movie, only visual information of the scenes This work was supported by the NSF-Business Finland Center for Visual and Decision Informatics project Co-Botics, jointly sponsored by Tieto Oy Finland and CA Technologies (Broadcom recently acquired CA Technologies).
A. Iosifidis is with the Department of Engineering, Electrical and Computer Engineering, Aarhus University, DK-8200 Aarhus, Denmark (e-mail: ai@eng.au.dk).
may not be enough to understand the whole scenario, but the audio and/or captions combined together with the visuals information will provide the full information.
In machine learning techniques for predictive data modelling, training data is needed to form a model which can accurately classify the future instances into a predefined number of classes. In many cases, data comes from sensors and can be further processed to extract different features. The term multimodal is used to describe the data coming from different sensors (also referred to as mode or modality), however it is also used as synonym to multi-view when different features are extracted from the same sensor or when there are multiple similar sensors, e.g., cameras.
The examples of multimodal representations are common in different application areas. In [1], active multimodal sensor system for target recognition and tracking is studied where information from three different sensors (visual, infrared, and hyperspectral) is used. In [2], a framework for vehicle tracking with multimodal data (velocity and images) is proposed where the outcome of velocity modality estimated by using a Kalman filter on the data obtained from motion sensors is fused with features learned from image modality by the color-faster R-CNN method. In [3], multimodal data collection framework for mental stress monitoring is studied. In the proposed framework, physiological and motion sensor data of people's under stress is collected.
The data in multimodal applications comes from different modalities, where each modality has its own statistical properties and contains specific information. The different modalities usually share high level concepts and semantic information and all together contain more information than any single-modal data [4]. If we build a model separately for each modality, the relationship between the modalities cannot be exploited efficiently. In multimodal subspace learning the goal is to infer a shared latent representation, that can accurately model data from each original modality and exploit the relationship between the modalities.
In traditional multiclass machine learning, an adequate amount of data is available for all the categories during training and, hence, the algorithm takes advantage of all available training data from all classes to train a model. However, it is possible that during the training, data is highly imbalanced or the only data available is from a single class. In such cases one-class classification techniques are used. It is useful in many different cases, such as outlier detection, predicting specific events, or, in general, predicting a specific target class [5] [6]. While much effort has been put on solving one-class arXiv:1904.07698v1 [cs.LG] 16 Apr 2019 classification tasks for data of a single modality, much less effort has been put on solving one-class multimodal challenges in general and we are not aware of any work in field of subspace learning for one-class classification [7]. In one-class multimodal tasks, it is assumed that the only data available is from a single class in many different modalities.
In this paper, we propose a novel method for solving multimodal one-class classification task. The proposed method, Multimodal Subspace Support Vector Data Description (MS-SVDD) finds a transformation for each modality along with defining a common model for all modalities in a lower dimensional subspace optimized for one-class classification. The rest of the paper is organized as follows. In Section II, an overview of related work is presented. In Section III, the newly proposed MS-SVDD is derived and discussed. In Section IV, we present the experimental setup and results, and finally, in Section V, conclusions are drawn.

II. BACKGROUND AND RELATED WORK
In this section, we briefly discuss principles of multimodal learning along with subspace learning. We also provide an overview of traditional methods used for multiclass multimodal data description and one-class unimodal data description.

A. Multimodal learning
Availability of many different modalities can be a bliss if it increases the performance of machine learning model. However, if the data description algorithm fails to make strong connection between the different available modalities, the performance can be degraded. To ensure a better performance of the model by combining data from different modalities, mainly two principles should be ensured, i.e., consensus and complementary principles [8]: • Consensus principle aims at minimizing the disagreement between data available from different modes. Maximizing the agreement will reduce the error rate and a better modelling of data is achieved while combining data from different modalities. • Complementary principle in the context of multimodal learning means that data from each modality may contain some knowledge not contained by the other ones. So its necessary to exploit information from all the available modes to make an accurate description of data. The multimodal machine learning techniques can be described by three main properties: two-view vs. multi-view, linear vs. non-linear, and unsupervised vs. supervised [9]. As the name indicates, in two-view learning, the number of views is limited to two. In multi-view learning, there number of views is not limited. The difference between supervised and unsupervised learning is that, in supervised learning, the information on output labels of the training data is taken into account when training the model, while in unsupervised methods the labels are not used to model the underlying structure or distribution of the data [10], [11]. Linear techniques for multimodal subspace learning may be too simple to provide a representative model, hence kernel methods are proposed to capture non-linear patterns in data.
The multimodal machine learning applications can be broadly divided into four main application domains, i.e., audio-visual speech recognition, multimedia content indexing and retrieval, understanding human multimodal behaviors, and language and vision media description [12]. In audiovisual speech recognition, the main goal is to improve speech recognition performance by combining the visual information with audio/speech signals [13], [14]. Content analysis such as automatic shot-boundary detection, multimedia event detection, searching visual and multimodal content in a dataset are few examples of multimedia content indexing and retrieval [15], [16], [17], [18]. Human-robot collaboration, human emotion recognition, human-computer interaction, and automatic assessment of depression and stress comes under the category of understanding human behaviour from multimodal input data during social interactions [19], [20], [21], [22], [23]. An example of language and vision media description is image captioning, where the goal is to generate the text description of images [24], [25].
In multimodal learning, the main goal is to develop a process of fusing information from various modalities. In [26], the fusion strategies are divided into two different categories as model-agnostic and model-based approaches. In modelagnostic approaches, the fusion is either late, early, or hybrid. In early fusion, the data or extracted features are fused together at the very initial phase of modelling. A new feature vector is usually formed by concatenating all the available data from different modes and the model is trained with the new feature vector. In late fusion, multiple models are trained and the fusion is done for scores generated by each model for corresponding modality. The score generated by each model can be a threshold or some probability used in decision making. Hybrid fusion exploits the advantage of both, early fusion and late fusion. Model-based approaches for fusion explicitly fuses data during their construction, such as kernelbased approaches, graphical models, and neural networks. In this work, we present a model-based approach for data fusion.

B. Subspace learning
In the current era of data science, where multidimensional multimodal big data is generated every minute in different industries, there is a need of getting the important insights and mine in knowledge in this high dimensional data. Subspace learning aims at representing data in a lower dimensional space by keeping intact all the information available in the original higher dimensional space.
Algorithms developed for linear subspace learning find a projection matrix for labelled training data (represented by vectors) satisfying some optimality criteria. Principal Component Analysis (PCA) is one of the first subspace learning methods mentioned in literature. In PCA, a subspace is learned by orthogonally projecting data to a subspace so that the variance of data is maximized. PCA works only with a single mode of data, i.e., all data should be in same dimension. Another traditional subspace learning method is Linear Discriminant Analysis (LDA) which finds a linear transformation by exploiting the class information. Many extensions of PCA and LDA have been proposed [27], [28], [29].
Analogous to PCA, but used for two-view learning is canonical-correlation analysis (CCA) [30]. CCA is a classic and conventional method for subspace learning which aims at relating two sets of data by finding out the pairs of directions which provide maximum corelation between the two sets. It has recently became one of the popular methods for unsupervised subspace learning because of its generalization capability and has been used extensively for multimodal data fusion and cross-media retrieval [31], [32], [33]. In subspace learning, state-of-the-art results are achieved by methods which have embraced some stimulus from conventional subspace learning methods [34].
As an extension to methods for linear transformation, kernel methods are introduced to find a nonlinear function or decision boundary. In kernel methods, the data is mapped to possibly a higher dimensional kernel-space using a kernel function where it exhibits linear patterns [35]. For example, in [36], kernel-PCA performing a nonlinear form of PCA is proposed.

C. One-class classification
In one-class classification, the parameters of the model are estimated using data from the positive class only because data from the other classes is either not available at all or it is too diverse in nature to be modeled statistically [10]. The positive class is also called target class and the data from the other classes which is not available during training is called negative, or outlier class. For example, a unimodal biometric system uses a single biometric trait for verification or identification [37]. Support Vector Data Description (SVDD) [38] is among the most widely used one-class classification methods used for anomaly detection and other related applications. SVDD obtains a spherical boundary around a the target data which can be made flexible by using kernel trick. The obtained boundary is used to detect outliers during the test, i.e., anything inside the closed boundary is classified as target class and otherwise as outlier. The Lagrangian of SVDD is given as follows where x is the input target training data and maximizing (1) gives a set of α i corresponding to each instance. Instances with α ≥ 0 define the data description. Other representative one-class classification method is One-Class Support Vector Machine (OC-SVM) [39]. Techniques for enhancing performance of one-class classification methods, mainly extensions of SVDD, can be categorized into four main categories, data structure, kernel issue, hypersphere boundary, and non-stationary data [40]. As the name indicates, in data structure category the main focus is on structure of data. For example, in [41], a confidence coefficient is associated to each training sample to deal with the uncertainty of data. In kernel issue extensions, the main focus is on reducing the complexity or proposing new kernels for one-class classification. For example, in [42] and [43], new kernels are proposed to improve the accuracy of SVDD. Proposing changes in the boundary for enclosing the target data comes under third category for improving one-class classification accuracy. For example in [44], [45] and [46], the ellipse shape is used for encapsulating target data instead of the traditional sphere used in SVDD. In [47] it is shown that both SVDD and OC-SVM lead to the same solution when exploiting the eliptical shape of the class. The last category of algorithms for improving one-class classifier performance attempts to handle non-stationary data. For example in [48], an extension of SVDD is proposed to handle non-stationary or increasing data. Recently, in [49], an algorithm developed for reducing the effect of uncertain data around the hypersphere of SVDD achieved state of the art result on many UCI [50] datasets. In this paper, we consider baseline SVDD combined with multimodal subspace learning. However, in the future, the method can be extended using similar ideas.
In the area of multimodal one-class classification, researchers have mainly focused on fusing the output labels of multiple models trained for each type of features independently, i.e., without taking into account information from other feature types for one model [51].
III. MULTIMODAL SUBSPACE SUPPORT VECTOR DATA DESCRIPTION MS-SVDD maps data from high dimensional feature spaces to a low dimensional feature space optimized for one-class classification. The optimized subspace is shared by data coming from all modalities. MS-SVDD is an extension of Subspace Support Vector Data Description (S-SVDD) which was proposed for unimodal data in [52]. The main novelty of MS-SVDD is using the multimodal approach for one-class classification. Here, we first derive the linear MS-SVDD. Then we derive two non-linear versions using the kernel trick [35] and the Nonlinear Projection Trick (NPT) [53], respectively.

A. Linear MS-SVDD
Let us assume that the items to be modelled are represented by M different modalities. The instances in each modality m, m = 1, . . . , M , are represented by a matrix where N is the total number of instances and D m is the dimensionality of the feature space in modality m. MS-SVDD tries to find a projection matrix Q m ∈ R d×Dm for each modality, which will project the corresponding instances to a lower (d) dimensional optimized subspace shared by all modalities. Thus, the feature vector x m,i is projected to a d-dimensional vector y m,i as To obtain a common description of all the data transformed from their corresponding modalities to the new common subspace, we exploit Support Vector Data Description (SVDD) [38] to form a closed boundary around the target class data in the new subspace. The center and radius of the hypersphere are denoted by a ∈ R d and R, respectively. Figure 1 depicts the basic idea of the proposed method. In order to find a compact hypersphere which encloses all the target data from all the modalities in the new subspace, we minimize By introducing slack variables ξ m,i , such that most of the training data from all the modalities in the new common space should lie inside the hypersphere, the above criterion becomes: The Lagrange function corresponding to (4) can be given as The Lagrangian function should be maximized with respect to α m,i ≥ 0, and γ m,i ≥ 0 and minimized with respect to R, a, ξ m,i , and Q m . By setting the partial derivative to zero, we get: It is clear from (6)-(9) that parameters α and Q are interrelated and cannot be jointly optimized. Hence we apply a two step iterative optimization process where, in each step we fix one parameter and optimize the other. Substituting (2), (6), (7) and (8) in the Lagrangian function (5), we get We see that optimizing (10) for α corresponds to the traditional SVDD applied in the subspace. Maximizing (10) for a particular set of data will give us α m,i corresponding each sample. The value of α m,i for corresponding sample defines its position with respect to the hypersphere: • Samples with 0 < α m,i < C define the data description and lie on the boundary of hypersphere, they are refered to as support vectors. • Samples with α m,i = C are outside the boundary.
• Samples with α m,i = 0 lie inside the boundary.
In the second step, we fix α and update Q m for each modality. For this step, we add a regularization term ω: The regularization term ω expresses the covariance of data from different modalities in the new low-dimensional space, and β is a regularization parameter for controlling the significance of ω. We propose different settings for ω as where α m ∈ R N in (14) and (17) is a vector having the elements α m,1 , ..., α m,N . Thus, α m has non-zero values for support vectors and outliers. λ m ∈ R N in (15) and (18) is a vector having the elements of α m that are smaller than C.
Values of α m corresponding to outliers (i.e., α m,i = C) are replaced with zeros in λ m . Thus, λ m has non-zero values only for the support vectors. For ω 0 , the regularization term becomes obsolete and it is not used in the optimization process.
In ω 1 , the regularization term only uses representations coming from the respective modality and no representations from the other modalities are used to describe the variance of the positive class.
In ω 2 , all support vectors, i.e., representations at the hypersphere boundary, and outliers are used to describe the class variance for the update of the corresponding Q m . In ω 3 , only support vectors of the respective modality are used to describe the variance of the class to be modelled. In ω 4 , data from all the modalities are used to describe the covariance and regularize the update of Q m . In ω 5 , the instances belonging to the hypersphere boundary and outliers from all modalities are used to describe the covariance. In ω 6 , only the support vectors belonging to class boundary from all modalities are used to update Q m and describe the covariance of the positive class. We update Q m by using the gradient of L with respect to Q m , where η is learning rate parameter and the gradient of L is calculated as where ∆ω is the derivative of the regularization term with respect to Q m ∆ω 0 = 0, We initialize the Q m using PCA. At every iteration, the projection matrix is orthogonalized and normalized so that where I is an identity matrix. We use QR decomposition for orthogonalizing and normalizing the projection matrix Q m . Algorithm 1 describes the overall MS-SVDD algorithm.

B. Non-linear MS-SVDD
For non-linear mapping from the original feature spaces to a new shared feature space, we use two approaches. The first approach is based on the standard kernel trick [35] and the second on the Nonlinear Projection Trick (NPT) [53], which is used as a computationally lighter alternative to the kernel trick.
1) Standard kernel trick: In the non-linear data description, the original data is mapped to a kernel space F using a nonlinear function φ(·) such that x m,i ∈ R Dm → φ(x m,i ) ∈ F. The kernel space dimensionality can possibly be infinite. Then the data is projected from the kernel space to R d as In order to calculate y m,i , we use the so called kernel trick by expressing the projection matrix Q m as a linear combination of the training data representations of the respective modality in the kernel space F, leading to where Φ m ∈ R |F |×N is a matrix formed in F containing the training data representations of modality m, W m ∈ R d×N is a matrix containing the weights for Φ m needed to form Q m , and k m,i is the i-th column of the Gramian matrix, also called as the kernel matrix, K m ∈ R N ×N having elements equal to K m,ij = φ(x m,i ) T φ(x m,j ). In our experiments, we use the Radial Basis Function (RBF) kernel, given by where σ > 0 is a hyperparameter and determines the width of the kernel. The augmented version of the Lagrangian function now takes the following form: The α's are calculated optimizing (10) with W m 's fixed, i.e., applying SVDD in the subspace. In the second step, the alphas are fixed and W m 's are updated with the gradient descent: where the gradient is calculated as The gradient of the regularization term, ∆ω, now takes the following forms: We initialize the matrix W m for each mode using kernel-PCA. We orthogonalize and normalize W m at every iteration so that We decompose (42) using eigendecomposition as where the + sign denotes pseudo-inverse andŴ m is the normalized projection matrix. For notation simplicity, we set W m =Ŵ m .
2) Nonlinear Projection Trick: The non-linear MS-SVDD using the kernel trick requires computing the eigendecomposition (43) at every iteration. This is computationally expensive and, therefore, we propose an alternative non-linear approach using NPT [53]. Here, a non-linear mapping is applied only at the beginning of the process, while the optimization follows the linear MS-SVDD. In the NPT-based MS-SVDD, we first compute kernel matrix K m using (31). In the next step, the computed kernel matrix is centralized aŝ whereK m is the centralized kernel matrix and E N is N × N matrix defined as 1 N ∈ R N is a vector with each element having value of 1. The centralized matrixK m is decomposed by using eigendecomposition,K where A m contains the non-negative eigenvalues of the centered kernel matrix and U m contains the corresponding eigenvectors. The data in the reduced dimensional kernel space is obtained as Since we consider NPT as a pure preprocessing step, we continue by considering Φ m as our input data, i.e., we set X m = Φ m . Then we follow the linear MS-SVDD.

C. Test Phase
During the test phase, an instance x m * ∈ R Dm (the * in subscript denotes test instance) coming from modality m is projected to the common d-dimensional subspace using (2) for the linear case. For kernel case, first the kernel vector is computed as and then projected to the common d-dimensional subspace using (30). For NPT, first the kernel vector k m * is computed and the after centralised aŝ The central kernel vector is then mapped to and then to d-dimensional subspace using (2) (for notation simplicity φ m * is considered as x m * ). The decision to classify the test instance y m * as positive or negative is taken on the basis of its distance from the center of hypersphere, i.e., The representation y m * is assigned to the positive class when y m * − a 2 2 ≤ R 2 and to the negative class if y m * − a 2 2 > R 2 , where R 2 is the distance from center a to any support vector on the boundary, where v is any support vector in the training set with corresponding α having value 0 < α < C. Since the items are represented by M different modalities, the final decision for assigning the item to a particular class (either positive or negative) is taken using different strategies explained in section IV-C.

A. Datasets and prepossessing
To evaluate the proposed method, we used 2 different datasets downloaded from UC Irvine (UCI) machine learning repository [50]. The first set of experiments was performed on the Robot Execution Failures dataset [50], [54]. In Robot Execution Failures dataset, force and torque measurements are collected at regular intervals of time after a task failure is detected. The dataset is divided into five different learning problems (LP) corresponding to different triggering events:  Table I. All instances are given as 15 samples collected at 315 ms regular time intervals for each sensor. For this dataset, we consider all the instances belonging to the normal class as the target class and the remaining classes as the non-target data. Hence, we have two modalities (torque and force measurements) and we consider it as a one-class classification problem.
The second set of experiments was performed on Single Proton Emission Computed Tomography (SPECTF) heart dataset [55]. The SPECTF heart dataset consists of two sets of features corresponding to rest and stress condition SPECTF images of different subjects. The training set consists of 40 examples diagnosed as healthy heart muscle perfusions and 40 diagnosed as pathological perfusions. The test set consists of 15 instances of healthy heart muscle perfusions and 172 from instances diagnosed as pathological perfusions. We convert this to a multimodal one-class classification problem by considering the rest and stress conditions as different modalities and by selecting the healthy heart muscle perfusions as our target class.

B. Experimental setup
For Robot Execution Failures dataset, we performed our experiments on 70-30% split for training and testing sets. We selected the 70-30% split randomly 5 times keeping the distribution of classes similar to the original data. To tune the hyperparameters for final testing, we did a 5-fold crossvalidation on the training set, where the (70%) training data is divided into 5 different sets and each time one set is used for testing while all the others for training. The process was repeated 5 times until all the sets have been used as test sets. For SPECTF heart dataset, the train and test sets are given with the dataset. Therefore, we did a 5-fold cross-validation on the training set to optimize the hyperparameters.
With both datasets, the models were trained by using samples from the positive class only, while testing was carried out using all the classes. The hyperparameters were selected from the following ranges For the proposed method, we restricted the dimension d of the shared subspace as d < min{D 1 , ..., D M } for a given dataset, where D m is the dimensionality of modality m. For competing methods, the features from different modalities were concatenated before training the model. We also report the results of the competing methods by considering data from one modality at a time for training and testing.

C. Decision strategies
During testing, after the common compact representation of all modalities is formed, each representation (modality) of an instance is mapped to the lower dimensional subspace via corresponding projection matrix and classified as described in Section III-C. The following four strategies are used to decide the final class for the instance: • Decision strategy 1 (also called the AND gate): The test instance is assigned the target label if the representations from all modalities for that particular instance are classified to the target class and the non-target label otherwise. • Decision strategy 2 (also called as the OR gate): The final decision is taken on the basis of the OR gate principle, i.e., if a representation of an instance from any of the modalities is classified to the target class, the overall decision for that particular instance is taken in the favor of the target class. • Decision strategy 3: The final classification decision is made on the basis of 1 st modality, i.e., if the representation from the first modality is assigned to a particular class, the overall classification is made following that. • Decision strategy 4: The overall decision is taken on the basis of label assigned to the representation from the second modality. It should be noted that for more than two modalities different decision strategies, such as majority vote, might be more suitable.

D. Evaluation criteria
One-class classification models can be evaluated using different metrics. These metrics are decided on the basis of the goals of a given application. For example, in outlier detection, the focus is on detecting negative instances accurately. The most common metrics in one-class classification are true positive rate (tpr), and true negative rate (tnr). The former, also called as recall, sensitivity, or hit rate, is the proportion of positive instances that is classified by the trained model as positive correctly: where tp is the number of positive samples classified correctly and p is the total number of positive samples in the test set. The latter, tnr, also called as specificity, is defined as where tn is the number of negative samples classified correctly and n is the total number of negative samples in the test set. Accuracy (accu) is measured as the ratio of number of correctly classified instances to the total number of instances: Precision (pre) measures the proportion of instances classified positive which really are positive: where f p is the number of false positives. Another useful measure is F 1 measure, which is the harmonic mean of pre and tpr: Geometric mean (Gmean) is defined as the square root of the product of sensitivity and specificity: Gmean has been used by many researchers for imbalanced datasets. Since it takes into consideration both sensitivity and specificity, we opted to finetune hyperparameters based on the Gmean score on the validation data.

E. Experimental results and discussion
In Table II, we report the average of different evaluating metrics over the five data splits for Robot Execution Failures dataset for both linear and non-linear (kernel) versions of the applied methods. In Table III, we report the results on the test set for SPECTF heart dataset. In these tables, we only show the best performing versions of the proposed method along with all competing methods. We compare our results with OC-SVM [39], SVDD [38] and recently proposed S-SVDD [52]. In S-SVDD, different regularization terms (ψ's) were proposed and, hence, we compare MS-SVDD with all proposed regularization terms of S-SVDD. In order to analyze the different regularization terms and decision strategies for  the proposed method, we also report the exhaustive results obtained by different settings in Tables IV and V. For Robot Execution Failures dataset (Table II), our proposed method outperforms all the competing methods in the linear case. The results achieved by the linear version of the proposed MS-SVDD method are overall best also compared to the non-linear methods. Table IV shows that using decision strategy 3 with constraints ω 1 (all representations from the corresponding modality considered) or ω 2 (all support vectors and outliers from the corresponding modality considered) for the update of the corresponding Q m yields the best overall results for the robot dataset. In the non-linear case, the best performance for the proposed method is achieved by using the kernel trick with either constraint type ω 2 or ω 5 , both with decision strategy 3. Even though the non-linear versions using the kernel trick achieved better results for few constraints and decision types as compared to the non-linear version with NPT, NPT was more robust overall. The worst results drop only to 0.86 in terms of Gmean compared to 0.17 occurring for the kernel version.
It is also evident that the first modality (force measurements) is vital in taking the final decision as in both linear and nonlinear cases the best results are obtained when the decision is taken based on first modality (decision strategy 3). Also the results of the competing methods show that the best results are obtained when using the force measurements only. The  results on the concatenated features are slightly worse and the results using the torque measurements are clearly worse. Nevertheless, the proposed multimodal approach has managed to boost the results by combining information from the both modalities.
For SPECTF heart dataset, in the linear case, the best results are achieved by MS-SVDD and S-SVDD. For the kernel case, the best result is achieved by MS-SVDD. Even though, in terms of Gmean, the best performances of linear and kernel MS-SVDD, and linear S-SVDD are the same, MS-SVDD is found superior in terms of other important metrics such as precision and F1 score. Also with the choice of regularization term and decision strategy, MS-SVDD gives more freedom for selecting the required model in terms of the desired metrics for specific applications.
Overall in both datasets, NPT is found to be more robust than the kernel version. The linear MS-SVDD performed overall best or equally good as kernel MS-SVDD. By fixing the constraint type and decision strategy, there are few cases where non-linear methods yield much better results as compared to linear methods. For example, by using constraint ω 2 and decision strategy 1 for SPECTF heart data set in linear MS-SVDD, the overall performance drops to 0.42 (Gmean), but the kernel version for same setup yields overall best result.

V. CONCLUSION
In this paper, a new multimodal one-class classification method was proposed. The proposed method iteratively trans-forms data from all the modalities to a new shared subspace optimized for data description in multimodal one-class classification tasks. We derived linear and two different non-linear versions along with a selection of different regularization terms. According to the best of our knowledge this is the first work in the field of subspace learning for multimodal oneclass classification. We conducted experiments comparing the different version of MS-SVDD and performed comparisons against other one-class classification methods using either concatenated representations or a single modality at a time.
In most cases, linear MS-SVDD outperformed all the nonlinear methods in our experiments. NPT turned out to be more stable than kernel case. We noticed that optimal decision strategy depends on the usefulness of different modalities. If a certain modality is more informative than other(s), then it is useful to use that particular modality for taking the final decision. Nevertheless, MS-SVDD can improve the results as compared to using a single modality only. If the modalities are more balanced, AND or OR gate strategies are better. In the former case, ω 2 (all support vectors and outliers are used to describe the class variance for the update of the corresponding projection matrix) seems to be a good constraint. In the latter case, the constraints make a smaller difference.
MS-SVDD can be interpreted and used in many ways for different one-class multimodal problems. It can be used for anomaly detection and detection of a specific class such as speaker verification and face recognition. In the future, we intend to try different kernels and model-based decision  strategies for the proposed method. We also intend to propose changes in the boundary shape (other than spherical) for enclosing the target data in subspace. There is also room for research in other one-class classification techniques for multimodal subspace learning.