Supervised projection pursuit - A dimensionality reduction technique optimized for probabilistic classification

An important step in multivariate analysis is the dimensionality reduction, which allows for a better classi ﬁ cation and easier visualization of the class structures in the data. Techniques like PCA, PLS-DA and LDA are most often used to explore the patterns in the data and to reduce the dimensions. Yet the data does not always reveal properly the structures wen these techniques are applied. To this end, a supervised projection pursuit (SuPP) is proposed in this article, based on Jensen-Shannon divergence. The combination of this metric with powerful Monte Carlo based optimization algorithm, yielded a versatile dimensionality reduction technique capable of working with highly dimensional data and missing observations. Combined with Naïve Bayes (NB) classi ﬁ er, SuPP proved to be a powerful preprocessing tool for classi ﬁ cation. Namely, on the Iris data set, the prediction accuracy of SuPP-NB is signi ﬁ cantly higher than the prediction accuracy of PCA-NB, (p-value (cid:1) 4.02E-05 in a 2D latent space, p-value (cid:1) 3.00E-03 in a 3D latent space) and signi ﬁ cantly higher than the prediction accuracy of PLS-DA (p-value (cid:1) 1.17E-05 in a 2D latent space and p-value (cid:1) 3.08E-03 in a 3D latent space). The signi ﬁ cantly higher accuracy for this particular data set is a strong evidence of a better class separation in the latent spaces obtained with SuPP.


Introduction
Dimensionality reduction (DR) techniques, also referred to as projection methods, are perhaps the most used exploratory tools for applications in various fields, from image analysis and information retrieval to bioinformatics and chemometrics. The main reason for such an extensive use of these techniques is the set of benefits that these are bringing for data analysis: (i) possibility to plot and visualize data and potential structures in the data in lower dimensions, (ii) possibility to apply stochastic models, (iii) capacity to solve the "curse of dimensionality" and (iv) facilitation of prediction and classification of the new data sets (i.e. query data sets with unknown class labels). The projection techniques can be classified into three major groups according to the way the latent components are obtained: supervised (i.e. considers class labels for the deduction of the latent components and for further classification), semisupervised (i.e. uses both labeled and unlabeled samples to infer class structures in the latent space) and unsupervised (i.e. class labels are not available and are yet to be found from the structural patterns of the projected data or the class labels are simply not used). Each of these three major types of DR methods can be further divided into "linear" and "nonlinear" methods. A myriad of methods emerged in the past two decades, and listing all of them would be a task for a detailed review. We ought to mention a few however from each category that are more recent or more used across different domains.
From the unsupervised category and the subcategory of linear methods, the most commonly used are principal component analysis (PCA) [1,2], projection pursuit (PP) [3][4][5] and independent component analysis (ICA) [6]. In the non-linear subcategory, it is worth mentioning the Kernel PCA (KPCA), local linear embedding (LLE) and isometric mapping (Isomap) [7]. The semi-supervised methods are a relatively new form of DR. Hou et al. described a multiple view semi-supervised DR (MVSSDR) [8] and more recently, Mikalsen et al. introduced a noisy multi-label semi-supervised DR (NMLSDR), which solves the problems of missing labels and noisy labels [9]. From the supervised methods, the most used are partial least squares (PLS) [10,11], orthogonal projection to latent structures (OPLS), including the combination with discriminant analysis of both (i.e. PLS-DA and OPLS-DA), Fisher's discriminant analysis or linear discriminant analysis (LDA) [12], heteroscedastic discriminant analysis (HDA) [13], regularized discriminant analysis (RDA) [14] and the regularized coplanar discriminant analysis (RCDA) introduced by Huang et al. [15]. In the non-linear subcategory of the supervised DR, € Ornek and Vural [16] recently introduced the smooth regular embedding (SRE) and Raducanu and Dornika proposed a supervised version of the linear embedding DR (SLE) [17]. Pires et al. [18,19] and Lee et al. [20] described a combination between PP and LDA thus including the PP technique also in the group of the supervised projection techniques.
There is a considerable amount of research directed on comparison between non-linear DR methods and the linear ones [21][22][23][24]. Most of them outlined that the non-linear DR techniques are much slower than the linear ones and that the non-linear methods are having a better performance than the linear ones on synthetic data sets and poorer performance on real data sets. It must be added here that one of the major concerns with the non-linear methods is the interpretability of the latent space (i.e. the latent manifold). In the linear DR methods, a component can express the percentage of variability for example between various species of flowers and/or the variance within one specie (e.g. as the principal components in PCA and OPLS). Sometimes the latent component can express the direction of the maximal effect size, as in the case of LDA. It is the interpretability that gives us the possibility to extract certain important variables by the degree to which these are contributing to the interpreted property. Such variable selection procedures are the magnitude of the loadings in the PCA and so-called variable importance on projection (VIP) in the case of PLS (and OPLS). On the other hand, the eigenvalues of a manifold or the components obtained with a non-linear DR are much harder to interpret for a real world example. However, the linear methods have their own shortcomings. Namely, in order for the LDA to be properly applied the projected data have to comply with homoscedasticity and normality. This may be a non-realistic requirement especially for the biological data where outliers and biological variability are often present. Several options emerged to solve these limitations of LDA among which are the HDA and Multimodal Oriented Discriminant Analysis (MODA). The latter is based on the maximization of the pairwise, symmetric version of Kullback-Leibler (KL) divergence between the distributions of the classes. Abou-Mustafa pointed to the limitation of the total pairwise KL divergence maximization approach and further extended MODA to Pareto discriminant analysis (PARDA) [25]. PARDA is based on Pareto multi-objective optimization considering all pairwise KL divergences at the same time. Although an elegant solution, the Pareto optimization strategy is not necessary when a different divergence is employed, namely Jensen-Shannon divergence, which is the core of the method proposed in this article. Another limitation of LDA is the dependency between the number of classes and the number of latent components, i.e. the latter has to be strictly smaller than the number of the classes, unlike PCA or PLS, where maximum number of components is equal to the number of attributes. Recently, Gromski et al. published a review pointing to the weaknesses of PLS-DA, suggesting the use of alternative methods such as Principal Component-Discriminative Feature Analysis (PC-DFA) [26]. Several serious disadvantages of PLS-DA were also previously reported in Ref. [27]. For example, when the number of sample points is considerably smaller than the number of attributes (i.e. variables), PLS-DA can give a good classification just by chance (i.e. overfitting).
In this article is presented an adaptation of projection pursuit to a supervised way of projection analysis by means of entropic divergence measurement, namely Jensen-Shannon divergence. The technique will be referred to as SuPP (i.e. the acronym for Supervised Projection Pursuit) for convenience. The benefits of the SuPP are: (i) maximization of the distances between all the classes simultaneously on each projection, (ii) determination of latent components using a non-parametric approach, (iii) minimization of the Bayes classification error and (iv) capability to handle missing data.
The performance of the proposed strategy is assessed using real data sets at different training-to-test set ratios. Comparative assessment is made using the most applied dimensionality reduction techniques from the same category (i.e. linear DR): PCA, PLS-DA and LDA. The assessment also includes the capability of working with missing data.

Entropic projection index
In 1973, Friedman and Tukey [3] explored the idea of Kruskal [28] and outlined a strategy for PP. The core of their method is the search of the projections based on a so-called projection index (PI) that would quantify the "usefulness" or "interestingness" of the projections. Later, Huber [29] generalized the PI and was the first to propose standardized negative Shannon entropy as a PI. Huber's definition of "interestingness" of a projection is linked to the concept of normality of the projection. Huber stated that the more convoluted are the projected clusters, the more normal is the projection and thus less interesting. However, Huber also mentioned that the entropy is by far not the only possibility for a PI and listed a number of possible PIs including Fisher's information and kurtosis. Jones and Sibson [30] proposed to use kernel density estimation of the projected data and, using R enyi generalization of the α-order entropy [31], further shaped the choice of entropic index into an even more rigorous mathematical form. To be more specific, let's consider the projection of the vector X ¼ ½X 1 ; X 2 ; …; X M onto a unit vector b k ¼ ½k 1 ; k 2 ; …; k M : Here χ is the set of projected points onto direction vector b k. Huber's entropic index, according to Jones and Sibson, takes the following form: where b f ðχÞ is the kernel density estimation of the projected data χ. The PI from eq. (2) can take positive and negative values and as such cannot be considered an information metric. Shannon entropy quantifies the information of a system, and is calculated using discrete probability distribution: where p i ¼ p i ðχÞ is the probability distribution of a discrete random variable χ. The notation HðχÞ can be sometimes written as HðpÞ.
The aim of SuPP is to find the projection that would yield the maximum information from the projected data with the use of group labels and Shannon entropy (SE).

Supervised adaptation using Jensen-Shannon divergence
In a supervised method, the labels of the classes are available and the "interesting projection" is defined as the projection that yields the best class separation. In other words, the distances between the classes are maximized for a maximal PI. This can be achieved using Jensen-Shannon divergence defined as follows. Let's consider the class labels Y C ¼ ½y 11 ; y 21 ; …; y jc ; …; y NC ; c ¼ 1; 2; …; C coming from a total of C classes and associated to each data point in the vector X. In this case, eq. (2) takes the form of the SE of the mixture of C class distributions. Subtracting the weighted summation of the entropies for each class distribution from the entropy of the mixture distribution, we obtain the Jensen-Shannon divergence (JSD): Here, π ¼ ½π 1 ; π 2 ; …; π C is the vector of weights associated to each class distribution. pðχjcÞ is the discrete probability distribution of the projected data associated to the class c and is calculated using the histograms of the projected data corresponding to a class c. Unlike the differential entropy that can take both positive and negative values, SE can take only positive values and consequently has convenient properties (e.g. is a metric of information, Jensen's inequality can be applied etc). However, as shown in eq. (3), SE is only applicable on discrete variables. In order to accommodate for both continuous and discrete variables, in SuPP a discretized kernel density estimation (KDE) is used. More specifically, a kernel density is obtained using Gaussian KDE and further thinly discretized and normalized to the summation of all the discrete values. Employing the discretized KDE would result in a smoother objective function (Fig. 1). In the example from Fig. 1, two simulated classes were used to illustrate the effect of the KDE of the probability density function. The centroids of the sample groups were spread apart from one another at a smaller distance (the upper subplots) and at a higher distances (the lower plots). The two subplots on the right represent the rough JSD surface obtained using only a histogram for the estimation of the probability density function. The left subplots show the smoothing effect of the application of the KDE for the calculation of JSD. JSD is a distance metric that measures the divergence between distributions and therefore is a good choice for a PI [32] when the maximal separation between the projected class distributions is the defining criterion of the latent vector's orientation. This order-1 divergence [33], is particularly interesting when the main aim is to preprocess the data for further probabilistic classification. In fact, Lin showed in Ref. [34] the link between JSD and Bayes error of classification PðeÞ. Lin pointed to the fact that the upper bound of the Bayes error has the following expression: Here, HðπÞ is the entropy of the prior probabilities for each class. For equal weights and uniform prior probability assumption pðcÞeUð0; CÞ, the weights are equal to the prior probabilities. For this work, only the equal weights or equal priors on the classes are used. The goal of SuPP is to find the projection that maximizes the divergence between the probability density functions of the projected data associated to each class (i.e.pðχjcÞ) while minimizing the upper bound of the Bayes error. From eq. (5) it fallows that both objectives can be achieved by  The KDE estimation has another beneficial effect on the projected data χsmoothing of an incomplete histogram. This effect is notable when there are missing values in the data set. Fig. 2 illustrates the histograms and the estimated densities at different rates of missing values (i.e. 0%, 1%, 5%, 10% and 30%). The values were extracted from random locations in the data set and the pursued projection is the one that maximizes the JSD. Note that below 30% the estimated densities do not vary significantly from the 0% missing value rate. At the 30% rate of missing values, the distributions of the two simulated classes appear closer to each other. For a good separation between the classes, this effect plays a minor role on the JSD calculation and the latent component would be close to the one obtained in the rest of the missing value cases.

The optimization function and the SuPP algorithm
The latent components in SuPP are found sequentially, a strategy suggested also by Friedman and Tukey in the unsupervised PP. By using the PI from eq. (4) in an objective function, one can employ an optimization algorithm to find the latent component that would maximize the distances between the class distributions. Jones and Sibson mentioned in their work that the optimization criteria for the PP are: (1) Achieving maximum PI (in this case maximizing the divergence D JS ) (2) Finding the most orthogonal projection space Previous publications on PP suggested that it is worth looking for several interesting projections. This aspect of PP is considered, in fact, its strength as opposed to other dimensionality reduction techniques which have only one solution per component. Furthermore, Huber stated in his publication that "orthogonal directions do not suffice, the interesting directions may be oblique to each other" [29]. This suggests, on one hand, that a sequential direction-wise pursuit is preferred to the projection on a multidimensional orthogonal space. On the other hand, the same statement suggests that a penalty coefficient may be used on the second objective listed above, namely on the orthogonality (i.e. to allow the exploration of oblique components). Using these criteria, the optimization function proposed here, is: Here, f ðD JS Þ represents a function of variable D JS described in eq. (4), the f ðO β Þ is a function of the average absolute pairwise orthogonality O: with d < z D, d 2 ½1; D À1; z 2 ½2; D are the indices of the latent vectors corresponding to one dimension and D is the dimension of the final latent space on which the data is projected. The coefficient γ is meant to weigh the importance of the orthogonality. If this parameter would be γ ¼ 0 for D > 1, the optimization function may return, as the most optimal solution, the same latent component as in the case of D ¼ 1.
In higher dimensions, γ allows for more oblique components (for lower values of γ) or more orthogonal components (for higher values of γ).
Intuitively, there is no need for the orthogonality term when applied in unidimensional space, thus for D ¼ 1, γf ðO β Þ ¼ 0. Note that D JS and O both depend on the unit vector b k through the projected vector χ.
The optimal f ðD α JS Þ is achieved when D JS approaches its upper limit ε (i.e. achieving maximum divergence between the class distributions).
Following Lin's work [34], for equal class weights, ε ¼ log 2 C. For the case where the entropy terms in eq. (4) are calculated using natural logarithm, the logarithm in the expression of ε changes into a natural logarithm.
Accounting for all the aspects mention above, a good choice for the optimization function is the following: 8 > > > > < > > > > : For an extensive exploration of different local minima, algorithms like genetic algorithm and global search are proposed in Ref. [5] over a simple gradient descent optimization. For the case of large variables-to-samples ratio, Hou and Wentzell suggested in Ref. [35] a regularized objective function for an unsupervised PP regression. For SuPP, a Monte-Carlo based global optimization algorithm developed by Li and Scheraga (also known as "basin-hopping algorithm" or BH) was used [36]. A succinct description of BH algorithm is as follows.
Step 1. Generate a set of points at random in the variable space (Xi) Step 2. Apply a local search algorithm to generate a local minimum (Yi) Step 3. Apply a perturbation to Yi to get a new Xiþ1 and re-apply the local search on the Xiþ1 Step 4. If Objective (Yiþ1) < Objective (Yi), retain Yiþ1 and increment i Steps 3 and 4 are applied in a while loop until a stopping criterion is met (e.g. reaching a limited number of iteration).
The local search method or local minimization algorithm used in the steps listed above is based on linear interpolation modelling of the objective function (i.e. COBYLA) developed by Powell in 1994 [37]. The Objective mentioned in the 4th step is the objective function expressed in eq. (8). The reason behind the choice of these two algorithms is, for one part, the extensive (local and global) search capability of the BH algorithm and for the other part, the capacity of COBYLA to work without derivatives. The latter property is required for SuPP due to non-parametric nature of kernel-estimated densities.
For python 3.0 and higher, both algorithms, COBYLA and basinhopping, are available in the SciPy (https://www.scipy.org/) framework [38]. The pseudo-code of the SuPP algorithm is shown in Fig. 3. Examples of usage and detailed explanations of the SuPP class are provided in section 3 of the supporting information. The script is available at https://github.com/ABarcaru/SuPP under the Apache v2 license.

Starting point of the optimization algorithm
As many other optimization procedures that have more than one solution, SuPP is, to some extent, sensitive to the starting point. Friedman and Tukey and later Jones and Sibson suggested that using the first principal component can be a good option for a starting vector. However, this may lead to the convergence to the same local minimum which, in case of PP technique, is not always desired as was mentioned previously. Starting with a random vector k increases the chances of exploring new "interesting" projections. Different values of γ can also lead to different and potentially interesting projections in the higher dimensions.

Objectives
For assessment of the SuPP several objectives must be analyzed: the visualization and exploratory potential, optimization of the parameters used by SuPP and the classifier attached, capacity to handle missing data and the assessment of the quality of separation between the classes (i.e. as pre-processing for classification).
Different data sets are used for the purposes aforementioned.

Data
A. Synthetic data sampled from normal distribution consisting of two groups "Control" and "Case" having 200 variables. The data was duplicated and a difference in mean values was added to 20 randomly chosen variables to simulate difference between the groups. The structures of the synthetic data sets are listed in Table 1. The data set A.I is only used to generate data A.II and A.III. B. "Iris data set" or "Fisher's Iris data set" collected and first published in 1935 by Edgar Anderson [39] and later used by Fisher in application of linear discriminant analysis and published in Annals of Eugenics [12]. This data set is made available in scikit-learn library [40]. Iris data set has 150 sample points, consisting of 4 attributes (i.e. sepal length, sepal width, petal length and petal width) and 3 cluster labels (i.e. Iris virginica, Iris setosa and Iris versicolor). This data set is typically used by the machine learning community in testing classification, variable selection, projection and dimensionality reduction algorithms. C. Wine data set, also available in the scikit-learn library. This data set has 178 sample points and 13 attributes and 3 class labels [41].

Data exploration capacity
The exploratory data analysis is based on the visual inspection of the projected data in lower dimensions. The potential of a projection method as an exploratory tool lies in the quality of separation of the classes on the A.Ino difference between the mean of the groups (Δμ), with the same withingroup standard deviation (σ w ) and with the same number of training and validation points (i.e. N T and N V respectively). A.II -added difference between the groups, increased the within-group standard deviation of the group "case".
A.IIIsame as A.II with the exception of the number of training points (i.e. reduced to 25). latent components. Visual estimation of the separation of the classes in the projection space is done typically by plotting the projected data onto a 2D and 3D spaces. However, a good visual estimation of the separation of the classes can be achieved through the plotting using "parallel coordinates". In case of good separation, the classes are well stratified at least on one axis from the parallel coordinates plot. For a linear relationship between the classes and the latent components, the stratification must be observed on all the axes. For the visualization and exploratory potential evaluation of SuPP, data set C is chosen. To quantify the quality of separation of classes, the following metrics were employed: Mahalanobis distance, Bhattacharyya distance and Hellinger distance. Detailed description of each metric is available in the SI, section 1. Each distance was calculated pairwise: Group 1 -Group 2 and Group 2 -Group 3. Statistical significance of the pairwise separation between the groups was estimated using Hotelling's T 2 and F-statistics.

Parameter optimization
Prior to the comparison of SuPP-SVM and SuPP-NB with other techniques, a few parameters must be optimized. More specifically, for SuPP, the penalty on the orthogonality, i.e. γ (eq. (6) and eq. (8)), and for SVM, the influence of a sole training sample (i.e. Γ) and the maximum decision function's margin, i.e. C DFM . The data sets A.II and A.III are used for the optimization of the parameters (i.e. γ, Γ and C DFM ). The penalty γ is drawn from the powers of the number of classes (C). The optimality of γ is indicated by the accuracy of the Naïve Bayes classifier at different levels of γ. In order to optimize the number of Monte Carlo iterations (i.e. niter) of the basin-hopping algorithm, this parameter is increased with different levels (i.e. 100, 500, 700 and 1000) and the accuracy is evaluated 5 times at each increment. For the optimization of the SVM classifier, the accuracy of the prediction of this classifier is assessed at Γ ¼ [0.1, 1, 10] and for each value of Γ, C DFM ¼ [0.01, 1, 100].
Another parameter could be optimized, namely bandwidth of the KDE. However, for this work, the default method of the kernel density bandwidth calculation is used, i.e. bw ¼ nef À 1 dþ4 with nef representing the number of effective points and d ¼ 1 the dimension of the projected data.

Evaluation of the capacity to handle missing data
To illustrate the capacity of SuPP to work with missing values, the synthetic data A.II was used. In the training set, a percentage of the total values was replaced with "NaN" (i.e. 1%, 5%, 10% and 30% missing values of the entire training data set) at different random locations, similar to the case indicated in Fig. 2.

Assessment of SuPP as pre-processing for classification
In order to assess the performance of SuPP as a pre-processing technique aimed for further classification, the data is split into two sets: training set and test set. The training set is used to obtain the latent components and classification models. The test set is projected on the latent components obtained with the training set and the classification model is applied on the projected test set. The accuracy of the classification of the test data set reflects the potential of the projection method. As classifiers, Support Vector Machine (SVM) and Naïve Bayes (NB) are employed due to their superiority over DA or logistic regression classification as mentioned in Ref. [43]. The Radial Basis Function kernel is chosen for the SVM for its robustness to non-linearity in the data.
For this purpose, data sets B, C and D are used. The SuPP-SVM and SuPP-NB are compared with PCA-SVM, PCA-NB, PLS-DA and LDA by comparing the accuracies of the prediction of the test set. To assess how significant is the prediction accuracy, we must first obtain distributions of accuracies. To this end, the method used for testing the quality of separation between the classes is based on cross-validation. More specifically, the algorithm iterates over 3 variables: (i) the number of components (ii) the fraction of data set splitting (i.e. ratio of the test sample size to the training sample size), and (iii) the random state (Fig. S1). For the number of components, maximum of 3 were selected thus exhausting the maximum number of dimensions for visual representation of the complex data. If needed, this number can be further optimized (i.e. increased or decreased) with a double cross validation technique reported in Ref. [44]. The second loop iterates over the ratio between the test set size and the training set size. The algorithm iterates over the range 0.16-0.75 with 5 equidistant values in total. The limits were selected to ensure that the data set containing the smallest number of sample points, in this case adenocarcinoma, has at least 3 points for the test and training respectively. The last loop, having 20 iterations (minimal number recommended for Mann-Whitney U test), ensures the random selection of the data points (i.e. different topology of the points in each random sampling) yielding a distribution of accuracies for each classifier as opposed to a constant accuracy that would originate from the same topology of the data points.
In Fig. S1 is indicated that LDA is applied only for a number of components lower than the number of classes. Also, for a large variableto-sample ratio, the shrinkage parameter, available in scikit LDA class, was set to "auto" to ensure a higher performance of LDA and thus a more fair comparison.
To express the statistical significance of the differences between accuracy distributions obtained with the algorithm outlined in Fig. S1, the Mann-Whitney U Test was applied. The null hypothesis (i.e. H0) states that the accuracy is not different from SuPP-X (X denotes SVM or NB, depending on the case) to the classifier Y (Y denotes any of the following: PCA-SVM, PCA-NB, PLS-DA and LDA) used for comparison. Alternative hypotheses are the following: H1. The SuPP-X accuracy is higher than the accuracy of Y (right-tailed test) H2. The SuPP-X accuracy is different than the accuracy of Y (two-tailed test) H3. The SuPP-X accuracy is lower than the accuracy of Y (left-tailed test) For the cases with the p-value bellow the significance level, the effect sizes (ES) are estimated following the work of Ruscio [45].

Exploratory data analysis
For the assessment of the data visualization in lower dimensions using SuPP, a comparison with the separations obtained using PCA and PLS-DA was made. All three techniques were applied to the data set C described in the previous section. The results are illustrated in Fig. 4, where 2, 3 and 4 latent components of SuPP, PCA and PLS-DA were used to plot the data. From the 4D representation (Fig. 4 g, h and i) it is clear that, in case of the SuPP latent space, the groups are well separated in each of the 4 dimensions. This indicates a linear relationship between the latent components of SuPP and the classes in the data. The 2D representation in Fig. 4 a,b and c, indicates that there is a better separation of the classes in the case of SuPP (Fig. 4 a) while the distribution of the groups in the latent space is similar to those in the cases of PCA (Fig. 4 b) and PLS-DA (Fig. 4 c).
The pairwise distances and the statistical significance of the distances between pairs of classes represented in Fig. 4, are listed in Table 2. The Table 2 Distances between pairs of groups of the projected "Wine" data set onto latent latent SuPP, PCA and PLS-DA spaces. The p-value is obtained from F-statistics and df1, df2 of freedom. MD -Mahalanobis distance, BD -Bhattacharyya Distance, HD -Hellinger Distance, T2 -Hotelling's T 2 , df1 -1st parameter for degrees of freedom and latent components, df2 -2nd parameter for degrees of freedom.
dimension of the latent space is equivalent to the first parameter for the degrees of freedom, i.e. df1. The distances between the classes are significant for all three projection methods (SuPP, PCA and PLS-DA). However, the largest distances were obtained with SuPP (Table 2).

Optimization of the parameters
Detailed discussion on the optimization of the parameters of the SuPP and SVM is outlined in section 2.1 of the SI. From the results listed in Table S2.1 from SI, the penalty seems to play a small role in the accuracy of the prediction for both data sets when D < 3. Increasing D however, makes the effect of γ more evident. Thus, a recommendation is to explore the performance at γ values such as γ ¼ C À4 for a less orthogonal space and γ ¼ C 6 for a more orthogonal space.
In Table S2.2 from SI, are listed the accuracy values of NB classifier and the average standard deviation of the latent components (i.e. "loadings"). The results are indicating a possible convergence of a solution for b k 1 and existence of multiple solutions for b k d with d > 1. In other words, the more iterations are used for higher dimensions, the more probable is to find different interesting projections. Additionally, Fig. S2.1 in SI, where the distributions of the correlation coefficient are plotted, supports this argument. The average accuracy of NB classifier however does not improve with the increased values of niter. Thus, a lower value, between 100 and 500, is recommended for a faster performance.
The optimization of the SVM classifier showed that for a low parsimony model and high accuracy, the combination of C ¼ 1 and Γ ¼ 0.1 is an optimal choice for both data sets A.II and A.III (Table S2.3).

Handling missing data
As mentioned in the theory section using KDE of the projected data is an advantage when working with missing values (i.e. NA or NaN values). More specifically, the projection of the data is made without considering the missing data which will affect the histogram of the projection more or less significantly, depending on the amount and the location distribution of the missing data (Fig. 2). The estimation of the density in this case, plays a corrective (i.e. smoothing) role. The training data set included missing values in random location as described in section 3.3.3. The latent components were further used to project the test data set (Fig. 5). The effect of the missing data on the separation of the classes, in the latent space, was insignificant. This conclusion however must not be generalized to all data sets. The censoring of the data varies from case to case and as such, the effect of the missing values on the projection can sometimes be overwhelming (e.g. when the missing values are not uniformly distributed across the data set and moreover when the important features are missing in large proportions). These however are extreme cases and solutions for such cases are yet to be found.

SuPP as pre-processing for classification
The accuracy distributions obtained using CV algorithm outlined in Fig. S1, are represented in Figs. S4, S5 and S6 from the SI. The median of the accuracies, obtained with each of the methods used for comparison, are plotted in Fig. 6. The continuous lines correspond to SuPP.
For Iris data set the, SuPP is only superseded by LDA, for both 2 and 3 components. For Wine data set, the performance of SuP, in median accuracy, is at least as good as PLS-DA, for D ¼ 2. While LDA is limited in the number of latent components, SuPP and PLS are not and this is reflected in the median accuracy trend for D ¼ 3 in the case of Wine data set. In this case the median accuracy significantly increase for SuPP and PLS.
The p-values (of the alternative hypotheses) resulting from the hypotheses tests described in section 3.3.4 are presented in Table S4 to  Table S7 from the SI. A simplified interpretation of Tables S4-S7 is the  following: if there is a number on the H1 row, there is sufficient evidence that the classifier X, applied in the projected space obtained with SuPP, performs better than the corresponding classifier on column Y. If H2 row contains a number corresponding to a column Y, there is not sufficient evidence to reject the null hypothesis (i.e. the accuracy is about the same as for the classifier Y). Note that the p-values of this test are higher than the 5% significance level. And lastly, if there is a value on H3 row, corresponding to a column Y, the data provides sufficient evidence that the classifier Y performs better than SuPP-X. The assessment of the significance using p-values must be interpreted however with caution as this is not the most objective measure on its own. For p-values indicating significant improvement in accuracies from one method to another, an effect size (i.e. ES) value is attached (Table 3).
For the Wine data set (data set C), when the projection is made on two latent components (i.e. Table S4 in SI), SuPP-SVM does not perform better than any of the classifiers. In this case SuPP-SVM is less accurate than LDA. The same is observed in the case of SuPP-NB (Table S6 in SI). The performance changes when D ¼ 3 for the same data set (Table S5 and  Table S7) being at least as good or, as in the case of SuPP vs PCA, even better (p-value 2.95E-03 and ES ! 0.63). Small exceptions for D ¼ 3 can be observed in Table S7 and Table S8 from the SI for test/ training ¼ 0.75 (Wine data set rows), where PLS-DA and LDA perform better than SuPP-NB. For the Wine data set the ES values are shown in Table S8.
For the Iris data set (data set B), with D ¼ 2, SuPP performs better than PCA and PLS (p-value 4.02E-05 and p-value 1.17E-05 respectively) in all the increments of the test/training ratio. LDA however, in this case, performs better than SuPP. Table 3 indicates the p-values and Table 3 The p-values and the effect sizes (ES) obtained by testing the hypothesis H1 for the Iris data set. the ES values for the Iris data set when testing the hypothesis H1 defined previously. A significantly large effect size in this cases (ES ! 0.84 and ES ! 0.89 for PCA and PLS respectively) indicates a better separation of the classes in latent spaces. For D ¼ 3 the application on Iris data set shows better performance than PLS-DA (p-value 3.08E-03 and ES ! 0.74) and, with small exceptions (Table S5 test/training ¼ 0.16, 0.3 and 0.58), better than PCA (p-value 3.00E-03). The ES values however for this data set, for D ¼ 3, are lower than the correspondent ones for a 2D latent space. For data set D (adenocarcinoma), SuPP-SVM performed better with respect to PCA-SVM (Table S4). Compared to PLS-DA and LDA, for D ¼ 2 the accuracy of SuPP-SVM is at least as good or lower. Increasing the dimensions of the latent space did not improve the performance of SuPP-SVM on the adenocarcinoma data set. For SuPP-NB, the accuracy is at least as good as for the other classifiers and in some cases (test/ training ! 0.44) SuPP-NB performs better than PCA-NB (Table S2.6 and  Table S7). The results from Tables S6 and S7 point to the optimality of SuPP as a preprocessing or dimensionality reduction step applied prior to a probabilistic classifier (in this case NB) as indicated in eq. (5). The ES values for the data set D are included in Table S10.

Conclusions
The SuPP strategy described here is a versatile dimensionality reduction technique that offers a new perspective on supervised exploratory data analysis. SuPP proved to be able to work for low samples-tovariables ratio as well as with missing values from the data set. The main theoretical aspect of this method indicates that JSD is an important metric for a comprehensive representation in lower dimensions. The method, although slower (i.e. 13 min for A.II, D ¼ 2, niter ¼ 100 and 23 min for A.II, D ¼ 3, niter ¼ 100) than the classical dimensionality reduction techniques like PCA and PLS, is more objective (i.e. less prone to overfitting) and in some cases a better choice for preliminary step in classification process. The superiority of SuPP was observed especially on the Iris data set, where the accuracy of SuPP-NB and SuPP-SVM were above PCA-NB, PCA-SVM and PLS-DA. In the case of the Wine data set, SuPP performs at least as good as LDA for 50% training-to-test sample ratio. The capacity of SuPP to reduce the dimension independently of the parameters of the projected distribution, could, potentially show better result than LDA on a non-normally distributed projected distributions.
SuPP is an alternative to the classical DR methods and the choice to use SuPP must be determined by the data itself and the desire of the user to discover new perspectives on the projected data. Future work will include a feature selection approach based on SuPP, applications on categorical data and a more extensive application on the biological data sets.

Fundings
This study was supported by the Human Nutrition & Health initiative of the University of Groningen.

Declaration of competing interest
None declared.