The Generalized Matrix Decomposition Biplot and Its Application to Microbiome Data

Biplots that simultaneously display the sample clustering and the important taxa have gained popularity in the exploratory analysis of human microbiome data. Traditional biplots, assuming Euclidean distances between samples, are not appropriate for microbiome data, when non-Euclidean distances are used to characterize dissimilarities among microbial communities. Thus, incorporating information from non-Euclidean distances into a biplot becomes useful for graphical displays of microbiome data. The proposed GMD-biplot accounts for any arbitrary non-Euclidean distances and provides a robust and computationally efficient approach for graphical visualization of microbiome data. In addition, the proposed GMD-biplot displays both the samples and taxa with respect to the same coordinate system, which further allows the configuration of future samples.

Second, the AMD may result in nondecreasing "singular values." While these seem like minor technical issues, the second drawback can have important practical implications: which of the left and right singular vectors should be displayed in the resulting biplot? The authors of reference 11 suggest constructing the AMD-biplot based on the two left and right singular vectors that correspond to the two largest singular values. This AMD-biplot assures that the arrows for variables are as meaningful as possible, but they may fail to reveal meaningful sample clusters if the information of sample clusters is associated only with the first several left singular vectors. An alternative approach may be to simply display the first and second left and right singular vectors of the AMD (as done for the SVD). Unfortunately, this strategy does not solve the problem either: although we may observe meaningful sample clusters, the arrows may not be meaningful due to the small singular values. There is thus a lack of clarity regarding which singular vectors should be used to construct the AMD-biplot.
The drawbacks of the AMD-biplot motivate our proposal which is based on the generalized matrix decomposition biplot (GMD-biplot) (12). The GMD-biplot is a direct generalization of the SVD-biplot that accounts for structural dependencies among the samples and/or variables. This approach has several advantages. First, as with the AMD, it directly handles any non-Euclidean distance matrix. Specifically, the full information from that distance matrix is used. Second, unlike the AMD, which provides an approximate decomposition of the data matrix, the GMD provides an exact decomposition of the original data matrix without losing any information. Third, the GMD restores the matrix duality in a mathematically rigorous manner, unlike the approximate matrix duality obtained with the AMD; it naturally extends the duality inherent in the SVD and allows one to plot both the configuration of samples and the contribution of individual variables with respect to a new coordinate system. Fourth, the GMD gives nonincreasing GMD values, so the resulting GMD-biplot can be directly constructed based on the first two left and right GMD vectors. Last, unlike the AMD-biplot whose sample clusters depend only on distance, the GMD-biplot uses both the non-Euclidean distance and the data matrix for classifying samples, which more directly connects the contribution of the individual variables to the configuration of samples. Additionally, besides accounting for the non-Euclidean distances between samples, the GMD can also account for auxiliary information on (dis)similarities between the variables.
The remainder of this paper is organized as follows. We first illustrate the GMD-, AMD-, and SVD-biplots in three numerical studies. We then discuss advantages of the proposed GMD-biplot and further extensions. In Materials and Methods, we present detailed description of the GMD-biplot framework.

RESULTS
In the results below, we compare the GMD-, AMD-, and SVD-biplots on three data sets in the manner that each has been proposed recently for microbiome data. In particular, in reference 11, the AMD-biplot is advocated specifically for relative abundance data, while in reference 13, the SVD-biplot is advocated for data that have been scaled by the centered log ratio (CLR) transformation. The GMD-biplot is constructed using the CLR-transformed data. We first examine the performance of all biplots using the smokeless tobacco data set explored in reference 11. In the second study, we compare their performances using the human gut microbiome data from reference 14.
In the third analysis, we simulate a data set based on the smokeless tobacco data to illustrate a dilemma that the AMD-biplot may face.
Analysis of the smokeless tobacco data. This data set includes 15 smokeless tobacco products: 6 dry snuff samples, 7 moist snuff samples, and 2 toombak samples from Sudan. Three separate (replicate) observations (starting with sample preparation) were made of each product, so that a total of 45 observations are available. Each observation has a 271 ϫ 1 vector of taxon counts, and thus, the data set can be formed into a 45 ϫ 271 matrix, denoted by X. The squared weighted UniFrac distance, denoted by ⌬ ʦ ‫ޒ‬ 45ϫ45 , was used to measure the distance between samples. The corresponding similarity kernel H was calculated as H ϭ Ϫ 1 2 ⌬, where ϭ I n Ϫ 1 n 1 n 1 n T is the centering matrix and 1 n is an n ϫ 1 vector. Since H is not positively semidefinite, we forced it to be positive semidefinite by removing its negative eigenvalues and corresponding eigenvectors. The resulting similarity kernel, denoted H ‫ء‬ , has rank 27. For the GMD-biplot, we consider the CLR transformation of X. Specifically, denoting the geometric mean of a vector z by g͑z͒ ϭ ͑ kϭ1 p z k ͒ 1⁄p , the CLR transformation of x i ; i ϭ 1, . . . , 45 is given by We denote the resulting data matrix by X ϭ ͑x 1 , . . . , x 45 ͒ T . For the AMD-biplot, we converted each row of X into the empirical frequencies and further centered the rows and columns to have mean 0, as done in reference 11. We denote the resulting data matrix by X.
We constructed the GMD-biplot and the AMD-biplot based on H ‫ء‬ using X and X, respectively. Figure 1d displays the proportion of variance captured by each GMD component. It can be seen that the first two GMD components capture more than 80% of the total variance of X, which assures that the resulting GMD-biplot (Fig. 1a) visualizes the data well. As shown in Fig. 1a, the GMD-biplot is perfectly successful at separating the different tobacco products (dry, moist, and toombak). Furthermore, the replicates corresponding to the same product are tightly clustered. By examining the arrows for taxa in Fig. 1a, we see that moist samples may be characterized by elevated levels of Alloiococcus and Halophilus, while Aerococcaceae appears elevated in toomback samples. Figure 1e, which is the same as the right bottom panel of Fig. 1 in reference 11, shows that the AMD singular values are not necessarily decreasing. It should be noted that Fig. 1b is slightly different from Fig. 3 in reference 11; this difference may be due to the use of H ‫ء‬ here as opposed to H in reference 11. This is because we wanted the AMD-biplot to be directly comparable to the GMD-biplot, since the GMD requires both H and R to be positive semidefinite. From Fig. 1b, it can be seen that the AMD-biplot successfully separates toombak samples (purple points) from dry (blue) and moist (orange) snuff samples, although the separation between dry and moist snuff samples in the AMD-biplot is not as definitive as that in the GMD-biplot (Fig. 1a).
Additionally, we included the SVD-biplot and its corresponding scree plot in Fig. 1c and f, respectively. As the SVD-biplot assumes Euclidean distances between samples, it is more appropriate to construct the SVD-biplot using the CLR-transformed data X than the relative abundance data X (13). It can be seen from Fig. 1c that although the SVD-biplot successfully separates dry snuff from moist and toombak samples, it does not give a clear separation between moist snuff and toombak samples.
It is worth noting that the three biplots identify different top taxa, i.e., the taxa with the longest arrows. Although a biplot is not a rigorous statistical method to detect important taxa, it may shed light on which taxa are important to the observed sample clustering. To see this, we performed a univariate linear regression of each taxon (each column of X) on the tobacco groups (dry, moist, and toombak) and obtained P values representing the strength of association between each taxon and the tobacco groups. We then sorted these P values in a nondecreasing order and obtained the rank of each taxon based on the sorted P values. Hence, it is desirable that the taxa with the lowest ranks can be identified by the biplots. Table S1 in the supplemental material summarizes the ranks of the top 10 taxa identified by each biplot. It can be seen that the top 10 taxa identified by the GMD-biplot have lower ranks on average than those identified by the AMD and SVD biplots, indicating that the GMD-biplot may identify more meaningful taxa with respect to the separation of the samples than the AMD and SVD biplots.
Analysis of human gut microbiome data. We consider the human gut microbiome data collected in a study of healthy children and adults from the Amazonas of Venezuela, rural Malawi, and U.S. metropolitan areas (14). The original data set X consists of counts for 149 taxa for 100 samples. The squared unweighted UniFrac distance matrix ⌬ ʦ ‫ޒ‬ 100ϫ100 , computed using the R package phyloseq (15), was used to measure the distance between samples. Here, the distance between two samples is based entirely on the number of branches they share on a phylogenetic tree. The distance hence accounts only for the presence/absence of each taxon (not its abundance). The corresponding similarity kernel H was then derived as H ϭ Ϫ 1 2 ⌬, which is a positive semidefinite matrix with rank 99. Let X and X, respectively, denote the CLR-transformed data and the relative abundance data. Similar to the first study, the GMD-biplot and the AMD-biplot were constructed based on the similarity kernel H using X and X respectively, and the SVD-biplot was constructed based on the SVD of X.
As concluded in reference 14, shared features of the functional maturation of the gut microbiome are identified during the first 3 years of life. We thus define a binary outcome h i based on the age of the individual (in years) when each sample was taken as: for i ϭ 1, . . . , 100. Approximately 70% of the samples are assigned to group 0, and the remaining 30% are assigned to group 1.
In all biplots, the ith sample is colored by age i and symbolized by h i . Figure 2d indicates that more than 80% of the total variance is explained by the GMD-biplot in Fig. 2a, which provides a good visualization of sample clusters across age. By examining the relationship between the arrows and the color of the sample points in Fig. 2a, we see that Prevotella may be elevated in adults, while Parabacteroides appears to be elevated in infants. In contrast, Fig. 2e shows that less than 15% of the total variance is explained by the AMD-biplot in Fig. 2b and the AMD values are not decreasing. As shown in Fig. 2b, the AMD-biplot also displays potential clusters across age, but the sample points are not as tightly clustered as those in Fig. 2a. Odoribacter appears to be elevated in adults in Fig. 2b, while Lactobacillus appears associated with infants. As a reference, Fig. 2c shows the SVD-biplot of X, which looks very similar to Fig. 2a.
To further quantify the classification accuracy, for each biplot, we predicted the probability that each sample belongs to group 1 based on leave-one-out cross validation using the binary logistic regression of the group index h i on the two selected components. We then plotted a receiver operating characteristic (ROC) curve for each biplot based on the predicted probabilities (see Fig. S1 in the supplemental material) and calculated the area under the ROC curve (AUC): the GMD-, AMD-, and SVD-biplots yield an AUC of 0.989, 0.976, and 0.990, respectively. The AUC results indicate that the GMD-biplot provides a better separation of age groups than the AMD-biplot, but there is not a clear difference between the GMD-biplot and the SVD-biplot. This may be because, for the CLR-transformed data X, the unweighted UniFrac distance is not as informative with respect to age as the weighted UniFrac distance was in the tobacco data with respect to product groups.
We emphasize that both the GMD-biplot and the SVD-biplot identify Prevotella and Parabacteroides as top taxa, while the AMD-biplot identifies completely different ones. As reference 14 confirms that the trade-off between Prevotella and Bacteroides (including Parabacteroides) considerably drives the variation of microbiome abundance in adults and children between 0.6 and 1 year of age in all studied populations, the GMDand SVD-biplots may thus identify more biologically meaningful taxa than the AMDbiplot. It should, however, be noted that these bacterial are "identified" based on circumstantial, not statistical, evidence, and more work is needed to examine statistical associations in this context. Incorporating a kernel for variables into the GMD-biplot. The GMD problem (see equation 3 in Materials and Methods) allows not only the similarity kernel for samples but also a kernel for the variables. Including both kernels may further improve the accuracy of sample classification as well as the identification of important variables. We illustrate this advantage by incorporating a kernel for variables in the analysis of the human gut microbiome data. More specifically, we first calculate a matrix ⌬ R ʦ‫ޒ‬ 149ϫ149 of squared patristic distances between the tips of the phylogenetic tree for each pair of taxa and then derive a similarity matrix R as R ϭ Ϫ 1 2 ⌬ R . Figure 3a shows the GMD-biplot with the additional kernel R incorporated. The ROC analysis based on the leave-one-out cross validation for Fig. 3a yields an AUC of 0.984, which is higher than that of the AMD-biplot (Fig. 2b) but slightly lower than Fig. 2a and Fig. 2c. This may be because both H and R highly depend on the phylogenetic tree. Thus, incorporating R may be redundant and may reduce the accuracy of the sample clustering in this case. The top three taxa identified in Fig. 3a include Prevotella but not Parabacteroides, which may explain the lower clustering accuracy.
Including an additional kernel for variables in the GMD-biplot is related to the method of double-principal-coordinate analysis (DPCoA) (16). DPCoA, as shown in reference 17, is equivalent to a generalized PCoA which essentially incorporates an additional similarity kernel for variables into the analysis, as described in Proposition 1, but for H ϭ I n . As suggested in reference 18, DPCoA can provide a biplot representation of both samples and meaningful taxonomic categories. Hence, the GMD-biplot can also be viewed as an extension of DPCoA biplots because the GMD allows kernels for both samples and variables, while DPCoA allows a kernel only for variables.

Simulation.
In this section, we conduct a simulation study based on the smokeless tobacco data to illustrate a scenario in which the AMD-biplot may fail to separate the samples, whereas the GMD-biplot performs well. Let H ‫ء‬ and X be the similarity kernel and data matrix from the smokeless tobacco data, respectively. We consider the eigen-decomposition of H ‫ء‬ as H ‫ء‬ ϭ B⌳B T : B is a 45 ϫ 27 matrix whose columns are eigenvectors of H ‫ء‬ and ⌳ ϭ diag͑ 1 , . . . , 27 ͒ is a diagonal matrix whose elements are the eigenvalue of H ‫ء‬ . Then, the AMD-biplot is based on the following approximated orthogonal decomposition of X: where D ϭ diag͑d 1 , . . . , d 27 ͒ and V is a 271 ϫ 27 matrix with orthonormal columns. As shown in Fig. 2d, d 1 The GMD-biplot and the AMD-biplot of X S with similarity measure H ‫ء‬ are presented in Fig. 4a and b, respectively. It can be seen that the two groups are completely mixed up in the AMD-biplot because the first column of B is not selected for visualization. In contrast, the GMD-biplot successfully visualizes the sample groups by displaying the first and second GMD components.
To see why this occurs, we summarize the first three diagonal elements of ⌳, D S , and D S 2 ⌳ in Table 1 and notice that d 1,S Ͻ d 2,S Ͻ d 3,S . Consequently, the AMD-biplot displays  the second and third columns of BD S , and hence, it completely fails to classify the samples because the group index w i depends only on the first column of B. In contrast, Proposition 1a (see Materials and Methods) shows that the GMD-biplot is based on the two largest eigenvalues and the corresponding eigenvectors of X S X S T H ‫ء‬ . It can be further seen that Equation 2 implies that the diagonal elements of D S 2 ⌳ are the eigenvalues of X S X S T H ‫ء‬ and columns of B are the corresponding eigenvectors. Hence, it can be seen from Table 1 that d 1,S 2 1 Ͼd 2,S 2 2 Ͼd 3,S 2 3 , even though d 1,S Ͻd 2,S Ͻd 3,S . Therefore, the GMD-biplot displays the first and second column of BD S ⌳ 1⁄2 as sample points, which successfully captures sample classifications.

DISCUSSION
Biplots have gained popularity in the exploratory analysis of high-dimensional microbiome data. The traditional SVD-biplot is based on Euclidean distances between samples and cannot be directly applied when more general dissimilarities are used. Since Euclidean distances may not lead to an optimal low-dimensional representation of the samples, we have extended the concept of the SVD-biplot to allow for more general similarity kernels. The phylogenetically informed UniFrac distance, used in our examples, defines one such kernel. In settings where a general (possibly nonlinear) distance matrix is appropriate, our approach provides a mathematically rigorous and computationally efficient method, based on the GMD, that allows for plotting both the samples and variables with respect to the same coordinate system.
Our first data example with the smokeless tobacco data set from reference 11 demonstrates the merits of the proposed GMD-biplot. We found that the GMD-biplot successfully displays different types of products, while the AMD-biplot is not able to completely separate dry and moist snuff samples, and the SVD-biplot fails to separate moist and toombak samples. As shown in Table S1 in the supplemental material, the GMD-biplot is also able to identify biologically more meaningful taxa that are related to the different types of products, compared to the AMD-biplot and the SVD-biplot.
In our second example, the GMD-biplot also outperforms the AMD-biplot in terms of both the sample clustering and the identification of important taxa. However, there is not a clear advantage of the GMD-biplot over the SVD-biplot in this example. This difference between the two examples may be attributed to the relation between the Euclidean kernel and the non-Euclidean similarity measure. Denoting the Euclidean kernel and the non-Euclidean similarity measure by XX T and H, respectively, it can be seen that the sample configuration in the AMD-biplot and the SVD-biplot depend solely on either H or XX T , whereas the GMD-biplot uses the top two eigenvectors of XX T H, the matrix product of the Euclidean kernel XX T and H. Hence, if XX T contains substantially more information about sample clustering than H, then taking H T into consideration may not further improve the accuracy of sample clustering. Indeed, this may be the case in our second example, where the clustering of samples using the Euclidean distance between samples of the CLR-transformed data is highly successful because the difference of the microbial profiles between infants and adults is obvious even without the help of the UniFrac distance. However, a possibly more common scenario is when both H and XX T contain some, but different, information on sample clustering. In such cases, taking both XX T and H into consideration may improve the sample clustering and provide better biological interpretation.
In practice, we typically do not know what the true configuration of samples looks like, so it is impossible to determine whether H or XX T contains more information about sample clusters. Also, it is sensible to assume that XX T and H are "coinformative" in the sense that they exhibit a shared eigenstructure; for instance, both may be informative for clustering samples. The coinformativeness can be quantified precisely using the Hilbert-Schmidt information criteria (HSIC) (19). For any two kernels K 1 and K 2 , the empirical HSIC is proportional to tr͑K 1 K 2 ͒. Hence, by definition, the GMD problem in equation 3 (see Materials and Methods) is equivalent to minimizing the HSIC between ͑X Ϫ USV T ͒͑X Ϫ USV T ͒ T and H over U, S and V. In other words, if we consider X Ϫ USV T as the residual matrix of X, then the GMD solutions can be interpreted as the best approximation to X in the sense that the HSIC between H and the Euclidean kernel of the residual matrix is minimized. Thus, the GMD-biplot considers the coinformativeness of XX T and H. Therefore, in many cases, it would be a more robust way to display the sample points compared to the AMD-biplot or the SVD-biplot. Another advantage of the GMD-biplot over the AMD-biplot is illustrated in our simulation study. Since AMD may not give decreasing singular values, the AMD-biplot may not be able to display the most informative eigenvectors of H, and may thus fail to cluster the samples. In contrast, GMD assures that the resulting singular values are nonincreasing.
Our discussion in this paper has focused on the form biplot, which aims to visualize the relationship between variables and the sample clustering. In other scenarios, where the variation of the data matrix explained by each variable is of particular interest, the covariance biplot may be more appropriate. This biplot considers the GMD of X with respect to H; i.e., X ϭ USV T , where U T HU ϭ I q and V T V ϭ I q . Note that where S ϭ diag͑s 1 , . . . , s q ͒. Furthermore, since V has orthogonal columns, it can be seen that mϭ1 q s m 2 ϭ jϭ1 p ͑ mϭ1 q v jm 2 s m 2 ͒. Thus, the value of ͑ mϭ1 q v jm 2 s m 2 ͒ ⁄ ʈXʈ H,I 2 gives the proportion of the variability in ʈXʈ H,I 2 explained by the jth variable. Note that when q ϭ 2, the length of the arrow of the jth variable in the covariance biplot is given by ͱ ͑ mϭ1 q v jm 2 s m 2 ͒. Therefore, in a covariance biplot, the arrows shed light on how the total variance of the data is partitioned into parts explained by each variable.

MATERIALS AND METHODS
We denote the data matrix by Xʦ‫ޒ‬ nϫp , where n is the number of samples and p is the number of variables (taxa). We assume that the columns of X are centered to have mean 0 and rank͑X͒ ϭ K Յ min͑n,p͒. For any matrix M, we denote its ith row (sample) and its (i, j) entry by m i and m ij , respectively. We denote the transpose of M by M T .
Biplot, distance measure, and AMD. A biplot is a graphical method to simultaneously represent, in two dimensions, both the rows (as points) and columns (as arrows) of the matrix X on the same coordinate axes. Given a decomposition of X as X ϭ AB T , a biplot displays two selected columns of A and B. The SVD-biplot is based on the singular value decomposition (SVD) of X, i.e., X ϭ USV T , where U T U ϭ I K ,V T V ϭ I K , and S ϭ diag͑ 1 , . . . , K ͒ with 1 , . . . , K being a sequence of nonincreasing and positive singular values. Here I K is a rank K identity matrix. Based on the SVD, A and B can take various forms; examples include form and covariance biplots (7). Since our primary interest is to visualize the clustering of samples, we focus on the form biplot in this paper and comment on the covariance biplot in the Discussion.
The SVD-biplot displays the first two columns of US and V, which can explain ͑ 1 2 ϩ 2 2 ͒⁄tr͑XX T ͒ of the total variance of X. The SVD of X is closely related to the eigen-decomposition of the similarity kernel XX T , as we can write XX T ϭ US 2 U T . Thus, the eigen-decomposition of XX T provides a way to calculate U and S. Once U and S are calculated, one can calculate V from the duality between U and V; that is, VS ϭ X T U. The similarity kernel XX T characterizes the Euclidean distance between samples. To see this, we define the Euclidean squared distance between the ith and jth sample as d ij E ϭ ʈx i Ϫ x j ʈ ‫ޒ‬ p 2 . Let ϭ I n Ϫ 1 n 1 n 1 n T be the centering matrix where 1 n is an n ϫ 1 vector of ones. It can then be seen that Now, if ⌬ E is replaced by a matrix D of non-Euclidean squared dissimilarities, one can still define a similarity kernel by H ϭ Ϫ 1 2 D. One such example is when D arises from distances between sample vectors of microbial abundances (or presence/absence) which account for a phylogenetic tree, as in a weighted (respectively, unweighted) UniFrac distance matrix. In this case, one can construct a principalcoordinate analysis (PCoA) plot of the samples using H ϭ U H S H 2 U H T . However, an SVD-biplot cannot be constructed, since there is no V that corresponds to the variables. The approximate matrix duality (AMD) addresses this problem by fixing U H and then seeking a matrix V H with orthonormal columns and a diagonal matrix D H with nonnegative elements that minimize the objective function Here, ʈMʈ F 2 ϭ tr͑M T M͒ is the Frobenius norm of M, and for any square matrix M ʦ ‫ޒ‬ dϫd , tr͑M͒ ϭ jϭ1 d m jj . The resulting AMD-biplot can be constructed by plotting the two columns of U H D H as sample points and plotting V H as arrows for variables; the selected two columns/rows correspond to the two largest elements of D H .