Distributed under Creative Commons Cc-by 4.0 a Molecular Classification of Human Mesenchymal Stromal Cells

Mesenchymal stromal cells (MSC) are widely used for the study of mesenchymal tissue repair, and increasingly adopted for cell therapy, despite the lack of consensus on the identity of these cells. In part this is due to the lack of specificity of MSC markers. Distinguishing MSC from other stromal cells such as fibroblasts is particularly difficult using standard analysis of surface proteins, and there is an urgent need for improved classification approaches. Transcriptome profiling is commonly used to describe and compare different cell types; however, efforts to identify specific markers of rare cellular subsets may be confounded by the small sample sizes of most studies. Consequently, it is difficult to derive reproducible, and therefore useful markers. We addressed the question of MSC classification with a large integrative analysis of many public MSC datasets. We derived a sparse classifier (The Rohart MSC test) that accurately distinguished MSC from non-MSC samples with >97% accuracy on an internal training set of 635 samples from 41 studies derived on 10 different microarray platforms. The classifier was validated on an external test set of 1,291 samples from 65 studies derived on 15 different platforms, with >95% accuracy. The genes that contribute to the MSC classifier formed a protein-interaction network that included known MSC markers. Further evidence of the relevance of this new MSC panel came from the high number of Mendelian disorders associated with mutations in more than 65% of the network. These result in mesenchymal defects, particularly impacting on skeletal growth and function. The Rohart MSC test is a simple in silico test that accurately discriminates MSC from fibroblasts, other adult stem/progenitor cell types or differentiated stromal cells. It has been implemented in the www.stemformatics.org resource, to assist researchers wishing to benchmark their own MSC datasets or data from the public domain. The code is available from the CRAN repository and all data used to generate the MSC test is available to download via the Gene Expression Omnibus or the Stemformatics resource. How to cite this article Rohart et al. (2016), A molecular classification of human mesenchymal stromal cells.


Contents of Supplemental Files: Extended methods:
• Design of test and training datasets • Pre-processing of the data  Table S1. Markers used in the common MSC immunophenotyping panel. Accompanies Figure 1 and Table 1.   Table S4. Members of the MSC signature network, seeded from component 1 genes and first-order interaction partners (BioGRID). Column 1 provides information on Gene Symbol (and ENSEMBL ID for seed members). Column 2 indicates Component or network membership and Probability of selection derived from sPLS-DA analysis. Disease annotations taken from the Mendelian Inheritance in Man (MIM) database. Accompanies Figure 2. Table S5. (auxiliary spreadsheet). Datasets that contributed to the testing set and result of the Rohart test. Accompanies Figure 3. MSC status is unknown in the absence of phenotype data provided in the manuscript.  Figure S1: Box-whisker plots of the average expression of 16 non-differentially expressed common cell surface MSC markers across samples in the training dataset, depending on cell type. Box-whisker plots showing median expression, and 25 th and 75 th percentiles. MSC (n=115), non-MSC (n=510). Accompanies Figure 1 and Table 1.

Implementation in Stemformatics
All primary data are available from the Stemformatics stem cell resource. The underlying code for the statistical test is available as BootsPLS in the CRAN repository, and we have also made available the d3 code for the interactive MSC graph implemented in Stemformatics via the BioJS framework at http://biojs.io/d/biojs-visrohart-msc-test       (Lindblom et al., 2003) Required for blood brain barrier (Armulik et al., 2010). Binds heparin sulfate proteoglycans to limit PDGF diffusion (Abramsson et al., 2007). Marks Mesenchymal stem cell fraction in bone marrow (Koide et al., 2007) (Song et al., 2013). CCDC80 regulates lens formation in the developing eye (Jarrin, Pandit & Gunhaga, 2012) Binds growth factors including FGF in the extracellular matrix (Jarrin, Pandit & Gunhaga, 2012;Song et al., 2012). Is an important component of the bone marrow stroma and loss prevents HSC niche development (Charbord et al., 2015). Mouse homolog Dro-1 is a tumour suppressor: CCDC80 (Dro-1) KO mice develop thyroid adenomas and ovarian carcinomas (Leone et al., 2015).

Seed
Transmembrane receptor tyrosine kinase that interacts with fibrillar collagen and is required for the proliferation of mesodermal cells. Mice lacking Ddr2 have dwarfism and are infertile (Kawai et al., 2012). DDR2 regulates the size of long bones and the amount of body fat (Kawai et al., 2014) Reviewed in (Leitinger, 2014)  Plasma membrane serine protease/gelatinase. FAP has been associated with a mesenchymal phenotype in stromal tumours and depletion of FAP+ cells leads to tumour regression (Kraman et al., 2010). It is expressed on reactive fibroblasts in remodeled tissues and in cultured fibroblasts (Rettig et al., 1993   whereas expression of ABI3BP in fibroblasts leads to replicative senescence (Uekawa et al., 2005).

Notations
We used the following notations, X denotes the gene expression matrix of n=635 samples and p=27,901 genes and H denotes the number of PLS-components. A class vector Y indicating the class of each sample, categorized as 'MSC' for 125 samples or 'non-MSC' for 510 samples.

A sparse multivariate procedure for discriminant analysis and gene selection
The MSC signature was identified using a novel implementation of the sparse variant of Partial Least Square Discriminant Analysis (sPLS-DA) (Barker and Rayens, 2003;Wold, 1966) implemented for multiple microarray studies using the mixOmics package (Lê Cao et al., 2009)(Lê Cao et al., 2011. sPLS-DA enables the selection of a small subset of genes amongst thousands that best classify the samples into their respective groups (MSC vs. non-MCS). It is an iterative multivariate analysis method that seeks a linear combination of the most discriminant genes to maximize the covariance between the gene expression matrix X and the sample vector Y, which is coded as a dummy matrix. The sparsity in the linear combination is achieved through a lasso-type penalization (L1, (Tibshirani, 1996)) and aims to shrink the coefficients associated with non-discriminant genes to zero.  (Figures 1 and 2).

Constrained sPLS-DA
The standard sPLS-DA method only uses the most discriminative genes to define the PLS-components and needed to be further developed for a cross platform integrative analysis seeking for a robust signature agnostic to microarray platforms. Therefore, we developed a constrained extension of PLS-DA, which defines PLS-components based on specified subsets of stable genes using an internal bootstrap method. The code for this modified implementation is available as a function of BootsPLS R-package (CRAN repository).

Random subsampling
Each random subsampling consisted of splitting the training set (125 MSCs and 510 non-MSCs) into an internal learning set (69% of the data) and an internal test set (31% of the data); each was stratified in order to keep the same proportion of MSCs over non-MSCs. This process was repeated 200 times, to assess the contribution of individual datasets to the stability of the chosen sPLS-DA variables (see Figure 2A of the manuscript).

Cross-Validation
The internal training set was further divided for cross-validation. Following 200 random subsamplings, each of the resulting 200 subsets was further divided using 10-fold cross-validation (CV). A sPLS-DA model was trained on each CV subset and tested on the remaining group of samples to determine the optimal number of genes to select on each four components. We used the Balanced Error Rate (BER), defined as the average of the errors on each class to compensate for the low number of MSC in our training dataset.
Since sPLS-DA is an iterative method that defines each component one at a time, the optimal number of genes selected on a given PLS-DA component is highly dependent on the genes selected on the previous components. To address this component dependency issue we performed a sequential tuning: to tune component h, a constrained sPLS-DA model was built with the genes selected on the (h-1) previous components (h ≤ H).

Choosing the number of sPLS-DA components
We used the internal classification accuracy as a criterion to evaluate the performance of the sPLS-DA model for a sufficiently large number of components (H = 6). A naïve criterion would be to choose the number of components that minimizes the overall classification error rate. However, such criterion would not accommodate the high variability in the classification error rate across the 200 subsamplings. Instead, we proposed a new criterion that evaluates the gain in classification accuracy when an extra PLS-DA component is added to the sPLS-DA model.
We developed a multiple testing procedure using one-sided t-tests with a null hypothesis H 0 N : 'N PLS-DA components give better classification accuracy than N+1 PLS-DA components' ( ≥1) against two alternatives: H 1 A 'the overall classification accuracy is higher for N+1 PLS-DA components than for N PLS-DA components' and H 1 B 'the classification accuracy of non-MSC samples only is higher for N+1 PLS-DA components than for N PLS-DA components' The null hypothesis was rejected when both alternatives H 1 A and H 1 B were true. If H 0 N was rejected -i.e. N+1 components were considered in the sPLS-DA model, then the null hypothesis was tested for the next component added in the model. We tested the null hypothesis H 0 N for 1≤ ≤5 or until H 0 N was not rejected. Using the one-sided t-test and the t.test function in R resulted in two p-values for each tested alternative hypothesis. Both p-values were added and the sum was compared to a significance level =0.001 (similar to a Bonferroni correction for each p-value). The reason for this very conservative test was to ensure that the addition of one PLS-DA component would dramatically improve the classification accuracy of the model. We tested N = 5 components, with the associated p-values 1.282936e-58, 7.188488e-25, 3.394169e-10, 3.935352e-03 and 2.816028e-07 respectively. Our final sPLS-DA model therefore included 4 components ( Figure 2 of the manuscript).

Choosing the number of stable genes per component
The combination of the 200 subsamplings from the sPLS-DA models enabled us to assess the frequency of selection of each gene ( Figure 2A of the manuscript). For each of the four components, we reported the number of times each gene was selected on a given component across the 200 replications, leading to a frequency of selection for each of the 27,901 genes present in the dataset ( Figure 2B of the manuscript). We developed an iterative process to accommodate for the gene selection dependency between components and select the best stable gene subset on each PLS-DA component.
Firstly, the genes were assigned a rank according to their decreasing frequency of selection per component, with the same rank given to ties. The top stable genes are identified as most discriminative to differentiate MSC and non-MSC. For each rank k going from 1 to K, the prediction accuracy of the genes with a rank lower than k was assessed with an extra 200 random subsamplings, fitting a constrained sPLS-DA on the most stable genes on the internal learning set and calculating the classification accuracy on the internal test set. We chose K = 40, which is a value large enough for our final gene signature.
Secondly, similarly to the method presented to choose the number of components, we developed a multiple testing procedure using one sided t-tests to test the gain in classification accuracy when adding more stable genes to our subset of top ranked stable genes.
Assuming that `N<M`, we tested the null hypothesis H 0 N,M : 'N top stable genes give a better classification accuracy than M top stable genes' against two alternatives: H 1 A 'the overall classification is higher for M genes than for N genes' and H 1 B 'the classification accuracy of non-MSC samples is higher for M genes than for N genes'.  The null hypothesis H 0 N,M was tested against each alternative H 1 A and H 1 B with a one-sided t-test, resulting in two pvalues. Both p-values were added and the sum was compared to a significance level =0.01 (similar to a Bonferroni correction for each p-value). Note that the alternative H 1 C 'the classification accuracy of MSC samples is higher for M genes than for N genes' was not considered due to the low number of misclassified MSC. Figure 2C of the manuscript displays the results assessing the top stable genes with respect to our statistical tests for each of the 4 components. Our statistical model selected 5 genes on component 1, 4 genes on component 2, 4 genes on component 3 and 3 genes on component 4, resulting in our final 16-gene signature sPLS-DA model with 4 components.

Definitive assignation of a test sample to the MSC or non-MSC class
To fit our final statistical model, we applied a constrained sPLS-DA on the 16 signature genes identified in our workflow on the full training set (635 samples). Our model led to a prediction score for each external test sample. We derived two quality measures from the prediction scores, namely an uncertainty area and a confidence interval; we combined both to make a definitive prediction call.
From Figure 2E of the manuscript, we determined a cut-off above which 99% of the MSC samples had a prediction score greater than 0.4337, and 99% of the non-MSC samples had a prediction score lower than 0.5169. We defined an 'unsure' area between these two values.
We further assessed the stability of a prediction score by constructing a 95% Confidence Interval (CI) for each test sample. To that end, we recorded a series of prediction scores for each test sample by fitting a constrained sPLS-DA with the 16 signature genes on 200 subsampling of the training set. The 95% CI was then derived from these 200 scores. In all figures where the classification of individual samples is shown, the CI bounds are represented as error bars around each test sample.
From the two quality measures described above, a rule was set to assign each test sample to a MSC, non-MSC or unsure predicted class. A definitive call was made when the 95% CI of a score was not overlapping with the unsure area. A test sample with a prediction score higher than 0.5169 and for which the lower bound of the 95%CI was also higher than 0.5169 was predicted as an MSC. Similarly, a test sample with a prediction score lower than 0.4337 and for which the upper bound of the 95%CI was also lower than 0.4337 was predicted non-MSC class. Any other test samples not falling into these areas were classified as unsure.

Benchmark of the signature The MSC signature outperforms common MSC surface markers
We investigated the ability of each of the 32 markers defined in Table S1 taken individually to discriminate MSCs vs. non-MSC samples. A PLS-DA method was fitted with each of these 32 single markers on all 635 samples from the training set. The classification accuracy on these samples was recorded using 200 subsamplings (internal learning/test set). Most of these markers (25 out of 32) were reported with a classification error rate of 100%, showing their complete inability to classify the MSCs using our PLS-DA method, with the exception of seven markers CD73, CD105, VCAM1, PDGFRB, KIT, ITGA5 and ANPEP.

Batch effect and platform effect
Our 16-gene signature was not confounded by datasets and/or platforms as the sample plot representation from the constrained sPLS-DA in Figure S2 did not show any clusters related to platforms or studies.