Automated biomarker candidate discovery in imaging mass spectrometry data through spatially localized Shapley additive explanations

The search for molecular species that are differentially expressed between biological states is an important step towards discovering promising biomarker candidates. In imaging mass spectrometry (IMS), performing this search manually is often impractical due to the large size and high-dimensionality of IMS datasets. Instead, we propose an interpretable machine learning workflow that automatically identifies biomarker candidates by their mass-to-charge ratios, and that quantitatively estimates their relevance to recognizing a given biological class using Shapley additive explanations (SHAP). The task of biomarker candidate discovery is translated into a feature ranking problem: given a classification model that assigns pixels to different biological classes on the basis of their mass spectra, the molecular species that the model uses as features are ranked in descending order of relative predictive importance such that the top-ranking features have a higher likelihood of being useful biomarkers. Besides providing the user with an experiment-wide measure of a molecular species’ biomarker potential, our workflow delivers spatially localized explanations of the classification model’s decision-making process in the form of a novel representation called SHAP maps. SHAP maps deliver insight into the spatial specificity of biomarker candidates by highlighting in which regions of the tissue sample each feature provides discriminative information and in which regions it does not. SHAP maps also enable one to determine whether the relationship between a biomarker candidate and a biological state of interest is correlative or anticorrelative. Our automated approach to estimating a molecular species’ potential for characterizing a user-provided biological class, combined with the untargeted and multiplexed nature of IMS, allows for the rapid screening of thousands of molecular species and the obtention of a broader biomarker candidate shortlist than would be possible through targeted manual assessment. Our biomarker candidate discovery workflow is demonstrated on mouse-pup and rat kidney case studies.

• Dataset n o 2: Murine kidney images were generated in Q-TOF mode of operation. Tissue imaging data (591,534 pixels) were collected at 15 µm pixel size, using 400 shots per pixel, and 35% laser power. Data were collected in positive ionization mode from m/z 200 to 1,500.

FT-ICR IMS
To aid in the identification of detected lipid species of the mouse pup dataset, we used data from parallel analysis performed on a 15T solariX FT-ICR mass spectrometer (Bruker Daltonics, Bremen, Germany) coupled to an Apollo II ESI-MALDI dual ion source. The data was collected for the m/z range of 460-1,500 with a transient length of 4.2 s, resulting in a resolving power of 600,000 at m/z 760.

Histology
Following MALDI IMS, the matrix was removed from tissue using 100% ethanol and rehydrated with graded ethanol and double distilled H2O. The tissues were then stained using a hematoxylin and eosin (H&E) stain. Brightfield microscopy of stained tissues was obtained at 20× magnification using a Leica SCN400 Brightfield Slide Scanner (Leica Microsystems, Bu alo Grove, IL, USA).
Data preprocessing • Dataset n o 1: The mouse-pup data was exported into a custom binary format optimized for storage and speed of analysis of the ion mobility-IMS data. Each frame/pixel contains between 10,000 and 100,000 centroid peaks that span the acquisition range of m/z 300-1,200 and 1/K 0 0.4-1.9 (V·s)/cm 2 with 221,888 and 4,001 bins along the mass spectrometry and ion mobility dimensions, respectively. The processing pipeline requires common m/z and 1/K 0 axes, hence individual centroid peaks were inserted at their correct bin positions along the mass spectrometry and ion mobility dimensions; missing values were set to zero. Following the conversion process, the mass spectra were aligned and calibrated along the m/z axis to reduce peak dri and permit identification. The m/z alignment was driven by four reference peaks that were present throughout the dataset: 703.5748, 734.5694, 760.5851, 810.6001. A er calibration, a mean mass spectrum of the entire dataset was generated and peak picked. A total of 879 features were selected and extracted to generate ion mobility-rich ion images. Since our biomarker candidate discovery workflow is optimized for non-ion mobility data, the ion mobility information is removed by summation of all ion mobility bins of the ion mobility-rich images to a single vector, resulting in a standard ion image.
• Dataset n o 2: The murine kidney data was exported into a custom binary format as described above. Data were acquired in the Q-TOF only mode, hence the ion mobility dimension is not present. We introduced a trivial second spectral dimension by enforcing the dataset to contain one ion mobility bin, in order to be able to lean on recent developments for e icient storage and data processing, but this process does not have any impact on the actual data. Following the conversion process, a mean mass spectrum of the entire dataset was generated, which was furthermore peak picked resulting in a total of 1,428 ion images for distinct features in the data.
Low variance noise was removed from both datasets by principal component analysis: principal component analysis is a matrix factorization technique that can be used to reduce the dimensionality of an IMS dataset while retaining most of its original variance, and hence most of its latent molecular information [16,77]. Feature centering, which consists in subtracting the mean from each column of the IMS data matrix, is performed without feature scaling. We argue that feature scaling prior to classifier training is not necessary because large di erences in high intensity peaks are biologically informative, and high intensity peaks should therefore have more influence on the classifier than low intensity peaks [16].

Data labeling
The supervised ML algorithms used to learn classifiers from IMS data require the pixels in the training set to be labeled. In our work, the pixels belonging to di erent tissue functional units are labeled by user-defined masks. The lowdimensional latent patterns extracted by non-negative matrix factorization from datasets n o 1 and n o 2 delineated several tissue functional units within the tissue sections and therefore facilitated the manual localization and annotation of the target regions. The mouse-pup liver and brain masks of Figure 4 were defined by using MATLAB's Image Processing Toolbox to draw polygonal regions of interest, guided by the latent features displayed in Figure S1. The quality of these user-defined masks was confirmed by collaborators at Vanderbilt University. The renal inner medulla, outer medulla, and cortex masks of Figure 7 were defined using the same approach. Note that our biomarker candidate discovery workflow would work equally well if provided with automatically-generated class annotations.
(a) Latent feature extracted using non-negative matrix factorization that highlights the mouse-pup brain.
(b) Latent feature extracted using non-negative matrix factorization that highlights the mouse-pup liver. Figure S1. Latent features obtained from dataset n o 1 by non-negative matrix factorization in order to generate the brain and liver masks displayed in Figure 4. The masks are polygons that were drawn by hand with guidance from these latent feature plots.

Feature identification
The top features discovered using our workflow were run against the LipidMaps database (lipidmaps.org) to produce tentative annotations [78,79]. We selected [M+H] + , [M+Na] + , and [M+K] + as the adduct ions, selected the "even chains only", and specified a narrow 0.005 m/z mass tolerance window. We calculated the ppm error associated with each assignment and removed assignments with errors larger than 3 ppm. Annotations are only provided for the mouse pup dataset because many of the top selected features for the rat kidney dataset were Additional case studies

Molecular marker discovery for the mouse-pup liver (dataset n o 1)
Our liver molecular marker discovery workflow starts with building a classifier from IMS dataset n o 1 and the userprovided liver mask shown in Figure 4b. Figure S2a shows the result of applying the classifier to all pixels of dataset n o 1: the mouse-pup's liver is successfully di erentiated from the other organs, demonstrating that the classifier was able to capture a liver-specific mass spectral signature in the data. Regarding generalization performance, the XGBoost classifier trained to recognize pixels belonging to the mouse-pup liver achieves a balanced accuracy of 0.9960, a precision of 0.9984, and a recall of 0.9989. Figure S2b shows the top ten molecular markers out of the global ranking of 879 features (i.e. m/z values) as obtained by TreeSHAP. The features are ranked in descending order of global SHAP predictive importance score, and thus in descending order of relevance to liver tissue recognition. Figure S2b therefore provides insight into a feature's global relevance to the recognition task of Figure S2a. The feature with the highest influence on the XGBoost classifier used to assign pixels to the liver (or not) has a m/z value of 746.607 and a global SHAP score of 1.595. The le column of Figure  S3 provides information about the spatial distribution and relative abundance of the three top-ranking molecular features for recognizing the mouse liver in this dataset. Figures S3a, S3c, and S3e are the ion images of the features ranked n o 1, o 2, and o 3 respectively. The right column of Figure S3 provides information on the signs and magnitudes of the local SHAP scores, or Shapley values, across the sample for each top-ranking feature. Figures S3b, S3d, and S3f are the SHAP maps of the three top-ranking features. These SHAP maps provide information on where and how a given ion intensity signal relates to the task of liver tissue recognition. Studying the le and the right columns of Figure S3 enables one to understand which features are positively or negatively correlated with a pixel being assigned to the liver.
The ion image of the feature ranked n o 1 ( Figure S3a) indicates that it has a very low abundance in the liver region. Yet the SHAP map of that feature ( Figure S3b) shows that the local SHAP scores are positive, with a high magnitude, in the liver region. The feature ranked n o 1 can therefore be considered negatively correlated with a pixel being assigned to the liver. The feature ranked n o 2 is also negatively correlated with the XGBoost classifier assigning a pixel to the liver. In short, measuring a low intensity for features ranked n o 1 and n o 2 in a given pixel increases the log-odds (and probability) of the classifier assigning that pixel to the liver. Given the high predictive performance of the classifier, we can assume that measuring a low intensity of features ranked n o 1 and n o 2 in a given pixel also increases the probability of that pixel actually belonging to the liver. Conversely, the feature ranked n o 3 is positively correlated with a pixel being assigned to the liver: its ion image ( Figure S3e) indicates that the feature ranked n o 3 has a high abundance in the liver, and its SHAP map ( Figure S3f) indicates that its local SHAP scores are high in the liver region. Measuring a high intensity of the feature ranked n o 3 in a given pixel increases the log-odds (and probability) of the classifier assigning that pixel to the liver. Furthermore, given the high predictive performance of the classifier, we can assume that measuring a high intensity of the feature ranked n o 3 in a given pixel also increases the probability of that pixel actually belonging to the liver. The absence of features ranked n o 1 and n o 2, and the presence of the feature ranked n o 3 are therefore characteristic of the mouse-pup liver. If one measures a low abundance of m/z 746.607 and m/z 757.622 and a high abundance of m/z 558.292 in a pixel, the probability of that pixel belonging to the liver is very high.
(a) Liver class prediction in the mouse-pup whole-body section. The class labels are predicted by the classifier designed to di erentiate the pixels belonging to the mouse-pup's liver from the pixels not belonging to the mouse-pup's liver.
(b) Global feature importance of the ten m/z values most relevant to recognizing liver tissue in the mouse-pup, as per the SHAP interpretability method. The features of the classifier trained to recognize the mouse-pup's liver tissue are ranked in descending order of global feature importance. The global SHAP score of a feature is obtained by averaging the magnitude of the feature's Shapley values, or local SHAP scores, over all labeled and unlabeled pixels in dataset n o 1. The ten m/z values shown here score the highest in terms of di erentiating relevance, and automatically provide a shortlist of molecular markers for mouse-pup liver tissue.  Figure S3. Three promising molecular markers for the mouse-pup's liver. The ion images (le ) and SHAP maps (right) of three features (i.e. m/z values) with the most influence on the decision-making process of the classifier trained to recognize the mouse-pup's liver are shown. The ion images plot the spatial distribution and measured intensity of each feature across the sample, and are not specifically tied to the task of recognizing the liver. The SHAP maps plot the spatial distribution of Shapley values, or local SHAP predictive importance scores, of each feature across the sample, and provide information on where and how the feature is relevant to the task of recognizing the liver. Figure S4a presents the classification result: the outer medulla has successfully been di erentiated from the cortex and inner medulla. Regarding generalization performance, the XGBoost classifier trained to recognize pixels belonging to the outer renal medulla achieves a balanced accuracy of 0.9970, a precision of 0.9972, and a recall of 0.9958. Figure S4b shows the top ten molecular markers out of the global ranking of 1,428 features (i.e. m/z values) as obtained by TreeSHAP. The features are ranked in descending order of global SHAP score, and thus in descending order of relevance to outer medulla tissue recognition. The classifier used to assign pixels to the outer medulla (or not) relies heavily on the feature ranked n o 1, whose m/z value is 588.468 and whose global SHAP score of 2.311. The le column of Figure S5 provides information about the spatial distribution and relative abundance of the three top-ranking molecular features for recognizing the outer medulla. The right column of Figure S5 provides information on the signs and magnitudes of the local SHAP scores, or Shapley values, of each top-ranking feature across the sample. These SHAP maps provide information on where and how a given ion intensity signal relates to the task of outer medulla tissue recognition.

Biomarker candidate discovery for the outer renal medulla (dataset n o 2)
The feature ranked n o 1 is strongly positively correlated with assigning a pixel to the outer medulla because its ion image ( Figure S5a) shows a high intensity in the outer medulla and because its SHAP map ( Figure S5b) shows positive Shapley values in the outer medulla and negative Shapley values in the inner medulla and cortex. Given the high predictive performance of the classifier, we can assume that measuring a high intensity of the feature ranked n o 1 in a given pixel also increases the probability of that pixel actually belonging to the outer medulla. Features ranked n o 2 and n o 3 (m/z 1444.101 and m/z 1543.214 respectively) are negatively correlated with the classifier assigning a pixel to the outer medulla. Given the high predictive performance of the classifier, we can assume that measuring a low intensity of the features ranked n o 2 and n o 3 in a given pixel also increases the probability of that pixel actually belonging to the outer medulla. The SHAP maps of features ranked n o 2 and n o 3 ( Figures S5d and S5f respectively) show that the XGBoost classifier uses the feature ranked n o 2 to di erentiate the outer medulla from the inner medulla and cortex, and that it uses the feature ranked n o 3 to di erentiate the outer medulla and inner medulla from the cortex.
(a) Outer medulla class prediction in the rat kidney section. The class labels are predicted by the classifier designed to di erentiate the pixels belonging to the outer medulla from the pixels not belonging to the outer medulla.
(b) Global feature importance of the ten m/z values most relevant to recognizing the renal outer medulla, as per the SHAP interpretability method. The features of the classifier trained to recognize the outer medulla tissue are ranked in descending order of global feature importance. The global SHAP score of a feature is obtained by averaging the magnitude of the feature's Shapley values, or local SHAP scores, over all labeled and unlabeled pixels in dataset n o 2. The ten m/z values shown here score the highest in terms of di erentiating relevance, and automatically provide a shortlist of biomarker candidates for rat renal outer medulla tissue. Biomarker candidate discovery for the renal cortex (dataset n o 2) Figure S6a presents the classification result. The cortex has successfully been di erentiated from the inner and outer medulla. Regarding generalization performance, the XGBoost classifier trained to recognize pixels belonging to the renal cortex achieves a balanced accuracy of 0.9974, a precision of 0.9976, and a recall of 0.9973.
(a) Cortex class prediction in the rat kidney section. The class labels are predicted by the classifier designed to di erentiate the pixels belonging to the cortex from the pixels not belonging to the cortex.
(b) Global feature importance of the ten m/z values most relevant to recognizing the renal cortex, as per the SHAP interpretability method. The features of the classifier trained to recognize the cortex tissue are ranked in descending order of global feature importance. The global SHAP score of a feature is obtained by averaging the magnitude of the feature's Shapley values, or local SHAP scores, over all labeled and unlabeled pixels in dataset n o 2. The ten m/z values shown here score the highest in terms of di erentiating relevance, and automatically provide a shortlist of biomarker candidates for rat renal cortex tissue. Figure S6. Renal cortex recognition and global feature ranking. Figure S6b shows the top ten molecular markers out of the global ranking of 1,428 features (i.e. m/z values) as obtained by TreeSHAP. The features are ranked in descending order of global SHAP score, and thus in descending order of relevance to cortex tissue recognition. The XGBoost model used to assign pixels to the cortex (or not) relies heavily upon one feature, whose m/z value is 1540.238 and whose global SHAP score is 2.507. The le column of Figure S7 presents the ion images of the three top-ranking features. The right column of Figure S7 presents the SHAP maps of the three top-ranking features. The ion image of the feature ranked n o 1 ( Figure S7a) shows that feature ranked n o 1 has high intensity in the cortex; and the SHAP map of feature ranked n o 1 ( Figure S7b) shows that feature ranked n o 1 has positive Shapley values in the cortex, and negative Shapley values in the inner and outer medulla. Therefore, feature ranked n o 1 (i.e. m/z 1540.238) is positively correlated to a pixel being assigned to the cortex. Conversely, features ranked n o 2 and n o 3 (i.e. m/z 816.652 and m/z 1546.226 respectively) are negatively correlated with the classifier assigning a pixel to the cortex. Given the high predictive performance of the classifier, we can assume that measuring a high intensity of feature ranked n o 1, and a low intensity of features ranked n o 2 and n o 3, increases the probability of that pixel actually belonging to the cortex.  Figure S7. Three promising molecular markers for the renal cortex. The ion images (le ) and SHAP maps (right) of three features (i.e. m/z values) with the most influence on the decision-making process of the classifier trained to recognize the rat's renal cortex are shown. The ion images plot the spatial distribution and measured intensity of each feature across the sample, and are not specifically tied to the task of recognizing the cortex. The SHAP maps plot the spatial distribution of Shapley values, or local SHAP predictive importance scores, of each feature across the sample, and provide information on where and how the feature is relevant to the task of recognizing the cortex.