Probabilistic Segmentation of Mass Spectrometry (MS) Images Helps Select Important Ions and Characterize Confidence in the Resulting Segments*

Mass spectrometry imaging is a powerful tool for investigating the spatial distribution of chemical compounds in a biological sample such as tissue. Two common goals of these experiments are unsupervised segmentation of images into newly discovered homogeneous segments and supervised classification of images into predefined classes. In both cases, the important secondary goals are to characterize the uncertainty associated with the segmentation and with the classification and to characterize the spectral features that define each segment or class. Recent analysis methods have focused on the spatial structure of the data to improve results. However, they either do not address these secondary goals or do this with separate post hoc procedures. We introduce spatial shrunken centroids, a statistical model-based framework for both supervised classification and unsupervised segmentation. It takes as input sets of previously detected, aligned, quantified, and normalized spectral features and expresses both spatial and multivariate nature of the data using probabilistic modeling. It selects informative subsets of spectral features that define each unsupervised segment or supervised class and quantifies and visualizes the uncertainty in spatial segmentations and in tissue classification. In the unsupervised setting, it also guides the choice of an appropriate number of segments. We demonstrate the usefulness of this framework in a supervised human renal cell carcinoma experimental dataset and several unsupervised experimental datasets, including a pig fetus cross-section, three rodent brains, and a controlled image with known ground truth. This framework is available for use within the open-source R package Cardinal as part of a full pipeline for the processing, visualization, and statistical analysis of mass spectrometry imaging experiments.

. Note that some of the normal tissues appear to have regions of cancerous tissue, such as the left edge of sample E, UH9812 03.

Algorithm and implementation
All the referenced equations can be found in the main text. Both algorithms are implemented in the R package Cardinal (cardinalmsi.org) [2], available from Bioconductor. Source code is available at http://github.com/kbemis/Cardinal.

Procedure for spatial shrunken centroids classification (supervised)
The following describes the algorithm for a single set of parameters for a single fold of crossvalidation. Parameters should be selected as described in Section 5.2.6.  2. Output the class assignments and class probabilitiesp k (x ijm ).

Procedure for spatial shrunken centroids segmentation (unsupervised)
The following describes the algorithm for a single set of parameters. Parameters should be selected as described in Section 5.2.6.  ii. If a segment has N k = 0, define the distance to it as d( iv. Calculate the segment membership probabilityp k (x ijm ) [Equation 11] (b) Assign the pixel to the segment with the highest probabilityp k (x ijm ).
5. Update the segments with the pixel assignments from step 4b.
6. Repeat steps 3-5 until no segments change, or at most iter.max times.
7. Output the shrunken t-statistics t kp , shrunken centroidsx k , and probabilitiesp k (x ijm ).  Figure 5B shows k-means clustering applied to the first five principal components of the peak-picked spectra, which also results in a noisy segmentation, but with all parts of the painting represented as segments. Supplementary Figure 5C and Supplementary Figure 5D show the spatially-aware clustering and spatially-aware structurally-adaptive clustering of Alexandrov and

Evaluation
Kobarg [1], which both result in cleaner segmentations with clearer edges between segments. The methods above, which require a predetermined number of segments, were set to 8 segments. Supplementary Figure 5E and Supplementary Figure 5F show the proposed spatial shrunken centroids segmentation method with SA and SASA distances, which produce clean segmentations comparable for Supplementary Figure 5C and Supplementary Figure 5D. The proposed method was initialized to 10 segments, resulting in 8 segments in the final segmentations. clustering. E, Spatial shrunken centroids with SA distance. F, Spatial shrunken centroids with SASA.

Statistical regularization enables data-driven selection of the number of segments for unsupervised experiments
The selection of the number of segments for the pig fetus cross-section dataset is illustrated in Supplementary Figure 6. Supplementary Figure 6A shows the predicted number of segments for increasing shrinkage parameter s for spatial shrunken centroids with the spatially-aware (SA) distance. Supplementary Figure 6B shows the same for spatial shrunken centroids with the spatiallyaware structurally-adaptive (SASA) distance. The method was initialized for spatial smoothing radii r = 1 and r = 2, and for starting number of segments K = 15 and K = 20. The shrinkage parameter s was increased from 0 to 9 in increments of 3.
To identify segmentations with the most appropriate number of segments, we first look for where the predicted number of segments become similar across different numbers of starting segments K.
When this happens, only meaningful segments should remain. This occurs around s = 3. Next, we look for where the predicted number of segments stabilizes, which should correspond with an "elbow" in the graph, similar to a scree plot. For Supplementary Figure 6A, this occurs at s = 6, but for Supplementary Figure 6B, this may occur earlier at Supplementary Figure 6B   The predicted segment membership probabilites from spatial shrunken centroids with SA distance. A, the text segment, B, the body segment, and C, the wing segment. D-F The shrunken t-statistics of the spectral features. D, the text segment, E, the body segment, and F, the wing segment. G-I The single ion images corresponding with the top-ranked spectral features by shrunken t-statistic. G, the text segment, H, the body segment, and I, the wing segment. The highest cross-validated accuracy rate was for r = 3, s = 20 with 88.9% accuracy, defined as correctly classifying a pixel as cancer or normal. Each slide was treated as its own fold in 8-fold cross-validation, i.e., leave-one-sample-out crossvalidation.