Generalizable biomarker prediction from cancer pathology slides with self-supervised deep learning: A retrospective multi-centric study

Summary Deep learning (DL) can predict microsatellite instability (MSI) from routine histopathology slides of colorectal cancer (CRC). However, it is unclear whether DL can also predict other biomarkers with high performance and whether DL predictions generalize to external patient populations. Here, we acquire CRC tissue samples from two large multi-centric studies. We systematically compare six different state-of-the-art DL architectures to predict biomarkers from pathology slides, including MSI and mutations in BRAF, KRAS, NRAS, and PIK3CA. Using a large external validation cohort to provide a realistic evaluation setting, we show that models using self-supervised, attention-based multiple-instance learning consistently outperform previous approaches while offering explainable visualizations of the indicative regions and morphologies. While the prediction of MSI and BRAF mutations reaches a clinical-grade performance, mutation prediction of PIK3CA, KRAS, and NRAS was clinically insufficient.


In brief
Niehues et al. evaluate deep-learningbased prediction for MSI, BRAF, KRAS, NRAS, and PIK3CA biomarker status in colorectal cancer from histopathology slides. They evaluate the performances of trained models in a realistic setting on a large independent patient cohort and find that attention-based multiple-instance learning outperforms all other approaches.

INTRODUCTION
Digitized histopathological slides with hematoxylin and eosin (H&E) staining offer a wealth of information that can be quantified and made usable by artificial intelligence (AI), in particular by deep learning (DL) neural networks. 1 DL networks have been developed to predict clinically relevant biomarkers directly from H&E-stained tumor tissue sections. [2][3][4][5] The application of DL for such complex tasks represents a major part of ''computa-tional pathology.'' 3,4,6 In colorectal cancer (CRC), DL-based predictability of biomarkers from H&E-stained tissue sections has been reported for microsatellite instability (MSI) [7][8][9][10][11][12][13][14] and, in smaller studies, for mutations in BRAF, 10,13 TP53, KRAS, SMAD4, PIK3CA, and other genes. 4,15,16 Prediction of MSI or mismatch repair deficiency (dMMR) in CRC is one of the most widely studied tasks 17 due to its high clinical relevance: first, the MSI status may point to hereditary causes of CRC. 18 Second, MSI is the strongest predictor of response to cancer Article ll immunotherapy. 19 Third, MSI has an important role in the management of patients with CRC, for example in the decision whether to prescribe adjuvant chemotherapy. 20 Building on evidence provided in multiple studies, 7,9,14,17,21,22 the first DL algorithm for MSI prediction has received regulatory approval in Europe in 2022 (''MSIntuit CRC'' by Owkin, France/ USA). However, various questions remain open, which is even more relevant now that this method can be used in routine diagnostics. The most important issue of existing MSI detection algorithms is their generalizability. 23 Usually, a pronounced performance drop is observed when deploying the trained models on external patient cohorts. 21 Validation on external cohorts is crucial for testing the translation of models' prediction performance and hence generalization to independent datasets. The second issue is explainability, i.e., identifying which tissue patterns are associated with which genetic alterations. The third issue is the scope of the methods, i.e., their application to other biomarkers beyond MSI. Many genetic alterations are related to morphological features in tumor tissue. This is known for MSI 24 and BRAF mutations 25 in CRC and several mutations in other tumor types. 26,27 However, few studies have investigated alterations beyond MSI in CRC in large patient cohorts. While recent studies investigating the DL-based prediction of MSI status included thousands of patients, 17 studies investigating other biomarkers such as BRAF, KRAS, NRAS, and PIK3CA mutations are often limited to smaller cohorts with suboptimal data quality. 28 From a technical point of view, the most widely used method for biomarker prediction in computational pathology is to train DL networks on image tiles obtained from histological wholeslide images (WSI). 4,29 Mutation labels, however, only exist for the entire WSI, and it is unclear which regions on the WSI express morphologies that reflect underlying mutations. Therefore, tile predictions must be aggregated to slide predictions. A common approach is to apply transfer learning to models pre-trained on ImageNet and to use mean pooling for tile-to-slide aggregation. 7,[29][30][31] This method, the ImageNet pre-trained (INPT) approach, was first applied in histopathology by Coudray et al. in 2018. 30 Recent proof-of-concept studies have suggested that the attention-based multiple-instance learning (attMIL) 32 approach is superior to the INPT approach. 12 The image feature extractor (encoder) in attMIL can be pre-trained via self-supervised learning (SSL). Schirris et al. used SSL-attMIL in a pilot study on a public dataset with 360 patients. 12 On this relatively small dataset, they reported a performance gain compared with the INPT approach. However, this performance gain has not been validated in larger cohorts. Similarly, other works have applied the attMIL approach with and without SSL to predict biomarkers but have only provided external validation in small datasets, if at all. 5,33,34 In summary, previous evidence suggests that both SSL and attMIL are useful components in weakly supervised computational pathology pipelines, but this has not been systematically tested in a clinically relevant task with large-scale external validation. Such a lack of largescale validation is a risk for the ultimate generalizability of any biomarker. 23,35 In this light, we aimed to fill two knowledge gaps by answering two questions: first, do attMIL and SSL really provide a performance gain compared with the INPT approach? Second, is MSI the only predictable biomarker in CRC, or is the mutational status of BRAF, KRAS, NRAS, and PIK3CA similarly predictable?
To this end, we implemented the INPT approach as a baseline and trained models for the prediction of multiple biomarkers in CRC. We tested the generalization on a test dataset and saw a performance drop, as expected. Subsequently, we implemented attMIL and applied it using two different SSL-trained feature extractors. We showed that one encoder outperformed the other by a large margin. The better encoder generalized well to the second dataset and consistently outperformed all other tested models. Finally, we extended attMIL by including clinical patient data and show that there was no synergy for the performance on the training dataset, although performance on the test dataset was increased.

RESULTS
attMIL outperforms the INPT approach for biomarker prediction First, we investigated the predictability of MSI, BRAF, KRAS, NRAS, and PIK3CA directly from H&E histopathology images in the QUASAR cohort (Tables 1 and 2). We compared the INPT approach with SSL-attMIL using the SSL encoders by Ciga or Wang ( Figures 1A-1C). We found that the best performances were obtained using image-only Wang-attMIL. For prediction of MSI, BRAF, KRAS, NRAS, and PIK3CA, areas under the receiver operating characteristic curve (AUROCs) of 0.94 ± 0.02, 0.82 ± 0.05, 0.67 ± 0.04, 0.52 ± 0.12, and 0.57 ± 0.07 were obtained, respectively (Figures 2A-2E). Previous studies have discussed that AUROCs of close to 0.9 with good generalization have a high discriminative power, which may be clinically relevant. 9,29,36,37 In this sense, only MSI and BRAF mutation prediction reached a potentially clinically relevant level, but the prediction of the other investigated biomarkers did not. Because using the AUROC as the sole metric is suboptimal, 38 we evaluated the model performance of the image-only Wang-attMIL model in Quasar at pre-defined threshold values (Figure 3). For MSI prediction, the 95% in-domain sensitivity threshold of value 0.21 yielded 236 true positive, 639 false positive, 9 false negative, and 890 true negative predictions across the five internal datasets. This globally corresponds to a sensitivity of 96.3%, a specificity of 58.2%, a positive predictive value (PPV) of 27%, and a negative predictive value (NPV) of 99%. At a threshold value of 0.5, BRAF status was globally predicted with a sensitivity of 73.3%, a specificity of 73.5%, a PPV of 19.7%, and an NPV of 96.9% across the five internal test sets. For BRAF status prediction notably, the requirement of 95% in-domain sensitivity comes at a high cost in specificity.
Together, these data show that the DL methods presented in this article have the potential to reach clinical-grade performance for the prediction of MSI, and near-clinical-grade performance for the prediction of BRAF but that they do not reach a high performance for KRAS, NRAS, and PIK3CA, despite using the best-performing image-only Wang-attMIL models in a large patient cohort.
There is no direct synergy between clinical data and image data in biomarker prediction Further, we investigated whether or not adding baseline clinical data (gender, age, tumor location) as additional inputs improves the internal prediction performance of the best model. Wang-attMIL with clinical data (multi-input model) achieved the following AUROCs: MSI 0.94 ± 0.02, BRAF 0.82 ± 0.07 ( Figures 2F and 2G), KRAS 0.66 ± 0.04, NRAS 0.49 ± 0.18, and PIK3CA 0.52 ± 0.17 (Table 1), yielding statistical compatibility with the image-only Wang-attMIL model for MSI and BRAF prediction. The solely clinical-data-based model achieved good prediction results as well (AUROCs: MSI 0.80 ± 0.03, BRAF 0.77 ± 0.08, Figures 2F and 2G; KRAS 0.50 ± 0.06, NRAS 0.54 ± 0.13, PIK3CA 0.59 ± 0.06, Table 1). In particular, the solely clinical-data-based results for BRAF mutation prediction were close to those obtained with the image-only Wang-attMIL or the multi-input model and statistically compatible with all other DL approaches. This indicates that the visual features on H&E*stained tissue sections that are predictive of BRAF status are by themselves only slightly superior to the clinical variables. The same applies to the prediction of NRAS and PIK3CA mutation status. For KRAS and MSI status prediction, the image-based models outperformed the solely clinical-databased model. This indicates better predictability of biomarker status from image features than from clinical variables for these two biomarkers.
Image-only and multi-input attMIL generalizes better than the state of the art Next, we assessed the generalizability of QUASAR-trained models on the DACHS cohort (Table 2; Figures S1 and S2). One set of tiles was color normalized using the Macenko method, while another set contained the same tiles without any color normalization. Here, we restricted the analysis to MSI and BRAF biomarker prediction, as other biomarkers had already been shown to perform poorly during internal validation. Article ll

OPEN ACCESS
The image-only Wang-attMIL models and the multi-input models yielded a high performance for the prediction of MSI and BRAF status ( Figures 2F and 2G). For MSI and BRAF prediction on color-normalized tiles in the external validation cohort, AUROCs of 0.92 ± 0.01 and 0.81 ± 0.01 and 0.92 ± 0.01 and 0.85 ± 0.01 were obtained by image-only Wang-attMIL and multi-input, respectively ( Figures 2F-2I). For BRAF mutation prediction, this shows a better generalization of the multi-input compared with the image-only Wang-attMIL models. These high AUROCs correspond to high areas under the precision-recall curve (AUPRCs) ( Table 1; Figure S3), pointing to potential clinical applicability. For MSI prediction in DACHS with the 95% in-domain sensitivity threshold value of 0.21, the averaged models' scores achieved a sensitivity of 90.5%, a speci-ficity of 79.6%, a PPV of 33.7%, and an NPV of 98.6%. At a threshold value of 0.5, BRAF status was predicted with a sensitivity of 73.3%, a specificity of 73.5%, a PPV of 19.7%, and an NPV of 96.9% (Figure 3). Clinical statistics for correctly classified and misclassified patients in QUASAR and DACHS at a threshold value of 0.5 are given in Tables S1 and S2. The models had difficulties in correctly predicting MSI-positive patients with rectal cancer in the DACHS cohort. In the case of rectal carcinomas, the odds ratio for the correct classification of an MSI-positive patient in QUASAR compared with DACHS was 11.7, suggesting that more data from patients with rectal carcinoma are required in future datasets.
Notably, when using the Wang encoder, the performance in the validation cohort was not dependent on the presence of color A B C Figure 1. Schematic workflow of this study (A) Schematic summary of attMIL and the multi-input DL architecture: a WSI is tessellated into smaller tiles, that are subsequently pre-processed and passed through the encoder to give image feature vectors. In the multi-input case, each image feature vector is concatenated by a vector representing the patient's clinical data. In contrast, the INPT models trained on QUASAR showed a marked performance drop on color-normalized DACHS images (AUROCs: MSI 0.86 ± 0.02, BRAF 0.78 ± 0.02) and further dropped in performance for the non-normalized images (AUROCs: MSI 0.80 ± 0.04, BRAF 0.75 ± 0.03). This shows that the INPT approach is less stable and generalizes less well than the image-only Wang-attMIL or multi-input models. The robustness of the Wang-attMIL approach seemed to be due to the particular encoder since the Ciga-attMIL model generalized poorly (AUROCs color normalized: MSI 0.72 ± 0.03, BRAF 0.73 ± 0.02; AUROCs non-normalized: MSI 0.71 ± 0.05, BRAF 0.68 ± 0.06; Table 1). Results for the analysis of variances (ANOVA) for AUROCs obtained with trained models in internal validation and in external validation on DACHS for MSI and BRAF status prediction are listed in Tables S3-S8.

SSL-attMIL is domain-shift invariant
Domain shifts can still hide behind high AUROC values and can severely limit the real-world performance of DL models. 38 We investigated the distribution of the image-only Wang-attMIL model prediction scores for MSI and BRAF in the training and test cohort. We found that the prediction scores were similarly distributed in the training and test set for the image-only Wang-attMIL ( Figure S4) as well as for the multiinput models ( Figure S5). In summary, these data show that Wang-attMIL yields classifiers with high generalizability across the two datasets, which are independent of Macenko normalization and do not display domain shifts. Furthermore, adding clinical data to the models leads to even better generalization.

Attention-based models attend to relevant tissue regions
To comprehend the decision-making processes of trained DL models, we investigated the visual patterns in their spatial context on WSIs. We separately visualized attention and prediction heatmaps for typical patients for the image-only Wang-attMIL models (Figures 4A and 4B). For MSI prediction, high-attention regions were confined to the tumor tissue, while fibromuscular tissue and non-tumor epithelium were not attended to as much by the model (Figures 4A and 4B). In BRAF prediction, however, the attention was more spread out. Tumor tissue is still attended more to than non-tumor tissue but to a lesser extent (Figures 4A and 4B). This indicates that either the BRAF prediction model did not learn to focus sufficiently on the tumor tissue or that the BRAF prediction model learned that visual features outside of the tumor region are somewhat relevant to making predictions. In particular, lymphocyte-infiltrated muscle tissue was assigned a high BRAF and attention score. Confounding factors in images for BRAF status prediction are yet another possibility. Further high-resolution heatmaps for MSI and BRAF status for typical patients are available at Zenodo: https://doi.org/10.5281/zenodo.7454743. Interestingly, the presence of pen marks on some slides did not confuse the models, as pen marks were assigned a very low attention score, showing that the image-only Wang-attMIL model is very robust, even to the presence of artifacts.
Distinct visual features drive MSI and BRAF prediction MSI and BRAF mutant status are highly correlated; therefore, we addressed whether the models recognize different sets of visual features for either target. First, we investigated whether BRAF mutations can be predicted in the MSI and microsatellite stable  (MSS) subgroups of the QUASAR trial dataset. Using image-only Wang-attMIL models, the DL system was able to detect BRAF mutational status in the MSI subgroup, reaching a cross-validated AUROC of 0.73 ± 0.06 ( Figure 5A). However, BRAF status was not predictable in the MSS subgroup, reaching an AUROC of 0.66 ± 0.10 ( Figure 5B). Second, we repeated the analysis for MSI status prediction in BRAF-mutated and wild-type subgroups in analogy: MSI status was predictable in BRAF wt patients (AUROC 0.89 ± 0.06, Figure 5C) and BRAF mut patients (AUROC 0.78 ± 0.15, Figure 5D). We further investigated the visual features present in image tiles that were assigned high attention and a high-class prediction score at the same time.
We found that MSI ( Figure 5E) and MSS ( Figure 5F) tiles showed similar patterns to those described previously: poorly differentiated tumor glands with immune-infiltrated stroma in MSI versus well-differentiated stroma-rich tissue areas for MSS. 17,24 BRAF mut ( Figure 5G) and BRAF wt ( Figure 5H) top tiles showed different prominent patterns than MSI and MSS tiles, with mucinous differentiation dominating BRAF mut tiles and welldifferentiated, stroma-rich patterns dominating BRAF wt tiles. Using gradient-weighted class activation mapping (Grad-CAM) to highlight relevant subregions in these top tiles, we found that the models indeed focused on these tissue structures ( Figures 5E-5H). MSI and BRAF prediction scores were correlated in all patient subgroups ( Figure S6). Taken together, these data show that MSI and BRAF prediction models detect distinct visual features that are compatible with previous knowledge; however, MSI features appear to be more distinct, as MSI status is easier detectable in subgroups of BRAF mut /BRAF wt than BRAF status in subgroups of MSI/MSS.

DISCUSSION
MSI prediction from histopathology with DL has been investigated since 2019. 7,14,17,22 Earlier works used the INPT approach using mean pooling for slide-level aggregation. 7 Recent studies have investigated attention-based MIL approaches in the hope of less noisy supervision and creating models able to learn to combine global features. [39][40][41][42] Most recently, SSL methods have been adopted in the histopathology domain. 12 In a smaller pilot study, the attMIL approach has shown superior performance compared with the INPT approach. 12 The main limitations of many of these works, however, are that (1) they focus on only a few clinically relevant tasks and (2) they are not validated on external cohorts, thus lacking performance evaluation in realistic scenarios. First, we tested the performance of two attMILmodels with different pre-trained encoders on multiple clinically relevant biomarkers. Second, we investigated their external validation performance on a large dataset for internally well-predictable biomarkers. For the attMIL approach, this degree of large-scale validation is required for clinical translation but was missing from previous studies. 23 This study evaluates current state-of-the-art methods for biomarker prediction in CRC from pathology slides in a realistic evaluation setting: SSL-attMIL with the Wang encoder outperformed all other approaches. This confirms the superiority of the attMIL approach when combined with an appropriate encoder on a large external dataset. Our Wang-attMIL models were generalizable and invariant to the color normalization in the test set. In contrast, this was not the case for our Ciga-attMIL models, where the encoder was trained on a similar, but much smaller, dataset compared with the Wang encoder. This provides empirical evidence that Wang's encoder trained via the clustering-guided contrastive learning (CCL) algorithm is superior to Ciga's encoder trained via SimCLR for the biomarker prediction investigated in this article. Thus, the Wang encoder provides an ideal backbone for the attMIL approach for biomarker prediction at hand. Using the image-only Wang-attMIL models, our approach improves the AUROC for MSI prediction from 0.68 to 0.92 for training on QUASAR and testing on DACHS compared with Echle et al. 9 These results are in line with previous studies, which demonstrated the superiority of the att-MIL approach for biomarker prediction. 12,22,40,43 Further, we demonstrated that morphological features most relevant for a prediction made by our best image-only MSI and BRAF models are in line with previous findings and pathological knowledge. 17,24,25 In addition, the current study extends these previous findings by (1) showing the superiority of the Wang-attMIL models using large cohorts with thousands of patients and (2) investigating multiple biomarkers beyond MSI.
Finally, we tested extensions of the image-only Wang-attMIL model by concatenating image vectors with vectors representing clinical patient data. Here, we did not see direct synergy in performance on the QUASAR cohort, but we did see enhanced prediction performance for patients in the DACHS cohort. This is true in particular for the prediction of BRAF biomarker status, which shows a weaker morphological phenotype compared with MSI mutations. In this case, multi-input models stabilized predictions across different datasets.
Prediction of genetic alterations such as MSI and BRAF mutation is regarded as one of the most relevant applications of computational pathology. 2 Exceeding pure research applications, the prediction of MSI status has enormous commercial potential. This is evident in multiple companies that have developed solutions for MSI status prediction, [43][44][45]  In (E)-(G), top tiles are the highest, top 5%, and top 10% scoring tiles in terms of the product of the tile's attention and the tile's classification score (left to right) for the patients with the highest overall classification score for the target mutation (top to bottom). High-resolution images can be found at Zenodo: https://doi.org/ 10.5281/zenodo.7454743. Correlation of prediction scores for MSI and BRAF status for the best image-only model can be found in Figure S6.
Limitations of the study However, we also identified limitations of DL-based biomarker prediction. While previous studies have suggested that mutations in KRAS, NRAS, and PIK3CA might be predictable from pathology images, 4,22,46 we show that this performance is not in a clinically relevant range with the methods described in this article. Although prediction of these biomarkers was possible with non-random AUROCs above 0.5, this is far from suitable for clinical application. Also, we show that a trivial model that uses only age, gender, organ, and sidedness as an input reaches similar performances for the prediction of NRAS and PIK3CA genes (Table 1). Thereby, our study provides suggestive evidence that despite the use of large, multi-centric patient cohorts and powerful DL models, it is not possible to predict the mutational status of KRAS, NRAS, and PIK3CA from CRC histopathology slides with current methods.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

DECLARATION OF INTERESTS
For transparency, we provide the following information: J.

Materials availability
This study did not generate new unique reagents.

Data and code availability
The DACHS and QUASAR data used in this study cannot be deposited in a public repository because of local ethical prohibitions. All source codes are available at GitHub: https://github.com/KatherLab/marugoto. Heatmaps for typical patients and high-resolution images of top tiles have been deposited at Zenodo at Zenodo: https://doi.org/10.5281/zenodo.7454743. Models trained in this study have been deposited to GitHub: https://github.com/KatherLab/crc-models-2022. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Ethics statement
This study was performed in accordance with the Declaration of Helsinki. This study is a retrospective analysis of digital images of anonymized archival tissue samples of multiple cohorts of CRC patients. Data were collected and anonymized and ethical approval was obtained. in subsets of this multicenter study. In detail, the methods were the single-stranded conformational polymorphism technique and immunohistochemical analyses, 55 respectively, or by Sanger sequencing. 56 CONSORT charts with details on missing data and preprocessing drop out for the QUASAR and DACHS cohort can be found in Figures S1 and S2.

METHOD DETAILS
Image preprocessing All images from H&E stained resection tissue slides were preprocessed according to the ''Aachen protocol for deep learning histopathology''. 57 WSIs were tessellated into 512x512 pixels image tiles of 256 mm edge length. Tissue regions were automatically selected using RGB thresholding (summed median brightness across RGB channels < 660) and canny edge detection by requiring at least four edges per image tile. 40 All remaining tiles were included in the analysis. The fraction of blurry or homogenous tiles was estimated using the method of variation of the Laplacian, 58 which showed that 9.2% and 3.4% of the tiles stayed below a score value of 80 in the QUASAR and DACHS cohorts, respectively. Tiles were processed at 224 px edge length (effective resolution of 1.14 mm per pixel) using bilinear interpolation as implemented in PyTorch's ''Resize'' function and normalized with ImageNet's mean and standard deviation of RGB pixel values. Tiles in the training set were color-normalized with Macenko's method using a reference image tile. 7,59 In the test set, the performance of models was assessed in color-normalized and native tiles.

Biomarker prediction from whole slide images
We compare results obtained with two different DL approaches -the INPT approach against the attMIL approach. Both approaches address a classification problem in which the objective is to predict a slide label from a collection of individual tiles.
In the INPT approach, 7,30 a DL network pre-trained on ImageNet is fine-tuned using the WSI-level label assigned to each tumor tile. Slide-level predictions are then obtained by averaging/mean-pooling of tile-level predictions. This has resulted in high-performance models, 9 but imperfect generalization to external cohorts. 21 The attMIL approach is a two-stage process: First, images of tiles are compressed to image feature vectors using a pre-trained encoder network. Second, the image feature vectors are used as input to a network that uses an attention mechanism for aggregation of predictions from tile to slide level. In short, this network computes an attention-weighted average of the input feature vectors which is then classified and can thus learn which parts of the input image should be discarded for the final prediction. We trained and tested models on top of two publicly available frozen encoders trained with self-supervised learning (SSL), referring to the generic pipeline as ''SSL-attMIL''. Ciga et al. applied SimCLR 60 to train a ResNet-18 on 400,000 pathology images selected from 57 datasets. 61 Wang et al. trained a ResNet-50 on a total of 15 million pathology images retrieved from 32,000 WSIs from the full TCGA and PAIP dataset via a clustering-guided contrastive learning (CCL) SSL algorithm. 62 In CCL, the learning objective is to minimize the contrastive loss between any two tiles from the same WSI and to maximize the loss for any two tiles from different WSIs. 62 In SimCLR, the contrastive loss is minimized for the same tile and maximized between any two different tiles. 60 We used both pre-trained models to extract 1024 (''Ciga-attMIL'') and 2048 (''Wang-attMIL'') features per tile. The set of features from all or a large subset of tiles from a WSI (we randomly sampled 512 every epoch per WSI) was then used as input to the basic attMIL model 32 that learns to predict a single label for a WSI.
Finally, we extended the basic attMIL approach by adding basic clinicopathological data as an additional input to the model. These input data are known to be associated with MSI status: 24 gender, age, tumor sidedness (lef/right) and organ (colon/rectum) ( Table 2). To this end, each patient's clinical data was embedded into a vector representation. For each tile, this clinical data vector was concatenated with the image feature vector.
Setting all values of the image feature vectors to zero results in yet another model that solely depends on clinical data. We call the two described model architectures the ''multi-input'' and ''solely clinical-data-based'' models. The multi-input and solely clinicaldata-based models were trained using the same hyperparameters as in the image-only approach. Detailed information on the training procedure and model details are available in the STAR Methods.

Visualization and explainability
Visualization of important morphological features relevant to the decision-making processes of DL models is important for: 1) Finding if there are distinct morphologies for various mutations and 2) better comprehension of model internals. For visualization, we used three approaches. We showed the highest-scoring tiles from patients that are correctly classified with the highest scores. 63 Additionally, we apply Grad-CAM, 64 a generalization of the class activation mapping (CAM) algorithm. 65 Finally, WSI heatmaps display separate spatial distributions of the attention and prediction scores.

Implementation of the INPT approach
In our implementation, tiles were direct inputs for transfer learning. Transfer learning requires a convolutional neural network (ResNet-18) that was pre-trained on ImageNet combined with appropriate substitution for the fully connected classification head. First, the new head's weights are trained with all other layers' weights frozen; subsequently, the remaining layers' weights are unfrozen and fine-tuned. Thus the network learns to predict the biomarker status for a single tile, and the patient score is calculated by averaging across all tiles for a given patient. We used our in-house open-source pipeline DeepMed 66 with a batch size of 92, the Adam optimizer Cell Reports Medicine 4, 100980, April 18, 2023 e2 Article ll OPEN ACCESS (b 1 = 0:9;b 2 = 0:99;ε = 10 À 5 ), and a learning rate of 2e-3 and 1% weight decay. 67 The cross-entropy loss function was weighted by the inverse of class frequencies to account for class imbalances. After fine-tuning the model's head for one epoch, the full model was trained for 32 epochs during which the learning rate was scheduled by a modified ''1 cycle policy'' as made available by fas-tAI [68][69][70] Maximum learning rates were set in equally spaced slices from lr_max=1e-3 for the deepest layer to lr_max/100 for the shallowest layer, respectively. The learning rates sinusoidally increased from 1/5 of the maxima to the maxima over ten epochs. Then, the learning rates were sinusoidally decreased from the maxima to 1/10,000 of the maxima over the remaining epochs. At the same time, b 1 was sinusoidally varied from 0.95 to 0.85 over the first ten and back to 0.95 over the remaining epochs. During training, tiles in the training data set were augmented by combined operations of random rotations up to 360 with 75 % and vertical flips with 50 % probability.

Implementation of attention-based multiple instance learning
In both self-supervised learning-attMIL approaches, a fully connected layer followed by ReLU embeds the features in a 256-dimensional space. This embedded vector is then passed through a linear layer that outputs another 256-dimensional vector h k for tile k. Then the attention score a k for the k-th tile is calculated via: where h˛R 256 ;V˛R 128x256 , w˛R 128 and K is the maximal number of tiles randomly resampled every epoch for each patient. Then the MIL pooling operation is applied via: where h i is the i-th tile's embedding; a maximum of K =512 tiles were used per patient. To obtain the final probability score for each patient, the batch of h sum s is passed through a BatchNorm1D layer, followed by Dropout layer with p=50%. Then, h sum is passed through a fully connected layer with two output dimensions and finally, a softmax layer is applied to obtain the scores. The batch size was 32 patients, the number of epochs was 32, the maximal learning rate was sinusoidally varied from lr_max/25 to lr_max=1e-4 over eight epochs and back to lr_max/10.000 over the remaining epochs, no learning rate slicing was applied, b 1 was varied with the same periodicity, and other hyperparameters were the same as in the INPT approach.

Implementation of multi-input prediction models
We one-hot encoded the patient's gender and tumor location and added the age (years) as an integer variable. All variables were normalized to be zero-centered with a normal distribution. Missing values were filled using mean-imputation. These features were concatenated with a tile's image feature vector before training. This extended vector was then used as input to the attMIL approach. We performed an ablation study by setting the image features to zero to test the performance of a solely clinical-data-based model separately.

Experimental design and statistics
We trained all neural network models on QUASAR via stratified five-fold cross-validation on the level of patients (''within-cohort experiment'', for MSI, BRAF, KRAS, NRAS, and PIK3CA). Subsequently, we applied all five models to the external validation cohort DACHS (only for MSI and BRAF). During cross-validation, a validation subset (25% of the training data) was randomly split off every training set to check for overfitting. The area under the receiver operator characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC) give statistical endpoints in our analysis, the latter being more robust to class imbalance. For clarity, we numbered all of our experiments and summarized the results in Table 1. AUROCs of trained models for internal and for external validation for MSI and BRAF status prediction on DACHS are compared using the analysis of variances (ANOVA) test and p-values are listed in Tables S3-S8. In addition to the AUROC, we evaluated the sensitivity and specificity of our models at thresholds of 0.25, 0.5, 0.75, and a threshold giving a 95% in-domain sensitivity. The 95% in-domain sensitivity threshold was obtained by taking the average of each model's 95% sensitivity thresholds on its respective internal test dataset.