NaroNet: Discovery of tumor microenvironment elements from highly multiplexed images

Understanding the spatial interactions between the elements of the tumor microenvironment -i.e. tumor cells. fibroblasts, immune cells- and how these interactions relate to the diagnosis or prognosis of a tumor is one of the goals of computational pathology. We present NaroNet, a deep learning framework that models the multi-scale tumor microenvironment from multiplex-stained cancer tissue images and provides patient-level interpretable predictions using a seamless end-to-end learning pipeline. Trained only with multiplex-stained tissue images and their corresponding patient-level clinical labels, NaroNet unsupervisedly learns which cell phenotypes, cell neighborhoods, and neighborhood interactions have the highest influence to predict the correct label. To this end, NaroNet incorporates several novel and state-of-the-art deep learning techniques, such as patch-level contrastive learning, multi-level graph embeddings, a novel max-sum pooling operation, or a metric that quantifies the relevance that each microenvironment element has in the individual predictions. We validate NaroNet using synthetic data simulating multiplex-immunostained images where a patient label is artificially associated to the -adjustable-probabilistic incidence of different microenvironment elements. We then apply our model to two sets of images of human cancer tissues: 336 seven-color multiplex-immunostained images from 12 high-grade endometrial cancer patients; and 382 35-plex mass cytometry images from 215 breast cancer patients. In both synthetic and real datasets, NaroNet provides outstanding predictions of relevant clinical information while associating those predictions to the presence of specific microenvironment elements.


Introduction
The histopathology and phenotype of a tumor guide its diagnosis, prognosis, and help to predict its response to conventional or immune-based anticancer treatments. Indeed, cancers are graded based on tumor architecture and cellular morphology (histopathology), while the expression of relevant cancer biomarkers (phenotype) is used to stratify patients, predict their prognosis and customize their treatment. Automating these tasks using machine learning (ML) is the goal of a novel field known as computational pathology.

Computational pathology
Two main computational pathology strategies exist to automate the analysis of the histopathology of a tumor or its phenotype: weakly supervised deep learning (WSDL) and single cell analysis (SCA). WSDL builds on the widespread availability of whole slide imaging (WSI) to blindly extract prominent histopathological tumor features from large amounts of raw or weakly annotated images of H&E stained tissue sections. Trained only with patient-level labels, WSDL automatically associates these architectural tumor features with clinical labels (Veta et al., 2019;Srinidhi et al., 2021;van der Laak et al., 2021), eliminating the need for manual and extensive pixel-level annotations (Bulten et al., 2020). WSDL uses patches containing several cells as the basic unit of interpretability, to saliently localize tumor-specific regions. This strategy has been shown very effective, often outperforming human experts' predictive ability. For instance, WSDL has been effectively used for tumor subtyping, patient grade classification, or lymph node metastasis detection without pathologist's intervention (Campanella et al., 2019;Boehm et al., 2021;Bilal et al., 2021;Lu et al., 2021;Pinckaers et al., 2020;Cheng et al., 2021;Diao et al., 2021;Longo et al., 2021).
SCA emerged in the context of the research for novel cancer biomarkers, i.e., specific proteins that are expressed by tumor cells, defining their particular phenotype. This laborious task requires selecting potential targets from in silico data and validating these targets in situ to confirm that they are reliably related to a specific biological effect.
Traditionally, this has been done one or a few markers at a time. Recently, the development of highly multiplexed tissue imaging technologies, such as imaging mass cytometry (IMC) or multiplex immunofluorescence (MI), allows simultaneous staining of tissue sections with a high number (>20) of biomarkers (Hao et al., 2021;Stopsack et al., 2020). These complex biomarker signatures provide a comprehensive visualization of the tumor microenvironment and the spatial relationship between its elements, which could be related to the biology and prognosis of the tumor (Rendeiro et al., 2021;Ji et al., 2020). However, the complexity of the patterns of expression and the spatial relationships between multiple markers exceeds the capabilities of the human brain. SCA methods (Schapiro et al., 2017) approach this task by first segmenting the cells in the tissue and quantifying their morphology and intensity of marker expression. This information is then used to find clusters of cells with similar phenotypes, as well as higher-order interactions or 'neighborhoods' between phenotypes (Schürch et al., 2020). To this end, SCA methods build topological networks containing cell phenotype interactions, and apply graph-based clustering (Blondel et al., 2008) to assign groups of cells to different neighborhoods. Since SCA methods use the cell as the basic unit of tissue representation, they provide a high level of interpretability. However, SCA methods are sequential and not learning-based, meaning that the phenotypes and neighborhoods extracted are inferred ignoring which were the clinical questions at hand, and therefore the quantified microenvironmental features are not necessarily optimal to differentiate patient types (McQuin et al., 2018). This is especially true when analyzing highly heterogeneous data, affected by technical non-linear variabilities caused by autofluorescence and/or low expression of some antigens Schürch et al., 2020).

Aim of the study
The aim of this study is to combine SCA (cell-level interpretable quantification of the tumor microenvironment) and WSDL (patch-based end-to-end learning of tumor histopathology) to automate the in situ discovery of tumor microenvironment elements (TMEs) that are relevant for a specific clinical predictive task. To this end, we have developed NaroNet, a multilevel, interpretable deep learning ensemble, which learns the most relevant TMEs from multiplex immunostained tissue sections while performing a classification task, using only patient-level labels. NaroNet assigns patches to TMEs at three levels of spatial complexity: local cell phenotypes, cellular neighborhoods, and interactions between neighborhoods that we name as tissue areas. The concept and main elements of NaroNet are illustrated in Fig. 1.
To validate NaroNet in a controlled way, i.e. having a ground truth, we first analyzed synthetic sets of multiplex images that simulate situations that can be found in real samples. Then we applied our model to learn relevant TMEs while predicting clinically relevant parameters from two real datasets: 336 7-plex images from 12 patients with high-grade endometrial cancer and a publicly available imaging mass cytometry dataset  consisting of images from 215 breast cancer patients.

Methodological contributions
Besides the main conceptual novelty of discovering TMEs while performing clinical predictions from multiplex immunostained cancer tissues, NaroNet integrates novel and state-of-the-art ML approaches. In particular, the main methodological contributions of NaroNet are:

•
The development of patch contrastive learning (PCL), a self-supervised learning algorithm that encodes high-dimensional pixel information into enriched patchembeddings.

•
The modelling of the tumor microenvironment in three levels of increasing complexity: local phenotypes, cellular neighborhoods and areas of interaction between cellular neighborhoods.
• A novel max-sum pooling operation that transforms TME learned annotations into patient-level vectors where each value specifies the incidence of TMEs.
• Two novel regularization loss terms that prevent NaroNet from producing spurious local minima: patch entropy loss and patient entropy loss.

•
The optimal selection between several dataset-dependent architectural variations (e.g. multiple TME assignment, leveraging patch relevance, global reasoning unit, etc.). This is computationally feasible thanks to the dimensionality reduction provided by our PCL module.
• A BioInsights interpretability module that automates the association between patient types and TMEs, based on a novel predictive influence ratio (PIR) metric that quantifies the relevance that TMEs have in individual predictions.
The structure of the rest of this paper is as follows: Section 2 describes the synthetic and real datasets used, and describes the proposed methodology. Section 3 contains the experiments used to test the performance of NaroNet and reports the results obtained. Section 4 provides in-depth analysis of the proposed methods. Finally, we discuss the results in Section 5, and end with our conclusions in Section 6.

Datasets
A. Synthetic patient cohorts.-An in-house developed multiplex immunostained tissue simulator (Jiménez-Sánchez et al., 2021), was used to create patient cohorts. Each patient of the cohort was represented by a 80 0×80 0 multiplex image that contained 8 cell phenotypes (Ph1-Ph8), defined by the (tunable) probabilistic level of expression of 6 fluorescently labeled markers (Mk1-Mk6) (Fig. 2a), the cell size ( Fig. 2b), and shape (Fig. 2c). Four types of cell neighborhoods (Nb1-Nb4) were also defined based on the (adjustable) probabilistic abundance of the 8 cell phenotypes (Fig. 2d), and the (adjustable) interactions between them (Fig. 2e). Each neighborhood had a predefined prevalence in the tissue (Fig.  2f) and could interact with other neighborhoods (Fig. 2g) defining one area of interaction.
We simulated 7 patient cohorts. Each cohort contained 240 patients, distributed in 3 groups (type I, II, and III) of 80 patients each, defined by the variation of the default configuration parameters shown in (Fig. 2), simulating different disease paradigms in-spired on real scenarios: Phenotype Marker Intensity (PMI).: In these patient cohorts, the cells of phenotype Ph6, located in neighborhood Nb3, displayed different relative intensity of Mk6 marker expression in each group of patients: 25% (type I), 50% (type II), or 75% (type III) ( Supplementary Fig. 1). Two cohorts were created with different levels of complexity. In cohort PMI1, the relative abundance of Ph6 cells in Nb3 was set to 15% (moderately present), whereas in PMI2 the relative abundance of Ph6 was set to 0.25% (rarely present).

Phenotype Frequency (PF).:
We simulated two patient cohorts where each group of patients displayed different abundance of cell phenotype Ph6. In PF1 (moderate presence) the relative abundance of Ph6 cells in neighborhood Nb3 was set to 0% (type I), 30% (type II), and 60% (type III) ( Supplementary Fig. 2). In PF2 (rare presence), the relative abundance of Ph6 in Nb3 was set to 0% (type I), 0.12% (type II), and 0.25 (type III)%.

Cell-Cell Interactions (CCI).:
We simulated two patient cohorts where cell phenotypes Ph4 and Ph5 that belong to neighborhood Nb2 repel (type I), show no interaction (type II), or attract (type III) ( Supplementary Fig. 3). In cohort CCI1 (moderate presence) the relative abundance of both Ph4 and Ph5 in Nb2 was set to 5%; in CCI2 (rare presence), the relative abundance of both Ph4 and Ph5 was set to 1%.

Neighborhood-Neighborhood Interactions (NNI).:
We simulated one patient cohort displaying different interactions between cellular neighborhoods, related to patient type. In this cohort (NNI1), Nb2 and Nb3 repel (type I), show no interaction (type II), or attract (type III). The relative abundance of both Nb2 and Nb3 was set to 15% ( Supplementary Fig.  4).
B. Endometrial carcinomas.-Tissue sections from twelve Formalin-fixed, paraffinembedded (FFPE) high-grade endometrial carcinomas were stained with a seven-color multiplex panel targeting key elements of the immune environment: CD4 and CD8 T cell membrane receptors, the transcription factor FOXP3, the bona fide T cell activation marker CD137 (4-1BB), the programmed cell death-1 (PD-1), cytokeratin (CK), and DAPI (counterstaining). 336 1876×1404×7 pixel images were obtained from the 12 tissue sections, using a Vectra-Polaris Automated Quantitative Pathology Imaging System (Perkin Elmer Inc., Waltham, MA, USA). Clinicopathological patient-level information was available for these tumors (León-Castillo et al., 2020), including the microsatellite instability (MSI) subclass, genomic copy number variation, and POLE mutation variants. A detailed description of this dataset, including the staining protocol and relevant clinicopathological information is included in Supplementary Materials: Appendix A. High grade Endometrial carcinomas.
C. Breast cancer.-A publicly available image dataset  was used, consisting in 381 images from the same number of IMC (CyTOF) stained tissue sections, obtained from 215 breast cancer tumor biopsies. Tissue sections were stained with a 35-plex antibody panel staining clinically established breast cancer targets like oestrogen receptor (ER), progesterone receptor (PR), and HER2, as well as relevant oncogenes, signalling, and epigenetic related proteins.
2.2. Methodology: Patch Contrastive Learning (PCL) (Fig. 1b) The goal of the first step of our pipeline is to convert each high-dimensional multiplex image of the cohort into a more manage-able list of low-dimensional embedding vectors.
To this end, each image is divided into patches-our basic units of representation of the local tissue microenvironment, or phenotype-, and each patch is converted by the PCL module -a properly trained CNN-into a low-dimensional vector that embeds both the morphological and spectral information of the patch.
The PCL module is trained iteratively. In each iteration, illustrated in Fig. 3, the PCL module is unsupervisedly trained to learn embeddings of a random set of patches, by maximizing the agreement between augmented views of highly overlapping patches, and minimizing the agreement between augmented views of distant patches, using a contrastive loss function. The choice of the image patch size S L is critical as it determines the extent to which biological structures can be captured, and their context. It also determines the size of the graph that is used to predict the outcome of the patient (see next section). This value was chosen considering that: i. a patch should be large enough to contain between zero and two cells, thus guaranteeing the interpretability of the model at the level of single cell or small cell environments; ii. the entire set of patches extracted from the images of one patient should fit in a single GPU, to efficiently generate patient predictions.
The steps required in each training iteration are described next: being i ≠ q)-possibly corresponding to dissimilar biological structures- (Fig. 3g). Let sim(u, v) = u T v/ ∥ u ∥ ∥ v ∥ denote the cosine similarity between two vectors u and v. The loss function applied to any given pair of patches that belong to the same image crop is defined as: ℓ j1, j2 = − log exp sim z j1 , z j2 /τ ∑ q = 1, q ≠ j, l = 1, 2 B L − 1 exp sim z jk , z ql /τ where τ is a temperature parameter set to 0.5.
This iterative (A-D) process is repeated until convergence to train the PCL module, which is next used to create vector embeddings of all the images of the cohort. To this end, each high dimensional multiplex image i ∈ ℝ i x × i y × B is divided in patches of size S L × S L × B, and each image patch is then introduced into the PCL module to obtain a patch embedding h j (Fig. 3e). This way the PCL module converts each image into a list of patch embeddings ℎ 1.L . The resulting embedded image has reduced dimensionality, i.e.

Methodology: Patch-graph generation (Fig 1c)
A graph is then created that contains all the embedded patches of each tissue/image capturing cellular neighborhoods, i.e., local phenotypes that are spatially associated (Fig.   1c). This graph is G = (Z, A), where Z ∈ ℝ Lxg is a matrix that contains all the embeddings of the image ℎ 1.L , and A ∈ 0, 1 LxL is an adjacency matrix that contains the connectivity between patches. To reduce the expensive memory burden of storing complete adjacency matrices, we 'sparsify' A as A′ ∈ ℤ Ex2 , being A′ a list of edges (i.e., connections) between patches, extracted from the non-zero values of the original A, where E is the number of edges present in the graph. Therefore, graph G = (Z, A), is converted into graph G′ = Z, A′ . Since we connect each patch to its 4 adjacent neighbors, i.e., E = L × 4, the memory required to store A′ increases linearly with L, as opposed to A, which increases exponentially.

Methodology: NaroNet (Fig 1d)
Being D = G 1 , y 1 , G 2 , y 2 , …, G M , y M , a cohort of patients, where M is the number of patients, and each patient is represented by a graph G m ∈ G, and a patient-level label y m ∈ Y, the goal of NaroNet is to learn a mapping G f Y that relates patient information with patient labels, or predictions (Fig. 1d). The architecture of NaroNet is divided in two consecutive networks G f 1 (P, N, A) f 2 Y, trained end-to-end using the patient labels, where P ∈ ℝ P is the abundance of local phenotypes (Fig. 1e), N ∈ ℝ N is the abundance of neighborhoods (or phenotype interactions) (Fig. 1f), and A ∈ ℝ A is the abundance of areas (or neighborhood interactions) (Fig. 1g). Note that P, N, and A are the number of phenotypes, neighborhoods or areas, respectively. We therefore model the tissue microenvironment using three levels of increasing spatial complexity. For the sake of consistency, we refer globally to P, N, A as tumor microenvironment elements (TMEs). The first section of NaroNet, f 1 ( Fig. 1e-h), is an ensemble of three parallel networks that assigns nodes to distinct P, N, A values. The second section, f 2 , assigns patient's predictions from the learned TMEs (Fig. 1i). To learn the tumor microenvironment, the three neural networks f 1 = f 1P , f 1N , f 1A are trained in parallel from individual patient data and later pooled to obtain the abundance of each TME, as described in the following paragraphs: A. Phenotype learning.-Each image patch, ℎ l ∈ Z m , is assigned to a phenotype vector using f 1P , i.e.: where f 1P is an 8-layer MLP with skip connections, with P phenotypes in the last layer. Therefore f 1P takes the patch representations of image Z m and generates a patch assignment matrix S P , whose values represent the probability that each patch is assigned to P phenotypes.
B. Neighborhood learning.-Likewise, each image patch, ℎ l ∈ Z m is assigned to a neighborhood f 1N , i.e.: where f 1N is a Graph Neural Network (GNN) followed by a 1-layer MLP which has an output dimensionality N. Therefore, f 1N uses Z m patch representations and the adjacency matrix A′ to produce a patch assignment matrix S N . Here, the GNN captures relationships between connected patches of a graph. To that end, it performs K iterations of a trainable weighted sum of each graph node (in our case patch ℎ l ∈ Z m ) and its connected neighboring nodes, generating a new feature vector at the next hidden layer of the network Kipf and Welling, 2017;Jimenez-Sanchez et al., 2020;Pati et al., 2020).
C. Area learning.-Each neighborhood that resulted from the previous GNN, ℎ l ∈ Z m (K) is assigned to areas using a second GNN f 1A . To this end the following trainable assignment matrix is used: This GNN learns the higher order interactions between the N neighborhoods of the original graph. For this purpose, f 1A is fed with the embeddings from N neighborhoods accumulates feature vectors of neighborhoods that are close to each other. As in the previous section, the GNN is applied K iterations or hops, this number indicating the extent to which the patch embeddings can capture information of their neighbors.
-After applying f 1P , f 1N , f 1A , each row of S P contains the probability that each patch of the image contain each of the P phenotypes, each row of S N contains the probability that a patch of the image contain each of the N neighborhoods, and each row of S A contains the probability that a neighborhood of the image contain each of the possible A areas. The final step of f 1 is a max-sum pooling operation that captures the abundance of each TME: where S P , S N , S A (eqs. 5, 6, 7) are the assignment matrices whose values correspond to neuron activations, where the softmax activation function transforms them into probabilities in a row-wise fashion. The max operator function is applied row-wise so that only the maximum values of each row are kept, while the others are set to zero. The sum operator is applied column-wise to obtain the abundance of each TME. The resulting (P, N, A) are the TME abundances that represent each patient.
The TME abundance vector (P, N, A) ∈ ℝ P + N + A is fed to the second's network section (f 2 ), consisting in a 1 layer MLP, i.e., y′ = f 2 (P, N, A) ∈ ℝ 0 , where y′ is the prediction between O possible patient-outcomes. A cross entropy loss is used to train the parameters of both f 1 and f 2 . The strategy used to implement f 1 can produce spurious local minima where all patches are assigned to a single microenvironment element. This local optimal solution traps the gradient-based optimization, and reduces NaroNet's performance. To prevent this, we use two regularization loss functions.
E. Patch entropy loss.-Patch entropy loss is used to regularize the probabilities given by eqs. 5, 6, 7. After initialization, the assignment of patches to TMEs is uncertain and the entropy of the patches is high. During the training process, we aim at knowing the assignment of patches to TMEs, obtaining a sparse matrix assignment. To this end, we propose to reduce patch entropy for each TME using a loss function: where S is any of the matrices S P , S N , S A , and the function generates ℓ ep , ℓ en , ℓ ea losses, respectively. The final loss is restricted to ℝ ∩ [0, 1] where the lower the value the most certain it is that a patch belongs to a specific TME. The final, combined loss is regularized by a λ parameter.
This is a specific learning rate that is chosen based on the tumor microenvironment complexity.
E. Patient entropy loss.-Patient entropy loss is used to avoid graph pooling collapse in (P, N, A) TMEs.: where (P, N, A) are the TME abundances and the vector ℓ pp , ℓ pn , ℓ pa contains the calculated losses, their values being restricted to ℝ ∩ [ − 1, 0]. As done for the patch entropy loss, the final loss is also regularized using a λ parameter, ℓ p = lambda pp * ℓ pp + λ pn * ℓ pn + λ pa * ℓ pa /3. Notice that the lower the value of l p the most spread out is the abundance of the TMEs. This strategy is less restrictive than the orthogonal loss (Bianchi et al., 2020) since the regularization term does not force clusters to have the same size.
In order to provide the highest predictive and interpretability performance, NaroNet's parameters and architecture variations (Supplementary Materials: Appendix C. Architectural variations) are optimally selected by an architecture search algorithm (Supplementary Materials: Appendix B. Architecture search).

Methodology: BioInsights interpretability module
Besides generating predictions, NaroNet identifies the elements of the tumor landscape that relate to a specific predictive task. This can be done a posteriori through the analysis of the TMEs (P, N, A), obtained by NaroNet while classifying patients. That is the goal of the BioInsights module, that is done through the identification of global cohort-differentiating features (differential TME analysis), and relevant TMEs in individual predictions (predictive influence ratio).
A. Differential TME analysis.-NaroNet's f 2 network maps TME abundances to patient-outcomes, i.e., Y = f 2 (P, N, A) 1.M . Therefore, (P, N, A) 1.M are the coefficients or covariates of the model, and the patient's predictions are made solely using the relative abundance of specific TMEs. We use regression analysis to interrogate which TMEs were more important to perform patient predictions. Specifically, to evaluate whether a specific TME is significant to perform patient predictions, a leave-one-out strategy is used, where a TME t is extracted from the set of all patient TME abundances (P, N, A) 1.M obtaining a new set of TMEs defined as (P, N, A) 1.M t . The model is evaluated with the entire patient cohort, and new prediction probabilities are obtained. Then, a Kruskal-wallis test is used to compare the prediction performance of the original TMEs with that of the leave-one-out model. If the null hypothesis is accepted, the extracted TME is considered to have predictive value.
B. Predictive influence ratio (PIR).-The differential TME analysis finds global patterns in patient cohorts but ignores the heterogeneity existing between patients/tissues. To address this, we introduce the predictive influence ratio (PIR), which quantifies the influence that each TME has on the prediction accuracy of a patient m ∈ 1 − M: where P IR m, t is the predictive influence ratio for a patient m and a TME t, and f 2 (P, N, I) m t is the leave-one-out model performance. The higher the value of P IR m, t the most important the TME t is for the classification of patient m.

Synthetic experiments
Seven patient cohorts, 240 patients each, were simulated (see Section 2.1.A), representing four disease paradigms (PM, PF, CCI and NN) in which either the moderate (1) or rare (2) presence of a specific TME differs between each of the three patient types (I, II or III). Seven different experiments were carried out, in which NaroNet was trained to predict the correct patient type for all patients of a cohort, while learning the TMEs that were relevant for that prediction. With these experiments we wanted to validate the ability of NaroNet to correctly classify each patient, and also to identify the relevant TME that defines each paradigm, in a properly controlled, quantifiable fashion (Supplementary Materials: Appendix D. Interpretability performance measure). In all 7 experiments, 120 patients of the cohort (40 patients of each type) were used for training and validation of NaroNet, including PCL and 500 runs of architecture search, and the remaining 120 patients were used for testing. Three train-test runs were made and averaged to report the final performance values.
The results obtained in each of these 7 experiments, in terms of NaroNet's predictive accuracy, i.e. how accurately NaroNet predicted the patient type, and interpretability, i.e. the correspondence between the TME found more relevant by NaroNet and the TME that actually defined the disease paradigm, are shown in Table 1. Overall, the model predicts remarkably well all disease paradigms, even in those experiments involving rare cell populations.
Illustrative example: CCI1.-Now we illustrate NaroNet's methodology, results and interpretability using one of the synthetic experiments (CCI1) consisting of a patient cohort where cell phenotypes Ph4 and Ph5, coexisting in neighborhood Nb2, repel, show no interaction, or attract each other in patient types I, II, and III, respectively (Fig. 4a,b). The PCL module was trained to generate 256-long vector embeddings of 10×10 pixel patches with 52.8% contrastive accuracy, which was comparable to state-of-the-art semisupervised learning setups (Chen et al., 2020b) (Table 1). Next, we used the training image set (120 patients) to calculate the optimal architecture (Supplementary Tables 1 and 2, Supplementary Figs. 5 and 6) and train the model. Then the test image set (120 patients) was used to calculate the classification performance. The receiver operating characteristic (ROC) curves, confusion matrix, and training and test accuracy curves obtained are shown in Supplementary Fig. 7. As shown in Table 1, the overall accuracy achieved for experiment CCI1 was 98.6% with a 95% confidence interval (CI) of [97.7,99.5] Regarding the interpretability of the results, our global differential TME analysis revealed that, amongst all the neighborhoods detected by NaroNet ( Supplementary Figs. 8,9), four neighborhoods -in order of statistical significance: N3, N7, N9, and N1-were most responsible for NaroNet's predictions ( Supplementary Fig. 10a). We also found that N3 and N7 are the most abundant neighborhoods in type I patients ( Fig. 4a). Therefore, NaroNet has correctly identified, and weighted in the classification, the tissue regions where the patient-defining TMEs are located.
We can next analyze the content of these four neighborhoods to confirm this finding: N7 contains high expression of markers Mk3 and Mk5, corresponding to cell phenotypes Ph3 and Ph5 (Fig. 4b,c). In type I patients, the abundance of N7 is statistically higher than in type III patients (Fig. 4d). If we look at N7 in type I tissues (Fig. 4e), we can confirm that it contains Ph3 and Ph5 cells, but not Ph4, meaning that there is physical repulsion between Ph4 and Ph5 as expected for this disease paradigm. The behaviour of N3 is similar to that of N7. N9 contains high expression of markers Mk3, Mk4 and Mk5, which correspond to cell phenotypes Ph3, Ph4, and Ph5 (Fig. 4b,c) and is significantly more abundant in type III patients, compared to patient types I and II (Fig. 4f). If we go back to the tissues of type III (Fig. 4g), it can be confirmed that N9 contains spatially related cells with phenotypes Ph4 and Ph5, as expected in this disease paradigm. Therefore, we have shown that the TMEs learned by NaroNet capture the specifics of the underlying disease paradigm and lead to a successful classification.
To interpret why an individual image/patient was classified as a certain patient type, we calculated the predictive influence ratio (PIR) value for each TME. This strategy, applied to CCI1 shows ( Supplementary Fig. 11a) that for most type I patients, the abundance of neighborhood N7 was the most determinant classification factor. Conversely, N9 was highly predictive for type III patients, and N1 was highly relevant to successfully classify type II patients. We illustrate this with examples of individual predictions: a patient classified as type I with prediction confidence of 97.75% and a PIR value of 2.28 for N7 (Fig. 4h), and a patient classified as type III with prediction confidence of 94.36% and a PIR value of 1.61 for N9 (Fig. 4i).

Endometrial carcinomas
We first asked NaroNet to learn TMEs associated to four patient-level labels: the somatic POLE mutation, copy number variation (CNV), DNA mismatch repair (MMR) deficiency, and two tumor histology types (endometrial carcinoma or serous-like carcinoma) from 382 images of 12 high-grade endometrial carcinomas. The PCL module was trained to generate 256-dimensional embeddings of 15×15 pixel image patches, obtaining a high contrast accuracy of 81.11%. A 10-fold nested cross validation strategy was then used to optimize NaroNet's parameters and hyperparameters. Using this strategy, the architecture search is repeated ten times (outer loop) using a 10-fold partition of the data. In each inner loop, 50 runs of the architecture search are implemented from 90% of the image dataset (344 images). The best architecture configuration (Supplementary  . 5a and Supplementary  Fig. 12) for the four patient-level labels.
As an example of the global interpretability of the results, we next analyze the interpretability of the model while predicting the POLE mutation status: NaroNet unsupervisedly learned 26 TMEs ( Supplementary Figs. 13 and 14). Our differential TME composition analysis revealed that area A1 (p-value: 2.56×10 −9 ) is the most predictive TME when making patient predictions. Particularly, A1 is significantly more associated to tumors harboring POLE mutations than to POLE wild type (WT) tumors (Fig. 5c). Area A1 contains neighborhoods N2 and N7 ( Fig. 5b and Supplementary Fig. 14d) which in turn contain local phenotype interactions between P2-P7 and P4-P9, respectively (Fig. 5d). N7 contains CK+ tumor cells (P4) and intratumoral cells expressing CD4 and CD8 (P9), and by itself is not associated with POLE mutation (Fig. 5e). By contrast, N2 contains non-infiltrating cells that express CD8, PD1 and FoxP3 (P2) associated to tumor CK+ cells (P7), and is associated with POLE mutated patients (Fig. 5f). Furthermore, P2 by itself was significantly more abundant in POLE mutated patients (p-value: 2.80×10 −10 ) compared to patients carrying the wild type version of the gene (Fig. 5g-i). All these findings are consistent with the literature as CD4, CD8, FoxP3, and PD1 are inflammation markers, and POLE-mutated endometrial carcinomas, usually with a better prognosis than POLE WT, with higher abundance of A1 areas, are described to have large lymphocyte populations (Li et al., 2019b). In summary, area A1 contains cellular neighborhoods related to high immunological activity, and points at the existence of interactions between specific immune phenotypes in POLE vs. non POLE mutated cancers that could be further explored, as could be done with other TMEs selected by NaroNet.
To illustrate the individual interpretability of our results we provide two examples of images in which phenotype P2 was the most relevant TME selected by NaroNet. The first image was correctly classified as POLE WT with a prediction confidence of 95.45%, and PIR value of 1.49. It shows a cold tumor landscape -low P2 abundance -that is associated with POLE WT patients (Fig. 5j). This is consistent with the global cohort-level findings (Fig. 5h). The second image was correctly classified as a POLE mutated with a prediction confidence of 99.24%, and a PIR value of 1.38. It shows a hot tumor landscape with high P2 abundance (Fig. 5k), being also consistent with global cohort-level findings.
Comparison with single cell analysis: Qupath.-To further validate NaroNet, we quantified the phenotype that NaroNet identified as the most discriminative between patient types (P2, i.e. high expression of CD8, PD1, and FoxP3), using QuPath (Bankhead et al., 2017), a widely used open source software for computational pathology (Supplementary Materials: Appendix E. Image analysis with Qupath). For each image of the cohort, we first quantified the level of expression of CD8, PD1, and FoxP3 from the cell segmentation masks obtained using the DAPI -counterstained-channel. Then we calculated the number of CD8+PD1+FoxP3+ cells and correlated this number with the number of patches that NaroNet assigned to P2, obtaining positive correlation (R 2 = 0.63) (Supplementary Fig.  15b). Moreover, the respective violin-plots ( Supplementary Fig. 15c-d) showed both QuPath and NaroNet are able to robustly distinguish patient-types based on CD8+PD1+FoxP3+ phenotype abundance, the outstanding difference being that NaroNet infers it without human supervision.
Patient-wise quantification.-Finally, to test NaroNet's predictive power classifying subjects, i.e. patients and not individual images, based on the POLE mutation, we performed a leave-one-out experiment: iteratively, 11 patients (represented by all their images) were used to train the model and one patient was used for testing. Patient-wise predictions were calculated as the mean prediction value of all images that correspond to the test patient, achieving an overall accuracy of 83.33% with a 95% CI [63.02-10 0.0 0%], and an AUC of 0.67 with a 95% CI [0.32-1].

Breast cancer cohort
NaroNet was trained to associate TMEs with patient survival risk . 215 patients were clustered by k-means into three risk groups (RI, RII, RIII) based on their long-term survival (Fig. 6a): RI contained 48 patients that survived more than 120 months, RII contained 107 patients that survived between 54 and 119 months, and RIII contained 60 patients that survived less than 53 months. The PCL module produced 18×18 pixel patch embeddings for all the images of the cohort, with a high contrast accuracy of 82.50%. As more than one image was acquired per patient, images from the same patient were combined in one single data structure (i.e., graph) and fed to NaroNet. A 10-fold nested cross validation was used to optimize NaroNet's parameters and hyperparameters (Supplementary  Fig. 16).
As an example of how to make use of its global interpretability, NaroNet learned 57 distinct spatial patterns of TMEs able to predict the patient risk group (Fig. 6c and Supplementary  Fig. 17). Using our differential TME composition analysis we found that a combination of two neighborhoods (N8 and N16) was significantly predictive (p-value < 0.05) when distinguishing between RI and RIII patients. N8 is a neighborhood that contains tumor cells (cytokeratin AE1/AE3 and cytokeratin 7 positive), and high presence of fibronectin. This neighborhood is more abundant in risk III patients (Fig. 6d). Fibronectin is a key component of the extracellular matrix. In particular, as seen in the literature, fibronectin is highly present in the remodeled tumor extracellular matrix, forming a barrier for the infiltration of immune cells. Consistent with our findings, fibronectin was associated with poorer patient survival (Fernandez-Garcia et al., 2014). Moreover, the abundance of neighborhood N16, consisting of tumor cells expressing p53, was also associated with bad prognosis (Fig. 6d). The tumor suppressor gene p53 is one of the most commonly mutated gene in human cancers. TP53 gene mutation is generally associated with a strong and diffuse immunoexpression of p53. Consistent with our findings, TP53 mutation has been shown to be a poor prognostic factor in various cancer types (Li et al., 2019a).
Besides N8 and N16 there are other TMEs whose abundance is significantly different across patient types (Supplementary Fig. 17c). Working in 'discovery' mode, these TMEs could be used to obtain insights on cohort-differentiating microenvironment features. For instance, neighborhood N4 was associated with poor survival (p-value < 0.0 0 01) and contained Sox9 positive cells, Sox9 having been previously described as an oncogene (Aguilar-Medina et al., 2019).
We next evaluate NaroNet's ability to capture the heterogeneity of the 35-plex breast cancer cohort by showing the individual interpretable prediction of two patients. The first patient, who survived 33 months, was correctly classified as high risk (RIII) with a prediction confidence of 98.98%. Such prediction was mainly driven by the presence of N4 (i.e., Sox9+, PIR value 1.91) that is highly abundant in this patient (9.07% of the tissue) compared to the average presence of N4 found in the whole patient cohort (1.49%). This is consistent with our global findings that indicate that high risk patients are associated with Sox9 oncogene expression (i.e., neighborhood N4). The second patient, who survived 174 months, was correctly classified as low risk (RI) with a prediction confidence of 93.15%, being the prediction mainly driven by the absence of N8 (i.e., extracellular fibronectin, PIR value 1.84) as its relative abundance is low (0.01%) compared to the average N8 mean abundance (2.21%). Therefore, for this patient, NaroNet correctly quantified a low presence of N8, and associated it to high survival, as it was previously observed cohort-wise (Fig. 6d).
Effect of the data input format.-We finally analyzed NaroNet's ability to predict patient risk subtypes as a function of the input used. On the one hand, as cell segmentation masks are provided along with the public image dataset, cell features were extracted as done in the original reference paper . Briefly, a graph of interconnected cells (37-element vectors) was fed to NaroNet, where each cell vector consists of the average expression of the 35 markers plus the cell size and eccentricity. This approach is compared to our proposed strategy based on the use of graphs of patch embeddings. On the other hand, as more than one image was acquired for some patients, we compared the strategy of feeding NaroNet using one graph per image or feeding it with a graph that combines all the images of the patient. We used the same hyperparameters for all the experiments (Supplementary Table 4). Fig. 6b shows the area under the curve (AUC) for all experiments. As shown, NaroNet achieves the highest prediction performance using a graph containing PCL patches instead of cell masks, and works better when all images of the same patient are combined into a single graph.

In depth analysis
In this section, we describe additional experiments that were carried out to provide an even more comprehensive understanding of the proposed methods (i.e., NaroNet and PCL). All experiments were conducted using the training protocols introduced in the Results section.

Ablation studies
We examined how NaroNet's performance varies when selected modules are removed from the original network, specifically phenotype, neighborhood, or area learning. Therefore, we repeated all experiments done with real and synthetic datasets, removing sequentially the phenotype, neighborhood, or area learning modules. The results of the ablation studies are shown in Table 2. Overall, NaroNet performs best when using the three modules. As expected, its performance varies greatly depending on which element of the tumor microenvironment is driving the disease paradigm at hand. For instance, in CCI1, where patient types show distinct cell to cell interactions, the neighborhood learning module is crucial, and when removed, NaroNet is unable to capture cellular interactions, its performance dropping dramatically.

Comparison with other methods
NaroNet is the first WSDL method fully adapted to multiplex imaging. In contrast with other imaging modalities like H&E staining, where WSDL methods can be evaluated in public datasets, e.g., Camelyon16 challenge (Bandi et al., 2018), there is a lack of public multiplex image datasets to objectively evaluate multiplex image analysis frameworks. However, in order to compare NaroNet with other existing approaches that could be applied to multiplex imaging, we adopted two state-of-the-art WSDL methods used to classify H&E tissue sections, adapting them for the analysis of multiplex images from our real cohorts (i.e., Endometrial carcinomas and Breast Cancer cohort).
A. CLAM- (Lu et al., 2021) As most WSDL methods, CLAM is based on a two-step strategy. In the first step, the image is divided into image patches (i.e., hundreds of cells) which are fed to a ResNet50 pretrained on ImageNet. In the second step, attention scores are assigned to patch representations considering their relevance in the patient-level classification task at hand. To adapt this method to multiplex imaging, it is necessary to use an alternative patch feature extraction strategy because it is not possible to input multiplex image patches to a RGB-based ResNet50 pretrained on ImageNet. Instead, we used our proposed PCL strategy. To this end, output patch representations from our PCL module were input to CLAM. To choose the size of the image patch we took into consideration how CLAM models patches: CLAM does not try to capture physical interactions between patch representation, but instead models the tumor microenvironment from the information existing within each patch. Therefore, it requires relatively large patch sizes. For this reason, we evaluated CLAM's performance using two patch sizes, the first following our strategy where one image patch contains one or two cells, and the second, more similar to the original CLAM's strategy, following the strategy used in (Lu et al., 2021) where one image patch contains dozens of cells.

B. Neural Image Compression (NIC)-(Tellez et al., 2019)
NIC is also a two-step strategy. In the first step, the image is divided into image patches that are used to train a CNN unsupervisedly. In the second step, feature vectors are arranged to create a compressed image, which is then fed to another CNN (ResNet50) that is trained supervisedly to predict patient-level labels. As done before, we used our PCL strategy to extract features at the cellular level. As NIC is based on a CNN to make patient predictions, thus capturing interactions between patches, we used a small image patch containing one or two cells.
Performance results are provided in Table 3. Compared to CLAM and NIC, NaroNet achieves the best performance in terms of AUC scores. Please note that, besides achieving higher prediction values, our method is inherently interpretable at three levels of complexity (i.e., phenotypes, neighborhoods, and areas).

PCL parameter evaluation
The PCL module learns cellular features from two augmented views of one image crop.
Here, as the image crop is ∝ L times bigger than the augmented views, the subsequent generated image patches (or views) do not necessarily contain the same pixel information, but can capture information from neighboring pixels. We evaluated NaroNet's performance using different ∝ L values. From the result (Table 4) we can conclude that NaroNet performs better when using a value of ∝ L higher than 1. This means that introducing in the learning pipeline information from neighboring pixels is beneficial to extract cellular features.

Discussion
Our working hypothesis is that relevant elements of the tumor microenvironment can be blindly identified and associated with patient-level tumor information from multiplex imaging data. To this end, we have developed NaroNet, an end-to-end deep learning framework that proves this hypothesis true, as it accurately performs patient predictions from local phenotypes, neighborhoods, and areas that were blindly identified from multiplex immunostained histological data. NaroNet takes advantage of, and improves elements of the two main state-of-the-art computational pathology approaches. From SCA methods, NaroNet inherits the use of graphs to capture phenotype interactions, extending this idea by using GNNs to actually 'learn' the most relevant interactions between elements of the tumor microenvironment (Kipf and Welling, 2017). From WSDL, NaroNet uses the concept of learning deep features from patches without the need of manual annotations using only patient-level labels, and applies it to multiplex immunostained sections instead of H&E histological images. Furthermore, instead of being a black-box approach (Rudin, 2019), NaroNet is inherently interpretable as it makes predictions based on the abundances of discovered phenotypes, neighborhoods and areas, thanks to the use of a novel max-sum pooling operation. During the learning process, NaroNet's parameters are trained to assign patches into never seen TMEs, whose abundances would eventually differentiate patient types. For this reason, NaroNet can be used in 'discovery mode' to research new biomarker signatures of the tumor biology, or to answer clinically relevant questions, e.g. which tumor features are more predictive of the tumor type or the outcome of the patient. Furthermore, using validated biomarker signatures, NaroNet can be trained to provide clinicians with interpretable clinical decisions, since predictions are based on TME annotations which can be mapped back onto the original images. To facilitate individual interpretable predictions, we developed a new metric called predictive influence ratio (PIR) that measures how each tumor microenvironment element contributes to the final prediction.
One of the major bottlenecks in developing high-performance machine learning classifiers for computational pathology is the low number of available labeled tissue images. This is even a greater problem in the case of multi-spectral images, as the use of multiple markers dramatically increases the complexity of the annotation. To address this, we propose a data-efficient contrastive learning loss preprocessing step (PCL). This is a similar strategy to the one followed in state-of-the-art semi-supervised learning frame-works (Chen et al., 2020a;2020c). These methods learn enriched image representations from large numbers of unlabeled images using an unsupervised deep neural network. Later, a supervised classifier can be trained to obtain outstanding image predictions from small numbers of these enriched, labeled image representations. In our case, all available patient tissue information is divided in patches, i.e., tiles, containing up to two cells, and are introduced in a convolutional neural network to create self-supervised low-dimensional enriched embeddings of these patches. These embeddings allow for comprehensive discrimination of pixel-level features such as, cell morphology, marker intensity, marker colocalization, etc. thus reducing the inherent heterogeneity existing within and between tissues. We hypothesized that the use of these prototypical enriched representations extracted from the images might help with the classification of low number of patients represented by them. Furthermore, the volume of data space is decreased so that NaroNet's computational time is reduced allowing the use of architecture search algorithms that would ultimately increase predictive performance.
We have validated NaroNet using both synthetic and real data. Using a novel multiplex tissue image simulator we created realistic patient cohorts with tunable presence of specific TMEs, providing an ideal objective benchmark to test the performance of the system. Indeed, our extensive validation using synthetic data successfully confirms that NaroNet can learn relevant TMEs -local phenotypes, cell-interaction neighborhoods, and neighborhoodinteraction areas -, even when their presence in the tissue is rare. Using a high-grade endometrial carcinoma patient cohort, NaroNet found, among other TMEs that could be explored, a local phenotype expressing CD8, PD1, and FOXP3 whose high abundance was associated with the POLE mutation, while achieving a prediction accuracy of 93.75%. This finding is in accordance with what has already been described in the literature. Moreover, we confirmed using a semi-automated computational pathology software (QuPath), that the abundance of this specific phenotype correlates positively with the one found by NaroNet. This nicely shows that NaroNet can be a useful tool in research environments, as it can help to blindly identify novel TMEs that are related to the biology of the tumor. Using a public breast cancer dataset, NaroNet found TMEs that were associated with the patient's survival achieving an AUC of 0.73. Strikingly, NaroNet did not require human supervision to learn, from a pool of millions of cells stained by 35 markers, decisive neighborhoods consisting of cells expressing Sox9 and extracellular fibronectin, respectively, that were related to the survival of the patient. In addition, we show that NaroNet performs better when fed with an enriched graph created from image patches than when using cell features obtained from cell segmentation masks. This shows that NaroNet is able to learn relevant tumor microenvironmental information without the highly demanding task of segmenting all cells in the tissue. We also prove that using graphs to represent patients is more advantageous than using images, as patient's stained tissue sections can be combined together into a single, disjoint graph providing NaroNet with more information to make better predictions.
Finally, we have presented an ablation study that shows that the three levels of spatial complexity used by NaroNet to model the tumor microenvironment (i.e., local phenotypes, neighborhoods and areas) contribute individually to achieve better predictions. Moreover, a comparison with two state-of-the-art WSDL methods shows that NaroNet is able to achieve more accurate predictions while providing an inherent interpretability of the reason behind those predictions that the rest of the methods lack.

Conclusion
We have presented NaroNet, and ensemble of networks that unsupervisedly identifies and annotates relevant TMEs that drive patient outcomes. Since we have shown that NaroNet is able learn in situ highly predictive TMEs that confirm the existing literature, it is possible to affirm that the analysis of new predictive TMEs discovered by NaroNet could provide novel insights into the mechanisms of disease progression. This could be used in clinical settings, and more importantly, it makes NaroNet a valuable research tool for the discovery of novel biomarkers. Furthermore, the fact that NaroNet's clinical predictions are directly based on the annotations of TMEs results in an important breakthrough in computational pathology, as it contributes to the whitening of DL black-boxes. Indeed, our model allows clinicians to understand which TMEs drive the prediction of each patient safely and reliably since DL neuron activations are related to specific biological structures that can be mapped back into the original images. Therefore, NaroNet could be an optimal solution for the rapid clinical translation of biomarker discovery signatures, where DL models trained to quantify relevant TMEs are then applied to new incoming patients by providing clinicians with interpretable predictions.

Implementation details
Patch contrastive learning is implemented in Python 3.7.3 using Tensorflow 1.14.0. NaroNet is implemented in Python 3.7.3 using PyTorch 1.4.0. Architecture search was performed using ray 1.0.0 (Liaw et al., 2018)   Scheme of NaroNet's learning and discovery protocol. a. The input data consists of multiplex cancer tissue images with associated clinical and pathological information. b. The patch contrastive learning module divides images into patches and embeds each patch in a 256-dimensional vector using a CNN unsupervisedly trained to assign similar vectors to patches containing similar biological structures. c. An enriched graph of patches is generated that contains the spatial interactions between tissue patches. d. The graph of patches is fed to NaroNet: an interpretable ensemble of neural networks that learns phenotypes (e), phenotype neighborhoods (f), and areas of interaction between neighborhoods (g) to classify patients (i) based on the abundance of those tumor microenvironment elements (h). Legend. CNN: convolutional neural network; MLP: multilayer perceptron; GNN: graph neural network. Synplex (synthetic simulator) default configuration. Eight cell phenotypes are defined by the expression of 6 markers (a), each of them having a specific cell size (b) and eccentricity (c). Four neighborhoods are defined based on the relative abundance of the phenotypes -please note that the color of neighborhood sections refer to the different phenotypes, as in pannels b and c-(d) and interaction, i.e. attraction/repulsion rules between phenotypes (e). These neighborhoods in turn are organized based on their frequency and interaction rules (f-g.).

Page 24
Med Image Anal. Author manuscript; available in PMC 2023 February 28.

Fig. 3.
Visualization of Patch Contrastive Learning method description.
Step-by-step illustration of Patch Contrastive Learning strategy.  Graphical description of synthetic experiment CCI1. a. Ground truth: Schematic description of the interactions between cell phenotypes Ph3, Ph4, and Ph5 (located in neighborhood Nb2) that define each patient type (I-III). b. Marker expression levels for the three relevant cell phenotypes. c. Z-scored mean expression of all markers in the neighborhoods learned by NaroNet. d. Relative abundance of learned neighborhood N7 in the three patient groups. e. Representative patches assigned to N7. f. Relative abundance of learned neighborhood N9 in the three patient groups. e. Representative patches assigned to N9. h. Example of patient correctly classified as Type I (i.e. displaying Ph4-Ph5 repulsion), with squares showing patches assigned to learned neighborhood N7, located in ground truth neighborhood Nb2 (marked in red). i. Example of patient correctly classified as Type III (Ph4-Ph5 attraction), with squares showing patches assigned to learned neighborhood N9, located in ground truth neighborhood Nb2. (***p < 0.001; **** p < 0.0001).  Association of high-grade endometrial carcinomas with patient-level labels. a. ROC curves showing the classification performance of NaroNet for the four tissue characteristics learned. b. Neighborhood composition of learned area A1. c. Violin-plot showing relative abundance of learned area A1 as a function of POLE mutation status. d. Heatmap showing interactions between the local phenotypes learned by NaroNet. e,f. Patches assigned to neighborhoods N7 and N2, and their corresponding abundance across patient-types. g. Heatmap showing the mean marker expression, for the phenotypes learned by NaroNet. h,i Patches assigned to phenotype P2 and its corresponding abundance across patient types. j,k. Images of WT and POLE mutated patients that were classified due to phenotype P2 abundance. White squares represent patches assigned to P2.  Association of spatially-resolved 35-plex of breast cancer tissues with patient long term survival. a. Histogram plot of patient overall survival in months, colored by risk classes. b. AUC prediction performance of NaroNet for four training strategies: image-wise or patientwise, and using the cell segmentation provided with the image dataset or using the proposed PCL method. c. Heatmap showing the mean marker expression, for all neighborhoods learned by NaroNet. d. Violin-plot showing the relative abundance of learned neighborhoods N8 and N16 as a function of risk group, RI and RIII (p-values were adjusted with Bonferroni correction). NaroNet performance: synthetic experiments. NaroNet's classification accuracy (and 95% confidence interval) and interpretability calculated as the intersection of the most relevant extracted TME and the ground-truth of each synthetic experiment. Legend. PMI: phenotype marker expression; PF: phenotype frequency; CCI: cell-cell interaction, and NNI for neighborhood-neighborhood interaction. Index 1 refers to moderate presence, and index 2 to rare presence.   Table 3 Comparison of NaroNet with other WSDL methods. AUC scores obtained using three weakly-supervised methods over a 10 fold cross validation, for the endometrial carcinomas and breast cancer datasets. A small patch size corresponds to 15×15 and 18×18 pixels, and a large patch size is 90×90 and 100×100 pixels in the endometrial carcinomas and breast cancer datasets, respectively.