Machine learning reveals genetic modifiers of the immune microenvironment of cancer

Summary Heritability in the immune tumor microenvironment (iTME) has been widely observed yet remains largely uncharacterized. Here, we developed a machine learning approach to map iTME modifiers within loci from genome-wide association studies (GWASs) for breast cancer (BrCa) incidence. A random forest model was trained on a positive set of immune-oncology (I-O) targets, and then used to assign I-O target probability scores to 1,362 candidate genes in linkage disequilibrium with 155 BrCa GWAS loci. Cluster analysis of the most probable candidates revealed two subfamilies of genes related to effector functions and adaptive immune responses, suggesting that iTME modifiers impact multiple aspects of anticancer immunity. Two of the top ranking BrCa candidates, LSP1 and TLR1, were orthogonally validated as iTME modifiers using BrCa patient biopsies and comparative mapping studies, respectively. Collectively, these data demonstrate a robust and flexible framework for functionally fine-mapping GWAS risk loci to identify translatable therapeutic targets.


INTRODUCTION
0][11] Genome-wide association studies (GWASs) have identified >170 genetic loci to date 12 and hundreds of genes have been prioritized as candidates by genetic fine-mapping strategies, including functionally validated modifiers of tumor intrinsic phenotypes. 13In contrast, germline genetic modifiers of BrCa that function through the tumor microenvironment (TME) remain poorly understood. 55][16][17][18] A GWAS of The Cancer Genome Atlas (TCGA) pan-cancer cohort revealed heritability ($15-20%) across 33 traits related to tumor infiltrating leukocytes, including key modulators of antitumor immunity (e.g., IFN, STING, and cytotoxic T-lymphocytes). 15ikewise, a separate TCGA pan-cancer analysis found multiple gene-level associations with 16 immune-related traits, including overlapping GWAS candidates for autoimmune diseases. 16These data, combined with evidence that similar immune traits predict response to anticancer immunotherapies and overall patient survival, 14,17,18 suggest that germline genetic modifiers of the immune TME (iTME) likely impact cancer incidence, progression, and outcome.
Despite the evidence that heritable variation in antitumor immunity impacts BrCa, 5 attempts to interrogate germline genetic modifiers of the iTME remain challenging for several reasons.5][16][17][18] Second, genome-wide genetic perturbation datasets for functionally annotating candidates have only recently become available 19,20 and functional genomics data have yet to be systematically incorporated into fine-mapping strategies for GWAS loci.Finally, in some cases, characterization of iTME modifiers will require comparative analysis using models with a fully intact iTME and yet there is a dearth of genetically diverse syngeneic cancer models for testing iTME modifiers.
The objective of this study was to develop an analytical framework for using human BrCa GWAS as a roadmap for translating iTME modifiers with the greatest therapeutic potential for antitumor immunity.To achieve this objective, a random forest model was trained on (B) Overview of the I-O target probability predictions using a positive training set of genes with preclinical and clinical evidence as I-O therapies.A size-matched negative set was generated from random sampling without replacement of all other genes and after excluding the positive set.Random forest models were then built (n = 10,000 models) using each of the random negative sets to generate predictions, followed by averaging across the 10,000 models to assign an I-O target probability score to 1,362 candidate genes in linkage disequilibrium (LD, r 2 > 0.6) with 155 human BrCa GWAS loci.(C) A correlation matrix of the quantitative variables revealed limited overlap between features.(D) A 10-fold cross validation (CV) was performed using the positive set and each of the negative sets, followed by a receiver operator characteristic (ROC) area under the curve (AUC) analysis that was averaged across all 10,000 random forest models (AUC = 0.71) (Figure 1D, hashed line).Because the randomly sampled negative sets potentially included some unknown I-O targets, a second 10-fold CV was performed using a negative set of 185 genes identified by taking the union of non-essential genes from Hart et al. 21and those not expressed in lymph node, bone marrow, thymus, whole blood, lymphocytes, and mammary tissues, based on the Human Protein Atlas and GTEx.Cross-validation with the defined negative set resulted in an AUC of 0.79 (Figure 1D, solid line), indicating that the negative sets likely contain previously undescribed I-O targets.Receiver operating characteristic (ROC) from leave-one-out cross validation using an independent negative a positive set of immune-oncology (I-O) targets using BrCa and immune phenotypes from genetic perturbation studies, comparative genomics, Mendelian genetics, and colocalization with autoimmunity and inflammatory disease risk loci.When applied to 1,362 candidate genes in linkage disequilibrium (LD, r 2 > 0.6) with 155 human BrCa GWAS loci, several I-O targets were identified, including LSP1 and TLR1.These data, combined with genomic, transcriptomic, proteomic, and phenotypic evidence, collectively demonstrate that fine-mapping with systematic integration of functional data (i.e., functional fine-mapping) is a powerful approach for translating GWAS risk loci to novel therapeutic targets.

Random forest modeling of I-O target probability for breast cancer GWAS candidates
5][16][17][18] Although these GWAS are relatively small and underpowered for candidate gene discovery, the observations suggest that larger cancer GWAS might provide a roadmap for I-O targets with increased translational potential due to the existing association with human malignancy.To this end, colocalization analysis of loci was performed using multiple large publicly available GWAS (N = 809,871 unique individuals) for BrCa (n = 155 loci with 452 lead SNPs curated in this study) and autoimmunity (n = 4,076 loci), which confirmed that heritable immunity is a prevalent component of BrCa risk (48% of BrCa loci, p < 0.0001) (Figure 1A; Tables S1, S2, and S3).However, due to the paucity of immunophenotypic data from these BrCa GWAS, identifying iTME modifiers is not possible without integrating immune phenotypes from other sources, such as Mendelian genetics, comparative genomics, and genetic perturbation studies.The objective of this study was to develop a machine learning approach that would identify the most probable antitumor immune modifiers from BrCa GWAS loci by learning the shared properties of known I-O targets (i.e., the positive set) compared to a negative set of proteins.
We hypothesized that a random forest model that implemented ''easy ensemble'' and bagging 22 would be well-suited to address the challenge of a limited set of true positives (n = 45 curated I-O targets) and a larger set of unknown potential targets (i.e., all other genes).The resulting unbalanced positive-only learning task can be addressed by using negative training sets of equivalent size to the positive set and randomly sampling unique sets that exclude the positive set 23 (Figure 1B).Random forest models were then built using the averaged predictions over the 10,000 models to assign an I-O target probability score to the 1,362 BrCa candidates in LD (r 2 > 0.6) with 155 human BrCa GWAS loci.To capture unique aspects of iTME biology, the random forest model integrated 22 feature sets that included complex genetics (Tables S1, S2, and S3), Mendelian genetics (Table S4), genetic perturbation studies (Tables S5, S6, S7, S8, S9, S10, and S11), comparative genomics (Table S12), cancer drivers (Table S13), and eQTLs (Table S14).A Pearson's correlation of the model features revealed that there was low covariance between the quantitative variables that were included as features in the random forest model (Figure 1C), suggesting that feature contributions to the model were largely independent.
To assess model performance, a 10-fold cross validation (CV) was performed using the positive set and each of the negative sets, followed by a receiver operator characteristic (ROC) area under the curve (AUC) analysis that was averaged across all 10,000 random forest models (AUC = 0.71) (Figure 1D, hashed line).Because the randomly sampled negative sets potentially included some unknown I-O targets, a second 10-fold CV was performed using a negative set of 185 genes identified by taking the union of non-essential genes from Hart et al. 23 and those not expressed in lymph node, bone marrow, thymus, whole blood, lymphocytes, and mammary tissues, based on the Human Protein Atlas and GTEx.Cross-validation with the defined negative set resulted in an AUC of 0.79 (Figure 1D, solid line), indicating that the negative sets likely contain previously undescribed I-O targets.The mean decrease in the Gini metric was used to assess the relative importance of the features in the models, which revealed that Mendelian disease feature was the most predictive and followed closely by genetic perturbations and other features (Figure 1E).Combined these features performed expectedly well in predicting the training set (median score = 0.83) compared with the negative training (median score = 0.43) and the BrCa candidates (median score = 0.44) (Figure 1F).Further, the random forest model performed comparably with an elastic-net logistic regression using the 22 feature sets (R = 0.6, p value 10 À92 ) (Table S16; Figure S1), suggesting that the performance of the random forest model was not due to overfitting of the limited set of true positives.Collectively, these data demonstrate that a random forest model implementing ''easy ensemble'' and bagging 22 performed well in learning the shared properties of known I-O targets (i.e., the positive set) compared to a negative set of proteins.
Hierarchical clustering of the top ranked BrCa candidates with I-O target probability scores was performed to test whether these plausible iTME modifiers function through similar or distinct biological pathways.STRING analysis was performed on the two major clusters of iTME modifiers, which revealed significant connectivity (p < 1 3 10 À10 ) and suggested that highly probable iTME modifiers converge on common biological signaling pathways.Gene ontology (GO) enrichment analysis revealed that Cluster 1 (n = 54 genes) was comprised of more general immune modulators (Table S15A), whereas Cluster 2 (n = 51 genes) was most significantly enriched for modifiers of T cell activation (Table S15B).Since the majority of ICB therapies already focus on the T cell compartment, we chose instead to focus on orthogonal validation of the more broadly acting immune modulators in Cluster 1, which included the highly ranked immune modulators, LSP1 and TLR1 (Figure 1G; Table S16).Notably, both candidates are widely associated with BrCa risk 6,12,13,24-27 and ranked within the top 25 of genes with the highest I-O Figure 1.Continued set curated from the non-essential gene list and filtered on genes without reported expression in immune or breast cells.The prediction performance was reported as AUC (area under the curve) of 0.79.(E) The relative importance of the 22 features in the Random Forest model as measured by the average decrease in the Gini metric.(F and G) Ranking of I-O target probability scores across the training set (blue), BrCa candidates (gray), and the independent negative set (red).Note the high probability rankings of the BrCa candidates, LSP1 and TLR1.target probability scores, yet little is known of the roles of LSP1 and TLR1 as genetic modifiers of the iTME.Thus, we sought in subsequent studies to assess the orthogonal evidence of LSP1 and TLR1 modifying BrCa risk through the iTME, which would provide further rationale as potential I-O therapeutic targets.
LSP1 is an I-O target associated with BrCa risk and outcome LSP1 is a negative regulator of leukocyte trafficking and activation, 28 which has been replicated across multiple GWAS for BrCa incidence 6,24-27 and outcome, 29,30 as well as inflammatory and autoimmune diseases, [31][32][33][34][35] yet the therapeutic potential of LSP1 is unknown.Based on the random forest analysis, the probability of LSP1 as an I-O therapeutic ranked in the top 1% of BrCa GWAS candidates (8 th out of 1,362 candidates) (Table S16), suggesting that LSP1 is an iTME modifier that shares the characteristics of successful I-O targets.Further, seven LD blocks (r 2 > 0.6) localized to LSP1 have been associated with BrCa incidence or outcome, including 5 LD blocks that correlated with altered LSP1 expression in leukocytes and 29 out of 78 SNPs (37%) within the LD blocks that were predicted to alter a functional motif (Table S17).Notably, 19 of the 29 functional SNPs (fSNPs) (66%) were predicted to disrupt canonical binding sites for transcriptional regulators of immune and inflammatory signaling (Table S17).For example, the most significantly BrCa-associated LD block (led by rs1973765) 26 contained three fSNPs that were predicted to disrupt LSP1 regulatory regions, including binding sites for immune regulatory transcription factors, ASCL2 (rs4980389) and PLAG1 (rs588321) (Figure 2A), and eight SNPs correlated with pan-leukocyte expression of LSP1 (Figure 2B).Finally, LSP1 expression in treatment-naı ¨ve tumor specimens from BrCa patients was largely restricted to the infiltrating leukocytes 36,37 (Figures 2C and 2D) and high leukocyte-specific expression of LSP1 correlated with shorter progression-free survival in univariate analysis of BrCa (HR = 1.73, 95% CI 1.29-2.33;p = 0.01) (Figures 2E-2K; Table S18).Combined, these data establish that LSP1 is a heritable iTME modifier that correlates with BrCa patient survival and closely resembles the characteristics of known I-O targets, which collectively suggest that LSP1 is a potential I-O therapeutic target with high translational potential.

Orthogonal validation of TLR1 as an I-O target using comparative mapping
We recently demonstrated that genetic modifiers in the TME can be mapped using cancer xenograft host strains with distinct genetic backgrounds that differ cancer risk profiles. 38,39Because the xenograft host strain backgrounds vary, whereas the malignant tumor cells do not, any observed changes in tumor progression are due to genetic differences in the nonmalignant TME. 38,39However, a challenge in using human xenografts for TME mapping studies is that the iTME compartment is inaccessible due to the use of immunocompromised xenograft strains.Thus, we developed a strategy for mapping iTME modifiers using syngeneic mouse models with different susceptibility to mammary cancer.

DISCUSSION
Germline genetics contribute to variability in the tumor-infiltrating immune system (referred to as iTME), 5,[14][15][16][17][18] which is clinically relevant to disease progression and response to anticancer immunotherapies. 17,18However, most germline genetic modifiers of the iTME remain uncharacterized due to the lack of well-powered GWAS that capture iTME phenotypes.Further compounding the challenges with mapping iTME modifiers is the limited integration of functional and comparative data for iTME-relevant phenotypes into current fine-mapping strategies for GWAS.To overcome these challenges, we developed a machine learning approach that used genomic, transcriptomic, proteomic, and phenotypic evidence to identify the BrCa GWAS candidate genes that most closely resemble features of a positive set of I-O therapeutic targets.This approach prioritized multiple BrCa candidate genes that resembled I-O therapeutics, such as LSP1 and TLR1, which were orthogonally validated by human BrCa disease association and comparative modeling, respectively.Collectively, these data demonstrate that germline genetic iTME modifiers within BrCa GWAS loci are pervasive and fine-mapping these loci with systematic integration of functional data is a powerful approach for translating GWAS risk loci to therapeutic targets.

Approaches to mapping heritability in the iTME
5][16][17][18] The estimated heritability of iTME modifiers from these studies approached 20% of variance for some traits, 15,17 excluding highly polymorphic regions with much higher estimates of heritability, such as the HLA and KIR regions. 17The collective estimates of iTME heritability from TCGA 15,17 match the estimated heritability of immune traits in healthy individuals by larger GWAS 32,46 and twin studies, 47 suggesting that the heritability of immunity is consistent in normal and malignant settings.However, despite comparable levels of heritable immunity in normal 46 and malignant settings, 15,17 only a handful of iTME candidates were detected by GWAS in TCGA pan-cancer cohort ($9,000 patients) as compared with >2,700 modifier loci of immune traits in a much larger cohort of >500,000 healthy individuals. 32Together, these observations indicate that GWAS in TCGA is currently underpowered to detect all but the most penetrant iTME modifier loci, and larger cohorts with matched germline and RNA-seq data are needed to replicate and expand the catalog of germline iTME modifiers. 16The inadequate power to detect most iTME modifiers by GWAS in TCGA is further compounded by the limitations of deconvoluting cell type-specific signals within bulk transcriptomics data. 48Thus, until well-powered cancer patient datasets expand with corresponding immune phenotypes, identifying iTME modifiers is not possible without integrating immune phenotypes from other sources, such as Mendelian genetics, comparative genomics, and genetic perturbation studies.
To our knowledge, this study was the first to develop a machine learning approach to map the most probable antitumor immune modifiers from BrCa GWAS loci by learning the shared properties of known I-O targets (i.e., the positive set) without a labeled set of negative controls.In this setting, the learner only has access to a small set of positive examples and a large set of unlabeled data, because outside the known I-O targets there are many genes with unknown potential as I-O therapeutic targets.This type of machine learning problem is referred to as Positive Unlabeled (PU) 49,50 and it is compounded by the risk of overfitting when the sizes of the positive and unlabeled sets are unbalanced. 51To overcome this problem, we developed a machine learning approach that combines easy ensemble 51 and bagging 22 to balance training sets by randomly sampling negative sets that are size-matched to the positive set. 52Random forest models (n = 10,000) were then generated using the positive set and each non-overlapping negative set from the unlabeled data to generate an averaged probability score across all models. 52By using known I-O targets as the positive training set, we hypothesized that this approach could be used to prioritize germline genetic modifiers of the iTME with the greatest probability as I-O therapeutic targets.Because this machine learning approach prioritizes candidates associated a priori with a disease of interest, the candidates that most closely resemble successful therapeutic targets are also likely to have high translational potential.Finally, this machine learning framework is flexible in that it requires only a positive set for training (i.e., known targets for any disease or phenotype) and can integrate genotypic, phenotypic, and functional data from independent sources to overcome the current limitations of disease association studies.

BrCa iTME candidates with potential as I-O targets
LSP1 and TLR1 were highly ranked within the top 25 genes with the highest I-O probability scores among BrCa candidate genes with orthogonal evidence supporting their role in BrCa.LSP1 was also distinct from other iTME candidates in that it has been associated with BrCa outcome 29 in addition to incidence of BrCa 6,24-27 and autoimmunity, 31,35,53,54 suggesting that LSP1 is a germline iTME modifier of BrCa incidence and disease progression.Moreover, cell type-specific detection of LSP1 in BrCa tumor specimens revealed that LSP1 expression is predominantly restricted to infiltrating leukocytes and was significantly correlated with better outcome.Based on the observation that Lsp1-deficient mice exhibit increased leukocyte motility 28 and enhanced response to immune checkpoint blockade (ICB), 55 we hypothesize that LSP1 is a negative regulator of antitumor immunity.7][58] These data, combined with genetic evidence that the LSP1 locus is associated with altered LSP1 expression in leukocytes and BrCa-specific outcomes, suggest that LSP1 is a iTME modifier of BrCa incidence and disease progression.TLR1 was also a notable example of a GWAS candidate for both BrCa 13,26 and immune phenotypes, 41,[59][60][61][62][63][64] which were confirmed by comparative mapping and GEMM phenotyping. 65TLR1 is a key regulator of innate immune response to DAMPs released by cancer cells 66 and TLR1/2 agonists have been shown to enhance efficacy of ICB via activation of CTLs. 67,68Combined with the evidence that LSP1 and TLR1 both modulate ICB therapies in experimental models, 55,67 our data suggesting that LSP1 and TLR1 are likely to be patient-relevant targets for modulating antitumor immunity.

Conclusions
This study systematically mapped germline genetic modifiers of the iTME of human BrCa that most closely resembled the characteristics of known I-O targets, which enabled the discovery of LSP1 and TLR1 as iTME modifiers with therapeutic potential.This machine learning framework is flexible in that it required only a positive set for training and could integrate genotypic, phenotypic, and functional data from independent sources to overcome the current limitations of disease association studies.We orthogonally replicated the iTME-specific associations of LSP1 and TLR1 to BrCa risk using patient tumor biopsies and comparative mapping, respectively.Importantly, because this machine learning approach prioritizes candidates that were first identified by human BrCa risk association, we postulate that there is greater likelihood that LSP1 and TLR1 will be translatable and clinically relevant to BrCa.Combined with our findings, these targets were also independently shown to modify responses to ICB in experimental cancer models. 55,67Thus, we conclude that LSP1 and TLR1 are likely to be patient-relevant targets for modulating antitumor immunity.Finally, these data demonstrate a robust and flexible analytical framework for functionally fine-mapping GWAS risk loci to identify the most translatable therapeutic targets for the associated disease.

Limitations of the study
Despite the evidence of iTME heritability, analytical approaches to identify germline genetic modifiers of the iTME are limited and there are many unresolved questions.One challenge is that iTME candidates are likely to have complex interactions across multiple molecular pathways, cell types, and physiological functions. 5Moreover, these interactions might differ across different cancer lineages and subtypes, including differences in the iTME of BrCa subtypes that have recently been reported. 69It is also highly plausible that some pleiotropic iTME modifiers impact both BrCa cells and infiltrating leukocytes.Further confounding the ability to characterize iTME candidates is the observation that multiple causative genes can co-segregate at the same GWAS locus. 70A single genetic locus could therefore elicit complex biological changes across multiple cell types and the combined effects are ultimately manifested at the phenotypic level.Likewise, seemingly unrelated iTME modifiers that are not connected at the molecular level might interact at the cellular or tissue levels by modifying the density or physiological efficacy of cellular mediators within the tumor.For example, the phenotypic effects of a genetic modifier of infiltrating leukocyte function (e.g., TLR1) might be dampened or amplified in a patient that has co-inherited a modifier of leukocyte trafficking (e.g., LSP1).To an extent, we demonstrated that potential interactions could be predicted through hierarchical clustering of iTME candidates to identify common biological functions and pathways.However, observation biases potentially exist in this approach due to limitations in the available data.Thus, it is expected that the GOs within iTME modifier subgroups will shift as additional datasets become available that capture different aspects of iTME biology.
The challenges to disentangling the complexities of the host iTME will likely require further development of robust experimental and analytical tools for assessing germline genetic heritability in the iTME.Here, we took the initial step toward developing a machine learning approach to identify the most likely iTME candidates within 1,362 genes in LD (r 2 > 0.6) with 155 human BrCa GWAs loci that resemble known I-O targets.To our knowledge, this was the first systematic analysis of BrCa GWAS loci to characterize iTME modifiers using the growing catalog of functional genomics datasets for both oncology and immunology.Intriguingly, this analysis revealed a prevalence of BrCa GWAS candidates (39%) with phenotypic evidence in leukocyte cell types, which also agrees with the significant overlap of 48% (p < 0.0001) of BrCa GWAS risk loci with autoimmune disease loci.Notably, not all candidate phenotypes were observable in ex vivo phenotypic screens with isolated cell types, yet perturbation of some of these candidates in the in vivo and patient settings revealed strong phenotypic evidence in the iTME.For example, text mining patient data in OMIM was the most predictive feature in the random forest model, which could be attributed to disease phenotypes only observed in patients, such as the association of LSP1 with the rare neutrophil actin dysfunction disorder (NAD, OMIM 257150).Experimental mapping using the first genetically diverse in vivo preclinical models for studying iTME heritability (C57BL/6J and PWD/PhJ), also enabled the independent replication of TLR1 as a plausible germline genetic modifier of the iTME.While this publication focused on LSP1 and TLR1, many probable BrCA I-O candidates were highly ranked and should be further evaluated.Thus, future studies to expand the repertoire of genotypic, phenotypic, and functional data from preclinical and clinical datasets are likely needed to facilitate the discovery and characterization of additional germline genetic modifiers of the iTME that impact antitumor immunity and response to immunotherapies.
There also remains a need to develop approaches to directly measure the germline genetic contribution to the iTME in patients, which will further enable discovery of additional I-O targets with high translational potential.One possible approach would be to combine germline genotyping with single-cell RNA expression profiling (i.e., scRNA-seq) of dissociated patient tumor biopsies to identify iTME eQTLs, as has been demonstrated elsewhere in circulating leukocytes isolated from healthy individuals. 71,72However, a major drawback of scRNAseq profiling using dissociated tissues is the inability to capture spatial orientation of profiled cells within the tissue.Spatial transcriptomics was recently developed to enable spatially resolved genome-wide gene expression at the near single-cell level, 73 yet both scRNA-seq and spatial transcriptomics are cost prohibitive and not scalable to population level studies.Based on recent advances in tissue microarray technologies 74 and quantitative immunofluorescent imaging, 75 we propose an alternative framework for iTME QTL mapping at the protein level, using multiplex immunofluorescent assays to correlate cell type-specific protein expression with patient genotypes.The capacity of this approach could be expanded using high density tissue microarrays and quantitative immunofluorescent imaging, which offers highly sensitive and spatially resolved detection of protein expression at the cellular and subcellular levels.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Breast cancer candidate gene mapping
Candidate gene mapping of each set of summary statistics was performed using FUMA, 95 followed by positional mapping of all SNPs in LD (r 2 > 0.6) with the reported lead SNP and mapping genes to SNPs within 10Kb of the nearest gene using ANNOVAR.Annotated loci from each set of summary statistics were clumped by chromosome and position, then annotated from locus 1 to 155 (Table S2).eQTL mapping was performed for all SNPs using the eQTL Catalogue 96 DICE, 97 GTEx (v8), 98 and eQTLGen. 99LD blocks within select loci (e.g., LSP1 and TLR1) were further characterized using Ensembl Variant Effect Predictor (VEP), 100 ImmuNexUT, 100 GWAS Catalog, 101 and RegulomeDB 102 to annotate SNPs overlapping function motifs (fSNPs).

Colocalization of human breast cancer and autoimmunity risk loci
Autoimmunity risk variants were obtained from Open Targets Genetics version 6. 21 Studies of autoimmune traits were identified using mapped Experimental Factor Ontology terms belonging to the class 'autoimmune disease'. 103Independent lead SNPs from BrCa loci were obtained as described below and are found in Table S3.Locus boundaries shown in the figure, plotted using the chromoMap R package, 79 were constructed using 250 kb windows on either side of a lead SNP.To test for significant overlap between BrCa loci and autoimmune loci, each of the 155 BrCa loci was randomly assigned a uniformly distributed location on the genome with 10,000 permutations.The distance between the BrCa locus and the nearest autoimmune lead SNP was computed for each set of randomized BrCa locus, and this was used to compute the number of BrCa loci within 500 kb of an autoimmune lead SNP.Results are insensitive to the choice of summary statistic: the mean and median distance, the proportion BrCa loci containing a lead SNP for an autoimmune disease, the proportion of BrCa loci having an autoimmune SNP within varying distances from the locus up to 10 Mb similarly resulted in random distributions that were significantly different from observed values, in most cases without overlap.

LSP1 protein expression in human breast cancer tissue biopsy microarrays
A validated multiplexed immunohistochemistry protocol was used to detect LSP protein in tumor-associated leukocytes in 1,109 de-identified cases of invasive BrCa diagnosed at the Jefferson University Hospital from 1988 to 2005 and made available under institutional IRB approved protocols in tissue microarray format (0.6 mm core diameter) as previously described. 73,74In addition to LSP1, multiplex fluorescence staining also included CD45 to detect leukocytes and pan-cytokeratin to detect cancer cells.Immunohistochemistry was performed on an autostainer (Omnis; Agilent/Dako).After antigen retrieval at pH 6, TMA sections were incubated with HRP-conjugated anti-mouse secondary antibody polymer (EnVision+; Dako-Cat# K400311-2); 1:8000 dilution), followed by HRP-conjugated anti-rabbit secondary antibody polymer (EnVision+; Dako-Cat#K4003), and visualized using Cy5-tyramide as substrate.The antigen retrieval step was repeated followed by second round staining for pan-CK using anti-pan-cytokeratin antibody (AE1/AE3 mouse monoclonal, Dako-Cat# M351501-2) followed by HRP-conjugated anti-mouse secondary antibody polymer (EnVision+; Dako-Cat# K400311-2) and visualized by Cy7-tyramide as substrate.CD45 was stained using biotin-conjugated anti-CD45 (Biolegend; cat#304004; 1:50 dilution) and visualized by incubation with Alexa555-labeled streptavidin.DAPI counterstain was used to visualize cell nuclei.Stained slides were digitized at 20x magnification on a slide scanner (3DHistech Pannoramic Flash II) capturing fluorescent images captured in four channels (DAPI, Alexa555, Cy5, ad Cy7.Digitized images were analyzed by Tissue Studio (Definiens) and mean cytoplasmic expression signal for LSP1 immunoreactivity was computed for CD45-positive cells in each tumor core.All immunohistochemistry, slide scanning, and quantitative analyses of digitized images were performed by investigators that were blinded to outcome.

RNAseq library preparation and sequencing
Total RNA was collected from EO771.LMB tumors that were xenografted into B6 and PWD.B6 mice (n = 4 per group) isolated using TRIzol Reagent (ThermoFisher), poly-A purified, transcribed, and chemically fragmented using Illumina's TruSeq RNA library kit using the manufacturer's protocol.Individual libraries were prepared for each sample, indexed for multiplexing, and then sequenced on an Illumina HiSeq 2500 (Illumina, Inc., San Diego, CA).

Spontaneous mammary tumor model
The F 1 generations of the PWD/C57-PyVT and C57/C57-PyVT crosses were genotyped for the PyVT transgene using an established protocol from the Jackson Laboratory.Beginning at 6 weeks-of-age, the transgene-positive PWD/C57-PyVT and C57/C57-PyVT female mice were palpated weekly for mammary tumors, and the initial date of detection and anatomical location were recorded.Prior to tissue collection, tumor-positive mice were then aged for 40 days post-diagnosis to permit development of metastases.To localize the mammary tumor modifiers in the PWD/PhJ and C57BL/6J genomic backgrounds, a [C57BL/6J x (PWD/PhJ x B6.FVB-Tg(MMTV-PyVT)634Mul/LellJ)]N 2 backcross (n = 104) was performed.Genome-wide genotyping was performed using 3,753 informative markers from the MUGA array. 104UANTIFICATION AND STATISTICAL ANALYSIS

Random forest modeling of I-O target probability
The random forest models and predictions were performed using the ''randomForest (v4.6-3)'' and ''caret (v6.0-93)'' libraries in R statistical software.The number of trees in each model was fixed at 1000, as any increase in trees above this value did not yield any significant changes in the prediction probability.The tuneRF function in the random forest package was used to determine the optimal number of variables (''mtry'') sampled at each split.The stepfactor was set at 0.01 and the ''improve'' parameter at 0.01 value.A positive training set of 45 preclinical and clinical I-O therapeutic targets were identified from the literature. 105To balance the training set for the model, size-matched negative training sets were generated by random sampling without replacement of genes and excluding the positive training set.Genes with more than 6 missing values were omitted from the predictions.A total of 10,000 random forest models using each of the random negative sets were trained and predictions were made based on each model, followed by averaging the predictions over 10,000 models to assign the IO target probability score to each gene.Two methods were used to cross-validate the random forest model.In the first approach, a 10-fold cross-validation (CV) was performed using the positive set (n = 45) and a negative set of 185 genes identified by taking the union of non-essential genes from Hart et al. 23 and those not expressed in lymph node, bone marrow, thymus, whole blood, lymphocytes, and mammary tissues, based on the Human Protein Atlas and GTEx.This resulted in a CV of AUC = 0.79.In the second approach, a 10-fold cross-validation (CV) was performed for each of the 10,000 random forest models that were generated using the positive set (n = 45) and the 10,000 size-matched and randomly selected negative sets, followed by averaging of the 10,000 AUCs.This resulted in a CV of AUC = 0.71.The slight drop in the AUC of the second approach is likely due to the random sampling of negative sets that were used for the 10,000 random forest models, which very likely included some models built from negative sets that included unknown I-O targets.The random forest analysis was further benchmarked against an elastic-net logistic regression model with L1 and L2 regularization to minimize overfitting.
For each elastic-net model, the alpha and lambda parameters were tuned to maximize accuracy based on a 10-fold cross-validation using the ''train'' function from the ''caret'' R package.Identical to the random forest analysis, the positive set (n = 45) was sized-matched with randomlysampled negative sets without replacement of genes and excluding the positive training set.A total of 10,000 elastic-net models using each of the random negative sets were trained and predictions were made based on each model, followed by averaging the predictions over 10,000 models to assign the IO target probability score to each gene.The elastic-net model was then cross-validated using the same two approaches that were applied to the random forest analysis: (a) the positive set (n = 45) and a fixed negative set of genes (n = 185) that yielded an AUC of 0.77 and (b) the positive (n = 45) and the 10,000 randomly sampled negative sets (n = 45) that yielded an average AUC of 0.69.Finally, the scores from the random forest and the elastic net-analyses were compared by Pearson's correlation, which revealed significant agreement between the approaches (R = 0.6, p-value 10 -92 ) (Table S16; Figure S1).Combined, the results of the two cross-validation approaches suggest that the probability scores assigned to the BrCa candidate genes are accurate and not a result of overfitting.Twenty-two features were included in the random forest model: OMIM Mendelian genetics -The Online Mendelian Inheritance in Manâ (OMIMâ) database (https://omim.org/) was mined using Natural Language Processing (NLP) and ontology-based Text and Data Mining (TDM) to extract immunology and cancer phenotype relevant concepts.These data were then converted to a numerical matrix that was compatible for random forest modeling.The full OMIM was obtained via FTP (Morbid Map, Gene Map) and API (full records) respectively (retrieval date: 2022-09-22) and pre-processed with Python for TDM indexing.The information was mined from specific sections (depending on the concept) in the disease and gene monographs, as well as clinical synopses using IQVIA/Linguamatics I2E platform. 106We further utilized internally built knowledge extraction pipelines and disambiguation routines. 107TDM queries were focused on paragraphs with clinical / human focus.Linguistic context (sentiment analysis) was used to filter out hits related to normal phenotype (''normal leukocyte count'') or absence of abnormality (''no autoantibodies detected'').TDM identified the most specific concept in the text as hit.In post-processing, ontological hierarchies are considered, and matches propagated to higher level concepts to balance the curator-dependent level of granularity found in OMIM text.For TDM, a list of concepts for normal and pathological immunity were mapped, including those related to immune cells, immunoglobulin types, autoantibodies, anatomical parts of the immune system like spleen, autoimmunity, immunodeficiency, recurrent infection, hypersensitivity, atopy, inflammatory reaction, immune system diseases, leukocyte disorders, and lymphatic diseases (see column headers in Table S4 for full list of concepts).Likewise, TDM was used to generate a list of concepts for cancer relevant processes (e.g., proliferation, apoptosis) for solid and hematological tumors (Table S4).The concepts were mapped to public standard life science ontologies, 93,94,108,109 and internally enriched with additional synonyms and custom-built vocabularies for concepts that could not be mapped to external standards.Disease concept associations resulting from mining the disease monographs and clinical synopses were turned into indirect Gene -I-O concept associations using enriched OMIM Gene -Disease (G-D) relationships.These relationships use Morbid Map data (6862 G-D relationships) as a basis and are enriched with TDM (5895 additional G-D associations) by mining 'Molecular Basis' in clinical synopses and linguistic relationship mining in relevant disease paragraphs.TDM enrichment is crucial as, for example, the association between LSP1 and Neutrophil Actin Dysfunction is only described in the text but not covered in Morbid Map and no indirect LSP1 and I-O associations could have been derived.Finally, all individual direct or indirect gene I-O concept associations were summarized into the binary per gene I-O association matrix (see Table S4).The final OMIM input feature to the random forest model is the sum of concept hits for each gene.

Figure 1 .
Figure 1.Random forest modeling of I-O target probability for breast cancer GWAS candidates (A) Global alignment of GWAS risk loci for BrCa (n = 155 loci with 452 lead SNPs curated in this study) and autoimmunity (n = 4,076 loci from Open Targets Genetics) revealed significant colocalization (48% of BrCa loci, p < 0.0001).(B)Overview of the I-O target probability predictions using a positive training set of genes with preclinical and clinical evidence as I-O therapies.A size-matched negative set was generated from random sampling without replacement of all other genes and after excluding the positive set.Random forest models were then built (n = 10,000 models) using each of the random negative sets to generate predictions, followed by averaging across the 10,000 models to assign an I-O target probability score to 1,362 candidate genes in linkage disequilibrium (LD, r 2 > 0.6) with 155 human BrCa GWAS loci.(C) A correlation matrix of the quantitative variables revealed limited overlap between features.(D) A 10-fold cross validation (CV) was performed using the positive set and each of the negative sets, followed by a receiver operator characteristic (ROC) area under the curve (AUC) analysis that was averaged across all 10,000 random forest models (AUC = 0.71) (Figure1D, hashed line).Because the randomly sampled negative sets potentially included some unknown I-O targets, a second 10-fold CV was performed using a negative set of 185 genes identified by taking the union of non-essential genes from Hart et al.21 and those not expressed in lymph node, bone marrow, thymus, whole blood, lymphocytes, and mammary tissues, based on the Human Protein Atlas and GTEx.Cross-validation with the defined negative set resulted in an AUC of 0.79 (Figure1D, solid line), indicating that the negative sets likely contain previously undescribed I-O targets.Receiver operating characteristic (ROC) from leave-one-out cross validation using an independent negative

Figure 3 .
Figure 3. Comparative mapping of iTME modifiers in mouse and human BrCa (A) Schematic representation of syngeneic tumor transplant experiments in homozygous C57BL/6J (B6) mice and PWD/PhJ x C57BL/6J (PWD.B6) F 1 heterozygous mice.(B and C) E0771.LMB tumor volumes (B) and weights (C) at 17 days post-implantation in B6 homozygous female mice (n = 22) and PWD.B6 F 1 heterozygous female mice (n = 10).Data are represented as mean G SEM. ***p < 0.001, as determined by Student's t test.(D and E) Enrichment of differentially expressed genes in inflammatory pathways (D) and immune cell infiltrates (E) assessed by GSEA of bulk transcriptomic data from E0771.LMB tumors grown in a B6 and PWD.B6 F 1 mice (n = 4 per group).

Figure 5 .
Figure 5. Evidence of TLR1 as an iTME modifier of BrCa (A) Pathway activation (Z score) predicted from bulk transcriptomic analysis of E0771.LMB tumors grown in B6 and PWD.B6 F 1 mice (n = 4 per group).The highlighted pathways consist of differentially expressed upstream regulators (FDR <0.1) with observable pathway deviations (Z score) that passed the accepted statistical threshold (p < 0.01) for the Upstream Regulator Analysis function of IPA.Significant activation of the top iTME candidate pathway, Tlr1/2, is highlighted in red.(B) Distribution of intrinsic BrCa phenotypes and extrinsic iTME phenotypes across the candidates overlapping with human GWAS loci.(C) Heatmap of functional fine-mapping scores using CRISPR screening data related to intrinsic BrCa phenotypes and extrinsic iTME phenotypes.

TABLE
94ANTIFICATION AND STATISTICAL ANALYSISB Random forest modeling of I-O target probability calipers and tumor volumes were calculated according to the formula: volume = Dd 2 p/6, where D and d equal larger and smaller diameters, respectively94 d RESOURCE AVAILABILITY B Lead contact B Materials availability B Data and code availability d EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS B Human breast cancer GWAS B Animal models B Mammary tumor transplant model d METHOD DETAILS B Breast cancer candidate gene mapping B Colocalization of human breast cancer and autoimmunity risk loci B LSP1 protein expression in human breast cancer tissue biopsy microarrays B RNAseq library preparation and sequencing B Spontaneous mammary tumor model d