Putatively cancer-specific alternative splicing is shared across patients and present in developmental and other non-cancer cells

We compared cancer and non-cancer RNA sequencing (RNA-seq) data from The Cancer Genome Atlas (TCGA), the Genotype-Tissue Expression (GTEx) Project, and the Sequence Read Archive (SRA). We found that: 1) averaging across cancer types, 80.6% of exon-exon junctions thought to be cancer-specific based on comparison with tissue-matched samples are in fact present in other adult non-cancer tissues throughout the body; 2) 30.8% of junctions not present in any GTEx or TCGA normal tissues are shared by multiple samples within at least one cancer type cohort, and 87.4% of these distinguish between different cancer types; and 3) many of these junctions not found in GTEx or TCGA normal tissues (15.4% on average) are also found in embryological and other developmentally associated cells. This study probes the distribution of putatively cancer-specific junctions across a broad set of publicly available non-cancer human RNA-seq datasets. Overall, we identify a subset of shared cancer-specific junctions that could represent novel sources of cancer neoantigens. We further describe a framework for characterizing possible origins of these junctions, including potential developmental and embryological sources, as well as cell type-specific markers particularly related to cell types of cancer origin. These findings refine the meaning of RNA splicing event novelty, particularly with respect to the human neoepitope repertoire. Ultimately, cancer-specific exon-exon junctions may affect the anti-cancer immune response and may have a substantial causal relationship with the biology of disease.


INTRODUCTION
), although we find that the set of junctions defined as "novel" is highly sensitive to the filtering criteria used (see Figure S1A, Table S2, and "Selection of cancer-specific junctions" in Methods). We use lack of occurrence of a junction in core normals as a baseline definition of cancer specificity.
We next assessed the extent to which a given junction not found in core normals might be shared among multiple samples of the same cancer type. We observed that over half (52.8%) of these junctions are confined to individual samples, although a small but significant subset (0.41%) is shared across at least 5% of samples in at least one cancer-type cohort ( Figure S1B). We also noted that 40.6% of novel junctions are shared between multiple cancer types, with a total of 1,609 junctions present in at least 5% of samples each across two or more TCGA cancer cohorts ( Figure S1C). Sharedness was significantly higher among junctions that were also present in normal tissues (Figures 1C and 1D). We observed that the number of junctions not found in core normals per patient was comparable for patients with and without splicing factor-associated mutations across all cancer types, with the exception of breast adenocarcinoma (Figures S1E and S1F). We also observed that splicing-associated mutations had minimal effect on the sharedness within a cancer-type cohort of junctions not found in core normals ( Figure S1G and S1H).
We finally assessed whether these junctions were also shared among independent cancer cohorts, using publicly available RNA-seq data in the SRA. Many TCGA cancer junctions not found in core normals were found to occur in cancer-type matched SRA samples: 11 of 14 cancer types had more than 50 junctions in common between the matched cohorts. Moreover, we found that junctions also present in matched SRA cancer cohorts were associated with significantly higher levels of sharedness in the TCGA cohort (H statistic = 3.85-2,803 and p = <0.0001-0.0495; Figure S1I).  (Table S1); green (center) bars give the percentage of the remaining junctions that are found in other core normals; and yellow (right) bars give the percentage of cohort junctions found in no core normals; cancer types are ordered by relative abundance of junctions in this last set. Cancer types with no blue (left) bar have no tissue-matched normal samples (Table S1). (B) Log-scale sorted strip plots representing the number of non-core normals junctions per sample for each of 33 TCGA cancer types. Each point represents a single TCGA tumor sample and the width of each strip is proportional to the size of the cancer type cohort (Kahles et al., 2018). Figure S1A shows analogous data with additional filters applied. (C) Log-scale box plots representing the prevalences within each cancertype cohort of junctions occurring in at least 1% of cancer-type samples, summarized across all TCGA cancer types. Junction prevalences are shown in blue (left) for those found in GTEx or TCGA tissue-matched normal samples (Table S1); junctions not present in tissue-matched normals but found in other core normals are shown in green (center); and junctions found in no core normals are shown in yellow (right). Note that any junction found in multiple cancer types is represented by multiple data points, one for each cancer type in which it is found. A detailed breakdown by TCGA cancer type is available in Figure S1D.

Shared novel junctions in cancer distinguish cancer identity and subtype
We hypothesized that a high level of exon-exon junction sharedness across samples is likely to be reflective of underlying conserved biological processes (e.g. among normal tissues). We therefore investigated the sharedness of novel junctions present in different cancer types. Interestingly, these novel junctions can readily distinguish disparate cancer types and show similarities among cancer types with shared biology, such as cutaneous and uveal melanomas ( Figure 2A). These novel junctions also reflect shared biology among additional cancer types with similar anatomic origins: colon and rectal adenocarcinoma, clear cell, chromophobe, and papillary renal cell carcinomas, low and high grade gliomas, and stomach and esophageal adenocarcinomas ( Figure 2A). Shared junctions from several cancer types also demonstrate similarities by histological subtype despite their differing anatomical origins, for instance squamous cell carcinomas of the lung, cervix, and head and neck ( Figure 2A, Figure S2A), consistent with previously published work (Lin et al., 2017). Moreover, shared novel junctions are readily able to distinguish distinct histological subtypes of sarcoma and cervical cancer, among other diseases ( Figure 2B). Using non-cancer cell types from the Sequence Read Archive (SRA) (Leinonen et al., 2011) we found that "novel" junctions from cancers arising from cell and tissue types poorly represented in GTEx normal tissue samples (e.g. melanocytes), or not present in GTEx at all (e.g. thymus tissue), can be found in many samples of the corresponding cell or tissue types of origin ( Figure 2C, Table S1). Sample-to-sample comparisons of all junctions from these rare-cell type cancers also show more similarity with cell type-matched normal samples from the SRA than with bulk tissue from GTEx ( Figure S2B). Heatmap showing junction prevalences across every TCGA cohort for each cancer type's top 200 shared junctions that are at least 1% prevalent in that cancer type and are not found in any core normal samples. (B) Heatmap showing shared junction prevalences across selected TCGA cancer types and their assigned histological subtypes for each subtype's top 200 shared junctions that are at least 1% prevalent in that subtype and are not found in any core normal samples. See Table S1 for TCGA subtype abbreviations. (C) Heatmap showing shared junction prevalences across selected TCGA cancer types and a set of their matched SRA tissue and cell types of origin, for each cancer type's top 200 shared junctions that are at least 1% prevalent in that cancer cohort and are not found in any core normal samples. See Table S3 for SRA sample type abbreviations.
Novel junctions in cancer are found among developmental and known cancer-related pathways As many cancers are thought to recapitulate normal developmental pathways (Borczuk et al., 2003;Huang et al., 2009;Naxerova et al., 2008), we further hypothesized that a subset of cancer-specific junctions may reflect embryological and developmental splicing patterns. We therefore compared cancer junctions not found in core normals with those from SRA samples pertaining to zygotic, placental, embryological, and fetal developmental processes (see Methods). On average, per cancer type, 15.4% of these junctions occur in SRA developmental cell or tissue samples, and 26.5% and 2.7% occur in samples from selected SRA normal adult tissues and cell types and SRA normal stem cell samples, respectively ( Figures 3A and S3A). The remaining significant majority of these cancer junctions not found in core normals were also not present in any non-cancer SRA tissue or cell type studied (64.9% on average per cancer type cohort, Figures 3A and S3A). Many of these novel "unexplained" junctions still exhibit high levels of sharedness both within (Figures S3B and S3C) and between ( Figure S3D) different cancer types. At the upper end, 16 of these shared junctions were found in more than 10% of samples in each of two or more cancer types (Table S4). We note that the liberal set inclusion criterion we employed may reduce our ability to identify robust cancerspecific biology among unexplained junctions. For instance, the well-described deletion causing a splicing of exons 1 and 8 (EGFRvIII) occurs in 29.4% of TCGA patients with glioblastoma multiforme (GBM) and in no core normals, but is also present in a single read from a single human epithelial cell line sample on SRA, and therefore is classified not as an unexplained cancer-specific junction but as "adult non-cancer." However, this set inclusion condition does allow for the identification of some cancer-specific biology of interest. For instance, rarer alternative EGFR splicing events were detected in the unexplained set, such as EGFRvIII with an alternate exon 1 joined to exon 8 (chr7:55161631-55172981), detected in 2 patients with GBM and 1 patient with low grade glioma; the same alternate exon 1 joined with two alternate exon 16s (chr7:55161631-55168521 and chr7:55161631-55170305) (detected in 1 and 2 GBM patients, respectively); and the same alternate exon 1 joined with exon 20 (chr7:55161631-55191717) in 2 GBM patients. An alternative filtering approach that instead requires two samples per SRA category to define junction set membership yields a greater number of unexplained junctions (Table S1D and Figures S3E and S3F).
We observed a number of unexplained junctions shared by unusually large proportions of ovarian cancer (OV) samples in TCGA, including one cancer-specific junction (chr16:766903-768491 on the minus strand) present in the highest proportion of samples in any TCGA cohort (81.3%, or 350 of 430 samples in OV). This junction occurs in an antisense transcript of MSLN, which codes for a protein known to bind to the well-known ovarian cancer biomarker MUC16 (CA125) (Felder et al., 2014;Kaneko et al., 2009). Another unexplained junction (chr19:8865972-8876532 on the minus strand) is in the MUC16 region itself and is present in 42.8%, or 184 of 430 samples in OV. In all, we identified 34 cancer-specific junctions present in >40% of OV samples. We further identified several novel pan-cancer splice variants (chr16:11851406 with chr16:11820297, chr16:11821755, and chr16:11828391, each present across up to 8 different cancers) in RSL1D1 and its neighboring BCAR4, a long noncoding RNA known to promote breast cancer progression (Godinho et al., 2011;Li et al., 2016).
Among all otherwise unexplained junctions, an average of 4.78% across cancer types are associated with known cancer-predisposing or cancer-relevant loci. Further, an elevated proportion of otherwise unexplained junctions (on average, 40.9%) occur in likely antisense transcripts and may therefore be of reduced interest as candidate neoantigens, but sustained interest in terms of cancer biology ( Figure 3B, Table S5). Shown junctions are absent from all core normals. Unexplained junctions (red highlights) comprise junctions not present in any set categories studied (see also expanded set assignments in Figure S3A). The developmental set comprises human development-related junctions not present in the category placenta. Scale is log10 of percent of junctions not found in core normals, calculated for each cancer. (B) Box plots showing the ratio of antisense-gene junctions to all junctions for (green) junctions found in core normals; (aqua) junctions not found in core normals but found in other selected non-cancer adult tissue and cell samples from the SRA; (lavender) junctions not found in core normals or SRA non-cancer adult samples but found in selected developmental samples on the SRA; (apricot) junctions not found in core normals or SRA non-cancer adult samples but found in selected stem cell samples on the SRA; and (red) junctions not found in core normals or selected non-cancer adult, developmental, or stem cell samples from the SRA.

Discussion
Previous studies have established the importance of alternative and aberrant splicing in cancer prognosis (Bjørklund et al., 2017;Li et al., 2017;Marcelino Meliso et al., 2017;Robertson et al., 2018;Zhu et al., 2018;Zong et al., 2018) and have begun to explore its potential relevance in cancer immunotherapy (Kahles et al., 2018;Wood et al., 2019). In this study, we explore "novel" exon-exon junction use among cancers with respect to a broad collection of normal tissues and cells. This is the largest such study to-date, integrating RNA-seq data from 10,549 tumor samples across 33 TCGA cancer types, 788 paired normal samples across 25 TCGA cancer types, 9,555 normal samples across 30 GTEx tissue types, and 12,231 human samples from the SRA (10,827 samples from 33 normal tissue and cell types and 1,404 samples from 14 cancer types) (Tables S1 and S2). To the best of our knowledge, this is also the first study to examine the novelty of cancer junctions from the perspective of immune tolerance, considering all adult normal tissue types as potential sources of tolerogenic peptides rather than only the closest matched normal tissues. Moreover, this is the first study to quantitatively interrogate the sharedness of novel exon-exon junctions both within and across cancer types, demonstrating that these junctions can distinguish cancers and their subtypes. We finally demonstrate that there is no one-size-fits-all definition of "novel" splicing, noting that purportedly cancer-specific junctions may in fact be present among, and perhaps biologically consistent with, a repertoire of embryological, developmentally-associated, and other cell types.
This study also has several limitations. We focus on the importance of exon-exon junctions as the predominant metric of alternative splicing. However, there are other sources of RNA variation (e.g. intron retention events (Smart et al., 2018) and RNA editing) that we do not explicitly study here, but which could be equally good sources of novel, cancer-specific protein sequence for immunotherapeutic and other applications. Additionally, there is substantial variability among analytical methods for identifying these exon-exon junctions. We note significant discordance between results of analyses of the same data using different junction filtering methods. While the same phenomena and general results appear to hold true independent of analytical technique, the identity and relative novelty of individual "cancer-specific" junctions vary between our results and those previously published (Kahles et al., 2018). We also acknowledge that GTEx and the SRA combined do not account for all sources of normal tissue(s) in the human body, and further acknowledge that the sample metadata used to search the SRA may be an imperfect surrogate for actual tissue/sample identities. Our assessment of embryological and developmentally-associated junctions is also limited by a relatively small number of relevant RNA-seq samples available on the SRA. Finally, due to the short-read nature of these RNA-seq data, we make no attempt to predict putative neoepitopes from cancer-specific junctions as we cannot confidently recapitulate reading frame or broader sequence context from isolated exon-exon junctions.
While cancer-specific exon-exon junctions may indeed be a source of neoepitopes, their sharedness across individuals and occurrence in cancer-relevant loci (e.g. EGFR, MUC16) are suggestive of underlying but as-ofyet unexplored biology. This sharedness does not appear to be related to variants in splicing factor or splicingassociated proteins, and is not wholly explained by recapitulation of embryological/developmental transcriptional profiles. As such, we see this work as opening a broad area of future research into the role and relevance of these novel recurring exon-exon junctions. Naxerova, K., Bult, C.J., Peaston, A., Fancher, K., Knowles, B.B., Kasif, S., and Kohane, I.S. (2008). Analysis of gene expression in a developmental context emphasizes distinct biological leitmotifs in human cancers. Genome Biology 9, R108.

CONTACT FOR REAGENT AND RESOURCE SHARING
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Abhinav Nellore (nellore@ohsu.edu).

METHOD DETAILS Data Download
Previously called exon-exon junction data including phenotype table, bed and coverage files for both TCGA and GTEx v6 were downloaded from the recount2 service at https://jhubiostatistics.shinyapps.io/recount/ (Collado-Torres et al., 2017). The metaSRA (Bernstein et al., 2017) web query form at http://metasra.biostat.wisc.edu/ was queried for experiment accession numbers for 1) non-cancer cell and tissue type samples (see Table S1 for cancer-matched samples and Table S3 for non-cancer samples, and "Selection of SRA tissue and cell types" for a description of how these samples were chosen) and 2) TCGAmatched cancer types (see Table S1). For the non-cancer samples, the term "cancer" was explicitly added as an excluded ontology term in the query, and the resulting files were filtered to remove any samples with "tumor" in the sample_name field. The resulting accession numbers were queried against the Snaptron junction database using the query snaptron tool (

Indexing of GTEx and TCGA junctions
The GENCODE gene transfer format (.gtf) file was parsed to collect full coordinates and left and right splice sites of junctions from annotated transcripts and a searchable tree of protein-coding gene boundaries. The GTEx phenotype file was parsed to collect tissue of origin information and donor gender; bone marrow samples derived from leukemia cell line cells were eliminated. The TCGA phenotype file was parsed to collect information on cancer type, cancer stage at diagnosis, patient gender, vital status, and sample type (primary tumor, matched normal sample, recurrent tumor, or metastatic tumor). Cancer subtype classifications were collected for five cancer types beyond their TCGA designations ( Figure 2B, Table S1): cervical squamous cell carcinoma and endocervical adenocarcinoma was separated into cervical squamous cell carcinoma, endocervical adenocarcinoma, and cervical adenosquamous; esophageal carcinoma was separated into esophagus adenocarcinoma and esophagus squamous cell carcinoma; brain lower grade glioma was separated into astrocytoma, oligoastrocytoma, and oligodendroglioma; sarcoma was separated into leiomyosarcoma, myxofibrosarcoma, malignant peripheral nerve sheath tumors, desmoid tumors, dedifferentiated liposarcoma, synovial sarcoma, and undifferentiated pleomorphic sarcoma; and pheochromocytoma and paraganglioma were separated. A new SQLite3 database was created to index all GTEx and TCGA junctions, with linked tables containing 1) sample ids and associated junction ids; 2) sample ids and phenotype information for each sample; and 3) junction ids and junction information including 0-based closed junction coordinates, GENCODE annotation status, and location within protein coding gene boundaries. SQL indexes were created on junction ID and sample ID columns for fast and flexible querying.
Selection of cancer-specific junctions For all analyses we apply a light filter, requiring a junction to have at least a two-read coverage across GTEx, TCGA, and the selected cancer and non-cancer SRA samples, to exclude false positive junctions but allow for the existence of splicing noise. To characterize junction novelty in cancer with respect to normal cells, we defined a hierarchical filter that specifies inclusion and exclusion of junctions in different RNA-seq datasets (Table 1). In order from most to least permissive, these filters are: 1) junctions not found in tissue-matched GTEx or TCGA normal samples, 2) junctions not found in any GTEx or TCGA normal ("core normal") samples, and 3) junctions not found in any core normal samples or in selected SRA tissue and cell type non-cancer samples. For our analyses, we do not explicitly filter on whether a junction is annotated in GENCODE. We do not set a limit on presence in the core normal sample cohorts: any junction present at any coverage level in only one sample is counted as "in" these cohorts. This yields a more stringent filter on normality than that used by the TCGA splicing paper, which uses the term "neojunctions" to refer to junctions not found in tissuematched GTEx or TCGA normal samples, with a 10-read coverage requirement in TCGA, and allowing through the filter lowly expressed junctions in GTEx tissue-matched samples (Kahles et al., 2018).
We queried the junction database to extract junctions of interest, specifically 1) all junctions for all tumor samples of each cancer type and 2) all junctions not present in any core normal samples for each cancer type cohort, with their cohort prevalence levels. All junctions are presented in a 0-based closed coordinate system. Protein coding region presence was determined for all junctions, with location assessment as follows: the junction is categorized as protein-coding if it is present in a protein-coding gene region (with at least one junction splice site within the gene boundaries) and antisense if it is present on the reverse strand of a proteincoding gene region, based on gene regions described in GENCODE v.28 (Frankish et al., 2019). Cancerassociated genes were collected from the OncoKB and the COSMIC cancer gene census; any gene listed in one or both lists was categorized as a cancer-associated gene. Any junction assigned to a protein-coding gene region corresponding to one of these genes was categorized as associated with cancer-relevant loci.
For comparison between cancer-sample junctions found vs. not found in core normal samples, we performed a Kruskal-Wallis H-test to determine the significance of the decreased sharedness levels, since the junction prevalence data is not normally distributed and there are many fewer cancer-specific junctions than junctions found in core normal samples. Junctions not found in any GTEx or TCGA normal ("core normal") samples

3+
Junctions not found in any core normal samples or in selected SRA non-cancer samples Comparison with SRA tissue and cell types Non-cancer sample types from the SRA were chosen via manual curation informed by a clustering of junctions according to ontology term prevalence, with commonly occurring terms that do not meaningfully distinguish junctions eliminated. The selected sample types in Table S3 comprise all non-cancer data from the SRA analyzed. All junctions for samples associated with these cell and tissue types but not with "cancer" were downloaded via Snaptron, translated to a 0-based closed coordinate system, and compared with those found in TCGA cancer samples. Junctions present in a TCGA cancer-type cohort and SRA samples from a specific assigned category determined set assignments, which were used for subsequent data analysis. To exclude false positive junctions but allow for the existence of splicing noise, only junctions with at least two reads across GTEx, TCGA, and the selected cancer and non-cancer SRA samples are considered true junctions. All SRA junctions not found in TCGA cancer samples were ignored. For the supplementary 2-sample minimum filter analysis, we retained all junctions that are present in only 1 SRA sample, but required at least two samples across the broad SRA category (adult, developmental, or stem cell) for inclusion in that set. (For developmental subsets, only one sample within a subset category was required, as long as the 2-sample criterion across the full developmental category was met.) For comparison between TCGA cancer-sample junctions not found in core normal samples with SRA junctions from matched cancer type samples, we performed a Kruskal-Wallis H-test to determine the significance of the increased sharedness levels, since the junction prevalence data is not normally distributed and the difference in junction counts between the two cohorts (TCGA junctions in or not-in the SRA matched cohort) is large.
Splicing Factor Mutation Analysis Patient somatic mutation call files were downloaded from the GDAC firehose (http://gdac.broadinstitute.org/) and all silent mutations were removed. Using the remaining mutation calls, patients were classified based on two-different separation criteria: 1) whether or not they had at least one mutation in a gene that codes for a protein annotated as involved in mRNA splicing, based on the UniProt protein annotation database, and 2) whether or not they had at least 1 mutation in a gene previously identified as sQTL associated (U2AF1, SF3B1, TADA1, PPP2R1A, and/or IDH1) in the TCGA cohort by the TCGA splicing paper (Kahles et al., 2018). For each cancer type, and each stratification method, the number of cancer-specific junctions per patient was compared for patients with and without at least one mutation in the defined set ( Figures S1E and S1F). Differences in the number of novel junctions across cancer types and stratification groups was assessed via two-way ANOVA with a Benjamini-Hochberg p-value correction.
In addition to comparing the levels of cancer specific junctions between patients with and without splicing associated mutations, we also compared junction sharedness based on the same two stratification criteria used above. For each cancer type, all junctions identified in two or more patients were selected. For each, the number of junction occurrences in patients with mutations in splicing associated genes was calculated and compared to the overall number of occurrences in the corresponding cancer cohort, using a Fisher's exact test ( Figures S1G and S1H).

DATA AND SOFTWARE AVAILABILITY
All data is publicly available and accessible online. Python code and corresponding descriptors for the implementation of methods as described is publicly available on GitHub (https://github.com/JulianneDavid/shared-cancer-splicing).