Targeted in silico characterization of fusion transcripts in tumor and normal tissues via FusionInspector

Summary Here, we present FusionInspector for in silico characterization and interpretation of candidate fusion transcripts from RNA sequencing (RNA-seq) and exploration of their sequence and expression characteristics. We applied FusionInspector to thousands of tumor and normal transcriptomes and identified statistical and experimental features enriched among biologically impactful fusions. Through clustering and machine learning, we identified large collections of fusions potentially relevant to tumor and normal biological processes. We show that biologically relevant fusions are enriched for relatively high expression of the fusion transcript, imbalanced fusion allelic ratios, and canonical splicing patterns, and are deficient in sequence microhomologies between partner genes. We demonstrate that FusionInspector accurately validates fusion transcripts in silico and helps characterize numerous understudied fusions in tumor and normal tissue samples. FusionInspector is freely available as open source for screening, characterization, and visualization of candidate fusions via RNA-seq, and facilitates transparent explanation and interpretation of machine-learning predictions and their experimental sources.


Correspondence bhaas@broadinstitute.org
In brief Haas et al. present FusionInspector, a method for in silico characterization and interpretation of candidate fusion transcripts identified via RNA sequencing. Applied to thousands of tumor and normal transcriptomes, FusionInspector identified numerous likely biologically relevant fusions with potential roles in cancer or normal cellular biology.

INTRODUCTION
Gene fusions are intensely studied for their relevance to disease and normal cellular biology. In cancer, gene fusions typically result from chromosomal rearrangements, including well-known drivers of cancer, such as BCR::ABL1 in chronic myelogenous leukemia (CML), 1,2 TMPRSS2::ERG in prostate cancer, 3,4 and SS18::SSX1 or SS18::SSX2 in synovial sarcoma. 5,6 Charting the diversity of fusion transcripts present in tumor and normal tissue is important for our basic understanding of the complexity and biological function of the transcriptome in normal and disease states, molecular diagnostics of cancer patients, and neoantigen discovery for targeting in personalized immunotherapy with cancer vaccines or T cell therapy. 7,8 The structural rearrangements leading to gene fusions can be detected or inferred through whole-genome sequencing (WGS) or from the presence of ''fusion transcripts'' in whole-transcriptome RNA sequencing (RNA-seq). 9,10 Given the easier and less costly nature of RNA-seq compared with WGS, and the effective methods for transcript assembly, RNA-seq has emerged as a leading experimental method for fusion transcript discovery and detection in both cancer research and molecular diagnostics. Dozens of computational tools have been developed to mine fusion transcripts from RNA-seq data (as referenced in Haas et al. 11 ), and there have been multiple efforts to build catalogs of fusions across tumor and normal tissues. [12][13][14][15][16][17] In general, tumor-specific fusion transcripts are presumed to derive from chromosomal rearrangements, whereas fusions identified in normal samples are more likely to be derived from normal karyotypes, thus reflecting other underlying causes, such as read-through transcription and cis-or trans-spliced products.
Nevertheless, predicting fusions from RNA-seq data is challenging, and the various methods developed to predict fusion products from RNA-seq vary tremendously in their accuracy for fusion detection, leading to both false positives and false negatives. 11,18,19 False positives can be due to experimental artifacts that arise during reverse transcription or PCR amplification or computational mismapping of reads to target gene sequences, 20 as well as specific differences in prediction tools. Moreover, as sequencing depth increases, the probability of detecting rare reads that support a fusion transcript prediction increases, due to either lab artifacts or real, low-rate trans-splicing of questionable functional relevance. Thus, there is an urgent need to understand the features that drive fusion detection and to generate high-quality catalogs of well-supported fusions.
Here we describe FusionInspector (Figure 1), a method to assess and document the evidence for fusions. While other methods have previously been developed for fusion visualization, [21][22][23][24][25] FusionInspector includes modules for supervised detection of fusion transcripts and comparing fusion transcripts with corresponding unfused fusion partners, and aims to differentiate likely biologically relevant fusions from likely experimental or bioinformatic artifacts. FusionInspector reassesses the read alignment evidence supporting targeted candidate fusion transcripts, comparing the relative alignment evidence for a fusion transcript with counter-evidence for its unfused partner transcripts. FusionInspector further evaluates the fusion transcript breakpoints in relation to sequence features represen-tative of likely experimental and bioinformatic artifacts, 26,27 including canonical splicing sequences, reference exon gene structures, and regions of microhomology between partner genes. Through reports, interactive visualizations, and classification, FusionInspector aims to in silico assess fusions for their likely validity and assist researchers in reasoning about the quantity and quality of the evidence supporting predicted fusions, to differentiate likely artifacts from fusions with characteristics similar to biologically relevant fusions known to occur in tumor and normal tissues. While potential oncogenicity and structural rearrangements may often be implied by tumor vs. normal sample prevalence and chromosomal loci for fusion gene partners, respectively, FusionInspector makes no assertions regarding oncogenicity or mechanism (genome rearrangement vs. cis-or trans-splicing) by which biologically relevant fusions may be derived.
We applied FusionInspector to assess recurrently predicted fusions in tumor and normal tissues, gathering insights into fusion transcript diversity and devising machine-learning methods for fusion classification ( Figure S1A). We first applied FusionInspector to examine recurrently predicted fusions in tumor and normal tissues to compute sequence and expression features, from which we next generated clusters of fusion variants with similar characteristics. We then identified fusion clusters that are enriched for known biologically relevant fusions, and others as likely artifacts. From these fusion clusters, we trained a classifier to automatically predict fusion instances as likely biologically relevant, likely artifactual, or other type. Finally, we applied FusionInspector on additional predicted instances of fusion transcripts that were clustered with biologically relevant fusions and explored, classified, and prioritized these transcripts to identify additional relevant candidates in tumor and normal transcriptomes. Our application discovered a cluster of fusion transcripts heavily enriched for known cancer fusions and other fusions of interest. FusionInspector is freely available as open-source software at https://github.com/FusionInspector/FusionInspector/wiki.

RESULTS
Development of FusionInspector for in silico evaluation of predicted fusion transcripts FusionInspector ( Figure 1) performs a supervised in silico evaluation of a specified set of candidate fusions, predicted from either RNA-seq data or from a user-defined panel. FusionInspector captures all read alignments in the RNA-seq that provide evidence for the specified fusions or for the unfused partner genes, and further explores the candidate fusion genes for regions of microhomology (defined as short identical sequence matches of length k [here k = 10]), and the proximity of microhomologies to putative fusion breakpoints.
To capture evidence supporting candidate fusions, FusionInspector identifies those reads that align concordantly between fusion genes as juxtaposed in their fused orientation and provide concordant alignments that span the two genes in this rearranged context. There are two types of fusion-supporting alignments: (1) split reads that define the fusion breakpoint, and (2) spanning fragments, where each paired-end read aligns to an opposite partner gene and the fragment bridges the fusion  Figure 1). FusionInspector leverages the STAR aligner, 28 which we enhanced here to support FusionInspector's mode of action. As input to STAR, we provide the entire reference genome along with a set of fusion contigs constructed by FusionInspector (based on the list of specified fusion candidates), and STAR aligns reads to the combined genome targets and reports those aligned to the fusion contigs for further evaluation by FusionInspector (STAR Methods).
Next, FusionInspector computes several features that are associated with the fusion based on these alignments to assist Top: lists of fusion candidates derived from predictions of one or multiple fusion detection methods, or from a screening panel, are provided to FusionInspector as input along with RNA-seq in fastq format. For each candidate fusion gene, fusion contigs are generated by fusing the full-length gene candidates as collinear on a single contig. Intronic regions are by default each shrunk to 1 kb. RNA-seq reads are then aligned to a reference consisting of the entire genome supplemented with fusion contigs. Fusion-derived reads that would normally align discordantly as chimeric alignments in the reference genome (top example) instead align concordantly in the fusion contig context (bottom example). Middle: FusionInspector identifies split-read alignments (light blue) and spanning pairs (purple) supporting the gene fusion in addition to read alignments that overlap the breakpoints and instead support the unfused fusion partners (fusion counter-reads). Bottom: for those fusions where FusionInspector captures RNA-seq read support (''in silico validation''), it reports fusion sequence and expression characteristics, interactive visualizations ( Figure S1B), and predicted classifications as COSMIC-like, potential artifact, or other category (STAR Methods). Article ll OPEN ACCESS in their evaluation. First, it uses the normalized number of reads exclusively supporting each fusion variant as a proxy for the expression of the fusion transcript (similarly, read alignments overlapping the fusion breakpoint and exclusively supporting the unfused partner genes are a proxy for the expression levels of the unfused partner genes). Second, it computes the fusion allelic ratio (FAR) for the fusion with respect to each (5 0 or 3 0 ) partner transcript (5 0 -FAR and 3 0 -FAR) as the ratio of mutually exclusive reads supporting the fusion variant vs. each unfused partner gene (Figures 1 and S1C). Third, it examines fusion breakpoints inferred from the read alignments for canonical dinucleotide splice sites at boundaries of the breakpoints in each partner gene and for agreement with available reference gene structure annotations. When there is evidence that supports multiple fusion transcript variants for a given fusion gene occurrence, FusionInspector uses an expectation-maximization-based algorithm to fractionally assign mutually compatible spanning fragments to the corresponding variants (STAR Methods). It then filters fusion variant candidates according to minimum evidence requirements (default settings require at least one split read to define the junction breakpoint, and at least 25 aligned bases supported by at least one read on both sides of the fusion breakpoint; STAR Methods). Finally, it captures microhomologies between putative fusion genes and determines the proximity of a fusion breakpoint to the nearest site of microhomology.
For clarity, we will use the following definitions when reporting on fusions cataloged among samples. ''Fusion'' refers to two genes, geneA and geneB, where there exists a fusion transcript involving exons from geneA and geneB, ordered respectively, and written as geneA::geneB according to the Human Genome Organization's (HUGO) Gene Nomenclature Committee (HGNC) nomenclature for fusion genes. 29 A ''fusion occurrence'' corresponds to a fusion identified in an individual sample, defined by the tuple (fusion, sample). A ''fusion transcript'' or ''fusion variant'' (used interchangeably) represents a fusion occurrence with a unique breakpoint (e.g., splicing isoform), defined by the unique tuple (fusion, sample, breakpoint coordinates). The primary fusion variant is defined as the variant with the highest expression level within an occurrence. Fusion occurrences and primary fusion variants are considered equivalent, and these terms are used interchangeably.

Assessment of fusions using FusionInspector
By evaluating alignments to the modeled fusion contigs, FusionInspector demonstrates high sensitivity and specificity in supervised fusion detection, as we show for validated fusions in four well-studied breast cancer cell lines: BT474, KPL4, MCF7, and SKBR3. In our earlier study, 46 experimentally validated fusions were correctly predicted by at least one of 24 different fusion prediction methods. 11 When FusionInspector was applied to the same data, it correctly identified each of these fusions in the relevant sample ( Figure 2 and Table S1), with no false positives.
Using sequence and expression attributes of fusions and known characteristics of biologically relevant fusions (see below), FusionInspector further predicts whether each in silico validated fusion variant candidate (i.e., a fusion candidate evaluated by FusionInspector as having fusion read support evidence) is likely to be biologically relevant or alternatively has features consistent with experimental or bioinformatic fusion artifacts.
We illustrate these features in the context of two contrasting examples of fusion types ( Figure 3). Fusion EML4:ALK, a known cancer driver prevalent in lung adenocarcinoma, 30,31 has evidence of multiple transcript variant structures, and while microhomologies are found between the EML4 and ALK genes they tend to be distal from the fusion variant breakpoints ( Figure 3A). The EML4::ALK fusion variant breakpoints are all found at consensus dinucleotide splice sites that coincide with exon boundaries of reference gene structure annotations. In contrast, FusionInspector captures many reads supporting a putative fusion KRT13::KRT4, but the breakpoints inferred from splitread alignments mostly have non-consensus dinucleotide splice sites and coincide with sites of microhomology; additionally, split reads with consensus dinucleotide splice sites mostly do not coincide with reference exon boundaries ( Figure 3B). Because KRT13 and KRT4 are only distantly related with no easily detected nucleotide-level sequence conservation, their fusion may not be discarded by many fusion transcript predictors. However, given that most fusion evidence coincided with sites of microhomology and the lack of consensus splicing at breakpoints, FusionInspector infers most putative KRT13::KRT4 fusion variants to be artifactual. Another particularly compelling example of a similarly misleading and likely artifactual fusion is COL1A1::FN1, which is detected as prevalent in cancer-associated fibroblast cell lines ( Figure S2). Further consideration of fusion and partner gene expression levels can aid in evaluating and prioritizing fusion candidates for further study, as we pursue below.

Clustering of recurrent fusion transcripts via FusionInspector attributes resolves COSMIC-like and artifact-like fusions
We first applied FusionInspector to examine recurrent fusion transcripts based on poly(A)-stranded RNA-seq from tumors (from The Cancer Genome Atlas [TCGA] 32 ) and corresponding healthy tissue (from TCGA and Genotype-Tissue Expression [GTEx] 33 ). To generate an initial comprehensive catalog of fusion variants in tumor and normal tissues, we predicted fusion transcripts with STAR-Fusion (v1.7) across 9,426 tumor and 707 normal samples from TCGA, and 8,375 normal samples from GTEx (Table S2). We initially applied lenient fusion evidence requirements to maximize sensitivity (STAR Methods). As a result, putative fusion transcripts were detected in nearly all tumor and normal samples. After applying a minimum expression level threshold (0.1 fusion fragments per million [FFPM]), we detected a significantly higher number of fusion occurrences in tumors vs. paired normal samples in several TCGA tumor types ( Figure S3A), although there were similar median numbers of predicted fusion occurrences per sample type in TCGA tumor and GTEx normal samples (t test, p = 0.5, Figure S3B). We readily identified known cancer fusions included in the COSMIC fusion collection 34,35 (''COSMIC fusions,'' Figure 4A) according to known disease associations and prevalence, such as TMPRSS2::ERG identified in roughly half of prostate cancers, 3 FGFR3::TACC3 in glioblastoma, 36 and PML::RARA in the acute promyelocytic leukemia subtype of acute myeloid leukemia (AML). 37 COSMIC fusions were more highly expressed than most predicted fusions, which had low estimated expression levels and few supporting reads ( Figures 4B-4D).
Next, we used FusionInspector to examine the sequence and expression features of fusion transcripts that were recurrently detected across tumor and/or normal samples in order to distinguish biologically impactful fusions (akin to the COSMIC fusions) from experimental or computational artifacts, or from low levels of cis-or trans-splicing from highly expressed genes. To this end, we analyzed all 53,240 predicted fusion variants (38,591 fusion occurrences plus 14,649 alternative fusion variants) from 628 TCGA and 530 GTEx representative samples (STAR Methods). For each fusion candidate, FusionInspector identified the number of reads supporting the fusion variant and those supporting the unfused partner genes at putative breakpoints, identified regions of microhomology between partner genes, and determined the following features: inferred fusion expression level (FFPM), 5 0 and 3 0 fusion allelic ratios (5 0 -FAR, 3 0 -FAR), 5 0 and 3 0 unfused gene expression levels (5 0 -counter-FFPM and 3 0 -counter-FFPM), presence of consensus vs. non-consensus dinucleotide splice sites at fusion breakpoints, agreement or disagreement with reference gene structure exon boundaries at splice junctions, number of microhomologies observed between the two partner genes, and the distance of each inferred fusion breakpoint to the nearest site of microhomology.
To distinguish fusion artifacts from those with features consistent with biologically impactful fusions, we clustered fusion   Table S3; STAR Methods). Clustering produced 61 high-granularity clusters, each containing an abundance of fusions identified in both TCGA and GTEx samples and distinguishable by feature attributes (Figures S5A-S5L). We further grouped these fusion variant clusters by hierarchical clustering according to median fusion attribute values in each fine cluster ( Figure 5B). We then focused on examining clusters enriched for COSMIC fusions as a proxy for biologically impactful fusions, exploring those fusions with sequence and expression characteristics similar to those of COSMIC fusions, irrespective of potential genome rearrangement characteristics or known oncogenic roles.
One fusion cluster (C4) was significantly enriched with COSMIC fusions, harboring 57% of our detected instances of COSMIC fusion occurrences among these samples, but only 4% of all fusion occurrences (p < 10 À90 , Fisher's exact one-sided test) ( Figure 5B). Fusion variants in this cluster had splice  47  41  20  27  17  14  57  56  60  49  15  31  25  10  61  33  24  22  30  18  44  32  35  40  26  29  19  34  28  53  5  52  59  58  8  45  42  3  7  16  4  51  2  21  48  13  6  23  12  11  54  1  9  43  46  36  39  55  37    Conversely, other fusion clusters, spanning 3% of all fusion occurrences and no COSMIC fusion occurrences, had features indicative of experimental or computational artifacts, especially enrichment in microhomologies that could confound alignment or contribute to reverse transcriptase mispriming. We thus consider those fusion variants as putative artifacts. Of the 3% of all fusion occurrences encompassed by these clusters, twothirds (2% of all fusion occurrences) had moderately to highly expressed partner genes, suggesting origination from reverse transcriptase mispriming, and the remainder had little evidence for partner gene expression, suggesting read misalignment artifacts. The low portion of such presumed artifacts is a testament to STAR-Fusion's rigorous filtering. 11 A further 1% of fusions involved highly expressed partner genes, where the detected fusion represented a small fraction of the total expression from these loci. These fusions may result from low levels of cis-or trans-splicing from the highly expressed partner genes.
A fusion classifier allows targeted screening of predicted COSMIC-like fusions We reasoned that the set of 1,511 predicted fusion occurrences (835 distinct fusions) that were members of the COSMIC-enriched cluster C4 are likely enriched for fusions of functional significance and should be prioritized for further study. Some are already known to be relevant to cancer but not yet included in the COSMIC database, such as EGFR::SEPT14, 39 , PVT1::MYC, [40][41][42] and TPM3::NTRK1. 43 Others are reciprocal fusions for COSMIC fusions that could result from balanced translocations, including reciprocal ABL1::BCR1 of COSMIC BCR1::ABL1, BRAF::SND1 of COSMIC SND1::BRAF, and PPARG::PAX8 of COSMIC PAX8::PPARG. This fusion cluster is also enriched for fusions exclusively identified in pancreatic tissue (explored below).
To gain further insights into the characteristics of the COSMIClike fusions in C4, we first screened additional TCGA and GTEx samples to characterize additional occurrences of C4 representative fusions. We refocused FusionInspector on 236 C4+ key fusions (231 C4 recurrent fusion gene pairs including 26 recurrent COSMIC fusions, plus another five recurrent COSMIC fusions having all occurrences localized outside of C4; STAR Methods and Table S4 , Table S4), and ranked fusions by the difference in their initial detected prevalence in tumors vs. normal samples.
To determine whether additional occurrences of these fusions have characteristics consistent with COSMIC-like C4 fusions, are artifact-like, or belong to another category, we trained a random forest classifier to predict the labels of each of the earlier-defined 61 Leiden clusters and applied it to predict the cluster labels of fusion variants examined in this expanded targeted survey. We then categorized each fusion occurrence by the overall category of the hierarchical cluster to which the Leiden cluster label it was classified to belongs (e.g., ''COSMIC-like'' for C4 predictions) ( Figure 6A and STAR Methods). Evaluation of this random forest classifier using 5-fold cross-validation and application to biological replicates demonstrates high prediction accuracy, with most clusters yielding >90% prediction accuracy and mispredicted fusion variants or biological replicate fusion variants assigned to fusion clusters with highly similar features, effectively discriminating artifact-like fusions from others (Figure S7). This fusion classification was further incorporated into FusionInspector for routine application in discriminating COSMIC-like fusions from artifacts or other types.
FusionInspector-screened occurrences of known COSMIC fusions were mostly tumor enriched with few to no normal samples identified with evidence (noting that all 1,009 GTEx normal samples were screened by FusionInspector for an identical list of COSMIC fusions). All but 31 of the 236 fusions had occurrences classified as C4. Only seven fusions had at least 10% of their occurrences predicted to clusters with high counter-evidence (C49 or C51), suggesting the fusion transcripts may reflect low levels of cis-or trans-splicing of more highly expressed normal fusion partner genes. Only nine fusions had any occurrences predicted as artifacts, found in both TCGA and GTEx as pancreas-specific fusions (further discussed below).
This analysis highlighted intriguing, well-supported fusions for further study. For example, while the top-ranked tumor-enriched fusion, FGFR3::TACC3 (rank 1, 70 tumor samples, 0 normal) is a known oncogenic driver, 36 other top-ranking fusions, such as CCAT1:CASC8 (rank 2, 42 tumors-mostly lung and stomach cancers, 0 normal) and VCL::ADK (rank 3, 36 tumors-also mostly lung and stomach cancers, 0 normal) have not yet been extensively studied. CCAT1::CASC8 was only recently reported in the fusion catalog generated by DEEPEST fusion, 17 and VCL::ADK was only previously reported in a study of cancer cell lines. 12 Because our initial selection of samples for FusionInspector exploration of recurrent fusion transcripts did not strictly require COSMIC fusions, only 31 of the 58 known COSMIC fusions initially discovered by our comprehensive survey with STAR-Fusion were further characterized above. We thus further applied FusionInspector to 46 additional samples with instances of the remaining 27 COSMIC fusions. Over 80% of these COSMIC fusion occurrences were predicted to other COSMIC fusion-containing clusters, none as likely artifacts, and 11 (41%) of these had occurrences predicted to the COSMIC-enriched cluster C4 (Figure S8A and Table S5) Some fusion transcripts are prevalent in normal tissues and may not be oncogenic Approximately one-third (86) of the 236 C4+ targeted fusions in our analysis were robustly detected in normal tissues (found in at least five normal samples); these may not be particularly relevant to cancer biology but may play a role in normal biological processes. Of these, 61 fusions are broadly expressed across at least five tissues, involve intrachromosomal pairs of genes, and can be largely explained by read-through transcription, local rearrangements, or trans-splicing of neighboring transcripts.
Some other putative fusions that are prevalent in normal tissues may in fact represent normal structural variation in the Article ll OPEN ACCESS human genome, which is not accounted for when performing read alignment to a single human reference. For example, Fusion KANSL1::ARL17, which would require a local rearrangement in the human reference genome, is prevalent across both tumor and normal tissues (median of 31% of individuals, Figure S8B), and is known to correspond to a common haplotype involving a locally rearranged genomic region observed in populations of European descent. 45 An earlier report identified KANSL1::ARL17 in diverse tumor samples and proposed that it may be a cancer predisposition germline fusion specific to Europeans. 46 Note, however, that no specific human genetic association evidence was shown for predisposition thus far, and we observe slightly higher prevalence of KANSL::ARL17 among GTEx normal samples than in tumors from TCGA ( Figure S8B). Another normal fusion due to a rarer germline structural variation is TFG::GPR128, previously associated with a copy number variation and a haplotype frequency estimated at around 2% of individuals of European descent. 47 Consistently, we find TFG::GPR128 broadly expressed across tumor and normal tissues and represented similarly at a median of 2% of all tissues examined ( Figure S8C). As more evidence of common structural variation becomes available, other prevalent fusions found in normal tissues may be more easily explained.
Another set of fusions that are less easily explained involve those we found only in normal pancreas and pancreatic carcinoma (Figures 6 and S6), involving various pairwise combination of CPA1, CPA2, CLPS, CELA2A, CELA3A, CTRB1, CTRB2, and CTRC (e.g., CELA3A::CPA2, CELA3B::CELA2A, and CELA3A::CELA2A) fused to generate in-frame fusion products. These genes are among the highest expressed in pancreas and mostly reside on different chromosomes, suggesting that trans-splicing may be the predominant underlying mechanism. While the initially detected transcripts were within the COSMIC-peak-enriched fusion cluster, the random-forest-based fusion classifier did not predict the additional instances as COSMIC-like, and some of them (e.g., CELA3A::CTRC) have high fractions of occurrences predicted to ''high-counter-evidence'' clusters or ''artifact-like'' types ( Figures 6 and S6E; Table S4).
Some well-established oncogenic fusions are also reliably detected in normal samples Several of the COSMIC fusions or other tumor-enriched fusions with known ties to cancer were surprisingly identified in both tumor and normal samples. For example, the prostate cancer fusion TMPRSS2::ERG, identified as our fourth most tumor-enriched fusion (182 of 465 TCGA prostate tumor samples), is also detected in six normal prostate samples (five TCGA and one GTEx) ( Figure S6A). Each of the five TCGA prostate normal samples containing TMPRSS2::ERG were found with TMPRSS2::ERG in their matching prostate tumor samples, likely due to tumor-innormal contamination. TMPRSS2::ERG was only identified by FusionInspector in prostate tumors or normal prostate, reflecting both this fusion's high tissue specificity and FusionInspector's high specificity of fusion calling.
In another example, COSMIC fusion PVT1:MYC was originally identified by STAR-Fusion in 20 samples (14 TCGA tumor, 1 TCGA normal, and 5 GTEx samples). Interactions between PVT1 and MYC including their fusion are well-known contributors to tumorigenesis. [40][41][42] Through subsequent screening of PVT1::MYC with FusionInspector, we identify a total of 32 samples (+9 TCGA tumor, +1 TCGA normal, and +3, À1 GTEx). Most (21/32) are expressed at low levels (below 0.1 FFPM), and we do not find strong evidence for expression to be generally higher in tumor samples than in normals (p < 0.07, Wilcoxon rank-sum test). However, 5 of the 32 PVT1::MYC occurrences were identified in cervical cancer tumors, and all were significantly more highly expressed than the other samples (p < 0.001), with the most highly expressed at 19 FFPM ( Figure S8D). PVT1 and MYC are co-localized to a proximal region in the bottom arm of chromosome 8, and a PVT1::MYC fusion would likely involve local restructuring at the locus in tumors to generate the fusion product. Interestingly, this chromosome 8 region is a known hotspot for insertion of human papilloma virus (HPV), 48 the leading cause of cervical cancer (>90% of cases). Most of the TCGA cervical cancer samples we identified with PVT1::MYC have HPV insertions at this hotspot (see Table S3 of the study by TCGA Research Network 48 ). Thus, we hypothesize that HPV insertion contributes to the formation of the PVT1::MYC fusions. We did not find evidence for HPV insertion in the breast cancer sample with similar levels of PVT1::MYC expression (data not shown).
COSMIC fusion VTI1A::TCF7L2, originally identified as an oncogenic fusion in colorectal cancer, 49 was most abundant in stomach, colon, and esophageal carcinoma samples but was also detected in seven individual GTEx normal samples (brain, whole blood, tibial nerve, tibial artery, prostate, and breast) ( Table S4). While VTI1A::TCF7L2 was not enriched for detection in tumors vs. normal, only those fusions in colon cancer were highly expressed (>0.15 FFPM), whereas other tumor and normal instances were lowly expressed (<0.05 FFPM; many at the limit of detection; Figure S8E), supported by a single split read defining the fusion breakpoint (Table S4). This could be consistent with a very low proportion of cells in the normal tissue expressing the fusion, compared with a large clone in the tumor.
Interestingly, COSMIC fusion BCR::ABL1 was not tumor enriched in our analysis, likely due to paucity of the relevant tumors in TCGA. In particular, BCR::ABL1 occurs in >95% of CML cases, 1,2 but TCGA lacks CML samples. Indeed, the four TCGA tumors with BCR::ABL1 likely correspond to a subtype of AML defined with this fusion. 50 Three of these AML tumors also have evidence of the reciprocal ABL1::BCR fusion, and the oncogenic BCR::ABL1 is expressed at higher levels than the reciprocal counterpart in each sample. Interestingly, we detected eight instances of the oncogenic BCR::ABL1 fusion (and none of the reciprocal) in six different GTEx normal tissues (one each of adipose, breast, nerve, prostate, and thyroid, and two pancreas) and one TCGA normal kidney sample. We observed no sequence or expression features distinguishing these fusions from those we identified in AML, and fusion transcript breakpoints for those in GTEx normal tissues are identical to those found in AML. In general, when we find COSMIC fusions in GTEx normal samples, they are found at low frequencies (<1% prevalence in a tissue type).
We examined C21 TARGET fusions to identify other fusions potentially relevant to pediatric cancer. Of 910 fusions, 108 recurrent fusion candidates were present in at least three samples and absent from normal samples ( Figure 7B and Table S6). These include fusions not yet reported in the COSMIC oncogenic fusion database but otherwise known as drivers or highly relevant to pediatric oncology, 51,52 such as TCF3::HLF, 53-55 ZNF384::EP300, 56 and CBFB::MYH11. 57 Many of the 108 fusions (51 fusions = 47%) involve a known ALL driver as one or both partners, 52 such as KMT2A (eight fusions); PAX5, ZNF384, and MLLT10 (five fusions each); and NUP98, TCF3, and ETV6 (four fusions each). Most driver-gene-containing fusions (28/51 = 55%) involve reciprocal fusion arrangements. Other fusion partners frequently found among this set that may be potentially relevant to pediatric cancer include PTMA (seven fusions), EIF4A1 and the EIF4A1 locusderived RP11-186B7.4 (seven unique fusions), and SEPT14 (three fusions).
While our version of FusionInspector was not compatible with exploration of immunoglobulin heavy chain gene (IGH) fusions given the large IGH locus size (1.3 Mb vs. our 100 kb maximum intron length; STAR Methods), 12 (1.8%) TARGET ALL samples had evidence of IGH::CRLF2 fusions, a common oncogenic IGH fusion. 58 We also found the oncogenic ALL fusion P2RY8::CRLF2 59 in 25 (3.8%) ALL participants. These were members of C10 TARGET , likely due to consensus splice breakpoints that did not coincide with reference gene structure exon boundaries, highlighting the relevance of other clusters as sources of biologically relevant fusions with alternative non-artifactual attribute profiles.

DISCUSSION
We developed FusionInspector to enable exploration of the evidence supporting candidate fusions, flag likely artifacts, and identify those fusions with sequence and expression features similar to known biologically relevant fusion transcripts irrespective of oncogenicity or potential genome structural rearrangements. Given a list of candidate fusions, FusionInspector captures RNA-seq read alignments that support either the fused genes or the unfused partner genes. From the fusion and partner gene expression evidence coupled with sequence features relating to the fusion breakpoint, FusionInspector helps the user to reason about the nature and quality of any target fusion transcript.
Clustering TCGA and GTEx fusions by shared sequence and expression features identified a cluster of fusions highly enriched for COSMIC fusions. Fusions in the COSMIC-enriched cluster had relatively high fusion expression with 3 0 -FAR generally exceeding 5 0 -FAR, suggesting oncogenic activity from the 3 0 -fused transcript. Analysis initiated by fusions in the COSMICenriched cluster highlighted several putative or less appreciated oncogenic fusions, including CCAT1::CASC8 and VCL::ADK, based on their feature similarity to other well-known tumor-enriched fusions. Only $3% of initially predicted fusions were members of clusters likely enriched for artifacts based on features such as high partner gene expression or sites of microhomology at or near the fusion breakpoint. The low artifact rate is likely due to the strong filtering of the initial input catalog from STAR-Fusion.
We obtained similar results when applying FusionInspector to the TARGET pediatric cancer cohorts, with a single COSMIC-enriched cluster of fusion occurrences with relatively high fusion expression and 3 0 -FAR, containing known cancer drivers and yielding a source of additional potential fusion genes for future study. These additional fusions included gene partners PTMA, EIF4A1, and SEPT14, each with known roles in cancer (e.g., Frattini et al., 39,60,61 Kumar et al.,39,60,61 and Wu et al. 39,60,61 ) but yet to be further characterized as pediatric cancer fusion transcripts.
While we focused on fusions identified in COSMIC-enriched clusters, other fusion clusters also harbor important oncogenic fusions. For example, the COSMIC fusions SS18::SSX1 and SS18::SSX2, known drivers of synovial sarcoma, 5,62 are in other clusters (C38 and C39), due in part to their higher 5 0 -FAR. Another fusion of interest, FSIP1::RP11-624L4.1, is present in 240 (22%) of breast tumors analyzed and in 16 normal breast tissues samples, where it is expressed at significantly lower levels in normal tissues ( Figure S10). While the individual fusion partners have cancer associations, 63,64 any role for this likely read-through/cis-spliced fusion transcript deserves consideration in further exploring the roles of both genes in disease.
In some cases, fusions that were found in both tumor and normal tissues might reflect a low level of oncogenic events. For example, hallmark driver fusions including TMPRSS2::ERG and BCR::ABL1 were also detected in GTEx normal tissue samples, which may reflect a low proportion of premalignant or transformed cells. [65][66][67] We also detected COSMIC fusion VTI1A::TCF7L2 across multiple tumor and normal tissue types (consistent with Nome et al. 68 ), but only highly expressed in colon cancer samples where it is a postulated oncogenic driver. 49,69 Whether such a fusion could contribute to tumorigenesis in a different tissue with different cellular circuitry remains unknown. While we are intrigued by evidence of oncogenic fusions in normal tissue RNA-seq data, we have not ruled out explanations such as tumor-in-normal contamination 70 or potential cross-sample sequencing read contamination. 71 Fusions that were prevalent among normal tissues can mostly be explained by read-through transcription and cis-splicing of co-linear genes, but some may simply reflect natural germline structural variations that may exist in the population. With ongoing advancements in methods for detecting and cataloging of structural variants, 72,73 we may soon better understand the structural basis for many naturally occurring fusion transcripts. Access to matched RNA-seq and WGS of the same samples across individuals would greatly facilitate such efforts.
Pancreas stood out as a clear outlier among all normal tissues explored for fusions. While we suspect some of the putative fusions detected in pancreas are derived from reverse transcription or alignment artifacts, several did have features consistent with trans-splicing of highly expressed partner genes, with trans-spliced products yielding in-frame proteins. In general, these fusion occurrences do not have COSMIC-like sequence and expression features. Trans-spliced in-frame fusion transcripts have the potential to expand functional diversity from our otherwise linear genomes, 74 and even if these pancreas-specific candidates failed to ultimately reach our COSMIC-like prioritization status, they may be worth additional studies.
FusionInspector opens the way to further explore the biological impact of the predicted fusions and the tissues and gene expression networks in which they are phenotypically relevant. FusionInspector helps illuminate the evidence supporting fusions in RNA-seq or to screen for relevant fusions sensitively and accurately in samples of interest. While expression attribute analysis and fusion class prediction by FusionInspector should require poly(A)-stranded RNA-seq, the supervised fusion variant identification could leverage alternative RNA-seq methodologies.
Because short reads remain limited in their capacity to represent full-length fusion transcripts, FusionInspector further integrates Trinity 75,76 for de novo reconstruction to optionally reconstruct more full-length fusion transcripts from RNA-seq data aligned to each fusion contig. FusionInspector is available as a stand-alone application for screening lists of candidate fusion transcripts and is also incorporated into STAR-Fusion for in silico validation or visualization of STAR-Fusion-predicted fusion transcripts. This facilitates analysis of fusions from both bulk and single-cell RNAseq, as we have recently demonstrated. 77 Long-read transcriptome sequencing may eventually obviate short-read sequencing for fusion detection, thus removing the need for de novo reconstruction of full-length fusion transcripts. 78,79 Full-length single-molecule direct RNA-seq 80 should also avoid reverse transcription amplification artifacts. Conversely, other features scored by FusionInspector, such as expression characteristics of fusion transcripts with respect to partner genes, will remain relevant and easily adapted for long-read RNA-seq.
Machine learning is likely to play an increasingly important role in biomedical science and its clinical applications. In this paper we emphasize an important companion direction to machine learning, namely generating transparent and interpretable predictions, loosely referred to as explanations. The area of explanations and causal interpretation is growing rapidly in artificial intelligence. 81 We need to keep a reproducible trace of facts, predictions, and hypotheses from gene to function in the era of big data.
We hope that the practical applicability of FusionInspector will help drive transparency and other explanatory efforts in predictive areas in genomics and personalized medicine more generally, including screening and reassessment of evidence supporting fusion predictions, and visualization of the evidence via interactive reports (Figures 1 and S1B). FusionInspector can be easily leveraged as an add-on component to any fusion transcript prediction pipeline, and is directly incorporated into STAR-Fusion to facilitate execution as part of the Trinity Cancer Transcriptome Analysis toolkit. 82 The FusionInspector software is freely available as open source on GitHub (key resources table), provided in container form via Docker and Singularity, and accessible on the Terra cloud computing framework for secure and scalable application across large compendia of sample collections or patient-derived RNA-seq data. Limitations of the study FusionInspector as implemented here requires read alignment overlap with exons of reference gene structure annotations, and hence fusions restricted to intronic regions of either fusion partner or neighboring intergenic regions presently go undetected. Because genome-level fusion events often occur outside of exons, FusionInspector is largely ineffective with whole-genome or whole-exome sequencing data, instead necessitating RNA-seq.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
We thank Joshua Gould for assisting with software pipeline development and Terra integration, and Leslie Gaffney for help with generating publication quality figures. Special thanks to Jinghui Zhang at St. Jude for insights and assistance in navigating the TARGET data. This work has been supported by Na- The results published here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/ccg/ research/genome-sequencing/tcga. TCGA v11 data were obtained through Terra/AnVIL via dbGaP accession number phs000178.v11.p8. The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The GTEx v8 data used for the analyses described in this paper were obtained from Terra/AnVIL via dbGaP accession number phs000424.v8.p2. The results published here are in whole or part based upon data generated by the Therapeutically Applicable Research to Generate Effective Treatments (https://www.cancer.gov/ccg/ research/genome-sequencing/target) initiative, phs000218. The data used for this analysis are available at https://portal.gdc.cancer.gov/projects.

DECLARATION OF INTERESTS
A.R. is a co-founder and equity holder of Celsius Therapeutics, an equity holder in Immunitas, and was a scientific advisory board member of Thermo Fisher Scientific, Syros Pharmaceuticals, Neogene Therapeutics, and Asimov until July 31, 2020. From August 1, 2020, A.R. has been an employee of Genentech. M.G. is a current employee and stockholder at Monte Rosa Therapeutics.
The parameters used here eliminated any filtering of fusions according to fusion expression levels or based on fusion annotations, to retain any fusions known to frequently occur in normal samples in both the normal and the tumor samples for further study. All STAR-Fusion predictions are provided in Table S2.

FusionInspector method and implementation
FusionInspector takes as input a list of candidate fusions and RNA-seq files in fastq format, with each fusion formatted as "geneA:: geneB" indicating a candidate fusion between geneA (5') and geneB (3). Leveraging the companion CTAT genome library set of genomic resources (identical to that used with STAR-Fusion, including the human reference genome, gene structure annotations, and STAR genome index), FusionInspector constructs fusion contigs by extracting the genomic sequences for each geneA and geneB, and concatenating each geneA and geneB pair into a single contig in collinear transcribed orientation. Gene structure annotations for fusion genes are similarly restructured to match the position and orientation of the corresponding genes in the fusion contigs. By default, long introns are shrunk to 1 kb in length by removing central regions of intron sequences, reducing the alignment search space and simplifying downstream visualizations.
RNA-seq reads are aligned to the fusion contigs along with the whole reference genome by running STAR with both inputs, including the pre-indexed whole genome and a fasta file containing the fusion contigs. STAR first loads the whole reference genome index into RAM, then builds an index for the fusion contigs, and incorporates the fusion contig index into the whole genome index. Only those reads that align concordantly to the fusion contigs, while considering all alignments to the combined targets, are reported. Note, that in the fusion context, all fusion-supporting reads are aligned concordantly, but will align partially to one gene and partially to the adjacent gene. This functionality was implemented in STAR since version 2.5.0a to support FusionInspector functionality. STAR-Fusion directly executes STAR to align reads like so " STAR -runThreadN 4 -genomeDir ctat_genome_lib_build_dir/ref_genome.fa.star.idx -outSAMtype BAM SortedByCoordinate -twopassMode Basic -alignSJDBoverhangMin 10 -genomeSuffixLengthMax 10000 -limitBAMsortRAM 47271261705 -alignInsertionFlush Right -alignMatesGapMax 100000 -alignIntronMax 100000 -readFilesIn reads_1.fastq.gz reads_2.fastq.gz -genomeFastaFiles finspector.fa -outSAMfilter KeepAllAddedReferences -sjdbGTFfile finspector.gtf -alignSJstitchMismatchNmax 5 -1 5 5 -scoreGapNoncan À6 -readFilesCommand 'gunzip -c' ", where 'finspector.fa' and 'finspector.gtf' correspond to the fusion contigs sequence and structure annotation files.
FusionInspector examines the aligned reads output by STAR and identifies read alignments supporting fusions between gene pairs represented by the fusion contigs. Candidate fusion breakpoints are identified by split read alignments having partial alignments that anchor to exons of the neighboring fusion genes. Spanning fragments are identified as paired-end reads having each read mapping entirely on opposite sides of the breakpoint. Alignments must meet minimum criteria to be counted as evidence, requiring at least 98% sequence identity and no more than 10 bases unaligned at their ends (soft-or hard-clipped bases). For split reads, at least 10 bases must align adjacent to each breakpoint (anchor), and each anchor region must have sufficient sequence complexity, requiring entropy R 1.2. For spanning fragments, each paired-end read sequence must have sufficient complexity, requiring entropy R 1.2. Preliminary fusion predictions are defined based on candidate fusion breakpoints and sets of compatible spanning fragments. RNA-seq fragments that span a candidate breakpoint but support transcription from an unfused partner gene are captured, stored as counter-evidence, and used to compute the partner gene counter FFPM and fusion allelic ratio.
There is often evidence for multiple fusion variants, and while the split reads are unique to and define each breakpoint, the spanning fragments are often compatible with multiple breakpoints and assigned ambiguously. We implemented an expectation maximization (EM) algorithm based on that described in kallisto 83 to fractionally assign RNA-seq evidence fragments to fusion variants according to maximum likelihood. Fusion expression values (FFPM) are then computed based on estimated RNA-seq fragment counts resulting from the EM.
Fusion candidates are then filtered according to defined minimum evidence requirement, with defaults set as requiring at least one split read to define the junction breakpoint, and at least 25 aligned bases supported by at least one read on both sides of the fusion breakpoint (termed 'long double anchor support' or 'LDAS'). If the breakpoint involves non-consensus dinucleotide splice sites, then at least three split reads are required to support the breakpoint. A final filter of fusion predictions to exclude those containing overly promiscuous fusion partners (maximum 10) or those involving paralogs of more dominantly supported fusions is applied identically as previously described. 84