Unscrambling cancer genomes via integrated analysis of structural variation and copy number

Summary Complex somatic genomic rearrangements and copy number alterations are hallmarks of nearly all cancers. We have developed an algorithm, LINX, to aid interpretation of structural variant and copy number data derived from short-read, whole-genome sequencing. LINX classifies raw structural variant calls into distinct events and predicts their effect on the local structure of the derivative chromosome and the functional impact on affected genes. Visualizations facilitate further investigation of complex rearrangements. LINX allows insights into a diverse range of structural variation events and can reliably detect pathogenic rearrangements, including gene fusions, immunoglobulin enhancer rearrangements, intragenic deletions, and duplications. Uniquely, LINX also predicts chained fusions that we demonstrate account for 13% of clinically relevant oncogenic fusions. LINX also reports a class of inactivation events that we term homozygous disruptions that may be a driver mutation in up to 9% of tumors and may frequently affect PTEN, TP53, and RB1.


In brief
Shale et al. present a comprehensive analysis of somatic structural variation in a large cohort of cancer genomes. They have developed an algorithm, LINX, that reveals insights into complex genomic events and demonstrates the utility of whole-genome sequencing in detection of diverse clinically relevant gene fusions and disruptions.

INTRODUCTION
Somatic structural variation (SV) and associated copy number alterations (CNAs) are key mechanisms in tumorigenesis. 1 However, both the mechanisms driving and the consequences of genomic rearrangements in cancer are less well understood than for point mutation events. This is due both to the relative paucity of whole-genome sequencing (WGS) data that are required for comprehensive SV analysis and also to the fact that genomic rearrangements have significant diversity. Many rearrangements involve a high degree of complexity, with individual events resulting in multiple or even hundreds of breaks. 2,3 Interpretation of these highly rearranged genomes is challenging but simultaneously highly relevant for the identification of driver events that may function as biomarkers or druggable targets.
LINX is an SV interpretation tool, which integrates CNA and SV calling derived from WGS data and comprehensively clusters, chains, and classifies genomic rearrangements. The motivation for this is twofold: first, from a biological perspective, to allow better insight into distinct mechanisms of rearrangements in tumorigenesis and second, from a clinical perspective, to allow prediction of the functional impact of structural rearrangements, including gene fusions and disruptions. A number of previous tools have been developed to analyze the roles of certain rearrangement event types in tumorigenesis, such as chromothripsis, 2 chromoplexy, 4 long interspersed nuclear element (LINE) insertions, 5 and amplification mechanisms. 6 Clustering methodologies have also been used previously to propose signatures of structural rearrangement. 1,7 LINX goes further than just integrating the functionality of each of these previous tools, both by classifying all classes of rearrangements in each tumor genome and by predicting the local chained structure of the derivative chromosome as well as the functional impact of the rearrangement in a single application.

LINX algorithm
The input for LINX is a base-pair-consistent segmented copy number and SV callset from the previously described tools PURPLE 8 and GRIDSS. 9 The base pair consistency means that each and every copy number change in the genome is matched precisely to an SV junction, which is represented either as a breakpoint when the partner location is known or as a single breakend when the partner location cannot be unambiguously determined.
There are four key steps in the LINX algorithm ( Figure 1; Methods S1). First, LINX annotates each breakpoint and breakend with several basic geometric and genomic properties that are important to the clustering and chaining algorithm. This includes whether each breakend is part of a foldback inversion, flanks a region of loss of heterozygosity (LOH), or is in a wellknown fragile site region. 8,10 LINX also annotates well-known line element source locations 5 and identifies additional suspected mobile LINE source elements based on both the local breakpoint structure and signals of poly-A sequence insertions.
Second, LINX performs a clustering routine to group raw structural variants into distinct rearrangement ''events.'' LINX defines a rearrangement event as one or more junctions that likely occurred proximately in time and transformed the genome from one stable configuration to another. Events can range from a simple deletion or tandem duplication to complex events, including chromothripsis or breakage fusion bridge 11 cascades. The fundamental principle for clustering in LINX is to join breakpoints where it is highly unlikely that they would have occurred independently. Rather than a single rule, such as clustering variants into events based solely on proximity 12 or variants that form a ''deletion bridge,'' 4 LINX employs a set of 11 independent rules in its clustering routine (Methods S1). These include clustering variants that are very close in proximity (<5 kb between breakends); clustering breakends that together delimit an LOH event, homozygous deletion, or region of high major allele copy number; clustering translocations that share common arms at both ends; clustering inversions, long deletion, and long tandem duplication variants that directly overlap each other; and clustering all foldback inversions that occur on the same chromosome arm.
Third, after resolving all variants into clusters, LINX predicts the derivative chromosome structure via a chaining algorithm. A B Figure 1. LINX schematic and visualizations (A) The LINX algorithm works in four steps to annotate, cluster, chain, and determine the functional impact of an integrated copy number. The Circos on the left represents the input of LINX and shows three structural variants (purple lines) affecting two chromosomes (outer track in green and blue) with consistent copy number breakpoints (middle track showing green for gain and red for loss). The Circos on the right shows example output of LINX, including the chaining of the variants into two continuous predicted derivative chromosomes (lines in brown and purple) and a canonical TMPRSS2_ERG fusion (genes depicted in blue and light brown on second outer circle with fused exons showing darker shading) on one of the two predicted chromosomes. (B) A detailed guide to the visualizations produced by LINX.
To do this, LINX considers all pairs of facing breakends on each chromosomal arm within each cluster and iteratively prioritizes which pair is most likely to be joined. The chaining logic imposes allele specific copy number constraints at all points on each chromosome and also the biological constraint that chromosomes are not permitted without a centromere unless strict criteria relating to detection of extrachromosomal DNA are met. Foldback inversions are also explicitly modeled to allow chaining of clusters of variable junction copy number and high amplification. Overall, the chaining prioritization scheme is designed to be error tolerant and aims to maximize the chance that each individual breakend is linked correctly to the next breakend on the derivative chromosome. However, due to multiple possible paths, upstream sources of error, and missing information, the prediction is representative only and, in the case of highly complex clusters, unlikely to be correct across all break junctions.
The fourth and final step in LINX is to annotate the gene impact of SV junctions to predict gene disruptions and fusions. Any breakend overlapping or in the upstream region of an Ensembl transcript 13 is annotated with its position and orientation relative to the strand of the gene and the nearest splice acceptor or donor. Gene fusions are called by searching for correctly oriented splice acceptor and donor pairs on the predicted derivative chromosome, including chained fusions that may span multiple break junctions. 14 To meet the fusion calling criteria, the breakends must also connect to viable contexts in each gene and not be terminated by further breakends in the chain on either 5 0 or 3 0 partner end (Methods S1). Since complex rearrangements may result in many candidate gene fusions, LINX streamlines clinical interpretation by providing a curated list of known pathogenic fusion gene pairs, as well as known promiscuous 5ʹ and 3ʹ fusion gene partners, and marks matching fusions as reportable. Finally, LINX also matches amplification, deletion, and LOH drivers called by PURPLE across a panel of well-known cancer genes (Table S1) 8 to specific SV clusters and calls additional disruption driver events in tumor suppressor genes.

Pan-cancer landscape of genomic rearrangements
To demonstrate the functionality of LINX, we ran it on a pan-cancer cohort of 4,358 paired tumor-normal, whole-genomesequenced (median of 1063 and 383 paired-end sequencing coverage, respectively) adult metastatic cancer samples from Hartwig Medical Foundation (referred to as Hartwig cohort; Table S2). 8 Of these samples, 1,924 had matched whole transcriptome sequencing data, which were used for orthogonal validation where appropriate. Overall, we found a mean of 324 rearrangement junctions per sample with the highest rates in esophagus (mean = 753) and stomach (mean = 647) tumors and lowest rates in thyroid (mean = 102) and neuroendocrine (mean = 109) tumors ( Figure 2A; Table S3). Event classification by LINX highlighted the diversity and tumor type specificity of rearrangement mechanisms with deletions, tandem duplications, LINE insertions, and complex events (defined as events with three or more junctions) found to be the largest classes of rearrangements in agreement with previous pan-cancer wholegenome analysis. 1 We examined each of these event classifications in detail as follows. Classification of simple and complex rearrangement events Classification of event types in LINX can considerably simplify interpretation of a cancer genome. An important use case is to distinguish simple events driven by a single break resulting in deletions and duplications ( Figure S1) from variants that are notionally called deletions and duplications by an SV caller but may be part of a more complex event. Clean mutational profiles for sim-  Figure S4 and Table S3. ple deletions and duplications are important for downstream applications, such as signature analysis 1 and in particular homologous recombination (HR) deficiency classification, 15,16 which is associated with both short deletions and tandem duplications and may be relevant to cancer treatment.
In the Hartwig cohort, we find that lengths of deletions and duplications classified as simple events are notably shorter than those clustered in complex events ( Figure 2B). Moreover, the simple deletions and duplications show distinct characteristic length peaks, which have been previously shown to be associated with BRCA1, BRCA2, and CDK12 inactivation or CCNE1 amplification, 17 as well as a short DUP signature that we have recently shown to be associated with colorectal tumors. 9 On the other hand, the deletions and duplications involved in complex events have length distributions closely resembling that of inversions clustered in complex events. We also find that the per-sample counts of deletions, duplications, and inversions in complex events closely follows a 1:1:2 ratio as expected from random rearrangements following a catastrophic event (Figure 2C). However, the counts of simple deletion and duplication junctions per sample were only very weakly positively correlated with those for deletions or duplications that are categorized as part of complex events (deletions r = 0.156; duplications r = 0.13; Figure 2D). Taken together, these observations suggest LINX is able to accurately distinguish between simple and complex rearrangements.
LINX annotates every cluster involving two break junctions (further referred to as two-break junction events) with a resolved type where they can be consistently chained ( Figure S2 Figure S3). Consistent two-break junction clusters fall into two major categories-reciprocal events (e.g., reciprocal inversions or translocations) or events with insertions of a templated sequence either in a chain or cycle. 1 We observe that two-break junction events with insertion sequences frequently involve very-short-templated sequences <1 kb in length, referred to as ''genomic shards,'' 18 which we find to be pervasive in cancer, constituting 14% of somatic breakpoints. Genomic shards can confound classification of otherwise simple variant types, because a short-templated insertion from another chromosome appears notionally as two translocations and can easily be misinterpreted as a reciprocal translocation or more complex event.
LINX classifies events that can be resolved as a simple deletion, tandem duplication, or translocation event with one or more inserted shards as a ''synthetic'' event, under the assumption that the structure is likely created by the disruption of a simple event with the insertion of the templated sequence during repair without affecting the shard donor locus. In support of this hypothesis, we find that samples with high counts of simple deletion and duplications have significantly higher (p < 1 3 10 À60 for both) counts of synthetic deletion and duplications, respectively ( Figures S4A and S4B), and furthermore, we observe the lengths of synthetic deletions and duplications to be highly consistent with the respective lengths of simple deletions and duplications ( Figure S4C). Synthetic deletion and duplication events can have many different breakend topological rearrangements, depending on the source and orientation of the inserted shard ( Figure S1). Insertion of genomic shards is by no means unique to simple deletion and duplication events, as we also see frequent short-templated insertion sequences in breaks of more complex events, including foldback inversion and chromothripsis events. Synthetic foldback inversions also show the same length distribution as simple foldbacks ( Figure S4C).
Reciprocal events are the other major category of two-break junction events. These arise from the crossover of multiple concurrent double-stranded breaks forming either a reciprocal inversion if both breaks occur on a single chromosome (with the segment in between the two breaks repaired inverted) or a reciprocal translocation if the repair is interchromosomal. Although reciprocal inversions and translocations are found in 65% of samples in the Hartwig cohort, they are infrequent relative to other events in cancer, making up 0.8% and 0.5% of all clusters, respectively. In addition to these classical reciprocal events, we also find other configurations of reciprocal events involving two break junctions ( Figure S2). One prominent configuration that we term ''reciprocal duplication'' involves a pair of reciprocal translocations or inversions but with breakends facing each other at both ends with substantial overlap, often multiple kilobase or even megabase in length ( Figure S4D). Reciprocal duplications are significantly enriched (p < 1 3 10 À60 ) in samples with strong tandem duplication signatures ( Figure S4E). Furthermore, the length distribution of reciprocal duplications matches the length distribution of the signature for samples with drivers known to cause tandem duplication phenotypes, i.e., BRCA1, CCNE1, or CDK12 drivers ( Figure S4F). This suggests that these reciprocal duplication events may arise from the same process that forms tandem duplication events, likely when multiple tan-dem duplications occur simultaneously in a cell and, instead of repairing locally, they may cross over and create a reciprocal duplication. This observation places constraints on the mechanism by which tandem duplications may form, because it requires duplication of DNA at both loci prior to breakage and is consistent with a replication restart-bypass model, 19 but not a microhomology-mediated, break-induced replication model. 20 Mobile element and pseudogene insertion detection Somatic integration of LINEs is a common feature in many types of cancer, particularly esophagus and head and neck cancers. 5 A LINE insertion may involve either the transposition of a full or partial LINE source element or the transduction of a partnered or orphaned genomic region within 5 kb downstream of the LINE element. While LINE insertions are typically simple events in themselves, correct classification of these break junctions is important to accurate interpretation of the genome, as they can otherwise be mistaken as translocations and other complex events.
LINE integrations can be difficult to resolve with short read technology, because the inserted sequence is often not uniquely mappable in the genome and typically includes a Poly-A tail, 21 making assembly difficult. LINX circumvents both these issues by leveraging GRIDSS's single breakend-calling capability 9 to identify LINE insertion sites with breakend evidence for either repetitive LINE sequence, PolyA sequence, or a list of known recurrently active LINE source elements. To validate LINX's detection of mobile element insertions, we ran LINX on 75 samples from the pan-cancer analysis of whole genomes (PCAWG) pan-cancer cohort and compared LINX's LINE insertion calls with those from TraFiC-mem. 5 Overall, 339 of 564 (60%) LINX LINE insertions calls were also detected by TraFiC-mem, with TraFiC-mem calling an additional 270 insertions not found by LINX. The concordance in total LINE insertion count was very strong on a per-sample basis ( Figure S5A; Table S4), with most of the private calls in both pipelines being found in the high LINE mutational burden samples ( Figure S5B), suggesting that many of the private calls from both pipelines may be genuine LINE insertions.
Across the full Hartwig cohort, LINX found 76% of tumors have at least one LINE insertion event. Some tumors suffer extreme deregulation, with 6.7% of tumors having over 100 insertions and 2,241 insertions found in a single esophagus tumor sample (Figures 3A and 3B). The five most frequently inserted LINE source elements in the Hartwig cohort were all among the top six reported previously in the PCAWG pan-cancer cohort: 5 chr22:29,059,272-29,065,304, chrX:11,725,366-11,731,400, chr14:59,220,385-59,220,402, chr9:115,560,408-115,566,440, and chr6:29,920,213-29,920,223. Analysis of the precise breakend locations at these five sites reveals highly recurrent site-specific patterns of transduction ( Figures 3C  and S5C), where the 3 0 ends of the transduced sequences are normally sourced from a handful of specific downstream sites (presumably polyadenylation sequences of alternative transcription endpoints for the LINE source element), whereas the location of 5 0 side of the transduction appears to be relatively randomly distributed.
At the LINE insertion site, accurate breakpoint determination can also give insight into potential biological mechanisms. LINX finds frequent target-site duplication 22 but intriguingly finds two peaks in the distance between the insertion breakends, one at an overlap of 16 bases but also a second peak with no overlap, suggesting the possibility of two distinct breakage mechanisms for the second strand after LINE invasion ( Figure S5D). Furthermore, for the 20% of insertions where LINX observed a 5 0 inversion in the insertion sequence (due to twin priming), 22 only a single peak with target site duplication of 16 bases is found.
LINX can also detect somatic pseudogene insertions resulting from the activated reverse transcriptase activity associated with deregulated LINE activity in tumors. 5 LINX annotates any group of deletions that matches the exact boundaries of annotated introns as pseudogene insertions ( Figure 3D). We find 577 pseudogene insertions in the Hartwig cohort, exclusively in samples with somatically activated LINE mechanisms and enriched in the samples with the most deregulated LINE activity ( Figure 3E).
Complex events LINX classifies any cluster that has three or more junctions and is not resolved as a LINE source element as ''complex.'' Previous tools, notably ChainFinder, 4 have been developed to systematically search for complex rearrangement patterns in tumors. We compared LINX and ChainFinder across 1,479 Hartwig cohort samples and found that, while in 22% of cases, LINX and Chain-Finder produced near-identical clusters, the majority of junctions clustered by LINX are left unclustered by ChainFinder, while few SVs were exclusively clustered by ChainFinder (Figures S6A and  S6B). We found this to be because of two main reasons: first, ChainFinder fails to cluster a large number of junctions that are highly proximate (<5 kb between breakends; Figure S6C) and, second, LINX employs a variety of clustering techniques to link distant junctions on the same chromosome arm that are not  Figure S6D). The additional variants clustered by LINX compared with ChainFinder share a strikingly similar length distribution to the variants clustered by both tools (Figure S6E), including deletions, duplications, and inversions with lengths greater than 1 Mb, which are not normally found in simple events.
Conversely, in a small proportion (1.8%) of cases, junctions are clustered by ChainFinder and not by LINX. Ninety-five percent of these are deletions and tandem duplications <1 Mb in length that may also have occurred as independent events and be inadvertently clustered by ChainFinder ( Figure S6E). In line with this hypothesis, we find that 20% of the deletions clustered by ChainFinder, but not by LINX, are in known fragile sites ( Figure S6F) and often are phased in trans, suggesting that they likely occurred in different events. 12 Across the Hartwig cohort, we found at least one complex event in 95% of tumors and at least one event of 20 or more junctions in 60% of tumors ( Figure 4A). While there are relatively few complex events in any given tumor, they account for more than half of junctions overall. Complex clusters with >100 junctions were found in all cancer types, with breast cancer having the highest median maximum complex cluster size of 62 ( Figure 4B). We observe that complex events with a higher number of junctions are more likely to disrupt or amplify a putative cancer driver gene. Overall, 12.7% of all complex clusters in the cohort contributed to a LOH, amplification, deletion, or disruption driver, but this rises to 39.1% for events with 20 or more junctions and 77% for events with more than 20 junctions and high amplification (junction copy number R8; Figure 4C). LINX goes further than other clustering tools in that it allows not only for complex clusters to be identified but in many of cases is able to completely resolve such events into a consistent set of derivative chromosomes, including in chains with up to 33 junctions ( Figure 4D). Uniquely, and critically for accurate chaining in these complex structures, LINX utilizes phased assembly output from GRIDSS to determine whether proximate facing breakends are cis or trans phased. We observe that transphased facing breakends, causing local duplication, are common in complex events and can often extend up to several hundred bases but only rarely extend beyond 30 bases in reciprocal events and mobile insertions ( Figures 4E and 4F), suggesting a fundamentally different breaking mechanism in complex events, which may cause double-stranded breaks with hundreds of bases overlap. Proximate cis-phased breakends are even more common than trans-phased and resemble in length distribution the shards detected in simple events but with much higher frequency in complex clusters ( Figures 4G and 4H). We frequently observe localized regions of scarring with multiple distinct shards sourced from the same location, sometimes with overlapping template sequences. Amplification mechanisms Regions of high amplification are among the most complex events in tumors, as they require iterative and repeated cycles of synthesis or unequal segregation to form. There are two well-known key distinct biological mechanisms that create highly amplified rearrangements: repeated cycles of breakage fusion bridge (BFB) and stochastic amplification of circular extrachromosomal DNA by asymmetric segregation during cell division (ecDNA). ecDNA (Figures 4I and S7A) may arise from any event that creates simultaneous multiple double-stranded breaks on the same chromosomal arm, with one or more chromosomal segments repairing to form a circular structure without a centromere. BFB ( Figure S7B), on the other hand, is triggered by the formation via translocation or foldback inversion of a chromosome with two centromeres, arising from either multiple concurrent double-stranded breaks or telomere erosion, and leads to duplication of chromosomal segments within a linear chromosome.
Despite these significant differences in mechanism, distinguishing between ecDNA and BFB is non-trivial based on short-read sequencing data but is essential in order to understand the diversity of amplification drivers in tumors and may be relevant to the prognosis or treatment of certain tumors. 23 The key difficulties in discrimination are that both mechanisms can leave a similar footprint, as both may arise out of complex shattering events and are highly shaped by the same selection processes, both positive (amplification of key oncogenes) and negative (constraints on amplifications of other proximate genes).
LINX employs a set of heuristics to identify subsets of clusters as likely ecDNA. The key principle used to identify ecDNA is to look for high junction copy number (JCN) structural variants adjacent to low-copy-number regions that can be chained into a closed or predominantly closed loop. LINX also checks that the high JCN cannot be explained by compounding linear amplification mechanisms, by comparing the JCN of the candidate ecDNA junctions with the maximal amplification impact of fold-back inversions (hallmarks of BFB) and other junctions that link closed segments of the ecDNA to other regions of the genome (Methods S1). To validate the ecDNA heuristic, we ran LINX on a set of 13 WGS neurosphere-cultured glioblastoma samples that had been previously analyzed 24 for ecDNA with Amplicon Architect. 6 LINX and Amplicon Architect called ecDNA for an identical set of 19 oncogenes across the 13 samples (Table  S5), including the 11 samples that were orthogonally validated by fluorescence in situ hybridization (FISH).
Applying the heuristic to the Hartwig cohort, we found ecDNA to be a relatively uncommon event present in 9.9% of all tumors, with the highest frequency in CNS tumors (51%; Figure 4J). This is lower than found in a large recent pan-cancer cohort analysis of WGS using AmpliconArchitect, 23 which found a pan-cancer prevalence of 14%. We observe that, overall, 12% of putative amplification drivers identified in the Hartwig cohort are associated with ecDNA events ( Figure S7C) but that this rate increases for more highly amplified events to greater than 40% for events with maximum JCN > 32. The relative rate of ecDNA is the highest for EGFR ( Figure S7D), but this appears to be highly specific to CNS tumors (where 87% of EGFR amplifications are associated with ecDNA), whereas for lung tumors (where epidermal growth factor receptor [EGFR] amplification is also common) and other cancer types, the rates of ecDNA are only 11% and 21%, respectively, similar to that of other well-known oncogenes ( Figure S7E).
The high-amplification events that do not meet the ecDNA criteria are assumed to be formed via linear amplification. While we find that 76% of these events have at least one foldback inversion, suggesting a BFB process, in many events, the foldback JCN cannot explain the full amplification, and in the remaining events, LINX identifies no foldback events at all ( Figure S7F). The majority of these are unlikely to be ecDNA, however, because there is no obvious set of junctions and segments that can be closed into a circle with a consistent copy number. Some events, such as the exceptionally complex amplifications of MDM2 and CDK4 common in liposarcoma, 3 may not fall neatly into either an ecDNA or BFB classification ( Figure S7G) and have recently been proposed to be a novel rearrangement class termed ''tyfonas.'' 12 Detection of clinically relevant pathogenic rearrangements LINX calls a diverse and comprehensive range of fusions and pathogenic rearrangements (Figures 5A and S8A-S8D). We orthogonally validated LINX's pathogenic fusion predictions by comparing them with fusions predicted from RNA sequencing (RNA-seq) data taken from the same samples. For the RNA comparison, we used Arriba, one of the best performing RNA fusion callers, 25 using a curated list of 391 known pathogenic fusion pairs (Table S6). Across 1,924 Hartwig cohort samples with matched RNA, 148/173 in-frame fusions (86%) predicted by LINX were also found by Arriba ( Figure 5B; Table S7). Of the 25 fusions not identified in RNA, 13 matched the characteristic tumor type of the known fusion pair (nine of which were TMPRSS2-ERG fusions in prostate cancer) and are likely to be pathogenic but with insufficient expression to be detected in the RNA. A further two cases predicted by LINX were found by Arriba but only in out-of-frame transcripts. Thirteen known pair Cell Genomics 2, 100112, April 13, 2022 7 Technology ll OPEN ACCESS fusions were predicted by Arriba, but not by LINX, seven of which involve gene pairs less than one million bases apart on the same chromosome and may be caused by readthrough transcripts 26 or circularized RNA 27 unrelated to structural rearrangements in the DNA.
In addition to known pathogenic fusion pairs, 63 cancerrelated fusion genes were curated as promiscuous 5 0 and 3ʹ fusion partners. Among these, LINX identified a further 152 candidate in-frame fusions, 74 (49%) of which were also detected in RNA. Arriba detected 397 additional promiscuous candidates, but 86% of these were proximate on the same chromosome and are likely readthrough transcripts with no genomic rearrangement. Altogether, 43 of the 325 (13%) known and promiscuous fusion predictions were chained fusions involving multiple junctions, 26 (60%) of which were validated in the RNA-seq data ( Figure 5C), highlighting the utility of chaining of derivative chromosomes for DNA fusion calling. TMPRSS2-ERG was the only fusion that LINX found to be recurrently chained in the cohort, accounting for 14 of the 43 predicted chained fusions, all in prostate cancers.
Immunoglobulin enhancer rearrangements are a distinct class of pathogenic rearrangements, common in B cell tumors where errors in VDJ recombination and/or isoform switching in the IGH, IGK, and IGL regions may lead to pathogenic rearrangements, driving high expression of known oncogenes through regulatory element repositioning. 28 Although these typically do not make a novel protein fusion product, LINX predicts these pathogenic rearrangements based on the breakend in the IGH, IGK, and IGL regions with orientation and position matching locations commonly observed in B cell tumors. 28 Among 10 lymphoid samples with matching RNA in the cohort, LINX found six such rearrangements, including five cases of IGH-BCL2 and one case of IGH-MYC. The five identified samples with IGH-BCL2 rearrangements have significantly higher expression (p = 0.008) of BCL2 than the five lymphoid samples with no BCL2 rearrangement detected ( Figure 5D).
LINX also identifies disruptive intragenic rearrangements that may cause exonic deletions and duplications. Our knowledge base includes nine such rearrangements known to be pathogenic and two that we have deemed likely pathogenic due to high recurrence in the Hartwig cohort. Three of the known pathogenic exon rearrangements were detected by LINX in at least five samples with paired RNA in our cohort: EGFRvII (n = 6), EGFRvIII (n = 14), and CTNNB1 exon three deletion (n = 6). In all cases with an event detected by LINX in the DNA, we found RNA fragments that supported novel splice junctions in the matched RNA ( Figure S8E). Only one other sample in the complete cohort (n = 1,924) had more than one fragment supporting any of these alternative splice junctions (a gastrointestinal stromal tumor with three fragments supporting EGFRvII but with no evidence of rearrangement in EGFR), suggesting a low false-negative rate in LINX.
In addition to producing novel oncogenic proteins and overexpression of well-known oncogenes, rearrangements may also lead to tumorigenesis by disrupting the function of tumorsuppressor genes. To capture this, LINX annotates every breakend that overlaps a gene, determines whether it is disruptive to the gene, and reports the number of undisrupted copies. In cases of reciprocal translocations ( Figure 5E), reciprocal inversions ( Figure S9A), complex break events ( Figure S9B), or tandem duplications that overlap at least one exon ( Figure S9C), a gene may be disrupted on all remaining copies, even though the copy number is greater than zero for all exonic bases. 29 We term this type of genomic rearrangement a ''homozygous disruption.'' Homozygous disruptions cannot readily be detected by standard panel or whole-exome sequencing, since intronic sequences are typically not included in such panels and they are copy neutral in exonic regions.
We find homozygous disruptions to be a common driver in the Hartwig cohort, with 9% of samples containing at least one homozygous disruption in a panel of 448 curated cancer-related genes (Table S8). Three well-known tumor-suppressor genes had homozygous disruptions in more than 1% of the cohort: TP53 (n = 69), PTEN (n = 56), and RB1 (n = 55; Figure 5F). Supporting the functional impact of these events, we found significantly lower expression for each of these genes (TP53: p = 2 3 10 À16 ; PTEN: p = 2 3 10 À6 ; RB1: p = 2 3 10 À3 ) in samples with predicted homozygous disruptions compared with samples with at least one intact copy ( Figure 5G) and similar mean fold change in expression compared with samples with homozygous deletions (TP53: 0.30 versus 0.40; PTEN: 0.47 versus 0.37; RB1: 0.68 versus 0.60 for disruptions and deletions, respectively). We also performed a genome-wide search for genes with enrichment of homozygous disruptions and found 35 significantly enriched genes, including 16 well-known tumor suppressors, 14 genes immediately adjacent to tumor-suppressor genes, and three highly recurrent oncogenic fusion partners (Table S9). Intriguingly, we found an additional two genes also enriched in homozygous disruptions, but not widely characterized as tumor-suppressor genes: PSIP1 (five observations; q = 0.006), which has previously also shown to be significantly enriched in truncating point mutations, 30 and USP43 (six observations; q = 0.01), a recently proposed tumor suppressor. 31 Visualization LINX produces detailed visualizations of the rearrangements in the tumor genome that allow further insights into complex rearrangements. LINX supports either drawing all rearrangements in a cluster or all the rearrangements on a chromosome, creating an integrated Circos output 32 showing copy number changes, clustered SVs, the derivative chromosome predictions, and impacted genes, including protein domain annotation for gene fusions, all on the same diagram. The visualizations use a log-based position scaling between events so that small-and large-scale structures can both be inspected on a single chart. Combined with the circular representation, these features allow unprecedented resolution of complex structures across a broad array of event types, including chromoplexy ( Figure 6A) and complex BFB amplification events ( Figure 6B). Methods S1 includes a walkthrough and explanation of all LINX figures, covering the complete SV landscape of the COLO829T melanoma cancer cell line, which has been proposed as a somatic reference standard for cancer-genome sequencing. 33,34 Evaluation on an independent cohort. To assess broader utility of the tool set and the reproducibility of our results, we compared the findings on the Hartwig cohort with a subset of 1,541 samples from the independently sequenced PCAWG pan-cancer cohort (Table S10). 35 The PCAWG samples analyzed also cover a diverse range of tumor types but, unlike the Hartwig cohort, contain almost exclusively primary cancer samples and are sequenced to a lower average coverage of depth (383-603 PCAWG compared with median 1063 for HMF).
We observed largely the same structural variant patterns across the two cohorts ( Figure S10A). The length distributions of deletions, duplications, and inversions were highly similar for both simple and complex events across the two cohorts (Figure S10B). We also observed a very similar preponderance and length distribution of genomic shards across all event types (Figure S10C). Furthermore, we found that the length distributions of the synthetic events in the PCAWG cohort closely replicated the results found in the Hartwig cohort ( Figure S10D). Likewise, the reciprocal duplication events we identified in the Hartwig cohort were also present in PCAWG, with the same length patterns of tandem duplication signatures for samples with BRCA1, CDK12, and CCNE1 drivers ( Figure S10E). Driver-related rearrangement patterns were also similar between the PCAWG and Hartwig cohorts. While the overall rates of samples with high-amplification events were lower in the primary cancers (22% PCAWG; 41% HMF), the proportion accounted for by ecDNA was similar (28% PCAWG; 24% HMF; Figure S10F). We also found homozygous disruptions events impacting tumor-suppressor genes (TSGs) in the PCAWG cohort. Indeed, the top four driver genes with putative homozygous disruption drivers were the same in both datasets ( Figure S10G).
Overall, the high reproducibility of these results in the independently sequenced PCAWG cohort lends weight both to the utility of LINX and the universality of the observed patterns across both metastatic and primary cancers.

DISCUSSION
We have shown that LINX can help understand highly rearranged cancer genomes in multiple ways. Other recent publications on complex somatic rearrangements 1,12 have developed tools, such as ClusterSV and JaBbA, that have significant feature overlap with LINX. Each of these approaches also utilizes a basepair-consistent SV/CN call set to cluster SVs, classify certain types of rearrangements, and assess downstream impact.
However, LINX differs from existing approaches in several key aspects. First, LINX clusters use copy number consistency constraints in addition to SV proximity. Second, LINX chains SVs to reconstruct the derivative chromosomes caused by each rearrangement event, including partial reconstruction for incomplete events. Third, LINX performs comprehensive classification. Every SV is classified, including mobile element translocations. Fourth, LINX utilizes single-breakend SV calls. The single-breakend repeat annotations provided by GRIDSS enable LINX to classify mobile element translocations as well as cluster complex events overlapping centromeric repeats. Fifth, LINX's rearrangement model allows for genomic shards to be inserted in any event type. The size distribution of sharded events indicates this approach is sound, at least for simple events, and this approach considerably simplifies the classification scheme. Sixth, LINX utilizes a nonlinear Circos-style visualization format that enables even quite complex rearrangements to be visually interpretable. Finally, LINX provides the most comprehensive genomic rearrangement functional impact analysis currently available. To the best of our knowledge, LINX is the only tool that reports homozygous disruptions and the only tool that can identify chained fusions from DNA-seq data alone, both of which can lead to clinically relevant rearrangements in tumors.
The challenges in understanding the complexity of rearrangements in tumor genomes can be daunting. The diversity of overlapping or converging biological mechanisms that may cause similar rearrangement patterns means that it may be perilous to analyze any one rearrangement as a standalone analysis. By exhaustively classifying all rearrangements, LINX is a robust foundation for more detailed analysis of specific rearrangement patterns, including structural variant signatures, complex shattering events, and high-amplification drivers as well as dissection of the underlying molecular mechanisms, DNA replication, Figure 6. Complex event visualization (A) Chromoplexy-like cluster formed from 19 break junctions across seven chromosomes in HMF001596B, a prostate tumor. The rearrangement leads to three distinct putative drivers in a single event, including a chained TMPRSS2-ERG fusion with two hops; a loss of heterozygosity for PPP2R2A, which also has a stopgained point mutation (not shown); and an intronic homozygous disruption of PTEN. (B) Breakage fusion bridge event affecting the P arm of chromosome 3 in the melanoma cell line COLO829T. The predicted derivative chromosome has a copy number of two and can be traced outwards starting from the centromere on chromosome 3, traversing two simple foldbacks and two chained foldbacks and finishing on a single breakend at chr3:25.3M, which from the insert sequence can be inferred to be connected to a centromeric satellite region (likely chromosome 1, which has a copy number gain of two over the centromere from P to Q arm and which appears to be connected to chromosome 3 in unpublished SKY karyotype figures; http://www.pawefish.path.cam.ac.uk/OtherCellLineDescriptions/COLO829.html). One chained foldback at chr3:26.4M includes a genomic shard from chr6 of approximately 400 bases, which has itself been replicated and internally disrupted by the foldback event. The other chained foldback at chr3:25.4M includes two consecutive genomic shards inserted from chromosome 10 and 12 of approximately 200 bases each.
10 Cell Genomics 2, 100112, April 13, 2022 Technology ll OPEN ACCESS and repair components involved. The full LINX analysis results on the Hartwig cohort are available via data request and can be paired with clinical data and other whole-genome analyses for further in-depth research.
WGS offers the promise of a single comprehensive test for all genomic alterations for both routine diagnostics and future biomarker discovery. LINX takes a step toward that goal by both comprehensively calling clinically relevant fusions from DNA with similar precision and sensitivity to gold standard RNA-seq methods and by identifying homozygous disruptions, an important class of drivers of tumorigenesis that cannot readily be detected by standard-of-care methods.
Limitations of the study There are many potential sources of error that can confound correct interpretation of complex genomic rearrangements, including sample preparation, sequencing errors and coverage biases (such as GC bias), inaccurate fitting of sample purity and ploidy, false-positive or false-negative structural variant calls, and inaccurate local copy number measurement. Depth of coverage and sequencing quality are important considerations here. While we have shown that LINX can find highly similar results on the PCAWG dataset, which has, on average, half the sequencing coverage of the Hartwig cohort, lower depth coverage and/or lower quality sequencing is associated with higher false-negative rates of structural variants 9 and will result in less complete reconstructions.
Furthermore, while LINX has been optimized for short-read technology, the short-read length is ultimately the key limitation in interpretation, because it limits the phasing of proximate variants and accurate identification of events in long repetitive regions. Nevertheless, in practice, LINX is able to resolve many structures via various chaining and clustering heuristics, but for more complex events, particularly highly rearranged focal regions, errors are inevitable and the chaining is only partial and representative. While we have performed extensive comparison of LINX against other tools and can validate some of LINX's chaining predictions orthogonally via RNA evidence for chained fusions, there are, as yet, no representative tumor genomes with a fully resolved chromosomal structure for comparison as a truth set. Long-read sequencing technologies 36 can phase more distant breakpoints and are likely better suited for resolving complex events, although those technologies typically perform less well for small variant detection. Pairing short-and long-read technologies will no doubt lead to further advances in our understanding of the mechanisms and role of genomic rearrangements in tumorigenesis.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
This publication and the underlying study have been made possible partly on the basis of the data that Hartwig Medical Foundation and the Center of Personalised Cancer Treatment (CPCT) have made available to the study. Wholegenome sequencing data were analyzed from glioblastoma neurosphere lines, which were established at Henry Ford Healthcare System. Funding was provided from Hartwig Medical Foundation.