Optical genome mapping and revisiting short-read genome sequencing data reveal previously overlooked structural variants disrupting retinal disease − associated genes

Purpose: Structural variants (SVs) play an important role in inherited retinal diseases (IRD). Although the identi ﬁ cation of SVs signi ﬁ cantly improved upon the availability of genome sequencing, it is expected that involvement of SVs in IRDs is higher than anticipated. We revisited short-read genome sequencing data to enhance the identi ﬁ cation of gene-disruptive SVs. Methods: Optical genome mapping was performed to improve SV detection in short-read genome sequencing − negative cases. In addition, reanalysis of short-read genome sequencing data was performed to improve the interpretation of SVs and to re-establish SV prioritization criteria. Results: In a monoallelic USH2A case, optical genome mapping identi ﬁ ed a pericentric inversion (173 megabase), with 1 breakpoint disrupting USH2A . Retrospectively, the variant could be observed in genome sequencing data but was previously deemed false positive. Reanalysis of short-read genome sequencing data (427 IRD cases) was performed which yielded 30 pathogenic SVs affecting, among other genes, USH2A ( n = 15), PRPF31 ( n = 3)


Introduction
Structural variants (SVs) are increasingly recognized as important causes of inherited diseases, and pathogenic variants have been described to be implicated in many diseases, including developmental disorders and sensory disorders. 1-3 SVs are defined as large (>1 kb) genomic aberrations and can be subdivided into unbalanced (eg, deletions and duplications) and balanced (eg, inversions and translocations) rearrangements. 4 The number of identified pathogenic SVs has been growing rapidly, and SV identification has significantly improved with the arrival of genome sequencing technologies (Reurink et al unpublished). 5,6 Inherited retinal diseases (IRDs) are a group of genetically heterogeneous disorders, and pathogenic variants in IRDs have been described in >270 genes (RetNet, https:// sph.uth.edu/retnet/). The contribution of SVs to the mutational landscape of IRDs is currently being estimated to range between 5% and 15%. 5,[7][8][9] Despite extensive sequencing efforts, in approximately one-third of IRD cases, no conclusive genetic diagnosis could be established. 10 Although literature reports suggest that this percentage can be improved by the implementation of (short-read) genome sequencing, there is still a significant degree of missing heritability (Reurink et al [unpublished]). 5,6 One of the main hypotheses for this missing heritability is the presence of pathogenic SVs that cannot be detected using short-read sequencing approaches. Several studies have reported the additive value of long-read sequencing or cytogenetic approaches that make use of (ultra)long DNA molecules for the identification of SVs. [11][12][13] In this study, we showed that the power of SV detection from short-read data, and therefore the prevalence of pathogenic SVs, is still being underestimated in IRDs. Optical genome mapping (OGM) was performed to improve SV detection in genetically unexplained IRD cases and, surprisingly, revealed that several pathogenic SVs were overlooked during our previously performed short-read genome analyses. By revisiting genome sequencing data generated from established IRD cohorts and performing a focused SV reanalysis, several previously overlooked pathogenic SVs could be identified, including, but not limited to, large (pericentric) inversions and small intragenic deletions. Several lessons were learned during the process of data reanalysis, which allowed us to re-establish and optimize our SV prioritization protocols. We therefore advocate that more attention should be paid to SV interpretation during genome data analyses, and we believe that this will facilitate (partial) explanation for the missing heritability in IRDs and possibly in other inherited disorders as well.

Patient cohort
Genome sequencing data were collected from 427 IRD probands. This IRD study cohort included both genetically explained and unexplained samples that were incorporated in recent studies with 100 IRD cases described by Fadaie et al 6 and 100 Usher syndrome and monoallelic USH2Aassociated recessive retinitis pigmentosa cases by Reurink et al (unpublished). Other samples included in the cohort were part of unpublished studies (n = 96) focused on the analyses of genomic sequences of genetically unexplained IRD cases and a portion of samples (n = 131) were not analyzed previously (Supplemental Figure 1). Before participation in these studies, all probands were prescreened using either exome sequencing or targeted gene panel sequencing, which yielded no conclusive genetic diagnosis. Informed consent was obtained from all participants or their legal representatives.

Genome sequencing
Genomic DNA was isolated from peripheral blood lymphocytes following standard procedures and analyzed through genome sequencing as described previously. 6 Sequencing was performed by BGI on a BGISeq500 using a 2x 100 basepair (bp) or 2x 150 bp paired-end module, with a minimal median coverage per genome of 30 fold. Read mapping to the Human Reference Genome build GRCh38/ hg38 and single-nucleotide variant (SNV) calling were performed using Burrows-Wheeler Aligner V.0.78 14 and Genome Analysis Toolkit HaplotypeCaller (Broad Institute), respectively. SVs were called using Manta structural variant caller, 15 which is based on read-pair signals (split reads and discordant read pairs) and read-depth signals (copy number changes). In addition, copy number variant (CNV) detection was performed based on read-depth evidence using Canvas Copy Number Variant Caller. 16 OGM For a single case, OGM (Bionano Genomics) was performed as previously described. 12,13,17 In brief, ultrahigh molecular weight DNA was isolated from whole peripheral blood (EDTA) using the SP Blood & Cell Culture DNA Isolation Kit (Bionano Genomics). DNA labeling was performed using the Direct Label and Stain (DLS) DNA Labeling Kit (Bionano Genomics), and the labeled sample was loaded on a 3×1300 Gb Saphyr chip (G2.3) on a Saphyr instrument (Bionano Genomics). Annotated de novo assembly using the genome build hg19 was performed using Bionano Solve version 3.6.1, which includes 2 separate algorithms for SV and CNV detection as described previously. 12 SV calls that were absent in a control OGM data set (>200 human population control samples) were prioritized. Identified candidate variants overlapping with an IRDassociated gene were visualized and investigated in Bionano Access version 1.6.1.

Reanalysis of genome sequencing data and variant selection
All 278 IRD-associated genes listed on the RetNet webpage (https://sph.uth.edu/retnet/, accessed October 1, 2022) were investigated in this study, and genomic positions were extracted using the Ensembl genome browser. 18 All SVs and CNVs with at least 1 breakpoint within one of the listed IRD-associated genes (±1 megabase [Mb] flanking regions) were extracted. Extracted SVs and CNVs were combined for all samples followed by in-depth variant assessment.
Coding SVs and CNVs were filtered and selected based on a minor allele frequency (MAF) of <1% in 1000 Genomes, 19 DECIPHER, 20 and our in-house SV database (consisting of 920 genomes of presumably healthy unrelated individuals). Inversion events were only considered when at least one of the breakpoints was located within an IRD-associated gene and therefore disrupt the gene.

Variant validation
Potentially pathogenic SNVs that were not previously identified were validated using Sanger sequencing. Identified SVs were validated by visualization using the Integrative Genomics Viewer (IGV) software V2.4 27 (Broad Institute), and breakpoints were confirmed using polymerase chain reaction (PCR) amplification and Sanger sequencing. Primer sequences for SV validations are listed in Supplemental Table 1, and PCR conditions used are available upon request. Segregation analysis was performed when DNA of family members was available (Supplemental Table 2).
For one variant (a full-gene deletion of MERTK), validation was performed using quantitative real-time PCR (qPCR). qPCR was performed on the genomic DNA from the affected individual and unaffected unrelated controls (n = 2). The experiment was performed using GoTaq qPCR Master Mix (Promega) on a Quantstudio 6 Flex Real-Time PCR System (Applied Biosystems). Primer pairs were designed to amplify parts of the genome in the suspected deleted region as well as regions outside the affected genome region as a reference for standard quantity. Primer sequences are listed in Supplemental Table 1.

OGM reveals a pathogenic 173 Mb pericentric inversion disrupting USH2A
The implementation of short-read genome sequencing technologies has significantly improved the diagnostic yield for inherited disorders; nonetheless, many affected individuals remain genetically unexplained. For IRDs, about 30% of individuals lack a genetic diagnosis after genome sequencing has been performed, which indicates that a large diagnostic gap exists (Reurink et al [unpublished]). 6 One of the cases that remained genetically unexplained after genome sequencing was individual USH-44 (Reurink et al [unpublished]). This individual has been previously diagnosed with Usher syndrome type-II (OMIM 276901), a recessively inherited disorder characterized by retinitis pigmentosa and congenital hearing loss and associated with variants in several genes, the most important one being USH2A. Genome sequencing did reveal a heterozygous intragenic USH2A deletion (c.9258+2601_9371+1539del, p.(?), NM_206933.2), spanning exon 46, which is predicted to result in a frameshift, but the second pathogenic allele remained elusive (Reurink et al [unpublished]). As part of this study, we performed OGM: an innovative cytogenetics approach, which allows the efficient detection of SVs using ultralong DNA fragments. 12,13 OGM revealed a total of 5929 SV calls, of which 143 were absent in the OGM control cohort of 204 unrelated individuals. Two of the identified SVs overlapped with the USH2A gene, of which one corresponded to the previously identified heterozygous deletion spanning USH2A exon 46. The other SV supports a large approximately 173 Mb pericentric inversion event, which involves most of chromosome 1. The 3′ breakpoint was predicted to be located within the USH2A gene ( Figure 1A) and the 5′ breakpoint within FOXJ3. The inversion was not present in any of the control samples, and none of the other SV calls were overlapping with any IRDassociated gene. Inspection of the inversion event using the Bionano Access software confirmed the presence of the USH2A inversion. In addition, the variant was confirmed using traditional karyotyping in patient-derived cells ( Figure 1).
To determine the exact breakpoints of the inversion event, we interrogated the implicated breakpoint regions in the available short-read genome sequencing data. Retrospectively, we noticed a 173.1 Mb inversion variant call that corresponded to the SV calls obtained from OGM. In addition, split reads spanning the breakpoints of the inversion event could be observed in IGV ( Figure 1B). The variant was previously deemed to be false positive and overlooked, mainly because of the large size of the inverted region and the high number of large inversion events called in this sample. Using the genomic positions derived from the short-read data, breakpoints of the inversion event could be confirmed through PCR and Sanger sequencing (chr1:42320825-215677220delins42320846-215677215inv, hg38). The 3′ breakpoint of the inversion is located in intron 62 of USH2A and thus disrupts the coding sequence of USH2A. Most likely, none or only truncated USH2A protein will be produced that lacks several protein domains, among which is the essential transmembrane domain. Based on these results, the variant was classified as pathogenic, and this individual with Usher syndrome was considered genetically solved.

Revisiting genome sequencing data reveals previously overlooked SVs
The identification of the previously overlooked USH2A inversion event prompted us to reassess our available shortread genome sequencing data sets and optimize our variant prioritization protocol. We decided to merge SV and CNV data of all 427 genomes that were collected previously from unrelated individuals diagnosed with IRD and performed a comprehensive variant (re)analysis. SV and CNV data of genetically explained samples remained included in the analyses for control purposes. We refrained from filtering SVs based on size or quality criteria to allow the establishment of correct filtering criteria. Considering the large volume of data and the feasibility of these analyses, we decided to focus on rare coding SVs and CNVs only. After variant filtering, 334 SV calls and 472 CNV calls overlapping with an IRD-associated gene and a MAF of <1% were selected and subjected to a detailed examination ( Figure 2).
A total of 5 homozygous and 21 heterozygous candidate SVs, potentially in trans with a second heterozygous pathogenic allele, were identified in 25 samples. This included the 173 Mb USH2A inversion (variant 30) and an in-frame deletion of exons 22 to 24 in USH2A, which was identified in 2 samples (variant 18 and variant 22). Moreover, 1 hemizygous and 3 heterozygous potentially pathogenic variants were identified in IRD cases with an X-linked or a dominant mode of inheritance, respectively (Table 1,  Supplemental Tables 2 and 3). All previously identified SVs (n = 15, Fadaie et al, 6 Reurink et al [unpublished], Velde et al, 28 unpublished data) were detected through this approach, which confirms the suitability of our method. Breakpoints of all SVs except one were validated, and segregation analysis was performed if possible (Supplemental Table 2). For an SV affecting MERTK, breakpoints could not be amplified, but the deletion event was validated through genomic qPCR (Supplemental Figure 2). Sizes of identified SVs ranged from 72 bp to 173 Mb, and variants included inversions (n = 3), duplications (n = 3), and deletions (n = 24) (Figure 3). In concordance with previous studies of CNVs and SVs in IRDs, 9 variants in USH2A are the main contributor to the variants identified in this study (15/30 variants).
Surprisingly, >25% of the newly identified variants (n = 8) were identified in previously analyzed samples, indicating that these variants were previously overlooked. One of these variants is another large inversion event that disrupts the USH2A gene (variant 28). The variant was identified in sample USH-42 (Reurink et al [unpublished]), an individual diagnosed with Usher syndrome type-II and heterozygous for a known pathogenic variant in USH2A. This individual was previously included in the genome sequencing study performed by Reurink et al (unpublished) but remained genetically unexplained after genome analysis was performed. Only by reassessing the (size) filtering criteria, we were able to pick up this large inversion that affects USH2A. To find a possible explanation why several other variants were misinterpreted during previous analyses, an overview of the respective SV and CNV calls and corresponding quality scores are summarized in Supplemental Table 4.
For all 29 cases in which a candidate SV was identified, in-depth genome sequencing (re)analysis, including SNV analysis, was performed to exclude the presence of other (likely) pathogenic variants in IRD-associated genes that could have been misinterpreted in earlier analyses as well. No additional potentially pathogenic homozygous or compound heterozygous (recessive inheritance), hemizygous (X-linked inheritance), or heterozygous variants (dominant inheritance) were revealed in the affected individuals. All 30 variants are classified as (likely) pathogenic according to the American College of Medical Genetics and Genomics classification system 29,30 and expected to be part of the underlying causative genetic defects responsible for the IRD phenotype in the respective individuals. A detailed summary of the number of genetically explained samples in this and previous studies can be found in Supplemental Figure 3.

Discussion
SVs are considered as important contributors to the mutational landscape of inherited disorders, including IRDs (Reurink et al [unpublished]). 5,6 Although SV detection did improve upon the arrival of short-read genome sequencing, the process of SV detection is still not trivial and SV interpretation is a complex process. On average, 10,000 SVs are called in a sample, of which >50 SVs overlap with the coding regions of an IRD-associated gene including several false-positive calls. There is an increased need for using standard prioritization protocols for the interpretation of SV data. In addition, SV algorithms are still continuously developing to improve the accurate detection of SVs. Despite these efforts, it is expected that many pathogenic SVs escape detection through short-read sequencing or are misinterpreted, which would suggest that these types of variants might well be one of the most important sources of the reported missing heritability for IRDs.
In this study, OGM was performed to identify large pathogenic SVs in a case that was heterozygous for a pathogenic USH2A variant. A large USH2A-disruptive pericentric inversion on chromosome 1 was identified, which genetically explained this case after decades of research. In hindsight, interrogation of the short-read genome sequencing data did reveal the presence of a large Source column represents whether the samples were previously analyzed in published or unpublished studies or the samples are novel and included in this study. More detailed variant data can be found in Supplemental Tables 2 and 3. chr, chromosome; del, deletion; dup, duplication; fs, frameshift; hem, hemizygous; het, heterozygous; hom, homozygous; inv, inversion; SV, structural variant; Zyg, zygosity. a Genome sequencing data were previously analyzed, and variant was not identified (published or unpublished studies). b Breakpoints could not be confirmed through Sanger sequencing; SV has been validated using genomic quantitative polymerase chain reaction only.
inversion event as well as a corresponding variant call. Split reads could be observed at the predicted inversion breakpoint junctions. This suggests that the USH2A inversion was overlooked and misinterpreted during previous analysis.
This finding prompted us to reanalyze our genome sequencing data sets that were collected over time and to perform a comprehensive reanalysis focused on the identification of pathogenic coding SVs overlapping with and disrupting an IRD-associated gene. This approach was very successful because 30 (likely) pathogenic SVs in 29 IRD probands were identified. Most remarkably, 8 of the identified pathogenic variants (>25%) were overlooked during initial analysis. We have critically evaluated our findings, and we found several explanations why these variants were not recognized previously. This allowed us to re-establish our SV prioritization protocols and analysis pipeline. Based on this, we would like to share the lessons that we learned from this reanalysis to facilitate and improve future analyses and interpretation of SV data sets ( Figure 4).
First, it is generally accepted that SV calling from shortread sequencing data is not optimal yet and it is still difficult to distinguish true variant calls from false-positive calls. Especially large SVs (>2 Mb) are often considered as noise and deemed false positive and are frequently excluded from analyses to simplify the prioritization process. 31 However, in this study, 2 of the identified variants exceed this size threshold. Both of these variants were not identified in previously performed genome analysis because of this reason. The identified large inversion events encompass hundreds of genes but exhibit an apparent breakpoint in USH2A and were validated through breakpoint PCR. This demonstrates that it is warranted to critically evaluate breakpoints of all variants that are called, especially when an SV has a breakpoint in a disease gene of interest. In our updated SV annotation pipeline, SV breakpoint information has now been implemented for all variant calls. In addition, we have noticed that the usage of multiple variant databases (both global and local) is crucial to allow a better discrimination between true and false-positive calls. Especially, the  Tables 2 and 3. implementation of an in-house SV database allowed us to exclude a significant number of false-positive calls from analysis that were introduced by technical errors in our sequencing pipeline.
Second, an important step in the SV calling pipeline is the quality assessment of a variant. Quality assessment is based on different aspects, such as the number of reads that are supporting the structural event or lack of paired Figure 4 Proposed flowchart for structural variant analysis from short-read genome sequencing data. A recommended workflow for the identification of potential causal structural variants (SVs) from short-read genome sequencing data. Initially, a stringent structural variant analysis of patient samples should be performed. In this stringent analysis, standard quality and filtering criteria (eg, a minor allele frequency of <1%) should be applied to allow fast and efficient identification of causal variants and a genome-wide analysis should be performed. Potential candidate variants should be validated by analyzing the sequencing reads as well as confirmed through breakpoint PCR and Sanger sequencing. Duplications and deletions can also be confirmed through a genomic qPCR. An extended analysis of the data should be performed in case when no causal variants were identified during the first stringent analyses. Nonstringent variant prioritization can be performed, focusing on variants identified in disease-associated genes (gene panel analysis) minimizing the number of variants to be interrogated in detail. In addition, a manual inspection of strong candidate genes should be performed to identify indications for possible structural variations (eg, read-depth changes or split reads) in regions of interest. When nonstringent SV analysis does not yield any potential candidate variants, genome-wide follow-up studies based on other DNA technologies such as long-read genome sequencing or optical genome mapping should be considered. AF, allele frequency; CNV, copy number variant; PCR, polymerase chain reaction; qPCR, quantitative PCR; seq, sequencing; SMRT, single molecule real time; SV, structural variant. sequencing reads supporting the event. A variant is granted a quality score, which indicates whether a variant has failed or passed this assessment. Quality scores are considered easy tools for variant filtering and an efficient way to reduce potential candidate variants. Nevertheless, because SVs usually are complex variants resulting from recombination in homologous or repetitive regions, coverage of these events are generally low, which could consequently result in a low quality score. Therefore, low quality scores without the filter, pass, should also be considered during SV analyses. In our study, 6 of 30 identified SVs received a quality warning by at least 1 variant caller. Five of these variants were not picked up in earlier genome analysis and previously overlooked, suggesting that filtering on quality scores is an important issue and could possibly explain why pathogenic variants are still missed during analysis. All of the variants that received a quality warning could be validated using Sanger sequencing or genomic qPCR. An important eye-opener based on this result is that quality scoring should not be used for prioritization or exclusion for SVs but could be used as supportive evidence only.
Third, our results show that it is of utmost importance that multiple SV caller tools, based on different lines of evidence, are combined to improve data interpretation. SV detection can be based on either read-pair evidence (split reads and discordant read pairs; eg, Manta structural variant caller) or readdepth evidence (copy number changes; eg, Manta structural variant caller and Canvas Copy Number Variant Caller). Generally, read-pair evidence allows the detection of shorter (>50 bp) variants both balanced and unbalanced, whereas readdepth evidence allows the detection of longer (>1 kb) CNVs and can also be applied to identify terminal deletions or duplications. To allow efficient SV detection, it is crucial that both types of evidence are combined for SV calling. In this study, we identified 27 unbalanced rearrangements, ie, duplication and deletion events. Only 15 of 27 variants were identified by both SV caller (Manta structural variant caller 15 ) and CNV caller (Canvas Copy Number Variant Caller 16 ). Although the importance of combining SV callers has been recognized for some time, there are still reports available in which only a single SV caller is employed in SV pipelines. Moreover, SV caller algorithms are still continuously improving, and therefore SV detection pipelines should be frequently evaluated or updated. A striking example is variant 1 described in this study: a partial heterozygous deletion of ARSG identified in an individual diagnosed with Usher syndrome type-IV. This individual was previously described in a recent study by Velde et al, 28 in which the heterozygous ARSG deletion was identified only after visual inspection of the sequencing reads in IGV. In the study described by Velde et al, 28 CNV calling was performed using Control-FREEC 32 and the ARSG deletion was not called. In this study, SV and CNV analysis was performed using an updated pipeline, in which the Canvas Copy Number Variant caller 16 has been implemented for CNV calling. This time, the ARSG deletion was called and recognized by Canvas.
Finally, although our findings encourage the use of shortread sequencing for the identification of SVs, results and variant calls should always be treated with caution. Not all variant calls prioritized in our reanalysis pipeline could be observed in the raw sequencing reads or validated using either breakpoint PCR or genomic qPCR. For one of these variants, an inversion event disrupting the EYS gene, the presence of split reads could be clearly observed in the short-read sequencing data, whereas the variant could not be validated using long-read genome sequencing yet (data not shown). Therefore, variant validation remains a crucial step of the prioritization process.
With these data, we have shown that short-read sequencing data in terms of SV detection are more powerful than assumed, however interpretation of SVs remains challenging. Also, SVs affecting repetitive regions or regions of high complexity most likely still remain undetected. There is ample evidence of SVs that could be detected using long-read sequencing approaches only because they are based on de novo assembly and improved mapping of highly homologues regions. 33,34 It was shown that long-read sequencing detects about 2.5-fold more SVs than multiple short-read SV detection algorithms combined. 11 Nevertheless, the results of our study indicate that it is worthwhile to invest additional effort and time in SV analyses and interpretation using short-read data. First, a genome-wide stringent SV analysis should be performed, which allows the rapid identification of known pathogenic variants but also allows the identification of possibly pathogenic SVs in genes that have not been associated with disease before. After an initial stringent analysis is performed, we would like to advocate that a nonstringent SV analysis should be applied that prioritizes SVs that affect known candidate disease genes. In addition, manual inspection of sequencing reads overlapping with strong candidate genes (eg, in cases with a monoallelic pathogenic variant or a strong genotype−phenotype correlation) should be performed to identify hints of possible structural variation (eg, coverage changes or presence of split reads) in the region of interest. In this way, we hypothesize that previously hidden pathogenic SVs can still be exposed from existing data sets before engaging expensive long-read methods.
In conclusion, we identified likely pathogenic SVs in 29 of 427 (6.8%) probands (genetically explained and unexplained samples). The current contribution of SVs to the mutational landscape of IRDs is estimated to be 5% to 15%, 5,7-9 which is in line with this percentage. However, samples included in this study were prescreened using exome sequencing or targeted gene panel sequencing, which already included CNV analysis. This strongly suggests that the true contribution of SVs is higher than the currently anticipated percentages. In addition, >25% of the pathogenic SVs identified in this study were overlooked and/or misinterpreted during initial sequencing analysis. Through an initial discovery using OGM, we have successfully explained a portion of missing heritability in our short-read genome sequencing cohort through reestablishing our protocols and pipelines. For the feasibility of this study, it was decided to focus on coding SVs only. It is clearly established that both SNVs and SVs can also have pathogenic consequences via noncoding mechanisms by affecting regulatory elements, topologically associated domain organization, splicing mechanisms, or untranslated region disruption. 2,35,36 It is likely that more pathogenic SVs are present in our data set, and follow-up analyses are warranted. This study highlights that SVs are an underestimated cause of IRDs and demand a sophisticated approach and more attention to facilitate detection during genome analyses.

Data Availability
Data are available upon request. All pathogenic and likely pathogenic variants identified in this study have been submitted to the Leiden Open (source) Variation Database (LOVD) (http://www.lovd.nl). All other genome sequencing data are subject to controlled access because they may compromise the privacy of research participants. These data may become available upon a data transfer agreement approved by the local ethics committee and can be obtained after contacting the corresponding author (S.E.d.B.) upon request.