Era of gapless plant genomes: innovations in sequencing and mapping technologies revolutionize genomics and breeding

Whole-genome sequencing and assembly have revolutionized plant genetics and molecular biology over the last two decades. However, significant shortcomings in first- and second-generation technology resulted in imperfect reference genomes: numerous and large gaps of low quality or undeterminable sequence in areas of highly repetitive DNA along with limited chromosomal phasing restricted the ability of researchers to characterize regulatory noncoding elements and genic regions that underwent recent duplication events. Recently, advances in long-read sequencing have resulted in the first gapless, telomere-to-telomere (T2T) assemblies of plant genomes. This leap forward has the potential to increase the speed and confidence of genomics and molecular experimentation while reducing costs for the research community.


Introduction
Plant genomes represent a unique space within genome research. The breadth and depth of genetic architectural diversity among model and crop systems, even among strains of the same species, have constantly presented challenges for generating high-quality assemblies without sequence gaps between telomeres. This concept of gapless genomes represents a compounding improvement for plant genomics that has recently been achieved through a combination of hybrid technologies that utilize short-and long-read single-molecule sequencing instruments to finally generate a telomere-totelomere (T2T) reference for all chromosomes within simple or complex genomes ( Figure 1). Telomeres, centromeres, and ribosomal repeats are common chromosomal features that have been recalcitrant regions for both human and plant genomes, but the fluid nature of plant genomes due to high levels of transposable elements (TE) and segmental duplications (SD) poses additional challenges. While there is an arms race between genome integrity and stability, plants develop novel advantageous genetic variation for species adaptation through TEs, however, they have been difficult to characterize until recently. SDs are genomic repeats > 1 kb that are generally around 90% similar that often contain many variations of the same gene contributing to functional evolution [1-3]. Since these regions are very similar, they are often incorrectly assembled or collapsed into a single sequence, obscuring important functional variation. A recent gapless assembly of indica rice found 24% of the genome was composed of SDs and 150 duplicated genes showed tissue-specific differential expression between the copies, suggesting sub-or neofunctionalization [4].
TEs often make up a substantial portion of plant genomes and are key components of regulatory architecture [5]. Long-read sequencing studies of many plant species frequently reveal notably more TE regions than observed in previous non-long-read studies because long reads are able to sequence completely through the element. A Pacific Biosciences (PacBio)-based resequencing of maize captured > 100 thousand copies of intact TEs, and aided in defining a lineage-specific expansion of long terminal repeats in maize since its divergence from sorghum [6].
Ribosomal repeats resist coherent assembly via older sequencing methods. Cytogenetic studies in banana revealed clusters of 5S rDNA repeats on one arm of chr8 and in the centromeric regions of chr1 and chr3 [7], but these results were not seen in the v2 assembly, and in fact only seemed to represent 7.5 kb of the total sequence [8]. The banana genome was resequenced with long nanopore reads: 2.2 Mb of rDNA repeats were now included in the new T2T assembly, validating the original cytogenetic results [9] and further showing the power and potential of gapless genome strategies.
Regulatory element discovery can also be improved with T2T or near-gapless genome assemblies through the incorporation of epigenetic sequencing; long-read platforms can enable partial DNA methylation profiling, which can be combined with improved genomic coverage to find cis-regulatory motifs by scanning hypomethylated regions across the assembly. This was done recently with long-read assemblies of 26 maize germplasms [10], and also shows how gapless genomes can augment pan-genome analyses.

Short-read sequencing
Next-generation sequencing surpassed Sanger sequencing as the favored method for primary genome assembly in the late 2000s. Short-read methods were used in combination with Sanger to generate the first hybrid read-type whole-genome sequence of Vitis vinifera [11]. Pyrosequencing was rapidly eclipsed with the reversible terminator chemistry developed by Illumina Inc. (formerly Solexa) [12], which reduced sequencing costs, increased throughput and increased accuracy, despite initially providing contiguous reads of only 35 bp, and heralded a new generation of genome science.
Next-generation short-read sequencing strategies provide highly accurate sequence and high depth of Gapless genomes resolve recalcitrant regions. T2T chromosome assembly resolves all previously undeterminable sequences. (a) Whole chromosomes are made up of many recalcitrant regions that are difficult for short-read sequencing approaches to resolve, such as TEs, highly repetitive regions, tandem duplications, ribosomal repeats, and so on. (b) These problematic regions result in uncallable sections that form gaps in the entire assembly, (c) but single-molecule technologies and hybrid approaches can create reads that completely cover these once unresolvable genomic sequences. (d) These long reads are then combined with hybrid assemblers to ultimately create the (e) gapless genome. Created with BioRender.com. coverage (> 60x) for nonrepetitive regions and can be unambiguously aligned into distinct contigs for assembly. As of publication, Illumina-based technologies account for more than 90% of all sequencing done in the world, owing to its low cost and high capacity. Short-read approaches were initially designed for resequencing rather than de novo sequencing, which are still applied toward diversity and breeding panels as well as polyploid progenitors for the purpose of creating molecular breeding tools and evolutionary analysis [13••,14••]. These methods have also been employed in the generation of large pan-genome projects that have sequenced 10-1000 s [15][16][17] of individuals giving an unparalleled view into species diversity and the genomics that drive important phenotypes.

Long-read sequencing
The subsequent, or third, generation of sequencing technologies tackled long DNA fragments (> 1 kb) by PacBio Single Molecule, Real-Time (SMRT) cell technology [18] and Oxford Nanopore MinION [19]. Where short reads use clonal amplification to overcome sequencing error and increase signal, these methods sequence a single molecule at a time. Skipping amplification allows sequencing to continue far longer along the molecule than short-read methods, but leads to a reduction in accuracy. Thus, long reads often require either higher coverage [20] or accurate short reads to correct the long reads [21] to create a highly contiguous assembly. Read length and accuracy improvements have continued from these initial platforms to generate reads of 10-25 kb through PacBio High Fidelity (HiFi -circular consensus sequencing) [22•] and > 300-kb reads for Oxford Nanopore sequencing [23]. Continued improvement of HiFi sequencing has pushed read accuracy above 99% [22,24], and newer plant-specific base calling models have pushed nanopore accuracy > 95%, further strengthening genome accuracy. Long-read incorporation has resolved significant issues plaguing larger, complex genomes, namely generating nonambiguous reads and contigs that are long enough to span structural variants, repetitive regions, and TEs (Figure 2). For example, a gapless watermelon assembly discovered an additional 173 genes located in gap regions of the prior assembly, including two LRR-RLK genes, which play crucial roles in plant development [25••]. Long-read platforms also have the ability to partially sequence the epigenome and epitranscriptome [10,26].

Optical maps
Initially developed in the 1990s, optical mapping used microscopy to visualize a restriction enzyme map over ranges from 0.2 to 1 Mb. [27] Since then, genome quality has been bolstered by using optical maps to greatly reduce the total number of assembly contigs by expanding the effective length of the chromosome-wide pseudomolecule [28][29][30]. Even some long-read and most short-read hybrid assemblies that incorporate optical maps can see contig reduction by one or two orders of magnitude from sequencing-base contig construction alone [31]. Zea mays is a paragon of this approach; optical mapping generated 63 contigs [32•] compared with the > 100 000 contigs from the initial Sanger sequencing assembly a decade prior [33].

Hi-C
High-throughput chromosome conformational capture (Hi-C) is a chromatin conformation capture technique [34] that relies on cross-linking regions of chromatin before cleavage and ligation of sequencing-compatible adapters. The resulting fragments are sequenced on a short-read sequencing instrument yielding unbiased genome-wide chromatin interaction maps that reveal global patterns of interactions such as translocation and inversion. Such longrange information is essential to achieve phasing in polyploid plant samples. While long reads alone can effectively phase diploid samples as seen in Z. mays [35•], H. lupulus [36], and D. alata [37••], phasing of polyploid samples has remained a major hurdle for crop science, since properly phased chromosomes can serve to identify regions of heterozygosity in hybrid breeding programs or within heavily heterozygous species, such as with obligate outcrossers. Phasing polyploids required developing new computational strategies such as those used to phase hexaploid bamboo [38•] and autotetraploid alfalfa [39•]. Hi-C methods are now being developed for long nanopore reads in place of short reads called Pore-C, it was recently used to assess chromatin interaction in Arabidopsis-preserving methylation data in addition to long-range DNA interactions [40]. These combined strategies revealed significant differences in haplotypes, which can contribute to plant phenotypes and ultimately the success or failure of breeding programs.

Assembly technologies
Sequencing methods and longer DNA fragment extraction have accelerated the quality of assemblies significantly, but those advances rely on parallel improvements in assembly algorithms. Assemblers such as ALLPATHS (named for finding 'all paths' in a graphbased approach) [41] use de Brujin graphs that integrate typical sequencing methods with longer-range data via mate-pair libraries. However, they still rely on short-read sequencing technology and are not capable of effectively assembling long, noisy reads. Others, such as Celera [42], were based on overlap layout consensus approaches and used Sanger sequencing to assemble early plant genomes such as papaya [43]. Eventually, hybrid approaches employed both accurate short reads and long noisy reads as seen in PacBio Corrected Reads (PBcR) [44] and ALL-PATHS-LG [45]. As sequencing throughput has increased the need for less computationally intense assembly methods has become critical. For perspective, the first long-read D. melanogaster assembly was estimated to take 600 000 CPU hours using the Celera/PBcR assembler [46], contrasted to a CANU + early PacBiobased assembly of A. thaliana that required 925 CPU hours [47], while hexaploid wheat using MaSuRCA + PacBio took 470 000 CPU hours.

Extraction
While advances in sequencing technology have been crucial to genome assembly improvements, acquiring high-quality plant DNA remains quite challenging. Unlike humans or mammals in general, plants have incredibly diverse metabolites that impact yield, quality, and performance of extracted DNA. While nanograms of DNA are required for short-read libraries, current singlemolecule approaches often need more than 10 ug of DNA that is > 10 kb in length [48]. Interestingly, as the capacity to sequence increasingly longer fragments of DNA emerged alongside the resurgence of optical mapping technologies, kits for high-molecular-weight DNA extraction are now being eschewed for older methods: cetyltrimethylammonium bromide (CTAB)based nuclei preparations were used for preparing DNA from the coast redwood [49] and gel-plus-lysis methods were applied for nanopore sequencing and optical mapping to generate a chromosome-scale assembly of Sorghum [50]. Nevertheless, these methods remain difficult and time-consuming, limiting the adoption of high-throughput sequencing for diverse plant samples.

Applications and examples
Strategies employed by T2T consortiums can be adapted to complex plant genomes to achieve truly contiguous, phased assemblies critical for annotating genes and noncoding regulatory elements (such as TEs and distal enhancers), genomic and marker-assisted breeding, transgenic and genome editing approaches, and Genome-wide association studies (GWAS). Trait discovery can be enhanced via gapless genomes through improved annotation of structural variants or highly duplicated gene families that confer environmental response phenotypes such as disease resistance alleles [4,9,25,32]. Gapless genomes can also help in projection of conserved regions and gene models to different species [51], thus improving overall annotation quality and evolutionary understanding of priority traits, this applies toward more complex genomes, including highly repetitive and polyploid organisms. These genome assemblies are more stable and require less resequencing and version releases for the community, which lowers time cost for researchers that rely on genomics to underpin experiments and breeding programs. Additionally, current long-read platforms can detect certain forms of DNA methylation, conferring the benefit of partial epigenome sequencing as well [52]. A recent A. thaliana assembly showed distinct methylation patterns between pericentromeric and centromeric regions, and has applications toward dissecting important regulatory epigenomic sections in other plants and crop systems [53].
With continually decreasing costs, the ability to generate a high-quality, phased reference genome for any organism lowers the boundaries for improving germplasm stock and broadening biotechnological approaches to more and varied types of crop systems that previously would go undervalued due to their complex genomes. This is the case for the watermelon gapless assembly that resulted in more confident gene identification and annotation and allowed the authors to more accurately quantify and locate ethyl methanesulfonate (EMS)-induced SNPs within the Citrullus lanatus G42 lineage [25], providing a stronger understanding of mutagenic SNP saturation while mapping a male-sterile trait. This T2T genome also yielded an example of accelerated functional gene identification in other cultivars, specifically by filling in the unsequenced upstream region of Cla97C10G197910 gene (conveys rind hardness) within the 97103v2 cultivar-gapped genome.

Summary and concluding remarks
These technologies have significant benefits, but each has limitations in achieving a fully contiguous T2T assembly. Short reads are highly accurate, but fragment length curtails assembly and misses large amounts of the genome. Conversely, long reads are able to sequence through complex regions of the genome and can provide unambiguous alignments to repetitive regions, but their overall lower accuracy and throughput can lead to misassemblies in addition to difficulties in isolating adequate high-molecular-weight DNA. While long-read technologies are generally thought to have very high costs, this is rapidly changing as methods and technologies are developed (Figure 3). These continuing improvements and cost reductions lower the entry barrier for researchers studying plants with complex genomes and ultimately remove barriers to apply conventional and molecular breeding approaches to a wider array of crop systems globally. The future of gapless genomes will continue to rely upon the iterative improvement of long-molecule sequencing, DNA isolation, assembly tools, and community cooperation for creating quality reference assemblies that inform pangenomes and are translatable across related species.

Conflict of interest statement
The authors declare the following financial interests/ personal relationships that may be considered as potential competing interests: W. Richard McCombie is a founder and shareholder in Orion Genomics that works in plant genomics. Orion Genomics had no part in the preparation of this paper.

Data Availability
The authors used publicly available data from NCBI, Ensembl Plants, and Gramene.