Introduction

During the two decades following publication of the first larger eukaryotic genomes (i.e., Drosophila melanogaster1 and Homo sapiens2), considerable progress has been made in sequencing technology and assembly methods, improving our basic knowledge of genome complexity across the tree of life. For example, we now understand that genome composition (e.g., gene complement, the extent of intergenic space, and the landscape of transposable elements (TEs)) varies substantially at both the inter- and intra-specific levels.

Throughout the genomic era, continual improvements have been made to both the data underlying assemblies and assembly algorithms. The first reference genomes employed bac-by-bac approaches with long-read Sanger sequencing generated at great expense over several years2,3. Next-generation assemblies initially relied on short-read data due to cost and technological limitations. These data have often been used for assembly in a combination of paired-end and mate pair formats with local assembly of contigs relying on paired-end sequences, which are then scaffolded using mate pair information4,5,6. While these assemblies represented genes reasonably well, repetitive regions containing TEs and tandem repeats were either omitted or highly fragmented7. Newly developed long-read sequencing technologies now enable contiguous assembly of even the repetitive fraction of eukaryotic genomes8 with, e.g., a complete telomere-to-telomere human X chromosome recently being assembled9. Highly contiguous long-read assemblies have now been published for a wide range of species10,11,12,13. Recent long-read assemblies in maize also show considerable improvement in both completeness and contiguity relative to previous efforts13,14,15,16, suggesting these data are particularly useful for plant species like maize with genomes that are large (2.3 Gb), complex (paleopolyploid and comprised primarily of TEs), and highly repetitive (extensive tandem sequence arrays in heterochromatic knobs and centromeres). Assembly algorithms are also continually being benchmarked and improved to better utilize long-read data with substantial variability in performance of particular methods observed across species10,17,18,19.

The cost of long-read sequence data can still be prohibitive for species with larger genomes, and the critical target for average read length and read depth remains unclear. A full assessment of the impacts of varying sequence read length and depth on the contiguity and completeness of assemblies in genic and repetitive regions is therefore essential for informed allocation of finite resources. Here we conduct a comprehensive assembly experiment using subsets of a high-depth, long-read (PacBio) dataset for the maize inbred line NC358 to evaluate critical inflection points of quality during the assembly of a complex, repeat-rich genome.

Results

Genome sequence subsampling and assembly statistics

We sequenced the NC358 genome to 75× depth (based on a ~2.27 Gb genome size20) using the PacBio Sequel platform, which generated a subread N50 of 21.2 kb (Table 1, Supplementary Table 1, and Supplementary Fig. 1). To identify an optimal assembly approach for this study, the complete data from NC358 were each assembled using Falcon21, Canu22, WTDBG223, and hybrid approaches in which Falcon was used for error correction and Canu, Flye24, and Peregrine25 were used for assembly (Supplementary Table 2). A subset of methods (Supplementary Table 2) were confirmed to be robust using data from the B73 v4 genome assembly (68× depth)13. All assembled contigs were superscaffolded with a de novo Bionano optical map (Supplementary Fig. 2) and pseudomolecules were constructed based on maize GoldenGate genetic markers26 and high-density maize pan-genome markers27 (Methods). The Falcon–Canu hybrid assemblies of both the NC358 and B73 genomes showed consistently higher quality in terms of contig length, Bionano conflicts, Benchmarking Universal Single-Copy Orthologs (BUSCOs)28, and LTR Assembly Index (LAI)8 (Supplementary Table 2); thus, this method was used for all subsequent assemblies performed on subsets of the data.

Table 1 Summary statistics for NC358 assemblies.

NC358 subreads were downsampled from 75× to 60×, 50×, 40×, 30×, and 20×, while maintaining a 21 kb subread N50 and to 50× depth with a subread N50 of 11 and 16 kb. For these latter assemblies, we based read-length distributions on empirical datasets generated for the human HG00229 and maize B73 v413 genome assemblies (Supplementary Fig. 3). NC358 read subsets were error-corrected and assembled independently using the hybrid assembly approach described above (Methods and Supplementary Note 1). These processes were resource-intensive and were accelerated through cloud computing. The central processing unit (CPU) time required for both Falcon error correction and Canu assembly increased substantially as read depth increased, whereas the required maximum memory was fairly similar (Fig. 1h and Table 1).

Fig. 1: Assembly of NC358 using various read lengths and coverage.
figure 1

a Hybrid scaffolding using the Bionano optical map. A 199 Mb scaffold from chromosome 5 is shown. Gray areas on the chromosome cartoon represent the 199 Mb scaffold; the white area is the remaining 23 Mb scaffold in chromosome 5; the red dot is the centromere. Green tracts represent scaffolded sequences and blue tracts show the contigs that comprise this scaffold with contigs jittered across three levels. b Contig NG(x). c Scaffold NG(x). d Benchmarking Universal Single-Copy Orthologs (BUSCO) of Pilon-polished assemblies. e The number of assembly misjoins revealed by DLE-1 conflicts and the number of contigs of each assembly. f Regional LTR Assembly Index (LAI) values estimated based on 3 Mb windows with 300 kb steps. The box shows the median, upper, and lower quartiles. Whiskers indicate values ≤ 1.5× interquartile range. Outliers are plotted as dots. g Unique mapping rate of RNA-seq libraries. A total of ten tissues with two biological replicates were sequenced. Each dot represents an RNA-seq library. The box shows the median, upper, and lower quartiles. Whiskers indicate values ≤ 1.5× interquartile range. h Central processing unit (CPU) core hours required for Falcon correction and Canu assembly. i Bionano optical map inconsistency. Deletions and insertions are cases where sequences are shorter or longer than the size estimated by the optical map, respectively. Structural variations (SVs) are plotted as dots. The box shows the median, upper and lower quartiles. Whiskers indicate values ≤ 1.5× interquartile range. Source data underlying Fig. 1b, c, f, g are provided as a Source Data file.

Most assemblies had a total contig size covering >92% of the flow-cytometry estimated genome size of NC358 (2.27 Gb20), with the notable exception of the 21k_20x assembly (70.4% covered; Table 1). Contig length metrics were positively correlated with both read length and sequence coverage (Fig. 1b), with the highest contig N50 (24.54 Mb) and the longest contig (79.68 Mb) observed in the 21k_75x and 21k_60x assembly, respectively (Table 1). A dramatic drop in contiguity was observed for both the lowest depth (21k_20×) and shortest sequence length (11k_50×) assemblies, where the number of contigs was 17×–32× more than the complete 21k_75x dataset (Table 1 and Fig. 1e).

For each assembly, superscaffolds were generated from the contigs using a common Bionano optical map. Even the most fragmented Falcon–Canu assembly could be scaffolded to high contiguity using this optical map due to the high density of labels in the map (Fig. 1a–c). The resulting assemblies all had scaffold N50s at ~98 Mb (Table 1), highlighting the utility of a high-quality optical map when sequencing quality is suboptimal. In fact, chromosome 3 (~237 Mb) consisted of a single scaffold in five out of eight assemblies (Table 1). However, conflicts vs. the Bionano map were much higher in the assemblies with 20× coverage and a subread N50 of 11 kb (Table 1 and Fig. 1e), suggesting assembly error increased with lower coverage and shorter read length. Assemblies with shorter read length contained many more deletions relative to the optical map (Fig. 1i; Supplementary Table 3), which may be due to the collapse of repetitive sequences. We did not observe a clear pattern between read length and deletion size (Fig. 1i). Assembly misjoins were reduced with both longer reads and higher coverage, as shown by the relative number of insertions (Fig. 1i and Supplementary Table 3).

For each of the assemblies, pseudomolecules were constructed using the GoldenGate and pan-genome genetic markers, which placed >99% of the total assembled bases into pseudomolecules (Supplementary Table 4 and Supplementary Fig. 4). The resulting NC358 pseudomolecules were highly syntenic across assemblies and to the B73 v4 genome (Supplementary Fig. 5).

Evaluation of the gene assembly space

We evaluated the completeness of gene-rich regions in each of the assemblies using BUSCO28. The percentage of complete BUSCO genes increased from 68.0% to 96.3% from the 21k_20x to the 21k_75x assembly (Table 1, Fig. 1d, and Supplementary Table 5). Minimal improvement in BUSCO scores was achieved at depths higher than 30× (95.5% complete BUSCO genes), indicating this depth provides satisfactory gene-space assembly.

To further evaluate the assembly of genic regions, we annotated gene models in the 21k_20x and the 21k_75x assemblies (Methods) and obtained a total of 28,275 and 39,578 genes, respectively (Supplementary Table 6). More than 99% of the 21k_75x genes could be mapped to all but the 21k_20x assembly, to which only 90% were mapped (Supplementary Table 7). Exon and intron lengths of the annotated genes were similar across the assemblies (Supplementary Table 6). However, genes in the 21k_20x assembly contained 50 times more gaps than those in the 21k_75x assembly (Supplementary Table 7), suggesting higher coverage contributed to the contiguity of genic regions. Longer reads similarly improved the gene-space contiguity (Supplementary Table 7).

In addition, we sequenced RNA libraries from ten tissues with two biological replicates (Methods). On average, 80% of reads in these libraries could be uniquely mapped to the various NC358 assemblies (Fig. 1g). The 21k_20x assembly was a notable exception with only 63% of reads uniquely mapped (Fig. 1g and Supplementary Fig. 6). We extracted the reads that did not map to the 21k_20x assembly and remapped them to the 21k_75x assembly, obtaining a unique mapping rate of 36% (Supplementary Table 8). These reads mapped to 3184 genes in the 21k_75x assembly (Supplementary Table 9). Of these genes, 20% are present in the 21k_20x assembly but had assembly errors that prevented the RNA sequencing (RNA-seq) reads from mapping, while the remaining 80% were within sequence gaps (Supplementary Table 9).

In addition to metrics of gene completeness, we also examined each assembly for its ability to capture two notable maize tandem gene arrays, Rp1-D30 and zein31. The total length of these gene arrays was estimated at 536 kb and 62 kb in NC358, respectively, based on the optical map. Both the Rp1-D and zein loci were completely assembled in all, except for the 21k_20x assembly, where only 70% and 91% of the loci were assembled, respectively (Fig. 2g and Supplementary Table 10).

Fig. 2: Assembly of repetitive components in the NC358 genome.
figure 2

a The assembled size of the 180-bp knob repeat, the knob TR-1 element, the chromosome 6 nucleolus organizer region (NOR) region, CentC arrays, and subtelomere arrays in each of the NC358 assemblies. b Length distribution of long terminal repeat (LTR) retrotransposons longer than 26 kb. Each dot represents an annotated sequence. The box shows the median, upper and lower quartiles. Whiskers indicate values ≤ 1.5× interquartile range. d Telomere 7-mer counts in telomere regions of NC358 assemblies. Assembly of (c) LTR retrotransposons, (e) CentC arrays, (f) the chromosome 6 NOR region, (g) the Rp1-D and zein tandem gene arrays, and (h) two example knobs in each of the NC358 assemblies. The NC358 Bionano optical map was used to estimate the size of these components. Ngap, estimated gap size. Source data underlying Fig. 2a, b, i are provided as a Source Data file.

Evaluation of the TE assembly space

The completeness of transposon-rich regions of the genome was assessed through the assembly index of LTR retrotransposons, called LAI8. A higher LAI score is indicative of a more complete assembly in TE-rich regions. The 21k_20x assembly had a substantially lower LAI score compared to other assemblies (LAI = 12.2; Table 1). As sequence depth increased a substantial improvement in LAI was observed, while the effect of sequence length on LAI was minimal (Fig. 1f). This is likely due to the fact that the length of LTR retrotransposons is 10 kb on average (Supplementary Fig. 7), which could be spanned by even the 11 kb reads. The assemblies that were generated from ≥40× genomic depth achieved gold quality (LAI ≥ 20 (ref. 8)) (Table 1 and Fig. 1f), which was comparable to the B73 v4 genome and much higher than many previously published maize genome assemblies generated with short-read data (Supplementary Fig. 8).

The insertion time of each LTR retrotransposon can be dated based on sequence divergence between terminal repeats8. We identified 36% fewer intact LTR retrotransposons in the highly fragmented 21k_20x assembly (Supplementary Fig. 9) and significantly older LTR elements in the 11k_50x assembly (p < 10−5, one-way analysis of variance followed by Tukey’s test), suggesting fragmentation of assemblies could bias conclusions of transposon studies. LTR retrotransposons shorter than 26 kb were assembled well across the assemblies (Supplementary Figs. 10 and 11). However, a substantial effect of longer reads and higher depth was observed in the assembly of LTR sequences longer than 26 kb (Fig. 2b). We examined the assemblies of the longest LTR sequence clusters using the Bionano optical map and found most assemblies contained no gaps and were virtually complete (Fig. 2c), with the notable exception that the 11k_50×, 16k_50×, and 21k_20x assemblies, which contained large gaps in one of the LTR clusters (Supplementary Table 11). We also inspected the bz locus32, which has highly nested transposon insertions and an estimated size of 303.5 kb in NC358. The bz locus was well assembled in all but the 21k_20x assembly, in which only 56.3% of the sequence was included (Supplementary Table 12). In summary, with ≥40× of sequence coverage, long-read sequencing and assembly can traverse most transposon-rich genomic regions including relatively long LTR sequences, although with shorter reads (i.e., read N50 of 11–16 kb) this sequencing depth may not be sufficient.

Evaluation of the non-TE repeat assembly space

The assembly of non-TE tandem repeat space was also evaluated, including telomeres (7 bp repeats), subtelomeres (300–1300 bp repeats), CentC arrays (156 bp repeats), nucleolus organizer region (NOR, ~11 kb repeats), and the two major knob repeats (mixture of 180 and 350 bp repeats) (Fig. 2a and Supplementary Table 13). The effects of sequence read depth and sequence read length were far more pronounced across many of these tandemly duplicated portions of the genome (Fig. 2a).

Telomeres are characterized by 7 bp tandem repeats at the end of each chromosome. Our results showed a substantial increase in the assembled length of telomere sequence with the increase of both read length and sequence coverage (Fig. 2d and Supplementary Table 14). However, a precise estimate of telomere length was not possible with our optical map due to the lack of Bionano DLE-1 sites in these highly repetitive regions. Using the full dataset (21k_75×), only 10 of 20 telomere–subtelomere combined regions were assembled to >90% of the Bionano estimated size (Supplementary Table 15), suggesting even longer reads and higher coverage are required for the full assembly of these regions.

The centromere is one of the most repetitive regions of many species’ genomes including maize. We characterized NC358 centromeres based on CentC arrays33, which are abundant in functional centromeric regions34. Even with the full dataset (21k_75×), only half of CentC arrays were assembled (Fig. 2e, Supplementary Fig. 12, and Supplementary Table 16). Hybrid scaffolded assemblies with sequence coverage ≥60× yielded a better approximation to the Bionano estimated size, even though these regions largely consisted of gaps (Fig. 2e). Although assembled sequences were not significantly increased, higher sequence depth resulted in better anchoring of sequences with the Bionano optical map. Only three centromeres, which contained a mixture of CentC arrays, transposons, and intergenic sequences, could be traversed by Bionano DLE-1 labeling due to having a comparatively higher content of low-copy sequence34. The size of the remaining centromeres was likely underestimated (Supplementary Fig. 13) and further improvements in scaffolding technology are required to traversing these regions.

The NOR is enriched with ribosomal DNA (rDNA) and spans approximately 9 Mb on chromosome 6 of NC358 (Supplementary Table 17). Longer read length improved the assembly of this region, but substantial differences were not observed with coverage ≥30× (Fig. 2f). Approximately 72% of the NOR was included in the 21k_30x assembly and this improved by just 9% to 81% in the 21k_75x assembly (Supplementary Table 17 and Fig. 2f).

Finally, maize knobs are heterochromatic regions consisting of 180 bp (knob180) and 350 bp (TR-1) repeats35. We used the Bionano optical map to assess the assembly of two knobs that together spanned a total of 5 Mb. With longer reads and higher coverage, more knob sequences were assembled, with 6.5% of the two knobs present in the 21k_20x assembly and up to 65% in the 21k_75x assembly (Supplementary Table 18 and Fig. 2h).

Discussion

Recent innovations in long-read and scaffolding technology have made highly contiguous assembly possible across a wide range of species. We have documented how both the completeness and contiguity of assemblies improve with increasing depth and read length. The biological aims of an investigation must be considered when determining the level of investment in depth of sequence. With long-read sequencing, the low-copy gene space (including tandem gene arrays) can be well assembled with as low as 30× genomic coverage across a range of read lengths. Complete characterization of TEs in complex genomes such as maize will require a greater depth of sequence (~40×) and should employ library preparation protocols that maximize read-length N50. Finally, complete assembly of highly repetitive genomic features such as heterochromatic knobs, telomeres, and centromeres will require substantially more data36. In fact, complete assembly of these latter highly repetitive sequences will likely require innovations beyond current sequencing technology.

Methods

Sample preparation

Seeds for the maize NC358 inbred line were obtained from GRIN Global (seed stock ID Ames 27175), grown, and self-pollinated at Iowa State University in 2017. A total of 144 seedlings derived from a single selfed ear were grown in the greenhouse. Leaf tissues from the seedlings at the Vegetative 2 (V2) growth stage were sampled after a 48-hour dark treatment to reduce carbohydrates. A total of 35 g of tissue was collected and flash-frozen. Tissue was sent to the Arizona Genomics Institute (AGI) for high-molecular-weight DNA isolation using a CTAB protocol37.

Illumina and PacBio sequencing

Pacific BioSciences long-read data for NC358 were generated at AGI using the Sequel platform. Libraries were prepared using the manufacturer’s suggested protocol (https://www.pacb.com/). The subreads that were generated covered the genome at an estimated 75-fold depth (75×) with a subread N50 of 21,166 bp. Reads from each SMRT cell were inspected and quality metrics were calculated using SequelQC38. After validating the PSR (polymerase to subread ratio) and ZOR (ZMW occupancy ratio) were satisfactory, all subreads were used for subsequent steps.

Paired-end Illumina data for NC358 were generated at the Georgia Genomics and Bioinformatics Core from the same DNA extraction as was used for the long-read sequencing. Quality control of DNA was conducted using Qubit and Fragment Analyzer to determine the concentration and size distribution of the DNA. The library was constructed using the KAPA Hyper Prep Kit (catalog number KK8504). During library preparation, DNA was fragmented by acoustic shearing with Covaris before end repair and A-tailing. Barcoded adaptors were ligated to DNA fragments to form the final sequencing library. Libraries were purified and cleaned with solid phase reversible immobilization (SPRI) beads before being amplified with PCR. Final libraries underwent another bead cleanup before being evaluated by Qubit, quantitative PCR (qPCR) (KAPA Library Quantification Kit catalog number KK4854), and Fragment Analyzer. The final pool undergoing Illumina’s Dilute and Denature Libraries protocol was diluted to 2.2 pM for loading onto the sequencer and then sequenced with 1% PhiX by volume. Libraries were sequenced on the NextSeq 500 instrument using PE150 cycles. The demultiplexing was done on Illumina’s BaseSpace.

PacBio SMRT subreads for the maize inbred line B73 (sequenced to 68× depth) were retrieved from the NCBI SRA database with accession ID SRX1472849 (ref. 13). PacBio SMRT subreads for the human HG002 sample (sequenced to 147× depth) were retrieved with accession IDs SRX1033793 and SRX1033794 (ref. 29).

Downsampling full dataset

The 75× SMRT Sequel full data from maize NC358 was downsampled to 60×, 50×, 40×, 30×, and 20x data using seqtk (v1.2) (https://github.com/lh3/seqtk). Downsampling was performed as serial titration, in which each dataset was the superset of the next smaller dataset, and was sampled to have similar length distributions (Supplementary Fig. 3). The N50 of the downsampled datasets were almost identical to the N50 of the full 75× data (Table 1). There is a chance that the subsampling process could pick a disproportionate number of poor reads and result in a poor assembly. There is also a chance that the subsampling process could pick, on average, better quality reads and result in a higher quality assembly. However, evidence suggests it is more likely that the small sample size (i.e., 20×) could not provide enough sequence information to reconstruct the original genome, as has been observed in a study based on simulated and empirical long-read data to test the effect of different depths of Prokaryotic genome assemblies19.

Shifting read-length distribution of subreads

Two more NC358 datasets were downsampled and trimmed from the original 75× SMRT dataset to match the read-length distribution of the maize B73 data13 and the human HG002 data29, which had read N50 lengths of ~16 kb and ~11 kb, respectively (Supplementary Fig. 3). To do this, first, the read lengths of the maize B73 and human HG002 data were each sorted in descending order. For each read-length value, all subreads from NC358 that were longer than said value were randomly sampled without replacement and clipped to have matched read length. The unused clipped part of the read was put back in the pool for further use with short read length. This distribution-shifting approach was chosen to achieve a realistic distribution of read length rather than trimming all reads by fixed lengths. These datasets were labeled as 16k and 11k based on their N50 of subread data of 16,765, and 11,092, respectively.

RNA tissue sampling and sequencing

Samples from ten tissues throughout development were collected to generate expression evidence for gene annotation. Two biological replicates were collected for each tissue type and each replicate consisted of three individual plants. The tissues that were sampled were as follows: (1) primary root at 6 days after planting; (2) shoot and coleoptile at 6 days after planting; (3) base of the tenth leaf at the Vegetative 11 (V11) growth stage; (4) middle of the tenth leaf at the V11 growth stage; (5) tip of the tenth leaf at the V11 growth stage; (6) meiotic tassel at the Vegetative 18 (V18) growth stage; (7) immature ear at the V18 growth stage; (8) anthers at the Reproductive 1 (R1) growth stage; (9) endosperm at 16 days after pollination; and (10) embryo at 16 days after pollination. Tissue from developmental stage V11 and older were taken from field-grown plants, while all younger tissue samples were taken from greenhouse-grown plants. For the endosperm and embryo samples, tissue from 50 kernels per plant (150 total per biological replicate) were sampled. Greenhouse-grown plants were planted in Metro-Mix300 (Sun Gro Horticulture) with no additional fertilizer and grown under greenhouse conditions (27 °C/24 °C day/night and 16 h/8 h light/dark) at the University of Minnesota Plant Growth Facilities. Field-grown plants were planted at the Minnesota Agricultural Experiment Station located in Saint Paul, MN with 30 inch row spacing at ~52,000 plants per hectare. RNA was extracted using the Qiagen RNeasy plant mini kit following the manufacturer’s suggested protocol.

The quality of the total RNA was assessed by Bioanalyzer or Fragment analyzer to determine RNA concentration and integrity. The sample concentration was normalized in 25 µL of nuclease-free H2O before library preparation. Libraries were prepared using KAPA’s stranded mRNA-seq kit with halved reaction volumes. During library preparations, mRNA was selected using oligo-dT beads, the RNA was fragmented, and cDNA was generated using random hexamer priming. Single or dual indices were ligated depending on the desired level of multiplexing. The number of cycles for library PCR was determined based on kit recommendations for the amount of total RNA used during library preparation. Libraries were quality control checked using Qubit or plate reader, depending on the number of samples in the batch for library concentration, and fragment analyzer for the size distribution of the library. The pooling of samples was based on qPCR. The pooled libraries were then checked by Qubit, Fragment Analyzer, and qPCR.

RNA libraries were prepared for sequencing on Illumina instruments using Illumina’s Dilute and Denature protocol. Pooled libraries were diluted to 4 nM, then denatured using NaOH. The denatured library was further diluted to 2.2 pM and PhiX was added at 1% of the library volume. RNA pools were sequenced on a NextSeq 550 to generate 75 bp pair-end reads. On average, 24.5 million pair-end reads were generated per replicate per tissue type, for a total of 489 million reads across all samples. Data were demultiplexed and trimmed of adapter and barcode sequences on BaseSpace (Supplementary Fig. 14).

Bionano data generation

The DNA extraction was performed using the Bionano Prep™ Plant Tissue DNA Isolation Kit according to a modified version of the Plant Tissue DNA Isolation Base Protocol. Approximately 0.5 g leaf tissue was collected from young etiolated seedlings germinated in soil-free conditions and grown in the dark for 2 weeks after germination. Freshly cut leaves were treated with a 2% formaldehyde fixing solution and then washed, cut into small pieces and homogenized using a Qiagen TissueRuptor probe. Free nuclei were concentrated by centrifugation at 2000 × g, washed, isolated by gradient centrifugation, and embedded into a low-melting-point agarose plug. After proteinase K and RNase A treatments, the agarose plug was washed four times in Wash Buffer and five times in TE (Tris and EDTA) buffer. Finally, purified ultra-high-molecular weight nuclear DNA (uHMW nDNA) was recovered by melting the plug, digesting it with agarase and subjecting the resulting sample to drop dialysis against TE.

The Bionano Saphyr platform, in combination with the Direct Label and Stain (DLS) process, was used to generate chromosome-level sequence scaffolds and validate PacBio sequence contigs. Direct labeling was performed using the Direct Labeling and Staining Kit (Bionano Genomics Catalog 80005) according to the manufacturer’s recommendations, with some modifications39. In total, 1 μg uHMW nDNA was incubated for 2:20 h at 37 °C, followed by 20 min at 70 °C in the presence of DLE-1 Enzyme, DL-Green, and DLE-1 Buffer. Following proteinase K digestion and cleanup of the unincorporated DL-Green label, the labeled DNA was combined with Flow Buffer, DL-Dithiothreitol (DTT), and incubated overnight at 4 °C. DNA was quantified and stained by adding Bionano DNA Stain to a final concentration of 1 μL per 0.1 μg of final DNA. The labeled sample was loaded onto a Bionano chip flow cell and molecules separated, imaged and digitized in a Bionano Genomics Saphyr System and server according to the manufacturer’s recommendations (https://bionanogenomics.com/support-page/saphyr-system/).

Data visualization, processing, DLS map assembly, and hybrid scaffold construction were all performed using the Bionano Genomics software Access, Solve, and Tools. A filtered subset of 1,282,746 molecules (353,596 Mb total length) with a minimum size of 150 kb and a maximum size of 3 Mb were assembled without pre-assembly using the non-haplotype parameters with no CMPR cut and without extend-split.

Genome assembly

To determine the assembly approach to apply to each of the datasets, six different methods were tested on the complete dataset, including Falcon-only, Canu-only, WTDBG2-only, a Falcon–Canu hybrid approach, a Falcon–Peregrine hybrid approach (the longest 23× corrected reads were used), and a Falcon–Flye hybrid approach. We also downloaded PacBio sequencing data for the B73 v4 genome (68× subreads) for comparison of the different approaches.

The Falcon genome assemblies were performed using the falcon_kit pipeline v0.7 (ref. 21) with some modifications. TANmask and REPmask were not used due to their extensive masking for the maize genome. Error correction for subreads was performed on the longest 50× coverage, with the average read correction rate set to 75% (-e 0.75) and local alignments for at least 3000 bp (-l 3000). The usage of -l 3000 instead of -l 2500 was done because of the omitted repeat masking, which works better for highly repetitive genome species like maize. A minimum of two reads and a maximum of 200 reads were used for error corrections (--min_cov 2 --max_n_read 200). For sequence assembly, the exact matching k-mers between two reads was set to 24 bp (-k 24) with read correction rate as 95% (-e 0.95) and local alignments at least 1000 bp (-l 1000). The longest 20x coverage reads were used for assembly with a minimum coverage of two and maximum coverage of 80 (--min_cov 2 --max_cov 80). Full parameter sets are included in the Supplementary Note 1.

For Canu read correction and assembly, Canu v1.7 (ref. 22) was used. K-mers more frequent than 500 were not used to seed overlaps (ovlMerThreshold = 500). The genome size of 2,272,400,000 bp and 2,500,000,000 bp for NC358 and B73, respectively, were used in this study20. Other parameters were used as default. Due to a bug in the Canu v1.7 program, truncations of large contigs would occur during the consensus process (https://github.com/marbl/canu/releases/tag/v1.8). As the program was not expecting the superlong contigs that were being generated for our NC358 assemblies, we found a total of nine large contigs that suffered from consensus truncations. To fix these truncation gaps, consensus-free contigs were generated using Canu v1.7 (cnsConsensus = quick), then blastn was used to search for 5 kb boundaries of truncation gaps in consensus-free assemblies. Truncated sequences were retrieved and patched to the truncated contigs.

For the Falcon–Canu hybrid approach, the error correction was performed by Falcon, and the trimming and assembly were performed by Canu using the versions and parameters described above. All the assemblies were performed on the DNAnexus cloud platform. CPU core hour and maximum memory usage were recorded every 10 minutes for each Falcon error correction and Canu assembly job. For Falcon error correction of the 21k datasets, the CPU core hour (y) could be predicted by subread depth (m) with

$${\it{y}} = 20603100000 + (3136.685 - 20603100000)/\left( {1 + \left( {m/1932.377} \right){\hat{\phantom{a}}}4.148144} \right)$$
(1)

For Canu assembly of the 21k datasets, the CPU core hour (y) could be predicted by corrected read depth (n) with

$$y = 6438752000 + (1284.689 - 6438752000)/(1 + (n/56334.74){\hat{\phantom{a}}}1.872455)$$
(2)

These curves were fit using the https://mycurvefit.com/ website and plotted in R.

For the Falcon–Flye hybrid approach, the Falcon-corrected reads (44×) were assembled by Flye (v2.6)24 with the genome size parameter set to 2,272,000,000. For the Falcon–Peregrine hybrid approach, the longest 23× Falcon-corrected reads were assembled by Peregrine (pg0.1.6.1)25 with parameters 24 24 24 12 24 24 24 24 24 --with-consensus --shimmer-w 80 --shimmer-r 3 --best_n_ovlp 24 --mc_upper 640. The WTDBG2 genome assembly was performed using the WTDBG2 pipeline (v2.5)23 with default parameters using the 75× uncorrected subreads. The estimated genome size was set to 2,272,400,000.

Three of the assembly approaches were tested using both maize NC358 and B73 reads. For both inbred lines, a similar assembly size was generated by the Falcon-only, Canu-only, and Falcon–Canu hybrid approaches. However, the Falcon–Canu hybrid approach yielded the longest contig length (78.4 Mb and 19.7 Mb, respectively), the highest contig NG50 (23.0 Mb and 3.0 Mb, respectively), and the lowest number of assembly errors based on Bionano DLE-1 conflict (21 and 64, respectively; Supplementary Table 2). The gene-space completeness evaluated using BUSCOs v3.0.228 and the repeat space continuity evaluated using the LAI (vbeta3.2)8 were similar between the Canu and the hybrid approach and higher than those assemblies that were created using the Falcon assembler (Supplementary Table 2). This was likely due to the consensus approach used at the end of the Canu program, which was missing in the Falcon program.

The remaining three approaches were tested using only NC358 reads. For assemblies generated by the Falcon–Flye hybrid, Falcon–Peregrine hybrid, and WTDBG2-only approaches, the assembled sizes were 16–51% larger than the estimated genome size with 28–202 times more contigs compared to the assembly generated by the Falcon–Canu hybrid approach (Supplementary Table 2). Other quality metrics of these assemblies, such as the longest contig, contig N50, Bionano DLE-1 conflict, and LAI were all inferior compared to those of the Falcon–Canu assembly (Supplementary Table 2). BUSCO of the Falcon–Peregrine assembly was higher than that of the Falcon–Canu raw assembly, but lower than that of the Pilon or Arrow-polished assembly. The correction-free WTDBG2 assembly was the most fragmented probably due to the lack of error correction that hindered the assembly of repetitive sequences, which is demonstrated in the low LAI value (LAI = 2.5).

Due to the consistently high quality of the assemblies generated from the Falcon–Canu hybrid approach, we used this approach to assemble each of the NC358 datasets with varying sequence depth and read length. Full parameter sets of all assembly approaches are included in the Supplementary Note 1.

Genome polishing

Two polishing approaches were tested on the 21k_75x assembly. The first was done using Arrow with PacBio subreads (75× coverage). Read mapping to the assembly was done using BLASR40 with default parameters (--minMatch 12 --bestn 10 --minPctSimilarity 70.0 --refineConcordantAlignments). The Arrow tool in the SMRT Link (v5.1.0) software package was then applied to correct for sequencing errors with default parameters. A second approach for polishing was done using Pilon with Illumina pair-end reads (30.7× coverage). Read mapping to the assembly was done using Minimap2 (v2.16)41 with the short read option (-ax sr). Pilon (v1.23-0)42 was then applied to correct for sequencing errors including SNPs and small indels (--fix bases) on sites with a minimum depth of 10 and a minimum mapping quality of 30 (--mindepth 10 --minmq 30).

With both approaches, minimal differences were observed in the contiguity statistics (Supplementary Table 2) or the repeat content for the 21k_75x assembly (Supplementary Fig. 15), and it is expected that this minimal impact would be observed across all of the NC358 assemblies. A more substantial difference in BUSCO scores was observed with both the Arrow-polished and the Pilon-polished 21k_75x assemblies (Supplementary Table 2). As the polishing had a substantial impact on this metric, the other NC358 assemblies were also polished using Pilon with the same parameter settings and similar improvement of BUSCO scores was observed (Table 1; Supplementary Table 4).

Generation of pseudomolecules

Hybrid scaffolds for the assemblies were generated with Bionano Direct Label and Stain data using Bionano Solve (v3.2.1_04122018). Overlaps of contigs within Bionano map space were resolved by placing 13 bp of Ns (13 N gaps) at the overlap site. In addition to arranging contigs into scaffolds, the hybrid scaffold was also used to detect misassembly and to assess the completeness of the assembled genome and repeat elements.

The pseudomolecules were constructed from the hybrid scaffolds using ALLMAPS (v0.8.12)43. Both pan-genome anchor markers27 and GoldenGate markers26 were used with equal weights for ordering and orientating the scaffolds. For pan-genome anchor markers, data were downloaded from the CyVerse Data Commons (http://datacommons.cyverse.org/browse/iplant/home/shared/panzea/genotypes/GBS/v27/Lu_2015_NatCommun_panGenomeAnchors20150219.txt.gz) and a bed file with 50 bp upstream and downstream of the B73 v3 coordinates were generated. A text file with marker name and predicted distance was also constructed from the same file. The extracted markers were mapped to HiSat2 (v2.1.0)44 indexed assemblies of NC358 by disabling splicing (--no-spliced-alignment) and forcing global alignment (--end-to-end). Very high read and reference gap open and extension penalties (--rdg 10000,10000 and --rfg 10000,10000) were also used to ensure full-length mapping of marker sequence. The final alignment was then filtered for mapping quality of greater than 30 and tag XM:0 (unique mapping) to retain only high-quality uniquely mapped marker sequences. The mapped markers were merged with the predicted distance information to generate a CSV input file for ALLMAPS. Only scaffolds with more than 20 uniquely mapped markers, with a maximum of 100 markers per scaffold, were used for pseudomolecule construction.

The GoldenGate markers were downloaded from MaizeGDB (https://www.maizegdb.org/data_center/map?id=1203673). For the markers with coordinates, 50 bp flanking regions were extracted from the B73 v4 genome. For markers without coordinates, marker sequences were used as-is, and those missing both coordinates and sequences were discarded. Mapping of the markers was done similar to the method described above for the pan-genome anchor markers, with all uniquely mapped markers retained. The genetic distance information for these markers was converted to a CSV file before using it in ALLMAPS. ALLMAPS was run with default options, and the pseudomolecules were finalized after inspecting the marker placement plot and the scaffold directions. Synteny dotplots were generated using the scaffolds as well as pseudomolecule assemblies against the B73 genome by following the ISUgenomics Bioinformatics Workbook (https://bioinformaticsworkbook.org/dataWrangling/genome-dotplots.html)45. Briefly, the repeats were masked using RepeatMasker (v4.0.9)46 and the Maize TE Consortium (MTEC) curated library3. RepeatMasker was configured to use the NCBI engine (rmblastn) with a quick search option (-q) and GFF as a preferred output. The repeat-masked genomes were then aligned using Minimap2 (v2.2)41 and set to break at 5% divergence (-x asm5). The paf files were filtered to eliminate alignments less than 1 kb and dotplots were generated using the R package dotPlotly (https://github.com/tpoorten/dotPlotly).

Gene annotation and RNA-seq mapping

The MAKER-P pipeline47 was used to annotate protein-coding genes for Pilon-polished NC358 21k_20x and 21k_75x genome assemblies. The baseline evidence used in annotating the B73 v4 genome13 was applied. Before gene annotation, the MTEC curated TE library3 and RepeatMasker was used to mask repetitive sequences. For gene prediction, we used Augustus48 and FGENESH49 (http://www.softberry.com/berry.phtml) with training sets based on maize and monocots, respectively. To identify genes that were missing in the 21k_20x assembly, total coding sequences (CDS) from the 21k_75x annotation was masked by total CDS from the 21k_20x annotation using RepeatMasker (-div 2 -cutoff 1000 -q -no_is -norna -nolow). The 21k_75x CDS that were masked less than 20% were determined missing in the 21k_20x annotation. These missing CDS were blast against the 21k_20x assembly and those that had less than 20% similarity were also determined to be missing in the 21k_20x assembly.

A total of 20 RNA-seq libraries were sequenced from NC358 tissue samples. Each library was sequenced to 21.9× ± 0.7× coverage with a mapping rate of 86.4% ± 1.0% to the B73 v4 using STAR (v2.5.2b)50 (Supplementary Fig. 16 and Supplementary Table 19). To benchmark the gene-space assembly, STAR (v2.5.2b)50 was used to map the RNA-seq reads against the Pilon-polished NC358 assemblies. Unmapped reads from the 21k_20x assembly were extracted using SAMtools51 and remapped to the 21k_75x assembly with STAR. Genes with read coverage ≥20% were extracted using BEDtools52, and blast against the 21k_20x assembly for the identification of full-length copies. The NC358 TE library (see next section for details on library generation) was used to identify TE fragments in genes with aligned reads (Supplementary Table 9). In addition, TEsorter (v1.1.4)53 (https://github.com/zhangrengang/TEsorter) was used to identify TE-related protein domains in genes with default parameters (Supplementary Table 9).

Assessment of genome assembly quality

The quality of the different NC358 assemblies was assessed on the unpolished assemblies unless noted. For continuity, N50, NG50, NG(x), the number of contigs, and maximum contig length were estimated. NG(x) values were the length of the contig at the top x percent of the estimated genome size (2.2724 Gb) consisting of the longest contigs. NG50 is a commonly used case of NG(x) values. NG(x) values were calculated using GenomeQC (https://github.com/HuffordLab/GenomeQC)54. The gene-space completeness was estimated using BUSCO (v3.0.2)28 with the Embryophyta odb9 dataset (n = 1440) and BLAST (v2.6)55, Augustus (v3.3)48, EMBOSS (v6.6.0)56, and HMMER (v3.1b2)57.

The repeat space contiguity was accessed using the LAI (vbeta3.2)8. To annotate LTR retrotransposons, LTR_retriever (v2.6)58 was used to identify intact LTR retrotransposons and construct LTR libraries for each NC358 assembly with default parameters. To generate a high-quality LTR library for NC358, assembly-specific LTR libraries were aggregated and masked by the MTEC curated LTR library using RepeatMasker (v4.0.7)46. Library sequences masked over 90% were removed and redundant sequences were also removed using utility scripts (cleanup_tandem.pl and cleanup_tandem.pl) from the EDTA package59. Non-redundant NC358-specific LTR sequences were added to the MTEC curated LTR library to form the final LTR library for NC358. The final library was then used to mask the 21k_75x assembly for the estimation of total LTR content. The total LTR content of 76.34% and LTR identity of 94.854% was used to estimate LAI values of all NC358 assemblies (-totLTR 76.34 -iden 94.854). The LAI of the other maize line genomes, including PH207 (GeneBank Accession: GCA_002237485.1)60, CML247 (GeneBank Accession: GCA_002682915.2)27, Mo1761, and GeneBank Accession: GCA_003185045.162), W22 (GeneBank Accession: GCA_001644905.2)63, and B73 v4 (GeneBank Accession: GCA_000005005.6)13 were also evaluated for context.

Effective assembly size, which is the length of the uniquely mappable sequences of an assembly, was estimated using unique 150-mers in each sequence assembly and quantified using Jellyfish (v2.0)64 with default parameters.

Misassembly identification with optical maps

The Bionano optical mapping was used as an orthogonal method to identify misassemblies in genomes. Bionano de novo assembled optical maps were aligned to the sequence pseudomolecules to characterize structural inconsistencies using the structural variant calling pipeline of BionanoSolve 3.4. Default parameters were employed from the nonhaplotype_noES_DLE file. Homozygous calls with a confidence of 0.1, a size of 500 bp, and non-overlaps with gap regions were regarded as insertions and deletions in sequence assemblies.

Assembly quality evaluation in repeat space

The coordinates of CentC arrays, knob180, TR-1 knobs, and NOR in the assemblies were identified by blasting CentC, knob180, TR-1 knob consensus sequences34, and the rDNA intergenic spacer (AF013103.1) against each assembly. An individual repeat array was defined as clusters of repetitive sequences that had less than 100 kb interspace between repeated elements. The level of repeats and gaps were then quantified in each defined repeat array. Respective sizes of each repeat array in the Bionano maps were estimated using the Bionano labels closest to the start and end coordinates in the assemblies.

To identify the telomere–subtelomere boundaries of the NC358 assemblies, seven maize subtelomere repeat sequences were downloaded from NCBI (EU253568.1, S46927.1, S46926.1, S46925.1, CL569186.1, AF020266.1, and AF020265.1) and used as queries to blast against the NC358 21k_75x assembly. Subtelomere boundaries were first identified at the start and end of chromosomes where blast hits were clustering then cross-checked with subtelomere-specific fluorescence in-situ hybridization (FISH) data65. The blast results were concordant with FISH results, showing the beginning of chromosomes 7, 8, 9, and 10 lack subtelomeres (Supplementary Table 15). Telomeres were defined as the distance between the boundary of subtelomeres to the end of pseudomolecules of the 21k_75x assembly, which were used as the basis for estimating the telomere size and count of the telomeric repeat sequences (5′-TTTAGGG-3′ and 5′-CCCTAAA-3′ in reverse complementation) in all other NC358 assemblies.

To identify the bz locus in the NC358 assemblies, the sequence of the maize W22 bz locus was first downloaded from NCBI (EU338354.1)32. The starting and ending 2 kb of the W22 bz locus were used to blast against the NC358 21k_75x assembly and the longest matches on chromosome 9 were used as the location of the bz locus in the NC358 21k_75x assembly. The obtained NC358 bz locus is 289,103 bp in length (chr9:11625031.11914133), which is 50 kb longer than that of the W22 bz locus (238,141 bp). Similarly, the 2 kb flanking sequences of the NC358 21k_75x bz locus were used to locate the bz locus coordinates in the other NC358 assemblies.

The zein sequence was downloaded from NCBI (AF031569.1) and the Rp1-D from MaizeGDB (AC152495.1_FG002). The same method as described for the bz locus was used to identify coordinates in the NC358 assemblies based on blast results using 2 kb flanking sequences.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.