High-quality Arabidopsis thaliana Genome Assembly with Nanopore and HiFi Long Reads

Arabidopsis thaliana is an important and long-established model species for plant molecular biology, genetics, epigenetics, and genomics. However, the latest version of reference genome still contains a significant number of missing segments. Here, we reported a high-quality and almost complete Col-0 genome assembly with two gaps (named Col-XJTU) by combining the Oxford Nanopore Technologies ultra-long reads, Pacific Biosciences high-fidelity long reads, and Hi-C data. The total genome assembly size is 133,725,193 bp, introducing 14.6 Mb of novel sequences compared to the TAIR10.1 reference genome. All five chromosomes of the Col-XJTU assembly are highly accurate with consensus quality (QV) scores > 60 (ranging from 62 to 68), which are higher than those of the TAIR10.1 reference (ranging from 45 to 52). We completely resolved chromosome (Chr) 3 and Chr5 in a telomere-to-telomere manner. Chr4 was completely resolved except the nucleolar organizing regions, which comprise long repetitive DNA fragments. The Chr1 centromere (CEN1), reportedly around 9 Mb in length, is particularly challenging to assemble due to the presence of tens of thousands of CEN180 satellite repeats. Using the cutting-edge sequencing data and novel computational approaches, we assembled a 3.8-Mb-long CEN1 and a 3.5-Mb-long CEN2. We also investigated the structure and epigenetics of centromeres. Four clusters of CEN180 monomers were detected, and the centromere-specific histone H3-like protein (CENH3) exhibited a strong preference for CEN180 Cluster 3. Moreover, we observed hypomethylation patterns in CENH3-enriched regions. We believe that this high-quality genome assembly, Col-XJTU, would serve as a valuable reference to better understand the global pattern of centromeric polymorphisms, as well as the genetic and epigenetic features in plants.


Introduction
The Arabidopsis thaliana Col-0 genome sequence was published in 2000 [1], and after decades of work, this reference genome has become the ''gold standard" for A. thaliana. However, centromeres, telomeres, and nucleolar organizing regions (NORs) have been either misassembled or not even been sequenced yet due to the enrichment of highly repetitive elements in these regions [2,3]. Long-read sequencing technologies, such as Oxford Nanopore Technologies (ONT) sequencing and Pacific Biosciences (PacBio) single molecule real-time (SMRT) sequencing, generate single molecular reads longer than 10 kb, which exceeds the length of most simple repeats in many genomes, making it possible to achieve highly contiguous genome assemblies [4]. Highly repetitive regions, e.g., centromere or telomere regions, however, remain mostly unassembled due to the limitations in read length and the error rate associated with sequencing of long reads. Although ONT sequencing has overcome read length limitation and can generate ultra-long reads (longest > 4 Mb) (https://nanoporetech. com/products/promethion), the associated 5%-15% per base error rate [5] leads to misassemblies or inaccurate assemblies. Naish et al. [6] used ONT-generated ultra-long reads to produce a highly contiguous A. thaliana Col-0 genome, but the consensus quality (QV) scores of all five chromosomes, ranging from 41 to 43, were lower than those of the reference TAIR10.1 (ranging from 45 to 52) [6]. High-fidelity (HiFi) data generated from a circular consensus sequencing [7] are a promising strategy for repeat characterization and centromeric satellite assembly. The combination of ONT long reads and HiFi reads has been demonstrated to overcome the issues of sequencing centromere and telomere regions in the human genome, and generated the telomere-to-telomere (T2T) assembly of human chromosome (Chr) X [8] and Chr8 [9].
Centromeres mainly consist of satellite DNAs and long terminal repeat (LTR) retrotransposons [10] that attract microtubule attachment and play an important role in maintaining the integrity of chromosomes during cell division [11]. In plant species, centromeric satellite DNA repeats range from 150 bp to 180 bp in size [12]. It has been reported that A. thaliana centromeres contain megabase-sized islands of 178-bp tandem satellite DNA repeats (CEN180) [13] that bind to centromere-specific histone H3-like protein (CENH3) [14,15]. Unfortunately, centromere sequences are largely absent from previously generated A. thaliana reference genome assemblies [15], hindering the investigation of CEN180 distribution and its genetic and epigenetic impacts on the five chromosomes.
To obtain T2T A. thaliana genome assembly, we introduced a bacterial artificial chromosome (BAC)-anchor replacement strategy to our assembly pipeline and generated the Col-XJTU genome assembly of A. thaliana. We completely resolved the centromeres of Chr3, Chr4, and Chr5, and partially resolved the centromeres of Chr1 and Chr2. The Col-XJTU assembly of A. thaliana genome was found to be highly accurate with QV scores greater than 60, which were obviously higher than those of TAIR10.1 and another recently deposited assembly [6]. Due to the unprecedented high quality of the Col-XJTU genome assembly, we were able to observe intriguing genetic and epigenetic patterns in the five centromere regions.

Results
Assembly of a high-quality genome of A. thaliana We assembled ONT long reads using NextDenovo v. 2.0, and initially generated 14 contigs (contig N50 = 15.39 Mb) ( Figure 1A, Figure S1A). Of these, eight contigs contained the Arabidopsis-type telomeric repeat unit (CCCTAAA/ TTTAGGG) on one end, while two contigs had the 45S rDNA units on one end ( Figure 1A). Contig 13 (935 kb) and Contig 14 (717 kb) composed of CEN180 sequences were neither ordered nor oriented, and thus were removed from the assembly ( Figure S1A). We polished the remaining 12 contigs with HiFi data using Nextpolish and scaffolded them using 3D-DNA derived from Hi-C data. Consequently, we obtained five scaffolds with seven gaps located at centromere regions ( Figure 1A). To further improve the genome assembly, we assembled HiFi reads using hifiasm [16,17] and identified the centromeric flanking BAC sequences [18][19][20] on both the five ONT scaffolds and HiFi contig pairs ( Figure 1A, Figure S1B and C). We first filled the gaps on centromeres using the BAC-anchor strategy ( Figure S1B). To guarantee the highest base-pair accuracy, we replaced the low-accuracy ONT genome assemblies with the PacBio HiFi contigs and kept the HiFi contigs as long as possible ( Figure 1A, Figure S1C). , and the QV scores of all five chromosomes are greater than 60 (ranging from 62 to 68), which are obviously higher than those of the TAIR10.1 reference genome (ranging from 45 to 52) ( Table 1) and a recently deposited genome (ranging from 41 to 43) [6], suggesting that our Col-XJTU assembly is highly accurate. The completeness evaluation showed a k-mer completeness score of 98.6%, suggesting that the Col-XJTU assembly is highly complete as well. The Col-XJTU assembly was composed of 97% HiFi contigs, with only 4,098,671 bp from ONT contigs which contain highly repetitive elements (Table S1). The heterozygosity of A. thaliana Col-XJTU is very low (0.0865%), which was estimated using GenomeScope v. 1.0 [21] from the k-mer 17 histogram computed by Jellyfish v. 2.3.0 [22]. The base accuracy and structure correctness of the Col-XJTU assembly were also estimated from the sequenced BACs. Firstly, 1465 Figure 1 High-quality T2T genome assembly A. Assembly of ONT ultra-long reads. We obtained 14 contigs, and two of them are composed of CEN180 repeat sequences. Then, the 12 contigs were ordered and oriented to five scaffolds using Hi-C data. After scaffolding, the assembly was represented in five pseudomolecules corresponding to the five chromosomes of the TAIR10.1 assembly. Gaps were filled using HiFi contigs based on BAC anchors. Whole ONT-HiC assemblies were replaced with HiFi contigs based on BAC anchors. The gray bars represent the centromere regions; the red lines represent the gap locations; the numbers and corresponding lines in blue indicate the numbers and loci of new gene annotations, respectively. B. Col-XJTU genome assembly corrected a misassembly region of TAIR10.1 genome assembly. Grey bands connect corresponding collinear regions. Duplicated segments that were misassembled are connected with bands in blue. C. Circos plot of Col-XJTU genome assembly. The tracks from outside to inside: distribution of karyotypes of assembled chromosomes, GC density, density of transposable elements, and gene density calculated in 50-kb windows. Syntenic blocks and different colored lines represent different chromosomes. Chromosomes are labeled at the outmost circle with centromeres shown in dark gray. T2T, telomere-to-telomere; ONT, Oxford Nanopore Technologies; HiFi, high-fidelity; BAC, bacterial artificial chromosome.
BACs were aligned to the Col-XJTU assembly via Win-nowmap2, and the mapping results calculated using the CIGAR string revealed good agreement with high sequence identity (99.87%). We validated the structure of our assembly using bacValidation, and the Col-XJTU assembly resolved 1427 out of 1465 validation BACs (97.41%), which is higher than BAC resolving rate of humans [23]. In addition, Col-XJTU genome assembly corrected one misassembled region with 1816 bp in length, containing two protein-coding genes, in the TAIR10.1 genome ( Figure 1B; Table S2).
The assembly sizes of Col-XJTU centromere 1 (CEN1), CEN2, CEN3, CEN4, and CEN5 were 3.8 Mb, 3.5 Mb, 4.0 Mb, 5.5 Mb, and 4.9 Mb, respectively (Table S3). The sizes of gap-free CEN3, CEN4, and CEN5 were consistent with the physical map-based centromeric sizes [18][19][20]; however, the 3.8-Mb-long CEN1 had a gap and was smaller than the estimated size of 9 Mb based on the physical map [20], and the 3.5-Mb-long CEN2 with a gap was assembled, accounting for 88% of the 4-Mb-long physical map [20]. All five centromeric CEN180 arrays did not contain large structural errors ( Figure S2). Upon the annotation of the five centromere regions, we found that all five A. thaliana centromeres were surrounded by transposon-enriched sequences rather than protein-coding gene-enriched sequences ( Figure 1C).
The Col-XJTU assembly (contig N50 = 22.25 Mb) improved the contiguity of the A. thaliana genome compared to TAIR10.1 (contig N50 = 11.19 Mb) ( Table 1), and we had filled 36 gaps apart from two gaps in CEN1 and CEN2 (Table S4). Benchmarking Universal Single-Copy Orthologs (BUSCO) evaluation revealed higher genome completeness of Col-XJTU than that of TAIR10.1 ( Table 1). The synteny plot showed that Col-XJTU genome is highly concordant with TAIR10.1 ( Figure S3) but with three additional completely resolved centromere regions and partly resolved NORs. Novel sequences (a set of regions not covered by TAIR10.1) equivalent to a total of 14.6 Mb were introduced in the Col-XJTU genome; of these, 94.8% belong to the centromeric regions, with 3.7% of them located in the NORs and telomeres (Table S5). The QV score of the novel sequences (> 10 kb) is 67.43, and the base accuracy is 99.999982%. The assembly sizes of 45S rDNA units in Chr2 and Chr4 were 300,270 bp and 343,661 bp, respectively. The telomeres of the eight chromosome arms ranged from 1862 bp to 3563 bp in length (Table S6), which are consistent with the reported lengths [24]. The read depths of these telomeres did not differ obviously compared to the average coverage of the genome (Table S6). Moreover, no telomeric motif was found in the unmapped HiFi reads, probably indicating completely resolved telomeres. The repeat content of Col-XJTU genome (24%) is much higher than that of the current reference genome (16%) ( Table 1), largely due to the higher number of LTR elements assembled and annotated in Col-XJTU genome (Table S7).
A total of 27,418 protein-coding genes (99.9%) were liftedover from TAIR10.1 (27,444) using Liftoff (Table 1). We then masked repeat elements and annotated protein-coding genes in the novel sequences in Col-XJTU genome. Finally, we obtained 27,583 protein-coding genes in Col-XJTU genome with 165 newly-annotated genes. Of the newly-annotated genes, 41 and 89 genes were located in the NORs of Chr2 and Chr4, respectively ( Figure S4), while 35 newly-annotated genes were located in the centromeres (n = 33) and telomeres (n = 2) ( Figure 1A). Only 14 of the 165 newly-annotated genes contain functional domains, whereas the remaining 151 ones have unknown functions (Table S8). Interestingly, 96% of the newly-annotated genes were found to be actively transcribed across different tissues (Table S9), especially in leaves ( Figure S5). The highly expressed leaf-specific novel genes encode protein domains such as ATP synthase subunit C and NADH dehydrogenase (Table S8), indicating that these genes may be involved in photosynthesis.

Global view of centromere architecture
Previously, the centromere composition of A. thaliana was estimated using physical mapping and cytogenetic assays; however, such estimation resulted in the generation of incorrectly annotated and unknown regions, such as 5S rDNA and CEN180 repeat regions [1]. The complete assembly of CEN3, CEN4, and CEN5 in this study revealed $ 0.5-kb-long repeats in the 5S rDNA array regions ( Figure 2), which is consistent with the previous findings obtained by fluorescence in situ hybridization and physical mapping [25,26]. The 5S rDNA regions in CEN4 and CEN5 exhibited high similarity with 95% sequence identity. However, this region in CEN3 was interrupted by LTRs, resulting in a low sequence identity. All 5S rDNA regions presented GC-rich and hypermethylation patterns ( Figure 2). We detected 3666 5S rDNA monomers, which approximately doubled the previously reported amount of $ 2000 5S rDNA gene copies in the Col-0 genome [27]. The 5S rDNA arrays were divided into four clusters ( Figure S6A), wherein the 5S rDNA sequences in CEN4 and CEN5 formed independent clusters labeled as 5S Cluster 1 and Cluster 2, respectively ( Figure 2). The 5S rDNA sequences in CEN3 were divided into two 5S clusters, Clusters 3 and 4 ( Figure 2), which contained obviously more polymorphic sites than Clusters 1 and 2 in CEN4 and CEN5 ( Figure S7). We observed that CEN1, CEN2, CEN3, and CEN4 contained highly similar CEN180 arrays (Figure 2), and the reduced internal similarity in CEN5 was likely due to the disruption by LTR/Gypsy elements (Figure 2). We found one CEN180 array in CEN1, CEN2, CEN3, and CEN5 but two distinct CEN180 arrays in CEN4. Except for the downstream one in CEN4, all others CEN180 arrays showed higher than 90% sequence identity either with inter-or intrachromosomal regions (Figure 2). The downstream CEN180 array in CEN4 showed a higher internal sequence identity (> 90%) and a lower external sequence identity (< 90%) than the other CEN180 arrays ( Figure 2). Moreover, the downstream CEN180 array in CEN4 showed lower GC content and methylation frequency than other CEN180 arrays ( Figure 2). We performed LASTZ search for tandem repeats to construct the CEN180 satellite library and identified 60,563 CEN180 monomers in the five centromeres. The phylogenetic clustering analysis revealed four distinct CEN180 clusters with single-nucleotide variants and small indels ( Figures  S6B and S8). Almost all the downstream CEN180 monomers of CEN4 belonged to CEN180 Cluster 1 (Figure 2, Figure S9), while the upstream CEN180 monomers of CEN4 belonged to the remaining three CEN180 clusters.
A functional region of centromere is defined by the binding of epigenetic modifications with CENH3 [28,29]. We observed that CENH3 was obviously enriched in the interior of the centromere but depleted at the LTR region ( Figure 2). The five centromeres showed higher DNA methylation than pericentromeres ( Figure S10); however, the CEN180 arrays presented hypomethylation patterns (Figure 2, Figure S10). Interestingly, we found that the CENH3-binding signal exhibited a strong preference for CEN180 Cluster 3 on all five centromeres ( Figure S11). Such preference was not observed in CEN180 Cluster 1 in CEN4 and other four centromeres ( Figure S11). The CENH3 signal enrichment presented the opposite tendency with the methylation frequency in 60% CEN180 clusters of the five centromeres ( Figure 2, Figure S12).

Discussion
Traditionally, long-read sequencing technologies commonly suffer from high error rates [30]. However, the recently developed HiFi reads by PacBio have both the advantages of long read lengths and low error rates, enabling the assembly of complex and highly repetitive regions in the new era of T2T genomics [31,32]. HiFi reads have been used to assemble the T2T sequence of human ChrX and Chr8 [8,9], aiding in the completion of the human genome [33]. Recently, two complete rice reference genomes have also been assembled using HiFi reads [32].
The size of A. thaliana centromeres is 2-5 folds larger than that of the rice centromeres (0.6-1.8 Mb) [32], and hence, a sophisticated approach is required to complete the assembly of the A. thaliana centromeres. We combined the dual longread platforms of ONT ultra-long and PacBio HiFi to produce the high-quality A. thaliana Col-XJTU genome with only two gaps in CEN1 and CEN2. We assembled a 3.8-Mb-long CEN1, which is smaller than the 9-Mb region estimated by physical mapping [20]. We also assembled a 3.5-Mb-long sequence (88% of the physical map [20]) of CEN2 using hifiasm. Recently, a version of A. thaliana genome was deposited with a $ 5-Mb-long CEN1 sequence, which is still smaller than the physical map size [20], indicating the difficulty in assembling long centromere regions even with long-read technologies [6]. We are optimizing a singly unique nucleotide k-mers (SUNKs) assembly method [9] for plant genomes, aiming to eventually produce the completely resolved long centromere regions.
Diverse methylation patterns have been observed in the centromere sequences of two human chromosomes upon completion of the human genome [8,9]. The centromeres of Chr8 and ChrX in the human genome contain a hypomethylation pocket, wherein the centromeric histone CENP-A for kinetochore binding is located [8,9,34,35]. This phenomenon has also been observed experimentally in A. thaliana [36]. Our highquality centromere assembly of A. thaliana reveals that the CEN180 arrays enriched with CENH3 occupancy are hypomethylated compared to the pericentromeric regions. Although the primary function of centromeres is conserved between animal and plant kingdoms, the centromeric repeat monomers are highly variable in terms of sequence composition and length, and little sequence conservation is observed between species [37]. Extensive experimental evidence has confirmed that convergent evolution of centromere structure, rather than the sequence composition, is the key to maintaining the function of centromeres [38]. Furthermore, we have observed clusters with irregular patterns of methylation and CENH3 binding, indicating that centromeres may contain regions with unknown functions or still-evolving components. We would need to complete the assembly of centromere sequences for more related species to gain insight into the evolution of centromere structure and function.
In conclusion, our novel assembly strategy involving the combination of ONT long reads and HiFi reads leads to the assembly of a high-quality genome of the model plant A. thaliana. This genome will serve as the foundation for further understanding molecular biology, genetics, epigenetics, and genome architecture in plants.

Plant growth condition and data sources
The A. thaliana accession Col-0 was obtained from the Shandong Agricultural University, China as a gift. The A. thaliana seeds were placed in a potting soil and then maintained in a growth chamber at 22°C with a 16 h light/8 h dark photoperiod and a light intensity of 100-120 mmolÁm À2 Ás À1 . Young true leaves taken from 4-week-old healthy seedlings were used for sequencing.

Genomic DNA preparation
DNA was extracted using the Qiagen Genomic DNA Kit (Catalog No. 13323, Qiagen, Valencia, CA) following the manufacturer's guidelines. Quality and quantity of total DNA were evaluated using a NanoDrop One UV-Vis spectrophotometer (ThermoFisher Scientific, Waltham, MA) and Qubit 3.0 Fluorometer (Invitrogen life Technologies, Carlsbad, CA), respectively. The Blue Pippin system (Sage Science, Beverly, MA) was used to retrieve large DNA fragments by gel cutting.

Oxford Nanopore PromethION library preparation and sequencing
For the ultra-long Nanopore library, approximately 8-10 mg of genomic DNA was selected (> 50 kb) with the SageHLS HMW library system (Sage Science), and then processed using the Ligation sequencing 1D Kit (Catalog No. SQK-LSK109, Oxford Nanopore Technologies, Oxford, UK) according the manufacturer's instructions. DNA libraries (approximately 800 ng) were constructed and sequenced on the PromethION (Oxford Nanopore Technologies) at the Genome Center of Grandomics (Wuhan, China). A total of 56.54 Gb of ONT long reads with $ 388Â coverage were generated including $ 177Â coverage of ultra-long (> 50 kb) reads. The N50 of ONT long reads was 46,452 bp, and the longest reads were 495,032 bp.

HiFi sequencing and assembly
SMRTbell libraries were constructed according to the standard protocol of PacBio using 15 kb preparation solutions (Pacific Biosciences, CA). The main steps for library preparation include: 1) genomic DNA shearing; 2) DNA damage repair, end repair, and A-tailing; 3) ligation with hairpin adapters from the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences); 4) nuclease treatment of SMRTbell library with SMRTbell Enzyme Cleanup Kit; and 5) size selection and binding to polymerase. In brief, the 15 mg genomic DNA sample was sheared by gTUBEs. Single-strand overhangs were then removed, and DNA fragments were damage repaired, end repaired, and A-tailed. Then, the fragments were ligated with the hairpin adapters for PacBio sequencing. The library was treated with the nuclease provided in the SMRTbell Enzyme Cleanup Kit and purified by AMPure PB Beads. Target fragments were screened by BluePippin (Sage Science). The SMRTbell library was then purified by AMPure PB beads, and Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA) was used to detect the size of the library fragments. Sequencing was performed on a PacBio Sequel II instrument with Sequencing Primer V2 and Sequel II Binding Kit 2.0 at the Genome Center of Grandomics. A total of 22.90 Gb of HiFi reads with $ 157Â coverage were generated, and N50 of the reads was 15,424 bp. HiFi reads were assembled using hifiasm v. 0.14-r312 [17] with default parameters, and the gfatools (https://github.com/lh3/gfatools) was used to convert sequence graphs in the GFA to FASTA format.

Hi-C sequencing and scaffolding
Hi-C library was prepared from cross-linked chromatins of plant cells using a standard Hi-C protocol; the library was then sequenced using Illumina NovaSeq 6000. A total of 21.14 Gb of Hi-C reads with $ 158Â coverage were generated. The Hi-C sequencing data were used to anchor all contigs using Juicer v. 1.5 [44], followed by a 3D-DNA scaffolding pipeline [45]. Scaffolds were then manually checked and refined with Juicebox v. 1.11.08 [46].

Replacing ONT-HiC assemblies with HiFi contigs
We introduced a BAC-anchor strategy to fill the remaining gaps in ONT-HiC assemblies. Briefly, for each gap, we first identified two BAC sequences flanking the gap locus that were aligned concordantly (identity > 99.9%) to both the ONT-HiC assembly and a HiFi contig, and then replaced the gapcontaining contigs with corresponding HiFi contigs. We used the same method to polish ONT-HiC assemblies with HiFi contigs. The BAC sequences we used as anchors are list in Figure S1 and Table S10.

Genome comparisons
Two genome assemblies were aligned against each other using nucmer v. 4.0.0 (-c 100 -b 500 -l 50) [47], and the output delta file was filtered using a delta-filter (-i 95 -l 50). The alignment regions between two genomes were extracted using showcoords (-c -d -l -I 95 -L 10,000), and the novel region of our genome was extracted using 'complement' in BEDTools v. 2.30.0 [48]. The synteny relationships among the five chromosomes were estimated using BLASTN v. 2.9.0 with 'all vs. all' strategy and visualized using Circos v. 0.69-8 [49]. Genomic alignment dot plot between Col-XJTU and TAIR10.1 assemblies was generated using D-GENIES [50]. QV and completeness scores were estimated using Merqury [51] from Illumina sequencing data generated on the same material in this study. The assembly accuracy for five chromosomes was estimated from QV as follows: accuracy percentage = 100 À (10 (QV/À10) ) Â 100 [9]. To assess genome completeness, we also applied BUSCO v. 3.0.2 analysis using the plant early release database v. 1.1b [52]. Pairwise sequence identity heatmaps of five centromeres were calculated and visualized using the aln_plot (https://github.com/mrvollger/ aln_plot) command: bash cmds.sh CEN CEN.fa 5000.

BAC validation
We validated the assemblies using bacValidation (https:// github.com/skoren/bacValidation) with default parameters, which recognizes a BAC as 'resolve' within the assembly with 99.5% of the BAC length to be aligned to a single contig. BAC libraries were downloaded from European Nucleotide Archive (ENA), and the BACs used to validate five chromosomes are listed in Table S10.

Assembly validation of CEN180 arrays
We applied TandemTools [53] to assess the structure of the centromeric CEN180 arrays. We first aligned ONT reads (> 50 kb) to the Col-XJTU assembly with Winnowmap2 and extracted reads aligned to the centromeric CEN180 arrays (Chr1: 14

Misassembly evaluation
We first used QUAST v. 5.0.2 [57] to assess the structure accuracy of new assemblies. QUAST parameters were set to 'quast. py <asm> -o quast_results/<asm> -r<reference> --largemin-alignment 20,000 --extensive-mis-size 500,000 --minidentity 90' according to a previous report [23]. Based on QUAST evaluation, we did not detect any misassembly between Col-XJTU and TAIR10.1 genomes at noncentromeric regions. Furthermore, we detected and labeled one potential misassembly due to segmental duplications for Chr5 when mapping the protein-coding gene sequences of TAIR10.1 to Col-XJTU using Liftoff. We aligned the BAC sequences (K3M16 and K10A8) to the different regions between TAIR10.1 and Col-XJTU using BLASTN, supporting that the Col-XJTU assembly is correct.

Methylation analysis
Nanopolish v. 0.13.2 with the parameters 'call-methylation --methylation cpg' was used to measure the frequency of CpG methylation in raw ONT reads. The ONT reads were aligned to whole-genome assemblies via Winnowmap v. 2.0 [67]. The script 'calculate_methylation_frequency.py' provided in the methplotlib package [68] was then used to generate the methylation frequency.

Data availability
The whole-genome sequence data reported in this study have been deposited in the Genome Warehouse [69] at the National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation (GWH: GWHBDNP00000000.1), and are publicly accessible at https://ngdc.cncb.ac.cn/gwh. The genome annotation has been deposited in https://dx.doi.org/10.6084/m9.figshare. 14913045. The raw sequencing data for the PacBio HiFi reads, ONT long-reads, Illumina short reads, and Hi-C Illumina reads have been deposited in the Genome Sequence Archive [70] at the National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation (GSA: CRA004538), and are publicly accessible at https://ngdc.cncb.ac.cn/gsa.