Chromosome-Scale Assembly and Annotation of Eight Arabidopsis thaliana Ecotypes

Abstract The plant Arabidopsis thaliana is a model system used by researchers through much of plant research. Recent efforts have focused on discovering the genomic variation found in naturally occurring ecotypes isolated from around the world. These ecotypes have come from diverse climates and therefore have faced and adapted to a variety of abiotic and biotic stressors. The sequencing and comparative analysis of these genomes can offer insight into the adaptive strategies of plants. While there are a large number of ecotype genome sequences available, the majority were created using short-read technology. Mapping of short-reads containing structural variation to a reference genome bereft of that variation leads to incorrect mapping of those reads, resulting in a loss of genetic information and introduction of false heterozygosity. For this reason, long-read de novo sequencing of genomes is required to resolve structural variation events. In this article, we sequenced the genomes of eight natural variants of A. thaliana using nanopore sequencing. This resulted in highly contiguous assemblies with >95% of the genome contained within five contigs. The sequencing results from this study include five ecotypes from relict and African populations, an area of untapped genetic diversity. With this study, we increase the knowledge of diversity we have across A. thaliana ecotypes and contribute to ongoing production of an A. thaliana pan-genome.


Introduction
Arabidopsis thaliana is a model organism used as the basis for studies of many aspects of plant biology.A. thaliana is part of the brassicales family, which includes a variety of oilseed crops of agronomic importance (Raza et al. 2020).A. thaliana is a predominantly self-fertilizing plant, with an estimated 3% out-crossing (Platt et al. 2010).In addition to its roles as a general plant model system and a tool for the study of brassicas, due to the depth of genomic resources available A. thaliana is also used as a reference for genome assembly and comparison (Lijavetzky et al. 2003;Long et al. 2007;Provart et al. 2016).A. thaliana is an excellent genomic system as it is diploid and has a relatively small ∼135 mb genome (Hays 2002;Hou et al. 2022).This makes it very amenable to sequencing, genetic analysis, and computational benchmarking (Hays 2002;Hou et al. 2022).
Naturally occurring variants of A. thaliana, also known as ecotypes, have been isolated from many regions including Europe, Africa, North America, Japan, and South Korea (Koornneef et al. 2004;The 1001Genomes Consortium 2016).The diverse environmental conditions of these geographic regions have led to natural genetic variation across these ecotypes, providing insights into plant adaptation to abiotic and biotic stress (Coolen et al. 2019).Sequencing efforts by the 1001 genomes consortium have identified extensive heterozygosity and polymorphisms across A. thaliana world-wide, highlighting the natural variation inherent to this plant species (The 1001 Genomes Consortium 2016).Recent work has brought into question how much of this diversity has been properly captured.Performing de novo longread sequencing of several ecotypes isolated from Sweden found that a significant portion of the heterozygosity observed was the result of spurious single-nucleotide polymorphisms caused by the use of short-read sequencing (Jaegle et al. 2023).In place of the heterozygosity, the longread sequencing identified significant levels of structural variation that could not be identified using short-read sequencing (Jaegle et al. 2023).This work highlights that the mismapping of reads derived from loci of structural variation in new ecotypes will simultaneously lead to a loss of genetic information and the introduction of false heterozygosity (Jaegle et al. 2023).Long-read sequencing and de novo genome assembly is the only way to properly identify structural variation and novel genetic information.
With the increasing availability of long-read sequencing, many recent studies have performed resequencing of a variety of A. thaliana ecotypes to capture the structural variation and natural diversity of the species (Jiao and Schneeberger 2020;Kang et al. 2023;Wlodzimierz et al. 2023;Lian et al. 2024).These studies have indicated that Eurasian ecotypes are over-sampled, with the discovery of new genetic diversity reaching saturation.In contrast, the same studies found that African and relict populations are under-sampled and that the discovery of new genetic diversity has not yet reached saturation in these populations.Consistent with this, genetic analysis of A. thaliana ecotypes also showed greater genetic divergence across African and relict populations than Eurasian populations (Durvasula et al. 2017).It is therefore clear that additional long-read sequencing of relict and African populations is likely to uncover novel genetic diversity in A. thaliana.
To this end, we sequenced the genomes of eight different ecotypes of A. thaliana from diverse geographic regions, including two African and three Eurasian relict ecotypes.The resulting genomes are of high quality and continuity, comparable to the TAIR10.1 reference.These results expand the overall cache of long-read sequenced ecotype genomes available to researchers, offering additional resources for the discovery of novel genetic variation and improving comparative analyses across A. thaliana.

Assembly and Gene Prediction Assessment
The eight Arabidopsis ecotypes were selected from a collection of Eurasian, Eurasian relict, and African accessions (Fig. 1a).Ecotype genomes were sequenced to a coverage of 16 to 100× with long nanopore reads and polished using short Illumina reads with a coverage >40×.Sequences were assembled into five pseudo-chromosomes covering 119-121 Mb (Table 1, Fig. 1).This is a slightly shorter total length compared to the A. thaliana ColPEK and ColCEN references, with sizes of ∼135 Mb (Naish et al. 2021;Hou et al. 2022), but is comparable to the TAIR10.1 assembly (Lamesch et al. 2012).Chromosome lengths of the five main chromosomes were all similar to the TAIR10.1 reference (Table 1).Genomes contained high contiguity, with contig N50s > 8.9 Mb, scaffold N50s > 23 Mb, and L50s of 2-4 (Table 1).These benchmarks are also slightly lower than ColCEN and ColPEK, but comparable to the TAIR10.1 assembly and other recent genome assemblies (Jiao and Schneeberger 2020).All genomes contain a GC% of ∼36%, consistent with the TAIR10.1 reference (Table 1).On average, over 88% of Illumina short-reads mapped to the draft genomes during the polishing steps (Table 1).
BUSCO and k-mer counting were used to assess genome completeness.BUSCO scores were comparable to TAIR10.1, showing completeness of 99% or greater for all genomes (Table 1).K-mer counting using Merqury showed genome completion >90% and QV >30 for all genomes except Mt-0, which showed a completeness of 89% and QV of 24, and Ped-0, which showed a QV of 29 (Table 1).Assembly quality was similar to that of the TAIR10.1 assembly and recent genome assemblies (Jiao and Schneeberger 2020).
Protein sequences were inferred using ab initio, homology, and transcript data.There were ∼32,000-36,000 protein-coding sequences predicted using the combined methods (Table 1).Using homology methods from liftoff, we also predicted over 26,000 protein-coding genes using the Araport11 reference.The large disparity most likely comes from the filtering methods used.Our protein predictions likely overestimate the true number of protein-coding genes, but are more amenable for downstream filtering and analysis.The liftoff predictions instead are only proteincoding genes with strong homology to the existing Araport11 reference set.Using homology-based inference methods (liftoff and blastp), the predicted protein sets showed 96% to 98% and 95% to 98% completeness,  respectively.In agreement with this observation, we assessed the protein-set completeness with BUSCO and found completeness percentages of >98.7 across all ecotypes (Table 1).Taken together, these analyses indicate that a large portion of the protein-coding sequences have been captured.

Orthogroup Prediction
The protein sequences were then used to construct orthogroups using Orthofinder.Across the eight ecotypes, the proteins clustered into ∼44,000 orthogroups (Fig. 1b).

Transposable Element Analysis
Transposon elements (TEs) were predicted de novo using EDTA and AnnoSINE_v2.EDTA was used for annotations of most TEs, while AnnoSINE_v2 was used to predict SINEs that were not captured in the EDTA analysis.
Repetitive elements indicative of TEs were predicted to comprise ∼13%-15% of the genomes (Table 1, Fig. 1, supplementary table S1, Supplementary Material online).Helitrons were the most common type, with TIR class being the second most abundant (Fig. 1c).SINE occurrence was similar across all ecotypes except for Con-0, which had much fewer (Fig. 1c).The occurrence of LINEs was variable across ecotypes, with Yeg-8 having the fewest and Mt-0 having the most (Fig. 1c).

Synteny and Structural Variation
Syntenic relationships between each of the ecotypes and ColPEK were inferred using SYRI.Syntenic relationships showed that the chromosomes are largely conserved across the ecotypes (Fig. 1).The largest area of variation occurred across and around the centromeres, with increased duplication and inversion across the ecotypes (Fig. 1).Interestingly, it appears pericentromeric regions experienced the highest diversity across ecotypes (Fig. 1), though this finding may be confounded by a lack of quality assembly surrounding the centromeres in the ecotypes sequenced here.There also appear to be regions where structural variation is more likely.For example, the region around position 20 to 25 Mb in chromosome 1 appears to be a hot spot for duplication and translocation (Fig. 1).Interestingly, this region has been reported to contain a high number of biotic stress resistance genes (Van de Weyer et al. 2019).It may be that the observed structural variation is indicative of stress adaptation.

Plant Material and Growth Conditions
Arabidopsis seeds from ecotypes Aitba-2, Baa-1, Con-0, DraIV 6-22, Moj-0, Mt-0, Ped-0, and Yeg-8 were acquired from the Arabidopsis Biological Resource Center.Natural variant geographic location and stock information are outlined in supplementary table S2, Supplementary Material online.Seeds were vernalized at 4 °C for 1 to 4 weeks on 2.2 g/L Murashige and Skoog basal medium (Sigma).The plates were then transferred into controlled growth chambers at 22 °C under long-day light conditions (16 h light and 8 h dark).After 7 days, seedlings were transferred to soil supplemented with 1 g/L fertilizer with a ratio 20P:20N:20K (Plant-pro).Plants were watered as needed and every 2 weeks the watering was supplemented with 0.5 g/L additional fertilizer.After ∼3 weeks, plants were transferred to short-day conditions (12 h light and 12 h dark) for 1 week to reduce stress and starch levels.All growth chambers provided light intensities of 120 µmol/ m 2 s and relative humidity of 40% to 60%.At 48 h prior to DNA extraction, plants were transferred into 24-h dark conditions to reduce levels of contaminating polysaccharides and polyphenols.

DNA Extraction and Quality Assessment
High molecular weight DNA was extracted using the Macherey-Nagel HMW Nucleobond isolation kit as per manufacturer's instructions.DNA purity was estimated by measuring OD A260/280 and A260/230 ratios using a nanodrop spectrophotometer (Nanodrop 2000, Thermo-Fischer).DNA quality and fragmentation were visualized by gel electrophoresis.

Library Construction and Genome Sequencing
Genomic DNAs were quantified with the Qubit instrument using the dsDNA BR Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA).Approximately 2 mg of each gDNA was used to prepare sequencing libraries with the SQK-LSK109 Ligation Sequencing Kit and the EXP-NBD104 Native Barcoding Expansion Kit (Nanopore, Oxford, UK).
Libraries were prepared and barcoded following the manufacturer's instructions.The barcoded libraries were pooled equally and sequenced on a PromethION R9.4.1 flow cell.
Sequencing was performed for 68 h.
To estimate protein-coding set completeness, homology was tested using liftoff and blastp v2.12.0 (Camacho et al. 2009) compared to the Araport11 annotations.Proteins were considered homologous if blastp bit scores were ≥50, and sequence similarity was ≥50% following the approach of Pearson (Pearson 2013).Protein-set completeness was also measured with BUSCO v5.5.0 using the embryophyte-odb10 set under mode "proteins."

Gene Orthology Analysis
Orthogroups were determined using Orthofinder v2.5.5 (Emms and Kelly 2019).Orthogroups were split into core (all eight ecotypes contain an orthologous gene), almost core (only seven ecotypes contain an orthologous gene), accessory (two to six ecotypes contain an orthologous gene), and ecotype specific (gene found in a single ecotype).

Transposon Analysis
Repetitive elements were inferred de novo using the Extensive de novo TE Annotator v2.2 (Ou et al. 2019) with default settings.Transposon class was inferred for LTRs, TIRs, non-LTRs, non-TIRs, and repeat regions.For SINE inference, AnnoSINE v2 (Li et al. 2022) was used with default settings.

Fig. 1 .
Fig. 1.Analysis of eight A. thaliana ecotype genomes.a) Location of isolation shown for chosen ecotypes labeled with coordinates listed in parentheses.b) Orthogroup assignment for predicted protein-coding genes across the eight ecotypes are labeled for core (dark gray; all eight ecotypes contain an orthologue), almost core (light gray; seven ecotypes contain an orthologue), accessory (purple; two to six ecotypes contain an orthologue), and ecotype specific (yellow; one ecotype contains a given gene).c) De novo transposon identification.Each ecotype is a single bar; each type of repeat element is shown in a unique color.The total number of predicted transposable elements of each class is shown under each bar.d) Synteny analysis for the five main chromosomes across the eight A. thaliana ecotypes compared to ColPEK.Major structural rearrangements are denoted by colored lines connecting involved chromosomes.Orange = inversion event, green = translocation event, and blue = duplication event.Dark green bars represent the location of the centromere.Map adapted from Wikimedia Commons user STyx.

Table 1
Genome assembly and annotation quality for 8 A. thaliana ecotypes