Chromosome-scale genome assembly of a Japanese chili pepper landrace, Capsicum annuum ‘Takanotsume’

Abstract Here, we report the genome sequence of a popular Japanese chili pepper landrace, Capsicum annuum ‘Takanotsume’. We used long-read sequencing and optical mapping, together with the genetic mapping technique, to obtain the chromosome-scale genome assembly of ‘Takanotsume’. The assembly consists of 12 pseudomolecules, which corresponds to the basic chromosome number of C. annuum, and is 3,058.5 Mb in size, spanning 97.0% of the estimated genome size. A total of 34,324 high-confidence genes were predicted in the genome, and 83.4% of the genome assembly was occupied by repetitive sequences. Comparative genomics of linked-read sequencing-derived de novo genome assemblies of two Capsicum chinense lines and whole-genome resequencing analysis of Capsicum species revealed not only nucleotide sequence variations but also genome structure variations (i.e. chromosomal rearrangements and transposon-insertion polymorphisms) between ‘Takanotsume’ and its relatives. Overall, the genome sequence data generated in this study will accelerate the pan-genomics and breeding of Capsicum, and facilitate the dissection of genetic mechanisms underlying the agronomically important traits of ‘Takanotsume’.


Introduction
The genus Capsicum includes four major species, C. annuum, C. baccatum, C. chinense, and C. frutescens, all of which are used as vegetables and spices. 1 Because of partial crosscompatibility among Capsicum species, attractive cultivars have been bred worldwide through both inter-and intraspecific crossing. 1 Therefore, pedigrees of interspecific hybrids are complicated and error prone during the breeding process. The availability of interspecific hybrids depends on the combinations of parental lines used for their generation. 2 Some combinations generate morphologically abnormal F 1 hybrids, which fail to survive. 3 This phenomenon is caused by a negative interaction between two independent genetic loci, a hypothesis also known as the Bateson-Dobzhansky-Muller (BDM) model, which has been observed in wide interspecific crosses in animals and plants, including pepper. 4 To the best of our knowledge, the genomes of four C. annuum lines, one C. baccatum line, and one C. chinense line have been sequenced to date. [5][6][7][8][9][10] These sequences were constructed using two next-generation sequencing technologies: short-read sequencing and error-prone long-read sequencing. Since the genomes of Capsicum species are larger and more complex than those of their relatives, for example, Solanum species, [11][12][13] complete and high-quality genome sequencing of Capsicum might be difficult with the existent technologies. Therefore, the available sequence data have gaps, even though the sequences are assembled at the chromosome level. [5][6][7][8][9][10] Recent advances in sequencing technologies enable the generation of high-quality long reads, also known as HiFi reads. 14 Furthermore, techniques such as chromosome conformation capture, 15 which generates chromatin contact maps, and optical mapping, 16 which outputs high-resolution genome-wide restriction maps, are also available. These technologies could be used to assemble the genomes of multiple lines of different Capsicum species, generating the Capsicum pan-genome, 17,18 which would enhance our understanding of its genetic mechanisms and provide insights into Capsicum evolution.
'Takanotsume' (which in Japanese literally translates to 'The Claw of the Hawk') is a Japanese C. annuum landrace named after the shape of its fruit, which is similar to that of the nails of hawks. 'Takanotsume' plants exhibit indeterminate growth, with a spread-out branching habit, and are cultivated for the thin-fleshed fruits, 19 which are used as a spice with a pungency level of approximately 11,900 on the Scoville scale. 20 Because of the rapid water loss from its fruits post-harvest, 'Takanotsume' has become a popular cultivar for spice purposes, and its derivative lines, such as 'Hontaka' and 'Daruma', have been distributed all over Japan. 19 However, the pedigree of 'Takanotsume' is unclear. 'Takanotsume' also possesses some unique characteristics, including two independent genes, which confer interspecific cross-compatibility explained by the BDM model, 3 and high ribonuclease activity in leaves, which could combat chrysanthemum stunt viroid in vivo. 21 To reveal the genetic mechanisms underlying the attractive traits of 'Takanotsume', a high-quality genome assembly is required. In this study, we employed the HiFi sequencing technology, together with optical mapping and genetic mapping methods, to generate a chromosome-scale genome sequence assembly of 'Takanotsume'. Comparative genomics revealed nucleotide sequence variations, chromosome structural rearrangements, and transposon-insertion polymorphisms within the Capsicum species. The genome sequence and variant information obtained in this study would be helpful for elucidating the genetic mechanisms controlling the unique traits of 'Takanotsume'.

Genome sequencing and data analysis
A short-read sequence library of 'Takanotsume' was prepared using the TruSeq DNA PCR-Free Sample Preparation Kit (Illumina) and sequenced on the NextSeq500 instrument (Illumina) in paired-end 151 bp mode. After removing lowquality bases (quality value of <10) with PRINSEQ 22 and adaptor sequences (AGATCGGAAGAGC) with fastx_clipper in the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_ toolkit), the genome size of 'Takanotsume' was estimated using Jellyfish (k-mer size = 17). 23

Linked-read sequencing and assembly
Genomic DNA was extracted from the young leaves of 'Takanotsume', '3686', and 'Sy-2' plants using Genomic Tip (Qiagen, Hilden, Germany), and high-molecular-weight DNA (fragment length >40 kb) was selected with BluePippin (Sage Science, Beverly, MA, USA). Genomic DNA library was prepared using the Chromium Genome Library Kit v2 (10X Genomics, Pleasanton, CA, USA) and sequenced on the NovaSeq 6000 platform (Illumina, San Diego, CA, USA) to generate paired-end 150 bp reads. The obtained sequence reads were assembled with Supernova (10X Genomics), in which 2 billion reads in maximum were subsampled (maxreads = 2,000,000,000).

Long-read sequencing and assembly
The genomic DNA of 'Takanotsume' used for linked-read sequencing was also used for long-read sequencing. Briefly, the genomic DNA of 'Takanotsume' was sheared in a DNA Shearing Tube g-TUBE (Covaris, Woburn, MA, USA) by centrifugation at 1,600 × g. The sheared DNA was used for HiFi SMRTbell library preparation with the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA). The resultant library was separated on BluePippin (Sage Science) to remove short DNA fragments (<20 kb), and sequenced with SMRT cell 8 M on Sequel II and Sequel IIe systems (PacBio). The obtained HiFi reads were assembled with Hifiasm 24 (version 0.15.2) with the default parameters.

Optical mapping
Genomic DNA was extracted from young 'Takanotsume' leaves using the Plant DNA Isolation Kit (Bionano Genomics, San Diego, CA, USA), in accordance with Bionano Prep Plant Tissue DNA Isolation Base Protocol. The isolated genomic DNA was treated with DLE-1 nickase, and labelled with a florescent dye supplied in the DLS DNA Labeling Kit (Bionano Genomics). The labelled DNA was run on the Saphyr Optical Genome Mapping Instrument (Bionano Genomics). The output reads were assembled and then merged with the HiFi assembly to generate hybrid scaffold sequences using Bionano Solve (Bionano Genomics) with the default parameters.

Genetic mapping and chromosome-level assembly
Genomic DNA was extracted from all F 2 individuals (n = 118) and their parental lines using the DNeasy Plant Mini Kit (Qiagen). The obtained DNA samples were digested with PstI and MspI to construct a double-digest restriction-siteassociated DNA sequencing (ddRAD-Seq) library, 25 which was sequenced on HiSeq 4000 (Illumina) in paired-end mode. The obtained sequence reads were subjected to quality control (as described above), and mapped onto the hybrid scaffold sequences with Bowtie 2. 26 High-confidence biallelic single nucleotide polymorphisms (SNPs) were identified using the mpileup option of SAMtools, 27 and filtered using VCFtools 28 using the following criteria: read depth ≥5; SNP quality = 10; proportion of missing data <50%. The identified SNPs were subjected to linkage analysis using Lep-Map3. 29 Contig sequences were anchored to the genetic map, and pseudomolecule sequences were established with ALLMAPS. 30 Using D-genies, 31

Gene and repeat prediction
Gene prediction was performed with BRAKER2, 32 based on the peptide sequences of the predicted genes of C. annuum line 'CM334' and RNA-Seq reads obtained from the Sequence Read Archive (SRA) database of the National Center of Biotechnology Information (accession nos.: SRR17837 286-SRR17837292 and SRR17837303-SRR17837315). Simultaneously, gene sequences reported in the genome assemblies of C. annuum lines 'CM334' (v.1.55: 30,242 genes) and 'Zunla-1' (v2.0: 35,336 genes) were mapped onto the 'Takanotsume' genome assembly with Liftoff. 33 Genome positions of the predicted and mapped genes were compared with the intersect command of BEDtools. 34 Functional annotation of the genes was performed with Hayai-Annotation Plants. 35 Repetitive sequences in the genome assembly of 'Takanotsume' were identified with RepeatMasker (https:// www.repeatmasker.org) using repeat sequences registered in Repbase and a de novo repeat library built with RepeatModeler (https://www.repeatmasker.org).

Genetic diversity analysis
Whole-genome shotgun libraries of 13 Capsicum lines were prepared with the TruSeq DNA PCR-Free Sample Prep Kit (Illumina), in accordance with the manufacturer's protocol. The resultant libraries were sequenced either on HiSeq 2500 (Illumina) to generate paired-end 250 bp reads or on NextSeq500 (Illumina) and NovaSeq 6000 (Illumina) platforms to generate paired-end 151 bp reads. The reads were subjected to quality control (as described above) and mapped onto the pseudomolecule sequences of 'Takanotsume' with Bowtie2. 26 Sequence variants were detected using the mpileup and call commands of BCFtools, 27 and high-confidence biallelic SNPs were identified with VCFtools 28 using the following parameters: minimum read depth ≥8 (--minDP 8); minimum variant quality = 20 (--minQ 20); maximum missing data <0.5 (--max-missing 0.5); and minor allele frequency ≥ 0.05 (--maf 0.05). Effects of SNPs on gene function were estimated with SnpEff. 36 The population structure of the 13 Capsicum lines and 'Takanotsume' were evaluated with maximum-likelihood estimation of individual ancestries with ADMIXTURE 37 and principal component analysis with TASSEL. 38 Genetic distances among the 13 Capsicum lines and 'Takanotsume' were calculated with the neighbourjoining method implemented in TASSEL 38 and a dendrogram was drawn with iTOL. 39 Insertion polymorphisms of Tcc transposons, which have been reported to affect pungency level in chili pepper, 40 were investigated across the 13 Capsicum lines with PTEMD. 41

Assembly of Capsicum genomes
Genome size estimation with 74.7 Gb short-read data indicated that 'Takanotsume' has a homozygous genome, with an estimated haploid genome size of 3,168.4 Mb (Supplementary Figure S1).
To extend the sequence contiguity, optical mapping was performed. Data amounting to 1,201.0 Gb (read length ≥150 kb) were generated, a subset (600 Gb) of which was employed for further analysis. Of the 600 Gb data, 563.8 Gb data (number of reads = 1,355,894; N50 length = 407.5 kb) were used for de novo assembly, generating 40 molecule maps (total length = 3,078.9 Mb; N50 = 247.1 Mb). In the subsequent hybrid scaffolding process, 2 and 16 conflicts in the 40 molecule maps and CAN_r1.0, respectively, were resolved. Then, a hybrid scaffold comprising 23 sequences (total length = 3,074.2 Mb; N50 = 253.3 Mb) was obtained (Table 1), which was designated as CAN_r1.1.
To anchor the CAN_r1.1 scaffold sequences to the chromosome, genetic mapping was performed. The DNA of the mapping population and their parental lines was subjected to ddRAD-Seq analysis, which generated 1.0 M reads per sample. After quality filtering, high-quality reads were mapped onto the CAN_r1.1 assembly, with a mapping rate of 92.5%. This resulted in the detection of 1,836 high-confidence SNPs. A linkage analysis of these SNPs resulted in a genetic map, with a total of 12 linkage groups and 1,736 SNPs, and a total genetic distance of 748.8 cM (Table 2). Eighteen CAN_r1.1 scaffolds were anchored to the genetic map (Supplementary Figure S2). Nine of these scaffolds were anchored to nine chromosomes (ch01, ch02, ch03, ch04, ch05, ch06, ch08, ch09, and ch10; one per chromosome), while the remaining nine scaffolds were anchored to ch11 (two scaffolds), ch12 (three scaffolds), and ch07 (four scaffolds) (  Figure 1). This final assembly was designated as CAN_r1.2.pmol.  Figure 1). The complete BUSCO score of the highconfidence genes was 95.0% (Supplementary Figure S2). Functional annotation analysis showed that of the 34,324 high-confidence genes, 7,609, 15,746, and 10,581 sequences were assigned to Gene Ontology slim terms in the biological process, molecular function, and cellular component categories, respectively, and 1,959 genes had enzyme commission numbers (Table S3). Repetitive sequences occupied a total physical distance of 2,549.4 Mb (83.4%) in the CAN_r1.2.pmol genome assembly (3,058.5 Mb). Nine major types of repeats were identified in varying proportions (Table 3, Figure 1). The dominant repeat types in the chromosome sequences were long-terminal repeats (63.1%, 1,928.4 Mb) including Gypsy-(54.0%, 1,651.9 Mb) and Copia-type retroelements (5.9%, 178.9 Mb). Repeat sequences unavailable in public databases totalled 273.0 Mb.

Sequence and structural variations within the genus Capsicum
First, genome structure variants between C. annuum and C. chinense were investigated. Genome sequences of two C. chinense lines, '3686' and 'Sy-2', were constructed with the linked-read technology. The genome of '3686' was 3,211.8 Mb in size, according to the k-mer frequency analysis (Figure S1), and the resultant assembly was 3,019.6 Mb in size, with 31,863 sequences and a contig N50 length of 9.0 Mb ( Table  1, Supplementary Table S1). On the other hand, the genome size of 'Sy-2' was estimated as 3,303.2 Mb (Supplementary Figure S1), and the assembly size was 3,000.5 Mb, including 30,812 sequences with a contig N50 length of 12.0 Mb ( Table  1, Supplementary Table S1). The complete BUSCO scores of '3686' and 'Sy-2' genomes were 96.0% and 96.6%, respectively (Supplementary Table S2). Finally, alignment analysis revealed that the '3686', 'Sy-2', and CAN_r0.1 sequences covered 85.6%, 85.2%, and 96.3% of the CAN_r1.2.pmol reference sequence.
Next, sequence variants were detected in six C. annuum, two C. baccatum, and five C. chinense lines. On average, 84.5 Gb short-read data were obtained from the 13 lines, and mapped onto CAN_r1.2.pmol, with mapping rates of 96.4% for C. annuum, 80.2% for C. baccatum, and 87.3% for C. chinense. Totals of 5.2, 32.9, and 43.8 million highconfidence SNPs were found in C. annuum, C. baccatum, and C. chinense, respectively. In the C. annuum lines, the SNP distribution pattern was biased (Figure 1, Supplementary  Figure S3), with a high density on ch09, ch10, and ch11 of '106', '110', 'California Wonder', and 'Sweet Banana'. According to SnpEff results, the most prominent SNP type was modifier impact (98.5%) in intergenic regions and introns, followed by moderate impact (0.9%; leading to missense mutations), low impact (0.5%; synonymous mutations), and high impact (0.1%; nonsense mutations and mutations at splice junctions) (Supplementary Table S4). The admixture analysis indicated the 13 lines in addition to 'Takanotsume' were grouped into three clusters (K) ( A total of 263 polymorphic sites of transposon insertions were found across the 13 Capsicum lines. Interestingly, the   SINEs, short interspersed nuclear elements; LINEs, long interspersed nuclear elements; LTRs: long-terminal repeats. number of polymorphic sites was biased in accordance with species. In the two C. baccatum and six C. chinense lines, numbers of polymorphism sites were ranged from 6 in pun1 to 83 in Sy-2 ( Figure 3); however, no polymorphism sites was observed in any C. annuum lines investigated. Of the 263 sites, 20 transposons were detected in gene sequences while the remaining 243 insertions were found in intergenic regions (Supplementary Table S5). Comparative genomics revealed that the genome structures of 'Takanotsume' and 'CA59' were well conserved; however, the chromosomes of five Capsicum lines ('CM334', 'Zunla-1', 'UCD-10X-F1', 'PBC81', and 'PI159236') were disrupted at the middle ( Figure 4). Moreover, five potential translocations were detected in the 'Takanotsume' genome, including one on ch01 (compared with the ch08 of 'PBC81' and 'PI159236'), two on ch03 (one compared with the ch05 of 'PBC81' and another relative to the ch09 of 'PBC81'), one on ch05 (compared with the ch03 of 'PBC81'), and one on ch09 (compared with the ch03 of 'PBC81').

Discussion
Here, we present the chromosome-scale genome assembly (CAN_r1.2.pmol) of a popular Japanese chili pepper C. annuum landrace, 'Takanotsume' (Figure 1). The assembly spanned a total length of 3,058 Mb, which corresponded to 96.5% of the estimated genome size (Supplementary Figure  S1, Table 1). Sequence gaps (total length = 8.2 Mb) were observed at 171 locations on 12 chromosomes (Table 1). The contiguity of this chromosome-level assembly was much greater than that obtained using the 10X Genomics Chromium technology ( Table 1). The genome coverage of the 'Takanotsume' assembly was comparable with that of 'CA59' and higher than those of 'CM334', 'UCD-10X-F1', 'Zunla-1', and the relatives 'PBC81' and 'PI159236'. Moreover, sequence orders and orientations in the middle of the chromosomes were disrupted (Figure 4). This suggested that the genome structures varied within the Capsicum genus and/or there were misassembly points in the genomes of the five above-mentioned lines, probably because of the short-read sequencing technologies employed. To validate this assumption, further karyotyping studies with fluorescence in situ hybridization are required. In addition, genic regions in the CAN_r1.2.pmol assembly were also well annotated (Supplementary Tables S2 and S3). A total of 34,324 high-confidence genes in CAN_r1.2.pmol (Table 2) were supported by those in 'CM334' and/or 'Zunla-1'.
The population structure analyses indicated the three Capsicum species could be discriminated with the genetic variations of the genome (Figure 2, Supplementary Figure S4) except for 'Aji Rojo', which is a C. baccatum line but grouped in the C. chinense cluster. In our previous studies, 10, 42 192 Capsicum lines mainly including four species, C. annuum, C. baccatum, C. chinense, and C. frutescens, were roughly classed into four groups representing the species; however, there were mismatch between the classifications based on morphological traits and those based on DNA sequence, probably due to misclassification of species based on morphological traits and/ or genome introgression between different species. 42 These observations suggested that the concept of species might be reconsidered as discussed for a long time, 43,44 especially for crops including Capsicum because of the ease of the crosscompatibility between species. Indeed, in accordance with nuclear and plastid genotypes, 'Takanotsume' is suggested to be a derivative of the hybridization between C. annuum as a paternal parent and either C. chinense or C. frutescens as a maternal parent. 42 The 'Takanotsume' genome assembly from this study might contribute to clarify the mysterious pedigree.
'Takanotsume' exhibits attractive, agriculturally important phenotypes. 19 One of them is the restoration of hybrid breakdown in the progeny derived from crosses between C. annuum and either C. baccatum or C. chinense. 3 This phenomenon could be explained by the BDM model, 4 which was originally proposed >100 yrs ago; however, the molecular mechanisms still remain unclear. Owing to the high-quality genome assemblies and high coverage of the gene-rich regions, a map-based cloning strategy, together with gene editing and/or virusinduced gene silencing, would identify the genes capable of restoring hybrid breakdown in pepper. This would provide new insights into the molecular mechanisms responsible for the long-term unresolved BDM model. Another important characteristic of 'Takanotsume' is high ribonuclease activity in leaves. 21 This trait would be agronomically useful for the development of biopesticides to combat RNA viruses around the world. Identification of the genes responsible for the RNase activity in 'Takanotsume' would enable the regulation of enzyme activity and specificity.
In addition to nucleotide sequence polymorphisms, structural variations including copy-number variations (also known as presence-absence variations) and chromosomal rearrangements (such as translocations and inversions) can also  Table  S5) could also affect the phenotypic variations even within a species. 40 Therefore, a single reference genome sequence of a species is insufficient for gaining insights into its genomics and genetics. 45 A genome sequence established by sequencing the genomes of multiple lines of a species is called the pan-genome. 46 A pan-genome study of Capsicum recently conducted 17,18 will likely accelerate the pace of Capsicum genomics. The chromosome-level genome assembly of 'Takanotsume' constructed in this study is expected to contribute to the pan-genome study of Capsicum.