Chromosome-scale genome sequence assemblies of the ‘Autumn Bliss’ and ‘Malling Jewel’ cultivars of the highly heterozygous red raspberry (Rubus idaeus L.) derived from long-read Oxford Nanopore sequence data

Red raspberry (Rubus idaeus L.) is an economically valuable soft-fruit species with a relatively small (~300 Mb) but highly heterozygous diploid (2n = 2x = 14) genome. Chromosome-scale genome sequences are a vital tool in unravelling the genetic complexity controlling traits of interest in crop plants such as red raspberry, as well as for functional genomics, evolutionary studies, and pan-genomics diversity studies. In this study, we developed genome sequences of a primocane fruiting variety (‘Autumn Bliss’) and a floricane variety (‘Malling Jewel’). The use of long-read Oxford Nanopore Technologies sequencing data yielded long read lengths that permitted well resolved genome sequences for the two cultivars to be assembled. The de novo assemblies of ‘Malling Jewel’ and ‘Autumn Bliss’ contained 79 and 136 contigs respectively, and 263.0 Mb of the ‘Autumn Bliss’ and 265.5 Mb of the ‘Malling Jewel’ assembly could be anchored unambiguously to a previously published red raspberry genome sequence of the cultivar ‘Anitra’. Single copy ortholog analysis (BUSCO) revealed high levels of completeness in both genomes sequenced, with 97.4% of sequences identified in ‘Autumn Bliss’ and 97.7% in ‘Malling Jewel’. The density of repetitive sequence contained in the ‘Autumn Bliss’ and ‘Malling Jewel’ assemblies was significantly higher than in the previously published assembly and centromeric and telomeric regions were identified in both assemblies. A total of 42,823 protein coding regions were identified in the ‘Autumn Bliss’ assembly, whilst 43,027 were identified in the ‘Malling Jewel’ assembly. These chromosome-scale genome sequences represent an excellent genomics resource for red raspberry, particularly around the highly repetitive centromeric and telomeric regions of the genome that are less complete in the previously published ‘Anitra’ genome sequence.


Introduction
Red raspberry (Rubus idaeus L.) is a popular, highly heterozygous diploid (2n = 2x = 14) perennial crop plant, with 822,493 tonnes of raspberries harvested annually throughout the world (http://www.fao.org/faostat/en/#data/QC). As such, red raspberry is one of the most economically valuable soft-fruit species, and interest in the development of new varieties through breeding and selection has led to many breeding programmes for the species being established globally. There are two main flowering habits in commercial red raspberry germplasm: floricane fruiting (such as the cultivar 'Malling Jewel') in which canes grown the previous season produce lateral shoots in the second year of growth that bear the flowers and fruits; and primocane fruiting (such as the cultivar 'Autumn Bliss'), in which the first-year canes bear a limited number of flowers and fruits in the late summer or early autumn. Conventional breeding of red raspberry germplasm has led to many commercially-successful varieties being released, however genetic improvement in this highly heterozygous outbreeding species is slow [1] with several factors, including inbreeding depression, deleterious recessive alleles [2] and loss of fertility, amongst others, making the breeding process challenging. Plant genomics has the potential to significantly increase the precision and accuracy of breeding and selection of crop plants through the development and application of molecular markers, marker-assisted breeding and genomic selection. As such, significant molecular resources have been developed to date to support the breeding effort in red raspberry [3].
Chromosome-scale genome sequences are a vital tool in unravelling the genetic complexity controlling traits of interest in crop plants, as well as for functional genomics, evolutionary studies, and pan-genomics diversity studies. However, there are significant challenges in assembling genomes of highly heterozygous species using short-read sequence data because heterozygosity significantly increases the complexity of the de Bruijn graph structure predominantly used in short-read assemblers. Additionally, sequence length may make correct resolution of haplotigs in highly heterozygous genomes intractable, leading to fragmented assemblies containing many small contigs. Despite a relatively small haploid genome size of~300 Mb, a diploid genome structure, and good progress in early sequencing efforts for red raspberry [4], the highly heterozygous nature of the genome, and the relatively high cost of early short-read sequencing data, meant that initial assemblies were highly fragmented and consisted of many thousands of scaffolds [5]. Thus, the development of chromosome-scale genomics resources for red raspberry has lagged behind those of closely-related species with genomes of a similar size such as F. vesca [6,7], Potentilla micrantha [8] and Rubus occidentalis [9].
The advent of long-read sequencing technologies such as the platforms offered by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) has permitted high-quality, chromosome-scale genome sequence assemblies of many plant species, including those with relatively complex genomes such as the allo-octoploid strawberry F. × ananassa [7]. Long-read PacBio sequence data was recently used to construct chromosome-scale genome sequences of R. idaeus cultivars 'Joan J' [10] and 'Anitra' [11]. In the study of Davik et al., [11] the genome sequence covered 291.7 Mb, more than 99% of the estimated genome size of R. idaeus, with 85% of the sequence contigs resolved incorporated into seven chromosome-scale scaffolds, and 98% of Benchmarking Universal Single-Copy Orthologs (BUSCO) genes present in the assembly. The sequence of 'Joan J' [10] covered a total of~300 Mb with a BUSCO completeness of 95.3%, but summary statistics for the genome were not available, and to date, the sequence from that publication has not been deposited in public sequence data repositories. These chromosome-scale genome sequences represent an excellent genomics resource for red raspberry, however additional sequence data and assemblies will help improve overall genome coverage and completeness, particularly around the highly repetitive centromeric and telomeric regions of the genome. More complete sequences will increase the value of these genomics resources for breeding and selection, as well as for more fundamental studies within the species.
Here we present the genome sequences for two red raspberry cultivars, 'Autumn Bliss' and 'Malling Jewel', along with associated gene predictions. 'Malling Jewel' ['Preussen' x (S1'Lloyd George' x S1'Pyne's Royal' F2)] is a floricane red raspberry variety that was released in 1949 and represents a 'pure' R. idaeus genotype with 25% genetic contribution from R. idaeus strigosus (from 'Superlative') in a background of R. idaeus vulgatus. In contrast, 'Autumn Bliss', released in 1983 is a strict primocane variety with a complex hybrid genetic background [12] which has been used extensively as a parent in breeding worldwide for its early and productive primocane season, its aphid resistance, and its tolerance to soil-borne pathogens. The genome sequences produced in this investigation were compared to the recently-published genome sequence of the cultivar 'Anitra' [11]. The use of long-read ONT sequencing data yielded significantly longer read lengths than the PacBio sequence employed in the assembly presented by Davik et al. [11], which permitted more complete genome sequence assemblies to be resolved and in which the majority of the centromeric regions were defined.

DNA extraction and sequencing
DNA was extracted from fresh, young leaf material collected from a single plant of the R. idaeus cultivars 'Autumn Bliss' and 'Malling Jewel' using a high molecular weight genomic DNA extraction protocol [13]. Long-read sequencing libraries were prepared using the SQK-LSK108 Ligation Sequencing Kit (Oxford Nanopore Technologies) from approximately 1 μg of high molecular weight genomic DNA, following the manufacturer's protocol. Longread libraries were sequenced on R9.4.1 Spot-On Flow cells (FLO-MIN106) using the Grid-ION X5 platform (Oxford Nanopore Technologies) set to high accuracy base calling. A PCR free short read Illumina sequencing library was prepared for each of the two cultivars using an insert size of 450 bp. The 'Malling Jewel' libraries were sequenced with 250 bp paired end reads on the Illumina HiSeq 2500 platform at the Earlham Institute (Norwich, UK) and the 'Autumn Bliss' libraries were sequenced with 300 bp paired end reads on the Illumina MiSeq platform at NIAB East Malling.

RNA extraction and sequencing
Developing fruits of the red raspberry cultivar 'Anitra' were collected at three maturity stages; unripe, turning, and fully mature. Fruit samples from each maturity stage were collected at Graminor Njøs (Norway), where they were divided into three biological replicates, snap frozen in liquid nitrogen and stored at -80˚C until the samples were processed further. Tissue samples were ground to a fine powder under liquid nitrogen, and total RNA was extracted using the RNeasy Plant Mini Kit (Qiagen, Germany) following the manufacturer's instructions. The concentration and purity of the resultant RNA was measured using a QIAxpert spectrophotometer (Qiagen, Germany) and the integrity of the RNA was determined using a Qubit 4.0 fluorimeter (Thermo Fisher Scientific, UK). Samples with an RNA integrity number (RIN) value above 7.0 were submitted for subsequent RNA-Seq. Library preparations were performed using the NEB Next1 ultra RNA Library Prep Kit (Biolabs, Inc., Beijing, China) and 150 bp paired-end sequencing was performed by Norwegian Sequencing Centre (Oslo University Hospital, Norway) using the HiSeq2500 platform (Illumina Inc., Beijing, China) to yield between 12.6 and 19.4 Gb of data per sample.

Gene prediction and annotation
Repetitive and low complexity sequences in the reference-guided assemblies for 'Autumn Bliss' and 'Malling Jewel' were identified and soft masked using Red version: 05/22/2015 [27]. The R. idaeus cv. 'Anitra' RNA-seq reads were quality controlled using FastQC v0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), and adapters and low-quality regions were trimmed using Trimmomatic v0.39 [21]. Assemblies were indexed and RNA-seq reads were aligned using HISAT2 v2.2.1 [28] using default settings. Gene prediction was performed using BRAKER2 v2.1.6 [29] under the-etpmode setting with the RNA-seq and the eudicots protein database as evidence. Annotation completeness of the genome was assessed using BUSCO v5.2.2 [26], using default parameters and the gene families set defined for the eudicots_odb10 database. Gene predictions for the purged alternative contigs in each assembly were performed and gene predictions unique to the these contigs were identified using blastn. The repeat co-ordinates file output from Red and the 'gene' co-ordinates from the annotations file output from BRAKER2 were plotted using pyCircos (https://github.com/ponnhide/ pyCircos) to visualise the distribution and density of repeat and coding regions.

Functional annotation
Pairwise sequence comparison of the predicted proteomes of each genome, along with the genes found uniquely in the alternative purged contigs was performed using the BLAST + blastp-fast algorithm [31] through the Galaxy platform [32] against the NCBI nr, SwissProt, RefSeq, TrEMBL and Araport11 protein databases using an expectation value cutoff of 1e-6. InterProScan v5 [33] was used to assign InterPro domains and Gene Ontology (GO) terms, whilst BlastKOALA v2.2 [34] and eggNOG-mapper v2 [35] were used to map Kyoto Encyclopaedia of Genes and Genomes (KEGG) orthologs and KEGG pathways respectively. GO enrichment analyses were carried out using Cytoscape 3.9.1 [36] and the BinGO 3.0.3 plug-in [37] performing a hypergeometric test with False Discovery Rate (FDR) correction, a 0.05 significance level and GO Full ontology.

Comparative genomics
Syntenic alignments to the 'Anitra' genome sequence [11] were generated and plotted using D-genies [38] implementing Minimap2 v2.17-r941 [19] for the main assemblies for both 'Autumn Bliss' and 'Malling Jewel'. Additionally, the alternative contigs purged from heterozygous regions of the genome during assembly were plotted against the primary assemblies of each cultivar to determine their genomic positions. OrthoFinder [39] was used to identify gene families in the 'Autumn Bliss', 'Malling Jewel' and 'Anitra' genome sequences [11], along with the genome sequences of the related species R. chingii v1.0 [40], R. occidentalis v3.0 [41] and Fragaria vesca v4.0.a2 [42]. Presence or absence of genes in orthogroups was used to determine species-and cultivar-specific gene families.

Long-and short-read sequencing data
Sequencing of high molecular weight genomic DNA from 'Autumn Bliss' and 'Malling Jewel' using long read ONT libraries yielded 15.8 Gb and 12.0 Gb of data for the two cultivars respectively. Following filtering, the sequencing datasets contained 833,623 reads for 'Malling Jewel' and 991,952 reads for 'Autumn Bliss', representing 55× and 42× coverage, respectively. The read length N 50 was 37.5 Kb for 'Malling Jewel' and 18.4 Kb for 'Autumn Bliss', with maximum read lengths of 456,177 bp and 366,503 bp, respectively for the two cultivars (Table 1). After adapter trimming and the removal of low-quality sequence data, a total of 18.7 Gb of 250 bp paired-end Illumina sequencing data was produced for 'Malling Jewel' and 5.0 Gb of 300 bp paired-end Illumina sequencing data was produced for 'Autumn Bliss', representing 69× and 19× genome coverage, respectively.

Genome sequence assembly
The k-mer analysis performed for 'Autumn Bliss' returned a predicted heterozygosity of 1.54% and a total predicted genome length of 196. 8-198.5 Mbp. The data for 'Malling Jewel' returned a predicted heterozygosity of 0.45% and a total predicted genome length of 294.9-298.6 Mbp

Genome sequence scaffolding
The de novo assemblies of 'Autumn Bliss' and 'Malling Jewel' were scaffolded against the seven pseudochromosomes of the 'Anitra' genome sequence. Following scaffolding, 136 contigs of 'Autumn Bliss' and 79 contigs of 'Malling Jewel' were anchored to the seven 'Anitra' pseudochromosomes (Table 3). In total, 263.0 Mb (97.9%) of the 'Autumn Bliss' assembly was anchored to the 'Anitra' chromosomes, with a largest scaffold of 50.7 Mb and an N 50 of 35.7 Mb, whilst 265.5 Mb (99.9%) of the 'Malling Jewel' assembly was anchored to the 'Anitra' chromosomes, with a largest scaffold of 44.5 Mb and an N 50 of 36.1 Mb (Table 4). Of the ten contigs that were not anchored from the 'Autumn Bliss' assembly, nine were included in other contigs, whilst the longest contig (4.6 Mb) contained highly repetitive sequences. The four contigs from the 'Malling Jewel' assembly that were not anchored to the 'Anitra' chromosomes represented fragmented sequences of the chloroplast and mitochondrial genomes. The scaffolded, anchored assemblies were 15.5 Mb (6.3%) and 18 Mb (7.3%) larger than the previously published 'Anitra' genome sequence for 'Autumn Bliss' and 'Malling Jewel' respectively. Single copy ortholog analysis (BUSCO) revealed high levels of completeness in both genomes, with 97.4% of the sequences identified in 'Autumn Bliss' and 97.7% of the sequences identified in 'Malling Jewel' ( Table 4). The seven pseudochromosomes assembled for each genome were numbered according to the 'Anitra' chromosomes which in turn are concordant with the chromosomes of other Rosoideae genomes such as F. vesca [6] and Rosa chinensis [43]. All raw sequencing data and genome assemblies presented here are available at the NCBI under the Bioproject IDs PRJNA886864, PRJNA886865 and PRJNA886875. The assemblies of the 'Autumn Bliss' and 'Malling Jewel' genomes, along with the transcriptome annotations of the 'Autumn Bliss', 'Malling Jewel' and 'Anitra' have also been deposited on the Genome Database for Rosaceae [44].

Assembly completeness
The assembled 'Autumn Bliss' and 'Malling Jewel' contigs were highly syntenic and concordant with the previously published R. idaeus cv. 'Anitra' genome sequence of Davik et al. [11], despite being significantly larger than the 'Anitra' pseudochromosome assembly (Fig 1). S2 Fig contains the names and positions of the 'Autumn Bliss' and 'Malling Jewel' scaffolds along the 'Anitra' assembly, along with their lengths in base pairs. S1 File shows the relative positions of the alternative contigs purged from the primary assembly from heterozygous regions of the genome in relation to the associated primary assembly of each cultivar. The additional length of the primary assemblies presented was partly due to 15% of the total 'Anitra' assembly length not being scaffolded into the seven main pseudochromosomes, however there were also differences in the abundance of repetitive sequence within the genomes. Extensive repetitive regions were identified throughout the pseudochromosomes of all three assemblies (Fig 2) and the relative abundance of sequence within highly repetitive regions in the assemblies was compared to assess genome completeness. Regions with a higher density of repetitive sequence and lower gene density were observed near the centre of each of the seven 'Anitra' pseudomolecules, which we inferred correspond to the centromeres of each the seven chromosomes. These same regions were clearly identifiable in the 'Autumn Bliss' and 'Malling Jewel' pseudomolecules, and the density of repetitive sequence they contained was significantly higher than in the 'Anitra' assembly (Fig 3). The 'Autumn Bliss' and 'Malling Jewel' assemblies contained a greater total repetitive sequence within the centromeric regions, and more tandem repeat sequences overall on all seven pseudochromosomes than the 'Anitra' assembly, except for chromosome 4 of 'Autumn Bliss' (Fig 3). Additionally, telomeric repeat sequences ([CCCTAAA]n) were identified at both ends of five pseudochromosomes and at one end of the remaining two pseudochromosomes in 'Malling Jewel', whilst one pseudochromosome of the 'Autumn Bliss' assembly contained identifiable telomeric repeats at both ends, and a further three contained identifiable repeats at a single end.

Gene prediction and annotation
A total of 41,800 protein coding regions were identified in the 'Autumn Bliss' assembly, whilst 40,491 were identified in the 'Malling Jewel' assembly. The distribution of these protein coding regions across the seven pseudochromosomes for each cultivar is shown in Fig 2. A further 1,023 genes in 'Autumn Bliss' and 1,227 genes in 'Malling Jewel' not present in the primary assemblies were found in the purged contigs. A BUSCO analysis of the proteome of the two assemblies identified 96.9% queried proteins in the 'Autumn Bliss' proteome, and 97.7% in the 'Malling Jewel' proteome ( Table 5). The 'Anitra' genome sequence was re-annotated using the same methodology to ensure consistency in downstream orthology analyses [45]. The reannotation of the 'Anitra' assembly resulted in 41,509 protein coding regions, with a BUSCO completeness score of 97.3% compared to 39,448 (91.3%) in the previous study of [11] indicating a largely complete gene-space characterisation in all three genomes. The statistics for the number of protein-coding gene predictions for each cultivar that returned �1 positive hits after the BlastP analysis with nr, Araport11, RefSeq, SwissProt and TrEMBL databases as subjects, are given in Table 6, along with the number of protein-coding gene regions assigned Interpro, GO, KEGG orthology and KEGG pathway terms. Best matches resulting from BlastP homology searches for each dataset are detailed in S2 File Gene prediction BlastP summary, whilst the functional annotation results are given in S3 File Gene prediction functional annotation.

Discussion
Plant genomes are often highly heterozygous and are composed of a significant proportion of highly repetitive DNA [46]. As such, large and highly contiguous assemblies are often intractable with short-read sequence data alone. In this report, we present two new chromosome-scale genome sequence assemblies for red raspberry (R. idaeus) derived from the cultivars 'Autumn   Bliss' and 'Malling Jewel' which were constructed using long-read ONT sequence data and were compared to the previously published genome sequence assembly of the R. idaeus cultivar 'Anitra' [11]. Prior to scaffolding with long-range structural information, both genomes sequenced in this investigation returned a high level of contiguity, with just 143 and 83 contigs and a contig L 50 of 25 and 7 returned for the 'Autumn Bliss' and 'Malling Jewel' assemblies respectively. This is in contrast to the 2,350 contigs in the 'Anitra' assembly, with both new assemblies containing at least 15 Mb more scaffolded sequence data than the 'Anitra' sequence [11]. The lower degree of contiguity within the 'Autumn Bliss' assembly is likely due to the differences in the long-read N 50 values (37.5 Kb vs 18.4 Kb in 'Malling Jewel' and 'Autumn Bliss') of the sequence data returned for each cultivar. In contrast to the ONT data used in this investigation, Davik et al. [11] used PacBio long-read sequence data for the chromosome-scale assembly of the 'Anitra' genome. The PacBio reads used in that assembly had a mean sub-read length of 9.5 Kb and an N 50 length of 13.6 Kb, which is significantly shorter than the N 50 read length of the 'Autumn Bliss' and 'Malling Jewel' datasets presented here. In the final assembly, the highly repetitive centromeric regions of the 'Anitra' assembly were less well resolved than the generich regions of the genome and significantly less well resolved than the same regions in the 'Autumn Bliss' and 'Malling Jewel' genomes presented here. The greater fragmentation in the pre-scaffolded 'Anitra' contigs than in the 'Autumn Bliss' and 'Malling Jewel' assemblies resulted in a significantly shorter final scaffolded sequence for 'Anitra', demonstrating the value of very long-read sequence data in resolving highly heterozygous genome sequences. The two new genome assemblies for red raspberry presented here displayed a very high degree of synteny with the genome sequence of 'Anitra'. Analysis of the tandem repeats in the genome sequences of 'Autumn Bliss', 'Malling Jewel' and 'Anitra' demonstrated that much of the additional sequence data incorporated into the new assemblies was contained in the centromeric regions, which were more accurately assembled in the 'Autumn Bliss' and 'Malling Jewel' genomes due to the longer read length of the ONT data used compared to the shorter PacBio data used in the 'Anitra' assembly.
Cultivated red raspberry varieties are derived from the interbreeding of both the European red raspberry (R. idaeus subsp. idaeus L.) and the North American red raspberry (R. idaeus subsp. strigosus L.) as well as other species and subspecies. However, the global genetic structure of red raspberry populations has yet to be extensively studied. Modern red raspberry varieties are derived from a relatively narrow genetic base [47], with just 20 clones accounting for the majority of the genetic diversity in 137 varieties of known pedigree that have been released worldwide between 1960 and 1992. However, R. idaeus is naturally an out-crossing species, and as such, high levels of genome differentiation and heterozygosity are a feature of red raspberry germplasm. The two varieties for which genome sequence assemblies are presented in this paper are from very different backgrounds within the red raspberry germplasm base. 'Malling Jewel' is a floricane with a relatively pure R. idaeus genetic background, containing several important founder clones in its pedigree, whilst 'Autumn Bliss' in contrast is a strict primocane variety, with a complex hybrid pedigree including contributions from R. arcticus and R. occidentalis, meaning these two genomes span a significant representative portion of cultivated red raspberry diversity.
Many traits that have been selected for during the domestication and breeding of red raspberry including large fruit size, flavour, colour and pest and disease resistance, and alleles controlling these traits are likely to have been through genetic bottlenecks, with not all favourable alleles passed to all breeding populations of the crop globally. The genes contained within the 5,286 R. idaeus-specific orthogroups identified in this study were enriched for metabolic process, DNA binding, peptidase activity and endopeptidase activity, however, there is likely to be further differentiation within the cultivated and wild red raspberry germplasm that exists globally. The genetic resources for the red raspberry cultivars 'Autumn Bliss' and 'Malling Jewel' presented here, along with the previously published sequence of the cultivar 'Anitra' are valuable resources to begin to understand the structure and function of the red raspberry pangenome and provide a route to unlock the potential of red raspberry germplasm through precise identification and characterisation of genetic loci controlling traits of agronomic importance.