The genome sequence of the apple, Malus domestica (Suckow) Borkh., 1803 [version 1; peer review: awaiting peer review]

We present genome assemblies from four Malus domestica cultivars (the apple; Streptophyta; Magnoliopsida; Rosales; Rosaceae). The genome sequences are 643–653 megabases in span. The greater part of each assembly length (99.24–99.74%) is scaffolded into 17 chromosomal pseudomolecules. The mitochondrial and plastid genomes were also assembled and are 400 kilobases and 167 kilobases in length respectively


Background
Malus domestica (Suckow) Borkh., the cultivated or sweet apple (Figure 1), belongs to the Rose family (Rosaceae) and is one of the most important fruit crops in temperate regions (FAO). It is a hybrid species with a complex domestication and hybridisation history that is slowly becoming unravelled using genetic methods (Cornille et al., 2014;Duan et al., 2017;Migicovsky et al., 2021;Sun et al., 2020). The progenitor is M. sieversii (Ledeb.) M.Roem., a native of Central Asia, which hybridised with the Caucasian apple M. orientalis Uglitz, the European wild apple M. sylvestris Mill. and possibly the Siberian apple M. baccata (L.) Borkh. as it was traded along the Silk Road (Cornille et al., 2014).
Apples are largely self-incompatible, and offspring grown from seed often do not resemble the mother apple tree. To preserve desirable characteristics of a variety such as taste and disease resistance, vegetative propagation by grafting has been employed for centuries. Selection and cloning of superior genotypes derived from chance open-pollinated progenies have created the huge variety of several thousand sweet apples, often of unknown parentage, we know today (Morgan & Richards, 2002).  While the majority of cultivated apples are diploid (2n = 2x = 24), triploid varieties (i.e., those with three whole genomes in each nucleus, 2n = 3x = 51) are not uncommon, including some of the widely available cultivars such as 'Bramley's Seedling', 'Newton's apple', and 'Blenheim Orange'. While the impact of triploidy on traits relevant to apple breeding is variable, typically it results in more vigorous growth, larger fruits, better disease resistance (e.g. resistance to apple scab caused by the ascomycete fungus Venturia inaequalis) and/or enhanced tolerance to abiotic stress (Sattler et al., 2016;Sedov et al., 2014).
The increasing availability of full genome sequences of different apple cultivars and wild relatives contributes to (i) shedding light on their parentage, (ii) helping further untangle their often enigmatic origin, and (iii) accelerate the use of wild species and genetically distinct genotypes of M. domestica for apple improvement, including to increase the content of some of the key phytochemicals present in apples (e.g. flozins, flavan-3-ols and dihydrochalcones) which have been shown to have important health-promoting benefits (Anastasiadi et al., 2017;Howes et al., 2020;Newman & Cragg, 2020;Simmonds & Howes, 2016) The genome of the apple, M. domestica was sequenced as part of the Darwin Tree of Life Project, a collaborative effort to sequence all named eukaryotic species in the Atlantic Archipelago of Britain and Ireland. Here we present a chromosomally complete genome sequence for M. domestica, based on four cultivars: 'Costard' from the Royal Botanic Gardens, Kew, 'Brown Snout' and 'Bardsey' from the Royal Horticultural Society Garden Wisley and a Newton's apple cultivar from Woolsthorpe-by-Colsterworth.

Genome sequence report
The genome was sequenced from material harvested from four M. domestica cultivars as described in the Methods. The assemblies are based on PacBio data, 10X Genomics Chromium data and Arima Hi-C data generated for each specimen. The fold-coverage in Pacific Biosciences single-molecule circular consensus (HiFi) long reads and 10X Genomics read clouds per specimen is shown in Table 1. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. The corrections made during manual assembly and final assembly length are shown in Table 1.
The chromosome-scale scaffolds are named by synteny based on the assembly of Malus domestica (apple) GCA_004115385.1 (Table 2, Figure 2 to Figure 5).
The following were noted during curation: drMalDome5.3: From the Hi-C data, inversions between haplotypes can be seen on chromosome 4 (11.13-14.   Table 3). fourth specimen (drMalDome58) was taken from a Newton's apple cultivar by Jennifer Johns of the National Trust at Woolsthorpe-by-Colsterworth, Lincolnshire, UK (latitude 52.81, longitude 0.63). The leaves were snap-frozen in liquid nitrogen.

Sample acquisition and nucleic acid extraction
DNA was extracted at the Tree of Life laboratory, Wellcome Sanger Institute (WSI). The samples were weighed and dissected on dry ice with tissue set aside for Hi-C sequencing. The plant tissue was cryogenically disrupted to a fine powder using a Covaris cryoPREP Automated Dry Pulveriser, receiving multiple impacts. High molecular weight (HMW) DNA was extracted using the Qiagen MagAttract HMW DNA extraction kit. Fragment size analysis of 0.01-0.5 ng of DNA was then performed using an Agilent FemtoPulse. Low molecular weight DNA was removed from a 20 ng aliquot of extracted DNA using 0.8X AMpure XP purification kit prior to 10X Chromium sequencing; a minimum of 50 ng DNA was submitted for 10X sequencing. HMW DNA was sheared into an average fragment size of 12-20 kb in a Megaruptor 3 system with speed setting 30. Sheared DNA was purified by solid-phase reversible immobilisation using AMPure PB beads with a 1.8X ratio of beads to sample to remove the shorter fragments and concentrate the DNA sample. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.
RNA was extracted from specimens drMalDome5, drMalDome10 and drMalDome11 in the Tree of Life Laboratory at the WSI using TRIzol, according to the manufacturer's instructions. RNA was then eluted in 50 μl RNAse-free water and its concentration assessed using a Nanodrop spectrophotometer and Qubit Fluorometer using the Qubit RNA Broad-Range (BR) Assay kit. Analysis of the integrity of the RNA was done using Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.

Sequencing
Pacific Biosciences HiFi circular consensus and 10X Genomics read cloud DNA sequencing libraries were constructed according to the manufacturers' instructions. DNA sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences SEQUEL II (HiFi) and Illumina NovaSeq 6000 (10X) instruments and RNA-Seq on an Illumina HiSeq 4000 instrument. Hi-C data were generated using the Arima v2 kit and sequenced on the Illumina NovaSeq 6000 instrument.

Genome assembly
Assembly was carried out with Hifiasm (Cheng et al., 2021) and haplotypic duplication was identified and removed with purge_dups (Guan et al., 2020). One round of polishing was performed by aligning 10X Genomics read data to the assembly with longranger align, calling variants with freebayes (Garrison & Marth, 2012). The assembly was then scaffolded with Hi-C data (Rao et al., 2014) using SALSA2 (Ghurye et al., 2019). The assembly was checked for contamination as described previously (Howe et al., 2021). Manual curation (Howe et al., 2021) was performed using HiGlass (Kerpedjiev et al., 2018) and Pretext (Harry, 2022). The mitochondrial and plastid genomes were assembled from PacBio HiFi reads mapping to related genomes using MBG (Rautiainen & Marschall, 2021). A representative circular sequence was selected for each from the graph based on read coverage.
The genome was analysed and BUSCO scores generated within the BlobToolKit environment (Challis et al., 2020). Table 4 contains a list of all software tool versions used, where appropriate. The genome sequence is released openly for reuse. The Malus domestica genome sequencing initiative is part of the Darwin Tree of Life (DToL) project. All raw sequence data and the assembly have been deposited in INSDC databases. The genome will be annotated using available RNA-Seq data and presented through the Ensembl pipeline at the European Bioinformatics Institute. Raw data and assembly accession identifiers are reported in Table 3.