Chromosome-scale genome assembly of kiwifruit Actinidia eriantha with single-molecule sequencing and chromatin interaction mapping

Abstract Background Kiwifruit (Actinidia spp.) is a dioecious plant with fruits containing abundant vitamin C and minerals. A handful of kiwifruit species have been domesticated, among which Actinidiaeriantha is increasingly favored in breeding owing to its superior commercial traits. Recently, elite cultivars from A. eriantha have been successfully selected and further studies on their biology and breeding potential require genomic information, which is currently unavailable. Findings We assembled a chromosome-scale genome sequence of A. eriantha cultivar White using single-molecular sequencing and chromatin interaction map–based scaffolding. The assembly has a total size of 690.6 megabases and an N50 of 21.7 megabases. Approximately 99% of the assembly were in 29 pseudomolecules corresponding to the 29 kiwifruit chromosomes. Forty-three percent of the A. eriantha genome are repetitive sequences, and the non-repetitive part encodes 42,988 protein-coding genes, of which 39,075 have homologues from other plant species or protein domains. The divergence time between A. eriantha and its close relative Actinidia chinensis is estimated to be 3.3 million years, and after diversification, 1,727 and 1,506 gene families are expanded and contracted in A. eriantha, respectively. Conclusions We provide a high-quality reference genome for kiwifruit A. eriantha. This chromosome-scale genome assembly is substantially better than 2 published kiwifruit assemblies from A. chinensis in terms of genome contiguity and completeness. The availability of the A. eriantha genome provides a valuable resource for facilitating kiwifruit breeding and studies of kiwifruit biology.


Introduction
Kiwifruit is well known as the king of fruits due to its remarkably high vitamin C content and abundant minerals [1,2]. Native to China, kiwifruit belongs to the genus Actinidia which contains 54 species and 75 taxa [3]. All species in this genus are perennial, deciduous and dioecious with a climbing or scrambling growth habit, and they also have many common morphological features including the characteristic radiating arrangement of styles of the female flower and the structure of the fruit [4]. Despite rich germplasm resources in kiwifruit, only a few Actinidia species have been domesticated, such as A. chinensis var. chinensis, A. chinensis var. deliciosa and A. eriantha, whose fruit size are close to commercial standard [5][6][7].
Actinidia eriantha has also been used for genetic and genomic studies thanks to its high efficiency in genetic transformation and relatively short phase of juvenility [13]. The flowering and fruiting of A. eriantha can be accomplished within two years in green house conditions with a low requirement for winter chilling [13]. In addition, roots of A. eriantha which contain many bioactive compounds such as triterpenes and polysaccharides are employed as a traditional Chinese medicine for the treatment of gastric carcinoma, nasopharyngeal carcinoma, breast carcinoma, and hepatitis [12,14]. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Previously, two kiwifruit genomes were published and both are from A. chinensis ('Hongyang' and 'Red 5') [15,16]. These short-read based assemblies are very fragmented, possibly due to the high complexity and heterozygosity of the kiwifruit genomes as well as technical limitations. Here, we used single-molecular sequencing combined with the high-throughput chromosome conformation capture (Hi-C) technology to assemble the genome of the elite kiwifruit cultivar 'White' of A. eriantha. The availability of this high-quality chromosome-scale genome sequence not only provides fundamental knowledge regarding kiwifruit biology but also presents a valuable resource for kiwifruit breeding programs.

Sample collection and genome sequencing
Fresh young leaves were collected from a female individual of A. eriantha cv. White. High molecular weight (HMW) genomic DNA was extracted using the CTAB method as described in  (Table S1).  (Table S1).
To construct the Hi-C library, 'White' plants were grown in a greenhouse, and approximately 4~6 grams young leaves were then harvested and subsequently fixed in the formaldehyde (1% v/v) for 10 min at room temperature. The fixation was terminated by adding glycine to a final concentration of 0.125M. The fixed samples were ground into powder in liquid nitrogen and then lysed with the addition of Triton X-100 to a concentration of 1% (v/v). The nuclei were isolated and prepared for Hi-C library construction according to a previously published protocol [20]. The library was sequenced on an Illumina HiSeq 2500 system using the paired-end mode, which yield a total of approximately 118 million read pairs.

Transcriptome sequencing
To improve gene prediction, we generated transcriptome sequences from a pool of mixed tissues of 'White' including root, stem, leaf, flower, and fruits at 7, 30, 60, 90 and 120 days after anthesis.

Chromosome-scale assembly of the A. eriantha genome
We employed a strategy which took into account the unique advantage of different assemblers to construct the 'White' genome using PacBio long reads. First, PacBio long reads were corrected and assembled using the Canu program [24] (v1.7), which is a modularized pipeline consisting of three primary stages -read correction, trimming and assembly. The Canu-corrected reads were also assembled independently with the wtdbg program (https://github.com/ruanjue/wtdbg), a fast assembler for long noisy reads. Subsequently, the two independent assemblies (one with Canu and another with wtdbg) were merged by Quickmerge [25] (v0.2) to improve the contiguity. The merged assembly was further processed to correct errors using Pilon [26]   To scaffold the contigs based on chromatin interaction maps inferred from the Hi-C data, we first used HiC-Pro [27] to evaluate and filter the cleaned Hi-C reads. The Hi-C data usually contains a considerable part of invalid interaction read pairs which are non-informative and need to be filtered out beforehand. Among the 51 million read pairs that were uniquely aligned to the A. eriantha assembly, 33 million (64.1%) were valid interaction pairs and their insertion size spanned predominantly from dozens to hundreds of kilobases, therefore providing efficient information for scaffolding. As a part of error correction of the assembly, we also used valid Hi-C reads to identify potential misassembled contigs. In principle, a genuine contig should display a continuous Hi-C interaction map whereas the discrete distribution of an interaction map likely indicates a misassembly. We examined the interaction map for each contig and broke 51 that were possibly misassembled. Subsequently, the corrected PacBio assembly was used for scaffolding using the LACHESIS program [28] with parameters "CLUSTER_MIN_RE_SITES=48, CLUSTER_MAX_LINK_DENSITY=2, CLUSTER_NONINFORMATIVE_RATIO=2, ORDER_MIN_N_RES_IN_TRUN=14, ORDER_MIN_N_RES_IN_SHREDS=15". LACHESIS 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 assigned 3,666 contigs with a total size of 682,355,494 bp (98.84% of the assembly) into 29 groups corresponding to the 29 kiwifruit chromosomes (Fig. 2 and 3a) We then identified synteny between the A. eriantha 'White' assembly and the assembly of A. chinensis 'red5' using MUMMER [29] (version 4.0.0beta2). In general, the two assemblies showed a high macro-collinearity, with only a few inconsistencies (Fig. 3b). Detailed check of the inconsistent regions using mate-pair read alignments supported the correct assemblies in the A.
eriantha 'White' genome, and therefore the inconsistencies could be due to errors in the 'red5' assembly or structure variations between 'White' and 'red5' (Fig. S1).

Repeat annotation
Repeats were annotated following a protocol described in Campbell et al [30]. eriantha 'White' was much higher than that in A. chinensis (e.g. 36% in Hongyang [15]), and this difference could be largely due to the improvement of the repeat region assembly with PacBio long reads. In addition, variations between the two kiwifruit species could also contribute to this difference. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Protein-coding genes were predicted from the repeat-masked A. eriantha 'White' genome with the MAKER-P program [30]  RNA-Seq data generated in this study were assembled de novo with Trinity and the assembled contigs were aligned to the 'White' genome assembly to provide transcript evidence. Predictions supported by the three different sources of evidence were finally integrated by MAKER-P, which resulted in a total of 52,514 primitive gene models. We then filtered and polished these gene models by two steps. First, we combined our RNA-Seq data with others collected from a previous study [36], and mapped the reads to the 'White' genome using the STAR program [22], and a total of 266 million read pairs were mapped. Based on the mapping, raw count for each predicted gene model was derived and then normalized to CPM (counts per million mapped read pairs). Gene models with ultra-low expression (CPM < 0.1) were less likely to be real genes. Furthermore, we found that these lowly expressed genes had relatively high annotation edit distance (AED) score, an indication of low-confidence as defined by the MAKER-P program. Therefore, for gene models with CPM < 0.1, we only kept those containing both pfam domains and homologous sequences in the NCBI nr protein database. After this filtering process 42,613 gene models were kept. Second, the predicted protein-coding genes of kiwifruit A. chinensis 'red5' have been manually curated [16], and therefore these gene models should have relatively higher accuracy and could be used to modify A. eriantha 'White' gene models whose predictions were not consistently supported by the different types of evidence. To this end, we performed another two ab initio predictions using 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 BRAKER2 [37] and GeMoMa [38] (version 1.5.2) with 'red5' proteome as the sole evidence.

Prediction and functional annotation of protein-coding genes
These two predictions were compared with the gene models predicted by MAKER-P.
Consequently, a total of 237 gene models not predicted by MAKER-P were added and another 415 gene models which had better predictions by BRAKER2 or GeMoMa were used to replace the corresponding gene models predicted by MAKER-P. Finally, we obtained a total of 42,850 protein-coding genes in the A. eriantha 'White' genome, with a mean coding sequence (CDS) size of 1,004 bp and containing an average of five exons.
The predicted genes were functionally annotated by blasting their protein sequences against (N=39,075) of the predicted genes contain at least one annotation from the above databases (Table   S4).