Genome sequence data from 17 accessions of Ensete ventricosum, a staple food crop for millions in Ethiopia

We present raw sequence reads and genome assemblies derived from 17 accessions of the Ethiopian orphan crop plant enset (Ensete ventricosum (Welw.) Cheesman) using the Illumina HiSeq and MiSeq platforms. Also presented is a catalogue of single-nucleotide polymorphisms inferred from the sequence data at an average density of approximately one per kilobase of genomic DNA.


Value of the data
Here we present the first genome-wide sequence data available for enset accessions cultivated or growing wild in Ethiopia.
There is potential to exploit genetic diversity (e.g. large numbers of single-nucleotide polymorphisms) to generate markers to assist enset selection for key agronomic traits.
Given the long lifespan of enset, patterns of genetic variation can be used to classify germplasm and to prioritise and select germplasm for use in breeding.

Data
The data presented here include enset genomic resequencing data, in the form of sequence reads generated using the Illumina massively parallel deoxyribonucleic acid (DNA) sequencing platform. Also included are draft genome assemblies, a catalogue of single-nucleotide polymorphisms (SNPs) inferred from the sequence data, and images of agarose gels containing results of genotyping assays for several SNPs. Enset (Ensete ventricosum (Welw.) Cheesman) is a perennial, herbaceous plant belonging to the same botanical family as bananas and plantains, namely the Musaceae [1]. Although it does not yield edible fruits, it is the most important cultivated staple food crop in the highlands of central, south and southwestern Ethiopia with cultural significance [2] as well as a key role in food security [3,4]. The main food value is in the large starch-rich corm, which can be boiled and consumed in a similar manner to tubers such as potato or can be used to generate a fermented product known as kocho [3,[5][6][7][8][9].
Enset varieties display a great range of genetic and phenotypic variation [7,[10][11][12][13][14][15][16] (Fig. 1) and 15 phenotypic traits have been assayed for a collection of 387 enset accessions [17]. Integration of phenotypic measurements with genetic markers could be of great value in breeding improved varieties with enhanced resistance to abiotic and biotic stresses. Despite its importance for food security of millions in Ethiopia, enset has been relatively neglected in molecular research and few genomic resources are available. We previously published a first draft genome sequence of E. ventricosum [18], but the sequenced individual was obtained from the nursery trade (from the UK-based company Jungle Seeds) and its provenance is unknown and therefore its relevance to Ethiopian agriculture is Phylogenetic positions of the enset accessions sequenced here compared to that of the previously sequenced enset genome based on sequences of the trnF -trnT barcode voucher region of the chloroplast DNA. This locus has previously been used as a barcode and phylogenetic indicator and sequence data for this locus are available from previously published studies (Bekele and Shigeta, [36]; Li et al. [19]; Harrison et al. [18]). There was no sequence variation at this locus among the 17 genomes presented here, as judged by BWA alignments of raw sequence reads against trnF-trnT sequence. Thus, the branch indicated by the black circle represents the phylogenetic position of all 17 sequenced accessions. A black triangle highlights the position of the "Jungle Seeds" individual whose genome was previously sequenced. The Maximum Likelihood tree presented here is based on a multiple sequence alignment of trnF-trnT sequences generated using MUSCLE (Edgar, 2004). Evolutionary history was inferred by using the Maximum Likelihood method based on the Tamura-Nei model (Tamura and Nei [37]). The tree with the highest log likelihood (-1249.11) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. Initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involved 32 nucleotide sequences. All positions containing gaps and missing data were eliminated. There were a total of 666 positions in the final dataset. Evolutionary analyses were conducted in MEGA7 (Kumar et al. [38]). uncertain. Its phylogenetic relationship with Ethiopian varieties is rather distant (Fig. 2), clustering much more closely with E. ventricosum e4 (GenBank: FJ428156.1) [19], whose provenance is also unknown. In contrast, the data presented here originate from enset accessions collected in Ethiopia. Most of these enset accessions are sourced from the germplasm collection of the Southern Agricultural Research Institute (SARI), with the exception of Bedadeti, which originated from the collection of the International Institute for Tropical Agriculture (IITA). The data presented here complement previously published genomic resequencing data from Ensete species: targetted sequencing of repeats in Ensete gilletii [20] and E. ventricosum variety Gena [21] and exon sequencing of Ensete superbum and E. ventricosum [22].

Experimental design, materials and methods
Genomic DNA was extracted from the young emerging (cigar) leaves using a previously published mini-prep protocol [23]. Between 0.2 and 0.5 g of young and clean leaf was collected per plant and dried in silica gel. From these dried leaves 0.2 g was taken from each sample and ground with sterile pestle and mortar. Genomic DNA was isolated from about 0.2 g of pulverized leaf sample using a modified triple cetyltrimethyl ammonium bromide (CTAB) extraction technique [24]. The yield and quality of DNA were assessed by agarose gel electrophoresis and by a NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, Delaware) and quantified by Qubit broad range assay (Thermo Fisher Scientific). Illumina sequencing libraries were prepared, after fragmenting 500 ng of DNA to an average size of 500 bp, using Nextflex Rapid DNAseq kit for Illumina sequencing (Bioo Scientific) with Table 1 Illumina sequencing of E. ventricosum accessions. Pairs of 100-bp reads were generated using the Illumina HiSeq. 2500 in normal mode except where indicated. A single asterisk (*) indicates use of the Illumina HiSeq. 2500 in rapid-run mode to generate pairs of 300-bp reads and two asterisks (**) indicate use of the Illumina MiSeq to generate pairs of 300-bp reads. adapters containing indexes and 5-8 cycles polymerase chain reaction (PCR) [25]. Library quality was determined using D1000 screen-tapes (Agilent) and libraries were either sequenced individually or combined in equimolar pools. We sequenced the enset genomic DNA using a combination of Illumina [26,27] MiSeq and/or Illumina HiSeq. 2500 in either normal or rapid-run modes, as detailed in Table 1. The 17 sequenced accessions included 15 distinct named varieties. We sequenced two different accessions for cultivar Mazia and two different accessions for cultivar Lochingie (a result of complex vernacular naming systems for enset landraces arising from multiple ethno-linguistic communities); one accession was sequenced for each of the other varieties. Raw sequence reads were submitted to the Sequence Read Archive (SRA) [28] under the accession numbers listed in Table 1.

SARI ID
Prior to further analysis, sequence reads were trimmed and filtered using TrimGalore with options "-q 30 -paired". We performed de novo sequence assembly for sequence reads from Bedadeti, Derea and Onjamo (Table 2). For Bedadeti, we used St. Petersburg genome assembler (SPAdes) v. 3.6.1 [29] to assemble contigs and then scaffolded these using Short Sequence Assembly by progressive K-mer search and 3′ read Extension (SSAKE)-based Scaffolding of Pre-Assembled Contigs after Extension (SSPACE) v. 3.0 [30]. For Onjamo, we generated contigs and scaffolds using SPAdes v. 3.9.0 and for Derea generated contigs only using SPAdes v. 3.9.0. SPAdes assemblies were performed using the "-careful" option.
We identified single-nucleotide polymorphisms by alignment against the reference genome sequence, according to the following procedure. After trimming and filtering with TrimGalore, sequence reads were aligned against the Bedadeti reference genome sequence (GenBank: GCA_000818735.2) using Burrows-Wheeler Aligner (BWA) mem [31,32] version 0.7.15-r1140 with default options and parameter values.
bcftools call -m -v -Ov alignment.bcf 4 alignment.vcf The candidate variants were then filtered using the following command line:  Each row represents one of the sequenced genomes. Colour indicates the relative frequency of aligned sequence reads with the variant nucleotide at that site in that genome, on a yellow-orange-red palette. Thus, heterozygous sites would be expected to be orange, while homozygous sites would be yellow (same as Bedadeti reference genome sequence) or red (variant from the Bedadeti reference genome sequence). These frequency values were inferred from mpileup-formatted files, generated by aligning genomic sequence reads against the Bedadeti reference genome sequence. The Perl script used to extract these from the mpileup files is included in the Supplementary Material. This filtering step eliminates indels with low-confidence single-nucleotide variant calls. It also eliminates candidate SNVs within 10 base pairs of an indel, since alignment artefacts are relatively common in the close vicinity of indels.
Allele frequencies at each SNP site were estimated from frequencies of each base (adenine (A), cytosine (C), guanine (G) or thymine (T)) among the aligned reads. Thus, we would expect an allele frequency of close to zero or one for homozygous sites and approximately 0.5 for heterozygous sites in a diploid genome. The binary alignment/map (BAM)-formatted BWA-mem alignments were converted to pileup format using the samtools mpileup command in SAMtools [33] version 1.6 with default options and parameter values. From the resulting pileup files, we used a custom Perl script (included in Supplementary material) to detect SNPs. For SNP detection, we considered only sites where depth of coverage by aligned reads was at least 5× for all 17 datasets. The distribution of a random sample of variants across the 17 accessions is summarized in Fig. 3.
The identification of relatively high-confidence SNPs, distributed throughout the genome at a density of approximately one SNP per kilobase, provides the possibility to develop markers that could be used for genotyping large numbers of plant accessions without the need for large-scale sequencing. One straightforward approach is polymerase chain reaction restriction fragment digest polymorphism (PCR-RFLP) [34]. Another is co-dominant amplified polymorphism (CAPS) [35]. In the PCR-RFLP assay, oligonucleotide primers are designed to amplify a PCR product that flanks a SNP that falls within the recognition site for a restriction enzyme such that one variant is cleavable by the restriction enzyme whilst the other variant is not. Thus, by examining the pattern of bands in agarose electrophoresis of the product after restriction digestion, it is possible to assess the genotype at that SNP location. As a proof of principle, we designed 22 pairs of oligonucleotide primers targeting SNPs identified from the genome sequencing data; these are listed in Table 3. We applied 5 of these assays to several hundred E. ventricosum accessions; agarose gels showing the products of digesting the PCR products can be found in the Supplementary material.