Genome sequencing data for wild and cultivated bananas, plantains and abacá

We performed shotgun genome sequencing on a total of 19 different Musa genotypes including representatives of wild banana species Musa acuminata and M. balibisiana, allopolyploid bananas and plantains, Fe'i banana, pink banana (also known as hairy banana) and abacá (also known as hemp banana). We aligned sequence reads against a previously sequenced reference genome and assessed ploidy and, in the case of allopolyploids, the contributions of the A and B genomes; this provides important quality-assurance data about the taxonomic identities of the sequenced plant material. These data will be useful for phylogenetics, crop improvement, studies of the complex story of intergenomic recombination in AAB and ABB allotriploid bananas and plantains and can be integrated into resources such as the Banana Genome Hub.


Value of the Data
• This genomic resequencing data will inform studies of Musa evolution, biodiversity, speciation and allopolyploidy. • This is a useful resource for breeders, researchers as well as science communicators engaging with the general public about the germplasm collection at the Eden Project. • The data can be mined for polymorphisms with value as markers for breeding strategies.
• These data can be integrated into banana genomics resources such as the Banana Genome Hub [1] . • Since some samples were sequenced using more than one method, the data can be used to compare performances of alternative sequencing platforms [2] .

Data Description
Genomic shotgun sequencing data was generated using BGIseq-500 ( Table 1 ), Illumina HiSeq 2500 using libraries of two different sizes ( Tables 2 and 3 ) and Illumina NovaSeq 60 0 0 ( Table 4 ). This generated a total of 505.69 GB and 120.95 GB raw read data for the Eden Project and IITA accessions respectively. Raw data is available at NCBI's Sequence Read Archive [3] via BioProjects PRJNA540118 and PRJNA413600.
An important quality control step is to check whether the sequence data are consistent with the botanical identifications of the source material. Therefore, we assessed observed against expected levels of ploidy. For allopolyploids purported to originate from interspecific hybrids between Musa acuminata and Musa balbisiana , we assessed the relative contributions of these respective "A" and "B" genomes compared against the expected characteristics of each sample as described under Experimental Design, Materials, and Methods. The resulting quality-control metrics are summarised in Table 5 and in Fig. 1 . Accessions 2012-1152 (SAMN11522021), 1999-2846 (SAMN11522023) and 2011-0950 (SAMN11522017) were expected to be allopolyploids containing contributions from both the A and B genomes but sequence data appeared to be exclusively Table 1 Genomic sequencing data generated using BGIseq-500 (2 × 150 bp reads, 300-bp insert size   Table 3 Genomic sequencing data generated using Illumina HiSeq (2 × 125 bp reads, 300-bp insert).

SRA accession Received as
Depth of coverage  from the A genome, suggesting that these three plants had been mis-identified. Further, there were discrepancies between the expected ploidy levels versus the empirically inferred levels in several accessions.

Experimental Design, Materials and Methods
Fresh leaf material was obtained from five accessions from the IITA (International Institute of Tropical Agriculture) [4] accessions and 14 from the Eden Project. DNA was extracted from fresh leaf material and sequenced using a combination of Illumina HiSeq 2500, Illumina No-vaSeq 60 0 0 and BGIseq-50 0 platforms. This yielded at least 20 × coverage of each genome and was sufficient for calling single-nucleotide polymorphisms, detecting presence/absence polymorphisms and cataloguing patterns of heterozygosity.
From the 14 plant accessions from the Eden Project, cigar leaves were cut from the plant and lyophilised in a freeze dryer before sending to BGI Tech Solutions (Hong Kong) Co., Limited, where DNA extraction and sequencing was performed. For the five accessions from the IITA (International Institute of Tropical Agriculture), genomic DNA was isolated using a modified CTAB (hexadecyltrimethylammonium bromide) extraction method [5] . The University of Exeter's Sequencing Service prepared Illumina sequencing libraries after fragmenting 500 ng of DNA to an average size of 500 bp, using the NEXTflex 8-barcode Rapid DNAseq kit sequencing (Perkin Elmer) with adapters containing indexes and 5-8 cycles polymerase chain reaction (PCR) [6] . Library quality was determined using D10 0 0 screen-tapes (Agilent) and libraries were either sequenced individually or combined in equimolar pools. Sequencing was performed on a single lane of a high-output v4 flow-cell on the Illumina HiSeq 2500 at the University of Exeter, yielding pairs of 125-bp reads.
This yielded at least 20 × coverage of each genome, sufficient for calling single-nucleotide polymorphisms, detecting presence/absence polymorphisms and cataloguing patterns of heterozygosity. Reads were also generated with longer inserts using the Illumina HiSeq (2 × 150 bp reads, 800-bp insert size) for two of the samples, which potentially aids resolution of sequence repeats if data are used in de novo assembly of genomes.
The quality of the sequencing reads was evaluated using FASTQC [7] . Before further analyses, reads were trimmed and adapters removed using TrimGalore [8] with command-line options "-q 30 --paired ". Trimmed and filtered reads were aligned to the M. acuminata genome [9] using BWA [10] to generate binary alignment map (BAM) files [11] .
As a prerequisite for plotting the relative contributions of the A and B genomes in allopolyploids, we first identified a set of informative SNPs that distinguish A ( M. acuminata ) from B ( M. balbisiana ) as follows utilising SAMtools' mpileup function, BCFtools [ 11 , 12 ] and custom scripts available at https://github.com/davidjstudholme/SNPsFromPileups . First, the relevant BAM alignment files were converted into uncompressed VCF format using SAMtools v1.6 ( mpileup function), selecting for variant sites only ( -v ) using the alternative model for multiallelic and rare-variant calling ( -m ). Potential SNPs were filtered using the filter function of BCFtools (v1.6), excluding potential SNPs that were within 100 base pairs of an indel (--SnpGap 100 ) and had a quality score of less than 35 ( QUAL > = 35 ) with a depth of 5 or more reads ( MIN(DP) > = 5 ). The minimum number of reads supporting an indel was set to two ( MIN(IDV) > = 2 ). Variants that were flagged as indels were excluded ( INDEL = 0 ). The resulting filtered VCF files contained the positions of candidate SNPs that distinguished the B genome [13] versus the A reference genome [14] . At each of these informative SNPs, we quantified the relative abundance of the A-and B-alleles, only considering sites where the depth was between 10 and 50. When plotting, the resulting percentage of the B allele was smoothed in R using the LOESS package [15] . The percentages of the B alleles at each SNP were visualised using Circos [16] ( Fig. 1 ).
We used nQuire [17] to estimate ploidy from the BAM files (of genomic reads aligned agains the M. acuminata reference genome). After de-noising to remove noise from mis-mapping due to highly repetitive regions, we assessed ploidy level using the lrdmodel command of nQuire to produce delta log-likelihoods of diploidy, triploidy or tetraploidy. The lowest delta log-likelihood was taken to indicate the most likely ploidy level ( Table 5 ). To infer ploidy levels, we used nQuire [17] to predict ploidy using BAM alignment files generated with BWA. The ploidy model yielding lowest value of logL was chosen as the inferred ploidy. The command lines used were as follows: nQuire create -b example.bam -o example for i in * .bin; do echo $i; nQuire denoise $i -o $i \ _denoised; done for i in * _denoised.bin; do echo $i; nQuire lrdmodel -t 8 $i; done

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.