Insights into population structure of East African sweetpotato cultivars from hybrid assembly of chloroplast genomes

Background: The chloroplast (cp) genome is an important resource for studying plant diversity and phylogeny. Assembly of the cp genomes from next-generation sequencing data is complicated by the presence of two large inverted repeats contained in the cp DNA. Methods: We constructed a complete circular cp genome assembly for the hexaploid sweetpotato using extremely low coverage (<1×) Oxford Nanopore whole-genome sequencing (WGS) data coupled with Illumina sequencing data for polishing. Results: The sweetpotato cp genome of 161,274 bp contains 152 genes, of which there are 96 protein coding genes, 8 rRNA genes and 48 tRNA genes. Using the cp genome assembly as a reference, we constructed complete cp genome assemblies for a further 17 sweetpotato cultivars from East Africa and an I. triloba line using Illumina WGS data. Analysis of the sweetpotato cp genomes demonstrated the presence of two distinct subpopulations in East Africa. Phylogenetic analysis of the cp genomes of the species from the Convolvulaceae Ipomoea section Batatas revealed that the most closely related diploid wild species of the hexaploid sweetpotato is I. trifida. Conclusions: Nanopore long reads are helpful in construction of cp genome assemblies, especially in solving the two long inverted repeats. We are generally able to extract cp sequences from WGS data of sufficiently high coverage for assembly of cp genomes. The cp genomes can be used to investigate the population structure and the phylogenetic relationship for the sweetpotato.


Introduction
The chloroplast (cp) genome has been widely used to study the phylogeography, molecular systematics and the population genetics for plants 1,2 . The chloroplast DNA (cpDNA) usually displays uniparental inheritance and represents a relatively high degree of conservation in genome structure and gene content 2 . There are over 800 complete cp sequences available for a wide variety of plants from National Center for Biotechnology Information (NCBI) repository ranging in size from 107 to 218 Kb 3 . The cp genomes usually contain 110-130 protein encoding genes (PEGs), about 30 transfer RNA (tRNA) genes and four ribosomal RNA (rRNA) genes, primarily participating in the process of photosynthesis 3,4 . The cpDNA typically forms a circular quadripartite structure with two inverted repeats (IRs), IRA and IRB, separated by one large single-copy section (LSC) and one small single-copy section (SSC) 5 .
The first cpDNA was sequenced from tobacco (Nicotiana tabacum) using the bacterial artificial chromosome (BAC) sequencing method in 1986 6 . The two IRs were cloned separately in order to distinguish between them. A plethora of cpDNA had since been sequenced with similar methods 7-9 . Besides BAC sequencing, an alternative strategy used to sequence cpDNA is whole-cp-genome amplification by rolling-circle amplification (RCA) technology 10-12 . However, both approaches require complicated library preparation.
The development of next-generation sequencing (NGS) technologies such as Illumina and Roche 454 facilitate faster and cheaper methods to sequence cp genomes [13][14][15] . The output of the NGS technologies is short reads of size up to a few hundred base pairs. It is difficult to assemble cp genome with short reads only, especially because of the two large IRs of tens of kilobase pairs. In order to solve this problem, a reference cp genome, normally from a related species, is usually used to anchor the contigs assembled from the short reads 4,16 . The long reads generated from the third-generation sequencing (TGS) technologies, such as the single-molecule real-time (SMRT) PacBio sequencing and Oxford Nanopore sequencing, can also be used to anchor the contigs and solve the repetitive regions. It is even possible to assemble cp genomes directly from long reads 17 . However, as the sequencing error rate of the long reads from the TGS is typically higher than 10%, it is important to introduce an error correction step to guarantee an accurate genome assembly 18 . The high-quality NGS short reads can be integrated for error correction to improve accuracy 19,20 .
The aforementioned methods to construct cp genomes from NGS or TGS data assume pure cpDNA were sequenced. More precisely, the cpDNA were isolated from the nuclear DNAs and other organelle DNAs before sequencing 4,13-16 . However, whole-genome sequencing (WGS) data generated from NGS or TGS technologies always contains cp sequences at various levels determined by the tissue type and library preparation. Normally we are able to gain enough coverage of cp genome for assembly even from low coverage WGS data. There have since been several studies describing assembly of cp genomes from WGS data [21][22][23][24][25][26][27] . Extraction of cp sequences from the WGS data plays a key role in these methods. The most straightforward idea is to use a reference cp genome. The cp sequences could be extracted by examining the mapping results of the WGS data to the reference cp genome 21,22 . An alternative strategy relies upon the fact that there are many more copies of the cpDNA than the nuclear DNA and that from other organelles. The entire WGS data is assembled to construct contigs. Contigs that represent significantly higher coverages are treated as cp contigs [23][24][25] . NOVOPlasty adopted a seed-and-extend paradigm, where the seed could be a cp read sequence, a conserved gene or a cp genome from a related species 26 . The start and the end of a given seed sequence are iteratively extended with reads that are overlapped with the seed until the circular genome is formed. Izan et al. proposed a K-mer frequency-based selection of cpDNA sequences from WGS data, which was integrated into a reference free cp genome assembler for non-model species 27 .
Sweetpotato (Ipomoea batatas) ranks among the ten most important food crops worldwide 28 . The total annual production is more than 100 million metric tonnes grown on about 8.6 million hectares around the world in year 2016 29 . Understanding the sweetpotato genomes is of significant importance to achieve the full potential of the sweetpotato 30 . Sweetpotato is a hexaploid (2n=6x=90) with genome size estimated to be between 2,200 to 3,000 Mb 28 . Due to the complex genome structure, the availability of sweetpotato genomic resources is lacking. Under these circumstances, the cp genome provides researchers with an easy and efficient way to study sweetpotato 4,16,31,32 . A number of cp genomes from the genus Ipomoea have been sequenced 4,16,33,34 . Most of them are diploid wild relatives of the sweetpotato. The genome size is around 161 Kb, and the structure represents a standard quadripartite circular with a LSC of 87 Kb, a SSC of 12 Kb and two IRs of 31 Kb 4 . The cp genomes were mainly used to perform phylogenetic analyses 4,16,34 .
In the present study, we constructed a complete cp genome assembly for the hexaploid sweetpotato cultivar Tanzania 35 using long reads produced by the Oxford Nanopore sequencing technology. Despite the <1× genome coverage, we obtained approximately 270× data coverage for the cp genome. Illumina sequencing data was integrated to improve the accuracy of the genome assembly. Using the Tanzania cp genome assembly as a reference, we constructed 19 cp genomes for a further 17 sweetpotato cultivars (including a duplicate for one cultivar) and an I. triloba line from paired-end whole genome Illumina sequence data. The assembled sweetpotato cp genomes were combined to perform phylogenetic analysis to investigate the population structure

Amendments from Version 1
A number of misspellings and incorrect claims were revised following the reviewers' comments. Some misleading and confusing sentences were corrected in the new version. Figure 1d has been slightly modified to remove red dots indicating SNPs. The figure legend has been modified accordingly.
Any further responses from the reviewers can be found at the end of the article REVISED of 18 East African sweetpotato cultivars. Putting together the assembled cp genomes and nine publicly available cp genomes of the sweetpotato and its wild relatives, we performed a phylogenetic analysis to investigate the phylogenetic relationship for species in Convolvulaceae Ipomoea section Batatas.

Results
Extraction of cp genome sequence from whole genome sequencing data We generated high-coverage (60×) 150 bp paired-end Illumina WGS data, and low-coverage (<1×) Oxford Nanopore WGS data on a single cultivar, referred to as Tanzania 35 (Methods). The cultivar Tanzania was used as one of the parents to develop an F1 outcrossing mapping population (B×T) in the Genomic Tools for Sweetpotato (GT4SP) Improvement Project 30 . Approximately 162,000 Nanopore reads and 1.46 billion Illumina reads were generated (Supplementary Table 1). A total of 6,710 Nanopore reads were identified for cp genome by mapping to 30 publicly available cp genomes of the species from the Convolvulaceae Ipomoea family 4,16,33,36 (Methods, Supplementary Table 2). The total size is ~43.9 Mb, which represents ~270× data coverage for the cp genome. The longest read is ~30 Kb, and the average size is ~6.5 Kb (Supplementary Figure 1). We identified approximately 45 million Illumina reads for cp genome by mapping to the publicly available cp genomes summing to ~6.2 Gb, which were used for error correction for Nanopore reads and the genome assembly. The other parent for the B×T F1 outcrossing mapping population, Beauregard, was subject to whole genome sequencing at 60× coverage (Methods). A total of approximately 1.3 billion 150 bp Illumina reads were generated summing to ~164 Gb, of which approximately 52 million reads were identified as cp sequences with a total size of ~7.2 Gb (Supplementary Table 1). We performed Illumina WGS at 30× coverage for a further 16 sweetpotato cultivars-Wagabolige and New Kawogo 35 , Ejumula and SPK004 37 , NASPOT 1 and NASPOT 5 38 , NASPOT 7 and NASPOT 10 O 39 , NK259L and NASPOT 11 40 , Huarmeyano, Dimbuka-Bukulula and NASPOT 5/58 41 , Resisto 42 , Magabali 43 and Mugande 44 . These cultivars were used as the parental genotypes in the Mwanga Diversity Panel (MDP) which is an 8×8 diallele diversity mating panel constructed by the GT4SP project for genomic selection of the sweetpotato. While the great majority of these sweetpotato cultivars were from East African countries including Uganda and Kenya, Resisto was from USA and Huarmeyano was from Peru (Supplementary Table 3). We have duplicate samples for the cultivar NASPOT 10 O-one was from the screen-house while the other one was from the field. These two NASPOT 10 O samples were analysed separately in this research (Methods). On average, a total of approximately 75 million 251 bp reads were generated for each sample. The number and the total size of the cp reads extracted from the whole genome sequence data, on average, are ~4.4 million and ~1 Gb respectively for each sample (Supplementary Table 1).
We performed Illumina whole genome sequencing at 50× coverage for the I. triloba line, NCNSP-0323 30 (Methods). The raw whole genome sequence data consists of approximately 196 million 150bp reads summing to ~29 Gb. We extracted approximately 13 million reads for the cp genome from the raw sequence data summing to ~2 Gb (Supplementary Table 1).

Cp genome assembly for the sweetpotato cultivar Tanzania
We combined the Nanopore long reads with Illumina short reads to construct a cp genome assembly for the sweetpotato cultivar Tanzania (Methods). After trimming off the low-quality bases, approximately 2.2 Gb Illumina sequence data remained which was used for error correction for the Nanopore reads with Nanocorr (Supplementary Table 1). A total of 70 low quality Nanopore reads were removed after error correction and the total size reduced to approximately 43.2 Mb (Figure 1a), which was used to construct a draft genome assembly using Canu. The resulting genome assembly of approximately 218 Kb consists of three contigs of size 46 Kb, 39 Kb and 132 Kb, respectively. Compared to the published sweetpotato cp genome, the assembly is split at the boundaries of the two IRs ( Figure 1b). Utilizing the overlap information between the contigs, the AMOS minimus combined the three contigs and generated a single contig of ~183 Kb (Figure 1c) (Methods). The contig contains a ~20 Kb redundancy at the ends which was removed after circularization ( Figure 1d). The circularized contig is ~161 Kb, and is highly collinear with the reference cp genome assembly (Figure 1d). Application of Pilon further identified and corrected 42 single-nucleotide polymorphisms (SNPs) and small indels. To follow the paradigm of the published cp genomes, we restructured the genome assembly so that it starts from the LSC (Methods). The final genome assembly consists of a single circular contig of 161,274 bp ( Figure 1e).
Cp genome assembly for the other 17 sweetpotato cultivars and the I. triloba line NCNSP-0323 The cp sequence data was subjected to quality control before assembled with SPAdes (Methods). After trimming off the low-quality regions, the total sizes of the sequence data of the 19 samples range from approximately 267 Mb to 2.67 Gb (Supplementary Table 1). The contigs generated from SPAdes for the 19 samples vary in numbers and sizes: the minimum number of contigs is 76 for the cultivar NASPOT 7, while the maximum number is 197 for the cultivar Beauregard; and the total sizes of the genome assemblies range from ~169 Kb (cultivars Ejumula and NASPOT 7) to ~229 Kb (cultivar NK259L) (Supplementary Table 4). The SPAdes contigs were then mapped to the Tanzania cp genome assembly for anchoring (Methods). The resulting genome assemblies for the 19 samples are very similar. The largest and the smallest genome assembly is 161,509bp and 161,198bp, derived from the cultivar NASPOT 5 and Beauregard, respectively (Supplementary Table 4).

Molecular structure and gene content of the sweetpotato cp genome
The gene annotation of the cp genome assembly of the sweetpotato cultivar Tanzania was generated with the web tool DOGMA and further refined with MUSCLE (Methods). The circular plot of the gene annotation is depicted in Figure 2. The sweetpotato cp genome represents a common circular structure with two IRs (IRA and IRB) separating one LSC and  Table 2). The alignment was performed with BWA MEM 45 . The alignment identities were calculated from the Cigar string. The purple and yellow represents before and after error correction with Illumina reads using Nanocorr 20 , respectively. one SSC 5 . The size of the IRA, IRB, LSC and SSC is 30,874, 30,835, 87,489 and 12,076 bp, respectively. The overall GC content of the sweetpotato cp genome is 37.54%. The GC contents in different regions are highly variable. The two IRs represent significantly higher GC content than the single-copy regions: for the LSC and SSC, the GC content is 36.14% and 32.20%, respectively, whereas for the two IRs, the GC content is 40.57%. This is mainly caused by the high GC content ribosomal RNA genes in IR regions, including rrn16, rrn23, rrn4.5 and rrn5 ( Figure 2). We identified 152 genes in the cp genome of which there are 96 protein encoding genes (PEGs), eight rRNA genes and 48 tRNA genes. Table 1 shows a full list of the functional genes. As we can see, the genes can be divided into 16 functional systems. The number of single-copy and double-copy genes is 71 and 11, respectively, and there is one triple-copy gene (rps12). The results are highly similar to what has been reported for the cultivar Xushu 18 cp genome 4 ; the only difference is that the psbZ gene is not found in the cultivar Xushu 18 cpDNA while the ihbA gene is not found in the cultivar Tanzania cpDNA. It should be noted that the double-copy gene ycf1 was not reported for the cultivar Xushu 18 cp genome4, but this was actually a miss-annotation.  Table 2). The resulted phylogenetic tree is depicted in Figure 3. The 18 sweetpotato cultivars used as the parental genotypes for mapping populations in the GT4SP project represent two distinct clades, consisting of 12 and six cultivars, respectively. Here, the length of any branch in a clade is no greater than 2×10 -4 substitutions per bp. The detailed phylogenetic relationship of the 18 sweetpotato cultivars is shown in Figure 4. As we can see, the distance between the two clades is approximately 5×10 -4 substitutions per bp. In the larger clades, the cultivar Tanzania represents a relatively larger distance (2×10 -4 substitutions ber bp) compared to the other cultivars. The population structure discovered here is similar to the one revealed by using simple sequence repeat primers by David et al. with the exception of the classification of the sweetpotato cultivars NK259L, Resisto and Mugande 47 (Supplementary Table 3). For the publicly available sweetpotato cp genomes, PI 561258 and Xushu 18 are closely related to the larger clade, while PI 518474 and PI 508520 have a closer relationship with the smaller clade ( Figure 3). The diploid wild relative of the hexaploid sweetpotato, I. trifida (REM 753), displays a significantly closer relationship to the I. batatas compared to the other species in the Convolvulaceae Ipomoea section Batatas. The other I. trifida accession PI 618966, however, represents a much larger diversity to the I. batatas and shows a close relationship to the I. triloba line NCNSP-0323 assembled in this research. Interestingly, the accession PI 618966 was originally identified as I. triloba and was recently reidentified

Discussion
The sweetpotato cp genome contains two ~31 Kb IRs which is very difficult for short-read de novo assemblers. There have been a few studies exploring the possibility to perform de novo assembly of organelle genomes with long reads especially with SMRT PacBio sequencing reads 21,26,51 . In this study, we constructed a complete sweetpotato cp genome assembly using   the long reads generated from Oxford Nanopore sequencing. Nanopore reads proved to be extremely powerful in assembling the cp genome, especially in solving long repetitive regions. The sweetpotato cp genome contains two ~31 Kb IRs, which is very difficult for short-read de novo assemblers. With the overlapping information from long reads; however, the problem can be easily resolved. Canu 17 provides a useful tool set for assembling Nanopore reads, which was used in this research. It is worth noting that although the average depth of coverage of the whole sweetpotato genome is less than 1×, we obtained enough coverage of the cp genome for assembly.
Although long reads are powerful in solving complex genome structures, the error-prone nature of the raw reads necessitates an extra error-correction step. Illumina reads have been widely used to assist long read error-correction 19,20 . The Illumina read-based correction could be performed either on the raw long reads before assembling 20 or on the draft genome assembly constructed from the raw reads 19 . In the current study, we did both. Before assembling with Canu, the Nanopore reads were corrected with Illumina reads using Nanocorr 20 . After assembling, the draft genome assembly was polished with Illumina reads using Pilon 19 (Methods). With several pipelines examined, we found that to perform error correction both before and after assembling is the best practice to construct the sweetpotato cp genome.
Assembling the cp genome from the short Illumina reads is challenging owing to the two large IRs. Since the structure of the cp genome is generally stable, reference genomes from the closely related species are usually used to perform reference-based assembling 4,16 . In this study, we used the genome assembly constructed from the Nanopore reads as reference to assemble cp genomes for a further 19 cp genomes including 17 sweetpotato cultivars (including a duplicate for one sample) and the I. triloba line NCNSP-0323. SPAdes 53 was used as the de novo short-read assembler. The contigs generated by SPAdes were fragmented as expected. Among the 19 genome assemblies, the minimum number of contigs was 76. As the two IRs are highly homologous, there was generally only one copy of repetitive regions being assembled. In order to solve this problem, for reference-based scaffolding, we reused some single-copy contigs from the two IR regions to construct complete cp genome assemblies.
The molecular structure and gene content of the cpDNA are relatively conserved in land plants 2 . Many cpDNAs form a circular quadripartite structure with two IRs separated by one large and one small single-copy section 2,5 . All 20 cp genome assemblies constructed in this research represent this common structure. The size of the two IRs of the sweetpotato cpDNA is approximately 31 Kb each, and is much larger than the other plants such as potato 10 , rice 54 , wheat 55 , and maize 56 , of which the IRs are usually smaller than 26 Kb. This is highly likely due to gene losses in these species. By comparing the gene annotation of the sweetpotato cpDNA in this study (Figure 2) to the potato cpDNA 10 , we can see that, in the potato cpDNA, the boundary region of the IRA and SSC harbors a deletion of approximately 6Kb involved in the genes, ycf1, rsp15 and ndhH. Meanwhile, these three genes are presented in the symmetric boundary region of the IRB and SSC, which explains why the size of the IRs of the potato cpDNA is approximately 6 Kb smaller than the sweetpotato cpDNA.
The cpDNA usually has uniparental inheritance and undergoes low rates of substitution and recombination, which makes it well suited for phylogenetic analysis. The cp genome has been widely used to perform phylogenetic or comparative analysis in previous studies 2,10,16 . In this research, we used the complete cp genome assemblies to study the phylogenetic relationship of the 18 sweetpotato potato cultivars used as the parental genotypes for mapping populations in the GT4SP project, as well as the species from the Convolvulaceae Ipomoea section Batatas. The sweetpotato genotypes from the GT4SP project were classified into two distinct clusters, which guarantees the diversities of mapping populations derived from them. The phylogenetic analysis clearly revealed that the I. trifida is the most closely related diploid wild relatives to the hexaploid sweetpotato, I. batatas, which is consistent with conclusions from the previous studies 32,57 .
Almost all whole genome sequencing data contains cp sequences, from which we are usually able to obtain cp genome sequences of enough data coverage for de novo assembly. As we can see, all the cp genome assemblies described in this research were constructed using whole genome sequencing data. Given that the cp genome is an important resource for studying plant genomes and whole genome data has gradually become indispensable in modern genome projects, it will be a good practice to construct the cp genome assembly to gain a first insight into the plant genome we are trying to understand before moving to the complex nuclear genome.

Genome sequencing of the MDP parental genotypes
The 16 sweetpotato cultivars used as the parental genotypes for the MDP diversity panel were subjected to whole genome sequencing. 10x Genomics' Chromium sequencing of the sweetpotato cultivar Tanzania and Beauregard The genomic DNA of Tanzania and Beauregard were extracted using the method cetyltrimethyl ammonium bromide and purified with 1× Agencourt AMPure XP beads (Beckman Coulter), according to manufacturer's instructions. Before the library preparation, 1.5 µg purified gDNA was size selected using the BluePippin instrument (Sage Science) with the 0.75% Agarose Dye free, Marker U1 High-pass 30-40 kb vs3 protocol followed by a purification step with 0.4× AMPure XP beads. The library preparations for these two samples were done following the Chromium TM Genome Reagent Kits user guide (CG00022, Rev C). In summary, 10 ng of sample DNA was used to generate Gel Bead-In-Emulsions (GEM) in the Chromium TM Controller (10× Genomics) followed by isothermal incubation, post GEM incubation cleanup and quality control (QC). Libraries were constructed with end-repair and A-tailing, adaptor ligation, post ligation cleanup using SPRIselect Reagent (Beckman Coulter, USA), sample index PCR, post PCR cleanup, and QC. We modified the protocol by increasing the number of PCR cycles to nice and adding 105 µl SPRIselect reagent for the Post Sample Index PCR Cleanup, which resulted in the recovery of shorter fragments than it was expected. The libraries were sequenced using the HiSeq X Ten platform (Illumina, San Diego, CA).
Oxford Nanopore sequencing of the sweetpotato cultivar Tanzania Before the MinION library preparation, 5.7 µg Tanzania pure DNA was size selected (start selection size: 8Kb) with the same protocol used in 10x Genomics' Chromium sequencing. The size selected gDNA was purified with 1× AMPure XP beads. The resulting 950 ng of Tanzania gDNA was used in MinION sequencing library preparation with the SQK-LSK108 1D ligation Sequencing kit (May 2017 version). We modified the protocol as follows: 30 min incubation each end-repair step and adapter ligation; 10 min incubation at RT in the end-repair purification step; 0.7× AMPure XP beads used after adapters ligation and ELB buffer (Oxford Nanopore Technologies) warmed up at 50°C previously to use and incubation of the eluted solution at 50°C. A library of 348 ng was loaded into a FLO-MIN106 (R.9.4 version) flowcell used in a MK1B MinION. We run the 1D protocol in the MinKnow software (version 1.5.18) and we basecalled the raw data using Albacore (version 1.1.0).
Cp genome sequence extraction WGS data were aligned to 30 publicly available cp genome assemblies of the species from the Ipomoea family 4,16,33,36 (Supplementary Table 2) to extract cp genome reads, using BWA MEM 45 (version 0.7.15). We used the option '-x ont2d' for Nanopore reads, and default options for Illumina reads. For each Nanopore read, the alignment records with at least 500 bp sequence aligned were selected to calculate the total length of the alignment. A Nanopore read was considered as a cp sequence if at least 1 Kb and 80% of the read aligned. A similar strategy was employed for Illumina reads extraction. Both of the two reads of a read pair were required to be aligned. The minimum size of the alignment block was set to 100 bp.
Cp genome assembly from Nanopore data We used Nanocorr 20 (version 0.01) to perform error correction for Nanopore reads using the Illumina reads. In order to guarantee the quality of Illumina reads, Trimmomatic 60 (version 0.36) was used to remove the low quality regions. We imposed the quality score of each base pair to be no less than 20 and the length of the reads no less than 100. The corrected Nanopore reads were then used to construct a draft genome assembly with Canu 17 (version 1.5). As the resulting draft genome assembly contained more than one contig, AMOS minimus 46 (version 3.1.0) was used to remove the redundancy and concatenate contigs using the overlap information. The AMOS minimus was also used to circularize the contig. We aligned the Illumina reads to the circularized contig and corrected the SNPs and small indels with Pilon 19 (version 1.22). In order to follow the paradigms of the published cp genomes, we aligned the genome assembly to the published cp genomes with MUMMER 61 (version 3.23) to find homology regions, and let the genome assembly start from the LSC.

Cp genome assembly from Illumina Hiseq data
The low quality regions of the extracted cp sequences were removed with Trimmomatic 60 (version 0.36). The minimum quality score of each base pair was set to 20 and the minimum length of the reads was set to 100. SPAdes 53 (version 3.10.1) was used to construct contigs from Illumina reads. We excluded the repeat resolve module from SPAdes and used the contigs before repeat resolution as it consistently missed one of the two IRs. The resulting genome assembly contains tens to hundreds of contigs. The size of the contigs ranged from several hundred base pairs to tens of kilobase pairs. Since we know the structure of cp genome is generally stable, the syntenic relationship was used for scaffolding. We mapped the SPAdes contigs to the genome assembly resulting from the Nanopore reads using BWA MEM 45 . The alignments were used to order the contigs. The overlap information between the neighbouring contigs was used to concatenate them.

Cp genome annotation
We used the web tool Dual Organellar GenoMe Annotator (DOGMA) 48 to generate the preliminarily gene annotations. For each particular gene, we used MUSCLE 49 (version 3.8.31) to align the genuine protein sequences of the gene gained from the NCBI GenBank to the genome assembly to decide the exact boundary positions. The web tool Organellar Genome DRAW (OGDRAW) 50 was used to generate the circular annotation plot of the genome assembly. The hypothetical cp open-reading frame ycf1 was not identified by DOGMA initially. It was added to the annotation on the basis of the MUSCLE alignment results.

Phylogenetic analysis
Phylogenetic analysis was performed on the 18 sweetpotato cultivars used as the parental genotypes for constructions of mapping populations in GT4SP project as well as the Convolvulaceae Ipomoea section Batatas including the cp genome assemblies constructed in this research and nine publicly available cp genome assemblies. MAFFT 62 (version 7.310) was employed to perform the multiple sequence alignment (MSA) for cp genomes. The phylogenetic structure was constructed with PhyML 63 (version 3.1). Branch certainty was evaluated with 1000 replications of bootstrap resampling. The phylogenetic tree depicted in this research was constructed with the web tool iTOL (version 4) 52 .

Data availability
Underlying data Nanopore and Illumina reads and the cp genome assemblies are deposited at NCBI BioProject repository, accession number PRJNA438020: http://identifiers.org/bioproject/PRJNA438020.  I thought one of the highlights of this paper was the assembly of 18 sweetpotato cp genomes. The authors demonstrated the presence of the two distinct subpopulations in East Africa using these cp genomes. However, no other more detailed analysis and discussion about these 18 cp genomes. Would it be better to add more detailed sequences analysis? For example, the factors which impact the genome size, some specific loci 3.
genome data, and to validate the method. Others: "The only difference is that the psbZ gene is not found in the cultivar Xushu 18 cpDNA while the ihbA gene is not found in the cultivar Tanzania cpDNA. It should be noted that the double-copy gene ycf1 was not reported for the cultivar Xushu 18 cp genome4, but this was actually a miss-annotation." The difference between Xushu 18 and Tanzania are psbZ and ihbA, why the ycf1 was actually a miss-annotation. You should give more information about it. 1.
Just a suggestion from Dr. Yang, the important references should be added. 2.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Yuki Monden
Graduate School of Environmental and Life Science, Okayama University, Okayama, Japan I think this paper is well-written, and the authors have properly revised the manuscript by referring to the comments of the reviewer. This paper performed a complete circular cp genome assembly using NGS and TGS technologies in the hexaploid sweetpotato. Phylogenetic analysis using the cp genomes revealed that there are two distinct clusters of sweetpotato in East Africa and I. trifida is the most closely related diploid wild species to the I. batatas hexaploid sweetpotato. The results of this paper provide insights into the genetic relationships and the population structure of the species from the Convolvulaceae Ipomoea section Batatas. Besides, despite the complexity of the cp genomes by the presence of two large inverted repeats, this research demonstrates the possibility of building the cp genomes using extremely low coverage (<1x) Oxford Nanopore WGS data combined with Illumina short reads. Other comments are shown below. Table 1: The copy number of rps12 gene should be three, but the bracketed superscript of this gene is two. Please make sure.

1.
Phylogenetic analysis using nuclear genomes Is it possible to compare the results of phylogenetic analyses based on the cp genomes and the nuclear genomes using the same materials? I think such comparative analysis should provide new insight into evolutionary dynamics on cp and nuclear genomes of Ipomoea species. Do you have any plans for such work? 2.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Response:
The primary focus of this study was to investigate the population structure of the East African sweetpotato cultivars used in the GT4SP project. We strongly agree that these suggested analyses will largely increase the impact of the manuscript. However, it is difficult to integrate these suggested analyses in the current study within the constraints of the data. We will include these suggestions in the future directions of the project.

Reviewers Comment:
In terms of the manuscript organization and writing, I found confusing some parts. For example, the material and methods are not aligned with the results presented in the manuscript. For example, the section "Extraction of cp genome sequence from whole genome sequencing data" describe the chloroplast data mining from ON and Illumina for the Tanzania accession and then for the Beauregard accession, but the material and methods also describe the use of 10X Genomics Chromium that I am not sure where it comes from. Do the authors used 10X Genomics also? Probably for the genome assembly, the comparison of the Canu assembly with Organelle_PBA (Soorni et al. 2017) could be interesting, to see if the authors obtain only one contig representing the whole chloroplast.

Response:
We have amended the mansucript to adddress these concerns. 10X Genomics was indeed used to perform the whole genome sequencing for the sweetpotato accessions Tanzania and Beauregard. However, the linked reads information was not utilized in construction of the cp genome assemblies for them. Instead, the sequence data was simply used as paired-end reads to create contigs. The contigs were then used to construct whole cp genome assembly with a cp reference genome. The mentioned Current Biology paper has provided hundreds of cp genome sequences of sweet potato and its wild relatives.
○ "The circularized contig is ~161 Kb, and is highly collinear with the published sweetpotato cp genome assembly (Figure 1d)." In Figure 1d, it is an I. trifida cp genome, not a published sweet potato cp genome.
○ "The sweetpotato cp genome represents a common circular structure with two IRs (IRA and IRB) separating one LSC and one SSC2." Where does the 2 in SSC2 come from? Convert into right format if it is a citation.
○ "The red dots represent SNPs between the two cp genomes. The green bars on the x-axis indicate positions of the two IRs" No red dots there, only black dots.
○ "It should be noted that the doublecopy gene ycf1 was not reported for the cultivar Xushu 18 cp genome4" Convert into right format if it is a citation.
○ "Interestingly, the accession PI 618966 was originally identified as I. triloba and was recently reidentified as I. trifida by the GRIN National Genetic Resources Program." The identification of PI 618966 needs to be checked carefully. All individuals of I. trifida formed a monophyletic clade closely related to I. batatas according to Current Biology paper. As the progenitor of sweet potato, it's quite strange that I. trifida is much closer to other species in Series Batatas than I. batatas. ○ Figure 3 & 4 It will be much clear to add the tip labels rather than collapsed clades on the tree. Figure 4 will be no more informative in this case. If the tree is not that complicated, it is not suggested to collapse the two clades. Since information about the relationship between within-clade sample and out-clade sample is not visible when one collapse clade. This information will not be illustrated in Figure 4. Clades can be labeled in different colors if one wants to highlight the clades. Furthermore, it is not clear to me which place each sample nested on in Figure 4.
○ "In this study, we used the genome assembly constructed from the Nanopore reads as reference to assemble cp genomes for a further 19 cp genomes including…" Misleading sentence, authors do rely on published cp genome rather than de novo Nanopore assembly.
○ "In order to solve this problem, for reference-based scaffolding, we reused some singlecopy contigs from the two IR regions to construct complete cp genome assemblies." In which cultivar(s), did author investigate the influence on the tree structure? ○ I agree the population structure of East African sweet potato cultivars is important for GT4SP project. Also obviously, the data organization and visualization could be largely improved to meet the indexing standards.
in Current Biology (Mu, Pablo, et al., 2018). Due to the lack of awareness of this publication, the claims in the manuscript are incorrect and need to be revised. Response: This publication was cited in the revised version. The incorrect claim was revised (see reviewer's comment 7).
Comment 2. "In recent years, the development of next-generation sequencing (NGS) technologies such as Illumina and Roche 454 facilitate faster and cheaper methods to sequence cp genomes 13-15 ." To my knowledge, the Roche 454 already left the market. Response: In the revised version, "In recent years" was deleted to make it more precise.
Comment 3. "By examining the mapping results of the WGS data, we are able to extract cp sequences 21,22 ." We? Who are we? Response: In the revised version, this sentence was rewritten to "The cp sequences could be extracted by examining the mapping results of the WGS data to the reference cp genome 21, 22 ." Comment 4. "Sweetpotato is a hexaploid (2n=6x=90) with genome size estimated to be between 2,200 to 3,000 Mb 28 ." How about the C-values? Response: The nuclear genome size is not the key point of this study. The C-value was not investigated.
Comment 5. "Due to the complex genome structure, the availability of sweetpotato genomic resources is lacking." We do have a published genome, right? Response: Even though there is a sweetpotato reference genome published recently (Yang, Jun, et al., 2017), we think the availability of the sweetpotato genome resources is still lacking.
Comment 6. "A number of cp genomes from the Ipomoea family have been sequenced 16,33 ." Dose Ipomoea family mean genus Ipomoea? Or genus Ipomoea Series Batatas? Response: "Ipomoea family" means "genus Ipomoea". This was made clear in the revised version.
Comment 7. "Most of them are diploid wild relatives of the sweetpotato. To the best of our knowledge, to date, four cp genomes have been completely sequenced for the hexaploid sweetpotato 4, 16 ; the genome size is around 161 Kb, and the structure represents a standard quadripartite circular with a LSC of 87 Kb, a SSC of 12 Kb and two IRs of 31 Kb 4 . The cp genomes were mainly used to perform phylogenetic analyses 4, 16 . " The mentioned Current Biology paper has provided hundreds of cp genome sequences of sweet potato and its wild relatives.

Response:
The claim "To the best of our knowledge, to date, four cp genomes have been completely sequenced for the hexaploid sweetpotato 4, 16 ;" was removed in the revised version.