On the (Im)possibility to Reconstruct Plasmids from Whole Genome Short-Read Sequencing Data

Plasmids are autonomous extra-chromosomal elements in bacterial cells that can carry genes that are important for bacterial survival. To benchmark algorithms for automated plasmid sequence reconstruction from short read sequencing data, we selected 42 publicly available complete bacterial genome sequences which were assembled by a combination of long- and short-read data. The selected bacterial genome sequence projects span 12 genera, containing 148 plasmids. We predicted plasmids from short-read data with four different programs (PlasmidSPAdes, Recycler, cBar and PlasmidFinder) and compared the outcome to the reference sequences. PlasmidSPAdes reconstructs plasmids based on coverage differences in the assembly graph. It reconstructed most of the reference plasmids (recall = 0.82) but approximately a quarter of the predicted plasmid contigs were false positives (precision = 0.76). PlasmidSPAdes merged 83 % of the predictions from genomes with multiple plasmids in a single bin. Recycler searches the assembly graph for sub-graphs corresponding to circular sequences and correctly predicted small plasmids but failed with long plasmids (recall = 0.12, precision = 0.30). cBar, which applies pentamer frequency composition analysis to detect plasmid-derived contigs, showed an overall recall and precision of 0.78 and 0.64. However, cBar only categorizes contigs as plasmid-derived and does not bin the different plasmids correctly within a bacterial isolate. PlasmidFinder, which searches for matches in a replicon database, had the highest precision (1.0) but was restricted by the contents of its database and the contig length obtained from de novo assembly (recall = 0.36). Surprisingly, PlasmidSPAdes and Recycler detected single isolated components corresponding to putative novel small plasmids (<10 kbp) which were also predicted as plasmids by cBar. This study shows that it is possible to automatically predict plasmid sequences, but only for small plasmids. The reconstruction of large plasmids (>50 kbp) containing repeated sequences remains challenging and limits the high-throughput analysis of WGS data. Author Summary Short read sequencing of the DNA of bacteria is often used to understand characteristics such as antibiotic resistance. However the assembly of short read sequencing data with the goal of reconstructing a complete genome is often fragmented and leaves gaps. Therefore independently replicating DNA fragments called plasmids cannot easily be identified from an assembly. Lately a number of programs have been developed to enable the automated prediction of the sequences of plasmids. Here we tested these programs by comparing their outcomes with complete genome sequences. None of the tested programs were able to fully and unambiguously predict distinct plasmid sequences. All programs performed best with the prediction of plasmids smaller than 50 kbp. Larger plasmids were only correctly predicted if they were present as a single contig in the assembly. While predictions by PlasmidSPAdes and cBar contained most of the plasmids, they were merged with or indistinguishable from other plasmids and sometimes chromosome sequences. PlasmidFinder missed most plasmids but all its predictions were correct. Without manual steps or long-read sequencing information, plasmid reconstruction from short read sequencing data remains challenging.

Plasmids are a major driver of variation and adaptation in bacterial populations. The 2 dissemination of multidrug resistance via transfer of plasmids leads to new antibiotic 3 resistant bacteria such as Escherichia coli producing extended-spectrum 4 beta-lactamases [1] or vancomycin resistant Enterococcus faecium causing nosocomial 5 outbreaks [2]. The prevalence of a plasmid in a bacterial population can increase due to 6 environmental pressures that include the presence of an antibiotic, but may cause a 7 decrease in bacterial fitness in absence of selective pressure [3]. 8 A bacterial cell can hold no, one or multiple plasmids with varying sizes and copy 9 numbers. Traditionally, plasmid sequencing involved the extraction of plasmids using 10 methods to specifically purify plasmid DNA, followed by shot-gun sequencing, which 11 frequently necessitated closing of gaps by PCR or primer-walking [4]. Plasmid DNA 12 purification is exceedingly difficult if it involves plasmids longer than 50 kbp [4,5]. 13 Alternatively, plasmid sequences can be assembled from whole genome sequencing data 14 (WGS) generated by high-throughput short-read sequencing platforms. However, 15 plasmids often contain repeated sequences shared between the different physical DNA 16 units of the genome, which prohibits complete assembly from short read data. Assembly 17 often results in many fragmented contigs per genome and their origin (plasmid or 18 chromosome) thus remains unclear [6]. Assembly alone is insufficient to determine the 19 origin of a contig and to differentiate contigs belonging to different plasmids. Recently, 20 attempts to reconstruct plasmids from WGS data were automated in a number of 21 programmes. 22 Currently available plasmid reconstruction programmes either aim to determine 23 whether a previously assembled contig is obtained from a plasmid (PlasmidFinder, 24 cBar), or try to reconstruct whole plasmid sequences from the (mapped) sequencing 25 reads or the assembly graph (Recycler, PlasmidSPAdes, PLACNET) ( Table 1). 26 One of the most widely used tools for plasmid detection and classification is a web 27 tool called PlasmidFinder, developed to detect replicon sequences particularly 28 originating from the family Enterobacteriaceae [7]. Two plasmids sharing the same 29 replication mechanism cannot coexist in the long term within the same cell thus replicon 30 sequences are used to classify plasmids into different incompatibility groups [12]. 31 Unsupervised binning using differences in k-mer composition has been widely used in 32 shotgun metagenomic algorithms [13][14][15]. Composition-based classification methods 33 allow the clustering of contigs into distinct genomes and perform a species-level 34 classification. However, most of these methods are not designed for application to 35 isolated strains and do not report a classification between plasmid or chromosomal 36 contigs. cBar was specifically designed to predict plasmid-derived sequences based on 37 differences in k-mer composition [8]. It relies on differences in pentamer frequencies 38 2/18 Table 1. Overview of the programmes to reconstruct or predict plasmids from short read sequencing data.

Input
Paired-end information Coverage k-mer composition de Bruijn graph Similarity to replicons Similarity to relaxases Similarity to plasmids Web-tool Command-line interface

Included in study
PlasmidFinder [7] Contigs cBar [8] Contigs Recycler [9] BAM+assembly graph PlasmidSPAdes [10] Reads PLACNET [11] BAM/SAM+contigs from 881 complete prokaryotic sequences and gives a binary classification of 39 chromosome-or plasmid-derived contigs. non-redundant plasmid sequences from NCBI [11]. PLACNET merges all the 45 information into a single network where each component corresponds to a physical DNA 46 unit. Repetitive sequences such as transposases or insertion sequences (IS) with a higher 47 coverage are shared between components. Manual pruning in Cytoscape is necessary to 48 duplicate and split the graph to obtain disjoint components in the final network [16,17]. 49 Prediction reproducibility rates is thus highly dependent on the expertise of the 50 researcher. As we aimed to test fully automated methods for plasmid reconstruction, we 51 excluded PLACNET from the comparison.

52
More recently, two algorithms that reconstruct plasmids on basis of the information 53 contained in the de Bruijn graph were developed: Recycler [9] and PlasmidSPAdes [10]. 54 Recycler extracts the information from the de Bruijn graph searching for sub-graphs 55 (cycles) corresponding to plasmids. Selection of the cycles is based on the following 56 assumptions: (i) nodes forming a plasmid have a uniform coverage, (ii) a minimal path 57 must be selected between edges because of repetitive sequences, (iii) contigs belonging 58 to the same cycle have concordant read-end paired information and (iv) plasmid cycles 59 exceed a minimum length.  Here, we benchmarked currently available programmes to detect and reconstruct 70 plasmid sequences from short read sequencing data, starting either from the reads or 71 Recycler was created by alignment of the trimmed reads against the resulting 105 contigs using Bwa 0.7.12 [20] and samtools 1.3.1 [21]. Cycles reported in the 106 assembly graph were considered as Recycler prediction.

110
Measures for the evaluation 111 We evaluated the performance of each programme regarding accuracy and completeness. 112 Quast (version 4.1) [22] was used to map plasmid predictions against i) each reference 113 plasmid separately or ii) the reference genome (containing chromosomes and plasmids) 114 using Nucmer alignments. Total contig length was used to estimate the following terms: 115 • Plasmid fraction. Fraction of the prediction that matched the reference

123
• Fraction of novel sequences. Fraction of the prediction not mapping to either 124 the reference plasmid or the chromosome, thus corresponding to contigs absent 125 from the reference assembly. 126

5/18
Novel contigs were further analyzed and annotated using Prokka (version 1.12-beta) [23]. 127 To identify potential novel plasmids we compared these sequences to the non-redundant 128 nucleotide database of the NCBI using BLAST. The best blast hit was extracted 129 selecting minimum e-value and highest bit-score as previously described [10]. 130 Furthermore, the completeness of the potential novel mobile genetic elements was 131 corroborated by generating a dot-plot aligning the sequence to itself [24]. The presence 132 of the same repeated sequence at the ends of the contig suggested a potential 133 circularization signature. This analysis was summarized in Table 3.

134
The programs were further evaluated using the following metrics.

135
• Recall was defined as the percentage of the reference plasmid(s) covered by the 136 prediction. On the individual plasmid level, a recall of 100% indicates that the full 137 sequence of the reference plasmid was present among the predicted plasmids. On 138 the whole genome level, a recall of 100% indicates all reference plasmids were fully 139 present among the predicted plasmids. However, recall does not take prediction of 140 plasmid synteny or plasmid boundaries into account.

141
Recall value was estimated using the genome fraction reported in Quast.

142
• Precision. We defined precision as: 143 P lasmid f raction P lasmid f raction + Chromosome f raction (1) The fraction of novel sequences was ignored when calculating precision.

144
The negative control, the plasmid-less B. cenocepacia strain 22E-1, was excluded 145 from recall and precision calculations. Icarus [25] (packaged in Quast 4.1) was used to 146 visualize the alignments between the reference genomes and the predicted sequences.

147
Scaffold linkage of specific contigs in the PlasmidSPAdes assembly graph of a selection 148 of genomes was visualized with Bandage (version 0.7.1) [26].

149
The workflow (S1 Plasmid length (kbp) Fig 1. Overview of reference plasmids. The size of the reference plasmids is shown for each bacterial isolate. Strains were sorted based on their total plasmid length. B. cenocepacia DDS 22E-1 was considered as a negative control because no reference plasmids are present. K. oxytoca strain CAV1374 was the most complex isolate with eleven plasmids ranging from 1.9 to 332.9 kbp Reconstruction per plasmid 168 The performance of the programs was first evaluated on a single plasmid level. We 169 defined a minimum recall value of 0.9 to classify a plasmid as correctly predicted. Out 170 of 148 reference plasmids included in this study, 133 (89.9 %) were reconstructed by 171 either PlasmidFinder, cBar, Recycler or PlasmidSPAdes (Figures 2 and 3).  validate Recycler and/or PlasmidSPAdes [9,10]. Recall values obtained for each of the 181 reference plasmids in this study were concordant with previous findings (S1 Table and 182 S1 Appendix).

183
Of all 148 plasmids, five plasmids were consistently fully predicted by all of the Reference plasmids were classified into small (less than 10 kbp), medium (from 10 to 50 kbp) and large plasmids (greater than 50 kbp) depending on their size. The number of reference plasmids correctly predicted by the programs is represented in the three categories.  Performance of the programmes on genome level. The prediction of each program was mapped against the reference genomes of each bacterial isolate. Contigs mapping to the reference plasmids were depicted as plasmid fraction (green bar), to the reference chromosome as chromosome fraction (white bar) or to neither as novel sequences fraction (purple bar). On the right y-axis the total length (in kbp) of reconstructed plasmid contigs is indicated. cBar was the only program predicting contigs as plasmids in the genome that was used as negative control (B. cenocepacia DDS 22E-1).

9/18
per programme. We assigned each as plasmid predicted contig to one of the following  Figure 4). Surprisingly, a fraction of 0.06 corresponding to 208 contigs not mapping to the reference genomes was detected. This resulted in an overall 209 precision of 0.76. The majority of plasmids were present in the prediction (overall recall 210 = 0.82).

211
The completeness of the prediction was high even in the bacterial isolates with an  contigs into a single bin with a size of 870.8 kbp. In total, 14 contigs with a size ranging 217 from 0.6 to 4.7 kbp matched to two or more reference plasmid sequences from the K.

218
oxytoca strain CAV1374, among which a transposase with a length of 3.6 kbp present in 219 six reference plasmids.

220
PlasmidSPAdes predicted a total contig length of 18.18 Mbp of which 3.78 Mbp were 221 mapping to the reference chromosome. These chromosomal contigs were analyzed and 222 annotated by Prokka to search for phage-related genes. We applied the same keywords 223 as reported in Phaster [27,28] to assess the presence of a phage sequence. A total of 224 1.36 Mbp showed evidence for presence of phage-related genes. These sequences were 225 possibly predicted as plasmids because their coverage differed from the host genome. 226 We found contigs not mapping to the reference genomes in 20 cases (Figure 4). With 227 the exception of E. coli JJ1886 and E. coli JJ1887, most of the contigs present in the 228 fraction of novel sequences were detected as isolated components by PlasmidSPAdes.

229
Copy number of those components was inferred from their k-mer coverage ratio.

230
Isolated components including a single contig are highlighted in Table 3. Contig size, 231 best blast hit, inferred copy number, gene annotation and circularity are reported.   Recycler obtained a low overall recall of 0.12 ( Figure 5). This value can partly be 245 explained by the fact that the algorithm only reports unique circular sequences.

246
Therefore plasmids sharing highly similar sequences with each other were not present in 247 the prediction. In B. subtilis subsp. natto BEST195 and E. aerogenes CAV1320, which carry single 287 plasmids, precision and recall value were 0.0 ( Figure 5). Both these plasmids were 288 assembled into single contigs but the algorithm erroneously predicted these as 289 chromosome-derived. 290 Notably, in the negative control B. cenocepacia DDS 22E-1, cBar predicted a total 291 size of 1369 kbp wrongly as plasmid-derived contigs 4.

292
Using cBar, the detection of novel plasmids is more difficult compared to Recycler or 293 PlasmidSPAdes because graph component information is not available. However, only 294 with the exception of two putative small cryptic plasmids in A. veronii AVNIH1 and K. 295 oxytoca KONIH1, all components highlighted in Table 3 were also classified as plasmids 296 by cBar. We compared four different programmes to reconstruct or predict plasmid sequences 322 from WGS data. The large majority of the sequences of the plasmids (89.9 %) could be 323 reconstructed by one of the programmes when compared to the reference plasmids.

324
However, in many cases, the reconstructions were fragmented (all programmes), 325 contaminated by chromosome sequences (cBar, Recycler, PlasmidSPAdes), boundaries 326 of the plasmids were unclear (cBar, PlasmidSPAdes) and plasmids incomplete (all 327 programmes). In absence of reference plasmid sequences, disentangling or binning the 328 reconstructions into separate plasmids is a challenging step that still has to be solved. 329 The overall recall obtained by PlasmidSPAdes (0.82) showed that most of the 330 reference plasmids were fully or partially present in the plasmid prediction. The major 331 drawback in using PlasmidSPAdes was the lack of boundaries when reference plasmids 332 shared a high number of similar sequences. This limitation can be overcome by applying 333 the same methodology as already established in PLACNET [11]. Recycler applies an innovative and general approach to plasmid reconstruction and 344 successfully extracted complete plasmid sequences if they had circular features present 345 in the assembly graph. Most large plasmids, however, tend to be assembled into several 346 14/18 contigs due to the presence of repeated sequences with high coverage. Recycler failed to 347 extract these types of plasmids and in many cases only extracted non-plasmid mobile 348 elements. 349 cBar was originally designed to categorize chromosome and plasmids in metagenomic 350 sequences by comparing pentamer frequencies of a plasmid database. The accuracy of 351 this approach is known to be lower for long plasmids because the nucleotide composition 352 of these plasmids is similar to the host chromosome [29]. However, the overall recall of 353 cBar is high (0.78) and it might be well-suited to confirm if a sequence is predicted to 354 be plasmid-derived by another method.

355
The results of PlasmidFinder showed an outstanding 1.0 true positive rate indicating 356 a high reliability of the prediction. Being initially designed for Enterobacteriaceae, it (fraction of novel sequences: 0.06 and 0.14, respectively) that were not present in the 363 complete reference genomes and which were also predicted as plasmids by cBar. These 364 sequences could originate from sequence reads that were filtered in the reference 365 assembly because they were considered to be contaminant sequences, but could also 366 represent small replicons. As described elsewhere, hierarchical genome assembly process 367 (HGAP) of PacBio reads can lead to missing small plasmids in the main assembly when 368 using a seed read length cut-off greater than actual plasmid size [30,31]. To include 369 small plasmids in a genome assembly we suggest to perform a subsequent de novo 370 assembly using short-reads not mapping to the PacBio assembly or to perform a HGAP 371 iteratively reducing the seed read length when assembling whole genomes.

372
Small cryptic plasmids are mostly composed of genes involved in plasmid replication 373 and were previously described in ESBL-producing E.coli [32]. We analyzed a total of 27 374 putative small cryptic plasmids extracted either by Recycler or PlasmidSPAdes 375 corresponding to isolated components with a single contig. However experimental 376 validation is required to confirm these plasmids as stable residents.

377
To obtain the full sequences of plasmids, long read sequencing data can be a 378 solution [33]. Nonetheless, the relatively high error rate of long read sequencing by 379 Pacific Biosystems PacBio RS II or Oxford Nanopore Technologies Ltd makes desirable 380 the combination of long and short-read sequencing technologies for accurate plasmid 381 sequencing. Moreover, we showed the importance of checking the presence of small 382 replicons not present in the reference assembly. This may be crucial to identify the 383 entirety of the plasmids repertoire and, with that, obtain complete genome sequences.

384
In this study, plasmid reference sequences were present for comparison, something