mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species

Assembly of bacterial short-read whole-genome sequencing data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Complete genomes resolved by long-read sequencing can be used to generate and label short-read contigs. These were used to train several popular machine learning methods to classify the origin of contigs from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. We selected support-vector machine (SVM) models as the best classifier for all three bacterial species (F1-score E. faecium=0.92, F1-score K. pneumoniae=0.90, F1-score E. coli=0.76), which outperformed other existing plasmid prediction tools using a benchmarking set of isolates. We demonstrated the scalability of our models by accurately predicting the plasmidome of a large collection of 1644 E. faecium isolates and illustrate its applicability by predicting the location of antibiotic-resistance genes in all three species. The SVM classifiers are publicly available as an R package and graphical-user interface called ‘mlplasmids’. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.

To cover all plasmid replication genes not present in the first selection, 12 additional isolates were selected for ONT sequencing. This second selection was based on a reciprocal blast of the predicted plasmid orthologous genes against 76 previously described plasmid replication amino-acid sequences from the genus Enterococcus [6]. Reciprocal blast allowed to identify miss-annotated genes corresponding to plasmid replication sequences. Isolates bearing plasmid replication genes not present in the first selection were sorted and selected based on highest number of orthologous genes.
Hybrid assembly was performed using Unicycler (version 0.4.1), specifying "bold" mode [7]. Briefly, Unicycler uses SPAdes (version 3.6.2) to create different assembly graphs based on different k-mer size only considering Illumina reads [8]. The best assembly graph was selected by Unicycler based on number of dead-ends and contiguity. Next, all ONT reads were used to scaffold and solve the assembly graph. Additionally, we specified the same file as described above (Isolate selection for ONT sequencing) containing 76 known plasmid replication sequences to rotate and change the 0coordinate of replicons resulting from hybrid assembly [6]. Finally, Unicycler conducted several rounds of Pilon (version 1.22) to polish genome sequences using Illumina reads [9].

Categorization of Unicycler contigs
Unicycler contigs were labeled either as chromosome or plasmids based on size and circularity. Contigs were categorized as chromosome if they were larger than 350 kbp, regardless of circularity.
However, only contigs were categorized as plasmids if they were circular and smaller than 350 kbp. Putative plasmids smaller than 350 kbp and lacking circularization signatures were not categorized. Draft annotation (Prokka -version 1.12) of plasmid sequences allowed to identify and discard four putative complete phage sequences present as circular contigs.
Supplementary Methods S2 (Building a machine-learning model) For each bacterial species, we tuned and compared five different supervised algorithms provided in mlr R package (version 2.11): logistic regression, Bayesian classifier, decision trees, random forest (RF) and support-vector machine (SVM) [10]. We defined a two-class classification problem using the category 'plasmid' as positive-class. To train and test the resulting classifiers we considered pentamer frequencies (n=1024) which were calculated using oligonucleotideFrequency function available in R package biostrings (version 2.42.1). Mlr package was used to split SPAdes labeled contigs into training (80%) and test set (20%), preserving the frequencies of each class in both sets (Supplementary Table S4).
Decision trees, random forest and support-vector machines hyperparameters were optimized using random search in a predefined search space (Supplementary Table S5). We performed 10-fold crossvalidation to assess the quality of hyperparameters combination, using error rate as performance measure. For each object, posterior probabilities were generated and the class with a highest posterior probability was assigned.

Supplementary Methods S3
In this study, we used Illumina NextSeq/MiSeq data for 1,644 E. faecium isolates that are available under the ENA project PRJEB28495. A fraction (n = 62) of these 1,644 E. faecium isolates was completed using ONT MinION reads which are publicly available under the figshare projects: 10.6084/m9.figshare.7046804 ; 10.6084/m9.figshare.7047686 From these 62 ONT isolates, 5 were not used to label short-read contigs to train and test mlplasmids models. These 5 isolates (E2079, E2364, E4457, E7591, E8172 and E9101) were used to benchmark E. faecium mlplasmids models against other plasmid tools. A complete overview of the different datasets used in this study is available at Supplementary Table S6. Supplementary Results S1

Comparison of mlplasmids against other plasmid prediction tools
Only with the purpose of comparing mlplasmids and plasflow prediction, we created an artificial and third category for mlplasmids named 'unclassified' in which we included all contigs first assigned as plasmid-or chromosome-derived but with a posterior probability lower than 0.7. We only defined a mlplasmids 'unclassified' category for this particular analysis, since mlplasmids prediction only consists of two classes: plasmid or chromosome, and users can decide whether filter out predicted contigs based on their associated posterior probabilities.
For the three single species datasets, the frequency of this category was lower for mlplasmids (E. faecium = 0.03 ; K. pneumoniae = 0.10 ; E. coli = 0.05) compared to PlasFlow (E. faecium = 0.16; K. pneumoniae = 0.17 ; E. coli = 0.21) (Fig. S6). These results showed that most of predicted contigs had an associated high posterior probability of belonging to either plasmid-or chromosome-class with mlplasmids compared to PlasFlow. To show the potential of mlplasmids predicting unclassified contigs from PlasFlow, we considered unclassified contigs by plasflow and observed the posterior probabilities given by mlplasmids. For E. faecium and K. pneumoniae datasets, we observed that unclassified contigs from PlasFlow derived from the chromosome-class and plasmid-class were mostly correctly predicted by mlplasmids ( Fig. S7a and S7b). For E. coli, unclassified contigs from the plasmid-class showed a non-uniform distribution whereas contigs from the chromosome-class were in general correctly predicted (Fig. S7c).

Supplementary Results S2
Applicability for predicting sequences derived from incomplete long-read assemblies For all bacterial species, mlplasmids did not recover any false positive sequences (Specificity = 1). For E. coli, only a single plasmid sequence (NC_022662.1) was wrongly predicted as chromosomederived but with a low posterior probability associated to that class (0.53). In the case of K. pneumoniae, mlplasmids misclassified a plasmid sequence with a length of 26.45 kbp (NZ_CP015133.1) from K. pneumoniae strain KPN555. For E. faecium two sequences were misclassified as chromosomal (NZ_LT598665.1 and NZ_CP019991.1) and the last sequence (NZ_CP019991.1) could correspond to a phage since its NCBI annotation showed two phage-related genes. This demonstrates the flexibility of mlplasmids to predict sequences with different lengths compared to average contig length used to train and test resulting classifiers and discarded misclassifications due to a correlation between pentamer frequencies and contig length. This may facilitate the classification of contigs generated from incomplete hybrid or long-read assemblies as exemplified for isolate E. faecium E7070. This isolate was selected for ONT sequencing and after hybrid assembly, 16 contigs were reported. Contigs predicted as plasmid by mlplasmids (n = 6) contained circularization signatures whereas the rest of the contigs (n = 10) were predicted as chromosome-derived (Fig. S9). This facilitated the design of appropriate PCR reactions to complete the genome sequence for E7070. Supplementary Table S4 Figure S1. Ward hierarchical clustering of computed pairwise mash distances (k = 21 ; s = 1,000) from E. faecium isolates. Based on dendrogram branch lengths, we defined three clusters (black, blue and grey) and visualized mash distances using heatmap based on their genome content similarity. At the bottom y-axis, we coloured in red E. faecium isolates (n = 60) that were selected and completed using ONT sequencing and Illumina sequencing. Rest of the isolates corresponded to publicly available NCBI complete genomes from E. faecium (n = 24).     and E. coli (c) which were predicted as 'unclassified' by plasflow were interrogated using mlplasmids. Each predicted contig was grouped into chromosome-or and plasmid-derived (x-axis), coloured based on prediction evaluation and associated probability plasmid-class (y-axis) represented. Figure S8. Estimating mlplasmids potential to predict plasmid sequences transferred by HGT events. We used all the three species models available in mlplasmids to predict contigs belonging to E. faecium (a), K. pneumoniae (b) and E. coli (c) validation sets. Each plasmid-derived contig was coloured as false-negative (orange) or true-positive (green) based on evaluation of mlplasmids prediction. Figure S9. mlplasmids applicability to predict contigs derived from incomplete hybrid or long-read assemblies. Bandage visualization of the hybrid assembly obtained for the E. faecium isolate E7070. For this isolate, hybrid assembly using Unicycler did not result in a complete assembly (chromosome and plasmids in single and circular components). Resulting contigs were labeled based on mlplasmids prediction. Figure S10. Enterococcus faecium resistome. Draft genomes available in NCBI Genomes FTP ( n = 369) were downloaded and screened using Abricate and ResFinder for the presence of antibiotic resistance genes. Each contig containing a resistance gene was predicted with mlplasmids to predict plasmid-or chromosome-origin. For visualization purposes, only antibiotic resistance genes present more than five times are shown. Figure S12. Escherichia coli resistome. Draft genomes available in NCBI Genomes FTP (n = 5,234) were downloaded and screened using Abricate and ResFinder for the presence of antibiotic resistance genes. Each contig containing a resistance gene was predicted with mlplasmids to predict plasmid-or chromosome-origin. For visualization purposes, only antibiotic resistance genes present more than five times are shown.