PlaScope: a targeted approach to assess the plasmidome of Escherichia coli strains

Plasmid prediction may be of great interest when studying bacteria such as Enterobacteriaceae. Indeed many resistance and virulence genes are located on such replicons and can have major impact in terms of pathogenicity and spreading capacities. Beyond strains outbreak, plasmids outbreaks have been reported especially for some extended-spectrum beta-lactamase or carbapenemase producing Enterobacteriaceae. Several tools are now available to explore the “plasmidome” from whole-genome sequence data, with many interesting and various approaches. However recent benchmarks have highlighted that none of them succeed to combine high sensitivity and specificity. With this in mind we developed PlaScope, a targeted approach to recover plasmidic sequences in Escherichia coli. Based on Centrifuge, a metagenomic classifier, and a custom database containing complete sequences of chromosomes and plasmids from various curated databases, it performs a classification of contigs from an assembly according to their predicted location. Compared to other plasmid classifiers, Plasflow and cBar, it achieves better recall (0.87), specificity (0.99), precision (0.96) and accuracy (0.98) on a dataset of 70 genomes containing plasmids. Finally we tested our method on a dataset of E. coli strains exhibiting an elevated rate of extended-spectrum beta-lactamase coding gene chromosomal integration, and we were able to identify 20/21 of these events. Moreover virulence genes and operons predicted locations were also in agreement with the literature. Similar approaches could also be developed for other well-characterized bacteria such as Klebsiella pneumoniae. Data summary All the genomes were downloaded from the National Center for Biotechnology Information Sequence Read Archive and Genome database (Supplementary table 1 and 2). The source code of PlaScope is available on Github (https://github.com/GuilhemRoyer/PlaScope). Importance Plasmid exploration could be of great interest since these replicons are pivotal in the adaptation of bacteria to their environment. They are involved in the exchange of many genes within and between species, with a significant impact on antibiotic resistance and virulence in particular. However, plasmid characterization has been a laborious task for many years, requiring complex conjugation or electroporation manipulations for example. With the advent of whole genome sequencing techniques, access to these sequences is now potentially easier provided that appropriate tools are available. Many softwares have been developed to explore the plasmidome of a large variety of bacteria, but they rarely managed to combine sensitivity and specificity. Here, we focus on a single species, E. coli, and we use the many data available to overcome this problem. With our tool called PlaScope, we achieve high performance compared with two other classifiers, Plasflow and cBar, and we demonstrate the utility of such an approach to determine the location of virulence or resistance genes. We think that PlaScope could be very useful in the analysis of specific and well-known bacteria.


Introduction
Recently, several studies have evaluated the effectiveness of in silico plasmid prediction tools (1,2).
In fact, many bioinformatics methods are now available to detect such mobile elements, with different approaches like read coverage analysis, k-mer based classification, replicon detection; some being fully automatized (3-7), others not (8). Some of them achieve high sensitivity: for example, PlasmidSPAdes and cBar enable plasmid recall of 0.82 and 0.76 on a dataset of 42 genomes, respectively (1). On the other side some tools display very high precision, as PlasmidFinder which reaches 100% (1). Unfortunately, neither of them succeeds in finding a good trade-off between sensitivity and specificity, and thus users need to combine different methods to get correct predictions.
Concomitantly more and more sequences are available in public databases, with various level of completeness from large sets of contigs to fully circularized genomes and plasmids. Some people have made an effort to curate these databases and proposed high quality dataset. Carattoli et al. and Orlek et al., for example, have published interesting and exhaustive plasmid datasets for Enterobacteriaceae (4, 9).
With this in mind, we propose here a method, called PlaScope, to assess the plasmidome of genome assemblies. We took advantage of available genomic data to create a custom database of plasmids and chromosomes, which is used as input of the Centrifuge software, a tool originally developed as a metagenomics classifier (10). We compared it with others plasmid classifiers, cBar and Plasflow, and showed that with our specific knowledge-based approach we were able to recover nearly all plasmids of various Escherichia coli strains without compromising on specificity. Finally, the usefulness of our approach is illustrated on a set of E. coli whole genomes for which we have sought to identify the location of specific genes involved in virulence or antibiotic resistance.

Theory and implementation Workflow description
PlaScope workflow is illustrated in Fig. 1. First, users have to provide paired end fastq files. Then assembly is run using SPAdes 3.10.1 (11) with the "careful" option to obtain contigs. Subsequently, Centrifuge (10) predicts the location of these contigs thanks to a custom database and sorts sequences into 3 classes: plasmid, chromosome and unclassified. The latter includes sequences shared by both categories (i.e. plasmid and chromosome) and which are therefore indistinguishable, and sequences without any hit. Finally results are sorted based on those three classes and extracted using awk. The complete workflow is available through a unique bash script called PlaScope.sh on github (https://github.com/GuilhemRoyer/PlaScope).

Centrifuge custom database construction
We gathered all the complete genome sequences (chromosomes and plasmids) of E. coli from the NCBI on 10/01/2018. We also added the plasmid sequences that were used to create PlasmidFinder database (4) and those proposed by Orlek et al. (9). Finally, we added a very specific dataset containing E. coli plasmids involved in antibiotic resistance (www.agence-nationalerecherche.fr/en/anr-funded-project/) (https://www.ebi.ac.uk/ena/data/view/PRJEB24625) ( Then, we pooled separately plasmid and chromosome sequences to create a custom database for Centrifuge 1.0.3 (10) with an artificial taxonomy containing only three nodes: "chromosome", "plasmid", and "unclassified" (see README on https://github.com/GuilhemRoyer/PlaScope).

PlaScope classification method
PlaScope classifies contigs as "chromosome", "plasmid" or "unclassified" with Centrifuge using our custom database (centrifuge -f --threads 2 -x custom_database -U example.fasta -k 1 --report-file summary.txt -S extendedresult.txt), with the option "k" set to 1 in order to get only one taxonomic assignment. Only contigs longer than 500 bp, with a Centrifuge hit longer than 100 bp and with a SPAdes contig coverage higher than 2 are classified as plasmid or chromosome-related.

Reference dataset for method evaluation
To evaluate our tool, we searched for completely finished genomes of E. coli with Illumina reads available on the National Center for Biotechnology Information (NCBI) database. All corresponding chromosome and plasmid sequences and Illumina short reads were downloaded from the NCBI on 10/01/2018, and converted into fastq files with fastq-dump from sra-toolkit (fastq-dump --splitfiles). For evaluation purpose, these genomes were not included in the centrifuge custom database.
The short reads were assembled with SPAdes 3.10.1 (11) with standard parameter and "careful" option (spades.py --careful -t 8 -1 read_1.fastq.gz -2 read_2.fastq.gz -o output_directory). After assembly, 16S rapid identification was performed on fasta files using ident-16s (12). 12 assemblies which did not contained Escherichia 16S or with multiple 16S from various organisms were excluded from the subsequent analyses. Finally, we kept 70 genomes containing 183 plasmids and 7 genomes with no plasmid according to the NCBI database (Supplementary table 2).
We filtered the assemblies based on contigs length (> 500 bp) and SPAdes coverage (> or = 2). Then, each assembly was mapped against the corresponding complete chromosome and plasmid sequences from the NCBI database using Quast 4.6 with standard parameters (13). Contigs that did not aligned on any sequence (chromosome and plasmid) or aligned on less than 50% of their length were not considered, as well as contigs that aligned on both sequences.

PlaScope, Plasflow and cBar benchmark
PlaScope, Plasflow (5) and cBar (7) softwares were run on the reference dataset of 70 genomes containing plasmids. All these methods use different databases and classification approaches to sort contigs as plasmidic or chromosomal. Moreover, PlaScope and Plasflow may assign contigs as unclassified for ambiguous results.

Application to resistance, virulence gene and operon locations
In a second step, we evaluated our method on Extended-Spectrum Beta-lactamase (ESBL) carrying E. Using this approach, we accurately identified 20 chromosomally-integrated and 5 plasmid-related CTX-M (Fig. 3) compared to the publication results. We only had a discrepancy with the two isolates of Clade E (RS254 and RS371 strains). Indeed, we found a plasmid location of the CTX-M coding gene in the strain RS254 whereas it was described as chromosome-related, probably because of an uncommon structure formed by the gene and its adjacent sequences. For the second strain, RS371, the location was not predicted by PlaScope (unclassified) whereas it was stated as plasmid-located.
The really short length of the contig carrying the CTX-M gene in this strain (i.e. 3274 bp) is certainly involved in this undetermined result.
In the same publication, the authors also searched for virulence genes and iron metabolism operons.
To go further, we used PlaScope results to determine the location of these genes (Fig. 3). Some of them are exclusively carried by chromosomes (lpfA, mcmA, astA) or plasmids (f17G, cma, senB).
Interestingly, iss can be found on either type of replicon. For example, iss is on chromosome in Clade A (V161 and V210 strains) isolates whereas it is located on plasmids in 4 out of the 5 Clade C (E003488, E006910, R107, R208 and V177 strains). This illustrates the different genetic background even between closely related strains. In the same way, the gene f17A has different locations: on plasmids in 3 strains (R299, R56, R61a) and on chromosome in only one (370B15-13-2A, not described in the original publication). These two possible locations of iss and f17A were previously observed (15,16). Concerning the operons, 5 of them (i.e. enterobactin, fec, feo, fhu and yersiniabactin operons) were predicted as chromosome-related whereas the others, (i.e. aerobactin, salmochellin, sit and the iron transport pEC14_114) were predicted as plasmidic. These results are in agreement with the literature. Indeed, the first five are known to be chromosome-encoded (17)(18)(19)(20)(21) whereas iron transport pEC14_114 is plasmidic (22). Aerobactin, salmochelin and sit have been found on both types of replicons (23).

Conclusion
Here, we propose a method, called PlaScope, for plasmid and chromosome classification of E. coli contigs. It is based on Centrifuge (10): a fast metagenomic classifier that uses exact matches and small-sized databases. PlaScope offers a high specificity by selecting a unique assignment of contigs to plasmid, chromosome or unclassified. Indeed, we took advantage of the ever growing number of sequences from databases to build a custom database, which combines many high quality sequences of Enterobacteriaceae plasmids and chromosome sequences of E. coli. We compared the performance of our tool with cBar and Plasflow, as these bioinformatic softwares also enable the segregation of plasmid and chromosome contigs. These two programs rely on genomic signature and have been develop to predict plasmid sequences in metagenomic samples.
Compared to PlaScope, Plasflow achieve roughly the same recall value on our dataset, whereas cBar performed a little bit less well. However when looking at the other criteria such as precision, specificity and accuracy, PlaScope outperformed the other ones due to its highly specific database. cBar and Plasflow are virtually able to identify mobile elements in many bacterial species owing to their very diverse taxonomic database. But when focusing on a species, the targeted approach of PlaScope gave indisputably better results both in terms of recall and precision.
Using PlaScope, we were able to recover almost all plasmids from the analysed strains, with very high precision, specificity and accuracy. Furthermore, among 1 of the 7 strains described as nonbearing plasmid strains in the NCBI database we were able to identify a mobile element: a typical plasmid F in a E. coli K-12.
In a second analysis, we challenged our approach on more concrete data by looking at specific  (14). Using PlaScope we accurately identify 20/21 of these chromosomal insertions. Beside, we predicted the location of virulence genes and iron metabolism operons and it was in agreement with the literature. It demonstrates that PlaScope may be really useful to locate operons like aerobactin or salmochellin, which can be on plasmid as well as chromosome and have, like other iron-metabolism related systems, major impact on virulence and/or fitness (21,24).
We think that our approach can be very useful when focusing on a well-described species as it makes it possible to decipher the plasmid content of the genomes without an excess of over