PlasmidTron: assembling the cause of phenotypes and genotypes from NGS data

Increasingly rich metadata are now being linked to samples that have been whole-genome sequenced. However, much of this information is ignored. This is because linking this metadata to genes, or regions of the genome, usually relies on knowing the gene sequence(s) responsible for the particular trait being measured and looking for its presence or absence in that genome. Examples of this would be the spread of antimicrobial resistance genes carried on mobile genetic elements (MGEs). However, although it is possible to routinely identify the resistance gene, identifying the unknown MGE upon which it is carried can be much more difficult if the starting point is short-read whole-genome sequence data. The reason for this is that MGEs are often full of repeats and so assemble poorly, leading to fragmented consensus sequences. Since mobile DNA, which can carry many clinically and ecologically important genes, has a different evolutionary history from the host, its distribution across the host population will, by definition, be independent of the host phylogeny. It is possible to use this phenomenon in a genome-wide association study to identify both the genes associated with the specific trait and also the DNA linked to that gene, for example the flanking sequence of the plasmid vector on which it is encoded, which follows the same patterns of distribution as the marker gene/sequence itself. We present PlasmidTron, which utilizes the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographical information, to identify traits that are likely to be present on DNA that can randomly reassort across defined bacterial populations. It is also possible to use this methodology to associate unknown genes/sequences (e.g. plasmid backbones) with a specific molecular signature or marker (e.g. resistance gene presence or absence) using PlasmidTron. PlasmidTron uses a k-mer-based approach to identify reads associated with a phylogenetically unlinked phenotype. These reads are then assembled de novo to produce contigs in a fast and scalable-to-large manner. PlasmidTron is written in Python 3 and is available under the open source licence GNU GPL3 from https://github.com/sanger-pathogens/plasmidtron.

We have reanalysed a dataset (Holt et al. 2015) on the genomic diversity, population structure, virulence, and antimicrobial resistance in K. pneumoniae, to evaluate the effectiveness of PlasmidTron where the underlying genomes are known to harbour multiple plasmids.
Invasive K. pneumoniae versus carriage Two samples of K. pneumoniae, 1612 (accession number ERS012041) and EW-85-R-MAN (accession number ERS011857), with near identical core genes both from ST 36 and with capsule 149, display very different phenotypes (Holt et al. 2015). One sample (1612) caused invasive disease, the other (EW-85-R-MAN) did not cause disease (carriage). PlasmidTron was used to assemble sequences which were only found in the invasive sample to attempt to understand the underlying mechanisms of the pathogen. Comparing both samples with PlasmidTron using parameters "-k 31 -l 5000" gave an assembly of 37 contigs and 586,427 bases. The contigs were compared using blastn (Camacho et al. 2009) (version 2.7.0) to the non-redundant nucleotide (NT) GenBank database, with the top 20 largest contigs summarised in Supplementary Figure 1. Two previously identified K. pneumoniae plasmids, pSGH10 (accession number CP025081) and AR_0129_p2 (accession number CP021715.1), were present. As the plasmids both contained repetitive transposon sequences, full reconstruction of the underlying plasmid sequences was not possible using short reads alone. There are an additional 8 contigs which are associated with plasmids but their origin was ambiguous. This is most likely due to the recombined variants of a plasmid not being present in GenBank. Interestingly there are 3 complete novel sequences with no matches, even distant, in GenBank. Some chromosomal material is present in 7 contigs which can be depleted through the use of more control samples.  PlasmidTron was run with the parameters "--action intersection -k 61 -l 1000 -r 0.4" which is a strict mode, requiring a kmer to be present in all invasive samples with at least 10 times kmer coverage and absent from all carriage samples. The resulting assemblies had a mean of 649 kbases and a mean of 307 contigs. As a comparison a pan-genome was created of the same sets of samples. The raw reads were first assembled de novo using SPAdes (v3.11.1) (Bankevich et al. 2012) and annotated with PROKKA (version 1.10) (Seemann 2014). These annotated assemblies were then used as input to Roary with parameters "-e -n" (version 3.11.0) (Page et al. 2015). The cumulative running time for the assemblies, annotation and pan-genome generation was 146.45 hours. This compares to a running time for PlasmidTron of just 1.16 hours. The peak memory usage in both cases was under 1.5 GB. Roary identified 225 genes, equating to 224,221 coding bases, which were present in all invasive samples and absent from all carriage samples.
Comparing the Roary accessory genes to the assemblies produced by PlasmidTron using blastn (version 2.7.0) (Camacho et al. 2009) revealed that 4 invasive samples (66%) had blast matches to all 225 genes, the remaining samples had 224 and 223 genes with blast matches. When validated against the SGH10 reference genome and found a total of 83 (36.8%) of these genes were found to reside on plasmid pSGH10 and 142 (63.1%) on the chromosome. Supplementary Figure 4 shows the regions of pSGH10 which are unique to the invasive samples as identified by PlasmidTron. These regions (in blue) cover the iro (salmochelin), iuc (aerobactin) and rmpA/rmpA2 virulence genes (upregulators of capsule expression). A full discussion can be found in (MC Lam et al. 2017). PlasmidTron was able to identify these regions using phenotypic data alone, which is often the case when no complete reference genome is available. A limitation of the PlasmidTron method is that large segments of the pSGH10 plasmid are also present in 9 of the carriage strains, although they do not contain the virulence genes. This reduced the amount of the plasmid that could be assembled.

Supplementary Figure 4: A circular plot of plasmid pSGH10 with the track (from inner to outer) consisting of GC skew (G-C/G+C), G+C content, genes identified by PlasmidTron as being unique to invasive samples (blue), coding regions on the reverse and forward strands (grey).
Multiple plasmids in K. pneumoniae K. pneumoniae samples can often have multiple plasmids. This makes it an ideal test case for PlasmidTron. The K. pneumoniae dataset from Holt et al. (Holt et al. 2015) was analysed using ARIBA (version 2.10.3) (Hunt et al. 2017) with the PlasmidFinder database (Carattoli et al. 2014(Carattoli et al. ) (accessed 05-12-2017 to identify all of the Incompatibility (Inc) groups in each sample. Sample AJ049 (accession number ERS005743) contained the most plasmid typing sequences (6) in the dataset so was chosen as a worst case scenario for PlasmidTron. All complete K. pneumoniae reference genomes (162) were downloaded from RefSeq (accessed 05-12-2017 and listed in Supplementary Table 4) and the chromosomes from each reference were extracted using fasta_grep (version 1) (https://github.com/sanger-pathogens/fasta_grep) to use as controls. PlasmidTron was run with the parameter "-k 61 -d -l 2000" and the resulting assembly contained 520,314 bases with an N50 of 14,182, in 52 contigs. The assembly was compared to the PlasmidFinder database and all 6 plasmid typing sequences were recovered.
A blastn (version 2.7.0) (Camacho et al. 2009) with a minimum identity of 80% and a query coverage of 80% was performed on each of the 52 contigs against the nucleotide nonredundant (NT) database. Full details are listed in Supplementary Table 3. Of the 52 contigs, 41 (78.8%) were associated with plasmid sequences, all from Enterobacteriaceae encompassing Klebsiella, Citrobacter, Enterobacter, Raoultella and Salmonella. A further 2 (3.8%) were associated with chromosomal sequences from Enterobacter xiangfangensis and Klebsiella michiganensis. There was ambiguity which could not be resolved for 4 contigs (7.6%) which had a top hit in the Raoultella ornithinolytica (formerly Klebsiella ornithinolytica) chromosome (accession number CP017802.1), but every other blast hit matched to K. pneumoniae plasmids. This may be due to an assembly error with the deposited genome or an integration of a plasmid into the chromosome, however it is not possible to resolve this using short reads alone. Novel sequences, where there were no matches of any description in GenBank, accounted for 96Kb in 5 contigs (9.6%). Plasmid pKPN-332 (accession number CP014763.1) accounted for 6 contigs and pAUSMDU8141-1 (accession number CP022696.1) accounted for 14 contigs, so it is likely that both of these plasmids are present. The remaining 21 contigs were associated with a variety of plasmids but there was no clear pattern. As PlasmidTron does not use a database of reference plasmid sequences, it is not possible to unambiguously assign all contigs to particular plasmids. Whilst further post processing is required, PlasmidTron does rapidly identify the interesting parts of the accessory genome, substantially reducing the size of the data for the researcher.