Distributed under Creative Commons Cc-by 4.0 Most: a Modified Mlst Typing Tool Based on Short Read Sequencing

Multilocus sequence typing (MLST) is an effective method to describe bacterial populations. Conventionally, MLST involves Polymerase Chain Reaction (PCR) amplification of housekeeping genes followed by Sanger DNA sequencing. Public Health England (PHE) is in the process of replacing the conventional MLST methodology with a method based on short read sequence data derived from Whole Genome Sequencing (WGS). This paper reports the comparison of the reliability of MLST results derived from WGS data, comparing mapping and assembly-based approaches to conventional methods using 323 bacterial genomes of diverse species. The sensitivity of the two WGS based methods were further investigated with 26 mixed and 29 low coverage genomic data sets from Salmonella enteridis and Streptococcus pneumoniae. Of the 323 samples, 92.9% (n = 300), 97.5% (n = 315) and 99.7% (n = 322) full MLST profiles were derived by the conventional method, assembly-and mapping-based approaches, respectively. The concordance between samples that were typed by conventional (92.9%) and both WGS methods was 100%. From the 55 mixed and low coverage genomes, 89.1% (n = 49) and 67.3% (n = 37) full MLST profiles were derived from the mapping and assembly based approaches, respectively. In conclusion, deriving MLST from WGS data is more sensitive than the conventional method. When comparing WGS based methods, the mapping based approach was the most sensitive. In addition, the mapping based approach described here derives quality metrics, which are difficult to determine quantitatively using conventional and WGS-assembly based approaches.


INTRODUCTION
The process of whole genome sequencing (WGS) has benefited from recent advances collectively known as next generation sequencing, allowing high throughput sequencing of bacterial genomes at low financial cost. This results in WGS becoming a viable alternative to some traditional typing methods for public health infectious disease surveillance.
MLST can be derived from WGS using de novo assembly/BLAST based (Larsen et al., 2012;Jolley & Maiden, 2013) and mapping based (Inouye et al., 2012;Inouye et al., 2014) approaches. De novo assembly/BLAST based approaches work by assembling short reads into longer contiguous sequences and then comparing these contigs to a reference allele database using BLAST to assign a MLST type. Mapping based approaches align short reads to reference (allele) sequences representing all alleles from MLST loci using mapping tools such as BWA (Inouye et al., 2012) or Bowtie2 (Inouye et al., 2014). Subsequently, SNP/INDELs are called using a variant-calling algorithm such as Samtools mpileup (Li et al., 2009) to determine the most likely allele at each locus. An allele is assigned if the reads have 100% coverage and 100% nucleotide identity to the locus alleles sequence without any INDELs. Mapping based approaches allow the calculation of metrics for each designated allele to assess the quality of the match (Inouye et al., 2012;Inouye et al., 2014).
Public Health England provides diagnostic, specialist and reference microbiology services to healthcare providers in England. Implementation of whole genome sequence (WGS) technology for public health microbiology requires quality controlled results that are at least as accurate as conventional 'gold standard' methods. In order to make an informed decision regarding the software that is most capable of accurately determining the MLST profile from WGS data, this paper systematically compared the performance of WGS-based MLST software to conventional methods using genomes from 323 samples. The software was evaluated based on the ability to: (a) Derive a full MLST profile, (b) demonstrate concordance to the MLST results derived from conventional sequencing and (c) assign quality metrics that allow results to be reported quantitatively.

Isolates
Reference isolates and samples with mixed species assembly in vitro were prepared in order to compare the reliability of MLST results.  From these, representatives of many different STs were included.

Salmonella isolates, including those mixed with other species and isolates with low coverage
Isolates submitted to the above reference laboratories are most often pure cultures, but a small proportion of samples do contain a mixture of organisms. Kmer ID software (https://github.com/phe-bioinformatics/kmerid) was used to identify samples containing mixed species. Samples containing Salmonella mixed with other species were used to test the sensitivity of WGS-based MLST methods (Table 1). Samples with lower than the expected coverage of genomic data can be revealed from the ''minimum consensus depth'' value. To test how sensitive WGS-based MLST methods were when processing low coverage samples we used Salmonella samples with minimum read depth values of 1-10 (Table S1).

Intra-species mixed samples (Strepococcus pneumoniae)
In order to determine how sensitive WGS-based MLST methods are when processing intra-species mixed samples, we assembled artificial mixes of different S. pneumonaie types from previously extracted genomic DNA, at different ratios (Table 2).

WGS, quality assessment and species identification
Samples for WGS sequencing were submitted to the Genomic Sequencing Unit at PHE. Illumina Nextera DNA libraries were constructed and sequenced using the Illumina HiSeq 2500. Afterwards, the samples were deplexed using the Casava 1.8.2 (Illumina inc., San Diego, CA, USA) and the FASTQ reads were quality trimmed using Trimmomatic (Bolger, Lohse & Usadel, 2014) to remove bases with a quality PHRED score below 30 from both ends. K-mer ID software was used to compare the sequence reads with a panel of curated NCBI Refseq genomes to identify the species.

MLST determination
To extract MLST from Campylobacter sp., Staphylococcus aureu, Streptococcus pneumoniae and Salmonella sp.,the respective allele and ST/profile definitions were downloaded from  Table 3.

MLST via conventional sequencing
Alleles were initially amplified by PCR and DNA sequenced using Sanger sequencing. Sanger sequencing was carried out using Applied Biosystems 3720X DNA analyser. Bionumerics version 6.1 was then used to determine the alleles and ST. Bionumerics assigned an allele if the assembled reads matched 100% to the locus variant sequence with zero SNP/INDELs using BLAST. STs were determined using this methodology from set of pure isolates (Campylobacter sp., Staphylococcus aureus and Streptococcus pneumoniae samples).

MLST via WGS based mapping
At the time that this validation study took place the only available mapping-based approach was SRST (version 1) (Inouye et al., 2012). Following initial testing, SRST was modified and the resulting software called ''Metric Oriented Sequence Typer'' (MOST). Bowtie2 was chosen as the global aligner (rather than BWA) due to the greater sensitivity that we have observed with Bowtie2. MOST uses the output from the Bowtie2 mapping to report percentage coverage across the allele length and the ''maximum percentage of nonconsensus bases'' at any position. The latter value enables the user to identify potentially mixed samples and stop mixed samples not to be reported as novel allele.
The ''Max percentage non-consensus bases'' value is calculated for each position by using the following formula: Percentage non-consensus bases = (Number of reads mapped to reference sequence with non-consensus base/Total number of reads aligned to reference sequence) * 100.
Once the percentage non-consensus bases are calculated, the maximum percentage non-consensus base value is determined and reported.
Finally, MOST was adjusted to infer Salmonella serotype from the ST value using a PHE Salmonella serotype database (Ashton et al., 2016). For a full list of other modifications please refer to Supplemental Information 1. MOST is available as open-source software (https://github.com/phe-bioinformatics/MOST).
In addition to the samples used for the conventional ST methodology, STs were also determined from samples with intra and inter species mixes and those with low coverage from the genomic data in order to determine the sensitivity of MOST.
Also included in Table S1 is the time taken to derive a 7 locus MLST result using MOST.

MLST using BIGSdb-a WGS-assembly based approach
Sequence reads from the same samples described in the previous section were assembled using Spades (version 2.5.1) de novo assembly software with the following parameters 'spades.py-careful-1 strain.

Conventional MLST vs WGS-based MLST-for pure cultures
WGS based MLST yielded via MOST returned full MLST profiles from 99.7% (322) of the 323 isolates tested. This compared to 97.5% (315) via assembly and BIGSdb, and 92.9% (300) by conventional MLST (Table 3). The concordance between samples that return a full MLST profile by conventional MLST and both WGS methods was 100% (Table 3). For 21 Campylobacter sp and 2 Streptococcus pneumoniae samples, a full MLST profile was not returned via the conventional method due to poor sequence quality.

WGS-mapping based MLST vs WGS-assembly based MLST
Having established the superiority of WGS based MLST over conventional MLST for sensitivity of ST determination from pure cultures, we investigated the accuracy (including the assessment of quality) of different WGS analyses for samples with low coverage and for samples with more than one organism. From 29 samples that yielded low coverage Salmonella genomic data, the WGS-mapping approach (MOST) and WGS-assembly approach (assembly and BIGSdb) returned 100% (29) and 93.1% (27) full MLST profiles, respectively (Table 3). The WGS-assembly based approach did not return full profiles for 2 samples due to truncation of a contig that contained a MLST locus and for the other sample BIGSdb returned two variant matches for the thrA allele. BIGSdb identified two variant matches for thrA (thrA67 and thrA489) alleles due to the present of duplicated allele within the BIGSdb database. thrA489 was removed from curator's database (http://mlst.warwick.ac.uk/mlst/dbs/Senterica/ Downloads_HTML) but this was not addressed in BIGSdb at the time.
From 14 Salmonella isolates mixed with other bacterial species (Table 1), the WGSmapping approach (MOST) returned 92.9% (13/14) full MLST profiles whereas the WGS-assembly approach returned full profiles for only half of the samples (50% or 7/14) ( Table 3). The WGS-assembly approach did not return full profiles for 7 samples. Three of these were due to contigs that were truncated in a target region (MLST allele), a further three returned two thrA allele variants via BIGSdb. The remaining sample had an 'N' introduced in the aroC allele. The WGS-mapping approach (MOST) reported ambiguous result (Novel) for 1 sample with ''max percentage non-consensus bases'' of 50%. This indicates that the sample is mixed at ratio around 50:50. Samples that are mixed might be assigned a novel allele due to the fact the percentage of mixed bases at each position can vary according due to stochastic processes during library preparation and sequencing. For example at one position a mixed base can be 52:48 and at another position the ratio may be 48:52 and therefore it is possible to link allelic calls at each position across the entire sequence with high confidence.
The ''max percentage non-consensus bases'' quality metric is useful to identify mixed samples and also stops mixed samples to be reported as Novel. The WGS-assembly approach returned 11 allele matches and an ST.
From 12 samples constructed in vitro to contain more than one ST of Streptococcus pneumoniae we found the WGS-mapping approach (MOST) returned the expected MLST results for 58% (7/12), whilst the WGS-assembly approach (via BIGSdb) returned the expected MLST results for only 25% (3/12) of samples (Table 3). Thus, the mapping based software, MOST, was more sensitive than the assembly based approach. Of the four samples that returned full profiles via MOST, three returned two allele variants using BIGSdb. It reported the following alleles: ddl (ddl14 and ddl339) and spi (spi2 and spi17) rather than the correct designations: ddl (ddl1 and ddl339) and spi (spi2 and spi6). The fourth sample has a contig that was truncated in the gki allele region. For four S.pneumoniae isolates that were mixed at 50%:50% ratio, both WGS based methods did not return correct profile (Table 2).

MOST quality metrics accurately informed mixed and low coverage samples
Unlike assembly based approaches, the MOST mapping based approach provided a ''minimum consensus depth'' quality metric that informed low coverage, as well as the ''max percentage non-consensus base value'' which was informative for identifying mixed samples.
For the 29 samples that yielded low coverage Salmonella genomic data, the ''minimum consensus depth'' values reported by MOST did demonstrate that the samples have low sequence depth (Table S1).
For the mixed samples containing more than one ST of S. pneumoniae the ''max percentage non-consensus base'' values reported by MOST demonstrated the presence of a mixture but also returned the ST of the majority strain within the mixture. However the ratios of the mixtures detected by MOST were consistently higher than the ratios provided by the laboratory. For example samples mixed at ratio 50:50%, 40:60%, 30:70%, 20:80%, 10:90% gave ''max percentage non consensus base'' values of 50%, 49%, 40%, 31% and 17%, respectively (Table 2), and probably reflects a combination of laboratory pipetting inaccuracies during construction of the mixes, and randomness in the distribution of reads across the MLST loci.

DISCUSSION
This study revised, tested and validated mapping-based and assembly-based software whose purpose was to extract STs from short-read WGS data by comparing the results with those from the conventional (PCR amplification and Sanger sequencing) MLST methods. Having established the superiority of WGS based methods, we then went on to compare the performance of two WGS data analysis approaches (assembly and mapping) to determine their accuracy against samples that contained more than one organism and low coverage data.
The superiority of WGS based methods was evidenced by the greater number of full MLST profiles as compared to the conventional method. Additional evidence was provided by the complete concordance between the results of conventional and WGS based methods, as well as no instances where only the conventional method returned a full MLST profile. Between the two WGS approaches our comparison indicated that MOST returned, 5% (19) more full MLST profiles than an assembly based approach (Table 3). MOST was particularly effective when handling data from samples with intra-and inter-species mixes. Moreover the quality metric values that it assigns flag mixtures such as these as well as low coverage data. In this respect as well determining the ST from pure samples, it is also suitable for determining the ST from a contaminated or impure sample. The importance of this benefit in the environment of a routine microbiology laboratory cannot be understated, for example we found that 1.5% (n = 335) of the cultures of Salmonella referred for typing were mixed with other species and 4.9% (n = 1,060) contained more than one strain. PHE National Infections Service reference laboratories have selected and used MOST to extract the MLST profile as part of its bioinformatics pipelines. To date (18th March 2016), our reference laboratories have extracted MLST data from over 37,000 samples (21,237 Salmonella, 4,256 Streptococcus pyogenes, 1,579 Campylobacter, 2,920 Streptococcus pneumoniae, 3,936 Escherichia coli, 1,887 Staphylococcus aureus, 1,200 Listeria monocytogenes and 700 Streptococcus agalactiae) via MOST.
As part of the MOST development we included additional utility to infer serotypes for Salmonella. This functionality inferred the serotype from the MLST profile based on a database of previously determined conventional serotyping results and showed 96% (n = 6,616) concordance between the MOST and conventional results (Ashton et al., 2016). Six months after our implementation of MOST an updated version of SRST (version 2) was released (Inouye et al., 2014). Whilst this update included the addition of local mapping alignment, it did not include the additional database analysis component we used for inferring serotype, otherwise our tests indicated agreement with MOST results, except for one sample for which SRSTv2 returned a different type to the conventional type (Table S1).