Comparison of classical multi-locus sequence typing software for next-generation sequencing data

Multi-locus sequence typing (MLST) is a widely used method for categorizing bacteria. Increasingly, MLST is being performed using next-generation sequencing (NGS) data by reference laboratories and for clinical diagnostics. Many software applications have been developed to calculate sequence types from NGS data; however, there has been no comprehensive review to date on these methods. We have compared eight of these applications against real and simulated data, and present results on: (1) the accuracy of each method against traditional typing methods, (2) the performance on real outbreak datasets, (3) the impact of contamination and varying depth of coverage, and (4) the computational resource requirements.


INTRODUCTION
A small number of bacterial foodborne pathogens, such as Salmonella, Campylobacter, Listeria and Escherichia, cause a huge burden of disease in humans and animals. With Listeria monocytogenes, although the case count is small, the case-fatality rate is high at approximately 21 to 38 % [1,2] and a high economic burden [3]. In the US, each foodborne illness can cost anywhere from hundreds to millions of US dollars depending on the organism. Therefore, investigating potential foodborne outbreaks and preventing any illness is advantageous from both economic and public health standpoints. In order to understand these bacteria in more depth, there have been many studies to describe their population structure using phylogenetic methods based on multi-locus sequence typing (MLST) [4,5].
Additionally, there have been many large-scale surveillance efforts for these pathogens. One of the most successful programs has been PulseNet International [6], which aids in the detection of common source outbreaks. Recently, large numbers of isolates have been subjected to whole-genome sequencing (WGS) through an initiative between the Centers for Disease Control and Prevention (CDC), the US Food and Drug Administration (FDA), the US Department of Agriculture (USDA), and the National Center for Biotechnology Information (NCBI). Through this collaboration, every L. monocytogenes genome that is discovered in the food supply, or in clinical samples, is being sequenced and uploaded to the NCBI Sequence Read Archive (SRA) database. This collaboration has since started sequencing a large percentage of Escherichia coli, Salmonella enterica, Campylobacter coli, Campylobacter jejuni and many others, with the eventual goal of completely switching from pulsedfield gel electrophoresis to WGS. In Europe, Public Health England sequences every Salmonella and Mycobacterium tuberculosis isolate submitted to them and deposits the data in the SRA. Perceiving a future need for worldwide collaboration on these new methods, the Global Microbial Identifier (GMI) [7] partnership was initiated in 2011 to encourage data sharing among all nations for many purposes, including public health and research.
To aid in population structure studies and in epidemiological investigations, MLST has been used for nearly two decades [8] to categorize different clonal expansions of these pathogens into broad categories, based on allelic variation amongst seven highly conserved housekeeping genes. Sequence typing can be performed using both nextgeneration sequencing (NGS) and classical sequencing techniques. Whilst MLST is a low-resolution classification compared to what is possible from NGS data, the nomenclature is in common usage by microbiologists and clinicians. A number of software applications have been developed using a variety of fundamentally different techniques to calculate sequence types (STs) from NGS data. However, there has been no comprehensive review to date on the accuracy, computational performance, robustness and ease of use of these methods. In this paper, we have evaluated multiple MLST software applications on a variety of datasets, both real and simulated, such as: (1) standard sets of outbreak data from the Gen-FS WGS Standards and Analysis Working Group (available from https://github.com/WGS-standards-and-analysis/datasets) [9], which includes C. jejuni, E. coli, L. monocytogenes and S. enterica; (2) Salmonella isolates that have been typed using both traditional capillary sequencing and NGS; (3) simulated reads of varying coverage; and (4) simulated mixed strains. Here, we describe a comprehensive list of command-line tools for MLST analysis and benchmark them with these standardized datasets in terms of accuracy and computer resources required.

SOFTWARE OVERVIEW
MLST software can be categorized according to the input data they accept; there are tools that use raw sequence reads and tools that use de novo assemblies. Calling MLST from raw reads avoids the need to fully reconstruct the whole genome, theoretically allowing for a lower running time. However, in practice de novo assembly is routinely performed for bacteria [10] and assemblies may already be available for any given MLST analysis leading to faster sequence typing. The process of de novo assembly can introduce artefacts, particularly from short reads. For example, a gene may be fragmented over multiple contigs. A full overview is given in Table 1. In general, the desired characteristics of MLST software include: (1) high specificity of calling STs, (2) resilience in the face of mixed samples, (3) tolerance with low sequencing coverage, (4) efficient usage of computational and disk resources,

IMPACT STATEMENT
Sequence typing is rapidly transitioning from traditional sequencing methods to using whole-genome sequencing. A number of in silico prediction methods have been developed on an ad hoc basis and aim to replicate classical multi-locus sequence typing (MLST). This is believed to be the first study to comprehensively evaluate multiple MLST software applications on real validated datasets and on common simulated difficult cases. It will give researchers a clearer understanding of the accuracy, limitations and computational performance of the methods they use, and will assist future researchers to choose the most appropriate method for their experimental goals. Antibiotic Resistance Identification By Assembly (ARIBA) (https://github.com/sanger-pathogens/ariba) takes raw reads as input on the command line, and uses a combination of mapping and local de novo assembly to calculate alleles. Like SRST2, it can be used more generally for gene detection and classification, allowing for antibiotic-resistance prediction, virulence-gene detection and plasmid replication gene classification. It is open source, has extensive unit tests and is packaged for easy installation.
Bacterial Isolate Genome Sequence Database (BigsDB) [11] is a web service whose primarily purpose is the management of sequence typing databases, as opposed to querying them. It is used by the majority of schemes as the backend for storing their typing data. The database can be queried in two ways, via a web interface or programmatically through a REST API. There is no described command line interface for queries; however, the mechanisms are in place to allow for it in the future. BigsDB can be used to create new MLST schemes.
BioNumerics (http://www.applied-maths.com/bionumerics) from Applied Maths is a commercial application that is widely used by public-health laboratories to calculate STs. Due to its proprietary nature, a full review is not possible; however, the authors described a reads-based k-mer sequence typing method in a patent [12] and do assemblybased sequence typing using BLASTN.
EnteroBase (http://enterobase.warwick.ac.uk) is a web resource that incorporates sequencing data from both public databases and directly from users for four genera (Salmonella, Escherichia, Yersinia and Moraxella), and assembles it de novo with an adjusted pipeline using SPAdes [13]. Enter-oBase succeeds the University College Cork/Warwick MLST database (http://www.mlst.net/databases/), and maintains the database and assigns new alleles of MLST schemes for these genera. These data are mirrored through PubMLST via the EnteroBase API, which is available for all EnteroBase users. Alleles are called using nucleotide and amino acid sequence with uSEARCH/uBLAST, which allows for high sensitivity for divergent allele variants. However, the source code is not publicly available.
Metric-Oriented Sequence Typer (MOST) [14] builds upon SRST (version 1) [15] and uses a mapping-based approach to align alleles to reads, with a traffic light system indicating the confidence in the ST calling. One major difference to SRST2 is that it takes a 100 base flanking region around the locus from a reference genome, reducing the impact of coverage drop off at the ends of the sequences. Additionally, it can assign predicted serovars to Salmonella isolates. It is used by Public Health England on clinical isolates and has strict, well-defined conservative criteria for calling STs to ensure accuracy. mlst (https://github.com/tseemann/mlst) takes de novo assemblies as input on the command line and uses BLASTN to align sequences to alleles. It is very fast and searches all databases on pubMLST to automatically detect the organism, then calculates the ST. Installation is very easy using brew.
MLST from the Center for Genomic Epidemiology (MLST-CGE) [16] is a web-based method for calculating MLST. It can take assembled genomes or raw sequencing reads. If raw sequencing reads are provided, it performs a de novo assembly. Alleles are called using a BLAST-based method.
MLSTcheck [17] takes de novo assemblies as input on the command line and uses BLASTn to align sequences to alleles. It is packaged for easy dependency installation, and has unit test coverage. It produces a multi-FASTA alignment of concatenated allele sequences for each sample, which allows for phylogenetic trees to be easily reconstructed. Novel allele sequences are saved to allow for them to be submitted to the MLST curators.
SeqSphere+ [18] from Ridom is a commercial application that is widely used by public-health laboratories. It uses assembled sequences to call STs. It is packaged for easy installation and consists of a large suite of analysis pipelines for automated sequence analysis. Due to its proprietary nature, a full review is not possible.
Short Read Sequence Typing 2 (SRST2) [19] takes raw reads as input on the command line and uses a mapping-based approach to align reads to the alleles. It is packaged for easy dependency installation and is widely used for a variety of applications in addition to MLST including: antibiotic-resistance prediction, virulence-gene detection and serotyping [20]. The software licence is free for both commercial and non-commercial use, and it has unit tests.
stringMLST [21] takes raw reads as input on the command line and uses k-mers to detect MLST alleles. Instead of detecting allele coverage or parsing for potential SNPs, an allele call is made by identifying the allele with the most number of matching k-mers. The use of k-mers gives a substantial speed advantage, but at the expense of accuracy. This method is fast enough to detect STs in real time during sequencing, so it holds much promise for the future. It is free for non-commercial purposes and it has no automated tests.
The described applications were optimized to work with MLST. Their performance on higher resolution schemes, such as ribosomal MLST, core genome MLST, and whole genome MLST, is quite different, with most scaling poorly to schemes with hundreds or thousands of genes, as this was a case the applications were never fundamentally designed to handle. Alternative methods are required to cater for these cases; thus, extended schemes are not covered in this paper.

DATABASE AVAILABILITY
The availability of databases containing alleles and ST profiles for different species is an important aspect of any MLST software application as outlined in Table 2, since this dictates how easy it is to use the software. These databases also need to be kept up to date, as the underlying schemes are constantly being extended as new isolates are sequenced. Out of date databases can mean that rapidly emerging clonal expansions may be missed, impairing epidemiological investigations. ARIBA, BioNumerics, mlst, MLSTcheck, stringMLST, SeqSphere+ and SRST2 all provide automated scripts/methods to download all of the latest databases from pubMLST [11], which are immediately ready to use. This provides immediate access to schemes for over 125 species.
mlst and stringMLST go one step further and additionally bundle all available databases in their software repository, which are regularly updated. MOST [22] as the base platform for the evaluations. Each of the applications was run in their own Docker container [23] available from GitHub (https://github.com/andrewjpage/docker_ mlst). The Debian Testing distribution was used as the base operating system for all containers as it provides access to a large range of up-to-date bioinformatics software. The host VM had four cores and 32 GB of RAM running Ubuntu 16.04 (LTS); however, only a single core was used for the evaluations. All datasets used for this analysis are available for download as described in the data bibliography or from the public archives using the accession numbers in the Supplementary Material. Where assemblies were required as input to MLST applications, the raw reads were de novo assembled with SPAdes (v3.9.0) [13] using the default parameters. SPAdes was chosen as it is widely used and consistently produces high-quality results on bacterial data [24]. All experiments using the two commercial applications, BioNumerics and SeqSphere+, were performed using the CDC compute infrastructure with default options and the SPAdes assemblies as described above.

REAL OUTBREAK DATASETS
Standard datasets (https://github.com/WGS-standards-andanalysis/datasets), covering L. monocytogenes from stone fruit [25], E. coli from sprouts [26], C. jejuni from raw milk (http://www.outbreakdatabase.com/details/hendricks-farmand-dairy-raw-milk-2008/) and S. enterica from spicy tuna [27], comprising 85 samples, were analysed by each of the software applications. These are real outbreak datasets where there were substantive epidemiological investigations and full details are available [9]. No false positives were reported by any application, they made either the correct call, a low-confidence call or no call. A summary of the overall performance is provided in Table 3, with extended details available in Table S1 (available with the online Supplementary Material). There was a wide variation in the results, with only three applications (stringMLST, BioNumerics and MLSTcheck) correctly calling all of the STs. MOST failed to confidently call any of the spicy tuna Salmonella samples, but did identify the correct STs, flagged as low confidence (amber). There was a 29-fold variation in the running times between the applications (stringMLST vs SRST2) using raw reads as input (Table 1). This extra computation imposes financial costs and increases the analysis time after sequencing.  [28], with calls sometimes made using a single read, leaving little resilience to sequencing errors. As the NGS data had very high depth of coverage (over 30Â) of this allele, it is likely that the NGS results were correct. Nearly all of the calls from MOST were low confidence (rated amber); however, they correlated with the results from the other applications, and it is just that MOST has very stringent, validated, criteria for calling an ST. Three samples were flagged by multiple applications as problematic; however, in every case the capillary sequencing data, stringMLST, EnteroBase, SeqSphere+ and BioNumerics confidently called an ST, indicating a contaminant has been missed. Eight applications flagged sample 139K as problematic; however, stringMLST confidently called an ST, indicating overconfidence in ST calling. MLSTcheck and BioNumerics called a different ST for 2 samples; however, this appears to be due to duplicate allele profiles in the underlying database at pubMLST. Overall, we conclude that whilst the MLST results between capillary sequencing data and NGS data are nearly identical, the MLST based on NGS data is more accurate and reliable when presented with edge cases.

IMPACT OF DEPTH OF COVERAGE
The impact of depth of coverage over the MLST genes was assessed by artificially generating perfect paired-ended reads with a length of 125 bases and a median insert size of 400 bases with varying levels of coverage using FASTAQ (v3.14.0). Values in bold indicate the best results in each column.
*The time to assemble with SPAdes before running the applications was 2873 min and is included separately.
†MOST identified the correct ST in 97.6 % of cases, but flagged 48.2 % of these calls as low confidence.
The allele sequences plus 500 base flanking regions were extracted from S. enterica Typhi CT18 [29], accession number AL513382, and artificial paired-end reads were generated with mean depths of coverage from 1Â to 30Â. The simulated reads were free from sequencing errors to allow for the effect of coverage alone to be measured. Therefore, the minimum effective depth of coverage for each application could be tested. All applications could accurately call STs when the coverage was greater than 12Â; however, below this the minimum depth of coverage applications required varied greatly, as shown in Fig. 1. stringMLST correctly called the ST with just 3Â coverage; however, it gave false-positive results for lower coverage alleles. ARIBA correctly called the ST from 5Â with no false-positive results. SRST2 correctly called the ST from 12Â coverage with no false-positive results; however, it did correctly identify the ST from 6Â with low confidence.
The computational resources required varied greatly with stringMLST taking just 10 s to call an ST with 30Â coverage, as shown in Fig. 2, and the final disk space requirements were negligible, as shown in Fig. 3. Whilst minimizing the disk space resources needed for the application is generally positive, stringMLST does not output enough information about the allele calls to allow for further analysis, for example, to interrogate a false-positive result. The time to call an ST at 30Â with ARIBA was 40 s with 0.1 Mbytes output data. The disk-space requirement is higher than stringMLST, but provides the allele assemblies used to call the ST, which is useful for further analysis. SRST2 is an order of magnitude slower, taking over 500 s to call an ST at 30Â. The disk space required for the final output is also very substantial at  confidently correctly called each individual allele from 4Â, the overall ST call was flagged as low confidence below 10Â due to its inherently conservative nature. The running time given for mlst and MLSTcheck includes the de novo assembly time with SPAdes, which accounts for most of the running time. MLSTcheck takes on average four times longer (25 s per sample) to return a result than mlst (5.9 s per sample), with the final results between the two being identical.

IMPACT OF MIXED SAMPLES
Contamination and mixed colonies are a standard complexity in microbiology [30]. To understand the behaviour of the different MLST software applications in the presence of more than one strain, we constructed a simulated dataset consisting of two Salmonella samples with different alleles in varying ratios. This allowed us to see at what point contamination/mixed strains becomes detectable. Once detected, we would expect an MLST application to flag the results as low confidence or provide no result at all to avoid false positives. The flip side of this is that if algorithms are too sensitive to low level contamination and sequencing errors, they become less useful on real world applications, so need to be tolerant to some low-level noise.
The allele sequences plus 500 base flanking regions were extracted from S. enterica Typhi CT18 [29], accession number AL513382, and S. enterica Weltevreden 10 259 [31], accession number LN890518. Artificial paired-end reads were generated using FASTAQ to give a total coverage of 50Â, beginning with CT18 at 1Â and 10 259 at 49Â in a single FASTQ file. The coverage of each sample was varied in steps of 1Â to generate a dataset of 49 FASTQ files. Fig. 4 shows that the accuracy of the software varies, but follows a general pattern, calling the sample with the highest coverage at the highest levels, with uncertainty in the middle as the proportion of the two samples becomes similar. The worst case is where a software application calls an ST with high confidence that is not in the underlying data (false positive), and only occured with stringMLST. MOST and ARIBA are highly conservative, detecting that there are mixed samples when the samples are at very low levels of coverage (at 4-5Â). MLSTcheck, mlst, SeqSphere+ and BioNumerics all performed identically, with the performance linked to how well SPAdes assembled the underlying genomes. There was no clear boundary with SRST2 and it varied between high-quality calls and low-confidence calls as the mixing of the samples changed.

CONCLUSION
It is clear that not all MLST calling applications function as expected. Problems with some software include: out of date databases, computationally inefficient methods, false-positive results, inability to call alleles at low coverage and variable performance in the presence of mixed samples. Therefore, there is scope for improvement. Overall though, these software applications' ST calls using NGS data are concordant with traditional MLST calling methods based on capillary sequencing data, perform moderately well with low mean genome coverage, and are sometimes able to report low confidence when faced with contamination.  Fig. 4. STs called by each software application when given data containing two different Salmonella samples in varying ratios of abundance. Where there is no ST called, or where the ST has any ambiguity at all, it is marked as low confidence. A false positive is where an ST is called with high confidence and is not one of the two samples in the raw data.