Whole genome sequencing of Streptococcus pneumoniae: development, evaluation and verification of targets for serogroup and serotype prediction using an automated pipeline

Streptococcus pneumoniae typically express one of 92 serologically distinct capsule polysaccharide (cps) types (serotypes). Some of these serotypes are closely related to each other; using the commercially available typing antisera, these are assigned to common serogroups containing types that show cross-reactivity. In this serotyping scheme, factor antisera are used to allocate serotypes within a serogroup, based on patterns of reactions. This serotyping method is technically demanding, requires considerable experience and the reading of the results can be subjective. This study describes the analysis of the S. pneumoniae capsular operon genetic sequence to determine serotype distinguishing features and the development, evaluation and verification of an automated whole genome sequence (WGS)-based serotyping bioinformatics tool, PneumoCaT (Pneumococcal Capsule Typing). Initially, WGS data from 871 S. pneumoniae isolates were mapped to reference cps locus sequences for the 92 serotypes. Thirty-two of 92 serotypes could be unambiguously identified based on sequence similarities within the cps operon. The remaining 60 were allocated to one of 20 ‘genogroups’ that broadly correspond to the immunologically defined serogroups. By comparing the cps reference sequences for each genogroup, unique molecular differences were determined for serotypes within 18 of the 20 genogroups and verified using the set of 871 isolates. This information was used to design a decision-tree style algorithm within the PneumoCaT bioinformatics tool to predict to serotype level for 89/94 (92 + 2 molecular types/subtypes) from WGS data and to serogroup level for serogroups 24 and 32, which currently comprise 2.1% of UK referred, invasive isolates submitted to the National Reference Laboratory (NRL), Public Health England (June 2014–July 2015). PneumoCaT was evaluated with an internal validation set of 2065 UK isolates covering 72/92 serotypes, including 19 non-typeable isolates and an external validation set of 2964 isolates from Thailand (n = 2,531), USA (n = 181) and Iceland (n = 252). PneumoCaT was able to predict serotype in 99.1% of the typeable UK isolates and in 99.0% of the non-UK isolates. Concordance was evaluated in UK isolates where further investigation was possible; in 91.5% of the cases the predicted capsular type was concordant with the serologically derived serotype. Following retesting, concordance increased to 99.3% and in most resolved cases (97.8%; 135/138) discordance was shown to be caused by errors in original serotyping. Replicate testing demonstrated that PneumoCaT gave 100% reproducibility of the predicted serotype result. In summary, we have developed a WGS-based serotyping method that can predict capsular type to serotype level for 89/94 serotypes and to serogroup level for the remaining four. This approach could be integrated into routine typing workflows in reference laboratories, reducing the need for phenotypic immunological testing.


INTRODUCTION
Streptococcus pneumoniae is a human respiratory tract pathogen that represents a leading cause of invasive bacterial disease in children under five years of age and the elderly. Isolates of S. pneumoniae are traditionally characterised in terms of the antigenicity of their capsular polysaccharides (CPS) and currently there are 92 serologically distinct serotypes as defined by Statens Serum Institut (SSI), Copenhagen, Denmark sera using the Neufeld reaction for capsular swelling (referred to as Quellung) (Austrian, 1976;Statens Serum Institut, 2013). Additional 'serotypes' (6E, 6F, 6G, 6H, 11E, 20A and 20B) have been identified by molecular methods and/or monoclonal antibodies, but these are not distinguishable using the current commercial serotyping sera and some may not be phenotypically distinct types, but simply genetic variants producing the same polysaccharide structures (Calix & Nahm, 2010;Calix et al., 2012;Ko, Baek & Song, 2013;Oliver et al., 2013a;Park et al., 2015;Burton et al., 2016). The polysaccharide capsule is also the target of all currently licensed vaccines and therefore, due to selective pressure on circulating strains, accurate identification of pneumococcal serotypes is essential for disease surveillance, evaluation of the efficacy of the pneumococcal vaccines and to inform national vaccine policy.
In the UK, the initial reduction of invasive pneumococcal disease (IPD) and carriage, due to the introduction of PCV7 and PCV13 polysaccharide-conjugate vaccines, was followed by increases in non-vaccine type IPD (Miller et al., 2011;Waight et al., 2015). Specifically, following implementation of pneumococcal conjugate vaccine PCV13 (Centers for Disease Control and Prevention (CDC), 2010) in 2010, changes in disease-causing serotypes included a significant reduction in PCV13 serotypes 1, 6A, 7F and 19A, followed by a substantial increase in non-vaccine serotypes 8, 12F, 15A, 15B/C, 22F, 23B and 24F observed in 2013/14 in children under 5 years of age, when compared with 2012/13 (Waight et al., 2015) (Fig. 1). Surveillance studies from other countries revealed a similar pattern of PCV13serotype reduction and non-vaccine serotype increase in IPD cases. These studies underline the need for continuous surveillance to monitor the emergence of serotypes due to the clonal expansion of non-vaccine serotypes (Richter et al., 2013;Regev-Yochay et al., 2015). . Serotypes with less than 10 isolates in all periods were removed.
The Quellung reaction, in conjunction with the antisera from SSI, is currently considered the gold standard for pneumococcal capsular typing and can identify all 92 serotypes (Austrian, 1976;Habib, Porter & Satzke, 2014). Agglutination tests were developed later to reduce the time required to serotype a batch of isolates by eliminating the need for microscopy. In these tests SSI antisera reacts with the pneumococcal capsule and produces visible clumping (Kronvall, 1973;Porter, Ortika & Satzke, 2014). Based on the SSI serotype scheme, there are 25 distinct serotypes and 21 serogroups, each consisting of between two and five immunologically-related serotypes, accounting for a further 67 serotypes (Statens Serum Institut, 2013). Despite the wide use of the SSI sera, the serological methods can be difficult to interpret since, in certain cases, serotypes within a serogroup can cross-react with some of the antisera making it impossible to resolve to serotype level. In these cases, the serotypes are reported to serogroup level (Jauneikaite et al., 2015). Bentley et al. (2006) reported the sequences of the capsular biosynthetic loci of 90 pneumococcal serotypes (serotypes 6C and 6D were not included) and linked the capsular gene sequences to known structural and immunological patterns, thus paving the way for molecular capsular typing. Since then, new molecular serotyping methods have been described (microarray;Turner et al., 2011), multiplex PCR (Brito, Ramirez & Lencastre, 2003), in silico inference from WGS data (Croucher et al., 2011;Everett et al., 2012;Metcalf et al., 2016)), utilising the molecular differences between the capsular loci to determine serotype (Jauneikaite et al., 2015). A review of methodologies for capsular typing of S. pneumoniae including serological assays, semi-automated molecular tests, and one directly from WGS data (Liyanapathirana et al., 2014) was published by Jauneikaite et al. (2015). These molecular/WGS assays have some advantages compared to the serological methods described above: easier interpretation, multiplex capability and in some cases no requirement for culture. However, they still have some limitations, especially if they were to be used for surveillance. Firstly, due to the genetic similarities of certain serotypes, not all serogroups can be resolved. Secondly, most assays described to date only include the most common serotypes so currently not all serotypes can be detected (Magomani et al., 2014). Here we describe genetic capsular locus differences for all but four serotypes and demonstrate the use of these differences in predicting capsular type from WGS data using an automated bioinformatics tool that can be incorporated into routine workflows.

Isolate selection
Reference strains (n = 91) for all serotypes (excluding 6D) were acquired from Statens Serum Institut. Reference strain for 6D was kindly provided by The National Institute for Health and Welfare (THL), Finland. A total of 926 clinical isolates were selected from the archives of the Public Health England (PHE) National Reference Lab as a test cohort; for serotypes found to belong to a genogroup, at least 10 isolates were selected where available. Post genomic-sequence data cleansing (to remove repeat isolates from the same patient, mixed cultures, other species and MLST partial profiles or failures) resulted in 871 isolates (Development Set in Table 1). In addition, 2079 prospective or research-related isolates were sequenced as part of the UK validation cohort. This cohort covers 72 of the commonly circulating serotypes (including all vaccine serotypes), and includes prospective isolates received by PHE during 2015, isolates selected as part of research projects and epidemiological investigations (15A (n = 196) and 19A (n = 249), respectively) and archived isolates for rarer serotypes. Post genomic-sequence data cleansing of this dataset resulted in a total of 2065 isolates (Validation Set in Table 1). All isolates were serotyped on receipt as part of the PHE enhanced surveillance programme using slide agglutination with Statens Serum Institut typing sera.
Genomic data for non-UK isolates were obtained from Streptococcus pneumoniae isolate database hosted in BIGSdb (http://pubmlst.org/software/database/bigsdb/) (Jolley & Maiden, 2010) and the European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena). Specifically, three collections were used; a set of 2531 isolates from Thailand initially described by Chewapreecha et al. (2014), an Icelandic panel of 252 serogroup 6 isolates described in Van Tonder et al. (2015) and a USA panel of 181 invasive isolates available in ENA as study SRP059723.

DNA extraction and sequencing
Isolates were grown overnight on horse blood agar (PHE Media Services) with 5% CO 2 . DNA was extracted from an entire plate of growth for each isolate using the QIAsymphony SP automated instrument (Qiagen) and QIAsymphony DSP DNA Mini Kit, using the manufacturer's recommended tissue extraction protocol for Gram negative bacteria (including a 1 h pre-incubation with proteinase K in ATL buffer and RNAse A treatment). DNA concentrations were measured using the Quant-iT dsDNA Broad-Range Assay Kit (Life Technologies, Paisley, UK) and GloMax R 96 Microplate Luminometer (Promega, Southampton, UK). DNA was sent for whole genome sequencing (WGS) by Illumina sequencing using the PHE Genomic Services and Development Unit (Colindale, London,  et al., 2006). For serotypes 6C and 6D accession numbers JF911515.1 and HV580364.1 were used, respectively. Capsular locus sequences for 6E and 23B1, a new 23B genetic subtype identified during the course of this project, were derived in-house using the assembly-based approach described below. Isolates for both types (6E n = 12; 23B1 n = 27) were assembled using SPAdes (Bankevich et al., 2012) and the capsular locus sequences were extracted using the sequence for 6B and 23B, respectively. The extracted sequences were aligned using ClustalW (Thompson, Higgins & Gibson, 1994) and the alignment was visualized with MEGA6 (Tamura et al., 2013).

Capsular typing tool implementation
PneumoCaT (Pneumococcal Capsule Typing; https://github.com/phe-bioinformatics/ PneumoCaT), written in Python (version 2.7.6), utilises a two-step method to assign capsular type (Fig. 2). In the first step, reads from each readset are mapped to a multi-fasta file containing 94 capsular locus sequences (92 serologically distinct serotypes + 2 molecular types/subtypes -6E and 23B1) using bowtie2 (version 2.1.0; following options used: -fr -minins 300 -maxins 1100 -k 99999 -D 20 -R 3 -N 0 -L 20 -I S,1,0.50) (Langmead & Salzberg, 2012). Thresholds of coverage of >90% of the length of the sequence, minimum depth of 5 reads per bp and a mean depth of >20 reads over the entire length of the sequence are implemented during this step. This step is considered successful if one or more capsular locus sequences are returned. If a match to a single capsular locus is returned and the capsular type predicted does not belong to one of our defined genogroups (Table 2), then the software terminates here and reports this as the predicted capsular type (Fig. 3). If the match belongs to a genogroup or more than one locus is matched then the software moves to the second step; a variant-based approach. If multiple serotypes are matched in step 1 and they do not correspond to the same genogroup then a 'Mixed sample' flag is called. Finally, if no serotypes are called due to low coverage (<90%) then a 'Failed' flag is called.  The capsular types matched in step 1 (genogroup) are fed into step 2 and the relevant information for this specific genogroup is retrieved, if genogroup information is available. This information is organised in a multi-fasta reference file, with complete gene sequences for all genes that display genetic diversity within the selected genogroup, and a database containing all sequence variants pertaining to the same genogroup. Paired reads are then mapped to the loaded reference file using bowtie2 and the tool tests for the variants loaded from the capsular type variant (CTV) database. Specific functions were designed to detect the different types of variants, such as presence/absence of genes, different alleles, single nucleotide polymorphisms (SNPs) and loss-of-function mutations. Base calling and insertion/deletion detection were made with SAMtools mpileup (version 0.1.19; following options used: -B -A) , requiring a minimum-read depth of 5 and using reads with minimum base mapping quality of 30. These base calls were only included in the downstream analysis if they constituted >80% of read depth (ie, a nonmixed base call). A capsular type is assigned if a 100% match to all variants tested is achieved, else a 'Failed' flag is returned. In the case of serogroups 24 and 32, where no distinguishing variants are available, the software returns the serogroup.

Phylogenetic tree generation
Reads were mapped to selected reference sequence using bwa (version 0.7.12) . Variants were called using GATK 2.6.5 (McKenna et al., 2010). Variants were then parsed to retain high quality SNPs based on the following conditions: depth of coverage (DP) ≥ 5, AD ratio (ratio between variant base and alternative bases) ≥ 0.8, Mapping Quality (MQ) ≥ 30, ratio of reads with MQ0 to total number of reads ≤ 0.05. All positions that fulfilled the filtering criteria in >0.9 of the samples were joined to produce a multiple fasta format file where the sequence for each strain consists of the concatenated variants. This file was used as an input to generate a maximum likelihood (ML) tree using RAxML (Stamatakis, 2014) with the following parameters -m (substitutionModel) ASC_GTRCAT -asc_corr=lewis (Paul Lewis's model correction for ascertainment bias when analysing variant-only data) -b (bootstrapRandomNumberSeed) 12345 -# (numberOfRuns) 1000.

Genogroups
Genetic clustering of S. pneumoniae strains based on the capsular locus sequence was previously described by Mavroidi et al. (2007). Based on their analysis, 88 pneumococcal serotypes were grouped in eight major clusters and 20 subclusters. All but six serogroups fell within the same genetic subcluster as previously described  and 9 of the subclusters contained serotypes from different serogroups. In this study, we define genogroups as groups of serotypes with high (>90%) sequence identity within the capsular locus. Based on these criteria, 21 genogroups were identified by comparing published capsular locus sequences (Bentley et al., 2006). All genogroups contain serotypes grouped within the same genetic subcluster as defined by Mavroidi et al. (2007) (Table 2); 11 of the genogroups correspond to serogroups as defined by the SSI sera, six of which also form genetic subclusters described by Mavroidi et al. (2007). Of the remaining 10 genogroups, six are identical to previously described subclusters and three correspond to incomplete serogroups. For example, serogroup 10 assigned previously to a single genetic subcluster is split into two genogroups, 10A-10B and 10C-10F. Similarly, serogroup 19 was assigned to a single subcluster, whereas in our analysis only serotypes 19B and 19C form a genogroup. The remaining two serotypes are too genetically dissimilar to be included; 19A shares only 50% of the capsular locus with 19F, its closest genetic neighbour within the serogroup, whereas 19F shares 65% with 19B. The last genogroup is 33A, 33F and 37; serotypes 33A and 33F belong to a genetic subcluster; serotype 37 was not included in the previous analysis since it does not use the Wzx/Wzy-dependent pathway. However, previous analysis revealed that serotype 37 isolates use the tts synthase gene which is located elsewhere in the genome to define serotype, but still contain a defective 33F-like capsular locus (Llull, Munoz & Lopez, 1999).

Serotype-defining sequence variants
Serotype-defining variants were detected following gene-based alignment of published capsular locus sequences (Bentley et al., 2006). Sequences were extracted in GenBank format and for each genogroup, gene profiling and sequence variant analysis was performed to determine variant profiles that could predict to serotype level. Types of sequence variation investigated include presence/absence of capsular biosynthesis-related genes, detection of inactivating mutations such as early stop codons and frameshifts (insertions, deletions, truncations), detection of differing alleles of genes that can be distinguished by coverage alone due to low sequence similarity (<80%) and SNP differences that can be detected by variant calling. Reported variants were then evaluated with our test cohort (n = 871; Table  2) and variants with 100% incidence rate (found in all isolates of serotype tested) were incorporated into capsular type variant (CTV) database (Table 3). Many of the variants detected during this study correspond to structural and genetic differences reported previously (Kolkman, Van der Zeijst & Nuijten, 1998;Bentley et al., 2006;Mavroidi et al., 2007;Song, Baek & Ko, 2011;Yang et al., 2011;Oliver et al., 2013b).
In two cases of previously reported frameshift mutations (7A and 15C), the mutations were confirmed in our test cohort, but no consensus of the inactivated sequence was detected (20-80% of reads carry frameshift mutation). In 7As, an insertion of a thymine (T) at residue 587 of wcwD inactivates the glycosyltransferase (Table 3) leading to loss of the side branch D-Galp-(β1 − 2)-α-D-Galp . During evaluation with a cohort of 34 isolates (7A, n = 9; 7F, n = 25- Table 2), all 7F isolates carried a wild type (wt) wcwD gene sequence. In comparison, 7/9 of the 7A isolates also carried a wt wcwD sequence, whereas 2/9 (including the SSI 7A type strain) exhibited a mixed profile at residue 587, with Table 3 Genetic differentiation of serotypes within serogroups. Description of variants present in the CTV database.

7A/7F
Frameshift mutation insT 587 in 7A wcwD gene Loss of function of glycosyltransferase leading to loss of side branch for 7A   (Table S2) Amino acid changes-wcjA and wcjB encode for glycosyltranferases (GT) and changes in these are consistent with presence of glucose in 9N instead of galactose present in residue 3 of the polysaccharide repeat unit of the other three serotypes Genes wcwC and wcjE are present in 11A, 11D and 11F whereas gene wcwR is present in 11B and 11C  wcwC, wcjE and wcwR are acetyltransferase genesdifferences in acetylation Presence of Gro-1P correlates with an intact gct gene in types 11A and 11C; gct is frameshifted in types 11F and 11B, and Rib-ol is present in the CPS instead of Gro (Mavroidi et al., 2007) 11 11A/11D/11F wcrL pos 334: codon AAT (Asn) in 11A; codon ACT (Ser) in 11D (Oliver et al., 2013) and codon GCT (Ala) in 11F wcrL encodes for a GT-donor sugar for WcrL is GlcpNAc in types 11F, 11B, and 11C but Glcp in type 11A     (Table S4) wcyV , wcyD and wcyC encode for GTs -no structures available for 25A, 25F or 38 28 28A/28F SNPs in wciU (Table S5) wciU encodes for a GT-no structures available (continued on next page)   wcrK encodes for a GT-consistent with differences in structure 41 41A/41F Frameshift mutation delG 23 in 41A wcrX  wcrX encodes for a acetyltranferase-differences in acetylation 47 47A, 47F Mapping only only a percentage of the reads carrying the insertion (60-70%). All 7A isolates were retested in the laboratory using the SSI sera and only the two isolates with the mixed profile were re-typed as 7A whereas the 7/9 with a wt wcwD were typed as 7F, suggesting that this mixed profile is indicative of 7As and can still lead to phenotypic changes. In 15Cs, differences in the length of the TA tandem repeat region of the wciZ gene lead to loss of function for the O-acetyltransferase in 15C and are responsible for the differences in acetylation between 15B and 15C isolates (Bentley et al., 2006). However, unlike the 7A/7F scenario where the two types can be easily distinguished, distinguishing 15B and 15C isolates on the bench using SSI sera in the slide agglutination method is extremely challenging. Reversible switching between the 15B and 15C serotypes was previously described in the laboratory and in natural infection (Venkateswaran, Stanton & Austrian, 1983;Van Selm et al., 2003). Thus, in practice it may not be possible to clearly distinguish between these two serotypes in the laboratory using serological methods and for the reasons above such isolates appear as 15B/C in many publications (Laufer et al., 2010). Therefore, it was unsurprising that the capsular typing tool had problems matching the serological results where isolates had previously been reported as either 15B or 15C prior to a change to the reporting; the previous result mostly likely due to the dominance of one type over the other in the mix. During evaluation with 30 isolates (15B, n = 16; 15C n = 14), we encountered 15B and mixed 15B/C isolates but no pure 15C isolates. A genetically pure 15C was found later during validation but a different frameshift mutation was responsible for the inactivation of wciZ. Overall 11/16 15B isolates were correctly predicted as 15B and 5/16 were mixed 15B/C whereas 10/14 15C isolates were predicted as 15B and 4/14 as 15B/C. Since any 15B-15C-15B/C discordances cannot be resolved in the laboratory the 15B-15B/C-15C isolates will always be reported as 15B/C.

Molecular type 6E
Serotype 6E is not a serologically distinct serotype, but has been shown to be a distinct molecular subtype (Ko, Baek & Song, 2013). Due to the low genetic similarity between the 6E capsular locus sequence and the other serogroup 6 capsular locus sequences, PneumoCaT was unable to determine a serotype for isolates with 6E capsular locus in the first step of the capsular typing tool due to coverage <90% (n = 12; 10 6B and 2 6A). In order to take account of this variation, the genomic sequences of these 12 isolates were used to determine a consensus 6E capsular locus sequence and this was introduced into the capsular locus typing tool. With this in place the tool can determine whether an isolate has a 6E molecular type and using the variant profiles described above can then determine the serological type (i.e., 6E(6B)).

Novel serogroup 7 subtype
Genogroup 7B, 7C and 40 can be distinguished using a panel of 9 SNPs (Table S3) all found within glycosyltransferase gene wcwK. Nucleotide differences at residues 46 and 385 are considered as more significant, as both result in distinct amino acids in the three serotypes. During evaluation using a cohort of 38 isolates (7B, n = 15; 7C, n = 21; 40, n = 2- Table 1 -Development Set), one isolate originally typed as serotype 7C in the laboratory, exhibited a novel codon at residue 385 which encoded for a distinct amino acid (CTT (Leu) compared to ACT (Thr) for serotype 40, TTT (Phe) for 7B and TGT (Cys) for 7C). The rest of the SNP profile matched the 7B profile. Further investigation revealed a distinct combination of factor sera reactions (Table 4) for this isolate suggesting that it might correspond to a novel serogroup 7 serotype.

Serogroup 22
Biochemical or genetic differences between serotypes 22A and 22F have not been previously described. Initial genetic analysis revealed 99% identity over 100% of the capsular locus sequences of the two types (22A: CR931681; 22F: CR931682) suggesting a possible error in submission or serotyping. To investigate this, genomic data from the two SSI type strains (SSI-22A, SSI-22F) were assembled and using the reference sequence for 22A, the capsular locus sequences were extracted. The extracted sequences were compared to the reference sequences and the capsular locus sequence of SSI-22A was identical to the two reference sequences whereas the SSI-22F had 98% identity over 91% of the capsular locus of all three of the other sequences. Specifically, the region covering genes wcwA and wcwC shares no similarity between SSI-22F and 22A. The SSI-22F capsular locus sequence was submitted to ENA as assembly under PHE Pathogens BioProject PRJEB14267 (Accession number LT594600.1).

Novel molecular subtype 23B1
Serotypes 23A, 23B and 23F have distinct capsular locus sequences which allows for serotype prediction based on mapping coverage alone. A cohort of 72 isolates (23A, n = 25; 23B, n = 25; 23F, n = 22) was used to test the coverage-based approach. All isolates typed as 23A or 23F and 21/25 isolates typed as 23B were correctly predicted; however, 4 × 23B isolates gave a failed result in the capsular typing tool with highest coverage of approximately 70%. These isolates were re-tested in the laboratory using SSI sera and the previous results were confirmed. In order to investigate this further, more 23B isolates were sequenced and analysed using the capsular typing tool. In total, 24/46 23B isolates failed the pipeline with 70% coverage. Following further investigation, these isolates were identified as a novel molecular subtype (23B1) with ∼70% homology to the 23B capsular locus. The 23B1 capsular locus sequence was introduced into PneumoCaT assuring 100% concordance with serological type.

Serogroup 24
Genetic differences between serotypes in serogroup 24 have previously been described  and were also confirmed following genetic analysis of the reference sequences (Accession numbers CR931686, CR931687, CR931688 for 24A, B and F, respectively). Serotype 24A lacks wzy, a polymerase gene and rbsF, a putative ribofuranose biosynthetic gene. In addition, the gene sequence of the flippase gene wzx is unique to 24A as indicated by using a blast nucleotide query against the NCBI database. This could only be confirmed by the SSI type strain (SSI-24A) since no other 24A isolates were available in the PHE archive. According to Mavroidi et al. (2007) serotypes 24B and 24F are distinguished by loss of function of genes abp1 and rmlC in 24B. This was confirmed when comparing the reference sequences. However, following evaluation with a cohort of 25 isolates (6, 24B; 19, 24F) the frameshift in rmlC was not seen in any of the 24B isolates, including the SSI-24B type strain whereas the frameshift in abp1 was only seen in SSI-24B strain, but no other 24B strains. This suggested that either all 24B isolates are actually 24F or that the reference sequences are not representative of the circulating strains; the serological typing was repeated for all isolates and original serotypes were confirmed, therefore we surmised that the latter is more likely. Genomic SNP analysis was used to investigate this further by mapping to either (a) the 24F capsular locus sequence (CR931688) or (b) the whole genome of the non-capsular S. pneumoniae R6 strain (NC_003098). Subsequent variant calling and filtering produced 145 and 19,123 high quality SNPs, respectively. Evolutionary history was inferred using the Maximum-likelihood method and phylogenetic trees were drawn (Fig.  S1). No serotype specific clustering was observed in either of the trees suggesting that if there is a molecular basis for the observed serological differences, it occurs using a mechanism that is not vertically inherited e.g rapid strand-slippage variation in a microsatellite within a promoter region. Further investigation is underway but in the meantime, PneumoCaT can predict isolates to serogroup level and serological analysis can distinguish further to serotype level. Although genetic differences have been described above for differentiating 24A from 24B/24F, they are not currently implemented in PneumoCaT due to incomplete evaluation.

Serogroup 32
Structures for the capsular polysaccharide repeat unit of both serotype 32A and 32F are available and differences based on the structure have previously been described (Bentley et al., 2006;Mavroidi et al., 2007). However, when the two reference capsular locus sequences (32A: CR931696; 32F: CR931697) were aligned, 99% identity was observed over 99% of the length of the sequence. The only difference was a 5 bp gap at the intergenic region between wcrN and the HG272/3 pseudogene observed in the serotype 32 capsular locus. Gene-based analysis returned no nucleotide differences within coding regions. To investigate whether there was a submission error, as described for 22F, the capsular locus sequences (lacking the tnp regions at the beginning of the capsular locus) of the 32A and 32F SSI type strains were extracted from the assembled genomes of SSI-32A and SSI-32F strains and compared to the reference strains. Both type sequences showed 100% identity with the respective reference sequences over the length of the alignment. The serogroup 32 evaluation cohort only includes 5 isolates (32A, n = 2; 32F, n = 3- Table 1; Development Set) and following analysis with the capsular typing tool all isolates were assigned to serogroup 32. Further investigation into the BAM files revealed that the 5 bp gap was seen only in the SSI-32F type strain. More serogroup 32 isolates need to be analysed before any conclusions can be reached, but no serogroup 32 isolates have been observed in the UK since 2007. At present, the predicted serogroup 32 isolates can only be differentiated into serotypes using serological analyses.

Validation of the capsular typing tool
Following development, the PneumoCaT tool was evaluated using two panels: (a) 2046 UK isolates retrieved from PHE archive covering 72/92 serotypes, including all serotypes contained in commercial vaccines (Table 5; excluding non-typeables) and (b) genomic data from 2964 non-UK isolates from three distinct, publicly available datasets (Thailand; n = 2531-65 serotypes, USA; n = 181-serotype information not available-and Iceland; n = 252-serotypes 6A, 6B, 6C and 6E). Both panels were used to evaluate the typability of PneumoCaT based on the proportion of isolates that were assigned a type. In the UK cohort, serotype was assigned initially to 98.2% of the typeable isolates (2010/2046). Of the 36 failed isolates, 19 were called as mixed and following retesting using a culture from a single colony, 18/19 were resolved, raising the typability to 99.1%. In the non-UK cohort, a serotype was assigned to 99% of the isolates (2934/2964) and 8/20 failed ones were called as mixed, suggesting that if retesting was possible these could be resolved.
Concordance with currently used methodology (slide agglutination using SSI sera) was evaluated using the UK cohort for which further laboratory testing was possible. Concordance was estimated on the 2028 typeable isolates for which a serotype was called with both methods (2010 + 18 mixed resolved). Overall, 92.4% concordance was observed between the predicted serotypes and the serologically-derived serotypes (n = 1873). Discordance was observed in 155 isolates (Table 6) and in most cases (77%, n = 119), the   discordance involved serotypes using the same factor sera set for serotype allocation (i.e., 12B/12F, 7A/7F discordances) or serotypes using the same pool sera (i.e., serotype 31 and 7F both react with pool C serum) (Statens Serum Institut, 2013) (n = 22). Repeating the serological serotyping for these two sets resolved the initial discordance in 88.7% of the cases (125/141). This suggests that discordance in these cases may be attributed to difficulty in reading slide agglutination reactions during the serological analysis with the SSI sera (i.e., auto-agglutination with all group 12 factor sera in the case of 12B/12F typing). In one case, a novel serogroup 9 serotype was predicted based on both the SNP-pattern (Table S4) and sera reaction pattern that differed from the expected 9L pattern, showing a weak reaction with factor sera 9 g. In the remaining 14/155 cases, there are no shared sera between the serotypes predicted by WGS and the serologically-derived serotypes. These isolates were retested with both methods and all but one (unable to serotype due to autoagglutination) were concordant suggesting a possible laboratory sample translocation during archiving and/or retrieval of the isolate. Overall, following retesting, we observed a final concordance of 99.3% (2013/2028), with only 18 discordant. In some of these cases, serotyping could not resolve the discordance because auto-agglutination leads to inconclusive results (n = 10), whereas other cases (n = 8) exhibit a persistent discordance even after retesting with both methods (Table S9). In these cases, the capsular operon sequence was investigated further using an assembly and blast approach to query for both capsular operons. In all cases, the capsular operon sequence detected did not match the serologically derived serotype. However, it must be noted that for each of these serotypes, serotype was accurately predicted for the majority of cases and these discordant cases are outliers ( Table 5). The serogroup 6 isolates from the Icelandic dataset (n = 252) were originally used to demonstrate the distribution of the new 6E type and molecular typing was used to characterized all isolates (Van Tonder et al., 2015). Although these isolates were not available for further investigation, they were the only complete dataset with confirmed 6E isolates (molecular confirmation since serological confirmation is not possible), so they were used to demonstrate the ability of PneumoCaT to differentiate 6E from the other serogroup 6 types. Following analysis, 100% typability and concordance was observed (n = 252). Reproducibility of the WGS method was investigated using a subset of 292 UK isolates covering 82/94 serotypes that were randomly selected for inclusion in the reproducibility study. For this study each isolate was cultured twice and DNA extracted and sequenced at different times. Concordance was seen in all cases.

Non typeable isolates
Virulence of S. pneumoniae is classically associated with the capsule, however some pathogenic strains, often associated with non-invasive sites (i.e., respiratory isolates), are serologically non-typeable (NT) (Park et al., 2012). Nineteen NT strains were selected from the PHE archive and analysed using WGS and PneumoCaT to predict serotype. In 13/19 cases PneumoCaT failed to predict serotype with a highest coverage <50% (Fig. 4 and Table 5). This is consistent with lack of a capsular operon and these isolates would be considered non-typeable. In addition, three isolates that originally had been typed in the lab (19A and 2 × 38) but failed with <50% coverage during PneumoCaT analysis were retested and found non-typeable in the lab, suggesting that this cut off could be used an indicative of non-typeable isolates following further validation. In the remaining 6/19 cases a serotype was predicted. In three of the cases, retesting with SSI sera gave the same serotype as predicted by PneumoCaT. In three other cases, repeat serotyping by slide agglutination consistently failed to give an acceptable result. In all of these cases, the colonies had an abnormal morphology (i.e., rough).

DISCUSSION
Molecular capsular typing methods, such as multiplex PCR, have been readily used in reference laboratories across the globe in place of the gold standard Quellung method (''WHO, 2011-Laboratory methods for the diagnosis of meningitis caused by Neisseria meningitidis, Streptococcus pneumoniae, and Haemophilus influenzae''; European Centre for Diseases Prevention and Control (ECDC), 2013). The lower cost and increased turnaround times associated with such methods have made them the preferred serogrouping/typing method for a number of laboratories. Many conventional and real-time PCR schemes cover the vaccine-related and most commonly observed serotypes (Jauneikaite et al., 2015). However, these methods do not cover all serotypes and many types cannot be distinguished from others with genetically closely related capsular operons. In this study, we present serotype-specific variant profiles that can be used to distinguish 87/92 serologicallydistinct serotypes and 2 molecular subtypes, and PneumoCaT, a bioinformatics tool that uses the CTV (Capsular Typing Variant) database to predict capsular type from WGS data.
A gene-based approach was used to identify variants that allow differentiation between closely-related serotypes. During its development stage, a test cohort of 871 clinical isolates, including the SSI-type strains, was used to evaluate the relevance of these variants in the population and establish the CTV database. In many cases, these sequence variants were previously published (e.g., serogroup 6) (Song, Baek & Ko, 2011), whereas in other cases structural differences have been observed but no molecular information was provided (e.g., serogroup 9). One of the main advantages of this approach is the ability to detect molecular subtypes (and potential new phenotypes) even if no serological differences are observed using standard typing sera. For example, isolates from serogroups 6 and 23 initially failed the development pipeline due to low capsular locus coverage. Further investigation into the reason for the failures was able to identify serotype 6E, a previously described molecular type (Ko, Baek & Song, 2013) and a novel 23B molecular subtype. Another example is serotype 19A, where two subgroups were identified based on the mapping coverage of the 19A capsular locus. In this case, all isolates were still predicted as 19A by PneumoCaT, since more than 90% coverage was observed for both subgroups, but gene profile analysis revealed variability in the presence of cps19aO gene and further functional and epidemiological analysis is currently underway to investigate whether this has any phenotypic relevance. In terms of surveillance, the presence of these new molecular subtypes could indicate introduction of new clones that may influence new pneumococcal vaccine development. In some rare cases, a novel molecular type can also have corresponding serological differences; for example, during the test phase of the variant profiles for genogroup 7B-7C-40, an isolate exhibited a distinct codon at position 385 and a distinct factor sera reaction pattern (Table  4) upon retesting in our laboratory. Following further investigation by SSI this isolate has now been confirmed as a novel serotype (authors' unpublished data). In addition, during the validation phase of this study, SNP analysis revealed a novel serogroup 9 type with a distinct sera reaction pattern, but this has not been confirmed.
Following the development stage, PneumoCaT was evaluated for typability, accuracy (concordance with serological results) and reproducibility. Typability was evaluated as the percentage of isolates for which a serotype was assigned by PneumoCaT, using two panels of isolates (UK, n = 2,046; non-UK, n = 2,964), that combined, cover 82/94 capsular types, including clones that cover 4 different countries (UK, USA, Iceland and Thailand). Using both panels, a combined typability of 99% (99.1% for UK and 99% for non-UK) was achieved, demonstrating that PneumoCaT can be used to analyse isolates from diverse geographical lineages. The panel of 2,028 typeable UK isolates (encompassing 72 serotypes) was then used to evaluate the accuracy of the tool. When discordances between phenotypic serotyping results and PneumoCaT serotype were investigated, the majority (119/155) were shown to be due to problems with determining serotype within serogroup, when using the factor sera in the slide agglutination method. Particular serogroups such as 12, 33 and 35 proved especially problematic (Table 5). This could be due to the way the results are interpreted for the factor sera reactions when used in slide agglutination. For example, in 24/69 Serogroup 12 cases a 12B/12F discordance was observed; 18 serotype 12B isolates were assigned 12F capsular type by PneumoCaT whereas 6 12F isolates were assigned 12B. Based on the SSI typing scheme a reaction with all three serogroup 12 factor sera (12b, 12c, 12e) is required to identify 12B, whereas a reaction with factor serum 12b only is indicative of 12F isolates. Unfortunately, some factor sera agglutination reactions can be weaker than others. Therefore, a weak reaction in all factor sera could be mistaken for auto-agglutination or vice versa, leading to incorrect scoring. In most discordant cases, repeating the phenotypic testing resolved the discordance. In some cases (n = 10) a weak reaction or auto-agglutination made interpretation of the serotyping result difficult, thus the serotype could not be confidently called using the slide agglutination method with the standard SSI sera. The standard factor sera obtained from Statens Serum Institut are optimised for use in the Quellung reaction rather than slide agglutination and therefore, it is highly likely that if the serotyping was performed using the gold-standard Quellung method by technicians with the necessary experience, this could resolve the discrepancies in many cases and increase the accuracy of the serotyping. The use of latex absorbed sera specifically designed for agglutination reactions is also expected to improve the accuracy of the serotyping results for the more difficult to distinguish serotypes. Late in this study, a limited number of latex-absorbed subtyping sera were kindly provided by SSI for use in agglutination reactions, and were used to re-serotype all discordant serogroup 12 isolates. The latex sera was able to resolve 22/24 discordances and in 2/24 cases a serotype could not be confidently called due to autoagglutination.
In addition, some of the observed original discordances clearly came from simple record keeping errors due to the nomenclature of the factor sera being very similar to the nomenclature for the serotypes themselves, such as serotype 15A which is reacts with 15c factor sera, which accounted for several discordant cases where 15A isolates had been originally recorded as 15C due to 15c factor positive result. Similar discordances were also observed in serogroup 6 where 6A reacts with 6b factor sera, 6B with 6c and 6C with 6d factor sera resulting in mis-reporting. This kind of error could be avoided by use of automated reporting mechanisms which take electronically derived results (such as the PneumoCaT output) and enter them directly into laboratory information systems, avoiding manual transcription.
Where occasional discordance persisted even after retesting, it involved pairs of serotypes within the same serogroup or with known cross-reactions (Table S9). In these persistent cases, further molecular investigation is underway to determine whether any regions outside of the capsular operon might be involved in this discordance between phenotype and serotype. However, a repeated serotyping error with these particular isolates cannot be excluded since other examples of the same serotypes have concordance between WGS and serology. For example, one of the isolates is serologically 29 where PneumoCaT predicted 35B. The two capsular operon sequences share approximately 70% of the capsular locus sequence with 87% similarity and can be distinguished serologically based on the reaction of a single factor sera (i.e., 35a gives positive reaction if 35B and negative if 29, and there are known cross-reactions between 29 and some of the factor sera for group 35). The presence of a complete serotype 35B capsular operon in this discrepant isolate was confirmed by assembly as well as mapping and following analysis of the capsular locus sequence no differences were found between this isolate and other 35Bs, including two that were originally mistyped as 29. At this point, further analysis for external factors is underway.
Three of the persistent discordances involved serotype 18C; 2 assigned 18B and one 18A serotype by PneumoCaT. Following retesting by both WGS and sera; 2 persisted with 18C/18B discordance, whereas the third isolate was re-serotyped as 18B and retyped as 18A by WGS. The 18C/B discordance is not unexpected as both serotypes react with 18e factor sera whereas 18c sera reacts only with 18C serotypes, suggesting that a weak reaction or autoagglutination could lead to false prediction. However, 18B/A discordance is unexpected as the serotypes react with two different factor sera (18d for 18A and 18e for 18B). Interestingly, the molecular difference distinguishing 18A and 18B is the presence of a non-functional wciX gene in 18B; this difference should have no impact on phenotype, but has successfully differentiated 19 18A/B isolates (18A, n = 11 and 18B, n = 8). We further examined the capsular sequence to determine additional differences that could impact the phenotype. Specifically, we looked into glf gene, which based on previous studies (Bentley et al., 2006) was present as a pseudogene in 18B, 18C, and 18F. However, when these gene sequences were translated it was evident that a large part of the UDP-galactopyranase mutase domain was still encoded; specifically within the translated glf sequence in 18B and 18C a 167 amino acid ORF was present and a 204 amino acid ORF in 18F. Both ORFs matched to published UDP-pyranase mutases when compared to the NCBI protein database using BLAST. The genomic reads of the 19 18A/B and the discordant 18B/A isolate were mapped to the 502 nucleotide coding region of 18B glf and as expected glf was present in all 18Bs and none of the 18As. However, glf was also found in the 18B/A isolate which also lacks wciX, a genetic marker for 18A. This finding can potentially explain the discordance between phenotype and genotype and incorporation of the glf marker in CTV will prevent any future 18A/B discordances and instead flag them for further investigation.
Another of the discordant isolates involved a rare serotype (41A/41F) and was difficult to resolve due to the lack of representative isolates of serogroup 41. If implemented into the routine laboratory, isolates of these rare serotypes would be subjected to traditional serotyping in addition to WGS analysis until the pipeline could be further developed or confidence in the results was assured.
The high typability, concordance and reproducibility rates (99.1, 99.2 and 100%, respectively) observed with PneumoCaT suggest the method is highly robust and reliable, with less subjectivity than traditional serotyping and full traceability of results. The sensitivity of the method for serotype mixtures is greater than that of the standard method which means that careful preparation and handling of the isolates for DNA extraction and WGS is necessary to avoid cross-contamination, but also means that potentially genuinely mixed serotype cultures can be recognised. For example, an isolate was initially typed as 19A and predicted as mixed 19A/3 with WGS when retrieved from the archives for testing. It was found to have colonies with two different morphologies; one very mucoid and growing over a smaller non-mucoid colony form. The mucoid colony was sampled and following retesting serotype 3 was assigned using both traditional serotyping and PneumoCaT. Unfortunately, the smaller colonies could not be sampled in a pure form but were presumably the original serotype 19A, given the PneumoCaT result.
A further advantage of the PneumoCaT method is the extra information obtained in terms of the percentage mapping to the capsular reference and SNPs identified, which could lead to the discovery of novel serotypes. Trends in the mapping percentage (6E and 23B1) and differences in SNP patterns for those entering the CTV detection pathway could be recorded (novel serogroup 7 and serogroup 9 subtypes). These novel types may be missed by standard serotyping methods due to the way the typing sera are cross-absorbed to avoid reactions with known types; this, however, does not avoid cross-reactions with novel types e.g., 6C which initially cross reacted with 6A serotyping factor sera prior to production of specific factor sera for this type (Lambertsen & Kerrn, 2010).
A small number of isolates (n = 11) failed during the first stage (coverage-based) of the capsular typing pipeline and, upon further examination, repeat serotyping was also unable to confidently call a type in five cases suggesting that they might be true acapsular, nontypeable isolates. In 2/5 cases, % coverage was between 75-90% suggesting that a capsular operon was present but not expressed. The remaining six isolates fall just short of the 90% coverage threshold (Fig. 4), but the top hit corresponded to the expected serotype/serogroup suggesting that a capsule could still be expressed (Table S10).
Non-typeable isolates, lacking a functional capsular operon, also get a 'Failed' tag by PneumoCaT but can easily be distinguished, based on coverage levels, from unusual isolates where serotype prediction failed. Based on our analysis of 19 isolates, serologically non-typeable isolates have a highest coverage value of less than 50% whereas for unusual isolates whereas some reactions with serotyping sera are seen when the coverage falls within 60-90%. With further data to confirm this observation, a 'Non-typeable' flag could be introduced to PneumoCaT (Fig. 4). During our analysis of these non-typeable isolates, in 32% of the cases a serotype was predicted by PneumoCaT, indicating a functional operon was present. Upon retesting, half of these gave the serotype predicted and half were still non-typeable, indicating possible external regulatory elements that inhibit expression of the capsule in these organisms.
In summary, PneumoCaT is a robust, accurate, sensitive and expandable tool that could revolutionise pneumococcal reference microbiology. This tool has the sensitivity to enable the recognition of mixed serotypes or new subtypes that could be masked by the use of other methods. The flexibility to introduce new types to the CTV database with relative ease means the system can evolve to suit future challenges. The CTV database itself has the potential to inform the further development of other molecular methods, for example PCR or microarray analysis for determination of capsular type and it could be particularly useful for developing non-culture typing schemes. PneumoCaT will enable more detailed surveillance of serotype drift and could be used to target further phenotypic analysis of potential new serotypes. The automated nature of the tool means that it can be incorporated into routine pipelines and results can be populated into laboratory information systems using custom scripts, thus avoiding some of the potential errors associated with manual result recording and entry and suiting a role in reference microbiology.