Kaptive Web: User-Friendly Capsule and Lipopolysaccharide Serotype Prediction for Klebsiella Genomes

ABSTRACT As whole-genome sequencing becomes an established component of the microbiologist's toolbox, it is imperative that researchers, clinical microbiologists, and public health professionals have access to genomic analysis tools for the rapid extraction of epidemiologically and clinically relevant information. For the Gram-negative hospital pathogens such as Klebsiella pneumoniae, initial efforts have focused on the detection and surveillance of antimicrobial resistance genes and clones. However, with the resurgence of interest in alternative infection control strategies targeting Klebsiella surface polysaccharides, the ability to extract information about these antigens is increasingly important. Here we present Kaptive Web, an online tool for the rapid typing of Klebsiella K and O loci, which encode the polysaccharide capsule and lipopolysaccharide O antigen, respectively. Kaptive Web enables users to upload and analyze genome assemblies in a web browser. The results can be downloaded in tabular format or explored in detail via the graphical interface, making it accessible for users at all levels of computational expertise. We demonstrate Kaptive Web's utility by analyzing >500 K. pneumoniae genomes. We identify extensive K and O locus diversity among 201 genomes belonging to the carbapenemase-associated clonal group 258 (25 K and 6 O loci). The characterization of a further 309 genomes indicated that such diversity is common among the multidrug-resistant clones and that these loci represent useful epidemiological markers for strain subtyping. These findings reinforce the need for rapid, reliable, and accessible typing methods such as Kaptive Web. Kaptive Web is available for use at http://kaptive.holtlab.net/, and the source code is available at https://github.com/kelwyres/Kaptive-Web.

W hole-genome sequencing (WGS) represents a powerful tool for the characterization and public health surveillance of bacterial pathogens. This technology is now routinely used by a number of public health laboratories (1-3), and there is increasing interest in its use in clinical labs (4)(5)(6). While there are several well-developed protocols which use WGS data for the determination of multilocus sequence types (STs), resistance gene profiling, and phylogenetic investigations (7)(8)(9), there remain gaps in the repertoire; e.g., the characterization of species-specific antigens is currently restricted to a small number of species (10)(11)(12). Furthermore, many WGS analyses rely on software via a command line interface that requires bioinformatics skills to install, execute, and interpret, thereby limiting their accessibility. Instead, we need tools that can extract information and present it in an easily interpretable manner to bioinformaticians, public health professionals, and clinicians alike (5,7,8).
Klebsiella pneumoniae is a major cause of health care-associated infections with high rates of multidrug resistance (MDR) (13). In particular, the emergence and global dissemination of extended-spectrum beta-lactamase (ESBL) and carbapenemaseproducing (CP) clones are major concerns and have led to the recognition of K. pneumoniae as an urgent public health threat (14,15). With the lack of new antimicrobial therapies, there has been a resurgence of interest in alternative strategies, such as phage therapy (16)(17)(18)(19), monoclonal antibody therapy (20)(21)(22)(23), and vaccination (24)(25)(26). Several therapeutic targets have been suggested, and the polysaccharide capsule (K antigen) and lipopolysaccharide (O antigen) are among the most frequent. Both are also considered key virulence determinants that are necessary to establish infection, primarily owing to their serum resistance and antiphagocytic properties (27)(28)(29)(30)(31). Of note, capsular serotypes vary substantially in the degrees of serum resistance they provide. For example, K1, K2, and K5 are highly serum resistant and are associated with hypervirulent strains that differ from classical K. pneumoniae in that they commonly cause community-acquired disease (32)(33)(34). Despite the importance of these loci, K. pneumoniae serotyping is not widely available, even in large central public health laboratories, and the most practical option for most laboratories is genotyping the loci involved in antigen biosynthesis via multiplex PCR (35,36) or WGS (37,38). Lipopolysaccharide comprises three subunits: lipid A, the core oligosaccharide, and the O antigenic polysaccharide (39). With only one exception, the key determinants of the O antigenic polysaccharide are colocated at the O locus (previously known as the rfb locus) (36,(40)(41)(42). While 10 serologically distinct O antigens have been recognized, many isolates are nontypeable (23,43), and investigations have identified 12 distinct O loci (25,36). Interestingly, both the O1 and O2 antigens, which are by far the most common (23,25,43), are each associated with the same two loci, O1/O2v1 and O1/O2v2 (25). The expression of the v1 locus results in the production of D-galactan I, characteristic to a subset of O2 antigens (41,44). The expression of the v2 locus results in the production of D-galactan III, associated with the remaining O2 subtypes (21,45). Regardless of the subtype, any O2 antigen can be converted to O1 by the addition of D-galactan II, which requires the products of wbbY and wbbZ that are located outside the O locus, i.e., elsewhere in the genome (44).
The Klebsiella polysaccharide capsule is produced through a Wzy-dependent process (46), for which the synthesis and export machinery are encoded in a single 10 to 30-kbp region of the genome known as the K locus (47,48). Seventy-seven distinct capsule phenotypes have been recognized by serological typing (49), but many isolates are serologically nontypeable. We recently explored the K loci among a large diverse K. pneumoniae WGS collection and were able to define 134 distinct loci on the basis of protein coding gene content, suggesting there are at least this many distinct capsule types circulating in the population (38).
Given the interest in targeting these diverse surface polysaccharides and the lack of accessible serotyping assays for K. pneumoniae, tools for WGS-based K and O locus typing will be essential for researchers, clinicians, and public health microbiologists. We previously developed Kaptive for K locus typing from WGS assemblies (38), which has become a key component of the Klebsiella bioinformatics tool kit (50)(51)(52), but it requires command-line skills and a degree of bioinformatics expertise to use. Here we present Kaptive Web, an easy-to-use web-based implementation of the Kaptive algorithm which has been extended to type both K and O loci. We demonstrate its utility (i) for the identification of K and O loci for serotype prediction and as epidemiological markers and (ii) to inform the design and implementation of control strategies targeting the capsules or lipopolysaccharides of K. pneumoniae.

MATERIALS AND METHODS
O locus definitions. Unlike K loci which were defined on the basis of gene content, O loci have been defined by sequence identity in the conserved wzm and wzt genes (25). For example, two O loci can have the same genes, but modestly divergent sequences (Ͼ5% divergence). Kaptive is compatible with these definitions because it first chooses a best locus on the basis of a nucleotide search. Only then does it tally the gene content of the locus.
A complication comes from the fact that O antigens O1 and O2 are encoded by the same two O loci. It is the presence or absence of two other genes elsewhere in the genome, wbbY and wbbZ, which determines the specific antigen. When both genes are present, D-galactan II is produced, leading to the O1 antigen. When they are absent, the O2 antigen is the result. We have added the appropriate logic to Kaptive (both command-line Kaptive and Kaptive Web), so it will report the locus as O1 or O2 on the basis of the presence/absence of these genes as determined by a tBLASTn search with coverage and identity thresholds of 90% and 80%, respectively. If Kaptive finds only one of the two genes, it will report the locus as O1/O2.
Kaptive Web. Kaptive Web is available for use at http://kaptive.holtlab.net/. It was developed using the web2py framework (53). The source code for the web implementation is available on GitHub, so users can host their own copy of the software (https://github.com/kelwyres/Kaptive-Web). Kaptive Web automatically populates the "reference database" selection with the contents of command-line Kaptive's database directory, enabling automatic compatibility with any new locus databases for Klebsiella or other bacteria. Kaptive Web utilizes a 16-core 64-GB RAM server hosted by the Australian NeCTAR cloud. A single genome analysis requires approximately 4 min and 20 s to complete for the K and O locus databases, respectively. The server can run up to 15 analyses simultaneously, enabling large data sets to be processed relatively quickly.
Genome data for K and O locus characterization. Sequence read data for 309 K. pneumoniae organisms were obtained as part of the global diversity study (54), and 13 O3 antigen-producing isolates (20) were assembled de novo using Unicycler v0.4.1 (55). Genome assemblies were uploaded to Kaptive Web in a single compressed data directory and analyzed with the Klebsiella primary K locus and the Klebsiella O locus databases. The total Kaptive Web analysis times for the global data set were 52 min (K locus) and 12 min (O locus). The results were inspected via the Kaptive Web graphical interface and downloaded in tabular format (see Data Set S1 in the supplemental material).
The same protocol was used for characterization of 201 publicly available CG258 genome assemblies (see Data Set S3). These genomes were identified among the complete set of Klebsiella genomes (downloaded from GenBank on 12 October 2017) on the basis of ST information generated using Kleborate (https://github.com/katholt/Kleborate). STs 11, 258, 340, 395, 437, 512, 855, and 895 were included in the analyses.

RESULTS AND DISCUSSION
Introducing Kaptive Web. Kaptive Web is a browser-based method for running Kaptive and visualizing the results. Users upload one or more assemblies and select their preferred typing database (Fig. 1A). There is no limit to the number of assemblies that can be uploaded for a single run, though multiple assemblies must be uploaded as a tar.gz or zip file. After the upload is complete, command-line Kaptive is automatically run on the remote server (Fig. 1B). Results appear in a table with one row per genome assembly showing key details: the best-matching locus from the reference database, the match confidence, nucleotide identity, and coverage compared to the reference (Fig. 1C). The rows are colored on the basis of the match confidence, with six possible levels.
(i) Perfect. The locus was found in a single piece (one alignment within a single contig) with 100% coverage and 100% nucleotide identity to the reference.
(ii) Very high. The locus was found in a single piece with Ն99% coverage and Ն95% nucleotide identity to the reference, with no truncated/missing genes and no extra genes compared to the reference.
(iii) High. The locus was found in a single piece with Ն99% coverage, with Յ3 truncated/missing genes and no extra genes compared to the reference.
(iv) Good. The locus was found in a single piece or with Ն95% coverage, with Յ3 truncated/missing genes and Յ1 extra gene compared to the reference.
(v) Low. The locus was found in a single piece or with Ն90% coverage, with Յ3 truncated/missing genes and Յ2 extra genes compared to the reference.
(vi) None. Did not qualify for any of the above. The top two confidence levels, "very high" and "perfect," require the locus to be found uninterrupted in a single contig with the expected gene content. If the locus has truncated/missing genes or was found in multiple discontiguous pieces, a lower confidence level will result. Insertion sequence (IS) integrations are one possible cause of truncated/missing genes. When an IS interrupts a locus gene, that gene's function is likely lost (38), and for short-read Illumina sequencing, IS integration also typically causes assembly fragmentation (56). However, truncated/missing genes and assembly fragmentation can also result from poor read coverage or indel sequencing errors, in which case, the gene is likely still functional. This is why Kaptive is somewhat tolerant of truncated/missing genes and discontiguity-a locus with these issues can still achieve a confidence of "high." Extra genes in a locus (genes that are not in the best-matching locus reference but are in another locus reference) are strongly indicative of a biological change and thus reduce confidence more so than missing genes-a locus found with an extra gene can achieve a confidence of no greater than "good." When a locus has genuinely different gene content relative to the reference, the resulting serotype may be affected, but the Kaptive match can still be useful for phylogenetic typing purposes. Clicking on an assembly row expands the view to show more detail, including a diagram of the best-matching locus with genes colored by tBLASTn coverage and identity (Fig. 1C). Beneath the locus diagram are two expandable lists of additional K or O locus genes identified inside and outside the locus region of the query genome (genes that are not usually present in the reference locus). It is common to see matches to a small number of additional genes outside the locus region of the query due to sequence homology (see reference 38 for further details). An additional gene within the locus region of the query genome may indicate that the genome has a novel locus type and will likely correspond to a large length discrepancy from the reference (shown on the right side of the display). In such a case, users may wish to perform further analyses outside Kaptive Web. To facilitate this, Kaptive Web lists the position of the locus in the query genome (shown on the left side of the display), along with a link that allows these assembly regions to be downloaded in FASTA format. For K locus typing, Kaptive Web will also report the alleles for the conserved wzc and wzi K locus genes, for compatibility with earlier schemes that focused on these genes (37,57,58). Figure 1C shows Kaptive Web K locus results for four isolates from a global Klebsiella diversity study data set (54), with various degrees of data quality. The assembly for strain Pus_15987 has a "perfect" match for KL1. Strain D-026-I-b-1 has a best matching locus of KL107, though poor assembly quality resulted in very low identity and coverage and, consequently, a confidence level of "none." Strain QMP_M1-200 has a "high" match for KL11. It contains the entire KL11 sequence, but with a moderate amount of divergence (92% nucleotide sequence identity). In most cases, minor nucleotide divergence likely does not affect the capsule phenotype. However, it should be noted that even a single nonsense or frameshift mutation can have important implications; e.g., the key distinction between the K22 and K37 capsules is not gene content but rather a nonsense mutation in the acetyltransferase gene (48). Kaptive Web clearly identifies potential nonsense or frameshift mutations by marking such loci with "missing genes." The results for strain AJ170 are expanded in Fig. 1C, showing the full Kaptive Web visualization. It has a very good coverage and identity match to KL38, yet the locus was not found in a single contiguous piece of the assembly. Kaptive was also unable to find a translated protein sequence for one of the KL38 genes, wckR, illustrated by the gray coloring in the locus diagram. In this instance, both issues (discontiguous locus sequence and missing gene) were caused by a break in the assembly, splitting the locus over two contigs. This may have resulted from poor read coverage, in which case wckR may have been intact and functional in the original isolate. Alternatively, the assembly break may have resulted from an insertion sequence interrupting wckR, in which case, gene function is likely lost; such interruptions have been characterized in a number of Klebsiella K loci (38). This uncertainty is why AJ170 only achieved a "good" confidence score for KL38.
O locus database and typing. The Kaptive algorithm was originally developed and validated for typing the K locus of K. pneumoniae (38), but it can in principle be used to type any variable locus that occurs no more than once per genome. In Kaptive Web, we apply it to O locus typing, which follows mostly the same procedure as K locus typing but with two unique aspects. First, serological types O1 and O2 are distinguished not by the O locus but by two genes, wbbY and wbbZ, elsewhere in the chromosome. Second, the O locus shared by the O1 and O2 serotypes comes in two varieties (v1 and v2), which are distinguishable using genomic data (they differ in terms of gene content) but are serologically cross-reactive. These aspects are incorporated into Kaptive as follows: (i) the relevant O locus variant is reported as v1 or v2, and (ii) an additional search is conducted for wbbY and wbbZ to decide whether the locus should be reported as O1 or O2. If only one of wbbY or wbbZ is found, Kaptive will give a label of O1/O2 (i.e., possibly either).
A recent study of O3 antigens identified several subtypes that can be distinguished serologically and genotypically (O3, O3a, and O3b [20]). The O3 and O3b loci corre-spond to the O3l and O3s loci previously described in Follador et al. (25) and are distinguished by divergent sequences of the wbdA and wbdD genes (25). The O3 and O3a loci are distinguished by a single point mutation in wbdA (C80R) (20). While Kaptive does not aim to distinguish antigen subtypes, the O3b subtype has sufficient nucleotide divergence to necessitate a separate reference sequence. Kaptive therefore designates O3 loci as either O3/O3a (covered by the same reference sequence) or O3b. As more phenotypic data for antigen subtypes become available, we will consider adding broader antigen subtyping capabilities to Kaptive.
We assessed the accuracy of Kaptive O locus typing by applying the Klebsiella global diversity study genomes (54) for which O locus types were previously inferred on the basis of nucleotide variation in the universally conserved wzm and wzt genes (25). Of the 309 WGS assemblies, the numbers which matched each of the confidence levels were as follows: 3, perfect; 212, very high; 28, high; 56, good; 3, low; and 7, none (see Data Set S1 in the supplemental material). The assemblies with a confidence of "low" or "none" were possibly due to low assembly quality; seven had the O locus split over multiple contigs and three had very low coverage. There was very good agreement between the O locus types defined previously (25) and the Kaptive results-only 8/309 assemblies had discrepancies. Of those, four were cases where the previous type was O1 and Kaptive assigned O1/O2 (i.e., it only found one of wbbY and wbbZ) and one was where the previous type was O1 and Kaptive assigned O2. The remaining three discrepancies were all between O3 and OL104, which are distinguished by their wbdD genes (25). Isolates AJ031 and D-026-I-b-1 were mistyped due to poor assembly-not all of the O locus was represented. Isolate U_13792_2 was typed as O3b by Kaptive but previously assigned OL104, and manual inspection of the predicted WbdD amino acid sequence suggested that this strain produces a hybrid WbdD protein. Given the more subtle distinctions between the O3/O3a, O3b, and OL104 loci, we further tested the accuracy of Kaptive's O locus typing using 13 additional O3 K. pneumoniae for which genome data and O antigen phenotypes were previously determined (20) and found Kaptive correctly typed all 13 genomes (see Data Set S2).
Application of Kaptive Web to track K and O locus diversity in MDR clones. The increasing rates of antimicrobial resistance, particularly against last-line drugs such as carbapenems, has led to a resurgence of interest in phage and monoclonal antibody therapies and vaccinations targeting K. pneumoniae (16)(17)(18)(19)(20)(21)(22)(23)(24)(25). There is particular interest in targeting the globally distributed MDR clones, including CG258 (21)(22)(23)26). ST258, the most well-known member of CG258, has rapidly become the most common cause of CP Klebsiella infections in the United States (59). Recent studies suggest low lipopolysaccharide diversity in this clone, with the majority of ST258 isolates expressing the O2 antigen (21,23). Similarly, early studies reported that ST258 harbored just two distinct capsule types (60). However, subsequent work has shown greater K locus variation (61), particularly among other members of the clonal group, e.g., ST11, ST340, and ST437, which are frequent causes of CP infections outside the United States (62).
The large number of publicly available CG258 assemblies make this group an ideal case for exploring the broader diversity of K and O loci within a single clonal group. To this end, we downloaded all CG258 genome assemblies available in GenBank (n ϭ 201 as of 12 October 2017) (see Data Set S3) and analyzed them using Kaptive Web. "Good" or better K and O locus calls were obtained for 173 (86%) and 186 (93%), respectively (Data Set S3). Importantly, while there were dominant types (KL107 and O2v2, the combination of which accounted for 65 [32%] of the isolate genomes), there was also much diversity, with 25 K loci and 6 O loci in total (not distinguishing O1 and O2) (Fig. 2).
There is now emerging evidence that like CG258, other globally distributed MDR clones also harbor diverse K and O loci (50,63). For example, Fig. 3 shows the core chromosomal phylogeny of the 309 global genomes (54), highlighting Kaptive Web's K and O locus calls for three additional globally distributed MDR lineages (54,62). Locus exchange tends to result from recombination and demarcates diverging sublineages within the expanding MDR clones (60,61); hence, the locus calls can also serve as epidemiological markers for the subtyping of MDR strains (58,60). Therapies specific to the dominant K or O antigens may apply further selective pressure that shifts the population towards different types, as is well documented following the introduction  Only Kaptive locus calls with a confidence of "good" or better were included in this figure. Lower confidence matches were labeled "unknown." of protein-conjugate vaccines targeting the Streptococcus pneumoniae polysaccharide capsule (64,65). The success of new control measures directed at K and O loci will therefore depend on reliable tracking of K and O loci in the K. pneumoniae population. Kaptive and Kaptive Web provide simple WGS-based solutions to monitor these trends and ensure that therapies are well targeted and keep up with K. pneumoniae evolution. With its graphical interface and remote computation, Kaptive Web also makes these analyses accessible to the wider public health community.

ACKNOWLEDGMENT
Web development services were provided by the eResearch group at the University of Melbourne.