EcOH: In silico serotyping of E. coli from short read data

The lipopolysaccharide (O) and flagellar (H) surface antigens of Escherichia coli are targets for serotyping that have traditionally been used to identify pathogenic lineages of E. coli. As serotyping has several limitations, public health reference laboratories are increasingly moving towards whole genome sequencing (WGS) for the rapid characterisation of bacterial isolates. Here we present a method to rapidly and accurately serotype E. coli isolates from raw, short read sequence data, leveraging the known genetic basis for the biosynthesis of O- and H-antigens. Our approach bypasses the need for de novo genome assembly by directly screening WGS reads against a curated database of alleles linked to known E. coli O-groups and H-types (the EcOH database) using the software package SRST2. We validated our approach by comparing in silico results with those obtained via serological phenotyping of 197 enteropathogenic (EPEC) isolates. We also demonstrated the utility of our method to characterise enterotoxigenic E. coli (ETEC) and the uropathogenic E. coli (UPEC) epidemic clone ST131, and for in silico serotyping of foodborne outbreak-related isolates in the public GenomeTrakr database.


47"
agglutination reactions with panels of antisera, and is expensive in terms of both labour and 48" reagent costs (Achtman et al., 2012;Fratamico et al., 2009). In addition, the interpretation of 49" these assays is subjective and relies on antisera that vary in titre and specificity according to 50" the source and integrity of the serum. Further, a significant proportion of E. coli isolates

51"
(approximately one quarter) are serologically 'untypeable' due to cross-reactivity or a lack of 52" reaction with available antisera (DebRoy et al., 2011). For these reasons there has been a 53" shift away from serological phenotyping of E. coli, towards inference of O-and H-genotypes 54" using molecular methods (Jenkins, 2015).

56"
O-antigen biosynthesis in E. coli is encoded in gene clusters that are typically located

64"
with structural variation in the carbohydrate residues that make up each O-antigen (DebRoy 65" et al., 2011;Wang et al., 2003). Because of this, the sequences of these genes can be used

67"
genotype-phenotype relationships for some O-groups are unexpectedly complex. For

72"
H-antigen specificity is determined by flagellin, which is the protein subunit of flagella. This

73"
protein is encoded by fliC in 43 of the 53 serologically defined H-types (Wang et al., 2003).

74"
PCR detection of fliC alleles has been used for molecular H-typing for some time (Wang et 75" al., 2003). However, some E. coli isolates have an alternative flagellar phase, due to the

76"
presence of an additional flagellin gene (flnaA, fllA, fmlA or flkA), similar to those found in

81"
laboratories are increasingly moving away from phenotyping and towards whole genome 82" sequence (WGS) based typing of bacteria including E. coli (Joensen et al., 2015;Kwong et 83" al., 2015). Given the strong genetic basis for O-and H-antigenic variation in E. coli, the 84" availability of genome data provides a valuable opportunity to infer serotypes at little or no 85" additional cost. Here we present a method to rapidly and accurately serotype E. coli isolates 86" from raw, short read sequence data, by screening reads directly against a curated database

87"
of alleles linked to known E. coli O-groups and H-types (the EcOH database, presented

89"
also be used to infer serotypes from assembled genome data using BLAST or other 90" sequence comparison tools, which will become increasingly useful as long-read sequence 91" data become more common. We validated our approach by comparing in silico predicted

92"
serotypes to those determined phenotypically in a public health reference laboratory, and

93"
demonstrated the utility of in silico serotyping to characterise more than 1,000 E. coli isolates

96"
E. coli deposited in the public GenomeTrakr database.

101"
The EcOH database of O-and H-type encoding sequences was initially constructed in 2014

102"
from publically available sequences identified in GenBank by reviewing the literature on the

106"
resulting EcOH database includes sequences of alleles for wzm and wzt, or wzx and wzy,

108"
known H-types. Details of all sequences in the EcOH database are provided in Table S1.

111"
Publically available sequence data used in this study

112"
Details of the short read Illumina data used in this study are provided in Table 1

126"
and assemblies are available in the European Nucleotide Archive (ENA) under ERP001141

133"
Both sets of assemblies were screened against the EcOH database using BLAST+ (blastn).

134"
A genotype call was made where a hit was identified with ≥90% coverage of a query

135"
sequence at ≥90% nucleotide identity. Note that as the SPAdes assemblies yielded fewer

136"
genomes with BLAST+ hits to O-antigen loci, these assemblies were discarded and all

137"
results were reported as comparisons of SRST2 data to assembly-based analysis using the

138"
Velvet Optimiser assemblies. This makes the comparison as generous as possible towards

139"
the competing method of assembly-based analysis.

142"
SRST2 was run with default parameter settings, such that a genotype call reflects detection

143"
of reads covering ≥90% of the length of a query locus at ≥90% nucleotide identity. Where

144"
multiple alleles of the same locus appears in the database, SRST2 reports the best-scoring

145"
allele as the genotype call (Inouye et al., 2014). A confident genotype call is defined as one

146"
exceeding the minimum depth cut-offs (Inouye et al., 2014). Here we used the SRST2

147"
default values of ≥5x mean read depth across the query locus to define a confident call.

150"
Isolates were subcultured on Luria-Bertani agar and incubated overnight at 37°C before

151"
being submitted to the National E. coli Reference Laboratory at the Microbiological

152"
Diagnostic Unit Public Health Laboratory (MDU PHL) in Melbourne, Australia, for serotyping.

5"
O-and H-serotyping utilised the standard tube agglutination tests, adapted for U-bottomed

157"
Where an O-group was determined via serological testing of an isolate, but no wzx/wzy or

158"
wzm/wzt genes were detected in the corresponding isolate's genome, the de novo genome

159"
assembly was interrogated to identify potential novel O-antigen loci. For each such isolate 160" the assembled contig containing the genes galF and gnd, which typically flank the O-antigen

161"
locus, was identified using BLAST and extracted using EMBOSS (Rice et al., 2000). The

163"
translated protein sequences from the EcOH database as the preliminary annotation source.

164"
We then used ACT (Carver et al., 2008) to visually compare the annotated sequences with

165"
full-length reference sequences for the corresponding O-group that had been identified by

166"
serology. Putative wzx and wzy alleles for these O-groups were identified based on (i) the

167"
annotation, (ii) sequence homology with the reference O-group sequences, and (iii) the

169"
putative wzx and wzy gene sequences were added to the EcOH database with the suffix

175"
For the EPEC and ETEC pathotypes, population structure was determined by constructing

176"
neighbour-joining trees based on Hamming distances between MLST allele profiles inferred 177" from the genomes using SRST2. O-and H-types were plotted against these trees using R

180"
and rarefaction plots, were performed using the vegan package for R (Oksanen et al., 2015).

183"
Illumina reads from a total of 169 isolates (accession numbers in Table 1

188"
consensus alleles at all SNP sites identified in the isolate collection are then extracted from 189" " 6" each read set using SAMtools (Li et al., 2009) (Phred score ≥20 and unambiguous

192"
The resulting SNPs were filtered to include only those located within common genes

194"
total of 38,213 SNPs. The resulting SNP alignment was used as input to infer a maximum 195" likelihood (ML) tree using RAxML (yielding 100% bootstrap support for all major nodes). The

196"
phylogeny was outgroup-rooted using the group comprising four closely related ST95

197"
isolates (these had originally been identified as ST131 in PCR analysis for rfb and pabB

201"
GenomeTrakr (NCBI BioProject: PRJNA183844) is a public repository of genome data from

202"
foodborne pathogens submitted by various laboratories including the US Food and Drug

203"
Administration and the Centers for Disease Control and Prevention. It includes raw Illumina

204"
reads and a kmer-based phylogeny of E. coli read sets, which is updated daily to incorporate

218"
The two exceptions were O57 and O14, as isolates with these O-groups lack any of these 219" genes and harbour only small O-antigen gene clusters, with no known polymerase or 220" flippase genes, and only the housekeeping genes galF, gnd and hisI together with ugd and

221"
wzz which is not sufficient to delineate these O-groups. The EcOH database also includes

222"
sequences for all 53 known H-types, allowing for the detection of both fliC and non-fliC

230"
which are reported in the literature as O152 and O150, respectively (Oshima et al., 2008; 231" Toh et al., 2010). In silico analysis of these genomes identified wzx and wzy alleles for O16

232"
and O173, respectively, and no BLAST+ hits to the alleles corresponding to the reported 233" serogroups O152 and O150. Reported H-types were available for 20 of the reference

234"
genomes and we identified the expected H-alleles in all of these (Table S2)

239"
We compared in silico serotyping (i.e., O-and H-genotyping) to serological phenotyping of

240"
197 EPEC isolates. All isolates were serotyped by a national reference laboratory which

245"
different H-types). The remaining 69 isolates were identified as H-(n=67, indicating that the

249"
The 197 isolates were previously subjected to whole genome sequencing using the Illumina

250"
HiSeq platform (Ingle et al, 2015, in press). We compared two different strategies for in silico 251" assignment of O-and H-types using the EcOH database: (i) typing direct from reads using

252"
SRST2, and (ii) de novo assembly (using Velvet Optimiser) followed by identification of

253"
alleles via BLAST+ (see Methods). Results are summarised in Figure 1 with the full results

256"
isolates (85%), and at one O-determining locus for a further 15 isolates. Thus, a total of 182

265"
assemblies were not analysed further.

267"
Of the 15 isolates for which SRST2 analysis did not yield a serotype call, 7 isolates had 268" serological O-groups but no high confidence calls for any wzx/wzy or wzm/wzt genes 269"

279"
agreed at the two O loci but did not match the serological phenotype, assembly analysis

280"
identified the same O-group as SRST2 in 14/15 cases, and the serological O-type in 1/15

281"
cases. There was only one isolate for which assembly-based analysis identified the same O-

282"
group as phenotyping when SRST2 had no result, and there were also two cases where

283"
SRST2 analysis identified the serological O-group and BLAST+ did not.

285"
The possible reasons for mismatches between O-antigen phenotype and genotype include 286" multiple genetic variants manifesting in the same phenotype (for example O45, see 287" (Plainvert et al., 2007)) and/or atypical genetic variation such as multi-copy genes or novel

288"
genes. To explore these possibilities we manually inspected the genome assemblies of

289"
isolates yielding conflicting genotype/phenotype calls and identified twelve novel O-antigen

293"
alleles were detected at high depth (~100x) and were highly divergent from the reference

294"
O116 wzx allele (~10% nucleotide divergence, the maximum limit of detection we used for 295" genotyping). We hypothesised that these isolates may carry wzy genes that are genetically 296" distant from the prototypical alleles that were included in our database, but which

297"
nonetheless result in similar phenotypic agglutination patterns to isolates carrying more 298" prototypical genes. Investigation of the corresponding genome assemblies revealed a novel

301"
wzy sequences were labelled O116var1 and added to the EcOH database to facilitate

302"
identification of this novel type in future. In the genomes of four isolates genotyped as O8

304"
but also identified a putative wzx homolog (with 76% identity to O83) outside the O-antigen

306"
Interestingly, an isolate which displayed the O153 phenotype but had no O-locus hits in the

308"
and 57% to O170, respectively), one of which was similar to the additional wzx gene

311"
H-typing using the EcOH database yielded similar results to O-typing. SRST2 analysis

312"
returned 127 confident calls that matched the phenotype in 119 of 128 (93%) of the

317"
amongst phenotypically H-non-motile isolates is likely due to a lack of flagellin expression 318" during serotyping, which does not affect genotyping. Only two isolates had no flagellin genes 319" detected from the sequence data. These were non-motile and may be the only isolates that 320" genuinely lack the ability to express flagella.

323"
The data above show that the use of SRST2 and the EcOH database to type raw Illumina

324"
read sets can provide rapid in silico serotyping, that outperforms assembly-dependent 325" analysis (especially for H-typing) and is largely predictive of results obtained from serological

326"
typing while yielding fewer 'untypeable' results. In addition to the EcOH database, other

327"
databases, such as those used for multi-locus sequence typing (MLST) and antibiotic

329"
single SRST2 analysis returning MLST, serotype and antimicrobial resistance gene results in

330"
a few minutes (approximately 5-10 minutes for paired Illumina data at mean read depth 50-

331"
100x, see (Inouye et al., 2014)). We therefore sought to demonstrate the utility of this 332" approach for the rapid characterisation of E. coli genomes, including serotyping, MLST and

333"
antibiotic resistance gene profiling, in a variety of contexts including the investigation of

341"
that H-types are more stably maintained within E. coli clones than are O-groups (Fig. 3),

349"
high diversity within ST10 is consistent with the fact that it was one of the first E. coli

353"
Next we used SRST2 and the EcOH database to analyse Illumina read sets for 169 UPEC

354"
isolates previously reported as belonging to the epidemic UPEC clone ST131 (Petty et al.,

357"
report on these genomes (Petty et al., 2014)). In silico serotyping identified the majority of

361"
clone of ST131 in which a change of serotype has occurred (Fig. 4). The O16:H5 sub-clone

362"
carried fewer resistance genes than other ST131 genomes and corresponds to ST131 Clade

363"
A, which has been identified as an ancestral sub-lineage of ST131 that is distinct from the

365"
within ST131 was detected in the original genome reports (Petty et al., 2014;Price et al.,

366"
2013), but was not explored in detail. Our data highlight the utility of in silico serotyping to

367"
illuminate on-going microevolution in important epidemic clones of E. coli, including change

368"
of serotype, which could confound serological identification of outbreak-related isolates.

370"
Finally, we performed in silico serotyping of 300 foodborne outbreak-associated E. coli

371"
genomes recently deposited by public health laboratories into the GenomeTrakr project

373"
overlaid on the GenomeTrakr kmer-based tree. Environmental isolates displayed a diversity 374" " 11" of ST, O-and H-types, whereas most clinical isolates belonged to one of six clonal lineages,

381"
antigen genes, whilst 272 isolates (91%) had a confident call for at least one allele. In most

382"
cases where a low-confidence O-antigen genotype call was made (due to low read depth),

383"
the call was for O157 alleles; the position of these genomes within the ST11 O157:H7

384"
lineage of the kmer tree suggests that these low-confidence calls of these genomes are

385"
likely to be correct. Only 7 isolates (2%) yielded no genotype calls for the H-locus, indicating

386"
they are likely to be non-motile. These data demonstrate the utility of our method for in silico 387" serotype prediction of E. coli sequenced for the investigation of foodborne outbreaks or other

392"
This study has demonstrated that E. coli O-and H-genotypes can be rapidly and accurately

393"
extracted from whole genome data using the free SRST2 software and a new publicly

394"
available database, EcOH. The method improves on both (i) serological phenotyping, which

395"
is resource intensive in terms of time, labour and reagent costs, and fails to type up to one

396"
third of E. coli isolates, and (ii) assembly-based approaches for in silico genotyping of

397"
Illumina data, particularly for H-typing, which are more computationally expensive and are

398"
highly dependent on the quality of the sequence data. Importantly, since SRST2 works on

399"
raw reads and can readily be used to extract other useful genotyping information in addition 400" to serotype, including MLST, antimicrobial resistance and virulence genes (Inouye et al.,

401"
2014), it lends itself to integration with robust assembly-free pathogen genome 402" characterisation pipelines. Our data demonstrate that this approach can be used to readily

403"
infer serotypes from genome data currently being produced by GenomeTrakr and other

404"
public health networks as part of routine investigation of foodborne E. coli outbreaks. This

405"
could be useful in identifying the emergence of novel serotypes within outbreak clades that 406" may signal a shift in the pathogen population during its dissemination (as we identified in 407" ST131 UPEC), and importantly will provide backwards compatibility with the wealth of 408" serotype data that is currently available from historical outbreak investigations.

580"
antigen locus or at H locus; a-b), or vs. BLAST+ analysis of de novo assemblies (hit to ≥1 O-

581"
antigen locus or at H locus; c-d).

598"
Core genome SNP tree for 170 isolates, outgroup-rooted using ST95 isolates. Serotype and 599" acquired antimicrobial resistance gene profiles, detected using SRST2, are demarcated on 600" the rings surrounding the phylogeny; note low-confidence serotype calls are shown in paler 601" colours.

602"
603" Figure 5. In silico prediction of serotype and sequence type from GenomeTrakr data

604"
Kmer-based tree obtained from the GenomeTrakr project; common sequence types (STs) or