In Silico Genotyping of Escherichia coli Isolates for Extraintestinal Virulence Genes by Use of Whole-Genome Sequencing Data

Extraintestinal pathogenic Escherichia coli (ExPEC) is the leading cause in humans of urinary tract infection and bacteremia. The previously published web tool VirulenceFinder (http://cge.cbs.dtu.dk/services/VirulenceFinder/) uses whole-genome sequencing (WGS) data for in silico characterization of E. coli isolates and enables researchers and clinical health personnel to quickly extract and interpret virulence-relevant information from WGS data.


Exclusion of gene variants from the database:
One cvaC allele (JHDL0100011) was removed as it was <61% identical to the other cvaC alleles.
Three ompT alleles (CYBK01000048, UDAJ01000003 and DQ381420) were removed as they were found to be GlcNAc transferase by Blastx.
Four sitA alleles were 64-73% similar to the 56 unique sitA alleles, but were removed from the database as BLASTx confirmed them to originate from Klebsiella (FLWH01000008 and FLXF01000002) and Citrobacter (UFZE01000001 and UGBZ01000001).

Sfa/focDE
Eighty unique putative focC alleles were downloaded from NCBI. Of these, only four contained the sfa/focDE reverse primer, in each instance with > 99% identity. Of the remaining 76 alleles, 39 shared < 60% identity with the four focC alleles, and 37 proved to be fimC by blastx analysis against the nonredundant protein sequence database (nr). Therefore, these 76 alleles were not included in the final ExPEC database (see Supplementary material 1 for details of the analysis). The reverse sfa/focDE primer was located in all three unique sfaE alleles, with > 98% identity. The sfaE and focC gene alleles had identities > 98%, and two sfaE alleles and two focC alleles were 100% identical. A total of 10 unique sfaD alleles and one focI allele were downloaded from NCBI. The forward sfa/focDE primer was found in the focI allele and in the 10 sfaD alleles. The sfaD and focI alleles were > 98% identical. In summary, two focC, two focI, 10 sfaD, one sfaE, and two focC/sfaE alleles were added to the ExPEC database.
The original VirulenceFinder database consisted of one sfaS allele. One additional unique allele was identified for sfaS (similarity of 99.8 %) and two alleles for focG (similarity 99.8%). The sfaS primer pair was found in both sfaS alleles, and the focG primer pair was found in both focG alleles. The focG forward primer and the sfaS forward primer were also found in both sfaS alleles and both focG alleles, respectively. The three new alleles were added to the ExPEC database. The focG and sfaS genes are ~65 % identical.
However, three of the afaB alleles were 100% identical to three nfaE alleles already present in the E.
coli VirulenceFinder database. This resulted in the addition of only 3 afaB alleles to the ExPEC database. Eleven afaC alleles were added to the ExPEC database, including three unique alleles each for afaC-1, afaC/draC, and afaC-8, and two for afaC-3.
Thus, a total of 37 unique afaD alleles were added to the ExPEC database, including alleles classified as either Agg3B, Agg4/HdaB and/or afaD. All draD labels were changed to afaD.

kpsM:
When searching for the kpsM alleles, preference was given to sequences where information on the serotype included the K capsule K antigen and/or the original strain number was provided. Eighty-two unique alleles were found for the kpsM. Twelve sequences were excluded because of a similarity lower than 74.3% to any other kpsM sequence and because there was no match to any of the kpsM primers.
Sixty-eight alleles with a similarity higher than 90.5% were found in sequences related to group 2 capsules and two unique alleles with similarity 65.4-68.5% to the 68 group 2 alleles were found in strains indicating presence of group 2 capsules K94 and K97. The similarity between these two was 77.9%. Nine group 3 capsule alleles were found with a similarity higher than 99.5% in five sequences but with similarities of only 70.9% and 78.5% for indicated K19/K23 and K11 capsules to the five other group 3 alleles, and similarities of 46.3-48.9% for two sequences indicating group 3 capsule K19.
Three alleles for kpsM-K15 with similarities higher than 99.7% were included. For the kpsM genes, seven, six, five, and three alleles matched 100% for primer pairs kii-kpsII-F/K1-kii-R, kii-kpsII-F/kii-R, kpsIII-F/kpsIII-R, and kpsM15-F/R respectively. Twenty-four, 7 and 16 alleles matched primer pair kii-kpsII-F/K1-kii-R 95.7%, 91.3% and 82.6-87% respectively, while 2 alleles had 59.1-63.6% together in several groups, where the internal identity is >99%. BLASTx confirms them all to be UDP-N-acetyl glucosamine 2-epimerases. Four tcpC gene alleles were found with a similarity of >99 %. One allele was removed as it was shorter than the other three alleles (720bp vs. 924bp) and was reported as being truncated (GQ902993). Twenty-five unique terC alleles were found, of them 19 were > 96% similar, three of them were > 98% similar and >80% similar to the before mentioned 19 terC alleles, another three alleles differed more from the first 22 alleles, and are >52% similar to them. BLASTx confirms them all to be terC alleles.

papA genes not covered by the search string and inclusion criteria
The search string for papA did not return the expected allele results for the serotype-specific P fimbriae variants F8, F10, F12, F15, and F40. Therefore, an additional BLASTn search at NCBI Genbank was performed using the corresponding reverse primers and original Accession numbers (3) Table 1 lists the number of alleles for each of the papA genes. Concordance of PCR and WGS typing of the evaluation sequences.
The sfaS (5 times), focG (10 times), fyuA (10 times), iutA (21 times), ibeA (7 times) and the sfa/focDE operon (8 times) were identified by WGS in PCR negative strains in the number of strains listed in parentheses. The alleles identified by WGS in these strains were positive for both of the PCR primers.
In four strains, only one of the sfa/focDE operon genes were identified (sfaD, sfaE, focI, focC) by WGS, and the strains were classified as sfa/focDE negative.
Thirty-three strains were positive for papA by WGS, all with an allele where the forward PCR primer was identified. Twenty strains negative for papC by PCR were found papC positive by WGS. For 9 of the 20 strains, the papC PCR primer sequences were not identified in the papC gene sequence. hlyF (1), usp (7), traT (12) and ompT (19) were identified by WGS in PCR negative strains with alleles, where only one PCR primer or no PCR primers were identified in the gene sequence.
Repeat PCR was performed for 7 strains and resulted in reclassification of six strains as non-ExPEC instead of ExPEC, in agreement with VirulenceFinder. The seventh strain (PUTI_374) was found to be ExPEC by VirulenceFinder, but not by PCR in either the initial or repeat testing.