Comparative Pathogenomics of Escherichia coli: Polyvalent Vaccine Target Identification through Virulome Analysis

ABSTRACT Comparative genomics of bacterial pathogens has been useful for revealing potential virulence factors. Escherichia coli is a significant cause of human morbidity and mortality worldwide but can also exist as a commensal in the human gastrointestinal tract. With many sequenced genomes, it has served as a model organism for comparative genomic studies to understand the link between genetic content and potential for virulence. To date, however, no comprehensive analysis of its complete “virulome” has been performed for the purpose of identifying universal or pathotype-specific targets for vaccine development. Here, we describe the construction of a pathotype database of 107 well-characterized completely sequenced pathogenic and nonpathogenic E. coli strains, which we annotated for major virulence factors (VFs). The data are cross referenced for patterns against pathotype, phylogroup, and sequence type, and the results were verified against all 1,348 complete E. coli chromosomes in the NCBI RefSeq database. Our results demonstrate that phylogroup drives many of the “pathotype-associated” VFs, and ExPEC-associated VFs are found predominantly within the B2/D/F/G phylogenetic clade, suggesting that these phylogroups are better adapted to infect human hosts. Finally, we used this information to propose polyvalent vaccine targets with specificity toward extraintestinal strains, targeting key invasive strategies, including immune evasion (group 2 capsule), iron acquisition (FyuA, IutA, and Sit), adherence (SinH, Afa, Pap, Sfa, and Iha), and toxins (Usp, Sat, Vat, Cdt, Cnf1, and HlyA). While many of these targets have been proposed before, this work is the first to examine their pathotype and phylogroup distribution and how they may be targeted together to prevent disease.

major virulence factor in E. coli into a holistic picture that allows comparison across pathotype, phylogroup, and VF category. This organization reveals vaccine targets that speciate by phylogroup specifically but also highlight some unexpected entry points that may disarm this pathogen from multiple angles. Many of the vaccine targets proposed here have been proposed before as monovalent targets. However, to our knowledge, there has been no work looking at their pathotype and phylogroup distribution or how this could be leveraged to identify novel vaccine targets or polyvalent strategies. Furthermore, we reveal here several novel findings that provide insight into E. coli evolution, pathogen-versus-commensal delineation, and diagnostic classification.

RESULTS AND DISCUSSION
Curation of the E. coli virulome and visualization of results. The first step in our analysis was to curate a database of E. coli virulence factors by known strains. This included retrieving known virulence factors from the Virulence Factor Database (VFDB), VICTORs, and PATRIC, cross-referencing and confirming their function from hundreds of literature sources, and using a tiered approach to analyze strains (56)(57)(58). On the first tier, we used a strain database of 107 strains which had complete chromosome sequences (sequences of high quality) and had published evidence for their pathotype assignment. Two incomplete strains were grandfathered in from preliminary analyses (NC101 and REL606), but these nonetheless had published pathotype evidence. These strains were organized by their pathotype and visualized in detail. This is referred to here as our "pathotype database." The next tier database contained 1,348 complete E. coli chromosomes that were organized into phylogroups using an in silico method based on Clermont phylotyping developed in-house (see Materials and Methods; see also Appendix S1 in the supplemental material) (59)(60)(61). This is referred to as our "phylogroup database." Genetic insights gleaned from the first tier (pathotype database) were tested against this larger phylogroup database, mainly in the form of gene distribution. Using both methods, we found many apparently novel associations between virulence factors, phylogroups, and pathotypes. Figure S1 details the pipeline for separation of the E. coli genomes into phylogroups.
To visualize the relationship between pathotype, phylogroup, type of virulence factor, and any polymorphisms in genes associated with virulence factors, a heatmap template was developed. These heatmaps are divided into two panels: nonpathogenic strains and ExPECs (general ExPECs, UPECs, NMECs, and APECs) (see, for example, Fig. 1A) and InPECs (AIECs, EHECs/STECs, EAHECs, EAECs, ETECs, and EPECs [ Table 1]) (see, for example, Fig. 1B). Each column represents a single strain that is listed at the top of the heatmap. These strains are organized first by pathotype, then by phylogroup, and finally by sequence type. Rows represent a single gene, which is listed to the left of the heatmap. Genes are generally organized by class, operon, or otherwise related function.
Each cell in the heatmap is colored based on percent nucleotide identity compared to reference used to generate the alignments. The range of colors for each figure are based on the lowest value-which indicates divergence from the reference gene-and so varies from figure to figure. Coloring cells this way allows the user to use color as a proxy for both conservation and allelic distribution, and our organization allows the reader to investigate this distribution by pathotype, phylogroup, or sequence type. It is important to note that matching colors do not necessarily mean identical alleles. Instead, it indicates that the gene in those strains have the same number of mutations relative to the reference gene. However, we did not come across a case where matching colors were from separate alleles.
K-capsule group types are strongly associated with phylogroup. Surface sugars are an important virulence factor in many pathogenic bacteria as they generally act as protectins, hiding the bacterium from the host's immune system (62). Three U.S. Food and Drug Administration-approved vaccines (for N. meningitidis, H. influenzae, and S. pneumoniae) have capsule as their component and thus may guide E. coli vaccine development since E. coli also produces capsule. In E. coli there are three major types of surface polysaccharides: lipopolysaccharide (LPS), capsular polysaccharide (CPS or Kantigen), and exopolysaccharide (EPS) (63)(64)(65)(66). Of these, LPS and CPS are serospecific surface polysaccharides (64,66). E. coli K-antigens are characterized into groups 1, 2, 3, and 4 (64). K-antigens that belong to groups 1 (G1C) and 4 (G4C) are related to LPS Oantigens and use similar biosynthetic machinery. G1C and G4C are found in intestinal  (B). Columns are organized first by pathotype, then by phylogroup, and finally by sequence type (sequence type not shown). The wzi gene used as a reference is specific for G1C. G2C and G3C both use the kpsFEDUCS and kpsTM operons for export. The biosynthetic operons for the most widely distributed and studied K-types, K1 (neu) and K5 (kfi), are also shown. G4C is synthesized by yccZ, ept, etk, and ymcABCD genes. The percent identity was determined using megaBLAST with reference genes found in Data Set S1 in the supplemental material. pathogenic E. coli, including EPECs, ETECs, and EHECs (64). Group 2 capsule (G2C) and 3 (G3C) are found in extraintestinal pathogenic E. coli and are the group of interest here (64). K-antigens from these groups utilize a separate assembly and transport system (the kps operon) to those of the group 1 and 4 K-types. The structure of G2C and G3C are similar to those found in N. meningitidis and H. influenzae bacteria, which strengthens the case for using them as vaccine targets. G2C often appear similar to polysaccharides found on the surface of eukaryotic cells, and these include two of the most extensively studied K-antigens, K1 and K5 (64).
G1C was found only in one of the strains examined in our pathotype database, the commensal strain HS (nonpathogenic; A) ( Fig. 1). G3C was found in none of the strains. While initially puzzling, the G1C and G3C distribution in our RefSeq phylogroup databases indicate that G1C and G3C are not widespread and found predominantly in phylogroups C and D, respectively. These two phylogroups are underrepresented in our pathotype database (Fig. 1). Given the rarity of G1C and G3C in both databases, they are not recommended candidates for vaccine development.
G2C and G4C were more common, being found mostly in ExPECs and InPECs of our pathotype database, respectively (Fig. 1). These results suggested that G2C are by far the most common capsule type found in ExPECs and are almost completely absent from InPECs, with the exception of E. coli 042 (EAEC), a member of the D phylogroup. This trend agreed with the literature that G2C are virulence factors of ExPECs, and it further strengthens the notion that capsular polysaccharide from group 2 should be considered a vaccine target. The G2C distribution in our RefSeq phylogroup database showed that G2C is associated with the B2, D, and F phylogroups, rather than with the ExPEC pathotype, which explains G2C in the 042 (EAEC; D) strain (Fig. 1B). However, this strong phylogroup association does not exclude a pathotype-based selection. Instead, this may partly explain why ExPEC strains tend to be from these three phylogroups. This is highlighted by exceptions to the rule in strain VR50, an asymptomatic bacteriuria (ABU) strain of the A phylogroup, which contains both G2C and G4C, and the E2348/69 strain, which is an EPEC strain and the only member of the B2 phylogroup in our pathotype database that lacks G2C and contains G4C capsule (Fig. 1). In our phylogroup database, only 8% (15/195) of B2 strains carried the genes to produce G4C (Fig. 2). Of these, 87% (13/15) also carried the locus of enterocyte effacement (LEE) which is a pathogenicity island associated with diarrheal E. coli (namely, EHECs and EPECs) (67). Of B2 strains not carrying genes to encode the G4C, none of these 180 strains carry the LEE pathogenicity island. This, as well as the overall pathotype distribution of G2C and G4C, suggests that G4C may be important for survival within the intestines, while G2C is important for survival in other parts of the host. Thus, it would seem that targeting G2C capsule might be protective against strains that disseminate from the gastrointestinal tract into extraintestinal tissues. This has interesting implications since the phylogroup distribution of these capsule groups suggests that the B2, D, and F phylogroups are adapted to infections outside the intestines and opens up the range of potential vaccine targets from known virulence factors and surface antigens to those that are associated specifically with these phylogroups.
Interestingly, all AIEC strains, which are found in the intestines, also carried G2C (Fig. 1B) (68). This appears to be because the AIEC strains in this data set are from the B2 phylogroup; however, the majority of AIEC strains isolated thus far belong to the B2 phylogroup, so that does not necessarily mean our strains are not representative. This deviation from the phylogroups of other InPECs may be explained by the association of AIECs with inflammatory bowel disease. In such a dysregulated and immune-factorheavy environment, a capsule type that is associated with avoidance of the immune system provides obvious benefits. This dovetails with our findings above: if G2C acts as a protectin against the immune system outside the intestines and G2C is predominantly found in B2 strains, this could explain why most AIEC strains belong to the B2 phylogroup-they simply are more resistant to the immune system and possibly survive better intracellularly (69)(70)(71)(72).
Lastly, another interesting finding is that strains from the recently described phylogroup G, which is a sister phylogroup to the B2 phylogroup, apparently carry G4C ( Fig. 2) (55). This may suggest that phylogroup G, like B2 EPEC strains, diverged from phylogroups B2, D, and F and is more adapted to an intestinal niche. This also has interesting implications for the evolution and acquisition of capsular genes, since it suggests that these divergent G and B2 strains may have exchanged G2C for G4C, which has been an important step in their evolutionary development.
kpsT is predictive for group 2 capsule K-types. One of the most exciting things about the work presented here is the novel patterns that emerge from this layout. A good example of this was a striking result from the capsule comparisons where there was an apparent 1:1 association of the K-type with specific alleles of kpsT, a gene which encodes a ATP-binding protein member of the KpsTM ABC transport complex ( Fig. 1; see also Fig. S2 in the supplemental material). While other genes in the kps operon showed either significant differences within each K-type or high conservation across Ktypes, kpsT showed a high degree of difference between K-types, but little-to-no change within a given K-type ( Fig. 3; see also Fig. S3 in the supplemental material). This can clearly be seen in the amino acid alignment of all the KpsT sequences in our pathotype database (Fig. 3A). To test whether this trend holds up in a larger data set, we took advantage of the fact that the biosynthetic operons for two well-known G2C K-types are known: neu for K1 and kfi for K5. We then used megaBLAST to search our phylogroup database for 100% identical matches for the kpsT allele found in strains known to have K1 (kpsT K1 ) or K5 (kpsT K5 ) and cross referenced them against strains that contained the neu or kfi operons, respectively. Overall, our results showed that 89% (58/65) of all strains carrying the neu operon also contain kpsT K1 , while 94% of all strains carrying the kfi operon contain kpsT K5 . However, if we lower the stringency and include kpsT alleles that were .99% identical (,5 single nucleotide polymorphisms) to kpsT K1 or kpsT K5 , 99% (64/65) of strains carrying the neu operon carry a kpsT K1 -like allele, and 100% (50/50) of strains carrying the kfi operon carry a kpsT K5 -like allele. These results show that kpsT is tightly linked to K-type and suggest that it may interact directly with variable regions of the exported polysaccharide. If true, it is likely that this kpsT-to-K-type association will hold for other K-types. Due to its location inside the cell, KpsT is unlikely to be a candidate antigen for vaccine consideration (see Fig. S2). However, these results do have important implications outside vaccine development. For one, K-typing can be achieved by simply sequencing this gene, thereby allowing for K-types to be reported alongside phylogroup and virulence factors in clinical studies. This will be especially important for rarer K-types. It could also allow for rapid bacteriophage therapeutic selection, since several types of phages specifically target capsule (73,74). Finally, our results suggest that KpsT could be an attractive drug target, since disruption of this gene should be sufficient to prevent capsule export, thus targeting can be focused on K-types that are associated with extraintestinal pathogenesis while sparing commensals.
Fimbriae and adhesins are associated with phylogroup rather than pathotype. Fimbriae and adhesins are proteins found on the surface of all E. coli. Many are thought to be associated with adhesion to certain molecules, environments, or cell types (75). They are highly immunogenic and, if their distribution and mechanisms can be understood, may make excellent vaccine targets (76). Our results showed that fimbriae tend to be associated with phylogroup rather than pathotype (Fig. 4).
Type I fimbriae are encoded by the fim operon, target a-mannose, and are known to be in both commensals and pathogenic E. coli (77)(78)(79)(80)(81). It has been reported to bind several other molecules and cell types, including collagen, fibronectin, laminin, and macrophages (82)(83)(84)(85). AIEC adherence to intestinal epithelial cells of Crohn's disease (CD) patients is dependent on type I fimbriae, apparently because of the overexpression of CEACAM6 in CD patients (86). These fimbriae have been shown to promote urinary tract colonization and persistence, as well as cellular invasion (87,88). However, there are some conflicting results where the ABU strain 83972 (which lacks the complete fim operon) did not promote adherence in a murine urinary tract when fim was complemented (89). Our results show that type I fimbriae are indeed widely distributed and well conserved in nonpathogenic, ExPEC, and InPEC strains (Fig. 4). Surprisingly, EAEC and EAHEC appear to completely lack the fim operon, and this absence does not correlate with phylogroup, sequence type, or serotype (Fig. 4B). This is despite the fact that the fim operon of prototypical EAEC 042 strain has been studied (90). It is possible that the F9 or sfm fimbriae that are found in 042 were mistaken for the type I fimbriae given their similarity and the fact that they are often mistakenly annotated as fim. In each case for EAEC and EAHEC, the only hit found for the fim locus is a partial 171 bp hit for fimH. This partial hit is found in the correct location on the chromosome (between the gntP and nanC genes), but it is interrupted by an IS1-family transposable element. This supports the notion that the B1 EAEC and EAHEC strains studied here may share an ancestor. Interestingly, of the eight non-EAEC/EAHEC strains lacking the full fimH operon, this partial 171-bp hit and transposable element was also found in the ST648 (ExPEC; F phylogroup) strain and 90-9281 (ETEC; B1 phylogroup). It should be noted that there are two E. coli 042 strains found in the NCBI database. The one analyzed here has a 5,241,977-bp chromosome with accession number FN554766 (RefSeq NC_017626) and has had its analysis as an EAEC strain published (91). The other has a 4,692,707 bp chromosome with accession number CP042934. This entry does not have an accompanying article, and the title gives no indication that it is an EAEC or EAHEC strain.
The afimbrial adhesin Afa is encoded by the afa genes and is generally associated with ExPEC, UPEC, and diffusely adhering E. coli (DAEC) strains (81,92). The vast majority of strains containing afa genes belonged to the ST131 clonal group (Fig. 4). The full set of genes to make this adhesin protein were only found in the VR50 strain (A; ABU/ UPEC), which is an asymptomatic bacteriuria strain (Fig. 4A) (93). But this is because there are multiple divergent alleles of afaE, and a single reference (afaE-I was used here) will not cover them all using our method (94). The limited distribution of afa genes make it a poor vaccine target by itself, but combining it with other antigens in a polyvalent vaccine is a viable option. Colonization factor antigen I (CFA/I) fimbria is a class 5 fimbria generally considered associated with human colonization of ETEC strains (95). The CFA/I fimbriae are found in ETEC strains are encoded by a plasmid, but there is apparently a divergent form (;30% identical to the ETEC plasmid version) on the chromosome that was widely found in strains belonging to the B1, C, and F phylogroups ( Fig. 4 and 5). This trend is verified by our phylogroup database, where 100% of B1, C, and F strains contain hits for CFA/I (Fig. 5A). The most surprising result here is that all strains from the F phylogroup in both our databases contained this type of fimbria because the F phylogroup is more closely related to B2 strains ( Fig. 4 and 5). This finding may be a clue to the lifestyle of the understudied F phylogroup. Interestingly, the only B2 strains that contained CFA/I were those from ST127 (Fig. 4A). Given that B1 strains are most often found as commensals in domesticated animals, it is possible that this fimbria promotes colonization in nonhuman hosts. Our finding also suggests that CFA/I as a vaccine target against ETECs will need to be carefully studied, since there is a possibility of crossreactivity with nonpathogenic strains.
E. coli laminin-binding fimbriae (ELF) were first described in EDL933, an O157:H7 EHEC/STEC strain, where is was found to contribute to the ability to bind Hep-2 cells (96). That group also found that antibodies against ELF were able to partially block adherence of EDL933. Our results suggested that ELF is more generally found among strains that do not belong to the B2 or D phylogroup ( Fig. 4 and 5). In fact, it appears that no strain from the B2 phylogroup (0/39; 0%) in either our pathotype database or phylogroup database (0/195; 0%) harbor ELF on their chromosome, but it is found in all other phylogroups in our pathotype database ( Fig. 4 and 5). It was also found in more than 97% of strains from phylogroups A, B1, C, E, and, surprisingly, F in our phylogroup database (Fig. 5B). It is interesting that, like CFA/I fimbriae, ELF is found in a phylogroup F that is more closely related to phylogroups B2 and D, which are generally considered to be more like ExPECs. This all makes ELF unlikely to be a viable vaccine Distributions of UPEC-associated P and S fimbriae, respectively. Note that only papCDEFH genes were used to differentiate strains that contained a full array of pap genes and those that contain a disrupted pap operon. Distributions were determined using megaBLAST to bin strains from each phylogroup into hit versus no hit.
Vaccine Targets from Comparative Genomics of E. coli Infection and Immunity target for either ExPECs or InPECs, but again may offer clues to the niche of phylogroup F strains. Fimbriae of serotype 1C (F1C) is encoded by the focA, focC, focD, focF, focG, focH, and focI genes (97). These fimbriae are associated with uropathogenic strains and selectively bind to glycosphingolipids found on bladder, urethra, and kidney cells (98)(99)(100)(101). It has also been shown to play a role in intestinal colonization of Nissle 1917 (nonpathogenic) (102). Despite the association with uropathogenicity, our search located true F1C fimbria hits in only three strains-Nissle 1917 (nonpathogenic), ABU 83972 (UPEC), and CFT073 (UPEC)-but that several cross hits with sfa genes did occur. The three strains with F1C were all members of the ST73 sequence type, which may indicate that this F1C is specific to certain sequence types.
F1C fimbriae are closely related to S-fimbrial adhesins, which are encoded by sfa genes and found on PAI III 536 (103). However, they have distinct receptors, with S-fimbria-binding sialyl galactosides (104,105). This matches our results where the foc reference genes from CFT073 (UPEC) showed significant homology for sfa reference genes from the UTI89 (UPEC) strain. The major differences were found in the major subunits focA and sfaA, which only shared 75.7% identity, and the adhesins focH and sfaH, which shared 84.8% identity. The other genes of each loci could be considered alleles of the same gene: the first minor subunits (focI and sfaD) shared 98.7% identity, the second minor subunits (focF and sfaG) shared 99.2% identity, the periplasmic chaperones (focC and sfeE) shared 98.7% identity, and the outer membrane ushers (focD and sfaF) shared 99.6% identity. The second gene downstream of the sfa locus is a previously uncharacterized bona fide papX regulator (100% coverage, 96% identical to reference papX). The sfa locus also contains two other regulatory genes that are like pap genes: sfaB, which shares high similarity with papB, and sfaC, which shares high similarity with papI. This all suggests a connected evolutionary history between these three fimbriae and may link them with uropathogenicity. On a practical level, this relatedness made it difficult to determine whether to score hits as sfa or foc. In the present study, we considered the hits to be hits for the F1C fimbriae if a full-length focG was present and S fimbriae if sfaS was present. In general, this method agreed with deciding which fimbriae were present by determining whether the adhesin was a better hit for focH or sfaH. However, strain 789 (APEC) presents an interesting problem for either way: this strain contains what appears to be a focH adhesin in an operon that is predominantly made up of S-fimbrial genes (.99% identical), with the exception of the minor subunit sfaS, which diverged and only had a minor hit (Fig. 4A). This may be an uncharacterized hybrid F1C/S fimbria. For strain 789, hits were included under both the F1C and S fimbria subsection to highlight the relatedness. The divergences between the traditional targets of fimbria vaccines, major subunits focA and sfaA and adhesins focH and sfaH, signifies a high probability that F1C and S fimbriae could only be targeted together with a polyvalent vaccine. However, the relatedness of the minor subunits means that they should be explored as potential targets. P fimbriae are encoded by pap (pyelonephritis-associated pili) genes and are associated with uropathogenic strains and target glycosphingolipids (101,106). It is part of both PAI I CFT073 and PAI II CFT073 pathogenicity islands (107). Some studies have found that P fimbriae are associated with 90% of acute pyelonephritis but less than 20% of ABU strains (80). Many ExPEC and NMEC strains contained only a few of the genes required for P-fimbria production, including the major repeating subunit, papA (Fig. 4). In some cases, such as in strains belonging to the ST131 group, full true hits for papA, papB, papI, and papX were found with nearby mobile elements that may explain the absence of the rest of the loci. In others, there were hits for papB and papI that appeared to be cross-hits with some S-fimbria genes (sfaB and sfaC, respectively). There were papX hits returned in any strain with the P or S fimbriae due to the papX homologue found near the sfa operon and in fact was found in over 70% of B2, D, and F strains in our phylogroup database, indicating that it may be ancestral to this lineage. It is important to note that it is possible that the bias toward certain sequence types (i.e., ST131 and ST95) in our pathotype database may skew these percentages, but the papB/sfaB hit does cover the majority of sequence types found in the ExPEC and NMEC category (Fig. 4). Interestingly, the only ST131 with a full complement of pap genes was NA114, a UPEC strain isolated in India ( Fig. 4) (108). On the other hand, 70% (7/10) of ST95 strains, 67% (2/3) of ST73 strains, and 100% (3/3) of ST127 strains carried the full complement of pap genes (Fig. 4). P-fimbrial genes are only rarely found outside the B2/D/F/G clade in our phylogroup database (Fig. 5).
The afa, foc, pap, and sfa genes are considered markers for ExPEC potential. However, the results from our pathotype and phylogroup databases suggest that these genes are found almost exclusively in the B2, D, F, and G cluster and are found in less than 2% of strains from the A, B1, C, and E cluster (Fig. 5). Its presence in the latter cluster could be explained by horizontal gene transfer (HGT), but its presence in the former could be more complicated. The leading hypothesis is that these genes give a competitive advantage in intestinal colonization (2,49,50). This may be related to the fact that elf and CFA/I fimbriae are well conserved in phylogroups associated with commensalism or intestinal pathogenesis but lacking in B2 and D strains. This again makes the F phylogroup intriguing because it contains both of those fimbriae but can also carry the ExPEC-associated fimbriae. No matter which proposed evolutionary route for phylogroup diversification turns out to be correct, the presence of ELF and CFA/I in F strains indicates these fimbriae were acquired or lost in the population multiple times in other phylogroups.
A polyvalent fimbria vaccine targeting afa, foc, pap, and sfa could potentially target most strains responsible for ExPEC infections. Although these fimbriae are mostly found in the B2, D, F, and G clade, this clade is responsible for the majority of extraintestinal infections. Importantly, such a vaccine could also conceivably prevent longterm colonization by ExPEC, since these adhesins appear to contribute to colonization of the intestines, and many B2 and D strains seem to lack fimbriae found in other intestinal E. coli, such as CFA/I and ELF. This vaccine would also target ExPEC strains from other phylogroups that have acquired them through HGT, such as VR50 (A; ExPEC) and 789 (C; APEC). The most logical targets of such a vaccine would be the adhesin or major subunit, but the overlap between minor subunits of F1C and S fimbriae should be investigated as well.
One major question remains with P fimbriae: which protein to target. Of the highly conserved genes, papB, papI, and papX encode regulatory proteins, excluding them as potential targets (109)(110)(111). The only other highly conserved gene is papA, which encodes the major subunit. However, some studies have shown it to be dispensable for binding when papE is present, unlike papF and papG, which are required for binding (75,112). This presents more questions than answers: conservation of papB, papI, or papX can be explained by trans-regulatory functions (109)(110)(111). The conservation of papA is, at first glance, a mystery. However, upon closer inspection, it appears that papA may be susceptible to transposon insertion, since it appears to have been disrupted multiple separate times based on alignments of papA (see Fig. S4 in the supplemental material). Given this, papH appears to be the best target for vaccine intervention.
Strains from the B2 phylogroup are enriched for iron acquisition genes. Iron acquisition proteins have long been known to be associated with virulence because iron is a limiting essential nutrient for pathogenic bacteria in the host (24,(113)(114)(115)(116).
Hypothesizing that the pathogenic pathotypes may harbor a preponderance of such genes, we compared genes encoding siderophores, hemophores, and iron transporters between pathogenic and commensal strains. Our results show that pathogenic strains are enriched for iron acquisition genes (Fig. 6). The three exceptions to this were the nonpathogenic phylogroup B2 strains ED1a, Nissle 1917, and SE15 (Fig. 6A). This general trend of increased iron acquisition may be another explanation for why members of the B2 and D phylogroups are overrepresented in ExPECs, since recent work has shown that iron acquisition gene increase intrinsic virulence (117). Because several The chu operon is responsible for heme uptake. Yersiniabactin, aerobactin, and salmochelin are virulence-associated iron-binding siderophores, whereas enterobactin is a ubiquitous iron-binding siderophore. The percent identity was determined using megaBLAST with the reference genes in Data Set S1 in the supplemental material. pathogenic bacteria lacked these virulence-linked iron acquisition genes on their chromosomes, we also searched their plasmids (Fig. 7).
ExPECs tended to have more iron acquisition than InPECs (Fig. 6). With non-APEC ExPECs, only 5 (G749, CI5, MS6198, ST648, and VR50) of 35 (14.3%) strains carry fewer than three of these iron-uptake loci (Fig. 6A). InPECs tended to have few iron acquisition types, with only AIECs and strain 042 (EAEC; D) having more than two of the loci examined (Fig. 6B). All of these trends can be explained by phylogeny: of the five ExPECs deficient in virulence-associated iron acquisition, only one was a B2 strain (Fig.  6A). EHECs/STECs tended to have either the chu heme operon or the yersiniabactin siderophore system, depending on whether they are from the E phylogroup (chu) or the B1 phylogroup (yersiniabactin) (Fig. 6B). This may suggest that EHEC/STEC strains from different phylogroups have different primary sources of iron during infection, since the chu operon is responsible for the uptake of heme, whereas yersiniabactin targets ferric iron (Fe 31 ) with an affinity that would allow it to steal iron from host ironbound proteins, including transferrin and lactoferrin (118,119). The A, B1, and C phylogroups lacked chu because, by definition of the phylogroups, they lack chuA. Interestingly, the ETEC pathotype seems to be the most devoid of iron acquisition genes, possibility indicating a close relationship with commensals, which has been noted for some ETEC strains (120).
In the phylogroup database, our results for chu show that it is found in 100% of B2, D, E, F, and G strains (Fig. 8A). This is an important control for our method since these Plasmids for each strain in the pathotype database were obtained and analyzed using megaBLAST against the iron acquisition data set. The percent identity was determined using megaBLAST with the reference genes in Data Set S1 in the supplemental material.
Vaccine Targets from Comparative Genomics of E. coli Infection and Immunity phylogroups are characterized by the presence of chuA. Still, the chu system has been shown to play an important role in uropathogenesis (121,122). The distribution of yersiniabactin, encoded by fyuA, also agrees with the results from our pathotype database and highlights the trends more clearly. Over 90% of B2 strains carry yersiniabactin (fyuA), a siderophore found on PAI IV 536 ( Fig. 8B) (117). It is surprising to find that such an overwhelming majority of B2 strains carried fyuA because it is a well-known ExPECassociated gene. In fact, it has been proposed as a gene to differentiate UPEC strains from commensals and other pathotypes (123). While fyuA is only found in 22% (68/ 303) of B1 strains, 42 of those strains carry the stx toxin, making them EHEC or STEC strains (Fig. 8B). Finding yersiniabactin in many B1 EHEC/STEC strains does make sense, however, since unlike members of the E phylogroup (including the well-known O157 serotype), B1 strains do not have the chu system to extract iron from blood. This could also explain why less than 1% of E-phylogroup strains contain yersiniabactin: the majority of those strains are members of the EHEC/STEC pathotype, and they all have the chu system to extract heme-iron after inducing bloody diarrhea ( Fig. 8A and B). This does lead to some obvious questions about selection in cattle, where EHEC/STEC strains are found as asymptomatic intestinal residents. Yersiniabactin does not appear to be highly conserved, but it is enriched for in both the B2/D/F/G clade and the C phylogroup. This pattern could be explained by extensive horizontal transfer coupled with the lack of strong positive selection during intestinal colonization and commensalism. The presence of yersiniabactin in the C phylogroup is probably related to APEC strains from that phylogroup: three of our four strains matching that criteria in our pathotype database contained the siderophore (Fig. 6A).
Aerobactin is a siderophore found on PAI I CFT073 that has been shown to be important for uropathogenesis (124,125). The aerobactin locus is found throughout pathotypes and phylogroups, which is not surprising considering it is found on a mobile element ( Fig. 6 and 8). There do appear to be two distinct forms of the operon, with one form being found predominantly in InPEC strains of the B1 phylogroup and the other being found mostly in ExPEC strains of the B2 phylogroup, which raises questions about how or whether this locus is transferred between phylogroups or just within them (Fig. 6). Between these forms of aerobactin, the biggest divergence was found in iutA (89.6% identical to ExPEC ST131 reference), the gene encoding the aerobactin receptor, which may make it difficult to target with a vaccine. This divergent iutA is predominantly found in strains from the B1 and E phylogroups, but there are instances of it being found in the B2 phylogroup, such as with CFT073 (Fig. 6). This is interesting because CFT073 belongs to the ST73, which also contains the ABU 83972 (UPEC) strain and the Nissle 1917 (Nonpathogenic) strain, two strains that carry the less-divergent iutA gene (Fig. 6A). There is less divergence in the chromosomal versions of iucC and iucD, which are aerobactin biosynthesis genes (Fig. 6). Of the 61 B1 strains that carry iutA, 89% (54/61) belong to the EHEC/STEC pathotype (stx 1 ). Outside these EHEC/STEC strains, iutA is again found predominantly in the B2, D, and F cluster (Fig. 8C).
Genes encoding aerobactin were also found on numerous plasmids from the B2, F, and C phylogroups and was in 63% (5/8) of the APEC strains (Fig. 7). The plasmid version of aerobactin appeared to be nearly identical across plasmids, regardless of pathotype or phylogroup, indicating that this plasmid is probably widespread and contributes to virulence (Fig. 7). The plasmid-encoded version was different from the chromosomally encoded version ( Fig. 6 and 7). They all contained the ST131-like iutA gene but differed significantly from the chromosomal version in their biosynthetic genes. This indicates that they probably diverged long ago, but the actual receptor was conserved. The plasmid carrying the aerobactin operon also contained nearly identical operons for the sit (iron-manganese transporter) and iro (salmochelin) iron acquisition genes (Fig. 7). This plasmid is the pS88/pColV plasmid associated with APEC virulence and NMEC strains (126,127).
Iron-regulated element A (ireA) is a siderophore-receptor like protein that is associated with ExPEC strains, implicated in adherence, and found on PAI CFT073 II (128). It appears to be present only sporadically in members of the B2 phylogroup and strain O2-211 (APEC), which is a member of the recently characterized G phylogroup (Fig. 6). This gene is relatively rare in our phylogroup database, being found in less than 3% of strains in phylogroups A, B1, C, D, and E (Fig. 8D). However, it is still only found in 20% of phylogroup B2 strains and 12% of phylogroup F strains. Surprisingly, it is found in 78% (14/18) of phylogroup G strains in our database (Fig. 8D).
The iron-manganese transporter system encoded by the sit locus was found almost exclusively in the chromosome of the B2/D/F/G cluster and follows a pattern similar to yersiniabactin, except for being less common in the C phylogroup (Fig. 6). Only 4 of the 39 strains (10%) belonging to the B2 phylogroup in our pathotype database lacked this transporter: G749, SF-166, and ZH063 (all ExPECs) and E2348-69 (EPEC), though G749 carried it on a plasmid ( Fig. 6 and 7). Distribution of sit in our phylogroup database agrees with this, with sit being present in 88% of B2 strains, 72% of D strains, and 34% of F strains (Fig. 8E).
The salmochelin siderophore encoded by the iro genes on the PAI III 536 and has been implicated in the adherence and invasion of urothelial cells and virulence in an animal model and is upregulated in the presence of human urine (103,(129)(130)(131)(132)(133). This locus is found only on the chromosome of a few strains from the B2 phylogroup, mainly from the UPEC (46%; 5/11), NMEC (50%; 3/6), and AIEC (50%; 2/4) pathotypes, from the ST73, ST95, and ST127 sequence types (Fig. 6). Our phylogroup database saw a similar distribution: iroN was found in only 26% of B2 strains and less than 2% of strains from other phylogroups (Fig. 8F).
Iron acquisition genes have a well-known and extensively studied association with uropathogenesis (116,121,122,124,(134)(135)(136)(137). Indeed, vaccines targeting siderophores have been proven as a concept in animal models, including mouse models of both E. coli urinary tract infection (UTI) and intestinal colonization by Salmonella (138)(139)(140)(141)(142)(143)(144)(145). While some of these studies showed that protective antibodies against siderophores can be generated, the results are not as efficacious as one would hope. This is potentially due to the high level of redundancy in iron acquisition genes seen in uropathogenic strains (135). Still, our work supports previous results suggesting that iron acquisition genes are good targets for an ExPEC vaccine. One obvious bonus to targeting iron acquisition genes is that the risk of targeting commensals may be lower with such a vaccine. This is also supported by evidence that different iron acquisition mechanisms have different levels of importance depending on the type and location of the infection (121,122,129,135). Of course, there is also the risk of redundancy previously mentioned that could make it easy for E. coli to evolve resistance. A solution to this is a polyvalent vaccine. Targeting yersiniabactin, aerobactin, and sit would target many members of the B2, D, and F clade, while targeting ireA and salmochelin would target rarer sequence types and apparently UPECs specifically ( Fig. 6 and 8). Targeting aerobactin, sit, and salmochelin would also vaccinate against the pColV plasmid that appears to be associated with ExPECs (Fig. 7).
The B2 phylogroup is enriched for ExPEC-associated toxins. Bacterial protein toxins have long been studied as critical virulence factors driving pathogenesis. In some cases, pathotype-such as EHEC/STEC and ETEC-assignment can be made based solely on the presence of certain toxins. While there are toxins known to contribute to ExPEC virulence, no single toxin can ensure an ExPEC phenotype. Like our iron acquisition gene analysis, plasmids were included in the toxin analysis because they are often carried on plasmids. However, unlike the iron acquisition graphs, there were no overlaps between genes on the chromosome and those on plasmids, so they are incorporated into a single graph (Fig. 9). Our analysis of E. coli for well-known toxins produced some surprising findings.
Cytolethal distending toxin (CDT) is a heterotrimeric genotoxin encoded by the cdtABC locus (146). After the CdtABC complex binds to a host cell, CdtB is delivered intracellularly where it causes double-stranded breaks and death of the host cell (146). In our pathotype database, cdtABC is only found in three B2 phylogroup strains, though two of them were from the ST95 sequence type (Fig. 9). This represented only 20% of ST95 strains in our database. In our phylogroup database, cdtA was also rare; only 19 of 1,348 strains carried it on their chromosome (Fig. 10A). The majority (n = 11) of these were strains from the B2 phylogroup, whereas 5 were from the G phylogroup, 2 were from the D phylogroup, and 1 was from the A phylogroup (Fig. 10A). Of these, the G phylogroup had the highest percentage (28%) of strains containing cdtAB, though there are only 18 strains in this phylogroup in our database (Fig. 10A).
Cytotoxic necrotizing factor 1 (CNF-1) is a chromosome-encoded deamidase toxin that is associated with UPEC and NMEC strains, and a virulence factor that is supposedly found in between 30 and 40% of ExPEC and diarrheal strains (147). The cnf1 locus was not widely spread in our pathotype database, but it does appear to be associated with ExPEC strains and found only in strains from the B2 phylogroup (Fig. 9). These strains were from the ST73, ST95, ST127, and ST643 sequence types, though not all strains from these sequence types contained the cnf1 gene. In our phylogroup database, there were only 43 hits for cnf1, and 40 of these were in the B2 phylogroup, but only in a low percentage of the total B2 strains (21%; 40/195) (Fig. 10B). The other three hits were in phylogroups E (n = 2) and C (n = 1). That cnf1 was found in 30 and 40% of pathogenic strains suggests that an oversized number of ExPEC and diarrheal strains are caused by a relatively small number of strains, particularly from the B2 phylogroup.
Hemolysin E, also known as cytolysin A or silent hemolysin, is a pore-forming toxin encoded by hlyE that under certain conditions can confer hemolytic phenotype to E. coli carrying it (148,149). It is found on PAI I 536 and PAI II 536 (103). It belongs to a family of toxins that is also found in Salmonella typhi and Shigella flexneri and reported to be a virulence factor in ETEC strains (32). It appears to have three chromosomal forms, all of which are more than 95% identical to the reference (Fig. 9). In over half of phylogroup B1 (15/27) and C (3/4) strains the hlyE gene is truncated by a frameshift, with both the 59 and 39 ends present and possibly translated (Fig. 9). This includes phylogroup B1 strains in the EAEC/EAHEC pathotype. In 100% of B2 strains (39/39), hlyE is truncated as well, but only the C terminus is present and is annotated as a hemolysinactivating protein, HecB, and it is more than 97% identical to reference hlyE. The full hlyE gene is found in the remaining (A, D, F, and G) phylogroups, as well as the other members of the B1 and C phylogroups. This truncation has been published before . The percent identity was determined using megaBLAST with the reference genes in Data Set S1 in the supplemental material.
Vaccine Targets from Comparative Genomics of E. coli Infection and Immunity (148), but to our knowledge, no connection to phylogroup has been made before the present study, and the phylogroup distribution can explain the results showing a pathotype association. a-hemolysin, also known as hemolysin A, is a pore-forming cytolytic toxin encoded by genes found on the PAI I 536 and II 536 pathogenicity islands and is only found on the chromosome of a few phylogroup B2 ExPECs and the plasmid of UMNK88 (ETEC; pUMNK88_Hly) (103,150,151). The chromosomal version is specifically found in strains from the ST73 (2/3; not found in Nissle 1917 commensal), ST95 (3/10), and ST127 (2/2) (Fig. 9). There is, however, another form that is found on the plasmids of many EHEC/ STEC strains. This form is divergent enough to not be hit by megaBLAST alignments except for part of hlyB and is found on a plasmid of 82% of the EHEC/STEC strains examined: 78% (7/9) phylogroup B1 strains and 84% (16/19) of phylogroup E strains. Distribution of chromosomal hlyA in our phylogroup database shows that it occurs almost exclusively in strains from the B2 phylogroup, where it is found in 26% of strains (50/195) (Fig. 10C).
There has been some work on the distribution of hlyA and hlyE, such as by Kerenyi et al. (150), who looked at hundreds of clinical isolates for the presence of hemolytic activity, hlyA, and hlyE (referred to in that study as shaE). These researchers very reasonably concluded that hlyA and hlyE never occur together. Our results suggest an answer: hlyA occurs almost exclusively in phylogroup B2, and no B2 strain in our pathotype database contains the N terminus of hlyE.
Shigella enterotoxin 1 (ShET1) encoded by setA1 and setB1 genes that are found on the antisense strand of the pic gene in the SHI-1 (she) pathogenicity island. In Shigella, it has been found to induce intestinal fluid accumulation (152). It has been associated with EAEC infections but also found in many ExPECs as well (125,(153)(154)(155). In our pathotype database, this toxin was found to be highly associated with EAEC strains and found on the chromosome of a few B2 phylogroup members from very diverse pathotypes: Nissle 1917 (nonpathogenic; B2), NC101 (AIEC; B2), ABU 83972 (ABU-UPEC; B2), CFT073 (UPEC; B2), and O2-211 (APEC; G) ( Fig. 9). However, three of these five strains (Nissle 1917, ABU 83972, and CFT073) belong to the ST73 sequence type. In B1 EAEC and EAHEC strains, the SHI-1 PAI was duplicated and led to two identical copies of each gene. It is interesting to note that all EAEC and EAHEC strains in our pathotype database contained very similar alleles of setA1 and setB1 (,0.06 and 0% divergence from the reference, respectively), despite being from diverse phylogroups (B1 and D). This is compared to setA1 and setB1 genes in B2 and G strains, which have roughly 5.6 and 1.6% divergence from the reference, respectively. This implicates genes carried on the SHI-1 PAI in EAEC pathogenesis, since it is likely that these divergent phylogroups acquired it independently, but recently. In our phylogroup database, setA and setB were found at low abundance, but predominantly in phylogroup B2 (Fig. 10D). The most surprising result is that 50% (9/18) of strains from the G phylogroup contained set genes (Fig. 10D).
Uropathogenic-specific protein (usp) is a colicin-like bacteriocin toxin that is associated with UPEC strains and increases virulence in a mouse model of UTIs (156). In our pathotype database, this supposedly UPEC-specific protein is actually highly associated with the B2 phylogroup, being found in all but one B2 strain: E2348/69 (B2; EPEC) ( Fig.  9). Our pathotype database confirms this association: 94% of B2 strains and 10% of F and G strains contain this protein. It is only found in a single phylogroup D strain and none of the strains from the E phylogroup or the A/B1/C cluster (Fig. 10E). This distribution could explain the propensity for B2 strains to cause UTIs, and being so widespread suggests it entered the B2 phylogroup early in its diversification. Its lack of representation in other phylogroups suggests that it is not very mobile or highly selected against in some contexts. However, if it is not mobile, its presence in 10% (4/41) of the F-phylogroup strains is curious, especially since two of the three sequence types that are known to cause UTIs (ST62 and ST648) are represented in our pathotype database and yet do not contain usp (Fig. 9A) (54).
It is difficult to say whether or not some of these toxins can be used as a vaccine target, since many of them seem to be dispensable for ExPEC virulence. There has been some early success targeting a-hemolysin (hemolysin A) (157). Targeting a-hemolysin and/or CNF-1 could hit a subset of ExPEC strains, particularly some of the more common ExPEC strains from the ST73 and ST127 sequence types (Fig. 9). A vaccine targeting a combination of these proteins has been shown to significantly reduce instances of cystitis in a mouse model and bacterial loads in urine, but not colonization of the kidneys or bladder (158). However, these are both still found in only a small set of B2 strains and appear absent in ST131s. The setAB toxin could potentially be a target against EAECs and some virulent ExPEC sequence types (ST73). Targeting this toxin would also present a unique situation where the protein coded on the opposite strand -pic, an autotransporter-could also be a potential target. The so-called uropathogenic-specific protein (usp) could potentially act as a target for vaccines targeting B2 strains, but our results suggests that commensal B2 strains would also be targeted. Hemolysin E/cytolysin A also may present a B2 target, but it remains to be seen whether the truncated version found in B2 strains-often annotated as hecB-is functional and exported. Of all of these, we believe that a-hemolysin is the more attractive toxin target.
One thing our work does highlight is that the B2 phylogroup is by far the most likely phylogroup to contain these toxins, which does make targeting B2-associated antigens an attractive possibility.
Type Va and Vc secretion systems (autotransporters) are associated with ExPEC strains. Type V secretion system is composed of the autotransporters (Va or AT-1), twopartner secretion pathway (Vb), and trimeric autotransporter adhesins (Vc or AT-2) (159). This secretion system is made up of secreted and outer membrane proteins involved in adherence and virulence. They consist of N-terminal signal peptide, a passenger domain, an autochaperone domain, and a C-terminal transmembrane b-barrel domain (160). Other domains are possible, including a lectin-like domain found on the end of the invasion protein, Inv (161,162). Autotransporters are classified based on their domain architecture as AIDA-I, serine protease autotransport of Enterobacteriaceaes (SPATEs), or trimeric autotransporter adhesin (TAA) (160). Overall, autotransporters in our data set were generally associated with ExPEC strains.
Antigen 43 (ang43), an AIDA-I member of the Va pathway, is one of the most abundant phase-varying outer membrane proteins and is encoded on PAI III 536 , and PAI I CFT073 , along with the genes encoding the Vat or Sat autotransporter, respectively (103,125,163). This gene is also known as flu for fluffing because it promotes aggregation between cells and colonization of mouse bladders. Some evidence suggests it may be beneficial to UPEC strains (164). Our results suggest that ang43 is widespread, with no apparent phylogroup or pathotype association (Fig. 11). There does appear to be an EAEC-and EAHEC-specific allele. This allele is found in all EAEC and EAHEC strains, regardless of whether they are from the D or the B1 phylogroup (Fig. 11), and appears to be on the same genomic island (GI 3 in strain 042) that carries pic, set1A, and set1B and two type VI secretion systems (see Fig. S4 in the supplemental material).
EhaA is an autotransporter identified as being associated with the O157:H7 serotype and important for adhesion and biofilm formation (165). However, our results suggest that ehaA is more generally associated with EHEC/STEC, EAEC, and EAHEC (Fig.  11). It is also possibly associated with EPEC strains but appears to be absent in the B2 EPEC strain (E2348/69).
EhaB, also called UpaC, is found in two forms: one (labeled upaC; ST131 is used as a reference) is found predominantly in ExPEC strains and B2 commensal strains (Nissle 1917 and SE15) (Fig. 11). However, it is absent in most AIEC strains (Fig. 11B). The other form, designated ehaB and using the Sakai (EHEC) strain as a reference, is found in 100% of O157 (15/15; all EHEC/STECs) and O55 (3/3; 1 EHEC/STEC and 2 EPECs) serotypes (Fig. 11). It is also found in the B2 phylogroup EPEC strain but in none of the B2 members of the ExPEC phylogroup (Fig. 11). Interestingly, it is also found in three of the four (75%) AIEC strains (Fig. 11B). This means that four of the five InPEC members of the B2 phylogroup carry this allele, while being found in none of the 3 B2 commensals or 32 B2 ExPEC strains. These results suggest that ehaA and ehaB are important virulence factors for intestinal pathogenesis and suggests that ehaB may be an AIEC-associated virulence factor.
Another AIEC-or InPEC-specific finding in our results is the profound changes in upaB, which encodes an autotransporter that binds to fibronectin and glucosaminoglycans (166). It has been shown to promote uropathogenesis and colonization of the bladder in a mouse model (167). Our results suggest upaB is found widely throughout ExPEC and nonpathogenic B2 strains, with only 2 of 35 of these strains missing the gene (Fig. 11). However, the two commensal strains that carry this gene are from the ST73 and ST131 sequence types and appear to have lost virulence factors on their way to becoming nonpathogenic. In contrast to ExPEC B2 strains, all five B2 InPEC strains (four AIEC and one EPEC) lacked a similar upaB (Fig. 11B). In the AIEC strains, the upaB the gene contained either a premature stop codon (LF82 and UM146) or a 189-bp insertion (NC101 or NRG 857C), explaining why this potential connection has not been investigated. The B2 EPEC strain E2348_69 did not elicit even a partial hit.
Several autotransporters appear to be tightly linked in the ST131 group: espC ST131 , tsh ST131 , upaG ST131 , and upaH ST131 (Fig. 11). Outside ST131 strains, these autotransporters only appear sporadically (Fig. 11). In fact, espC ST131 and tsh ST131 are only found in one strain outside ST131 strains in our pathotype database: CI5 (UPEC; B1) (Fig. 11). However, closer examination shows that this appears to be caused by remarkable diversity in the upaG ST131 and upaH ST131 genes. UpaH is a Va AIDA-I autotransporter that is involved in biofilm formation and colonization (168,169). UpaG is a Vc trimeric autotransporter adhesin that has been shown to promote biofilm formation, adherence to host matrix, and abiotic surfaces (170). These genes are highly conserved, but alignments show a large degree of variation from sequence type to sequence type, mainly in the middle of the sequence. This has been noted in both genes (169,171). To reflect this, upaG and upaH are annotated to indicate they used ST131 references. This could mean that UpaG and UpaH could be useful for quickly determining whether an isolate belongs to virulent sequence types (such as ST131 or ST73) using multiplex PCR. As a vaccine target, UpaG and UpaH may lack specificity, however, for UpaH conserved regions have been identified throughout phylogroups and work on a vaccine is under way (168, 169). The question remains as to whether UpaH is involved in colonization of other areas, since it is found in pathotypes (e.g., EHEC) that do not colonize the bladder.
Two SPATEs-EspC ST131 and Tsh ST131 -are found almost exclusively in ST131 strains with the interesting exception of also being found in the CI5 strain (B1; UPEC) (172, 173) ( Fig.   FIG 11 Pathotype distribution of autotransporters. (A and B) Percent identity results from megaBLAST alignments for nonpathogenic and ExPEC strains (A) and InPEC strains (B). SPATEs, serine protease autotransporters of Enterobacteriaceae. The percent identity was determined using megaBLAST with the reference genes in Data Set S1 in the supplemental material.

Vaccine Targets from Comparative Genomics of E. coli
Infection and Immunity 11A). However, while these were annotated as espC and tsh by the VFDB and some annotation software, it appears that this annotation may be incorrect. Using these references, we saw no hits in any of our EPEC strains or APEC strains, where research on EspC and Tsh has been done, respectively (27,(173)(174)(175) (Fig. 11). Closer examination revealed that espC from E2348/69 was only 60% identical to this reference, so a second espC (espC E2348/69 ) was included, while tsh from APEC strains were only ;41% identical to tsh ST131 . In fact, tsh ST131 appears to be related to adcA (CBG90828; 70% identical; 72% coverage) and espC ST131 is related to a putative autotransporter gene (CBG91787; 87% identical; 100% coverage) from Citrobacter rodentium (176)(177)(178). The espC E2348/69 is found on integrative element 5 (IE5), and the surrounding genes do not match those of espC ST131 , while tsh ST131 is found only two genes upstream of espC ST131 . The only strain carrying espC E2348/69 was the E2348/69 strain (Fig. 11). This is probably because E2348/69 is the only B2 EPEC strain in our database, and this gene appears to be isolated to LEE-containing B2 strains in our phylogroup database. This would explain the results by Mellies et al. (173), who only found these genes in a subset of EPEC strains. The fact that tsh ST131 and espC ST131 are specific for ST131s does make them potential vaccine targets, but more work must be done to determine whether the proteins they encode contribute to virulence and induce an antibody response.
Vacuolating autotransporter toxin (Vat) has been shown to contribute to uropathogenesis (179). Our data show that it is only found in the B2 and G phylogroups. It is found in the well-known ST73 and ST95, as well as lesser-known sequence types, but not ST131 strains. It should also be noted that sometimes this vat gene is mistakenly annotated as temperature-sensitive hemagglutinin, tsh (or the closely related hemoglobin-binding protease, hbp) or sepA, which encodes a Shigella virulence factor (180,181). Of these genes, only sepA was found on the chromosome of one strain examined in our database, though tsh/ hbp was found on the plasmid of 50% (4/8) of APEC strains, and sepA was found on the plasmid of all B1 EAEC and EAHEC strains. To verify each vat hit, we examined each hit for a marA/papX regulator (vatX) immediately downstream and the yag operon roughly 4 kb upstream (179). In our phylogroup database, vat is found exclusively in phylogroups B2 (53%; 103/195) and G (78%; 14/18) (Fig. 12A). Vat is a potential target, as it is found in many of the common ExPEC sequence types (ST73, ST95, and ST127), and it elicits an antibody response in UTI patients infected with vat 1 E. coli (179). The largest downside is that targeting Vat would not target ST131, a major cause of ExPEC infections worldwide.
Secreted autotransporter toxin (Sat) is another vacuolating cytotoxin implicated in uropathogenesis (182,183). Like Vat, it also elicits a strong antibody response, but unlike vat, it is also found in ST131 and ExPEC strains from the D and F phylogroup in our pathotype database (Fig. 11) (183). It is also more widely distributed in our phylogroup database, where it is found in 47% of B2 strains (92/195), 27% of D strains (22/ 81), and 12% of F strains (5/41) (Fig. 12B). One interesting note is that sat is found in only one of the 457 strains in our phylogroup A database: VR50. While both sat and vat are only found in only barely 50% of B2 strains, 87% (170/195) of B2 strains carry one or the other (or both), making a polyvalent vaccine targeting both an intriguing prospect (Fig. 12B). Such a hypothetical vaccine would provide protection against 86% (30/ 35) of the ExPEC, UPEC, and NMEC strains in our pathotype database, while only targeting 11% (1/9) of the nonpathogenic strains.
The invasion-like autotransporter fdeC binds to human epithelial cells and contributes to colonization of kidneys and bladder in an animal model, and it may be protective as a vaccine target (184). In our pathotype database, the gene encoding the FdeC adhesin is found in nearly every strain (Fig. 11). It appears that certain alleles of fdeC may be associated with intestinal lifestyles, although these allelic trends appear to fall along A/B1/C/E and B2/D/F/G clusters (Fig. 11). However, the fact that fdeC genes are found throughout almost all strains makes the protein it encodes a less ideal vaccine target.
The accessory colonization factor encoded by sslE is a zinc-metalloprotease with mucinase activity (185). The sslE gene is found across all phylogroups but is mostly lacking the EHEC/STEC pathotype. This is even seen in EHEC/STEC strains that belong to the B1 phylogroup, despite B1 strains in other phylogroup still having this gene (Fig. 11). This trend is also seen in our phylogroup database (Fig. 12C). Roughly 75% of stx 1 B1 strains carry sslE compared to 96% of B1 strains lacking stx (see Fig. S6). In the E phylogroup, only 1% of stx 1 strains carry sslE compared to 56% of stx mutant strains (see Fig. S6). This may indicate that sslE is either not required or selected against during EHEC infections.
Interestingly, inoculation with FdeC and SslE have been shown to be protective against UPECs (184,186). However, our results show that both of these targets could produce significant off-target effects considering how broadly they are conserved in commensals (Fig. 11).
The pet and pic autotransporters are normally associated with EAEC strains (32). In our data set, pet is found almost exclusively on the chromosomes EAEC and EAHEC strain from the B1 phylogroup and, strangely, on the chromosome of ED1a (B2; commensal). However, it is truncated and predicted to be expressed in two parts (Fig. 11). The 042 (D; EAEC) strain lacks pet on its chromosome, but a divergent form annotated as pet and picked up by our reference pet is found on its plasmid (data not shown). The pic gene is associated with EAEC as well, and all six EAEC or EAHEC strains share a near identical allele-EAECs are 97% identical to the reference, while EAHECs are 96.9% identical. There is an allele of pic that is found in B2 strains, specifically those belonging to ST73 (Fig. 11). The only other instance is also the most divergent allele, which is found in the NC101 (AIEC; B2).
The invasin-like protein SinH appears to be strongly associated with ExPEC strains, but the distribution results from our phylogroup database suggests this association is more likely with the B2/D/F/G cluster than specifically with ExPEC strains (Fig. 11 and  12), although this does not rule out a contribution to virulence. Outside of ExPEC strains, B2 commensal strains also carry this gene, but in ED1a the beta-barrel domain that links the protein to the outer membrane is missing due to a premature stop codon (Fig. 11). There also appears to be an ST131-specific allele that is a relatively divergent (ca. 88 to 90% identical) from the sinH found in other ExPECs (see Fig. S7). There are two significant hits in the RM13514 (EHEC/STEC) and RM13516 (EHEC/STEC) strains, but the SinH in RM13514 is missing the beta-barrel domain, and the RM13516 strain is very divergent and aligns poorly with the other SinHs on the protein level (see Fig. S7). Our phylogroup database verifies the association with the B2 phylogroup, where it is found in 98% of strains, and shows that sinH is also strongly associated with the F and G phylogroups (100%) and, to a lesser extent, with the D phylogroup (48%) (Fig. 12).
Other adhesins and miscellaneous virulence genes are either ubiquitous or enriched in B2 strains. The locus of enterocyte effacement (LEE) carries a type 3 secretion system related to virulence, is known to be associated with EPEC and EHEC/STEC strains, and is responsible for their characteristic attaching-and-effacing (A/E) phenotype (187). The LEE-encoded attachment protein intimin is a product of eae (also called eaeA) and is responsible for EHEC and EPEC characteristic attachment phenotype, and our results are in agreement (Fig. 13). Another adhesin, EaeH, was first identified in ETEC. It is upregulated when ETEC strains interact with epithelial cells and promotes adhesion and toxin delivery (188,189). However, it seems this adhesin is ubiquitous, since it is found in all but five strains, though there may be some allelic differences in eaeH (Fig. 13). Unlike most molecules that show allelic differences roughly along the A/ B1/C/E and B2/D/F/G clades, eaeH appears to have a B2-specific allele. . The percent identity was determined using megaBLAST with the reference genes found in Data Set S1 in the supplemental material.
Porcine A/E-associated protein (encoded by paa) is an LEE-encoded adhesin that is required for EHEC infections (190). It has been found to be more immunogenic than intimin, and it confers a slight protective effect against colonization of EHEC strains in mice when used as a vaccine (191). Our results shows that it is predominantly found in LEE-carrying pathotypes, though it appears to be missing in some B1 EHEC/STEC strains and the B2 EPEC strain (Fig. 13).
The invasion of brain endothelium A (ibeA) gene encodes a protein that has been shown to be important for invasion of the blood-brain barrier (192). It has also been shown to be an important virulence factor in APEC strains (192). Our pathotype results largely support these assertions, but the gene appears to only be found in B2 strains of these pathotypes: 3/5 B2 strains in NMEC and 3/3 B2 strains in APEC (Fig. 13). The Fphylogroup strain (CE10) and two of the B2 strains NMEC strains lacked ibeA, and none of the phylogroup C APEC strains carried the gene (Fig. 13). Interestingly, all four AIEC strains in our database contained ibeA, which supports recent work showing that this gene is important for intestinal colonization, cellular invasion, and macrophage survival of the NRG857 (AIEC) strain (Fig. 13B) (33). In our phylogroup database, ibeA is only found in phylogroups B2 and F, which may further explain why most AIEC strains are from the B2 phylogroup (Fig. 14A).
The malX gene encodes a phosphotransferase system II enzyme and is associated with ExPEC strains (193)(194)(195). It may also increase persistence in the intestines (156). Our results show that it highly conserved in B2 strains and, to a lesser degree, pathogenic B1 strains from the UPEC and EHEC/STEC pathotypes (Fig. 13).
The iron-regulated homologous adhesin encoded by iha gene is part of the PAI CFT073 I and PAI 536 II pathogenicity islands and, unlike the P fimbriae, appears to be more widespread (107,196). This suggests that iha has weaker selection against it or greater selection for it. Iha has been shown to be important for colonization of the kidneys and bladder in a mouse model of infection (196). In our work, it is found widely in Vaccine Targets from Comparative Genomics of E. coli Infection and Immunity B2 strains and pathogenic strains in general (Fig. 13). An interesting exception is that it is not found in either ST95 or ST127 B2 strains, two common sources of ExPEC infections. In EHEC/STEC strains, which are all from the B1 and E phylogroups, it appears highly conserved (Fig. 13). However, it is not found in any of the EPEC strains which may suggest that the acquisition of an iha PAI in addition to stx is important for EHEC/ STEC pathogenesis and differentiates EHEC/STEC strains from EPEC strains (Fig. 13B). It is also lacking in B2 strains from the AIEC, NMEC, and APEC pathotypes, which may be related to their lifestyle (Fig. 13A). In our phylogroup database, it was found in a relatively diverse set of phylogroups and especially enriched in phylogroup E (Fig. 14B).
One intriguing thing about targeting Iha is that, outside EHEC/STEC strains, the gene that encodes it is often found on the same pathogenicity island (e.g., PAI 536 II) as sat and iutA (aerobactin), which have already been mentioned as potential vaccine targets. Targeting any of these alone has a serious drawback in that they are not present in the majority of the ST95 strains examined in our pathotype database. However, as with our speculations on combining Sat and Vat to cover the majority of ExPEC strains, Vat combined with either Iha or IutA appears to confer similar or greater coverage. The Hek adhesin is encoded by a gene that is part of the PAI 536 II pathogenicity island and is involved in autoaggregation, hemagglutination, heparin binding, adherence, and invasion of host cells in NMEC strains (103,197,198). Despite being found on the same pathogenicity island as iutA, sat, and iha, it has apparently been lost in most of those strains in our pathotype database (Fig. 13). It is only found sporadically in B2 strains, specifically ST127, and occasionally ST73 and ST95 strains. In our phylogroup database, it is found in low abundance in all phylogroups examined (Fig. 14C). The relatively low number of strains carrying this protein makes it less suitable as a vaccine target.
The prophage-encoded increased serum survival gene (iss) and bor genes have been associated with ExPEC infections for at least four decades. The proteins encoded by these genes are known to be surface exposed, have been shown to help E. coli resist the host complement system, and elicit an immune response in avian hosts (199)(200)(201). Since the proteins encoded by iss and bor are more than 90% identical, we have not distinguished between the two here (202). Both were first found on plasmids, but more recent work has shown that they are also carried on the chromosome (203). Our results suggest that iss/bor are widespread and found in all pathotypes (Fig. 13). There appears to be no phylogroup or sequence types trends, since these genes are found sporadically through different sequence types and are found often in all phylogroups ( Fig. 13 and 14D). This is most likely because they are encoded by prophages and means that they are unlikely to be suitable vaccine targets.
Commensalism and vaccine targets. Our results show that the majority of virulence factors (VFs) and markers for extraintestinal pathogenesis are associated with the B2 phylogroup or the B2/D/F/G cluster. In many cases, these VFs are found exclusively (or nearly exclusively) in these phylogroups. In other cases, the VFs are found predominantly in the B2/D/F/G cluster but also in a minority of strains from other phylogroups. This suggests that the B2/D/F/G cluster is the dominant source of pathogenic strains, which is known, although estimates of the extent to which they are predominant vary. A similar conclusion with B2 and D strains was reached over 20 years ago in an experimental determination of the lethality of 82 strains from the E. coli reference (ECOR) collection cross-referenced against the presence of seven virulence determinants (204). There, researchers found that these virulence factors and lethality correlated with the B2 phylogroup (this was before phylogroups F and G were described). Our work not only verifies this but expands it to a wider range of virulence factors and, importantly, also shows that the B2/D/F/G cluster is the dominant reservoir for extraintestinal pathogenic virulence factors in general.
A major implication of this is that B2/D/F/G isolates that contain virulence factors should be looked at with suspicion but, without a comprehensive and in-depth overview of virulence factors, it is hard to say whether or not a strain from this cluster can cause disease. This is easy to see with the three nonpathogenic B2 strains in our pathotype database: Nissle 1917, SE15, and ED1a. Nissle 1917 is an ST73 strain, like ABU 83972 (ABU) and CFT073 (UPEC), and SE15 belongs to ST131, of which we have numerous examples of in our pathotype database. In most cases, these commensal strains contain the same virulence factors as pathogenic B2 strains, with some notable exceptions. Compared to other ST73 strains, Nissle 1917 is missing hlyA (toxin), fulllength pap (fimbriae), cdiAB (autotransporter), fdeC (autotransporter), upaB (autotransporter), and salmochelin (iron acquisition). With ST131 strains of the fimH41 subclass, SE15 has lost the partial pap (fimbria) genes seen in other ST131 strains, as well as agn43 (autotransporter), sat (autotransporter), aerobactin/iutA (iron acquisition), and iha (adhesion/iron acquisition)-all proteins found on PAI CFT073 I (Fig. 15A). The loss of these genes appears to abolish pathogenicity in a strain from a highly pathogenic sequence type. It is important to note that most cursory overviews of these B2 commensals would conclude that they have high pathogenic potential, especially if they were compared to strains outside their phylogroup or sequence type. For example, Nissle 1917, still carries kpsTM, K5 biosynthetic genes, pic, sat, vat, foc fimbriae, papA, fyuA (yersiniabactin), iutA (aerobactin), ireA, iha, malX, and iss. It also still carries the pks genotoxin island (205). It can be reasonably concluded that this is a virulence fingerprint that is from a strain with high probability of causing disease, and yet Nissle 1917 has been used as a probiotic to successfully treat intestinal diseases for a century (206).
Another implication from our work is that most of the ExPEC-associated VFs are found widely and predominantly in B2 and D strains (Fig. 15B). This suggest that strains from the A/B1/C cluster should be viewed with greater suspicion when clinical isolates are shown to contain virulence factors that are generally found in the B2/D/F/G cluster. This is because for these strains to have these genes, they almost certainly had to acquire them through horizontal gene transfer, and then the gene had to most likely be selected for to fix it in the population. With a gene that confers an advantage during infection, increased virulence is the simplest mechanism of fixation.
However, pathotype associations from this work need to be interpreted with care, since several pathotypes we catalogued have few representative strains and are therefore at greater risk of selection bias. There is, of course, the same risk of bias with our phylogroup associations, but we feel it is much less likely because of the much larger sample size and because our phylogroup database comprises all complete E. coli genomes currently available in NCBI's RefSeq database.
Our work also suggests that vaccines targeting VFs will likely target many strains in the B2/D/F/G cluster, even nonpathogenic ones. Although it has been postulated that an ideal vaccine candidate would only target pathogenic strains, targeting all B2/D/F/G strains is not necessarily a dead end (13). B2 strains are now common commensals in the developed world, but much rarer in older data sets and in the developing world, and there is no indication that the lack of B2 strains in the gut microbiota has been detrimental to health (207,208). In fact, the opposite may be true since microbiota in developed countries tend to be less diverse (209). Accepting this fact leads to the obvious conclusion that specifically targeting B2 strains may open new avenues for vaccine research because while we have struggled to clearly define the ExPEC pathotypes, phylogroups can be clearly delineated. A successful B2 or B2/D/F/G cluster vaccine could potentially eliminate a large source of both virulent strains and a virulence factor reservoir.
Along these lines, the phylogroup distribution of what we believe are the vaccine targets with the most potential given their prevalence in different pathotypes and phylogroups is shown in Fig. 16. These potential targets are predominantly found on the surface, but it should be noted that the proposed toxin targets are excreted. While many successful vaccines against other organisms have targeted secreted toxins, the multifaceted nature of E. coli virulence means that a monovalent toxin vaccine may reduce the severity of the disease rather than completely prevent an infection. However, a polyvalent vaccine targeting several of these proposed proteins-including toxins-would target multiple mechanisms of virulence such as adherence or immune evasion. This would increase the chances that an infection could be completely prevented rather than simply abated. Such an approach would also make developing resistance while maintaining virulence difficult (Fig. 16A). Targeting uropathogenicspecific protein (Usp), yersiniabactin receptor (FyuA), group 2 capsule (KpsTM), vacuolating toxins (Sat/Vat), and iron/manganese transporter (SitA) would target the vast majority of strains from the B2 and D phylogroups-which are more intrinsically virulent. They would do this while avoiding strains from phylogroups more common among commensal unless those strains have acquired these virulence factors (e.g., strain VR50) (Fig. 16B). Other, less widespread targets such as a-hemolysin (HlyA), afimbrial adhesin (Afa), F1C fimbriae (Foc), P fimbriae (Pap), S fimbriae (Sfa), and aerobactin receptor (IutA) make attractive auxiliary targets because even though they are rare, they have been shown to play a major role in pathogenicity (Fig. 16C).
This information is being used to guide polyvalent vaccine development in our lab, but these genes and their distribution should also be useful when determining the potential pathogenicity of isolates, as well as in further elucidating the pathogenicity of E. coli.

MATERIALS AND METHODS
Comparative genomics. Alignments of virulence factors and strains were created using megaBLAST plugin for Geneious Prime 2019.2.3 using a 1/-2 match/mismatch scoring matrix, linear gap score, and a maximum E value of 1E-10. Where necessary, a culling limit of 1 was used to prevent too many off-target hits. Percent identity outputs were exported as a spreadsheet and used to produce heatmaps in GraphPad Prism version 9.0.0 for Windows (GraphPad Software, San Diego, CA).
Virulence factors database was created using the Virulence Factor Database, Victors, and PATRIC virulence databases as a backbone and expanded upon with literature (56)(57)(58). Strain and phylogroup databases were constructed as described below. Virulence factors and sequences can be found in Data Set S1 in the supplemental material.
RefSeq phylogroup database construction. Escherichia coli strains for the phylogroup database were retrieved from NCBI's RefSeq database (accessed 1 September 2020) with the following query string: Search all[filter] AND "Escherichia coli"[ORGN] AND latest[filter] AND "complete genome"[filter] AND (all[filter] NOT "derived from surveillance project"[filter] AND all[filter] NOT anomalous[filter]) Sort by: SORTORDER. The 4,105 strains were trimmed to 1,351 strains above 3.5 megabases. These strains were classified by phylogroup using in-house methods (see Appendix S1 in the supplemental material). Briefly, megaBLAST was used to align short sequences derived from Clermont phylotyping fragments. Strains were binned based on hits or no hits to each fragment (61). The resulting pattern of "hit/no hit" for each of the main four fragments were used to determine phylogroup (see the Fig. S1 legend for more information). Phylogroups E, C, and G strains were classified next based on the presence of the arpA_gpE, trpA_gpC, and ygbD_G PCR fragments, respectively (55,61,210). For more information, see Appendix S1 in the supplemental material. A breakdown of strains in our phylogroup data can be found in Fig. S8. Accession numbers for strains divided by phylogroup can be found in Data Set S2 in the supplemental material.
Strain curation. Phylogroup and pathotype breakdown of strains in our pathotype database can be found in Fig. S9 in the supplemental material.
The extraintestinal pathogenic E. coli strains (ExPECs) refer to several pathotypes that cause disease outside the intestines; the major pathotypes are uropathogenic E. coli (UPEC), neonatal meningitis E. coli (NMEC), and avian pathogenic E. coli (APEC) (25). There are dozens of virulence factors said to be associated with ExPEC strains, though our results suggest that most of these may actually be found in the majority of strains (25). In the present study, ExPEC strains are labeled as ExPECs unless there is published evidence for the strains to be sorted into the UPEC, NMEC, or APEC pathotypes, even if a strain was collected from the urine of an infected patient ( Table 2). This may mostly affect UPEC strains, since the terms ExPEC and UPEC are often used interchangeably. Of the 18 general ExPEC strains, 16 belong to the B2 phylogroup, 1 belongs to the D phylogroup, and 1 belongs to the F phylogroup. Of these strains, 12 belong to the ST131 sequence type. Given the highly virulent nature of ST131 strains, they were all considered ExPECs, with three exceptions: SE15, EC958, and NA114. SE15 was isolated from a healthy adult and shown to lack virulence and also lacks many virulence genes seen in other ExPEC strains, so it is classified as nonpathogenic (Table 1) (211). EC958 and NA114 were classified as UPEC strains because they were isolated in urine and carried UPEC-associated virulence factors not seen in other ExPECs: the selC genomic island in EC958 and the full-length pap fimbriae in NA114 (212,213).
UPECs are characterized by their ability to multiply intracellularly in the urinary tract (25,214  Strains that belong to the NMEC phylogroup can survive in the bloodstream and invade the meninges of infants (25). NMEC strains are generally considered to be difficult to distinguish from commensals (215).
The avian pathogenic E. coli (APEC) causes colibacillosis in poultry and other avian species and has many of the same virulence factors that are found in ExPEC strains (26).
Four strains belong to the adherent-invasive E. coli (AIEC) pathotype. This pathotype was first described in 1999 with the LF82 strain isolated from the ileal mucosa of a Crohn's disease (CD) patient (216). AIECs have since been associated with inflammatory bowel disease, are characterized by the ability to invade intestinal epithelial cells and replicate in macrophages, and have been found to bind receptors overexpressed in a K1 indicates a strain that contained kps genes without additional information. If the kps genes were identical to those for strains that had experimentally determined Ktypes or if the K-type was described with some uncertainty, the K-type is followed by a "?" (for example "K100?"). b FimH types were determined using CGE's FimTyper (226).
CD patients using type I fimbriae (29,86,217). There have been no unique genetic factors described for this pathotype, other than a weak association with certain alleles of chiA chitinase, a gene encoding a protein shown to promote AIEC adherence to IEC (1,216,218). The four complete AIEC genomes here all belong to the B2 phylogroup, the phylogroup to which the majority of isolated AIEC strains belong. Shiga-toxin-producing E. coli (STEC) strains have acquired the phage-associated Stx toxin (34,39,219). This pathotype is a foodborne illness that can produce a wide range of symptoms from asymptomatic to life-threatening hemorrhagic colitis and hemolytic-uremic syndrome (HUS). Technically, pathogenic STEC strains that have the ability to cause disease in humans are called enterohemorrhagic E. coli (EHEC) and belong predominantly to seven serotypes, including the well-known O157:H7 serotype. In practice, however, the terms STEC and EHEC are used somewhat interchangeably, along with verotoxigenic E. coli (VTEC). In addition to the Stx toxin, the most prominent virulence factor found in these strains is the LEE pathogenicity island, which encodes intimin among other things (34). Other described virulence genes in the literature include espP, lpf, efa, toxB, eibG, ehaA, ompA, iha, and paa (32).
Enteroaggregative E. coli (EAEC) is an emerging diarrheagenic pathotype that is characterized by its "stacked-brick" adherence pattern on Hep-2 cells. This pathotype is a serious cause of childhood diarrhea in developing countries and traveler's diarrhea (35). EAEC can be asymptomatic, but during symptomatic infections the most common symptoms are watery diarrhea, abdominal pain, nausea, and vomiting (220).
During an outbreak of bloody diarrhea in Europe in 2011, a new hybrid pathotype called enteroaggregative hemorrhagic E. coli (EAHEC) was described (37). All known members of this pathotype are EAEC strains of the O104:H4 serotype that have acquired the Stx phage (221).
Enterotoxigenic E. coli (ETEC) is a pathotype that gets its name for the plasmid-transmitted heat-labile enterotoxin (LT) and heat-stable enterotoxin (ST). These strains can carry one or both toxins. LT is similar to cholera toxin and has a similar mechanism of action. ST is a much smaller peptide that binds to guanylate cyclase to increase cyclic GMP in the small intestines. ETEC is the most common cause of E. coli diarrheagenic disease and the leading cause of traveler's diarrhea. It is also a major cause of acute infectious diarrhea, which accounts for an estimated 20% of all childhood deaths (222).
Enteropathogenic E. coli (EPEC) is another major diarrheagenic E. coli. This pathotype is characterized by the formation of A/E lesions. The attachment to intestinal cells is mediated by intimin, encoded by eae found in the LEE pathogenicity island (40).
The sequence type of each strain was determined using Center for Genomic Epidemiology (CGE) MLST 2.0 software (223,224). For strains where the O and H serotypes were not known, CGE SeroTypeFinder software was used (225). K-types were listed if they had been experimentally determined or if neu (for K1) or kfi (for K5) operons were present. When the strain contained kps genes without additional information, "K1" is listed in Table 2. If the kps genes were identical to strains that had experimentally determined K-types or if the K-type was described with some uncertainty, the K-type was listed followed by a "?" (for example "K100?"). FimH type was determined using CGE's FimTyper (226).