Genomic diversity of Escherichia coli isolates from healthy children in rural Gambia

Little is known about the genomic diversity of Escherichia coli in healthy children from sub-Saharan Africa, even though this is pertinent to understanding bacterial evolution and ecology and their role in infection. We isolated and whole-genome sequenced up to five colonies of faecal E. coli from 66 asymptomatic children aged three-to-five years in rural Gambia (n=88 isolates from 21 positive stools). We identified 56 genotypes, with an average of 2.7 genotypes per host. These were spread over 37 seven-allele sequence types and the E. coli phylogroups A, B1, B2, C, D, E, F and Escherichia cryptic clade I. Immigration events accounted for three-quarters of the diversity within our study population, while one-quarter of variants appeared to have arisen from within-host evolution. Several study strains were closely related to isolates that caused disease in humans or originated from livestock. Our results suggest that within-host evolution plays a minor role in the generation of diversity than independent immigration and the establishment of strains among our study population. Also, this study adds significantly to the number of commensal E. coli genomes, a group that has been traditionally underrepresented in the sequencing of this species.


Introduction 37
Ease of culture and genetic tractability account for the unparalleled status of Escherichia coli 38 as "the biological rock star", driving advances in biotechnology (1), while also providing 39 critical insights into biology and evolution (2). However, E. coli is also a widespread 40 commensal, as well as a versatile pathogen, linked to diarrhoea (particularly in the under-41 fives), urinary tract infection, neonatal sepsis, bacteraemia and multi-drug resistant infection 42 in hospitals (3)(4)(5). Yet, most of what we know about E. coli stems from the investigation of 43 laboratory strains, which fail to capture the ecology and evolution of this key organism "in 44 the wild" (6). What is more, most studies of non-lab strains have focused on pathogenic 45 strains or have been hampered by low-resolution PCR methods, so we have relatively few 46 genomic sequences from commensal isolates, particularly from low-to middle-income 47 countries (7-13). 48 We have a broad understanding of the population structure of E. coli, with eight 49 significant phylogroups loosely linked to ecological niche and pathogenic potential (B2, D 50 and F linked to extraintestinal infection; A and B1 linked to severe intestinal infections such 51 as haemolytic-uraemic syndrome) (14-17). All phylogroups can colonise the human gut, but 52 it remains unclear how far commensals and pathogenic strains compete or collaborate-or 53 9 the eight main phylogroups of E. coli (Table 2) We observed thirteen cases where a single host harboured two or more variants within the 196 same SNP cloud (

Accessory gene content and relationships with other strains 206
A quarter of our isolates were most closely related to commensal strains from humans, with 207 smaller numbers most closely related to human pathogenic strains or strains from livestock, 208 poultry or the environment (Table 4). One isolate was most closely related to a canine isolate 209 from the UK. Three STs (ST38, ST10 and ST58) were shared by our study isolates and 210 diarrhoeal isolate from the GEMS study (Supplementary Figure 2), with just eight alleles 211 separating our commensal ST38 strain from a diarrhoeal isolate from the GEMS study 212 ( Figure 5). 213 We detected 130 genes encoding putative virulence factors across the 88 study isolates 214 (Figure 2; Supplementary File 3). More than half of the isolates encoded resistance to three or 215 more clinically relevant classes of antibiotics ( Figure 3; Supplementary Figure 1). The most 216 common resistance gene network was -aph(6)-Id_1-sul2 (41% of the isolates), followed by 217 aph(3'')-Ib_5-sul2 (27%) and bla-TEM-aph(3'')-Ib_5 (24%). Most isolates (67%) harboured 218 two or more plasmid types (Figure 4). Of the 24 plasmid types detected, IncFIB was the most 219 common (41%), followed by col156 (19%) and IncI_1-Alpha (15%). Nearly three-quarters of 220 the multi-drug resistant isolates carried IncFIB (AP001918) plasmids, suggesting that these 221 large plasmids disseminate resistance genes within our study population. 222 223

Discussion 224
This study provides an overview of the within-host genomic diversity of E. coli in healthy 225 children from a rural setting in the Gambia, West Africa. Surprisingly, we recovered a low 226 rate of colonisation than reported elsewhere among children of similar age groups (42), with 227 only a third of our study samples yielding growth of E. coli. This may reflect geographical 228 variation but might also be some hard-to-identify effect of the way the samples were handled, 229 even though they were kept frozen and thawed only just before culture. 230 Several studies have shown that sampling a single colony is insufficient to capture E. coli 231 strain diversity in stools (20,21,23 We found that within-host evolution plays a minor role in the generation of diversity, in 246 line with Dixit et al. (20), who reported that 83% of diversity originates from immigration 247 events, and with epidemiological data suggesting that the recurrent immigration events 248 account for the high faecal diversity of E. coli in the tropics (47). Co-colonising variants 249 belonging to the same ST tended to share an identical virulence, AMR and plasmid profile, 250 signalling similarities in their accessory gene content. The estimated mutation rate for E. coli 251 lineages is around one SNP per genome per year (48), so that two genomes with a most 252 recent common ancestor in the last five years would be expected to be around ten SNPs apart. 253 However, in two subjects, pairwise distances between genomes from the same ST (ST59 and 254 ST5148) were large enough (14 and 18 respectively) to suggest that they might have arisen 255 from independent immigration events, as insufficient time had elapsed in the child's life for 256 such divergence to occur within the host. However, it remains possible that the mutation rate 257 was higher than expected in these lineages, although we found no evidence of damage to 258 DNA repair genes. More than half of our isolates encode resistance to three or more classes 259 of antimicrobials echoing the high rate of MDR (65%; confirmed by phenotypic testing) in 260 the GEMS study. IncFIB (AP001918) was the most common plasmid Inc type from our 261 study, in line with the observation that IncF plasmids are frequently associated with the 262 dissemination of resistance (49). However, a limitation of our study is that we did not 263 perform phenotypic antimicrobial resistance testing, although Doyle et al. (50)    The study sample processing flow diagram.

Figure 2
A maximum-likelihood tree depicting the phylogenetic relationships among the study isolates. The tree was reconstructed with RAxML, using a general time-reversible nucleotide substitution model and 1,000 bootstrap replicates. The genome assembly of E. coli str. K12 substr. MG1655 was used s as the reference, and the tree rooted using the genomic assembly of E. fergusonii as an outgroup. The sample names are indicated at the tip, with the respective Achtman sequence types (ST) indicated beside the sample names. The respective phylogroups the isolates belong to are indicated with colour codes as displayed in the legend.
E. coli reference genome is denoted in black. Asterisks (*) are used to indicate novel STs.
The predicted antimicrobial resistance genes and putative virulence factors for each isolate are displayed next to the tree, with the virulence genes clustered according to their function.
Multiple copies of the same strain (ST) isolated from a single host are not shown. Instead, we have shown only one representative isolate from each strain. Virulence and resistance factors were not detected in the reference strain either. A summary of the identified virulence factors and their known functions are provided in Supplementary File 3.

Figure 3
A: The prevalence of antimicrobial-associated genes detected in the isolates. The y-axis shows the detected AMR-associated genes in the genomes, grouped by antimicrobial class.
B: A histogram depicting the number of antimicrobial classes to which resistance genes were detected in the corresponding strains.

Figure 4
A: Plasmid replicons detected in the study isolates. B: A histogram depicting the number of plasmids co-harboured in a single strain.

Figure 5
A: A NINJA neighbour-joining tree showing the population structure of E. coli ST38, drawn using the genomes found in the core-genome MLST hierarchical cluster at HC1100, which corresponds to ST38 clonal complex. B: The closest neighbour to a pathogenic strain reported in GEMS 4 is shown to be a commensal isolate recovered from a healthy individual.
C: The closest relatives to the commensal ST38 strain recovered from this study is shown (red highlights), with the number of core-genome MLST alleles separating the two genomes displayed. D: A maximum-likelihood phylogenetic tree reconstructed using the genomes found in the cluster in C above, comprising both pathogenic and commensal ST38 strains is presented, depicting the genetic relationship between strain 100415 (pathogenic) and 103709 (commensal) (red highlights). The nodes are coloured to depict the status of the strains as pathogenic (red) or commensal (blue). The geographical locations where isolates were recovered are displayed in Figures 4A-C; the genome counts shown in square brackets.

Supplementary Figure 1
A co-occurrence matrix of acquired antimicrobial resistance genes detected in the study isolates. The diagonal values show how many isolates each individual gene was found in, while the intersections between the columns represent the number of isolates in which the corresponding antimicrobial resistance genes co-occurred.

Supplementary Figure 2
A Neighbour-joining phylogenetic tree depicting the genetic relationships among twenty-four strains isolated from diarrhoeal cases in the GEMS study 4 . The Sequence types identified in these isolates are shown in the legend, with the genome count displayed in square brackets next to the respective sequence types. Three STs (ST38, ST58 and ST10) overlapped with what was found among commensal strains from this study (see Figure 2).

Supplementary File 1
Sequencing statistics and characteristics of twenty-four previously sequenced GEMS cases included in this study.

Supplementary File 2
A summary of the sequencing statistics of the study isolates reported in this study.

Supplementary File 3
A summary of the virulence factors detected among the study isolates and their known functions.

Supplementary File 4
A pairwise single nucleotide polymorphism matrix showing the SNP distances between the study genomes.

Supplementary File 5
List of the sample clones for which two independent cultures were obtained and sequenced, to find the SNPs between the same clones.