A comprehensive and high-quality collection of E. coli genomes and their genes

Escherichia coli is a highly diverse organism which includes a range of commensal and pathogenic variants found across a range of niches and worldwide. In addition to causing severe intestinal and extraintestinal disease, E. coli is considered a priority pathogen due to high levels of observed drug resistance. The diversity in the E. coli population is driven by high genome plasticity and a very large gene pool. All these have made E. coli one of the most well-studied organisms, as well as a commonly used laboratory strain. Today, there are thousands of sequenced E. coli genomes stored in public databases. While data is widely available, accessing the information in order to perform analyses can still be a challenge. Collecting relevant available data requires accessing different sources, where data may be stored in a range of formats, and often requires further manipulation, and processing to apply various analyses and extract useful information. In this study, we collated and intensely curated a collection of over 10,000 E. coli and Shigella genomes to provide a single, uniform, high-quality dataset. Shigella were included as they are considered specialised pathovars of E. coli. We provide these data in a number of easily accessible formats which can be used as the foundation for future studies addressing the biological differences between E. coli lineages and the distribution and flow of genes in the E. coli population at a high resolution. The analysis we present emphasises our lack of understanding of the true diversity of the E. coli species, and the biased nature of our current understanding of the genetic diversity of such a key pathogen. Author Notes All supporting data have been provided within the article or through supplementary data files. All supporting code is provided in the git repository https://github.com/ghoresh11/ecoli_genome_collection. Significance as a BioResource to the community As of today, there are more than 140,000 E. coli genomes available on public databases. While data is widely available, collating the data and extracting meaningful information from it often requires multiple steps, computational resources and expert knowledge. Here, we collate a high quality and comprehensive set of over 10,000 E. coli genomes, isolated from human hosts, into a set of manageable files that offer an accessible and usable snapshot of the currently available genome data, linked to a minimal data quality standard. The data provided includes a detailed synopsis of the main lineages present, including their antimicrobial and virulence profiles, their complete gene content, and all the associated metadata for each genome. This includes a database which enables the user to compare newly sequenced isolates against the assembled genomes. Additionally, we provide a searchable index which allows the user to query any DNA sequence against the assemblies of the collection. This collection paves the path for many future studies, including those investigating the differences between E. coli lineages, following the evolution of different genes in the E. coli pan-genome and exploring the dynamics of horizontal gene transfer in this important organism. Data Summary The complete aggregated metadata of 10,146 high quality genomes isolated from human hosts (doi.org/10.6084/m9.figshare.12514883, File F1). A PopPUNK database which can be used to query any genome and examine its context relative to this collection (Deposited to doi.org/10.6084/m9.figshare.12650834). A BIGSI index of all the genomes which can be used to easily and quickly query the genomes for any DNA sequence of 61 bp or longer (Deposited to doi.org/10.6084/m9.figshare.12666497). Description and complete profiling the 50 largest lineages which represent the majority of publicly available human-isolated E. coli genomes (doi.org/10.6084/m9.figshare.12514883, File F2). Phylogenetic trees of representative genomes of these lineages, presented in this manuscript, are also provided (doi.org/10.6084/m9.figshare.12514883, Files tree_500.nwk and tree_50.nwk). The complete pan-genome of the 50 largest lineages which includes: A FASTA file containing a single representative sequence of each gene of the gene pool (doi.org/10.6084/m9.figshare.12514883, File F3). Complete gene presence-absence across all isolates (doi.org/10.6084/m9.figshare.12514883, File F4). The frequency of each gene within each of the lineages (doi.org/10.6084/m9.figshare.12514883, File F5). The representative sequences from each lineage for all the genes (doi.org/10.6084/m9.figshare.12514883, File F6).

Introduction E. coli is a globally distributed, highly diverse organism with a very large gene pool [1][2][3]. While some variants of E. coli are found in the guts of healthy individuals, in animals and in the environment, others cause severe intestinal and extraintestinal lifethreatening disease [4]. The diversity between E. coli strains is driven by high genome plasticity; genes are regularly gained and lost, leading to high variability in gene content between lineages and isolates [2,[5][6][7]. The combination of these factors: a large gene pool, genome plasticity, global distribution and ubiquity across niches, make E. coli an important genetic storehouse for the spread and wider dissemination of genes, including those that confer resistance and virulence. Indeed, E. coli has been designated a priority pathogen by the World Health Organisation due to its high levels of drug resistance [8]. Therefore, E. coli is a highly relevant organism to study in today's world, with the increasing spread of antimicrobial resistance, and for understanding the emergence of new, globally disseminated, bacterial pathogens of relevance to human and animal health.
Eight pathogenic variants of E. coli, termed "pathotypes", have been defined based on their site of infection and by distinguishing phenotypic and molecular markers [4]. Based on molecular definitions, they can be considered diarrheagenic E. coli [9,10] with Shigella often classified as an EIEC, as they are clinically and diagnostically similar [4]. EPECs, ETECs and Shigella are prevalent in the developing world where they cause fatal diarrhea among infants and children [11,12]. ETECs, EAECs and Shigella are the most common causes for travellers' diarrhea [13]. EHECs are the only diarrheagenic E. coli that are cause for concern in developed countries as their major reservoir is in the gastrointestinal tracts of cattle [14,15] Figure S1).

Assembly
Reads were assembled by VELVET (v1.2.09) [22] using the prokaryotic assembly pipeline (v2.0.1) with default setting [23]. Assembled genomes were filtered to remove those with more than 600 contigs or those that had a total combined contig length of less than 4 Mb or larger than 6 Mb (1,152 genomes, based on a distribution of these values, Supplementary Figure S1).
Mash distances were calculated between all the assemblies [24]. Mash uses a minimised database of k-mers, i.e. words of size k, to represent each genome (based on the Minhash sketch). Mash returns the proportion of shared k-mers, the Jaccard distance, between every two genomes as a measure of their genomic distance. A network was constructed so that every genome is represented in a node and two genomes were connected only if their Mash distance was smaller than 0.04 (equivalent to 96% Average Nucleotide Identity (ANI)) [24]. Isolates from the same species should have an ANI of approximately 95-96%, i.e. Mash distance smaller than 0.04 [25]. Therefore, genomes were removed (189 genomes) if they were disconnected from the largest connected component which should represent the E.
coli and Shigella species.

Coding sequences
Predicted coding sequences (CDSs) were predicted using Prokka with a custom training file (v1.5, available at doi.org/10.6084/m9.figshare.12514883). Prodigal (v2.6) was trained using a random selected set of 100 genomes from the entire dataset using the "prodigal.py" script available in Panaroo [26,27]. The training file was used as the input for Prokka to predict the CDSs in the entire dataset. All the genomes were then annotated using the same standardised training properties defined in the training file.
There was a linear relationship between the size of the genome and the number of genes called. Genomes which deviated from linear correlation by 500 genes were removed (Supplementary Figure S1).

Constructing the BIGSI index
Each assembly was converted to a non-redundant list of k-mers through the construction of De Bruijn graphs (k=31) using mccortex v1.0 [28]. All assemblies had between 10 5 and 10 6 unique k-mers. The parameters chosen for the BIGSI index were h=1 and m=28000000, as detailed in the berkeleyDB config file (available at doi.org/10.6084/m9.figshare.12666497, file config_10K_00.yaml) and following steps were performed using BIGSI (https://github.com/iqbal-lab-org/BIGSI, checkout:12966cacefc14354ee3c42dc2852917475567a0d) [29]. A single hash function (h=1) was applied to each k-mer and each assembly was stored as a fixed length (m=28000000) Bloom filter (bit-vector). To reduce the overall build time of the index, individual Bloom filters were merged in batches of 500 into matrices using `bigsi merge_blooms` command, where the input `--from_files` was a tab separated file where the first column provides the absolute path to the bloom filter and the second is the assembly name. These merged blooms files were then used to build the BIGSI index using `bigsi large_build` command where the provided `from_file` input was a file that contains two columns, separated by tab, where the first column details the absolute path to the merged bloom matrices and the second contains all the corresponding assemblies in that merged bloom file, separated by commas. The BIGSI index of the assemblies in this resource, index10k, can be found at doi.org/10.6084/m9.figshare.12666497.

Defining lineages using PopPUNK
Population Partitioning Using Nucleotide K-mers (PopPUNK) (v. 1.1.3) was used to group the assemblies into PopPUNK clusters or lineages [31]. PopPUNK uses Mash, a k-mer based whole genome comparison approach, to infer the pairwise core and accessory distances between every two assemblies. The database was constructed with parameters k-min=18, k-max=30 and step_size=3, as these values produced the correct line fit for estimating the core and accessory distances, as detailed in https://poppunk.readthedocs.io/en/latest/troubleshooting.html#kmer-length. The estimated core and accessory distances between the assemblies were clustered using a two-dimensional Gaussian mixture model (GMM) to identify cut-offs for the within lineage core and accessory distances. The model fitting was applied using six different values of total number of clusters for the GMM ( K= 5, 8, 11, 14, 17 and 20). The scores generated by PopPUNK for all these values were compared. A value of K=11 was chosen as it had the overall lowest entropy, i.e. highest confidence in assigning each distance to a cluster, and comparably high overall score. PopPUNK then constructs a network between all assemblies where each node is an assembly and two assemblies are connected only if their core and accessory distance is below the "within lineage core" and "within lineage accessory" distances. All assemblies which are connected to each other in this network are defined as a lineage.

Phylogenetic analysis
The core gene phylogeny was inferred from the core gene alignment generated using Roary for each lineage [32], and a tree from the SNPs in the core gene alignment, extracted using SNP-sites [33] (v2.3.2), was constructed using FastTree [34]. Treemer (v0.3) [35] was used to select ten genomes from each lineage as representatives of that lineage (Supplementary Table S1). Similarly, Treemer was used to choose a single representative genome from each of the 50 lineages to generate a tree containing only 50 genomes. In both cases, the core gene phylogeny was inferred from the SNPs of the core gene alignment generated using Roary on the representative genomes [32]. A maximum likelihood tree from the informative SNPs, chosen using SNP-sites [ representative E. coli genomes [37]. ClermonTyping uses an in-silico PCR approach of marker genes, following the Clermont phylotyping scheme presented in [38]. This is supplemented by a Mash-based mapping to a curated collection of E. coli genomes, for which the phylogroup is known. A lineage was assigned to the phylogroup according to the most common phylogroup assignment of the ten representative strains. The exception was Lineage 10 which was assigned to phylogroup D by ClermonTyping as the marker gene arpA was not detected in the in-silico PCR using primer ArpAgpE, however, the assignment did not correspond with the phylogeny and this was corrected to phylogroup E.

Pan-genome analysis
A pan-genome analysis using Roary [32] was applied on each lineage separately using the default identity cutoff of 0.95, with paralog splitting disabled [32]. The outputs of the pan-genome analysis of each lineage were combined to generate a final collection of gene clusters of the entire dataset in the following steps: 1. Gene cluster definitions, from the Roary analysis within each lineage, were assumed to be the best approximation of the representation of the genes that are well-defined within a closely related group of genomes. Note that each gene cluster has multiple members (sequences) from that lineage (Supplementary Figure S2, Step 1). A representative sequence was chosen for each gene cluster as the sequence that had the modal length within that gene cluster. If there was no mode, a sequence with the median length was chosen.   Figure 1. Shigella, which are phylogenetically part of the E. coli species, were also included and are referred to as E. coli throughout. In short, genome identifiers from publications where complete metadata was available were collected, and combined with identifiers of genomic data from public databases for which only limited metadata was available. Genomes were downloaded, assembled and their CDSs were predicted and annotated. Importantly, to ensure the accuracy of the data, multiple QC measures were applied, reducing the initial dataset and thereby ensuring a final collection of high-quality genomes ( Figure 1B). Only genomes for which we received explicit approval to be used by the submitter were kept, removing any doubts regarding the ability to use this data for high resolution analyses. The curated high-  Figure S3). Where available, the pathotype description was as described in the original publication. Within these isolates, the representation of diarrheal disease-causing E. coli pathotypes, EPECs and ETECs, was very low with only 3% and 2% of the genomes belonging to these pathotypes, respectively.
Six STs represent more than 50% of the genomes in the collection.
Multi-locus sequence typing is based on the variation of seven housekeeping genes, the combination of which define a sequence type or ST. A total of 993 different known STs were identified in the collection. 87 STs (9%) alone accounted for 80% of the isolates (Supplementary Figure S4) The dataset can be divided into lineages of closely related isolates.
As E. coli is a highly diverse organism, relying on MLST for subtyping can lead to new ST definitions within a group of closely related isolates due to variation in one of these genes, or otherwise, to connections between unrelated isolates due to recombination.
We therefore grouped the genomes into lineages of closely related isolates using a whole-genome based approach. PopPUNK extracts and compares words of size k, named "k-mers", from whole genomes to measure the deviation in core-gene sequence termed as the "core distance", and the deviation in gene content, termed as the "accessory distance", between two genomes [31]. In E. coli, the core distance, as estimated by PopPUNK, correlates with the pairwise SNP distance between all the core genes of the two genomes being compared, and the accessory distance correlates with the proportion of shared accessory genes between every two genomes (the Jaccard distance) [31]. Genomes which had both low core and accessory distances were considered to be in the same PopPUNK cluster, defined here as a "lineage", as they were highly similar in both their core and accessory genomes.
Based on the rules described above this grouping produced 1,154 lineages. As expected the distribution of lineage sizes was similar to that defined by MLST with a few large lineages representing most of the population (Supplementary Figure S4) 50 PopPUNK lineages represent more than 75% of the genomes, and are representative of the currently known E. coli population structure.
We focused the further analysis of the dataset on the 50 lineages which had at least twenty isolates. Together these lineages included 7,693 genomes (76% of the collection) and 271 different STs (27% of those described by this collection).
To examine the population structure and diversity of the 50 lineages largest lineages, the phylogeny was constructed by selecting ten genomes from each lineage that captured most of the diversity of that lineage (See Methods, Supplementary Table S1), leading to a total of 500 genomes representing the dataset. Their core genome was extracted and the phylogenetic tree from the core gene alignment was inferred. The phylogenetic analysis confirmed that PopPUNK separated the genomes into clearly distinct lineages based on their core genome ( Figure 2). The exception to this was Lineage 12 which was split into two closely related groups. One group was more closely related to Lineage 28 whereas the other to Lineage 35. The core and accessory distances estimated by PopPUNK showed that indeed, the core distance between PopPUNK Clusters 12, 28 and 35 was low, however, they sufficiently deviated in their accessory gene content to be defined as three distinct PopPUNK lineages.
Population genetics studies on E. coli have defined the existence of eight deepbranching phylogenetic groups, termed "phylogroups" (A, B1, B2, D, E, F, C and G) [52][53][54][55]. While the collection assembled here is biased towards particular STs and we only included lineages with 20 genomes or more, it is evident from Figure 2 that the collection of genomes spans all E. coli phylogroups (18 from B1, 13 from B2, 4 from A, 5 from D, 4 from F, 3 from E, 1 from C and 2 of Shigella representing S. sonnei (45) and S. flexneri (30) and therefore is representative of the known species diversity [38,56]. Treemer [35]. Solid coloured outer ring indicates the phylogroup assignment of the representatives of that lineage. The tree was plotted using ITOL [51]. Colours on tips used to distinguish between the PopPUNK lineages. include STECs or EHECs were collected in the high income settings [12]. Lineage 12, which consisted of 78% isolates from ST10, was the only lineage that spanned all continents and consisted of all sample types (faecal, blood, urine or unknown).
Where sampling date was available, 39% of the genomes in the collection were Multidrug resistance was predicted for more than half of the isolates in 16 of 50 lineages.
A total of 153 known resistance gene alleles were identified in the collection. The number of known resistance genes within each isolate ranged from none to a maximum of 18 in a single isolate, predicted to confer resistance to up to ten different antimicrobial classes ( Figure 3A, File F1).      Examining the distribution of a gene across the species phylogeny.
A gene of interest can be identified in the pan-genome presented by using alignment tools like Blast+ [62] or DIAMOND [63] against the pan genome reference file provided (File F3). The distribution of the gene named "intA_1", a prophage integrase, in this genome collection can be plotted across the phylogeny of the 47 lineages using the ( Figure 5B).

Discussion
We have created a high-quality, extensively curated dataset of over 10,000 E. coli and Shigella genomes, linked this to resources which enable this dataset to be queried as a single dataset, and have provided several usage examples. Additionally, we have described in detail the properties of the main lineages present in the collection and their gene (predicted CDS) content. We hope that the data provided in this manuscript will make future studies on E. coli more accessible to a wider audience and will facilitate the investigation of some of the pressing questions in E. coli genetics and evolution.
Aggregating data from diverse sources along with their associated metadata is not trivial but, given the increasing number of data sources and data types, essential.
Genome identifiers and data formats across publications and databases do not always match leading to many conversions which are error prone and require knowledge of programming. In addition, computational resources are required in order to apply thousands of assembly and annotation calculations. These are all limiting factors to research. This emphasises the need to build new resources which maintain high quality genome collections where users would more easily be able to both retrieve and apply analyses on large collections. Without such resources, information is widely available, but it is practically only usable for a small proportion of scientists with large resources and computational expertise. Enterobase is a valuable resource which overcomes data accessibility issues by integrating, assembling and analysing the genomic data of specific enteric pathogens from the Sequence Read Archive, while providing researchers with relevant metadata and software [3]. However, as metadata is often associated with a publication, and is not directly linked to the database from which the genome was downloaded, this information is often missing. Even more, describing the gene content by comparing whole genomic datasets is a much harder problem, which cannot realistically be provided in a high quality in an automated manner across increasing dataset sizes. Therefore, studies on E. coli in recent years have either been detailed and focused only on a single pathotype [60,[64][65][66] or, when utilising a very large number of genomes, the analyses were limited in their resolution due to the complexity of extracting the information from such large collections [3,67].
Taken together, the collection presented here represents a detailed, high-quality and accessible dataset which will enable researchers to apply comprehensive comparisons in future investigations on E. coli. This includes the PopPUNK and BIGSI databases which can be used to query newly sequenced isolates or DNA sequences of interest and examine their diversity relative to this collection.
The analysis presented in this manuscript emphasises our lack of knowledge on the true diversity of this important species, and that we should redirect our efforts towards sampling to understand the diversity which has yet to be studied. The collection we obtained is biased towards E. coli lineages which have clinical significance. The vast majority of genomes were available from Europe and North America, such that the pathotypes comprising the dataset are those which predominantly affect these areas.