Construction of the ETECFinder database for the characterization of enterotoxigenic Escherichia coli (ETEC) and revision of the VirulenceFinder web tool at the CGE website

ABSTRACT The identification of pathogens is essential for effective surveillance and outbreak detection, which lately has been facilitated by the decreasing cost of whole-genome sequencing (WGS). However, extracting relevant virulence genes from WGS data remains a challenge. In this study, we developed a web-based tool to predict virulence-associated genes in enterotoxigenic Escherichia coli (ETEC), which is a major concern for human and animal health. The database includes genes encoding the heat-labile toxin (LT) (eltA and eltB), heat-stable toxin (ST) (est), colonization factors CS1 through 30, F4, F5, F6, F17, F18, and F41, as well as toxigenic invasion and adherence loci (tia, tibAC, etpBAC, eatA, yghJ, and tleA). To construct the database, we revised the existing ETEC nomenclature and used the VirulenceFinder webtool at the CGE website [VirulenceFinder 2.0 (dtu.dk)]. The database was tested on 1,083 preassembled ETEC genomes, two BioProjects (PRJNA421191 with 305 and PRJNA416134 with 134 sequences), and the ETEC reference genome H10407. In total, 455 new virulence gene alleles were added, 50 alleles were replaced or renamed, and two were removed. Overall, our tool has the potential to greatly facilitate ETEC identification and improve the accuracy of WGS analysis. It can also help identify potential new virulence genes in ETEC. The revised nomenclature and expanded gene repertoire provide a better understanding of the genetic diversity of ETEC. Additionally, the user-friendly interface makes it accessible to users with limited bioinformatics experience. IMPORTANCE Detecting colonization factors in enterotoxigenic Escherichia coli (ETEC) is challenging due to their large number, heterogeneity, and lack of standardized tests. Therefore, it is important to include these ETEC-related genes in a more comprehensive VirulenceFinder database in order to obtain a more complete coverage of the virulence gene repertoire of pathogenic types of E. coli. ETEC vaccines are of great importance due to the severity of the infections, primarily in children. A tool such as this could assist in the surveillance of ETEC in order to determine the prevalence of relevant types in different parts of the world, allowing vaccine developers to target the most prevalent types and, thus, a more effective vaccine.

on the patient's or animal's history and symptoms, thus making monitoring of ETEC very difficult.In surveillance studies, PCR targeting ETEC toxins is mainly used; however, an up-to-date database of virulence factor alleles is important for wider coverage.By expanding the existing VirulenceFinder tool and database, we hope to contribute to a more refined and accurate detection of virulence factors related to ETEC infections in both humans and animals.

Construction of the VirulenceFinder database
The relevant genes and their sequences were compared using BLASTn against the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nuccore/) to identify potential matches.Sequences with open reading frames were curated and validated for the presence of only the four nucleotides ATCG.The validated sequences were compared using BioNumerics software (v8.1), and a reference sequence was selected for each allele.The first validated sequence to enter the database was designated as the reference sequence.
To collect alleles, sequences from various papers were compared using BLASTn, and alleles that matched 80% sequence identity and total length were included in the database.Partial genes were excluded from the database to ensure completeness of sequences.The database is stored as a text file in FASTA format with a unique identifier for each reference sequence followed by its sequence, which can be downloaded from the CGE website (genomicepidemiology / virulencefinder_db -Bitbucket).
To account for the variations in the porcine variant of STa1, we replaced the two sta1 alleles with 15 estap alleles.For the human variant of STah2 and STah3, we added four estah alleles.Furthermore, we renamed three stb alleles for the porcine variant STb1 as estb and added one variant of ST2b to the database (see Table S1).
We used a beta version of the VirulenceFinder database, which includes the ETECFinder database (ETEC-related genes), to search for virulence genes in 1,083 preassembled E. coli genomes.The ETECFinder database is designed to perform a genotypic detection of ETEC virulence genes, and all genes and alleles in the database are identified with a GenBank accession number or identifier (ID).To simplify the LT typing, we concatenated the eltA and eltB sequences and removed the overlapping four bases present in a subset of the LT types.We then analyzed the nucleotide sequences that encode the LT holotoxin.Additionally, we ran the 1,083 genomes through the CGE web tool SerotypeFinder, CGE Server (dtu.dk).
Genomes were defined as ETEC if they were positive for any combination of the genes estah, estap, estb, eltIAB, or eltIIAB.Additionally, genomes were classified as ExPEC JJ if they were positive for two or more of papAH and/or papC (P fimbriae), sfa/focDE (S and F1C fimbriae), afa/draBC (Dr-binding adhesins), iutA (aerobactin siderophore system), and kpsM II (group 2 capsules) (23), and as UPEC HM if they were positive for two or more  13), where concatenated protein sequences were presented.Four nucleotides "TGAA" were removed from the NCBI-submitted sequences as described in the text.Only branch lengths larger than 1.00 are shown.

Data set and data used for the validation of the ETECFinder database
A collection of 1,083 ETEC isolates, representing the global diversity of ETEC, was selected for sequencing as part of various studies conducted between 1980 and 2011.These isolates were sourced from 57 different countries across Asia, Africa, and North, Central, and South America.Among the 1,083 ETEC genomes sequenced, 362 genomes had been previously described, while the remaining genomes were included to expand the existing collection and fill knowledge gaps regarding CF profiles.The selection of isolates was based on their host and virulence profiles, ensuring representation across the spectrum of ETEC diversity.These encompassed isolates obtained from adults (both indigenous populations and travelers), children under 5 years of age, and farm animals infected with ETEC.Although the data set may be biased toward ETEC collected from humans and pigs, the ETECFinder database will be continuously updated with contributions from the ETEC research community.As more research is conducted and new isolates are collected from different hosts, the database is expected to expand and include a broader range of host-associated ETEC virulence factors.The ongoing input from the research environment will ensure that the ETECFinder database remains comprehensive and reflective of the evolving understanding of ETEC diversity across various hosts.Additionally, the data set covers the 20 known ETEC lineages associated with human disease previously described in von Mentzer et al. (25) and, together with additional genomes, it aims to cover ETEC diversity on a global scale as well as multiple hosts (humans and animals).(Fig. S1-S3; Data Set S1).A subset of the clinical samples was collected from asymptomatic individuals.The isolates were screened for toxins using PCR and a subset of CFs using dot blot with available antibodies.A subset (n = 362) of the genomes was manually analyzed for virulence profiles using comparative genomics (25).The additional 721 genomes followed the same protocols for DNA extraction, sequencing, and assembly as described in von Mentzer et al. (25).An additional level of quality control of the reads and assemblies was performed to ensure high-quality sequencing data.FASTQC v0.11.8 (Babraham Bioinformatics -FastQC A Quality Control tool for High Throughput Sequence Data) and MultiQC (26) were used to investigate read quality and GC content (between 49% and 51%).Kraken/bracken was used to identify potential contamination in combination with assembly statistics, such as species abundance (>65% E. coli), the total number of bases (4.5-6 MB), and the total number of contigs (<300), and the N 50 value (>30 kb) was used to further assess the quality of the assemblies.The two BioProjects (PRJNA421191 with 305 and PRJNA416134 with 134 sequences) plus the reference genome of ETEC TW11681 (accession no.AELD00000000) analyzed by Hazen et al. (27) were also tested using the revised VirulenceFinder database.

RESULTS
Including original, renamed, replaced, and new genes, the beta ETECFinder database contains 524 alleles representing 38 loci; the gene name, the number of new alleles, and the CFs or protein names are listed in Table S3.These alleles were added to and/or changed in the existing database at the CGE website.

Updated nomenclature for enterotoxins
When constructing the ETECFinder database, a search for the various ST and LT types was conducted, which revealed identical genes but with different nomenclature.To ensure that identical gene sequences do not have multiple names, a revised, new nomenclature is proposed and listed in Table 1, along with the present nomenclature and accession numbers and/or references of the listed proteins and genes.STa (18) STa1-STa6 ( 28) STah2 and STah3 sta2 [WP_023485648. 1 (27)] As a result of the new nomenclature, the existing VirulenceFinder web tool had to be revised.The revision led to changes in the nomenclature of est and elt genes, encoding ST and LT, respectively, and the number of alleles.Some ETEC virulence genes were previously included in the VirulenceFinder database.However, the database did not contain genes from the entire fimbriae but merely one or two genes.As a result of this study, genes from the entire CF loci have been added to the database.Changes in the VirulenceFinder database are listed in Table S1 and new alleles in Table S3.

Implementation of ETECFinder
The ETECFinder database comprising all known ETEC virulence genes is now a part of the VirulenceFinder tool, which can either take reads from Illumina, Ion Torrent, Roche 454, SOLiD, Oxford Nanopore, and PacBio or assembled sequences as input.Because the CGE web tools only allow for one sequence at a time, a script for batch analyses was kindly provided by bioinformatician Maja Weiss (CGE) so that batch analyses could be performed on the SSI server.First, raw data generated from sequenced and assembled bacterial genomes were used as input.By performing a BLAST search of the genome against the database, the closest matching alleles were identified, and a virulence profile was determined.The CF profile is based on the various alleles found.The short output format includes the identified best-matching ETEC allele in the database.An additional extended output includes the nucleotide sequence of the ETEC alleles identified.For a few selected outputs, both types of input data-reads and assembled genomeswere analyzed and compared.Here, we discovered different outputs/results from using assembled genomes compared to read data.To showcase this, the genome of the ETEC strain positive for CS30 as well as LT and STp was analyzed using the ETECFinder/Viru lenceFinder database using both types of data.VirulenceFinder identified the whole CS30 gene cluster (csmS, T, A, B, C, D, E, F, and G) and both toxin genes but did not identify genes fdeC and hha (Fig. 3) when the reads were used as input.Conversely, when analyzing the assembled genome, VirulenceFinder failed to identify the full CS30 gene cluster, missing csmC and csmD.However, both fdeC and hha were identified (Fig. 4).We therefore contacted the CGE helpdesk in order to inquire about these differences in output.Hence, it was discovered that the old version of VirulenceFinder did not allow for nucleotide overlap in the assembled contigs.The genes csmC and csmB are slightly overlapping with 20 nucleotides, and the genes csmD and csmE overlap with four nucleotides.This was amended, and the results presented in this paper reflect the results of the revised VirulenceFinder, which now allows overlaps of up to 30 nucleo tides, as is also the default for ResFinder at the CGE website.An option was added to the command line tool, allowing the users to tweak this value.Thus, an additional 1,051 genes-primarily colonization factor-were detected, as were tibA and tibC, which overlap by eight nucleotides (Data Set S1 row CB1114:UL1114, summarized in Table S9).
When the KMA read aligner sees too many mismatches within a certain area, it will not align anything to that area (R. S. Kaas, DTU, personal communication), and the end of fdeC has a lot of mismatches, whereas the beginning of the gene aligns perfectly.The contig alignment indicating 93.05% identity in the assembled contig is a bit misleading as all the mismatches are found toward the end of the gene.The actual identity is a Followed by the specific toxin subtype designation, i.e., estap-STap1, estah-STah2, etc. b Followed by the specific toxin subtype designation, i.e., estb-STb1 and estb-STb2.c Three new eltIAB subtypes and four new eltIAB variants were identified in this study.d eltIIAB sequences were 48.6%-54.3%identical to eltIAB sequences (Fig. 2).e REGION: 20190-21333.
significantly lower at 71.89%.Likewise, for the hha gene, the identity is 76.73%.The cutoff value on the KMA web tool using reads is at least 85%, which is the reason why there are no hits for fdeC and hha when using reads as input.

Validation of the ETECFinder database
The revised database was tested on a collection of 1,083 preassembled genomes (sequenced by Astrid von Mentzer) to predict the virulence profile and compared to previous bioinformatics analyses.The threshold for a positive hit was set at 90% sequence identity and minimum length at 60%.This resulted in 890 ETEC matches (i.e., enterotoxin-positive E. coli) and 193 non-ETEC matches (i.e., enterotoxin-negative E. coli).In addition to identifying ETEC virulence genes, the VirulenceFinder searches for genes used for E. coli pathotype diagnostics.Interestingly, 18 genomes were predicted as hybrid pathotypes, 13 ExPEC JJ -UPEC HM , 4 ETEC/ExPEC JJ , and 1 ETEC-UPEC HM .A total of 2 genomes were predicted as UPEC HM , 15 as ExPEC JJ , and 7 as enteroaggregative E. coli (EAEC), two of which were EAEC-ExPEC JJ .One hundred fifty-six were negative for genes used for defining any E. coli pathotypes (EAEC, EIEC; EPEC, ETEC, ExPEC JJ , and UPEC HM ).

Enterotoxins
To further investigate the different ETEC toxins identified in the data set, we report that the most prevalent toxin profiles (here using the newly proposed nomenclature)  S4).Three new LT types were discovered, where LT32 was identified in 94 genomes (51 alone and 15 together with STap1), followed by LT33 (12 alone and 15 with STap1) and LT31 with STap1 (1).Additional new variants designated LT15b (11/1,083), LT12b (2/1,083), LT18b with STap1 (6/1,083), and LT18c with STap1 (1/1,083) were found in the analyzed genomes (Fig. 1).Not all genes were found with 100% identity, indicating alterations in the sequence compared to the reference sequence found in the database.This could be a result of point mutations or poor sequencing.

Human-specific ETEC colonization factors
Five hundred twenty-nine (48%) of the genomes used in this study were positive for at least one CF gene in two to five different CF types (regulator, major subunit, minor subunit, chaperone, usher, and adhesin).The specific allele combination findings are described in detail in Table S5.One hundred ninety-seven genomes were positive for only one CF gene (specific single allele findings are described in detail in Table S6).CFA/I encoded by cfaA, cfaB, cfaC, cfaD, and cfaE, of which 80 genomes contained the full CFA/I locus, was found in combination with other CFs in 327 genomes (Table S5).In three ETEC-UPEC HM genomes of serotypes O128ac:H45 (two genomes) and O153:H46 (one genome), CFA/I was the only CF (Table S6).The specific allele combination findings are listed in Table S5 and Data Set S1.In summary, 529 genomes contained 33 different combinations of colonization factors, with CFA/I being the most commonly found in 338/529 (63%) genomes, and 182/338 (54%) of these were also CS21 positive.Further more, CS21 was also found with other combination CFs such as CS6, 39/529 (7%) and alone in 32/197 (16%), followed by CS12 in combinations in 64/529 (12%) and alone in 57/197 (29%).

Additional ETEC non-canonical virulence genes
The genes for enterotoxigenic E. coli autotransporter A (eatA, three alleles), extracellu lar serine protease EspC (also called EPEC enterotoxin) (espC, three alleles), and the plasmid-encoded type II secretion pathway-related protein EtpD (etpD three alleles) were already included in the original VirulenceFinder database (48).The gene eatA was found once in 392 and twice in 2 genomes, and espAC was found once in an eae-positive sequence of in silico serotype O71:H49.The genes etpA, etpB, and etpC were found, respectively, in 31, 340, and 341 sequences.Of note, none of these were also positive for etpD found in eight sequences.The adhesin gene tia was found once in 182 genomes and twice in 1 genome.Additionally, genes such as tibA and tibB were found in 113 and 131 sequences, tleA was found once, and yghJ was found once in 869 and twice in 79 genomes.

Validation of the ETECFinder database on the ETEC collection in BioProjects PRJNA421191 and PRJNA416134
The revised database was used on a total of 440 genomes from BioProjects (PRJNA421191 with 305 and PRJNA416134 with 134 genomes) plus the reference genome of ETEC TW11681 (accession no.AELD00000000) listed by Hazen et al. (27).The toxin profile for the 269 genomes listed in Hazen et al. 's (4) Table S1 was initially determined using BLASTN large-scale BLAST score ratio (LS-BSR) analysis and represen ted a total of 346 accession numbers (S1, RAW data on the 269 genomes).The results on each of these accession numbers were "Moved to match" each of the genomes in Hazen et al. 's Table S1 (4).We then compared the published toxin and CF profiles from Hazen et al. with the output from analyzing the same genomes with the VirulenceFinder.The toxin profile covering eltIAB and est matched between the LS-BSR and VirulenceFinder approach for all of the 269 analyzed genomes.Furthermore, using the VirulenceFinder allowed us to determine the toxin subtype for each of the genomes: 145 genomes were estah-STah2, 82 estah-STah3, one each of estap-STap1, eltIAB-30; estah-STah2, eltIAB-15; estah-STah3, eltIAB-22; estah-STah3, eltIAB-29; and estah-STah2.Combinations of two toxins estah-STah2 and estah-STah3 were found in 11, estap-STap1 and estah-STah2 in 1, and estap-STap1 and estah-STah3 in 1 genome.Twenty-four genomes were toxin negative.
Two sequences that had been found negative for ST and CFs by LS-BSR fulfilled the criteria for both ExPEC and UPEC (PNRS00000000 and PNVD00000000).Four sequences (PNRS00000000, PNVC00000000, PNVD00000000, and PNYQ00000000) were enteroag gregative E. coli (EAEC) being positive for aggR and one of four different AAFs (AAFI, AAFIII, AAFIV, and AAFV).Additional results for the analyzed 269 sequences as well as the remaining 93 sequences not analyzed by LS-BSR can be found in the tables in Data Set S1.

DISCUSSION
The decreasing costs of whole-genome sequencing have led to an increase in bacte rial pathogen sequencing.However, extracting the correct data for analysis from WGS data can be challenging, despite its increased availability to routine diagnostic laborato ries.To address this issue, we developed, implemented, and evaluated an ETECFinder database to predict ETEC-associated virulence genes based on our own WGS data and on two published BioProjects PRJNA421191 with 305 and PRJNA416134 with 134 sequen ces.This database has been integrated into the pre-existing VirulenceFinder, which is accessible at www.cge.cbs.dtu.dk/services/VirulenceFinder/.Users can upload preassem bled bacterial genomes or short sequence reads, and the CGE web tools have been designed to facilitate use and output for users with limited bioinformatics experience.
Using the revised VirulenceFinder database to analyze 1,083 preassembled genomes, we identified 890 ETEC sequences, while 193 sequences were negative for enterotoxin genes.All sequences had originally been identified as ETEC-positive by PCR, and the failure to detect ETEC genes in 193 genomes could be explained by loss of toxin genes (which are often plasmid-encoded and therefore can be lost during storage) or by poor sequencing and assembly quality.A similar failure to reproduce PCR results with WGS was also observed by Hazen et al., where 26 PCR-positive results could neither be confirmed by LS-BSR nor by VirulenceFinder.However, VirulenceFinder did identify CFA/I in four genomes that were CFA/I-negative by LS-BSR and four cfaD-positive genomes that were CS6-positive by PCR but CS6-negative by LS-BSR.Additionally, VirulenceFinder found genes for five different CFs that had not been found by LS-BSR.The enterotoxin profile covering eltIAB and est matched between the LS-BSR and VirulenceFinder for all of the 269 analyzed genomes.Furthermore, using the VirulenceFinder allowed us to determine the toxin subtype for each of the genomes and identify 13 genomes with more than one copy of an enterotoxin gene.
Detecting colonization factors in ETEC is challenging due to their large number, heterogeneity, and lack of standardized tests.However, the detection of LT and ST defines an ETEC isolate, although many such isolates may express colonization factors specific to either animals or humans.In this study, we found 138 sequences with different combinations of human and animal colonization factor genes primarily involving F4 and F18.Further investigation of eight selected genomes harboring a combination of different CF genes revealed that the minor fimbrial subunit encoded by aalH (CS23) was located on the same contigs as F4 genes faeH and faeI.As aalH has been described as similar to faeJ (97% identity) (49), it could be speculated that some ETEC strains have acquired supplementary genes for the full expression of colonization factors by horizontal gene transfer.Furthermore, VirulenceFinder found csvA (CS7) in 24 genomes of which 23 were also positive for csfBCD but negative for csfA.Considering that csvA (CS7) and csfA (CS5) are 91.1% identical, it could be speculated that csvA has replaced csfA in these strains.While examining such gene exchange in the tested 1,083 sequen ces is beyond the scope of this study, expanding the VirulenceFinder tool with these ETEC-associated gene alleles will hopefully encourage users to conduct closer analyses.The issue regarding overlapping genes and the differences in output using assembled genomes and/or reads also highlights the importance of detailed sequence analyses.In summary, our findings illustrate that it may be difficult to estimate the CF profiles due to multiple gene combinations because some are identical or near identical to each other.Finally, we observed a low level (13%) of human-and animal-specific CFs in the same genomes.
Apart from 30 non-ETEC genomes that could be categorized as either ExPEC JJ , UPEC HM , or both, this study identified four ETEC-ExPEC JJ and one ETEC-UPEC HM genomes.Several studies have identified hybrid STEC-ETEC strains from both humans and animals (1)(2)(3)(4)(5)(6)(7).EPEC expressing LT of ETEC has also been described (50).Therefore, it is important to include these ETEC-related genes in a more comprehensive Virulen ceFinder tool at the CGE website in order to obtain more complete coverage of the virulence gene repertoire of pathogenic types of E. coli.
The standard nomenclature of est and elt genes is critical to minimize mistakes when analyzing these genes.Current nomenclature is insufficient to classify whether an isolate contains porcine or human-associated est genes.Additionally, it is impossible to distinguish between LT types based on the elt genes, as the gene name does not refer to the individual LT type.However, the new nomenclature meets these criteria, thus easing analyses.
We observed minor differences in the identification of genes when results obtained by assembled sequences were compared with those obtained by short-read FASTQ files.We would therefore recommend that users examine both kinds of sequence data with the VirulenceFinder web tool.Matches with an identity of 95% or higher are likely due to point mutations or poor sequencing.Depending on the length of the gene, an identity score of between 95% and 92% could represent new alleles.However, specific guidelines on how to define gene alleles, subtypes, and variants are generally problematic and invite international collaboration between dedicated researchers and experts.

Future perspectives
ETEC vaccines are of great importance due to the severity of the infections, primarily in children.A tool such as this could assist in the surveillance of ETEC in order to deter mine the prevalence of relevant types in different parts of the world, allowing vaccine developers to target the most prevalent types and, thus, a more effective vaccine.

ACKNOWLEDGMENTS
The script for batch analyses was kindly provided by bioinformatician Maja Weiss (CGE, DTU), allowing batch analyses on the SSI server.The download of the genomes from the two BioProjects PRJNA421191 and PRJNA416134 and the installation of the revised VirulenceFinder script were performed by Karen Loaiza Conza (SSI).Anne Sophie Majgaard Uldall (SSI) provided scripts that were used to assemble the individual results into the results sheets in Data Set S1.Finally, Rolf Sommer Kaas, DTU Food, CGE identified the bug in the old version of VirulenceFinder, which did not allow overlapping nucleotide sequences in assembled contigs.
F.S. was partially funded by the European Union's Horizon 2020 research and innovation programme under grant agreement no.773830.A.V.M. was supported by the Swedish Research Council grant no.2022-01449 and the Svenska Sällskapet för Medicinsk Forskning grant no.P18-0140.

TABLE 1
Nomenclature of LT and ST toxins and genes

TABLE 1
Nomenclature of LT and ST toxins and genes (Continued)

synonym for toxin type (accession no.; reference) New synonym for toxin [adapted from reference (28) and this study] Previous synonym for toxin gene (accession no.; reference) New synonym for toxin gene (this study) Length of amino acid sequence(s)
e B subunit: 123