Introduction

Erwinia amylovora is the first phytopathogenic bacterium ever described (Vanneste 2000) and still one of the top ten most dangerous plant pathogens (Mansfield et al. 2012). It is one of the main species of the genus Erwinia, and the causal agent of fire blight, a devastating disease affecting a wide range of host species within Rosaceae (apple, pear, hawthorn, cotoneaster, rubus, etc.), with a potentially disastrous economic impact on pome fruit commercial production (Griffith et al. 2003; Heitefuss 2012). The systemic infection of E. amylovora usually occurs in spring, and it starts when the bacterial cells enter the plants through the nectarthodes of flowers or other natural openings and slowly spreads to the vascular system, mainly in the xylem. Via the rapid movement within the plant, E. amylovora causes extensive wilting symptoms and sometimes complete dieback (Thomson et al. 1986). The mechanisms by which E. amylovora triggers this relevant necrotic plant disease involve the migration in the apoplast, and amylovoran, an exopolysaccharide (EPS) essential in the formation of a protective biofilm, is associated with the motility phenotype (Yuan et al. 2022). Along with the type III secretion system (T3SS), the amylovoran biosynthetic pathway is one of the main virulence factors required for successful infection in E. amylovora as well as in other Erwinia species (Malnoy et al. 2012; Piqué et al. 2015). Mutants defective in the exopolysaccharide production are unable to form the required protective biofilms, which are fundamental to prevent recognition by the host plant defence system, obstruct the vascular system and facilitate movement through the plant tissue. In addition, amylovoran has also been shown to promote bacterial survival outside the host plant by preventing drying (Bellemann et al. 1994; Kharadi et al. 2021; Koczan et al. 2009; Vanneste 2000).

Besides the proteins involved in the galactose metabolism (Benini et al. 2017; Metzger et al. 1994), the ams operon determines the biosynthesis, chain elongation, transport and secretion of the amylovoran repeating unit, which is composed of four galactose, one glucuronic acid residue and final pyruvate. This operon was firstly identified in 1995 as a single transcript of 16 kb with 12 collinear open reading frames, represented by genes from amsA to amsL (Bugert and Geider 1995). Six of the ams-encoded proteins are essential to the synthesis of an amylovoran subunit (Fig. 1, upper part): the carbohydrate-active enzymes AmsG, AmsB, AmsD, AmsE, AmsK, which have a distinct role in glycosyl transfer, and the pyruvyl transferase AmsJ, play a key part in annealing the different sugar moieties to form a new polysaccharide unit. The consequent polymerization is instead achieved by AmsC and AmsF. While the amsH and amsL gene products are involved in oligosaccharide transport and assembly, the last two Ams proteins possess a tyrosine kinase (AmsA) and an acid phosphatase (AmsI) activity and regulate amylovoran synthesis (Langlotz et al. 2011; Wang et al. 2012a).

Fig. 1
figure 1

Schematic representation of the amylovoran biosynthetic pathway. All protein structures are models obtained from AlphaFold analysis, except for AmsI, solved and deposited as PDB entry 4D74 (Salomone-Stagni et al. 2016). Proteins involved in the steps of synthesis, polymerisation, transport and regulation have been grouped into different sets. The synthesis steps governed by each protein are colour coded (and underlined in the same colour). At the beginning of the synthesis, a molecule of a membrane-bound lipidic carrier P-Und (undecaprenylphosphate) binds a molecule of UDP-galactose thanks to the glycosyltransferase AmsG (yellow). AmsD in blue and AmsE in red, add a second and a third molecule of UDP-Gal. To this trisaccharide, a molecule of UDP-Glucuronic acid is added by AmsB (pink). The light blue protein AmsK adds a further UDP-Gal. AmsJ (orange) adds the pyruvate to end the growth of the side chain. Another yet unidentified glycosyltransferase is responsible for adding the glucose moiety in 50% of the repeating units. Proteins AmsC, AmsH, and AmsL (represented respectively in violet, black and turquoise) were suggested to be involved in oligosaccharide transport and assembly, as they contain transmembrane domains. While AmsA (beige) is a tyrosine kinase and activates amylovoran, AmsI (light green) appears to have a distinct function in the recycling of the lipid carrier. Finally, the enzyme AmsF (grey) processes the newly synthesized repeating units and is involved in their polymerization, adding them to another amylovoran chain

Extensive research has been dedicated to investigating the regulation of the biosynthetic pathway of amylovoran in E. amylovora (e.g., Benini 2021; Smits et al. 2017; Wang et al. 2011). Over the past years, there has been thorough examination of this complex process: the roles of the rcsA gene product and the RcsB protein in transcriptional activation and regulation of the ams operon has been extensively studied (Bugert and Geider 1995; Kelm et al. 1997). Furthermore, the orphan gene amyR has been identified as a novel negative regulator of amylovoran production (Wang et al. 2012b). The second messenger cyclic-di-GMP (c-di-GMP) has been recently reported to positively regulate the transcription of amsG, which encodes the first protein present in the amylovoran biosynthetic pathway (Kharadi and Sundin 2022). Moreover, current researches have shed light on the AraC-type transcription regulator YqhC, which has also been found to influence amsG expression and influence E. amylovora pathogenicity (Sahebi et al. 2022). However, the research on the exopolysaccharide biosynthesis in Erwinia spp. has been relatively limited. There are a few comparative studies investigating the responsible operon across a small range of Erwinia and Pantoea genomes (e.g., Langlotz et al. 2011; Kamber et al. 2012) but none of them exploits the wealth of genomic data available nowadays.

The significance of exopolysaccharide secretion in the pathogenic phenotype of Erwinia species is well recognized. Building upon earlier research, the present study aims to contribute by outlining the differences in the chromosomal organization of the EPS biosynthetic operon (ams, cps, wce, etc.) within the Erwinia genus. We used a comparative genomic approach on a dataset constituted of eighteen genomes belonging to the Erwinia spp. A genome-wide analysis was conducted including phylogenetic inference, orthology network analysis using BLAST and the Bidirectional-Best-Hit approach. To finalize the examination of the ams gene, analysis on the predicted 3D structure of the proteins involved in the amylovoran biosynthetic pathway were conducted.

Materials and methods

Download of the genomes and phylogenetic tree

In order to have a dataset covering most of the Erwinia species, we selected forty-six genomes including pathogens and non-pathogens. We selected the genomes with the assembly level complete as of 29 May 2023. A phylogenetic tree, including all selected Erwinia species, Duffyella gerundensis and Pantoea stewartii DC283 as an outgroup, was built using PhyloPhlAn3 (Asnicar et al. 2020). Bootstrap support was calculated after 999 random resampling of the characters. The tree was drawn and annotated (i.e. presence or absence of genes was reported on each leaf of the tree) using IToL server 6.7.1 (Letunic and Bork 2021). A comprehensive list of all the genomes downloaded and the complete generated tree are available as Supplementary material (Fig. S1 and Table S1). Average Nucleotide Identity was computed using the software Pyani (Pritchard et al. 2016), and the output was plotted using DiMHepy (Esposito et al. 2019).

Identification of ams genes

The search for homologous genes was done using BLAST (Altschul et al. 1990). Since the focus was on the among-species variation, and the identity among the different E. amylovora genomes was over 97%, only one genome was chosen per species (Table 1). The coding and protein sequences of the ams operon in each genome were identified using as a query the genes and proteins, respectively, of E. amylovora CFBP 1430 (Smits et al. 2010a, b), and as a database for each of the other seventeen Erwinia (Table 2). We considered as valid orthologs inference the hits having over 80% of coverage and at least 80% of nucleotide identity or 60% of protein identity. To account for the increasing phylogenetic distance between the query (i.e. the E. amylovora genes) and the other species, we repeated the search using as query the orthologs of E. oleae. This species was the most early diverging genome in the tree (therefore having the higher divergence from the E. amylovora) in which all orthologs were found. To further minimize the false negative detection due to low sequence identity, we also used the agnostic tool Prodigal (Hyatt et al. 2010), to detect additional coding sequences in the region of interest. To create the gene trees, we relied on the MEGAX program package (Tamura et al. 2021) using the neighbour-joining method (Fig. S2). The structure of the operon was depicted using the R-package GenoPlotR (Guy et al. 2010), using the coordinates found by the BLAST analysis, making sure that the operon was found on the same contig and including the genes at the 5´ and 3´ ends (gene yegH and galG, respectively). For simplicity, the orthologous genes detected in other species were named as "Ams" or “ams”, but it is essential to recognize that individual names are designated for each protein or gene involved in exopolysaccharide synthesis across the different species.

Table 1 List of the eighteen genomes selected for the comparative genomic study. The accession numbers under which each genome sequences was deposited (if sequence is fragmented, the assembly number was preferred), the associated size, habitat, related pathology and references to each genome sequence are given
Table 2 For each protein of Erwinia amylovora CFBP 1430 considered as reference, locus tags, predicted function and gene name were listed

Protein similarity networks

Networks of orthologous proteins were predicted by using an in-house python software developed according to the pipeline described in two previous works (Ambrosino and Chiusano 2017; Ambrosino et al. 2018). In details orthologs, inferred by all-versus-all BLAST similarity searches (Camacho et al. 2009) together with a Bidirectional Best Hit (BBH) approach (Overbeek et al. 1999) were grouped into networks by using the NetworkX package (Hagberg et al. 2008).

The identification of Ams networks was obtained by using each Ams protein as a probe to scan the results and select the networks of interest. The graphical visualisation of these relationships was obtained via Cytoscape software (Shannon et al. 2003), generating networks in which nodes are proteins and a connection between each node (edges) were established upon inference of are homology.

Homology modelling and structural superposition

Three-dimensional (3D) homology models of the Ams proteins were built exploiting the AlphaFold server (Jumper et al. 2021). The top-score models generated were visualized using PyMOL v.2.4 (Schrödinger and DeLano 2020) and UCSF Chimera software (Pettersen et al. 2004). The MatchMaker tool of the suite UCSF Chimera was required for structural comparison. The search for conserved motifs (a recurring and ungapped sequence pattern that is approximately repeated within a set of related sequences) in some of the Ams proteins and YqhC was done using the MEME suite 5.5.1 (Bailey et al. 2015).

Results

Phylogenetic tree

To determine the phylogenetic relationships among the different Erwinia species, we built a phylogenetic tree using PhyloPhlAn3, a software that extracts the protein sequence of over 4000 markers and uses them as an input for the multiple sequence alignment and phylogenetic inference. The phylogenetic tree was built from a concatenated alignment of 5866 marker genes (total informative sites 49264), using RAxML (Fig. S1). The likelihood tree was created using the GTR substitution model with gamma-distributed rates. The final tree log-likelihood is -938871.63. All the conspecific genomes formed well-distinct monophyletic clusters, however within the same species, the genome of the isolates infecting different plants (i.e. E. amylovora RM1 and Ea644, that infect Rubus spp.) and the ones from isolates taken from geographically distant locations (Erwinia sp. Ejp617, reclassified as E. pyrifoliae), cluster as an outgroup to the other conspecific genomes. The earliest diverging species are Duffyella gerundensis (Soutar and Stavrinides 2022), E. iniecta, and E. toletana (Fig. 2). The species known to be pathogenic are not monophyletic, nevertheless the topology of the Erwinia spp. mirrors the one of the host plants (Cole and Hilger 2013). More specifically, Rosaceae and Fabaceae infecting species are monophyletic, whereas the ones infecting Myrtaceae, Cucurbitaceae and Oleaceae belong to distinct clades. The ANI ranged from 0.75 to 0.90, coherently with the values found among congeneric bacterial species, and the clustering pattern determined by the ANI is consistent with the one of the phylogenetic tree (Fig. S3).

Fig. 2
figure 2

Phylogenetic tree of the eighteen genomes considered for the study. On the right side of the picture, there is a paraphyletic representation of the angiosperm families including only the hosts of the Erwinia strains. In the middle, the genes are represented as dots, full dots represent the presence of a complete CDS and empty dots represent the proteins found with an identity minor of 60%. Asterisks at the end of species name indicates the type strains.

The ams operon displays a variable degree of conservation

The search for orthologs using the sequences of the ams genes from E. amylovora revealed that in E. piriflorinigrans, E. pyrifoliae and E. oleae orthologs for the whole operon were found with a complete sequence respecting all set parameters, whereas in the other species, the number of the proteins of interest ranged from 1 to 11 (Fig. 2). The most prevalent protein is AmsH (putative amylovoran export outer membrane protein), and the low molecular weight protein-tyrosine-phosphatase AmsI was found in all strains except E. toletana. The amylovoran biosynthesis membrane-associated protein AmsA, and the two proteins with roles in glycosyl transfer AmsB and AmsG were found in 16 genomes out of 18 studied ones, followed by the two proteins with transport and polymerization functions AmsC and AmsF in 15 and 14 genomes, respectively. AmsJ (with a Pyruvyl transferase function), AmsK (another galactose transferase) and AmsL (a transmembrane protein) in 13. The least present are the glycosyltransferases AmsD and AmsE, both found only in E. amylovora, E. oleae, E. piriflorinigrans and E. pyrifoliae. In all but one strains it was found a paralog of amsG located approximately 200 Kbp upstream of the operon. The only genome in which amsG was found as a single copy gene was in E. tracheiphila.

The regions from all genomes were extracted and compared in Fig. 3 where the main rearrangements of the ams genes and their homologs (cps/wce cluster) are visualised. Genes associated with the synthesis of exopolysaccharides in each species were always collinear. As pointed out in Kamber et al. 2012, there is a reciprocal translocation of the genes amsC and amsB between the genomes of Pantoea spp. and the Erwinia spp. In E. billingiae the homologous region to amsC contains a CDS that did not pass the threshold to be defined an orthologous protein. In E. iniecta and E. toletana a deletion of amsG occurred. In D. gerundensis, E. phyllosphaerae and E. typographi, where an homolog of amsJ, amsK and amsL is consistently missing, the homologous region presented a high sequence divergence. The pairwise identity matrix revealed different degrees of molecular divergence, with amsG and amsL being more conserved (with an id always over 75%), and amsF the most variable gene (Fig. 4).

Fig. 3
figure 3

Representation of the organisation of the EPS gene cluster in the different Erwinia spp. considered. Arrows represent the genes, fully coloured arrows represent the genes with the percentage of identity above the threshold (80%), empty arrows with coloured outline represent genes with the threshold in the range 50–80%, empty arrows with black outline represent other CDS. As they are similar in organisation, the species E. pyrifoliae, E. piriflorinigrans and E. oleae have been merged by taking E. amylovora as representative. The same was done for the species E. aphidicola, E. endophytica, E. mallotivora, E. psidii, E. rhapontici, E. tasmaniensis and E. tracheiphila,, taking E. persicina as representative

Fig. 4
figure 4

Boxplots displaying the pairwise identity (y-axis) among all the combinations of each Ams protein (x-axis). Each boxplot represents the range of similarity score obtained by comparing the Ams protein sequence from any given species of each of the selected 18 Erwinia species. Each dot represents a comparison and it is coloured according to the species used as a query

Prediction of orthologs and protein similarity networks

The BBH homology inference (Figs. 5 and S5) confirmed the one-to-one relationships for the proteins AmsA, AmsI and AmsH, which formed fully connected cliques (i.e. it was found in one copy for each genome). AmsB was also fully connected except for AmsB in E. toletana, which had a BBH only with E. iniecta. The same was found for AmsL, where AmsL in E. toletana had an homology inference with the one in D. gerundensis. The proteins AmsC, AmsF and AmsJ were located in almost all genomes with a one-to-one relationship. More complex patterns were discovered for the other Ams proteins. More specifically, the proteins AmsD and AmsK belonged to a homology network that was connected by a protein of E. iniecta annotated as a GTB-type-glycosyltransferase (WP_052897582.1). The gene amsG was found with two paralogs in many species, although it was absent in E. iniecta. AmsE displays a one-to-one relationship with only seven other proteins and shows an orthology relationship with a protein in E. persicina, a glycosyltransferase family 2 protein (WP_062742809.1), through a homology relationship inferred with D. gerundensis.

Fig. 5
figure 5

Panel of the most representative networks containing Ams orthologs in the considered Erwinia species. Circles, squares triangles and diamonds represent proteins per species, according to the legend. Each species was represented with a symbol of the same colour code used throughout the manuscript and each protein was labelled with its ID. Grey lines represent orthology relationships defined as Bidirectional Best Hits. A. Network containing AmsD and AmsK orthologs in the considered Erwinia species. B. Network containing AmsB orthologs. C. Network containing AmsE orthologs. D. Network containing AmsG orthologs. (See Supplementary materials for more networks)

Structure comparison and motifs conservation

From the superimposition of the various structures obtained with AlphaFold, it can be seen that the glycosyl transfer proteins AmsB and AmsE have a similar structure, overlapping almost perfectly, as do the pyruvyl transferase AmsJ and the two glycosyltransferases AmsD and AmsK (Fig. 6). Although some motifs have been found nearly replicated among the collection of interconnected proteins, the repetitive amino acid sequence do not show the same predicted secondary structure.

Fig. 6
figure 6

Superposition of the various Ams structures obtained with AlphaFold and MEME analysis in which are reported the conserved motifs (recurring, fixed-length patterns) between the considered proteins. On the top left is the overlay of AmsD (blue), AmsJ (orange) and AmsK (light blue) structures. Beside, in red is the AmsE structure and in pink AmsB. Below we find in yellow AmsG and its “paralog” (par_AmsG) with the corresponding MEME analysis

Even between different Erwinia species, the main motifs of any given protein are retained, although they do not have the same protein sequence (e.g. in E. amylovora identity of 62%, 282 amino acids retained out of a total of 454), each AmsG protein and its paralog also overlap structurally.

Discussion

The genus Erwinia includes both pathogenic and non-pathogenic species. The pathogenicity of those species is connected with the presence and expression of genes involved in the biosynthesis of siderophores and extracellular polysaccharides (EPS). A previous survey on the genes involved in the hydroxamate siderophore-mediated iron uptake revealed that the Rosaceae-infecting strains form a distinct cluster according to specific genomic features, which include a high number of conserved core genes and the presence of complete coding sequences for specific genes (dfoJAC, fhuA, foxR, fhuD, fhuB, fhuC and sidE) involved in the biosynthesis and utilization of siderophores (Polsinelli et al. 2019). In the present study, we extended the number of species included and moved the focus from the iron uptake system to the EPS.

According to Mann et al. 2013, Erwinia amylovora can be divided into host-specific genetically homogeneous strains colonizing different Rosaceae subfamilies (Mann et al. 2013). The analysis of the ANI among species suggests that this host-specific genomic homogeneity can also be detected at a higher taxonomic level. Indeed, the cluster of species infecting Rosaceae (E. amylovora, E. pyrifoliae, E. pirifolinigrans, E. tasmaniensis), is characterized by a higher similarity among them. The cluster of genomes infecting Fabaceae (E. persicina, E. aphidicola, E. rhapontici) also displays high similarity but to a minor extent in comparison with the Rosaceae infecting species. Furthermore, the high-resolution Erwinia phylogenetic tree topology was congruent with the phylogenetic tree of the Angiosperm families, suggesting that long-term interaction between host and pathogen has reciprocally influenced the evolution at the genomic level (Fig. 2). This phenomenon leads us back to the well-known “red queen hypothesis” (Papkou et al. 2019). Another evidence of this is the fact that Rubus infecting strains form a distinct group within the monophyletic clade of E. amylovora (Fig. S1). The species Duffyella gerundensis (formerly known as Erwinia gerundensis, Rezzonico et al. 2016) was also isolated from pome fruit, however it is placed very distant from the Rosaceae-infecting strains. This should be not interpreted as a contradiction because D. gerundensis changed the genus name, therefore it should be seen as an outgroup rather than an early diverging Erwinia lineage; and because this species is not pathogenic, it is rather commensal, therefore there is no strict co-evolution as in other host–pathogen relationships. Despite E. tasmaniensis is regarded as "pathoadapted", housing the majority of pathogenicity factors for Rosaceae, no pathogenicity has been conclusively demonstrated thus far. While we have not specifically investigated this aspect in our research, it is plausible that regulatory mechanisms contribute to the pathogenicity observed. The same occurs with E. toletana and E. oleae, which are grouped in the same cluster that infects the Oleaceae because although not strictly pathogenic, E. toletana co-occurs in infections caused by Pseudomonas savastanoi and modulates the disease severity (Buonaurio et al. 2015). Hence, the production of an EPS by E. toletana remains uncertain, yet there might exist other pathogenesis factors that exacerbate the condition caused by P. savastanoi.

Since the pathogenicity of Erwinia amylovora is connected to amylovoran secretion, we focused our attention on the biosynthetic operon ams. The rearrangements of genes in the operon could indicate that the gene products and/or the EPS structures are different (Moreno-Hagelsieb 2015). A previous study compared the E. amylovora ams operon to the cps region in Pantoea stewartii, which encodes for the biosynthetic pathway of stewartan, a related polysaccharide involved in the pathogenesis of this species (Langlotz et al. 2011). It was found that the pair of genes amsB-C and amsD-E underwent a translocation event between the two genomes. In the present study, we found that the order of the genes is conserved within the whole Erwinia genus. The genes in the first part of the operon (amsG, amsH and amsI) are well conserved in their organization among all species in which they were detected (Fig. 3), additionally, they display a relatively lower level of divergence compared to the other genes (Fig. 4). Similarly, also amsA and amsB are well conserved, but their sequence similarity is not higher compared to the one of the rest of the genes. Other genes, such as amsD and amsE show re-arrangement such as translocations (amsD in E. endophitica) or were not found in many species. Indeed, the amsE gene is relatively rare among various Erwinia spp.. Yet, as demonstrated by previous research conducted by Kamber et al. 2012, there exists a notable similarity between other proteins located at the same position (such as WbdN), which, in turn, share only a 25% identity with AmsE. Furthermore, the same earlier research validates the observation that in the genomic regions corresponding to the location of amsCDE, the operon of E. billingiae has two other distinct genes, identified as an acyltransferase and a SGNH/GDSL hydrolase family protein, respectively. Those genes (particularly amsD) display a lower similarity among all species. However the network approach could detect amsD in different species through a BBH approach, the wide range of sequence divergence among the amsD genes imply that the molecular clock runs faster for those genes than it does for the other genes in the operon. Probably due to a weaker or disruptive selective pressure. A similar situation was found also for the genes amsJ-K and amsL in the three non-pathogens D. gerundensis, E. phyllosphaerae, and E typographi, suggesting that the biosynthetic pathway results in a polysaccharide with different secondary chains.

As demonstrated for the prokaryotic isomerases (Álvarez-Lugo and Becerra 2023), the duplication of genes is often associated with an increased expression (i.e. “gene dosage effect”). A paralog of the amsG gene was consistently found in all except one Erwinia species (that is E. tracheiphila). Being the first protein, AmsG catalyses a critical step in the biosynthetic pathways: it triggers the amylovoran unit synthesis by adding the first molecule of UDP-galactose to the membrane-bound lipid transporter. The presence of a paralog, which for example in E. amylovora is catalogued as another undecaprenyl-phosphate galactose phosphotransferase (wbaP gene, EAMY_1987), highlights its crucial role over the adjacent genes in the ams operon. Indeed, the downregulation of amsG itself significantly hampered amylovoran secretion and motility (Yuan et al. 2022). The gene tree based on the sequences of both amsG paralogs in all species showed a clustering pattern that grouped together the two paralogs in different species rather than both the gene copies in each species, suggesting that the duplication occurred before the differentiation of the Erwinia spp. (Fig. S2). In E. toletana and E. iniecta the first gene of the operon amsG was deleted, implying that in those species the exopolysaccharides biosynthetic pathway starts with the catalysis of an alternative protein or potentially not start at all, consequently leading to a significant change in their virulence potential.

Other exopolysaccharides have been characterized in the Erwinia species, one example is the pyrifolian, secreted by E. pyrifoliae that, as it was observed by Kim et al. 2002, displays a remarkable similarity with amylovoran in its biochemical structure. Pyrifolian is encoded by the cps operon, which, as we also observed, is orthologous to the ams operon in E. amylovora. Moreover, the cluster discovered in both E. tasmaniensis and E. billingiae indicates that the structure of EPS of those species may bear a closer resemblance to stewartan, the EPS secreted by Pantoea stewartii (Schollmeyer et al. 2012). A particular case encountered is that of E. tracheiphila, which, unlike other species, uses insect vectors for disease transmission (Mitchell and Hanks 2009). amsD and amsE were found at a low identity or not at all, and the gene amsG was found as a single copy. Evidence suggests that E. tracheiphila exhibits decreased EPS secretion (Mason 2013), implying that the different mechanisms of exopolysaccharide production and secretion could be connected to environmental adaptation. We screened this species also for the yqhC gene, and the Arac-like transcriptional regulator was not found, whereas it was found in all other Erwinia spp. (See supplementary material).

The use of the orthology network confirmed for most of the proteins a one-to-one relationship (Fig. S5). For AmsB and AmsL in E. toletana, it was possible to define an orthology relationship only with E. iniecta or D. gerundensis, respectively (Fig. 5B). This could be due to the higher sequence divergence of the ams genes in those species. The glycosyltransferases AmsD and AmsK belonged to the same network, however, they display a fourth-degree connection through an orthologous protein of E. iniecta that was the only node connecting the network of AmsD with the one of AmsK (Fig. 5A). These two proteins, as well as AmsE, belong to the large family of glycosyltransferase, which explains the presence of multiple proteins detected as paralogs to the ones of interest.

The relationships between protein sequence and 3D structure are complex, given the involvement of numerous factors in its determination (Krissinel et al. 2007). It could be expected that proteins with a more similar primary sequence display a high degree of overlap in their 3D structure. Indeed, the 3D structures of AmsD and AmsK overlapped, confirming their relatedness inferred also through the network analysis (Figs. 5 and 6). Conversely, proteins such as AmsB and AmsE had a similar 3D structure despite a higher divergence at the primary sequence level. This suggests a strict selection acting on the 3D structure yet tolerating a higher amino acid substitution rate. Indeed, it is also likely that the various glycosyltransferases resemble each other in structure because they catalyse the same reaction on slightly different substrates. Further structural analysis eventually revealed that AmsG and its duplicate, have similar sequence and the same secondary structure, emphasising how the function of the pathway's first glycosyltransferase may be crucial in the production of amylovoran. However, the structural analysis cannot be considered as predictive of a similar function. Future studies, aimed at understanding the substrate range of the two paralogues would be more reliable.

Conclusion

The availability of genome sequences for many bacterial species provides a valuable resource for comparative genomics. The study of the ams operon organisation in different Erwinia strains allows a deeper understanding of the evolution of the exopolysaccharide biosynthesis gene clusters in those species. Generally, the operon gene order and the protein sequence exhibit resemblance to one another among the species, suggesting the preservation of their functions. Minor differences have been detected in the genomes from Erwinia species distantly related to E. amylovora.

The main findings regard the presence of a paralog of the glycosyltransferase amsG in most of the species, and the progressive sequence divergence of the genes at the 5`- end of the operon. This confirms the key role of the AmsG protein in the production of EPS and suggests the possibility of modifications to the chemical structure of the exopolysaccharide produced by each species, especially in the secondary chain. Moreover, the deletion of AmsG in E. iniecta, and E. toletana and the yqhC gene in E. tracheiphila suggests that in those species the exopolysaccharides production follows different biosynthetic pathways. Protein similarity networks revealed at least a one-to-one relationship between most of the Ams proteins, although in some species there were more complex orthology patterns among the ams genes. The 3D structure analysis allowed us to put forward hypotheses about the selective pressures acting at different levels on some of the proteins. Future research is needed in determining the effect of the genomic rearrangements found in the different species on the virulence and the chemical composition of the EPS, and in the last instance on the virulence of the species.