Bioinformatics comparisons of RNA-binding proteins of pathogenic and non-pathogenic Escherichia coli strains reveal novel virulence factors

Pathogenic bacteria have evolved various strategies to counteract host defences. They are also exposed to environments that are undergoing constant changes. Hence, in order to survive, bacteria must adapt themselves to the changing environmental conditions by performing regulations at the transcriptional and/or post-transcriptional levels. Roles of RNA-binding proteins (RBPs) as virulence factors have been very well studied. Here, we have used a sequence search-based method to compare and contrast the proteomes of 16 pathogenic and three non-pathogenic E. coli strains as well as to obtain a global picture of the RBP landscape (RBPome) in E. coli. Our results show that there are no significant differences in the percentage of RBPs encoded by the pathogenic and the non-pathogenic E. coli strains. The differences in the types of Pfam domains as well as Pfam RNA-binding domains, encoded by these two classes of E. coli strains, are also insignificant. The complete and distinct RBPome of E. coli has been established by studying all known E. coli strains till date. We have also identified RBPs that are exclusive to pathogenic strains, and most of them can be exploited as drug targets since they appear to be non-homologous to their human host proteins. Many of these pathogen-specific proteins were uncharacterised and their identities could be resolved on the basis of sequence homology searches with known proteins. Detailed structural modelling, molecular dynamics simulations and sequence comparisons have been pursued for selected examples to understand differences in stability and RNA-binding. The approach used in this paper to cross-compare proteomes of pathogenic and non-pathogenic strains may also be extended to other bacterial or even eukaryotic proteomes to understand interesting differences in their RBPomes. The pathogen-specific RBPs reported in this study, may also be taken up further for clinical trials and/or experimental validations.


Background
Escherichia coli is one of the most abundant, facultative anaerobic gram-negative bacterium of the intestinal microflora and colonises the mucus layer of the colon. The core genomic structure is common among the commensal strains and the various pathogenic E. coli strains that cause intestinal and extra-intestinal diseases in humans [1]. In the pathogenic strains, novel genetic islands and small clusters of genes are present in addition to the core genomic framework and provide the bacteria with increased virulence [2][3][4]. The extracellular intestinal pathogen, enterohemorrhagic E. coli (EHEC), which cause diarrhea, hemorrhagic colitis and the haemolytic uremic syndrome, is the most devastating of the pathogenic E. coli strains [5,6].
Pathogenic bacteria have evolved various strategies to counteract host defences. They are also exposed to environments that are undergoing constant changes. Hence, in order to survive, bacteria must adapt themselves to the changing environmental conditions by altering gene expression levels and in turn adjusting protein levels according to the need of the cell. Such regulations may occur at the transcriptional and/or posttranscriptional levels [7].
Here, we describe the employment of mathematical profiles of RBP families to study the RBP repertoire, henceforth referred to as the 'RBPome' , in E. coli strains. The proteomes of 19 E. coli strains (16 pathogenic and three non-pathogenic strains) have been studied to compare and contrast the RBPomes of pathogenic and non-pathogenic E. coli. More than 40 different kinds of proteins have been found to be present in two or more pathogenic strains, but absent from all the three nonpathogenic ones. Many of these proteins are previously uncharacterised and may be novel virulence factors and probable candidates for further experimental validations.
We have also extended our search method to probe to all available E. coli complete proteomes (till the date of the study) for RBPs, and thus obtain a bigger picture of the RBP landscape in all known E. coli strains. The search method can also be adapted in future for comparing the RBPomes of other species of bacteria as well. In addition, our work also discusses case studies on a few interesting RBPs. The first of them is an attempt to provide a structural basis for the inactivity of the Ribonuclease PH (RNase PH) protein from E. coli strain K12, the second study deals with the structural modelling and characterisation of RNA substrates of an 'uncharacterised' protein that is exclusively found in the pathogenic E. coli strains, whereas the third one involves the analysis of pathogen-specific Cas6 proteins and comparison with their non-pathogenic counterparts.

Dataset
Protein families were grouped on the basis of either structural homology (structure-centric families) or sequence homology (sequence-centric families). A dataset of 1285 RNA-protein and 14 DNA/RNA hybrid-protein complexes were collected from the Protein Data Bank (PDB) (May 2015) and were split into protein and RNA chains. The RNA-interacting protein chains in this dataset were classified into 182 Structural Classification of Proteins (SCOP) families, 135 clustered families and 127 orphan families (a total of 437 structure-centric families), on the basis of structural homology with each other. Sequence-centric RNA-binding families were retrieved from Pfam, using an initial keyword search of 'RNA' , followed by manual curation to generate a dataset of 746 families. The structure-centric classification scheme, the generation of structure-centric family Hidden Markov Models (HMMs) and retrieval of sequence-centric family HMMs from the Pfam database (v 28) were as adapted from our previous study [43].
Proteomes of 19 E. coli strains were retrieved from UniProt Proteomes (May 2016) [44] for the comparative study of pathogenic and non-pathogenic strains. The names and organism IDs of the E. coli strains, their corresponding UniProt proteome IDs and the total number of proteins in each proteome have been listed in Table 1.
All complete E. coli proteomes were retrieved from RefSeq (May 2016) [45] to study the overall RBP landscape in E. coli. The names of the E. coli strains, their corresponding assembly IDs and the total number of proteins in each proteome and have been listed in Table 2.

Search method
The search method was described in our previous study [43] and is represented schematically in Fig. 1. A library of 1183 RBP family HMMs (437 structure-centric families and 746 sequence-centric families) were used as start points to survey the E. coli proteomes for the presence of putative RBPs. The genome-wide survey (GWS) for each E. coli proteome was performed with a sequence E-value cut-off of 10 −3 and the hits were filtered with a domain i-Evalue cut-off of 0.5. i-Evalue (independent E-value) is the E-value that the sequence/profile comparison would have received if this were the only domain envelope found in it, excluding any others. This is a stringent measure of how reliable this particular domain may be. The independent E-value uses the total number of targets in the target database. We have now mentioned this definition in the revised manuscript. The Pfam (v 28) domain architectures (DAs) were also resolved at the same sequence E-value and domain i-Evalue cut-offs.

Comparison of RNA-binding proteins across strains
The RBPs identified from 19 different strains of E. coli, were compared by performing all-against-all protein sequence homology searches using the BLASTP module of the NCBI BLAST 2.2.30 + suite [46] with a sequence E-value cut-off of 10 −5 . The hits were clustered on the basis of 30% sequence identity and 70% query coverage cut-offs to identify similar proteins i.e., proteins that had a sequence identity of greater than or equal to 30%, as well as a query coverage of greater than or equal to 70%, were considered to homologous in terms of sequence and hence clustered. These parameters were standardised on the basis of previous work from our lab to identify true positive sequence homologues [47].
Associations for proteins that were annotated as 'hypothetical' or 'uncharacterised' , were obtained by sequence homology searches against the NCBI non-redundant (NR) protein database (February 2016) with a sequence E-value cut-off of 10 −5 . The BLASTP hits were also clustered on the basis of 100% sequence identity, 100% query coverage and equal length cut-offs to identify identical proteins.
Clusters that consist of proteins from two or more of the pathogenic strains, but not from any of the nonpathogenic ones, will henceforth be referred as 'pathogenspecific clusters' and the proteins in such clusters as 'pathogen-specific proteins'. Sequence homology searches were performed for these proteins against the reference human proteome (UP000005640) retrieved from Swiss-Prot (June 2016) [44] at a sequence E-value cut-off of 10 −5 . The hits were filtered on the basis of 30 percentage sequence identity and 70 percentage query coverage cut-offs.

Modelling and dynamics studies of RNase PH protein
The structures of the active and inactive monomers of the tRNA processing enzyme Ribonuclease PH (RNase PH) from strains O26:H11 (UniProt ID: C8TLI5) and     [48]. The active and inactive RNase PH monomers are 238 and 228 amino acids in length, respectively and are 69% and 70% identical to the template, respectively. Twenty models were generated for each of the active and inactive RNase PH monomers and validated using PROCHECK [49], VERIFY3D [50], ProSA [51] and HARMONY [52]. The best model for each of the active and inactive RNase PH monomers were selected on the basis of Discrete Optimized Protein Energy (DOPE) score and other validation parameters obtained from the above-mentioned programs. The best models for the active and inactive RNase PH monomers were subjected to 100 iterations of the Powell energy minimisation method in the Tripos Force Field (in absence of any electrostatics) using SYBYL7.2 (Tripos Inc.). These were subjected to 100 ns (ns) molecular dynamics (MD) simulations (three replicates each) in the AMBER99SB protein, nucleic AMBER94 force field [53] using the Groningen Machine for Chemical Simulations (GROMACS 4.5.5) program [54]. The biological assembly (hexamer) of RNase PH from Pseudomonas aeruginosa (PDB code: 1R6M) served as the template and was obtained using the online tool (PISA) (http://www.ebi.ac.uk/pdbe/prot_int/pistart.html) [55]. The structures of the active and inactive hexamers of RNase PH from strains O26:H11 and K12, respectively were modelled and the 20 models generated for each of the active and inactive RNase PH hexamers were validated using the same set of tools, as mentioned above. The best models were selected and subjected to energy minimisations, as described above. Electrostatic potential on the solvent accessible surfaces of the proteins were calculated using PDB2PQR [56] (in the AMBER force field) and Adaptive Poisson-Boltzmann Solver (APBS) [57]. The head-to-head dimers were randomly selected from both the active and the inactive hexamers of the protein for performing MD simulations, to save computational time. Various energy components of the dimer interface were measured using the in-house algorithm, PPCheck [58]. This algorithm identifies interface residues in proteinprotein interactions on the basis of simple distance criteria, following which the strength of interactions at the interface are quantified. 100 ns MD simulations (three replicates each) were performed with the same set of parameters as mentioned above for the monomeric proteins.
Modelling and dynamics studies of an 'uncharacterised' pathogen-specific protein The structure of the PELOTA_1 domain (Pfam ID: PF15608) of an 'uncharacterised' pathogen-specific protein from strain O103:H2 (UniProt ID: C8TX32) (371 amino acids) was modelled on the basis of the L7Ae protein from Methanocaldococcus jannaschii (PDB code: 1XBI: A) (117 amino acids) and validated, as described earlier. The 64 amino acids long PELOTA_1 domain of the uncharacterised protein, has 36% sequence identity with the corresponding 75 amino acids domain of the template. The best model was selected as described in the case study on RNase PH. This model was subjected to 100 iterations of the Powell energy minimisation method in the Tripos Force Field (in absence of any electrostatics) using SYBYL7.2 (Tripos Inc.). Structural alignment of the modelled PELOTA_1 domain and the L7Ae K-turn binding domain from Archaeoglobus fulgidus (PDB code: 4BW0: B) was performed using Multiple Alignment with Translations and Twists (Matt) [59]. The same kink-turn RNA from H. marismortui, found in complex with the L7Ae K-turn binding domain from A. fulgidus, was docked onto the model, guided by the equivalents of the RNAinteracting residues (at a 5 Å cut-off distance from the protein) in the A. fulgidus L7Ae protein (highlighted in yellow in the upper panel of Fig. 7c) using the molecular docking program HADDOCK [60]. The model and the L7Ae protein from A. fulgidus, in complex with kink-turn RNA from H. marismortui, were subjected to 100 ns MD simulations (three replicates each) in the AMBER99SB protein, nucleic AMBER94 force field using the GRO-MACS 4.5.5 program.

Sequence analysis of pathogen-specific Cas6-like proteins
The sequences of all the proteins in Cluster 308 were aligned to the Cas6 protein sequence in E. coli strain K12 (UniProt ID: Q46897), using MUSCLE [61] and subjected to molecular phylogeny analysis using the Maximum Likelihood (ML) method and a bootstrap value of 1000 in MEGA7 (CC) [62,63]. All reviewed CRISPR-associated Cas6 protein sequences were also retrieved from Swiss-Prot (March 2017) [44], followed by manual curation to retain 18 Cas6 proteins. Sequences of two uncharacterised proteins (UniProt IDs: C8U9I8 and C8TG04) from Cluster 308, known to be homologous to known CRISPR-associated Cas6 proteins (on the basis of sequence homology searches against the NR database, as described earlier) were aligned to those of the 18 reviewed Cas6 proteins using MUSCLE. The sequences were then subjected to molecular phylogeny analysis using the above-mentioned parameters. Secondary structure predictions for all the proteins were performed using PSIPRED [64].
The structures of Cas6 proteins from E. coli strain K12 (PDB codes: 4QYZ: K, 5H9E: K and 5H9F: K) were retrieved from the PDB. The RNA-binding and proteininteracting residues in the Cas6 protein structures were calculated on the basis of 5 Å and 8 Å distance cut-off criteria, from the associated crRNAs (PDB codes: 4QYZ: L, 5H9E: L and 5H9F: L, respectively) and the protein chains (PDB codes: 4QYZ: A-J, 5H9E: A-J and 5H9F: A-J, respectively), respectively.

Results
Genome-wide survey (GWS) of RNA-binding proteins in pathogenic and non-pathogenic E. coli strains The GWS of RBPs was performed in 19 different E. coli strains (16 pathogenic and three non-pathogenic strains) and a total of 7902 proteins were identified (Additional file 1: Table S1). Figure 2a shows the number of RBPs found in each of the strains studied here. The pathogenic strains have a larger RBPome, as compared to the non-pathogenic ones -with strain O26:H11 encoding the greatest (441). The pathogenic strains also have bigger proteome sizes (in terms of the number of proteins in the proteome), as compared to their non-pathogenic counterparts, by virtue of maintaining plasmids in them. Hence, to normalise for proteome size, the number of RBPs in each of these strains were expressed as a function of their respective number of proteins in the proteome (Fig. 2b). We observed that the difference in the percentage of RBPs in the proteome among the pathogenic and the non-pathogenic strains are insignificant (Welch Two Sample t-test: t = 3.2384, df = 2.474, p-value = 0.06272).
To compare the differential abundance of domains, if any, among the pathogens and the non-pathogens, the Pfam DAs of all the RBPs were resolved (to strengthen the results in this section, this study has been extended to all known E. coli proteomes and will be discussed in a later section). The number of different types of Pfam domains and that of Pfam RNA-binding domains (RBDs) found in each strain have been represented in Fig. 2c Fig. 2d and also been listed in Table 3.
We found that E. coli encodes 185 different types of Pfam RBDs in their proteomes and the DEAD domain was found to be the most abundant, constituting approximately 4% of the total number of Pfam RBD domains in E. coli. The DEAD box family of proteins are RNA helicases that are required for RNA metabolism and thus are important players in gene expression [65]. These proteins use ATP to unwind short RNA duplexes in an unusual fashion and also help in the remodelling of RNA-protein complexes.

Comparison of RNA-binding proteins across strains reveals novel pathogen-specific factors
The proteins were clustered on the basis of sequence homology searches in order to compare and contrast the RBPs across the E. coli strains studied here. The 7902 proteins identified from all the strains were grouped into 384 clusters, on the basis of sequence homology with other members of the cluster (Additional file 2: Table S2). Greater than 99% of the proteins could cluster with one or more RBPs and formed 336 multi-member clusters (MMCs), whereas the rest of the proteins failed to cluster with other RBPs and formed 48 single-member clusters (SMCs). The distribution of members among all the 384 clusters has been depicted in Fig. 3.
The largest of the MMCs, consists of 1459 RBPs which are ATP-binding subunit of transporters. The E. coli genome sequence had revealed that the largest family of paralogous proteins were composed of ATP-binding cassette (ABC) transporters [66]. The ATP-binding subunit of ABC transporters share common features with other nucleotide-binding proteins [67] like, the E. coli RecA [68] and the F1-ATPase from bovine heart [69]. GCN20, YEF3 and RLI1 are examples of soluble ABC proteins that interact with ribosomes and regulate translation and ribosome biogenesis [70][71][72].
The other large MMCs were those of small toxic polypeptides that are components of the bacterial toxinantitoxin (TA) systems [73][74][75][76][77], RNA helicases that are involved in various aspects of RNA metabolism [78,79] and Pseudouridine synthases that are enzymes responsible for pseudouridylation, which is the most abundant post-transcriptional modification in RNAs [80]. Cold shock proteins bind mRNAs and regulate translation, rate of mRNA degradation etc. [81,82]. These proteins are induced during the response of the bacterial cell towards temperature rise.
The majority of the SMCs (38 out of 48 SMCs) are RBPs from pathogenic strains and lack homologues in any of the other strains considered here. These include proteins like putative helicases, serine proteases, and various endonucleases. Likewise, members of the small toxic Ibs protein family (IbsA, IbsB, IbsC, IbsD and IbsE that form Clusters 362, 363, 364, 365 and 366 respectively) from strain K12 are noteworthy examples of SMCs that are in non-pathogenic strains only. These Ibs proteins cause the cessation of growth when overexpressed [83].

Pathogen-specific proteins
In this study, the 226 pathogen-specific proteins that formed 43 pathogen-specific clusters are of special interest. Sixty-three of these proteins were previously uncharacterised and associations for all of these proteins were obtained on the basis of sequence homology searches against the NCBI-NR database. The function annotation of each of these clusters were transferred on the basis of homology. The biological functions and the number of RBPs constituting these pathogen-specific clusters have been listed in Table 4.
If these pathogen-specific proteins are exclusive to the pathogenic strains, then they may be exploited for drug design purposes. To test this hypothesis, we surveyed the human (host) proteome for the presence of sequence homologues of these proteins. It was found that, barring the protein kinases that were members of Cluster 98 (marked in asterisk in Table 4), none of the pathogenspecific proteins were homologous to any human protein within the thresholds employed in the search strategy (please see Methods section for details). Few of the pathogen-specific protein clusters are described in the following section.  The DEAD/DEAH box helicases that use ATP to unwind short duplex RNA [65], formed three different clusters. In two of the clusters, the DEAD domains (Pfam ID: PF00270) were associated with C-terminal Helicase_C (Pfam ID: PF00271) and DUF1998 (Pfam ID: PF09369) domains. On the other hand, in a bigger cluster, the DEAD/DEAH box helicases were composed of DNA_primase_S (Pfam ID: PF01896), ResIII (Pfam ID: PF04851) and Helicase_C domains. Four of the pathogen-specific clusters were those of Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) sequence-associated proteins, consisting of RBPs from 10 pathogenic strains each. Recent literature reports also support the role of CRISPR-associated proteins as virulence factors in pathogenic bacteria [84]. The KilA-N domains are found in a wide range of proteins and may share a common fold with the nucleic acid-binding modules of certain nucleases and the N-terminal domain of the tRNA endonuclease [85]. Fertility inhibition (FinO) protein and the anti-sense FinP RNA are members of the FinOP fertility inhibition complex which regulates the expression of the genes in the transfer operon [86][87][88][89]. tRNA (fMet)specific endonucleases are the toxic components of a TA system. This site-specific tRNA-(fMet) endonuclease acts as a virulence factor by cleaving both charged and uncharged tRNA-(fMet) and inhibiting translation. The Activating Signal Cointergrator-1 homology (ASCH) domain is also a putative RBD due to the presence of an RNA-binding cleft associated with a conserved sequence motif characteristic of the ASC-1 superfamily [90].

Identification of the distinct RNA-binding protein repertoire in E. coli
We identified identical RBPs across E. coli strains, on the basis of sequence homology searches and other filtering criteria (as mentioned in the Methods section). Out of the 7902 RBPs identified in our GWS, 6236 had one or more identical partners from one or more strains and formed 1227 clusters, whereas 1666 proteins had no identical counterparts. Hence, our study identified 2893 RBPs from 19 E. coli strains that were distinct from each other. Identification of such a distinct pool of RBPs will help to provide an insight to the possible range of functions performed by this class of proteins in E. coli, and hence compare and contrast with the possible functions performed by RBPs in other organisms.

GWS of RNA-binding proteins in all known E. coli strains
We extended the above-mentioned study, by performing GWS of RBPs in 166 complete E. coli proteomes available in the RefSeq database (May 2016) and a total of 8464 proteins were identified (Additional file 3). It should be noted that, unlike the nomenclature system of UniProt, where the same protein occurring in different strains are denoted with different UniProt accession IDs, RefSeq assigns same or at times different accession IDs to the same protein occurring in different strains. Thus, on the basis of unique accession IDs, 8464 RBPs were identified. The 8464 RBPs were grouped into 401 clusters on the basis of sequence homology with other members of the cluster. We found that greater than 99% of   Fig. 4a. The number of different Pfam RBDs found across all complete E. coli proteomes has been shown in Fig. 4b. Similar to the afore-mentioned results, seen from the dataset of 19 E. coli proteomes, it was found that E. coli encodes 188 different types of Pfam RBDs in their proteomes and the DEAD domain was still observed to be the most abundant, constituting approximately 6% of the total number of Pfam RBD domains in E. coli. The length distribution of RBPs from E. coli have been plotted in Fig. 4c and RBPs of the length 201-300 amino acids were found to be the most prevalent.

Identification of the complete distinct RBPome in 166 proteomes of E. coli
These 8464 RBPs (please see previous section) formed 1285 clusters of two or more identical proteins, accounting for 3532 RBPs, whereas the remaining 4932 RBPs were distinct from the others. Hence, 6217 RBPs, distinct from each other, were identified from all known E. coli strains, which is much greater than the number (2893) found from 19 E. coli proteomes.
It should be noted that the pathogenicity annotations are not very clear for few of the 166 E. coli strains for which complete proteome information are available. Hence, we have performed the analysis for the pathogen-specific proteins using the smaller dataset of 19 proteomes, whereas all the 166 complete proteomes have been considered for the analysis for the complete E. coli RBPome.

Case studies
Three case studies on interesting RBPs were performed to answer some outstanding questions and have been described in the following sections. The first of the three examples, deals with a RNase PH protein that does not cluster with those from any of the other 165 E. coli proteomes considered in this study. This protein, which forms a SMC, is interesting in the biological context due to its difference with the other RNase PH proteins, both at the level of sequence as well as biological activity. The second case study deals with a protein that is a part of a pathogen-specific cluster, in which none of the proteins are well-annotated. This protein was found to encode a bacterial homologue of a well-known archaeo-eukaryotic RBD, whose RNA-binding properties are not as well studied as its homologues. The final study involves a sequence-based approach to analyse the pathogen-specific CRISPR-associated Cas6 proteins, and compare the same with similar proteins from the non-pathogenic strains.
Case study 1: RNase PH from strain K12 is inactive due to a possible loss of stability of the protein RNase PH is a phosphorolytic exoribonuclease involved in the maturation of the 3′-end of transfer RNAs (tRNAs) containing the CCA motif [91][92][93]. The RNase PH protein from strain K12 was found to be distinct from all other known RNase PH proteins from E. coli and has a truncated C-terminus. In 1993, DNA sequencing studies had revealed that a GC base pair (bp) was missing in this strain from a block of five GC bps found 43-47 upstream of the rph stop codon [94]. This one base pair deletion leads to a translation frame shift over the last 15 codons, resulting in a premature stop codon (five codons after the deletion). This premature stop codon, in turn, leads to the observed reduction in size of the RNase PH protein by 10 residues. It was also shown by Jensen [94] that this protein lacks RNase PH activity. Figure 5a shows a schematic representation of the DAs of the active (up) and inactive (down) RNase PH proteins, with the five residues that have undergone mutations and the ten residues that are missing from the inactive RNase PH protein depicted in orange and yellow, respectively. These are the residues of interest in our study. The same colour coding has been used both in Fig. 5a and b.
To provide a structural basis for this possible loss of activity of the RNase PH protein from strain K12, we modelled the structures of the RNase PH protein monomer as well as the hexamer from strains O26:H11 and K12 (Fig. 5b and c). It is known in the literature that the hexamer (trimer of dimers) is the biological unit of the RNase PH protein and that the hexameric assembly is mandatory for the activity of the protein [95,96]. The stability of both the monomer and the hexamer were found to be affected in strain K12, as compared to that in strain O26:H11. The energy values have been plotted in Fig. 6a. In both monomer and hexamer, there is a reduction in stability, suggesting that the absence of C-terminal residues affects the stability of the protein, perhaps more than a cumulative contribution to the stability of the protein. It should be noted that since the monomeric form of the inactive protein is less stable than that of its active counterpart, the hexameric assembly of the inactive RNase PH protein is only a putative one. Hence, the putative and/or unstable hexameric assembly of the RNase PH protein, leads to the loss of activity of the protein. Figure 5b shows that the residues marked in cyan (left) are at an interacting distance of 8 Å from the residues of interest (left). These residues marked in cyan are a subset of the RNase PH domain, which is marked in magenta (right). Hence, the loss of possible interactions (between the residues marked in cyan and the residues of interest) and subsequently stability of the threedimensional structure of the RNase PH domain might explain the inactive nature of the protein from strain K12. Figure 5d shows differences in the electrostatic potential on the solvent accessible surfaces of the active (left) and inactive (right) RNase PH proteins.
To test this hypothesis for the possible loss of function of the RNase PH protein due to loss of stability of the monomer and/or the hexamer, we performed MD simulations to understand distortions, if any, of the monomer and a randomly selected head-to-head dimer (from the hexameric assembly) of both the active and the inactive proteins. The dimers have been marked in black boxes in Fig. 5c. Various energy components of the dimer interface, as calculated by PPCheck, have been plotted in Fig. 6b. The results show that the inactive RNase PH dimer interface is less stabilised as compared to that of the active protein. The trajectories of the MD runs have been shown in additional movie files (Additional file 4, Additional file 5, Additional file 6 and Additional file 7, for the active monomer, inactive monomer, active dimer and inactive dimer, respectively). Analyses of Additional file 4, and Additional file 5 shows a slight distortion in the short helix (pink) in the absence of residues of Fig. 5 Modelling of the RNase PH proteins from two different E. coli strains. The structural modelling of the RNase PH protein has been represented in this figure. a Schematic diagram of the active (above) and the inactive (below) RNase PH proteins. The RNase PH and the RNase_PH_C domains, as defined by Pfam (v.28), have been represented in magenta and pink, respectively. The five residues that have undergone mutations due to a point deletion and the ten residues that are missing from the inactive RNase PH protein from strain K12 have been depicted in orange and yellow, respectively. These two sets of residues are the ones of interest in this study. b Model of the RNase PH monomer from strain O26:H11. The residues with the same colour codes as mentioned in panel (a), have been represented on the structure of the model. The residues that are within an 8 Å cut-off distance from the residues of interest have been highlighted in cyan (left). c Structure of the RNase PH hexamer from strain O26:H11 (left) and the probable structure of the inactive RNase PH hexamer from strain K12 (right). The dimers marked in black boxes are the ones that were randomly selected for MD simulations. d Electrostatic potential on the solvent accessible surface of the RNase PH hexamer from strain O26:H11 (left) and that of the inactive RNase PH hexamer from strain K12 (right) interest (orange and yellow), which might lead to overall loss of stability of the monomer. Further analyses (Additional file 6 and Additional file 7) show the floppy nature of the terminal part of the helices that are interacting in the dimer. This is probably due to the loss of the residues of interest, which have been seen to be structured and less floppy in the active RNase PH dimer (Additional file 6).
For each of the systems, the H-bond traces for three replicates (represented in different colours) have been depicted. From these figures, we can observe that the replicates are showing similar H-bonding patterns. Analyses of the number of hydrogen bonds (H-bonds) formed in the system over each picosecond of the MD simulations of the active monomer, inactive monomer, active dimer and inactive dimer have been represented in Fig. 8a, b, c and d, respectively. Comparison of panels a and b of this figure shows a greater number of H-bonds being formed in the active monomer, as compared to that of the inactive monomer, over the entire time period of the simulation. Similarly, comparison of panels c and d of this figure shows a greater number of H-bonds being formed in the active dimer as compared to that of the inactive dimer, over the entire time period of the simulation. These losses of H-bonding interactions might lead to overall loss of stability of the dimer and subsequently that of the hexamer.
Case study 2: Uncharacterised pathogen-specific protein and its homologues show subtly different RNA-binding properties In our study, we observed that Cluster 60 was composed of 10 proteins, each from a different pathogenic strain studied here. All the proteins in this cluster were either annotated as 'putative' , 'uncharacterised' , 'hypothetical' or 'predicted'.
To understand the RNA-binding properties of these orthologous pathogen-specific proteins, we resolved the Pfam DA of this protein. In particular, such an association to Pfam domains provide function annotation to a hitherto uncharacterised protein, from strain O103:H2, to RBD PELOTA_1. Hence, the structure of the RNA-binding PELOTA_1 domain of this protein was modelled on the basis of the L7Ae protein from M. jannaschii (Fig. 7a).
Domains that are involved in core processes, such as RNA maturation, e.g. the tRNA endonucleases, and translation and with an archaeo-eukaryotic phyletic pattern includes the PIWI, PELOTA and SUI1 domains [97]. In 2014, Anantharaman and co-workers had shown associations of the conserved C-terminus of a phosphoribosyltransferase (PRTase) in the Tellurium resistance (Ter) operon to a PELOTA or Ribosomal_L7Ae domain (Pfam ID: PF01248) [98]. These domains are homologues of the eukaryotic release factor 1 (eRF1), which is involved in translation termination. Unlike the wellstudied PELOTA domain, the species distribution of the PELOTA_1 domain is solely bacterial and not much is known in literature regarding the specific function of this domain.
Structure of this modelled PELOTA_1 domain from the uncharacterised protein was aligned with that of the L7Ae kink-turn (K-turn) binding domain from an archaeon (A. fulgidus) (Fig. 7b). The model also retained the same basic structural unit as the eRF1 protein (data not shown). The L7Ae is a member of a family of proteins that binds K-turns in many functional RNA species [99]. The K-turn RNA was docked onto the model, guided by the equivalents of the known RNA-interacting residues from the archaeal L7Ae K-turning binding  Fig. 5c). The results show that the dimer interface of the inactive RNase PH protein is less stabilised as compared to that of the active RNase PH protein domain. Both the complexes have been shown in Fig. 7c with the RNA-interacting residues highlighted in yellow. MD simulations of both these complexes were performed and the trajectories have been shown in additional movie files Additional file 8 (PELOTA_1 domain model-k-turn RNA complex) and Additional file 9 (L7Ae K-turn binding domain-k-turn RNA complex).
For each of the systems, the H-bond traces for three replicates (represented in different colours) have been depicted. From these figures, one can observe that the replicates are showing similar H-bonding patterns. Analyses of the number of H-bonds formed between the protein and the RNA over each picosecond of the MD simulations of the PELOTA_1 domain-RNA complex and the L7Ae K-turn binding domain-RNA complex have been represented in Fig. 8e and f, respectively. Comparison of panels e and f of this figure shows a greater number of H-bonds being formed in the L7Ae K-turn binding domain-RNA complex as compared to that of the PELOTA_1 domain-RNA complex over the entire time period of the simulation. These results show that the two proteins have differential affinity towards the same RNA molecule. This hints at the fact that these proteins might perform subtly different functions by the virtue of having differential RNA-binding properties.
Case study 3: Pathogen specific Cas6-like proteins might be functional variants of the well-characterised nonpathogenic protein In many bacteria, as well archaea, CRISPR associated Cas proteins and short CRISPR-derived RNA (crRNA) assemble into large RNP complexes and provide surveillance towards invasion of genetic parasites [100][101][102]. The role of CRISPR-associated proteins as virulence factors in pathogenic bacteria has also been reported in recent literature [84]. We found that Cluster 308 consists of 10 pathogen-specific proteins, of which half of them were already annotated as Cas6 proteins, whereas the other half constituted of 'uncharacterised' or 'hypothetical' proteins. As mentioned in the Methods section, the latter proteins were annotated on the basis of sequence homology to known proteins in the NR database, as Cas6 proteins.
Molecular phylogeny analysis of all the proteins from Cluster 308 and Cas6 from E. coli strain K12 has been depicted in Additional file 10a: Figure S1, which reinstates the fact that the pathogen-specific proteins are more similar to each other, in terms of sequence, than they are to the Cas6 protein from the non-pathogenic strain K12. Furthermore, a similar analysis of two previously uncharacterised proteins (UniProt IDs: C8U9I8 and C8TG04) (red) from this pathogen-specific Cas6 proteins cluster (Cluster 308), with other known Cas6 proteins has been shown Additional file 10b: Figure S1. From the phylogenetic tree, one can infer that the pathogen-specific Cas6 proteins are more similar in terms of sequence to the Cas6 from E. coli strain K12 (blue) than that from other organisms.
Multiple sequence alignment (MSA) of all the proteins from Cluster 308 and Cas6 from strain K12 has been shown in Fig. 9. The RNA-binding residues in E. coli strain K12 Cas6 protein (union set of RNA-binding residues inferred from each of the three known PDB structures (see Methods section)) have been highlighted in yellow on its sequence (CAS6_ECOLI) on the MSA. The corresponding residues in the other proteins on the MSA, which are same as that in CAS6_ECOLI, have also been highlighted in yellow, whereas those which differ have been highlighted in red. From Fig. 9a, we can conclude that the majority of the RNA-binding residues in CAS6_ECOLI are not conserved in the pathogenspecific Cas6 proteins, and can be defined as 'class-specific residues'. A similar colouring scheme has been followed in Fig. 9b, to analyse the conservation of protein-interacting residues in these proteins. From these analyses, we can speculate that due to the presence of a large proportion of 'class-specific residues' , the RNA-binding properties, as well as protein-protein interactions, might be substantially different among the Cas6 proteins from non-pathogenic and pathogenic E. coli strains, which might lead to functional divergence. Secondary structures of each of these proteins, mapped on their sequence (α-helices highlighted in cyan and β-strands in green) in Fig. 9c, also hint at slight structural variation among these proteins.

Discussion
We have employed a sequence search-based method to compare and contrast the proteomes of 16 pathogenic and three non-pathogenic E. coli strains as well as to obtain a global picture of the RBP landscape in E. coli. The results obtained from this study showed that the pathogenic strains encode a greater number of RBPs in their proteomes, as compared to the non-pathogenic ones. The DEAD domain, involved in RNA metabolism, was found to be the most abundant of all identified RBDs. The complete and distinct RBPome of E. coli was also identified by studying all known E. coli strains till date. In this study, we identified RBPs that were exclusive to pathogenic strains, and most of them can be exploited as drug targets by virtue of being non-homologous to their human host proteins. Many of these pathogenspecific proteins were uncharacterised and their identities could be resolved on the basis of sequence homology searches with known proteins.
Further, in this study, we performed three case studies on interesting RBPs. In the first of the three studies, a tRNA processing RNase PH enzyme from strain K12 was investigated that is different from that in all other E. coli strains in having a truncated C-terminus and being functionally inactive. Structural modelling and molecular dynamics studies showed that the loss of stability of the monomeric and/or the hexameric (biological unit) forms of this protein from E. coli strain K12, might be the possible reason for the lack of its functional activity. In the second study, a previously uncharacterised pathogenspecific protein was studied and was found to possess subtly different RNA-binding affinities towards the same RNA stretch as compared to its well characterised homologues in archaea and eukaryotes. This might hint at different functions of these proteins. In the third case study, pathogen-specific CRISPR-associated Cas6 proteins were analysed and found to have diverged functionally from the known prototypical Cas6 proteins.

Conclusions
The approach used in our study to cross-compare proteomes of pathogenic and non-pathogenic strains may also be extended to other bacterial or even eukaryotic proteomes to understand interesting differences in their RBPomes. The pathogen-specific RBPs reported in this study, may also be taken up further for clinical trials and/or experimental validations.
The effect of the absence of a functional RNase PH in E. coli strain K12 is not clear. The role of the PELOTA_1 domain-containing protein may also be reinforced performing knockdown and rescue experiments. These might help to understand the functional overlap of this protein with its archaeal or eukaryotic homologues. Introduction Fig. 9 Sequence analysis of pathogen-specific Cas6-like proteins. Comparison of sequence features of Cas6 proteins from pathogenic (Cluster 308) and non-pathogenic K12 strains. a Comparison of RNA-binding residues. The RNA-binding residues in E. coli strain K12 Cas6 protein have been highlighted in yellow on its sequence (CAS6_ECOLI) on the MSA. The corresponding residues in the other proteins on the MSA, which are same as that in CAS6_ECOLI, have also been highlighted in yellow, whereas those which differ have been highlighted in red. b Comparison of protein-interacting residues. The protein-interacting residues in E. coli strain K12 Cas6 protein have been highlighted in yellow on its sequence (CAS6_ECOLI). A similar colour scheme has also been followed here. c Secondary structure prediction. The α-helices have been highlighted in cyan and the β-strands in green