PDZ Domains in Microorganisms: Link Between Stress Response and Protein Synthesis

The PSD-95/Dlg-A/ZO-1 (PDZ) domain is highly expanded and diversified in metazoan where it is known to assemble diverse signalling components by virtue of interactions with other proteins in sequence-specific manner. In contrast, in bacteria it monitors protein quality control during stress response. The distribution, functions and origin of PDZ domain-containing proteins in prokaryotes are largely unknown. We analyzed 7,852 PDZ domain-containing proteins in 1,474 prokaryotes and fungi. PDZ domains are abundant in eubacteria; and, this study confirms their occurrence also in archaea and fungi. Of all eubacterial PDZ domain-containing proteins, 89% are predicted to be membrane and periplasmic, explaining the depletion of bacterial domain forms in metazoan. Planctomycetes, myxobacteria and other eubacteria occupying terrestrial and aquatic niches encode more domain copies, which may have contributed towards multi-cellularity and prokaryotic-eukaryotic transition. Over 93% of the 7,852 PDZ-containing proteins classified into 12 families including 6 novel families. Out of these 88% harbour eight different protease domains, suggesting their substrate-specificity is guided by PDZ domains. The genomic context provides tantalizing insight towards the functions associated with PDZ domains and reinforces their involvement in protein synthesis. We propose that the highly variable PDZ domain of the uncharacterized Fe-S oxidoreductase superfamily, exclusively found in gladobacteria and several anaerobes and acetogens, may have preceded all existing PDZ domains.


10
Abstract The PSD-95/Dlg-A/ZO-1 (PDZ) domain is highly expanded and diversified in 11 metazoan where it is known to assemble diverse signalling components by virtue of interactions 12 with other proteins in sequence-specific manner. In contrast, in bacteria it monitors protein 13 quality control during stress response. The distribution, functions and origin of PDZ domain- 14 containing proteins in prokaryotes are largely unknown. We analyzed 7,852 PDZ domain- 15 containing proteins in 1,474 prokaryotes and fungi. PDZ domains are abundant in eubacteria; 16 and, this study confirms their occurrence also in archaea and fungi. Of all eubacterial PDZ 17 domain-containing proteins, 89% are predicted to be membrane and periplasmic, explaining the 18 depletion of bacterial domain forms in metazoan. Planctomycetes, myxobacteria and other 19 eubacteria occupying terrestrial and aquatic niches encode more domain copies, which may 20 have contributed towards multi-cellularity and prokaryotic-eukaryotic transition. Over 93% of 21 the 7,852 PDZ-containing proteins classified into 12 families including 6 novel families. Out of 22 these 88% harbour eight different protease domains, suggesting their substrate-specificity is 23 guided by PDZ domains. The genomic context provides tantalizing insight towards the functions 24 associated with PDZ domains and reinforces their involvement in protein synthesis. We propose 25 that the highly variable PDZ domain of the uncharacterized Fe-S oxidoreductase superfamily, 26 exclusively found in gladobacteria and several anaerobes and acetogens, may have preceded all 27 existing PDZ domains. 28 29

Introduction 34
Proteins exhibiting both signaling and protein interaction domains are prevalent in eukaryotic 35 signal transduction systems (Nourry et al. 2003). This domain architecture provides an elegant 36 solution to rewire and regulate complex biological networks by sensing signals through signaling 37 domains while protein interaction domains serve to amplify the signal. The PDZ domain is one of such protein interaction domains. It was first identified in the context of signaling proteins, 1 which are referred to as GLGF repeats proteins or DHR (Discs large homology repeat) proteins 2 (Cho et al. 1992; Ponting and Phillips 1995). The abbreviation PDZ is derived from the three 3 metazoan proteins in which this domain was first identified: PSD-95, DLG and ZO-1( Kennedy 4 1995). 5 Metazoan PDZ domains are referred to as canonical PDZ domains and typically comprise 6 80-100 amino acid residues harboring a highly conserved fold (Doyle et al. 1996; Kennedy 7 1995). However, the length of secondary structures composed of six β -strands with a short and a 8 long α -helix may vary (Doyle et al. 1996;Morais Cabral et al. 1996). In contrast, eubacterial 9 PDZ domains fold similarly to metazoan domains but with a distinct topology of secondary 10 structural elements and are referred to as non-canonical (Harris and Lim 2001; Lee and Zheng 11 2010). Non-canonical PDZ domains consist of a circularly permuted structural fold. 12 PDZ domains are inherently variable at primary sequence level, and show diversity in 13 functional roles and binding specificities in metazoan (Belotti et  fungal PDZ domains are largely unknown. The genome-wide analysis of non-metazoan PDZ 16 domains dates back to 1997, wherein these domains were shown to occur in bacteria (in 17 abundance), plants and fungi [13]. However, its presence in fungi was considered doubtful due to 18 low sequence similarity with known PDZ domains. Therefore, it was assumed either that the 19 primordial PDZ domain arose prior to the divergence of bacteria or eukaryotes, or that horizontal 20 gene transfer led to the acquisition of these domains by bacteria [13]. Even until now few 21 researchers believe that this domain is occurs in fungi and archaea [12], while few others do not 22 [7,14], indicating how little we know about non-metazoan PDZ domains. It Lipinska et al. 1990). Interestingly, the 7 domain is absent in 4% species (55 genomes), mainly belonging to cell-wall lacking mollicute 8 and candidatus phytoplasma species, of which many are adapted to parasitic lifestyle (Fig. 1). 9 The domain is also absent in two Buchnera aphidicola species which are symbionts of a Cinara 10 (a conifer aphid) and in the archaean Nanoarchaeum equitans, a parasite on Ignicoccus hospitalis 11 on which it depends for lipids (Jahn et al. 2008 38 Eubacterial PDZ proteins are primarily involved in stress signaling. The number of signaling 39 proteins in prokaryotes is proportional to their genome size (Galperin 2005). PDZ domains also 40 follow this trend. Figure 2b shows that numbers of PDZ domain-containing proteins and their 41 multi-domain architecture expanded with increase in genome size. However, this trend is 42 significant only for eubacterial genomes and not for archaea and fungi ( Supplementary Fig. S2).
PDZ being the second most abundant domain in metazoa, it was hypothesized that the 1 domain might have co-evolved with multi-cellularity and complexity (Harris and Lim 2001; Kim 2 et al. 2012). Interestingly, many eubacterial species encode more than 10 PDZ genes, particularly 3 planctomycetes and myxobacteria ( Fig. 1 & 2b). Interestingly, besides methylotrophic 4 proteobacteria, only planctomycetes and myxobacterial members can synthesize C 30 sterols such 5 as lanosterol, which are found primarily in eukaryotes (Desmond and Gribaldo 2009; Fuerst and 6 Sagulenko 2011; Pearson et al. 2003). Planctomycetes are also considered a probable host for an 7 endosymbiont which gave rise to a eukaryotic cenancestor cell, as they exhibit many features 8 found only in eukaryotes such as internal membranes, a primitive form of endocytosis, growth by 9 budding and lack of peptidoglycans (Fuerst and Sagulenko 2011; Godde 2012). Myxobacteria 10 are known to form fruiting bodies which behave in many aspects like a multi-cellular organism 11 (Rokas 2008), and thus may demarcate the origin of multi-cellularity. Ser/Thr/Tyr protein 12 kinases, which together comprise a major class of regulatory proteins in eukaryotes, were 13 ubiquitously found in myxobacteria and planctomycetes (Perez et al. 2008). Corroborating our 14 results with above-mentioned findings raises the question, whether PDZ domain expansion in 15 planctomycetes and myxobacteria to metazoan extent is an evolutionary contribution towards the 16 origin of eukaryotes and multi-cellularity. We speculate the possible contribution of 17 planctomycetes and myxobacteria to the evolution of a eukaryotic cenancestor cell, which 18 diverged into fungi that lost PDZ domains, and a probable metazoan progenitor that retained 19 PDZ domains and expanded subsequently during metazoan evolution. Nonetheless, our results 20 strengthen the hypothesis that the expansion of PDZ domains indeed correlates with multi-21 cellularity and complexity -even in eubacteria. 22 Metazoan life flourished in limited ecological niches compared to the extremely diverse 23 habitats of prokaryotes. Remarkably, prokaryotes that are most successful in terrestrial and 24 aquatic ecological niches encode significantly higher numbers of PDZ domains compared to 25 those inhabiting multiple environments; those that live obligatorily host-associated; and those 26 with specialized habitat, i.e. environments such as marine thermal vents (Fig. 3a). Evidently, 27 prokaryotes encoding higher number of PDZ domains are aerobic rather than anaerobic or 28 facultative with Wilcox p-value 2.1e-05 (Fig. 3b). These results indicate that the expansion of 29 PDZ domain coding genes might have role in adapting to aerobic aquatic and terrestrial habitats. at either the N-or C-terminus of other domains with no apparent preference for one of the 5 termini (Fig. 4a). Out of twelve classified protein families, PDZ domain is combined with eight 6 different protease domains in 88% of the classified proteins. The remaining proteins classified 7 into four non-protease families and referred to as Fe-S oxidoreductase, General Secretory Protein 8 C, Haem-binding uptake, and Sensor histidine kinase ComP, based on the functions of the 9 domains therein (Fig. 4b). Figure 4c shows the distribution of proteins into eight protease 10 families suggesting that the HtrA, Carboxy-terminal protease (Ctp) and Regulator of Sigma-E 11 Protease (RseP) families proteins are predominant compared to lesser number of proteins 12 belonging to Aminopeptidase N, Lon Protease, Sporulation protein IV B, Aspartate protease, and 13 Zn-dependent exopetidases families. In spite of sequence and structural diversity, PDZ domain 14 is preferentially combined with protease domain in 88% of classified proteins, suggesting its 15 function to be directly related to providing substrates for proteolysis. Our assumption is 16 supported by the function of PDZ domains of well-characterized proteases HtrA, Ctp, and RseP 17 in The phyletic distribution of classified proteins is shown in Table 1. All families are 19 represented in eubacteria; three also occur in archaea; and the HtrA family is common even in 20 fungi. Notably, HtrA and Ctp family proteins are present in multiple copies in eubacteria 21 suggesting their expansion by gene duplication during evolution (Table 1). Notably, fungi 22 inherited HtrA genes only, which encode non-canonical PDZ domains. It is consistent with their 23 presence in metazoans and inheritance from last common ancestors of eukaryotes. Some of the 24 interesting families have been discussed in detail in the subsequent sections. The HtrA protein family is part of the elaborate high temperature stress response system for 27 protein quality control, which monitors protein homeostasis to prevent accumulation of 28 unwanted and damaged proteins in the cell (Clausen et al. 2011). Characteristic feature of HtrA 29 proteins is one or two PDZ(s) followed by a trypsin-like serine protease domain (Fig. 4a). 30 Previous studies have reported its presence in higher eukaryotes (Pallen and Wren 1997 Peptidoglycan binding domain, Tricorn and Tricorn_C1 domains (Fig. 4a). This suggests that 4 Ctp family is amenable to evolutionary changes more than others, and associated domains 5 provided new functional aspects to it in lineage-specific manner. 6 The canonical PDZ domain was detected based on Pfam as well as Superfamily profile in 7 309 (out of 316) DUF3340 domain-containing family proteins in 298 species (Supplementary  8  Table S1). Interestingly, the Ctp DUF3340 domain shows sequence similarity with the 9 mammalian interphotoreceptor retinoid binding protein (IRBP) (Keiler and Sauer 1995) and 10 might be ancestral to it (Ponting 1997 uncharacterized Fe-S oxidoreductase proteins with a CxxxCxxC conserved motif in 148 species 30 ( Supplementary Fig. 4). The amino acids responsible for structural integrity seem to be 31 conserved in these two families. It could be a result of accumulated mutations over time, and 32 hence detected by the structure-based HMM model of PDZ domains only. Thus, we 33 hypothesized that one of the PDZ domain of these two families could be ancestral to all other 34 PDZ domains. 35 GspC family is a part of general secretory system, which transports many substrates. 36 This system is not specific towards certain substrate hence naturally imparts less constraints on 37 the sequence conservation. GspC family proteins are mostly present in proteobacteria, which is 38 not accounted among old bacterial phyla so far. On the other hand, the radical SAM domain of 39 Fe-S oxidoreductase family is considered to be among the oldest domains. The highly conserved 40 CxxxCxxC motif is likely to form an iron-sulphur (Fe-S) cluster to cleave SAM reductively and 41 produce a radical (usually a 5′-deoxyadenosyl 5′-radical) (Frey et al. 2008). The radical 42 intermediate allows wide variety of unusual chemical transformations. These proteins are 43 conserved in all cyanobacterial genomes investigated in this study, and half of the chlorobacteria 1 and hadobacteria (Table 1). These phyla constitute the oldest bacterial clade gladobacteria 2 (Cavalier-Smith 2006) and a probable root of life is placed in chlorobacteria (Cavalier-Smith 3 2006). These proteins are also present in firmicutes, δ -proteobacteria; and few actinobacterial 4 species, which predominantly use anaerobic mode of respiration (Table 1). The presence of this 5 family's proteins in photosynthetic and anaerobic eubacteria suggests their functional role in the 6 early Earth environment (Lane and Martin 2012) and also is consistent with their high sequence 7 divergence due to long evolutionary time span. Furthermore, it is presents in several acetogens 8 that generate acetate as a product of their anaerobic respiration and their modern lifestyles 9 resemble to that of last universal common ancestor (Weiss et al. 2016). MSA analysis confirms 10 the authenticity of these PDZ domains following structural analysis (Fig. 5a). We used protein 11 sequence of slr2030 and GSU1997, hypothetical proteins from Synechocystis PCC 6803 and 12 Geobacter sulfurreducens PCA to predict three-dimensional structure using Phyre2 protein 13 modelling web server. The 61% N-terminus amino acids of both sequences could be modelled 14 with more than 90% confidence level, which includes PDZ and radical SAM domain. The 15 predicted structures were compared with the known ligand-bound PDZ domain structure of 16 HtrA2 protein isolated from Mycobacterium tuberculosis (Mohamedmohaideen et al. 2008) (Fig.  17 5b). The predicted domain structure is composed of two helices and three to four beta sheets as 18 opposed to around six often occur in known structures (Fig. 5c, d). This indicates the insertion of 19 additional beta sheets in PDZ domains of other families later in the evolution. Even with only 20 three beta sheets the peptide-binding cavity in the domain is maintained suggesting its functional 21 equivalence to other PDZ domains (Fig. 5c, d) the tip suggesting its recent divergence (Fig. 5e, Supplementary Fig. 5,6,7). On the other hand, 27 GspC family domains may have diverged later from HtrA family domains. 28 Collectively, phylogenetic analysis strongly indicates that the PDZ domain of radical 29 SAM proteins is likely to be ancestral and gave rise to other domains, of which Ctp family 30 domain is diverged recently. Phyletic distribution also supports the presence of PDZ domains of 31 radical SAM proteins in the last universal common ancestor. 32 (in Bacillus subtilis known as YlbL) and radical SAM family is conserved in phylogenetically 37 diverse species (Fig. 6). RseP is an inner membrane metalloprotease induces stress response via 38 proteins (Table 1). Characteristic feature of this family's proteins are two highly conserved 42 motifs, HExGH and NxxPxxxLDG at their N-and C-termini which are similar to the potential 43 zinc binding site found in a variety of metalloproteases (Brown et al. 2000) and to the motif 1 found in the human S2P protease, respectively (Supplementary Fig. S8) (Fig. 6a). Supplementary Figure S9 depicts the complex regulation of this 8 gene cluster in Escherichia coli by many σ 24 (a stress response sigma factor) promoters, 9 (Erickson and Gross 1989); σ 70, a primary sigma factor during exponential cell growth; and 10 also rho-independent termination sites. The presence of the frr gene is especially interesting since 11 some of its mutants rapidly decrease protein synthesis followed by inhibition of RNA synthesis rseP and frr genes has a σ 24 promoter, suggesting their co-regulation ( Supplementary Fig. S9). 15 Therefore, we propose a possible link between translation and membrane biogenesis regulated by 16 the stress response factor σ 24, mediated by the induction of rseP gene expression along with 17 HtrA family member's degP and degQ during heat shock. σ 24 is positively regulated by the 18 starvation alarmone ppGpp (guanosine 3´-diphosphate 5´-diphosphate) during entry into 19 stationary phase, suggesting that σ 24 can respond to internal signals as well as stress signals 20 originating in the cell envelope (Costanzo and Ades 2006). We propose the internal signal 21 through ppGpp might be responsible for the σ 24 activation to maintain misfolded and damaged 22 proteins during stationary phase (internal signal) and for inhibition of tff-rpseB-tsf gene 23 expression to cease translation due to nutrient limitations, since ppGpp has been shown to inhibit 24 expression of this transcription unit during amino acid starvation (Aseev et al. 2014). These 25 observations associate RseP family proteins directly to translation and membrane biogenesis. 26 The genomic context of Fe-oxidoreductase family members contains gpsA (encodes a 27 glycerol-3-phosphate dehydrogenase enzyme), and engA (Fig. 6b). The engA gene encodes a 28 GTPase also called Der, required for ribosome assembly and stability. It co-transcribes with the 29 outer membrane protein BamB coding gene from a σ 24 promoter in Escherichia coli (Rhodius et 30 al. 2006). The product of bamB is part of the large multi-protein BAM complex responsible for 31 outer membrane biogenesis, including bamA sharing genomic context with rseP (Fig. 6a). 32 Interestingly, the radical SAM domain predicted to have three-dimensional fold similar to SAM 33 methyltransferase involved in translation. The genomic neighbourhood of this family with Der 34 further strengthen our hypothesis that the radical SAM family is likely to be involved in 35 translation related functions. 36 PDZ domain is combined with ATP-dependent lon protease in 292 uncharacterized 37 proteins of actinobacteria and firmicutes (Table 1). The prototype example of such domain 38 organization is the Bacillus subtilis membrane protein YlbL which has a structural fold similar to 39 ribosomal protein S5. The genes encoding these proteins share genomic neighbourhood with 40 rsmD (Fig. 6c). The rsmD product plays a critical role in the methylation of G966 of 16S rRNA 41 (Weitzmann et al. 1991 The present study proposes a concrete classification of 97% PDZ proteins into twelve families. 4 Of these, six families have at least one characterized protein, while remaining six are 5 uncharacterized till date. A combination of PDZ and protease domains is a common theme 6 observed in 88% of classified proteins. PDZ-protease domain combination seems to have 7 appeared later in the evolution. Prior to this event the function of PDZ domains was presumably 8 to bind iron available in large quantity in early earth atmosphere (Fig. 5e) first diverged from non-protease PDZ domain-containing proteins, and shared last common 16 ancestor with remaining protease families (Fig. 5e). This is consistent with their heat resistance 17 and requirement in higher temperature of early earth. The DUF3340 domain of Ctp protease 18 family proteins is homologous to many metazoan IRBPs suggesting link between them. 19 Furthermore, PDZ domains of many Ctp family proteins were predicted to be 20 canonical/metazoan form. This is consistent with their recent divergences evident from the 21 phylogenetic analysis (Fig. 5e). Proteins harbouring Aminopeptidase N, ATP-dependent lon 22 protease, Aspartyl protease, and Zn-dependent exopeptidase domains are yet to be functionally 23 characterized, along with non-proteases Fe-S oxidoreductases and Haem-binding uptake, Tiki 24 superfamily. 25 In search of their putative functions, we analyzed genomic neighbourhood all protein-26 coding genes. Conserved neighbourhood were observed for three families, of which the ATP-27 dependent lon proteases (structural similarity with tRNA or rRNA modification domain) and Fe-28 S oxidoreductases are often placed with genes coding for proteins involved in translation and 29 membrane biogenesis, indicating their functions in these processes. RseP intra-membrane The expansion of PDZ domains in planctomycetes (debatable host in endosymbiosis 42 theory of first eukaryotic cell formation) and myxococcus having many metazoan/eukaryote 43 features supports the hypothesis that PDZ domains might have co-evolved with multi-1 cellularity/complexity (Harris and Lim 2001). However, the origin of PDZ domains was unclear 2 in metazoa as well as in prokaryotes. We hypothesized the PDZ domain of the Fe-S 3 Oxidoreductase family as the probable ancestor for all PDZ domains that might have helped the 4 ancestral eubacterial species to withstand the anaerobic atmosphere of early Earth. Their radical 5 SAM domains might have provided the means of reductive energy generation or translation 6 stability under anaerobic atmosphere of early Earth. Oxygen availability might have negatively 7 selected these proteins in the species diverged from ancestral gladobacteria but retained them in 8 some extant obligatory anaerobes. One the other hand, proteases might have expanded with the 9 availability of oxygen and help in adapting terrestrial niche (Fig. 3). 10 Though we provide evidence of ancestral relationship between bacterial and metazoan PDZ 11 domains as well as presence of hundreds of canonical PDZ domains (in Ctp protease family) in 12 eubacteria, their depletion or absence in archaea and fungi hinders solid explanation for this 13 transition. Based on the analysis presented here we argue that Archaea presumably never had 14 canonical PDZ domains. The bacterial endosymbiont might have contributed these domains to 15 eukaryotic phylogeny, where they were lost recurrently only in fungi and ecdysozoa. 16 Phylogenetic analysis confirms recent divergence of the PDZ domains of Ctp family, which 17 might have contributed canonical PDZ form on metazoan phylogeny. Collectively, our 18 comprehensive analysis provides insights into the emergence of the PDZ domains and their 19 functional divergence during evolution. 20

Methods 21
Identification and analysis of PDZ domain-containing proteins in completely sequenced genomes 22 Protein sequence and annotation files were obtained from the National Centre for Biotechnology 23 Information (NCBI) for completely sequenced prokaryotic and fungal species (Sayers et al. 24 2012). Out of total 2,057 genomes we selected 1,474 representative genomes for which 25 phenotypic information was available for further analyses. localization of these proteins was predicted using Phobius web server (Kall et al. 2007). The data 40 was processed using in-house Perl scripts and visualized over pruned version of NCBI taxonomy 41 tree which was created using interactive Tree Of Life (iTOL) web server's API tool by providing 1 NCBI taxonomy identifiers for investigated organisms (Letunic and Bork 2007). 2

3
The classification of PDZ domain-containing proteins is challenging, owing to their sequence 4 and structural variations. On several instances we were unable to find correspondence between 5 hits identified by Superfamily and Pfam models due to the different classification strategies 6 adapted by them. To overcome this problem, Pfam classification was used as a reference and 7 always cross-checked with Superfamily classification for consistency. First, we grouped proteins 8 based on conserved Pfam domain architectures using in-house Perl scripts. The remaining 9 sequences were manually checked and assigned to each group. Second, the clustalo program was 10 used to construct a multiple sequence alignment with default settings for each group which was 11 manually analyzed to exclude highly divergent sequences (Sievers et al. 2011). At multiple 12 instances a prototype motif of specific family was considered for assigning proteins to their 13 respective group (Supplementary Fig. 6 & 7). This semi-automatic sequence analysis led to 14 classification of 7,318 proteins out of 7,852 into 12 families. We were unable to classify 3% 15 proteins due to their presence in less than 20 species and highly variable domain combinations. 16 Sequence, Structure and Phylogenetic analysis 17 PDZ domains are inherently diverged at sequence and structure level. This hinders the 18 phylogenetic signal in addition to its small length leaving few informative sites for phylogenetic 19 reconstruction. Therefore, we selected PDZ domains only from delta-proteobacteria group to 20 reconstruct phylogeny. The selection was based on the presence of both GspC and radical SAM 21 family proteins in them. clustalo program was used to align sequences using Superfamily HMM 22 model, which is based on the alignment of PDZ domain structures. Positions that were conserved 23 in more than 70% sequences were retained for analysis. We manually monitored MSA to remove 24 sequences that were highly diverged. Phylogenetic trees were reconstructed with Fitch-25 Margoliash and parsimony algorithms available through fitch and protpars programs in Phylip 26 package respectively (Felsenstein 1989 The difference between two distributions of PDZ containing proteins/domains was assessed 7 using Wilcox rank sum test (Wilcoxon 1945). Null hypothesis was either no difference in two 8 distributions or one distribution is greater than the other. The null hypothesis was rejected and 9 the difference considered significant if the Wilcox test p-value was lower than 0.05. 10 Bean-width corresponds to the proportion of genomes in it and bean-line shows the 5 average number of proteins in the kingdom. Archaeal and fungal genomes encode less 6 than 5 proteins in average, whereas many eubacteria encode a higher number of 7 proteins. b) The scatter plot illustrates the relationship between genome size, number of 8 proteins and multi-domain proteins. Numbers of proteins and their multi-domain 9 architecture expanded with increase in genome size in eubacteria. Proteins with 10 repetitive PDZ domain alone were not counted as multi-domain. The species with 11 complex processes and/or the ability to form cell aggregates are annotated in the plot. 12 Test of significance p-values using Wilcox test are shown above of compared two 16 groups; null hypothesis was the number of PDZ domains in organisms belonging to right 17 side group is higher than the left side. NS stands for non-significant p-value.

18
Prokaryotes encoding a higher number of PDZ domains favor aquatic, terrestrial and 19 aerobic niche. In panel b) "Multiple" stands for species with a wide host range and 20 variety of habitats; "Host-associated" for species obligatorily associated with a host; and 21 "Specialized" for those with specialized habitat, i.e. environments such as marine 22 thermal vents. 23  Domain of unknown function. 42 from Mycobacteria is shown without and with tetra-peptide ligand in cartoon, and space-5 fill model. The predicted structure of PDZ domain from Geobacter (c) and 6 Synechocystis (d). Their overlap with HtrA2 is shown along with its ligand. HtrA2 is 7 shown in green color while predicted structure in red. The predicted structure bound 8 with HtrA2 ligand is shown in space-fill model in third panel. Phylogenetic tree shown in 9 panel (e) is constructed by Fitch neighbor-joining algorithm. Phylogenetic tree shows 10 ancestry of PDZ domain of Fe-S oxidoreductases to other PDZ domains, whereas Ctp 11 family domain as a recent divergence. 12 13 cdsA, uppS (ispU), dxr, frr, rpsB and bamA genes. Intergenic distance greater than 50 16 nucleotides is indicated by slash (/). b) Fe-S oxidoreductases of Radical SAM family 17 members occur in the vicinity of engA (der) and gpsA genes. The gene between engA 18 and der is conserved in many species but no function is defined for it. c) ATP-19 dependent lon protease family members are well conserved with coaD and rsmD gene. 20 The rsmD gene is not well annotated in many genomes. Proteins encoded by dxr, rpsB, 21 frr, der, and rsmD are associated with translation related functions. The genes are 22 included in this analysis if they are placed within a distance of 50 nucleotides from the 23 PDZ-coding gene with the exception of Escherichia coli where it was 300 bases (a).

24
Locus identifiers were used when gene names were not available. Red color is used to 25 represent PDZ-domain containing proteins in each panel. Homologous genes are 26 represented in the same color. Grey colored genes are not conserved in the 27 neighborhood. Arrows pointing towards the right represent the plus strand and towards 28 the left represent the minus strand of the genome. 29 30 Table 1  A protein family is shown by its total number of members/total number of organisms in particular taxonomic group. Protein families are abbreviated as : Htr, High temperature requirement proteases; Ctp, C-terminal processing proteases; RseP, RseP intramembrane metalloproteasese; APN, Aminopeptidase protein N; Lon, ATP-dependant lon proteases; SpoIVB, Sporulation factor IV B proteases; AP, Aspartate proteases; ZEP, Zn-dependent exopeptidases; GspC, General secretion pathway protein C; FeSO, Fe-S oxidoreductases; ComP, Competence protein family; ChaN, EreA-Chan-like family. $ Uncharacterized families. * Rare incidences of family proteins in respective group/kingdom which were considered as horizontally transferred events from eubacteria.