Local Flexibility in Molecular Function Paradigm*

It is generally accepted that the functional activity of biological macromolecules requires tightly packed three-dimensional structures. Recent theoretical and experimental evidence indicates, however, the importance of molecular flexibility for the proper functioning of some proteins. We examined high resolution structures of proteins in various functional categories with respect to the secondary structure assessment. The latter was considered as a characteristic of the inherent flexibility of a polypeptide chain. We found that the proteins in functionally competent conformational states might be comprised of 20–70% flexible residues. For instance, proteins involved in gene regulation, e.g. transcription factors, are on average largely disordered molecules with over 60% of amino acids residing in “coiled” configurations. In contrast, oxygen transporters constitute a class of relatively rigid molecules with only 30% of residues being locally flexible. Phylogenic comparison of a large number of protein families with respect to the propagation of secondary structure illuminates the growing role of the local flexibility in organisms of greater complexity. Furthermore the local flexibility in protein molecules appears to be dependent on the molecular confinement and is essentially larger in extracellular proteins.

It is generally accepted that the functional activity of biological macromolecules requires tightly packed three-dimensional structures. Recent theoretical and experimental evidence indicates, however, the importance of molecular flexibility for the proper functioning of some proteins. We examined high resolution structures of proteins in various functional categories with respect to the secondary structure assessment. The latter was considered as a characteristic of the inherent flexibility of a polypeptide chain. We found that the proteins in functionally competent conformational states might be comprised of 20 -70% flexible residues. For instance, proteins involved in gene regulation, e.g. transcription factors, are on average largely disordered molecules with over 60% of amino acids residing in "coiled" configurations. In contrast, oxygen transporters constitute a class of relatively rigid molecules with only 30% of residues being locally flexible. Phylogenic comparison of a large number of protein families with respect to the propagation of secondary structure illuminates the growing role of the local flexibility in organisms of greater complexity. Furthermore the local flexibility in protein molecules appears to be dependent on the molecular confinement and is essentially larger in extracellular proteins.

Molecular & Cellular Proteomics 5:1212-1223, 2006.
Over the last 2 decades, the extent of structural research has led to a large number of three-dimensional (3D) 1 structures of biologically active macromolecules and their complexes. Enormous structural information (over 34,000 entries) is currently available from the Protein Data Bank (PDB) that includes details of protein organization, of their interactions with nucleic acids and ligands, and of their conformational behavior. Structural data provide the essential framework for characterizing molecular mechanisms of biological activity, for analyzing evolutionary relationships, and for illuminating our understanding of biological function (Ref. 1 and references therein).
Although proteins often are pictured as rigid entities corresponding to some average structure (immersed in a featureless solvent continuum), it has long been known that they have a rather fluid, dynamic structure with rapid conformational fluctuations (2). Subnanosecond dynamics of proteins studied by NMR (3), nitroxide spin labeling (4), dielectric relaxation (5), and fluorescence experiments (1) have advanced such descriptive terms as "breathing" (6), "relaxation," "segmental motion," and "mobile defect" (7) to portray the conformational mobility of proteins in functionally competent states. The presence of substantial Ͼ1-Å breathing motions has been recognized in early NMR studies on the flipping of the buried aromatic residues in the pancreatic trypsin inhibitor (8). Hemoglobin and myoglobin offer another striking example of the dynamic nature of biological activity where small structural fluctuations of the protein matrix allow O 2 molecule to move to and from the heme pocket (9 -11). Buried water molecules, which are observed in most proteins, were shown to exchange with surface water molecules at the microsecond timescale, and that process necessitates large and correlated fluctuations in the host protein (12,13). Furthermore the reduced activity of the protein mutants in some cases might be a consequence of reduced fluctuations and flexibility in the molecule away from that which has evolved for optimal functioning (1). Indeed the conformational lability in proteins, coordinated with the chemical requirements at each stage of their reactions, is a major component in enzyme catalysis, allosteric regulation, antigen-antibody interactions, and protein-DNA binding (1). The concept of inherent and correlated protein motions has become a landmark in biophysics and structural biology that underlies our understanding of molecular recognition (14,15).
Although an importance of protein flexibility has been widely evoked in the literature, it has been more difficult to characterize experimentally. Proteins are composed of discrete atoms, which are constantly undergoing thermal fluctuations from rapid (picosecond) vibrations, through slower (multinanosecond) global reorientations and side chain isomerization, to long time scale (microsecond to second) conformational changes (16). The reality of these fluctuations is evident in the PDB, which reports not only a set of fixed coordinates but also the temperature B-factors (Debye-Waller factors). The latter denotes the thermal fluctuations of the protein and provides information about the mobility of each atom in the structure (17)(18)(19). The crystallographic parameters have been successfully used to derive overall and intrinsic motions (20), to identify higher atomic mobility at the active site, and even to allocate a component in the amplitudes of atomic vibration that are derived from the overall global motion of the protein (21). Furthermore the analysis of the 3D structures of wild type proteins and their synthetic analogs (22) as well as the proteins that crystallize in a different space group (23) promotes the idea that the B-factors reveal the effect of different packing constraints on the protein flexibility. Further statistical-mechanical study of a large group of protein structures clearly demonstrated that the B-profile is, in fact, essentially determined by spatial variations in local packing density (24). Note that NMR data can also be used to characterize the flexibility of a protein (25), but in practice the number of atoms within a molecule is so large that drawing conclusions from the data is difficult. A simple method to predict protein flexibility using secondary chemical shifts has been developed recently that allows quantitative, site-specific mapping of protein backbone mobility without the need of a 3D structure or NMR relaxation experiments (26).
Numerous studies have attempted to identify flexible regions in proteins as well as to understand their role in overall protein dynamics and functionality. However, different groups use different definitions and various experimental approaches to identify this fold characteristic. One class of "intrinsically disordered" flexible regions was defined as the regions that are invisible in electron density maps of x-ray diffraction (27)(28)(29)(30)(31). Other researchers focus on extended (Ͼ70 consecutive residues) regions of a very low regular secondary structure that are particularly abundant in eukaryotic proteomes, conserved during evolution, and over-represented in regulatory and promiscuously interacting proteins (32)(33)(34). In fact, many proteins contain recognizable small "modules" that recur in other proteins in various combinations and in some cases can fold independently. They can be covalently linked to generate multimodular proteins and serve as self-directed structural units (35). Such domains can function independently, can be expressed in genomes, and are often rearranged through alternative splicing. These structural units are inferred to be a good evolutionary unit and are often used instead of whole proteins for annotations of the protein space (36). Yet if misplaced they can trigger dramatic biological consequences: oncoproteins comprising DNA-binding domains are capable of initiating transcription albeit being a small part of a largely unfolded chimeric polypeptide chain (37). In this regard, the crystallographic B-factors when considered over the length of a protein chain show that some segments undergo movements on a much larger scale than the rest of the protein, suggesting that the analysis of the B-distributions can be used to identify and predict flexible regions (38 -41). Moreover the scheme was proposed to discriminate the amino acid residues according to their flexibility based on the B-factors of their C ␣ atoms (38,39,41). Altogether four categories with distinct flexibility were recognized that include low B-factor ordered regions, high B-factor ordered regions, and short and long disordered regions with the last two categories being the regions of missing electron density (31). The amino acid compositions of these categories differ significantly, whereas the biophysical properties of high B-factor ordered regions are relatively close to those of the short disordered regions. They provide a higher flexibility, hydrophilicity, and absolute net and total charge. The low B-factor ordered regions are enriched in hydrophobic residues and depleted in the total number of charged residues compared with the other categories (31). The "significantly-greater-than-chance" predictability of these categories from sequence suggests that they are, most likely, encoded at the primary structure level (31). The impact of local flexibility in proteins on biological activity remains unclear. It is clear, however, that even such a coarse grained aspect of protein structure as the secondary structure assigned from x-ray crystals captures flexibility relevant for protein function (42).
Structural complexity of biological macromolecules allows for a large variety of mechanisms to regulate the molecular recognition, including key-lock binding, template-assisted folding, folding by association, conformational selection, etc. Whether the recognition involves inducible or constitutive binding, the interaction per se depends on and affects the secondary structure of the individual protein (43)(44)(45)(46)(47). It seems reasonable therefore to analyze the deviation of the secondary structure (assigned by crystallography and NMR data) in various protein families with specific emphasis on the residues embedded in configurations with high flexibility. Such residues do not necessarily represent the class of natively disordered residues with high inherent flexibility but constitute a more flexible class than rigid helical or stranded regions. Mining conventional biophysical data might facilitate the discovery of underlying trends in structure-function relationships that were missed previously (48,49). Here we examine the impact of local flexibility in proteins on molecular recognition and structure-function relationships utilizing gene-regulating molecules as a starting point.
Transcription factors (TFs) are modular molecules with separate domains mediating DNA binding, transcriptional activation, and repression. They are regulators of cell cycle progression, differentiation, and survival and are often altered in human diseases. Eukaryotic transcription factors contain a minimum of two domains that are responsible for the sequence-specific DNA recognition and transcriptional activation (50,51). Most DNA-binding domains adopt folded structures in the absence of DNA, and conformational transitions induced by DNA binding are rather local (43)(44)(45). Yet tran-scriptional activation domains might be both non-globular and unstructured and become partly ordered upon binding. Thus, local disorder appears to be an important facet of proteins involved in gene regulation, allowing for a variety of mechanisms in molecular networking (43,45).
Despite the successes in protein domain assignment and comparison (50), there have been only a few breakthroughs in the area of quantifying the structure-function relationship. In fact, most of the databases classify the protein space by occurrence of a particular type of secondary structure (e.g. all ␣, all ␤, etc.) or global fold (e.g. Rossman, ferrodoxin, ␣/␤cylinder, ␣ϩ␤-plaits, etc.). Analyzing the secondary structure in proteins with various biological activities will aid greatly in understanding the role of structure-function relationships in evolutionary biology. In this report, we introduce the parameter of the effective flexibility in proteins and analyze its variation through the functional space. We elucidate the impact of molecular confinement on inherited conformational lability of proteins and demonstrate the growing role of intrinsic flexibility in organisms of higher complexity that require intricate molecular networking (52).

EXPERIMENTAL PROCEDURES
Data Procurement-Structural data were collected using the annotations reported in the PBD and, later, utilizing an automated search with cross-referencing of protein structure with protein function, taxonomy, and gene ontology. Initially complete entries for high resolution structures (containing B-values, secondary structure indexes, experimental conditions, resolution characteristics, etc.) were taken from an unbiased selection of the PDB entries. In the case of TFs, this procedure resulted in 353 structures that included individual TF chains and multisubunit complexes with DNA. Given the significant redundancy of the PDB, the recovered data set was revised manually. Structures determined with resolution Ͼ3 Å and R-factor Ն0.25 were excluded from analysis as was NMR data averaged over less than 20 models, synthetic constructs, mutated proteins, and short polypeptides (Ͻ50 amino acids). The multiple entries were substituted by the average values of appropriate parameters. Of the remaining 112 structures of TF chains and single chain-DNA complexes, 98 structures (98 chains) were complete in all respects and were found to be acceptable (Table I). A similar approach was utilized to generate the structural set for oxygen transporters. In this case, a PDB query provided the original set of 497 entries (1,275 chains). Exclusion of structures with poor resolution, synthetic constructs, mutated proteins, duplicates, and multimeric complexes (Ͼ2 chains) left only 41 structures (57 chains) that were found acceptable for the analysis (Table II).
An automated data search was performed utilizing Macromolecular Structure Database Search Database (MSDSD) provided by the Macromolecular Structural Database Group at the European Bioinformatics Institute (www.ebi.ac.uk/msd). This database uses a generic search interface with a standard query language server that allows data exchange with the Oracle platform (Oracle RDBMS Version 9.2.0.1). Oracle9i software is widely used to manage large amounts of data, to import and export new data in the generated databases, to create tables, to select specific sets of data, and to combine data from different sources. Standard query language offers a powerful tool to connect relational databases. Data-housing capabilities of Oracle9i together with standard query language allow users to efficiently access and organize information reported in PDB: surfing 34,000 structures to reveal target correlations only takes a few minutes. Applying this approach, we examined the secondary structure composition in various protein families and in evolutionarily distinct organisms. Note that the MSDSD is derived from PDB entries and contains an extensive set of links to other biological databases including Swiss-Prot, SCOP (Structural Classification of Proteins), PROSITE, GO, and many others. These built-in, tightly integrated databases were used as 1) a source of input data, 2) an independent instrument for calculating various structural characteristics (e.g. B ␣factors, secondary structure composition, frequency functions, etc.), and 3) a tool for data analysis. Importantly MSDSD allows integration of PDB-derived data with the Gene Ontology database at the European Bioinformatics Institute, UK (www.geneontology.org) and the Search input for MSDSD a : GO, oxygen transporters; taxonomy: human; search output: 49 entries/182 chains a Mutants and synthetic constructs were both removed from the data set. taxonomy database at the National Center of Biotechnology Information (www.ncbi.nlm.nih.gov). We took advantage of MSDSD in our study, focusing on biophysical parameters of proteins with respect to their functional ability.
Altogether comparing the manual and automated searches revealed the independence of the recovered characteristics on the algorithm in use: launching the MSDSD for oxygen transporters containing single chain or protein dimers yielded an experimental set similar to that from manual exploration of the PDB. Likewise searching for human proteins (taxonomy index 9606) involved in oxygen transport (GO:0005344) provided 49 entries (182 chains) associated mainly with tetrameric complexes. Table II gathers both sets of the recovered samples. Furthermore using the MSDSD/Oracle9i allowed us to perform automatic assignment of proteins and to avoid the manual curation of the generated databases.
Secondary Structure Assignment-Each protein chain in PDB is indexed with respect to the secondary structure composition based on appropriate crystallography or NMR data as assigned by the DSSP (53). Specifically eight secondary structure classifications are recognized, i.e. ␣-helix, 3 10 -helix, ␤-strand, ␤-turn, bend, bulge, -helix, and coil, and grouped into three categories: ␣-helix, ␤-strand, and coil. Three types, ␣-helix, 3 10 -helix, and -helix, are classified as ␣-helix. The second category encompasses residues in extended ␤-strand configurations, whereas remaining residues are regarded as "coil." In the PDB annotation, all helices shorter than five residues and all strands shorter than three residues are reassigned to coil (54). The secondary structure assignment provided by the MSDSD ignores the requirement for the minimum size of the secondary structure element, thereby yielding an ϳ7% smaller fraction of coiled residues.
Rational for Data Analysis-Manual and automated searches were performed for a large number of protein families with respect to the secondary structure composition. Specifically each protein chain within the distinct family was assigned with the parameter of effective flexibility (see below), and the frequency distribution of this parameter as computed for a bin of 10 units (otherwise stated) was generated for the entire ensemble. In addition, for each recovered distribution normalized by a constituency number, we calculated the ensemble average and applied it as a flexibility index to a protein family. The distributions of the frequency function and the average flexibility parameters were used to analyze the effect of molecular interactions on the protein flexibility and the effective flexibility in proteins with various biological activities and/or in various organisms, e.g. prokaryotes and eukaryotes.

RESULTS AND DISCUSSION
Parameter of Local Flexibility-Intrinsic flexibility in molecular systems allows for a number of definitions that all imply a freedom of internal rotations. The mechanism of the molecular flexibility is usually rather complicated, involving both small scale isomeric rotations of atomic groups within a particular sequence fragment and large scale rotations of individual structured domain. Given the structural hierarchy in biological molecules, one might consider the lack of secondary structure as an indicator of the local flexibility. The latter can be measured as the fraction of residues approaching "coiled" configurations. To justify the idea that some secondary structures (recognized by DSSP) can be considered as flexible joints, we examined 36 high resolution structures of TF molecules (Table I) determined by x-ray crystallography with respect to the deviation of B-factors. This parameter represents smearing of atomic electron densities around their equilibrium positions, and a larger B-value corresponds to a larger uncertainty in the location of the appropriate residue or atom (55,56). Altogether the entire set of data contained 36 structures, 6,481 residues associated with seven secondary structure types (-helix was missing), and appropriate B-values. For each i-th chain comprising N i residues we computed the entire single chain average B ␣ -factor, ͗B ch,i ͘, with ␣-carbons being used as representative sites and the B ␣ k values identifying every k residue.
The standard deviation of B ␣ -distribution was also calculated for every chain.
In addition, all residues of i-th chain were sorted by secondary structure type, and the average B ␣ -values, ͗B i,j ͘, were computed for each secondary structure subcategory.
In Equation 3, index N i,j denotes the number of residues involved in the jth type of secondary structure. The deviation of ͗B i,j ͘ around the chain average ͗B ch,i ͘ was expressed in units of standard deviation and used in further analysis (41,55,56).
Positive ⌬B i,j values imply a greater uncertainty in the positioning of the residues associated with a particular secondary structure type in comparison with the chain average value and vice versa. Finally for the ensemble of 36 structures we computed the average of ⌬B j,j values using the frequency of the j type, f j .
Note that comparing B-factors among different proteins necessitates protein-by-protein normalizations to avoid the systematic differences in the experimental data (38 -41, 57-59). Indeed the B-distributions are highly irregular when viewed protein by protein, mainly because of differences in the refinement methods used (60) and the degree of care taken to determine accurate B-factor values (61). These normalizations are usually made assuming the Gaussian distri-bution of B-factors (38,39), although other statistical models such as Gumbel distributions (extreme value distributions) also can be used (41). Given that Gaussian distributions of B-factors were previously used for the development of successful flexibility predictors (38,39), we believe that the overall differences between the Gaussian and Gumbel distributions are relatively small, and the results of our studies are not quantitatively affected by using the Gaussian rather than Gumbel distributions.
The entire set of ⌬B j values are shown in Table III together with the pertinent secondary structure type and the fraction of residues involved. As can be seen, the coiled configurations provide the largest positive ⌬B j deviation of 0.34, whereas ␣-helical and ␤-stranded conformations yield the largest negative ⌬B j deviation of Ϫ0.11. The other secondary structure type with a negative ⌬B j value was found to be bulges representing only 1.2% of the entire ensemble. Unexpectedly the 3 10 -helical configuration also provides a positive ⌬B j value of 0.06, thereby indicating notable uncertainty in the positioning of the residues involved. Thus, the residues located in ␣-helical or ␤-stranded segments consistently provide smaller B ␣factors than ensemble average, thereby demonstrating the superior structural rigidity in comparison with others. Hence ␣-helical and ␤-stranded configurations could be considered relatively rigid, whereas all other configurations could be considered flexible.
To examine overall positional ambiguity associated with intrinsic flexibility, we analyzed the scattering of ⌬B j values around the ensemble average as computed for flexible, ⌬B fl , and rigid, ⌬B r , residues. In this case, all secondary structure types that provided positive values of scaled B-factor, ⌬B j , were assigned as flexible, and those yielding negative values were assigned as rigid (see Table III). In addition, for every ith chain the average B ␣ -values were computed utilizing the set of flexible or rigid residues.
In Equation 6, index d indicates rigid or flexible residues and other parameters as denoted above. The deviation of the mean B ␣ -factors associated with flexible and/or rigid configurations around the chain average ͗B ch,i ͘ was expressed in units of standard deviation (41,55,56). . This data corroborates the idea that residues in protein molecules present two independent groups that consistently exhibit quite different uncertainties in location and, therefore, can be regarded as relatively rigid and flexible populations.
Local Flexibility in Protein Family-Each selected protein was indexed by a parameter of flexibility, d , that implies the fraction of residues comprised in coiled configurations. The entire group was divided into subcategories in accord with the value of d parameter (Ͻ10, Ͻ20, etc.), and the frequency of occurrence of proteins in each 10-unit bin, F(d ), was counted. In addition, the frequency function was normalized to the total number of protein chains selected for the analysis. As a characteristic of a frequency function distribution, F(d ), we utilized the weighted average, ͗d ͘, where F i and d i are the frequency index and the appropriate flexibility parameter, respectively.
To begin with, we explored the local flexibility in the TF family using readily available data reported in the PDB. An exhaustive cleansing (see "Data Procurement") left 73 single TF chains and 25 complexes with DNA fragments. Among them 36 structures were resolved by x-ray crystallography, and 62 structures were determined by NMR spectroscopy. The selected set of 98 structures was sufficient to permit reasonable statistical tendencies to be determined. For each protein chain we retrieved the secondary structure indexes and calculated the flexibility index as a fraction of residues involved in flexible configurations.  DNA (black circles). One sees that the flexible residues constitute over half of the TF chains, providing the ensemble average ͗d ͘ value of 61%. Formation of intermolecular complexes with DNA induces only a slight ordering of the TF proteins given that the fraction of flexible residues decreases from 63 to 58%. The recovered distributions are nearly symmetrical with a half-width of the frequency function of ϳ20%.
When searching for structural determinants that are indicative for protein families, it is imperative to elucidate inherent limitations of generated databases and the experimental techniques utilized for the data acquisition. The proteins analyzed here are those for which structures have been determined at atomic resolution. This selection inevitably introduces a bias, principally toward smaller proteins (given that the mean chain length in the PDB is about 370 residues) and those with rigid structures (35). Clearly crystallography reports mostly on proteins that are inherently ordered and therefore can be crystallized, whereas a much wider structural diversity can be studied by NMR method. In agreement with this supposition, we found (see Fig. 2B) that the average flexibility (as characterized by the content of coiled residues) in structures determined by NMR (black circles) is larger by 15% than that in crystallized samples (gray circles). Both the NMR and the x-ray sets, however, brought in relatively large ͗d ͘ values of 66 and 51%, respectively, thereby pointing to a large portion of the TF residues that exhibit substantial uncertainty in their locations. Whether this finding indicates the significant amplitude of thermal motions (dynamic disorder) allowed in packed structures or reflects quenched disorder, i.e. trapped coil-like configurations, requires further investigation.
Local Flexibility and Biological Activity-Toward understanding the role of fold specifics in structure-function relationships, the correlation between the protein flexibility and the protein function needs to be addressed using samples with distinct functional activity. The proteins involved in the transport of oxygen in biological systems present the best candidates for this study given a large number of structures determined at high resolutions. Our initial PDB search provided 454 entries (1,345 chains), which included individual proteins and multimeric complexes. The manual refinement left only 41 structures (57 chains) suitable for the analysis, which included individual chains and protein dimers (Table II). The inset in Fig. 3 displays the d -distributions for the representative 57 chains (open triangles) and for the entire structural set (filled triangles) as computed for a bin of five units. We found that this functional category can tolerate only a small portion of the flexible residues given that ͗d ͘ of about 34%  Table I.

FIG. 2. Effective flexibility in the family of the transcription factors (open circles).
A shows the contributions from individual chains (gray circles) and complexes with DNA (black circles). B displays the contributions from the structures resolved by x-ray crystallography (gray) and those by NMR spectroscopy (black). The PDB codes of the selected samples are listed in Table I. was obtained for both data sets. Furthermore the d -distributions generated for oxygen transporters exhibit narrowly defined functions with a half-width of about 5% for the revised set and 10% for the entire set retrieved from the PDB. Once again, we have established that the incorporation of protein molecules into intermolecular complexes has very little effect on such fold specifics as intrinsic flexibility.
Using the MSDSD allows for automated analysis of the large data domain with assignment of gene ontology and taxonomy indexes to the protein samples. We utilized this approach to generate the database comprising human proteins (taxonomy index 9606) that are involved in the oxygen transport as identified by GO function (GO:0005344). The inset in Fig. 3 displays the d -distributions generated for the originally identified 428 chains (open circles) and for the 182 chains that were left after cleansing (filled circles). The average flexibility parameter was found to be 32 and 34%, respectively. It appears that secondary structure specificity might be robust enough to survive the coarse grained screening of protein families with respect to the functional activity.
To provide further support to this notion, we computed the d -distributions for a number of human protein families with distinct functions as identified by the GO database. Fig. 3 displays the distributions of flexibility parameter computed for oxygen transporters (curve 1, GO:0005344), proteins with tyrosine phosphatase activity (curve 2, GO:0004725), and proteins with serine type endopeptidase activity (curve 3, GO: 0004252). A data search provided 116, 86, and 307 PDB entries that contained 428, 111, and 454 protein chains, correspondingly. As can be appreciated, each protein family provides rather specific d -distributions in terms of position of the maximum and the bandwidth: increasing flexibility correlates with the increasing width of the distribution. Overall these findings point to a great potential of coarse grained analysis of fold specifics to reveal the structure-function correlations and their diversity in evolutionary distinct organisms.
Phylogenic Comparison of Functional Categories-To elucidate the effect of evolution on protein structure we analyzed the proteins with similar functional ability in organisms of different complexity. Our initial set of the TFs included 20 structures from prokaryotic proteins and 68 from eukaryotes selected manually from the PDB ( Table I). The decomposition of the d -distribution into the contributions from eukaryotic and prokaryotic proteins revealed that average flexibility in eukaryotes is larger by 11% (data are not shown), pointing thereby to the increasing role of the conformational lability in the functioning of the organism with larger complexity. At this point, a serious question arises: to what extent does the PDB data provide a comprehensive set suitable for phylogenic comparison of functional categories? To address this issue we analyzed the TF family using the PROSITE database to explore the molecular compositions of the similar functional categories. A search provided 58 entries with various PROS-ITE documentation indexes, PDOC. For each PDOC entry we generated a taxonomic tree view of all Swiss-Prot/TrEMBL entries matching the given PDOC code and counted the samples associated with specific taxonomy. The representative set of TFs in eukaryotic and prokaryotic proteins are shown in Fig. 4. One can see that the composition of the TF family associated within evolutionarily distinct organisms differs significantly as follows from the uneven distribution of appropriate sequences in the functional space. For instance, zinc fingers, which are largely unfolded macromolecules, comprise almost 17% of the human proteins, and they are lost in simple organisms (Fig. 4). (All zinc fingers are italic in Fig. 4.) Whether this reflects the effect of evolution on a protein family intended to perform a specific function or reflects the bias of experimental data remains unclear. It is clear, however, that structural comparisons lacking the taxonomy of samples might be misleading toward understanding structure-function relationships.
To approach this problem we used the MSDSD provided by the European Bioinformatics Institute (62). Briefly MSDSD is a relational database that offers a single access point for protein and nucleic acid structures and related information. Impor- tantly the MSDSD allows combining the PDB-derived structural data with ontology and taxonomy databases. Using the GO database, we identified sets of proteins with resolved 3D structure using human and Escherichia coli proteins as targets. To minimize sampling errors, only GO functions that were supported by Ͼ50 PDB entries were used. The search with a larger cutoff provided similar results. In addition, for a given organism we calculated the flexibility parameter for proteins with identical GO functions. Altogether we examined the set of 70 functional categories associated with human proteins and 21 categories relevant for E. coli, comprising a total of 12,107 structures (22,109 chains). For each functional category, we calculated the average flexibility parameter and organized data in ascending manner that shows increasing flexibility with respect to the identified function (Fig. 5).
We found that the fraction of residues with higher flexibility in human proteins ranges from 25% (e.g. oxygen transporters) to 65% (e.g. trypsin and plasminogen activators), whereas for E. coli proteins it scatters around 35-45%. It appears that in organisms of greater complexity (human versus E. coli), the molecular flexibility becomes more significant. As expected the "function" as a physical process (e.g. binding) shows no distinctiveness with respect to the molecular flexibility. In fact, it implies a response to the presence of another counterpart: DNA-binding proteins exhibit the average flexibility parameter of Ͻ40%, whereas heparin and zinc binding requires 50 and 60% of flexible residues, respectively. It seems likely that protein fold specifics are receptive for the relevant biological process that is a series of events involving proteins (series of functions) and/or for the molecular localization indicating where the interactions occur. To this end, Fig. 5B displays the effective flexibility in proteins that are identified by their localization. Two examples, human proteins (curve 1) and E. coli proteins (curve 2), are considered. As can be appreciated, there is a clear correlation between the confinement of human proteins and the content of coiled residues: local flexibility increases as proteins move out from the interior (Golgi apparatus, nucleus, mitochondria, cytoplasm, etc.) to the exterior of the cell. On the contrary, the E. coli proteins show no significant changes of these fold specifics upon the cellular locations. Once again, we observe a striking correlation between the cell complexity and the flexibility of relevant molecular forms.

CONCLUSIONS
To date, most of our knowledge about proteins is derived from sequence-related databases. Sequence comparisons, however, fail to identify many relationships that emerge from known protein structures. Here we address correlations between protein fold characteristics and protein functional activities. Several families were investigated with an approach relying on readily available structural data. The primary observation was that natively folded proteins might have a significant fraction of residues that exhibit uncertainty in residue positions and, therefore, can be regarded as flexible. The diversity of this structural parameter within the protein family can be significant. Some examples are provided in Fig. 6. Given that many PDB entries are truncated chains and crystallized proteins that are biased toward a higher content of rigid regions, our assessment of the local flexibility in protein families is most likely underestimated. One should also consider the less ordered conformational states (e.g. molten glob- ules) that might be involved in protein functioning. At this juncture, a question arises as to the location of flexible regions within the 3D structures. Our analysis of transcription factors (see Table I) indicated that over 30% of flexible residues are, in fact, buried in the interior of the molecule and have no direct exposure to the solvent (Fig. 6B).
Presence of significant coiled regions in protein molecules challenges the conventional views of molecular recognition and protein function as well as our understanding of structure-function relationships. The higher flexibility in eukaryotic proteins points to an increasing role of conformational lability in biological systems of greater complexity with respect to molecular recognition and molecular assembly as well as protein modification, etc. Clearly certain biological activities, such as enzyme catalysis, immunological recognition, or molecular discrimination by receptors, demand exquisite control of 3D structure. Other functions such as signaling can be achieved by linear sequences. Locally disordered/flexible segments, which are induced to fold by interactions with other molecules, offer several important advantages for molecular signaling and cellular regulation. For instance, inherently labile segments can easily be shaped by their environment and might be able to recognize a large number of biological targets without sacrificing specificity. Different sequences might provide comparable binding sites (63), and similarly different sequences can fold into similar structures but have different functions (64 -67). In fact, numerous efforts to annotate function based on structure and sequence homology alone often lead to misannotations (68,69). Most importantly, complex organisms might retain their intricate physiologies without dramatically increasing their genome size (note that the number of genes in the human genome proved to be deceptively low). It may be the case that conformational lability of small sequences manifests the capacity to provide various binding sites, recognition templates, and signaling pathways and points to multidimensional structure-function relationships. Using available structural data, we also recognize that protein structures are closely tied to the location of the protein: intracellular proteins on average seem to be more rigid than those located closer to the extracellular milieu. Another direction yet to be explored is to consider function as a biological process that assumes locations, interactions, and counterparts. Indeed incorporation of fold-specific correlations into evolutionary models would provide a new tool for the organization of biological data, thereby expanding the conventional views of biological function.  Table I).