Take home lessons from studies of related proteins

Highlights ► We review the recent advances made from the study of related proteins. ► We relate pathway malleability to the balance between foldons and helical propensity. ► We speculate why different topologies respond differently to mutation. ► We discuss the role of kinetic intermediates in folding pathways. ► We explain why it is important to study several members from each protein fold.


Introduction
In the fifty years since the protein-folding field was first established, there have been thousands of papers detailing the thermodynamic or kinetic characterization of hundreds of different proteins. One particularly useful approach is 'The Fold Approach' [1], which involves a detailed analysis of the folding of several topologically, structurally and/or evolutionarily related proteins in order to discern patterns and trends in folding (stability, pathways and mechanisms).
In this manuscript, we describe a number of studies that highlight how comparisons within and between related protein families have affected our understanding of protein folding. This article builds on our recent review [2 ] incorporating significant results from the last few years. Here, we focus on the folding of isolated domains and do not discuss multidomain proteins, misfolding or aggregation.
The malleability of protein folding pathways A unifying folding mechanism In the early days of the 'protein-folding problem', three competing mechanisms were proposed that described how a polypeptide chain might fold to the native state: nucleation [3], hydrophobic-collapse [4] and diffusioncollision (framework) [5]. However, an early F-value analysis of the small protein chymotrypsin inhibitor 2 (CI2) demonstrated that none of these mechanisms was appropriate, since secondary and tertiary structure formed concomitantly [6]. Thus the nucleation-condensation mechanism was introduced [7], in which long-range contacts set up the initial topology of the protein (incurring a substantial entropic loss with minimal enthalpic gain), followed by a rapid collapse to the native state (with minimal entropic loss but substantial enthalpic gain). Under these conditions, the transition state is usually an expanded form of the native state [8], which helps to explain the strong correlation between native topological complexity (Contact Order) and folding rates, as noted by Plaxco and Baker in the late 1990s [9].
Although the nucleation-condensation mechanism is observed to be widely applicable, several proteins have been shown to fold in a more hierarchical manner. In particular, the engrailed homeodomain (En-HD) was seen to fold via a classical framework mechanism [10]. To investigate whether this result was owing to the simple architecture of the protein, Fersht and co-workers studied four other members of the homeodomain-like superfamily: c-Myb, hRAP1, Pit1 and hTRF1. They observed a slide in mechanism a slide from hTRF1 (pure nucleation-condensation) to En-HD (pure framework) through c-Myb, hRAP1 and Pit1 (mixed mechanisms), which correlated with the innate secondary structural propensity of each domain [11,12 ]. The authors used this result to conclude that nucleation-condensation and diffusion-collision are thus ''different manifestations of a common unifying mechanism'' for protein folding. This variation is not unique, and a continuum of mechanisms has also been seen for different members of the PSBD superfamily, where it is again linked to secondary structural propensity [13].

The foldon concept
Further reconciliations between apparently different folding pathways have also been proposed using the concept of 'foldons'. This term was initially used to describe the C-terminal domain of bacteriophage T4 fibritin [14], but was quickly adopted by Wolynes and co-workers to describe independently folding units of a protein chain [15]. Although originally referring solely to contiguous regions of polypeptide sequence, Englander [16] and Oliveberg [17,18] redefined the term 'foldon' to describe any kinetically competent submotif within a protein (i.e. any subset of residues that can fold cooperatively to a defined structural state).
Perhaps the most successful application of the foldon hypothesis comes from studies of the ferredoxin-like family of proteins including U1A and the small ribosomal protein S6 from Thermus thermophilus (S6 T ). Here, Oliveberg and co-workers observed that, while the wild-type S6 T protein folded through a globally diffuse transition state that typified nucleation-condensation, a circular permutant (with conjoined wild-type termini and a different backbone cleavage site) exhibited an extremely polarized transition state [19]. Moreover, two alternate circular permutants demonstrated that entropy mutations could be used to shift the position of the nucleus within the topology of the S6 T protein [20]. This finding was particularly interesting, since it reconciled the folding of S6 T and U1A with that of S6 A and ADA2h: two other homologous ferredoxin-like proteins that appeared to fold through a different pathway (although still by nucleation condensation). Oliveberg explained these results by suggesting that all ferredoxin-like proteins comprise two overlapping foldons, but that the specific folding pathway is determined by the primary sequence of each domain [18].
It is, perhaps, easiest to compare these foldons to tandem repeat proteins. In these proteins, each repeat is unstable in isolationand yet each repeat has a defined native structure to which it will fold [21,22 ]. Interactions between these repeats can provide sufficient stabilization to produce a globally stable native state, and a cooperatively folding protein [23]. In the same way, isolated foldons are unstablebut the combination of several foldons will lead to a stable, structured protein domain. In the ankyrin repeat protein myotrophin, it is the C-terminal repeat that is most stable (least unstable) in isolation, and hence folding begins in this region of the protein. However, when this repeat is destabilized by mutation, it is now the N-terminal repeat that is most stable, and the protein will fold from the opposite end over a different pathway [24], similar to that of Internalin B [25]. A similar rerouting of the folding pathway has also been achieved by mutations in the Notch ankyrin domain [26]. In an analogous manner, the folding of the ferredoxin-like proteins is controlled by which of the two component foldons is the most stable (least unstable), hence the differences in transition state structure between U1A/ S6 T and S6 A /ADA2h [18].
How do folding pathways respond to sequence changes?
Both experiment [27] and theory [28] suggest that the protein-folding nucleus can be subdivided into two distinct sections (Figure 1). The obligate nucleus comprises those few interactions that commit the polypeptide chain to fold to the correct native state topology. Such residues pack early, (with high F-values), and incur a substantial entropy cost with little enthalpic gain. They are surrounded by the critical nucleus, which is a shell of additional interactions that are necessary to turn the free-energy profile downhill (i.e. additional interactions that are accumulated up to the global transition state). These interactions are more plastic, and each folding event may use a different subset of residues within the critical nucleus to effect a barrier crossing. The foldon idea can be combined with that of the obligate and critical folding nucleus to explain the many types of pathway malleability: this is described in Figure 2, and exemplified by members of the immunoglobulin-like (Ig-like) fold.
When considering the folding of related proteins, perhaps the most thoroughly studied fold is that of the Ig-like domains. These all-b proteins have a complex Greek-key architecture, and are extremely common in eukaryotes with over 40 000 distinct domains identified to date [29]. They were chosen for study because, despite their complex topology, there is low sequence identity within each superfamilyand virtually no sequence identity between different superfamilies. Early studies on fibronectin type III (fnIII) domains (TNfn3 and FNfn10) revealed the presence of four key hydrophobic residues in the B, C, E and F strands that constituted the obligate nucleus: interactions of these residues was necessary, but sufficient, to set up the correct topology of the protein [30-32]. Interestingly, the size of the critical nucleus was very different in these two proteinsit is far more extensive in FNfn10 than in TNfn3 ( Figure 2B). Moreover, in FNfn10, a few mutations resulted in a small change in the unfolding m-value that could indicate a shift in the critical nucleus ( Figure 2C). Most importantly, the obligate nucleus of the evolutionarily unrelated Ig domain titin I27 comprised residues that were structurally equivalent to those in the fnIII domains [33]. Thus, these proteins share an obligate nucleus, which is required to set up the correct topology of these complex Greek-key domains and allow folding to proceed. Indeed, the hydrophobic residues of this obligate nucleus were so well conserved that a search of the Protein Data Bank (PDB) was undertaken to find an Ig-like domain that did not contain this nucleation motif. The resultant domain, CAfn2, was subject to a detailed F-value analysis that produced a gratifying result: the folding nucleus had simply 'slipped' down the core to use an adjacent pair of hydrophobic residues [34] -both the obligate and critical nuclei have moved in response to sequence changes ( Figure 2D).
A final surprise in this analysis of pathway malleability in Ig-like domains came from a more detailed analysis of I27. This domain exhibited unusual anti-Hammond behaviour at high concentrations of denaturant and upon mutation. These data were used to infer the presence of an alternate folding pathway that nucleated at the E-F loopboth the critical and the obligate nucleus have moved entirely ( Figure 2E) [35]. Thus we find that Iglike domains contain at least two potential nucleation motifs, with one foldon comprising the B, C, E and F strands and one foldon centred on the E-F loop. Note that we are not implying that every immunoglobulin-like domain can display all types of pathway malleability, merely that the topology of the immunoglobulin fold allows for each. We speculate that this robustness to sequence changes might account for the success of this fold in Nature.
Are all protein folds as malleable? Using a stringent definition for transition state inflexibility, no shift in the position or size of the folding nucleus, the classic two state folder CI2 and the small three-helix bundle BdpA are the only domains for which no experimental perturbation has resulted in an altered transition state structure ( Figure 2A). In the case of CI2, this inflexibility extends to point mutation, circularization, circular permutation [36] and even bisection [37], and it appears that this protein really does have only one energetically accessible nucleation motif. However, since no other members of     [88] form early, and are associated with a high entropy cost and little enthalpic gain. The critical nucleus forms a shell around the obligate nucleus, and provides sufficient extra interactions to turn the free-energy profile downhill, (lower entropic cost, larger enthalpic gain). These interactions are more plastic, and only a subset of these interactions may be required to complete the folding nucleus. How folding mechanisms or pathways might change when the sequence of a protein changes. Top: Protein folding has been described as occurring by a sliding mechanism between a framework mechanism, F (5), and nucleation condensation, NC [7]. (F1) If the secondary structure (helical) propensity of the protein is high (dark grey) then secondary structure formation may precede the formation of a tertiary folding nucleus and the protein folds through the framework mechanism. If the secondary structure weakens then a nucleation-condensation mechanism may become more favourable. (F2) If the secondary structure propensity is weak (light grey), but there is no strong nucleus, the protein may still fold by a framework-like, diffusion-collision mechanisms, where folding proceeds through collision of partly formed secondary structure elements. Changes in sequence may lead to stronger, earlier formation of secondary structure, or a move to nucleation condensation. Bottom: Within nucleation condensation (NC) mechanisms there may be shifts in the folding nucleus. The malleability of a protein-folding pathway is determined by its component foldons and by redundancy in the nucleating residues. The obligate nucleus is shown in blue and the critical nucleus is shown in cyan. (a) Where a protein contains only one potential set of nucleating residues, the folding pathway is robust. Such proteins can be described as 'ideal' two state folders, and exhibit Vshaped chevron plots with a single free-energy barrier. Mutation of the nucleating residues will not change the structure of the transition state, but may result in a protein that cannot fold. (b and c) If the obligate nucleus is surrounded by many favourable interactions, then a detrimental mutation within the critical nucleus can lead to the recruitment of other interactions to compensate. This will result either in expansion of the critical nucleus, b, or a shift in the position of the nucleus, c. Such mutations can lead to Hammond effects. (d and e) If a protein can use degenerate residues to set up its native state topology, then mutations within the obligate nucleus can lead to minor shifts in both the obligate nucleus and the critical nucleus; however, if the topology provides alternate foldons, then disruption of the obligate nucleus may result in a complete shift in the position of the folding nucleus. These latter shifts are often linked to anti-Hammond behaviour. Alternatively, in the absence of an alternative set of nucleating residues, destruction of the folding nucleus may lead to a protein that can only fold when transient secondary structure is stabilized by long-range tertiary interactions (F2). Such a protein would be said to fold through the diffusioncollision mechanism. this fold have been studied, it is not yet known if this is a general feature of this protein topology. The BdpA protein has been less ruthlessly perturbed and, while the transition state is not affected by point mutation or by temperature [38,39], a more serious structural perturbation may yet have an effect. An interesting case is demonstrated by the LysM domain, which shows an identical pattern of F-values after circularization [40], albeit with a global decrease in magnitude. A detailed Eyring analysis suggests that the lower entropy cost of transition state formation is compensated for by a lower enthalpy of contacts: the protein still folds through the same pathway, with a structurally identical but spatially expanded nucleus ( Figure 2B).
The apparent malleability of the transition state ensemble can be strongly dependent on the imposed perturbation, as demonstrated by the b-sandwich domain aspectrin SH3. The wild-type transition state is formed from the packing of two out of the three native state bhairpins (RT loop and distal loop). A circular permutant that cleaved the RT loop resulted in an unchanged folding pathway, ( Figure 2B), but an alternate permutant that cut the distal loop resulted in a completely different transition state structure involving the n-Src loop and the WT termini ( Figure 2E) [41]. Other large-scale shifts in the obligate nucleus are not uncommon, especially where the folds exhibit symmetry. The symmetrical, ubiquitinlike Protein G, which comprises a central helix packing on two terminal hairpins, is a good example of such a large change. The wild type protein nucleates using the Cterminal hairpin and helix, as determined by F-value analysis [42]. However, a computationally redesigned version of the protein was successfully engineered to fold via the N-terminal hairpin [43], with a transition state reminiscent of the homologous Protein L [44]. In both of these cases, SH3 and ubiquitin-like domains, the protein topology provides at least two foldons, either of which is able to nucleate under the right conditions. As with the S6 proteins, these foldons are overlapping.

The role of intermediates in folding
As mentioned previously, the engrailed homeodomain has been shown to fold through a framework mechanism [11]. In fact, the secondary structural propensity of En-HD is so high that individual helices are stable in isolation (Figure 2, F1). This leads to three-state folding behaviour where kinetic intermediates accumulate. Reducing the secondary structural propensity results in a domain where no helix is stable in isolation. Now, the transiently formed helices are only stabilized once they have accumulated sufficient long-range interactions, and this interdependency results in global folding cooperativity, as seen with c-Myb. This behaviour is shown in Figure 2 as the slide from framework (F1) to nucleation condensation (NC). Nevertheless, c-Myb can be specifically mutated to increase the helical propensity, and convert the folding kinetics to three-state [45]. A similar effect is seen with the immunity proteins, Im7 and Im9 [46,47], which share a common transition state structure despite the fact that Im9 folds in a two-state manner (no independently stable submotifs) while Im7 exhibits three-state kinetics (with at least one independently stable submotif). By stabilizing the nucleating foldon, Im9 was rationally engineered to fold through a . This is an extremely interesting example where one of the component foldons has mutated so as to be the most stable species under certain solution conditions, as shown by the presence of an equilibrium intermediate. We infer that the PDZ domain contains at least two nucleation competent motifs within its structure. If the protein nucleates using the first (stable) foldon, then the second energy barrier is larger than the first and an intermediate accumulates ( Figure 3A). If, however, the protein nucleates using the second (unstable) foldon, then the second energy barrier is smaller than the first and the whole folding process is cooperative. Under certain experimental conditions, it is easier for the intermediate to fully unfold and follow the alternate nucleation pathway than it is for the intermediate to progress directly to the native state ( Figure 3B). In these cases, the intermediate appears to be off-pathway. The PDZ behaviour was modeled on that of lysozyme, which contains a stable a-domain, an unstable b-domain, and folds with a 'triangular' scheme of two parallel pathways, only one of which exhibits a kinetic intermediate [57]. Alternative folding pathways and kinetic traps have also been observed, and analysed, for homologous members of the flavodoxin-like fold [58,59 ], the b-trefoil family [60,61] and the caspase recruitment domains [62], amongst others.

Comparisons between folds
Both spectrin domains and homeodomains are three-helix bundle proteins. Three spectrin domains have been investigated in detail, (R15, R16 and R17), all from chicken brain a-spectrin. As seen for the homeodomains, there is no common folding mechanism, with R16 (and R17) folding by the collision of partly pre-formed helices [63,64], while R15 folds by classical nucleation-condensation [65]. In the spectrin case, however, it is not increased helical propensity in R16 that favours the framework-like mechanism: rather, it is the lack of a competent folding nucleus (Figure 2, F2). Addition of a nucleus results in a change in the folding mechanism from framework towards nucleation condensation, as shown in Figure 2 with a slide from F2 to NC [66 ,67 ]. Interestingly, in contrast to the homeodomains where the framework mechanism leads to faster folding, in spectrin it is the proteins that fold by nucleation condensation that fold faster. This difference is probably related to the difference in size of these two folds. The helices in spectrin are long (8-10 turns per helix) unlike the short 2-3 turn helices in the homeodomains. We have speculated that there is a frustrated search for the correct docking of the helices in the spectrin domains, manifested as 'internal friction', that explains this observation [66 ,68,69]. Remarkably, it has not been possible to alter the folding pathway of R15, either to move towards a framework-like mechanism, or to induce a change in the position of the nucleus: radical destabilization of the folding nucleus in R15, which causes significantly slower folding and unfolding, still results in a protein with Fvalues that are identical to the wild-type protein (unpublished data). This protein therefore shows no signs of pathway malleability (Figure 2A), unlike its homologues R16 and R17.

Combining experiment and computational studies Knotted proteins
One of the more surprising results in recent years is the finding that knotted proteins are able to fold spontaneously, without chaperones or enzymatic help, to the native knotted state. Mallam and Jackson investigated two members of the a/b knot family and observed that both YbeA and YibK folded with similar rates and through comparable kinetic pathways, from knotted denatured states [70]. In an elegant recent follow-up study [71 ], the authors followed the folding of these

Current Opinion in Structural Biology
A protein with more than one foldon has access to multiple folding pathways and may exhibit both on-pathway and off-pathway intermediates.Lowercase letters denote unstructured foldons (p, q) and uppercase letters denote structured foldons (P, Q). The double dagger (z) denotes the foldon that is (un)folding at each transition state. (a) Both the PDZ domains and lysozyme have been shown to fold through a triangular folding scheme under certain experimental conditions. This can be explained by considering a protein with two component foldons (p, q) either of which can fold first. Importantly, one foldon is stable in isolation (P) but the other is unstable in isolation (q). In the blue pathway, the second energy barrier (q folding) is larger than the first energy barrier (p folding) and therefore an on-pathway intermediate accumulates. proteins in a cell-free translation system and demonstrated that the newly synthesized proteins have to knot before they can folda rate limiting process that is accelerated by chaperonins. Nevertheless, this knotting process must be controlled by the primary sequence of the protein and thus it is very interesting to investigate homologous proteins where some are knotted and some are not. Faccioli and co-workers used coarse-grained protein models to study the folding of the nativelyknotted N-acetylornithine carbamoyltransferase (AOT-Case) and a homologous unknotted ornithine carbamoyltransferase (OTCase). They found that, when non-native interactions were ignored, neither protein was able to form a trefoil knot. By contrast, when non-native interactions were added to the model, the AOTCase was able to spontaneously knot in a substantial proportion of the simulations [72 ]. This kind of study is particularly useful, since it can be used to highlight important folding contacts that cannot be deduced from the native, denatured or transition states. In this case, the simulations predict contacts that can be added/removed in vitro to make a knotted form of OTCase or a non-knotted mutant of AOTCase.

Nearly the same sequence but a different fold
As a contrast to the fold approach, several groups have been working towards designing proteins with highly similar amino acid sequences, but which cooperatively fold to different native state topologies. This quest, known as the Paracelsus Challenge, was first achieved in 1997 when Reagan and co-workers designed two proteins that were more than 50% identical yet adopted different native folds (ROP-like and ubiquitin-like) [73]. This design was surpassed in 2005, and again in 2008, when Bryan and co-workers developed two polypeptide chains that are 88% identical and yet adopt very different tertiary structures [74]. These proteins have been studied both by experiment and computationally, and the conclusion is that the final native topology is determined by the structure of the denatured state and the very earliest folding events [75,76 ]. In the case of the G X 88 proteins, the early development of a b-hairpin in one sequence prevents a-helical formation in that region, and leads to the ubiquitin-like fold [75,76 ]. The alternate sequence retains significant helical structure in the denatured state, which leads to the all-a helical bundle. Residual structure in the denatured state has also recently been shown to be important for the folding of the ribonuclease domains [77 ] and the SUMO proteins [78].
In a more recent extensive study of the designed system Gianni and co-workers have shown that G A 88 folds using a robust transition state to a three helical bundle, while G B 88 folds over a very malleable energy landscape to a ubiquitin-like (mostly b-sheet) topology [79]. This malleability is assigned to the presence of multiple, competing foldons. In contrast to most natural proteins, where the component foldons work in unison to provide a cooperatively folded protein, the Gx88 designed proteins provide an example where two structurally overlapping foldons work in opposition. By fine-tuning the energy cost of each nucleating foldon, the overall topology of the whole protein can be adjusted. This result should be directly applicable to the study of aggregation-prone polypeptides, where minimal perturbations in structure and/or solution conditions are able to change the resulting topology of the folded state from native to the universal cross-b amyloid structure.

Summary
What is clear from many of these studies is that researchers should be wary of characterising the folding of a particular protein topology based on a single member of the fold. While it may be informative to study a wide cross-section of the proteome [80], gross comparisons between different folds are unable to inform as to how and why a polypeptide chain folds to its specific native state. These answers mostly come from more intricate studies, looking for differences in the folding of closely related proteins (the so-called 'Fold Approach'). For example, such studies have taught us that a folding pathway should not be defined by its kinetic intermediates, since these species can easily be introduced into, or removed from, the energy landscape (e.g. En-HD/c-Myb, Im7/Im9, PDZ). In addition, while some proteins appear to be very restricted in their response to mutation (CI2, LysM), other folds exhibit a high degree of pathway malleability. This latter group includes the immunoglobulin-like domains, which are able to change their folding nucleus in response to deletions in the hydrophobic core [34], changes in solvent conditions [35], and even under mechanical stress [81,82]. This plasticity in the energy landscape may confer an evolutionary advantage over more restricted folds, and may explain why the topologically complex Ig-like domains are so prevalent when compared to more simple folds: changes in sequence that are required for functional reasons can be easily compensated for by a shift in the folding nucleus. It is also observed that symmetric proteins, such as the ubiquitin-like domains [42,44], show more pathway malleability than similarly sized asymmetric proteinspresumably owing to the comparable entropic cost of topologically symmetric foldons [83,84].
The idea that protein domains comprise several foldons (individually cooperative submotifs) is particularly appealing, since it is able to simplify the folding of complex topologies by introducing the concept of a 'funnel of funnels' [85]. This would also have the advantage that de novo proteins could be systematically built using a toolbox of smaller components. Indeed, Baker and co-workers recently emphasized that it is easy to rationally stabilize the native state of a protein, but it is much harder to disfavour the plethora of non-native states that are also possible. Their phenomenal success in designing five new stable, monomeric proteins from scratch was based on the structural overlap of several defined motifs with a known topological bias, specifically chosen to favour funnel-shaped energy landscapes [86 ].
While it is certainly true that the ferrodoxin-like proteins comprise two overlapping foldons, whether or not this is a general feature of all complex protein folds remains to be seen. Nevertheless, one interesting observation is that the size of the dominant foldon may be related to topological complexity. The spectrin repeats [67] and homeodomainlike bundles [12 ]

59.
Hills RD, Kathuria SV, Wallace LA, Day IJ, Brooks CL, Matthews CR: Topological frustration in beta alpha-repeat proteins: sequence diversity modulates the conserved folding mechanisms of alpha/beta/alpha sandwich proteins. J Mol Biol 2010, 398:332-350. A combined experimental and computational approach is used to demonstrate that the hydrophobic residues that are required for the fast folding of the flavodoxin-like proteins are also responsible for the prematurely folded unproductive intermediates.