Mechanical Strength of 17 134 Model Proteins and Cysteine Slipknots

A new theoretical survey of proteins' resistance to constant speed stretching is performed for a set of 17 134 proteins as described by a structure-based model. The proteins selected have no gaps in their structure determination and consist of no more than 250 amino acids. Our previous studies have dealt with 7510 proteins of no more than 150 amino acids. The proteins are ranked according to the strength of the resistance. Most of the predicted top-strength proteins have not yet been studied experimentally. Architectures and folds which are likely to yield large forces are identified. New types of potent force clamps are discovered. They involve disulphide bridges and, in particular, cysteine slipknots. An effective energy parameter of the model is estimated by comparing the theoretical data on characteristic forces to the corresponding experimental values combined with an extrapolation of the theoretical data to the experimental pulling speeds. These studies provide guidance for future experiments on single molecule manipulation and should lead to selection of proteins for applications. A new class of proteins, involving cystein slipknots, is identified as one that is expected to lead to the strongest force clamps known. This class is characterized through molecular dynamics simulations.


Introduction
Atomic force microscopy, optical tweezers, and other tools of nanotechnology have enabled induction and monitoring of large conformational changes in biomolecules. Such studies are performed to assess structure of the biomolecules, their elastic properties, and ability to act as nanomachines in a cell. Stretching studies of proteins [1] are of a particular current interest and they have been performed for under a hundred of systems. Interpretation of some of these experiments has been helped by all-atom simulations, such as reported in refs. [2,3]. They are limited by of order 100 ns time scales and thus require using unrealistically large constant pulling speeds. However, they often elucidate the nature of the force clamp -the region responsible for the largest force of resistance to pulling, F max . All of the experimental and all-atom simulational studies address merely a tiny fraction of proteins that are stored in the Protein Data Bank (PDB) [4]. Thus it appears worthwhile to consider a large set of proteins and determine their F max within an approximate model that allows for fast and yet reasonably accurate calculations. Structure-based models of proteins, as pioneered by Go and his collaborators [5] and used in several implementations [6][7][8][9][10][11][12][13], seem to be suited to this task especially well since they are defined in terms of the native structures away from which stretching is imposed.
There are many ways, all phenomenological, to construct a structure-based model of a protein. 504 of possible variants are enumerated and 62 are studied in details in ref. [14]. The variants differ by the choice of effective potentials, nature of the local backbone stiffness, energy-related parameters, and of the coarse-grained degrees of freedom. The most crucial choice relates to making a decision about which interactions between amino acids count as native contacts. Comparing F max to the corresponding experimental values in 36 available cases selects several optimal models [14]. Among them, there is one which is very simple and which describes a protein in terms of its C α atoms, as labeled by the sequential index i. This model is denoted by LJ3 = 6 − 12, C, M 3, E 0 which stands for, respectively, the Lennard-Jones native contact potentials, local backbone stiffness represented by harmonic terms that favor the native values of local chiralities, the contact map in which there are no i, i + 2 contacts, and the amplitude of the Lennard-Jones potential, ǫ, is uniform. The contact map is determined by assigning the van der Waals spheres to the heavy atoms (enlarged by a factor to account for attraction) and by checking whether spheres belonging to different amino acids overlap in the native state [15,16]. If they do, a contact is declared as native. Non-native contacts are considered repulsive. Application of this criterion frequently selects the i, i + 2 contacts as native. If the contact map includes these contacts the resulting model will be denoted here as LJ2. On average, it performs worse than LJ3 because the i, i + 2 contacts usually correspond to the weak van der Waals couplings as can be demonstrated in a sample of proteins by using a software [17] which analyses atomic configurations from the chemical perspective on molecular bonds. Thus the i, i + 2 couplings should better be removed from the contact map (in most cases).
The survey to determine F max in 7510 model proteins with the number of amino acids, N , not exceeding 150 and 239 longer proteins (with N up to 851) has been accomplished twice. First within the LJ2 model [18] and soon afterwords within the LJ3 model [19]. The first survey also comes with many details of the methodology whereas the second just presents the outcomes. The two surveys are compared in more details in refs. [14,20]. The results differ, particularly when it comes to ranking of the proteins according to the value of F max , but they mutually provide the error bars on the findings. They both agree, however, on predicting that there are many proteins whose strength should be considerably larger than the frequently studied benchmark -the sarcomere protein titin (F max of order 204 pN [21,22]). Near the top of the list, there is the scaffoldin protein c7A (the PDB code 1aoh) which has been recently measured to have F max of about 480 pN [23]. Other findings include establishing correlations with the CATH hierarchical classification scheme [24,25], such as that there are no strong α proteins, and identification of several types of the force clamps. The large forces most commonly originate in parallel β − strands that are sheared [26]. However, there are also clamps with antiparallel β − strands, unstructured strands, and other kinds.
The two surveys have been based on the structure download made on July 26, 2005 when the PDB comprised 29 385 entries. Many of them correspond to nucleic acids, complexes with nucleic acids and with other proteins, carbohydrates, or come with incomplete files and hence the much smaller number of proteins that could be used in the molecular dynamics studies. Here, we present results of still another survey which is based on a download of December 18, 2008 which contains 54 807 structure files and leads to 17 134 acceptable structures with N not exceeding 250 (instead of 150). These structures are then analyzed through simulations based on the LJ3 model. The numerical code has been improved to allow for acceleration of calculations by a factor of 2.
The 190 structures (or 1.1 % of all structure considered) with the top values of F max in units of ǫ/Å are shown in Table 1 (the first 81 entries for which F max ≥ 3.9 ǫ/Å) and Table S1 of the SI (proteins ranked 82 through 190), together with the values of titin (1tit) and ubiquitin (1ubq) to provide a scale. As argued in the Materials and Methods section section, the unit of force, ǫ/Å, is now estimated to be of order 110 pN. All of the corresponding proteins are predicted to be much stronger than titin and none but two of them (1aho, 1g1k [23]) have been studied experimentally yet. In addition to the types of force clamps identified before, we have discovered two new mechanisms of sturdiness. One of them involves a cysteine slipknot (CSK) and is found to be operational in all of the 13 top strength proteins. In this motif, a slip-loop is pulled out of a cysteine knot-loop. Another involves dragging of a single fragment of the main chain across a cysteine knot-loop. The two mechanisms are similar in spirit since both involve dragging of the backbone. However, in the CSK case, two fragments of the backbone are participating.
We make a more systematic identification of the CATH-classified architectures that are linked to mechanical strength and then analyze correlations of the data to the SCOP-based grouping (version 1.73) [27][28][29]. The previous surveys did not relate to the SCOP scheme.
We identify the CATH-based architectures and SCOP-based folds that are associated with the occurrence of a strong resistance to pulling. A general observation, however, is that each such group of structures may also include examples of proteins that unravel easily. The dynamics of a protein are very sensitive to mechanical details that are largely captured by the contact map and not just by the appearance of a structure. On the other hand, if one were to look for mechanically strong proteins then the architectures and folds identified by us should provide a good starting point. We also study the dependence of F max on the pulling velocity and characterize the dependence on N through distributions of the forces.
The current third survey has been performed within the same LJ3 model as the second survey [19]. However, we reuse and extend it here because the editors of Biophysical Journal retracted the second survey [30]. All of the values of F max are deposited at the website www.ifpan.edu.pl/BSDB (Biomolecule Stretching Database) and can by accessed by through the PDB structure code.

Distribution of Forces
The distribution of all values of F max for the full set of proteins is shown in Figure 1. Despite the larger limit on N now allowed, the distribution is rather similar to that obtained in ref. [19] for the smaller number of proteins (and with the smaller sizes). The similarity is primarily due to the fact that the size related effects, discussed below, are countered by new types of proteins that are now incorporated into the survey. The distribution is peaked around F max of 1.2 ǫ/Å which constitutes about 60% of the strength associated with titin. The distribution is non-Gaussian: it has a zero-force peak and a long force tail. The zero-force peak arises in some proteins with the covalent disulphide bonds. In the model, such bonds are represented by strong harmonic bonds. Stretching of such a protein may not result in any force peak before a disulphide bond gets stretched indefinitely and hence F max is considered to be vanishing then. The tail, on the other hand, corresponds to the strong proteins. The top strongest 1.1% of all proteins are listed in Tables 1 (in the main text) and S1 (in the SI).
The insets of Figure 1 show similar distributions for proteins belonging to the particular CATH-based classes. There are four such classes: α, β, α − β and proteins with no apparent secondary structures. It is seen that none of the 3240 α proteins exceeds the peak force obtained for titin within our model. This observation is in agreement with experiments on several α proteins that are listed in the Materials and Methods section. All strong proteins are seen to involve the β − strands. The peak in the probability distribution for the α − β proteins is observed to be shifted towards the bigger values of F max compared to the one for the β proteins. At the same time, the high force tail of the distribution for the β proteins is substantially more populated than the corresponding tail for the α proteins. Figure 2 is similar to Figure 1 in spirit, but now the structures are split into particular ranges of the protein sizes: N between 40 and 100 (the dotted line), between 100 and 150 (thin solid line), and between 200 and 250 (the thick solid line). The curve for the range from 150 to 200 is in-between the curves corresponding to neighboring ranges and is not shown in order not to crowd the Figure. The distributions are seen to be shifting to the right when increasing the range of the values of N indicating, that the bigger the number of amino acids, the more likely a protein is to have a large value of F max . This observation holds for all classes of the proteins, as evidenced by the insets in Figure 2.
In most cases, the major force peak arises at the begining of stretching where the Go-like model should be applicable most adequately. One can characterize the location of F max during the stretching process by a dimensionless parameter λ which is defined in terms of the end-to-end distance, as spelled out in the caption of Table 1. This parameter is equal to 0 in the native state and to 1 in the fully extended state. In 25 % of the proteins studied in this survey, λ was less than 0.25 and in 52 % -les than 0.5. There are very few proteins with λ exceeding 0.8. Table 1 does not include any (non-cysteine-based) knotted proteins. The full list of 17 134 proteins contains 42 such proteins but they come with moderate values of F max . However, knotted proteins with N > 250 may turn out to have different properties.

Biological properties of the strongest proteins
A convenient way to learn about the biological properties listed in Tables 1 and S1 is through the Gene Ontology data base [31] which links such properties with the PDB structure codes. The properties are divided into three domains. The first of these is "molecular function" which describes a molecular function of a gene product. The second is "biological processes" and it covers sets of molecular events that have well defined initial and final stages. The third is "cellular component" and it specifies a place where a given gene product is most likely to act.
The results of our findings are summarised in Table 2. It can be seen, that most of the 190 strongest proteins are likely to be found in an extracellular space where conditions are much more reducing than within cells. Larger mechanical stability is advantageous under such conditions. 90 out of the strongest proteins exhibit hydrolase activity. 39 of these 90 are serine-type endopeptidases. These findings seem to be consistent with expectations regarding proteins endowed with high mechanical stability. For instance, proteases, which are well represented in Table 2 should be more stable to prevent self-cleavage.

CATH-based architectures
The classification of proteins within the CATH (Class, Architecture, Topology, Homology) data base is done semi-automatically by applying numerical algorithms to structures that are resolved better than within 4Å [24,25]. The four classes of proteins in the CATH system are split into architectures, depending on the overall spatial arrangement of the secondary structures, the numbers of β −sheets in various motifs, and the like. The next finer step in this hierarchical scheme is into topologies and it involves counting contacts between amino acids which are sequentially separated by more than a treshold. The further divisions into homologous superfamilies and then sequence family levels involve studies of the sequential identity.
Examples of architectures that are dominant contributors to a low force behavior are the α orthogonal bundle (the right bottom panel of Figure 3), the α up-down bundle, and the β − roll (the left bottom panel of Figure 3).

SCOP-based classes and folds
The SCOP (Structural Classification of Proteins) data base [27][28][29] is curated manually and it relies on making comparisons to other structures through a visual inspection. This classification scheme is also hierarchical and the broadest division is into seven classes and three quasi-classes. The classes are labelled a through g and these are as follows: mainly α (a), mainly β (b), α/β which groups proteins in which helices and β − sheets are interlaced (c), α + β with the helices and β − sheets grouped into clusters that are separated spatially (d), multidomain proteins (e), membrane and cell-surface proteins (f ), and small proteins that are dominated by disulphide bridges or the heme metal ligands (g). The quasi-classes are labelled h through j and they comprise coiled-coil proteins (h), structures with low resolution (i), and peptides and short fragments (j). The classes are then partitioned into folds that share spatial arrangement of secondary structures and the nature of their topological interlinking. Folds are then divided into superfamilies (same fold but small sequence identity) and then families (two proteins are said to belong to the same family if their sequence identity is at least 30%). Families are then divided into proteins -a category that groups similar structures that are linked to a similar function. Proteins comprise various protein species.
Each structure assignment comes with an alphanumeric label, as shown in Tables 1, S1, and 4 which reflects the placement in the hierarchy. At the time of our download, there have been 92 972 entries in the SCOP data base that are assigned to 34 495 PDB structures. These entries are divided into 3464 families, 1777 superfamilies and 1086 unique folds. A given structure may have several entry labels but the dominant assignment is listed first. We use the primary assignment in our studies. The same rule is also applied to the CATH-based codes. Figure 4 shows the distributions of forces for the SCOP-based classes of proteins. The results are consistent with the CATH-based classes since the α − β class of CATH basically encompasses the α/β and α + β classes of SCOP. However, there are proteins which are classified only according to one of the two schemes. Thus there are 4431 α − β proteins out of which only the total of 3368 is SCOP-classified as belonging to the α + β and α/β classes. At the same time, the total of the proteins in the α + β and α/β classes we have is 4795.
It should be noted that the peak in the distribution for α + β is shifted to higher forces by about 0.7 ǫ/Å from the peak for α/β. At the same time, the zero-force peak is virtually absent in α + β. The SCOP-based classification also reveals that its class g contributes across the full range of forces and, in particular, it may lead to large values of F max . It should be noted, as also evidenced by Table 1, that there is a substantial number of strong proteins that has no class assignment. Figures 5 and 6 refer to the distributions of F max across specific folds. The first of these presents results for the folds that give rise to the largest forces. The names of such folds are specified in Figure  5. The percentage-wise assessment of the folds contributing to big forces is presented in Table 4. The top contributor is found to be the b.47 fold (SMAD/FHA domain). Figure 6 gives examples of folds that typically yield low forces.
It is interesting to note that distributions corresponding to some folds are distinctively bimodal, as in the case of the SMAD/FHA fold (b.47). This particular fold is dominated by SMAD3 MH2 domain (b.47.1.2; 352 structures) which contributes both to the high and low force peaks in the distribution. The remaining domains (b.47.1.1, b47.1.3, and b47.1.4) contribute only to the low force peak. The dynamical bimodality of the b.47.1.2 fold can be ascribed to the fact that the strong subset comes with one extra disulphide bond relative to the weak subset. This extra bond provides substantial additional mechanical stability when stretching is accomplished by the termini. We illustrate sources of this bimodality in the SI ( Figure S1) for two proteins from this fold: 1bra which is strong and 1elc which is weak. In ref. [18], we have noted that various sets of proteins with identical CATH codes (e.g., 3.10.10) may give rise to bimodal distributions without any dynamical involvement of the disulphide bonds. The reason for this is that even though the contact maps for the two modes are similar, the weaker subset misses certain longer ranged contacts which pin the structure. Mechanical stability is more sensitive to structural and dynamical details than are not provided by standard structural descriptors.

Force clamps
Shearing motif. The most common type of the force clamp identified in the literature is illustrated in the top left panel of Figure 7 corresponding to the 14th-ranked protein 1c4p. In this case, the strong resistance to pulling is due to a simultaneous shearing of two β − strands which are additionally immobilised by short β − strands that adhere to the two strands. Similar motifs appears in 1qqr (15), 1j8s (17), 1j8r (19), 1f3y (20), 2pbt(29), 2fzl(15), 1aoh (19), where the number in brackets indicate ranking as shown in Table  1. It is interesting to note that the β − strands responsible for the mechanical clamp in 1j8s and 1j8r display an additional twist. Undoing the twist enhances F max . (There is a similar mechanism that seems to be operational in the case of a horseshoe conformation found in ankyrin [32,33]). The force clamps are identified by investigating the effect of removal of various groups of contacts on the value of F max [12,18].
There are, however, new types of the force clamps that we observe in the proteins listed in Tables 1 and S1. They arise from entanglements resulting from the presence of the disulphide bonds which cannot be ruptured by forces accessible in the atomic force microscopy. We note that about 2/3 of the proteins listed in Tables 1 and S1 contain the disulphide bonds. Many of these bonds do not carry much of dynamical relevance when pulling by the termini. However, in certain situations they are the essence of the force clamp. The disulphide bonds have been already identified as leading to formation of the cystein knot (CK) motifs [34,35] (such proteins are found in the toxins of spiders and scorpions) and the cyclic CK motifs [36,37]. Here, we find still another motif -that of the CSK which is similar to that found in slipknotted proteins [38][39][40] which do not conatin the disulphide bonds. This motif is found in the top 13 proteins. The cysteine loop, knot, and slipknot motifs are shown schematically in the remaining panels of Figure 7. It is convenient to divide these motifs into two categories: shallow (S) and deep (D) (according to the classification used for knotted proteins [41,42]), depending on whether the motif is spanning most of the sequence or is instead localized in its small fraction.
Shearing connected with a cysteine loop. In this case, the mechanical clamp arises from shearing between a β−strand belonging to a deep cysteine loop and another strand located outside the loop (the left bottom panel of Figure 7). Existence of the disulphide bond before the shearing motif allows to decompose direct tension onto the β − strands making the protein resist stretching much more effectively than what would be expected from a simple shearing motif. Additionally, the disulphide bonds prevent an onset of any rotation in the protein conformation which otherwise might form an opportunity for unzipping. This motif appears in 1dzj(40,D) 1vsc(37,D), 1dzk(35,D), 1i04 (81,D), 1hqp(83,D), 1oxm(98,D), 2a2g (175,D), 2boc(179,D), and many other proteins. The middle panel of Figure 8 gives an example of the corresponding force (F ) -displacement (d) pattern as obtained for 1dzj.
Shearing and dragging out of a cysteine loop. This motif consists of two parts. The first is formed by a rather small and deep cysteine loop which is located very close to one terminus with the second terminus located across the cysteine loop. The motif arises when almost all of the protein backbone is dragged across the cysteine loop on stretching. A protein structure also contains a few β − strands which get sheared before dragging takes place. This motif is seen in 1kdm(23,D), 1q56(24,D), 1qu0(33,D), 1f5f (34,D) and this geometry of pulling we call geometry I. It should be pointed out that, in all such cases, pulling by the N terminus takes place within (or very near) the plane formed by the cysteine loop. A small change in such a geometry, e.g. the one arising from pulling not by the last amino acid but by the penultimate bead, may cause getting out of the cystein loop and result in a very different unfolding pathway with a distinctly different value of F max . In this other kind of pulling set up, denoted as geometry II, the loop is bypassed and the resistance to pulling is provided only by the shearing mechanism.
Dragging arises from overcoming steric constraints and generates an additional contribution to the strength of the standard shearing mechanical clamp. By using geometry II and also by eliminating the native contacts between the sheared β − strands we can estimate the topological contribution of the dragging effect on the value of F max . For proteins 1kdm, 1q56, 1qu0, 1f5f, it comes out to be around 25 %. The force F -d patterns corresponding to these two geometries of pulling are shown in top panel of Figure 9.
In the survey, there are other proteins which also have disulphide bonds and belong to the 2.60.120.200 category. These proteins have a cysteine which is either very shallow or deep, but is located in the middle of the protein backbone so that there is no possibility to form a long β − strand. In this case, the dragging effects are much smaller. For instance, for 1pz7(D) and 1cpm(S), F max is close to 1 ǫ/Å.
Shearing inside of a cysteine knot . This motif is created by a loosely packed CK (two or more spliced cysteine loops) with at least two parallel β strands that are present within the knot. Pulling protein by termini exerts tension on the entire CK and thus produces an indirect shearing force on the β − strands inside the entangled part of the protein. In this case, elimination of the native contacts between the β − strands reduces F max only partially indicating that the mechanical clamp is created also by the CK. A simple CK is also found in 2bzm (42)  Cysteine slipknot force-clamp is observed in the strongest 13 proteins. The top strength protein is 1bmp (bone morphogenic protein) with the predicted F max of 10.2 ǫ/Å, which should correspond to about 1100 pN (see Materials nad Methods). This strength should be accessible to standard experiments as the atomic force microscopy has been already used to rupture covalent N-C and C-C bonds by forces of 1500 and 4500 pN respectively [43].
In our discussion, we focus on the 13-ranked 1vpf (a vascular endothelial growth factor) with the predicted F max of 5.3 ǫ/Å. The CSK motif arises from two loops [40]: the knot-loop and the slip-loop, where the slip-loop can be threaded across the knot-loop. One needs at least three disulphide bonds for this motif to arise.
In the case of the 1vpf, the knot-loop is created by the disulphide bonds between amino acids 57 and 102, 61 and 104, and the protein backbone between amino acids 57-61 (GLY,GLY,CYS) and 102-104 (GLU). The slip-loop is created by the protein backbone between sites 61-102 and is stabilized by 12 hydrogen bonds between two parallel β −strands. In the CSK motif, the force peak is due to dragging of a slip-loop through the knot-loop making the native hydrogen contacts only marginally responsible for the mechanical resistance. Thus the force peak arises, to a large extent, from overcoming steric constraints, i.e. it is due to repulsion resulting from the excluded volume. The F − d pattern for this novel type of a force clamp is shown in the top panel of Figure 8. Another example of such a pattern for a CSK is shown in the bottom panel of Figure 9 for the 22nd ranked 2h64 (a human transforming growth factor). The leading role of the steric constraints is verified by checking the reduction of the F max when all the slipknot-related contacts (inside the slip-loop and between the slip-loop and the knot-loop) are converted to be purely repulsive. As a result of this bond removal, the force peak persists, though it gets shifted and becomes smaller. This is summarized in Table S2 in the SI. It is a new and unexpected result.
Another way to establish the role of the CSK motif is to create the disulphide-deficient mutants, as accomplished experimentally [44] for 1vpf. The two mutants, 1mkk (C61A and C104A) and 1mkg (C57A and C102A), have structures similar to 1vpf but contain no knot-loops and thus there is no slipknot. Muller et al. [44] show that the mutants' thermodynamic stability is not reduced but their folding capacity is. Our work shows that the mutants have a reduced resistance to pulling compared to 1vpf: F max drops from 5.3 ǫ/Å to 1.49 and 2.01 ǫ/Å for 1mkk and 1mkg respectively.
We note that the CSK topology is a subgroup inside the CK class (represented mostly by 2.10.90.10) and the CSK force clamp need arise for a particular way of pulling. For instance, proteins 1afk(68), 1afl(100) or 1aqp(118) have up to four disulphide bonds and yet the CSK motif does not play any dynamical role in pulling by the terminal amino acids. In the case of the CSK, we observe a formidable dispersion in the values of F max . For example, it ranges between 4.8-5.9, 4.1-4.8, and 4.1-5.2 ǫ/Å for various trajectories in 1vpf, 2h64, and 2c7w respectively. We now examine the CSK geometry in more details.
Cysteine slipknot motif is distinct from the slipknot motif in several ways. The left-most panel of Figure 10 shows a slipknot with three intersections at sequential locations k 1 , k 2 , and k 3 . This geometry is topologically trivial since when one pulls by the termini, the apparent entanglement may untie and become a simple line. The entanglement would form the trefoil knot if the k 3 intersection was removed by redirecting the corresponding segment of the chain (thin line) away from the k 1 − k 3 loop. Such slipknot motifs have been observed in native states of several proteins [38][39][40]. In contrast, the CSKs are not present in the native state but arise as a result of mechanical manipulation. The middle panel of Figure  10 shows a schematic representation of a native conformation with three cysteine bonds: between i 1 and j 1 , between i 2 and j 2 , and between i 3 and j 3 . The i-ends of the bonds are counted as being closer to the N-terminus. The three bonds are in a specific arrangement as shown in the panel. In particular, the i 3 − j 3 bond must cross the loop i 1 − i 2 − j 2 − j 1 . This loop consists of two pieces of the backbone (i 1 − i 2 and j 2 − j 1 ) that are linked to form a closed path by the two remaining cysteine bonds -it is the cysteine knot-loop. The average radius of this loop is denoted by R ck .
The arrangement shown in the middle panel has no entanglements that could be considered as knots in the topolgical sense. However, on pulling by the termini, the chain segment adjacent to i 3 gets threaded through the knot-loop since i 3 is rigidly attached to j 3 , as illustrated in the rightmost panel of Figure  10. Pulling by i 3 − j 3 also results in generating another loop -the cysteine slip-loop -since the segment around i 3 gets bent strongly to form a cigar like shape with the radius of curvature at the i 3 -tip denoted by R cs . This loop extends between i 2 and j 1 . It should be pointed out that the cysteine knot-loop in the CSK is stiff whereas in a slipknotted protein (such as the thymine kinase) its size is variable (as it can be tightened on the protein backbone [40] in analogy to tightening a knot [45] by pulling).
The dynamics of pulling depends of the relationship between R ck and R cs as the "cigar" may either go through or get stuck. In the former case a related force peak would arise. If the system was a homogeneous polymer, dragging would be successful when R ck was bigger than R cs . The corresponding force would be related to the work against the elasticity that was needed to bend the slip-loop to the appropriate curvature. This work is proportional to the square of the curvature. Thus the total elastic energy involved in bending the segment i 2 − j 1 is of order dsR −2 ∼ R −1 cs [46], where s is the arc distance. Dividing this energy by the distance of pulling would yield an estimate of the force measured if thermal fluctuations were neglected. The geometrical condition for dragging in proteins is more complicated because of the presence of the side groups and the related non-homogeneities and variability across the hydrophobicity scale. The diameter of the "rope" that the knot loop is made of should not exceed the maximum a linear extension, t k of amino acids. Thus the effective inner radius of the knot-loop is R ck − t k . Similarly, the size of the outer circle that is tangential to the tightest slip-loop is R cs +t s , where t s is the thickness of the slip-loop. (Both thicknesses can be considered as being site dependent and including possible hydration layer effects near polar amino acids.) Thus the slip-knot can be driven through the cystein knot-loop provided In our simulations, the successful threading situations correspond to R ck and R cs of around 7 and 3Å. The amino acids in the knot-loop are mostly Gly, Ala, or Cys with their side groups pointing outside of the loop. One may then estimate t k to be about 1.5Å. On the other hand, the linear size of the amino acids in the slip-loop can be determined to be close to 2.5Å. These estimates indicate that R cs + t s can be very close to R ck − t k so the possibility of slipping through the knot-loop is borderline. In fact, slipping might be forbidden within the framework of the tube-picture of proteins [47,48] in which the effective thickness of the tube is considered to be 2.7Å. The CSK motifs give rise to a force peak in 1vpf, 2h64(22,S), 1rv6(25,S), 1waq(26,S), 1reu(27,S), 1tgj(28), 2h62(30,S), 1tgk(31), 2c7w(38,D), 2gyr(39,S), 1lx5(95,D), and many other proteins. In these cases, the typical value of R ck is about 7Å. However, specificity may result in somewhat smaller values of R ck which may cause only smaller segments of the slip-loop to be threaded. If the passage is blocked, there will be no isolated force peak as happens in 1tgj and 1vpp.
Types of the force-displacement patterns for proteins with the disulphide bonds. In the case of proteins with very shallow cystein knot, loop or slipknot motifs, F increases very rapidly with d and isolated force peak does not arise (F max = 0). Such cases are represented, e.g., by 1bmp, 1rnr, 1ld5, and 1wzn where the slipknots are either very tight or the cystein loop is very shallow. In the case of a shallow motif, however, a force peak can sometimes be isolated as in the case of the 13th-ranked protein 1vpf ( Figure 8) and in several other proteins, like 1xzg and 1dzk. In this case, the value of F max takes into account tension on the cystein bonds and it is not obvious whether such a strong elastic background should be subtracted from the value of F when determining F max or not. In this survey, we do not subtract the backgrounds. It should be noted that in our previous surveys we missed the CSK-related force peaks because we attributed the rapid force rises at the end of pulling just to stretching of the backbone without realizing existence of structure in some such rises.
For a deep motif, the F − d pattern may have several small force peaks before the final rise of the force, as observed for 2g4s and 1bj7. When the CSK motif is very deep, it usually does not have any influence on the shape of the F − d pattern apart from a much steeper final rising force. Such a situation is seen in the case of, e.g., 1j8r and 1j8s.

Concluding remarks
This surveys identifies a host of proteins that are likely to be sturdy mechanically. Many of them involve disulphide bridges which bring about entanglements that are complicated topologically such as CSKs and CKs. The distinction between the two is that the former can depart from its native conformation and the latter cannot.
Our survey made use of a coarse grained model so it would be interesting to reinvestigate some of the proteins identified here by all-atom simulations, especially in situations when the CSK is involved. The CSK motifs may reveal different mechanical properties when studied in a more realistic model. Of course, a decisive judgment should be provided by experiment.
The very high mechanical resistance of the CSK proteins should help one to understand their biological function. The superfamily of cysteine-knot cytokines (in class small proteins and fold cystein-knot cytokines) includes families of the transfroming growth factor (TGF)-β and the polypeptide vascular endothelial growth factors (VEGFs) [49,50]. The various members of this superfamily, listed in Table  5, have distinct biological functions. For instance, VEGF-B proteins which regulate the blood vessel and limphatic angiogenesis bind only to one receptor of tyrosine kinase VEGFR-1. On the other hand, VEGF-A proteins bind to two receptors VEGFR-1 and VEGFR-2. All of these proteins form a dimer structure. The members of this familly are endowed with remarkably similar monomer structures but differ in their mode of dimerisation and thus in their propensity to bind ligands. Additionally, all dimers posses almost the same a cyclic arrangement of cysteine residues which are involved in both intra-and inter-chain disulphide bonds. These inter-chain disulphide bonds create the knot and slip-loops, where the intra-chain disulphide bonds give rise to a CSK motif when the slip-loop is gets dragged acrros the knot-loop upon pulling.
It has been shown experimentally [51] that such cysteine related connectivities bring the key residues involved in receptor recognition into close proximity of each other. They also provide a primary source of stability of the monomers due to the lack of other hydrogen bonds between two beta strands at the dimer interface.
The non trvial topologial connection between the monomers allow for mechanical separation of two monomers by a distance of about half of the size of the slip-loop. Our results suggest, however, that the force needed for the separation may be too high to arise in the cell.

Materials and Methods
The input to the dynamical modeling is provided by a PDB-based structures. The structure files may often contain several chains. In this case, we consider only the first chain that is present in the PDB file. Likewise, the first NMR determined structure is considered. If a protein consists of several domains, we consider only the first of them.
The modeling cannot be accomplished if a structure has regions or strings of residues which are not sufficiently resolved experimentally. Essentially all structure-disjoint proteins have been excluded for our studies. Exceptions were made for the experimentally studied scaffoldin 1aoh and for proteins in which small defects in the established structure (such as missing side groups) were confined within cystein loops and were thus irrelevant dynamically. In these situations, the missing contacts have been added by a distance based criterion [23] in which the treshold was set at 7.5Å. Among the test used to weed out inadequate structures involved determining distances between the consecutive C α atoms. A structure was rejected if these distances were found to be outside of the range of 3.6-3.95Å. The exception was made for prolines, which in its native state can accommodate the cis conformation. In that case, the distance between a proline C α and its subsequent amino acid usually falls in the range between 2.8 and 3.85Å. For a small group of proteins which slipped through our structure quality checking procedure, but were found to be easily fixed (e.g. 1f5f, 1fy8, and 2f3c), we used publicly avialable software BBQ [52] to rebuild locations of the missing residues. A limited accuracy of this prediction procedure seems to be adequate for our model due to its the coarse-grained nature.
The modeling of dynamics follows our previous implementations [11,12,18] within model LJ2 except that the contact map is as in ref. [19], i.e. with the i, i + 2 contacts excluded. There is also a difference in description of the disulphide bonds. In refs. [14,19] they were treated as an order-of-magnitude enhancement of the Lennard-Jones contacts in all proteins. In ref. [18] the different treatment of the disulphide bonds was applied to the proteins that were found to be strong mechanically without any enhancements. Here, on the other hand, we consider such bonds as harmonic in all proteins, in analogy to the backbone links between the consecutive C α s. The native contacts are described by the Lennard-Jones potential V 6−12 = 4 ǫ[ ], where r ij is the distance between the C α 's in amino acids i and j whereas σ ij is determined pair-by-pair so that the minimum in the potential is located at the experimentally established native distance. The non-native contacts are repulsive below r ij of 4Å.
The implicit solvent is described by the Langevin noise and damping terms. The amplitude of the noise is controlled by the temperature, T . All simulations were done at k B T = 0.3 ǫ, where k B is the Boltzmann constant. Newton's equations of motion are solved by the fifth order predictor-corrector algorithm. The model is considered in the overdamped limit so that the characteristic time scale, τ , is of order 1 ns as argued in refs. [6,53]. Stretching is implemented by attaching an elastic spring to two amino acids. The spring constant used has a value of 0.12 ǫ/Å 2 which is close to the elasticity of experimental cantilevers. One of the springs is anchored and the other spring is moving with a constant speed, v p . Choices in the value of the spring constant have been found to affect the look of the force-displacements patterns and thus the location of the transition state [54,55], but not the values of F max [10,12,18]. The dependence on v p is protein-dependent and it is approximately logarithmic in v p as evidenced by Figure 11 for several strong proteins. The logarithmic dependence has been demonstrated experimentally, for instance, for polyubiquitin [56,57]. F max = p ln(v/v 0 ) + q. The approximate validity of this relationship is demonstrated in Figure 11 for three proteins with big values of F max . We observe that the larger the value of F max , the bigger probability that the dependence on v p is large. When we make a fit to F max = p ln(v/v 0 ) + q for 1vpf, 1c4p, and 1j8s, we get the parameter p to be equal to 0.39 ± 0.11, 0.17±0.03, and 0.04±0.02ǫ/Å respectively (the values of q are 7.42±0.63, 5.85±0.16, and 4.96±0.08ǫ/Å correspondingly). However, some strong proteins may have p to be as low as 0.04.
When making the survey, we have used v p of 0.005Å/τ and stretching was accomplished by attaching the springs to the terminal amino acids (there is an astronomical number of other choices of the attachment points).
In order to estimate an effective experimental value of the energy parameter ǫ, we have correlated the theoretical values of F max with those obtained experimentally. The experimental data points used in ref. [14] have been augmented by entries pertaining to 1emb (117-182), 1emb (182-212) [58] (where the numbers in brackets indicate the amino acids that are pulled) and 1aoh, 1g1k, and 1amu [23]. The full list of the experimental entries is provided by Table 6. Unlike the previous plots [14] that cross correlate the experimental and theoretical values of F max , we now extrapolate the theoretical forces to the values that should be measured at the pulling speeds that are used experimentally. We assume that the unit of speed, v 0 = 1Å/τ , is of order 1Å/ns and consider 10 speeds to make a fit to the logarithmic relationship. The values of parameters p and q for the proteins studied experimentally are listed in Table 6.
The main panel of Figure 12 demonstrates the relationship between the extrapolated theoretical and experimental values of F max . The best slope, indicated by the solid line, corresponds to the slope of 0.0091. The inverse of this slope yields 110 pN as an effective equivalent of the theoretical force unit of ǫ/Å. The Pearson correlation coefficient, R 2 is 0.832, the rms percent error, r e , is 1.02, and the Theil U coefficient (discussed in ref. [14]) is 0.281. The inset show a similar plot obtained when the extrapolation to the experimental speeds is not done. The resulting unit of the force would be equivalent to 110 pN which differs form the previous estimate of 71 pN (shown by the dotted line in the main panel) because of the inclusion of the newly measured proteins and implementation of the extrapolation procedure. The statistical measures of error here are R 2 = 0.851, r e = 0.37, and U = 0.251. These measures are better compared to the case with the extrapolation because the extrapolation procedure itself brings in additional uncertainties. Nevertheless, implementing the procedure seems sounder physically. The spread between these various effective units of the force suggests an error bar of order 30 pN on the currently best value of 110 pN.  Table 1. F max is obtained within the LJ3 model at the pulling velocity of 0.005Å/τ . The first column indicates the ranking of a model protein, the second -the PDB code, and the third -the number of the amino acids that are present in the structure used. L max denotes the end-to-end distance at which the maximum force arises. λ is the corresponding dimensionless location defined as λ = (L max − L n )/(L f − L n ), where L n is the native end-to-end distance and L f corresponds to full extension. The last two columns give the leading CATH and SCOP codes. The survey is performed based strictly on the PDB-assigned structure codes. It may happen that the structure of a protein has been determined several times and then each of these determinations leads to its own value of F max . In this case, one may derive the best estimate either by picking the best resolved structure or by making (weighted) averages over all related structures.      Table 6. F e max denotes the experimentally measured value of F max as reported in the reference stated in the last column. v p denotes the experimental pulling speed used. F t max is the value of the maximal force obtained in our simulation within the LJ3 model. They were performed at v p = 0.005Å/τ . F te max corresponds to the theoretical estimate of F max when extrapolated to the experimental speeds. The extrapolation assumes the approximate logarithmic dependence F max = p ln(v/v o ) + q, where v 0 is 1Å/τ . 10 speeds were used to determine the values of p and q in analogy to the procedure illustrated in Figure 11 The values of p and q are provided in columns 7 and 8 of the Table respectively. The first column indicates the corresponding symbol that is used in Figure 12.              [14] where no exptrapolation was made. The inset shows a similar plot in which the extrapolation is not implemented (denoted as F t max in Table 6). The list of the proteins used is provided by Table 6. It comprises almost all cases considered in ref. [14] but it also includes the recent data points obtained for the scaffoldin proteins [23] and the GFP [58]. The numerical symbols used in the Figure match the listing number in Table 6. Tables   TABLE 1S: The predicted list of the strongest proteins, ctd.