Learning structural bioinformatics and evolution with a snake puzzle

We propose here a working unit for teaching basic concepts of structural bioinformatics and evolution through the example of a wooden snake puzzle, strikingly similar to toy models widely used in the literature of protein folding. In our experience, developed at a Master course at the Universidad Autónoma de Madrid (Spain), the concreteness of this example helps to overcome difficulties caused by the interdisciplinary nature of this field and its high level of abstraction, in particular for students coming from traditional disciplines. The puzzle will allow us discussing a simple algorithm for finding folded solutions, through which we will introduce the concept of the configuration space and the contact matrix representation. This is a central tool for comparing protein structures, for studying simple models of protein energetics, and even for a qualitative discussion of folding kinetics, through the concept of the Contact Order. It also allows a simple representation of misfolded conformations and their free energy. These concepts will motivate evolutionary questions, which we will address by simulating a structurally constrained model of protein evolution, again modelled on the snake puzzle. In this way, we can discuss the analogy between evolutionary concepts and statistical mechanics that facilitates the understanding of both concepts. The proposed examples and literature are accessible, and we provide supplementary material to reproduce the numerical experiments. We also suggest possible directions to expand the unit. We hope that this work will further stimulate the adoption of games in teaching practice. Abstract We propose here a working unit for teaching basic concepts of structural bioinformatics and evolution through the example of a wooden snake puzzle, strikingly similar to toy models widely used in the literature of protein folding. In our experience, developed at a Master course at the Universidad Aut´onoma de Madrid (Spain), the concreteness of this example helps to overcome difficulties caused by the interdisciplinary nature of this field and its high level of abstraction, in particular for students coming from traditional disciplines. The puzzle will allow us discussing a simple algorithm for finding folded solutions, through which we will introduce the concept of the configuration space and the contact matrix representation. This is a central tool for comparing protein structures, for studying simple models of protein energetics, and even for a qualitative discussion of folding kinetics, through the concept of the Contact Order. It also allows a simple representation of misfolded conformations and their free energy. These concepts will motivate evolutionary questions, which we will address by simulating a structurally constrained model of protein evolution, again modelled on the snake puzzle. In this way, we can discuss the analogy between evolutionary concepts and statistical mechanics that facilitates the understanding of both concepts. The proposed examples and literature are accessible, and we provide supplementary material to reproduce the numerical experiments. We also suggest possible directions to expand the unit. We hope that this work will further stimulate the adoption of games in teaching practice.

integrative environment. Together with the limited number of teaching hours, this situation constitutes 48 a serious bottleneck to learning. Hence, it is of great importance to find tools that help to bridge the 49 gap between different backgrounds and favour a learning convergence [5]. Games can help students to 50 assimilate abstract concepts and to address complex problems. There is a growing number of games 51 related with topics as diverse as protein folding [6], spin glasses [7], ecological networks [8], biological 52 data integration [9], geometry [10] or scientific induction [11], to name a few. 53 Here we present an illustrative example where we employed a wooden snake puzzle in a structural 54 bioinformatics course at the Universidad Autónoma de Madrid (Spain). This puzzle can be regarded as 55 a coarse grained (toy) model of a polymer structure and it is strikingly similar to simplified 56 mathematical models proposed in the protein folding literature (see, for instance, [12]). We propose 57 several exercises accessible to students with a graduate-level background in either biology or physics 58 and notions in programming. 59 Our goal consists in giving concreteness to the subjects presented in the course through a physical 60 object, and our experience in this sense is very positive. This example allows us to discuss the first 61 steps in the modeling process, i.e. the definition of the system and its epistemological and practical 62 consequences, a discussion that is often neglected. Relying on a physical object allow us to provide a 63 first intuitive contact with the different subjects that we treat, ranging from computational techniques, 64 such as protein structure alignment algorithms, to theoretical concepts, such as protein folding and we will refer to the former as 27-mer model and as 64-mer to the latter. The exercises proposed in the 92 following sections are based on the 64-mer puzzle, which is more complex while remaining 93 computationally tractable. Our proposal is inspired by the similarity between this puzzle and the 94 lattice models of protein structures studied in the literature (the relevant references will be presented 95 all throughout the text). 96 The number of possible polymer conformations on the cubic lattice is huge. Taking into account the 97 self-avoiding condition that two units cannot occupy the same cell, each new unit can be placed in at 98 most 5 different cells. Each time that the length increases the number of conformations is multiplied 99 by a roughly constant factor, leading to an exponential increase. Numerical computations show that 100 the number of self-avoiding walks on the cubic lattice scales as N γ µ N for large N , with µ ≈ 4.68 [15], 101 close to the maximum possible value, so that the number of conformations of a 64-mer is of the order 102 of ∼ 10 42 . The enormousness of these numbers offers the opportunity to introduce the well known 103 Levinthal's paradox [16] as a starting point for a course focused on protein folding. maximally compact conformations. In the case of the 64-mer, these are conformations that occupy the 106 4 × 4 × 4 cube. This number also increases exponentially, although with a smaller exponent whose 107 asymptotic value can be computed analytically, for instance it is µ = 5/e ≈ 1.84 on the cubic 108 lattice [17]. Thus the number is huge (of the order of 10 16 for the 64-mer ). An algorithm to generate 109 exhaustively all maximally compact conformations of the 27-mer model was proposed in Ref. [18]. 110 With respect to these numbers, the puzzle presents a huge reduction of the conformation space due 111 to the constraints on the direction that each step can take. Indeed, linearly arranged consecutive units 112 can be regarded as a single rigid fragment of size 2, or 3 (27-mer ) and 2, 3, or 4 (64-mer ). These rigid 113 blocks can be easily seen in Fig. 1c. The number of fragments is much smaller than the number of 114 units. In addition, two consecutive rigid fragments have only four self-avoiding conformations (two 115 consecutive fragments cannot extend in the same dimension). Consequently, due to these constraints 116 the 64-mer puzzle has access to only ∼ 10 12 conformations. Nevertheless, it is still remarkably difficult 117 to solve it without computational help. 118 This reduction in the order of magnitude of the number of conformations offers an opportunity to 119 discuss how the secondary structure of proteins limits the number of different folded structures, and 120 how secondary structure prediction helps predicting protein structure from sequence. Furthermore, 121 different fragments of the puzzle are sometimes assembled into higher order structures reminiscent of 122 structures found in real proteins. For instance, the 64-mer has two consecutive blocks of size 4 that 123 cannot be folded unless they are placed in parallel next to each other. This configuration is remarkably 124 similar to the naturally occurring beta sheet secondary structure. The constraints that this folding  It has to be noted that the time taken by our search algorithm grows exponentially with the 137 number of fragments, in accordance with a recent work that shows that the snake puzzle is an 138 NP-complete problem [19], i.e., loosely speaking, a problem that cannot be solved exactly by any 139 algorithm whose computation time grows only polynomially with system size. Thus, while it would be 140 impossible to apply this algorithm to much larger systems, modern computers are readily capable of 141 handling the calculations for the 64-mer. Stochastic algorithms can find solutions in shorter times at 142 the expense of an exhaustive exploration, and they are of interest in the context of protein folding.

143
To characterize the bottlenecks of the search algorithm, we monitor in which regions of the puzzle 144 the exhaustive search spends more time. In Fig. 2 we plot the histogram of the number of times a 145 fragment is visited during two exhaustive searches that start from the two ends of the chain. To 146 understand the relationship between the search and the constraints imposed by the fragments, we 147 depict in the histogram the position of the rigid fragments of size larger than two. We observe that 148 there is an intense search around the regions where many consecutive small fragments accumulate. On 149 the other hand, the search is drastically reduced when the algorithm finds large consecutive fragments 150 (see, for instance, the two rigid fragments of size 4 separated by a fragment of size 2) because there are 151 few possible conformations compatible with valid spatial coordinates. It is interesting to observe that 152 the number of steps needed to find the solutions varies depending on the end of the chain from which 153 the algorithm starts. This observation suggests that the puzzle may be solved faster if we start the 154 algorithm from regions with more restrictive constraints. The search starts either from the first (blue histogram) or from the last fragment of the chain (red) and explores all maximally compact conformations. Positions with blocks of length larger than two are depicted over the histograms.
Through the exhaustive search, we find eight solutions for the 64-mer puzzle that we show in Fig. 3. 156 These solutions will be labelled hereafter as S1, S2, . . . , S8. The students can be requested to visually 157 inspect the solutions through standard visualization software such as PyMOL [20] or VMD [21]. For a 158 review on software for protein structure visualization see [22]. For this exercise, it is necessary to 159 transform the format of the files with spatial coordinates. This gives us the opportunity to discuss the 160 different file formats with their flaws and advantages, and to stress the importance of format 161 standardization in bioinformatics [23].

162
The contact matrix 163 At this point, it is convenient to introduce a reduced representation of protein structures that arises 164 naturally in the context of lattice polymers, and that will play an important role in the following: the 165 contact matrix. This binary matrix has value C ij = 1 if two units i and j contact each other in space 166 in the polymer conformation, and C ij = 0 otherwise: Bonded units (j = i ± 1) are excluded from this count because they contact each other in all 168 conformations. In lattice polymers, the condition for contact is that two units are nearest-neighbours 169 in the lattice, i.e. d 0 is the lattice space. We provide the contact matrices of the eight solutions of the 170 snake puzzle in the Supplementary Material since they are needed to perform the exercises that we 171 describe in the following. Fig. 4 shows the four different types of location that a monomer can occupy 172 within the maximally compact structure and the number of contacts that it has in each case. The same 173 figure also shows an intermediate conformation of two similar solutions, S1 and S2, depicting only their 174 Although the solutions cannot be superimposed, they represent four pairs of mirror images, analogous to naturally occurring enantiomers of chemical substances. In the figure, each pair of related solutions are depicted with the same color. The figure was generated using coordinates files in PDB format, whose documentation can be obtained in http://www.wwpdb.org. The source code provided in the Supplementary Material outputs PDB files that can be explored using any standard protein structure visualization software. common structure and the first different fragment. The associated contact matrices are represented 175 below, highlighting the differences determined by the alternative positions of the last fragment.

176
Contact matrices are also used as a simplified representation of real protein structures. Two 177 residues i and j are considered in contact if their closest not-hydrogen atoms are closer than a 178 threshold, typically d 0 = 4.5Å (the main reason to exclude hydrogen atoms is that they are often not 179 reported in PDB files obtained with X-ray crystallography). Pairs are considered in contact only if 180 |i − j| > 2, since neighbours along the chain trivially fulfil the contact condition in all conformations.

181
The contact matrix provides a simple visual representation that makes evident secondary structure 182 elements (alpha-helices, appearing as lines of contacts (i, i + 3) and (i, i + 4) parallel to the diagonal; 183 parallel beta sheets, (i, i + l) and anti parallel beta sheets, i + l = const, appearing as lines 184 perpendicular to the diagonal). Importantly, the contact matrix is independent of the reference frame 185 used to represent the coordinates.

186
This point gives a good opportunity for a general discussion on the modelling process in biology 187 and its epistemological and practical implications. Protein molecules are extremely complex. need arbitrary choices (the definition of the distance between residues) and parameters (the threshold 197 distance). This is an unavoidable (and indeed desirable) feature of the modelling process. where a monomer can be found in a maximally compact structure and the contacts that it will has in each case. A given monomer may have from one to four contacts, with the exception of the first and last monomers, which can present an additional contact up to a maximum of five. (B) Conformations and contact matrices of two similar solutions S1 and S2. In the conformations, for clarity we show only the part that is common to both solutions and the first fragment that is different. The upper triangular matrix shows the contacts of S1 and S2, highlighting in red the contacts present in S1 that are absent in S2 (shown in the conformation S1 in red as well). The grey shaded area in the matrix corresponds to the monomers not shown in the above conformations. The distance in sequence between the contacts can be seen in the arcs representation depicted in the left of the matrix. The lower triangular matrix represents the number of shared contacts among the four non-redundant solutions.
Nevertheless, although the modelling choices may seem plausible, it is important that they are tested a 200 posteriori for their predicting ability, and that parameters are optimized by comparison with 201 experimental data, if it is possible. In the case of protein contact matrices, predictions are obtained 202 from statistical mechanical models that present a simple contact energy function: where C ij is the contact matrix and U ij is the contact interaction energy between residues i and j that 204 are in contact, which may be imagined as the result of averaging out all other degrees of freedom in a folding stability of proteins, such as for instance the Miazawa and Jernigan potential [24]. The second 213 type belongs to the category of structure-based energy functions, which are determined from each 214 experimentally determined protein native structure in such that the native structure has minimum free 215 energy and that the native state is minimally frustrated, i.e. all native interactions (and only them) are 216 stabilizing, and all pairs of atoms that interact in the native state minimize their interaction energy: where C nat is the native contact matrix.

218
Some structure-based models are constructed as an explicit function of the inter-residue distance Elastic Network Model) [26,27] or to predict the thermodynamics and kinetics of the folding 223 process [28] and, despite their simplicity, they provide accurate predictions that are favourably 224 compared to experimental observations, as it can be seen in reviews on these subjects [29,30].

225
Contact order 226 An important quantity that can be computed from the contact matrix is the Absolute Contact Order 227 (ACO), defined as: i.e. the average distance in sequence between pairs of residues spatially in contact. It has been 229 observed that the ACO is negatively correlated with the folding rate of the protein, in such a way that 230 proteins with larger ACO tend to fold slower [31]. compact representation shows the number of contacts per residue. We find the following ACO values: 233 18, 17, 15 and 14 for S1, S2, S3, and S7, respectively. The analogy with real proteins suggests that 234 structures S1 and S2 fold more slowly, as they present larger contact order. Conversely, the folding 235 dynamics of S7 is expected to be faster and that of S3 would be intermediate. We also note the 236 presence of common contacts in all four solutions, which are related with more constrained regions, e.g. 237 the contacts involving the two consecutive blocks of length 4 mentioned above. Here we assume that this is the case, and we will discuss alignment 245 algorithms below. We can measure the similarity between any two polymer structures S x and S y 246 through the Contact Overlap: algorithm guarantees that all solutions are different, in the sense that they cannot be superimposed 252 through rigid body rotations or translations. We can ask the students to explain this fact. The answer 253 is that structures with overlap equal to one are related through a mirror reflection -they correspond to 254 chemical enantiomers. It can be shown (see Material and Methods) that mirror images can be excluded 255 by the search algorithm through a suitable control of the axes. In the following, we will only consider 256 one of each pair of enantiomers (S1, S2, S3, and S7) to avoid redundancy. Fig. 4 compares the contact 257 matrices of the structures S1 and S7. A summary of the contacts found for all four non-redundant 258 solutions is also shown.
where r indicates the coordinates of atom i in structure x, |·| 2 is the Euclidean distance, R denotes a 263 rotation matrix that has to be optimized to find the optimal superimposition, and both r are translated in such a way that their centers of mass stay at the origin. The above formula for the 265 RMSD helps to avoid the confusion between alignment a(i) and superposition R, which is frequent 266 among students.

267
Instead of comparing interatomic distances between pairs of atoms as the Contact Overlap does, 268 the RMSD compares the distances of individual aligned atoms after optimal superimposition, which is 269 apparently simpler. However, this simplification is obtained at the price to determine the optimal 270 rotation matrix R that minimizes the RMSD. This minimization can be performed analytically After discussing the notion of spatial superposition, we can now discuss the concept of structure 284 alignment. This is easier adopting the Contact Overlap. In this context, the optimal pairwise 285 alignment a(i) can be defined as the alignment that maximizes the Contact Overlap Eq.(4). Since 286 protein structure is conserved through evolution, we expect that pairs of aligned residues associated 287 through the correspondence a(i) are likely to be evolutionarily related (homologous).

288
It is clear that the algorithm for finding the optimal structure alignment given a scoring scheme will 289 be more complex than the algorithm for finding the optimal sequence alignment that depends only on 290 the similarity between the individual amino acids at positions i and a(i). It can be shown that the 291 structure alignment based on the Contact Overlap is an NP-complete problem, which, loosely speaking, 292 means that no algorithms that runs in polynomial time can guarantee to find the optimal solution in 293 all cases, while for pairwise sequence alignments the optimal solution can be found in polynomial time 294 through dynamic programming. Nevertheless, good heuristic algorithms exist, such as the Monte Carlo 295 algorithm implemented in the structural alignment algorithm Dali [33], which is conceptually related to 296 the optimization of the Contact Overlap.

297
On the other hand, there is no structure alignment method based on the minimization of the RMSD. 298 In fact, when we superimpose two aligned proteins, the optimal superimposition is strongly affected by 299 pairs of residues i, a(i) with large distance, typically located in flexible regions. Consequently, the 300 optimal superimposition R may locate far apart atoms in the structural cores of the two proteins for 301 the sake of improving the alignment of outliers. Therefore, for distantly related proteins, it is 302 preferable to superimpose only the structural cores, constituted by pairs of residues whose interatomic 303 distance is smaller than a threshold. This introduces a trade-off between the alignment a, the 304 structural superimposition R and the core definition, that must be sorted out through the choice of the 305 threshold and some kind of heuristic, such as in the structure alignment algorithm Mammoth [34].

306
Instead of applying a fixed threshold to define the core, the program TMalign computes the average 307 distance between unrelated residues as a function of the protein length, and adopts this function in 308 such a way that only pairs that are closer than expected by chance give a positive contribution to the 309 structural similarity [35]. TM-align is one of the most reliable structure alignment programs based on

313
Protein structure comparison may also be a convenient point for introducing protein structure 315 evolution. We will base our analysis on protein superfamilies, groups of bona fide evolutionarily related 316 protein domains that are structurally similar and can be found in structural classification databases 317 such as SCOP [37] and CATH [38]. 318 Comparing proteins within a superfamily, it was found that protein structure and protein sequence 319 divergence are linearly related. This result has been established based on the RMSD of aligned and 320 superimposed protein cores [39] and later extended to other measures of structural change, allowing to 321 quantify the statement that protein structure is more conserved than protein sequence [40]. Poissonian distance between the corresponding sequences, we found that structure evolution is slower 327 than sequence evolution by a factor that ranges from 0.24 to 0.37 for different superfamilies [41]. 328 Importantly, proteins that conserve exactly the same molecular function appear to be limited in their 329 CD, which suggests that homology modelling can be rather successful in the case of function 330 conservation, while structure evolution is more irregular and the CD explodes for proteins that change 331 their molecular function.

332
The notion that protein structures may be largely different within the same superfamily lead to 333 challenge the concept that protein structure space is organized in discrete regions called "folds" that 334 represent well-defined equivalence classes of protein structures, an idea that underlies most structural 335 classifications of proteins [37, 38,42]. More recently, it is gaining force an alternative view that sees 336 protein structure space as constituted by discrete folds only at very high similarity, while at low 337 similarity structure space is continuous and should be described as a network rather than as a 338 tree. [43][44][45][46].

339
Sequence-structure relationship and protein designability 340 Key to protein evolution is the relationship between the protein sequence, which evolves through time, 341 and protein structure and function. The very simple models presented in this unit can constitute a 342 suitable introduction for such a subject. 343 We can define the sequence structure relationship within the contact energy model of Eq.(2) where a 344 contact matrix C ij represents the mesoscopic state made of all structures with contact matrix C ij , and 345 Eq.(2) represents its effective free energy, with all other degrees of freedom averaged out. In principle 346 the effective energy depends on the temperature and the state of the solvent, but these dependencies 347 will be kept implicit. We assume that the contact interaction energies depend on the protein sequence 348 as U ij = U (A i , A j ), where A i denotes the amino acid type at position i. To keep things simplest, we 349 consider the HP model that groups the amino acid in two types, hydrophobic (H) and polar (P). More 350 realistic models consider the twenty natural amino acid types and contact interaction energies with 210 351 parameters. This simple HP model is sufficient to reproduce many qualitative features of protein 352 folding. Indeed, hydrophobicity is considered as the main force that stabilizes folded proteins. We   Figure 6. Designability of selected structures for an increasing number of random sequences. We observe significant differences between the structures that tends to a well defined limit when the number of sequences increases. matrices of compact self avoiding walks on the cubic lattice or the contact matrices of real protein 364 structures. Here, for the sake of simplicity, we will limit our computations to the solutions of the snake 365 puzzle. This is analogous to the toy model of protein folding based on the maximally compact 366 structures on the cubic lattice introduced by Shakhnovich and coworkers in the 90's [18].

367
As suggested by several works, we identify the native state of a protein sequence with the ground 368 state of its effective energy (more realistic models also impose conditions on the stability with respect 369 to alternative compact conformations, as we will see in the following). This allows us to define the 370 designability of a given structure as the number of sequences for which this structure is the ground 371 state.

372
More designable structures are expected to be more resistant to sequence mutations. Mutational 373 robustness has been proposed to be important to reduce the deleterious effects of erroneous protein 374 translation in the ribosomes [48] and to mitigate the mutation load determined by unviable offsprings 375 generated under strong mutation rate [49,50]. Furthermore, it has been shown that robustness favours 376 the evolution of new protein functions [51]. Similar considerations motivated the computational study 377 by Li et al. [47], who considered as feasible states the contact matrices of the maximally compact self 378 avoiding walks of 27 steps on the cubic lattice, whose number is ∼ 10 5 [18]) and evaluated their 379 designability in the space of the 2 27 sequences of 27 H and P. 380 Here we propose that the students reproduce this computation, adopting as feasible structures three 381 of the solutions of the snake puzzle: S1, S3 or S7. We discard the solution S2 because it is very similar 382 to S1 (see the Contact Overlap values in Table 1) and it should be considered part of its native valley 383 rather than an alternative conformation.  This exercise allows us to emphasize the importance to assess the statistical significance and 391 statistical errors in computational studies. The differences that we observe are relevant only if they are 392 statistically significant, i.e. if the probability that they arise by chance is lower than a predetermined 393 threshold (typically, 0.05). We propose two methods to verify that they are indeed significant. The 394 first one, more rigorous, is based on the binomial distribution: given any structure a, we can compute 395 the probability that the number of sequences assigned to a, S a over a total of S that were tested, is the 396 result of S tests with success probability 1/3, as if all structures have the same probability. If all these 397 probabilities are small, then the designabilities are significantly different. The second method, easier to 398 apply in general, consists in computing the standard error of the mean of the designability p a , 399 s a = p a (1 − p a )/S, and performing an unpaired Z-test with Z = (p a − p b )/ s 2 a + s 2 b for all pairs of 400 structures. With 0.05 as significance threshold, the binomial test finds the difference from equal 401 frequencies not significant for S ≤ 100 (except for S1) but significant for S ≥ 1000 for all structures.

402
The Z test yields the same result (in this case, S1 is not significant for S = 100).

403
Misfolding stability and energy gap 404 It has been noted in several works that the analogy between the polymer model and protein folding 405 only makes sense if the ground state structure, identified as the native state, is much more stable than 406 alternative structures, which collectively represent the misfolded ensemble.

407
One of the first parameters used to assess the stability of the putative native state has been the have a larger energy gap, and that the energy gap averaged over the sequences assigned to a given 416 structure clearly separates the maximally designable structures from the rest. In our toy model there is 417 one "native" and only two "misfolded" conformations. We denote by E(C i , A k ) the effective energy of 418 sequence A k in structure i, and we define the energy gap as the effective energy difference between the 419 "native" structure and the alternative conformation with lowest energy This quantity has then to be averaged over all sequences A k assigned to structure C i , which we denote 421 with an overline: The values of the average energy gap are shown in Fig. 7. We can see that, consistently with the 423 results reported by Li et al., the structure S7 has both the highest designability and the largest energy 424 gap, and that these differences are significant. We can then ask students whether this correlation 425 between designability and energy gap is surprising or it should be expected based on the definitions.

426
Another measure of the stability of the misfolded ensemble, more standard under the point of view 427 of statistical mechanics, is its free energy, which can be evaluated through some simplified statistical 428 mechanical model. 429 We assume that the protein can exist in three macroscopic states: the native state (the attraction 430 basin of the contact matrix with minimum effective energy, which can be described for instance 431 through the structure-based elastic network model mentioned above), the unfolded state (analogous to 432 the self-avoiding-walk model, with large conformation entropy and effective contact energy close to 433 zero), and the misfolded state (alternative compact conformations dissimilar enough from the native 434 structure). We describe each of these states (native, misfolded and unfolded) by a free energy which, at 435 constant pressure and volume, is given by the difference between the total energy E and the

445
We approximate the free energy of the native state as its effective contact energy, where C nat ij indicates the contact matrix of the native structure. Note 447 that we should also take into account the conformational entropy of the native state, derived from the 448 volume in phase space of the structures that share the native contact matrix, but this conformational 449 entropy is expected to be similar to that of misfolded contact matrices [53], so that it contributes little 450 to ∆G.

451
The free energy of the unfolded state is approximated as its conformational entropy, which is an empirical value that mainly depends on its number of chi angles.

459
The term that is most difficult to estimate, and that is often neglected in computational models where C ij denotes the average value of the contact between residues i and j in the misfolded 469 ensemble, and we adopt the approximation that all such average contacts are equal and the average 470 number of contacts of misfolded structures is equal to the number of contacts of the native structure, 471 i<j C ij = N C . In real polymers, C ij decreases with the sequence distance between the two 472 residues in contact, |i − j|, as predicted by polymer physics, and as it can be verified through a 473 statistical analysis of the PDB, but we neglect this dependence for simplicity.

474
Next, we combine the misfolded and unfolded states into a single non-native state containing both 475 states separated by an energy barrier, which we describe with the partition function 476 Z nonat = e −βG misf + e −βG unf . We define the folding free energy ∆G as the difference between the 477 native and non-native free energy states, where β = 1/k B T is the Boltzmann factor. Finally, we assume that the free energy of the misfolded 479 ensemble is much more negative than that of the unfolded ensemble, which will be neglected. Thus, we 480 estimate the folding free energy as We will adopt the above free energy for the model of protein evolution described in next section.

482
In the context of evolution, we call positive design the selective forces that make the first term of (and at a faster pace than the native energy, since the REM free energy has a term proportional to the 492 mean square energy of alternative conformations). Evolution has to finely balance these two selective 493 forces, and the balance depends on the mutation bias, i.e. on whether mutation favour hydrophobic or 494 polar amino acids [58]. We will illustrate these issues and their interplay with other evolutionary forces 495 such as mutation bias and population size with the computational exercise proposed in next section.

496
Structurally constrained sequence evolution

497
The main point we address here is that the thermodynamic properties of folding observed in natural 498 proteins are a consequence of the evolutionary process, consisting of mutation and selection. With the 499 aim to "bring molecules back into molecular evolution" [59], we will show how protein evolution is 500 jointly constrained by evolutionary parameters (mutation bias, population size, temperature) and the 501 requirement to fold into the target native structure.  Our evolutionary model considers a genetically homogeneous population, i.e. we assume that the 507 mutation rate is very small. It is important to distinguish between a mutation, which is a microscopic 508 event that affects individuals, and a mutation that becomes fixed in the population (often called 509 substitution, although this term would only be used for amino acid or nucleotide changes and not for 510 insertions and deletions), which is the macroscopic event that interests us here. Every time a mutation 511 occurs in a sequence A, it may either disappear or get fixed in the population with a probability P fix 512 that depends on its fitness relative to the wild type and on the population size N , where A ′ is the mutant sequence. It is often assumed that the fitness depends on the stability of the  (12) where ∆G(A) is modelled for instance as in Eq.(10) and β = 1/k B T is the inverse of temperature.

517
Although other properties -in particular the dynamics of the protein, its capacity to interact with 518 other proteins, its catalytic rate and so on-, are arguably more important than its stability, they are 519 more difficult to model, and the stability is a necessary prerequisite at least for the large number of 520 proteins that must be folded in order to function (i.e., excluding natively unfolded proteins).

521
If the mutation is disadvantageous, i.e. f (A ′ ) < f (A), corresponding to a lower stability, the 522 probability that it becomes fixed is exponentially small with population size, but it is non-zero if is of the order of 1/N . In other words, the smaller is the population, the more 524 tolerant it is to mutations that decrease the fitness. Analogously, a thermodynamic system is more 525 tolerant to changes that increase its energy the higher is its temperature.

526
This analogy can be made more meaningful if we consider the evolutionary trajectory as a random 527 walk in the space of the possible sequences [14]. More precisely, it is a Markov process, since the  This situation has a parallel in molecular evolution. In this case, our starting point is the transition 545 probability of the Markov process (in the limit of a homogeneous population), from which we can easily 546 compute the limit distribution in sequence space exploiting the detailed balance. Sella and Hirsch [14]  1. Mutation One of the positions of the sequence is randomly chosen and mutated from H to P or 564 vice versa. We consider a mutation bias, i.e. the rate of mutation from H to P and from P to H 565 may be different. This bias is parametrized with a parameter p that expresses the ratio between 566 the rate at which an amino acid of type H mutates to P and the rate at which P mutates to H 567 (this parameter suffices since the total mutation rate only affects the time scale of the problem). 568 We extract the mutant site in two steps. Firstly, we extract a random number to decide whether 569 the mutation is from H to P (probability P HP = pn H /(pn H + (1 − p)n P ), where n H is the number 570 of positions occupied by a H) or viceversa. Then, we extract with uniform probability one of the 571 n H sites if the mutation is from H to P, or one of the n P sites otherwise.   Starting from a random sequence, stability and fitness tend to increase on the average, although 581 with large fluctuations. We are interested in the stationary properties at large time, when the limit 582 distribution is reached and the memory of the starting random sequence is lost. We propose to use the 583 following method to numerically estimate the stationary value of the fitness through the following 584 method. We define the average fitness at time t, F (t) as We now assume that the average fitness F tends to its stationary value exponentially, This assumption allows us to derive the parameters in which we are interested, F ∞ (stationary fitness) 587 and τ (time scale) through the linear fit of y = F (t + 1) − F (0) versus x = F (t) − F (0): The slope of the fit gives τ and the intercept gives F ∞ . This computation allows us to discuss the 589 advantages of analytical methods such as linear fit with respect to other alternatives. An alternative 590 method for computing F ∞ consists in discarding the first t 0 steps of the simulation and using only the 591 last steps to compute the average. However, it is difficult to be sure that the transient that we discard 592 is large enough to guarantee convergence and not too large to reduce statistics. Moreover, we would 593 miss the interesting information on the time scale τ . The second possibility is to perform a nonlinear 594 fit of Eq. (14). However, non-linear fits require an heuristic optimization that does not guarantee 595 convergence to optimal parameters. The linear fit allows an analytic solution that is preferable, in 596 particular for complex fits involving many parameters.

597
Armed with these tools, we investigate the dependence of fitness (hence, stability) on the properties 598 of the evolving population: the physical temperature T at which evolution takes place, the population 599 size N and the mutation bias p. We will finally test the influence of the native structure.
600 Temperature Fig. 9 shows some typical simulations reaching a stationary value under the proposed 601 evolutionary model. Each step represents one fixed mutation. The first evolutionary simulation that 602 we propose has the objective to study the effect of temperature, which enters the definition of fitness 603 Eq.(12) through the factor β = 1/k B T . In Fig. 8 we compare results obtained with T = 1, T = 10 604 (β = 0.1) and T = 100 (β = 0.01). For the sake of comparison, we also show results for random 605 sequences. Simulations were made with population size N = 50. We will analyse the effect of 606 modifying population size in the next section.

607
When T is low (T 1) the fitness function Eq.(12) tends to a binary function of stability: f ≈ 1 if 608 ∆G < 0 and f ≈ 0 otherwise. This corresponds to a neutral fitness landscape in which all proteins that 609 are sufficiently stable are selectively equivalent and all proteins that are unstable are strongly rejected. 610 Even random sequences show a bimodal fitness distribution with peaks at f = 0 and f = 1, as 611 expected for an almost neutral fitness landscape, and fitness close to one can be achieved even by 612 random sequences. For evolved sequences the most likely value of ∆G is only slightly below zero, 613 where the fitness is close to one and the entropy in sequence space is large [62].

614
If T increases (β = 0.1), the fitness becomes a smooth function of stability and it is more difficult to 615 accept mutations that decrease stability. As a consequence, the mean value is significantly lower than 616 for β = 1 (−12.64 ± 0.04 versus −6.48 ± 0.04 for β = 1). For the same temperature, random sequences 617 have free energies that are on the average zero and, correspondingly, fitness distributed around f = 0.5, 618 which is the value attained when ∆G = 0. Under this point of view, the inverse of the physical 619 temperature 1/T can be regarded as an evolutionary temperature, in that, the larger is 1/T , the more 620 tolerant is the evolution with respect to mutations that decrease protein stability.

621
In the high temperature limit (β = 0.01) all sequences have the same fitness f ≈ 0.5 independent of 622 ∆G and evolution becomes an almost random process. Indeed, it is difficult to obtain higher values for 623 evolving sequences and the mean value of the free energy for evolved sequences, although still negative, 624 approaches zero -with a mean value of −3.24 ± 0.5.  Figure 8. Comparison of the distributions of fitness (above) and free energy (below) for randomly drawn and evolved sequences. The evolutionary parameters are N = 10; p = 0.5. For T = 1 (right) we obtain a neutral landscape, and the variation of free energy after the evolutionary process is smaller than for T = 10 (middle). Increasing temperature until T = 100 (left) all sequences have almost the same fitness, and evolution becomes an almost random process.
that the closer F is to the neutral regime in which fitness is a step function of stability, the less the 627 evolutionary process depends on population size. In particular, as we have seen above for low This may be a good opportunity to discuss this concept, explaining that the effective population size is 638 not just the number of individuals in the population but it is a number that recapitulates its 639 demographic history, in particular bottlenecks. We plot in Fig. 10 Table 2 the equilibrium values of fitness and stability obtained for different population sizes keeping all 651 the other parameters fixed, within a non-neutral temperature regime (β = 0.5). We chose this value in 652 such a way that free energy values are clearly distinguishable between any pair of population sizes (see 653 Fig. 10). We can see that smaller populations attain significantly smaller equilibrium fitness, since the 654 fixation probability of disadvantageous mutations is larger for smaller population, in agreement with 655 the statistical mechanical analogy [14]. Moreover, larger populations also attain significantly larger 656 stability, which shows that the system is far from the neutral regime in which fitness is a stepwise 657 function of stability.

658
For these simulations we observe an exponential trend towards the stationary state and, therefore, 659 we can compute the asymptotic time τ through a linear fit, as explained above. Nevertheless, the 660 differences in time scales between the different populations, although systematic, are not significant. In 661 fact, τ depends on the initial value of the fitness, which is a variable that fluctuates strongly, so that a 662 very large number of independent simulations would be necessary to reduce the variability and improve 663 the significance.  Table 2. Average and standard deviation for the fitted parameters and free energy obtained in 10 independent evolutionary processes with balanced mutations (p = 0.5) and inverse temperature β = 0.5.
Mutation Bias Finally, we propose to explore the role of the mutation bias p (see Materials and 665 Methods). We perform these simulations for population size N = 10 and high temperature β = 0. values of the free energy ∆G display a significant change. As shown in Table 3, the stability against 670 unfolding i.e. the free energy of the native state) decreases when the mutation bias favours polar 671 sequences and the stability against misfolding (i.e. the free energy of misfolded conformations) has the 672 opposite behaviour. This data also suggest that the folding free energy resulting from both unfolding 673 and misfolding is minimum when the mutation bias is close to the mutation bias p = 0.5 at which polar 674 and hydrophobic mutations balance, at least for the chosen parameters (see the quadratic fit in Fig. 11, 675 which has illustrative purposes only).

676
The study of the mutation bias raises two important points from the perspectives of physics and properties such as the equilibrium fitness and the stability of proteins are influenced by the mutation 683 bias, as it was shown in Ref. [58].

684
This fact reflects an important difference between the equilibrium distribution in the phase space of 685 statistical physics, i.e. the Boltzmann distribution, and the equilibrium distribution in the space of 686 evolving protein sequences. While the Boltzmann distribution can be defined as the probability 687 distribution with maximum entropy given a constraint on the average energy, the equilibrium 688 distribution in sequence space can be defined as the distribution with minimum Kullback-Leibler 689 divergence from the distribution that would be attained by mutation alone, given the constraint on the 690 average fitness [64]. The correspondence between the two definitions can be appreciated noting that 691 the entropy is equal to the Kullback-Leibler divergence from the equiprobable distribution, i.e. in the 692 case of statistical physics the reference distribution is the equiprobable distribution, while in the case 693 of evolution the reference distribution is the distribution that would be attained by mutation alone.

694
From the point of view of evolution, the fact that the equilibrium fitness depends on the mutation 695 process creates the possibility of an interesting feedback between selection and mutation, which in turn 696 depends on the replicative machinery of the organism and is under genetic control. In other words, 697 mutation and selection should not be considered as completely independent processes, but there is the 698 possibility that a population selects the mutation process that is more convenient to it under its 699 ecological and evolutionary circumstances [58].  Table 3. Averages and standard deviations for the fitted parameters and free energy obtained in 10 independent evolutionary simulations. The parameters used are N = 50 and β = 0.5. The target structure used is S7. Different mutation bias are represented.  Table 4. Averages and standard deviations for the fitted parameters and free energies obtained in 10 independent evolutionary simulations considering N = 50 and β = 0.5 using the structures S7 (high designability) and S1 (low designability) as target. See Methods for further details.
Structural effects Finally, we explore the influence of the target structure in the evolutionary 701 process by comparing the least designable structure S1 and the most designable structure, S7 (see  Figure 11. Dependence of the free energy on the mutation bias. A quadratic fit is shown for illustrative purposes only. Table 4). The structure S7 reaches higher free energy on the average, but with a much smaller value of 703 the time scale τ , which indicates that it reaches equilibrium faster, consistent with the fact that there 704 are more sequences for which this structure is lower in energy.

705
Sequence-structure relationship 706 We conclude our analysis by relating the statistical properties of the evolved sequences and their 707 corresponding structures. This is particularly simple in the HP model, in which each position along the 708 sequence can be characterized by a single parameter representing the frequency at which the position is 709 occupied by a hydrophobic residue in the given ensemble of sequences. For real proteins, each position 710 must be characterized by a 19 dimension vector of the frequency of the different amino acid types (the 711 20-th frequency is the result of the normalization condition), called profile in the bioinformatics 712 literature. We plot in Fig. 12 the hydrophobicity profiles of random sequences and sequences evolved 713 to fold into structure S7 for mutation bias p = 0.5 (see Materials and Methods for further details). The 714 profiles of random sequences only depend on the mutation bias but, by definition, they do not differ 715 significantly between one position and the other. In contrast, we observe marked differences between 716 positions for the profiles of evolved sequences. To rationalize these differences, we also report in Fig. 12 717 the number of contacts at each position of the target structure. The predictive correlation between this 718 structural profile and the sequence profile is readily apparent, and implies that in this model the property of real proteins that buried positions tend to be hydrophobic and surface positions tend to be 722 polar, and it shows that this tendency is sufficient to completely determine the hydrophobicity profile 723 in the simple case of the HP model.

724
It can be interesting to suggest further readings that show that the contact energy model Eq.(10) 725 together with a statistical mechanics approach allows us to analytically predict the observed 726 correlation between the number of contacts and the hydrophobicity profile [64,66]. This point can be 727 proposed as an exercise, and it can be an interesting way to introduce the Boltzmann distribution in 728 statistical physics and to deepen the analogy between statistical mechanics and evolution.

729
One consequence of the structural propensities discussed above is that the positions of evolved 730 sequences are more conserved than positions in the random ensemble. We can quantify this through 731 the sequence entropy of each position i, defined as S i = − a p i (a)ln (p i (a)) where a denotes one of the 732 2 amino acid types and p i (a) is the profile at position i. The sequence entropy is smaller for evolved 733 than for random sequences. This reduction of entropy is the consequence of the constraints that 734 enforce the stability of the native structure. We can see that the number of contacts is a reliable predictor of the frequency of hydrophobic residues of evolved sequences. This is also consistent with the common observation that deeply buried positions, which have a larger number of contacts, tend to be more hydrophobic.

736
In this paper we presented a teaching unit dedicated to the computational study of protein structures, 737 stability and evolution. This area of research is difficult to present to students, since it requires a 738 highly multidisciplinary background that includes statistical mechanics, evolution and computational 739 skills. Our experience teaching this subject to students at the Master of Biophysics of the Universidad 740 Autónoma of Madrid (Spain) has shown us that these concepts are easier to present using as a toy 741 model a real toy such as the snake puzzle. This allowed us making abstract concepts more concrete, 742 and opened several avenues towards more advanced subjects of evolutionary and structural 743 bioinformatics. We have selected exercises that, in our opinion, establish simple parallelisms with 744 interesting and simple enough publications that we suggest to the students for further reading. 745 We want to remark that these exercises are not intended to replace other educational tasks that 746 deal with real biological entities. Rather, we believe that the proposed approach aids students to 747 establish a fundamental theoretical framework through a computationally and conceptually tractable 748 model. Ultimately, these exercises are to provide a foundation upon which students can build 749 increasingly complex biophysical models and design strategies to tackle real biological problems.

750
In our teaching experience, we have realized that students that lack a background in physics face, in 751 general, the most difficult conceptual challenges. For these students, the evolutionary model that we 752 propose may constitute an intuitive introduction to statistical mechanics concepts, and the simulations 753 proposed constitute a practical introduction to scientific computation. In general, we find that students 754 quickly build an intuition on the problem. On the other hand, students with a background in physics 755 usually enjoy this new application of statistical mechanics, but they tend to have severe difficulties to 756 interpret the results from a biological point of view. In our experience, a fruitful way to take advantage 757 of these differences consists in forming working teams of students with different backgrounds.

758
In summary, the increasingly interdisciplinary setting of scientific research requires efforts to 759 overcome traditional boundaries between established academic disciplines. These efforts are yielding 760 promising results through the application of interesting alternatives [67]. A suitable scientific academic 761 curriculum should also be evaluated by assessing how students are getting on with their early scientific 762 academic curricula is of key importance, not only for computational biology [5]. It is our responsibility 764 to claim for these changes.

766
Algorithm to solve the snake puzzle 767 To solve the snake puzzle, we adopted a straightforward algorithm that builds the conformations of the 768 snake iteratively and implements an exhaustive search of all maximally compact conformations, i.e.

769
conformations that can be fitted into a cube of side 3 (for the 27-mer) or 4 (for the 64-mer). The 770 search is performed on a decision tree whose nodes correspond to spatial arrangements of the 771 consecutive rigid fragments. From each node, there are four possible directions where the next 772 fragment can be placed, since two consecutive fragments cannot be extended in the same direction. For 773 example if fragment i is placed along the x-axis, fragment i + 1 can be placed only in the +/ − y or 774 +/ − z directions. The first two rigid fragments are placed together at the root of the tree and define 775 the oriented x and y-axes. The third fragment can be placed in any direction perpendicular to y, i.e. 776 +x, −x, +z and −z. The last two options are related by mirror symmetry (see also Main Text), which 777 we can reduce allowing only the placement in the +z direction the first time that the z axis is visited.

778
At each step, the algorithm tests whether the new fragment occupies positions already occupied by 779 other fragments (self-avoidance condition) and whether the partial conformation extends outside the 780 boundaries of the cube. If any of these requirements is not fulfilled, the node is discarded and the 781 algorithm goes back to the parent node, moving in the remaining directions. Otherwise, we create the 782 node corresponding to the new accepted conformations and proceed forward. When all four directions 783 have been tested, the algorithm goes back to the parent node. Note that the tree can be built starting 784 with any fragment of the puzzle, in particular the initial and final fragments, but also some in between 785 (in this case, it has the choice to start moving forward and then backward or vice versa). 786 We represent the final solution as an ordered string containing, for every fragment, its direction 787 with sign, e.g. (+x, +y, −x, . . . , +z). This format can be converted to explicit coordinates that can be 788 input to the visualization software used in structural bioinformatics (see Main Text) to visually inspect 789 the solutions, and used to solve the puzzle manually.

791
To compute the designability of each of the non-redundant solutions (we exclude S2 since it is very 792 similar to S1, see Main Text), we randomly draw m sequences of the HP model (i.e., only two amino 793 acid types H and P are considered) with probability p = 0.5 that the amino acid is P. For each sequence 794 we compute the effective energies of the target structures using Eq. 2, with the same parameters used 795 by [47] (U (H, H) = −2.3, U (H, P ) = −1 and U (P, P ) = 0) and assign the sequence to the structure 796 with lowest energy. In this way, we compute the fraction of sequences assigned to each structure. The 797 experiment is replicated 100 times to evaluate the statistical error, with set sizes m = 10, 100, 1000 798 and 10000. The average and standard error of the mean are plotted as a function of m in Fig. 6.

799
Structurally constrained sequence evolution 800 We simulate protein sequence evolution with structural constraints using a Monte Carlo algorithm 801 illustrated in the Main Text. We extract the initial sequence A(0) of length L = 64 and compute its free 802 energy Eq.(10) and its fitness, Eq. (12). At each step t of the simulation we mutate a random position 803 of the sequence with bias p for polar replacement. We then compute the free energy and fitness of the 804 new sequence and obtain the probability of fixation P f ix , Eq. (11). We then extract a random number 805 0 ≤ r ≤ 1 and we accept the new sequence (i.e. A(t + 1) = A ′ ) if r < P f ix . Otherwise, the old sequence 806 is kept (A(t + 1) = A(t)). It is important to note that, when the old sequence is kept, its associated 807 evolutionary values are recorded one more time. Considering values associated to rejected mutations is 808 needed to ensure that the underlying distribution sampled within the stationary state is indeed a 809 Boltzmann distribution, from which we next compute the average of the fitness and free energies. 810 We perform simulations changing some key parameters: nine different temperature values 811 distributed between β = 0.0001 to β = 10 (see Fig. 10), three different values for the mutation bias 812 (p = 0.25, 0.5 0.75) that the mutation is from a hydrophobic to a polar amino acid, four different 813 population sizes (N = 10, 50, 250, 1000), and two different target structures (S1 and S7). The 814 combinations of parameters selected for the simulations are described in the main text. For each set of 815 parameters, 10 independent simulations were run until the stationary was reached.

816
From the fitness values recorded, the average fitness F is computed, and the stationary fitness F ∞ 817 and the evolutionary time scale τ obtained through the linear fit described in the main text, Eq. 15.

818
The average of the free energy ∆G and its standard error were computed considering a sufficiently  The sequence of the rigid elements of the snake cube puzzle, the contact matrices and the script used 824 to solve the puzzle can be found in the URL https://github.com/insectopalo/snake-puzzle 825