T HE R OSENBLUTH SAMPLING C ALCULATION OF H YDROPHOBIC -P OLAR M ODEL

Lattice proteins are models resembling real proteins. They comprise an energy function and a set of conditions specifying the interaction between elements occupying adjacent lattice sites. In this paper we present an approach examining the behavior of chains of a large number of molecules. We investigate this by solving a restricted random walk problem on a cubic lattice and square lattice. More specifically, we apply the hydrophobic-polar model to examine the spatial characteristics of protein folds using the Monte Carlo method. This technique is the so-called Rosenbluth sampling method for solving restricted random walk problems. Specifically, by solving such walks we obtain plausible folds. In addition, this method can be extended to solve the hydrophobic-polar model. In this paper, we describe this method as an algorithm that calculates the energy spectrum for the hydrophobic-polar model, and the related formula for estimating the number of folds. Moreover, we estimate the number of folds for each sequence using hydrophobic-polar model energy estimation. On test sequences the predicted protein folds were obtained with a mismatch of one unit according to the energy. We also observe that the estimated number of folds depends only on the length and not on the type of sequence. This promising strategy can be extended to quantify other proteins in nature.


INTRODUCTION
The search for a more efficient algorithm of protein folding in the hydrophobic-polar (HP) model is an important aspiration in many disciplines (Sali et al. (1994), Pande (2010)).Knowing how proteins fold can help elucidate their three-dimensional structure-function relationship, which is crucial to the understanding of enzymes and to the treatment of misfolded-protein diseases such as Alzheimer's, Huntington's, and Parkinson's disease.The numerical simulation focused on those proteins is particularly useful for drug design, as it allows to test different physical characteristics using models of various complexities.Indeed, if high-resolution chemical structure is used, leading to precise molecule representations, dynamical simulation showing atomic interactions can be reached.This might ultimately provide more effective and personalized drugs.
It has been shown that the HP protein folding model is NP-Hard (Berger & Leighton (1998)), which means it is difficult to solve efficiently for longer protein sequences.In order to overcome this obstacle, many heuristic algorithms have been proposed (Jiang et al. (2003), Yanev et al. (2017)).Besides heuristics mostly based on optimization, other approaches are based on the idea that cooperativity of folding occurs, as local conformational choices which constraints the optimization space in which solutions are searched.Those assumption-based methods include hydrophobic zipper method Dill et al. (1993), which assumes that once a hydrophobic contact is created it cannot be broken.And the core-directed chain growth method Beutler & Dill (1996) which constrains the optmization search within the space of solutions having a hydrophobic core with a square (in 2D) or a cube (in 3D).
In this context, there is theoretical and experimental evidence of the advantage of solving a restricted random walk problem (RRW) on cubic and square lattices.One of the earliest proposed numerical algorithms which apply the RRW paradigm is the one designed by M. Rosenbluth and A. Rosenbluth (Rosenbluth & Rosenbluth (1955)).In this report, we present a benchmark implementation of Rosenbluth methods for the HP model with an additional extension to estimate the number of possible sequence configurations.

HYDROPHOBIC-POLAR MODEL
In the hydrophobic-polar model, the set of twenty standard amino acids is reduced to two: H (hydrophobic amino acid) and P (hydrophilic amino acid).More formally, the model relies on embedding a given finite polypeptide sequence s = (s 1 , . . ., s i , . . ., s k ) where s i ∈ {H, S}, into a given infinite graph G.In this article, the graph G will primarily be the three-dimensional cubic lattice G = Z 3 and square cubic lattice G = Z 2 over integer numbers Z.A fold of length k for s in G is an injective mapping f : [1, . . ., k] → G such that adjacent integers map to adjacent points of G.
The set of all folds of length k is denoted as Z k .In addition, each point f (i) is assigned one letter from the polypeptide sequence s i .Such neighboring points form a bond.Each point of Z 3 has six neighbors (x ± 1, y ± 1, z ± 1).The energy of the fold of s is expressed as (1) ∆(p, q) = 1 if p and q are adjacent but do not connect amino acids, 0 otherwise with energy equation: The above equation for calculating the energy of fold s in G can also be expressed as negation of the number of H − H bonds in the fold, where a bond is a pair of symbols corresponding to adjacent points, except for those H's which are adjacent to pairs of sequences s.The goal of the HP model is to minimize the energy min Note that for a given number of adjacent points k in the fold, any configuration consisting of k adjacent points laid out joined in succession on a cubic lattice Z 3 is considered.The method proposed by Rosenbluth & Rosenbluth (1955) involves drawing successive steps of a random walk only from among acceptable points, which are points previously not visited.In this section, we describe the random procedure in more detail.We will focus on the 3D case, i.e G = Z 3 , but the method is easily transferable to the 2D case.
Regarded as a random walk problem, for any walk consisting of m adjacent points and ending at position (x, y, z) m , all six positions are a priori equally likely at iteration m.The excluded volume effect is simulated by the requirement that a fold is not allowed to cross itself or back up on itself at any iteration.Consequently, at any iteration, there are at most five possible positions to move to.For simplicity, we assume that the first link originates from (0, 0, 0).Any satisfactory set of m adjacent points start from the origin (x, y, z) m is associated with a weighting function w m of possible positions calculated at each step according to the procedure described below.At any iteration m where the most recent link terminates at (x, y, z) m and 5 potential positions (x ± 1, y ± 1, z ± 1) m must be considered, while position (x, y, z) m−1 is ruled out immediately.All five remaining potential positions at m + 1 may be associated with values (x, y, z) i for i = m − p where p is an odd number greater than 1.If the comparison reveals this to be the case, a modification of the weight W m must be made, obtaining W m+1 .
Below we present all possible cases at iteration m: 1.All six new position (x ± 1, y ± 1, z ± 1) m are occupied.The process is then terminated with weight W m = 0.
2. Only w m new positions are unoccupied, with 0 < w m ≤ 5. Then During this process, an embedding is generated.If the embedding is equal to the length of the sequence, we can calculate energy according to the presented formula.

ESTIMATION OF FOLD NUMBERS
In this section, we present a mathematical justification for the estimation of folds.
Let us assume in general terms that when constructing fold f i of length k with partial weights 0 < w i ≤ 5: It is not excluded that at a certain step we may have no further possibilities for continuation, i.e., w m+1 = 0. We then say that a non-extendable fold of length m has been formed.Let Y denote the set of all folds of length m = k and non-extendable folds of length m < k.Recall that set of all folds of length m = k is denoted as The probability of picking a random fold f i ∈ Y of length m is equal to: with a weight function for the specific fold One can interpret W (f i ) as the weight of fold f i .Let us now repeat the draw using the growth method n times.There are n random folds f 1 , f 2 , . . ., f n from set Y .
Let n s denote the number of drawn elements f i for which W (f i ) = s and the set of these elements Then, based on the large numbers law: Therefore, the above expression can be written as the average weight of the drawn folds.We note that: Finally, the expression for W can be written as: We introduce the following notation for fold estimators of length k: To validate this fold estimator we test sequences of different length and type and the results are reported in the following section.

ESTIMATION STANDARD ERROR FOR Ẑk
For estimation of standard error for Ẑk , we used the batch means method.We will briefly describe how this method was applied in this setup.
Let us divide the sequence of weights: of length n into j "blocks" of length l each (so n = jl): Let us denote by μb the mean calculated from the block The estimator for the variance is defined as: Then the standard error is the root of the estimator for the variance: This method significantly speeds up the variance calculation by the standard method of generating estimators and calculating the standard deviation.

RESULTS
The experiments were run on Google's Colab platform on Intel(R) Xeon(R) CPU @ 2.20GHz with 13GB RAM.We investigated 2 different datasets one 2D and one 3D.The algorithm code, written in Python, can be found at the following website: Wie (2022).The method was run for n = 10 5 of suitable configurations folds with a specific sequence s of length k.We used j = 10 3 blocks in the batch means method for estimation standard error.

BENCHMARK FOR 2D
Using the method proposed above, we calculate Ẑk with a statistical error.The algorithm was initially tested for several sequences in dimension 2 (for Z 2 from site LABORATORY (2011).(2011).We can conclude that our energy is minimally different.Having computed an estimation for all folds Z 2 for each sequence, we can conclude that the number of folds Ẑk does not depend on the tested sequence s.We can observe that the estimated number of folds depends only on length k and not on the form of s.

ESTIMATION FOR 3D
The experiments were performed using n = 10 5 and j = 10 3 blocks in batch means method.The code can be found in Wie (2022).Estimated energy is equal to 0 for all k.This is because it is difficult to fold for Z 3 so that the number of H − H bonds is minimised.The second experiment lasted 4 minutes.Results for this dataset are reported on Referring to experiment 1 we see that there is a significant increase in the number Ẑk of these folds for each sequence.For values of k = 24, 25 we observe particularly large differences.
The experiment itself shows how difficult it is to wrap sequences in 3 dimensions.Estimations for sequences 24 and 25 alone show that the number of folds is on the order of 10 16 and 10 17 , as shown in Table 1 and 2. However, the energy of the fold is still zero.Therefore, the 3D model is significantly more difficult than the 2D model.For the initial cases k = 1, 2, the results are obvious.If the length of the sequence is 1 (k = 1), then we have only 6 possible points.For k = 2 we have 36 possible point assignments, 6 of which are forbidden.We have also prepared special sequences for cases k ∈ {1, 2, 3, 4} that will always have zero energy.It is not possible to wrap a sequence in such a way that points with H − H labels touch each other.

DISCUSSION
Correctly predicting protein conformations based on the amino acid sequence is of pivotal importance for drug design and other relevant computational chemistry tasks.In this paper, we report our computational experiments, where we use HP sequences corresponding to published benchmarks LABORATORY (2011) with a 2D lattice in the HP model.Our model successfully estimates the number of folds for a particular sequence; regardless of the type of sequence but only on its length.For small sequences, the method accurately estimates the number of folds.Our experiments show that for sequences of size k = 24, 25 the 3D model becomes significantly more complex than the 2D model.It has been observed that adding one dimension significantly affects the solution base.In 2D, the energies are −6 and −7, respectively, while in 3D the energy is zero for both cases.There are too many degrees of freedom to draw consecutive points.Therefore, it is difficult to find a wrap that has non-zero energy even for shorter sequences.However, the Rosenbluth sampling method can be successfully used to estimate the number of all folds, especially those with energy 0. This can help design heuristic algorithms based on this hindsight.The estimation itself, according to our mathematical justification, increases in accuracy as we increase the number of iterative executions of the method.The described approach is effective for identifying and sampling configurations on a lattice geometry.This kind of representation can be useful in the context of ab initio protein structure prediction Rashid et al. (2016).Expansions as implementations on quantum devices have been proposed, but those have been limited to the 2D case so far Micheletti et al. (2021).Conversion of the proposed tool into quadratic unconstrained binary optimization (QUBO) Kochenberger et al. (2014) using 3D lattices on quantum devices will be investigated in future work.

Figure 1 :
Figure 1: In each figure we embed a given finite polypeptide sequence in the square lattice G = Z 2 .The energy in the presented diagrams can be easily deduced.Each red line indicates a bond.The number of these edges corresponds to energy Êmin .
Figure 2: In the graph above, we can observe that the fold did not wrap in such a way that the two H − Hs are next to each other.Therefore, energy is equal to 0. This is because we have significantly more degrees of freedom in 3D space.

Table 2 :
Table 2, with 2 examples of resulting predicted folding depicted in Figure 2. In the accompanying table we count the fold estimation values for dimension 3 for Z 3 .