Mechanisms of Protein Search for Targets on DNA: Theoretical Insights

Protein-DNA interactions are critical for the successful functioning of all natural systems. The key role in these interactions is played by processes of protein search for specific sites on DNA. Although it has been studied for many years, only recently microscopic aspects of these processes became more clear. In this work, we present a review on current theoretical understanding of the molecular mechanisms of the protein target search. A comprehensive discrete-state stochastic method to explain the dynamics of the protein search phenomena is introduced and explained. Our theoretical approach utilizes a first-passage analysis and it takes into account the most relevant physical-chemical processes. It is able to describe many fascinating features of the protein search, including unusually high effective association rates, high selectivity and specificity, and the robustness in the presence of crowders and sequence heterogeneity.


Introduction
Dynamical nature of underlying processes is what distinguishes the living systems from other processes [1,2]. Biological processes constantly involve time-dependent fluxes of energy and materials, which makes them strongly deviating from equilibrium as long as organisms are alive. This implies that the concepts of equilibrium thermodynamics have limited applications for biological systems, while the role of methods that study the dynamical transformations is much more important [3]. In this review, we present our theoretical views on dynamic aspects of the protein-DNA interactions, which dominate in biological systems. Our approach is based on explicit calculations of dynamic properties via a first-passage probabilities analysis. The first-passage ideas have been already widely utilized in studies of various complex processes in Chemistry, Physics and Biology [4,5]. We employ these ideas in developing a discrete-state stochastic framework for analyzing the dynamics of protein search for specific targets on DNA.
It is known that the beginning of most biological processes is associated with specific protein molecules binding to specific target sequences on DNA because these events initiate the cascades of corresponding biochemical and biophysical processes [1][2][3]. For example, to activate or to repress a gene the corresponding transcription factor proteins must bind first to the gene promoter's region [1,2]. This fundamental aspect of protein-DNA interactions has been studied extensively by various experimental and theoretical methods . A special attention was devoted to understanding

Simplest Discrete-State Stochastic Model of the Protein Target Search
Experiments clearly indicate that during the search the protein molecule is alternating between freely diffusing behavior in the solution around the DNA chain and non-specific associations to DNA, which also include scanning the DNA chain [10][11][12]. The process is completed when the protein molecule reaches the specific target sequence on DNA for the first time. Stimulated by this observations, we start with a simplest minimal model of the protein search as presented in Figure 1. It is important to note that, in contrast to other theoretical approaches [10,11,15,32], this method is based on a discrete-state stochastic description of the system. This is a more realistic view of early stages of protein-DNA interactions because of intrinsically discrete nature of molecular interactions in these systems.
In this simple model, we consider a single protein molecule and a single DNA molecule with a single target site: see Figure 1. The DNA chain is viewed as having L discrete binding sites, and one of them at the position m is considered to be the target for the protein molecule. Because the diffusion of the proteins in the bulk is usually fast, all solutions states for the protein are combined into one state that we label as a state 0 ( Figure 1). It is assumed that from the bulk solution the protein molecule can bind with equal probability to any site on DNA, and the total association rate to DNA is equal to k on , while the dissociation rate from DNA is k o f f . The non-specifically bound proteins can diffuse without bias along the DNA contour in any direction with a rate u (see Figure 1). Note that the actual diffusion coefficient of the protein molecule translocating on DNA has units bp 2 s −1 , while the rate u is given in units of s −1 because it describes the rate of hopping to the neighboring sites. Since the search process ends as soon as the protein molecule arrives to the specific site for the first time, we introduce a function F n (t), which is defined as a probability density function of reaching the site m (the target site) for the first time at time t if at t = 0 the protein started in the state n (n = 0 is the bulk solution, and n = 1, ..., L are the protein-DNA bound states). This function is also known as a first-passage probability density function [4,5]. To compute these first-passage probabilities, we utilize backward master equations that describe the temporal evolution of these quantities [4,5,17], for 2 ≤ n ≤ L − 1, while at the boundaries (n = 1 or n = L) we have and For the state n = 0, the backward master equation is different, Here we used the fact that the rate to bind to any site on DNA is k on /L, so that the total association rate is equal to k on . In addition, the initial conditions require that F m (t) = δ(t) and F n =m (t = 0) = 0. This means that if the protein molecule starts at the target site m the search is immediately accomplished. It is important to explain the physical meaning of the backward master equations because they differ from classical forward master equations widely employed in Chemical Kinetics. It can be easily seen that all trajectories that start at the state n and finish at the target site m can be divided into several groups. For example, for 2 ≤ n ≤ L − 1 all trajectories starting at n can be divided into three groups: (1) passing via the state n − 1; (2) passing via the state n + 1 or (3) passing via the state 0 in the next time step. The fractions of those trajectories are given by u/(2u + k o f f ), u/(2u + k o f f ) and k o f f /(2u + k o f f ), respectively. Equation (1) describes this partition of the trajectories in the time-dependent manner because the first-passage probability flux to the target is determined by these trajectories. Thus, the backward master equations reflect the temporal evolution of the first-passage probabilities.
The most convenient way to analyze the dynamics in the system is to use Laplace representations of the first-passage probability functions, F n (s) ≡ ∞ 0 e −st F n (t)dt. Then Equations (1)-(4) can be written as simpler algebraic expressions: In addition, from the initial conditions we have F m (s) = 1. These equations are solved assuming that the general form of the solution is F n (s) = Ay n + B, where the unknown coefficients A, y and B are determined from the initial and boundary conditions [17]. One could argue that the target site m divides the DNA molecule into two homogeneous segments (1 ≤ n ≤ m and m ≤ n ≤ L), which can be considered separately. It was shown [17] that this approach leads to explicit expressions for the first-passage probability functions. Specifically, one obtains with an auxiliary function S 1 (s) defined as and with the parameters y and B given by Explicit expressions for the first-passage probabilities provide a full dynamic description of the protein search processes and any relevant quantities can be easily computed. For example, the mean search time from the bulk solution, which is inversely proportional to the chemical association rate for the specific target site, can be found from [17], This result has a very clear physical meaning. Here the parameter S 1 (0) describes the average number of distinct sites that the protein molecule scans during each visit to DNA while searching for the single specific site. Then, on average, to find the target the protein must make L/S 1 (0) visits to DNA because during every association S 1 (0) DNA sites are checked. Each visit, on average, lasts 1/k on while the protein scans for the target diffusing along the DNA chain. The protein also makes L/S 1 (0) − 1 dissociations back into the solution. The number of dissociation events is smaller by one than the number of association events because the last binding to DNA leads to finding the specific site.
The results of our calculations for the mean search times are presented in Figure 2. Our main finding here is that there are three dynamic search regimes depending on the values of kinetic parameters. It is convenient to introduce here a scanning length λ = u/k o f f , which gives the average distance that the protein molecule travels on DNA during each search cycle. This quantity is related to the parameter S 1 (0), but it is not the same because the protein might visit the same sites several times. If the protein molecule has a strong affinity to bind non-specifically to the DNA molecule (small k o f f , λ > L), then there will be only one searching cycle. After binding to DNA the protein will not dissociate until it finds the target. In this case, the mean search time scales as ∼L 2 because the DNA-bound protein does a simple unbiased random walk. We call this dynamic phase a random-walk regime. Because of the redundancy of the random walk the search in this regime should be generally slow: many sites are repeatedly visited. In the opposite limit of weak attractions between DNA and protein molecules (large k o f f , λ < 1), the protein can bind to DNA but it cannot slide because it quickly dissociates back into the solution. The protein on average makes L searching cycles (T 0 ∼ L). This dynamic regime is called a jumping regime. The search in this regime is generally fast as long as the associations are also fast. The most interesting behavior is observed for the intermediate interactions, which we label as a sliding regime. Here the scanning length λ is larger than one but smaller than the length of DNA L, and the number of searching cycles is also proportional to L. But in this regime the system can reach the most optimal dynamic behavior with the smallest search times. This search facilitation is achieved due to the fact that the fluxes to the target are coming now from both the bulk solution and from the DNA chain. This is one of the main mechanisms of the facilitated diffusion of proteins during the target search, but other processes like inter-segment transfer might also contribute significantly in the facilitated diffusion [27].

The Effect of Multiple Targets and Traps
The advantage of the discrete-state stochastic framework with the first-passage analysis presented above is that it can be extended and generalized to more realistic biological situations. This allows us to investigate important questions related to the mechanisms of the protein target search on DNA. Let us present several specific examples, although many more results have been obtained [17][18][19][20][21][22][23][24][25][26][27][28][29]. We start with the problem of how the presence of multiple target sites or multiple semi-specific trap sites affect the dynamics of the protein search.
It is known that in eukaryotic cells multiple target sites are available on the accessible DNA fragments [1][2][3]44]. The protein search is accomplished in these systems when the protein molecules finds for the first time any of the target sites. It has been argued that the mean search time in this system might not decrease proportionally to the number of targets as one would naively expect from simple-minded applications of chemical kinetics [18]. This is due to the complex mechanism of the protein search that involves both 3D and 1D motions [18]. Applying our discrete-state stochastic framework to this problem, we consider a model with multiple targets at arbitrary locations as presented in Figure 3. To describe the search dynamics in this system, we again introduce the first-passage probability function F n (t) of finding any of the targets at time t if the process started at t = 0 at the site n. Targets are dividing the DNA chain into several homogeneous segments, and this allows us to solve the corresponding backward master equations as explained in Section 2. This leads to the following explicit expression for the mean search time for any number of targets [18], with a function S i (0) describing the average number of distinct sites scanned by the protein on DNA with i targets. This formula is a generalization of Equation (13) when there is only one target (i = 1). Specific expressions for S i (0) for various numbers of randomly distributed targets have been obtained [18]. For example, for i = 2 it was shown that where the parameter y is given in Equation (11). To understand the effect of multiple targets on the protein search dynamics, we analyze the results of explicit calculations for mean search times as presented in Figure 4. It is found that the presence of multiple targets does not affect the overall dynamic phase diagram as compared with the single-target case: three search regimes are again observed depending on the size of the scanning length, the target size and the size of the DNA segment. Generally, the search is faster in the multiple-target systems. However, surprisingly, increasing the number of specific sites might not always accelerate the search. To quantify this effect, we introduced an acceleration parameter, a n = T 0 (1)/T 0 (n), where T 0 (n) is the mean search for the system with n targets. This ratio gives a numerical value of how faster the search is in the presence of n targets in comparison with the single-target system. It is illustrated in Figure 5. One can see that there is a range of parameters when the search dynamics in the system with two targets can be slower than the dynamics in the system with one target. This happens in the effectively 1D search regime (random-walk dynamic phase) when the single target is located in the middle of the DNA chain, while two targets are close to each other and located near one of the ends of the DNA segment. In this case, for the protein molecule the two targets are viewed as effectively a single target site (with the size equal to two target sites) because they are so close to each other. But it is faster to find the target located in the middle of the chain than the target positioned near the ends [17]. This is the main reason why having multiple targets does not always lead to decrease in the search times. Thus, our theoretical analysis predicts that the degree of acceleration due to the presence of multiple targets depends on the nature of the dynamic search phase and on the location of the specific sites with respect to each other and with respect to the middle point of DNA [18].   Figure 5. Ratio of the mean search times as a function of the normalized distance between the targets for single-target and two-target systems (l is the distance between between targets, L is the DNA length). The single target is in the middle of the chain. In the two-target system, one of the specific sites is fixed at the end and the position of the second one is varied. The parameters used in calculations are: u = k on = 10 6 s −1 ; k o f f = 10 −4 s −1 ; and L = 10, 000. Adapted with permission from Ref. [18].
Another important factor that might affect the protein search dynamics is the existence of so-called semi-specific sites, or decoys, on DNA. These sites have a chemical composition very similar to the specific targets with differences in only one or few nucleotides. The protein molecule can be trapped in these sites, and this should influence the search for real targets. To analyze this effect, we can extend the simplest model to include the possibility of traps, assuming that associations to these semi-specific sites are effectively irreversible [19]. This assumption is reasonable because the search times in many systems are relatively short and the experimental observations also limited in time. Thus the bindings to decoys can be viewed as effectively irreversible. But even if the bindings are reversible the theoretical method can be extended to take this into account. The first-passage analysis can be applied for the case of irreversible associations, but we have to notice that only a fraction of trajectories will reach the correct target site. Then the main quantity of our calculations, the first-passage probability function F n (t), is now a conditional probability for the protein molecules not captured by the trap to find the target site.
Let us consider a system consisting of a single target at the site m 1 and a single trap at the site m 2 on the DNA molecule with L sites [19]. The scheme presented in Figure 3 is also a correct representation of this system with the correction that instead of the second target there is a trap in the site m 2 , and the successful search corresponds to the protein molecule finding the specific site m 1 . Following our theoretical method, the corresponding backward master equations can be solved and they yield the Laplace transform of the first-passage probability function to find the target if the protein starts from the bulk solution [19], and the parameters y and S 2 given in Equations (11) and (15), respectively. This allows us to evaluate all dynamic properties in the system and to test the effect of traps.
The probability to reach the target (i.e., the fraction of the successful trajectories) is now given by a so-called splitting probability function [4,5], The mean search time, which is the conditional mean first-passage time to reach the target, can be estimated by averaging over the successful trajectories, producing Let us analyze this expression. On the left side, the division by the splitting probability emphasizes the fact that this is the conditional mean search time. It is also interesting to note that the first two terms on the right side of the equation is exactly the mean search time for the system with two targets and no traps (at the sites m 1 and m 2 ) as we discussed above [18], while the third term is a correction which accounts for the fact that the site at m 2 is actually the trap. The main reason for this is the observation that the sites m 1 and m 2 are special locations where all trajectories are end up in both systems, with two targets and with the target and the trap. For the two-target case the mean search times are averaged over all trajectories to both sites, while for the target and the trap system the mean search times are obtained only by considering the trajectories finishing at the target [19].
The results of calculations for the dynamic properties of the protein search in the presence of traps are presented in Figures 4 and 6. Again, three dynamic search phases are observed, but adding the trap generally facilitates the search dynamics, which is a counter-intuitive result: see Figure 4. However, this acceleration (in comparison with the single-target system) is always associated with lowering of the probability of reaching the specific target, as shown in Figure 6. This means that the protein molecules might reach the target faster in the presence of the traps, but the fraction of such events is decreasing. In addition, the search dynamics is sensitive to the nature of the dynamic phase. The strongest effect due to the presence of the trap is observed in the effective 1D random-walk regime (because it has only one searching cycle) where the locations of the target and the trap strongly influence the search. In other dynamic regimes, the effect is smaller.

Sequence Heterogeneity
Real DNA molecules are heterogeneous polymers consisting of several types of subunits. This means that the interactions between protein and DNA molecules depend on the DNA sequence at the location where they meet. It is reasonable to expect that this sequence dependence in the interaction strength should affect the protein search dynamics because the diffusion rate for the non-specifically bound proteins will be position-dependent [3,11,45]. This has been also experimentally shown [46]. Similarly, association and dissociation rates should also depend on the location of the protein molecule on DNA. In addition, recent theoretical investigations suggested that different DNA sequence symmetries might lead to additional effective interactions between protein and DNA molecules [47][48][49][50]. The discrete-state stochastic framework with the first-passage analysis is a convenient tool to investigate the effect of DNA sequence heterogeneity and symmetry on the protein search dynamics [20].
Our goal here is clarify the molecular origin of how the sequence heterogeneity influences the protein target search. We assume here a simplified picture of DNA, in which each monomer can be one of two chemical species, A or B, as presented in Figure 7 [20]. When the protein is bound to the subunit A (B), it interacts with energy ε A (ε B ), and the difference between interaction energies is given by a parameter ε = ε A − ε B ≥ 0. This means that the protein attracts stronger to the B sites than to the A sites. The protein molecule can diffuse along DNA with a rate u A ≡ u or u B = ue −ε , where ε is measured in k B T units. This reflects the assumption that if the protein interacts stronger with the DNA at given location then it will move out of this site slower. In addition, we assume that, independently of the chemical nature of the neighboring sites, sliding out of the sites A is characterized by the rate u A , while the diffusion out of the sites B is given by u B . From the bulk solution the protein might associate to any site A or B on DNA with the corresponding rates k A on = k on or k B on = k on e −θε . Note that for convenience the on-rates defined here as the rates per unit site, in contrast to our definitions in the previous sections. Similarly, the dissociations from the DNA chain are described by the rates Here, the parameter 0 ≤ θ ≤ 1 specifies how the protein-DNA interaction energy is distributed between the association and dissociation transitions [20]. The physical meaning of this parameter is that the protein molecule tends to bind faster and to dissociate slower from the stronger attracting sites B, as compared with the weaker attracting A sites. The parameter θ accounts for these effects.
To quantify the role of sequence heterogeneity, we consider the DNA molecule with a fixed chemical composition (the fractions of A and B monomers are the same), but with different arrangements of subunits. Two limiting cases are specifically analyzed. One of them views the DNA molecule as two homogeneous segments of only A and only B subunits separated by the target in the middle of the chain (Figure 7). Another one is the DNA chain with the alternating A and B sites. The block copolymer has two homogeneous sequence segments, while the alternating polymers are more heterogeneous. It is important to note that in both cases, the overall interaction between the protein and DNA is the same (because the overall chemical composition in both cases is identical), and thus our analysis probes only the effect of the heterogeneity and symmetry in the subunit positions, in contrast to other theoretical treatments [51].
Applying again the first-passage approach and solving the corresponding equations leads to the explicit expressions for mean search times for all situations shown in Figure 7 [20]. For example, for the block copolymer DNA sequences, we obtain where for i = A or B. The expressions for the mean search time for alternating sequences are quite bulky and can be found in Ref. [20]. The results of our calculations are presented in Figure 8, where the ratio of the mean search times for the block copolymer and alternating sequences are plotted. The analysis of this figure produces several interesting observations. First, we see that three dynamic search regimes are also found in this system and the effect of sequence heterogeneity on protein search dynamics depends on the nature of the dynamic phase. In the jumping regime when the protein does not slide along the DNA contour (λ < 1), the symmetry of the sequence does not play any role. This is because in this case the process is taking place only via associations and dissociations (3D search), and the structure of the DNA chain is not important. The situation is different for the intermediate sliding regime (3D + 1D search, 1 < λ < L) where in most cases, the search on alternating sequences is faster. This can be explained by noticing that the search time in this dynamic phase is proportional to L/λ, which gives the average number of cycles before the protein can find the target. In the block copolymer sequence, the protein mostly comes to the target from the B segment because of stronger interactions with these sites, i.e, it comes from one side of the DNA molecule. In the alternating sequences, the protein can reach the target from both sides of DNA, and this lowers the overall search time. It can be shown analytically that the scanning length on the alternating segment is larger than the scanning length for the B segment, i.e., λ AB > λ B [20]. Then the search is faster for the alternating sequences because L/λ AB < L/λ B , i.e., the number of searching cycles is lower for the alternating sequences, which helps to find the target faster. Note that there are three different states are possible for the alternating systems, depending on the chemical composition of the sites surrounding the target and labeled as ATA, ATB and BTB. The only deviation from the picture described above is found for ATA sequences, which correspond to having two A sites around the target site, where for the small range of parameters the search is slower than in the block copolymer sequence. This effect can be explained by the fact that the protein does not sit at A sites for the long time and it moves quickly away, effectively increasing the barrier to enter the target via DNA [20]. Thus, our theory predicts that the composition of the DNA flanking sites around the target sequences might affect the dynamics of reaching them. It is interesting to note that recent experiments are consistent with our theoretical predictions [52].
In the random-walk regime (1D search, λ > L), the effect of the sequence heterogeneity is even stronger. The protein molecule finds the specific binding site up to 2 times faster for more heterogeneous alternating DNA sequences. To understand this behavior, we note that in this case the mean first-passage time to reach the target is a sum of residence times on the DNA sites since the protein will not dissociate until the target is located so that all trajectories to the target are one-dimensional. Because the target is in the middle of the chain, the mean time to reach the target from the block copolymer sequence can be approximated as T 0 (L/4)τ B , where τ B is the average residence time on any site B. The protein prefers to start the search at any position on the B segment with equal probability, i.e., the distance to the target varies from 0 to L/2. Then, the average starting position of the protein is L/4 sites away from the target. For the alternating sequences, the average distance to the target is approximately the same (L/4), but the chemical composition of intermediate sites on the path to the target is different, yielding, T 0 (L/8)τ A + (L/8)τ B (τ A is the residence time on A sites). The protein spends much less time on A subunits, and this leads to faster search for the alternating DNA sequences. For τ A τ B , this also explains the factor of 2 in the search speed. In this case, the B subunits can be viewed as effective traps that slow down the search dynamics. Thus, our theoretical calculations make surprising predictions that the sequence heterogeneity almost always lead to faster protein search for targets on DNA despite the fact that it lowers the effective protein-DNA binding affinity [47][48][49][50]. And the stronger the contribution of the 1D search modes, the more relevant will be the effect of sequence heterogeneity.

The Effect of Crowding on DNA in the Protein Target Search
Living cells are typically crowded with a large number of molecules, and many of them are attached to the DNA chains [1,2]. This should prevent the fast protein search for targets on DNA, and earlier theoretical studies supported this prediction [53]. However, surprisingly, experiments show that crowding on DNA does not affect much the effectiveness of the protein target search [33,34], and this was also found in Molecular Dynamics (MD) simulations [54]. By applying the discrete-state stochastic approach, we were able to clarify the role of the crowding on DNA in the protein target search.
To analyze this problem, the model illustrated in Figure 9 is considered. There is a single DNA molecule with L + 1 binding sites, and one of them is the target (at the site m). On the DNA chain there is also a crowding particle that can diffuse with a rate u ob , but it cannot leave DNA. A single protein molecule starts from the solution (state 0) and it can bind to any site on DNA that is not occupied by the crowder with a rate k on (rate per site). The bound protein molecule can diffuse with a rate u, and there is an exclusion interaction between the protein and the crowder. Finally, the protein molecule can dissociate from DNA to the bulk solution with a rate k o f f : see Figure 9. Investigating the model with the mobile crowding particle on DNA first using Monte Carlo computer simulations, it is found that there are three search regimes depending on the main length scales in the system. This is shown in Figure 10 for the mean search times to find the target as a function the scanning length λ. We can understand the complex dynamics in this system using the following arguments. If the diffusion rate of the crowder is much smaller than other rates (u ob u, k on and k o f f ), then the protein molecule will find the target before the crowding particle can move away from its original location. But we already explicitly solved the problem of the protein target search with static obstacles using the same discrete-state stochastic approach with the first-passage analysis [23]. Then the mean search time in the system with movable crowder can be approximated as the average over all possible static locations of the crowding particle [21], yielding where is the mean search time with the static obstacle located at a distance l ob from the target. An auxiliary function S ob is given by [23] S ob (s) = y(y −m − y m ) (1 − y)(y m + y 1+m ) with the parameter y specified in Equation (11).
This simple approximate theory works quite well in the dynamic regimes where 3D pathways are important for the search (λ < L). However, theoretical arguments fail in the random-walk regime where 1D dynamics dominate the search. These results are expected. The protein molecule that collides with the crowding particle on DNA in dynamic regimes with 3D pathways will have the opportunity to dissociate into the bulk solution and to avoid the blocking effect. But in the random-walk regime (1D search) there is no such opportunity, and the search times will definitely increase. Computer simulations also indicate that the search times in this regime depend on the diffusivity of the crowding particle. The search is faster for more mobile crowders: see Figure 10. The dynamics in the random-walk regime can be explained using the following arguments. The overall search can be viewed as consisting of two terms, where T 0 is the search in the random-walk regime without any crowders, and it is given in Equation (13). The second term is the average time it takes for the crowder to diffuse away and clear the path for the protein to reach the target without interference [21]. It was shown that this blocking time T bl depends on the location of the target and the diffusion rate of the crowding particle u ob [21], This simple theoretical arguments show excellent agreement with Monte Carlo computer simulations: see dashed lines in Figure 10. But more importantly, they provide a clear molecular picture on the role of the crowding on DNA in the protein target search. If the protein search is dominated by 1D pathways and the mobility of the crowder is low the search dynamics will be significantly slowed down. But if the search involves mostly 3D pathways and the crowder is mobile the mean search times will not be affected much. It seems that real biological systems operate in 3D + 1D regime, and crowding particles diffuse with the rates comparable to the searching proteins (u ∼ u ob ) [3]. Then one might conclude that the effect of the crowders on DNA should be minimal. This fully agrees with experimental observations and with results from MD simulations [34,54].
It is also important to note that the first-passage probabilities method is a useful tool to explain and analyze other features of the protein search such as the effect of conformations [24,25], the surface-enhanced search [26], and the role of DNA topological structures [23].

Conclusions and Future Directions
Although protein search for targets on DNA is a very complex phenomenon that involves multiple biochemical and biophysical processes, significant advances in our understanding of the underlying molecular mechanisms have been achieved in recent years. A major role in this success is due to analysis of the systems using the discrete-state stochastic framework supplemented by explicit calculations via the first-passage probabilities method. In this review, we presented and explained this theoretical approach by considering the protein target search in various systems. It is important to emphasize that the main advantage of our theoretical approach is the ability to obtain analytical results that clarify the physics of the underlying processes. In addition, the method can be easily extended in many directions, as shown in this work, as well as in other cases which we did not discuss here, such as the role of conformational transitions [24], the effect of inter-segment transfer [27], and the influence of the DNA loop formation during the protein target search [23]. Furthermore, our theoretical calculations using this theoretical framework were successful in explaining the experimental observations on homology search by RecA protein filaments [28], inter-segment protein transfer [27], and the dynamics of CRISPR genome interrogation [29].
Several important dynamic features of the protein search for targets on DNA have been identified from theoretical analysis. It is found that the dynamic phase diagram of the protein target search always shows thee dynamic regimes, which are determined by the three relevant length scales in the system: the size of DNA, the average scanning length of the non-specifically bound proteins, and the size of the target sequence. Depending on the dynamic phase, the search is dominated by the 3D motions (jumping regime), 1D motions (random-walk regime) or a combination of 3D and 1D motions in the sliding regime. The analysis shows that the most optimal search dynamics can be achieved in the dynamic regime when the protein molecules explore both 1D and 3D pathways during the search. In this case, the protein can reach the target by sliding from the DNA chain or by directly binding from the solutions. Theoretical calculations also indicate that the presence of several target sites influences the search dynamics differently depending on the locations of the targets on DNA and distances between them. Surprising observations are found in the system with semi-specific sites, which are viewed as effective traps. It is shown that the search dynamics can be faster in this case, but it comes with the price of lowering the yield of the protein molecules reaching the target. We also investigated the effect of sequence heterogeneity and symmetry in the protein search dynamics. Our calculations indicate that the search is faster for more heterogeneous sequences, and the chemical composition around the target is also an important factor in this process. Furthermore, our method allowed us to probe the effect of crowding on DNA in the protein target search. It is shown that it depends on the dynamic phase and on the mobility of the crowding particles. The crowders influence the protein search stronger when 1D pathways dominate and when the diffusivity of the crowding particle is small enough so that the protein will be frequently blocked during the process. Increasing the mobility of the crowders and/or increasing the contribution of 3D search pathways lowers the effect of the crowding. These theoretical arguments fully agree with experimental observations and MD computer simulations.
Despite tremendous progress in theoretical understanding of the protein target search phenomena, there are many questions remain on the molecular mechanisms of these processes. It is still unclear what is the nature of protein-DNA interactions in the regions surrounding the target sequences. Is the effective potential created by these interactions drives the protein molecule to the target like a funnel or is it completely random? How large is the size of the flanking segments that affect the finding of the target? What is the role of DNA geometry and topology in the protein target search? There are proposals that DNA supercoiling and the formation of complex DNA topologies can strongly influence the dynamics of protein-DNA interactions [55]. This is especially important for proteins that have several binding sites for DNA which can form DNA loops and other complex structures. Another interesting question is the role of various DNA and protein conformations in these processes. It is clear that further progress in understanding protein target search phenomena depends on combining theoretical, computational and experimental methods.