Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem

Mihaela E. Sardiu, Gelio Alves, and Yi-Kuo Yu

Phys. Rev. E 72, 061917 – Published 23 December 2005

Abstract

Sequence alignment is one of the most important bioinformatics tools for modern molecular biology. The statistical characterization of gapped alignment scores has been a long-standing problem in sequence alignment research. Using a variant of the directed path in random media model, we investigate the score statistics of global sequence alignment taking into account, in particular, the compositional bias of the sequences compared. Such statistics are used to distinguish accidental similarity due to compositional similarity from biologically significant similarity. To accommodate the compositional bias, we introduce an extra parameter indicating the probability for positive matching scores to occur. When is small, a high scoring alignment obviously cannot come from compositional similarity. When is large, the highest scoring point within a global alignment tends to be close to the end of both sequences, in which case we say the system percolates. By applying finite-size scaling theory on percolating probability functions of various sizes (sequence lengths), the critical at infinite size is obtained. For alignment of length , the fact that the score fluctuation grows as is confirmed upon investigating the scaling form of the alignment score. Using the Kolmogorov-Smirnov statistics test, we show that the random variable , if properly scaled, follows the Tracy-Widom distributions: Gaussian orthogonal ensemble for slightly larger than and Gaussian unitary ensemble for larger . Although these results deepen our understanding of the distribution of alignment scores, the use of these results in practical applications remains somewhat heuristic and needs to be further developed. Nevertheless, the possibility of characterizing score statistics for modest system size (sequence lengths), via proper reparametrization of alignment scores, is illustrated.

10 More

Received 3 June 2005
Revised 21 October 2005

DOI:https://doi.org/10.1103/PhysRevE.72.061917

Authors & Affiliations

Mihaela E. Sardiu^1,2, Gelio Alves^1,2, and Yi-Kuo Yu²

¹Department of Physics, Florida Atlantic University, Boca Raton, Florida 33431, USA
²National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA

Article Text (Subscription Required)

Click to Expand

References (Subscription Required)

Click to Expand

Issue

Vol. 72, Iss. 6 — December 2005

Reuse & Permissions

Access Options

Author publication services for translation and copyediting assistance advertisement

Images

Figure 1
An example of global alignment between sequences $a$ and $b$ .Reuse & Permissions
Figure 2
The alignment lattice. Upon laying sequence $a$ along the horizontal axes and sequence $b$ along the vertical axes, we note that the directed path here uniquely represents the alignment shown in Fig. 1. The new coordinate system ( $x = m - n$ , $t = m + n$ ) is also shown to illustrate the connection between the recursion relation (5) used in sequence alignment and the corresponding one (7) used in DPRM.Reuse & Permissions
Figure 3
An example of a polynuclear growth event and its corresponding permutation. A region of height $h$ is labeled by $h$ in a circle. A nucleation point $i$ , labeled by its projection on the $m$ axis, will have a different (permuted) label $π (i)$ on the $n$ axis. Therefore the sequence [1,2,3,4,5,6,7,8] is permuted into $[π (1), π (2), π (3), π (4), π (5), π (6), π (7), π (8)] = [5, 2, 8, 3, 7, 1, 4, 6]$ with LIS [2,3,4,6] of length 4, the maximum height of the PNG profile.Reuse & Permissions
Figure 4
(Color online) The extrapolation method for obtaining $p_{c}$ . In this example, the gap penalty is $δ = 0.4$ and the $p_{c}$ obtained is 0.107.Reuse & Permissions
Figure 5
(Color online) The determination of exponent $ν$ : using the $p_{c}$ value obtained from Fig. 4 [or equivalently Eq. (22)], one may then use Eq. (21) to obtain the exponent $ν$ from the inverse of the slope of $\ln [p_{c} (L) - p_{c}]$ vs $\ln (L)$ .Reuse & Permissions
Figure 6
(Color online) A typical data collapse for $⟨ S ⟩$ of various sizes based on the scaling function proposed. The positive abscissa indicates the quantity $L {(p - p_{c})}^{3 γ ∕ 2}$ ; the negative abscissa records $- L {(p_{c} - p)}^{3 γ ∕ 2}$ . The gap penalty $δ$ is 0.4. By varying the exponent $γ$ , we find the best $γ$ to be around 0.825.Reuse & Permissions
Figure 7
(Color online) The pdf’s of $χ$ and the cumulative deviations between the numerical and theoretical distributions for $p > p_{c}$ but $p$ still close to $p_{c}$ . The relevant parameters are as follows: lattice size $L = 600$ , $p = 0.0975$ , and gap penalty $δ = 0.3$ . With $F_{0}$ being the distribution assumed, (a) displays a histogram of $λ χ$ (with $λ = 3.4552$ ) and a theoretical curve of the $F_{0}$ distribution, while (d) displays the cumulative difference between the numerical distribution and $F_{0}$ with the largest absolute deviation being $1.05 \times 10^{- 2}$ ; given by the KS statistics test, the likelihood for $F_{0}$ to be the correct distribution is $5.69 \times 10^{- 10}$ . With $F_{GOE}$ being the distribution assumed, (b) displays a histogram of $λ χ$ (with $λ = 2.5680$ ) and a theoretical curve of the $F_{GOE}$ distribution, while (e) displays the cumulative difference between the numerical distribution and $F_{GOE}$ with the largest absolute deviation being $6.85 \times 10^{- 4}$ ; given by the KS statistics test, the likelihood for $F_{GOE}$ to be the correct distribution is 1.0. With $F_{GUE}$ being the distribution assumed, (c) displays a histogram of $λ χ$ (with $λ = 2.8914$ ) and a theoretical curve of the $F_{GUE}$ distribution, while (f) displays the cumulative difference between the numerical distribution and $F_{GUE}$ with the largest absolute deviation being $4.14 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{GUE}$ to be the correct distribution is $6.50 \times 10^{- 2}$ . The pdf’s are obtaining by normalizing the histogram properly. In (d)–(f), regions with theoretical pdf values larger than $10^{- 3}$ are sandwiched by two vertical dashed lines.Reuse & Permissions
Figure 8
(Color online) The pdf’s of $χ$ and the cumulative deviations between the numerical and theoretical distributions for $p > p_{c}$ but $p$ still close to $p_{c}$ . The relevant parameters are as follows: lattice size $L = 600$ , $p = 0.2093$ , and gap penalty $δ = 0.6$ . With $F_{0}$ being the distribution assumed, (a) displays a histogram of $λ χ$ (with $λ = 2.8373$ ) and a theoretical curve of the $F_{0}$ distribution, while (d) displays the cumulative difference between the numerical distribution and $F_{0}$ with the largest absolute deviation being $1.14 \times 10^{- 2}$ ; given by the KS statistics test, the likelihood for $F_{0}$ to be the correct distribution is $1.03 \times 10^{- 11}$ . With $F_{GOE}$ being the distribution assumed, (b) displays a histogram of $λ χ$ (with $λ = 2.1109$ ) and a theoretical curve of the $F_{GOE}$ distribution, while (e) displays the cumulative difference between the numerical distribution and $F_{GOE}$ with the largest absolute deviation being $6.47 \times 10^{- 4}$ ; given by the KS statistics test, the likelihood for $F_{GOE}$ to be the correct distribution is 1.0. With $F_{GUE}$ being the distribution assumed, (c) displays a histogram of $λ χ$ (with $λ = 2.3798$ ) and a theoretical curve of the $F_{GUE}$ distribution, while (f) displays the cumulative difference between the numerical distribution and $F_{GUE}$ with the largest absolute deviation being $5.23 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{GUE}$ to be the correct distribution is $8.37 \times 10^{- 3}$ . The pdf’s are obtaining by normalizing the histogram properly. In (d)–(f), regions with theoretical pdf values larger than $10^{- 3}$ are sandwiched by two vertical dashed lines.Reuse & Permissions
Figure 9
(Color online) The pdf’s of $χ$ and the cumulative deviations between the numerical and theoretical distributions for $p > p_{c}$ with $p = 0.5$ . Lattice size 600 and gap penalty $δ = 0.3$ are used. With $F_{0}$ being the distribution assumed, (a) displays a histogram of $λ χ$ (with $λ = 3.3867$ ) and a theoretical curve of the $F_{0}$ distribution, while (d) displays the cumulative difference between the numerical distribution and $F_{0}$ with the largest absolute deviation being $6.87 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{0}$ to be the correct distribution is $1.56 \times 10^{- 4}$ . With $F_{GOE}$ being the distribution assumed, (b) displays a histogram of $λ χ$ (with $λ = 2.5208$ ) and a theoretical curve of the $F_{GOE}$ distribution, while (e) displays the cumulative difference between the numerical distribution and $F_{GOE}$ with the largest absolute deviation being $4.40 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{GOE}$ to be the correct distribution is $4.18 \times 10^{- 2}$ . With $F_{GUE}$ being the distribution assumed, (c) displays a histogram of $λ χ$ (with $λ = 2.8441$ ) and a theoretical curve of the $F_{GUE}$ distribution, while (f) displays the cumulative difference between the numerical distribution and $F_{GUE}$ with the largest absolute deviation being $9.32 \times 10^{- 4}$ ; given by the KS statistics test, the likelihood for $F_{GUE}$ to be the correct distribution is 1.0. The pdf’s are obtaining by normalizing the histogram properly. In (d)–(f), regions with theoretical pdf values larger than $10^{- 3}$ are sandwiched by two vertical dashed lines.Reuse & Permissions
Figure 10
(Color online) The pdf’s of $χ$ and the cumulative deviations between the numerical and theoretical distributions for $p > p_{c}$ with $p = 0.5$ . Lattice size 600 and gap penalty $δ = 0.6$ are used. With $F_{0}$ being the distribution assumed, (a) displays a histogram of $λ χ$ (with $λ = 2.8373$ ) and a theoretical curve of the $F_{0}$ distribution, while (d) displays the cumulative difference between the numerical distribution and $F_{0}$ with the largest absolute deviation being $7.48 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{0}$ to be the correct distribution is $2.75 \times 10^{- 5}$ . With $F_{GOE}$ being the distribution assumed, (b) displays a histogram of $λ χ$ (with $λ = 2.1110$ ) and a theoretical curve of the $F_{GOE}$ distribution, while (e) displays the cumulative difference between the numerical distribution and $F_{GOE}$ with the largest absolute deviation being $4.26 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{GOE}$ to be the correct distribution is $5.29 \times 10^{- 2}$ . With $F_{GUE}$ being the distribution assumed, (c) displays a histogram of $λ χ$ (with $λ = 2.3801$ ) and a theoretical curve of the $F_{GUE}$ distribution, while (f) displays the cumulative difference between the numerical distribution and $F_{GUE}$ with the largest absolute deviation being $1.22 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{GUE}$ to be the correct distribution is $9.98 \times 10^{- 1}$ . The pdf’s are obtaining by normalizing the histogram properly. In (d)–(f), regions with theoretical pdf values larger than $10^{- 3}$ are sandwiched by two vertical dashed lines.Reuse & Permissions
Figure 11
(Color online) The pdf’s of $χ$ and the cumulative deviations between the numerical and theoretical distributions for $p > p_{c}$ with $p = 0.8$ . Lattice size 600 and gap penalty $δ = 0.3$ are used. With $F_{0}$ being the distribution assumed, (a) displays a histogram of $λ χ$ (with $λ = 3.9420$ ) and a theoretical curve of the $F_{0}$ distribution, while (d) displays the cumulative difference between the numerical distribution and $F_{0}$ with the largest absolute deviation being $6.84 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{0}$ to be the correct distribution is $1.72 \times 10^{- 4}$ . With $F_{GOE}$ being the distribution assumed, (b) displays a histogram of $λ χ$ (with $λ = 2.9347$ ) and a theoretical curve of the $F_{GOE}$ distribution, while (e) displays the cumulative difference between the numerical distribution and $F_{GOE}$ with the largest absolute deviation being $4.84 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{GOE}$ to be the correct distribution is $1.83 \times 10^{- 2}$ . With $F_{GUE}$ being the distribution assumed, (c) displays a histogram of $λ χ$ (with $λ = 3.3119$ ) and a theoretical curve of the $F_{GUE}$ distribution, while (f) displays the cumulative difference between the numerical distribution and $F_{GUE}$ with the largest absolute deviation being $5.87 \times 10^{- 4}$ ; given by the KS statistics test, the likelihood for $F_{GUE}$ to be the correct distribution is 1.0. The pdf’s are obtaining by normalizing the histogram properly. In (d)–(f), regions with theoretical pdf values larger than $10^{- 3}$ are sandwiched by two vertical dashed lines.Reuse & Permissions
Figure 12
(Color online) The pdf’s of $χ$ and the cumulative deviations between the numerical and theoretical distributions for $p > p_{c}$ with $p = 0.8$ . Lattice size 600 and gap penalty $δ = 0.6$ are used. With $F_{0}$ being the distribution assumed, (a) displays a histogram of $λ χ$ (with $λ = 3.2516$ ) and a theoretical curve of the $F_{0}$ distribution, while (d) displays the cumulative difference between the numerical distribution and $F_{0}$ with the largest absolute deviation being $7.07 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{0}$ to be the correct distribution is $9.09 \times 10^{- 5}$ . With $F_{GOE}$ being the distribution assumed, (b) displays a histogram of $λ χ$ (with $λ = 2.4750$ ) and a theoretical curve of the $F_{GOE}$ distribution, while (e) displays the cumulative difference between the numerical distribution and $F_{GOE}$ with the largest absolute deviation being $4.36 \times 10^{- 3}$ ; given by the KS statistics test, the likelihood for $F_{GOE}$ to be the correct distribution is $4.42 \times 10^{- 2}$ . With $F_{GUE}$ being the distribution assumed, (c) displays a histogram of $λ χ$ (with $λ = 2.7923$ ) and a theoretical curve of the $F_{GUE}$ distribution, while (f) displays the cumulative difference between the numerical distribution and $F_{GUE}$ with the largest absolute deviation being $8.70 \times 10^{- 4}$ ; given by the KS statistics test, the likelihood for $F_{GUE}$ to be the correct distribution is 1.0. The pdf’s are obtaining by normalizing the histogram properly. In (d)–(f), regions with theoretical pdf values larger than $10^{- 3}$ are sandwiched by two vertical dashed lines.Reuse & Permissions
Figure 13
(Color online) The gradual degradation (improvement) of the agreement between the pdf of $χ$ and $F_{GOE}$ $(F_{GUE})$ for $δ = 0.3$ . System sizes of $L = 1000$ (solid line), 1600 (dashed line), 2560 (dot-dashed line), and 4096 (long-dashed line) are studied with $[p - p_{c} (L)] ∕ p_{c} (L) \approx 0.1$ . Part (a) displays how the amplitude of cumulative difference between the numerical pdf and $F_{GOE}$ gradually increases with size; part (b) displays how this amplitude decreases with size for $F_{GUE}$ . In part (a), the gradual degradation leads to a decrease of the likelihood value (from 100.0 to 87.3 %); in part (b), the gradual improvement leads to an increase of the likelihood value (from 2.7 to 34.0 %).Reuse & Permissions
Figure 14
(Color online) The gradual degradation (improvement) of the agreement between the pdf of $χ$ and $F_{GOE}$ $(F_{GUE})$ for $δ = 0.6$ . System sizes of $L = 1000$ (solid line), 1600 (dashed line), 2560 (dot-dashed line), and 4096 (long-dashed line) are studied with $[p - p_{c} (L)] ∕ p_{c} (L) \approx 0.1$ . Part (a) displays how the amplitude of cumulative difference between the numerical pdf and $F_{GOE}$ gradually increases with size; part (b) displays how this amplitude decreases with size for $F_{GUE}$ . In part (a), the gradual degradation leads to a decrease of the likelihood value (from 100.0 to 92.0 %); in part (b), the gradual improvement leads to an increase of the likelihood value (from 7.80 to 27.0 %).Reuse & Permissions
Figure 15
(Color online) The score pdf for (a) $p < p_{c}$ , (b) $p > p_{c}$ but close to $p_{c}$ , and (c) $p$ much larger than $p_{c}$ . The lattice size used here is $L = 600$ and the gap penalty used is $δ = 0.4$ . The $p$ values are (a) 0.077, (b) 0.1678, and (c) 0.80. For $δ = 0.4$ , the critical $p$ value at infinite size is $p_{c} = 0.107$ while the finite size $p_{c} (L = 600) = 0.12$ . The effective path lengths $t_{eff}$ are 1168.2 for $p = 0.1678$ (b), and 1196.6 for $p = 0.8$ (c). The parameter $v$ takes the values 0.046 88 and 0.269 85 respectively for (b) and (c).Reuse & Permissions
Figure 16
(Color online) The score pdf for $p = 0.1204$ [greater than but very close to $p_{c} (L = 600) = 0.12$ ]. The lattice size used here is $L = 600$ and the gap penalty used is $δ = 0.4$ . As one may see, the score pdf is well fitted by the Poisson distribution with parameters $A_{1} = 0.96$ , $A_{2} = 0.5712$ , and $μ = 11.46$ .Reuse & Permissions
Figure 17
(Color online) Numerical solution of the Painlevé II equation.Reuse & Permissions

Physical Review E

covering statistical, nonlinear, biological, and soft matter physics