Hostname: page-component-76fb5796d-zzh7m Total loading time: 0 Render date: 2024-04-26T08:57:22.951Z Has data issue: false hasContentIssue false

Exact distribution of word occurrences in a random sequence of letters

Published online by Cambridge University Press:  14 July 2016

S. Robin*
Affiliation:
INA-PG, Paris
J. J. Daudin*
Affiliation:
INA-PG, Paris
*
Postal address: INA-PG – INRA, 16, rue Claude Bernard, 75005 Paris, France.
Postal address: INA-PG – INRA, 16, rue Claude Bernard, 75005 Paris, France.

Abstract

The study of the distribution of the distance between words in a random sequence of letters is interesting in view of application in genome sequence analysis. In this paper we give the exact distribution probability and cumulative distribution function of the distances between two successive occurrences of a given word and between the nth and the (n+m)th occurrences under three models of generation of the letters: i.i.d. with the same probability for each letter, i.i.d. with different probabilities and Markov process. The generating function and the first two moments are also given. The point of studying the distances instead of the counting process is that we get some knowledge not only about the frequency of a word but also about its longitudinal distribution in the sequence.

Type
Research Papers
Copyright
Copyright © Applied Probability Trust 1999 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aki, S., and Hirano, K. (1993). Discrete distributions related to succession events in a two state Markov chain. In Statistical Science and Data Analysis, ed. Matusita, K., Puri, M. L. and Ahayakawa, T. WSP International Science Publishers, Zeist, pp. 467474.CrossRefGoogle Scholar
Blom, G. (1982). On the mean number of random digits until a given sequence occurs. J. Appl. Prob. 19, 136143.Google Scholar
Blom, G., and Thorburn, D. (1982). How many random digits are required until given sequences are obtained? J. Appl. Prob. 19, 518531.Google Scholar
Chrysaphinou, O., and Papastavridis, S. (1990). The occurrence of sequence patterns in repeated dependent experiments. Theory Prob. Appl. 35, 145152.Google Scholar
Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle's formula. J. Appl. Prob. 28, 886892.CrossRefGoogle Scholar
Feller, W. (1968). An Introduction to Probability Theory and its Applications. Vol I, 3rd edn. Wiley, New York.Google Scholar
Geske, M. X., Godbole, A. P., Schaffner, A. A., Skolnick, A. M., and Wallstrom, G. L. (1995). Compound Poisson approximation for word patterns under Markovian hypothesis. J. Appl. Prob. 32, 877892.Google Scholar
Godbole, A. P. (1991). Poisson approximations for runs and patterns of rare events. Adv. Appl. Prob. 23, 851865.Google Scholar
Henaut, A., Rouxel, T., Gleizes, A., Moszer, I., and Danchin, A. (1996). Uneven distribution of GATC motifs in the Escherichia coli chromosome, its plasmids and its phages. J. Mol. Biol. 257, 574585.Google Scholar
Johnson, N.L., Kotz, S., and Kemp, A.W. (1992). Univariate Discrete Distributions. Wiley, New York.Google Scholar
Karlin, S., and Macken, C. (1991). Some statistical problems in the assessment of inhomogeneities of DNA sequence data. J. Amer. Statist. Soc. 86, 2735.Google Scholar
Koutras, M. V. (1997). Waiting time distributions associated with runs of fixed length in two state Markov chains. Ann. Inst. Statist. Math. 49, 123139.Google Scholar
Mori, T. F. (1989). On the number of different patterns preceding a given one. Studia. Sci. Math. Hung. 24, 355364.Google Scholar
Pekoz, E. A. (1996). Stein's method for geometric approximation. J. Appl. Prob. 33, 707713.CrossRefGoogle Scholar
Philippou, A. N., Georghiou, C., and Philippou, G. N. (1983). A generalized geometric distribution and some of its properties. Statist. Prob. Lett. 1, 171175.Google Scholar
Prum, B., Rodolphe, F., and de Turkheim, E. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205220.Google Scholar
Rudander, J. (1996). On the first occurrence of a given pattern in a semi-Markov process. Uppsala Dissertations in Math. 2, Uppsala University, Sweden.Google Scholar
Schbath, S. (1995). Compound Poisson approximation of word counts in DNA sequences. ESAIM Prob. Statist., 1, 116.CrossRefGoogle Scholar
Schwager, S. (1983). Run probabilities in sequences of Markov dependent trials. J. Amer. Statist. Assoc., 78, 168175.Google Scholar
Uchida, M., and Aki, S. (1995). Sooner or later waiting time problems in a two-state Markov chain. Ann. Inst. Statist. Math. 47, 415433.Google Scholar