Exact distribution of word occurrences in a random sequence of letters

S. Robin; J. J. Daudin

doi:10.1239/jap/1032374240

Exact distribution of word occurrences in a random sequence of letters

Part of: Stochastic processes Combinatorial probability Markov processes Distribution theory - Probability

Published online by Cambridge University Press: 14 July 2016

S. Robin and

J. J. Daudin

Show author details

S. Robin*: Affiliation:
INA-PG, Paris
J. J. Daudin*: Affiliation:
INA-PG, Paris
*: ∗Postal address: INA-PG – INRA, 16, rue Claude Bernard, 75005 Paris, France.
∗Postal address: INA-PG – INRA, 16, rue Claude Bernard, 75005 Paris, France.

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The study of the distribution of the distance between words in a random sequence of letters is interesting in view of application in genome sequence analysis. In this paper we give the exact distribution probability and cumulative distribution function of the distances between two successive occurrences of a given word and between the nth and the (n+m)th occurrences under three models of generation of the letters: i.i.d. with the same probability for each letter, i.i.d. with different probabilities and Markov process. The generating function and the first two moments are also given. The point of studying the distances instead of the counting process is that we get some knowledge not only about the frequency of a word but also about its longitudinal distribution in the sequence.

Keywords

Distance between occurrences genome sequence analysis Markov chain patterns waiting time

MSC classification

Primary: 60C05: Combinatorial probability 60E05: Distributions

Secondary: 60G50: Sums of independent random variables; random walks 60J20: Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) 60J10: Markov chains (discrete-time Markov processes on discrete state spaces)

Type: Research Papers
Information: Journal of Applied Probability , Volume 36 , Issue 1 , March 1999 , pp. 179 - 193

DOI: https://doi.org/10.1239/jap/1032374240 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 1999

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aki, S., and Hirano, K. (1993). Discrete distributions related to succession events in a two state Markov chain. In Statistical Science and Data Analysis, ed. Matusita, K., Puri, M. L. and Ahayakawa, T. WSP International Science Publishers, Zeist, pp. 467–474.CrossRef Google Scholar

Blom, G. (1982). On the mean number of random digits until a given sequence occurs. J. Appl. Prob. 19, 136–143.Google Scholar

Blom, G., and Thorburn, D. (1982). How many random digits are required until given sequences are obtained? J. Appl. Prob. 19, 518–531.Google Scholar

Chrysaphinou, O., and Papastavridis, S. (1990). The occurrence of sequence patterns in repeated dependent experiments. Theory Prob. Appl. 35, 145–152.Google Scholar

Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle's formula. J. Appl. Prob. 28, 886–892.CrossRef Google Scholar

Feller, W. (1968). An Introduction to Probability Theory and its Applications. Vol I, 3rd edn. Wiley, New York.Google Scholar

Geske, M. X., Godbole, A. P., Schaffner, A. A., Skolnick, A. M., and Wallstrom, G. L. (1995). Compound Poisson approximation for word patterns under Markovian hypothesis. J. Appl. Prob. 32, 877–892.Google Scholar

Godbole, A. P. (1991). Poisson approximations for runs and patterns of rare events. Adv. Appl. Prob. 23, 851–865.Google Scholar

Henaut, A., Rouxel, T., Gleizes, A., Moszer, I., and Danchin, A. (1996). Uneven distribution of GATC motifs in the Escherichia coli chromosome, its plasmids and its phages. J. Mol. Biol. 257, 574–585.Google Scholar

Johnson, N.L., Kotz, S., and Kemp, A.W. (1992). Univariate Discrete Distributions. Wiley, New York.Google Scholar

Karlin, S., and Macken, C. (1991). Some statistical problems in the assessment of inhomogeneities of DNA sequence data. J. Amer. Statist. Soc. 86, 27–35.Google Scholar

Koutras, M. V. (1997). Waiting time distributions associated with runs of fixed length in two state Markov chains. Ann. Inst. Statist. Math. 49, 123–139.Google Scholar

Mori, T. F. (1989). On the number of different patterns preceding a given one. Studia. Sci. Math. Hung. 24, 355–364.Google Scholar

Pekoz, E. A. (1996). Stein's method for geometric approximation. J. Appl. Prob. 33, 707–713.CrossRef Google Scholar

Philippou, A. N., Georghiou, C., and Philippou, G. N. (1983). A generalized geometric distribution and some of its properties. Statist. Prob. Lett. 1, 171–175.Google Scholar

Prum, B., Rodolphe, F., and de Turkheim, E. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205–220.Google Scholar

Rudander, J. (1996). On the first occurrence of a given pattern in a semi-Markov process. Uppsala Dissertations in Math. 2, Uppsala University, Sweden.Google Scholar

Schbath, S. (1995). Compound Poisson approximation of word counts in DNA sequences. ESAIM Prob. Statist., 1, 1–16.CrossRef Google Scholar

Schwager, S. (1983). Run probabilities in sequences of Markov dependent trials. J. Amer. Statist. Assoc., 78, 168–175.Google Scholar

Uchida, M., and Aki, S. (1995). Sooner or later waiting time problems in a two-state Markov chain. Ann. Inst. Statist. Math. 47, 415–433.Google Scholar

Article contents

Exact distribution of word occurrences in a random sequence of letters

Abstract

Keywords

MSC classification

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests