Stochastic Analysis of Minimal Automata Growth for Generalized Strings

Char, Ian G.; Lladser, Manuel E.

doi:10.1007/s11009-019-09706-8

Stochastic Analysis of Minimal Automata Growth for Generalized Strings

Published: 14 March 2019

Volume 22, pages 329–347, (2020)
Cite this article

Methodology and Computing in Applied Probability Aims and scope Submit manuscript

100 Accesses
Explore all metrics

Abstract

Generalized strings describe various biological motifs that arise in molecular and computational biology. In this manuscript, we introduce an alternative but efficient algorithm to construct the minimal deterministic finite automaton (DFA) associated with any generalized string. We exploit this construction to characterize the typical growth of the minimal DFA (i.e., with the least number of states) associated with a random generalized string of increasing length. Even though the worst-case growth may be exponential, we characterize a point in the construction of the minimal DFA when it starts to grow linearly and conclude it has at most a polynomial number of states with asymptotically certain probability. We conjecture that this number is linear.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Construction of a Family of Automata That Are Generically Non-minimal

On the number of active states in finite automata

Article 19 July 2021

Complexity of Generic Limit Sets of Cellular Automata

References

Aho AV, Corasick MJ (1975) Efficient string matching: an aid to bibliographic search. Commun ACM 18(6):333–340
Article MathSciNet Google Scholar
AitMous O, Bassino F, Nicaud C (2012) An efficient linear pseudo-minimization algorithm for Aho-Corasick automata. In: Annual symposium on combinatorial pattern matching. Springer, pp 110–123
Apostolico A, Szpankowski W (1992) Self-alignments in words and their applications. J Algor 13(3):446–467
Article MathSciNet Google Scholar
Aston JAD, Martin DEK (2005) Waiting time distributions of competing patterns in higher-order Markovian sequences. J Appl Prob 42(4):977–988
Article MathSciNet Google Scholar
Bender EA, Kochman F (1993) The distribution of subword counts is usually Normal. Eur J Comb 14(4):265–275
Article MathSciNet Google Scholar
Brookner E (1966) Recurrent events in a Markov chain. Inf Control 9(3):215–229
Article MathSciNet Google Scholar
Char IG (2018) Algorithmic construction and stochastic analysis of optimal automata for generalized strings. University of Colorado, the United States, Master’s thesis
Google Scholar
Chestnut SR, Lladser ME (2010) Occupancy distributions in Markov chains via Doeblin’s ergodicity coefficient. Discrete Mathematics and Theoretical Computer Science Proceedings. Vienna, pp 79–92
Cristianini N, Hahn MW (2007) Introduction to computational genomics: a case studies approach, 1st edn. Cambridge University Press
Erhardsson T (1999) Compound Poisson approximation for Markov chains using Stein’s method. Ann Prob 27:565–596
Article MathSciNet Google Scholar
Flajolet P, Szpankowski W, Vallée B (2006) Hidden word statistics. J ACM 53(1):147–183
Article MathSciNet Google Scholar
Flames N, Hobert O (2009) Gene regulatory logic of dopamine neuron differentiation. Nature 16:885–889
Article Google Scholar
Fu JC, Chang YM (2002) On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. J Appl Prob 39 (1):70–80
Article MathSciNet Google Scholar
Fu JC, Koutras MV (1994) Distribution theory of runs: a Markov chain approach. J Amer Statist Assoc 89(427):1050–1058
Article MathSciNet Google Scholar
Fu JC, Lou WYW (2003) Distribution theory of runs and patterns and its applications. A finite Markov chain imbedding approach. World Scientific Publishing Co. Inc
Gani J, Irle A (1999) On patterns in sequences of random events. Mh Math 127:295–309
Article MathSciNet Google Scholar
Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and computation, 2nd edn. Addison–Wesley
Lladser ME (2007) Minimal Markov chain embeddings of pattern problems. In: Proceedings of the 2007 information theory and applications workshop. University of California, San Diego
Lladser ME (2008) Markovian embeddings of general random strings. In: 2008 Proceedings of the fifth workshop on analytic algorithmics and combinatorics. SIAM, San Francisco, pp 183–190
Lladser ME, Chestnut SR (2014) Approximation of sojourn-times via maximal couplings: motif frequency distributions. J Math Biol 69(1):147–182
Article MathSciNet Google Scholar
Lladser ME, Betterton MD, Knight R (2008) Multiple pattern matching: a Markov chain approach. J Math Biol 56(1-2):51–92
Article MathSciNet Google Scholar
Marschall T (2011) Construction of minimal deterministic finite automata from biological motifs. Theor Comput Sci 412(8):922–930
Article MathSciNet Google Scholar
Marschall T, Herms I, Kaltenbach HM, Rahmann S (2012) Probabilistic arithmetic automata and their applications. IEEE/ACM Trans Comput Biol Bioinform 9(6):1737–50
Article Google Scholar
Martin DEK (2018) Minimal auxiliary Markov chains through sequential elimination of states. Commun Statist Simul Comput 0(0):1–15
Google Scholar
Mojica FJM, Díez-Villaseñor C, García-Martínez J, Almendros C (2009) Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155(3):733–740
Article Google Scholar
Nicodème P, Salvy B, Flajolet P (2002) Motif statistics. Theor Comput Sci 287(2):593–617
Article MathSciNet Google Scholar
Rėgnier M, Szpankowski W (1998) On pattern frequency occurrences in a Markovian sequence. Algorithmica 22(4):631–649
Article MathSciNet Google Scholar
Reinert G, Schbath S (1998) Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J Comput Biol 5 (2):223–253
Article Google Scholar
Robin S, Rodolphe F, Schbath S (2005) DNA, words and models: statistics of exceptional words, 1st edn. Cambridge University Press
Robin S, Daudin JJ, Richard H, Sagot MF, Schbath S (2002) Occurrence probability of structured motifs in random sequences. J Comput Biol 9:761–73
Article Google Scholar
Roquain E, Schbath S (2007) Improved compound Poisson approximation for the number of occurrences of any rare word family in a stationary Markov chain. Adv Appl Probab 39(1):128–140
Article MathSciNet Google Scholar

Download references

Acknowledgements

We are thankful to two anonymous referees for their careful reading of this paper and valuable suggestions. We are also very thankful to Dr. Dougherty for partially funding this research through her NSF EXTREEMS training grant.

Author information

Authors and Affiliations

Department of Applied Mathematics, University of Colorado, Boulder, CO, 80309-0526, USA
Ian G. Char & Manuel E. Lladser

Authors

Ian G. Char
View author publications
You can also search for this author in PubMed Google Scholar
Manuel E. Lladser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel E. Lladser.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work has been partially funded by the NSF EXTREEMS Grant No. DMS 1407340, and the NSF Graduate Research Fellowship Program under Grant No. DGE 1252522. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

Appendix

Here we show that our random generalized string model, G = G[1],…,G[n] with G[1],G[2],G[3]… i.i.d. uniform non-empty subsets of {0, 1}, does not fit in the low correlation framework in AitMous et al. (2012).

Following the notation of AitMous et al. (2012), let \(\mathbb {P}_{N}\) with N = n2ⁿ (the largest possible “size” of G) denote the probability mass function of G. Condition (1) in Definition 1 is then satisfied with C = 1.

Next, sort words in G lexicographically so that u₁ is its smallest word, u₂ is the second smallest (when G contains at least two words), and so on. Condition (2) in the Definition requires that \(\mathbb {P}_{N}(u_{1}[1,\ell ]=u_{2}[1,\ell ])=O(\beta ^{-\ell })\) for some β > 1. Since the event G[1] = ⋯ = G[ℓ] = {0} and G[ℓ + 1] = {0, 1} is contained in the event u₁[1,ℓ] = u₂[1,ℓ], and the probability of the former is 3^{−(ℓ+ 1)}, we must have β ≤ 3. On the other hand, condition (3) in the Definition requires that \(n\ge \frac {8\log (n2^{n})}{\log \beta }\) asymptotically—which is possible only if β ≥ 256. Conditions (2) and (3) in Definition 1 of AitMous et al. (2012) are therefore incompatible under our i.i.d. model.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Char, I.G., Lladser, M.E. Stochastic Analysis of Minimal Automata Growth for Generalized Strings. Methodol Comput Appl Probab 22, 329–347 (2020). https://doi.org/10.1007/s11009-019-09706-8

Download citation

Received: 28 August 2018
Revised: 27 February 2019
Accepted: 04 March 2019
Published: 14 March 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11009-019-09706-8

Keywords

Mathematics Subject Classification (2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Analysis of Minimal Automata Growth for Generalized Strings

Abstract

Access this article

Similar content being viewed by others

On the Construction of a Family of Automata That Are Generically Non-minimal

On the number of active states in finite automata

Complexity of Generic Limit Sets of Cellular Automata

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2010)

Navigation

Stochastic Analysis of Minimal Automata Growth for Generalized Strings

Abstract

Access this article

Similar content being viewed by others

On the Construction of a Family of Automata That Are Generically Non-minimal

On the number of active states in finite automata

Complexity of Generic Limit Sets of Cellular Automata

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation