Skip to main content

Learning the Language of Biological Sequences

  • Chapter
  • First Online:

Abstract

The application to biological sequences is an appealing challenge for Grammatical Inference. While some first successes have already been recorded, such as the inference of profile Hidden Markov Models or stochastic Context-Free Grammars which are now part of the classical Bioinformatics toolbox, it is still a nice and open source of problems or inspiration for our research, with the possibility to apply our ideas to real fundamental applications. In this chapter, we survey biological sequences’ main specificities and how they are handled in Pattern/Motif Discovery in order to introduce the important concepts and techniques used and present the latest successful approaches in that field by Grammatical Inference.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This corresponds to the preservation constraint from [104] forbidding us to merge together the states resulting from merging a diagonal to prevent identified conserved words from being damaged.

References

  1. Beadle, G.W., Beadle, M.: The language of life: an introduction to the science of genetics. American Institute of Biological Sciences (1966)

    Google Scholar 

  2. Clancy, S., Brown, W.: Translation: DNA to mRNA to protein. Nature Education (2008)

    Google Scholar 

  3. Chomsky, N.: Syntactic Structures. Mouton (1957)

    Google Scholar 

  4. Searls, D.B.: The computational linguistics of biological sequences. In Hunter, L., ed.: Artificial Intelligence and Molecular Biology. AAAI Press (1993) 47–120

    Google Scholar 

  5. Searls, D.B.: Linguistic approaches to biological sequences. Computer Applications in the Biosciences 13 (1997) 333–344

    Google Scholar 

  6. Searls, D.B.: The language of genes. Nature 420 (2002) 211–217

    Article  Google Scholar 

  7. Chiang, D., Joshi, A.K., Searls, D.B.: Grammatical representations of macromolecular structure. Journal of Computational Biology 13 (2006) 1077–1100

    Article  MathSciNet  Google Scholar 

  8. Searls, D.B.: A primer in macromolecular linguistics. Biopolymers 99 (2013) 203–17

    Google Scholar 

  9. Joshi, A.K., Weir, D.J., Vijay-Shanker, K.: The convergence of mildly context-sensitive grammar formalisms. Technical Report MS-CIS-90-01, University of Pennsylvania (1990)

    Google Scholar 

  10. Dong, S., Searls, D.B.: Gene structure prediction by linguistic methods. Genomics 23 (1994) 540–551

    Article  Google Scholar 

  11. Nicolas, F., Rivals, E.: Hardness results for the center and median string problems under the weighted and unweighted edit distances. J. Discrete Algorithms 3 (2005) 390–415

    Article  MathSciNet  MATH  Google Scholar 

  12. Dsouza, M., Larsen, N., Overbeek, R.: Searching for patterns in genomic data. Trends in Genetics 13 (1997) 497–498

    Article  Google Scholar 

  13. Pesole, G., Liuni, S., D’Souza, M.: Patsearch: a pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance. Bioinformatics 16 (2000) 439–450

    Article  Google Scholar 

  14. Belleannée, C., Sallou, O., Nicolas, J.: Logol: Expressive Pattern Matching in Sequences. Application to Ribosomal Frameshift Modeling. In Comin, M., Kall, L., Marchiori, E., Ngom, A., Rajapakse, J., eds.: PRIB2014 - Pattern Recognition in Bioinformatics, 9th IAPR International Conference. Volume 8626 of Lecture Notes in Computer Science, Stockholm, Springer (2014) 34–47

    Google Scholar 

  15. Macke, T.J., Ecker, D.J., Gutell, R.R., Gautheret, D., Case, D.A., Sampath, R.: Rnamotif, an RNA secondary structure definition and search algorithm. Nucleic acids research 29 (2001) 4724–4735

    Article  Google Scholar 

  16. Eddy, S.: RNABOB: a program to search for RNA secondary structure motifs in sequence databases (1996)

    Google Scholar 

  17. Graf, S., Strothmann, D., Kurtz, S., Steger, G.: Hypalib: a database of RNAs and RNA structural elements defined by hybrid patterns. Nucleic Acids Res. 29 (2001) 196–198

    Google Scholar 

  18. Strothmann, D., Gräf, S.A., Kurtz, S., Steger, G.: The syntax and semantics of a language for describing complex patterns in biological sequences. Technical report, Universität Bielefeld, Technische Fakultät, Arbeitsgruppe Praktische Informatik (2000)

    Google Scholar 

  19. Billoud, B., Kontic, M., Viari, A.: Palingol: a declarative programming language to describe nucleic acids’ secondary structures and to scan sequence database. Nucleic Acids Res 24 (1996) 395–403

    Article  Google Scholar 

  20. Meyer, F., Kurtz, S., Backofen, R., Will, S., Beckstette, M.: Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinformatics 12 (2011) 214

    Article  Google Scholar 

  21. Pribnow, D.: Nucleotide sequence of an RNA polymerase binding site at an early t7 promoter. Proceedings of the National Academy of Sciences of the United States of America 72 (1975) 784–8

    Article  Google Scholar 

  22. van Helden, J.: The Analysis of Regulatory Sequences. In: Multiple Aspects of DNA and RNA: from Biophysics to Bioinformatics: Lecture Notes of the Les Houches Summer School 2004. Gulf Professional Publishing (2005)

    Google Scholar 

  23. Parida, L.: Pattern Discovery in Bioinformatics: Theory & Algorithms. Chapman & Hall/CRC (2007)

    Google Scholar 

  24. Stormo, G.D., Schneider, T.D., Gold, L., Ehrenfeucht, A.: Use of the "perceptron" algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10 (1982) 2997–3011

    Google Scholar 

  25. Schneider, T.D., Stormo, G.D., Gold, L., Ehrenfeucht, A.: Information content of binding sites on nucleotide sequences. Journal of molecular biology 188 (1986) 415–31

    Google Scholar 

  26. Schneider, T.: Information theory primer (1995)

    Google Scholar 

  27. Crooks, G.E., Hon, G., Chandonia, J.M., Brenner, S.E.: Weblogo: a sequence logo generator. Genome Res 14 (2004) 1188–1190

    Article  Google Scholar 

  28. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Statistics 22 (1951) 79–86

    Article  MathSciNet  MATH  Google Scholar 

  29. Hertz, G.Z., Hartzell, 3rd, G., Stormo, G.D.: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci 6 (1990) 81–92

    Google Scholar 

  30. Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15 (1999) 563–577

    Article  Google Scholar 

  31. Stormo, G.D., Hartzell, 3rd, G.: Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A 86 (1989) 1183–1187

    Article  Google Scholar 

  32. Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2 (1994) 28–36

    Google Scholar 

  33. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262 (1993) 208–214

    Google Scholar 

  34. Neuwald, A.F., Liu, J.S., Lawrence, C.E.: Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci 4 (1995) 1618–1632

    Article  Google Scholar 

  35. Neuwald, A.F., Liu, J.S., Lipman, D.J., Lawrence, C.E.: Extracting protein alignment models from the sequence database. Nucleic Acids Res 25 (1997) 1665–1677

    Article  Google Scholar 

  36. Roth, F.P., Hughes, J.D., Estep, P.W., Church, G.M.: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16 (1998) 939–945

    Article  Google Scholar 

  37. Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouzé, P., Moreau, Y.: A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17 (2001) 1113–1122

    Google Scholar 

  38. Liu, X., Brutlag, D.L., Liu, J.S.: Bioprospector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput (2001) 127–138

    Google Scholar 

  39. Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E., Wingender, E.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34 (2006) D108–D110

    Google Scholar 

  40. Sandelin, A., Alkema, W., Engström, P., Wasserman, W.W., Lenhard, B.: Jaspar: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32 (2004) D91–D94

    Article  Google Scholar 

  41. Taylor, W.R.: The classification of amino acid conservation. J Theor Biol 119 (1986) 205–218

    Article  Google Scholar 

  42. Eddy, S.R.: Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22 (2004) 1035–1036

    Google Scholar 

  43. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48 (1970) 443–453

    Google Scholar 

  44. Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147 (1981) 195–197

    Google Scholar 

  45. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85 (1988) 2444–2448

    Article  Google Scholar 

  46. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: A basic local alignment search tool. J. Mol. Biol. 215 (1990) 403–410

    Article  Google Scholar 

  47. Thompson, J.D., Higgins, D.G., Gibson, T.J.: Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22 (1994) 4673–4680

    Article  Google Scholar 

  48. Notredame, C., Higgins, D.G., Heringa, J.: T-coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302 (2000) 205–217

    Article  Google Scholar 

  49. Do, C.B., Mahabhashyam, M.S.P., Brudno, M., Batzoglou, S.: Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Res 15 (2005) 330–340

    Article  Google Scholar 

  50. Edgar, R.C.: Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32 (2004) 1792–1797

    Article  Google Scholar 

  51. Katoh, K., Misawa, K., Kuma, K.i., Miyata, T.: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30 (2002) 3059–3066

    Google Scholar 

  52. Morgenstern, B., Frech, K., Dress, A., Werner, T.: Dialign: finding local similarities by multiple sequence alignment. Bioinformatics 14 (1998) 290–294

    Article  Google Scholar 

  53. Morgenstern, B.: Dialign 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15 (1999) 211–218

    Article  Google Scholar 

  54. Eddy, S.R.: Profile hidden markov models. Bioinformatics 14 (1998) 755–763

    Article  Google Scholar 

  55. Gribskov, M., McLachlan, A.D., Eisenberg, D.: Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Sciences of the United States of America 84 (1987) 4355–8

    Google Scholar 

  56. Krogh, A., Brown, M., Mian, I.S., Sjölander, K., Haussler, D.: Hidden Markov models in computational biology. applications to protein modeling. Journal of molecular biology 235 (1994) 1501–31

    Google Scholar 

  57. Baldi, P., Chauvin, Y., Hunkapiller, T., McClure, M.A.: Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences of the United States of America 91 (1994) 1059–63

    Google Scholar 

  58. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE. (1989) 257–286

    Google Scholar 

  59. Henikoff, J.G., Henikoff, S.: Using substitution probabilities to improve position-specific scoring matrices. Computer applications in the biosciences : CABIOS 12 (1996) 135–43

    Google Scholar 

  60. Claverie, J.M.: Some useful statistical properties of position-weight matrices. Comput Chem 18 (1994) 287–294

    Article  MATH  Google Scholar 

  61. Sjölander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I., Haussler, D.: Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Computer applications in the biosciences : CABIOS 12 (1996) 327–345

    Google Scholar 

  62. Brown, M., Hughey, R., Krogh, A., Mian, I.S., Sjölander, K., Haussler, D.: Using Dirichlet mixture priors to derive hidden Markov models for protein families. In Hunter, L., Searls, D.B., Shavlik, J.W., eds.: Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology, Bethesda, MD, USA, July 1993, AAAI (1993) 47–55

    Google Scholar 

  63. Hughey, R., Krogh, A.: Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12 (1996) 95–107

    Google Scholar 

  64. Sonnhammer, E.L., Eddy, S.R., Durbin, R.: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28 (1997) 405–420

    Article  Google Scholar 

  65. Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E.L.L., Tate, J., Punta, M.: Pfam: the protein families database. Nucleic Acids Res (2013)

    Google Scholar 

  66. Haft, D.H., Selengut, J.D., Richter, R.A., Harkins, D., Basu, M.K., Beck, E.: TIGRFAMS and genome properties in 2013. Nucleic Acids Res 41 (2013) D387–D395

    Google Scholar 

  67. Moult, J.: A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 15 (2005) 285–289

    Google Scholar 

  68. Gough, J., Karplus, K., Hughey, R., Chothia, C.: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 313 (2001) 903–919

    Google Scholar 

  69. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25 (1997) 3389–3402

    Google Scholar 

  70. UniProt: Update on activities at the universal protein resource (UniProt) in 2013. Nucleic Acids Res 41 (2013) D43–D47

    Google Scholar 

  71. Pruitt, K.D., Tatusova, T., Maglott, D.R.: Ncbi reference sequence (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 33 (2005) D501–D504

    Article  Google Scholar 

  72. Karplus, K.: Hidden Markov models for detecting remote protein homologies. Bioinformatics 14 (1998) 846–865

    Article  Google Scholar 

  73. Karplus, K., Karchin, R., Barrett, C., Tu, S., Cline, M., Diekhans, M., Grate, L., Casper, J., Hughey, R.: What is the value added by human intervention in protein structure prediction? Proteins Suppl 5 (2001) 86–91

    Article  Google Scholar 

  74. Karplus, K., Karchin, R., Draper, J., Casper, J., Mandel-Gutfreund, Y., Diekhans, M., Hughey, R.: Combining local-structure, fold-recognition, and new fold methods for protein structure prediction. Proteins 53 Suppl 6 (2003) 491–496

    Article  Google Scholar 

  75. Eddy, S.R.: Accelerated profile HMM searches. PLoS Comput Biol 7 (2011) e1002195

    Article  MathSciNet  Google Scholar 

  76. Söding, J.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21 (2005) 951–960

    Article  Google Scholar 

  77. Remmert, M., Biegert, A., Hauser, A., Söding, J.: HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9 (2012) 173–175

    Google Scholar 

  78. Wheeler, T.J., Eddy, S.R.: nhmmer: DNA homology search with profile hmms. Bioinformatics 29 (2013) 2487–2489

    Article  Google Scholar 

  79. Wheeler, T.J., Clements, J., Eddy, S.R., Hubley, R., Jones, T.A., Jurka, J., Smit, A.F.A., Finn, R.D.: Dfam: a database of repetitive DNA based on profile hidden markov models. Nucleic Acids Res 41 (2013) D70–D82

    Article  Google Scholar 

  80. Eddy, S.R.: A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics 3 (2002)  18

    Article  Google Scholar 

  81. Sakakibara, Y., Brown, M., Hughey, R., Mian, I.S., Sjölander, K., Underwood, R.C., Haussler, D.: Recent methods for RNA modeling using stochastic context-free grammars. In: Proceedings of the Asilomar Conference on Combinatorial Pattern Matching, New York, NY, Springer-Verlag (1994) 289–306

    Google Scholar 

  82. Eddy, S.R., Durbin, R.: RNA sequence analysis using covariance models. Nucleic Acids Res 22 (1994) 2079–2088

    Google Scholar 

  83. Burge, S.W., Daub, J., Eberhardt, R., Tate, J., Barquist, L., Nawrocki, E.P., Eddy, S.R., Gardner, P.P., Bateman, A.: Rfam 11.0: 10 years of RNA families. Nucleic Acids Res 41 (2013) D226–D232

    Article  Google Scholar 

  84. Nawrocki, E.P., Eddy, S.R.: Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29 (2013) 2933–2935

    Article  Google Scholar 

  85. Uemura, Y., Hasegawa, A., Kobayashi, S., Yokomori, T.: Tree adjoining grammars for RNA structure prediction. Theoretical Computer Science 210 (1999) 277–303

    Google Scholar 

  86. Rivas, E., Eddy, S.: The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics 16 (2000) 334

    Article  Google Scholar 

  87. Cai, L., Malmberg, R.L., Wu, Y.: Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics 19 Suppl 1 (2003) i66–i73

    Article  Google Scholar 

  88. Matsui, H., Sato, K., Sakakibara, Y.: Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Proc IEEE Comput Syst Bioinform Conf (2004) 290–299

    Google Scholar 

  89. Grundy, W.N., Bailey, T.L., Elkan, C.P., Baker, M.E.: Meta-meme: motif-based hidden Markov models of protein families. Comput Appl Biosci 13 (1997) 397–406

    Google Scholar 

  90. Jonassen, I. Collins, J., Higgins, D.: Finding flexible patterns in unaligned protein sequences. Protein Science 4 (1995) 1587–1595

    Google Scholar 

  91. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B.A., de Castro, E., Lachaize, C., Langendijk-Genevaux, P.S., Sigrist, C.J.A.: The 20 years of PROSITE. Nucleic Acids Res 36 (2008) D245–D249

    Google Scholar 

  92. Yokomori, T., Ishida, N., Kobayashi, S.: Learning local languages and its application to protein \(\alpha \)-chain identification. In: 27th Annual Hawaii International Conference on System Sciences (HICSS-27), January 4-7, 1994, Maui, Hawaii, USA, IEEE Computer Society (1994) 113–122

    Google Scholar 

  93. Yokomori, T., Kobayashi, S.: Learning local languages and their application to DNA sequence analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 1067–1079

    Article  Google Scholar 

  94. Garcia, P., Vidal, E., Oncina, J.: Learning locally testable languages in the strict sense. In: Proceedings of the International Conference on Algorithmic Learning Theory. (1990) 325–338

    Google Scholar 

  95. Garcia, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 920–925

    Article  Google Scholar 

  96. Peris, P., López, D., Campos, M., Sempere, J.M.: Protein motif prediction by grammatical inference. In Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E., eds.: Ig TM. Volume 4201 of Lecture Notes in Computer Science, Springer (2006) 175–187

    Google Scholar 

  97. Peris, P., López, D., Campos, M.: IGTM: An algorithm to predict transmembrane domains and topology in proteins. BMC Bioinformatics 9 (2008)

    Google Scholar 

  98. Garcia, P., Vidal, E., Casacuberta, F.: Local languages, the succesor method, and a step towards a general methodology for the inference of regular grammars. IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 841–845

    Article  Google Scholar 

  99. Oncina, J., Garcia, P.: Inferring regular languages in polynomial update time. In: Pattern Recognition and Image Analysis. (1992) 49–61

    Google Scholar 

  100. Lang, K.J. In: Random DFA’s can be approximately learned from sparse uniform examples. Association for Computing Machinery (1992) 45–52

    Google Scholar 

  101. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo One DFA learning competition and a new evidence-driven state merging algorithm. In: Proceedings of the 4th International Colloquium on Grammatical Inference. ICGI ’98, London, UK, Springer-Verlag (1998) 1–12

    Google Scholar 

  102. Coste, F., Kerbellec, G., Idmont, B., Fredouille, D., Delamarche, C.: Apprentissage d’automates par fusions de paires de fragments significativement similaires et premières expérimentations sur les protéines MIP. In: JOBIM. (2004)

    Google Scholar 

  103. Coste, F., Kerbellec, G.: A similar fragments merging approach to learn automata on proteins. In Gama, J., Camacho, R., Brazdil, P., Jorge, A., Torgo, L., eds.: ECML. Volume 3720 of Lecture Notes in Computer Science., Springer (2005) 522–529

    Google Scholar 

  104. Coste, F., Kerbellec, G.: Learning Automata on Protein Sequences. In Denise, A., Durrens, P., Robin, S., Rocha, E., de Daruvar, A., Groppi, A., eds.: JOBIM, Bordeaux, France (2006) 199–210

    Google Scholar 

  105. Kerbellec, G.: Apprentissage d’automates modélisant des familles de séquences protéiques. PhD thesis, Université de Rennes 1 (2008)

    Google Scholar 

  106. Bretaudeau, A., Coste, F., Humily, F., Garczarek, L., Corguillé, G.L., Six, C., Ratin, M., Collin, O., Schluchter, W.M., Partensky, F.: Cyanolyase: a database of phycobilin lyase sequences, motifs and functions. Nucleic Acids Research 41 (2013) 396–401

    Article  Google Scholar 

  107. Burgos, A., Coste, F., Kerbellec, G.: Learning automata on protein sequences by partial multiple sequence alignment. (in preparation)

    Google Scholar 

  108. Coste, F., Fredouille, D.: What is the Search Space for the Inference of Non Deterministic, Unambiguous and Deterministic Automata? Rapport de recherche RR-4907, INRIA (2003)

    Google Scholar 

  109. Dyrka, W., Nebel, J.C.: A stochastic context free grammar based framework for analysis of protein sequences. BMC Bioinformatics 10 (2009) 323

    Article  Google Scholar 

  110. Coste, F., Garet, G., Nicolas, J.: Local Substitutability for Sequence Generalization. In Heinz, J., de la Higuera, C., Oates, T., eds.: ICGI 2012. Volume 21 of JMLR Workshop and Conference Proceedings, University of Maryland, MIT Press (2012) 97–111

    Google Scholar 

  111. Clark, A., Eyraud, R.: Identification in the limit of substitutable context free languages. In Jain, S., Simon, H.U., Tomita, E., eds.: Proceedings of the 16th International Conference on Algorithmic Learning Theory, Springer-Verlag (2005) 283–296

    Google Scholar 

  112. Clark, A., Eyraud, R.: Polynomial identification in the limit of substitutable context-free languages. Journal of Machine Learning Research 8 (2007) 1725–1745

    MathSciNet  MATH  Google Scholar 

  113. Yoshinaka, R.: Identification in the limit of k, l-substitutable context-free languages. In Clark, A., Coste, F., Miclet, L., eds.: ICGI. Volume 5278 of Lecture Notes in Computer Science., Springer (2008) 266–279

    Google Scholar 

  114. Harris, Z.: Distributional structure. Word 10 (1954) 146–162

    Google Scholar 

  115. Coste, F., Garet, G., Nicolas, J.: A bottom-up efficient algorithm learning substitutable languages from positive examples. In Clark, A., Kanazawa, M., Yoshinaka, R., eds.: ICGI 2014. Volume 34 of JMLR Workshop and Conference Proceedings. (2014) 49–63

    Google Scholar 

  116. Nevill-Manning, C.G., Witten, I.H.: Compression and explanation using hierarchical grammars. The Computer Journal 40 (1997) 103–116

    Article  MATH  Google Scholar 

  117. Cherniavsky, N., Lander, R.: Grammar-based compression of DNA sequences. In: DIMACS Working Group on the Burrows-Wheeler Transform. (2004)  21

    Google Scholar 

  118. Lanctot, J.K., Li, M., Yang, E.H.: Estimating DNA sequence entropy. In: ACM-SIAM Symposium on Discrete Algorithms. (2000) 409–418

    Google Scholar 

  119. Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proceedings of the IEEE 88 (2000) 1733–1744

    Article  Google Scholar 

  120. Apostolico, A., Lonardi, S.: Compression of biological sequences by greedy off-line textual substitution. In: Data Compression Conference. (2000) 143–153

    Google Scholar 

  121. Nevill-Manning, C., Witten, I.: On-line and off-line heuristics for inferring hierarchies of repetitions in sequences. In: Data Compression Conference, IEEE (2000) 1745–1755

    Google Scholar 

  122. Carrascosa, R., Coste, F., Gallé, M., López, G.G.I.: The smallest grammar problem as constituents choice and minimal grammar parsing. Algorithms 4 (2011) 262–284

    Article  MathSciNet  Google Scholar 

  123. Carrascosa, R., Coste, F., Gallé, M., López, G.G.I.: Searching for smallest grammars on large sequences and application to DNA. J. Discrete Algorithms 11 (2012) 62–72

    Article  MathSciNet  MATH  Google Scholar 

  124. Brejova, B., Vinar, T., Li, M.: Pattern Discovery: Methods and Software. In Krawetz, S.A., Womble, D.D., eds.: Introduction to Bioinformatics. Humana Press (2003) 491–522

    Google Scholar 

  125. Sakakibara, Y.: Grammatical inference in bioinformatics. IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005) 1051–1062

    Article  Google Scholar 

  126. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press (1999)

    Google Scholar 

  127. Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. 2nd edn. Cambridge: MIT Press (2001)

    MATH  Google Scholar 

  128. de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to François Coste .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Coste, F. (2016). Learning the Language of Biological Sequences. In: Heinz, J., Sempere, J. (eds) Topics in Grammatical Inference. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48395-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48395-4_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48393-0

  • Online ISBN: 978-3-662-48395-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics