Skip to main content

Advertisement

Log in

Genes, information and sense: complexity and knowledge retrieval

  • Original Paper
  • Published:
Theory in Biosciences Aims and scope Submit manuscript

Abstract

Information capacity of nucleotide sequences measures the unexpectedness of a continuation of a given string of nucleotides, thus having a sound relation to a variety of biological issues. A continuation is defined in a way maximizing the entropy of the ensemble of such continuations. The capacity is defined as a mutual entropy of real frequency dictionary of a sequence with respect to the one bearing the most expected continuations; it does not depend on the length of strings contained in a dictionary. Various genomes exhibit a multi-minima pattern of the dependence of information capacity on the string length, thus reflecting an order within a sequence. The strings with significant deviation of an expected frequency from the real one are the words of increased information value. Such words exhibit a non-random distribution alongside a sequence, thus making it possible to retrieve the correlation between a structure, and a function encoded within a sequence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The theory and methodology described below is applicable to a sequence from an arbitrary (finite) alphabet ℵ, say, for amino acid sequences.

  2. An equality of these two sums stands behind the connection of a sequence into a ring.

  3. Strictly speaking, information capacity is defined for a frequency dictionary, not for a sequence; we shall not make the difference between them, unless a mispresentation occurs.

References

  • Bugaenko NN, Gorban AN, Sadovsky MG (1996) Towards the information content of nucleotide sequences. Mol Biol Mosc 30:529

    CAS  Google Scholar 

  • Bugaenko NN, Gorban AN, Sadovsky MG (1998) Maximum entropy method in analysis of genetic text and measurement of its information content. Open Syst Inf Dyn 5:265

    Article  Google Scholar 

  • Carbone A, Zinovyev A, Kepes F (2003) Codon Adaptation Index as a measure of dominating codon bias. Bioinformatics 19:2005

    Article  CAS  PubMed  Google Scholar 

  • Durand B, Zvonkin A (2004) L’héritage de Kolmogorov en Mathématiques, Berlin, pp 269–287

  • Gorban AN, Popova TG, Sadovsky MG (1994) Redundancy of genetic texts and mosaic structure of genomes. Mol Biology (Mosc) 28:313

    CAS  Google Scholar 

  • Gorban AN, Karlin IV (2005) Invariant manifolds for physical and chemical kinetics. Lect. Notes Phys, 660. Springer, Berlin

  • Nakamura PM (2000) Codon usage: mutational bias, translational selection and mutational biases. Nucleic Acids Res 19:8023

    Google Scholar 

  • Popova TG, Sadovsky MG (1995) Introns differ from exons in their redundancy. Russ J Genet 31:1365

    CAS  Google Scholar 

  • Rui H, Bin W (2001) Statistically significant strings are related to regulatory elements in the promoter regions of Saccharomyces cerevisiae. Physica A 290:464

    Google Scholar 

  • Sadovsky MG (2002a) Information capacity of symbol sequences. Open Syst Inf Dyn 9:37

    Article  Google Scholar 

  • Sadovsky MG (2002b) Towards the information capacity of symbol sequences. Electron Inform Control 1:82

    Google Scholar 

  • Sadovsky MG (2002c) Towards the redundancy of viral and prokaryotic genomes. Russ J Genet 38:575

    Article  CAS  Google Scholar 

  • Sadovsky MG (2003) Comparison of real frequencies of strings vs. the expected ones reveals the information capacity of macromoleculae. J Biol Phys 29:23

    Article  CAS  Google Scholar 

  • Sadovsky MG (2005) Information capacity of biological macromoleculae reloaded ArXiv q-bio.GN 0501011 v1

  • Sadovsky MG (2006) Information capacity of nucleotide sequences and its applications. Bull Math Biol 68:156

    Article  Google Scholar 

  • Sadovsky MG, Putintzeva YA (2007) Codon usage bias measured through entropy approach, arXiv:0706.2077v1, 14 June 2007

  • Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois Press, Urbana

    Google Scholar 

  • Sharp PM, Stenico M, Peden JF, Lloyd AT (1993) Codon usage: mutational bias, translational selection and mutational biases. Nucleic Acids Res 15:8023

    Google Scholar 

  • Zubkov AM, Mikhailov VG (1974) Limit distributions of random variables associated with long duplications in a sequence of independent trials. Probab Theory Appl 19:173

    Google Scholar 

  • Zvonkin AK, Levin L (1970) The complexity of finite objects and development of the concepts of information and randomness by means of the theory of algorithms. Russ Math Surv 25(6):83

    Article  Google Scholar 

Download references

Acknowledgments

We are thankful to Prof. Alexander N. Gorban from Leicester University, for valuable discussions and inspiring ideas, and to Dr. Tatyana G. Popova from the Institute of Computational Modelling of RAS for stimulating interest in this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael G. Sadovsky.

Additional information

The results present here were partially obtained due to the support from Krasnoyarsk Science Foundation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sadovsky, M.G., Putintseva, J.A. & Shchepanovsky, A.S. Genes, information and sense: complexity and knowledge retrieval. Theory Biosci. 127, 69–78 (2008). https://doi.org/10.1007/s12064-008-0032-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12064-008-0032-1

Keywords

Navigation