Elsevier

Physics Letters A

Volume 274, Issues 5–6, 25 September 2000, Pages 247-253
Physics Letters A

Entropy concepts and DNA investigations

https://doi.org/10.1016/S0375-9601(00)00557-0Get rights and content

Abstract

Topological and metric entropies of the DNA sequences from different organisms were calculated. Obtained results were compared each other and with ones of corresponding artificial sequences. For all envisaged DNA sequences there is a maximum of heterogeneity. It falls in the block length interval [5,7]. Maximum distinction between natural and artificial sequences is shifted on 1–3 position from the maximum of heterogeneity to the right as for metric as for topological entropy. This point on the specificity of real DNA sequences in the interval.

Introduction

One of the first conceptions of entropy was proposed by C. Shannon in application to the theory of information transmission [1]. Entropy in that context implies the measure of heterogeneity of a set of symbols. In mathematical form it can be written asH=−ipilogpiwhere pi is probability of appearance of the ith symbol. Then the notion of entropy spread in different fields of science: statistical mechanics, probability theory, computer science etc. Many new conceptions of entropy such as metric entropy, thermodynamic, topological, generalized, Kolmogorov–Sinai, structural spectral entropy have been proposed [2], [3], [4], [5], [6]. But all of them pursues the single aim – description of uncertainty in a large set of objects.

One of the main directions in DNA investigations is search and elaboration of methods for robust structural properties of genome texts extraction. What allows to understand general principles of genetic sequences forming and makes reasonable conclusions in the evolution theories [7], [8]. It is not surprising that entropy and information notions find a wide application in DNA investigations [9], [10], [11], [12], [13]. Indeed in some sense DNA is represented as a long sequence of symbols of the alphabet consisting of just four letters. Eventually letter analysis appears to be insufficient to explain DNA properties or structure. The analysis of letter groups or so called words seems more interesting. But due to exponential growth of the number of possible words of length n, when n increases, it is considerably more difficult. As a rule in such investigations some additional suggestions as, for example, about an equidistribution of words [14], [15], are taken. On the basis of ones the entropy estimation is performed [16], [17], [18]. But in reality neither a word's distribution nor even a distribution of letters are not equiprobable. It is essential point if we want to gain an information from and about real DNA. Moreover very often in the analysis short parts of a genome or chromosome have been used [15]. Today available data and computer methods allow to investigate complete genomes and chromosomes. (For example the length of human 22 chromosome is 33476902 bp.) What allows to take the maximum likelihood estimation pi=qi/N (where qi is the number of occurrences of the ith word in an investigated sample) as enough good approximation for the calculation of the entropy in Shannon's sense. At least in such respect an investigated object is not replaced by any artificial set.

So we use the metric and topological definitions of entropy for DNA texts heterogeneity estimation. Some remarks about the possibility of a representation of a DNA sequence by a Markov chain for the calculation of the entropy estimate are also given.

Section snippets

Metric entropy

In Shannon's definition of entropy given above pi may denote as the probability of a symbol as the probability of a group or a block of symbols. In other words Shannon's formula can be rewritten asHn=−i=1anp(Ci)logp(Ci),where Ci is a block of symbols or 'word' of length n, a the number of letters in a language and an the number of all possible combinations of length n of the letters in the language. Obviously Hn is non-decreasing function of n. The ratio Hn/n tends to a certain limit when n

Topological entropy

The topological entropy is defined ask(n)=log2N(n)n,where N(n) is the number of distinct blocks of length n in a sequence under attention [2]. We calculated k(n) for n≤12.

It has been obtained that for n≤4 k(n) is maximal and equal 2 for all genomes/chromosomes, i.e. all possible n-letters' combinations are present in even from the investigated DNA texts. On the fifth level just the shortest bacteria mgen genome (its length is 580074 bp) has no some words. On the 6 level a half of bacteria

A DNA sequence and Markov processes

One of the first attempts of modelling DNA sequences and obtaining appropriate statistical characteristics based on the application of Markov processes [9], [19]. Under the term Markov process or Markov chain it is usually assumed the stochastic process where a state of a system depends on its previous state. It is so called one step Markov chain (a process with memory equal 1). In general case the memory (dependence) may be greater than 1 but necessarily finite. In the critical review [20] Li

Conclusion

The effect of finiteness of a sequence seems to be considerable for large n in entropy estimation. In average as h(n) as k(n) reductions correlate with length of a sequence. For n of order 11, 12 the differences of h(n) and k(n) very small (for yeast chromosomes and the bacteria genomes) as well as the distinctions of the natural and artificial sequences. For the human chromosome these distinctions are greater.

Reduction rate of h(n) for different organisms differs. In general it is not follow

References (21)

  • L. Gatlin

    J. Theor. Biol.

    (1966)
  • G.W. Rowe

    J. Theor. Biol.

    (1983)
  • H. Herzel et al.

    Solitons & Fractals

    (1994)
  • A.O. Schmitt et al.

    J. Theor. Biol.

    (1997)
  • R.A. Elton

    J. Theor. Biol.

    (1974)
  • W. Li

    Comput. & Chem.

    (1997)
  • S.C.E. Bell

    Syst. Tech. J.

    (1948)
  • R. Badii, A. Politi, Complexity – Hierarchical Structures and Scaling in Physics, Cambridge Univ. Press, Cambridge,...
  • Ya.G. Sinai, Topics in Ergodic Theory, Princeton Univ. Press, Princeton, NJ,...
  • L.D. Landau, E.M. Lifshits, Stat. Phys. 5, Nauka, Moscow,...
There are more references available in the full text version of this article.
View full text