Entropic fluctuations in DNA sequences

https://doi.org/10.1016/j.physa.2017.11.119Get rights and content

Highlights

  • Fluctuations of Local Shannon Entropy in DNA are measures of local complexity.

  • LSE extracts information related to the presence and structure of repetitive units.

  • Graphical LSE method for prediction of regions with high concentration of repeats.

  • LSE covariance function measures evolutionary distance between related organisms.

Abstract

The Local Shannon Entropy (LSE) in blocks is used as a complexity measure to study the information fluctuations along DNA sequences. The LSE of a DNA block maps the local base arrangement information to a single numerical value. It is shown that despite this reduction of information, LSE allows to extract meaningful information related to the detection of repetitive sequences in whole chromosomes and is useful in finding evolutionary differences between organisms. More specifically, large regions of tandem repeats, such as centromeres, can be detected based on their low LSE fluctuations along the chromosome. Furthermore, an empirical investigation of the appropriate block sizes is provided and the relationship of LSE properties with the structure of the underlying repetitive units is revealed by using both computational and mathematical methods. Sequence similarity between the genomic DNA of closely related species also leads to similar LSE values at the orthologous regions. As an application, the LSE covariance function is used to measure the evolutionary distance between several primate genomes.

Introduction

The human genome contains DNA sequences of great sizes (109 base pairs) and of complex structure. The sequences contain regions which serve different functionalities and have accumulated high content of erratic DNA during evolution. Therefore, recent genomic studies set the focus not only on DNA global statistical properties but also on the local properties of the information contained in these molecules. These approaches can reveal local properties of DNA in connection with functionality, or discern any correlations between specific role regions to their local statistics.

The complexity of current days genome has been shaped during the process of evolution. During evolution the genome has increased in size and complexity via multiple repetitions of elements, nucleotide mutations, and insertion or deletion of segments. Early investigations have shown that the succession of bases along noncoding regions exhibit long-range correlations , whereas coding regions in higher organisms presents short-range correlations [[1], [2], [3]]. Many other interesting statistical properties have been revealed and studied since then [[4], [5], [6], [7], [8], [9], [10]].

In this study we focus on the variations of Local Shannon Entropy (LSE) on genomic sequences. Shannon Entropy (SE) was originally introduced in 1948 by C. Shannon [11] as a measure of the information produced when one message is chosen from the set of possible messages. The basic idea behind SE is to measure the predictability of a chosen message. A message with high predictability has lower information content than a less predictable message. Shannon Entropy is a measure with a wide range of applications. Earlier entropic studies on DNA sequences using SE have shown that information theory methods are suitable for DNA sequences analysis [12] and that DNA can be viewed as an out of equilibrium structure [13]. Studies on DNA segmentation using the Jensen–Shannon entropic measure have succeeded to discriminate between compositionally homogeneous patches along a sequence. Notable works in these lines are on segmentation algorithms based on 4 nucleotides which detect patches distributed in power law fashion [see Ref. [14]] and algorithms based on 12 symbols which detect coding and non-coding segments [see Ref. [15]]. These entropic studies on DNA sequences were motivated by similar studies on symbol sequences in general, such as alphabetic or musical symbols [[16], [17], [18]].

The LSE analysis we propose here is in line with previous studies aiming to extract information from DNA sequences, using local statistical measures and collapsing the local information into a single numerical value. Straightforward approaches on the base level can indeed give a more detailed picture of the underlying sequences, but require long computations for base-by-base comparison between organisms. Here, we collapse the information to a minimal number of parameters and use these parameters to obtain, for example, phylogenetic trees or other evolutionary features. The GC-content is another reductive approach and has been shown that many biological features can be correlated with this coarse-grained quantity (e.g. gene density). K-mer distribution is also another reductive approach which is known to be quite powerful. From this perspective, we propose the Local Shannon Entropy as an alternative reduction approach which can lead to meaningful biological information extraction, without the need to go through the difficult and time consuming base-sequence detailed analysis. Next, we recapitulate briefly the basic definitions of Shannon Entropy in its local and global form and its application on symbolic DNA sequences.

For a sequence containing q symbols, each of them with probability P(i), i=1,,q, the global Shannon entropy is defined as H=i=1qP(i)logP(i)

To measure the entropic fluctuations we introduce the Local Shannon Entropy (LSE) in blocks. We use non-overlapping consecutive blocks of length l, covering entirely each sequence. For a sequence of length L the total number of non-overlapping blocks is B=[Ll]. All blocks have a numerical identifier j=1B, depending on their positions in the sequence of blocks. The local Shannon Entropy Hl(n) of block with index n and size l is one way to summarize the base frequency distribution in a block and is defined as: Hl(n)=i=1qP(n)(i)logP(n)(i)where P(n)(i) is the appearance probability of i-character inside the block n. In this study the entropy of each block is measured independently, so the probability will be equal to the frequency Q(n)(i) of appearance of character i, over the block’s length, i.e. P(n)(i)=Q(n)(i)l. A long DNA sequence can be coarse-grained by a series of LSE’s, with mean Eq. (1) and variance depended on the block size. It can be different from the global entropy calculated from the sequence at the base level. The average Shannon Entropy H̄l tends to the global H, when the block size l approaches the sequence size, lL, or if there is no fluctuations of the local entropy in each block.

The global and local entropies take the same maximum value in the case of equiprobable symbol representation, i.e. when each character tends to constitute the 1q ratio of the overall character appearances: Heq=i=1q1qlog1q=logq.

Besides the LSE measure, other equivalent quantities may be used, such as the windowed GC-content or various moments of the probability distribution. The use of these measures lies on the particular applications. In the current study we employ the LSE measure because it represents the content of information found locally in the DNA sequence.

To apply the LSE method to DNA, each genomic segment is considered as a string composed of q=4 different characters, A, C, G, T for Adenine, Cytosine, Guanine and Thymine, respectively. As in any string of symbols, it is possible to measure its global SE using Eq. (1), and its local counterparts by Eq. (2). The magnitude of information in each region could be characteristic of its role, especially when we try to make a separation between functional regions. Note, that the maximum value of the global and local Shannon Entropies takes the value Heq=log4=1.386294... and, as will be seen in the following, it is very rarely met in natural DNA sequences. For example, in human genome, the base frequency for G and C is around 0.2, that for A and T is around 0.3 [19], deviating from the equal base probability distribution. In the current international genomic data bases DNA sequences contain a percentage of nonidentified base pairs (bps). These nonidentified bases are denoted by the letter N and are ignored in the current study since their frequency is very small (of the order <1%) in all human chromosomes, except chromosome Y.

In the next section we apply the LSE in specific sequences using different block sizes and calculate the LSE fluctuations along the sequence. For specific values of block length l we find regions where the LSE fluctuations, not LSE itself, take distinctly lower values than in the rest of the sequences. The largest such regions are identified as the centromeric regions of chromosomes. We verify that the centromeric positions coincide with low LSE fluctuations in all human chromosomes. In Section 3 and Appendix A we show that the reason behind low fluctuations in centromeric regions is the repetitive structure of centromeres. In Section 4 we propose a graphical method based on the LSE fluctuations for the prediction of centromeric regions and regions with high concentration of repetitive strings. In Section 5 we analyze further centromeric and noncentromeric regions using the Fourier transforms to recover the size of the repetitive strings (repeats). In Section 6 we compare the LSE results from the human genome with genomes from other primates and measure their evolutionary distance by comparing the LSE of their largest chromosomes, chromosome 1. Finally, we recapitulate our main results and discuss open problems.

Section snippets

Local Shannon entropy of the human genome

LSEs were computed for the 22 pairs of autosome chromosomes and the pair (X–Y) of allosome chromosomes (sex chromosomes) of the human genome. Chromosome Y was excluded from the current study because in the international genomic data bases it still contains large unidentified regions while its size is small for statistical studies. The LSE values of the human chromosome 1 is presented in Fig. 1 for various block sizes. We notice a region with low fluctuations which becomes clearer when the

Repetitive elements cause low entropic fluctuations

First we would like to know the reason why the LSE fluctuations are low around centromeres. The centromere is known for being highly repetitive, with various sizes of repeated segments and repetition multiplicities. The centromere’s repetitions are characterized as tandem repeats, consecutively placed in the genome. Such segments are: a-satellite (171 base pairs-bps), b-satellite (61 bps), satellite 1 (25–48 bps) etc. [26]. Knowing this information we can state that the fluctuations in

Repetitive sequence predicting function

Based on the LSE we now propose a graphical method to predict the centromere position in human chromosomes, or, in general, to predict the position of repetitive elements along any symbol sequence. In order to correctly predict the position of the repetitive areas along the DNA sequence, the analysis needs to become more detailed at the base level. For this reason sliding blocks (shifted by 1 bp) will be used from now on, so that each block of size l overlaps with its next-in-line at l1

Analysis of high and low LSE fluctuation regions

In Section 2 we did the first step analyzing the DNA structure using LSE in theory. In this section we prompt specific low LSE regions, centromeric and others. In the plots made with the LSE method described in Section 2 level-like patterns can emerge when the centromere regions are magnified. Parts of the centromere appear on different levels of entropy values. This characteristic is common and we observed it at the centromeres of almost all different human chromosomes. The next two plots in

Evolutionary distance

There have been many attempts to compare genomes since the DNA sequencing started. Many of them are based only on protein coding regions, or other approaches that may be controversial, since they make use of partial genomic information [[35], [36], [37], [38], [39]]. Complexity measures quantifying characteristic properties of different genomes can potentially be used as taxonomical criteria [[40], [28], [29]]. Other approaches use topology concepts: DNA sequences can be represented as curves

Results

To recapitulate, the main objective of our work was to investigate the entropy content of DNA sequences. We unraveled the mathematical reason behind the LSE behavior of DNA regions containing repetitive units and now we are able to get more information about them by using LSE. By taking advantage of the LSE properties of the human centromere, we constructed a function that can predict the position of centromeres by the use of only symbolic sequence data. We show that “low-complexity” regions do

Acknowledgments

D.T. would like to thank Prof. Stratos Prassidis for helpful discussions. The authors would like to thank an anonymous Reviewer for pointing out the connection between Fig. 1 d and the isochore structure of the human DNA. This work was supported by computational time granted from the Greek Research & Technology Network (GRNET) in the National HPC facility – ARIS – under project ID pr003017.

References (60)

  • NadeauJ.H. et al.

    Counting on comparative maps

    TIG

    (1998)
  • ProvataA. et al.

    Complexity measures for the evolutionary categorization of organisms

    Comput. Biol. Chem.

    (2014)
  • VossR.F.

    Evolution of long-range fractal correlations and 1/f noise in DNA base sequences

    Phys. Rev. Lett.

    (1992)
  • LiW. et al.

    DNA correlations

    Nature

    (1992)
  • PengC.-K. et al.

    Long-range correlations in nucleotide sequences

    Nature

    (1992)
  • Román-RoldánR. et al.

    Sequence compositional complexity of DNA through an entropic segmentation method

    Phys. Rev. Lett.

    (1998)
  • MesserP.W. et al.

    The majority of recent short DNA insertions in the human genome are tandem duplications

    Mol. Biol. Evol.

    (2007)
  • CarpenaP. et al.

    High-level organization of isochores into gigantic superstructures in the human genome

    Phys. Rev. E

    (2011)
  • PolakP. et al.

    Long-range bidirectional strand asymmetries originate at CpG islands in the human genome

    Genome Biol. Evol.

    (2009)
  • ShannonC.E.

    A mathematical theory of communication

    SIGMOBILE Mob. Comput. Commun. Rev.

    (2001)
  • ProvataA. et al.

    DNA viewed as an out-of-equilibrium structure

    Phys. Rev. E

    (2014)
  • Bernaola-GalvánP. et al.

    Compositional segmentation and long-range fractal correlations in DNA sequences

    Phys. Rev. E

    (1996)
  • Bernaola-GalvánP. et al.

    Finding borders between coding and noncoding DNA regions by an entropic segmentation method

    Phys. Rev. Lett.

    (2000)
  • EbelingW. et al.

    Entropy, transinformation and word distribution of information-carrying sequences

    Int. J. Bifurcation Chaos

    (1995)
  • EbelingW. et al.

    Entropy of symbolic sequences: The role of correlations

    Europhys. Lett.

    (1991)
  • NicolisJ.S.

    Chaos and Information Processing

    (1991)
  • LiW.

    G+C content evolution in the human genome

  • A.F.A. Smit, R. Hubley, P. Green, Repeatmasker open-4.0, 2013-2015,...
  • BensonG.

    Tandem repeats finder: a program to analyze DNA sequences

    Nucl. Acids Res.

    (1999)
  • FrenkelF.E. et al.

    Database of periodic DNA regions in major genomes

    BioMed. Res. Int.

    (2017)
  • Cited by (12)

    • A new method to study genome mutations using the information entropy

      2021, Physica A: Statistical Mechanics and its Applications
    • Spatial constrains and information content of sub-genomic regions of the human genome

      2021, iScience
      Citation Excerpt :

      It should be mentioned, however, that although different methods have been used to analyze the DNA and its information content, they show some commonalities in their general findings. As a general trend, they distinguish between different structural regions of the genome, and differentiate between coding and non-coding regions of DNA (Karakatsanis et al., 2018, Thanos et al., 2018). The projection of the dynamics to the statistics in the phase space develops a complete picture that integrated to the variations of the complexity metrics.

    • Statistical physics approaches to the complex Earth system

      2021, Physics Reports
      Citation Excerpt :

      There exist also many other types of entropy, such as Gibbs, Residual, Approximate, Sinai–Kolmogorov, Sample, Multiscale. Entropy has been proven useful in many real-world systems, including analysis of DNA sequences [233], cosmology and astrophysics [234–236], economics [237,238], and climate systems [239–241]. Each definition of entropy could give better results for some systems but fails for others.

    • Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome

      2019, Gene
      Citation Excerpt :

      Here we plot the reciprocal of JS over chromosome positions, such that peaks indicate nuclear regions that are similar to the mitochondrial sequences. On the other hand, when JS is plotted instead of 1/JS, the peaks usually indicate centromeres, telomeres, and other low-complexity regions (plot not shown), as these contain distinct repeating patterns (e.g., Thanos et al., 2018) Comparing to other graphic illustration of NUMTs (e.g., Fig. 2 of Woischnik and Moraes, 2002), our approach displays not only the chromosomal location, but also strength of the signal. The overlap of our novel NUMT predictions with LTR annotations requires some specific discussion.

    • Quantifying local randomness in human DNA and RNA sequences using Erdös motifs

      2019, Journal of Theoretical Biology
      Citation Excerpt :

      We first use the same file to bracket the centromere region (when the band is labeled as “acen”). Then we further fine-tune the boundary by an observation made in Thanos et al. (2018) that windowed statistical qualities (e.g. entropy) have extremely low variations in the centromere region. For R/Y binarization, out of 2.911 billions overlapping 10-mers in the human genome (chromosomes 1-22,X, excluding any 10-mers which contain unsequenced bases) there are 6,161,338 counts of R/Y E1, 10, or 0.21% of all 10-mer counts.

    • Automated detection of colon cancer using genomic signal processing

      2021, Egyptian Journal of Medical Human Genetics
    View all citing articles on Scopus
    View full text