Entropic fluctuations in DNA sequences
Introduction
The human genome contains DNA sequences of great sizes ( base pairs) and of complex structure. The sequences contain regions which serve different functionalities and have accumulated high content of erratic DNA during evolution. Therefore, recent genomic studies set the focus not only on DNA global statistical properties but also on the local properties of the information contained in these molecules. These approaches can reveal local properties of DNA in connection with functionality, or discern any correlations between specific role regions to their local statistics.
The complexity of current days genome has been shaped during the process of evolution. During evolution the genome has increased in size and complexity via multiple repetitions of elements, nucleotide mutations, and insertion or deletion of segments. Early investigations have shown that the succession of bases along noncoding regions exhibit long-range correlations , whereas coding regions in higher organisms presents short-range correlations [[1], [2], [3]]. Many other interesting statistical properties have been revealed and studied since then [[4], [5], [6], [7], [8], [9], [10]].
In this study we focus on the variations of Local Shannon Entropy (LSE) on genomic sequences. Shannon Entropy (SE) was originally introduced in 1948 by C. Shannon [11] as a measure of the information produced when one message is chosen from the set of possible messages. The basic idea behind SE is to measure the predictability of a chosen message. A message with high predictability has lower information content than a less predictable message. Shannon Entropy is a measure with a wide range of applications. Earlier entropic studies on DNA sequences using SE have shown that information theory methods are suitable for DNA sequences analysis [12] and that DNA can be viewed as an out of equilibrium structure [13]. Studies on DNA segmentation using the Jensen–Shannon entropic measure have succeeded to discriminate between compositionally homogeneous patches along a sequence. Notable works in these lines are on segmentation algorithms based on 4 nucleotides which detect patches distributed in power law fashion [see Ref. [14]] and algorithms based on 12 symbols which detect coding and non-coding segments [see Ref. [15]]. These entropic studies on DNA sequences were motivated by similar studies on symbol sequences in general, such as alphabetic or musical symbols [[16], [17], [18]].
The LSE analysis we propose here is in line with previous studies aiming to extract information from DNA sequences, using local statistical measures and collapsing the local information into a single numerical value. Straightforward approaches on the base level can indeed give a more detailed picture of the underlying sequences, but require long computations for base-by-base comparison between organisms. Here, we collapse the information to a minimal number of parameters and use these parameters to obtain, for example, phylogenetic trees or other evolutionary features. The GC-content is another reductive approach and has been shown that many biological features can be correlated with this coarse-grained quantity (e.g. gene density). K-mer distribution is also another reductive approach which is known to be quite powerful. From this perspective, we propose the Local Shannon Entropy as an alternative reduction approach which can lead to meaningful biological information extraction, without the need to go through the difficult and time consuming base-sequence detailed analysis. Next, we recapitulate briefly the basic definitions of Shannon Entropy in its local and global form and its application on symbolic DNA sequences.
For a sequence containing symbols, each of them with probability , , the global Shannon entropy is defined as
To measure the entropic fluctuations we introduce the Local Shannon Entropy (LSE) in blocks. We use non-overlapping consecutive blocks of length , covering entirely each sequence. For a sequence of length the total number of non-overlapping blocks is . All blocks have a numerical identifier , depending on their positions in the sequence of blocks. The local Shannon Entropy of block with index and size is one way to summarize the base frequency distribution in a block and is defined as: where is the appearance probability of -character inside the block . In this study the entropy of each block is measured independently, so the probability will be equal to the frequency of appearance of character , over the block’s length, i.e. . A long DNA sequence can be coarse-grained by a series of LSE’s, with mean Eq. (1) and variance depended on the block size. It can be different from the global entropy calculated from the sequence at the base level. The average Shannon Entropy tends to the global , when the block size approaches the sequence size, , or if there is no fluctuations of the local entropy in each block.
The global and local entropies take the same maximum value in the case of equiprobable symbol representation, i.e. when each character tends to constitute the ratio of the overall character appearances: .
Besides the LSE measure, other equivalent quantities may be used, such as the windowed GC-content or various moments of the probability distribution. The use of these measures lies on the particular applications. In the current study we employ the LSE measure because it represents the content of information found locally in the DNA sequence.
To apply the LSE method to DNA, each genomic segment is considered as a string composed of different characters, A, C, G, T for Adenine, Cytosine, Guanine and Thymine, respectively. As in any string of symbols, it is possible to measure its global SE using Eq. (1), and its local counterparts by Eq. (2). The magnitude of information in each region could be characteristic of its role, especially when we try to make a separation between functional regions. Note, that the maximum value of the global and local Shannon Entropies takes the value and, as will be seen in the following, it is very rarely met in natural DNA sequences. For example, in human genome, the base frequency for G and C is around 0.2, that for A and T is around 0.3 [19], deviating from the equal base probability distribution. In the current international genomic data bases DNA sequences contain a percentage of nonidentified base pairs (bps). These nonidentified bases are denoted by the letter N and are ignored in the current study since their frequency is very small (of the order ) in all human chromosomes, except chromosome Y.
In the next section we apply the LSE in specific sequences using different block sizes and calculate the LSE fluctuations along the sequence. For specific values of block length we find regions where the LSE fluctuations, not LSE itself, take distinctly lower values than in the rest of the sequences. The largest such regions are identified as the centromeric regions of chromosomes. We verify that the centromeric positions coincide with low LSE fluctuations in all human chromosomes. In Section 3 and Appendix A we show that the reason behind low fluctuations in centromeric regions is the repetitive structure of centromeres. In Section 4 we propose a graphical method based on the LSE fluctuations for the prediction of centromeric regions and regions with high concentration of repetitive strings. In Section 5 we analyze further centromeric and noncentromeric regions using the Fourier transforms to recover the size of the repetitive strings (repeats). In Section 6 we compare the LSE results from the human genome with genomes from other primates and measure their evolutionary distance by comparing the LSE of their largest chromosomes, chromosome 1. Finally, we recapitulate our main results and discuss open problems.
Section snippets
Local Shannon entropy of the human genome
LSEs were computed for the 22 pairs of autosome chromosomes and the pair (X–Y) of allosome chromosomes (sex chromosomes) of the human genome. Chromosome Y was excluded from the current study because in the international genomic data bases it still contains large unidentified regions while its size is small for statistical studies. The LSE values of the human chromosome 1 is presented in Fig. 1 for various block sizes. We notice a region with low fluctuations which becomes clearer when the
Repetitive elements cause low entropic fluctuations
First we would like to know the reason why the LSE fluctuations are low around centromeres. The centromere is known for being highly repetitive, with various sizes of repeated segments and repetition multiplicities. The centromere’s repetitions are characterized as tandem repeats, consecutively placed in the genome. Such segments are: a-satellite (171 base pairs-bps), b-satellite (61 bps), satellite 1 (25–48 bps) etc. [26]. Knowing this information we can state that the fluctuations in
Repetitive sequence predicting function
Based on the LSE we now propose a graphical method to predict the centromere position in human chromosomes, or, in general, to predict the position of repetitive elements along any symbol sequence. In order to correctly predict the position of the repetitive areas along the DNA sequence, the analysis needs to become more detailed at the base level. For this reason sliding blocks (shifted by 1 bp) will be used from now on, so that each block of size overlaps with its next-in-line at
Analysis of high and low LSE fluctuation regions
In Section 2 we did the first step analyzing the DNA structure using LSE in theory. In this section we prompt specific low LSE regions, centromeric and others. In the plots made with the LSE method described in Section 2 level-like patterns can emerge when the centromere regions are magnified. Parts of the centromere appear on different levels of entropy values. This characteristic is common and we observed it at the centromeres of almost all different human chromosomes. The next two plots in
Evolutionary distance
There have been many attempts to compare genomes since the DNA sequencing started. Many of them are based only on protein coding regions, or other approaches that may be controversial, since they make use of partial genomic information [[35], [36], [37], [38], [39]]. Complexity measures quantifying characteristic properties of different genomes can potentially be used as taxonomical criteria [[40], [28], [29]]. Other approaches use topology concepts: DNA sequences can be represented as curves
Results
To recapitulate, the main objective of our work was to investigate the entropy content of DNA sequences. We unraveled the mathematical reason behind the LSE behavior of DNA regions containing repetitive units and now we are able to get more information about them by using LSE. By taking advantage of the LSE properties of the human centromere, we constructed a function that can predict the position of centromeres by the use of only symbolic sequence data. We show that “low-complexity” regions do
Acknowledgments
D.T. would like to thank Prof. Stratos Prassidis for helpful discussions. The authors would like to thank an anonymous Reviewer for pointing out the connection between Fig. 1 d and the isochore structure of the human DNA. This work was supported by computational time granted from the Greek Research & Technology Network (GRNET) in the National HPC facility – ARIS – under project ID pr003017.
References (60)
- et al.
Multi-scale coding of genomic information: from DNA sequence to genome structure and function
Phys. Rep.
(2011) - et al.
Understanding long-range correlations in DNA sequences
Physica D
(1994) The study of correlation structures of DNA sequences: a critical review
Comput. Chem.
(1997)- et al.
Application of information theory to DNA sequence analysis: A review
Pattern Recognit.
(1996) - et al.
Statistics of local complexity in amino acid sequences and sequence databases
Comput. Chem.
(1993) - et al.
Scaling properties of coding and noncoding DNA sequences
Physica A
(1997) - et al.
A novel signal processing measure to identify exact and inexact tandem repeat patterns in DNA sequences
EURASIP J. Bioinform. Syst. Biol.
(2007) - et al.
Criteria for confirming sequence periodicity identified by Fourier transform analysis: application to GCR2, a candidate plant GPCR?
Biophys. Chem.
(2008) - et al.
Information decomposition method to analyze symbolical sequences
Phys. Lett. A
(2003) - et al.
Big trees from little genomes: mitochondrial gene order as a phylogenetic tool
Curr. Opin. Genetics Dev.
(1998)
Counting on comparative maps
TIG
Complexity measures for the evolutionary categorization of organisms
Comput. Biol. Chem.
Evolution of long-range fractal correlations and 1/f noise in DNA base sequences
Phys. Rev. Lett.
DNA correlations
Nature
Long-range correlations in nucleotide sequences
Nature
Sequence compositional complexity of DNA through an entropic segmentation method
Phys. Rev. Lett.
The majority of recent short DNA insertions in the human genome are tandem duplications
Mol. Biol. Evol.
High-level organization of isochores into gigantic superstructures in the human genome
Phys. Rev. E
Long-range bidirectional strand asymmetries originate at CpG islands in the human genome
Genome Biol. Evol.
A mathematical theory of communication
SIGMOBILE Mob. Comput. Commun. Rev.
DNA viewed as an out-of-equilibrium structure
Phys. Rev. E
Compositional segmentation and long-range fractal correlations in DNA sequences
Phys. Rev. E
Finding borders between coding and noncoding DNA regions by an entropic segmentation method
Phys. Rev. Lett.
Entropy, transinformation and word distribution of information-carrying sequences
Int. J. Bifurcation Chaos
Entropy of symbolic sequences: The role of correlations
Europhys. Lett.
Chaos and Information Processing
G+C content evolution in the human genome
Tandem repeats finder: a program to analyze DNA sequences
Nucl. Acids Res.
Database of periodic DNA regions in major genomes
BioMed. Res. Int.
Cited by (12)
A new method to study genome mutations using the information entropy
2021, Physica A: Statistical Mechanics and its ApplicationsSpatial constrains and information content of sub-genomic regions of the human genome
2021, iScienceCitation Excerpt :It should be mentioned, however, that although different methods have been used to analyze the DNA and its information content, they show some commonalities in their general findings. As a general trend, they distinguish between different structural regions of the genome, and differentiate between coding and non-coding regions of DNA (Karakatsanis et al., 2018, Thanos et al., 2018). The projection of the dynamics to the statistics in the phase space develops a complete picture that integrated to the variations of the complexity metrics.
Statistical physics approaches to the complex Earth system
2021, Physics ReportsCitation Excerpt :There exist also many other types of entropy, such as Gibbs, Residual, Approximate, Sinai–Kolmogorov, Sample, Multiscale. Entropy has been proven useful in many real-world systems, including analysis of DNA sequences [233], cosmology and astrophysics [234–236], economics [237,238], and climate systems [239–241]. Each definition of entropy could give better results for some systems but fails for others.
Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome
2019, GeneCitation Excerpt :Here we plot the reciprocal of JS over chromosome positions, such that peaks indicate nuclear regions that are similar to the mitochondrial sequences. On the other hand, when JS is plotted instead of 1/JS, the peaks usually indicate centromeres, telomeres, and other low-complexity regions (plot not shown), as these contain distinct repeating patterns (e.g., Thanos et al., 2018) Comparing to other graphic illustration of NUMTs (e.g., Fig. 2 of Woischnik and Moraes, 2002), our approach displays not only the chromosomal location, but also strength of the signal. The overlap of our novel NUMT predictions with LTR annotations requires some specific discussion.
Quantifying local randomness in human DNA and RNA sequences using Erdös motifs
2019, Journal of Theoretical BiologyCitation Excerpt :We first use the same file to bracket the centromere region (when the band is labeled as “acen”). Then we further fine-tune the boundary by an observation made in Thanos et al. (2018) that windowed statistical qualities (e.g. entropy) have extremely low variations in the centromere region. For R/Y binarization, out of 2.911 billions overlapping 10-mers in the human genome (chromosomes 1-22,X, excluding any 10-mers which contain unsequenced bases) there are 6,161,338 counts of R/Y E1, 10, or 0.21% of all 10-mer counts.
Automated detection of colon cancer using genomic signal processing
2021, Egyptian Journal of Medical Human Genetics