Abstract
We apply methods from statistical physics (histograms, correlation functions, fractal dimensions, and singularity spectra) to characterize large-scale structure of the distribution of nucleotides along genomic sequences. We discuss the role of the extension of noncoding segments (“junk DNA”) for the genomic organization, and the connection between the coding segment distribution and the high-eukaryotic chromatin condensation. The following sequences taken from GenBank were analyzed: complete genome of Xanthomonas campestri, complete genome of yeast, chromosome V of Caenorhabditis elegans, and human chromosome XVII around gene BRCA1. The results are compared with the random and periodic sequences and those generated by simple and generalized fractal Cantor sets.
Similar content being viewed by others
References
Watson, J. D., Hopkins, N. H., Roberts, J. W., Steiz, J. A., and Weiner, A. M. (1987) Molecular Biology of The Gene, ed. 4, The Benjamin/Cummings Publishing Company, Menlo Park, CA.
Venter, J. C., Adams, M. D., Myers, E. W. et al. (2001) The sequence of the human genome. Science 291, 1304–1351.
Lander, E. S., Linton, L. M., Birren, B., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921.
Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology, PWS Publishing Company, Boston.
Anantharaman, V., Koonin, E. V., and Aravind, L. (2002) Comparative genomics and evolution of proteins involved in RNA metabolism. Nucl. Acids Res. 30, 1427–1464.
Baxevanis, A. D. and Ouellete, B. F. F., eds. (2001) Bioinformatics, ed. 2, John Wiley & Sons, New York.
Wheeler, D. L., Church, D. M., Lash, A. E., Leipe, D. D., Madden, T. L., Pontius, J. U., Schuler, G. D., Schrimi, L. M., Tatusova, T. A., Wagner, L., and Rapp, B. A. (2002) Database resources of the National Center for Biotechnology Information: 2002 update. Nucl. Acids Res. 30, 13–16.
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A., and Weeler, D. L. (2002) GenBank. Nucl. Acids Res. 28, 17–20.
Sueoka, N. (1959) A statistical analysis of deoxyribonucleic acid distribution in density gradient centrifugation. Proc. Natl. Acad. Sci. USA 45, 1480–1490.
Churchill, G. A. (1989) Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51, 79–94.
Bernardi, G. (2000) Isochores and the evolutionary genomics of vertebrates. Gene 241, 3–17.
Oliver, J. L., Bernaola-Gálvan, P., Carpena, P., and Román-Roldán, R. (2001) Isochore chromosome maps of eukaryotic genomes. Gene 276, 47–56.
Li, W. (2001) Delineating relative homogeneous G+C domains in DNA sequences. Gene 276, 57–72.
Li, W. (2001) New stopping criteria for segmenting DNA sequences. Phys. Rev. Lett. 86, 5815–5818.
Clay, O. (2001) Standard deviations and correlations of CG levels in DNA sequences. Gene 276, 33–38.
Eyre-Walker, A. and Hurst, L. D. (2001) The evolution of isochors. Nat. Rev. Genet. 2, 549–554.
Peng, C.-K., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Sciortino, F., Simons, M., and Stanley, H. E. (1992) Long-range correlations in nucleotide sequences. Nature 356, 168.
Peng, C.-K., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Sciortino, F., Simons, M., and Stanley, H. E. (1992) Fractal landscape analysis of DNA walks, Physica A 191, 25–29.
Buldyrev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K., Stanley, E. H., Stanley, M. H. R., and Simons, M. (1993) Fractal landscapes and molecular evolution: modeling the myosin heavy chain gene family. Biophys. J. 65, 2673–2679.
Buldyrev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K., Simons, M., and Stanley, E. H. (1993) Generalized Lévy-walk model for DNA nucleotide sequences. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 47, 4514–4523.
Buldyrev, S. V., Dokholyan, N. V., Goldberger, A. L., Havlin, S., Peng, C.-K., Stanley, E. H., and Viswanathan, G. M. (1998) Analysis of DNA sequences using methods of statistical physics. Physica A 249, 430–438.
Viswanathan, G. M., Buldyrev, S. V., Havlin, S., and Stanley, H. E. (1998) Long-range correlation measures for quantifying patchiness: deviations from uniform power-law scaling in genomic DNA. Physica A 249, 581–586.
Rosas, A., Nogueira, E., and Fontanari, J. F., (2002) Multifractal analysis of DNA walks and trails, Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 66, 061906.
Gates, M. A. (1986) A simple way to look at DNA. J. Theor. Biol. 119, 319–328.
Berthelsen, C. L., Glazier, J.A., and Skolnick, M. H. (1992) Global fractal dimension of human DNA sequences treated as pseudorandom walks, Phys. Rev. A 45, 8902–8913.
Abramson, G., Alemany, P. A., and Cerdeira, H. A. (1998) Noisy Lévy walk analog of two-dimensional DNA walks for chromosomes of S. cerevisiae. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 58, 914–918.
Berthelsen, C. L., Glazier, J. A., and Raghavachari, S. (1994) Effective multifractal spectrum of a random walk. Phys. Rev. E 49, 1860–1864.
Glazier, J. A., Raghavachari, S., Berthelsen, C. L., and Skolnick, M. H. (1995) Reconstructing phylogeny from the multifractal spectrum of mitochondrial DNA. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 51, 2665–2668.
Tarafdar, S., Nandy, P., Sahoo, S., Som, A., Chakrabarti, J., and Nandy, A. (1999) Self-similarity and scaling exponent for DNA walk model in two and four dimensions. Indian J. Phys. 73B, 337–343.
Oiwa, N. N. and Glazier, J. A. (2002) The fractal structure of the mitochondrial genomes. Physica A 311, 221–230.
Oiwa, N. N. and Glazier, J. A. (2004) Self-similar mitochondrial DNA. Cell Biochem. Biophys. 41, 41–62.
Clark, A. G. (2001) The search for meaning in noncoding DNA. Genome Res. 11, 1319–1320.
Bergman, C. M. and Kreitman, M. (2001) Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res. 11, 1335–1345.
Purugganan, M. D. (1993) Scale-invariant spatial patterns in genome organization. Phys. Lett. A 175, 252–256.
Provata, A. (1999) Random aggregation models for the formation and evolution of coding and noncoding DNA. Physica A 264, 570–580.
Oiwa, N. N. and Goldman, C. (2000) Phylogenetic study of the spatial distribution of protein-coding and control segments in DNA chains. Phys. Rev. Lett. 85, 2396–2399.
Li, W. and Kaneko, K. (1992) Long-range correlation and partial 1/f a spectrum in a noncoding DNA sequence. Europhys. Lett. 17, 655–660.
Voss, R. F. (1992) Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68, 3805–3808.
Larhammar, D. and Chatzidimitriou-Dreismann, C. A. (1993) Biological origins of long-range correlations and compositional variations in DNA. Nucl. Acids. Res. 21, 5167–5170.
Osaka, M., Gohara, K., Ishii, S., Kishida, H., Hayakawa, H., and Ito, N. (1999) Symbolic strings and spatial 1/f spectra. Physica D 125, 142–154.
Silva, A. C. R., Ferro, J. A., Relnach, F. C., et al. (2002) Comparison of the genomes of two Xanthomonas pathogens with differing host specificities. Nature 417, 459–463.
Goffeau, A., Barrel, B. G., Bussey, H., et al. (1996) Life with 6000 genes. Science 274, 546–567.
Ainscough, R., Bardill, S., Barlow, K., et al. (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018.
Herzel, H., Trifonov, E. N., Weiss, O., and Grobe, I. (1998) Interpreting correlations in biosequences. Physica A 249, 449–459.
Azbel', M. Ya. (1995) Universality in a DNA statistical structure. Phys. Rev. Lett. 75, 168–171.
Li, W., Marr, T. G., and Kaneko, K. (1994) Understanding long-range correlations in DNA sequences. Physica D 75, 392–416.
Vlad, M. O., Schönfisch, B., and Mackey, M. C. (1996) Evolution towards ergodic behavior of stationary fractal random processes with memory: application to the study of long-range correlations of nucleotide sequences in DNA. Physica A 229, 312–342.
Li, W. (1997) The study of correlation structures of DNA sequences: a critical review. Comput. Chem. 21, 257–271.
Lu, X., Sun, Z., Chen, H., and Li, Y. (1998) Characterizing self-similarities in bacteria DNA sequence. Phys. Rev. Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 58, 3578–3584.
Z.-G., Anh, V. V., and Wang, B. (2000) Correlation property of length sequences based on global structure of the complete genome. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 63, 011903.
Allegrini, P., Barbi, M., Grigolini, P., and West, B. J. (1995) Dynamical model for DNA sequences. Phys. Rev. E Stat. Phy. Plasmas Fluids Relat. Interdiscip. Topics 5281–5296.
Arnéodo, A., Bacry, E., Graves P. V., and Muzy, J. F. (1995) Characterizing long-range correlations in DNA sequences from wavelet analysis. Phys. Rev. Lett. 74, 3293–3296.
Arnéodo, A., d'Aubenton-Carafa, Y., Bacry, E., Graves, P. V., Muzy, J. F., and Thermes, C. (1996) Wavelet based fractal analysis of DNA sequences. Physica D 96, 291–320.
Arnéodo, A., d'Aubenton-Carafa, Y., Audit, B., Bacry, E., Muzy, J. F., and Thermes, C. (1998) Nucleotide composition effects on the long-range correlations in human genes. Eur. Phys. J. B 1, 259–263.
Arnéodo, A., Audit, B., Bacry, E., Mannville, S., Muzy, J. F., and Roux, S. G. (1998) Thermodynamics of fractal signals based on wavelet analysis: application to fully developed turbulence data and DNA sequences. Physica A 254, 24–45.
Vieira, M. S. (1999) Statistics of DNA sequences: a low-frequency analysis. Phys. Rev E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 60, 5932–5937.
Bernaola-Gálvan, P., Román-Roldán, R., and Oliver, J. L. (1996) Compositional segmentation and log-range fractal correlation in DNA sequences. Phys. Rev. 53, 5181–5189.
Herzel, H. and Grobe, I. (1997) Correlations in DNA sequences: the role of protein coding segments. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 55, 800–810.
Román-Roldán, R., Bernaola-Galván, P., and Oliver, J. L. (1998) Sequence compositional complexity of DNA through an entropic segmentation method. Phys. Rev. Lett. 80, 1344–1347.
Luo, L., Lee, W., Jia, L., Ji, F., and Tsai, L. (1998) Statistical correlation of nucleotide in a DNA sequence. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 58, 861–871.
Román-Roldán, R., Carpena, P., Bernaola-Galván, P., and Oliver, J. L. (1999) Compositional complexity of DNA sequence models. Comput. Phys. Comm. 121–122, 136–138.
Crochemore, M. and Vérin, R. (1999) Zones of low entropy in genomic sequences. Comput. Chem. 23, 275–282.
Kowalczuk, M., Gierlik, A., Mackiewicz, P., Cebrat, S., and Dudek, M. R. (1999) Optimization of gene sequences under constant mutational pressure and selection. Physica A 273, 116–131.
Guharay, S., Hunt, B. R., Yorke, J.A., and White, O. R. (2000) Correlations in DNA sequences across the three domains of life. Physica D 146, 388–396.
Weber, J. L., and Myers, E. (1997) Human whole-genome shotgun sequencing. Genome Res. 7, 401–409.
Green, P. (1997) Against a whole-genome shotgun. Genome Res. 7, 410–417.
Green, P. (2002) Whole-genome disassembly. Proc Natl Acad Sci USA 99, 4143–4144.
Myers, E. W., Sutton, G. G., Smith, H. O., et al. (2002) On the sequencing and assembly of the human genome. Proc Natl Acad Sci USA 99, 4145–4146.
Mackiewicz, P., Gierlik, A., Kowalczuk, M., Szczepanik, D., Dudek, M. R., and Cebrat, S. (1999) Mechanism generating long-range correlation in nucleotide composition of the Borrelia burgdorferi genome. Physica A 273, 103–115.
Mantegna, R. N. (1994) Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 73, 3169–3172.
Mantegna, R. N., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K., Simons, M., and Stanley, H. E. (1995) Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. Phys. Rev. E 52, 2939–2950.
Mantegna, R. N., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K., Simons, M., and Stanley, H. E. (1996) Reply, Phys. Rev. Lett. 76, 1979–1981.
Israeloff, N. E., Kaganlenko, M., and Chan, K. (1996) Can Zipf distinguish language from noise in noncoding DNA. Phys. Rev. Lett. 76, 1976.
Bonhoeffer, S., Herz, A. V. M., Boerlijst, M. C., Nee, S., Nowak, M. A., and May, R. M. (1996) No signs of hidden language in noncoding DNA. Phys. Rev. Lett. 76, 1977.
Voss, R. F. (1996) Comment on linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 76, 1978.
Halsey, T. C., Jensen, M. H., Kadanoff, L. P., Procaccia, I., and Shraiman, B. I. (1986) Fractal measures and their singularities: the characterization of strange sets. Phys. Rev. A Stat. Phys. Plasmas Fluids Relat. Interdiscip. Topics 33, 1141–1151.
Hao, B.-L. (1989) Elementary Symbolic Dynamics and Chaos in Dissipative Systems. World Scientific, Singapore.
McCauley, J. L. (1993) Chaos, Dynamics and Fractals and Algorithmic Approach to Deterministic Chaos, Cambridge Univ. Press, Cambridge.
Easton, D. F. (1999) How many more breast cancer predisposition genes are there? Breast Cancer Res. 1, 14–17.
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1989) Numerical Recipes—the Art of Scientific Computing, Cambridge University Press, Cambridge.
Mandelbrot, B. (1967) How long is the coast of Britain? Statistical self-similarity and fractional dimension. Science 156, 636–639.
Mandelbrot, B. (1982) The Fractal Geometry of Nature, Freeman, San Francisco.
Grassberger, P. and Procaccia, I. (1983) Characterization of strange attractors. Phys. Rev. Lett. 50, 346–349.
Grassberger, P. and Procaccia, I. (1983) Measuring the strangeness of strange attactors. Physica D 9, 189–208.
Yamaguti, M. and Prado, C. P. C. 1995) A direct calculation of the spectrum of singularities f(a) of multifractals. Phys. Lett. A 206, 318–322.
Yamaguti, M. and Prado, C. P. C. 1997) A smart covering for a box-counting algorithm. Phys. Rev. E 55, 7726–7732.
Oiwa N. N. and Fiedler-Ferrara, N. 1998) A moving-box algorithm to estimate generalized dimensions and the f(a) spectrum. Physica D 124, 210–224.
Haken, H. (1988) Information and Self-Organization—A Macroscopic Approach to Complex Systems, Springer-Verlag, Berlin.
Nicolis, G. and Prigogine, I. (1989) Exploring Complexity, W. H. Freeman and Company, New York.
Takahashi, M. (1989) A fractal model of chromosomes and chromosomal DNA replication. J Theor. Biol. 141, 117–136.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Oiwa, N.N., Goldman, C. On the analysis of large-scale genomic structures. Cell Biochem Biophys 42, 145–165 (2005). https://doi.org/10.1385/CBB:42:2:145
Issue Date:
DOI: https://doi.org/10.1385/CBB:42:2:145