Word organization in coding DNA: A mathematical model

Mukhopadhyay, Indranil; Som, Anup; Sahoo, Satyabrata

doi:10.1016/j.thbio.2006.03.002

Word organization in coding DNA: A mathematical model

Published: August 2006

Volume 125, pages 1–17, (2006)
Cite this article

Theory in Biosciences Aims and scope Submit manuscript

Indranil Mukhopadhyay¹,
Anup Som² &
Satyabrata Sahoo³

80 Accesses
2 Citations
Explore all metrics

Abstract

This article deals with the relationship between vocabulary (total number of distinct oligomers or “words”) and text-length (total number of oligomers or “words”) for a coding DNA sequence (CDS). For natural human languages, Heaps established a mathematical formula known as Heaps' law, which relates vocabulary to text-length. Our analysis shows that Heaps' law fails to model this relationship for CDSs. Here we develop a mathematical model to establish the relationship between the number of type of words (vocabulary) and the number of words sampled (text-length) for CDSs, when non-overlapping nucleotide strings with the same length are treated as words. We use tangent-hyperbolic function, which captures the saturation property of vocabulary. Based on the parameters of the model, we formulate a mathematical equation, known as “equation of word organization”, whose parameters essentially indicate that nucleotide organization of coding sequences are different from one another. We also compare the word organization of CDSs with the random word distribution and conclude that a CDS is neither similar to a natural human language nor to a random one. Moreover, these sequences have their unique nucleotide organization and it is completely structured for specific biological functioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Blaisdell, E.B., Campbell, A.M., Karlin, S., 1996. Similarities and dissimilarities of phase genomes. Proc. Natl. Acad. Sci. USA 93, 5854–5859.
Article PubMed CAS Google Scholar
Bulyk, M.L., Gentalen, E., Lockhart, D.J., Church, G.M., 1999. Quantifying DNA-protein interactions by double-stranded DNA arrays. Nat. Biotechnol. 17, 573–577.
Article PubMed CAS Google Scholar
Dyer, B.D., LeBlanc, M.D., Benz, S., Cahalan, P., Donorfi, B., Sagui, P., Villa, A., Williams, G., 2004. A DNA motif lexicon: cataloguing and annotating sequences. In Silico Bio. 4.
Heaps, H.S., 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York.
Google Scholar
Li, W., 1992. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Trans. Inform. Theory 38 (6), 1842–1845.
Article Google Scholar
Li, W., Kaneko, K., 1992. Long-range correlation and partial 1/f spectrum in a non-coding DNA sequence. Europhys. Lett. 17, 655–660.
Article CAS Google Scholar
Mantegna, R.N., Buldyrev, S.V., Goldberger, A.L., Havlin S., peng, C.-K., Simons, M., Stanley, H.E., 1994. Linguistic features of non-coding DNA sequences. Phys. Rev. Lett., 73, 3169–3172.
Article PubMed CAS Google Scholar
Peng, C.-K., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Sciortino, F., Simons, M., Stanley, H.E., 1992. Long-range correlations in nucleotide sequences. Nature 356, 168–170.
Article PubMed CAS Google Scholar
Rao, C.R., Toutenburg, H., 1999. Linear Models: Least Squares and Alternatives, second ed., Springer, New York.
Google Scholar
Searle, S.R., Casella, G., McCulloch, C.E., 1992. Variance Components. Wiley, New York.
Google Scholar
Shields, D.C., Sharp, P.M., Higgins, D.G., Wright, F., 1988. “Silent” sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol. Biol. Evol. 5, 704–716.
PubMed CAS Google Scholar
Som, A., Chattopadhyay, S., Chakrabarti, J., Bandyopadhyay, D., 2001. Codon distributions in DNA. Phys. Rev. E 63, 0519081.
Article Google Scholar
Som, A., Sahoo, S., Chakrabarti, J., 2003a. Coding DNA sequences: statistical Distributions. Math. Biosci. 183, 49–61.
Article PubMed CAS Google Scholar
Som, A., Sahoo, S., Mukhopadhyay, I., Chakrabarti, J., Chaudhury, R., 2003b. Scaling violations in coding DNA. Europhys. Lett. 62 (2), 271–277.
Article CAS Google Scholar
Trifonov, E.N., Bettecken, T., 1997. Sequence fossils, triplet expansion, and reconstruction of earliest codons. Gene 205 (1–2), 1–6.
Article PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Human Genetics, University of Pittsburgh, 15261, Pittsburgh, PA, USA
Indranil Mukhopadhyay
Center for Evolutionary Functional Genomics, The Biodesign Institute, Arizona State University, 85287-5301, Tempe, AZ, USA
Anup Som
Department of Physics, Raidighi College, WB-743383, Raidighi, India
Satyabrata Sahoo

Authors

Indranil Mukhopadhyay
View author publications
You can also search for this author in PubMed Google Scholar
Anup Som
View author publications
You can also search for this author in PubMed Google Scholar
Satyabrata Sahoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Indranil Mukhopadhyay.

Additional information

IM and AS contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mukhopadhyay, I., Som, A. & Sahoo, S. Word organization in coding DNA: A mathematical model. Theory Biosci. 125, 1–17 (2006). https://doi.org/10.1016/j.thbio.2006.03.002

Download citation

Received: 17 February 2006
Accepted: 07 March 2006
Issue Date: August 2006
DOI: https://doi.org/10.1016/j.thbio.2006.03.002

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Word organization in coding DNA: A mathematical model

Abstract

Access this article

Similar content being viewed by others

DNA codes for nonadditive stem similarity

Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels

In search of coding and non-coding regions of DNA sequences based on balanced estimation of diffusion entropy

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Word organization in coding DNA: A mathematical model

Abstract

Access this article

Similar content being viewed by others

DNA codes for nonadditive stem similarity

Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels

In search of coding and non-coding regions of DNA sequences based on balanced estimation of diffusion entropy

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation