Skip to main content
Log in

Word organization in coding DNA: A mathematical model

  • Published:
Theory in Biosciences Aims and scope Submit manuscript

Abstract

This article deals with the relationship between vocabulary (total number of distinct oligomers or “words”) and text-length (total number of oligomers or “words”) for a coding DNA sequence (CDS). For natural human languages, Heaps established a mathematical formula known as Heaps' law, which relates vocabulary to text-length. Our analysis shows that Heaps' law fails to model this relationship for CDSs. Here we develop a mathematical model to establish the relationship between the number of type of words (vocabulary) and the number of words sampled (text-length) for CDSs, when non-overlapping nucleotide strings with the same length are treated as words. We use tangent-hyperbolic function, which captures the saturation property of vocabulary. Based on the parameters of the model, we formulate a mathematical equation, known as “equation of word organization”, whose parameters essentially indicate that nucleotide organization of coding sequences are different from one another. We also compare the word organization of CDSs with the random word distribution and conclude that a CDS is neither similar to a natural human language nor to a random one. Moreover, these sequences have their unique nucleotide organization and it is completely structured for specific biological functioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Blaisdell, E.B., Campbell, A.M., Karlin, S., 1996. Similarities and dissimilarities of phase genomes. Proc. Natl. Acad. Sci. USA 93, 5854–5859.

    Article  PubMed  CAS  Google Scholar 

  • Bulyk, M.L., Gentalen, E., Lockhart, D.J., Church, G.M., 1999. Quantifying DNA-protein interactions by double-stranded DNA arrays. Nat. Biotechnol. 17, 573–577.

    Article  PubMed  CAS  Google Scholar 

  • Dyer, B.D., LeBlanc, M.D., Benz, S., Cahalan, P., Donorfi, B., Sagui, P., Villa, A., Williams, G., 2004. A DNA motif lexicon: cataloguing and annotating sequences. In Silico Bio. 4.

  • Heaps, H.S., 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York.

    Google Scholar 

  • Li, W., 1992. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Trans. Inform. Theory 38 (6), 1842–1845.

    Article  Google Scholar 

  • Li, W., Kaneko, K., 1992. Long-range correlation and partial 1/f spectrum in a non-coding DNA sequence. Europhys. Lett. 17, 655–660.

    Article  CAS  Google Scholar 

  • Mantegna, R.N., Buldyrev, S.V., Goldberger, A.L., Havlin S., peng, C.-K., Simons, M., Stanley, H.E., 1994. Linguistic features of non-coding DNA sequences. Phys. Rev. Lett., 73, 3169–3172.

    Article  PubMed  CAS  Google Scholar 

  • Peng, C.-K., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Sciortino, F., Simons, M., Stanley, H.E., 1992. Long-range correlations in nucleotide sequences. Nature 356, 168–170.

    Article  PubMed  CAS  Google Scholar 

  • Rao, C.R., Toutenburg, H., 1999. Linear Models: Least Squares and Alternatives, second ed., Springer, New York.

    Google Scholar 

  • Searle, S.R., Casella, G., McCulloch, C.E., 1992. Variance Components. Wiley, New York.

    Google Scholar 

  • Shields, D.C., Sharp, P.M., Higgins, D.G., Wright, F., 1988. “Silent” sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol. Biol. Evol. 5, 704–716.

    PubMed  CAS  Google Scholar 

  • Som, A., Chattopadhyay, S., Chakrabarti, J., Bandyopadhyay, D., 2001. Codon distributions in DNA. Phys. Rev. E 63, 0519081.

    Article  Google Scholar 

  • Som, A., Sahoo, S., Chakrabarti, J., 2003a. Coding DNA sequences: statistical Distributions. Math. Biosci. 183, 49–61.

    Article  PubMed  CAS  Google Scholar 

  • Som, A., Sahoo, S., Mukhopadhyay, I., Chakrabarti, J., Chaudhury, R., 2003b. Scaling violations in coding DNA. Europhys. Lett. 62 (2), 271–277.

    Article  CAS  Google Scholar 

  • Trifonov, E.N., Bettecken, T., 1997. Sequence fossils, triplet expansion, and reconstruction of earliest codons. Gene 205 (1–2), 1–6.

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Indranil Mukhopadhyay.

Additional information

IM and AS contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mukhopadhyay, I., Som, A. & Sahoo, S. Word organization in coding DNA: A mathematical model. Theory Biosci. 125, 1–17 (2006). https://doi.org/10.1016/j.thbio.2006.03.002

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1016/j.thbio.2006.03.002

Keywords

Navigation