Abstract
This article deals with the relationship between vocabulary (total number of distinct oligomers or “words”) and text-length (total number of oligomers or “words”) for a coding DNA sequence (CDS). For natural human languages, Heaps established a mathematical formula known as Heaps' law, which relates vocabulary to text-length. Our analysis shows that Heaps' law fails to model this relationship for CDSs. Here we develop a mathematical model to establish the relationship between the number of type of words (vocabulary) and the number of words sampled (text-length) for CDSs, when non-overlapping nucleotide strings with the same length are treated as words. We use tangent-hyperbolic function, which captures the saturation property of vocabulary. Based on the parameters of the model, we formulate a mathematical equation, known as “equation of word organization”, whose parameters essentially indicate that nucleotide organization of coding sequences are different from one another. We also compare the word organization of CDSs with the random word distribution and conclude that a CDS is neither similar to a natural human language nor to a random one. Moreover, these sequences have their unique nucleotide organization and it is completely structured for specific biological functioning.
Similar content being viewed by others
References
Blaisdell, E.B., Campbell, A.M., Karlin, S., 1996. Similarities and dissimilarities of phase genomes. Proc. Natl. Acad. Sci. USA 93, 5854–5859.
Bulyk, M.L., Gentalen, E., Lockhart, D.J., Church, G.M., 1999. Quantifying DNA-protein interactions by double-stranded DNA arrays. Nat. Biotechnol. 17, 573–577.
Dyer, B.D., LeBlanc, M.D., Benz, S., Cahalan, P., Donorfi, B., Sagui, P., Villa, A., Williams, G., 2004. A DNA motif lexicon: cataloguing and annotating sequences. In Silico Bio. 4.
Heaps, H.S., 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York.
Li, W., 1992. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Trans. Inform. Theory 38 (6), 1842–1845.
Li, W., Kaneko, K., 1992. Long-range correlation and partial 1/f spectrum in a non-coding DNA sequence. Europhys. Lett. 17, 655–660.
Mantegna, R.N., Buldyrev, S.V., Goldberger, A.L., Havlin S., peng, C.-K., Simons, M., Stanley, H.E., 1994. Linguistic features of non-coding DNA sequences. Phys. Rev. Lett., 73, 3169–3172.
Peng, C.-K., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Sciortino, F., Simons, M., Stanley, H.E., 1992. Long-range correlations in nucleotide sequences. Nature 356, 168–170.
Rao, C.R., Toutenburg, H., 1999. Linear Models: Least Squares and Alternatives, second ed., Springer, New York.
Searle, S.R., Casella, G., McCulloch, C.E., 1992. Variance Components. Wiley, New York.
Shields, D.C., Sharp, P.M., Higgins, D.G., Wright, F., 1988. “Silent” sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol. Biol. Evol. 5, 704–716.
Som, A., Chattopadhyay, S., Chakrabarti, J., Bandyopadhyay, D., 2001. Codon distributions in DNA. Phys. Rev. E 63, 0519081.
Som, A., Sahoo, S., Chakrabarti, J., 2003a. Coding DNA sequences: statistical Distributions. Math. Biosci. 183, 49–61.
Som, A., Sahoo, S., Mukhopadhyay, I., Chakrabarti, J., Chaudhury, R., 2003b. Scaling violations in coding DNA. Europhys. Lett. 62 (2), 271–277.
Trifonov, E.N., Bettecken, T., 1997. Sequence fossils, triplet expansion, and reconstruction of earliest codons. Gene 205 (1–2), 1–6.
Author information
Authors and Affiliations
Corresponding author
Additional information
IM and AS contributed equally to this work.
Rights and permissions
About this article
Cite this article
Mukhopadhyay, I., Som, A. & Sahoo, S. Word organization in coding DNA: A mathematical model. Theory Biosci. 125, 1–17 (2006). https://doi.org/10.1016/j.thbio.2006.03.002
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1016/j.thbio.2006.03.002