Abstract
Totally, 472 288 regions of triplet periodicity were found in 578 868 genes from KEGG databank version 29 and classified. A new concept of triplet periodicity class and a measure of similarity between periodicity classes were introduced. Overall, 2520 classes were created and contained 94% of the triplet periodicity cases found. A similar correlation between the triplet periodicity and reading frame was observed for 92% of triplet periodicity regions contained in different classes. The remaining triplet periodicity regions displayed a shift of the reading frame relative to that common for the majority of genes belonging to the same triplet periodicity class. The hypothetical amino acid sequences were deduced from the periodicity regions according to the reading frame characteristic of the given triplet periodicity class. BLAST analysis demonstrated that 2660 hypothetical amino acid sequences display a statistically significant similarity to proteins from the Uni-Prot databank. It was supposed that 8% of the triplet periodicity regions contained in the classes have frameshift mutations. The triplet periodicity classes can be used to identify the coding regions in genes and to searching for frameshift mutations.
Similar content being viewed by others
References
Fickett J.W. 1998. Predictive methods using nucleotide sequences. Methods Biochem. Anal. 39, 231–245.
Staden R. 1994. Staden: statistical and structural analysis of nucleotide sequences. Methods Mol. Biol. 25, 69–77.
Baxevanis A.D. 2001. Predictive methods using DNA sequences. Methods Biochem. Anal. 43, 233–252.
Gutierrez G., Oliver J.L., Marin A. 1994. On the origin of the periodicity of three in protein coding DNA sequences. J. Theoret. Biol. 167, 413–414.
Gao J., Qi Y., Cao Y., Tung W.W. 2005. Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences. J. Biomed. Biotechnol. 2, 139–146.
Yin C., Yau S.S. 2007. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J. Theor. Biol. 247, 687–694.
Eskesen S.T., Eskesen F.N. Kinghorn B., Ruvinsky A. 2004. Periodicity of DNA in exons. BMC Mol. Biol. 5, 12.
Bibb M.J., Findlay P.R., Johnson M.W. 1984. The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. Gene. 30, 157–166.
Konopka A.K. 1994. Sequences and codes: Fundamentals of biomolecular cryptography. In: Biocomputing: Informatics and genome projects. Ed. Smith D. San Diego: Academic Press, pp. 119–174.
Trifonov E.N. 1999. Elucidating sequence codes: Three codes for evolution. Ann. N.Y. Acad. Sci. 870, 330–338.
Eigen M., Winkler-Oswatitsch R. 1981. Transfer-RNA: The early adaptor. Naturwissenschaften. 68, 217–228.
Zoltowski M. 2007. Is DNA code periodicity only due to CUF-codons usage frequency? Conf. Proc. IEEE Eng. Med. Biol. Soc. 1, 1383–1386.
Antezana M.A., Kreitman M. 1999. The nonrandom location of synonymous codons suggests that reading frame-independent forces have patterned codon preferences. J. Mol. Evol. 49, 36–43.
Karlin S., Bucher P. 1992. Correlation analysis of amino acid usage in protein classes. Proc. Natl. Acad. Sci. USA. 89, 12165–12169.
Zhang J. 2005. On the evolution of codon volatility. Genetics. 169, 495–501.
Trifonov E.N. 1987. Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. J. Mol. Biol. 194, 643–652.
Fickett J.W. 1996. The gene identification problem: An overview for developers. Comput. Chem. 20, 103–118.
Issac B., Singh H., Kaur H., Raghava G.P.S. 2002. Locating probable genes using Fourier transform approach. Bioinformatics. 18, 196–197.
Tiwari S., Ramachandran S., Bhattacharya A., Bhattacharya S., Ramaswamy R. 1997. Prediction of probable genes by Fourier analysis of genomic sequences. Comput. Appl. Bioscie. 13, 263–270.
Azad R.K., Borodovsky M. 2004. Probabilistic methods of identifying genes in prokaryotic genomes: Connections to the HMM theory. Briefings Bioinform. 5, 118–130.
Henderson J., Salzberg S., Fasman K.H. 1997. Finding genes in DNA with a Hidden Markov Model. J. Comput. Biol. 4, 127–141.
Snyder E.E., Stormo G.D. 1993. Identification of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucleic Acids Res. 21, 607–613.
Thomas A., Skolnick M.H. 1994. A probabilistic model for detecting coding regions in DNA sequences. 1994. IMA J. Math. Appl. Med. Biol. 11, 149–160.
Korotkov E.V., Korotkova M.A., Kudryashov N.A. 2003. Information decomposition method for analysis of symbolical sequences. Physics Lett. A. 312, 198–310.
Korotkov E.V., Korotkova M.A., Frenkel F.E., Kudryashov N.A. 2003. The informational concept of searching for periodicity in symbol sequences. Mol. Biol. 37, 436–451.
Gribskov M., Veretnik S. 1996. Identification of sequence pattern with profile analysis. Methods Enzymol. 266, 198–212.
Kullback S. 1978. Information Theory and Statistics. Gloucester: Peter Smith.
Chaley M.B., Korotkov E.V., Skryabin K.G. 1999. Method for revealing latent periodicity of the nucleotide sequences modified for a case of small samples. DNA Res. 6, 153–163.
Gmurman V.E. 2003. Teoriya veroyatnosti i matematicheskaya statistika (The Probability Theory and Mathematical Statistics). Moscow: Vysshaya Shkola.
Grosse I., Buldyrev S.V., Stanley H.E., Holste D., Herzel H. 2000. Pacific Symposium on Biocomputing. Hawaii, USA: Abstract book, p. 611.
Ota T., Suzuki Y., Nishikawa T., Otsuki T., Sugiyama T., Irie R., Wakamatsu A., Hayashi K., Sato H., Nagai K., Kimura K., Makita H., Sekine M., Obayashi M., Nishi T., Shibahara T., Tanaka T., Ishii S., Yamamoto J., Sugano S. 2004. Complete sequencing and characterization of 21 243 full-length human cDNAs. Nature Genetics. 36, 40–45.
Thiesen H.J. 1990. Multiple genes encoding zinc finger domains are expressed in human T cells. New Biol. 2, 363–374.
Raes J., van de Peer Y. 2005. Functional divergence of proteins through frameshift mutations. Trends Genetics. 21, 428–431.
Hahn Y., Lee B. 2005. Identification of nine human-specific frameshift mutations by comparative analysis of the human and the chimpanzee genome sequences. Bioinformatics. 21, 186–194.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © F.E. Frenkel, E.V. Korotkov, 2008, published in Molekulyarnaya Biologiya, 2008, Vol. 42, No. 4, pp. 707–720.
Rights and permissions
About this article
Cite this article
Frenkel, F.E., Korotkov, E.V. Classification of triplet periodicity in the DNA sequences of genes from KEGG databank. Mol Biol 42, 629–640 (2008). https://doi.org/10.1134/S0026893308040201
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0026893308040201