Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Large scale bacterial gene discovery by similarity search

Abstract

DNA sequencing efforts frequently uncover genes other than the targeted ones. We have used rapid database scanning methods to search for undescribed eubacterial and archean protein coding frames in regions flanking known genes. By searching all prokaryotic DNA sequences not marked as coding for proteins or stable RNAs against the protein databases, we have identified more than 450 new examples of bacterial proteins, as well as a smaller number of possible revisions to known proteins, at a surprisingly high rate of one new protein or revision for every 24 initial DNA sequences or 8,300 nucleotides examined. Seven proteins are members of families which have not been described in prokaryotic sequences. We also describe 49 re–interpretations of existing sequence data of particular biological significance.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

References

  1. Gish, W. & States, D. Identification of protein coding regions by database similarity search. Nature Genet. 3, 266–272 (1993).

    Article  CAS  Google Scholar 

  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. molec. Biol. 214, 1–8 (1990).

    Article  Google Scholar 

  3. Karlin, S. & Altschul, S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. natn. Acad. Sci. U.S.A. 87, 2264–2268 (1990).

    Article  CAS  Google Scholar 

  4. Osawa, S., Jukes, T.H., Watanabe, K., Muto, A. Recent evidence for evolution of the genetic code. Microbiol. Rev. 56, 229–264 (1992).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Roth, J.R., Lawrence, J.G., Rubenfield, M., Kieffer-Higgins, S. & Church, G.M. Characterization of the cobalamin (vitamin B12) biosynthetic genes of Salmonella typhimurium. J. Bact. 175, 3303–3316 (1993).

    Article  CAS  Google Scholar 

  6. Stormo, G.D., Schneider, T.D., Gold, L. & Ehrenfeucht, A. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucl. Acids Res. 10, 2997–3011 (1982).

    Article  CAS  Google Scholar 

  7. Gesteland, R.F., Weiss, R.B. & Atkins, J.F. Reprogrammed genetic decoding. Science 257, 1640–1641 (1992).

    Article  CAS  Google Scholar 

  8. Cech, T.R. RNA editing: World's smallest introns? Cell 64, 667–669 (1991).

    Article  CAS  Google Scholar 

  9. Krawetz, S.A. Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation. Nucl. Acids. Res. 17, 3951–3957 (1989).

    Article  CAS  Google Scholar 

  10. Kristensen, T., Lopez, R. & Prydz, H. An estimate of the sequencing error frequency in the DNA sequence databases. DNA Seq. 2, 343–346 (1992).

    Article  CAS  Google Scholar 

  11. Pocalyko, D.J., Carroll, L.J., Martin, B.M., Babbitt, P.C. & Dunaway–Mariano, D. Analysis of sequence homologueies in plant and bacterial pyruvate phosphate dikinase, Enzyme I of the bacterial phosphoenolpyruvate:sugar phosphotransferase system and other PEP–utilizing enzymes. Biochem. 29, 10757–10765 (1990).

    Article  CAS  Google Scholar 

  12. Carlisle, S.M. et al. Pyrophosphate–dependent phosphofructokinase: Conservation of protein sequence between the alpha- and beta-subunits and with the ATP–dependent phosphofructokinase. J. biol. Chem. 265, 18366–18371 (1990).

    CAS  PubMed  Google Scholar 

  13. Fickett, J.W. & Tung, C.S. Assessment of protein coding measures. Nucl. Acids Res. 20, 6441–6450 (1992).

    Article  CAS  Google Scholar 

  14. Posfai, J. & Roberts, R.J. Finding errors in DNA sequences. Proc. natn. Acad. Sci. U.S.A. 89, 4698–4702 (1992).

    Article  CAS  Google Scholar 

  15. States, D.J. & Botstein, D. Molecular sequence accuracy and the analysis of protein coding regions. Proc. natn. Acad. Sci. U.S.A. 88, 5518–5522 (1991).

    Article  CAS  Google Scholar 

  16. Benson, D., Lipman, D.J. & Ostell, J. Gen Bank. Nucl. Acids Res. 21, 2963–2965 (1993).

    Article  CAS  Google Scholar 

  17. Altschul, S.F. Amino acid substitution matrices from an information theoretic perspective. J. molec. Biol. 219, 555–565 (1991).

    Article  CAS  Google Scholar 

  18. Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. . in Atlas of Protein Sequence and Structure (ed. Dayhoff, M.O) 5, 345–352 (National Biomedical Research Foundation, Washington D.C., 1978).

    Google Scholar 

  19. Henikoff, S. & Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. natn. Acad. Sci. U.S.A. 88, 10915–10919 (1992).

    Article  Google Scholar 

  20. Pearson, W.R. & Lipman, D.J. Improved tools for biological sequence comparison. Proc. natn. Acad. Sci. U.S.A. 85, 2444–2448 (1988).

    Article  CAS  Google Scholar 

  21. Barker, W.C., George, D.G., Hunt, L.T. & Garavelli, J.S. The PIR protein sequence database. Nucl. Acids Res. 19, 2231–2236 (1991).

    Article  CAS  Google Scholar 

  22. Bairoch, A. & Boeckmann, B. The SWISS–PROT protein sequence data bank. Nucl. Acids Res. 19, 2247–2249 (1991).

    Article  CAS  Google Scholar 

  23. Claverie, J.-M. Identifying coding exons by similarity search: Alu–derived and other potentially misleading protein sequences. Genomics 12, 838–841 (1992).

    Article  CAS  Google Scholar 

  24. Higgins, D.G., Bleasby, A.J. & Fuchs, R. CLUSTAL V: improved software for multiple sequence alignment. CABIOS 8, 181–191 (1992).

    Google Scholar 

  25. Larsen, N. et al. The ribosomal database project. Nucl. Acids Res. 21 (Suppl), 3021–3023 (1993).

    Article  CAS  Google Scholar 

  26. Klenin, A. et al. Comparative analysis of genes encoding methyl coenzyme M reductase in methanogenic bacteria. Molec. gen. Genet. 213, 409–420 (1988).

    Article  Google Scholar 

  27. Cram, D.S. et al. Structure and expression of the genes, mcrBDCGA, which encode the subunits of component C of methyl coenzyme M reductase in Methanococcus vannielii. Proc. natn. Acad. Sci. U.S.A. 84, 3992–3996 (1987).

    Article  CAS  Google Scholar 

  28. Bokranz, M. & Klein, A. Nucleotide sequence of the methyl coenzyme M reductase gene cluster from Methanosarcina barken. Nucl. Acids Res. 15, 4350–4351 (1987).

    Article  CAS  Google Scholar 

  29. Bokranz, M., Baeumner, G., Allmansberger, R., Ankel–Fuchs, D. & Klein, A. Cloning and characterization of the methyl coenzyme M reductase genes from Methanobacterium thermoautotrophicum. J. Bacteriol. 170, 568–577 (1988).

    Article  CAS  Google Scholar 

  30. Puehler, G., Lottspeich, F. & Zillig, W. Organization and nucleotide sequence of the genes encoding the large subunits A, B and C of the DNA–dependent RNA polymerase of the archaebacterium Sulfolobus acidocaldarius. Nucl. Acids Res. 17, 4517–4534 (1987).

    Article  Google Scholar 

  31. Lechner, K., Heller, K. & Boeck, A. Organization and nucleotide sequence of a transcription unit of Methanococcus vannielii comprising genes for protein synthesis elongation factors and ribosomal proteins. J. molec. Evol. 29, 20–27 (1989).

    Article  CAS  Google Scholar 

  32. Klenk, H.P., Schwass, V. & Zillig, W. Nucleotide sequence of the genes encoding the L30, S12 and S7 equivalent ribosomal proteins from the archaeum Thermococcus celer. Nucl. Acids Res. 19, 6047–6047 (1991).

    Article  CAS  Google Scholar 

  33. Nielsen, H., Andreasen, P.H., Dreisig, H., Kristiansen, K. & Engberg, J. An intron in aribosomal protein gene from Tetrahymena. EMBO J. 5, 2711–2717 (1986).

    Article  CAS  Google Scholar 

  34. Alksne, L.E. & Warner, J.R. A novel cloning strategy reveals the gene for the yeast homologueue to Escherichia coli ribosomal protein S12. J. biol. Chem. 268, 10813–10819 (1993).

    CAS  PubMed  Google Scholar 

  35. Leffers, H., Gropp, F., Lottspeich, F., Zillig, W. & Garrett, R.A., Sequence, organisation, transcription and evolution of RNA polymerase subunit genes from the archaebacterial extreme halophiles Halobacterium halobium and Halococcus morrhuae. J. molec. Biol. 206, 1–17 (1989).

    Article  CAS  Google Scholar 

  36. Auer, J., Spicker, G., Mayerhofer, L., Puehler, G. & Boeck, A. Organisation and nucleotide sequence of a gene cluster comprising the translation elongation factor 1-alpha from the extreme thermophilic archaebacterium Sulfolobus acidocaldarius: Phylogenetic implications. Syst. appl. Microbiol. 14, 14–22 (1990).

    Article  Google Scholar 

  37. Kuwano, Y., Olvera, J. & Wool, I.G. The primary structure of rat ribosomal protein S5, a ribosomal protein present in the rat genome in a single copy. J. biol. Chem. 267, 25304–25308 (1992).

    CAS  PubMed  Google Scholar 

  38. Stroeher, U.H., Karageorgos, L.E., Morona, R.,& Manning, P.A. Serotype conversion in vibrio cholerae o1. Proc. natn. Acad. Sci. U.S.A. 89, 2566–2570 (1992).

    Article  CAS  Google Scholar 

  39. Koeplin, R. et al. Genetics of xanthan production in Xanthomonas campestris: the xanA and xanB genes are involved in UDP–glucose and GDP–mannose biosynthesis. J. Bacteriol. 174, 191–199 (1992).

    Article  CAS  Google Scholar 

  40. Zielinski, N.A., Chakrabarty, A.M. & Berry, A. Characterization and regulation of the Pseudomonas aeruginosa algc gene encoding phosphomannomutase. J. biol. Chem. 266, 9754–9763 (1991).

    CAS  PubMed  Google Scholar 

  41. Lee, S.J., Romana, L.K. & Reeves, P.R. Sequence and structural analysis of the rfb (o antigen) gene cluster from a group C1 Salmonella enterica strain. J. gen. Microbiol. 138, 1843–1855 (1992).

    Article  CAS  Google Scholar 

  42. Matsuoka, M. et al. Primary structure of maize pyruvate,orthophosphate dikinase as deduced from cDNA sequence. J. biol. Chem. 263, 11080–11083 (1988).

    CAS  PubMed  Google Scholar 

  43. Belunis, C.J., Mdluli, K.E., Raetz, C.R.H. & Nano, F.E. A novel 3-Deoxy-D-manno-octulosonic acid transferase from Chlamydia trachomatis required for expression of the genus–specific epitope. J. biol. Chem. 267, 18702–18707 (1992).

    CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Robison, K., Gilbert, W. & Church, G. Large scale bacterial gene discovery by similarity search. Nat Genet 7, 205–214 (1994). https://doi.org/10.1038/ng0694-205

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng0694-205

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing