Skip to main content
Log in

A semiautomated approach to gene discovery through expressed sequence tag data mining: Discovery of new human transporter genes

  • Published:
AAPS PharmSci Aims and scope Submit manuscript

Abstract

Identification and functional characterization of the genes in the human genome remain a major challenge. A principal source of publicly available information used for this purpose is the National Center for Biotechnology Information database of expressed sequence tags (dbEST), which contains over 4 million human ESTs. To extract the information buried in this data more effectively, we have developed a semiautomated method to mine dbEST for uncharacterized human genes. Starting with a single protein input sequence, a family of related proteins from all species is compiled. This entire family is then used to mine the human EST database for new gene candidates. Evaluation of putative new gene candidates in the context of a family of characterized proteins provides a framework for inference of the structure and function of the new genes. When applied to a test data set of 28 families within the major facilitator superfamily (MFS) of membrane transporters, our protocol found 73 previously characterized human MFS genes and 43 new MFS gene candidates. Development of this approach provided insights into the problems and pitfalls of automated data mining using public databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921.

    Article  CAS  PubMed  Google Scholar 

  2. Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science. 2001;291:1304–1351.

    Article  CAS  PubMed  Google Scholar 

  3. Pao SS, Paulsen IT, Saier MH, Jr. Major facilitator superfamily. Microbiol Mol Biol Rev. 1998;62:1–34.

    CAS  PubMed Central  PubMed  Google Scholar 

  4. Paulsen IT, Sliwinski MK, Saier MH, Jr. Microbial genome analyses: global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities. J Mol Biol. 1998;277:573–592.

    Article  CAS  PubMed  Google Scholar 

  5. Paulsen IT, Sliwinski MK, Nelissen B, Goffeau A, Saier MH, Jr Unified inventory of established and putative transporters encoded within the complete genome of Saccharomyces cerevisiae. FEBS Lett. 1998;430:116–125.

    Article  CAS  PubMed  Google Scholar 

  6. Paulsen IT, Nguyen L, Sliwinski MK, Rabus R, Saier MH, Jr. Microbial genome analyses: comparative transport capabilities in eighteen prokaryotes. J Mol Biol. 2000;301:75–100.

    Article  CAS  PubMed  Google Scholar 

  7. Hogenesch JB, Ching KA, Batalov S, et al. A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell. 2001;106:413–415.

    Article  CAS  PubMed  Google Scholar 

  8. Karp PD, Riley M, Saier M, Paulsen IT, Paley SM, Pellegrini-Toole A. The EcoCyc and MetaCyc databases. Nucleic Acids Res. 2000;28:56–59.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  9. Ogasawara N. Systematic function analysis, of Bacillus subtilis genes. Res Microbiol. 2000;151:129–134.

    Article  CAS  PubMed  Google Scholar 

  10. Blattner FR, Plunkett G III, Bloch CA, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1474.

    Article  CAS  PubMed  Google Scholar 

  11. Kunst F, Ogasawara N, Moszer I, et al. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature. 1997;390:249–256.

    Article  CAS  PubMed  Google Scholar 

  12. Wittenberger T, Schaller HC, Hellebrand S. An expressed sequence tag (EST) data mining strategy succeeding in the discovery of new G-protein coupled receptors. J Mol. Biol. 2001;307:799–813.

    Article  CAS  PubMed  Google Scholar 

  13. Black DL. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell. 2000;103:367–370.

    Article  CAS  PubMed  Google Scholar 

  14. Graveley BR. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 2001;17:100–107.

    Article  CAS  PubMed  Google Scholar 

  15. Strehler EE, Zacharias DA. Role of alternative splicing in generating isoform diversity among plasma membrane calcium pumps. Physiol Rev. 2001;81:21–50.

    CAS  PubMed  Google Scholar 

  16. Ingram VM. Abnormal human haemoglobin, III: the chemical difference between normal and sickle cell haemoglobins. Biochim Biophys Acta. 1959;36:402–411.

    Article  CAS  PubMed  Google Scholar 

  17. Qi M, Byers PH. Constitutive skipping of alternatively spliced exon 10 in the ATP7A gene abolishes Golgi localization of the menkes protein and produces the occipital horn syndrome. Hum Mol Genet. 1998;7:465–469.

    Article  CAS  PubMed  Google Scholar 

  18. Brett D, Hanke J, Lehmann G, et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 2000;474:83–86.

    Article  CAS  PubMed  Google Scholar 

  19. Cargill M, Altshuler D, Ireland J, et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet. 1999;22:231–238.

    Article  CAS  PubMed  Google Scholar 

  20. Lai E. Application of SNP technologies in medicine: lessons learned and future challenges. Genome Res. 2001;11:927–929

    Article  CAS  PubMed  Google Scholar 

  21. Sadee W. Genomics and drugs: finding the optimal drug for the right patient. Pharm Res. 1998;15:959–963.

    Article  CAS  PubMed  Google Scholar 

  22. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  23. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410.

    Article  CAS  PubMed  Google Scholar 

  24. Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res. 1999;9:868–877.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  25. Adams MD, Venter JC. Should non-peer-reviewed raw DNA sequence data release be forced on the scientific community? Science. 1996;274:534–536.

    Article  CAS  PubMed  Google Scholar 

  26. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Xucleic Acids Res. 1994;22:4673–4680.

    Article  CAS  Google Scholar 

  27. Tusnady GE, Simon I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol. 1998;283:489–506.

    Article  CAS  PubMed  Google Scholar 

  28. Hillis DM, Moritz, C, Mable BK. Molecular Systematics. 2nd ed Sunderland, MA: Simauer Associates; 1996.

    Google Scholar 

  29. Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–214.

    Article  CAS  PubMed  Google Scholar 

  30. Saier MH, Jr. A functional-phylogenetic system for the classification of transport proteins. J Cell Biochem. 1999; 32/33(suppl):84–94.

    Article  Google Scholar 

  31. Anderle P, Rakhmanova V, Woodford K, Zerangue N, Sadee W. Messenger RNA expression of transporter and ion channel genes in undifferentiated and differentiated Caco-2 cells compared to human intestines. Phar. Res. 2003; in press.

  32. The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–2018.

    Article  Google Scholar 

  33. Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J. Gene index analysis of the human genome estimates approximately 120,000 genes. Nat Genet. 2000;25:239–240.

    Article  CAS  PubMed  Google Scholar 

  34. Ewing B, Green P. Analysis of expressed sequence tags indicates 35,000 human genes. Nat Genet. 2000;25:232–234.

    Article  CAS  PubMed  Google Scholar 

  35. Roest Crollius H, Jaillon O, Bernot A, et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet. 2000;25:235–238.

    Article  CAS  PubMed  Google Scholar 

  36. Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res. 2000;28:263–266.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  37. Sonnhammer EL, Kahn D. Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 1994;3:482–492.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  38. Kasahara M, Maeda M. Contribution to substrate recognition of two aromatic amino acid residues in putative transmem brane segment 10 of the yeast sugar transporters Gal1 and Hxt2. J Biol Chem. 1998;273:29106–29112.

    Article  CAS  PubMed  Google Scholar 

  39. Phay JE, Hussain HB, Moley JF. Cloning and expression analysis of a novel member of the facilitative glucose transporter family, SLC2A9 (GLUT9). Genomics. 2000;66:217–220.

    Article  CAS  PubMed  Google Scholar 

  40. Doege H, Bocianski A, Scheepers A, et al. Characterization of human glucose transporter (GLUT) 11 (encoded by SLC2A11), a novel sugar-transport facilitator specifically expressed in heart and skeletal muscle. Biochem J. 2001;359(pt 2):443–449.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  41. Enomoto A, Kimura H, Chairoungdua A, et al. Molecular identification of a renal urate anion exchanger that regulates blood urate levels. Nature. 2002;417:447–452.

    CAS  PubMed  Google Scholar 

  42. Kim DK, Kanai Y, Matsu H, et al. The human T-type amino acid transporter-1: characterization gene organization, and chromosomal location. Genomics. 2002;79:95–103.

    Article  CAS  PubMed  Google Scholar 

  43. Botka C, Wittig T, Graul R, et al. Human proton/oligopeptide transporter (POT) genes: identification of putative human genes using bioinformatics. AAPS PharmSci. 2000; Article 2. Available at: http://www.aapspharmsci.org/scientificjournals/pharmsci/journal/16.html.

  44. Schultz J, Doerks T, Ponting CP, Copley RR, Bork P. More than 1,000 putative new human signalling proteins revealed by EST data mining. Nat Genet. 2000;25:201–204.

    Article  CAS  PubMed  Google Scholar 

  45. Quackenbush J, Liang F, Holt I, Pertea G, Upton J. The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res. 2000;28:141–145.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  46. Allikmets R, Gerrard B, Hutchinson A, Dean M. Characterization of the human ABC superfamily: isolation and mapping of 21 new genes using the expressed sequence tags database. Hum Mol Genet. 1996;5:1649–1655.

    Article  CAS  PubMed  Google Scholar 

  47. Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1. Available at: http://genomebiology/com/1465-6906/1/REVIEWS0005.

  48. Karp PD. What we do not know about sequence analysis and sequence databases. Bioinformatics. 1998;14:753–754.

    Article  CAS  PubMed  Google Scholar 

  49. Holm L, Sander C. An evolutionary treasure: unification of a broad set of amidohydrolases related to urease., Proteins. 1997;28:72–82.

    Article  CAS  PubMed  Google Scholar 

  50. Babbitt PC, Hasson MS, Wedekind JE, et al. The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids. Biochemistry. 1996;35:16489–16501.

    Article  CAS  PubMed  Google Scholar 

  51. The Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 2001;11:1425–1433.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patricia C. Babbitt.

Additional information

Published: January 14, 2003

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brown, S., Chang, J.l., Sadee, W. et al. A semiautomated approach to gene discovery through expressed sequence tag data mining: Discovery of new human transporter genes. AAPS PharmSci 5, 1 (2003). https://doi.org/10.1208/ps050101

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1208/ps050101

Keywords

Navigation