Abstract
Identification and functional characterization of the genes in the human genome remain a major challenge. A principal source of publicly available information used for this purpose is the National Center for Biotechnology Information database of expressed sequence tags (dbEST), which contains over 4 million human ESTs. To extract the information buried in this data more effectively, we have developed a semiautomated method to mine dbEST for uncharacterized human genes. Starting with a single protein input sequence, a family of related proteins from all species is compiled. This entire family is then used to mine the human EST database for new gene candidates. Evaluation of putative new gene candidates in the context of a family of characterized proteins provides a framework for inference of the structure and function of the new genes. When applied to a test data set of 28 families within the major facilitator superfamily (MFS) of membrane transporters, our protocol found 73 previously characterized human MFS genes and 43 new MFS gene candidates. Development of this approach provided insights into the problems and pitfalls of automated data mining using public databases.
Similar content being viewed by others
References
Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921.
Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science. 2001;291:1304–1351.
Pao SS, Paulsen IT, Saier MH, Jr. Major facilitator superfamily. Microbiol Mol Biol Rev. 1998;62:1–34.
Paulsen IT, Sliwinski MK, Saier MH, Jr. Microbial genome analyses: global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities. J Mol Biol. 1998;277:573–592.
Paulsen IT, Sliwinski MK, Nelissen B, Goffeau A, Saier MH, Jr Unified inventory of established and putative transporters encoded within the complete genome of Saccharomyces cerevisiae. FEBS Lett. 1998;430:116–125.
Paulsen IT, Nguyen L, Sliwinski MK, Rabus R, Saier MH, Jr. Microbial genome analyses: comparative transport capabilities in eighteen prokaryotes. J Mol Biol. 2000;301:75–100.
Hogenesch JB, Ching KA, Batalov S, et al. A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell. 2001;106:413–415.
Karp PD, Riley M, Saier M, Paulsen IT, Paley SM, Pellegrini-Toole A. The EcoCyc and MetaCyc databases. Nucleic Acids Res. 2000;28:56–59.
Ogasawara N. Systematic function analysis, of Bacillus subtilis genes. Res Microbiol. 2000;151:129–134.
Blattner FR, Plunkett G III, Bloch CA, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1474.
Kunst F, Ogasawara N, Moszer I, et al. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature. 1997;390:249–256.
Wittenberger T, Schaller HC, Hellebrand S. An expressed sequence tag (EST) data mining strategy succeeding in the discovery of new G-protein coupled receptors. J Mol. Biol. 2001;307:799–813.
Black DL. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell. 2000;103:367–370.
Graveley BR. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 2001;17:100–107.
Strehler EE, Zacharias DA. Role of alternative splicing in generating isoform diversity among plasma membrane calcium pumps. Physiol Rev. 2001;81:21–50.
Ingram VM. Abnormal human haemoglobin, III: the chemical difference between normal and sickle cell haemoglobins. Biochim Biophys Acta. 1959;36:402–411.
Qi M, Byers PH. Constitutive skipping of alternatively spliced exon 10 in the ATP7A gene abolishes Golgi localization of the menkes protein and produces the occipital horn syndrome. Hum Mol Genet. 1998;7:465–469.
Brett D, Hanke J, Lehmann G, et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 2000;474:83–86.
Cargill M, Altshuler D, Ireland J, et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet. 1999;22:231–238.
Lai E. Application of SNP technologies in medicine: lessons learned and future challenges. Genome Res. 2001;11:927–929
Sadee W. Genomics and drugs: finding the optimal drug for the right patient. Pharm Res. 1998;15:959–963.
Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410.
Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res. 1999;9:868–877.
Adams MD, Venter JC. Should non-peer-reviewed raw DNA sequence data release be forced on the scientific community? Science. 1996;274:534–536.
Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Xucleic Acids Res. 1994;22:4673–4680.
Tusnady GE, Simon I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol. 1998;283:489–506.
Hillis DM, Moritz, C, Mable BK. Molecular Systematics. 2nd ed Sunderland, MA: Simauer Associates; 1996.
Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–214.
Saier MH, Jr. A functional-phylogenetic system for the classification of transport proteins. J Cell Biochem. 1999; 32/33(suppl):84–94.
Anderle P, Rakhmanova V, Woodford K, Zerangue N, Sadee W. Messenger RNA expression of transporter and ion channel genes in undifferentiated and differentiated Caco-2 cells compared to human intestines. Phar. Res. 2003; in press.
The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–2018.
Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J. Gene index analysis of the human genome estimates approximately 120,000 genes. Nat Genet. 2000;25:239–240.
Ewing B, Green P. Analysis of expressed sequence tags indicates 35,000 human genes. Nat Genet. 2000;25:232–234.
Roest Crollius H, Jaillon O, Bernot A, et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet. 2000;25:235–238.
Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res. 2000;28:263–266.
Sonnhammer EL, Kahn D. Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 1994;3:482–492.
Kasahara M, Maeda M. Contribution to substrate recognition of two aromatic amino acid residues in putative transmem brane segment 10 of the yeast sugar transporters Gal1 and Hxt2. J Biol Chem. 1998;273:29106–29112.
Phay JE, Hussain HB, Moley JF. Cloning and expression analysis of a novel member of the facilitative glucose transporter family, SLC2A9 (GLUT9). Genomics. 2000;66:217–220.
Doege H, Bocianski A, Scheepers A, et al. Characterization of human glucose transporter (GLUT) 11 (encoded by SLC2A11), a novel sugar-transport facilitator specifically expressed in heart and skeletal muscle. Biochem J. 2001;359(pt 2):443–449.
Enomoto A, Kimura H, Chairoungdua A, et al. Molecular identification of a renal urate anion exchanger that regulates blood urate levels. Nature. 2002;417:447–452.
Kim DK, Kanai Y, Matsu H, et al. The human T-type amino acid transporter-1: characterization gene organization, and chromosomal location. Genomics. 2002;79:95–103.
Botka C, Wittig T, Graul R, et al. Human proton/oligopeptide transporter (POT) genes: identification of putative human genes using bioinformatics. AAPS PharmSci. 2000; Article 2. Available at: http://www.aapspharmsci.org/scientificjournals/pharmsci/journal/16.html.
Schultz J, Doerks T, Ponting CP, Copley RR, Bork P. More than 1,000 putative new human signalling proteins revealed by EST data mining. Nat Genet. 2000;25:201–204.
Quackenbush J, Liang F, Holt I, Pertea G, Upton J. The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res. 2000;28:141–145.
Allikmets R, Gerrard B, Hutchinson A, Dean M. Characterization of the human ABC superfamily: isolation and mapping of 21 new genes using the expressed sequence tags database. Hum Mol Genet. 1996;5:1649–1655.
Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1. Available at: http://genomebiology/com/1465-6906/1/REVIEWS0005.
Karp PD. What we do not know about sequence analysis and sequence databases. Bioinformatics. 1998;14:753–754.
Holm L, Sander C. An evolutionary treasure: unification of a broad set of amidohydrolases related to urease., Proteins. 1997;28:72–82.
Babbitt PC, Hasson MS, Wedekind JE, et al. The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids. Biochemistry. 1996;35:16489–16501.
The Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res. 2001;11:1425–1433.
Author information
Authors and Affiliations
Corresponding author
Additional information
Published: January 14, 2003
Rights and permissions
About this article
Cite this article
Brown, S., Chang, J.l., Sadee, W. et al. A semiautomated approach to gene discovery through expressed sequence tag data mining: Discovery of new human transporter genes. AAPS PharmSci 5, 1 (2003). https://doi.org/10.1208/ps050101
Received:
Accepted:
Published:
DOI: https://doi.org/10.1208/ps050101