User's Guide
Published: September 2003

Question 10 For a given protein, how can one determine whether it contains any functional domains of interest? What other proteins contain the same functional domains as this protein? How can one determine whether there is a similarity to other proteins, not only at the sequence level, but also at the structural level?

Nature Genetics volume 35, pages 57–62 (2003)Cite this article

1969 Accesses
1 Citations
Metrics details

You have full access to this article via your institution.

To demonstrate how to find functional domains within a protein, the human testis-determining factor TDF, also known as the sex-determining protein SRY, will be used as an example.

Although the search could be commenced from the Entrez search box on the NCBI home page, a better way to perform the initial search is from LocusLink¹⁰. One of the advantages of using LocusLink lies in its standardization of gene and protein names with appropriate cross-referencing, making it more likely that the correct protein will be found on the first attempt. From the NCBI home page at http://www.ncbi.nlm.nih.gov/, choose LocusLink from the pull-down menu in the upper left corner, type the gene name, 'TDF', into the query box, and click Go. Four loci are returned (Fig. 10.1). The first column gives the Locus ID, which is a stable identifier associated with that gene locus. Clicking on the LocusID produces a LocusLink report view; more detailed information on the report view can be found in the LocusLink Help feature and in the literature¹⁵. The second column, marked Org, gives a shorthand version of the organism name. Here, there is one entry from Drosophila (Dm), one from mouse (Mm), one from human (Hs) and one from rat (Rn). A series of alphabet blocks shown to the right of each entry provide jumping-off points to other database resources. The locus of interest here is the fourth entry in the list, because that is the one for the human form of TDF/SRY. To find additional information on the protein, click on the second P (in green) on that line. This takes the user to the protein entries corresponding to that particular LocusLink entry (Fig. 10.2). At this point, the user can click on any of the hyperlinks to look at the raw database information available on any of the proteins listed.

Consider the last entry in the list, an NCBI Reference Protein sequence with accession number NP_003131. To the right of the accession number is a series of hyperlinks. Clicking on the link labeled BLink will take the user to the BLink page for the protein of interest (Fig. 10.3). BLink stands for 'BLAST Link' and provides the graphical results of pre-computed BLAST searches that have been performed not just for this protein sequence, but for every protein sequence within the Entrez Proteins data domain. The pre-computed BLAST results for TDF/SRY are shown in the section beginning with the label '204 aa'. Across the top are a number of buttons that allow the user to ask a series of questions regarding their protein of interest. As the object of this question is to find the protein domains present within the TDF/SRY protein, the user can click on CDD-Search (Conserved Domain Database Search¹⁸). Doing this will produce a graphical overview of any domains present within the protein, as well as a sequence alignment of those domains with the query sequence (Fig. 10.4). In this case, one functional domain is found: an HMG box, which is a DNA-binding domain found in many nuclear proteins. The domain was found in all of the databases comprising CDD (Pfam, SMART, and COG), as can be seen by looking at the accession numbers in the hit list.

To determine which other proteins contain this same HMG-box domain, click on the box labeled Show, right under the graphical view near the top of the page. This will invoke the domain architecture retrieval tool (DART). DART shows functional domains within a protein and, more importantly, other proteins with a similar domain architecture (Fig. 10.5). The query (the HMG-box) is shown at the top of the page in red. Every other protein in the NCBI's non-redundant sequence database having that same domain is then shown below the query, with the HMG box again colored red. Other domains within the found proteins are also shown, in various colors and shapes, with a key appearing at the bottom of the web page. Clicking on any of the links to the left would provide additional information about these new proteins.

Although a protein domain has now been identified within the query protein, no in-depth information has yet been provided about the function of that domain. Whereas a circuitous path could be followed from the DART page to find this information, an easier method is to use another web-based resource, called InterPro. InterPro is an integrated resource for information about protein families, domains and functional sites, bringing together information from a number of protein domain-based resources, such as PROSITE, PRINTS, Pfam and ProDom¹⁹. The InterPro Simple Search engine can be accessed from the InterPro home page, at http://www.ebi.ac.uk/interpro. Clicking on Text Search, on the left, brings the user to the search page; for this search, type “HMG Box” (with quotes) into the text box and hit Search. Two hits are returned (Fig. 10.6). For purposes of this example, follow the link from the first hit, for high mobility group proteins HMG1 and HMG2 (IPR000135). The resulting InterPro summary page (Fig. 10.7) provides information on the function, intracellular location and, most importantly, metabolic role of this particular protein within the cell, in an executive summary format. References are provided at the bottom of the web page for users who wish for more in-depth information about the domain. Users can also retrieve all of the full-length sequences containing the domain; the reader is referred to the InterPro documentation for more details.

The final part of this question asks whether similarity to the query protein can be found at the structural as well as the sequence level. Answering this question requires a new search against NCBI Structures. From the NCBI home page, change the pull-down menu in the query box at the top of the page to Structure, type 'SRY' in the box and hit Go. Seven three-dimensional structures are returned, one of which is 1HRY, the structure of the human SRY–DNA complex solved by NMR. Clicking on the 1HRY hyperlink takes the user to the Structure Summary page for 1HRY. The summary links to more detailed information about chain A, the protein component of the structure, chain B, the nucleotide component of the structure, and the conserved domain (CD) in the protein, obtained through a CDD search. Click on the chain A graphic to get a list of proteins whose known structures have, using a method called VAST, been deemed similar to that of the original SRY protein; more information on the method and on interpreting the data within the tables can be found elsewhere¹⁵. Here, the SRY protein is shown to have some structural similarity to a fasciculin 2–mouse acetylcholinesterase complex (1MAH), a protein named V-1 Nef (1AVZ), a heat-shock protein of 70 kD (1QQN), and a myosin motor-domain complex(1BR1) (Fig. 10.8). The VAST program quite often reveals similarities between proteins that are not evident from simple BLAST or FASTA searches, so readers are encouraged to employ this and similar tools when trying to answer questions related to protein families.

Accession codes

Accessions

GenBank/EMBL/DDBJ

NP_003131

References

International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Collins, F.S. and McKusick, V.A. Implications of the Human Genome Project for medical science. J. Am. Med. Assoc. 285, 540–544 (2001).
Article CAS Google Scholar
Watson, J.D. & Crick, F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737–738 (1953).
Article CAS Google Scholar
Green, E.D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573–583 (2001).
Article CAS Google Scholar
Ouellette, B.F.F. & Boguski, M.S. Database divisions and homology search files: a guide for the perplexed. Genome Res. 7, 952–955 (1997).
Article CAS Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT Protein Sequence Database and its supplement TREMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000).
Article CAS Google Scholar
Hubbard, T. et al. The Ensembl Genome Database Project. Nucleic Acids Res. 30, 38–41 (2002).
Article CAS Google Scholar
Kent, W.J. BLAT—the BLAST-like Alignment Tool. Genome Res. 12, 656–664 (2002).
Article CAS Google Scholar
Stein, L. Genome annotation: from sequence to biology. Nature Rev. Genet. 2, 493–503 (2001).
Article CAS Google Scholar
Pruitt, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).
Article CAS Google Scholar
Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346–354 (1998).
Article CAS Google Scholar
Schuler, G.D. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol. 16, 456–459 (1998).
Article CAS Google Scholar
Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS Google Scholar
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30, 52–55 (2002).
Article CAS Google Scholar
Baxevanis, A.D. & Ouellette, B.F.F. (eds.) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (John Wiley & Sons, New York, 2001).
Book Google Scholar
Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367–375 (1995).
CAS PubMed Google Scholar
Yeh, R.F., Lim, L.P. & Burge, C.B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001).
Article CAS Google Scholar
Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281–283 (2002).
Article CAS Google Scholar
Apweiler, R. et al. InterPro—an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 16, 1145–1150 (2000).
Article CAS Google Scholar
Rebhan, M., Chalifa-Caspi, V., Prilusky, J. & Lancet, D. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–664 (1998).
Article CAS Google Scholar
Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A. & Eppig, J.T. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res. 30, 113–115 (2002).
Article CAS Google Scholar
Hudson, T.J. et al. A radiation hybrid map of mouse genes. Nature Genet. 29, 201–205 (2001).
Article CAS Google Scholar
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2002).
Article CAS Google Scholar
Letunic, I. et al. Recent improvements to the SMART domain–based sequence annotation resource. Nucleic Acids Res. 30, 242–244 (2002).
Article CAS Google Scholar
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, 1998).
Book Google Scholar
Peri, S., Ibarrola, N., Blagoev, B., Mann, M. & Pandey, A. Common pitfalls in bioinformatics-based analyses: look before you leap. Trends Genet. 17, 541–545 (2001) [erratum Trends Genet. 18, 218 (2002)].
Article CAS Google Scholar
Ponting, C. Issues in predicting protein function from sequence. Brief. Bioinform. 2, 19–29 (2001).
Article CAS Google Scholar
Aparicio, S.A.J.R. How to count ... human genes. Nature Genet. 25, 129–130 (2000).
Article CAS Google Scholar
Beadle, G.W. & Tatum, E.L. Genetic control of biochemical reactions in Neurospora. Proc. Natl Acad. Sci. USA 27, 499–506 (1941).
Article CAS Google Scholar
Jeffery, C.J., Bahnson, B.J., Chien, W., Ringe, D. & Petsko, G.A. Crystal structure of rabbit phosphoglucose isomerase, a glycolytic enzyme that moonlights as neuroleukin, autocrine motility factor, and differentiation mediator. Biochemistry 39, 955–964 (2000).
Article CAS Google Scholar
Wistow, G. & Piatigorsky, J. Recruitment of enzymes as lens structural proteins. Science 236, 1554–1556 (1987).
Article CAS Google Scholar
Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 8–11 (1999).
Article CAS Google Scholar
Chothia, C. Proteins. One thousand families for the molecular biologist. Nature 357, 543–544 (1992).
Article CAS Google Scholar
Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147–164 (1999).
Article CAS Google Scholar
Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28, 1481–1488 (2000).
Article CAS Google Scholar
Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132–133 (1999).
Article CAS Google Scholar
Smith, R.F. Perspectives: sequence data base searching in the era of large-scale genomic sequencing. Genome Res. 6, 653–660 (1996).
Article CAS Google Scholar

Download references

Rights and permissions

Reprints and permissions

About this article

Cite this article

Question 10 For a given protein, how can one determine whether it contains any functional domains of interest? What other proteins contain the same functional domains as this protein? How can one determine whether there is a similarity to other proteins, not only at the sequence level, but also at the structural level?. Nat Genet 35 (Suppl 1), 57–62 (2003). https://doi.org/10.1038/ng1198

Download citation

Issue Date: September 2003
DOI: https://doi.org/10.1038/ng1198

This article is cited by

Old mice, new tricks

Nature Genetics (2005)

Accession codes

Accessions

GenBank/EMBL/DDBJ

References

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Old mice, new tricks

Search

Quick links