Skip to main content
Log in

Biological Sequences Integrated: A Relational Database Approach

  • Published:
Acta Biotheoretica Aims and scope Submit manuscript

Abstract

Over the last decade the modeling and the storage of biological data has been a topic of wide interest for scientists dealing with biological and biomedical research. Currently most data is still stored in text files which leads to data redundancies and file chaos.

In this paper we show how to use relational modeling techniques and relational database technology for modeling and storing biological sequence data, i.e. for data maintained in collections like EMBL or SWISS-PROT to better serve the needs for these application domains.

For this reason we propose a two step approach. First, we model the structure (and therefore the meaning of the) data using an Entity-Relationship approach. The ER model leads to a clean design of a relational database schema for storing and retrieving the DNA and protein data extracted from various sources. Our approach provides the clean basis for building complex biological applications that are more amenable to changes and software ports than their file-base counterparts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  • Aho, A. V., B.W. Kernighan and P.J. Weinberger (1988). The awk Programming Language. Addison-Wesley, Boston.

    Google Scholar 

  • Bairoch, A. and R. Apweiler. (1999). The SWISS-PROT protein sequence databank and ist supplement TrEMBL in 1999. Nucleic Acids Research 27: 49-54.

    Google Scholar 

  • Barker, W.C., J.S. Garavelli, P.B. McGarvey, C.R. Marzec, B.C. Orcutt, G.Y. Srinivasarao, L.S. Yeh, R.S. Ledley, H.W. Mewes, F. Pfeiffer, A. Tsugita and C. Wu. (1999). The PIR-International Protein sequence database. Nucleic Acids Research 27: 39-43.

    Google Scholar 

  • Benson, D.A., M.S. Boguski, D.J. Lipman, J. Ostell, B.F. Ouellette, B.A. Rapp and D.L. Wheeler (1999). GenBank. Nucleic Acids Research 27: 12-17.

    Google Scholar 

  • Bergholz, A., S. Heymann, J.A. Schenk and J.C. Freytag (1997). Sequence comparison using a relational database approach. Proceedings of International Database and Engineering and Applications Symposium 126-131.

  • Cariello, N. F., G.R. Douglas, M.J. Dycaico, N.J. Gorelick, G.S. Provost and T. Soussi (1997). Databases and software for the analysis of mutations in the human p53 gene, human hprt gene and both the lacI and lac/ gene in transgenic rodents. Nucleic Acids Research 25: 136-137.

    Google Scholar 

  • Chen, P. P.-S. (1976). The Entity-Relationship-Model — Toward a Unified View of Data. ACM Transactions on Database Systems 1: 9-36.

    Google Scholar 

  • Contrino, S. (2000). SWISS-PROT goes to Oracle http://www.ebi.ac.uk/~contrino/sp/

  • Date, C.J. (1995). An Introduction To Database Systems. The System Programming Series, 6th edition. Addison-Wesley, Boston.

    Google Scholar 

  • EMBL Nucleotide Sequence Database Release Notes (Release 55, 1998). Available from ftp.ebi.ac.uk

  • Kabat, E. A., T.T. Wu, H.M. Perry, K.S. Gottesman and C. Foeller (1991). Sequences of Proteins of Immunological Interest. National Institutes of Health Publications No. 91: 3242.

  • Keen G., J. Burton, D. Crowley, E. Dickinson, A. Espinosa-Lujan, E. Franks, C. Harger, M. Manning, S. March, M. McLeod, J. O'Neill, A. Power, M. Pumilia, R. Reinert, D. Rider, J. Rohrlich, J. Schwertfeger, L. Smyth, N. Thayer, C. Troup and C. Fields (1996). The Genome Sequence DataBase (GSDB): meeting the challenge of genome sequencing. Nucleic Acids Research 24: 13-16.

    Google Scholar 

  • Letovsky, S.I., R.W. Cottingham, C.J. Porter and P.W. Li (1998). GDB: the human genome database. Nucleic Acids Research 26: 94-99.

    Google Scholar 

  • Moore, J., A. Engelberg and A. Bairoch (1988). Using PC/Gene for protein and nucleic acid analysis. Biotechniques 6: 566-572.

    Google Scholar 

  • Ritter, O. (1994). The integrated genomic database. Computational Methods in Genome Research: 57-73.

  • Senger, M., K.H. Glatting, O. Ritter and S. Suhai (1995). X-HUSAR, an X-based graphical interface for the analysis of genome sequences. Computational Methods and Programs in Biomedicine 46: 131-141.

    Google Scholar 

  • Stoesser, G., M.A. Tuli, P. Lopez and P. Sterk (1999). The EMBL Nucleotide sequence database. Nucleic Acids Research 27: 18-24.

    Google Scholar 

  • Teorey, T. J., D. Yang and J.P. Fry (1986). A Logical Design Methodology for Relational Databases Using the Extended Entity-Relationship Model. ACM Computing Surveys 18: 197-222.

    Google Scholar 

  • Thierry-Mieg, J. and R. Durbin (1992). Syntactic definitions for the ACEDB data base. Technical Report MRC-LMB xx.92.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bergholz, A., Heymann, S., Schenk, J.A. et al. Biological Sequences Integrated: A Relational Database Approach. Acta Biotheor 49, 145–159 (2001). https://doi.org/10.1023/A:1011958524279

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011958524279

Keywords

Navigation