Protein structure and evolutionary history determine sequence space topology

  1. Boris E. Shakhnovich1,
  2. Eric Deeds2,
  3. Charles Delisi1, and
  4. Eugene Shakhnovich3,4
  1. 1 Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA
  2. 2 Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA
  3. 3 Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, USA

Abstract

Understanding the observed variability in the number of homologs of a gene is a very important unsolved problem that has broad implications for research into coevolution of structure and function, gene duplication, pseudogene formation, and possibly for emerging diseases. Here, we attempt to define and elucidate some possible causes behind the observed irregularity in sequence space. We present evidence that sequence variability and functional diversity of a gene or fold family is influenced by quantifiable characteristics of the protein structure. These characteristics reflect the structural potential for sequence plasticity, i.e., the ability to accept mutation without losing thermodynamic stability. We identify a structural feature of a protein domain—contact density—that serves as a determinant of entropy in sequence space, i.e., the ability of a protein to accept mutations without destroying the fold (also known as fold designability). We show that (log) of average gene family size exhibits statistical correlation (R2 > 0.9.) with contact density of its three-dimensional structure. We present evidence that the size of individual gene families are influenced not only by the designability of the structure, but also by evolutionary history, e.g., the amount of time the gene family was in existence. We further show that our observed statistical correlation between gene family size and contact density of the structure is valid on many levels of evolutionary divergence, i.e., not only for closely related sequence, but also for less-related fold and superfamily levels of homology.

Footnotes

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3133605.

  • 4 Corresponding author. E-mail eugene{at}belok.harvard.edu; fax (617) 384-9228.

    • Accepted November 23, 2004.
    • Received August 10, 2004.
| Table of Contents

Preprint Server