Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions

  1. Zhengdong D. Zhang1,
  2. Alberto Paccanaro2,
  3. Yutao Fu3,
  4. Sherman Weissman5,
  5. Zhiping Weng3,4,
  6. Joseph Chang6,
  7. Michael Snyder7, and
  8. Mark B. Gerstein1,8,9
  1. 1 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA;
  2. 2 Department of Computer Science Royal Holloway, University of London, Egham Hill, TW20 0EX, United Kingdom;
  3. 3 Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA;
  4. 4 Biomedical Engineering Department, Boston University, Boston, Massachusetts 02215, USA;
  5. 5 Department of Genetics, Yale University, New Haven, Connecticut 06510, USA;
  6. 6 Department of Statistics, Yale University, New Haven, Connecticut 06520, USA;
  7. 7 Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA;
  8. 8 Program in Computational Biology and Bioinformatics Yale University, New Haven, Connecticut 06520, USA

Abstract

The comprehensive inventory of functional elements in 44 human genomic regions carried out by the ENCODE Project Consortium enables for the first time a global analysis of the genomic distribution of transcriptional regulatory elements. In this study we developed an intuitive and yet powerful approach to analyze the distribution of regulatory elements found in many different ChIP–chip experiments on a 10∼100-kb scale. First, we focus on the overall chromosomal distribution of regulatory elements in the ENCODE regions and show that it is highly nonuniform. We demonstrate, in fact, that regulatory elements are associated with the location of known genes. Further examination on a local, single-gene scale shows an enrichment of regulatory elements near both transcription start and end sites. Our results indicate that overall these elements are clustered into regulatory rich “islands” and poor “deserts.” Next, we examine how consistent the nonuniform distribution is between different transcription factors. We perform on all the factors a multivariate analysis in the framework of a biplot, which enhances biological signals in the experiments. This groups transcription factors into sequence-specific and sequence-nonspecific clusters. Moreover, with experimental variation carefully controlled, detailed correlations show that the distribution of sites was generally reproducible for a specific factor between different laboratories and microarray platforms. Data sets associated with histone modifications have particularly strong correlations. Finally, we show how the correlations between factors change when only regulatory elements far from the transcription start sites are considered.

Footnotes

  • 9 Corresponding author.

    9 E-mail mark.gerstein{at}yale.edu; fax (360) 838-7861.

  • [Supplemental material is available online at www.genome.org.]

  • Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5573107

    • Received June 1, 2006.
    • Accepted October 18, 2006.
  • Freely available online through the Genome Research Open Access option.

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server