Under construction: building a safer industry.

A revolution in the building industry over the past decade has spawned a new generation of safer materials and practices, decreasing some health risks for construction workers. Concerned consumers, builders, materials manufacturers, and government regulatory agencies have all contributed to a turn toward "green" building materials and practices, meaning that homeowners and office workers now are better able to live and work in healthier environments, and many construction workers are handling and installing less-toxic materials.


INTRODUCTION
The goal of the ENCODE project is to identify all functional elements in the human genome sequence (1). The pilot phase of the project is focused on a specific 30 megabases ($1%) of the human genome, with an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. UC Santa Cruz is the main repository for sequence-based data, with microarray data being held at GEO and ArrayExpress. The roles of UC Santa Cruz are (i) to collect the experimental data and analyses, (ii) to perform basic quality assurance (QA) on the submitted data, (iii) to publicly release the data with comprehensive descriptions, (iv) to provide interactive displays for integrating the ENCODE data with existing genome-wide data and (v) to provide interactive tools for analysis. General details of the Genome Browser have been described previously (2,3), and are briefly reviewed here for clarity.
Within the Genome Browser, each dataset is represented as a track, which is a horizontal, graphical representation of the underlying data table. A complete description of each dataset is available on the description page for each track. The Table Browser has been previously described as a general purpose tool for analyzing data in the UCSC Genome Browser (4), possibly with integrated user-supplied data. Several features have been added to this platform in the context of the ENCODE project. In addition to interactive browsing and analysis tools that are only available at the ENCODE project at UCSC site (http://genome.ucsc.edu/ENCODE), the data are available for public download (http://hgdownload.cse. ucsc.edu/goldenPath/encode/). Data from this project are made publicly available as quickly as possible after submission. All data on the UCSC Browsers, including the ENCODE data, pass through an extensive QA and documentation process before release. Biological validation criteria have been defined for each of the datasets and are the responsibility of the submitters to confirm before submission. Our developers and QA staff work with the data to provide fast, clear display and to confirm that the file formats and genomic coordinates are consistent.
It is expected that the ENCODE project will transition from the May 2004 human genome assembly (hg17; NCBI Build 35) to the newly released human genome assembly (hg18; NCBI Build 36) in early 2007. Following this, as the ENCODE project expands from the current 1% to the whole genome, UCSC is poised to support this growth. This paper describes the site and the tools that have been developed for viewing, retrieving and analyzing the data from the ENCODE project.

Portal and data
We have extended the UCSC Genome Browser (5) to include specialized support for the ENCODE project and its data. The ENCODE portal is accessible both through a link on the main Genome Browser site and directly at http://genome. ucsc.edu/ENCODE. This portal provides access to the EN-CODE data and serves as a starting point for the computational analyses that are possible with the new data and analysis tools. It also contains announcements of new data releases and tool deployments, terms of use for the ENCODE data, and information about the contributors. The 'Regions' link opens a frames page allowing the user to quickly scan all ENCODE regions by selecting one region in the navigator frame, which opens a customizable view of that region in a display frame.

Track groups
We have added two levels of organization to reduce the complexity of accessing the data. Tracks of a similar type are collected into track groups, which provide highlevel organization to the datasets. The six ENCODE-specific track groups roughly parallel the analysis working groups: Regions and Genes; Transcript Levels; Chromatin Immunoprecipitation; Chromosome, Chromatin and DNA Structure; Comparative Genomics; and Variation. The individual tracks are too numerous to list here and are frequently being updated with new results from the Consortium. The track status page at http://genome.cse.ucsc.edu/ENCODE/trackStatus. html provides a current snapshot of the data, including new datasets that are being developed, those that are in QA, and the fully released datasets.

Composite tracks
Sometimes one experiment will be run repeatedly with many different experimental conditions, producing the same data type but many parallel datasets, such as with the many combinations of cell lines, antibodies and stimulation conditions used in chromatin immunoprecipitation. For organizational simplicity, these composite tracks allow a set of similar data, usually from a single data provider, to be controlled through a single interface. On the track's user interface page, parameters that are common to all sub-tracks (e.g. visibility mode, track height, display range limits) are presented once. Just below those controls, a checkbox for each sub-track allows it to be individually included or excluded from the display. Experiments can be grouped into logical categories (e.g. cell type, transcription factor) with shared controls. Figure 1 shows the Yale RNA transcriptionally active regions (TARs) (6,7) track as an example of the streamlined interface and the resulting display of the composite tracks.

Multiple sequence alignment display
In addition to our data display and repository role, UCSC and collaborators have been developing algorithms for sequence alignment (8) and conservation analysis (9). As this produces extremely rich datasets and parallels the efforts of several other consortium members, we have created a special display that combines multiple species alignments and conservation scores in the same track, as shown in Figure 2. Alignments are projected onto a reference species for display in the browser by removing alignment columns in which the reference species is a gap. Additional enhancements include annotation of alignment gaps to indicate missing sequence and syntenic breaks, and translation in coding regions with user-selectable reading frames based on available gene annotations. When even more detail is necessary, full unprojected alignments are available on the details page for this track.
The comparative genomics efforts within the ENCODE Consortium are also receiving special attention. The group is producing a common dataset of sequences from 23 mammals and 5 other vertebrates, which provides a rich dataset for the development and comparison of algorithms for multiple sequence alignment and detection of evolutionary constraint. Four separate alignment algorithms are being developed [MAVID (10), MLAGAN (11), PECAN (B. Paten and E. Birney, submitted for publication) and TBA (8)], and three separate conservation scoring methods [binCons (12), GERP (13,14) and phastCons (9)] are being applied to each of these alignments. Each alignment is presented in its own Alignment track, with two composite tracks to represent the real-valued Conservation scores and the predicted Elements.  7). The latter two are composite tracks, each containing multiple datasets. The Placenta RNA checkbox is deselected above, so that the data are not displayed in the image below.

Tools for analysis
The Table Browser has always provided summary statistics on a single dataset, and we have added tools for exploring correlation between genomic datasets. Data within composite tracks can be treated as a single set for simplified comparison against other tracks. An example of this is available in Supplementary Data, where promoters that are active in at least one cell line are joined to create a set of 'functional' promoters.
The correlation function calculates correlation coefficients, covariance, scatter plots, residuals and histograms on the fly for the selected datasets. Briefly, the data points from each table are projected down to the base level. The two datasets are intersected and only bases that contain values in both datasets are retained, resulting in datasets of equal length n. These two datasets (X,Y) are then used in a standard linear correlation function, computing the correlation coefficient: where s X and s Y are the standard deviations of the datasets X and Y, and s XY is the covariance, computed as follows:  Figure 3 shows such correlation between the Boston University OH Radical Cleavage Intensity Database (ORChID) (15)(16)(17) and the CpG Island and GC Percent tracks. The CpG Island histogram shows significant skew in the data due to many zero values, which obscures the correlation of ORChID values within CpG Islands. The correlation of ORChID values with GC Percent is very strong at r ¼ 0.89, which reveals a potential confounding factor when comparing the ORChID values with other datasets. This method is further described in Supplementary Data.
The hgLiftOver tool, accessible via the Genome Browser's 'Utilities' link, translates genomic coordinates within a species from one assembly version to another and also retrieves putative orthologous regions between species using UCSC's chained and netted alignments. These tools have been used to migrate the ENCODE regions from one assembly to another, and have also been used in the Multiple Species Alignment working group to provide orthology predictions for the preparation of the sequence datasets as described above.

DISCUSSION
The ENCODE project at UC Santa Cruz extends the powerful Genome Browser with datasets and tools to aid researchers in their quest to understand the functional elements in the genome. This extension of the Browser brings datasets on DNA replication, chromatin regulation, promoter function, gene models and multiple species comparisons together and makes them available for visualization, analysis and download. Integration of the datasets generate by the ENCODE Consortium, in addition to other genome-wide data, proves to be a rich source for addressing questions about functional elements in 1% of the human genome, and is poised to expand with the needs of the ENCODE project.
Extensions have been made to the display, providing capabilities such as composite tracks for better organization and increased customization. Analysis tools have been built into the Table Browser to simplify merging of related tables and to assess correlation between datasets. These build on the general usability, integration with genome-wide resources, ability to do online analyses and simplicity of exporting data for external analyses that have made the data analysis more accessible to biologists. Newer additions such as the Gene Sorter, In-Silico PCR and VisiGene (2,3) continue to add value by bringing resources together so that detailed analysis can proceed rapidly.