ArrayTrack--supporting toxicogenomic research at the U.S. Food and Drug Administration National Center for Toxicological Research.

The mapping of the human genome and the determination of corresponding gene functions, pathways, and biological mechanisms are driving the emergence of the new research fields of toxicogenomics and systems toxicology. Many technological advances such as microarrays are enabling this paradigm shift that indicates an unprecedented advancement in the methods of understanding the expression of toxicity at the molecular level. At the National Center for Toxicological Research (NCTR) of the U.S. Food and Drug Administration, core facilities for genomic, proteomic, and metabonomic technologies have been established that use standardized experimental procedures to support centerwide toxicogenomic research. Collectively, these facilities are continuously producing an unprecedented volume of data. NCTR plans to develop a toxicoinformatics integrated system (TIS) for the purpose of fully integrating genomic, proteomic, and metabonomic data with the data in public repositories as well as conventional (Italic)in vitro(/Italic) and (Italic)in vivo(/Italic) toxicology data. The TIS will enable data curation in accordance with standard ontology and provide or interface a rich collection of tools for data analysis and knowledge mining. In this article the design, practical issues, and functions of the TIS are discussed through presenting its prototype version, ArrayTrack, for the management and analysis of DNA microarray data. ArrayTrack is logically constructed of three linked components: a) a library (LIB) that mirrors critical data in public databases; b) a database (MicroarrayDB) that stores microarray experiment information that is Minimal Information About a Microarray Experiment (MIAME) compliant; and c) tools (TOOL) that operate on experimental and public data for knowledge discovery. Using ArrayTrack, we can select an analysis method from the TOOL and apply the method to selected microarray data stored in the MicroarrayDB; the analysis results can be linked directly to gene information in the LIB.

While modern toxicology has focused on understanding biological mechanisms involved in the expression of toxicity at the molecular level, a technological revolution has occurred enabling researchers to perform experiments on a scale of unprecedented proportions (Marshall and Hodgson 1998;Ramsay 1998). Highthroughput experimentation is producing large amounts of data impossible to analyze without informatics-related support (Bellenson 1999;Spengler 2000). We see a paradigm shift in toxicology research, where hypothesis-driven research is complemented by data-driven experimentation designed to be hypothesis generating . Although toxicogenomics, the study of toxicology using highthroughput "omics" technologies (Aardema and MacGregor 2002;Hamadeh et al. 2002;Nuwaysir et al. 1999;Schmidt 2002;Ulrich and Friend 2002), and systems toxicology, the study of toxicology through data integration (Waters et al. 2003), have advanced rapidly and are likely to continue to advance, development of software infrastructures to manage, analyze, and integrate the diverse data has lagged behind. Recently, Waters et al. (2003) proposed a conceptual framework of chemical effects in biological systems [(CEBS) Chemical Effects in Biological Systems knowledge base] to meet the expanding toxicogenomic research needs at the National Center for Toxicogenomics (NCT) (Tennant 2002), including both NCT intramural research and research within the Toxicogenomics Research Consortium (TRC) (Medlin 2002). Both the NCT and the TRC are located at the National Institute of Environmental Health Sciences (NIEHS) in the Research Triangle Park, North Carolina.
Implementing toxicogenomic technologies is a high-priority initiative at the U.S. Food and Drug Administration (U.S. FDA) National Center for Toxicological Research (NCTR). A microarray core facility using validated and standardized protocols has been established. Similar facilities for proteomics and metabonomics are at an advanced stage of development and are preparing for validation of protocols. A toxicoinformatics integrated system (TIS) is concurrently being developed to meet the data management and analysis challenges associated with these efforts. The TIS is designed to aggregate data from toxicogenomic research with traditional toxicological end points and chemical data, along with sequence, gene function, and pathway data in public repositories. Through integration of different data types with analysis capabilities, the TIS will enable extraction of a tailored data set for data interpretation and hypothesis generation and testing.
In this article, the prototype of TIS, ArrayTrack, is presented in the context of meeting the following bioinformatics challenges associated with DNA microarray experiments in toxicology: • How to manage the massive information associated with a microarray experiment and determine what relevant toxicologyspecific experimental information or ontology needs to be acquired for the database. • What visualization and analysis capabilities are required to efficiently extract knowledge from the microarray data. • How the microarray experimental data should be linked with data from public databases to make the germane information on gene annotation, protein function, and pathways readily available for data interpretation.
The mapping of the human genome and the determination of corresponding gene functions, pathways, and biological mechanisms are driving the emergence of the new research fields of toxicogenomics and systems toxicology. Many technological advances such as microarrays are enabling this paradigm shift that indicates an unprecedented advancement in the methods of understanding the expression of toxicity at the molecular level. At the National Center for Toxicological Research (NCTR) of the U.S. Food and Drug Administration, core facilities for genomic, proteomic, and metabonomic technologies have been established that use standardized experimental procedures to support centerwide toxicogenomic research. Collectively, these facilities are continuously producing an unprecedented volume of data. NCTR plans to develop a toxicoinformatics integrated system (TIS) for the purpose of fully integrating genomic, proteomic, and metabonomic data with the data in public repositories as well as conventional in vitro and in vivo toxicology data. The TIS will enable data curation in accordance with standard ontology and provide or interface a rich collection of tools for data analysis and knowledge mining. In this article the design, practical issues, and functions of the TIS are discussed through presenting its prototype version, ArrayTrack, for the management and analysis of DNA microarray data. ArrayTrack is logically constructed of three linked components: a) a library (LIB) that mirrors critical data in public databases; b) a database (MicroarrayDB) that stores microarray experiment information that is Minimal Information About a Microarray Experiment (MIAME) compliant; and c) tools (TOOL) that operate on experimental and public data for knowledge discovery. Using ArrayTrack, we can select an analysis method from the TOOL and apply the method to selected microarray data stored in the MicroarrayDB; the analysis results can be linked directly to gene information in the LIB. microarray experiment, including information on slide samples, treatment, and experimental results; b) TOOL, which provides analysis capabilities for data visualization, normalization, significance analysis, clustering, and classification; and c) LIB, which contains information from public repositories (e.g., gene annotation, protein function, and pathways). MicroarrayDB and LIB are used to store in-house experimental results and public data, respectively, whereas TOOL provides various algorithms for data visualization and analysis. At the time of this writing, ArrayTrack is not open-source software but can be accessed through the World Wide Web (http://edkb.fda.gov/ webstart/arraytrack/). Prospective users can also acquire the software free of charge by contacting the authors. Both MicroarrayDB and LIB were developed based on the Oracle relational database management system (Oracle Corp., Redwood Shores, CA). The database structure of MicroarrayDB and LIB was designed to accommodate the essential data associated with a microarray experiment as well as the data from the public repositories on genes, proteins, and pathways (database schema available upon request). The robust design allows data entities (tables of the identical type of data) and their relationships in the databases to be conveniently added and modified to accommodate needs of ever-evolving microarray technology and public databases. The diverse data in MicroarrayDB and LIB are stored in an IBM storage area network (SAN), and backed up daily using TSM (the Tivoli storage manager system).
User interface components providing query analysis and visualization capabilities are programmed in the Java language, ensuring portability to most computer operating systems as well as enabling easy Web deployment. Interfaces have been built for several data exchange formats, including flat text files and Microsoft Office Excel spreadsheets (Microsoft Corp., Redmond, WA). The "data drilling" capabilities were developed to allow the user to lock down and requery the database across other data within the realm of the previous query.
Controlling access to experimental data is a sensitive issue for many organizations and researchers. ArrayTrack allows only the owner of the data and members of groups approved by the owner to access the data to either read or write. Figure 1 depicts the ArrayTrack comprising three integrated components: a) MicroarrayDB, b) TOOL, and c) LIB.

Results
Through a user-friendly interface, the user can select an analysis method from the TOOL, apply the method to selected microarray data stored in the MicroarrayDB, and link the analysis results directly to gene information in the LIB. Additionally, ArrayTrack also allows data to be directly linked with other public databases.

MicroarrayDB
Microarray experimentation is one of the fastest-growing methods used in genomic research and has led to a broad diversity of microarray databases in both the public domain and commercial domains (Gardiner-Garden and Littlejohn 2001 (Brazma et al. 2001), the MIAME/Tox document outlines the minimum information required for a toxicogenomic experiment to ensure that the results are interpretable and the experiment is replicable.
Our goal is to develop a validated microarray database as a rich resource for cross-experiment and platform comparison to derive toxicity-specific signatures. By validated, we mean that data are stored if and only if they meet prescribed standards for completeness and accuracy as well as conformance to the applicable ontology. MicroarrayDB was designed to support toxicogenomic studies adhering to the MIAME guidelines. Currently, a number of journals, including Nature, the Nature group of journals, Cell, The Lancet, EMBO, and Toxicology Pathology, require an accession number from the public microarray databases developed based on the MIAME guidelines, which must be supplied on or before acceptance of publication Ball et al. 2002). The following practical issues were specifically discussed for implementing the MIAME guidelines among software developers, bioinformaticians, and toxicologists who  work closely together to understand both the structure of the database and the structure of the data to be stored in the database: • MIAME versus database: MIAME specifies the content of the information to be available, whereas the database addresses how the content should be managed, and most importantly, queried. In other words, there is a distinction between the way the database handles all available information and a subset that is searchable. Technically, both available and searchable information can be treated in the same way. However, practically, such an approach usually imposes an inevitable burden on the end user to enter all information into the database in a tedious way, which might hinder their participation. Therefore, it is critical to define a balance point that can be accepted by both experimentalists and bioinformaticians. • Local versus global repository: The MIAME guidelines broadly specify required data with the goal of a truly global repository for public data deposition and data exchange that would evolve as needs change. However, most databases similar to ArrayTrack are intended primarily, at least initially, for local use within an institution. For local institutional use, the extensive MIAME format can be simplified while still retaining essential information for toxicogenomics experiment interpretation and replication.
Thus, the ArrayTrack is MIAMEcompliant, with inclusion of additional parameters related to toxicogenomics, using controlled vocabularies. Figure 2 gives the data submission requirements for essential information from both the microarray and toxicology perspectives. Currently, MicroarrayDB contains over 650 array data. We are closely following the current development of MicroArray Gene Expression Markup Language (MAGE-ML) standards (Spellman et al. 2002) that represent microarray data using markup language. We will develop a mean using MAGE-ML-an XML-based data exchange format-to allow data in MicroarrayDB to be communicated with other microarray data repositories such as ArrayExpress (http://www.ebi.ac.uk/arrayexpress; Brazma et al. 2003) and the Gene Expression Omnibus (GEO; http://www. ncbi.nlm.nih.gov/geo; Edgar et al. 2002).

LIB
The public domain has a rich and diverse collection of biological databases that greatly facilitates microarray experiment interpretation and associated knowledge discovery (Baxevanis 2003    2 weeks using scripts. The LIB was the selected aggregation of the information in the mirrored databases that was relevant for interpretation of microarray results. Currently, the LIB comprises three sublibraries, GeneLib, ProteinLib, and PathwayLib, which concentrate public data on genes, proteins, and pathways, respectively. Each contains only the most relevant selected information from UniGene (http://www.ncbi.nlm.nih.gov/UniGene/), LocusLink, SWISS-PROT, KEGG, and GO. The three libraries (GeneLib, ProteinLib, and PathwayLib) have the same design and functional interface. A screen shot of the GeneLib is displayed in Figure 3. The gene information is displayed in an Excel-like spreadsheet. Each row is associated with a gene, and each column is a particular functional annotation, such as chromosomal location, pathway, or functional assignment (molecular function, biological process, and cellular component) defined by GO (Ashburner et al. 2000). The spreadsheet can be customized by including/excluding a specific functional annotation. The common functions such as sorting, ranking, and querying are available for comparison across the entire gene list. The genes can also be categorized on the basis of their common pathways (Figure 4). In addition, detailed information on each gene is available, including synonym, sequence, chromosomal map, and reference. Information for genes not contained in the GeneLib is readily available by hot link to a wide range of public data repositories.

TOOL
The TOOL was designed to provide a spectrum of algorithmic tools for microarray data visualization, quality control, normalization, significant gene identification, pattern discovery, and class prediction.
A quality assurance/quality control tool was developed to assist quality control of slide array results ( Figure 5). The tool summarizes most relevant information into one interface to facilitate the process of quality control. The user can determine the quality of individual microarray results through visualizing data, applying statistical measures, and viewing experimental annotation. Statistical measures are provided to assess the quality of a hybridization result based on the raw expression data, including signal-tonoise ratio, the percentage of nonhybridized spots, etc. The experimental annotations associated with the processes of hybridization, RNA extraction, and labeling are also available to the end user. Additionally, a scatterplot of Cy3 versus Cy5, together with the original image, is available for visual inspection for quality control purposes.
Two data visualization methods are currently provided-ScatterPlot Viewer and VirtualImage viewer. The ScatterPlot viewer plots gene expression profiles of one sample versus another sample (Figure 6), whereas the VirtualImage viewer displays expression pattern in an array image format (Figure 7). Both functions permit visual identification of significant genes and hyperlink directly from the graph to additional detailed library information on any particular gene.

Discussion
The GeneLib, ProteinLib, and PathwayLib components of ArrayTrack contain general but essential information for functional genomics research. These libraries also provide a basis for linking and integrating various omics data. For example, lists of genes, proteins, and metabolites derived from various omics platforms could be cross-linked based on their common identifiers through these three libraries. An additional library, ToxicantLib, is being developed for ArrayTrack and will similarly provide linkage between toxicological data and the different types of omics data. The ToxicantLib contains the chemical name and structure together with toxicological end points.
Through the similarity comparison of the chemical structure of a toxicant with the structures of the metabolites in the PathwayLib, we might be able to examine the toxicity effect of a particular toxicant at the molecular level. The first toxicological data in ToxicantLib are data from our endocrine disruptor knowledge base (EDKB; http://edkb.fda.gov/; Tong et al. 2002) and the carcinogenicity potency database (CPDB) (Gold and Zeiger 1997). Other specific toxicology libraries will be added in the near future, including LiverLib (gene/protein associated with liver toxicity) and SNPsLib (containing information on single-nucleotide polymorphism).
Development of commercial software for visualizing and analyzing microarray data is currently an area of vigorous effort by bioinformatics-oriented companies. Representative software providers for microarray data analysis include Spotfire, Silicon Genetics, BioDiscovery , and Partek. Similarly, a diversity of software is available in the public domain, some of which can be accessed through a website at Stanford University (http://genome-www.stanford.edu/). Collectively, commercial and public software Toxicogenomics | ArrayTrack-supporting toxicogenomic research at the U.S. FDA NCTR Environmental Health Perspectives • VOLUME 111 | NUMBER 15 | November 2003 Figure 7. VirtualArray viewer. The function shows a reconstruction of the original array image from the expression data derived from the array (A). This virtual array image provides a visual representation of data in the format of the original image. There are several functions on the top of the image that allow browsing the contents of the array, identifying significant spots and information about their corresponding genes. For example, there are two sliding controls for filtering out unwanted spots. The upper sliding control is used to eliminate spots whose expression fold change is less than the predefined criteria. The other sliding control is used to eliminate spots for which the intensity of both Cy3 and Cy5 channels falls below the selected threshold. The resulting image (B) contains only those genes that meet both ratio and intensity criteria. Those genes can be directly linked to the GeneLib. provide many redundant capabilities, though particular software may have unique features or other attributes or familiarity that appeal to end users. Consequently, we have developed and will develop more interfaces (as part of TOOL) to provide interoperability between ArrayTrack and other analysis software. ArrayTrack includes some tools common to other bioinformatics software, but future development will focus on novel analysis approaches and tools for toxicologyspecific problems. For example, we developed a novel class prediction method, heterogeneous decision forest (Tong et al. 2003), that could be useful for omics data analysis generally and for development of predictive models in particular.
ArrayTrack has been developed and programmed in a modular manner and uses a Java library such that the code is readily extensible for other omics data, such as proteomics and metabonomics, as well as for conventional toxicology data. The extended system, supporting the diversity of data types, is the TIS, which will be under further development and evolution for several more years. The ultimate goal is for TIS to serve as a general, broad repository for diverse data sources (e.g., omics, toxicology, and chemical structure data), supporting broad data mining and meta-analysis activities as well as development of robust and validated predictive systems. The TIS will facilitate scientific discovery and productivity via effective management of diverse data and knowledge and by integration of toxicological information at different levels of biological complexity. Through cross-linking gene, protein, and pathway information available in public databases, and experimental data from multiple experiments, protocols, and labs, systems toxicology will allow a fuller understanding of toxicological mechanisms.