Exploration, Visualization, and Preprocessing of High–Dimensional Data

Wu, Zhijin; Wu, Zhiqiang

doi:10.1007/978-1-60761-580-4_8

Zhijin Wu⁵ &
Zhiqiang Wu⁶

Part of the book series: Methods in Molecular Biology ((MIMB,volume 620))

6042 Accesses
5 Citations

Abstract

The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high–dimensional data and introduce the basic preprocessing procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gentleman, R. and Biocore. geneplotter: Graphics related functions for Bioconductor R package version 1.20.0.
Google Scholar
Ringnér, M. (2008) What is principal component analysis?. Nat. Biotechnol., 26, 303–304.
Article PubMed Google Scholar
Mutelo, R. M., Woo, W. L., and Dlay, S. S. (2008) Two dimensional principle component analysis of gabor features for face representation and recognition. Communication Systems, Networks and Digital Signal Processing, CNSDSP, p. 457–461.
Google Scholar
Li, J., Tao, D., Hu, W., and Li, X. (2005) Kernel principle component analysis in pixels clustering. Web Intelligence, 2005. Proceedings, IEEE/WIC/ACM International Conference, 786–789.
Google Scholar
Lee, J.-K., Kim, K.-H., Kim, T.-Y., and Choi, W.-H. (2003) Nonlinear principle component analysis using local probability. Science and Technology, Proceedings KORUS, 2, 103–107.
Google Scholar
Shah, M. and Sorensen, D. C. (2005) Principle component analysis and model reduction for dynamical systems with symmetry constraints. Decision and Control on 2005 and 2005 European Control Conference, 2260–2264.
Google Scholar
Yang, H., Zhang, J. Q., and Wang, B. (2007) Hypercomplex principle component weighted approach to multiplespectral and panchromatic images fusions. Geoscience and Remote Sensing Symposium on IEEE, 3096–3099.
Google Scholar
Chen, T., Hsu, Y. J., Liu, X., and Zhang, W. (2002) Principle component analysis and its variants for biometrics. Image Processing. Proceedings. 2002 International Conference, 1, 61–64.
Google Scholar
Friston, K. J., Frith, C. D., Liddle, P. F., and Frackowiak, R. S. (1993) Functional connectivity: The principal-component analysis of large (PET) data sets. J. Cereb. Blood Flow Metab., 13, 5–14.
Article PubMed CAS Google Scholar
Alter, O., Brown, P. O., and Botstein, D. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA, 97, 10101–10106.
Article PubMed CAS Google Scholar
Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinfromatics, 19(2), 185–193.
Article CAS Google Scholar
Verhaak, R. G., Sanders, M. A., Bijl, M. A., Delwel, R., Horsman, S., Moorhouse, M. J., van derSpek, P. J., Löwenberg, B., and Valk, P. J. (2006) Heatmapper: Powerful combined visualization of gene expression profile correlations, genotypes, phenotypes and sample characteristics. BMC Bioinformatics, 7, 337.
Article PubMed Google Scholar
Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868.
Article PubMed CAS Google Scholar
Kibbey, C. and Calvet, A. (2005) Molecular property explorer: A novel approach to visualizing sar using tree-maps and heatmaps. J. Chem. Inf. Model., 45, 523–532.
Article PubMed CAS Google Scholar
Lee, P. S. and P. North, C. (2005) Visualization of graphs with associated timeseries data. Information Visualization, 2005. INFOVIS 2005. IEEE Symposium, 225– 232.
Google Scholar
Fisher, D. (2007) Hotmap: Looking at geographic attention. IEEE Trans. Vis. Comput. Graph., 13(6), 1184–1191.
Article PubMed Google Scholar
Podowski, R. M., Miller, B., and Wasserman, W. W. (2006) Visualization of complementary systems biology data with parallel heatmaps. IBM J. Res. Dev., 50(6), 575–581.
Article Google Scholar
Phattarsukol, S. and Muenchaisri, P. (2001) Identifying candidate objects using hierarchical clustering analysis. Software Engineering Conference on APSEC, 381–389.
Google Scholar
Werle, P., Borsi, H., and Gockenbach, E. (1999) Hierarchical cluster analysis of broadband measured partial discharges as part of a modular structured monitoring system for transformers. High Voltage Engineering, 1999. Eleventh International Symposium, 5, 29–32.
Google Scholar
Hooper, E. (2007) An intelligent intrusion detection and response system using hybrid ward hierarchical clustering analysis. 2007 International Conference on Multimedia and Ubiquitous Engineering, 1187–1192.
Google Scholar
Yanagida, R. and Takagi, N. (2005) Consideration on hierarchical cluster analysis based on connecting adjacent hyper-rectangles. 2005 IEEE International Conference on Systems, Man and Cybernetics, 3, 2795– 2800.
Article Google Scholar
Kobayasi, M. (1999) Classification of color combinations based on distance between color distributions. Image Processing, 3, 70–74.
Google Scholar
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. U.S.A., 96, 6745–6750.
Article PubMed CAS Google Scholar
Hodge, D., Karim, N., and Reardon, K. F. (2003) Hierarchical cluster analysis to detect coordinated protein expression in metabolically engineered Zymomonas mobilis. Proc. Am. Control Con., 3, 2081–2082.
Google Scholar
Muzinich, N. (2005) Discovery of prokaryotic relationships through latent structure of correlated nucleotide sequences. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 143.
Google Scholar
Wang, Y. and Chen, H. (2008) Sex differences in hierarchical clustering of the spontaneous fluctuations in brain resting state. Bioinformatics and Biomedical Engineering on ICBBE, 2087–2090.
Google Scholar
Liao, W., Chen, H., Yang, Q., and Lei, X. (2008) Analysis of fmri data using improved self-organizing mapping and spatio-temporal metric hierarchical clustering. Medical Imaging, IEEE Transactions, 27(10), 1472–1483.
Article Google Scholar
Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An introduction to cluster analysis, Wiley Series in Probability and Mathematical Statistics. Wiley.
Google Scholar
Liu, J.-L., Bai, Y., Kang, J., and An, N. (2006) A new approach to hierarchical clustering using partial least squares. 2006 International Conference on Machine Learning and Cybernetics, 1125–1131.
Google Scholar
Getz, G., Levine, E., and Domany, E. (2000) Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA, 97, 12079–12084.
Article PubMed CAS Google Scholar
Dougherty, E. R., Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y., Bittner, M., and Trent, J. M. (2002) Inference from clustering with application to gene-expression microarrays. J. Comput. Biol., 9, 105–126.
Article PubMed CAS Google Scholar
Durbin, B. P., Hardin, J. S., Hawkins, D. M., and Rocke, D. M. (2002) A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics, 18(Suppl. 1), S105–S110.
Google Scholar
Huber, W. von Heydebreck, A. Sueltmann, H. Poustka, A. and Vingron, M. (2003) Parameter estimation for the calibration and variance stabilization of microarray data. Stat. Appl. Genet. Mol. Biol., 2(1), Article 3.
Google Scholar
Rocke, D. M. and Durbin, B. (2001) A model for measurement error for gene expression arrays. J. Comput. Biol., 8(6), 557–569.
Article PubMed CAS Google Scholar
Wu, Z. and Irizarry, R. A. (2007) A statistical framework for the analysis of microarray probe-level data. Ann. Appl. Stat. 1(2), 333–357.
Article Google Scholar
Naef, F., Hacker, C. R., Patil, N., and Magnasco, M. (2002) Empirical characterization of the expression ratio noise structure in high-density oligonucleotide arrays. Genome Biol., 3, RESEARCH0018.
Google Scholar
Naef, F., Lim, D. A., Patil, N., and Magnasco, M. (2002) Dna hybridization to mismatched templates: A chip study. Phys. Rev. E, 65, 040902.
Google Scholar
Irizarry, R. A., B. Hobbs, F. C., Beaxer-Barclay, Y., Antonellis, K., Scherf, U., and Speed, T. P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264.
Article PubMed Google Scholar
Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003) Summaries of affymetrix GeneChip probe level data. Nucleic Acids Res., 31(4), e15.
Google Scholar
Wu, Z., Irizarry, R., Gentlemen, R., Martinez-Murillo, F., and Spencer, F. (2004) A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc., 99(468), 909–917.
Article Google Scholar
Johnson, W. E., Li, W., Meyer, C. A., Gottardo, R., Carroll, J. S., Brown, M., and Liu, X. S. (2006) Model-based analysis of tiling-arrays for ChIP-chip. Proc. Natl. Acad. Sci. USA, 103, 12457–12462.
Article PubMed CAS Google Scholar
Kapur, K., Xing, Y., Ouyang, Z., and Wong, W. H. (2007) Exon arrays provide accurate assessments of gene expression. Genome Biol., 8, R82.
Google Scholar
Li, C. and Wong, W. H. (2001) Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Nat. Acad. Sci. USA, 98, 31–36.
Article PubMed CAS Google Scholar
Calza, S., Valentini, D., and Pawitan, Y. (2008) Normalization of oligonucleotide arrays based on the least-variant set of genes. BMC Bioinformatics, 9, 140.
Article PubMed Google Scholar
Cope, L. M., Irizarry, R. A., Jaffee, H., Wu, Z., and Speed, T. P. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323–331.
Article PubMed CAS Google Scholar
Irizarry, R. A., Wu, Z., and Jaffee, H. A. (2006) Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22, 789–794.
Article PubMed CAS Google Scholar
Kooperberg, C., Fazzio, T. G., Delrow, J. J., and Tsukiyama, T. (2002) Improved background correction for spotted DNA microarrays. J. Comput. Biol., 9(1), 55–66.
Article PubMed CAS Google Scholar
Glish, G. L. and Vachet, R. W. (2003) The basics of mass spectrometry in the twenty-first century. Nat. Rev. Drug Discov., 2, 140–150.
Article PubMed CAS Google Scholar
Baggerly, K. A., Edmonson, S. R., Morris, J. S., and Coombes, K. R. (2004) High-resolution serum proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer, 11, 583–584.
Article PubMed CAS Google Scholar
Baggerly, K. A., Morris, J. S., and Coombes, K. R. (2004) Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics, 20, 777–785.
Article PubMed CAS Google Scholar
Coombes, K. R., Baggerly, K. A., and Morris, J. S. (2007) chapter Pre-Processing Mass Spectrometry Data, Fundamentals of Data Mining in Genomics and Proteomics, Springer US, 79–102
Google Scholar
Yasui, Y., Pepe, M., Thompson, M. L., Adam, B. L., Wright, G. L., Qu, Y., Potter, J. D., Winget, M., Thornquist, M., and Feng, Z. (2003) A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics, 4, 449–463.
Article PubMed Google Scholar
Kwon, D., Vannucci, M., Song, J. J., Jeong, J., and Pfeiffer, R. M. (2008) A novel wavelet-based thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise. Proteomics, 8, 3019–3029.
Article PubMed CAS Google Scholar
Lange, E., Gropl, C., Reinert, K., Kohlbacher, O., and Hildebrandt, A. (2006) High-accuracy peak picking of proteomics data using wavelet techniques. Pac. Symp. Biocomput., 243–254.
Google Scholar
Li, X., Li, J., and Yao, X. (2007) A wavelet-based data pre-processing analysis approach in mass spectrometry. Comput. Biol. Med., 37, 509–516.
Article PubMed Google Scholar
Coombes, K. R., Tsavachidis, S., Morris, J. S., Baggerly, K. A., Hung, M. C., and Kuerer, H. M. (2005) Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics, 5, 4107–4117.
Article PubMed CAS Google Scholar
Du, P., Kibbe, W. A., and Lin, S. M. (2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22, 2059–2065.
Article PubMed CAS Google Scholar
Cannataro, M. and Veltri, P. (2007) Ms-analyzer: preprocessing and data mining services for proteomics applications on the grid. Concurrency and Computation, 19(15), 2047–2066.
Article Google Scholar
Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. H., and Zhang, J. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol., 5, R80.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Statistical Sciences, Brown University, Providence, RI, USA
Zhijin Wu
Department of Electrical Engineering, Wright State University, Dayton, OH, USA
Zhiqiang Wu

Authors

Zhijin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Heejung Bang
Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Xi Kathy Zhou
Journal of Experimental Medicine, Rockefeller University Press, First Ave. 1114, New York, 10021, New York, USA
Heather L. van Epps
Weill Medical College, Dept. Public Health, Cornell University, East 69th St. 411, New York, 10021, New York, USA
Madhu Mazumdar

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Wu, Z., Wu, Z. (2010). Exploration, Visualization, and Preprocessing of High–Dimensional Data. In: Bang, H., Zhou, X., van Epps, H., Mazumdar, M. (eds) Statistical Methods in Molecular Biology. Methods in Molecular Biology, vol 620. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-580-4_8

Download citation

DOI: https://doi.org/10.1007/978-1-60761-580-4_8
Published: 15 December 2009
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60761-578-1
Online ISBN: 978-1-60761-580-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics