Skip to main content

Exploration, Visualization, and Preprocessing of High–Dimensional Data

  • Protocol
  • First Online:
Statistical Methods in Molecular Biology

Part of the book series: Methods in Molecular Biology ((MIMB,volume 620))

Abstract

The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high–dimensional data and introduce the basic preprocessing procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gentleman, R. and Biocore. geneplotter: Graphics related functions for Bioconductor R package version 1.20.0.

    Google Scholar 

  2. Ringnér, M. (2008) What is principal component analysis?. Nat. Biotechnol., 26, 303–304.

    Article  PubMed  Google Scholar 

  3. Mutelo, R. M., Woo, W. L., and Dlay, S. S. (2008) Two dimensional principle component analysis of gabor features for face representation and recognition. Communication Systems, Networks and Digital Signal Processing, CNSDSP, p. 457–461.

    Google Scholar 

  4. Li, J., Tao, D., Hu, W., and Li, X. (2005) Kernel principle component analysis in pixels clustering. Web Intelligence, 2005. Proceedings, IEEE/WIC/ACM International Conference, 786–789.

    Google Scholar 

  5. Lee, J.-K., Kim, K.-H., Kim, T.-Y., and Choi, W.-H. (2003) Nonlinear principle component analysis using local probability. Science and Technology, Proceedings KORUS, 2, 103–107.

    Google Scholar 

  6. Shah, M. and Sorensen, D. C. (2005) Principle component analysis and model reduction for dynamical systems with symmetry constraints. Decision and Control on 2005 and 2005 European Control Conference, 2260–2264.

    Google Scholar 

  7. Yang, H., Zhang, J. Q., and Wang, B. (2007) Hypercomplex principle component weighted approach to multiplespectral and panchromatic images fusions. Geoscience and Remote Sensing Symposium on IEEE, 3096–3099.

    Google Scholar 

  8. Chen, T., Hsu, Y. J., Liu, X., and Zhang, W. (2002) Principle component analysis and its variants for biometrics. Image Processing. Proceedings. 2002 International Conference, 1, 61–64.

    Google Scholar 

  9. Friston, K. J., Frith, C. D., Liddle, P. F., and Frackowiak, R. S. (1993) Functional connectivity: The principal-component analysis of large (PET) data sets. J. Cereb. Blood Flow Metab., 13, 5–14.

    Article  PubMed  CAS  Google Scholar 

  10. Alter, O., Brown, P. O., and Botstein, D. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA, 97, 10101–10106.

    Article  PubMed  CAS  Google Scholar 

  11. Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinfromatics, 19(2), 185–193.

    Article  CAS  Google Scholar 

  12. Verhaak, R. G., Sanders, M. A., Bijl, M. A., Delwel, R., Horsman, S., Moorhouse, M. J., van derSpek, P. J., Löwenberg, B., and Valk, P. J. (2006) Heatmapper: Powerful combined visualization of gene expression profile correlations, genotypes, phenotypes and sample characteristics. BMC Bioinformatics, 7, 337.

    Article  PubMed  Google Scholar 

  13. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868.

    Article  PubMed  CAS  Google Scholar 

  14. Kibbey, C. and Calvet, A. (2005) Molecular property explorer: A novel approach to visualizing sar using tree-maps and heatmaps. J. Chem. Inf. Model., 45, 523–532.

    Article  PubMed  CAS  Google Scholar 

  15. Lee, P. S. and P. North, C. (2005) Visualization of graphs with associated timeseries data. Information Visualization, 2005. INFOVIS 2005. IEEE Symposium, 225– 232.

    Google Scholar 

  16. Fisher, D. (2007) Hotmap: Looking at geographic attention. IEEE Trans. Vis. Comput. Graph., 13(6), 1184–1191.

    Article  PubMed  Google Scholar 

  17. Podowski, R. M., Miller, B., and Wasserman, W. W. (2006) Visualization of complementary systems biology data with parallel heatmaps. IBM J. Res. Dev., 50(6), 575–581.

    Article  Google Scholar 

  18. Phattarsukol, S. and Muenchaisri, P. (2001) Identifying candidate objects using hierarchical clustering analysis. Software Engineering Conference on APSEC, 381–389.

    Google Scholar 

  19. Werle, P., Borsi, H., and Gockenbach, E. (1999) Hierarchical cluster analysis of broadband measured partial discharges as part of a modular structured monitoring system for transformers. High Voltage Engineering, 1999. Eleventh International Symposium, 5, 29–32.

    Google Scholar 

  20. Hooper, E. (2007) An intelligent intrusion detection and response system using hybrid ward hierarchical clustering analysis. 2007 International Conference on Multimedia and Ubiquitous Engineering, 1187–1192.

    Google Scholar 

  21. Yanagida, R. and Takagi, N. (2005) Consideration on hierarchical cluster analysis based on connecting adjacent hyper-rectangles. 2005 IEEE International Conference on Systems, Man and Cybernetics, 3, 2795– 2800.

    Article  Google Scholar 

  22. Kobayasi, M. (1999) Classification of color combinations based on distance between color distributions. Image Processing, 3, 70–74.

    Google Scholar 

  23. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. U.S.A., 96, 6745–6750.

    Article  PubMed  CAS  Google Scholar 

  24. Hodge, D., Karim, N., and Reardon, K. F. (2003) Hierarchical cluster analysis to detect coordinated protein expression in metabolically engineered Zymomonas mobilis. Proc. Am. Control Con., 3, 2081–2082.

    Google Scholar 

  25. Muzinich, N. (2005) Discovery of prokaryotic relationships through latent structure of correlated nucleotide sequences. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 143.

    Google Scholar 

  26. Wang, Y. and Chen, H. (2008) Sex differences in hierarchical clustering of the spontaneous fluctuations in brain resting state. Bioinformatics and Biomedical Engineering on ICBBE, 2087–2090.

    Google Scholar 

  27. Liao, W., Chen, H., Yang, Q., and Lei, X. (2008) Analysis of fmri data using improved self-organizing mapping and spatio-temporal metric hierarchical clustering. Medical Imaging, IEEE Transactions, 27(10), 1472–1483.

    Article  Google Scholar 

  28. Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An introduction to cluster analysis, Wiley Series in Probability and Mathematical Statistics. Wiley.

    Google Scholar 

  29. Liu, J.-L., Bai, Y., Kang, J., and An, N. (2006) A new approach to hierarchical clustering using partial least squares. 2006 International Conference on Machine Learning and Cybernetics, 1125–1131.

    Google Scholar 

  30. Getz, G., Levine, E., and Domany, E. (2000) Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA, 97, 12079–12084.

    Article  PubMed  CAS  Google Scholar 

  31. Dougherty, E. R., Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y., Bittner, M., and Trent, J. M. (2002) Inference from clustering with application to gene-expression microarrays. J. Comput. Biol., 9, 105–126.

    Article  PubMed  CAS  Google Scholar 

  32. Durbin, B. P., Hardin, J. S., Hawkins, D. M., and Rocke, D. M. (2002) A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics, 18(Suppl. 1), S105–S110.

    Google Scholar 

  33. Huber, W. von Heydebreck, A. Sueltmann, H. Poustka, A. and Vingron, M. (2003) Parameter estimation for the calibration and variance stabilization of microarray data. Stat. Appl. Genet. Mol. Biol., 2(1), Article 3.

    Google Scholar 

  34. Rocke, D. M. and Durbin, B. (2001) A model for measurement error for gene expression arrays. J. Comput. Biol., 8(6), 557–569.

    Article  PubMed  CAS  Google Scholar 

  35. Wu, Z. and Irizarry, R. A. (2007) A statistical framework for the analysis of microarray probe-level data. Ann. Appl. Stat. 1(2), 333–357.

    Article  Google Scholar 

  36. Naef, F., Hacker, C. R., Patil, N., and Magnasco, M. (2002) Empirical characterization of the expression ratio noise structure in high-density oligonucleotide arrays. Genome Biol., 3, RESEARCH0018.

    Google Scholar 

  37. Naef, F., Lim, D. A., Patil, N., and Magnasco, M. (2002) Dna hybridization to mismatched templates: A chip study. Phys. Rev. E, 65, 040902.

    Google Scholar 

  38. Irizarry, R. A., B. Hobbs, F. C., Beaxer-Barclay, Y., Antonellis, K., Scherf, U., and Speed, T. P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264.

    Article  PubMed  Google Scholar 

  39. Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003) Summaries of affymetrix GeneChip probe level data. Nucleic Acids Res., 31(4), e15.

    Google Scholar 

  40. Wu, Z., Irizarry, R., Gentlemen, R., Martinez-Murillo, F., and Spencer, F. (2004) A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc., 99(468), 909–917.

    Article  Google Scholar 

  41. Johnson, W. E., Li, W., Meyer, C. A., Gottardo, R., Carroll, J. S., Brown, M., and Liu, X. S. (2006) Model-based analysis of tiling-arrays for ChIP-chip. Proc. Natl. Acad. Sci. USA, 103, 12457–12462.

    Article  PubMed  CAS  Google Scholar 

  42. Kapur, K., Xing, Y., Ouyang, Z., and Wong, W. H. (2007) Exon arrays provide accurate assessments of gene expression. Genome Biol., 8, R82.

    Google Scholar 

  43. Li, C. and Wong, W. H. (2001) Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Nat. Acad. Sci. USA, 98, 31–36.

    Article  PubMed  CAS  Google Scholar 

  44. Calza, S., Valentini, D., and Pawitan, Y. (2008) Normalization of oligonucleotide arrays based on the least-variant set of genes. BMC Bioinformatics, 9, 140.

    Article  PubMed  Google Scholar 

  45. Cope, L. M., Irizarry, R. A., Jaffee, H., Wu, Z., and Speed, T. P. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323–331.

    Article  PubMed  CAS  Google Scholar 

  46. Irizarry, R. A., Wu, Z., and Jaffee, H. A. (2006) Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22, 789–794.

    Article  PubMed  CAS  Google Scholar 

  47. Kooperberg, C., Fazzio, T. G., Delrow, J. J., and Tsukiyama, T. (2002) Improved background correction for spotted DNA microarrays. J. Comput. Biol., 9(1), 55–66.

    Article  PubMed  CAS  Google Scholar 

  48. Glish, G. L. and Vachet, R. W. (2003) The basics of mass spectrometry in the twenty-first century. Nat. Rev. Drug Discov., 2, 140–150.

    Article  PubMed  CAS  Google Scholar 

  49. Baggerly, K. A., Edmonson, S. R., Morris, J. S., and Coombes, K. R. (2004) High-resolution serum proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer, 11, 583–584.

    Article  PubMed  CAS  Google Scholar 

  50. Baggerly, K. A., Morris, J. S., and Coombes, K. R. (2004) Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics, 20, 777–785.

    Article  PubMed  CAS  Google Scholar 

  51. Coombes, K. R., Baggerly, K. A., and Morris, J. S. (2007) chapter Pre-Processing Mass Spectrometry Data, Fundamentals of Data Mining in Genomics and Proteomics, Springer US, 79–102

    Google Scholar 

  52. Yasui, Y., Pepe, M., Thompson, M. L., Adam, B. L., Wright, G. L., Qu, Y., Potter, J. D., Winget, M., Thornquist, M., and Feng, Z. (2003) A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics, 4, 449–463.

    Article  PubMed  Google Scholar 

  53. Kwon, D., Vannucci, M., Song, J. J., Jeong, J., and Pfeiffer, R. M. (2008) A novel wavelet-based thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise. Proteomics, 8, 3019–3029.

    Article  PubMed  CAS  Google Scholar 

  54. Lange, E., Gropl, C., Reinert, K., Kohlbacher, O., and Hildebrandt, A. (2006) High-accuracy peak picking of proteomics data using wavelet techniques. Pac. Symp. Biocomput., 243–254.

    Google Scholar 

  55. Li, X., Li, J., and Yao, X. (2007) A wavelet-based data pre-processing analysis approach in mass spectrometry. Comput. Biol. Med., 37, 509–516.

    Article  PubMed  Google Scholar 

  56. Coombes, K. R., Tsavachidis, S., Morris, J. S., Baggerly, K. A., Hung, M. C., and Kuerer, H. M. (2005) Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics, 5, 4107–4117.

    Article  PubMed  CAS  Google Scholar 

  57. Du, P., Kibbe, W. A., and Lin, S. M. (2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22, 2059–2065.

    Article  PubMed  CAS  Google Scholar 

  58. Cannataro, M. and Veltri, P. (2007) Ms-analyzer: preprocessing and data mining services for proteomics applications on the grid. Concurrency and Computation, 19(15), 2047–2066.

    Article  Google Scholar 

  59. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. H., and Zhang, J. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol., 5, R80.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Wu, Z., Wu, Z. (2010). Exploration, Visualization, and Preprocessing of High–Dimensional Data. In: Bang, H., Zhou, X., van Epps, H., Mazumdar, M. (eds) Statistical Methods in Molecular Biology. Methods in Molecular Biology, vol 620. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-580-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-60761-580-4_8

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-60761-578-1

  • Online ISBN: 978-1-60761-580-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics