Skip to main content
Log in

Data mining of vector–item patterns using neighborhood histograms

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The representation of multiple continuous attributes as dimensions in a vector space has been among the most influential concepts in machine learning and data mining. We consider sets of related continuous attributes as vector data and search for patterns that relate a vector attribute to one or more items. The presence of an item set defines a subset of vectors that may or may not show unexpected density fluctuations. We test for fluctuations by studying density histograms. A vector–item pattern is considered significant if its density histogram significantly differs from what is expected for a random subset of transactions. Using two different density measures, we evaluate the algorithm on two real data sets and one that was artificially constructed from time series data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal C (2001) Re-designing distance functions and distance-based applications for high dimensional data. SIGMOD Rec 30(1): 13–18

    Article  Google Scholar 

  2. Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. Lecture Notes in Computer Science, vol 1973

  3. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (2005) Automatic subspace clustering of high dimensional data. Data Mining and Knowl Discov J 11(1): 5–33

    Article  MathSciNet  Google Scholar 

  4. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD international conference on management of data, Washington, DC, pp 207–216

  5. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of 20th international conference on very large data bases, VLDB. Morgan Kaufmann, San Francisco, pp 487–499

    Google Scholar 

  6. Bar-Joseph Z, Gerber G, Jaakkola T, Gifford D, Simon I (2003) Continuous representations of time series gene expression data. J Comput Biol 10(3–4): 241–256

    Google Scholar 

  7. Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB’02: proceedings of sixth annual international conference on computational biology, New York

  8. Bolshakova N, Azuaje F, Cunningham P (2005) A knowledge-driven approach to cluster validity assessment. Bioinformatics 21(10): 2546–2547

    Article  Google Scholar 

  9. Bolstad B, Irizarry R, Astrand M, Speed T (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19(2): 185–193

    Article  Google Scholar 

  10. Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD’97: proceedings of the 1997 ACM SIGMOD international conference on management of data, New York. ACM Press, New York, pp 265–276

  11. Chen J (2007) Making clustering in delay-vector space meaningful. Knowl Inf Syst 11(3): 369–385

    Article  Google Scholar 

  12. Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of eighth international conference on intelligent systems for molecular biology (ISMB), pp 93–103

  13. Chiang R, Cencil CH, Lim E-P (2005) Linear correlation discovery in databases: a data mining approach. Data Knowl Eng 53: 311–337

    Article  Google Scholar 

  14. Chudova D, Hart C, Mjolsness E, Smyth P (2003) Gene expression clustering with functional mixture models. In: Proceedings of advances in neural information processing systems (NIPS)

  15. Denton A (2005) Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model. In: Proceedings of fifth IEEE international conference on data mining (ICDM’05), Houston, pp 122–129

  16. Denton A, Besemann C, Dorr D (2008) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst (online first)

  17. Denton A, Kar A (2007) Finding differentially expressed genes through noise elimination. In: Proceedings of data mining for biomedical informatics workshop in conjunction with the seventh SIAM international conference on data mining, Minneapolis, April 2007

  18. Denton A, Wu J, Townsend M, Prüß B (2008) Relating gene expression data on two-component systems to functional annotations in Escherichia coli. BMC Bioinformatics 9: 294

    Article  Google Scholar 

  19. Ekin A, Webster D (2007) Combinatorial and high-throughput screening of the effect of siloxane composition on the surface properties of crosslinked siloxane-polyurethane coatings. J Comb Chem 9: 178–188

    Article  Google Scholar 

  20. Ekin A, Webster D, Daniels J et al (2007) Synthesis, formulation and characterization of siloxane- polyurethane coatings for underwater marine applications using combinatorial high-throughput experimentation. J Coatings Tech Res 4(4): 435–451

    Article  Google Scholar 

  21. Ernst J, Nau G, Bar-Joesph Z (2005) Clustering short time series gene expression data. Bioinformatics 21(Suppl 1): I159–I168

    Article  Google Scholar 

  22. Gao B, Griffith O, Ester M, Jones S (2006) Discovering significant opsm subspace clusters in massive gene expression data. In: Proceedings of 2006 ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia

  23. Goldberger A, Amaral L, Glass L et al (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23): e215–e220. http://circ.ahajournals.org/cgi/content/full/101/23/e215

    Google Scholar 

  24. Goldin D, Mardales R, Nagy G (2006) In search of meaning for time series subsequence clustering: Matching algorithms based on a new distance measure. In: Proceedings of the conference on information and knowledge management, Washington, DC, November 2006

  25. Golland P, Liang F, Mukherjee S, Panchenko D (2005) Permutation tests for classification. In: Proceedings of COLT: annual conference on learning theory. Lecture Notes in Computer Science, vol 3559, pp 501–515

  26. Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufman, San Francisco

    Google Scholar 

  27. Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining—a general survey and comparison. SIGKDD Explor 2(1): 58–64

    Article  Google Scholar 

  28. Hsing T, Attoor S, Dougherty E (2003) Relation between permutation-test p values and classifier error estimates. Mach Learn 52(1-2): 11–30

    Article  Google Scholar 

  29. Ide T (2006) Why does subsequence time-series clustering produce sine waves? In: Proceedings of the tenth European conference on principles and practice of knowledge discovery in databases, pp 311–322

  30. Inselberg A (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: IEEE visualization conference, pp 361–378

  31. Jiang D, Pei J, Ramanathan M et al (2007) Mining gene-sample-time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 13: 305–335

    Article  Google Scholar 

  32. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386

    Article  Google Scholar 

  33. Jonsson P, Laurio K, Lubovac Z et al (2002) Using functional annotation to improve clusterings of gene expression patterns. In: Proceedings of sixth joint conference on information science, pp 1257–1262

  34. Kailing K, Kriegel H, Kroeger P, Wanka S (2003) Ranking interesting subspaces for clustering high dimensional data. In: Proceedings of PKDD conference, pp 241–252

  35. Kaski S, Sinkkonen J, Nikkilä J (2001) Clustering gene expression data by mutual information with gene function. In: Proceedings of international conference on artificial neural networks (ICANN), pp 81–86

  36. Keogh E, Folias T (2003) The ucr time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html

  37. Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for previous and future research. In: Proceedings IEEE international conference on data mining, Melbourne, FL, pp 115–122

  38. Kohavi R, Provost F (1998) Special issue on applications of machine learning and the knowledge discovery process. Mach Learn 30: 271–274

    Article  Google Scholar 

  39. MATLAB. Documentation http://www.mathworks.com/access/helpdesk/help/toolbox/stats/chi2gof.html, accessed 02/07

  40. Mulder N, Apweiler R, Attwood T (2007) New developments in the interpro database. Nucleic Acids Res 35: D224–228

    Article  Google Scholar 

  41. Pomeroy S, Tamayo P, Gaasenbeek M et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415: 436–442

    Article  Google Scholar 

  42. Rastogi R, Shim K (2001) Mining optimized support rules for numeric attributes. Inf Syst 26(6): 425–444

    Article  MATH  Google Scholar 

  43. Roth V, Braun M, Lange T, Buhmann J (2002) A resampling approach to cluster validation. In: Computational statistics (COMPSTAT’02)

  44. Saccharomyces Genome Database. Interproscan results using S. cerevisiae protein sequences. ftp://genome-ftp.stanford.edu/pub/yeast/sequence_similarity/domains/domains.tab

  45. Spellman P (2007) Yeast cell cycle analysis project. http://cellcycle-www.stanford.edu

  46. Spellman P, Sherlock G, Zhang M et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9: 3273–3297

    Google Scholar 

  47. Srikant R, Agrawal R (1996) Mining quantitative association rules in large relational tables. In: Proceedings of 1996 ACM SIGMOD international conference on management of data, Montreal, Canada, 4–6, 1996, pp 1–12

  48. Wu J, Denton A (2007) Mining vector–item patterns for annotating protein domains. In: Proceedings of the workshop on mining multiple information in conj. with the ACM SIGKDD international conference on data mining (KDD), San Jose, August 2007

  49. Yates F (1934) Contingency table involving small numbers and the χ 2 test. J R Stat Soc 1(Suppl): 217–235

    Google Scholar 

  50. Yeung K, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17

  51. Yona G, Dirks W, Rahman S, Lin D (2006) Effective similarity measures for expression profiles. Bioinformatics 22(13): 1616–1622

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anne M. Denton.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Denton, A.M., Wu, J. Data mining of vector–item patterns using neighborhood histograms. Knowl Inf Syst 21, 173–199 (2009). https://doi.org/10.1007/s10115-009-0201-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0201-7

Keywords

Navigation