Abstract
The representation of multiple continuous attributes as dimensions in a vector space has been among the most influential concepts in machine learning and data mining. We consider sets of related continuous attributes as vector data and search for patterns that relate a vector attribute to one or more items. The presence of an item set defines a subset of vectors that may or may not show unexpected density fluctuations. We test for fluctuations by studying density histograms. A vector–item pattern is considered significant if its density histogram significantly differs from what is expected for a random subset of transactions. Using two different density measures, we evaluate the algorithm on two real data sets and one that was artificially constructed from time series data.
Similar content being viewed by others
References
Aggarwal C (2001) Re-designing distance functions and distance-based applications for high dimensional data. SIGMOD Rec 30(1): 13–18
Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. Lecture Notes in Computer Science, vol 1973
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (2005) Automatic subspace clustering of high dimensional data. Data Mining and Knowl Discov J 11(1): 5–33
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD international conference on management of data, Washington, DC, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of 20th international conference on very large data bases, VLDB. Morgan Kaufmann, San Francisco, pp 487–499
Bar-Joseph Z, Gerber G, Jaakkola T, Gifford D, Simon I (2003) Continuous representations of time series gene expression data. J Comput Biol 10(3–4): 241–256
Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB’02: proceedings of sixth annual international conference on computational biology, New York
Bolshakova N, Azuaje F, Cunningham P (2005) A knowledge-driven approach to cluster validity assessment. Bioinformatics 21(10): 2546–2547
Bolstad B, Irizarry R, Astrand M, Speed T (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19(2): 185–193
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD’97: proceedings of the 1997 ACM SIGMOD international conference on management of data, New York. ACM Press, New York, pp 265–276
Chen J (2007) Making clustering in delay-vector space meaningful. Knowl Inf Syst 11(3): 369–385
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of eighth international conference on intelligent systems for molecular biology (ISMB), pp 93–103
Chiang R, Cencil CH, Lim E-P (2005) Linear correlation discovery in databases: a data mining approach. Data Knowl Eng 53: 311–337
Chudova D, Hart C, Mjolsness E, Smyth P (2003) Gene expression clustering with functional mixture models. In: Proceedings of advances in neural information processing systems (NIPS)
Denton A (2005) Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model. In: Proceedings of fifth IEEE international conference on data mining (ICDM’05), Houston, pp 122–129
Denton A, Besemann C, Dorr D (2008) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst (online first)
Denton A, Kar A (2007) Finding differentially expressed genes through noise elimination. In: Proceedings of data mining for biomedical informatics workshop in conjunction with the seventh SIAM international conference on data mining, Minneapolis, April 2007
Denton A, Wu J, Townsend M, Prüß B (2008) Relating gene expression data on two-component systems to functional annotations in Escherichia coli. BMC Bioinformatics 9: 294
Ekin A, Webster D (2007) Combinatorial and high-throughput screening of the effect of siloxane composition on the surface properties of crosslinked siloxane-polyurethane coatings. J Comb Chem 9: 178–188
Ekin A, Webster D, Daniels J et al (2007) Synthesis, formulation and characterization of siloxane- polyurethane coatings for underwater marine applications using combinatorial high-throughput experimentation. J Coatings Tech Res 4(4): 435–451
Ernst J, Nau G, Bar-Joesph Z (2005) Clustering short time series gene expression data. Bioinformatics 21(Suppl 1): I159–I168
Gao B, Griffith O, Ester M, Jones S (2006) Discovering significant opsm subspace clusters in massive gene expression data. In: Proceedings of 2006 ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia
Goldberger A, Amaral L, Glass L et al (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23): e215–e220. http://circ.ahajournals.org/cgi/content/full/101/23/e215
Goldin D, Mardales R, Nagy G (2006) In search of meaning for time series subsequence clustering: Matching algorithms based on a new distance measure. In: Proceedings of the conference on information and knowledge management, Washington, DC, November 2006
Golland P, Liang F, Mukherjee S, Panchenko D (2005) Permutation tests for classification. In: Proceedings of COLT: annual conference on learning theory. Lecture Notes in Computer Science, vol 3559, pp 501–515
Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufman, San Francisco
Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining—a general survey and comparison. SIGKDD Explor 2(1): 58–64
Hsing T, Attoor S, Dougherty E (2003) Relation between permutation-test p values and classifier error estimates. Mach Learn 52(1-2): 11–30
Ide T (2006) Why does subsequence time-series clustering produce sine waves? In: Proceedings of the tenth European conference on principles and practice of knowledge discovery in databases, pp 311–322
Inselberg A (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: IEEE visualization conference, pp 361–378
Jiang D, Pei J, Ramanathan M et al (2007) Mining gene-sample-time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 13: 305–335
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386
Jonsson P, Laurio K, Lubovac Z et al (2002) Using functional annotation to improve clusterings of gene expression patterns. In: Proceedings of sixth joint conference on information science, pp 1257–1262
Kailing K, Kriegel H, Kroeger P, Wanka S (2003) Ranking interesting subspaces for clustering high dimensional data. In: Proceedings of PKDD conference, pp 241–252
Kaski S, Sinkkonen J, Nikkilä J (2001) Clustering gene expression data by mutual information with gene function. In: Proceedings of international conference on artificial neural networks (ICANN), pp 81–86
Keogh E, Folias T (2003) The ucr time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html
Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for previous and future research. In: Proceedings IEEE international conference on data mining, Melbourne, FL, pp 115–122
Kohavi R, Provost F (1998) Special issue on applications of machine learning and the knowledge discovery process. Mach Learn 30: 271–274
MATLAB. Documentation http://www.mathworks.com/access/helpdesk/help/toolbox/stats/chi2gof.html, accessed 02/07
Mulder N, Apweiler R, Attwood T (2007) New developments in the interpro database. Nucleic Acids Res 35: D224–228
Pomeroy S, Tamayo P, Gaasenbeek M et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415: 436–442
Rastogi R, Shim K (2001) Mining optimized support rules for numeric attributes. Inf Syst 26(6): 425–444
Roth V, Braun M, Lange T, Buhmann J (2002) A resampling approach to cluster validation. In: Computational statistics (COMPSTAT’02)
Saccharomyces Genome Database. Interproscan results using S. cerevisiae protein sequences. ftp://genome-ftp.stanford.edu/pub/yeast/sequence_similarity/domains/domains.tab
Spellman P (2007) Yeast cell cycle analysis project. http://cellcycle-www.stanford.edu
Spellman P, Sherlock G, Zhang M et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9: 3273–3297
Srikant R, Agrawal R (1996) Mining quantitative association rules in large relational tables. In: Proceedings of 1996 ACM SIGMOD international conference on management of data, Montreal, Canada, 4–6, 1996, pp 1–12
Wu J, Denton A (2007) Mining vector–item patterns for annotating protein domains. In: Proceedings of the workshop on mining multiple information in conj. with the ACM SIGKDD international conference on data mining (KDD), San Jose, August 2007
Yates F (1934) Contingency table involving small numbers and the χ 2 test. J R Stat Soc 1(Suppl): 217–235
Yeung K, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17
Yona G, Dirks W, Rahman S, Lin D (2006) Effective similarity measures for expression profiles. Bioinformatics 22(13): 1616–1622
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Denton, A.M., Wu, J. Data mining of vector–item patterns using neighborhood histograms. Knowl Inf Syst 21, 173–199 (2009). https://doi.org/10.1007/s10115-009-0201-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0201-7