Data mining of vector–item patterns using neighborhood histograms

Denton, Anne M.; Wu, Jianfei

doi:10.1007/s10115-009-0201-7

Data mining of vector–item patterns using neighborhood histograms

Regular Paper
Published: 19 March 2009

Volume 21, pages 173–199, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Anne M. Denton¹ &
Jianfei Wu¹

104 Accesses
6 Citations
Explore all metrics

Abstract

The representation of multiple continuous attributes as dimensions in a vector space has been among the most influential concepts in machine learning and data mining. We consider sets of related continuous attributes as vector data and search for patterns that relate a vector attribute to one or more items. The presence of an item set defines a subset of vectors that may or may not show unexpected density fluctuations. We test for fluctuations by studying density histograms. A vector–item pattern is considered significant if its density histogram significantly differs from what is expected for a random subset of transactions. Using two different density measures, we evaluate the algorithm on two real data sets and one that was artificially constructed from time series data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploratory Data Analysis through the Inspection of the Probability Density Function of the Number of Neighbors

Variable Selection for Correlated High-Dimensional Data with Infrequent Categorical Variables: Based on Sparse Sample Regression and Anomaly Detection Technology

Random indexing of multidimensional data

Article Open access 07 December 2016

Fredrik Sandin, Blerim Emruli & Magnus Sahlgren

References

Aggarwal C (2001) Re-designing distance functions and distance-based applications for high dimensional data. SIGMOD Rec 30(1): 13–18
Article Google Scholar
Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. Lecture Notes in Computer Science, vol 1973
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (2005) Automatic subspace clustering of high dimensional data. Data Mining and Knowl Discov J 11(1): 5–33
Article MathSciNet Google Scholar
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD international conference on management of data, Washington, DC, pp 207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of 20th international conference on very large data bases, VLDB. Morgan Kaufmann, San Francisco, pp 487–499
Google Scholar
Bar-Joseph Z, Gerber G, Jaakkola T, Gifford D, Simon I (2003) Continuous representations of time series gene expression data. J Comput Biol 10(3–4): 241–256
Google Scholar
Ben-Dor A, Chor B, Karp R, Yakhini Z (2002) Discovering local structure in gene expression data: the order-preserving submatrix problem. In: RECOMB’02: proceedings of sixth annual international conference on computational biology, New York
Bolshakova N, Azuaje F, Cunningham P (2005) A knowledge-driven approach to cluster validity assessment. Bioinformatics 21(10): 2546–2547
Article Google Scholar
Bolstad B, Irizarry R, Astrand M, Speed T (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19(2): 185–193
Article Google Scholar
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD’97: proceedings of the 1997 ACM SIGMOD international conference on management of data, New York. ACM Press, New York, pp 265–276
Chen J (2007) Making clustering in delay-vector space meaningful. Knowl Inf Syst 11(3): 369–385
Article Google Scholar
Cheng Y, Church G (2000) Biclustering of expression data. In: Proceedings of eighth international conference on intelligent systems for molecular biology (ISMB), pp 93–103
Chiang R, Cencil CH, Lim E-P (2005) Linear correlation discovery in databases: a data mining approach. Data Knowl Eng 53: 311–337
Article Google Scholar
Chudova D, Hart C, Mjolsness E, Smyth P (2003) Gene expression clustering with functional mixture models. In: Proceedings of advances in neural information processing systems (NIPS)
Denton A (2005) Kernel-density-based clustering of time series subsequences using a continuous random-walk noise model. In: Proceedings of fifth IEEE international conference on data mining (ICDM’05), Houston, pp 122–129
Denton A, Besemann C, Dorr D (2008) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst (online first)
Denton A, Kar A (2007) Finding differentially expressed genes through noise elimination. In: Proceedings of data mining for biomedical informatics workshop in conjunction with the seventh SIAM international conference on data mining, Minneapolis, April 2007
Denton A, Wu J, Townsend M, Prüß B (2008) Relating gene expression data on two-component systems to functional annotations in Escherichia coli. BMC Bioinformatics 9: 294
Article Google Scholar
Ekin A, Webster D (2007) Combinatorial and high-throughput screening of the effect of siloxane composition on the surface properties of crosslinked siloxane-polyurethane coatings. J Comb Chem 9: 178–188
Article Google Scholar
Ekin A, Webster D, Daniels J et al (2007) Synthesis, formulation and characterization of siloxane- polyurethane coatings for underwater marine applications using combinatorial high-throughput experimentation. J Coatings Tech Res 4(4): 435–451
Article Google Scholar
Ernst J, Nau G, Bar-Joesph Z (2005) Clustering short time series gene expression data. Bioinformatics 21(Suppl 1): I159–I168
Article Google Scholar
Gao B, Griffith O, Ester M, Jones S (2006) Discovering significant opsm subspace clusters in massive gene expression data. In: Proceedings of 2006 ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia
Goldberger A, Amaral L, Glass L et al (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23): e215–e220. http://circ.ahajournals.org/cgi/content/full/101/23/e215
Google Scholar
Goldin D, Mardales R, Nagy G (2006) In search of meaning for time series subsequence clustering: Matching algorithms based on a new distance measure. In: Proceedings of the conference on information and knowledge management, Washington, DC, November 2006
Golland P, Liang F, Mukherjee S, Panchenko D (2005) Permutation tests for classification. In: Proceedings of COLT: annual conference on learning theory. Lecture Notes in Computer Science, vol 3559, pp 501–515
Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufman, San Francisco
Google Scholar
Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining—a general survey and comparison. SIGKDD Explor 2(1): 58–64
Article Google Scholar
Hsing T, Attoor S, Dougherty E (2003) Relation between permutation-test p values and classifier error estimates. Mach Learn 52(1-2): 11–30
Article Google Scholar
Ide T (2006) Why does subsequence time-series clustering produce sine waves? In: Proceedings of the tenth European conference on principles and practice of knowledge discovery in databases, pp 311–322
Inselberg A (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: IEEE visualization conference, pp 361–378
Jiang D, Pei J, Ramanathan M et al (2007) Mining gene-sample-time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 13: 305–335
Article Google Scholar
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11): 1370–1386
Article Google Scholar
Jonsson P, Laurio K, Lubovac Z et al (2002) Using functional annotation to improve clusterings of gene expression patterns. In: Proceedings of sixth joint conference on information science, pp 1257–1262
Kailing K, Kriegel H, Kroeger P, Wanka S (2003) Ranking interesting subspaces for clustering high dimensional data. In: Proceedings of PKDD conference, pp 241–252
Kaski S, Sinkkonen J, Nikkilä J (2001) Clustering gene expression data by mutual information with gene function. In: Proceedings of international conference on artificial neural networks (ICANN), pp 81–86
Keogh E, Folias T (2003) The ucr time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html
Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for previous and future research. In: Proceedings IEEE international conference on data mining, Melbourne, FL, pp 115–122
Kohavi R, Provost F (1998) Special issue on applications of machine learning and the knowledge discovery process. Mach Learn 30: 271–274
Article Google Scholar
MATLAB. Documentation http://www.mathworks.com/access/helpdesk/help/toolbox/stats/chi2gof.html, accessed 02/07
Mulder N, Apweiler R, Attwood T (2007) New developments in the interpro database. Nucleic Acids Res 35: D224–228
Article Google Scholar
Pomeroy S, Tamayo P, Gaasenbeek M et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415: 436–442
Article Google Scholar
Rastogi R, Shim K (2001) Mining optimized support rules for numeric attributes. Inf Syst 26(6): 425–444
Article MATH Google Scholar
Roth V, Braun M, Lange T, Buhmann J (2002) A resampling approach to cluster validation. In: Computational statistics (COMPSTAT’02)
Saccharomyces Genome Database. Interproscan results using S. cerevisiae protein sequences. ftp://genome-ftp.stanford.edu/pub/yeast/sequence_similarity/domains/domains.tab
Spellman P (2007) Yeast cell cycle analysis project. http://cellcycle-www.stanford.edu
Spellman P, Sherlock G, Zhang M et al (1998) Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9: 3273–3297
Google Scholar
Srikant R, Agrawal R (1996) Mining quantitative association rules in large relational tables. In: Proceedings of 1996 ACM SIGMOD international conference on management of data, Montreal, Canada, 4–6, 1996, pp 1–12
Wu J, Denton A (2007) Mining vector–item patterns for annotating protein domains. In: Proceedings of the workshop on mining multiple information in conj. with the ACM SIGKDD international conference on data mining (KDD), San Jose, August 2007
Yates F (1934) Contingency table involving small numbers and the χ ² test. J R Stat Soc 1(Suppl): 217–235
Google Scholar
Yeung K, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17
Yona G, Dirks W, Rahman S, Lin D (2006) Effective similarity measures for expression profiles. Bioinformatics 22(13): 1616–1622
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Operations Research, North Dakota State University, Fargo, ND, 58108-6050, USA
Anne M. Denton & Jianfei Wu

Authors

Anne M. Denton
View author publications
You can also search for this author in PubMed Google Scholar
Jianfei Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anne M. Denton.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Denton, A.M., Wu, J. Data mining of vector–item patterns using neighborhood histograms. Knowl Inf Syst 21, 173–199 (2009). https://doi.org/10.1007/s10115-009-0201-7

Download citation

Received: 14 March 2008
Revised: 25 September 2008
Accepted: 08 February 2009
Published: 19 March 2009
Issue Date: November 2009
DOI: https://doi.org/10.1007/s10115-009-0201-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data mining of vector–item patterns using neighborhood histograms

Abstract

Access this article

Similar content being viewed by others

Exploratory Data Analysis through the Inspection of the Probability Density Function of the Number of Neighbors

Variable Selection for Correlated High-Dimensional Data with Infrequent Categorical Variables: Based on Sparse Sample Regression and Anomaly Detection Technology

Random indexing of multidimensional data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data mining of vector–item patterns using neighborhood histograms

Abstract

Access this article

Similar content being viewed by others

Exploratory Data Analysis through the Inspection of the Probability Density Function of the Number of Neighbors

Variable Selection for Correlated High-Dimensional Data with Infrequent Categorical Variables: Based on Sparse Sample Regression and Anomaly Detection Technology

Random indexing of multidimensional data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation