Data quality awareness: a case study for cost optimal association rule mining

Berti-Équille, Laure

doi:10.1007/s10115-006-0006-x

Data quality awareness: a case study for cost optimal association rule mining

Regular Paper
Published: 28 March 2006

Volume 11, pages 191–215, (2007)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Laure Berti-Équille¹

310 Accesses
12 Citations
Explore all metrics

Abstract

The quality of discovered association rules is commonly evaluated by interestingness measures (commonly support and confidence) with the purpose of supplying indicators to the user in the understanding and use of the new discovered knowledge. Low-quality datasets have a very bad impact over the quality of the discovered association rules, and one might legitimately wonder if a so-called “interesting” rule noted LHS→ RHS is meaningful when 30% of the LHS data are not up-to-date anymore, 20% of the RHS data are not accurate, and 15% of the LHS data come from a data source that is well-known for its bad credibility. This paper presents an overview of data quality characterization and management techniques that can be advantageously employed for improving the quality awareness of the knowledge discovery and data mining processes. We propose to integrate data quality indicators for quality aware association rule mining. We propose a cost-based probabilistic model for selecting legitimately interesting rules. Experiments on the challenging KDD-Cup-98 datasets show that variations on data quality have a great impact on the cost and quality of discovered association rules and confirm our approach for the integrated management of data quality indicators into the KDD process that ensure the quality of data mining results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MOGACAR: A Method for Filtering Interesting Classification Association Rules

Mining Association Rules from Database Tables with the Instances of Simpson’s Paradox

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

References

Avenali A, Batini C, Bertolazzi P, Missier P (2004) A formulation of the data quality optimization problem. In: Proceedings of the international CAiSE workhop on data and information quality (DIQ), Riga, Latvia, pp 49–63
Ballou DP, Pazer H (1995) Designing information systems to optimize the accuracy-timeliness trade-off. Inf Syst Res 6(1)
Ballou DP, Pazer H (2002) Modeling completeness versus consistency trade-offs in information decision contexts. IEEE Trans Knowl Data Eng (TDKE) 15(1):240–243
Google Scholar
Batini C, Catarci T, Scannapiceco M (2004) A survey of data quality issues in cooperative information systems. In: Tutorial presented at the 23rd international conference on conceptual modeling (ER), Shanghai, China
Benjelloun O, Garcia-Molina H, Su Q, Widom J (2005) Swoosh: A generic approach to entity resolution. Technical Report, Stanford Database Group
Berti-Équille L, Moussouni F (2005) Quality-aware integration and warehousing of genomic data. In: Proceedings of the 10th international conference on information quality (IQ'05), MIT, Cambridge, USA
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Washington, DC, USA, pp 39–48
Bouzeghoub M, Peralta V (2004) A framework for analysis of data freshness. In: Proceedings of the 1st ACM SIGMOD workshop on information quality in information systems (IQIS), Paris, France, pp 59–67
Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: Identifying density-based local outliers. In: Proceedings of 2000 ACM SIGMOD conference, Dallas, TX, USA, pp 93–104
Brodie ML (1980) Data quality in information systems. Inform Manage 3:245–258
Article Google Scholar
Celko J, McDonald J (1995) Don't warehouse dirty data. Datamation 41(18)
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, San Diego, CA, USA, pp 313–324
Cui Y, Widom J (2001) Lineage tracing for general data warehouse transformation. In: Proceedings of the 27th international conference on very large data bases (VLDB), Roma, Italy, September 11–14, pp 471–480
Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York
MATH Google Scholar
Dasu T, Johnson T, Muthukrishnan S, Shkapenyuk V (2002) Mining database structure or, how to build a data quality browser. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data, Madison, WI, USA, pp 240–251
De Giacomo G, Lembo D, Lenzerini M, Rosati R (2004) Tackling inconsistencies in data integration through source preferences. In: Proceedings of the 1st ACM SIGMOD workshop on information quality in information systems (IQIS), Paris, France, pp 27–34
Delen G, Rijsenbrij D (1992) The specification, engineering and measurement of information systems quality. J Softw Syst 17:205–217
Article Google Scholar
Elfeky MG, Verykios VS, Elmagarmid AK (2002) Tailor: A record linkage toolbox. In: Proceedings of the 19th international conference on data engineering (ICDE), San Jose, CA, USA, pp 1–28
English L (1998) Improving data warehouse and business information quality. Wiley, New York
Google Scholar
Fan K, Lu H, Madnick S, Cheung D (2001) Discovering and reconciling value conflicts for numerical data integration. Inform Syst 26(8):235–656
Article Google Scholar
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64:1183-1210
Article Google Scholar
Fox C, Levitin A, Redman T (1994) The notion of data and its quality dimensions. Information Processing and Management 30(1)
Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an RDBMS for web data integration. In: Proceedings of the 12th international world wide web conference (WWW), Budapest, Hungary, pp 90–101
Hernandez M, Stolfo S (1998) Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37
Article Google Scholar
Hou WC, Zhang Z (1995) Enhancing database correctness: A statistical approach. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, CA, USA
Huang K, Lee Y, Wang R (1999) Quality information and knowledge management. Prentice Hall, New Jersey
Google Scholar
Jarke M, Jeusfeld MA, Quix C, Vassiliadis P (1998) Architecture and quality in data warehouses. In: Proceedings of the 10th international conference on advanced information systems engineering (CAiSE), Pisa, Italy, pp 93–113
Johnson T, Dasu T (1998) Comparing massive high-dimensional data sets. In: Proceedings of the 4th international conference KDD, New York City, New York, USA, pp 229–233
Kahn B, Strong D, Wang R (2002) Information quality benchmark: Product and service performance. Com. ACM 45(4):184–192
Article Google Scholar
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th international conference on very large data bases (VLDB), New York City, USA, pp 392–403
Lavrač N, Flach PA, Zupan B (1999) Rule evaluation measures: A unifying view. In: Proceedings of the international workshop on inductive logic programming (ILP), Bled, Slovenia, pp 174–185
Liepins G, Uppuluri V (1990) Data quality control: Theory and pragmatics. Marcel Dekker, New York
Google Scholar
Lim L, Srivastava J, Prabhakar S, Richardson J (1993) Entity identification in database integration. In: Proceedings of the 9th international conference on data engineering (ICDE), Vienna, Austria, pp 294–301
Little RJ, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York
MATH Google Scholar
Liu L, Chi L (2002) Evolutionary data quality. In: Proceedings of the 7th international conference on information quality (IQ), MIT, Cambridge, USA
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Boston, MA, USA, pp 169–178
Mihaila GA, Raschid L, Vidal M (2000) Using quality of data metadata for source selection and ranking. In: Proceedings of the 3rd international WebDB workshop, Dallas, TX, USA, pp 93–98
Missier P, Batini C (2003) A multidimensional model for information quality in CIS. In: Proceedings of the 8th international conference on information quality (IQ), MIT, Cambridge, MA, USA
Monge A (2000) Matching algorithms within a duplicate detection system. IEEE Data Eng Bull 23(4):14–20
Google Scholar
Müller H, Leser U, Freytag JC (2004) Mining for patterns in contradictory data. In: Proceedings of the 1st ACM SIGMOD workshop on information quality in information systems (IQIS) in conjunction with ACM PODS/SIGMOD, Paris, France, pp 51–58
Naumann F, Leser U, Freytag J (1999) Quality-driven integration of heterogeneous information systems. In: Proceedings of the 25th international conference on very large data bases (VLDB), Edinburgh, Scotland, pp 447–458
Naumann F (2002) Quality-driven query answering for integrated information systems. LNCS 2261, Springer, Berlin Heidelberg New York
MATH Google Scholar
Pasula H, Marthi B, Milch B, Russell S, Shpitser I (2003) Identity uncertainty and citation matching. In: Proceedings of the international conference advances in neural information processing systems (NIPS), Vancouver, British Colombia, pp 1401–1408
Pearson RK (2002) Data mining in face of contaminated and incomplete records. In: Proceedings of SIAM international conference on data mining
Perner P (2002) Data mining on multimedia. LNCS 2558, Springer, Berlin Heidelberg New York
MATH Google Scholar
Piattini M, Genero M, Calero C, Polo C, Ruiz F (2000) Database quality. Chapter 14: Advanced database technology and design. Artech House, Norwood, MA, pp 485–509
Piattini, M, Calero C, Genero M (eds)(2002) Information and database quality. The Kluwer International Series on Advances in Database Systems, 25
Pyle D (1999) Data preparation for data mining. Morgan Kaufmann, San Mateo, CA
Rahm E, Do H (2000) Data cleaning: Problems and current approaches. IEEE Data Eng Bull 23(4):3–13
Google Scholar
Raman V, Hellerstein JM (2001) Potter's wheel: An interactive data cleaning system. In: Proceedings of the 26th international conference on very large data bases (VLDB), Roma, Italy, pp 381–390
Redman T (2001) Data quality: The field guide. Digital Press, Elsevier
Rothenberg J (1996) Metadata to support data quality and longevity. In: Proceedings of the 1st IEEE metadata conference, Silver Spring, MD
Santis LD, Scannapieco M, Catarci T (2003) Trusting data quality in cooperative information systems. In: Proceedings of the international conference on cooperative information systems (CoopIS), Catania, Sicily, Italy, pp 354–369
Scannapieco M, Pernici B, Pierce E (2004) IP-UML: A methodology for quality improvement based on IP-MAP and UML. Advances in Management Information Systems-Information Quality Monograph (AMIS-IQ), Sharpe
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London
MATH Google Scholar
Schlimmer J (1991) Learning determinations and checking databases. In: Proceedings of AAAI workshop on knowledge discovery in databases, AAAI–1991 Anaheim California
Tan P-N, Kumar V, Srivastava J (2002) Selecting the right interestingness measure for association patterns. In: Proceedings of the 8th ACM SIGKDD conference on knowledge discovery and data mining (KDD), Edmonton, Canada, pp 32–41
Theodoratos D, Bouzeghoub M (2001) Data currency quality satisfaction in the design of a data warehouse. Special Issue on design and management of data warehouses. Int J Coop Inf Syst 10(3):299–326
Google Scholar
Vassiliadis P, Bouzeghoub M, Quix C (1999) Towards quality-oriented data warehouse usage and evolution. In: Proceedings of the 11th international conference on advanced information systems engineering (CAiSE), Heidelberg, Germany, pp 164–179
Vassiliadis P, Simitsis A, Georgantas P, Terrovitis M (2003) A framework for the design of ETL scenarios. In: Proceedings of the 15th international conference on advanced information systems engineering (CAiSE), Klagenfurt, Austria, pp 520–535
Vassiliadis P (2000) Data warehouse modeling and quality issues. PhD thesis, Technical University of Athens, Greece
Wang R, Kon HB, Madnick SE (1993) Data quality requirements analysis and modeling. In: Proceedings of the 9th international conference on data engineering (ICDE), Vienna, Austria, pp 670–677
Wang R, Storey V, Firth C (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng (TDKE) 7(4):670–677
Google Scholar
Wang R (1998) A product perspective on total data quality management. Com. ACM 41(2):58–65
Article Google Scholar
Wang R (2002) Journey to data quality, vol 23 of Advances in database systems. Kluwer, Boston, MA, USA
Google Scholar
Wang K, Zhou S, Yang Q, Yeung JMS (2005) Mining customer value: From association rules to direct marketing. J Data Min Knowl Discov
Weis M, Naumann F (2004) Detecting duplicate objects in XML documents. In: Proceedings of the 1st international ACM SIGMOD workshop on information quality in information systems (IQIS) in conjunction with ACM PODS/SIGMOD, Paris, France, pp 10–19
Winkler WE (2004) Methods for evaluating and creating data quality. Inf Syst 29(7)

Download references

Author information

Authors and Affiliations

IRISA, University of Rennes I, Campus Universitaire de Beaulieu, 35042, Rennes, France
Laure Berti-Équille

Authors

Laure Berti-Équille
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laure Berti-Équille.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berti-Équille, L. Data quality awareness: a case study for cost optimal association rule mining. Knowl Inf Syst 11, 191–215 (2007). https://doi.org/10.1007/s10115-006-0006-x

Download citation

Received: 09 May 2005
Revised: 01 November 2005
Accepted: 14 January 2006
Published: 28 March 2006
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10115-006-0006-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data quality awareness: a case study for cost optimal association rule mining

Abstract

Access this article

Similar content being viewed by others

MOGACAR: A Method for Filtering Interesting Classification Association Rules

Mining Association Rules from Database Tables with the Instances of Simpson’s Paradox

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data quality awareness: a case study for cost optimal association rule mining

Abstract

Access this article

Similar content being viewed by others

MOGACAR: A Method for Filtering Interesting Classification Association Rules

Mining Association Rules from Database Tables with the Instances of Simpson’s Paradox

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation