Abstract
Constraint discovery in relational databases aims to find constraints that express dependency relationships among a set of attributes and has witnessed remarkable success in the applications of data cleaning, detecting data errors, and enhancing police and security operations. In this paper, we propose a new type of constraint, called distributional constraints (DCs), which leverages the attribute value distribution feature for intelligent auditing and security analysis. The constraint, which specifies the range of attribute values that most data follow, enables financial auditors, law enforcement, and security analysts to identify data with anomalous distributions and explain the reasons for such data anomalies. In the context of police and security applications, distributional constraints can help detect potential criminal activities, fraud, and other security threats by identifying unusual patterns in data. To efficiently discover distributional constraints, we propose an inference system to find the minimum coverage of a set of DCs. The efficient optimization technique BitVector indexing is also proposed to further speed up the distributional constraint discovery. We conduct experiments on 12 real datasets such as medical bills and credit card statements to validate the efficiency and effectiveness of our solution. We show the performance of the discovery DCs and the effectiveness of using DCs for detecting abnormal data in different audit datasets.
Similar content being viewed by others
References
Wyss C, Giannella C, Robertson E (2001) Fastfds: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. Springer, Berlin, pp 101–110
Berti-Equille L, Harmouch H, Naumann F, Novelli N, Saravanan T (2018) Discovery of genuine functional dependencies from relational data with missing values. Proc VLDB Endow 11(8):880–892
Asghar N, Ghenai A (2015) Automatic discovery of functional dependencies and conditional functional dependencies: a comparative study. University of Waterloo, Waterloo
Fan W, Geerts F, Li J, Xiong M (2010) Discovering conditional functional dependencies. IEEE Trans Knowl Data Eng 23(5):683–698
Chiang F, Miller RJ (2008) Discovering data quality rules. Proc VLDB Endow 1(1):1166–1177
Goethals B, Page WL, Mannila H (2008) Mining association rules of simple conjunctive queries. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 96–107
Beskales G, Ilyas IF, Golab L, Galiullin A (2014) Sampling from repairs of conditional functional dependency violations. VLDB J 23(1):103–128
Song S, Chen L (2011) Differential dependencies: reasoning and discovery. ACM Trans Database Syst (TODS) 36(3):1–41
Brown PG, Hass PJ (2003) Bhunt: automatic discovery of fuzzy algebraic constraints in relational data. Proc VLDB Endow 1(1):668–679
Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: Proceedings of the 32nd international conference on very large data bases (VLDB’06), pp 187–198
Qahtan A, Tang N, Ouzzani M, Cao Y, Stonebraker M (2020) Pattern functional dependencies for data cleaning. Proc VLDB Endow 13(5):684–697
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111
Dhamankar R, Lee Y, Doan A, Halevy A, Domingos P (2004) imap: discovering complex semantic matches between database schemas. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, 383–394
Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. Proc VLDB Endow 1215:487–499
Chu X, Ilyas IF, Papotti P (2013) Discovering denial constraints. Proc VLDB Endow 6(13):1498–1509
Bleifuß T, Kruse S, Naumann F (2017) Efficient denial constraint discovery with hydra. Proc VLDB Endow 11(3):311–323
Pena EH, de Almeida EC, Naumann F (2019) Discovery of approximate (and exact) Denial constraints. Proc VLDB Endow 13(3):266–278
Cai Q, Xie Z, Zhang M, Chen G, Jagadish H, Ooi BC (2018) Effective temporal dependence discovery in time series data. Proc VLDB Endow 11(8):893–905
Tan Z, Ran A, Ma S, Qin S (2020) Fast incremental discovery of pointwise order dependencies. Proc VLDB Endow 13(10):1669–1681
Caruccio L, Deufemia V, Polese G (2015) Relaxed functional dependencies-a survey of approaches. IEEE Trans Knowl Data Eng 28(1):147–165
De Carvalho MG, Laender AH, GonçAlves MA, Da Silva AS (2013) An evolutionary approach to complex schema matching. Inf Syst 38(3):302–316
Fan W (2008) Dependencies revisited for improving data quality. In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 159–170
Fan W, Gao H, Jia X, Li J, Ma S (2011) Dynamic constraints for record matching. VLDB J 20(4):495–520
Song S, Chen L (2013) Efficient discovery of similarity constraints for matching dependencies. Data Knowl Eng 87:146–166
Srikant R, Agrawal R (1995) Mining generalized association rules. IBM Research Division Zurich, Zurich
Najafabadi MK, Mahrin MN, Chuprat S, Sarkan HM (2017) Improving the accuracy of collaborative filtering recommendations using clustering and association rules mining on implicit data. Comput Hum Behav 67:113–128
Ota M, Müller H, Freire J, Srivastava D (2020) Data-driven domain discovery for structured datasets. Proc VLDB Endow 13(7):953–967
Diao Y, Guzewicz P, Manolescu I, Mazuran M (2019) Spade: a modular framework for analytical exploration of rdf graphs. Proc VLDB Endow 12(8):1926–1929
Parameswaran A, Polyzotis N, Garcia-Molina H (2013) Seedb: visualizing database queries efficiently. Proc VLDB Endow 7(4):325–328
Kumar R, Bishnu PS (2019) Identification of k-most promising features to set blue ocean strategy in decision making. Data Sci Eng 4(4):367–384
Chai C, Cao L, Li G, Li J, Luo Y, Madden S (2020) Human-in-the-loop outlier detection. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data
Shang Z, Li G, Bao Z (2018) Dita: Distributed in-memory trajectory analytics. In: Proceedings of the 2018 ACM SIGMOD international conference on management of data, 725–740
Funding
Funding was provided by National Natural Science Foundation of China (Grant No. 61702449), the Key Research and Development Program of Zhejiang Province of China (Grant No. 2020C01024), the Natural Science Foundation of Zhejiang Province of China (Grant No. LY18F020005).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, W., Jiang, D., Wu, S. et al. Distributional constraint discovery for intelligent auditing. Knowl Inf Syst 65, 5195–5229 (2023). https://doi.org/10.1007/s10115-023-01929-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01929-z