Distributional constraint discovery for intelligent auditing

Hu, Wentao; Jiang, Dawei; Wu, Sai; Chen, Ke; Chen, Gang

doi:10.1007/s10115-023-01929-z

Distributional constraint discovery for intelligent auditing

Regular Paper
Published: 07 August 2023

Volume 65, pages 5195–5229, (2023)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Wentao Hu ORCID: orcid.org/0000-0003-0930-7810¹,
Dawei Jiang²,
Sai Wu²,
Ke Chen² &
…
Gang Chen²

176 Accesses
Explore all metrics

Abstract

Constraint discovery in relational databases aims to find constraints that express dependency relationships among a set of attributes and has witnessed remarkable success in the applications of data cleaning, detecting data errors, and enhancing police and security operations. In this paper, we propose a new type of constraint, called distributional constraints (DCs), which leverages the attribute value distribution feature for intelligent auditing and security analysis. The constraint, which specifies the range of attribute values that most data follow, enables financial auditors, law enforcement, and security analysts to identify data with anomalous distributions and explain the reasons for such data anomalies. In the context of police and security applications, distributional constraints can help detect potential criminal activities, fraud, and other security threats by identifying unusual patterns in data. To efficiently discover distributional constraints, we propose an inference system to find the minimum coverage of a set of DCs. The efficient optimization technique BitVector indexing is also proposed to further speed up the distributional constraint discovery. We conduct experiments on 12 real datasets such as medical bills and credit card statements to validate the efficiency and effectiveness of our solution. We show the performance of the discovery DCs and the effectiveness of using DCs for detecting abnormal data in different audit datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 2

Algorithm 3

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Big healthcare data: preserving security and privacy

Article Open access 09 January 2018

The role of data science in healthcare advancements: applications, benefits, and future prospects

Article Open access 16 August 2021

Notes

References

Wyss C, Giannella C, Robertson E (2001) Fastfds: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. Springer, Berlin, pp 101–110
MATH Google Scholar
Berti-Equille L, Harmouch H, Naumann F, Novelli N, Saravanan T (2018) Discovery of genuine functional dependencies from relational data with missing values. Proc VLDB Endow 11(8):880–892
Article Google Scholar
Asghar N, Ghenai A (2015) Automatic discovery of functional dependencies and conditional functional dependencies: a comparative study. University of Waterloo, Waterloo
Google Scholar
Fan W, Geerts F, Li J, Xiong M (2010) Discovering conditional functional dependencies. IEEE Trans Knowl Data Eng 23(5):683–698
Article Google Scholar
Chiang F, Miller RJ (2008) Discovering data quality rules. Proc VLDB Endow 1(1):1166–1177
Article Google Scholar
Goethals B, Page WL, Mannila H (2008) Mining association rules of simple conjunctive queries. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 96–107
Beskales G, Ilyas IF, Golab L, Galiullin A (2014) Sampling from repairs of conditional functional dependency violations. VLDB J 23(1):103–128
Article Google Scholar
Song S, Chen L (2011) Differential dependencies: reasoning and discovery. ACM Trans Database Syst (TODS) 36(3):1–41
Article Google Scholar
Brown PG, Hass PJ (2003) Bhunt: automatic discovery of fuzzy algebraic constraints in relational data. Proc VLDB Endow 1(1):668–679
Google Scholar
Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: Proceedings of the 32nd international conference on very large data bases (VLDB’06), pp 187–198
Qahtan A, Tang N, Ouzzani M, Cao Y, Stonebraker M (2020) Pattern functional dependencies for data cleaning. Proc VLDB Endow 13(5):684–697
Article Google Scholar
Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111
Article MATH Google Scholar
Dhamankar R, Lee Y, Doan A, Halevy A, Domingos P (2004) imap: discovering complex semantic matches between database schemas. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, 383–394
Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. Proc VLDB Endow 1215:487–499
Google Scholar
Chu X, Ilyas IF, Papotti P (2013) Discovering denial constraints. Proc VLDB Endow 6(13):1498–1509
Article Google Scholar
Bleifuß T, Kruse S, Naumann F (2017) Efficient denial constraint discovery with hydra. Proc VLDB Endow 11(3):311–323
Article Google Scholar
Pena EH, de Almeida EC, Naumann F (2019) Discovery of approximate (and exact) Denial constraints. Proc VLDB Endow 13(3):266–278
Article Google Scholar
Cai Q, Xie Z, Zhang M, Chen G, Jagadish H, Ooi BC (2018) Effective temporal dependence discovery in time series data. Proc VLDB Endow 11(8):893–905
Article Google Scholar
Tan Z, Ran A, Ma S, Qin S (2020) Fast incremental discovery of pointwise order dependencies. Proc VLDB Endow 13(10):1669–1681
Article Google Scholar
Caruccio L, Deufemia V, Polese G (2015) Relaxed functional dependencies-a survey of approaches. IEEE Trans Knowl Data Eng 28(1):147–165
Article MATH Google Scholar
De Carvalho MG, Laender AH, GonçAlves MA, Da Silva AS (2013) An evolutionary approach to complex schema matching. Inf Syst 38(3):302–316
Article Google Scholar
Fan W (2008) Dependencies revisited for improving data quality. In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 159–170
Fan W, Gao H, Jia X, Li J, Ma S (2011) Dynamic constraints for record matching. VLDB J 20(4):495–520
Article Google Scholar
Song S, Chen L (2013) Efficient discovery of similarity constraints for matching dependencies. Data Knowl Eng 87:146–166
Article Google Scholar
Srikant R, Agrawal R (1995) Mining generalized association rules. IBM Research Division Zurich, Zurich
Google Scholar
Najafabadi MK, Mahrin MN, Chuprat S, Sarkan HM (2017) Improving the accuracy of collaborative filtering recommendations using clustering and association rules mining on implicit data. Comput Hum Behav 67:113–128
Article Google Scholar
Ota M, Müller H, Freire J, Srivastava D (2020) Data-driven domain discovery for structured datasets. Proc VLDB Endow 13(7):953–967
Article Google Scholar
Diao Y, Guzewicz P, Manolescu I, Mazuran M (2019) Spade: a modular framework for analytical exploration of rdf graphs. Proc VLDB Endow 12(8):1926–1929
Article Google Scholar
Parameswaran A, Polyzotis N, Garcia-Molina H (2013) Seedb: visualizing database queries efficiently. Proc VLDB Endow 7(4):325–328
Article Google Scholar
Kumar R, Bishnu PS (2019) Identification of k-most promising features to set blue ocean strategy in decision making. Data Sci Eng 4(4):367–384
Article Google Scholar
Chai C, Cao L, Li G, Li J, Luo Y, Madden S (2020) Human-in-the-loop outlier detection. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data
Shang Z, Li G, Bao Z (2018) Dita: Distributed in-memory trajectory analytics. In: Proceedings of the 2018 ACM SIGMOD international conference on management of data, 725–740

Download references

Funding

Funding was provided by National Natural Science Foundation of China (Grant No. 61702449), the Key Research and Development Program of Zhejiang Province of China (Grant No. 2020C01024), the Natural Science Foundation of Zhejiang Province of China (Grant No. LY18F020005).

Author information

Authors and Affiliations

Zhejiang Police College, Hangzhou, China
Wentao Hu
The Key Laboratory of Big Data Intelligent Computing of Zhejiang Province, Zhejiang University, Hangzhou, China
Dawei Jiang, Sai Wu, Ke Chen & Gang Chen

Authors

Wentao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Dawei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Sai Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ke Chen
View author publications
You can also search for this author in PubMed Google Scholar
Gang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wentao Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hu, W., Jiang, D., Wu, S. et al. Distributional constraint discovery for intelligent auditing. Knowl Inf Syst 65, 5195–5229 (2023). https://doi.org/10.1007/s10115-023-01929-z

Download citation

Received: 22 February 2022
Revised: 22 June 2023
Accepted: 05 July 2023
Published: 07 August 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10115-023-01929-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributional constraint discovery for intelligent auditing

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big healthcare data: preserving security and privacy

The role of data science in healthcare advancements: applications, benefits, and future prospects

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributional constraint discovery for intelligent auditing

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big healthcare data: preserving security and privacy

The role of data science in healthcare advancements: applications, benefits, and future prospects

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation