Skip to main content
Log in

Distributional constraint discovery for intelligent auditing

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Constraint discovery in relational databases aims to find constraints that express dependency relationships among a set of attributes and has witnessed remarkable success in the applications of data cleaning, detecting data errors, and enhancing police and security operations. In this paper, we propose a new type of constraint, called distributional constraints (DCs), which leverages the attribute value distribution feature for intelligent auditing and security analysis. The constraint, which specifies the range of attribute values that most data follow, enables financial auditors, law enforcement, and security analysts to identify data with anomalous distributions and explain the reasons for such data anomalies. In the context of police and security applications, distributional constraints can help detect potential criminal activities, fraud, and other security threats by identifying unusual patterns in data. To efficiently discover distributional constraints, we propose an inference system to find the minimum coverage of a set of DCs. The efficient optimization technique BitVector indexing is also proposed to further speed up the distributional constraint discovery. We conduct experiments on 12 real datasets such as medical bills and credit card statements to validate the efficiency and effectiveness of our solution. We show the performance of the discovery DCs and the effectiveness of using DCs for detecting abnormal data in different audit datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://www.data.gov/.

  2. https://archive.ics.uci.edu/ml/datasets.

References

  1. Wyss C, Giannella C, Robertson E (2001) Fastfds: a heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances extended abstract. Springer, Berlin, pp 101–110

    MATH  Google Scholar 

  2. Berti-Equille L, Harmouch H, Naumann F, Novelli N, Saravanan T (2018) Discovery of genuine functional dependencies from relational data with missing values. Proc VLDB Endow 11(8):880–892

    Article  Google Scholar 

  3. Asghar N, Ghenai A (2015) Automatic discovery of functional dependencies and conditional functional dependencies: a comparative study. University of Waterloo, Waterloo

    Google Scholar 

  4. Fan W, Geerts F, Li J, Xiong M (2010) Discovering conditional functional dependencies. IEEE Trans Knowl Data Eng 23(5):683–698

    Article  Google Scholar 

  5. Chiang F, Miller RJ (2008) Discovering data quality rules. Proc VLDB Endow 1(1):1166–1177

    Article  Google Scholar 

  6. Goethals B, Page WL, Mannila H (2008) Mining association rules of simple conjunctive queries. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 96–107

  7. Beskales G, Ilyas IF, Golab L, Galiullin A (2014) Sampling from repairs of conditional functional dependency violations. VLDB J 23(1):103–128

    Article  Google Scholar 

  8. Song S, Chen L (2011) Differential dependencies: reasoning and discovery. ACM Trans Database Syst (TODS) 36(3):1–41

    Article  Google Scholar 

  9. Brown PG, Hass PJ (2003) Bhunt: automatic discovery of fuzzy algebraic constraints in relational data. Proc VLDB Endow 1(1):668–679

    Google Scholar 

  10. Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: Proceedings of the 32nd international conference on very large data bases (VLDB’06), pp 187–198

  11. Qahtan A, Tang N, Ouzzani M, Cao Y, Stonebraker M (2020) Pattern functional dependencies for data cleaning. Proc VLDB Endow 13(5):684–697

    Article  Google Scholar 

  12. Huhtala Y, Kärkkäinen J, Porkka P, Toivonen H (1999) Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111

    Article  MATH  Google Scholar 

  13. Dhamankar R, Lee Y, Doan A, Halevy A, Domingos P (2004) imap: discovering complex semantic matches between database schemas. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, 383–394

  14. Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. Proc VLDB Endow 1215:487–499

    Google Scholar 

  15. Chu X, Ilyas IF, Papotti P (2013) Discovering denial constraints. Proc VLDB Endow 6(13):1498–1509

    Article  Google Scholar 

  16. Bleifuß T, Kruse S, Naumann F (2017) Efficient denial constraint discovery with hydra. Proc VLDB Endow 11(3):311–323

    Article  Google Scholar 

  17. Pena EH, de Almeida EC, Naumann F (2019) Discovery of approximate (and exact) Denial constraints. Proc VLDB Endow 13(3):266–278

    Article  Google Scholar 

  18. Cai Q, Xie Z, Zhang M, Chen G, Jagadish H, Ooi BC (2018) Effective temporal dependence discovery in time series data. Proc VLDB Endow 11(8):893–905

    Article  Google Scholar 

  19. Tan Z, Ran A, Ma S, Qin S (2020) Fast incremental discovery of pointwise order dependencies. Proc VLDB Endow 13(10):1669–1681

    Article  Google Scholar 

  20. Caruccio L, Deufemia V, Polese G (2015) Relaxed functional dependencies-a survey of approaches. IEEE Trans Knowl Data Eng 28(1):147–165

    Article  MATH  Google Scholar 

  21. De Carvalho MG, Laender AH, GonçAlves MA, Da Silva AS (2013) An evolutionary approach to complex schema matching. Inf Syst 38(3):302–316

    Article  Google Scholar 

  22. Fan W (2008) Dependencies revisited for improving data quality. In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 159–170

  23. Fan W, Gao H, Jia X, Li J, Ma S (2011) Dynamic constraints for record matching. VLDB J 20(4):495–520

    Article  Google Scholar 

  24. Song S, Chen L (2013) Efficient discovery of similarity constraints for matching dependencies. Data Knowl Eng 87:146–166

    Article  Google Scholar 

  25. Srikant R, Agrawal R (1995) Mining generalized association rules. IBM Research Division Zurich, Zurich

    Google Scholar 

  26. Najafabadi MK, Mahrin MN, Chuprat S, Sarkan HM (2017) Improving the accuracy of collaborative filtering recommendations using clustering and association rules mining on implicit data. Comput Hum Behav 67:113–128

    Article  Google Scholar 

  27. Ota M, Müller H, Freire J, Srivastava D (2020) Data-driven domain discovery for structured datasets. Proc VLDB Endow 13(7):953–967

    Article  Google Scholar 

  28. Diao Y, Guzewicz P, Manolescu I, Mazuran M (2019) Spade: a modular framework for analytical exploration of rdf graphs. Proc VLDB Endow 12(8):1926–1929

    Article  Google Scholar 

  29. Parameswaran A, Polyzotis N, Garcia-Molina H (2013) Seedb: visualizing database queries efficiently. Proc VLDB Endow 7(4):325–328

    Article  Google Scholar 

  30. Kumar R, Bishnu PS (2019) Identification of k-most promising features to set blue ocean strategy in decision making. Data Sci Eng 4(4):367–384

    Article  Google Scholar 

  31. Chai C, Cao L, Li G, Li J, Luo Y, Madden S (2020) Human-in-the-loop outlier detection. In: Proceedings of the 2020 ACM SIGMOD international conference on management of data

  32. Shang Z, Li G, Bao Z (2018) Dita: Distributed in-memory trajectory analytics. In: Proceedings of the 2018 ACM SIGMOD international conference on management of data, 725–740

Download references

Funding

Funding was provided by National Natural Science Foundation of China (Grant No. 61702449), the Key Research and Development Program of Zhejiang Province of China (Grant No. 2020C01024), the Natural Science Foundation of Zhejiang Province of China (Grant No. LY18F020005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wentao Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, W., Jiang, D., Wu, S. et al. Distributional constraint discovery for intelligent auditing. Knowl Inf Syst 65, 5195–5229 (2023). https://doi.org/10.1007/s10115-023-01929-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01929-z

Keywords

Navigation