Abstract
The paradigm shift towards data-driven science is massively transforming the scientific process. Scientists use exploratory data analysis to arrive at new insights. This requires them to specify complex data analysis workflows, which consist of compositions of data analysis functions. Said functions encapsulate information extraction, integration, and model building through operations specified in linear algebra, relational algebra, and iterative control flow among these. A key challenge in these complex workflows is to understand and act upon irregularities in these workflows, such as outliers in aggregations. Regardless whether irregularities stem from errors or point to new insights, they must be localized and rationalized, in order to ensure the correctness and overall trustworthiness of the workflow. We propose to automatically reduce a workflow’s input data while still observing some outcome of interest, thereby computing a minimal reproducible example to support workflow debugging. In essence, we reduce the problem to the determination of the input relevant to reproducing the irregularity. To that end, we present a portfolio of different strategies being tailored to data analysis workflows that operate on tabular data. We investigate their feasibility in terms of input reduction, and compare their effectiveness and efficiency within three characteristic cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Our replication package can be found at https://osf.io/fk2x4/?view_only=442434edaec94c2b8172a759699d0886.
- 4.
References
Abreu, R., Zoeteweij, P., Van Gemund, A.J.: On the accuracy of spectrum-based fault localization. In: Testing: Academic and Industrial Conference Practice and Research Techniques. IEEE (2007)
Buneman, P., Khanna, S., Wang-Chiew, T.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44503-X_20
Contreras-Rojas, B., Quiané-Ruiz, J., Kaoudi, Z., Thirumuruganathan, S.: TagSniff: simplified big data debugging for dataflow jobs. In: ACM Symposium on Cloud Computing, pp. 453–464. ACM (2019)
Deelman, E., et al.: The future of scientific workflows. J. High Perform. Comput. Appl. 32(1) (2018)
Galhotra, S., Fariha, A., Lourenço, R., Freire, J., Meliou, A., Srivastava, D.: DataExposer: exposing disconnect between data and systems. arXiv preprint arXiv:2105.06058 (2021)
Grust, T., Kliebhan, F., Rittinger, J., Schreiber, T.: True language-level SQL debugging. In: International Conference on Extending Database Technology (2011)
Gulzar, M.A., Interlandi, M., Han, X., Li, M., Condie, T., Kim, M.: Automated debugging in data-intensive scalable computing. In: Symposium on Cloud Computing (2017)
Gulzar, M.A., et al.: BigDebug: debugging primitives for interactive big data processing in spark. In: ICSE. IEEE (2016)
Heiden, S., et al.: An evaluation of pure spectrum-based fault localization techniques for large-scale software systems. Softw. Pract. Exp. 49(8), 1197–1224 (2019)
Herschel, M., Eichelberger, H.: The nautilus analyzer: understanding and debugging data transformations. In: International Conference on Information and Knowledge Management, pp. 2731–2733 (2012)
Herschel, M., Hernández, M.A.: Explaining missing answers to SPJUA queries. Proc. VLDB Endow. 3(1–2), 185–196 (2010)
Hey, A.J., Tansley, S., et al.: The Fourth Paradigm: Data-intensive Scientific Discovery, vol. 1. Microsoft Research (2009)
Ikeda, R., Cho, J., Fang, C., Salihoglu, S., Torikai, S., Widom, J.: Provenance-based debugging and drill-down in data-oriented workflows. In: International Conference on Data Engineering. IEEE (2012)
Interlandi, M., et al.: Titian: Data provenance support in spark. In: Proceedings of VLDB, vol. 9 (2015)
Kanewala, U., Bieman, J.M.: Testing scientific software: a systematic literature review. Inf. Softw. Technol. 56(10), 1219–1232 (2014)
Leser, U., et al.: The Collaborative Research Center FONDA. Datenbank-Spektrum (1610–1995) (2021)
Lin, B., et al.: A time-driven data placement strategy for a scientific workflow combining edge computing and cloud computing. IEEE Trans. Industr. Inform. 15(7), 4254–4265 (2019)
Lourenço, R., Freire, J., Shasha, D.: BugDoc: a system for debugging computational pipelines. In: Proceedings of the 2020 ACM SIGMOD (2020)
Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: A large-scale study about quality and reproducibility of Jupyter notebooks. In: Internatonal Conference on Mining Software Repositories. IEEE (2019)
Robinson, D., Ernst, N.A., Vargas, E.L., Storey, M.A.D.: Error identification strategies for python Jupyter notebooks. arXiv preprint arXiv:2203.16653 (2022)
Sanders, R., Kelly, D.: Dealing with risk in scientific software development. IEEE Softw. 25(4), 21–28 (2008)
Shirvani, M.: A hybrid meta-heuristic algorithm for scientific workflow scheduling in heterogeneous distributed computing systems. Eng. Appl. Artif. Intell. 90, 103501 (2020)
Vogel, T., Druskat, S., Scheidgen, M., Draxl, C., Grunske, L.: Challenges for verifying and validating scientific software in computational materials science. In: International Workshop on SE for Science. IEEE (2019)
Vu, A.D., Kehrer, T., Tsigkanos, C.: Outcome-preserving input reduction for scientific data analysis workflows. In: International Conference on Automated Software Engineering, New Ideas and Emerging Results (2022)
Wang, G., Shen, R., Chen, J., Xiong, Y., Zhang, L.: Probabilistic delta debugging. In: ESEC/FSE (2021)
Zeller, A., Hildebrandt, R.: Simplifying and isolating failure-inducing input. IEEE Trans. Softw. Eng. 28(2), 183–200 (2002)
Acknowledgements
Funded in part by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - SFB 1404 FONDA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vu, A.D., Tsigkanos, C., Quiané-Ruiz, JA., Markl, V., Kehrer, T. (2023). On Irregularity Localization for Scientific Data Analysis Workflows. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14073. Springer, Cham. https://doi.org/10.1007/978-3-031-35995-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-35995-8_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35994-1
Online ISBN: 978-3-031-35995-8
eBook Packages: Computer ScienceComputer Science (R0)