Skip to main content

On Irregularity Localization for Scientific Data Analysis Workflows

  • Conference paper
  • First Online:
Computational Science – ICCS 2023 (ICCS 2023)

Abstract

The paradigm shift towards data-driven science is massively transforming the scientific process. Scientists use exploratory data analysis to arrive at new insights. This requires them to specify complex data analysis workflows, which consist of compositions of data analysis functions. Said functions encapsulate information extraction, integration, and model building through operations specified in linear algebra, relational algebra, and iterative control flow among these. A key challenge in these complex workflows is to understand and act upon irregularities in these workflows, such as outliers in aggregations. Regardless whether irregularities stem from errors or point to new insights, they must be localized and rationalized, in order to ensure the correctness and overall trustworthiness of the workflow. We propose to automatically reduce a workflow’s input data while still observing some outcome of interest, thereby computing a minimal reproducible example to support workflow debugging. In essence, we reduce the problem to the determination of the input relevant to reproducing the irregularity. To that end, we present a portfolio of different strategies being tailored to data analysis workflows that operate on tabular data. We investigate their feasibility in terms of input reduction, and compare their effectiveness and efficiency within three characteristic cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    kaggle.com/mysarahmadbhat/shark-attacks.

  2. 2.

    https://pandas.pydata.org/.

  3. 3.

    Our replication package can be found at https://osf.io/fk2x4/?view_only=442434edaec94c2b8172a759699d0886.

  4. 4.

    wesmckinney.com/book/data-analysis-examples.html.

References

  1. Abreu, R., Zoeteweij, P., Van Gemund, A.J.: On the accuracy of spectrum-based fault localization. In: Testing: Academic and Industrial Conference Practice and Research Techniques. IEEE (2007)

    Google Scholar 

  2. Buneman, P., Khanna, S., Wang-Chiew, T.: Why and where: a characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 316–330. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44503-X_20

    Chapter  Google Scholar 

  3. Contreras-Rojas, B., Quiané-Ruiz, J., Kaoudi, Z., Thirumuruganathan, S.: TagSniff: simplified big data debugging for dataflow jobs. In: ACM Symposium on Cloud Computing, pp. 453–464. ACM (2019)

    Google Scholar 

  4. Deelman, E., et al.: The future of scientific workflows. J. High Perform. Comput. Appl. 32(1) (2018)

    Google Scholar 

  5. Galhotra, S., Fariha, A., Lourenço, R., Freire, J., Meliou, A., Srivastava, D.: DataExposer: exposing disconnect between data and systems. arXiv preprint arXiv:2105.06058 (2021)

  6. Grust, T., Kliebhan, F., Rittinger, J., Schreiber, T.: True language-level SQL debugging. In: International Conference on Extending Database Technology (2011)

    Google Scholar 

  7. Gulzar, M.A., Interlandi, M., Han, X., Li, M., Condie, T., Kim, M.: Automated debugging in data-intensive scalable computing. In: Symposium on Cloud Computing (2017)

    Google Scholar 

  8. Gulzar, M.A., et al.: BigDebug: debugging primitives for interactive big data processing in spark. In: ICSE. IEEE (2016)

    Google Scholar 

  9. Heiden, S., et al.: An evaluation of pure spectrum-based fault localization techniques for large-scale software systems. Softw. Pract. Exp. 49(8), 1197–1224 (2019)

    Article  Google Scholar 

  10. Herschel, M., Eichelberger, H.: The nautilus analyzer: understanding and debugging data transformations. In: International Conference on Information and Knowledge Management, pp. 2731–2733 (2012)

    Google Scholar 

  11. Herschel, M., Hernández, M.A.: Explaining missing answers to SPJUA queries. Proc. VLDB Endow. 3(1–2), 185–196 (2010)

    Article  Google Scholar 

  12. Hey, A.J., Tansley, S., et al.: The Fourth Paradigm: Data-intensive Scientific Discovery, vol. 1. Microsoft Research (2009)

    Google Scholar 

  13. Ikeda, R., Cho, J., Fang, C., Salihoglu, S., Torikai, S., Widom, J.: Provenance-based debugging and drill-down in data-oriented workflows. In: International Conference on Data Engineering. IEEE (2012)

    Google Scholar 

  14. Interlandi, M., et al.: Titian: Data provenance support in spark. In: Proceedings of VLDB, vol. 9 (2015)

    Google Scholar 

  15. Kanewala, U., Bieman, J.M.: Testing scientific software: a systematic literature review. Inf. Softw. Technol. 56(10), 1219–1232 (2014)

    Article  Google Scholar 

  16. Leser, U., et al.: The Collaborative Research Center FONDA. Datenbank-Spektrum (1610–1995) (2021)

    Google Scholar 

  17. Lin, B., et al.: A time-driven data placement strategy for a scientific workflow combining edge computing and cloud computing. IEEE Trans. Industr. Inform. 15(7), 4254–4265 (2019)

    Article  Google Scholar 

  18. Lourenço, R., Freire, J., Shasha, D.: BugDoc: a system for debugging computational pipelines. In: Proceedings of the 2020 ACM SIGMOD (2020)

    Google Scholar 

  19. Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: A large-scale study about quality and reproducibility of Jupyter notebooks. In: Internatonal Conference on Mining Software Repositories. IEEE (2019)

    Google Scholar 

  20. Robinson, D., Ernst, N.A., Vargas, E.L., Storey, M.A.D.: Error identification strategies for python Jupyter notebooks. arXiv preprint arXiv:2203.16653 (2022)

  21. Sanders, R., Kelly, D.: Dealing with risk in scientific software development. IEEE Softw. 25(4), 21–28 (2008)

    Article  Google Scholar 

  22. Shirvani, M.: A hybrid meta-heuristic algorithm for scientific workflow scheduling in heterogeneous distributed computing systems. Eng. Appl. Artif. Intell. 90, 103501 (2020)

    Google Scholar 

  23. Vogel, T., Druskat, S., Scheidgen, M., Draxl, C., Grunske, L.: Challenges for verifying and validating scientific software in computational materials science. In: International Workshop on SE for Science. IEEE (2019)

    Google Scholar 

  24. Vu, A.D., Kehrer, T., Tsigkanos, C.: Outcome-preserving input reduction for scientific data analysis workflows. In: International Conference on Automated Software Engineering, New Ideas and Emerging Results (2022)

    Google Scholar 

  25. Wang, G., Shen, R., Chen, J., Xiong, Y., Zhang, L.: Probabilistic delta debugging. In: ESEC/FSE (2021)

    Google Scholar 

  26. Zeller, A., Hildebrandt, R.: Simplifying and isolating failure-inducing input. IEEE Trans. Softw. Eng. 28(2), 183–200 (2002)

    Article  Google Scholar 

Download references

Acknowledgements

Funded in part by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - SFB 1404 FONDA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anh Duc Vu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vu, A.D., Tsigkanos, C., Quiané-Ruiz, JA., Markl, V., Kehrer, T. (2023). On Irregularity Localization for Scientific Data Analysis Workflows. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14073. Springer, Cham. https://doi.org/10.1007/978-3-031-35995-8_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35995-8_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35994-1

  • Online ISBN: 978-3-031-35995-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics