skip to main content
10.1145/3624062.3624192acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC

Published:12 November 2023Publication History

ABSTRACT

HPC application developers and administrators need to understand the complex interplay between compute clusters and storage systems to make effective optimization decisions. Ad hoc investigations of this interplay based on isolated case studies can lead to conclusions that are incorrect or difficult to generalize. The I/O Trace Initiative aims to improve the scientific community’s understanding of I/O operations by building a searchable collaborative archive of I/O traces from a wide range of applications and machines, with a focus on high-performance computing and scalable AI/ML. This initiative advances the accessibility of I/O trace data by enabling users to locate and compare traces based on user-specified criteria. It also provides a visual analytics platform for in-depth analysis, paving the way for the development of advanced performance optimization techniques. By acting as a hub for trace data, the initiative fosters collaborative research by encouraging data sharing and collective learning.

References

  1. 2015. darshan-logutils.c. https://github.com/darshan-hpc/darshan/blob/main/darshan-util/darshan-logutils.cGoogle ScholarGoogle Scholar
  2. Jean Luca Bez, Suren Byna, and Shadi Ibrahim. 2023. I/O Access Patterns in HPC Applications: A 360-Degree Survey. ACM Comput. Surv. (jul 2023).Google ScholarGoogle Scholar
  3. Jean Luca Bez, Houjun Tang, Bing Xie, David B. Williams-Young, Robert Latham, Robert B. Ross, Sarp Oral, and Suren Byna. 2021. I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis. In 6th IEEE/ACM International Parallel Data Systems Workshop (PDSW@SC), St. Louis, MO, USA, November 15. 15–22.Google ScholarGoogle ScholarCross RefCross Ref
  4. Phil Carns. 2013. ALCF I/O Data Repository. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).Google ScholarGoogle Scholar
  5. Philip Carns. 2014. Darshan. In High performance parallel I/O. Chapman and Hall/CRC, 351–358.Google ScholarGoogle Scholar
  6. Philip H. Carns, Kevin Harms, William E. Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert B. Ross. 2011. Understanding and Improving Computational Science Storage Access through Continuous Characterization. ACM Trans. Storage 7, 3 (2011), 8:1–8:26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Steven WD Chien, Artur Podobas, Ivy B Peng, and Stefano Markidis. 2020. tf-Darshan: Understanding fine-grained I/O performance in machine learning workloads. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 359–370.Google ScholarGoogle ScholarCross RefCross Ref
  8. European Organization For Nuclear Research and OpenAIRE. 2013. Zenodo. https://doi.org/10.25495/7GXK-RD71Google ScholarGoogle ScholarCross RefCross Ref
  9. Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. " O’Reilly Media, Inc.".Google ScholarGoogle Scholar
  10. Harsh Khetawat, Christopher Zimmer, Frank Mueller, Sudharshan Vazhkudai, and Scott Atchley. 2018. Using darshan and codes to evaluate application i/o performance. SC Poster Session (2018).Google ScholarGoogle Scholar
  11. Seong Jo Kim, Seung Woo Son, Wei-keng Liao, Mahmut T. Kandemir, Rajeev Thakur, and Alok N. Choudhary. 2012. IOPin: Runtime Profiling of Parallel I/O in HPC Systems. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, November 10-16, 2012. IEEE Computer Society, 18–23.Google ScholarGoogle Scholar
  12. Andreas Knupfer, Christian Rossel, Dieter an Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, and Wolfgang E. Nagel. 2012. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. (8 2012). https://www.osti.gov/biblio/1567522Google ScholarGoogle Scholar
  13. Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett H. Phillips, Ankur Mahesh, Michael A. Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. 2018. Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11-16, 2018. IEEE / ACM, 51:1–51:12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jakob Lüttgau, Shane Snyder, Philip H. Carns, Justin M. Wozniak, Julian M. Kunkel, and Thomas Ludwig. 2018. Toward Understanding I/O Behavior in HPC Workflows. In 3rd IEEE/ACM International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS@SC), Dallas, TX, USA, November 12. 64–75. https://doi.org/10.1109/PDSW-DISCS.2018.00012Google ScholarGoogle ScholarCross RefCross Ref
  15. Huong Luu, Babak Behzad, Ruth A. Aydt, and Marianne Winslett. 2013. A multi-level approach for understanding I/O activity in HPC applications. In 2013 IEEE International Conference on Cluster Computing, CLUSTER 2013, Indianapolis, IN, USA, September 23-27, 2013. IEEE Computer Society, 1–5.Google ScholarGoogle ScholarCross RefCross Ref
  16. D Miller, J Whitlocak, M Gartiner, M Ralphson, R Ratovsky, and U Sarid. [n. d.]. OpenAPI Specification v3. 1.0 (2021). URL https://spec. openapis. org/oas/latest. html. OpenAPI Initiative, The Linux Foundation ([n. d.]).Google ScholarGoogle Scholar
  17. Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O behavior in large-scale storage systems: the expected and the unexpected. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, Michela Taufer, Pavan Balaji, and Antonio J. Peña (Eds.). ACM, 65:1–65:13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Arnab K. Paul, Jong Youl Choi, Ahmad Maroof Karimi, and Feiyi Wang. 2022. Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems. In 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Minneapolis, MN, USA, 27 June 2022 - 1 July. 199–212. https://doi.org/10.1145/3502181.3531457Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Jeffrey S. Vetter, Pietro Cicotti, Erwin Laure, and Stefano Markidis. 2018. Characterizing the performance benefit of hybrid memory system for HPC applications. Parallel Comput. 76 (2018), 57–69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Sameer Shende, Allen D. Malony, Wyatt Spear, and Karen Schuchardt. [n. d.]. Characterizing I/O Performance Using the TAU Performance System. In Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August - 3 September 2011, Ghent, Belgium(Advances in Parallel Computing, Vol. 22), Koen De Bosschere, Erik H. D’Hollander, Gerhard R. Joubert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.). IOS Press, 647–655.Google ScholarGoogle Scholar
  21. Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K Lockwood, and Nicholas J Wright. 2016. Modular hpc i/o characterization with darshan. In 2016 5th workshop on extreme-scale programming tools (ESPT). IEEE, 9–17.Google ScholarGoogle ScholarCross RefCross Ref
  22. Chen Wang, Jinghan Sun, Marc Snir, Kathryn M. Mohror, and Elsa Gonsiorowski. 2020. Recorder 2.0: Efficient Parallel I/O Tracing and Analysis. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020, New Orleans, LA, USA, May 18-22, 2020. IEEE, 1052–1059.Google ScholarGoogle ScholarCross RefCross Ref
  23. Teng Wang, Shane Snyder, Glenn K. Lockwood, Philip H. Carns, Nicholas J. Wright, and Suren Byna. 2018. IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs. In IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, September 10-13. 466–476.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, [n. d.]. ([n. d.]).Google ScholarGoogle Scholar
  25. Cong Xu, Shane Snyder, Vishwanath Venkatesan, Philip Carns, Omkar Kulkarni, Suren Byna, Roberto Sisneros, and Kalyana Chadalavada. 2017. Dxt: Darshan extended tracing. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).Google ScholarGoogle Scholar

Index Terms

  1. The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
        November 2023
        2180 pages
        ISBN:9798400707858
        DOI:10.1145/3624062

        Copyright © 2023 ACM

        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 November 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited
      • Article Metrics

        • Downloads (Last 12 months)85
        • Downloads (Last 6 weeks)18

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format