skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: LDMS-GPU: Lightweight Distributed Metric Service (LDMS) for NVIDIA GPGPUs

Technical Report ·
DOI:https://doi.org/10.2172/1813665· OSTI ID:1813665
 [1];  [1];  [2];  [2]
  1. New Mexico State Univ., Las Cruces, NM (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

GPUs are now a fundamental accelerator for many high-performance computing applications. They are viewed by many as a technology facilitator for the surge in fields like machine learning and Convolutional Neural Networks. To deliver the best performance on a GPU, we need to create monitoring tools to ensure that we optimize the code to get the most performance and efficiency out of a GPU. Since NVIDIA GPUs are currently the most commonly implemented in HPC applications and systems, NVIDIA tools are the solution for performance monitoring. The Light-Weight Distributed Metric System (LDMS) at Sandia is an infrastructure widely adopted for large-scale systems and application monitoring. Sandia has developed CPU application monitoring capability within LDMS. Therefore, we chose to develop a GPU monitoring capability within the same framework. In this report, we discuss the current limitations in the NVIDIA monitoring tools, how we overcame such limitations, and present an overview of the tool we built to monitor GPU performance in LDMS and its capabilities. Also, we discuss our current validation results. Most of the performance counter results are the same in both vendor tools and our tool when using LDMS to collect these results. Furthermore, our tool provides these statistics during the entire runtime of the tool as a time series and not just aggregate statistics at the end of the application run. This allows the user to see the progress of the behavior of the applications during their lifetime.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
NA0003525
OSTI ID:
1813665
Report Number(s):
SAND2020-13137; 697729
Country of Publication:
United States
Language:
English

Similar Records

Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs
Conference · Sat Jul 01 00:00:00 EDT 2023 · OSTI ID:1813665

Performance Analysis of PIConGPU: Particle-in-Cell on GPUs using NVIDIA’s NSight Systems and NSight Compute
Technical Report · Tue Jan 19 00:00:00 EST 2021 · OSTI ID:1813665

Production Application Performance Data Streaming for System Monitoring
Journal Article · Sat Jun 01 00:00:00 EDT 2019 · ACM Transactions on Modeling and Performance Evaluation of Computing Systems · OSTI ID:1813665

Related Subjects