Tools for GPU Computing – Debugging and Performance Analysis of Heterogenous HPC Applications

General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC systems. However, programming GPUs eﬃciently is an even more daunting task than traditional HPC application development. This becomes even more apparent for large-scale systems containing thousands of GPUs. Orchestrating all the resources of such a system imposes a tremendous challenge to developers. Luckily a rich ecosystem of tools exist to assist developers in every development step of a GPU application at all scales. In this paper we present an overview of these tools and discuss their capabilities. We start with an overview of diﬀerent GPU programming models, from low-level with CUDA over pragma-based models like OpenACC to high-level approaches like Kokkos. We discuss their respective tool interfaces as the main method for tools to obtain information on the execution of a kernel on the GPU. The main focus of this paper is on two classes of tools, debuggers and performance analysis tools. Debuggers help the developer to identify problems both on the CPU and GPU side as well as in the interplay of both. Once the application runs correctly, performance analysis tools can be used to pinpoint bottlenecks in the execution of the code and help to increase the overall performance.


Introduction
General purpose GPUs are now ubiquitous in high-end supercomputing. With the rise of deep learning and the convergence of simulation-based HPC and AI, GPU computing took a major leap forward. All but one (the Japanese Fugaku system, which is based solely on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC systems. However, programming GPUs efficiently is an even more daunting task than traditional HPC application development. This becomes even more apparent for large-scale systems containing thousands of GPUs. Orchestrating all the resources of such a system imposes a tremendous challenge to developers. Besides GPUs other accelerators have been tried, the most prominent being Intels Xeon Phi as a many-core architecture and FPGAs. However, the Xeon Phi has been discontinued and FPGAs are only a niche solution for very specific workloads or research projects, but not (yet) ready for production use in general HPC.
NVIDIA GPUs power most of todays GPU-enabled supercomputers. 136 systems in the TOP500 list of November 2019 2 are equipped with NVIDIA GPUs, including the number one and two systems, the U.S.-based Summit [33] and Sierra [25] supercomputers. Thus, we put a strong focus on NVIDIA GPUs in this paper.
Tools have always been an integral part of the HPC software stack. Debuggers and correctness checker help application developers to write bug-free and efficient code. Code efficiency can be improved by pinpointing bottlenecks with performance analysis tools. The tools community is working hard to provide tools that master the complexity of modern HPC systems [30], facing the same challenges when scaling up as the application developers themselves. Today, a rich ecosystem of tools exist to assist developers in every development step of a GPU application at all scales, from a workstation to a supercomputer.
In this paper we present an overview of these tools and discuss their capabilities. We present the currently dominant programming models for GPU computing and discuss their tool interfaces as the main method for tools to obtain information on the execution of a kernel on the GPU in section 1. Then we look into debuggers in section 2, which help to develop correct heterogenous applications that scale to several hundred or thousand of GPUs. Performance analysis tools, which help to use these resources efficiently, are discussed in section 3. Finally we conclude the paper and give an outlook on future developments in heterogenous supercomputing.

GPU Programming Models
For decades two programming paradigms dominated the HPC landscape -distributedmemory programming (inter-node) and shared-memory programming (intra-node). The main programming model for distributed-memory programming is MPI, the Message Passing Interface [28], which is used in virtually all HPC applications. MPI is a rather low-level interface, i.e. the user has to explicitly express the communication pattern and data transfers. Shared memory programming is mostly done via OpenMP [36], a directive-based API. For both MPI and OpenMP alternatives exist, like the PGAS (Partitioned Global Address Space) model for distributed memory or pthreads and TBB (Threading Building Blocks) for shared memory, but none come close in popularity to MPI and OpenMP.
With the advent of general purpose GPUs, things changed significantly. Now a new very powerful but also very complex architecture was thrown into the mix, yet on the other hand the old programming paradigms are still valid in order to create scaling HPC applications. There exist several programming models for GPUs, some are low-level like MPI, others are pragmabased like OpenMP. Some support only certain languages or specific vendor architectures, others are more open. So it is a challenge for an application developer to choose the right programming model for his application, but also for tools developers to choose which models to support. In this section we present various GPU programming models that suits different needs, CUDA and OpenCL as high-performance low-level interfaces, OpenACC and OpenMP as easy-to-use yet efficient directive-based approaches and KOKKOS and RAJA that aim for performance portability on a wide range of architectures. Where applicable we also give an introduction to the respective tools interface as the main source for tools to get information on the kernels running on the accelerator and the data transfers to and from the device.

CUDA
CUDA [32] is a parallel computing platform and programming model developed by NVIDIA for general computing on NVIDIA GPUs. It is a very low-level interface, i.e. the programmer has to specify every data movement and kernel launch explicitly. Given access to all hardware features of modern GPUs like Unified Memory, CUDA can yield the highest performance achievable on GPUs. However, this comes at the cost of a rather high development effort and non-portability. A rich set of libraries, both from NVIDIA directly and from third parties, are available for CUDA, enabling developers to harness the power of CUDA without the need to deal with all the low-level details of the architecture. So far CUDA is the most popular programming model for GPU programming, thus most tools support CUDA to some extend. While CUDA itself is C++, CUDA bindings exist for many programming languages like C, Fortran (currently only for PGI compilers), Python and MATLAB.

CUPTI -The CUDA Performance Tools Interface
The NVIDIA CUDA Profiling Tools Interface (CUPTI) provides performance analysis tools with detailed information about how applications are using the GPUs in a system. CUPTI provides two simple yet powerful mechanisms that allow performance analysis tools to understand the inner workings of an application and deliver valuable insights to developers. The first mechanism is a callback API that allows tools to inject analysis code into the entry and exit point of each CUDA C Runtime (CUDART) and CUDA Driver API function. Using this callback API, tools can monitor an applications interactions with the CUDA Runtime and driver. The second mechanism allows performance analysis tools to query and configure hardware event counters designed into the GPU and software event counters in the CUDA driver. These event counters record activity such as instruction counts, memory transactions, cache hits/misses, divergent branches, and more. This enables automated bottleneck identification based on metrics such as instruction throughput, memory throughput, and more.

OpenCL, SYCL and oneAPI
The aim of OpenCL, the Open C omputing Language, is to provide a vendor independent programming interface for all kinds of computing devices, from CPUs over GPUs to FPGAs. OpenCL is developed by the Khronos Group, an open industry consortium of over 100 leading hardware and software companies. OpenCL, like CUDA, is a low-level API where the kernels are written in the OpenCL C++ kernel language, a static subset of C++14.
To ease the development of heterogenous applications, the Khronos group developed SYCL as an abstraction layer build on the concepts, portability and efficiency of OpenCL. SYCL allows the developer to program on a higher level than OpenCL, while still having access to lowerlevel code. A lot of the boilerplate code of OpenCL is removed by SYCL and a single-source programming, where host and device code are contained in the same source file, is enabled.
The newest member in the OpenCL language space is Intel'soneAPI with DPC++ (Data P arallel C++), which in turn is built upon SYCL. Due to its recent Beta release and the -at the time of writing -limited availability of hardware, the support of tools for oneAPI could not be evaluated for this paper. However, it is clear that the well-known Intel tools VTune and Advisor will have rich support for oneAPI. The most interesting and unique feature of the Intel Advisor will be an analysis of the potential gain of offloading a sequential code path to an accelerator.
It will be interesting to see how oneAPI will be adopted by the HPC community and how the tools support for SYCL and oneAPI develops. Codeplay, a compiler vendor and active part in the SYCL community, recently announced SYCL support for NVIDIA GPUs [38], which could dramatically increase the interest in SYCL as a portable API as it significantly increases to potential user-base.

The OpenCL Profiling Interface
OpenCL provides a very basic interface to get profiling information on memory operations and kernel launches. If profiling is enabled, the function clGetEventProfilingInfo returns timing information of OpenCL functions that are enqueued as commands to a command-queue. The most interesting for performance analysis are the begin and end timestamps of kernel launches. The SYCL specification defines a similar profiling interface. However, most tools with OpenCL support use some form of library wrapping to obtain information of the OpenCL execution.

OpenACC
The OpenACC (Open ACCelerator) API [34] describes a collection of compiler directives to specify loops and regions of code to be executed in parallel on a multicore CPU, or to be offloaded and executed in parallel on an attached accelerator device, providing portability across operating systems, CPUs, and accelerators. With directives for C/C++ and Fortran, OpenACC covers the most important programming languages for HPC.
OpenACC eases the development of heterogenous applications as it relieves the user from explicit accelerator and data management as well as data transfers to and from the device. Data management is handled with the data construct, where enter data and exit data directives can be used to control data transfers between host and device. Two fundamental compute constructs, kernels and parallel can be used to offload the execution of code blocks to an accelerator.
While OpenMP is a prescriptive programming model, i.e. the developer explicitly states how to split the execution of loops, code regions and tasks among well-defined teams of threads, OpenACC is more descriptive model, telling the compiler where it is safe to parallelize loops or offload kernels and what data has to be transferred. This enables the compiler to perform more optimizations and generate faster code [43].

OpenACC Profiling Interface
OpenACC provides a profiling interface for both profile and trace data collection. This interface provides callbacks that are triggered during runtime if specific events occur. Three types of events are supported: data events, launch events and other events. Data events cover the allocation/deallocation of memory on the accelerator as well as data transfers. Launch events trigger before and after a kernel launch operation. Other events include device initialization and shutdown as well as wait operations [10]. However, these events only give host-side information. For information on the device the respective tools interface of the backend has to be used.

OpenMP
OpenMP is a directive-based API already well known for shared-memory parallelization on CPUs which is easy to learn. It also offers a path to more portable GPU-accelerated applications. Like OpenACC, one of the goals of the OpenMP standard is to minimize the need for applications to contain vendor-specific statements. Thus, codes are portable across all supported GPU architectures.
Pragmas to offload work on general purpose GPUs have been introduced in OpenMP 4 [35], the OpenMP device constructs. The target construct is required to specify a region to be launched Tools for GPU Computing -Debugging and Performance Analysis of Heterogenous HPC... on the device. Target data maps the variables on the device. The teams pragma inside target spawns the set of teams with multiple OpenMP threads. The distribute construct partitions the iterations and maps it to each team.

The OpenMP Tools Interfaces
Unlike the other programming interfaces, OpenMP since version 5 [36] provides two tools interfaces, OMPT for performance analysis tools and OMPD for debuggers [12].
OMPT [13] is a portable interface for both sampling-based and instrumentation-based performance analysis tools. Like the other tool interfaces, OMPT provides callbacks for defined OpenMP events, like the begin of a parallel region or the start of a offloaded kernel. It also maintains the tools data for OpenMP scopes and it provides signal-safe inquiry functions to get OpenMP runtime information. OMPT is intended for first-party tools, i.e. tools that are linked into or loaded from the OpenMP application.
OMPD, the OpenMP debugging interface, on the other hand is an interface for third-party tools, i.e. tools that live in a different process from the OpenMP application. This interface allows external tools to inspect the OpenMP state of a running program via callbacks. The debugger has no direct access to the OpenMP runtime, it interacts with it through the OMPD architecture and the OMPD interface is transparent to the OpenMP application. The OMPD library can be used to debug a running program as well as core files generated when the application aborted due to an error.

Kokkos and RAJA
As stated above, HPC programming models didn't change for a long time, which gave application developers some confidence that their application will perform on the next generation of machines. With an increased variability in architectures and programming models that does not hold any more. An application tuned for a specific platform could perform badly on the next system, which could be completely different from the current one. Further, applications and libraries that are used universally need some kind of assurance to perform well on a wide range of architectures.
In the scope of the Exascale Computing Project [29], two projects emerged that strive for performance portability by providing an abstraction layer over the existing programming models. Both originate from US national laboratories, one is Kokkos [11], developed at Sandia, and the other one RAJA [2] from LLNL. The abstraction layers include memory and execution spaces, data layout (i.e. the data layout might change depending on the architecture the application is compiled for) and parallel execution.
Both Kokkos and RAJA currently provide only C++ interfaces and only have a CUDA backend for offloading work to a GPU, though support for other programming models is likely to follow.

The Kokkos Profiling Interface
Kokkos provides a set of hooks for profiling libraries to interface with the Kokkos runtime [19]. These hooks can be implemented in the form of callbacks within a shared library. Upon start of the application, the Kokkos runtime loads the library, checks for implemented callbacks, and invokes the performance monitor via corresponding callbacks. Currently Kokkos M. Knobloch, B. Mohr supports callbacks for initialization and finalization of the runtime, deep data copies, and the three parallel execution models parallel for, parallel reduce, and parallel scan. Similar to the OpenACC profiling interface only events on the host are triggered, though device events can be captured with CUPTI. RAJA unfortunately does not provide a profiling interface at this time.

Debuggers
Developing correct parallel programs is already a daunting task, adding the complexity of GPUs to the mix makes that endeavour even harder. This holds especially when using low-level programming paradigms, where the user is responsible for correct memory management and data movement. Luckily, several debugging solutions exist to assist the application developer in finding and fixing bugs, both at small and large scale.  Table 1 shows the supported GPU programming models of each of these debuggers. There is very good support for CUDA and OpenACC (where the NVIDIA tools support the debugging of the generated CUDA kernels), but nearly no support for the other programming models. TotalView showed a prototype with support for OpenMP offloading using an experimental OMPD-enabled OpenMP runtime. There exist a couple of debugges for OpenCL, but none proved usable for complex HPC applications 3 .

NVIDIA Debugging Solutions
NVIDIA realized the importance of debugging for novel programming paradigms right from the beginning and shipped debugging tools right from the beginning with the CUDA toolkit [17]. These tools can be used standalone from the command-line, but are also integrated in the Nsight IDE [20], NVIDIAs development platform for CUDA and OpenACC applications. An example debugging session is shown in Fig. 1.

CUDA-MEMCHECK
CUDA-MEMCHECK is like valgrind for GPUs, a very powerful memory tracker and analysis tool. Hundreds of thousands of threads running concurrently on each GPU can be monitored. It reports detailed information about global, local, and shared memory access errors (e.g. index out-of-bounds or misaligned memory accesses) and runtime executions errors (e.g. stack overflows and illegal instructions). Potential race conditions can also be detected with CUDA-MEMCHECK. In case of an error, CUDA-MEMCHECK displays stack back-traces on host and device.

CUDA-GDB
CUDA-GDB is, as the name indicates, an extension to gdb, the Unix debugger. Simultaneous debugging on the CPU and multiple GPUs is possible. The user can set conditional breakpoints or break automatically on every kernel launch. It is possible to examine variables, read/write memory and registers and inspect the GPU state when the application is suspended. Memory access violations can be analyzed by running CUDA-MEMCHECK in an integrated mode to detect the precise causes.

TotalView
TotalView 4 is a symbolic debugger specifically designed for HPC applications written in C/C++, Fortran or Python. Noteworthy are the analysis capabilities for heavily templated C++ codes with complex data types. Advanced Memory Debugging allows to keep track of all memory accesses and memory allocations and deallocations to find memory leaks and corrupted memory. Another feature that sets TotalView apart from the competition is reverse debugging, i.e. the program execution is recorded and the user can step back from the point where the error occurred. This is especially helpful in fixing non-deterministic bugs. TotalView features full control over processes and threads with the ability to stop and debug an individual thread or groups of threads or processes. Debugging of CUDA [18] and OpenACC applications is supported with the possibility to debug multiple GPUs on a single node or multiple nodes across a cluster.
Here it is possible to seamlessly set breakpoints in host and device code.  Figure 2 shows a screenshot of a CUDA debugging session using the new TotalView GUI, which greatly improves the usability.

Arm DDT
DDT 5 is another commercial debugger with a modern interface and very similar features to TotalView. It supports all major HPC programming languages with a special focus on complex C++ applications. Multi-process and multi-thread support is a matter of course. DDT also features advanced memory debugging and visualizations of huge data sets. Like TotalView, DDT supports debugging of CUDA and OpenACC applications with a fine-grained thread control, as shown in Fig. 3. DDT is available standalone or together with the Arm profiling tools in the Arm Forge suite 6 .

Performance Analysis Tools
Performance analysis tools are an integral component in the HPC software stack for decades and many application developers were exposed to profilers to a certain degree. There are many tools for all kinds of analyzes, some are vendor provided and thus tied to a specific platform, some are commercial and several open source. The latter are usually developed at universities or national research laboratories with larger supercomputers. The tools community, which has a long history of collaboration, started adding GPU support relatively early [26], though the programming models and amount of features supported varies significantly between tools.
Though we commonly refer to performance analysis tools as profilers, we distinguish between trace-based tools, which store all events with timestamps and profile-based tools, which only store statistical information like the number of calls to a specific routine and the total time spend in that routine. Several tools can generate both profiles and traces and are thus universally applicable. Tool support for the various GPU programming models varies significantly. The tool compatibility matrix for some of the most popular and wide-spread performance analysis tools is shown in Tab. 2. CUDA is supported by all the tools we consider. This is partly because CUDA was the first programming model for GPUs, but also because NVIDIA provides a very powerful and easy to use tools interface with CUPTI. Half of the tools support OpenACC or OpenCL, so there are options for all application developers. Several tools are working on supporting OpenMP offload to GPUs, but there is currently no public OpenMP runtime that implements OMPT for target directives. However, both Score-P and TAU already support OMPT on the host-side. HPCToolkit showed a prototype with OpenMP offload support using an internal experimental OpenMP runtime that implements OMPT for target directives.

NVIDIA Tools
NVIDIA realized early on that good tools (and a good documentation) are a necessity for a new platform to gain traction. So NVIDA began shipping their own profiler nvvp, the NVIDIA Visual Profiler, shortly after the release of CUDA. It is an integral feature of the CUDA tool-kit since then, so it is available on all CUDA-enabled platforms, without the need for a third-party tool. After several years, nvvp began to show scalability (and maintenance) issues and will be deprecated in a future CUDA release. Luckily, two new tools, Nsight Compute and Nsight System, are ready to fill that gap.

NVIDIA Visual Profiler
For many years, nvvp [5] was the de-facto standard profiler for CUDA applications. It presents a unified CPU and GPU timeline including CUDA API calls, memory transfers and kernel launches. For a more detailed analysis of CPU activities, users can annotate the source code using the NVIDIA Tools Extension (NVTX) [24]. It supports all the advanced features of recent CUDA versions like Unified Memory, with CPU and GPU page faults and data migrations shown in the timeline. Upon selection of a specific kernel, nvvp shows a detailed low-level kernel analysis with performance metrics collected directly from GPU hardware counters and software instrumentation. Nvvp can compare results across multiple sessions to verify improvements from tuning actions. Another unique feature is an Automated or Guided Application Analysis with graphical visualizations to help identifying optimization opportunities. The Guided Analysis provides a step-by-step analysis and optimization guidance. The Visual Profiler is available as

Nsight Compute
NVIDIA Nsight Compute 7 is an interactive kernel profiler for CUDA applications. It provides similar features to nvvp's low-level kernel analysis, i.e. detailed performance metrics and the guided performance analysis. Nsight Compute provides a customizable and data-driven user interface (as shown in Fig. 5) Further, it has a command-line mode for manual and automated profiling and can be extended with analysis scripts for post-processing results. Additionally, its baseline feature allows users to compare results directly within the tool, very much like in the Visual Profiler.

Nsight Systems
NVIDIA Nsight Systems 8 is a system-wide timeline-based performance analysis tool. It is designed to visualize the complete application execution help to identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs. Users will be able to identify issues, such as GPU starvation, unnecessary GPU synchronization, and insufficient overlap with CPU computation. An example timeline is shown in Fig. 6. It is possible to zoom in to any level of detail. Kernels showing unexpected behavior can be analyzed in detail with Nsight Compute, launched directly from the Nsight Systems GUI. NVTX is supported to get a more detailed picture of the CPU utilization. Currently Nsight System is focused on a single process, with more advanced support for MPI and OpenMP planned for a future release.

ARM Tools
Arm, since the acquisition of Allinea in 2016, provides several commercial cross-platform performance analysis tools, that can be obtained standalone or together with DDT in the Arm Forge 9 suite.

Performance Reports
Arm Performance Reports is a gateway to the world of performance analysis. It is a very low-overhead tool working on unmodified optimized binaries that generates a one-page report characterizing the application performance at a rather high level. Performance Reports analyzes For all these categories it presents three to four sub-metrics to give more detailed information, e.g. the ratio of scalar and vector operations. For issues found, Performance Reports gives hints on how to proceed with more sophisticated analysis tools. An example of the accelerator breakdown of Performance Reports is shown in Fig. 7. This only gives a very brief overview of the GPU utilization, but in this case indicates that a thorough analysis with more advanced tools might be beneficial.

MAP
Arm MAP [21] is a fully featured cross-platform source level performance analysis tool. It supports low-overhead sampling-based profiling of parallel multi-threaded C/C++, Fortran and Python codes. MAP providing in-depth analysis and bottleneck pinpointing to the source line as well as an analysis of communication and workload imbalance issues for MPI and multi-process codes. For accelerators, MAP offers a detailed kernel analysis with data obtained via CUPTI. This includes a line-level breakdown of warp stalls. Possible reasons for warp stalls include execution and memory dependencies or barriers. Knowing the reason for warp stalls can help the developer tuning the code accordingly. However, MAP currently supports only kernels generated   Figure 8 shows an example of MAP analyzing a CUDA application.

The Score-P Ecosystem
Score-P [23] is a community instrumentation and measurement infrastructure developed by a consortium of performance tool groups. It is the next-generation measurement system of several tools, including Vampir [22], Scalasca [16], TAU [42] and Periscope [3]. Common data formats for profiling (CUBE4) and tracing (OTF2 [14]) enable tools interoperability. Figure 9 gives an overview of the Score-P ecosystem. On the bottom are the various supported programming paradigms, which are implemented as independent adapters interacting with the measurement Figure 9. The Score-P ecosystem M. Knobloch, B. Mohr system core. That eases adding support for new paradigms. The measurement data can be enriched with hardware counter information from PAPI [31], perf, or rusage. Score-P supports all major GPU programming models with CUDA [26], OpenACC [8], and OpenCL [9]. OMPT support for host-side measurement was recently added [15] and there is ongoing work to support OpenMP target directives [7]. Score-P also features a sampling mode for low-overhead measurements. It supports both profiling and tracing for all adapters. Profiles are generated in the CUBE4 format, that can be analyzed by TAU or Cube [39].
Cube is the performance report explorer for Score-P profiles as well as for the Scalasca trace analysis. The CUBE data model consists of a three-dimensional performance space with the dimensions (i) performance metric, (ii) call-path, and (iii) system location. Each dimension is represented in the GUI as a tree and shown in one of three coupled tree browsers, i.e. upon selection of one tree item the other trees are updated. Non-leaf nodes in each tree can be collapsed or expanded to achieve the desired level of granularity. Figure 10 shows a profile of a simple OpenACC application Cube GUI. On the left (Fig. 10a), the results of a pure OpenACC measurement are shown. Due to restrictions of the OpenACC tools interface, only the hostside calls are visible. However, if Score-Ps CUDA support is enabled as well, also the kernels generated by OpenACC get recorded (Fig. 10b). OTF2 traces generated by Score-P can be analyzed automatically with Scalasca, which determines patterns indicating performance bottlenecks, and manually with Vampir. Unfortunately, Scalasca currently does not support the analysis of traces containing GPU locations, but can be used to analyze the communication of multi-node heterogenous programs if the corresponding adapter for the GPU programming model is disabled, i.e. only host-side events are recorded. In contrast to traditional profile viewers, which only present aggregated values of performance metrics, Vampir allows the investigation of the whole application flow. The main view is the Master Timeline which shows the program activity over time on all processes, threads, and accelerators. An example is shown in Fig. 11.
The Master Timeline is complemented by several other views, timelines, and tables, e.g. the Process Timeline to display the application call stack of a process over time or a Communication Matrix to analyze the communication between processes. Any counter metics, e.g. from PAPI or counter plugins, can be analyzed across processes and time with either a timeline or as a heatmap in the Performance Radar. It is possible to zoom into any level of detail, all views are updated automatically to show the information from the selected part of the trace.

TAU
TAU [42] is a very portable tool-set for instrumentation, measurement and analysis of parallel multi-threaded applications. It features various profiling modes as well as tracing and various forms of code instrumentation as well as event-based sampling. All major HPC programming languages (C/C++, Fortran, Python) and programming models (MPI, OpenMP, Pthreads) are supported by TAU. TAU offers the widest support for accelerators, it allows measurement of CUDA [27], OpenACC, OpenCL, Kokkos [41], and also AMDs ROCm+HIP. For the analysis of 3-dimensional profile data, TAU includes ParaProf, which -like Cube -shows performance metric, call-path and location for an easy and quick investigation of bottlenecks. Figure 12 shows the visualization of a TAU trace file with Jumpshot 10 .

Extrae/Paraver
Extrae 11 is a measurement system to generate Paraver trace files for post-mortem analysis. It supports C/C++, Fortran, and Python programs on all major HPC platforms, i.e. Intels x86, NVIDIA GPUs, Arm, and openPOWER. Extrae features several measurement techniques, which are configured through an XML file. The main source of information in Extrae is preloading shared libraries that substitutes symbols for many parallel runtimes, e.g. MPI, OpenMP and CUDA. Extrae also support dynamic instrumentation by modification of the application binary and parallel runtimes via Dyninst [4]. Further, Extrae supports sampling via signal timers and hardware performance counters. Since the Paraver trace format has no predefined semantics, adding support for new paradigms is relatively straightforward.
Paraver [37,40] is a very flexible data browser working on the trace files generated by Extrae. Flexible means that there is no fixed set of metrics, the metrics can be programmed in the tool itself. Paraver offers a large selection of views, e.g. timelines, histograms, and tables, that can Figure 12. Jumpshot screenshot of a TAU trace measurement of a CUDA application be combined to show virtually all the information that is present in the data. A view (or a set of views) can be saved as a Paraver configuration and recalculated with another trace file. CUDA streams are displayed in Paraver like any other data source, e.g. MPI processes or OpenMP threads. HPCToolkit [1] is an integrated suite of tools for measurement and performance analysis of applications at all scales. Working on unmodified, fully optimized executables, it uses sampling to generate both call-path profiles and traces, independent of the language used. It supports multiprocess (MPI) and multi-threaded (OpenMP, Pthreads) applications, but features no collection or analysis of communication and I/O metrics. For GPU analysis, it supports CUDA [6] and, as a currently unique feature, OpenMP offload by shipping an experimental OpenMP runtime that implements OMPT for target constructs. It measures the execution time of each GPU kernel as well as explicit and implicit data movements. For CUDA codes it uses program counter sampling to pinpoint hotspots and to calculate the utilization of the GPU. A powerful analysis features of HPCToolkit is the blame shifting from symptoms to causes, so the user can quickly identify the real bottlenecks.

HPCToolkit
HPCToolkit consists of multiple programs that work together to generate a complete picture. hpcrun collects calling-context-sensitive performance data via sampling. A binary analysis to associate calling-context-sensitive measurements with source code structure is performed by hpcstruct. The performance data and the structure information are combined by hpcprof and finally visualized by hpcviewer for profiles and hpctraceviewer for traces.

Conclusion
In this paper we showed that tools can support developers in programming heterogenous codes on current supercomputers, both in writing correct bug-free (debuggers) and efficient (performance analysis tools) applications.
There is one point in GPU programming where tools can't help -the decision which programming model to use. However, regardless of the choice, there is at least some tools support for each of the programming models. Due to the dominance of NVIDIA GPUs in today's datacenters, currently most developers choose CUDA or OpenACC, which are also the models with the best tools support. To use the tools as efficient as possible we recommend for that case to use the NVIDIA tools on a single node (with possibly multiple GPUs) when developing or porting the application to GPUs. When scaling up, i.e. inter-node data distribution and communication becomes an issue, we recommend the usage of more sophisticated tools like Score-P or TAU, which offer dedicated communication analysis. Errors occurring at scale can be debugged efficiently using TotalView or DDT.
Most supercomputing centers offer support to their users in porting and tuning applications to GPU architectures, sometimes in dedicated labs, like the JSC/NVIDIA Application Lab 12 . The tools community is also actively supporting users via mailing lists and trainings, e.g. the VI-HPS Tuning Workshops 13 , which offer multi-day hands-on workshops covering a range of different tools.
So far the dominant player in GPU-enabled supercomputing is NVIDIA, but with the announced (Pre-)Exascale systems like Aurora 14 , which will be based on Intels X e architecture, and Frontier 15 with purpose-build AMD GPUs, a wider variability of architectures becomes available. We will see systems using NVIDIA GPUs and Intel, AMD, IBM POWER and even Arm based CPUs, Intel Xeons with X e accelerators and completely AMD-based systems with EPYC CPUs and Radeon GPUs. Portability and maintainability of GPU applications will become more important, so developers might switch to more portable programming models like OpenMP or SYCL or even a higher-level abstraction model like Kokkos, to ensure performance portability. Tools will have to adapt to this increased variability and provide better support for more architectures and programming models. 12 https://fz-juelich.de/ias/jsc/EN/Research/HPCTechnology/ExaScaleLabs/NVLAB/ node.html 13 https://www.vi-hps.org/training/tws/tuning-workshop-series.html 14 https://press3.mcs.anl.gov/aurora/ 15 https://www.olcf.ornl.gov/frontier/