Performance feature identification by comparative trace analysis

https://doi.org/10.1016/j.future.2004.11.022Get rights and content

Abstract

This work introduces a method for instrumenting applications, producing execution traces, and visualizing multiple trace instances to identify performance features. The approach provides information on the execution behavior of each process within a parallel application and allows differences across processes to be readily identified. Traces events are directly related to the source code and call-chain that produced them. This allows the identification of the causes of events to be easily obtained. The approach is particularly suited to aid in the understanding of the achieved performance from an application centric viewpoint. In particular, it can be used to assist in the formation of analytical performance models which can be a time-consuming task for large complex applications. The approach is one of human-effort reduction: focus the interest of the performance specialist on performance critical code regions rather than automating the performance model formulation process completely. A supporting implementation analyses trace files from different runs of an application to determine the relative performance characteristics for critical regions of code and communication functions.

Introduction

It is generally accepted that the peak-performance of high performance systems is the result of a complex interplay between the hardware architecture, the communication system and the applied workload. Knowledge of the processor design, memory hierarchy, inter-processor and network system, and workload arrangement is necessary to understand the achievable performance.

As high-end computing facilities increase in scale and complexity, it is essential to consider the system performance throughout an architecture's life cycle: starting at the design stage where no system is available for measurement, through comparison of systems and procurement, to implementation, installation and verification, and finally to examine the effects of system updates over time.

A key approach that can be used at each stage of the life-cycle is to utilize detailed models that provide information on the expected performance of a workload, given a particular architectural configuration. Depending on the complexity of the model, an expectation of the achievable performance of the workload can be obtained with reasonable fidelity.

The accuracy of a model, and hence, its effectiveness lie in its ability to capture an application's performance characteristics. It is considered advantageous to parameterize a model in terms of system configuration and calculation behavior, as this allows for the exploration of the performance space without being specific to a particular ‘performance point’. Evaluation of anticipated scenarios such as the scaling effect of increasing the processor count, altering the workload size, or the impact of modifying the communication strategy can then be explored.

Performance models are widely used: from large-scale, tightly coupled systems through to dynamic and distributed Grid-based systems. For instance, performance modeling was used to: validate performance during the installation of ASCI Q at Los Alamos National Laboratory (LANL) [1] which ultimately lead to the system optimization resulting in a factor of two performance improvement [2]; compare the performance of large-scale systems such as the Earth Simulator [3]; and has been used in the procurement of many systems including ASCI purple (expected to be a 100Tflop machine). Performance models are also extensively applied in ‘Grid’ environments to consider service-orientated metrics in the provision of resource management services [4], and in the mapping of financial applications to available resources [5].

It is, however, generally acknowledged that the formation of a performance model is a complex task. It typically involves a thorough code analysis, inspection of important data structures, and analysis of profile and trace data. It can, therefore, be time-consuming to generate a detailed model given the large size of many scientific applications and the relative complexities of advanced data structures and highly optimized communication strategies. Historically, much of this work has been performed “by hand” using profilers and tracing libraries typically designed for fault analysis. As applications become more complex, the ability to facilitate the modeling process has become increasingly desirable.

Several semi-automated approaches have been proposed that aim to make the formation of a performance model a simpler task using ‘black-box’ techniques in which individual performance aspects are observed but not necessarily understood. Examples include modeling the scaling behavior of basic-block performance [6] and modeling the memory behavior of basic-blocks and extrapolating to other systems [7]. These approaches tend to be specific to a particular processor configuration and/or application problem size.

In this work, we consider an approach that aims to simplify the process of generating a performance model, but not to automate it entirely. The purpose is not to simplify the resultant performance model, or detract from the skill-set required by the modeler, but rather remove unnecessary steps during formulation. While the answer to this question lies, in part, with the experience of the performance-modeler; we believe that tools can be developed (or adapted) to help locate and focus on the performance critical regions. Such regions are typically those whose execution behavior changes when the system configuration or application input-data is varied. Should a region of code not change (in the performance sense) to such configuration and input variations, it may be possible to approximate the behavior by a single timing for a given architecture. An essential part of the performance modeler's skill is distinguishing between the elements that require parameterization and those where a single-time will suffice.

While there are a number of post-analysis and diagnosis tools that can assist with identifying performance constraints such as identifying bottlenecks [8] or visualizing communication patterns [9], many are aimed at resolving problems with the application rather than characterizing and understanding an application's performance behavior. This subtle difference in purpose differentiates this work from standard post-analysis tools. Rather than highlighting hotspots of communication in a particular application for example, this work might focus on the fact that communication in a particular code region is proportionally lower than that of a previous run of the application. In a sense, the focus is on identifying key differences and, in particular, idiosyncratic behaviors.

The approach described in this paper uses a combination of static and dynamic call-graph analysis to identity regions of code that are sensitive to data-set and scalability variations in order to considerably reduce the time-to-model. It provides a compact view of multiple executions of an application using color cues to draw attention to “areas of interest”. It can also generate graphical illustrations of differences between execution traces to rapidly identify code blocks that require attention. In addition, events can match communication patterns against a library of predefined templates that attempt to identify common parallelization strategies. The current implementation is a prototype: it is envisaged that other methods of visualization could be employed to summarize more detailed performance characteristics “at a glance”. Reducing the physical screen estate that represents key data could permit rapid large-scale analysis. Similar visualization techniques have been applied to code maintenance [10].

The paper is organized as follows: Section 2 describes an approach that leads to a performance model and identifies areas where tools can assist. Section 3 introduces a tracing tool that can create call-graphs for post-analysis using source-code instrumentation. In Section 4, methods that use multi-trace visualizations to highlight key application differences are described. Their use is illustrated with examples run on large-scale parallel codes. Section 5 summarizes the features of the approach. Conclusions and future work are discussed in Section 6.

Section snippets

Identification of performance characteristics from multiple executions

The performance modeling work at Los Alamos National Laboratory (LANL) has been primarily focused on applications representative of the ASCI (Accelerated Strategic Computing Initiative) workload where analytical techniques are employed to develop entire-application models for the large-scale ASCI computing resources. This differs from a number of other performance initiatives that tend to focus on smaller applications in distributed computing environments such as [11]. The Los Alamos models are

Dynamic call-graph collection

In order to assist the instrumentation, an automatic source-to-source translation tool is used to insert profiling statements into the code where a subroutine is entered and exited. This tool currently supports Fortran, parsing the source and inserting profiling statements but could be extended for further language coverage. The application is linked against a lightweight profiling library that stores the “events” in a dynamic, page-based list whenever a subroutine is called. To minimize memory

Multi-trace visualization

There are a number of visualization tools that can assist with understanding an application's runtime behavior. For MPI applications, Vampir provides a large suite of views that provide a good level of detail. It is essentially possible to “playback” the application to determine where communication occurs, which processors are involved and where blocks of computation occur. The tools developed as part of this work provide similar capabilities, although they are geared towards the accompanying

Performance feature summary

Together the tools allow a performance modeler to rapidly profile a code, run it under different configurations (such as on a 4 × 4 processor network, and on an 8 × 8 network) and then analyze the traces for particular performance characteristics (see Table 2). In addition, the traces can be analyzed to determine the data decomposition by comparison with a set of example communication templates as described in Section 4.

Effective visualizations are used to illustrate key performance differences so

Conclusions and future work

The approach described in this paper utilizes dynamic trace files and a multi-trace visualization tool to highlight areas of interest when input parameters, data sets and resource configurations are modified. By focusing on the areas of a code that are sensitive to configuration and input data, overhead is removed in terms of isolating the critical regions that govern the performance characteristic of an application.

Analytical modeling is an increasingly important activity in understanding and

Acknowledgement

Los Alamos National Laboratory is operated by the University of California for the U.S. Department of Energy under contract W-7405-ENG-36.

Daniel Spooner is a lecturer in the Computer Science Department at the University of Warwick. His main research interests are in the areas of high performance systems, analytical performance models and their application in grid and distributed computing. He is currently involved in the UK e-Science programme developing performance-aware middleware services.

References (17)

  • S.C. Perry et al.

    Performance optimisation of financial option calculations

    Parallel Computing

    (2000)
  • D.J. Kerbyson, A. Hoisie, H.J. Wasserman, Verifying large-scale system performance during installation using modeling,...
  • F. Petrini et al.

    The case of the missing supercomputer performance: achieving optimal performance on the 8192 processors of ASCI Q

  • D.J. Kerbyson et al.

    A comparison between the earth simulator and alphaserver systems using predictive application performance models

  • D.P. Spooner et al.

    Local grid scheduling techniques using performance prediction

  • G. Marin et al.

    Cross-architecture performance predictions for scientific applications using parameterized models

  • A. Snavely et al.

    A framework for application performance modeling and prediction

  • H.W. Cain et al.

    A callgraph-based search strategy for automated performance diagnosis

There are more references available in the full text version of this article.

Cited by (3)

  • Trace profiling: Scalable event tracing on high-end parallel systems

    2012, Parallel Computing
    Citation Excerpt :

    Tracing tools reduce trace file sizes by giving users options of what data to omit. Generally, tracing tools provide API calls for users to start and stop tracing the application at any point during the execution [13,25–33]. This makes it possible for users to reduce the amount of data that is collected.

  • Characterization of the communication patterns of scientific applications on blue gene/P

    2011, IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
  • Evaluating similarity-based trace reduction techniques for scalable performance analysis

    2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09

Daniel Spooner is a lecturer in the Computer Science Department at the University of Warwick. His main research interests are in the areas of high performance systems, analytical performance models and their application in grid and distributed computing. He is currently involved in the UK e-Science programme developing performance-aware middleware services.

Darren Kerbyson is currently a researcher in the Performance and Architecture Lab at Los Alamos. Prior to this he was a senior Lecturer in Computer Systems at the University of Warwick in the UK. He has been active in the areas of performance modeling, parallel and distributed processing systems, and image analysis for the last 15 years. He has worked on many performance orientated projects funded by the European Esprit program, UK Government, ONR, and DARPA. He has published over 70 papers in these areas and has taught courses at undergraduate and postgraduate levels as well as supervising numerous PhD students. He is currently involved in the modeling of large-scale applications on current and future supercomputers at Los Alamos. He is a member of the ACM and the IEEE.

View full text