An Approach for Realistically Simulating the Performance of Scientific Applications on High Performance Computing Systems

Scientific applications often contain large, computationally-intensive, and irregular parallel loops or tasks that exhibit stochastic characteristics. Applications may suffer from load imbalance during their execution on high-performance computing (HPC) systems due to such characteristics. Dynamic loop self-scheduling (DLS) techniques are instrumental in improving the performance of scientific applications on HPC systems via load balancing. Selecting a DLS technique that results in the best performance for different problems and system sizes requires a large number of exploratory experiments. A theoretical model that can be used to predict the scheduling technique that yields the best performance for a given problem and system has not yet been identified. Therefore, simulation is the most appropriate approach for conducting such exploratory experiments with reasonable costs. This work devises an approach to realistically simulate computationally-intensive scientific applications that employ DLS and execute on HPC systems. Several approaches to represent the application tasks (or loop iterations) are compared to establish their influence on the simulative application performance. A novel simulation strategy is introduced, which transforms a native application code into a simulative code. The native and simulative performance of two computationally-intensive scientific applications are compared to evaluate the realism of the proposed simulation approach. The comparison of the performance characteristics extracted from the native and simulative performance shows that the proposed simulation approach fully captured most of the performance characteristics of interest. This work shows and establishes the importance of simulations that realistically predict the performance of DLS techniques for different applications and system configurations.

Abstract Scientific applications often contain large, computationally-intensive, and irregular parallel loops or tasks that exhibit stochastic characteristics.
Applications may suffer from load imbalance during their execution on high-performance computing (HPC) systems due to such characteristics. Dynamic loop self-scheduling (DLS) techniques are instrumental in improving the performance of scientific applications on HPC systems via load balancing. Selecting a DLS technique that results in the best performance for different problems and system sizes requires a large number of exploratory experiments. A theoretical model that can be used to predict the scheduling technique that yields the best performance for a given problem and system has not yet been identified. Therefore, simulation is the most appropriate approach for conducting such exploratory experiments with reasonable costs. This work devises an approach to realistically simulate computationally-intensive scientific applications that employ DLS and execute on HPC systems. Several approaches to represent the application tasks (or loop iterations) are compared to establish their influence on the simulative application performance. A novel simulation strategy is introduced, which transforms a native application code into a simulative code. The native and simulative performance of two computationally-intensive scientific applications are compared to evaluate the realism of the proposed simulation approach. The comparison of the performance characteristics extracted from the native and simulative performance shows that the proposed simulation approach fully captured most of the performance characteristics of interest. This work shows and establishes the importance of simulations that realistically predict the performance of DLS techniques for different applications and system configurations.

Introduction
Scientific applications are complex, large, and contain irregular parallel loops (or tasks) that often exhibit stochastic behavior. The use of efficient loop scheduling techniques, from fully static to fully dynamic, in computationallyintensive applications is crucial for improving their performance on high performance computing (HPC) systems often degraded by load imbalance. Dynamic loop self-scheduling (DLS) is an effective scheduling approach employed to improve computationally-intensive scientific applications performance via dynamic load balancing. The goal of using DLS is to optimize the performance of scientific applications in the presence of load imbalance caused by problem, algorithmic, and systemic characteristics. HPC systems become larger on the road to Exascale computing. Therefore, scheduling and load balancing become crucial as increasing the number of PEs leads to increase in load imbalance and, consequently, to loss in performance.
Scheduling and load balancing, from operating system level to HPC batch scheduling level, in addition to minimizing the management overhead, are among the most important challenges on the road to Exascale systems [1]. The static and dynamic loop self-scheduling (DLS) techniques play an essential role in improving the performance of scientific applications. These techniques balance the assignment and the execution of independent tasks or loop iterations across the available processing elements (PEs). Identifying the best scheduling strategy among the available DLS techniques for a given application requires intensive assessment and a large number of exploratory native experiments. This significant amount of experiments may not always be feasible or practical, due to their associated time and costs. Simulation mitigates such costs and, therefore, it has been shown to be more appropriate for studying and improving the performance of scientific applications [2]. An important source of uncertainty in the performance results obtained via simulation is the degree of trustworthiness in the simulation, understood as the close quantitative and qualitative agreement with the native measured performance. Attaining a high degree of trustworthiness eliminates such uncertainty for present and future more complex experiments.
Simulation allows the study of application performance in controlled and reproducible environments [2]. Realistic predictions based on trustworthy simulations can be used to design targeted native experiments with the ultimate goal of achieving optimized application performance. Realistically simulating application performance is, however, nontrivial. Several studies addressed the topic of application performance simulation for specific purposes, such as evaluating the performance of scheduling techniques under variable task execution times with a specific runtime system [3], or focusing on improving communications in large and distributed applications [4].
The present work gathers the authors' in-depth expertise in simulating scientific applications' performance to enable research studies on the effects and benefits of employing dynamic load balancing in computationally-intensive applications via self-scheduling. Several details of representing the application and the computing system characteristics in the simulation are presented and discussed, such as capturing the variability of native execution performance over multiple repetitions as well as calibrating and fine-tuning the simulated system representation for the execution of a specific application. The coupling between the application and the computing system representation has been shown to yield a very close agreement between the native and the simulative experimental results, and to achieve realistic simulative performance predictions [5].
The proposed realistic simulation approach is built upon three perspectives of comparison of the results of native and simulative experiments, which are also illustrated in Figure 1: (1) native-to-simulative (or the past), (2) native-to-simulative (or the present), and (3) simulative-to-native.
Through the first perspective, the performance reported in the original publications, which introduced the most well-known, successful, and currently used DLS techniques from the past, is presently reproduced via simulation to verify the the similarity in performance results between the current DLS techniques implementations and their original implementation [6].
In the second perspective, the performance of the present native scheduling experiments on HPC systems is compared against that of the simulative experiments. This comparison typically enables one to verify and justify the level of the agreement between the results of the native and the simulative experiments, and to answer the question of "How realistic are the simulations of applications performance on HPC systems?" [5].
In the third comparison perspective, different representations of the same application or of the computing system characteristics are used in different simulations. The simulative performance of the application obtained when employing different DLS technique is compared among different simulative experiments. Given that different simulations are expected to represent the same application and platform characteristics, this comparison allows a better assessment of the influence of application and/or system representation  (3) Different simulation approaches are compared to achieve close agreement in terms of simulation of application performance to that of the native performance.
on the obtained simulative performance and the degree of agreement between the native and the simulative performance.
The present work makes the following contributions: (1) An approach for simulating application performance with a high degree of trustworthiness while considering different sources of variability in application and computing system representations. (2) A novel simulation strategy of computationally-intensive applications by combining two interfaces of SimGrid [7] simulation toolkit (SMPI and MSG) to achieve fast and accurate performance simulation with minimal code changes to the native application.
(3) A realistic simulation of the performance of two scientific applications with several dynamic load balancing techniques. The applications performance is analyzed based on native and simulative performance results. The performance comparison shows that simulations realistically captured key applications performance features. (4) An experimental verification and validation of the use of the different SimGrid interfaces for representing the application's tasks characteristics to develop and test DLS techniques in the simulation.
The present work builds upon and extends own prior work [5] [6], which focused on the experimental verification of DLS implementation via reproduction [6] and the experimental verification of application's performance simulation on HPC systems [5], respectively. In the present work, a new method to represent the computational effort in tasks is explored and tested (c.f. Section 4.1). Methods to evaluate and represent variability in the system are also considered in the present work (c.f. Section 4.3). An additional scientific application is also included herein (c.f. Section 5). The performance of the two scientific applications is examined with four additional adaptive DLS techniques and four additional nonadaptive DLS techniques by employing an MPI-based load balancing library both, in native and simulative experiments (c.f. Section 4.2). A novel strategy for simulating applications is also experimented in this work (c.f. Section 5.2). A full version of this manuscript is under publication in the Future Generations Computer Systems Journal, "On The Road to Exascale II Special Issue: Advances in High Performance Computing and Simulations".
The remainder of this manuscript is structured as follows. Section 2 presents the relevant background on dynamic load balancing via self-scheduling and the used simulation toolkit. Section 3 reviews recent related work and the various simulation approaches adopted therein. The proposed simulation approach is introduced and discussed in Section 4. The design of the evaluation experiments, the practical steps of representing the scientific applications in simulation, the results of the native and simulative experimental results with various DLS techniques, as well as their comparisons are discussed in Section 5. Section 6 presents conclusions and an outline of the work envisioned for the future. 7

Background
This section presents and organizes the relevant background of the present work in three dimensions. The first dimension covers the relevant information concerning dynamic load balancing via dynamic loop self-scheduling techniques, specifically, the selected DLS techniques of the present work. The second dimension discusses specific research efforts from the literature where DLS techniques enhanced the performance of various scientific applications. The last dimension introduces the simulation toolkit used in the present work. Dynamic load balancing via dynamic loop self-scheduling. There are two main categories of loop scheduling techniques: static and dynamic. The essential difference between static and dynamic loop scheduling is the time when the scheduling decisions are taken. Static scheduling techniques, such as block, cyclic, and block-cyclic [8], divide and assign the loop iterations (or tasks) across the processing elements (PEs) before the application executes. The task division and assignment do not change during execution. In the present work, block scheduling is considered and is denoted as STATIC.
Dynamic loop self-scheduling (DLS) techniques divide and self-schedule the tasks during the execution of the application. As a result, DLS techniques balance the execution of the loop iterations at the cost of increased overhead compared to the static techniques. Self-scheduling differs from work sharing, another related scheduling approach, wherein tasks are assigned onto PEs in predetermined sizes and order [9]. Self-scheduling is also different from work stealing [10] in that PEs request work from a central work queue as opposed to distributed work queues. The former has the advantage of global scheduling information while the latter is more scalable at the cost of identifying overloaded PEs from which to steal work. DLS techniques consider independent tasks or loop iterations of applications [11,12,13,14,15,16]. For dependent tasks, several loop transformations, such as loop peeling, loop fission, loop fusion, and loop unrolling can be used to eliminate loop dependencies [17]. DLS techniques can be categorized as nonadaptive and adaptive [18]. During the application execution, the nonadaptive techniques calculate the number of iterations comprising a chunk based on certain parameters that can be obtained prior to the application execution. The nonadaptive DLS techniques considered in this work include: modified fixed-size chunk [19] (mFSC), guided self-scheduling [13] (GSS), and factoring [14] (FAC). mFSC [19] groups iterations into chunks at each scheduling round to avoid the large overhead of single loop iterations being assigned at a time. In mFSC, the chunk size is fixed and plays a critical role in determining the performance of this technique. mFSC assigns a chunk size that results in a number of chunks that is similar to that of FAC (explained below).
GSS [13] assigns chunks of decreasing sizes to reduce scheduling overhead and improve load balancing. Upon a work request, the remaining loop iterations are divided by the total number of PEs.
FAC [14] improves GSS by scheduling the loop iterations in batches of equal-sized chunks. The initial chunk size of GSS is usually larger than the size of the initial chunk using FAC. If more time-consuming loop iterations are at the beginning of the loop, FAC balances the execution better than GSS. The chunk calculation in FAC is based on probabilistic analyses to balance the load among the processes, depending on the prior knowledge of the mean µ and the standard deviation σ of the loop iterations execution times. Since loop characteristics are not known a priori and typical loop characteristics that can cover many probability distributions, a practical implementation of FAC was suggested [14] that assigns half of the remaining work in a batch. This work considers this practical implementation. Compared to STATIC and mFSC, GSS and FAC provide better trade-offs between load balancing and scheduling overhead.
The adaptive DLS techniques exploit, during execution, the latest information on the state of both the application and the system to predict the next sizes of the chunks of the iterations to be executed. In highly irregular environments, the adaptive DLS techniques balance the execution of the loop iterations significantly better than the nonadaptive techniques. However, the adaptive techniques may result in significant scheduling overhead compared to the nonadaptive techniques and are, therefore, recommended in cases characterized by highly imbalanced execution. The adaptive DLS techniques include adaptive weighted factoring [15] (AWF) and its variants [16] AWF-B, AWF-C, AWF-D, and AWF-E.
AWF [15] assigns a weight to each PE that represents its computing speed and adapts the relative PE weights during execution according to their performance. It is designed for time-stepping applications. Therefore, it measures the performance of PEs during previous time-steps and updates the PEs relative weights after each time-step to balance the load according to the computing system's present state.
AWF-B [16] relieves the time-stepping requirement to learn the PE weights. It learns the PE weights from their performance in previous batches instead of time-steps. AWF-C [16] is similar to AWF-B, however, the PE weights are updated after the execution of each chunk, instead of batch.
AWF-D [16] is similar to AWF-B, where the scheduling overhead (time taken to assign a chunk of loop iterations) is taken into account in the weight calculation.
AWF-E [16] is similar to AWF-C, and takes into account also the scheduling overhead, similar to AWF-D. DLS in scientific applications. The DLS techniques have been used in several studies to improve the performance of computationally-intensive scientific applications. They are mostly used at the process-level to balance the load between processes running on different PEs. For example, AWF [15] and FAC [14] were used to balance a load of a heat conduction application on an unstructured grid [20]. Nonadaptive and adaptive DLS techniques such as self-scheduling 1 (SS) [11], GSS [13], FAC [14], AWF [15], and its variants, were used over the years to enhance applications, such as simulations of wave packet dynamics, automatic quadrature routines [16], N-Body simulations [21], solar map generation [22], an image denoising model, the simulation of a vector functional coefficient autoregressive (VFCAR) model for multivariate nonlinear time series [23], and a parallel spin-image algorithm from computer vision (PSIA) [24].
With the increase in processor core counts per compute node, advanced scheduling techniques, such as the class of self-scheduling mentioned earlier, are also needed at the thread-level. To this end, the GNU OpenMP runtime library was extended [25] (LaPeSD libGOMP) to support four additional DLS techniques, namely: fixed-size chunk [12] (FSC), trapezoid self-scheduling [26] TSS, FAC, and RANDOM (in terms of chunk sizes) besides the originally OpenMP scheduling techniques: STATIC, Dynamic, and Guided (equivalent to GSS [13]). The extended GNU runtime library that implements DLS was used to schedule loop iterations in computational benchmarks, such as the NAS parallel [27] and RODINIA [28] benchmark suites. The selected simulation toolkit. SimGrid [7] is a scientific simulation framework for the study of the behavior of large-scale distributed computing systems, such as, the Grid, the Cloud, and peer-to-peer (P2P) systems. It provides application programming interfaces (APIs) to simulate various distributed computing systems. SimGrid (hereafter, SG) provides four different APIs for different simulation purposes. MetaSimGrid (hereafter, SG-MSG) and SimDag (hereafter, SG-SD) provide APIs for the simulation of computational problems expressed as independent tasks or task graphs, respectively.
The SimGrid-SMPI interface (hereafter, SG-SMPI) provides the functionality for the simulation of programs written using the message passing interface (MPI) and targets developers interested in the simulation and debugging of their parallel MPI codes.
The newly introduced SimGrid-S4U interface (hereafter, SG-S4U) currently supports most of the functionality of the SG-MSG interface with the purpose of also incorporating the functionality of the SG-SD interface over time.
The present work proposes a novel simulation approach of computationally-intensive applications by combining SG-SMPI and SG-MSG to achieve fast and accurate performance simulation with minimal code changes to the native application.
The weak scalability of these DLS techniques was assessed in the presence of certain load imbalance sources (algorithmic and systemic). The flexibility, understood as the robustness against perturbations in the PE computing speed, of the same DLS techniques implemented using SG-MSG was also studied [32]. Moreover, the resilience, understood as the robustness against PE failure, of these DLS techniques on a heterogeneous computing system was studied using the SG-MSG interface [33].
Another research effort used the SG-MSG interface to reproduce certain experiments of DLS techniques [34]. Therein, a successful reproduction of the past DLS experiments was presented. The results were compared to experi-ments from the past available in the literature to verify the implementation of the DLS techniques. A similar approach of verifying the implementation of certain DLS techniques via reproduction was proposed using the SG-SD interface [35].
The relation between batch and application level scheduling was studied in simulation [36], using Alea [37] for the batch level scheduling and SG-SD for the application level scheduling. The two simulators were connected and used together to simulate the execution of multiple applications with various scheduling techniques at the batch level and the application level. It was shown that a holistic solution resulted in a better performance than focusing on improving the performance at each level solely.
SG was also used for the study of file management in large distributed systems [38] to improve applications performance. The effect of variability in task execution times on the makespan of applications scheduled using StarPU [39] on heterogeneous CPU/GPU systems was also studied in simulation [3]. The results showed that the dynamic scheduling of StarPU improves the performance even with irregular tasks execution times. Realistic simulation approaches. A combination of simulation and trace replay was used to guide the choice of the scheduling technique and the granularity of problem decomposition for a geophysics application to tune its performance [40]. SG-SMPI was used to generate a time independent trace (TiT), a special type of execution trace, of the application with the finest problem decomposition. This trace was then modified to represent different granularities of problem decomposition. Traces that represent different decompositions were replayed with different scheduling techniques to identify the decomposition granularity and scheduling technique combination that results in improved application performance. The scheduling techniques were extracted from the Charm++ runtime to be used in the simulation. However, the process of trace modification to represent different decomposition is complex, limits the number of explored decompositions, and may result in inaccurate simulation results.
The compiler-assisted native application source code transformation to a code skeleton suitable for structural simulation toolkit [41] (SST) was introduced [4]. Special pragmas need to be inserted in the source code to simulate computations as certain delays, eliminate large unnecessary memory allocations in simulation, and handle global variable correctly. This approach was focused on the simulation for the study of communications and network in large computing systems. Therefore, the variability of task execution times was not considered explicitly.
StarPU [39] was ported to SG-MSG for the study of scheduling of tasks graphs on heterogeneous CPU/GPU systems. Tasks execution times were estimated based on the average execution time benchmarked by StarPU. Both average task execution time and generating pseudo-random numbers with the same average as task execution time were explored. However, depending on time measurements only may not be adequate for fine-grained tasks. In addition, porting the StarPU runtime to a simulator interface is challenging and requires significant effort.
The Monte-Carlo method [42] was used to improve the simulation of workloads in cloud computing [43]. To capture the variation in applications execution time in simulation, the variability in cloud computing systems was quantified and added to task execution times as a probability. The simulation was repeated 500 times, each with different seeds to obtain a similar effect of the dynamic native execution on the clouds. However, the variation in the application execution time has two components: (1) the variability in a task execution time due to application characteristics or system characteristics such as nonuniform memory access; (2) the variability that stems from the computing system resources being perturbed by operating system interference, other applications that share resources, or transient malfunctions. Considering both components of application performance variability is important for obtaining realistic simulation results.
In this work, a novel simulation approach is presented that considers the different factors that affect application performance. Guidelines are proposed in Section 4 on how to estimate the tasks execution times and the system characteristics. Fine tuning the system representation to closely reflect the system performance for the execution of a certain application is essential. Reducing the differences between native and simulative experiments by using the same scheduling library in both native and simulative experiments ensures the same scheduling behavior in both types of experiments. A novel simulation method that combines the use of two SimGrid interfaces, namely SG-SMPI and SG-MSG, is introduced in Section 5.2, which enables the simulation of application performance with minimal code changes.

13
A realistic performance simulation means that conclusions drawn from the simulative performance results are close to those drawn from the native performance results. The close agreement between both conclusions does not necessarily mean a close agreement between native and simulative application execution times. For the study of dynamic load balancing and task self-scheduling, the performance of different scheduling techniques relative to others is expected to be preserved between native and simulative experiments. Preserving the expected behavior suffices to draw similar conclusions on the performance of DLS techniques between native and simulative experiments.  Figure 2: Illustration of the proposed generic approach for realistic simulations. Scientific application and computing system characteristics are abstracted for use in simulation. A single scheduling library is used which is called both by the native and simulative executions.
Preserving identical performance characteristics between native and sim-ulation experiments is challenging due to the dynamic interactions between the three main components that affect the performance: (1) Application characteristics, (2) Dynamic load balancing, and (3) Computing system characteristics.

Representing applications for realistic simulations
Two important aspects need to be clear to enable the representation of an application in simulation via abstraction: (1) The main application flow, i.e., initializations, branches, and communications between its parallel processes/threads; (2) The computational effort associated with each scheduled task.
For simple applications with one or two large loops or parallel blocks of tasks that dominate its performance, inspecting the application code is sufficient to understand the program flow. If this is insufficient, tracing the application execution can reveal the main computation and communication blocks in the application. In addition, the SG-SMPI simulation produces a special type of text-based execution trace called time independent trace (TiT) [44]. The TiT contains a trace of the application execution as a series of computation and communication events, with their corresponding amounts specified in terms of floating-pointing operations (FLOP) and bytes, respectively. Therefore, the TiT can be used to understand the application flow and to represent the application in simulations.
To obtain the amount of work per task, time measurement of task execution time or the FLOP count can be used. The measurement of short task execution times can be a source of measurement inaccuracies as such measurements are affected by the measurement overhead which is known as the probing effect. In addition, the execution time per task is not guaranteed to be constant between different executions of the same application. Instead of time measurements, the FLOP count per task can be measured using hardware counters, such as those exposed via the use of PAPI [45]. The FLOP count obtained with PAPI is used to represent the amount of work in each task in the simulation. The FLOP count per task is found to be a more accurate measurement to represent computational effort per task than time measurements as well as resulting in constant values across different application executions [5]. However, feeding the simulator the exact FLOP count per task might result in misrepresenting the dynamic behavior in native executions of tasks where their execution time varies among the different execution instances. To address this, a probability distribution is fitted to the measured tasks FLOP counts. The simulator then draws samples from this distribution to represent the task FLOP counts during simulation as shown in the upper part of Figure 2

Implementing scheduling techniques for native and simulative experiments
A number of dynamic loop self-scheduling (DLS) techniques have been proposed between the late 1980s and early 2000s, and efficiently used in scientific applications [18]. Dynamic nonadaptive techniques have previously been verified [6] by reproduction of the original experiments that introduced them [14] using the experimental verification approach illustrated by step 1 in Figure 1. In this work, the range of studied DLS techniques is extended with four adaptive DLS techniques in addition to the nonadaptive ones. To ensure that the implementation of the adaptive techniques adheres to their specification, the DLB tool [23], a dynamic load balancing library developed by the authors of the adaptive techniques, is used in this work. To minimize the differences between native and simulative executions, the DLB tool load balancing library, is used to schedule the application tasks in native and simulative executions. Connecting the DLB tool to the simulation framework required minimal effort as detailed below in Section 5.2.

Representing native computing systems in simulation
Representing HPC systems in simulation involves representing different system components that contribute to the application performance in simulation. As previously investigated [5], the application and computing system representation cannot be seen as completely decoupled activities, i.e., representing a computing system must take into account the application characteristics as current simulators cannot simulate precisely all the complex characteristics of HPC systems to create a general, application-independent system representation. For the simulation of the performance of computationally-intensive applications with different DLS, two main components of systems need to be represented: (1) The PEs, their number, their computational speed; (2) The interconnection network between the PEs, the network bandwidth, the network latency, and the topology. The PEs representation in simulation, needs to reflect the native configuration in terms of number of compute nodes and number of PEs per node. Communication links connect different PEs (cores and nodes) needs to reflect the native network topology, bandwidth and latency. Nominal values for the PE computing speeds, the network bandwidth, and the network latency are added in the simulated HPC representation to obtain an initial representation. The second step is to fine tune this initial representation to reflect the "real" HPC performance in executing a certain application. To this end, core speeds are estimated to obtain more accurate simulation results due to the fact that applications do not execute at the theoretical peak performance. The core speed is calculated by measuring the loop execution time in a sequential run to avoid any parallelization or communication overhead. The sum of the total number of FLOP in all iterations is divided by the measured loop execution time to estimate the core processing speed. This core speed is used in the simulated HPC representation to reflect the native core speed in processing the application tasks [5]. The above procedure is applicable for homogeneous and heterogeneous systems, where core speed estimation needs to be performed for each core type [46]. Similarly, a simple network benchmarking, such as a ping-pong test was used to estimate the real network links communication bandwidth and latency and insert these values in the simulation. Section 5.2 offers details about the actual steps required for the calibration procedure described above.
Quantifying system variability is essential for achieving realistic simu-lations of parallel applications. However, it involves significant challenges due to the variety of the factors that cause the variability, e.g., system failures, operating system kernel interrupts, memory and network contentions [47]. The present work models the effect of the system variability on application performance by exploiting a backlog of application execution times [43]. Two factors called maximum perturbation level, P L max , and minimum perturbation level, P L min , are used to determine the upper and the lower bounds of a uniform distribution, U , used to estimate the perturbation level, P L, induced by the system. These factors are calculated as in Equations 1 and 2, where E i denotes the application execution time at the i th execution instance andĒ is the average application execution time of n execution instances.
The estimated P L is calculated as in Equation 3 and is used to disturb the processor availability during simulation, i.e.: whenever a chunk is scheduled on a certain processor, a sample P L from the uniform distribution U is drawn. The value is then used to determine the speed of the processor by multiplying the original speed with (1 − P ).

Steps for Realistic Simulations
To achieve realistic performance simulation, three factors that affect application performance need to be well represented. In this section, we summarize the steps of the proposed realistic simulation approach and different methods to represent each factor.
Step 1 Application characteristics

Experimental Evaluation and Results
To evaluate the usefulness and effectiveness of the proposed approach, an important number of native and simulative experiments is performed. These experiments have been designed as a factorial set of experiments which is described below and summarized in Table 1. In addition, details of creating the performance simulation using SG and its interfaces and how the approach proposed in Section 4 is applied to realistically simulate the performance of two scientific computationally-intensive applications are also provided. Subsequently, the native and simulative performance results are compared using the second and the third step of the comparison approach illustrated in Figure 1 and the results are discussed.

Design of native and simulative experiments
Applications. The first application considered in this work is the parallel spin-image algorithm (PSIA), a computationally-intensive application from computer vision [48]. The core computation of the sequential version of the algorithm (SIA) is the generation of the 2D spin-images. Figure 3 shows the process of generation of a spin-image for a 3D object. The PSIA exploits the fact that spin-images generations are independent of each other. The size of a single spin-image is small (200 bytes) and fits in the lower level (L1) cache. Therefore, the memory subsystem has no impact on the application performance, as data are always available for computation at the highest speed. The PSIA pseudocode is available online [49]. The amount of computations required to generate the spin-images is data-dependent and is not identical over all the spin-images generated from the same object. This introduces an algorithmic source of load imbalance among the parallel processes generating the spin-images. The performance of PSIA has previously been enhanced by using nonadaptive DLS techniques to balance the load between the parallel processes [24]. Using DLS improved the performance of the PSIA by a factor of 1.2 and 2 for homogeneous and heterogeneous computing systems. The second application of interest is the computation of the Mandelbrot set [51] and the generation of its corresponding image. The application is  Figure 3: Illustration of the spin-image calculation for a 3D object (from literature [50]). A flat sheet is rotated around each point of the 3D object to describe the object from this point view.
parallelized such that the calculation of the value at every single pixel of a 2D image is a loop iteration, that is performed in parallel. The application computes the function f c (z) = z 4 + c instead of f c (z) = z 2 + c to increase the number of computations per task. The size of the generated image is 512 × 512 pixels resulting in 2 18 parallel loop iterations. To increase the variability between tasks execution times, the calculation is focused on the center image, i.e., the seahorse valley, where the computation is intensive. Figure 4 shows the calculated image. Mandelbrot is often used to evaluate the performance of dynamic scheduling techniques due to the high variation between its loop iterations execution times. Dynamic load balancing. The DLB tool is an MPI-based dynamic load balancing library [23]. The DLB tool has been used to balance the load of scientific applications, such as image denoising and the statistical analysis of vector nonlinear time [23]. The DLB tool is used for the self-scheduling of the parallel tasks of PSIA and Mandelbrot both in native and simulative executions. The DLB tool employs a master-worker execution model, where the master also acts as a worker when it is not serving worker requests. Workers request work from the master whenever they become idle, i.e., the self-scheduling work distribution. Upon receiving a work request, the master calculates a chunk size based on the used DLS technique. Then, the master sends the chunk size and the start index of the chunk to the requesting worker. The above process of work requests from workers and master assigns work to requesting workers repeats until the work is finished. The two applications of interest are scheduled using the DLB tool with eight different loop scheduling techniques ranging from static to dynamic, nonadaptive and adaptive as shown in Table 1

Realistic simulations of scientific applications
Extracting the computational effort in an application. To obtain the computational effort per task of the applications of interest, the FLOP count approach described in Section 4.1 is used. The native application code is instrumented and the number of FLOP per task is counted using the PAPI performance API [45]. The application was executed sequentially on a single dedicated node in the FLOP counting experiment to avoid interference between cores on the hardware counters and ensure the correct count of FLOPs. The experiment was repeated 20 times for each application to ensure that the FLOP count is constant in all repetitions. The FLOP count can be also inferred from the application source code [6] in case of simple dense linear algebra kernels. The resulting FLOP count per task is written to a file that is read by the simulator to account for task execution times. Whenever inferring or counting FLOP per task is not possible, and tasks are of large granularity, the task execution time can be used instead of FLOP count, as the measurement overhead will not dominate the task execution time as it is the case for short tasks.
To simulate the dynamic behavior of the task execution times, a probability distribution is fitted to the measured FLOP count. To obtain this probability distribution, the linear piecewise approximation of the empirical cumulative density function (eCDF) is used [3]. The eCDF values are split over the y-axis into 100 linear segments (pieces). To draw a sample from this distribution, a segment is randomly selected, and a value is randomly selected along this linear segment. Figure 5 shows the results of approximating the measured FLOP counts of tasks both from PSIA and Mandelbrot using linear piecewise approximation of the eCDF using MATLAB 3 . To ensure that the simulator draws samples from the approximated distribution with a fast, long period, and low serial correlation random engine, the random number generator of the GNU Scientific Library 4 (GSL) is used in the simulator to generate good uniformly distributed random numbers to select among the 100 linear segments and a value from the segment with low overhead during simulation.
5.9 6 6.1 6.2 6.3 6.4 6.5 6.6 FLOPs per task  The SMPI+MSG simulation approach. A novel simulation approach is employed in this work. Two interfaces of the SimGrid toolkit are leveraged to realistically simulate the application performance with minimal effort. Algorithm 1 shows the changes needed in the native application code to transform it into the simulative application code using SG-SMPI+SG-MSG using the approach illustrated in Figure 2. Lines in mint font color in Algorithm 1 show additions to simulate the application, lines in grey font color show the lines that need to be uncommented to revert to the native application code, and black lines denote unchanged code. The SG-SMPI interface is used to execute the native application code. To speedup the SG-SMPI simulation, the computational tasks in the application are replaced with SG-MSG tasks. The amount of work per SG-MSG task is either read from a file or drawn from a probability distribution according to the experimented simulation type. Memory allocations of results and data in the native code are removed or commented in the simulation as they are not needed. This allows to reduce the memory footprint of the simulation and the simulation of a large number of ranks on a single compute node. No modifications are needed for the DLB tool in this approach. The scheduling overhead of different techniques is accounted for by the SG-SMPI, whereas the tasks execution time is accounted for in simulation by the SG-MSG. The proposed approach results in a fast and accurate simulation of the application with minimal modifications to the native application source code. Hundreds to thousands of MPI ranks can be simulated using a single core on a single compute node.

Computing system representation.
To represent the miniHPC in SimGrid, the system characteristics need to be entered in a specially formatted XML file denoted as platform file. Each core of a compute node of miniHPC is represent as a host in the platform file. Hosts that represent the cores of the same node are connected with links with high bandwidth and low latency to represent communication of cores of the same node through the memory. The bandwidth and the latency of these links are used as 500 Mbit/s and 15 us, respectively to represent the memory access bandwidth and latency. Every 16 host represent a node of miniHPC. Another set of links are used to connect the hosts to represent network communication in a two-level fat-tree topology. The properties of these links represents the properties of the Intel Omni-Path interconnect used in miniHPC and their bandwidth and latency are set to 100 Gbit/s and 100 ns, respectively.
To reflect the fact that network communications are nonblocking in the native miniHPC system, the FATPIPE is used to tell SimGrid that the communications on these links are nonblocking and is not shared, i.e., each host has all network bandwidth and shortest latency available all the time even in the case of all hosts are communicating at the same time. For the links that represent the memory communication, their sharing property is set to SHARED to represent possible delays that can occur if multiple cores are trying to access the memory at the same time.
To estimate the core speed, each application is executed sequentially on a single core to estimate the total execution time and avoid any scheduling or parallelization overhead in this measurement. The core speed is calculated as the total number of FLOP in all tasks of the application divided by the total application sequential execution time. Using the above approach, the core speed is found to be 0.95 GFLOP /s and 1.85 GFLOP /s for the execution of PSIA and Mandelbrot, respectively. This requires the creation of two platform files to represent the miniHPC in the execution of PSIA and Mandelbrot. This illustrates the strong coupling between application and system representation in simulation as discussed in Section 4.3.
SimGrid uses a flow-level network modeling approach that realistically approximates the behavior of TCP and InfiniBand (IB) networks specifically tailored for HPC settings. This approach accurately models contention in such networks [52] and accurately captures the network behavior for messages larger than 100KB on highly contended networks [53]. The SimGrid network model can further be configured to precisely capture characteristics, such as the slow start of MPI messages, cross-traffic, and asynchronous send 5 calls. To fine tune the network representation in the simulation to the native miniHPC system, the SG-based calibration procedure [54] is used to calibrate the network model parameters in the representation of both platforms to better adjust the network bandwidth and latency in both platform files.
Using the approach introduced in earlier work [55], the representation of the computing system can be verified in a separation of the application representation by using the SG-SMPI interface. The SG-SMPI interface simulates the execution of native MPI codes on a simulated computing platform file. Both the native and simulative executions using SG-SMPI share the applications native code. The difference between the native execution and the simulative SG-SMPI-based execution is the computing system representation component. The representation of the computing system can be verified by comparing the native and SG-SMPI simulative performance results.
To quantify the effect of system variability, both applications, PSIA and Mandelbrot, were executed 20 times using STATIC on 256 PEs. For PSIA, E, P L max , and P L min were 111.5792 seconds, 0.1539, and 0.0113, respectively. For Mandelbrot,Ē, P L max , and P L min were 139.9814 seconds, 0.0088, and 0.0009, respectively. These results indicate a low system variability in miniHPC during the execution of both applications. This variation is not considered in the simulative experiments. Figure 6 shows the native performance of both PSIA and Mandelbrot with eight static and dynamic (nonadaptive and adaptive) self-scheduling techniques. To measure application performance, the parallel loop execution time T loop par for both applications is reported. Each native experiment is submitted for execution as a single job to the Slurm [56] batch scheduler on dedicated miniHPC nodes. Slurm exclusively allocates nodes to each job. The nonblocking fat-tree network topology of miniHPC guarantees that nodes use the full bandwidth of the links, even if Two metrics are used to measure the load imbalance in both applications: (1) the coefficient of variation (c.o.v.) of the processes finishing times [14] and (2) max/mean of the processes finishing times.

Experimental results
The c.o.v. is calculated as the standard deviation of processes finishing times divided by their mean and indicates load imbalance as the variation in general between the processes finishing time. A high c.o.v. value represents high load imbalance and a low value (near zero) represents a nearly perfectly balanced load execution.
The max/mean is calculated as the maximum of processes finishing times divided by their mean. Max/mean indicates how long the processes of an application had to wait for the slowest process due to load imbalance. A max/mean value of 1 represents a balanced load execution (lower bound), and a higher value indicates that execution time is prolonged due to a process that lags all the other processes at the end.
When all processes, except for one, have similar finishing times, the c.o.v. is very low and hides the fact that the slowest process lags behind in execution, while the finishing time of this process is visible as a large value in max/mean metric.
Inspecting the native applications results in Figure 6, one observes that STATIC degraded the performance of both PSIA and Mandelbrot due to load imbalance. The high value of c.o.v and max/mean in both applications indicate the load imbalance with STATIC as shown in subfigures (c) and (d).
Although the value of c.o.v for GSS is lower than that of mFSC for PSIA, one can see that the performance of GSS is worse than mFSC. Figure 6 (e) shows, however, that the value of max/mean for GSS is higher than that of mFSC, which explains the large execution time in subfigure (a). This is an example where the c.o.v. alone hides the load imbalance resulting from a single process lagging the application execution as explained above. FAC technique improves the performance of both applications and result in the lowest execution time and also load imbalance metrics.
The adaptive DLS techniques improve the performance of PSIA and result in low load imbalance metrics as well. However, for the Mandelbrot due to the high variability of its tasks execution times and short execution times, the adaptive techniques did not have enough time to estimate PE relative weights correctly and resulted in high execution time and high load imbalance metric values with high variability also.
Two application representation approaches are employed for the experiments using SG-SMPI+SG-MSG. The first approach is denoted as FLOP file and is shown in Figure 7. The FLOP per task was measured with PAPI counters and was written into a file with task id and FLOP count per task. This file is read by the simulator during the execution to account for the computational effort in each task. Inspecting the first simulative performance results ( FLOP file) in Figure 7 reveals that STATIC degrades the performance of applications due to load imbalance as can be inferred from the load imbalance metrics in sub-figures(c-f). However, for STATIC with PSIA, the c.o.v and max/mean values are smaller than that of mFSC and GSS. The GSS performance is worse than that of mFSC, even though it has lower c.o.v. compared to mFSC for PSIA. However, this is due to a single process lagging the execution of the PSIA as captured in sub-figure(e). The FAC technique results in improved performance for both applications. The c.o.v. and max/mean values with FAC in both applications is almost the minimum. The adaptive techniques AWF-C and AWF-E improve the performance of PSIA and result in low parallel loop execution time, c.o.v., and max/mean almost similar to the FAC (the minimum). AWF-B and AWF-D improve the performance of PSIA also, compared to mFSC and GSS. However, PSIA execution time with these techniques is slightly longer than the best (FAC, AWF-C, AWF-E). The performance of Mandelbrot with the adaptive techniques is degraded in general compared to STATIC and dynamic nonadaptive DLS techniques. This poor performance of Mandelbrot with adaptive techniques is due to the high load imbalance as indicated by the c.o.v. and max/mean metrics in sub-figures(d and f). The high variability and the rather short execution time of the Mandelbrot left no room for the adaptive techniques to learn the correct relative PE weights.
The second simulation approach is denoted as FLOP dist and is shown in Figure 8. The measured FLOP counts with PAPI is used to fit a probability distribution to the measured FLOP data as described in Section 5. above. In this case, the simulation is repeated 20 times similar to the native execution with different seeds to capture the variability of the performance of the native application. Inspecting the first simulative performance results (FLOP dist.) in Figure 8 reveals that applications performance with STATIC is better than mFSC and GSS techniques, and almost similar to the best performance achieved by FAC. This is assured by the low values of the load imbalance metrics for both applications with STATIC. GSS degrades the PSIA performance due a process lags the application execution time as indicated by a high max/mean value. mFSC also failed to balance the load of PSIA as indicated by a high c.o.v. value and results in a long parallel loop execution time. FAC results in the shortest parallel loop execution times, c.o.v. and max/mean values for both applications under test. Adaptive techniques in general, and specifically AWF-C and AWF-E improve the PSIA performance with balanced load execution and results in the shortest execution time (similar to FAC). However, adaptive techniques failed to adapt correctly to the high variability of tasks execution times of Mandelbrot, due to its short execution time and resulted in poor performance.

Strong scaling results
In Figure 9, the strong scaling behavior of the PSIA and Mandelbrot applications is shown for the native (subfigures (a) and (b)) and simulative executions (subfigures (c) -(f)), respectively. Considering the native executions of PSIA, all DLS techniques scale very well. FAC and the adaptive DLS techniques show a constant parallel cost, while the parallel cost increases slightly with the increasing number of processing elements for mFSC and STATIC. The largest slope is induced by the execution using the GSS technique. By contrast, an almost constant parallel cost of the Mandelbrot performance is obtained with mFSC, GSS, and FAC. The parallel cost of using STATIC is also almost constant but higher than that of using mFSC, GSS, and FAC. Using the adaptive DLS techniques results in poorer strong scaling, characterized by at least one outlier per adaptive technique.
The strong scaling results for the first simulation approach, denoted as FLOP file, are shown in Figure 9 (c)-(d) for the PSIA and Mandelbrot applications, respectively. While the parallel costs are almost equal to the parallel costs of the native executions of PSIA, this is not the case for the Mandelbrot application. The Mandelbrot simulations show almost constant parallel costs for mFSC, GSS, FAC, and STATIC. These results are identical to those of the native executions. Considering the adaptive DLS techniques, the parallel costs are not characterized by outliers as observed for the native executions. However, in contrast to the non-adaptive DLS techniques, the parallel cost increases with the number of processing elements.
Considering the simulative executions using the second simulation approach, denoted as FLOP dist., a rather different strong scaling behavior is observed. For the PSIA application, the parallel costs are equal to the parallel costs of the native executions only for 256 processing elements. For a lower number of processing elements, the parallel costs are approximately half of those of the native executions. The parallel costs of the simulative executions of the Mandelbrot application are almost constant. Only using the adaptive DLS techniques results in a slight increase of the parallel costs with increasing number of processing elements.

Discussion
To evaluate how realistic the performed simulations are, the native and simulative performance of PSIA and Mandelbrot is analyzed in terms of T loop par , c.o.v., and max/mean metrics. Realistic simulation results are expected to lead to a similar analysis and similar conclusions drawn from the analysis of the native results. Table 2 summarizes seven performance features form the performance analysis of applications' performance with various scheduling techniques performed above in Section 5.3. The comparison between the native and simulative performance analysis results shows that the simulations with FLOP file captured almost all the performance features that characterize the performance of the two applications under test. The simulator overestimated only the performance of AWF-B and AWF-D.
Both simulations predicted correctly that the FAC technique achieves a balanced load execution for both applications and improves performance. Simulations with the FLOP dist approach failed to capture the load imbalance with STATIC in both applications. The performance with STATIC is significantly affected by the order of tasks or loop iterations assigned to each PE. As the order of tasks is not preserved by drawing random samples from FLOP distributions, the load imbalance with STATIC was dissolved between PEs as they are assigned different tasks in simulative execution from the native one. Interestingly, both simulations were able the most devious performance feature of high T loop par , low c.o.v, and high max/mean values of GSS with PSIA. Both simulations did not capture the high variability of adaptive techniques. The adaptive techniques depend on time measurements to estimate PE performance. If the granularity of the tasks is highly variable, and some task sizes are very fine, the time measurement of their execution will be inaccurate due to an overhead of the time measurement. The inaccurate time measurement leads to incorrect weight estimation and high variability between different native executions. This probing effect does not exist in the simulative execution and, therefore, was not fully captured. However, both simulations correctly predicted the high performance of adaptive techniques with PSIA and their low performance with Mandelbrot. The simulation with FLOP dist was able to capture the small variability in performance with various DLS techniques, which was not captured by reading the FLOP counts from a file in the first simulation.

Conclusion and Future Work
In this work, we show that it is possible to realistically simulate the performance of scientific applications on HPC systems. The approach proposed for this purpose considers various factors that affect the applications performance on HPC systems: application representation, scheduling, computing system representation, and systemic variations. The proposed realistic sim-ulation approach has been exemplified on two computationally-intensive scientific applications. A set of guidelines are also introduced and discussed for how to represent applications and system characteristics. These guidelines help to achieve realistic simulations irrespective of the application type (e.g., communication-or computationally-intensive) and the simulation toolkit (e.g., Alea or GridSim [37]).
Based on the proposed approach, a novel simulation method is also introduced for the accurate and fast simulation of MPI-based applications. This method jointly employs SimGrid's SMPI+MSG interfaces to simulate applications performance with minimal changes to the original application source code. We used this method to realistically simulate two computationally-intensive scientific applications using different scheduling techniques. The comparison of performance characteristics extracted from the native and simulative results shows that the proposed simulation approach captured very closely most of the performance characteristics of interest, such as strong scaling properties and load imbalance.
We believe that factors such as the application representation, scheduling, the computing system representation, and system variations, affect the realism of the simulations and deserve further investigation. Future work is planned to apply the proposed simulation approach to large and well-known performance benchmarks, such as the NAS suite, the SPEC suites, the RODINIA suite, and other scientific applications. The development of a tool to automatically transform the native application code into a simulative one is also envisioned in the future.