Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new"phase space plot,"we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.


Introduction and related work
Highly parallel MPI programs with no or weak global synchronization points show interesting dynamics that go beyond what is expected from their usually regular compute-communicate structure. Initiated by what is typically called "noise," a plethora of patterns can emerge: Propagating delays emanating from strong one-off disturbances, so-called idle waves [9], can interact [3] and eventually decay [1,2,3] via various mechanisms. Caused by idle waves, but also under the natural, fine-grained system noise, some applications are unstable and leave their initial lock-step mode (Figure 2 (left)) where all processes either compute or communicate. It was shown [2] that a hardware bottleneck such as main memory bandwidth is a prerequisite for this bottleneck evasion to occur. As a consequence, such programs settle in a metastable state, a computational wavefront, where neighboring processes are shifted in time with respect to each other ( Figure 2 (right)). It was also shown [4] that this desynchronization can lead to substantial speedups via automatic overlap of communication and code execution.
Investigating these dynamics typically requires the analysis of MPI traces taken by tools such as Intel Trace Analyzer/Collector or VAMPIR. Apart from the often prohibitive amount of data contained in such traces, the relevant patterns are often hidden in the data and not readily visible to the human eye. Furthermore, it is hard, if not impossible, to obtain this data in a production environment without adverse effects on the performance of applications. For applications that have natural regular compute-communicate cycles, we propose to use the MPI waiting time per process and time step (i.e., the time spent in the MPI library, regardless of whether communication takes place or not) as a starting point and input metric for data analysis methods that can identify the structural processes described above. The performance per process and time step can serve as a supplemental metric to track the impact of automatic communication overlap. This paper makes the following relevant contributions: -We scan a wide spectrum of metrics beyond the timelines of MPI processes to manifest the analysis of a set of parallel programs. -We perform the comparison of analysis for two observables, such as MPI times and performance, that enable us to explore the effectiveness of each metric for such analysis.
Overview This paper is organized as follows: we first provide details about our experimental environment and methodology in Sect. 2. To discover the dynamics of large-scale parallel programs, simple metrics, such as the histogram and timelines for all or individual MPI processes are studied in Sect. 3. While Sect. 4 covers the advances methods like, correlation coefficient matrix and phase space methods; while Sect. 5 addresses the machine learning techniques, such as Principal Component analysis, k-mean clustering. Finally, Section 6 concludes the paper and gives an outlook to future work.
2 Case studies, test bed and experimental methods

Test-systems and methodology
We conduct various experiments on two different clusters. The details of the hardware and software environments on Omni-Path Meggie 1 and Infiniband Fritz 2 parallel clusters can be found in Table. 1. By default, SMT is disabled on both systems and SNC is enabled on the Fritz system. We used only physical cores and the optimization flag -O3 is utilized. Process-core affinity was enforced using the I_MPI_PIN_PROCESSOR_LIST environment variable. The clock frequency was always fixed to the base value of the respective CPUs. Working sets were chosen large enough to not fit into any cache, i.e., at least ten times the last-level cache size on both systems (non-inclusive victim L3 cache in the Ice Lake processor of Fritz). All floating point computations and storage are done in double precision. All experiments were performed for 500 k iterations (except 100 k iterations for LBM program) and repeated at least five times to ensure that the run time analysis is stable. The scripts and additional material are available online at hptts://. 2 Case studies, testbed and experimental methods

Test systems and methodology
The details of the hardware and software environments on the "Meggie" 4 and "Fritz" 5 clusters can be found in Table 1. By default, hyper-threading (SMT) is disabled on both systems and Sub-NUMA Clustering (SNC) is enabled on the Fritz system. The optimization flag -O3 was utilized with the Intel compiler. Process-core affinity was enforced using the I_MPI_PIN_PROCESSOR_LIST environment variable. The clock frequency was always fixed to the base value of the respective CPUs. Working sets were chosen large enough to not fit into any cache, i.e., at least ten times the last-level cache size on both systems. All floating-point computations were done in double precision. All experiments were performed for 500 k iterations (compute-communicate cycles), except for LBM where we used 100 k iterations. Each experiment was repeated at least five times to ensure that the runtime analysis is stable.

Synthetic microbenchmarks
We ran pure-MPI versions of the McCalpin STREAM Triad [10] (A(:)=B(:)+s*C(:))) and a "slow" Schönauer vector Triad (A(:)=B(:)+cos(C(:)/D(:))) 6 with bidirectional next-neighbor communication. The saturation characteristics of these streaming kernels on one socket (ccNUMA domain) of Meggie are shown in Fig. 1. Each MPI rank i sends and receives messages to and from each of its direct neighbors i + 1 and i − 1 after each full loop traversal. Each array had 400 M elements, which yields a 9.6 GB working set for the STREAM Triad and a 12.8 GB working set for the Schönauer Triad. To mimic scalable applications, we set up a PISOLVER code which calculates the value of π by evaluating These four cases were run in two scenarios on the Meggie system: (A) open-chain process topology with 40 MPI processes on four ccNUMA domains (sockets), and (B) closed-ring process topology with 400 MPI processes on 40 sockets. Later these scenarios will be denoted "Bench iA" and "Bench iB", respectively, where i is the label in the enumeration list above.

Proxy memory-bound parallel applications
We experiment with the following two MPI-parallel proxy applications and run them with 1440 MPI processes on 40 sockets of the Fritz system.

MPI-parallel LBM solver
This is a prototype application based on a Lattice Boltzmann Method (LBM) from computational fluid dynamics using the Bhatnagar-Gross-Krook collision operator [5] and implementing a 3D lid-driven cavity scenario. It is purely memory bound on a single ccNUMA domain, but the halo exchange makes it communication dominated in strong scaling scenarios. The double-precision implementation employs a three-dimensional D3Q19 space discretization [11]. The domain decomposition is performed by cutting slices in the z direction. For halo exchange, five PDFs per boundary cell must be communicated. The MPI communication is done with non-blocking pointto-point calls, but no explicit overlapping of communication with useful work is implemented. We use an overall problem size of n x × n y × n z = 1440 3 lattice cells, which amounts to a working set of 908 GB plus halo layers. Due to the one-dimensional domain decomposition, the communication volume per halo depends on n x and n y only and is independent of the number of processes.
MPI-parallel spMVM solver The sparse matrix-vector multiplication (SpMVM) b = A x is a most relevant, time-consuming building block of numerous applications in science and engineering. Here, A is an n × n sparse matrix, and b, x are n-dimensional vectors. SpMVM plays a central role in the iterative solution of sparse linear systems, eigenvalue problems and Krylov subspace solvers. Due to its low computational intensity, the SpMVM kernel is mostly limited by the main memory After that, the whole SpMVM kernel is executed. The communication volume is crucially dependent on the structure of the matrix; the distribution of the nonzero entries plays a decisive role. In this paper, we use a matrix that arises from strongly correlated electron-phonon systems in solid state physics. It describes a Holstein-Hubbard model [6] comprising 3 electrons on 8 lattice sites coupled to 10 phonons. The sparse matrix A has 60, 988, 928 rows and columns and 889, 816, 368 non-zero entries, respectively, which leads to an average of 13 nonzeros per row and an overall data set size of 10.9 GB using four-byte indices in the Compressed Row Storage format (one-dimensional arrays for values, column indices, and row pointers).

Observables for analysis
We instrument all codes to collect the time stamps of entering and leaving MPI calls (MPI waiting time per process) at each iteration of each MPI process across the full run. From this data we construct a non-square matrix of size N p × N it , where N p is the number of MPI processes and N it is the number of iterations. Each row (column) of the observable matrix represents the observable value, i.e., the time spent in MPI, for each process (iteration). There is a choice as to how this data can be used in analysis: One can either use the full timeline per process, which takes the end-to-end evolution of execution characteristics (such as desynchronization) into account, or cut out a number of consecutive iterations from different execution phases, which allows to investigate the development of interesting patterns in more detail. In addition, for some experiments we collect performance per MPI process averaged over the 1000 time steps.
3 Simple timeline metrics for analysis

Rank/ccNUMA-wise timelines and histogram of MPI time and performance
The histograms in Figure 3 sort the MPI time values of end-to-end (500 k iterations) runs of Bench [1][2][3][4]A into 35 bins. For memory-bound code, idle times are lower for desynchronized processes if the bandwidth saturation on a ccNUMA domain is weaker [2] (see Figure 3(c)). In the computebound PISOLVER case (Figure 3(d)), all processes are synchronized because of the absence of (a) Bench1A-500k it-Rank 20 any contention on the memory interface or on the network. Open-chain boundary conditions and strong memory contention ((a) and (b)) lead to a single synchronized socket. In the other cases, all sockets desynchronize gradually over 500 k iterations, which causes a spread in the histogram because processes evolve from lower to higher idle times. We have observed that this spread is more prominent for codes with stronger saturation and higher communication (Bench1B, LBM, and spMVM; data not shown for brevity). We first investigate the open chain high communication overhead benchmark mode (Bench1A). Figure 4(a) shows the histograms at the different stages of evolution of a single MPI process (i.e., rank 20 on third ccNUMA domain) through the whole execution. Each histogram encompasses 1000 iterations. Initially (e.g., till 50 k iterations), the distributions are multimodal, which indicates different phases. On closer inspection it can be observed that the peak snaps from left to right as the process goes out of sync with its neighbors. This corroborates that the MPI waiting time is a good observable in our context. Since desynchronization cannot yield significant speedup if communication is insignificant, we show plots of performance vs. time step for significant communication cases only (Bench1A in Figure 4(b) and Bench1B in Figure 4(c)). These plots show the initial 1 k iterations. With open boundary conditions (b), one observes fluctuating performance as processes get desynchronized on all but one socket. However, this slow synchronized socket does not permit a global performance increase as desynchronized processes on other sockets cannot lag behind indefinitely. With closed boundary conditions (c), as the simulation progresses, performance (along with MPI waiting times) increases by about 15% and stays constant at the higher level till the end of the run.

Timeline in compact representation
MPI waiting times facilitate a compact representation of the timeline of a parallel program. Figure 5 show ranks on the x-axis and time steps on the y-axis, with the normalized MPI waiting time for each rank and time step color coded. For this representation, the mean value of MPI times across all processes and time steps is shown in white while red (positive value) and blue (negative value) represent values above and below the mean, respectively. This makes it possible to distinguish between synchronized and desynchronized groups of processes: Strongly desynchronized processes spend more time in MPI (red), while white color marks synchronized processes (Ranks 1-10 and 20-30 in Figure 5(b) and (c), respectively). This visualization is similar to what tracing tools like ITAC or Vampir display; however, these tools often encompass too much information, and depending on the chosen resolution one can easily get lost in the data. In contrast, compact timelines of the waiting time per time step deliver a condensed view on this information and help to better visualize certain phenomena. For instance, the weaker saturation cases collect lower idle times which can be seen when comparing Figures 5(c)-(e) and (h)-(j)). Asymptotic behavior in longer runs can be observed at the top part of the plot in all cases. Idle waves are prominently visible as dark-blue stripes in the LBM benchmark ( Figure 5(a)).

Advanced metrics for analysis
Beyond timeline visualization and statistics, a plethora of advanced data analysis methods exist which can lead to deeper insights into the desynchronization process. Here we pick the correlation coefficient and the phase space plot.

Correlation coefficient
The correlation coefficient function [13] provides a simple way to uncover correlations between the timelines of two MPI processes. Figure 6 shows the color-coded correlation coefficients of rank pairs for all benchmarks, using the full end-to-end timelines. The matrices are obviously symmetric, and the diagonal entries (dark red) are set to one by convention. The correlation coefficients range from −1 to 1, with −1 representing a direct, negative correlation, 0 representing no correlation, and 1 representing a direct, positive correlation. For the memory-bound applications, the ccNUMA domain structure is clearly visible in Figures 6(a-c,f-h). This implies that processes within ccNUMA domains are strongly correlated, while they are less (or not) correlated across sockets. The data shows strong correlations within desynchronized sockets with a bi-modal distribution of MPI times since the socket already started to lose the sync pattern. In the open chain scenarios, processes on the last socket show a weaker correlation. In the SpMVM application, the sparse matrix structure is reflected in the correlation coefficients since the desynchronization process is strongly influenced by the communication structure (Figure 6(f)). In weakly or non-saturated applications (Figures 6(d-e,  i-j)), correlations are generally weaker, as expected.

Phase space plots
In order to capture the temporal evolution of MPI waiting time, we set up a scatter plot where each data point has coordinates (MPItime(t i , r), MPItime(t i + 1, r)). For each process, a fixed point in this "phase space" is a point on the slope-1 line through the origin. If the waiting time evolves, a process will move through the first quadrant; if waiting time increases over time (e.g., due to desynchronization), the path of a process will rise above the axis and move further up and to the right. Color coding in the point cloud, from early (blue) to late (yellow), helps to visualize how processes move. We choose two different types of analysis.
In the snippet view, only a small part (e.g., 1000 iterations) of the data is visualized per plot; separate plots are used to show the long-term temporal evolution (initial -mid -end in Figure 7). In Figures 7(a)-(c), after the initial in-sync phase, the cloud gets spread out. Asymptotically, we identify multiple weak and strong clusters (smaller and bigger attractors basin for observable). Stronger or weaker clustering along the diagonal line expresses how much the observable fluctuates around a "steady-state" value. In the example shown, all but one (blue points) sockets get desynchronized. This separation of sockets should go away as time progresses for the close chain scenario (see Figures 7(f)-(h)), but obviously the progression is too slow to be discernible in this visualization. For PISOLVER (Figures 7(d)-(e)), the point cloud starts around the origin and remains there since this scalable code is self-synchronizing.
In the overall view (see Figure 8), the full timeline is shown for one process in one plot (plotting all processes would not allow useful conclusions)). Here the gradual evolution of waiting time is hard to see since it is drowned in fluctuations. However, especially in the open-chain scenarios ( Figures 8(b)-(d)) we observe structures parallel to the axes, indicating singular long delays of a few preferred lengths. These are signatures of traveling idle waves, which for a single process manifest themselves as singular high waiting time in one time step.

Machine learning techniques for analysis
In order to prepare the timeline data for machine learning techniques, we subtract the mean values of MPI times across all time steps and processes of each experiment from the value at each step. This is one of many possible options for data normalization; better ones might exist. We then apply PCA [7] to the timelines of each run, using the MPI times of each process as feature vectors, and then classify the projections of the feature vectors on the first two principal component vectors using clustering techniques. Finally, we validate the quality of the clustering for an accurate evaluation.
To do that, we look at the reconstruction error that is generated using an essential number of Principal Components only.

Principal Component Analysis (PCA)
Principal Component analysis projects the directions of high-dimensional data onto a lower-dimensional subspace while retaining most of the information. Ideally, the low-dimensional manifolds still retain almost all variance of the data set needed to identify and interpret generic behavior. Coarse features are captured by the first principal components while the highest-frequency features are captured by highest principal components. PCA centers the data and uses the singular value decomposition algorithm on the non-square observable matrix. Rows and columns of the input matrix correspond to observations and variables, respectively. Each row vector is the timeline of MPI times in a process; the observable values in different iterations are the coordinates in a highdimensional space. Principal Components (eigenvectors) In order to get better insight into the governing characteristics of desynchronized execution, We analyze the essential eigenvectors. Figures 9(c, d) show the eigenvectors and how ranks contribute to the reduced number of principal components for the full run and the last 1000 iterations. In the full-run case ( Fig. 9(c)), the PC1 eigenvector characterizes desynchronizing processes and thus indicates a lot of waiting times with in-between downward spikes. The PC2 eigenvector characterizes in-sync processes and shows almost no waiting time, but upward spikes (idle periods) in between. It must be noted that the PCs for end-to-end runs encompass the entire evolution of the program, including initial in-sync phases, transient states, and final, stable states. Looking at the final 1000 iterations (Fig. 9dc)), the signatures are much clearer; PC1 characterizes stable desynchronization while PC2 maps transient behavior where a noise-induced event between iteration 600 and 700 causes processes to change from a state with small waiting times to a state with large waiting times. One can interpret this as processes within a ccNUMA domain "snapping" out of sync. ) indicates for Bench1B how many PCs are required to reconstruct the original data sets using only the projections. In this particular case, one component is sufficient near the end of the run but many are required (with the first one still dominating) over the full run. Overall, the results show that more PCs are needed to explain the data variance for a more pronounced memory-bandwidth saturation on the ccNUMA domain. In contrast, the compute-bound PISOLVER has much less variance as no typical structure exists except natural, noise-induced fluctuations. Further, the more revealing socket behavior in short-runs is captured by the higher number of PCs compared to the asymptotic behavior in long runs.

K-means clustering
While PCA delivers insight into typical patterns, it does not allow for automatic grouping (clustering) of processes. This can be accomplished by partitioning the projection of observations on the principal components into k clusters by using k-means. Rows of PC scores correspond to points and columns correspond to variables. We use the k-means++ algorithm [12] for the cluster center initialization; it is strongly dependent on the distance metric used.
Distance types Clustering quality was studied for four metrics. In the cluster, each centroid c is either the mean ((x−c)(x−c) ) or component-wise median (( p j=1 |x j −c j |) of the points in squared Euclidean and city-block metrics, respectively. Here, x is a row of PC scores. For the cosine and correlation metrics, each centroid c is either the mean of the points which are already normalized to unit Euclidean length (1− xc (xx )(cc ) ) or component-wise mean of the points which are already centered and normalized to zero mean and unit standard deviation (1 − The result is a matrix containing the k cluster centroid locations and a vector containing cluster indices. Figure 9(b) shows a scatter plot of essential PC scores grouped by the cluster indices of each observation in the Bench1B case. K-means uses the squared Euclidean distance here. We expect one cluster if all processes are in a fully evolved desynchronized state.
Number of observables per cluster The number of clusters k is chosen in a way that it assigns all unalike clusters, while the number of observables belonging to each cluster could be significantly different. Figures 9(f, h) show the histogram bar chart of the cluster indices in the vector, which are sorted into k bins. We choose k equal to the number of ccNUMA domains. The x-axis indicates the cluster IDs and the y-axis shows the number of samples.
Validation of clustering quality A potential application of Principal Component analysis is its evaluation by calculating the error between original and reconstructed signal from fewer PCs. To this end, one can reconstruct the signal by multiplying the scores with the first two PCs and then sum them up. This should be very close to the original signal if the reconstruction error (using the Euclidean norm) is less than some threshold value. In Figures 9(g, i), we performed a Silhouette analysis [8] to quantify the quality of the clustering. A highly representative clustering is associated with a large positive coefficient value close to one and indicates that the point is well matched to other points in its own cluster, and poorly matched to other clusters. On the other hand, a negative coefficient value represents a disqualified clustering. We get higher reconstruction error for the high-frequency signal of the PISOLVER case as expected. While exploring the influence of distance metrics, it turned out that cosine is the best-suited and city-block is the worst-suited distance metric.

Summary and future work
Key takeaways We have presented expressive data analytics techniques for investigating the dynamics of MPI-parallel programs with regular compute-communicate cycles. We consider MPI waiting time per time step and process as a good observable metric since it encompasses much of the relevant dynamics in a very condensed format. Our new "phase space" analysis based on this data provides an efficient, visual way to observe the evolution of a program from its initial, synchronized state into a desynchronized state. However, it is not strictly a data analytics technique since it involves manual inspection of the data (moving dot clouds). PCA and subsequent k-means clustering allow for a more automated analysis, providing feature extraction, i.e., typical timeline behavior, as well as grouping of MPI processes into clusters with similar features. Hence, these methods could pave the way towards advanced job-specific monitoring of production jobs on clusters. We have also found that the analysis is more expressive when applied to snippets of the timeline in order to avoid mixing different characteristics. If one is interested in an evolved state only, the final iterations of a run are most relevant.
Future work We are convinced that PCA applied to MPI waiting time data allows the investigation of unknown applications by mapping their temporal evolution to principal components found in prototypical benchmark runs. It is still an open question how to choose these benchmarks to extract relevant, distinguishable patterns that real application can be tested against. It will also be necessary to investigate how the waiting time metric should be normalized to be as generic as possible. Furthermore, we plan to apply the demonstrated techniques to a wider spectrum of real applications in order to fathom their true scope of applicability.