Determination of the optimal shape of matrix elements partitioning on three abstract heterogeneous processors

Abstract The paper presents the results of a study done to find the optimal shapes of matrix element partitioning on three abstract heterogeneous processors when performing multiplication operations. An abstract processor model allows applying the research results in systems with different heterogeneous architectures. To determine the optimal partitioning shape, the work uses non-rectangular candidate shapes identified by Ashley DeFlumere in her work as a result of applying the technology of redistribution of matrix elements between the processors «push»: Square Corner, Rectangle Corner, Square Rectangle, Block Rectangle, L-Rectangle, Traditional 1D Rectangular. The optimality of shapes is determined for four classes of matrix multiplication algorithms: Serial Communication with Barrier (SCB), Parallel Communication with Barrier (PCB), Serial Communication with Bulk Overlap (SCO) and Parallel Communication with Overlap (PCO). The Hockney model was used to evaluate the communication complexity of algorithms. Mathematical models of the algorithm execution time were introduced in the paper for each considered candidate shape in all algorithms. Based on the developed mathematical models, software was developed that allows to select the form of elements partitioning between processors, depending on the ratio of their speeds and latency of the transmission medium.


PUBLIC INTEREST STATEMENT
The paper presents the results of a study done to find the optimal shapes of matrix element partitioning on three abstract heterogeneous processors when performing multiplication operations. An abstract processor model allows applying the research results in systems with different heterogeneous architectures. To determine the optimal partitioning shape, the work uses nonrectangular candidate shapes identified by Ashley DeFlumere as a result of applying the technology of redistribution of matrix elements between the processors «push»: Square Corner, Rectangle Corner, Square Rectangle, Block Rectangle, L-Rectangle, Traditional 1D Rectangular. The optimality of shapes is determined for four classes of matrix multiplication algorithms: Serial Communication with Barrier, Parallel Communication with Barrier, Serial Communication with Bulk Overlap and Parallel Communication with Overlap. The Hockney model was used to evaluate the communication complexity of algorithms. Mathematical models of the algorithm execution time were introduced in the paper for each considered candidate shape in all algorithms.

Introduction
The amount of processed information and the complexity of computational tasks grow every year. In this regard, the field of parallel data processing research became more and more popular all over the world.
Parallel multiplication of matrices is one of the most popular tasks in various fields of scientific knowledge. In particular, such tasks may include work with connectivity matrices and correspondences in information, engineering and transport networks that control the main information about objects and flows movement continuity and at the same time has significant length and dynamic elements.
According to the state of the Top500 list as of November 2019, the maximum computing speed of 200 petaflops was reached by the Summit supercomputer developed by IBM for Oak Ridge National Laboratory. Summit is used in astrophysics, cancer research and systems biology; it consists of 4,608 servers (crosspoints) and 2 397 824 cores.
If at the initial stage of development of high-performance computing, the top lines of the rating were mainly occupied by supercomputers built on identical elements, now we can trace a steady trend towards greater efficiency of heterogeneous cluster solutions.
While the models and methods of distributing computing resources when multiplying matrices on homogeneous systems are studied well enough, the search for optimal solutions for heterogeneous systems is still being studied.
The partitioning of matrix elements among computational elements when performing linear algebra operations is performed for the purpose of reducing the total computation time.
At the first stages of scientific research in this area, rectangular data partitioning shapes were defined and studied in detail (Beaumont et al., 2001;Clarke et al., 2012;Lastovetsky, 2007). Later works (DeFlumere, 2014;DeFlumere & Lastovetsky, 2014) showed that non-rectangular shapes of matrix elements partitioning can also be optimal for certain ratios of the computational power of processors. But these works did not take into account the difference in the latency of data transmission networks among computing elements.
The objective of this study is to determine the optimal shapes for matrix partitioning in heterogeneous systems with different latencies between computational elements. A mathematical model is built for four classes of algorithms for parallel matrix multiplication: To construct a mathematical model of computation, the notion of an "abstract processor", considered in work (Zhong et al., 2011(Zhong et al., , 2012, is used. Earlier, DeFlumere & Lastovetsky (2014) proved that the model of an abstract processor that is mainly based on the volume of communications and calculations, which accurately forecasts the experimental performance of many processors and even entire clusters for matrix calculations.
The work was done as part of research on the project АР05133699 "Research and development of innovative telecommunication technologies using modern cyber-technical means for the city's intelligent transport system" of the Institute of ICT of the MES RK.

Materials and research methods
For constructing model studies, several assumptions were done: square source matrices A, B and the resulting matrix Сwith dimension N × N of its elements are used; matrix multiplication operation is simulated on three abstract heterogeneous processors with fully connected topology; computing power of abstract processors P, R and S is determined by the ratio P r :R r : S r , where P is the most powerful processor, and S r = 1. The total processing power of the three-processor system will be equal to T = P r + R r + S r ; matrix elements are divided among the processors in proportion to their computing power; transmission network latency is β 1 between the processors P and S, β 2 between the processors P and R, β 3 between processors S and R; optimality is evaluated for the following candidate shapes Square Corner ( The communication complexity of the considered algorithms is estimated using the Hockney model: where α-latency or message delay, measured in seconds; β-bandwidth or time to transmit an element, measured in seconds; M-message size or number of elements to send.

Finding of the optimal data partitioning for the serial communication with barrier algorithm
The Serial Communication with Barrier algorithm (SCB) is a simple matrix multiplication algorithm in which all data are sent consistently by the processors, and calculations begin only after all processors have completed data transfer and are running in parallel ( Figure 2).
The execution time of the algorithm includes the communication time between processors and the time required to perform the calculations: where V-total number of communications; c Px -the time required by the processor X to calculate the assigned part of the matrix C.
Because the area of the matrix and the number of elements allocated to each processor are proportional to its computing power, the parameter c Px will be the same for all candidate forms considered and may not be taken into account when comparing the execution time of the algorithm.
The next step is to define the communication time T comm for each considered form according to the formula (2). Square Corner: where r-side of the square, intended for processor R; s-side of the square, intended for processor S.
Block Rectangle: Traditional 1D Rectangular: For the Rectangle Corner shape, the optimal size of R and S will be the combined width of N; that cannot be true, however, based on the classification of candidate forms (DeFlumere & Lastovetsky, 2014, p.78). Alternatively, we can establish that R w þ S w ¼ N À 1 ð Þ. Then, Rectangle Corner: where R h , R w -height and width of the rectangle, intended for processor R; S h ; S w -height and width of the rectangle, intended for processor S.
To facilitate the analysis of the mathematical model, we get rid of the coefficient β 3 , having obtained the ratio of the throughputs between the computational elements β 1 . Thus, we get four variables.

Finding of optimal data partitioning for parallel communication with the barrier algorithm
The algorithm of Parallel Communication with Barrier (PCB) assumes that the transfer of data between the processors is carried out in parallel, and that only after the communication is completely completed will the processors begin to perform parallel calculations ( Figure 3).
Evaluation of the execution time of the algorithm is performed using the formula: where V Px -the amount of data to be transferred by the processor X.
Similar to the SCB algorithm, the parameter c Px may not be taken into account in the formulas; the optimality of the forms will be determined based on the communication time T comm . Square Corner: Figure 3. The parallel communication with barrier algorithm for three-processor systems.

Finding of the optimal data partitioning for the serial communication with overlap algorithm
In the algorithm of the Serial Communication with Overlap (SCO), all data are transmitted by processor serially, and in parallelelementsnot engaged in the exchange are calculated.
The rest part of the operations is done only after completion of communication and overlap of computations ( Figure 4).
The algorithm execution time is calculated by the formula: where O px is time spent by processor X for computing elements that do not require communication; C px is time spent by processor X for computing the rest elements.
Let us determine the execution time of the algorithm T exe for each shape considered according to formula (4).
Algorithm execution time for the shape of Square Corner is determined as follows: Algorithm execution time for the shape of Square Rectangle equals to: Since matrix elements are distributed among the processors proportionally to their computational power, PrN 3 TSp ; RrN 3 TSr and N 3 TSs values will have the same value for all partitioning shapes, and in the rest formulas, it will be indicated as C P ; C R ; C s , respectively.
Algorithm execution time for the shape of Block Rectangle equals to: Algorithm execution time for the shape of L-Rectangle is determined as: Algorithm execution time for the shape of Traditional 1D Rectangular is presented as follows: For the shape of Rectangle Corner, R and S optimal size is a combined width N, which cannot be true based on the classification of candidate shapes (DeFlumere, 2014, p.78). As alternative, let us state that R w þ S w ¼ N À 1 ð Þ. Then, To simplify the analysis of obtained mathematical model, let us get rid of β 3 coefficient having received the ratio of bandwidths between computational elements β 1 . Thus, we get four variables.

Finding of the optimal data partitioning for the parallel communication with barrier algorithm
Algorithm of Parallel Communication with Overlap (PCO) completes all communications in parallel and simultaneously calculates any sections of the C matrix that do not require interprocessor communication. As soon as the operation data are completed, the rest computations are done ( Figure 5).
Evaluation of the execution time of the algorithm is performed using the formula: where T comm is equal to the communication time for the algorithm of the parallel communication with barrier PCB; O Px is time spent by the processor X for computing the elements that do not require communication; c Px is time spent by the processor X for computing the rest elements.
Let us construct computation models for each candidate shape according to the formula 5.
Square Corner: Sr N Square Rectangle: Figure 5. The parallel communication with overlap algorithm for three-processor systems.
Block Rectangle: L-Rectangle: For the shapes of Traditional 1D Rectangular and Rectangle Corner, to evaluate the total duration of execution, only time of communication between the elements is needed.
Traditional 1D Rectangular: Rectangle Corner: By analogy with the SCО algorithm let us get rid of the β 3 coefficient, having received ratio of bandwidths between computational elements β 1

Research results
Based on the constructed mathematical model, software was developed in the Java programming language that in text form or via the web interface allows entering the initial parameters (P, R, β 1 β 3 , β 2 β 3 ) and determining the theoretical optimal data partitioning shape (Figures 6-8).    Based on the analysis performed on the basis of calculations obtained according to the constructed mathematical model, the following conclusions were made. BR, LR, SC and SR partition shapes are optimal for the Serial Communication with Barrier Algorithm. The LR partition shape is optimal at a value of β 1 β 3 <1 and roughly equal powers of the processors P and R, while significantly exceeding the power of the processor S. The SR form is optimal in cases of small values of the coefficients β 1 β 3 ; β 2 β 3 and the powers of the processors P and R, significantly exceeding the power of the processor S. In all other cases, the BR and SC forms are optimal.
As in the case of using the SCB algorithm, only BR, LR, SC and SR forms can be optimal for Parallel Communication with the Barrier Algorithm. LR is optimal only for values of the coefficients β 1 β 3 = 0,1, β 2 β 3 <1 and roughly equal the powers of processors P and R while being much larger than S. SR is optimal for β 1 β 3 2 and the computing power values of processors P and R are larger than those in the processor S. In other cases, depending on the selected parameters, the forms BR and SC are optimal.
For the Serial Communication with Overlap Algorithm in the same way as in previous algorithms, optimal matrix partitioning shapes may be BR, LR, SC and SR. SR and SC shapes may beoptimalwiththe P processor power values that significantlyexceedpower of the processors R and S. At the same time, for the SR shape, this statement is true only with the coefficient value of β 1 β 3 ≥2. The LR shape is optimal with the coefficient value of β 1 β 3 <1 and approximately equal power of the processors P and R that significantly exceeds the power of the processor S. In all other cases, optimal shapes are BR.
For the Parallel Communication with Barrier Algorithm, optimal partitioning shapes may be also only BR, LR, SC and SR. The LR shape is only with the coefficient value of β 1 β 3 < 0.6, β 2 β 3 ¼ <1 and approximately equal power of the processors P and R. The SR shape may be optimalwith β 1 β 3 2 andpower values of processors P and R that significantly exceed the power of the processor S. The SC shape is optimal with the power of the processor P that significantly exceeds the power of the processors R and S and values β 1 β 3 , β 2 β 3 ! 1: In other cases, depending on the chosen parameters, optimal shapes are BR.

Conclusion
Based on the results of obtained calculation, it can be concluded that, as in the case of threeprocessor systems with the same bandwidth between computational elements, the Rectangle Corner and Traditional 1D Rectangular data partitioning shapes are optimal for any set of parameters.
In contrast to the results obtained in matrix (DeFlumere, 2014, p. 85, 88), the L-Rectangle form is optimal with the powers of the processors P and R greatly exceeding the power of the processor S, and approximately equalling each other in the case of β 1 β 3 <1 for the SCB algorithm and β 1 β 3 = 0,1 for the PCB algorithm. However, in these cases, there is no sense in using a low-power processor S, and the task can be solved on two processors. The Square Rectangle shape is optimal for small values of the coefficients β 1 β 3 ; β 2 β 3 with the powers of processors P and R significantly exceeding the power of the processor S. With such initial data, the problem can also be solved using two powerful processors.
In other cases, the Square Corner and Block Rectangle forms are optimal in the case of using both algorithms, depending on the values of the coefficients P r ; R r ; β 1 β 3 ; β 2 β 3 . L-Rectangle shape may be optimal with approximately equal powers of the processors P and R that significantly exceed the power of the processor S and β 1 β 3 <1 for the SCО algorithm, and β 1 β 3 < 0.6, β 2 β 3 ¼ <1 for the PCО algorithm. But in this case, the problem may be solved by two processors, without engaging the processor S.
Square Rectangle shape may be optimal if both caseswithpower values of the processors P and R that significantly exceed the power of the processor S. With such input data, the problem also may be solved using two powerful processors.
In most cases for both algorithm classes, it is preferable to use the Block Rectangle shape.
The developed software can be used to determine the optimal form of partitioning the matrix elements for certain parameters of a real computing system. The results can be used to implement a specific subject area algorithm, for example, for the analysis of large volumes of data presented in the form of matrices, as well as for the analysis of urban structures in modern metropolitan areas, in particular for determining the reachability between the city districts.