Comparison among Performance Measures for Parallel Matrix Multiplication Algorithms

In this study we analyze how to make a proper selection for the given matrix-matrix multiplication operation and to decide which is the best suitable algorithm that generates a high throughput with a minimum time, a comparison analysis and a performance evaluation for some algorithms is carried out using the identical performance parameters.


INTRODUCTION
Most of the parallel algorithms for matrix multiplication use matrix decomposition that is based on the number of processors available.This includes the systolic algorithm (Choi et al., 1992), Cannon's algorithm (Alpatov et al., 1997), Fox's and Otto's Algorithm (Agarwal et al., 1995), PUMMA (Parallel Universal Matrix Multiplication) (Choi et al., 1994), SUMMA (Scalable Universal Matrix Multiplication) (Cannon, 1969) and DIMMA (Distribution Independent Matrix Multiplication) (Chtchelkanova et al., 1995).The standard method for n×n matrix multiplication uses O (n 3 ) operations (multiplications).Most of the parallel algorithms are parallelization of this method.For instance, matrix multiplication can be performed in O (n) time on a mesh with wraparound connections and n×n processors (Cannon, 1969); in O (log n) time on a three dimensional mesh of trees with n 3 processors (Leighton, 1992); in O (log n) time on a hypercube or shuffle exchange network with n 3 processors (Dekel et al., 1983).On the other hand, all implementations have a cost, i.e., time processor product of at least O (n 3 ).Therefore, the aim is to develop highly parallel algorithms that have the cost less than O (n 3 ).
Extensive work has been done for constructing algorithms that require O (n β ) operations, such that β<3.The research world was stunned with O (n 2.81 ) algorithm for matrix multiplication (Strassen, 1969).A recursive application gives an algorithm that runs in time (Strassen, 1969).Following Strassen's work, a series of studies published among the years 1970 s and 80s Achieved better upper bounds on the exponent.Currently, the best matrix multiplication algorithm takes O (n β ) operations, where β<2.375477 (Coppersmith and Winograd, 1990).However, these algorithms are highly sophisticated even for sequential execution; parallelization of these methods are only of theoretical interest and far from practical (Leighton, 1992;Robinson, 2005;Coppersmith and Winograd, 1990).Therefore, from practical point of view, parallelization of standard matrix multiplication algorithm with O (n 3 ) multiplications is more appropriate.The objective of this study is to investigate which parallel method is more appropriate and how to optimize the existing methods in order to obtain more efficient parallelization.

METHODOLOGY
Parallel computing paradigms: Parallel computing process depends on how the processors are connected to the memory.There are two different ways how this can be done: shared memory system and distributed memory.
In a shared memory system, a single address space exists; within it, to every memory location is given a unique address and the data stored in the memory are accessible to each processor.In such a system the processor P i reads the data written by the processor P and it is necessary to use synchronization.
In a distributed memory system, each processor can only access to its own memory (local memory).The connection between processors is done by the so called "high-speed communication network" and they exchange information using send and receive operations.
MPI (Message Passing Interface) is useful for a distributed memory systems since it provides a widely used standard of message passing program.In MPI, data is distributed among processors, where no data is shared and data is communicated by message passing.
Matrix multiplication algorithms as a nested loop algorithms are suitable for such kind of systems.

Systolic array for matrix multiplication:
A systolic array is a computing network possessing with the features of synchrony (data are rhythmically computed and passed through the network), modularity and locality, special locality (a locally communicative interconnection structure), pipeline ability, repeatability (usually the repetition of one single type of cell and interconnection occurs).By the systolic arrays it can be achieve a high parallelism.Systolic arrays are very suitable for nested-loop algorithms; such is the problem of parallel solution of matrix-matrix multiplication (Bekakos, 2001;Snopce and Elmazi, 2008a;Gusev and Evans, 1994).
The simplest systolic arrays are the chain and the ring, which in fact are one dimensional array.In the m×n two dimensional systolic array the nodes are defined by integer points in a rectangle {(i, j), 0≤i≤m, 0≤ j≤n}.A node (i, j) is connected with nodes (i±1, j) and (i, j±1), if they are in the rectangle.If we add diagonal edges ((i, j), (i±1, j+1)) to the two dimensional array, then we get the so called hexagonal systolic array.By adding the edges ((0, j), (m-1, j)) and ((i, 0), (i, n-1)) to an m×n array, we get a two dimensional torus.There are given some models of systolic arrays in Fig. 1.
The simplest case of systolic array which can be used for parallel matrix multiplication is similar as the array shown in the Fig. 1c.The systolic array is quadratic 2-D such that the elements of a matrix A are fed into the array by the rows and the elements of the matrix B are fed into the array by the columns.The array consists of n×n PEs.Each PE indexed by i and j, in each step k, adds a partial product a ik b kj to the accumulated sum for every c ij .Hence, we have a movement in two directions.The elements of the The matrices A and B are taken to be quadratic of order 3.

Why systolic array?
The MPI technique needs two kinds of time to complete the multiplication process, t c and t f , where t c represents the time it takes to communicate one data between processors and t f is the time needed to multiply or add elements of two matrices.It is assumed that matrices are of type n×n.
The other assumption is that the number of processors is p.Each processor holds n 2 /p elements and it was assumed that n 2 /p is set to a new variable m 2 .The number of arithmetic operations units will be  denoted by f.The number of communication units will be denoted by c.The quotient q = f/c will represent the average number of flops per communication access.The speedup is S = q.(t f /t c ) (the definitions of some special performance characteristics are given in the next subchapter).In order to apply the analytical variables, we need to make some assumptions.Firstly we assume that the number of processors is p = 4.The dimension of the matrix is n = 600.The last assumption is that t f /t c = 0.1.Also we need to use the Efficiency formula E = S/p.In the Table 1 we record the results that we obtain for all algorithms, under the assumptions that we made (Alpatov et al., 1997;Agarwal et al., 1995;Choi et al., 1994;Alqadi et al., 2008).
From Table 1, after analyzing the theoretical results, one can conclude that the systolic algorithm will give the best performance (i.e., efficiency).The efficiency increases as the parameter S approaches the number of processors.
The experimental results (Alqadi and Amjad, 2005) for the execution time using 1 processor and using 4 processors are shown in the Table 2.The table is obtained by implementing each algorithm 100 times and then, ten fastest five ones are taken and averaged.On the other hand, taking into the consideration that S = T (1) /T (p) (which means that speedup is equal to the time using 1 processor dividing with the time using p processors), we give the Table 3 with the values of S and E.
The analysis explained above has shown that the systolic algorithm may be considered as the best algorithm that produces a high efficiency.The theoretical analysis matches with the experimental results performed in Alqadi et al. (2008).This analysis is useful for making a conclusion and giving a recommendation for selecting the best algorithm among others which is more effective for parallel solving of some linear algebra problems, including the problem of matrix-matrix multiplication.

MATHEMATICAL TOOLS
Definition 1: The array size (Ω) is the number of PEs in the array.
Definition 2: The computation Time (T) is the sum of the time for the date input in the array-T in , the time for the algorithm executing-T exe and the time necessary for dates leaving the array-T out , i.e.: (2) Theorem 1: Zhang et al. (1994): The execution time is given by the relation: Definition 4: The Pipelining period (α): The time interval between two successive computations in a PE.If λ is a scheduling vector and u is a projection direction, then the pipelining period is given by the relation: In some places (for the three nested loops algorithms), for the pipeline period it is used the formula: ( ) ( ) where, T is a transformation matrix and T li , i = 1, 2, 3 is (1, i) co-factor (minor) of matrix T and gcd (T 11 , T 12 , T 13 ) is the greatest common divisor of T 11 , T 12 and T 13 .
Definition 5: The geometric area (g a ) of a twodimensional systolic array is the area of the smallest convex polygon which bounds the PEs in the (x, y)plane.The geometric area is given by the formula: Definition 6: The Speedup (S) of a systolic array is the ratio of the processing time in the SA to the processing time in a single processor (T 1 ), i.e.: Definition 7: The Efficiency (E) is defined as the ratio of the speedup to the number of PEs in the array i.e.: Theorem 2: Bekakos (2001) and Zhang et al. (1994): The number of processors on SHSA array is (the array where we have no using the linear transformation): Theorem 3 (Snopce and Elmazi, 2008b): The number of processing elements in 2-dimensional systolic array for the algorithm of matrix-matrix multiplication for which is used the projection direction u = [1 1 1] T , could be reduced and given with Ω = N 2 .
Theorem 4 (Gusev and Evans, 1994): The number of PEs for the systolic array which is constructed by using the nonlinear transformation the number of PEs is given by the relation: Let P ind = {(i, j, k) /1≤ i, j, k≤N} be the index space of the used and computed data for matrix multiplication.The purpose is to implement the mapping ( ) ( ) , using an appropriate form for the matrix T.There are different options of such matrix T (Bekakos, 2001).
Definition 8: The linear transformation matrix T is the matrix: Definition 9: A matrix T is associated to the projection direction u = [u 1 u 2 u 3 ] T (there are some possible allowable projection vectors, Bekakos (2001), so that the following conditions must be satisfied: • The entries have to be from the set {-1, 0, 1} Definition 10: The transformation matrix T maps the index point (i, j, k) ∈ P ind into the point (t, x, y) ∈ T. P ind , where, P ind is the set of index points such that: In this case t is the time where calculations are performed and (x, y) are the coordinates of processor's elements on the 2-dimensional systolic array.
Definition 11: For the algorithm of matrix-matrix multiplication the transformation which is used for obtaining the new index space is defined with H = T○L 1 where,

RESULTS AND DISCUSSION
The advantage of using the linear transformation in designing the systolic array for matrix multiplication: If one uses theorem 2, then it is possible to find the number of PEs in the corresponding systolic array.For example, if N = 4 then Ω = 37 (Fig. 3).In the case of the systolic array constructed using the properties of theorem 3, the number of processors (Fig. 4) is Ω = 4 2 = 16.In Fig. 3 and 4 these two arrays are represented.The second array has less number of PEs and it is one of the advantages of using the linear transformation in constructing the systolic arrays.So, we can conclude how the number of processors on the array can be reduced using the linear transformation.For N = 4 is 16 vis-à-vis 37 without using the transformation.On the Table 4 we give the comparison for number of PE for different values of N.This information is taken from Bekakos (2001).
For the pipeline period, in the case of the array in Fig. 3, using relation (4) we get: This means that every computation in each processor is performed after the time interval of three units (after every third step).
If one uses the relation ( 5), then the same result will be obtained.To confirm this, we give the transformation matrix used in this case: For this matrix we have the following results: det (T) = 3, T 11 = 1, T 12 = -1, T 13 = 1.Thus, using the relation ( 5), the same result as in (12) will be obtained.
If this formula is used for the systolic array which is constructed by using the linear transformation matrix (the array in Fig. 4), we get: From the relations (2) and ( 14) the execution time may be obtained: If one uses the fact that if N 1 = N 2 = N 3 , then T exe = 3N-2.On the other hand T in = T out = N-1, therefore the computation time using (1) will be: In the case of the array in Fig. 4 the execution time will be ordered by using the theorem 1 which is associated by the relation (3): If N 2 = N 3 then T exe = 2N-1.In this case T in = N-1 and T out = 0, therefore the computational time is: In the Fig. 5 the systolic array using nonlinear transformation is presented (Gusev and Evans, 1994).For this array (Fig. 5), we have that T exe = 3N-1, T in = N-1 and T out = 0. Therefore the computation time is: For the geometric area in the case of the array in Fig. 3 we have: In the case of the array with nonlinear transformation, one can calculate the geometric area in a similar way as above:   Since the duration of matrix multiplication on a system with only one processor is T 1 = N 3 , the speedup and efficiency in the case of the array in Fig. 3, using relations ( 7) and (8), will be respectively: Using the results obtained by the relations (16-28), as well as theorems 1, 2 and 3, one can construct the corresponding table, where all the results can be compared.In Table 5 it is given a comparison of performance characteristics for some values of N.

CONCLUSION
In this study we make the comparison concerning some performance measures for parallel matrix multiplication.We emphasized the systolic approach as most efficient.We can conclude that using the identical performance parameters, for each parameter, the array which is constructed using linear transformation matrix has better performances.Especially for the efficiency when N tends to the infinity we have that it is approximately five times better than the array without using the linear transformation.From the Table 5 we can deduce the advantage of using the linear transformation.The results from the Table 5 are done for N = 4, N = 10 and N = 100.The table can be enlarged for other values of N as well.In all the cases, the results by using L are better.

Fig. 1 :
Fig. 1: Typical models of systolic arrays, (a) linear systolic array, (b) 1-D ring, (c) 2-D square array, (d) 2-D hexagonal array, (e) 2-D torus resulting matrix C are stationary.In Fig. 2 are shown just the first two cycles of such quadratic systolic array.The matrices A and B are taken to be quadratic of order 3.

Fig. 2 :
Fig.2: The first two cycles (7 in total) for the 2-D systolic computation of 3×3 matrix product The execution time, T exe , is defined as:

Table 1 :
Theoretical results T 1 = [t 11 , t 12 , t 13 ] is the scheduling vector (time schedule of the computations; in the case of matrix multiplication this is always [1 1 1]) and: This happens because from the case 2) of definition 9, it is easy to conclude that t 21 + t 22 + t 23 = 0 and t 31 + t 32 + t 33 = 0.

Table 5 :
Comparison of performance characteristics Without using L