Scalability Evaluation of Cimmino Algorithm for Solving Linear Inequality Systems on Multiprocessors with Distributed Memory ∗

The paper is devoted to a scalability study of Cimmino algorithm for linear inequality systems. This algorithm belongs to the class of iterative projection algorithms. For the analytical analysis of the scalability, the BSF (Bulk Synchronous Farm) parallel computation model is used. An implementation of the Cimmino algorithm in the form of operations on lists using higher-order functions Map and Reduce is presented. An analytical estimation of the scalability boundary of the algorithm for cluster computing systems is derived. An information about the implementation of Cimmino algorithm on lists in C++ language using the BSF program skeleton and MPI parallel programming library is given. The results of large-scale computational experiments performed on a cluster computing system are demonstrated. A conclusion about the adequacy of the analytical estimations by comparing them with the results of computational experiments is made.


Introduction
The problem of solving systems of linear inequalities arise in numerous fields. As examples, we can mention linear programming [1,2], image reconstruction from projections [3], image processing in magnetic resonance imaging [4], intensity-modulated radiation therapy (IMRT) [5]. At the present time, a lot of methods for solving systems of linear inequalities are known, among which we can mark out a class of self-correcting iteration methods that allow efficient parallelization. In this field, pioneer works are papers [6,7], in which the Agmon-Motzkin-Schoenberg relaxation method for solving systems of linear inequalities was proposed. The relaxation method belongs to the class of projection methods, which use the operation of orthogonal projection onto a hyperplane in Euclidean space. One of the first iterative algorithms of projection type was the Cimmino algorithm [8], intended for solving systems of linear equations and inequalities. Cimmino algorithm had a great influence on the development of computational mathematics [9]. A considerable number of papers have been devoted to the generalizations and extensions of the Cimmino algorithm (for example, see [3,[10][11][12][13]).
In many cases, systems of linear inequalities arising in the solution of practical problems can involve up to tens of millions of inequalities and up to hundreds of millions of variables [2]. In this case, the issue of developing scalable parallel algorithms for solving large-scale systems of linear inequalities on multiprocessor systems with distributed memory becomes very urgent. When one creates parallel algorithms for large multiprocessor systems, it is important at an early stage of the algorithm design (before coding) to obtain an analytical estimation of its scalability. For this purpose, one can use various models of parallel computation [14]. Nowadays, a large number of different parallel computation models are known. The most famous models among them are PRAM [15], BSP [16] and LogP [17]. Each of these models generated a large family of parallel computation models, which extend and generalize the parent model (see, e.g., [18][19][20]). The problem of developing new parallel computation models is still important today. The reason is that it is impossible to create a parallel computation model, which is good in all respects. To create a good parallel computation model, the designer must restrict the set of target multiprocessor architectures and class of algorithms. In paper [21], the parallel computation model BSF (Bulk Synchronous Farm) intended for cluster computing systems and iterative algorithms was proposed. The BSF model makes it possible to predict the scalability boundary of an iterative algorithm with great accuracy before coding. An example of using the BSF model is given in [22].
The purpose of this article is to investigate the scalability of the Cimmino algorithm for solving large-scale systems of linear inequalities on multiprocessor systems with distributed memory by using the BSF parallel computation model. The rest of the article is organized as follows. Section 1 gives a formal description of the Cimmino algorithm. In Section 2, the representation of the Cimmino algorithm in the form of operations on lists using higher-order functions Map and Reduce defined in the Bird-Meertens formalism is constructed. Section 3 is dedicated to an analytical investigation of the scalability of the Cimmino algorithm on lists using the BSF model cost metrics; the equations for estimating the speedup and parallel efficiency are given; the boundary of the algorithm scalability depending on the problem size is calculated. In Section 4, a description of the implementation of the Cimmino algorithm on lists in C++ language using the BSF algorithmic skeleton and the MPI parallel programming library is presented; a comparison of the results obtained analytically and experimentally is given. In conclusion, the obtained results are summarized and directions for further research are outlined.

Cimmino Algorithm for Inequalities
Let us consider the system of linear inequalities where a i , x is the Euclidean inner product of a i and x in R n , b i ∈ R. To avoid triviality, we assume m 2. We also assume that the system (1) is consistent. It is necessary to find a solution of the system of linear inequalities (1). To solve this problem, it is convenient to use a geometric language. Thus, we look upon x = (x 1 , . . . , x n ) as a point in n-dimensional Euclidean space R n , and each inequality l i (x) 0 as a half-space P i . Therefore, the set of solutions of system (1) is Let the orthogonal projection of x ∈ R n onto the hyperplane H i ⊂ R n be denoted by π Hi (x). The orthogonal projection π Hi (x) can be calculated by the following equation: where · is the Euclidean norm. Let us define the orthogonal reflection of x with respect to hyperplane H i as follows: The Cimmino algorithm for equally weighted inequalities consists of the following steps: Step 1: k := 0; x 0 := 0.
Cimmino's method starts with an arbitrary point x 0 in R n as an initial approximation, and then calculates at each step the centroid of a system of masses placed at the reflections of the previous iterate with respect to the hyperplanes H 1 , . . . , H m defined by the system of inequalities. This centroid is taken as the new iterate: In equation (5), λ is a relaxation parameter. It is known [10] that for 0 < λ < 2 the iteration process (5) converges to a point belonging to the polytope M .

Cimmino Algorithm in the Form of Operations on Lists
In order to obtain analytical estimations of an algorithm using the cost metrics of the BSF model, it must be represented in the form of operations on lists using higher-order functions Map and Reduce defined in the Bird-Meertens formalism [23]. The higher-order function M ap applies the given function F : A → B to each element of the given list [a 1 , . . . , a m ] and returns a list of results in the same order: The higher-order function Reduce reduces the given list [b 1 , . . . , b m ] to a single value by iteratively applying the given binary associative operation ⊕ : B × B → B to each pair of elements: In the context of the Cimmino algorithm, we define the list L map as follows: where i k ∈ {1, . . . , m} and i k = i l for k = l (k, l = 1, . . . , m). In other words, L map -is the list of numbers of inequalities (1) ordered in an arbitrary way. For an arbitrary point x ∈ R n , let us define the function F x : {1, . . . , m} → R n as follows: for all i ∈ {1, . . . , m}. In other words, the function F x (i) calculates the orthogonal reflection of x with respect to the hyperplane H i . For an arbitrary point x ∈ R n , let us define the list L (x) reduce ⊂ R n as follows: reduce is obtained from the list L map by applying to it the higher-order function M ap using as a parameter the function F x : Let us define the binary associative operation ⊕ : R n × R n → R n as follows: for all x, y ∈ R n . In this case, the ⊕ operator performs the conventional composition of vectors. Then the sum of orthogonal reflections of the point x can be obtained by applying to the list L (x) reduce the higher-order function Reduce using as a parameter the vector composition operation ⊕: Now we can write the Cimmino algorithm in the form of operations on lists: Step 1: k := 0; x 0 := 0; L map := [1, . . . , m].
The BSF model assumes that the algorithm is executed by a computing system consisting of one master-node and K worker-nodes (K > 0). Step 1 of the algorithm is performed by both the master and the workers during the initialization of the iterative process.
Step 2 (Map) is performed only on the worker-nodes.
Step 3 (Reduce) is performed on the worker-nodes and partially on the master-node. Steps 4-6 are performed only on the master-node. The BSF model assumes that all arithmetic operations (addition and multiplication) as well as comparison operations on floating-point numbers take the same time τ op .

Analytical Evaluation of Scalability
Let us introduce the following notation for the scalability evaluation of the Cimmino algorithm: Let us calculate the indicated values. At the beginning of each iteration, the master sends to all the workers the current approximation x k , which is a vector of dimension n. Hence: Let us calculate the number of arithmetic operations performed in the Map step. For each element of the list L map , one vector is calculated by equation (4). Note that the values of a i 2 (i = 1, . . . , m) do not depend on x k , and therefore can be calculated in advance at the initialization stage. Taking this into account, the quantity of operations for calculating one orthogonal reflection of the point x k is 3n + 1. Multiplying this value by the number of inequalities, we obtain c map = m(3n + 1).
During the execution of Reduce step, the list L reduce consisting of m vectors is divided into equal parts, each of them assigned to a single worker. Everywhere below we assume that K m. For simplicity we assume that m is a multiple of number of workers K. The composition of vectors of dimension n requires n arithmetic operations. Hence: After execution of Reduce step, each worker sends the resulting vector to the master. Thus: The execution of Step 4 requires 2n operations (we assume the constant value of λ/m to be computed in advance). The execution of Step 5 requires 3n − 1 arithmetic operations and one comparison operation. It follows the equation: Let us designate the time spent by the worker to perform one arithmetic operation as τ op , and designate the time spent for transferring a single float number across the network excluding latency as τ tr . In that way, we get the following values for the cost parameters of the BSF model [21] in the case of the Cimmino algorithm: t a = nτ op ; t r = nτ tr ; I.M. Sokolinskaya, L.B. Sokolinsky Equation (19), obtained on the basis of (14), gives an estimation of the time t s spent by the master to transfer a message to one worker excluding latency. Equation (20) is obtained using the equation (15). According to the BSF model cost metric, t map denotes the total time spent by a single worker to process the entire Map list. Equation (21) obtained using equation (16) calculates the time t p spent by a processor node on adding two vectors of dimension n. Equation (22), obtained on the basis of (17), gives an estimation of the time t r spent by the master to transfer a message to one worker excluding latency. Equation (23) obtained using equation (18) calculates the time t p spent by the master on the following actions: calculating the next approximation and checking of the stopping criterion. In accordance with this metric, the time for solving the problem by a system consisting of one master and one worker (K = 1) can be estimated as follows: The time of solving the problem by a system composed of one master and K workers can be estimated by the following equation: For m → ∞ , the equations (24) and (25) asymptotically tend to the following estimations: To determine the scalability boundary of the Cimmino algorithm in accordance with the procedure described in [21], let us deduce the derivative a (K) and solve the equation Substituting the right-hand sides of equations (33) and (34) into (32), we obtain which is equivalent to In that way, the boundary of the scalability of the Cimmino algorithm on lists increases in proportion to the square root of the number m of inequalities. In conclusion of this section, let us write the equation for estimating the parallel efficiency e as a function of K. Considering equation (28), we have e(K) = a(K) K = 2(L + τ tr n) + τ op (5n + m(3n + 1) + mn) 2K 2 (L + τ tr n + τ op n) + τ op (m(3n + 1) + mn + 4nK) . (36)

Numerical Experiments
In order to verify the analytical results, we implemented the Cimmino algorithm in C++ language using the BSF algorithmic skeleton and the MPI parallel programming library. The source code of this program is freely available on Github, at https://github. com/leonid-sokolinsky/BSF-Cimmino. The system of inequalities was taken from the model scalable linear-programming problem Model-n given in [24]. In this system, the number of inequalities is m = 2n + 2, where n is the dimension of the space. We investigated the speedup and parallel efficiency of the Cimmino algorithm on the supercomputer "Tornado SUSU" [25]. The calculations were performed for the dimensions 1 500, 5 000, 10 000 and 16 000. At the same time, we plotted the curves of speedup and parallel efficiency for these dimensions using equations (28) and (36). For this, the following values in seconds were determined experimentally: L = 1.5 · 10 −5 , τ op = 2.9 · 10 −8 and τ tr = 1.9 · 10 −7 . The results are presented in Fig. 1-4. In all cases, the analytical estimations were very close to experimental ones. Moreover, the performed experiments show that the boundary of the BSF-program scalability increases in proportion to the square root of the number m of inequalities. It was analytically predicted by the equation (35).

Conclusion
In this paper, the scalability and parallel efficiency of the iterative Cimmino algorithm used to solve large-scale linear inequality systems on multiprocessor systems with distributed memory were investigated. To do this, we used the BSF (Bulk Synchronous Farm) parallel computation model based on the "master-slave" paradigm. The BSF-implementation of the Cimmino algorithm in the form of operations on lists using higher-order functions Map and Reduce is described. A scalability boundary of the BSF-implementation of the Cimmino algorithm is obtained. This estimation tells us the following. If space dimension n is greater than or equal to the number m of inequalities, then the boundary of the scalability of the Cimmino algorithm on lists increases in proportion to the square root of the number m of inequalities. So, we may conclude that the Cimmino algorithm on lists is scalable well. Also, the equations for estimating the speedup and parallel efficiency of the Cimmino algorithm on lists are obtained. The implementation of the Cimmino algorithm in C++ language using the BSF algorithmic skeleton and the MPI parallel programming library was performed. This implementation is freely available on Github, at https://github.com/leonid-sokolinsky/BSF-Cimmino. On a cluster computing system, the large-scale experiments were conducted to obtain the actual speedup and parallel efficiency curves for systems having number of variables 1 500, 5 000, 10 000, 16 000 and the number of inequalities 3 002, 10 002, 20 002, 32 002, respectively. The results of the experiments showed that the BSF model predicts the boundary of the scalability of the Cimmino algorithm on lists with high accuracy. As future research directions, we intend to do the following:   1) apply the Cimmino algorithm to implement the Qwest phase of the NSLP algorithm [2], designed to solve large-scale non-stationary linear programming problems; 2) carry out computational experiments to solve large-scale linear programming problems on a cluster computer system under the conditions of dynamically changing the input data.