Experimental Evaluation of Events Distribution in Markov Based Computer Performance Models

The continuous-time Markov process is a widely used abstract tool for constructing high-level models of complex computer systems in order to evaluate either the performance or reliability parameters of a system. Utilization of the continuous-time Markov process is based on the assumption of exponential distribution of the time between random events influencing the behaviour of the modelled system. A different kind of probability distribution of this time requires adaptation (extension) of the original model. This article uses a representative example to evaluate the precision of the modelled system parameters using a simple Markov model based on the exponential distribution assumption instead of a more complex model which respects another (i.e. more realistic) probability distribution.


INTRODUCTION
A powerful tool for analysing selected probability based systems and/or problems in computer science is the model based on the mathematical theory of the (stochastic) Markov processes.For a thorough review of the basic theory, please consult e.g.[2].From a computer researcher's point of view, the Markov process is a kind of finite automaton, where transition between two states is caused by a random event, i.e. the time interval duration of every state is a random variable.When an automaton state is reached, all ongoing transitions are fired.Every transition (an edge in the graph describing the process) has been assigned the value of transition rate.This value can be interpreted in two ways: (i) it is the conditional (the transition is fired) frequency of the (subsequent) transitions, and (ii) it is the parameter of the exponential probability distribution of the time interval between transition firing and transition occurring (denoted here as the time interval of transition burning).The model (i.e. its state-transitions graph) can be easily transformed into a set of linear differential equations from which time dependent state probabilities p 0 (t), p 1 (t), … can be computed using conventional methods.
Markov models are used in two basic categories.The first category contains models with one or more absorbing states, i.e. states without an output edge.It is apparent that these models have a "limited time of life".First, time dependent probabilities of model states are computed directly from the corresponding set of differential equations; then, the target parameters can be determined, usually as a linear combination of some state probabilities.Markov models of the second category have "infinite life" (i.e.no absorbing states) and here the asymptotic probabilities (i.e.time independent limits p 0 = p 0 ( ∞) , p 1 = p 1 ( ∞), … ) of model states can be computed from a set of linear algebraic equations.Subsequently, significant parameters can be determined using the known values of the model states asymptotic probabilities.The analysed case used in this article falls into the second mentioned category.Utilization of Markov models is described e.g. in [1], [3], [4], [8].
Application of Markov processes is limited by the assumption of exponential probability distribution of the duration of any transition burning.Exponential distribution is quite "irregular", i.e. its standard deviation has the same value as its mean value (both are 1/λ, where λ is the single parameter of the distribution).In some applications (see e.g.[7]), the Markov model is constructed and utilized even when the modelled system time behaviour is influenced by more regular events, e.g. by built-in tests with a relatively regular period.The aim of this article is to use a representative example to evaluate numerically the deviation which occurs when the Markov model is used and the assumption of exponential distribution of events burning time is not quite valid.

SYSTEM TO BE EVALUATED
For the analysis we have chosen a classical model of cooperating parallel processes.The assumed computational environment can be e.g. a symmetrical multiprocessor system consisting of n processors and a shared memory.Let us assume n computational processes, each of which has its own processor.The processes cooperate using one critical section containing all the shared process data.All processes have the same program describing their cyclic behaviour: local computation without any interaction with another process, followed by computation within the critical section, etc.The computation inside the critical section needs to be locked, i.e.only one process can be in this part of the shared program at one time.All processes have the same time behaviour with the following parameters:  λ … mean frequency of the repeated process local computation without the critical section access, i.e. the reversed value of the mean time of the local computation,  μ … mean frequency of the repeated computation of the critical section without concurrency, i.e.
the reversed value of the mean time of this computation.
The described example can serve as a model of parallel computation based on the utilization of data parallelism: every process works on a separate (local) piece of a large set of data, then updates the global result of computation.This activity is performed periodically until the whole set of data is exhausted.The updating operation represents the critical section of the computation and, in implementation, it has to be e.g.locked using the locking operation provided by the computer operating system.
Time intervals of the process behaviour need then to be taken as random variables apparently due to the different values of the data within different cycles of process activity.When then the ideal (linear) speedup of parallel computation s max = n can be reached assuming deterministic time behaviour.When condition (1) is valid, but the modelled processes have a random time behaviour, their conflicts (i.e necessary synchronization at the input of the critical section, implemented within the locking operation) decrease the reachable value of the speedup.Keeping the above stated condition (1), the maximum frequency (i.e.frequency without conflicts) of every process computation is as follows: Due to these conflicts, the real frequency of computation f is decreased compared to f max , so we can define the speedup degradation coefficient d as the ratio: The orrected value of the speedup can be then expressed as When the time intervals of local computation and the time intervals within the critical section have the exponential distribution (i.e. they are quite irregular), then the analytic solution for the degradation coefficient d can be found using the Markov model (see the basic model in the next section).In this case the variables λ and μ can be taken as the parameters of the corresponding exponential probability distri-bution and the reversed values 1/λ and 1/ μ then represent the mean values of the corresponding distribution.
But the assumption of exponential distribution (i.e. total irregularity) of the processing time intervals can be questionable, especially in the case of the computation time spent inside the critical section.In the analysed example (model of parallel computing with data parallelism utilization), this time corresponds to a global result update, which can be quite a regular operation.That is why in Section 4 the influence of increased regularity of the time spent inside the critical section will be taken into account, still using the (extended) Markov model.In Section 5 the influence of combined regularity of both the processing times will be computed by means of a discretetime simulation model.
The basic Markov model (see below, Section 3) is general enough and can be used as an abstract model of many systems or problems in applied computer science.Bellow, we present two more examples that lead to the same abstract (Markov) model.
The first example is a closed queuing network with one server (here μ is the serving rate) and n clients which are non-stop generating (with the rate λ) their requests to be processed by the server.This network can be e.g. a model of a large information system containing a database server and n workstations.A request generated by a workstation means e.g. a query aimed to obtain a piece of information from the database.The query is translated into a database transaction; single transactions are served using the FIFO scheduling discipline.Here the highest possible throughput X max of the system (i.e. the maximum number of database transactions processed per a time unit) corresponds to nf max where f max is expressed by (2).Then the degradation coefficient d reflects the time lost with the clients' unproductive waiting for the server to start dealing with their request.The corrected value of the throughput can be computed similarly to formula (3) as X = X max /d.
The second example is from the area of fault-tolerant (FT) systems.A highly available information system uses n (identical) servers.The fault rate of a server is then λ .The rate of repairs is μ; the repairs are performed in sequence (assuming there is one repairman).Here we are able to use the Markov model from Fig. 1 in order to evaluate MTBF (Mean Time Between the system Failures) as p n-1 λ (see Fig. 1).Other system parameters like MTTF, MTTR and coefficient of availability a can be computed simply as well.
In both the given examples, the assumption that λ represents the parameter of exponential probability distribution is acceptable (the corresponding time intervals are quite irregular), but a similar assumption for μ is questionable (the time of single services and single repairs is more regular).In the second example, the realistic value of the λ /μ ratio is very low, so it is possible to forecast (see results in Table 3 and Table 4 in the next sections) a very small influence of the (more regular) time of repair on the resultant values of the evaluated system reliability parameters.

BASIC MARKOV MODEL
The Markov model of the example described above can be represented by a simple state diagram.
This set of equations can be expressed using a single matrix equation: where p = (p 0 , p 1 , ......., p n-1 , p n ) T is a column vector of asymptotic state probabilities and A is a coefficient matrix of the system: Generally, any system of linear equations representing the Markov process in this way is linearly dependent.The rank of matrix A is n (number of states minus 1).This degradation will be eliminated by replacing any equation by equation: (the system will always be in one of the states with probability 1).The form of matrix A clears the way for deriving the analytical solution of vector p.Suppose the value of p 0 is known.Due to the first row of matrix A, p 1 can be expressed directly in terms of p 0 , due to the second row, p 2 can be expressed directly in terms of p 0 and p 1 , etc.After a sequence of algebraic transformations, all the probabilities p i can be expressed in terms of p 0 (in this special case, not in general): Then it follows: The real frequency of computation (i.e.frequency when we assume an influence of conflicts) is: Table 1 shows the numerical values of the speedup degradation coefficient d based on the analytical model presented above.These results were computed for a representative set of parameters.The meaning of the parameters is explained in the previous text.For example the ratio λ / μ = 0.1 means ten times longer local computation (in average) compared to the average time of the critical section duration.Value d = 1.0424 for n = 5 processes means about 4% longer computation due to the influence of conflicts when accessing the critical section.Not all the positions in this table keep the "good parallelization" condition (1), so the corresponding results are far from the ideal value d = 1.0 (see e.g. the value in the right bottom corner).

EXTENDED MARKOV MODEL
In this section, the time intervals inside the critical section will be regarded as a serial connection of k stages, each of them with exponential distribution.The mean conditional frequency of any stage is k. .Therefore, the aggregate time within the critical section has Erlang-k distribution in this case and the same mean value as before.The method of stages is discussed in [5].In general, the state with non-exponential distribution can be split into a serial-parallel cluster of two or more exponentially distributed stages.
The following state diagram (see Fig. 2) represents the Markov model where computation inside the critical section is divided into k stages with identical mean times of all computation stages.Fig. 2 Extended Markov model Asymptotic probabilities of the model states can be computed from a system of n.k + 1 linear algebraic equations.Let us define p 0,1 = p 0 (12) p 0,2 = p 0,3 = ...... p 0,k = 0 p 1,0 = p 2,0 = ...... p n,0 = 0 Then the state diagram presented above can be simply represented by the following equations p 0 .n .λ -p 1,k .k. μ = 0 (13) p i,j .(n-i .λ + p i,j .k .μ -p i-1,j .(n-i+1)λ -p i,j-1 .kμ =0 ( for i = 1, 2, ... , n and j = 1,2,3,....,k) Let us define the following indexing of the set of states: Then, the coefficient matrix A is a banded matrix due to this indexing.Generally, the bandwidth of matrix A is k + 2 as can be illustrated by the example for n = 3 and k = 2 (14).The rows and columns of A are indexed from 0 to n*k, too.
After substituting (14) for A in (6), and replacing the last equation of the system by ( 8), the following system of linear equations is obtained.
This matrix equation is not suitable for deriving the analytical solution.In general, probabilities p i cannot be simply expressed in terms of p 0 .Therefore, numerical solution of this system seems to be the best way to obtain the vector of asymptotic probabilities p.
A standard spreadsheet tool was used to automatically construct the coefficient matrix A, to solve the system of linear equations, and to compute the real frequency of computation f and the speedup degradation coefficient d.
Table 3 shows the numerical values of the speedup degradation coefficient d based on the numerical model presented above.These results were computed for the same set of parameters as in Section 3. The lower row presents the relative deviation of the exact analytical solution (i.e. for exponential distribution of the time intervals spent inside the critical section) from the numerical solution for k = 10 stages with exponential distribution.

SIMULATION MODEL
The last part of our analysis is aimed to evaluate the combined influence of increased regularity of both probability distributions -distribution of the process local activity computation time as well as distribution of the time spent inside the critical section.The method of stages explained in the previous section is applicable here, too.
The resultant Markov model is complex enough and its complexity (i.e.number of states) grows significantly.Generally, the complexity of the extended Markov model grows exponentially with the number of events with nonexponentially distributed burning time.Here we would need 3D space to create the extended model.When the number of the state space dimensions is given (3 for this case), the complexity of the extended model grows linearly both with the number of processes and the number of assumed stages of both activities (i.e.events finishing the activity).
For the analysed case we decided to use a simulation model.The model is discrete-time and Monte Carlo based, i.e. it uses random numbers to determine the single values of duration of the modelled process activities.The C-Sim library [6] was used as the implementation tool.The simulation model can be easily verified when used for the cases described above in Section 3 and Section 4 and when we compare the results obtained from both models.If we let the simulation program run for 10 6 cycles of the modelled process activity, the relative error of the computed d is about 10 -3 .Table 4 was computed for the same set of parameters as the previous two tables.It shows the results obtained when both time intervals were divided into k = 10 stages, i.e. the modelled probability distribution was Erlang`s distribution of the k-th degree.For this case, the coefficient of variance C (a measure of the time interval regularity) for both distributions is C = 1/ sqrt (k) = 0.32.It corresponds approximately to the Gaussian distribution with the standard deviation of about one third of its mean value.When comparing the results in Table 4 with the previous table(s), we can see the expected influence of the increased regularity of the modelled process -the computed d has better (i.e.decreased) values.
We used the simulation model for quite regular (i.e.deterministic) values for both time intervals of activity too.These values were the same as the mean values of probability distributions (exponential or Erlang`s) used before.In this case the modelled parallel computation processes were fully synchronized and no conflicts occured.We obtained the expected result d = 1.0 (i.e.no degradation of the parallel computation speed) with a sufficient precision.This experiment served mainly to confirm the correctness of the used simulation model.

CONCLUSION AND FUTURE WORK
This article uses a representative example to evaluate the precision of the system parameters computed using the Markov model, when the assumed exponential distribution of the burning time of events is replaced with another probability distribution.
The chosen example is from the area of parallel processing, but the results are applicable to other parts of computer science, e.g.queuing networks or fault-tolerant systems, as well.
The results show that for an integral parameter such as the evaluated degradation coefficient d, and for the realistic values of the modelled system parameters, the deviation of the result due to the replacement of the exponential distribution of events burning time with another distribution, is not too large.In our analysis this deviation did not exceed 20%.One positive conclusion follows from this analysis: in a realistic case like the analysed one the basic (i.e.simple) Markov model can be used instead the (much more complex) extended one.Relatively small error of results like d can be expected due to the simplification used.
In fact, the evaluated degradation coefficient d is a combination of the probabilities of many states of the used Markov model where deviations in the evaluated probabilities of single states can eliminate each other.It is possible to expect that the probability values (or time functions) of the chosen states of the model can be influenced much more, but that is a matter of our future work.

Fig. 1
Fig. 1 Basic Markov model Asymptotic probabilities p 0 , p 1, ....... p n-1 , p n of the model states can be computed from a homogeneous set of linear

Table 1
Resultant values of d obtained from the basic model

Table 2
Indexing of the states

Table 3
Resultant d values obtained using the extended model

Table 4
Resultant d values obtained using the simulation model