Queueing models of RAID systems with maxima of waiting times

doi:10.1016/j.peva.2006.11.002

Performance Evaluation

Volume 64, Issues 7–8, August 2007, Pages 664-689

https://doi.org/10.1016/j.peva.2006.11.002 Get rights and content

Abstract

A queueing model is developed that approximates the effect of synchronizations at parallel service completion instants. Exact results are first obtained for the maxima of independent exponential random variables with arbitrary parameters, and this is followed by a corresponding approximation for general random variables, which reduces to the exact result in the exponential case. This approximation is then used in a queueing model of RAID (Redundant Array of Independent Disks) systems, in which accesses to multiple disks occur concurrently and complete only when every disk involved has completed. We consider the two most common RAID variants, RAID0-1 and RAID5, as well as a multi-RAID system in which they coexist. This can be used to model adaptive multi-level RAID systems in which the RAID level appropriate to an application is selected dynamically. The random variables whose maximum has to be computed in these applications are disk response times, which are modelled by the waiting times in $M / G / 1$ queues. To compute the mean value of their maximum requires the second moment of queueing time and we obtain this in terms of the third moment of disk service time, itself a function of seek time, rotational latency and block transfer time. Sub-models for these quantities are investigated and calibrated individually in detail. Validation against a hardware simulator shows good agreement at all traffic intensity levels, including the threshold for practical operation above which performance deteriorates sharply.

Introduction

Traditional, e.g. product-form, queueing networks cannot model synchronizations at parallel service completion instants. We approximate this effect in a queueing model of RAID (Redundant Array of Independent Disks) systems, derived by considering the explicit flow of control in the physical architecture. The contention in each parallel phase of processing is represented using an approach based on the $M / G / 1$ queue. The synchronization time is then the maximum of a collection of $M / G / 1$ queue sojourn times (also called waiting times or response times). We assume these sojourn times to be independent; initially, exponential random variables, as in an $M / M / 1$ queue, and then general.

Based on initial work in [12], Section 2 derives an exact recurrence formula for the Laplace transform of the probability density function of the maximum of a set of independent exponential random variables, from which the mean and higher moments follow. In the special case that all the constituent exponential distributions are identical, the well-known result for the mean value of the maximum in terms of harmonic numbers follows immediately. The recurrence is then generalized to approximate the mean of the maximum of independent, generally distributed random variables. This simplifies to the previous exact result when the constituent distributions are exponential but in general requires their second moments. The accuracy of the approximation is assessed by comparison with simulation results obtained for Erlang and Pareto constituent distributions, which typify the cases of small and large variances respectively.

RAID storage systems and existing analytical models are briefly reviewed in Section 3 and the results of Section 2 are then used in our new multi-level RAID performance model in Section 4. This model assumes Poisson external requests but allows general disk seek, latency and transfer times. We determine the higher moments of the queueing time in the $M / G / 1$ queue by differentiating its Laplace–Stieltjes transform at the origin. The second moment is then given in terms of the third moment of the service time, which is obtained in turn from the assumed distributions of seek time, rotational latency and block transfer time. Detailed studies of their principles of operation show that RAID levels 0–1 and 5 produce quite different demands on the disks in the array for each type of input-output access. This difference is amplified in the corresponding queueing times; it is seen in both the explicit simulation of the physical systems’ operation and in the calculation of mean and variance of queueing time in the analytical model.

The accuracy of the model is assessed in Section 5 by comparing the analytical predictions with a simulation of the actual system at the operational level. The quantitative results are presented as graphs of mean system response time against traffic intensity, showing generally good agreement and hence providing justification for our approach. The validity of the assumption of Poisson arrivals was tested numerically by comparing with simulation models with non-Poisson input; the simulation output (mean response time) shows little change. This is consistent with the commonly observed robustness of the Poisson assumption for external arrivals. In addition, in Section 6 we further investigate possible causes of inaccuracy in the model’s approximations. We isolate two possible sources, apart from the precision of the mean–max algorithm assessed in Section 2.4: (a) the representation of the delay at a single disk as the response time in an $M / G / 1$ queue; and (b) the effect of assuming such response times are independent when arrivals actually occur simultaneously. The paper concludes in Section 7 with a summary of the present contribution, open questions and suggestions for further research.

Section snippets

Maximum of random variables

Suppose a task forks into a number of subtasks that are processed in parallel independently. The task’s completion instant is that of the last subtask to complete processing, whereupon the subtasks combine (join) to re-form the original task. The fork-join time of the task, i.e. the time elapsed between the fork instant and the join instant, is therefore the maximum of the subtasks’ processing times. In a Markovian environment, we derive the following:

Proposition 1

The maximum of $n$ independent, negative

RAID storage system

A RAID storage system consists of a disk system manager and a collection (array) of independent disks. The disk system manager is a software component; it receives requests from the multiple system users. These requests are considered logical because they are independent of the physical configuration of the storage system. Requests may arrive from different users at various rates $λ_{j}^{'}$ . The disk system manager subdivides the data into blocks called stripe units and distributes them across the

The multi-level RAID analytical model

Our aim is to determine the mean logical request response time for data stored according to RAID0-1 and RAID5 patterns in a single multi-level RAID storage system. We consider relevant hardware parameters and requests’ execution schedules, for which we give task graphs to highlight the one or two synchronization points. We then determine the mean logical request response time using the fork-join model of Section 2 in an $M / G / 1$ queueing context.

Results and discussion

In order to validate our model and assess its accuracy, we developed a detailed event-driven simulator. This simulator is written in C and is composed of three main parts. The first part is a logical request generator, which uses standard random number generation functions to produce inter-arrival times for the logical requests with arbitrary probability distributions. The second part is a logical to physical mapping, which contains all the physical request generation functions. This part deals

Sources of approximation

We have already investigated in Section 2.4 one possible source of inaccuracy in our model, namely the mean–max approximation of Section 2.3, which is only exact for parallel exponential delays. We concluded that only for coefficients of variation (ratio of standard deviation to mean) much less than one is the approximation likely to be poor. Fortunately this is the least likely scenario, file access times being notoriously variable, sometimes even having heavy tailed distributions.

However,

Conclusion

We have developed a new, efficient approximation to compute the mean duration of certain synchronized fork-join operations. The approximation is exact in the case of exponentially distributed constituent delays, where exact results were also obtained for higher moments and the Laplace transform of the duration’s probability density function itself. Using these results, quite intricate, analytical models, based on simple queueing theory, were derived, which take into account the detailed

References (23)

S. Chen et al.
The design and evaluation of RAID5 and parity striping disk array architecture
Journal of Parallel and Distributed Computing
(1993)
A. Merchant et al.
An analytical model or reconstruction time in mirrored disks
Performance Evaluation
(1994)
E. Bachmat, J. Schindler, Analysis of methods for scheduling low priority disk drive tasks, in: Proc. ACM Sigmetrics,...
The RAID Advisory board
The RAIDBOOK: A Source Book for RAID Technology
(1993)
H. Bohnenkamp et al.
The mean value of the maximum
S. Chen et al.
A performance evaluation of RAID architecture
IEEE Transactions on Computers
(1997)
S. Chen, Design, modeling and evaluation of high performance, Ph.D. Thesis, University of Massachusetts, September...
G. Gibson, D.A. Patterson, R.H. Katz, A case for redundant arrays of inexpensive disks (RAID), in: Proc. SIGMOD...
G. Gibson, D.A. Patterson, P.M. Chen, R.H. Katz, Introduction to redundant arrays of inexpensive disks (RAID), in: IEEE...
J. Xu, E. Varki, A. Merchant, X. Qiu, An integrated performance model of disk arrays, in: Proc. International Symposium...

J. Xu et al.

Issues and challenges in the performance analysis of real disk arrays

IEEE Transactions on Parallel and Distributed Systems

(2004)

Cited by (32)

Dynamic Subtask Dispersion Reduction in Heterogeneous Parallel Queueing Systems
2015, Electronic Notes in Theoretical Computer Science
Fork-join and split-merge queueing systems are mathematical abstractions of parallel task processing systems in which entering tasks are split into N subtasks which are served by a set of heterogeneous servers. The original task is considered completed once all the subtasks associated with it have been serviced. Performance of split-merge and fork-join systems are often quantified with respect to two metrics: task response time and subtask dispersion. Recent research effort has been focused on ways to reduce subtask dispersion, or the product of task response time and subtask dispersion, by applying delays to selected subtasks. Such delays may be pre-computed statically, or varied dynamically. Dynamic in our context refers to the ability to vary the delay applied to a subtask according to the state of the system, at any time before the service of that subtask has begun. We assume that subtasks in service cannot be preempted. A key dynamic optimisation that benefits both metrics of interest is to remove delays on any subtask with a sibling that has already completed service. This paper incorporates such a policy into existing methods for computing optimal subtask delays in split-merge and fork-join systems. In the context of two case studies, we show that doing so affects the optimal delays computed, and leads to improved subtask dispersion values when compared with existing techniques. Indeed, in some cases, it turns out to be beneficial to initially postpone the processing of non-bottleneck subtasks until the bottleneck subtask has completed service.
Lifetime and availability of data stored on a P2P system: Evaluation of redundancy and recovery schemes
2014, Computer Networks
This paper studies the performance of Peer-to-Peer storage and backup systems (P2PSS). These systems are based on three pillars: data fragmentation and dissemination among the peers, redundancy mechanisms to cope with peers churn and repair mechanisms to recover lost or temporarily unavailable data. Usually, redundancy is achieved either by using replication or by using erasure codes. A new class of network coding (regenerating codes) has been proposed recently. Therefore, we will adapt our work to these three redundancy schemes. We introduce two mechanisms for recovering lost data and evaluate their performance by modeling them through absorbing Markov chains. Specifically, we evaluate the quality of service provided to users in terms of durability and availability of stored data for each recovery mechanism and deduce the impact of its parameters on the system performance. The first mechanism is centralized and based on the use of a single server that can recover multiple losses at once. The second mechanism is distributed: reconstruction of lost fragments is iterated sequentially on many peers until that the required level of redundancy is attained. The key assumptions made in this work, in particular, the assumptions made on the recovery process and peer on-times distribution, are in agreement with the analysis in [1] and in [2] respectively. The models are thereby general enough to be applicable to many distributed environments as shown through numerical computations. We find that, in stable environments such as local area or research institute networks where machines are usually highly available, the distributed-repair scheme in erasure-coded systems offers a reliable, scalable and cheap storage/backup solution. For the case of highly dynamic environments, in general, the distributed-repair scheme is inefficient, in particular to maintain high data availability, unless the data redundancy is high. Using regenerating codes overcomes this limitation of the distributed-repair scheme. P2PSS with centralized-repair scheme are efficient in any environment but have the disadvantage of relying on a centralized authority. However, the analysis of the overhead cost (e.g. computation, bandwidth and complexity cost) resulting from the different redundancy schemes with respect to their advantages (e.g. simplicity), is left for future work.
A highly reliable and parallelizable data distribution scheme for data grids
2013, Future Generation Computer Systems
Citation Excerpt :
Different aspects of the RAID system have been discussed in the literature. To name just a few, reliability [20–26], availability [27,21,28–30], scalability [31,32,13], energy–efficiency [33,34], and RAID modeling techniques [35,36]. Wu et al. [29] proposed an outscoring-based method to improve the availability of the RAID-structured storage systems.
The major drawback of the replication-based RAID (redundant arrays of independent disks) architectures is that, in spite of the high redundancy level, they cannot balance the load increased by the disk failures and this results in reliability and access bandwidth reduction for processing the data access requests. Furthermore, these schemes are not able to determine the actual position of the occurring data block errors (or in-error data blocks). In this paper, to alleviate the addressed problems, we propose a new parity-based striped mirroring scheme called PSM-RAID in which the striped data blocks are replicated among the other disks, and a parity block is then associated with each stripe. In this method, a data mirroring scheme improves the reliability of the array by directing the read requests to the mirrored copy of data, and the parity data enables the controller to determine the actual position of the in-error blocks. The specific data distribution algorithm proposed in this paper also improves the access bandwidth of the array to serve many more disk requests. The proposed method is compared with similar architectures and the simulation results show that the proposed model due to the data striping scheme is able to provide a significantly higher parallelism between the disk requests, as well as a higher reliability due to the block mirroring scheme and the parity blocks.
Bus Modelling in Zoned Disks RAID Storage Systems
2009, Electronic Notes in Theoretical Computer Science
A model of bus contention in a Multi-RAID storage architecture is presented. Based on an M/G/1 queue, the main issues are to determine the service time distribution that accurately represents the highly mixed input traffic of requests. This arises from the coexistence of different RAID organisations that generate several types of physical request (read/write for each RAID level) with different related sizes. The size distributions themselves are made more complex by the striping mechanism, with full/large/small stripes in RAID5. We show the impact of the bus traffic on the system's overall performance as predicted by the model and validated against a simulation of the hardware, using common workload assumptions.
Estimating the response time of a data-intensive computing environment
2022, Informatsionno-Upravliaiushchie Sistemy
Optimising hidden stochastic PERT networks
2017, ValueTools 2016 - 10th EAI International Conference on Performance Evaluation Methodologies and Tools

View all citing articles on Scopus

Peter Harrison is currently a professor of computing science at Imperial College, London where he became a lecturer in 1983. He graduated at Christ’s College Cambridge as a Wrangler in Mathematics in 1972 and went on to gain Distinction in Part III of the Mathematical Tripos in 1973, winning the Mayhew prize for Applied Mathematics. He obtained his Ph.D. in Computing Science at Imperial College in 1979. He has researched into stochastic performance modelling and algebraic program transformation for some twenty years, visiting IBM Research Centres during two summers. He has written two books, had over 150 research papers published and held a series of research grants, both national and international. The results of his research have been exploited extensively in industry, forming an integral part of commercial products such as Metron’s Athene Client–Server capacity planning tool. Currently, his main research interests are stochastic process algebra, where he has developed the RCAT methodology for finding separable solutions, response time analysis and optimization of fluid-based models. He has taught a range of subjects at undergraduate and graduate level, including Operating Systems: Theory and Practice, Functional Programming, Parallel Algorithms and Performance Analysis.

Soraya Zertal is a Lecturer in the Architecture and Parallelism research group in the PRiSM Laboratory at the University of Versailles, France. She obtained her Ph.D. in Computing Science from the University of Versailles in 2000, a Master degree from the University of Versailles in 1996 and an engineering degree from the University of Constantine in 1993. Her research interest include parallel architecture, storage systems specifically algorithms for data placement and performance modelling using both simulation and analytical methods.

View full text