Multilevel Monte Carlo for Reliability Theory

As the size of engineered systems grows, problems in reliability theory can become computationally challenging, often due to the combinatorial growth in the cut sets. In this paper we demonstrate how Multilevel Monte Carlo (MLMC) - a simulation approach which is typically used for stochastic differential equation models - can be applied in reliability problems by carefully controlling the bias-variance tradeoff in approximating large system behaviour. In this first exposition of MLMC methods in reliability problems we address the canonical problem of estimating the expectation of a functional of system lifetime and show the computational advantages compared to classical Monte Carlo methods. The difference in computational complexity can be orders of magnitude for very large or complicated system structures.


Introduction
It can prove to be computationally intractable to perform classical reliability analysis of very large engineered systems when the number of cut (path) sets grows combinatorially. It is well understood that working instead with subsets of the cut (path) sets or bounding structural designs can provide probability bounds in many reliability problems [4], but such bounds can be crude or may not be well characterised at all.
Evaluation of the reliability of engineered systems is a crucial part of system design and often scenario planning may involve repeated evaluation of the reliability for changing system configurations or component types meaning rapid simulation is highly desirable. For simplicity of exposition we herein consider the canonical problem of estimating the expectation of a functional of system lifetime both with and without a component repair process, showing the approach developed is easily generalised to other reliability problems which depend on cut (path) sets for the analysis.
In the case of static reliability analysis, there are many methods aside from Monte Carlo simulation using the cut (path) sets, including Sum of Disjoint Products (SDP) methods [22,30,27] and methods based on Binary Decision Diagrams (BDD) [25] or multistate BDD extensions [29]. On the other hand, these approaches are less prevalent in dynamic reliability problems where there are component dependencies, for example through system shocks, repair or maintenance programmes, and cascading failures among others. There have been recent developments in dynamic fault trees [20,26,21] which apply where event sequence ordering influences the reliability, including repairable systems [19]. When there are arbitrary dependencies, the most generally applicable approach is direct Monte Carlo simulation (e.g. [7]), so that acceleration of Monte Carlo techniques is important to address a broad range of the most complex reliability scenarios. Monte Carlo acceleration through importance sampling [15], or the use of control variates [28] have been suggested in the context of reliability estimation, but they are either restricted to the static case and require regular updates and sorting of all the cut sets (as for [15]), or could be combined with the MLMC paradigm (as for [28]).
Indeed, also note that interest may not be in the reliability at a particular fixed mission time, but instead in: some expectation of a functional of system lifetime; or in ascertaining a quantile of system lifetime (i.e. the time to which one is 99.9% certain the system will survive); or in estimation of the entire system lifetime distribution. In these situations Monte Carlo methods are typically the only tractable approach.
Multilevel Monte Carlo (MLMC) methods -pioneered by Heinrich [14] and Giles [10] -are now standard for estimation of expectations of functionals of processes defined by stochastic differential equations (SDEs). However, the MLMC approach is in fact a general paradigm for accelerating any Monte Carlo based method (whether standard, importance sampling, etc), if one can link the accuracy of the estimator with the complexity of generating a sample, while at the same time controlling the variance of the difference for approximations with different accuracy. The main contribution in this paper is development of a Multilevel Monte Carlo (MLMC) approach to reliability problems. In this way, we show how any reliability problem using Monte Carlo simulation over cut (path) sets can be substantially accelerated, extending the size of systems and complexity of dependencies which are within reach for reliability evaluation. In Section 2, we recap the traditional cut set method of simulating system lifetimes which does not scale well to large systems even when the cut sets are known. This motivates the approach taken in this work. In Section 3 we recap standard Monte Carlo theory and set out the error and computational cost metrics which will enable comparison with our MLMC based approach. The fundamental MLMC methodology and our adaptation to the reliability setting then follow in Section 4, before numerical results demonstrating the kind of substantial computational improvements which can be achieved are covered in Sections 5 and 6.

Simulating system lifetimes
Consider a coherent system with n components. Let x 1 (t), . . . , x n (t) denote the operational status (1 = working, 0 = failed) of the components at time t and consider the random variable for the lifetime of component c to be T c ∼ F c (t), where F c (·) are positively supported lifetime distributions which are not necessarily independent or identical. We will depict a system as an undirected network comprising a set of nodes (vertices) S, and a set of edges E, where nodes are considered unreliable and edges are perfectly reliable (note that any setting with imperfect edge reliability can be easily transformed to a corresponding representation where they are perfectly reliable [2]). The system is considered to be functional as long as there is a path from left to right which passes only though functioning nodes, see Figure 1. This is usually represented mathematically by the structure function, φ : {0, 1} n → {0, 1}, which maps component status to system status.
Herein, our focus is on an equivalent means of evaluation based on cut sets. A set of components, C, is said to be a cut set of the system if the system is failed whenever all the components in C are failed. A cut set is said to be a minimal cut set if no subset of it is also a cut set. Then, the set of all minimal cut sets, C, characterises the operational state of a system completely and is equivalent to knowledge of the structure function [6]. In addition to the cut sets characterising the operational state of the system given the binary operational state of the components, they also immediately provide the system failure time if the individual component failure times are known [5]: Thus, the failure time for the system depends on the system structure and the failure time distributions for each node.
The traditional approach to estimating the expectation of a functional of the lifetime of a system given the lifetime distributions of the components is to perform a simple Monte Carlo simulation. That is, The overall runtime for this approach depends on three quantities: 1. Variance of the estimator. Due to the random nature of component failure times, the estimator is a random variable: higher variance estimators will require more iterations to achieve an accurate estimate; 2. Target accuracy of the estimate. Naturally, the higher the desired accuracy, the longer the algorithm will take due to more iterations being required; 3. Number of cut sets. As the system size grows, the number of cut sets has a combinatorial growth, making the approach impractical for very large systems.
Less brute force approaches are possible with the restrictive assumption of iid components by making use of the system signature [18,23]. More recent work on the survival signature [8] has generalised the signature to multiple types of component, with the weaker assumption of exchangeability between components. However, if a large number of the components are of different types or there are highly dependent failures, then the survival signature will also grow exponentially in complexity. It can also accommodate a repair process [9] through expression as a new component type, though this increases the complexity if too many repairs occur. Hence, in this work, we first address the most general possible setting in which any form of component lifetime and dependence structure is allowed, requiring only knowledge of component lifetimes and the cut sets of the system. However, note that it should be possible to specialise this approach to work with the survival signature which we hope to address in future research.

Monte Carlo algorithms
To simplify presentation, hereinafter we only consider estimating expected failure time directly, rather than some functional of the failure time, though this is mostly without loss of generality (see Section 4 for details). Therefore, assume that for a given system S, we want to estimate the expected failure time.
. There are many approaches to simulation which may differ in terms of convergence to the true value as well as computational characteristics. In order to compare them, we present some useful cost and error expressions in the following subsection.

Performance measures: error and cost definitions
We start by defining the two main quantities, which will be used throughout this paper to compare methodologies. Given an estimatorT S of the quantity ET S , the Mean Squared Error (MSE) of any Monte Carlo based method is: The classical decomposition of the MSE yields: is the squared bias error, while E T S − ET S 2 is the error due to Monte Carlo variance. The first is a systematic error arising from the fact that we might not sample our random variable exactly, but rather use a suitable approximation, while the second error comes from the randomised nature of any Monte Carlo algorithm. For example, in traditional Monte Carlo applications, one samples exactly so that the first error is zero and only the Monte Carlo variance needs to be treated carefully.
The cost of any Monte Carlo based algorithm is typically taken to be the expected runtime in order to achieve a prescribed accuracy. A more convenient approach for theoretical comparison between different methods is to define cost = E(#random number generations and operations).
We now recap traditional Monte Carlo and then introduce Multilevel Monte Carlo, in each case highlighting their corresponding results for these two measures of performance.

Traditional (or single-level) Monte Carlo algorithm
The traditional Monte Carlo estimator is based on N replications of simulating the system lifetime, via the minimal cut sets, by simulating the component lifetimes. That is, given system simulations For reasons that will become clear in the sequel, it is useful to refer to this as the single-level Monte Carlo algorithm because it emphasises the relationship to Multilevel Monte Carlo. This single-level Monte Carlo estimate has variance proportional to N −1 , The estimator (2) is clearly unbiased, because there is no approximation involved in estimating the failure time, so the error measure introduced earlier only has this second variance term, Indeed, more generally the well known central limit result for standard Monte Carlo means that: for Z ∼ N(0, 1). Thus, for a desired level of accuracy ε > 0 with confidence level 1 − α, we require Monte Carlo simulations, where the quantile z α/2 is chosen to ensure that P(Z > z α/2 ) = α/2. Naturally z α/2 is a constant for any fixed level of confidence, so the variable compute costs in simulation are where #C denotes the number of minimal cut sets for the system.

Multilevel Monte Carlo
To simplify presentation we again only consider estimating expected failure time directly, rather than some functional of the failure time. Note that there is no loss of generality, so long as the functional of interest is Lipschitz continuous (or bounded for discrete measures). The most common functional of interest that this would exclude is computing expectations of quantiles. However, this problem can be treated with a smoothing approach, as discussed for the MLMC setting in [13]. In all other cases, the presentation below carries over in the natural fashion.

General MLMC
We first introduce MLMC in generality before specialising this to the reliability problem considered herein. Consider a sequence of estimators T 0 , T 1 , . . . , which approximates T L with increasing accuracy, but also increasing cost. By linearity of expectation, we have and therefore we can use the following unbiased estimator for E[T L ], The inclusion of the level in the superscript ( , n) indicates that the samples used at each level of correction are independent, but crucially note that the differences themselves use common samples. Note the terminology 'correction' arises from the fact that each T is generally not an unbiased estimate any more. Let V 0 and cost 0 be the variance and the expected cost of one sample of T 0 , and let V , cost be the variance and expected cost of one sample of T − T −1 . Then the overall expected cost and variance of the multilevel estimator are L =0 N · cost and L =0 N −1 · V , respectively. More generally, this means that provided the product V · cost decreases with (i.e. the cost increases with level slower than the variance decreases), then one can achieve significant computational savings, which can be formalised in Theorem 1 from [10]. Theorem 1. Let T S denote a random variable, and let T denote the corresponding level numerical approximation.
If there exist independent estimators Y based on N Monte Carlo samples, and positive constants α, β, γ, c 1 , c 2 , c 3 such that α ≥ 1 2 min(β, γ) and then there exists a positive constant c 4 such that for any ε < e −1 there are values L and N for which the multilevel estimator has a mean-square-error with bound Remark 2. We will informally illustrate the idea behind MLMC on a simple example with just two levels. Consider just two approximations T k and T L , where k < L, with sample costs cost k < cost L . It is clear, that the cost of one sample for T L − T k is roughly cost L . Now assume, that We see that the overall cost of Monte Carlo estimators, according to (3), can be expressed as In other words, provided there exists a good coupling between estimators T k and T L , we have reduced computational cost even for two levels. The twolevel Monte Carlo method in the context of Monte Carlo path simulation has been suggested and analysed in [16].
Remark 3. Multilevel provides the greatest benefit when β ≥ γ, because this is the case for which we get the best asymptotic performance. γ represents the parameter for the exponential increase of the cost of producing a sample, while β corresponds to the parameter for the exponential decay in the variance of the sample at a given level. There are 3 cases: β < γ. The number of samples required by the MLMC estimator decays at a slower rate than the increase of the sampling cost at each level. In this case the overall cost is proportional to the cost of the last level. β = γ. This is most common in practice [12]. The decay of the variance is balanced with the increase of the cost, therefore the contribution to the overall cost is the same from all the levels.
β > γ. In this most desirable case, the overall cost is dominated by level l = 0, since consecutive levels will have a decaying contribution to the cost.
When not available analytically, estimation of α, β and γ is usually done by empirically regressing using diagnostic quantities in the manner demonstrated in Sections 5 and 6 for our examples.  [10] for estimating expectations of functionals E(f (X t )), where X t is the solution of a stochastic differential equation. In the general Multilevel Monte Carlo path simulation setting, T from Theorem 1 is the functional value, evaluated via an approximation arising from a discretisation method, e.g. the Euler-Maruyama method [17].

MLMC for system reliability
Theorem 1 suggests that one may want to try getting a coarser Monte Carlo estimate of the system lifetime, perhaps by considering only a subset of the collection of minimal cut sets.
On its own T S is a biased estimator, so although a traditional single-level Monte Carlo estimator based on it may have lower computational cost, it will have increased MSE because the first error term in (1) can no longer be ignored. However, by using this coarse estimate inside an MLMC approach, we aim to improve the overall performance. To this end we introduce the sequence of estimators T 0 , . . . , T L based on a nested sequence of minimal cut sets, C 0 ⊂ · · · ⊂ C L = C. Note that here T L ≡ T S , which is not typically true in a general MLMC setting.
The crucial ingredient is the finite telescopic sum As described above, we independently estimate each term, and within each term, T and T −1 use the same random component simulations: with each level having cost being bounded from above by c · Var(Y ) · #C .
Here c is a constant independent of and the desired target accuracy. We choose #C -the number of minimal cut sets at level -to be a proxy for the upper bound on the cost of each level, because for a fixed system the number of elements in each minimal cut set is independent of the target accuracy. In other words, as we double the number of minimal cut sets in each level, their number is a straightforward way to construct a meaningful and easy upper bound for the cost of one sample.
Thus, the overall MLMC variance is at a cost of L =0 N · #C . Therefore, given a fixed target accuracy (variance), if we choose a sample size N ≈ Var(Y )/#C l on each level, this will minimise the computational cost. That is, for a desired accuracy ε > 0, the overall cost is: This means that: 1. If we have a good coupling between the approximations, or equivalently Var(Y ) decays rapidly, then we can achieve considerable savings compared to the single-level Monte Carlo algorithm.

2.
Additional savings are possible if we do not calculate all the levels Y , but rather stop the algorithm early. This introduces a (small) bias, but substantially decreases the overall computational cost. As long as the bias is quantified -and when combined with the estimator variance gives a MSE (1) below our target accuracy -then we can still solve the original problem at much lower cost.
Our proposed application to reliability involves a nested sequence of minimal cut sets providing improving accuracy: so the possible gain from stopping early depends on the way the minimal cut sets are chosen at each level.

Level selection algorithm
The first point to note is that existing Multilevel Monte Carlo literature has shown that anything less than geometric decay in the cost of computation at each level leads to suboptimal gains, see [12]. Therefore, we pre-specify that level contain #C/2 L− minimal cut sets. The levels will be grown from = 0 up, adding in those minimal cut sets which are in some sense optimal for the next level. Thus, we specify level selection in an inductive fashion. Note, we initially ignore the possibility of repair for simplicity of presentation and address the changes involved to accommodate repair in Section 6.2, when we demonstrate MLMC for repairable components.

Selection of level = 0
Level 0 will be simulated most frequently since it is the lowest cost. Therefore, optimal choice of this level is straightforward: it should contain the minimal cut sets which provide the best approximation to T S . That is, we wish to assign to level 0 the minimal cut sets which have smallest expected failure time, since these will most frequently be the causes of system failure. To achieve this, we propose an initial highly crude estimate by performing a pilot standard Monte Carlo simulation of N lifetimes of each component in the system, using these to generate N realisations of the failure time associated with each minimal cut set, The cut sets corresponding to the smallest #C/2 L of these η i are then chosen to form C 0 .
Selection of level > 0 Given that we have chosen levels 0, . . . , − 1 already (that is C 0 ⊂ · · · ⊂ C −1 are now fixed), we need to select which cut sets to add from C trial = C\C −1 . To maximise the performance of MLMC we would like to select the cut sets such that E[T −1 − T ] → max . In other words, choose so that the contribution from each level is as large as possible in the smallest levels, leading to a rapid decay in the size of the contribution in each level and hence the possibility of terminating the algorithm early. In particular, note that if L =k E(T − T −1 ) ε, then levels k, . . . , L need not be simulated at all, so that ensuring all large differences occur in early levels is highly desirable.
Notice that: for any C ∈ C trial . So our choice for sorting cut sets is motivated by the minimisation of the upper bound for the increments at each level, which we can implement for any level in a simple way: • use the N samples and calculated failure times for all cut sets used in selecting = 0, • calculate the following estimates: for each C k ∈ C trial .
• Sort δ k in a descending order and add cut sets corresponding to the largest values for δ k to C until #C = #C/2 L− .
This choice of number of minimal cut sets on each level guarantees an exponential increase in the cost, corresponding to γ = 1 in Theorem 1.

Full Multilevel Monte Carlo algorithm for reliability
One of the key features of the Multilevel Monte Carlo algorithm is its ability to naturally provide stopping criteria for an optimal selection of the number of levelsL to actually simulate, which we illustrate now along with the full description of the algorithm. For more advanced approaches to implementation of Multilevel Monte Carlo we refer to [11,3,12].
According to Theorem 1 and the first assumption in it, we have asymp- so that a natural stopping criteria is to chooseL minimal such that |YL| ≤ ε/2.
Input: Required accuracy ε and level specification as per §4.2.
1. Set the initial number of levels toL := 2. In order to define the optimal number of samples at each level we need to estimate the variances on levels = 0, 1, 2 2. Compute N := 100 samples on levels = 0, 1, 2 3. Estimate Var(Y ) and update N for each level = 0, . . . ,L: We take the maximum as it may happen that the numerical variance was initially overestimated and so more simulations were performed than necessary. If N has increased less than 1% on all levels, then skip to step 5.
4. Compute the additional number of samples on each level = 0, . . . ,L and return to step 3.

5.
Upon reaching this step we have converged in terms of MSE due to the variance. Next we test whether the bias error term is sufficiently small to terminate, or whether more levels and simulations are required. If |Y | ≥ ε: Then: setL :=L + 1 , Var(ŶL) = Var(YL −1 )/2 and return to step 3; Else: Terminate algorithm returning L =0 Y as the estimate.
There are two extreme cases to bear in mind. In the first case, we have only a few minimal cut sets (or even only one), which influence the failure time. This case is treated with the initial choice of the cut sets and selection as prescribed in §4.2 should ensure the minimal number of levels is simulated. The second case is when all the cut sets have similar 'weight' in determining the failure time, such as with a fully connected system with independent and identically distributed failure times for all components. This case is treated with the doubling of the number of cut sets with respect to the previous level, which again assures the mean and variance decay between the levels.
Remark 5. The Multilevel paradigm, with some slight modifications, also allows construction of efficient numerical algorithms (see [13]) for estimating the distribution function on a compact interval in a uniform norm. More specifically, one can costruct an algorithm which allows estimates in the norm , which guarantees the uniformity in the error for numerical estimates. This is the only Monte Carlo based approach whose estimates are functions, rather than finite dimensional entities, which has uniform norm as a measure of accuracy.

Systems and component reliability distributions
We generated many random systems to test the MLMC reliability method proposed hereinbefore. These random systems are generated by starting from the trivial one component system and with fixed probabilities either: • replacing a component with two components in series; • replacing a component with two components in parallel; • selecting two edges and inserting a 'bridging' component.
This allowed us to generate a wide range of different systems and in particular an increasing sequence of related systems with varying numbers of component.
For all systems we consider three test cases, where the components have Weibull distributed lifetimes with shape parameter k = 0.5, 1 or 3 and where the scale is chosen uniformly at random on an interval [2,10]. Variety in shape parameters corresponds to different applications in industry (see e.g. [24]). The shape parameter has a substantial effect on the corresponding density function.

Numerical results
We ran our algorithm 100 times for systems of different sizes with independent but differently distributed components, whose reliability is described above. In each case we considered fixed target accuracies of ε = 2 −4 and ε = 2 −7 , and computed the cost gains achieved for these fixed accuracies.

Shape parameter
The left top and bottom plots on Figure 2 show the result of diagnostic runs, where we tested the variance and mean decay, which correspond to assumptions (3) and (1) from Theorem 1 with β = 1 and α = 1 respectively (i.e. the slope of decay on the log-scale is −1). This indicates, that Multilevel Monte Carlo achieves the same convergence rate as traditional Monte Carlo in terms of accuracy ε, but can offer computational savings, due to the fact that most of the samples are calculated for very small subset of minimal cut sets. The right plot compares the differences in averaged costs for Multilevel Monte Carlo and standard Monte Carlo algorithms, which shows good savings even including the costs for initially selecting the cut sets for each level.

Shape parameter
The test case with k = 1 (Figure 3) gives us almost the same mean and variance decays along with computational gains as in the case with k = 0.5. One can see that in the last levels we see even super linear decay for the mean and variance, which indicates that the cut sets being added at those levels have very weak impact on system lifetime compared to those already chosen, which indicates good performance for the level selection algorithm.

Shape parameter k = 3.
The case with k = 3 ( Figure 4) shows substantial savings for ε = 2 −7 , as also seen in the previous examples, while still showing competitive results for ε = 2 −4 . The reason the gains are more modest here is that, as we had before, there is very small variance in the standard Monte Carlo and overall Multilevel Monte Carlo estimators, which puts more emphasis on the initial level selection costs which are not ε dependent, but are size dependent.

Numerical experiments, with repair process
To demonstrate the generality of the method and the substantial computational benefits available in more interesting scenarios, we consider the same 70 component system as generated in Section 5 with a repair process included. The components are again taken to have shape parameter k = 0.5 and uniformly distributed scale on [2,10], but now failed components are repaired according to an Exponentially distributed clock with rate λ = 0.05. Note that the final failure time of all components which lead to system failure cannot be sampled simultaneously any more, because repairs may change the state of the system en-route to ultimate failure. Indeed, the computational complexity of sampling is substantially greater due to the need to simulate the stochastic process of failure and repair, repeatedly testing after each system state change whether the system is still functional. Consequently, obtaining a single Monte Carlo sample may now require many passes over the collection of cut sets and moreover there is additional randomness in the runtime to simulate a single system failure time.
However, the Multilevel paradigm still applies and, as will be seen, even performs marginally better in this more complex scenario.

Repair process
The repair process is taken to be Exponential(λ = 0.05). Some full standard Monte Carlo runs show that this leads to a highly random number of repairs over the lifetime of the system as depicted in Figure 5. Note that this is not so much chosen to be representative of any real system, but is in fact faster repair than might be expected in order to increase the difficulty of the problem.

Level selection for repairable systems
The procedure described for non-repairable systems can be adapted to this case. The only minor adjustment required is to the level selection algorithm as described in Section 4.2.1.
Recall that the level selection procedure first involves determining a failure time for all cut sets. In the non-repairable case, this was simply a case of computing eq. (5) for a fixed collection of component simulations t (j) c . However, the stochastic process of failure and repair means that t (j) c depends on the cut set C i which causes failure, so that in principle for each j the stochastic process should be simulated continuously until all cut sets have failed at least once, with the first failure time being recorded as η i for each C i ∈ C. As such, very rare failure modes may result in essentially unbounded compute cost.
To address this, we propose simulation of the full stochastic process of failure and repair until the first instance of failure due to a cut set. We then simulate the conditional failure time of the still functioning components given survival to this time without further repairs taking place. Note that the exact behaviour beyond the initial cut set failure is therefore deemed less important: our primary goal is to establish cut sets which fail early, so exhaustive simulation is redundant. Clearly, these simulations cannot be used in the final estimate in the way they could for the non-repairable case.
In every other way, the level selection algorithm is the exact analogue of that for the non-repairable case, with the objective being to ensure rapid decay of the mean of each level. Figure 6 (left) shows a mean decay of order 1, which implies that the expected contribution to the system lifetime estimate halves compared to the contribution from the previous level. This strong decay, corresponding to α = 1 in Theorem 1, means that our sorting approach does capture the influence of the cut sets, thus allowing estimation of the failure time with reasonable accuracy, without spending computational effort on evaluating 'non-contributing' cut sets. Recall that the stronger the mean decay (i.e. the larger is the value α), the fewer levels we will need, thus the higher will be the computational gain. Figure 6 (right) shows the variance decay and indicates the quality of the coupling. As each level is more expensive to sample than the previous one, it is desirable to sample it less often, without destroying the overall variance. The parameter β quantifies this in a rigorous way, in our case having β ≈ 1, which implies that the variance on each level is half that on the previous level, hence halving the number of samples required.

Computational cost
Recall that the number of cut sets in a level was an accurate proxy for the computational cost in the non-repairable example (see p.10) and that γ = 1 could then be targeted by doubling the number of cut sets in each level.
However, in the repairable case this is no longer so, because the stochastic process of failure and repair adds a random element to the simulation runtime before 'system' failure, with it also depending on the cut set collection under consideration. Therefore, Figure 7 shows the distribution of empirical wall-clock runtimes to produce a single simulation for each level on a log scale, showing the desired growth in compute cost. Regressing log 2 (mean runtime) = a + γ results in γ ≈ 0.94. Crucially, this means that an exponential improvement in accuracy is achieved (α = β = 1), but with a little below exponential increase in cost (γ ≈ 0.94). This means MLMC actually provides marginally better performance gain in the repairable case than it did in the non-repairable case (where α = β = γ = 1).
When computing the cost in the repairable case, we can use the empirical mean compute time for level , κ say, instead of the number of cut sets in eq. (4). Then, for a target accuracy ε, the speedup provided by MLMC is characterised by: where L ε is the earliest level with mean less than ε. For varying ε this is depicted in Figure 8. For coarse estimates, speedups of upto 1, 000 times can be achieved. Note that the expected system lifetime is ≈ 205 with high  Figure 7: Actual runtimes to perform a single sample on each level for the repairable example. Note the log-scaled time axis. Density estimates composed from 108 replicates. Timings are for a single core of a c4.8xlarge Amazon EC2 compute instance using the AMI from [1].
variance ≈ 69, 000 -this was chosen as an extreme example to test MLMCs ability to handle very difficult estimation problems.
To put this in perspective, the time to achieve a Monte Carlo estimate to accuracy ε = 1.5 would take about 141 days on a single CPU core (or nearly 4 days using all cores of the c4.8xlarge Amazon EC2 instance used for testing). Multilevel Monte Carlo would achieve the same accuracy in 1 day, 4 hours on a single CPU core (or just 45 minutes using all cores of a c4.8xlarge).

Conclusion and future work
We have presented an exciting new application for the Multilevel paradigm for estimating the reliability of systems, which speeds up traditional Monte Carlo estimation of system lifetimes and provides a approach which can easily generalise to other reliability problems which involve cut (path) sets. We have demonstrated that the proposed approach to using MLMC in reliability problems achieves the strong mean and variance decay required to enable truncation of the number of levels which must be simulated. This is a very desirable feature for MLMC, as the standard approach has to go through all the cut sets for each sample, regardless of target accuracy. Moreover, we have demonstrated that harder problems (such as repairable systems) achieve slightly greater gains through just below exponential growth in the cost of simulating levels (γ ≈ 0.94) while still having exponential mean and variance decay (α = β = 1) Unlike classic MLMC implementations, where one considers different approximations of a certain stochastic process wherein all of them are biased, here we introduce approximations based on sorting the minimal cut sets in a special way, which are biased, but less costly to simulate. The numerical experiments show substantial savings for large systems and are promising for further study of reliability and structural optimisation. Indeed, the ease of extending from non-repairable to a repairable setting shows the flexibility of the approach and we anticipate it would be likewise straight-forward to incorporate shock and stress processes in a similar manner, benefitting ever more greatly from the acceleration offered by MLMC. Our own future work will include the extension of our approach in the spirit of [13], and expanding the applicability of the Multilevel paradigm to other algorithms established in the reliability community.