Integration of cost-risk assessment of denial of service within an intelligent maintenance system

As organisations become richer in data the function of asset management will have to increasingly use intelligent systems to control condition monitoring systems and organise maintenance. In the future the UK rail industry is anticipating having to optimize capacity by running trains closer to each other. In this situation maintenance becomes extremely problematic as within such a high-performance network a relatively minor fault will impact more trains and passengers; such denial of service causes reputational damage for the industry and causes fines to be levied against the infrastructure owner, Network Rail. Intelligent systems used to control condition monitoring systems will need to optimize for several factors; optimization for minimizing denial of service will be one such factor. With schedules anticipated to be increasingly complicated detailed estimation methods will be extremely difficult to implement. Cost prediction of maintenance activities tend to be expert driven and require extensive details, making automation of such an activity difficult. Therefore a stochastic process will be needed to approach the problem of predicting the denial of service arising from any required maintenance. Good uncertainty modelling will help to increase the confidence of estimates. This paper seeks to detail the challenges that the UK Railway industry face with regards to cost modelling of maintenance activities and outline an example of a suitable cost model for quantifying cost uncertainty. The proposed uncertainty quantification is based on historical cost data and interpretation of its statistical distributions. These estimates are then integrated in a cost model to obtain accurate uncertainty measurements of outputs through Monte-Carlo simulation methods. An additional criteria of the model was that it be suitable for integration into an existing prototype integrated intelligent maintenance system. It is anticipated that applying an integrated maintenance management system will apply significant downward pressure on maintenance budgets and reduce denial of service. Accurate cost estimation is therefore of great importance if anticipated cost efficiencies are to be achieved. While the rail industry has been the focus of this work, other industries have been considered and it is anticipated that the approach will be applicable to many other organisations across several asset management intensive industries.

Intelligent systems used to control condition monitoring systems will need to optimize for several factors; optimizat ion for minimizing denial of service will be one such factor.With schedules anticipated to be increasingly comp licated detailed estimation methods will be extremely difficult to imp lement.Cost prediction of maintenance activities tend to be expert driv en and require extensive details, making automation of such an activity difficult.Therefore a stochastic process will be needed to approach the problem of predicting the denial of service arising fro m any required maintenance.Good uncertainty modelling will help to increase the confidence of estimates.This paper seeks to detail the challenges that the UK Railway industry face with regards to cost modelling of maintenance activities and outline an examp le o f a suitable cost model for quantifying cost uncertainty.The proposed uncertainty quantification is based on historical cost data and interpretation of its statistical distributions.The se estimates are then integrated in a cost model to obtain accurate uncertainty measurements of outputs through Monte -Carlo simulation methods.An additional criteria of the model was that it be suitable for integration into an existing prototype integrate d intelligent maintenance system.It is anticipated that applying an integrated maintenance management system will apply significant downward pressure on maintenance budgets and reduce denial of service.Accurate cost estimation is therefore of great importance if anticipated cost efficiencies are to be achieved.While the rail industry has been the focus of this work, other industries have been considered and it is anticipated that the approach will be applicab le to many other organisations across several asset management intensive industries.

Introduction
The UK rail industry is under intense pressures, in terms of capacity of the network, maintenance budgets and asset reliability.The UK faces a particularly difficult challenge in the modernization of the rail network as age of much of the infrastructure is significant.Anticipated rises in passenger numbers and operating services will raise the pressures on the capacity of the network [1].Against this background of rising asset usage Network Rail is hoping to reduce maintenance costs [2].To achieve both targets asset down-time incidents will have to be reduced through the very best practice in asset management.It is expected that increasing usage of autonomous systems will make a significant contribution towards reducing denial of service.While many view autonomous systems in terms of using UA V drones or robotic systems, much of the impact fro m the widespread application of autonomous systems will be in the area o f software based decision support or decision making.The A UTONOM project is hoping to deliver much of the framework for such a system.AUTONOM is funded by EPSRC [3] under the autonomous and intelligent systems program (AISP); and is also supported by key UK industrialists, including Network Rail.AUTONOM seeks to enable effective asset and maintenance decision-making in data-rich scenarios autonomously.
Uncertainty and Risk are an integral part of cost engineering.Uncertainty and risk assessment in industry is used to show more clearly the possible ranges of values.Single point estimates, (where a single value is presented as the estimate) can be misleading and give decision makers a false sense of certainty about the estimate.A three-point estimate is a popular method for presenting the least-costly, most-costly and most-likely estimates, however while it gives informat ion on the range o f an estimate it g ives limited informat ion on the shape of the probability distribution function.
If an organization has sufficiently detailed informat ion then variables can be assigned ranges, and through a process of curve fitting, assigned probability d istribution functions.The organizat ion can then perform a Monte Carlo simulat ion to produce a more accurate estimate.Monte Carlo simulat ion is considered the industry best practice for dealing with uncertainty in data.
Cost estimation is not performed in isolation within the AUTONOM pro ject, there is a focus on three technical areas: Data Fusion, Planning and Scheduling and Cost Analysis.All are part of an integrated strategy that could lead to better decision support for maintenance activities.The Data Fusion approach consists of gathering suitable data from mult iple sources, so that data can be merged to supply inputs to an automated planning and scheduling model.The scheduling model uses a genetic algorith m approach to generate many d ifferent solutions and hunt towards a more optimal solution.This process is made more co mplicated by the need to schedule multip le maintenance tasks into an ordered list.The schedule is then used by the cost analysis , on which suitable cost engineering best practices are applied.These combined approaches will enable decision making within an integrated framework.
A challenge of this work has been to formu late cost models that can work with the limited informat ion.The data-flows within the demonstrator limit the availab le in formation to use in the model calculation, as shown in in figure 2.

Figure 2 Cost analysis module I/O analysis
In the reported state of the demonstrator [5], the cost estimates generated were broken down into material costs, labour costs and denial of service costs.However each of these cost estimates were single point estimates.
While the data transitions and initial state of the prototype has been described elsewhere [5], the planned future developments are worth further consideration.
The planned architecture calls for feedback of results internally, specifically cost estimates will need to be fed back to the planning/scheduling and used in that process.Initial versions of the demonstrator calculated single point estimates, for accuracies sake this should be expanded into a full Monte Carlo This work outlines the contributions towards improving the estimate for denial of service through Monte Carlo simulation.
Network Rail is keen to improve their current practices with regards to maintenance activities of their high value assets throughout their entire rail network In order to build frameworks and models case studies are being built that highlight challenges that Network Rail faces and provide outline solutions.When these can be validated it is anticipated that the continuation work will be to apply the lessons learnt to other project partners' maintenance challenges in the oil and nuclear sectors.

Uncertainty analysis
This case study is based on specific chosen maintenance activities that are related to points failures, track circuit failures, track defects, condition of track and many other infrastructure causes.A majo r gap in this research is to find the potential dependencies between these activities, their costs and the uncertainties in cost.
One aim of this work is to develop a framework to quantify uncertainty and integrate risk of maintenance activities within a cost model.The cost model on which the study is based is to generate cost of labour, materials and denial of service related to maintenance activities.Uncertainty measurements should be applied to all these input parameters.The methodology proposed for both uncertainty quantification and integration of the uncertainty model into the general cost model.Figure 3 shows the chosen methodology for this paper.This paper relates relevant points fro m the literature research subject and the identified research gap.The identified challenges were explored using a Mindmap approach.

Uncertainty and Risk
Uncertainty and risk are two related terms wh ich are often confused and it is worthwhile to clarify their meaning.According to ISO 31000, "risk is often expressed in terms of the consequences of an event (including changes in circu mstances) and the associated likelihood of occurrence" while " Uncertainty is the state, even partial, of deficiency of informat ion related to, understanding or knowledge of an event, its consequence, or likelihood" [6].

Uncertainty classification and quantification
According to Miliken's work [7] the first step in analysing uncertainty is to identify its nature.It can be: state uncertainty, which is related to impossibility to assign probabilities to the likelihood of future events, response uncertainty that is related to the outcome of a decision effect uncertainty that reflects a lack of understanding of the impact of events or changes on the studied environment [8].
The work performed by Erkoyunku [ 9] categorises the different kinds of uncertainty measurement techniques as deterministic, qualitative or quantitative.
Figure 4: Uncertainty assessment methods [9] It is stated that all these techniques are very efficient, with a particular focus on quantitative techniques such as probability theory, mathemat ical and statistical techniques, and Monte Carlo simu lation, as well as deterministic methods like the sensitivity analysis.According to the literature [9], quantitative approaches are known as the best means to provide suitable information to facilitate decision making.
Variables can be expressed using probability distributions.The Monte Carlo a method is based on probability theory and is used to explore co mplex probabilistic situations and results in a suitable approximation of the studied systems probability distribution is obtained.
The data used in this study was taken fro m the TRUST system.The TRUST system logs delays on the network.Each delay is attributed to a particular fau lt and to an owner.Typically this results in fines being passed between Network Rail and the TOC's.While the data availab le notes to whom the fault was attributed and the scale of the delay and resulting fine and the cause-code of the delay.The event of interest is therefore a delayed train and can be further broken down into different events for each code.A cracked rail event can be differentiated fro m a flooding type of event.Analysis of some of the codes that are used in the integrated demonstrator and quantification of the uncertainty is achievable.This work is a rudimentary analysis of risk as different failure events can be quantified in terms of consequences and impact [6], but likelihood is not possible with the provided data.

Methodol ogy
A methodology for quantification o f uncertainty and integration with in the demonstrators cost models is presented in Figure 5.

Figure 5: Methodology for uncertainty quantification and integration
Generally the data forming the histograms had high numbers of low value faults, but a long "tail" of low probability but h igh cost events.This ma kes deciding b in size a more co mp lex proposition.Generally b in size started small and was increased until the number of unpopulated bins situated between occupied bins were min imised.This desire to minimise gaps could result in bins of too great a scale and therefore this process currently relies on hu man judgement.However a process based on mathematical rules would be simp le to code when the framework is ready for full integration within the AUTONOM project framework.
Once histograms of delay minutes data are created the methodology directs users to examine the data first for symmetrical distributions and secondly if asymmet ry is found correlation with known (and often used) distributions.If both questions are insufficient the process indicates that there is a possibility that more than one distribution will be needed to explain the data.This would correspond to a situation where a hidden factor is at work.In this situation know ing the data contains a hidden factor could inspire further analysis and using two curves to describe the data gives closer correlat ion between the Monte Carlo and the available data.
The coefficient of determination (R²) is a nu mber between 0 and 1 that determines the degree of correlat ion between two curves The coefficient of determination is used as the metric of choice for determin ing wh ich curve form fits with the data.The input data is the Denial of Serv ice (DoS) delays, which are correlated with the cost.The fault types looked at are those that are related to the maintenance activities studied throughout the demonstrator [5] created for the AUTONOM project.

Results
Distribution of costs and delay minutes are analysed using the described methodology and modelled using the probability distribution(s) found to fit best.The wo rk consists in finding the best fit between statistical h istograms and known probability d istributions .The closer to 1 the factor is, the more the curves are correlated.Figure 6 shows the results of such a method, with a blue curve and a red curve representing a frequency distribution based on the data, and the chosen best fit probability distribution, respectively.

Figure 6: Correlation between distribution of delay data and best fit probability distribution
The determination factor between these two curves is used as the metric for curve suitability.The resulting best-fitting probability distributions can be applied as inputs to a Monte Carlo simulation.

Figure 7: Histogram of data and calculated Probability Distribution Function
While some of the analysed faults were suitably approximated by the use of a single Probability Distribution Function (see figure 7), the framewo rk was open to the possibility of using two Probability Distribution Functions to more accurately fit the available data.Using the same data used in figure 8 the resulting cumulat ive probability curve in figure 9 reveals that the data has a significant range, when co mpared to the cu mulative probability curves generated fro m a log-normal p robability distribution, (see figure 10).

Figure 10: cumulative probability of log-normal function
Figure 10 shows the P20-P80 range (range bounded by the probability being 20 and 80 respectively).Using P20 to P80 as indicative of the range of values reduces the influence of long tails fro m some of the probability d istributions and makes range values more sensible as a metric of interest.We see that the cumulat ive probability fro m mu ltip le probability distribution functions can result in a wider P20-P80 range.

Integration
Integration of the cost-risk informat ion within the broader AUTONOM framework will be challenging.Prior to this work the existing cost model generated single point estimates.The integration architecture demanded a feedback of cost informat ion to the schedule modeller.Using a single figure cost estimate suits the Genetic Algorithm approach, but as the real cost information is often more co mplex, deciding what informat ion to pass back towards the schedule model beco mes challenging.

Integrated
Genetic Algorith m and Monte-Carlo methodologies are mentioned within the available literature and applied to a range of problems .The work of Marseguerra and Zio [10] provides an example integrated Monte Carlo and Genetic algorith m approach to maintenance.This very relevant example provides illustrative case-studies of a chemical process plan and the optimisation of maintenance costs.The claims that a very modest number of simulat ion runs (at most 1000 but less for many cases) is adequate is not clearly demonstrated to be true.Subsequent work, Marseguerra et al. (2002) [11], used only 100 repetitions for the Monte Carlo model used.Such small sampling is said to be justified by the fact the GA provides more repetition; the individual solutions might not have much confidence but the number of repeated times the solution is found helps add confidence.This team released similar results where the three best solutions found are re-examined using a repeat of the Monte Carlo simulat ion using 1x10^6 t rials and were able to satisfy themselves that the solution "seems to actually deserve the title of being 'optimal' or 'near optimal' [12].
While much o f this type of work was limited by the computational power availab le in 1999 and 2002 [10][11][12], this constraint will not be as strong now.We could reasonably run each GA solution using a MC simulat ion that uses many more trials.As this trials nu mber is going to be strongly linked with the accuracy of the overall GA-M C optimisation we'll have to select the number o f trails used carefully to maximise both trial accuracy and speed of result delivery.
Our work will have one significant additional complications not faced by previous GA-MC attempts; our denial of service costs, (or "downtime penalty" as the previous GA-M C literature describes it [11]), can not be easily fixed at a single value.Additionally, a Monte Carlo cost model will generate lots of informat ion-such as most likely cost and cost-range, highest-possible cost.Dependent upon the application, the user may wish to take the option that is more expensive, but has a lower cost-range, a lower costrisk option.

Conclusions
This paper has provided details on the methodology to apply in order to find suitable cost uncertainty quantification and the steps towards integration into an existing model.Many cost estimat ing techniques that can be applied; such as probability theory, regression analysis or Monte Carlo simu lation.Many of them are useful when designing a framework for uncertainty quantification and integration, and when applying it to maintenance cost models.
Due to the integration aspect of this project, the difficu lty in integrating Monte Carlo simu lations with Genetic algorith ms has been discussed.There remains several issues to resolve; both the architecture of the GA/MC integration and the technical difficult ies with the integration.Selection of the metric of choice (either range or mean value of the cost estimate) will have to be completed after validation trials .
Issues of computational efficiency of the integrated Genetic Algorith m and Monte Carlo method are also of interest, as both are considered computationally expensive.We might find that the historical reliance on very small tria l numbers for M C simulat ions is no longer needed.If combinatorial exp losion remains a challenge future work will be directed towards examining what possible processes can be used to making the Monte Carlo method more efficient, thereby reducing the computational effort required by the AUTONOM system as it deals with the many maintenance tasks required to keep the Rail network performing.

Figure 3 :
Figure 3: Methodology of the uncertainty integration project

Figure 8 :
Figure 8: Combined Probability Distribution Functions Using Weibull and PERT distributions to more accurately fit the behaviour of the available data are shown in figure 8.