Modeling Information System Availability by Using Bayesian Belief Network Approach

Modern information systems are expected to be always-on by providing services to end-users, regardless of time and location. This is particularly important for organizations and industries where information systems support real-time operations and mission-critical applications that need to be available on 24  7  365 basis. Examples of such entities include process industries, telecommunications, healthcare, energy, banking, electronic commerce and a variety of cloud services. This article presents a modified Bayesian Belief Network model for predicting information system availability, introduced initially by Franke, U. and Johnson, P. (in article " Availability of enterprise IT systems – an expert-based Bayesian model ". Software Quality Journal 20(2), 369-394, 2012) based on a thorough review of several dimensions of the information system availability, we proposed a modified set of determinants. The model is parameterized by using probability elicitation process with the participation of experts from the financial sector of Bosnia and Herzegovina. The model validation was performed using Monte-Carlo simulation.


INTRODUCTION
In today's world, most of the business activities are associated with the use of information technology (IT).Information technologies enable and facilitate business, while, at the same time, a success of an organization is becoming increasingly dependent on the proper use of information technologies and managing risks associated with this dependency.The availability of information system (IS) is an essential requirement that business presents to IT departments.Forrester pointed out that across all industries, there is less and less tolerance for any kind of downtime [1].According to Aberdeen Report, the average cost of an hour of downtime for large companies is 686 250 US$, 215 638 US$ for medium companies and 8 581 US$ for small companies [2].Gartner noted that 5 600 US$ per minute is an average cost of downtime [3].Butler reported that a 49 minutes failure of Amazon's services on 31 st January 2013, resulted close to 5 million US$ in missed revenue [4].
According to ITIL, availability is the characteristic of the IS to perform its agreed action at the request of an authorized user [5].Availability in the broader sense implies that the information system is ready to serve end users, even in the event of unforeseen and catastrophic events.At the same time, IS has to be protected from various security threats.
In this study, an information system is defined in a broader sense, as a combination of interrelated components that, through their interactions, deliver the desired output.That means, information system, besides hardware and software, includes people, processes, culture and environment, which are all crucial for understanding, explaining and modeling availability and risk [6].From the availability perspective, in this study, the system was regarded as a set of services that are available to end users.The total availability of the IS is the result of summarizing the availability of individual services that are components of the system.It is assumed that each service has an agreed-upon operation time (if service is to be continuously available 24  7 or only 8  5).The percentage of users affected by the outage of individual systems is also considered.This article contains a thorough review of the literature in the IS availability field, addressing the problem from various perspectives.The primary goal of this research is to determine a set of factors that have the greatest influence on IS availability in BiH financial institutions.The second aim of the research is to compare the results of model parameterization with the results of Franke and Johnson [7].As IS availability depends on local factors, especially of telecommunication and power network, climate and seismic factors, this study aims to determine whether the factors that affect the availability of IS in Bosnia and Herzegovina are the same as for the US and Western Europe and whether they have equal 'weights'.

AVAILABILITY OF INFORMATION SYSTEMS
IEEE defined the IS availability as "The degree to which a system or component is operational and accessible when required for use.Often expressed as a probability."[8].Rauscher defines availability as a measure of the readiness of the system to be used for the purpose for which it was designed, when needed [9].ISO 27 000 series of standards tied availability to the concept of organizational assets.The asset is available if it is accessible and ready for use at the request of an authorized person.In the context of this standard, the assets include information system components, facilities, networks and computers [10].Singh gave a more quantitative definition: "Ps-system availability of the observed system S, is the probability that a system is operational and ready to provide services.As the Ps number should be close to 100 % as possible.The usual way to represent the IS availability is counting nines.So 99,999 % availability is called five nines."[11].Availability is also defined as a combination of three concepts: reliability, accessibility and timeliness [12].
Most commonly, IS availability is referred to be part of the of the CIA (confidentiality, integrity, availability) information security triangle [13][14][15][16][17][18][19][20].Also, in literature one can find term 'resilience' of an IS, where this term implies that the system "must remain available and maintain an acceptable level of performance when faced with various types of errors affecting the normal operation" [21] which is very close to the original definition of availability.Bajgorić used this term as a synonym for business continuity [22].Gaddum discussed the resilience as IT, organizational and business issue, and introduced a model with six layers of resistance: strategy, organization, processes, data and applications, technologies and facilities [23].Schiesser observed availability as optimization process of the productions systems readiness for accurately measuring, analyzing and reducing the system downtime [24].
Availability is expressed (measured) as the ratio of the time in which the system was available in relation to the total time.The basic formula for the IS availability calculation is For a complex system, we can get uptime time as the sum of uptime times for all subsystems.For example, if we have a system that provides the following services with the agreed times of availability: core banking 24  7, SWIFT 8  5, e-banking 24  7, m-banking 24  7, then a contractual system uptime annually 52  (24  7 + 8  5 + 24  7 + 24  7) = 28 288 hours.System downtime is collected for each service so that it takes into account the percentage of affected users.For example, the unavailability of a core banking service, affecting 5 % of users, for 20 minutes is equivalent to 1 minute of total unavailability of the core system.
Martin identified six major determinants of IS availability: physical security, audit and evaluation of the system efficiency, security policy, system monitoring and control of operations, business continuity management and backup management [25].Franke and Johnson in their model [7] used 16 determinants that affect the availability of the system based on the ʹIndex of availabilityʹ, introduced by Marcus et al. [26].Bajgoric identified eight kinds of threats to the IT infrastructure that affect business continuity [22].Rauscher et al. proposed a model for the reliability of communications infrastructure and identified the following components that affect reliability: the human factor, policies, hardware, software, network, load, environment, and power.In 2001, EMC conducted an extensive study on 250 European companies from different industries and various sizes, to identify causes of system disruptions.The study found that main reasons for disruption are: failures in hardware, interruptions in electric power supply, software errors, downtime of reserve power supply, data errors, errors of external service network), operating system, physical environment and disasters, 40 % were application errors, and 40 % were due to human mistakes [28].
There are different recommendations to raise the IS availability: Liu et al. [29], Raderius et al. [30], Franke and Johnson [31] suggested improvement of IS architecture, Martin recommended improving security policies [13], Gay suggested virtualization [32], Calzolari recommended clustering and virtualization [33] and Bajgoric et al. the application of standards in the IS governance [34].Chen et al. suggested a strategy of diversification as a possible solution for reducing the IS unavailability, caused by attacks on network resources [19].Bell proposed the use of the best practices in designing a data center to improve the IS availability [35].In a study conducted in 2009, IBM recommended the following technologies and processes, to achieve high availability of the system: application management, availability management, capacity management, change management, measurement management, network management, performance management, service management level and service recovery management [36].

INFORMATION SYSTEMS AVAILABILITY MODELING
Raderius et al. [30] cited block diagrams reliability and Monte-Carlo simulations as the most frequently used reliability modeling tool.They identified an inability to express uncertainty and high model dependency of the modeled system architecture as major problems of these methods.Malek et al. classified availability modeling methodologies into analytical, quantitative and qualitative [37].Quantitative models are based on measurements and most often used to model availability of hardware components of IS.Research based on qualitative models are conducted less formally, and as primary modeling tools utilize questionnaires and interviews.As a result, they assign availability class to the IS.Trivadi et al. [38] distinguish qualitative and quantitative availability models.They defined qualitative models as models based on verbal descriptions and checklists and quantitative models as stochastic models based on hardware and software structure of the IS.Unlike most IS availability models, which represent availability as binary variable (system available or not available), Tokuno et al. [39] modeled availability of software-intensive systems in a way that recognize declines in the system performance as a condition that affects the availability.As a modeling tool, they used the Markov process.[42].A method for availability analysis based on Fault Tree Analysis is presented by Narman et al [43].Torabi, Soufi, and Sahebjamnia proposed a framework for conducting the business impact analysis by using MADM techniques [44].

METHODOLOGY BAYESIAN NETWORKS
In this research, we used Bayesian Belief Networks (BBN) as a tool for analyzing the factors influencing the IS availability.Neil et al. wrote about the application of BBN to the modeling the operational risk of IT in financial institutions [45].As the main advantages of BBN they noted enabling a combination of statistical and qualitative data and mapping the causal structure of the process, thus making it easier to understand and communicate with business users.Using BBN one can: a) combine proactive indicators of losses with a reactive results of measurements, b) take into account experts judgments, c) work with incomplete data and still get a reasonable prediction, d) implement a robust scenario analysis, e) test the robustness of the results, f) have a tool for visual reasoning and help in documenting, g) carry out a comparative analysis of alternative scenarios and robustness testing, h) assess changes in design of the IT infrastructure.
BBN are graphic models that combine graph theory and probability theory.Each BBN has two elements: a direct acyclic graphs (DAG), which represent the structure, and a set of conditional probability tables (CPT).The nodes in the structure correspond to the observed variables, and the edges are formally interpreted as ʹprobabilistic independenceʹ.
CPT quantifies the relationship between the variable and its ʹparentʹ in the graph [46].Bayes' theorem is used for inference propagation so that the probability distribution can be quantified for each node if given the likelihood of an initial node and CPT for all nodes.For two events Bayes' theorem states: 2) is as follows: it is possible to calculate the conditional probability of event A, given the event B, using the conditional probability of event B, for a given event A and the probability of event B and the probability of event A. Although Bayesian networks significantly reduce the number of parameters, which needs to be determined by specifying the joint probability distribution, the number of parameters in the model remains one of the major bottlenecks of this framework.One way to reduce this number is to assume a functional relationship that defines the interaction between all the parents of a node.The most widely accepted and applied solutions for this problem is the Noisy-OR model [47].Noisy-OR model gives a causal interpretation to the interaction between the parent node and child node.It assumes that all causes (parents) are independent of each other regarding their ability to influence the variable effect (the child).Given these assumptions Noisy-OR model provides a logarithmic reduction in the number of parameters required for the construction of the CPT, which effectively makes the building of large models for real life problems feasible.Noisy-OR model assumes the presence of any of the causes Xi is sufficient to obtain the presence of Y as the effect.The second assumption of Noisy-OR model is the ability of cause Xi to produce an effect is independent of the presence of other causes.However, the presence of the cause Xi in Noisy-OR model does not guarantee that effect Y will happen.In practical models, a situation where the absence of all modeled causes ensures the lack of impact almost never happens.To solve that weakness of Noisy-OR model, Henrion introduced the concept of a leakage or background probabilities that allows modeling the impact of a combination of factors that are not included in the model [48].
BBN have been widely applied in OpRisk, INFOSEC and availability modeling.Raderius et al. presented a case study where the availability of the information system was estimated using the ʹextended influence diagramsʹ combined with an architectural metamodel [30].Hinz et al. presented BBN model for assessing the risk of IT infrastructure.The parameters of this model were obtained using interviews with experts [49].Weber et al. used the influence diagrams for the economic analysis of the IS availability [50].Neil  Zhang et al. presented an innovative model to improve the availability of the system based on the BBN in which the data for the CPT were obtained from the system logs [56].Bonafede did a review of statistical methods that can be used to model business continuity and gave an example of BBN use for that purpose [57].Different models, based on BBN, were made in the area of software reliability [58][59][60], and management of software development projects [61][62][63].
Franke and Johnson presented the model for decision support in the area of IS availability based on Leaky Noisy-OR BBN [7].The model parameters were obtained based on the probability elicitation of 50 experts in the IS availability field.That model, with modifications based on the theoretical part of the study, has been applied in this research.Also, we propose a model that consists of thirteen variables representing thirteen domains affecting information systems availability.Those variables are: the physical environment, availability requirements management, operations management, change management, backup management, storage redundancy, avoiding errors in internal applications, avoiding errors in external services, network management, equipment and location of the DR data centre, resistant client/server systems, monitoring of relevant components, human resources management.If the best practices are implemented in one more of those domains, IS unavailability would be reduced.
The probability elicitation used for determining the model parameters was done by interviewing 23 experts dealing with IT systems availability in the financial sector in BiH.The research focused on information systems in the banking industry which, due to the presence of international and local regulations and regular audit reviews, have the necessary maturity level of IS governance to be suitable for modeling.During the elicitation, most experts agreed that the selection of variables in the model is adequate and that the model is comprehensive.Elicitation was conducted through structured interviews.In the first part of meetings, experts were trained and calibrated, while in the second part experts filled in the questionnaire.The questionnaire consisted of three sets of questions.Experts were first asked to estimate the impact of individual variables on system availability.In the second question, experts gave their assessment of the situation in the areas described by the variables in the financial sector in BiH.To answer the third question, they estimated the necessary investments to bring the field represented by the variable to the level of best practices.As the system consists of several services provided to internal and external customers, overall system availability is defined as the average availability of each service weighted by a factor of importance of a service (for example different weight is given for a payment card authorization service and a service that calculates fixed assets depreciation).We used the equation bellow for the availability calculation.

 
In formula (3) A represents overall system availability, A i represents an availability of service s i , and k i represents a coefficient significance of service s i .
When calculating the availability of a particular service one should take into account the service operating time, defined in the service level agreement, as well as the number of clients affected by the service interruption.The availability of a particular service is calculated according to the following formula: In formula (4) t i is the total time that service s i was available under service level agreement, ut i is the total time for which the system was unavailable, n i is the total number of the service users, un i is the number of service users where experienced service interruption during time ut i .
According to the Leaky Noisy-OR model presented in Figure 1, the following formula applies to calculate the probability of IS availability.
In this formula n represents the number of variables in the model, V i represents percentage of the improving system's availability if the best practices are applied, B i represents a state of implementation of the best practices in different system's components, k represents transformation coefficient, p 0 represents a leak representing probability that the system is unavailable in the case that for all domains included in the model, best practices are applied.

RESULTS AND DISCUSSIONS
The research has shown that the 'availability requirements management' has the greatest impact on the availability (23,20 %), followed by 'operations management' (20,54 %) and the 'equipment and location of the DR data center' (19,52 %).The reduction of IS unavailability is the least impacted by 'the physical environment' (10,53 %), followed by 'backup management' (11,05 %) and 'resistant client / server systems' (11,81 %).The research results showed that the state of implementation of the best practices in the areas described by variables ranges from 4,60 to 6,85 on a scale from 1 to 10 depending on the area.The worst situation is in the fields of 'monitoring of relevant components' (4,6) and 'availability requirements management' (4,94).The best state in the IS of financial institutions in BiH is in the basic infrastructure areas: 'backup management' (6,85), 'network management' (6,54), 'resistant client/server systems' (6,39) and 'the physical environment' (6,05).According to the results of this research, the perception of experts is that the state of the essential IS infrastructure elements, including the server room, server and network infrastructure, data redundancy, backup management is much better than the process part, which includes change management, operations management, monitoring and requirements management.Assessment of the current maturity level of backup management may explain why the experts estimated that implementing the best practices in that area would have a small impact on reducing unavailability, as the situation in that field has been assessed as the best compared to all other areas that were part of the model.A similar explanation applies to the physical environment and server infrastructure.The above was the main reason to include assessed states of implementing best practices in the field as the prior probability for each parent node in BBN-based 'Leaky Noisy-OR' model.The conditional probability table for the node that represents availability is filled based on a linear transformation of the elicited impact values.The model is set up assuming the initial system availability of 99 % and a leak of 0,01 %, which represents the unavailability of the system.Both of these parameters can be subsequently changed.
As part of the research, we compared the results with the study made by Franke and Johnson [7].
To be able to compare the results, it was necessary to transform the research results, since the different methods of calculating variables impact on the IS availability were used.Research findings and comparison are shown in Table 1.The first column represents the effects of each variable on the IS availability, where resulting percent is calculated as a mean of experts' opinions.The second column represents experts' opinions about the maturity level in the financial sector in BiH using scale 1-10.Third columns represent the research results, where as the resulting impact was calculated as the mode of experts' opinions modified with adjacent intervals.The fourth column represents Franke's results.Fourth, fifth and sixth columns represent ranking of data presented in first, third and fourth columns respectively.One of the disadvantages of the proposed model is a deterministic determination of parameters.In other words, each parameter in the network is set using the weighted mean values, obtained in the elicitation process, and not reflecting the diversity of experts' opinion.For this reason, the same mathematical model was implemented using Microsoft Excel and Oracle Crystal Ball software.The base values of the input variables were set using values from CPT tables of the BBN model.However, each input variable was represented not only by the mean value but by also using the entire distribution obtained in the elicitation process.
Figure 2 shows the distribution of impact of 'availability requirements management' variable on IS availability as an illustration how the distribution for each variable was modeled.In this way, we got the stochastic equivalent of the BBN model.We used this model to run Monte-Carlo simulations.The first simulation was run without optimization, just applying the distributions obtained by elicitation.Each simulation had a total of 10 000 trials.Resulting availability probability distribution and certainty intervals are shown in Figure 3 and Figure 4.
Diagrams in Figure 3 and Figure 4 show the stochastic nature of availability prediction.If there are 13 variables, which can affect the availability and which are not at the best practices level, it is not possible to precisely determine the time and the effect that this weakness may cause.Thus it is not possible accurately to predict the IS availability percent, rather it is possible to predict that availability will be inside the predicted range with particular certainty level.
According to the results of the simulation, we got IS availability range from 98,33 % to 99,76 % with 90 % confidence for the case in which best practice are not applied.Mean and median values were 98,93 % and 98,97 % respectively, which was close to the initial assumption of 99 %.

CONCLUSION
The study presented an extensive literature review in the areas of availability and applications of BBN-based methods to decision-making in the fields of operational risk and IS.Based on the theoretical part of the research we the adapted model that was constructed by Franke et al. in two aspects.We changed input variables of the model, and incorporated information on the previous states of the variables, which improved the predictability of the model.Also, we

LIMITATIONS AND DIRECTIONS FOR FURTHER RESEARCH
The fundamental assumption built into this model is the independence of the variables that enabled application of Leaky Noisy-OR approach.Another limitation is the binary representation of variables.Investment in a domain does not always results in bringing the domain to the level of best practice, but can improve the situation in the field, thus reducing the impact to the IS unavailability.Further study could lead to a model that would overcome this limitation by using continuous variables instead of using binary and Noisy-MAX node instead of Noisy-OR.
The empirical research was done on IS of banks in Bosnia and Herzegovina as the most mature in the field of availability management.In a situation where other industries are becoming increasingly dependent on IT and increase standards of IS governance, it would be interesting to conduct a similar survey of the general population, not limited to one industry.This would verify the applicability of the model to other industries and enable comparative cross-industry analysis.
Also, more empirical research based on real data of the IS unavailability incidents and their causes of would lead to empirical validation of the model.
et al. presented the methodology for developing BBN model representing the operational risk of IT infrastructure in the financial institutions [45].Wei et al. developed an integrated process, based on BBN, for efficient IT services management [51].Sommestad et al. made a model based on the extended influence diagram, which enables the analysis of the cyber security of different architectural solutions [52].Cemerlic et al. proposed a system for intrusion detection system (IDS) based on BBN [53].Simonsson et al. proposed a model for measuring IT governance efficiency based on BBN [54].Lande et al. modeled the critical information systems, using BBN [55].

Figure 4 .
Figure 4. Availability certainty intervals.conductedan extensive field research, providing the probability elicitation on the entire population of InfoSec, IS audit and IS management experts from the BiH banking sector.This research resulted in a picture of the state of implementation of best practices in the fields that affect the IS availability in BiH banking sector.We performed a comparative analysis of the research results, with results that are Franke and Johnson get in a study conducted in Western Europe.From a practical point of view, this work identified, taking into account local and regional specificities, which are the most influential factor for the IS availability of BiH banks.

Table 1 .
[7]citation results compared with Franke and Johnson[7].Comparing data from with data from Franke and Johnson research we can notice that according to that research, there are more IS availability determinants with an impact greater than 20 %.In our study, only two variables have an impact greater than 20 % while the other variables impact IS availability with the 10 % and less.Among the four most significant areas, both studies identified the same three areas: 'availability requirements management', 'change management' and 'operations management'.Also, both studies have shown that least influential are variables 'resistant client/server systems' and 'DR equipment and location'.The biggest differences are in the areas of 'monitoring of relevant components' and 'avoiding errors in internal applications'.In research by Franke et al., they have a greater impact (2 nd and 5 th position) compared with our research (9 th and 10 th ).Contrary, 'physical environment' has a significant impact according to our results (3 th place) in comparison with Franke et al. research (8 th place).When interpreting these outcomes, one should consider that we gathered data from the BiH banking sector while Franke et al. research did not have industry boundaries.