Risk assessment for the cascading failure of electric cyber-physical system considering multiple information factors

A risk assessment approach for the cascading failure of electric cyber-physical system (ECPS) considering multiple information factors is proposed. First, considering hardware, software and personnel factors existing in the information system, the reliability model of the control function of the information system is established. Second, the reliability model of the physical component considering the control function of the information system is established. Finally, the risk assessment approach for the cascading failure of ECPS is introduced, and the probability of the cascading failure and the expected energy not supplied are used as the risk indices. Test results on an improved RTS-79 system show the effectiveness and significance of the method for choosing the optimal power communication network topology and the measures to reduce the risk.


Introduction
With the application of information and communication technology (ICT), the power grid has developed into an electric cyber-physical system (ECPS) [1].The openness and compatibility of the information system introduce inevitable risk factors to the operation of power system [2].The Ukraine blackout in 2015 confirmed the severe consequences of risk from the information system [3].In this blackout, a virus was implanted into the energy management system, which made the server in the substation out of operation and usurped the control capability of the related power equipment [4,5].The cascading failure of ECPS was introduced that the unexpected state transition of some power components was caused by the risk from the information system, then the power flow transferred and the overloaded lines tripped resulting in the large-scale blackout.Hence, it is of great significance to establish a risk assessment approach for the cascading failure of ECPS considering the information factors.
It can be obviously seen that the failure of the control function of the information system was the main reason for the Ukraine blackout.Under this scenario, the lack of the reliability model of the control function of the information system and the reliability model of the physical component considering the control function of the information system are the key factors to implement the risk assessment on the cascading failure of ECPS.Graph theory and complex network were used to analyse the independencies between the cyber system and physical system, especially from the view of topology [6,7].Matrix was used to describe the relationship between the information system and physical system, and the probability of different states of information system and corresponding outage area of physical system was analysed [8].The model of system state transition integrating the effect of information system was established considering direct cyber-power interdependencies [9,10] and indirect cyber-power interdependencies [11].The reliability model of physical component considering the reliability of the hardware in the information system was established [12,13].However, the dispatcher makes the decision based on the result of computing and analysis, and then the dispatching command is passed by ICT.It is obvious that the reliability of control function is influenced by the personnel factor (dispatcher) and the software in the information system.In the existing research on the risk assessment of ECPS, the software and personnel factors have been ignored.Actually, the failure of software will lead to invalid control of the dispatchers on the system, which is one of the main reasons for USA and Canada blackout in 2003 [14]; moreover, errors of dispatchers will bring unpredictable consequences to the system operation, for example, a grid dispatcher in 2000 gave dispatching instructions to the wrong substation interval, resulting in the cutting lines [15].Therefore, it is essential to consider the effect of the software and personnel factor when establishing the reliability model of the information system.
This paper proposed a risk assessment method for the cascading failure of the ECPS considering multiple information factors.The reliability model of the control function of the information system is obtained based on state space graph and the reliability model of hardware, software and personnel factors.Then, the reliability model of the physical component considering the control function of the information system is established and the process of the risk assessment is proposed.The method is illustrated and certified on the improved RTS-79 test system.The results show the significance of considering the information system and the effectiveness of the method.

Architecture of the information system
The typical architecture of an IEC 61850-based substation communication system [16] is shown in Fig. 1 including switches, intelligent electronic devices (IEDs), communication lines, server and so on.The star topology is adopted for communication network inside the substation, and the ring topology is also used for the digital substations.
The topology of the power communication network among the dispatching centre and substations composes of star-type, ring-type and bus-type as shown in Fig. 2. The information system includes the power communication network among the dispatching centre and substations as well as the communication network inside the substation.In this paper, the star-type topology of the communication network inside the substation is shown in Fig. 1 and the three topologies of power communication network among the dispatching centre and substations as shown in Fig. 2  The implementation of the control function of the information system depends on the normal operation of related hardware, software and personnel.Reliability models of hardware, software and personnel are established first, and then the reliability model of the control function of the information system is further obtained based on the state space graph.

Reliability model of the hardware of information system
For some equipment in the information system such as workstations, switches and IED, the normal operation rely on the proper operation of the corresponding application software.Hence, integrated application software and equipment itself are equivalent to series-connected failure model.
According to Figs. 1 and 2, the reliability block diagram (RBD) for the hardware of information system is shown in Fig. 3, where workstations, switches and IED are built by equivalent failure models considering application software.The corresponding equivalent failure rate λ h of the hardware of information system is calculated by analysing the series-parallel network.

Software factor of information system and corresponding reliability model
The software in the information system includes operating system, control system software and application software of the equipment, which is considered in the reliability model of the equipment.Factors affecting the reliability of the software mainly includes: (i) the demand analysis is not accurate; (ii) the software design is not reasonable; (iii) the encoding implementation failure; and (iv) the software test is not standardised.The logarithmic exponential distribution [17] is used to model the reliability of the operating system and control system software, and the failure rate λ s0 is shown as ( 1) where λ 0 is the initial failure probability; k is the coefficient of failure reduction rate; and ɛ is the number of errors found in operation.
The premise to realise the control function is that both operating system and control system software can properly operate.Therefore, both of them can be equivalent to series-connected failure model and the corresponding failure rate λ s of the software of information system can be calculated.

Reliability model of the personnel factor of information system
Two-parameter Weibull distribution is used to model the reliability of the personnel factor [18] and the failure rate λ p where t is the response time; T 0.5 is the median time for the operator to complete a task; α and β are the scale and shape parameters of the cognitive behavioural model, respectively.

Reliability model of the control function of information system based on the state space graph
Since the working process of the information system has the Markov characteristics, the reliability model of the control function of the information system can be obtained by the state space graph.
The specific solving process is achieved based on the following assumptions: (i) the failure of hardware, software and personnel factors in the information system are independent; (ii) the hardware, software and personnel of the information system operates at a stable stage, and the corresponding failure rates are regarded as constant; and (iii) some failures of the hardware can be self-tested, and the probability of self-test is C.
The state space graph of the control function of information system is shown in Fig. 4, where μ 1 is the repair rate of hardware that can be self-tested; μ 2 is the repair rate of hardware that cannot be self-tested; μ 3 is the repair rate of software; and μ 4 is the repair rate of personnel.
The steady-state probability of the control function of information system at each state is According to Fig. 4, state transition matrix P is obtained as ( 5) Combining ( 6) and ( 7), the availability probability of the control function is obtained as (8) shows

Reliability model of the physical component considering the control function of information system
Breaker is taken as the interface to analyse the relationship between cyber and physical system.Assume that the protection function of system is completely reliable, the failure of control function can be divided into refusal and malfunction of the breaker [13].The refusal caused by the failure of control function will not directly lead to the cascading failure, while promote the process of cascading failure under specific operation mode.Therefore, the failure of control function in this paper refers specifically to malfunction of breaker, which will lead to the disconnection of the corresponding line.Hence, the reliability model of the physical transmission line considering the control function of the information system is established as follows: where U′ is the availability rate of line considering the control function of the information system; U is the availability rate of line without considering the control function of the information system.U c,f and U c,t are the availability rate of breaker at the two ends of a line.P 0,f and P 0,t are the availability probability of control function of breaker at the two ends of a line.

Risk assessment of the cascading failures of ECPS
where S is the set of N − 1 fault; p i is the occurrence probability of fault i; and F i is the symbol representing whether the fault i triggers the cascading failure.If the number of faults lines is bigger than 3, F i = 1 representing cascading failure occurred; otherwise F i = 0. [19]:

Expected energy not supplied (EENS) (MWh/year)
where NL is the number of load level classification; T j is the duration time (the unit is hour) at load level j; and C i is the load loss caused by fault i.

General process of risk assessment of the cascading failures of ECPS
Given the load level, the general process of the risk assessment of the cascading failures of ECPS is proposed in Fig. 5. First, establish the reliability model of the hardware, software and personnel.Second, obtain the availability rate of control function based on state space graph.Third, obtain the availability rate of line considering the control function of information system.Finally, search the cascading failures and calculate the risk indices of the system.T j × Σp i C i given the load level j is obtained according to Fig. 5 and then EENS is obtained according to (11).The method of searching cascading failures is: (1) Generate the initial faulty line.
(2) Disconnect the fault line, and calculate power flow of the system.
(3) Check whether there is an overloaded line, if so, disconnect the overloaded line and turn to step (4); if not, end the search.(4) Determine whether the power flow is divergent or whether the system is separated, if not, return to step (3); if so, end the search, and calculate the minimum load loss of the system based on optimal power flow.

Case study
The physical power system is RTS-79 test system [20].Assume that the system is controlled by four dispatching centres and each of them takes charge of six substations (as shown in Fig. 6 and Table 1).For each dispatching centre, the switches of the substations listed in columns A-F in Table 1 correspond to the switches marked by A-F in the communication network topology in Fig. 2. The source of the reliability parameters of information system includes reliability data of some practical provincial power grid in China, the statistical data of nationwide grid operation and [21][22][23].
Meanwhile, the model in Section 3 is used for the detailed calculation: (1) The reliability parameters of equipment in the information system are shown in Table 2, which is taken as the reliability parameters of the application software of equipment.The failure rate of hardware λ h corresponding to each communication network topology is shown in Table 3.

Scenario A
The proposed method and its application in selecting the optimal communication network topology are first verified.
The power flow limit of the line is supposed to be reduced to 80% of the original.The system risk corresponding to the three communication network topologies is shown in Table 5. Case 0 refers to situation that the control function is completely reliable.Case 1 refers to the situation in which star-type topology is adopted    (1) The system risk increases if considering the reliability of the control function of information system, which further illustrates that it is necessary to consider the effect of information system in the risk assessment.
(2) Compared to the situation considering the hardware factor only, the system risk increases if considering the software and personnel further.It shows that the software factor and personnel factor cannot be neglected.
(3) The system risk adopting the bus-type topology is the largest, followed by the star-type topology and then the ring-type topology.The communication channels are the backup of each other and the reliability is comparatively high for the ring-type topology.As Table 4 shows, the average availabilities of control function under different topologies are 99.933%(ring-type), 99.932% (star-type), 99.926% (bus-type), from high to low.Therefore, the ring-type topology should be given the priority to be used.

Scenario B
Scenario B analyses the method of reducing the system risk from the information system.The star-type topology is adopted in both the scenario B and the following scenario C.
The redundant configuration of the equipment of the information system is applied and the risk level of the system is shown in Table 6.When the redundant configuration of the communication line is adopted, P cf and EENS are reduced by 0.34% and 0.33%, respectively.When the redundant configuration of the IED is adopted, the system risk reduces to the most extent and P cf and EENS are reduced by 16.11% and 15.47%, respectively.Redundant configuration of IED can be taken as an effective measure to reduce the system risk.
As Fig. 7 shows, the impact of hardware self-test probability on system risk is analysed.The system risk is approximately decreasing linearly with the hardware self-test probability increasing.The P cf and EENS when C = 0.9 reduce 15.64% and 15.10%, respectively, than the P cf and EENS when C = 0.8.Compared to the redundant configuration of hardware, it is more effective but hard and costly.Therefore, it should be taken into comprehensive consideration about applying the redundant configuration of hardware while improving the hardware self-test probability as much as possible.
Given the reliability of software and personnel cannot be improved by redundancy configuration, optimising and standardising software design and test and enhancing the training for personnel can be used to improve the reliability of software and personnel, respectively.
System risk when reducing the failure rate of the software and personnel is given in Table 7. Case 5 refers to the situation that the failure rate of control system software is reduced to the half of the    original, while Case 6 refers to the situation that the failure rate of personnel is reduced to the half of the original.Compared to Case 2, P cf and EENS of Case 5 decreased by 1.43% and 1.37%, respectively, and P cf and EENS of Case 6 decreased by 7.64% and 7.33%, respectively.It can be seen that it is of great importance to improve the reliability of the software and personnel to reduce the system risk; meanwhile, personnel training should be strengthened since improving the personnel reliability is more effective to reduce system risk.

Scenario C
Scenario C analyses the method of reducing the system risk from the power system.The risk level of the system with different line power flow limit is calculated, as shown in Fig. 8.
As can be seen in Fig. 8, P cf has an exponential decreasing trend when the power flow limit is between 60% and 100%.When the power flow limit is below 60%, P cf = 0.081218 which equals to the sum of fault probability of all lines.It means that the fault of any line will lead to the cascading failure.EENS is almost constant when the power flow limit is between 50% and 100%, and EENS is exponentially increasing when the power flow limit is below 50%.It can be seen that increasing power flow limit properly is an effective measure to reduce the system risk.

Conclusion
For the cascading failure caused by the failure of the control function of the information system, an effective risk assessment method was proposed considering the hardware, software and personnel factors.It illustrates a basic procedure, providing the ability to quantitatively assess the system risk level integrating the effect of information system.The resulting risk indices can further help operators and planners choose the optimal power communication network topology and the measures to reduce the risk level of the system.Application results on the test system show that: ① the failure of control function of information system increases the risk level of the system and the software and personnel factors cannot be ignored.② The risk level of the system with ring-type, star-type and bus-type topologies of the power communication network is the smallest, the middle and the biggest, respectively.③ The redundancy configuration of the equipment, improving the hardware self-test probability, enhancing the personnel training and increasing the power flow limit are the effective measures to reduce the risk level of the system.Considering the other function of the information system in the comprehensive risk assessment of ECPS will be in our future work.

Fig. 1 Fig. 2
Fig. 1 Typical architecture of an IEC 61850-based substation communication system

5. 1
Risk indices 5.1.1Probability of the cascading failure P cf :

Fig. 4
Fig. 4 State space graph of the control function of information system

Fig. 5
Fig. 5 Process of the risk assessment of the cascading failures of ECPS and only the hardware factor is considered.Case 2, Case 3, Case 4 refer to the situations adopting star-type, ring-type, bus-type topologies considering the hardware, software and personnel factors.Compared with Case 0, P cf increases by 238.1%, 235.2% and 245.1% and EENS increases by 208.8%, 206.7% and 220.6% in Case 2, Case 3 and Case 4, respectively.Comparing Case 2 with Case 1, P cf and EENS increases by 22.48% and 21.39%.Conclusions can be drawn as follows:

3 Reliability model of the control function of information system considering multiple information factors
This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/) IET Cyber-Phys.Syst., Theory Appl.

Table 1
Substations controlled by dispatching centre

Table 2
4IET Cyber-Phys.Syst., Theory Appl.This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/)

Table 4
Availability rate of the control function

Table 6
Fig. 7 Impact of hardware self-test probability on system risk IET Cyber-Phys.Syst., Theory Appl.This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/)

Table 7
System risk when reducing the failure rate of software and personnel Case P cf EENS, MWh/year Fig. 8 Impact of line power flow limit on system risk 6 IET Cyber-Phys.Syst., Theory Appl.This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/)