us-Building resilient cyber-physical power systems An approach using vulnerability assessment and resilience management

zu erfassen, um geeignete Präventionsmechanismen zu definieren. Die vorgeschlagenen Resilienzmanagementmaßnahmen zie-len darauf ab, Krisen besser zu bewältigen, anstatt auf höhere Barrieren zu setzen. Die Resilienz cyber-physikalischer Energiesysteme ist möglich.


Introduction
Power systems are evolving through an extended convergence with information and communication technologies (ICT), leading to complex cyber-physical power systems (CPPS).This has brought opportunities to enhance the systems' performance and provide solutions to cope with the associated challenges of energy supply based on distributed and fluctuating renewable energies.However, cyber-attacks targeting power systems have been growing in number and sophistication in recent years.For instance, the attacks against the Ukrainian power grid in 2015 and 2016 that resulted in power outages (Dragos Inc. 2017).Another incident against a utility in the United States was reported on March 2019 (Sobzak 2019).Several risk and vulnerability assessments for power systems have been published in recent years (e.g.NIST 2014; Rossebo et al. 2017).In these studies, potential impacts and mitigation options were evaluated based on lists of potential threats and their likelihood of occurrence.We argue that due to the dynamic nature of ICT and its complex interdependency with the power infrastructure, we have to expect surprises.It will no longer be possible to identify a comprehensive inventory of potential threats, as is the case in classic risk management.
A reliable power supply is of great importance for almost all areas of life, therefore it is necessary to develop strategies that enable the power system to be prepared for expected and unexpected stressors.In other words, it is essential to apply a resilience management strategy.Many definitions of resilience exist in the scientific community (e. g.Jesse et al. 2019).For this study, we describe resilience as a (socio-technical) system's ability to maintain its services under stress and in turbulent conditions (Brand et al. 2017;Gleich et al. 2010).The advantage of using this definition is that it focusses on the system services, which must be outlined together with the stakeholders/us-ers.In this way, changes and evolutions of the system are possible, which are core aspects of transitions.The focus lies on the complex nature of interconnectedness and interdependency, and the capability of the system to maintain its services.
This article presents the results of an empirical and interdisciplinary base study that involved actors from energy and ICT sectors through interviews and workshops, to get better insights into the vulnerabilities of CPPS.The study consists of two parts.First, a vulnerability assessment (VA) was performed to identify critical points coming from the ICT infrastructure.Second, a resilience strategy was developed by using a resilience management approach to identify how CPPS can be better prepared for any stressor.

Vulnerability Assessment Approach
The event-based and structural VA methods (Fig. 1) carried out in Gleich et al. (2010) and Gößling-Reisemann et al. (2013) were used as reference for this study.
The potential impacts were evaluated based on their effect on the system services, which were defined in this case according to parameters for both the electric and ICT infrastructures.Regarding the electric infrastructure, the quantity criteria are determined by the system's ability to supply the connected load.The quality criteria are defined by direct technical parameters, such as power quality or reliability indices, and by indirect parameters, such as socio-economic and socioecological impacts.Regarding the ICT infrastructure, the approach considers the effect on the security requirements, i. e. confidentiality, integrity, availability and non-repudiation of data in transit or at rest (e. g. control commands, firmware, software, etc.).
The study focused on the German and European power system covering the complete electrical energy conversion chain and was limited to evaluate stressors from the ICT infrastructure.The component layer of the Smart Grid Architecture Model 1 was used as a reference architecture model.Two workshops and 19 semi-structured interviews were conducted with experts from the sectors: energy, industrial automation, ICT, and public bodies in the period between June 2016 to March 2017.The expert statements were evaluated by means of a comprehensive qualitative content analysis methodology based on Mayring (2014).
Combining the experts' opinions, relevant literature, and our own judgement, the potential impacts were qualitatively rated as high, medium or low according to the effects of stressors and structural weaknesses on the quality and quantity criteria of the system services.In order to determine the adaptive capacity, inputs from experts and literature were considered regarding existing or foreseen adaptation mechanisms and the readiness of the concerned actors to implement them.They were also qualita-  Source: Authors' own compilation based on Gleich et al. (2010) and Gößling-Reisemann et al. (2013) tively rated as high, medium or low.Consequently, the vulnerability level was the result of combining potential impacts and adaptive capacity according to the matrix showed in Fig. 2. A more detailed description on the VA methodology can be found in Tapia et al. (in press) Resilience Management Approach Resilient CPPS should have a diverse set of capabilities such as resistance/robustness, adaptation, innovation and improvisation to overcome known and unknown stressors.They help the systems to maintain their system services (see definition above).In this study, the resilience management approach described in Acatech et al. ( 2017) and Goessling-Reisemann and Thier (2019) was used as reference.It comprises a four-phase approach: (1) Prepare and prevent, (2) Implement robust and precautionary design, (3) Manage and recover from crises, and (4) Learn for the future.The suggested measures for each step were developed based on the VA results, the resilience design principles/elements described in Brand et al. (2017) and Goessling-Reisemann and Thier ( 2019), the statements of the interviewed experts, and our own judgments (Fig. 3).

Vulnerability Assessment Results
The VA identified critical properties, structures and elements contributing to the vulnerability of the CPPS.Based on the qualitative content analysis results, the findings were sorted into the following four categories: (a) technology, (b) organizational security policies and procedures, (c) the human factor, and (d) reg-ulations.Each category included subcategories and they were assessed individually using the VA methodology described above.All subcategories resulted in high vulnerability ratings following the combination of medium to high potential impacts with medium or low adaptive capacities (Tab.1).The list of categories and subcategories is not intended to be comprehensive.However, it reflects the fact that the interviewees were queried about what the critical points are according to their opinion, which led to a list of high vulnerabilities.In the following section, the findings for each category will briefly be described.

Technology
The increased number of systems, endpoints and actors involved in the CPPS leads to a higher number of interconnections and communications.If these communications use unencrypted or

Prepare and prevent
Implement robust and precautionary design

Manage and recover from crisis
Learn for the future

Statements of interviewees
Our own judgements Fig. 3: Four phases of the resilient management approach scheme and the sources for determining the suggested measures for each phase.From the other side, experts also stated that the more distributed and closer to the end-consumer the communication occurs, the more vulnerable it gets.The reason is that devices located at the customer premises (e. g.Internet-of-Things devices) are deployed with poor security features and furthermore, they are not regulated.In most of the cases, they do not have capabilities for secure key management, control access, or patch management.Security challenges and threats of smart home devices are discussed in Lee et al. (2014).

Organizational Security Policies and Procedures
Experts agreed that due to the increasing complexity and interdependencies between IT and Operation Technology (OT) infrastructures, the knowledge needed to address the new challenges has changed.In most of the cases, interdisciplinary knowledge is missing or limited, and therefore it is difficult to properly understand, design, implement and operate the complete complex system.Normally, OT assets are maintained by ICS operators and engineers rather than experienced IT professionals, which can result in common mistakes in maintenance, configuration, and lack of hardening (Bodungen et al. 2017).Moreover, typical IT systems security measures cannot be directly applied in ICS environments, because the process stability or availability could be affected.Therefore, specific and tailored security measures are needed.
As experts stated, ICS usually tend to be outdated, either because vendors do not provide security patches or because the particular system is time-critical.As a consequence, attackers are able to gain access to different system components by exploiting known security-gaps that have not yet been patched.Nevertheless, even if all patches and mitigations are kept up-todate, attacks are becoming more sophisticated and adversaries use unknown zero-day exploits (McLaughlin et al. 2015), i. e. attacks based on previously unidentified and unpatched cybersecurity gaps.

The Human Factor
The lack of effective security trainings and awareness programs in power sector organizations can lead to insufficiently trained or engaged personnel in cyber-security aspects (NIST 2014).Applying social engineering, threat agents are exploring new at-tack mechanisms targeting different levels in the organization.This is one of the fastest growing security problems according to the experts.In the Ukrainian blackout in 2015, attackers developed the Blackenergy 3 tool malware and performed a phishing campaign targeting employees from the electricity distributor (Styzcynski and Beach-Westmoreland 2019).
Disgruntled employees, or ex-employees, who are not properly managed when leaving the company, may represent further potential threat actors.They could have detailed knowledge of the systems and access to critical data, allowing them to identify weak internal structures and methods to compromise the systems.Furthermore, critical information about the system configuration could be even publicly available through vendors' or asset owners' websites, employees' social media sites, or from other sources.Attackers can leverage this information for planning the attack.
Additionally, experts mentioned also that end users represent another vulnerable point because of their lack of awareness or understanding of the consequences of eventually low security of their smart devices.A more complex problem derives from end-users being prosumers, who may not have the expert-knowledge to implement and maintain appropriate security measures for Distributed Energy Resource (DER) systems (e. g. smart inverters).

Regulations
The lack of an effective implementation of security standards and regulations represents another critical point for CPPS.Experts considered that the absence of mandatory regulations to enforce power system operators to implement minimum required security standards, or vendors to provide the necessary security requirements in their products expose the system to possible cyber-attacks, for instance man-in-the-middle attacks on non-upgraded ICS systems running the IEC 60 870-5 protocol (Maynard et al. 2014).
Different technical and organizational standards have been developed to address cyber-security requirements in smart grids (ENISA 2012;NIST 2014).Nevertheless, as experts stated, in most of the cases, these are only recommendations and the compliance to a minimum-security level is not enforced by regulations.Furthermore, the experts mentioned that there are no economic incentives for grid operators to invest in cyber-security enhancements.The decision to upgrade legacy ICS in order to implement the security measures could be delayed until the next planned lifecycle equipment replacement, not only because of the processes' criticality, but due to the additional associated costs.Another critical point, as experts remarked, is the missing effective coordination to improve security for the overall system.
The critical points discussed in this section are related to all categories mentioned above.The relationship is seen as lack of readiness of the involved actors to implement existing adaptation strategies.Thus, increasing the vulnerability level of each category itself.

Resilience management strategy
The VA unveiled the critical vulnerable points.Security measures, if applied, have great potential to reduce some vulnerabilities.However, they focus mainly on trying to keep the malicious attackers outside of the system.Therefore, one of the biggest challenges is to find a way to broaden the horizon in handling known and unknown stressors by including recovering, adapting and learning mechanism after successful attacks, instead of only focusing on prevention and detection.This is the objective of the second part of the study.Our main concern is how to increase the resilience in CPPS.This requires the understanding that resilience is more than just eliminating identified vulnerabilities.The applied resilience management approach consists of four phases (Fig. 3).
During the preparation and prevention phase, weak points in the CPPS are identified and effective prevention measures must be derived.The focus here is on known stressors, thus a holistic security approach between IT-OT (IEC 2016), and energy-focused risk analysis and management strategies (Fischer et al. 2018) are needed.Experts also stressed the importance of scalable and regularly tested security measures at endpoints (e. g. encryption, authentication, authorization), intrusion detection systems, patch management, network segmentation, as well as more effective and engaging security trainings and awareness programs.Technology-wise, the implementation of additional measures for data storage and preserving of unused resourcesoperational slack -to better deal with surprises are helpful (Fischer and Lehnhoff 2018).
In order to enhance resilience, a robust and precautionary system design should be implemented from the beginning.This will empower the system to maintain its services even under stress or disturbances.The system should have a high diversity of IT components and redundancy in communication channels and devices (BNetzA 2019).Maintaining the ability to rely only on physical parameters for operation as well as hardware-based security are helpful.Furthermore, implementing a cellular structure in order to secure a minimum and stable power supply in case of a failing central ICT infrastructure appears beneficial (VDE 2015).Other suggestions supported by the experts are the implementation of real-time monitoring, intrusion and bad data detection schemes (Iturbe et al. 2016 The Human Factor Regulations Fig. 4: Selection of resilience-enhancing measures and elements, sorted by the categories: Technology (blue), Organizational Security Policies and Procedures (green), the Human Factor (orange) and Regulations (grey), according to the Resilient Management approach phases.
Source: Authors' own compilation based on Tapia et al. (in press) et al. 2018), as well as periodic backups, and reducing services and functionalities in terms of data, ports, libraries, etc. (Fischer and Lehnhoff 2018).A resilient power system is able to ride through failures in order to manage and recover from crises.While the stability and security in this phase could be enhanced by multi-agent based control with decentral consensus finding (Lehnhoff and Krause 2013), attention should also be paid to the ability to operate the system without ICT, i. e. manually, or to at least secure a soft landing, as experts stated.In addition, the provision of business continuity and emergency plans on a regional and local level, e. g. through supplying islands at least in and around public properties/buildings, and the preparation for active emergency planning and exercises based on realistic cyber-attacks have a high priority (Arghandeh et al. 2016).
Past and avoided disasters should be used in phase four to learn for the future in order to improve the adaptive capacity of the system.In this sense, digital forensic would allow to investigate incidents and near incidents in-depth and identify lessons.This should include the documentation of weaknesses that led to failures (Vulnerability store) (Gößling-Reisemann 2016).Furthermore, strengths that avoided crises in the past or enhanced recovery are equally worth identifying, as they form the basis for planning strategies and emergency scenarios (Solution store) (Gößling-Reisemann 2016).This documentation must be mandatory and publicly available.
Fig. 4 shows the summary of selected resilience-enhancing measures and elements for each phase of the resilience management approach.More details on the specific resilience management strategy described here can be found in Tapia et al. (in press).

Conclusions
In this study, critical properties, structures and elements contributing to the vulnerability of CPPS were identified.On one side, insecure communications or insecure end points, especially at the customer premises, resulted in a high vulnerability due to poor security features on the devices.On the other side, social engineering is a quickly growing security problem that enables threat agents to exploit one of the weaknesses present in every organization: the human factor.In spite of the existence of adaptation mechanisms that could minimize the impact, it was found that their implementation could be hindered by the lack of policy enforcement or the unreadiness of the involved actors to implement these measures.To address cybersecurity challenges, an integrated assessment considering physical, cyber and social perspectives is necessary.The aim is not only to try to keep attackers outside the system, but to design the system in a way that enables it to transform and adapt in order to cope with any kind of stressor.In other words, a resilience management strategy is needed that considers that resilience is more than just eliminating identified vulnerabilities.This article illustrated resilience enhancing measures assigned to the four phases of the resil-ience management cycle.One important measure is to establish an adequate cyber security regulation framework and monitor its effective implementation.Regarding the system architecture, a cellular structure and physical backup would build resilience in case of successful attacks.We conclude that introducing resilience principles/elements to the system and using a resilience management approach is a suitable way to prepare systems for the unexpected.

Fig. 2 :
Fig.2: Vulnerability assessment matrix that considers the level of potential impacts on system services and adaptive capacity.(H: High, M: Medium, L: Low).Source: Authors' own compilation based onGleich et al. (2010)  andGößling-Reisemann et al. (2013) weakly encrypted network protocols, authentication keys and data payload are exposed(NIST 2014).Using Man-in-the-Middle attacks, threat agents will be able to listen, inject or manipulate messages between nodes.From one side, legacy communication protocols used in Industrial Control Systems (ICS) in the generation, transmission and distribution domains have evolved from proprietary point-to-point links and isolated from external networks to open and standard protocols.According to the experts, this represents a high security problem.The 'Crashoverride' malware, which seems to have been used in the Ukraine blackout in 2016, is a good illustration of an advanced malware that leverages the weaknesses of certain ICS protocols (Dragos Inc. 2017).
Tab. 1: Categories and subcategories that reflect critical properties, structures and elements of CPPS and the corresponding ratings of Potential Impacts, Adaptive Capacity and Vulnerability on the scale L: Low, M: Medium, H: High.Source: Authors' own compilation based onTapia et al. (in press) ; McCarthy