Improving the Robustness of Industrial Cyber-Physical Systems through Machine Learning-Based Performance Anomaly Identification

We propose a versatile and fully data-centric methodology towards anomaly detection and identification in modern industrial Cyber-Physical Systems (CPS). Our motivation behind this move is the ever-growing computerisation in these systems, in the form of complex distributed computing nodes, running complex distributed software. Industrial CPS also demonstrate heavy deployment of hardware sensors, as well as an increasing role for software. We observe the insufficiency and costliness of design-time measures in prevention of anomalies. As our main contribution, our methodology is taking advantage of this data-rich environment by means of Extra-Functional Behaviour (EFB) monitoring, analytics pipelines and Artificial Intelligence (AI). Specifically, we demonstrate the use of compartmentalisation of execution timelines into distinct units, i.e., execution phases . We introduce the generation of representations for these phases, i.e., behavioural signatures and behavioural passports , as our way of behavioural fingerprinting. Composed using regression modelling techniques, signatures as the representation of ongoing behaviour, are compared to passports, representing reference behaviour. The comparison is done by means of goodness-of-fit scores, creating quantifiable measures of deviation between different recorded behaviours. We have used both partially synthetic and real-world traces in our experiments, depending on the use-case. We have also followed both white box and black box approaches for our use-cases, with discussions on the pros and cons of each. The effectiveness of our data-centric methodology is demonstrated by two proofs-of-concept from the industry, to represent the two ends of the industrial CPS complexity spectrum, with one being a large semiconductor photolithography machine, while the other is an image analysis platform. Each use-case comes with its own characteristics and limitations, confirming the flexibility of our methodology and the relevance of its integral steps in the approach towards the initial analysis and data transformations. The results of anomaly classification show overall high accuracies, as high as 99% in certain set-ups. These results show the capability of our data-centric methodology, suiting the presented modern industrial CPS designs.


Online monitoring
Detection Identification Actuation Analysis Fig. 1. High-level approach towards performance anomaly detection, identification and countering their effects, with this paper's focus highlighted in yellow surrounding environment. With the goal of covering this vast spectrum, we have opted for cases from the extremities.
Our first demonstrator is a semiconductor photolithography machine from ASML, for which the main computing node has been studied. The opposite case, is an embedded platform, used for detection of cars in images taken with the system's camera. Although both cases follow the same methodology, the exact workflows and data manipulations along the way are, to a limited extent, platform-specific.

Contribution and structure
This paper elaborates a complete view of our data-centric approach towards performance anomaly detection and identification in industrial CPS. In part, we are consolidating our previously published papers [34,30,33,29,35] into a single point of reference, as each previous individual paper was addressing only a specific part of the approach.
The extra novelties complementing earlier material include: extended descriptions of the employed concepts and techniques, revised and extended diagrams, but most importantly, new experimentation with anomaly classifiers for the semiconductor photolithography machine. We are also providing extended results for our image analysis use-case, especially where classification algorithms are applied. These results reveal most effective combinations of metric and execution phase for such systems. Accordingly, we present: • A data-centric methodology towards performance anomaly detection and classification in industrial CPS, • Compartmentalisation of industrial CPS runtime activity into executional units, i.e., behavioural phases, • Generation of behavioural signatures and passports as constructs reflecting the behaviour of the system during unclassified and reference executions, respectively, • And the demonstration of our workflow with proof-of-concept set-ups for two cases from the industry.
In the following section, we briefly elaborate background concepts necessary for our data-centric approach. In Section 3, we go through our methodology and provide a relatively high-level view of the major techniques employed, while Section 4 details the implementation of our solution and data-centric pipeline. Section 5 brings forth the specifics of our experiments with proof-of-concept set-ups, while the classification results are provided in Section 6. Further discussion on using our methodology's potential and increasing its effectiveness is carried out in Section 7, followed by related literature and our conclusions in Sections 8 and 9, respectively.

BACKGROUND
The bulk of concepts and techniques that we use in our methodology and experiments are provided here. What follows can be seen as the tooling to take advantage of the repetitive behaviour observed in industrial CPS.

Extra-functional behaviour
Extra-Functional Behaviour (EFB) convey a computing system's behaviour and as the name suggests, EFB are not directly derived from functional aspects of the system. Examples are, execution time, different latencies, throughput, power and energy consumption, amongst others. Metrics reflecting EFB are not only dependent on functional behaviour, but also on environmental circumstances, such as the platform itself, the input to the system and operational conditions. Environmental circumstances, being important variables for CPS, necessitate the role of EFB metrics in their monitoring and analysis.
Looking at EFB from a different perspective, one can observe that they reflect the effects of environmental circumstances, the platform and the input, which may be in physical form depending on the application, all directly related to physical aspects of a CPS. As such, we can consider EFB and the information provided by EFB as an interface, binding the digital and physical realms within the operational domain of a CPS, together.

Execution phases
Execution phases are basically repeated units of execution, which are especially noticeable in industrial CPS operations.
Repeated tasks can be broken down to their subtasks and higher granularity can be achieved to describe repeated units of execution. We call the smallest of these units an atomic execution phase. There may be (usually are) combined atomic phases that are also repetitive. Such repetitions involving two or more consecutive atomic phases are called combo execution phases. Figure 2 depicts these concepts under normal and anomalous execution conditions. For any particular system, analysis will reveal the best choice(s) from available atomic and combo phases.  Formally, an arbitrary phase is denoted as a tuple of its start, , and end, , times within the execution timeline, i.e., = ( , ). Each execution phase category, atomic and combo, is considered as a partially ordered set, where the generic ordering condition is < +1 , making the ordering strict. This means that phases within one category should not overlap. As such, we can describe atomic and combo phases as atomic = { : ∈ atomic , ≤ , ∈ N * } and combo = { : ∈ combo , ≤ , ∈ N * }, respectively. Here, is the set of available phase types, is the number of atomic phases and is the number of combo phases. Considering the simple example from Figure 2, we have atomic = { , } and combo = { }.

Performance anomalies
Anomalies in general can manifest themselves as unreliable or subpar performance behaviour. The manifestation might also be in other forms of unexpected and harmful system behaviour and not necessarily performance-related. The two important terms to consider here are performance anomaly and fault [33]. A performance anomaly is any readily detectable deviation in the system's performance behaviour. For instance, if a computational job takes significantly longer than it should to be fulfilled, a performance anomaly has occurred. Detecting the actual fault causing the performance anomaly however, requires insight and analysis of the internal interactions of the system. As such, a performance anomaly is the result of a fault, but at the same time, not all faults will necessarily lead to a performance anomaly. Following the notation of execution phases and considering and as expected start and end times vs. ′ and ′ as realised ones, we can expect to have the following anomalous situations: < +1 ≤ ′ and ′ +1 = +1 (2) Considering the relation between ′ and +1 , we see that Equations (2) and (3) break the ordering condition, i.e., ′ ≥ +1 . These instances will be further discussed below.
Transient anomalies. A transient or localised anomaly is visible for a short period of time, or occurs only a few times.
In such cases, either the fault is a short-lived one, or its impact to the overall execution is rather contained. For instance, with regards to timeliness, transient anomalies could be seen as delayed tasks in a contained part of the execution timeline, which means that the natural delays within the overall execution timeline can absorb such localised extended phase executions. We can also think of tasks without any dependency that do overlap and end up sharing available resources, as visualised in Figure 2 for the localised anomaly with independent consecutive phase As. In this context, Equations (1) and (2) point to localised and overtaking delays, respectively.
Persistent anomalies. In contrast to transient anomalies, a persistent or repeating anomaly is of the type that either will keep reoccurring, or will create cascading delays for all subsequent tasks. This could be the result of, for instance, dependencies between tasks, or excessive amount of delay. Another factor is the placement (time) of anomaly. There will be no room for compensation for an anomaly occurring towards the end of a phase. The overall cost to the execution timeline is higher for such anomalies, as depicted in Figure 2 for cascading and repeating anomalies. The cascading effect is the result of phase B being dependent on phase A, as well as each phase AB being dependent on its predecessor.
Both situations result in a shifting effect, as formalised in Equation (3).

Signatures and passports
Signatures and passports [33] are data-centric representations of EFB, for they are composed using the time-series data collected during the execution timeline of an application. A signature is the representation of an execution phase, modelling the trend of EFB metrics collected during that phase. Our technique of choice for this modelling is regression.
A passport is a reference signature, collected under normal and reference execution conditions, which is used in comparisons. As such, collected values for a metric of choice during an execution phase, for instance CPU time, will be transformed into cumulative data points in time, resulting in a time series collection. Using regression modelling, the closest function interpolating these data points will be generated and is used as the representation for that particular execution phase. Note that signatures are calculated per metric and per phase. For applications with multiple processes, different phases will potentially include activity data from multiple processes. This means that there will be signatures not only per metric and per phase, but also per process. Process identifiers can be used as an indicator to differentiate between signatures based on the same metric and within the same phase.
Considering our use-cases included in this paper, we have opted for both internal metrics, e.g., CPU time, and external metrics, e.g., electrical current, resulting in software signatures and power signatures, respectively.

Communication-centric monitoring, modelling and simulation
Most multi-node and multi-process, i.e., distributed, industrial CPS include a message passing subsystem, predominantly in the form of a message broker based on the publish-subscribe software architecture pattern. To reduce the observed behavioural surface in a complex CPS, one can limit the observation to this message passing subsystem and specifically, to the message broker process. Although the observed behaviour will be a subset of the full system behaviour, it will be a valid representation, sufficiently reflecting the trend and changes [34,30]. In such a communication-centric monitoring, this is achieved by collecting traces from purposefully planted probes in the code of the target communication subsystem. Such an invasive probing provides us with enough data to support modelling and simulation efforts.
Collected information consists of EFB metrics that are read and processed. In its final form, the tracing data consists of communication (read and write) and computation events. Communication events contain information such as, senders, receivers, message sizes, and timing information. Compute events contain information such as, CPU utilisation, CPU waiting times, process IDs and timing information.
We employ high-level modelling and trace-driven simulation of the CPS with the purpose of reasoning about the system behaviour. Accordingly, we have a detailed view of how different resources in the system are being used at different stages of the execution of a workload, e.g., CPU status and buffer utilisation. Moreover, the usage of high-level simulation models allows us to explore the effects of possible actuation mechanisms or countermeasures against performance anomalies in a safe environment, before applying these actuations to the real system.
High-level modelling and simulation are rather intertwined notions, as a simulation is the execution of a calibrated model. In Sections 4.1.1 and 4.1.2, we describe how trace data is used to automatically synthesise and calibrate our highlevel model. The model is a simplified, but descriptive enough representation of the studied system, hence high-level.
Having the knowledge of involved actor types, i.e., data producers (writers), data consumers (readers), and data brokers (communication libraries), complimented with interconnection information, i.e., topology, is sufficient to construct the high-level model. Calibrating this model for simulation is done by considering metric recordings, collected during the system's operation. The aforementioned tracing format, i.e., read, write and compute events, is especially suitable for a discrete-event simulation.

Time series data classification
As we are following a data-centric approach in our methodology, we are employing machine learning and classification algorithms [29,35]. When it comes to classification of time series data, both deep learning algorithms, i.e., neural networks, and traditional classification algorithms, e.g., decision tree, are viable options. We are especially interested in explainable results and the ability to backtrack achievements, making connections to the original data. This level of understanding is necessary when developing a methodology. Traditional classification algorithms are much more suitable for this goal, but on the other hand, they do require feature engineering. Accordingly, we have experimented with Decision Tree (DT) [40], Random Forest (RF) [3], Gaussian Naïve Bayes (GaussianNB) [15], k-Nearest Neighbours (k-NN) [10], Linear Support Vector Classification (LinearSVC) [2] and Kernel Support Vector Machine (KernelSVM) [9] classifiers for our demonstrators.

METHODOLOGY
Considering the high-level view of our performance anomaly detection and actuation approach given in Figure 1, let us present an elaborated version here. The focus has been to take advantage of the current data-rich industrial CPS ecosystem.

Performance anomaly identification
Our methodology can be divided in two main workflows, the first of which addresses the requirement for performance anomaly detection and identification. Depicted in Figure 3, this anomaly detection and identification workflow is completely covered in this paper.  Fig. 3. High-level approach towards feature engineering, performance anomaly detection and identification There are numerous sources of data collection in a modern industrial CPS. These include hardware-based sensors, as well as metric collection tools within the software running the system. Additional low overhead software probes and serial hardware sensors can also be introduced, as it is the case in both of our demonstrators. Considering the approach given in Figure 3, this online monitoring, happening during the operation of the system, will produce large amounts of data. The feature engineering that follows, involves different data manipulation steps. Examples are parsing, extraction of important pieces of data and especially fundamental for us, compartmentalisation. The latter is where we divide the data according to execution phases. Following feature extraction, this data set will transform into a feature set. Some features are readily taken from the data set, while others are to be calculated, e.g., through generation of signatures and passports.
Detection is the step where deviation of the behaviour at hand from the reference behaviour is considered. In practice, there is a close relation between "Detection", "Known behaviour" and "Anomaly identification" blocks. We leave implementation details for sections ahead. Upon suspicion of new and unseen anomalous behaviour, the "Analysis" block comes into play. At this point, such an analysis is considered to be a manual effort, but we do foresee that certain analyses could be performed automatically. The goal is to update the feature set and retrain the employed classifier, allowing it for automatic detection of this newly introduced behaviour. Depending on the maturity of the feature set for this specific behaviour, there may or may not be a need to compose new features from the data set.

Behaviour improvements
Detection and identification of anomalous behaviour will be followed with another workflow aimed at countering the anomaly itself, or its effects, keeping the industrial CPS fulfilling its purpose. This workflow, given in Figure 4, is the subject of our ongoing research. However, certain pieces of the puzzle are already in place, namely, a digital twin. Not every anomaly can be countered though.  Fig. 4. High-level approach towards countering the effects of anomalies, involving actuation policy, its verification and its activation on the industrial CPS We envision well-defined actuation policies for the anomalies that can be acted upon. The actuation policy chosen on the basis of the identified anomaly will first be applied to a twin of the system. For the most part, the anomaly detection and identification workflow from Figure 3 is also applicable to the twin. We would like to evaluate if the behaviour of the twin will be identified as normal, or if the deviation from normal is being reduced, in comparison to the anomaly itself. Upon an acceptable improvement, the actuation policy will be activated on the anomalous industrial CPS. The "Actuation policy dispatcher" may also come up with novel actuations, beyond what is previously known and available. Such policies, if resulted in improvements, will be added to the database. Whether an actuation policy results in an improvement or not, it could contribute to the base design of the system and fundamentally eliminate the anomaly from happening. For instance, when a software bug is discovered, or a corner case related to the order of calls in different processes is revealed, the base design can be improved. Although, analysis has to be performed to discover such causes behind a policy's failure, correlate these causes with the design and develop rectifications for them. Similar to the anomaly detection and identification workflow, we expect the analysis following an unsuccessful actuation policy to be a manual one.

White box approach
A white box approach means that there is interaction with the software running the industrial CPS to collect behavioural metrics during its runtime. Whether the interaction is with the Operating System (OS), through standard/custom tooling, or through probing of the workload handling application on top of the OS, it will render the monitoring approach as white box. In such an approach, the application is analysed to discover its internal operations, e.g., I/O and computation operations. For a multi-process application, the analysis is performed on a per process basis. Note that these internal operations may or may not match the execution phases. There could very well be multiple processes that are active through different phases and different sets of operations belonging to these processes are involved in each phase.
By implementing probes collecting system-specific and application-specific EFB metrics at the beginning and end of each operation, we will be able to collect the footprint of these metrics for that specific operation. For instance, as shown in Pseudocode 1, by collecting the amount of CPU time consumed by a process at the beginning and the end of a computation operation, we will be able to calculate the amount of CPU time consumed by that operation.

Black/grey box approach
The same principle as the white box approach also applies here, with the difference that no internal probing will be possible. This could be due to various reasons, such as lack of access, or strict controls over monitoring overhead. EFB metrics of choice for such cases are to be chosen from external ones. These metrics are usually collected via hardware probes in a passive fashion. The lack of interaction with cases that are being treated as black/grey box brings forth another limitation. Behaviour improvement is harder to achieve when following a black/grey box approach. We will show an example using electrical EFB metrics in the coming sections.
Hardware-based sensors are the category of choice for probing during a black/grey box approach. Especially when it comes to electrical metrics, one can use external power meters and power data loggers. Nowadays, more advanced embedded platforms have electrical metric collecting probes included in their design, reducing the need for additional hardware.

Digital and physical twins
Collecting traces under anomalous behaviour might not be a straightforward task when it comes to real production grade industrial CPS. The causes behind such a limitation can be numerous, but regardless, there will be a need for synthetically generated anomalous traces. To be able to resemble anomalous behaviour and generate synthetic anomalous traces, the role of a digital twin is a crucial one. This digital twin should sufficiently replicate the behaviour of its physical exemplar. Additionally, using a digital twin will dictate the application of a harmonisation procedure on the data coming from the real system. The argument here is that since both types of traces, normal and anomalous, are going to be the basis for comparisons, they have to be harmonious. The effects of synthetically introduced anomalies will not be local and there is great potential for cascading effects. As a naive example, delaying a process that should send a message to another, will certainly delay the reception at the latter, affecting its behaviour. We can achieve the conduction of such effects by using the very same digital twin as a reference platform. As such, all traces will be harmonised by passing through the digital twin. More details on this process will be given in Section 4.1.
Another important purpose of having a digital twin is its role in the actuation workflow. As seen in Figure 4, any chosen actuation policy will first be applied to a digital twin, in order to evaluate its effects on the representative twin's behaviour. Only if there was an improvement in a desirable direction, the policy will be activated in the real system. It is not a strict requirement for the twin to be a digital one. As discussed earlier, for platforms approached in a black/grey box fashion with a smaller footprint, a physical twin can be a fair and even more suitable option. A physical twin is especially advantageous when interacting with an industrial CPS in the form of a distributed embedded system, composed of numerous identical nodes. In such set-ups, a single physical twin can verify actuations aimed at any of the anomalous nodes.

Use-cases
Following the discussion on industrial CPS complexity spectrum from Section 1, we apply our methodology to and illustrate it for two distinct use-cases, both taken from the industry. For the case of the semiconductor photolithography machine, we follow a white box approach, in which we add probes within the software to collect EFB metrics. Thus, resulting representations are software signatures and software passports. For the case of the image analysis platform, the system has been treated as a grey box, with behaviour monitoring based on electrical metrics, which are collected via external probing. The resulting representations for this case are power signatures and power passports. Though there are subtle differences between different cases within the early data manipulation steps, generated constructs for both have been treated with the same classifiers.

IMPLEMENTATION
Although the implementations of behavioural signatures and their generation in both of our proof-of-concepts follow the previously elaborated methodology, differences remain. We will describe each case separately. The software passport workflow and the power passport workflow refer to the implementations of our methodology for the semiconductor photolithography machine and the image analysis platform, respectively.

Software passports workflow
As noted in Section 3, our software passport workflow has been applied to ASML's semiconductor photolithography machines, as one of this paper's real-world demonstrators. The workflow is depicted in Figure 5 and follows a white box approach. This complex industrial CPS includes communication subsystems based on the publish-subscribe architectural pattern. The subsystem connects application processes from different software components, which in turn run on distributed computing nodes. The depicted software pipeline has been created to extract per process, per metric and per execution phase information from the simulated traces. Regression modelling techniques are used to transform the extracted data into regression functions that are most accurately approximating the performance of processes over time.

Goodness-offit values
Comparison to passport For this use-case, we are working with the main computing node in the photolithography machine, which is in charge of the wafer processing and wafer exposure tasks. Thus, we have considered wafer processing operation (wafer op.) as a combo phase for full processing and exposure of a single wafer. We consider this operation as combo, since it can be broken down to different sub-operations, which will be more fine-grained and could be considered as atomic phases.
4.1.1 Effective sensing. We strive to capture a partial view of the system's behavioural trend, small enough for monitoring and data manipulation, but also complete enough to have a semi-accurate view of the behaviour. The sensory data required for this goal is gathered online, through invasive probing, i.e., by adding probes in the code, in a communication-centric fashion. Communication-centric monitoring allows us to limit the introduced overhead, as the software probes are implemented at the system communication module, i.e., message broker. When communication takes place, involved processes are traced indirectly from within the broker calls. The overhead is much less, compared to a fully invasive tracing of each and every software process, at the cost of a reduction in the amount of information captured. More precisely, we will miss the information on processes that are not using the communication module. The current information captured by our tracing tool includes per process data such as, event start and end timestamps, CPU utilisation, message sizes, message ids, messages sources and destinations, and memory utilisation. Following the description of software probes for a white box approach given in Section 3, Pseudocode 2 elaborates our probing for communication and computation events.
When a process starts its execution, a trace_comp_start call saves the id of the process, the type of event and the start timestamp. Notice that this is the initial reference point, ergo the CPU time is not recorded. The process 4.1.2 Digital twin for industrial CPS. As mentioned in Section 3.5, there is a need for a digital twin when generation of anomalous traces is not straightforward. For our semiconductor photolithography machine use-case, we have to rely on a digital twin to generate anomalous traces. Accordingly, a digital twin in the form of a trace-driven simulator comes into play. The discrete-event simulation environment including real processes, simulated processes, interaction between simulated processes and the resource manager model, as well as the topology of the communication subsystem model is depicted in Figure 6. We are using the OMNEST simulation framework for this purpose [44].
Note that only processes that are using the communication module are being traced, e.g., Real process i will not have a contribution in the collected behavioural metrics. The outcome of tracing to be used for the modelling and simulation, contains information such as, event ID, process ID, event initiation timestamp, event end timestamp, event type, CPU  Fig. 6. A discrete-event simulation environment as the digital twin of the semiconductor photolithography machine node, considering the topology from the broker's perspective and parsed traces that are converted into write, read and compute events Trace-driven virtual processes are generic simulated processes, automatically generated based on the number of executed application processes in the real system, i.e., observed processes using communication-centric monitoring.
Simulated processes are trace-driven and the simulation engine executes one event at a time. Process models are automatically inferred from the monitoring traces, avoiding the need to manually derive these models, which is typically very complex, or even infeasible to do for industrial CPS. Traces are parsed and converted to three types of events, namely, write, read and compute. These events are consumed by simulated processes and resource requests are created depending on the content of the trace, e.g., used CPU, type of event and timestamps. For instance, a process requiring a certain number of CPU cycles to perform the operation specified in the event, will create a CPU request. This request will arrive at the ready queue of the simulated CPU and will be served according to the CPU scheduling policy implemented in the resource manager model.

Hiccup injection.
Our fault injection implementation focuses on affecting the duration of events by modifying the end-time of events [33]. Our software allows us to completely configure the percentage of processes to be affected, the percentage of events to be affected, the type of event, i.e., communication or computation, the type of performance anomaly to be introduced, i.e., transient or persistent, as well as the overlap between the processes selected to be modified while introducing transient or persistent anomalies.
The three aforementioned types of events involve process idleness and CPU access delay. Duration of an event ( ), which is the elapsed time from its initialisation ( ini ) till its end ( end ), is made up of CPU access delay (delay ), CPU time (comp ) and idleness (idle ), such that We are considering synthetic increases (delay synth ) to the original duration of the event ( ), which means that we are increasing the combined duration of CPU access delay (delay ) and idleness (idle ), resulting in an increased event duration ( synth ), such that synth = comp + (delay + idle + delay synth ). Note that CPU access delay and idleness result from different conditions and we are not able to distinguish between them using the deployed tracing mechanism. While CPU access delay is simply a wait before CPU availability, idleness could be a result of I/O waits, or functional and data dependencies to other processes. Nevertheless, our synthetic manipulation of traces does not depend on the distinction between the two, since we will be adding to the overall delay of an event, i.e., anything other than comp . descriptions of these features are as follows: • Operation: Considered phase, i.e., wafer op.
• Metric: Considered metric, i.e., CPU time, cumulative reads (reads count) and cumulative writes (writes count) in our experiments for this use-case • pid, phase, event ( 2 ): Percentage difference between the 2 value and the 2 value of the reference passport per PID and per event, e.g., for a single wafer op. phase, we will have pid_event columns such as, 23_compute, 23_read and 23_write for PID 23, corresponding to deviations of computation, write and read behaviour, respectively. • pid, phase, event ( ): Percentage difference between the value and the value of the reference passport per PID and per event, e.g., for a single wafer op. phase, we will have pid_event columns such as, 23_compute, 23_read and 23_write for PID 23, corresponding to deviations of computation, write and read behaviour, respectively. Note that if a PID is strictly a producer of data, it will not have a read column, or for the case of a strict consumer, a write column will not be present. A total of 66 different processes are involved in this use-case's phase data. We split the data into 70% training and 30% test data. Our data analysis pipeline has been written in Python 3.7 and for our regression and classification needs, we rely on Scikit-learn 0.23.1 machine learning library [36].

Power passports workflow
Our set-up for a proof-of-concept, specific to the image analysis use-case, alongside the feature engineering and feature extraction pipeline is given in Figure 7. This set-up consists of an ODROID-XU4 computing device, implementing the ARM big.LITTLE computing architecture. We are running a stripped-down Linux distribution and the main running application is a neural network-based image analysis software, detecting if cars are present in images. The platform is capable of receiving images from either a camera, or a storage device and in our case, images are provided via a storage device. Note that for this use-case, the combination of the proof-of-concept set-up and the data processing pipeline has less complexity, mainly due to the absence of a digital twin, i.e., a simulator. Here, there is no need for such a twin, for we are able to include real anomalous conditions, eliminating the requirement for synthesising anomalous traces. There will still be a need for a digital or a physical twin when it comes to evaluating actuation policies though.  For our electrical EFB data collection, we rely on the Otii Arc power data logger unit [37]. The data logger collects electric potential in Volts and electric current in Amps with the sampling rates of 1 kHz and 4 kHz, respectively.

Signature/passport generation flow
Timestamps for each data collection is also recorded alongside these metric values. As shown in Figure 7, data collection is followed by its compartmentalisation (cutting) and application of regression modelling next, resulting in power signatures/passports. The goodness-of-fit values are acquired by comparing a signature to a passport.
Given that we have electrical potential, electrical current and time readings, we consider the three metrics, current in Amps, power in Watts and energy in Milliwatt-hours, for generation of signatures. The image analysis running on the target platform has two main operations, i.e., reading images from a storage and applying a neural network detection algorithm on them. Thus, we have considered the following atomic and combo phases for current, power and energy readings: • Image op.: An atomic phase for the image loading operation, • Neural op.: An atomic phase for the neural network operation, • Cycle op.: A combo phase for a full image cycle, including image loading and neural network operations.
As depicted in Figure 8a, we compartmentalise the processing of an image into execution units, based on the mentioned phases. The parsing is followed by a cutting step, breaking the parsed data per phase.

Mean passports.
In addition to individual power passports, we are also generating mean power passports as a unified representation for many reference executions with different input data. Mean passports are generated per metric and per phase type. When it comes to mean power passports, their generation is not a straightforward task, as there needs to be matching timestamps. Basically, we are calculating the mean of many regression functions, which can be written as, with being the independent variable, time, and the number of functions.
Either we have to do listwise deletion and remove independent variable readings which do not exist in all phases, or we have to perform data imputation. The latter is much more preferable, for there are executions that take slightly longer and we do not wish to disregard valid data collection points residing at the end of longer executions. In principle, this applies to other unmatched points as well, regardless of their location within the execution time frame. We have already generated regression models for data collections as their representations and as such, we can perform regression-  descriptions of these features are as follows: • Operation: Considered phase out of image op., neural op. and cycle op.
• Core type: Utilised CPU core, big or little  ( 2 ): Absolute difference between the 2 value and the 2 value of the reference mean passport • : Goodness-of-fit value for sample points from one image against the reference mean passport • ( ): Absolute difference between the value and the value of the reference mean passport • Label: Normal, Anomaly 1, Anomaly 2, etc. (NoFan and UnderVolt for this use-case, as will be discussed in Here, we also split the data into 70% training and 30% test data.

EXPERIMENTAL SET-UP
Let us go through the actual experiments performed for each use-case. Considering that we are following a data-centric approach, the main goal is to come up with multiple scenarios and relevant data sets to have enough variation for training and testing in our AI-based solution.

Semiconductor photolithography use-case
In the set of experiments conducted with the ASML semiconductor photolithography machine, we have considered one execution phase corresponding to a repetitive task performed by the machine, wafer processing operation. Photolithography machines are always used for exposing batches of wafers with the same chip design, hence repetitive task. The simulation's output traces corresponding to the baseline execution were used to generate software passports for each process involved in the phase. In this fashion, the simulator, i.e., the digital twin of the system, is being considered as a reference platform.
In order to generate a proper data set containing different types of performance anomalies, we have composed several hiccup injection scenarios. The hiccup injection protocol, resulting in these scenarios is elaborated in the following steps: (1) From the total number of available processes, 15%, 30% and 45% of them were selected randomly to introduce modifications in their traces.
(2) From the processes selected in the previous step, two main performance anomaly groups, persistent and transient, were created. Software processes in the traces were randomly assigned to each group, while a 10% overlap between the two groups was ensured. For example, if we have the option to inject hiccups in twenty processes and we have to generate two groups with 10% overlap, the resulting groups could look like, The overlap percentage is configurable and its purpose is to create a more realistic scenario, where there is no clear separation between the set of processes that could cause persistent anomalies and the ones that could cause transient anomalies.
(3) Computation, read, write, or all types of events are selected from the designated group of processes in the previous step. For each of the event types selected, either 10%, 15%, 30%, or 45% of them are modified, adding some overhead to the event's elapsed time.
(4) The overhead applied to each event is either of 5%, 15%, 30%, or 45% of their total event elapsed time.
(5) Lastly, the above steps were repeated five times to create sufficient data for different categories.
Using the protocol described above, we created 1920 different scenarios, where each scenario contains traces corresponding to the monitored software processes. These modified traces were simulated, using the digital twin environment described in Section 4.1.2, to determine the impact on the phase execution time. Since our potentially anomalous execution scenarios and the relevant data are synthetically generated, we need an indicator to detect if the injected hiccup has turned the execution into an anomalous one. This indicator is the execution time of the phase. While considering the original phase execution time as the baseline, for the 0-2%, 2-4% and above 4% amounts of increase in the phase execution time, we consider the behaviour to be Normal, Benign and Harmful, respectively. Combined with the groups of processes that are used for hiccup injection, we will end up with the categories, Normal, Persistent Benign, Persistent Harmful, Transient Benign and Transient Harmful.

Image analysis use-case
The input data for the image analysis application are provided on a storage device. We have considered two different sets of images, first one being proper images with meaningful scenery. This batch includes images with and without a car depicted in them. The second batch includes images that do not depict any particular shape and have purely randomised pixels, introducing variation in the input. In this fashion, we could evaluate if the composition of an image is a factor for our workflow. Each batch is used in two different executions, one involving a single round of image analysis and the second, involving ten rounds of image analysis, meaning the same batch is processed ten times, sequentially.
We have chosen the number of rounds arbitrarily and with the aim to have a long enough execution, reflecting the effects of anomalies. Each batch includes 30 images, making the workload for ten rounds of processing as 300 images.
We have also executed every combination of conditions twice, by assigning the application to either a big, or a little core on the platform. The list of performed data collections are as follows: • Case 1: 1 round of execution, regular images, little core (30 total images) • Case 2: 10 rounds of execution, regular images, little core (300 total images) • Case 3: 1 round of execution, regular images, big core (30 total images) • Case 4: 10 rounds of execution, regular images, big core (300 total images) • Case 5: 1 round of execution, randomised images, little core (30 total images) • Case 6: 10 rounds of execution, randomised images, little core (300 total images) • Case 7: 1 round of execution, randomised images, big core (30 total images) • Case 8: 10 rounds of execution, randomised images, big core (300 total images) The structuring of our workloads for the aforementioned data collection cases, fits the principle of repetitive task execution for industrial CPS rather well. Keep in mind that although these tasks are repetitive, but the underlying non-determinism is still present, as the behaviour of the system has subtle variations per execution, even with the same exact input. Now that different cases are defined, we have considered two different anomalies, affecting the performance and the reliability of the system.
Malfunction of the cooling system. For this anomaly, henceforth called NoFan, we have disabled the cooling fan of the platform's CPU block. Keep in mind that in our set-up, the fan has a separate supply of power to begin with and will not directly affect electrical EFB metric readings of the platform, as depicted in Figure 7.
Unstable power delivery. For this scenario, henceforth called UnderVolt, we have reduced the voltage supply to a level below the required amount for the platform, but still keeping the device functional. This was a reduction from 5.0 Volts to 4.7 Volts. We have also made sure that the voltage supply is at a sufficient level and it will not result in glitches during the execution of tasks.
As it was the arrangement for data collection under normal circumstances, reflecting the normal behaviour, we have considered the exact same cases with different batches of images, different numbers of processing rounds and different cores. Accordingly, our experiments resulted in eight cases for each scenario, representing Normal, NoFan and UnderVolt situations. It must be mentioned that having equal number of cases and basically equal number of data fields for all scenarios is advantageous as it will result in a balanced data set for classifiers. Having a balanced data set will eliminate the need for imbalance countering techniques, e.g., undersampling and oversampling.

CLASSIFICATION RESULTS
Our results mostly focus on performance anomaly prediction accuracies of various classification algorithms. We provide these separately for our two industrial use-cases. Accurate classification, as the ultimate goal of the workflow from Figure 3, can be achieved with the right choice of metric, execution phase and a suitable classification algorithm.

Semiconductor photolithography use-case
We have tried Decision Tree (DT), Random Forest (RF), Gaussian Naïve Bayes (GaussianNB), k-Nearest Neighbours (k-NN), Linear Support Vector Classification (LinearSVC) and Kernel Support Vector Machine (KernelSVM) classifiers with the data set representing a single wafer's processing, alongside CPU time, reads count and writes count, as our metrics of choice, as well as the combination of all three. As such, Table 1 presents overall prediction accuracies for different classifiers.   Figure 9, showing the higher prediction performance for DT and RF. As a result of randomness in our synthetic delay injection and scenario generation, the data set has considerable imbalance, leading to different accuracies for different label predictions. With an increased number of scenarios representing less frequent labels, relevant accuracies will improve. This is demonstrated in the image analysis use-case.

Image analysis use-case
For this use-case, we have also tried DT, RF, GaussianNB, k-NN, LinearSVC and KernelSVM classifiers with different subsets of our data set. We observe that the choice of phase and metric as sources of data has a considerable effect on the prediction accuracy of the classification. We have tried classifications using data from all three metrics at the same time, as well as every metric individually. For these trials, we have considered either atomic image op., atomic neural op., combo cycle op., or the combination of the data from both atomic image and atomic neural operations. Table 2 presents these choices alongside their resulting overall prediction accuracies. confusion matrices for all classifiers with this combination are depicted in Figure 10, again showing the higher prediction performance for DT and RF classifiers. High prediction accuracy is also observable across the board for all labels, as a result of having a balanced data set.

ANALYSIS AND DISCUSSION
Our methodology's outcome is based on feature engineering and traditional classification algorithms. While the methodology itself is applicable to any industrial CPS, implementations are use-case specific. Accordingly, it is important to fine-tune the workflow based on the platform at hand and its characteristics. Such a fine-tuning is necessary to maximise the classification accuracy and bring out the full potential of the method. Fine-tuning in this context is the process of selecting the best combination of EFB metrics, execution phases and classification algorithms. While EFB metrics and execution phases of choice are highly use-case dependent, we can see that DT and RF classifiers are the best performing ones. In the case of the DT classifier, a simple but powerful analysis is to visualise the DT graph. The graph can guide us to choose the right execution phase, for the more compact and contained the graph, the easier it is for the classifier to distinguish between labels. For our use-cases, we see that the combination of all metrics and the wafer processing phase, as well as the choice of electrical current as the metric and the cycle operation phase, lead to rather compact DT graphs, shown in Figures 11 and 12, respectively.
Without diving too deep into the analytical comparison between different classifiers we have experimented with, a few statements can be made. Our motivation behind the use of common traditional ML algorithms is to see which classifiers perform best out-of-the-box, without any hyperparameter tuning effort. As DT and RF are visibly superior for this type of data, we chose to consider these for our complete solution. Out of these two, DT provides certain level of root cause analysis as well, which is another added value, favouring DT. With moderate effort, one can expect to improve the prediction performance of other classifier, but to which extend, it is to be seen. From an analysis point of view however, considering the type of data we deal with and the features we have developed out of it, we can see why some classifiers perform poorly. Let us focus on GaussianNB and KernelSVM classifiers. For instance, the NB classifier is a parametric one. Although our features are generated from samples of metric data and regressors were developed based on these collections, the specifics of these regression functions (coefficients and intercept) and goodness-of-fit measurements are not expected to follow a clear distribution. As such, it is expected that parametric algorithms will not be a suitable choice. In the case of the SVM classifier, SVM is known to be very sensitive towards the choice of kernel and hyperparameter tuning, hence the out-of-the-box disadvantage.
One might argue that anomaly specific monitoring is also capable of detection, especially for metrics external to the system, e.g., voltage supply. While this is a perfectly valid proposition to detect voltage supply instabilities, such detection lacks versatility, as it is single-anomaly specific and assumes prior knowledge of the anomaly type.
In comparison, our methodology can detect and differentiate between multiple anomalies. This versatility is a direct result of data-centricity and comparisons based on behavioural signatures. Our methodology also foresees detection of unknown anomalies through described comparison tools, namely, goodness-of-fit values.

Explainable output
While treating the system under scrutiny as a black box can be advantageous and a requirement in certain use-cases, having a black/grey box solution, i.e., methodology, might not be so favourable. It is important to be able to figure out which changes at every different step of the methodology will lead to a certain result. Since our methodology, as mentioned in Section 2, is based on the combination of extensive feature engineering and traditional classification algorithms, we can reap the benefits of explainable output. More precisely, we can backtrack our choices at different data manipulating steps and highlight the accuracy-increasing ones. While such a capability can help with optimising the workflow itself, e.g., by choosing better metrics and phases as the sources of data, it could also help in narrowing down the root causes of anomalies. For instance, the specific PIDs responsible for deviations in a phase with multiple active PIDs can be detected, as signatures and passports are generated per PID. As an example, if we look at the graph from Figure 11 and traverse the tree from its root as,  and 54, in combination with the metric reads count for process 41, twice, and the metric CPU time for process 54. Note that there are more leaves with this label present.
The same argument holds for metrics as well, i.e., upon presence of multiple metrics in the training data set and the inference data, we can narrow down the impact of anomaly at hand to a specific metric or a subset of considered metrics.
Such knowledge is rather informative for any expert, looking for the actual root cause and it is a strong motivation for having passports and signatures per PID, per phase and per metric. Although the ideal root cause analysis workflow is a fully automatic one, the discussed examples demonstrate the usefulness of semi-automatic or methodically gathered inherent insights, which is a form of Decision Support System (DSS). The use of DSS is prevalent in a wide range of contexts nowadays [1,12,42,41].
In our context and considering the use of our data processing pipeline in combination with a DT classifier, explainability is inherent to the solution. A solution such as ours can be considered as a DSS for experts analysing the system behaviour for root cause analysis. Though the identified anomaly can be backtracked, revealing the subset of PIDs, metrics and phases relevant to the label, the expert still has to utilise system knowledge to figure out the actual fault, e.g., lack of a certain resource resulting in a delay, or an incorrect input resulting in a spike in CPU time measurement.
The reach of our solution's visibility of the system ends at the boundary of data collection, generating the raw sensory input to our data processing pipelines.
Besides root cause detection, other design-improving intelligence is available too. For instance, the methodology can show if a specific batch of inputs to the system results in unexpected anomalous behaviour, revealing design deficiencies in handling specific categories of inputs.

Limitations of the study
We have had a few limitations when dealing with our use-cases. Some of these were imposed by the industrial platform under scrutiny, such as the absence of real anomalous scenarios in the case of the semiconductor photolithography machine. Others were self-imposed with the goal of having better control over experimentation and development of the methodology. For instance, in the case of the image analysis platform, we have opted for a single-threaded and single process application. The choice is still in line with real-world conditions as such set-ups are the preferred choice in low-power use-cases.

RELATED WORK
Anomaly detection and identification have been on the agenda of the research community for a long period. The relevant body of knowledge is fairly large, but is also covered rather well in two extensive survey papers. Both Chandola et al. [6] and Ibidunmoye et al. [20] give collections of papers, diving into different aspects of anomaly detection.
The challenging nature of decision support, leading to actuations and better designs for industrial systems, is also attested by the grand challenges presented by Fowler [14]. Most of the body of work given in [20] focus on performance anomalies in distributed systems [19], cloud environments [16] and web applications [8], whereas our methodology is specifically tailored towards industrial CPS. We are of the opinion that our approach helps in simplifying the task for repetitive systems by considering execution phases, regression-based representations and more importantly, white box or black/grey box views of industrial CPS, where it fits. We have also made use of a digital twin for our first use-case, which is in line with the ever-growing importance of such virtual representations when it comes to complex CPS [38].
The very recent publication by Luo et al. [28] provides a handy listing of surveys on anomaly detection. Out of the given list, surveys by Giraldo et al. [17], Mitchell et al. [31] and Lun et al. [46], specifically focus on anomaly detection for CPS. Interestingly enough, these surveys all focus on anomalies to address security and attack detection. This has been a common trend, to see anomalies as consequences of security incidents, which is a valid motivation. Industrial CPS such as photolithography machines on the other hand, are totally isolated and there is more added value in tackling performance anomalies to improve the machine yield, which our work tries to address.
Electrical metrics, especially electrical power analysis, have a profound role in cybersecurity research. The famous and now classic paper by Kocher et al. [24] is a great example. What such publications have in common is their view towards electrical power in particular and electrical metrics in general, which is seen as a source of side-channel information. In principle, our view is rather similar and we see such metrics, external to the system, as reflections of its functional behaviour, which is the definition of EFB. Further uses of power signals for anomaly detection in the scope of cybersecurity are given in [4,23,27,45].
Considering our behavioural signature/passport generation technique, the common use of regression models as a statistical tool for data estimation and inference is well argued in literature [21,7]. This is especially true for different performance parameters of application processes as Lee and Brooks suggest [25]. Lee et al. also employ the notion of piecewise polynomial regression [26]. We have conducted a comparable strategy by dividing a timeline into meaningful phases, based on drastic changes in the values of CPU utilisation. Though, our division criteria aims at having meaningful and repeatable phases. We are not using all points from an execution, i.e., certain unsuitable parts are being omitted.
The survey by Chandola et al. [6] includes classification algorithms from other works. These algorithms are applicable to repetitive data such as ours. We have specifically focused on DT [40], RF [3], GaussianNB [15], k-NN [10], LinearSVC [2] and KernelSVM [9] as traditional classifiers. On the other hand, the complexity of modern CPS has driven researchers towards Deep Learning (DL) techniques more than ever. In particular, more recent publications [18,28,43,39] demonstrate this tendency. The recent survey by Chalapathy and Chawla [5], as well as the paper by Ratasich et al. [39], do point out the differences between traditional Machine Learning (ML) and sophisticated DL techniques by mentioning the black box nature of DL models. One advantage of our methodology, which is based on traditional ML, is the ability to traverse the resulting models and backtrack the data pipeline, as mentioned in Section 7.1. When dealing with anomalies, root cause analysis is arguably an integral requirement. Traditional ML in general and our method in particular is rather capable in this aspect.

CONCLUSION AND FUTURE WORK
We have shown that a well-devised data-centric solution, as we have presented, is capable of anomaly detection and classification for industrial CPS, with high accuracy. Such a data-centric solution is much more flexible than self-adaptivity based on traditional control mechanisms, as it is capable of consuming monitoring data from a multitude of sources. More precisely, we have shown a behaviour classification methodology composed of Extra-Functional Behaviour (EFB) monitoring at runtime, compartmentalisation of execution timeline into repetitive execution phases, generation of representative behavioural signatures and passports, deviation quantification based on goodness-of-fit tests, and traditional classification algorithms.
Our behavioural signature construct is an especially convenient tool, accommodating a variety of metrics as input and representing arbitrary lengths of execution timeline of a system in a compact fashion. This compactness results in easy to store representations of reference behaviour, i.e., passport, as well as efficient to compare representations of ongoing behaviour, i.e., signatures. Using signatures, behaviour can be represented per execution phase, per metric and per process, if the application consists of multiple processes. We have also shown that the resulting comparison data as a feature set, is most suitable for decision tree and random forest classifiers, achieving accuracies as high as 99% in certain set-ups. The feature set is complete enough to facilitate classification of anomalies early on. In most anomalous cases it will take a while before visible deviations are present, e.g., under NoFan conditions.
In view of our proof-of-concept use-cases, we have demonstrated that our methodology is valid throughout the industrial CPS complexity spectrum. We have implemented our methodology for a semiconductor photolithography machine as a large and complex industrial CPS, as well as an image analysis platform as a low-power and less complex industrial CPS. We have also described different monitoring techniques. Techniques using system-level metrics in a white box, invasive and communication-centric approach, and techniques using external metrics in a grey box, passive and side-channel-oriented approach. We have presented the role of a digital twin for industrial CPS and elaborated its creation using high-level modelling and event-based simulation of traces captured with communication-centric monitoring.
Future work. There are a few leads in our study that can be followed as part of a future work. The application of more advanced AI algorithms, i.e., neural networks and the implications of the deployment is an interesting aspect.
We would like to see how they compare to our current method, as they will drastically reduce the explainable output characteristic. Neural networks, e.g., Convolutional Neural Networks (CNN), are capable of extracting features on their own. However, these features are not the same as what we are generating within our pipelines. The input data to a neural network still has to be prepared and formatted, which may require moderate computational effort. One of the important requirements for us is the knowledge of execution phase start and end times. A further step in the automation of our workflow can be achieved by using Change Point Detection (CPD) for this goal. Development of multidimensional behavioural signatures and passports to represent the behaviour even more uniquely is also on our agenda. In this fashion, multiple metrics are to be used in construction of our regression-based representations. We also would like to experiment with hardware performance counters as EFB metrics in our methodology.
Last but not least, complementing our workflow with the ability to counter the effects of anomalies in industrial CPS will elevate our methodology to an EFB managing solution. This goal is our ongoing research. We hope that our contributions would be a step towards the integration of EFB managing data nodes as common building blocks in industrial CPS, for we believe such data-centric nodes provide multifaceted benefits. These benefits span from anomaly detection and prediction to base design improving intelligence, as discussed throughout this paper.