Clustering of Similar Historical Alarm Subsequences in Industrial Control Systems Using Alarm Series and Characteristic Coactivations

Alarm flood similarity analysis (AFSA) methods are frequently used as a primary step for root-cause analysis, alarm flood pattern mining, and online operator support. AFSA methods have been promoted in several research activities in recent years. However, addressing an often-observed ambiguity of the order of alarms and the annunciation of irrelevant alarms in otherwise similar alarm subsequences remains a challenging task. To address and solve these limitations, this paper presents a novel AFSA method that uses alarm series as input to two extended term frequency-inverse document frequency (TF-IDF)-based clustering approaches, a dimensionality reduction technique, and a novel outlier validation. The method proposed here utilizes both characteristic alarm variables and their coactivations, thus, emphasizing the dynamic properties of alarms to a greater extent. Our method is compared to three relevant methods from the literature. The effectiveness and performance of the examined methods are illustrated by means of an openly accessible dataset based on the “Tennessee-Eastman-Process”. It is shown that the integration of alarm series data improves the overall performance and robustness of the AFSA. Furthermore, the clustering results are less influenced by the ambiguity of the order of alarms and irrelevant alarms, thus overcoming a persistent challenge in alarm management research.


I. INTRODUCTION
Driven by the advances in automation technologies, industrial process plants have become data intense. The amount of data being processed and stored, e.g., time series readings from sensors and alarm logs, can sum up to hundreds of gigabytes every year [17]. This data provides a potential for data mining (DM) and machine learning (ML), to better understand plant behavior and thereby take better operator decisions.
In process control systems, alarms are raised to warn operators about critical process deviations when a predefined critical threshold value at a field sensor is exceeded. Ideally, the number of alarms raised at a time should be as low as possible. However, in more anomalous situations there can The associate editor coordinating the review of this manuscript and approving it for publication was Yiqi Liu . be a high number of alarms that becomes difficult to handle, which is a known challenge in the industry and literature, typically referred to as alarm floods [3].
In situations of alarm floods a simple sequential handling of alarms may not be the most practical approach, due to the limited time available to resolve the critical plant situation, but also because the various alarms cannot be handled in isolation but have dependencies. In many cases, alarms of an alarm flood were triggered by a common root-cause [13]. Here, an interesting use case for DM and ML is to extract the implicit knowledge about historic alarm situations. There is opportunity that such an automatic learning could save time for a human having to learn some of the non-obvious rules and patterns from experience over years. DM-and ML-based operator support functions could be imagined as part of the process control system, that could explain to the operator a recurrent alarm flood and thereby save the operator time in VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the decision-making process, where a manual assessment of complex alarm floods can be time consuming.
In this article a novel approach is presented for the analysis and clustering of similar and recurrent alarm floods, that makes use of insights about the dynamic properties of alarms and their coactivations. The proposed approach was compared to existing methods using an openly accessible alarm dataset. It is observed that the proposed approach leads to more accurate and meaningful clusters than if having left the intrinsic knowledge about the dynamic structure of the alarm sequences in the data unconsidered.
This paper is organized as follows: Section II analyzes the related work. Section III describes the development of a novel approach. In Section IV an in-depth evaluation and comparison of the methods in Sections II and III is conducted. Finally, this paper concludes with a discussion of the evaluation results and an outlook on potential future work in Section V.
In reference [20] we proposed a preliminary method for the analysis and clustering of similar and recurrent alarm floods. This method was presented and discussed at the ''32nd International Workshop on Principles of Diagnosis'' in Hamburg, Germany, in September 2021. The approach proposed in this paper considerably advances our previously presented research. By applying a suitable and carefully selected dimensionality reduction technique, we were able to successfully solve a major limitation of our preliminary method, i.e., a high dimensionality and computational complexity when clustering similar alarm floods. Moreover, our additional processing step is shown to have a distinct positive effect on the accuracy of the found clustering solution. These improvements make our novel method more feasible for industrial practitioners and ML researchers from academia.

II. RELATED WORK
A comprehensive overview of the existing alarm data analysis approaches is given in [18]. One major branch is alarm flood similarity analysis (AFSA) methods, which detect and group recurrent historical alarm flood situations or, more generally, alarm subsequences (ASs) [18]. Here, ASs are smaller partitions of an original alarm sequence [2], [32]. The unsupervised task of grouping or clustering similar historical ASs aims at finding ASs that are associated with similar abnormal situations. In this context, the alarm data of historical ASs is processed using a suitable similarity measure. AS clusters are then formed by finding those groups of ASs that are more similar to each other than compared to ASs in other clusters [18]. AFSA methods thus allow for the collection of different variants of otherwise similar abnormal situations, which can improve further analysis steps [7].
Most commonly, AFSA methods are used for alarm rationalization or to generate the input for advanced alarm analysis methods [18]. For example, in [7], [10], and [27], clusters of similar ASs are subject to a causal analysis to detect common root-cause disturbances. This information can then be used online to support the operator with suggestions regarding the most likely root-cause disturbance of a recurring AS [10]. Reference [5] defined two requirements (R1 and R2) regarding the similarity analysis of ASs: 1) A suitable method should tolerate irrelevant alarms annunciated in some ASs. 2) A suitable method should tolerate a swapped order of alarm activations (ACTs) in otherwise similar ASs.
One category of AFSA approaches applies ''frequent pattern mining'' (FPM) methods to sequences of ACTs. For example, [8] and [32] use FPM to detect the most relevant combinations of alarms in historical alarm data. However, these methods are restricted to alarm clusters that have minimum support in the data, i.e., either the absolute or relative frequency, and thus, they show limitations when an abnormal situation is uncommon.
Another category that is promoted in several research activities is the pairwise alignment of ASs. For this purpose, [2] proposed a global sequence alignment method using the dynamic time warping (DTW) algorithm to detect common alarm patterns. Prior to that, a prefiltering step groups potentially similar ASs according to the Jaccard-distance of AS pairs (s. (7)). However, DTW does not tolerate any ambiguity of order in otherwise similar ASs. This challenging task was to some extent solved by [5], in which a local sequence alignment was used that allows for a certain ambiguity of order if the alarms are close in time. It introduced the modified Smith-Waterman (MSW) algorithm, which is considered a prevailing benchmark in the AFSA literature [18]. The MSW algorithm generates a similarity matrix, which is used as the input for an agglomerative hierarchical clustering approach with a single-linkage (AHC-SL) to cluster similar ASs [5]. One limitation arises from the penalization of alarms in one AS that could not possibly be aligned with a matching counterpart in another AS. A disagreement on the number of ACTs in two ASs therefore negatively affects their similarity, thus making the MSW approach less robust to irrelevant alarms. Reference [27] proposed an improved version of the MSW algorithm by applying a filtering step based on the Jaccard-distance, as described in [2]. Henceforth, this method is referred to as MSW-J. Further alignment approaches were presented that aimed at reducing the computational effort required to carry out the MSW approach [11] and that applied alarm priority information as a primary similarity indicator [14].
A third category of AFSA methods is string metrics, which are based on distance or similarity measures [18]. For example, in addition to its utilization in the pre-or postprocessing of AS pairs, the Jaccard-distance was also used in [7] and [9] as a primary measure for the clustering of similar ASs. It considers only the binary activity of alarm variables (AVs), which are the unique identifiers of configured alarms, and not the number or order of ACTs and is therefore robust to any ambiguity in both. However, the Jaccard-distance overrates the similarity between two ASs that share common alarms but have considerable disagreement in their respective dynamics.
Henceforth, this method is referred to as J . Another string metric is the Levenshtein-distance, which uses the number of edits, i.e., insertion, deletion, and substitution of ACTs, that are needed for the transformation of one AS into another AS [9]. It shares some properties with the DTW in [2] and therefore has limitations if ACTs are annunciated in a swapped order. Another promising AFSA string metric, proposed in [9], uses the term frequency-inverse document frequency (TF-IDF) for the pairwise comparison of ASs. The TF-IDF is a frequently utilized measure in natural language processing that applies a bag-of-words model, i.e., a simplified representation of the alarms in an AS that does not consider their order but rather their quantity. Moreover, a unique feature of the TF-IDF is its weighting of the relevance of AVs according to their probability of occurrence with regard to all ASs. Eventually, similar ASs are clustered using the ''density-based spatial clustering of applications with noise'' (DBSCAN) [28]. Reference [9] demonstrated that this method generates robust and meaningful results compared to other methods, especially when Jaccard-distancebased postprocessing is applied. Henceforth, this method is referred to as T-A-J. It was also used in [10] as a primary step for the causal analysis of ASs. However, it is less robust to irrelevant ACTs of AVs with a high weight.
In conclusion, the data-driven AFSA approaches described here show some deficits in fulfilling both requirements R1 and R2. Moreover, most of these approaches use fixed alarm rates and time windows to detect ASs in historical data, e.g., in [2], [5], [9], [10], and [27], which could result in important alarms or ASs being missed [19]. This deficiency justifies the proposal of a novel method that is robust against both order ambiguity and some irrelevant ACTs while still considering relevant aspects of the AS's dynamic structure.
It was further shown in [18] that all of the existing AFSA methods share the common property of using an alarm sequence representation as input, i.e., a sequence of alarm instances ordered by their ACT times. However, [18] also examined two research areas that are similar to the idea of AFSA, namely, alarm similarity analysis and online alarm flood classification. The former examines the correlation between AVs, and the latter identifies known AS patterns in incoming alarm floods [18]. In both areas, several approaches have demonstrated that using alarm series, i.e., alarm data represented as time series, can be beneficial and produce more meaningful results, e.g., in [18] and [33], than when using only alarm activations. Moreover, [18] illustrated the advantages of using alarm coactivations for alarm analysis, i.e., two or more AVs that are active at the same time.

III. PROPOSED APPROACH A. OVERVIEW OF THE PROPOSED APPROACH
Based on the findings in Section II, this paper proposes an improvement to the promising T-A-J approach in [9] that aims at meeting the requirements R1 and R2. The improvement is achieved by using two novel TF-IDF-based AS clustering methods that utilize alarm series data for the analysis of individual AVs (T-S-J) and their coactivations (T-C-P-J). Here, each configured alarm, e.g., a high-or low-alarm, is denoted by an individual AV. Finally, the postprocessed clustering results from T-S-J and T-C-P-J are merged by a novel validation step that focuses on the detected AS outliers.   1 shows the general structure of the proposed ''alarm series similarity analysis method'' (ASSAM) using the ''formalized process description'' given in [31]. The process operators (green rectangles) and generated and processed information (blue hexagons) are described in detail below. T-S-J is specified by process operators O1.1, O1.2, O1.5, O1.6, O1.7, and O1.8 and results in I1.11, whereas T-C-P-J is defined by O1.1, O1.3, O1.4, O1.5, O1.6, O1.7 and O1.7 and generates the output I1.12.

B. DETAILS OF THE PROPOSED APPROACH
The ASSAM starts with O1.1 and a set of historical ASs (I1.1), which were obtained using the ''alarm coactivation and event detection method'' (ACEDM) proposed in [19]. The ACEDM uses a ''median absolute deviation''-based outlier detection in time distances between alarm events to find ASs. It was shown that the ACEDM is more precise and robust in detecting coherent abnormal situations than are methods that use arbitrary alarm rate-thresholds. The ASSAM uses a time series representation of alarm data, i.e., a binary alarm series VOLUME 9, 2021 for each AV α i [15]: where T i is the set of times t in which α i is active. Trivial ASs with only one active AV are eliminated. Moreover, to reduce the computational effort in the following steps, only those AVs that are active at least once in any of the subsequences are selected (I1.2). The time series for the coactivation of two AVs α i and α j can be represented as follows (following [18]): To calculate S ij for all possible α i and α j in an ASs, AVs must have an identical sampling rate and an identical number of samples. Here, only those AV pairs that are coactive at least once in any of the analyzed ASs are selected (I1.2).
In O1.2 (T-S-J) and O1.3 (T-C-P-J) the TF-IDF is then computed to weight AVs and their pairwise coactivations, respectively, for each alarm subsequence AS (following [9]): with the ''term frequency'': and the ''inverse document frequency'': where a is either an AV (T-S-J) or a pair of AVs (T-C-P-J), |S a | AS is the number of samples in which a is active in AS, and AS is the set of all ASs. The pairwise consideration of AVs in the TF-IDF vectors of T-C-P-J (I1.4) implicates a potentially high dimensionality. Furthermore, a single AV can have an excessive impact on the TF-IDF representation of an AS; i.e., it is considered in numerous elements of the TF-IDF vector. In fact, this kind of high-dimensional and redundant data representation increases the computational effort necessary for clustering similar ASs and can potentially negatively affect any found clustering solution [23], [25]. In the related research area of clustering similar textual documents, this limitation was addressed by applying a suitable dimensionality reduction technique on the TF-IDF vectors; i.e., a transformation into a relatively low-dimensional and less redundant representation.
For example, two frequently applied linear dimensionality reduction techniques are the ''singular value decomposition'' (SVD) and the ''principal component analysis'' (PCA) [16], [22], [23]. An in-depth evaluation and comparison of both was conducted in [23]. It was shown that the PCA has some advantages over the SVD in cases where the target dimensionality of the TF-IDF vectors is relatively small. Thus, for the ASSAM presented here, we propose to use the PCA for dimensionality reduction.
The desired transformation from the n-dimensional TF-IDF vectors into a k-dimensional target projection, where k < n, is achieved by using the top k eigenvectors of the covariance matrix. These eigenvectors correspond to the largest eigenvalues and account for a descending proportion of the variance of the original TF-IDF vectors. To estimate a suitable value for k a ''cumulative proportion of variance'' threshold τ CPV can be used, which allows for enough eigenvectors to be retained so as to maintain a variance of at least τ CPV of the original TF-IDF vectors [1], [30]. A detailed description of the PCA can be either found in [1] or [30].
Subsequently, a suitable distance measure is used to calculate the distances between any two alarm subsequences AS i and AS j in O1.5. According to [28], the DBSCAN clustering algorithm can be combined with any distance measure that is consistent with the analyzed domain and data. For example, [25] used the cosine distance measure for clustering similar textual documents. Here, we follow the proposal given in [9], where the Euclidean distance measure was applied to the AFSA domain and showed promising results. It can be calculated as follows [9]: where m is the total number of features in the TF-IDF vectors. Finally, both resulting distance matrices I1.6 (T-S-J) and I1.7 (T-C-P-J) are normalized to the range 0 to 1.
Identical to T-A-J, the AS distance matrices are postprocessed here. This step aims to reduce spurious low distances between ASs that share only a small number of active AVs [2]. In O1.6, the Jaccard distances for all AS pairs are calculated using the following formula (following [9]): where n xor ij is the number of AVs that are exclusively active in either AS i or AS j and n or ij is the number of AVs that are active in any of the two ASs. The resulting Jaccard distance matrix (I1.8) is then used in O1.7 for the postprocessing of I1.6 and I1.7. Each distance value in the postprocessed distance matrices I1.9 and I1.10 can be calculated as follows [9]: where τ Jac is the Jaccard-distance threshold that determines whether an AS pair is considered potentially similar. In O1.8, both I1.9 and I1.10 are used to generate two partitions of AS using DBSCAN. Reference [9] demonstrated the feasibility of utilizing DBSCAN when used for the clustering of ASs. It identifies regions of high density, i.e., ASs that are close to each other in terms of the distance. Clusters are identified by core points, where an AS is considered as such if at least (minPts − 1) other ASs are within a distance less than or equal to a threshold ε. ASs with no neighboring ASs in proximity are considered outliers. Two advantages of DBSCAN are its distinct outlier label and the absence of a manual specification of the number of clusters [28]. The resulting clustering solution can be represented as C = {c −1 , c 0 , c 1 , . . . , c n }, where c i depicts the ith cluster and c −1 groups all detected outliers. Here, T-S-J and T-C-P-J generate C S (I1.11) and C C (I1.12), respectively.
It can be assumed that C S and C C differ to some extent. In fact, preliminary tests have suggested that for some situations, one of the two chosen criteria can have advantages over the other and result in more meaningful clusters. To benefit from both, we propose a novel step (O1.9) that aims at validating the outliers in T-S-J (I1.11) by using T-C-P-J (I1.12). The former is used as the basis here since preliminary performance results have indicated that it is more robust to different settings of ε. The concept of the proposed approach is the following: for each outlier in c S −1 , the corresponding label in C C is analyzed. If T-C-P-J considers this AS as an outlier as well, it is labeled as such in the validated clustering solution C SC (I1.13). If, however, the AS is part of c C i with i ≥ 0, the outlier label in T-S-J is considered potentially erroneous. Next, we try to find the best match for c C i in C S . One way to achieve this is to compare c C i to each regular cluster in C S using a similarity measure. Here, we propose using the Braun-Blanquet formula for the calculation of the similarity s BB ij between two clusters c i and c j . It can be calculated as follows [26]: where n ij denotes the number of shared ASs in both clusters and |c i | and c j represent the number of ASs in c i and c j , respectively. Of all clusters in C S with a similarity greater than or equal to a validation threshold τ Val , the one with the highest similarity to c C i is considered the best match, i.e., c S j . Eventually, the former outlier is clustered inĉ SC j . Otherwise, it remains an outlier and is grouped inĉ SC −1 . Moreover, all nonoutlier cluster labels inĈ SC are assigned according to the cluster labels in C S .

C. DISCUSSION OF THE LIMITATIONS AND ADVANTAGES OF THE PROPOSED APPROACH
One limitation of the ASSAM arises from the computational effort necessary for the calculation of T-C-P-J; i.e., the coactivation of each AV pair needs to be determined for each sample and AS. Furthermore, as T-C-P-J considers only AV pairs, the implicit knowledge of more complex alarm coactivation dynamics possibly remains undiscovered.
Nevertheless, the ASSAM shows relevant advantages compared to state-of-the-art methods. Swapped alarm orders and a varying number of ACTs in similar abnormal situations can be characteristic of real-world industrial processes [5]. The proposed utilization of time series data in AFSA expands the view to the dynamic properties of activated AVs and the dynamic structure of the underlying ASs instead of focusing on a point-to-point examination of sequenced ACTs. In fact, the calculation of the TF in (4) is not affected by the order or number of ACTs. Moreover, randomly activated short alarms that are irrelevant for the situation have only a small impact due to the consideration of the number of active samples in (4). Hence, the proposed ASSAM and its components T-S-J and T-C-P-J fully satisfy the requirements R1 and R2.

IV. EVALUATION
This section evaluates and compares the performances and characteristics of three relevant AFSA methods described in Section II and the method proposed in Section III. Subsection IV.A gives a brief overview of the evaluation dataset used. Subsection IV.B deals with choosing a suitable evaluation measure. Subsection IV.C describes the experimental setup. The obtained evaluation results are presented in Subsection IV.D.

A. EVALUATION DATASET
The examined clustering methods are applied to the openly accessible simulation dataset 1 introduced in [19]. It is based on a simulation model of the ''Tennessee-Eastman-Process'' (TEP), a frequently used benchmark in process automation [4], [6]. It can be separated into five modules: a two-phase chemical reactor, a condenser, a vapor-liquid separator, a stripper, and a reboiler. Furthermore, the TEP includes 11 automatic pneumatic control valves, two pumps, and one compressor [6]. The alarm system of the TEP defines 81 lowalarm and 81 high-alarm thresholds as well as five high-highalarm and three low-low-alarm thresholds [19].
The dataset includes 100 simulation runs with 300 specified abnormal situations. These situations were designed using eight different root-cause disturbances with variations in their respective durations, disturbance scaling, and combinations. These variations as well as random influences affect the number of activated AVs, the order of alarm instances, and their dynamic behavior. The alarm system generates a total of 7343 alarm instances over all 300 situations [19]. Fig. 2 illustrates an example subset of 18 AV time trends for a typical simulation run with three consecutive abnormal situations, where the third abnormal situation rapidly escalates into an emergency shutdown of the TEP. The alarm data in this dataset is represented using a single multivalued alarm series for each process variable (XMEAS), i.e., time series readings from a specific sensor, and each manipulated variable (XMV), i.e., time series readings from a specific pneumatic valve. For each AV, the effective alarm state at a time is constituted using one out of five unambiguous integer values, e.g., high-and low-alarms are represented using the values ''1'' and ''-1'', respectively [18], [19]. To render the application of the proposed ASSAM possible, we need to transform the multivalued alarm series into a binary representation according to (1).
The application of the ACEDM on the TEP dataset results in 358 detected ASs, of which 310 ASs show more than one FIGURE 2. Three example consecutive abnormal situations (abn. sit.). Solid blue lines represent the time trends of alarm variables. The lower level for each alarm variable represents a low alarm, and the higher level represents a high alarm. Red dotted lines represent the initiation of a root-cause disturbance. Green dashed-dotted lines represent the return to a normal operation (following [19]). alarm instance. The latter are used as the preprocessed input for all methods examined here, thus being able to specifically compare the performances of the selected AFSA methods. One advantage of the TEP simulation dataset is that all induced abnormal situations are explicitly known [19], thus making it possible to use an external validity index, which compares the computed clusters to a given ground-truth partition [26]. The 310 preprocessed ASs are therefore manually assigned to 21 ground-truth clusters according to the details described in [19] and the technical report of the dataset. Each cluster includes 4 to 30 similar ASs. Furthermore, 14 ASs are labeled outliers, as they contain only random parts of the respective underlying abnormal situation and show no similarities to any other ASs.

B. EXTERNAL VALIDITY INDEX
For evaluation, a suitable external validity index needs to be chosen. A frequently used index in cluster evaluation [29], which evaluates the agreement of a ground-truth partition C 0 and a computed trial partition C 1 [26], is the adjusted Randindex (ARI) [15]: where a (d) is the number of AS pairs that are in the same (different) cluster in both partitions and b (c) is the number of AS pairs that are in the same cluster in C 0 (C 1 ) but in different clusters in C 1 (C 0 ). If C 0 and C 1 are identical, the ARI yields a value of 1. A value of 0 arises in the case where C 0 and C 1 are statistically independent [29]. A detailed analysis of the ARI can be found in [29].

C. EXPERIMENTAL SETUP
An overview of the methods examined here is given in Table 1. Two methods, J and MSW-J, are used as benchmarks for the evaluation of the TF-IDF-based methods, namely, T-A-J, the proposed ASSAM, and its components T-S-J and T-C-P-J. In addition, some of these methods are compared to versions of them that do not use the dimensionality reduction step in process operator O1.4 (s. Fig. 1), namely, T-C, or the postprocessing step in process operator O1.9, namely, T-A, T-S, and T-C-P. This evaluation approach allows for a systematic and in-depth examination of the effectiveness of the ASSAM and its components. Except for MSW-J, the examined methods utilize the DBSCAN clustering algorithm. For MSW-J, the algorithm parameters were set according to [5], i.e., δ = −0.4, µ = −0.6, and σ 2 = 4. The τ Jac for MSW-J, T-A-J, T-S-J, and T-C-P-J was set to 0.4, as suggested in [2]. Based on preliminary tests, the τ Val of the ASSAM was set to 0.5. For the minPts parameter of DBSCAN integer values between 3 and 30 were examined. The latter describes the size of the largest ground-truth cluster. The chosen range includes the default value of minPts = 4 as recommended [28]. For the distance threshold of the AHC-SL and ε of DBSCAN values between 0.001 and 1.000 with a step size of 0.001 were assessed since all resulting distance matrices are normalized to the range 0 to 1. For the evaluation of the ASSAM, both components T-S-J and T-C-P-J used the same minPts and ε due to the assumption that the individual tuning of two parameter settings would be cumbersome in an industrial application. In addition, the τ CPV of the PCA was set to 98%. This setting follows the findings given [30].
All methods examined here were implemented in Python (Version 3.8.5). Additional software libraries that were used are NumPy [12] (Version 1.21.3), Pandas [21] (Version 1.3.4), and Scikit-learn [24] (Version 0.24.2). The executable code of the ASSAM as well as the reported evaluation results and the used ground-truth partition are openly accessible. 2

D. EVALUATION RESULTS
For each method, the highest ARI value, which was obtained by applying all considered parameter settings, is shown in Fig. 3. J, T-A, and T-A-J, which do not consider ACT durations or their order, show the lowest ARI values of all examined methods. Indeed, in some cases, these three methods detected similarities between ASs that are in different ground-truth clusters and arose from different root-causes, thus resulting in fewer, though larger, computed clusters, i.e., 13 clusters for J with an optimal minPts of 4 and an optimal ε of 0.191. By using an optimal minPts of 3 and an optimal ε of 0.081, both T-A and T-A-J labeled 32 outliers, which represents the highest number for all examined methods. The corresponding ASs were characterized by random variations in the number of ACTs of those AVs with a high value for the IDF vector. In contrast, the consideration of the order of ACTs with ambiguity to short-term variations in MSW-J resulted in a higher ARI value. MSW-J detected 20 clusters and 24 outliers using an optimal distance threshold of 0.276. An in-depth inspection of the obtained results revealed that MSW-J was not always able to distinguish between significant variations for the same root-causes. Moreover, the detected outliers differed considerably from those in the given ground-truth; i.e., MSW-J was not always able to find similarities between two ASs with identical ground-truth cluster labels in cases where both disagreed on the number of ACTs.
The proposed methods T-S-J and T-C-P-J, as well as the alternative versions T-S, T-C, and T-C-P, showed an improved performance compared to that of the existing methods. All proposed methods presented an optimal minPts value of 3. The proposed application of the dimensionality reduction using the PCA in T-C-P and T-C-P-J yields a reduced TF-IDF vector with 17 features; i.e., a reduction of more than 99% compared to T-C. Moreover, the optimal ARI values of T-C and T-C-P in Fig. 3 reveal that this additional processing step improved the overall clustering performance by more than 5%.
Both T-S-J and T-C-P-J were able to detect 23 clusters as well as 12 and 11 outliers with as few as 17 and 18 mislabeled ASs, respectively. An in-depth inspection of the cluster labels resulting from T-S-J and T-C-P-J revealed that they are essentially identical except for five ASs, which mainly stem from two different abnormal situations. Interestingly, in both cases, one of the methods classified two of the subsequences as outliers, whereas the other method classified them correctly according to the ground-truth cluster labels.
The application of both T-S-J and T-C-P-J and the subsequent validation of outliers in the ASSAM were shown to result in more meaningful clusters; i.e., only 16 ASs were mislabeled, which resembles the ground-truth best. This finding was also supported by the ASSAM yielding an ARI value superior to that of all other examined methods. The optimal values for minPts and ε for the ASSAM were 3 and 0.095, respectively. Another significant phenomenon revealed in Fig. 3 is that the postprocessing of the TF-IDF-based methods was beneficial regarding the optimal ARI value. This phenomenon is further analyzed in Fig. 4 and 5.   4 illustrates the heatmaps of the distance matrices for the TF-IDF-based methods. The ASs in the columns and rows are ordered by the ground-truth cluster labels. This allows for the visual evaluation of the distance measures used. A trial partition identical to the ground-truth is characterized by dark colored blocks along the diagonal of the distance matrix. In contrast, undesired similarities between different VOLUME 9, 2021 ground-truth clusters appear as dark colored off-diagonal blocks. Fig. 4 (a), (b), and (c) show the distances without the application of the postprocessing step. The distance matrix of T-A in Fig. 4 (a) contains some erroneously high distances between ASs that have the same ground-truth cluster label and numerous spuriously high similarities in terms of the offdiagonal blocks. Only one cluster presents a desirable high visual contrast. The corresponding ASs are characterized by only two continuously active AVs. The distance matrices of T-S and T-C-P in Fig. 4 (b) and (c) show a substantially higher visual contrast between blocks along the diagonal and in the off-diagonal areas than shown in Fig. 4 (a). The highest contrast can be found in Fig. 4 (b), which is reflected by T-S having the highest ARI value of all TF-IDF-based approaches without postprocessing. The lower performance and lower visual contrast of T-C-P in Fig. 4 (c) can possibly be explained by different abnormal situations showing a similar dynamic propagation behavior, in terms of coactive AVs, but relevant differences in their respective initial phase where no or only few coactivations occur; i.e., T-C-P is not able to distinguish between such abnormal situations using only coactivations. Fig. 4 (d), (e), and (f) show the computed distance matrices after the application of the postprocessing step. By assigning the highest distance value to most of the erroneous AS pairs, the resulting visual contrast shows high agreement with the cluster structure of the ground-truth. However, Fig. 4 (d) demonstrates that T-A-J yields low distance values for most of the remaining AS pairs, thus impeding the detection of the correct ground-truth clusters. In contrast, Fig. 4 (e) and (f) depict overall higher distances in the remaining off-diagonal pairs for T-S-J and T-C-P-J. This advantageous characteristic resulted in higher ARI values for both proposed components of the ASSAM.
The performance and the number of resulting clusters for the TF-IDF-based methods with a minPts value of 3 and over all considered settings of the DBSCAN parameter ε are illustrated in Fig. 5. The corresponding diagram for the ASSAM is similar to that of T-S-J and is therefore not depicted here. The comparison of Fig. 5 (a), (b), and (c) and Moreover, the close inspection of Fig. 5 indicates that the range of suitable values for ε, which results in ARI values close to the maximum, is approximately twice as long for T-S-J and T-C-P-J compared to T-S and T-C-P. In conclusion, the postprocessing step makes the proposed methods more robust to changes in ε and the clustering results more reliable in cases where an optimal ε cannot be determined using a ground-truth partition.

V. DISCUSSION AND CONCLUSION
The evaluation in Section IV showed that the existing AFSA methods are not able to meet the requirements defined in [5] (s. Section II) to the fullest extent. In fact, the in-depth examination revealed that the methods J, T-A, T-A-J, and MSW-J can handle a certain ambiguity of the order of alarms in two compared ASs (R1), whereas none of them could suitably tolerate irrelevant alarms occurring in one or both ASs (R2). These methods are therefore not able to correctly detect all underlying AS similarities. Despite this distinct limitation, the clustering results obtained by MSW-J showed a relatively high agreement with the given ground-truth of the TEP dataset used here. However, the MSW necessitates the cumbersome tuning of four interrelated parameters, i.e., δ, µ, σ 2 , and the distance threshold of the AHC-SL.
It was further demonstrated that the proposed TF-IDFbased method ASSAM as well as its components T-S-J and T-C-P-J can fulfill all given requirements. Moreover, the ASSAM achieves the best performance among all considered AS clustering methods. This result confirms the assumption that the clustering results can be improved when using alarm series data and alarm coactivations as input. Overall, the evaluation showed that clustering methods that consider the dynamic properties of activated AVs and the dynamic structure of the ASs consistently demonstrate a higher performance than that of methods that utilize a less extensive data input.
One limitation of the ASSAM results from its need for a relatively high computational effort using T-C-P-J; i.e., each sample in a subsequence needs to be analyzed on occurring pairwise alarm coactivations. In contrast, T-S-J maintains a relatively low computational burden. Another limitation results from the necessity of tuning the DBSCAN parameter ε. In this context, it was proven that the postprocessing step of T-S-J and T-C-P-J makes them and the ASSAM more robust to changes in the parameter settings than without postprocessing and compared to T-A-J. It is noteworthy that this beneficial characteristic of the ASSAM makes it more suitable for an industrial application where a priori knowledge for parameter tuning can be limited. Moreover, this finding substantiates the viability of the postprocessing step, as hypothesized in [9]. In addition, it was shown that the application of a suitable dimensionality reduction technique on the otherwise high-dimensional TF-IDF vectors of T-C-P-J significantly reduces the computational effort necessary for calculating the AS clustering and considerably improves the quality of the clustering results.
Furthermore, the evaluation indicated a high agreement between the clustering results of T-S-J and T-C-P-J. However, the data also showed that the proposed combined approach ASSAM has advantages over the individual methods. For industrial practitioners, we recommend using T-S-J in cases where a low computational burden is of relevance. In other cases, we propose using the ASSAM as intended. It is reasonable to assume that in processes similar to the TEP used here, this approach can produce more meaningful clustering results. Future studies should apply the proposed ASSAM and its components T-S-J and T-C-P-J to further industrial and experimental datasets. Moreover, it should be investigated whether modern machine learning methods, e.g., representation learning, can improve the analysis of similar historical ASs. For this purpose, further research efforts could evaluate whether a more extensive consideration of the alarm dynamics and an examination of the most significant subsets of coactive AVs are beneficial for the performance of the AFSA.