Monitors That Learn From Failures: Pairing STL and Genetic Programming

In several domains, systems generate continuous streams of data during their execution, including meaningful telemetry information, that can be used to perform tasks like preemptive failure detection. Deep learning models have been exploited for these tasks with increasing success, but they hardly provide guarantees over their execution, a problem which is exacerbated by their lack of interpretability. In many critical contexts, formal methods, which ensure the correct behavior of a system, are thus necessary. However, specifying in advance all the relevant properties and building a complete model of the system against which to check them is often out of reach in real-world scenarios. To overcome these limitations, we design a framework that resorts to monitoring, a lightweight runtime verification technique that does not require an explicit model specification, and pairs it with machine learning. Its goal is to automatically derive relevant properties, related to a bad behavior of the considered system, encoded by means of formulas of Signal Temporal Logic ( $\mathsf {STL}$ ). Results based on experiments performed on well-known benchmark datasets show that the proposed framework is able to effectively anticipate critical system behaviors in an online setting, providing human-interpretable results.


I. INTRODUCTION
In many application domains, during its operation a system produces several data streams, that may contain valuable telemetry information. This is the case, for instance, with the logs generated by web servers, smart sensors, and industrial machinery. Such data can be used for tasks like predictive maintenance and early failure detection, typically carried out, due to their complexity, by means of machine or deep learning approaches. Despite their success, these methods hardly provide any guarantees over their execution, a problem which is exacerbated by their lack of interpretability, which is an essential requirement in many critical domains, such as, for instance, healthcare and avionics. In those scenarios, formal The associate editor coordinating the review of this manuscript and approving it for publication was Yiqi Liu . methods can, in principle, be used to automatically verify software and hardware systems. However, the presence of different operating conditions combined with the complexity of the system components and their interactions make it quite difficult to define in advance all the relevant conditions that must be guaranteed (or avoided) during execution; moreover, the specification of a complete system model against which to check these properties may be simply impossible [1].
To overcome these limitations, various methods, that combine classic exhaustive formal verification techniques with model-based testing and monitoring, have recently been proposed in the literature (see, e.g., [2], [3]). Here, we focus on the latter. Monitoring [1] is a runtime verification technique which is receiving more and more attention from the formal verification community. It allows one to detect the fulfilment or violation of a property (usually expressed by a temporal logic formula) by evaluating a single system run, without requiring a model of the system being considered. This makes it naturally applicable to data streaming scenarios.
In this paper, we present a novel online system verification framework that combines monitoring with supervised machine learning and can be used for tasks like preemptive failure detection over streams of data. The framework starts its operation by considering a limited set of properties encoding bad behaviors, to be monitored against the system, learnt during a warmup phase and/or specified with the help of domain experts. Then, during the runtime phase, by means of an iterative refinement process, the framework autonomously discovers new relevant properties, becoming able, over time, to identify undesired behaviors in advance, and with a significantly higher level of detail and coverage than the initial specifications. The process of property discovery and extraction is carried out by means of an original bi-objective evolutionary algorithm.
The distinctive features of the proposed solution are the following: • the framework poses as a monitoring-based tool to perform preemptive failure detection; • its operation relies on the seamless and automatic interaction between formal methods and machine learning approaches; • thanks to its modularity and flexibility, the framework can be adapted to different application domains and contexts; in particular, Signal Temporal Logic (STL) can possibly be replaced by other logical formalisms for property specification; • interpretability is a distinguishing feature of the framework, as the produced responses can be easily read by domain experts; this allows people to validate the overall behavior of the framework, and to gain insights about the causes that led to a failure; • the framework works in an online fashion, and it can adapt to changes in the behavior of the system, due, for instance, to updates or upgrades. The framework has been evaluated against three public datasets, and it has shown to be able of actually predicting in advance system failures. Results are on par with those obtained from other state-of-the-art solutions that, however, suffer from a lack of interpretability.
The rest of the paper is organized as follows. Section II analyses related work. Section III provides background knowledge about monitoring, STL, and evolutionary algorithms. Section IV describes the implementation of the evolutionary algorithm used in the property extraction phase. Section V shows how such an algorithm has been incorporated in the proposed framework. The experimental evaluation of the framework is reported in Section VI. Section VII provides a critical assessment of the work done and discusses its strengths and limitations. The last section summarizes the main contributions and outlines future research directions.

II. RELATED WORK
Learning techniques for the real-time detection of undesired behaviors (failures) of complex systems are getting increasingly popular. A significant line of research makes use of machine learning and deep learning, that realize failure detection via black-box models rather than by providing explicit properties capable of characterizing bad behaviors of a system. Despite their lack of interpretability, that makes it difficult to understand and validate the resulting verdicts, these approaches have been employed in several domains due to their effectiveness. For instance, machine learning strategies based on Logistic Regression (LR), Support Vector Machine (SMV), Random Forest (RF), and K-Nearest Neighbours (KNN) applied to the domains of aircraft components post-flight reports, gearbox failures in industrial robots, highperformance computing, and cloud systems are described in [4], [5], and [6]. Deep learning solutions are typically exploited to extract temporal relations in time series data, as witnessed in [7], [8], [9], and [10], where Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs) are applied to the domains of job failures in large-scale cloud data centers, turbofan engine degradation, hard drive telemetry data, and heart atrial fibrillation detection on routine screening electrocardiogram (ECG) signals. A common limitation of all these solutions is that historical data used for the predictions are often defined through a time window of fixed size, which may be inadequate when heterogeneous failure behaviors have to be captured.
In the attempt to cope with the interpretability requirement, which is fundamental in many critical domains, some approaches for the extraction of properties that distinguish between failure and normal execution traces of a system have been recently proposed in the literature [11], [12], [13], [14], [15], [16], [17]. However, learning temporal properties from the observed system traces is a challenging task that involves intractable optimization problems [11]. To overcome them, heuristics were suggested [11], [12], [13], [14]. In this spirit, some ad hoc, domain-specific solutions have been devised to assess the condition of electrical rotating machines through real-time vibration measurement and analysis instruments [18], to discover temporally-constrained alarm sets from dynamic systems' logs [19], to diagnose rolling bearings faults through a hardware architecture with a reconfigurable logic based on field programmable gate arrays (FPGAs) [20], to detect system intrusions through temporal logic specifications [21], and to ensure the safety of synthesized policies for robotics through model predictive shielding [22]. Nonetheless, a general technique, applicable to different domains and contexts, is still missing. A step towards such a goal was taken in [23], where an STL-based solution to the problem of detecting ineffective respiratory effort in intensive care patients, which includes a learning phase supporting some adaptive behaviors, is outlined. Still, the generative data models employed in the learning phase are tailor-made, thus limiting the flexibility and generality of the proposed solution.
With the goal of generalizability in mind, approaches explicitly aimed at combining the points of strength of machine learning and formal methods have been recently proposed. Specifically, in [15], [16], and [17], techniques for the mining of STL properties that distinguish between two different sets of time series data are presented. The proposal in [15] relies on a genetic algorithm combined with parameter learning through Gaussian process confidence upper bound. The solution originally presented in [13], and then extended with online learning in [16], exploits a decision tree learner based on STL primitives. Finally, the approach in [17] relies on a reinforcement learning-based property extractor that combines data-and knowledge-driven methodologies. Still, the aforementioned proposals significantly differ from what is presented in this work, as they are not designed to work iteratively, managing a pool of properties in real-time. Beyond STL, in [24], a failure prediction method for cloud data centers, based on message pattern recognition via Bayesian probability, is described. As for failure detection in cyber-physical systems, a Ripple down rule-based (RDR) framework was proposed in [25], that exploits a machine learning technique based on the algorithm InductRDR [26]; the result is then maintained by domain experts, who refine the RDR knowledge base.
A related field is that of specification mining, whose goal is to generate/integrate the formal specification of a system by analyzing its execution traces. Various approaches to this problem have been proposed in the literature. In [27], a Linear Temporal Logic (LTL) property template miner, based on support and confidence thresholds, is devised. A Bayesian inference-based probabilistic model that generates LTL task specifications from examples, by exploiting a Markov Chain Monte Carlo algorithm, is outlined in [28]. In [29], an algorithm to infer LTL specifications by combining the representational power and interpretability of temporal logic with the generalizability of inverse reinforcement learning is proposed. The problem of mining finite state automata to generate formal specifications in the context of software applications and libraries is dealt with in [30]. To this end, the authors make use of prefix tree acceptors, language models based on recurrent neural networks, and clustering algorithms to merge similar automaton states. Despite the relevance of specification mining, the proposed solutions extract non-contrastive properties from the observed executions of a system, and thus they cannot be naturally applied to failure detection, where properties able to discriminate between good and bad behaviors are needed.
As a final remark, we would like to mention that the solution described in detail in this paper was first outlined in two short preliminary contributions [31], [32]. In this paper, we fully work out the proposed framework, revise and extend it in various ways with respect to its original formulation, and largely improve its experimental evaluation.

III. BACKGROUND KNOWLEDGE
In this section, we recall some basic notions about monitoring, STL, and evolutionary algorithms.

A. MONITORING
While classic verification techniques, like, for instance, model checking [33], perform an exhaustive analysis of the behaviors of a system, monitoring [1] aims at establishing satisfaction or violation of a property by analyzing a finite prefix of a single behavior (trace/run), and then issuing an irrevocable verdict [1]. It is thus a lightweight technique, but the gain in efficiency is paid in terms of expressivity: monitorable properties are a subset of those expressible in temporal logics commonly used for automated verification.
We say that a property is positively (resp., negatively) monitorable if every trace satisfying (resp., violating) it features a finite prefix that witnesses the satisfaction (resp., violation). A monitorable property is a property that is either positively or negatively monitorable. Safety properties, informally requiring that ''something bad will never happen'', are negatively monitorable, as their violation is witnessed by a finite prefix exhibiting a violation; dually, co-safety properties, stating that ''something required will eventually happen'', are positively monitorable. Notably, there are meaningful properties, like, e.g., ''a good state is accessed infinitely often'' (an essential ingredient of strong fairness requirements [34]), which are clearly neither positively nor negatively monitorable.
As we will see in Section V, the online nature of monitoring makes it a natural candidate for the proposed framework.

B. SIGNAL TEMPORAL LOGIC (STL)
Signal Temporal Logic (STL) [35] extends propositional logic with future modalities that allow one to express temporal properties over linear structures. It can be directly applied to time series data characterized by continuous values.
Let N >0 (resp., N [t,t ′ ] ) be the set of positive naturals (resp., naturals in between t and t ′ , for all t, t ′ ∈ N) and R n be the n-dimensional Euclidean space over reals. A discrete-time STL signal (or trace) is a function x : N → R n , for some n ∈ N >0 ; a partial signal is a function x : N [0,t] → R n , for some n ∈ N >0 and t ∈ N. 1 We denote the length of a (partial) signal x by len(x). For a signal x, it holds that len(x) = ∞, whereas len(x) = t + 1 for a partial signal x : N [0,t] → R n . Let X (resp.,X ) be the set of signals (resp., partial signals). If the codomain of a (partial) signal x is R n , then n is the dimension of x, denoted by |x|. Let x ∈ X ∪X . We denote by x i , with 1 ≤ i ≤ |x|, the function from the domain of x to R such that x i (t) is equal to the i-th component of x(t), for all t. Moreover, we denote by x[j, k], with 0 ≤ j ≤ k < len(x), the restriction of the function x to the domain N [j,k] . 1 As a matter of fact, STL allows one to deal with continuous-time signals by simply redefining x as x : R ≥0 → R n , where R ≥0 is the set of nonnegative reals. Here, we restrict ourselves to the discrete-time case given that a sampling is required to represent time series data within a dataset. VOLUME 11, 2023 The syntax of STL is given by the grammar: where x is a signal, 1 ≤ i ≤ |x|, c ∈ R, and I is an interval of the form (a, b), (a, b], [a, b), or [a, b], with a ∈ N, b ∈ N ∪ {∞}, and a ≤ b. Modality U (until) is paired with an interval I which defines its validity scope. For every t ∈ N and interval I = (a, b), we denote by t + I the interval (t + a, t + b) (the same for intervals of the forms (a, b], [a, b), and [a, b]). Derived modalities are defined as usual. As an example, modalities eventually and globally are defined as F I φ = ⊤U I φ and G I φ = ¬F I ¬φ, respectively.
STL pairs the standard Boolean semantics with a quantitative one, which measures the robustness of the satisfaction of φ by a signal x at a time t ∈ N.
The quantitative semantics of STL is inductively defined as follows: The Boolean semantics of STL is defined on the basis of the sign of ρ(φ, x, t): Finally, given a partial signal x ∈X over N [0,t] , the set of completions of x is defined as C(x) = {x ∈ X |x(t ′ ) = x(t ′ ) for all t ′ ∈ N [0,t] }. STL monitoring is formally defined by the function mon : The choice of STL as the formalism for the specification of the properties to monitor has various advantages. First of all, STL allows one to directly deal with real values, still featuring quite compact and interpretable formulas. Moreover, its quantitative semantics provides one with an additional tool to evaluate the behavior of the extracted formulas, a feature that will be described in detail in Section IV.

C. MONITORING BOUNDED STL FORMULAS
Monitoring properties that refer to both the current and the future behavior of a system is a challenging task since their evaluation at a given time t may also depend on the observed inputs at some time t ′ > t. In Section III-A and Section III-B, we introduced the notion of monitorable properties and provided syntax and semantics of STL, respectively. To the best of our knowledge, no tool supporting the monitoring of arbitrary STL formulas is available. Luckily, in most application domains, the properties to monitor can be expressed by means of bounded-time STL (bSTL) formulas, and a tool to deal with such a class of formulas has been developed in [36]. Basically, the fragment bSTL constrains the interval I associated with modality U to be finite, that is, I = [a, b], with both a and b belonging to N (b = ∞ is excluded).
Let φ be a bSTL formula. By analyzing its syntactic structure, one can compute a temporal horizon H (φ), that intuitively represents the maximum number of (future) time points that one must take into consideration to establish whether or not φ is true. In the following, when evaluating a bSTL formula φ, with temporal horizon H (φ), at a given time t, the monitor will wait until time t + H (φ) is reached, since at that time all the data necessary for the quantitative evaluation of the formula and the possible formulation of a ⊤ or ⊥ verdict have surely been observed. As an example, the horizon of the bSTL formula φ = x ≥ 3 U [0,3] x ≥ 5 is 3, and thus only after 3 time units we can complete its (quantitative) evaluation.
Formally, the temporal horizon H (φ) of a bSTL formula φ is defined as follows: Notice that the monitor, when applied to a bSTL formula φ, may output the truth values ⊤ or ⊥, as well as the undefined verdict ? when the horizon of φ still has to be reached. As previously mentioned, a ⊤ or ⊥ can always be reached when the horizon is met.
Formally, bSTL monitoring is defined by the function b-mon : Now, we observe that, by the definition of monitoring and the nature of bSTL formulas, monitors evaluate bSTL formulas only based on prefixes of signals of bounded length, the bound depending on the temporal horizon of the formulas. As an example, formula ϕ = F [0,3] temperature ≥ 3 states that there must exist at least one time point where temperature ≥ 3 among the first 4 (i.e., H (ϕ) + 1) time points of the signal, that is, in the set of time points {0, 1, 2, 3}. This limits the applicability of monitoring in real-world scenarios where one is interested in detecting the possible occurrence of a given condition at any time point of a signal. This is the case, for instance, with the property: ''in 25 time units from now the temperature will exceed 30 degrees''. To accommodate for that, we extend the notion of monitoring by making it possible to apply it to any time point, that is, to any suffix of a signal. To this end, building upon function b-mon, we define function eb-mon : We are interested in identifying signals exhibiting bad behaviors, which are encoded by means of bSTL properties. As noted before, a conclusive verdict can be issued for a signal only if it is longer than the horizon of the bSTL property under consideration. Therefore, given a partial signal x and a bSTL property ϕ, function eb-mon(x, ϕ) returns ? whenever x is no longer than the horizon of ϕ. Otherwise, we monitor (through b-mon) ϕ against all suffixes y of x longer H (ϕ): if at least one such monitoring procedures returns ⊤, then so does eb-mon(x, ϕ), meaning that the signal is considered a bad-behaving one (see Figure 1).
The monitoring tool we used to realize the above-described approach is rtamt, a Python library for monitoring discreteand dense-time bSTL properties [36].

D. EVOLUTIONARY ALGORITHMS
Evolutionary Algorithms (EAs) are population-based metaheuristics inspired by the process of biological evolution and genetics, that excel in the solution of combinatorial optimization problems [37]. Unlike classic random search, EAs make use of historical information to direct the search into the most promising regions of the search space.
In nature, a population of individuals tends to evolve to adapt to its environment. Similarly, EAs are characterized by a population, where each individual represents a candidate solution to a given optimization problem; each solution is evaluated with respect to its degree of ''adaptation'' to the problem through a single-or multi-objective fitness function.
The EA population iteratively goes through a series of generations. At each generation, individuals chosen by a selection strategy undergo a process of reproduction. Such a selection strategy is the fundamental factor that distinguishes one evolutionary based approach from another, although, typically, individuals with a high degree of adaptation are more likely to be chosen (elitism). In this way, the elements of the population progressively evolve toward better solutions. Reproduction involves the application, with a certain degree of probability, of suitable crossover and mutation operators. As a result, an offspring is generated, which is finally merged with the previous population, and the cycle repeats until a stopping condition is met, e.g., a condition based on a given fitness threshold.
Crossover is the EA counterpart of natural reproduction, by which the characteristics of two individuals are combined by generating one or two offspring. As a general rule, a high crossover probability tends to pull the population towards a local minimum or maximum. Mutation applies random changes to the encoding of the selected solution, with the goal of maintaining genetic diversity in the individuals; it prevents premature convergence of the algorithm to a local optimum, thus allowing it to explore the search space more broadly.
In this work, we deal with a specific kind of optimization task, that is, genetic programming. Such a technique evolves programs starting from a population of random solutions [38]. Each individual is encoded by means of a computation tree, where each leaf represents an input value (either a variable or a constant) and internal nodes encode operators. The output value is generated by the primitive encoded in the root. Typical crossover/mutation operations applied on computation trees are subtree exchange and node/leaf addition/removal/replacement.
As for the task of property extraction, we will rely on a multi-objective evolutionary algorithm. The reason is twofold. First, such an EA is able to simultaneously follow different optimization goals, producing a set of Pareto-optimal solutions as a result which one can subsequently combine. Second, it is a flexible approach, as it allows one to customize the syntax of the generated formulas by constraining the computation trees, e.g., enabling/disabling specific operators or allowing only some kinds of combinations among them.

IV. THE EVOLUTIONARY ALGORITHM
The Evolutionary Algorithm we are going to exploit relies on DEAP (Distributed Evolutionary Algorithms in Python) [39], a framework that provides practical tools for the prototyping of custom evolutionary algorithms. In this section, we will illustrate how the different components of the optimizer have been developed.
The algorithm receives a set of finite traces X (all of the same length) as input, then, it partitions each trace into a normal behavior prefix and a failure behavior suffix, and, finally, it generates a bSTL formula which is able of discerning between the two cases.

A. POPULATION AND ITS INITIALIZATION
Each individual belonging to the population consists of a pair (ϕ, w), where ϕ encodes a computation tree representing a syntactically correct bSTL formula and w is its associated normal behavior window length.
As for ϕ, it is generated following DEAP's genHalfAnd-Half method, which outputs a random tree with a maximum height of 6, as suggested by Koza in his seminal work [40]. More precisely, half the time a tree whose leaves have all the same depth is returned; in the remaining cases, different leaves may have different depths.
The window length w is used to partition each trace x ∈ X into a normal behavior prefix of length w and failure suffix of length len(x) − w (see the definition of the fitness function below). Note that, in the generation process of the individual, a formula ϕ with a horizon H (ϕ) ≥ len(x) − w might be obtained. In such a case, the individual is discarded and another one is generated. The process is iterated until a valid individual is obtained. VOLUME 11, 2023

B. NODES OF THE COMPUTATION TREE
A node of the computation tree may represent a constraint, e.g., x i ≥ c, a bSTL formula whose outermost operator is a temporal one, e.g., ϕU [a,b] ψ, or a Boolean formula like, for instance, ϕ ∨ ψ, where ϕ and ψ are bSTL sub-formulas which are represented, in their turn, as trees. A node may also encode the following terminal values: (i) interval bounds of a temporal operator, i.e., [a, b], with a, b ∈ N and a ≤ b, (ii) signal identifiers x i , with 1 ≤ i ≤ |x|, and (iii) constants c occurring in formulas, with c ∈ Dom(x i ) for some i. All these terminals are implemented by means of DEAP's Ephemeral-Constants. As for the length of the normal behavior window, it is implemented as an EphemeralConstant w, with 0 < w < len(x), where len(x) is the length of the traces x ∈ X (they are all of the same length).

C. FITNESS FUNCTION
In order to evaluate an individual of the population, each trace x ∈ X is logically partitioned into a good behavior prefix x[0, w − 1] and a failure suffix x[w, len(x) − 1], following a windowing approach which takes w as the length of the normal behavior window. A bi-objective fitness function is then defined by making use of the rtamt monitoring algorithm for bSTL.
Formally, the first objective measures how good a formula ϕ is in discriminating between the normal behavior prefixes and the failure suffixes. For each trace x and each formula ϕ, let us define the numerical counterpart of eb-mon(x, ϕ) as follows: otherwise.
The first objective measure is defined as follows: It is worth noticing that, to maximize Acc(X , ϕ), a formula ϕ should evaluate to ⊥ on the normal behavior prefixes and to ⊤ on the failure suffixes. In this respect, it is very important to be able to evaluate a formula ϕ to ⊤ or ⊥ till the last instant of the good behavior prefix of a trace x. To this end, we simply extend the prefix with the first H (ϕ) points taken from the failure suffix. Intuitively, the reason is that, otherwise, there may be some failure patterns beginning on the prefix and ending on the suffix, that would not be captured (? verdict). The second objective measures the robustness of the formula (normalized in the [0, 1] interval) by means of bSTL quantitative semantics. As a preliminary step, at the beginning of the execution of the genetic algorithm, every signal in X is normalized in the [0, 1] interval so that ρ ranges between −1 and 1. This step is handled implicitly and it does not alter the constant value c of constraints x i ≥ c in the generated output formula, which are still represented with their raw, non-normalized value.
This second objective is defined as follows: Since two objectives are taken into consideration, no single best-performing solution can be directly selected from a given population by means of the fitness function. Rather, a Pareto front of optimal solutions can be identified, containing all non-dominated solutions. 2 As a final note, observe that including the window length w in each individual allows each bSTL formula to define its own (optimal) way of splitting the traces: we may indeed expect different kinds of failure, captured by different formulas, to be characterized by different temporal extensions.

D. CROSSOVER
Given two parent solutions, the crossover operation generates two new individuals. As for their computation trees, they are generated by one-point crossover (DEAP's cxOnePoint). The operator randomly chooses a node in each individual and exchanges the subtrees rooted at it. To avoid bloat, that is, an excessive increase in mean program size without a corresponding improvement in fitness, we placed a static limit of 17 on the children's height (DEAP's staticLimit), following once more a suggestion from Koza [40]. When an invalid (over the height limit) child is generated, it is simply replaced by one of its parents, randomly selected. As for the associated window lengths, they are randomly chosen from the parents. Observe that, in performing the crossover operation, non-valid individuals can be generated concerning the relationship between their horizon and normal behavior window w; given an individual with formula ϕ, if H (ϕ) ≥ len(x) − w, we replace it by one of the parents, randomly chosen.

E. MUTATION
As for the mutation operation, two operators have been used among those available in DEAP, each one chosen with uniform probability: mutNodeReplacement, that replaces a randomly chosen node in the individual, and mutEphemeral, that changes the value of a single constant used within an individual (including, possibly, the window length). As we did for crossover, in order to control bloat we impose a staticLimit constraint equal to 17 to the height of the tree. Moreover, it must be checked whether the resulting individual 2 A set S of solutions for an n-objective problem with fitness function f = ⟨f i , . . . , f n ⟩ is said to be non-dominated if and only if for each x ∈ S, there exists no y ∈ S such that (i) f i (y) improves f i (x) for some i, with 1 ≤ i ≤ n, and (ii) for all j, with 1 ≤ j ≤ n and j ̸ = i, f j (x) does not improve f j (y).
is valid, with reference to its horizon and window length. If this is not the case, the original individual is returned.

F. SELECTION
To promote population diversity, we rely on the elitist selection strategy implemented in NSGA-III [41], based on the concepts of reference points and niche preservation (we refer the reader to [41] for details).

G. TERMINATION CRITERIA AND EXTRACTION OF FINAL SOLUTIONS
Let us now focus on the termination criteria of the algorithm and on the extraction of the final solution. As it is commonly done, we impose an upper bound on the number of generations. In addition, we define an early stopping strategy, based on the hypervolume measure. According to it, the execution of the algorithm is interrupted when no improvement over the hypervolume is observed for a given number of generations. Intuitively, the hypervolume of a Pareto front measures the volume of the search space, bounded by a given reference point, that is weakly dominated by the points on the Pareto front [42]. The assumption is that populations of heterogeneous and well-performing solutions are characterized by a high hypervolume.
Since the EA provides a Pareto front of optimal individuals (ϕ, w) as its result, to determine the final solution to output we first filter the last population's front keeping all individuals whose formula ϕ has an accuracy greater than 0.5, that is, better than a random classifier. Then, among such individuals, we return the formula ϕ of the individual with the highest hypervolume. If no formula with accuracy greater than 0.5 is present in the final front, we return null.  [10,25,50]). Note that mutation probability starts rather high to ensure an effective exploration of the search space; then, it rapidly decays with the number of generations to foster the exploitation of the most promising solutions that have been found. Although we recognize that, in principle, each dataset has a different and optimal set of hyperparameters, as we will see, the above values still provide a solid basis when it comes to the overall framework performance, and can thus be considered default choices.
Another hyperparameter, used by the EA in this specific implementation, based on bSTL and rtamt, is max horizon, whose meaning is fairly natural. Intuitively, it sets an upper bound on the horizon of the formulas that can be explored within the EA. Enforcing a max horizon h has three effects: first, formulas can capture phenomena that are temporally extended at most h + 1 time points (in terms of the sampling rate of the considered time series); second, given the way rtamt (equivalently, eb-mon) works, at run time, when evaluating the truth of a formula, the verdict will be ? for the first h time points; third, it has been experimentally observed that the execution time of rtamt grows more than linearly with the size of the horizon of a formula. As we will see in Section VI, multiple runs of the framework have been taken into consideration in order to collect statistically relevant data. Thus, to allow for faster experimentation, we set a small value of 20 for max horizon on all datasets. Although this might seem restrictive, it still allows us to extract meaningful and well-performing properties. In a general usage scenario, max horizon should be set by domain experts considering the previously mentioned three aspects. The impact of the horizon length on the framework performance is studied in Section VI-C.

V. THE GENERAL FRAMEWORK
In the following, we describe the proposed framework for preemptive failure detection. As already pointed out, it works in an online fashion and it uses the rtamt monitoring algorithm to check the incoming system trace for undesired behaviors. As we will see, in terms of binary classification, the occurrence of a bad behavior is considered as a positive event. Thus, a false positive corresponds to an erroneous indication of a bad situation, while a false negative corresponds to a missed detection. Bad behaviors are encoded by bSTL formulas, which are collected in a monitoring pool P.
Operationally, we distinguish between two distinct execution phases of the framework: an optional warmup phase and a runtime phase. In the first one, the pool P is populated with an initial set of formulas encoding bad behaviors, following a teacher forcing-like approach [43] on supervised training data. In the second one, the framework online monitors the system, starting with a non-empty pool P.
During both phases, P is iteratively refined by (i) adding new formulas which are able to predict bad behaviors earlier and with increased reliability and coverage, and (ii) removing formulas that are ill-behaved or redundant. In addition to this refinement process autonomously operated by the framework, at any time, domain experts can, in principle, make changes to the pool P, e.g., by manually specifying a new formula encoding a bad behavior. 3

Algorithm 2 UPDATEPOOLINFORMATION
input: pool P of formulas, set F of failure formulas, trace x 1: for φ ∈ F do 2: far φ ← (1 − α)· newFAR(φ, x) + α · far φ 3: if far φ > far thr then 4: remove(φ, P) 5: end if 6: end for 7: handleRedundancy(P) and l is its corresponding label (⊤, if x is a trace ending with a failure; ⊥ otherwise). The overall idea is to monitor, one after the other, all available system traces and, for each of them, to simulate its point-by-point arrival.
The warmup phase is dealt with by Algorithm 1. The procedure gets, as input, a pool P of bSTL formulas and the set X of training system traces. P may possibly be empty. This is the case when no formula is inserted into it by domain experts. For each training trace x, with label l, two variables are set: has_triggered, which keeps track of whether the framework has correctly identified the trace x as a failure one (when l = ⊤); and a set S of suspended formulas. S includes all formulas that, at some point, erroneously signalled a bad behavior in x (when l = ⊥), and are thus ignored by the framework for its operation on the remainder of trace x.
Next, the framework starts the iterative part of its execution, during which the trace x is monitored sequentially, point-by-point. At each iteration i, with 0 ≤ i ≤ len(x) − 1, the system restricts its attention to the prefix y = x[0, i] of trace x, and it computes the set F of formulas leading to a violation (Algorithm 1, line 6). To this end, it executes the monitoring algorithm rtamt that verifies each (nonsuspended) formula in P \ S against the current trace y. Since all formulas are meant to encode bad behaviors, we say that a formula ψ leads to a violation if eb-mon(y, ψ) returns ⊤ (eb-mon is defined in Section III).
If at least one violation is detected, procedure UPDATE-POOLINFORMATION is executed (Algorithm 2) to detect and remove redundant or non-reliable formulas from the pool, the latter being formulas issuing several false positives. The procedure will be described in detail later.
Then, if x is an actual failure trace, P is updated (Algorithm 1, lines 10-17) as follows. Training data to be used for the extraction of a new formula are generated by the function generateTrainData (Algorithm 1, line 11). The latter perturbs the execution trace y by adding random Gaussian noise as a counter-overfitting measure, thus producing a set of augmented traces Y of size n aug (global parameter of the system). Next, function extractDiscrFormula (Algorithm 1, line 12) extracts a (bSTL) formula φ that discriminates between normal and failure (sub)traces obtained from those in Y, by exploiting the evolutionary algorithm as described in Section IV. Notice that φ may be null, an event that, according to the proposed definition of EA, occurs if none of the formulas in the final front has an accuracy greater than 0.5. If φ is not null, the initial false alarm rate (FAR) of φ is set to 0, and the formula is added to P (Algorithm 1, lines [13][14][15][16]. 4 At this point, since the trace x was recognized as a failure one by the framework, the execution on x is halted (Algorithm 1, line 17), and the framework is applied to the next trace in X .
On the contrary, if the framework detected a violation and trace x was not a failure one, all involved formulas f ∈ F are suspended, meaning that they are not going to be considered by the framework for its execution on the remainder of trace x (Algorithm 1, line 19). This prevents them from repeatedly triggering the extraction of other ill-behaved formulas. Note how formulas in F are not immediately removed from P, as such an approach would be too aggressive: their false positive detection might not be a generalized behavior, but something caused by random characteristics of trace x itself. As we will see, false positive detections are still considered by the procedure UPDATEPOOLINFORMATION for the maintenance of the pool P. Suspended formulas are reactivated when the next training trace is taken into account by the framework.
The iterative phase of the framework on trace x ends when either x is correctly recognized as a failure trace by a formula in P, or x has run out of points without any failure detection. In the latter case, if trace x was a failure one, we force the formula extraction process (Algorithm 1, lines 23-30). As the last operation of the framework (Algorithm 1, line 32), after being run on every training system trace, the obtained monitoring pool P is returned.
We would like to conclude this account of the operation of Algorithm 1 by observing that the warmup mode draws inspiration from the teacher forcing technique employed in deep learning [43]. Such an approach is used here to correct both false positive (Algorithm 1, line 9) and false negative (Algorithm 1, line 19 and line 23) framework predictions. As an example, as we already pointed out, the framework starts its execution with a possibly empty pool P of properties. Thus, in the most extreme case (P = ∅), it cannot identify any bad behavior of the system. In such a case, the failure is ''detected'' by observing the training label associated with the training execution trace, an event that forcedly triggers the pool update process. Intuitively, the whole scenario can be thought of as having an oracle assisting and instructing the framework. It is to be expected that, over time, the pool P becomes large enough so as to allow for the effective detection of bad behaviors of the system, progressively substituting the oracle in its role of correcting false positive and false negative predictions.
Let us focus now on the procedure UPDATEPOOLINFOR-MATION(P , F, x) (Algorithm 2). Operationally, for each formula φ ∈ F that leads to a violation, the corresponding FAR far φ ∈ [0, 1] is updated. Formulas whose FAR crosses a given threshold far thr (a global parameter of the framework) are removed from the monitoring pool (Algorithm 2, lines 3-4). As already pointed out, a FAR equal to 0 is associated with every formula when added to P. Then, the value of FAR is suitably updated according to the exponential moving average with smoothing constant α, which takes into account the ''historical'' FAR and the new FAR of the formula (Algorithm 2, line 2); the latter is considered to be 0 if the triggered formula actually anticipated a failure, 1 otherwise (false positive case). In the absence of detailed historical data, assigning an initial FAR equal to 0 to the formulas in the pool is a sensible choice, as they are either defined by a domain expert or generated by the evolutionary algorithm, which in turn optimizes accuracy and robustness measures.
The choice of relying on FAR for the pool maintenance instead of on other ''symmetric'' performance metrics, say F1-score, is twofold. First, formulas providing several false detections may cause a degradation of the monitoring pool, where other ill-founded formulas are added as a result of their triggering. Thus, they should be avoided at all costs. On the contrary, formulas leading to false negatives do not bring any adverse effect on the monitoring pool, except for increasing its size. The second reason pertains to the very nature of

Algorithm 3 Framework Execution (runtime phase)
input: initial non-empty pool P of formulas, non-empty set G of good behavior training traces, incoming system trace x 1: while true do 2: F ← {ψ ∈ P | eb-mon(x, ψ) returns ⊤} 3: if F ̸ = ∅ then 4: handleRedundancy(P) 5: X ← generateTrainData(x) 6: φ ← extractDiscrFormula(X ) 7: if φ ̸ = null then 8: if far φ ≤ far thr then 10: P ← P ∪ {φ} 11: end if 12: end if 13: end if 14: end while F1-score and similar metrics. To calculate it, it is necessary to establish when a formula experiences both false positives and false negatives. False positives, that is, false detections of bad behaviors, can be easily recognized: if a formula triggers and the forecasted bad event does not occur, that can be unequivocally considered as a false positive. The detection of false negatives is, instead, more subtle, and not well-defined. Indeed, it is perfectly admissible for the system to encounter a total failure not anticipated by any formula in the pool, since they may correctly model completely different failure scenarios. In that case, formulas should not be penalized for the missed detection.
Finally, procedure handleRedundancy(P) (Algorithm 2, line 7) removes redundant formulas, i.e., it detects groups of formulas with similar behavior and keeps a single representative for the entire group (the formula with the lowest FAR or the newest one, if the FAR is the same). To detect the similarity of two formulas, we rely on the Jaccard/Tanimoto test [44] that compares the histories of failures flagged by the formulas along the framework execution.
As a last remark, note how procedure UPDATEPOOLINFOR-MATION, in the way it is used by Algorithm 1, allows us to continuously update the monitoring pool P as the training instances are processed, ensuring its quality.

B. RUNTIME EXECUTION PHASE
Let us now concentrate on the runtime phase of the framework which is implemented by Algorithm 3. Here, the framework is used to continuously monitor an incoming trace x, generated by a system during its execution. Other than the trace, the procedure gets in input a pool P of properties, that can be assumed to be non-empty, either because it is returned by Algorithm 1 or hand-filled by domain experts. In addition, it takes into consideration a non-empty set G of past good execution traces of the system. The latter can be either extracted from the training warmup data, if available, or derived directly from the execution history of the system, restricting to those portions that are sufficiently far from failure events, following the suggestions of domain experts.
Algorithm 3 behaves as follows. At each time step, the set F of formulas leading to a violation is computed (Algorithm 3, line 2) by executing the monitoring algorithm rtamt, which checks each formula in P against the incoming system trace x. If at least one formula is triggered, P is updated (Algorithm 3, lines 3-13). First, procedure handleRedundancy is called, to identify and remove from the pool P possible redundant formulas, exactly in the same way as in the warmup phase. Next, training data to be used for the extraction of a new formula are generated by the function generateTrainData(x). Finally, function extractDis-crFormula(X ) extracts a (bSTL) formula φ that discriminates between normal and failure (sub)traces by using the EA. If the formula φ generated by the EA is not null, its FAR is computed with respect to the reference set of good traces G and, if such a value is less than or equal to the threshold far thr , the formula is added to the pool P.
As it can be noticed, the main differences between the warmup and runtime phases are that, in the latter, there is no teacher forcing, and thus the entire failure detection task is carried out by means of monitoring; moreover, the FAR of a formula is established only once by considering all traces in the reference set G, being the latter fixed.
As a final remark, note that Algorithm 3 (runtime) can, in principle, be run independently from Algorithm 1 (warmup), if there is at least one property in P and a set of good traces G of the considered system is available. This allows one to use the framework in a runtime setting even in the absence of supervised training data, as long as at least one failure property has been provided by domain experts and some (portions of) unlabeled past execution traces of the system, that express good/normal behaviors, are accessible.
An intuitive account of the operation of the framework is depicted in Figure 2. The framework is first attached to a trace x generated by the system for its runtime monitoring, with a pool P containing just the formula φ = F [7,9] y < 3 (left picture). Function eb-mon(x, φ) is run against the incoming trace and, specifically, b-mon(x [0,9], φ) identifies a failure occurring at time point 7. This leads to the extraction of a new formula. To this end, trace x[0, 7] is augmented, and then the EA is run on the set of resulting traces. In the middle picture, just for illustrative purposes, an exemplary splitting of the augmented traces based on a window length w = 4 is reported. Each trace is partitioned into a good behavior prefix and a failure suffix. For formula evaluation purposes, in the EA each subtrace is considered as to be starting from index 0. As a result, the formula ψ = F [0,2] y < 5, which is able to distinguish between the augmented prefixes and suffixes, is generated and added to the pool P. Finally, a subsequent part of the operation of the framework is described (right picture) Here, the recently discovered formula ψ identifies a failure occurring at time point 53 (b-mon(x[51, 53], ψ) = ⊤). Without such a formula, φ would have detected a violation only with respect to time point 58.
For the sake of convenience, all the global parameters of the framework are listed in Table 1, with an intuitive account and a short description of their expected behavior.

VI. EXPERIMENTAL EVALUATION
In this section, we give a detailed account of the experimental evaluation of the framework on 3 public datasets. In addition, we make a comparison with previous results from the literature. First, we introduce the datasets; then, we describe the experimental workflow; finally, the obtained results are portrayed. We pay particular attention to interpretability issues.

A. DATASETS
We considered the datasets Backblaze Hard Drive, 5 Tennessee Eastman Process, 6 and NASA C-MAPSS. 7 The Backblaze Hard Drive dataset (also referred to as SMART dataset hereafter) contains continuously updated information on the ''health'' status of hard drives in the Backblaze data center. Here, we focus on Self Monitoring Analysis and Reporting Technology (SMART) attributes of the ST4000DM000 hard drive model recorded daily from 2015 to 2017. Each trace is described by the following features: the date of the report, the serial number of the drive, a label indicating a drive failure and 21 SMART parameters with both discrete and real values. To compare the framework with the literature, two training/test set splits are considered:  The NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset includes run-to-failure simulated data of turbofan jet engines. Specifically, in the considered dataset FD001, engines are simulated according to a single operating condition (called Sea level) and their failures are attributable to one possible cause (HPC degradation). Each engine simulation is represented by a multivariate time series obtained from 21 engine sensors. Although each trace represents the simulation of a different engine, the data can be considered to be from a fleet of engines of the same type. Data are sampled at one value per second, and the trace length distributions are depicted in Figure 3. The dataset includes 100 training traces, each ending with a failure, and 100 test traces, each ending an arbitrary and known number of time steps before the failure (gap). In order to compare our framework with the literature [45], failure traces are generated by considering the 30% suffix of each engine observation as faulty, and the remaining 70% prefix as normal behavior. Thus, this leads to 200 training traces. On the other hand, 43 failure and 100 normal behavior traces are generated for the test set. The reason is that, given a test set trace, the 30% suffix is computed over the trace length including the gap and, thus, it may result to be empty.

B. EXPERIMENT SETUP
For each dataset, we performed the initial warmup phase by running Algorithm 1 on a sample of training execution traces related to both malfunctions (failure traces) and good executions. The traces were considered by the framework one after the other, according to a random ordering.
Once the warmup phase ended, the framework was evaluated on test set traces (Algorithm 3) in two modes: the online mode, where the framework continues to learn new properties from the execution traces, and the offline mode, where the properties in the monitoring pool are not updated, so that only the properties learnt in the warmup phase are taken into account when predicting failures on test set traces. This latter mode was useful to compare the proposed solution with those from the literature, while the former let us determine how the values of the considered metrics evolved over time. The two test set evaluation modes were carried out on a random ordering of both good and failure traces.
The performance of the two test phases was evaluated in terms of precision (P), recall (R), FAR, and F1-score (F1). Let T P be the number of true positives, that is, bad behaviors identified as such, F P be the number of false positives, T N be the number of true negatives, and F N be the number of false negatives. The metrics are defined as follows: All experiments were run 10 times varying the random seed governing the order in which execution traces are presented to the framework during the warmup and the online test phases, so as to collect statistical data regarding the considered metrics.

C. RESULTS
To begin with, the global parameters of the framework, chosen through grid search optimization on training set data, are shown in Table 1. In the remainder of the section, we will assess the framework performance in several respects. First, we will present the results of the offline and the online evaluation, as described in Section VI-B. Then, we will focus on the impact of teacher forcing and formula max horizon.

1) OFFLINE EVALUATION
As for the offline evaluation mode, we compared the proposed solution with other state-of-the-art approaches to failure detection on the three previously-described datasets. Given the continuously updating nature of the Backblaze dataset, we focused our analysis on two studies that take into account the specific versions we consider, namely, those reported in [46] and [47]. The first one [46] makes use of a feed-forward neural network model on split S1, while the second one [47] evaluates a Long-Short Term Memory (LSTM) recurrent neural network on split S2. In addition, we took into account a third proposal [9], that is, a model obtained by combining a convolutional neural network (CNN) and an LSTM recurrent neural network, applying it to both S1 and S2 splits, following the setup outlined by the authors for the SMART features group. As for the case of fault detection on the TEP dataset, we considered an approach based on image processing techniques along with feed-forward and radial basis function neural networks [48], and a solution based on a nonlinear support vector machine [49]. Finally, as for the C-MAPSS case study, we compared our framework with a solution based on a CNN model presented in [45].
Results achieved by the above solutions are reported in Table 2, together with those of the proposed framework (label our). Our solution exhibits an average performance on par with the considered state-of-the-art ones. This is even more relevant if we also bear in mind that all the contenders provide no explanation for the predicted failures. On the contrary, a distinguishing feature of the proposed approach, compared to previous ones, is that it is interpretable: it relies on the extraction of properties expressed as temporal logic formulas, that provide an understandable explanation of the undesired behaviors of the system. Moreover, they can be subsequently exploited for tasks such as root cause analysis, diagnosis, and preemptive failure detection.

2) ONLINE EVALUATION
As for the online evaluation mode, results for the SMART, TEP, and C-MAPSS datasets are shown in Figure 4. Note that, as the number of traces seen by the framework increases, a slight but consistent improvement of the metrics occurs. This is not obvious, since in such a case maintaining a good performance requires the ability to discover new properties able to reflect the evolution of the behavior of the monitored system over time. Figure 5 illustrates, for each considered dataset, the average number of formulas in the monitoring pool at each warmup and runtime iteration. Note that, at certain iterations, there is a decrease in the pool size. This happens when formulas are removed because they are redundant.
Let us now consider some examples of the bSTL formulas used within the framework. An example for the Backblaze dataset is formula (G [0,2] SMART 194 >45.6) ∧ (F [2,3] SMART 198 >0.32). Such a formula makes evident a bad behavior where the hard drive maintains a temperature exceeding 45.6 • C in the first 3 days, and then, in the following 2 days, its uncorrectable sector count becomes greater than 0. As another example of framework execution, consider the formula f 1 = F [0,19] SMART 198 > 2.59, extracted (and added to the monitoring pool) during an iteration of the framework. According to the definitions of the SMART attributes, sensor SMART 198 is a critical one and f 1 expresses the fact that the threshold 2.59 of sector read/write errors is exceeded. During a later iteration of the warmup phase, a failure prediction is issued thanks to the triggering of f 1 . As a consequence, f 2 = F [1,16] SMART 189 > 8.28 is extracted, meaning that a certain number (8.28) of unsafe fly height conditions is reached before the critical number of sector read/write errors is exceeded. This pattern is quite reasonable, as it describes a case in which the disk head is operating at an unsafe height, ultimately damaging a disk sector and consequently causing read and write errors. Notice that the framework allows us to predict a failure based on sensor SMART 189 , which is not considered to be critical in the SMART specification, by uncovering a pattern linking it to the critical sensor SMART 198 .
Turning to the TEP dataset, an extracted formula is (G [1,4] XMEAS 21 > 94.6) ∧ (G [2,4] XMEAS 20 > 341). It reveals a bad behavior where the compressor is operating with a power greater than 341 kW, while the temperature of the reactor of the plant exceeds 94.6 • C.
As for the C-MAPSS dataset, the formula (SENSOR 10 < 1.3) ∧ (F [4,6] SENSOR 11 < 47.62) was generated, which signals a bad behavior where a loss of pressure in the highpressure compressor outlet follows a loss of pressure in the engine. Notice that the subformula SENSOR 10 < 1.3 does not contain any time operator, meaning that it is evaluated at the currently observed time point. Figure 6 reports the number of teacher forcing interventions during the warmup phase, as more and more failure traces are encountered. Specifically, for each amount of encountered failure traces, the sum over multiple (10) framework executions is reported. As expected, teacher forcing triggers mainly at the beginning of the warmup phase, when the monitoring pool is empty. As formulas are learned over time, teacher forcing interventions decrease till a stationary behavior is reached. Of course, the latter depends on the specific dataset, and it confirms what was to be expected from the performance reported in Table 2. As an example, on the TEP dataset, where an F1-score of 1.0 is achieved, the number of teacher forcing interventions rapidly approaches 0.   Figure 7 reports, for a single execution of the framework, the offline mode performance on each dataset, as obtained by varying the max horizon value. Once more, different datasets exhibit different behaviors. Although results might appear rather counterintuitive (setting a large max horizon does not prevent, in principle, the discovery of formulas with shorter horizons), following a preliminary analysis, they are likely due to an overfitting effect. Indeed, formulas with a larger horizon have the capability of capturing more detailed and extended phenomena, that could be highly trace-dependent. Moreover, we would like to recall that the concept of horizon does not apply to the framework in general, but is tied to the particular kind of logic and monitoring tool employed here.

VII. STRENGTHS AND LIMITATIONS
The proposed framework relies on approaches originating from the two fields of machine learning and formal methods, combining their strengths in an effective way. More precisely, the former domain provided us with tools and techniques for the extraction of properties from temporal data, while the latter allowed us to formalize such properties by means of logic formulas and to online monitor a given system against them in a principled manner. The key feature of the proposed approach is its interpretability: as shown in Section VI-C, by means of the extracted logical formulas, the framework gives an understandable account of settings leading to future failures, allowing domain experts to take appropriate action and enriching their overall knowledge. While contributions from the literature show that interpretability is often achieved at the expense of prediction accuracy, e.g., by relying on a simple white-box model instead of a more complex black-box one, quantitative results showed that the performance of the proposed approach is on par with previous, non-interpretable solutions.
As a final note, while in this work we applied the proposed framework to the domain of failure detection, similar ideas can in principle be employed to detect and predict any type of event or anomaly, whether positive or negative in nature. Among the first, we mention a spike in the history of sales of a retail store, or a generalized increase in the grade point average of students enrolled in the latest edition of a course; among the latter scenarios, the detection of seizures in hospitalized patients based on their continuously recorded vital signs, or the identification of violations of a level of service agreement in the context of a contract between a service provider and a customer.
We would like to conclude this section by pointing out some limitations of the framework. First, in the considered datasets, all traces come from the same plant (resp., hard disk, jet engine model) operating under the same conditions. To deal with more than one type of system, separated monitoring pools have to be employed to prevent conflicts among formulas. Second, the considered datasets only deal with numerical data. It is worth evaluating the proposed approach on datasets encompassing categorical data, naturally leading to the usage of other logics, like, for instance, LTL. Third, Algorithm 3 operates in a sequential fashion: (i) the system is monitored until a formula is satisfied by the incoming data; (ii) such an event triggers the phase of the property extraction, that results in the addition of a new formula to the pool; (iii) then, the monitoring of the system resumes. Although this behavior is perfectly acceptable for a prototype implementation applied on benchmark datasets, as the one described here, a multithreaded version, able to update the property pool asynchronously, while monitoring the system, remains to be developed. Finally, Algorithm 3 makes use of a fixed reference set of good behavior traces to prevent formulas with a high FAR to be added to the monitoring pool. This is definitely a reasonable approach, but it does not take into account changes in the behavior of the monitored system, that may happen due to, for instance, updates, upgrades, or degradation phenomena. To overcome this limitation, we may think of extracting new normal behavior traces from runtime data and adding them to the reference set.

VIII. CONCLUSION AND FUTURE WORK
In this paper, we proposed a novel general framework for runtime system verification that combines monitoring with machine learning, to be used for early failure detection over streams of data. Experimental results showed that it is able to issue failure warnings in an anticipatory and effective manner and to incrementally learn new specifications to be monitored against the considered system.
As for future work, we would like to underline the following directions: (i) the application of the framework to other datasets; (ii) user tests to assess the quality of interpretability (iii) the experimentation with other logics, including an extension of STL dealing with categorical data [50]; and (iv) the development of a multithreaded version of the framework, able to asynchronously deal with the update of the property pool while monitoring the system.