Rule-based Out-Of-Distribution Detection

Out-of-distribution detection is one of the most critical issue in the deployment of machine learning. The data analyst must assure that data in operation should be compliant with the training phase as well as understand if the environment has changed in a way that autonomous decisions would not be safe anymore. The method of the paper is based on eXplainable Artificial Intelligence (XAI); it takes into account different metrics to identify any resemblance between in-distribution and out of, as seen by the XAI model. The approach is non-parametric and distributional assumption free. The validation over complex scenarios (predictive maintenance, vehicle platooning, covert channels in cybersecurity) corroborates both precision in detection and evaluation of training-operation conditions proximity. Results are available via open source and open data at the following link: https://github.com/giacomo97cnr/Rule-based-ODD.

Impact Statement-Many sectors these days address safe AI: automotive (SOTIF), avionics (SAE G-34/EUROCAE WG-114), ISO/IEC (JTC 1/SC 42) and healthcare. Safe AI means understanding under which conditions autonomous actuations may lead to hazards. The impact of research here is to make AI aware of this, thus understanding under which conditions it may operate without detrimental effect to the human or the environment. Examples may involve the prevention of: dangerous manoeuvres by autonomous cars, inaccurate clinical diagnosis by artificial doctors, wrong decision making in cyberwarfare and in many other sectors (energy, finance). The theoretical analysis here is empowered by computational and incremental groupwise analysis in order to increase the readiness level of the proposed approach.
Index Terms-Out-of-distribution detection, eXplainable AI, mutual information, open data.   [2]. From central green bar to side yellow/orange/red bars, the nominal domain shifts and the severity of the OoD increases in parallel.

II. INTRODUCTION
T HE problem of out-of-distribution (OoD) detection (ODD) deals with comparing the working conditions of a machine learning model with those considered during the training process. The comparison is performed at the operational level to understand if the new data belong to a probability distribution different from that driving the data collection of the training phase. In case of divergence between training and operation, the system must generate an alarm because the performance of the model may no longer conform to what was measured at the training stage (even in case of successfully passed generalization tests 1 ). The problem represents a very important challenge for the secure application of machine learning. The recent standards in avionics [2], [3], automotive [4], [5] and ISO/IEC, as well as other regulatory initiatives in medical informatics [6], pose the problem of identifying all those operating conditions that can have an impact on safety.
The EASA Fig. 1 shows the different levels of severity of the OoD on operational data. The green bar denotes compliance with training data (in-distribution), the yellow color reflects an OoD zone where the autonomous function still produces accurate indications; orange color indicates an OoD area where the autonomous function is fallacious, but the system does not degenerate into dangerous conditions (the surrounding conditions of the environment are still compatible with safe actuations), while red signals that the system may fall into dangerous conditions (if driven by the autonomous function). The tests of autonomous safety-critical actuations should include all the conditions in the mentioned color gradations, at least by simulation analysis. Although the literature in the field of OoD already poses solutions based on labelled data or through anomaly detection, as evidenced by [7], [8], the OoD according to distributional assumption-free and OoDagnostic criteria is still an open problem 2 .

A. Contribution
The proposed method is designed under these criteria with the added advantage of avoiding any parameter tuning. It is based on the evaluation of the histogram generated by the frequency of validation of a rule-based model by the data themselves. The histogram generated during the training phase represents a fingerprint to be verified at runtime. If the data at runtime generates a histogram "significantly different" from the training one, it means that the data are OoD. Unlike K-NN [8] and Neural Networks distance [9], where a single distance criterion is defined, the similarity measure can be derived through multiple metrics. This offers support to the tests mentioned by EASA, since the proposed method measures incremental cases of departure from in-distribution.

III. RELATED WORK
ODD has become an important theme in ML field, since the recognition of unseen data either "similar" or not (inor out) to the ones the ML system has been trained on may lead to potentially fatal consequences; indeed, a system should not only correctly classify what is known, yet also and most importantly should recognise what is not known for action to be taken.
Most of the solutions proposed to address the problem of the OoD make strong distributional assumptions of the feature space [10] or suppose they are given a training in and out probability density function (pdf) [11], but this not always holds in practice. What is more, lots of statistic tests fail to estimate the real distribution of training data (data are not enough and the pdf are too coarse) [8]. Some methods, widely used across OoD detection on images are the Maximum Softmax Probability [12], the ODIN [13] and the Energybased OoD Detection [14], while others use outlier methods as the Isolation Forest [15]. Label shift in deep learning is also considered in [16].
Under distributional assumptions free hypothesis, ODD needs the right tuning of some parameters [8]. Our solution still maintains the former and does not rely on any critical parameter setting.

A. Rule hits histograms
The reference ML model is based on XAI obtained using the Logic Learning Machine (LLM), a global, transparent by-design, model developed as a computational improvement of Switching Neural Networks [17] by Rulex 3 . The LLM builds a classifier described by rules in the IF-THEN format [18]. The approach can be however applied to any rule-based model coming, for example, from decision trees or coordinated groups of decision trees, such as random forests or Skope-Rules [19].
Let us denote with R tr a set of rules generated from a training set and let N r be the number of rules composing it. Let N tr and N op be the numbers of splits of the training domain and the operational one, respectively, and let N h = N tr + N op be the total number of splits. Let n s be the number of data samples present in a split. For each split, samples 4 may (or not) satisfy each rule a certain number of times. We refer to this number as the number of hits for that rule. Therefore we define N h vectors, considering this number scaled by the split size n s : Each vector h j can be thought as a histogram.

B. Data splits
At training stage, we exploit N tr splits of the dataset T R = {tr 1 , . . . , tr Ntr }. These splits become the baseline for building the in-distribution histograms, as per Eq. 1, representing the numbers of hits obtained by testing the rules in R tr on each considered split. Two different algorithms are then studied, based on the data organization in operation: N op = 1 when only one split is available and N op > 1 with more than one split. The first case is suitable for data scarcity in operation and simplifies the calculations.

C. Adopted Metrics
The metrics driving ODD are as follows: • Weighted mutual information W µI, used when only one operational split is available (N op = 1), as described in Sec. IV-D • Rule-based information RBI, used when the operational data are sufficient to perform multiple splits (N op > 1), as per Sec. IV-E • l p norm, with p = 1, 2; their computation is performed in the same way for both scenarios (N op ≥ 1) For all metrics, the idea is to compare values computed in operation with the ranges achieved in training (baseline). An ODD occurs whenever the value in operation falls outside the baseline. The order of the hits with respect to the rules drives modification to canonical mutual information, W µI and RBI, as explained in Appendix A.
D. First scenario: N op = 1 1) Training setting: The first scenario deals with a single operational split op 1 . We first present the procedure for the training domain and then for the operational one. As to Eq. 1, the training matrix-like structure shown in Table I consequently arises.
Based on that table, weighted mutual information and norms are computed as described in Algorithm 1.  1a. Define the weight associated with tr i and tr j , α i,j ∈ (0, 1), ∀i, ∀j: 1b. Compute the weighted entropies H(tr i ), H(tr j ), H(tr i , tr j ) , ∀i, ∀j 2) Operational setting: We now present the procedure when an operational set is considered. As to Eq. 1, we can build the training-operational matrix as in Table II.
Weighted mutual information and norms are then computed as described in Algorithm 2.
3.Compute lp norms: drive the computation of the baseline according to the rulebased information (RBI) (algorithm 3, table I as a reference). Algorithm 4 defines the inherent ODD by taking table III as a reference. The estimation of the Gaussian distributions follows the maximum likelihood principle (see 2.5.1 of [21]). Building N r separate Gaussian distributions (one for each row of the table) allows to tackle with the curse of dimensionality problem in parameters estimation, in place of building a single, multidimensional Gaussian distribution for the entire table (see, e.g., 2.5.7 of [21]). We remark that this methodology still complies with the distributional assumption-free (as stated in Sec. II) because the Gaussian distribution estimation concerns the rule hits and not the data and it is just a way to capture hits variations analytically, even if they do not follow a Gaussian behaviour.
, ∀j and ∀tr i , compute: . Compute the entropy: , ∀j and ∀tr i compute: Compute the weighted conditional entropy, ∀tr i : Compute the average entropies: 8. Measure the similarity between T R2m and T R1 considering RBI T R1−T R2m ∈ [0, 1]: 10. Construct the baseline range RBI base : 11. Compute the norms baselines lp base as done in Algorithm 1.

A. Incremental technique
The method collects a bunch of operational data before processing and classifying them as in or out of distribution. For this reason, it falls in the category of groupwise methods [22]. Differently from pointwise, groupwise confirms a type of situation (in or out), without relying on a single point that could be a spike in a steady trend. The collection phase in Algorithm 4 Rule-based Information at Operational Stage i = 1, . . . , Ntr, p = 1, 2, j = 1, . . . , Nr, op i ∈ OP ; Inputs: T R1 and OP (Table III); baseline ranges RBI base and l base p Output: ODD through RBI and lp , ∀j and ∀op i , compute: . Compute the entropy: , ∀j and ∀op i , compute: Compute the weighted conditional entropy, ∀op i : Compute the average entropies: 8. Measure the similarity between T R1 and OP by considering RBI T R1−OP ∈ [0, 1] : 9. OoD detection: ∈ lp base for the majority of i and j THEN flag is on IF {at least one flag is on} THEN OP is OoD operation does not imply that one would wait for new n s samples to register a new split and to provide the ODD. Splits are generated continuously, as soon as new samples are collected. Incremental techniques may be also used to accelerate the computation of statistically-based features (mean, variance, skewness and kurtosis), as in the RUL and DNS problems detailed later on [23]. Like in incremental techniques, once a new sample is available, a new (operational) bunch of n s samples is built, by adding the new sample and by disregarding the most far away point in the past (of n s positions). In turn, the bunch leads to the split collection, by computing the inherent hits on the ruleset. The process assumes a sample-by-sample incremental time window, over which the following operations are performed. A new data bunch is firstly registered, a new split is calculated and a new ODD is then derived.

B. Computational issues
The computational speed of the bunch building process depends on how fast the data samples are collected by the system (the quantity is denoted by δt 0 ). The speed of the split building process depends on the time required to compute the hits of the ruleset on the bunch, namely, on the latest n s data samples (δt 1 ). The speed of the ODD depends on the computational time of algorithms 2 and 4 above (δt a2 and δt a4 , respectively).
The computational times of the baselines in algorithms 1 and 3 are less of interest as the algorithms work at design time, in which enough computational resources are assumed to be available; they however follow similar O(·) as their respective operational versions. On the other hand, δt a2 and δt a4 are of interest, as algorithms 2 and 4 work over the deployed ML infrastructure, for which limited computational resources may be assumed. The following considerations hold for the δt quantities. δt 0 is outside of the scope of the paper as it depends on the environmental conditions and on the sensing architecture of the system. δt 1 is O(n s ) (by assuming the time to verify a rule on a data sample a constant, independently to the complexity of the rule). By referring to the computations inherent to the metrics involved in algorithm 2, δt a2 is O(N r · N tr ). Analogously, δt a4 is O(N r · N OP ).

A. Datasets
Three application scenarios are considered with the inherent datasets and relevance of the ODD problem.
1) RUL: The first dataset concerns damage propagation modeling for aircraft engines and is taken from the NASA repository [24]. It is an important benchmark in predictive maintenance and includes four different subsets of data (tr 1 , op a , op b , op c ), corresponding to different machines of the same factory family. The problem is interesting in the ODD perspective because one may expect a model trained on a machine (e.g., tr 1 ) to be applicable (with limited error) to another machine (e.g., op a ). The features are: mean (m), variance (v), kurtosis (k) and skewness (s) of the original 23 physical quantities over time. A preliminary analysis with LLM feature ranking [25]   The target variable is the Remaining Useful Life (RUL), which represents the time before the occurrence of a fault and is binarized to assume either value '0 healthy'(RUL>150) or '1 fault' (RUL≤150). A ML classifier through LLM predicts if the engine would come into a fault state or not; tr 1 constitutes the in-distribution and R tr1 the reference ruleset.
2) Platooning: The second dataset (platooning [26]) addresses collision avoidance in vehicle platooning, which is one of the most celebrated application in autonomous driving. A group of vehicles is interconnected via wireless, based on the Cooperative Adaptive Cruise Control [27]. The behavior of the platooning system is synthesised by the physical quantities pointed out in table V. The physical quantities correspond to the features of the problem, which identifies potential collision in advance after a sudden brake.  We consider two datasets: in the first one (LOW) the communication delay d parameter is bounded by 0.4 s; in the second one (HIGH), d is larger than that threshold. As in the RUL case, we set a training domain: tr LOW (in-distribution) as well as the reference ruleset R tr LOW . A typical ODD problem is thus posed (between LOW and HIGH) as d has a significant impact on performance. The ODD has here a safety preserving role as it recognizes if the delay in operation is larger than the one in training. The algorithms are however not aware that delay is the key for the datasets differentiation and understand the ODD through the operational hits on R tr LOW .
3) DNS: The third dataset (DNS) deals with a DNS tunneling detection problem [28]. The aim is detecting the presence of Domain Name Server intruders by an aggregation-based traffic monitoring. Silent intruders and quick statistical fingerprints generation make the tunneling detection a hard task. Table VI shows the physical quantities of the problem.

Symbol
Description q Size of a query packet a Size of an answer packet Dt Time interval intercurring between query and answer Again as in the RUL case, mean (m), variance (v), kurtosis (k) and skewness (s) are extracted over the time series of the system, thus leading to 12 features. The target variable is a binary label denoting the 'presence' or 'absence' of a tunneling attack. Two reference datasets are as follows: the first one considers a tunneled peer-to-peer (p2p) application, that is the training (in-distribution) domain tr p2p (with R trp2p as the reference ruleset), and the second refers to the tunneled secure shell (ssh) application, which is the operational setting (op ssh ). The ODD here is of interest once ssh is used in operation under the trained p2p model. It is a quite realistic situation in cybersecurity as not all attack configurations may be anticipated at design time.

VII. RESULTS
The first two subsections deal with understanding the ranges of the metrics in OoD conditions. The baseline ranges are reported in the first row of all the tables and represent the reference to infer possible OoD in operation. An even partial overlap between ranges in training and operation leads to a missed detection, i.e., a false negative (FN) Example code and data for the experiments are available at the following link: https://github.com/giacomo97cnr/ Rule-based-ODD.

A. N op = 1
Tables VII and VIII show norms, µI and W µI ranges over the RUL datasets. The robustness of algorithm 2 is validated by the fact that all OoD ranges are significantly far away from the training baselines. The values with the norms are closer to the respective training baselines with op b than with op a and op c . This is an important indication about the similarity of in (tr 1 ) and out (op b ) distributions. Coming back to the EASA introductive figure, it happens that op b lies in the yellow zone. Namely, the model trained on tr 1 is good on op b data, with FPR=18% and FNR = 27%, which is quite close to the in-distribution performance (training and test on separate tr 1 samples): FPR = 18% and FNR = 22% 5 . On the other hand, op a and op c lie in the orange zone (i.e., the tr 1 model is not good anymore on op a and op c , being FPR = 0.03%, FNR = 99.86% and FPR = 0.04%, FNR = 99.88%, respectively) 6 . When the tr 1 model is tested on op b , a good balance between FPR and FNR is still achieved; the same does not hold for op a and op c , which are far away from the tr 1 baseline.
The rationale behind the tr 1 and op b proximity is beyond the knowledge of the authors (one may argue about some mechanical similarity of the respective engines), but inferring such proximity through ODD is quite an important achievement. In this perspective, the method should use all the metrics jointly to provide both ODD and a measure of the distance of the in and out distributions.          As far as platooning and DNS are concerned, good performance are registered, except with µI in platooning (the topic is discussed later through groupwise analysis and in the Appendix).

B. N op > 1
This section outlines the performance of algorithms 3 and 4, whose results are shown in tables XIII, XIV and XV. N op > 1 allows to exploit more information at the operational level and thus finer separation of the OoD from the baseline. As expected, in the RUL case (Table XIII), RBI sensibly measures some proximity between tr 1 and op b .

C. Comparison with canonical methods
This section outlines a comparison with canonical supervised algorithms, such as K-Nearest Neighbours (KNN), Logistic Regression, Support Vector Machine (SVM) and Random Forest as well as unsupervised KNN (u-KNN). Supervised algorithms exploit information about OoD data. A mix of the in and out data are considered for training supervised approaches and then a testing phase got the FPR and FNR values presented in table XVI. In u-KNN we have followed [8] yet revisiting it according to the fact that we are not using images; hence, we have first split up the training domain into a training set and a test one and then we have tuned two parameters: the number of neighbours (K) and a distance threshold (λ) used to determine if test data are in-distribution or not; λ was set in order to have a true positive rate of 95% in the training domain. Despite including information about OoD data in their training, supervised algorithms fail ODD. This may denote that training and operational data are confused in the original feature space. The proposed algorithms achieve better performance in virtue of looking at in and out separation in a different space, namely, through the ruleset hits in training. This is corroborated by additional experiments, by repeating the supervised approaches and using the ruleset hits as the input features (in place of the original features) and verifying a perfect separation between in and out distributions. Algorithms 2 and 4 are however still preferable due to their unsupervised derivation.
In table XVI, algorithm 2 still experiences some FPR as the weighted version of the mutual information is applied to a smaller portion of operational data than with Algorithm 4. Another rationale behind the sensible level of FPR comes from the declaration of OoD if at least one of the metrics registers an OoD. This minimizes FNR, but may increase FPR. Additional results (not reported here for the sake of synthesis) confirm that the algorithm is even more sensitive to FPR with values of N tr < 50. On the other hand, algorithm 4 decreases also FPR (with respect to algorithm 2), in virtue of the (Gaussian) statistical filter applied to several splits of operational data.

D. Incremental groupwise in operation
By referring to section V, the following experiments highlight the ODD when replacing in-distribution data with outof, in a sample-by-sample, incremental, way. The analysis is relevant to the tracking of the OoD drift with both precision and measurement of distributions proximity. Every figure in the following contains the baseline derived at design time; the curves represent the behaviours of the metrics in operation. Increasing time windows with n s = 5·10 3 and 10 4 samples are used to emphasize the speed of the drift inference over time. The time size of the windows depends on the time granularity of the arrival of the points in operation; for this reason, the x-axis is not time, but it refers to the progressive identifier of the operational samples. The drift starts at time zero, that means the first operational sample derives from the OoD and previous points (of the window) are compliant with training conditions. As soon as the window collects more data (over the last n s points), it senses more information about the OoD. As to the W µI metric, the results confirm that the shorter the window, the faster the detection. On the other hand, the µI metric experiences a noise that can have different meanings as detailed later on.
The following evidence arises for the case studies. In RUL, W µI (Fig. 2) needs at least 200 samples to exit the baseline; this happens with the shortest window (n s = 5000) and with the most divergent OoD (op a with respect to op b ). The l 1 norm (Fig. 4) outlines a similar behaviour. µI (Fig. 3) does not trigger the expected ODD; this seems in contrast with previous results in table VIII, where ODD was successful. This is however due to the limited horizon of the figure; the curves under n s = 5·10 3 are actually approaching the baseline and, as expected, op a reveals to be faster than op b , being more divergent from tr 1 than op b . The groupwise progression thus suggests the joint adoption of the metrics to achieve both precision (W µI) and measure of the distributions similarity (µI).
In platooning, W µI matches the ODD and, coherently with previous results (table IX), µI is stuck in the baseline. Finally, DNS has good performance with the two metrics as well.
The difference between RUL and platooning in µI is remarkable as it is very subtle. In the former case, µI is sensitive to distributions similarity, still being able to slowly proceed in the ODD direction. In the latter, it experiences imprecise calculations (as shown in the appendix), thus complicating the ODD task.
It is finally worth noting that the window of the incremental groupwise should be coherent with the design setting with n s = 5000. Other results may show several counterexamples in RUL with n s = 1000 and tr 1 − tr 1 , in which, though only points in the baseline would have been expected, many false positives take place.

VIII. CONCLUSION AND FUTURE WORK
The paper deals with the identification of out-of-distribution through a distributional assumption free rule-based model. The approach also measures the proximity of in and out of distributions and is validated in challenging case studies. Future extensions comprise further testing on additional longitudinal datasets, as well as on image data. Alternative ways to the hits of the ruleset to infer in-distribution behaviour are of interest, as well as additional metrics to measure in and out of distribution divergence.    N op = 1, the correction consists of weighting the probabilities (later used in entropy calculations) through the average of hits differences in each rule/row;this leads to α i,j quantities in algorithm 1. An indirect effect of those weights is the inversion of the canonical trend of mutual information with respect to the dependency of the vectors. Namely, the more the vectors are dependent, the more W µI goes towards zero and viceversa. With N op > 1, the weights (γ · j quantities) are the fractions of the compared probabilities.    Enrico Cambiaso Enrico Cambiaso, Ph.D in Computer Science, has a background working experience as a computer scientist, for both small and big enterprises. He is currently employed at the IEIIT institute of Consiglio Nazionale delle Ricerche (CNR), as a technologist working on cyber-security topics and focusing on the design of last generation threats. He is author of more than 50 scientific papers on cybersecurity and he's been involved in several financed research projects, at national and European level. During his doctorate and in the following years, he worked on the quality of service for military networks with Selex. He was the CNIT technical coordinator of a research project concerning satellite emulation systems, funded by the European Space Agency; and he spent three months working on the project at the German Aerospace Center in Munich. He is now a researcher at the Institute of Electronics, Computer and Telecommunication Engineering (IEIIT) of the National Research Council (CNR), where he deals with machine learning applied to bioinformatics and cyber-physical systems. He is co-author of over 100 international scientific papers, 2 patents and is participating in the SAE G-34/EUROCAE WG-114 AI in Aviation Committee.