Detecting and interpreting faults in vulnerable power grids with machine learning

Unscheduled power disturbances cause severe consequences both for customers and grid operators. To defend against such events, it is necessary to identify the causes of interruptions in the power distribution network. In this work, we focus on the power grid of a Norwegian community in the Arctic that experiences several faults whose sources are unknown. First, we construct a data set consisting of relevant meteorological data and information about the current power quality logged by power-quality meters. Then, we adopt machine-learning techniques to predict the occurrence of faults. Experimental results show that both linear and non-linear classifiers achieve good classification performance. This indicates that the considered power-quality and weather variables explain well the power disturbances. Interpreting the decision process of the classifiers provides valuable insights to understand the main causes of disturbances. Traditional features selection methods can only indicate which are the variables that, on average, mostly explain the fault occurrences in the dataset. Besides providing such a global interpretation, it is also important to identify the specific set of variables that explain each individual fault. To address this challenge, we adopt a recent technique to interpret the decision process of a deep learning model, called Integrated Gradients. The proposed approach allows to gain detailed insights on the occurrence of a specific fault, which are valuable for the distribution system operators to implement strategies to prevent and mitigate power disturbances.


Introduction
Unscheduled power disturbances cause problems for customers and grid operators as they affect all customers connected to the power network, from single households to large industries [10,27,32,48]. Power failures might have complex and adverse socio-economic consequences in communities that are heavily reliant on the electricity supply [19,56]. The distribution system operator (DSO) is contractually obliged to provide a reliable power supply and to compensate customers affected by power interruptions [28]. To meet the expected energy demand, the DSOs must implement management plans that account for the underlying infrastructure.
In this study, we focus on disturbances on a power grid in an Arctic region in Northern Norway, where the energy demand from local food industries has increased greatly. The growth in energy demand has resulted in more frequent power disturbances, as the current power grid is operating close to its maximum capacity. One way to improve the reliability of the power supply is to build a new distribution grid that can handle larger power demand. However, this is costly, time-consuming, has a huge environmental impact, and contradicts the vision of better utilizing the current electricity grid infrastructure 2 [40]. An alternative solution is to limit the failures and strengthen only the most vulnerable parts of the grid, but this requires first identifying the factors that trigger power disturbances.
1. The machines of the local industries connected to the power grid are so sensitive to the power quality that they experience failures that are not registered in the failure-reporting system of the DSO.
2. The resolution of data in 2020 was too low (1-hour) to understand how power consumption truly affects power quality.
To address these issues, new power quality meters were installed on 19 February 2021 in the power grid under analysis. These meters log data every minute and register every small voltage variation. In addition, they provide detailed information about the power quality in the grid, such as the specific phase where the fault is registered, the magnitude of voltage variation, frequency imbalance, and the amount of flicker.
Contributions First, we build a power faults classification dataset in collaboration with the DSO, by collecting variables that are considered as most relevant in explaining power disturbances. Then, we train different classifiers, including linear classifiers and a deep learning architecture, to detect an incoming fault from the weather and powerquality variables, registered one minute before the specific fault occurs. As shown in the experimental results, the classifiers manage to detect most of the power disturbances before their onset, demonstrating that high-resolution data from power quality meters in conjunction with weather data are highly informative.
To gain a better understanding about the relationships between the different variables and the power disturbances, we analyze the decision process of the classifiers. First, we consider traditional features selection methods, which identify which are the most important variables in the dataset that explain the fault occurrence. While such an approach gives a global overview of the variables that are, on average, the most informative in the dataset, it does not allow to reason about specific cases.
To address this challenge, we adopt a recent technique to interpret the decision process of a deep learning model, called Integrated Gradients (IG). For each individual sample IG assigns to each feature a score, whose magnitude indicates how much the value of such feature contributes to determine the class of the sample. The proposed methodology shows that the classifiers focuses on heterogeneous sets of features when processing different samples. This indicates that the occurrence of faults can be explained by multiple different patterns in the weather and power-quality variables. Our findings are valuable to the DSO for implementing strategies to prevent and mitigate power disturbances.

Related work and studies
There exist a vast amount of literature about the detection of different classes of power quality disturbances, such as deviation in voltage, current, and frequency signals. For example, Ref. [29] provides a comprehensive review of more than 150 research studies between 1986 and 2014 on detection and classification of power quality disturbances.
In another comprehensive and more recent survey, [34] reviewed 242 papers on Power Quality and Classification (PQD&C) techniques based on digital signal processing and ML. The survey performed a comparative assessment on various PQD&C techniques by considering several criteria, such as type of data used, type of PQ disturbance, and classification accuracy.
However, fault detection and classification is a reactive process where models try to classify the fault after it has occurred. On the other hand, it is often interesting to identify the causing factors and predict the onset of a power fault. A fault prediction model should be able to quantify the likelihood of observing a fault in the next period given a set of conditions described by the explanatory variables in the model. Additionally, the identification of causing factors for faults will help the DSO to implement strategies to prevent and mitigate incoming faults.
There exist some prior relevant work on identifying causing factors for faults in the power grid. The causing factors are often divided into two different categories: i) weather conditions, and ii) other factors such as human-related activities (energy consumption).

Weather-related faults
Harsh and severe weather events are considered to be an important source of faults, and several studies have been conducted to address the impact of such events on power quality.
Owerko et al. predicted power faults in New York City by monitoring weather conditions [36]. The authors deployed a Graph Neural Network to model the spatial relationships between weather stations and improve the prediction performance.
The impact of seasonal weather on forecasting power disturbances was investigated in [33]. The authors tested the performance of the proposed models by using two different training sets: seasonal or all-year data. It was shown that, in some cases, the prediction performance of the models improved when the training data is limited to a subset corresponding to a particular meteorological season.
The impact of weather variations and extreme weather events on the resilience of energy systems was investigated in [38]. The authors developed a stochastic-robust optimization method to consider both low impact variations and extreme events. The method was applied on 30 cities in Sweden. The results indicated that 16% drop in power supply reliability is due to extreme weather events.
Other examples of relevant work on weather-related faults can be found in Refs. [12,37,42,55]. In addition, several risk assessment studies on the impacts of extreme weather hazards such as earthquakes, thunderstorms, and hurricanes can be found in Refs. [14,35,43,44,45,58].
The works mentioned so far consider only severe weather events and disregard other factors, such as heavy energy load caused by human-related activities. Additionally, many methodologies are tested on synthetic data or on public benchmark datasets, which limits the scope of the evaluation and poses constraints on the data acquisition procedure.

Alternative approaches for fault detection
A methodology to predict power faults by analyzing advanced measurement equipment such as Power Quality Analyzers (PQAs) and Phasor Measurement Units (PMUs.) has been proposed in [20]. The study used real-world measurements from nine PQA nodes in the Norwegian grid to predict incipient interruptions, voltage dips, and earth faults. The authors find incipient interruptions easiest to predict, while earth faults and voltage dips are more challenging to predict.
The authors in [21], compared several ML methods to predict power disturbance events such as voltage dips, ground faults, rapid voltage changes, and power interruptions. The Random Forest models achieved the highest performance and the results indicated that voltage dips and rapid voltage changes were the easiest to predict.
The challenge of detecting back-fed ground-faults has been recently addressed in [1]. The authors show that faults can be detected by integrating advanced metering infrastructure with a distribution management system. However, the proposed solution is relevant only for DSOs that adopt the OpenDSS software.
The study in [57] investigated the possibility of predicting voltage anomalies minutes in advance by using a ML model trained on historical power quality analyzers (PQA) data. The voltage data were collected from 49 measuring locations in the Norwegian power grid. The model attempted to predict voltage anomalies 10 minutes in advance based on the presence of early warning signs in the preceding 50 minutes. It was found that the time passed since the previous fault is a major factor that affects the probability of a new imminent fault.
In [39], the application of clustering and dimensionality reductions techniques to predict unscheduled events were investigated. First, the authors used several techniques to reduce the dimensionality of the data and to cluster events based on analytical features. Then, the fault events were separated from the normal operating conditions. The findings show promising results when using balanced datasets, while the predictive capability is significantly reduced in unbalanced datasets that, however, often appear in real-world case studies.
Other relevant work on fault detection based on ML techniques can be found in Refs. [15,23,30,49,60,61]. In addition, there are some relevant work that adopts novel ML-techniques for detecting and localizing faults in the power distribution network [8,17,25,46].
This section presented several relevant work in predicting faults by assessing either weather effects or human activities.
One of the goal of our work is to consider, at the same time, a larger amount of weather variables and electricity-related measures as potential causes of power disturbances. A close collaboration with the local DSO has provided us with valuable insights about the relevant variables that should monitored to construct a new classification dataset. More importantly, none of the previous work has focused on interpreting the decision process of the classifier, which is key to understand the causes of faults and can provide valuable information to improve the power grid reliability.

Power faults dataset
In this study, we focus on a power grid with a radial structure located in the Arctic. A detailed description of the grid configuration is deferred to Sect. A in the Supplementary Material. The grid is subject to frequent power faults, which could be caused by weather factors or by the strain of the infrastructure from a local industry, which dominates the load consumption in the power grid.
We prepared a classification dataset where each sample refers to a period when the grid is operating in normal conditions or to a period preceding a fault, respectively. Each sample is associated with a feature vector x ∈ R 12 and a label y ∈ {0, 1}, indicating the normal condition or the imminent fault, respectively. The feature vector consists of 6 different energy-related variables and 6 different weather variables, summarized in Tab  The dataset contains 90 samples representing reported faults (y = 1), which occurred in the period between 19.02.2021 to 30.04.2021. Naturally, the amount of samples associated to normal operating conditions is much larger. In addition, in normal operating conditions the values x from neighboring hours are very similar to each other. To limit the amount of class imbalance in the dataset and the redundancy in the over-represented class, we arbitrarily subsampled the non-fault class (y = 0) by taking 1 sample every 60. In the final dataset, there are 90 samples representing a reported fault and 1, 647 samples representing normal operating conditions without any power disturbance.

Methodology
Our approach consists of two steps. First, we train a classifier to predict the onset of power faults given the value of the electricity and weather variables. If we obtain a high classification accuracy, we can conclude that there are strong relationships between the variables, x, and the occurrence of faults, y. Then, we use two different techniques to highlight the most informative features identified by the classifiers to solve the task.
In Sect. 4.1 and 4.2, we describe which classifiers are considered in this study. In Sect. 4.3, we present an approach for interpreting the decision process of a neural network classifier.

Linear classifiers
We consider three different linear classifiers. The first, is a Ridge regression classifier, which first converts the target values into {1, 1} and then treats the problem as a regression task [6]. The second model is Logistic regression, which uses a logistic function to approximate the probability of binary classification variable [6]. The third model is the Linear Support Vector Classification model (LinearSVC), which is a type of Support Vector Machine (SVM) [7] endowed with a linear kernel.
Due to the strong class imbalance, we configure each model to assign a class weight that is inversely proportional to the number of samples in each class. In this way, errors on the underrepresented class (faults, y = 1) are penalized much more than errors on the larger class (nominal condition, y = 0).
One advantage of using linear classifiers is that they construct a decision boundary directly in the input space, which allows to easily interpret the decision process of the classifier. In particular, the linear models assign a weight w i to each feature x i in the input space: the higher w i , the more the values of x i impact the classification outcome. Therefore, looking at the magnitude of the weights w i is a simple strategy to estimate the average importance of the features in the dataset for the classification task.

Non-linear classifiers
We consider two non-linear classifiers. The first, is non-linear SVC classifier equipped with a radial basis function kernel (RBFSVC). As for the linear models, also in this case we used class weights inversely proportional to the class frequency.
The second non-linear classifier considered is a Multi-Layer Perceptron (MLP) [18]. The MLP consists of an input layer that takes input vectors x ∈ R 12 , L hidden blocks, an output layer that generates a 2-dimensional output o ∈ R 2 , and a softmax activation that gives the vector of class probabilities y. Each block l consists of a dense layer with n l units, a Batch Normalization layer [22], a non-linear activation function, and a Dropout layer [53] with dropout probability p. All trainable weights in the MLP, except the biases, are regularized with L 2 -norm penalty with strength λ. The MLP is trained by minimizing a cross-entropy loss, using batches of size b, and the Adam optimization algorithm [26] with initial learning rate r. Due to the strong class imbalance in the dataset, we initially trained the MLP by weighting the loss of each sample with a value inversely proportional to the class frequency, like we did for the other classifiers. However, we found out that the MLP achieved better performance by re-sampling the minority class during training. This allows to achieve class balance at the expense of introducing redundancy, by reproposing the same samples multiple times. We also tried to achieve class balance by subsampling the majority class but, due to the small amount of samples in the fault class, the total number of inputs in each training epoch was too small and the samples from the majority class were shown too few times during training.

Interpretation of the MLP results with Integrated Gradients
In the following, we introduce the technique adopted to interpret the decision process of the MLP. A short review of important approaches for interpretability in deep learning, which have been proposed over the past few years (and briefly mentioned hereafter), is deferred to Sect. C in the Supplementary Material.
Integrated Gradients (IG) [54] is a technique proposed to satisfy two axioms, which are not jointly enforced by other existing attribution schemes (see Sect. C for details). According to the first axiom, sensitivity, if the input and an uninformative baseline differ in exactly one feature, such a feature should be given non-zero attribution. While interpretability approaches such as LRP [2] and DeepLiFT [47] ensure sensitivity due to the conservation of total relevance, gradient based methods [50,51,52,59] do not guarantee the sensitivity axiom because the saturation at ReLU or MaxPool makes the score function locally "flat" with respect to some input features.
The second axiom, implementation invariance, states that when two models are functionally equivalent, they must have identical attributions to input features. While implementation invariance is mathematically guaranteed by vanilla gradient approaches, the coarse approximation to gradients in LRP and DeepLiFT might break this assumption.
The attribution to feature i given by IG is where i is an input feature, x is a sample in the dataset, x is the uninformative baseline, and α is an interpolation constant used to perturb the features of the input sample. The above definition ensures both the desirable assumptions: • By the Fundamental Theorem of Calculus, IGs sum up to the difference in feature scores and, thus, follow sensitivity; • Since the IG attribution is completely defined in terms of gradients, it ensures implementation invariance.
IG has become a popular interpretability technique due to its broad applicability to any differentiable neural network model, ease of implementation, theoretical justifications, and computational efficiency.
Implementation IG is a post-hoc explanatory technique that works with any differentiable model, F (·), regardless of its implementation. In this paper, we let F (·) be the MLP model described in Section 4.2 that takes as input tensor the feature vector x ∈ R 12 and generates an output prediction tensor, o = F (x), called logit. In our case, o ∈ R 2 and softmax(o) gives the probability of x being "fault" and "non-fault".  The baseline x in (1) is an uninformative input used as a starting point to compute the IG attributions. The baseline is essential to interpret the IG attributions as a function of individual input features. It is important to choose a baseline that encodes as much as possible the lack of information about the target class c. In a classification task with multiple classes, we want softmax[F (x )] c ≈ 0. In a binary classification task, like in our case, we can chose a baseline that gives equal probability of belonging to both classes, i.e., softmax In computer vision tasks, a black image (all pixels at 0) is commonly used as a baseline. However, in our dataset the value 0 might actually be informative because the absence of some specific features can increase the probability of belonging to a specific class (e.g., in the absence of wind it is less likely to observe a fault). Fig. 2(left) shows that the MLP assigns with high confidence the zero-baseline x z to class 0 (non-fault). Therefore, different alternatives should be considered as the baseline. One option is to cast the binary classification problem into a 3-classes problem and re-train the two that assigns a vector of zeros to a third, dummy class. In this way, when using the zero-baseline x z , we would Other alternatives are to use a mean-baseline, x m , which is a vector computed as a weighted average of the features across the two classes or to use, or a random baseline x r . In the latter case, the final result is given by averaging the IG attributions computed from several random baselines. As shown in Figure 2, the mean baseline gives almost the same probability to classes 0 and 1, while the random baseline has the tendency to assign a strong probability to one of the two classes. Therefore, we used the mean baseline in all our experiments.
The default path used by the integral in (1) is a straight line in the feature space from baseline to the actual input. Since the choice of path is inconsequential with respect to the above axioms, we use the straight line path that has the desirable property of being symmetric with respect to both x and x . The numerical computation of a definite integral is often not tractable and is necessary to resort to numerical approximations. The Riemann trapezoidal sums offers a good trade-off between accuracy and convergence, and changes (1) into: where m is the number of finite steps used to approximate the integral and α ≈ k/m. The m samples represent the linear interpolation between the baseline and the input. Fig. 3 depicts such an interpolation path from the mean-baseline to a specific sample of class "fault" in our dataset. After generating the set of interpolated samples X , we can compute the gradients ∂F (X ) ∂xi that quantify the relationship between the changes in the input features and the changes in the predictions of the MLP F . Important features will have gradients with steep local slopes with respect to the probability predicted by the model for the target class. Interestingly, the largest gradient magnitudes generally occur during the first interpolation steps. This happens because the neural network can saturate, meaning that the magnitude of the local feature gradients can become extremely small and go toward zero resulting in important features having a small gradient. Saturation can result in discontinuous feature importances and important features can be missed. This is the key motivation why rather than simply using the gradients of the actual input, ∂F (X ) ∂xi , IG sums all the gradients accumulated during the whole interpolation path. This concept is exemplified in Fig. 4(left), showing that the model prediction quickly converges to the correct class in the beginning and then flattens out. There could still be less relevant features that the model relies on for correct prediction that differ from the baseline, but the magnitudes of those feature gradients become really small, as shown in Fig. 4(right). The figure is obtained using the same data of Fig. 3.

Experimental evaluation
After introducing the experimental setting, in Sect. 5.1 we compare the classification performance of the different classifiers on our dataset. Then, in Sect. 5.2 we first analyze the specific samples of class "fault" that are missed by the classifiers and, then, we consider two techniques to interpret the decision process of the classifiers.
Model selection and performance evaluation. The linear and the SVM classifiers are implemented with the scikitlearn library 3 , while the MLP is implemented in Tensorflow 4 . To evaluate the model performance we first shuffle the data and then perform a stratified k-fold, with k = 5. In each fold, 80% of the data are used as training set and the remaining 20% is used as test set. The training is further divided in two parts: 80% is used to fit the models coefficients and 20% is used as validation set to find the optimal hyperparameters.
The hyperparameters of the linear models and the SVM are optimized with a grid search. In particular, we optimize the regularization strength in the Ridge regression classifier, Logistic regression, and LinearSVC. For the non-linear SVM classifier, we also optimize the width of the radial basis function.
For the MLP, due to the higher amount of hyperparameters and the longer training time, we used the Bayesian optimization strategy implemented in Keras Tuner 5 and evaluated a total of 5,000 configurations. In particular, we optimized the number of layers L, the number of units n l in each layer, the L 2 regularization coefficient λ, the dropout probability p, the learning rate r, and the type of activation function (ReLU, tanh, or ELU). We used a fixed batch size b = 32, an early stopping with patience of 30 epochs, and we reduced the initial learning rate by a factor of 1/2 when the validation loss was not decreasing for 10 epochs.
Before training the models, the input values x are normalized feature-wise by subtracting the mean and dividing by the standard deviation computed on the training set. The overall performance of each classification model is the average performance obtained on each test set of the 5 folds.
Performance measures. The classification performance is measured by looking at the confusion matrix, which reports the following quantities: True Negatives (TN) -correctly identified non-faults, False Positives (FN) -non-faults predicted as faults, False Negatives (FN) -faults missed, and True Positives (TP) -faults correctly identified. To quantify the performance with a single value we use the F1 score: ( Due to the strong class imbalance in the dataset, we compute a weighted F1 score, i.e., we weight the F1-score obtained for each class by the number of samples in that class and then we compute the average: F 1 weighted = (n faults · F 1 faults ) + (n non-faults · F 1 non-faults ) n faults + n non-faults , where n_ and F 1_ indicate the number of samples and classification scores for each class, respectively.
Selecting the number of interpolation steps in IG The result of the IG attribution depends on the number of steps m (see Eq. 2). One of the property of IG is completeness, meaning that feature attributions encompass the entire prediction of the model. As a consequence, the importance score should capture the individual contribution of each feature to the prediction. Therefore, by adding together all the importance scores is possible to recover the entire prediction value for a given sample x. In particular, we have that the variation in classification score (e.g., the probability of being a fault) is where F (x) c and F (x ) c are the prediction scores for class c when the model takes as input x and x , respectively. Since we want the i IG i (x) to explain the whole difference in the class attributions, the number of integration steps m should be increased until when δ becomes as close as possible to zero. Following this principle, we found m = 100 to be sufficiently large for our experiments as it gives δ < 1e − 2.

Classification performance of the different methods
Here, we compare the classification performance obtained by the linear methods, SVM classifier, and the MLP. The classification performance of each model is reported in Tab. 2 in terms of average Weighted F1 score and the average number of TN, FP, FN, and TP obtained across the 5 folds. Note that the TN, FP, FN, and TP are rounded to the closest integer. Finally, it interesting to notice that linear and non-linear models achieve a similar performance. This suggests that the two classes are almost linearly separable, i.e., most of the data samples can be separated reasonably well by an hyper-plane in the input features space. On the other hand, the data samples that are misclassified are very entangled and is difficult to find a decision boundary, even if is non-linear, that can correctly separate them. The good performance of the classification models motivates the feature interpretation procedure discussed in the next section.

Analysis and interpretation of the results
For the next analyses, we generate a fixed random train/validation/test split and used the same fold for each model. This allows us to analyze in detail the solution obtained by the different methods on a single test set, which contains 18 faults and 330 non-faults. Interestingly, all models fail to correctly classify as faults the same 5 data samples. A closer manual investigations on such 5 samples shows the following: The first FN could have been caused by some type of error, such as a calibration error, in the measurement instruments.
In the case of a ground fault, the electrical transformers connected to the grid break and the power that flows through the transformer flows to the ground. When the end of the electrical transformer station that contacts the ground level is on the downstream side, a ground fault occurs [1]. The ground fault is detected as a reduction of only one of the three phase voltages. Fig. 5 depicts the phase voltages when the first ground fault occurred: it is possible to see that Phase A decreases significantly, while the other two stay above the nominal voltage value. It is difficult to explicitly detect ground faults from only weather and electricity load measures considered as input variables, and therefore it is reasonable that the models miss the faults number 2 and 3.
Similarly to the ground faults, the 4 th FN could be caused by a factor not described in the weather and electricity variables. For example, it could have been caused by vegetation or animals interacting with the power lines.
Finally, the 5 th FN is a fault which lasts for 200 seconds, while the usual duration of the faults is approximately [25][26][27][28][29][30] secs. This suggests that the fault is an anomaly that is not well represented in the dataset and, therefore, is difficult to be classified accurately.
To identify the most important variables that explain the faults, we try to interpret the decision process of the different models. First, we analyze the coefficients of the linear models, which give a "global" interpretation of the variables importance. Then, we use the IG technique for a "local" interpretability of the features that explain the class of a specific data sample.
Global interpretability. As discussed in Sect. 4.1, when using linear models we can interpret the magnitude of the weights assigned to the input features as the global importance of the features for the classification problem. Fig. 6 reports the feature weights learned by the three different classifiers. We observe that in each model the Wind speed of Figure 6: Coefficients' magnitude assigned to each feature by different linear models. High magnitude indicates that the corresponding feature is important.
gust variable is always associated with a weight with large magnitude. The Linear SVC and the Logistic Regression classifiers attribute a large importance also to the Flicker variable, while the Ridge Regression classifiers weights the other features more uniformly and assigns weights to Temperature and Humidity that are slightly larger than the weight assigned to Flicker.
This analysis suggests that both the industry activity and the weather effects are important in discriminating between the fault and non-fault class. According to the linear models, the most important among the power-related variables seems to be Flicker, while the Wind speed of gust is consistently the most explanatory weather-related variable. These observations are aligned with experiences of the DSO and the local costumers, as more faults seem to occur when there is high activity at the industries and the machines operates at full load. In addition, it has been noted that faults are more likely to occur when there is strong wind, which could create collisions in the cables of the power lines.
Local interpretability. The faults correctly classified by the different models are reported in Tab. 3, together with the confidence score of the MLP classifier. The confidence score can be interpreted as the probability that the MLP believes a sample is a fault. The MLP correctly classifies with high confidence most fault samples and assigns a probability greater than 90% to 5 out of 13 samples. As a side note, the faults do not appear to be clustered around specific days or periods, but they seem to be uniformly distributed over time.  Figure 7: The green bars denote that a feature is important for the classification result. The higher the green bar, the more the feature value in the sample (blue bar) explains the classification result, compared to the value in the baseline (black bar). The red bars means that the value of the features in the sample decrease the confidence of the classifier that the sample is actually a fault.
We focus on the samples 52, 140, 227, 304, and 316 in Tab. 3, which are those classified with the highest confidence, and we use IG to identify which are the variables that are most important for the MLP to determine the correct fault class. The results are reported in Fig. 7. The top-left plot depicts the uninformative baseline, which corresponds to what an "average" sample in the dataset looks like. The blue bar plots represent the value of the 12 features in the 5 selected samples. Finally, the green and red bar plots are the output of the IG procedure.
The green bars indicate that a feature is important for the classification result. The higher the i-th green bar, the more the feature value x i in the sample (blue bar) explains the classification result, compared to the value x i in the baseline (black bar). For example, in Sample 227, the value of Flicker is much greater than in the baseline. IG assigns a high score (tall green bar) to this difference, meaning that the MLP found important the increment in Flicker compared to the baseline level for deciding that Sample 227 is a fault. Similarly, the MLP found important the decrement in Minimum Power Factor compared to the baseline level, to classify Sample 227 as a fault.
A red bar, instead, indicates that a value x i decreases the confidence in the classifier that the sample is actually a fault, compared to having a baseline value x i . For example, the MLP would have been even more confident that Sample 52 and Sample 140 are faults if their Difference in Frequency values would have been as in the baseline. In other words, for these two samples the increment of Difference in Frequency is something that decreases the confidence of the classifier that they are faults.
This analysis shows that each sample has different features that are found important by the MLP for the classification. For example, Sample 227 is classified as a fault mainly because of the above-average value in Flicker; Sample 52 is a fault due to the high value of Wind speed of gust and low value in Minimum Power Factor; for Sample 304 is important that the Difference in Reactive Power is higher than average.
The Minimum Power Factor and Reactive Power are important variables that contribute to explain the current power quality in a power grid. The Power Factor is the ratio of the working power over the apparent power and quantifies the energy efficiency: the lower the power factor, the less efficient is the power usage of the end-customer. The Reactive Power is the amount of power dissipated in the system. A high amount of reactive power in the system could affect the power quality negatively as there will be less amount of available active power that can be used by the end-customer [31]. Therefore, it is reasonable to observe a relationship between the low value in the Minimum Power factor, and the high Difference in Reactive Power for the fault samples 52 and 304.
Interestingly, the Minimum Power Factor and Difference in Reactive Power were not emerging as important features with the global interpretability approach, which is based on the weights magnitude of the linear models. Indeed, an approach that averages the contribution of the different factors across the whole dataset is likely to conceal the importance of configurations in the features value that appears only in few samples. On the other hand, by analyzing samples individually, IG can reveal new patterns in the data and allows to gain deeper insights about the true causes underlying specific faults.

Conclusions
In this work, we tackled the problem of detecting unscheduled faults in the power grid, which have major consequences for customers, such industries, relying on stable power supply. In collaboration with the DSO, we built a data set consisting of meteorological and power data variables, which monitor potentially relevant factors to cause power faults. Once the dataset was constructed, we trained different classifiers to detect imminent faults from the value of meteorological and power variables.
The classification performance was compared in terms of F1 score and the MLP classifier achieved the top performance, followed by the Ridge Classifier. The good classification results motivated the interpretation of the decision process learned by the model, as a tool to identify the variables that mostly explain the onset of power faults. We explored two different interpretability techniques. First, we considered the magnitude of the coefficient of the linear models to quantify the importance that, on average, the different features have to determine if a sample in the dataset is a fault. The results indicated that the amount of Flicker and Wind speed of gust are the most important variables in explaining the power disturbances. Such a global interpretability approach averages the contribution of the different factors across the whole dataset and, therefore, might fail to show interesting configurations in the features value that appear only in few samples.
As a second interpretability technique we used the Integrated Gradients to interpret the decision process taken by the MLP classifier on individual samples. This second approach allowed us to understand what features were considered important to classify a specific sample as a fault. Interestingly, some samples were classified as faults not only for having high values in Flicker and Wind speed of gust. In fact, the IG technique showed that the MLP classified as faults samples where the Minimum Power Factor was below average or where the Difference in Reactive Power was higher than average.
The proposed interpretability techniques revealed important patterns in the data, which allow to gain deeper insights about the underlying causes of power faults. This type of knowledge is fundamental for the DSO operating the grid in our study, which is currently developing strategies for preventing and mitigating incoming faults. In particular, the local power company is installing a large battery system that should be activated right before an incoming power fault, to supply additional power and avoid instability in the power supply. Understanding which variable should be monitored to detect an incoming power fault is, therefore, fundamental to implement prevention and mitigation strategies.
[37] Mathaios Panteli and Pierluigi Mancarella. Influence of extreme weather and climate change on the resilience of power systems: Impacts and possible mitigation strategies. Electric Power Systems Research, 127:259-270, 2015.

Supplementary material A The investigated power grid
The power grid analyzed in this study is a radial distribution system serving an Arctic community located approximately at (69.257°N, 17.589°E). Arva Power Company, the DSO of the power grid, has named this specific grid as SVAN22LY1. Fig. 8 shows an overview of the whole SVAN22LY1 grid, indicated by green dots. The SVAN22LY1 grid spans over 60 kilometers from the south to the northernmost point and has several branches to various communities towards the north. There are 978 unique utility poles (marked by green dots in Fig. 8) that support the power lines. The black boxes in Fig. 8 represent the electric transformer stations connected to the power grids. The red lines represent a power grid with an operating voltage of 66 kV, while the blue lines represent a power grid with an operating voltage of 22 kV. The SVAN22LY1 radial grid covered by green dots has an operating voltage of 22 kV. The largest customers connected to the SVAN22LY1 grid are located at the end of the northernmost point of the radial.
The total energy demand in the SVAN22LY1 grid is dominated by the load consumption of the local industry. The industry performs fish processing activities that are highly seasonal and uses many electrical machines in the production line that require stable power quality. Even minor power disturbances in the power supply trigger significantly long interruptions since the automated production line needs to be reset. In particular, for every short-term power interruption that occurs, is necessary to wait from 40 minutes to 1 hour before resuming the production. The consequences of the power disturbances are exacerbated by the topology of the power grid, which has a radial distribution with no alternative power supply in periods with disturbances.

B Dataset construction
Fault reports The reported faults used in this study are logged by a power-quality (PQ) metering system, which was installed in February 2021 in the proximity of the local industries to continuously measure the power quality. The PQ metering system reports all incidents with a voltage variation of ±10% from the nominal values on each phase of a three-phase system with phases A, B, C. According to the standard definition, all variations of ±10% from normal conditions are defined as a voltage variation and a drop larger than 10% is referred to as a voltage dip [11]. Voltage dips could provoke tripping of sensitive components such as industrial machines.
Weather measurements The weather variables that are considered relevant in causing power faults are: wind speed of gust, wind direction, temperature, pressure, humidity, and precipitation. The weather data are collected from areas that are more exposed to harsh weather conditions, such as hills and cliffs near the sea coast. To collect the weather-data in the Arctic region of interest, we used the AROME-Arctic weather model 6 . This model is developed by the meteorological institute of Norway (MET) and provides a reanalysis of historical weather data since November 2015 with a spatial resolution of 2.5 kilometers and a temporal resolution of 1 hour.
To collect the weather variables, the geographical coordinates from the weather-exposed areas in the power grid are used as inputs to the AROME-Arctic model. The output from the AROME-Arctic model is a dataset of 6 weather variables from the weather-exposed areas that are analyzed.
Electricity load measurements It is reasonable to assume that some types of fault are not caused by weather phenomena but originates from external factors that influence the power flows on the grid. To capture these effects, 6 different power-related variables from the largest industry connected to SVAN22LY1 are collected. The variables selected as relevant to explain power faults are: difference in frequency, voltage imbalance, the difference in active and reactive power, minimum power factor, and, finally, the amount of flicker in the system. All variables are metered on three different phases (phases A, B, and C).
A change in power frequency could be caused if there is an imbalance between energy production and consumption in the system. If there is a change in the power frequency (50 Hz is the normal frequency), the imbalance could cause power disturbances for the end-use customers.
Voltage imbalance is a voltage variation in the power system in which the voltage magnitudes or the phase angle between the different phases are not equal. It is believed that rapid changes (big changes within seconds/minutes) in power consumption at large industries could affect the power quality. Therefore, the difference in active and reactive power for each phase within each minute is computed. If the difference is large, there is a high activity at the industries, which are reported by the locals to result in a larger probability for faults.
The minimum power factor represents the relationship between the amount of active and reactive power in the system. If the minimum power factor is low, there is an increased amount of reactive power in the system. In the end, the amount of flicker in the system is collected.
Flicker is considered as a phenomena in the power system and is closely connected to voltage fluctuations over a certain time frame [41]. A voltage fluctuation is a regular change in voltage that happens when the machinery that requires a high load is starting. In addition, rapid changes in load demand could cause voltage fluctuations. If there are several start-up situations, or the load varies significantly during a given time frame, it will be measured a high amount of flicker in the system. The amount of flicker is particularly relevant in the industry considered in this study, as they have several large machines that require high loads and have a cyclical varying load pattern. In this study, the time frame of the flicker is 10-minutes, which is the standard for measuring the short-term flicker [28]. The PQ metering system has a 1-minute resolution, while the weather data have a 1-hour resolution. To align the temporal resolution of the different types of variables, the power consumption data are sub-sampled by taking one sample every 60. As an alternative sub-sampling technique, we tested taking the average of the values within each batch of 60 consecutive samples of power measurements. However, the results did not change significantly and, therefore, the former sub-sampling method was adopted.

C A brief history of explainability in deep learning
Due to the presence of many non-linear transformations, it is difficult to interpret the decision process of a neural network. During the last decade, considerable research effort has been devoted towards developing insights into what a neural network learns and how it makes its decisions. While most of the explanatory techniques were originally developed in the field of computer vision, some of them can be applied also to neural networks that process sequential or vectorial data. Gradient based approaches aim at identifying which inputs have the most influence on the model scoring function for a given class. The pioneering work of Simonyan at al. [50] proposed to compute a saliency map by taking the gradient of the class activation score (usually, the input to the last softmax) with respect to each input features. The visualization of the saliency maps were successively improved by using tricks such as clipping the gradients, averaging the gradients after adding Gaussian noise to the original images, and taking the absolute value of the gradients [51].
In [59], the authors propose a method to project the activations of an intermediate hidden layer back to the input space. The procedure consists in approximately inverting the operations of a CNN (affine transformations, ReLU activations, MaxPooling) from the hidden layer to the input layer. The result gives an insight into which details the hidden layer has captured from the input image.
The Guided Back Propagation approach performs the standard gradient back propagation but, when a ReLU is encountered, the gradient is back-propagated only if both the gradient and the ReLU activation in the forward pass are positive [52].
As a drawback, gradient based methods attribute zero contribution to inputs that saturate the ReLU or MaxPool. To capture such shortcomings, a formal notion of explainability (or relevance) was introduced in [2]. In particular, the authors introduced an axiom on the conservation of total relevance, which states that the sum of relevance of all pixels must equal the class score of the model. The authors propose to distribute the total relevance of the class score to the input features with a method called Layer-wise Relevance Propagation (LRP). The class score is computed as the difference between the score obtained by the actual input and the score obtained by an uninformative input, called baseline. Each time the relevance is passed down from a neuron to the contributing neurons in the layer below, the total relevance of contributing neurons is preserved. All incoming relevances to a neuron from the layer above are collected and summed up before passing down further. By doing this recursively from layer to layer, the input layer is eventually reached, which gives the relevance of each input feature. The relevance of a neuron to its contributing inputs can be distributed based on the magnitude of the weights of the neural network layers.
While LRP followed the conservation axiom, it did not formalize how to distribute the relevance among the input features. To address this problem DeepLiFT [47] enforces an additional axiom on how to propagate the relevance down, by following the chain rule like gradients.