Addressing Data Deficiencies in Outage Reports: A Qualitative and Machine Learning Approach

—This study investigates outage statistics in the Swedish power system. More specifically, this paper delves into the critical analysis and enhancement of data quality, focusing on inconsistencies and missing values, i.e. unknown outage causes and unidentified faulty equipment. By carefully examining the data, noticeable gaps and deficiencies are revealed. Thus, a format for improving outage reporting using a database with 3 relations (outage summary, outage breakdown and customer breakdown) is proposed. In addition to a qualitative analysis of the data, various machine learning algorithms are explored and tested for their capability to predict the unknown values within the dataset, thereby offering a twofold solution: enhancing the accuracy of outage data and facilitating deeper, more accurate analytical capabilities. The findings and proposals within this work not only illuminate the current challenges within outage data management but also pave the way for more robust, data-driven decision-making in outage management and policy formation.


I. INTRODUCTION
The reporting of reliability indices by Distribution System Operators (DSOs) to regulatory bodies is necessary both to ensure fair compensation to end-users and to provide oversight on the vulnerability of the power system.Reliability data is also an enabler for grid development plans such as prioritizing which sections of the grid should be reinforced or evaluating bottlenecks in the energy transition towards renewables.Consequently, the quantity, accuracy and detail with which outages are reported, from which reliability indices can be deduced, is a limiting factor to how useful the data is.
Nevertheless, real-world statistical data offers a strong opportunity to form comprehensive models that are directly driven by this data [1], [2].Moreover, processing real outage data has been shown to be crucial for understanding different phenomena in the power system.Some examples include obtaining the empirical probability distribution of transmission line restoration times from 14 years of field data from a large utility [3], or using historical transmission line outage data to obtain the network topology in such a way that cascades of line outages can be easily located in the network [4].Historical outage data can be used to estimate the effect of weather on cascading failures and to understand which historical initial line outages are likely to lead to further cascading failures [5].In [6] standard distribution system data is used to extract resilience metrics and to decompose resilience curves into outage and restoration processes.In [7], [8] transmission outage and inventory data collected in Transmission Availability Data Systems are used to identify and analyze weather-related transmission events and quantify their impact on the North American Bulk Electric System.
Data collection at a distribution level was introduced in Sweden in early 2000 [9].Initially, the Energy Market Inspectorate (EI) collected aggregated statistics on power outages from DSOs.From 2010 onward, it expanded its scope, gathering more granular data at the customer level to monitor the continuity of supply and formulate incentive schemes [9].Complementary to EI, the Swedish Energy Companies (Energiföretagen Sverige) annually publishes outage statistics derived from facultative reports from the DSOs.This dataset is commonly known as DARWin [10].DARWin statistics have been a valuable source for research, being utilized in various analyses as evidenced by [11]- [14].
Previous analysis addressing outage statistics and trends in Sweden [14] revealed a substantial number of incidents in DARWin reports characterized by unknown outage causes and unidentified faulty equipment.More specifically, over a period of 10 years (2009-2019), 35% of cases had an unknown outage cause, while in 33% of cases faulty equipment could not be identified.This can be seen as a major impediment to obtaining an overall view of the power system's performance.The same issue is highlighted in [12] which compares Swedish and Finnish reliability performance.While the Finnish power system has mainly been suffering from natural events, where the unknown interruption reasons are negligible, in Sweden there is an alarming increase in outages with unknown causes.
Besides missing information in the reports, i.e. unknown values, there is also a problem with inconsistencies in reporting or data accuracy [14].The presence of discrepancies within outage reports significantly impedes the capacity to conduct a meticulous analysis.Consequently, this compromises the ability to make well-informed decisions.Such an environment of uncertainty can potentially have a cascade effect, where the initial data inaccuracies compound over time, thereby magnifying the margin of error in subsequent analyses and decision-making processes.
This work, therefore, analyzes both voluntarily reported outage events and obligatorily reported reliability statistics in Sweden to identify anomalous values and missing data using transparent heuristics.Besides the qualitative analysis of the data, we also propose a Machine Learning (ML) based imputation method that suggests the most likely causes and component locations of unlabelled outages based on exogenous technical and financial data.Regulators are therefore able to flag anomalous reporting automatically based on an interpretable rule-based system, while DSOs can speed up part of the reporting method by picking the correct suggestion from the ML model.
The paper is structured as follows: Section II gives an overview of the data used in the analysis and proposed methodology.Section III gives a qualitative analysis of deficiencies in outage reports, such as inconsistencies and "unknown" values.Moreover, it proposes recommendations for improving outage reporting.Section IV presents machine learning approaches for the classification of unknown outage causess and faulty equipment.The last section discusses and concludes the work.

II. DATA COLLECTION AND METHODOLOGY
The data analyzed includes voluntarily submitted outage reports from the Swedish energy companies (Energiföretagen Sverige, DARWin) available at [10] as well as obligatorily reliability statistics, financial and technical data submitted to EI at [15].The reported outage events to DARWin include a total of 786 026 unplanned outages that were reported in the period of 2007-2019.According to Energiföretagen's reports in [16], the DARWin dataset covers approximately 80% of all customers in Sweden.The EI data covers all DSOs and thus customers in Sweden, but only provides data at an annual resolution.This study focuses primarily on the DARWin dataset while the EI datasets are used to both validate the accuracy of numbers reported to DARWin and act as covariates for the considered ML models.
In [14], outage data from the DARWin dataset was categorized and examined based on specific criteria, including the voltage level of the breaking device, the outage cause, and the type of faulty equipment.Nonetheless, this paper specifically addresses the data deficiencies identified in [14], provides a qualitative analysis, and introduces an ML algorithms along with recommendations to enhance outage reporting.

A. Inconsistencies
Upon a detailed review of the outage report, discrepancies in the customer interruption costs were identified.A subset of these interruptions is illustrated in Table I.The methodology used to calculate customer interruption time involves multiplying the number of customers (both low-voltage LV and high-voltage HV) by the total duration of the outage.Notably, only the top entry in Table I reflects the accurate customer interruption time.By analyzing the reported values and aggregating them on an annual basis, the resulting disparity is quite pronounced, as depicted in Figure 1.Customer interruption time is directly proportional to the reliability index SAIDI (System Average Interruption Duration Index).Consequently, for the year 2019, the imputed value of SAIDI is projected to be over 12 times greater than the value that was initially reported.What causes such disparity?One primary factor is the lack of documentation regarding the progressive developments of an outage.Take, for instance, an outage affecting 1,000 customers.While the outage might initially impact 1,000 customers, the power could be progressively restored: 200 customers might regain power after 15 minutes, followed by an additional 300 customers in the subsequent 15 minutes, and so on.Unfortunately, such incremental restorations are often overlooked in standard reports.
Another reason could be erroneous calculations or even duplicated entries.Within a single utility company, tracking and managing outages is a collaborative effort, involving multiple departments and roles.Occasionally, outages may be reported both at regional and local levels, leading to duplication.Reporting structures vary not only within individual utilities but also across all DSOs, adding significant complexity to the process.

B. Unknowns
Data analysis reveals that, over the period from 2007 to 2019, the outage cause remains unidentified in 32.94% of cases, constituting nearly one-third of the reported data (Fig. 2).Similarly, in 32,43% of cases faulty equipment is also unknown (Fig. 3).The predominant category in both instances is 'unknown', posing significant challenges to determining overall system performance.
Further granulation of data reveals that for incidents involving unknown faulty equipment, the outage cause cannot be determined in 60.43% of cases (Fig. 4).Similarly, in instances of an undetermined outage cause, 59.48% also feature unidentified faulty equipment (Fig. 5).
A potential contributor to the high percentage of unknowns may be weather-related outages.Research indicates that 21%   of outages in Sweden stem from weather-related events [13].
Such events can initiate cascading outages, complicating the identification of the cause and damaged equipment.Additionally, as previously noted, a single utility may utilize multiple reporting systems.It is not uncommon for the initial cause of a failure to be unidentified, and even when subsequently determined, it is sometimes only corrected in one system and not universally updated across all platforms.

C. Addressing the Deficiencies
A recommendation for improving outage reporting using a database with 3 relations, outage summary, outage breakdown and customer breakdown, is shown in Fig. 7.This format builds on the one suggested in Energiforsk's reference group in [16].By providing 3 levels of detail, this format enables researchers and regulators to perform more detailed reliability analysis while preserving the anonymity of DSO customers.
Since DSOs can merge or go bankrupt, EI's organization number is used to uniquely identify a DSO in Fig. 7a.This has the added benefit of enabling comparisons between DARWin data and EI data.Together with 3-letter grid area codes at [17], this allows the ownership of different grid sections to be traced.Additional bidding zone and municipality information allows for alternative groupings, such as comparing outage statistics between counties.
Some of the irregularities in the data identified in this study are addressed in the new format.The lack of information embedded in 'unknown' outage causes and faulty equipment is reduced by splitting it into the more explicit "Under Investigation" and "Could not be determined" categories.The update history then indicates whether this investigation was followed up on and thus can be used to flag outages for data quality inspections.An inferred cause could also be included for 'unknown' labels using the ML model described in section IV.However, since the ML tool's inferred labels are not guaranteed to be correct this is best kept as a separate analysis tool from the database.
The discrepancies in customer interruption time are addressed by providing a breakdown of the outage event, as shown in Fig. 7b.However, this does not capture that different customers may experience different outage developments after a fault.Fig. 7c therefore provides a further breakdown by customer group.If detailed Geographic Information Service polygons can be provided, this would also enable DSOs and researchers to try and predict weather-related outages with greater granularity.

A. Methodology
As mentioned previously in Section III, there is a nonnegligible amount of missing data in the DARWin dataset.Among these are the 'unknown' category for the cause of an outage (e.g.lightning strike) and the type of faulty equipment (e.g. a substation) at which the fault originated.To assist field engineers in making these classifications, we evaluate several ML models that can suggest the most likely cause and type of faulty equipment for an outage given a number of covariates.The chosen methods include K Nearest Neighbour (KNN), Random Forests (RF) and Bagging classifiers.These were introduced in [18], [19]- [20], and [21]- [22], respectively.RF and Bagging classifiers both use decision trees, but the former is built from projecting onto random sub-spaces of the input data, while the latter draws samples randomly with replacement.Given the value of interpretability provided by these well-established models, a deep learning approach was not considered.
The proposed classification, based on financial and technical covariates, assumes that the missing data has a similar statistical distribution to the known data.In the field such an assumption is safe, since engineers can correct the suggestion.However, to fill in historical missing data using this method would require further validation, for instance by crossreferencing with photographs of each incident.
To regularize the training process, some pre-processing is performed on the input data.To allow the models to potentially pick up on temporal patterns, the input data is augmented with sine and cosine encodings of the year, month and day when the outage started.To prevent a specific feature from dominating simply because it has larger numeric values, the features are scaled to have a mean of 0 and a standard deviation of 1.To allow the models to interpret categorical values, quantile encodings are used for all categorical features (such as company ID) based on the customer outage time (in hours).
Outages across all years and participating DSOs are split into a training set (80%) and a test set (remaining 20%) with random shuffling.Bayesian hyperparameter optimization provided by [23] is performed with 5-fold stratified cross validation (on the training set) to determine a reasonable set of hyperparameters.A separate model is trained to predict the outage cause class and the equipment type class.The classes are merged from the original labels according to the mapping in [14].Figures 2 and 3 clearly indicate that the DARWin dataset contains a varied number of examples across different classes.Consequently, since the classes are not balanced, a balanced accuracy score is used to evaluate the performance: Where ω represents the sample weight and y is the class of the sample.ω and ŷ indicate how the weight is adjusted to account for imbalances between the classes.

B. Results
Based on the model performances shown in Table II, using the Optuna optimized hyperparameters shown in Table III, the RF classifier has the highest precision, recall and f1-score.The similar performance of the RF and bagging classifiers is likely because both ensemble methods rely on decision trees as their base estimator.The RF classifier is about 10-40% faster to execute but requires 2-3x more RAM to store.The performance of the models across each class varies, as shown in the confusion matrices in Table IV and Table V.For instance, higher performance is seen for predicting weatherrelated outages than testing-related ones.This is most likely due to the relatively few samples available for some classes, such as testing, and the difficulty of predicting some classes based on the provided covariates, such as digging.Overall,   In order to validate the assumption that the distribution of unknown values is similar to the known ones, the trained model can be used to predict the unknown cause and equipment type for outages in the DARWin dataset.However, this still comes with the caveat that any distribution built from imperfect models will itself be imperfect as shown by the differing no. of true labels and predicted labels in Table IV and Table V.Nevertheless, Figure 8 appears to indicate that there is some difference in the distribution of the known labels ('known only' in the figures) and the predicted labels for samples with 'unknown' outage cause or equipment type.This may suggest there are confounding variables at play.For example, it may be more difficult to attribute a cause to certain types of outages.
Submitted to the 23rd Power Systems Computation Conference (PSCC 2024).

Fig. 3 :
Fig. 3: Number of unplanned outages according to the type of faulty equipment.
(a) Outage cause by model.U n d e r g r o u n d C a b le U n in s u la t e d O v e r h e a d L in e F u s e B o x B r a n c h / C a b le C a b in e t E n c lo s e d T r a n s f o r m e r S e c o n d a r y S u b s t a t io n In s u la t e d O v e r h e a d L in e O t h e r L in e P r im a r y S u b s t a t io n U n d e r w a t e r C a b le O t h e r S u b s t a t io n T y p e R e g io n a l S u b s t a t io n Equipment type by model.

Fig. 8 :
Fig.8: Distribution of class label counts (as percentage of total no. of samples) based on model predictions on data with unknown labels (orange, green and red), compared to the distribution of label counts for all known samples (blue).

TABLE I :
Snippet from DARWin report, the rightmost column does not always match reported customer interruption time.
Summary of Outages, * = Experienced at least one interruption during the outage event.** = Energy Market Inspectorate 23rd Power Systems Computation Conference PSCC 2024 Paris, France -June 4 -7, 2024 Fig. 7: Suggested new format for outage reporting (in Sweden).All data is fictitious.itappears predicting the cause of an outage is easier than predicting the type of faulty equipment.

TABLE II :
Performance of the trained models on predicting outage causes on an unseen and stratified testing set.

TABLE III :
Hyperparameters of each model, after 50 Optuna trials.

TABLE IV :
Outage cause confusion matrices for all 3 trained models, normalized by no. of true labels.