A Snapshot-Stacked Ensemble and Optimization Approach for Vehicle Breakdown Prediction

Predicting breakdowns is becoming one of the main goals for vehicle manufacturers so as to better allocate resources, and to reduce costs and safety issues. At the core of the utilization of vehicle sensors is the fact that early detection of anomalies facilitates the prediction of potential breakdown issues, which, if otherwise undetected, could lead to breakdowns and warranty claims. However, the making of such predictions is too complex a challenge to solve using simple predictive models. The strength of heuristic optimization techniques in solving np-hard problems, and the recent success of ensemble approaches to various modeling problems, motivated us to investigate a hybrid optimization- and ensemble-based approach to tackle the complex task. In this study, we propose a snapshot-stacked ensemble deep neural network (SSED) approach to predict vehicle claims (in this study, we refer to a claim as being a breakdown or a fault) by considering vehicle operational life records. The approach includes three main modules: Data pre-processing, Dimensionality Reduction, and Ensemble Learning. The first module is developed to run a set of practices to integrate various sources of data, extract hidden information and segment the data into different time windows. In the second module, the most informative measurements to represent vehicle usage are selected through an adapted heuristic optimization approach. Finally, in the last module, the ensemble machine learning approach utilizes the selected measurements to map the vehicle usage to the breakdowns for the prediction. The proposed approach integrates, and uses, the following two sources of data, collected from thousands of heavy-duty trucks: Logged Vehicle Data (LVD) and Warranty Claim Data (WCD). The experimental results confirm the proposed system’s effectiveness in predicting vehicle breakdowns. By adapting the optimization and snapshot-stacked ensemble deep networks, we demonstrate how sensor data, in the form of vehicle usage history, contributes to claim predictions. The experimental evaluation of the system on other application domains also indicated the generality of the proposed approach.


Introduction
Machine Learning-based approaches for predictive modeling have become one of the main tools in the Predictive Maintenance (PdM) strategy of the automotive industry, to improve the overall maintenance and reliability of operations, and to reduce costs and safety issues [1]. Such predictive models attempt to model and predict the breakdown of an asset (e.g., truck, bus, or car) which, if it occurred within the agreed warranty time period, could lead to a warranty claim. This set of systems is precious software technology that minimizes unexpected down-time which, in turn, improves safety and asset reliability. PdM is highly valuable for reducing costs as it means maintenance is performed only when necessary. Based on the figures reported in [2], maintenance is one of the leading vehicle operational expenses, accounting for around 59% of overall costs. Furthermore, under the PdM umbrella, claim or breakdown prediction of upcoming down-time significantly decreases the risk of human life being jeopardized in, for example, heavy-duty vehicles (in this study, the data source for the target labels is a warranty claim database and, therefore, we interchangeably use down-time, claim, breakdown and fault, such that they all point to the concept of a component defect leading to a claim in the dataset).
Heavy-duty vehicles are complex systems, with numerous possible specifications, operations and driver behaviors, wherein component down-times can originate from multiple sub-components failing for various causes. The correct prediction of breakdowns is critical and essential because failing to precisely recognize such breakdowns leads to increased maintenance costs, increased customer safety risks, decreased customer satisfaction, and lowered brand value. All of these factors add up to a significant loss for companies. Thus, correctly predicting faults that could cause customer or company loss has become an important goal in today's modern industrial strategy. Under this vision, many studies have investigated the prediction of warranty claims or component breakdowns, reliability, and degradation in the context of the vehicle industry, utilizing machine learning and statistical approaches, such as deep neural networks, recurrent neural networks, support vector machines or stochastic approaches [3][4][5]. Diagnostic models were developed, based on the data collected from machines (e.g, industrial machines, turbines, vehicles, trains, etc.) to estimate the performance and condition of a component. Many of the studies utilized stochastic approaches to estimate component failures [6,7]. Many studies have focused on estimating the remaining useful life (RUL) of a component, utilizing optimization or neural networks algorithms [7][8][9][10]. For example, Yang et al. [11] studied claim forecasting, where they focused on product usage rate and introduced two predictive models. In this study, the authors considered Weibull's distributed time assumption to foretell the number of breakdowns. More recently, an interesting work by Chehade et al. [12] studied the problem of data maturation in predicting warranty claims. They utilized Gaussian Mixture distribution of claim patterns from similar vehicles to forecast vehicle warranty claims as early as possible [12].
From the literature, we observed that considerable progress has been made in the context of predicting upcoming breakdowns in the automotive industry by adopting machine learning and statistical solutions. We also noticed that most of these combine and use the approaches with the age and lifetime of a particular component to estimate the component's breakdown. However, several recent studies in such areas [13,14] partially use vehicle usage to predict the need for maintenance to avoid breakdowns. Complexity, both in terms of formulating the problem and the high dimensionality of the usage data, make the task challenging, requiring further investigation.
Although the works mentioned vary widely in terms of the application context, problem to solve, and technique developed to tackle the challenge, machine learning has proved to be successful, particularly in the automotive industry [15]. However, there is still room for improvement (in terms of prediction accuracy and explainability) in existing claim prognostic systems. We believe that such forecasting operations can be improved by incorporating AI and advanced deep neural network approaches.
This motivated us to develop an ensemble solution taking advantage of hundreds of sensor signals to predict breakdowns (which often lead to warranty claims, the source of our target variable). However, the challenge of high-dimensional data needs to be dealt with. Thus, in this study, we propose an ensemble method to extract and select the best representation of the data-the most informative features-to build the predictive models for warranty claim prediction (WCP). The approach includes three modules. In the first module, that of "Data pre-processing". we developed a set of practices to integrate various sources of data, to extract relevant information and to segment the data into various time windows. In the second module, that of "Dimensionality Reduction", we designed an optimization procedure, using a Genetic Algorithm (GA) coupled with Elastic Net, to translate high-dimensional data into a representation with lower dimensionality without compromising prediction performance. GA is a method inspired by the evolutionary process and is used for solving constrained or unconstrained optimization problems [16]. The method runs several GA operators that frequently change individuals to reach the best solutions through several generations.
In the third module, that of "Snapshot-stacked Ensemble", which contains two phases, we developed an ensemble deep neural network to use the measurements (predictors, measurements, and features all refer to the same concept of sensor data) to map vehicle usage to claims. This module was developed by means of a Snapshot-stacked Ensemble method to increase the performance of the predictive model. In this fashion, several training learners are built from different perspectives to solve a complex claim prediction using vehicle usage. This is achieved by horizontally aggregating the outputs of several snapshot models, utilizing a Cyclic Cosine Annealing Schedule (CCAS) [17] concerning the principle of diversity, to improve predictive model performance. Indeed, each snapshot refers to a separate deep neural network, generated over one training process and saved, which is then used to build an ensemble model. These snapshots result in meta-learning that is capable of learning from various models (deep neural networks) to increase the accuracy of the claim prediction. In other words, we exploit the outcome of each snapshot to build an ensemble of networks for better prediction.
To the best of our knowledge, this study is the first to investigate the performance of ensemble optimization and snapshot-stacked deep neural networks to translate higher dimensional data into a lower dimension to be utilized for claim prediction. Earlier fault estimation was done through manual inspection and statistical approaches. Applying artificial intelligence techniques on sensor data automates fault detection, requiring less effort and achieving more accuracy in a timely fashion [18].
The following research questions (RQs) further elaborate the investigative objectives of our proposed approach: • RQ1-Measurement selection: How could vehicle usage best be extracted for use in claim prediction, and to what extent could the extracted features contribute to the predictive model? •

RQ2-Claim prediction, based on vehicle usage:
To what extent could ensemble deep neural networks contribute to, and improve, claim prediction in the automotive industry?
Our study aimed to address claim prediction by mapping vehicle usage to future breakdowns. First, we concentrated on selecting a subset of features that would best represent vehicle usage. Since heuristic and optimization solutions have demonstrated their capabilities to tackle np-hard problems in various domains [19,20], we hypothesized that the genetic method would be able to address this issue. Thus, to answer RQ1 we adapted the evolutionary approach to feature selection so as to select the sub-optimal features. The evaluation was performed over several generations and compared with other standard deep neural network and ensemble approaches. The contribution of the extracted features to the predictive model was evaluated by means of an A/B test, guided by the aforementioned RQ1. We report on the results of that optimization and A/B test, to show how the selected and extracted features positively affect the performance of the model. Second, we maintained that the optimization could perform better by coupling it with an ensemble learning approach. Hence, to answer RQ2, we developed a Snapshot-stacked ensemble learning system through two phases of generating multiple diverse models, which were used to build meta-learning. The meta-learning was then utilized for the final claim or breakdown prediction. The reported figures confirm that this development improves claim prediction performance compared with other ML approaches. The integration of these two methods in predictive maintenance strategy, which our research outlines, constitutes this paper's main contribution. The following are our contributions in this study: • Whereas previous studies mostly used limited features in their prognostic systems, we used hundreds of real sensors' data (LVD) from heavy-duty trucks to model and map vehicle usage to component breakdown. • Taking into consideration the high dimensionality of sensor data, an optimization approach was introduced to extract the best representation of the data characterizing vehicle usage in order to build an accurate predictive model for warranty claim forecasting. • Taking into consideration the low classification accuracy of claim prediction using vehicle usage, a snapshot-stacked ensemble deep learning approach is proposed in this paper. Compared with existing algorithms, our method effectively improves classification performance.
The rest of the paper is organized as follows. Section 2 presents related studies. In Section 3, we describe the data used. In Section 4 we explain the proposed approach. Section 5 covers experimental evaluation and results, followed by a discussion and summary of the work in Section 6.

Related Works
Predictive maintenance techniques have previously been investigated and developed to examine and monitor the condition of tools and equipment to estimate when maintenance needs to be performed. In this regard, identifying component failures proffers opportunities for companies to take preemptive action, in the form of predictive maintenance, to minimize breakdowns and prevent more significant issues in the long term. This mitigates safety risks, lowers costs and even supports sustainable production in smart cities [21,22]. Many attempts have been made [23][24][25] to develop various statistical and machine learning solutions to predict component failures in a wide range of applications such as the following: manufacturing [26,27], automotive [28][29][30][31], and energy [32]. Prognostic models have been based on data collected from machines, social networks, and other sources to estimate the performance and condition of components or the number of upcoming warranty claims [33]. These models widely used statistical approaches to estimate component failures [6,7,34,35]. For example, in the early 1990s, in [34], a log-linear Poisson model was utilized to build a forecasting system based on the dates of warranty claims to predict failures. Later, Fredette et al. [5] developed a mixed Non-homogeneous Poisson Process (NHPP) to tackle a similar problem, that of component down-time prediction. An interesting remaining useful life (RUL) forecasting approach was introduced in [7], in which the Wiener process [36] with drift was utilized to model degradation in component sensors. Recently, in [33], social media data was used to improve performance in daily warranty claim prediction. To tackle the challenge of warranty data maturation, an intriguing study was developed by Chehade et al. [12]. This investigation used warranty claim data to devise a Gaussian mixture distribution of claim patterns at current and future maturation levels. They took advantage of Bayesian theories to assess the conditional posterior distribution of warranty claims at future maturation based on the available data at the current maturation level.
To the best of our knowledge, the larger part of studies focusing on forecasting problems widely use neural networks to estimate the RUL of components and the state of complex machines [37]. One example is in the context of energy production [38], wherein a self-evolving maintenance scheduler technique was proposed to detect damage in the gearbox bearings of wind turbines. The technique uses an ANN to establish a condition monitoring system (CMS) based on supervisory control and acquisition data. To tackle a similar problem, in [39] a Sparse Autoencoder-based Deep Neural Network (SAE-DNN) approach was developed to detect bearing breakdowns in multi-component systems.
A typical optimization approach was introduced in [9], using Relevance Vector Machine, with the support of ant colony optimization (ACO), to detect gearbox fault detection.
To tackle a similar problem, a neural network-based forecasting system was introduced in [40], based on a dynamic Cuckoo search optimization algorithm. Such forecasting is crucial, since it can help manufacturers lower maintenance costs, increase customer satisfaction, and safety, etc.
Lifetime distribution estimation as a practical solution has gone a long way in component fault estimation. For instance, a probabilistic model was developed by Singpurwalla et al. in [41], wherein time, and time-dependent quantity, were utilized to build a forecasting model. Later, Kaminskiy et al. [42] proposed a component failure prediction approach, based on piece-wise application of the Weibull and exponential distributions, to predict intrinsic failure period and early failure period. Similarly, in [43] two variables of age and mileage of the vehicle were considered to estimate the mean cumulative number of breakdowns.
Among the vast numbers of approaches introduced and developed for various prediction tasks in many applications, ensemble techniques have been quite successful. Ensemble machine learning approaches are techniques that integrate several base ML models to provide improved performance [44][45][46][47][48]. As illustrated in Figure 1, these techniques can be divided into three categories: Stacking, Bagging, and Boosting. The Stacking, or Stacked generalization, technique [49] uses and combines predictions from multiple predictive models to build a new model on the training data. Then the model is utilized for the final prediction of the test data. Bagging is a technique that seeks a mixed group of ensemble predictive models built by various parts of the training data [50]. Then, statistical approaches, such as averaging or voting, are used to combine prediction results from members. Boosting is another ensemble approach that creates a strong predictive model from several weak classifiers [51]. In this approach, the training data is used to build a predictive model. The second model then tries to correct the mistakes from the first classifier. This process continue until the maximum number of predictive models is reached or the perfect prediction acquired.  Researchers working on applications from different sectors also took advantage of ensemble methods [52]. In the context of claim and fault detection, we observed interesting studies in different sectors [53,54]. For example, a bagging ensemble-based approach was developed in [55] to forecast faults in a single-shaft industrial gas turbine. In this approach, the fault prognosis was based on weighted voting to improve the robustness of the fault detection technique. To tackle the same problem in a similar domain, in [56] a deep neural network ensemble approach was developed and applied to a wind turbine data series. This work trained the network offline based on a training set of the imagination matrix that transformed the data series through segmentation technology. Then, the network was retrained by transferring learning, based on the renewed training set, to achieve enough knowledge to forecast the fault. In [57], an ensemble approach was introduced to predict the degradation of track geometry, wherein different models were developed by means of regression, deterioration, and classification. In the context of an electrical system, ref. [54] introduced an arc series ensemble fault detection approach, wherein they developed a boosting converter CPL and a buck convertor constant power load (CPL) to study different arc fault behaviors.

Ensemble Approaches
All in all, two main limitations can be observed in the related works presented: (1) most are limited to models based on only a few input parameters to predict warranty claims; (2) lack in use of complex real-world operational data from thousands of deployed products, which come with the challenges of noisy and missing data. Our study addresses these issues by deploying a complex ensemble-based system on high-dimensional data collected from thousands of heavy-duty trucks.
To our knowledge, the combination of evolutionary technique and ensemble deep neural network as an approach to increase warranty claim prediction accuracy, is novel. Our approach addresses the fault detection problem in the automotive industry, which is generic and applicable to a variety of fault and claim predictions in different sectors. Borrowing from the taxonomy in [56], our warranty claim prediction can be characterized as a stacked ensemble approach, since it takes into consideration the prediction of several diverse snapshot models to build meta-learning, and then the model is used for a final prediction. In the remainder of this paper, we describe the data, our proposed solution and the evaluation of its potential, both for warranty claim data and datasets from other applications.

Data Presentation
In this section we present the following two data sets which were used to test the proposed forecasting method: Logged Vehicle Data (LVD), including specifications, and usage, of vehicles aggregated in a cumulative fashion; Warranty Claim data (WCD), consisting of claim information reported over the vehicles' life-times during their warranty periods (24 months).

Logged Vehicle Data
The LVD used in this study were collected from commercial trucks over a four-year period, from 2016 to 2019. The LVD consisted of the aggregated usage information for a fleet of heavy-duty trucks operating in Europe. The parameters were collected using telematics, as well as through manual inspection when a vehicle visited an authorized workshop for repairs and service. In general, two types of parameters are logged in the LVD. The first type describes the configuration of the vehicles, e.g., engine type, gearbox specification, or type of pumps. The second type logs the usage of the vehicle during its operation. This data is logged in histogram and scalar formats, continuously aggregated and contains a number of different parameters, such as fuel consumption, compressor usage, gear usage, cargo load, engine ambient temperature vs. ambient pressure log, coolant temperature vs. oil temperature log, etc. An example of cumulative scalar features and its structure over one year are illustrated in Figure 2.

Warranty Claim Data
Claim data contains information regarding a vehicle's warranty claims logged during its operation, collected by OEM authorized workshops in Europe. In particular, the claim database stores information about which part or component of a vehicle was repaired or changed (due to a failure) along with the date. The parts and components are identified by the normalized identification codes using four different levels of detail. This claim dataset contains various parameters, such as names of components, codes, descriptions, and dates, etc.

Problem Formulation
This section represents the formulations determined to tackle the claim forecasting problem. Our approach considers the WCD and LVD as the system's inputs and employs optimization and ensemble-based deep neural networks. This study attempted to predict, one month beforehand, defects related to the power train in heavy-duty vehicles, during their warranty period of 24 months from when they started operating in service. Thus, our intention in this claim prediction investigation could be divided into two problem formulations, as follow: • First, we formulated the process of dimensionality reduction as an optimization task, and developed an optimization approach to extract a better set of features characterizing vehicle usage over time. The approach was carried out in a time-series fashion based on the hypothesis that vehicle behavior changes over time, so different predictors might have different impacts on claims/failures and, consequently, on performances of predictive models. • Second, we investigated the design of an ensemble machine learning approach, formulating the task as a classification to predict the claim/defect related to a power train. Given the selected predictors from the optimization step, the ensemble approach mapped the usage to the upcoming claims one month ahead during the warranty period. Figure 3 shows the high-level structure of the proposed claim prediction approach for the automotive industry. It contains three main modules: Data pre-processing, Feature Selection Optimization and Ensemble Modeling. The data pre-processing module consists of three processes: data integration, feature extraction, and data segmentation. The optimization module is then employed to choose the measurements in a time-series fashion. Here, the best representation of data in each segment is transferred to the ensemble modeling module. Subsequently, the ensemble method builds the predictive model over two phases to forecast warranty claims, given the vehicle usage. These modules and sub-modules are described as follow:

Module 1: Data Pre-Processing
Data preparation is an integral part of any machine learning problem and our fault diagnosis task is not an exception. In this section, we describe how we prepared the data to use in our optimization process.

Data Integration
This module aimed to merge the LVD and WCD to create an integrated dossier with the usage and failure information. We combined two data sets based on the vehicle's "Chassis id", "Date of readout, which is logged", and "Date of failure report". To merge these two datasets, we borrowed the method that we used in our previous study [18], where we took advantage of a Volvo expert's knowledge to determine a time window of one month as an interval where the symptoms of failures are most likely to be visible. Therefore, the prediction horizon was set to 30 days for the developed model. In this way, the merged data-set included a new attribute named "Target" as the target feature T. This parameter has a value of 1 for a given sample (r i ) if, and only if, failure of a specific component of interest is reported. Therefore, each vehicle may have multiple failures (for the same component) during its operational life. However, only one failure (for the same component) is considered during the desired time window. More formally, each time-window/time span is assigned a binary label according to Equation (1), where tw refers to the length of the time window that has the highest impact on failures in trucks and where τ is the current time (end of time window). Note: tw is determined based on a Volvo expert's knowledge, which differs from one type of failure to another. Thus, in this study, tw was set to one month which was considered large enough to overcome the effects of choosing a window that is too small, e.g., having a prediction horizon that is too tight or not being able to cover the whole chain of events leading to a failure. If a shorter time window would have been chosen, the target feature would have made more sense to be continuous, due to gradual transitioning into the chain of events.

Feature Extraction
This sub-module computes the differences between consecutive measurements of each sensor signal accumulator and quantizes the features. Indeed, this process was used due to the accumulative nature of data. It is essential to extract the actual usage in each timestamp, so that an abnormality estimating a breakdown in the near future might be indicated.
Equation (2) shows how a new feature (from each original feature) is calculated: where f i new denotes the extracted feature, and r i and r i+1 represent the readouts that are collected over the time and n points to the number of samples or readouts. Basically, in the feature extraction module, we calculate the delta between subsequent readouts, and exploit them as new features in training the models. Indeed, these extracted features demonstrate the actual usage (and their changes) of vehicles in a bi-weekly fashion. The anomaly behavior of the vehicle hidden in the data is revealed from these changes, which might indicate potential faults in the near future. In addition, information related to each season (spring, summer, autumn and winter) is added as a new feature in the data. Figure 4 depicts only the usage change of 10 vehicles considering 8 new features ( f i new )out of more than 600 features in total. Furthermore, we quantified the level of usage change f i new into low, medium, and high changes. This quantization was done with the support of expert knowledge from Volvo engineers in the field. Equations (3) and (4) explain how the quartiles, and accordingly the level of changes in vehicle usage, are defined. In this way, we assume that substantial change (reduction or raise f i new > Q3) in the usage of vehicles leads to performance degradation and component failure in the future.  (2)) and gradual changes bi-weekly in each parameter over one year. This means the original LVD data were logged every two weeks and collected by the authorized OEM workshops when the vehicles visited the workshops anywhere in Europe.
To show how the changes might relate to failures, we grouped the vehicles into two categories of Healthy and Unhealthy vehicles. A healthy vehicle is one that has no faults during its operational life, while an unhealthy vehicle has at least one reported claim in its history. Figure 5 demonstrates how the changes are related to the groups of vehicles, where the y-axis explains the relative frequency of usage change in four different categories. The vehicles with high, medium and low or no numbers of significant changes are shown on the x-axis. It can be observed from the usage pattern, plotted through the sub-figures, that the proportion of significant changes in unhealthy vehicles was higher than in healthy vehicles during their life histories (except one depicted in Figure 4f). Likewise, the proportion of healthy vehicles was more than that of unhealthy vehicles once we considered no-changes to evaluate the association between them. Through this experiment, we concluded that healthy vehicles had less usage deviation than unhealthy vehicles. Thus, exploiting this information contributed to building a better predictive model to result in more accurate predictions. , related to the health of the vehicles. Note: we used more than 600 features in our study, and these were just eight features we randomly selected for the purpose of plotting.

Data Segmentation
The segmentation was conducted by having hypotheses in which vehicle demeanor differs from context to context, and the context has a certain impact on the vehicle's performance. In the domain of the automotive industry, context can be characterized as location, application domain, time, specific road conditions, etc. Furthermore, this segmentation might aid to increase the performance of the predictive model. Thus, in this study, with this low-resolution data, the operational time of the vehicles over different seasons was considered to segment the data. Basically, we partitioned the whole time-series data into four segments representing the four seasons in a year.
This time-series data set contained 337,573 samples with 375 original features (parameters/measurements), characterizing the usage style of 2412 unique vehicles over four years of operation. In addition, information related to each season was added as a new feature in the data. In this way, in each segment of time we defined three months as a season over a year, and each time segment contained the vehicles which operated in that period of time in different years (regardless of their ages or production months). Since vehicles have different ages and operational patterns over the year, each segment contained various numbers of samples/readouts as follows: Segment1 contained 86,768 readouts; Segment2 included 90,709 readouts; Segment3 held 61,364 readouts, and, finally, Segment4 contained 98,732 readouts.
Bearing in mind the assumption raised in the integration section (Section 4.2.1), to label the LVD (one-month operation may have the highest impact on component down-times), we took account a very rare case of failure on the first day of each season, so each segment also contained one-month operation from the previous season. This allowed the model to map the usage to a failure if the failure happened on the first day of the season.

Module 2: Optimization: Measurement Selection
This section describes the module developed to conduct the measurement selection process. We formulated this process as an optimization task, wherein we aimed to find the best representation of vehicle usage (a subset of measurements/features) to increase the predictive model's performance in forecasting component failure. In other words, our objective was to discover the optimal subset of measurements that could characterize vehicle usage and map these to failures. Thus, to this end, we constructed this module by modifying the original Genetic Optimization [58], which was successfully applied in a wide range of applications [59,60], just to cite a few. As the main contribution in this module, we utilized the Elastic Net [61] technique, by adding simple penalties (a combination of L1 and L2 regularization) into the optimization techniques (we call this ELSGA), which acted as a generation component to initiate a decent population in the first step for the optimization process. Although, data dimensionality reduction approaches, such as PCA [61,62] or LDA [63], could be implemented for this problem, in order to maintain interpretability, using such methods was avoided in this dimensionality reduction step. The optimization module itself consists of three main sub-modules: Initialization component (IC), Evaluation component (Objective function) and Criterion. These sub-modules are described as follows: Given the desired segment (Seg i ) from the segmentation module, IC initiates the first generation of the population (rather than randomly select the first generation of features, which is the fundamental process in the GA algorithm, we aimed to start the optimization process with better population of the measurements) using a convex integration of lasso [64] and ridge [65] methods called Elastic Net (LS) [66]. We took advantage of the LS capability, that uses L1 and L2 penalties to shrink and select features, to generate the first population of predictors. Let us suppose we have R = (r 1 , . . . , r n ) readouts in the given time series Seg i data-set with F = ( f 1 , . . . , f m ) features/predictors (measurements), and T = (t 1 , . . . , t n ) as the target variables.The value a i,j refers to the value of feature f i in readout r i .
r n = a n,2 · · · a n,m Thus, R and T are our model matrix, and prediction of target variablet can be described as: where the input function is the dot product of the predictors and coefficients β: Thus, in our initialization process, the task is defined as a classification problem, so β can be calculated by Equation (7): where is a margin-based loss function and y ∈ {0, 1}. Using β 1 and β 2 , which are the norm l 1 and euclidean norm, we measure the largeness of the coefficient. λ and α are the two parameters to tune. α controls the integration of the two penalties and λ manages the amount of penalization, where α can take values between [0-1], and λ could be a positive number.
There are two alternatives for the LS: LS turns to ridge method once α gets close to 0, while LS serves as lasso when α reaches 1. Thus, to find the best α and l1, we tuned these parameters through walk-forward cross-validation by the learning process. This means we trained different models exploiting various values in an optimization manner to find the best setting. In this way, the algorithm overcame the drawbacks of the Lasso and Ridge functions [67] to select the best predictors. Thus, in IC we used an agent to perform Equation (11) to initialize the first population to be delivered to the Evaluation component and kept generating such decent populations to inject from the GA sector over the process.
where Rand() is a random selection of decent initiated predictors by LS(seg i ) and Rand() is designed to keep the diversity within the decent selected predictors. The value pop denotes the size of the population that should be initialized at the first generation. The value of pop was tuned with the above five numbers to find the optimal number for the population. The reasons for developing such an initialization component, rather than random generation, are as follows: (a) we hypothesized that good parents most likely generate better children than bad (or random) parents. This means randomly initializing the parent may lead to non-satisfying predictive performance in the fitness function, ignoring the cross-over step, and, accordingly, terminating the Genetic process; (b) in any optimization process, it is crucial to reach the optimal solution in the early stages (generations), particularly when the system needs to deal with a huge amount of streaming data. Thus, the random selection may lead to post-bonding to the late generations; (c) the nature of the GA approach is based on random generation and selection of a population to obtain the best solution; therefore, the approach does not concern the correlation between the features and the predictive performance. There might be highly correlated features that can contribute together to performance. Hence, the design and integration of IC support the GA operators by injecting such information to generate a better population over the optimization process.
For this reason, we employed LS in this initialization component since there might be highly correlated predictors in the high dimensional data; thus, LS automatically included all the highly related predictors in the population if one of them was selected. However, it needs to be remarked that there might be cases among these highly related features wherein only one is sufficient and contributes to the predictive performance, and so GA solves such an issue.To summarize, this design with LS and GA allows the inclusion of the many correlations expected, since features coming from sensors measure correlated processes. However, the GA-enabled part of the algorithm tries to select the best subset out of such correlations and does not necessarily keep all of them. The ensemble of these two algorithms overcomes the raised concerns, particularly as the automotive sector deals with high-dimensional streaming data over time. This enables the system to decrease dimensionality and, at the same time, increase the performance of the predictive model.

Evaluation Component (Fitness Function)
At the core of this component, and of the GA process, a decision tree-based ensemble method [68] (XGBoost) was used as a classifier to evaluate the importance of the individuals and their impacts on the predictive model. This objective function of claim prediction L(θ), can be expressed as follows [68]: where L is the log-likelihood function, and, thus, logistic(y i (ŷ t−1 i )) is the probability. f t denotes the tth tree, and y t−1 shows the prediction of r i sample, at t − 1 iteration. In this experiment, we used the logistic function, as our problem was formulated as a binary classification. Indeed, the merit of each population represented as solutions are evaluated as fitness value, so the best chromosomes are ranked and selected for the next generation. We formulated our fitness function as an optimization, where individuals with higher performance are considered for the next generation. In a parameterized fashion, the four top best populations were selected, in the selection process to be given to GA operators for cross-over and to generate a new solution (offspring).
A typical crossover method, called Partially Mapped (PMX) [69], was used in this optimization problem. The main idea behind this decision was to support the GA process by having a better convergence rate and diversity and to reduce stagnation during the optimization process [70].
The first criterion talks about the poorness of generated predictors by the IC component at the first generation. The IC is called if, and only if, the fitness value could not pass the minimum threshold.
This criterion was established to ensure the first population has the potential for further generation during the optimization process. The second criterion indicates that the fitness output has been sufficiently reached for the optimization process to be terminated once it is met. As the third proxy Tr 3 , we defined a maximum generation. Taking into account the first criterion Tr 1 , once one of Tr 2 or Tr 3 is met, the selected operators/features are injected into the ensemble modeling (in the third module) to build a complex and general predictive model for the final claim forecasting.

Module 3: Snapshot-Stacked Ensemble Modeling (Network Construction and Training)
In this section, we describe the third module of the proposed approach that contains two main phases. In the first phase, we constructed an ANN, which was used to generate several snapshot models from a single training process. Then, we took those snapshots (models) for validation, and, in the second phase, the prediction output (class probabilities) of each snapshot model was added to the dataset for final training and prediction. Indeed, the second phase acts as a meta-learning process, where it attempts to learn from the output of the snapshots to reduce the classification errors obtained from the first network.

Snapshot Ensemble on Vehicle Usage
The ensemble technique and its modeling is a process in which several models are generated to forecast the outcome. This can be done by utilizing different predictive models or different training data sets. The ensemble model then combines each model's output to obtain the final prediction for the test data. The method highly supports the network to increase the predictive model's performance compared to the individual model by decreasing the generalization error. Since training multiple deep networks is computationally expensive, we utilized Snapshot Ensemble to train multiple models by a single training process [71]. We utilized this approach, which we introduced in [71], and adapted and integrated it into the optimization technique for the breakdown prediction. However, this technique has the challenge of generating similar models, resulting in similar prediction performances. This does not contribute to the predictive model at the output level unless the network generates diverse models during the training process. One way to overcome this challenge and build diverse models is to change the learning rate during the learning process. Thus, to this end, we exploited the Cyclic Cosine Annealing Schedule (CCAS) [17], that splits the training process into N cycles, each starting with a large learning rate and relatively rapidly decreasing to the minimum value before increasing quickly again. Equation (13) shows the learning rate over the iterations.
where, α denotes the learning rate at iteration t, α 0 is the initial learning rate (maximum rate), and T refers to the total number of the training iterations. Thus, the function anneals the learning rate from α 0 to f (T/C) which is ≈0 over the course of a cycle [72]. Value N is the number of cycles in which the network is trained throughout the training process, and in which the best weights at the end of each cycle are saved as the snapshot of the model. This results in N model snapshots. In fact, in this process, the learning rate is raised periodically, reinforcing the model's convergence at the global minimum, rather than at the local minimum, over the snapshot ensemble. This leads to elimination of the need to manually find the optimal maximum learning rate. The overall snapshot process is illustrated in Figure 6, where at the end of each iteration the model with the maximum performance is saved as the snapshot. All the models are then used to predict the outcome in the first phase. The use of CCAS does not ensure diversity of the produced models, due to the stochastic nature of the algorithm. Therefore, the diversity of the snapshot models were carefully studied by computing and investigating any dissimilarities between prediction outputs. In short, in this step, we generated multiple diverse deep neural networks to build an ensemble of predictive models. This ensemble of models was then used to construct meta-learning (in the second phase) for the final breakdown prediction.

4.4.2.
Training the Network Table 1 shows the construction of the network in the first phase, wherein we trained the ANN network by incrementally creating a Sequential model. The network parameters were initialized using "He" initializer [73] (shown in Equation (14)), where G is the Gaussian probability distribution with a zero-centered and standard deviation of 2 n l . In this initialization form, biases are initialized at 0 and n l is the number of inputs to that node in layer l.
In order to use stochastic gradient descent to train the deep neural networks, rectified linear unit (ReLU) [74] as the-active function-was utilized to transfer the summed weighted input from the nodes into the output of the node.
In Equation (15), f (x) is the ReLU function, where it outputs 0 for ∀ x < 0, while for positive inputs x ≥ 0, it returns the same value f (x) = x. The Batch Normalization [75] technique is used to normalize the input of each layer to reduce the internal covariate shift problem leading to stabilizing the learning process. This is followed by a stochastic gradient descent technique called "Adam" optimization, which iteratively updates network weights during the training process [76].
We used a regularization method called "Dropout" to reduce the over-fitting issue. Basically, this led to the provision of different neural network architectures when training the networks. In the output layer, we utilized the Soft-max activation function-depicted in Equation (16)-for the probability distribution of the target variables.
where σ(z i ) are the elements of the input vector to the softmax function, K is the number of classes and ∑ K j=1 e z j is the normalization term, which ensures all the output values of the function sum to 1 (each class probability is in the range (0, 1)). Finally, the output of Equation (16) is injected to the Argmax function-defined in Equation (17) to predict the hard label.
hp r i = argmax(σ(z i )) (17) where hp r i carries the hard label prediction of the readout r i in the test set, expressing whether the usage corresponds to a claim/breakdown (1) or not (0).

Stacking and Horizontal Ensemble
Each snapshot model provides two types of outputs; Soft prediction (Equation (16)) and Hard label (Equation (17)). The former returns two values: the predicted probability of the readout belonging to a vehicle not having a claim and the predicted probability indicating the usage corresponding to a claim/fault. The second output is the hard label representing whether the component is healthy (0), given the vehicle usage, or faulty (1). Accordingly, Equation (18) shows the numbers of the new features which are generated by the Snapshots models.
where N S n refers to the snapshot numbers, and f is the number of features. C k points to the binary class of the vehicle status (healthy-0-or unhealthy-1-) which is summed by "1" as the hard-label of that prediction. Table 2 illustrates the snapshot and stacking horizontal ensembles calculations, which are described in detail as follow: r 1 , . . . , r n : refers to the vehicle non-stationary readouts/samples, which are collected over time.
. . , f m : represents the LVD parameters characterizing a vehicle's behavior in r i . -P r 1 (sp 1 ) . . . P r 1 (sp m ) , indicates the output of the snapshot models-in the first phase-for the given readouts. -p c 1 , p c 2 , represents the predicted probability of class 0 (healthy) and class 1 (unhealthy), for the given readout r i . -hp r i refers to the hard prediction of the given readout r i (0 or 1).
The output of the soft predictions and the hard labels are horizontally combined with the original features (validation set, see Figure 7). This means that, by considering Equation (18), each snapshot model could add three extra parameters carrying knowledge of the previous model's performance on that specific readout. The new data-as the new training set-is then passed to the second deep net to build the meta-learning to reduce the forecasting errors obtained from the previous phase. Indeed, meta-learning is built by stacking the outputs of the snapshots, which are diverse deep neural networks and able to capture the expected complex signal relations, as compared to learning algorithms with less capacity, see Tables 5 and 6. As is described above, CCAS is utilized to construct 20 different predictive deep network models in this stacking process. To obtain the final fault prediction, we injected the test set, which was held out, into the same generated snapshots to translate the test set's structure to the new one with more dimensions, similar to the new training set (see Figure 3 left-bottom). Finally, the performance of the meta-learning is assessed by the test set. Table 1 (right side) describes the architecture of the second deep network, and Algorithm 1 shows the pseudo code of the whole process in this complex claim prediction. . ... . r n r n f 1 ... r n fm P r n (sp 1 ) = (p c1 , p c2 , hp r n ) ... P r n (sp m ) = (p c1 , p c2 , hp r ) Figure 6. Visualization of the standard loss (red) and cosine annealing learning and cycles (blue). In this practice, N = 20 snapshot models were generated and saved to be used in the first and the second phases of the networks.

New Train Data
Snapshot from epoch 1

Snapshot Results
Creating new data sets  Figure 7. The schematic representation of the third module starts by partitioning the LVD data (injected from Module2) into three parts: training data, validation data, and test data. The training and validation are used in the first phase to generate the snapshot (SPs) models. In this process, CCSA is utilized to construct the snapshots where the function periodically decreases the learning rate from α 0 to ≈ 0 in a cycle; in this study, it was every 20 snapshots (see Figure 6). Then, at the end of each cycle, the model with the maximum performance is saved as the snapshot. In the second phase, the output of the SPs is horizontally added to the original set (validation data) to make a new dataset. Accordingly, the new dataset is used to build the meta-learning (with the training part) for the final prediction. Finally, the test part is utilized (with the new dimensions) to validate the meta-learning performance.

Experimental Evaluation and Results
To facilitate the implementation of the approach, we recall the two research questions, introduced in Section 1, on which we based the evaluation of the proposed approach as follows:

RQ1-Feature Selection with Optimization Results
Bearing in mind RQ1, we constructed two evaluation tracks. Firstly, the proposed optimization approach (ELSGA) was evaluated over several generations and compared with the standard genetic algorithm (without the IC component). Secondly, the selected features (using ELSGA) were used to build models through a type of time-series k-fold cross validation called "walk-forward cross validation (WFCV)" [77] and compared with other ML, deep learning and ensemble based approaches.
The optimization process was carried out as feature selection by selecting the most informative parameters to predict a vehicle's component failures (a component related to the turbocharger). We built a forecasting system in which a supervised ML algorithm examined the contribution of the parameters to the claims. Considering results from our previous study [18], in which vehicle usage changed over time, we segmented the data into different parts. This segmentation process was done by assuming that the distribution of the vehicle usage logged in the same season over the years was relatively similar with respect to the other seasons. Thus, in each segment there might be a set of parameters that had more impact on the failures, and, accordingly, more pertinent to the prediction of failure. We designed the optimization process to be implemented in each segment in order to identify the most informative features. However, in the second tier of the evaluation, the union of the selected features (from all segments) built the predictive models, and was compared with other algorithms.
Since there is no universal configuration for such a heuristic optimization algorithm that is considered to be the best setting, GA parameters, including population size, mutation rate, number of parents inside the mating pool, and number of elements to mutate, were selected in an optimization manner. This meant the implementation code was constructed in a parameterized fashion to find the best configuration of the GA settings. Table 3 lists the parameters that our optimization system tuned to obtain the best performance. Table 4 describes the output of the IC at initiating the first population on the whole and segmented data. The values α and l1 − norm were selected through the WFCV (as the optimal values) to train the elastic model. The whole data-set with the lowest feature reduction (FR = 0.29%) resulted in the highest number of predictors in the first population, with 410. Among the segments, 313 predictors were selected with coefficients larger than 0 in Seg4, which also indicated the lowest reduction proportion within the segments. Likewise, the IC initiated 80 predictors out of 577 as being decent parents for further generation in Seg1. The figures indicated that the IC delivered different numbers of predictors to GA operators-meaning that, in each segment/season, different parameters had different impacts on failure of further generations in each segment. It was observed the value of α in all cases was close to zero, which meant that the LS almost turned into the ridge method to initiate the first generation.
To evaluate the performance of the selected features, we formulated the objective function as a classification task, wherein the function seeks the optimal performance of the breakdowns forecasting, taking into account vehicle usage. Thus, given the vehicle usage with n readout samples and the selected predictors/features m, which were injected by GA operators, D = (r i , t i )(|D| = n, r i f 1..., f m , t i ∈ (0, 1)) in each generation, the function predicted whether the usage led to a failure or not.
In this stage of the experiment, the data was partitioned into training and test data. More to the point, in each segment of the data "throughout the optimization process", we took three months of vehicle operation, with one month from the previous season, into account to train the model, and the last month was used to test the model. The statistical information of the training and test data in each segment is listed as follows: As is mentioned in the Segmentation section, in each segment, the readout time-series samples were placed in the time order, in order to use the past data to build the model, and future data to test the model. Since these data were cumulative, and the scale of the readouts differed from year to year, data normalization-using Equation (19)-was conducted to form the data on the same scale.
The three criteria, tr 1 , tr 2 and tr 3 , were set to 0.5, 0.95 and 100 to recall the IC or terminate the optimization process so as to provide the best predictors, respectively.
The plots in Figure 8 illustrate the superiority of the ELSGA approach in all segments compared with the GA approach without the IC component over the whole optimization process. We noticed a considerable jump in the performance provided by ELSGA that describes how well the IC adapted to the optimization process to generate decent individuals in the first population. The increase at the early stages assured that the whole optimization process could be terminated before the criterion tr 3 was met (maximum 100 generation). However, tr 2 ≥ 95% was not passed due to the complexity of the problem. ELSGA performed better by reporting around 80% AUC in all segments with 3% higher than GA over the process. It was also noticeable that the AUC value obtained by ELSGA in Segment4 was higher among the other segments, with 82% AUC vs. 79% obtained by the basic optimization. The statistical assessment of the data revealed that Segment4 contained vehicles which had more claims compared to the vehicles located in other segments in this highly unbalanced data. This might be the reason why our model performed better, since more data and positive samples were available to build the predictive model.
The overall results suggest the segmentation formulation is compatible and positively contributes to better claim prediction performance with optimization approaches, particularly the ELSGA approach.
In the second track of the evaluation, the output of the ELSGA was taken into consideration to build the model using the XGBoost classifier [68]. In this experiment, the selected predictors, due to ELSGA, were exploited to build the predictive models and to compare with models trained with several classifiers by injecting the data. In fact, multiple classifier (including deep learning and ensemble approaches) results were selected as the baseline to compare with ELSGA. The similar WFCV method was used to build and evaluate the models with various portions of the data in a time-series fashion (this decision was made since our data was time-series data and the folds could not be randomly selected by normal k-fold cross-validation).   Table 5 shows the comparison of our proposed ELSGA method and multiple classifiers. The figures obtained from different predictive models demonstrate how complex the task is to predict claims by taking vehicle usage into account. As can be seen, most of the classifiers performed poorly, providing low auc ≈ 0.50 values. Practically, these results showed that most of the linear classifiers had no discriminatory capacity in regard to this complex problem, and could not map the usage to breakdowns. In contrast, deep learning models (such as CNN, LSTM and biLSTM) showed much better performance compared to the linear classifiers, by auc = 0.70 ± 2.
Among the examined predictive models, Boosting and Stacking performed close to the proposed approach by auc = 0.75 and 0.74 vs. auc = 0.76, respectively, when total data was considered. Concerning the segmentation, we can clearly observe that the proposed approach (Module2-ELSGA) significantly outperformed the other classifiers, except for Bagging. It is necessary to remark that, within deep learning approaches, CNN worked well on the last segment by providing auc = 0.77. To go beyond this performance assessment, we applied a 5 × 2cv paired statistical t-test to evaluate how significant the results of the classifiers were. In almost all cases, the p-values were smaller than the critical value α = 0.05, so this rejected the null hypothesis, by stating that there was a significant difference between the outcomes. The statistical tests showed that only the performance of the ensemble-based approaches were relatively close to that of the proposed approach in most cases. It is also fair to remark that the statistical test on the performance of the CNN model, in the last segment, indicated that the difference was not significant. It can be observed that in one case even Bagging performed slightly better than the proposed approach in Segment 3; however, the difference was not significant. In Segment 1, the figures indicate Stacking performed quite similarly to our proposed ELSGA approach by reporting auc = 0.81. Table 5. The comparison between Module2 (ELSGA (XGB)) and different approaches (standard machine learning, Gaussian Process (GP), deep neural networks, and ensemble-based approaches, such as Bagging, Boosting and Stacking approaches) for RQ1. We used a 5 × 2cv paired t-test to examine how significant the differences were between the performances of the two models. This meant that every time we compared only two models, one ELSGA vs. one classifier. In addition, we conducted an A/B test by comparing the performance of the models trained by the original predictors (MORGP) vs. the models built by the extracted predictors (MEXP). Both groups of predictors (original and extracted) were selected by the optimization approach. Indeed, we aimed to reveal the contribution of the extracted features and original features to predictive models.

All
From Figure 9, we can observe that more than 50% of the selected predictors (except in the last segment) were extracted from the extraction process, which translated to the importance of such predictors in the prediction task (Orange vs. purple bars). In fact, we considered this proportionate comparison mainly as a sanity check. However, to quantify the impact of the extracted features, several models were built through WFCV to construct and evaluate the models. The figures suggest the MEXP outperformed the MORGP by 3%, 1%, 7%, and 10% in segments one, two, three, and four, respectively (green vs. pink bars). The statistical tests also indicated how significant the differences were in segments two, three, and four by rejecting the null hypothesis (α is set to 0.05 as critical value). Although the test failed to reject the null hypothesis (by p-value = 3.32), and concluded the difference between the two models was not significant in the first segment (green and pink bars), the Extracted predictors (green) showed their valuable impacts by providing almost the same performances 0.66 with respect to the models trained by the mixed predictors (blue bar). The information extracted and used as extracted predictors in the models had greater, and more significant, impact on the predictive model's performances, compared to the original predictors. Figure 9. The outcome of the models built by different predictors. "ORGP" denotes the original predictors percentage, and "EXP" shows the proportion of the extracted predictors obtained after the optimization process.

RQ2-Snapshot-Stacked Ensemble Results
To answer RQ2, we took the output of the optimization process as the input of the ensemble module. We aimed to build a general model that could be used to predict the claims over the year, given vehicle usage. Thus, identical features from all segments were combined and injected into the first deep network to build and generate several diverse snapshot models. In addition, each season, as one additional feature, was added to the identical features to support generalization.
Given Equation (13), the number of cycles were set to N = 20 to generate 20 different snapshot models over the 400 epochs. Of the data, 60% were considered to train the snapshots and 30% to be the validation set to obtain the hard labels and soft predictions. Accordingly, 10% of the data was held out to test the meta-learning model at the final stage (see Figure 7 for visualization purposes). To quantify the diversity in the generated models, we used a disagreement measure, defined in Equation (20). This metric calculated the relation between the number of times the base classifiers predicted the same label, in terms of the total number of instances [78].
where Sn i (r k ) refers to the label assigned by the snapshot (Sn i ) to readout r k . Value f n() is the function of truth predicate that counts the cases where Sn i was correct and the Sn j was wrong and vise-versa. Figure 10 illustrates the disagreement between the 20 snapshots.
In the first phase, the hard labels predicted by the 20 snapshot models (here we took the average of the models' performances) was compared with the performance of the same algorithms used in the second module (Module2) evaluation, depicted in Table 6. The AUC values obtained over the 5-fold WFCV showed that, in almost all cases, our approach was superior in the first phase (Module3 First phase-Snapshots-, see Table 6 This motivated us to construct meta-learning with the aim to learn from the errors received in the first phase and improve the prediction in the second phase, that being the final prediction. The results of the snapshot models (20 × 3) were horizontally added to the data-set (the validation set in the first phase) to be trained and tested again in the second phase. This resulted in 60 extra new features in the data set. This meant the data considered as the validation set in the first deep networks (first phase/layer) became a new data set by carrying 60 extra features (in the second phase). Accordingly, the test set (10% of the whole data), which was held out, was used to assess the model. It is necessary to remark that, in each phase, we trained the deep neural network separately in one training shot. This meant that, in the first phase, we constructed 20 models in one training process using CCAS, and, in the second phase, we built a meta-learning model through one training process.
The result of the second phase is shown in Table 6 (last row), which was compared with other approaches. The figures represented in the table confirm significant improvement due to the stacking and ensemble snapshots performances generated in the first phase. We almost obtained auc > 0.80 in all experiments, with the exception of one in Segment 2 (auc = 0.75). The meta-learning built by the snapshot models could learn the errors obtained from the previous training and prediction phases. This resulted in a general predictive model that could be potentially used for claim/fault prediction in all seasons. Indeed, the diversity of the snapshots, illustrated in Figure 10, highly supported the second phase to build a generalized meta-learning model for the final prediction. Taking into account the results obtained from the above experiments, the ensemble approaches performed well, compared with other classifiers, including linear and deep networks. However, the superiority of our approach in this complex problem was very evident. This led us to examine these approaches in a different context. Thus, we conducted the proposed approach with the other three ensemble approaches on several different datasets to assess the generality and to ascertain whether the SSED performed equally to, or better than, its performance in other application domains. This is an important consideration, since it assesses the generality of the approach to deal with data from different contexts (see Table 7 for the details information of the data sets).    The disagreement of 20 snapshot models generated by cosine annealing in different segments. The heat-map shows how the generated snapshots were diverse in the first phase of Module2. In these plots, each cell under the orthogonal represents the disagreement between two snapshots-e.g., SP1 vs. SP2-. The more div (Equation (20)) value close to 1 the more diverse the models. From the heat-maps we observed that the snapshots in Segment 2 and Segment 3 were more diverse than in Segment 1 and Segment 4. The observations obtained from Table 8 show that SSED, in most cases, took the first rank, in terms of accuracy of performance, and outperformed other ensemble approaches. Individual comparison between SSED Vs. Bagging and Boosting indicated that the SSED almost consistently outperformed the two approaches on all datasets. However, this consistency was violated on dataset 7, where the figures show that Boosting provided slightly better results (%2). In contrast, the comparison between SSED vs. Stacking suggests that Staking had better generalization on the datasets. However, the statistical tests on the results (datasets 3, 7, 9, and 10) described the differences as not being statistically significant, when considering the critical value at α = 0.05. Concerning the same comparison, we observed the value of the t-test on the figures obtained on datasets 1, 4, 6, 10, 11, where the SSED performed better, confirmed the differences were statistically significant, and concluded that SSED outperformed the Stacking approach. Table 6. The comparison between Module 3 (SSED)-first phase, Module3 (SSED)-second phase and different approaches (standard machine learning, Gaussian Process (GP), deep neural networks, and ensemble based approaches, such as Bagging, Boosting and Stacking approaches) for RQ2. We used 5 × 2cv paired t-test to examine how significant the differences were between the performances of the approaches. In this statistical test, only "Module3 (SSED)-second phase" was compared with other algorithm performances. Every time we compared only two models; e.g., Module3 (SSED)-second phase vs. Boosting, or Module3 (SSED)-second phase vs. Module3 (SSED)-first phase, or Module2-ELSGA (XGB).

Discussion and Conclusions and Future Work
This paper presents a two-phase snapshot-stacked ensemble approach to predict warranty claims/breakdowns in the automotive sector. The experimental evaluation, using the logged data from thousands of heavy-duty trucks, showed that the proposed approach could properly predict failures based on vehicle usage. Regarding the research questions introduced in Section 1, we evaluated the key aspects, which expressed our approach's efficiency, effectiveness, stability and generality. Considering the first objective (RQ1), figures obtained in model performance experiments showed a significant difference between the optimization and other classifiers. The results in terms of feature contribution were auspicious. Our A/B test experiments, illustrated in Figure 9, explain how effectively the extracted features contributed to the predictive model. Statistical tests also indicated the improvement induced by the extracted features was significantly different from the original features. It was also observed that the model constructed by the extracted features in the first segment provided similar performances compared to the model trained by both the original and extracted predictors. Therefore, we answer RQ1 by stating that: our proposed adapted heuristic approach outperforms the other approaches to extract the best representation of vehicle usage in order to build a complex ensemble deep network for final claim prediction.
Concerning the second objective (RQ2), the results explained how meta-learning could learn from the errors received by the snapshot models in the first phase of the ensemble approach to predict warranty claims. Indeed, the results in the second phase confirmed the stability of the meta-learning performance over the course of the year, taking into account different distributions of vehicle usage. The SSED could produce a generic predictive model that performed well on data from different seasons for claim predictions but was also promising for other less complex datasets comparable to the performance of Stacking. Therefore, we answer RQ2 by stating that: the snapshot-stacked ensemble deep neural networks approach can successfully map usage to claims and outperforms the other compared algorithms. The figures obtained in the last phase also showed the potential in building a general forecasting model for this complex claim prediction problem. In addition, the results obtained on various datasets confirmed this claim.
The findings of this work also reveal limitations of the proposed approach, which suggest new directions for future research. The first limitation pertains to the issue of modeling vehicle usage by taking into account various contexts. In this work, our concerns were time and vehicle usage to formulate the claim prediction. However, we did not include other contexts, such as location (where the vehicles were driven), which highly affect usage and predictive models. This limitation strongly motivates us to extend our approach by incorporating context knowledge (e.g., location) for a better mapping between usage and breakdowns. An interesting extension of our work could focus on vehicle profiling. We could utilize segmentation to model vehicles' behaviors in different contexts, leading to identification of good and bad driving styles as an important factor in vehicle performance. Thus, in this matter, driver behavior as an extra knowledge input could be used to characterize the style of vehicle usage in different contexts. Regarding contexts, we hypothesize that climate could impact snapshot diversity. This could also explain the dissimilarities shown in the results, which could be subject to further investigation.