Feature Selection for Improving Failure Detection in Hard Disk Drives Using a Genetic Algorithm and Signiﬁcance Scores

: Hard disk drives (HDD) are used for data storage in personal computing platforms as well as commercial datacenters. An abrupt failure of these devices may result in an irreversible loss of critical data. Most HDD use self-monitoring, analysis, and reporting technology (SMART), and record different performance parameters to assess their own health. However, not all SMART attributes are effective at detecting a failing HDD. In this paper, a two-tier approach is presented to select the most effective precursors for a failing HDD. In the ﬁrst tier, a genetic algorithm (GA) is used to select a subset of SMART attributes that lead to easily distinguishable and well clustered feature vectors in the selected subset. The GA ﬁnds the optimal feature subset by evaluating only combinations of SMART attributes, while ignoring their individual ﬁtness. A second tier is proposed to ﬁlter the features selected using the GA by evaluating each feature independently, using a signiﬁcance score that measures the statistical contribution of a feature towards disk failures. The resultant subset of selected SMART attributes is used to train a generative classiﬁer, the naïve Bayes classiﬁer. The proposed method is tested on a SMART dataset from a commercial datacenter, and the results are compared with state-of-the-art methods, indicating that the proposed method has a better failure detection rate and a reasonable false alarm rate. It uses fewer SMART attributes, which reduces the required training time for the classiﬁer and does not require tuning any parameters or thresholds.


Introduction
The annual shipments of tablet computing devices surpassed personal computers in 2013, making flash memory the predominant mode of data storage for digital consumers [1]. However, the size of the digital universe is expected to be around 40 trillion gigabytes, which is roughly 5200 gigabytes of data for every living person on earth [2]. For a long time to come, consumers of digital data would rely on relatively cheaper storage options for backup and archival purposes, i.e., local or cloud storage afforded by magnetic hard disk drives (HDD) [1]. HDD are generally regarded as reliable components as their mean time to failure (MTTF) typically lies between 1 million to 1.5 million hours, as per manufacturer specifications. This would suggest a maximum annual failure rate (AFR) of 0.88%; however, field observations indicate that annual disk replacement rates (ARR) of 2-4% are common and, in some systems, replacement rates of up to 13% have also been reported [3][4][5]. Moreover, HDD are the most frequently replaced hardware components in large-scale information technology (IT) infrastructure [6], and, in some large datacenters, 70-80% of the known failures have been caused otherwise healthy disk drives. Thus, even a small improvement in the FAR would save the time, effort, and bandwidth required to perform unwarranted disk replacements and the associated data transfers.
In this paper, a two-tier approach is proposed to select the most indicative SMART attributes as precursors to a failing hard disk. In the first tier, a genetic algorithm (GA) is used for feature (or attribute) subset selection, whereas, in the second tier, a feature significance function is used to examine the selected subset of features and find those SMART attributes that would result in the best detection accuracy while minimizing the FAR. The GA selects the optimal SMART attributes using a mechanism that is analogous to biological evolution, where the principles of natural selection prune the sub-optimal characteristics in living organisms, over successive generations. In this study, the GA uses the ratio of the compactness of a class to the separability between classes as the objective function, and finds the optimal subset of SMART attributes that minimizes the objective function over successive generations. While the GA evaluates the SMART attributes in different combinations, the feature significance function considers individual attributes and discards those that are not significantly related to drive failures. A naïve Bayes (NB) classifier is then used to construct a model for the selected SMART attributes, which can differentiate healthy drives from failure-prone drives. Moreover, because of the inherent imbalance between the number of failed and healthy drives, random under-sampling of the majority class is used to mitigate the effects of class imbalance on the performance of the classifier.
The remainder of this paper is organized as follows. Section 2 presents the details of the proposed methodology, while Section 3 describes the SMART attributes dataset used to test the proposed methodology. Section 4 provides a discussion of the results obtained, and Section 5 presents the conclusions of this work.

The Proposed Methodology
The proposed methodology for detecting failing hard disks is illustrated in Figure 1. It utilizes SMART attributes to determine whether a disk is healthy or failing. SMART attributes are collected by modern HDD for self-monitoring purposes. These attributes are viewed as feature vectors in a high-dimensional space; the dimensionality of the feature space is determined by the number of selected SMART attributes. It is hypothesized that, for healthy and failing drives, these feature vectors will tend to form different clusters, which would be distinguishable in the high-dimensional feature space. A classifier could then be trained to differentiate between these two clusters, and hence determine whether a disk is failing or not using SMART attributes. Nevertheless, not all SMART attributes would be equally helpful in leading to easily separable clusters in the feature space; this would adversely affect the classifier's performance [25,26]. Hence, feature selection is used to select the optimal subset of features that would result in more compact and clearly separable clusters of vectors of the selected features. In addition to improving the predictive performance of the classifier, feature selection reduces the measurement and storage requirements, as well as the training and prediction times [27]. This study proposes a two-tier approach for feature selection using a GA and feature significance function.
While the GA evaluates different combinations of features to select the optimum subset, the feature significance function is used to individually evaluate each member of this optimal feature subset using significance scores. It discards those features that are determined to be statistically insignificant in contributing towards disk failure. The final set of selected features is then used to train a Bayes classifier that can detect a failing HDD.

Feature Selection
Machine learning techniques construct models of labeled objects in order to distinguish them from each other. These objects are described by feature vectors. The accuracy of these models depends upon the quality of these features, i.e., the more discriminative the features, the more accurate the machine learning model. However, not all the features used to describe a given object may be useful in distinguishing it from other objects. That is, some features may be irrelevant or redundant and hence may adversely affect both the classification accuracy of the machine learning model and the time required to construct it [25,28,29]. Therefore, feature selection algorithms are an essential part of many machine learning applications as these algorithms help in removing the irrelevant and redundant features to improve the classification accuracy, training time required to construct the model and the reduce the required number of training for better generalization [28][29][30][31][32][33].
Feature selection methods can be broadly divided into the following three categories [28]: 1. Filter Based Methods: These methods use some sort of fitness function to first evaluate and rank different features, and then select a subset of features that have fitness function values above a certain threshold. In essence, these methods filter out the bad features first and then construct the machine learning model. This approach is usually more efficient [28,29], but its performance depends upon the quality of the fitness function. The feature selection method used in this study can be categorized as a filter based method. 2. Wrapper Based Methods: These methods do not filter out the bad features before constructing the machine learning model. Rather, they use the classifier to filter out the bad features. For example, different combinations of features may be used by the classifier, and the combination of features that yields the highest classification accuracy may be selected as the best set of features. This approach can be very time consuming and may only result in a sub-optimal solution [28,29]. 3. Embedded or Hybrid Methods: As the name suggests, these methods use an embedded or a hybrid approach. Unlike wrapper based methods, which iterate through different combinations of the features and may select the best subset of features on the basis of the accuracy of the classifier, these methods do not involve such iterative use of the classifier, which improves their speed.
Similarly, unlike the filter based approaches, these methods do not use a separate fitness function to rank different features. Rather, these methods may use the output of the classifier to select the best subset of features. For example, the weights assigned to different inputs (features) in logistic regression or neural networks may be used to rank them, and select the best subset among them.

Feature Selection Using a Genetic Algorithm
A GA is grounded in the concepts of biological evolution and natural selection. The principles of natural selection are believed to govern the evolution of living species. Over generations, living organisms develop characteristics that would enable them to thrive amid the adversities of their environments. A GA works much the same way, as it iteratively refines a given solution by progressively selecting better candidate solutions, while discarding inferior choices. Doing so, it mimics the mechanisms of biological evolution, namely crossover and mutations. A fitness or objective function is used to estimate the quality of each solution. A GA takes as input a set of m-dimensional vectors in R m as given in Equation (1): where X The GA generates an n-dimensional subset of vectors in R n , as given in Equation (3): where X The GA just reduces the dimensionality of each vector in the set given in Equation (1) without affecting the cardinality of the set, i.e., n m. The dimensions selected by the GA are basically those SMART attributes, which minimize the fitness function. Thus, X (n) i represents a vector selected by the GA in the optimal subset of selected feature vectors. Equation (5) shows the fitness function proposed in this study: Here, C denotes the average compactness of the classes, as given in Equation (6), and S, as given in Equation (7): represents the average value of the separability among different classes, while L is the total number of classes in a given problem. In this study, the number of classes is two, i.e., healthy and failed hard disks. The notions of compactness of a class and the separability of two classes are illustrated in Figure 2 in a three-dimensional space. The compactness of a class measures how well different instances of that class are clustered together, whereas separability between two classes estimates how easy it is to separate the clusters formed by the instances of those two classes. The proposed fitness function is intuitively similar to the notion of Fisher ratio and, in some studies, it has been used in combination with Fisher ratio for feature selection [34]. The GA iterates to find a subset of features that would minimize the fitness function, as given in Equation (5), which is a ratio of the average values of these two quantities. The compactness of a given class is estimated by calculating the distance, mostly the Euclidean distance, between the feature vectors of the class and the class mean or centroid. The mean or centroid, µ (i) , of a class i, which has a total of N instances, can be determined using Equation (8): The mean value of the Euclidean norm given in Equation (9), which measures the distance of an instance of the class from the class centroid, can be used as an estimate for the compactness of class i, The Euclidean distance between the centroids of classes i and j, as given in Equation (10), can be used as an estimate for the separability between these two classes: The proposed two-tier approach for feature selection using a GA and feature significance scores is illustrated in Figure 3. The GA requires the SMART attributes or features to be encoded into chromosomes. New chromosomes are generated from old ones using the mechanism of mutations and crossovers. The newly generated chromosomes replace their parents, provided that they are better, i.e., they perform better compared to their parents in terms of the fitness or objective function, as discussed earlier. This process of generating new chromosomes and selecting the best among them to replace the ones in the last generation is iterated multiple times until the fitness function reaches an asymptotic value with further iterations yielding no improvement in the fitness function [27]. The SMART attributes are encoded into chromosomes using a binary encoding scheme. The resulting chromosomes are nothing but strings of 1 s and 0 s, where 0 implies that a certain SMART attribute has not been selected and 1 implies that it has been selected. A binary encoding scheme is used because we want to either select or leave a particular feature. When a particular feature is selected, only then is it used in the calculation of the fitness function. However, if a given feature is not selected, then it is left out completely in the calculation of the fitness function. The binary encoding scheme is preferred here as opposed to, say, value encoding because we are not interested in determining the weights that need to be assigned to individual features; rather, we want to determine the best subset of features that will optimize the fitness function. The individual SMART attributes are identified by the indices of each 1 and 0 in each chromosome. A random string of 1 s and 0 s, i.e., a randomly selected subset of SMART attributes serves as the initial population for the Genetic Algorithm. New populations are generated from these chromosomes by making them mutate and crossover among each other. A crossover involves the exchange of information or swapping of fragments between two parent chromosomes at randomly selected points. In contrast, mutations involve the flipping of bits on a single chromosome at randomly selected positions. To select the best set of chromosomes that could be used to replace the old ones, the value of the fitness function is calculated for each chromosome in the new generation of chromosomes. The first 100 chromosomes, which have the smallest fitness function values (100 is the size of the chromosome population in this study), are selected to produce the next generation of chromosomes. This process of creation and selection goes on for multiple generations until the proposed fitness function reaches an asymptotic value and sees no further reduction.

Feature Selection Using Significance Scores
The GA evaluates subsets of features or SMART attributes in high-dimensional spaces. Given the conclusions drawn in previous studies, such as [16], which is summarized in Section 1, it is plausible to assume that these features might behave well as a group, i.e., lead to easily distinguishable clusters in the feature space when considered together. However, individually, they might not hold considerable significance in determining a drive's failure. Hence, a simple mechanism is proposed to individually evaluate each feature selected by the GA by calculating its significance score, which provides a crude measure of the contribution of that feature towards the failure of disk drives. For a given feature, x i , first its mean values for both healthy and failed disk drives are determined, denoted as x i (h) and x i ( f ) , respectively. Then, the significance score Ψ is determined by calculating the frequentist probability of the event, when the distance of a sample of feature x i from x i ( f ) is smaller than its distance from x i (h) , as given in Equation (11): According to Equation (11), a feature is considered significant for predicting a disk failure if most of its values for failed drives lie closer to the mean value of that feature for the failed disk drives.

Classification Using the Naive Bayes Classifier
In this study, the NB classifier is used to differentiate a healthy HDD from a failing drive using the features selected by the proposed two-tier feature selection method, which employs a GA and feature significance scores. The NB classifier classifies a given feature vector, X n , by applying Bayes' rule to a generative classifier of the form given as follows [35]: In Equation (12), y ∈ {H, F} is a class label and θ is the unknown parameter for the conditional density of each class, while H and F represent healthy and failing hard disks, respectively. Unlike discriminative classifiers such as support vector machines (SVMs) that directly model the posterior P(y|X n ), generative classifiers such as NB solve a more general problem, i.e., they learn a model of the joint probability, P(X, y) of the feature vectors X and class label y and then use Bayes' rule to predict the label for an unknown feature vector [36]. The NB classifier assumes that there is conditional independence among the features, given the class labels, which reduces the number of unknown parameters that need to be estimated [35]. This is a reasonable assumption, given that, in the SMART dataset [5], most of the features are independent and do not generally exhibit any correlation. The NB classifier predicts categorical class labels, i.e., Healthy and Failing, for unknown vectors of the selected SMART attributes by using the Bayes rule and estimating the joint probability of the SMART attributes and the class labels. Given the nature of SMART data, a direct mapping of the feature vectors and the output labels by a discriminative classifier, such as an SVM, would not be very useful in predicting a failing disk drive. Moreover, because of the inherent class imbalance in the available data, i.e., the ratio of failed to healthy drives being a little more than 1:100, discriminative learners can lead to overfitting, and may have difficulty learning the minority class distribution [37][38][39][40]. This is demonstrated by the experimental results as discussed in Section 4. The model is constructed using a multivariate multinomial distribution (MVMN), which fits an appropriate probability distribution model for each attribute. A three-fold cross-validation scheme is used to improve the generalization performance of the NB classifier by training and testing it on different subsets of the data.

The SMART Dataset
The proposed methodology for detecting failing hard disks was tested on a public dataset of SMART attributes that was collected by a commercial datacenter [5]. The dataset used in this study was collected for more than 40, 000 HDD, comprising 26 different models and with storage capacities ranging from 1.0 terabyte to 8.0 terabytes, over a span of 273 days. As disks that were determined to have failed on a given day were removed the next day, and new disks were constantly being added to the datacenter, the number of disks was almost always changing, though it remained in excess of 40, 000 at all times. However, the proposed methodology wasn't tested on the entire dataset because of several caveats in the data. First, although values of 45 SMART attributes were to be recorded for more than 40, 000 HDD each day, not every HDD reported values for all the SMART attributes every day; a few attributes weren't reported at all. Therefore, the data were rife with blank fields, requiring the removal of records for disks with blank fields. Moreover, for some SMART attributes, the reported values were way out of bounds, e.g., for some drives, the raw value of SMART attribute "9" (9 is the ID of the SMART attribute that stores the total power-on hours for a HDD) suggested a drive life of more than 10 years, which was not correct. Hence, such records had to be removed from the dataset to ensure that the results and conclusions are based on reliable data. Furthermore, an average ARR of around 2-4% implies that the available SMART datasets would almost always be imbalanced, i.e., more data are available for healthy drives than for failed drives; class imbalance skews the classifiers in favor of the majority class, i.e., the healthy drives in this case, which leads to overfitting. Different approaches have been proposed to mitigate the effects of class imbalance. The simplest approaches involve either over-sampling the minority class, i.e., the failed hard disks in this case, or under-sampling the majority class, i.e., the healthy disks [37][38][39][40]. In this study, the majority class is under-sampled, i.e., data for an appropriate number of randomly selected healthy hard disks are removed to make the two classes more balanced. The SMART attribute values for all the failed drives are selected for training the NB classifier, i.e., a total of 565 drives failed during the period for which the data were considered. In order to have a reasonable balance between the two classes, SMART data for only 1500 randomly selected samples of healthy hard drives were added to the final dataset, which reduced training time and helped avoid overfitting. Hence, the final dataset that was used for training and testing the classifier contained values of 42 SMART attributes for 2065 hard disks, for a period of approximately nine months.

Results and Discussion
This section presents the results of this study, and provides a discussion in the context of existing research. After the necessary preprocessing, as discussed in Section 3, the SMART data are used by the GA to select the optimal features that will minimize the proposed fitness function. The subset of features selected by the GA is given in Table 1. The GA reduces the dimensionality of the original feature space from 42 to 12. It utilizes a population of 100 chromosomes over a maximum of 80 generations to optimize the fitness function discussed in Section 2. The GA selects a subset of features, which might work well as a combination but might contain sub-optimal features. Those sub-optimal features might not be significant contributors in determining whether an HDD is failing or not. This is why, in the proposed two-tier feature selection process, all the features selected by the GA are subsequently evaluated by calculating their feature significance scores, as discussed in Section 2.3.
The second tier of the proposed feature selection process discards the features at serial numbers 1, 4, and 6, which are the SMART attributes recording the spin-up time, spin retry count, and reported uncorrectable errors, respectively. The final list of the nine features selected by the proposed two-tier feature selection process is given in Table 2. The effectiveness of the proposed two-tier feature selection process is demonstrated by the results in Table 3. When an NB classifier is trained using the features selected by the proposed two-tier approach, given in Table 2, it achieves an average classification accuracy of 99.01% with a false positive rate (FPR) of 0.24%. However, when the NB classifier is trained using different sets of SMART attributes, the results are not as promising. For example, when all 42 attributes are used to train the classifier, the average classification accuracy drops to 86.98%, while the FPR rises to 1.03%. The features selected by the GA alone are effective, but contain certain SMART attributes, which are not significant contributors in determining a failing disk drive, as evident from the average accuracy of 92.0% and an FAR of 0.92%. Table 3 also provides a comparison of NB with a discriminative classifier, such as an SVM. The SVM achieves an average accuracy of 83.30% and an FAR of 0.26%, when it is trained on the features selected by the GA alone. It achieves a marginal improvement in average accuracy when it is trained on the nine features selected by the proposed two-tier approach. However, it shows degradation in both the TPR and FPR, i.e., the TPR drops to 44%, whereas the FPR exhibits a small increase of 0.6%. The results in Table 3 indicate that the NB classifier performs better than SVM in predicting failing disk drives. Moreover, the results obtained using two different types of classifiers indicate that the SMART attributes selected by the proposed two-tier feature selection method yield better diagnostic performance in detecting failing and healthy HDD.  As discussed earlier, the SVM, being a discriminative classifier, models the direct relationship between the feature vectors and the class labels. This approach may not be effective in the case of SMART data, where a direct relation between the two is not always very conclusive. The performances of the SVM and NB classifiers are also illustrated in Figure 4, which compares the receiver operating characteristic (ROC) curves for these classifiers.  The ROC curves in Figure 4 clearly indicate that the NB classifier trained on the nine features selected by the proposed two-tier feature selection process yields the best results in terms of both the TPR and FPR. When compared to existing methods for detecting failing hard disks, the proposed method offers distinct advantages. Wang et al. [18] presented a two-step parametric method, which utilizes 47 critical features, identified in [17] for failure prediction in HDD, as opposed to the nine SMART attributes determined in this study. They tested their method on a dataset that was collected for 369 hard disks of a single model, and contained data only for the last 600 h, where each sample of data are 2 h apart from the next one. Thus, for each drive, a maximum of 300 values are available for each of the 47 features. Among the 369 drives, 178 drives are healthy, while 191 are failed drives. Given the variation in failure rates across different models and across different storage capacities for certain models, as observed in commercial datacenters [41], this dataset cannot be considered a representative dataset. In contrast, as discussed in Section 2, the proposed method is tested on a more extensive and more representative dataset. The method proposed in [18] could achieve a failure detection rate (FDR) or TPR of 68.42% at a FAR of 0%, and an FDR of around 95% at a FAR of around 4.2%. However, the FAR is highly sensitive to the failure threshold and no mechanism has been provided to set an appropriate failure threshold. This is an important concern because the ineffectiveness of the simple thresholding technique put in place by drive manufacturers to detect a failing hard disk was a major reason why considerable interest was generated among researchers to devise better methods for detecting failing disk drives. The proposed method uses a two-tier feature selection process, and determines the nine SMART attributes given in Table 2, to be the most effective precursors to a failing HDD. This reduces the training time of the classifier. Using the values of the nine selected SMART attributes for 2065 hard disks, the proposed method yields an FDR of 98.40% at a FAR of 0.24%, without using any arbitrary parameters. Among the 2065 hard drives, 1500 are healthy and 565 are failed drives. These are divided into three folds, with 688, 688, and 689 hard drives, respectively. Each fold contains 500 healthy drives, whereas the remainder are failed drives. The proposed method correctly detects 185 of the 188 failed drives with only 1.2 false alarms on average. Moreover, the proposed algorithm can also be used for the online diagnosis of HDD. A trained instance of a NB or SVM classifier can be provided with the values of the nine SMART attributes of an HDD, as listed in Table 2. The output of the classifier can then be interpreted as either a healthy HDD or one with an impending failure.

Conclusions
In this paper, a novel two-tier approach was presented to select the most effective precursors to a failing HDD. These precursors were selected from an initial list of 42 SMART attributes, which were recorded in a commercial datacenter over a period of nine months for 21 different models with storage capacities ranging from 1.0 TB to 8.0 TB. The proposed two-tier approach evaluated the SMART attributes, both in combinations and individually. First, a GA was used to explore different feature subspaces in order to determine the best combination of features or SMART attributes. The quality of a feature subset was measured by determining how well clustered the samples of those features are for each of the two classes, i.e., healthy and failed HDD, and how well separated those clusters are from one another. This was done by calculating the ratio of intra-class compactness to inter-class separation for each subset of features, and finding the subset that minimized this ratio. The compactness of a class and the separation between two classes were measured using Euclidean distances. In the second tier, a new measure, i.e., significance score, was proposed to individually evaluate the features selected by the GA. The significance score measured the statistical contribution of a given feature towards disk failures. Features with statistical scores lower than a certain value were discarded. The final list of features selected by the proposed two-tier feature selection process comprised nine SMART attributes as opposed to the original 42, resulting in a shorter training time for the classifiers. These nine attributes were then used to train a generative classifier, the NB. The NB gave an FDR of 98.40% compared to 40.0% by the SVM, which is a discriminative classifier. To avoid overfitting by the classifier on the majority class data, the inherent class-imbalance problem in the failure data for HDD was addressed by under-sampling the majority class of healthy HDD. The proposed method correctly detected 185 of the 188 failed drives with only 1.2 false alarms on average.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: