A comparative study of feature selection methods for stress hotspot classification in materials

The first step in constructing a machine learning model is defining the features of the data set that can be used for optimal learning. In this work we discuss feature selection methods, which can be used to build better models, as well as achieve model interpretability. We applied these methods in the context of stress hotspot classification problem, to determine what microstructural characteristics can cause stress to build up in certain grains during uniaxial tensile deformation. The results show how some feature selection techniques are biased and demonstrate a preferred technique to get feature rankings for physical interpretations.

Feature selection methods have been used extensively in the field of bioinformatics [15], psychiatry [16] and cheminformatics [17]. There are multiple feature selection methods, broadly categorized into Filter, Wrapper and Embedded methods based on their interaction with the predictor during the selection process. The filter methods rank the variables as a preprocessing step, and feature selection is done before choosing the model. In the wrapper approach, nested subsets of variables are tested to select the optimal subset that work best for the model during the learning process. Embedded methods are those which incorporate variable selection in the training algorithm.
We have used random forest models to study stress hotspot classification in FCC [3] and HCP [4] materials. In this paper, we review some feature selection techniques applied to the stress hotspot prediction problem in hexagonal close packed materials, and compare them with respect to future data prediction. We focus on two commonly used techniques from each method: (1) Filter Methods: Correlation based feature selection (CFS) [18], and Pearson Correlation [19]; (2) Wrapper Methods: Fealect [20] and Recursive feature elimination (RFE) [13] and (3) Embedded Methods: Random Forest Permutation accuracy importance (RF-PAI) [21] and Least Absolute Shrinkage and Selection Operator (LASSO) [22]. The main contribution of this article is to raise awareness in the materials data science community about how different feature selection techniques can lead to misguided model interpretations and how to avoid them. We point out some of the inadequacies of popular feature selection methods and finally, we extract data driven insights with better understanding of the methods used.

Methods
An applied stress is distributed heterogenously within the grains in a microstructure [23]. Under an applied deformation, some grains are prone to accumulating stress due to their orientation, geometry and placement with respect to the neighboring grains. These regions of high stress, so called stress hotspots, are related to void nucleation under ductile fracture [24]. Stress hotspot formation has been studied in face centered cubic (FCC) [3] and hexagonal close packed (HCP) [4] materials using a machine learning approach. A set of microstructural descriptors was designed to be used as features in a random forest model for predicting stress hotspots. To achieve data driven insights into the problem, it is essential to rank the microstructural descriptors (features). In this paper, we review different feature selection techniques applied to the stress hotspot classification problem in HCP materials, which have a complex plasticity landscape due to anisotropic slip system activity.
Let (x i , y i ), for i = 1, ..., N be N independent identically distributed (i.i.d.) observations of a p-dimensional vector of grain features x i ∈ R p , and the response variable y i ∈ 0, 1 denotes the truth value of a grain being a stress hotspot. The input matrix is denoted by X = (x 1 , ..., x N ) ∈ R N ×p , and y ∈ [0, 1] N is the binary outcome. We will use small letters to refer to the samples x 1 , ..., x N and capital letters to refer to the features X 1 , ..., X p of the input matrix X. Feature importance refers to metrics used by various feature selection methods to rank, such as feature weights in linear models or variable importance in random forest models.

Dataset Studied
A dataset of HCP microstructures with different textures was generated using Dream.3D in [4]. Uniaxial tensile deformation was simulated in these microstructures using EVPFFT [25] with different constitutive parameters resulting in a dataset representing a Titanium like HCP material with an anisotropic critically resolved shear stress ratio [4]. This dataset contains grain-wise values for equivalent Von Mises stress, and the corresponding Euler angles and grain connectivity parameters.
The grains having stress greater than the 90 th percentile of the stress distribution were designated as stress hotspots, a binary target. Thirty four variables to be used as features in machine learning were developed. These features (X) describe the grain texture and geometry and have been summarized in table 1. We rank these features using different feature selection techniques, and observe the improvement in models, as well as understand the physics behind stress hotspot formation. The model performance is measured by the AUC (area under curve), a metric for binary classification which is insensitive to imbalance in the classes. An AUC of 100% denotes perfect classification and 50% denotes no better than random guessing [26].

Filter Methods
Filter methods are based on preprocessing the dataset to extract the features X 1 , ..., X p that most impact the target Y . Some of these methods are: Pearson Correlation [19]: This method provides a straightforward way for filtering features according to their correlation coefficient. The Pearson correlation coefficient between a feature X i and the target Y is: is the covariance, σ is the standard deviation [19]. It ranges between (−1, 1) from negative to positive correlation, and can be used for binary classification and regression problems. It is a quick metric using which the features are ranked in order of the absolute correlation coefficient to the target.
Correlation based feature selection (CFS) [18]: CFS was developed to select a subset of features with high correlation to the target and low intercorrelation among themselves, thus reducing redundancy and selecting a diverse feature set. CFS gives a heuristic merit over a feature subset instead of individual features. It uses symmetrical uncertainty correlation coefficient given by: where IG(X|Y ) is the information gain of feature X for the class attribute Y . H(X) is the entropy of variable X. The following merit metric was used to rank each subset S containing k features: where r cf is the mean symmetrical uncertainty correlation between the feature (f ∈ S) and the target, and r f f is the average feature-feature inter-correlation. To account for the high computational complexity of evaluating all possible feature subsets, CFS is often combined with search strategies such as forward selection, backward elimination and bi-directional search. In this work we have used the scikit-learn implementation of CFS [27] which uses symmetrical uncertainity [18] as the correlation metric and explores the subset space using best first search [28], stopping when it encounters five consecutive fully expanded non-improving subsets.

Embedded Methods
These methods are popular because they perform feature selection while constructing the classifier, removing the preprocessing feature selection step. Some popular algorithms are support vector machines (SVM) using recursive feature elimination (RFE) [29], random forests (RF) [21] and Least absolute shrinkage and selection operator (LASSO) [22]. We compare LASSO and RF methods for feature selection on the stress hotspot dataset.
Least Absolute Shrinkage and Selection Operator (LASSO) [22]: LASSO is linear regression with L 1 regularization [22]. A linear model L is constructed on the training data (x i , y i ), i = 1...., N , where w is a p dimensional vector of weights corresponding to each feature dimension p. The L 1 regularization term (λ||w|| 1 ) helps in feature selection by pushing the weights of correlated features to zero, thus preventing overfitting and improving model performance. Model interpretation is possible by ranking the features according to the LASSO feature weights. However, it has been shown that for a given regularization strength λ, if the features have redundancy, inconsistent subsets can be selected [30]. Nonetheless, Lasso has been shown to provide good prediction accuracy by reducing model variance without substantially increasing the bias while providing better model interpretability. We used the scikit-learn implementation to compute our results [31].
Random Forest Permutation Accuracy importance (RF PAI) [21]: The random forest is a non linear multivariate model built on an ensemble of decision trees. It can be used to determine feature importance using the inbuilt feature importance measure [21]. For each of the trees in the model, a feature node is randomly replaced with another feature node while keeping all others nodes unchanged. The resulting model will have a lower performance if the feature is important. When the permuted variable X j , together with the remaining unchanged variables, is used to predict the response, the number of observations classified correctly decreases substantially, if the original variable X j was associated with the response. Thus, a reasonable measure for feature importance is the difference in prediction accuracy before and after permuting X j . The feature importance calculated this way is known as Permutation Accuracy Importance (PAI) and was computed using the scikit-learn package in Python [31].

Wrapper Methods
Wrapper methods test feature subsets using a model hypothesis. Wrapper methods can detect feature dependencies i.e. features that become importance in presence of each other. They are computationally expensive, hence often use greedy search strategies (forward selection and backward elimination [32]) which are fast and avoid overfitting to get the best nested subset of features.
Fealect Algorithm [20]: The number of features selected by Lasso depends on the regularization parameter λ, and in the presence of highly correlated features, LASSO arbitrarily selects one feature from a group of correlated features [33]. The set of possible solutions for all LASSO regularization strengths is given by the regularization path, which can be recovered computationally efficiently using the Least Angles Regression (LARS) algorithm [34]. It was shown that LASSO selects the the relevant variables with a probability one and all other with a positive probability [30]. An improvement in LASSO, the Bolasso feature selection algorithm was developed based on this property [30] in 2008. In this method, the dataset is bootstrapped, and a LASSO model with a fixed regularization strength λ is fit to each subset. Finally, the intersection of the LASSO selected features in each subset is chosen to get a consistent feature subset.
In 2013, the FeaLect algorithm, an improvement over the Bolasso algorithm, was developed based on the combinatorial analysis of regression coefficients estimated using LARS [20]. FeaLect considers the full regularization path, and computes the feature importance using a combinatorial scoring method, as opposed to simply taking the intersection with Bolasso. The FeaLect scoring scheme measures the quality of each feature in each bootstrapped sample, and averages them to select the most relevant features, providing a robust feature selection method. We used the R implementation of FeaLect to compute our results [35].
Recursive Feature Elimination (RFE) [29]: A number of common ML techniques (such as linear regression, support vector machines (SVM), decision trees, Naive Bayes, perceptron, e.t.c) provide feature weights that consider multivariate interacting effects between features [13]. To interpret the relative importance of the variables from these model feature weights, RFE was introduced in the context of support vector machines (SVM) [29] for getting compact gene subsets from DNA-microarray data.
To find the best feature subset, instead of doing an exhaustive search over all feature combinations, RFE uses a greedy approach, which has been shown to reduce the effect of correlation bias in variable importance measures [36]. RFE uses backward elimination by taking the given model (SVM, random forests, linear regression etc.) and discarding the worst feature (by absolute classifier weight or feature ranking), and repeating the process over increasingly smaller feature subsets until the best model hypothesis is achieved. The weights of this optimal model are used to rank features. Although this feature ranking might not be the optimal ranking for individual features, it is often used as a variable importance measure [36]. We used the scikit-learn implementation of RFE with random forest classifier to come up with a feature ranking for our dataset. Table 2 shows the feature importances calculated using filter based methods: Pearson correlation and CFS; embedded methods: Random Forest (RF), Linear regression, Ridge regression (L 2 regularization) and LASSO regression and finally wrapper methods: RFE and Fealect . The shaded cells denote the features that were finally selected to build RF models and their corresponding performances are noted. The input data was scaled by minimum and maximum values to [0,1]. Figure 1 shows the correlation matrix for the features and the target.

Results and Discussion
Pearson correlation can be used for feature selection, resulting in a good model. However, this measure has implicit orthogonality assumptions between variables, and the coefficient does not take mutual information between features into account. Additionally, this method only looks for linear correlations which might not capture many physical phenomenon.
The feature subset selected by CFS contains features with higher class correlation and lower redundancy, which translate to a good predictive model. Although we know grain geometry and neighborhood are important to hotspot formation, CFS does not select any geometry based features and fails to provide an individual feature ranking.
Linear regression, ridge regression and Lasso are highly correlated linear models. A simple linear model results in huge weights for some features (NumCells, FeatureVolumes), likely due to overfitting, and hence is unsuitable for deducing variable importance. Ridge regression compensates for this problem by using L 1 regularization, but the weights are distributed among the redundant features, which might lead to incorrect conclusions. LASSO regression overcomes this problem by pushing the weights of correlated features to zero, resulting in a good feature subset. The top five ranked features by LASSO with regularization strength of λ = 0.3 are : sinθ, AvgMisorientations, cosφ, sinφ and Schmid 1. The first geometry based feature ranks 10 th on the list, which seems to underestimate the physical importance of such features. A drawback of deriving insights from LASSO selected features is that it arbitrarily selects a few representatives from the correlated features, and the number of features selected depends heavily on the Random forest models also provide an embedded feature ranking module. The RF-PAI importance seems to focus only on the hcp 'c' axis orientation derived features (cosφ, sinθ,), average misorientation and the Prismatic < a > Schmid factor, while discounting most of the geometry derived features. RF-PAI suffers from correlation bias due to preferential selection of correlated features during tree building process [37]. As the number of correlated variables increases, the feature importance score for each variable decreases. Often times the less relevant variables replace the predictive ones (due to correlation) and thus receive undeserved, boosted importance [38]. Random forest variable importance can also be biased in situations where the features vary in their scale of measurement or number of categories, because the underlying Gini gain splitting criterion is a biased estimator and can be affected by multiple testing effects [39]. From Figure 1, we found that all the geometry based features are highly correlated to each other, therefore deducing physical insights from this ranking is unsuitable. Hence, we move to Wrapper based methods for feature importance. Recursive feature elimination (RFE) has been shown to reduce the effect of the correlation on the importance measure [36]. RFE with underlying random forest model selects a feature subset consisting of two geometry based features (GBEuc and EquivalentDiameter), however, it fails to give an individual ranking among the features.
FeaLect provides a robust feature selection method by compensating for the uncertainty in LASSO due to arbitrary selection among correlated variables, and the number of selected variables due to change in regularization strength. Table 2 lists the Fealect selected variables in decreasing order. We find that the top two important features are derived from the grain crystallography, and geometry derived features come next. This suggests that both texture and geometry based features are important. Using linear regression based methods such as these tell us which features are important by themselves, as opposed to RF-PAI which indicates the features that become important due to interactions between them (via RF models) [13]. The Fealect method provides the best estimate of the feature importance ranking which can then be used to extract physical insights. This method also divides the features into 3 classes: informative, irrelevant features that cause model overfitting and redundant features [20]. The most informative features are: cosφ, Schmid 1, EquivalentDiameter, GBEuc, Schmid 4, Neighborhoods, sinθ and TJEuc.
The irrelevant features are sinφ and AvgMisorientations (which cause model overfitting). The remaining features are redundant.
A number of selected features directly or indirectly represent the HCP c-axis orientation, such as cosφ, sinθ and basal Schmid factor (Schmid 1), which is proportional to cosθ. It is interesting that pyramidal < c + a > Schmid factor (Schmid 4) is chosen as important. From Figure 1, we can see that hot grains form where θ, φ maximize sinθ and sinφ i.e. θ ∼ 90, φ ∼ 90. This means that the HCP c-axis orientation of hot grains aligns with the sample Y axis, which means these grains have a low elastic modulus. Since the c-axis is perpendicular to the tensile axis (sample Z); the deformation along the tensile direction can be accommodated by prismatic slip in these grains, and if pyramidal slip is occurring, it means they have a very high stress [4]. This explains the high importance of the pyramidal < c + a > Schmid factor. From the Pearson correlation coefficients in Figure 1, we can observe that the stress hotspots form in grains with low basal and pyramidal < c + a > Schmid factor, high prismatic < a > Schmid factor, and higher values of sinθ and sinφ.
From Figure 1, we can see that all the grain geometry descriptors do not have a direct correlation with stress, but are still selected by Fealect. This points to the fact that these variables become important in association with others. We analyzed these features in detail in [4] and found that the hotspots lie closer to grain boundaries (GBEuc), triple junctions (TJEuc), and quadruple points (QPEuc), and prefer to form in smaller grains.
There is a subtle distinction between the physical impact of a variable on the target vs. the variables that work best for a given model. From table 2, we can see that a random forest model built on the entire feature set without feature selection has an AUC of 71.94%. All the feature selection techniques result in an improvement in the performance of the random forest model to a validation AUC of about 81%. However, to draw physical interpretations, it is important to use a feature selection technique which: 1) keeps the original representation of the features, 2) is not biased by correlations/ redundancies among features, 3) is insensitive to the scale of variable values , 4) is stable to the changes in the training dataset, 5) takes multivariate dependencies between the features into account, and 6) provides an individual feature ranking measure.

Conclusions
We have used different feature selection techniques and demonstrated that while all techniques lead to an improvement in model performance, only the FeaLect method helps us to determine the underlying importance of the features by themselves.
-All feature selection techniques result in ∼ 9% improvement in the AUC metric for stress hotspot classification.
-Correlation based feature selection and Recursive feature elimination are computationally expensive to run, and give only a feature subset ranking. -Random forest embedded feature ranking is biased against correlated features and hence should not be used to derive physical insights. -Linear regression based feature selection techniques can objectively denote the most important features, however have their flaws. The Fealect algorithm can compensate for the variability in LASSO regression, providing a robust feature ranking that can be used to derive insights. -Stress hotspots formation under uniaxial tensile deformation is determined by a combination of crystallographic and geometric microstructural descriptors. -It is essential to choose a feature selection method that can find this dependence even when features are redundant or correlated.