FSF-GA: A Feature Selection Framework for Phenotype Prediction Using Genetic Algorithms

(1) Background: Phenotype prediction is a pivotal task in genetics in order to identify how genetic factors contribute to phenotypic differences. This field has seen extensive research, with numerous methods proposed for predicting phenotypes. Nevertheless, the intricate relationship between genotypes and complex phenotypes, including common diseases, has resulted in an ongoing challenge to accurately decipher the genetic contribution. (2) Results: In this study, we propose a novel feature selection framework for phenotype prediction utilizing a genetic algorithm (FSF-GA) that effectively reduces the feature space to identify genotypes contributing to phenotype prediction. We provide a comprehensive vignette of our method and conduct extensive experiments using a widely used yeast dataset. (3) Conclusions: Our experimental results show that our proposed FSF-GA method delivers comparable phenotype prediction performance as compared to baseline methods, while providing features selected for predicting phenotypes. These selected feature sets can be used to interpret the underlying genetic architecture that contributes to phenotypic variation.


Introduction
Recent advancements in genetic sequencing techniques have led to a significant increase in the volume of available genomic data over the past decade [1]. Access to vast quantities of genomic data necessitates the utilization of powerful data mining techniques and analytics to effectively extract relevant information. By uncovering key insights from such data, it becomes possible to identify genetic factors associated with human diseases, as well as to enhance breeding strategies in animals and plants [2]. Genomic and genetic data present challenges that hinder the efficacy of simple machine learning (ML) algorithms and statistical approaches. Genomic selection or phenotype prediction is one such challenge in genomic data analytics and involves predicting qualitative traits (e.g., eye color) or quantitative traits (e.g., height) based on genetic variations among individuals [3]. Along with non-genetic factors, trait values can also be influenced by genetic variants located in one or multiple locations in the genome [4]. These genetic variants can exhibit linear or non-linear interactions, which are, respectively, referred to as additive or epistatic effects, in determining the trait values [5]. Epistasis represents non-linear relationships among genetic variants and can extend beyond pairwise interactions and encompass higher-order interactions among multiple genetic variants [6]. Consequently, phenotype prediction in the presence of epistatic effects is a challenging combinatorial task, since modeling such interactions demands significantly complex models.
Another factor that makes phenotype prediction a challenge is the dimensionality of genomic data; this curse of dimensionality refers to the vast number of features that greatly outnumber the available samples [7]. Traditionally, approaches used to combat this problem using feature selection fall into four categories: filters, wrappers, embedded and hybrid approaches. Filters, such as information gain [8] and fisher score [9], mainly rely explores relationships among genetic features. In summary, FSF-GA consists of two primary stages that support feature selection in synergy with a regression model for phenotype prediction. In the first stage, we perform pre-processing in order to reduce the search space of the features and in the second stage, we utilize GA to find the best combination of SNPs to predict a quantitative trait under investigation.

FSF-GA Implementation
FSF-GA is designed to address both phenotype prediction and QTL detection problems with consideration of correlations or relationships among genetic features such as Linkage Disequilibrium (LD) blocks. In order to evaluate FSF-GA in our experiments, the dataset (a yeast dataset is used in this study) is first shuffled and separated into training and test sets using a fixed random seed, as illustrated in Figure 1; this is a normal course of action for evaluating an ML approach. Next, we perform feature selection on both sets using Pearson's Correlation Coefficient (PCC) between SNPs and the target phenotype. Afterwards, GA is applied to the training set in order to find the optimal set of features that results in the best R 2 Adj between the predicted and observed phenotypes. Finally, a test set is used in order to evaluate the performance of the GA-based regression model in FSF-GA.  The overall pipeline of FSF-GA for a single run. The dataset is split into training and test sets. Pre-processing is then performed on both sets using the information extracted from the training set. Next, GA is applied on the training set three times. The intersection of the results from these three executions is considered as the final set of QTLs and used for the test set evaluation.

LD Concordance
LD cutoff used in the pre-processing step of Algorithm 1 affects detected loci. More specifically, the higher we set the LD cutoff, the more variation, in terms of cardinality and concordance, we observe in QTL sets generated as the final output of the GA. This outcome is presented in Table 1, where the statistics of the yeast dataset after pre-processing (third column) and running the GA three times (last column) are reported. Therefore, in order to keep the results steady, we run the GA multiple times for each setting and consider the intersection of resulting sets as the final set of associated loci with a quantitative trait. Similarly, phenotype prediction performance of FSF-GA is evaluated using the intersection sets.  Table 1. SNP Statistics after pre-processing and running the GA three times on the yeast Saccharomyces cerevisiae dataset. LD cutoff refers to the LD threshold used in the pre-processing step. Intersection, in the last column, indicates the overlap between three runs on each LD cutoff and I/U indicates the intersection-to-union ratio. Higher values of I/U are desirable and indicate that the results have more concordance. To compare our findings with previous publications, we used LD concordance between our results and those of [5], which were acquired using a mixed model for repeated measures (MMRM). To calculate LD concordance, for each trait, we looped through QTLs reported by [5] and saved the maximum LD score among each of those and SNPs detected via FSF-GA for the same trait. Results of this analysis are presented in Figure 2 as violin plots. We observe that, for the majority of the traits, LD concordance is high, especially for LD cutoff thresholds of 0.4 and 0.5. A high LD concordance indicates that the FSF-GA pipeline can identify genetic loci that are either QTLs or in high LD with previously reported QTLs.  Figure 2. LD concordance between detected QTLs using FSF-GA and QTLs identified in [5]. Higher width indicates greater density. The central line in violin plots depicts quadrilles of the data, with the white circle as the median. We observe that, generally, higher LD cutoff thresholds result in more concordance between the two sets across different traits.

QTL Detection
Though FSF-GA is not designed to effectively mark QTLs at the current stage, we observe that a fraction of SNPs reported in [5] directly overlap with our sets. For each trait, we used an overlap between intersections that resulted from six LD cutoff thresholds for FSF-GA and then calculated the direct intersection between our method and that of [5]. It is noteworthy that, as expected, many of these overlaps were mentioned as additive QTLs in [5]. These overlaps are listed, per phenotype, in Table 2. Table 2. Shared QTL SNPs detected via FSF-GA and [5]. Items under overlap positions are represented as chrN_P, where N and P indicate chromosome number and position on the chromosome, respectively. These overlaps show that our method detects relevant features as QTLs.

Trait
Overlapping SNPs

Phenotype Prediction
For evaluating phenotype prediction of FSF-GA, we use mean absolute error (MAE), MSE and PCC to benchmark FSF-GA against other baseline methods, including classical ML models (RF and SVM) and genetic selection methods (rrBLUP and BGLR). For the ML models, such as RF and SVM, we train them on remaining features after the pre-processing step in our pipeline. Reported results for these models are the best out of each six LD cutoffs (0.2, 0.3, 0.4, 0.5, 0.6 and 0.7) per metric, which are considerably better than running them on the full set of features. For the RF regressor, after fine-tuning the model, number of trees and maximum depth of each tree are set to ceil(p/7) (where p is the number of features) and 7, respectively. In order to make the results fair, we run RF using 10 random splits and report the average performance metrics on the test set.
Regarding classical genetic selection models such as rrBLUP and BGLR [35], they are provided with the complete set of features. The results of this benchmark are illustrated in Figures 3 and 4. The outcomes suggest that FSF-GA performs as well as rrBLUP and BGLR. FSF-GA exhibits comparable performance to SVM and RF. However, RF's predictive performance can be highly variable in certain scenarios compared to other methods, such as Diamide/Indolacetic Acid, where RF's predictions can be significantly better or worse than those of the other methods.
However, the performance of models can be affected by the selected data partition. In order to assert that FSF-GA performs comparably to rrBLUP and BGLR in terms of MSE, we perform 5-fold cross-validation on the first three traits, namely Cobalt Chloride, Copper Sulfate and Diamide. The results of this experiment are presented in Tables 3 and 4. We can observe that FSF-GA does not perform as well as the other two in the case of Diamide. However, for Cobalt Chloride and Copper Sulfate there is at least one LD cutoff threshold in which FSF-GA performs as well as rrBLUP and BGLR.  ; best performance at +1). α in FSF-GA α corresponds to the LD cutoff threshold used in the pre-processing step. RF and SVM are applied to features for each of six LD cutoff thresholds in the pre-processing step and the best results among them, for each metric, are considered. We observe that based on PCC, FSF-GA performs similar to baseline methods.

Discussion
We observe that the majority of SNPs reported by previous research show a high LD concordance to those found via FSF-GA. In this regard, since there is no considerable difference in performance metrics such as MSE, the problem turns into a trade-off between LD concordance and I/U. The former is a measure of consistency between the results of FSF-GA with the previous research, while the latter is a measure of robustness of each individual run. Therefore, increasing the LD cutoff threshold reduces I/U and, at the same time, increases LD concordance. Overall, LD concordance is higher at LD cutoff thresholds above 0.4. LD cutoff thresholds of 0.4 and 0.5 show a reasonable amount of both I/U and LD concordance, making them preferable to other cutoff thresholds for individual runs. Of note, since increasing the LD cutoff threshold further than 0.5 does not result in tangible improvements in performance metrics, we abstain from checking LD cutoff values higher than 0.7. The discordance in results can be attributed to not using the full dataset by FSF-GA and/or the use of different LD blocks for prediction via the two methods. While the latter needs further investigation, which can be a focus of our future studies, it is not unexpected. In stepwise methods used for QTL detection, such as the one used in [5], the order of SNP selection can affect the process; i.e., the first few detected QTLs can lead to or prevent the selection of specific QTLs, which can prevent the model from selecting the optimal set of the QTLs. This shortcoming, theoretically, is not applicable to metaheuristic models such as FSF-GA, since selected loci can be removed and replaced during the process by the modification operators, such as mutation and crossover in the GA.
In the case of phenotype prediction, SVM performs worst, but the difference between the methods is not substantial. RF does better for some traits compared to others by a narrow margin. Since RF uses the same set of features that are fed into our GA, we suspect that the difference between our results and those of RF lies in the difference between the prediction power of Bayesian Ridge, which is used in FSF-GA as the estimator, and RF. Some models, such as neural networks and tree-based models, have the capacity to capture non-linear interactions between SNPs. Specifically, neural networks are suitable tools for detecting epistasis since they utilize non-linear activation functions in each layer. In contrast, Bayesian Ridge does not offer the same functionality; however, using other regressor models in the objective function of our GA adds significant computation load, so we avoid it in this study due to limitations in our hardware. Regardless, FSF-GA delivers results comparable to those of rrBLUP and BGLR, which are considered formidable baselines because in the yeast dataset the majority of contributions to trait variation result from additive effects [5]. Therefore, we expect that using neural networks instead of Bayesian Ridge can lead to a boost in prediction power and better epistatic QTL detection in FSF-GA, subjected to availability of more computational power which will be explored in future work.
The findings suggest that the LD cutoff used during the pre-processing step is a key factor in regulating the I/U ratio of identified SNPs. Moreover, it also impacts the stability of the SNP sets obtained for various traits and has a marginal influence on the accuracy of phenotype prediction. The reason is that when LD cutoff is higher, more correlated features are introduced to the search space and the SNP selection progress is influenced by this variation, while the presence of more SNPs increases the chance that a better QTL combination is found. In our current framework, Bayesian Ridge cannot capture complex QTL-QTL interactions; therefore, the performance gap using different cutoffs is insignificant. However, if we use other models capable of capturing such interactions, we have a trade-off between precision of QTL detection and phenotype prediction power.

Conclusions
In this study, we propose FSF-GA as a bi-objective framework for quantitative trait prediction and QTL detection. To the best of our knowledge, our method is the first method to make use of evolutionary algorithms for such a task. The advantage of metaheuristic approaches, such as FSF-GA, over stepwise models is their potential for breaking out of a local optimum and not being affected by initially selected QTLs in the process. A practical application of FSF-GA is to combine it with CRISPR genetic editing tools and/or base editing tools for targeted SNP editing and investigating phenotypic changes. This approach could provide precise and targeted SNP studies compared to traditional QTL mapping techniques.
For model evaluation, a widely used yeast dataset is used to benchmark our proposed method against other well-known models for quantitative phenotype prediction. Experimental results indicate that FSF-GA performs comparably to the baselines for phenotype prediction, while it can detect trait-associated LD blocks using pre-processed SNPs. Furthermore, our analysis reveals that QTLs detected by our framework show a reasonable LD concordance with the QTLs identified in previous research. Statistics show that the LD cutoff, in our proposed pre-processing step, affects phenotype predictive power and convergence of the GA in our framework. This LD cutoff is a trade-off between the convergence of the GA and phenotype predictive power of our method.
While FSF-GA shows reasonable performance in designated tasks, there is room for further improvement. We believe that using advanced models, such as neural networks, can lead to increased prediction performance and better QTL detection, given that powerful hardware is available. In our future work, we intend to expand this method in order to mark the exact QTLs in LD blocks that are responsible for predicting each quantitative trait, which, subsequently, improves phenotype prediction as well. Furthermore, our algorithm can be redesigned in a parallel fashion to reduce execution time, which can be another focus of our future work.

Dataset
In this study, we apply our method to a well characterized yeast Saccharomyces cerevisiae dataset [5]. The dataset contains sequenced genotype profiles of 4390 samples with 28,220 unique genetic variants (SNPs). The samples in the dataset are crosses between a laboratory strain (BY) and an isolate form of vineyard (RM), encoded as −1 and 1, respectively. In addition to the genotypes, the dataset contains phenotypes of the samples for 20 different growth traits. An important aspect of the dataset is the existence of epistasis, otherwise referred to as nonlinear gene-gene interactions, that contributes to the complexity of model training for trait prediction [36].
In order to evaluate our proposed method, we compare its results with baseline methods on ten phenotypes in this dataset. The dataset is split into two separate subsets, using a unique random seed, for training and test purposes, consisting of 90% and 10% of the samples, respectively. The aforementioned steps are identically repeated for baseline methods as well. Additionally, we perform a 5-fold cross validation on the first three traits in this dataset to ensure that the data split does not have a considerable effect on the model performance.

Feature Selection Framework for Phenotype Prediction Using Genetic Algorithm
Phenotype prediction involves solving two problems, namely epistatic interactions among loci and the curse of dimensionality. To address the latter, we propose FSF-GA in order to reduce the search space for effective SNPs in phenotype prediction. For the first problem, while our method does not directly address the epistasis detection problem, it can be used as a prior step in order to detect high-order epistasis. The overall pipeline for the proposed method in this study is illustrated in Figure 1. As a side note, henceforth, we use loci and features interchangeably.
The dataset is first partitioned into training and test sets. Afterwards, pre-processing is applied to both sets, restricting features that are used for the GA in the next step. In the GA, we aim to find the optimal set of features that maximizes our criteria, namely R 2 Adj , on the training set. A regression model is then fit on the training set using the selected subset of features and we evaluate the performance using this model on the test set. Selected features can depend on the regression model used in the fitness function of the GA; however, the final output of the GA is only the set of features. Since GA is a stochastic algorithm, we run it more than once and consider the overlap of produced sets as the final output.

The Optimization Problem
GA is a nature-inspired method and a major constitution of Computational Intelligence (CI), designed to solve real world problems [37]. Our objective in the GA is to find the optimal solution for an optimization problem.
In this study, given a certain regression model M and a dataset D, we look for the minimal subset of SNPs in D that provides us the best phenotype prediction results. The normal procedure for stopping a GA, in case the optimal goal is not met, includes (but is not limited to) setting a time limit on the runtime of the algorithm or the number of iterations. Here, we employ the latter method and set the maximum number of iterations to 5000.
There are a limited number of metrics used for regression problems. Among them, MSE is commonly used as a measure to compare different methods. However, through empirical study we found that R 2 Adj , which has been used in the literature for regression problems [38], serves as a better objective function for the task at hand. This metric can be calculated as follows: where p is number of independent features selected for training the model, n is the number of samples and R 2 is calculated as below: where RSS is the sum of squares of residuals and TSS is the total sum of squares for a given trait. R 2 Adj and R 2 range from 0 to 1, where 1 is the best and 0 is the worst value. We set maximizing R 2 Adj and minimizing the number of features as the primary and secondary objectives, respectively. In other words, our optimization algorithm maximizes R 2 Adj of phenotype prediction using as few features as possible.

Pre-Processing
The purpose of a GA in our approach is to find the most compact set of features, for each trait, that delivers the best predictive power. However, evolutionary algorithms alone cannot prioritize suitable features, resulting in extremely long run-times until convergence. Furthermore, genetic features are highly correlated due to LD structures or gene co-expression/regulation, which highlights the need for feature selection, similar to [21,[39][40][41]. Therefore, in order to guide our GA, we first mark valid SNPs for each trait and our GA is only allowed to use them in order to form the output set of the features. To do so, we make use of PCC among each SNP and the target trait in order to sort the SNPs, based on their importance, in a decreasing fashion. Next, we calculate LD between SNP pairs in order to filter sorted SNPs, as demonstrated in Algorithm 1. LD between SNP pairs is calculated using scikit-allel package in Python.

The Genetic Algorithm
In this section, we present the details of our GA. The inputs to our GA are the training set and valid feature indices acquired in the pre-processing step. The output is the set of selected SNPs that give the optimal result for predicting the phenotype of interest. Since randomness in evolutionary algorithms is inevitable, especially in this problem, we run the GA three times and use the intersection of outputs as the final set, for each setting.
The fundamental components of the GA are chromosomes and three functions named fitness, mutate and crossover. The overall process of the proposed GA is illustrated in Figure 5. In our algorithm, each chromosome contains a vector of binary values (1/0) called genes. In other words, genes refer to the parameters of the solutions in our problem, which are different from genes in genetics and through this paper, we use gene(s) only in the context of GA. The length of each array is equal to the number of loci in genotypes. Setting each element in these arrays to 1/0 indicates that the corresponding feature should be used/discarded in the respective data subset. In other terms, these arrays mask the presence of features in the dataset, as demonstrated in Figure 6. The fitness function in the proposed GA simply calculates the fitness score on the training set, that is, R 2 Adj in this study, using Bayesian Ridge regressor implementation from Scikit-learn package [42]. The key to selecting the model for the fitness function is that it should not have inherent L1 penalty (e.g., Lasso), so that redundant features affecting model performance are removed in the process. Tabu Search (TS) [43] is incorporated into our GA in order to improve local search and prohibit it from re-checking previously visited solutions. Furthermore, TS can save time by preventing redundant calculations in the fitness function.
The mutate function takes a chromosome and modifies its genes, exploring the search space for the global optimum. Algorithm 2 contains the pseudo-code for the mutate function. The inputs of the crossover function are two chromosomes, named parents (G P , G D ), their respective fitness scores and the fitness function. Generally speaking, in GA, a crossover combines two sets of genes, resulting in a new chromosome, named child (G C ), in which genes are inherited from either of the parents-performing exploitation and leading to convergence in search subspace. The same is applied in our crossover function. The pseudocode of the crossover function is presented in Algorithm 3.
The base code of our GA is adopted from [44]. However, as mentioned above, the code was heavily modified. The size of the parent pool in our GA is set to 10. The rate of mutation and crossover is set dynamically according to the last three improvements. However, we have designed the algorithm so that the mutation/crossover rate cannot fall beneath 20% and in each turn, only one of these operations is performed on each chromosome. For example, if two out of three of the last improvements are achieved through crossover, then succeeding functions to apply on the next chromosomes, until a subsequent improvement is obtained, are crossover/mutation with a probability of 60/40%.   x 1 x 2 · · · x m−1 0 1 · · · 1 0 1 · · · 0 . . . to be O(p 2 n). Mutation and crossover take O(Lnp 2 ) since their inner loops (Lines 5 and 6) involve calls to the fitness function up to L times. The GA runs for K iterations and the parent pool holds S chromosomes in total; hence, the total computational complexity of our GA, in this study is O(LKSnp 2 ). Informed Consent Statement: Not applicable.

Data Availability Statement:
The dataset is obtained from [5] and the source code for the proposed framework is available at Github: https://github.com/shilab/FSF-GA/ (accessed on 16 April 2023).