FeatureTurning straw into gold: building robustness into gene signature inference
Introduction
Statistical feature selection of ‘omics’ data is a practical means of deriving signatures for predictive purposes. Although the exact conditions for deriving a successful signature are not easily defined, it is known that statistical significance can arise for a variety of confounders (e.g., sampling bias, presence of hidden subpopulations, and batch effects), besides biological relevance [1]. This is known as the ‘Anna Karenina Principle’ 2, 3.
Therefore, naïve reliance on basic statistics leads to a lack of signature reproducibility (getting a similar signature with a different data set) 4, 5, 6 and signature generalizability (able to correctly predict phenotype based on a different data set) [7]. Addressing confounders is important but not necessarily practicable (assuming it is even possible to correctly identify every possible confounder). Some key points covered previously include developing more reasonable hypothesis statements and ensuring that the correct test statistics and reference distributions are used [1]. Broadly, these constitute GAPs in the context of general analysis. However, more robustness can be introduced for the purpose of signature inference. Using a re-examination of the data set of Venet et al. [7], we illustrate here the following GAPs: (i) the importance of meta-analysis; (ii) systematic evaluation of confounders; and (iii) generalizability tests.
Section snippets
The case study
In their study, Venet et al. evaluated 48 published breast cancer signatures using an independent data set [7]. A good signature is one that is associated significantly with outcome or phenotype. However, in this study, the authors found that most published signatures did not outperform randomly generated signatures, and even irrelevant signatures derived from other phenotypes did well; that is, statistical significance alone cannot prove relevance.
Suspected confounders include: (i) use of an
The importance of meta-analysis
Meta-analysis is the comparative evaluation of independent studies covering the same subject matter (e.g., breast cancer versus normal patients). In their study, Venet et al. evaluated 48 independently published breast cancer signatures against the NKI benchmark data set (see the Supplemental information online) [7], which revealed that these signatures were not only very different from each other, but also performed variably on the benchmark.
Each signature can be considered an independent
Systematic evaluation of confounders
Confounders are not homogeneous: although most proliferation genes are noncausal correlates, a subset is likely phenotypically relevant (Fig. 1a). To exemplify this point, SPS was compared with two proliferation gene sets (Prolif and meta-PCNA; see the Supplemental information online), revealing that almost all SPS genes were proliferation associated (Fig. 1b). Interestingly, only intersecting areas with SPS were strongly predictive, suggesting that the incorporation of SPS genes was why these
Generalizability tests
Gene signature inference should not stop at one benchmark data set because there is always the possibility that the signature is overfitted and, therefore, nongeneralizable (i.e., the signature only works on one data set). The minimum requirement should be at least one independent validation on a completely new data set (cross-validation is not good enough 11, 12, 13). Given the wide availability of data, a good practice is to leverage existing published data (which are not used for determining
Recommendations
Generally, it is good analytical practice to construct reasonable hypothesis statements and to check the appropriateness of the summary statistics and reference distributions. However, this does not exclude the existence of other sources of confounders. It is impracticable to exhaustively isolate and exclude all of these, especially because many will not be known a priori. Unfortunately, not addressing these would negatively impact the gene signature inference; thus, something has to be done.
Concluding remarks
Inference of predictive signatures can be augmented with the use of prior knowledge (via meta-analysis); with the careful and systematic evaluation of gene sets, even if they overlap with known sources of confounding; and rigorous testing of inferred signatures against as many published data sets as possible.
Author contributions
W.W.B.G. and L.W. co-designed the methodologies and co-wrote the manuscript.
Acknowledgments
W.W.B.G. and L.W. acknowledge Vincent de Tours and his colleagues for codes and data obtained from their publication. L.W. gratefully acknowledges support by a Kwan-Im-Thong-Hood-Cho-Temple chair professorship.
References (17)
- et al.
Dealing with confounders in omics analysis
Trends Biotechnol.
(2018) The application of principal component analysis to drug discovery and biomedical data
Drug Discov. Today
(2017)Why batch effects matter in omics data, and how to avoid them
Trends Biotechnol.
(2017)- et al.
The Anna Karenina principle: a way of thinking about success in science
J. Am. Soc. Inf. Sci. Technol.
(2012) Stress and stability: applying the Anna Karenina principle to animal microbiomes
Nat. Microbiol.
(2017)- et al.
Reproducibility in science: improving the standard for basic and preclinical research
Circ. Res.
(2015) Test set bias affects reproducibility of gene signatures
Bioinformatics
(2015)Feature selection in clinical proteomics: with great power comes great reproducibility
Drug Discov. Today
(2016)
Cited by (13)
A novel survival prediction signature outperforms PAM50 and artificial intelligence-based feature-selection methods
2023, Computational Biology and ChemistryDoppelgänger spotting in biomedical gene expression data
2022, iScienceCitation Excerpt :Observations of DEs and random feature set superiority in DMD and leukemia further emphasize how we should not naively trust any feature selection processes or ML outcomes purely based on validation accuracy since high accuracies could be achieved by any feature set and a good feature set could perform just as well as a random feature set in the presence of DEs. Such phenomenon is not unheard of: In biology, random signature superiority effects and irrelevant signature superiority effects have been observed in breast cancer (Goh and Wong, 2018, 2019; Ho et al., 2020a; Venet et al., 2011) and are owing to a variety of confounding factors, the most prominent of which is high class-effect proportion (CEP) (Ho et al., 2020b). For data with high CEP, good accuracy is assured regardless of feature selection or identification of DDs.
Data considerations for predictive modeling applied to the discovery of bioactive natural products
2022, Drug Discovery TodayHow doppelgänger effects in biomedical data confound machine learning
2022, Drug Discovery TodayExtensions of the External Validation for Checking Learned Model Interpretability and Generalizability
2020, PatternsCitation Excerpt :A gene signature can be thought of as a set of features where each feature is the expression level of a gene; and the set forms a prediction model, which is a classifier in the context of our discussion here. We first encountered issues with EV in the Venet et al.15 study of breast cancer prognostic signatures, and explored the implications for machine learning.6,12,14,16 In the Venet et al. study, they evaluated multiple reported gene signatures against a single large dataset, and found that none of these gene signatures could beat randomly generated gene signatures or domain-irrelevant signatures.