Turning straw into gold: building robustness into gene signature inference

doi:10.1016/j.drudis.2018.08.002

Drug Discovery Today

Volume 24, Issue 1, January 2019, Pages 31-36

https://doi.org/10.1016/j.drudis.2018.08.002 Get rights and content

Highlights

•
Meta-analysis can lead towards better signature inference.
•
Careful and systematic evaluation can produce robust signatures.
•
The generalizability test is a tough but rigorous guard against trivial association.

Reproducible and generalizable gene signatures are essential for clinical deployment, but are hard to come by. The primary issue is insufficient mitigation of confounders: ensuring that hypotheses are appropriate, test statistics and null distributions are appropriate, and so on. To further improve robustness, additional good analytical practices (GAPs) are needed, namely: leveraging existing data and knowledge; careful and systematic evaluation of gene sets, even if they overlap with known sources of confounding; and rigorous testing of inferred signatures against as many published data sets as possible. Here, using a re-examination of a breast cancer data set and 48 published signatures, we illustrate the value of adopting these GAPs.

Introduction

Statistical feature selection of ‘omics’ data is a practical means of deriving signatures for predictive purposes. Although the exact conditions for deriving a successful signature are not easily defined, it is known that statistical significance can arise for a variety of confounders (e.g., sampling bias, presence of hidden subpopulations, and batch effects), besides biological relevance [1]. This is known as the ‘Anna Karenina Principle’ 2, 3.

Therefore, naïve reliance on basic statistics leads to a lack of signature reproducibility (getting a similar signature with a different data set) 4, 5, 6 and signature generalizability (able to correctly predict phenotype based on a different data set) [7]. Addressing confounders is important but not necessarily practicable (assuming it is even possible to correctly identify every possible confounder). Some key points covered previously include developing more reasonable hypothesis statements and ensuring that the correct test statistics and reference distributions are used [1]. Broadly, these constitute GAPs in the context of general analysis. However, more robustness can be introduced for the purpose of signature inference. Using a re-examination of the data set of Venet et al. [7], we illustrate here the following GAPs: (i) the importance of meta-analysis; (ii) systematic evaluation of confounders; and (iii) generalizability tests.

Section snippets

The case study

In their study, Venet et al. evaluated 48 published breast cancer signatures using an independent data set [7]. A good signature is one that is associated significantly with outcome or phenotype. However, in this study, the authors found that most published signatures did not outperform randomly generated signatures, and even irrelevant signatures derived from other phenotypes did well; that is, statistical significance alone cannot prove relevance.

Suspected confounders include: (i) use of an

The importance of meta-analysis

Meta-analysis is the comparative evaluation of independent studies covering the same subject matter (e.g., breast cancer versus normal patients). In their study, Venet et al. evaluated 48 independently published breast cancer signatures against the NKI benchmark data set (see the Supplemental information online) [7], which revealed that these signatures were not only very different from each other, but also performed variably on the benchmark.

Each signature can be considered an independent

Systematic evaluation of confounders

Confounders are not homogeneous: although most proliferation genes are noncausal correlates, a subset is likely phenotypically relevant (Fig. 1a). To exemplify this point, SPS was compared with two proliferation gene sets (Prolif and meta-PCNA; see the Supplemental information online), revealing that almost all SPS genes were proliferation associated (Fig. 1b). Interestingly, only intersecting areas with SPS were strongly predictive, suggesting that the incorporation of SPS genes was why these

Generalizability tests

Gene signature inference should not stop at one benchmark data set because there is always the possibility that the signature is overfitted and, therefore, nongeneralizable (i.e., the signature only works on one data set). The minimum requirement should be at least one independent validation on a completely new data set (cross-validation is not good enough 11, 12, 13). Given the wide availability of data, a good practice is to leverage existing published data (which are not used for determining

Recommendations

Generally, it is good analytical practice to construct reasonable hypothesis statements and to check the appropriateness of the summary statistics and reference distributions. However, this does not exclude the existence of other sources of confounders. It is impracticable to exhaustively isolate and exclude all of these, especially because many will not be known a priori. Unfortunately, not addressing these would negatively impact the gene signature inference; thus, something has to be done.

Concluding remarks

Inference of predictive signatures can be augmented with the use of prior knowledge (via meta-analysis); with the careful and systematic evaluation of gene sets, even if they overlap with known sources of confounding; and rigorous testing of inferred signatures against as many published data sets as possible.

Author contributions

W.W.B.G. and L.W. co-designed the methodologies and co-wrote the manuscript.

Acknowledgments

W.W.B.G. and L.W. acknowledge Vincent de Tours and his colleagues for codes and data obtained from their publication. L.W. gratefully acknowledges support by a Kwan-Im-Thong-Hood-Cho-Temple chair professorship.

References (17)

W.W.B. Goh et al.
Dealing with confounders in omics analysis
Trends Biotechnol.
(2018)
A. Giuliani
The application of principal component analysis to drug discovery and biomedical data
Drug Discov. Today
(2017)
W.W. Goh
Why batch effects matter in omics data, and how to avoid them
Trends Biotechnol.
(2017)
B. Lutz et al.
The Anna Karenina principle: a way of thinking about success in science
J. Am. Soc. Inf. Sci. Technol.
(2012)
J.R. Zaneveld
Stress and stability: applying the Anna Karenina principle to animal microbiomes
Nat. Microbiol.
(2017)
C.G. Begley et al.
Reproducibility in science: improving the standard for basic and preclinical research
Circ. Res.
(2015)
P. Patil
Test set bias affects reproducibility of gene signatures
Bioinformatics
(2015)
W. Wang
Feature selection in clinical proteomics: with great power comes great reproducibility
Drug Discov. Today
(2016)

There are more references available in the full text version of this article.

Cited by (13)

A novel survival prediction signature outperforms PAM50 and artificial intelligence-based feature-selection methods
2023, Computational Biology and Chemistry
The robustness of a breast cancer gene signature, the super-proliferation set (SPS), is initially tested and investigated on breast cancer cell lines from the Cancer Cell Line Encyclopaedia (CCLE). Previously, SPS was derived via a meta-analysis of 47 independent breast cancer gene signatures, benchmarked on survival information from clinical data in the NKI dataset. Here, relying on the stability of cell line data and associative prior knowledge, we first demonstrate through Principal Component Analysis (PCA) that SPS prioritizes survival information over secondary subtype information, surpassing both PAM50 and Boruta, an artificial intelligence-based feature-selection algorithm, in this regard. We can also extract higher resolution ‘progression’ information using SPS, dividing survival outcomes into several clinically relevant stages (‘good’, ‘intermediate’, and ‘bad) based on different quadrants of the PCA scatterplot. Furthermore, by transferring these ‘progression’ annotations onto independent clinical datasets, we demonstrate the generalisability of our method on actual patient data. Finally, via the characteristic genetic profiles of each quadrant/stage, we identified efficacious drugs using their gene reversal scores that can shift signatures across quadrants/stages, in a process known as gene signature reversal. This confirms the power of meta-analytical approaches for gene signature inference in breast cancer, as well as the clinical benefit in translating these inferences onto real-world patient data for more targeted therapies.
Doppelgänger spotting in biomedical gene expression data
2022, iScience
Citation Excerpt :
Observations of DEs and random feature set superiority in DMD and leukemia further emphasize how we should not naively trust any feature selection processes or ML outcomes purely based on validation accuracy since high accuracies could be achieved by any feature set and a good feature set could perform just as well as a random feature set in the presence of DEs. Such phenomenon is not unheard of: In biology, random signature superiority effects and irrelevant signature superiority effects have been observed in breast cancer (Goh and Wong, 2018, 2019; Ho et al., 2020a; Venet et al., 2011) and are owing to a variety of confounding factors, the most prominent of which is high class-effect proportion (CEP) (Ho et al., 2020b). For data with high CEP, good accuracy is assured regardless of feature selection or identification of DDs.
Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, there are no tools for doppelgänger identification or standard practices to manage their confounding implications. We present doppelgangerIdentifier, a software suite for doppelgänger identification and verification. Applying doppelgangerIdentifier across a multitude of diseases and data types, we show the pervasive nature of DEs in biomedical gene expression data. We also provide guidelines toward proper doppelgänger identification by exploring the ramifications of lingering batch effects from batch imbalances on the sensitivity of our doppelgänger identification algorithm. We suggest doppelgänger verification as a useful procedure to establish baselines for model evaluation that may inform on whether feature selection and ML on the data set may yield meaningful insights.
Data considerations for predictive modeling applied to the discovery of bioactive natural products
2022, Drug Discovery Today
Natural products (NPs) constitute a large reserve of bioactive compounds useful for drug development. Recent advances in high-throughput technologies facilitate functional analysis of therapeutic effects and NP-based drug discovery. However, the large amount of generated data is complex and difficult to analyze effectively. This limitation is increasingly surmounted by artificial intelligence (AI) techniques but more needs to be done. Here, we present and discuss two crucial issues limiting NP-AI drug discovery: the first is on knowledge and resource development (data integration) to bridge the gap between NPs and functional or therapeutic effects. The second issue is on NP-AI modeling considerations, limitations and challenges.
An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
2022, Heliyon
Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact downstream analyses. Using a clinical mental health dataset, we investigated the impact of different normalisation techniques on classification model performance. Gene Fuzzy Scoring (GFS), an in-house developed normalisation technique, is compared against widely used normalisation methods such as global quantile normalisation, class-specific quantile normalisation and surrogate variable analysis. We report that choice of normalisation technique has strong influence on feature selection. with GFS outperforming other techniques. Although GFS parameters are tuneable, good classification model performance (ROC-AUC > 0.90) is observed regardless of the GFS parameter settings. We also contrasted our results against local modelling, which is meant to improve the resolution and meaningfulness of classification models built on heterogeneous data. Local models, when derived from non-biologically meaningful subpopulations, perform worse than global models. A deep dive however, revealed that the factors driving cluster formation has little to do with the phenotype-of-interest. This finding is critical, as local models are often seen as a superior means of clinical data modelling. We advise against such naivete. Additionally, we have developed a combinatorial reasoning approach using both global and local paradigms: This helped reveal potential data quality issues or underlying factors causing data heterogeneity that are often overlooked. It also assists to explain the model as well as provides directions for further improvement.
How doppelgänger effects in biomedical data confound machine learning
2022, Drug Discovery Today
Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split.
Extensions of the External Validation for Checking Learned Model Interpretability and Generalizability
2020, Patterns
Citation Excerpt :
A gene signature can be thought of as a set of features where each feature is the expression level of a gene; and the set forms a prediction model, which is a classifier in the context of our discussion here. We first encountered issues with EV in the Venet et al.15 study of breast cancer prognostic signatures, and explored the implications for machine learning.6,12,14,16 In the Venet et al. study, they evaluated multiple reported gene signatures against a single large dataset, and found that none of these gene signatures could beat randomly generated gene signatures or domain-irrelevant signatures.
We discuss the validation of machine learning models, which is standard practice in determining model efficacy and generalizability. We argue that internal validation approaches, such as cross-validation and bootstrap, cannot guarantee the quality of a machine learning model due to potentially biased training data and the complexity of the validation procedure itself. For better evaluating the generalization ability of a learned model, we suggest leveraging on external data sources from elsewhere as validation datasets, namely external validation. Due to the lack of research attractions on external validation, especially a well-structured and comprehensive study, we discuss the necessity for external validation and propose two extensions of the external validation approach that may help reveal the true domain-relevant model from a candidate set. Moreover, we also suggest a procedure to check whether a set of validation datasets is valid and introduce statistical reference points for detecting external data problems.

View all citing articles on Scopus

View full text

FeatureTurning straw into gold: building robustness into gene signature inference

Highlights

Introduction

Section snippets

The case study

The importance of meta-analysis

Systematic evaluation of confounders

Generalizability tests

Recommendations

Concluding remarks

Author contributions

Acknowledgments

Trends Biotechnol.

Drug Discov. Today

Trends Biotechnol.

The Anna Karenina principle: a way of thinking about success in science

J. Am. Soc. Inf. Sci. Technol.

Stress and stability: applying the Anna Karenina principle to animal microbiomes

Nat. Microbiol.

Reproducibility in science: improving the standard for basic and preclinical research

Circ. Res.

Test set bias affects reproducibility of gene signatures

Bioinformatics

Feature selection in clinical proteomics: with great power comes great reproducibility

Drug Discov. Today

Feature
Turning straw into gold: building robustness into gene signature inference