Stable graphical model estimation with Random Forests for discrete, continuous, and mixed variables
Introduction
In many problems one is not confined to one response and a set of predefined predictors. In turn, the interest is often in the association structure of a whole set of variables, i.e. asking whether two variables are independent conditional on the remaining variables. A conditional independence graph (CIG) is a concise representation of such pairwise conditional independence among many possibly mixed, i.e. continuous and discrete, variables. In CIGs, variables appear as nodes, whereas the presence (absence) of an edge among two nodes represents their dependence (independence) conditional on all other variables. Applications include among many others also the study of functional health (Strobl et al., 2009, Kalisch et al., 2010, Reinhardt et al., 2011).
We largely focus on the high-dimensional case where the number of variables (nodes in the graph) may be larger than sample size . A popular approach to graphical modeling is based on the Least Absolute Shrinkage and Selection Operator (LASSO; Tibshirani, 1996); see Meinshausen and Bühlmann (2006) or Friedman et al. (2008) for the Gaussian case and Ravikumar et al. (2010) for the binary case. However, empirical data often involve both discrete and continuous variables. Conditional Gaussian distributions were suggested to model such mixed-type data with maximum likelihood inference (Lauritzen and Wermuth, 1989), but no corresponding high-dimensional method has been suggested yet. Dichotomization, though always applicable, comes at the cost of lost information (MacCallum et al., 2002).
Tree-based methods are easy to use and accurate for dealing with mixed-type data (Breiman et al., 1984). Random Forests (Breiman, 2001, Hapfelmeier and Ulm, 2013) evaluate an ensemble of trees often resulting in notably improved performance compared to a single tree (see also Amit and Geman, 1997). Furthermore, permutation importance in Random Forests allows to rank the relevance of predictors for one specific response. However, Random Forests have also been criticized to perform possibly biased variable selection. We thus also consider Conditional Forests (Strobl et al., 2007) and conditional variable importance (Strobl et al., 2008), which have been suggested to overcome this behavior.
In general, the definition of both the conditional and marginal permutation importance differ for discrete and continuous responses. Thus, ranking permutation importances across responses of mixed-type is less obvious. However, such ranking is essential to derive a network of the most relevant dependences. Stability Selection proposed by Meinshausen and Bühlmann (2010) is one possible framework to rank the edges in the CIG across different types of variables. In addition, it allows to specify an upper bound on the expected number of false positives, i.e. the falsely selected edges, and thus provides a means of error control.
We combine Random Forests estimation with appropriate ranking among mixed-type variables and error control from Stability Selection. We refer to the new method as Graphical Random Forests (GRaFo). Our specific aims are (a) to evaluate and compare the performance of GRaFo with Stable LASSO (StabLASSO) and Stable Conditional Forests (StabcForests), which are LASSO- and Conditional Forest-based alternatives, and regular maximum likelihood (ML) estimation across various simulated settings comprising different distributions, interactions, and nonlinear associations for , and 200 possibly mixed-type variables while sample size is ( for ML), (b) to apply GRaFo to data from the Swiss Health Survey (SHS) to evaluate the interconnection of functional health components, personal, and environmental factors, as hypothesized by the World Health Organization’s (WHO) International Classification of Functioning, Disability and Health (ICF), and (c) to use GRaFo to identify risk factors associated with adverse neurodevelopment in children with trisomy 21 after open-heart surgery and more generally to assess the plausibility of the suggested associations.
Section snippets
Conditional independence graphs
Let be a set of (possibly) mixed-type random variables. The associated conditional independence graph of is the undirected graph , where the nodes in correspond to the variables in . The edges represent the pairwise Markov property, i.e. if and only if . For a rigorous introduction to graphical models, see, for example, the monographs by Whittaker (1990) or Lauritzen (1996).
We will now show that the pairwise Markov property can,
Random Forests
Random Forests have, to date, not been used to estimate CIGs. They perform a series of recursive binary partitions of the data and construct the predictions from terminal nodes. Based on classification and regression trees (Breiman et al., 1984) they allow convenient inference for mixed-type variables, also in the presence of interaction effects. Incorporating bootstrap (Efron, 1979, Breiman, 1996) and random feature selection (Amit and Geman, 1997), random subsets of both the observations and
Simulating data from directed acyclic graphs
We use a directed acyclic graph (DAG; cf., Whittaker, 1990) to embed conditional dependence statements among nodes representing the random variables. The associated CIG follows by moralization, i.e. connecting any two parents with a common child that are not already connected and removing all arrowheads (Lauritzen and Spiegelhalter, 1988).
Let be a -dimensional weight matrix with entries if and otherwise. In addition, we sample to be sparse,
The importance of functional health
According to the World Health Organization’s (WHO) new framework of the International Classification of Functioning, Disability and Health (ICF; cf., WHO, 2001) the lived experience of health (Stucki et al., 2008) can be structured in experiences related to body functions and structures as well as to activity and participation in society. All of these are, in turn, influenced by a variety of so-called personal factors such as gender, income, or age and environmental factors including individual
Modeling neurodevelopment in children experiencing open-heart surgery
Here we demonstrate an application of GRaFo to a research question, where is much larger than . It is thus of particular interest, whether GRaFo can suggest meaningful associations or tends to produce seemingly spurious associations.
Conclusion
We propose GRaFo (Graphical Random Forests) performed satisfactory, mostly on par or superior to StabLASSO, StabcForests, LASSO, Conditional Forests, Random Forests, and ML estimation. Error control of false positive edges could be achieved in all but the mixed-type simulation with and the nonlinear Gaussian setting with . Violation of assumption (A) in Theorem 1 and of the exchangeability condition might be responsible for this behavior. In contrast, in most of the other settings
Proof of Theorem 1
Proof We know that is equivalent to for all realizations of and of . Due to assumption (A) we can rewrite (4): for all and all . But (5) is equivalent to for all . This completes the proof. □
Acknowledgments
The authors would like to thank three anonymous reviewers, Gerold Stucki, Carolina Ballert, Markus Kalisch, Marloes Maathuis, Philipp Rütimann, and Holger Höfling for valuable feedback and discussion.
References (59)
- et al.
Neurodevelopmental status at eight years in children with dextro-transposition of the great arteries: the Boston Circulatory Arrest Trial
J. Thorac. Cardiovasc. Surg.
(2003) - et al.
Recursive partitioning on incomplete data using surrogate decisions and multiple imputation
Comput. Statist. Data Anal.
(2012) - et al.
A new variable selection approach using Random Forests
Comput. Statist. Data Anal.
(2013) - et al.
Long-term outcome of speech and language in children after corrective surgery for cyanotic or acyanotic cardiac defects in infancy
Eur. J. Paediatr. Neuro.
(2008) - et al.
Long-term neurodevelopmental outcome and exercise capacity after corrective surgery for tetralogy of Fallot or ventricular septal defect in infancy
Ann. Thorac. Surg.
(2006) - et al.
Graphical models illustrated complex associations between variables describing human functioning
J. Clin. Epidemiol.
(2009) - et al.
Risk factors for neurodevelopmental impairments in school-age children after cardiac surgery with full-flow cardiopulmonary bypass
J. Thorac. Cardiov. Sur.
(2012) - et al.
The cost of dichotomising continuous variables
Brit. Med. J.
(2006) - et al.
Shape quantization and recognition with randomized trees
Neural Comput.
(1997) A proposal for a new method of evaluation of the newborn infant
Curr. Res. Anesth. Analg.
(1953)
rpartOrdinal: an R package for deriving a classification tree for predicting an ordinal response
J. Stat. Softw.
National Health Survey: Summary of Results
Neurodevelopmental outcomes following congenital heart surgery
Pediatr. Cardiol.
Manual for the Bayley Scales of Infant Development
Bagging predictors
Mach. Learn.
Random Forests
Mach. Learn.
Classification and Regression Trees
Analyzing bagging
Ann. Statist.
Decomposition and model selection for large contingency tables
Biometrical J.
Bootstrap methods: another look at the jackknife
Ann. Statist.
Sparse inverse covariance estimation with the graphical Lasso
Biostatistics
Regularization paths for generalized linear models via coordinate descent
J. Stat. Softw.
Computational Statistics
Rapport de méthodes. Enquête suisse sur la santé 2007. Plan d’échantillonnage, pondérations et analyses pondérées des données
Estimation of sparse binary pairwise markov networks using pseudo-likelihoods
J. Mach. Learn. Res.
Unbiased recursive partitioning: a conditional inference framework
J. Comput. Graph. Statist.
Estimating high-dimensional directed acyclic graphs with the PC-algorithm
J. Mach. Learn. Res.
Understanding human functioning using graphical models
BMC Med. Res. Methodol.
Cited by (68)
On skewed Gaussian graphical models
2023, Journal of Multivariate AnalysisOn principal graphical models with application to gene network
2022, Computational Statistics and Data AnalysisCitation Excerpt :The Areas under the ROC curve (AUC) can also be a measure to show aggregated prediction performance. Throughout all of the figures and tables, we denote the method based on Voorman et al. (2013) as NP(V), Fellinghauer et al. (2013) as NP(F), Yuan and Lin (2007) as GGM(Y), and Liu et al. (2009) as CGGM(L) for the simplicity and the legend of each method for the ROC curves is in Fig. 3. In future research, we can elaborate multivariate response SIR and SAVE to achieve better accuracy.
Multi-omics integration in biomedical research – A metabolomics-centric review
2021, Analytica Chimica ActaCitation Excerpt :For example, Zierer et al. [41] inferred an MGM from a multi-omics dataset collected from the same individuals, including data on epigenomics, transcriptomics, glycomics, metabolomics, and phenotypic data. The authors used a Graphical Random Forest [149] method for the integration of 144 preselected features and explored the molecular underpinnings of age-related diseases and co-morbidities. They identified seven network modules that reflect distinct aspects of aging, such as lung function, bone density, and renal function.
Integration strategies of multi-omics data for machine learning analysis
2021, Computational and Structural Biotechnology JournalCitation Excerpt :MGM regresses each variable against every other using either linear regression or logistic regression depending on the type of variable (continuous or discrete/categorical). Another method, based on decision trees, is Graphical Random Forest [44,77], which computes a Random Forest on each variable using every other feature as predictors. Features that are ranked as important by the importance measure of Random Forest are considered to interact with the selected variable.
Gaussian and Mixed Graphical Models as (multi-)omics data analysis tools
2020, Biochimica et Biophysica Acta - Gene Regulatory MechanismsCitation Excerpt :In the popular case of an MGM incorporating both Gaussian and multinomial variables, three different edge types reflecting conditional dependencies, can be distinguished: edges between two Gaussian variables, edges between two multinomial variables, as well as edges connecting a Gaussian and a multinomial variable. Several scalable algorithms for MGMs have been proposed [42-46]. Analogously to GGMs, the problem of model overfitting also exists for MGMs, especially in the context of omics data sets.
Global–local shrinkage multivariate logit-beta priors for multiple response-type data
2024, Statistics and Computing