Stable graphical model estimation with Random Forests for discrete, continuous, and mixed variables

https://doi.org/10.1016/j.csda.2013.02.022Get rights and content

Abstract

Random Forests in combination with Stability Selection allow to estimate stable conditional independence graphs with an error control mechanism for false positive selection. This approach is applicable to graphs containing both continuous and discrete variables at the same time. Its performance is evaluated in various simulation settings and compared with alternative approaches. Finally, the approach is applied to two heath-related data sets, first to study the interconnection of functional health components, personal, and environmental factors and second to identify risk factors which may be associated with adverse neurodevelopment after open-heart surgery.

Introduction

In many problems one is not confined to one response and a set of predefined predictors. In turn, the interest is often in the association structure of a whole set of p variables, i.e. asking whether two variables are independent conditional on the remaining p2 variables. A conditional independence graph (CIG) is a concise representation of such pairwise conditional independence among many possibly mixed, i.e. continuous and discrete, variables. In CIGs, variables appear as nodes, whereas the presence (absence) of an edge among two nodes represents their dependence (independence) conditional on all other variables. Applications include among many others also the study of functional health (Strobl et al., 2009, Kalisch et al., 2010, Reinhardt et al., 2011).

We largely focus on the high-dimensional case where the number of variables (nodes in the graph) p may be larger than sample size n. A popular approach to graphical modeling is based on the Least Absolute Shrinkage and Selection Operator (LASSO; Tibshirani, 1996); see Meinshausen and Bühlmann (2006) or Friedman et al. (2008) for the Gaussian case and Ravikumar et al. (2010) for the binary case. However, empirical data often involve both discrete and continuous variables. Conditional Gaussian distributions were suggested to model such mixed-type data with maximum likelihood inference (Lauritzen and Wermuth, 1989), but no corresponding high-dimensional method has been suggested yet. Dichotomization, though always applicable, comes at the cost of lost information (MacCallum et al., 2002).

Tree-based methods are easy to use and accurate for dealing with mixed-type data (Breiman et al., 1984). Random Forests (Breiman, 2001, Hapfelmeier and Ulm, 2013) evaluate an ensemble of trees often resulting in notably improved performance compared to a single tree (see also Amit and Geman, 1997). Furthermore, permutation importance in Random Forests allows to rank the relevance of predictors for one specific response. However, Random Forests have also been criticized to perform possibly biased variable selection. We thus also consider Conditional Forests (Strobl et al., 2007) and conditional variable importance (Strobl et al., 2008), which have been suggested to overcome this behavior.

In general, the definition of both the conditional and marginal permutation importance differ for discrete and continuous responses. Thus, ranking permutation importances across responses of mixed-type is less obvious. However, such ranking is essential to derive a network of the most relevant dependences. Stability Selection proposed by Meinshausen and Bühlmann (2010) is one possible framework to rank the edges in the CIG across different types of variables. In addition, it allows to specify an upper bound on the expected number of false positives, i.e. the falsely selected edges, and thus provides a means of error control.

We combine Random Forests estimation with appropriate ranking among mixed-type variables and error control from Stability Selection. We refer to the new method as Graphical Random Forests (GRaFo). Our specific aims are (a) to evaluate and compare the performance of GRaFo with Stable LASSO (StabLASSO) and Stable Conditional Forests (StabcForests), which are LASSO- and Conditional Forest-based alternatives, and regular maximum likelihood (ML) estimation across various simulated settings comprising different distributions, interactions, and nonlinear associations for p=50,100, and 200 possibly mixed-type variables while sample size is n=100 (p=50,n=500 for ML), (b) to apply GRaFo to data from the Swiss Health Survey (SHS) to evaluate the interconnection of functional health components, personal, and environmental factors, as hypothesized by the World Health Organization’s (WHO) International Classification of Functioning, Disability and Health (ICF), and (c) to use GRaFo to identify risk factors associated with adverse neurodevelopment in children with trisomy 21 after open-heart surgery and more generally to assess the plausibility of the suggested associations.

Section snippets

Conditional independence graphs

Let X={X1,,Xp} be a set of (possibly) mixed-type random variables. The associated conditional independence graph of X is the undirected graph GCIG=(V,E(GCIG)), where the nodes in V correspond to the p variables in X. The edges represent the pairwise Markov property, i.e. ijE(GCIG) if and only if XjXi|X{Xj,Xi}. For a rigorous introduction to graphical models, see, for example, the monographs by Whittaker (1990) or Lauritzen (1996).

We will now show that the pairwise Markov property can,

Random Forests

Random Forests have, to date, not been used to estimate CIGs. They perform a series of recursive binary partitions of the data and construct the predictions from terminal nodes. Based on classification and regression trees (Breiman et al., 1984) they allow convenient inference for mixed-type variables, also in the presence of interaction effects. Incorporating bootstrap (Efron, 1979, Breiman, 1996) and random feature selection (Amit and Geman, 1997), random subsets of both the observations and

Simulating data from directed acyclic graphs

We use a directed acyclic graph (DAG; cf., Whittaker, 1990) to embed conditional dependence statements among nodes representing the p random variables. The associated CIG follows by moralization, i.e. connecting any two parents with a common child that are not already connected and removing all arrowheads (Lauritzen and Spiegelhalter, 1988).

Let A be a (p×p)-dimensional weight matrix with entries aij{[1,0.1]{0}[0.1,1]} if i<j and aij=0 otherwise. In addition, we sample A to be sparse,

The importance of functional health

According to the World Health Organization’s (WHO) new framework of the International Classification of Functioning, Disability and Health (ICF; cf., WHO, 2001) the lived experience of health (Stucki et al., 2008) can be structured in experiences related to body functions and structures as well as to activity and participation in society. All of these are, in turn, influenced by a variety of so-called personal factors such as gender, income, or age and environmental factors including individual

Modeling neurodevelopment in children experiencing open-heart surgery

Here we demonstrate an application of GRaFo to a research question, where p is much larger than n. It is thus of particular interest, whether GRaFo can suggest meaningful associations or tends to produce seemingly spurious associations.

Conclusion

We propose GRaFo (Graphical Random Forests) performed satisfactory, mostly on par or superior to StabLASSO, StabcForests, LASSO, Conditional Forests, Random Forests, and ML estimation. Error control of false positive edges could be achieved in all but the mixed-type simulation with p=200 and the nonlinear Gaussian setting with p100. Violation of assumption (A) in Theorem 1 and of the exchangeability condition might be responsible for this behavior. In contrast, in most of the other settings

Proof of Theorem 1

Proof

We know that XjXi|X{Xj,Xi} is equivalent to P[Xjxj|{xh;hj}]=P[Xjxj|{xh;hj,i}] for all realizations xj of Xj and {xh;hj} of X{Xj}. Due to assumption (A) we can rewrite (4): Fj(xj|mj({xh;hj}))=Fj(xj|mj({xh;hj,i})) for all xj and all {xh;hj}. But (5) is equivalent to mj({xh;hj})=mj({xh;hj,i}) for all {xh;hj}. This completes the proof. 

Acknowledgments

The authors would like to thank three anonymous reviewers, Gerold Stucki, Carolina Ballert, Markus Kalisch, Marloes Maathuis, Philipp Rütimann, and Holger Höfling for valuable feedback and discussion.

References (59)

  • K.J. Archer

    rpartOrdinal: an R package for deriving a classification tree for predicting an ordinal response

    J. Stat. Softw.

    (2010)
  • Australian Bureau of Statistics

    National Health Survey: Summary of Results

    (2009)
  • J.A. Ballweg et al.

    Neurodevelopmental outcomes following congenital heart surgery

    Pediatr. Cardiol.

    (2007)
  • N. Bayley

    Manual for the Bayley Scales of Infant Development

    (1993)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • L. Breiman

    Random Forests

    Mach. Learn.

    (2001)
  • Breiman, L., 2002. Setting up, using, and understanding Random Forests...
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • P. Bühlmann et al.

    Analyzing bagging

    Ann. Statist.

    (2002)
  • C. Dahinden et al.

    Decomposition and model selection for large contingency tables

    Biometrical J.

    (2010)
  • B. Efron

    Bootstrap methods: another look at the jackknife

    Ann. Statist.

    (1979)
  • J. Friedman et al.

    Sparse inverse covariance estimation with the graphical Lasso

    Biostatistics

    (2008)
  • J. Friedman et al.

    Regularization paths for generalized linear models via coordinate descent

    J. Stat. Softw.

    (2010)
  • G.H. Givens et al.

    Computational Statistics

    (2005)
  • E. Graf

    Rapport de méthodes. Enquête suisse sur la santé 2007. Plan d’échantillonnage, pondérations et analyses pondérées des données

    (2010)
  • H. Höfling et al.

    Estimation of sparse binary pairwise markov networks using pseudo-likelihoods

    J. Mach. Learn. Res.

    (2009)
  • T. Hothorn et al.

    Unbiased recursive partitioning: a conditional inference framework

    J. Comput. Graph. Statist.

    (2006)
  • M. Kalisch et al.

    Estimating high-dimensional directed acyclic graphs with the PC-algorithm

    J. Mach. Learn. Res.

    (2007)
  • M. Kalisch et al.

    Understanding human functioning using graphical models

    BMC Med. Res. Methodol.

    (2010)
  • Cited by (68)

    • On skewed Gaussian graphical models

      2023, Journal of Multivariate Analysis
    • On principal graphical models with application to gene network

      2022, Computational Statistics and Data Analysis
      Citation Excerpt :

      The Areas under the ROC curve (AUC) can also be a measure to show aggregated prediction performance. Throughout all of the figures and tables, we denote the method based on Voorman et al. (2013) as NP(V), Fellinghauer et al. (2013) as NP(F), Yuan and Lin (2007) as GGM(Y), and Liu et al. (2009) as CGGM(L) for the simplicity and the legend of each method for the ROC curves is in Fig. 3. In future research, we can elaborate multivariate response SIR and SAVE to achieve better accuracy.

    • Multi-omics integration in biomedical research – A metabolomics-centric review

      2021, Analytica Chimica Acta
      Citation Excerpt :

      For example, Zierer et al. [41] inferred an MGM from a multi-omics dataset collected from the same individuals, including data on epigenomics, transcriptomics, glycomics, metabolomics, and phenotypic data. The authors used a Graphical Random Forest [149] method for the integration of 144 preselected features and explored the molecular underpinnings of age-related diseases and co-morbidities. They identified seven network modules that reflect distinct aspects of aging, such as lung function, bone density, and renal function.

    • Integration strategies of multi-omics data for machine learning analysis

      2021, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      MGM regresses each variable against every other using either linear regression or logistic regression depending on the type of variable (continuous or discrete/categorical). Another method, based on decision trees, is Graphical Random Forest [44,77], which computes a Random Forest on each variable using every other feature as predictors. Features that are ranked as important by the importance measure of Random Forest are considered to interact with the selected variable.

    • Gaussian and Mixed Graphical Models as (multi-)omics data analysis tools

      2020, Biochimica et Biophysica Acta - Gene Regulatory Mechanisms
      Citation Excerpt :

      In the popular case of an MGM incorporating both Gaussian and multinomial variables, three different edge types reflecting conditional dependencies, can be distinguished: edges between two Gaussian variables, edges between two multinomial variables, as well as edges connecting a Gaussian and a multinomial variable. Several scalable algorithms for MGMs have been proposed [42-46]. Analogously to GGMs, the problem of model overfitting also exists for MGMs, especially in the context of omics data sets.

    View all citing articles on Scopus
    View full text