Model-based causal feature selection for general response types

Discovering causal relationships from observational data is a fundamental yet challenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a method for causal feature selection which requires data from heterogeneous settings and exploits that causal models are invariant. ICP has been extended to general additive noise models and to nonparametric settings using conditional independence tests. However, the latter often suffer from low power (or poor type I error control) and additive noise models are not suitable for applications in which the response is not measured on a continuous scale, but reflects categories or counts. Here, we develop transformation-model (TRAM) based ICP, allowing for continuous, categorical, count-type, and uninformatively censored responses (these model classes, generally, do not allow for identifiability when there is no exogenous heterogeneity). As an invariance test, we propose TRAM-GCM based on the expected conditional covariance between environments and score residuals with uniform asymptotic level guarantees. For the special case of linear shift TRAMs, we also consider TRAM-Wald, which tests invariance based on the Wald statistic. We provide an open-source R package 'tramicp' and evaluate our approach on simulated data and in a case study investigating causal features of survival in critically ill patients.


Motivation
Establishing causal relationships from observational data is a common goal in several scientific disciplines. However, systems are often too complex to allow for recovery of the full causal structure underlying the data-generating process. In this work, we consider the easier task of uncovering the causal drivers of a particular response variable of interest. We present methods, theoretical results and user-friendly software for model-based causal feature selection, where the response may represent a binary, ordered, count, or continuous outcome and may additionally be uninformatively censored. We propose tramicp for causal feature selection, which is based on invariant causal prediction (ICP,  and a flexible class of regression models, called transformation models (trams, Hothorn et al., 2018). tramicp relies on data from heterogeneous environments and the assumption, that the causal mechanism of the response given its direct causes (direct w.r.t. the considered sets of covariates) is correctly specified by a tram and does not change across those environments (Haavelmo, 1943;Frisch et al., 1948;Aldrich, 1989;Pearl, 2009;Schölkopf et al., 2012). The causal tram will then produce score residuals (residuals defined specifically for trams) that are invariant across the environments. We propose an invariance test based on the expected conditional covariance between the score residuals and the environments given a subset S of the covariates, called tram-GCM. With this invariance test, tramicp recovers a subset of the direct causes with high probability, by fitting a tram for all subsets of covariates, computing score residuals, testing whether those score residuals are uncorrelated with the residualized environments and lastly, intersecting all subsets for which the null hypothesis of invariance was not rejected. For the special case of additive linear models, we propose another invariance test, tram-Wald, based on the Wald statistic for testing whether main and interaction effects involving the environments are zero.
We illustrate the core ideas of tramicp in the following example with a binary response and the logistic regression model (McCullagh and Nelder, 2019), which is a tram. We defer all details on how trams and score residuals are defined to Section 2 and describe the invariance tests in Section 3.
Example 1 (Invariance in binary generalized linear models). Consider the following structural causal model (Pearl, 2009) over (Y, X 1 , X 2 , E): where N E ∼ Bernoulli(0.5), N 1 ∼ N(0, 1), N 2 ∼ N(0, 1), N Y are jointly independent noise variables and N Y follows a standard logistic distribution. Here, E encodes two environments in which the distribution of X 1 and X 2 differ, but the causal mechanism of Y given its direct causes X 1 does not change.
Let us assume that both the above structural causal model and its implied structure is unknown and that we observe an i.i.d. sample {(e i , x 1 i , x 2 i , y i )} n i=1 from the joint distribution of (E, X 1 , X 2 , Y ). We further know that Y given its direct causes is correctly specified by a logistic regression. All remaining conditionals do not need to satisfy any model assumptions. Our task is now to infer (a subset of) the direct causes of Y . To do so, for each subset of the covariates X S , S ⊆ {1, 2} (i.e., for ∅, {1}, {2} and {1, 2}), we now (i) fit a binary logistic regression model, (ii) compute the score residuals y i −P(Y = 1 | X S = x S i ) (from the logistic regression) and residualized environments e i −P(E = 1 | X S = x S i ) (via a random forest), and (iii) test whether the two residuals are correlated. Figure 1 shows the residuals obtained in step (iii) for each non-empty subset of the covariates.
In this example, even though the model using {X 1 , X 2 } achieves higher predictive accuracy than the model using the causal parent {X 1 }, only the model Y | X 1 is stable across the environments.
If more than one set is invariant, one can take the intersection of the invariant sets to obtain a subset of the direct causes of Y .
With our openly available R package tramicp (https://CRAN.R-project.org/package=tramicp), the analysis in this example can be reproduced with the following code, where df is a data frame with 500 independent observations from the structural causal model above.

Related work
In structural causal models (Pearl, 2009), several algorithms exist to tackle the problem of causal discovery, i.e., learning the causal graph from data, including constraint-based and score-based methods (Spirtes et al., 2000;Chickering, 2002;Pearl, 2009;Glymour et al., 2019). Without further restrictions on the structural assignments and faithfulness, one can hope to recover the causal graph up to the Markov equivalence class (Verma and Pearl, 1990;Andersson et al., 1997;Tian and Pearl, 2001), for which several algorithms have been proposed based on observational data, interventional data, or a combination of both (Spirtes et al., 2000;Chickering, 2002;Castelo and Kocka, 2003;He and Geng, 2008;Hauser and Bühlmann, 2015). However, in many real-world applications learning the full causal graph may be too ambitious or unnecessary for tackling the problem at hand. As opposed to causal discovery, causal feature selection aims to identify the direct causes of a given variable of interest (the response) from potentially many measured covariates, instead of the full graph (Guyon et al., 2007).
Invariant causal prediction (ICP) is an approach to causal feature selection which exploits invariance of the conditional distribution of a response given its direct causes under perturbations of the covariates (ICP, . ICP can be formulated from a structural causal modeling, as well as potential outcome perspective (Hernán and Robins, 2010). In contrast to constraintand score-based algorithms, ICP requires a specific response variable and data from heterogeneous environments.
ICP builds on the concept of invariance and can generally be formulated as conditional independence between the response and the environments given a candidate set (Heinze-Deml et al., 2018). Thus, nonparametric conditional independence tests Zhang et al., 2011;Candès et al., 2018;Strobl et al., 2019;Berrett et al., 2019) can, in principle, always be applied. However, conditional independence testing has been shown to be an arbitrarily hard problem, as there is no test that is simultaneously level and has non-trivial power (Shah and Peters, 2020).
As an alternative to conditional independence testing, model-based formulations of ICP have been formulated for linear  and non-linear additive noise models ("invariant residual distribution test" proposed in Heinze-Deml et al., 2018). Diaz et al. (2022) use an "invariant target prediction" test from Heinze-Deml et al. (2018) for testing invariance with a binary response by nonparametrically comparing out-of-sample area under the receiver operating characteristic (ROC) curve (AUC). Under correct model specification, model-based ICP can have considerably higher power than its nonparametric alternative. Model-based ICP has been extended to generalized linear models (GLMs, see dicussion in  and sequential data (Pfister et al., 2019).
ICP for GLMs and additive and multiplicative hazard models has been investigated in Laksafoss (2020).
Many applications feature complex response types, such as ordinal scales, survival times, or counts and the data-generating mechanism can seldomly be assumed to be additive in the noise. This is reflected in the most common model choices for these responses, namely proportional odds logistic (McCullagh, 1980;Tutz, 2011), Cox proportional hazards (Cox, 1972), and generalized linear models (McCullagh and Nelder, 2019), which do not assume additive noise in general. Together, non-continuous responses and non-additive noise render many causal feature selection algorithms inapplicable. Moreover, proposed extensions to GLMs and hazard-based models rely on case-specific definitions of invariance and thus a unified view on linear, generalized linear, hazards, and general distributional regression is yet to be established.
In practice, a model-based approach can be desirable, because it leads to interpretable effect estimates, such as odds or hazard ratios. However, there is a trade-off between model intelligibility and misspecification. Many commonly applied regression models are not closed under marginalization or the inclusion or exclusion of covariates that are associated with the response (collapsibility, Greenland, 1996;Greenland et al., 1999;Didelez and Stensrud, 2022).
(with jumps at points in Y) and F Z a user-specified continuous cumulative distribution function with log-concave density (log-concavity ensures uniqueness of the maximum likelihood estimator, Siegfried and Hothorn, 2020).
Example 4 (Parametric survival regression). Understanding the causal relationship between features and patient survival is sought after in many biomedical applications. Let Y be a (strictly) positive real-valued random variable, i.e., Y := R + . The Weibull proportional hazards model is β is interpretable as a vector of log hazard ratios (Kleinbaum and Klein, 2012). The model can be written as ) denotes the standard minimum extreme value distribution and h Y (·) := ϑ 1 + ϑ 2 log(·). The Cox proportional hazard model (Cox, 1972) is obtained as an extension of the Weibull model by allowing h Y to be a step function (with jumps at the observed event times) instead of a log-linear function.
In all of the above examples, we have assumed that the response given its causal parents is correctly specified by an linear shift tram (see Definition 7 for more details). If conditioning on a set that is not S * always yielded a model misspecification, one could attempt to identify the set of causal parents by testing, for different sets X S of covariates, whether the model for Y | X S is correctly specified. However, in Proposition 15 below, we prove that, in general, such a procedure does not work. More precisely, there exists a pair of structural causal models such that both induce the same observational distribution, and in both, the response given its causal parents is correctly specified by an (linear shift) tram but the parental sets differ.
In this work, following a line of work in causal discovery Heinze-Deml et al., 2018;Christiansen and Peters, 2020), we instead assume to have access to data from heterogeneous environments. Given such data, we define invariance in trams and propose invariance tests based on the expected conditional covariance between the environments and score residuals (tram-GCM) and an invariance test based on the Wald statistic for linear shift trams in particular (tram-Wald). We prove that the tram-GCM test is uniformly asymptotically level α for any α ∈ (0, 1) (Theorem 20) and demonstrate empirically that it has power comparable to or higher than nonparametric conditional independence testing. In the context of the result on the hardness of assumption-free conditional independence testing assumptions for continuous distributions (Shah and Peters, 2020), our theoretical results show that, under mild assumptions on the relationship between E and X, the model class of trams can be sufficiently restrictive to allow for useful conditional independence tests.
The rest of this paper is structured as follows. Section 2.1 gives a technical introduction to transformation models which can be skipped at first reading. We introduce structural causal trams in Section 2.2 and show that in this class, the set of causal parents is, in general, not identified (Section 2.3). In Section 3, we present the proposed tram-GCM and tram-Wald invariance tests and their theoretical guarantees. We also describe our readily available R implementation of tramicp.
Afterwards, we present a simulation study featuring the examples from above (Section 4) and the behaviour of tramicp under model misspecification.

Using transformation models for causal inference
Transformation models, as introduced by Box and Cox (1964) in their earliest form, are models for the conditional cumulative distribution function of a response given covariates (Doksum, 1974;Bickel and Doksum, 1981;Cheng et al., 1995;Hothorn et al., 2014). trams transform the response conditional on covariates such that the transformed response can be modelled on a fixed, continuous latent scale. Given data and a finite parameterization, the transformation can be estimated via maximum likelihood (Hothorn et al., 2018). We formally define trams as a class of non-linear non-additive noise models depending on the sample space of both response and covariates. Our treatment of trams may appear overly mathematical; however, the formalism is needed to formulate and prove the identification result (see Proposition 15 in Section 2.3) and the uniform asymptotic level guarantee for the tram-GCM invariance test (Theorem 20). A more intuitive introduction to trams can be found in Hothorn et al. (2018), for example. We then embed trams into a causal modeling framework, using structural causal models (SCMs, Pearl, 2009;Bongers et al., 2021). We also adapt standard results from parametric (Hothorn et al., 2018) and semi-parametric (McLain and Ghosh, 2013) maximum likelihood estimation which enable us to obtain results on consistency and asymptotic normality, which are exploited by the proposed invariance tests.

Transformation models
Let R := R ∪ {−∞, +∞} denote the extended real line. Throughout the paper, let Z denote the set of functions F Z : R → [0, 1] that are (i) strictly increasing with lim x→−∞ F Z (x) = 0, lim x→∞ F Z (x) = 1, (ii) three-times differentiable and have a log-concave derivative f Z = F ′ Z when restricted to R, and (iii) satisfy F Z (−∞) = 0 and F Z (+∞) = 1. We call Z the set of extended differentiable cumulative distribution functions. Given that a CDF F : R → R satisfies (i) and (ii), we may add (iii) and refer to the resulting function as an extended CDF. For instance, the extended standard logistic CDF is given by F SL (z) = (1 + exp(−z)) −1 for all z ∈ R and F SL (−∞) = 0 and F SL (+∞) = 1. Besides F SL , in our applications, we consider the extended versions of the standard normal CDF Φ, and the standard minimum extreme value CDF F minEV : z → 1 − exp(− exp(z)). By slight abuse of notation, we use the same letters Φ, F SL , F minEV , for the extended CDFs. In general, specification of a transformation model requires choosing a particular F Z ∈ Z. Further, for a symmetric positive semi-definite matrix A, let λ min (A) denote its smallest eigenvalue and ∥A∥ op denote its operator norm. For all n ∈ N, we write [n] as shorthand for {1, . . . , n}.
We call a function h : R → R extended right-continuous and increasing (ERCI) on Y ⊆ R if (i) it is right-continuous and strictly increasing on Y and fulfills h(min Y) > −∞ (if min Y exists), (ii) for all y < inf Y, we have h(y) = −∞, (iii) for all y > sup Y, we have h(y) = +∞, (iv) We are now ready to define the class of transformation models.
Definition 5 (Transformation model). Let Y ⊆ R and X := Then, for a fixed error distribution F Z ∈ Z and a set of transformation functions H Y,X ⊆ H * Y,X , the family of trams M(F Z , Y, X , H Y,X ) is defined as the following set of conditional cumulative distribution functions 1 (see also Definition 2 in Hothorn et al., 2018): As such, a single tram is fully specified by The inverse transformation function h −1 (· | x) at a given x can be interpreted analogously to a quantile function: Given some X = x, we can obtain an observation from F Y |X=x by sampling an observation from F Z and passing it through h −1 (· | x).
In statistical modelling, it is common to additionally assume additivity of the effects of X on a specific scale. For instance, in linear regression the covariates enter as a linear predictor on the scale of the conditional mean. In this work, we restrict ourselves to the class of shift trams in which additivity is assumed on the scale of the transformation function.
Definition 6 (Shift trams). Let Y, X and F Z ∈ Z be as in Definition 5. Further, let F := Then, M(F Z , Y, X , H shift Y,X ) denotes the family of shift trams and a tram F Z •h is called shift tram iff h ∈ H shift Y,X . Further, any h Y ∈ H Y is referred to as a baseline transformation.
We next introduce the subset of linear shift trams in which the covariates enter as a linear predictor.
1 In Proposition 27 in Appendix E2, we show that M indeed only contains CDFs.
Definition 7 (Linear shift trams). Consider shift trams specified by F Z , Y, X , F, H shift Y,X , as in Definition 6. Let b : X → R b be a finite collection of basis transformations and define . For the special case of b : x → x, we write H linear Y,X and refer to the class and its members as linear shift trams.
Estimation and inference in trams can be based on the log-likelihood function -if it exists. The following assumption ensures that this is the case.
where h ′ (y | x) := d dυ h(υ | x)| υ=y , is well-defined and a density (w.r.t. Lebesgue measure) of the conditional CDF induced by the tram.
Assumption 1 allows us to define (strictly positive) canonical conditional densities with respect to a fixed measure that we denote by µ: If Y is countable, we let µ denote the counting measure on Y and define the canonical conditional density by f Y |X=x (·; h) := F Z (h(· | x)) − F Z (h(ȳ | x)), wherē y := sup{υ ∈ Y : υ < y} 2 . If Y is uncountable, we let µ denote the Lebesgue measure restricted to Y and the canonical conditional density is then defined by (1). In either case, H Y,X ⊆ H shift Y,X ensures that for all x and y ∈ Y, f Y |X=x (y; h) > 0. Thus, for (F Z , Y, X , H Y,X ) satisfying Assumption 1, we can define the tram log-likelihood as ℓ : When applying ICP to linear additive noise models, invariance can be formulated as uncorrelatedness between residuals and environments. In trams, however, the response can be categorical, reducing the usefulness of classical residuals. Instead, score residuals (Lagakos, 1981;Korepanova et al., 2020;Kook et al., 2022) are a natural choice for testing invariance of trams. Score residuals were first introduced by Lagakos (1981) for multiplicative hazard models (see also Korepanova et al., 2020, for non-multiplicative hazard models) and extended to linear shift trams by Kook et al. (2022, Definition 2). Score residuals coincide with scaled least-squares residuals in linear regression with normal errors and martingale residuals in the Cox proportional hazards model (Barlow and Prentice, 1988) and directly extend to censored responses (Lagakos, 1981;Farrington, 2000). In this work, score residuals play a major role in formulating invariance tests (Section 3) and have been used for causal regularization in a distributional version of anchor regression (Rothenhäusler et al., 2021;Kook et al., 2022). For defining score residuals, we require the following assumption (which, by definition, is satisfied for H shift Y,X and H linear Y,X ).
Assumption 2. H Y,X is closed under scalar addition, that is, for all h ∈ H Y,X and α ∈ R, we Definition 8 (Score residuals, Lagakos, 1981;Kook et al., 2022). Let Y, X , F Z ∈ Z and H Y,X ⊆ H * Y,X be as in Definition 5. Impose Assumptions 1 and 2. Then, considering α ∈ R, the score residual R : H Y,X × Y × X → R is defined as Our invariance tests (Definition 16 in Section 3) are based on score residuals and use the following property.
Lemma 9. Let Y, X , F Z ∈ Z, H Y,X ⊆ H * Y,X be as in Definition 5. Impose Assumptions 1 and 2. Let X ∈ X follow P X and let (F Z , h 0 ), h 0 ∈ H Y,X , be a tram such that for P X -almost all x, where α ∈ R.
Example 11 (Count regression, cont'd). For count responses, we can define a family of linear shift trams with support Y := {0, 1, 2, . . . }, given by M(F Z , Y, X , H linear Y,X ). For all x ∈ X , the transformation function h(· | x) := h Y (·) − x ⊤ β is a right-continuous step function with steps at the integers and linear shift effects. The log-likelihood contribution for a single observation (y, x) is given by log F Z (h(0 | x)) if y = 0 and log(F Z (h(y | x)) − F Z (h(y − 1 | x)) for y ≥ 1. The exact form of score residual depends on the choice of F Z . The (generalized) inverse transformation function is Example 12 (Parametric survival regression, cont'd). For the Weibull proportional hazards model, we fix F Z (z) = 1−exp(− exp(z)) and define a family of log-linear transformation functions H log-lin The loglikelihood contribution for an exact response is given by the log-density and an uninformatively right-censored observation at time t with covariates x ∈ X contributes the log-survivor function evaluated at t, i.e., log(1 − F Z (h(t | x))), to the log-likelihood. The inverse transformation function is given by

Structural causal transformation models
Next, we cast trams into a structural causal modelling framework (Pearl, 2009) We are now ready to define structural causal trams.
Definition 13 (Structural causal tram). Let Y, X , F Z ∈ Z be as in Definition 5. Let H Y,X ⊆ H * Y,X be a class of transformation functions such that Assumption 1 holds. Let (Z, N X ) be jointly independent with Z ∼ F Z . Then, a structural causal tram C over (Y, X) is defined as where S * ⊆ [d], h ∈ H Y,X S * is the causal transformation function and pa C (Y ) := S * denotes the set of causal parents of Y in C and g X is an arbitrary measurable function. By P C (Y,X) we denote the observational distribution induced by C. We assume that the induced graph (obtained by drawing directed edges from variables on the right-hand side to variables on the left-hand side) is acyclic.
We denote by C(F Z , Y, X , H Y,X ) the collection of all structural causal trams with error distribution F Z and causal transformation function h ∈ H Y,X .

Non-identifiability of the causal parents in transformation models
We now show that performing causal feature selection in structural causal transformation models requires further assumptions. We consider a response variable Y and a set of covariates X and assume that (Y, X) are generated from an (unknown) structural causal tram (defined in (3)) with (known) H Y,X ⊊ H * Y,X . In our work, the problem of causal feature selection concerns learning the causal parents pa(Y ) given a sample of (Y, X) and knowledge of F Z , Y, X , H Y,X (which specifies the model class M(F Z , Y, X , H Y,X )).
In this work, we specify the model class for the conditional of the response, given its causal parents, Y | X pa(Y ) by a tram; the remaining conditionals are unconstrained. Identifiability of causal structure has been studied for several model classes that constrain the joint distribution (Y, X).
When considering the class of linear Gaussian SCMs, for example, the causal parents are in general not identifiable from the observational distribution (as there are linear Gaussian SCMs with a different structure inducing the same distribution). This is different for other model classes: When considering linear Gaussian SCMs with equal noise variances (Peters and Bühlmann, 2013), linear non-Gaussian SCMs (Shimizu, 2014) or nonlinear Gaussian SCMs (Hoyer et al., 2008;Peters et al., 2014), for example, the graph structure (and thus the set of causal parents of Y ) is identifiable under weak assumptions (identification then becomes possible by using goodness-of-fit procedures). To the best of our knowledge identifiability of such model classes has not been studied when constraining only the conditional distribution Y | X pa(Y ) .
trams are generally not closed under marginalization (see Appendix A1 for a detailed discussion on non-collapsability) and one may hypothesize that this model class allows for identifiability of the parents (e.g., by considering different subsets of covariates and testing for goodness of fit). We now prove that this is not the case: In general, for trams (and even for linear shift trams), the causal parents are not identifiable from the observed distribution. Instead, additional assumptions are needed to facilitate causal feature selection in trams.
Definition 14 formally introduces the notion of identifiability of the causal parents and Proposition 15 provides the non-identifiability result.
Definition 14 (Subset-identifiability of the causal parents). Let C denote a collection of structural causal models. The set of causal parents is said to be C-subset-identifiable if for all pairs C 1 , C 2 ∈ C it holds that Proposition 15 (Non-subset-identifiability). For all A ⊆ R that are either an interval or countable, A proof is given in Appendix E1.2, where we construct a joint distribution over three random variables (Y, X 1 , X 2 ), in which the two conditionals Y | X 1 and Y | X 2 are trams. This implies that there are two structural causal trams that have identical observational distributions, while Y has two different (non-empty) sets of causal parents that do not overlap. The proof in E1.2 characterizes how to construct such a joint distribution for shift trams. For illustrative purposes, we present a concrete example in Appendix E1.2 in which Y = A = {1, 2, 3} and Y | X 1 and Y | X 2 are proportional odds logistic regression models. We then sample from the induced distribution We sample from the induced distributions of the two structural causal trams constructed in the proof and apply the naive method described above of performing goodness-of-fit tests to identify the parents. We see that this method indeed fails to identify a non-empty subset of the parents in this example.
Instead of subset-identifiability, one can also consider a stronger notion of full identifiability, which states that the set of causal parents can be uniquely determined by the observed distribution (formally defined in Appendix A2). Proposition 15 immediately implies that the set of causal parents is not fully identifiable either.

Transformation model invariant causal prediction
Even if the observational distribution is insufficient to identify causal parents, identifiability can become possible if we have access to data from multiple, heterogeneous environments. Invariant causal prediction (ICP Peters et al., 2016) exploits the invariance of causal mechanisms (Haavelmo, 1943;Frisch et al., 1948;Aldrich, 1989;Pearl, 2009;Schölkopf et al., 2012) under interventions on variables other than the response. Depending on the response variable, multi-center clinical trials, data collected from different countries or different points in time may fall into this category. We then show that under Setting 1, the set of causal parents is subset-identifiable (Proposition 17) and fully identifiable if the environments are sufficiently heterogeneous (Proposition 18).
Setting 1 (Data from multiple environments). Let Y, X , F Z ∈ Z be as in Definition 5 and let H Y,X ⊆ H * Y,X be a class of transformation functions such that Assumptions 1 and 2 hold. Let C * be a structural causal tram over (Y, X, E) such that denoting the parents of Y and (Z, N X , N E ) denoting the jointly independent noise variables. In this setup, the random vector E encodes the environments and takes values in E ⊆ R q and may be discrete or continuous. By definition, the induced graph G * is acyclic.
An exemplary DAG contained in this setup is depicted on the right. By , we denote a sample of independent observations from P C * (Y,X,E) .
As for ICP, invariance plays a key role for tramicp. A subset of covariates is considered invariant if the corresponding transformation model correctly describes the conditional distribution across the environments E.
If an invariant transformation function h S according to Definition 16 exists, it is P X S -almost surely unique (see Lemma 31 in Appendix E2). Proposition 17 shows that the parental set fulfills (F Z , H Y,X )-invariance, which is sufficient to establish coverage guarantees for invariant causal prediction in trams.
Proposition 17. Assume Setting 1. Then the set of causal parents A proof is given in Appendix E1.3.
The set of causal parents S * together with the causal transformation function h * in Setting 1 may not be the only set satisfying (F Z , H Y,X )-invariance. In this vein, we define the set of identifiable causal predictors as and have, by Proposition 17, S I ⊆ S * . This implies that the causal parents P are subset-identifiable.
If the environments E are heterogeneous enough, the set of causal parents is fully identified under Setting 1 and the faithfulness assumption (see Spirtes et al., 2000, p. 31).
Proposition 18. Assume Setting 1. Let G be the DAG induced by C * and assume that P C * where ch(E) denotes the children of E, we have A proof is given in Appendix E1.4.
For simple model classes such as linear Gaussian SCMs, sufficient conditions for faithfulness are known (Spirtes et al., 2000). In our setting, analyzing the faithfulness assumption is particularly challenging due to non-collapsibility and non-closure under marginalization of trams (see Appendix A1). Nonetheless, we empirically show in our simulations (see Section 4) that faithfulness is not violated, for example, if the coefficients in linear shift trams are sampled from a continuous distribution (for details see Appendix B3).

Testing for invariance
We now translate (F Z , H Y,X )-invariance into testable conditions which are applicable to general trams and thus general response types. Here, we propose an invariance condition based on score residuals (Definition 8). The following proposition shows that the score residuals are uncorrelated with the environments (in Setting 1) when conditioning on an invariant set.
Proposition 19 (Score-residual-invariance). Assume Setting 1 and that (2) holds. Then, we have the following implication: where denotes the expected conditional covariance between the residuals and environments.
A proof is given in Appendix E1.5. In Appendix A3, we extend tramicp (in particular, Proposition 19) to uninformatively censored observations, where Y itself is unobserved.
We now turn to the problem of testing invariance from finite data. Section 3.
All proposed invariance tests are embedded in a subset-search over the set of covariates, in which we return the intersection of all non-rejected sets at a given level α ∈ (0, 1) (ICP; Algorithm 1).
Algorithm 1 Invariant causal prediction  Require: Data D n from Setting 1, significance level α ∈ (0, 1), and a family of invariance tests (p S,n ) S⊆{1,...,d} (outputting a p-value; see Algorithms 2, and 3 and the comparators in Section 4. Compute p S,n (D n ) ▷ Compute p-value of invariance test 3: end for 4: return S n := S:p S,n (Dn)>α S ▷ Intersection over all non-rejected sets We refer to the combination of ICP (Algorithm 1) with the proposed tram-GCM invariance test (Algorithm 2) as tramicp-GCM, with the proposed tram-Wald invariance test (Algorithm 3) as tramicp-Wald and using a nonparametric conditional independence test (see Section 4.1.1) as nonparametric ICP.
Both tramicp-GCM and tramicp-Wald under their respective additional assumptions enjoy the same coverage result as ICP for linear additive noise models, that is, lim n→∞ P(S ϕ n ⊆ pa C * (Y )) ≥ 1 − α if the tests are level α (Theorem 1 in .

Invariance tests based on score residuals
We can test the null hypothesis of (F Z , H Y,X )-invariance by testing the implication in (4), i.e., uncorrelatedness between score residuals and residualized environments in a GCM-type invariance test (Algorithm 2). This requires that the maximum likelihood estimator exists and is unique. shows that the proposed test is uniformly asymptotically level α for any α ∈ (0, 1).

Algorithm 2 tram-GCM invariance test
Obtainμ using data D n 3: Compute residual product terms: . . , n 4: Compute residual covariance: : Compute test statistic: Theorem 20 (Uniform asymptotic level of the invariance test in Algorithm 2). Assume Setting 1 together with Assumption 3 and for a fixed S ⊆ For all P in P, we denote by h P the h S ∈ H Y,X S in the definition of (F Z , H Y,X )-invariance and Further, we require the following rate conditions on M : Then T n converges to a standard q-variate normal distribution uniformly over P. As a consequence, where p S,n (D n ) is the p-value computed by Algorithm 2.
A proof is given in Appendix E1.6. Conditions (a)-(c) are mild regularity conditions on the distributions of (Y, X S , E). Of the remaining conditions it is usually (iii) that is the strictest. In the case of a parametric linear shift tram, we would expect W = O P (n −1 ) and therefore would only need the regression of E on X S to be consistent. However, the tram-GCM invariance test can still be correctly calibrated even if the score residuals are learned at a slower-than-parametric rate.
Slower rates occur, for instance, in mixed-effects (Tamási and Hothorn, 2021), penalized linear shift (Kook and Hothorn, 2021), or conditional trams (Hothorn et al., 2014). The robustness property of the tram-GCM invariance test does not necessarily hold without residualization of the environments (Chernozhukov et al., 2017;Shah and Peters, 2020) We illustrate empirically how the naive correlation test may not be level, in case of shift and penalized linear shift trams, in Appendix B4.

Invariance tests based on the Wald statistic
For linear shift trams, (F Z , H linear Y,X )-invariance implies the absence of main and interaction effects involving the environments if we include the environment variable as a covariate into the model (see Proposition 21 below); this can be tested for using a Wald test for both continuous and discrete responses (Algorithm 3).
We now introduce notation for including main and interaction effects involving the environments.
Let ⊗ denote the Kronecker product and define, for all S ⊆ Y,X S ×E (a) and consider the hypothesis test H 0 (S) : γ S = 0.
be given and suppose that the canonical conditional CDF of Y given X S and E is an A proof is given in Appendix E1.7.
The Wald test uses the quadratic test statistic (γ S ) ⊤Σγ S /nγ S which converges in distribution to a χ 2 -distribution with rankΣγS degrees of freedom (under further regularity conditions, see Compute the variance-covariance matrix and its rank where for j ∈ [d],p j is now a valid p-value for the null hypothesis H 0 (j) : X j / ∈ pa(Y ) (assuming that the true parents satisfy (F Z , H Y,X )-invariance). We then refer to X j withp j ≤ α, j ∈ [d] as plausible causal predictors.

Practical aspects
Reducing computational complexity The computational complexity of ICP scales exponentially in the number of predictors due to the need of testing all possible subsets.  proposed to reduce the high computational burden by pre-screening the predictors using a variable selection algorithm. Pre-screening of covariates in trams can in principle be done via L 1 -penalized likelihood estimation (Kook and Hothorn, 2021) or nonparametric feature selection methods that can handle discrete and censored responses. Given a variable selection procedure with output S VS n which guarantees lim n→∞ P(S VS n ⊇ pa C * (Y )) ≥ 1 − α at level α ∈ (0, 1), we can run tramicp with the potentially reduced set of covariates S VS n at level α and maintain the coverage guarantee of ICP at level 2α (Peters et al., 2016, Section 3.4).
While pre-screening simplifies the application of tramicp in practice, it is difficult to ensure that the screening property holds. Although L 1 -penalized maximum likelihood is fast and asymptotically guaranteed to return the Markov boundary for linear additive Gaussian noise models (Meinshausen and Bühlmann, 2006), this no longer holds true for general linear additive noise models (Nandy et al., 2017, Example 1 in the supplement) or (linear shift) trams, since the parametric regression model of the response on all covariates can be misspecified if a child is included in the conditioning set.
Unmeasured confounding In Setting 1, we assume that all confounding variables between covariates and response and all parents of the response have been measured. This assumption can be weakened by instead assuming that there exists a subset of ancestors A ⊆ an(Y ), such that

Implementation: R package tramicp
With tramicp, we provide a user-friendly implementation for applying tramicp, which we briefly outline in this section. For every model implementation listed in Table 2 in Section 4.2, there is a corresponding alias in tramicp which appends ICP to the model name (e.g., glmICP in Example 1).
As an example, in the below code snippet, we apply tramicp-GCM to data generated from a structural causal tram with DAG G (shown below) and a "Cotram" (cf. count tram, Siegfried and Hothorn, 2020) model for the count-valued response with F Z = F SL (see Table 2 and Example 3).
The corresponding output is shown below. tramicp correctly returns {1, 2} as the causal parent of the response. The reported p-value for a predictor of interest is computed as the maximum p-value over all tested sets not containing the predictor of interest (in case all sets are rejected, the p-value is set to 1, see Section 3.2). An illustration of the tram-GCM invariance test can be found in Figure 1. shift trams from package tramME) can be integrated.

Model-based Invariant Causal Prediction
If prior knowledge of the form S m ⊊ S * is available, only super-sets of S m need to be tested. Thus, S m can be interpreted as a mandatory part of the conditioning set. In tramicp, such mandatory predictors can be supplied to the mandatory argument as a formula. In our example above, we could, for instance, specify mandatory = ∼ X1 in the call to cotramICP(). In Section 5, we illustrate how such prior knowledge can be used to reduce computation time by testing fewer sets. If non-parents are falsely included as mandatory, the output of tramicp may be misleading. Also, the robustness guarantees discussed in Section 3.2 'Unmeasured confounding' can break down.

Simulation Experiments
We now evaluate the proposed algorithms on simulated data based on randomly chosen graphs with restrictions on the number of possible descendants and children of the response, as well as how the binary environment indicator affects non-response nodes. In the simulations, the conditional distribution of Y | pa(Y ) is correctly specified by a tram, all other structural equations are linear additive with Gaussian noise.

Nonparametric ICP via conditional independence testing
Throughout, we report the oracle version on nonparametric ICP. Under Setting 1, let G denote the DAG implied by C * . Assuming the Markov property (see Spirtes et al., 2000, p. 29)) and faithfulness of P C * (Y,X,E) w.r.t. G, the oracle is defined as the intersection of sets S for which Y is conditionally independent of E given X S (Mogensen et al., 2022, Proposition 4.1), where pa(·), ch(·), an(·) denote parents, children and ancestors of a node in G, respectively.
A general-purpose algorithm for causal feature selection when having access to data from heterogeneous environments is nonparametric conditional independence testing (Zhang et al., 2011;Strobl et al., 2019, Algorithm 1 with p S,n being a conditional independence test for the hypothesis E ⊥ ⊥ Y | X S ).

Simulation setup
We outline the simulation setup for our experiments and give details in Appendix B1, which includes the explicit parameterizations of the transformation function for all models, details on the data generating process and simulation scenarios. We compare tramicp with tram-GCM and tram-Wald invariance tests against nonparametric ICP and oracle ICP.

Models and parameterization
In Section 4, we consider the models discussed in the examples in Section 1.3, namely, the binary GLM ("Binary"), a discrete-odds count transformation model ("Cotram"), and a parametric survival model ("Weibull"). We also show results for the normal linear model and other transformation models for ordered and continuous outcomes (see Tables 1 and 2) in Appendix B2.1. For our numerical experiments, we parameterize the transformation function h in terms of basis expansions depending on the type of response h Y (·) := a(·) ⊤ θ, with a : Y → R b , and appropriate constraints on θ ∈ Θ ⊆ R b . Table 1 contains a summary of the bases used for continuous, bounded continuous, count, and binary/ordered responses.

TRAM Basis
Data-generating process In each iteration, the data are drawn from a random DAG with a pre-specified number of potential ancestors and descendants of the response. In the DAG, the response given its causal parents is correctly specified by one of the linear shift trams in Table 2.
All other structural equations are linear with additive noise. The environments are encoded by a single Bernoulli-distributed random variable. Algorithm B1 details the DGP. We generated data for 100 random DAGs and 20 repetitions per DAG. Sample sizes 100, 300, 1000, and 3000 are considered. The DAGs were generated with at most 3 ancestors and 2 descendants of the response (excluding the environment).

Results
Figure 2 summarizes the simulation results for the scenario described in Section 4.2. All ICP procedures are level at the nominal 5% for all considered models. Like ICP in linear additive noise models, tramicp is conservative in terms of FWER. For all models and invariance tests, the power (measured in terms of Jaccard similarity to the parents) increases with sample size. For all models, tramicp-Wald has more power than tramicp-GCM, which in turn slightly outperforms nonparametric ICP for the Cotram model. The ROC invariance test is only applied to binary responses and has comparable power to tram-GCM and GCM, but lower power than the Wald test.

Results under misspecification
We perform simulations for additional models including some that lead to model misspecification in Appendix B2.2 and briefly outline the results here. Hidden variables In the presence of hidden variables, tramicp can still output a subset of the ancestors, for example, if there is a set of ancestors that is invariant (see Section 3.2). Suppose, we generate Y as a tram and omit one of its observed parents. If this parent is directly influenced by E, then we do not expect any set of ancestors to be invariant. More precisely, there is no set of observed covariates, such that E is d-separated from Y given this set, which, assuming faithfulness, implies that there is no invariant set. When a test has power equal to one to reject non-invariant sets, the output of ICP is the empty set (and thus still a subset of the ancestors of Y , in fact, even of the parents of Y ). We add a corresponding experiment in Appendix B2.2, where we show results for both tramicp-GCM and tramicp-Wald. While both tests seem to be anti-conservative for finite sample sizes, the results indicate that this may indeed be due to insufficiently large sample sizes.
Error distribution Choosing an error distribution F Z different from the one generating the data is another way to misspecify the model. In Appendix B2.3, we see in simulations that tramicp shows some robustness against misspecification of the error distribution. In all our simulations, misspecifying F Z leads to reduced power. In the case the data were generated with F Z = F minEV but modelled using the misspecified F Z = F SL , and vice versa, there is some evidence that tram-Wald is anti-conservative, while tramicp-GCM still seems to be level.

Case study: Causal drivers of survival in critically ill adults
We apply tramicp to the SUPPORT2 dataset (Knaus et al., 1995) with time-to-death in a population of critically ill hospitalized adults being the response variable. SUPPORT2 contains data from 9105 patients of whom 68.1% died after a maximum follow-up of 5.55 years and the remaining 31.9% of observations were right-censored due to loss of follow-up. We consider the following predictors measured at baseline (determined at most three days after hospital admission): Sex (male/female), race (white, black, asian, hispanic, other), number of comorbidities (0-9; num.co), coma score (0-100, scoma), cancer (no cancer, cancer, metastatic cancer; ca), age (years), diabetes (yes/no), dementia (yes/no), disease group (nine groups, including colon and lung cancer; dzgroup). 4 For our analysis, we treat num.co (0, 1, . . . , 5, 6 or more) and scoma (11 levels) as a factor, square-root transform age and omit 43 patients with missing values in any of the predictors listed above. We apply tramicp using both tram-GCM and tram-Wald. For tram-Wald, we only test the presence of main effects of the environments (without additional first-order interaction effects) due to non-convergence when fitting the models with interaction effects.  Figure 3: Set-specific p-values of the tram-GCM and tram-Wald invariance test in the SUPPORT2 case study described in Section 5. Sets containing both ca and age are depicted as circles, whereas sets containing neither are depicted as crosses. The blue and red shaded regions mark the rejection region for tram-GCM and tram-Wald, respectively, and the dashed line is the identity function. Left: p-values on a linear scale. All invariant sets contain cancer and age, which is therefore the output of both tramicp-GCM and tramicp-Wald (see Table 3). Right: p-values on the log 10 -scale. tram-GCM is more conservative than tram-Wald, as all p-values fall below the identity (dashed line).

Results
The set of all predictors is not invariant In the model including all predictors the standard Wald test rejects the null hypothesis of no effect for all predictors except race. A Wald test for the main effect of num.co yields a p-value < 0.0001. This provides strong evidence that the purely predictive model using all predictors is not invariant across num.co and thus uses a set of features that is different from the set of causal parents.
Evidence of age and cancer being direct causes of time-to-death We now apply tramicp-GCM and tramicp-Wald to the SUPPORT2 dataset specifying the survival time as the response in a Cox proportional hazard model, using num.co as the environment and including all other predictors. Both algorithms output ca and age as plausible causal predictors (i.e., the intersection of all sets for which the invariance test was not rejected equals {ca, age}). This can be seen in Figure 3, where all non-rejected sets include both ca and age. The predictor-specific p-values (see Section 3.2) are given in Table 3 ('Evidence of age and cancer being direct causes of time-to-death').
Multiple environments tramicp allows several environments to be specified. When using both race and num.co as environments in the SUPPORT2 dataset, tramicp-GCM outputs ca, age, diabetes as plausible causal predictors. tramicp-Wald additionally outputs dementia and sex. Sensitivity analysis: Informative censoring When applying tramicp in a survival context, the assumption of uninformative censoring plays an important role. In their original analysis of the SUPPORT dataset, Knaus et al. (1995) have assumed that the censoring is uninformative. As a sensitivity analysis, we can introduce (possibly additional) informative censoring by treating the observed event times of a fraction of all patients who experienced the event as right-censored. For 10% (20%, ..., 90%) of randomly chosen patients with exact event times, we repeat the analysis ten times. For low fractions of additionally informatively censored observations (10-20%), the output of tramicp-Wald remains stable (ca, age). For larger fractions (30-80%), age is contained in the output less often. For the largest considered fraction (90%), tramicp-Wald outputs mostly the empty set (see Table C1 in Appendix C).

Discussion
In this paper, we generalize invariant causal prediction to transformation models, which encompass many classical regression models and different types of responses including categorical and discrete variables. We show that, despite most of these models being neither closed under marginalization nor collapsible (Appendix A1), tramicp retains the same theoretical guarantees in terms of identifying a subset the causal parents of a response with high probability. We generalize the notion of invariance to discrete and categorical responses by considering score residuals which are uncorrelated with the environment under the null hypothesis. Since score residuals remain well-defined for categorical responses, our proposal is one way to phrase invariance in classification settings. We demonstrate empirically that tramicp-GCM and tramicp-Wald level at the nominal α, but also have non-trivial power against the alternatives considered in the simulation setup.
The tram-Wald invariance test hinges critically on correct model specification. Despite its high power in the simulation setups shown in Figure 2 The appendix is structured as follows: •

A1 Collapsibility and closure under marginalization
Linear shift trams, like the general additive noise model, are in general not closed under marginalization. That is, when omitting variables from a correctly specified tram, the reduced model is generally no longer a tram of the same family and linear shift structure. This is closely related to non-collapsibility. Non-collapsibility has been studied for contingency tables (Whittemore, 1978), logistic-linear models (Guo and Geng, 1995) and rate/hazard ratio models (Greenland, 1996;Martinussen and Vansteelandt, 2013). In non-collapsible models, the marginal effect measure (obtained from regressing Y on X) is not the same as in the conditional effect measure (the regression additionally adjusting for V ) even if X and V are independent. For instance, in binary logistic regression, the conditional odds ratio given a binary exposure across a conditioning variable with two strata is in general not the same as the marginal odds ratio and may even flip sign (which is related to Simpson's paradox, Hernán et al., 2011). Even if the marginal model is correctly specified, the conditional model may not be correctly specified, and vice versa, which has been illustrated in binary GLMs (Robinson and Jewell, 1991) and the Cox proportional hazards model (Ford et al., 1995). Collapsibility depends on the chosen link function. In GLMs, the identity-and log-link give rise to collapsible effect measures (difference in expectation, and log rate-ratios, respectively).
It is possible to connect collapsibility with conditional independence statements. For instance, the conditional odds ratio of a covariate X for a binary outcome Y is collapsible over another (discrete) Greenland et al., 1999). Recently, collapsibility has also been discussed in the context of causal parameters (Didelez and Stensrud, 2022).

A2 Full identifiability of the causal parents
Definition 22 (Full identifiability of the causal parents). Let C denote a class of structural causal models. Then, the set of causal parents is said to be C-fully identifiable if for all pairs C 1 , C 2 ∈ C it holds that
Setting 2 (Censoring). Assume Setting 1 in which Y is unobserved. Instead, we have access to additional variables L, U ∈ Y with joint density p (L,U ) w.r.t. µ ⊗ µ (see below Assumption 1 for the definition of µ) and which fulfill L < U with probability 1. Further, we assume that for all S ⊆ [d], (L, U ) ⊥ ⊥ Y | X S , i.e., uninformative censoring (Kalbfleisch and Prentice, 2011). In addition to L and U , we observe the indicator random variables δ L and δ I which carry the information about Under Setting 2, we have which follows from considering the four cases of the values for (δ L , δ I ), e.g., P(δ L = 1, The log-likelihood is given by the log conditional density (and is undefined for the case δ L = δ I = 1) The definition of (F Z , H Y,X )-invariance and Proposition 17 still hold when considering the unobserved Y in Setting 2. However, when using tramicp, since Y is unobserved, we cannot test statements about Y directly -the censoring needs to be taken into account. For (F Z , Y, X , H Y,X ) satsifying Assumptions 1 and 2, uninformatively censored score residuals can be defined viaR : l, u, x). The following proposition extends invariance based on score residuals (Proposition 19) to uninformatively censored responses.
Proposition 23 (Score-residual-invariance for uninformatively censored responses). Assume Setting 2. Then, we have the following implication: A proof is given in Appendix E1.8. Establishing theoretical results on invariance in case of informative censoring is left for future work.

B1.1 Data generating process
Algorithm B1 summarizes the data generating process. First, a sample of a random DAG of d 1 potential ancestors of the response is generated and perturbed with environment effects B 1 E . Afterwards, the response is sampled from the given tram with baseline and shift coefficients θ and β, respectively (cf. Table 2). Lastly, the DAG of d 2 potential descendants is generated and likewise perturbed with environment effects B 2 E . Based on the adjacency matrix of the resulting full DAG, we can compute the oracle ICP output (cf. Section 4.1.1). We generate random DAGs using pcalg::randomDAG() and sample from them with pcalg::rmvDAG(). We simulate from trams using tram::simulate().

B1.2 Simulation scenario
We generate data from all models listed in Table 2, even the ones that were not discussed in Section 4.3; we run tramicp-({Wald, GCM}) and nonparametric ICP for 100 random DAGs and 20 samples per DAG. We considered sample sizes 100, 300, 1000, 3000, and 10000. The random DAGs contain at most 3 ancestors and 2 descendants of the response. The probability of two vertices being connected is fixed at 0.8, which includes also edges between environment and covariates. Bernstein polynomial bases are fixed to be of order 6. Likewise, for the Polr model, which features an ordinal response, the number of classes is fixed at 6. Coefficients for the baseline transformation are fixed at

Algorithm B1 Data generation from random DAGs
Require: Number of ancestors d 1 and descendants d 2 , a tram with parameters θ ∈ R K , β ∈ R d 1 with pa(Y ) = supp β, sample size n, edge probabilities p A (ancestral graph), p D (descendental graph), coefficient matrices B 1 E (environment on ancestors), B 2 E (environments on descendants), B Y (response on its children), B 1 and ancestors on descendants, routine randomDAG(d, p) which outputs a random DAG with d nodes with connection probability p, routine rmvDAG(DAG, n) which draws an i.i.d. sample from the distribution implied by DAG, a routine simulate(tram, newdata) which draws i.i.d. observations from the conditional distribution implied by tram given newdata.
▷ Output simulated data  Figure B1 contains simulation results for the simulation scenario described in Appendix B1.2 for all models from Table 2 not included in the main text, i.e., BoxCox, Colr, Coxph, Lm, and Polr.

B2.1 Other models and response types
The conclusions drawn in the main text remain the same: ICP is level at the nominal 5% for all considered invariance tests. The tram-Wald invariance test exerts the highest power (measured in terms of Jaccard-similarity to the true parents), followed by the tram-GCM invariance test and the nonparametric conditional independence test (GCM), which perform on par. Figure B2 shows the results of the simulation described in Appendix B1.2 with two modifications.

B2.2 Hidden parents
First, the edge probabilities for E → X and X → Y are fixed to one, thus X 1 , X 2 , X 2 are fixed as a parents of Y in all graphs; second, X 1 is omitted (treated as unobserved) when running all  Figure B1: Results for the experiment described in Appendix B2.1. We use the simulation scenario described in Appendix B1.2 and show the models that were not included in the main text.
ICP algorithms. As stated in Section 4.3, this situation is adversarial in that there is no invariant ancestral set as described in Section 3.2 (because the path E → X 1 → Y is always unblocked).
There is evidence that, for the considered sample sizes, tramicp-GCM is not level at the nominal 5%. For smaller sample sizes, tramicp-Wald is also anti-conservative.
However, for large sample sizes the empirical probability of obtaining a set that is not a subset of the parents of Y decreases for all models. In addition, Figure B3 shows increased power for all  Figure B3: Results for the experiment described in Section B2.2. We use the same simulation scenario as in Appendix B1.2 with X 1 as a fixed, but treated-as-hidden parent and with coefficients for X 1 → Y multiplied by 2.5. The invariance tests show larger power to reject non-invariant sets.
In order for the trams to be well-behaved, the data D 1 in Algorithm B1 are standardized, so that each component has empirical variance 1/25. invariance test when the coefficient for X 1 → Y is multiplied by 2.5 (for the same DAGs used in Figure B2).

B3 Faithfulness in linear shift transformation models
Studying faithfulness violations theoretically in trams is again hindered by the lack of collapsibility and marginalizability (discussed in Appendix A1). We now give an example illustrating that there are structural causal trams that induce distributions that are non-faithful w.r.t. the induced graph.  Figure B4: Results for the experiment described in Appendix B2.3. We use the same simulation scenario as in Appendix B1.2 but with misspecified F Z . The data is generated according to the models in Table 2, but the error distribution used when running ICP is different from the one generating the data, as follows: Consider the following structural causal tram, where g := F −1 χ 2 (3) • Φ denotes the inverse baseline transformation and N E ∼ Bernoulli(0.5), N 1 ∼ N(0, 1), N 2 ∼ N(0, 1), and Z ∼ N(0, 1) are jointly independent noise terms. The population version of tramicp returns {1, 2} as an invariant set since E ⊥ ⊥ Y | X 1 , X 2 . We let β vary from 0.2 to 0.8; for β = 0.5, we have the following cancellation For β = 0.5, due to the cancellation, E ⊥ ⊥ Y , i.e., the empty set is invariant, too, and the output of a population version of tramicp equals the empty set. For each of the above values of β, we simulate a single sample from the implied joint distribution with sample size n = 10 4 and apply BoxCoxICP with the tram-GCM invariance test. Table B1 shows set p-values and the output of  Table B1: Results for the experiment described in Appendix B3. Set-specific p-values and output of tramicp are shown for different choices for the coefficient β of X 2 to illustrate faithfulness violations in linear shift trams. The p-values stem from a single run of tramicp on data generated according to the structural causal tram in the main text with a sample size of n = 10 4 . The setting β = 0.5 implies E ⊥ ⊥ Y and thus corresponds to a faithfulness violation. The test indeed does not reject this hypothesis and consequently, tramicp outputs the empty set. The seed for data generation is fixed to the same value for each β, such that the results are directly comparable across rows.
tramicp. However, such cancellations seem to be avoided when sampling the coefficients from a continuous distribution, as done in our experiments in Section 4.3.

B4 Wider applicability of the TRAM-GCM invariance test
The tram-Wald invariance test requires regularity conditions and is restricted to linear shift trams Example 24 (tram-Wald may be anti-conservative). Consider a structural causal shift tram with the following assignments: E ∼ Bernoulli(0.5) where N 1 , N 2 , Z, N 3 are i.i.d. standard normally distributed and f 1 : x → 0.5x+sin(πx) exp(−x 2 /2). There is, however, strong evidence that the tram-Wald invariance is not level at the nominal 5% (the binomial test described above yields a p-value less than 0.0001). and the correlation test seems overly conservative (the binomial test described above yields a p-value of 0.99).
Example 25 (The Pearson correlation invariance test may be anti-conservative). Consider a structural causal linear shift tram over (Y, X 1 , X 2 , . . . , X 12 , E) with the same assignments as in Ex- ample 24 using f 1 : x → 0.5x and additional covariates X 4 , . . . , X 12 drawn i.i.d. from a standard normal distribution. We fit penalized linear shift trams with linear transformation function and an ℓ 2 -penalty using λ = 10. In Figure B6, we again display the ECDF of p-values obtained from the tram-GCM invariance test, the tram-Wald invariance test and a naive correlation test between score residuals and environments for the set {1, 2, 4, 5, . . . , 12}. There is no evidence that the tram-GCM invariance test is not level (the binomial test described above yields a p-value equal to 0.78). However, there is strong evidence that the tram-Wald invariance test and correlation test are not level at the nominal 5% (in both cases the binomial test described above yields a p-value less than 0.0001).

C Case study: Informative censoring
To study the sensitivity of tramicp to the assumption of uninformative censoring, we sample time as right-censored and apply tramicp-Wald to the resulting dataset. The sampling is repeated 10 times per fraction. Table C1 shows and how often tramicp-Wald outputs the empty set, ca or ca+age over the ten repetitions. tramicp-Wald is relatively robust for fractions up to 40%. For larger amounts of informative censoring, age is contained in the output less often. At fraction 0.9, the output is empty for most of the repetitions.

D R package tramicp
The tramicp package implements ICP for the models and invariance tests listed in Tables D1 and D2, respectively. tramicp is implemented in dicp() which takes the argument modFUN, which can be models of one of the implementations listed in Table D1, or any other model which pos-sesses an S3 method corresponding to stats::residuals() (for type = "residual" invariance) or multcomp::glht() (for type = "wald" invariance) (below we give an example of how to use shift trams implemented in tramME). Package tramicp implements aliases for dicp(..., modFUN = <implementation>), which are listed in Table D1 and use some convenient defaults for arguments supplied to modFUN. For instance, MASS::polr() has no corresponding residuals() implementation, which has been added with tramicp. Implemented tests can be supplied as character strings (Table D2). For type = "residual", custom tests can be supplied as function(r, e, controls) {...}). Importantly, these tests should output a list containing an entry "p.value". Additional arguments can be supplied via the controls argument, which can be obtained and altered via dicp controls(). Some tests (for instance, coin::independence test()) can handle multivariable environments, i.e., formulae like env = ∼ E1 + E2.  .  introduce HSIC as a kernel-based independence measure, which is zero if and only if the two arguments are independent (when using characteristic kernels). Here, we use dHSIC (Pfister et al., 2018). By default, a Gaussian kernel for R and a discrete (for discrete environments) or Gaussian (for continuous environments) kernel for E is used (for details see Pfister et al., 2018). less, models implemented in tramME (such as BoxCoxME()) can still be used in tramicp via Proof. Recall the definition of µ given beneath Assumption 1. For the expected score residual, we have for P X -almost all x, where for all j ∈ {1, 2}, (N j X 1 , N j X 2 , Z j ) are independent, N 1 X 1 and N 2 X 2 have density f X with respect to ν, Z 1 and Z 2 have distribution function F Z , N 1 X 2 and N 2 X 1 are uniformly distributed on (0, 1) and g is the generalized inverse of the conditional CDF corresponding to f X|Y .
Explicit example for the construction in the proof of Proposition 15 For illustrative purposes, we now give an example with Y = A = {1, 2, 3} in which we construct a structural causal tram over three (ordered) categorical random variables (Y, X 1 , X 2 ), in which the two pairwise marginal conditionals Y | X 1 and Y | X 2 are proportional odds logistic regression models; we sample from this model and illustrate the non-identifiability empirically.
Now, let f Y |X 1 and f Y |X 2 be conditional densities corresponding to the proportional odds model F Z •ȟ. We then construct the joint distribution by fixing X 1 (and X 2 ) to have a uniform distribution, that is, f X 1 : j → 1/3 and follow the argument in the proof above.
We now sample from the joint distribution above and perform goodness-of-fit tests to illustrate that feature selection based on goodness-of-fit tests yields the empty set in case of non-subset identifiability. In Figure E1, we show the relative bias of a parameter estimateθ with ground truth ϑ, i.e.,θ/ϑ − 1, when estimating the parameters of the two proportional odds logistic regressions of Y | X 1 and Y | X 2 for sample size n = 10 4 and over 10 3 repetitions. For all j ∈ {1, 2}, we refer to the parameters for Y | X j as ϑ j1 , ϑ j2 , β j2 , and β j3 . In addition, we perform a Pulkstenis-Robinson goodness-of-fit χ 2 -test (Pulkstenis and Robinson, 2004) implemented in R package generalhoslem (Jay, 2019) for both models at a sample size of 10 4 and 100 repetitions. For Y | X 1 and Y | X 2 the test rejects 4 times and 5 times at the 5% level, respectively. Thus, the intersection over accepted sets is empty in most cases.

E1.3 Proof of Proposition 17
Proof. Under Setting 1, we have Z ⊥ ⊥ (E, X S * ) which implies, by weak union and contraction, that The desired result now follows, since the conditional CDF of Y given X S * = x S * is F Z (h * (· | x S * )) by construction.

E1.4 Proof of Proposition 18
Proof. Let S := {S ⊆ [d] | S ̸ ⊇ S * }. We now show that, for all S ∈ S, we have Y ̸ ⊥ ⊥ G E | S, where ⊥ ⊥ G denotes d-separation in G. Let S ∈ S. By definition of S, there exists j ∈ [d] such that j ∈ S * and j ̸ ∈ S. Because S * ⊆ ch(E) and S * = pa(Y ), by definition, there exists a directed path E → X j → Y in G. Since j ̸ ∈ S, we therefore have that Y ̸ ⊥ ⊥ G E | S. We conclude, by the faithfulness assumption, that Y ̸ ⊥ ⊥ E | S and thus S is not (F Z , H Y,X )-invariant.
Thus, for all (F Z , H Y,X )-invariant sets S we have S ⊇ S * and therefore Proposition 17 yields that S I ⊆ S * and hence the result follows.

E1.5 Proof of Proposition 19
Proof. By Lemma 9, we have E[R(h 0 ; Y, X) | X] = 0, E[R(h S ; Y, X S )] = 0 and E[X S R(h S ; Y, X S )] = 0. By the same argument as in the proof of Proposition 17, we have that X S ] = 0 follows directly from the above.

E1.6 Proof of Theorem 20
Proof. For i ∈ [n], we let ξ i : We first argue that To that end, we write By the Cauchy-Schwarz inequality, we have that By Assumption (iii), we conclude that Q1 is o P (1).
For the Q2 term, we note first that, for all P ∈ P, Y ⊥ ⊥ E | X S and, for all P ∈ P and i ∈ [n], . Therefore, for i ∈ [n], we have By the fact that for i ̸ = j, ξ i and ξ j are independent given (Y i , X S i ) i∈ [n] and Assumptions (ii) and (c), we have By Lemma 30 we conclude that Q2 = o P (1).
Finally, for the Q3 term, by a similar argument as for the Q2 term above, we have, for all i ∈ [n], by Proposition 9. By this fact and Assumptions (ii) and (c), we obtain that By Lemma 30 we conclude as above that Q3 = o P (1). Thus, (6) holds.
We now argue that We begin by decomposingΣ intô and handle each term in the following.
By equivalence of finite-dimensional norms, it suffices to show that for all j, k ∈ [q] (I) jk − (Σ P ) jk = o P (1).
This holds by Lemma 19 in Shah and Peters (2020) since For the (II) term, by sub-multiplicativity of the operator norm and the Cauchy-Schwarz inequality, we have again by Lemma 19 in Shah and Peters (2020) and Assumption (b). By Lemma 30 and our argument for the Q2 term, we conclude that For the (III) term, we have The first factor has already been handled in (II), whereas for the second we have by Assumption (iii). The same line of argument holds for (III) ⊤ .
For the (IV) term, we have The first factor was shown to be E P [R 2 P,i ∥ξ i ∥ 2 2 ] 1/2 + o P (1) in the argument for (II), while Lemma 30 and our argument for the Q3 term let us conclude that For the (V) term, we have The first factor has been handled in the argument for the (II) term and the second in the argument for (III).
For the (VI) term, we have which has been dealt with in the argument for (III).
For the (VII) term, we have The first factor is shown to be o P (1) as part of the argument for (III) while the second factor is shown to be o P (1) in our argument for (IV), thus we conclude that (VII) = o P (1).
For the (VIII) term, we have which was shown to be o P (1) in the argument for (IV); thus (VIII) = o P (1).
For the (IX) term, we have which was part of the argument for the (II) term.
For the (X) term, we have which is o P (1) by (6).
For the (XI) term, we have By Markov's inequality, for all ϵ > 0, For the (XII) term (and similarly for the (XII) ⊤ term), we have The first factor has been dealt with in (X) and the second in (XI). We have thus shown (7).
The results now follows from (7).
We are now ready to prove the result. We write by (6) and (8). For any M > 0, by Markov's inequality sup P ∈P so n −1/2 n i=1 R P,i ξ i is bounded in probability uniformly over P. Hence, by (8). Similarly, by (6) and Assumption (a). Combining the above with a uniform version of Slutsky's theorem (Bengs and Holzmann, 2019, Theorem 6.3) we have the desired result.

E1.7 Proof of Proposition 21
Proof. First, γ S = 0 implies that S is (F Z , H Wald Y,X (a))-invariant, because the canonical conditional CDF of Y conditional on X S and E can, for P (X S ,E) -almost all (x S , e), be written as F Z (a(·) ⊤ θ + (x S ) ⊤ β S ) which implies that (Y | X S = x S , E = e) and (Y | X S = x S ) are identical.
For the reverse direction, S is (F Z , H Wald Y,X (a))-invariant implies that for P (X S ,E) -almost all (x S , e), (Y | X S = x S , E = e) is identical to (Y | X S = x S ) and the existence of an invariant transformation function h S ∈ H Wald Y,X (a) given by h S : (y, x S ) → a(y) ⊤θ − (x S ) ⊤βS . Then, let h ϑ S ∈ H Wald Y,X S ×E (a) be such that ϑ S = (θ,β S , 0). We then have for P (X S ,E) -almost all (x S , e), h S (· | x S ) = h ϑ S (· | x S , e), which concludes the proof.
Then, for P X S -almost all x S , we have for the expected score residual: where in the first equality we have used the expression for the conditional density computed below Setting 2. The result now follows analogously to the proof of Proposition 19 which is given in Appendix E1.5.

E2 Auxiliary results
Proposition 27. Let Y ⊆ R, X ⊆ R d . Let F Z : R → [0, 1] be the extended CDF of a continuous distribution and let h : R × X → R s.t. for all x ∈ X , h(· | x) is ERCI on Y. Then, for all x ∈ X , F Z (h(· | x)) is a CDF.
Lemma 28. Let A, B ∈ R d×d be symmetric and positive semidefinite. Assume further that λ min (A) ≥ c > 0. Then Proof. (i) Define C := A 1/2 − B 1/2 and let x ∈ R d denote a unit eigenvector corresponding to the largest absolute eigenvalue ∥C∥ op of C. Then, since A−B = A 1/2 C +CA 1/2 −C(A 1/2 −B 1/2 ), we have Rearranging proves the desired result.
Rearranging the inequality, we have that Since A −1 op = λ min (A) −1 ≤ c −1 by assumption, we conclude that 1− A −1 op ∥A − B∥ op ≥ 1/2 so we can rearrange the inequality above further to get Lemma 29 (Uniform multivariate Lyapunov's Theorem). Let X 1 , . . . , X n be i.i.d. copies of X ∈ R d with distribution determined by P, such that for all P ∈ P, E P [X] = 0, Var P [X] = Id and there exists a δ > 0, s.t. sup P ∈P E P [∥X∥ 2+δ 2 ] < ∞. Let S n := n −1/2 n i=1 X i . Then, where Φ d denotes the d-dimensional standard normal CDF.
The following result is essentially Lundborg et al. (2022, Lemma 10) that we include here for completeness.
Lemma 30. Let X 1 , . . . , X n be a sequence of real-valued random variables and W 1 , . . . , W n be a sequence of random variables taking values in a measurable space W, both sequences with distributions determined by P. If E P [|X n | | W n ] = o P (1) then X n = o P (1).
By assumption, for any η > 0, we can choose N ∈ N so that for all n ≥ N , we can make sup P ∈P P P (E P [|X n | | W n ] ≥ δ/2) < η. Hence, letting η := δ 2ϵ , we have that sup P ∈P E P [Y n ] < δ for n sufficiently large, which proves the desired result.
Lemma 31. Assume Setting 1. If S ⊆ [d] is (F Z , H Y,X )-invariant, then the invariant transformation function h S ∈ H Y,X S from Definition 16 is P X S -almost surely unique.
Proof. Assume there exists another invariant transformation function h ∈ H Y,X S . Then for P X Salmost all x S , the CDF of Y | X S = x S can be written as F Z (h S (· | x S )) or F Z (h(· | x S )) by Definition 16. Now, we have for P X S -almost all x S F Z (h S (· | x S )) = F Z (h(· | x S )).
By strict monotonicity of F Z | R and F Z (−∞) = 0 and F Z (+∞) = 1, we have for P X S -almost all x S h S (· | x S ) = h(· | x S ), which concludes the proof.