Direction of dependence analysis for pre-post assessments using non-Gaussian methods: a tutorial

Abstract Objective We introduced methods for solving causal direction of dependence between variables observed in pre- and post-psychotherapy assessments, showing how to apply them and investigate their properties via simulations. In addition, we investigated whether changes in depressive symptoms drive changes in social and occupational functioning as suggested by the phase model of psychotherapy or vice versa, or neither. Method As a Gaussian (normal-distribution) model is unidentifiable here, we used an identifiable linear non-Gaussian structural vector autoregression model, conceptualizing instantaneous effects as during-psychotherapy causation and lagged effects as pre-treatment predictors of change. We tested six alternative estimators in six simulation settings that captured different real-world scenarios, and used real psychotherapy data from 1428 adult patients (Finnish Psychotherapy Quality Registry; assessments on Patient Health Questionnaire-9 and Social and Occupational Functioning Assessment Schedule). Results The methodology was successful in identifying causal directions in simulated data. The real-data results provided no evidence for single direction of dependence, suggesting shared or reciprocal causation. Conclusions A powerful new tool was presented to investigate the process of psychotherapy using observational data. Application to patient data suggested that depression symptoms and functioning may reciprocate or reflect third variables instead of one predominantly driving the other during psychotherapy.

Clinical or methodological significance of this article: Finding causally active ingredients in psychotherapy contents, or therapeutic mechanisms, has been a major challenge to contemporary psychotherapy research.We introduced a powerful statistical tool with potential for overcoming some of the obstacles to using commonly available observational data.The added potential derived from more comprehensive use of non-Gaussian statistical information compared to typical parametric models in the field.This first application of the tools suggested that depressive symptoms and functioning may reciprocate or reflect third variables instead of one predominantly driving the other during psychotherapy.

Purposes of the Present Study
Understanding the process and mechanisms of behavior change are central to improving psychotherapy practice.Therefore, psychotherapy researchers have put forth theories of change such as the phase model of psychotherapy (Howard et al., 1993), which builds upon a long tradition of theorizing about common and ordered stages behind effective treatments (see e.g.Bandura, 1977;Rogers, 1958;Sullivan, 1970;Whitehorn, 1959).Empirical evaluation of process models tends to require establishing of a cause and a consequence, however.In other words, one wishes to establish that a manipulation of the supposed cause results in the supposed consequence instead of them merely occurring together.Randomized controlled trials can be an effective way to ensure that a manipulation leads to a desired consequence, but for many questions of interest, they are expensive, unethical, or otherwise unfeasible in practice.This paper aims to present a recent advance helpful in some such cases.
For example, the phase model hypothesizes that constraints exist between aspects of behavior change such that three causally ordered phases are formed during psychotherapy: (i) increase in subjective well-being potentiates (ii) symptomatic improvement which further (iii) potentiates improvements in life functioning (causal flow is i → ii → iii).Focusing on the latter part, suppose we wish to establish empirically that symptomatic improvement is a prerequisite for functional improvement (i.e., ii → iii) rather than the other way around (iii → ii).What would even be a feasible intervention to manipulate e.g.depressive symptoms independently of daily functioning and functional disability?When devising an intervention is infeasible, costly, or unethical, researchers typically turn to either formal statistical inference or certain suggestive rules of thumb to distinguish a causal association from a mere association (e.g., Hernán & Robins, 2020;Hill, 1965).Here we present and discuss a statistical technique to infer direction of dependence.We build on the crosslagged design and use it to introduce the idea.
Cross-lagged models are often invoked as a natural design to test a hypothesized causal flow between two variables.They rely on the time-honored role of temporal precedence, or temporality, as a sign of causation (Hill, 1965).For example, one could test whether depressive symptoms at time t predict psychosocial functioning at time t + 1 when adjusting for functioning at time t, and vice versa.As this is done for multiple time points t, a generalizable (causal) process is estimated in which a variable may cause (drive changes in) another and/or be caused (driven) by it.However, already Hill (1965) realized that temporality is only an imperfect sign of causation.Even when temporality can be equated with causation, many shortcomings remain in the widely applied cross-lagged models.
First, at least nine distinct discrete-time longitudinal models are in frequent use and the model combining them all is not identified in any data (Usami et al., 2019).Often these models introduce assumptions about unobserved variables that could bias estimates of causal direction but few applied researchers have considered these various possibilities to any extensive length.Second, psychological processes may feasibly unravel in a continuous manner rather than in discrete time steps.However, cross-effects in discrete-time cross-lagged models are poor approximations to their continuous-time counterparts: Coefficient that are equal in discrete-time models can become different when transitioning to a continuous-time model, strength order of different coefficients can reverse, and nonzero coefficients can vanish or even change sign (Oud & Delsing, 2010).This is because the underlying continuous-time dynamics have time to evolve between the discrete-time observations and linear instantaneous dynamics do not imply linear trajectories over time.In a network of variables, discretetime cross-lagged effects reflect "total" effects of a variable on another over a given interval rather than "direct" (i.e., causal) effects (Ryan & Hamaker, 2022).In practice, a wrong lag for successive measurements could lead a researcher to conclude e.g. that functional changes cause changes in symptoms merely because an early-in-treatment change in symptoms occurs between measurements and causes functional improvements associated with post-treatment symptomatic outcome (cf.Crits-Christoph & Gibbons, 2021).Crits-Christoph and Gibbons (2021) also criticized cross-lagged models for not capturing what we typically want in psychotherapy: they estimate average effects from session t to t + 1 over all t.However, psychotherapies are designed to have different contents in different sessions.Empirically, desirable symptom trajectories frequently show "sudden gains" instead of steady progress (Shalom & Aderka, 2020).While specific fixes to the abovementioned issues tend to exist, here we introduce methodology that sidesteps many of the issues.
Before presenting our current proposal, we next reflect upon what other alternative strategies to cross-lagged designs we have at our disposal.Experimental control (randomization to an intervention group) is often considered a gold-standard procedure for causal inference, but it might be difficult to conceive practical interventions that reliably affect only symptoms or psychosocial functioning but not both.Quasi-experiments such as instrumental variables regression have sometimes been used in psychotherapy research (Crits-Christoph & Gibbons, 2021), but it is again difficult to conceive an instrument variable that is certain to influence only symptoms or functioning but not both.Recent advances in copula-based statistical methods have been introduced that could in principle unconfound a causal effect but they may require several thousand Psychotherapy Research 1059 observations, which could be impractical (Falkenström et al., 2021).Biometric information from twins could in principle be used to infer direction of causation but it is difficult to find twins undergoing similar psychotherapies and some designs are weakly powered in the bivariate case (Duffy & Martin, 1994;Heath et al., 1993;but see Rosenström, Czajkowski, et al., 2019).
By now, we hope to have made a convincing case that estimating direction of dependence in psychotherapy is a non-trivial research challenge that needs novel solutions.Luckily, methodological solutions for cross-sectional direction-of-dependence estimation have rapidly evolved in the twenty-first century.Noting how the common wisdom that correlation does not imply causation only holds for normally distributed (i.e., Gaussian) variables, Dodge and Rousson (2000) used skewness and later higher-order cumulants (Dodge & Rousson, 2001) to infer which one of two variables should be modeled as a cause and which a consequence in linear regression models.These ideas have been generalized to using any other distributional characteristics that deviate from the Gaussian (Hyvärinen & Smith, 2013;Shimizu et al., 2006Shimizu et al., , 2011)), and a book has been published on statistical direction dependence methods (Wiedermann et al., 2020).Within psychology, such methods have been previously applied to infer direction of dependence between specific depressive symptoms, and practical guides have been written for epidemiologists and prevention scientists (García-Velázquez et al., 2020;Rosenström et al., 2012;Rosenström & García-Velázquez, 2020;Shimizu, 2019).For example, the non-Gaussian cross-sectional methods have converged with traditional (Gaussian) cross-lagged analyses in a surprising finding that suicidal ideation is causally antecedent to many other depressive symptoms (Bringmann et al., 2015;García-Velázquez et al., 2020;Komulainen et al., 2021).Non-Gaussian distribution-based methods for estimating direction of dependence from cross-sectional data represent an unexplored opportunity to explore psychotherapy processes, such as the possible causal precedence of symptom improvement to functional improvement.
In the remainder of the paper, we first formalize the cross-lagged modeling problem for the simple case of pre-and post-therapy assessments.We then explain how causal direction of dependence can be inferred from cross-sectional data under non-Gaussian distributions, and with what statistical power and robustness.Only after having covered these necessary parts do we explain how the idea extends to pre-and post-therapy assessments.We then show how simulation studies can be used to test these methods and verify that the correct direction of dependence between variables can be detected from pre-and post-therapy assessments only.Finally, we apply the method to pre-and posttherapy assessments from a large naturalistic psychotherapy quality register to test the direction of dependence between during-therapy change in patient-reported depressive symptoms and change in their therapist-reported social and occupational functioning.We show how to investigate multiple estimators to better understand the data and the methodology.In the supplementary file, we provide R-language implementations of the methods.

The Cross-lagged Problem
We consider the simplest two-time-step cross-lagged model for centered variables.It models the four-byfour correlation matrix of two successive measurements on two variables using the linear relationships shown in the path diagram of Figure 1(a), excluding the dashed line to be discussed.In equations, this can be expressed as.
where the m's are regression coefficients, n x and n y are the residuals of x and y, respectively, x(t) is the variable x at time step t, and similarly for y (note: we retain certain familiar symbols for later use).To identify and estimate the regression coefficients, it is typical to assume that n x and n y are independent of each other.The assumption suggests that (i) there are no confounding variables to x(1) and y(1) given x(0) and y(0), and more importantly, (ii) the time step 0 (baseline, pre-treatment) is sufficient to explain the time step 1 (follow up, post-treatment) and no important dynamics occur in between these two measurements.The assumptions i and ii are related in that if ii fails, so will i.This is an obvious problem both in general and for therapy data specifically.In general, we frequently know discrete time steps to be only an approximation to continuously evolving phenomena.For pre-and post-assessments of therapy studies, interesting things most certainly ought to occur between the measurements.If only ii in above fails, we could model this case by positing that there is an instantaneous effect, e.g., such that n y = bn x + ε for some constant regression coefficient b and residual variable ε (cf.dashed line in Figure 1(a)).Obviously, neglecting b and ε will violate the model assumptions and therefore lead to biased causal estimates m under typical leastsquares estimation.The problem is that n x and n y are essentially cross-sectional and we typically lack a priori knowledge on the causal direction for such instantaneous effects, and on how estimating them should reflect upon the cross-lagged effects m.This is where we may benefit from an ability to estimate direction of causation from cross-sectional data.We will next provide the reader with an intuition on how cross-sectional causality estimation can be achieved and only then return to the issue of instantaneous vs. lagged effects (i.e., to the full modeling task here).

Understanding Distribution-based Inference for Direction of Dependence
Although understanding distribution-based inference for direction of dependence in its full generality requires advanced mathematical concepts, some special cases can be intuitively understood.We start from the intuitive and, for comprehensiveness, also shortly review some more advanced concepts.To understand cross-sectional inference on direction of dependence, we consider the simple linear regression model with the distinction that we lack the a priori knowledge on which one of the variables should be treated as the dependent variable (the outcome) and which one as the independent variable (the predictor).That is, y = bx + ε for random variables x and ε but we do not know it nor their truly pertinent units.Assume b = 1 for simplicity.Then, if x and y have Gaussian distributions and we standardize them, their joint distribution looks completely symmetric, or "spherical" (Figure 1(b)).It does not contain any information about x having caused y.This is a Psychotherapy Research 1061 property of Gaussian distribution only, however.Now, assume that x and ε instead had a rightskewed (Log-normal) distributions, also standardized to mean zero and unit variance, with y = x + ε similarly standardized.Then the joint distribution of x and y drastically differs from spherical, and not only in terms of skew (Figure 1(c)).It suffices to discuss right-skewed variables as leftskewed variables can be transformed to right-skewed with no loss of information by multiplying with a negative value.A right-skewed distribution means that small deviations occur frequently but when the infrequent large deviations do occur that tends to happen to the positive direction, or right-hand side of the distribution.Hence, in our example, large deviations in x are always paralleled in y because it was a sum of x and ε.Large deviations in y, however, need not be paralleled by those in x because they could have derived from ε (cf. Figure 1(c)).Expressed numerically, on average x 2 y > xy 2 , or T(x → y) ≡ E[x 2 yxy 2 ] > 0, where E is the expectation operator and "≡" defines the direction-dependence test statistic T. This occurs because large deviations in x 2 will not get zeroed by small y as they tend to imply a large y, whereas large deviations in y 2 are more often zeroed by small x.Because of this asymmetry, we say that direction of dependence runs from x to y, or shortly x → y.The opposite asymmetry would have been observed under y → x, or equivalently, under x = by + ε.
The above example provides us with intuition on how one might use non-Gaussian cross-sectional distributions to infer direction of dependence in one special case of right-skewed distributions.Are more general principles involved?Indeed.If y is a linear function of x and an independent error term ε, y = bx + ε, then x should be statistically independent of ε, not just uncorrelated with it.For standardized variables, ε is estimated as y -r xy x, where r xy = b is the Pearson correlation between x and y.The x and yr xy x are always uncorrelated by definition. 1 However, for all we know, it could instead hold that x = by + ξ for some ξ independent of y (i.e., direction y → x is the true dependence relation instead of x → y).Whereas x and ε [i.e., now by + ξ and y -b(by + ξ)] would remain uncorrelated by definition, they most certainly would be statistically dependent if not Gaussian (i.e., normally distributed; Kagan et al., 1973, p. 89). 2 Due to its exceptional symmetries, the multivariate normal distribution happens to be the one and only distribution for which uncorrelatedness and independence coincide.For any other distributions, when one of the linear causal hypotheses hold (y → x or x → y), the variable that is less dependent on its residual is the true independent variable.This is simply because under x → y the residual was assumed independent of x and because research in mathematical statistics has proven that y then cannot be independent of its residual, except for multivariate Gaussian data (Darmois, 1953;Kagan et al., 1973, p. 89;Shimizu et al., 2011;Skitovich, 1954).Although such an abstract proof cannot be readily made intuitive, we can get intuition on the phenomenon e.g. by plotting the residual data from our Figure 1 toy example (Figure 2).Clearly, the Gaussian residuals are not particularly informative but the non-Gaussian ones show a striking statistical dependence (heteroscedasticity, not correlation) when plotted for the wrong independent variable.
Although complete statistical independence was implicated in above discussion, in practice, some methods for direction dependence analysis work surprisingly well despite partial confounding, that is, despite some residual dependence (Rosenström et al., 2012;Rosenström & García-Velázquez, 2020).We later return to this phenomenon, but for now, it is the reason why it may make more conceptual sense to "measure relative (in)dependence" than test for perfect independence.Mutual information is one way to quantify statistical dependence irrespective of the distributions of random variables (Cover & Thomas, 2006;MacKay, 2003, for textbook introductions).Mutual information, I(X; Y ), between two random variables X and Y measures the informational price of encoding the joint distribution of X and Y using their marginal distributions only (i.e., assuming them independent). 3herefore, we can test the hypothesis x → y against the alternative causal direction (i.e., y → x) by testing 1 For standardized variables, cor(x, y -r xy , where E is expectation.With the typical estimators this holds in finite samples too as the ordinary least squares method solves a regression coefficient coinciding with r xy by setting the corresponding sample average to zero. 2 by + ξ and y -b(by + ξ) = (1b 2 )y + bξ are both linear statistics of the same random variables, with respective products of coefficients 1b ≠ 0 and b(1b 2 ) ≠ 0 by assumption.In this case, the Skitovich-Darmois theorem (e.g., Kagan et al., 1973, p. 89, for a proof) implies that x and ε must be Gaussian, but this contradicts our assumed non-Gaussianity.Hence, x and ε cannot be independent under the causation y → x.
whether the test statistic gets a positive value, i.e., by testing whether y is more dependent of its residuals than x.A negative value suggests the opposite direction dependence, and zero no evidence one way or the other.Mutual information is also a theoretically important basis for the statistic T because it coincides with the logarithm of the likelihood ratio comparing non-Gaussian structural equation models for x → y and y → x under unknown data distributions (Hyvärinen & Smith, 2013).The reason we needed x or y to be non-Gaussian when using mutual information was that Gaussian variables were the only exception wherein uncorrelatedness implies independence [i.e., this is the only case where both T(x → y) and T(y → x) will always equal to each other, and equal to zero, under an otherwise correct directional model].But, this computation (i.e., T ) turns out to be theoretically the same computation as non-Gaussian likelihood ratio that measures relative degree of fit to data (which is maximized to the direction of maximal non-Gaussianity).Hence, under non-Gaussian distributions, we have a general The same held for the opposite regression on yn.For skewed x, e, and y in place of the normal xn, en, and yn, the picture differed dramatically, with very strong dependence emerging when the "wrong" variable (here y) is used as a predictor variable, i.e., in the model lm(x ∼ y).In all the cases, the residual is always uncorrelated with the predictor variable by definition.For Gaussian data, it is also independent irrespective of regression direction.For any non-Gaussian data, the "wrong" (non-generative) direction always shows higher order dependencies (higher than second moments, i.e., correlations), although their exact form depends on the underlying distributions.
This quantity is always positive.Under independence, f X,Y = f X f Y , and hence, the integral over log(1) = 0 equals zero.Otherwise, it measures the informational price of encoding Psychotherapy Research 1063 direction-dependence statistic provided we can estimate mutual information from observed data, or equivalently the non-Gaussian likelihood ratio via maximizing non-Gaussianity (Hyvärinen & Smith, 2013).The connection of T with this measure of relative fit provides a theoretical explanation for why it works well despite minor violations of independence assumptions, and the below section reviews some empirical evidence.

Statistical Power and Robustness of Distribution-based Inference for Direction of Dependence
Several studies have tested properties of direction dependence statistics in extensive simulations (e.g., Hyvärinen & Smith, 2013;Rosenström et al., 2012;Rosenström & García-Velázquez, 2020;Shimizu et al., 2011).For example, Rosenström and García-Velázquez (2020) provided a simulation example for detection of pairwise causal direction, providing the following guidance for the method's scope of application in those data (they used the kgv estimator, further discussed below).Without confounding, a sample size of n = 200 observations led to above 90% correct detection and samples with n ≥ 500 to near certainty.When 20% of the association was due to a confounding variable (i.e., non-directional for the observed variables), then correct detection was obtained with ∼83% probability when n = 200 and with ≥ 95% probability when n ≥ 500.With 40% confounded association, samples of n = 200 approached chance performance (∼65% correct detections) but sample sizes n ≥ 1000 still led to a good (∼95%) performance.With 60% confounding, very large samples are needed to detect the pairwise causation (e.g., n = 5000 still achieved ≥ 90% correct detection).Under no confounding, fairly little skewness was needed to achieve above 95% detection performance (0.46 with n ≥ 500 and 0.78 with n = 200).Similarly, excess kurtosis of 0.37 was similarly sufficient for n ≥ 500 and that of 1.1 for n = 200, with the investigated distributions.For strong skewness (≥1.5) or excess kurtosis (≥2.35), even small datasets of n = 100 were enough.Thus, methods to estimate pairwise direction dependence can be quite powerful and quite robust to presence of confounding (technically, a violation of assumptions).Yet, it is difficult to give very general statements as the performance varies as a function of the exact non-Gaussian distributional forms and association strength present in the data (e.g., Bach & Jordan, 2002).While the cross-sectional direction dependence methods can have favorable power characteristics compared to quasi-experiments, such as biometric direction-of-causation models (Rosenström & García-Velázquez, 2020), they require some assumptions and have their own limitations.For example, approximating mutual information is a challenging statistical problem (Cover & Thomas, 2006;Hyvärinen, 1997).Yet, some form of an approximation is needed and the methods' performance can depend on the match between the data at hand and the chosen approximation, or even on a hyperparameter choice for the same approximation (García- Velázquez et al., 2020;Hyvärinen & Smith, 2013).There is a bright side to abovementioned issue: careful attention to multiple estimators and characteristics of the data can be informative, providing convergent and/or divergent evidence.Overall, however, simulations have indicated that the methods are remarkably robust against typical issues in psychological data, such as confounding, measurement errors, and ordinal-valued approximations to continuity (García-Velázquez et al., 2020;Hyvärinen & Smith, 2013;Rosenström et al., 2012;Rosenström & García-Velázquez, 2020).
Robustness to partial confounding is a particularly important property.We show below how to probe the extent of that robustness in specific situations and how to test for confounding.One should note that clinical psychological constructs exhibit pervasive inter-correlations and a tool non-robust to confounding might be next to useless in practice within psychotherapy research (e.g., Oltmanns et al., 2018;Rosenström, Gjerde, et al., 2019).Some papers give the impression that direction dependence methodology would be quite non-robust and dependent on strict requirements, such as normally distributed regression error and little confounding (Pornprasertmanit & Little, 2012).However, these are merely features of a particular estimator with very restrictive assumptions that we do not recommended using (see e.g.Hyvärinen & Smith, 2013).We have the general ideas we need now, as well as some evidence of their power.The next sections will discuss more specific direction dependence statistics and how these can be combined with the cross-lagged model to make it useful again.

Six Estimators for Cross-sectional Direction Dependence
Ideally, one would have just one perfect estimator for the direction dependence based on the same (linear) model.However, mutual information (and likelihood ratio) under unknown data distributions can only be approximated in a finite sample and performance of 1064 Tom H. Rosenström et al. different approximations may depend on the data distributions.Simulations indicate that different approximations may have comparable overall performance.Since we lack the ground truth for the data distribution, it may be helpful to compare different general estimators to estimators that utilize a non-Gaussianity of a known type, such as kurtosis or skewness.For example, if one of the general estimators happens to be more correlated with a kurtosis-based estimator than another, and kurtosis also happens to be more informative on direction dependence than skewness, then one might put more weight to inferences based on the kurtosis-linked general estimator.By correlated estimators, we mean estimators that correlate across bootstrap resamples of the data (Efron & Tibshirani, 1993).Supplementary Table lists the estimators we investigated, with their key properties and relevant references.We have implemented these methods in the supplementary R code (file "direction_dependence_measures.R").Given a causality statistic, we denote the bootstrap estimate of probability of the inference x → y by P(x → y).
We chose these measures both due to their differences to each other and because of the following properties: Kernel Generalized Variance (kgv) approximation to mutual information leads to T statistic based on a very general measure of dependence and a proven track record of good comparative performance from ∼20 years of research (Bach & Jordan, 2002;Gretton et al., 2005;Rosenström et al., 2012;Shimizu et al., 2011).Maximumentropy approximation of non-Gaussian likelihood ratio (mxnt) is a lot faster computational and possibly more outlier robust than "kgv" while providing nearly as general a basis for the T statistic (Hyvärinen & Smith, 2013).Similar approximations based on robust-skew (rskew) and hyperbolic tangent (tanh) approximations are outlier robust and, yet, concentrate specifically to skewness and kurtosis properties, respectively, which makes them useful for diagnostic purposes (Hyvärinen & Smith, 2013).Finally, cumulant-based skewness measure (skew) is simply intuitive and easy to understand and serves to show how far one can get merely with elementary equations (Hyvärinen & Smith, 2013). 4Recall that both residual dependence (kgv) and non-Gaussianity maximization (mxnt) were routes to causal discovery, making both approximations of mutual information (dependence) and those of non-Gaussianity (relative entropy) of interest, respectively.Finally, although a kernel-based tool like kgv, we investigated also Hilbert-Schmidt independence criterion (hsic) because it allows a statistical test of independence and has been used and considered in the previous literature (Gretton et al., 2007;Li & Wiedermann, 2020;Pfister et al., 2018;Shimizu et al., 2011).

Direction of Dependence in Pre-and Post-Treatment Data
Typical psychotherapy data includes at least pretherapy and post-therapy measurements.Such prepost measurements do not include observed changes that could precede later changes, but they are not strictly cross-sectional either.Therefore, we treat them as a single-step non-Gaussian structural vector auto-regression (SVAR) model, which is an extension of the above-discussed cross-sectional causal inference tools and the classic cross-lagged regression of Eq. 1 and Figure 1(a) (Hyvärinen et al., 2010).Namely, we both solve the value and direction of the instantaneous effect suggested by the dashed arrow in Figure 1(a) and then correct the classic autoregression coefficients accordingly.Notice how the dashed line implies that part of the effect of x(0) on y(1) is actually mediated by x(1) rather than direct, as would be implied by the classic cross-lagged model.Similarly, part of the autoregressive effect of y(0) on y(1) will be mediated by x(1) if a cross-lagged effect of y on x exists.Therefore, we need to correct the classic autoregressive estimates of Eq. 1 so that they take in account the instantaneous effect.This leads to the following algorithm which we give here in simple equations, in the online supplement as matrix notation, and a reader interested in more general expositions can consult Hyvärinen et al. (2010) and Hyvärinen and Smith (2013).
Our special case of the SVAR model can be expressed as with the constraint that either b 0 xy or b 0 yx must be zero.These are the instantaneous effects having time lag of 0. The coefficients with the superscript 1 are the lagged effects of time lag 1, but these now differ from the m's of Eq. 1 due to the presence of the instantaneous effects.We estimate the coefficients with a two-step algorithm that uses the m's, however.

4
The "skew" statistic defines T simply as T(x → y) = r xy E[x 2 yxy 2 ], where r xy is correlation and E expectation, or mean (note the resemblance to algebraic discussion in the section "Understanding Distribution-based Inference … ").

Psychotherapy Research 1065
The algorithm goes as follows.First, we use classic ordinary least squares to solve for estimates m of the m parameters of Eq. 1, i.e., the classic autoregressive coefficients and use these to estimate the residuals of the Eq. 1.Second, we solve the instantaneous direction of dependence using the statistic T ( n x n y ).Third, if n x n y , we estimate b 0 yx by solving n y = b 0 yx n x + 1 using ordinary least squares.In the case n y n x , we instead estimate b 0 xy from the opposite regression direction.Fourth, and finally, we estimate the (corrected) lagged effects.If n x n y , meaning that only b 0 yx of the instantaneous effects is non-zero, these are5 Namely, we are merely subtracting the mediational paths created by the dashed-line path in Figure 1(a) from the lagged paths to y(1) so as to not count them twice when transitioning from Eq. 1 model to Eq. 2 model.Obviously, under n y n x , we would have made similar adjustments to the two upper equations of Eq. 3 instead of the two lower equations.
We may now infer an instantaneous direction of dependence to be from variable x to variable y when T( nx ny ) .0, and corresponding dominating lagged direction of dependence when another test statistic L(x y) , when lagged effect of x on y is of greater absolute value than the opposite lagged effect).Note that nothing constrains the instantaneous and lagged effects from being to opposing directions.For comparison, we also computed in simulated data the "diluted" direction of dependence based on classic vector auto-regression matrix: where D is the "diluted" test statistic.Regarding interpretation, notice how the instantaneous effects capture the directionality between intervention-related changes in the variables, whereas the lagged effects capture the effects of pre-intervention state on the duringintervention change.Also the lagged effects could be relevant to the phase model of psychotherapy if the patients start the intervention in different phases that allow differential changes.However, the instantaneous effects are arguably more relevant to the question "how does psychotherapy work".
For purposes of simulation-based sensitivity analysis, we also invoked a new confounding variable z to create correlations between n x and n y (cf.dashed arc in Figure 1(a)) or between x(0) and y(0) to probe their effects on the T and L statistics and the ensuing inferences.It is also possible to statistically test such confounding (Li & Wiedermann, 2020).If neither n x is independent of the residual from regression of n y on n x nor n y is independent of the residual from regression of n x on n y , that suggests presence of a confounding variable that explains residual dependency (Li & Wiedermann, 2020) or possibly reciprocal instantaneous effects (cf. the unaccounted for dashed arc in Figure 1(a)).This does not mean that a dominant direction of dependence could not be found but suggests it being a partial explanation only.We tested residual effects to our model using Hilbert-Schmidt Independence Criterion (hsic) for which a statistical permutation test exists (Gretton et al., 2007;Pfister et al., 2018).

A Simulation Study
We illustrate that cross-sectional direction dependence measures can uncover causal direction with success depending on the estimator and the underlying data, providing the reader with means to interrogate own data-specific situations e.g. by modifying the supplementary scripts.We then go on to show how instantaneous and lagged directions of dependence can be unpacked using the methodology, wherein wrong inferences would ensue from classic, solely lagged-effects, estimation.For cross-sectional examples, we simulated the following instantaneous dynamics: y = βx + ε, x = ξ, wherein, for simplicity, we set β = 1.The simulated variables ε and ξ represent non-Gaussian exogenous variables.
In the first set of cross-sectional simulations, ε and ξ were Double exponential (Laplace) distributed random variables (location 0, scale 1, skewness 0, ex.kurtosis 3).Such variables are "sparse", or kurtotic, but not skewed.Sparse improvements or impairments could arise e.g. from sudden gains and losses in psychotherapy, wherein most changes in symptoms are small and only few are large, or comparatively "sudden" (Shalom & Aderka, 2020).Second, we simulated skewed log-Normally distributed variables (logarithm of the variable had s.d.0.25 and mean 0; the variable had skewness ∼0.78 and ex.kurtosis ∼1.1).Such variables could arise when sudden gains dominate over losses.we simulated bimodal variables from a mixture of two Gaussians, one with mean 3, s.d. 1, and weight 0.75 and the other with mean −4, s.d. 1, and weight 0.25 (skewness ∼−0.98, ex.kurtosis ∼−0.53), rescaling the simulated vectors to a variance of 1.Such variables could arise under latent population stratification to patients who benefit vs. are harmed or not benefitting from the intervention.In each set of simulations, we simulated data for 1000 patients in 2000 independent replications.
We simulated SVAR-model data by setting b 0 yx = 0.5 and b 0 xy = 0, that is, an instantaneous effect of x(1) on y(1) was modeled, with no effect of y( 1) That is, the variables had lagged effects on themselves (autocorrelation) and y(0) had a lagged effect on x(1), but x(0) did not have an effect on y(1).Hence, we modeled opposing instantaneous and lagged effects.Using Eq. 3, together these non-Gaussian SVAR effects imply classic vector autoregressive coefficients m xx = m xy = 0.50, m yx = 0.25, and m yy = 0.75.That is, the "diluted" cross-lagged coefficient for the direction x → y is only half of that for y → x, when in fact, equal but opposite-direction instantaneous and lagged effects were implemented.The supplementary online material expresses the same using matrix notation.
To summarize, the true values for the "instantaneous", "lagged", and "diluted" test statistics of interest had the key properties T(n x ny ) .0, L(x y) , 0, and D(x y) , 0 (true value for D(x y) is here obtained as 0.25-0.50= −0.25).For longitudinal simulations, we also investigated a smaller sample size n = 200 because that is sometimes used as a rule of thumb for the minimum sample size for structural equation modeling and has not been tested specifically for this non-Gaussian SVAR model (see Introduction for a general discussion on sample sizes).The complete technical details of the simulations are in the online supplement (file "simulate_dda.R").
Regarding applied estimators, we expected mxnt to perform the best overall because it uses information on both skewness and kurtosis and is efficient in small samples and robust to large "outlying" deviations.We expected kgv to perform the second best overall due to its generality and expected the robust skewness-based estimator to perform well for skewed data and tanh-based estimator for kurtotic data.We had no clear expectations for hsic.It could be more general and thus superior to kgv but may lack the theoretical rationale based on mutual information.
Real-world Example with the Finnish Psychotherapy Quality Register (FPQR) The Finnish Psychotherapy Quality Register (FPQR) was established in 2018 to systematically collect data on quality, availability, and efficiency of psychosocial treatments in Helsinki and Uusimaa hospital district, Finland.At least one previous article has used the sample (Rosenström et al., 2022), and another article comprehensively characterizes it (Saarni et al., 2022).Basically, all the patients referred to outsourced psychotherapies are included in the registry.Patient and psychotherapist fill in their own questionnaires at the beginning and at the end of the therapy.In case of long, i.e. more than one-year long treatments (>40 visits), also a midterm questionnaire is used, but this paper concentrates only on short psychotherapies (usually approx.20 visits) for adult patients.All patients fill in validated symptom measures, including one on depression (PHQ-9; Kroenke et al., 2001).The psychotherapist also evaluates the patients' social and occupational functioning with the SOFAS scale (American Psychiatric Association, 1994;Rybarczyk, 2011).
In this study, we used data from the first three years of the registry, i.e., from June 2018 to September 2021.We were granted with an access to these data in a pseudonymized form, based on a research permission, on a permission from the ethical review board of HUS (HUS/1861/2020 and HUS/3150/ 2020), and on the national regulations on the secondary use of healthcare registers.The need for informed consent was waived by the national regulations and the ethical review board.After screening for completed short adult psychotherapies, 1670 patients fit the category, but only the 1428 patients (86%) with full data on the below measurements were included to this study.Short psychotherapies in the registry have on average 14 sessions, with median 18, s.d. 7, and range 1-40 (Rosenström et al., 2022).
Measures.The severity of depression was evaluated with Patient Health Questionnaire PHQ-9.It is a self-report measure including nine items on depression symptoms, each rated on 0-3 scale, yielding a maximum of 27 points, where larger scores indicate greater severity of symptoms (Kroenke et al., 2001).Psychometric evidence for this sum score formulation seems relatively strong (Stochl et al., 2022).Social and occupational functioning was assessed by the psychotherapist on a 0-100 scale with SOFAS, Social and Occupational Functioning Assessment Scale (American Psychiatric Association, 1994;Rybarczyk, 2011).We also used Psychotherapy Research 1067 knowledge on whether a patient's primary diagnosis was for a disorder or something else, although most groups of psychiatric patients have some depressive symptoms.

Simulation Results
Table 1 shows the estimator performance in the simulated data over 2000 replications per each cross-sectional simulation condition we investigated.First rows (per condition) show the average direction dependence statistic over the replications, which is almost invariably positive.That is, almost all measures on average correctly recognized x → y as the underlying true direction of dependence for all the simulated distributions.The one exception was the biased direction-dependence inference in the robust skewness-based measure when applied to a bimodal data-generating distribution, but this sort of a mismatch between data and estimator properties would generally be easily detected in preliminary data explorations.Second rows show the confidence interval (CI) where the central 95% of the replication-specific estimates lie.Here, we begin to see variance across estimators and cases (discussed below).Third rows show the proportion of replications, P, where the estimator correctly picked up the true underlying causal direction (i.e., where T(x → y) > 0).In the simulation (where x → y holds), P(x → y) < 0.5 indicates bias or wrongly inferred causal direction, P(x → y) ≈ 0.5 indicates chance-level performance, and P(x → y) ≈ 1 indicates reliable correct detection.Bias rarely occurred here, but we next discuss the observed variation in P(x → y), or equivalently, CIs.
As expected, the skewness-based direction-dependence measures performed worse than the other measures in kurtotic data, with the robust version outperforming its non-robust counterpart.Both were excellent in skewed data.As also expected, the kurtosis-based (tanh) estimator performed well in kurtotic data and worse in skewed data, but still reasonably well for skewed data too (note: the characteristically skewed Log-normal distribution also has non-zero excess kurtosis).Unexpectedly, the kgv estimator was the worst estimator in these skewed data.The maximum entropy approximation of likelihood ratio (mxnt) and Hilbert-Schmidt independence criterion (hsic) were the only estimators that never failed in this simulation, correctly recognizing the underlying direction of dependence in 6000 out of 6000 simulated datasets.Note, though, that mxnt is orders of magnitude faster to compute than hsic (Hyvärinen & Smith, 2013).
We then investigated the ability of the same estimators to find both instantaneous and lagged direction of dependence from hypothetical (simulated) pre-and post-therapy assessment data.This was done for a case of independent initial values (column "Correct path model", Table 2), for a case where the initial x and y shared half of their variance (column "Correlated x(0) and y(0)", Table 2), and for a case where a confounding variable z(0), distributed like x(0) and y(0), was added to x(1) and y(1) but not modeled in any way (column "Correlated ε x and ε y ", Table 2).The idea in the two latter cases was to test robustness of the direction-dependence inferences to non-modeled initial correlations and residual confounding, respectively.Here the simulated data was characterized by kurtotic (i.e., Laplace) distributions.The estimators kgv, tanh, and hsic all recognized both the correct instantaneous causation from x to y and the correct lagged causation from y to x (Table 2).The estimators mxnt and tanh were the only estimators that never failed across all the bootstrap replications with sample size n = 1000, being also faster to compute than the kgv and hsic which also performed well otherwise.They were also the best estimators for small samples (n = 200; Table 2).With samples of 200 observations, we no longer had certainty for correct causal inference even under the optimal conditions but the performance was certainly satisfactory, and remained far better than chance even under the tested assumption violations.
For comparison, in line with theoretical predictions (see Methods), the classic lagged relationships would have indicated causation from y to x, even though we simulated equally strong, mutually opposing, instantaneous and lagged effects.Our simulations can be reproduced and extended using the supplementary "simulate_dda.R" file with appropriate R packages installed.
To check the preconditions for direction dependence inference, we verified that both PHQ-9 and SOFAS score distributions deviated from the Gaussian according to Kolmogorov-Smirnov tests (i.e., the data was non-Gaussian, as required; p < 0.001 for both).Supplementary Figure S1 shows estimates of the non-Gaussian distributions of PHQ-9 and SOFAS and scatterplots of their post-treatment values (cf.instantaneous effect), and that of PHQ-9 at pre-treatment vs. SOFAS at post-treatment (cf. the other lagged direction).A scatterplot smoother (local regression) indicated that the relationships were roughly linear, as assumed by our direction Note: The values are bootstrapped probabilities for effects from x to y vs. the opposite direction, i.e., proportions of estimators supporting the direction dependence across all 2000 bootstrap replications.Instantaneous, lagged, and diluted (confounded) effects refer to the respective statistics T, L, and D in the Methods section.The initial (time 0) x and y were independent in simulations pertaining the first column and shared half of their variance in the second column.
Psychotherapy Research 1069 dependence methods.We note that the quadratic terms were non-significant for the linear models corresponding to Figure S1c (p = 0.072) and Figure S1d (p = 0.391), although linearity this strict may not be needed for the applied direction dependence methodology.In other words, the data were well-suited to our modeling assumptions.Finally, Table 3 shows the direction dependence results using the FPQR data on PHQ-9 and SOFAS.No statistically significant evidence on direction dependence could be obtained from the real FPQR data.Comparatively stronger indication towards a symptoms' (PHQ-9) instantaneous effect on functioning (SOFAS) could be seen amongst the non-depressed patients, but this did not reach the conventional 95% confidence.Accordingly, confounding or reciprocity in instantaneous effect could be seen at familywise error rate 0.05 across all patients (direction PHQ-9 → SOFAS: dHSIC = 1.06, critical value = 0.60, p 1 < 0.001; direction SOFAS → PHQ-9: dHSIC = 1.03, critical value = 0.57, p 2 = 0.003; in total: max{p 1 , p 2 } = 0.003 < 0.05/2 = 0.025).The null hypothesis of independent instantaneous predictor and residual could not be rejected for the depressed patients (p-values 0.060 and 0.073) nor for the non-depressed patients (pvalues 0.018 and 0.029 > 0.05/2), but that could have been an issue of statistical power.
Looking at the correlations of the estimators across the bootstrap replications, kgv was not particularly correlated with any of the other measures but mxnt, tanh, and hsic exhibited a relatively strong dependence in these data (Supplementary Figure S2).This suggests that kgv might deliver partly independent information on causation in real psychotherapy data.Also, the robust skewness-based measure may offer better contrast to the other measures compared to the naïve skewness-based statistic.

Discussion
We showed how direction dependence statistics applied to pre-and post-therapy data can, in principle, be used to infer direction of causation between two patient variables interacting during the therapy.Provided non-Gaussian processes take place, these methods thus deliver what we want in the sense of being able to recover the direction of dependence in the total effects of therapy irrespective of them representing sudden change or more gradual dynamics.Applying these methods to the real registry data on depression symptoms and social and occupational functioning revealed no clear directions of dependence, however.This suggested that depression symptoms and functioning may Table 3. Direction dependence results in the short FPQR-registered psychotherapies reciprocally drive each other or be confounded by third variables.Such a does not align with the phase model's prediction that symptomatic improvement predominantly drives functional improvement rather than the other way around.Instead, the finding aligns e.g. with the viewpoint that changes in some general psychopathology process drive changes in disorder and behavioral statuses (Constantinou et al., 2019;Gluschkoff et al., 2019).
At the face of typical undergraduate textbook statistics, the direction dependence methods may feel almost magical in their ability to bootstrap causal information from readily available cross-sectional data.To what degree are they trustworthy?Based on the available simulation work presented in this paper and elsewhere, they perform surprisingly well relative to their modest data requirements (García-Velázquez et al., 2020;Hyvärinen & Smith, 2013;Rosenström et al., 2012;Rosenström & García-Velázquez, 2020;Shimizu et al., 2011).For comparison, "direction of causation" inference techniques based on genetic information from twins or genome scans have received attention in the epidemiologic literature, but often suffer from low statistical power and even bias (Burgess et al., 2017;Didelez & Sheehan, 2007;Duffy & Martin, 1994;Rosenström & García-Velázquez, 2020).Copula-based methods instead seem to require sample sizes in the order of thousands whereas the non-Gaussian SVAR may work with just a few hundreds (Falkenström et al., 2021).We consider direction dependence methods an important addition to the causal inference methodology, especially in psychotherapy research where finding twins or good instrumental variables is difficult.However, they require some modifications for studying the psychotherapy process.Namely, we had to partition the direction of dependence to instantaneous and lagged components.Whereas the lagged relationship from pre-to post-therapy assessment capture the effects of pre-therapy condition on the post-therapy condition, the "instantaneous" direction of dependence pertains to changes brought about during therapy.Direction of dependence in such "innovation effects" is only identifiable using non-Gaussian features of the data, i.e., statistical moments other than covariance (Hyvärinen et al., 2010;Moneta et al., 2013).Classic vector autoregression confounds the instantaneous and lagged effects.
Of course, the black-box nature of the methodology warrants cautious application, and all quasiexperimental methods warrant proper triangulation of evidence, wherein "triangulation" refers to replications using independent methods (Lawlor et al., 2017).This need not happen in a single paper, though, but typical cumulative scientific process should suffice.Here, we concentrated on a prevalent case of data in psychotherapy research, pre-and posttherapy assessments.However, the method readily extends to more heavily repeated measurements that are increasingly frequent in psychotherapy (Hyvärinen et al., 2010;Moneta et al., 2013;Trull & Ebner-Priemer, 2009), and cross-method comparisons are increasingly feasible with other emerging tools for causal inference (Crits-Christoph & Gibbons, 2021;Falkenström et al., 2021;Hernán & Robins, 2020;van der Laan & Rose, 2011).Furthermore, cross-comparisons across different direction-dependence estimators can be informative, as we demonstrated that they are not tautological.
In the bootstrap replications of the real data, the maximum entropy approximation to likelihood ratio estimator (mxnt) correlated strongly with the robust kurtosis-based (tanh) estimator and with the Hilbert-Schmidt independence criterion -based estimator (hsic), whereas the kgv estimator was largely uncorrelated with the other estimators.One possible reason for these differences might be that both mxnt and tanh -estimators, in a sense, make an assumption that the unknown data distributions are "not very far from a Gaussian distribution" (Hyvärinen, 1997)."Not very far" means that only some even (kurtosis-related) and odd (skewness-related) functional deviations from the Gaussian are implemented.Kernel generalized variance, on the other hand, estimates variable dependence based on a very large function space known as reproducing kernel Hilbert space (Bach & Jordan, 2002;Gretton et al., 2005).While the kgv estimator had some surprising caveats with skewed simulated data here, we consider it worthwhile to continue investigating fit of different estimators to common problems of psychology.Approximations of likelihood ratio (mxnt, tanh, rskew) and mutual information (kgv) provide theoretical rationale for why strict independence is not necessary for the direction dependence methods to find the dominant direction of causation, but also hsic performed very well despite confounding and might be interesting to analyze theoretically from this viewpoint.Additionally, many other tools to measure general dependence between random variables exist and could be further investigated in this context.
Another good question is how our results relate to findings of possible patient heterogeneity?For example, Stulz and Lutz (2007) analyzed 1128 psychotherapy patients and argued that they can be empirically classified into three patients groups of shared change patterns: "phase model consistent", "partial responders", and "symptomatically highly impaired", with the phase model consistency Psychotherapy Research 1071 representing the largest group (63%).We believe our predominantly reflect the majority group of the patients because the direction dependence statistics essentially are averages over patients (or other units) and show robustness to outlying observations (Hyvärinen & Smith, 2013).This is frequently also the foremost scientific question-what causation pertains to the majority of the patients?It seems likely that averages over patients hide important exceptions beneath the mass, but that does not make averages uninformative.Rather, one would like to deliver treatments helpful to as many patients as possible, making averages a good starting point.The patient heterogeneity may, however, reduce the power of the methods to detect direction of dependence and it may be possible to mitigate the problem in the future by conditioning to some available third variables responsible for patient heterogeneity (Li & Wiedermann, 2020).Related to heterogeneity, the field-permeating assumption of additive separability of independent variable and endogeneous error (or shock) is increasingly questioned (Su et al., 2015), but separability is not strictly essential to direction-dependence inference (Hyvärinen & Smith, 2013, p. 119-120).Future non-separable models might bring about some benefits, however.
As our methodology relied on non-Gaussian phenomena, some researchers may wonder if psychological constructs are inherently Gaussian even if their measures sometimes appear non-Gaussian due to being sensitive to only one end of the population variation (Falconer, 1965;Plomin et al., 2009;Reise & Rodriguez, 2016).For example, the disordered end.A latent-normality assumption is frequently used in psychometric and behavior-genetic modeling (Olsson, 1979;Reise & Rodriguez, 2016;Verhulst & Neale, 2021).While recent modeling efforts indicate the relevant latent psychopathology traits are indeed non-Gaussian (Rosenström et al., 2021), more generally, the jury is still out and pertinent methodology readily available (e.g., Woods, 2015;Woods & Lin, 2009).
Why did we not find clear indications of the predicted direction of dependence, from recovery of depressive symptoms to increased social and occupational functioning?One obvious reason would be that no dominant direction of dependence existed in the data, meaning that the phase model did not hold for these data.The alternatives to directional dependence would be either reciprocal dependence or that a latent third variable explains changes in both depressive symptoms and functioning.Partial evidence towards either or both of these hypotheses being true were found in terms of independence tests.Another possibility would be that a dominant direction of dependence did exist, but the effect was weak and thus too hard to detect well.
Phase model is but one example of hypothesized processes causing directional dependence between variables during psychotherapy.Hypotheses involving other variables could be investigated.Furthermore, methods of direction dependence analysis can be used to discover dependencies between variables with no a priori hypothesis involved.After all, the method treats the competing directional hypotheses symmetrically instead of trying to reject one at the expense of the other.We hope this tutorial exposition will help advance observational causal inference in clinical psychology, psychological medicine, and psychotherapy research-and why not elsewhere too.

Figure 1 .
Figure 1.Lagged versus instantaneous effects and non-Gaussian data distributions.(a) A path diagram of two-time point structural vector autoregression (SVAR).Whereas an ordinary SVAR model is only identified for the directed paths depicted with solid arrows, a non-Gaussian SVAR can further identify one more directed path (dashed arrow).It cannot identify residual confounding (dashed arc for correlation) but the directionality inference (dashed arrow) tolerates substantial amount of it.(b) Simulated Gaussian data points.Variables X and ε were independently Gaussian and y-axis given by Y = βX + ε, where β = 1.Both X and Y were standardized (i.e., unitless).The standardized data are rotationally symmetric because they are Gaussian, meaning order between X and Y cannot be inferred without a priori knowledge.(c) Simulated non-Gaussian data.Otherwise as panel b but the variables X and ε were independent right-skewed variables.Now observing high X values tend to imply high Y values due to the shared part βX.Due to the non-shared ε in Y, high Y values need not imply high X values as frequently as holds for the opposite implication.The rotational asymmetry means that the ordering between X and Y is inferable from the standardized data.

Figure 2 .
Figure 2. Data distributions and residual dependency illustrated.We simulated 1000 normally distributed observations xn and en, with en and xn independent, and generated yn as yn = z(xn + en), where is z is the standardizing z-score transformation.No obvious dependence ensued for xn and the linear model residual from regression yn on xn, denoted lm(yn ∼ xn) in R syntax.The same held for the opposite regression on yn.For skewed x, e, and y in place of the normal xn, en, and yn, the picture differed dramatically, with very strong dependence emerging when the "wrong" variable (here y) is used as a predictor variable, i.e., in the model lm(x ∼ y).In all the cases, the residual is always uncorrelated with the predictor variable by definition.For Gaussian data, it is also independent irrespective of regression direction.For any non-Gaussian data, the "wrong" (non-generative) direction always shows higher order dependencies (higher than second moments, i.e., correlations), although their exact form depends on the underlying distributions.

Table 1 .
Cross-sectional simulation results for the six estimators (in columns).
Note: Direction dependence x → y indicates the true simulated direction.T = estimated direction-dependence test statistic for the true direction; CI = bootstrap percentile confidence interval; P = probability, or proportion of bootstrap replications with T(x → y) > 0. The columns indicate the applied approximation of, or estimator for, T (see Methods or Supplementary Table for details).

Table 2 .
Longitudinal simulation results for the six estimators with independent and correlated initial data and with simultaneously independent initial and confounded lagged data.