DR-VIDAL - Doubly Robust Variational Information-theoretic Deep Adversarial Learning for Counterfactual Prediction and Treatment Effect Estimation on Real World Data

Determining causal effects of interventions onto outcomes from real-world, observational (non-randomized) data, e.g., treatment repurposing using electronic health records, is challenging due to underlying bias. Causal deep learning has improved over traditional techniques for estimating individualized treatment effects (ITE). We present the Doubly Robust Variational Information-theoretic Deep Adversarial Learning (DR-VIDAL), a novel generative framework that combines two joint models of treatment and outcome, ensuring an unbiased ITE estimation even when one of the two is misspecified. DR-VIDAL integrates: (i) a variational autoencoder (VAE) to factorize confounders into latent variables according to causal assumptions; (ii) an information-theoretic generative adversarial network (Info-GAN) to generate counterfactuals; (iii) a doubly robust block incorporating treatment propensities for outcome predictions. On synthetic and real-world datasets (Infant Health and Development Program, Twin Birth Registry, and National Supported Work Program), DR-VIDAL achieves better performance than other non-generative and generative methods. In conclusion, DR-VIDAL uniquely fuses causal assumptions, VAE, Info-GAN, and doubly robustness into a comprehensive, per-formant framework. Code is available at: https://github.com/Shantanu48114860/DR-VIDAL-AMIA-22 under MIT license.


Introduction
Understanding causal relationships and evaluating effects of interventions to achieve desired outcomes is key to progress in many fields, especially in medicine and public health.A typical scenario is to determine whether a treatment (e.g., a lipid-lowering medication) is effective to reduce the risk of or cure an illness (e.g., cardiovascular disease).Randomized controlled trials (RCTs) are considered the best practice for evaluating causal effects 1 .However, RCTs are not always feasible, due to ethical or operational constraints.For instance, if one wanted to evaluate whether college education is the cause of good salary, it would not be ethical to randomly pick teenagers and randomize their admission to college.So, in many cases, the only usable data sources are observational data, i.e., real-world data collected retrospectively and not randomized.Unfortunately, observational data are often plagued with various biases -since the data generation processes are largely unknown-such as confounding (i.e., spurious causal effects on outcomes by features that are correlated with a true unmeasured cause) and colliders (i.e., mistakenly including effects of an outcome as predictors), making it difficult to infer causal claims 2 .Another problem is that, in both RCTs and observational datasets, only factual outcomes are available, since clearly an individual cannot be treated and non-treated at the same time.Counterfactuals are alternative predictions that respond to the question "what outcome would have been observed if a person had been given a different treatment?"If models are biased, counterfactual predictions can be wrong, and interventions can be ineffective or harmful 3 .In both RCT-based and real-world based studies, two types of treatment effects are usually considered: (i) the average treatment effect (ATE), which is population-based and represents the difference in average treatment outcomes between the treatment and controls; and (ii) the individualized treatment effect (ITE), which represents the difference in treatment outcomes for a single observational unit with the same background covariates 4 .When there is suspected heterogeneity, stratified ATEs, or conditional ATEs, can be calculated.Traditional statistical approaches for estimating treatment effects, taking into account possible bias from pre-treatment characteristics, include propensity score matching (PSM) and inverse probability weighting (IPW) 5 .The propensity score is a scalar estimate representing the conditional probability of receiving the treatment, given a set of measured pre-treatment covariates.By matching (or weighting) treated and control subjects according to their propensity score, a balance in pre-treatment covariates is induced, mimicking a randomization of the treatment assignment.However, the PSM approach only accounts for measured covariates, and latent bias may remain after matching 6 .PSM has been historically implemented with logistic-linear regression, coupled with different feature selection methods in the presence of high-dimensional datasets 7 .A problem with PSM is that it often decreases the sample size due to matching, while IPW can be affected by skewed, heavy-tailed weight distributions.Machine learning approaches have been introduced more recently, e.g., Bayesian additive regression trees 8 and counterfactual random forests 9 .Big data also led to the flourishing of causal deep learning 10 .Notable examples include the Treatment-Agnostic Representation Network (TARNet) 11 , Dragonnet 12 , Deep Counterfactual Network with Propensity-Dropout (DCN-PD) 13 , Generative Adversarial Nets for inference of Individualized Treatment Effects (GANITE) 14 , Causal Effect Variational Autoencoder (CEVAE) 15 , and Treatment Effect by Disentangled Variational AutoEncoder (TEDVAE) 16 .

X t y
Figure 1: Directed acyclic graph modeling the causal relationships among treatment t, outcome y and pretreatment covariates X, under the latent space Z.
Contribution This work introduces a novel deep learning approach for ITE estimation and counterfactual prediction on real-world observational data, named the Doubly Robust Variational Information-theoretic Deep Adversarial Learning (DR-VIDAL).Motivated from Makhzani et al. 17 , we use a lower-dimensional neural representation of the input covariates to generate counterfactuals to improve convergence.We assume a causal graph on top of the covariates where the covariates X are generated from 4 independent latent variables Z t , Z ycf , Z yf and Z x indicating latents for treatment, counterfactual, factual outcomes and observed covariates respectively, shown in Figure 1.In generating the representations, we use a variational autoencoder (VAE) to infer the latent variables from the covariates in unsupervised manner and feed the learned lower-dimensional representation from the VAE to a generative adversarial network (GAN).Also, to counter the loss of the predictive information while generating the counterfactuals, we aim to maximize the mutual information between the learned representations and the output of the generator.We add this as a regularizer to the generator loss to obtain more robust counterfactuals.Finally, we incorporate a doubly robust network head to estimate the ITE, improving in loss convergence.As DR-VIDAL generates the counterfactual outcomes, we minimise the supervised loss for both the factual and the counterfactual outcomes to estimate ITE more accurately.
The main features of DR-VIDAL are, in summary: • Incorporation of an underlying causal structure where the observed pre-treatment covariate set X is decomposed into four independent latent variables Z t , Z X , Z yf , Z ycf , inducing confounding on both the treatment and the outcome (Figure 1).• Latent variables are inferred using a VAE 18 .
• A GAN 19 with variational information maximization 20 generates (synthetic) complete tuples of covariates, treatment, factual and counterfactual outcomes.• Individual treatment effects are estimated on complete datasets with a downstream, four-headed deep learning block which is doubly robust 21,22 .
To our knowledge, this is the first time in which VAE, GAN, information theory and doubly robustness are amalgamated into a counterfactual prediction method.By performing test runs on synthetic and real-world datasets (Infant Health and Development Program, Twin Birth Registry, and National Supported Work Program), we show that DR-VIDAL can outperform a number of state-of-art tools for estimating ITE.DR-VIDAL is implemented in Pytorch and the code is available at: https://github.com/Shantanu48114860/DR-VIDAL-AMIA-22under MIT license.In the repository, we also provide an online technical supplement (OTS) with full details on the architectural design, derivation of equations, and additional experimental results.

Problem Formulation
We use the potential outcomes framework 23,24 .Let us consider a treatment t (binary for ease of reading, but the theory can be extended to multiple treatments) that can be prescribed to a population sample of size N .The individuals are characterized by a set of pre-treatment background covariates X, and a health outcome Y is measured after treatment.We define each subject i with the tuple {X, T, Y } N i=1 , where Y 0 i and Y 1 i are the potential outcomes when applying treatments T i = 0 and T i = 1, respectively.The ITE τ (x) for subject i with pre-treatment covariates X i = x, is The ITE cannot be calculated directly give the inaccessibility of both potential outcomes, as only factual outcomes can be observed, while the others (counterfactuals) can be considered as missing values.However, when the potential outcomes are made independent of the treatment assignment, conditionally on the pre-treatment covariates, i.e., {Y 1 , Y 0 } ⊥ T | X, the ITE can then be estimated as . Such an assumption is called the strongly ignorable treatment assignment (SITA) assumption 25,26 .By further averaging over the distribution of X, the ATE τ 01 can be calculated as ITE and ATE can be calculated with stratification matching of x in treatment and control groups, but the calculation becomes unfeasible as the covariate space increases in dimensions.The propensity score π(x) represents the probability of receiving the treatment T = 1 conditioned on the pre-treatment covariates X = x, denoted as π(x) = P (T = 1 | X = x) 24 .The propensity score can be calculated using a regression function, e.g., logistic.ITE/ATE can then be calculated by matching (PSM) or weighting (IPW) instances through π(x), in a doubly robust way 27 , or through myraid approaches 28,9,29,30,27,31,32,33 .In the next section, we describe approaches based on deep learning.

Related Work
Alaa and Van der Schaar 34 characterized the conditions and the limits of treatment effect estimation using deep learning.The sample size plays an important role, e.g., estimations on small sample sizes are affected by selection bias, while on large sample sizes, they are affected by algorithmic design.Our work builds up on the ITE estimation approaches of CEVAE 15 , DCN-PD 13 , Dragonnet 12 , GANITE 14 , TARNet 11 , and TEDVAE 16 .DCN-PD is a doubly robust, multitask network for counterfactual prediction, where propensity scores are used to determine a dropout probability of samples to regularize training, carried out in alternating phase, using treated and control batches.CEVAE uses VAE to identify latent variables from an observed pre-treatment vector and to generate counterfactuals.TARNet aims to provide an upper bound effect estimation by balancing the distributions of treated and controls -with a weight indemnifying group imbalance-within a high dimensional covariate space, but it does not exploit counterfactuals, and only minimises the factual loss function.Dragonnet is a modified TARNet with targeted regularization based on propensity scores.GANITE generates proxies of counterfactual outcomes from covariates and random noise using a GAN, and feeds them to an ITE generator.For both GANITE and TARNet, in presence of high-dimensional data, the loss could be hard to converge.TEDVAE 16 uses a variational autoencoder to infer hidden latent variables from proxies using a causal graph similar to CEVAE.In the next sections, we discuss in detail the novelty of DR-VIDAL and the differences in the architectural design and training mechanisms with respect to the aforementioned approaches.

Proposed Methodology
DR-VIDAL architecture can be summarized in three components: (1) a VAE inferring the latent space, (2) a GAN generating the counterfactual outcomes, and (3) a doubly robust module estimating ITE.The architectural layout is schematized in Figure 2, while the algorithmic details are given in the OTS.
Latent variable inference with VAE.We assume that the observed covariates X = x with treatment assignment T = t factual and counterfactual outcomes Y f = y f and Y cf = y cf respectively, are generated from an independent latent space z, composed by z x ∼ p(z x ), z t ∼ p(z t ), z yf ∼ p(z yf ), and z ycf ∼ p(z ycf ), which denote the latent variables for the covariates x, treatment indicator t, and factual outcomes y f and y cf , respectively.This decomposition follows the causal structure shown in Figure 1.The goal is to infer the posterior distribution p(z x , z t , z yf , z ycf |x), which is harder to optimize.We use the theory of variational inference 35 to learn the variational posteriors q φx (z x |x), q φt (z t |x), q φ yf (z yf |x), q φ ycf (z ycf |x), using 4 different neural network encoders with parameters φ x , φ t , φ yf , and φ ycf , respectively.Using the latent factors sampled from the learned variational posteriors, we reconstruct x by estimating the likelihood p φ d (x|z x , z t , z yf , z ycf ) via a single decoder parameterized by φ d .The latent factors, assumed to be Gaussian, are defined as follows: where D zx , D zt , D z yf , D z ycf are the dimensions of the latent factors z x , z t , z yf , z ycf , respectively.The variational posteriors of the inference of models are defined as: where μx , μt , μyf , μycf and σ2 x , σ2 t , σ2 yf , σ2 ycf are the means and variances of the Gaussian distributions parameterized by encoders E φx , E φt , E φ yf , E φ ycf with parameters φ x , φ t , φ yf , φ ycf respectively.The overall evidence lower bound (ELBO) loss of the VAE is expressed as L ELBO in the following equation, where KL denotes the Kullback-Leibler divergence of two probability distributions.We minimize the optimization function of the VAE as L V AE to obtain the optimal parameter of the encoders φ x , φ t , φ yf , φ ycf , and of the decoder Generation of counterfactuals via GAN.After learning the hidden latent codes z x , z t , z yf , z ycf from the VAE, we concatenate the latent codes to form z c , passed to the generator of the GAN block G θg , along with a random noise z G ∼ N (0, Id).G θg is parameterized by θ g , and it outputs the vector y of the potential (factual and counterfactual) outcomes.We replace the factual outcome y f in the generated outcome vector y to form ŷ0 and ŷ1 , which are passed to the counterfactual discriminator D θ d , along with the true covariate vector x.D θ d is parameterized by θ d , and is responsible to predict the treatment variable, similarly to GANITE.The loss of the GAN block is defined as: where x ∼ p(x), z G ∼ p(z G ) and z c denote the concatenated latent codes z x ∼ q φx (z x |x), z t ∼ q φt (z t |x), z yf ∼ q φ yf (z yf |x) and z ycf ∼ q φ ycf (z ycf |x).From y, we also calculate the predicted factual outcome ŷf .As also done in GANITE, we make sure to include the supervised loss L G S (y f , ŷf ), which enforces the predicted factual outcome ŷf to be as close as to the true factual outcome y f .
The complete loss function of counterfactual GAN is given by We also employ an additional regularization λI(z c ; G(z G , z c )) to maximize the mutual information between the learned concatenated latent code z c and the generated output by the generator G(z G , z c ), as in 20 .We thus propose to solve the following minimax game: is harder to solve because of the presence of the posterior p(z c |x) 20 , so we obtain the lower bound of it using an auxiliary distribution Q(z c |x) to approximate p(z c |x).
Finally, the optimization function of the counterfactual information-theoretic GAN -InfoGAN-incorporating the variational regularization of mutual information and hyperparameter λ is given by: min The counterfactual InfoGAN is used to generate the missing counterfactual outcome y cf to form the quadruple {x, t, y f , y cf } N i=1 and sent to the doubly robust block to estimate the ITE.
Information-theoretic GAN optimization.The GAN generator G θg works to fool the discriminator D θ d .To get the optimal Discriminator D * θ d , we maximize To get the optimal generator G * θg , we maximize Doubly robust ITE estimation.As introduced above, the propensity score π(x) represents the probability of receiving a treatment T = 1 (over the alternative T = 0) conditioned on the pre-treatment covariates X = x.By combining IPW through π(x) with outcome regression by both treatment variable and the covariates, Jonsson defined the doubly robust estimation of causal effect 21 as follows: where µ(x, t) = α0 + α1 x 1 + α2 x 2 + • • • + αn x n + δt, and (t i − π(x i ))µ(x i , t i ) is used for the IPW estimator.
After getting the counterfactual outcome y cf from the counterfactual GAN to form the quadruple {x, t, y f , y cf } N i=1 , we pass this as the input to the doubly robust multitask network to estimate the ITE, using the architecture shown in Figure 2 (green box).To predict the outcomes y (0) and y (1) , we use a configuration similar to TARNet, which contains a number of shared layers, denoted by f φ , parameterized by φ, and two outcome-specific heads f θ0 and f θ1 , parameterized by θ 0 and θ 1 .
To ensure doubly robustness, we introduce two more heads that predict the propensity score π(x) = P(T = 1|x) and the regressor µ(x, t).These two are calculated using two neural networks, parameterized by θ π and θ µ respectively.The factual and counterfactual outcome y (0) i and y (1) i of the i th sample are then calculated as: Next, the predicted loss will be where α is a hyperparameter.With the help of the propensity score π(x) and the regressor µ(x, T ), the doubly robust outcomes are calculated as The doubly robust loss L DR i (θ 1 , θ 0 , θ φ , θ µ , φ) is calculated as: Finally, the loss function of the ITE is: where β is a hyperparameter and the whole network is trained using end-to-end strategy.
Experimental Setup Synthetic datasets.We conduct performance tests on two synthetic data experiments.The first uses the same data generation process devised for CEVAE 15 .We generate a marginal distribution x as a mixture of Gaussians from the 5-dimensional latent variable z, indicating each mixture component.The details of the synthetic dataset using this process is discussed in the OTS.Datasets of sample size {1000, 3000, 5000, 10000, 30000} are generated, and divided  into 80-20 % train-test split.In the second experimental setting, we amalgamate the synthetic data generation process by CEVAE with that of GANITE 14 , to model the more complex causal structure illustrated in Figure 1.We sample 7-, 1-, 1-, and 1-dimensional vectors for z x , z t , z yf , and z ycf from Bernoulli distributions, and then collate them into x.From the covariates x, we simulate the treatment assignment t and the potential outcomes y as described in the GANITE paper.We generate multiple synthetic datasets for sample sizes {1000, 3000, 5000, 10000, 30000}, also divided into 80-20 % splits.Equations for both data generating processes are provided in the OTS.
Real-world datasets.We use three popular real-world benchmark datasets: the Infant Health and Development Program (IHDP) dataset 8 , the Twins dataset 36 , and the Jobs dataset 37 .The IHDP and Twins two are semi-synthetic, and simulated counterfactuals to the real factual data are available.These datasets have been also designed and collated to meet specific treatment overlap condition, nonparallel treatment assignment, and nonlinear outcome surfaces 8,11,15,14 .
In detail, IHDP collates data from a multi-site RCT evaluating early intervention in premature, low birth infants, to decrease unfavorable health outcomes.The dataset is composed by 110 treated subjects and 487 controls, with 25 covariates.The Twins dataset is based on records of twin births in the USA from 1989-1991, where the outcome is mortality in the first year, and treatment is heavier weight, comprising 4553 treated, 4567 controls, with 30 covariates.The Jobs study (1978-1978) investigates if a job training program intervention affects earnings after a two-year period, and comprises 237 treated, 2333 controls, with 17 covariates.For all the real-world datasets, we use the same experimental settings described in GANITE, where the datasets are divided into 56/24/20 % train-validation-test splits.We run 1000, 10 and 100 realizations of IHDP, Jobs and Twins datasets, respectively.
Model fit and test details.Consistent with prior studies 8,11,14 , we report the error on the ATE AT E , and the expected Precision in Estimation of Heterogeneous Effect (PEHE), P EHE , for IHDP and Twins datasets, since factual and the counterfactual outcomes are available.For the Jobs dataset, as the counterfactual outcome does not exist, we report the policy risk R pol (π), and the error on the average treatment effect on the treated (ATT) AT T , as indicated in 11,14 .The training details and the hyperparameters of the individual networks are given in the OTS.We compared DR-VIDAL with TARNet, CEVAE, and GANITE.In addition, for real-world datasets, we compare: least squares regression with treatment as a covariate (OLS/LR1); separate least squares regression for each treatment (OLS/LR2); balancing linear regression (BLR) 10 ; k-nearest neighbor (k-NN) 33 ; Bayesian additive regression trees (BART) 28 ; random and causal forest (R Forest, C Forest) 9 ; balancing neural network (BNN) 10 ; counterfactual regression with Wasserstein distance (CFR W ASS ) 11 .

Results
Synthetic datasets.Figure 3 (a), (b) and (c) shows ATE/PEHE results of DR-VIDAL vs. all other models according to the two synthetic data generation processes.In the generative process of CEVAE, the doubly robust version of DR-VIDAL demonstrates lower ATE error than all other models at all sample sizes.When comparing PEHE, DR-VIDAL (both with and without the doubly robust feature) largely outperforms GANITE.In the second synthetic dataset, generated under the more complex assumptions, DR-VIDAL (both with and without the doubly robust feature) outperforms GANITE in terms of PEHE.It is worth noting the potential of DR-VIDAL to better infer hidden representations in  Real world datasets.In all three IHDP, Jobs and Twins datasets, across all realizations, the information-theoretic, doubly robust configuration of DR-VIDAL yields the best results against all other configurations -with/without information-theoretic optimization and with/without doubly robust loss.The doubly robust loss seems to be responsible for most of the improvement.The absolute gain is small, in the order of 1%, but the relative gain with respect to the non-doubly robust setup is significant, where the doubly robust module always outperforms its non-doubly robust version, from 55-60% in IHDP to over 80% in Twins and Jobs datasets (Figure 5).Table 1 shows the comparison for the √ P EHE and R P ol values with the state-of-the-art methods on the three datasets.DR-VIDAL outperforms the other methods on all datasets.On the IHDP and Jobs dataset, DR-VIDAL is the best over all by a larger margin.Instead, performance increment in the Twins dataset is mild.Even if DR-VIDAL has a large number of parameters, the deconfounding of hidden factors and the adversarial training make it appropriate for datasets with relatively small sample size like IHDP.It is worth noting that DR-VIDAL converges much faster than CEVAE and GANITE, possibly due to the doubly robustness.

Conclusions
DR-VIDAL is a new deep learning approach to causal effect estimation and counterfactual prediction that combines adversarial representation learning, information-theoretic optimization, and doubly robust regression.On the benchmark datasets, both the doubly robust property and information-theoretic optimization of DR-VIDAL improve performance over a basic adversarial setup.
The work has some limitations.First, the causal graph, even if more elaborate than CEVAE, could be improved.For instance, by connecting the Z to X and only to their respective t, factual and counterfactual outcome nodes would imply two adjustments set.Another option could be to use the TEDVAE structure in conjunction with out doublyrobust setup.Also, the encoded representation in the VAE does not employ any attention mechanism to identify the most important covariates for the propensity scores, especially with of high-dimensional datasets.Finally, one thing that would be worth evaluating is how Dragonnet would perform as a downstream module of DR-VIDAL, substituting it to our current four-head doubly-robust block.
In conclusion, DR-VIDAL framework is a comprehensive approach to predicting counterfactuals and estimating ITE, and its flexibility (modifiable causal structure and modularity) allows for further expansion and improvement.
The algorithm to train the generative adversarial network and doubly robust multitask network for counterfactual outcome calculation and ITE estimation are discussed in Algorithms 1 and 2 respectively.

A2 Performance Metrics
The error for PEHE, ATE, Policy Risk, ATT will be evaluated by estimating P EHE , AT E , R pol (π), AT T respectively as follows: where , and E is the randomized sample.
The true average treatment effect on the treated (ATT) and its error AT T are defined as follows: AT T = AT T − where T 1 , T 0 and E are the subsets corresponding to treated, controlled samples, and randomized controlled trials, respectively.

A5 Datasets
The IHDP and Twins two are semi-synthetic, and simulated counterfactuals to the real factual data are available.These datasets have been also designed and collated to meet specific treatment overlap condition, nonparallel treatment assignment, and nonlinear outcome surfaces 8,11,15,14 .The IHDP datasets is composed by 110 treated subjects and 487 controls, with 25 covariates.The Twins dataset comprises 4553 treated, 4567 controls, with 30 covariates.The Jobs dataset comprises 237 treated, 2333 controls, with 17 covariates.For all the real-world datasets, we use the same experimental settings described in GANITE, where the datasets are divided into 56/24/20 % train-validation-test splits.We run 1000, 10 and 100 realizations of IHDP, Jobs and Twins datasets, respectively.

A7 Differences with CEVAE and GANITE
The counterfactual outcome predictor of DR-VIDAL uses both VAE and GAN in the same framework, while only VAE is used in CEVAE and only GAN is used GANITE.CEVAE also incorporates a causal graph, but it is simplistic, as it infers only the observed proxy X from Z.We instead considered multiple latent variables causally related to the treatment and the outcome in addition to the direct links to the pre-treatment covariates.Furthermore, we use GAN to generate counterfactual examples, but, unlike GANITE, we first infer the multiple latent factors using a VAE, then optimize the GAN with the mutual information, and finally generate the entire potential outcome vector.

A8 Differences with TARNet and Dragonnet
The design of the doubly robust module block of DR-VIDAL is closely related to that of TARNet and Dragonnet.However, TARNet uses a two-headed network, which is not doubly robust.Dragonnet includes a third head that incorporates the propensity score.DR-VIDAL exploits the doubly robustness adding two heads, i.e., the propensity score and the regressor head, to the the basic two-headed TARNet configuration.Further, in TARNet the weights corresponding to each sample are calculated as the crude probability of the treatment assignment, whereas DR-VIDAL accounts for the pre-treatment covariates.For Dragonnet, the targeted regularization is implemented without taking into account the regressed outcome, which instead is estimated by DR-VIDAL in the fourth head, as a function of treatment and pre-treatment covariates.Another major difference between TARNet/Dragonnet and DR-VIDAL is the training strategy.For both TARNet and Dragonnet, the counterfactual outcome does not exist, so for each sample the overall loss function has to be estimated with the factual outcome only, updating the parameters of the outcome head of the factual outcome during training.In contrast DR-VIDAL provides the entire potential outcome vector, comprising both the factual and the counterfactual outcomes.For each training sample, the loss function is calculated for both outcomes, and the corresponding parameters of both the outcome heads are updated.
Algorithm 1 Training of the generative adversarial network for counterfactual outcome calculation

Calculate
Calculate the predicted loss L p i (θ 1 , θ 0 , φ) Calculate the doubly Robust loss Calculate the final loss L IT E (θ 1 , θ 0 , θ π , θ µ , φ) 9: Calculate gradients of the loss L IT E (θ 1 , θ 0 , θ π , θ µ , φ) q φt (z t |x), q φ yf (z yf |x), q φ ycf (z ycf |x) as outputs have a single layer with 5, 1, 1, 1 nodes, respectively.The decoder is a 4-layer neural network, each with 15 nodes to calculate the data likelihood p φ d (x|z x , z t , z yf , z ycf ).For the GAN, the generator network has 2 shared layers and 2 outcome-specific layers, each with 100 nodes.The discriminator and the network for information maximization (Q network in Figure 2) is a 3-layered neural network, each with 30 nodes and 8 nodes respectively.All the layers of the VAE and GAN use Rectified Linear Unit (ReLU) activation functions and the parameters are updated using the Adam optimizer 38 .The random noise z G is sampled from a 92-dimensional standardized Gaussian distribution N (0, 1).The hyperparameter γ is set as 1 for all datasets, while λ is set as 0.2, 0.01 and 10 for IHDP, Jobs and Twins, respectively.The batch sizes of IHDP, Jobs, and Twins are 64, 64, and 256, respectively.The learning rates of the VAE, generator and discriminator are 1e-3, 1e-4, and 5e-4, respectively.
Doubly robust module.For the doubly robust module, the shared network f φ and outcome specific networks f θ0 and f θ1 are both 3-layer neural network, each with 200 and 100 nodes.The propensity network π has 2 layers each with 200 nodes.The regressor network µ has 6 layers with 200 nodes and 100 nodes in the first and last 3 layers.All the layers of the VAE and GAN use ReLU activation and the Adam optimizer.The batch sizes are the same as for the adversarial module.We set the learning rate of all the networks as 1e-4 and the hyperparameters α and β are set at 1 for all 3 datasets.

A10 Performance of all the various DR-VIDAL configurations
The performance of all the various DR-VIDAL configurations are mentioned in Table 2.
A10 DR-VIDAL's in-sample performance on Jobs dataset for R P ol values The performance of various models on the Jobs dataset for R P ol values are shown in Table 3.

A12 Performance comparison of doubly robust vs. non-doubly robust version of DR-VIDAL
Performance comparison of doubly robust vs. non-doubly robust version of DR-VIDAL is shown in Figure 5.The bar plots show how many times one model setup is better than the other in terms of error on the factual outcome (yf).

A13 t SNE of representations
The t-distributed stochastic neighbor embedding (t-SNE) of representations learned by the VAE of the adversarial module of DR-VIDAL for Twins and Jobs datasets -before and after training-are shown in Figure 6.For all datasets, the t-SNE shows reorganization and cluster tightness (i.e., the data reside on a smaller space) on the treatment, factual and counterfactual outcomes spaces.

Figure 2 :
Figure 2: Architecture of DR-VIDAL incorporating the variational autoencoder inferring the latent space (VAE), the generative adversarial network for calculating the counterfactual outcomes (GAN), and the doubly robust module (green box) for estimating ITE.defined as the difference in the average potential outcomes under both treatment interventions (i.e., treated vs. not treated), conditional on x, i.e., τ(x) = E[Y 1 i − Y 0 i | X i = x](1)

Figure 3 :
Figure 3: Panel (a): performance (ATE) of DR-VIDAL vs. all other models on samples from the generative process of CEVAE.Panel (b) and (c): performance (PEHE) of DR-VIDAL with or without the doubly robust (DR, w/o DR) block vs. GANITE on samples from the generative process of CEVAE-GANITE.

Figure 4 :
Figure 4: Performance comparison of doubly robust vs. non-doubly robust version of DR-VIDAL.The bar plots show how many times one model setup is better than the other in terms of error on the factual outcome (y f ).Panels, from left to right, show results on IHDP, Jobs and Twins datasets (100, 10, 100 iterations), respectively.comparison to GANITE irrespective of the presence of the doubly robust module.

Figure 5 :
Figure 5: Performance comparison of doubly robust vs. non-doubly robust version of DR-VIDAL.Panels, from left to right, show results on IHDP, Jobs and Twins datasets (100, 10, 100 iterations), respectively.A11 DR-VIDAL's performance on IHDP and Twins datasets for √ P EHE values The performance of various models for √ P EHE values on the IHDP and Twins dataset are shown in Table4.

Figure 6 :
Figure 6: Visualization of the latent representation learned by the VAE module of DR-VIDAL for the Twins and Jobs dataset using t-SNE.The 1 st and 2 nd panels show the t-SNE before and after training the network for Twins dataset.The 3 rd and 4 th panels show the same for Jobs dataset.From left to right, the plots show the t-SNE of treatment, factual and counterfactual outcomes.

Table 1 :
Performance of √ P EHE and R P ol

Table 2 :
Performance of the all the different DR-VIDAL configurations on the IHDP, Jobs and Twins datasets (1000, 10, and 100 realizations, respectively).Results show the out-of-sample (mean ± st.dev) error (PEHE) and policy risk (R P ol ).

Table 3 :
Performance of various models on the Jobs dataset for AT T (mean ± st.dev).

Table 4 :
Performance of various models on the IHDP and Twins datasets for AT E (mean ± st.dev).