GRouNdGAN: GRN-guided simulation of single-cell RNA-seq data using causal generative adversarial networks

We introduce GRouNdGAN, a gene regulatory network (GRN)-guided causal implicit generative model for simulating single-cell RNA-seq data, in-silico perturbation experiments, and benchmarking GRN inference methods. Through the imposition of a user-defined GRN in its architecture, GRouNdGAN simulates steady-state and transient-state single-cell datasets where genes are causally expressed under the control of their regulating transcription factors (TFs). Training on three experimental datasets, we show that our model captures non-linear TF-gene dependences and preserves gene identities, cell trajectories, pseudo-time ordering, and technical and biological noise, with no user manipulation and only implicit parameterization. Despite imposing rigid causality constraints, it outperforms state-of-the-art simulators in generating realistic cells. GRouNdGAN learns meaningful causal regulatory dynamics, allowing sampling from both observational and interventional distributions. This enables it to synthesize cells under conditions that do not occur in the dataset at inference time, allowing to perform in-silico TF knockout experiments. Our results show that in-silico knockout of cell type-specific TFs significantly reduces cells of that type being generated. Interactions imposed through the GRN are emphasized in the simulated datasets, resulting in GRN inference algorithms assigning them much higher scores than interactions not imposed but of equal importance in the experimental training dataset. Benchmarking various GRN inference algorithms reveals that GRouNdGAN effectively bridges the existing gap between simulated and biological data benchmarks of GRN inference algorithms, providing gold standard ground truth GRNs and realistic cells corresponding to the biological system of interest. Our results show that GRouNdGAN is a stable, realistic, and effective simulator with various applications in single-cell RNA-seq analysis.


24
Unraveling gene regulatory interactions, often represented as a gene regulatory network (GRN), 25 plays a crucial role in studying biological processes under different conditions 1-3 , simulating 26 knockdown and knockout experiments 4,5 , and identifying therapeutic drug targets 6,7 . Many 27 algorithms have been proposed for GRN reconstruction using bulk or single-cell RNA sequencing 28 data (scRNA-seq), alone [8][9][10][11][12][13][14] or with other modalities [15][16][17][18] . While these advances have provided 29 great biological insights, evaluating the performance of GRN Inference algorithms remains 30 challenging 19,20 due to the lack of reliable ground truth for the biological processes under study. 31 Existing evaluation approaches often resort to curated databases 21-23 . However, the regulatory 32 interactions in these databases are aggregated from a wide range of datasets and are not specific 33 to a biological system, making them not ideal benchmarks due to context specificity of gene 34 regulation. Another strategy is to verify regulatory interactions by conducting perturbation 35 experiments on the system under study 20 . However, this approach is tedious, lengthy, and 36 expensive. Another approach is to employ scRNA-seq simulators. Despite great progress in this 37 domain, most simulators lack the essential properties for this task, such as the preservation of 38 gene identities and simulation based on a user-provided ground truth GRN. For example, 39 scGAN 24 , cscGAN 24 , scDESIGN2 25 , and SPARSIM 26 capture inter-gene correlations in their 40 datasets. However, since they do not explicitly impose a known GRN (that can act as the ground 41 truth), they are not suitable for benchmarking GRN inference methods.
identities and make simplifying assumptions regarding cooperative regulation (see Discussion). 66 These shortcomings demonstrate the need for simulators capable of generating realistic scRNA-67 seq data that retain the regulatory dynamics specified by a user-defined GRN. Importantly, with 68 the growing interest in causal inference, simulators that impose causal GRNs are of great need.

87
GRouNdGAN generates scRNA-seq data using causal generative adversarial networks 88 GRouNdGAN is a deep learning model that generates scRNA-seq data while imposing a user-defined causal 89 GRN to describe the regulatory relationships of the genes and TFs. Its architecture builds on the causal  Tables S1-S2, and an ablation study in File S1 and Table S3.  GRouNdGAN generates realistic scRNA-seq data 122 We trained GRouNdGAN on three datasets. The first dataset contained scRAN-seq profiles of 123 68579 human peripheral blood mononuclear cells (PBMCs) from 10x Genomics 33 corresponding 124 to eleven cell types ("PBMC-All"). We formed a dataset of the most common cell type in the  tests and are adjusted for multiple hypotheses following the Benjamini-Hochberg procedure. 145 We quantitatively assessed their similarity using Euclidean distance, Cosine distance, maximum 146 mean discrepancy (MMD) 35 , mean integration local inverse Simpson's index (miLISI) 36 , and the 147 area under the receiver operating characteristic curve (AUROC) of a random forests (RF) classifier 148 distinguishing simulated and experimental cells (Methods). As a "control", we calculated these 149 metrics using two halves of the reference test set. Table 1 shows the performance of 150 GRouNdGAN, control, and three state-of-the-art simulators for the PBMC-CTL dataset (see Table   151 S4 for training set performance). Note that to assess data quality systematically and fairly, we 152 only included methods that automatically generate scRNA-seq data resembling experimental 153 data, since the quality of data generated by methods that rely on the user's judgement for 154 distribution matching (e.g., SERGIO and BoolODE) is user dependent and subjective.  GRouNdGAN did not receive cell type/cluster information, we did not observe mode collapse 176 ( Figure S2), and it was able to generate cells from distinct clusters ( Figure 2B, test set miLISI = 177 1.90). GRouNdGAN outperformed all simulators that did not use cell cluster information (Table   178 2) and compared to those that utilized this information, it was still the top performing based on 179 MMD and RF AUROC and the second best according to the Euclidean distance (6% higher than 180 the best) and miLISI (0.5% lower than the best) (   The imposed GRN can be reconstructed from the simulated data 218 To test whether the imposed GRN can be reconstructed from the generated data, we first applied  Table S6), reflecting that these edges are of comparable importance.

228
We simulated data using GRouNdGAN by imposing only the positive control GRN, and used 229 GRNBoost2, PIDC 10 GENIE3 9 , and PPCOR 37 to reconstruct the underlying GRN ( Figure 3, Figure S4, 230 Table S6). All methods reconstructed the imposed edges with a much higher AUPRC compared 231 to the reference dataset, while they did not assign high scores to the unimposed edges, resulting 232 in close to random AUPRC. Figure 3 shows the performance of GRNBoost2 applied to the 233 reference, GRouNdGAN-simulated, and scGAN-simulated data. The performance for scGAN (both 234 GRNs) was lower than the reference dataset, showcasing that it had not preserved the TF-gene observed when we repeated these analyses using the BoneMarrow dataset (Table S6, Figure S5).

240
These results show that the imposed edges were accentuated by GRouNdGAN, while the 241 unimposed edges were disrupted and could not be found by GRN inference methods.

243
Next, we asked whether GRNBoost2 can reconstruct the imposed GRN, if GRNs of different 244 densities are used to simulate data. Using data for which each gene was regulated by 15, 10, 5, 245 and 3 TFs, we observed that in all cases the GRN imposed by GRouNdGAN could be inferred from 246 the simulated data (Table S6). Finally, we repeated the controlled analysis above, swapping the 247 role of positive and negative control GRNs. Similar to the results of Figure 3, the imposed edges 248 were inferred by GRNBoost2, while the unimposed edges were not accurately discovered (Table   249 S6). These results show that 1) GRouNdGAN imposes the causal GRN, 2) the GRN inference  GRouNdGAN-simulated data preserves trajectories and can be used for pseudo-time inference 266 In addition to deciphering discrete states, scRNA-seq data is often used to determine continuous  (Table S7) 38 . Figure

298
We annotated different cell types present in both datasets using these marker genes (following  In many studies, one is interested to characterize the relationship between TFs' expression and 360 phenotypic labels (e.g., cell types). Differential expression (DE) analysis is a common approach to 361 identify TFs (or other genes) most associated with a specific cell type. However, to directly test  Figure 6A shows the UMAP embedding of the reference dataset 378 (miLISI = 1.94 when compared with simulated data) and Figure 6B shows the density plots of the 379 experimental and simulated data, revealing a high degree of resemblance. We focused on the  (Table   395 S8). This analysis shows that in-silico TF perturbation experiments using GrouNdGAN produces 396 results concordant with results directly obtained from real biological data, a direction that we 397 will explore further in the future studies.    We followed the pre-processing steps of scGAN and cscGAN 24 using scanpy version 1.8.2 46 . In 536 each dataset, cells with nonzero counts in less than ten genes were removed. Similarly, genes 537 that only had nonzero counts in less than three cells were discarded. Next, library-size 538 normalization was performed on the counts per cell with a library size equal to 20,000 in order 539 to be consistent with previous studies 24 . Next, top 1000 highly variable genes were selected using 540 the dispersion-based method described by Satija et al. 48 . "generator") that produces new samples from noise, and its adversary, a discriminative model 553 that tries to distinguish between real and generated samples (called the "discriminator"). The 554 generator's goal is to generate samples so realistic that the discriminator cannot determine 555 whether it is real or simulated (an accuracy value close to 0.5). Through adversarial training, the 556 generator and discriminator receive feedback, allowing them to co-evolve in a symbiotic manner.

558
The main difference between a WGAN and a traditional GAN is that in the former, a Wasserstein 559 distance is used to quantify the similarity between the probability distribution of the real data 560 and the generator's produced data (instead of Kullback-Leibler or Jensen-Shannon divergences).

561
Wasserstein distance has been shown to stabilize the training of WGAN without causing mode 562 collapse 32 . The detailed formulation of the Wasserstein distance used as the loss function in this 563 study is provided in File S1. In addition, instead of a discriminator, WGAN uses a "critic" that 564 estimates the Wasserstein distance between real and generated data distributions. In our model, 565 we added a gradient penalty term for the critic (proposed by Gulrajani et al. 50 as an alternative 566 to weight clipping used in the original WGAN) in order to overcome vanishing/exploding 567 gradients and capacity underuse issues.

569
In the pretraining step, we trained a WGAN with a generator (containing an input layer, three corresponding to the TFs that regulate the gene in the imposed GRN (i.e., its parents in the graph).

596
The expression values of TFs and the generated values of target genes are arranged into a vector, 597 which is passed to an LSN layer for normalization. We used target generators with three hidden 598 layers of equal width. The width of the hidden layers of a generator is dynamically set as twice 599 the number of its regulators (the noise variable and the set of regulating TFs). If the imposed GRN 600 is relatively dense and contains more than 5000 edges, we set the depth to 2 and the width 601 multiplier to 1 to be able to train on a single GPU. The details of hyperparameters and 602 architectural choices are provided in Table S2 (in File S1).

604
In practice, generating each target gene's expression using separate neural networks introduces 605 excess overhead. This is because in every forward pass, all target genes' expressions must be first To overcome this issue, we used auxiliary tasks and neural networks known as "labeler" and 628 "anti-labeler" 31 . The task of these two networks is to estimate the causal controller's TF GRouNdGAN solves a min-max game between the generator ( ) ) and the critic ( * ), with the 661 following objective function: (1) 663 We alternated between minimizing the generator loss for one iteration and maximizing the The hyperparameters were tuned using a validation set consisting of 1000 cells from the PBMC-

669
CTL dataset (Tables S1-S2 in File S1) based on the Euclidean distance and the RF AUROC score,  acyclic graph (DAG) as input, representing the relationship between TFs and their target genes.

677
In this study, we created the causal graph using the 1000 most highly variable genes identified in  Evaluation of the resemblance of real and simulated cells 689 We evaluated all models using held-out test sets containing randomly selected cells from each  Baseline simulator models 729 We compared the performance of GRouNdGAN to scDESIGN2 25 , SPARsim 26 , and three GAN-  For the PBMC-All and the BoneMarrow dataset, we trained all models above. Additionally, we 736 simulated data using scDESIGN2 and SPARsim with and without cell cluster information, as they 737 allow providing such side information in their training.

739
To train models that utilized cell cluster information, we performed Louvain clustering and 740 provided the cluster information and ratio of cells per cluster during training. Clustering was done 741 by following the cell ranger pipeline 33 , based on the raw unprocessed dataset (and independent 742 of the pre-processing steps described earlier for training simulators). First, genes with no UMI 743 count in any of the cells were removed. Then the gene expression profile of each cell was 744 normalized by the total UMI of all (remaining) genes, and highly variable genes were identified.

745
The gene expression profile of each cell was then re-normalized by the total UMI of retained 746 highly variable genes, and each gene vector (representing its expression across different cells) 747 was z-score normalized. Given the normalized gene expression matrix above, we found top 50 748 principal components (PCs) using PCA analysis. These PCs were then used to compute a 749 neighborhood graph with a local neighborhood size of 15, which was used in Louvain clustering. 750 We ran the Louvain algorithm with a resolution of 0.15.