Applying causal discovery to single-cell analyses using CausalCell

Correlation between objects is prone to occur coincidentally, and exploring correlation or association in most situations does not answer scientific questions rich in causality. Causal discovery (also called causal inference) infers causal interactions between objects from observational data. Reported causal discovery methods and single-cell datasets make applying causal discovery to single cells a promising direction. However, evaluating and choosing causal discovery methods and developing and performing proper workflow remain challenges. We report the workflow and platform CausalCell (http://www.gaemons.net/causalcell/causalDiscovery/) for performing single-cell causal discovery. The workflow/platform is developed upon benchmarking four kinds of causal discovery methods and is examined by analyzing multiple single-cell RNA-sequencing (scRNA-seq) datasets. Our results suggest that different situations need different methods and the constraint-based PC algorithm with kernel-based conditional independence tests work best in most situations. Related issues are discussed and tips for best practices are given. Inferred causal interactions in single cells provide valuable clues for investigating molecular interactions and gene regulations, identifying critical diagnostic and therapeutic targets, and designing experimental and clinical interventions.


Introduction
RNA-sequencing (RNA-seq) has been used to detect gene expression in a lump of cells for years. Many statistical methods have been developed to explore correlation/association between transcripts in RNA-seq data, including the 'weighted gene co-expression network analysis' that infers networks of correlated genes (Joehanes, 2018). Since a piece of tissue may contain many different cells and the properties and advantages/disadvantages of these algorithms are summarized, with '+++' and '+' indicating the most and least recommended ones ( Table 1; Appendix 2-figures 1-7).
Many causal discovery methods have been proposed. Constraint-based causal discovery identifies causal relationships between a set of variables in two steps: skeleton estimation (determining the skeleton of the causal network) and orientation (determining the direction of edges in the causal network). The PC algorithm is a classic and widely recognized algorithm (Glymour et al., 2019). Causal discovery using the PC algorithm is different in that PC can work with different CI tests to perform the first step. We combined the PC algorithm with 10 CI tests to form 10 constraint-based causal discovery algorithms. The properties and advantages/disadvantages of the 10 algorithms are summarized, with '+++' and '+' indicating the most and least recommended ones ( Table 2; Appendix 3-figures 1 and 2). In addition to constraint-based methods, there are other kinds of methods, including score-based methods that assign a score function to each directed acyclic graph (DAG) and optimize the score via  figure 1). Accuracy is estimated upon simulated and real data (Appendix 2figures 2-7). Scalability is estimated upon simulated data (Appendix 2- figure 2). Advantage/disadvantage is made upon accuracy together with algorithms' other properties. Performs well with complete models # Time consumption is estimated upon simulated data (Appendix 3-figure 1). Accuracy is estimated upon the lung cancer cell lines (Figure 2; Appendix 3-figure 2). Stability is estimated upon the relative structural Hamming distance (SHD, a standard distance to compare graphs by their adjacency matrix), which is used to measure the extent an algorithm produces the same results when running multiple times (Appendix 3-table 1). Advantage/ disadvantage is made upon accuracy. Developing the workflow/platform for causal discovery The CausalCell workflow/platform is implemented using the Docker technique and Shiny language and consists of feature selection, causal discovery, and several auxiliary functions (Figure 1). A parallel version of the PC algorithm is used to realize parallel multi-task causal discovery (Le et al., 2019).
In addition, the platform also includes the GES, GSP, and DAGMA-nonlinear methods. PC and GSP can work with 10 CI tests. Annotations of functions and parameters and the detailed description of a causal discovery process are available online.
Data input and pre-processing scRNA-seq and proteomics data generated by different protocols or methods (e.g. 10x Genomics, Smart-seq2, and flow cytometry) can be analyzed. CausalCell accepts log2-transformed data and z-score data and can turn raw data into either of the two forms. A dataset (i.e. the 'case') can be analyzed with or without a control dataset (i.e. the 'control'). Researchers often identify and analyze special genes, such as highly expressed or differentially expressed genes. For each gene in a case and control, three attributes (the averaged expression value, percentage of expressed cells, and variance) are computed. Fold changes of gene expression are also computed (using the FindMarkers function in the Seurat package) if a control is uploaded. Genes can be ordered upon any attribute and filtered upon a combination of five conditions (i.e. expression value, percentage of expressed cells, variance, fold change, and being a TF or not). Since performing feature selection transcriptome-wide is unreliable due to too many genes, filtering genes before feature selection is necessary, and different filtering conditions generate different candidates for feature selection. Batch effects may influence identifying differentially expressed genes. Since removing batch effects should be performed with raw data before integrating batches and there are varied batch effect removal methods (Tran et al., 2020), it should be performed by the user if necessary.

Feature selection
Feature selection selects a set of genes (i.e. features) from the candidate genes upon one or multiple genes of interest (i.e. response variables). As above-mentioned, candidate genes are extracted from the whole dataset upon specific conditions because performing feature selection transcriptome-wide is unreliable. Based on the accuracy, time consumption, and scalability of the nine feature selection algorithms (Table 1), BAHSIC is the most recommended algorithm. The joint use of two kinds of algorithms (e.g. Random Forest+BAHSIC) is also recommended to ensure reliability. Feature genes are usually 50-70, but the number also depends on the causal discovery algorithms. Genes can be manually added to or removed from the result of feature selection (i.e., the feature gene list) to address a biological question specifically. The input for causal discovery can also be manually selected without performing feature selection; for example, the user can examine a specific Gene ontology (GO) term.

Causal discovery
The PC and GSP algorithms can work with the 10 CI tests to provide varied options for causal discovery. In the inferred causal networks, direction of edges is determined by the meek rules (Meek, 1997), and each edge has a sign indicating activation or repression and a thickness indicating CI test's statistical significance. The sign of an edge from A to B is determined by computing a Pearson correlation coefficient between A and B, which is 'repression' if the coefficient is negative or 'activation' if the coefficient is positive. In most situations, 'A activating B' and 'A repressing B' correspond to up-regulated A in the case dataset, with up-and down-regulated B in the case dataset compared with in the control dataset.
There are two ways to construct a consensus network that is statistically more reliable. One way is to run multiple algorithms (i.e. multiple CI tests) and take the intersection of some or all inferred networks as the consensus network ( Figure 2). The other is to run an algorithm multiple times and take the intersection of all inferred networks as the consensus network (Figure 3).
If a scRNA-seq dataset is large, a subset of cells should be sampled to avoid excessive time consumption. We suggest that 300 and 600 cells are suitable for reliable inference if the input is Smart-seq2 and 10x Genomics data, respectively, the input contains about 50 genes, and genes are expressed in >50% cells. Here, reliable inference means that key interactions (those with high CI test significance) are inferred (Appendices 3, 4). More cells are needed if the input genes are expressed in Figure 2. The accuracy of PC+nine CI tests was evaluated with four steps. First, nine causal networks were inferred using the nine CI tests. Second, pairwise structural Hamming distances (SHD) between these networks were computed, and the matrix of SHD values was transformed into a matrix of similarity values (using the equation Similarity = exp(-Distance/2σ 2 ), where σ=5). The networks of DCC.gamma, DCC.perm, HSIC.gamma, and HSIC. fewer cells and if the input contains >50 genes. Larger sample sizes (more cells) may make more interactions be inferred, but the key interactions are stable (Appendix 3- figure 3). As HSIC. perm and DCC. perm employ permutation to perform CI test, the networks inferred each time may be somewhat different. Our data analyses suggest that interactions inferred by running distance covariance criteria (DCC) algorithms multiple times are quite stable (Figure 3). Four parameters influence causal discovery. First, 'set the alpha level' determines the statistical significance cut-off of the CI test, and large and small values make more and fewer interactions be inferred. Second, 'select the number of cells' controls sample size, and selecting more cells makes the inference more reliable but also more time-consuming. Third, 'select how a subset of cells is sampled' determines how a subset of cells is sampled. If a subset is sampled randomly, the inferred network is not exactly reproducible (but by running multiple times, the inferred edges may show high consistency, see Figure 3). Fourth, 'set the size of conditional set' controls the size of conditional set when performing CI tests; it influences both network topology and time consumption and should be set with care. Since some CI tests are time-consuming and running causal discovery with multiple algorithms are especially time-consuming, providing an email address is necessary to make the result sent to the user automatically.
The performance of different PC+CI tests was intensively evaluated. First, we evaluated the accuracy, time consumption, sample requirement, and stability of PC+nine CI tests using simulated data and the non-small cell lung cancer (NSCLC) cell line H2228 and the normal lung alveolar cells (as the case and control) Travaglini et al., 2020). Comparing inferred networks with the perm share the highest similarity. Third, a consensus network was built using the networks of the above four CI tests, which was assumed to be closer to the ground truth than the network inferred by any single algorithm. Fourth, each of the nine networks was compared with the consensus network. (A) The cluster map shows the similarity values (darker colors indicating higher similarity). (B) Shared and specific interactions in each algorithm's network and the consensus network. In each panel, the gray-, green-, and pink-circled areas and numbers indicate the overlapping interactions, interactions identified specifically by the algorithm, and interactions specifically in the consensus network. There are 73 overlapping interactions between DCC.gamma's network and the consensus network, and 33 interactions were identified specifically by DCC.gamma. Thus, the true positive rate (TPR) of DCC. gamma is 73/ (73+33)=68.9%. The TPRs of DCC.perm,HSIC.gamma,HSIC.perm,GaussCItest,HSIC.clust,cmiKnn,RCIT,and RCoT are 70.2%,67.6%,68.9%,29.5%,61.8%,47.9%,57.1%,and 56.6%, indicating that the two distance covariance criteria (DCC) CI tests perform better than others.

Verification of causal discovery
We used the five NSCLC cell lines (A549, H1975, H2228, H838, and HCC827), the normal alveolar cells, and genes in specific pathways to validate network inference by PC+ DCC. gamma Travaglini et al., 2020). First, upon the combined conditions of (a) gene expression value >0.1, (b) gene expression in >50% cells, and (c) fold change >0.3, we identified highly and differentially expressed genes in each cell line against the alveolar cells. Second, we applied gene set enrichment analysis to differentially expressed genes in each cell line using the g:Profiler and GSEA programs. g:Profiler identified 'Metabolic reprogramming in colon cancer' (WP4290), 'Pyrimidine metabolism' (WP4022), and 'Nucleotide metabolism' (hsa01232) as enriched pathways in all cancer cell lines, and GSEA identified 'Non-small cell lung cancer' (hsa05223) as an enriched pathway in cancer cell lines ('WP' and 'hsa' indicate WikiPathways and KEGG pathways). Many studies reveal that glucose metabolism is reprogrammed and nucleotide synthesis is increased in cancer cells. Key features of reprogrammed glucose metabolism in cancer cells include increased glucose intake, increased lactate generation, and using the glycolysis/TCA cycle intermediates to synthesize nucleotides. The networks inferred by PC+ DCC. gamma capture these features despite of the absence of metabolites in these datasets. The networks of WP4022 also capture the key features of pyrimidine metabolism. In the networks of hsa05223, over 50% inferred interactions agree with pathway annotations. These results support network inference (Appendix 4).

Evaluating and ensuring the reliability
Single-cell data vary in quality and sample sizes; thus, it is important to effectively evaluate and ensure the reliability of network inference. Inspired by using RNA spike-in to measure RNA-seq quality (Jiang et al., 2011), we developed a method to evaluate and ensure the reliability of causal discovery. This method includes three steps: extracting the data of several well-known genes and their interactions from certain dataset as the 'spike-in' data, integrating the spike-in data into the case dataset, and applying causal discovery to the integrated dataset (the latter two steps are performed automatically when a spike-in dataset is chosen or uploaded). The user can choose a spike-in dataset in the platform or design and upload a spike-in dataset. In the inferred network, a clear separation of genes and their interactions in the spike-in dataset from genes and interactions in the case dataset is an indicator of reliable inference (Appendix 4-figure 1). Some public databases (e.g. the STRING database, https:// string-db.org/) can also be used to evaluate inferred interactions (Appendix 4-figures 2 and 3).

Results
The analysis of lung cancer cell lines and alveolar epithelial cells Down-regulated MHC-II genes help cancer cells avoid being recognized by immune cells (Rooney et al., 2015); thus, identifying genes and interactions involved in MHC-II gene down-regulation is important. To assess if causal discovery helps identify the related interactions, we examined the five NSCLC cancer cell lines (A549, H1975, H2228, H838, and HCC827) and the normal alveolar epithelial cells Travaglini et al., 2020). For each of the six datasets, we took the five MHC-II genes (HLA-DPA1, HLA-DPB1, HLA-DRA, HLA-DRB1, HLA-DRB5) as the response variables (genes of interest, hereafter also called target genes) and selected 50 feature genes (using BAHSIC, unless otherwise stated) from all genes expressed in >50% cells. Then, we applied the nine causal discovery algorithms to the 50 genes in 300 cells sampled from each of the datasets. The two DCC algorithms performed the best when processing the H2228 cells and lung alveolar epithelial cells (Appendix 5figures 1 and 2).
The inferred networks also show that down-regulated genes weakly but up-regulated genes strongly regulate downstream targets and that activation and repression lead to up-and downregulation of target genes. These features are biologically reasonable. Many inferred interactions, including those between MHC-II genes and CD74, between CXCL genes, and between MHC-I genes and B2M, are supported by the STRING database (http://string-db.org) and experimental findings ( Figure 4; Appendix 4-figure 2; Castro et al., 2019;Karakikes et al., 2012;Szklarczyk et al., 2021). An interesting finding is the PRDX1→TALDO1→HSP90AA1→NQO1→PSMC4 cascade in H2228 cells. Interactions between PRDX1/TALDO1/HSP90AA1 and NQO1 were reported (Mathew et al., 2013;Yin et al., 2021), but the interaction between NQO1 and PSMC4 was not. Previous findings on NQO1 include that it determines cellular sensitivity to the antitumor agent napabucasin in many cancer cell lines (Guo et al., 2020), is a potential poor prognostic biomarker, and is a promising therapeutic target for patients with lung cancers (Cheng et al., 2018;Siegel et al., 2012), and that mutations in NQO1 are associated with susceptibility to various forms of cancer. Previous findings on dcc.gamma PSMC4 include that high levels of PSMC4 (and other PSMC) transcripts were positively correlated with poor breast cancer survival (Kao et al., 2021). Thus, the inferred NQO1→PSMC4 probably somewhat explains the mechanism behind these experimental findings.

The analysis of macrophages isolated from glioblastoma
Macrophages critically influence glioma formation, maintenance, and progression (Gutmann, 2020), and CD74 is the master regulator of macrophage functions in glioblastoma (Alban et al., 2020;Quail and Joyce, 2017;Zeiner et al., 2015). To examine the function of CD74 in macrophages in gliomas, we used CD74 as the target gene and selected 50 genes from genes expressed in >50% of macrophages isolated from glioblastoma patients (Neftel et al., 2019). In the networks of DCC algorithms (Appendix 5- figure 3), CD74 regulates MHC-II genes, agreeing with the finding that CD74 is an MHC-II chaperone and plays a role in the intracellular sorting of MHC class II molecules. The network includes interactions between C1QA/B/C, agreeing that they form the complement C1q complex. The identified TYROBP→TREM2→A2M→APOE→APOC1 cascade is supported by the reports that TREM2 is expressed in tumor macrophages in over 200 human cancer cases (Molgora et al., 2020) and that there are interactions between TREM2/A2M, TREM2/APOE, A2M/APOE, and APOE/APOC1 (Krasemann et al., 2017).

The analysis of tumor-infiltrating exhausted CD8 T cells
Tumor-infiltrating exhausted CD8 T cells are highly heterogeneous yet share common differentially expressed genes (McLane et al., 2019;Zhang et al., 2018), suggesting that CD8 T cells undergo different processes to reach exhaustion. We analyzed three exhausted CD8 T datasets isolated from human liver, colorectal, and lung cancers (Appendix 5-figure 4; Guo et al., 2018;Zhang et al., 2018;Zheng et al., 2017). A key feature of CD8 T cell exhaustion identified in mice is PDCD1 up-regulation by TOX (Khan et al., 2019;Scott et al., 2019;Seo et al., 2019). Using TOX and PDCD1 as the target gene, we selected 50 genes expressed in >50% exhausted CD8 T cells and 50 genes expressed in >50% non-exhausted CD8 T cells, respectively. Transcriptional regulation of PDCD1 by TOX was observed in LCMV-infected mice without mentioning any role of CXCL13 (Khan et al., 2019). Here, indirect TOX→PDCD1 (via genes such as CXCL13) was inferred in exhausted CD8 cells, and direct TOX→PDCD1 was inferred in non-exhausted CD8 T cells (although the expression of TOX and PDCD1 is low in these cells) (Appendix 5- figure 4). Recently, CXCL13 was found to play a critical role in T cells for effective responses to anti-PD-L1 therapies . The causal discovery results help reveal differences in CD8 T cell exhaustion between humans and mice and under different pathological conditions. The PDCD1→TOX inferred in exhausted and non-exhausted CD8 T cells may indicate some feedback between TOX and PDCD1, as on the proteome level, a study reported that the binding of PD1 to TOX in the cytoplasm facilitates the endocytic recycling of PD1 .

Identifying genes and inferring interactions that signify CD4 T cell aging
How immune cells age and whether some senescence signatures reflect the aging of all cell types draw wide attention (Gorgoulis et al., 2019). We analyzed gene expression in naive, TEM, rTreg, naive_ Isg15, cytotoxic, and exhausted CD4 T cells from young (2-3 months, n=4) and old (22-24 months, n=4) mice (Appendix 5-figures 5; Elyahu et al., 2019). For each cell type, we compared the combined data from all four young mice with the data from each old mouse to identify differentially expressed genes. If genes were expressed in >25% cells and consistently up/down-regulated (|fold change|>0) in most of the 24 comparisons, we assumed them as aging-related (Appendix 5table 1). Some of these identified genes play important roles in the aging of T cells or other cells, such as the mitochondrial genes encoding cytochrome c oxidases and the gene Sub1 in the mTOR pathway (Bektas et al., 2019;Gorgoulis et al., 2019;Goronzy and Weyand, 2019;Walters and Cox, 2021). We directly used these genes, plus one CD4-specific biomarker (Cd28) and two reported aging biomarkers (Cdkn1b, Cdkn2d) (Gorgoulis et al., 2019;Larbi and Fulop, 2014), as feature genes to infer their interactions in different CD4 T cells in young and old mice. The inferred causal networks unveil multiple findings (Appendix 5- figure 5). First, B2m→H2-Q7 (a mouse MHC class I gene), Gm9843→Rps27rt (Gm9846), and the interactions between the five mitochondrial genes (MT-ATP6, MT-CO1/2/3, and MT-Nd1) were inferred in nearly all CD4 T cells. Second, many interactions are supported by the STRING database (Appendix 4- figure 3). Third, some interactions agree with experimental findings, including Sub1-|Lamtor2  and the regulations of these mitochondrial genes by Lamtor2 (Morita et al., 2017). Fourth, Gm9843→Rps27rt→Junb were inferred in multiple CD4 T cells (both Gm9843 and Rps27rt are mouse-specific). Since JUNB belongs to the AP-1 family TFs that are increased in all immune cells during human aging , Gm9843→Rps27rt→Junb could highlight a counterpart regulation of JUNB in human immune cells.

Discussion
Single-cell causal discovery Various methods have been proposed to infer interactions between variables from observational data. As surveyed recently (Nguyen et al., 2021;Pratapa et al., 2020), many methods assume linear relationships between variables and the Gaussian distribution of data. These assumptions enable these methods to run fast, handle many genes and even perform transcriptome-wide prediction. However, our algorithm benchmarking results suggest that networks inferred by fast methods with these assumptions should be concerned. Causal discovery infers causal interactions directly upon observations of variables without assuming relationships between variables and the distribution of data. Because genes and molecules have varied relationships in different cells, causal discovery better satisfies inferring their interactions than other methods. Causal discovery methods have reviewed recently (Glymour et al., 2019;Yuan and Shou, 2022), but workflows and platforms integrating multiple methods for analyzing scRNA-seq data remain rare.
Our integration and benchmarking of multiple methods (note that these methods are not for inferring causal relationships from temporal data) and analysis of multiple datasets generate several conclusions. First, although kernel-based CI tests are time-consuming (Shah and Peters, 2020), applying them to a set of genes is feasible. A set of genes can be generated by feature selection, by gene set enrichment analysis, or by manual selection. Second, the cost of time consumption pays off in network accuracy, as the most time-consuming CI tests generate the most reliable results. Thus, trade-offs between time consumption, network size, and network accuracy should be made. Third, causal discovery can infer signaling networks or gene regulatory networks, depending on the input. If genes encoding TFs and their targets are the input, gene regulatory networks are inferred. Fourth, dropouts and noises in scRNA-seq data concern researchers and trouble correlation analysis Mohan and Pearl, 2018;Tu et al., 2019), but can be well tolerated by PC+kernel-based CI tests if samples are sufficiently large. Finally, using 'spike-in' data can effectively evaluate the reliability of causal discovery.

Challenges of data analysis
Single-cell causal discovery also faces several challenges. First, causal discovery assumes there are no unmeasured common causes (the causal sufficiency assumption), but in real data latent and unobserved variables are common and hard to identify. Specifically, inferring interactions between highly expressed or differentially expressed genes is a case of causal discovery with incomplete models (i.e. models with missing variables from the data-generating model). In this situation, what are inferred are indirect relationships instead of direct interactions between gene products. Second, constraintbased methods cannot differentiate networks belonging to a Markov equivalent class (the causal Markov assumption). This can be solved partly by combined use of PC and DAGMA-nonlinear (which can better determine the direction of edges). Third, the following examples indicate that the lack of relevant information makes judging inferred interactions and relationships difficult. (a) TOX is reported to activate PDCD1 in exhausted CD8 T cells in mice (Khan et al., 2019), but whether CXCL13 is involved in (or required for) the TOX-PDCD1 interaction in exhausted CD8 T cells in humans is unclear, until recently CXCL13 is reported to play critical roles in T cells for effective responses to anti-PD-L1 therapies . (b) The differences in inferred networks in exhausted CD8 T cells from different cancers are puzzling, until a recent study reports that exhausted CD8 T cells show high heterogeneity and exhaustion can follow different paths . (c) It is difficult to explain multiple genes encoding ribosomal proteins in the inferred networks in CD4 cells from old mice, until a recent study reports that aging impairs ribosomes' ability to synthesize proteins efficiently (Stein et al., 2022).

Limitations of the study
The time consumption of kernel-based CI tests disallows inferring large networks, and how this challenge can be solved remains unsolved. C codes may be developed to replace the most timeconsuming parts of the R functions, but this has not been done.

Tips for best practices
First, exploring different biological modules or processes needs careful selection of genes ( Figure 5). When it is unclear what genes are most relevant to one or several target genes, it is advisable to run multiple rounds of feature selection using different combination of target genes as response variable(s). Second, when feature genes are identified by gene set enrichment analysis or upon highly expressed genes, PC+kernel-based CI tests perform better than continuous optimization-based methods, and the inferred networks consist more likely of indirect causal relationships instead of direct causal interactions. Third, BAHSIC and SHS are the best feature selection algorithms. Since selecting feature genes from too many candidates is unreliable, filtering genes upon specific conditions (e.g. expression values, expressed cells, fold changes) is necessary. Fourth, DCC. gamma and DCC. perm are the best CI tests working with PC. When building consensus networks, it is advisable to use the results of just DCC CI tests. Fifth, trade-offs between scale, reliability, and accuracy are inevitable. When examining many genes, RCIT/RCoT may be proper, and when examining large datasets, sub-sampling is necessary. For Smart-seq2 and 10x Genomics datasets, 300 and 600 cells are recommended for analyzing 50-60 genes expressed in >50% of cells. More cells are needed if more genes are selected and/or selected genes are expressed in fewer cells (e.g. 25%). Sixth, when it is unclear if a sub-sampled dataset is large enough, repeat causal discovery several times using different sizes of sub-samples. If the inferred networks are similar, the sub-samples should be sufficient. Seventh, using "spike-in" datasets helps measure and ensure reliability. Eighth, carefully inspect the potential influence of cell heterogeneity on causal discovery, and caution is needed when interpreting the results of heterogeneous cells.

Data availability
Only public data were used. Links to all data are provided in the manuscript.
The following previously published datasets were used: Appendix 1 Overview of data and algorithms

Algorithms and datasets
We combined feature selection and causal discovery to infer causal interactions among a set of gene products in single cells. We used synthetic data, semi-synthetic data, real scRNA-seq data, and flow cytometry data to benchmark nine feature selection algorithms and nine causal discovery algorithms (Appendix 1-tables 1 and 2; Appendix 1- figure 1).
Appendix 1-figure 1. Overview of data and benchmarking. (A) The single-cell RNA-sequencing (scRNA-seq) data were generated by different protocols and from different cell types (Appendix 1-table 1). (B) Illustration of feature selection benchmarking using data of microglia from humans and mice. Steps: (i) choose a target gene from a list of microglia biomarkers; (ii) let each algorithm select 50 genes from 3000 candidates expressed in most cells; (iii) merge the nine sets of feature genes into a superset; (iv) compare each selected feature gene set with the superset. (C) Illustration of causal discovery benchmarking using a set of feature genes. Steps: (i) use nine Appendix 1-figure 1 continued on next page algorithms (PC+CI tests) to generate nine causal networks; (ii) generate a type 1 consensus network upon the networks of multiple algorithms (a type 2 consensus network is generated upon running an algorithm multiple times); (iii) compare each causal network with the consensus network.
Appendix 2. Synthetic data for feature selection Fully synthetic dataset N variables (indicating candidate genes) without a specific pattern were generated randomly from a (0, 2) uniform distribution, from which n variables (indicating feature genes) were selected randomly to synthesize a response variable. Each feature influences the response variable depending randomly on one of the five functions: y=x 2 , y=sin(x), y=cos(x), y=tanh(x), and y=e x . By combining different numbers of feature genes and candidate genes (4-50, 8-100, 8-200, 20-500, 20-1000, and 50-2000) and generating samples of different sizes (100, 200, 500, 1000, and 2000), we generated 30 schemes. For each algorithm, we ran each scheme 10 times, and each time the true positive rate (TPR) was calculated by: A TPR of 1.0 means that the feature selection algorithm completely correctly selects the features; a small TPR indicates poor performance.

Semi-synthetic dataset
First, we extracted genes from a benchmark scRNA-seq dataset (https://support.10xgenomics.com/ single-cell-gene-expression/datasets/4.0.0/Parent_NGSC3_DI_PBMC), sorted these genes based on the cells in which they were expressed (expression level >0), and obtained the top 5000 genes expressed in most cells. Then, we obtained the top 5000 cells that contained the most expressed genes. These genes and cells formed a 5000*5000 matrix. Different candidate gene sets were sampled from this matrix, and different feature gene sets were selected randomly from each candidate gene set. Next, the response variables (target genes) were synthesized using feature genes.

Synthetic data for causal discovery
We used the randomDAG function in the pcalg package (https://cran.r-project.org/web/packages/ pcalg/index.html) to generate DAGs with random topologies. Values of nodes (i.e. genes) in these DAGs were randomly generated using the following 10 functions that determined relationships between nodes: Appendix 2 Feature selection algorithms and benchmarking 1. Ensemble learning-based algorithms Random forests We used the RandomForestRegressor function (with default parameters) in the sklearn package (https://scikit-learn.org/stable/) to build random forest models (Breiman, 2001). Each model contained 200 decision trees. After regression based on the response variable(s), genes were sorted based on Gini importance, and the top genes were selected as feature genes.

Extremely randomized trees
We used the ExtraTreesRegressor function (with default parameters) in the sklearn package (https:// scikit-learn.org/stable/) to generate extremely randomized trees (Geurts et al., 2006). Each tree model contained 200 decision trees. After regression based on the response variable(s), genes were sorted based on Gini importance, and the top genes were selected as feature genes.

XGBoost
We used the XGBRegressor function (with default parameters) in the Scikit-Learn API (https:// xgboost.readthedocs.io/en/latest/python/python_api.html) to build the XGBoost models (Chen and Guestrin, 2016). Each XGBoost model contained 200 decision trees. After regression based on the response variable(s), genes were sorted based on Gini importance, and the top genes were selected as feature genes.

Regularization-based algorithms Lasso
We used the Lasso function (with default parameters) in the sklearn package (https://scikit-learn. org/stable/) to produce the regression models. In the Lasso (least absolute shrinkage and selection operator) regression equation (Tibshirani, 1997): N is the number of samples, p is the number of features, β j is the coefficient of the j th feature, and λ (by default λ=0.5) is a penalty coefficient controlling the shrinkage. Feature genes were selected based on the value of |β j | , which indicates the importance of the j th feature for the response variable(s).

Ridge regression
We used the Ridge function (with default parameters) in the sklearn package (https://scikit-learn. org/stable/) to build Ridge repression models. The equation of Ridge regression is similar to that of Lasso (Hoerl and Kennard, 2000): Feature genes were selected based on the value of |β j | , which indicates the importance of the j th feature for the response variable(s).

Elastic net
We used the ElasticNet function (with default parameters) in the sklearn package (https://scikitlearn.org/stable/) to build elastic net models. Elastic net linearly combines the L1 and L2 penalties of the Lasso and Ridge methods using the following equation (by default λ=1 and α=0.5) (Zou and Hastie, 2005). In the equation: feature genes are selected upon the value of |β j | , which indicates the importance of the j th feature for the response variable(s).

HSIC-based algorithms BAHSIC
Hilbert-Schmidt independence criterion (HSIC) is a measure of dependency between two variables (Gretton et al., 2005). After obtaining a measure between a response variable and a feature, a backward elimination process is used to extract a subset of features that are most relevant to the response variable (Song et al., 2007). We used the BAHSIC program (https://www.cc.gatech. edu/~lsong/code.html), together with the nonlinear radial basis function kernel, to evaluate the dependency between feature genes and response variable(s), and set the parameter flg3 =

SHS
Sparse HSIC (SHS), which combines HSIC with fast sparse decomposition of matrices, is an HSICbased feature selection algorithm without the backward elimination process to identify a sparse projection of all features (Gangeh et al., 2017). We translated the SHS program encoded in MATLAB (https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/shs.zip) into a Python program and used eigenvalue decomposition as the matrix halving procedure. The parameters γ=1.1 (a penalty parameter that controls the sparsity of the solution) and ρ = 0.1 .

Block HSIC Lasso
HSIC Lasso is a variant of the minimum redundancy maximum relevance feature selection algorithm and is suitable for high-dimensional small sample data. We used the pyHSICLasso program (https:// github.com/riken-aip/pyHSICLasso; Yamada et al., 2014), which is an approximation of HSIC Lasso but reduces memory usage dramatically while retaining the properties of HSIC Lasso (Climente-González et al., 2019). We used the function get_index_score() to compute feature importance and the function get_features() to return top feature genes.

Benchmarking results
We evaluated the time consumption, accuracy, and scalability of nine feature selection algorithms in three categories (Appendix 1-tables 1 and 2). First, tested using synthesized data, all algorithms showed moderate time consumption, which increased insignificantly when the sample size increased (Appendix 2- figure 1). Second, using synthesized data, multiple algorithms selected all features correctly if schemes were simple (e.g. selecting 4 features from 50 candidates). If features and/or candidates increased (e.g. selecting 50 features from 2000 candidates), BAHSIC showed the best performance, with accuracy decreasing more slowly than others (Appendix 2- figure 2). Third, using well-known microglial biomarkers in humans and mice (Butovsky et al., 2014;Patir et al., 2019) and using scRNA-seq data of microglia from the human and mouse brain (Geirsdottir et al., 2019), we further evaluated feature selection algorithms' accuracy. We merged feature genes generated by the nine algorithms into a superset (Appendix 1-figure 1B), identified a subset generated by the majority of algorithms, and examined how many feature genes of each algorithm overlap with the subset. When selecting 50 genes from 3000 candidates upon a target gene (e.g. Hexb), BAHSIC and SHS were the best and second-best algorithms, and they also selected most microglia biomarkers (Appendix 2-figures 3 and 4). Fourth, we used real scRNA-seq data in applications to evaluate algorithms' accuracy and found that BAHSIC also performs well. Finally, to evaluate algorithms' scalability, we let algorithms select feature genes from different numbers of candidate genes. When the number of candidate genes is large (>10,000), the accuracy of feature selection is somewhat decreased. BAHSIC's performance was examined further using macrophages isolated from human glioblastoma by checking whether the selected feature genes accurately characterize macrophages (Neftel et al., 2019). We used six macrophage biomarkers (CD14, AIF1, FCER1G, FCGR3A, TYROBP, and CSF1) exclusively expressed in these macrophages as the target genes (response variables) and used BAHSIC to select 50 feature genes from 3000 candidate genes expressed in >50% macrophages (Appendix 2- figure 5). Nearly all feature genes were expressed exclusively in the macrophages (Appendix 2- figure 6). Experimental findings support many feature genes. C1QA/B/C and C3 are supported by the finding that C1Q is produced and the complement cascade is up-regulated in cancer-infiltrated macrophages . CD74 and MHC-II genes are supported by the finding that CD74 is correlated with malignancies and the immune microenvironment in gliomas (Xu et al., 2021). TREM2 and APOE are supported by the finding that highly expressed TREM2 and APOC2 in macrophages contribute to immune checkpoint therapy resistance (Xiong et al., 2020). MS4A4A and MS4A6A are supported by the finding that APOE and TREM2 are up-regulated by MS4A (Deming et al., 2019). In contrast, feature genes selected by RidgeRegression upon the same six biomarkers were expressed in diverse cells (Appendix 2-figure 7). These confirm that BAHSIC can quite reliably select feature genes upon target genes.
Appendix 2-figure 1. Time consumption of feature selection algorithms increased mildly when the sample size increased from 1000 to 10,000 (the X-axis indicates the sample size; 1-10 indicate 1000-10,000, respectively Appendix 2- figure 3. This screenshot showed a feature selection result (the superset of feature genes selected by ≥3 algorithms from the mouse microglia dataset) when all 9 algorithms were used. The Hexb gene was the target gene, and each algorithm selected 50 feature genes from the 3000 candidate genes expressed in most cells. BAHSIC, SHS, RandomForest, ExtraTrees, ElasticNet selected highly overlapping feature genes, many of which are microglia biomarkers in mice (Appendix 2-figure 4). The numbers right side of algorithm names indicate genes overlapping with the superset. Appendix 2-figure 6. These tSNE plots show that nearly all of the 50 feature genes were exclusively expressed in macrophages. These feature genes were selected by BAHSIC using six macrophage biomarkers (CD14, AIF1, FCER1G, FCGR3A, TYROBP, and CSF1R) as the target genes. They include genes involved in macrophage activation (e.g. C1QA, CD74, TREM2) and multiple class II major histocompatibility complex (MHC) genes (e.g. HLA-DMA, HLA-DPA1). The interactions between CD74 and MHC-II genes (CD74 is an MHC class II chaperone) probably contribute to the co-selection of these genes. Appendix 3 Causal discovery algorithms and benchmarking 1. Causal discovery methods CausalCell integrates four causal discovery methods -PC, GES, GSP, and DAGMA-nonliearwhich are representative constraint-based, score-based, hybrid, and continuous optimizationbased methods. Constraint-based methods identify causal interactions in a set of variables in two steps: skeleton estimation and orientation. Score-based methods assign a score function (e.g. the Bayesian information criterion) to each potential causal network and optimize the score via greedy approaches. Hybrid methods combine score-based methods and CI tests. Continuous optimizationbased methods recast the combinatoric graph search problem as a continuous optimization problem.
The PC and GSP algorithms can be combined with different CI tests. To benchmark the performance of different CI tests, we combined 10 CI tests with the parallel version of the PC algorithm (i.e. the pc function in the R package pcalg, with the default setting skel. method="stable") (Le et al., 2019). The results show that kernel-based CI tests (especially the two DCC CI tests) outperform other CI tests (Appendix 3-table 1; Appendix 3-figures 1 and 2). To evaluate the score-based and hybrid methods GES (https://cran.r-project.org/web/packages/pcalg/ index.html) and GSP (https://github.com/uhlerlab/causaldag; Chickering, 2003;Solus et al., 2021;Squires, 2018), we compared PC+ DCC. gamma, GES, and GSP+ DCC. gamma. The results show that PC+ DCC. gamma and GSP+ DCC. gamma have comparable network accuracy and time consumption, and both are more accurate but more time-consuming than GES (Appendix 3-figures 3-6). Appendix 3-figure 1. Causal discovery performance that was tested using synthetic data. DCC. perm and HSIC. gamma are the most time-consuming algorithms, but all algorithms have similar accuracy. The sample size is 500 (AB) or 1000 (CD), and the variable number ranges from 20 to 80. (AC) show the time consumption (in second) of different algorithms, and (BD) show structural Hamming distance (SHD) values. '*' in (CD) indicates that the algorithm did not finish running in 6 weeks. These two cases, and the two cases in (A) where DCC. perm and DCC. gamma took more time when there were 60 variables than when there were 80 variables, were anomalies caused by synthetic data. When testing using real scRNA-seq data, no such anomalies occurred.  Appendix 3-figure 6. The causal relationships between genes in the WP4290 pathway in the cell line H838 inferred using 800 cells (alpha = 0.1). More cells make more relationships be inferred, but relationships with high significance (with thick arrows) are stable.

Partial correlation-based CI test GaussCItest
Gauss CI test examines CI using partial correlation, assuming that all variables are multivariate Gaussian. The partial correlation coefficient ρ XYZ is zero if and only if X is conditionally independent of Y given Z (Kunihiro et al., 2004). H 0 is ρ XYZ = 0 , H 1 is ρ XYZ ̸ = 0 , and a hypothesis test (p<0.05) decides whether two variables are conditionally independent given Z. We used the gaussCItest function in the R package pcalg with default parameters (https://cran.r-project.org/web/packages/ pcalg/index.html).

HSIC-based CI test
HSIC is a measure of dependency between two variables; HSIC ( X, Y ) = 0 if X and Y are unconditionally independent. Performing two extra transformations can determine if X and Y are conditionally independent given the conditioning set Z: first, performing nonlinear regressions for X and Z and for Y and Z, respectively, to generate the residuals X resid and Y resid based on Z; then, calculating HSIC ( X resid , Y resid ) that indicates whether X and Y are conditionally independent given the conditioning set Z ( X ⊥ ⊥ Y|Z ) (Verbyla et al., 2017). We used the gam() function in the R package mgcv to build the nonlinear regression model and used the three HSIC-based functions (with default parameters unless otherwise specified) in the R package kpcalg (https://cran.r-project.org/web/ packages/kpcalg/index.html) to perform CI test.

hsic. perm
In practice, HSIC ( X, Y ) may be slightly larger than 0.0 when X and Y are independent, making it hard to judge whether X and Y are independent. hsic. perm uses a permutation test to solve this problem by assuming that permuting Y removes any dependency between X and Y. We used the hsic. perm function to permute Y 100 times to calculate HSIC ( X, Yperm ) , then we compared them with HSIC ( X, Y ) . The p-value was the fraction of times HSIC ( X, Yperm ) was smaller than the HSIC ( X, Y ) .

hsic. gamma
We used the hsic. gamma function to fit a gamma distribution: Gamma(α, θ) of the HSIC under the null hypothesis. The shape parameter α and the scale parameter θ were calculated using the equation: A p-value was obtained as an upper-tail quantile of HSIC (X, Y).

hsic. clust
First, samples were clustered using the R function kmeans() by calculating the Euclidean distance between the Z coordinates of samples; then, Y was permutated based on the clustered Z. Within each Z, cluster Yperm was generated, ensuring that the permuted samples break the dependency between X and Y but retain the dependency between X and Y on Z. After permutation, a p-value was calculated to make a statistical decision.

Distance covariance-based CI test
Distance covariance is an alternative to HSIC for measuring independence (Székely and Rizzo, 2009;Székely et al., 2007). We used two DCC-based functions dcc. perm and dcc. gamma (with default parameters) in the R package kpcalg (https://cran.r-project.org/web/packages/kpcalg/index.html) to perform CI test. Similar to HSIC-based algorithms, the two functions directly calculate DCC ( X, Y ) for an UI test, then, the nonlinear regression is performed, next, DCC ( X resid , Y resid ) is calculated for a CI test and a statistical decision (Verbyla et al., 2017).

dcc. perm
This program is similar to hsic. perm and uses a permutation test to estimate a p-value. The DCC statistic is calculated in each permutation, and finally, a statistical decision is made based on the pvalue. We used the dcov. test function (with default parameters) in the R package energy to calculate the statistic DCC in the permutation test. The p-value was the fraction of times that DCC(X, Y perm ) was smaller than DCC(X,Y).

dcc. gamma
Similar to hsic. gamma, dcc. gamma uses the gamma distribution Gamma(α, θ) of the DCC under the null hypothesis. The two parameters were estimated by We used the dcov. gamma function (with default parameters) in the R package kpcalg to calculate the p-value. The p-value was obtained as an upper-tail quantile of DCC(X, Y).

Approximation of KCIT
The KCIT is another powerful CI test (Zhang and Peters, 2011), but it is time-consuming for large datasets. Based on random Fourier features (Rahimi and Recht, 2007), two approximation methods (randomized conditional independence test, RCIT, and randomized conditional correlation test, RCoT) were proposed . RCIT and RCoT approximate KCIT by sampling Fourier features, return p-values orders of magnitude faster than KCIT when the sample size is large, and may also estimate the null distribution more accurately than KCIT.

RCoT
RCoT often outperforms RCIT, especially when the size of the conditioning set is greater than or equal to 4. We used the RCoT function in the R package RCIT (with default parameters) (https:// github.com/ericstrobl/RCIT; Strobl, 2019) to implement the RCoT.

Conditional mutual information-based CI test
Mutual information is used to measure mutual dependence between two variables. Conditional mutual information (CMI) is a measure based on mutual information, which is zero if and only if X ⊥ ⊥ Y|Z .

CMIknn
CMIknn is a program that combines CMI with a local permutation scheme determined by the nearestneighbor approach (Runge, 2018). We used the Python package tigramite (with default parameters) (http://github.com/jakobrunge/tigramite; Runge, 2020) to perform the CI test.
7. CI test based on generalized covariance measure GCM GCM (https://cran.r-project.org/web/packages/GeneralisedCovarianceMeasure/index.html; Peters and Shah, 2022) is a CI test based on generalized covariance measure. It is also classified as a regression-based CI test because it is based on a suitably normalized version of the empirical covariance between the residual vectors from the regressions (Shah and Peters, 2020).

Benchmarking results
The time consumption, accuracy, sample requirement, and stability of the PC+ nine CI tests were evaluated (Appendix 3-table 1). First, we simulated eight datasets with known causal networks, whose variable numbers and sample sizes ranged from 20 to 80 (step = 20) and 500 to 1000 (step = 500), respectively, to evaluate causal discovery algorithms' time consumption, scalability, and accuracy. Algorithms based on the DCC kernel were more time-consuming than others (Appendix 3- figure  1A,C). Algorithms' accuracy was assessed based on the structural Hamming distance (SHD) between the inferred and the true networks (SHD = 0 indicates no difference). The networks of all algorithms showed similar SHD when the sample size was 500 (Appendix 3- figure 1B); the close performance was probably because synthetic data were generated using a few simple functions. When the sample size was increased from 500 to 1000, time consumption increased (but was not doubled), but SHD did not decrease (i.e. algorithms' performance did not increase) significantly, indicating that 500 cells may be adequate for causal discovery (Appendix 3-figure 1D).
Second, to further evaluate algorithms' accuracy, for each feature gene set, we merged causal networks generated by multiple good algorithms into a consensus network (multi-algorithm-based consensus network), then compared the network of each algorithm with the consensus network (Main text- Figure 2; Appendix 3-figure 2). We used the SHD to define the difference between two networks, and the network with the shortest SHD with the consensus network is assumed to be the most accurate.
Third, to evaluate the impact of sample size on algorithms' performance, we ran the nine algorithms using 200 (instead of 300) H2228 cells. The results of 200 cells were poorer than the results of 300 cells (compared with the consensus network in Main text- Figure 2 and Appendix 3- figure 2). Still, the two DCC algorithms performed the best and were less sensitive to the decreased sample size than the two HSIC algorithms. We also inferred interactions between genes in the 'Metabolic reprogramming in colon cancer' (WP4290) pathway using 200, 400, 600, and 800 cells in the H838 (Appendix 3-figures 3-6). We find that more cells make more interactions be inferred, but the interactions with high significance are quite stable.
Fourth, to evaluate algorithms' stability, we used the H2228 dataset to run the nine algorithms five times and estimated each algorithm's stability by computing the mean relative SHD of the five networks. The networks of gaussCItest have the smallest mean relative SHD and the networks of HSIC. perm, HSIC. clust, and DCC. perm have the largest mean relative SHD (Appendix 3-table 1). As DCC. perm and DCC. gamma are the most accurate algorithms, we examined whether their stability impairs their accuracy by checking the distribution of interactions in the five networks. DCC. gamma and DCC. perm inferred 127 and 143 interactions, 78% and 64.3% occurred stably in ≥4 networks, and many inconsistent interactions occurred in just one network (Main text- Figure 3), indicating that most interactions were stably inferred in multiple running. The networks of multiple running can be merged into a consensus network (multi-running-based consensus network), which can be used to examine which algorithm generated the most consistent networks.
where N is the number of nodes and deg is the maximal degree. † Time-consuming levels are estimated upon simulated data (Appendix 3-figure 1). ‡ Accuracy is estimated upon the lung cancer cell lines (Main text- Figure 2; Appendix 3-figure 2). We performed causal discovery using the nine algorithms five times for the H2228 cell line and obtained 9*5=45 causal networks. § We estimated the stability of each algorithm's performance by computing the mean relative SHD for the five causal networks the algorithm generated using the equation:

Evaluating the reliability and verifying causal discovery results
We evaluated the reliability of causal discovery by examining whether algorithms can differentiate interactions between genes in different cells. Inspired by using RNA spike-in to measure RNAseq quality, we extracted the data of six MHC-II-related genes (HLA-DRB1, HLA-DMA, HLA-DRA, HLA-DPA, CD74, C3, which have the suffix _si to mark them) from the macrophage dataset (generated by Smart-seq2 sequencing) and the alveolar epithelial cell dataset (generated by 10x Genomics) to form two spike-in datasets. We mixed the spike-in dataset with the dataset of exhausted CD8 T cells and examined if the causal discovery was able to separate MHC-II genes and their interactions in the spike-in dataset from feature genes and their interactions in the exhausted CD8 T dataset. When the datasets contain sufficient cells (usually >300), the two DCC algorithms can discriminate genes and interactions in the two datasets quite well (Appendix 4-figures 1 and 2), indicating the power of causal discovery based on kernel-based CI tests. The inferred causal interactions can be verified using annotated protein interactions in the STRING database (https://string-db.org/). The results of our application cases indicate that many inferred interactions are supported by annotated protein interactions in the STRING database (Appendix 4-figures 3 and 4). (Appendix 4- figure 5). We also performed GO analysis using the GSEA package, which identified the KEGG pathway 'Non-small cell lung cancer' (hsa05223) as an enriched pathway in cancer cell lines (note that these lung cancer cell lines were derived from NSCLC). We used the PC+ DCC. gamma to infer interactions among genes in the three pathways in the five cancer cell lines and the alveolar cells. First, we examined the 'Metabolic reprogramming in colon cancer' (WP4290) pathway (Appendix 4-figures 6-9). Numerous studies report that glucose metabolism is reprogrammed and nucleotides synthesis is increased in cancer cells. Thus, we first examined and compared the WP4290 pathway in the five lung cancer cell lines and lung alveolar cells. The key features of the reprogrammed glucose metabolism are that (a) glucose intake is increased, (b) the glycolysis/TCA cycle intermediates are used for synthesizing nucleotide, (c) lactate generation is increased. The inferred networks capture these features. (a) multiple activations of SLC2A1 (which encodes a major glucose transporter and controls glucose intake), PGD (which promotes glucose metabolism into the pentose phosphate shunt), PSAT1 (which encodes a phosphoserine aminotransferase that catalyzes the reversible conversion of 3-phosphohydroxypyruvate to phosphoserine), and LDHA (whose protein catalyzes the conversion of pyruvate to lactate) are inferred in all cancer cell lines but not in alveolar cells. (b) Many activations of genes by downstream genes are inferred, and this sort of feedback regulations is an intrinsic feature of metabolism. Especially, the controlling factor SLC2A1 is activated by multiple genes. (c) In contrast, none of these features occur in the alveolar cells (partly due to key genes such as SLC2A1 is not expressed). These inferred results are literaturesupported and biologically reasonable, despite that the causal inference is flawed by the absence of metabolites in the data. Appendix 4-figure 6. The causal relationships between genes in the WP4290 pathway in the cell line A549 inferred using PC+ DCC. gamma. The inference is flawed because the true network contains both gene Appendix 4- figure 6 continued on next page products and metabolites but single-cell RNA-sequencing (scRNA-seq) data do not contain metabolites. Nevertheless, Appendix 4-figures 6-8 show that multiple inferred interactions reasonably reveal key features of reprogrammed glucose metabolism. First, shared interactions in ≥2 datasets are 21.12%, 58.82%, 50.51%, 40.82%, 53.09%, and 50.0% in alveolar cells, H838 cells, H2228 cells, HCC827 cells, H1975 cells, and A549 cells, indicating that causal inference differentiates glucose metabolism in cancer cells from in alveolar cells. Second, the inferred networks reflect key features of reprogrammed glucose metabolism, especially the activation of SLC2A1 (which encodes a major glucose transporter and controls glucose intake), PGD (which promotes glucose metabolism toward nucleotide synthesis), PSAT1 (which promotes glucose metabolism toward nucleotide synthesis), and LDHA (which promotes glucose metabolism toward lactate generation) in cancer cell lines. Third, we examined the 'Non-small cell lung cancer' (hsa05223) pathway (Appendix 4-table 1; Appendix 4- figure 14). We used the 'graphite' R package to turn hsa05223 into an adjacency matrix and mapped inferred interactions to the matrix. If an interaction can be mapped to an edge or a path with any directions (forward, inverse, or undirected) in hsa05223, it was assumed mapped to the pathway. hsa05223 contains sub-pathways such as p53 signaling pathway and PI3K-AKT pathway, therefore there are considerable epistatic interactions that are not annotated in hsa05223. Also, synergistic interactions (e.g. CDKN1A->BAX and EGFR->MET, see Dong et al., 2019;Wang et al., 2014), and many of which are literature-supported but not annotated. We additionally examined hsa05223 and sub-pathways wherein manually and found that many inferred interactions can be mapped to epistatic and synergistic interactions. Taken together, in each cell line, about 50% of inferred interactions can be mapped to the pathway. Note that this is the result without considering feedback regulations by TFs. For example, many EGF1-related interactions were inferred (e.g. E2F1->EGFR and RB1->ERBB2), but these interactions were not accounted because they are not annotated in the KEGG database. Two extra notes here. First, unlike reprogrammed glucose metabolism, common interactions between genes in different cell lines are not impressive, probably because these cell lines are generated with different genetic basis despite being derived from NSCLC. Second, the annotation of hsa05223 has defects, because it is not in the list of enriched pathways identified by g:Profiler.

Appendix 5 Additional results of applications
This appendix file describes the additional results of five applications, including the analysis of lung cancer cell lines and alveolar epithelial cells, the analysis of macrophages isolated from glioblastoma, the analysis of tumor-infiltrating exhausted CD8 T cells, identifying genes and inferring interactions that signify CD4 T cell aging, and the analysis of a flow cytometry dataset. These examples were used to examine the applicability of causal discovery to varied cell types and sequencing protocols. To same running time and also examine algorithms' power, varied sample sizes were used. All of these data were analyzed using the PC+CI method. The results indicate that causal discovery can be applied flexibly to varied cells. The appendix text (including appendix tables and figures) is brief and divided into five subsections, with the first four corresponding to the four subsections in the Results section in the main text, following appendix figures that are ordered accordingly.
1. The analysis of lung cancer cell lines and lung alveolar epithelial cells Appendix 5-figure 1. The causal network of the 50 feature genes inferred by PC+cmiKnn from the H2228 dataset (settings: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed in cells were used). In these and the following figures, red and blue arrows indicate activation and inhibition, double arrows indicate undermined direction, arrows' thickness indicates the statistical significance of CI test, and node colors indicate fold changes of gene expression. MHC-II genes were significantly down-regulated compared with the control (the lung alveolar epithelial cells).  Appendix 5-figure 2. The causal network of the 50 feature genes inferred by PC+GaussCItest from the H2228 dataset (settings: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed in cells were used). MHC-II genes were significantly down-regulated compared with the control (the lung alveolar epithelial cells).  Appendix 5-figure 3. The causal network of the 50 feature genes inferred by PC+RCIT from the H2228 dataset (settings: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed in cells were used). MHC-II genes were significantly down-regulated compared with the control (the lung alveolar epithelial cells). Appendix 5-figure 4. The causal network of the 50 feature genes inferred by PC+ DCC. perm from the H2228 dataset (settings: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed in cells were used). MHC-II genes were significantly down-regulated compared with the control (the lung alveolar epithelial cells). Appendix 5-figure 5. The causal network of the 50 feature genes inferred by PC+cmiKnn from the alveolar epithelial cell dataset (settings: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed the cells were used). The control of the case was the H2228 cells. Compared with H2228 cells, MHC-II genes in alveolar epithelial cells were highly expressed. The relationships between MHC-II genes and the relationships between MHC-II genes and CD74 (which is a key regulator of MHC-II proteins) are supported by annotated interactions in the STRING database. −2 −1 0 1 2 3

Wen
Appendix 5-figure 6. The causal network of the 50 feature genes inferred by PC+GaussCItest from the alveolar epithelial cell dataset (settings: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed the cells were used). The control of the case was the H2228 cells. Compared with H2228 cells, MHC-II genes in alveolar epithelial cells were highly expressed. The relationships between MHC-II genes and between MHC-II genes and CD74 are much more dense than those inferred by PC+cmiknn. Appendix 5-figure 8. The causal network of the 50 feature genes inferred by PC+ DCC. perm from the alveolar epithelial cell dataset (settings: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed the cells were used). The control of the case was the H2228 cells. Compared with H2228 cells, MHC-II genes in alveolar epithelial cells were highly expressed.

The analysis of the macrophages from glioblastoma
After using the dataset of macrophage isolated from glioblastoma to examine feature selection algorithms, we also used it to examine causal discovery algorithms. Again, feature genes include HLA genes to examine whether reported interactions are inferred (Appendix 5-figures 9 and 10).  Appendix 5-figure 9. The causal network of the 50 feature genes inferred by PC+ DCC. perm from the macrophage dataset (macrophages isolated from human glioblastoma) (setting: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed were used). Because no control data was used, the differential expression of genes was not computed. Note the interactions between MHC-II genes and CD74, between C1QA/B/C, and the TYROBP→TREM2→A2M→ APOE→APOC1 cascade. Appendix 5- figure 10. The causal network of the 50 feature genes inferred by PC+ DCC. gamma from the macrophage dataset (macrophages isolated from human glioblastoma) (setting: feature genes were expressed in >50% cells, the alpha level for CI test was 0.1, and the 300 cells with more feature genes expressed were used). Because no control data was used, the differential expression of genes was not computed. Note the interactions between MHC-II genes and CD74, between C1QA/B/C, and the TYROBP→TREM2→A2M→ APOE→APOC1 cascade.

The analysis of exhausted CD8 T cells from multiple cancers
We used TOX and PDCD1 as the target gene, respectively, to select 50 genes from genes expressed in >50% exhausted CD8 T cells (from liver, colorectal, and lung cancers) and in >50% non-exhausted CD8 T cells (from the normal tissues neighboring these cancers). Networks with TOX and PDCD1 as the target gene are called TOX-network and PDCD1-network, respectively. In this application case, we demonstrate consensus networks; unless otherwise specified, all panels are consensus networks of the two DCC algorithms. Therefore, we use letters but not algorithms to label panels. Networks were inferred from 500 cells (the case of colorectal cancer) and 463 cells (the case of lung cancer). Exhausted and non-exhausted were mutually used as case and control. In panels, →→ and -|-| represent indirect activation and inhibition (Appendix 5-figures 11-17). Appendix 5-figure 11. The networks of 50 genes in exhausted CD8 T cells and non-exhausted CD8 T cells from colorectal and lung cancers and the normal tissues neighboring the cancers. (A) The TOX-network inferred from exhausted CD8 T cells from colorectal cancer. TOX→→PDCD1 (TOX→CXCL13→PDCD1), TNFRSF9→TRAF5, and TOX→→MIR155HG have related reports including (1) TOX up-regulates PDCD1 expression (Khan et al., 2019), (2) TNF receptors bind to TRAF2/5 to activate NF-kB signaling, (3) in mice up-regulated miR-155 represses Fosl2 by inhibiting Fosb and causes long-term persistence of exhausted CD8 T cells during chronic infection (Stelekati et al., 2018). We found that if more feature genes were selected (to include FOSB), the MIR155HG-|YPEL5→DNAJB1→FOSB were inferred, agreeing with inhibited FOXB by up-regulated MIR155. Appendix 5-figure 13. The networks of 50 genes in exhausted CD8 T cells and non-exhausted CD8 T cells from colorectal and lung cancers and the normal tissues neighboring the cancers. (C) The PDCD1-network inferred from exhausted CD8 T cells from colorectal cancer. PDCD1→→TOX (PDCD1→CXCL13→TOX), PDCD1→→MIR155HG (there were two routes: PDCD1→CXCL13→MIR155HG, PDCD1→CCL3→MIR155HG), MIR155HG-|YPEL5→DNAJB1, and HAVCR2-|PDCD4 were inferred. Related reports of these interactions include (1) TOX transcription factors cooperate with NR4A transcription factors to impose CD8+ T cell exhaustion (Seo et al., 2019), (2) CCL3 is one of the up-regulated chemokine genes in exhausted CD8 T cells (Wherry et al., 2007). Appendix 5-figure 16. The networks of 50 genes in exhausted CD8 T cells and non-exhausted CD8 T cells from colorectal and lung cancers and the normal tissues neighboring the cancers. (F) The PDCD1-network of HSIC algorithms inferred from non-exhausted CD8 T cells from the normal tissue neighboring colorectal cancer.