Multi-omics protein-coding units as massively parallel Bayesian networks: Empirical validation of causality structure

Summary In this article we use high-throughput epigenomics, transcriptomics, and proteomics data to construct fine-graded models of the “protein-coding units” gathering all transcript isoforms and chromatin accessibility peaks associated with more than 4000 genes in humans. Each protein-coding unit has the structure of a directed acyclic graph (DAG) and can be represented as a Bayesian network. The factorization of the joint probability distribution induced by the DAGs imposes a number of conditional independence relationships among the variables forming a protein-coding unit, corresponding to the missing edges in the DAGs. We show that a large fraction of these conditional independencies are indeed verified by the data. Factors driving this verification appear to be the structural and functional annotation of the transcript isoforms, as well as a notion of structural balance (or frustration-free) of the corresponding sample correlation graph, which naturally leads to reduction of correlation (and hence to independence) upon conditioning.


INTRODUCTION
The last 20 years have witnessed an explosion of the number and accuracy of omics techniques, and with them of our possibilities to monitor and understand in a systematic way the events that lead to the expression of genes and proteins (Dihazi et al., 2018;Hawe et al., 2019;Huang et al., 2017;Subramanian et al., 2020). These omics techniques call for integrative models to be developed, and in fact also the number of multi-omics computational solutions proposed in the literature has grown at a stable pace. See the surveys Berger et al. (2013); Bersanelli et al. (2016); Gomez-Cabrero et al. (2014); Meng et al. (2014);and Zeng and Lumley (2018). Most of this literature focuses on system-level methods in which genome-wide models are developed and investigated. Typical examples are gene regulatory networks (Duren et al., 2017;Greenfield et al., 2013;Huynh-Thu and Sanguinetti, 2019;Miraldi et al., 2019;Siahpirani and Roy, 2017), protein-protein interaction networks (DeLas Rivas and Fontanillo, 2010;Vogel and Marcotte, 2012), or combinations of these layers with genetic interactions, methylation, genome-wide association studies, signaling, and so on (Buetti-Dinh et al., 2020;Fu et al., 2017;Godsey, 2013).
The approach we take in this article is not aimed at developing novel system-level models, but rather at integrating various omics into fine-graded models of transcription and translation at the level of single gene/protein. The omics we consider are epigenomics (ATAC-seq), transcriptomics (RNA-seq), and proteomics (mass spectrometry), all performed on primary human naive CD4 + T-cells during early T-helper type 1 (T H 1) cell differentiation. ATAC-seq (Assay for Transposase-Accessible Chromatin sequencing) is one of the most popular high throughput techniques for chromatin accessibility profiling (Yan et al., 2020). The regions (called peaks) of open chromatin identified by the method can be associated with the nearest gene and play an important role in its regulation, as the binding of transcription factors to DNA is enabled (or facilitated) by chromatin accessibility. RNA-sequencing at sufficient depth (4 , 10 7 reads per sample in our case) allows us to quantify gene expression at the level of alternative splice variants of a gene. For this reason, we have decided to consider in our model the expression of these splice variants in place of a single lump value describing gene expression. Annotated ATAC-seq peaks can be assigned with reasonable accuracy also to the splice variants of a gene, by labeling them according to their transcription start site. To measure protein abundance we used mass spectrometry. As this technique is not sensitive enough to detect protein isoforms in a systematic way (Blencowe, 2017; Tress A class of models that is particularly suitable to deal with DAGs is Bayesian networks (Friedman, 2004;Huynh-Thu and Sanguinetti, 2019;Kalisch and Bü hlmann, 2007;Koller and Friedman, 2009;Lauritzen, 1996;Maathuis et al., 2010;Pearl, 1988). A Bayesian network is actually defined as a DAG with a random variable assigned to each node. When the topology of the graph is known, then the joint probability distribution of the Bayesian network factorizes according to the DAG. Full measurement of the variables means that we have the opportunity to verify if indeed the sample distribution respects the structure predicted by the DAG. In fact, in a Bayesian network, absence of an edge between two nodes can be expressed as a conditional independence relationship among the variables (Koller and Friedman, 2009;Lauritzen, 1996). Hence, checking the DAG factorization is equivalent to checking how many of the corresponding conditional independencies are verified by the data, test which can be carried out by computing partial correlations. We show in the article that the data indeed verify many of the predicted conditional independencies, meaning that the predicted structure of the protein-coding units is largely validated by our data. The verification of conditional independencies is particularly high for protein-coding units directly involved in the cellular process that our data are describing, T-cell differentiation, and even more for T H 1 cell differentiation.
Among the factors leading to verification of conditional independence, the functional and structural annotation of splice variants of a gene appear to play a major role, thereby confirming the idea that ''principal'' transcript isoforms are indeed an important factor behind protein synthesis, see Gonzalez-Porta et al. (2013); Ezkurdia et al. (2015). The relevance of such isoforms is here cross-validated by using conditional independence relationships to determine a dominant splice variant in each protein-coding unit. The method we propose provides a ranking of the splice variants of a gene even when standard criteria, like RNA abundance, fail to do so.
Apart from these results, a novel contribution of this article consists in identifying a property of sample data that most likely leads to a validation of a conditional independence relationship. To our DAGs we can associate a matrix of sample correlations R, and the corresponding signed sample correlation graph GðRÞ. We show in the article that when the subgraph of GðRÞ associated with a conditional independence is structurally balanced (i.e., all its cycles have an even number of negative edges), then it is much more likely that the corresponding conditional independence is validated by the sample data. Structural balance (hereafter referred to simply as ''balance'') is a well-known concept in graph theory (Harary et al., 1965). It is related to the notion of frustration-free graph in Ising spin glasses (Facchetti et al., 2011;Mezard and Montanari, 2009;Mezard et al., 1986), and to the idea of coherence in biological networks (Alon, 2007). Its interpretation in the present context is similar to that of a ''parity check'' in error correcting codes (Cover and Thomas, 2006;Gallager, 1963;Mezard and Montanari, 2009): it provides an internal check of consistency among the sample data involved in the graph. What we show in the article is that balance most often leads to contraction of the correlation upon conditioning, i.e., to a partial correlation that shrinks toward 0, and hence toward ''exact'' conditional independence. The observation that balance corresponds to consistent sample data is in our knowledge novel for the field of Bayesian networks. Somewhat related concepts of ''internal consistency'' have been used for graphical models in Lauritzen et al. (2019); Slawski and Hein (2015) in relationship to a property which we call inverse balance, but which is called signed multivariable total positivity of order 2 (signed MTP 2 ) in Lauritzen et al. (2019), and deals with the concentration matrix associated with R. The latter property is more restrictive than balance, however it can be shown that for it the aforementioned idea of contraction upon conditioning holds systematically. The impact of both balance and inverse balance on the empirical validation of the structure of the protein-coding units is analyzed in detail in the article.

RESULTS
The 4797 protein-coding units obtained from our data are represented as DAGs and treated as a parallel set of Bayesian networks, as described in the Methods and in Figure 1. Our aim is to check consistency The causal dependencies considered in a protein-coding unit can be represented as a DAG. A Bayesian network can be associated with the DAG, with joint probability distribution that factorizes according to the DAG. A number of conditional independence relationships follow (notice that some of the conditioning sets can be reduced).
(B) The sample correlation graph GðRÞ corresponding to the DAG. Blue (resp. red) edges represent positive (resp. negative) entries of R. In this case GðRÞ is not balanced. (C) Checking conditional independence on the data: examples. The variables involved in a conditional independence form a motif on the DAG, to which is associated a subgraph of GðRÞ. If this subgraph is balanced then typically the correlation contracts upon conditioning, often leading to an empirical validation of the predicted conditional independence.

OPEN ACCESS
iScience 25, 104048, April 15, 2022 3 iScience Article between the factorization of the joint probability distribution predicted by the structure of the DAGs and our sample data for ATAC-seq (variable a in our notation), splice-level RNA-seq (variable s) and mass-spec protein abundance (variable p), i.e., to check to what extent also the empirical sample distributions factorize according to the DAGs, and what are the implications for our protein-coding units. For that, it is convenient to express the factorized probability distribution of a DAG as a set of conditional independence relationships (also called Markov conditions) among the nodes of the DAG, as described in the Methods and summarized in Table 1. See also Figure 1 for an example.

Conditional independence verification via partial correlations
With the data we have available, a correlation-based verification test can be carried out exhaustively on the entire set of 1:5 , 10 5 predicted conditional independencies, using sample partial correlations. In formulas, a conditional independence between variables x and y with conditioning set C is expressed as x t yjC, where x and y are any of a, s, and p, localized on the same DAG. Verification of the conditional independence corresponds to the partial correlation R xy:C being small enough: R xy:C < q, where q is a threshold provided by a Fisher z-test, see STAR Methods for more details. As can be seen in Figure 2A, for a significant fraction of nodes the conditional independencies are indeed verified by the sample partial correlation test. In particular, the number of nodes with 100% and with 0% of successes is resp. 19.5% and 17.9%, while 44.3% of the nodes have at least 50% of confirmed conditional independencies. The verification is particularly successful for conditional independencies between different ''layers'' of the DAG, which are by far the most important types of independencies when it comes to understanding the structure of the DAGs, see the Discussion section for more details. For them the overall percentage of successes is 89.3%, see Table 1 and inset in Figure 2A. In practice, this result means that the vast majority of causal independencies among the a, s and p variables predicted by the structure of the Bayesian networks are indeed confirmed by the data.
It is also possible to aggregate the statistics of the Markov conditions according to the DAGs to which a node belongs. The size of the DAGs varies as shown in Figure 3A, and the total number of predicted and verified conditional independencies is shown in Figure 3C. The maximum number of predicted (resp. verified) conditional independencies on a DAG is 1201 (resp. 915). For 14.5% of the DAGs no conditional independence is verified, while for 4.9% all are verified. When focusing only on the a/ s and a/ p classes of conditional independencies the corresponding percentages are 24.7% and 51.4%, see inset in Figure 3B.

Validation based on null models
To evaluate whether the level of confirmation of the causal dependencies we obtain from the data is indeed meaningful for our protein-coding units, we compared the true distribution of the verified conditional independencies with two null models. The first null model is obtained by replacing y in an expression like x t yjC with a randomly chosen y r (not necessarily located on the same DAG). In the second null model, both y and C are instead replaced by randomly chosen y r and C r (of the same cardinality as C and also in this case not necessarily on the same DAG). The latter null model expresses the probability of obtaining a conditional independence by pure chance, while the former null model describes how much of the information on x is captured by C (recall that C is interpreted either as a set of parents of x for local Markov conditions or as a d-separating set for global Markov conditions), see STAR Methods. As can be seen in Figure 2C, the first null model (x t y r jC) shifts the histogram to the right with respect to the second (x t y r jC r ). Although conditional independence is detectable in average only in 36.8% of the trials against 29%, the iScience Article difference is significant (2-sample t-test, pvalue $ 0). The distribution for the true conditional independences is essentially bimodal, with an average of successes of 46.4%. Also this is a statistically significant difference against the 36.8% of the best null model.

Validation based on the ontology of the protein-coding units
As our experiments describe a CD4 + T-cell differentiation process toward T H 1 cells, we reasoned that the signal-to-noise ratio in our data should be particularly high for the protein-coding units directly involved in such process. To verify this hypothesis, we compiled a list of genes involved in T-cell differentiation at large, and a second list more specific to T H 1 differentiation (see STAR Methods) and checked to what extent the conditional independencies described above are verified in these subsets of protein-coding units. Quite remarkably, we observe that nearly all statistics indeed improve, as can be seen in Table 2. In particular, the verification of the a/p conditional independencies (the most significant class of conditional independencies, given the structure of our DAGs) passes from 74.1% of the whole set of DAGs to 83% for the T-cell related DAGs, to 89.4% for T H 1 related DAGs. All these improvements are statistically significant (Fisher exact test, see Table 2 for the pvalues). Analogous considerations hold also for other classes of conditional independencies (a À a and s À s), see Table 2. As an alternative analysis, we also checked the ontological enrichment of the two subsets of protein-coding units representing 0 and 100% of verified conditional independencies of type a/s and a/p shown in the inset of Figure 3B.
A B C Figure 2. Conditional independence verification via partial correlations. Verifying conditional independencies (A) Ratio between the number of verified and predicted conditional independencies for each node (a, s, or p) of a DAG. Inset: same ratio but when considering only the conditional independencies between different layers, a/s and a/p. (B) The histogram reports the n. of verified and predicted conditional independencies, this time binned according to the n. of predicted conditional independencies (inset: alternative view of the same quantities, with size of the dots proportional to the height of the bars).
(C) Distribution of successful conditional independences in the true model (i.e., x t yjC, blue), in a null model in which y is replaced by a randomly chosen y r (red), and in a null model in which both y and C are replaced by randomly chosen y r and C r (of the same cardinality of C, yellow).

OPEN ACCESS
iScience 25, 104048, April 15, 2022 5 iScience Article Also, in this case, the second subset is enriched in pathways related to T-cells and their differentiation, while the first is not, see Figure S1 in the Supplementary Information.

Determining the dominant splice variant behind protein synthesis
Another analysis that can be performed on our data using conditional independence relationships consists in computing the splice variant that can be considered the most important (we call it ''dominant'') in order to ''explain'' the profile of a protein, and how many splice variants can instead be considered ''redundant'', in the sense that their contribution can be explained by the presence of a dominant transcript isoform for that protein. For this, we follow the approach already used in the literature for reconstructing gene regulatory networks (de la Fuente et al., 2004;Schä fer and Strimmer, 2005;Soranzo et al., 2007;Zampieri et al., 2008), namely, on each DAG we search for a splice variant s i such that for all other s j on the same DAG it is s j t pjs i , i.e., R sj p:si < q. Of the 4111 DAGs having more than one splice variant, 29.6% have a single dominant splice isoform, and in 88.4% of the cases a single splice can explain 50% or more of the existing alternative splice variants for that protein, see Figure 4. Also, this test is significant against randomizations. Notice that when restricting to T H 1-related genes, the percentage of protein-coding units having a single dominant splice isoform grows to 36.4%.  (2017)), that many transcripts classified as splice variants of a gene are not translated, and that many more contribute to protein synthesis only with a minor probability. To this end, the database APPRIS (Rodriguez et al., 2018) provides a classification of all transcript isoforms of a gene based on annotation and conservation. One may then wonder to what extent the successfully verified conditional independencies we obtain reflect the importance of the functional annotation of the corresponding splice variant. A test can be carried out for some of the classes of conditional independencies listed in Table 1, using the APPRIS ranking of splice variants described in the Methods. For instance, for the a/p conditional independencies, from Figure 5B, it can be seen that of the 6613 validated conditional independencies 2018 (30.5%) correspond to a median APPRIS score which is the highest possible, i.e., the observed s for the conditional independence a t pjs are labeled PRINCIPAL:1 in APPRIS. The corresponding percentage for an equal number of randomly chosen s on the same DAG is 11%, meaning that the result is statistically significant. In other words, the transcript isoform that tends to explain most of the correlation between a and p happens often to be the splice variant that is labeled principal for that gene in APPRIS.
A similar test cannot be carried out for the a/s and s À s classes of conditional independencies, as for them the transcripts do not appear in the conditioning set C. It can be done, however, when we seek for the dominant splice of a protein-coding unit, as conditioning in this case is over s (i.e., over s i in the expression s j t pjs i ). When we compute the dominant splice variant according to the method described above, we see that in around 28% of the DAGs (42% if we restrict to the DAGs admitting a splice variant classified as PRINCIPAL:1) indeed the dominant splice has an associate score equal to PRINCIPAL:1 (green dots in Figure 5C), i.e., the splice able to alone explain the most of the profile of the protein corresponds to the most important splice according to the current structural and functional knowledge of the proteins. This result is highly significant (Fisher exact test, pvalue of 5 , 10 À13 ).
As an alternative test for the s À s class of conditional independencies (which alone gives more than 90% of the non-verified conditional independencies), it is possible to compute a set of Transcription Factors (TF) potentially associated with each transcript according to a footprinting analysis, as described in the STAR Methods. We can then observe that there is a noticeable difference between splice variants involved in s À s conditional independencies that are verified by the data and those that are not. Namely the former tend to have a higher number of putative TF than the latter, see Figure 5D. This difference is significant (2-sample t-test, p-val $ 0).

Coupling the protein coding units: shared peaks and protein complexes
Genes and proteins take part in regulatory networks characterized by a multitude of interactions, some of which involve multiple protein-coding units. For instance, protein-coding units may be coupled by the presence of shared chromatin peaks, intended as peaks that influence splices located on different protein coding units (i.e., splices associated with different genes, and hence to different proteins), see Figure S4A. In our data 876 (i.e. $3%) peaks belong to this category. To analyze the effects brought by a shared peak, we can consider the joint DAG containing the peaks, splices and proteins of the two different protein units, see the example in Figure S4A. This form of coupling leads to additional conditional independencies in the joint DAG. For example, in the case of the DAG in Figure S4A, we can iScience Article consider the conditional independencies s 12 t s 21 ja 2 , and p 1 t p 2 jfs 11 ; s 12 g; which do not exist when we consider the two protein-coding units separately. In particular, it is interesting to note that considering interactions between protein-coding units introduces a new class of conditional independencies, that of p À p.
We have systematically tested the set of conditional independencies that arise from the presence of shared peaks, which can be of two types: s 1i t s 2j ja comm ; p 1 t p 2 jS 1 ; where s ki indicates the i-th splice that belongs to the protein unit of p k ; a comm represent the peaks in common between s 1i and s 2j ; and S 1 = fs 11 ; s 12 ; .g is the set of splices associated with p 1 (the situation is specular if we take the set of splices associated with p 2 ).
Our analysis shows that $29% of these new conditional independencies in the joint DAG are verified by the data, see Table S1, which is a similar value to what was observed for the set of ''single-DAG'' sÀ s conditional independencies in Table 1. On the other hand, it is interesting to note that only around 20% of the conditional iScience Article independencies between proteins is validated by the data. We can interpret this as a sign that the causal links introduced by the shared ATAC peaks do not propagate in a significant way to protein co-regulation.
Another potential form of interactions among our protein-coding units is due to the existence of protein complexes. These introduce another type of interaction in the model, in the form of an undirected pÀ p edge, see Figure S4B. In terms of probabilistic graphical models, the resulting graph is no longer a A C D B Figure 5. Factors behind conditional independence: splice function and TF. Role of splice annotation and TF (A) To each splice isoform is associated a score obtained from the APPRIS database. The distribution of all scores for the $ 27K splices is shown in the yellow bars. For around 62% of all DAGs a splice labeled PRINCIPAL:1 is present (violet bars). (B) Checking APPRIS scores on the a/p conditional independencies. The median APPRIS score of the s that lead to a a/p conditional independence (green) is compare to the median APPRIS score of all the s on the DAG (yellow). Dot size is proportional to fraction (gray circle is 100%). As can be seen on the red dots, the median APPRIS score of the observed s is systematically higher than the median APPRIS score of the corresponding DAG, meaning that in general s with high APPRIS scores (and, in particular, PRINCIPAL:1 splice variants) are more reliable predictors of a conditional independence a/p. iScience Article Bayesian network but rather a mixed directed/undirected (acyclic) graph (Koller and Friedman, 2009), but for our purposes we can treat it in an analogous way.
To analyze this form of interaction, we have downloaded the list of protein complexes from the database CORUM (Giurgiu et al., 2019), which contains 4274 manually annotated protein complexes from mammalian organisms. Out of these, 1175 complexes correspond to at least a pair of proteins in our data. What we observe is a strong dependence at protein level. For example, proteins that belong to the same complex display a significantly higher correlation compared to random protein pairs, see Figure S4C. This is not unexpected, as subunits of a protein complex should be expressed in fixed stoichiometric ratios.
Moreover, proteins that belong to the same complex rarely satisfy the conditional independence of type p 1 t p 2 jS 1 : in only 14% of pairs the condition is satisfied (compared to 20% for random pairs). Even more, when we select only proteins that belong to a protein complex and also share a common peak, only two out of 28 pairs are conditionally independent.
However, the same dependence observed for subunits of a protein complex does not seem to exist at splice level, where we observe correlation values similar to random pairs of splices, see Figure S4C. This observation suggests that the abundance of proteins in the same complex may not always be regulated at the level of transcription, but could be rather due to post-translational modifications, as already described in Eisenberg et al. (2018); Ross et al. (2021).

Balance as a proxy for conditional independence
It is intriguing to try to understand if there is also any property of Bayesian networks that can be exploited to predict success or failure in our empirical verification of conditional independences. In order to do so, we need to introduce the concept of balance of the signed graph GðRÞ induced by the sample correlation R on a set of nodes, see STAR Methods.
If we classify the $ 4,10 5 partial correlations required for checking all Markov conditions into balanced and unbalanced according to the corresponding correlation graph, then we see a striking difference: the partial correlations R xy:C taken from unbalanced graphs tend systematically to stay away from 0, a pattern not visible in those taken from balanced correlation graphs, see Figure 6A. Consequently, unbalanced graphs seldom lead to verification of conditional independencies (i.e., to R xy:C <q), and in fact in the vast majority of cases (82.5%), verification of a Markov condition is associated with balance of the correlation graph formed by the corresponding variables, see Table 3.

Inverse balance as a more stringent proxy for conditional independence
If instead of the sample correlation matrix R we consider the corresponding sample concentration matrix J = R À1 , then an analogous notion of balance can be set up also for GðJÞ. We call this inverse balance, see STAR Methods for details. If we compare the statistics of verified conditional independencies for balance and inverse balance (Table 3), then we see that while balance is able to capture a larger fraction of verifications, and in fact the number of confirmed conditional independencies associated with unbalanced GðRÞ is low (''false negatives'', 17.5%), it is also producing a large number of ''false positives'' i.e., balanced motifs that do not lead to verifications of a conditional independence. On the contrary, inverse balance can verify around half of the conditional independencies confirmed by balance, but the number of concentration graphs that are inverse balanced and that are not verified is around a third of the balanced cases, see Table 3. Treating balanced + non-verified and unbalanced + verified as ''errors'', then inverse balance produces around 7% less errors than balance. This is confirmed by the ROC-like curves we obtain if instead of keeping a fixed threshold q for acceptance of a verification we vary it continuously (a more thorough explanation is in the STAR Methods), see Figure 6E. AUC for balance is 0.64, while for inverse balance it is 0.69. However, if we look at the Precision-Recall curve built in an analogous way, Figure 6F, we see that now the AUC for balance is 0.81, while that for inverse balance is only 0.45.
In short, balance is less restrictive but also less precise (it has lower positive predictive value but higher negative predictive value) than inverse balance in verifying the conditional independencies predicted by the DAGs structures associated with the protein-coding units. Both concepts are however extremely useful in screening large sets of dependencies, and in checking the presence of edges in the corresponding Bayesian networks.

DISCUSSION
The multiple high-throughput omics data we have available for this study allow us to characterize with reasonable precision several of the crucial steps of the complex process that leads to the synthesis of proteins in higher organisms: ATAC-seq data allow to estimate the chromatin accessibility state in the area of the DNA surrounding the gene of interest, RNA-seq data provide us a quantification of gene expression at the level of alternative splice variants, and mass spectrometry based proteomics can measure the abundance of the corresponding gene products. Combining together these measurements means looking at Figure 6. Balance as a proxy for conditional independence. Balance and inverse balance (A) Distribution of the sample partial correlations R xy:C , classified according to the balance or unbalance of the corresponding sample correlation graph. The vertical dotted lines correspond to Gq (significance threshold). Balanced R xy:C tend to be concentrated around 0, while unbalanced R xy:C tend to stay away from 0. (B) Sample density heatmap of the pair ðR xy ; R xy:C Þ for the balanced (left) and unbalanced (right) cases. In the balanced cases, normally it is R xy:C < R xy , while no such rule holds in the unbalanced cases.
(C) Same histogram as in (A), but now classified according to inverse balance. The pattern is similar to (A), but now the fraction of inverse balanced cases is lower and that of the inverse unbalanced cases higher. (D) Same density plot as in (B), but for inverse balance. The inequality R xy:C < R xy is now obeyed systematically in the inverse balanced cases, i.e., conditioning always leads to contraction. (E) ROC curves for balance and inverse balance, as the threshold q is varied. The square marker corresponds to the threshold for q chosen in the article. AUC for balance is 0.64, while for inverse balance it is 0.69. (F) Precision-recall curves for balance and inverse balance, as the threshold q is varied. AUC for balance is 0.81, while for inverse balance it is 0.45.

OPEN ACCESS
iScience 25, 104048, April 15, 2022 11 iScience Article a fine-graded description of the transcriptional/translational events that lead to the synthesis of a protein.
A key contribution of this article is to show that the combination of these multi-omics data can give rise also to novel high-resolution mathematical models.
Importantly, the physical interactions used in these models have a natural DAG structure, suited to be modeled using tools such as Bayesian networks. Even more importantly, the resulting Bayesian networks have a manageable size, and the regime in which they are placed is not the highly unconstrained one (more degrees of freedom than data) normally needed in gene regulatory network inference (Schä fer and Strimmer, 2005;Soranzo et al., 2007), but its opposite. In fact, in our case, each variable is measured and we do not need to learn the graph, as the topology of the Bayesian network itself is available through gene sequence annotation: the ATAC-seq peaks can be assigned with reasonable accuracy to specific splice variants of a gene based on the transcript start site of each splice, and the alternative splicings of a gene are also well characterized. The translation from alternative splices to proteins is perhaps the step that carries the most uncertainty, the limiting factor being the low precision and the limited coverage of current mass spectrometry proteomics. In particular, the limited sensitivity of this technique does not allow in general to identify protein isoforms (Blencowe, 2017;Tress et al., 2017), hence making it impossible to associate the protein signals we have with the specific transcript isoforms.
It is therefore quite remarkable that with no other information than conditional independence relationships among the alternative splice variants of a gene, in around 25-30% of the cases we can determine the splice variant which is classified as principal according to the current annotation of structure, functions and conservation of each transcript isoform. In other words, principal splice variants tend to explain a considerable fraction of the expression of the alternative splices, even though this explanation is not related to abundance, and it is not directly visible in the splice-splice correlation, nor in the splice-protein correlation, but rather at the partial correlation level (i.e., in the conditional dependencies among the variables). In fact, when we classify the correlations between s and p according to the APPRIS scores, then no particular trend emerges. In Figure S2B, the histogram for PRINCIPAL:1 splice variants is not distinguishable from the others. The splice variants classified as dominant according to our conditional information approach have some overlap with the most abundant splice in our sample, see Figure S2A. Since correlations and partial correlations are invariant to amplitude rescaling, this provides (at least to some extent) a cross-validation of the idea that the most abundant splice isoform is also the most relevant one for translation (Ezkurdia et al., 2015). This appears to be especially true for the protein-coding units containing a low number of splice variants, see Figure S2C. In 23.8% of the DAGs, the dominant splice isoform corresponds also to the splice having the highest correlation with the corresponding protein, see Figure S2B. However, a straightforward relationship (like the one provided by RNA-seq abundance) between a single splice isoform and the corresponding protein appears to be missing for the remaining 70-80% of the proteins we have analyzed. For them, no single feature of the transcript and of its annotation appears to be enough to determine which splice variant is playing the most important role in protein synthesis. Here our approach may provide a novel way to infer the importance of splicing events on translation and protein synthesis.
When we zoom in on the protein-coding units directly involved in T H 1 differentiation, then, apart from proteins having a trivial DAG (e.g., IFNG [Interferon-g, the main product of T H 1 cells] and IL27RA Both balance and inverse balance help in verifying conditional independencies. Balance is broader and less selective (many more ''true positives'' than ''false negatives'', but also many ''false positives''. Inverse balance is more selective, but has lower inference power (less ''false positives'' and many ''true negatives'' but also less ''true positives''). Overall, inverse balance is slightly more accurate: 7% less ''errors'' (''false positives + false negatives'') than balance. iScience Article [Interleukin-27 receptor a chain, involved in T H 1 differentiation], both having a single transcript isoform, which is of course also principal), we can find several cases of interest. For instance, of the 11 splice isoforms of STAT4 (a key regulator of T-helper cell differentiation), only four are retained in the DAG after filtering out transcripts that lack ATAC peaks and/or those labeled as non-coding in APPRIS. All of these four transcripts have similar expression, but only two are labeled as PRINCIPAL:1 in APPRIS. One of the two corresponds to the dominant splice we identify through conditional independence analysis.
In several examples the various classification criteria that we are discussing (RNA-seq abundance, APPRIS score and our test for dominance) are all consistent. Consider for instance the gene TBX21, a Tbox transcription factor thought to be the main driver of T H 1 differentiation. Of the two transcript isoforms of TBX21, one (ENST00000177694) has consistently higher expression than the other (ENST00000581328). The first also has a higher APPRIS score (PRINCIPAL:1) and is also selected as the dominant isoform by us. Similar considerations hold for (among other) IL18R1, JAK2, ZAP70, LCK (for the latter our analysis is able to pick out as dominant a PRINCIPAL:1 transcript out of 14 splice variants). Even when a gene lacks a principal isoform, abundance alone can be a proxy for dominance, as in the case of GATA3 (an early regulator of T-cell differentiation), where one of the two isoforms showing high expression, ENST00000379328 and ENST00000346208, is selected as dominant (the first, which happens to coincide with the highest APPRIS score for this gene, a PRINCIPAL:4). For other genes, instead, abundance is not a useful indicator. For instance, for the ITK gene (IL2 inducible tyrosine kinase, expressed in T-cells and playing a role in T H differentiation), none of the nine transcript isoforms that survive our filtering has an expression significantly higher than the others. Nevertheless our algorithm is able to identify the only isoform labeled PRINCIPAL:1 as the dominant isoform for that gene. Another example of the same type is HMGB1, whose (equally abundant) multiple transcript variants are said to encode for the same protein (see GeneCards PAGE, accessed April 2021).
Another key feature of our approach is that each Bayesian network can be treated independently, as it corresponds to a model of the protein-coding unit associated with a gene/protein. Hence our > 4000 Bayesian networks can be run in parallel, which breaks down the complexity to a manageable size, and enables the systematic in-depth analysis carried out in the article. Of course, in reality, the protein-coding units are coupled, for instance through the sharing of chromatin peaks or the association in protein complexes, as we have analyzed in the Results section, or through the action of transcription factors, through the concentration of polymerases, through the ribosomal machinery, etc. All of these latter factors must however be treated as unmodeled hidden variables, as we lack any specific measurement for their quantification. Some of these factors (most probably TF) are likely responsible for the limited validation percentage of certain types of conditional independencies, in particular for the s À s conditional independences. For this case, for instance, we have managed to show that verification of the s À s conditional independences grows with the number of TF potentially associated with the splices involved. This results may be reflecting the fact that a more (complex and hence) precise regulation is less prone to noisy measures of expression, and hence lead to more clear s À s interactions.
If for s À s conditional independencies the percentage of success is only 33.8%, it grows to 57.9% for aÀ a conditional independencies. It is worth noticing that none of these two classes lead to modifications of the DAG that are justifiable physically or mathematically. From a physical point of view, having a direct dependence between two s or between two a does not seem to be a realistic assumption. From a mathematical modeling point of view, s À s and a À a pairs are so-called ''moral edges'' (see STAR Methods for details on this terminology), and it is known that for Bayesian networks a probability distribution that factorizes over a DAG factorizes also over the graph obtained adding the moral edges to the DAG.
When instead we look at independence relationships between different ''layers'' (the most important types of conditional independencies when it comes to structure validation), then the percentages of confirmations are significantly higher: 74.1% for a/p conditional independencies, and 97.5% for a/s conditional independencies. The latter are the only independencies that may lead to a straightforward modification of the DAG, meaning that when a conditional independence a/s cannot be confirmed, then the presence of an extra edge a/s in the DAG cannot be ruled out. In summary, our validation analysis of the predicted conditional independencies largely confirms the factorization of the probability density functions determined by the DAGs, and suggests the addition of extra 415 causal dependencies of type a/ s, i.e., a mere 2.5% of the already existing edges.

OPEN ACCESS
iScience 25, 104048, April 15, 2022 13 iScience Article The idea that balance can be of importance in biological networks is not new, and was observed for instance in gene regulatory networks in Alon (2007) under the name of coherence, as well as in other classes of biological networks in Iacono and Altafini (2010); Sontag (2007); Torres and Altafini (2016). The role it plays for our Bayesian networks is however novel to the best of the authors' knowledge. It is related to the already mentioned facts of working in a over-constrained regime (more data than degrees of freedom) and of making use of a symmetric similarity measure, like correlation. In fact, R symmetric leads to the presence of cycles in GðRÞ, whose nodes are all experimentally sampled. When evaluated on a cycle, the sample correlations provide an intrinsic test of consistency for the values of the variables involved in the cycle.
To understand the origin of this consistency, it is useful to look at the smallest possible nontrivial correlation graph, a triangle, formed by the variables x;y, and z obtained from a fork or chain motif in a DAG (see Figure 1C), and yielding a 333 correlation matrix R. When GðRÞ is balanced, the patterns of signs of R xy , R xz and R yz are always compatible with each other, and never contradict each other. For instance, if x and y are negatively correlated and so are y and z, then x and z must be positively correlated (in structural balance theory: ''the enemy of my enemy is my friend'', see Facchetti et al. (2011)). Internal consistency in a triangle of correlations means that the empirical trends we have obtained from our data for x, y, and z confirm each other. On the other hand, on an unbalanced cycle the opposite happens: the data for x, y and z are guaranteed to contradict each other. One consequence of such inconsistency is that the data become unreliable when used for learning (or confirming) conditional independences, meaning that balance gives us a way to label certain interaction patterns as ''more plausible'' or as ''less plausible'' based on the samples. The generalization from a triangle to any arbitrary correlation graph (and to a concentration graph) can be done along similar conceptual lines.
To understand the mechanism by which balance and inverse balance lead to confirmation of conditional independencies in such a systematic way, it is useful to observe what happens to the correlation between x and y upon conditioning. If we compare R xy and R xy:C , then we see that for most balanced sample correlations graphs it is R xy:C % R xy ; (Equation 1) meaning that conditioning on a balanced graph normally leads to a contraction, that reduces the dependency between the variables x and y, see Figure 6A. The contrary often (but not always) happens on unbalanced GðRÞ. This contraction property can be made intuitively clear if we look at the formula for partial correlations on a triangle R xy:z = R xy À R xz R yz ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 À R 2 xz q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 À R 2 yz q : For all possible sign combinations in a balanced triplet, the second term of the numerator has opposite sign to the first term, meaning that in most cases R xy:z obeys to the heuristic (1). There are however exceptions, as can be seen in the density plots of Figure 6B. In the case of inverse balance, the contraction law (1) is obeyed systematically. For covariances, the equivalent of (1) can be shown to be rigorously true, see STAR Methods.
The consistency of signs in a balance graph can be interpreted as solving a system of equations over a binary field, with constraints given by the signs. In statistical physics, it is well-known that checking balance (i.e., checking the absence of frustration in a signed graph) can be mapped into a so-called XORSAT constraint satisfaction problem, see Facchetti et al. (2011); Mezard and Montanari (2009). Such problem either has a unique solution, when the graph is balanced, or has no solution at all, when the graph is unbalanced. The test is akin to a ''parity check'' used in error-correcting codes in information theory (Cover and Thomas, 2006). What is novel in our approach is the observation that balance and inverse balance provide good heuristics for reduction of correlation upon conditioning, and hence for verification of conditional independencies.
While such tests are in our knowledge novel for balance, somewhat related ideas have been developed in the field of Gaussian graphical models for inverse balance. In particular in Lauritzen et al. (2019); Slawski and Hein (2015), it is shown how a (signed) MTP 2 distribution (see STAR Methods) imposes constraints on the possible sign pattern of the correlation matrix, thereby allowing to reconstruct such matrix even ll OPEN ACCESS iScience 25, 104048, April 15, 2022 iScience Article when data on some of the variables are missing. The underlying principle in this literature is the following: a property of the multivariate joint probability distribution provides constraints that must be obeyed also by the empirical sample probability distribution if the latter is indeed drawn from the former.
The underlying principle in our approach is instead the following: a sample probability distribution provides constraint for its own consistency. These constraints do not require us to make any assumption on the ''true'' multivariate joint probability distribution from which the data are drawn. The constraints are nevertheless useful in practice in situations in which all variables are measured, as they provide a sort of ''parity check'' on the sampled data. We expect this type of reasoning to be applicable also in other contexts in which Bayesian networks are of relevance.

Limitations of the study
Several limitations have already been discussed above, notably the fact that the protein coding units are in reality not independent, but coupled by unmeasured quantities (TF, shared transcriptional/translational machinery etc.). Another limitation comes from the fact that the data we have available correspond to time series, for which the assumption of stationarity is not rigorously true. Using dynamical correlation partially, but not completely, eliminates the issue. Other potential limitations come from the data we are using, in particular the fact that mass-spec is not sensitive enough to detect protein isoforms is a significant bottleneck of the current study. Similarly, the uncertainty in how the RNA sequences are assigned to the various splice variants of a gene is of great relevance in our approach, as it may lead to a bias in the analysis.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

DECLARATION OF INTERESTS
The authors declare no competing interests.   (Rundquist et al., 2022).
The data and the code used in this article will be shared upon request to the corresponding author.

Isolation of CD4 + T helper (T H ) cells and T H 1 polarization
Peripheral blood mononuclear cells (PBMC) were isolated from blood donor derived buffy coats through gradient centrifugation (Lymphoprep, Axis shields diagnostics, Dundee, Scotland). Naive CD45RA + CD4 + T-cells were subsequently isolated with magnetic bead separation using the ''Naive CD4 + T-cell Isolation Kit II, human'' (Miltenyi Biotec, Bergisch Gladbach, Germany). The cells were then activated and polarized toward T H 1 using Dynabeadsä. Human T-Activator CD3/CD28 (1 bead/cell) (Dynal AS, Lillestøm, Norway), 5 ng/ml recombinant human IL-12p70, 10 ng/ml recombinant human IL-2 and 5 mg/ml anti-IL-4 antibodies (clone MAB204) (all three from, Bio-Techne, Minneapolis, USA), in RPMI 1640 media (Gibco, Paisley, United Kingdom). A portion of T-cells used for RNA and protein isolation was obtained at baseline and after 0.5 h (only RNA), 1 h, 2 h, 6 h, 24 h and 5 days (only protein; point not used in the analysis). The cells were washed twice in PBS, snap frozen in a dry ice ethanol bath and stored at À80 C until use. During the protein and RNA extractions, multiple samples were pooled from twelve different individuals to reach the necessary amount of material for the subsequent analysis steps.

RNA-seq sample preparation
RNA was isolated using a ZR-Duet DNA/RNA kit (Zymo Research, Irvine, USA) and stored at À80 C. RNA library preparation and the subsequent RNA-sequencing were carried out by the Beijing Genomics Institute (https://www.bgi.com/global/

Isolation of CD4 + T helper cells and T H 1 polarization
Naïve T-cells were isolated and differentiated towards T H 1 for ATAC-seq as described above and sampled at the time points of 0.5, 1, 2, 6 and 24 h (same as the RNA-seq). The naïve T-cells were differentiated on a 96-well plate with 50,000 cells per differentiation reaction (each time points was its own differentiated reaction), in biological triplicate. All time points were from the same individual for each replicate.

Nuclei extraction
For nuclear extraction, at each time point the cells were transfer to DNA low-binding tubes (Eppendorf, Hamburg, Germany) and washed with 1ml PBS by centrifugation at 500g for 5 min. After centrifugation the supernatant was discarded by pipette and the cells resuspended in 50 ml ATAC-seq RSB buffer (10 mM Tris-HCl pH 7.4, 10 mM NaCl and 3 mM MgCl2) containing 0.1% NP40, 0.1% Tween-20 (all from Sigma-Aldrich, St. Louis, MO) and 0.01% digitonin (Promega, Madison, WI) and incubated for 3 min. Then, 1 ml ATAC-seq RSB containing 0.1% Tween 20 was added and the nuclei were pelleted by centrifugation at 500g for 10 min and the supernatant discarded. All steps for the nuclei extraction were carried out on ice and all buffers were pre-cooled to 4 C.

Sequencing library preparation and sequencing
ATAC-seq library preparation was carried out immediately by resuspending the nuclei pellet in 50 ml transposition mix (25 ml 2X TD buffer, 2.5 ml transposase (both from Illumina, San Diego, CA), 16.5 ml PBS, 0.5 ml digitonin, 0.5 ml 10% Twenn-20 and 5 ml Milli-Q water). The reaction mix was then incubated for 30 min at 37 C in a Eppendorf themomixer comfort (Eppendorf, Hamburg, Germany) at 1000 rpm and the reaction subsequently stopped by the addition of 250 ml DNA binding buffer from a Zymo-clean and concentrator kit (Zymo-Research, Irvine, CA) and the DNA purified using said Zymo kit. The ATAC-seq DNA libraries were then amplified and indexed with a NEBNext index kit (NEB, Ipswich, MA) by PCR with the following cycling conditions: 98 C 30 s, 72 C 5 min, followed by 5 cycles of 98 C 10 s, 63 C 30 s, 72 C 1 min. Following PCR DNA concentration was assayed by qPCR as described in Buenrostro et al. (2015) and additional PCR cycles were added as necessary. After PCR the samples were once again purified using the Zymo-clean and concentrator kit. Library quality was assessed by fluorimetry with a Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, MA) and gel electrophoresis using a Fragment Analyzer (Thermofisher Scientific). Sequencing of the libraries was carried out on an Illumina NovaSeq 6000 instrument by the National Genomics Infrastructure of Sweden (https://www.scilifelab.se/facilities/ngi/).

Cell lysis
The isolated cells from patients were resuspended in 100 ml of 8 M Urea in 40 mM Tris-HCl (pH 7.6) and pooled with a similar number of total cells between biological replicates. The suspension was sonicated using focus sonicator (Sonic Dismembrator 500, Fisher Scientific) for 3 cycles of 10 s pulse with 10 s intervals at 10% of power. After sonication, a magnetic rack was used to remove the T-Activator beads used for the polarization. Protein concentration was measured using BCA assay and 40 mg of each sample was digested with trypsin using In-solution digestion. iScience Article buffer (pH 8.0) until the urea concentration was 1 M. The proteins were digested with trypsin overnight at 37 C at an enzyme to protein ratio of 1:20. Finally, the peptides were acidified with 100% trifluoroacetic acid (TFA) to a final concentration of 1% TFA and then desalted using macro spin columns (Harvard apparatus, USA).

TMT labeling
Peptides were labeled with 6-plex TMT reagent using manufacturer's protocol with some modification. The six peptide samples from each time series was resuspended in 100 ml of 100 mM TEAB buffer (pH 8.0) and a unit of each TMT reagent was resuspended in 40 ml of acetonitrile. Subsequently, the prepared TMT reagent was transferred to the peptide sample and then vortexed. Next, the samples were incubated for 2 h at room temperatures. Finally, all of the labeled peptide samples from each time series were pooled and concentrated by vacuum centrifugation. The labeled sample was resuspended 100 ml with 10 mM ammonium formate in water (pH 10) and subjected to high-pH reverse-phase liquid chromatography fractionation.

High pH fractionation
The TMT labeled peptides were fractionated into 24 fractions. Peptide samples were separated using an analytical column (Xbridge, C18, 5 mm, 4.6 mm 3 250 mm) on the Agilent 1200 series HPLC system. Peptides were eluted using following gradient over 115

LC-MS analysis
The fractionated peptides were analyzed on an Orbitrap Fusion Lumos Tribrid Mass Spectrometer (source) coupled with the Easy-nLC 1200 nano-flow liquid chromatography system (Thermo Fisher Scientific). The peptides from each fraction were reconstituted in 0.1% formic acid and loaded on an Acclaim PepMap100 Nano-Trap Column (100 mm 3 2cm, Thermo Fisher Scientific) packed with 5 mm C18 particles at a flow rate of 5ml per minute. Peptides were resolved at 250-nl/min flow rate using a linear gradient of 10%-35% solvent B (0.1% formic acid in 95% acetonitrile) over 95 min on an EASY-Spray column (50 cm 3 75 mm ID), PepMap RSLC C18 and 2 mm C18 particles (Thermo Fisher Scientific), which was fitted with an EASY-Spray ion source that was operated at a voltage of 2.3 kV. Mass spectrometry analysis was carried out in a data-dependent manner with a full scan in the mass-to-charge ratio (m/z) range of 350 to 1,800 in the ''Top Speed'' setting, 3 s per cycle. MS1 and MS2 were acquired for the precursor ions and the peptide fragmentation ions, respectively. MS1 scans were measured at a resolution of 120,000 at an m/z of 200. MS2 scan was acquired by fragmenting precursor ions using the higher-energy collisional dissociation method and detected at a mass resolution of 30,000, at an m/z of 200. Automatic gain control for MS1 was set to one million ions and for MS2 was set to 0.1 million ions. A maximum ion injection time was set to 50 ms for MS1 and 100 ms for MS2. Higher-energy collisional dissociation was set to 35 for MS2. Precursor isolation window was set to 0.7 m/z. Dynamic exclusion was set to 35 s, and singly-charged ions were rejected. Internal calibration was carried out using the lock mass.

Peptide and protein identification
The obtained data were analyzed using MaxQuant (v 1.6.0.1). MS raw data were searched using Andromeda algorithm with matching to the Uniprot human reference (released in Nov, 2017). A specificity of trypsin was determined at up to 2 missed cleavages. In modification, carbamidomethylation, TMT 6-plex modification at lysine and N-termination were set as the fixed modifications, and oxidation of methionine was set as a variable modification. The false discovery rate (FDR) for peptide level was evaluated to 0.01 for removing false positive data. For highly confident quantifications of protein, protein ratios were calculated from two or more unique quantitative peptides in each replicate. Data was normalized and removed contaminant and razor peptide. To enrich differentially expressed proteins (DE-Ps), we analyzed the quantitative ratios (as the Log2 value). The fold-change ratio cut off was more than 2 or less than 0.5 based on intensity of unstimulated sample. Searched data went through statistical process with Perseus (v 1.5.1.6).

OPEN ACCESS
iScience 25, 104048, April 15, 2022 21 iScience Article RNA-seq data processing The RNA-seq data was processed using the following pipeline. Sample qualities were assessed with fastQC and the mRNA reads were subsequently aligned using STAR Dobin et al. (2013) to the ''Homo_ sapiens.GRCh38.dna.primary_assembly.fa'' from Ensembl. The resulting read alignment bam files were assembled into transcripts with StringTie Pertea et al. (2015) using the GRCh38.90 gtf annotation from Ensembl. mRNA Transcripts were mapped to the mass spectrometry signal of protein abundance using the Homo.sapiens and Mus.musculus package in R.

ATAC-seq data processing
General sample qualities were assessed as for the RNA-seq with fastQC. Transcriptional Start Site (TSS) enrichment for the samples were calculated with tssenrich (https://pypi.org/project/tssenrich/) and found to be acceptable at an average value of 5 (see https://www.encodeproject.org/atac-seq/ for ATAC-seq quality standards). Sequencing adapters were trimmed using Trim-Galore (https://github.com/ FelixKrueger/TrimGalore), and the reads aligned to the ''GCA_000001405.15_GRCh38_no_alt_analysis_ set.fna'' from Refseq using Bowtie2 (Langmead and Salzberg, 2012). ATAC-seq signal peaks were called using ''Genrich'' (https://github.com/jsh58/Genrich) and reads in peaks counted with ''featurecounts'' from the Rsubread package (Liao et al., 2019).Prior to counting of reads in peaks duplicate reads were dropped with ''Picard MarkDuplicates'' and all alignments shifted with ''alignmentSieve -ATACshift'' from deeptools (Ramirez et al., 2016). Peaks were then assigned to the promoter of transcripts with a flanking transcript distance of 5000bp, see Yu et al. (2015), by mapping the peaks to the promoter of transcripts in the GRCh38.90 gtf annotation from Ensembl, performed in python. Promoters were defined as G 3000bp from their TSS. This threshold is similar to the values found in the literature (Corces et al., 2016;Fullard et al., 2018;Wu et al., 2018).

Data filtering and analysis
Our datasets consist of 41466 ATAC-seq time series, 147119 RNA-seq time series (of which 49890 are labeled as ''protein coding''splice variants according to Ensembl Hg38) and 6909 proteins mass-spec time series. The time series for ATAC-seq and RNA-seq consist of the following time points (in hours): t a = t s = ½0 0:5 1 2 6 24 h , while for the protein datat p = ½0 1 2 6 24 120 h. The t = 120 h time point is disregarded in the following. All time series consist of 3 biological replicates. The data were interpolated using cubic splines, to a resolution of 0:5h, meaning that after interpolation we have time series of 49 time points. For 4797 of these proteins there exist an association with at least one ATAC-seq peak and a splice variant, hence we have 4797 'protein-coding units' represented as DAGs. These DAGs involve 8110 ATAC-seq peaks and 27642 alternative splice variants, the latter after filtering for the ''protein coding'' annotation in the Ensembl database, and after excluding time series missing too many points or not linked to any ATAC-seq peak.

Functional annotation for splice variants of a gene
In the APPRIS database (Rodriguez et al., 2018), splice variants of a gene are labelled as PRINCIPAL:x, where x is an integer running between 1 and 5 according to importance (1 = most important), ALTERNATIVE:x, with x = 1; 2, and MINOR. Other splice variants irrelevant for translation (retained introns, processed transcripts, nonsense mediated decay, non-coding transcripts, etc.) are here classified as NON-CODING. A few splice variants for which an APPRIS classification is missing are denoted as ABSENT. For the 27642 splice variants present in our DAGs, the distribution of APPRIS scores is shown in Figure 4A of the article, yellow bars. In particular, around 62% of the 4797 DAGs have one splice which is labeled as PRINCIPAL:1 in APPRIS (see violet bars in Figure 4A).

Compiling a TF -splice variant interaction map
A map of TF -binding sites was assembled, involving 656 TF and a total of around 6.5M potential interactions with our 27,642 transcript isoforms. This was achieved in three steps. A chromatin peak -splice variant interaction map results from the association procedure described above. ATAC-seq can also be used to infer which TFs regulate gene transcription. In fact, TFs binding events leave behind a characteristic mark: the sequence of DNA where the binding occurred presents a drop in the accessibility-these sequences are usually called footprints. By scanning each footprint for known TF motifs, using a TF binding site database (HOCOMOCO v.11, see Kulakovskiy et al. (2018)), we have associated each peak to a list of potential TFs. This constitutes the second interaction map, TF-peak. iScience Article Lastly, we have combined the two interaction maps (i.e. peak-splice variant and TF-peak) to obtain a TF-splice variant interaction map.

Sample dynamical correlation
Let the vector xðtÞ represent ATAC-seq, RNA-seq or mass-spec data, measured at a finite number of time points t 1 <t 2 <.<t N . The method to compute sample dynamical correlations described in Opgen- Rhein and Strimmer (2006) is based on computing a numerical approximation of the inner product wherex i ðtÞ and x j ðtÞ are components of xðtÞ. To calculate (2) we assume x i ðtÞ and x j ðtÞ follow a linear approximation between the sampling time points, yielding where t 0 = t N + 1 = 0. Consider the averages over time of x i and x j , indicated with x i = 1 N P N k = 1 x i ðt k Þ and x j = 1 N P N k = 1 x j ðt k Þ, and the centered functions x C i = x i À x i and x C j = x j À x j . The sample dynamical covariance can be obtained by taking the inner product between the centered functions x C i ðtÞ and x C j ðtÞ while the corresponding sample dynamical correlation is Var½x j ðtÞ p are the standardized functions and Var½x i ðtÞ = Cx C i ;x C i D, Var½x j ðtÞ = Cx C j ; x C j D the corresponding sample variances. Intuitively, this definition assigns a positive (negative) correlation coefficient to time-series that appear mostly on the same (opposite) side of their time average.
In general, experiments are repeated multiple times, obtaining M replicates of each realization. We will denote x ik ðtÞ, k˛1;.;M, the k-th replicate of x i ðtÞ. In order to compute the dynamical correlation, as before we start by calculating for each replicate x ik ðtÞ and x jk ðtÞ their time averages x ik and x jk . This leads to the centered functions x C ik ðtÞ = x ik ðtÞ À 1 M P M r = 1 x ik and x C jk ðtÞ = x jk ðtÞ À 1 M P M r = 1 x jk and the variance estimates The corresponding sample dynamical covariance is of course given by A normality test for detrended data As mentioned above, the data for ATAC-seq, RNA-seq and mass-spec correspond to time series, and do not directly obey an assumption of stationarity nor of Gaussianity. The situation is however different for the detrended data x S ik described in the previous section.

Protein-coding units as Bayesian networks
To each protein we can associate a protein-coding unit formed by the alternative splice variants of the gene coding for that protein (according to the Ensembl classification, database build Hg38, Cunningham et al. (2018)), and by the chromatin peaks associated to those splice variants. The protein-coding unit represents the largest possible graph associated with the synthesis of a given protein for the data we have available. Let a; s; and p denote the variables representing resp. chromatin site accessibility, splice variant expression, and protein abundance. The causal dependencies we consider are a/s and s/p. Physically, these edges are unambiguously directed, meaning that all paths of the graph are directed and of length at most 2 (a/ s/p). The graph is therefore a 3-layer DAG, and typically it looks like in Figure 1A. For each protein, the corresponding DAG can be modeled as a Bayesian network with continuous random variables associated to the nodes and joint probability distribution that factorizes into a product of conditional probabilities according to the DAG. Overall, from our datasets we have 4797 DAGs of size that varies from 3 to 52 nodes, see Figure 3A. Different protein-coding units involve different genes and proteins, while for chromatin accessibility only a small fraction of peaks (8%) is shared by two or more genes. Since the node sets of different DAGs are essentially non-overlapping, each DAG can be treated independently, leading to a massively parallel set of Bayesian networks.

Computing conditional independencies
Roughly speaking, in Bayesian networks missing edges on a DAG correspond to pairs of variables that are either independent or conditionally independent given some other variables, see e.g. Koller and Friedman (2009). Let x and y be two variables corresponding to nodes of a DAG (i.e., x and y are any of the a, s and p localized on a protein-coding unit). Independence between x and y is indicated x t y, while conditional independence given a set C of variables is denoted x t yjC. For instance, choosing C = paðxÞ, the set of parent nodes of x, Bayesian network theory predicts that x t yjpaðxÞ whenever y is a non-descendent of x. These conditions are called local conditional independencies (or local Markov conditions). In general, the local Markov conditions do not exhaust all conditional independencies encoded in a Bayesian network. Other, less straightforward, independencies are obtained by checking the so-called d-separation property, see Pearl (1988). The resulting conditional independencies are referred to as global Markov conditions: x t yjS, where C = S is a d-separation set for the pair x and y. For the class of DAGs discussed in this article, consisting of 3 clearly defined layers (a, s, and p) and having edges only between adjacent layers (a/s and s/p), it is always possible to order the nodes so that local and global Markov conditions coincide (Koller and Friedman, 2009;Lauritzen, 1996). These conditional independencies can be systematically computed from the topology of our DAGs. Overall in the 4797 DAGs considered in the article there are more than 1:5 , 10 5 conditional independencies, split into four4 categories, a/s, a/p, a À a, s À s, see Table 1. A fifth possible category, s/ p, is empty, i.e., s and p are never considered conditionally independent, as by construction all s in a DAG are connected with the corresponding protein p (i.e., all s are ''causing''p).

A correlation-based verification test for conditional independencies
The conditional independencies computed on the DAGs can be validated empirically via a correlation analysis. Since our time series represent a cellular dynamical process (cell differentiation), we decided to use the dynamical correlation of Opgen-Rhein and Strimmer (2006) in place of ordinary correlation, because it can account for longitudinal data that are not stationary, as observed above. In the article we always omit the adjective ''dynamical'' when mentioning correlations and partial correlations.
Given x and y colocalized on the same DAG, let R xy be the corresponding sample correlation. We say that the independence condition x t y is verified (or validated) by the data if R xy <q, where q is the threshold provided by a Fisher z-test. Similarly, the conditional independence condition x t yjC is verified by the data if R xy:C <q, where R xy:C is the partial correlation between x and y conditioned on C.

Bayesian networks and conditional independence
A Bayesian network is a probabilistic graphical model on a Directed Acyclic Graph (DAG) in which to each node is associated a random variable X i . The joint probability distribution that represents the DAG ll OPEN ACCESS iScience 25, 104048, April 15, 2022 iScience Article factorizes according to the structure of the DAG. For instance, for the DAG of Figure 1A of the article, of variables X i = fa 1 ; a 2 ; s 1 ; s 2 ; s 3 ; pg, we have P À a 1 ; a 2 ; s 1 ; s 2 ; s 3 ; p Á = Pða 1 ÞPða 2 ÞPðs 1 ja 1 ÞPðs 2 ja 2 ÞPðs 3 ja 2 ÞP À p s 1 ; s 2 ; s 3 Á : The topology of the DAG defines also a series of conditional independence relationships which are normally classified into local and global Markov independencies, see Koller and Friedman (2009). Local Markov conditions are expressed as X i t X j jpaðX i Þ wherepaðX i Þ are the parents of X i , and X j is a non-descendent node of X i . Global Markov conditions are instead obtained computing the so-called d-separation set (Pearl, 1988), S, i.e., the set isolating X i from the rest of the DAG: if X j ;S, then X i t X j jS: If the nodes of the DAG are ''well-ordered'', i.e., if a topological order is chosen so that X i˛p aðX j Þ implies i< j, then local and global Markov conditions coincide. For instance, for the DAG in Figure 1A of the article the Markov conditions are: a 1 t a 2 ; s 1 t s 2 ja 1 ; a 2 ; s 1 t s 3 ja 1 ; a 2 ; s 2 t s 3 ja 2 ; a 1 t pjs 1 ; a 2 t pjs 2 ; s 3 ; s 1 t a 2 ja 1 ; s 2 t a 1 ja 2 ; s 3 t a 1 ja 2 where some of the conditioning sets can be reduced (e.g., s 1 t s 2 ja 1 ; a 2 can be written as s 1 t s 2 , etc.).

Moral edges and moral graphs
Starting from a DAG D, one can construct the so-called moral graph D m obtained by adding to D the undirected edges connecting co-parents of a node (called moral edges), and dropping all edge arrows. By construction, for our DAGs, all s À s edges are moral edges because all splice variants are connected to the corresponding protein p, and the motif formed by s 1 and s 2 with p is always a so-called v-structure s 1 /p)s 2 . The corresponding d-separating set between s 1 and s 2 never involves p, but only upstream causes (i.e., one or more a, see example on the left in Figure 1C). Similar moral edges can occur also when two a are connected to the same splice variant s. From the topology of the DAGs, these can only be associated to marginal independencies, not to conditional independencies. Not all a À a independencies we consider are however moral edges, see e.g. Figure 1A. None of these s À s or aÀ a edges is part of the original DAG D, but they are of D m . However, if a probability distribution factorizes over D it factorizes also over D m . Hence the conditional independencies associated to moral edges sÀ s or aÀ a can be considered of ''secondary importance'' for what concerns the determination of the causal dependency structure among the variables of a DAG.

Gaussian graphical models
The probability density function of a n-dimensional Gaussian distribution over X = fX 1 ; .; X n g (denoted X in vector form) is given by where X $ N ðm;SÞ. From now on we assume that m = 0, and that the covariance matrix S is symmetric, positive definite (p.d.). K = S À1 is called the concentration matrix (or information matrix, or precision matrix) of the multivariable distribution and is also a p.d. matrix. Just like the correlation matrix R is the normalization of S by the diagonal matrix D 1 = diagð ffiffiffiffiffiffiffi ffi S 11 p ;.; ffiffiffiffiffiffiffi ffi S nn p Þ, i.e., R = D À1 1 P D À1 1 , so we can consider the normalization of K by D 2 = diagð ffiffiffiffiffiffiffi K 11 p ; .; ffiffiffiffiffiffiffi ffi K nn p Þ: H : = D À1 2 KD À1 2 . Straightforward calculations give that H = D À1 3 R À1 D À1 3 where D 3 = D 1 D 2 . Both R and H are obviously p.d. and have the same sign pattern as S resp. K. Notice that in the article, for simplicity of notation, we use a concentration matrix J = R À1 , which of course corresponds to J = D 3 HD 3 .
The partial correlation between the variables X i and X j given the remaining n À 2 variables X c = XyfX i ; X j g is defined as or, in matrix form, P = 2I À H. This formula shows that the concentration matrix H and the partial correlation matrix P have the same nonzero pattern but with opposite signs. P is in general not p.d.

Balance and inverse balance on Bayesian networks
Consider the graph GðRÞ formed by taking the correlation matrix R as adjacency matrix. Because, in general, R+0, GðRÞ is a signed graph. The signed graph GðRÞ is said to be balanced (or structurally balanced, see Facchetti et al. (2011), or frustration-free in statistical physics language, see Mezard et al. (1986); Mezard and Montanari (2009)), if all cycles of GðRÞ have positive sign, i.e., they contain an even number of negative edges. Denote V the node set of GðRÞ. An equivalent condition to balance is that there exists a partition of V into 2 sets V 1 , V 2 such that V 1 XV 2 = B and V 1 WV 2 = V, for whichR ij R0 for all v i ; v j˛Vk and R ij % 0 for all v iV k and v j˛V[ , ks[. Another equivalent condition is that there exists a diagonal signature matrix D = diagðd 1 ; .; d n Þ with d i = G1, such that, after the change of basis with D = D À1 (also called a gauge transformation), R D = DRD is nonnegative, and therefore GðR D Þ has all nonnegative edges. For Gaussian variables X i and X j , one says that X i and X j are positively associated if for all monotone functions f and g, R f ðXi ÞgðXj Þ R0, which in particular implies that R Xi Xj R0. If a vector of Gaussian random variables X has a balanced correlation graph GðRÞ, then the variables DX are positively associated (they have correlation R D ). A consequence of positive association is that X i and X j are independent iff R Xi Xj = 0 (Thm. 3.7 of Fallat et al. (2017)).
In Gaussian graphical models, a concept stronger than positive association is often used, that of Multivariate Total Positivity of order 2 (MTP 2 ). In terms of the concentration matrix K = S À1 (or of H = ðD 3 RD 3 Þ À1 ), X $ N ð0; SÞ is MTP 2 if and only if K is an M-matrix, i.e., K ij %0cisj (and K is p.d., hence K ii >0). X being MTP 2 is a sufficient but not necessary condition for X being positive associated (meaning that K M-matrix implies S[and R] nonnegative, see Karlin and Rinott (1983)). In this article, we call inverse balanced a Gaussian probability distribution X $ N ð0; SÞ such that K = S À1 becomes an M-matrix after a gauge transformation with a diagonal signature matrix D, i.e., K D = DKD is an M-matrix. It is straightforward to show that if D renders K D nonnegative, then it also renders DRD (and DSD) nonnegative, meaning that the same change of ''orthant order'' induced by D that renders X MTP 2 also renders X positively associated. The same D is also such that DPD becomes nonnegative, as well as all partial correlations d i R ij:C d j for any subset C3XyfX i ;X j g. The notion of inverse balance is referred to as signed MTP 2 in Lauritzen et al. (2019). Given a p.d. covariance matrix S and its concentration matrix K = S À1 , then checking inverse balance is a purely graphical condition, better tested on the graph of the partial correlation matrix P = 2I À H: X is inverse balanced if and only if GðPÞ is balanced. In fact by construction, the ''dominance'' of the diagonal part encoded in the definition of M-matrix is already ensured by the fact that H is p.d., hence, only the sign of the off-diagonal entries must be checked, and having ½DHD ij %0 for all isj for some D amounts to having DPD nonnegative, i.e., GðPÞ balanced (or ''non-frustrated'', as it is called in Malioutov et al. (2006)). Since it is also inverse balance 0 G balance: Using balance and inverse balance for checking sample data consistency Consider a sample correlation matrix R. By construction, correlation matrices are symmetric R ij = R ji , while the ''true'' interaction graph is a DAG, i.e., if an edge X i /X j exists, then X j /X i cannot be another edge of the DAG. Since a correlation matrix is symmetric, its graph GðRÞ is always undirected. Any pairwise samplebased measure of association between two variables (correlation, mutual information, entropy, etc.) is by construction symmetric, hence so is the graph built on such measure. R symmetric reflects the fact that correlation is not causation.
The sample correlation obtained from the data is also typically full, and so is GðRÞ. When applied to our protein-coding units, this implies that all variables pairs are connected by an undirected edge, even those that iScience Article are not so in the underlying Bayesian network, like a and p, or pairs of a, or pairs of s. In particular, the correlation graph GðRÞ does not normally coincide with the moral graph associated to the DAG, and does not respect the Markov conditions of the original Bayesian network. It is this passage from the ''true'' topology of the DAG to the undirected and fully connected topology of GðRÞ that provides an opportunity for checking consistency of the data, as it induces the cycles whose balance helps us discriminate the reliable data from the unreliable ones.
The simplest cycle on GðRÞ is a triangle of nodes X i , X j and X k . On this triangle, if we associate an empirical regulatory action to each edge of GðRÞ by taking the corresponding value of R, then it is easy to realize that if GðRÞ is balanced the 3 regulatory actions are compatible with each other, while if GðRÞ is not balanced, then at least one of these empirical regulations is incompatible with the others. For instance, if in the data X i and X j are positively correlated, but X j and X k are negatively correlated, then one expects that X i and X k should also be negatively correlated (balanced case) while if instead X i and X k are positively correlated (unbalanced case) then the 3 time series of X i , X j and X k are incompatible with each other and should not belong to the same ''true'' regulatory graph. This test can be extended to an arbitrary sample correlation graph, and gives us a way to label certain interaction patterns as ''more plausible'' (candidates to be true positives) or as ''less plausible'' (candidates to false positives) based on the samples. On the original DAG of our protein-coding units we cannot have cycles, but whenever we consider two or more adjacent edges on the DAG then the corresponding undirected graph GðRÞ has cycles, and a balance test naturally follows. For instance, if X i = a and X j = s and X k = p are linked by edges on a DAG, then it is possible to check if the regulatory pattern of the ''true'' path a/s/p, deduced by the correlations R as and R sp , is compatible with the corresponding ''nonphysical'' branch a/p deduced by the correlation R ap , see Figure 1C of the article for an illustration. The test is purely on the sample correlations and does not lead to any causality violation on the corresponding DAG. A similar argument can be set up for any DAG, or subset of a DAG, involving at least two adjacent edges.
Since inverse balance is computed from sample concentration matrices in an analogous way, inverse balance of a DAG can also be investigated using the same principles.

Sample correlation balance and verification of conditional independencies
The Markov independence conditions of a DAG can be checked for empirical validity using sample correlations and sample partial correlations. In particular, we can say that X i t X j jC 5 jR X i X j :C j< q whereC denotes the conditioning set and q is a significance threshold obtained by e.g. setting an hypothesis test using a Fisher z-transform (Tong, 1990) Z À X i ; X j C Á = 1 2 ln 1 + R X i X j :C 1 À R X i X j :C ! and then choosing q = argmax RX i X j :C subject to ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N À cardðCÞ À 3 p , where F is the cumulative probability distribution of the multivariate Gaussian N ð0;IÞ, N is the sample size, and a = 0:05 is the significance level. See Kalisch and Bü hlmann (2007) for an identical test. Typical values we obtain for our sample size are q $ 0:16 which is a reasonable threshold for our partial correlations.
What is shown in the article is that balance of GðRÞ leads to verification of conditional independencies much more often than unbalance. This is a heuristic property, and indeed exceptions are visible in the data. Also for the more strict property of inverse balance the same happens. What can be made rigorous is that for inverse balance conditioning leads to an element-wise contraction towards 0 of the ''residual'' partial covariance.
Even if the property ''balance leads to decrease of correlation upon conditioning'' cannot be turned into a rigorous statement in the balanced case (not even for covariances), it is however empirically true in a large ll OPEN ACCESS iScience 25, 104048, April 15, 2022 27 iScience Article fraction of our sample data. This combines well with the previous observation that balance is an indication of data compatibility (i.e., an intrinsic test of ''true positivity'' of the interactions).

Binary classification for balance and inverse balance under varying acceptance thresholds
A natural way to read Table 3 is to think of the ''verified/non-verified'' condition as the true condition, and of ''balance / unbalance'' (or ''inverse balance / inverse unbalance'') as the prediction. This would associate ''False Positives'' (FP) to the combination ''non-verified + balanced'', and ''False-Negatives'' (FN) to ''verified + unbalanced''. However, if we want to turn Table 3 into a binary classificator depending on a threshold, then we have to invert the role of the two conditions, taking ''balance / unbalance'' (or ''inverse balance / inverse unbalance'') as truth and ''verified / non-verified'' as prediction. In fact, the latter pair is the only one depending on a threshold, here the value q associated to a Fisher z-test (verification 5 R Xi Xj :C <q). The only practical effect is to invert the roles of FP and FN, meaning that the total amount of errors is unchanged. In this way we can evaluate the effect of continuously varying q on the accuracy of the reconstruction, for instance through Receiver-Operating Characteristic (

List of T H 1 related genes
The following list of gene was compiled assembling the Gene Ontology annotation for the following categories:

List of T-cell related genes
The following list of gene was compiled combining the list of T H 1 related genes given above with the Gene Ontology annotation for the following categories: