Synthetic data generation with probabilistic Bayesian Networks

Bayesian Network (BN) modeling is a prominent and increasingly popular computational systems biology method. It aims to construct network graphs from the large heterogeneous biological datasets that reflect the underlying biological relationships. Currently, a variety of strategies exist for evaluating BN methodology performance, ranging from utilizing artificial benchmark datasets and models, to specialized biological benchmark datasets, to simulation studies that generate synthetic data from predefined network models. The last is arguably the most comprehensive approach; however, existing implementations often rely on explicit and implicit assumptions that may be unrealistic in a typical biological data analysis scenario, or are poorly equipped for automated arbitrary model generation. In this study, we develop a purely probabilistic simulation framework that addresses the demands of statistically sound simulations studies in an unbiased fashion. Additionally, we expand on our current understanding of the theoretical notions of causality and dependence / conditional independence in BNs and the Markov Blankets within.


Introduction
Dependency modeling is a descriptive / predictive modeling activity that is especially suited to a systems biology approach to data analysis (in particular, analysis of heterogeneous big data). During dependency modeling, a biological network, or a graphical representation of a biological model (including genotypes, phenotypes, metabolites, endpoints, and other variables and their interrelationships), is constructed from the "flat" data. (Subsequently, specific components of biological networks can be evaluated using conventional statistical criteria, and sub-networks can be selected for further biological hypothesis generation and testing). This activity is also known as causal discovery (inference) in graphical models.
Most of the recent work in the BN methodology field aims at improving the scalability and accuracy of BN modeling [28][29][30][31], handling mixed data types (including various types of biological data) [32][33][34], incorporating latent variables [35,36], causal discovery [26,27,37], and developing more robust software interfaces and visualization options [38]. In view of these developments, the ability to objectively assess the performance of BN modeling is of the utmost importance. Currently, there are three principal avenues for accomplishing this: (i) using well-established (in the machine learning community) predefined benchmark models and datasets, such as "Alarm" or "Asia", (ii) using various specialized biologicallyoriented benchmark datasets, both real and simulated, such as "LUCAS" or "DREAM" [9,30,39,40], and (iii) developing approximately realistic simulation frameworks [31,32,41,42]. The first approach is necessarily limited and does not generalize to modern highdimensional biological data. The latter two strategies have their pros and cons (primarily having to do with observability -generalizability trade offs); a useful discussion can be found in [9].
In this study, we concentrate on the third approach, namely constructing robust, generalizable, and mathematically rigorous simulation frameworks. Typically, these frameworks involve (i) specifying synthetic network structures, (ii) utilizing some interaction model to encode this structure, and (iii) the actual generation of synthetic data via a method that depends on the interaction model. The interaction model is highly contextual, but the most common approaches are either additive noise modeling [43], e.g. SEM (structural equation modeling) [31,41], or probabilistic modeling via specification of the distribution [5,44,45]. The latter requires sampling techniques to generate data, e.g. exact sampling, MCMC or Gibbs sampling.
In the broader context of heterogeneous biological data and networks, it is important to distinguish between the notions of causality, correlation, association and dependence. Establishing causality [21,24,27,40] is usually perceived as the ultimate goal in the biomedical network analyses [16,29,31]. However, establishing (and even defining) causality is often equivocal, for both theoretical [24,27] and practical/biological reasons (e.g., the ambiguity of the "causal" interpretation of the relationship between two correlated gene expression measurements, or two single nucleotide polymorphisms in linkage disequilibrium [41]). Of course, there are advantages to viewing certain biological network models (or their parts) through the lens of directional causality. First, directional causality often corresponds well to biological reality (to use an obvious example, phenotypes tend to be "downstream" of genotypes). Second, existing BN algorithmic machinery largely relies on the concept of directed acyclic graphs (DAGs). And yet, at the most basic level (e.g., physical -chemical interaction of two biological molecules), it is unclear if it is possible, or even desirable, to impose a "directional causation" label on what is essentially a dependence relationship that is inferred from observing the joint probability distribution. The concept of "cause -effect" may have more to do with rationalization in seeking implication chains rather than reflecting the totality of any given interaction. Therefore, we chose to frame this study predominantly around the concept of dependence, which is a measure of relevance that allows us to make inferences about biological reality without necessarily invoking considerations of causality (however, we do take into consideration ancestor-offspring relationships in DAGs).
Regardless of whether we focus on causation or dependence relationships, SEM, being informationally denser and, therefore, a less general model of interaction [43], might not be the most appropriate tool for generating synthetic data in the case of biological BNs. Leaving aside the issues of acyclicity (and directionality in general), and the assumptions of linearity and normality (which can be dealt with, if imperfectly), there is a fundamental issue of observability. Indeed, consider a typical biological dataset subject to BN modeling -"real" biological datasets include only comparatively small sets of variables that are observable under the confines of specific biomedical studies (i.e., experiments). These variables cannot always be expected to be in any particular categorizable relationship known or suspected a priori. The most one can postulate with certainty about a "real" dataset is that the variables in it may be dependent or independent. Even for the well-dissected biological systems, we cannot necessarily say that the captured variables, no matter how carefully selected, belong to a group of key components that adequately describe the structure and dynamics of a biological system in question. Consequently, we propose to use a fully probabilistic, unconstrained, framework for synthetic data generation purposes, equipped for the statistical study of model parameter space as well as for the statistical performance assessment of structure discovery methods.
It is built around an analytically defined distribution induced by a given graphical model and, coupled with random model structure generation, is immediately interpretable as the joint probability distribution, while simultaneously featuring inferential characteristics encoded in conditional independencies. Below we detail considerations involved in building such a framework in Methods, Proofs and Results: Sections 2-3; ascertain the congruence between the "forward" process (simulations) and "inverse" process (actual reconstruction from the data of the networks and Markov Blankets within) in Methods, Proofs and Results: Section 4; and discuss simulation sample size considerations in Results: Section 5.
From this perspective, the data synthesis algorithm should be able to generate samples from arbitrarily diverse models to cover as much of a model space as possible.
The analytically defined distribution induced by a given BN model, has a joint probability that is factorizable into an expression that is no more complex than the methodical application of the chain rule permits. However, the set of simplifications imposed by conditional independencies encoded in the BN may significantly lower the complexity of the factorization.
Hence, the joint probability distribution of a given model can be represented as a product of conditional and unconditional probabilities This product provides a way to generate sample components serially from local conditional probabilities instead of direct sampling from the n-dimensional joint distribution. This is an important computational consideration, because using direct sampling to obtain adequate representation for low probability events is difficult given a finite sample size.
In fact, this difficulty grows with the dimensionality of the model, so that as the number of variables increases, many joint events quickly become intractable even for binary variables. Therefore, being able to rely on local sampling of conditional probability distributions is a key factor in successfully synthesizing components of an adequate n-dimensional sample that actually reflects the fine structure of the joint distribution.
If a necessary level of resolution in the data sampled from a given joint distribution is not achieved, the data source uniqueness problem, typical for BN reconstruction setting, is likely to be amplified. That is, if the same data is obtainable from different source distributions, making the source distinction all but impossible, the problem of recovering a BN from such data will be more ill-posed.
In order to utilize the resolution provided by the expansion above, a strategy for methodically generating conditional distributions for the dependent variables needs to be specified: For n-variate variable X and m-variate variable Y the relationship between the two variables can be characterized via the law of total probability which, combined with the chain rule, reveals the role of conditional probability: where y i are individual disjoint events.
This marginalization procedure can be viewed as a matrix-vector product which defines the conditional probability P(X|Y) as an n × m matrix with entries P(X = x j |Y = y i ): where P(X) = (P(x 0 ), …, P(x n )) T , P(Y) = (P(y 0 ), …, P(y m )) T with the subscripted values representing individual disjoint events.
Note the fraction P(X = x j ∩ Y = y i )/P(Y = y i ) arising from the use of the chain rule cannot be used to define conditional probability, because this expression is not defined for zero probability events of Y.
A formal criterion for the conditional independence of X and Y is given by ∀y i P X | Y = y i = P (X) (2.4) Its negation provides the criterion for the conditional dependence ∃y i P X | Y = y i ≠ P (X) (2.5) i.e. the existence of at least one dependent event y i is sufficient to establish the dependence.
The above condition, however, reveals little about how to consistently construct appropriate objects, and how the properties of these objects reflect on the character of the dependence relationship between X and Y. This can be remedied by assessing the dependence relationship in a way that is agnostic about the unconditional distributions, as a separately defined transformation of the ancestor distribution q = P(Y) to the descendant distribution p = P(X), a matrix P(X|Y) relating X to Y via marginalization operation, inducing P(X ∩ Y) and P(X) along the way, i.e.
With the above goal in mind, consider the following observation Lemma 1. Let X be an n-variate random variable, then the columns of conditional distribution P(X|Y) are members of (n − 1)-simplex S ⊂ ℝ n .
Proof. Trivially, the elements of any column of a conditional distribution P(X|Y) sum up to 1. Hence, because the variable X has n outcomes, all columns are members of the set defined by which is an n − 1-simplex, and a subset of ℝ n □ To simplify the notation and to emphasize that the conditional distribution P(X|Y) is defined independently, let D be an n × m linear operator whose columns are members of the (n − 1)-simplex S, so that P(X) = D · P(Y) given some P(Y).

Lemma 2.
If the column rank of D defined above is at least two, then X defined by P(X) = D · P(Y) is conditionally dependent on Y.
Proof. Suppose the rank of D is at least two, then there is at least one linearly independent column, i.e.
where D i is the i-th column of D. It follows that if for every index k ≠ i the coefficients are given by α k = P y k 1 − P y i , then Rearranging the terms yields ∃i, D i ≠ D ⋅ P (Y ) (2.10) In other words, there is an i corresponding to y i such that P(X|Y = y i ) ≠ P(X), which is a restatement of the dependence criterion. Hence, there is at least one dependent event, making X dependent on Y. □ This lemma already guarantees that any matrix of appropriate size with rank of at least two should be sufficient to generate a conditionally dependent pair of variables. However, this does not elucidate additional constraints that an arbitrary conditional dependence may impose, and it does little to provide control over the structure of a synthetically constructed dependence relationship.
The following lemma addresses these shortcomings: Lemma 3. Let X be an n-variate random variable dependent on an m-variate random variable Y; then

•
the convex hull C formed by the columns of D has at least two vertices; • rank of D must be at least two; • P(X) lies in the convex hull C formed by the columns of D; • the number of vertices of convex hull C corresponds to the number of conditionally dependent events.
Proof. Suppose X is conditionally dependent on Y, then where D i is the i-th column of D, because P(X) = D · P(Y). Rearranging the terms gives for some arbitrary P(Y). Assuming that P(Y = y i ) ≠ 1, this yields with coefficients defined on the probability simplex characterized by In other words, for X to be conditionally dependent on Y for any D defined independently of P(Y), there must be at least one column in D not expressible as a convex combination of other columns. Hence, this column is a vertex of a convex hull C ⊂ S formed by all convex combinations of columns of D characterized by The above implies the following: • the convex hull contains more than one member, and, therefore, has at least two vertices; • because there are at least two vertices, this polytope must be embedded in at least a (2-1)-simplex, and, therefore, in at least ℝ 2 -hence the linear operator D must be at least of rank two; • the unconditional distribution P(X) always lies in this convex hull, because P(X) = D · P(Y) for any independently defined D and P(Y); • every conditionally dependent event of Y corresponds to a vertex of the convex hull that envelops the columns of D. □ Some of the immediately useful consequences of the above can be summarized in the following corollary: Corollary 1. Let X be an n-variate random variable, Y be an m-variate random variable and D a linear operator characterized by P(X) = D · P(Y), then • X is conditionally dependent on Y if and only if the rank of the linear operator D is at least two.
• Let d be the number of dependent ancestor events that the linear operator D of the size n × m and rank r encodes; then r ≤ d ≤ m.
Conditional distributions of higher ancestral complexity, i.e. P(X|∩ k Y k ), can be treated as a pairwise distribution P(X|Y) if we observe that ∩ k Y k can be viewed as a single variable over the combinatorically prescribed joint events. In this situation, the only specification that is different from the already considered scenario with the pairwise conditional distribution is that the joint events are not given a priori. Rather, the joint events have to be derived as a function of the number and arity of the ancestor variables in the DAG. For example, for two binary ancestor variables, there will be four possible events, six for a binary and a tertiary variable, etc.
After defining all the necessary parameters, the sample construction can be carried out via the exact sampling method, starting with the sampling of the set of root nodes, and proceeding sequentially forward following the rule that, given a particular ancestral sample y*, the joint probability distribution for the downstream node combined with the ancestral sample is given by P X ∩ Y = y* = P X | Y = y* P Y = y* (2.16) i.e. for each sample Y = y* the sample X = x* is obtained with probability P(X = x*|Y = y*). This serialized construction of samples will produce a single row of data with values corresponding to nodes in the probabilistic DAG. This row will then constitute a single sample from the given DAG. Repeating this procedure as necessary will generate a dataset of a prescribed size. We discuss sample size considerations below in Results: Section 5.
The above results suggest an explicit approach for constructing random conditionally dependent distributions (and generating synthetic data from them) in a formally sound way, while ensuring that all relevant relationship properties are accounted for. This framework is, in principle, extensible to mixed-type variable models for known distribution families via approximations, but arbitrary model generation in that scenario is highly non-trivial due to difficulties with the specification for arbitrary distriubtion families.
It should be noted that the numerical structure of the linear operator D is secondary to the properties of the convex hull formed by its columns, because the latter reflects the qualitative structure of the event space of the conditional dependence relationship. In particular, manipulating the properties of this convex polytope provides a way to choose the proportion of dependent and independent events in the event space.
More importantly, we observe that the columns of D must be sampled from the (n − 1)-simplex S, and if sampled uniformly, this is equivalent to sampling from a Dirichlet distribution Dir(α) with the appropriate concentration parameters. However, as will become evident below, constructing random dependence relationships that adequately cover the model parameter space is not as trivial a task as it may seem, and is something that uniform sampling of individual simplex elements is not able to address.

Proposition 1.
Let t, p, q be i.i.d. random variables obtained from Dir(α) with α = (1, 1), and let D be a 2 × 2 matrix with columns p and q. Then the distribution of z = D · t is non-uniform.
Proof. Note that all the variables including z are members of (2 − 1)-simplex, so all variables have two components, i.e. z = (z, 1 − z), t = (t, 1 − t), p = (p, 1 − p) and q = (q, 1 − q). Then z = D · t reduces to the single independent equation relating the components, namely z = tp+(1 − t)q. Using the substitution u = z − tp 1 − t to reparametrize the integral yields the following expression for the cumulative distribution The corresponding density is obtained by differentiating with respect to z and evaluating the resulting integrals with the appropriate integration bounds. Using the requirement that 0 ≤ Applying the above to the evaluation of the integral yields 1 − t ∫ Ω dp dt = ∫ δ 1 1 1 − t min 1, z t − max 0, z − 1 t + 1 dt = 2( − z log(z) − (1 − z)log(1 − z)) = 2H(z) (2.19) where H is the information-theoretic entropy. Hence, the resulting density is non-uniform. □ The above implies that the naively constructed D induces a non-uniform distribution, strongly favoring simplex elements with higher entropy. Furthermore, as the number of columns in D grows, z = D · t, viewed as a weighted sum of the columns of D, has an asymptotic tendency towards the normal distribution. The latter point is made evident by the following central limit theorem: , (2.20) while the components t j of t are β(1, n − 1) distriubted random variables with common mean and variance E t j = 1 n , σ 2 t j = n − 1 n 2 (n + 1) , (2.21) Hence, the components z i of z are randomly weighted sums z i = ∑ where μ = E(d ij ) and σ 2 = σ 2 (t j ). □ The implication of the above is that BN modeling simulations that construct their conditional probabilities D in a naive way cannot adequately cover model parameter space for simulation studies as the individual node parameter distributions downstream of the root node in the network will tend towards the maximum entropy states. This tendency may be further amplified down the ancestor/offspring chains, and since for a non-root node, n grows exponentially with the number of ancestors, we can expect this bias to be noticeable for practically every non-root node.
One way to rectify this is, for example, by appropriately varying concentration parameters α, associated with conditional probabilities. Another option may be to fix the unconditional distributions of all the nodes of a BN first (thus losing the ability to generate a BN serially), and attempting to obtain all the conditional distributions subject to the constraints imposed by the prescribed unconditional node distributions (possibly too computationally expensive for the setting where millions of BNs and datasets will have to be generated for a sufficiently robust simulation study).
On the other hand, this distributional phenomenon could be instrumental in identifying more realistic prior distributions for constructing scoring criteria. For instance, for the BD (Bayesian-Dirichlet) family of metrics, the prior directly depends on the choice of the concentration parameter α.
Finally, under certain assumptions about the nature of the conditional dependence relationship in a given model, the tendency of the descendant nodes towards higher entropy configuration can be also be used as a criterion for suggesting a partial topological order of variables in the underlying DAG for the purposes of BN reconstruction. That is, reasoning that variables with relatively low entropy are unlikely to be terminal nodes can substantially narrow the space of possible DAG configurations that could fit a given dataset.

Methods, proofs and results: random graph distribution
To estimate the error of the structure recovery for the inverse solver (i.e., the actual BN reconstruction algorithm), the forward solver (the synthetic data generator) needs to not only synthesize data from a given model, but also generate the random model structures (joint probability factorizations) that the data is sampled from. This can be realized by producing random DAG adjacency matrices, or, more specifically, triangular extractions from symmetric adjacency matrices to account for the acyclicity of structures.
In order for the error estimation to lead to reliable and meaningful results, synthesized random structures should be distributed in a well understood or prescribed manner. Therefore, we need to establish some of the basic properties of DAG structure distributions.
The discrete nature of graph configurations implies nonuniform distribution across various graph densities for combinatorical reasons. For a graph of n variables with a given topological ordering, the maximum density (maximum number of edges) possible is given by Then the total count C of possible graph configurations with a density d < D is given by the binomial coefficient (3.2) which implies that structures with density around Let δ be the density valued function over graph structures. Then the probability that a graph G has density δ(G) = d is The joint probability of obtaining a configuration G with density d is therefore Gogoshin et al. Page 11 where the conditional probability of G given the prescribed density is P (G | δ(G) = d) = 1 C(d) (3.6) Of particular interest is the expression for P (δ(G) = d) because it implies that structures of medium density are more likely than any other. This effect reflects another systematic bias in the model space and can severely hinder the performance of a naive stochastic structure recovery method that searches the model space by evaluation only, not accounting for the fact that medium density states will occur more frequently. The same, of course, applies to sampling of the model space for statistical simulation studies.
In principle, this could be remedied by a forced uniform sampling across various density groups, thereby increasing the sampling rate over lower and higher density groups relative to the medium density groups or, similarly, by decreasing the sampling rate over the medium density groups. However, this may be impractical, because even for a problem of moderate size, the number of possible configurations at the extremes of the distribution is sharply limited relative to the medium density graphs, which would make it virtually impossible to maintain a constant sampling error. For a 16-node structure the above translates to 120 configurations with one edge, 840261910995 configurations with 8 edges, and on the order of 10 34 medium-density configurations with 60 edges. Therefore, if a 10% sampling rate is to be maintained at any cost, then the number of samples across all density groups is in the order of 10 35 (with most of the computational load in the medium-density groups). We conclude that pursuing a uniform sampling is computationally impractical for all but the smallest models, and, in general, the problem can only be addressed by introducing additional considerations and background information.

Methods, proofs and results: forward inference over Markov Blankets
It is important to ensure the congruence of independence and probabilistic inference mechanisms in the simulation framework (as described above) and during the BN construction process. In this section, we will dissect the "localized" probabilistic inference within a BN by utilizing the concept of Markov Blanket.
Consider the fact that in a simple BN of the form A → X → D the state of the variable X is not completely determined by the state of its ancestor A given that it also has a descendant D. The reason is that the conditional probability distribution of X for the particular instantiation a of the ancestor is further constrained by an instantiation d of the descendant, i.e.  The situation quickly gets more complicated for larger BNs, but, fortunately, a complete specification of a variable requires only its Markov Blanket, conveniently factoring the graph.
By definition, a Markov Blanket M of a particular variable X renders such a "center" variable conditionally independent from the rest of the BN (i.e. BN sans the Markov Blanket), with the "periphery" variables of M completely determining the state of the center variable assuming their states are known. This naturally leads to the problem of estimating the conditional probability of X as a function of the states of periphery variables in M − X.
The computational difficulties here are the same as with the estimation of any sufficiently complex joint probability. Often, we would lose accuracy due to the sampling error before we can estimate most joint events, unless all the variables are defined analytically, and the dimensionality of the computation for such an estimation grows superexponentially with the number of variables. However, these difficulties can be mitigated by relying on the structural information encoded in the BN. If correct, the simplifications introduced into the factorization sufficiently reduce the computational burden and the sampling error, thereby making the desired estimation not only feasible, but also more reliable. And although some of the difficulties alleviated are traded for the multiplicative propagation of error, the end result is still preferable to the direct loss of resolution.
Let a(X) and d(X) denote the set of immediate ancestors and immediate descendants of X respectively. Also, let ad(X) = a(d(X)) − X − a(X) − d(X) (4.3) so that ad(X) represents the set of ancestors of members of d(X) excluding X itself that are neither direct ancestors or descendants of X themselves, i.e. X ∉ ad(X), ad(X) ∩ a(X) = ∅ , ad(X) ∩ d(X) = ∅ (4.4) Then the Markov Blanket M of X is M = X ∪ a(X) ∪ d(X) ∪ ad(X) (4.5) and the joint probability of M can be factored as follows P (M) = P (X, a(X), d(X), ad(X)) = P (a(X))P (X | a(X) Note that ad(X) ∩ d(X) = ∅ P (ad(X) | X, a(X)) = P (ad(X) | a(X)) (4.7) and that a(d(X)) ∩ d(X) ≠ ∅ ∃Y ∈ d(X)(a(Y ) ∩ d(X) ≠ ∅ ) (4.8) Equation (4.6) provides direct access to all joint probabilities for any instantiation of variables in M, allowing us to proceed with the estimation of the desired conditional probability P X = x i | M − X = P X = x i , M − X ∑ j P X = x j , M − X (4.9) marginalizing over X in the denominator for any given state x i of X in the numerator.
Undoubtedly, the division operation itself introduces further numerical error into the estimate, but the classical numerical techniques are applicable in this situation. Different types of inference are possible under the same assumptions, utilizing the same set of equations. For example, one could assess the probability of a periphery variable Z ∈ M being in a state z i in order for the center variable X ∈ M to be observed in a state x j : In essence, the above type of inquiry allows us to make a judgment about the most likely scenario that coincides with the X = x j event, and this inquiry could be extended to more than one periphery variable.
Furthermore, the idea that periphery variables can predict the state of the center variable with a prescribed accuracy can also be utilized in structural optimization during the BN structure search. Local prediction accuracy as an objective optimization criterion is inherently practical because it attempts to maximize the utility of the result rather than concentrating exclusively on a set of more abstract notions associated with informationtheoretic properties.

Results: effects of the central limit and sample size considerations
We first consider the tendency of terminal node distributions to concentrate in the region of the simplex corresponding to maximum entropy when the nodes of a BN are initialized with conditional probability distributions sampled from the naively configured Dirichlet distribution with unitary concentration parameters, as described in Section 2.
To simulate this behavior we construct a random BN structure with eight tertiary nodes and prescribed random conditional distributions with columns drawn from the Dir(α) with α = 1, initialize the root nodes with unconditional probabilities drawn from the same distribution and propagate marginalization consecutively from the root nodes towards the terminal nodes, thus obtaining unconditional probabilities for the whole network (see Figure  1).
As expected, the result is that the terminal nodes tend toward maximum entropy configuration (the central region of the associated simplex on the right, Figure 1) -a manifestation of the central limit condition restricting the admissible node parameters. Under these circumstances, models favoring lower entropy distributions (outer regions of the associated simplex) on the terminal nodes are practically inaccessible and would be unusually rare.
In practice, it is important to evaluate how many samples should be generated (as prescribed in Section 2) for the estimation of the joint probability to be sufficiently robust. As a representative example, let us consider a network of eight tertiary variables and an average edge density of 0.8. This network has at the most 3 8 = 6561 possible joint events, and it is straightforward to analytically compute their associated probabilities, provided the network has been fully defined. Subsequently, we can evaluate the effect of sample size in estimating the "true" distribution. Figure 2 shows the analytical and estimated distribution over all possible joint events in our example, for four (increasing) sample size values. To quantify the differences between the distributions, we used Jensen-Shannon divergence (JSD), a distributional divergence metric (Figure 3). These results confirm that our procedures converge to zero distributional divergence with the increasing sample size.
Although our simulation framework is not limited by the complexity of the model, and the above results generalize to any number of nodes, the 8-node configuration was specifically chosen for the sake of a clearer presentation and for the ability to easily generate the sample size that covers the event space more exhaustively. As can be seen in Figure 3, the sample count required for even a modestly sized network to be well-represented in the data can be substantial.
Given that the analytical joint distribution over the network is calculated during the simulation, the sample size necessary to achieve the desired representational accuracy can be estimated via the multinomial distribution. If p = (p 1 , …, p k ) is the density of the joint events, and x = (x 1 , …, x k ) are the sample counts of the observed joint events, then the probability of sampling a particular sequence of counts x is given by the multinomial f(x, p). This allows us to formulate a number of possible accuracy constraints. For example, controlling the maximum deviation from the idealized count Np, with N = ∑ i x i , can be achieved by estimating N necessary to bring the probability of such a maximum deviation below a certain threshold ϵ, i.e. P(max i |Np i − x i | > δ) < ϵ, so that for a defined accuracy level δ, all deviations beyond that have a small probability of occurrence.

Discussion and conclusions
Simulation studies based on synthetic data generation are ubiquitous in the machine learning and computational biology domains. They are instrumental in objectively assessing the performance of modeling methods, such as BNs. However, comprehensive and realistic simulation frameworks are comparatively underdeveloped in the context of network-centered systems biology methodology. In this study, we presented theoretical and methodological considerations for developing a realistic, fully probabilistic, synthetic data generation framework for biological BNs.
At this time, we have developed corresponding algorithms and software for generating synthetic data for randomly generated discrete-variables probabilistic DAGs. These are included as part of our "BNOmics" BN modeling software package [6], which is freely available from the authors, as well as from the bitbucket open source distributary (https:// bitbucket.org/77D/bnomics). In the future, we intend to expand our algorithms to take into account mixed data types (continuous-discrete variable mix, specifically).
Two outstanding issues, relevant to both the BN reconstruction and synthetic data generation / BN performance evaluation, are (i) whether the Markov Blankets (within the BNs) can in principle be recovered with consistency (provided that the "forward generation" and "backward reconstruction" processes are sufficiently aligned, as in our proposed framework), and (ii) whether it is possible to incorporate a strict, rigorous definition of dependence and conditional independence in the BN reconstruction process, as opposed to concentrating on establishing causation from correlation. We are cautiously optimistic on both counts, but more investigation is needed, and will be forthcoming.
Much of the work detailed in this communication was spurred by our ongoing collaborative immuno-oncology study that involves comparative BN analyses of multi-dimensional FACS (fluorescence-activated cell sorting) and other immuno-oncology datasets obtained from patients with gastrointestinal and breast cancers undergoing checkpoint blockade immunotherapy treatments [3]. In that study, we aim to construct and compare BNs representing immune network states in sickness and health, before and after therapy, in responders and non-responders. In the future, we plan to generate synthetic datasets that more closely align with the real, observed FACS and other immuno-oncology datasets -this will allow us to rigorously validate the reconstructed immune network BNs and corresponding biological results. Based on our experience working with such data, it is our opinion that each and every BN analysis should, ideally, be accompanied by a corresponding simulation study built around synthetic data that hews as closely as realistically possible to the actual biological data under consideration.  Jensen-Shannon divergence (JSD) between the analytical and estimated sampled distributions of all joint events of a given 8-nodes random graph with average variable connection density and arity being 0.8 and 3, respectively. JSD, shown as a function of sample size, is obtained by averaging 100 replicas for each sample size value.