The importance of temporal information in Bayesian network structure learning

- Several algorithms have been proposed towards discovering the graphical structure of Bayesian networks. Most of these algorithms are restricted to observational data and some enable us to incorporate knowledge as constraints in terms of what can and cannot be discovered by an algorithm. A common type of such knowledge involves the temporal order of the variables in the data. For example, knowledge that event 𝐵 occurs after observing 𝐴 and hence, the constraint that 𝐵 cannot cause 𝐴 . This paper investigates real-world case studies that incorporate interesting properties of objective temporal variable order, and the impact these temporal constraints have on the learnt graph. The results show that most of the learnt graphs are subject to major modifications after incorporating incomplete temporal objective information. Because temporal information is widely viewed as a form of knowledge that is subjective, rather than as a form of data that tends to be objective, it is generally disregarded and reduced to an optional piece of information that only few of the structure learning algorithms may consider. The paper argues that objective temporal information should form part of observational data, to reduce the risk of disregarding such information when available and to encourage its reusability across related studies.


INTRODUCTION
large part of scientific research is driven by interest in discovering causal relationships from data to be used as guides for intervention, to maximise utility of interest and to minimise undesirable risk. Much of this research is based on methods that focus on maximising the predictive accuracy of a target variable from a set of observed predictors . However, the best predictors of are often not its causes and hence, the motto association does not imply causation. While the distinction between association and causation is nowadays better understood, what has changed over the decades is mostly the way the results are stated rather than the way they are produced. Pearl's and Mackenzie's book (2018) has brought great attention to the importance and need for causal models, like Causal Bayesian Networks (CBNs), as the basis of achieving true AI. Any model that captures cause-and-effect relationships must, by definition, adhere to the temporal order of the variables. For example, an effect at time can only have causes observed at a time prior to . The question of how to most effectively develop such models to solve realworld problems is therefore a particularly current concern.
The field of research that appears to have made significant steps towards causal discovery involves the constraint-based algorithms that are typically used to construct Complete Partial Directed Acyclic Graphs (CPDAGs) that can be converted into a BN model. A CPDAG is a A graph that incorporates both directed and undirected edges and represents a Markov equivalence class of Directed Acyclic Graphs (DAGs). Most of the constraint-based algorithms are based on conditional independence tests, amongst others, that generate causal graphs under the assumption that the direction of the edges represents causal or influential relationships between nodes. Undirected edges in a CPDAG represent dependencies whose directionality cannot be determined by observational data. This process is inherited by the Inductive Causation (IC) algorithm (Verma and Pearl, 1990). The Peter and Clark (PC) algorithm has had a major impact in this area of research due to its simplicity, learning strategies, computational speed, and performance (Glymour and Cooper, 1999;Spirtes et al., 2001).
Alternatives to the constraint-based methods are the score-based algorithms which can be viewed as a traditional machine learning approach. This is because score-based learning involves heuristics that explore the search space of graphs and return the graph that maximises an objective function. Well-established examples include the K2 (Cooper and Herskovits, 1992) and GES (Chickering, 2002) algorithms. Unlike constraint-based methods, score-based algorithms do not make claims about causation. Moreover, hybrid algorithms exist that share characteristics with both the constraint-based and score-based learning, such as the Max-Min Hill-Climbing (MMHC) algorithm (Tsamardinos et al., 2006) and the L1-Regularization paths (Schmidt et al., 2007).
While both the constrain-based and score-based algorithms work well in theory (i.e., with synthetic data), for various reasons they are generally less effective when applied to realworld data (Freedman and Humphreys, 1999;Zhang, 2008;Korb and Nicholson, 2011;Koski and Noble, 2012;Dawid et al., 2015). Because of this, BN models are often constructed manually with knowledge, instead of being automatically generated by structure learning algorithms, and this applies to various real-world domains with access to causal knowledge (Fenton and Neil, 2012). As a result, many of the algorithms are defined and developed in ways that enable us to incorporate knowledge about what can and cannot be discovered by the algorithm with reference to the input data. Perhaps the most common type of knowledge involves the temporal order of variables, such as specifying that event occurs after observing and hence, cannot cause .
The paper is structured as follows: Section 2 provides a formal introduction to BN structure learning with temporal constraints, Section 3 presents the methodology used to perform the experiments, Section 4 describes the experiments and presents the results, and Section 5 discusses the results and provides the concluding remarks.

TEMPORAL CONSTRAINTS IN BAYESIAN NETWORK STRUCTURE LEARNING
This section focuses on the standard score-based and constraint-based classes of learning to describe the process of incorporating temporal constraints into the structure learning process. Cooper and Herskovits' K2 algorithm (1991) represents the first important attempt at learning the graphical structure of BNs. K2 is a score-based algorithm that uses an objective function to score graphs. Specifically, it searches for graph in data and a discrete variable set that maximises (Cooper and Herskovits, 1991)

Temporal constraints in score-based learning
where is the number of variables in , is a constant ignorant prior probability for each , is the Candidate Parent Set (CPS) of variable ∈ , is the number of unique instantiations of , is the number of unique instantiations of , is the number of cases in in which variable is instantiated as and CPS is instantiated as , and K2 is also an order-based algorithm which assumes that complete information about the temporal ordering of the variables is given. Full prior information of the ordering eliminates the need to assess the orientation of edges. This is because the temporal information is imposed as a directionality constraint in the search space of graphs. In equation (1), this constraint translates into pruning of the CPS for each variable ∈ . For example, if variable precedes variable in the ordering, then → would violate the ordering. To ensure such a violation does not occur, the CPS of would need to be pruned by removing all the combinations of parents that include .
Complete information of the ordering represents a very strong, and often unrealistic, assumption that greatly reduces the search space of possible graphs. Specifically, full knowledge of the temporal ordering reduces the search space from super-exponential into 2 2 − 2 , which remains exponential in .

Temporal constraints in constraint-based learning
The PC algorithm (Spirtes et al., 2001) is one of the oldest and most important constraint-based algorithms. Unlike score-based learning that relies on a search-and-score process, constraintbased learning involves constructing a graph that is consistent with the results obtained over a series of conditional independence tests. The PC algorithm is based on the following six main steps (Spirtes et al., 2001): i.
Forms a fully connected undirected graph where each variable is linked to all other variables that belong in . ii.
Eliminates edges in with a marginal dependency score lower than a given significance threshold (the threshold is usually set to = 0.01 or = 0.05).
iii. Performs conditional independence tests for each remaining edge − in , where − is removed if and are found to be independent conditional on a third variable that is connected to either or ; i..e. if ⫫ | . iv.
Performs conditional independence tests for each remaining edge − in , where − is removed if and are found to be independent conditional on a pair of variables { , } with edges both connected to or both connected to ; i.e., if ⫫ |{ , }.

v.
For each triple of variables connected as − − , it orientates the triple as a v-structure (also known as the causal class of common-effect) → ← if did not appear in the conditioning set from which and had their edge eliminated.

vi.
For each triple of variables connected as → − , it orientates edge − as → .
In a constraint-based learning process similar to PC, temporal constraints would influence learning steps and . Specifically, partial temporal information would determine some of the edges preserved at the end of step , thereby pruning any tests needed to determine the orientation of those edges. In the case of complete temporal information, the orientation of the edges would be determined exclusively by the temporal constraints. This would make steps and redundant, and the output graph a DAG (rather than a CPDAG).
Since constraint-based learning focuses on the exploration of local structures in sets of triples, as opposed to iterating over global structures as in score-based learning, it is generally considered to have less computational complexity than score-based learning. As a result, temporal constraints are likely to have less impact on the computational complexity of a constraintbased algorithm compared to the impact they may have on the computational complexity of a score-based algorithm.

METHODOLOGY
We often have partial, and rarely complete, information about the temporal order of the variables in the data. The methodology is driven by interest in assessing a) the ability of some wellestablished structure learning algorithms in terms of discovering graphs that satisfy known undisputed temporal facts, and b) the benefit of incorporating such temporal information as constraints into the structure learning process of these algorithms. The subsections that follow provide details about the case studies, the data, and the structure learning algorithms considered.

Data and Case Studies
Three case studies are considered that come from applications of BN modelling in different real-world domains. All three case studies incorporate interesting properties of temporal variable order suitable for the purposes of this paper (discussed in Section 4). The properties of the datasets associated with each case study are depicted in Table 1. The first case study, which represents the simplest of the three, is based on football (soccer) team performance statistics taken from the English Premier League season 2017/18. In football, teams aim to gain possession ( ) of the ball so that they can create shots ( ) on target ( ) to score a goal ( ) when the keeper fails to save the shot. While there are various other fac-tors that influence the outcome of the variables defined here, there is a transparent and objective temporal order, from to , as illustrated in Figure 1. The assumed 'true' BN model of the football performance case study, where is possession, is shots created, is shots on target, and is goals scored, for both home ( ) and away ( ) teams.
A sample of the first 10 rows of the dataset is provided in Table 2. Note that for variable Possession we only need to know the possession of the home team , since the possession of the away team is = 1 − . The data for variables , and are collected from football-data.com, and the data for variable from whoscored.com. Table 2. The first 10 rows, out of 380, of the football performance dataset, where is possession, is shots, is shots on target, and is goals scored, for both home ( ) and away ( ) teams. The second and relatively complex case study is based on the forensic psychiatry data taken from (Constantinou et al., 2015). These data capture information about released prisoners with serious history of violence and mental health problems in the UK. The 56 variables that make up the data are listed in Table 3. Some of the variables are based on transparent and objective temporal observations and associate with events occurred before serving prison sentence, during prison, and after release from prison. Observations related to events that occurred before, during, and after serving prison sentence are indicated as temporal tiers t1, t2 and t3 respectively, where t1 precedes t2 and t2 precedes t3. Observations not necessarily belonging to a particular temporal tier are indicated with 'n/a'. Table 3. The data variables of the forensic psychiatry case study. Observations that associate with events occurred before, during, and after prison sentence are indicated with the respective temporal tiers of t1, t2, and t3. Observations not necessarily belonging to a particular temporal tier are indicated with 'n/a'. The third case study is based on the property market BN model presented in (Constantinou and Fenton, 2017). This BN model was used to assess the impact of property investment tax reforms introduced in 2015 by the British government. The 27 variables that make up the model are listed in Table 4 and ordered by temporal tier. The temporal order of the variables is based on clearly defined rules and regulating protocols that associate with the UK's Buy-To-Let property sector. Specifically, variables at temporal tier t1 involve features that associate with the purchase of the property, at t2 they involve features that associate with rental income and expenses for the year following the purchase of the property, at t3 they involve features that associate with tax expenses and net profit given t2, at t4 they involve features that associate with the future growth in property value, and at t5 they involve features that associate with the future growth in rental income and associated expenses.

Variable name
It is important to highlight that while all the variables belong to a specified temporal tier, this information does not constitute complete temporal information. This is because edges between variables that fall within the same tier are not subject to temporal constraints. Moreover, unlike the previous two case studies which involve real data, this third case study involves synthetic data generated directly from the conditional distributions of the BN model. Since synthetic experiments tend to overestimate real-world performance, this third case study investigates whether the conclusions obtained from synthetic data are consistent with those obtained from real data.

Structure learning algorithms
Since many of the experiments involve incorporating partial temporal information as constraints into the structure learning process, the selection of the algorithms is restricted to those that accept such partial constraints. Moreover, the algorithms would also need to work with both continuous and discrete data, as well as with datasets that incorporate missing values. The TETRAD freeware provides access to six well-established structure learning algorithms, spanning all three classes of learning (i.e., constraint-based, score-based and hybrid), that satisfy these requirements. While each algorithm comes with a set of parameters that could be manipulated by the user, such as the level of significance described in subsection 2.2, we shall investigate the algorithms with their parameter defaults as implemented in TETRAD v6.5.3. Note that these parameters are not intended for tuning on a given dataset; they represent optional thresholds that can be subjectively manipulated to produce denser, or less dense, graph. The six algorithms considered and are briefly discussed below. Perhaps the most well-known constraint-based algorithm is the PC algorithm previously described in subsection 2.2. Here we consider the modern version of the PC algorithm, called PC-Stable, that solves PC's order dependency issue determined by the order of the variables as they appear in the data (Colombo and Maathuis, 2014). The PC-Stable generates CPDAGs from a set of -separation equivalence classes of DAGs under the assumption that no latent common causes exist (Spirtes and Glymour, 1991). A variant of the PC algorithm is also considered, called Fast Adjacency Search (FAS). This algorithm only performs the adjacency search of the PC algorithm and hence, it returns the skeleton of PC (Spirtes et al., 2001).
The Fast Causal Inference (FCI) algorithm is a constraint-based algorithm that, unlike other PC variants, accounts for the possibility of latent variables. Similar to the PC algorithm, it performs a series of conditional independence tests to determine which edges to eliminate, starting from a fully connected undirected network. It then proceeds to the orientation phase that uses the stored conditioning sets that had led to the removal of adjacencies at the previous step, to orientate as many of the preserved edges as possible (Spirtes et al., 2001;TETRAD, 2017). The Really Fast Causal Inference (RFCI) algorithm is also considered, which is a variant of the FCI algorithm that decreases runtime by performing fewer conditional independence tests that are conditioned on a smaller set of variables, at the expense of minor changes to the output graph (Colombo et al., 2012).
The fifth algorithm considered is the Fast Greedy Equivalent Search (FGES) which represents an optimised version of the Greedy Equivalence Search (GES) algorithm that was initially developed by Meek (1997) and later further developed by Chickering (2002). Unlike the four constraint-based algorithms discussed above, the FGES is a score-based algorithm that returns the graph that maximises the Bayesian score via greedy search.
Lastly, the Greedy Fast Causal Inference (GFCI) algorithm is considered which combines the FGES and FCI algorithms discussed above, thereby forming a hybrid structure learning process. This combination aims to improve both the accuracy as well as the efficiency by supplementing the initial set on nonadjacencies of FGES with a series of conditional independence tests of FCI to eliminate further adjacencies (Spirtes et al., 2001;Ogarrio et al., 2016).

EXPERIMENTS AND RESULTS
The results are presented per case study in the subsections that follow. A set of accuracy metrics is also used to quantify the accuracy of the learnt graphs with respect to the ground truth graphs. These metrics are based on the confusion matrix parameters where True Positives (TP) is the number of true edges discovered in the generated graph, False Positives (FP) is the number of false edges discovered in the generated graph, True Negatives (TN) is the number of true direct independencies in the generated graph, and False Negatives (FN) is the number of false direct independencies in the generated graph. The scoring metrics considered come from the relevant literature. These are: i.
the Precision (Pr) and Recall (Re) defined as = + and = + respectively, ii. the F1 score defined as iii. the SHD score (Tsamardinos et al., 2006) defined as = + , and iv. the BSF score (Constantinou, 2019) defined as where is the number of edges in the true graph and is the number of direct independencies in the true graph defined as where is the variable set as defined in Section 2, and | | is the size of variable set .
In this study, the above metrics are used to measure the impact of temporal constraints, rather than to measure the accuracy of the different algorithms considered. Table 5 presents the graphs generated by each of the six algorithms over the different temporal constraints. The position of the nodes depicted in each of the graphs in Table 1 is based on the position of the nodes as shown in Fig 1. The first column presents the graphs generated without any temporal constraints, whereas the remaining columns progressively increase the amount of temporal information provided as temporal constraints into the structure learning process of each algorithm. Specifically, the constraint → , , involves partial ordering of the nodes specifying that occurs first in the temporal space, the constraint → → , involves partial ordering where occurs after observing and { , } occur after observing and , whereas the constraint → → → involves complete ordering of the nodes (assuming the variables , , and are duplicates; one for each team). Without temporal constraints, the results show that the four constraint-based algorithms PC-Stable, FAS, FCI, and RFCI, are in agreement in determining the edges, although with some disagreements in the orientation of some of those edges. On the other hand, the score-based FGES and hybrid-based GFCI have produced a different set of edges that is in agreement between the two of them, as well as in agreement with the true graph shown in Fig 1. However, and excluding FAS which returns a skeleton, the RFCI, GFCI and FGES failed to orientate any of the edges.

Case study 1: Football team performance
The partial ordering → , , has led to improvements for most of the algorithms, and these are coloured in green. Interestingly, this single piece of temporal information enabled FGES and GFCI to correctly direct all the previous undirected edges and to successfully generate the true graph. The RFCI is the only algorithm that demonstrated both corrections as well as some incorrect revisions which are coloured in red. The extended partial ordering → → , and complete ordering → → → have led to further graphical revisions that do not include any incorrect revisions. Interestingly, while the temporal constraints assisted the algorithms in determining the correct orientation of the edges, the constraints did not lead to any edge additions nor deletions. As a result, only the FGES and GFCI algorithms managed to recover the true graph whose initial set of edges match the edges in the true graph.  Table 6 provides edge statistics for each of the graphs depicted in Table 5 and with reference to each level of temporal constraints. The edge statistics are reported with reference to the true graph shown in Fig 1. Specifically, → is the number of directed edges in the learnt graph that are matched in the true graph (also equivalent to TP), − is the number of edges in the true graph that are undirected in the learnt graph, ← is the number of edges in the true graph that are reversed in the learnt graph, − is the number of undirected edges in the learnt graph that do not exist in the true graph, → is the number of directed edges in the learnt graph that do not exist in the true graph (also equivalent to FP). Table 6. Edge statistics for each algorithm and over each level of temporal constraints. The edge statistics are reported with reference to the true graph shown in Figure 1.

No temporal constraints
Temporal constraints → , , The edge statistics in Table 6 show that, without temporal constraints, only one out of the five algorithms (excluding the adjacency algorithm FAS) managed to discover most of the true arcs (the PC-Stable), whereas the four remaining algorithms failed to discover any of the true arcs; although they did discover most of the true dependencies. As previously mentioned, the single piece of temporal information → , , enabled the algorithms to recover most of the true graph, and two of the algorithms, FGES and GFCI, to fully recover the true graph. The graphs in Fig 2 illustrate how the graphical revisions translate in terms of accuracy, as determined by each of the metrics. Note that, in contrast to the metrics on the primary axis, a lower SHD score (i.e., error) on the secondary axis indicates a better performance. The results show that even incomplete temporal information would often increase the accuracy scores from less than 0.5 to 1 (or close to 1). The SHD error decreases at a similar rate over the incremental temporal constraints. Specifically, and excluding the adjacency search FAS, the accuracy scores (BSF, F1, Pr, Re) have improved on average by 79% and the SHD error has decreased on average by 67.1%, when comparing the graphs learnt without temporal information to the graphs learnt with complete temporal information. Overall, the scores generated by the metrics are consistent with the graphical revisions illustrated in Table 5 and the edge statistics in Table 6. The accuracy scores derived from the scoring metrics Pr, Re, F1 and BSF (Primary axis) and SHD (Secondary axis) for the football team performance (first) case study. Each graph represents an algorithm and illustrates the metric scores change given the temporal constraints.

Case study 2: Forensic psychiatry
While in the first case study we had complete information about the temporal order of the variables in the data, we have incomplete temporal information in this second case study. Specifically, we know three of the 56 possible temporal orderings, with 17 out of the 56 variables assigned a temporal tier, as shown in Table 3. Further, and contrary to the first case study, the data now consist of discrete variables and incorporate missing values.
Figs 3 and 4 present the graphs generated by each of the algorithms given the temporal constraints specified in Table 3. Note that contrary to the dashed coloured edges in Table 6 which indicate correct and incorrect graphical revisions, the coloured solid edges in Figs 3 and Fig 3. The graphs generated by PC-Stable, FAS, and FCI algorithms, based on the forensic psychiatry case study and the temporal constraints specified in Table 3. New edges resulting from the temporal constraints are shown in blue colour, reoriented edges (including undirected) are shown in green colour, and edges deleted are shown in red colour. The graphs generated by RFCI, FGES, and GFCI algorithms, based on the forensic psychiatry case study and the temporal constraints specified in Table 3. New edges resulting from the temporal constraints are shown in blue colour, reoriented edges (including undirected) are shown in green colour, and edges deleted are shown in red colour.
Unlike the first case study where the temporal constraints led to only edge reorientations, Figs 3 and 4 show that the graphical revisions in this second case study include edge additions and edge deletions, despite providing only partial information about the temporal order of the variables. However, according to Table 7 which compares the metric scores of the graphs learnt without constraints to the scores of the graphs learnt with constraints, the temporal constraints in this second case study have not led to the same level of improvement as in the first case study.
Overall, the temporal information in Table 3 has modestly improved the scores of constraint-based algorithms PC-Stable, FCI, and FGES, with no changes in the accuracy scores of the score-based FGES and hybrid-based GFCI algorithms despite the minor revisions illustrated in Fig 4. Moreover, the adjacency search FAS was negatively affected by the temporal constraints, suggesting that the improvements observed in the other algorithms are due to modifications in the directionality of the edges (from which FAS cannot benefit since it produces a skeleton graph), rather than edge additions and deletions. Specifically, and excluding the adjacency search FAS, the accuracy scores (BSF, F1, Pr, Re) have improved on average by 7.4%, whereas the SHD error has increased by an average of 0.5%. The conflicting conclusion between the SHD error and the other metrics is an observation documented in (Constantinou, 2019), explained by the fact that the SHD score represents classic classification accuracy whereas the other metrics are designed to offer a more balanced score. Table 7. The accuracy scores produced by each of the metrics and for each of the algorithms, with and without the temporal constraints specified in Table 3. Green coloured scores improvements and red coloured scores indicate that the temporal constraints have led to an inferior score.

Case study 3: Property market
As discussed in Section 3.1, the third case study differs from the first two case studies in that the structure learning process is based on synthetic, rather than real, data that has been sampled directly from the conditional distributions of the property market BN model. Moreover, the data are discrete and complete. The temporal information involves five temporal tiers, out of a possible of 27 tiers, with all the 27 variables assigned to a temporal tier as shown in Table 4. Figs 5 and 6 present the graphs generated by each of the algorithms over the different levels of temporal constraints. As in subsection 4.2, new edges resulting from the temporal constraints are shown in blue colour, reoriented edges (including undirected) are shown in green colour, and edges deleted are shown in red colour. The number of revisions observed in this third case study is noticeably higher compared to the number of revisions observed in the second case study, despite the size of the network being approximately half in this case study; i.e., 27 variables in this case study versus 56 variables in the previous case study. The difference in the number of revisions can be explained by the difference in the number of temporal constraints. Specifically, in the second case study just 17 out of the 56 variables were assigned to one of the possible three temporal tiers, whereas in this case study all the 27 variables are assigned to one of the five temporal tiers.
Since in this case study all the variables associate with a temporal tier, we can illustrate how each additional temporal tier influences the previously learnt graph, as in the first case study. The graphs in Fig 5 illustrate this effect, as determined by each of the metrics. Similar to the first case study and contrary to the second case study, the results from PC-Stable, RFCI and GFCI suggest that partial temporal information, and in this case a single tier (out of possible 27 tiers) of temporal information that includes five out of the 27 variables (i.e., t1 as defined in Table 4), will often lead to important corrections in the learnt graph. Conversely, the graphs learnt by FAS, FCI and FGES demonstrate improvements only after incorporating the first three temporal tiers (i.e., t1, t2, t3) as constraints. Overall, the results are rather consistent across algorithms and show that (excluding the adjacency search FAS) the accuracy scores (BSF, F1, Pr, Re) have improved on average by 45.6% and the SHD error has decreased on average by 43.3%, when comparing the graphs without temporal information to the graphs with the temporal constraints specified in Table 4.  The graphs generated by PC-Stable, FAS, and FCI algorithms, based on the property market case study and the temporal constraints specified in Table 4. New edges resulting from the temporal constraints are shown in blue colour, reoriented edges (including undirected) are shown in green colour, and edges deleted are shown in red colour.

Fig 7.
The graphs generated by RFCI, FGES, and GFCI algorithms, based on the property market case study and the temporal constraints specified in Table 4. New edges resulting from the temporal constraints are shown in blue colour, reoriented edges (including undirected) are shown in green colour, and edges deleted are shown in red colour.

DISCUSSION AND CONCLUDING REMARKS
The aim of the paper is not to demonstrate that temporal constraints are beneficial for BN structure learning. This is because it is already widely accepted that BN structure learning algorithms benefit from temporal constraints. Yet, temporal information is still largely overlooked and only some of these algorithms are designed to consider it. This is partly because temporal information is generally viewed as a form of knowledge that is subjective, rather than as a form of data that tends to be objective.
The paper focused on real-world case studies that incorporate interesting properties of transparent and objective temporal information to highlight the potential gain in accuracy that is typically lost simply because this information is not recorded as hard evidence in data. Specifically, i.
The first case study is based on a simple and clean dataset with complete objective information about the temporal order of the variables. The structure learning algorithms failed to determine the correct direction of the edges between variables prior to incorporating temporal constraints. However, some algorithms only needed a single piece of temporal information to correctly determine the true graph, while others failed to do so even after providing complete temporal information as constraints into their structure learning process. Complete temporal information has improved the scores produced by the various scoring metrics, that judge how well a learnt graph approximates the ground truth graph, by 67.1% to 79%, on average.
ii. The second case study is based on a relatively complex problem, with noisy and incomplete data as well as incomplete, although objective, information about the temporal order of some of the variables in the data. While the temporal information available for this case study amounted to just three (out of possible 56) temporal tiers with only 17 of the 56 variables assigned to one of those tiers, parts of some of the learnt graphs were still subject to major modifications given the temporal constraints. However, the results from the scoring metrics suggest that the graphical revisions have led to minor improvements (-0.5% to 7.4%, on average) relative to the improvements observed in the first case study.
iii. The third case study is based on synthetic data sampled from a real-world BN of moderate complexity. The temporal information used in this case study was also incomplete, although richer compared to that used in the second case study. The constrains involved five (out of possible 27) tiers of temporal information with all the 27 variables assigned to one of those tiers. The results from these experiments suggest that while synthetic data tends to overestimate real-world performance, incomplete temporal information still improved the scores of the learnt graphs by 43.3% to 45.6%, on average.
We often have access to objective temporal information irrespective of the application domain. However, because classical statistics and machine learning are traditionally not concerned with causal inference, the sequence of events occurring in the real world is generally reduced to an insignificant piece of information. While temporal information is clearly useful in causal inference, it is still overlooked partly because it is considered as part of knowledgebased constraints that only some of the structure learning algorithms consider.
When Cooper and Herskovits (1991) first published the K2 algorithm three decades ago, they made it dependent on knowledge about the temporal order of the variables. However, such a strong restriction represents the other extreme of the argument, as well as an inconvenience given that objective temporal information is not generally available for all the variables in the data. Moreover, because temporal information was pitched as a knowledge-based constraint, the requirement for this information understandably raised comments similar to: "what artificial intelligence is after is the development of an agent which has some hope of overcoming problems on its own, rather than requiring engineers" (Korb and Nicholson, 2011).
Initially, Pearl and Verma (1994) stated that "we must still identify the clues that prompt people to perceive causal relations in the data, and we must find a computational model that emulates this perception". However, it is now understood that the development of human knowledge is not restricted to statistical observations (Pearl and Mackenzie, 2018). Much of our causal knowledge is established by observing chains of events that enable us to experience the perception of time. If we expect machines to become rational agents in a world that requires causal perception, then we may have to provide them with something more than a dataset consisting of mere static observations under the assumption that answers about causality can be retrieved from a static observational dataset.
A possible way forward is to extend observational data in ways that capture objective temporal information for some, or all if available, the variables in the data. This will ensure that objective temporal information is viewed as part of available observational data that is generally assumed to be objective. Moreover, temporal information could be reused across similar studies without requiring access to expertise or knowledge. Lastly, the benefits of temporal constraints extend to aspects not covered in this paper, such as 'relaxing' the NP-hardness by reducing the search space of possible graphs that explain the data, as well as leading to more accurate causal models that enable the simulation of interventions for optimal decision making.