Worry less about the algorithm, more about the sequence of events

: Background : Many algorithms exist for learning network structure and parameters from data. Some of these algorithms include Grow Shrink, Incremental Association Markov Blanket, IAMB, Fast IAMB, and Interleaved IAMB, Hill Climbing, Restricted Maximization and Maximum-Minimum Hill Climbing. These algorithms optimize the fit to the data, while ignoring the order of occurrences of the variables in the network structure. Objective : This paper examines if sequence information (i.e., one variable occurs before another) can make algorithms for learning directed acyclical graph networks more accurate. Methods : A 13-variable network was simulated, where information on sequence of occurrence of some of the variables was assumed to be known. In each simulation 10,000 observations were generated. These observations were used by 4 conditional dependency and 4 search and score algorithms to discover the network from the simulated data. Partial sequence was used to prohibit a directed arc from a later variable to an earlier one. The Area under the Receiver Operating Curve (AROC) was used to compare the accuracy of the sequence-constrained and unconstrained algorithms in predicting the last node in the network. In addition, we examined the performance of sequence constrained algorithms in a real data set. We analyzed 1.3 million disability assessments done on 296,051 residents in Veterans Affairs nursing homes; where the sequence of occurrence of variables was inferred from the average age of occurrence of disabilities. We constructed three networks using Grow-Shrink algorithm, one without and the other two use two permutation of the observed sequence. The fit of these three models to data was examined using Bayesian Information Criterion (BIC). Results : In simulated data, algorithms that used sequenced constraints (AROC = 0.94,


Introduction
This paper examines if sequence information can make algorithms for learning directed acyclical graph networks more accurate. By sequence we mean whether one variable occurs before another. For example, we can assume that race of a patient is established before history of infection because race occurs at birth and infection later. Historically, sequence information is ignored, or used after the data has been modelled. For example, regression analysis fits a model to the data with no regard for sequence among the independent variables. This paper examines if the sequence information could help improve the accuracy of fit of the model to data.
There are three types of algorithms for learning network structures: search and score [1][2][3][4][5][6][7][8][9][10][11][12][13], conditional independency [14][15][16][17][18][19][20], and multivariate methods [21][22][23][24][25][26]. These algorithms must accomplish two tasks: Determine the structure of the network (i.e., the arcs that connect the nodes in the network) and orient the direction of the arcs (sequence). All network learning algorithms, with some exceptions [9,27], determine the direction of the arcs after establishing the network structure. Historically, in learning networks from data, sequence is determined after the network structure has been established. For example, Pearl suggested the collider test for directing the arcs in a network after establishing the network structure [28]. To do the reverse, to first establish the sequence and then estimate the network, goes against the practice of many established algorithms for learning networks. The aim of this study is to show the importance of changing this tradition.
In the published literature, there are sporadic reports of use of sequence in learning network structures [9,2931]. A good example is Shojaie and Michaildis method, where the full order among variables is used to learn network models by iteratively fitting multivariate regression models [27]. Elsewhere, we have also shown that sequence information could make some search and score algorithms more accurate [32,33]. These occasional reports in the literature have raised the possibility that sequence information may improve not only the few algorithms, where the effect has been reported, but all network-learning algorithms. In this paper, we check if sequence information changes the accuracy of different network-learning algorithms.

Methods of learning sequence
There are many methods for establishing sequence among a pair of variables. Pearl's collider test is one method, there are also other ways: (a) Definition of variables. The easiest way to establish a sequence is to define the variables so that one occurs before another. Many statisticians already do so. They assume, or design studies, so that some variables are measuring events at baseline, other variables refer to treatment after baseline and outcomes are measured last. These assumptions about timing of events allow an easy way to establish a partial sequence. For example, by definition, race occurs at birth; it is established prior to medical history, which occurs before current conditions; which occurs prior to treatment, which also occurs prior to outcomes. These events by definition create a partial sequence among the variables. In this paper, in analysis of simulated data, we use the definition of variables to establish a partial sequence. (b) Error reduction. Most statistical measures are symmetric measures and cannot be used to order a pair of variables. This is not the case for Goodman and Kruskal error reduction [34]. Vang used this method to examine sequence among predictors of drug abuse [35]. (c) Strength of causal reasoning. In human judgment, causes have a bigger impact on the effects than the reverse. Zargoush and colleagues used this method to examine sequence among disabilities [32]. (d) Age of patients at time of occurrence of the variables. For example, a number of crosssectional studies show that feeding disabilities occur after walking disabilities because they occur at a later age [33,[36][37][38]. In this paper, in analysis of real data, we use this method to establish a sequence of events/variables. (e) Longitudinal order. This method is the gold standard for defining sequence among variables. These data are widely available in electronic health records, where events are time-stamped and therefore it is easy to examine which event has occurred before another.
Sequence information is widely available, multiple methods exist to extract it, and Pearl's collider test is just one among many methods. Bayesian probability networks have focused mostly on the collider test which requires knowledge of the network structure. Other methods of learning sequence do not require knowledge of the network structure and therefore can be implemented before the network structure is discovered.
It is important to point out that often sequence of occurrence among a pair of variables is not deterministic. Sequence between a pair of variables is an uncertain judgment that needs to be based on data. Naturally, these inferences could be erroneous. Sequences derived from data will almost never apply to all individuals in the sample. The majority of patients may experience one sequence of variables, while a minority may experience the reverse order. For example, consider substance abuse disorders, heart attack, and Alzheimer's diseases. Most patients will report substance abuse in younger age than heart attack. In addition, majority of patients will report Alzheimer's disease at a later age than heart attack. Despite these patterns, it is also possible that some patients have these diseases in reverse order; meaning that substance abuse can occur after heart attack or Alzheimer's disease could occur before heart attack. When we talk of sequence among a pair of variables, we are referring to the experience of majority of patients and not all patients. We assume that such estimates could be erroneous and require sensitivity analysis to check if small perturbation of the order could radically change the network. For example, we may require that one variable should occur on average at least 3 months before another. If variables are less than 3-months apart on average, then we may assume that their sequence is unknown leading to a partial knowledge of the sequence of variables.

Sequence wrapper
A wrapper is a code that repeatedly calls an algorithm and provides it with a subset of variables to analyze [19]. Many algorithms have a random starting point. In sequence wrappers, algorithms start with variables in the order provided by the sequence wrapper. In this method, only relationships that are found consistently across all time periods are used to establish the network structure. Figure 1 shows the procedure for sequence wrapping an algorithm. One way to implement sequence constraints is through the blacklist feature of existing algorithms' software. In the algorithm's software, the blacklist feature prohibits arc directions from being considered as candidates in learning the network structure. In this approach, arcs from later to earlier time periods are prohibited. The variables that occur in first time period have no prohibited directions. Variables that occur at the last time period have maximum number of directions prohibited. Variables in between the first and last time periods have directed arcs prohibited proportional to how late they occur. The later a variable occurs the more restrictions it has. The procedure increases the probability that the network learning algorithm first analyzes the earlier variables, de facto implementing the main feature of sequence wrapper.
Note that Grow Shrink and the three variations of Incremental Association Markov Blanket are all based on conditional independence tests and focus on constructing a Markov Blanket. The remaining four are search and score algorithms and do not rely on search for a Markov Blanket. All eight ignore sequence information during the learning of network structure.
We used the implementation of these algorithms in BNLearn R Package (version 4.0). The sequence information was used to blacklist the impact of later variables on earlier ones. For example, if one variable was judged to occur later than another, then we blacklisted the impact of the later variable on the earlier variable.

Methods of simulation
To test the performance of these algorithms, we simulated a 13-variable network using the Binomial distribution. This network describes how various cost overruns in a 90-day episode of treatment of hip fracture will affect CMS's bundled payment. Figure 1 shows the network used to simulate the data. In this network, Durable Medical Equipment cost (DME), Clinical Laboratory tests (CL), Physician bills (P) are assumed to affect Hospital (H) costs. Likewise, hospital costs are assumed to affect Long Term Hospital cost (LTH), Rehabilitation Facility cost (RF), Skilled Nursing Facility cost (SNF), Hospice cost (HOS) and eventually CMS's Bundled Payment cost (BP). Physician billing is expected to affect Home Health Agency cost (HHA), SNF, RF, and LTH. LTH and Part B Drug cost (PBD) are expected to affect Hospital Outpatient (HO) cost, and PBD is expected to affect HO and Outpatient Therapy cost (OT). OT is also affected by RF, SNF, and HHA. Finally, BP is affected by seven variables including H, HOS, HO, OT, PBD, SNF, and HHA. The simulation was done in two steps. First, a network structure was organized. We call this network the original structure ( Figure 2). In the original network, the odds of any variable with a single parent was set to 2.33:1, namely a conditional probability of 70%. For each additional parent, the odds were added. For example, the presence of 2 parents increased the odds to 4.66:1 or a conditional probability of 82%. Also note that the original network in Figure 2 includes variables that affect not only the next time period but several time periods later. These variations in lag time of impact make learning the network structure more difficult and thus make the simulated data a reasonable environment to test model accuracy.
The network structure in Figure 2 was used to randomly generate 10,000 cases using the Netica software (5.24). We chose a large number of cases so that sample size was not a factor in our estimate of accuracy. It is well known that small sample sizes deteriorate the performance of all algorithms. This is well established and not central to our paper and therefore we did not manipulate this variable in our simulations.
The sequence among the variables in this simulated data can be inferred from the definition of the variables and the assumed continuity of care across locations for treatment of hip-fractured patients. The starting time for all patients is on admission to the hospital. Variables that describe treatment during hospitalization occur prior to discharge (first row in Figure 2). At discharge (second row), patients are typically discharged to institutional care for post-acute recovery (third row). Later, these patients may receive care in outpatient settings (fourth row). Bundled payment (fifth row) is made 90 days after admission. Each row in Figure 2 describes one time period. In total there are 5 time periods.
The accuracy of the network model was examined in two ways. First, we used Kappa statistic to examine the extent to which the discovered and the original network agreed. Second, we examined predictive accuracy of the network by examining Area under Receiver Operating Curves (AROC) for the last node in the network: Bundled Payment Cost Overrun.
We did not distinguish between arc errors or direction errors. Thus, to be considered a correct prediction, the algorithm had to be accurate in both the arc and the direction. An algorithm's predictions may differ from the original network in two ways: first it may indicate a directed arc where it does not exist (a commission error) or miss indicating a directed arc where it does exist (omission error). Based on omission and commission errors, an algorithm can have different levels of sensitivity and specificity. The Receiver Operating Curve indicates the tradeoffs between sensitivity and specificity achieved by the algorithm. Figure 3 indicates the performance of the algorithms before and after sequence constraints. For all algorithms, the AROC of the sequenced algorithms (dashed lines) was significantly higher than AROC of the un-sequenced algorithms (solid lines). The space between the two lines shows how much the sequence information improved the accuracy of the algorithm. Note also that the improvement in the Score-Based algorithms (i.e., Taboo, Maximum-Minimum Hill Climbing, and Restricted Maximization) was not as large as the algorithms that relied on conditional dependency tests (i.e., Grow Shrink, and IAMB variations). Table 1 shows the AROC for the eight algorithms. The first column of data shows the performance of these algorithms before using sequence constraints. The last column shows the performance after using sequence constraints. Table 1 also presents the 95% confidence intervals for the AROCs [34]. The differences between two sets of performance are not statistically significant if the two confidence intervals overlap. For example, before used of sequence information the Grow Shrink algorithm had AROC of 0.71 and afterwards it had AROC of 0.98. The confidence intervals for these two estimates are not overlapping and therefore the differences are statistically significant. In fact, for all eight algorithms, the sequence information improved the performance of the algorithm significantly. The magnitude of improvement was large, in some instances the improved AROC was near 1.

Methods for real data
We test the value of sequence in establishing a model for predicting prognosis of nursing home patients. In choosing to join hospice program, clinicians, patients and their caregivers would benefit from accurate information on when and how the patient will die. Often how a person dies translates into knowing how disabled the patient will be before dying. A network model can predict both mortality as well as disabilities that nursing home patients may experience. Nursing home clinicians and residents may use the model to identify what disability may occur next and take actions to prevent it. They may also use the model to assess the value of continuing as is or seeking hospice care. To accurately model the transition among disabilities and death, we examined data on 296,051 residents in Veterans Affairs nursing homes called Community Living Centers (CLCs). The period of study was from 1/1/2000 through 9/10/2012. These data included a comprehensive assessment of residents' disabilities, including ability to walk, bathe, groom, dress, sit (transfer), toilet, eat, bowel, and urinate independently. The data also included the resident's age and whether they had died within 6-months of assessment. By policy, assessments were supposed to be done within 14 days of admission, at least quarterly, or sooner when there was an event such as hospitalization, or when the nursing home staff identified changes in the residents' status. In our data, there were two peaks in the distribution of assessments; one for residents assessed monthly (75,994 residents) and the other for resident assessed every 3 months (42,904 residents). The average time between assessments was 115 days and the standard deviation was 235 days. More than 1.3 million assessments were available for analysis.
We used the Grow-Shrink algorithm to learn the network structure from the data. We first learned the network without using sequence information and then used the same algorithm now with sequence information. We established the sequence among the variables by examining the age with which the patient reported the disability. Two disabilities were judged to occur at different times, if the average age at which one disability occurred was significantly different (at alpha levels less than 0.001) from the age at which the other occurred. The use of age to determine the sequence among events is common in health care literature [33,[36][37][38]. Nevertheless, because of the large size of the data, it is possible that small differences in age may be statistically significant but clinically meaningless. One solution is to require a minimum effect size, e.g., an average age difference of more than 3 months. We examined the sensitivity of our conclusions to requirement of at least 3-month difference in average age of occurrence of the variables. To establish the networks, we use the implementation of Grow Shrink algorithm in BNLearn R package. The sequence information was used to create a blacklist, where later variables were not allowed to point to earlier variables. Table 2 provides a summary of the findings. These data suggest that on average disabilities occur in the following order: Unable to Walk, Unable to Sit, Unable to Bathe, Unable to Eat, Unable to Groom, Unable to Toilet, Unable to Dress, Unable to Bowel, Unable to Urinate, and Death. The differences in average age of these disabilities are all statistically significant at alpha levels less than 0.001. To test the sensitivity of network structure to the estimated rank orders, we also required that the two variables should differ in average age by at least 3 months. Some variables, such as bathing and eating disabilities occur too close and therefore under the alternative arrangement the order of these disabilities were reversed. Walking disability occurs at least 3 months prior to sitting disability, so the order does not need to be reversed. Note that by definition death occurs last, on average death occurs at 73.71, although some patients survive beyond 85 years of age.

Results for real data
Using the BNLearn package, we created a blacklist which prohibited arcs from diagnosis with later average ages to diagnoses with earlier ages. For example, using the observed rank order we prohibited an arc going from unable to eat to:  unable to bathe  unable to sit  unable to walk  and age of 65-74 years In the alternative rank order, we prohibited an arc going from unable to eat to:  unable to sit  and unable to walk  and age of 65-74 years  Figure 4a shows the network structure without use of sequence information. The age indicators did not affect the network, so they are absent in the networks in Figure 4. This is not surprisingly as chronological age may not be as important in predicting mortality as history of functional decline. Among the functional abilities, a complex set of relationships emerged. On average, each disability had the other 8 disabilities in its Markov Blanket. On average, there were 5.8 arcs to, or from, each node. There was a total of 29 arcs in the network, 3 of which were not directed. Undirected arcs are not allowed in DAG networks and Pearl has suggested that when one cannot determine the direction of an arc that it can be set randomly, without creating cycles or new colliders in the network. Obviously, such random assignment of direction could be wrong and may affect network predictions.
Surprisingly, the variable "Dead in 6 months post assessment" in the network in Figure 4 pointed to the variables "Unable to Eat", "Unable to Urinate", and "Unable to Sit". This, of course, makes little sense, since by definition death cannot be followed by a change in the resident's functional disability.
We also used the observed sequence information in Table 2 to prohibit arcs from later variables to earlier ones. Again 29 arcs were identified. This time all arcs were directed. Furthermore, the directions of the arcs from death to other disabilities were reversed. In total 18 out of the 29 arcs (62%) changed. We show the arcs in which the sequenced networked differed from the non-sequenced network in red in Figure 4. Each red line in Figure 4a shows how sequence information changed the network. The goodness of fit to the data of the two networks was assessed through Bayesian Information Criterion score. For large data, the Bayesian Information Criterion is −2 times natural log of the maximum likelihood of observing the data given the model. The model with the lowest Bayesian Information Criterion should be preferred. The sequenced network had a score of −4,105,657 and the non-sequenced network had a higher score of −3,873,261. Sequenced network had 6% better fit (lower Bayesian Information Criterion score) to the data.
We repeated the analysis, this time using the alternative sequence in Table 2. Again, 29 arcs were identified and the sequence information corrected illogical arcs from death to disabilities. In total 9 out of 29 arcs (31%) changed and the Bayesian Information Criterion score of the model increased to −5,107,770. The alternative sequence improved the fit to the data by 24.40% compared to un-sequenced network. . Changes due to sequence. N is no disability, W is unable to walk, B is unable to bathe, E is unable to eat, S is unable to sit, G is unable to groom, T is unable to toilet, R is unable to dress, L is unable to bowel, U is unable to urinate, and D is death in 6 months post assessment. Red arcs show changes between unsequenced and sequenced networks. Sequence is shown counter clockwise starting at N and ending in D.

Limitations
This study was limited in several ways. First, we focused on eight algorithms available for learning network structures. Future researchers should examine the effect of sequence-wrappers, or sequence constraints, in improving other algorithms.
Second, in the simulation we assumed that a partial sequence can be established accurately. Obviously, errors in establishing the sequence could reduce the accuracy of sequenceconstrained algorithms. In the real data, we made small changes in order of the variables that were less than 3-month apart from each other. Both the original, and the permutated, order improved the fit of the model to the data. Study findings were not sensitive to small changes in the order of variables, highlighting that analyst need to consider only large, and perhaps obvious, order among the variables.
Finally, and perhaps most important limitation, we simulated only one data structure. We randomly simulated many sets of data from this one original network. It is possible that sequence information is not as helpful in other types of networks. We did not examine how the number of variables, intercorrelation among the variables, and discrete/continuous variables might affect the impact of sequence. Additional research is needed to test the performance of sequence constraints in multiple simulated environments. In the real data, we also examined the impact of sequence in only one data set. Here again, it is possible that the results may be different in other data sets.

Discussions
This study has shown that information on sequence improved accuracy of network structure for eight widely used network discovery algorithms. On average, the predictive accuracy improved by 27%. The Area under Operating Characteristic Curve improved by 20%, from 0.75 to 0.95. The agreement between discovered and the original network increased by 22%. These are large jumps in various measures of accuracy. In contrast, changing from one algorithm to another improved the accuracy of predictions by only 6.4%. Therefore, the advice to a scientist might be: never mind the algorithm, focus on getting accurate sequence information. Most of the differences among the 8 algorithms disappeared, when sequence information was used.
In both simulated and real data, the sequenced-constrained network was more accurate than unconstrained networks. There are many possible reasons for this finding. Sequence removes a large number of possible arcs from further consideration; and thus, it reduces the occasions the model can make an error. For example, in the simulated data, there were 156 possible directed arcs. Sequence constraints removed 61 possible directed arcs. Thus, the model had 40% less occasions in which it could make an error. The algorithms were more accurate simply because they had fewer chances of making an error. While this explanation seems reasonable to us, it does not explain the differential impact of sequence on search and score as opposed to conditional dependency algorithms.
We also speculate that sequence information may have been more helpful to algorithms that use Markov Blankets because it reduced errors that occur because of stratification of common effects. When a common effect is stratified, two or more previously independent causes become dependent and show spurious and non-causal associations. To have an intuition about how this might occur consider that later variables are more likely to be an effect than a cause. An algorithm, like Grow Shrink, first assembles candidates for Markov Blanket (Grow Phase) and then verifies them (Shrink phase). This algorithm starts randomly with one variable. The use of blacklists changed the order of considering variables. Despite randomization, variables that occurred earlier in time were more likely to enter the Grow phase than variables that occurred later in time. Furthermore, variables that occurred later in time were more likely to be dropped in the shrink phase. De facto, these two steps reduce stratification on effects and thus improve accuracy of the algorithm.
We hypothesize that sequence data improves accuracy of algorithms because it removes the possibility of "reverse causality". In the real data, the sequence information corrected obvious errors in the network (death pointing to other disabilities), it re-oriented 62% of the arcs, and improved the fit between the network and the data by 6%. For some time, it has been known that expert opinions and algorithms may make the resulting model more accurate [41]. Experts are aware of causal direction of events and thus may correct models that have reverse causality. To an expert, it is clear that "the sun does not rise because the rooster crows". To the computer algorithm, this is not clear at all. Sequence may be playing a similar role to what experts do: It is avoiding reverse causal findings.
Finally, sequence data can radically improve face validity of a model. In our effort to model real data, the face validity of the network was radically improved when obvious errors such as death causing changes in functional disability was avoided. In our experience, when a model makes an obvious error, then no one will trust it any further. In most occasions, reverse causality is obvious error. Any observer may be able to point to these errors. If a model misses something so obvious, then no one will trust it. No statistic can replace the fall from grace of a model that misses obvious relationships. By reducing occasions of reverse causality, sequence improves face validity and increases trust in algorithms.