Continuous time causal structure induction with prevention and generation

Most research into causal learning has focused on atemporal contingency data settings while fewer studies have examined learning and reasoning about systems exhibiting events that unfold in continuous time. Of these, none have yet explored learning about preventative causal influences. How do people use temporal information to infer which components of a causal system are generating or preventing activity of other components? In what ways do generative and preventative causes interact in shaping the behavior of causal mechanisms and their learnability? We explore human causal structure learning within a space of hypotheses that combine generative and preventative causal relationships. Participants observe the behavior of causal devices as they are perturbed by fixed interventions and subject to either regular or irregular spontaneous activations. We find that participants are capable learners in this setting, successfully identifying the large majority of generative, preventative and non-causal relationships but making certain attribution errors. We lay out a computational-level framework for normative inference in this setting and propose a family of more cognitively plausible algorithmic approximations. We find that participants' judgment patterns can be both qualitatively and quantitatively captured by a model that approximates normative inference via a simulation and summary statistics scheme based on structurally local computation using temporally local evidence.


A B S T R A C T
Most research into causal learning has focused on atemporal contingency data settings while fewer studies have examined learning and reasoning about systems exhibiting events that unfold in continuous time. Of these, none have yet explored learning about preventative causal influences. How do people use temporal information to infer which components of a causal system are generating or preventing activity of other components? In what ways do generative and preventative causes interact in shaping the behavior of causal mechanisms and their learnability? We explore human causal structure learning within a space of hypotheses that combine generative and preventative causal relationships. Participants observe the behavior of causal devices as they are perturbed by fixed interventions and subject to either regular or irregular spontaneous activations. We find that participants are capable learners in this setting, successfully identifying the large majority of generative, preventative and non-causal relationships but making certain attribution errors. We lay out a computational-level framework for normative inference in this setting and propose a family of more cognitively plausible algorithmic approximations. We find that participants' judgment patterns can be both qualitatively and quantitatively captured by a model that approximates normative inference via a simulation and summary statistics scheme based on structurally local computation using temporally local evidence.
We naturally think about the world in terms of a progression of events that cause and affect one another. When successful, causal reasoning helps us abstract from our real-time experience to recognize stable causal mechanisms that we can use to explain, predict and sometimes control our environment (Sloman, 2005). However, inferring causal structure in real environments is notoriously challenging, involving a complex interplay between incoming evidence, action, and intuitive theories of how causal influences manifest and link elements of experience like events, objects and variables (Goodman, Ullman, & Tenenbaum, 2011;Griffiths & Tenenbaum, 2009;Lagnado, 2011).
Two of the basic and well-studied notions of causality are generative and preventative relationships. In a generative relationship, we think of the occurrence of one event as bringing about the occurrence of another. A generative causal claim implies the counterfactual that, had the cause not occurred, the effect would not have occurred either. In probabilistic accounts of causal reasoning, generative causality is typically linked with an expectation of positive contingency: The presence of a generative causal variable is associated with an increase in the probability of its effect(s) being present compared to cases where ✩ Author Note A preliminary analysis of a pilot experiment and Experiment 1 was presented at the 42th Annual Meeting of the Cognitive Science Society (Gong and Bramley, 2020) and the Causal Inference & Machine Learning workshop at the 35th Neural Information Processing Systems conference. We thank Simon Stephan and an anonymous reviewer for many helpful comments. TG is supported by a University of Edinburgh PPLS Scholarship. NB is partly supported by a EPSRC New Investigator Grant (EP/T033967/1). * Corresponding author.
E-mail address: tia.gong@ed.ac.uk (T. Gong). the cause is absent or inactive. The reverse of this is the notion of a preventative causal relationship, where we think the occurrence of a causal event as blocking another event from occurring. A preventative causal claim implies the counterfactual that, had the cause not occurred, the effect would have occurred. Probabilistically, we thus expect the presence of a preventative cause to decrease the probability of its effect(s) occurring, compared with cases where the cause is absent or inactive (Cheng, 1997;Griffiths & Tenenbaum, 2005;Sloman, 2005). The majority of causal learning research has focused on inferences from atemporal evidence, which can be represented in tables of co-occurrence or contingency that reflect the statistical dependencies among a set of variables (Buehner, Cheng, & Clifford, 2003;Cheng, 1997;Griffiths & Tenenbaum, 2005;Lagnado & Sloman, 2004;Rottman & Hastie, 2014). This kind of data is central in scientific experimentation, in that it depends on the collection of multiple independent samples (Pearl, 2000;Pearl & Mackenzie, 2018;Zimmerman, 2007). However, an intriguing question regarding human cognition is about how people learn causal relationships from temporal data, given that we experience the world as one continuous timeline, and that real world causal mechanisms often take time to produce their effects. The temporal setting also allows that multiple events of the same type may occur multiple times to a single individual. This more closely resembles repeated-measure data from a single individual than reasoning from large independent samples. In this setting, people might rely more on ''soft'' cues (e.g. time, prior knowledge) than the contingency principle (Lagnado, Waldmann, Hagmayer, & Sloman, 2007). Understanding how people learn from temporal data is crucial because it not only improves our understanding of the basic mechanisms of human learning, but also clarifies the differences between scientific practices and intuitive causal inference.
Besides this, studies of atemporal causal learning (Buehner et al., 2003;Cheng, 1997;Griffiths & Tenenbaum, 2005;Lagnado & Sloman, 2004;Rottman & Hastie, 2014) as well as recent studies of temporal causal learning (Bramley, Gerstenberg, Mayrhofer and Lagnado, 2018;Buehner & McGregor, 2006) typically focus on one type of causal relationship at a time. In contrast, this paper aims to investigate how one can learn preventative and generative relationships where both are in play at once. Can people identify what is causing and what is preventing an effect despite, and perhaps even because of the ways that such causal influences intertwine and interact in time. Although this may sound like a ''niche'' scenario, it is actually very common. To illustrate such an everyday situation: Suppose you adopt a cat that, while adorable, frequently urinates outside its litter. You would like to understand why and learn to prevent this behavior before she completely ruins your soft furnishings. 1 Identifying the causes of the problem peeing, not to mention an effective pee-prevention strategy is nontrivial and might require considerable thought and experimentation. Perhaps you notice the cat rarely pees inappropriately when playing with its teaser. However it is unclear if teaser is an effective preventer, because the times of day she is encouraged to play with it may be different from those when she pees. Intuitively, diagnosis becomes easier if you can exploit the moments when you know she tends to urinate to test whether the teaser is an effective preventer. For instance, if she often urinates around 7 a.m, you could try introducing her teaser around this time. Alternatively, you might consider encouraging her to drink water to stimulate additional need to urinate a little before the time she more habitually plays with her teaser. In this way you might leverage either an established baseline expectation or an established generative cause (extra water) to facilitate your preventative investigation.
The example above shows, firstly, that temporal expectations are necessary to make sensible causal inferences (Bramley, Gerstenberg, Mayrhofer et al., 2018;Buehner & McGregor, 2006;Greville & Buehner, 2010;Lagnado & Sloman, 2006). In this case we need some sense of when the cat usually pees inappropriately, as well as an expectation of how long it takes for water to pass through its body. Secondly, it is likely that generative and preventative influences interact in terms of how they reveal or obscure one another (Lombrozo, 2010;Rottman, 2016). The existence of either a regular base rate occurrence of an effect, or of effects generated by a known generative cause with regular delays, makes it possible to form a strong expectation against which we can test preventative causes.
In this paper we distill these reasoning patterns into a task and a rational analysis that aim to examine: (1) whether people can use temporal knowledge to learn causal systems that include both generative and preventative causes, (2) how the regularity or predictability of the base rate occurrence of an effect of inference affects the learning process, and (3) whether there are interactions between learning different types of causes.
Apart from establishing what factors influence temporal causal learning, we also want to know how people learn, i.e. what kind of inference process can capture human judgments. Causal Bayesian 1 This is a real life example for one of the authors of this paper. Networks (CBNs) are an established mathematical framework representing and reasoning about causal structure giving rise to observations (Allan, 1980;Pearl, 2000;Rottman & Hastie, 2014). In the psychology of causal reasoning, they have served as a computational-level norm (Marr, 1982) allowing researchers to investigate how the cognitive processes of causal induction approximate or deviate from ideally reverse engineering the generative causal mechanism most likely to be responsible for one's observations. Accordingly, a number of process-level models have been proposed (Bramley, Dayan, Griffiths and Lagnado, 2017; that each capture some of the ways human performance departs from this kind of Bayesian ideal. However, CBNs and extant process-level models do not describe the role of continuous-time information in human causal structure induction. This is surprising, since as argued, time is a ubiquitous feature of human interactions with their environment, and the need to process rich temporal information in real time is a practical constraint on most of our basic causal inferences. In this paper we take a rational analysis approach (Anderson, 1990;Simon, 1982), starting with a normative account of inference from observations of real-time events to their underlying causal structure and developing a process-level approximation family that can capture human deviations from this. For our normative account, we expand the CBNs framework so that it incorporates representing and learning via causal delay information. Alongside this, we propose a process-level framework that exploits several tricks for approximating intractable probabilistic inference: mental simulation (Battaglia, Hamrick, & Tenenbaum, 2013;Ullman, Stuhlmüller, Goodman, & Tenenbaum, 2018), local computations (Bramley, Dayan et al., 2017;Fernbach & Sloman, 2009), and temporally local evidence (Bonawitz, Denison, Gopnik, & Griffiths, 2014;Bramley, Dayan et al., 2017;Bramley, Lagnado, & Speekenbrink, 2015).

Question 1: How do beliefs about causal orders and delays shape causal structure learning?
One of our main goals is to test whether people can use their knowledge about time and causality to learn causal structure. Previous studies have demonstrated the temporal knowledge from three perspectives: order, delay expectation, and delay variation.
Foundational to the notion of causation, is the principle that causes must precede their effects (Hume, 1740). Accordingly, people use the order of occurrence to constrain and sometimes fully attribute causal structure among components of a system (Bramley, Gerstenberg, & Lagnado, 2014). Indeed, event order appears to be a strong heuristic cue to causal order, having been shown to override contingency patterns even in settings where participants are instructed that order is an unreliable guide (Lagnado & Sloman, 2006) or even completely irrelevant to causal structure (Rottman & Keil, 2012).
As well as order, causal inferences are sensitive to delays between events. People make stronger or more confident (generative) causal attributions connecting events separated by short temporal delays than by long temporal delays (Shanks & Dickinson, 1991;Shanks, Pearson, & Dickinson, 1989;Tarpy & Sawabini, 1974). This reflects one of the most basic forms of learning, in which animals associate stimuli at a learning rate inversely related to their separation in time (Grice, 1948). However, going beyond automatic associations in time, human causal attributions are moderated by domain-specific delay expectations, with shorter-than-expected delays also reducing the causal judgment strength (Buehner & May, 2002;Buehner & McGregor, 2006;Hagmayer & Waldmann, 2002;Lagnado & Speekenbrink, 2010;Mendelson & Shultz, 1976). For example, Hagmayer and Waldmann (2002) found participants judged whether an insecticide prevents mosquitoes by comparing prevalence of mosquitoes in fields with and without the insecticide, but judged whether planting flowers prevents mosquitoes based on whether the prevalence of mosquitoes was affected the year after the flowers were planted, presumably expecting that flowers would take longer to influence the insect population than insecticide.
Besides the length of inter-event delays, people are also sensitive to delay variability when they are repeatedly exposed to putative causeeffect pairs. That is, people rate one kind of event as less of a strong cause of another to the extent that the delay varies a lot across instances (Greville & Buehner, 2010;Lagnado & Speekenbrink, 2010).
Recently several studies proposed models to capture human's expectations for delay length and variation, including scenarios of pairwise causal learning (Pacer & Griffiths, 2012), structure learning (Bramley, Gerstenberg, Mayrhofer et al., 2018;Pacer & Griffiths, 2015), imputing hidden causes (Valentin, Bramley, & Lucas, 2022), or making judgments of actual causation given a known causal structure (Stephan, Mayrhofer, & Waldmann, 2020). Nevertheless, these studies have predominantly focused on cases of generative causal influence. Additionally, they have focused on inference from sets of independent clips, in which root components are usually activated at the start and effects follow from this. However, a more naturalistic and challenging setting is one where causes and effects intermingle and components can exhibit multiple activations, and both generative and preventative influences can occur within a single learning episode. This is the setting we will explore.

Question 2: How do generation, prevention, and background causes interact in affecting causal learning?
Early studies of causal cognition focused on elemental pairwise causal judgments based on contingency data. While not directly related to the current temporal setting, these studies reveal general principles of causal inference. For instance, the principle captures the change in the probability of an effect occurrence with vs. without a putative cause ( ( | ) − ( |¬ )), forming a basic metric for the strength and direction of a potential causal effect (Allan, 1980). However, researchers later found people are sensitive to the base rate of the effect ( |¬ ). That is, how frequently the effect occurs in the absence of the cause. For a fixed , people infer stronger generative influences when base rates are high (because this implies the cause would have succeeded a greater proportion of the time if it had the chance to operate), and stronger preventative influences when base rates are low (Buehner et al., 2003;Cheng, 1997;Wu & Cheng, 1999).
In addition to the size of the base rate, the regularity of the base rate also influences causal inference. In Rottman (2016), participants were asked to evaluate the effectiveness of two medications. In one context, the baseline pain level was random from case to case, whereas in another setting, it was autocorrelated (i.e. it tended to increase or decrease smoothly over time). Participants were found to focus more on the raw effect values in the random condition, while focusing more on the change of effect values in the autocorrelated condition. This indicates that people are sensitive to environmental stability, adapting how they accumulate and represent causal effect evidence when receiving information in different environments (Biele, Erev, & Ert, 2009;Whittle, 1988). We will explore whether people are sensitive to temporal regularity (periodic vs. unpredictable) and, if so, whether or not they adjust their inference strategy accordingly.
Finally, humans show some ability to condition on other variables when inferring the role of a target variable (Beckers, De Houwer, Pineno, & Miller, 2005;Gopnik, Sobel, Schulz, & Glymour, 2001;Rescorla & Wagner, 1972;Shanks, 1985). People can use information regarding known causes to better understand unknown causes, particularly preventative causes. The classic paradigm in prevention learning is to let learners build a generative impression of a cause ( +), and then expose them to negative results under the combination of a generative cause and a preventative cause ( −). People learn the preventive cause better in this case than when the preventative cause is paired with the negative result alone ( −, Melchers, Wolff, & Lachnit, 2006;Rescorla & Wagner, 1972). However, the existence of temporal information may actually increase the difficulty of thinking about causal interactions: To utilize the generative causes to learn about prevention, the learner must have ensured that generative causes would have produced effects in a particular time period when preventative causes are active.
Recent studies also demonstrate human limitations in dealing globally with joint probability, i.e. reasoning probabilistically about multiple interacting variables (Bonawitz et al., 2014;Davis, Bramley, & Rehder, 2020;Fernbach & Sloman, 2009;Griffiths, Lieder, & Goodman, 2015;Markant, Settles, & Gureckis, 2016). Outside of very simple learning problems, they may rather focus on local components of the system rather than maintain a global perspective. For example, people often infer an erroneous → link when reasoning about a generative system with two links → → , apparently failing to notice that can explain 's dependence on Fernbach & Sloman, 2009). Through model comparison, we will explore to what extent people can reason globally or locally about causal structure on the basis of real time evidence, e.g. whether they can account for and potentially bootstrap their inferences by considering interactions between causal mechanisms, or if they rather fail to make these accommodations.

Question 3: How do people process temporal dynamics to make causal inferences?
We build two models for describing how the temporal information could be processed in order to make causal inferences. We will explain the models at a theoretical level in this section and refer the readers to Appendices B and C for technical details. To do this, we first introduce the learning task before describing our model so that readers can get a concrete understanding of how it works.

The learning task
In this study, participants must guess the structure of abstract causal ''devices'' (Bramley, Dayan et al., 2017;Bramley, Gerstenberg, Mayrhofer et al., 2018;Gong, Gerstenberg, Mayrhofer, & Bramley, 2023) composed of three components ( Fig. 1a-d): two ''control components'' (i.e. Cause , ) and one ''target component'' (i.e. Effect ) on the basis of observations of those structures being perturbed by interventions. To control the impact of interventions, our experiments focus on a learning setting wherein the interventions are part of the stimuli, meaning participants observe them taking place rather than selecting and performing them themselves. For each device, the connection between each control component and the target component could be generative, preventative, or they might be unconnected (non-causal). Thus, we focus on learning in a nominal hypothesis space of 9 possible structures including all combinations of generative, preventative and non-causal connections from and to (Fig. 1e). As a first foray into preventative causation in real-time causal structure induction, we focus on this restricted hypothesis space of causal structures which only contains the common effect topology. However, the experimental paradigm and computational models we introduce generalize directly to learning in arbitrarily broader causal hypothesis spaces, as well as under different prior expectations about plausible delays and relations.
We focus on relationships between point events (i.e. activations) occurring at a device's components at particular moments in time. We assume an activation of a generative component will always produce an ''extra'' activation of the target component (i.e. causal strength = 1, Cheng, 1997, see Fig. 2a). We use the gamma distribution to model and generate the delays between causes and effects (Bramley, Gerstenberg, Mayrhofer et al., 2018;Stephan et al., 2020;Valentin et al., 2022). See Appendix A for more details.
We assume an activation of a preventative component blocks any activations of the target component for a short stochastic time window (Fig. 2b). We assume that prevention occurs irrespective of whether activations would have been caused by a generative causal influence or would have occurred spontaneously. Preventative influences are thus conceived as having a broad preventative scope (Carroll & Cheng, T. Gong and N.R. Bramley  Using gamma density distributions to generate the delays between cause and effect and the blocking windows of preventative causes. Circles indicate cause events and diamonds indicate effect events. Each vertical line shows an actual sampled situation. (a) The distribution of delays between cause and effect. When a generative cause event occurs, it will produce an effect event after 1.5 ± 0.5 s. (b) The distribution for preventative window length. When a preventative cause event occurs, all effect events supposed to occur within 3 ± 0.5 s will be canceled, while effects outside the preventative window (the red box) would not be affected. (c) The distribution of delays between base rate events in the regular condition. When a base rate effect occurs, the next base rate effect will occur after 5 ± 0.5 s. (d) The distribution of delays between base rate events in the irregular condition. When a base rate effect occurs, the next base rate effect will occur after 5 ± 5 s. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 2009). 2 By definition, activations of non-causal components have no impact on the behavior of the target component.
Two forms of background activation are considered. In the Regular base rate condition, the target component activates quasi-periodically ( Fig. 2c). In the Irregular base rate condition it occurs exactly as often overall but is completely unpredictable when the next occurrence will be ( Fig. 2d; see Appendix A).

Bayesian inference
We now lay out an ideal Bayesian model as a normative model for this task. The ideal reasoner is presumed to take all activation events within the observation interval as the basis of their inference and use the relative likelihood of these under different structural hypotheses to update a distribution over causal structures. The calculation of likelihood here depends on an expensive enumerative actual causal attribution step (Halpern, 2016). The basic idea is that accurate judgments about type-level causal relationships (i.e. about the underlying causal structure) depend on detailed considerations about the token-level causation giving rise to the observable evidence (i.e. which particular event actually caused which particular effect). There are often a very large number of possible ways that even a single causal hypothesis could have produced a particular pattern of observed events. For instance, if activates at 0.1 s and activates at 1.2 s ( { (1) = 0.1 , (1) = 1.2 }), 2 We recognize that there are other ways in which one might operationalize prevention and we consider several alternatives in the General Discussion. and the learner observes two subsequent effects ( { (1) = 1.5 , (2) = 2.8 }), even under the hypothesis that and are both generative causes, the data could be produced in multiple ways: could have caused the first effect and the later one ( (1) → (1) , (1) → (2) ) or could have caused the later effect and the earlier one ( (1) → (2) , (1) → (1) ). Alternatively one or both connections could have not revealed their effects yet and meaning either or both observed effects could simply be base rate activations. Therefore, in order to maintain rational beliefs about causal structure, the ideal reasoner considers all possible causal paths that could describe what actually happened given each possible structural hypothesis. Fig. 3a shows two examples of the tree of possible causal paths under two of the possible structural hypotheses. Since one must consider possible causal paths exhaustively, the complexity of this inference scheme scales in a worse-than-polynomial manner as the number of events a learner observes increases.

Simulation-and-summary-statistic approximation
While the enumerative approach achieves benchmark performance by inverting the generative model, exhaustively considering pathways linking all observed events, it makes unrealistic demands on memory storage and computing power compared to what could plausibly be involved in human cognition. Therefore, we propose a process-level model that is more consistent with cognitive constraints. It is based on the simulation-and-summary-statistic idea (also written as ''summarystatistic'' for short), which is an important approach in Approximate T. Gong and N.R. Bramley  ii-iii. Ideal observer sums over all possible pathways (branches) that explain all events evidence under each hypothetical structure. ii. e.g. Under the structure where and are both generative causes, there are 13 ways to explain Evidence : one candidate cause for 0 (base rate), four candidate causes for 1 , and 3-4 candidate causes for 2 depending on how 1 is explained. iii. Possible pathways under a different structure. (b) Summary-statistic approach: i. Intervention-window or fixed-window evidence segmentation. ii. Distributions for summary-statistics given different connection types based on pre-simulated data. The model uses likelihood of observed statistics under these distributions as a proxy for generative model likelihood. Distributions slightly differ given different base rate conditions. (c) Example where posterior over structures differs among models (assuming a regular base rate). Curved arrows indicate the true underlying generative process unknown to the models. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Bayesian Computation in statistics (Blum, Nunes, Prangle, & Sisson, 2013;Lintusaari, Gutmann, Dutta, Kaski, & Corander, 2017;Sunnåker et al., 2013;Zhao et al., 2023). We explore this idea's cognitive plausibility as an explanation for human judgments in our setting. Our model incorporates three features of bounded inference that are often highlighted in cognitive psychology: mental simulation, local computation, and temporally local evidence.

Mental simulation
The tendency to rely on simulation-based approximation to exact inference has been hypothesized to play an important role in model-based reasoning in many scenarios, including physical scene understanding (Battaglia et al., 2013;Hamrick, Battaglia, Griffiths, & Tenenbaum, 2016;Ullman et al., 2018), mechanical reasoning (Hegarty, 2004), and causal judgment (Gerstenberg, Goodman, Lagnado, & Tenenbaum, 2021;Gerstenberg, Peterson, Goodman, Lagnado, & Tenenbaum, 2017). The idea is that instead of computing the likelihood of a potential generative model producing observed data exactly, people instead compare their observations to mental simulations of what kind of pattern they expect to happen under different generative models.
Critical to this process is the identification of a useful set of easily tracked abstract cues or features with which to compare such simulations to observations. When a scenario of interest involves complex dynamics, direct surface-level (i.e. ''pixel-level'') comparison between simulated and observed evidence is generally inappropriate for measuring the likelihood of a hypothesis. Ullman et al. (2018) combined the ideas of simulation and abstraction to model inferences about the latent properties of physical objects (such as masses and forces) from observed dynamics. As a simple example, if imagined heavy objects tend to move more slowly than imagined light objects, this licenses the use of speed as a (fallible) cue to mass.
Concretely, we explore whether simple salient local features of event sequences that are diagnostic (if fallible) guides to local causal relationships can explain human judgments better than a fully Bayesian treatment. The implied cognitive process is that learners draw on (imagined) evidence under different causal ground truth structures in order to develop statistical cues that can be directly applied to pairwise causal judgments. Here we simply investigate two straightforward and salient cues that people might be sensitive to in the current task: 1. Delay: The interval between a cause component's activation and the next subsequent effect activation. 2. Count: The number of activations of the effect after the cause activation within some time window.
These cues are hand-engineered, and far from exhaustive. However, they are simple to track and turn out to discriminate reasonably well between different types of causal connections. As shown in Fig. 3b, for the delay cue, we generally expect to see shorter intervals between a control component's activation and the target component's next activation if the control component is a generative cause, a medium and more variable interval if there is no connection or a longer interval if it is preventative. For the count cue, more effect activations are likely to follow the activation of the generative component on average because of the existence of base rate activations as well as generation. In contrast, fewer activations are likely to follow the activation of preventative components. The former cue considers concrete delay information but ignores the possibility of different causal pathways, while the latter cue ignores the exact temporal interval between events (cf. Bramley, Gerstenberg, Mayrhofer et al., 2018).

(Structurally) local computation
Both the count and delay cues introduced above ignore surrounding structure and context leading to the potential for interference. For example, in the presence of a known preventative cause that has just occurred, an ideal learner should reduce their expectation that a generative cause would produce a short delay to the next event, or a high subsequent effect count. Thus, this approach also captures a principle of local computation (Bonawitz et al., 2014;Bramley, Dayan et al., 2017;Fernbach & Sloman, 2009;Griffiths et al., 2015;Markant et al., 2016), predicting that learners will make causal attributions at the level of individual links without accommodating the global context and the full space of global causal models.
The other reason why we apply local computation to this processlevel model is that it can greatly reduce the computational cost compared to the global computation approach. In the current continuoustime setting, interventions could happen at any time making every context unique. This means that conditioning one's inference on even a single previous intervention requires learners to simulate a much larger number of one-off context-specific situations. Introducing more of this context sensitivity (i.e. constructing separate summary statistics for each possible combination of causes) would allow a summary-statistic approach to perform closer to normative inference but at the cost of increasing computational demands and reducing generality beyond the set of contexts considered.

(Temporally) local evidence
The final cognitive feature we consider is related to how people parse and segment the evidence encountered across an extended observation of a causal system. Ullman et al. (2018) applied a summarystatistic approach to short observations (5 s) and allowed participants unlimited replay opportunities, so assumed people could use cues based on the entire observation. In the scenario considered here, the learner observes causal dynamics for considerably longer (20 s, containing dozens of events) without recourse to replays. In general, we experience the world in a single ongoing timeline. Thus, with finite short-term memory storage and attention, it seems plausible that people abstract cues more locally than from full observation. In other studies, people are found to often use temporally local (i.e. recent) information to drive causal model learning (Bramley, Dayan et al., 2017;Rehder, Davis, & Bramley, 2022). Furthermore, people are often unable to recall older evidence exactly (Bramley et al., 2015;Harman, 1986), rather remember whatever conclusions they have drawn on the basis of it.
In line with these ideas, we hypothesize that people segment their observations as they unfold, using recent events to update their beliefs and then discarding their memory of them. We consider two ways to segment continuous-time evidence. As shown in Fig. 3b, a unit of evidence under both approaches begins with an intervention (i.e. the activation of a control component), capturing the basic principle that causes can only influence what happens later. An Intervention-window segmentation approach treats one unit of observation as the interval between one intervention and the next. This removes the distraction of other interventions that might also influence the effect, but ignores the fact that these interventions might be performed irregularly or reactively, and also that actual effects may not have been revealed before the occurrence of the next intervention. A Fixed-window approach ends one unit of observation after a fixed amount of time. This has the advantage of stability in its odds of including all relevant effects' but instead opens the door to confounding influences when subsequent interventions occur within the preceding observation window. A fixed window approach also implies some degree of parallel processing since fixed-length attentional windows may easily overlap in a single timeline.

Summary of modeling frameworks
In sum, we have laid out two approaches to solving the current learning problem. The normative model utilizes the exact timing information of each event, considering all possible observation-consistent ways in which the effects might have been generated or prevented. The summary-statistic model compresses the information by abstracting useful cues and comparing the similarity between cues summarized from observation with mental simulation. We do not see the two accounts as fundamentally in tension. Rather, the summary-statistic approach embodies a set of algorithmically plausible steps to approximate the normative solution.
Given the information compression and the local focus of the summary-statistic approach, its predictions diverge from the normative one in some situations. One example is shown in Fig. 3c. When activates and then activates followed closely by two effects, the normative learner finds this most consistent with the structure where is a generative cause because the delay between and the first effect is consistent with its delay expectation, while the other effect could easily be due to the base rate. For the summary-statistic models, the intervention-window approach suffers from a blocking effect, where the occurrence of masks any potential link between and the effects. The fixed-window approach suffers from a local computations error, where each effect is potentially attributed to both and leading to a marginal preference for the model with both and as generative causes. We will show more similarities and differences between the two modeling approaches alongside human behavior in Results sections.

Overview of experiments
We now report on three experiments that investigate how people infer preventative and generative causal structures in continuous time. Each experiment includes stimuli generated from each of the nine underlying structures we consider (Fig. 1e). Experiment 1a and 1b aimed at exploring how overall structure and regular and irregular base rates influence causal judgments. Experiment 2 additionally includes stimuli designed to probe whether people make specific mistakes predicted by the summary-statistics model. All pre-registrations, materials, data, and analysis code are available at https://osf.io/q8n72/. Stimuli for all experiments can be viewed at https://github.com/tianweigong/causal_ diamond.

Participants
One hundred and eighty-seven participants from Amazon Mechanical Turk were recruited and reported for Experiment 1a (81 female, 105 male, 1 non-binary, aged 37 ± 11, regular vs. irregular condition: 93 vs. 94) and another 123 participants were recruited and reported for Experiment 1b (45 female, 78 male, aged 39 ± 11, regular vs. irregular: 63 vs. 60). The sample size of Experiment 1a was determined by a power analysis comparing two between-subject groups anticipating a medium sized effect ( = 0.5) with a goal of .90 power at the standard .05 alpha. The sample size for Experiment 1b followed a pilot study (Gong & Bramley, 2020) given that both of them aimed to compare participants' performance with normative and heuristic models. Nine additional participants in Experiment 1a were recruited but excluded prior to analysis because they clicked (to respond) more than 300 times during the task (as average participants acted 113 ± 26 times). Hence, we suspected these respondees were either inattentive or nonhuman. Four additional participants in Experiment 1b were recruited but excluded prior to analysis because they clicked more than 300 times during the task (n = 2), or failed to pass at least one of two attention questions (n = 2). 3 Participants were paid between $1.00 and $2.08 depending on their performance (see below) and experiments lasted around 20 min.

Design & procedure
Overview. In both Experiment 1a and 1b, participants judged the causal structure of 18 causal devices (Fig. 1e). When a generative cause event occurred, it would produce an effect event after 1.5 ± 0.5 s (see Fig. 2a). Whenever a preventative cause event occurred, any upcoming effect events in the subsequent 3 ± 0.5 s were canceled (see Fig. 2b). Each base rate event occurred 5 ± 0.5 s after the previous one in the regular base rate condition, or 5 ± 5 s in the irregular base rate condition. The choice of generative delay was based on past studies that suggest people only reliably attribute causal relations to delays of up to around 2 s in the absence of context information shaping delay expectations (Shanks & Dickinson, 1991;Shanks et al., 1989). We chose the size of the true preventative windows and base rates such that base rates are generally lower than casual influences (i.e. activity is relatively sparse without any generative events) and preventive influences last long enough to have a reasonable chance of preventing something. The true sampled causal delays are unknown to the learner (human or model), but for simplicity we pre-trained (Experiment 1a) or told (Experiment 1b) participants about typical patterns of base rate activations and about typical generative delays and preventative durations in an instruction phase, and so also assumed these parameters were available to all models.
For each device, participants clicked a ''Start'' button to watch the clip. Each clip started with a base rate activation of the target component and included three pre-set interventions on and three on randomly spaced and intermingled over 20 s. After that, the clip would end and no further activations could be observed. Components' activations were displayed as the component ''lighting up'' by changing from gray to yellow for 350 ms. The activation of the control component was accompanied by a hand symbol (Fig. 1b) and participants were told that this showed that control components were being intervened on by someone or something external to the system, meaning that the interventions happened at random moments rather than following any informative pattern. Clips were selected to make sure that no activation was masked by another on the same component in the clips, and participants were also told about this rule.
Participants were invited to mark their guesses about the two connections during or after the clip by clicking the space between the components (Fig. 1d). Each clip could only be played once. The order of 18 trials, as well as the click pattern (whether they would have to click once, twice, or tree times to select generative, preventative or noncausal), and the vertical position of and components (above or below) were randomized independently between participants.
Participants were informed of the timing of three types of connections as well as the target component's self-activation prior to the inference task. For the base rate specifically, participants in the regular condition were told that the target component would activate regularly about every five seconds and they saw an illustration with a circular arrow to create the impression of periodic activation (Fig. 1f). Participants in the irregular condition were told that the target component can activate by itself at completely random times and they saw an illustration with an exogenous link intended to imply that someone sometimes activates the target component directly but one cannot anticipate when it will happen (Fig. 1f). In order to similarly provide timing information, participants were told the base rate activation happens about 2-7 times per clip. Participants had to pass introduction check questions before starting the experiment. To properly incentivize accurate judgments, a 3-cent bonus was paid for each correctly identified connection and non-connection during the main task in addition to the basic $1 payment. Experiment 1a. In Experiment 1a, to generate stimuli from different structures (e.g. both generative, one generative and one non-causal) and different conditions (i.e. regular vs. irregular) comparable, we used a Latin-square design. We first created 18 causal delay seeds independently. Each of these included a set of timings for interventions, base rate activations, which depended also on whether the base rate was regular or irregular, and what generative delays (or blocking windows) and would have if they were generative (or preventative) components. Under each seed, 18 stimuli (9 causal structures × 2 base rate settings) were generated by implementing generative or preventative influences according to the ground-truth structure (see Fig. 4a for an example of a single seed manifesting under each structure and base rate condition). Across different seeds, the timing and order of interventions were randomly generated to capture the diversity of ways in which the interventions could be interleaved ranging from perfectly interleaved (e.g. ) to perfectly clustered (e.g. , see Fig. 4b for an example of a single structure under different seeds and base rate conditions). All stimuli were finally divided into 18 sets (9 sets for each base rate setting) according to a Latin-square design that ensured participants would only see one structure under each seed (see https://osf.io/sqv6c for the counterbalancing matrix). Participants were randomly assigned to one of these 18 sets.
In the instructions, participants saw training videos that showed the patterns of the target component's base rate activations (corresponding to their condition) and also what happens after intervening on a causal system with a single (generative, non-causal, or preventative) connection. They completed a single practice trial in which the true causal device included one generative connection and one non-causal connection. Feedback was provided in the practice trial but not in the test trials. Experiment 1b. Experiment 1b differed from Experiment 1a from two perspectives. Firstly, although we assume the provenance of the summary-statistic approximation to be mental simulation, cues might also be derived from experience with the ''labeled data'' included in the instructions or practice trials. Therefore, Experiment 1b only kept the text instructions and removed the training videos and practice trials, to show that labeled data were not necessary for participants to complete the task.
Additionally, given that the stimuli in Experiment 1a were generated by one of the ground truth structures, the normative model and summary-statistic approximations often made similar predictions. To probe how participants react to situations with stronger discrepancies between the normative and summary-statistic predictions, we created some stimuli that were not generated by any particular causal device. We created two blocks of stimuli in Experiment 1b. Block 1 included nine stimuli for each participant, which replicated the procedure of Experiment 1a, and served to ensure participants were habituated to reacting to ''normal'' stimuli. In Block 2, we generated potential test stimuli by randomly distributing six interventions and between 1 and 9 effects across a 20 s trial. We selected sequences for which the structure predictions of the normative and summary-statistic models were strongly dissimilar, while ensuring that these stimuli were not too normatively improbable (i.e. that they could conceivably have been generated by one of the causal structures). 4 There were 27 stimuli for each condition and each participant observed nine of them. Block 1 4 Specifically, we picked the stimuli where at least one (interventionwindow) summary-statistic cue (Delay or Count) had a different dominant answer compared to the normative model and rejected any for which the likelihood of the most probable structure producing the data was extremely low (<10 −40 ). The squared error between normative and summary-statistic predictions in Block 1 (trials with the ground truth) and Block 2 (trials without the ground truth) was 0.22 vs. 0.53 on average. The likelihood of the most probable structure according to the normative model in Block 1 and Block 2 was 0.07 vs. 0.004 on average. always preceded Block 2 so that the first half task would be identical to Experiment 1a. Participants completed 18 trials in sequence without any delineation between the blocks. All other experimental settings remained identical to Experiment 2. The bonuses were, in reality, determined by doubling the bonuses participants gained in Block 1.

Results
We focus on analyzing participants' accuracy by comparing their judgments against the ground truth. We investigate whether participants' performance was influenced by the nature of the underlying causal mechanism, base rate regularity, or the observed intervention sequence (i.e. whether this involves interleaved interventions on the two components or clusters of interventions on one component then the other). Since these analyses require there to be a correct answer, for Experiment 1b we only include Block 1.
To compare our models' behavior qualitatively with participants', we simulate judgments of each model type after observing the same stimuli as the participants. We used one fitted softmax parameter for each model and repeated each simulation 200 times per participant to obtain stable and consistent distributions of simulated judgments (see Appendix D for model fitting details). For summary-statistic models, we average predictions under the two proposed features with equal weights to form a combined prediction (cf. Ullman et al., 2018). Results of intervention-window vs. fixed-window summary-statistics were similar at the aggregated level, and hence we only visualize the intervention-window results in the figures.

Overall performance
In Experiment 1a, participants in both regular and irregular conditions performed above chance at the connection level (chance = 33%, regular: 66% ± 22%, This means that labeled data in the form of video training and practice trials were not a necessary condition for participants' success in this task. We therefore combine stimuli from two experiments in later analyses to obtain a larger sample size.

Focal and neighboring causes
To investigate participants' ability to identify generative, noncausal, and preventative connections, as well as whether the base rate regularity or the neighboring connections would influence performance, we performed a 3 (focal cause: generative, non-causal, preventative) × 3 (neighboring cause: generative, non-causal, preventative) × 2 (base rate regularity: regular, irregular) mixed ANOVA. Each trial provided two data points here, one regarding as the focal cause and as the neighboring cause and the other regarding as the focal cause and as the neighboring cause.
For simulated model-based learners, the summary-statistic learner exhibited a similar tendency as participants, performing worse in identifying non-causal connections (Fig. 5a). The accuracy of both normative and summary-statistic learners was partly dependent on the neighboring cause. As shown in Fig. 3b, the summary-statistic distributions of the non-causal type, particularly the Delay distributions, frequently exhibit overlaps with both other distributions, and furthermore, the other types (generative or preventative) typically have higher density in the overlapping region. Fig. 6 shows the proportion of participants' choices under different ground truths. We explored the frequency of choices when people made inconsistent judgments with the ground truth. Under the regular base rate, people were equally likely to judge a generative connection as a non-causal one or a preventative one (12% vs. 10%, chi-square goodness of fit: 2 (1) = 3.01, = .082). They were equally likely to judge a non-causal connection as a generative or preventative one (22% vs. 25%, 2 (1) = 2.65, = .103) while they more often judged a preventative connection as a non-causal one than a generative one (22% vs. 9%, 2 (1) = 83.41, < .001). The results of irregular base rate were similar (non-causal ground truth: 25% vs. 26%, 2 (1) = 0.70, = .404; preventative ground truth: 30% vs. 12%, 2 (1) = 107.96, < .001) except now participants also more often judged a generative connection as non-causal than preventative (18% vs. 9%, 2 (1) = 29.55, < .001). The summary-statistic learner exhibited a similar tendency to human participants, tending to mistake preventative or generative connections more often as non-causal, rather than mistaking one for the other.

Intervention order
We examined the influence of the intervention sequence. The intervention patterns in the experimental stimuli were randomly generated (albeit balanced to include 3 interventions per control component) and hence varied in terms of the sequence. In some trials, participants observed data in which interventions on one component were ''interleaved'' (e.g. in or ), in others they were fully ''clustered'' (e.g. in or ), and in others they were T. Gong and N.R. Bramley Fig. 7. Accuracy separated by intervention order in Experiment 1. Lines indicate the performance of simulated normative and summary-statistic learners each with a fitted softmax parameter based on participants' data in Experiment 1 (see Appendix D). Error bars indicate 95% confidence intervals. partially clustered (e.g. in or ) which we called a ''medium'' level. We performed a 3 (focal cause: generative, non-causal, preventative) × 3 (intervention order: interleaved, medium, clustered) × 2 (base rate regularity: regular, irregular) mixed ANOVA. Each trial provided two data points, one regarding as the focal cause and the other regarding as the focal cause. The effects regarding focal cause and base rate regularity were similar to previous analyses and hence we only focus on the effects related to intervention order here.

Trials optimized for model discrimination
Block 2 in Experiment 1b contained stimuli that were not generated from a particular ground truth structures, but rather generated so as to distinguish strongly between normative and summary-statistic models. Fig. 8 shows the choice proportion of human learners vs. simulated learners on each stimulus. The choices simulated from the summary-statistic model were better correlated with human judgments across generative, non-causal, and preventative answers. In particular, the summary-statistic model captured when people tended to judge a variable as non-causal (gray points and line) which often diverged from the normative prediction.

Model fitting
To check quantitatively how well the models we have considered capture participants' causal judgments, we fit all participant judgments with our normative and summary-statistic models at both aggregate and individual levels. The details of the model fitting procedure can be found in Appendix D.
Participants choices were best captured by the summary-statistic approach, specifically by the variant that segments evidence according T. Gong and N.R. Bramley  to the intervals between interventions (Table 1). This is corroborated by the individual level fits, where the largest proportion of participants were fit by summary-statistic (intervention based) in both regular and irregular conditions across experiments (model fits separated by conditions are shown in Table E.1). We provide additional model fitting results in Appendix E. In Table E.2 we fit answers from Experiment 1b separated by blocks. The difference in cross-validation log-likelihood or BIC between normative and summary-statistic models was more pronounced in the no-groundtruth block than in the ground-truth block, which reflected that people's judgments were indeed more similar to the summary-statistic model. In Table E.3 we fit participants' answers with each cue separately to see whether they were dominated by Delay or Count rather than their combination. Results indicate that models with one or another cue did not fit participants' judgments better than models that mixed two cues. In Fig. E.1, we performed a grid search in [1, 7] s with a step of 0.5 s to test whether the fixed-window model fits were sensitive to our choice of a 4 s window. Models with different fixed-window lengths always had substantially larger BICs than the model with the inter-intervention window approach, meaning that, even had we fit window length as an additional parameter it would not outperform by-intervention segmentation in describing participants. This was true despite the fact that the models' accuracy in causal identification is quite sensitive to the window length.

Discussion
In Experiment 1, we showed that people are capable of using temporal information to learn causal structures that involve generative and preventative relationships. It also showcases several interesting differences between generative and preventative causation, which we return to in the General Discussion. Human judgments were better aligned with the summary-statistic models' predictions in both quantitative results and aggregate qualitative results. Nevertheless, the data in Experiment 1 was complicated, meaning we can do more to distill simpler, more intelligible examples of how the normative and summary-statistic models diverge in their judgments. In Experiment 2, we examine judgments about minimal event sequences for which the summary-statistic and normative learners differ in their dominant answers.

Experiment 2
We designed two types of stimuli for which two models have different dominant answers. They are based on the two locality principles driving the summary-statistic model: (1) Local computation; meaning summary statistic learners fail to account for the influence of the other connections in the system, and (2) Local evidence; meaning summary statistic learners fail to take into account whatever happened before their current observation window. For the first type of stimuli we use scenarios where a learner needs to identify a generative target cause that is paired with a preventative cause. This presents a challenge for local computation because the preventative cause can block the generative causes' influence and mislead a local learner into believing the target connection is a non-causal connection, because it is statistically associated with fewer events per window or longer delays than generative causes have on average across the task. The second case type is scenarios where a non-causal target is paired with a generative neighboring component. For a local learner who only focuses on a small time window after each intervention, the generative influences can easily spill over to the observation window during which the learner is focused on the target non-causal component and leading to statistics more typical of generative causation, because it is associated with more events and shorter delays than non-causal components exhibit on average across the task. Experiment 2 focused on the regular base rate condition, since this yields the larger predicted difference between normative and summary-statistic based judgments, though we also checked that the dominant answer for each model was the same under the irregular base rate parameters.

Participants
Sixty participants from Prolific were recruited and reported (32 female, 28 male, aged 41 ± 12). The sample size was determined by a power analysis assuming a medium sized effect ( = 0.5) in comparing within-subject judgments on the target cause and the goal of .90 power at the standard .05 alpha. No participants were excluded from this experiment based on the criteria we pre-registered.

Design & procedure
Participants' task was very similar to the regular condition in Experiment 1b, where they needed to judge the roles of two connections given a 20-second clip of evidence. No video training or feedback was provided. The hand-crafted stimuli are shown in Fig. 9. For each stimulus, we call one component the ''target'', and the other the ''lure'', which could affect participants' judgments about the target. Each clip contained two segments of evidence where the two components activated close together, so their influences on the system (if any) were misleading to the summary statistic model (gray shadows in Fig. 9), but also contained evidence where the components occurred far enough apart to make the true structure recoverable by the normative model.
We constructed four exemplars of the two stimulus types (Fig. 9). For the type (preventative lure and generative target), the lure often cancels the influence of the target, and hence the summary statistics of the target are more aligned with the non-causal summary statistics. For the type (generative lure and non-causal target), the lure's influence spills over into the observation window of the target, leading to summary statistics more consistent with a generative target component. Therefore, the summary-statistic approach predicts systematic errors in these cases that are not predicted by the normative model (Fig. 9).
Participants went through 6 practice trials sampled from Experiment 1 (with structures , , , , , ) before 8 testing trials, to ensure that they had some experience with different structures and edge types under more normal conditions. The vertical positions of two control components (above or below) were randomized across trials. The order of trials was randomized within the practice and testing phases. Participants completed 14 trials in sequence without any delineation between the practice and critical trials. The bonuses were, in reality, administered proportional to the bonuses participants gained in the practice phase (given that we predicted participants would make systematic errors in the test phase). The normative and summary-statistic models particularly differ in their judgments about the target components, with opaque bars used to highlight where the modal response shifts between normative and summary statistic models. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Results & discussion
For the stimuli, participants judged the targets as non-causal 1.8 ± 1.1 times on average out of 4 trials (above the 33% change level, (59) = 3.15, = .003, = 0.41, 95%CI of = [0.14, 0.67]). More importantly, participants judged them more often as non-causal than generative ( (59) Table 1. Similar to Experiment 1, participants' answers were better fit by the summary-statistic models than the normative model. In general, they were also better aligned with the intervention-window segmentation than the fixedwindow segmentation. This is also supported by a qualitative result that for stimuli, both the intervention-window model (Fig. 9) and participants (Fig. 10) regarded the lure as less likely to be a generative cause than the target component ( (59) = 5.56, < .001, = 1.04, 95%CI of = [0.65, 1.41]), while the fixed-window model regarded the probabilities as more even (Fig. 9). When it comes to the individual difference, participants split more evenly across the intervention-window and fixed-window models than Experiment 1, which may imply that some participants do consider longer windows in situations when interventions interleaved heavily and hence evidence of intervention-based windows was sometimes too short to rely on.

General discussion
This paper examined how people infer causal structure on the basis of observing events in continuous time. The project was motivated by the fact that classical causal structure induction research has largely focused on inferences from atemporal statistical information, essentially sidestepping the role of event timing and delay, or else reducing it to a simple sequence of equally spaced measurements. Meanwhile, empirical research (not to mention common sense) suggests people rely strongly on event timing for causal reasoning, using temporal information to guide causal attributions even when it is inappropriate to do so. It seems likely, therefore, that time is integral to our representation of causality and hence deserves careful formal and empirical treatment.
While the space of causal structures we explored was relatively restricted, our task was challenging due to the spontaneous activations of the effect component and potential interactions between generative and preventative cause components. There were always multiple competing explanations for any effect occurrence or surprising non-occurrence, and as such, normative reasoning about the structure behind the evidence required entertaining and marginalizing over many hypothetical mappings between events. Nevertheless, participants' were able to correctly identify the majority of causal components well above chance even when base rate activations of the effect were unpredictable (Experiment 1a and 1b) and even without pretraining about the true causal delays (Experiment 1b). Our experiments thus provide an initial empirical demonstration that people can use real-time temporal information to detangle the influences of generative and preventative causes and identify causal structures involving combinations thereof.

Empirical findings
By including both preventative and generative relationships in our task, we have empirical results showing how the identification of these two types of relationships differ from each other in a continuous-time setting.
First, base rate regularity has a larger impact on identifying preventative relationships than generative relationships. Participants can better identify preventative connections when the effect otherwise activates regularly. This is aligned with the principle that detecting preventative causation relies heavily on one's expectation of what would otherwise have happened in the causal system (Buehner et al., 2003;Cheng, 1997;Griffiths & Tenenbaum, 2005).
Second, when judging a causal connection in the system, the type of neighboring connections matters. Experiment 1 showed that when the base rate is irregular, participants could better identify a connection when it was paired with a generative neighbor rather than a noncausal or preventative neighbor. This can be explained by the fact that a generative connection can increase the predictability of the effect, which is helpful in general but particularly when the base rate is unpredictable. Experiment 2 showed that a preventative neighbor can cancel out a generative influence and mislead people to judge a generative connection as non-causal.
Third, the timing and sequence of interventions matter when making causal judgments, and it affects the identification of generative and preventative connections in different ways. Participants identified generative and non-causal relationships better when the interventions were clustered, rather than interleaved. This makes sense given that the evidence under clustered interventions involves less interference from neighboring connections. We confirmed this in Experiment 2 where we show that deliberately interleaved evidence leads participants to systematically mistake the roles of generative and non-causal connections. In contrast, the advantage of clustered interventions disappeared when it came to prevention. To identify preventative relationships, it makes sense to spread out interventions so their influence covers more of the timeline, and in particular to perform them ahead of whenever one has a strong expectation of the effect occurring Melchers et al., 2006). To our knowledge, these findings represent the first systematic investigation of how human causal judgments engage with a setting where generative and preventative causal influences intertwine and interact in time.

Normative vs. summary-statistics
To better understand how participants made their judgments, we contrasted two learning models: An exhaustive normative account and a summary-statistic-based local approximation. Both accounts were able to identify generative and preventative influences well in our task, but only the summary statistic account could capture cases in which participants were worse at identifying the non-causal connections (Experiment 1) and misled by interleaved interventions (Experiment 1 and  2). Quantitatively, the summary-statistic account also fits participants' judgments across both experiments better.
Our normative model demonstrates that near perfect inversion of the generative causal model is possible for a learner with exactly the correct delay assumptions and unlimited processing power. It works via reasoning at the token level of actual attribution (Halpern, 2016), suggesting this kind of reasoning is key for achieving benchmark performance in this small data setting. The summary-statistics account takes a different approach that is computationally much more frugal and scalable to more complex causal models, but has the cost of being less sensitive to precise event timing, and being more susceptible to interference between components. The approach combines several core principles of bounded cognitive processing: Use of simulation from generative mental models and comparison via summary statistics in place of an exact or intractable likelihood calculation (Battaglia et al., 2013;Blum et al., 2013;Lintusaari et al., 2017;Sunnåker et al., 2013;Ullman et al., 2018). It combines this with local (Bramley, Dayan et al., 2017;Fernbach & Sloman, 2009) and incremental (Bramley, Dayan et al., 2017;Rehder et al., 2022) processing to break up the global inference problem into a series of spatially and temporally local subproblems. The departures from the ideal of the global normative thinker allow it to explain several error patterns exhibited by participants. In general the normative model serves to showcase the rapidly compounding challenge of maintaining a global perspective when processing evidence that includes multiple causal influences that intertwine and interact in real time (Bramley, Mayrhofer, Gerstenberg and Lagnado, 2017;Gong et al., 2023).
Imagined experiences are a core feature of our conscious experience and as such, mental simulation has been implicated by a number of theories of cognition as playing key roles in both model-based inference and planning (Battaglia et al., 2013;Bramley, Gerstenberg, Tenenbaum and Gureckis, 2018;Gerstenberg et al., 2021;Hamrick et al., 2016;Ludwin-Peery, Bramley, Davis, & Gureckis, 2020;Ullman et al., 2018). Mental simulation is thought to be key to offline (Hinton, Dayan, Frey, & Neal, 1995), and simulation phases are now a common part of the training regimen for large reinforcement learning models (Ellis et al., 2020;Mnih et al., 2015). Our experiments add one small piece to this research line, showing how an inference mechanism grounded in simulation and the extraction of summary statistics may explain how people mitigate the computational costs involved in reverse engineering the causal mechanisms that explain the events we observe in real time.
The idea of combining sampling from a generative model with summary statistics stems from Approximate Bayesian Computation (Blum et al., 2013;Lintusaari et al., 2017;Sunnåker et al., 2013). The approach makes it possible to approximate an intractable Bayesian inference by using the similarity between data simulated from a hypothesized model or parameter setting and observed data as a proxy for the likelihood of that model or parameter setting. Choosing the best summary statistics or loss function for a domain is a research area in itself in machine learning (Csilléry, Blum, Gaggiotti, & François, 2010), while identifying what summary statistics might be used in cognition is another challenging and unsolved problem. We do not solve this problem here, but simply hand selected two basic summary statistics (cf. Ullman et al., 2018) on the grounds that they reflect the most basic and easily reported timing measurements people can make in online settings. We showed that the delay and count cues were reasonably diagnostic in our task (Experiment 1) but also unpacked the circumstances under which they can be misleading (Experiment 2).
Within the summary-statistic framework, we considered two ways participants might segment the trials into counting windows. We proposed they might either track events within fixed-length windows after each intervention or use the gaps between each intervention directly as a count window. The inter-intervention segmentation variant captured participants' behavior better despite the fact that the windows were of markedly different lengths detracting from the reliability of the metric. A potential explanation for this is that people may be fundamentally unable to track events from multiple causal perspectives in parallel, thus being forced to rely on the uneven inter-event windows (Bramley, Gerstenberg, Mayrhofer et al., 2018;. Of course, in an active learning context, the learner is free to perform interventions at their own pace. This research suggests that what learners are able to attend to and measure is likely to shape their approach to interventions in time. For instance, one way to make inter-intervention count statistics as powerful as possible is to intervene on a regular schedule, eliminating the confound of episode length, while leaving as large as possible gaps between interventions additionally minimizes spillover effects. Interestingly, these are cognitive rather than normative considerations since the ideal observer is practically indifferent to the regularity of the intervention spacing.

Alternative accounts
One popular recent idea in the causal cognition literature is that people form and adjust causal theories locally and incrementally (Bramley, Dayan et al., 2017;Bramley, Mayrhofer et al., 2017;Fernbach & Sloman, 2009;Markant et al., 2016). For instance, Bramley, Dayan et al. (2017) model causal structure learning (in discrete trial contexts) as a process of incremental adaptation of a single global hypothesis driven by the need to accommodate new evidence as it arrives. They argue that causal learners do focus locally when grappling with complex structures, but that many are able to condition on their current beliefs about neighboring connections rather than ignoring them altogether, leading to patterns of sequential local focus and anchoring that still tend to favor the correct global structure in the limit. We did not collect the interim judgments we would need to probe this account directly, but we think it is entirely plausible that people focused on the roles of the components not just separately but also serially, perhaps flipping their attention back and forth several times throughout a trial. For example, if participants focused on a generative component first and a preventative component second, they might have been able to take advantage of their expectation of events produced by the apparently generative component to supercharge their inferences about prevention.
The other idea is based on the ''smart initialization and short search'' algorithm in Ullman et al. (2018). Analogous to our findings, they showed that although human physical learning was better captured by a summary-statistic account than a noisily normative Bayesian model, responses could be even better fit by a mechanism that combines the two. Their best-fit model used the prediction of a summary-statistic approach as a starting hypothesis, and then made local adjustments to this by running a short Markov Chain Monte Carlo search chain. Such a smart initialization could play an important role here too. It is plausible that some participants may have performed similar steps, i.e. forming an impression of the role of a component due to the delays and counts but adjusting this when accommodating a belief about the neighboring connection or an understanding of the regularity of the base rate.

Future directions
To date, causal learning in continuous time has received little attention, meaning there are numerous basic research questions still to be addressed. In the current paper, we focus on just one of these, providing a close examination of the interplay between inference about generative and preventative causal relationships. However, for this we make specific assumptions about the scope with which preventative influences work. Concretely, we conceive of preventative influences as eliminating all expected effects for a short time no matter their cause. However, there are several alternatives that seem at least as salient and may be more appropriate depending on knowledge of the context and mechanisms involved. For example, prevention could work by blocking the next one event (or perhaps the next events) rather than blocking everything for a fixed window. Prevention could also operate on ''links'' rather than ''nodes'' within the causal graph, for example blocking the action of a generative cause on an effect, but leaving the spontaneous activations of that effect intact, or visa versa (Carroll & Cheng, 2009;Chow, Lee, & Lovibond, 2023;Fraser & Holland, 2019).
In the current learning task, causal influences were represented as operating between point events. This is a major simplification from many real scenarios in which variables involved in causal interactions are often able to take multiple, or even a continuum of, values. The cat in our motivating example might drink more or less water or hold different teaser toys in higher or lower regard leading to faster, slower, more or less intense effects. Even though events are abstractions of continuous inputs, and many, such as state changes, are readily thought of as punctate, many everyday event concepts clearly have non-zero duration and often have internal structure such as a gradual or sudden onset or offset. For example, given enough time, many of the states referred to in causal learning scenarios are not permanent. ''Wet ground'' dries. ''Tanned skin'' fades. Many disequilibria will either dissipate or recover without external intervention. Other states, such as a turned-on light bulb may tend to persist until canceled, i.e. by switching the switch a second time. These could be seen as events with an infinitely long duration (i.e. permanent state-changes). As event duration reduces, it becomes less likely that events will overshadow one another. Point events are a limiting-case abstraction of this where the duration is reduced to zero, resulting in a setting where there is no true causal overshadowing (Paul & Hall, 2013). That is, generative cause will always produce an observable effect even if it occurs close to another event. However, in settings with longer events it becomes increasingly important to consider the super-secession situations and perhaps to apply the noisy-or or noisy-and-not frameworks (Cheng, 1997;Griffiths & Tenenbaum, 2009) that capture how in contingency settings, effects can easily be hidden due to an already-occurring, or already-prevented target. Future research could study how people represent the duration of causal events as well as their influences and thus begin to form a richer theory of causal concepts in time that captures a wider range of relata, variables, influences, and events.
Finally, we focused on online causal learning here, where information flowed in rapidly and learners had no opportunity to replay and revise. However, it is possible that people are capable of reasoning more normatively in offline learning tasks when they are provided with information summarized in a timeline and can take as long as they like to consider the fit between the data and different causal hypotheses (Bramley, Gerstenberg, Mayrhofer et al., 2018). Furthermore, to the extent that summary-statistic based inference and normative inference deviate, it seems likely that people's judgments after additional thinking time could differ from their more instinctive or gut responses (Ludwin-Peery et al., 2020). Reflective thinking has been studied for decades in human reasoning and decision making (Kahneman, 2011;Sloman, 1996), while it is less studied in causal inference. The normative vs. summary-statistic contrast in this paper provides a potential paradigm for operationalizing the role of reflective thinking in causal inference.

Conclusions
In this paper, we showed that people can use information in continuous real time to learn about causal systems that potentially contain generative and preventative causal relationships. Their performance was influenced by multiple factors, including the nature of the causal influences (generative, non-causal, preventative), interactions with neighboring connections, base rate regularity, and intervention patterns. We laid out both a normative framework and a processlevel model. Both qualitatively and quantitatively, human judgments were better captured by the process-level summary-statistic account, capturing the idea that people may infer causal structure via statistical cues such as average delays and counts that are much easier to track in real time than the exact generative model likelihoods. This work thus provides a quantitative account of how people manage to learn causal structure, in particular preventative influences, on the basis of continuous temporal dynamics. This contributes to our understanding of natural cognition and sheds light on the challenging question of how any cognitive agent can succeed in forming an internal causal model of a complex and continuous environment.

C.2. Likelihood calculation
We assume each connection is estimated independently as either generative, non-causal, or preventative, and then combined to yield an overall probability for each candidate causal structure. For example, an intervention on with the nearest effect occurring 2.5 s later has a likelihood of [.2, .7, .1] of having been produced by a generative, non-causal or preventative → connection respectively under the regular base rate and [.3, .6, .2] under the irregular base rate. When the next intervention on happens, the posterior is updated by taking the product of this new likelihood with the preceding ones.

C.3. Boundary situations
We consider boundary situations when observing evidence as follows: If no effect occurs within the observation window, in both segmentation approaches, the delay cue will be marked as larger than the observation window and the probability is estimated according to the cumulative density function falling after this. If the observation window is less than the fixed window length for the fixed-window approach (which often happens near the end of the clip), or there is no next intervention in the intervention-window approach, the count cue will be marked as greater than or equal to the observed count of effects and the probability is also estimated on the basis of its cumulative mass function.

Appendix D. Model fitting procedure
We considered four models in total: 1. Fully normative inference based on marginalizing over all possible causal pathways. 2. Summary-statistic (SS) based inference, using a fixed 4 s window to count events following each intervention. 3. Summary-statistic based inference, using the interval until the next intervention to count events. 4. A parameter free baseline that predicts all structure judgments to be selected with equal probability.
As in our comparison to simulations, we simply assume the delay and count cues are equally weighted and merged. We assume learners begin each problem with a uniform prior over causal structures. We feel this is a reasonable choice here since the relatively small hypothesis space,a balanced set of trials, and the abstract setting leave little for inductive biases to attach to. Nevertheless, we accept that we cannot rule out the possibility that some of the findings we attribute to evidence processing enter through prior preferences. To map models' posterior probabilities to judgments, we assumed participants' responses result from a softmax over a posterior probability vector : The ''temperature'' parameter ∈ (0, +∞] controls how reliably the participant selects the most probable answer (i.e. that with the largest in choice ). Smaller connotes higher choice reliability with = 0 corresponding to hard maximization and → ∞ approaching random responding.
We evaluate model fit using cross-validation. At the aggregate level, we fit parameters to the judgments from − 1 subsets of the complete dataset, and evaluate model performance in terms of its log-likelihood of predicting the left-out subset.
was defined via the stimulus seeds in each experiment (i.e. = 18 in Experiment 1a and = 12 in Experiment 1b including stimuli with and without a ground truth). This provides a rigorous and generalizable test of the models, since the actual sampled values of the stimuli (e.g. intervention timing, base rate activating timing, etc.) are always outside of the training sample T. Gong and N.R. Bramley   for all test sets. On the individual level, we similarly applied hold-onestimulus-out as our cross-validation scheme for all experiments. For easy familiarity and comparability with other model based analyses of causal learning data, we also report Bayesian Information Criterion (BIC) penalized fits to the full dataset.