Discovering Agents

Causal models of agents have been used to analyse the safety aspects of machine learning systems. But identifying agents is non-trivial -- often the causal model is just assumed by the modeler without much justification -- and modelling failures can lead to mistakes in the safety analysis. This paper proposes the first formal causal definition of agents -- roughly that agents are systems that would adapt their policy if their actions influenced the world in a different way. From this we derive the first causal discovery algorithm for discovering agents from empirical data, and give algorithms for translating between causal models and game-theoretic influence diagrams. We demonstrate our approach by resolving some previous confusions caused by incorrect causal modelling of agents.


Introduction
How can we recognise agents? In economics textbooks, certain entities are clearly delineated as choosing actions to maximise utility. In the real world, however, distinctions often blur. Humans may be almost perfectly agentic in some contexts, while manipulable like tools in others. Similarly, in advanced reinforcement learning (RL) architectures, systems can be composed of multiple non-agentic components, such as actors and learners, and trained in multiple distinct phases with different goals, from which an overall goal-directed agentic intelligence emerges.
It is important that we have tools to discover goal-directed agents. Artificially intelligent agents that competently pursue their goals might be dangerous depending on the nature of this pursuit, because goal-directed behaviour can become pathological outside of the regimes the designers anticipated (Bostrom, 2014;Yudkowsky et al., 2008), and because they may pursue convergent instrumental goals, such as resource acquisition and self-preservation (Omohundro, 2008). Such safety concerns motivate us to develop a formal theory of goal-directed agents, to facilitate our understanding of their properties, and avoid designs that pose a safety risk.
The central feature of agency for our purposes is that agents are systems whose outputs are moved by reasons (Dennett, 1987). In other words, the reason that an agent chooses a particular action is that it "expects it" to precipitate a certain outcome which the agent finds desirable. For example, a firm may set the price of its product to maximise profit. This feature distinguishes agents from other systems, whose output might accidentally be optimal for producing a certain outcome. For example, a rock that is the perfect size to block a pipe is accidentally optimal for reducing water flow through the pipe.
Systems whose actions are moved by reasons, are systems that would act differently if they "knew" that the world worked differently. For example, the firm would be likely to adapt to set the price differently, if consumers were differently price sensitive (and the firm was made aware of this change to the world). In contrast, the rock would not adapt if the pipe was wider, and for this reason we don't consider the rock to be an agent.
Behavioural sensitivity to environment changes can be modelled formally with the language of causality and structural causal models (SCMs) (Pearl, 2009). To this end, our first contribution is to introduce mechanised SCMs (Sections 3.1 and 3.2), a variant of mechanised causal games (Hammond et al., forthcoming), and give an algorithm which produces its graph given the set of interventional distributions (Section 3.3). Building on this, our second contribution is an algorithm for determining which variables represent agent decisions and which represent the objectives those decisions optimise (i.e., the reasons that move the agent), see Section 3.4. This lets us convert a mechanised SCM into a (structural) causal game (Hammond et al., forthcoming) 1 . Combined, this means that under suitable assumptions, we can infer a game graph from a set of experiments, and in this sense discover agents.
Our third contribution is more philosophical, giving a novel formal definition of agents based on our method, see Section 1.2.
These contributions are important for several reasons. First, they ground game graph representations of agents in causal experiments. These experiments can be applied to real systems, or used in thought-experiments to determine the correct game graph and resolve confusions (see Section 4). With the correct game graph obtained, the researcher can then use it to understand the agent's incentives and safety properties (Everitt et al., 2021a;Halpern and Kleiman-Weiner, 2018), with an extra layer of assurance that a modelling mistake has not been made. Our algorithms also open a path to automatic inference of game graphs, especially in situations where experimentation is cheap, such as in software simulations.

Example
To illustrate our method in slightly more detail, consider the following minimal example, consisting of a gridworld with three squares, and with a mouse starting in the middle square (Fig. 1a). The mouse can go either left or right, represented by binary variable . There is some ice which may cause the mouse to slip: the mouse's position, , follows its choice, , with probability = 0.75, and slips in the opposite direction with probability 1 − . Cheese is in the right square with probability = 0.9, and the left square with probability 1 − . The mouse gets a utility, , of 1 for getting the cheese, and zero otherwise. The directed edges → and → represent direct causal influence.
The decision problem can be represented with the game graph in Fig. 1b: the agent makes a decision, , which affects its position, , which affects its utility, . The intuition that the mouse would choose a different behaviour for other settings of the parameters and , can be captured by a mechanised causal graph (Fig. 1c). This graph contains additional mechanism nodes, , , in black, representing the mouse's decision rule and the parameters and . As usual, edges between mechanisms represent direct causal influence, and show that if we intervene to change the cheese location, say from = 0.9 to = 0.1, and the mouse is aware 2 of this, then the mouse's decision rule changes (since it's now more likely to find the cheese in the leftmost spot). Experiments that change 1 We can also reverse this, converting a causal game into a mechanised SCM (Section 3.5). 2 The mouse could become aware of this through learning from repeated trials under soft interventions of and which occur on every iteration, see Section 3.1 for further discussion. DeepMind,Confidential. Internal Only. 2 and in a way that the mouse is aware of, generate interventional data that can be used to infer both the mechanised causal graph (Fig. 1c) and from there the game graph (Fig. 1b). The edge labels (colours) in Fig. 1c will be explained in Section 3.2.

Other Characterisations of Agents
To put our contribution in context, we here describe previous characterisations of agents: • The intentional stance: an agent's behaviour can be usefully understood as trying to optimise an objective (Dennett, 1987). • Cybernetics: an agent's behaviour adapts to achieve an objective (e.g. Ashby, 1956;Wiener, 1961). • Decision theory / game theory / economics / AI: An agent selects a policy to optimise an objective. • An agent is a system whose behaviour can be compressed with respect to an objective function (Orseau et al., 2018). • "An optimising system is ... a part of the universe [that] moves predictably towards a small set of target configurations" (Flint, 2020). • A goal-directed system has self-awareness, planning, consequentialism, scale, coherence, and flexibility (Ngo, 2020). • Agents are ways for the future to influence the past (via the agent's model of the future) (Garrabrant, 2021;von Foerster et al., 1951).
Our proposal can be characterised as: agents are systems that would adapt their policy if their actions influenced the world in a different way. This may be read as an alternative to, or an elaboration of, the intentional stance and cybernetics definitions (depending on how you interpret them) couched in the language of causality and counterfactuals. Our definition is fully consistent with the decision theoretic view, as agents choose their behaviour differently depending on its expected consequences, but doesn't require us to know who is a decision maker in advance, nor what they are optimising.
The formal definition by Orseau et al. can be viewed as an alternative interpretation of the intentional stance: the behaviour of systems that choose their actions to optimise an objective function should be highly compressible with respect to that objective function. However, Orseau et al.'s definition suffers from two problems: First, in simple settings, where there is only a small and finite number of possible behaviours (e.g. the agent decides a single binary variable), it will not be possible to compress any policy beyond its already very short description. Second, the compression-based approach only considers what the system actually does. It may therefore incorrectly classify as agents systems with accidentally optimal input-output mappings, such as the water-blocking rock above. Our proposal avoids these issues, as even a simple policy may adapt, but the rock will not.
The insightful proposal by Flint leaves open the question of what part of an optimising system is the agent, and what part is its environment. He proposes the additional property of redirectability, but its not immediately clear how it could be used to identify decision nodes in a causal graph (intervening on almost any node will change the outcome-distribution).
The goal-directed systems that Ngo has in mind are agentic in a much stronger sense than we are necessarily asking for here, and each of the properties contain room for interpretation. However, our definition is important for goal-directedness, as it distinguishes incidental influence that a decision might have on some variable, from more directed influence: only a system that counterfactually adapts can be said to be trying to influence the variable in a systematic way. Counterfactual adaptation can therefore be used as a test for goal-directed influence.
Our definition also matches closely the backwards causality definition of agency by Garrabrant (2021), as can be seen by the time-opposing direction of the edges → and → in Fig. 1c. It also fits nicely with formalisations of agent incentives (Everitt et al., 2021a;Halpern and Kleiman-Weiner, 2018), which effectively rely on behaviour in counterfactual scenarios of the form that we consider here. This is useful, as a key motivation for our work is to analyse the intent and incentives of artificial agents.

What do we consider an agent
Before digging into the mathematical details of our proposal, let us make some brief remarks about what it considers an agent and not. From a pre-theoretic viewpoint, humans might be the most prototypical example of an agent. Our method reliably classifies humans as agents, because humans would usually adapt their behaviour if suitably informed about changes to the consequences of their actions. It's also easy to communicate the change in action-consequences to a human, e.g. using natural language. But what about border-line cases like thermostats or RL agents? Here, the answer of our definition depends on whether one considers the creation process of a system when looking for adaptation of the policy. Consider, for example, changing the mechanism for how a heater operates, so that it cools rather than heats a room. An existing thermostat will not adapt to this change, and is therefore not an agent by our account. However, if the designers were aware of the change to the heater, then they would likely have designed the thermostat differently. This adaptation means that the thermostat with its creation process is an agent under our definition. Similarly, most RL agents would only pursue a different policy if retrained in a different environment. Thus we consider the system of the RL training process to be an agent, but the learnt RL policy itself, in general, won't be an agent according to our definition (as after training, it won't adapt to a change in the way its actions influence the world, as the policy is frozen).
For the purpose of detecting goal-directed behaviour, the relevant notion of agency often includes the creation process. Being forced to consider the creation process of the system, rather than just the system itself, may seem inconvenient. However, we consider it an important insight that simple forms of agents often derive much of their goal-directedness from the process that creates them.

Outline
Our paper proceeds as follows: we give relevant technical background in Section 2; give our main contribution, algorithms for discovering agents, in Section 3; show some example applications of this in Section 4 followed by a discussion in Section 5.

Background
Before we get to our algorithms for discovering agents, we cover some necessary technical background. The mathematical details can be found in Appendix A. Throughout, random variables are represented with roman capital letters (e.g. ), and their outcomes with lower case letters (e.g. ). We use bold type to indicate vectors of variables, , and vectors of outcomes . For simplicity, each variable only has a finite number of possible outcomes, denoted dom( ). For a set of variables, dom( ) = ∈ dom( ). In structural causal models (SCMs; Pearl, 2009), randomness comes from exogenous (unobserved) variables, E, whilst deterministic structural equations relate endogenous variables, , to each other and to the exogenous ones, i.e., = ( , E ) (Definition A.5). An SCM induces a causal graph over the endogenous variables, in which there is an edge → if ( , E ) depends on the value of (Definition A.6). The SCM is cyclic if its induced graph is, and acyclic otherwise. We never permit self-loops → . Parents, children, ancestors and descendants in the graph are denoted Pa , Ch , Anc , and Desc , respectively (neither include the variable ). The family is denoted by Fa = Pa ∪ { }. Interventions on ⊆ , denoted do( = ), can be realised as replacements of a subset of structural equations, so that = ( , E ) gets replaced with = (Definition A.7). The joint distribution ( | do( = )) is called the interventional distribution associated with intervention do( = ). A soft intervention instead replaces with some other (potentially non-constant) functions .
A (structural) causal game (Everitt et al., 2021a;Hammond et al., forthcoming;Koller and Milch, 2003) is similar to an SCM, but where the endogenous variables are partitioned into chance, decision, and utility variables (denoted , and respectively), and for which the decision variables have no structural equations specified. Instead a decision-maker is free to choose a probability distribution over actions , given the information revealed by the outcome of the parents of (Definition A.9). The decision variables belonging to agent are denoted ⊆ , and the agent's utility is taken to be the sum of the agent's utility variables, ⊆ . A collection of decision rules for all of a player's decisions is called a policy. Policies for all players are called policy profiles.
A causal game is associated with a game graph with square, round and diamond nodes for decision, chance and utility variables, respectively, with colours associating decision and utility nodes with different agents (Definition A.10). Edges into chance and utility nodes mirror those of an SCM, while edges into decision nodes represent what information is available, i.e. → is present if the outcome of is available when making the decision , with information edges displayed with dotted lines. An example of a game graph is shown in Fig. 1b.
Given a causal game, each agent can set a decision rule, , for each of their decisions, , which maps the information available at that decision to an outcome of the decision. Formally, the decision rule is a deterministic function of Pa and E , where E provides randomness to enable stochastic decisions. This means that the decision rules can be combined with the causal game to form an SCM, which can be used to compute each agent's expected utility. In the single agent case, the decision problem represented by the causal game is to select an optimal decision rule to maximise the expected utility (Definition A.11). With multiple agents, solution concepts such as Nash Equilibrium (Definition A.12) or Subgame-Perfect Nash Equilibrium (Definition A.13) are needed, because in order to optimise their decision rules agents must also consider how other agents will will optimise theirs.
Similar to SCMs, interventions in a causal game can be realised as replacements of a subset of structural equations. However, in contrast to an SCM, an intervention can be made before or after decision rules are selected. This motivates a distinction between pre-policy and post-policy interventions (Hammond et al., forthcoming). Pre-policy intervention are made before the policies are selected, and agents may adapt their policies (according to some rationality principle) to account for the intervention. In other words, agents are made aware of the intervention before selecting their policies. For post-policy interventions, the intervention is applied after the agents select their policies. The agents cannot adapt their policies, even if their selected policies are no longer rational under the intervention. In other words, the intervention is applied without the awareness of the agents.

Algorithms for Discovering Agents
Having discussed some background material, we now begin our main contribution: providing algorithms to discover agents from causal experiments.
This can provide guidance on whether a proposed game graph is an accurate description of a system of agents and gives researchers tools for building game graphs using experimental data. Figure 2 | Overview of our three theorems. Each provides relations between a game-theoretic mechanised causal game, M, with its interventional distributions, I, and with its associated game graph, G, and a causal object -a mechanised causal graph, C. Our proposed algorithms Algorithm 1, Mechanised Causal Graph Discovery; Algorithm 2, Agency Identification; and Algorithm 3, Mechanism Identification; detail how to transform from one representation to another.
We propose three algorithms: • Algorithm 1, Mechanised Causal Graph Discovery, produces an edge-labelled mechanised causal graph based on interventional data. • Algorithm 2, Agency Identification, takes an edge-labelled mechanised causal graph and produces the corresponding game graph. • Algorithm 3, Mechanism Identification, takes a game graph and draws the corresponding edgelabelled mechanised causal graph.
Theorems 1 to 3 establish their correctness, and Fig. 2 visualises their relationships.

Mechanised Structural Causal Model
In this subsection we introduce mechanised SCMs, that we will later use in a procedure for discovering agents from experimental data. A mechanised SCM is similar to an ordinary SCM, but includes a distinction between two types of variables: object-level and mechanism variables. The intended interpretation is that the mechanism variables parameterise how the object-level variables depend on their object-level parents. Mechanism variables have been called regime indicators (Correa and Bareinboim, 2020) and parameter variables (Dawid, 2002). Mechanised SCMs are variants of mechanised causal games Hammond et al. (forthcoming) that lack explicitly labelled decision and utility nodes. Figure 1c draws the induced graph of a mechanised SCM.
Definition 1 (Mechanised SCM). A mechanised SCM is an SCM in which there is a partition of the endogenous variables V = ∪ into object-level variables, (white nodes), and mechanism variables, (black nodes), with | | = | |. Each object-level variable has exactly one mechanism parent, denoted , that specifies the relationship between and the object-level parents of .
We refer to edges between object-level nodes as object-level edges obj , edges between mechanism nodes as mechanism edges mech , and edges between a mechanism node and the object-level node it controls functional edges func . We only consider mechanised SCMs in which the object-level-only subgraph is acyclic, but we allow cycles in the mechanism-only subgraph (we follow the formalism of Bongers et al. (2021) when using cyclic models).
By connecting mechanism variables with causal links, we violate the commonly taken independent causal mechanism assumption (Schölkopf et al., 2021), though we introduce a weaker form of it in Assumption 4 (see further discussion in Section 5.3).
Interventions in a mechanised SCM are defined in the same way as in a standard SCM, via replacement of structural equations. An intervention on an object-level variable changes the value of without changing its mechanism, 3 . This can be interpreted as the intervention occurring after all mechanisms variables have been determined/sampled. In a causal model, it is necessary to assume that the procedure for measuring and setting (intervening on) a variable is specified. Mechanised SCMs thereby assume a well-specified procedure for measuring and setting both object-level and mechanism variables. Pre-and post-policy interventions in games correspond to mechanism and object-level interventions in mechanised SCMs (Hammond et al., forthcoming).
The distinction between mechanism and object-level variables can be made more concrete by considering repeated interactions. In Section 1.1, assume that the mouse is repeatedly placed in the gridworld, and can adapt its decision rule based (only) on previous episodes. A mechanism intervention would correspond to a (soft) intervention that takes place across all time steps, so that the mouse is able to adapt to it. Similarly, the outcome of a mechanism can then be measured by observing a large number of outcomes of the game, after any learning dynamics has converged 4 . Finally, object-level interventions correspond to intervening on variables in one particular (postconvergence) episode. Assuming the mouse is only able to adapt its behaviour based on previous episodes, it will have no way to adapt to such interventions. Appendix B has a more detailed example of marginalising and merging nodes in a repeated game to derive the mechanised causal graph and game graph.

Edge-labelled mechanised causal graphs
We now introduce an edge-labelling on mechanised SCMs, aiming to capture two aspects of mechanised SCMs that we think are inherent to agents: 1. whether a variable is inherently valuable to an agent (i.e. is a utility node), rather than just instrumentally valuable for something downstream; 2. whether a variable's distribution adaptively responds for a downstream reason, (i.e. is a decision node), rather than for no downstream consequence (e.g. it's distribution is set mechanistically by some natural process).
For the first, to determine whether a variable, , is inherently valuable to an agent, we can test whether the agent still changes its policy in response to a change in the mechanism for if the children of stop responding to . For the second, to determine whether a variable, , adapts for a downstream reason, we can test whether 's mechanism still responds even when the children of stop responding to (i.e. has no downstream effect).
We can stop the children of a variable responding to it by performing hard interventions on each child. If an agent is present, we want it to be aware of these interventions, so they should be implemented via mechanism interventions -we call this a structural mechanism intervention:

Definition 2 (Structural mechanism intervention). A structural mechanism intervention on a variable
is an intervention on its mechanism variable such that is conditionally independent of its object-level parents. That is, under do( = ), the following holds (1) We can record whether points 1. and 2. above hold in a label on the relevant mechanism edge motivating the following definition: Definition 3. A mechanised SCM is edge-labelled if it further identifies a subset term ⊆ mech of mechanism edges (dashdotted blue) → , called terminal mechanism edges, such that: 1. responds to even after any effects of on its children, Ch , have been removed by means of any structural mechanism interventions on Ch ; and 2. does not respond to if effects of on its children, Ch , have been removed by means of all structural mechanism interventions on Ch .
Non-terminal mechanism edges are drawn with dashed black lines.
Intuitively, the terminal edges designate the variables that an agent cares about for their own sake. For example, the mechanism edge → in Fig. 1c is terminal, because it remains when the children of the object-level variable are cut (indeed, has no children), and disappears if we cut off from its children (since then doesn't affect , and hence doesn't affect ). In contrast, → is non-terminal, because if the object-level link → is cut (i.e., the agent's position is made independent of it finding cheese), then the agent will cease adapting its policy to changes in the slip probability . The labelling of terminal links will be used in Section 3.4 to determine that is only instrumentally valuable to the agent.

Discovering Edge-labelled, Mechanised Causal Graphs
We next describe how edge-labelled, mechanised causal graphs can be inferred from interventional data. Intuitively, by definition of a causal edge, if one applies interventions to all nodes except one node , and varying these interventions at only node , then one can reliably discover whether there should be a causal edge from to (even in cyclic SCMs). This leave-one-out strategy 5 is described below: Lemma 1 (Leave-one-out causal discovery). Applied to the set of interventional distributions generated by a (potentially cyclic) SCM, Leave-one-out causal discovery returns the correct causal graph.
Proof. Immediate from the definitions of SCM and causal graph, see Section 2.
Algorithm 1 applies Leave-one-out causal discovery to the combined set of object-level and mechanism variables of a mechanised SCM, and then infers edge-labels using structural mechanism interventions on object-level children.
Leave-one-out causal discovery

Algorithm 1 Edge-labelled mechanised SCM discovery
for interventions ∪ ch that are structural for Ch , and interventions , on 15:

Lemma 2 (Discovery of mechanised SCM). Applied to the set of interventional distributions generated by a mechanised SCM in which structural mechanism interventions are available for all nodes, Algorithm 1 returns the correct edge-labelled mechanised causal graph.
Proof. The algorithm checks the conditions in Definitions 1 and 3.
Applied to the mouse example of Fig. 1, Algorithm 1 would take interventional data from the system and draw the edge-labelled mechanised causal graph in Fig. 1c. For example, the edge ( , ) will be discovered because the mouse's decision rule will change in response to a change in the distribution for cheese location.

Discovering game graphs
To discover agents,we can convert an edge-labelled mechanised causal graph into a game graph as specified by Algorithm 2: decision nodes are identified by their mechanisms having incoming terminal edges (Line 4), while utility nodes are identified by their mechanisms having outgoing terminal edges (Line 5). Decisions and utilities that are in the same connected component in the terminal edge graph receive the same colouring, which is distinct from the other components. On Line 10, Connected ( ) is the set { ∈ | − − } where the undirected path − − is in the terminal edge graph. This set could be found by a search algorithm, such as breadth-first search.

Algorithm 2 Agency Identification. Converts edge-labelled mechanised causal graph to game graph
Input: An edge-labelled mechanised causal graph C = (V, ), with nodes V = ∪ and edges = obj ∪ func ∪ mech , with term ⊆ mech .
Algorithm 2 will identify any decision node under the following conditions (though it may generate false positives): • A utility node , or a mediator node on a directed path from to , is included in . • The utility/mediator node must be sufficiently important to the agent controlling that its mechanism shapes the agents behaviour. • Mechanism interventions are available that change the agent's optimal policy for controlling (or ). • These mechanism interventions are operationalised in a way that the agent's policy can respond to the changes they imply.
Under the following stronger assumptions 6 , Algorithm 2 is guaranteed to produce a fully correct game graph (without false positives). These assumptions are most easily stated using mechanised SCMs with labelled decision and utility nodes. Following Hammond et al. (forthcoming), we call such objects mechanised games.
For our first assumption, the following definition will be helpful.
Definition 4. For a game graph, G, we define the agent subgraph to be the graph G = ( ∪ , ), where the edge ( , ) belongs to if and only if there is a directed path ∈ G that doesn't pass through any ∈ \ { }. We define the decision-utility subgraph to be the graph G = ( ∪ , ∪ ).
For example, the decision-utility subgraph of Fig. 1b consists of two nodes, and , and an edge ( , ) as there is a directed path to that is not mediated by other utility nodes. One further piece of terminology we use is that a DAG is weakly connected if replacing all of its directed edges with undirected edges produces a connected graph, i.e. one in which every pair of vertices is connected by some path. A weakly connected component is a maximal subgraph such that all nodes are weakly connected. For example, the decision-utility subgraph of Fig. 1b is connected, and consists of a single connected component (the agent subgraph for the mouse).
Our first assumption uses these definitions as follows: Assumption 1. Each weakly connected component of the decision-utility subgraph is an agent subgraph, and contains at least one decision and one utility node.
The intuition behind this assumption is that if there was a disconnected component in the agent subgraph, then the decisions in that component could be reasoned about independently from the rest of the decisions, and there would be no way to experimentally distinguish whether those independent decisions were made by a separate agent. So we make this as a simplifying assumption that only separate agents reason about their decisions independently. An example of a game ruled out by this assumption is Fig. 8, in which a decision doesn't directly cause it's utility.
Assumption 2. For any set of mechanism interventions, every agent optimises expected utility (plays best response) in every decision context, i.e. agents play a subgame perfect equilibrium.
Assumption 2 implies that mechanism interventions are operationalised in a way that agents can appropriately respond to them, that agents are trying to optimise their utility nodes, and that object-level links going into the decision can be interpreted as information links (since agents adapt appropriately to the outcomes of the decision parents).

Assumption 3.
Agents have a preferred ordering over decision rules, so that if two or more decision rules obtain the same (optimal) expected utility in all decision contexts, the agent will always pick the first optimal decision rule according to the order. This ensures no unmotivated switches occur -so that agents don't switch decision rule in response to mechanism interventions which have no effect on the optimality of that decision rule. This is a weak form of the popular independent causal mechanism assumption , discussed further in Section 5.3, preventing dependencies between certain mechanisms.
Assumption 5. For each node , interventions on can instantiate any deterministic function relating to its parents (when lacks parents, it can be set to any constant value in dom( )).
This is to ensure that we can enact the necessary soft interventions, in a way that the agent is aware of. We are now ready to establish a correctness result for Algorithm 2.
Theorem 1 (Correctness of Algorithms 1 and 2). Let M real be a mechanised causal game satisfying Assumptions 1 to 5. Let G model be the game graph resulting from applying Algorithm 1 followed by Algorithm 2 to M real . Then G model = G real .
Proof. We establish that the algorithm infers the correct object-level causal structure, the correct labelling of decision and utility nodes (and hence of chance nodes), and the correct colouring of the same.
Causal structure The only structural difference between a game and an SCM is the presence of information links in the game. By Assumption 5, we can impute an arbitrary decision rule to any decision, that makes it depend on all its observations. Thereby all information links are causal links.

Decision:
We first show that all and only decisions get mapped to decisions. Let ∈ be a decision variable for agent in M real . By Assumption 1 we have that there exists a utility variable ∈ such that there's a directed path, , from to not passing through any other utility node of . By means of mechanism interventions, we can ensure that is either 0 or 1 depending on the value of by copying the value of along , using deterministic functions (Assumption 5). All other nodes ignore . Agent chooses a decision rule setting to 1 (Assumption 2). If we do a mechanism intervention to invert the function governing , and cut off all of its effects on its children, then agent will choose a different decision rule and Lines 10-13 will add edge ( , ) to term . Further, no mechanism intervention on the function governing will cause agent to choose a different decision rule if we intervene to cut the effect of on its children as all decision rules would have the same expected utility (and Assumption 3 rules out unmotivated switches). Thus, Lines 14-17 will not remove ( , ) from term . Algorithm 2 then correctly identifies as a decision.
Conversely, assume ∈ \ is a non-decision. It may be that Lines 10-13 will add ( , ) to term , for some ∈ \ { }. But Lines 14-17 will remove ( , ) from term by Assumption 4, and Algorithm 2 then doesn't identify as a decision.
Utility: We next show that all and only utilities get mapped to utilities. Let ∈ be a utility variable for agent in M real . By Assumption 1 we have that there exists a decision variable ∈ such that there's a directed path, , from to not passing through any other utility node of . By the same construction as for decision nodes above, Algorithm 1 will discover a terminal mechanism edge ( , ). Therefore Algorithm 2 identifies to be a utility as desired.
Conversely, consider a non-utility node, ∉ , and some other node, ∈ \ { }, with structural interventions cutting off Ch and interventions on all mechanisms except . Suppose, for contradiction, there exists a terminal edge ( , ). By Assumption 4, there will be a terminal edge ( , ) only if is a decision. Further, by Assumptions 2 and 3 the expected utility must be affected by the change in . But since we've intervened on all mechanisms except , the only effect can have on the expected utility is via . But ∉ , and Ch are not affected (since they've been cut off), so cannot affect expected utility. Therefore, only utility variables get outgoing edges in term from Algorithm 1, and Algorithm 2 does not assign to be a utility.
We have thus shown that all and only decisions nodes get mapped to decisions, and similarly for utilities. All that are left are chance nodes, and these must be mapped to chance nodes (since only decisions/utilities get mapped to decisions/utilities).
Colouring: By Assumption 1 for any agent, , and for any decision ∈ , there exists ∈ with ( , ) ∈ . By the above paragraphs, we must have that term contains the edge ( , ), and further, by the converse arguments, the only edges in term are of the form ( , ) with ∈ , ∈ and ( , ) ∈ for some , which means term is a disjoint union of , in which each edge of is the reverse of an edge in . By Assumption 1, the weakly connected components of are the , and so the are each weakly connected, and disconnected from each other. The colouring of Algorithm 2 colours each vertex of a connected component the same colour, and distinctly to all other components, and thus is correct.

Mechanism Identification Procedure
In the last section we demonstrated an algorithm that, when applied after a causal discovery algorithm, can identify the underlying game graph of a system. In this section we will show the converse, that if one already has a game graph, one can convert it into an edge-labelled mechanised causal graph. The interpretation is that the same underlying system can equivalently be represented either as an edge-labelled mechanised causal graph, which is a physical representation of the system, or as a game graph, which is a decision-theoretic representation of the system.
We first prove a Lemma relating the mechanism causal graph produced by Algorithm 1 to strategic relevance (Koller and Milch, 2003), which captures which other decision rules are relevant for optimising the decision rule at . Koller and Milch give a sound and complete graphical criterion for strategic relevance, called s-reachability 7 , where ≠ is s-reachable from ∈ , for agent , if, in a modified game graphĜ with a new parentˆadded to , we haveˆ ⊥Ĝ | Fa , where is the set of utilities for agent that are descendants of (i.e. = ∩ Desc for ∈ ) and ⊥ denotes d-connection (Pearl, 2009). In the game graph in Fig. 1b, both and are s-reachable from .
Lemma 3. Let M be a mechanised causal game satisfying Assumptions 1 to 5, containing an agent, , with decision variables and utility variables , and let C be the mechanised causal graph with edges obj ∪ func ∪ mech , and term ⊆ mech , which results from applying Algorithm 1 to M. Then

For ∈
, that the node ∈ \ is s-reachable from is a necessary and sufficient condition for ( , ) ∈ mech (this places no restriction on ( , ) ∈ mech for ∉ ). 2. Further, for ∈ , that the existence of a directed path not through another ∈ \ { } is a necessary and sufficient condition for ( , ) ∈ term .
Proof. Necessity of 1: We largely follow the soundness direction of Koller and Milch (2003), Thm 5.1, with an extension to relate this to a mechanised causal graph discovered by Algorithm 1. The proof strategy is to suppose that is not s-reachable from , and show this implies ( , ) ∉ mech .
We perform the mechanism interventions, do( = , = ) and do( = , = ). Since is a decision variable, by Lemma 5.1 of Koller and Milch (2003) the optimal decision rule (pa , E ) under do( = , = ) must be a solution of the following optimisation problem and similarly for the decision rule under do( = , = ).
Now suppose that is not s-reachable from , then by Lemma 5.2 of Koller and Milch (2003), we have that ( | , pa , do( = , = )) = ( | , pa , do( = , = )), and so the two optimization problems are the same. Since they are solutions of the same optimization problem, and by Assumptions 2 and 3 the agents choose decision rules which make up subgame-perfect equilibrium, this leads to the same decision rule in each intervened game (pa , E ) = (pa , E ). This holds for any , , and so Algorithm 1 does not draw an edge, i.e. ( , ) ∉ mech , as was to be shown.
Sufficiency of 1: We can use soft interventions on object-level variables to construct the same model as used in the existence proof for Theorem 5.2 of Koller and Milch (2003). We note that the proof for Theorem 5.2 of Koller and Milch (2003) is written for another decision variable being s-reachable from . But the proof itself makes no use of the special nature of as a decision, rather than any other type of variable, and so it also applies to any variable ∈ \ { }.
Suppose is s-reachable from in M. It follows from Theorem 5.2 of Koller and Milch (2003) that the optimal decision rule for will be different under these mechanism interventions (i.e. this choice of causal game), when different mechanism interventions are applied to . Hence Algorithm 1 will draw an edge ( , ) ∈ mech .
Sufficiency of 2: By the arguments in Theorem 1 (decision, utility) the existence of a directed path not through another ∈ \ { } means that ( , ) ∈ term .
The conversion from game graph to mechanised causal graph is done by Algorithm 3, Mechanism Identification, which identifies mechanisms by converting a game graph into a mechanised causal graph. It first takes the game graph edges and on Lines 3-5 adds the function edges. Lines 8-14 then add the mechanism edges based on s-reachability: if a node is s-reachable from in the game graph, then we include an edge ( , ) in the mechanised causal graph. Further, it adds a terminal edge when there's a directed path from one of an agent's decisions to one of its utilities, that doesn't pass through another of its utilities. We now establish that Algorithm 2 and Algorithm 3 are inverse to each other. We will use the shorthand ( ), for = 1, 2, 3 to refer to the result of algorithm on object , where e.g. is a game graph.
Theorem 2 (Algorithm 2 is a left inverse of Algorithm 3). Let G be a mechanised game graph satisfying Assumptions 1 to 5, and let C be the mechanised causal graph resulting from applying Algorithm 3 to it. Then applying Algorithm 2 on C reproduces G. That is, 2 ( 3 (G)) = G.
Proof. All edges between nodes are the same in G and 2 ( 3 (G)), because neither Algorithm 2 or Algorithm 3 changes the object-level edges. We will now show that the node types are the same in both.

Decision:
Let be an agent with utilities and let ∈ , then by Assumption 1 ∃ ∈ and a directed path not through another ∈ \ { }. Algorithm 3 Lines 13-14 add ( , ) to term . Algorithm 2 then adds to the set of decisions, as desired.
Let ∈ \ . Algorithm 3 Lines 13-14 only adds terminal mechanism edges going into decisions, and Algorithm 2 then doesn't add to the set of decisions, as desired.

Utility: Let be an agent with decisions
and let ∈ , then by Assumption 1 ∃ ∈ and a directed path not through another ∈ \ { }. So Algorithm 3 Lines 13-14 add ( , ) to term . Algorithm 2 then adds to the set of utilities, as desired.
Let ∈ \ . Algorithm 3 Lines 13-14 only adds terminal edges going out of utilities, so there will be no edge out of in term . Algorithm 2 then doesn't add to the set of utilities, as desired.
Colouring: By above paragraphs, the node types and edges are the same in both 2 ( 3 (G)) and G. By Assumption 1 the colouring in G is a property of the connectedness and hence will be the same in 2 ( 3 (G)).
We now consider the other direction: beginning with a mechanised causal graph, can we transform it to a game graph and then back to the same mechanised causal graph? In general this isn't possible, because the space of possible mechanised causal graphs is larger than the space of mechanised causal graphs that can be recovered using only the information present in a game graph. In particular, mechanisms with non-terminal incoming mechanism edges do not, in general, get codified in the game graph when using 2 . Further, we will find it useful to consider only those mechanised causal graphs that are producible from a mechanised causal game satisfying Assumptions 1 to 5, as this will enable us to use Lemma 3. Thus, in the next theorem, we restrict the space of mechanised causal graphs we consider.

Theorem 3 (Algorithm 3 is a left inverse of Algorithm 2). Let C be a mechanised causal graph such that
• there exists a mechanised causal game, M, satisfying Assumptions 1 to 5, such that 1 ( M) = C; • any node with an incoming mechanism edge also has an incoming terminal edge, i.e. ∀( , ) ∈ mech , ∃( , ) ∈ term , for some ∈ \ { }.
Proof. The edges in obj , func are the same in both C and 3 ( 2 (C)), since neither algorithm changes the object-level edges, and all mechanised causal graphs over object-level variables have the same edges in func , i.e. {( , )} ∈ , which are added in Algorithm 3 Lines 3-5. We now show why edges in term and mech are the same in both.
From the theorem statement, ∃ M such that 1 ( M) = C. Let G be the game graph of M.

Examples
We now look at example applications of our algorithms, which help the modeler to draw the correct game graph to describe a system.

Simple example
We begin by considering the simple example of Fig. 1 in more detail. The underlying system has game graph G real , displayed in Fig. 1b, with a decision node, a chance node and a utility node. Recall that all variables are binary; = with probability , = 1 − with probability 1 − ; and = with probability , = 1 − with probability 1 − . Having specified the causal game, we can now describe the optimal decision rule -this depends on the values of and : if , > 0.5 or , < 0.5, then = 1 is optimal, if < 0.5, > 0.5 or > 0.5, < 0.5 then = 0 is optimal, and if either or is 0.5, then both = 0 and = 1 are optimal.
We can now consider mechanism interventions, to understand what Algorithm 1 will discover. Suppose we soft intervene on and such that , > 0.5, so that the optimal policy is = 1. Then we change the soft intervention on such that < 0.5, we will see the optimal policy change to = 0. Thus Algorithm 1 draws an edge ( , ). By a similar argument, it will also draw an edge ( , ), which will be a terminal edge. Thus Algorithm 1 produces the edge-labelled mechanised causal graph C model shown in Fig. 1c. Algorithm 2 then takes C model and produces the correct game graph by identifying that only has incoming arrows, and so is the only decision node, and that is the only variable which has its mechanism with an outgoing terminal edge into the mechanism for , and hence is a utility. In this simple example, we have recovered the game graph of Fig. 1b.  Fig. 4b). 3b mechanised causal graph, C, that Algorithm 1 discovers. Note the path 2 → → which implies the recommendation system's policy depends on how a human updates their opinions when shown the recommended content, which is not visible from the game graph.

Optimising a model of a human
We next consider an example from the influence diagram literature. It has been suggested that a safety problem with content-recommendation systems is that they can nudge users towards more extreme views, to make it easier to recommend content that will generate higher utility for the system (e.g., more clicks), as the extreme views are more easily predictable (Benkler et al., 2018;Carroll et al., 2022;Stray et al., 2021). To combat this, Everitt et al. (2021a) propose that the system's utility be based on predicted clicks using a model of a user, rather than directly on actual clicks and the user's actions. Their Fig. 4b is reproduced here in our Fig. 3a. The node 1 represents a human's initial opinion, and 2 their influenced opinion after seeing an agent's recommended content, . The agent observes a model of the human's initial opinion, , and optimises for the predicted number of clicks, , using the model .
Drawing the mechanised causal graph (Fig. 3b) for this system reveals some critical subtleties. First, there is a terminal edge ( , ), since this is the goal that the agent is trained to pursue. But should there be an edge ( 2 , )? This depends on how the user model was obtained. If, as is common in practice, the model was obtained by predicting clicks based on past user data, then changing how a human reacts to recommended content ( 2 ), would lead to a change in the way that predicted clicks depend on the model of the original user ( ). This means that there should be an edge, as we have drawn in Fig. 3b. Everitt et al. (2021a) likely have in mind a different interpretation, where the predicted clicks are derived from according to a different procedure, described in more detail by Farquhar et al. (2022). But the intended interpretation is ambiguous when looking only at Fig. 3a the mechanised graph is needed to reveal the difference.
Why does all this matter? Everitt et al. (2021a) use Fig. 3a to claim that there is no incentive for the policy to instrumentally control how the human's opinion is updated and they deem the proposed system safe as a result. However, under one plausible interpretation, our causal discovery approach yields the mechanised causal graph representation of Fig. 3b, which contains a directed path 2 . This can be interpreted as the recommendation system is influencing the human in a goal-directed way, as it is adapting its behaviour to changes in how the human is influenced by its recommendation (cf. discussion in Section 1.2).
This example casts doubt on the reliability of graphical incentive analysis (Everitt et al., 2021a) and its applications (Ashurst et al., 2022;Cohen et al., 2021;Evans and Kasirzadeh, 2021;Everitt et al., 2021b;Farquhar et al., 2022;Langlois and Everitt, 2021). If different interpretations of the same graph yields different conclusions, then graph-based inference does not seem possible. Fortunately, by pinpointing the source of the problem, mechanised SCMs also contain the seed of a solution: Figure 4 | Actor-Critic. 4a True game graph G real . 4b Algorithm 1 produces the mechanised causal graph C model . From C model , Algorithm 2 produces the correct game graph by identifying that and have incoming arrows, so are decisions, and that has its mechanism with an outgoing terminal edge to the mechanism for so is its utility, whilst has its mechanism with an outgoing terminal edge to the mechanism for , so is its utility. They are coloured differently due to having different utilities. 4c Incorrect game graph for actor-critic. 4d Coarse-grained single-agent game graph. graphical incentive analysis can be trusted (only) when all non-decision mechanism lack ingoing arrows. Indeed, this mirrors the extra assumption needed for the equivalence between games and mechanised SCMs in Theorem 3. As mechanisms are often assumed completely independent, this is often not an unreasonable assumption (see also Section 5.3). Alternatively, it may be possible to use mechanised SCMs to generalise graphical incentive analysis to allow for dependent mechanisms, but we leave investigation of this for future work.

Actor-Critic
Our third example contains multiple agents. It represents an Actor-Critic RL setup for a one-step MDP (Sutton and Barto, 2018). Here an actor selects action as advised by a critic (Fig. 4a). The critic's action states the expected reward for each action (in the form of a vector with one element for each possible choice of , this is often called a Q-value function). The action influences the state , which in turn determines the reward . We model the actor as just wanting to follow the advice of the critic, so its utility is = ( ) (the -th element of the vector). The critic wants its advice to match the actual reward . Formally, it optimises = −( − ) 2 .
Algorithm 1 produces the mechanised causal graph C model , in Fig. 4b. We don't justify all of the mechanism edges, but instead focus on a few of interest. For example, there is an edge ( , ) but there is no edge ( , ), i.e. the critic cares about the state mechanism but the actor does not. The critic cares because it is optimising which is causally downstream of , and so the optimal decision rule for will depend on the mechanism of even when other mechanisms are held constant. The dependence disappears if is cut off from , so the edge ( , ) is terminal. In contrast, the actor doesn't care about the mechanism of , because is not downstream of , so when holding all other mechanisms fixed, varying won't affect the optimal decision rule for . There is however an indirect effect of the mechanism for on the decision rule for , which is mediated through the decision rule for . Algorithm 2 applied to C model produces the correct game graph by identifying that and have incoming arrows, and therefore are decisions; that 's mechanism has an outgoing terminal edge to 's mechanism and so is its utility; and that 's mechanism has an outgoing terminal edge to the mechanism for , and so is its utility. The decision-utility subgraph consists of two connected components, one being ( , ) and the other ( , ). The decisions and utilities therefore get coloured correctly.
This can help avoid modelling mistakes and incorrect inference of agent incentives. In particular, Christiano (private communication, 2019) has questioned the reliability of incentive analysis from CIDs, because of an apparently reasonable way of modelling the actor-critic system where the actor is not modelled as an agent, shown in Fig. 4c. Doing incentive analysis on this single-agent diagram would lead to the assertion that the system is not trying to influence the state or the reward , because they don't lie on the directed path → (i.e. neither nor has an instrumental control incentive; Everitt et al., 2021a). This would be incorrect, as the system is trying to influence both these variables (in an intuitive and practical sense).
The modelling mistake would be avoided by applying Algorithms 1 and 2 to the underlying system, which produce Fig. 4a, differing from Fig. 4c. The correct diagram has two agents, and it's not possible to apply the single-agent incentive concept from (Everitt et al., 2021a). Instead, an incentive concept suitable for multi-agent systems would need to be developed. For such a multi-agent incentives concept to be useful, it should capture the influence on and jointly exerted by and . Fig. 4d shows a game graph that involves only a subset of the variables of the underlying system, i.e., a coarse-grained version. This is also an accurate description of the same underlying system, though with less detail. At this coarser level, we find an instrumental control incentive on and , as intuitively expected.

Modified Action Markov Decision Process
Next, we consider an example regarding the redirectability of different RL agents. Langlois and Everitt (2021) introduce modified action Markov decision processes (MAMDPs) to model a sequential decision-making problem similar to an MDP, but where the agent's decisions, , can be overridden by a human. In the game graph in Fig. 5, this is modelled by only influencing via a chance variable, , which represents the potentially overridden decision.
Algorithm 1 produces the mechanised causal graph C model in Fig. 5b, where for readability we restrict to mechanisms only -for the full diagram see Fig.10 in Appendix D. There are many mechanism edges, so we only elaborate on the interpretation of one of the edges, ( 1 , 1 ), in this mechanised causal graph. This edge represents that the agent's choice of decision rule is influenced by the mechanism for the potentially overridden variable, 1 . In general, in this decision problem, it will be suboptimal to ignore knowledge of the mechanism for the potentially overridden variables. Algorithm 2 applied to C model produces the correct game graph by identifying that 1 and 2 have incoming terminal edges, so are decisions, and that has its mechanism with outgoing terminal edges to the mechanisms for decisions 1 and 2 , and so is a utility. Since the utility is the same, the decisions are coloured the same to show they are the same agent.
We note that the game graph diagram presented here in Fig. 5a differs from Figure 2 of Langlois and Everitt (2021). The reason is that we have been stricter about what should appear in a game graph, and what should appear in a mechanised causal graph. In particular, Langlois and Everitt have a node for the decision rule in their game graph, whereas we only have decision nodes in our game graphs, with decision rule nodes only appearing in mechanised causal graphs, along with other mechanism nodes. With this extra strictness comes greater expression and clarity -in our game The underlying system has game graph G real . 5b Algorithm 1 produces the mechanised causal graph C model (we display mechanisms only, see Fig.10 in Appendix D for the full diagram). Since the utility is the same, the decisions 1 and 2 are coloured the same to show they belong to the same agent.
(a) G real (b) C model Figure 6 | Zero agents. 6a The true game graph G real has no decisions or utilities, so is a standard causal Bayesian network. 6b Algorithm 1 produces the mechanised causal graph C model . Algorithm 2 produces the correct game graph by identifying that there are no agents, and just recovers the standard causal Bayesian network.
graph we are clear that the agent's decisions can't condition on the result of the modification, whereas Langlois and Everitt draw an information edge from the modification to the policy, which is a decision node in their diagram. Instead, we represent the fact that the decision rules are influenced by the mechanism for potentially overridden variables by the edges ( , ) in the mechanised causal graph. This allows us to be clearer about what information is available for each decision (the state), but does not observe the modification, as might be construed from the diagram in Langlois and Everitt (2021).

Zero agents
Our final example of our algorithm working as desired is one in which there are no agents at all, see Fig. 6. Here causes and causes , but there is no decision or utility. Algorithm 1 produces the mechanised causal graph C model , in Fig. 6b. Algorithm 2 produces the correct game graph by identifying that there are no decisions as there are no mechanisms with incoming edges, and hence also no utilities. This then just recovers a causal Bayesian network graph.

Breaking Assumptions
Compared to the other assumptions which are more benign, Assumption 1 rules out some examples that we might wish to consider. We now consider some examples which break it. . 7a True game graph G real , where the yellow utility indicates both robot and human share the same utility. 7b Algorithm 1 produces the mechanised causal graph C model , shown here restricted to mechanisms only for readability -see Appendix D, Fig. 11 for the full mechanised causal graph. 7c Algorithm 2 produces the an incorrect game graph in this case, because we violated Assumption 1, and gives that all decisions belong to the same agent.

Multiple agents with a shared utility
First, in Fig. 7, we consider a causal game that has two agents with a shared utility, see Fig. 7a. This is a diagram that represents an Assistance Game, formerly known as Cooperative Inverse Reinforcement Learning (Hadfield-Menell et al., 2016). There is a human which makes decisions 1 and 2 , conditioned on information about their preference, encoded by , and a robot which makes decisions Algorithm 1 produces the mechanised causal graph C model shown in 7b. We believe this mechanised causal graph representation of the system is correct. However, the problem arises when we apply Algorithm 2 on it. The result is the game graph in 7c which has all decisions belonging to the same agent. We hope that in future work we will be able to distinguish agents which share the same utility, through some modification to the colouring logic of Algorithm 2. One approach may be to use some condition involving sufficient recall (Milch and Koller, 2008) to distinguish between agents (a game graph has sufficient recall if for all agents the mechanism graph restricted to that agent's decision rules is acyclic).

Non-descendent utility
We now consider an example, Fig. 8a, that breaks Assumption 1 in another way. There are two agents which make decisions and with utilities and respectively. The red agent chooses , which affects the utility that the blue agent receives . The blue agent's choice affects the red agent's utility. Note that the agent subgraph for is disconnected (no directed path from to ), so this example violates Assumption 1.
Algorithm 1 applied to G real produces the mechanised causal graph C model , in Fig. 8b. We think this mechanised causal graph is an accurate representation of the system. From inspecting it, we can Figure 8 | Non-descendant utility. 8a True game graph G real . Note that the agent subgraph for (Definition 4) is not connected, violating Assumption 1. 8b Algorithm 1 produces the mechanised causal graph C model . 8c Algorithm 2 produces an incorrect game graph in this case, because we violated Assumption 1, leading to and being incorrectly identified as chance, rather than as decision and utility variables respectively. see that although is not a descendent of , it is a descendent of , via → → → . That is, the red agent's decision rule can still have an effect on its utility, but Assumptions 2 and 3 rule out agents strategising using this path. Applying Algorithm 2 on C model produces an incorrect game graph with and being incorrectly identified as chance nodes (Fig. 8c).
This example highlights several questions for future work: Which agents learn to influence their utility by means of their decision rule, thereby breaking our Assumptions 2 and 3? And how can Algorithm 2 be generalised to handle non-descendant utilities and agents utilising influence from their decision rule?

Relativism of variable types
The first thing we discuss is that the variable types in a causal game, i.e. decision, utility or chance, are only meaningful relative to the choice of which variables are included in the model. Whether our procedure of Algorithm 1 followed by Algorithm 2 classifies a variable as decision, utility or chance depends on what other variables are included in the graph. For example, if in reality there is a utility variable which is not present in the model (i.e., the set of variables doesn't include ), but some of its parents, Pa , are present in the model, then those parents will be labelled as utilities. Similarly, if in reality there is a decision variable, , which is not present in the model, but some of its children, Ch , are present in the model, then those children will be labelled as decisions. See Appendix C for a simple example of this relativism. In a sense, a choice of variables represents a frame in which to model the system, and what is a decision or a utility node is frame-dependent.

Modelling advice
How does one identify the relevant variables to begin with? Section 3.3 and Algorithms 1 and 2 only provides a way to determine the structure of a mechanised SCM and associated game graph from a given set of variables, but not how to choose them. We now offer some tips on choosing variables.
A few principles always apply. First, variables should represent aspects of the environment that we are concerned with, either as means of influence for an agent, or as inherently valuable aspects of the environment. The content selected by a content recommender system, and the preferences of a user, are good examples. Second, it should be fully clear both how to measure and how to intervene on a variable. Otherwise its causal relationship to other variables will be ill-defined. In our case, this requirement extends also to the mechanism of each variable. Third, a variable's domain should be exhaustive (cover all possible outcomes of that variable) and represent mutually exclusive events (no pair of outcomes can occur at the same time) (Kjaerulff and Madsen, 2008). Finally, variables should be logically independent: one variable taking on a value should never be mutually exclusive with another variable taking on a particular value (Halpern and Hitchcock, 2010).
It's important to clarify whether a variable is object-level or a mechanism. For example, previous work (Langlois and Everitt, 2021) has drawn a policy (i.e., a mechanism) in a way that makes it look like an object-level variable, which led to some confusion, whereas in Section 4.4 we take the decision rule to be a mechanism. Another lesson learnt is that there are important differences between a utility and a variable which is merely instrumental for that utility. This is evident when performing a structural mechanism intervention to cut off instrumental variables from their downstream utilities, in which case a decision-maker won't respond to changes only in the instrumental variable. Of particular importance is the level of coarse-graining in the choice of variables. There are some works on the marginalisation of Bayesian networks (Evans, 2016;Kinney and Watson, 2020), and in cyclic SCMs (Bongers et al., 2021), which allow for one to marginalise out some variables. We hope to explore marginalisation in the context of game graphs in future work, and present one example in Appendix B. The choice of coarse-graining may have an impact on whether agents are discovered.

Relationship to Causality Literature
We now discuss some related literature in Causality. Other related work was discussed in Section 1.2. Pearl (2009) lays the foundations for modern approaches to causality, with emphasis on graphical models, and in particular through the use of structural causal models (SCMs), which allow for treatment of both interventions and counterfactuals. Dawid (2002) considers related approaches to causal modelling, including the use of influence diagrams to specify which variables can be intervened on. One model that's introduced is called a parameter DAG, which is similar to our mechanised SCM, in that each object-level variable has a parameter variable which parametrises the distribution of the object-level variable. However, whilst acknowledging there could be links between the parameter variables, they are not considered in that work. In contrast, our focus is less on using influence diagrams as a tool for causal modelling, and rather on modelling and discovering agents using causal methods. Further, we allow relationships between mechanism variables in our models, and elucidate their relation to decision, chance and utility variables in the influence diagram representation. Halpern (2000) gives an axiomatization of SCMs, generalizing to cases where the structural equations may not have a unique (or any) solution. However, in the case of non-unique (or nonexistant) solutions, potential response variables are ill-defined, which White and Chalak (2009) claim prevents the desired causal discourse. They instead propose the settable systems framework in which there are settable variables which have a role-indicator argument which determines whether a variable's value is a response, determined by its structural equation, or if its value is a setting, as determined by a hard intervention. Bongers et al. (2021) give formalizations for statistical causal modelling using cyclic SCMs, proving certain properties present in the acyclic case don't hold in the cyclic case. In our work, we use mechanised SCMs that can have cycles between mechanism variables. Zero, one or multiple solutions reflect the multiple equilibria arising in some games. Our formalism for mechanised SCMs follows the cyclic SCM treatment of Bongers et al. (2021). Correa and Bareinboim (2020) develop sigma-calculus for reasoning about the identification of the effects of soft interventions using observational, rather than experimental data. In our work, we assume access to experimental data, which makes the identification question trivial. Future work could relax this assumption to explore when agents can be discovered from observational data. Their regime indicators roughly correspond to our mechanism variables.
Our work draws on structure discovery in the causal discovery literature. See Glymour et al. (2019) for a review, and Forré and Mooij (2018) for an example of causal discovery of cyclic models. The usual focus in causal discovery is not to model agents, but rather to model some physical (agent-agnostic) system (modelling agents is usually done in the context of decision/game theory). Our work differs in that we use causal discovery in order to get a causal model representation of agents (a mechanised SCM), and can then translate that to the game-theoretic description in terms of game graphs with agents.
One of the most immediate applications of our results concerns the independent causal mechanisms (ICM) principle (Peters et al., 2017;Schölkopf et al., , 2021. ICM states that, 1. Changing (intervening on) the causal mechanism for ( | pa ) does not change any of the other mechanisms for ( | pa ), ≠ . 2. Knowledge of does not provide knowledge of for any ≠ .
ICM argues that ( | pa ) typically describes fixed and modular causal mechanisms that do not respond to the mechanisms of other variables. The classic example is the distribution of atmospheric temperature given its causes pa such as altitude. While the distribution (pa ) may vary between countries, ( | pa ) remains fixed as it describes a physical law relating altitude (and other causes) to atmospheric temperature. In recent years ICM has become the predominant inductive bias used in causal machine learning including causal and disentangled representations (Bengio et al., 2013;Locatello et al., 2019;Schölkopf, 2022), causal discovery (Janzing and Schölkopf, 2010;, semi-supervised learning , adversarial vulnerability (Schott et al., 2018), reinforcement learning (Bengio et al., 2019), and has even played a role in major scientific discoveries such as discovering the first exoplanet with atmospheric water (Foreman-Mackey et al., 2015). Our results provide a constraint on the applicability of the ICM principle; namely that ( | pa ) does not obey the ICM principle if it is an agent's decision rule, or is strategically relevant to some agent's decision rule, as determined by Algorithm 2.
Condition 1 in the ICM is true only if is not strategically relevant for an agent, and condition 2 covers agents themselves, as their mechanisms are correlated with the mechanisms of strategically relevant variables. This limits the applicability of ICM (and methods based on ICM) to systems where the data generating process includes no agents. Likely examples include sociological data and data generated by reinforcement learning agents during training. However, our hope is that Algorithm 1 can be applied to identify mechanism edges that violate ICM, allowing ICM to be applied to the correct systems, and in doing so improve the performance of ICM-based methods.

Conclusion
We proposed the first formal causal definition of agents. Grounded in causal discovery, our key contribution is to formalise the idea that agents are systems that adapt their behaviour in response to changes in how their actions influence the world. Indeed, Algorithms 1 and 2 describe a precise experimental process that can, in principle and under some assumptions, be done to assess whether something is an agent. Our process is largely consistent with previous, informal characterisations of agents (e.g. Dennett, 1987;Flint, 2020;Garrabrant, 2021;Wiener, 1961), but making it formal enables agents and their incentives to be identified empirically or from the system architecture. Our process improves upon an earlier formalisation by Orseau et al. (2018), by better handling systems with a small number of actions and "accidentally optimal" systems (see Section 1.2 for details).
Causal modelling of AI systems is a tool of growing importance, and this paper grounds this area of work in causal discovery experiments. We have demonstrated the utility of our approach by improving the safety analysis of several AI systems (see Section 4). In consequence, this would also improve the reliability of methods building on such modelling, such as analyses of the safety and fairness of machine learning algorithms (see e.g. Ashurst et al., 2022;Cohen et al., 2021;Evans and Kasirzadeh, 2021;Everitt et al., 2021b;Farquhar et al., 2022;Langlois and Everitt, 2021;Richens et al., 2022).

A.1. Notation
We use roman capital letters for variables, lower case for their outcomes . We use bold type to indicate vectors of variables, , and vectors of outcomes . Parent, children, ancestor and descendent variables are denoted Pa , Ch , Anc , Desc , respectively, with the family denoted by Fa = Pa ∪ { }. We use dom( ) and dom( ) = × ∈ dom( ) to denote the set of possible outcomes of and respectively, which are assumed finite. Subscripts are reserved for denoting submodels and potential responses to an intervention.

A.2. Structural Causal Model
We begin with a standard definition of a structural causal model. This has an associated directed graph, called a causal graph (CG). ( , E ) depends on the value of (as such, all our causal graphs are faithful (Pearl, 2009) by construction).
The subgraph of white nodes in Fig. 1c is an example of a CG.
In some parts of this work, we will consider acyclic (recursive) SCMs, in which the CG is acyclic. Other parts will consider cases in which there is a possibly cyclic (nonrecursive) SCM, in which the CG is cyclic. See Bongers et al. (2021) for a foundational treatment of SCMs with cyclic CGs. They define a solution of an SCM as a set of exogenous and endogenous random variables, E, , for which the exogenous distribution matches that in the cyclic SCM, and for which the structural equations are satisfied. For a solution, E, , the distribution over the endogenous variables, Pr is called the observational distribution associated to . In this cyclic case, there can be zero, one or many observational distributions, due to the existence of different solutions of the structural equations. In this work, we assume the existence of a unique solution, even in the case of a nonrecursive SCM. This unique solution then defines a joint distribution over endogenous variables (Pearl, 2009) SCMs model causal interventions that set variables to particular outcomes, captured by the following definition of a submodel: Definition A.7 (SCM Submodel, Pearl, 2009). Let S = , E , F , Pr(E ) be an SCM, ⊆ be a set of endogenous variables, and ∈ dom( ) a value for each variable in that subset. The submodel S represents the effects of an intervention do( = ) and is formally defined as the SCM S = , E , F , Pr(E ) where F = { = (Pa , E )} ∈ \ ∪ { = }. That is, the original functional relationships for are replaced with the constant functions = .
We also assume the existence of a unique solution to the set of structural equations under all interventions, allowing us to define the potential response.
Definition A.8 (Potential Response, Pearl, 2009). Let S = , E , F , Pr(E ) be an SCM, and let , ⊆ . The potential response of to the intervention do( = ), denoted (E) is the solution for in the set of equations F , that is, (E) = S (E), where S is the submodel from intervention do( = ).

A.3. Structural Causal Game
We now introduce a (structural) causal game, which draws on the SCM, emphasising the structural causal dependencies present. The causal game has an associated graph: Definition A.10 (Game Graph). Let M = , , E , {Pa } ∈ , F , Pr(E ) be a causal game. We define the game graph to be the structure G = ( , ∪ E , ), where = {1, . . . , } is a set of agents and ( ∪ E , ) is a DAG with: • Four vertex types ∪ E = ∪ ∪ ∪ E : the first three types are endogenous nodes in white circles, coloured diamonds and squares respectively; the fourth type are exogenous nodes, E , in grey circles. The different colours of diamonds and squares correspond to different agents. • Two types of edges: dependence edges, ( , ) ∈ if and only if either ∈ \ and is an argument to the structural function , i.e. ∈ Pa ∪ E ; or = ∈ and = E . These are denoted with solid edges.
information edges, ( , ) ∈ if and only if ∈ Pa of the causal game. These are denoted with dashed edges.
One can also draw a simpler graph by omitting the exogenous variables and their outgoing edges from the game graph. Fig. 1b is an example of a game graph. We will only consider causal games for which the associated game graph is acyclic.
For each non-decision variable, the causal game specifies a distribution over it. For the decision variables, the causal game doesn't specify how it is distributed, only the information available at the time of the decision, as captured by Pa . The agents get to select their behaviour at each of their decision nodes, as follows. Let M be a causal game. A decision rule, , for a decision variable ∈ ⊆ is a (measurable) structural function : dom(Pa ∪ {E }) ↦ → dom( ) where E is uniformly distributed over the [0, 1] interval. 8 A partial policy profile, is a set of decision rules for each ∈ ⊆ . A policy refers to , the set of decision rules for all of agent 's decisions. A policy profile, = ( 1 , . . . , ) assigns a decision rule to every agent.
Definition A.11 (Optimality and Best Response, Koller and Milch, 2003). Let ⊆ and let be a policy profile. We say that partial policy profileˆis optimal for policy profile = ( − ,ˆ) if in the induced causal game M ( − ), where the only remaining decisions are those in , the decision rulê is optimal, i.e. for all partial policy profiles (( − ,ˆ)) ≥ (( − , )).
Agent 's decision rule is a best response to the partial policy profile − assigning strategies to the decisions of all other agents if for all strategies (( − ,ˆ)) ≥ (( − , )).
In the game-theoretic setting with multiple agents, we typically consider rational behaviour to be represented by a Nash Equilibrium: Definition A.12 (Nash Equilibrium, Koller and Milch, 2003). A policy profile = ( 1 , . . . , ) is a Nash Equilibrium if for all agents , is a best response to − .
In this paper, we consider the refined concept of subgame perfect equilibrium (SPE), as follows Definition A.13 (Subgame Perfect Equilibrium, Hammond et al., 2021, forthcoming). A policy profile = ( 1 , . . . , ) is a Subgame Perfect Equilibrium if for all subgames, is a Nash equilibrium.
Informally, in any subgame, the rational response is independent of variables outside of the subgame. See Hammond et al. (2021, forthcoming) for the formal definition of a subgame in a causal game.

B. Getting a mechanised SCM by Marginalisation and Merging
A bandit algorithm repeatedly chooses an arm and receives a reward . We can represent two iterations by using indices. If we include the mechanisms in the graph, we can model the fact that the policy at time 2, i.e. 2 , depends on the arm and outcome at time 1: To arrive at the final mechanism graph, we first marginalise 1 and 1 . The path from 1 to 2 previously mediated by 1 now becomes a direct edge.
2 2 2 2 1 1 Finally, we merge 1 with 2 and 1 with 2 , with the understanding that observing the merged node corresponds to observing 2 , while intervening on means setting both 1 and 2 . This yields the following mechanised causal graph (note the terminal edge due to 2 not having any children) 2 2 Applying Algorithm 2 yields the expected game graph: 2 2

C. Example of relativism of variable types
This example illustrates the discussion of Sec.5.1 with an example of the relativism of variable types. Whether a variable gets classified as a decision, chance, or utility node by Algorithm 2 depends on which other nodes are included in the graph. To see this, consider the graph in Fig. 9 in which a (a) Include , , and (b) Include and (c) Include and Figure 9 | What is a decision or a utility node depends on what other variables are included. Here the variables represent a blueprint for a thermometer (B), the constructed thermometer (T), and thereby whether the reading is correct or not (C) blueprint for a thermometer, , influences the constructed thermometer, , and thereby whether the reading is correct or not, .
Considering first Fig. 9a where a first modeler has included all three variables. We find that the designer will produce a different blueprint if they are aware that blueprints are interpreted according to a different convention (i.e., if changes), or if temperature was measured at a different scale (a change to ). Accordingly, Algorithm 2 labels a decision, and a utility. This makes sense in this context: the designer chooses a blueprint to ensure that the thermometer gives a correct reading.
A second modeler may not care about the blueprint, and only wonder about the relationship between the produced thermometer and the correctness of the reading . See Fig. 9b. They will find that if temperature was measured at a different scale, a slightly different thermometer would have been produced, i.e. now influences rather than (as in Fig. 9a). This is not a contradiction, as is a different object in Fig. 9a and 9b. In Fig. 9a, represents the relationship between and , while in Fig. 9b, represents the marginal distribution of . As a consequence, Algorithm 2 will label as a decision optimising . This is not unreasonable: a decision was made to produce a particular kind of thermometer with the aim of getting correct temperature readings.
A third modeler may not bother to represent the correctness of the readings explicitly, and only consider the blueprint and the produced thermometer, see Fig. 9c. They will find that the blueprint is optimised to obtain a particular kind of thermometer. Again, this is not unreasonable, as in this context we may well speak of the designer deciding on a blueprint that will produce the right kind of thermometer.