Thermodynamics of computing with circuits

Digital computers implement computations using circuits, as do many naturally occurring systems (e.g., gene regulatory networks). The topology of any such circuit restricts which variables may be physically coupled during the operation of a circuit. We investigate how such restrictions on the physical coupling affects the thermodynamic costs of running the circuit. To do this we first calculate the minimal additional entropy production that arises when we run a given gate in a circuit. We then build on this calculation, to analyze how the thermodynamic costs of implementing a computation with a full circuit, comprising multiple connected gates, depends on the topology of that circuit. This analysis provides a rich new set of optimization problems that must be addressed by any designer of a circuit, if they wish to minimize thermodynamic costs.


I. INTRODUCTION
A long-standing focus of research in the physics community has been how the energetic resources required to perform a given computation depend on that computation. This issue is sometimes referred to as the "thermodynamics of computation" or the "physics of information" [1][2][3]. Similarly, a central focus of computer science theory has been how the minimal computational resources needed to perform a given computation depend on that computation [4,5]. (Indeed, some of the most important open issues in computer science, like whether P = NP, concern the relationship between a computation and its resource requirements.) Reflecting this commonality of interests, there was a burst of early research relating the resource concerns of computer science theory with the resource concerns of thermodynamics [6][7][8][9] [10].
Starting a few decades after this early research, there was dramatic progress in our understanding of non-equilibrium statistical physics [2,[11][12][13][14][15], which has resulted in new insights into the thermodynamics of computation [2,3,13,16]. In particular, recent research has derived the "(generalized) Landauer bound" [17][18][19][20][21][22], which states that the heat generated by a thermodynamically reversible process that sends an initial distribution p 0 (x 0 ) to an ending distribution p 1 (x 1 ) is kT [S(p 0 ) − S(p 1 )] (where S(p) indicates the entropy of distribution p, T is the temperature of the single bath, and k is Boltzmann's constant).
Almost all of this work on the Landauer bound assumes that the map taking initial states to final states, P (x 1 |x 0 ), is implemented with a monolithic, "all-at-once" physical process, jointly evolving all of the variables in the system at once. In contrast, for purely practical reasons modern computers are built out of circuits, i.e., they are built out of networks of "gates", each of which evolves only a small subset of the variables of the full system [4,5]. An example of a simple circuit that computes the parity of 3 input bits using two XOR gates, and which we will return to throughout this paper, is illustrated in Fig. 1.
Similarly, in the natural world, biological cellular regulatory networks carry out complicated computations by decomposing * Also at Complexity Science Hub, Vienna; Arizona State University, Tempe, Arizona.
As elaborated below, there are two major, unavoidable thermodynamic effects of implementing a given computation with a circuit of gates rather than with an all-at-once process: I) Suppose we build a circuit out of gates which were manufactured without any specific circuit in mind. Consider such a gate that implements bit erasure, and suppose that it is thermodynamically reversible if p 0 is uniform. So by the Landauer bound, it will generate heat kT S(p 0 ) = kT ln 2 if run on a uniform distribution. Now in general, depending on where such a bit-erasing gate appears in a circuit, the actual initial distribution of its states, p 0 , will be non-uniform. This not only changes the Landauer bound for that gate from kT ln 2 to kT S(p 0 ); it is now known that since the gate is thermodynamically reversible for p 0 = p 0 , running that gate on p 0 will not be thermodynamically reversible [30]. So the actual heat generated by running that bit will exceed the associated value of the Landauer bound, kT S(p 0 ). II) Suppose the circuit is built out of two bit-erasing gates, and that each gate is thermodynamically reversible on a uniform input distribution when run separately from the circuit. If the marginal distributions over the initial states of the gates are both uniform, then the heat generated by running each of them is kT ln 2, and therefore the total generated heat is 2kT ln 2. Suppose though that there is nonzero statistical coupling between their states under their initial joint distribution. Then as elaborated below, even though each of the gates run separately is thermodynamically reversible, running them in parallel is not thermodynamically reversible. So running them generates extra heat beyond the minimum given by applying the Landauer bound to the dynamics of the full joint distribution [31].
These two effects mean that the thermodynamic cost of running a given computation with a circuit will in general vary greatly depending on the precise circuit we use to implement that computation. In the current paper we analyze this dependence.
We make no restriction on the input-output maps computed by each gate in the circuit. They can be either deterministic (i.e., single-valued) or stochastic, logically reversible (i.e., implementing a deterministic permutation of the system's state space, as in Fredkin gates [6]) or not, etc. However, to ground thinking, the reader may imagine that the circuit being considered is a Boolean circuit, where each gate performs one of the usual single-valued Boolean functions, like logical AND gates, XOR gates, etc.
For simplicity, in this paper we focus on circuits whose topology does not contain loops [5,32], such as the circuit shown in Fig. 1.

A. Contributions
We have four primary contributions. 1) We derive exact expressions for how the entropy flow (EF) and entropy production (EP) produced by a fixed dynamical system vary as one changes the initial distribution of states of that system. These expressions capture effect (I) described above. (These expressions extend an earlier analysis [30]). 2) We introduce "solitary processes". These are a type of physical process that can implement any particular gate in a circuit while respecting the constraints on what variables in the rest of the circuit that gate is coupled with. We can use the thermodynamic properties of solitary processes to analyze effect (II) described above.

3)
We combine our first two contributions to analyze the thermodynamic costs of implementing circuits in a "serialreinitializing" manner. This means two things: the gates in the circuit are run one at a time, so each gate is run as a solitary process; after a gate is run its input wires are reinitialized, allowing for subsequent reuse of the circuit. In particular, we derive expressions relating the minimal EP generated by running an SR circuit to information-theoretic quantities associated with the wiring diagram of the circuit. 4) Our last contribution is an expression for the extra EP that arises in running an SR circuit if the initial state distributions at its gates differ from the ones that result in minimal EP for each of those gates. This expression involves an informationtheoretic function that we call "multi-divergence" which appears to be new to the literature.

B. Roadmap
In Section II A we introduce general notation, and then provide a minimal summary of the parts of stochastic thermodynamics, information theory and circuit theory that will be used in this paper. We also introduce the definition of the "islands" of a stochastic matrix in that section, which will play a central role in our analysis. In Section III we derive an exact expression for how the EF and EP of an arbitrary process depends on its initial state distribution. In Section IV we introduce solitary processes and then analyze their thermodynamics. In Section V we introduce SR circuits. In Section VI we use the tools developed in the previous sections to analyze the thermodynamic properties of SR circuits. In Section VII we discuss related earlier work. Section VIII concludes and presents some directions for future work. All proofs that are longer than several lines are collected in the appendices.

II. BACKGROUND
Because the analysis of the thermodynamics of circuits involves tools from multiple fields, we review those tools in this section. We also introduce some new mathematical structures that will be central to our analysis, in particular "islands". We begin by introducing notation.

A. General notation
We write a Kronecker delta as δ(a, b). We write a random variable with an upper case letter (e.g., X), and the associated set of possible outcomes with the associated calligraphic letter (e.g., X ). A particular outcome of a random variable is written with a lower case letter (e.g., x). We also use lower case letters like p, q, etc. to indicate probability distributions.
We use ∆ X to indicate the set of probability distribution over a set of outcomes X . For any distribution p ∈ ∆ X , we use supp p := {x ∈ X : p(x) > 0} to indicate the support of p. Given a distribution p over X and any Z ⊆ X , we write p(Z) = x∈Z p(x) to indicate the probability that the outcome of X is in Z. Given a function f : X → R, we write E p [f ] to indicate x p(x)f (x), the expectation of f under distribution p.
Given any conditional distribution P (y|x) of y ∈ Y given x ∈ X , and some distribution p over X , we write P p for the distribution over Y induced by applying P to p: We will sometimes use the term "map" to refer to a conditional distribution.
We say that a conditional distribution P is "logically reversible" if it is deterministic (the entries of P (y|x) are 0/1valued for all x ∈ X and y ∈ Y) and if there do not exist x, x ∈ X and y ∈ Y such that P (y|x) > 0 and P (y|x ) > 0. When Y = X , a logically reversible P is simply a permutation matrix. Given any subset of states Z ⊆ X , we also say that P is "logically reversible over Z" if the entries P (y|x) are 0/1-valued for all x ∈ Z and y ∈ Y, and there do not exist x, x ∈ Z and y ∈ Y such that P (y|x) > 0 and P (y|x ) > 0.
We write a multivariate random variable with components V = {1, 2, . . . } as X V = (X 1 , X 2 , . . . ), with outcomes x V . We will also use upper case letters (e.g., A, V, . . . ) to indicate sets of variables. For any subset A ⊆ V we use the random variable X A (and its outcomes x A ) to refer to the components of X V indexed by A. Similarly, for a distribution p V over X V , we write the marginal distribution over X A as p A . For a singleton set {a}, we slightly abuse notation and write X a instead of X {a} .

B. Stochastic thermodynamics
We will consider a circuit to be physical system in contact with one or more thermodynamic reservoirs (heat baths, chemical baths, etc.). The system evolves over some time interval (sometimes implicitly taken to be t ∈ [0, 1], where the units of time are arbitrary), possibly while being driven by a work reservoir. We refer to the set of thermodynamic reservoirs and the driving -and, in particular, the stochastic dynamics they induce over the system during t ∈ [0, 1] -as a physical process.
We use X to indicate the finite state space of the system. Physically, the states x ∈ X can either be microstates or they can be coarse-grained macrostates under some additional assumptions (e.g., that all macrostates have the same "internal entropy" [2,20,33]). While we are ultimately interested in the special case where the system is a circuit with a set of nodes V , the review in this section is more general.
While much of our analysis applies more broadly, to make things concrete one may imagine that the system undergoes master equation dynamics, also known as a continuous-time Markov chain (CTMC), as often used in stochastic thermodynamics to model discrete-state physical systems. In this subsection we briefly review stochastic thermodynamics, referring the reader to [34,35] for more details.
Under a CTMC, the probability distribution over X at time t, indicated by p t , evolves according to the master equation where K t is the rate matrix at time t. For any rate matrix K t , the off-diagonal entries K t (x x ) (for x = x ) indicate the rate at which probability flows from state x to x , while the diagonal entries are fixed by , which guarantees conservation of probability. If the system is connected to multiple thermodynamic reservoirs indexed by α, the rate matrix can be further decomposed as where K α t is the rate matrix at time t corresponding to reservoir α.
The term entropy flow (EF) refers to the increase of entropy in all coupled reservoirs. The instantaneous rate of EF out of the system at time t is defined aṡ . (3) The overall EF incurred over the course of the entire process is Q = 1 0Q dt. The term entropy production (EP) refers to the overall increase of entropy, both in the system and in all coupled reservoirs. The instantaneous rate of EP at time t is defined aṡ The overall EP incurred over the course of the entire process is σ = 1 0σ dt. Note that we use terms like "EF" and "EP" to refer to either the associated rate or the associated integral over a noninfinitesimal time interval; the context should always make the precise meaning clear.
Given some initial distribution p, the EF, EP, and the drop in the entropy of the system from the beginning to the end of the process are related according to In general, the EF can be written as the expectation Q(p) = x p(x)q(x), where q(x) indicates the expected EF arising from trajectories that begin on state x. Given that the drop in entropy is a nonlinear function of p, while the expectation Q(p) is a linear function of p, Eq. (5) tells us that EP will generally be a nonlinear function of p. Note that if P is logically reversible, then S(p) = S(P p) and therefore EF and EP will be equal for any p.
While the EF can be positive or negative, the log-sum inequality can be used to prove that EP for master equation dynamics is non-negative [15,36]: This can be viewed as a derivation of the second law of thermodynamics, given the assumption that our system is evolving forward in time as a CTMC. All of these results are purely mathematical and hold for any CTMC dynamics, even in contexts having nothing to do with physical systems. However, these results can be interpreted in thermodynamic terms when each K α t obeys local detailed balance (LDB) with regard to thermodynamic reservoir α [3,15,34]. Consider a system with Hamiltonian H t (·) at time t, and let α label a heat bath whose inverse temperature is β α . Then, K α t will obey LDB when for all x, x ∈ X , either If LDB holds, then EF can be written as [35] where Q α is the expected amount of heat transfered from the system into bath α during the process. We end with two caveats concerning the use of stochastic thermodynamics to analyze real-world circuits. First, many of the processes described in this paper require that some transition rates be exactly zero at some moments. In many physical models this implies there are infinite energy barriers at those times. In addition, perfectly carrying out any deterministic map (such as bit erasure) requires the use of infinite energy gaps between some states at some times. Thus, as is conventional (though implicit) in much of the thermodynamics of computation literature, the thermodynamic costs derived in this paper should be understood as limiting values.
Second, there are some conditional distributions that take the system state at time 0 to its state at time 1, P (x 1 |x 0 ), that cannot be implemented by any CTMC [37,38]. For example, one cannot carry out (or even approximate) a simple bit flip P (x 1 |x 0 ) = 1 − δ(x 1 , x 0 ) with a CTMC. Now, we can design a CTMC to implement any given P (x 1 |x 0 ) to arbitrary precision, if the dynamics is expanded to include a set of "hidden states" in addition to the states in X [21,22]. However, as we explicitly demonstrate below, SR circuits can be implemented without introducing any such hidden states; this is one of their advantages. (See also Example 9 in Appendix A.)

C. Information theory
Given two distributions p and r over random variable X, we use notation like S(p) for Shannon entropy and D(p r) for Kullback-Leibler (KL) divergence. We write S(P p) to refer to the entropy of the distribution over Y induced by p(x) and the conditional distribution P , as defined in Eq. (1), and similarly for other information-theoretic measures. Given two random variables X and Y with joint distribution p, we write S(p(X|Y )) for the conditional entropy of X given Y , and I p (X; Y ) for the mutual information (we drop the subscript p where the distribution is clear from context). All informationtheoretic measures are in nats.
Some of our results below are formulated in terms of an extension of mutual information to more than two random variables that is known as "total correlation" or multi-information [39]. For a random variable X A = (X 1 , X 2 , . . . ), the multi-information is defined as Some of the results reviewed below are formulated in terms of the multi-divergence between two probability distributions over the same multi-dimensional space. This is a recently introduced information-theoretic measure which can be viewed as an extension of multi-information to include a reference distribution. Given two distributions p A and r A over X A , the multi-divergence is defined as Multi-divergence measures how much of the divergence between p A and r A arises from the correlations among the variables X 1 , X 2 , . . ., rather than from the marginal distributions of each variable considered separately. See App. A of [3] for a discussion of the elementary properties of multi-divergence and its relation to conventional multi-information. Note that multi-divergence is defined with "the opposite sign" of multiinformation, i.e., by subtracting a sum of terms involving marginal variables from a term involving the joint random variable, rather than vice-versa.

D. 'Island' decomposition of a conditional distribution
A central part of our analysis will involve the equivalence relation, x ∼ x ⇔ ∃y : P (y|x) > 0, P (y|x ) > 0.
In words, x ∼ x if there is a non-zero probability of transitioning to some state y from both x and x under the conditional distribution P (y|x). We define an island of the conditional distribution P (y|x) as any connected subset of X given by the transitive closure of this equivalence relation. The set of islands of any P (·|·) form a partition of X , which we write as L(P ). We will also use the notion of the islands of the conditional distribution P restricted to some subset of states Z ⊆ X . We write L Z (P ) to indicate the partition of Z generated by the transitive closure of the relation given by Eq. (11) for x, x ∈ Z. Note that in this notation, L(P ) = L X (P ).
As an example, if P (y|x) > 0 for all x ∈ X and y ∈ Y (i.e., any final state y can be reached from any initial state x with non-zero probability), then L(P ) contains only a single island. As another example, if P (y|x) implements a deterministic function f : X → Y, then L(P ) is the partition of X given by the pre-images of f , L(P ) = {f −1 (y) : y ∈ Y}. For example, the conditional distribution that implements the logical AND operation, has two islands, corresponding to (a, b) ∈ {(0, 0), (0, 1), (1, 0)} and (a, b) ∈ {(1, 1)}, respectively. As a final example, let P be the following conditional distribution: where the rows and columns corresponds to the ordered states X = Y = {00, 01, 10, 11}. The island decomposition for this map is illustrated in Fig. 2 (left). We also show the island decomposition for this map restricted to subset of states Z = {00, 01}, in Fig. 2 (right). For any distribution p over X , any Z ⊆ X , and any c ∈ L Z (P ), p(c) = x∈c p(x) is the probability that the state of the system is contained in island c. It will be helpful to use the unusual notation p c (x) to indicate the conditional probability of x within island c. Formally, p c (x) = p(x)/p(c) if x ∈ c, and p c (x) = 0 otherwise.
Intuitively, the islands of a conditional distribution are "firewalled" subsystems, both computationally and thermodynamically isolated from one another for the duration of the process implementing that conditional distribution. In particular, we will show below that the EP of running P (y|x) on an initial distribution p can be written as a weighted sum of the EPs involved in running P on each separate island c ∈ L(P ), where the weight for island c is given by p(c).

E. Circuit theory
For the purposes of this paper, a (logical) circuit is a special type of Bayes net [40][41][42]. Specifically, we define any circuit Φ as a tuple (V, E, F, X V ). The pair (V, E) specifies the vertices and edges of a directed acyclic graph (DAG). (We sometimes call this DAG the wiring diagram of the circuit.) X V is a Cartesian product v X v , where each X v is the set of possible states associated with node v. F is a set of conditional distributions, indicating the logical maps implemented at the non-root nodes of the DAG.
Following the convention in the Bayes nets literature, we orient edges in the direction of information flow. Thus, the inputs to the circuit are the roots of the associated DAG and the outputs are the leaves of the DAG [43]. Without loss of generality, we assume that each node v has a special "initialized state", indicated as ∅.
We use the term gate to refer to any non-root node, input node to refer to any root node, and output node or output gate to refer to a leaf node. For simplicity, we assume that all output nodes are gates, i.e., there is no root node which is also a leaf node. We write IN and x IN to indicate the set of input nodes and their joint state, and similarly write OUT and x OUT for the output nodes.
We write the set of all gates in a given circuit as G ⊆ V , and use g ∈ G to indicate a particular gate. We indicate the set of all nodes that are parents of gate g as pa(g). We indicate the set of nodes that includes gate g and all parents of g as n(g) := {g} ∪ pa(g).
As mentioned, F is a set of conditional distributions, indicating the logical maps implemented by each gate of the circuit. The element of F corresponding to gate g is written as π g (x g |x pa(g) ). In conventional circuit theory, each π g is required to be deterministic (i.e., 0/1-valued). However, we make no such restriction in this paper. We write the overall conditional distribution of output gates given input nodes implemented by the circuit Φ as We can illustrate this formalism using the parity circuit shown in Fig. 1. Here, V has 5 nodes, corresponding to the 3 input nodes and the two gates. The circuit operates over bits, so X v = {0, 1} for each v ∈ V . Both gates carry out the XOR operation, so both elements of F are given by π g (x g |x pa(g) ) = δ(x g , XOR(x pa(g) )) (where XOR(x pa(g) ) = 1 when the two parents of gate g are in different states, and XOR(x pa(g) ) = 0 otherwise). Finally, E has four elements representing the edges connecting the nodes in V , which are shown as arrows in Fig. 1.
In the conventional representation of a physical circuit as a (Bayes net) DAG, the wires in the physical circuit are identified with edges in the DAG. However, in order to account for the thermodynamic costs of communication between gates, it will be useful to represent the wires themselves as a special kind of gate. This means that the DAG (V, E) we use to represent a particular physical circuit is not the same as the DAG (V , E ) that would be used in the conventional computer science representation of that circuit. Rather (V, E) is constructed from (V , E ) as follows.
To begin, V = V and E = E . Then, for each edge (v →ṽ) ∈ E , we first add a wire gate w to V , and then add two edges to E: an edge from v to w and an edge from w toṽ. So a wire gate w has a single parent and a single child, and implements the identity map, π w (x w |x pa(w) ) = δ(x w , x pa(w) ). (This is an idealization of the real world, in which wires have nonzero probability of introducing errors.) We sometimes calls (V, E) the wired circuit, to distinguish it from the original, logical circuit defined as in computer science theory, (V , E ). We use W ⊂ G to indicate the set of wire gates in a wired circuit.
Every edge in a wired circuit either connects a wire gate to a non-wire gate or vice versa. Physically, the edges of the DAG of a wired circuit don't represent interconnects (e.g., copper wires), as they do in a logical circuit. Rather they only indicate physical identity: an edge e ∈ E going into a wire gate w from a non-wire node v indicates that the same physical variable will be written as either X v or X pa(w) . Similarly, an edge e ∈ E going into a non-wire gate g from a wire gate w indicates that X w is the same physical variable (and so always has the same state) as the corresponding component of X pa(g) . However, despite this modified meaning of the nodes in a wired circuit, Eq. (14) still applies to any wired circuit, as well as applying to the corresponding logical circuit. In Fig. 3, we demonstrate how to represent the 3-bit parity circuit from Fig. 1 as a wired circuit.  Fig. 1 represented as a wired circuit. Squares represent input nodes, rounded boxes represent nonwire gates, and smaller green circles represent wire gates. The output XOR gate is in blue, while the other (non-output) XOR gate is in red.
We use the word "circuit" to refer to either an abstract wired (or logical) circuit, or to a physical system that implements that abstraction. Note that there are many details of the physical system that are not specified in the associated abstract circuit. When we need to distinguish the abstraction from its physical implementation, we will refer to the latter as a physical circuit, with the former being the corresponding wired circuit. The context will always make clear whether we are using terms like "gate", "circuit", etc., to refer to physical systems or to their formal abstractions.
Even if one fully specifies the distinct physical subsystems of a physical circuit that will be used to implement each gate in a wired circuit, we still do not have enough information concerning the physical circuit to analyze the thermodynamic costs of running it. We still need to specify the initial states of those subsystems (before the circuit begins running), the precise sequence of operations of the gates in the circuit, etc. However, before considering these issues, we need to analyze the general form of the thermodynamic costs of running individual gates in a circuit, isolated from the rest of the circuit. We do that in the next section.

III. DECOMPOSITION OF EF
Suppose we have a fixed physical system whose dynamics over some time interval is specified by a conditional distribution P , and let p be its initial state distribution, which we can vary. We decompose the EF of running that system into a sum of three functions of p. Applied to any specific gate in a circuit (the "fixed physical system"), this decomposition tells us how the thermodynamic costs of that gate would change if the distribution of inputs to the gate were changed.
First, Eq. (6) tells us that the minimal possible EF, across all physical processes that transform p into p := P p, is given by the drop in system entropy. We refer to this drop as the Landauer cost of computing P on p, and write it as Since EF is just Landauer cost plus EP, our next task is to calculate how the EP incurred by a fixed physical process depends on the initial distribution p of that process. To that end, in the rest of this section we show that EP can be decomposed into a sum of two non-negative functions of p. Roughly speaking, the first of those two functions reflects the deviation of the initial distribution p from an "optimal" initial distribution, while the second term reflects the remaining EP that would occur even if the process were run on that optimal initial distribution.
To derive this decomposition, we make use of a mathematical result provided by the following theorem. The theorem considers any function of the initial distribution p which can be written in the form S(P p) − S(p) + E p [f ] (i.e., the increase of Shannon entropy plus an expectation of some quantity with respect to p). The EP incurred by a physical process can be written in this form (by Eq. (5), where E p [f ] refers to the EF). Further below, we will also consider other functions, which are closely related to EP, that can be written in this special form. The theorem shows that any function with this special form can be decomposed into a sum of the two terms described above: the first term reflecting deviation of p from the optimal initial distribution (relative to all distributions with support in some restricted set of states, which we indicate as Z), and a remainder term.

Theorem 1. Consider any function
where P (y|x) is some conditional distribution of y ∈ Y given x ∈ X and f : X → R ∪ {∞} is some function. Let Z be any subset of X such that f (x) < ∞ for x ∈ Z, and let q ∈ ∆ Z be any distribution that obeys q c ∈ arg min r:supp r⊆c Γ(r) for all c ∈ L Z (P ).
Then, each q c will be unique, and for any p with supp p ⊆ Z, (We emphasize that that P and f are implicit in the definition of Γ. We remind the reader that the definition of L Z and q c is provided in Section II D. See proof in Appendix A.) Note that Theorem 1 does not suppose that q is unique, only that the conditional distributions within each island, {q c } c , are. Moreover, as implied by the statement of the theorem, the overall probability weights assigned to the separate islands, {q(c)} c , has no effect on the value of Γ.
Consider some conditional distribution P (y|x), with Y = X , implemented by a physical process. Then, if we take Z = X and E p [f ] = Q in Theorem 1, the function Γ is just the EP of running the conditional distribution P (y|x). This establishes the following decomposition of EP: We emphasize that Eq. (16) holds without any restrictions on the process, e.g., we do not require that the process obey LDB. In fact, Eq. (16) even holds if the process does not evolve according to a CTMC (as long as EP can be defined via Eq. (5)). We refer to the first term in Eq. (16), the drop in KL divergence between p and q as both evolve under P , as mismatch cost [44]. Mismatch cost is non-negative by the dataprocessing inequality for KL divergence [45]. It equals zero in the special case that p c = q c for each island c ∈ L Z (P ). We refer to any such initial distribution p that results in zero mismatch cost as a prior distribution of the physical process that implements the conditional distribution P (the term 'prior' reflects a Bayesian interpretation of q; see [20,30].) If there is more than one island in L Z (P ), the prior distribution is not unique.
We call the second term in our decomposition of EP in Eq. (16), c∈L Z (P ) p(c)σ(q c ), the residual EP. In contrast to mismatch cost, residual EP does not involve informationtheoretic quantities, and depends linearly on p. When L Z (P ) contains a single island, this "linear" term reduces to an additive constant, independent of the initial distribution. The residual EP terms {σ(q c )} c are all non-negative, since EP is non-negative.
Concretely, the conditional distributions {q c } c and the corresponding set of real numbers {σ(q c )} c depend on the precise physical details of the process, beyond the fact that that process implements P . Indeed, by appropriate design of the "nitty gritty" details of the physical process, it is possible to have σ(q c ) = 0 for all c ∈ L Z (P ), in which case the residual EP would equal zero for all p. (For example, this will be the case if the process is an appropriate quasi-static transformation; see [21,46].) Imagine that the conditional distribution P is logically reversible over some set of states Z ⊆ X , and that supp p ⊆ Z. Then, both mismatch cost and Landauer cost must equal zero, and EF must equal EP, which in turn must equal residual EP [47]. Conversely, if P is not logically reversible over Z, then mismatch cost cannot be zero for all initial distributions p with supp p ⊆ Z (for such a P , regardless of what q is, there will be some p with supp p ⊆ Z such that the KL divergence between p and q will shrink under the mapping P ). Thus, for any fixed process that implements a logically irreversible map, there will be some initial distributions p that result in unavoidable EP.
To provide some intuition into these results, the following example reformulates the EP of a very commonly considered scenario as a special case of Eq. (16): Example 1. Consider a physical system evolving according to an irreducible master equation, while coupled to a single thermodynamic reservoir and without external driving. Because there is no external driving, the master equation is timehomogeneous with some unique equilibrium distribution p eq . So the system is relaxing toward that equilibrium as it undergoes the conditional distribution P over the interval t ∈ [0, 1].
For this kind of relaxation process, it is well known that the EP can be written as [35,48,49]: Eq. (17) can also be derived from our result, Eq. (16), since 1. Taking Z = X , P has a single island (because the master equation is irreducible, and therefore any state is reachable from any other over t ∈ [0, 1]); 2. The prior distribution within this single island is q = p eq (since the EP would be exactly zero if the system were started at this equilibrium, which is a fixed point of P ); 3. The residual EP is σ(q) = 0 (again using fact that EP is exactly zero for p = p eq , and that there is a single island); 4. P q = p eq (since there is no driving, and the equilibrium distribution is a fixed point of P ).
Thus, Eq. (16) can be seen as a generalization of the wellknown relation given by Eq. (17), which is defined for simple relaxation processes, to processes that are driven and possibly connected to multiple reservoirs.
The following example addresses the effect of possible discontinuities in the island decomposition of P on our decomposition of thermodynamic costs: Example 2. Mismatch cost and residual EP are both defined in terms of the island decomposition of the conditional distributions P over some set of states Z. That decomposition in turn depends on which (if any) entries in the conditional probability distribution P are exactly 0. This suggests that the decomposition of Eq. (16) can depend discontinuously on very small variations in P which replace strictly zero entries in P with infinitesimal values, since such variations will change the island decomposition of P .
To address this concern, first note that if P P , then the EP of the real-world process that implements P can be approximated as where Q (p) is the EF function of the real-world process, with the approximation becoming exact as P → P [50]. If we now apply Theorem 1 to the RHS of Eq. (18), we see that so long as P is close enough to P , we can approximate σ (p) as a sum of mismatch cost and residual EP using the islands of the idealized map P , instead of the actual map P .

IV. SOLITARY PROCESSES
Implicit in the definition of a physical circuit is that it is "modular", in the sense that when a gate in the circuit runs, it is physically coupled to the gates that are its direct inputs, and those that directly get its output, but is not physically coupled to any other gates in the circuit. This restriction on the allowed physical coupling is a constraint on the possible processes that implement each gate in the circuit. It has major thermodynamic consequences, which we analyze in this section.
To begin, suppose we have a system that can be decomposed into two separate subsystems, A and B, so that the system's overall state space X can be written as X = X A × X B , with states (x A , x B ). For example, A might contain a particular gate and its inputs, while B might consist of all other nodes in the circuit. We use the term solitary process to refer to a physical process over state space X A × X B that takes place during t ∈ [0, 1] where: 1. A evolves independently of B, and B is held fixed: 2. The EF of the process depends only on the initial distribution over X A , which we indicate with the following notation: 3. The EF is lower bounded by the change in the marginal entropy of subsystem A, Note that it may be that some subset A of the variables in subsystem A don't change their state during the solitary process. In that sense such variables would be like the variables in B. However, if the dynamics of those variables in A that do change state depends on the values of the variables in A , then in general the variables in A cannot be assigned to B; they have to be included in subsystem A in order for condition (2) to be met.

Example 3. A concrete example of a solitary process is a
CTMC where at all times, the rate matrix K t has the decoupled structure indicates the rate matrix for subsystem A and thermodynamic reservoir α at time t [51].
To verify that this CTMC is a solitary process, first plug the rate matrix in Eq. (22) into Eq. (2) and simplify, giving Marginalizing the above equation, we see that the distribution over the states of A evolves independently according to Note also that, given the form of Eq. (22), the state of B does not change. Thus, the conditional distribution carried out by this CTMC over any time interval must have the form of Eq. (19). (See also App. B in [3].) Next, plug Eq. (22) into Eq. (3) and simplify to geṫ Thus, the EF incurred by the process evolves exactly as if A were an independent system connected to a set of thermodynamic reservoirs. Therefore, a joint system evolving according to Eq. (22) will satisfy Eqs. (20) and (21).
We refer to the lower bound on the EF of subsystem A, as given in Eq. (21), as the subsystem Landauer cost for the solitary process. We make the associated definition that the subsystem EP for the solitary process iŝ which by Eq. (21) is non-negative. Note that if P A is a logically reversible conditional distribution, then subsystem EP is equal to the EF incurred by the solitary process.
In general, S(p A )−S(P A p A ), the subsystem Landauer cost, will not equal S(p AB ) − S(P p AB ), the Landauer cost of the entire joint system. Loosely speaking, an observer examining the entire system would ascribe a different value to its entropy change during the solitary process than would an observer examining just subsystem A -even though subsystem B doesn't change its state. We use the term Landauer loss to refer to this difference in Landauer costs, Assuming that the lower bound Eq. (21) can be saturated, since the bound Eq. (6) can be saturated, the Landauer loss is the increase in the minimal EF that must be incurred by any process that carries out P if that process is required to be a solitary process. By using the fact that subsystem B remains fixed throughout a solitary process, the Landauer loss can be rewritten as the drop in the mutual information between A and B, from the beginning to the end of the solitary process, Applying the data processing inequality establishes that Landauer loss is non-negative [52]. (See Section VII for a discussion of the relation between solitary processes and other processes that have been considered in the literature.) If P A (and thus also P ) is logically reversible, then the Landauer loss will always be zero. However, for other conditional distributions, there is always some p that results in strictly positive Landauer loss. Moreover, we can rewrite it as So in general the subsystem EP will be less than the overall EP of the entire system [53]. Finally, note that Q A (p A ) is a linear function of the distribution p A (since EF functions are linear). Combining this fact with Theorem 1, while taking Z = X A , allows us to expand the subsystem EP aŝ (28) where q A is a distribution over X A that satisfiesσ a (q c A ) = min r:supp r⊆cσA (r) for all c ∈ L(P A ). As before, both the drop in KL divergence and the term linear in p A (c) are nonnegative. We will sometimes refer to that drop in KL divergence as subsystem mismatch cost, with q A the subsystem prior, and refer to the linear term as subsystem residual EP. Intuitively, subsystem Landauer cost, subsystem EP, subsystem mismatch cost, and subsystem residual EP are simply the values of those quantities that an observer would ascribe to subsystem A if they observed it independently of B.

V. SERIAL-REINITIALIZED CIRCUITS
As mentioned at the end of Section II E, specifying a wired circuit does not specify the initial distributions of the gates in the physical circuit, the sequence in which the gates in the physical circuit are run, etc. So it does not fully specify the dynamics of a physical system that implements that wired circuit. In this section we introduce one relatively simple way of mapping a wired circuit to such a full specification. In this specification, the gates are run serially, one after the other. Moreover, the gatesreinitialize the states of their parent gates after they run, so that the entire circuit can be repeatedly run, incurring the same expected thermodynamic costs each time. We call such physical systems serial reinitialized implementations of a given wired circuit, or just SR circuits for short.
For simplicity, in the main text of this paper we focus on the special case in which all non-output nodes have out-degree 1, i.e., where each non-output node is the parent of exactly one gate. See Appendix C for a discussion of how to extend the current analysis to relax this requirement, allowing some nodes to have out-degree larger than 1.
There are several properties that jointly define the SR circuit implementation of a given wired circuit.
First, just before the physical circuit starts to run, all of its nodes have a special initialized value with probability 1, i.e., . Typically this setting of the state of the input nodes is done by some offboard system, e.g., the user of the digital device containing the circuit. We do not include the details of this offboard system in our model of the physical circuit. Accordingly, we do not include the thermodynamic costs of setting the joint state of the input nodes in our calculation of the thermodynamic costs of running the circuit [55].
After x IN is set this way, the SR circuit implementation begins. It works by carrying out a sequence of solitary processes, one for each gate of the circuit, including wire gates. At all times that a gate g is "running", the combination of that gate and its parents (which we indicate as n(g)) is the subsystem A in the definition of solitary processes. The set of all other nodes of the wired circuit (V \ n(g)) constitute the subsystem B of the solitary process. The temporal ordering of the solitary processes must be a topological ordering consistent with the wiring diagram of the circuit: if gate g is an ancestor of gate g , then the solitary process for gate g completes before the solitary process for gate g begins.
When the solitary process corresponding to any gate g ∈ G begins running, x g is still set to its initialized state, ∅, while all of the parent nodes of g are either input nodes, or other gates that have completed running and are set to their output values. By the end of the solitary process for gate g, x g is set to a random sample of the conditional distribution π g (x g |x pa(g) ), while its parents are reinitialized to state ∅. More formally, under the solitary process for gate g, nodes n(g) evolve according to while all nodes V \ n(g) do not change their states. (Recall notation from Section II E.) Note that this means that the input nodes are reinitialized as soon as their child gates have run.

Example 4.
In this example we demonstrate how to implement an XOR gate g in an SR circuit with a CTMC, i.e., how to carry out the following logical map on the state of gate g, The CTMC involves a sequence of two solitary processes over n(g). The time-dependent rate matrix for both solitary processes has the form for all x V = x V (compare to Eq. (22), where for simplicity we assume there is a single thermodynamic reservoir). The two solitary processes differ in their associated subsystem rate matrices K n(g) t .
In the first solitary process, the state of the gate's parents is held fixed, while the gate's output is changed from the initialized state to the correct XOR value. For t ∈ [0, 1] (the units of time are arbitrary), the subsystem rate matrix that implements this solitary process is for x n(g) = x n(g) , where η > 0 is the relaxation speed. Note that the term δ(x n(g) , ∅) inside the square brackets encodes the assumption that the initial state of the gate is ∅ with probability 1, while the factor of 1/4 encodes the assumption that the initial distribution over the four possible states of the gate's parents is uniform.
From the beginning to the end of the first solitary process, the nodes n(g) are updated according to the conditional probability distribution P (1) g , given by the time-ordered exponential of the rate matrix in Eq. (30) over t ∈ [0, 1]. In the quasi-static limit η → ∞, this conditional distribution becomes P (1) g (x n(g) |x n(g) ) = δ(x pa(g) , x pa(g) )π g (x g |x pa(g) ).
In the second solitary process, the gate's output is held fixed while the gate's parents are reinitialized. Redefining the time coordinate so that this second process also transpires in t ∈ [0, 1], its subsystem rate matrix is , where η is again the relaxation speed. Note that π g (x g |x pa(g) )/4 is what the distribution over nodes n(g) would be at the beginning of the second solitary process, if the distribution at the beginning of the first solitary process was δ(x g , ∅)/4. From the beginning to the end of the second solitary process, the nodes n(g) are updated according to the conditional probability distribution P g , which is given by the time-ordered exponential of the rate matrix Eq. (31). In the quasi-static limit η → ∞, this conditional distribution is The sequence of two solitary processes causes the nodes in n(g) to be updated according to the conditional distribution P g = P (1) g P (2) g . In the quasi-static limit, this is P g (x n(g) |x n(g) ) = π g (x g |x pa(g) ) v∈pa(g) which recovers Eq. (29), as desired. We now compute thermodynamic costs for the XOR gate. Let Q(p pa(g) ) be the total EF incurred by running the sequence of two solitary process, given some initial distribution p pa(g) over the parents of gate g. Using results from Section IV, write this EF as Q(p pa(g) ) = S(p pa(g) ) − S(π g p pa(g) ) + D(p pa(g) q pa(g) ) − D(π g p pa(g) π g q pa(g) ) where the three lines correspond to subsystem Landauer cost, subsystem mismatch cost, and subsystem residual EP, respectively. To derive this decomposition, we applied Theorem 1, while taking Z = {x n(g) ∈ X n(g) : x g = ∅} (note that for this Z, L Z (P ) = L(π g )).
To compute the second and third of those terms, note that in the quasi-static limit, the prior distribution is uniform: To see this, suppose that the distribution over n(g) when the sequence of processes begins is given by p n(g) (x n(g) ) = δ(x g , ∅)q pa(g) (x pa(g) ). Then, 1. The system will remain in equilibrium during the first solitary process, thereby incurring zero EP. At the end of the first solitary process, it will have distribution [P (1) g p n(g) ](x n(g) ) = q pa(g) (x pa(g) )π g (x g |x pa(g) ). (35) 2. Given that the system starts the second solitary process with this distribution P g p n(g) , it will remain in equilibrium throughout the second solitary process, thereby again incurring zero EP.
So that sequence of processes will incur zero EP -the minimum possible -if the initial distribution is q pa(g) over pa(g) (and x g = ∅), as claimed. In addition, the fact that the minimal EP that can be generated for any initial distribution is strictly zero means that the subsystem residual EP vanishes. This fully specifies all terms in Eq. (33), as a function of p pa(g) .
As a concrete example of this analysis, consider the initial distribution which is uniform over states {00, 01, 10}:
We end by noting that the XOR gate may also incur some EP which is not accounted for by these calculations, due to loss of correlations between the nodes n(g) and the rest of the circuit as the gate runs. This is quantified by the Landauer loss, which can be evaluated using Eq. (25), Eq. (26), or Eq. (27).
Given the requirement that the solitary processes are run accordingly to a topological ordering, Eq. (29) ensures that once all the gates of the circuit have run, the state of the output gates of the circuit have been set to a random sample of π Φ (x OUT |x IN ), while all non-output nodes are back in their initialized states, i.e., Fig. 3. An SR implementation of this wired circuit would run its 6 gates in topological order, such that each gate computes its output and then reinitializes its parents. One such sequence of steps is shown in Fig. 4 (note that some other topological orderings are also possible). Each XOR gate could be implemented by the kind of CTMC described in Example 4. Each wired gate could be run by a similar kind CTMC, but which carries out the identity operation π g (x g |x pa(g) ) = δ(x g , x pa(g) ), instead of the XOR operation.

Example 5. Consider the 3-bit parity circuit shown in
After the SR circuit has run, some "offboard system" may make a copy of the state of the output gate for subsequent use, e.g., by copying it into some of the input bits of some downstream circuit(s), onto an external disk, etc. Regardless, we  Fig. 3. Each diagram represents one step of the SR implementation, with white shapes indicating nodes set to their initialized value (∅) and maroon shapes indicates nodes that can have non-initialized values. The implementation starts with only the input nodes set to non-initialized values (left-most diagram) and ends with only the output gates set to non-initialized values (right-most diagram). assume that after the circuit finishes, but before the circuit is run again, the state of the output nodes have also been reinitialized to ∅. Just as we do not model the physical mechanism by which new inputs are created for the next run of the circuit, we also do not model the physical mechanism by which the output of the circuit is reinitialized. Accordingly, in our calculation of the thermodynamic costs of running the circuit, we do not account for any possible cost of reinitializing the output [56].
This kind of cyclic procedure for running the circuit allows the circuit to be re-used an arbitrary number of times, while ensuring that each time it will have the same expected thermodynamic behavior (Landauer cost, mismatch cost, etc.), and will carry out the same map π Φ from input nodes to output gates.

VI. THERMODYNAMIC COSTS OF SR CIRCUITS
In general, there are multiple decompositions of the EF and EP incurred by running any given SR circuit. They differ in how much of the detailed structure of the circuit they incorporate. In this section we present some of these decompositions. (See Appendix Bfor all proofs of the results in this section.)

A. General decomposition of EF and EP
Let p refer to the initial distribution over the joint state of all nodes in the circuit. By Eq. (5), the total EF incurred by implementing some overall map P which takes the initial joint state of all nodes in the the full circuit to the final joint state is where is the Landauer cost of computing P for initial distribution p. The first term in Eq. (37), L, is the minimal EF that must be incurred by any process over X IN × X OUT that implements P , without any constraints on how the variables in X IN × X OUT are coupled, and without any reference to a set of intermediate subsystems (e.g., gates) that may connect the input and output variables [57].
The second term in Eq. (36) is the EP, which reflects the thermodynamic irreversibility of the SR circuit. Using Eq. (16), the EP can be further decomposed as The decrease in KL reflects the mismatch cost, arising from the discrepancy between p(x V ), the actual initial distribution over all nodes of the circuit defined in Eq. (39), and q(x V ), the optimal prior distribution over the joint state of all the nodes of the circuit which would result in the least EP. The last sum in Eq. (38) reflects the residual EP, reflecting EP that remains even when the circuit is initialized with the optimal prior distribution.
Suppose we know that the dynamics is actually implemented with an SR circuit, but don't know the precise wiring diagram. Then we know that the initial joint distribution over all the nodes is and the ending joint distribution is So S(p) = S(p IN ) and S(P p) = S(p OUT ) = S(π Φ p IN ), where π Φ is the conditional distribution of the final joint state of the output gates given the initial joint state of the input nodes, defined in Eq. (14). Combining gives Similarly, the EP becomes While the expressions in Eqs. (38) and (42) for EP must be equal, how they decompose that EP among a mismatch cost term and a residual EP differ. The two decompositions differ because they define the "optimal initial distribution" relative to differ sets of possible distributions, resulting in different prior distributions (which are also defined over different sets of outcomes). Also note that the residual EP terms in Eq. (42) are defined in terms of a more constrained minimization problem than the residual EP terms in Eq. (38). Thus, given the same initial distribution p, the residual EP in Eq. (42) will generally be larger than the residual EP in Eq. (38), while the mismatch cost in Eq. (38) will generally be larger than the mismatch cost in Eq. (42). We also emphasize that the island decompositions appearing in the two expressions are different.

B. Circuit-based decompositions of EF and EP
The decompositions of EF and EP given in (Eqs. (36), (38) and (42)) do not involve the wiring diagram of the SR circuit. As an alternative, we can exploit that wiring diagram to formulate a decomposition of EF and EP which separates the contributions from different gates. In general, such circuitbased decompositions allow for a finer-grained analysis of the EP in SR circuits than do the decompositions proposed in the last section. In particular, they allow us to derive some novel connections between nonequilibrium statistical physics, computer science theory, and information theory, as discussed in the next two subsections.
Before discussing these circuit-based decompositions, we introduce some new notation. We write p pa(g) (x pa(g) ) and p n(g) (x n(g) ) = p pa(g) (x pa(g) )δ(x g , ∅) for the distributions over x pa(g) and x n(g) , respectively, at the beginning of the solitary process that implements gate g. We write the EF function of the solitary process of gate g as Q g (p n(g) ), and its subsystem EP aŝ σ g (p n(g) ) := Q g (p n(g) ) − [S(p n(g) ) − S(P g p n(g) )] . (43) We also write p beg(g) and p end(g) to indicate the joint distribution over all circuit nodes at the beginning and end, respectively, of the solitary process that runs gate g. As an illustration of this notation, p beg(g) (x pa(g) ) = p pa(g) (x pa(g) ). On the other hand, p end(g) (x g ) = (π g p pa(g) )(x g ) is the distribution over x g after gate g runs, and p end(g) (x pa(g) ) is a delta function about the joint state of the parents of g in which they are all initialized, by Eq. (29). Note that since we're considering solitary processes, p beg(g) (x V \n(g) ) = p end(g) (x V \n(g) ).
We now present our first circuit-based decomposition, and then we explain what its terms mean in detail:

Theorem 2. The total EF incurred by running an SR circuit where p is the initial distribution over the joint state of all nodes in the circuit is
1) The first term in Eq. (44), L(p), is the Landauer cost of the circuit, as described in Eq. (37). This Landauer cost can be further decomposed into contributions from the individual gates. Specifically, write L g (p) for the drop in the entropy of the entire circuit during the time that the solitary process for gate g runs, given that the input distribution over the entire circuit is p: L g (p) := S(p beg(g) ) − S(p end(g) ) .
Note that the distribution over the states of the entire physical circuit at the end of the running of any gate is the same as the distribution at the beginning of the running of the next gate. So by canceling terms, and using the fact that entropy does not change when a wire gate runs, we can expand L as (Recall from Section II E that W is the set of wire gates in the circuit.) This decomposition will be useful below.
2) The second term in Eq. (44), L loss (p), is the unavoidable additional EF that is incurred by any SR implementation of the SR circuit on initial distribution p, above and beyond L, the Landauer cost of running the map π Φ on initial distribution p.
We refer to this unavoidable extra EF as the circuit Landauer loss. It equals the sum of the subsystem Landauer losses incurred by each non-wire gate's solitary process, where L loss g (p) = S(p pa(g) ) − S(π g p pa(g) ) − L g (p). Each term L loss g (p) in this sum is non-negative (see end of Section IV), and so L loss (p) ≥ 0. Note that we can omit wires from the sum in Eq. (47) because π g is logically reversible for any wire gate g, which means that L loss g (p) = 0 for such gates. We define circuit Landauer cost to be the minimal EF incurred by running any SR implementation of the circuit, i.e., L circ (p) = L(p) + L loss (p) = g∈G\W S(p pa(g) ) − S(π g p pa(g) ) .
Recall that L(p) is the minimal EF that must be generated by any physical process that carries out the map P on initial distribution p. So by Eq. (48), L loss (p) is the minimal additional EF that must be generated if we use an SR circuit to carry out P on p, no matter how efficient the gates in the circuit are. In this sense, Eq. (48) can be viewed as an extension of the generalized Landauer bound, to concern SR circuits.
3) The third term in Eq. (44), M, reflects the EF incurred because the actual initial distribution of each gate g is not the optimal one for that gate (i.e., not one that minimizes subsystem EP within each island of the conditional distribution P g , defined in Eq. (29)). We refer to this cost as the circuit mismatch cost, and write it as M(p) = g∈G D p n(g) q n(g) −D P g p n(g) P g q n(g) where the prior q n(g) is a distribution over X n(g) whose conditional distributions over the islands c ∈ L(P g ) all obeŷ σ g (q c n(g) ) = min r:supp r⊆cσg (r). Note that we must include wire gates g in the sum in Eq. (50) even though π g for a wire gate is logically reversible. This is because the associated overall map over n(g), Eq. (29), is not logically reversible over n(g) [58]. M is non-negative, since each gate's subsystem mismatch cost is non-negative. Moreover, M approaches its minimum value of 0 as p cg n(g) → q cg n(g) for all g ∈ G and all islands c g ∈ L(P g ). (Recall that subsystem priors like q cg n(g) reflect the specific details of the underlying physical process that implements the gate g, such as how its energy spectrum evolves as it runs.) Suppose that one wishes to construct a physical system to implement some circuit, and can vary the associated subsystem priors q cg n(g) arbitrarily. Then in order to minimize mismatch cost one should choose priors q cg n(g) that equal the actual associated initial distributions p c n(g) . Moreover, those actual initial distributions p cg n(g) can be calculated from the circuit's wiring diagram, together with the input distribution of the entire circuit, p IN , by "propagating" p IN through the transformations specified by the wiring diagram. As a result, given knowledge of the wiring diagram and the input distribution of the entire circuit, in principle the priors can be set so that mismatch cost is arbitrarily small.

4)
The fourth term in Eq. (44), R, reflects the remaining EF incurred by running the SR circuit, and so we call it circuit residual EP. Concretely, it equals the subsystem EP that would be incurred even if the initial distribution within each island of each gate were optimal: R(p) = g∈G c∈L(Pg) p n(g) (c)σ g (q c n(g) ).
Circuit residual EP is non-negative, since eachσ g is nonnegative. Since for every gate g, p n(g) (c) is a linear function of the initial distribution to the circuit as a whole, circuit residual EP also depends linearly on the initial distribution. Like the priors of the gates, the residual EP terms {σ g (q c n(g) )} reflect the "nitty-gritty" details of how the gates run.
To summarize, the EF incurred by a circuit can be decomposed into the Landauer cost (the contribution to the EF that would arise even in a thermodynamically reversible process) plus the EP (the contribution to that EF which is thermodynamically irreversible). In turn, there are three contributions to that EP: 1. Circuit Landauer loss, which is independent of how the circuit is physically implemented, but does depend on the wiring diagram of the circuit, the conditional distributions implemented by the gates, and the initial distribution over inputs. It is a nonlinear function of the distribution over inputs, p IN .
2. Circuit mismatch cost, which does depend on how the circuit is physical implemented (via the priors), as well as the wiring diagram. It is also a nonlinear function of p IN .
3. Circuit residual EP, which also depends on how the circuit is physical implemented. It is a linear function of p IN . However, no matter what the wiring diagram of the circuit is, if we implement each of the gates in a circuit with a quasistatic process, then the associated circuit residual EP is identically zero, independent of p IN [59].
There are other useful decompositions of the EP incurred by an SR circuit that incorporate the wiring diagram. One such alternative decomposition, which is our second main result, leaves the circuit Landauer loss term in Eq. (44) unchanged, but modifies the circuit mismatch cost and the circuit residual EP terms.

Theorem 3. The total EF incurred by running an SR circuit where p is the initial distribution over the joint state of all nodes in the circuit is
To present this decomposition, recall from Eq. (29) that for any gate g, the distribution over X n(g) has partial support at the beginning of the solitary process that implements P g , since there is 0 probability that x g = ∅. We use this fact to apply Theorem 1 to Eq. (43), while taking Z = {x n(g) ∈ X n(g) : x g = ∅}. This allows us to express the modified circuit mismatch cost by replacing all of the map P g (x n(g) |x n(g) ) in the summand in Eq. (50) with P g (x g |x pa(g) ) = π g : M (p) = g∈G\W D(p pa(g) q pa(g) ) − D(π g p pa(g) π g q pa(g) ) (53) where the priors q pa(g) are defined in terms of the island decompositions of the associated conditional distributions π g , rather than in terms of the island decompositions of the conditional distributions P g . Note that can exclude the wire gates from the sum in Eq. (53) because each wire gate's π g is logically reversible, and so the associated drop in KL divergence are zero. Then, the modified circuit residual EP is where each termσ g (q c pa(g) ) is given by appropriately modifying the arguments in Eq. (43). In deriving Eq. (54), we used the fact that L Z (P g ) = L(π g ).
As with the analogous results in the previous section, Theorem 2 and Theorem 3 differ, because they define "optimal initial distribution" relative to different sets of possibilities. In particular, the decomposition in Theorem 2 will generally have a larger mismatch cost and smaller residual EP term than the decomposition in Theorem 3.
For the rest of this section, we will use the term "circuit mismatch cost" to refer to the expression in Eq. (53) rather than the expression in Eq. (50), and similarly will use the term "circuit residual EP" to refer to the expression in Eq. (54) rather than the expression in Eq. (51).

C. Information theory and circuit Landauer loss
By combining Eq. (47) and Eq. (26), we can write circuit Landauer loss as Any nodes that belong to V \ n(g) and that are in their initialized state when gate g starts to run will not contribute to the drop in mutual information terms in Eq. (55). Keeping track of such nodes and simplifying establishes the following:

Corollary 4. The circuit Landauer loss is
(We remind the reader that I(p A ) refers to the multiinformation between the variables indexed by A.) Corollary 4 suggest a set of novel optimization problems for how to design SR circuits: given some desired computation π Φ and some given initial distribution p IN , find the circuit wiring diagram that carries out π Φ while minimizing the circuit Landauer loss. Presuming we have a fixed input distribution p and map π Φ , the term I(p IN ) − I(π Φ p IN ) in Corollary 4 is an additive constant that doesn't depend on the particular choice of circuit wiring diagram. So the optimization problem can be reduced to finding which wiring diagram results in a minimal value of g∈G\W I(p pa(g) ). In other words, for fixed p and map π Φ , to minimize the Landauer loss we should choose the wiring diagram for which the parents of each gate are as strongly correlated among themselves as possible. Intuitively, this ensures that the "loss of correlations", as information is propagated down the circuit, is as small as possible.
In general, the distributions over the outputs of the gates in any particular layer of the circuit will affect the distribution of the inputs of all of the downstream gates, in the subsequent layers of the circuit. This means that the sum of multiinformations in Corollary 4 is an inherently global property of the wiring diagram of a circuit; it cannot be reduced to a sum of properties of each gate considered in isolation, independently of the other gates. This makes the optimization problem particularly challenging. We illustrate this optimization problem in the following example.

Example 6.
Consider again the case where we want our circuit to compute the three-bit parity function using 2-input XOR gates, i.e., it want it to implement the map Suppose we happen to know that the input distribution to the circuit (which is specified by how we will use the circuit) is where Z is a normalization constant and φ(

), inputs 1 and 3 have intermediate-strength coupling strength (1/2), and inputs 1 and 2 have the weakest coupling strength (1/4).
We wish to find the wiring diagram connecting our XOR gates that has minimal circuit Landauer cost for this distribution over its three input bits. It turns out that we can restrict our search to three possible wiring diagrams, which are shown in Fig. 5. We indicate the circuit Landauer loss for the input distribution of Eq. (56) for each of those three wiring diagrams. So for this input distribution, the right-most wiring diagram results in minimal circuit Landauer loss. Note that this wiring diagram aligns with the correlational structure of the input distribution (given that inputs 2 and 3 have the strongest statistical correlation).
An interesting variant of the optimization problem described above arises if we model the residual EP terms for the wire gates. In any SR circuit, wire gates carry out a logically reversible operation on their inputs. Thus, by Eq. (16), all of the EF generated by any wire gates is residual EP. If we allow the physical lengths of wires to vary, then as a simple model we could presume that the residual EP of any wire is proportional to its length. This would allow us to incorporate into our analysis the thermodynamic effect of the geometry with which a circuit is laid out on a two-dimensional circuit boards, in addition to the thermodynamic effect of the topology of that circuit.
Finally, note that for any set of nodes A, multi-information can be bounded as Given this, Corollary 4 implies This means that for a fixed input state space, the circuit Landauer loss cannot grow without bound as we vary the wiring diagram. Interestingly, this bound on Landauer loss only holds for SR circuits that have out-degree 1. If we consider SR circuits that have out-degree greater than 1, then the circuit Landauer cost can be arbitrarily large. This is formalized as the following proposition (See proof in Appendix C.) Proposition 1. For any π Φ , non-delta function input distribution p IN , and κ ≥ 0, there exist an SR circuit with out-degree greater than 1 that implements π Φ for which L loss (p) ≥ κ.

D. Information theory and circuit mismatch loss
Landauer loss captures the gain in minimal EF due to using an SR circuit, if there is no mismatch cost or residual EP. It is harder to make general statements about the gain in actual EF due to using an SR circuit, i.e., when the mismatch cost is nonzero. In this subsection we make some preliminary remarks about this issue.
Imagine that we wish to build a physical process that implements some computation π Φ (x OUT |x IN ) over a space X IN × X OUT . Suppose we want this process to achieve minimal EP when run with inputs generated by q IN (e.g., if we expect future inputs to the process to be generated by sampling q IN ), and as usual assume the initial value of x OUT will be ∅ whenever it is run. Using the decomposition of Eq. (42) and assuming that the residual EP of the process is zero, the EF that such a process would generate if it is actually run with an input distribution p (initialized like SR circuits are, so that it has the form of Eq. (39)) is given by the sum of the Landauer cost and the mismatch cost, with no Landauer loss term. We write this as (58) Note that in order for the EF generated by an actual physical process to be given by Eq. (58), the prior of that process must be q IN , and in general this may require that the process couple together arbitrary sets of variables. This means that the EF generated by implementing π Φ (x OUT |x IN ) with an SR circuit cannot obey Eq. (58) in general, due to restrictions on what variables can be coupled in such a circuit. (One can verify, for example, that the prior distribution q IN of a circuit consisting of two disconnected bit erasing gates must be a product distribution over the two input bits.) To emphasize this distinction, we will refer to a process whose EF is given by Eq. (58)) as an "all-at-once" (AO) process (indicated by the subscript "AO" in Eq. (58)).
For practical reasons, it may be quite difficult to construct an AO process that implements π Φ , and we must use a circuit implementation instead. In particular, even though the circuit as a whole cannot have prior q IN , suppose we can set the priors q pa(g) at its gates by propagating q IN through the wiring diagram of the circuit. Assuming again that there is zero EP, the EF that must be incurred by any such SR circuit implementation of π Φ on input distribution p, assuming some particular wiring topology and gate priors, is given by the decomposition of Theorem 3, Q circ (p) = L(p) + L loss (p) + M (p).
We now ask: how much larger is this EF incurred by the SR circuit implementation, compared to that of the original AO process? Subtracting Eq. (58) from Eq. (59) gives where we have defined M loss as the the difference between the circuit mismatch cost, M (p), and the mismatch cost of the AO process. We refer to that difference in mismatch costs as the circuit mismatch loss, and use Eq. (53) to express it as where D refers to the multi-divergence, defined in Eq. (10). Eq. (61) can be compared to Corollary 4, which expresses the circuit Landauer cost rather than circuit mismatch cost, and involves multi-informations rather than multi-divergences. Interestingly, while circuit Landauer loss is non-negative, circuit mismatch loss can either be positive or negative. In fact, depending on the wiring diagram, p IN and q IN , the sum of circuit mismatch loss and circuit Landauer loss can be negative. This means that when the actual input distribution p IN is different from the prior distribution of the AO process, the "closest equivalent circuit" to the AO process may actually incur less EF than the corresponding AO process. This occurs because an SR circuit cannot implement some of the prior distributions that an AO process can implement, so the two implementations end up having different priors. This is illustrated in the following example.
We now implement this computation using an SR circuit which consists of two disconnected erasure gates. The closest equivalent SR circuit has gate priors given by the uniform marginal distributions, q(x 1 ) = 1/2 and q(x 2 ) = 1/2. Then the difference between the EF of the AO process and the SR circuit is This can be made arbitrarily negative by taking sufficiently close to zero. Thus, the EF of the AO process may be arbitrarily larger than the EF of the closest equivalent SR circuit.

VII. RELATED WORK
The issue of how the thermodynamic costs of a circuit depend on the constraints inherent in the topology of the circuit has not previously been addressed using the tools of modern nonequilibrium statistical physics. Indeed, this precise issue has received very little attention in any of the statistical physics literature. A notable exception was a 1996 paper by Gernshenfeld [60], which pointed out that all of the thermodynamic analyses of conventional (irreversible) computing architectures at the time were concerned with properties of individual gates, rather than entire circuits. That paper works through some elementary examples of the thermodynamics of circuits, and analyzes how the global structure of circuits (i.e., their wiring diagram) affects their thermodynamic properties. Gernshenfeld concludes that the "next step will be to extend the analysis from these simple examples to more complex systems" [61].
There are also several papers that do not address circuits, but focus on tangentially related topics, using modern nonequilibrium statistical physics. Ito and Sagawa [41,42] considered the thermodynamics of (time-extended) Bayesian networks [62,63]. They divided the variables in the Bayes net into two sets: the sequence of states of a particular system through time, which they write as X, and all external variables that interact with the system as it evolves, which they write as C. They then derive and investigate an integral fluctuation theorem [12,15,34,64] that relates the EP generated by X and the EP flowing between X and C. (See also [65]).) Note that [41] focuses on the EP generated by a proper subset of the nodes in the entire network. In contrast, our results below concern the EP generated by all nodes. In addition, while [41] concentrates on an integral fluctuation theorem involving EP, we give an exact expression for (expected) EP.
Otsubo and Sagawa [66] considered the thermodynamics of stochastic Boolean network models of gene regulatory networks. They focused in particular on characterizing the information-theoretic and dissipative properties of 3-node motifs. While their study does concern dynamics over networks, it has little in common with the analysis in the current paper, in particular due to its restriction to 3-node systems.
Solitary processes are similar to "feedback control" processes, which have attracted much attention in the thermodynamics of information literature [2,67,68]. In feedback control processes, there is a subsystem A that evolves while coupled to another subsystem B, which is held fixed. (This joint evolution is often used to represent either A making a measurement of the state of B, or the state of B being used to determine which control protocol to apply to B.) It has been shown for feedback control processes that the total EP incurred by the joint A × B system is the "subsystem EP" of A, plus the drop in the mutual information between A and B [67]. Formally, this is identical to Eq. (26).
Crucially however, in feedback control processes there is no assumption that A and B are physically decoupled. (Formally, Eq. (20) is not assumed.) Therefore the change in mutual information can either be negative or positive in those processes (the latter occurs, for instance, when A performs a measure-ment of the state of B). In addition, the "subsystem EP" in these processes can be negative. For this reason, in feedback control processes there is no simple relationship between subsystem EP and the total EP incurred by the joint A × B system. In contrast, in solitary processes A and B are physically decoupled (cf. Eqs. (20) and (21)). For this reason, in solitary processes subsystem EP is non-negative, as is the drop in mutual information, Eq. (26), and so each of them is a lower bound on the total EP incurred by the joint A × B system.
Boyd et al. [69] considered the thermodynamics of "modular" systems, which in our terminology are a special type of solitary processes, with extra constraints imposed. In particular, to derive their results, [69] assumes there is exactly one thermodynamic reservoir (in their case, a heat bath). That restricts the applicability of their results. Nonetheless, individual gates in a circuit are run by solitary processes, and one could require that they in fact be run by modular systems, in order to analyze the thermodynamics of (appropriately constrained) circuits. However, instead of focusing on this issues, [69] focuses on the thermodynamics of "information ratchets" [70], modeling them as a sequence of iterations of a single solitary process, successively processing the symbols on a semi-infinite tape. In contrast, we extend the analysis of single solitary processes operating in isolation to analyze full circuits that comprise multiple interacting solitary processes.
Riechers [71] also contains work related to solitary processes, assuming a single heat bath, like [69]. [71] exploits the decomposition of EP into "mismatch cost" plus "residual EP" introduced in [30], in order to analyze thermodynamic attributes of a special kind of circuit. The analysis in that paper is not as complicated as either the analysis in the current paper or the analysis in [69]. That is because [71] does not focus on how the thermodynamic costs of running a system are affected if we impose a constraint on how the system is allowed to operate (e.g., if we require that it use solitary processes). In addition, the system considered in that paper is a very special kind of circuit: a set of N disconnected gates, working in parallel, with the outputs of those gates never combined.
[3] is a survey article relating many papers in the thermodynamics of computation. To clarify some of those relationships, it introduces a type of process related to solitary processes, called "subsystem processes". (See also [72].) For the purposes of the current paper though, we need to understand the thermodynamics specifically of solitary processes. In addition, being a summary paper, [3] presents some results from the arXiv version of the current paper, [72]. Specifically, [3,72] summarize some of the thermodynamics of straight-line circuits subject to the extra restriction (not made in the current paper) that there only be a single output node.
There is a fairly extensive literature on "logically reversible circuits" and their thermodynamic properties [3,6,[73][74][75]. This work is based on the early analysis in [76], and so it is not grounded in modern nonequilibrium statistical physics. Indeed, modern nonequilibrium statistical physics reveals some important subtleties and caveats with the thermodynamic properties of logically reversible circuits [3]. Also see [16] for important clarifications of the relationship between thermodynamic and logical reversibility, not appreciated in some of the research community working on logically reversible circuits.
Finally, another related paper is [77]. This paper starts by taking a distilled version of the decomposition of EP in [30] as given. It then discusses some of the many new problems in computer science theory that this decomposition leads to, both involving circuits and involving many other kinds of computational system.

VIII. DISCUSSION AND FUTURE WORK
It is important to emphasize that SR circuits are somewhat unrealistic models of many real-world digital circuits. For example, many real digital circuits have multiple gates running at the same time, and often do not reinitialize their gates after they're run. In addition, many real digital circuits have characteristics like loops and branching. This makes them challenging to model at all using simple solitary processes. Extending our analysis to these more general models of circuits is an important direction for future work. Nonetheless, it is worth mentioning that all of the thermodynamic costs discussed above -including Landauer loss, mismatch cost, and residual EP -are intrinsic to any physical process, as described in Section III. So versions of them arise in those other kinds of circuits, only in modified form.
An interesting set of issues to investigate in future work is the scaling properties of the thermodynamic costs of SR circuits. In conventional circuit complexity theory [4,5] one first specifies a "circuit family" which comprises an infinite set of circuits that have different size input spaces but that are all (by definition) viewed as "performing the same computation". For example, one circuit family is given by an infinite set of circuits each of which has a different size input space, and outputs the single bit of whether the number of 1's in its input string is odd or even. Circuit complexity theory is concerned with how various resource costs in making a given circuit (e.g., the number of gates in the circuit) scales with the size of the circuit as one goes through the members of a circuit family. For example, it may analyze how the number of gates in a set of of circuits, each of which determines whether its input string contains an odd number of 1's, scales with the size of those input strings. One interesting set of issues for future research is to perform these kinds of scaling analyses when the "resource costs" are thermodynamic costs of running the circuit, rather than conventional costs like the number of gates. In particular, it's interesting to consider classes of circuit families defined in terms of such costs, in analogy to the complexity classes considered in computer science theory, like P/poly, or P/log.
Other interesting issues arise if we formulate a cellular automaton (CA) as a circuit with an infinite number of nodes in each layer, and an infinite number of layers, each layer of the circuit corresponding to another timestep of the CA. For example, suppose we are given a particular CA rule (i.e., a particular map taking the state of each layer i to the state of layer i + 1) and a particular distribution over its initial infinite bit pattern. These uniquely specify the "thermodynamic EP rate", given by the total EP generated by running the CA for n iterations (i.e., for reaching the n'th layer in the circuit), divided by n. It would be interesting to see how this EP rate depends on the CA rule and initial distribution over bit patterns.
Finally, another important direction for future work arises if we broaden our scope beyond digital circuits designed by human engineers, to include naturally occurring circuits such as brains and gene regulatory networks. The "gates" in such circuits are quite noisy -but all of our results hold independent of the noise levels of the gates. On the other hand, like real digital circuits, these naturally occurring circuits have loops, branching, concurrency, etc., and so might best be modeled with some extension of the models introduced in this paper. Again though, the important point is that whatever model is used, the EP generated by running a physical system governed by that model would include Landauer loss, mismatch cost, and residual EP. ments PA to a solitary process that does so, and how that change increases the total EP. That example also considers the special case where the prior of the full A×B system is required to factor into a product of a distribution over the initial value of xA times a distribution over the initial value of xB. In particular, it shows that the Landauer loss is the minimal value of the mismatch cost in this special case.
[54] Strictly speaking, if the circuit is a Bayes net, then pIN should be a product distribution over the root nodes. Here we relax this requirement of Bayes nets, and let pIN have arbitrary correlations.
[55] For example, it could be that at some t < 0, the joint state of the input nodes is some special initialized state ∅ with probability 1, and that that initialized joint state is then overwritten with the values copied in from some variables in an offboard system, just before the circuit starts. The joint entropy of the offboard system and the circuit would not change in this overwriting operation, and so it is theoretically possible to perform that operation with zero EF [2]. However, to be able to run the circuit again after it finishes, with new values at the input nodes set this way, we need to reinitialize those input nodes to the joint state ∅ . As elaborated below, we do include the thermodynamic costs of reinitializing those input nodes in preparation of the next run of the circuit. This is consistent with modern analyses of Maxwell's demon, which account for the costs of reinitializing the demon's memory in preparation for its next run [2,3]. See also Section VIII. .
[56] Suppose that the outputs of circuit Φ were the inputs of some subsequent circuit Φ . That would mean that when Φ reinitializes its inputs, it would reinitialize the outputs of Φ. Since we ascribe the thermodynamic costs of that reinitialization to Φ , it would result in double-counting to also ascribe the costs of reinitializing Φ's outputs to Φ. .

Preliminaries
Consider a conditional distribution P (y|x) that specifies the probability of "output" y ∈ Y given "input" x ∈ X , where X and Y are finite.
Given some Z ⊆ X , the island decomposition L Z (P ) of P , and any p ∈ ∆ X , let p(c) = x∈c p(x) indicate the total probability within island c, and if x ∈ c and p(c) > 0 0 otherwise indicate the conditional probability of state x within island c.
In our proofs below, we will make use of the notion of relative interior. Given a linear space V , the relative interior of a subset A ⊆ V is defined as [82] Finally, for any function g(x), we use the notation to indicate the right-handed derivative of g(x) at x = a. When the condition that x = a is omitted, a is implicitly assumed to equal 0, i.e., We also adopt the shorthand that a := a + (b − a), and write S(a ) := S(p(a )), P a := P p(a ), and so S(P a ) = S(P p(a )).

Proofs
Given some conditional distribution P (y|x) and function f : X → R, we consider the function Γ : ∆ X → R as Note that Γ is continuous on the relative interior of ∆ X . Lemma A1. For any a, b ∈ ∆ X , the directional derivative of Γ at a toward b is given by Proof. Using the definition of Γ, write Consider the first term on the RHS, where we adopt the convention that if a(x) = 0, b(x) = 0 for some x, then this expression means −∞. We next consider the ∂ + E a [f ] term, Combining the above gives Theorem A1. Let V be a convex subset of ∆. Then for any q ∈ arg min s∈V Γ(s) and any p ∈ V , Equality holds if q is in the relative interior of V .
Proof. Define the convex mixture q := q + (p − q). By Lemma A1, the directional derivative of Γ at q in the direction p − q is At the same time, ∂ + Γ(q )| =0 ≥ 0, since q is a minimizer within a convex set. Eq. (A2) then follows by rearranging. When q is in the relative interior of V , q − (p − q) ∈ V for sufficiently small > 0. Then, where in the first inequality comes from the fact that q is a minimizer, in the second line we change variables as → − , and the last line we use the continuity of Γ on interior of the simplex. Combining with the above implies ∂ + Γ(q ) = D(P p P q) − D(p q) + Γ(p) − Γ(q) = 0.
The following result is key. It means that the prior within an island has full support in that island.
Lemma A2. For any c ∈ L(P ) and q ∈ arg min s:supp s⊆c Γ(s), Proof. We prove the claim by contradiction. Assume that q is a minimizer with supp q ⊂ {x ∈ c : f (x) < ∞}. Note there cannot be any x ∈ supp q and y ∈ Y \ supp P q such that P (y|x) > 0 (if there were such an x, y, then q(y) = x P (y|x )q(x ) ≥ P (y|x)q(x) > 0, contradicting the statement that y ∈ Y \ supp P q). Thus, by definition of islands, there must be anx ∈ c \ supp q,ŷ ∈ supp P q such that f (x) < ∞ and P (ŷ|x) > 0.
Since q is a minimizer of Γ, ∂ Γ(q )| =0 ≥ 0. Since Γ is convex, the second derivative ∂ 2 Γ(q ) ≥ 0 and therefore ∂ Γ(q ) ≥ 0 for all ≥ 0. Taking a = q and b = u in Lemma A1 and rearranging, we then have where the second inequality uses that q is a minimizer of Γ. At the same time, where in the second line we've used that q (x) = , and in the third that q (y) = (1 − )q(y) + P (y|x), so q (y) ≥ (1 − )q(y) and q (y) ≥ P (y|x).
The following result is also key. Intuitively, it follows from the fact that the directional derivative of S(p) into the simplex for any p on the edge of the simplex is negative infinite.
Lemma A3. For any island c ∈ L(P ), q ∈ arg min s:supp s⊆c Γ(p) is unique.
Proof. Consider any two distributions p, q ∈ arg min s:supp s⊆c Γ(s), and let p = P p, q = P q. We will prove that p = q.
First, note that by Lemma A2, supp q = supp p = c. By Theorem A1, where the last line uses the log-sum inequality. If the inequality is strict, then p and q can't both be minimizers, i.e., the minimizer must be unique, as claimed.
If instead the inequality is not strict, i.e., Γ(p) − Γ(q) = 0, then there is some constant α such that for all x, y with P (y|x) > 0, which is the same as Now consider any two different states x, x ∈ c such that P (y|x) > 0 and P (y|x ) > 0 for some y (such states must exist by the definition of islands). For Eq. (A6) to hold for both x, x with that same, shared y, it must be that p(x)/q(x) = p(x )/q(x ). Take another state x ∈ c such that P (y |x ) > 0 and P (y |x ) > 0 for some y . Since this must be true for all pairs x, x ∈ c, p(x)/q(x) = const for all x ∈ c, and p = q, as claimed.
In words, φ(c) is the subset of output states in Y that receive probability from input states in c. By the definition of the island decomposition, for any y ∈ φ(c), P (y|x) > 0 only if y ∈ c. Thus, for any p and any y ∈ φ(c), we can write Using p = c∈L(P ) p(c)p c and linearity of expectation, We are now ready to prove the main result of this appendix.
Theorem 1. Consider any function Γ : ∆ X → R of the form where P (y|x) is some conditional distribution of y ∈ Y given x ∈ X and f : X → R ∪ {∞} is some function. Let Z be any subset of X such that f (x) < ∞ for x ∈ Z, and let q ∈ ∆ Z be any distribution that obeys q c ∈ arg min r:supp r⊆c Γ(r) for all c ∈ L Z (P ).
Then, each q c will be unique, and for any p with supp p ⊆ Z, Proof. We prove the theorem by considering two cases separately.
Case 1: Z = X . This case can be assumed when f (x) < ∞ for all x, so that L Z (P ) = L(P ). Then, by Lemma A4, we have Γ(p) = c∈L(P ) p(c)Γ(p c ). By Lemma A2 and Theorem A1, where we've used that if some supp q c = c, then q c is in the relative interior of the set {s ∈ ∆ X : supp s ⊆ c}. q c is unique by Lemma A3.
At the same time, observe that for any p, r ∈ ∆ X , D(p r) − D(P p P r) The theorem follows by combining.
Case 2: Z ⊂ X . In this case, define a "restriction" of f and P to domain Z as follows: 2. Define the conditional distributionP (y|x) for y ∈ Y, x ∈ Z viaP (y|x) = P (y|x) for all y ∈ Y, x ∈ Z.
In addition, for any distribution p ∈ ∆ X with supp p ⊆ Z, let p be a distribution over Z defined viap(x) = p(x) for x ∈ Z. Now, by inspection, it can be verified that for any p ∈ ∆ X with supp p ⊆ Z, We can now apply Case 1 of the theorem to the functionΓ : ∆ Z → R, as defined in terms of the tuple (Z,f ,P ) (rather than the function Γ : ∆ X → R, as defined in terms of the tuple (X , f, P )). This gives where, for all c ∈ L(P ),q c is the unique distribution that satisfiesq c ∈ arg min r∈∆ Z :supp r⊆cΓ (r). Now, let q be the natural extension ofq from ∆ Z to ∆ X . Clearly, for all c ∈ L(P ), Γ(q c ) =Γ(q c ) by Eq. (A8). In addition, each q c is the unique distribution that satisfies q c ∈ arg min r∈∆ X :supp r⊆c Γ(r). Finally, it is easy to verify that D(p q) = D(p q), D(Pp Pq ) = D(P p P q), L(P ) = L Z (P ) (recall the definition of L Z from Section II D). Combining the above results with Eq. (A8) gives Γ(p) =Γ(p) = D(p q) − D(P p P q) + c∈L Z (P ) p(c)Γ(q c ).

Example 8. Suppose we are interested in thermodynamic costs
associated with functions f whose image contains the value infinity, i.e., f : X → R∪{∞}. For such functions, Γ(p) = ∞ for any p which has support over an x ∈ X such that f (x) = ∞. In such a case it is not meaningful to consider a prior distribution q (as in Theorem 1) which has support over any x with f (x) = ∞. For such functions we also are no longer able to presume that the optimal distribution has full support within each island of c ∈ L(P ), because in general the proof of Lemma A2 no longer holds when f can take infinite values.
Nonetheless, by Eq. (A9), for the purposes of analyzing the thermodynamic costs of actual initial distributions p that have finite Γ(p) (and so have zero mass on any x such that f (x) = ∞), we can always carry out our usual analysis if we first reduce the problem to an appropriate "restriction" of f . Example 9. Suppose we wish to implement a (discrete-time) dynamics P (x |x) over X using a CTMC. Recall from the end of Section II B that by appropriately expanding the state space X to include a set of "hidden states" Z in addition to X , and appropriately designing the rate matrices over that expanded state space X ∪ Z, we can ensure that the resultant evolution over X is arbitrarily close to the desired conditional distribution P . Indeed, one can even design those rates matrices over X ∪ Z so that not only is the dynamics over X arbitrarily close to the desired P , but in addition the EF generated in running that CTMC over X ∪ Z is arbitrarily close to the lower bound of Eq. (21) [21].
However, in any real-world system that implements some P with a CTMC over an expanded space X ∪ Z, that lower bound will not be achieved, and nonzero EP will be generated. In general, to analyze the EP of such real-world systems one has to consider the mismatch cost and residual EP of the full CTMC over the expanded space X ∪ Z. Fortunately though, we can design the CTMC over X ∪ Z so that when it begins the implementation of P , there is zero probability mass on any of the states in Z [21,22]. If we do that, then we can apply Eq. (A9), and so restrict our calculations of mismatch cost and residual EP to only involve the dynamics over X , without any concern for the dynamics over Z.

Example 10.
Our last example is to derive the alternative decomposition of the EP of an SR circuit which is discussed in Section VI B. Recall that due to Eq. (29), the initial distribution over any gate in an SR circuit has partial support. This means we can apply Eq. (A9) to decompose the EF, in direct analogy to the use of Theorem 1 to derive Theorem 2 -only with the modification that the spaces X and Y are set to X pa(g) and X g , respectively, rather than both set to X n(g) , as was done in deriving Theorem 2. (Note that the islands also change when we apply Eq. (A9) rather than Theorem 1, from the islands of P g to the islands of π g ). The end result is a decomposition of EF just like that in Theorem 2, in which we have the same circuit Landauer cost and circuit Landauer loss expressions as in that theorem, but now have the modified forms of circuit mismatch cost and of circuit residual EP introduced in Section VI B.

Appendix B: Thermodynamics costs for SR circuits
To begin, we will make use of the fact that there is no overlap in time among the solitary processes in an SR circuit, so the total EF incurred can be written as Q(p) = g∈G Q g (p n(g) ). (B1) Moreover, for each gate g, the solitary process that updates the variables in n(g) starts with x g in its initialized state with probability 1. So we can overload notation and write Q g (p pa(g) ) instead of Q g (p n(g) ) for each gate g.
Next, again use the fact that the solitary processes have no overlap in time to establish that the minimal value of the sum of the EPs of the gates is the sum of the minimal EPs of the gates considered separately of one another. As a result, we can jointly takeσ g (p n(g) ) → 0 for all gates g in the circuit [21]. We can then use Eq. (B1) to establish that the minimal EF of the circuit is simply the sum of the minimal EFs of running each of the gates in the circuit, i.e., the sum of the subsystem Landauer costs of running the gates. In other words, the circuit Landauer cost is L circ (p) = g∈G S(p n(g) ) − S(P g p n(g) ) (B4) = g∈G S(p pa(g) ) − S(π g p pa(g) ) (B5) = g∈G\W S(p pa(g) ) − S(π g p pa(g) ) . (B6) To derive the second line, we've used the fact that in an SR circuit, each gate is set to its initialized value at the beginning of its solitary process with probability 1, and that its parents are set to their initialized states with probability 1 at the end of the process. Then to derive the third line we've used the fact that wire gates implement the identity map, and so S(p pa(g) )− S(π g p pa(g) ) = 0 for all g ∈ W .
Given the assumption that x g = ∅ at the beginning of the solitary process for gate g, we can rewrite I p beg(g) (X n(g) ; X V \n(g) ) = I p beg(g) (X pa(g) ; X V \pa(g) ) . (B9) Similarly, because X v = ∅ for all v ∈ pa(g) at the end of the solitary process for gate g, we can rewrite I p end(g) (X n(g) ; X V \n(g) ) = I p end(g) (X g ; X V \g ) . (B10) Finally, for any wire gate g ∈ W , given the assumption that X g = ∅ at the beginning of the solitary process, we can write I p beg(g) (X pa(g) ; X V \pa(g) ) = I p end(g) (X g ; X V \g ) .
Now, notice that for every v ∈ V \(W ∪OUT) (i.e., every node which is not a wire and not an output), there is a corresponding wire w which transmits v to its child, and which has S(p w ) = S(p v ). This lets us rewrite Eq. (B13) as In this appendix, we consider a more general version of SR circuits, in which non-output gates can have out-degree greater than 1.
First, we need to modify the definition of an SR circuit in Section V. This is because in SR circuits, the subsystem corresponding to a given gate g reinitializes all of the parents of that gate to their initialized state, ∅. If, however, there is some node v that has out-degree greater than 1 -i.e., has more than one child -then we must guarantee that no such v is reinitialized by one its children gates before all of its children gates have run. To do so, we require that each non-output node v in the circuit is reinitialized only by the last of its children gates to run, while the earlier children (if any) apply the identity map to v.
Note that this rule could result in different thermodynamic costs of an overall circuit, depending on the precise topological order we use to determine which of the children of a given v reinitialized v. This would mean that the entropic costs of running a circuit would depend on the (arbitrary) choice we make for the topological order of the gates in the circuit. This issue won't arise in this paper however. To see why, recall that we model the wires in the circuit themselves as gates, which have both in-degree and out-degree equal to 1. As a result, if v has out-degree greater than 1, then v is not a wire gate, and therefore all of its children must be wire gates -and therefore none of those children has multiple parents. So the problem is automatically avoided.
We now prove that for SR circuits with out-degree greater than 1, circuit Landauer loss can be arbitrarily large.

Proposition 1.
For any π Φ , non-delta function input distribution p IN , and κ ≥ 0, there exist an SR circuit with out-degree greater than 1 that implements π Φ for which L loss (p) ≥ κ.
Proof. Let Φ = (V, E, F, X ) be such a circuit that implements π Φ . Given that p is not a delta function, there must be an input node, which we call v, such that S(p v ) > 0. Take g ∈ OUT to be any output gate of Φ, and let π g ∈ F be its update map.
4. X w = X g = X w = X v .
In words, Φ is the same as Φ except that: (a) we have added an "erasure gate" g which takes v as input (through a new wire gate w ), and (b) this erasure gate is provided as an additional input, which is completely ignored, to one of the existing output gates g (through a new wire gate w ).
It is straightforward to see that π Φ = π Φ . At the same time, S(p pa(g ) ) − S(Π g p pa(g ) ) = S(p v ), thus L loss Φ (p) = L loss Φ (p) + S(p v ) , where L loss Φ and L loss Φ indicate the circuit Landauer loss of Φ and Φ respectively. This procedure can be carried out again to create a new circuit Φ from Φ , which also implements π Φ but which now has Landauer loss L loss Φ (p) = L loss Φ (p) + 2S(p v ). Iterating, we can construct a circuit with an arbitrarily large Landauer loss which implements π Φ .