A reaction network scheme for hidden Markov model parameter learning

With a view towards artificial cells, molecular communication systems, molecular multiagent systems and federated learning, we propose a novel reaction network scheme (termed the Baum–Welch (BW) reaction network) that learns parameters for hidden Markov models (HMMs). All variables including inputs and outputs are encoded by separate species. Each reaction in the scheme changes only one molecule of one species to one molecule of another. The reverse change is also accessible but via a different set of enzymes, in a design reminiscent of futile cycles in biochemical pathways. We show that every positive fixed point of the BW algorithm for HMMs is a fixed point of the reaction network scheme, and vice versa. Furthermore, we prove that the ‘expectation’ step and the ‘maximization’ step of the reaction network separately converge exponentially fast and compute the same values as the E-step and the M-step of the BW algorithm. We simulate example sequences, and show that our reaction network learns the same parameters for the HMM as the BW algorithm, and that the log-likelihood increases continuously along the trajectory of the reaction network.


Introduction
The implementation of abstract dynamical systems with molecular systems has gained scientific interest as a promising piece in the nanotechnology toolbox.Several automated theoretical schemes can now compile arbitrary networks of formal reactions into DNA oligonucleotide sequences [1][2][3][4][5][6][7][8].Experimental demonstrations have synthesized these oligonucleotides, mixed them in a single test tube, and verified that they interact via base pairing reactions to implement the dynamics of the formal network [2,7,[9][10][11].In this way, if an algorithm can be described by reaction network dynamics, it might equally be carried out in a test tube by a DNA machine.Understanding how to describe algorithms in terms of reaction network dynamics is thus becoming a matter of interest.
Dynamical systems described by formal reaction networks are known to be computationally universal [12].Several examples of functions computed by reaction network dynamics have been described [13][14][15][16][17][18][19][20].However, the computational universality proofs often do not lend to the most elegant ways of implementing the algorithms with chemical reaction networks.We are particularly interested in problems arising in statistical learning theory, such as maximum-likelihood estimation and related optimization problems, Bayesian posterior sampling, and inference from incomplete data.These problems are at the core of statistical theory and may have many practical applications when implemented inside artificial cells.Our work demonstrates that these problems are particularly amenable to implementation by reaction network dynamics by exploiting the formal connection between the notion of entropy in statistics and in statistical mechanics.
Molecule-based statistical inference is receiving increasing attention.In [21], dynamic Bayesian decision-making in the form of a two-state hidden Markov model (HMM) is implemented by means of intracellular kinetics (which might be interpreted in terms of reaction networks), where the target is to infer an approximate posterior distribution.Similar ideas have been applied to decode information in biological processes [22].On a different note, a cell's potential for solving tasks and learning requires a thorough understanding of design principles in reaction networks and how molecular sources of stochastic information are communicated within and between cells [23,24].Such tasks might be seen as computational and statistical problems posed to the cell.
In this article, we describe a reaction network scheme whose dynamics learns parameters of HMMs.HMMs are standard statistical models widely used in machine learning to model complex data with a linear spatial structure as in bioinformatics [25], or a temporal structure as in speech recognition [26].HMMs also form an essential component of communication systems as well as of intelligent agents trained by reinforcement learning methods.They might be used in an exploratory sense without stipulating the interpretation of the hidden variables in advance, or in a concrete sense to learn the strength and influence of known hidden variables.Our group has previously worked on statistical learning theory from the perspective of reaction networks and log-linear models [27][28][29][30][31], and the current paper builds on these experiences.
Our proposed algorithm has similarities to, and important differences from, the Baum-Welch (BW) algorithm [32] which is the standard learning algorithm for HMMs.The BW algorithm is an iterative expectation-maximization (EM) algorithm where one step is performed at a time in a prescribed sequence.Our reaction network scheme is divided into four subnetworks that correspond to the forward algorithm, the backward algorithm, the expectation step (E-step), and the maximization step (M-step) of the BW algorithm.Each subnetwork describes a system of ordinary differential equations (ODEs) that might be run separately, exactly mimicking the steps of the BW algorithm; or run simultaneously in continuous time and in a distributed manner, obtaining a variant on the BW theme where all four stages of the BW algorithm are being performed at the same time.Because our algorithm permits the different stages to be run simultaneously and without coordination, it is particularly suited to software federated learning implementations where HMMs might need to be run on a network of edge devices in a distributed manner.
We obtain the following results.
-Our scheme can be partitioned into two modules, the inference module (forward, backward, E-step) and the learning module (M-step), which provide feedback to each other.We show in theorem 4.2 that each module separately converges exponentially fast to the correct value (the value of the BW algorithm) when the other module is kept switched off.-We prove in theorem 4.3 that our scheme (when divided into two modules, or when considered jointly) and the BW algorithm have the same set of positive fixed points.-We demonstrate practical feasibility of our algorithm by simulation of example HMMs, and by showing that parameters can be learned successfully with performance comparable to that of the BW algorithm.
The proposed reaction network is a formal (abstract) reaction network without particular chemical features.It might be turned into a chemically realistic reaction network by compiling the formal reactions into reactions between DNA oligonucleotide sequences, using recently proposed techniques [1][2][3][4].Promising areas of application of this work come from cellular biology.In many cellular processes, only partial information about the environment is available in the form of a sequence of observations.For example, this might happen when an enzyme acts processively on a substrate or a molecular walker locates its position on a grid [33][34][35].In the future, a molecule-based HMM device might learn a molecular environment within an organism by sensing and interacting with the environment at the molecular level.It might take action according to the learning outcome, for example, choosing among different drug options, or a molecule-based HMM might be used as a building block in an artificial cell or population of cells, enabling cooperative behaviour among cells or facilitating various tasks.

The Baum-Welch algorithm
A stochastic map between two finite sets P and Q is a matrix A = (a pq ) P×Q , such that a pq ≥ 0 for p ∈ P, q ∈ Q, and P q[Q a pq ¼ 1 for p ∈ P. One might think of a stochastic map as a collection of conditional probability distributions.
An HMM is a tuple (H, V, θ, ψ, π) of two finite sets H (for 'hidden') and V (for 'visible'), an initial probability distribution π = (π h ) h∈H on H, and two stochastic maps: the transition matrix θ : H → H, and the emission matrix ψ : H → V. See figure 1 for an example.
The likelihood Pðv 1 , . .., v L ju, cÞ of an observed sequence v 1 , …, v L ∈ V given the parameter (θ, ψ) is the probability where the sum is over all sequences η = (h 1 , …, h L ) ∈ H L , with L ≥ 2 (to avoid triviality).Knowing π and the sequence ðv 1 , v 2 , . .., v L Þ of visible states, one can estimate maximumlikelihood values for (θ, ψ) by means of the BW algorithm, which is a particular instance of the EM algorithm.In §5, we show how to extend this construction to unknown π and multiple sequences while preserving the theoretical guarantees.
The standard BW algorithm is composed of four subroutines.The forward algorithm (figure 1b) outputs the quantities α ℓh computed from initial values of (θ, ψ) and π by the recursion and the backward algorithm (figure 1c) outputs the quantities β ℓh computed by the recursion royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 20: 20220877 where h ∈ H and ℓ = 1, …, L − 1.The E-step computes , for ℓ ∈ {1, 2, …, L} and g, h ∈ H. Finally, the M-step computes an update of (θ, ψ), for g, h ∈ H and w ∈ V, where δ is the Dirac delta function Here, θ gh 0 and ψ hw 0 (with primes) are the updated parameter values of θ gh and ψ hw , respectively, after one iteration of the BW algorithm.As π is assumed known, it is not updated.The BW algorithm (figure 1d) sequentially runs the forward algorithm, the backward algorithm, the E-step, and the M-step until the change in the likelihood (2.1) is insignificant.In practice, this means until the difference between (θ, ψ) and (θ 0 , ψ 0 ) becomes smaller than a prescribed tolerance level.For the steps to be well defined, division by zero is not allowed.Denote the parameter region for which division by zero does not happen in any step of an iteration by (see the electronic supplementary material for details and proof).Furthermore, let Q 0 ¼ fðu, cÞ j u .0, c .0g and Q 1 ¼ fðu, cÞ j u !0, c ! 0g (the full parameter space), where vector/matrix inequalities are taken coordinate-wise.It follows that and that all steps in one iteration of the BW algorithm are valid, provided ðu, cÞ [ Q.Each row (a conditional distribution) of the update (θ 0 , ψ 0 ) has unit length automatically.
The following lemma is essential.See the electronic supplementary material for a proof.Lemma 2.1.Assume all letters of V are in the observed sequence v 1 , …, v L and π > 0, L ≥ 2, or all letters of V are in v 2 , …, v L (excluding v 1 ) and π ≥ 0, L ≥ 3.If ðu, cÞ [ Q 0 , then the same holds for the updated parameter value, ðu 0 , Let (θ n , ψ n ) denote the value of (θ 0 , ψ 0 ) after n iterations of the BW algorithm.According to lemma 2.1, if ðu 0 , then it is a local extremum or saddle point of the likelihood (2.1) [36][37][38][39].If ðu Ã , c Ã Þ [ Q, then it is a fixed point of the BW algorithm.However, the limit might be outside Q [36][37][38].In general, the BW algorithm might have multiple fixed points depending on the initial point (θ 0 , ψ 0 ) and the observed sequence.
Henceforth, we make the assumptions of lemma 2.1.If v ∈ V is not observed, then ψ hv = 0 for all h ∈ H, and one might equivalently consider a HMM with the state v forward algorithm by propagating the position ℓ conditional probabilities β ℓ+1,1 and β ℓ+1,2 backwards.(d) The BW algorithm is a fixed point expectation-maximization computation.The E-step calls the forward and backward algorithm as subroutines and, conditioned on the entire observed sequence ðv 1 , v 2 , . .., v L Þ [ V L , computes the probabilities γ ℓg of being in states g ∈ H at position ℓ and the probabilities ξ ℓgh of taking the transitions (g, h) ∈ H 2 at position ℓ.The M-step updates the parameters θ and ψ to maximize their likelihood given the observed sequence.

Baum-Welch reaction network
Let S be a finite set.A formal (chemical) reaction over S is a pair a, b [ Z S !0 of non-negative integer vectors.This is commonly represented in chemical equation notation as , where X i , i ∈ S, represent formal (chemical) species.Given a rate constant k > 0, mass-action kinetics describes the change of concentrations through time by the system of ODEs _ XðtÞ ¼ ðb À aÞk Q i[S X i ðtÞ a i , where species names are overloaded to also represent the vector of concentrations together with a choice of reaction rate constants.Combining the effect of all reactions yields the ODE system _ XðtÞ ¼ P m j¼1 ðb j À a j Þk j Q i[S X i ðtÞ a ji .For background on reaction networks, see [40].
We now describe a reaction network that implements HMM learning.Our scheme bears close correspondence to the BW algorithm as presented in the previous section.Let an HMM (H, V, θ, ψ, π) and an observed sequence ðv 1 , v 2 , . .., v L Þ [ V L be given.Choose an arbitrary hidden state h* ∈ H and an arbitrary visible state v* ∈ V.This choice is merely an artifice to break symmetry, and our results hold independently of these choices.We represent every variable appearing in the BW algorithm by a separate species.The species are θ gh , θ gh 0 , ψ hw , That is, both (θ, ψ) and the update (θ 0 , ψ 0 ) are represented as species.
We first work out in full detail how the forward algorithm may be translated into chemical reactions.Recall that the forward algorithm is the recursion for ℓ = 1, …, L − 1 and g, h ∈ H.This implies the balance equations For the initialization step, this prompts the use of the reactions and for all h ∈ H, where ! 1 indicates the reaction rate constant is put to 1.Only one species change in each reaction (the red species), the other species are catalysts that remain unchanged by the reaction.The rate by which α 1h is converted into α 1h* depends on the concentrations of the catalysts and, thus, is time-dependent.By design, at equilibrium the set of reactions fulfil the balance equation (3.1).
For the recursion step, we use the reactions for all g, h ∈ H, and ℓ = 1, …, L − 1. Again by design, at equilibrium the balance equation (3.2) is fulfilled for this set of reactions.
The reactions depend on the observed sequence ðv 1 , v 2 , . .., v L Þ [ V L of visible states.This is a problem because one would have to design different reaction networks for different observed sequences, defeating the whole purpose.To solve this problem, we introduce the catalyst species E ℓw with ' ¼ 1, . .., L and w ∈ V.The E ℓw species are initialized such that at time zero, E ℓw (0) = 1 for w = v ℓ , and E ℓw (0) = 0 for w ¼ v ' .Thus, their concentrations remain fixed throughout time.
The other parts of the BW algorithm may be translated into chemical reactions using a similar logic.The full set of reactions is shown in table 1 and is termed the BW reaction network.It consists of four parts that are further divided into smaller subnetworks.Each of the four parts corresponds to one of the four parts of the BW algorithm, as shown in table 1.The corresponding equations of the ODE system are shown in appendix A (equations (A 1)-(A 8)).Each of the smaller subnetworks in table 1 when run independently can be separately analysed as a mono-molecular reaction network for which the dynamical behaviour can be fully described as shown in appendix A.

The dynamics of the Baum-Welch reaction network
In the following, we expose the relationship between the dynamics of the BW algorithm and the BW reaction network.
For this, we define three different, alternative ways of running the BW reaction network in table 1.
Let α ℓ (t) = (α ℓh (t)) h∈H denote the vector of concentrations at time t ≥ 0, with similar notation for other quantities.For convenience, we leave out the species E ℓw .

BW1
Initialization.Fix an initial value, ðu, cÞ  1), such that the ODE system of the kth subnetwork (k = 1, …, 4L − 1) is run until an equilibrium is obtained with the chosen initial values, before the ODE system of the (k + 1)th subnetwork is executed.The catalyst species concentrations of the kth subnetwork are fixed to their equilibrium values obtained in the k 0 th subnetworks, k 0 < k.In particular, the concentrations of the species θ, ψ are fixed to their initial values ðu 0 , Iteration.The above procedure is then iterated.After completion of the nth iteration, n ≥ 0, the concentrations of the species θ, ψ in the (n + 1)th iteration are initialized to the equilibrium values ðu n , c n Þ [ Q 0 of the species (θ 0 , ψ 0 ) in the nth iteration.All other species are initialized at their current values.
Completion.This process is continued until convergence of (θ n , ψ n ) has been achieved.The limit of (θ n , ψ n ) is denoted the equilibrium of BW1 with initial point (θ 0 , ψ 0 ).
Execution.The ODE systems corresponding to the inference module (forward, backward and E-step) and the learning module (M-step) of the BW algorithm are executed sequentially and run until an equilibrium is obtained, with the concentrations of the species θ, ψ fixed to the initial values ðu 0 , Iteration and completion.BW2 is iterated similarly to BW1.The limit of (θ n , ψ n ) is denoted the equilibrium of BW2 with initial point (θ 0 , ψ 0 ).See figure 2.
Execution.The species θ, ψ are identified with the species θ 0 , ψ 0 .That is, unprimed species are substituted for primed ones in table 1.The full ODE system as described in §3 is executed at the same time without fixing any species concentrations so that all species concentrations are dynamic.
Completion.The limit of (θ 0 (t), ψ 0 (t)) (if it exists) is denoted the equilibrium of BW3 with initial point (θ 0 , ψ 0 ).BW1 replicates the BW algorithm with one equilibrium update for each step in one iteration of the BW algorithm.In total, there are 4L steps in one iteration of the BW algorithm, equalling the number of subnetworks (table 1).BW2 combines all 4L steps in one iteration of the BW algorithm, corresponding to the simultaneous calculation of the forward and backward algorithm, the E-step and the M-step in one iteration.BW3 makes all updates simultaneously.BW3 is a feedback system where the concentrations of all species are adjusted continuously.
Proofs of the following statements are in appendix A.
Theorem 4.1.If the BW algorithm and BW1 are initiated at the same point ðu 0 , c 0 Þ [ Q 0 , then their equilibria always exist and agree.
Theorem 4.2.If BW1 and BW2 are initiated at the same point ðu 0 , c 0 Þ [ Q 0 , then their equilibria always exist and agree.Furthermore, this equilibrium is globally asymptotically stable for the BW2 dynamics and convergence is exponentially fast, subject to the invariant subspace defined by the initialization of θ 0 , ψ 0 : the sum of the entries of each row of θ 0 (t), respectively, ψ 0 (t), is one.
The two theorems imply that the initial conditions of the species other than (θ 0 , ψ 0 ) are irrelevant, hence we need not be concerned with these.Furthermore, BW1 and BW2 compute the same parameter value as the BW algorithm provided ðu 0 , c 0 Þ [ Q 0 .Theorems 4.1 and 4.2 might not be true if (θ 0 , Table 1.The BW reaction network, divided into four parts corresponding to the forward (R a 1 , R a ' ) and the backward (R b ' ) algorithm, the E-step (R g ' , R j ' ) and the M-step (R u , R c ).All parts but the M-step are further divided into small subnetworks, one for each ℓ = 1, …, L − 1 (or L).Catalytic species are in black, non-catalytic in red.The indices vary over g, h ∈ H, w ∈ V, and ' ¼ 1, . .., L À 1.For γ species, ℓ = L is also allowed.

BW algorithm
BW reaction network subnetwork royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 20: 20220877 ψ 0 ) is at the boundary QnQ 0 ; see below for discussion.In particular, an equilibrium of BW3 at the boundary might not be an equilibrium of the BW algorithm.
Any of the three algorithms as well as the BW algorithm might have several positive equilibria depending on the observed data and the initial point.In general, one should expect coexistence of boundary and positive equilibria [37,39].
Lemma 4.4.The solution to the ODE system of BW3 exists for all times and any initial condition.Furthermore, assume ðu 0 ðtÞ, c 0 ðtÞÞ !ðu Ã , c Ã Þ [ Q 0 as t → ∞ for some initial condition, then x(t) → x* for some positive vector x*, where x(t) denotes the vector of concentrations of all species except the species θ 0 , ψ 0 .
If the initial point (θ 0 , ψ 0 ) belongs to Q n Q 0 , then the dynamics of BW1/BW2 and BW3 might differ.Imagine that one or more of the (time-dependent) reaction rates are set to zero in figure 3.This could happen in different ways.In figure 3b, the graph is broken into two, effectively replacing one conserved quantity with two conserved quantities, one for each subgraph (one graph has only one node, the other six).The equilibrium depends on how much 'mass' is allocated to each subgraph initially.The same is the case if two dead-end nodes are created, nodes from which no mass can flow out; see figure 3c.Also here, an additional conserved quantity is created.See lemma A.3 for further details.
To complete the story, discrepancies between BW1 and BW2 (and similarly BW3) might also be found if some reaction rates eventually become zero.We illustrate this with a simple example.Imagine a two-species reaction network X 1 → X 2 with rate κ(t) = e −t , t ≥ 0, and conserved amount T = x 1 (t) + x 2 (t).It has solution x 1 (t) = x 1 (0)exp (e −t − 1) → x 1 (0)/e as t → ∞, whereas the limiting reaction network with zero reaction rate has constant solution.Thus, a trajectory of the limiting reaction network only approximates a trajectory of the reaction network with time-dependent reaction rate, if the initial condition of the latter is e ≈ 2.71 times that of the former.

Multiple sequences and unknown π
If there are multiple observed sequences, R, potentially of different length, then the forward and backward algorithms in the BW algorithm are replaced with R forward and backward algorithms, one for each sequence and initialized with the same values of (θ, ψ) and π [41].Similarly, the E-step is replaced by R E-steps, one for each sequence.The main difference lies in the M-step, which is replaced by [41] , where π now is updated and the additional index i in γ iℓh and ξ iℓgh refers to the ith sequence, and otherwise similar to the one sequence algorithm (if π is considered known, the update step for π is just ignored).The forward and backward algorithms and the E-step are implemented similarly to the one sequence reaction network with a set of species for each observed sequence and common species for θ, ψ and π.The M-step might be implemented by the reactions where E i 'w is defined similarly to E ℓw , but for the ith sequence alone.All these reactions take the same mono-molecular form as in the one sequence algorithm, resulting in statements analogous to theorems 4.1-4.3(with analogous proofs).The reaction networks for the R sequences are executed simultaneously for the R sequences in the three different ways BW1, BW2 and BW3.

Examples
To illustrate the performance of BW3 versus the BW algorithm, we simulated an observed sequence of length L = 100 from an HMM with two hidden states and two visible states v 1 , v 2 ; see appendix A for details and figure 4.
Out of the 100 symbols, 49 were v 1 , while 51 were v 2 .Then, we ran the two algorithms for different choices of initial values with π = (0.5, 0.5) fixed.In one case, the two algorithms returned the same estimated values (figure 4a), namely ðu 0 , c 0 Þ ¼ 0:5 0:5 0:5 0:5 , 0:51 0:49 0:51 0:49 !, far from the true values (see appendix A), while in the second case (figure 4b), the two algorithms returned markedly different values.For the BW algorithm, we obtained ðu 0 , c 0 Þ ¼ 0:150 0:850 0:998 0:002 , 0:642 0:358 0:356 0:644 !, with log-likelihood −68.5617, while for BW3, we obtained values on the boundary, ðu 0 , c 0 Þ ¼ 0:000 1:000 0:229 0:771 , 0:000 1:000 0:631 0:369 !, with log-likelihood −68.6750.As a second example, we study a small observed sequence v 2 , v 1 , v 2 , v 1 , v 2 of length 5, generated from an HMM with two hidden states and two visible states.With initial parameter values given in appendix A, the BW algorithm returns the boundary equilibrium while the BW3 returns a different boundary equilibrium Both examples illustrate that at the boundary different things might happen.In the latter case, the BW3 equilibrium belongs to Q.However, when initiated at that point, the BW algorithm returns the first equilibrium point.Hence, the BW3 equilibrium is not an equilibrium of the BW algorithm (as the equilibrium in not positive, then theorem 4.3 is not contradicted).We further study an example where the initial probabilities can be tuned by adding new reactions according to equations (5.1)-(5.2).Consider a casino game which depends on the output of a three-sided die.A fair die would have equal probabilities of each outcome [42].However, the casino can switch to an unfair die with unequal outcome probabilities.The player only observes the die outcome sequence.An HMM with two hidden states of the casino (Honest and Dishonest) with three visible states as the outcome of the die can be used to predict if and when the casino is dishonest [42].We use the following HMM:   5b.Spikes indicate the switch to the hidden state Dishonest.We find that both BW and BW3 predict the hidden state sequence with high accuracy when compared with the HMM used for generating the sequences.More details can be found in appendix A.

Discussion
Some machine learning algorithms like gradient descent are based on continuous-time dynamical systems while others like message passing appear essentially discrete.Before our work, the BW algorithm has fallen in the second category.
Performing the different steps in order was seen as an important part of the algorithm.Here, we have shown a continuous-time dynamical system based on the BW algorithm which implements HMM learning.Our work has exposed that there is a greater design space for BW algorithm implementations than was previously known.Specifically, the steps need not be run to completion, and need not be run in order.This reduces the synchronization burden for distributed implementations of such extended BW algorithms.As a result, our algorithm can be implemented on a distributed network of edge devices, and in this manner serves as a federated learning scheme.Our work exposes that reaction networks are a natural design language to think about the design of machine learning algorithms when the data and computation need to be handled in a distributed manner with minimum overhead of synchronization.Our implementation of this algorithm is explicitly given in the form of a chemical reaction network.This opens the possibility of molecular implementations of this algorithm.If implemented in an artificial cell, it might provide the cell with the ability to sample possible realities, and act according to these imaginings.This is important because the world of molecules is a noisy world.To obtain exquisite control over a large number of molecules-which is one of the main goals of nanotechnology-requires algorithms that will be robust to such noise.Denoising or error correction algorithms are essentially statistical algorithms of the kind that we have implemented here.Thus, not only does our work point to future technological directions in nanotechnology, we believe reaction network implementations of algorithms of this kind are inevitable when attempting to exquisitely control large ensembles of molecules.
One of the big challenges in biology has been the immense complexity of living cells.While on one hand, cells are capable of behaviours of incredible sophistication, on the other hand our imagination about their workings has been limited to a very low level of abstraction described by systems biology.To understand the behaviour of cells will require the invention of layers of abstraction above that of systems biology.These layers will have to explain the algorithmic power of biochemical reaction networks.Biochemical reaction networks in living systems are known to be performing inference.There may be opportunity to understand them from the vantage point of our work which is giving concrete schemes by which reaction networks can lead to inference.Our design has reproduced certain characteristics found in biochemical reaction networks like the ubiquitous presence of enzymes and futile cycles, for example in phosphorylation and dephosphorylation of the same site on a protein.This reproduction has led to a novel idea to explain futile cycles: their purpose is to achieve probabilistic inference.For example, reactions of the form A !A Ã catalysed in the forward direction by a kinase and in the backward direction by a phosphatase are shown here to be the paradigmatic form of reactions needed to carry out probabilistic inference.A research programme to pursue this idea to its logical conclusions is suggested by the work here, but is beyond the scope of the current paper.where G j is the jth connected component, j = 1, …, λ.The nonnegative vector T ¼ ðT 1 , . . ., T l Þ defines an invariant subspace of (A 9).Let z j [ R n be the vector with ith component 1 if i ∈ G j and otherwise zero.Then, z j is a left kernel vector of A(t), z j- A(t) = 0, and a linear first integral of (A 9).
Lemma A.1.The following hold: i The rank of A(t) equals n − λ if and only if τ = λ.
ii If τ = λ, then z 1 , . . ., z l is a basis of the left kernel of A(t).Any linear first integral is a linear combination of z 1 , . . ., z l .iii Assume A(t) = A is independent of time.If τ = λ, then there is a unique non-negative exponentially and globally stable equilibrium within each invariant subspace given by T ≥ 0. The equilibrium is positive if and only if the connected components are strongly connected and T > 0.
Lemma A.2. Assume τ = λ and that κ(t), t ≥ 0, converges exponentially fast towards k .0 as t → ∞, i.e. there exists γ 1 > 0 and K 1 > 0 such that kkðtÞ À kk K 1 e Àg 1 t for t !0: Let A be the matrix A(t) with κ inserted for κ(t).Then, any solution to _ x ¼ AðtÞx converges exponentially fast towards the unique globally stable equilibrium x Ã [ R n .0 of the ODE system _ x ¼ Ax within the relevant invariant subspace, provided T > 0. That is, there exists γ 2 > 0 and K 2 > 0 such that kxðtÞ À x Ã k K 2 e Àg 2 t for t !0: Lemma A.3.Consider a mono-molecular reaction network with reactions such that κ i > 0 for i = 1, …, n − 1 (then the Laplacian graph is as in figure 3a).In particular, λ = t = 1.Then, the unique positive equilibrium for T > 0 is , for i ¼ 1, . . ., n:

A.4. Proofs
We provide proofs of theorems 4.1-4.3 and lemma 4.4.The remaining statements are proven in the electronic supplementary material, except lemma A.3 which is straightforward.We assume π > 0, L ≥ 2, and note that similar arguments apply if π ≥ 0 and L ≥ 3.

A.4.1. Proof of theorem 4.1
It is sufficient to demonstrate that (θ n , ψ n ), n ≥ 0, are the same and positive after each iteration of the BW algorithm and BW1.For this, assume θ, ψ (and π) are positive.Then the subnetwork R a 1 is mono-molecular with species α 1h , h ∈ H, and fixed positive reaction rates.The Laplacian graph is strongly connected and τ = λ = 1.Lemma A.1(iii) gives global stability of the unique positive equilibrium for the concentrations of the species α 1h , h ∈ H, identical to that of the BW algorithm (compare lemma A.3).This argument is repeated inductively for each subnetwork (in the order listed for BW1).Specifically, for the kth subnetwork (k = 2, …, 4L), the equilibrium of the k 0 th subnetwork, k 0 < k, is positive by induction hypothesis.Hence the reaction rates for the kth subnetwork are positive and fixed, given by the equilibrium values of the subnetworks k 0 < k and θ, ψ.Hence, the Laplacian graph is strongly connected with τ = λ = 1.Using lemma A.1(iii) completes the argument.In particular, (θ 0 , ψ 0 ) is positive and the same for the BW algorithm and BW1.

A.4.2. Proof of theorem 4.2
The proof is similar to that of theorem 4.1.The layered structure of BW2 implies that the equilibrium of the species in the kth subnetwork is unaffected by the presence of the k 0 th subnetwork, k 0 > k.Assume θ, ψ (and π) are positive.Then the subnetwork R a 1 is mono-molecular with species α 1h , h ∈ H, and fixed positive reaction rates.The Laplacian graph is strongly connected and τ = λ = 1.Lemma A.1(iii) gives exponential stability of the unique positive equilibrium for the concentrations of the species α 1h , h ∈ H.By lemma A.1(iii), it is the same as that of BW1.Consider next the subnetwork R a 2 .It is mono-molecular with positive time-dependent reaction rates, given by the concentrations of the species α 1h , h ∈ H, at time t ≥ 0 and θ, ψ.Hence, the Laplacian graph is strongly connected and τ = λ = 1.Applying lemma A.2 yields exponential convergence towards the unique positive equilibrium for the concentrations of the species α 2h , h ∈ H.This equilibrium is the same as that of BW1 for R a 2 (lemma A.2).The theorem now follows by repeated application of the above procedure for all subnetworks.

A.4.3. Proof of theorem 4.3
If the catalyst species concentrations of a subnetwork at equilibrium are positive, then the unique equilibrium concentrations of the non-catalyst species are likewise positive (lemma A.3). Consequently, if θ, ψ (and π) are positive, then the unique equilibrium concentrations of the species α 1 h , h ∈ H, in R a 1 are positive.Inductively, the equilibrium concentrations in any subnetwork are positive, if θ, ψ (and π) are positive.
For BW1 and BW2, if ðu, cÞ [ Q 0 is an equilibrium, then (θ 0 , ψ 0 ) is well-defined and (θ 0 , ψ 0 ) = (θ, ψ).Consequently, the species concentrations at equilibrium solve the same equations in the three cases, thus the sets of equilibria are the same.The same reasoning gives the uniqueness statement.Theorem 4.1 gives uniqueness for the BW algorithm.

A.4.4. Proof of lemma 4.4
The first part follows by standard arguments as the trajectories are confined to a compact space.The second part follows from [46, theorem 1.2] and lemma A.1(iii).

A.5. Simulated examples
We consider an HMM with two hidden states and two visible states v 1 , v 2 , and generate a random sequence of length L = 100 from π = (0.5, 0.5) (fixed throughout) and ðu, cÞ ¼ 0:2 0:8 0:7 0:3 , 0:75 0:25 0:3 0:7 !: royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 20: 20220877 The generated sequence is The log-likelihood of the sequence is −68.7955.We trained the HMM using BW3 and the BW algorithm, starting from two different initials values: The other species in the reactions of the BW3 were initialized with (1) equal concentrations and ( 2) concentrations obtained from one step of the BW algorithm and appropriate normalization.

Figure 1 .
Figure 1.Learning HMMs from sequences.(a) HMM: the hidden states, elements of H = {H 1 , H 2 }, are not directly observable.Instead, we observe elements from a set V = {V 1 , V 2 } of visible states.The parameters θ 11 , θ 12 , θ 21 , θ 22 denote the probability of transitions between the hidden states.The probability of observing the states V 1 , V 2 depends on the parameters ψ 11 , ψ 12 , ψ 21 , ψ 22 , as indicated in the figure.(b) The forward algorithm computes the likelihood of the first ℓ + 1 observed states (the position ℓ+ 1 likelihood) a 'þ1,1 ¼ a '1 u 11 c 1v 'þ1 þ a '2 u 21c 1v 'þ1 by forward propagating the position ℓ likelihoods α ℓ1 and α ℓ2 .Here, v ℓ , v ℓ+1 ∈ V are the observed symbols at position ℓ and ℓ + 1, respectively.(c) The backward algorithm computes the conditional probability of the observed states ℓ + 1, …, L, given the observed state ℓ (the position ℓ conditional probability) b'1 ¼ u 11 c 1v 'þ1 b 'þ1,1 þ u 12 c 2v 'þ1 b 'þ1,2by propagating the position ℓ conditional probabilities β ℓ+1,1 and β ℓ+1,2 backwards.(d) The BW algorithm is a fixed point expectation-maximization computation.The E-step calls the forward and backward algorithm as subroutines and, conditioned on the entire observed sequence ðv 1 , v 2 , . .., v L Þ [ V L , computes the probabilities γ ℓg of being in states g ∈ H at position ℓ and the probabilities ξ ℓgh of taking the transitions (g, h) ∈ H 2 at position ℓ.The M-step updates the parameters θ and ψ to maximize their likelihood given the observed sequence.
Figure 1.Learning HMMs from sequences.(a) HMM: the hidden states, elements of H = {H 1 , H 2 }, are not directly observable.Instead, we observe elements from a set V = {V 1 , V 2 } of visible states.The parameters θ 11 , θ 12 , θ 21 , θ 22 denote the probability of transitions between the hidden states.The probability of observing the states V 1 , V 2 depends on the parameters ψ 11 , ψ 12 , ψ 21 , ψ 22 , as indicated in the figure.(b) The forward algorithm computes the likelihood of the first ℓ + 1 observed states (the position ℓ+ 1 likelihood) a 'þ1,1 ¼ a '1 u 11 c 1v 'þ1 þ a '2 u 21c 1v 'þ1 by forward propagating the position ℓ likelihoods α ℓ1 and α ℓ2 .Here, v ℓ , v ℓ+1 ∈ V are the observed symbols at position ℓ and ℓ + 1, respectively.(c) The backward algorithm computes the conditional probability of the observed states ℓ + 1, …, L, given the observed state ℓ (the position ℓ conditional probability) b'1 ¼ u 11 c 1v 'þ1 b 'þ1,1 þ u 12 c 2v 'þ1 b 'þ1,2by propagating the position ℓ conditional probabilities β ℓ+1,1 and β ℓ+1,2 backwards.(d) The BW algorithm is a fixed point expectation-maximization computation.The E-step calls the forward and backward algorithm as subroutines and, conditioned on the entire observed sequence ðv 1 , v 2 , . .., v L Þ [ V L , computes the probabilities γ ℓg of being in states g ∈ H at position ℓ and the probabilities ξ ℓgh of taking the transitions (g, h) ∈ H 2 at position ℓ.The M-step updates the parameters θ and ψ to maximize their likelihood given the observed sequence.

Figure 3 .
Figure 3.Each subnetwork in table 1 is a catalytic mono-molecular reaction network with time-dependent reaction rates.The designated species with indices h*, (h*, h*) or v*, depending on the subnetwork, correspond to the nodes in the middle of the reaction graphs.(a) If the reaction rates are all positive, then the reaction graph is strongly connected.(b) Three reaction rates are zero, breaking the graph into two connected components and creating one node (blue) from which all mass eventually disappears.(c) Two reaction rates are zero, creating two dead-end nodes (blue) from which mass cannot flow out.All mass eventually accumulates in the blue nodes.
simulation starting with uniform initial distribution of (θ, ψ) simulation starting with random initial distribution of (θ, ψ)

Figure 4 .
Figure 4. Dynamics of the BW3 and the BW algorithm for an HMM with two hidden states and two visible states for an observed sequence of length L = 100, and different initialization.(a) Log-likelihood of the sequence for BW3 (left) compared to BW algorithm (right) starting with equal initial parameter values.(b) Loglikelihood of the sequence for the BW3 (left) compared to BW algorithm (right) starting with random initial parameter values.In both cases, the likelihood is nondecreasing over time/iterations.See the main text and appendix A for details.

Figure 5 .
Figure 5. Detection of a dishonest casino.Inferred statistics in the dishonest casino example.(a) Outcome probabilities of the loaded and fair die estimated from the generator HMM for 300 die rolls.(b) Prediction of the hidden states on the test data (last 150 rolls) using the generator HMM and after training BW and BW3 on first 150 rolls with a random initial distribution of parameters.
, for arbitrary positive constants A ℓ , B ℓ .Execution.The ODE systems of the subnetworks are executed sequentially in the order R a 1