Model-Based Reinforcement Learning via Stochastic Hybrid Models

Optimal control of general nonlinear systems is a central challenge in automation. Enabled by powerful function approximators, data-driven approaches to control have recently successfully tackled challenging applications. However, such methods often obscure the structure of dynamics and control behind black-box over-parameterized representations, thus limiting our ability to understand closed-loop behavior. This paper adopts a hybrid-system view of nonlinear modeling and control that lends an explicit hierarchical structure to the problem and breaks down complex dynamics into simpler localized units. We consider a sequence modeling paradigm that captures the temporal structure of the data and derive an expectation-maximization (EM) algorithm that automatically decomposes nonlinear dynamics into stochastic piecewise affine models with nonlinear transition boundaries. Furthermore, we show that these time-series models naturally admit a closed-loop extension that we use to extract local polynomial feedback controllers from nonlinear experts via behavioral cloning. Finally, we introduce a novel hybrid relative entropy policy search (Hb-REPS) technique that incorporates the hierarchical nature of hybrid models and optimizes a set of time-invariant piecewise feedback controllers derived from a piecewise polynomial approximation of a global state-value function.


I. Introduction
The class of nonlinear dynamical systems governs a vast range of real-world applications and underpins the most challenging problems in classical control, and reinforcement learning (RL) [1], [2]. Recent developments in learningfor-control have pushed towards deploying more complex and highly sophisticated representations, e.g., (deep) neural networks and Gaussian processes, to capture the structure of both dynamics and controllers. This trend led to unprecedented success in the domain of RL [3] and can be observed in both approximate optimal control [4]- [6], and approximate value and policy iteration algorithms [7]- [9].
However, before the latest revival of neural networks, research has focused on different paradigms for solving complex control tasks. One interesting concept relied on decomposing nonlinear structures of dynamics and control into simpler piecewise (affine) components, each responsible for an area of the state-action space. Instances of this abstraction can be found in the control literature under the labels of hybrid systems or switched models [10]- [13], while in the machine and reinforcement learning communities, the terminology of switching dynamical systems and hybrid state-space models is more widely used [14]- [17].
While the hybrid-state paradigm is a natural choice for studying jump processes, it also provides a surrogate piecewise approximation of general nonlinear dynamical behavior. Despite being less flexible than generic black-box approximators, hybrid models can regularize functional complexity and contribute to improved interpretability by imposing a structured representation.
Adopting this perspective in this paper, we present techniques for data-driven automatic system identification and closed-loop control of general nonlinear systems using piecewise polynomial hybrid surrogate models. More concretely, we focus on dynamic Bayesian graphical models as hybrid representations due to their favorable properties. These models have an inherent time-recurrent structure that captures correlations over extended horizons and carry over the advantages of well-established recursive Bayesian inference techniques for dynamical time series data.
In prior work [18], we presented a maximum likelihood approach for hierarchical piecewise system identification and behavioral cloning. Here, we robustify that approach by introducing suitable priors over all parameters. However, the central contribution of this paper is the introduction of an infinite horizon reinforcement learning framework that integrates the structured representation of stochastic hybrid models. The resulting algorithm interactively synthesizes nonlinear feedback controllers and value functions via a hierarchical piecewise polynomial architecture.
This paper is structured as follows. In Section II, we start by reviewing and comparing prominent paradigms of system modeling and optimal control of hybrid systems. Using that context in Section III, we highlight the advantages of our contributions in comparison with the literature. In Section IV, we cast the control problem as an infinite horizon Markov decision process and extended it to accommodate a hybrid structure. Next, in Section V, we introduce our notation of stochastic switching models in the form of hybrid dynamic Bayesian networks, as previously established in [18]. In Section VI, we recap our approach from [18] and improve it to derive a maximum a posteriori expectation-maximization (EM) algorithm for inferring the parameters of probabilistic hybrid models from data. This inference method is helpful for automatically decomposing nonlinear open-loop dynamics into switching affine regimes with arbitrary boundaries and deconstructing state-of-theart nonlinear expert controllers into piecewise polynomial policies. Furthermore, in Section VII, we formulate hybrid optimal control as a stochastic optimization problem and derive a trust-region reinforcement learning algorithm that incorporates an explicit hierarchical model of the nonlinear dynamics. We use this approach to iteratively learn piecewise approximations of the global nonlinear value function and stationary feedback controller. Finally, in Section VIII, we empirically evaluate our approaches on examples of stochastic nonlinear systems, including results from [18] that contribute to the overall picture.
Our empirical evaluation indicates that hybrid models can provide an alternative to generic black-box representations for system identification, behavioral cloning, and learningbased control. Hybrid models are able to reach comparable performance and deliver simpler, easily identifiable switching patterns of dynamics and control while requiring a fraction of the number of parameters of other functional forms. However, the results also reveal certain drawbacks, mainly in poor scalability and increased algorithmic complexity. We address these issues in a final outlook in Section IX.

II. Related Work
This section reviews work related to the modeling and control of hybrid systems and highlights connections and parallels between approaches stemming from the control and machine and reinforcement learning literature.
Hybrid systems have been extensively studied in the control community and are widely used in real-world applications [19], [20]. For research on hybrid system identifi- x 1 FIGURE 1: A hybrid system with K = 3 piecewise affine regimes. The top row depicts the mean unforced continuous transition dynamics in the phase space. The bottom row shows the distinct activation regions of the three dynamics regime across the phase space. We illustrate examples of affine (left), quadratic (middle), and third-order polynomial (right) switching boundaries. Figure reproduced from [18]. cation, we refer to survey work in [21] and [22]. There, the authors focus on piecewise affine (PWA) systems and introduce taxonomies of different representations and procedures commonly used for identifying sub-regimes of dynamics, ranging from algebraic approaches [23] to mixed-integer optimization [24], and Bayesian methods [25]. Furthermore, identification techniques for piecewise nonlinear systems have been developed based on sparse optimization [26] and kernel methods [27]. Finally, it is worth noting that the majority of literature considers deterministic regimeswitching events with exceptions in [28], [29].
Research in the area of optimal control for hybrid systems stretches back to the seminal work in [30], which highlights the possibility of general nonlinear control by considering piecewise affine systems. In [31], an overview of control approaches for piecewise affine switching dynamics is presented. The authors categorize the literature by distinguishing between externally and internally forced switching mechanisms. The bulk of optimal control approaches in this area focuses on (nonlinear) model predictive control (MPC) [32]. Here we highlight the influential work in [33], which formulates the optimal control problem as a mixed-integer quadratic program (MIQP). This approach was later extended in [34] and [35] to solve a multi-parametric MIQP and arrive at time-variant piecewise affine state-feedback controllers and piecewise quadratic value functions with polyhedral partitions. Recently, more efficient formulations of hybrid control have been proposed [36], which leverage modern techniques from mixed-integer and disjunctive programming to tackle large-scale problems.
Hybrid representations also play a central role in datadriven, general-purpose process modeling and state estimation [37], [38], where different classes of stochastic hybrid systems serve as powerful generative models for complex dynamical behaviors [39]- [41]. The dominant paradigm in this domain has been that of probabilistic graphical models (PGM), more specifically, hybrid dynamic Bayesian networks (HDBN) for temporal modeling [42], [43]. One crucial contribution of recent Bayesian interpretations of switching systems is rooted in the Bayesian nonparametric (BNP) view [44]- [47]. This perspective theoretically allows for an infinite number of components, thus dramatically increasing the expressiveness of such models. Given the limited scope of this review section, we highlight only recent contributions with high impacts, such as [48] and [17], which successfully develop Markov chain Monte Carlo (MCMC) and stochastic variational inference (SVI) techniques for system identification. More recently, the rise of variational auto-encoders [49] has enabled a new and powerful view of inference techniques [50] for hybrid systems. A distinct drawback of such approaches is their reliance on end-to-end differentiability and the need to relax discrete variables in order to perform inference.
In the domain of learning-for-control, the notion of switching systems is directly related to the paradigm of model-free hierarchical reinforcement learning (HRL) [51], [52], which combines simple representations to build complex policies.
Here it is useful to differentiate between two concepts of hierarchical learning, namely temporal [53], and state abstractions [54]. In their seminal work [55], [56], the authors build on the framework of semi-Markov decision processes (SMDP) [57] to learn activation/termination conditions of temporally extended actions (options) for solving discrete environments. Additionally, pioneering work in optimizing hierarchical control structures with temporally extended actions is developed in [58] and [59]. Recent work has focused on formulations of the SMDP framework that facilitate simultaneous discovery and learning of options [60]- [64].
However, the concept of state abstraction -partitioning state-action spaces into sub-regions, each governed by local dynamics and control -carries the most apparent parallels to the classical view of hybrid systems. In [65], a proof of convergence for RL in tabular environments with state abstraction is presented, while [66] does a comprehensive study of different abstraction schemes and gives a formal definition of the problem. Furthermore, recent work has shown promising results in solving complex tasks by combining local policies, albeit while leveraging a complex neural network architecture as an upper-level policy [67].
Switching systems serve as a powerful tool in behavioral cloning. For example, [68] combines hidden Markov models (HMMs) with Gaussian mixture regression to represent trajectory distributions. In contrast, [62] uses a semi-hidden Markov model (HSMM) to learn hierarchical policies, and [69] introduces switching density networks for system identification and behavioral cloning. Finally, a Bayesian framework for the hierarchical policy decomposition is presented in [70], albeit while considering known transition dynamics.

III. Contribution
In light of the motivation and reviewed literature from Section I and II, we establish here the overall contribution of our methodology and highlight the main differences that distinguish it from related approaches.
As previously stated, this work strives to cast the problem of nonlinear optimal control into a data-driven hierarchical learning framework. Our aim is to introduce explicit structure and adopt hybrid surrogate models to avoid the opaqueness of recently popularized black-box representations. While this paradigm has been established before, our realization differs from previous attempts in two central aspects: • System Modeling: This work leverages probabilistic hybrid dynamic networks as hierarchical representations of nonlinear dynamics. Contrary to a piecewise autoregressive exogenous systems (PWARX), HDBNs straightforwardly account for noise in both discrete and continuous dynamics. They also incorporate nonlinear transition boundaries, thus minimizing partitioning redundancy. Furthermore, HDBNs admit efficient inference methods in data-driven applications. Finally, by pursuing an abstraction over states instead of time, we circumvent the need to infer termination policies of the SMDP framework. • Control Synthesis: We propose a hybrid policy search approach that formulates a non-convex infinite horizon objective and optimizes a piecewise polynomial approximation of the value function with nonlinear partitioning. This approximation is used to derive stationary switching feedback controllers. In contrast, trajectory optimization and model predictive control techniques for hybrid models are often cast as sequential convex programs that assume polyhedral partitions and optimize a fixed horizon objective, yielding time-variant value functions and controls.

IV. Problem Statement
Consider the discrete-time optimal control problem of a stochastic nonlinear dynamical system to be defined as an infinite horizon Markov decision processes (MDP). An MDP is defined over a state space X ⊆ R d and an action space U ⊆ R m . The probability of a state transition from state x to state x' by applying action u is governed by the Markovian time-independent density function p(x'|x, u). The reward r(x, u) is a function of the state x and action u. The state-dependent policy π(u|x), from which the actions are drawn, is a density determining the probability of an action u given a state x. The general objective in an average-reward infinite horizon optimal control problem is to maximize the average of rewards V π (x) = lim T →∞ 1 T E T t=1 r , where V π denotes as the state-value function under the policy π, starting from an initial state distribution µ 1 (x).
Given the context of this work and our choice to model the system with hybrid models, we introduce to the MDP formulation a new hidden discrete variable z, an indicator of the currently active local regime. The resulting transition dynamics can then be expressed by a factorized density FIGURE 2: Graphical model of recurrent autoregressive hidden Markov models (rARHMMs) extended to support hybrid controls. In rARHMMs, the discrete state z explicitly depends on the continuous state x and action u, as highlighted in red. Figure reproduced from [18].
function p(x', z'|x, u, z) = p(z'|z, x, u)p(x'|x, u, z'), which we depict as a graphical model in Figure 2 and discuss in further detail in the upcoming section. In the same spirit of simplification through hierarchical modeling, we employ a mixture of switching polynomial controllers π(u|x, z), associated with a piecewise polynomial value function V π (x, z).

V. Hybrid Dynamic Bayesian Networks
In this section, we focus on the modeling assumptions for the stochastic switching transition dynamics p(x', z'|x, u, z), see Section IV. We choose recurrent autoregressive hidden Markov models (rARHMMs) as a representation, which is a special case of recurrent switching linear dynamical systems (rSLDS) [17], also known as augmented SLDS [71]. In contrast to rSLDS, an rARHMM lacks an observation model and directly describes the internal state up to an additive noise process. We extend rARHMMs to support exogenous and endogenous inputs in order to simulate the open-and closed-loop behaviors of driven dynamics. Figure 2 depicts the corresponding graphical model, which closely resembles the graph of a PWARX. An rARHMM with K regions models the trajectory of a dynamical system as follows. The initial continuous state x 1 ∈ R d and continuous action u 1 ∈ R m are drawn from a pair of Gaussian and conditional Gaussian distributions 1 , respectively. The initial discrete state z 1 is a random vector modeled by a categorical density parameterized by φ The transition of the continuous state x t+1 and actions u t are modeled by affine-Gaussian dynamics The discrete transition probability p(z t+1 |z t , x t , u t ) is governed by K categorical distributions parameterized by a state-action dependent multi-class logit link function f [72] where f may have any type of features in (x, u). The vectors ω ij parameterize the discrete transition probabilities for all transition combinations i → j ∀i, j ∈ [1, K]. Figure 1 depicts realizations of different logit link functions leading to various state space partitioning.
The remainder of this paper focuses on using these hybrid models in three scenarios: • An open-loop setting that treats the control u as an exogenous input is used for automatically identifying nonlinear systems via decomposition into continuous and discrete switching dynamics. • A closed-loop setting that assumes the control u to originate from a nonlinear controller. We show that this setting can simultaneously decompose dynamics and control in a behavioral cloning scenario. • A reinforcement learning setting where we develop a model-based hybrid policy search algorithm to learn switching controllers for general nonlinear systems.

VI. Bayesian Inference of Hybrid Models
In this section, we sketch the outline of an expectationmaximization/Baum-Welch algorithm [73]- [75] for inferring the parameters of an rARHMM given time-series observations. The resulting algorithm can be used two-fold. First, it can be applied to automatically identify hybrid models and approximate the open-loop dynamics of nonlinear systems given state-action observations. Second, it can clone the closed-loop behavior of a nonlinear controller and decompose it into a set of local experts. Our developed approach is related in some aspects to the Baum-Welch algorithms proposed in [76] and [62]. However, we introduce suitable priors over all parameters and derive a maximum a posteriori (MAP) technique with a stochastic maximization step and hyperparameter optimization. In our experience, the priors significantly regularize the sensitivity of EM with respect to the initial point, making it less prone to getting stuck in bad local minima.
Moreover, a good prior specification is crucial in small data regimes since a vague prior may dominate the predictive posterior and effectively cause under-fitting. We implement a hyperparameter optimization scheme that elevates this concern by optimizing the prior parameters via empirical Bayes [77], thus attenuating the prior influence and improving the predictive performance significantly.

A. Maximum A Posteriori Optimization
Consider again the rARHMM in Figure 2 where the continuous state x and action u are observed variables, while 2 We abuse notation slightly by sometimes using z to refer to the discrete state index instead of treating it as a one-hot vector. the K-region indicators z are hidden. To infer the model parameters, we assume a dataset of N state-action trajectories where (X n , U n , Z n ) represent the time concatenation of an entire trajectory (x n 1:T , u n 1:T , z n 1:T ). The objective corresponding to system identification and behavioral cloning can be cast as a maximization problem of the log-posterior probability of the observations where p(D n , Z n |θ) is the complete-data likelihood of a single trajectory and factorizes according to and p(θ|h) is the factorized parameter prior We choose all priors to be conjugate or semi-conjugate with respect to their likelihoods. Therefore, we place a normal-Wishart (NW) prior on the initial state distribution (µ k , Ω k ) ∼ NW(0, κ 0 , Ψ 0 , ν 0 ), and a matrixnormal-Wishart (MNW) on the affine transition dynam- The initial discrete state takes a Dirichlet prior φ ∼ Dir(τ 0 ), while the logit link function parameters are governed by a non-conjugate zero-mean Gaussian prior with diagonal precision ω ik ∼ N(0, αI). Finally, we place a separate matrix-normal-Wishart prior on the conditional action like- The choice of priors is not restricted to these distributions. Depending on modeling assumptions, one can assume dynamics with diagonal noise matrices and pair them with gamma distribution priors. Moreover, if the system is known to have a state-independent noise process, the K Wishart and gamma priors can be tied across components, leading to a more structured representation.

B. Baum-Welch Expectation-Maximization
On closer examination of Equations (2) and (3), we observe that the optimization problem is non-convex with multiple local optima since the complete-data likelihood N n=1 p(D n , Z n |θ) can follow complex multi-modal densities. Another technical difficulty is the summation over all possible trajectories of the hidden variables Z n , which is of computational complexity O(N K T ) and is intractable in most cases. Expectation-maximization algorithms overcome the latter problem by introducing a variational posterior distribution over the hidden variables q(Z n ) and deriving a lower bound on the complete log-probability function We find a point estimate θ MAP by following a modified scheme of EM, alternating between an expectation step (Estep), in which the lower bound in Equation (4) is maximized with respect to the variational distributions q(Z n ) given a parameter estimateθ, a maximization step (M-step), that updates θ given (q(Z n ),ĥ), and finally, an empirical Bayes step (EB-step) that updates h given (q(Z n ),θ). A sketch of the overall iterative procedure is presented in Algorithm 1.

1) Exact Expectation Step
Maximizing the lower bound with respect to q(Z n ) is determined by reformulating Equation (4) This form of the lower bound implies that the optimal variational distributionq(Z n ) minimizes the Kullback-Leibler divergence (KL) [78], meaninĝ q(Z n ) = p(Z n |D n , θ) = p(z n 1:T |x n 1:T , u n 1:T , θ). (5) This update tightens the bound if the posterior model q(Z n ) belongs to the same family of the true posterior [15]. Notice that the E-step is independent of the prior p(θ). Moreover, Equation (5) indicates that the E-step reduces to the computation of the smoothed marginals p(z n t |x n 1:T , u n 1:T ,θ) under the current parameter estimateθ. Following [73] and [72], we derive a two-filter algorithm, which enables closedform and exact inference by splitting the smoothed marginals into a forward and backward message 3 γ n t (k) = p(z n t = k|x n 1:T , u n 1:T ) ∝ p(z n t = k|x n 1:t , u n 1:t )p(x n t+1:T , u n t+1:T |z n t = k, x n t , u n t ) = α n t (k)β n t (k), where α n t (k) = p(z n t = k|x n 1:t , u n 1:t ) is the message which computes the filtered marginals via a forward recursion and β n t (k) = p(x n t+1:T |z n t = k, x n t , u n t ) is the backward message that performs smoothing by computing the conditional likelihood of future evidence ×p(x n t+1 |x n t , u n t , z n t+1 = j)p(u n t+1 |x n t+1 , z n t+1 = j). Additionally, by combining both forward and backward messages, we can compute the two-slice smoothed marginals p(z n t , z n t+1 |x n 1:T , u n 1:T ) which will be useful during the maximization and empirical Bayes steps

2) Stochastic Maximization Step
After performing the E-step and computing the smoothed posteriors, we are able to evaluate the lower bound and maximize it with respect to θ given (q(Z n ),ĥ). By plugging Equations (3) and (5) into (4), leveraging conditional independence, and disregarding terms independent of θ, we arrive at the expected complete log-probability function Q(θ, γ, ξ,ĥ) The function Q is non-convex in ω when a nonlinear logit link function f (., ω) is chosen as an embedding for the transition probability χ, see Equation (1). In that case, stochastic 3 We briefly drop the dependency onθ for an uncluttered notation while deriving the forward-backward recursions.

Algorithm 1: Expectation-Maximization for System Identification and Behavioral Cloning
optimization is recommended [79] as batched noisy gradient estimates allow the algorithm to escape shallow local minima and reduce the computational cost that comes with evaluating the gradients for all data instances.
Consequently, when implementing the M-step, we apply stochastic optimization on the transition parameters ω. We use a stochastic gradient ascent direction with an adaptive learning rate ε and batch size M [79] For the parameters with conjugate priors, we derive closed-form optimality conditions. Effectively, we derive the posterior distribution via Bayes' rule and take the mode of each posterior density for a MAP estimate update.
By considering only relevant terms, we write the MAP of the initial gating parameter φ as while the estimate of the initial state parameters (µ k , Ω k ) can be decoupled for each k as follows Analogously, the MAP of the dynamics parameter and, finally, to learn closed-loop behavior, we can infer the controller parameters Due to space constraints, we will refrain from stating the explicit solution for these optimization problems. Instead, we provide the general outline of how to compute these posteriors and their modes based on the unified notation for exponential family distributions in Appendix A and B.

3) Approximate Empirical Bayes
Inference techniques that leverage data-independent assumptions run the risk of prior miss-specification. In our MAP approach, the priors are weakly informative and carry little information. Their main purpose is to regularize greedy updates that might lead to premature convergence. However, when there is little data, the priors, especially those on the precision matrices, may dominate the posterior probability, leading to over-regularization and under-fitting of the objective. Empirical Bayes approaches remedy this issue by integrating out the parameters θ and optimizing the marginal likelihood with respect to the hyperparameters h [77]. In our setting, marginalizing all hidden quantities does not admit a closed-form formula. An approximate approach to empirical Bayes is to interleave the E-and M-steps with hyperparameter updates that optimize the lower bound given an estimate of parametersθ and a step size ϱ where the gradient of Q with respect to h reduces to

VII. Reinforcement Learning via Hybrid Models
The last sections focused on the system modeling aspect and how to use hybrid surrogate models to approximate nonlinear dynamics. Now we turn our attention to the problem of using these models to synthesize structured controllers for general nonlinear dynamical systems. One possible approach is to use the learned hybrid models and apply the classical hybrid control methods, which we have reviewed Section II. However, as discussed earlier, these methods suffer from several drawbacks. On the one hand, they rely a polyhedral partitioning of the space. This limitation is severe because it often leads to an explosion in the number of partitions.
On the other hand, these methods are often focused on computationally expensive trajectory-centric model predictive control. This class of controllers is disadvantageous in applications that require fast reactive feedback signals with broad coverage over the state-action space.
In this section, we address these points and present an infinite horizon stochastic optimization technique that incorporates the structure of hybrid models. This approach can deal with rARHMMs with arbitrary non-polyhedral partitioning and synthesizes stationary piecewise polynomial controllers. Our algorithm extends the step-based formulation of relative entropy policy search (REPS) [80]- [82] by explicitly accounting for the discrete-continuous mixture state variables (x, z). Our approach, hybrid REPS (Hb-REPS), leverages the state-action-dependent nonlinear switches p(z'|z, x, u) as a task-independent upper-level coordinator to a mixture of K lower-level policies π(u|x, z). While the proposed approach shares many features with [62], our formulation relies on a state-abstraction representation of hybrid models and embeds the hierarchical model structure into the optimization problem in order to learn a hierarchy over the global value function. In contrast, [62] operates in the framework of semi-Markov decision processes and optimizes a mixture over termination and feedback policies without considering the existence of a hierarchical structure in the space of dynamics and value functions. For more details on differences between state-and time-abstractions, refer to Section II.

A. Infinite-Horizon Stochastic Optimal Control
In the REPS framework, an optimal control problem is presented as an iterative trust-region optimization for a discounted average-reward objective under a stationary stateaction distribution π(u|x, z)µ(x, z), Equation (6a). The trust-region is formulated as a KL [78], Equation (6c). Its purpose is to regularize the search direction and limit information loss between iterations. The REPS formulation explicitly incorporates a dynamics consistency constraint, Equation (6b), that describes the evolution of the stochastic state of the system. The following optimization problem is solved during a single iteration of hybrid REPS where µ(x, z) is the stationary mixture distribution, q(x, u, z) is the trust-region reference distribution, and the constraint in Equation (6d) guarantees the normalization of the state-action distribution. The factor 1 − ϑ, ϑ ∈ [0, 1), represents the probability of an infinite process to reset to an initial distribution µ 1 (x, z). The notion of resetting is necessary to ensure ergodicity of the closed-loop Markov process and allows the interpretation of ϑ as a discount factor and regularization of the MDP [82], [83].

B. Optimality Conditions and Dual Optimization
To solve the trust-region optimization in Equations (6a)-(6d), we start by constructing the Lagrangian of the primal [84] where we use p(x, u, z) = µ(x, z)π(u|x, z) for convenience and leverage the following identities The second identity implies that the resetting is only dependent on the parameter ϑ and independent of the state and actions (x, u, z) to satisfy the ergodicity property. The parameters η and λ are the Lagrangian variables associated with Equation (6c) and (6d), while V (x, z) is the state-value function, which appears naturally in REPS as the Lagrangian function associated with Equation (6b). Next, we take the partial derivative of L with respect to p(x, u, z) and set it to zero to get the optimal point where A(x, u, z, V ) is the advantage function given as The optimal point p * (x, u, z) = µ * (x, z)π * (u|x, z) has to satisfy the constraint in Equation (6d), which in turn enables us to find the Lagrangian variable λ *

=
By substituting λ * back into p * (x, u, z) in Equation (7), we retrieve the normalized density softmax form Now by plugging the solutions p * and λ * back into the Lagrangian, we arrive at the dual function G as a function of the remaining Lagrangian variables η and V where q(x, u, z) = q(x, u)q(z|x, u) and q(z|x, u) is the posterior over z given x and u. In Section VI, we derived a forward-backward algorithm for inferring this density, allowing us to compute the expectation over z. The expectations over x and u are analytically intractable. Therefore, we approximate them given samples from the reference distribution q(x, u). The multipliers η and V are then obtained by numerically minimizing the dual G(η, V ) arg min that acts as the upper bound on the primal objective.

C. Modeling Dynamics and State-Value Function
Up to this point, the derivation of Hb-REPS has been generic.
We have made no assumptions on initial distributions µ 1 (x, z), the dynamics p(x', z'|x, u, z), or the value function V (x, z). Now, we introduce the piecewise affine-Gaussian dynamics and logistic switching described in Section V and assume these representations to be available in parametric form as a result of a separate learning process. Furthermore, we model the state-value function with piecewise n-th degree polynomial is the state-feature vector which contains polynomial features of the state x, and τ z is the parameter vector assigned to the different regions.
Under these assumptions, we can use the available joint density µ 1 (x, z) and p(x'|x, u, z) to compute the necessary expectations in Equation (8) This computation allows our approach to capture the stochasticity of the dynamics and delivers an estimate of the advantage function A(x, u, z, V ) instead of the temporal difference (TD) error in the general REPS framework [80]. Ultimately, this leads to better estimates of the expected discounted future returns captured by V .
Practically, these integrals can be either naively approximated by applying Monte Carlo integration [85] or, more efficiently, by recognizing the structure of the integrand V (x', z') and using Gauss-Hermite cubature rules for exact integration over polynomial functions [86].

D. Maximum-A-Posteriori Policy Improvement
A significant advantage of our model-based reinforcement learning approach becomes evident when considering the policy improvement step in the REPS framework. The policy update is incorporated into the optimality condition of the stationary state-action distribution p(x, u, z) = π(u|x, z)µ(x, z) in Equation (7). As a consequence, updating the mixture policies π(u|x, z) requires the computation of state probabilities µ(x, z), which in turn require knowledge of the dynamics model. This issue is circumvented in other model-free realizations of REPS by introducing a crude approximation to enable a model-free policy update nonetheless. For example, in [87], the authors postulate that the new state distribution µ(x, z) is usually close enough to the old distribution q(x, z), thus allowing the ratio q(x, z)/µ(x, z) to be ignored when a weighted maximum-likelihood fit of the actions u is performed to update π.
While the assumption of closeness may be practical and empowers many successful variants of REPS, it is crucial to be aware of its technical ramifications, as it undermines the primary motivation of a relative entropy bound on the stateaction distribution in Equation (6c). This aspect is unique in the REPS framework when compared to other state-ofthe-art approximate policy iteration algorithms [7], [9], [88], that optimize a similar objective, albeit with a relaxed bound that only limits the change of the action distribution π.
In contrast, our algorithm uses the surrogate hybrid dynamics and updates the policy π(u|x, z) with the correct weighting. The optimality condition in Equation (7) is satisfied by computing a weighted maximum a posteriori estimate of the parameters θ of the state-action distribution p(x, u, z|θ), thus implicitly updating π(u|x, z). This procedure is equivalent to a modified Baum-Welch expectationmaximization algorithm that learns the parameters of a closed-loop rARHMM, as derived in Section VI. The difference is that the EM objective in Equation (2) has to be augmented with the importance weights from Equation (7) arg max θ log N n=1 z n w n p(X n , U n , Z n |θ)p(θ), where (X n , U n ) are state-action trajectories collected via interaction with the environment and w n are the associated weights resulting from Equation (7) w n = exp A(X n , U n , Z n , V )/η . initialize: q(u|x, z), τ z , η This augmentation leads to weighted M-and EB-steps while the E-step is not altered.
Note that during the policy improvement step, we can either assume an a priori estimate of the open-loop dynamics p(x', z'|x, u, z) and only update the control parameters corresponding to the conditional π(u|x, z), or we can iteratively update p(x', z'|x, u, z) as more data becomes available. A compact sketch of the overall optimization process is available in Algorithm 2.

VIII. Empirical Evaluation
In this section, we benchmark different aspects of our approach to system modeling and control synthesis via hybrid models. In the following, • we assess the predictive performance of rARHMMs at open-loop system identification of nonlinear systems and validate our choice of hybrid surrogate models as a suitable representation. • we test the ability of rARHMMs to approximate and decompose expert nonlinear controllers in a closed-loop behavioral cloning scenario. • we deploy rARHMMs in the proposed hierarchical RL algorithm Hb-REPS to solve the infinite horizon stochastic control objective and optimize piecewise polynomial controllers and value functions.

A. Piecewise Open-Loop System Identification
We start by empirically benchmarking the open-loop learned rARHMMs and their ability to approximate nonlinear dynamics. We compare to popular black-box models in a longhorizon and limited-data setting. This evaluation focuses on rARHMMs with exogenous inputs. We learn the dynamics of three simulated deterministic systems; a bouncing ball, an actuation-constrained pendu-    Figure 3. The values reflect the total number of parameters of each model. The values in parentheses represent the hidden layer sizes S of the neural models and the number of discrete components K for the (r)ARHMM, respectively.
lum, and a cart-pole system. We compare the predictive time forecasting accuracy of rARHMMs to classical non-recurrent autoregressive hidden Markov models (ARHMMs) 4 [16], feed-forward neural nets (FNNs), Gaussian processs (GPs) 5 , long-short-term memory networks (LSTMs) [89], and recurrent neural networks (RNNs). During the evaluation, we collected segregated training and test datasets. The training dataset is randomly split into 24 groups, each used to train different instances of all models. These instances are then tested on the test dataset. During evaluation, we sweep the test trajectories stepwise and predict the given horizon. All neural models have two hidden layers, which we test for different layer sizes, S ∈ {16, 32, 64, 128, 256, 512} for FNNs, S ∈ {16, 32, 64, 128, 256} for RNNs, and S ∈ {16, 32, 64, 128} for LSTMs. In the case of (r)ARHMMs, we vary the number of components K, dependent on the task. As a metric, we evaluate the forecast NMSE for a range of horizons averaged over the 24 data splits. We report the result corresponding to the best choice of S and K. Finally, in Table 1, we qualitatively compare the complexity of all representations in terms of their total number of parameters.

1) Bouncing Ball
This example is a canonical instance of a dual-regime hybrid system due to the hard velocity switch at the moment of impact. We simulate the dynamics with a frequency of 20 Hz and collect 25 training trajectories with different initial heights and velocities, each 30 s long. This dataset is split 24 folds with ten trajectories, 10 × 150 data points, in each subset. The test dataset consists of 5 trajectories, each 30 s long. We evaluate the NMSE for horizons h = {1, 20, 40, 60, 80} time steps. We did not evaluate a GP model in this setting due to the long prediction horizons that led to a very high computational burden. The (r)ARHMMs are tested for K = 2. The logit link function of an rARHMM is parameterized by a neural net with one hidden layer containing 16 neurons. The results in Figure 3 show that the rARHMM approximates the dynamics well and outperforms both ARHMMs and the neural models.

2) Pendulum and Cart-Pole
These systems are classical benchmarks from the nonlinear control literature. Here we consider two different observation types, one in the wrapped polar space, where the angle space θ ∈ [−π, π] includes a sharp discontinuity, and a second model with smooth observations parameterized with the Cartesian trigonometric features {cos(θ), sin(θ)}. Both dynamics are simulated with a frequency of 100 Hz. We collect 25 training trajectories starting from different initial conditions and apply random uniform explorative actions. Each trajectory is 2.5 s long. The 24 splits consist of 10 trajectories each, 10 × 250 data points. The test dataset consists of 5 trajectories, each 2.5 s long. Forecasting accuracy is evaluated for horizons h = {1, 5, 10, 15, 20, 25}. The (r)ARHMMs are tested for K = {3, 5, 7, 9} on both tasks. The logit link function of the rARHMM is parameterized by a neural net with one hidden layer containing 24 neurons. As shown in Figure 3, the forecast evaluation provides empirical evidence for the representation power of rARHMMs in both smooth and discontinuous state spaces. FNNs and GPs perform well in the smooth Cartesian observation space and struggle in the discontinuous space, similar to RNNs and LSTMs. Moreover, in Table 1, it is clear that rARHMMs reach comparable predictive performance to state-of-the-art models with a fraction of the parametric complexity.

B. Piecewise Closed-Loop Behavioral Cloning
We want to analyze the closed-loop rARHMM with endogenous inputs as a behavioral cloning framework. The task is to reproduce the closed-loop behavior of expert policies on challenging nonlinear systems. For this purpose, we train two feedback experts on the pendulum and cart-pole. The two environments are simulated at 50 Hz and are influenced by static Gaussian noise with a standard deviation σ = 1×10 −2 . The experts are two-layer neural network policies with 4545 parameters (pendulum) and 17537 parameters (cart-pole), optimized with the SAC algorithm [9].
For cloning, we construct two 5-regime rARHMMs with piecewise polynomial policies of the third order. The hybrid controllers have a total number of parameters of 100 (pendulum) and 280 (cart-pole). Learning is realized on a dataset of 25 expert trajectories, each 5 s long, for each environment and using the EM technique from Section VI. The decomposed controllers complete the task of swinging up and stabilizing both systems with over 95% success rate. Figure 4 shows the phase portraits of the unforced dynamics and closed-loop control identified during cloning. Figure 5 depicts sampled trajectories of the hybrid policies highlighting the switching behavior.

C. Nonlinear Control Synthesis via Hybrid Models
Finally, we evaluate the performance of the hybrid policy search algorithm Hb-REPS on two nonlinear stochastic dynamical systems: an actuation-constrained pendulum swingup and a cart-pole stabilization task. We make no claim to the absolute sample efficiency of our RL approach when compared to state-of-the-art RL algorithms. Instead, we aim to provide empirical support for the premise that structured representations that rely on compact piecewise parametric forms can provide an alternative to black-box function approximators with comparable overall performance. Therefore, we compare the performance of Hb-REPS to two baselines. The first is a vanilla version of REPS that does not maintain any hierarchical structure and uses nonlinear function approximators with random Fourier features (RFFs) [90] to represent both policy and value function. The second baseline assumes a hierarchical policy structure and a nonlinear value function with Fourier features. This baseline is somewhat akin to what is implemented in [62], albeit with a hierarchy based on state abstraction rather than time. We will refer to this algorithm as hierarchical REPS (Hi-REPS). We assume an offline learning phase in which the hybrid models are learned from pre-collected data.

1) Pendulum Swing-up
In this experiment, the actuation-constrained pendulum is simulated at 50 Hz and perturbed by Gaussian noise with a standard deviation σ = 1 × 10 −2 . The REPS agent relies on a policy and value function with 50 and 75 Fourier basis functions, respectively. Hi-REPS assumes a similar form of the value function but with a piecewise third-order polynomial policy over five partitions. Hb-REPS represents both policy and value function with piecewise third-order polynomials over five partitions. Empirical results in Figure 7 (left half) feature comparable learning performance of all algorithms over ten random seeds. Every iteration involves 5000 interactions with the environment. We provide a phase portrait of the closed-loop behavior for a qualitative assessment of the final stationary hybrid policy.

2) Cart-pole Stabilization
This evaluation features a cart-pole constrained by an elastic wall modeled by a spring. The dynamics are linearized around the upright position. The environment is simulated at 100 Hz and perturbed by Gaussian noise with a standard deviation σ = 1 × 10 −4 . The REPS policy and value function both use 25 random Fourier basis functions. Hi-REPS adopts the same value function structure with a twopartition piecewise affine policy. Hb-REPS also assumes a two-partition piecewise affine policy and second-order value function. Figure 7 (right half) depicts comparable learning performance over ten random seeds. Every iteration involves 2500 interactions with the environment.

IX. Discussion
We presented a general framework for data-driven nonlinear system identification and stochastic control based on the structured representation of hybrid surrogate models. To introduce the hybrid structure, we proposed replacing commonly used piecewise affine auto-regressive models with probabilistic hybrid dynamic Bayesian networks, as they offer a range of advantages in data-driven scenarios. Furthermore, we presented a novel reinforcement learning algorithm that leverages the learned hybrid models to synthesize piecewise polynomial feedback controllers for nonlinear systems.
Our hybrid-model-infused reinforcement learning approach is able to reach comparable performance on control tasks with a significant reduction in the complexity of functional representation. Furthermore, in contrast to deterministic hybrid model predictive control, our approach solves the infinite-horizon stochastic optimal control problem by approximating the global value function and lifts the requirement for polyhedral partitioning.
While initial empirical results are encouraging, the application of this work is limited to low-dimensional dynamical systems. Although a viable alternative to expensive mixedinteger optimization, the inference techniques used in this paper still present a bottleneck in the face of scalability to higher dimensions. While our MAP approach significantly improves the quality of expectation-maximization solutions, it nevertheless struggles in more challenging environments.
A possible course of action is to investigate Bayesian nonparametric extensions of hybrid dynamic Bayesian networks based on non-conjugate variational inference. Fully Bayesian methods tend to improve learning in large structured models significantly. Another potential avenue of research is to improve the hybrid reinforcement learning framework by considering the control-as-inference paradigm. Such approaches may offer ways of integrating the Bayesian structure of the models into the control optimization and constructing an uncertainty-aware approach that is better equipped to deal with the exploration-exploitation dilemma.

Appendix A: Exponential Family
Our work focuses on random variables with probability density functions belonging to the exponential family. The unified minimal parameterization of this class of distributions lends itself for convenient and efficient posterior computation when paired with conjugate priors.
We assume the natural form for a probability density of a random variable x where h(x) is the base measure, η are the natural parameters, t(x) are the sufficient statistics and a(η) is the log-partition function, or log-normalizer. Following the same notation, a conjugate prior g(η|λ) to the likelihood f (x|η) has the form g(η|λ) = h(η) exp λ · t(η) − a(λ) , with prior sufficient statistics t(η) = [η, −a(η)] ⊤ and hyperparameters λ = [α, β] ⊤ . By applying Bayes' rule, we can directly infer the posterior q(η|x) q(η|x) ∝ f (x|η)g(η|λ) ∝ exp ρ(x, λ) · t(η) − a(ρ) , where the posterior natural parameters ρ(x, λ) are a function of the likelihood sufficient statistics t(x) and prior hyperpa- The structure of the resulting posterior reveals a simple recipe for data-driven inference. By moving into the natural space, the posterior parameters are computed by combining the prior hyperparameters with the likelihood sufficient statistics and log-partition function. By definition, every exponential family distribution has a minimal natural parameterization that leads to a unique decomposition of these quantities [91].

Appendix B: Conjugate Posteriors
We present an outline of all M-step updates. We use an adapted form of the exponential natural parameterization, as it offers a clear methodology for deriving and implementing such updates for all relevant distributions.

A. Categorical with Dirichlet Prior
A weighted categorical likelihood over a one-hot random variable z with size K has the form where w nk are the importance weights for each category K. The conjugate prior is a Dirichlet p(φ) distribution The posterior q(φ) is likewise a Dirichlet distribution The maximization step requires computing the mode categorical weights. For a Dirichlet distribution the mode weights areφ = (τ −1)/( K k=1 τ k −K) with τ k > 1. The parameter vector τ is given by τ k = τ 0,k + N n=1 w n,k ∀k ∈ [1, K].

Linear-Gaussian with Matrix-Normal-Wishart Prior
A weighted linear-Gaussian likelihood takes a random variable x ∈ R d and returns a random variable y ∈ R m according to a linear mapping A : R d → R m p(Y|X, A, V) = N n=1 N(y n |x n , A, V) wn where w n are the weights and W = diag(w n ) is the diagonal weight matrix. The data matrices X and Y are of size d×N and m × N , respectively. The conjugate prior p(A, V) is a matrix-normal-Wishart with zero mean p(A, V) = N(A|0, V, K 0 ) W(V|Ψ 0 , ν 0 ) The posterior q(µ, Λ) is matrix-normal-Wishart The mode mapping and precision of a matrix-normal-Wishart areÂ = M andΛ = (ν − m)Ψ, respectively. The standard posterior parameters are