Approximate propagation of normal distributions for stochastic optimal control of nonsmooth systems

We present a method for the approximate propagation of mean and covariance of a probability distribution through ordinary differential equations (ODE) with discontinous right-hand side. For piecewise affine systems, a normalization of the propagated probability distribution at every time step allows us to analytically compute the expectation integrals of the mean and covariance dynamics while explicitly taking into account the discontinuity. This leads to a natural smoothing of the discontinuity such that for relevant levels of uncertainty the resulting ODE can be integrated directly with standard schemes and it is neither necessary to prespecify the switching sequence nor to use a switch detection method. We then show how this result can be employed in the more general case of piecewise smooth functions based on a structure preserving linearization scheme. The resulting dynamics can be straightforwardly used within standard formulations of stochastic optimal control problems with chance constraints.


Introduction
Throughout this paper, we consider initial value problems (IVP) with uncertain initial value x 0 ∈ R n which are defined by an ordinary differential equation (ODE) with discontinuous right-hand side.More specifically, we consider IVP of the form for t ∈ [0, T ] =: T and with the initial uncertainty described by probability distribution X 0 .The right-hand side is defined by the smooth modes f 1 , f 2 and the switching function ψ, such that depending on the sign of ψ(x) the dynamics evolve according to f 1 or f 2 .The zero-level set of ψ defines the switching surface as S ψ = {x ∈ R n | ψ(x) = 0}, and for regularity we assume ∇ψ(x) ̸ = 0 for all x ∈ S ψ .For rigorously treating the discontinuity, i.e., the case where ψ(x) = 0, we refer to the notion of Filippov differential inclusions [1].We only consider the nondegenerate cases where the solution to (1) is well defined, as will be discussed below in more detail.The main results will be derived for the case that f 1 , f 2 and ψ are affine, with an extension on how these results can be employed in the nonlinear case.Since our primary interest here is the simulation of the dynamics we consider an uncontrolled system.However, the results straightforwardly extend to the case where f additionally depends on a control input.Nonsmooth dynamics arise in the modeling of a wide range of systems, especially in mechanics and robotics, e.g., Coulomb friction, contact models, gear boxes, but also in electrical circuits [2,3].Standard theory of numerical integration of ODE is built on the assumption of Lipschitz continuity [4].Thus, special care needs to be taken when simulating nonsmooth dynamics, for which this assumption is violated [2].Based on a dynamic system model, optimal control provides a systematic framework for achieving a desired system behavior, i.e., optimizing trajectories based on an objective function and subject to constraints.Its closed loop application, model predictive control (MPC), relies on the numerical solution of optimal control problems (OCP) in real time [5].
Often this takes the form of solving nonlinear programs (NLP) via Newton-type, i.e., derivative based, methods [6].Thus, when simulating system dynamics in this context, it is relevant that the integration schemes are both efficient and provide accurate sensitivities.Since models typically do not allow for a perfect prediction of reality, there is always some uncertainty involved.The closely related fields of robust and stochastic optimal control try to explicitly account for this mismatch [5,7,8,9].In the robust paradigm this takes the form of set based uncertainty models, whereas the stochastic approach is concerned with probability distributions.Numerically the resulting formulations can be very similar: for example, a positive definite matrix can both describe the shape of an ellipsoidal set and the covariance of a normal distribution (giving rise to ellipsoidal confidence sets), cf., e.g.[10,11].Thus, while working in a stochastic framework -the results in this paper exploit the smoothly decaying unbounded support of normal distributions -we will still draw from results in the robust optimization literature.
In this paper we are concerned with the behavior of probability distributions under nonsmooth dynamics.In [12], the authors consider distributions under the dynamics of a Moreau sweeping process.In more detail, they consider probability densities with support on a convex bounded set which moves as a function of time.This leads to nonsmooth interactions at the set boundary as the distribution is pushed along, and the authors derive results on existence and uniqueness for the resulting evolution of the distribution and provide a discretization based approximation.In [13] the authors consider a similar setting with an additional drift term.Based on a Lipschitz approximation of the nonsmooth dynamics, they describe the evolution of the distribution via the Liouville equation.Leveraging an additional finite order moment approximation allows them to compute the distribution over time.In [14], the authors propose the linearization based propagation of an uncertainty set for an ODE with discontinuous right-hand side.Based on detection of the switch they account for the change in integrator sensitivity resulting from the discontinuity via the so called jump matrix, cf.[15,16].This allows them to solve an OCP with nonsmooth dynamics under parametric uncertainty.For an overview of jump matrices in this context, see also [17].

Contribution and outline
We present a method for the approximate propagation of mean and covariance of a probability distribution through an ODE in the form of (1).The method is based on (a) linearizing the righthand side function at the current mean in terms of its components f 1 , f 2 , ψ, such that a piecewise affine function is obtained which preserves the discontinuous structure of f , (b) approximating the current probability distribution by a normal distribution which for piecewise affine f allows us to analytically compute the current change of mean and covariance.However, since this neglects the change of the higher-order moments, this does not lead to an exact propagation.Finally, we demonstrate how the derived dynamics can be used in a stochastic optimal control problem formulation with chance constraints.We start by discussing the relevant background on ODE with discontinuous right-hand side in Section 2 and on IVP with uncertain initial value in Section 3. In Section 4 we provide a detailed discussion of the simplified setting of a scalar piecewise constant system to further the intuitive understanding.This is followed by the derivation of the main result for piecewise affine systems in Section 5, the extension to the piecewise smooth case in Section 6, and its application to stochastic OCP in Section 7, with a concluding Section 8.

Notation
For a multivariate function f : R n → R m , x → f (x), the gradient is defined as the transpose of the Jacobian, ∇f a sum of a matrix A ∈ R n×n and its transpose, we abbreviate this as S = A + (⋆).

Ordinary differential equations with discontinuous right-hand side
In this section we briefly summarize the relevant background on ODE with discontinuous righthand side of the form (1), to the extent relevant in the context of numerical optimal control.For a more in-depth discussion we refer especially to [16].
In consequence dψ(x(t)) dt > 0 immediately before and after the switching time such that the state immediately leaves the switching surface after reaching it, and crosses from ψ(x) < 0 to ψ(x) > 0. The case where the surface is crossed in the opposite direction is analogous.2. Sliding mode (or trapped case): ψ(x) = 0 and ∇ψ(x) ⊤ f 1 (x) > 0 but ∇ψ(x) ⊤ f 2 (x) < 0. In this case, the solution will remain on the surface, i.e., dψ(x(t)) dt = 0 after the switching time.

Numerical integration
Because f is discontinuous, the standard theory of ODE and their numerical integration does not hold since it is built on the assumption of (Lipschitz)-continuity of f [4].Thus, when relying on standard integration schemes and their theory, the switch needs to be considered explicitly.
Otherwise the standard results on the order of integration accuracy will not hold: In general, for a step size h the error will be O(h) irrespective of the order of the scheme [2].Furthermore, the error in the integration map sensitivities will even be independent of the step size [16,18].In the context of optimal control this requires either a predefined switching sequence or a switch detecting integration scheme which supplies correct sensitivities [2,19,20].An intuitive and common workaround is to smoothen the right-hand side (1).This results in the smooth approximate dynamics where α σ : R → R is a smooth approximation of the Heaviside step function, parametrized by σ > 0 and with increasing accuracy as σ → 0. One choice is α σ (ξ) = (1 + tanh(ξ/σ))/2.However, the resulting ODE will be increasingly nonlinear and stiff for decreasing σ and require decreasingly smaller step sizes for sufficiently precise integration.Furthermore, it can be shown that the integrator has to have a step size of h = o(σ) in order for its sensitivities to be adequate [16,18], which is an important requirement if it is to be used as a component in the formulation of a nonlinear program (NLP).Thus, one is in general well advised to carefully choose an appropriate integration scheme when handling an ODE with discontinuous right-hand side.

Solution sensitivities
While the solution maps of IVP with discontinuous right-hand side are continuous, their sensitivities may have jumps when encountering a switch.Denote by x(t; x 0 ), t ∈ T, the solution of IVP (1) for a given initial value x 0 .Assume the IVP is not initialized at a switch but that at some time 0 < t s < T with x s := x(t s ; x 0 ) the solution reaches the switching surface, ψ(x s ) = 0 and ψ(x(t, x 0 )) ̸ = 0 for all t ∈ [0, t s ).Then the sensitivity of the final state with respect to the initial value is given by ∂x(T ; where the jump matrix J(x s ) accounts for the jump of sensitivity and is given by for the crossing resp.sliding mode [16].Without loss of generality, the jump matrix for the crossing case is given under the assumption that ψ(x 0 ) < 0, i.e., the state crosses from mode f 1 into mode f 2 .
Example 1 (Crossing the discontinuity).Consider a scalar system with state x ∈ R and ẋ = 3 for x < 0 and ẋ = 1 for x > 0. We simulate the trajectory x(t; x 0 ) for t ∈ [0, T ], and for three different values of the initial state x 0 , given by the set {μ 1 , μ1 + 3σ 1 , μ1 − 3σ 1 }, with μ1 = −3, σ 1 = 0.3, and T = 2.The resulting trajectories as well as the corresponding integrator map x(x 0 , T ) are shown in Fig. 1.We observe that the discontinuity in ẋ at x = 0 leads to a kink in the trajectories.Due to this kink, the distance between the trajectories narrows.Whereas initially the distance between the two outer points is 6σ 1 , after each point has crossed the switch this distance has narrowed to f2 f1 6σ 1 = 1 3 6σ 1 .The scaling factor of 1 3 corresponds to the slope of the integrator map in the respective region and is given by the jump matrix (4).
Example 2 (Sliding mode).Now consider a system with state x ∈ R and ẋ = 3 for x < 0 and ẋ = −1 for x > 0. This results in a system with sliding mode, i.e., once a trajectory reaches x = 0, it stays there.Again, we simulate the trajectory x(t; x 0 ) for t ∈ [0, T ], and for three different values of the initial state x 0 , given by {μ 1 , μ1 + 3σ 1 , μ1 − 3σ 1 }, this time with μ1 = −1, σ 1 = 0.6, and T = 1.2.The resulting trajectories as well as the corresponding integrator map x(x 0 , T ) are  shown in Fig. 2. The sliding mode leads to a narrowing of the distance between the trajectories over time, until all of them have reached x = 0.This results in the zero slope of the integrator map x(T ; x 0 ) in the region of initial states x 0 such that x(T ; x 0 ) = 0. Note that this time the zero sensitivity is not a consequence of the jump matrix ( 4), but of the resulting sliding mode dynamics ẋ = 0 if x = 0.This is due to the scalar state space.In a higher dimensional state space, the sensitivity would not necessarily be zero, since the state could still evolve on the switching surface with nontrivial dynamics.

Initial value problems with uncertain initial value
We now consider again a general IVP with x(0) = x 0 , ẋ = f (x), t ∈ T, and assume its solution is well defined for every x 0 ∈ R n .However, the initial state is not exactly known.Instead it follows a probability distribution, x 0 ∼ X 0 .If we denote by x(t; x 0 ) the solution to the IVP after the time interval [0, t] given x(0) = x 0 , this induces a distribution X (t) over the solution trajectory such that x(t; x 0 ) ∼ X (t).Alternatively we can view this as an IVP in distribution space, X (0) = X 0 , Ẋ = F(X ), with appropriately defined F.
In principle, we can describe this evolution in terms of the moments of X , assuming that all moments are finite and uniquely determine X .This holds if X has bounded support or if its tails decay sufficiently fast, which includes normal distributions [21].In particular, consider the first and second-order moments, mean and covariance, defined as Noting that E x∼X (t) {x} = E x 0 ∼X 0 {x(t; x 0 )}, we get the time derivative of the mean as Similarly, we get the time derivative of the covariance as However, to exactly describe the evolution of X , we would need to consider the change of all moments up to infinite order, which is in general intractable.Further, we cannot treat the expectation of a nonlinear transformation of a random variable in this general setting.

Linearization-based uncertainty propagation for smooth dynamics
For IVP with smooth dynamics (or in general for smooth nonlinear transformations of random variables) a standard approach to approximate the uncertainty propagation is based on linearization.Examples include widely used methods such as the Extended Kalman Filter [22].There are two major variants: (i) linearization of the right-hand side of the ODE, and (ii) linearization of the integration map.
In the first variant, after substituting f by its first-order Taylor approximation at the mean, the expectations in ( 6) can be analytically computed as When computing the expectation in (7b), the first term vanishes due to E x∼X {x − m X } = 0, cf.(5a).The second term yields the covariance (5b) premultiplied by a constant (with respect to the expectation operator).The approximation in ( 7) is exact for the case that f is an affine function.
The resulting approximate propagation of the first two moments is defined by the IVP Unsurprisingly, the covariance dynamics are in the form of the continuous time differential Lyapunov equation [23], i.e., the covariance dynamics of a linear system.While in the preceding approach the uncertainty propagation is given in continuous time, in direct optimal control the dynamics are usually considered only on a discrete time grid.Thus an alternative approach, corresponding to the second variant, is to discretize the dynamics first and only then consider uncertainty.Let f h (x k ) := x(h; x k ) denote the integration of the dynamics over the discretization time step h given the initial value x k , such that x k+1 = f h (x k ).Note that in practice, this integration map is typically approximated by standard numerical integration schemes, cf., e.g., [5].Then, given x k ∼ X k , we can approximate the expectations as This results in the approximate propagation with the covariance dynamic corresponding to the discrete time Lyapunov difference equation.Thus, it is sufficient to obtain a (numerically approximated) map for the deterministic IVP, from which the covariance dynamics directly follow.This is in contrast to (8), where both mean and covariance need to be numerically integrated.One important practical consideration is that during the numerical integration of (8) it can happen that Σ takes indefinite values, while the propagation in (10) guarantees that Σ k ⪰ 0 if Σ 0 ⪰ 0, cf.[24].

Linearization-based uncertainty propagation for nonsmooth dynamics
Both of the approaches discussed in the previous subsection rely on linearization of a smooth function.Thus, they are not directly applicable to ODE with a discontinuous right-hand side.Again, an intuitive approach is to smoothen the dynamics as in (2) and then apply a linearization based scheme.This does not come without issues: all the problems regarding stiffness and accuracy of derivatives from the nominal case will transfer such that it is challenging to use within an optimization problem, cf.Section 2.2.Thus, as in the nominal case, an alternative is to treat the switch explicitly and propagate the covariance based on (10).Using the jump matrix (4) the sensitivity jump can be handled explicitly.However, in the context of optimal control this requires either a method for switch detection [14] or a predefined switching sequence.Furthermore, the linearization based propagation of mean and covariance leans on the assumption that the nonlinear dynamics are sufficiently well approximated by a linearization at the mean within the region of uncertainty.For a switched system this is clearly not the case if the mean is on one side of the switch but a nonnegligible amount of probability mass on the other.While this works for some situations, in others it can cause a complete failure of the uncertainty propagation as we will see in a later example.

Normalization-based uncertainty propagation
In the remainder of this paper we will derive an alternative method for the approximate propagation of mean and covariance.Instead of linearization, the method is based on "normalization" of the probability distribution, i.e., at each point in time we approximate the true distribution by a normal distribution that is defined by our current value for mean and variance.For switched affine systems this will allow us to compute the expectations in (6) analytically, resulting in a natural smoothing of the discontinuity.This yields an easy to implement ODE for mean and variance that can be treated by standard integrators and be straightforwardly used within tractable stochastic OCP formulations.
A similar idea of renormalization after every time step (although in a discrete time setting) is used in the method of moment matching for recursive time series prediction based on Gaussian process models: for these, the mean and covariance of the output can be exactly computed given a normal distribution in the input variable [25].Further, the unscented Kalman filter [26] discretizes the current normal distribution into systematically chosen samples which are then propagated through the nonlinear function.Based on mean and variance of the propagated samples, a new normal distribution is obtained.While we will explain our suggested approach in detail in the following three sections, we already derive some results that will be useful later on.Consider a variable z ∈ R n , two distributions Z 1 , Z 2 , and a function g : R n → R m .We would like to approximate E z∼Z 1 {g(z)} by computing instead the expectation with respect to Z 2 .We can write where δ g (Z 1 , Z 2 ) defines the resulting error.Clearly, if the distributions are identical, this error is zero.Intuitively, the error becomes larger as the two distribution become more dissimilar.In the remainder of this paper we will use this definition to describe the error resulting from our approximation, but provide no rigorous analysis.However, we point out that the definition of the error is closely related to integral probability metrics [27].For example, in case that g is Lipschitz continuous, δ g (Z 1 , Z 2 ) can be bounded via the dual representation of the 1-Wasserstein (Kantorovich-Rubinstein) distance of Z 1 and Z 2 , cf. [28,Remark 6.5].Similar bounds may be derived for the case that g is piecewise Lipschitz, based on conditional expectations.Additionally, we define with a similar motivation.Now consider again the state distribution X (t) as defined from the IVP with uncertain initial value x(0) = x 0 , x 0 ∼ X 0 , ẋ = f (x), t ∈ T. Both computing and representing X (t) is in general intractable, and we would like to approximate it by a normal distribution X (t) ≈ N (µ(t), Σ(t)), parameterized by µ and Σ, and with its probability density function (PDF) given by φ(x; µ, Σ) = 1 Our aim is to derive a tractable IVP for the parameters µ and Σ, such that X (t) ≈ N (µ(t), Σ(t)).
As a first step, we consider what happens when we replace the expectation with respect to X by an expectation with respect to N (µ, Σ) in the moment dynamics (6).
Lemma 1.Consider a distribution in state space, x ∼ X , that evolves according to ẋ = f (x).
Consider also a normal distribution on the same space, N (µ, Σ).Then, the time derivative of mean and covariance can be written as with the error terms δ f , ∆ f defined in (11).
Proof.The mean dynamics d dt m X we have from (6a).Exchanging the expectation over X by the expectation over N (µ, Σ) as in (11a) results in (14a).From (6d) we have the covariance dynamics as We rearrange the first term of the above right-hand side as where in the first step we add and subtract the expectation over N (µ, Σ) as in (11a) and also add and subtract µ ⊤ .In the second step we cancel , and in the third step we use the definition of ∆ ′ f (X , N (µ, Σ)) from (11b).Repeating the reformulation for the second term in (15), which is the transpose of the first, and substituting the definition of ∆ f from (11c) yields (14b).
We obtain a differential equation for µ and Σ by disregarding the error terms in (14).This results in In general, ( 17) is still intractable due to the expectation over a nonlinear function.However, for the piecewise constant and piecewise affine systems treated in this paper, we can compute this expectation analytically, as will be derived in the following two sections.For this purpose, the following lemma will be useful.It shows that we get a closed-form expression of the variance dynamics Σ for free if we can find a closed-form expression of the mean dynamics μ.
We conclude this section with a small lemma which we will use several times throughout the paper.Before stating the lemma, we define the probability density function of the univariate standard normal distribution as with associated cumulative distribution function (CDF) Φ.While there exists no closed-form expression for Φ, we can in practice treat it as such since numerical implementations are readily available via the error function erf(•).

Uncertainty propagation for scalar piecewise constant systems
In order to get an intuitive understanding of how a normal distribution behaves when encountering a discontinuity, we will first limit our analysis to a scalar state space x ∈ R with piecewise constant dynamics ẋ = f (x) of the form with f1 , f2 ∈ R \ {0}.If f1 and f2 have the same sign, this leads to a crossing of the discontinuity, whereas f1 > 0 and f2 < 0 leads to the sliding mode.In the following we will take a detailed look at how normal distributions behave when propagated through such a system: first for an example with a crossing of the discontinuity, then for an example with a sliding mode.This is then followed by an approximate approach for propagating its first two moments.

Case study: the switched normal distribution
We start by revisiting Example 1.
Example 3 (Crossing the discontinuity (cont.)).Consider again the system from Example 1, i.e., dynamics of the form ( 23) with f1 = 3 and f2 = 1.This time, assume that the initial value x 0 follows a normal distribution, x 0 ∼ N (μ 1 , σ 2 1 ), with μ1 = −3, σ 1 = 0.3.For ease of presentation, we first limit our attention to the interval x 0 ∈ [μ 1 − 3σ 1 , μ1 + 3σ 1 ] =: X0 , noting that it contains around 99.7% of the probability mass.Consistent with the behavior we saw in Example 1, after each point of the interval has passed through the discontinuity, e.g., at time t = T , the distribution has been scaled by factor f2 Since μ1 is the median of the initial distribution, with 50% of probability mass on each side of μ1 , it follows that, as the system evolves, x(t; μ1 ) will always be the median of the evolved distribution.However, since the points above x(t; μ1 ) are squeezed together earlier than those below, the mean will be below the median while the distribution is crossing the discontinuity, cf.Fig. 3 (left).We observe that that there are two virtual normal distributions involved, visualized in Fig. 3 (left).The first one is associated with points below the switch and given by N (µ 1 (t), σ 2 1 ), with the evolution of its mean defined by µ 1 (0) = μ1 and μ1 = f1 , such that µ 2 ) with μ2 = f2 .Further, there is a time point t s such that µ 1 (t s ) = µ 2 (t s ) = 0.In the considered example, this is t s = 1.From these conditions we can compute μ2 = f2 f1 μ1 such that µ 2 (t) = μ2 + f2 t.We now revisit the assumption that x 0 ∼ N (μ 1 , σ 2 1 ).While this is certainly a valid assumption, it is slightly inconsistent: It makes a statement about all points, those above the switching surface.But we have seen that as soon as an interval of points passes the discontinuity, their distribution is scaled such that it will be closely related to N (µ 2 (t), σ 2 2 ).With this in mind, it seems more natural to assume that the distribution for all points above the switch has already been transformed.In consequence, we can describe the distribution at time t as where µ 1 (t), µ 2 (t), σ 1 , σ 2 are as defined above, and which is defined via its PDF φs (x; µ 1 , µ 2 , σ 1 , σ 2 ) := with associated CDF Φs .We refer to this distribution as a switched normal distribution, since depending on the sign of x, this distribution switches between N (µ 1 (t), σ 2 1 ) and N (µ 2 (t), σ 2 2 ).We note that the above definition of φs does not result in a probability distribution for arbitrary parameter values, since its integral over R is not necessarily given by 1.A visualization of this distribution can be found in Fig. 3 (right).
We now consider how a normal distribution behaves for the sliding mode, revisiting Example 2.

Approximate uncertainty dynamics via normalization
Even though we can derive analytical results for the specific case considered in the previous subsection, we can see that already in such a simple scenario this becomes rather involved.If we want to consider more general situations, the above results are of limited use.A large part of the complexity came from the fact that the state was not distributed normally.In the following we will see what happens if at each time we approximate the true distribution by a normal distribution.This will allow us to derive an explicit ODE for mean and variance.Proposition 1.Consider the IVP with state x ∈ R, x(0) = x 0 , ẋ = f (x), t ∈ T, with piecewise constant f of the form (23), and with x 0 ∼ N (µ, v) where v = σ 2 .Then for the corresponding IVP in µ and v as defined by (17), such that N (µ(t), v(t)) is an approximation of the exact distribution at time t, we can state the right-hand side explicitly as  Proof.For the mean dynamics we have from (6a) that where in the last step we used Lemma 3. The variance dynamics follow from Lemma 2. Adapting the multivariate notation of Lemma 2 to the currently considered univariate case, ( ∂ fµ(µ,v) ∂µ v, with fµ (µ, v) given by the right-hand side of (29a).Additionally noting that As it turns out, the mean dynamics (29a) are exactly a form of smoothing of the form (2), with smoothing function α σ (ξ) = Φ( ξ σ ).This is visualized in Fig. 5.As long as uncertainty is sufficiently large, σ ≫ 0, the discontinuity is strongly smoothed, such that most standard methods of numerical integration are well suited.However, if σ decreases, the approximate dynamics become increasingly stiff and nonlinear, such that more care needs to be taken.As we have seen, this is especially relevant in the context of sliding modes, since they inherently lead to a decay of uncertainty.A practical remedy in this context could be the consideration of process noise, since this would in effect provide a lower bound on the state covariance.This is however beyond the scope of this paper.Having pointed out these limitations, we will now try to get an understanding of the error resulting from the proposed approximation.For this purpose, we perform the following numerical experiments.
Example 5. We apply dynamics (29) to the system from Example 3 with initial distribution x 0 ∼ N (μ 1 , σ 1 ) and compare the resulting evolution of µ(t) and σ 2 (t) to mean and variance of the exactly propagated switched normal distribution (24).While the initial distribution is not exactly identical to a switched normal distribution, the mean is at a distance of more than 6 standard deviations from the switching surface, such that the initial probability mass on the other side of the switching surface is less than 10 −9 , and difference between the two distributions is negligible.The resulting errors over time are shown in Fig. 6.Some error is accumulated while the distribution is crossing the switch, but the error is constant before and after.We now perform two experiments on the influence of parameters on the total error that is accumulated during the crossing of the switch.Strictly speaking the switching process is never completed nor does it have a well defined starting point due to the unbounded support of the normal distributions.We use the error at time T as a proxy for the total error while ensuring that both initial and final mean have a distance of at least six standard deviations from the switching surface in their respective direction.Thus, the probability mass which has not yet switched at the end can be neglected.In the first experiment, we vary the initial standard deviation σ 0 , with fixed f1 = 3, f2 = 1, T = 2.The results are shown in Fig. 7 (left).In the second experiment we vary the pre-switching dynamics f1 while keeping the post-switching dynamics fixed at f2 = 1, with initial standard deviation σ 0 = 0.3 and T = 4.For f1 = f2 we would expect no error since in this case there is no discontinuity, and an error that continuously rises as | f1 − f2 | moves away from zero.Thus, we vary f 1 both in the interval [ f2 + 10 −12 , f2 + 10 −1 ] and [ f2 − 10 −12 , f2 − 10 −1 ].The results are shown in Fig. 7 (right).In combination these experiments suggest that the final integration error is of order O(σ 0 | f1 − f2 |).While we do not investigate this in detail here, we point out that this is consistent with the error resulting from smoothing the discontinuity in the nominal case, cf.[16].

Uncertainty propagation for piecewise affine systems
After having developed an understanding of how a normal distribution behaves when passing through the discontinuity of an ODE with piecewise constant right-hand side, we now consider the more general case of a piecewise affine system with affine switching function, This corresponds to the specific of (1) where In this case the switching surface is a hyper plane {x ∈ R n | g ⊤ (x − x) = 0} with normal vector g and x an arbitrary point on the surface.While in the piecewise constant case a normal distribution is asymptotically recovered after crossing a discontinuity, this is not generally true in the piecewise affine case.However, similarly to the switched constant case, during the switching process the distribution will be strongly different from a normal distribution, but after crossing the resulting distribution will resemble a slightly perturbed normal distribution.
For this case we will not try to find an exact parameterization of the distribution as it encounters the switching surface.Instead, we will directly derive approximate dynamics for mean and variance and validate them by sampling.Before stating these dynamics we define by φg (ξ; µ, Σ) the PDF of the univariate normal distribution obtained by projecting the n-variate distribution N (µ, Σ) onto the direction g ∈ R n , φg (ξ; µ, Σ) := 1 where ξ ∈ R is the remaining degree of freedom after projection, and which has mean µ g := g ⊤ µ, standard deviation σ g := g ⊤ Σg and associated CDF Φg (ξ; µ, Σ).Similarly, the projection of x is denoted by xg := g ⊤ x Proposition 2. Consider an IVP with state x ∈ R n , x(0) = x 0 , ẋ = f (x), t ∈ T, x 0 ∼ N (µ, Σ), and with f (x) piecewise affine with affine switching function as in (31).Then for the corresponding IVP in µ and Σ as defined by (17), such that N (µ(t), Σ(t)) is an approximation of the exact distribution at time t, we can state the right-hand side explicitly as Corollary 1.Consider the same situation as in Prop. 2 and additionally assume A 1 = A 2 =: A. Then fµ (µ, Σ) = Aµ + f1 Φg (x g ; µ, Σ) + f2 (1 − Φg (x g ; µ, Σ)), (35a) fΣ (µ, Σ) = ( f2 − f1 )g ⊤ Σ φg (x g ; µ, Σ) + AΣ + (⋆). (35b) Proof.Without loss of generality we state the proof under the assumption that the switching function gradient is given by g = (1, 0, . . ., 0), such that the first dimension of the state space is orthogonal to the switching surface.If this assumption does not hold one may substitute y = Rx, with invertible R ∈ R n×n , such that in the transformed state space the assumption holds.We use Lemma 1 and 2 to derive (34).The basic structure of the proof is to separate the expectation in (14a) into two steps: the expectation over the first dimension, in which the switch occurs, and the conditional expectation over the remaining dimensions, with respect to which the dynamics are linear.We partition the state space as where correspondingly.We use σ 2 g = g ⊤ Σg instead of Σ 11 to emphasize that in x 1 we have a univariate (normal) distribution resulting from projecting N (µ, Σ) onto the direction g.From (14a) we obtain the mean dynamics as where we first use conditional expectation and then move the expectation over x 2 into f (•) which is allowed since the switch depends only on x 1 such that f (•) is affine in x 2 .For the inner expectation we use Defining allows us to write where x1 ∈ R results from the partitioning of x = (x 1 , x2 ).Continuing (37), we get where the last line results from applying Lemma 3 to each row of the integrated function.Resubstitution of a i , b i yields such that which is identical to (34a).The variance dynamics (34b) follow from Lemma 2.
As discussed in the previous section, this approximation leads to an error even in the piecewise constant case.However, it is principled in the following sense: (a) If A 1 = A 2 and f1 = f2 , i.e., if effectively there is no discontinuity, the corresponding Lyapunov equation for a linear system is recovered [23].(b) With increasing distance of the mean from the switching surface, as measured in terms of the projected standard deviation σ g , both φg (x g ; µ, Σ) and Φg (x g ; µ, Σ) decay exponentially to 0 resp.1.Thus, the Lyapunov dynamics for the corresponding mode are recovered asymptotically.
Example 6.We consider a linear spring/dashpot contact-impact model [3].The system has state x = (q, v) with position q ∈ R and velocity v ∈ R. The position q = 0 corresponds to the system being in contact with a wall but with uncompressed spring.For q < 0 the system is in contact and the spring is compressed such that v = −kq − cv, with spring constant k and damping c.For q > 0 there is no contact and the spring is uncompressed.However, an external force of magnitude g is applied, pushing the mass towards the wall.Thus in this case v = −g.Overall, the dynamics can be written as The initial state is distributed as x 0 ∼ X 0 = N (µ 0 , Σ 0 ).We simulate the system based on the approximate dynamics (33) for mean and covariance yielding a time varying normal distribution N (µ(t), Σ(t)).As proxy for the exact evolution of the distribution we sample 10 4 points from X 0 and propagate them based on the nonsmooth dynamics, yielding the sample distribution X (t) with the piecewise linearization of f at µ, which is identical to f if f is piecewise affine.After this piecewise linearization, the results of Proposition 2 can in principle be applied, but with some additional dependencies on µ.
where σ ψ (µ, Σ) := ∇ψ(µ) ⊤ Σ∇ψ(µ) is the standard deviation orthogonal to the switching surface with respect to the linearization of ψ at µ. Due to the additional dependencies of fµ (µ, Σ) on µ the explicit expression for fΣ (µ, Σ) is more involved than (34b).However, if these equations are implemented with a symbolic framework such as CasADi [29] there is no need to derive the explicit expressions by hand.In the case that f 1 , f 2 , and ψ are affine, the expressions in (48) simplify to those of (34).

Stochastic optimal control problem formulation
We now demonstrate how the derived dynamics can be used within a stochastic OCP formulation.After augmenting the dynamics by an argument for the control vector u(t) ∈ R nu -which for the purpose of integration can be seen as a time dependent parameter -a rather general OCP can be stated as min u(•), µ(•), Σ(•) T 0 l(µ(t), u(t))dt + L(µ(T )) (49a) Here we consider stage cost l and terminal cost L only for state mean µ and controls u, but it could be straightforwardly extended by a direct cost on the variance Σ.The initial mean and covariance are fixed to μ0 resp.Σ0 .For simplicity of notation we consider separate constraints on controls, g(u) ≤ 0, g : R nx → R ng , and states, h(x) ≤ 0, h : R nu → R n h , although the formulation can be straightforwardly generalized to combined constraints.Since x ∈ N (µ, Σ) has unbounded support, the state constraints cannot be strictly enforced for all possible values of x.Instead we  jectory (open loop).Since in the region with f 1 the additional vector field is too strong given the control constraints, ψ(x) acts like an implicit constraint, and the region is mostly avoided.We compare this to the standard linearization based approach (10) applied to the smoothed dynamics (2) with smoothing parameter σ = 5 • 10 −2 .Since in these the covariance propagation is based on linearization at the mean, the resulting dynamics only "see" the switch if the mean is in its vicinity, independent of the level of uncertainty.Thus, in the resulting optimal trajectory, cf.Fig 11 (right), the mean keeps almost no distance from the switching surface, with a significant amount of probability mass overlapping with the second region.In consequence, when simulating the sample distribution, a large fraction of the samples gets trapped in this region and does not arrive at the target state.

Conclusions
We derived a method for the approximate propagation of mean and variance through an ODE with discontinuous right-hand side and demonstrated how it can be straightforwardly used in a stochastic OCP formulation.However, the paper was mostly focused on the derivation of the approximations.A formal analysis of the resulting errors would enable a more rigorous theoretical backing of the method.Further, we only treated the case of two different modes of the right-hand side.If similar results were derived for the case where the right-hand side has an arbitrary number of modes, the method could be applied in a wider range of situations.

Figure 2 :
Figure 2: Sliding mode.Left: The state trajectories from Example 2. Once a trajectory reaches x = 0 it stays there, which leads to narrowing of the distances over time.Right: The corresponding integrator map x(T ; x0) as a function of x0, which is piecewise affine.The flat region in the center corresponds to the values of x0 such that x(T, x0) reaches the switching surface.The blue lines visualize the corresponding mapping of the initial states from the left-hand side plot.

Figure 3 :
Figure 3: Left: The means of the two imagined normal distributions N (µi(t), σ 2 i ), i = 1, 2, compared to the mean of the exactly propagated distribution.The shaded regions indicate 3σ on each side of the mean.The dotted lines indicate the 99.7% probability mass corresponding to the original ±3σ region.Right: The cumulative density function Φs(x; µ1(t), µ2(t), σ1, σ2) of the switched normal distribution for various time points as the distribution crosses the switch at x = 0.The arrows indicate the state dynamics.

Figure 5 :
Figure 5: The approximated mean and variance dynamics (29) for the system from Example 3.

error σ 2 Figure 6 :Figure 7 :
Figure 6: Left: error between exact and approximated mean over time.Right: error between exact and approximated variance over time.

Figure 8 :
Figure 8: Trajectories for the switched system from Example 6 in state space with histograms of the sampling distribution X in comparison with the approximating normal distribution shown for selected times.

Figure 11 :
Figure11: Optimal solutions when using either the approximate dynamics (33) (left) or the standard linearization based approach (10)(right), indicated by blue dots for the mean and a shaded 99% confidence region.The green lines are samples from the initial distribution simulated according to the respective optimal control trajectory (open loop).
Figure 4: Left: The means of the two imagined normal distributions N (µi(t), σ 2 i ), i = 1, 2, compared to the mean of the exactly propagated distribution.The shaded regions indicate 3σ on each side of the mean.The dotted lines indicate the 99.7% probability mass corresponding to the original ±3σ region.Right: The cumulative density function Φ′s (x; µ1(t), µ2(t), σ1, σ2) of the modified switched normal distribution for various time points as the distribution crosses the switch at x = 0. On both sides of x = 0, the probability mass is transported towards the origin, where it accumulates.The arrows indicate the state dynamics.