Model Predictive Control with Variational Autoencoders for Signal Temporal Logic Specifications

This paper presents a control strategy synthesis method for dynamical systems with differential constraints, emphasizing the prioritization of specific rules. Special attention is given to scenarios where not all rules can be simultaneously satisfied to complete a given task, necessitating decisions on the extent to which each rule is satisfied, including which rules must be upheld or disregarded. We propose a learning-based Model Predictive Control (MPC) method designed to address these challenges. Our approach integrates a learning method with a traditional control scheme, enabling the controller to emulate human expert behavior. Rules are represented as Signal Temporal Logic (STL) formulas. A robustness margin, quantifying the degree of rule satisfaction, is learned from expert demonstrations using a Conditional Variational Autoencoder (CVAE). This learned margin is then applied in the MPC process to guide the prioritization or exclusion of rules. In a track driving simulation, our method demonstrates the ability to generate behavior resembling that of human experts and effectively manage rule-based dilemmas.


Introduction
Robotics is increasingly permeating diverse sectors, spanning both civilian and industrial applications, and is becoming integral to everyday life.Service robots are now prevalent in public spaces, interacting with individuals and delivering services.Within the field of robotics, autonomous driving emerges as a particularly dynamic area, garnering extensive research attention.
In robotics, adherence to rules varies from basic collision avoidance in navigation scenarios to compliance with complex traffic regulations in autonomous driving.These rules, established primarily for safety, must generally be upheld by robots while executing their tasks.However, it is essential to acknowledge that not all rules carry equal importance.Depending on the context, some rules may need to be prioritized over others or even disregarded.For example, in autonomous driving, scenarios may necessitate breaching certain rules-such as lane changes in dense traffic, decisions at yellow traffic lights, or crossing double yellow lines to avoid obstacles.These situations compel robots to make intricate decisions regarding rule compliance, presenting significant challenges in determining appropriate control inputs.
Model Predictive Control (MPC) stands out as a robust approach for autonomous control, recognized for its capabilities in online trajectory optimization [1].The core principle of MPC involves identifying optimal control inputs to minimize a predefined cost function, considering both inputs and anticipated future outputs.This method integrates an objective function characterizing the desired robot behavior and constraints mitigating undesirable actions.The efficacy of MPC is well-documented across diverse applications, such as the full-body control of humanoid robots [2][3][4].
Designing effective MPC controllers remains a significant challenge.Experienced operators can adeptly manage robots, yet encoding such expertise into MPC parameters is complex.For instance, expert drivers in autonomous driving must make continuous, complex decisions, such as whether to decelerate or change lanes in response to slowmoving vehicles.However, finding the appropriate MPC parameters to handle such varied scenarios is complex and computationally intensive.
Recently, imitation learning has emerged as a promising solution for robotic learning challenges [5,6].This approach derives near-optimal control strategies directly from human expert demonstrations, eliminating the need for manual policy or cost function design.Imitation learning excels in capturing complex policy functions that balance multiple considerations [7], learning the importance of various factors from expert behaviors to enable robots to replicate human actions.However, despite its advantages, imitation learning does not inherently ensure performance reliability.In scenarios where safety rules like collision avoidance are crucial, imitation learning may not consistently yield control actions that comply with essential safety norms, underscoring the paramount importance of rule adherence for robot safety and human protection.
In this paper, we address a control synthesis problem within a framework of prioritized rules, building on our previous research [8].We assumed inherent rule priorities and aimed to design a controller that accounts for these priorities to manage dilemmas effectively.Our methodology is grounded in the MPC framework, which, unlike purely deep learning-based approaches, integrates each rule as a constraint, thereby enhancing performance reliability.
We represent these rules using Signal Temporal Logic (STL) [9,10], a formalism allowing the precise specification of desired system behaviors, commonly applied in robotic task specifications [11][12][13][14][15]. STL is particularly suited for describing properties of real-valued signals in dense time scenarios, making it ideal for real-world robotic applications.
Instead of explicitly determining rule priorities, we adopted a learning approach to identify minimal acceptable levels of rule satisfaction, informed by expert demonstrations.This approach diverges from our earlier work [8] by employing a Conditional Variational Autoencoder (CVAE) [16].This technique helps discern essential rules and decide on adherence levels, facilitating selective compliance rather than strict obedience to all rules.The use of a CVAE is justified by its efficiency in handling uncertainties within data, providing a more effective solution compared to Gaussian process regression methods used in previous work [8].
Our hybrid approach combines deep learning with traditional MPC, guiding robots to emulate expert human behaviors in complex decision-making scenarios.

Related Work
Extensive research has explored trajectory optimization and Model Predictive Control (MPC) within the framework of temporal logic specifications, particularly Linear Temporal Logic (LTL).Mixed-integer linear programming (MILP) has been employed to generate trajectories for continuous systems subject to finite-horizon LTL specifications [17,18].Wolff et al. [19] extended this approach by encoding general LTL formulas into MILP constraints, accommodating infinite runs with periodic structures.Additionally, Cho et al. [20] investigated optimal path planning under synthetically co-safe LTL specifications, utilizing a sampling-tree and two-layered structure.
Recent advancements have integrated Signal Temporal Logic (STL) within MPC frameworks.Raman et al. [21] structured MPC to facilitate control synthesis from STL specifications using MILP, allowing for the calculation of open-loop control signals that adhere to both finite and infinite horizon STL properties while maximizing robust satisfaction.Sadigh et al. [22] introduced a novel STL variant incorporating probabilistic predicates to address uncertainties in predictive models, thereby enhancing safety assessments under uncertainty.Mao et al. [23] proposed a solution to handle complex temporal requirements formalized in STL specifications within the Successive Convexification algorithmic framework.This approach retains the expressiveness of encoding mission requirements with STL semantics while avoiding combinatorial optimization techniques such as MILP.
The integration of MPC with machine learning techniques has been pursued to address system identification challenges within MPC contexts [24][25][26].Lenz et al. [24] applied deep learning within MPC to derive task-specific controls for complex activities such as robotic food cutting.Carron et al. [25] presented a model-based control approach that utilizes data gathered during operation to improve the model of a robotic arm and thereby enhance the tracking performance.Their scheme is based on inverse dynamic feedback linearization and a data-driven error model, integrated into an MPC formulation.Lin et al. [26] compared deep reinforcement learning (DRL) and MPC for Adaptive Cruise Control (ACC) design in car-following scenarios.
Efforts have also been made to address the types of dilemmas introduced in our work.Tumova et al. [27] and Castro et al. [28] examined scenarios where not all LTL rules can be satisfied in path planning, seeking paths that minimally violate these rules.However, their approaches require predetermined weights among rules, contrasting with our method that learns directly from expert demonstrations.Urban driving dilemmas were specifically addressed by Lee et al. [29], who applied inverse reinforcement learning to capture expert driving strategies.
Imitation learning is emerging as a promising approach to robotic learning problems and has been widely applied to autonomous driving.Policies for autonomous vehicles have been learned from image or video datasets through Convolutional Neural Networks (CNNs) [30,31].Schmerling et al. [32] utilized a Conditional Variational Autoencoder (CVAE) framework to reason about interactions between vehicles in traffic-weaving scenarios, producing multimodal outputs.Additionally, some studies have applied learning approaches to MPC, where certain parameters of MPC are learned from data [8,33].Reinforcement learning has also been considered for autonomous driving, using CNNs to encode visual information [34].

System Model
We consider a continuous-time dynamical system described by the following differential equation: ẋt where x t ∈ X ⊂ R n x represents the state vector, u t ∈ U ⊂ R n u denotes the control input, and f is a smooth (continuously differentiable) function with respect to its arguments.Through employing a predefined time step dt, the continuous system in Equation ( 1) can be discretized as follows: where n represents the discrete time step, defined as n = ⌊t/dt⌋, and x 0 denotes the initial state.For a fixed horizon H, let x(x n , u H,n ) denote a trajectory generated from the state x n with the control inputs u H,n = {u n , . . ., u n+H−1 }.
A signal is defined as a sequence of states and control inputs: In addition to the definition provided in Equation (3), we use the notation ξ(n) to represent a signal starting from the discrete time step n, with a slight abuse of notation.

Signal Temporal Logic
Signal Temporal Logic (STL) is a formalism used to specify properties of real-valued, dense-time signals, and is extensively applied in the analysis of continuous and hybrid systems [9,10].A predicate within an STL formula is defined as an inequality of the form µ(ξ(t)) > 0, where µ is a function of the signal ξ at time t.The truth value of the predicate µ is determined by the condition µ(ξ(t)) > 0.
An STL formula is composed of boolean and temporal operations on these predicates.The syntax of STL formulae φ is defined recursively as follows: where φ and ψ are STL formulas, G denotes the globally operator, and U represents the until operator.
The validity of an STL formula φ with respect to a signal ξ at time t is defined inductively as follows: The notation (ξ, t) ⊨ φ indicates that the signal ξ satisfies the STL formula φ at time t.For example, (ξ, t) ⊨ G [a,b] φ implies that φ holds for the signal ξ throughout the interval from t + a to t + b.In discrete-time systems, STL formulas are evaluated over discrete time intervals.
One significant advantage of Signal Temporal Logic (STL) is its associated metric, known as the robustness degree, which quantifies how well a given signal ξ satisfies an STL formula φ.The robustness degree is defined as a real-valued function of the signal ξ and time t, calculated recursively using the following quantitative semantics: min Following our previous study [8], we introduce the notation (ξ, t) ⊨ (φ, r) to indicate that the signal ξ satisfies the STL formula φ at time t with a robustness slackness r, defined as Equation ( 18) asserts that the signal ξ satisfies φ with at least the minimum robustness degree r.The robustness slackness r serves as a margin for the satisfaction of the STL formula φ.As r increases, the constraints on the signal ξ to satisfy φ at time t become more stringent, while smaller values of r imply more relaxed constraints.Notably, when r < 0, it allows for the violation of φ.

Problem Formulation
This study aimed to solve a control synthesis problem using Signal Temporal Logic (STL) formulas [8].Let φ = [φ 1 , . . ., φ N ] represent a set of STL formulas, with their conjunction denoted as φ = φ 1 ∧ . . .∧ φ N .We define a cost function J over the state and control spaces, where J(x, u) measures the cost associated with a trajectory x and control sequence u.The control synthesis problem under STL for Model Predictive Control (MPC) is formulated as follows.
Problem 1.Given a system model as described in (2) and an initial state x 0 , with a planning horizon of length H, determine the control input sequence u H,t at each time step t that minimizes the cost function J(x(x t , u H,t ), u H,t ) while ensuring that the conjunction of STL formulas φ is satisfied: minimize u H,t J(x(x t , u H,t ), u H,t ) subject to (ξ(x t , u H,t ), t) ⊨ φ. (19) While this strict formulation ensures compliance with the STL formulas, our primary objective is to develop a control sequence that incorporates flexibility in rule compliance.To this end, we introduce robustness slackness values, denoted by r = [r 1 , . . ., r N ], which quantify the degree to which each STL formula is satisfied.In incorporating these robustness values, the MPC problem can be reformulated as follows [8].
Problem 2. Given the system model specified in (2), an initial state x 0 , and a horizon length H, compute the control input sequence u H,t at each time step t by solving the following optimization problem: minimize This enhanced formulation allows for a more flexible management of STL constraints, effectively addressing scenarios where it is not feasible to fully satisfy all STL formulas.The robustness slackness values are derived from expert demonstrations, based on the assumption that these experts have accurately assessed the priority and required compliance level of each rule.This learning is achieved through a deep learning approach.

Proposed Method
The proposed framework, illustrated in Figure 1, synergizes learning techniques with STL constraints to refine MPC, enabling it to more accurately mimic human expert behavior.By leveraging expert demonstrations, we learn robustness slackness values, which define the margins of rule compliance.A Conditional Variational Autoencoder (CVAE) [16] is utilized to estimate these robustness slackness values in novel scenarios.
In incorporating the robustness slackness values obtained through the learning process, the MPC method, designed under STL constraints, generates control sequences that respect the specified rules with a certain level of flexibility.To manage the nonlinear differential constraints characteristic of dynamical systems, we employ linearized models.Although this approach may introduce some approximation errors, it remains effective for practical applications.
Figure 1 presents an overview of the proposed learning-based MPC framework.Expert demonstrations are used to learn the lower bounds of robustness, referred to as robustness slackness, through a deep learning approach.These learned values inform the MPC method, which then calculates control sequences that take into account the STL rules.

Feature Description
We introduce a feature function, denoted as ϕ, which transforms a signal into a feature vector, mapping from the combined state and control spaces into the feature space: ϕ : R n x +n u → R n f .As illustrated in Figure 2, the control of the ego vehicle, V ego , is influenced by six nearby vehicles located in adjacent lanes.These vehicles are collectively referred to as

Learning Robustness Slackness from Demonstration
We consider a set of M demonstrated signals, denoted by Ξ = {ξ i } M i=1 , where each signal ξ i n = (x i n , u i n ) comprises the state x i n and control input u i n at time step n.The robustness degree r i,j is defined as the minimum value observed from the current time step n to the future time step n + H − 1 for the demonstration ξ i : where H denotes the control horizon length.The robustness degree r i,j serves as the robustness slackness for the signal over the horizon length H, starting from ξ i n , indicating the minimum permissible lower bound of robustness within this timeframe.Our CVAE model comprises the following three parameterized functions:

•
The recognition model q ν (Z|ϕ) approximates the distribution of the latent variable Z based on the input features.This is modeled as a Gaussian distribution, N (µ ν (ϕ), Σ ν (ϕ)), where µ ν and Σ ν represent the mean and covariance determined by the network.

•
The prior model p θ (Z|ϕ) assumes a standard Gaussian distribution, N (0, I), simplifying the structure of the latent space.

•
The generation model p θ (r|Z, ϕ) calculates the likelihood of robustness slackness based on the latent variable Z and the input feature ϕ.
Both the recognition model q ν (Z|ϕ) and the generation model p θ (r|Z, ϕ) are implemented as multi-layer perceptrons.
The training of our CVAE is guided by the Evidence Lower Bound (ELBO) loss function, initially formulated as To better accommodate the specific requirements of our application, we adapted the ELBO function and define the loss function as follows: where r i represents an element of the robustness slackness r, and λ is a scaling factor used to balance the terms.The Kullback-Leibler divergence (D KL ) measures the divergence between two probability distributions.We set λ = 1 and optimize parameters ν and θ by minimizing this loss function.

Model Predictive Control Synthesis
Previous work, such as that by Raman et al. [21], has shown that MPC optimization with STL constraints can be formulated as a mixed-integer linear program (MILP).This method introduces two encoding strategies: one that focuses on satisfying STL formulas and another, termed 'robustness-based encoding', that considers the robustness degree of the STL formulas.In our problem formulation, we manage each STL formula according to its defined robustness slackness using the robustness-based encoding method.
Let C φ j ,r j denote the encoded constraints for the STL formula φ j with robustness slackness r j .The combined encoded constraints are formulated as follows: where z φ , z φ j ∈ [0, 1] are Boolean variables, with z φ representing the satisfaction of all STL constraints and z φ j representing the satisfaction of an individual STL formula φ j .Note that z φ j = 1 only if ρ φ j − r j > 0; otherwise, z φ j = 0.The proposed algorithm is outlined in Algorithm 1.We extended our previous work [8] by incorporating a deep learning network approach.Inputs to the algorithm include a set of STL formulas φ 1 , . . ., φ N , the time of interest τ = [t 0 , t 1 ], the discretization time step dt, a control horizon H, an initial signal state ξ init , and demonstrated signals Ξ.
Initially, feature vectors and robustness slackness values (the lowest robustness degree for the horizon H) are pre-computed from demonstrations (line 1).The closed-loop algorithm, which determines the optimal strategy at each time step, runs over the time interval τ = [t 0 , t 1 ].Nonlinear dynamics are linearized with respect to the current signal state (line 4).The robustness slackness of the STL formula φ j for the input feature ϕ(ξ cur ) is predicted using the trained CVAE network (line 6).The predicted robustness slackness for each STL formula is denoted as r j .Based on the updated robustness slackness r j , each STL formula φ j is converted into mixed-integer programming constraints C φ j ,r j using the robustness-based encoding method (line 7), where C φ j ,r j consists of binary variables and linear predicates.In considering all STL constraints, dynamic constraints, and past trajectories, the optimal control sequence is computed over the time horizon H using a user-defined cost function (line 11).This procedure is repeated for the entire time interval τ.

Experimental Results
The proposed algorithm was implemented in a Python (version 3.10) environment, utilizing PyTorch (version 2.2.1) [35] for the deep learning components and Gurobi [36] as the optimization engine for MPC.Simulation experiments were conducted on a system equipped with an AMD R7-7700 processor and an RTX 4080 Super GPU.The Gurobi tool enabled solving the proposed MPC problem in approximately 0.11 seconds.
We conducted realistic simulations using the Next Generation Simulation (NGSIM) dataset [37] and the highD dataset [38], assuming that the drivers in these datasets possessed a certain level of expertise, making them suitable for "expert driver" demonstrations in our proposed approach.In the proposed method, obstacles were set as nearby vehicles.For generating training data, we utilized a combination of 70% data from the highD dataset and 30% from the NGSIM dataset.Data points from the NGSIM dataset that involved vehicles deviating from the track or causing collisions were excluded or modified.Additionally, data with normal speeds but no lane changes were partially removed to ensure a diverse set of training scenarios.
The CVAE network was trained with the following hyperparameters: a batch size of 64, a learning rate of 0.001, and 100 epochs.The future time horizon H was set to 16.

System Description
We modeled the dynamics of the vehicles on the track using a unicycle model.The state of the system at time t is described by x t = [x t , y t , θ t , v t ] T , where x t and y t represent the vehicle's position, θ t denotes the heading angle, and v t indicates the linear velocity.The control inputs are u t = [w t , a t ] T , with w t as the angular velocity and a t as the acceleration.The vehicle dynamics are expressed as follows: where κ 1 and κ 2 are constants.To facilitate the optimization process, we linearize the dynamics around a reference point x = [ x, ŷ, θ, v] T .The resulting linear system is derived as a first-order Taylor approximation of the nonlinear dynamics, given by where the matrices A n , B n , and C n are defined as
Collision avoidance (front vehicle): Slow down before the front vehicle: In these formulations, t a and t b are set to 6 and 12, respectively.Figure 5 illustrates the driving environment used to describe these STL rules.Note that in this figure, the ego vehicle is depicted in blue, the preceding vehicle in orange, and other vehicles in gray.The positions x t and y t and velocity v t correspond to the ego vehicle.The boundaries of the preceding vehicle at the x-y coordinates are denoted by x c,min , x c,max , y c,min , and y c,max .Similarly, x o,min , x o,max , y o,min , and y o,max represent the boundaries of other vehicles except the preceding one.The lane boundaries are denoted by y l,min and y l,max , while the track boundaries are represented by y t,min and y t,max .Here, v th represents the speed limit threshold for rule φ 4 .The final rule, φ 5 , mandates that the ego vehicle decelerate when approaching a preceding vehicle in the same lane.Parameters v u , t a , and t b are specific to rule φ 5 .

Simulation Results
Figure 6 presents the predicted robustness slackness r generated by the proposed CVAE network, alongside the control sequence produced by the MPC based on these predicted values.In the left subfigures indicating robustness slackness, negative degrees of satisfaction are marked with a red box.
In Figure 6a, the predicted robustness slackness suggests that rules φ 2 and φ 5 may be violated.It can be observed that the control sequence generated by the MPC results in the vehicle moving to the left lane (violating φ 2 ) and accelerating in the presence of a preceding vehicle (violating φ 5 ).  Figure 7 demonstrates the application of the proposed method in the NGSIM road environment.The figure illustrates four different scenes, showing the predicted robustness slackness and the corresponding vehicle movements for each situation.For the lanekeeping rules φ 1 and φ 2 , if the robustness slackness value is less than or equal to a specified threshold (indicated by 'threshold' in the figure), it is evident that the ego vehicle attempts to change lanes.Conversely, if the robustness slackness value for φ 1 and φ 2 is greater than the threshold value, the proposed method may not initiate a lane change, depending on the specific situation (as illustrated in scene 4).Overall, the proposed method demonstrates the ability to drive efficiently-allowing the violation of some rules in certain situations-while maintaining safety in complex traffic conditions.
Collision experiments using the proposed approach were conducted across five test scenarios: two from the NGSIM dataset and three from the highD dataset.We compared the proposed method against five methods: LBMPC_STL [8], LSTM, TFN [39], and DQN [40].The LSTM method employs a naive LSTM encoder-decoder framework for imitation learning, while TFN utilizes a Transformer network for imitation learning.In the DQN method, the Q-network is modeled as a four-layer multi-layer perceptron with 12 discrete actions and receives input features.The DQN model was trained until convergence was achieved (1,000,000 episodes).Table 1 presents the number of successful trials for each method.The two methods with the highest number of successes for each scenario are highlighted in bold.The results clearly demonstrate that the proposed approach outperforms other methods in most test scenarios.DQN(1/2) refers to cases where half of the episodes (500,000 episodes) are used in the reinforcement learning stage.The average time steps for successful cases are shown in parentheses.
In the results presented in Table 1, reinforcement learning techniques (specifically DQN) exhibit a longer average time step compared to other methods due to the emphasis on stability in the design of the reward function.Additionally, there was no significant difference in average time steps (for successful cases) between the MPC techniques, including the proposed method, and the supervised learning techniques.Notably, the proposed method demonstrated a slightly shorter average time step compared to the other MPC technique, LBMPC_STL.
While the "average time step" cannot be an absolute criterion for evaluating the superiority of an algorithm's performance, when combined with the "collision rate", it indicates that the proposed method enables more stable and efficient autonomous driving compared to other methods.
Two key observations can be made from these results: • Model Predictive Control (MPC) demonstrates superior safety performance compared to reinforcement learning (DQN) and imitation learning approaches (LSTM, TFN).

•
The deep learning approach employed in the proposed method yields a better performance than the Gaussian process regression approach used in LBMPC_STL.

Conclusions
In this paper, we present a Model Predictive Control (MPC) method designed to manage dynamic systems while adhering to a set of Signal Temporal Logic (STL) rules.Unlike traditional approaches that enforce strict compliance with all rules, our method efficiently balances rule adherence by selectively disobeying certain rules to resolve dilemma situations where not all rules can be simultaneously satisfied.
The proposed method introduces the concept of robustness slackness, which represents the lower bound of the robustness degree, learned from expert demonstrations or data.By employing a Conditional Variational Autoencoder (CVAE) network, the controller adapts its behavior to prioritize different rules based on the context, emulating the decision-making processes of human experts.
Our contribution lies in the innovative approach of learning the satisfaction measure of rules using a deep-learning network, enabling robots to internalize and replicate the value systems of humans.This approach allows for more flexible and context-aware control, which is crucial for operating in complex and dynamic environments.

Figure 1 .
Figure 1.Overview of the proposed learning-based MPC framework.Expert demonstrations are utilized to learn the lower bounds of robustness, referred to as robustness slackness, through a deep learning approach.The learned values inform the MPC method, which then computes control sequences considering the STL rules.

Figure 2 .
Figure 2. Description of the ego vehicle and nearby vehicles in a track driving scenario.The ego vehicle (V ego ) is shown in blue.The diagram includes up to six nearby vehicles positioned in front and behind, across the left, center, and right lanes relative to the ego vehicle.

Figure 3 Figure 3 .
Figure3illustrates a demonstrated trajectory in a track driving scenario, depicting both the robustness degree and its lower bound for the time horizon H.The rule considered involves maintaining the first (lowest) lane, defined by the STL formula φ lane = (y ≤ y upper ) ∧ (y ≥ y lower ), where y represents the vertical position of the vehicle, and y upper and y lower are the upper and lower lane boundaries, respectively.An obstacle (or other vehicle), depicted as a striped black box, necessitates a lane change to proceed.The figure illustrates the variance between the robustness degree values and their corresponding lower bounds.

Figure 4 .
Figure 4.The CVAE network used to predict the robustness slackness.

Figure 5 .
Figure 5. Driving environment illustrating the defined STL rules φ.
(a) Predicted robustness slackness indicates a move to the left lane.(b) Predicted robustness slackness indicates a move to the right lane.

Figure 6 .
Figure 6.Snapshots of the proposed method applied to the NGSIM dataset.

Figure 7 .
Figure 7. Illustration of the proposed method's performance in NGSIM road environments.The figure depicts four different scenes, showing the predicted robustness slackness and the corresponding vehicle movements for each situation.

Table 1 .
Number of successful trials in collision experiments.