Learning-Based Safe Control for Robot and Autonomous Vehicle Using Efficient Safety Certificate

Energy-function-based safety certificates can provide demonstrable safety for complex automatic control systems used in safety control tasks. However, recent studies on learning-based energy function synthesis have only focused on feasibility, which can lead to over-conservatism and reduce controller efficiency. In this study, we propose using magnitude regularization techniques to enhance the efficiency of safe controllers by reducing conservativeness within the energy function while maintaining promising demonstrable safety guarantees. Specifically, we measure conservativeness by the magnitude of the energy function and reduce it by adding a magnitude regularization term to the integrated loss. We present the SafeMR algorithm, which is synthesized using reinforcement learning (RL) to unify the learning process of the safety controller and the energy function. To verify the effectiveness of the algorithm, we conducted two sets of experiments, one in a robot-based environment and the other in an autonomous vehicle environment. The experimental results demonstrate that the proposed approach reduces the conservativeness of the energy function and outperforms the baseline in terms of controller efficiency for the robot, while ensuring safety.


I. INTRODUCTION
S AFETY is paramount in real-world applications of intelligent systems, which must always adhere to hard state constraints. Autonomous vehicles, for example, must exercise caution to avoid accidents with other road users, including pedestrians and other vulnerable users. In light of this, a range of vehicle cooperative methods has been proposed to address the collision issue in diverse scenarios, such as intersections [1], [2], highways [3], [4], and other similar scenarios [5], [6]. Nevertheless, the penetration rate of autonomous vehicles is still in its nascent stage and requires time to proliferate. Thus, the current mainstream technology focuses on non-cooperative vehicle safety technology. However, focusing solely on safety can lead to over-conservatism. It is important for intelligent systems The review of this article was arranged by Associate Editor Yougang Bian. such as autonomous vehicles to also prioritize efficiency while maintaining safety. For instance, overly conservative strategies may result in prolonged stationary periods for autonomous vehicles to avoid collisions, which negatively impact traffic efficiency. Hence, a balance between safety and efficiency is crucial in the design and deployment of intelligent systems, particularly in safety-critical applications such as autonomous vehicles.
The energy-function-based safety certificates are a major branch of intelligent bodies' safety or safe control studies. Existing studies include barrier certificates or control barrier functions (CBF) [7], [8], [9], safety index [10], [11], [12], and reachability analysis [13], [14], [15]. Intuitively, the energy functions mean that the unsafe states should be assigned high energy, while safe states have low energy. Then the safe control policies are designed to dissipate the system energy [16]. Thus, energy function and safe control policy are closely related. The most promising aspect of energy-function-based safety certificates is that they can provide provable safety guarantees by ensuring forward invariance in the safe sets. Forward invariance indicates that the system will never leave the safety set if there always exist actions to dissipate the system energy. The existence of actions is also called feasibility, and the corresponding energy function is defined to be feasible. The provable safety guarantees are very promising in both algorithm design and real-world intelligent bodies. However, artificially synthesizing into a feasible energy function is very difficult [12] (provable safety guarantees are only valid for feasibility). The difficulties have stimulated many recent studies using learning-based techniques to synthesize energy functions [8], [17], [18], [19], [20], [21], [22], [23], [24], [25]. Reinforcement learning (RL) has gained increasing attention because it learns from the interactions of the environment and can be used without prior controllers or dynamics. Some recent studies have shown that RL can synthesize safety certificates while learning safety control policies [24], [25].
Feasibility is crucial since it determines safety, but the energy function also significantly impacts the controller's efficiency. For instance, an energy function that forces autonomous vehicles to remain stationary at all times may prevent them from encountering any danger, but it will also prevent them from accomplishing any task. We define an efficient energy function as it generates an efficient safety control policy. Intuitively, an ideal energy function is feasible and efficient. Unfortunately, few studies have discussed the efficiency of synthetic energy functions. We can briefly describe why the consideration of efficiency is lacking in previous studies. The safety control problem based on energy functions is usually formalized as constrained optimization [7], [12], [16]. The policy efficiency is reflected in the objective function, while the energy functions are constraint functions. It is quite challenging to know explicitly how to adjust the constraints to improve the optimal solution of the constraints. Therefore, in this paper, we propose the magnitude regularization method, along with an algorithm called SafeMR. SafeMR is an RL-based energy function synthesis method that improves the efficiency of energy function synthesis. Some preliminary results have been presented in a conference version [26]. The main contributions of this paper can be summarized as follows: 1) We quantify the conservativeness of the energy function in terms of magnitude. Specifically, we recognize that the energy function should not be unnecessarily high for hazardous states, as this can lead to overly conservative policies that are not efficient. Instead, we aim to balance safety and efficiency by optimizing the magnitude of the energy function. 2) We add a magnitude regularization term to the energy function synthesis loss and use RL to learn it. This allows us to avoid any priori knowledge about the controller and system dynamics, making our approach more flexible and adaptable to different systems and scenarios. 3) We conduct experiments on Safety Gym, a commonly used safety RL benchmark, as well as on SUMO for simulating autonomous vehicles in intersection traffic scenarios. Our results demonstrate that the magnitude regularization method effectively improves the efficiency of the policy while ensuring safety.

II. RELATED WORKS A. ENERGY-FUNCTION-BASED SAFETY CERTIFICATES
Representative energy-function-based safety certificates include the safety set algorithm (SSA) [12], the control barrier functions (CBF) [10] and the barrier certificates [27]. Safety certificates and safety control policies are closely related, and both are significant to ensure the safety of dynamic systems. According to the learning objectives, recent learning-based research can be divided into three main categories: (1) learning safe control policies with known feasible energy function [17], [18], [22]; (2) learning to synthesize energy functions with known dynamic models or controller [21], [23], [28], [29], [30]; and (3) joint synthesis of safe control policy and energy functions [8], [24]. However, in the studies related to these three branches, only the feasibility of learning objectives or a priori knowledge has been considered. Few studies have discussed the efficiency of synthetic energy functions. To the best of our knowledge, the only relevant studies are some accessibility studies that discuss states located on the boundary of the safety sets having the most conservative actions [25], [31], [32]. However, reachability-based methods focus only on purely safe policies, without explicitly learning policies that are efficient and safe.

B. REINFORCEMENT LEARNING WITH SAFETY IN CONSIDERATION
Safety has always been an important issue in decisionmaking problems, especially for RL based on learning from interaction with the environments. There are many branches of safety-related RL studies, like constrained Markov decision process (CMDP) [33], [34], [35], [36], [37], [38], safe exploration problem in MDP [39], [40], risk-sensitive RL [41], [42], post-processing of the RL policy output [18], [43], [44]. Whereas the constraints on the actions of the disambiguation constraints are state-dependent, previous studies have had difficulties in dealing with the constraints [45] (post-processing is capable for state-dependent constraints, but requires more known information). A Lagrangian-based method with state-dependent multipliers [25], [46] has been proposed to explicitly handle state-dependent constraints in RL, and the method proposed in this paper is based on this approach.

III. PREPARATION
In this section, we present preliminary information about the problem formulation and the energy function-based safety certificate.

A. SAFETY SPECIFICATIONS AND PROBLEM FORMULATION
In this paper, safety means that the system state s should be bounded in a connected closed set S s , which is called the safety set. S s can also be expressed as by a zero-sublevel set of a safety exponential function φ 0 (·), S s = {s|φ 0 (s) ≤ 0}. We use the Markov Decision Process (MDP) with deterministic dynamics (a reasonable assumption when dealing with robot safety control problems) defined by the tuple (S, A, F, r, c, γ ), where S, A is the state and action space, and F : S × A → S is the unknown system dynamics, r, c : S × A × S → R is the reward and cost functions, γ is the discount factor.

B. ENERGY-FUNCTION-BASED SAFETY CERTIFICATES
We define the energy function by φ : S → R. Intuitively, if the system state is dangerous, e.g., the autonomous vehicle close to the other vehicles or pedestrians, the system is assigned a high energy value. Conversely, if the system state is safe, the energy value assigned is low. To ensure provable safety, two conditions must be met: (1) the system must remain at a low energy state (φ ≤ 0), and (2) the system must dissipate energy rapidly when the energy state is high (φ > 0). Fig. 1 illustrates the relationship between the energy function, the safety set, and the two conditions. Therefore, we can get the safety constraint for where η D is a slack variable controlling the descent rate of safety index. If, in some specific states, there always exists an action a ∈ A that satisfies (1) at s, or the safe action set U s (s) = {a|φ(s ) < max{φ(s) − η D , 0}} is always nonempty, we say that the energy function φ is feasible. Safety can only be guaranteed if the constraints on feasible certificates are satisfied (1). Otherwise, there may not be any feasible action that guarantees safety in some specific states, and therefore, safety cannot be guaranteed. However, recent comprehensive studies of energy functions have only focused on the feasibility of the energy function [20], [21], [24]. Feasibility is undoubtedly essential, and synthesizing a feasible energy function is already a difficult task. However, in the real world, feasibility alone is insufficient. When dealing with real-world robotic applications, we must also consider efficiency. For instance, an autonomous vehicle should not remain stationary on a narrow road where both sides pose a danger, as illustrated in Figure 2. The design of the energy function, denoted as φ, undoubtedly affects both efficiency and policy performance.

IV. ENERGY FUNCTION SYNTHESIS AND MAGNITUDE REGULARIZATION A. ENERGY FUNCTION SYNTHESIS USING CONSTRAINED REINFORCEMENT LEARNING
We formulate a constrained reinforcement learning (CRL) problem to synthesize feasible and efficient energy functions. We use RL because it can learn safety control policy and energy function without any a prior. If other supervised learning techniques are used instead, we must know a safe control policy (or dynamical model) to learn the energy function, which is difficult for complex real-world tasks. The CRL problem is formulated to maximize the expected return while satisfying the energy constraints (1): where V π (s) is the state-value function of s. It is worth noting that the constraint (1) is posed for each state, not just for the safe state. We utilize a Lagrangian-based constraint RL algorithm to handle this particular problem formulation [46]. A network of Lagrangian multiplier λ(s) was used to handle the statedependent constraints. First, we follow their methodology to define Lagrangian functions. In the next section, we will further describe how to add magnitude regularization to this Lagrangian function.
Since the Lagrangian function is the loss function of RL, or the loss function of policy learning, we name it primitive loss function. The original loss function can jointly learn the safety control policy and the energy function [24]. 1 However, the original loss function only considers feasibility, without considering efficiency.

B. EFFICIENCY CONSIDERATION BY MAGNITUDE REGULARIZATION
We will now describe how to implement magnitude regularization based on the original loss function. In Section I, we briefly discussed the difficulties of designing an efficient loss function. In reinforcement learning Algorithm 2, one might question whether the expected return is a sufficient choice for an efficiency metric and whether there is a need to introduce additional metrics. However, (2) is a constrained RL problem, where the expected return is part of the objective function, and the energy function only changes the constraints. What we aim to optimize is a better constraint, where the objective function has no gradient w.r.t the constraints. 2 To address this problem, we have developed a method called magnitude regularization. The motivation behind this method is that the higher the energy function of the relatively safe region, such as half to one meter from the edge of an obstacle vehicle, the more conservative the energy constraint (1). This motivation stems from the fundamental principle of the energy function, which is that lower values indicate greater safety. In other words, we require the energy function to be high enough to ensure feasibility, but not excessively high, which could hinder performance. To achieve this balance, we propose adding a regularization term to the combined loss of the energy function (3). We refer to this term as the magnitude regularization.
To illustrate how to implement magnitude regularization, we will use the energy function adapted for the collision avoidance task described in [30] as an example. Since collision avoidance is a common safety requirement, the approach we outline here can be readily applied to other safety requirements.
where d is the distance between the intelligent body, such as autonomous vehicle and the obstacles to avoid, d min is the minimum safe distance to the obstacle. Theḋ is the derivative of distance with respect to time, ξ = [σ, k, n] are the adjustable parameters we desire to optimize in the synthesis algorithm. They should all be positive real numbers. Given a specific energy function formula, we can analyze how changes in the parameters will affect the magnitude of the energy function. It is obvious that if σ and n increase, φ will also increase, since d min is a constant. However, the situation is a bit more complex for the parameter k, as the correlation between the energy function and k depends onḋ. We can only consider the case where the intelligent body is approaching a dangerous obstacle. In these cases, φ always increases with positive k.
Based on the analysis above, the tunable parameters σ, k, n indirectly affect the original loss function L(π, λ) by modifying the value of φ(s). To make the effect of these parameters on the loss function more direct, we propose removing some of these tunable parameters from φ(s) and instead adding them directly to the original loss function L(π, λ).
After conducting a series of experiments, we propose adding two parameters from φ(s), namely (σ + d min ) n and k, directly to the loss function L(π, λ). To set their weights, we introduce two new adjustable parameters a and b. This approach allows us to increase the efficiency of the intelligent body while maintaining safety, by adjusting the size of the safety set.
In general, we conclude that the magnitude of φ increases with σ, n, k. Therefore, the specific magnitude regularization term is designed as where a and b are two positive parameters. We will analyze the sensitiveness of a, b in the experimental results section. Eventually, the magnitude regularization term is added to the original loss function (3), and the total final loss function is

V. ENERGY FUNCTION SYNTHESIS USING CONSTRAINED REINFORCEMENT LEARNING
In this section, we outline our practical algorithm for synthesizing energy functions, which includes a step-by-step procedure, the computation of gradients, and a discussion on the convergence properties of the algorithm.

A. DETAILS OF ALGORITHM
Our CRL algorithm is constructed within the actor-critic framework [47]. It is designed as a multi-timescale learning process where different parameters are trained simultaneously, and some converge faster than others. Specifically, the fastest timescale is the value function (or Q-function) learning, followed by the policy. Both are trained using the actor-critic algorithm. Then, we update the multiplier network, which is proposed to handle state-dependent constraints in the CRL problem [46]. Finally, the energy function converges the slowest.
We named the algorithm SafeMR because we focus on the problems of safety control with magnitude regularization (MR) [26] to increase the efficiency of the algorithm. The algorithm represents the parameters of the policy network, multiplier network and energy function as θ, ξ, ζ , and the gradients to update policy, multiplier, and
It is to be noted that the loss function to update the policy and multiplier is the same which follows the nature of dual ascent algorithm [48]: The policy gradient with the reparameterized policy a t = f θ ( t ; s t ) can be approximated by: where∇ θ J π (θ ) represents the stochastic gradient with respect to θ , and Q φ (s t , a t ) = (φ(s t+1 )−max{φ(s t )−η D , 0}). Neglecting those irrelevant parts, the objective function of updating the multiplier network parameters ξ is The stochastic gradient is The objective function of the synthetic energy function parameter ζ is the loss function (6), and the gradient of the energy function parameter ζ is Similar calculations can be found in [46] and [24]. The subscripts means that φ, λ has already reached the locally optimal solutions w.r.t. current energy function φ since they converge faster than φ.

B. CONVERGENCE DISCUSSION
In the previous study, the convergence of the CRL algorithm and energy function synthesis was analyzed [24]. Intuitively, with moderate assumptions and appropriate learning rate arrangements, both the safety control policy and energy function converge to their local optima. This current paper proposes a CRL algorithm, and therefore, its convergence can be analyzed using similar convergence analysis tools. However, the magnitude regularization term poses a challenge as it will always be non-negative and produce nonnegative gradients for the energy function synthesis steps, which will affect convergence results. Hence, theoretical convergence analysis may not be available for the proposed algorithm. Nonetheless, convergence is still guaranteed in practice as the feasibility requirement drives the first term of the loss function (excluding the regularization) to zero quickly, and the size of the regularization term is bounded since the parameters are bounded. This ensures that the loss function and gradient norm do not increase too much and lead to divergence.

VI. ROBOT EXPERIMENT
In the robotics experiments, we focus on addressing these questions. First, does the proposed algorithm reduce the conservativeness of the energy function design? Second, if the answer to the first question is yes, does the reduction in conservativeness lead to an increase in efficiency? Third, how does the proposed algorithm compare with other constrained RL algorithms, and does it achieve a constrained safety policy with zero violations?
In the first part of the robot intelligences experiment, we chose two experimental settings. One is purely collision avoidance or aircraft [14] to verify the conservativeness reduction. The other is Safety Gym [36], a commonly used safety RL benchmarking environment with different tasks and obstacles. Here we chose four environments with different tasks and different obstacles to validate the robot's performance after CRL training.

A. CONSERVATIVENESS REDUCTION
In this experiment, we control an robot to avoid obstacles or hazard areas. The control input is the angular velocity, and the state is the position and heading angle of the robot. The boundaries of the learned unsafe sets are shown in Fig. 3. Notably, states outside the boundaries are safe. We compare the performance of SafeMR with JointSIS [24] and a handcrafted energy function in [30]. The reward is designed to be the L2 norm of actions.
The results show that the unsafe set learned by SafeMR is smaller than that of JointSIS and also covers the real safe set (red boundary, obtained through numerical solution). The handcrafted unsafe set, on the other hand, does not cover all the real unsafe sets, which means that there might exist no safe action in the uncovered area, resulting in danger. Therefore, the experimental results show that SafeMR reduces conservativeness while guaranteeing safety, which

FIGURE 3. A schematic of the safe set provided by different algorithm and the real safe set. SafeMR learns the smallest safe set that covers the true safe set (compared to other algorithms). A handmade energy function can extend a smaller unsafe set, but the safe set has unsafe states in it, so the unsafe set is invalid [26].
answers the first question proposed at the beginning of this section.

B. SAFETY GYM: BASELINE ALGORITHMS AND EXPERIMENT SETUP
In this section, three types of baseline algorithms were compared with the proposed algorithm: (1) joint synthesis method of safe control policy and safety certificate without efficiency consideration [24], we name it JointSIS here for clearance; (2) constrained RL baselines. CRL baselines include PPO-Lagrangian, TRPO-Lagrangian and CPO [35], [36]. It is worth to note that the safety specification in this paper, i.e., zero constraint violation, is different from the specification in the original CRL paper. Therefore, we set the cost threshold to be zero to make the CRL baseline head to solid safe policies (However, they will not learn the zero-constraint-violation policies as the following results show). The proposed approach and all baseline implementations are based on the Parallel Asynchronous Buffer-Actor-Learner (PABAL) architecture proposed by [49]. PABAL is a stateof-the-art parallel RL training platform, which is pretty efficient at integrating model-based and model-free algorithms. All experiments are implemented on Intel Xeon Gold 6248 processors with 12 parallel actors, including 4 workers to sample, 4 buffers to store data, and 4 learners to compute gradients. Implementations of the CRL baselines are also motivated by their original paper and the official code release [35], [36].
The four experimental environments are shown in Fig. 4 and named by Obstacles-Size-Tasks. The policy of the red robot is to reach the green target area while avoiding overlap or collision with the blue obstacles. In two of the environments, there are no physical obstacles, only virtual obstacles. The other two environments have pillars as real physical obstacles. In addition, the size of the obstacles differs in the two environments with virtual obstacles and in the two environments with real obstacles.

C. EXPERIMENTAL RESULTS
The Safety Gym environment was used to test the performance of the robot intelligence trained by the SafeMR algorithm. The success of the model testing was measured by how quickly the robot reached the target area and whether it did so without any collisions. According to the experiment procedure, the SafeMR-trained robot was able to consistently move towards the target area without colliding with any obstacles, ensuring both speed and safety in completing its task. However, there were a few exceptional cases where the robot became confused when the target area was far away and there were many obstacles in between. In such cases, the robot was able to quickly find a passable gap between the obstacles and cross it to reach the target area. If the gap was too narrow, the robot may have initially considered it impassable, but it ultimately found a way to bypass the obstacle-ridden area and reach the target area. Overall, the results of the experiment demonstrate that the SafeMR algorithm enables the robot to efficiently and safely navigate towards the target area, with only minor issues arising in rare and challenging situations. Table 1 compares the expected return between SafeMR and a baseline with the highest expected return and zero violations. We use the expected return and expected cost as metrics to evaluate the performance of the algorithm. Specifically, the expected cost is computed by summing up the violations of the constraints in a single episode and averaging them across multiple runs.
The experimental results show that SafeMR achieves the highest expected return among all safety algorithms. Baseline algorithms can be divided into two categories: (1) those that achieve higher expected gains but are not secure,  including all CMDP-based algorithms (TRPO-L, PPO-L, and CPO), and FAC-φ 0 , and (2) those that guarantee safety but have lower expected gains, including JointSIS (FAC-φ h in some environments). According to the original Safety Gym paper [36], metrics for evaluating policy performance need to consider both safety and performance. However, for the zero-violation safety control problem, if the policy does not guarantee safety, then it is meaningless. Therefore, we can conclude that SafeMR improves policy efficiency while guaranteeing safety compared to all baseline algorithms.

D. MICROSCOPIC AND SENSITIVE ANALYSIS
We give some micro and sensitive analysis of the hyperparameters in the magnitude regularization, i.e., [a, b]. We first present the learned energy function parameters σ, k, n for different a, b in Table 2. Due to space limitations, we only show the results in the Pillar-0.15-Goal setting. The feasibility of the energy function parameters can be verified by [30, eq. (3)] where all three sets of learned energy functions are feasible. The training curve are shown in Fig. 6. It shows that the expected benefit is related to the choice of [a, b], but they all outperform JointSIS with no efficiency consideration. For Safety, the expected cost converges to zero for different hyperparameters. The experimental results show that for SafeMR, all hyper-parameters are chosen to improve the performance while ensuring safety.

VII. VEHICLE EXPERIMENT A. EXPERIMENT BASED ON VEHICLE MODEL AND VIRTUAL TRANSPORTATION SCENE
The next experiment will be used to verify whether SafeMR is suitable for use in the autonomous driving domain. For an autonomous vehicle, its decision making and control will be limited by the vehicle model, and the vehicle will also be affected by the traffic environment and the traffic regulations. Since the vehicle model is very different from the robot model, the road traffic scene is also different from the robot experimental environment above, we need to build the vehicle dynamic model and traffic environment first and then use the SafeMR for experimental analysis.

B. TRAFFIC ENVIRONMENT AND SCENE BUILDING
There are many kinds of traffic scenes in the real world. In this experiment, we choose the intersection scene without pedestrians but with complex vehicle traffic flow.
The environment part follows the prior work [45], modeling a real intersection located at (31 • 080 1300 N, 120 • 350 0600 E) in the SUMO network editor. The random traffic flow of the vehicles is generated by the SUMO traffic simulator.
In the intersection scene, we would give the ego-vehicle a left-turn trajectory and let it follow the trajectory. However, during the experiment, other vehicles within the traffic flow may collide with the ego-vehicle on the trajectory at some point. The other vehicles are equivalent to the obstacles in the robotics experiment, which generate the corresponding potential energy field. At the same time, the ego-vehicle, like the robot, uses SafeMR to ensure that it is collision-free and reaches the lane after the left turn as quickly as possible. This is the same goal as the robot experiment.

C. INTRODUCTION AND BUILDING OF THE VEHICLE MODEL
Vehicles driving in urban road environments such as intersections do not have particularly complex vehicle dynamics. Therefore, a dynamic bicycle model is sufficient to describe the vehicle dynamics constraints in the experimental scene [50]. This dynamic bicycle model is a very classical vehicle model, which is widely used to characterize the lateral kinematics of vehicles in non-extreme driving experiments. The model is shown in Fig. 8. The characteristic equations of the model are as followṡ where The six state variables of the vehicle state X are respectively horizontal position x, vertical position y, yaw angle ϕ, longitudinal velocity u, lateral velocity v and yaw rate ω. The input U is the front wheel steering angle δ and acceleration command a. The lateral tire forces F Y1 , F Y2 are functions of sideslip angle α f and α r . For a vehicle under non-extreme conditions, the lateral tire force and sideslip angle have the following equation where k f and k r are the front and rear wheel lateral deflection stiffness of the vehicle. What's more, l f and l r are the distances from the vehicle's center of mass to the vehicle's front and rear axis. m is the vehicle mass, and I z is the vehicle's rotational inertia around the z-axis. These are the parameters of the vehicle model and are related to the physical properties of the vehicle itself. The parameters of the vehicle model are shown in Table 3.

D. EXPERIMENTAL SETUP
This experiment differs methodologically from the previous one with the robot.Since the training object has a describable model, i.e., the dynamic bicycle model, this experiment uses the reinforcement learning method of dynamic planning, which is a model-based reinforcement learning method, to train the virtual autonomous vehicle. For the other vehicles on the road, they are just moving obstacles in the self-vehicle view, so we do not need to consider their dynamics model. Using two circles to describe the obstacle vehicles among the traffic flow is a simple and feasible way to operate, as shown in Fig. 9 In this case, the part inside the two circles (approximately the solid part of the vehicle) is the high-energy region.
In this part of the experiment, we use the dynamic programming RL algorithm for training. This is a model-based RL algorithm. The details of the algorithm are described in [51]. This algorithm requires the discretization of the vehicle model, which can be found in [50], and the results of the discretization is where T s is the time length of a control cycle and is an adjustable parameter. For this part of the experiment, the autonomous vehicle was asked to start at the bottom of the intersection and turn left into the target area on the left. The departure point of the autonomous vehicle is fixed, however, the target area is the area to the left of the pedestrian crossing on the left side of the intersection. In the middle of the intersection, the autonomous vehicle needs to avoid the surrounding obstacle vehicles and ensure that it does not collide with any of the moving obstacle vehicles. At the same time, we want the autonomous vehicle to be able to reach the target area as quickly as possible, without long stops.
Considering that the traffic environment requires the observance of traffic rules, we have set a reference trajectory for the autonomous vehicle (green line in Fig. 7), When there is no obstacle vehicle to avoid around the vehicle, the autonomous vehicle will follow the reference trajectory. The details of method can be found in [51].

E. EXPERIMENTAL RESULTS OF THE INTERSECTION SCENE
The results of the experiment are shown in Fig. 10. The results show that the autonomous vehicle trained with the SafeMR method can successfully avoid other vehicles in the traffic flow, ensuring safety and efficient access to the target area (left side of the intersection, either lane is available).
Analyzing Fig. 10, we can get that before the autonomous vehicle enters the intersection, there are a lot of vehicles around it because the traffic light is just turning green. At this time, the speed of the autonomous vehicle is kept relatively low, but it also follows the vehicle in front of it closely to ensure the efficiency of the passage. After entering the intersection, the vehicles around the autonomous vehicle spread out and the density of vehicles decreases. At this point the speed of the autonomous vehicle increases rapidly to reach the target area on the left side of the intersection as quickly as possible. During the driving process, when the autonomous vehicle meets other vehicles in the way (the last two figures in Fig. 10), the autonomous vehicle also decelerates in time to follow the vehicle. This fully illustrates that the SafeMR approach can be used to conduct autonomous vehicle planning and control that balances safety and efficiency.

VIII. CONCLUSION
This paper proposed the magnitude regularization technique to synthesize efficient energy functions while guaranteeing safety in robotic or automatic safe control tasks. We quantify the conservativeness by the magnitude of the loss function and construct a magnitude regularization term to control the magnitude growing during synthesis. An algorithm called SafeMR is proposed to combine magnitude regularization and RL and synthesize feasible and efficient energy functions. Experimental results on various tasks show that the proposed algorithm can reduce the conservativeness of the energy function and then improve the efficiency of the safe control policies. Meanwhile, the algorithm solidly guarantees safety and is robust to hyper-parameter choices.