Research and implementation of variable-domain fuzzy PID intelligent control method based on Q-Learning for self-driving in complex scenarios

: In the control of the self-driving vehicles, PID controllers are widely used due to their simple structure and good stability. However, in complex self-driving scenarios such as curvature curves, car following, overtaking, etc., it is necessary to ensure the stable control accuracy of the vehicles. Some researchers used fuzzy PID to dynamically change the parameters of PID to ensure that the vehicle control remains in a stable state. It is di ffi cult to ensure the control e ff ect of the fuzzy controller when the size of the domain is not selected properly. This paper designs a variable-domain fuzzy PID intelligent control method based on Q-Learning to make the system robust and adaptable, which is dynamically changed the size of the domain to further ensure the control e ff ect of the vehicle. The variable-domain fuzzy PID algorithm based on Q-Learning takes the error and the error rate of change as input and uses the Q-Learning method to learn the scaling factor online so as to achieve online PID parameters adjustment. The proposed method is verified on the Panosim simulation platform.The experiment shows that the accuracy is improved by 15% compared with the traditional fuzzy PID, which reflects the e ff ectiveness of the algorithm.


Introduction
PID controller and fuzzy logic-based controller dominate the field of intelligent automation control by virtue of their scalability and structural simplicity [1]. They have been widely used in vehicle corner speed control [2,3]. However, these conventional controllers are highly parameter-dependent. In practice, the adjustment of parameters relies mainly on expert experiences, which is not only timeconsuming, but once unable to automaticallly adjust when the control parameters are determined. And It cannot be executed well in a single scene. Generally, if the system is used for other tasks, the control effect often change, and the control parameters need to be adjusted frequently [4]. There are many diversity, time-varying and uncertainty problems in typical driving scenes, such as overtaking, meeting on narrow road, etc. [5,6]. Therefore, tuning the parameters of conventional controllers to achieve optimal performance has become a popular research direction [7]. To handle complicated processes and combine the benefits of traditional controllers with human operator knowledge, fuzzy control [8] has recently emerged as an alternative to conventional control algorithms. Sang Hyuk Park [9] proposed a method for fuzzy self-tuning PID controller and a method for online tuning of PID controller gain values. Priyam [10] proposed PID and fuzzy PID control for DC motors and compares the results of two methods, concluding that fuzzy PID is more adaptable than classical PID. Quan et al. [11] used fuzzy PID control to study the thermal degradation behavior and kinetics of biomass microwave pyrolysis, which significantly improved the response speed of the system. Jie et al. [12] proposed a Type-2 fuzzy PID controller for improving the drive performance of autonomous mobile robots in static obstacle environments. Neerendra [13] proposed that satisfactory results can be achieved using fuzzy logic and laser sensors for navigation in unknown environments. Muhammad [14] proposed a method for designing robust Fuzzy tuned PD (FPD) controller with better tracking and minimal ball oscillations under different lighting conditions. Reinforcement learning [15] is a machine learning algorithm that generates optimal policies by interacting with the environment. Various approaches such as Q-Learning and deep deterministic policy gradient algorithm (DDPG) have been developed for the particular self-driving decision making and control tasks, driving policy generation [16], the autonomous mobile robot obstacle avoidance [17], robot manipulator control [18], path planning [19], and path tracking [20], etc. Traditional control methods with reinforcement learning have proven to be very effective in solving problems in highdimensional action spaces and state spaces [21]. Many researches use reinforcement learning algorithms to completely replace traditional controllers and apply them to control strategies. One such approach is to use reinforcement learning algorithms to completely replace traditional controllers. Ramanathan [22] used Q-Learning to control the level in a non-linear conical tank with satisfactory results. It is worth noting, however, that the selection of state space, action space, and rewards all affect the performance of the algorithm in a reinforcement learning algorithm, and designing a controller by reinforcement learning algorithms alone becomes challenging when the number of dimensions increases [23].
Other methods are to combine traditional controllers with reinforcement learning methods to compensate for the shortcomings of traditional controllers. Lakhani et al. [24] proposed an automatic PID tuning framework based on reinforcement learning (RL), particularly the deterministic policy gradient (DPG) method to address traditional PID tuning methods rely on trial and error for complex processes where insights about the system is limited and may not yield the optimal PID parameters. Dogru et al. [25] combined the recent developments in computer sciences and control theory to address the tuning problem. It formulates the PID tuning problem as a reinforcement learning task with constraints.Yu et al. [26] proposed a self-adaptive model-free SAC-PID control method based on reinforcement learning for automatic control of mobile robots. The upper controller based on soft actor-critic (SAC), one of the most competitive continuous control algorithms, and the lower controller based on incremental PID controller.
Q-learning algorithm has been proved as one of the most efficient model-free RL algorithms by directly parameterizing and updating value functions or policies without explicitly modeling the environ-ment. Therefore, we design a variable-domain fuzzy PID controller based on Q-Learning. Specifically, we use a variable-domain fuzzy PID controller to change the initial domain of the fuzzy PID controller by scaling its factors, which can change with the deviation and the rate of change of the deviation to achieve the effect of intelligent adjustment of the PID parameters. However, the control function of a variable-domain fuzzy PID controller can become distorted, which decreases control accuracy. Therefore, this paper combines reinforcement learning algorithm to improve the variable-domain fuzzy PID controller, so that it has the ability of online optimization. To improve the effectiveness of adjusting the PID control settings, these two effects are combined and added.

PID
PID uses differential, integral, and proportional control to calculate the system's error, as is shown in Figure 1. The PID algorithm uses a linear mix of proportional, integral, and differential error information e(t) to choose the desired control parameters: where k p , k i , and k d are the proportional, integral and differential coefficients respectively. In selfdriving vehicle control, the deviation between e(t) = r(t) − c(t), c(t) is the actual trajectory and r(t) is the preset trajectory. P (proportional) control is the basis of PID control and proportionally reflects the deviation signal of the control system, which is immediately controlled to reduce the deviation as soon as it occurs. When only P control is available, steady-state error or overshoot is generated. I (integral) control is used to eliminate steady-state error (steady-state error is the very small error that still exists between the vehicle's stable driving and the preset trajectory under PD control), and the error is integrated as long as it exists, so that the output continues to increase or decrease until the error is zero, the integration stops and the output no longer changes. D (differential) control reflects the trend of the deviation signal and can introduce an effective early correction signal in the system before the deviation signal becomes too large, thus speeding up the action of the system and reducing the regulation time. In terms of time, the P control adjusts for current errors, the I control adjusts for historical errors and the D control adjusts for future errors.
Although PID control has many advantages, such as simple structure, good stability and so on, it is difficult to perform well in the face of the complex environment of self-driving vehicles and timevarying non-linear systems like speed and corner control using a set of PID parameters.

Fuzzy PID
Fuzzy PID control [28] is a method for optimising the parameters of a PID in real time using fuzzy logic and according to certain fuzzy rules, in order to overcome the shortcomings of traditional PID which cannot adjust the PID parameters in real time. Fuzzy PID control includes components such as fuzzification, determination of fuzzy rules and defuzzification. The fuzzy PID is in the form of a two-input, three-output controller that takes the error e(t) and the rate of change of the error ec(t) = de(t)/dt as inputs x i (t) (i = 1, 2) and x 1 (t) = e(t), x 2 (t) = ec(t), the three parameters of the PID are proportional, integral and differential corrections ∆kp, ∆ki, ∆kd as outputs y j (t) ( j = 1, 2, 3) and y 1 (t) = ∆kp, y 2 (t) = ∆ki, y 3 (t) = ∆kd and set the initial domain to be respectively where E i , a n U J is the boundary of the theoretical domain. The theoretical domains of both input and output variables are divided into 7 fuzzy subsets: NB (positive large), NM (positive medium), NS (positive small), ZO (zero), PS (negative small), PM (negative medium), PB (negative large), and determine the form of the affiliation function, then, the output variables are obtained by 3 processes of input fuzzification, fuzzy inference and defuzzification ∆kp, ∆ki, ∆kd The final control quantity is determined according to the Eq (2.2).
where k p0 , k i0 , and k d0 are the initial design value of the PID parameters, designed by the conventional PID controller parameters rectification method.∆kp, ∆ki, ∆kd are the three outputs of the fuzzy controller, which can automatically adjust the values of the three PID control parameters according to the state of the controlled object.

Variable-domain fuzzy PID
When the size of the theoretical domain in a fuzzy PID controller is not chosen properly, it is more difficult to guarantee the control effect of the fuzzy controller, so variable-domain fuzzy PID controller [29] is born.
As is shown in Figure 3, the variable theoretical domain fuzzy PID adjusts the theoretical domain range of the input and output in the fuzzy controller online by introducing a scaling factor. According to e(t) and ec(t) calculate the scaling factor α(e(t)), α(ec(t)), β(e(t), ec(t)), where α(e(t)), α(ec(t)) are the scaling factors of the input variables e(t) and ec(t) and β(e(t), ec(t)) are the scaling factors of the three output variables ∆kp, ∆ki, ∆kd the common scaling factor. Then, the initial theoretical domains of the input and output variables are adjusted for scaling, taking the ith input as an example, the new theoretical domain obtained after the adjustment is where ε is a sufficiently small positive number, the E 1 and E 2 are the initial domain boundaries of the input variables, respectively, and τ i (i = 1, 2) is the scaling factor design parameter, and τ i ∈ [0, 1].The scaling factor should be stable. [30] verified the stability of the new scaling factor from five aspects: duality, zero avoidance, monotonicity, coordination and normality.

Reinforcement learning
Reinforcement learning is a subfield of machine learning, as is shown in Figure 4, and intelligence robots learn a strategy for maximizing expected future rewards by interacting with its environment π(s) [31] which defines which action a should be taken in each state s. When an action is performed and the environment shifts to a new state s, the intelligence receives a reward r. The reinforcement learning process can be modelled as a Markov decision process (MDP) [32], the Markov decision process is defined by the tuple (S , A, T, R, γ). S represents the state space, A represents the action space, T represents the state transfer function, R represents the reward function, and γ represents the discount factor. At each time step t, the maximise reward is Reinforcement learning algorithms are divided into policy-based algorithms and value-based algorithms. In value-based reinforcement learning, the value function is used to estimate the value of being in each state. The state value function is derived from the states given under the strategy π.
where E π denotes the expectation under strategy π. Similarly, the state action value function for taking action a in state s under strategy π. q π can be defined as Q-Learning [33] is an offline value-based algorithm with a value function based on state actions and an iterative formula that can be described as where α ∈ (0, 1] is the learning rate, and γ ∈ (0, 1] is the discount factor describing the importance of weighting between immediate and future long-term rewards [34].

Variable-domain fuzzy PID based on Q-Learning
Traditional PID control has its own limitations in that the parameters cannot change with the external environment, adding fuzzy PID control allows for improved robustness. In order to increase the accuracy of fuzzy control, the theoretical domain can be changed by adding a scaling factor to optimise PID control, i.e., variable theoretical domain fuzzy PID control. The control function in a variable-domain fuzzy PID controller is self-adjusting through constant self-adjustment, making the PID parameters self-tuning. The control function is reduced or enlarged over time, making the updated control function less accurate and generating distortion. As is shown in Figure 5, the Q-Learning algorithm is added to make the scaling factor have the ability to find the optimum online, thus adjusting the PID parameters more accurately and seeking the optimum PID parameters for the current operating conditions. This section analyses the key scaling factors in variable theoretical domain fuzzy PID control. Taking Eq (2.3) as an example, what determines the size of the theoretical domain are its two parameters τ i (i = 1, 2) When these two parameters are changed, the domain is changed. By invoking Q-learning control of these two parameters, a better control effect can be achieved, making the controller capable of learning and online correction, and the system is more resistant to disturbances and more robust. In the process of Q-learning, the change in parameters ∆τi (i = 1, 2) as the action set, i.e., A = {−0.075, −0.05, −0.025, 0, 0.025, 0.05, 0.075}, and the state set is the deviation e. The Q matrix is built. The reward function is related to the rate of change of error. A negative rate of change of error indicates that this learning direction is the reward direction and can continue to be adjusted in this direction, a positive rate of change of error indicates that this learning direction is the penalty direction and should be adjusted in the opposite direction at this time.
where r(t) is the immediate reward at moment t, e(t) is the deviation between the actual trajectory c(t) and the preset trajectory r(t), i.e., e(t) = r(t) − c(t), and ec(t) is the rate of change of the error at moment t, i.e., ec(t) = e(t) − e(t − 1), denotes the direction of learning. g is the penalty term, and ω is the weight, and the iteration of the Q-value table is carried out through Eq (2.7).

Loop
If S = 0 with probability ϵ select a random action to get out of the local minimum Else Select a t = argmax a ′ Q s i , a ′ ; θ t|t−1 from the selection criterion of the current parameters τ i End If t < N State Update the Q-value function Q(s, a) with updated action a. End t = t + 1 Outputs: The parameters τ 1 , τ 2 .

Experiment
To verify the effectiveness of the proposed method, we trained a Q-Learning based variable-domain fuzzy PID controller and built urban road scenarios, including large curvature scenarios, overtaking scenarios, and following scenarios, in a Panosim simulation environment, as is shown in Figure 6. , the E 1 is 350 and E 2 is 100. When the error and the rate of change of the error are greater than the domain boundary, the value is taken as the boundary. The reward function is measured according to the realtime trajectory error, so initialize ω to 1 and initialized g to 20.

Simulation experiments
The whole experimental process is divided into training and testing, with training conducted in the following and overtaking scenarios, where the overtaking trajectories are generated from Bessel curves, and testing conducted in the cornering scenario. The proposed method is compared with the fuzzy PID and variable-domain fuzzy PID (vdfuzzyPID) using the error between the actual trajectory and the preset trajectory as an indicator.
As is shown in Figures 7-9, we comprise between fuzzy PID control,variable-domain fuzzy PID control and variable-domain fuzzy PID control based on Q-Learning. And the red line is the expected vehicle trajectory, the blue line is the vehicle trajectory under fuzzy PID control, the black line is the vehicle trajectory under variable-domain fuzzy PID control, and the green line is the vehicle trajectory under variable-domain fuzzy PID control based on Q-Learning. Figures 7(a)-9(a) shows the target trajectory and the actual trajectory of the three scenarios respectively, in which A indicates the starting point, B indicates the ending point. Figure 7(a) shows the control effect in the following scenario. Since the vehicle speed is slow and the corner change is small in the following scenario, the control  effect is no different between the three methods. Figure 8(a) shows the control effect in the overtaking scenario, where there is a slow moving vehicle in front of the original trajectory obstructing the car from moving forward, so it make a left lane change decision, and it can be seen from the steering path that the steering is smoother than fuzzy PID under variable-domain fuzzy PID control based on Q-Learning. Figure 9(a) shows the control effect in the curvature scenario, where variable-domain fuzzy PID control based on Q-Learning results in a trajectory that is more closely aligned with the preset trajectory, and fuzzy PID results in a larger actual trajectory arc during cornering due to the slow response time and small change in steering angle.
As shown in Figures 7(b)-9(b), the average reward curve for each path showed a steady upward trend with no significant drop or fluctuation, which means that the training process was very stable.
The control error is used as a measure of the stability of the proposed method, and the error varies with the movement of the vehicle. As shown in Figure 10, it is a real time error and local zoom of the car following scene. As shown in Figure 10(a), the control errors of both methods fluctuate around 0. The enlarged area in Figure 10(b) corresponds to the curve in Figure 7(a). It can be seen from both the trajectory and the error diagram that the fuzzy PID control effect and variable-domain fuzzy PID effect are poor. Figure 10(c) the error change in the enlarged area is relatively dense. It can be seen from the enlarged view that Q-fuzzyPID control error is small. As shown in Figure 11, it is real-time control trajectory error of overtaking and curvature. As shown in Figure 11(a), after lane changing and overtaking, the error between the actual control track and the preset track of the three methods is large, and the control effect of the proposed method will be better by comparison. Figure 11(b) is the control error of the three methods at the curvature curve. Due to the slow response time of fuzzy PID at the turning angle, the small change of steering angle leads to large error. Table 1 shows the average error of the three methods in each of the three scenarios. From the data in the table, it can be seen that the average error of the proposed method in all three scenarios is smaller than the fuzzy PID control error and variable-domain fuzzy PID control error.

Conclusions
Self-driving vehicles require different control accuracies when faced with different complex scenarios. Traditional PID control uses a set of parameters that are difficult to adapt to changes in the scenario. The control function in a variational domain fuzzy PID controller will make the PID parameters self-tuning by constantly adjusting itself. However, the control function is reduced or enlarged with time, which can make the updated control function no longer accurate and generating distortion. In this paper, the Q-Learning algorithm is added to make the scaling factor have the ability to find the best online, so that the PID parameters can be adjusted more accurately and the optimal PID parameters can be found under the current working conditions. The effectiveness of the method is also verified on the Panosim simulation platform.