An Energy Management Strategy for a Super-Mild Hybrid Electric Vehicle Based on a Known Model of Reinforcement Learning

For global optimal control strategy, it is not only necessary to know the driving cycle in advance but also difficult to implement online because of its large calculation volume. As an artificial intelligent-based control strategy, reinforcement learning (RL) is applied to an energy management strategy of a super-mild hybrid electric vehicle. According to time-speed datasets of sample driving cycles, a stochastic model of the driver’s power demand is developed. Based on theMarkov decision process theory, a mathematical model of an RL-based energy management strategy is established, which assumes the minimum cumulative return expectation as its optimization objective. A policy iteration algorithm is adopted to obtain the optimum control policy that takes the vehicle speed, driver’s power demand, and state of charge (SOC) as the input and the engine power as the output. Using a MATLAB/Simulink platform, CYC WVUCITY simulation model is established. The results show that, compared with dynamic programming, this method can not only adapt to random driving cycles and reduce fuel consumption of 2.4%, but also be implemented online because of its small calculation volume.


Introduction
With the increasing problems of global warming, air pollution, and energy shortage, hybrid electric vehicles (HEVs) have been extensively studied because of their potential to significantly improve fuel economy and reduce emissions.
Energy management strategies are critical for HEVs to achieve optimal performance and energy efficiency through power split control.Energy management strategies can be divided according to the way they are implemented: rulebased strategies and optimization-based strategies.
Optimization-based strategies form an important group of energy management strategies and can be divided into strategies based on instantaneous optimization, global optimization, model predictive control, and artificial intelligent.
Rule-based strategies are primarily based on human experience.The torque distribution of the engine and motor is based on preset control rules, which are formulated by the steady-state MAP chart of the engine and motor.Reference [1] proposed an energy management strategy based on a logic threshold and a fuzzy algorithm for improving fuel economy.Wang et al. [2] developed algorithms for On/Off control, load tracking control, and bus voltage control and conducted a simulation.Rule-based control strategy is simple and can be easily implemented online; however, a static control strategy is not optimal in theory and it also does not consider the dynamic changes in working conditions.
An instantaneous optimal control strategy can ensure an optimum objective at every time step; however, it cannot guarantee an optimum objective design over the whole driving cycle.Compared with the logic threshold strategy, the calculation amount is large, but it can be implemented online.Reference [3] formulated the energy management strategy of an HEV based on driving style recognition.Reference [4] combined K-means clustering algorithm with equivalent consumption minimization strategy to realize the energy management of the whole vehicle.An MPC strategy can ensure an objective optimum design in the prediction domain and can be implemented online.Reference [5] proposed a stochastic model predictive control-based energy management strategy using the vehicle location, traveling direction, and terrain information of the area for HEVs operating in hilly regions with light traffic.
A global optimal control strategy can guarantee an objective optimum design over a given driving cycle by distributing the power of the engine and motor.However, it can only be implemented offline because of the large volume of calculations involved.Reference [6] proposed an improved dynamic programming (DP) control strategy for hybrid electric buses based on the state transition probability.DP was applied to an EVT-based HEV powertrain to realize its optimal control in [7].A DP algorithm was also used for global optimization on the performance of a speed coupling ISG HEV in [8].
Intelligent energy management strategies include fuzzy logic, neural network, genetic algorithm, and machine learning-based control strategies.Energy management strategies based on machine learning include those based on supervised learning [9,10], unsupervised learning [11,12], and reinforcement learning.
RL is a data-driven approach that assumes the system as a black box, regardless of whether it is linear or nonlinear.As a type of self-adaptive optimal control method based on machine learning, RL has been widely applied to the learning control of several nonlinear systems.
Reference [13] proposed an energy management strategy for PHEB based on RL.Liu et al. [14] proposed an RLenabled energy management strategy by using the speedy Q-learning algorithm, to accelerate the convergence rate in Markov Chain-based control policy computation.Reference [15] developed an online energy management controller for a plug-in HEV based on driving condition recognition and a genetic algorithm.In [16], deep Q learning was adopted for energy management and the strategy was proposed and verified.RL was shown to derive model-free and adaptive control for energy management in [17].Liu et al. [18] proposed a bilevel control framework that combined predictive learning with RL to formulate an energy management strategy.
RL can ensure global optimum over the driving cycle and does not require foreseeing the driving cycle.Compared with complex dynamic models, data-driven models can be implemented online because of the small calculation volume.
Aimed at super-mild HEV, [19] studied rule-based and instantaneous optimization methods to be applied to energy management strategies for super-mild HEVs.However, RL has not been reported to be applied to an energy management strategy of super-mild HEVs.
This study establishes an energy management strategy for super-mild HEVs based on a known model of RL.The optimal control results are obtained by a policy iteration algorithm.Based on the MATLAB/Simulink simulation platform, a simulation is conducted on the economic performance of the vehicle.

Structure and Main Parameters.
A super-mild HEV is primarily composed of an engine, motor, and continuously variable transmission with reflux power.The continuously variable transmission with reflux power includes a metal belt continuously variable transmission, a fixed speed ratio gear transmission device, a planetary gear transmission device, wet clutches, a one-way clutch, and a brake.Its structure is shown in Figure 1.The main parameters are shown in Table 1.

Working Modes.
There are four working modes of a super-mild HEV: only-motor mode, only-engine mode, engine-charging mode, and regenerative braking mode, as shown in Figure 2.

Power Demand Model.
The power demand at the wheel is the power sum of rolling resistance, air resistance, and acceleration resistance: where   is the rolling resistance,   is the air resistance,   is the acceleration resistance, V is the vehicle speed,  is the rolling resistance coefficient,  is the vehicle mass,   is the air resistance coefficient,  is the frontal area, and  is the conversion coefficient of rotation mass.

Engine Model.
The engine is a highly nonlinear system and its working process is very complex.Therefore, engine data for a steady-state condition are obtained through experiment testing.Based on these data, an engine torque model and an effective fuel consumption model are established, as shown in Figures 3 and 4, respectively.

Battery Model.
Based on a Ni-MH battery performance experiment, the electromotive force and internal resistance model are obtained as shown in Formulas ( 2) and (3): where  0 is the electromotive constant of the battery,   is the fitting coefficients, SOC is the state of charge, and  soc is the electromotive force under the current state: where  0 is the internal resistance constant of battery,   is the fitting coefficient,  0 is the compensation coefficient of internal resistance with the change in current, SOC is state of charge, and  soc is internal resistance under the current state.
The process to calculate the change in SOC is shown by Formulas (4) ∼ (6): Therefore, where  bat is the battery power,  bat is the battery current, and  bat is the battery capacity.

Stochastic Modeling of Driver's Power Demand
Traditionally, the driver's power demand is obtained according to a given driving cycle; however, in reality, the driving cycle is random.The discrete-time stochastic process is regarded as Markov decision process (MDP).In other words, the transition probability from the current state to the next state only depends on the current state and the selected action; this is independent of historical states.The power demand at next state only depends on the current power demand, which is independent of any previous state.To establish the transition probability of power demand, a large volume of data is required.In this study, time-speed datasets of the UDDS and ECE EUDC driving cycles are adopted to calculate the power demand at each moment (Figure 5).The transition probability matrix of the driver's power demand is obtained using the maximum likelihood estimation method.The steps to calculate the power demand, based on the driving cycle, are as follows: The transition probability   is the probability from the current state   req to the next state   req : According to the maximum likelihood estimation method, the power transition probability can be obtained by where   represents the number of times that the transition from   req to   req has occured given the vehicle speed and   represents the total number of times that   req has occurred at the vehicle speed; it is given by Based on the data of ECE EUDC and UDDS driving cycles, transition probability matrices for the power demand are obtained at given speeds of 10 km/h, 20 km/h, 30 km/h, 40 km/h, as shown in Figure 6.

Addressing Energy Management by RL.
From a macroscopic perspective, the energy management strategy of an HEV involves determining the driver's power demand according to the driver's operation (acceleration pedal or braking pedal) and distributing optimal power split between two power sources (engine and motor) on the premise of guaranteeing dynamic performance.From a microscopic perspective, the energy management strategy can be abstracted as solving a sequential optimization decision problem.RL is a machine learning method based on Markov decision process, which can solve sequential optimization decision problems.The process of solving sequential decision problem by RL is as follows.First, a Markov decision process is represented by the tuple (, , , , ), where  is the state set,  is the action set,  is the transition probability,  is the return function, and  is the discount factor (Figure 7).Second, based on the Markov decision process, the vehicle controller of an HEV is regarded as an agent, the distribution of its engine power is regarded as action , and the hybrid electric system except for the vehicle controller is regarded as the environment.In order to achieve minimum cumulative fuel consumption, the agent takes certain action to interact with the environment.After the environment accepts the action, the state begins to change and an immediate return is generated to feedback to the agent.The agent chooses the next action based on an immediate return and the current state of environment and, then, interacts with the environment again.The agent interacts with the environment continually, thus generating a considerable amount of data.RL utilizes the generated data to modify the action variable.Then, the agent interacts with the environment again and generates new data; this new data is utilized to further optimize the action variable.After several iterations, the agent will learn the optimal action that can complete the corresponding task; in other words, it will determine the decision sequence  * 0 →  * 1 → ⋅ ⋅ ⋅  *  (Figure 8), thereby solving the sequential decision problem.penalty is taken as the immediate return .  is a factor of the equivalent fuel consumption,  fuel is the fuel consumption,  ele is the electricity consumption,  is the penalty factor of the battery, and  ref is reference value of the SOC:

Mathematical Model of an Energy
In an infinite time domain, the Markov decision process will solve the problem of determining a sequence of decision policies that can predict the minimum cumulative return expectation of a random process called the optimal state value function  * (): where  ∈ [0, 1] is the discount factor,  +1 represents the return value at  + 1 time, and  represents the control policy.Meanwhile, the control and state variables must also meet the following constraints: where the subscript min and max represent the maximum and minimum threshold values for the state of charge, speed, and power, respectively.The purpose of the solution is to determine the optimal policy  * : A policy iterative (PI) algorithm is used to solve the problem of the random process.It involves a policy estimation step and a policy improvement-step.
The calculation process is shown in Algorithm 1.
In the policy evaluation step, for a control policy  k (s) (the subscript k represents the number of iterations), the corresponding state value function    (s) is calculated, as shown in In the policy improvement step, the improved policy  +1 (s) is determined through a greedy strategy, as shown in During policy iteration, policy evaluation and policy improvement are performed alternately until the state value function and the policy converge.
The policy iteration algorithm is adopted, obtaining the optimum control policy that takes the vehicle speed, driver's power demand, and SOC as the input and the engine power as the output.Figure 9 shows the optimized engine powers at vehicle speeds of 10 km/h, 20 km/h, 30 km/h, and 40 km/h.It can be seen from Figure 9 that only the motor works when the SOC is high and the power demand is low; when both the SOC and the power demand are high, only the engine works, and, when the SOC is low, the engine is in the charging mode.
Figure 10 illustrates the offline and online implementation frames of the energy management strategy.In the offline part, the power demand transition probability matrix is obtained by the Markov Chain.Mathematical models of the state variable , action variable , and immediate return  are derived according to RL theory.The policy iteration algorithm is employed to determine engine power optimization tables.In the online part, the driver's power demand is obtained by the opening of driver's accelerator pedal and brake pedal.Then, the power of the engine and motor is distributed by looking up the offline engine power optimization tables.Finally, the power is transferred to the wheels by the transmission system.

Simulation Experiment and Results
A simulation is conducted on a MATLAB/ Simulink platform, taking ECE EUDC driving cycle as the simulation driving cycle and setting the initial SOC as 0.6.The energy management strategy based on a known model of RL is adopted, simulating the vehicle operation status online.
Results are shown in Figure 11.
From Figure 11, we can see that the following parameters change with time: gear ratio of transmission  g , SOC of the battery, power demand  req , motor power  m , engine torque  e , motor torque   , and instantaneous equivalent fuel consumption  equ .From Figure 11(b), it can be seen that the gear ratio varies continuously from 0.498 to 4.04; this is primarily because of the use of a CVT with reflux power, which, unlike AMT, can realize power without interruption.In Figure 11(c), SOC of the battery is seen to change from the initial value of 0.6 to the final value of 0.5845; ûSOC = 0.0155; this can meet the HEV SOC balance requirement before and after the cycle (−0.02≤ Δ ≤ 0.02).The power demand change curve of the vehicle is obtained based on ECE EUDC, as shown in Figure 11(d).The power distribution curves of the motor and the engine can be obtained according to the engine power optimization control tables, as shown in Figures 11(e) and 11(f).The motor power is negative, which indicates that the motor is in the generating state.In order to validate the optimal performance of the RL control strategy, vehicle simulation tests based on DP and RL were carried out.Figure 12 shows a comparison of the optimization results by the two control methods based on the ECE EUDC driving cycle.The solid line indicates the optimization results of RL and the dotted line indicates the results of DP. Figure 12(a) shows the SOC optimization trajectories of the DP-and RL-based strategies.The trend of the two curves is essentially the same.The difference between the final SOC and the initial SOC is within 0.02, and the SOC is relatively stable.Compared with RL, the SOC curve of DPbased strategy fluctuates greatly; this is primarily related to the distribution of the power source torque.Figures 12(b) and 12(c) indicate engine and motor torque distribution curves based on DP and RL strategies.The engine torque curves essentially coincide; however, the motor torque curves are somewhat different.This is primarily reflected in the torque distribution when the motor is in the generation state.
Table 2 shows the equivalent fuel consumption obtained through DP and RL optimization.Compared with that obtained by DP (4.952L), the value of fuel consumption obtained by RL is 2.3% higher.The reason for this is that DP only ensures a global optimum at a given driving cycle (ECE EUDC), whereas RL optimizes the result for a series of driving cycles in an average sense, thereby realizing a cumulative expected value optimum.Compared with rule-based (5.368L) and instantaneous optimization (5.256L) proposed in the literature [19], RL decreased the fuel consumption by 5.6% and 3.6%, respectively.
In order to validate the adaptability of RL to random driving cycle, CYC WVUCITY is also selected as a simulation cycle.The optimization results are shown in Figure 13. Figure 13(b) shows the variation curve of the SOC of the battery based on DP and RL control strategies.The solid line indicates the optimization results of RL and the dotted line indicates the results of DP.The final SOC value for the DP and RL strategies is 0.59 and 0.58, respectively, with the same initial value of 0.6.The SOC remained essentially stable.Figures 13(c) and 13(d) show the optimal torque curves of the motor and the engine based on DP and RL strategies.It can be seen from the figure that the changing trend of the torque curves based on the two strategies is largely the same.In comparison, the change in the torque obtained by DP fluctuates greatly.Table 3 demonstrates the equivalent fuel consumption of the two control strategies.Compared with DP, RL saves fuel by 2.4%; it can adapt with random cycle perfectly.Meanwhile, the computation time based on DP and RL is recorded, which include offline and online computation time.The offline computation time of DP is 5280s and that of RL is 4300s.Due to large calculation volume and the driving cycle unknown in advance, DP cannot be realized online, while RL is not limited to a certain cycle, which can be realized online by embedding the engine power optimization tables into the vehicle controller.The online operation time of RL is 35s based on CYC WVUCITY simulation cycle.

Conclusion
(1) We established a stochastic Markov model of the driver's power demand based on datasets of the UDDS and ECE EUDC driving cycles.
(2) An RL energy management strategy was proposed which takes SOC, vehicle speed, and power demand as state variables, engine power as power as the action variable, and minimum cumulative return expectation as the optimization objective.
(3) Using the MATLAB/Simulink platform, a CYC WVUCITY simulation model was established.The results show that, compared with DP, RL could reduce fuel consumption by 2.4% and be implemented online.

Figure 11 :
Figure 11: ECE EUDC simulation results based on RL.

Figure 12 :
Figure 12: Comparison between DP and RL optimization results based on ECE EUDC.

Figure 13 :
Figure 13: Comparison between DP and RL optimization results based on CYC WVUCITY.

Table 2 :
Simulation results based on ECE EUDC.

Table 3 :
Simulation results based on CYC WVUCITY.