Energy Management of Multi-mode Plug-in Hybrid Electric Vehicle using Multi-agent Deep Reinforcement Learning

The recently emerging multi-mode plug-in hybrid electric vehicle (PHEV) technology is one of the pathways making contributions to decarbonization, and its energy management requires multiple-input and multipleoutput (MIMO) control. At the present, the existing methods usually decouple the MIMO control into singleoutput (MISO) control and can only achieve its local optimal performance. To optimize the multi-mode vehicle globally, this paper studies a MIMO control method for energy management of the multi-mode PHEV based on multi-agent deep reinforcement learning (MADRL). By introducing a relevance ratio, a hand-shaking strategy is proposed to enable two learning agents to work collaboratively under the MADRL framework using the deep deterministic policy gradient (DDPG) algorithm. Unified settings for the DDPG agents are obtained through a sensitivity analysis of the influencing factors to the learning performance. The optimal working mode for the hand-shaking strategy is attained through a parametric study on the relevance ratio. The advantage of the proposed energy management method is demonstrated on a software-in-the-loop testing platform. The result of the study indicates that the learning rate of the DDPG agents is the greatest influencing factor for learning performance. Using the unified DDPG settings and a relevance ratio of 0.2, the proposed MADRL system can save up to 4% energy compared to the single-agent learning system and up to 23.54% energy compared to the conventional rule-based system.


Introduction
Demands for advancement in performance and decarbonization motivate the automotive industry to move towards automation and electrification based on intelligent optimization [1][2][3]. Electrified vehicles, including plug-in hybrids, battery electric, and fuel cell vehicles are the keys to road transport electrification [4]. The energy management system (EMS) is a critical function module for electrified vehicles, which should be dedicated to various powertrain architectures and thus capable of maximizing energy efficiency while maintaining the health of the powertrain components [5]. The The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.
recently developed multi-mode PHEV utilizes a new powertrain topology to allow the vehicle to operate in pure battery mode, series hybrid mode, or parallel hybrid mode adaptively according to the driving conditions [6]. This powertrain topology has been adopted by some OEMs and T1 suppliers worldwide, e.g., Honda, BYD, and MAHLE [7]. Different from series or parallel HEVs, the multi-mode vehicle cannot use the coupled control of the engine, generator, and traction motor, and thus MIMO control is required.
There are currently three main categories of control strategies for the EMS, i.e., rule-based methods, optimization-based methods, and learning-based methods [8][9][10]. The rule-based powertrain control strategies typically implement deterministic rules or fuzzy logic that are founded on the parameters of the vehicle and expert knowledge, and they are computationally efficient and easy to be applied in real-time [11]. The optimization-based strategies include the model predictive control (MPC) [12], dynamic programming (DP) [13], Pontryagin's minimum principle (PMP) [14], and equivalent consumption minimization strategy (ECMS) [15]. They provide access to optimize vehicle performance in certain conditions. The main drawback of optimization-based strategies is that they have limited adaptability to real-world conditions, especially for dramatically changing conditions. It is tough to obtain good results for multi-objective and multi-mode optimization problems since they require heavy-duty computation to resolve the control models [16].
The learning-based EMSs are emerging in recent years and demonstrated their advantages in optimizing control policies during real-world driving [17][18]. Q-learning, a prevalent RL method at cost of computation and algorithmic complexity, has been developed for the EMSs with discretized state and action spaces. Qi et al. utilized the Q-learning algorithm to optimize the EMS for chargingdepletion conditions [19]. Liu et al. proposed a Q-learning-based EMS by combining neuro-dynamic programming with future trip information under a two-stage deployment [20]. Chen et al. formulated a new EMS by incorporating the Q-learning algorithm with a stochastic model predictive control (SMPC), where a Markov chain-based velocity prediction model is developed to achieve superior fuel economy [21]. Zhou et al. proposed a multi-step Q-learning algorithm to enable the all-life-long online optimization of a model-free predictive EMS control [22]. Shuai et al. developed a double Qlearning algorithm for the hybrid vehicle by proposing two new heuristic action execution policies, the max-value-based policy and the random policy [23]. However, with the increasing number of state and action variables in advanced decision-making tasks, it is more difficult for Q-learning algorithms to compute all Q values corresponding to the discrete state-action from the perspectives of computing efficiency and optimality. To overcome this shortcoming of Q-learning algorithms, deep reinforcement learning (DRL) algorithms are used for high-dimensional continuous decision-making tasks by using neural networks to approximate the value function outputs [24][25][26][27].
DRL algorithms have been studied by many researchers to deal with multiple-input and single-output (MISO) control in series hybrids, parallel hybrids, and power-split hybrids, which normally involve high-dimension input spaces, such as the SoC, power demand, vehicle velocity, etc. Some of states The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper. need to be obtained through multi-information fusion by combining inverse smoothing and gray prediction fusion to mitigate the impact of sensor measurement errors [28]. Xiong [31]. Since they have provided much better control performance compared to the Q-learning methods and rule-based methods, DRL-based control strategies combined with other advanced algorithms have been employed in the EMS of many single model hydrid vehicles [32][33].
Studies of RL-based control for the multi-mode PHEV, however, are just at the beginning, because the multi-mode PHEV itself is new [34][35][36]. Tang et al. proposed a DRL method for the control of a multi-mode HEV, in which the deep Q-network (DQN) algorithm is used for gear shifting and the DDPG algorithm is used for controlling engine throttle [37]. Sun et al. developed a hierarchical powersplitting strategy that implements two DRL agents for multi-mode PHEV [38]. To enable MIMO control with conventional RL/DRL algorithms that are only capable of MISO control, the abovementioned methods implemented more than two RL/DRL agents to obtain the control outputs, but these RL/DRL agents have no links or communications and thus can only achieve local optimal results. While the above-mentioned RL/DRL algorithms can only deal with single output control and are thus not capable of global optimization, the main challenge to be addressed in this paper is to develop a new type of RL algorithm that has multiple agents working with strong links for MIMO control of the multi-mode PHEV.
Multi-agent deep reinforcement learning (MADRL) is a recent breakthrough in artificial intelligence emphasizing the behaviors of multiple learning agents coexisting in a shared environment [39]. It links multiple RL agents in three working modes: 1) cooperative mode, 2) competitive mode, and 3) a mixture of the two [40]. In cooperative scenarios, agents work together to maximize a shared longterm return; in contrast, in competitive scenarios, agents' returns typically add up to zero; in mixed scenarios, there are general sum returns in both cooperative and competitive agents.
So far, MADRL has been explored in some areas, such as games, and robots. However, it has never been used for the EMS of PHEV. The authors of this paper believe that MARL is a good solution to the MIMO control in the multi-mode PHEV. Therefore, the presented work has been focused on developing such EMS based on MADRL with two new contributions: 1) the best setting for the DDPG agents has been obtained through a parametric study, concerning networks layers, learning rate, and policy noise; 2) multiple objectives, including minimizing fuel economy and minimizing battery The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.
SoC sustaining error, has been optimized by using a hand-shaking strategy proposed for the DDPG agents.
The rest of this paper is organized as follows: Section 2 formulates the MIMO control problem by modeling the multi-mode PHEV system. The MADRL framework is proposed in Section 3, with a DDPG-based EMS introduced as the baseline method. Test and validation are conducted on a software-in-the-loop platform and the results are discussed in Section 4. Section 5 summarizes the conclusions.

MIMO control of a multi-mode PHEV
The architecture of the multi-mode PHEV studied in this paper is shown in Fig. 1. The motor generator (MG1) and the engine work together to maintain the battery's SoC at a certain level for safety. The other motor (MG2) and engine are the power sources for driving. The multi-mode PHEV can work in different modes, which are controlled by engaging or disengaging the clutch, as illustrated in Fig. 1. In series mode (red lines), the clutch is disengaged and only MG2 drives the powertrain. In parallel mode (yellow lines), the clutch is engaged, and the engine will provide the part of driving torque as the supplement to the MG2. The energy flow of the vehicle is modeled based on longitudinal vehicle dynamics, and the force demand, (t), the power demand, (t), and the torque demand, (t), can be calculated by The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.
where is the vehicle mass; is the vehicle acceleration, is the gravity acceleration; is the rolling resistance coefficient; is the air density; is the front area of the vehicle; is the air resistance coefficient; is the longitudinal velocity; is the wheel radius; is the road slope, in this paper, the road slope should not be considered. Note that vehicle velocity and acceleration can be estimated accurately by utilizing a robust regression method and an adaptive Kalman filter that incorporates vehicle heading alignment [41].
In the series mode, the energy flow is described as: In the parallel mode, the energy flow is described as: where, 1,2 ( ) and ( ) are the electric power of MG1 and MG2, respectively; 1 ( ) , 2 ( ), and ( ) are the rotational speed of MG1, MG2, and the engine, respectively; the MG1 and MG2 torque are separately 1,2 ( ) and (t) is the engine torque; 1 is the transmission ratio of the gearbox to MG1, 2 is the final ratio of the gearbox from MG2 to the wheels.
The energy of a multi-mode PHEV is from the battery and the engine (fuel tank), and the total power loss mainly consists of the engine loss, ( ), and battery loss, ( ), which can be calculated by: where, is the total power loss, is the heat value of fuel ( = 43.5 KJ/g); and is the equivalent internal resistance in the battery model.
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.

a) Engine model
The engine model is used to determine the fuel consumption rate ̇ (g/s) based on a 2D look-up table, which is a function of the engine speed, ( ), and the engine torque, ( ), as

b) Motor models
The power demands of MG1 and MG2 are modelled based on two quasi-static energy efficiency maps 1 and 2 , respectively, as follows:

c) Battery model
The battery model is established based on an equivalent circuit as follows: where, ( ) is the output power of the battery pack during charging and discharging;

The multiple-input and multiple-output (MIMO) energy management controller
A multiple-input and multiple-output (MIMO) controller, as shown in Fig. 2, is developed to manage the energy flow of the studied vehicle. By observing the battery SoC and the overall torque demand as the control inputs, the MIMO controller calculates the torque demands for MG1, MG2, and the engine, respectively, to provide sufficient general torque to drive the vehicle while maintaining the battery SoC.
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.

Fig.2. MIMO control architecture
The core of the MIMO control is to resolve an optimization problem defined as follows： . .
where the overall power loss, , and the SoC difference, ∆ , are two objectives that need to be minimized; the MG1 torque command, 1 ( ), and the MG2 torque command, 2 ( ), are the optimization variables to be determined during the real-time control. The optimization should be subjected to the vehicle energy flow models and other physical constraints of the powertrain system and subsystems.

Multi-agent learning with the DDPG algorithm
To resolve the optimization problem defined in Eq. (11), this paper proposes a hand-shaking multiagent learning scheme, as shown in Fig.3, in which two DDPG agents are involved to minimize the fuel consumption and battery usage simultaneously through the torque control of MG1 and MG2.
Each learning agent has an actor-network and a critic-network. In each time interval, the agent starts learning with an observation of the state variables, and it uses the actor-network to generate a control action for the vehicle system followed by a reward evaluation from the system feedback. The criticnetwork is trained to update the actor-network using a policy gradient algorithm based on the recorded variables of state, action, and reward.
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.

Fig.3. DDPG-based EMS with multi-agent learning
The differences between the multi-agent system and the conventional single-agent system (baseline) are summarized in Table 1. Details of the main components of both learning systems are described as follows. The main difference between the single-agent system and the multi-agent system is the number of learning agents, i.e., the single-agent system only has one learning agent while the multiagent system has more than two agents. The learning agent is a multi-input and single-output (MISO) control model that has the capability of self-learning for the development of control policy. It can be developed based on Q-learning, deep Q-learning, or other RL algorithms. In this study, the environment states and action variables are continuously varying. Therefore, the DDPG agent has been developed. The proposed multi-agent system includes two agents with different reward preferences for the two optimization objectives, and each agent generates the control signal for MG1 and MG2, respectively.
Weighted sum Two unique functions for different preferences

DDPG-based learning process
This paper implements the deep deterministic policy gradient algorithm to optimize the control policy with an outer loop and an inner loop as shown in Fig.4. In the inner loop shown with the red line, the agent interacts with the vehicle driving in real-world in with a sampling time of 1s. Once the vehicle The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.
finished driving within a certain time defined in a driving cycle, the agent will update the policy in the outer loop as illustrated with the cyan line [37].

Fig.4. Control policy optimization using the DDPG algorithm
By defining a policy, ( ), as where is the control action; and is the state variables. If the policy, ∶ ← , is deterministic with the actor-network, the policy update process can be formulated as statistical learning of the criticnetwork, ( , ), by where is the reward variable; and γ is the discount factor. In the statistical learning process, an experience replay buffer with size 3R is used to store the transitions, which is a time-series batch of states, actions, and rewards in a form like , , , +1 , +! , +1 , … + , + . Since DDPG is an offpolicy algorithm, the buffer size can be very big to allow the system to learn from a large number of unrelated transitions. To reduce the computation load, this paper implements a minibatch method that randomly selects N samples ( ≪ ) from the experience buffer to train the actor and critic networks at different times.
Once a new batch of data is collected from real-world driving, this paper implements the temporal difference (TD) method to estimate the hyperparameters of the critic network, , by minimizing a loss function, , defined by The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.
Since the critic network ( , | ) is updated and used to estimate the target value, the Q update is prone to be unstable in many environments. Target networks, ′ ( , | ′ ) and ′ ( | ′ ) , are employed for the actor and critic networks respectively to calculate the target values to provides momentum to the learning process by introducing a factor ≪ 1 to weights the parameters by ′ ← Exploration is one of the most challenging problems of the learning process in continuous action domains. An exploration policy ′ by combining the actor policy with a noise process Π, which is described as: An Ornstein-Uhlenbeck (OU) process is chosen to produce the noise by: where is a Wiener process with normally distributed increments, the decay rate > 0 (how "strongly" the system reacts to perturbations) and the variation > 0 of the noise should be set and tunned, as shown in Fig.8. Since the OU process is time-series related and can be used to generate temporally correlated exploration in the action selection process of the previous step and the next step of RL to improve the exploration efficiency of control systems.

Fig. 5. Ornstein-Uhlenbeck (OU) noise process under different parameters
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.

States and actions
In this study, both the single-agent system and the multi-agent system monitor vehicle torque demands and battery SoC values as the state variables in a two-dimensional vector space, ( ), as where ( ) is the power demand of the vehicle and ( ) is the battery state-of-charge level.
Since the single agent system can only output a single control action, ( ), this study uses the DDPG algorithm to compute the control command of MG1 1 ( ), based on a deterministic policy μ sa : where, sa ( ) is a hyper-parameter matrix representing the control policy that is updated over time.
And the control commands 1 ( ) and 2 ( ) of MG1 and MG2, and the control command ( ) of the engine can be calculated by where,

1_
is the maximum torque of the MG1; _ is the maximum torque that can be supplied by the engine, and is the torque demand for driving and braking the vehicle; is the torque of the output shaft provided by the engine when MG2 output torque cannot meet the requirement of the total torque output.
For the proposed multi-agent system that has two DDPG agents, the actions as the output of the multi-agent system can be expressed: where 1 ( ) is the output of the first DDPG agent for control of MG1 while 2 ( ) is the output of the second DDPG agent for MG2. And the engine control command can be calculated using Eq.20.
Both ( ) and ( ) are calculated following a rolling process of exploration and exploitation [18].

Reward functions and hand-shaking design
The single-agent system implements a weighted sum method to incorporate the optimization objectives by The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.
where is a scaling factor; is the target battery SoC value to be maintained during the driving; and is a conditional weight factor, which yields: The conditional weight factor, , will allow the DDPG agent to have a higher priority in minimizing fuel consumption when the SoC level is high [32].
The proposed multi-agent learning system provides access to comprehensive evaluations of the optimization objectives by incorporating global reward ( ), and local rewards ( ,1 and ,2 ) in a hand-shaking manner through the relevance ratio rel as where 1 and 2 are the rewards for the first DDPG agent and the second DDPG agent, respectively. Since minimizing the power loss is the main optimization objective, the power loss value is the element for the global reward function, Two local reward functions are designed to balance the usages of the ICE engine and the battery with two DDPG agents. ,1 and ,2 are allocated for the first DDPG agent and the second DDPG agent, respectively, and they can be calculated by: where is the scaling factor, and is the weighting factor. In this research, , and in the multiagent system are set the same as in the single-agent system.

Results and Discussion
A software-in-the-loop (SiL) testing platform is built for testing and validation on a workstation with an i7-7600U CPU and 64GB RAM. The models including the vehicle plant model and MIMO control model for the SiL test are developed using MATLAB/Simulink version 2022a. The impacts of the learning agent design on MIMO control performance are firstly studied with sensitivity ranked. A parametric study on the relevance ratio, , is conducted to attain the best handshaking strategy The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.
for multi-agent control optimization. The performance of the single-agent system and multi-agent system are compared on a multi-mode PHEV driving under a training cycle and two testing cycles.
The training driving cycle is built with elements generated from four standard driving cycles, including Artemis Rural, RTS95, UDDS, and WLTP, as illustrated in Fig.6. It has four phases, where Phase 1 represents the low-speed region of the Artemis Rural cycle; Phase 2 involves the maximum acceleration in the RTS95 cycle; Phase 3 represents the medium-speed in UDDS Driving Cycle; and Phase 4 involves the high-speed region in the WLTP Cycle. In each learning episode, the four phases will be reorganized randomly to provide the learning noise for the robustness evaluation of the learning.

a) Networks layers
Since the critic network is to obtain an accurate estimate of the Q function value for the evolution of the actor-network, this paper first focuses on the design of the critic network. We investigate the critic networks with different numbers of layers from 2 to 7 with an interval of 1 set for Group 1.  Fig.7.
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.  Fig. 7 (c), this paper suggests the critic network with parameters in Group 1.2 is the best concerning three aspects including convergence episodes, computation time, and fuel consumption.
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.

b) Learning rate
Five groups of tests are designed to investigate the impact of the learning rates on the MIMO control performance. Groups 2.1, 2.2, and 2.5 have equal settings of learning rates for the actor and critic networks with values of 1 × 10 −4 , 1 × 10 −3 , and 1 × 10 −5 , respectively. Groups 2.3 and 2.4 have different settings for the actor and critic networks. In Group 2.3, the actor learning rate is 1 × 10 −4 , and the critic learning rate is 1 × 10 −3 . In Group 2.4, the actor learning rate is 1 × 10 −3 , and the critic learning rate is 1 × 10 −4 . The other network settings are the same as Group 1.2. The learning and MIMO control performance are compared in Fig.8 and Table. 2.  The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.  Fig.8 (a) and (b), it is apparent the learning rate is crucial for MIMO control optimization and produces a distinguished impact on learning through the update of both networks. Theatrically, the learning agent with a higher learning rate achieves better learning performance, this is the reason why Group 2.  Table.2, the MIMO controller with the settings in Group 2.3 is the best. It requires the least computation time and is the fastest to reach the coverage point, although its fuel consumption is 0.9% higher than Group 2.2.

c) Policy noise
As defined in Eq. 17, the OU process, dominating the ratio of exploration during reinforcement learning, is determined by the weighting value of the Wiener process, , and the variation of the noise, . By implementing the network setting in Group 1.2 and Group 2.3, three groups of test with different values of and , as illustrated in Table.3 are conducted. The three groups of policy noise combinations are implemented respectively in the studied vehicle driving under the learning cycle, and the average reward, battery SOC, and fuel economy (L/100km) are compared in Fig. 9 and Table. 4. The results indicate that the controller with the setting defined in Group 3.1 is the best since it requires less time to converge to the best system performance which leads to the best fuel economy and battery charge sustaining. The results also show that the weighting value of the Wiener process, , impacts the learning speed more significantly compared to the variation of the OU process, .
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.

d) Importance analysis
By introducing a sensitivity level, , which is defined as  Table. 5. The results highlight that the learning rate is the most important influencing factor on learning performance in terms of CT, CE, and FE. Policy noise is the second contributor to the number of episodes for convergence (CE) while the number of network layers is the second contributor to fuel economy (FE). The policy noise has no contribution to computation time (CT).
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.

Hand-shaking of the multi-agent systems with different
Following the study in Section 4.1, the two DDPG agents in the hand-shaking learning system share the same network parameters. To determine the unified setting for the relevance ratio in the multi- The two agents are growing in opposite directions in the system with a rel value of 0.4. In the system with a rel value of 0.8, although both agents have the same tendency, the reward values of the The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper. multi-agent system decrease over time. Most of the studied cases ( rel = 0, 0.2, and 0.6) demonstrate the way of handshaking in the multi-agent system, in which both agents have the same tendency to grow with their reward values increase over time. Then, when rel value of 0, 0.2, and 0.6, two agents perform well and achieve fast convergence at about 40 episodes and have the same tendency to reach the common goal. Therefore, the learning performance with more training time has been investigated in this paper.
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper. The learning performance of the two agents with relevance ratio of 0, 0.2, and 0.6 are compared in Fig.11. The result suggested that when the two agents have a relevant high level of relevance, i.e., rel =0.6, although they follow the same trend during the training, both tend to be unstable and thus cannot converge at the steady point. By comparing the average value and standard deviation of the rewards obtained during the training with rel values of 0 and 0.2 in Table.6, the study suggested that the multi-agent system with a rel value of 0.2 is the best since it can achieve the highest average reward with less level of variation.

Comparison with single-agent system
To demonstrate the advantage of the proposed multi-agent system, a comparison study using a single-agent system as the baseline is conducted based on the studied multi-mode PHEV. The configuration of the single-agent system is summarized in Section 3, and the DDPG algorithm with the same setting as the multi-agent system is used for the online optimization of the single-agent system. By defining the battery SoC sustaining error, , and fuel saving rate, , as: The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.  The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper. The results indicate that the multi-agent system can help maintain the battery SoC error within 5% when the initial SoC is relatively low (below 0.30). This means the PHEV controlled by the multiagent system aligns with the requirement of most vehicle testing regulations around the world. The multi-agent system outperforms the single-agent system in all the study cases in terms of using less fuel and battery energy. This is because the multi-agent system generates two control variables to allow MG1 and MG2 to be controlled independently for their best performance. Up to 4% fuel can be saved and the best fuel economy performance is achieved under the WLTC cycle with an initial battery SoC of 0.28. For the vehicle driving under the WLTC cycle with an initial SoC of 0.30, it can save 3.7% fuel with an SoC sustaining error of 1.33%, which is far better than the industry standard (5%).

Conclusions
This paper studied a new MIMO control method for the multi-mode PHEV based on MADRL. By introducing a relevance ratio, a hand-shaking strategy has been proposed to enable two learning agents to work collaboratively under the MADRL framework using the DDPG algorithm. Through sensitivity analysis, parametric study, and software-in-the-loop testing, the conclusions drawn from the investigation are as follows:

1)
Learning rateof the DDPG agents in the proposed MADRL-based EMS is the most significant influencing factor in determining the learning performance including computing time, the number of episodes for convergence, and fuel economy.

2)
Hand-shaking among the DDPG agents is achievable by tuning the relevance ratio. The optimal setting for the relevance ratio is 0.2 for control of the studied multi-mode PHEV. A smaller relevance ratio would lead to a low learning speed while a higher value will make the system unstable.
The short version of the paper was presented at CUE2022. This paper is a substantial extension of the short version of the conference paper.

3)
The proposed MADRL method outperforms the baseline single-agent method under all studied driving cycles in terms of mitigating energy consumption and battery SoC sustaining error. Up to 4% fuel can be saved with battery SoC error well-controlled within 5% compared to using the single-agent method.