Tactical Decision-making for Autonomous Driving using Dueling Double Deep Q Network with Double Attention

Decision-making is still a significant challenge to realize fully autonomous driving. Using deep reinforcement learning (DRL) to solve autonomous driving decision-making problems is a recent trend. A common method is to encode surrounding vehicles in a grid to describe the state space to help DRL network extract the scene features. However, in the process of human driving, surrounding vehicles at different positions have different contributions to decision-making. Meanwhile, the network is difficult to fully extract useful features in a sparse state, which can also lead to catastrophic actions. Therefore, this work proposes a spatial attention module to calculate different weights to represent different contributions to decision-making results. And a channel attention module is proposed to fully extract useful features in sparse state features. These two attention modules are integrated into dueling double deep Q network, named D3QN-DA, as a high-level decision-maker of a hierarchical hierarchical control structure based decision-making system. To improve agent performance, an emergency safe checker is introduced in this system. And the agent is trained and test with a designed reward function from safety and efficiency in simulation. The experimental results show that the proposed method increases the safety rate by 54%, and the average explore distance by 30%, which is safer and more intelligent for the decision-making process of automatic driving.


I. INTRODUCTION
I N the last decades, autonomous driving has achieved a great development as the increasing of computing power and algorithmic innovation. It is expected to take many benefits, such as a reduction of accidents and better energy efficiency [1], [2]. But it is still a challenge to achieve fully autonomous driving because of the strict request of efficient and safe decision-making in the complex and uncertain environments. The goal of decision-making system in autonomous vehicle is to give safe and appropriate action instructions according to surrounding environments information [3]. The decision-making task is commonly divided into three types: strategic, tactical, and operational which are also named navigation, guidance and stabilization [1], [4], [5]. In this paper, we focus on high level tactical decisions-making system, which inputs high-dimensional state and outputs discrete actions, such as lane change and speed up during driving. The inputs of this system are the state of ego vehicle and surrounding vehicles, High Definition (HD) map, and target reference path and the outputs are the actions based on the current moment in order to plan a trajectory [6].
Traditionally, rule-based decision-making algorithms are common solutions for autonomous driving [7], [8]. [4], [9], [10] have applied rule-based algorithms in DARPA Urban Challenge successfully. However, the drawback of rule-based decision-making is not flexible enough to deal with complex situations. When some new rules are added, the old rules will be influenced, such as undone or rewritten, which makes this system unrobust. Moreover, when different driving tasks are combined, rules have different combinations methods which makes the coupling between different rules and even contradictions between each other. These problems limit the generalization of the algorithm and makes it difficult to scale the algorithms to complex and uncertain environments. To solve these problem, machine learning based methods are developed for complex environment decision-making. Since the decision-making process does not have a certain standard, ground truth, to evaluate the performance of the current action, it can not adopt supervised machine learning which is usually used in environment perception. The problem which Deep Reinforcement Learning (DRL) solved is how the agent adopts actions in the environment to obtain the maximum reward, which coincides with the problem of autonomous driving decision-making [11]. Many studies have applied DRL in autonomous driving. [12] summarizes a DRL framework for autonomous driving which proves the feasibility of DRL in decision-making.
DRL has two main approaches to determine the optimal actions: value-based and policy-based algorithms [13]. The optimization goal of value-based DRL is to find an action with the maximum action value and usually outputs discrete action. While the target of policy-based methods is to find a policy which can achieve maximum reward and usually outputs probability distributions of actions to deal with the problem of continuous action space [14]. Value-based DRL happens be used to solve the autonomous driving tactical decision-making problems. Deep Q Network (DQN) is a widely used value-based DRL algorithm, where Q is action value [15]. [16] is a pioneer to propose this original concept and the authors improve DQN powerful performance in games to control the agents playing as a human in [17]. But since DQN selects the maximum action value as the target Q value and employs the dynamic optimization target, over estimation and difficulty in convergence become the pivotal problems of DQN. [18] propose a Double DQN (DDQN) to solve this overestimation by decoupling selection of action and calculation of action value. And [19] uses a dueling network, named Dueling DQN, with action advantages function to reduce fluctuations in action value to improve the stability of DQN. [20] integrates DDQN and Dueling DQN as Dueling Double DQN (D3QN). In this paper, we adopt D3QN as the framework on tactical decision-making for autonomous driving.
A valid method of encoding state information and extract feature are especially important to help DRL-based network learn effective features. [21] uses a vector based on a fixed number of vehicles and characteristics to construct the state space and trains agents with double action spaces by Fully Connected Neural Network (FCNN). But this method did not consider the spatial position relationship between ego vehicle and other vehicles. [22] uses an 2-dimensional (2D) occupancy grid map to represent the environment state as observation space and trains agent to select the lane changing actions in highway scenarios. When the networks extract features from an occupancy grid, each vehicle is treated as the same importance to generate current decisions. But commonly in human driving, different surrounding vehicles contribute differently to make decision and we pay different attention to different vehicles. Attention mechanism that assigns different attention according to different attributes, has been used in the field of object detection [23], [24], object segmentation [25] and natural language processing for machine translation [26]. In object detection task, every feature point has different contributions to the final object detection result. Inspired by these algorithms, we present a spatial attention-based module to model this mechanism to imitate human driving process.
Besides, surrounding vehicles information, such as velocity, heading, positions, are encoded as the channel feature of occupancy grid map. [27] uses the position, speed and heading of a leading vehicle, the ego vehicle and a following vehicle in this lane as channel features to select the decision actions. [28] encodes the position and speed of all the surrounding vehicles as the input of network. However, different features also contribute differently for decisionmaking according to surrounding vehhicles. If the network can not abundantly and effectively extract features in sparse state space, it will cause the inaccurate action prediction to produce catastrophic exploratory actions. Thence we present a channel attention module to extract the features effectively to make the output actions safer and more accurate by reweighting of objects features to the feature map. Finally, we integrate spatial attention module and channel attention module in D3QN, named D3QN with double attention (D3QN-DA), as a high-level decision maker in this research.
In this paper, we present a hierarchical control structure system to control the ego vehicle interacting in simulation environment. The system contains a high-level decision maker, a middle emergency safe checker and a low-level executive controller. To establish a strong baseline to better validate the modules we propose, a well-designed reward function according to safety and efficiency is introduced to guide networks learning.
The contributions of this study are as following: 1) We design a hierarchical control structure system to control the ego vehicle interacting during the simulation environment. The system contains a high-level decision maker, a middle emergency safe checker and a low-level executive controller. 2) An adaptive state encoder in hierarchical control structure system is used to effectively encode surrounding vehicles. 3) We model the human driving attention process, include a spatial attention models the spatial relationship among vehicles, and a channel attention to calculate channels weights to extract features abundantly and effectively. We integrate these two attention modules into a D3QN to make safer and more efficient decisions for autonomous driving. 4) We train and test the model in a simulation environment with a designed reward function according to safety and efficiency. Then we qualitatively and quan-titatively compare the results with the DRL methods using two kinds of metrics: reward metric and capability metric. The results prove the proposed method is more secure and smarter than baseline models. The rest structure of this paper is as followed: The development and optimization goals of D3QN are introduced in Section II. In Section III, we introduce the hierarchical control structure system with a D3QN-DA based highway-level decision-maker and a emergency safe checker. In Section IV, we demonstrate the agent training environment, conduct a comparative experiment and evaluate the effectiveness of proposed method. Finally, the conclusions are arranged in Section V.

II. BACKGROUND
Reinforcement Learning is that an agent learns which action in a specific situation can get the most rewards during the trial process. The agent interacts with an environment and tries to learn a policy π which can get max potential future rewards [29]. The policy is a process that the agent chooses which action a in state s. The environment will give next time state s and return a reward r after executing the action a. Almost all reinforcement learning problems can be modeled as Markov Decision Process (MDP) [30]. That is, a certain state information contains all relevant history. The state s at time t + 1 is determined only by the state s and action a at time t. MDP is defined as a 5 − tuple (S, A, P, R, γ), where S is a set of states, A is a set of actions, P is state transition probability function from s to s , R is reward function, and γ ∈ [0, 1] is discount factor. At every time step t, the goal of the agent is to obtain the max return of future discount as Eq.(1).
where r t+i is the reward at time step t + i. Q Learning is one of the classic reinforcement learning algorithms [15] that the agent is trained to find an action with the optimal action value function Q * (s, a) defined in Eq. (2).
where a is the next time step action, and its optimal action value is Q * (s , a ). When the state space is discrete and finite, we can easily formulate Q function. But the state space of most problems are infinite and continuous. DQN is proposed to solve this situation.
DQN is produced to handle the high dimension catastrophe by combining between reinforcement learning and deep learning. In every time step t, the interactive data with environment as a 4−tuple e i = (s t , a t , r t , s t+1 ) of agent are collected in an experience replay dataset D(e 1 , e 2 , ..., e n ). DQN randomly samples a batch size B tuples from replay dataset D to guarantee the independence of training data. The batch tuples are input the network as the training data to train the agent to approximate the action value function [17]. The action value function is defined by Q(s, a; θ i ) to approximate Q * (s, a), where θ i are the training weights of the Q network in every iteration i. DQN has two networks with the same structure, a target Q net F t to calculate target Q value Q t and an estimated Q net F e to estimate Q value Q e in current state. The goal of DQN is to train suitable weights to minimize the loss L i (θ i ) of the Q t and the Q e as shown in Eq.(4): The training parameters of F t are updated by the parameters of an F e every C episodes. Then DQN uses the stochastic gradient descent method to optimize the loss function. And the advantage of the double network structure is to give a fixed target to avoid the non-stationary target causing dimensional calculation explosion. DQN uses a ε−greedy policy to deal with the trade-off between exploration and exploitation. The agent selects a random action with probability ε, otherwise chooses the action which has the highest action value.
DDQN is proposed to address the overestimation of action values in DQN algorithm due to the F t selects Q value expectation after taking the maximum action value as the Q t . DDQN solves this overestimation by decoupling selection of the F t action and calculation of the Q t . Instead of finding the maximum Q value directly in the F t , DDQN selects action with the maximum Q value from F e and then calculates the Q t in the F t , shown as Eq.(5) (5) Dueling DQN [19] is proposed to improve the DQN stability and performance from an alternative but complementary approach by reconstructing the network. Dueling DQN separates the estimation of state values and state-dependent action advantages. [32] puts forward to the advantage function which expresses the advantage of a certain action a relative to the average in state s to normalize Q value, shown as Eq. (6).
[19] uses a Convolutional Neural Network (CNN) layer to extract the features and then uses two streams of FCNN to output the value function V (s; θ, β) and advantage function, an |A|-dimensional vector A(s, a; θ, α). θ denotes the parameters of the CNN layers, α and β are the parameters of the two streams of FCNN layers. Dueling DQN uses Eq. (7) to calculate the Q-value and shares the same input-output interface with standard DQN.
D3QN [33] is integrated DDQN and Dueling DQN to solve over estimation, unstable and the difficulty in convergence problems of DQN. In this work, we adopt D3QN as a strong baseline framework to help us verify the proposed new model.

III. METHODOLOGY
In this section, we will introduce the proposed hierarchical control structure based decision-making system as shown Fig.1. The system contains a high-level decision maker, a middle emergency safe checker and a low-level executive controller. The high-level decision-maker uses our proposed D3QN-DA to generate an action according to surrounding vehicles intentions and behaviours. Section III-A1 introduces the state space with an adaptive state encoder, action space and a designed reward function of D3QN-DA. Section III-A2 illustrates our proposed spatial and channel attention modules which integrated in D3QN to improve the performance of the agent. In Section III-B, a middle emergency safe checker is introduced to avoid some catastrophic exploratory actions. This module limits some endangered actions by the speed limit. Finally, in Section III-C, we introduce the low-level controller, which used to implement the actions to control the ego vehicle interacting in environment.

A. HIGH-LEVEL DECISION MAKER
The ego vehicle interacts with the environment to get observation information and each action reward to make optimal rational decision. We treat this decision-making process in autonomous driving as a MDP. And we use D3QN-DA algorithm to solve this MDP problem.

1) MDP
Since the state transition probability function P is expressed by a neural network, MDP is defined with a 4 − tuple (S, A, R, γ), where γ is a hyperparameter, S is state representations, A is action space and R is reward fuction. This tuple contains the input and output of the network. And S, A, R are demonstrated as following: State Representation. In order to build the spatial relationship of the surrounding environment, a common method of state representation is encoding surrounding objects in a grid map according to the each object centroid [34]. But these methods did not consider the relationship between vehicle size and grid size, the lacking of the shape characteristics of the vehicles makes the network may not fully extract real features, resulting in producing wrong actions. In this work, an adaptive state encoder is proposed to express the shape characteristics of the vehicles. Fig.2(a) shows vehicles position, where (x 0 , y 0 ) is the ego vehicle global coordinates and (x 1 , y 1 ), (x 2 , y 2 ) are surrounding vehicles global coordinates. Each vehicles size is 5meters × 2meters. And the grids size is 2meters × 2meters. Fig.2(b) shows that vehicles are adaptive encoded in three or four grid cells according to their size and centroid position. The spatial feature, p, whether exists vehicle or not, in a cell has been constructed. Since far away vehicles are not necessary for current time step decision, we only select vehicles within a Region Of Interest (ROI) area around the ego vehicle. In this simulation, we set the region is 128meters × 128meters while all vehicles velocity are limited in [V l , V h ], set by [30,40]m/s. In this work, we define the states of all vehicles are in Eq. (8):  Action Space. We define Action space A with 5 actions. The ego vehicle will choose one action a to interact during autonomous driving, including lane change actions, and accelerate or brake actions, shown in Table.1 Reward Function. Reward function is significant to guide DRL learning. Different with the supervised learning, DRL only has an optimization goal defined by reward and expectation of future discounted rewards without an exact ground truth. A good reward function can guide the agent to produce correct actions and avoid catastrophic actions. In this research, our reward function is designed to achieve higher efficiency while maintain safety. In terms of safety, the ego vehicle is capable of making appropriate choice to avoid the collisions. When the collisions occur, the agent gets collision punishment R c = −20. Discrete sparse rewards are not enough for agent which may slow learning process or lead to ineffective learning. We design reward to solve the sparse reward problem by adding a safety guidelines. In this safety guidelines, a Dangerous Approach Punishment (DAP) R a = −20 is brought in evaluation mechanism, decided by time to collision T ttc , which is calculated by the distance between the ego and a leading front vehicle d f on this lane and the relative velocity of front vehicle v f and ego vehicle v, shown as Eq.9.
The time to collision is limited within [T max , T min ], we set [3,1] in this research. Overall, the reward function R can be expressed in Eq.(10) 2) D3QN-DA In this study, we use D3QN-DA shown in Fig.3 to estimate the actions according to current status in simulation environment. The structure of D3QN-DA contains a feature extraction subnet and an action prediction subnet. The input of D3QN-DA network is the adaptive encoded grid map M ∈ R H×W ×C . The outputs are the Q of five actions and the action a which has maximum Q will be chosen by the agent. CNN [35] is widely used to extract features while keeping position invariance of inputs. We use three convolutional layers F 1 , F 2 , F 3 with channels C 1 , C 2 , C 3 , which we set to 16, 32, 64 in this study, to extract features from M . All layers have the same kernel size, 2 × 2, stride 2 and a rectified linear unit (ReLU) activation function. Then, we integrate double attention (DA) module in M and the this subnet outputs a weighted feature map M d . The action prediction subnet is composed of two sequential FCNN layers F c and a dueling network. The first FCNN inputs the flatten M d and outputs 256 neurons. The second FCNN layer inputs the 256 neurons and output 128 neurons. Each layer has a ReLU activation function. Finally, via the dueling network to calculate advantage function and value function in Eq. (7), the network will output action state value Q(s, a). Spatial Attention Module. For each surrounding vehicle, the ego vehicle should give different attentions like a human driver driving. We add a Spatial Attention (SA) module to model this attention as weight to make decision actions. In SA module, we reweight the feature map M 3 ∈ R H3×W3×C3 obtained via three convolutional layers from input grid map M . We use a slicing of tensor M 3 to express M 3 as shown in Eq. (11).
where m i,j ∈ R 1×1×C3 , is the element in spatial location The SA module is calculated by a channel squeeze function F cs () and and a Sigmoid activation functions. F cs () is achieved by CNN which has a convolutional layer with kernel size 1 × 1, stride 1, and a C 3 channel. After these operations, we get a projection tensor P ∈ R H3×W3×1 , that is the weight for each spatial element. Then, we use P to reweight each VOLUME 4, 2016 spatial location. The spatial attention based feature map M s is shown in Eq. (12).
where p i,j ∈ P represent each weight for spatial element calculated from spatial attention module.
Channel Attention Module. The surrounding vehicles have lots of information characteristics, such as velocity, position, orientation, lane, etc. When the networks predict the actions, different features contribute differently for decisionmaking. For example, a closer low-speed vehicle before the ego vehicle is more important than a far away high-speed vehicle in this lane. If the ego vehicle treats all the features as the same importance, the features which really have an impact for action prediction can not be fully utilized. The network may not estimate a precise action which may cause catastrophic collisions. Same with the squeeze and excitation principle [36] which is used in object detection task, we introduce a Channel Attention (CA) module to recalibrate features by using global information Z ∈ R 1×1×C3 from to pick significant features and suppress less useful ones. We use a channel combination tensor M * 3 to express M 3 , shown as Eq. (13).
where m * i ∈ R H3×W3 , is the ith channel slicing elements. We extract the global information through a spatial squeezing function F ss (), that is a global average pooling, shown as Eq. (14).
Then, Z is transformed to a channel excitation function F ce () to calculate channel weight U ∈ R 1×1×C3 . F ce () contains two FCNN layers with the parameters W1, W2, a ReLU layer δ to increase the non-linear interaction and a sigmoid function σ shown as Eq. (15).
Finally, U is used to reweight all each channel of M * 3 . The channel attention based feature map M c is shown in Eq. (16).
Double Attention Module. We integrate the above two modules, SA module and CA module, in the D3QN algorithm through the element-wise addition as a double attention (DA) module, shown as Eq. (17).
In this work, we set DA module after the third convolutional layers. We also discuss the setting of DA module in Section IV-B.

B. MIDDLE EMERGENCY SAFE CHECKER
To improve the safety of the ego vehicle, we add a middle emergency safe checker in lateral and longitudinal. According to limit velocity [V h , V l ] and D3QN-DA decision time, we set a dangerous distance D d = 10.0 meters shown in Fig.1, the red circle.
In longitudinal safe checker, when the distance between the closest front vehicle on this lane and the ego vehicle is less than D d and the velocity of ego vehicle v is higher than the front vehicle velocity v f , the target speed will be reduced linearly according to the relative velocity.
In the meanwhie, when the action produced in high-level decision maker is a lane change action and there is a vehicle in dangerous region in target lane, the lane change action will be cancelled, and keep in current lane until the next decision action generated.

C. LOW-LEVEL EXECUTIVE CONTROLLER
In the low-level controller, the ego vehicle is controlled by a lateral steering angle controller and a longitudinal acceleration controller by a Proportional Integral Derivative (PID) controller. When the ego vehicle gets a lane change command, it will be controlled according to lateral position error and the ego vehicle orientation error. Lateral position error ∆l is converted to a heading reference ψ ref by a proportional controller, shown as Eq. (18).
where K p,l is the lateral position proportional gain. Then, a steering angle command θ is generated by the ego vehicle heading ψ, shown as Eq.(19) where K p,h is the lateral heading proportional gain. And the motions of the vehicles in the longitudinal is also controlled by a proportional controller with target velocity v tar and vehicle speed v, shown as Eq.(20) where K p,v is the longitudinal speed proportional gain.

IV. EXPERIMENT AND RESULT
We verify our algorithm in a challenging environment, a devised dynamic dense highway traffic environment based on [37]. And training and testing details are introduced in Section IV-A. Finally, we evaluate the performance of D3QN-DA and compare the results with the relative works in Section IV-B.

A. SIMULATION ENVIRONMENT
Since free exploration can increase the diversity of DRL training data, we construct lots of other vehicles (maximum 25 vehicles) to build an complex and dynamic environment based on highway scene [37] with 4 lanes. In order to better simulate the real world, we model the acceleration of the other vehicles by Intelligent Driver Model (IDM) [38] and model the discrete lane change decisions by the MOBIL model [39]. The environment vehicles are also controlled by PID controller same as the ego vehicle. We use D3QN-DA algorithm to train the ego vehicle in 2500 episodes to explore in the environment. For each episode, when the ego vehicle gets a collision or when its travelling time is more than 40 seconds, the episode will be terminated. After training, we evaluate the final model in 100 testing episodes.
Training Parameters We train the agent using D3QN algorithm as the framework described in Section II. To appropriately collect training samples, the agent should get balance between exploration and exploitation. We adopt epsilon greedy behaviour to handle the trade-off between exploration and exploitation. The epsilon greedy decreases from initial exploration constant i to final exploration constant f within the interval of final exploration iteration I e . During testing episodes, all the exploration rate is set to 1, that means all the actions are from DRL algorithms without random generation. Discount factor γ is used to discount future potential value to calculate the Q t in network. In every time step t, the agent interactive data with environment are collected in an experience replay dataset D with capacity D max as mentioned in Section II. The network samples batch size B training data in D. The goal of D3QN is to train suitable weights to minimize the error of Q t and Q e . We make a summary of the training hyperparameters as shown in Table.2. Reward Metric. Since DRL is different from supervised learning without ground truth, reward is applied to guide the training process. Hence, the reward is a natural indicators to evaluate the performance of the agent. The higher the reward, the better the performance. Fig.4(a) shows the total rewards during 2500 training episodes with smoothing. And Fig.4(b) shows the average rewards during 100 testing episodes. We can see that A-D3QN improves 4.78% rewards comparing with D3QN, and it proves the necessity of adaptive encoder which encodes surrounding environment effectively and completely for network. AE-D3QN can also improve 22.97% average reward. We can also see that attention based D3QN algorithm, AE-D3QN-SA, AE-D3QN-CA, AE-D3QN-DA, VOLUME 4, 2016 can further obtain more rewards during training process. In the test set, they improve 24.12%, 24.41%, 28.39% rewards, respectively.
(a) Total reward during training process.
(b) Average reward during test process. Capability Metric. The second metric is the capability metric to analyze the performance of models quantitatively in test set. This metric contains 4 indicators, success rate, average speed, average exploration time and average exploration distance. Success rate is proportion of the episodes without crash and the total episodes, shown as Fig.5(a). Average speed is shown as Fig.5(b). Average exploration time is the travelling time in each episode, shown as Fig.5(c), note that maximum travelling time is limited within 40 seconds. Average exploration distance is the ego vehicle travelling distance until this episode ended, shown as Fig.5(d). All the indicators are averaged by the number of testing episodes. The results of A-D3QN and AE-D3QN indicate an adaptive grid encoding module could accelerate to earn short-term rewards as shown by high speed. But it crashes often without safe limit, as shown that success rate only increases 10% while other models improve more than 42%. We will obtain a riskier driving policy if the network do not have a safe checker. AE-D3QN could drastically improved success rate, exploration time and exploration distance with 42%, 27.5%, 23.72%, but it still have some corner case that it can not solve. The results shows that the agent fails to utilize the information of other vehicles, leading to short exploration.  To further evaluate attention modules, And for different features, the ego vehicle also makes full use of information with different characteristics. Through the different attentions for each surrounding vehicle, a much safer action is chosen to travel in the road. Owing to a significant performance improvement, the overall resulting manifest that it makes ego vehicle actions more intelligent. Ablation Study. To further explore the influence of integration ways of SA module and CA module to D3QN algorithm, we make an ablation study to analyze different integration ways during 100 episodes test set. We set the double attention modules after each convolution layers, F 1 , F 2 , F 3 in Fig.3, as model named Comebine1, Comebine2, Comebine3, shown as Table.4. Comparing with D3QN, Comebine1 could improve 23.23% rewards and 45% success rate, but the average speed drop more than other models. Comebine2 could keep a higher speed than Comebine1 and Comebine3, but raising of other indicators is not as good as above two model. Combine3 could achieve a higher reward, significantly improve 54% success rate while maintains the average velocity. We assume this is because after three convolutional layer, the feature map has more advanced semantic features which is useful for decision-making. Thence, we choose Combine3 as D3QN-DA algorithm integration method.

V. CONCLUSION
Autonomous driving decision making is still an important technical problem for fully autonomous driving. This work puts forward a hierarchical control structure based AE-D3QN-DA algorithm through giving different attentions for different vehicles to improve efficiency and ensure safety in uncertain highway environment. We experiment and evaluate the proposed method with reward metric and capability metric. The results show AE-D3QN-DA performance in the experiment by improve average total reward, success rate, average explore time and average explore distance while maintains average velocity. For the future work topics, we intend to analyze the mobility of the proposed method in more scenes, such as Intersection scene, and consider the trajectory prediction of other cars intention recognition.

ACKNOWLEDGMENT
Portions of this manuscript were presented at the 59th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), Chiang Mai, Thailand, September 25th, 2020.