Online Personalized Preference Learning Method Based on In-Formative Query for Lane Centering Control Trajectory

The personalization of autonomous vehicles or advanced driver assistance systems has been a widely researched topic, with many proposals aiming to achieve human-like or driver-imitating methods. However, these approaches rely on an implicit assumption that all drivers prefer the vehicle to drive like themselves, which may not hold true for all drivers. To address this issue, this study proposes an online personalized preference learning method (OPPLM) that utilizes a pairwise comparison group preference query and the Bayesian approach. The proposed OPPLM adopts a two-layer hierarchical structure model based on utility theory to represent driver preferences on the trajectory. To improve the accuracy of learning, the uncertainty of driver query answers is modeled. In addition, informative query and greedy query selection methods are used to improve learning speed. To determine when the driver’s preferred trajectory has been found, a convergence criterion is proposed. To evaluate the effectiveness of the OPPLM, a user study is conducted to learn the driver’s preferred trajectory in the curve of the lane centering control (LCC) system. The results show that the OPPLM can converge quickly, requiring only about 11 queries on average. Moreover, it accurately learned the driver’s favorite trajectory, and the estimated utility of the driver preference model is highly consistent with the subject evaluation score.


Introduction
Advanced driver assistance systems (ADAS) such as adaptive cruise control (ACC), forward collision warning, lane keeping assistance, and lane change assistance have become increasingly common in newly manufactured vehicles. The ADAS is expected to improve driving safety and comfort. However, to be effective, ADAS systems must match drivers' preferences and driving behaviors [1,2]. Drivers' preferences vary based on personality traits, driving experience, and situational factors. Therefore, ADAS systems must be personalized [3,4].
Personalization methods for ADAS systems can be divided into explicit and implicit personalization approaches [3]. Explicit personalization requires drivers to manually choose a specific system setting that matches their preference. ACC is an example of explicit personalization, where drivers can set their desired speed and choose between predefined time gaps when using ACC. However, explicit personalization can be difficult for drivers to understand and may be limited in terms of the available options, particularly when the settings are interactive between multiple ADAS systems. Limited choices are another drawback of explicit personalization.
Implicit personalization is a promising approach for resolving these challenges by developing a personalized driver preference model that can predict preferences based on collected driver data. One common method for implicit personalization is driving style identification, where individual drivers are classified into driving style categories such as comfortable, normal, and sporty based on individual driving data [5,6]. In study [7], a driver style identification method was proposed based on questionnaire surveys and corresponding driving behavior characteristics. This approach was used to identify the driver type (aggressive, ordinary, or cautious) online, and a driver-adaptive ACC/CA (collision avoidance) fusion control strategy was designed accordingly. However, as highlighted in a review by [8] on the driving style identification method used for ADAS, this method still faces the challenge of limited categories and may not be able to adapt to individual preferences.
Imitation learning is another widely researched implicit personalization method, where the vehicle controller is personalized based on a model built from a group of driver behaviors [9]. The general process of this method involves observing driver behavior by collecting driving data from a group of drivers, building a driver behavior or preference model, and obtaining a personalized vehicle controller based on the driving behavior model and the measured driving data of a new individual driver [3,4]. The driver behavior model can be built based on a steering or car-following driver model [10,11] and machine learning methods [12][13][14], such as inverse reinforcement learning, which has been used to learn human-like driving [15][16][17] since the work in Ref. [18]. However, these methods assume that drivers prefer the vehicle to drive like themselves, which is not necessarily true for all drivers [19]. For instance, assertive drivers prefer a significantly more defensive driving style than their own [20,21]. Ref. [22] revealed that the perceived control of risk-taking for drivers and passengers is different. Passengers who are out of control of their vehicles perceive more risk than drivers in control of their vehicles. Therefore, personalized ADAS must consider drivers' real preferences when designing their systems.
Preference learning is a type of machine learning that involves learning from observations that reveal information about an individual's preferences [23]. It has been applied in various fields, including user preference mining and human-robot interaction. There are three main types of preference information used in preference learning: pairwise comparison, ranking, and rating of alternatives [24]. In human-robot interaction, learning user preferences for robot motion trajectories can be challenging due to the quality and quantity of user feedback. Learning from demonstration (LFD) is a common method used to learn user preferences [25,26], but it can be difficult for users to provide demonstrations that orchestrate all of the robot's degrees of freedom [27]. Preference queries, such as pairedgroup comparison, are an easier form of user feedback, but they require a large amount of data [28]. To reduce the amount of data required, active learning or active query selection methods are used [29][30][31]. Other tricks, such as batch active preference learning [32] and scale feedback [33], are also used to further reduce the data needed. In study [34], a generative adversarial network (GAN) has been used to learn human preferences with fewer queries required. This approach replaces the role of a human in assigning preferences.
At present, great progress to learn user preference in human-robot interaction has been made. However, these works mainly focus on mobile manipulators such as personal robots and assembly line robots. Applying preference learning methods to ADAS or autonomous vehicles is a challenge as driving tasks are much more difficult to demonstrate compared to mobile manipulators, especially for unskilled drivers.
The acceptable number of queries for drivers is also much less than that for robots. Even with tricks such as batch learning, about 100 queries are required to obtain driver preference [32], which is unacceptable for drivers. Ref. [35] uses an augmenting comparison query with feature queries and an active query selection method to learn the driver's reward function for trajectory, which is faster than preference comparison only. However, it does not provide a convergence criterion to indicate when to stop queries and obtain the final driver preference. The feature query requires drivers to carefully point out the difference in the host vehicle's pose, relative position to the road lane, and other vehicles between paired-group trajectories, which is not easy for drivers. Therefore, there is a need for further research to develop efficient and effective preference learning methods that consider the challenges specific to ADAS and autonomous vehicles. Based on preference learning methods, this study aims to develop an online personalized preference learning method (OPPLM) with a particular focus on the trajectory of lane centering control (LCC) in a simple curve condition without other vehicles involved. The main contributions of this paper are:

1.
Introducing an online personalized preference learning method (OPPLM) based on pairwise comparison group preference queries and Bayesian approach.

2.
Establishing a two-layer hierarchical structure model based on utility theory to model driver preferences on trajectory, taking into account the uncertainty of drivers' query answers.

3.
Utilizing informative and greedy query selection methods to improve the learning speed. A convergence criterion is proposed to indicate when the driver's preferred trajectory has been found.
The paper is organized as follows: Section 2 introduces the proposed OPPLM. Section 3 describes the user study experiment conducted to validate the OPPLM. The experiment's results are presented in Section 4, followed by a discussion in Section 5. Finally, the proposed OPPLM is summarized, and its limitations are pointed out. Figure 1 illustrates the flow chart of the proposed OPPLM. The driver preference model is initialized at the beginning and then updates online. The system starts by selecting a trajectory from the pre-prepared trajectory pool, which contains many alternative trajectories. Next, a new pairwise comparison group is constructed based on the selected trajectory and the driver's preferred trajectory from the previous query. The driver is then queried for their preferred trajectory, and the driver preference model is updated accordingly. Finally, the OPPLM checks whether it has converged or not. If it has, the learning process ends, and the driver-preferred trajectory is identified. Otherwise, the process repeats from the first step. between paired-group trajectories, which is not easy for drivers. Therefore, there is a need for further research to develop efficient and effective preference learning methods that consider the challenges specific to ADAS and autonomous vehicles. Based on preference learning methods, this study aims to develop an online personalized preference learning method (OPPLM) with a particular focus on the trajectory of lane centering control (LCC) in a simple curve condition without other vehicles involved. The main contributions of this paper are:

Flow of the Online Personalized Preference Learning Method
1. Introducing an online personalized preference learning method (OPPLM) based on pairwise comparison group preference queries and Bayesian approach. 2. Establishing a two-layer hierarchical structure model based on utility theory to model driver preferences on trajectory, taking into account the uncertainty of drivers' query answers. 3. Utilizing informative and greedy query selection methods to improve the learning speed. A convergence criterion is proposed to indicate when the driver's preferred trajectory has been found.
The paper is organized as follows: Section 2 introduces the proposed OPPLM. Section 3 describes the user study experiment conducted to validate the OPPLM. The experiment's results are presented in Section 4, followed by a discussion in Section 5. Finally, the proposed OPPLM is summarized, and its limitations are pointed out. Figure 1 illustrates the flow chart of the proposed OPPLM. The driver preference model is initialized at the beginning and then updates online. The system starts by selecting a trajectory from the pre-prepared trajectory pool, which contains many alternative trajectories. Next, a new pairwise comparison group is constructed based on the selected trajectory and the driver's preferred trajectory from the previous query. The driver is then queried for their preferred trajectory, and the driver preference model is updated accordingly. Finally, the OPPLM checks whether it has converged or not. If it has, the learning process ends, and the driver-preferred trajectory is identified. Otherwise, the process repeats from the first step.

Formulation of Driver Preference Model
Utility theory is widely used to model discrete choice problems. In this theory, a decision maker selects the alternative with the highest utility among those available [36]. The utility of an alternative is typically modeled as a function of its relevant attributes, often a

Formulation of Driver Preference Model
Utility theory is widely used to model discrete choice problems. In this theory, a decision maker selects the alternative with the highest utility among those available [36]. The utility of an alternative is typically modeled as a function of its relevant attributes, often a linear function. To account for the uncertainty of the decision maker, a random utility is added to the utility function, which makes the discrete choice problem probabilistic. In this study, the driver's preferred trajectory is modeled as the one with the highest expected utility among all alternatives, while the preferred trajectory of the pairwise comparison group is modeled as the one with a higher expected utility.
The relevant attributes of trajectory for vehicles typically include safety, comfort, efficiency, and energy-saving. However, for simplicity, energy-saving is not considered in this study. Let U S , U C , and U E represent the safety, comfort, and efficiency utility, respectively. Let β S , β C , and β E represent the linear weight parameters of the safety, comfort, and efficiency utility, with ε representing the random utility. Therefore, the utility (U) of a trajectory could be represented as follows: and θ E represent the normalized linear weight parameters of the safety, comfort, and efficiency utilities, respectively, all within [0, 1]. β =|β S |+|β C |+|β E | represents the normalization coefficient. The safety, comfort, and efficiency utility (U S , U C , U E ) of a trajectory are unknown. However, they can be calculated using assumed utility functions and corresponding trajectory attributes or characteristic indicators. Let X S_1 , X S_2 , . . . represent the safety corresponding trajectory indicators. Let β S_1 , β S_2 , . . . represent the linear weight parameters of the safety, comfort, and efficiency utility. Similarly, the safety utility (U S ) can be calculated using the following equation: where, The comfort and efficiency utility functions are similar but use different trajectory characteristic indicators. The safety utility is calculated based on trajectory indicators. It means that the safety utility is the driver's direct perception of the trajectory. Therefore, it is called the safety perception model (SPM) in this research. Similarly, the comfort perception model (CPM) and efficiency perception model (EPM) are used for calculating the comfort and efficiency utility functions, respectively. The utility function of a trajectory, as shown in Equation (1), is indirectly evaluated using the safety, comfort, and efficiency utility functions, and it is referred to as the utility evaluation model (UEM).
For pairwise trajectory comparison group (A, B), the probability that a driver with utility function parameter (β, Θ) prefers trajectory A to B, represented by P r (A|β, Θ, X A , X B ), can be modeled as the probability that the utility of trajectory A ( U A ) is larger than that of trajectory B ( U B ), represented by P r (U A > U B ), which is formulated as: where, A and B could be assumed to be an independent and identical distribution, although the specific distribution is unknown. A reasonable assumption for the distribution of ε is a Gaussian distribution, given the central limit theorem. However, this assumption does not lead to a closed-form solution for the probability. A better assumption for the distribution is the standard Gumbel (or type I extreme value) distribution, which does lead to a closed-form solution [36] as follows: Sensors 2023, 23, 5246

of 22
Equation (4) models the likelihood that the driver prefers trajectory A to B for the pairwise comparison group (A, B). The probability that the driver prefers trajectory B to trajectory A can be modeled as follows: Based on the above equation, it is easy to predict the driver's answer to a query. If P r (A|β, Θ, X A , X B ) is larger than 0.5, then the driver is more likely to prefer A than B, and vice versa.
Equations (3)-(5) do not consider the uncertainty of a driver's answer when distinguishing between two trajectories. Sometimes, drivers may find it difficult to discern the difference between two trajectories, and forcing them to make a deterministic choice may be inappropriate. In such cases, it is more appropriate to allow for uncertain answers. It is assumed that when the absolute difference in utility is closer to zero, the driver is more likely to give an uncertain answer. Thus, the probability of different answers to a query can be modeled as follows: where, the UB (upper bound) and LB (lower bound) represent the probability threshold between the uncertain result and the other two deterministic results.
To calculate the likelihood of a deterministic answer, we can use Equations (4) and (5), respectively. However, to calculate the likelihood of an uncertain answer, we can view it as a joint result of two opposite answers: where, P r (A ≈ B|β, Θ, X A , X B ) represents the probability that the driver's answer is uncertain.
The correlation between the likelihood of different answers to a query and the utility difference Θ T (X A − X B for drivers with different parameters β is shown in Figure 2. Based on the above equation, it is easy to predict the driver's answer to a query. If ( | , , X , X ) is larger than 0.5, then the driver is more likely to prefer A than B, and vice versa.
Equations (3)-(5) do not consider the uncertainty of a driver's answer when distinguishing between two trajectories. Sometimes, drivers may find it difficult to discern the difference between two trajectories, and forcing them to make a deterministic choice may be inappropriate. In such cases, it is more appropriate to allow for uncertain answers. It is assumed that when the absolute difference in utility is closer to zero, the driver is more likely to give an uncertain answer. Thus, the probability of different answers to a query can be modeled as follows: where, the UB (upper bound) and LB (lower bound) represent the probability threshold between the uncertain result and the other two deterministic results.
To calculate the likelihood of a deterministic answer, we can use Equations (4) and (5), respectively. However, to calculate the likelihood of an uncertain answer, we can view it as a joint result of two opposite answers: ( ≈ | , , X , X ) = ( | , , X , X ) • ( | , , X , X ) where, ( ≈ | , , X , X ) represents the probability that the driver's answer is uncertain.
The correlation between the likelihood of different answers to a query and the utility difference (X − X ) for drivers with different parameters is shown in Figure 2.

Figure 2.
The correlation between the likelihood of different answers to a query and the utility difference (X − X ) for drivers with different parameters . Figure 2 illustrates that, on one hand, the likelihood that the driver prefers A to B increases as the utility difference (X − X ) becomes larger. However, as (X − X ) approaches 0, the likelihood of an uncertain answer increases. On the other hand, for the same utility difference (X − X ), the likelihood of uncertainty decreases as the parameter β increases, which means that drivers are more likely to give a deterministic answer  Figure 2 illustrates that, on one hand, the likelihood that the driver prefers A to B increases as the utility difference Θ T (X A − X B becomes larger. However, as Θ T (X A − X B Sensors 2023, 23, 5246 6 of 22 approaches 0, the likelihood of an uncertain answer increases. On the other hand, for the same utility difference Θ T (X A − X B , the likelihood of uncertainty decreases as the parameter β increases, which means that drivers are more likely to give a deterministic answer to a query. The parameter β measures the driver's ability to distinguish between trajectories and is therefore referred to as the perception coefficient in this study. It is worth noting that, for a specific critical utility difference, e.g., the minimum difference required for a driver to give a deterministic answer, the perception coefficient β and the corresponding probability thresholds UB and LB are interrelated. As the perception coefficient β increases, the values of UB and LB approach 0.5. This coupling indicates that the parameters UB, LB, and β are interdependent. Overall, the personalized preference learning system aims to estimate the linear weight parameters Θ and the perception coefficient β of the driver preference model (UEM, SPM, CPM, EPM) for each individual.

Estimation Method
The driver preference model parameters are estimated using a Bayesian approach and a limited greedy estimation method. Firstly, an assumption is made about the prior probability distribution of estimation parameters. Then, the estimation is updated based on the driver's preference trajectory query result of a pairwise trajectory comparison group (A, B) at every step, following the Bayesian approach. The flow chart of the parameter update method for query results at each step is shown in Figure 3. to a query. The parameter measures the driver's ability to distinguish between trajectories and is therefore referred to as the perception coefficient in this study. It is worth noting that, for a specific critical utility difference, e.g., the minimum difference required for a driver to give a deterministic answer, the perception coefficient β and the corresponding probability thresholds UB and LB are interrelated. As the perception coefficient increases, the values of UB and LB approach 0.5. This coupling indicates that the parameters UB, LB, and are interdependent. Overall, the personalized preference learning system aims to estimate the linear weight parameters and the perception coefficient of the driver preference model (UEM, SPM, CPM, EPM) for each individual.

Estimation method
The driver preference model parameters are estimated using a Bayesian approach and a limited greedy estimation method. Firstly, an assumption is made about the prior probability distribution of estimation parameters. Then, the estimation is updated based on the driver's preference trajectory query result of a pairwise trajectory comparison group (A, B) at every step, following the Bayesian approach. The flow chart of the parameter update method for query results at each step is shown in Figure 3. The first step of the estimation process involves updating the parameter for a given prior parameter set and perception coefficient . For the pairwise comparison group (A, B) and its corresponding driver query answer, the parameter can be updated using the following equation: where, ( ) represents the prior probability distribution of estimation parameters , and ( | , , X , X ) represents the likelihood of the query answer as calculated by Equations (4,5,7). ( | , , X , X ) represents the posterior probability distribution of the updated .
The second step of the estimation process involves determining whether the updated driver preference model, with its newly estimated parameters, can correctly predict the latest query result. The prediction of the latest query result can be calculated using Equation (6). If the prediction matches the real driver answer, then the parameter estimation update ends, and the perception coefficient remains unchanged. However, if the pre- The first step of the estimation process involves updating the parameter for a given prior parameter set Θ and perception coefficient β. For the pairwise comparison group (A, B) and its corresponding driver query answer, the parameter Θ can be updated using the following equation: where, P r (Θ) represents the prior probability distribution of estimation parameters Θ, and P r (Ans|β, Θ, X A , X B ) represents the likelihood of the query answer as calculated by Equations (4)- (7). P r (Θ|β, Ans, X A , X B ) represents the posterior probability distribution of the updated Θ. The second step of the estimation process involves determining whether the updated driver preference model, with its newly estimated parameters, can correctly predict the latest query result. The prediction of the latest query result can be calculated using Equation (6). If the prediction matches the real driver answer, then the parameter estimation update ends, and the perception coefficient β remains unchanged. However, if the prediction does not match the real driver answer, the third step involves incrementally increasing the perception coefficient β and repeating the first and second steps until the prediction is consistent with the driver answer. This process is referred to as "greedy" because the likelihood of the answer increases with β, as shown in Figure 2, leading to a more accurate prediction by the driver preference model with updated parameters. To ensure parameter estimation stability and prevent noisy answers from significantly decreasing estimation accuracy, the increment of β is limited to a value no larger than a specified number Incre_max for each query answer. This is why the estimation method is referred to as "limited-greedy". Finally, the parameters Θ are normalized, and the perception coefficient β is modified accordingly.

Pairwise Comparison Group Construction and Query Trajectory Selection
To achieve accurate and efficient learning of the driver-preferred trajectory, it is crucial to construct an appropriate pairwise comparison group that minimizes the required amount of data. However, there is often a trade-off between speed and accuracy. Selecting a comparison group that leads to faster learning may result in a suboptimal outcome, while a more accurate approach may require more data and time. This is similar to the explorationexploitation dilemma encountered in reinforcement learning, which can be addressed using the -greedy method [37]. In reinforcement learning, exploitation refers to selecting actions that have the highest expected reward based on current experience, while exploration involves selecting untried actions in the hopes of achieving a higher reward.
Drawing on the -greedy method, we propose a similar "greedy" policy to construct the pairwise comparison group. This involves combining a new query trajectory with one that has been previously compared, at each step, to minimize the required number of query trajectories. The greedy approach is reflected in both the construction of the pairwise comparison group and the selection of new query trajectories.

Pairwise Comparison Group Construction
The greedy pairwise comparison group is constructed by selecting the two trajectories that the driver is most likely to prefer. The first trajectory is the driver's favorite among all compared trajectories, representing the greediest selection among them. A second trajectory selection method is introduced below.

Query Trajectory Selection
A greedy query trajectory selection policy entails selecting the trajectory that the driver is most likely to prefer according to the current driver preference model, except for the first one selected above. On the other hand, the -greedy policy involves selecting a greedy trajectory with probability 1 − at each step, while all non-greedy trajectories are selected at random with probability , the exploration probability, as expressed in Equation (9): Here, Pr_selected(Trajectory) represents the probability of the trajectory being selected, and N is the number of all alternative trajectories. This equation ensures that each trajectory has a chance of being selected. As the value of increases, the probability of selecting the greedy trajectory decreases, while the other trajectories become more likely to be selected, thus increasing the level of exploration. To increase the level of exploration at the start and decrease it after some queries to speed up convergence, a variable exploration level is proposed as given in Equation (10): where, ini is the initial value, and γ is the decay rate. Figure 4 illustrates how changes as the step increases, with an initial value of ini = 1 and three different values of decay rate γ. The figure shows that as γ increases, decreases faster, leading to a decrease in the level of exploration.
where, is the initial value, and is the decay rate. Figure 4 illustrates how ϵ changes as the step increases, with an initial value of = 1 and three different values of decay rate . The figure shows that as increases, decreases faster, leading to a decrease in the level of exploration.

Convergence Criterion
A convergence criterion is proposed to end preference learning as shown in Figure 5 below. Firstly, it is required to check whether the current step is larger than the Min-Exploration Step, which guarantees a minimum number of queries. The second step is to match up the current driver preferred trajectory with the preference estimation, which is the trajectory with the highest utility among all trajectories in the trajectory pool. In addition, the trajectory utility can be calculated using Equation (1) with the newly updated preference model parameters. If they match, the Flag representing the number of consecutive false comparisons should be reset to zero. However, if the comparison is true, the Flag should be incremented by one. The final step is to judge whether the Flag is equal to the Threshold to ensure a stable preference estimation. This criterion should not be too conservative, resulting in numerous inefficient comparison queries, nor too aggressive, resulting in premature termination and inaccurate estimation.

Convergence Criterion
A convergence criterion is proposed to end preference learning as shown in Figure 5 below.  Firstly, it is required to check whether the current step is larger than the Min-Exploration Step, which guarantees a minimum number of queries. The second step is to match up the current driver preferred trajectory with the preference estimation, which is the trajectory with the highest utility among all trajectories in the trajectory pool. In addition, the trajectory utility can be calculated using Equation (1) with the newly updated preference model parameters. If they match, the Flag representing the number of consecutive false comparisons should be reset to zero. However, if the comparison is true, the Flag should be incremented by one. The final step is to judge whether the Flag is equal to the Threshold to ensure a stable preference estimation. This criterion should not be too conservative, resulting in numerous inefficient comparison queries, nor too aggressive, resulting in premature termination and inaccurate estimation.

Experiment Configuration
The proposed preference learning method (OPPLM) was evaluated through a user study conducted on a fixed-base driving simulator.

Equipment
The fixed-base driving simulator comprises four components, as shown in Figure 6. These include a real-time target machine, a steering system and pedals, a personal computer, and a screen. The real-time target machine is responsible for computing trajectory planning, tracking, and controlling the steering system and pedals. Specifically, it sends alignment torque to the load motor and haptic feedback torque to the EPS (Electric Power Steering) motor and outputs the vehicle state to the scenario simulation software (prescan) running on the personal computer. The real-time scenario is then displayed on a screen with a resolution of 3840 × 1080.

Experiment Configuration
The proposed preference learning method (OPPLM) was evaluated through a user study conducted on a fixed-base driving simulator.

Equipment
The fixed-base driving simulator comprises four components, as shown in Figure 6. These include a real-time target machine, a steering system and pedals, a personal computer, and a screen. The real-time target machine is responsible for computing trajectory planning, tracking, and controlling the steering system and pedals. Specifically, it sends alignment torque to the load motor and haptic feedback torque to the EPS (Electric Power Steering) motor and outputs the vehicle state to the scenario simulation software (prescan) running on the personal computer. The real-time scenario is then displayed on a screen with a resolution of 3840 × 1080.

Scenario
The experiment was conducted in a simple single-lane curve scenario without any other vehicles present. To speed up the experiment procedure, a closed-loop triangular field was designed, as shown in Figure 7. Throughout the entire experiment, the vehicle was controlled by a controller, and the test subject did not need to drive manually. The vehicle initially entered the curved road

Scenario
The experiment was conducted in a simple single-lane curve scenario without any other vehicles present. To speed up the experiment procedure, a closed-loop triangular field was designed, as shown in Figure 7.

Experiment Configuration
The proposed preference learning method (OPPLM) was evaluated through a user study conducted on a fixed-base driving simulator.

Equipment
The fixed-base driving simulator comprises four components, as shown in Figure 6. These include a real-time target machine, a steering system and pedals, a personal computer, and a screen. The real-time target machine is responsible for computing trajectory planning, tracking, and controlling the steering system and pedals. Specifically, it sends alignment torque to the load motor and haptic feedback torque to the EPS (Electric Power Steering) motor and outputs the vehicle state to the scenario simulation software (prescan) running on the personal computer. The real-time scenario is then displayed on a screen with a resolution of 3840 × 1080.

Scenario
The experiment was conducted in a simple single-lane curve scenario without any other vehicles present. To speed up the experiment procedure, a closed-loop triangular field was designed, as shown in Figure 7. Throughout the entire experiment, the vehicle was controlled by a controller, and the test subject did not need to drive manually. The vehicle initially entered the curved road Throughout the entire experiment, the vehicle was controlled by a controller, and the test subject did not need to drive manually. The vehicle initially entered the curved road at a speed of 40 km/h and exited the curve with an end speed of 40 km/h. Within the curve, the vehicle was controlled to follow the planned trajectory, and, outside the curve, it traveled at a constant speed of 15 km/h on the long straight road to allow the subjects enough time to evaluate. Smooth speed profiles connected the speed profiles between these sections to avoid discomfort.

Trajectory Pool
To obtain a driver-preferred trajectory under curved driving conditions, a trajectory planning method capable of generating diverse trajectories is introduced. Additionally, the trajectory track method utilized in this research is presented. The trajectory pool is established based on the planning and tracking methods employed in the study.

Trajectory Planning
The trajectory of the curved path is composed of three distinct segments, each corresponding to a particular section of the road, as depicted in Figure 8.
at a speed of 40 km/h and exited the curve with an end speed of 40 km/h. Within the curve, the vehicle was controlled to follow the planned trajectory, and, outside the curve, it traveled at a constant speed of 15 km/h on the long straight road to allow the subjects enough time to evaluate. Smooth speed profiles connected the speed profiles between these sections to avoid discomfort.

Trajectory Pool
To obtain a driver-preferred trajectory under curved driving conditions, a trajectory planning method capable of generating diverse trajectories is introduced. Additionally, the trajectory track method utilized in this research is presented. The trajectory pool is established based on the planning and tracking methods employed in the study.

Trajectory Planning
The trajectory of the curved path is composed of three distinct segments, each corresponding to a particular section of the road, as depicted in Figure 8. Referring to the modes established in Ref. [38], path planning is achieved by combining various modes with different weights to generate divergent paths. For the entry and exit segments, the path is planned by connecting the endpoint of the middle segment of the curve to the center of the road lane using a cubic spline.
Speed planning is also conducted using the optimal speed planning methodology outlined in Ref. [38], with certain modifications to adapt to the curve conditions. Specifically, the minimum jerk mode, which minimizes the derivative of longitudinal acceleration, is substituted with the minimum speed variation mode, which aims to minimize variations in longitudinal speed. In addition, the maximum allowable velocity is replaced by the maximum allowable lateral acceleration, which is used to constrain the maximum speed during the curve. For further details, please refer to Ref. [38]. Figure 9 illustrates some of the planned paths and speed profiles presented in the Frenet coordinate system where a negative lateral offset indicates positions located close to the inner side of the curved road, and a positive lateral offset indicates positions close to the outer side. The lateral offsets −1, 0, and 1 indicate the inner side, center, and outer side of the road, respectively. The speed profile consists of three constant speed profiles, which are smoothly connected. The minimum constant speed is determined by the maximum allowable lateral acceleration. Referring to the modes established in Ref. [38], path planning is achieved by combining various modes with different weights to generate divergent paths. For the entry and exit segments, the path is planned by connecting the endpoint of the middle segment of the curve to the center of the road lane using a cubic spline.
Speed planning is also conducted using the optimal speed planning methodology outlined in Ref. [38], with certain modifications to adapt to the curve conditions. Specifically, the minimum jerk mode, which minimizes the derivative of longitudinal acceleration, is substituted with the minimum speed variation mode, which aims to minimize variations in longitudinal speed. In addition, the maximum allowable velocity is replaced by the maximum allowable lateral acceleration, which is used to constrain the maximum speed during the curve. For further details, please refer to Ref. [38]. Figure 9 illustrates some of the planned paths and speed profiles presented in the Frenet coordinate system where a negative lateral offset indicates positions located close to the inner side of the curved road, and a positive lateral offset indicates positions close to the outer side. The lateral offsets −1, 0, and 1 indicate the inner side, center, and outer side of the road, respectively. The speed profile consists of three constant speed profiles, which are smoothly connected. The minimum constant speed is determined by the maximum allowable lateral acceleration.

Trajectory Tracking
The augmented Stanley controller [39] is utilized to track the various planned trajectories. The tracking error is typically less than 5 cm when following the planned trajectories, with a maximum error of less than 10 cm. This level of tracking performance is deemed acceptable and ensures that the distinct planned trajectories can be clearly discerned by

Trajectory Tracking
The augmented Stanley controller [39] is utilized to track the various planned trajectories. The tracking error is typically less than 5 cm when following the planned trajectories, with a maximum error of less than 10 cm. This level of tracking performance is deemed acceptable and ensures that the distinct planned trajectories can be clearly discerned by the driver.

Trajectory Pool Setting
A trajectory pool consisting of thirty unique trajectories is generated by manipulating various parameters of the trajectory planning method. These thirty trajectories are obtained through an orthogonal design of ten paths and three distinct speed profiles. The speed profiles differ mainly based on the minimum speed required during the curve, which is set to 20 km/h, 30 km/h, and 40 km/h.

Trajectory Indicators
In line with similar research, trajectory indicators utilized to construct the SPM, CPM, and EPM were selected based on prior experience. The trajectory indicators commonly used to identify driving styles [8] and personalize ADAS [4] (p. 400) were reviewed, including speed, acceleration, jerk, and distance to the lane center, among others. From this review, three trajectory indicators were selected for each perception model, as detailed in Table 1 below.

Ind_S1
Maximum left lateral offset to lane center Ind_S2 Maximum right lateral offset to lane center Ind_S3 Minimum time to lane crossing CPM

Ind_C1
The range of lateral offset to lane center Ind_C2 Mean value of speed jerk Ind_C3 Mean value of yaw acceleration EPM Ind_E1 Minimum speed Ind_E2 Maximum speed acceleration Ind_E3 Mean inverse time to right lane crossing

Estimation Method Setting
The parameters of the preference model were estimated without prior knowledge by assuming a normal distribution with a mean of (1/3, 1/3, 1/3) and an identity covariance matrix for the prior probability distribution of estimation parameters Θ. The initial perception coefficient was assumed to be one, and the increment limitation Incre_max of the β was set to three. In order to determine convergence, the minimum number of queries Min-Exploration was set to four, while the Threshold, which ensures a stable preference estimation, was set to three.

Subjects
In order to ensure an effective statistical analysis with a sufficient number of subject samples, the G*power software [40] was utilized to determine the minimum required sample size. A paired-samples t-test was employed to validate the results, with an effect size set at 0.5, an α error probability of 0.05, and a power of 0.8. As a result, a total sample size of 27 was determined. In the experiment, 29 subjects ultimately participated, with an average age of 34.9 (SD = 11.8). Among the subjects, 15 were identified as inexperienced drivers with an average annual driving mileage of less than 1000 km, while 14 were identified as experienced drivers with an average annual driving mileage of more than 20,000 km.

Procedure
Before conducting the main experiment, a pre-comparison test was conducted to ensure that each subject was familiar with the scenario and experiment procedure. Four pairwise comparison groups were presented to each subject, and they were required to experience the trajectories in the driving simulator, followed by describing the differences between the trajectories. The subjects were allowed to experience each trajectory multiple times until they were confident with the comparison. In case of any ignored differences, they were informed and the trajectories were re-compared. After comparing all four groups, the subjects were required to provide answers to the four queries listed in Table 2. During the experiment, each participant was instructed to respond to the four queries listed in Table 2 for every comparison group. The selection of query trajectories and the construction of comparison groups were both automated using the method described in Section 2.3. The first comparison group required the participant to answer the queries after two trajectories were presented in turn. Subsequently, the participant was required to compare the newly selected trajectory to the preferred one in the last query and repeat the process until the OPPLM converged.
Once the OPPLM converged, four trajectories were selected to validate the effectiveness of the learned driver preference model. The utility of each trajectory in the pool of 30 trajectories was calculated using Equations (1) and (2) with the newly estimated driver preference model parameters. The 30 trajectories were then sorted in descending order of utility, and the leading two trajectories, the 15th (middle) trajectory, and the last trajectory according to their utility were selected for evaluation using the Likert scale presented in Table 3. Table 3. Trajectory evaluation Likert scale.

Item Scale
I feel very safe. 1 (strongly disagree)-7 (strongly agree) I feel very comfortable.
1 (strongly disagree)-7 (strongly agree) I like the way it drives.

Evaluation Indices
To evaluate the effectiveness of the OPPLM, two aspects are considered: learning speed and accuracy. The learning speed is measured by the number of queries (QN) required for the OPPLM to converge. The learning accuracy is evaluated by two indices. The first index is the goodness-of-fit (GOF) of the final learned driver preference model, which is calculated by the ratio of QN_Positive to QN, as given by Equation (11).
Here, QN_Positive refers to the number of queries that can be correctly predicted by Equation (6) with the learned preference model parameters. In addition, QN refers to the queries number that is used when the OPPLM converges. A higher value of GOF indicates greater accuracy of the model.
The second index is the score-utility consistency (SUC), which measures the consistency between the evaluation Likert score ordering and the corresponding utility ordering for the final evaluated four trajectories. Taking the item preference (the fourth item in Table 3) as an example. Let A, B, C, and D represent the four trajectories that are selected to be evaluated after the OPPLM converges in Section 3.6. They are sorted by utility in descending order, meaning that the utility of the four trajectories is ranked as Let a, b, c, and d represent the preference evaluation score of the corresponding four trajectories. The SUC for item preference could be calculated as follows: Here, sign() is the sign function. Sign(a−b) equals 1 when (a−b) is positive; sign(a−b) equals 0 when (a−b) equals 0; and sign(a−b) equals -1 when (a−b) is negative. For example, if the evaluation scores of the four trajectories are a = 6, b = 5, c = 4, and d = 3, the evaluation score ordering is consistent with the utility ordering, indicating that the learned utility model accurately predicts the driver's preference. In this case, the SUC is equal to 1. Conversely, if the scores of the four trajectories are a = 3, b = 4, c = 5, and d = 6, the estimated utility is opposite to the driver's preference for trajectory, resulting in an SUC of −1. The SUC is within the range of [−1, 1]. The SUC value provides an indication of the accuracy of the model, with a higher SUC indicating greater accuracy.

Results
This section presents an analysis of the learning speed and accuracy of the experiment results.

Learning Speed
The distribution of the number of queries (QN) of all 29 subjects is presented in Figure 10. The mean QN of all participants is 11.1 with a standard deviation of 4.6. The vast majority of subjects (excluding one outlier) had a QN of less than 20, indicating that the OPPLM generally converged within 20 queries. Notably, one participant had a QN of 27, which deviates significantly from the remaining results. In Figure 10b, the QN of inexperienced and experienced drivers is compared. The mean QN of the inexperienced subjects is 12.4, with a standard deviation of 5.7. In comparison, the mean QN of experienced subjects is 9.7, with a standard deviation of 2.6. An independent-samples t-test was conducted to examine the QN difference between inexperienced and experienced subjects. The results indicate that the influence of subject experience on QN is not statistically significant (t (27) = 1.61, p = 0.12). The mean QN of all participants is 11.1 with a standard deviation of 4.6. The vast majority of subjects (excluding one outlier) had a QN of less than 20, indicating that the OP-PLM generally converged within 20 queries. Notably, one participant had a QN of 27, which deviates significantly from the remaining results. In Figure 10b, the QN of inexperienced and experienced drivers is compared. The mean QN of the inexperienced subjects is 12.4, with a standard deviation of 5.7. In comparison, the mean QN of experienced subjects is 9.7, with a standard deviation of 2.6. An independent-samples t-test was conducted to examine the QN difference between inexperienced and experienced subjects. The results indicate that the influence of subject experience on QN is not statistically significant (t(27) = 1.61, p = 0.12).

Goodness-of-Fit
This section presents an analysis of the goodness-of-fit (GOF) of the four driver preference models (UEM, SPM, CPM, and EPM) for 29 subjects, as shown in Figure 11 and Table 4.
The mean QN of all participants is 11.1 with a standard deviation of 4.6. The vast majority of subjects (excluding one outlier) had a QN of less than 20, indicating that the OPPLM generally converged within 20 queries. Notably, one participant had a QN of 27, which deviates significantly from the remaining results. In Figure 10b, the QN of inexperienced and experienced drivers is compared. The mean QN of the inexperienced subjects is 12.4, with a standard deviation of 5.7. In comparison, the mean QN of experienced subjects is 9.7, with a standard deviation of 2.6. An independent-samples t-test was conducted to examine the QN difference between inexperienced and experienced subjects. The results indicate that the influence of subject experience on QN is not statistically significant (t(27) = 1.61, p = 0.12).

Goodness-of-fit
This section presents an analysis of the goodness-of-fit (GOF) of the four driver preference models (UEM, SPM, CPM, and EPM) for 29 subjects, as shown in Figure 11 and Table 4.   The UEM model had a mean GOF of 0.85 across all subjects, indicating that 85% of the queries could be correctly predicted by the learned UEM. This result suggests that the UEM performs well in modeling subject preference queries in a pairwise comparison group. The GOF for the other three perception models (SPM, CPM, and EPM) were all above 0.64. To compare the performance of each model, a paired-sample t-test was conducted between each two of the four models, as depicted in Figure 11a. The results indicate that the GOF of UEM is significantly higher than that of SPM (t(28) = 3.98, p = 0.000), CPM (t(28) = 3.25, p = 0.003) and EPM (t(28) = 4.78, p = 0.000). Additionally, the GOF of CPM was found to be significantly larger than that of EPM (t(28) = 4.78, p = 0.016).
Moreover, the influence of subject experience on the GOF of the models was investigated. The results demonstrate that the mean GOF of the experienced subjects was higher than that of the inexperienced subjects for all models. However, an independent-samples t-test revealed no significant difference between experienced and inexperienced subjects for each model (UEM: t (27)

Score-Utility Consistency
The score-utility-consistency (SUC) of the driver preference model (UEM, SPM, CPM, and EPM) of 29 subjects is summarized and presented in Figure 12 and Table 5.
above 0.64. To compare the performance of each model, a paired-sample t-test was conducted between each two of the four models, as depicted in Figure 11a. The results indicate that the GOF of UEM is significantly higher than that of SPM (t(28) = 3.98, p = 0.000), CPM (t(28) = 3.25, p = 0.003) and EPM (t(28) = 4.78, p = 0.000). Additionally, the GOF of CPM was found to be significantly larger than that of EPM (t(28) = 4.78, p = 0.016).
Moreover, the influence of subject experience on the GOF of the models was investigated. The results demonstrate that the mean GOF of the experienced subjects was higher than that of the inexperienced subjects for all models. However, an independent-samples t-test revealed no significant difference between experienced and inexperienced subjects for each model (UEM: t (27)

Score-Utility Consistency
The score-utility-consistency (SUC) of the driver preference model (UEM, SPM, CPM, and EPM) of 29 subjects is summarized and presented in Figure 12 and Table 5.   The mean SUC of UEM for all subjects is 0.74, indicating good consistency between the order of estimated utility and the evaluation score. This suggests that UEM models the degree of the subjects' trajectory preferences well. The SUC of the other three perception models (SPM, CPM, and EPM) is smaller than that of UEM, with EPM having the smallest SUC value. Paired-sample t-tests are conducted for each pair of models, and the results are displayed in Figure 12a. The analysis reveals that the SUC of UEM is significantly larger than SPM (t(28) = 2.24, p = 0.03), CPM (t(28) = 3.45, p = 0.002), and EPM (t(28) = 6.11, p = 0.000). Furthermore, the SUC of EPM is significantly smaller than that of SPM (t(28) = −3.06, p = 0.005) and CPM (t(28) = −2.81, p = 0.009).
The impact of subject experience on the SUC of the models is also studied. The results indicate that the mean SUC of experienced subjects is larger than that of inex-perienced subjects for each model. However, an independent-samples t-test reveals no significant difference between experienced and inexperienced subjects for each model (UEM (t (27)

Evaluation Score
The evaluation score of the four selected trajectories, based on the estimated utility by UEM for all subjects, is presented in Figure 13. Table 6 provides a summary of the evaluation scores for UEM and the other three perception models (SPM, CPM, and EPM).
The impact of subject experience on the SUC of the models is also studied. The results indicate that the mean SUC of experienced subjects is larger than that of inexperienced subjects for each model. However, an independent-samples t-test reveals no significant difference between experienced and inexperienced subjects for each model (UEM (t (27)

Evaluation Score
The evaluation score of the four selected trajectories, based on the estimated utility by UEM for all subjects, is presented in Figure 13. Table 6 provides a summary of the evaluation scores for UEM and the other three perception models (SPM, CPM, and EPM).   The mean score of the trajectory ranked first, by estimated utility in descending order, is 6.48, which is larger than that of the other three trajectories. A paired-sample t-test was conducted to validate the evaluation score difference between adjacent trajectory groups for the UEM model, and the result showed that the evaluation score was significantly different between adjacent groups. This qualitative result indicates that the estimated utility by UEM is consistent with subjects' degree of preference for trajectory. The summary and statistical test results for the other three perception models are also listed in Table 6. The result shows that the estimated utility by SPM and CPM is consistent with subjects' corresponding evaluation of safety and comfort, respectively. However, for EPM, the evaluation score of trajectories between groups has no significant difference. This indicates that the EPM could not model subjects' evaluation of efficiency well. The qualitative result is consistent with the quantitative result of GOF and SUC, in which the GOF and SUC of the EPM are the smallest among the models.
The influence of driver experience on the result is presented in Figure 14. A pairedsample t-test is conducted to compare the result of different driver experience groups for each model. The result shows that there is little difference between drivers with different experience, except for the CPM. For inexperienced drivers, there is a significant difference between the evaluation scores of comfort of the 2nd and 15th trajectories, but not for experienced drivers. The degree to which comfort affects drivers' preference for trajectory is inconsistent among drivers of different experience. statistical test > 0.05, which is not significant, ** means the statistical test < 0.01, and * means < 0.05.

Learning Speed
The OPPLM converged in approximately 11 queries on average. This is significantly fewer queries than those required in human-robot-interaction related research, where as many as 100 queries are often required [32]. However, there was one subject who required 27 queries, as shown in Figure 10a, which is significantly more than the others. The reason for this is that this subject provided conflicting answers to some pairwise comparison groups. It is challenging for subjects to avoid such cases completely, and a more efficient S. means statistical test p > 0.05, which is not significant, ** means the statistical test p < 0.01, and * means p < 0.05.

Learning Speed
The OPPLM converged in approximately 11 queries on average. This is significantly fewer queries than those required in human-robot-interaction related research, where as many as 100 queries are often required [32]. However, there was one subject who required 27 queries, as shown in Figure 10a, which is significantly more than the others. The reason for this is that this subject provided conflicting answers to some pairwise comparison groups. It is challenging for subjects to avoid such cases completely, and a more efficient approach is required to handle such cases and improve the convergence speed of OPPLM in the future.
Another problem with the learning speed is that OPPLM converged too early for some subjects. Six subjects' OPPLM converged within seven queries, which occurred when the subject's favorite trajectory was selected by accident, and the greedy trajectories were selected at the first few queries. Early convergence can result in a sub-optimal solution for finding the subject's favorite trajectory. To avoid such cases and guarantee enough exploration, further study is needed.
In this research, the convergence and stopping of OPPLM were based on the proposed convergence criterion. However, for practical applications, convergence of OPPLM is not necessary. As long as the driver is satisfied with the current trajectory and does not request another query actively, the OPPLM will not be updated. Thus, the learning speed of a satisfied trajectory may be faster in actual application.

Learning Accuracy
The mean goodness of fit (GOF) for UEM, calculated within the range of 0-1 for all subjects, is 0.85, while the mean score-utility-consistency (SUC), calculated within the range of [−1, 1], is 0.74. A paired-sample t-test confirms a significant difference in the evaluation score between two adjacent trajectory groups for UEM, indicating that UEM models the subject's degree of preference for trajectory well. However, for some subjects, the GOF of UEM is less than 0.6, and the SUC is less than 0.4. Nonetheless, the accuracy of UEM is considerably higher than that of the other three perception models (i.e., SPM, CPM, and EPM). This suggests that the three perception models are not estimated well, particularly EPM. The low accuracy of these models could be attributed to the direct relationship between the query trajectory selection and convergence criteria with UEM. Therefore, to enhance the learning accuracy of UEM, it is necessary to improve the estimation accuracy of the perception models.

Driver Preference Model Assumption and Setting
In this research, a two-layer hierarchical structure model based on utility theory is assumed to be the driver preference model. The corresponding trajectory indicators for each of the three perception models are selected based on experience. However, the selection method of indicators could be further optimized or even adapted to individual drivers to improve the accuracy of the driver preference model. Additionally, the estimation method settings in Section 3.4.2 could be further studied to improve the performance of OPPLM.
Moreover, the proposed OPPLM could be applied to other advanced driver assistance system functions or autonomous vehicles to learn drivers' preferences based on the driver preference model. However, for different functions, the driver preference model needs to be specifically developed. Further research is needed to explore the potential applications of OPPLM in other contexts and to improve its performance.

Conclusions
This research introduced an online personalized preference learning method (OPPLM) based on pairwise comparison group preference queries and the Bayesian approach. The driver preference model was established using a two-layer hierarchical structure model based on utility theory. To improve accuracy, the uncertainty of drivers' query answers was modeled, and informative and greedy query selection methods were used to enhance learning speed. A convergence criterion was proposed to identify the preferred trajectory.
A user study was conducted to learn the drivers' preferred trajectories of the lane centering control (LCC) system in a simple curve condition without other vehicles. A total of 14 experienced and 15 inexperienced subjects participated in the experiment. The results demonstrate that the OPPLM converges rapidly, within approximately 11 queries on average, and that the driver evaluation scores of the trajectories are consistent with the estimated utility by the learned driver preference model. The OPPLM can quickly and accurately learn the preferences of most subjects.
However, several limitations need to be addressed. The conflict of query answers is common, causing a delay in convergence and inaccurate preference estimation. To ensure adequate exploration, more exploration is necessary for occasional situations. The perception models are not estimated accurately enough when the OPPLM converges because the query selection method and convergence criteria are directly related to UEM only. The trajectory indicators used to build perception models are chosen based on experience, which can be optimized and even adapted to each individual. The estimation method settings in Section 3.4.2 are established based on experience and should be studied further to improve the performance of OPPLM.

Patents
A patent is being applied for based on this research.