Modeling Autonomous Vehicle Responses to Novel Observations Using Hierarchical Cognitive Representations Inspired Active Inference

: Equipping autonomous agents for dynamic interaction and navigation is a significant challenge in intelligent transportation systems. This study aims to address this by implementing a brain-inspired model for decision making in autonomous vehicles. We employ active inference, a Bayesian approach that models decision-making processes similar to the human brain, focusing on the agent’s preferences and the principle of free energy. This approach is combined with imitation learning to enhance the vehicle’s ability to adapt to new observations and make human-like decisions. The research involved developing a multi-modal self-awareness architecture for autonomous driving systems and testing this model in driving scenarios, including abnormal observations. The results demonstrated the model’s effectiveness in enabling the vehicle to make safe decisions, particularly in unobserved or dynamic environments. The study concludes that the integration of active inference with imitation learning significantly improves the performance of autonomous vehicles, offering a promising direction for future developments in intelligent transportation systems


Introduction
The rapidly developing field of autonomous driving systems (ADS) has the potential to revolutionize transportation in smart cities [1,2].One of the main challenges that ADS face is their ability to adapt to novel observations in a constantly changing environment.This challenge is due to the inherent complexity and uncertainty of the real world, which may lead to unexpected and unpredictable situations.
Driving scenarios involve multiple unpredictable factors, such as diverse driver behaviors and environmental fluctuations.These challenges require a shift from traditional rule-based learning models [3] to adaptive and cognitive entities, enabling autonomous vehicles (AVs) to navigate through complex and unpredictable terrains effectively.
Cognitive learning is a promising approach to tackle the challenge of adapting to novel situations in a dynamic environment [4].This approach aligns with the principles of Bayesian brain learning [5], which suggests that the human brain operates as a Bayesian inference system, constantly updating its probabilistic models to perform a wide range of cognitive tasks such as perception, planning, and learning.This allows autonomous agents to update their beliefs about the external world in response to novel observations.In the context of ADS, integrating these concepts leads to the emergence of vehicles with a cognitive hierarchy.One of the open questions in this area is to what extent autonomous vehicles can move beyond mere rule-following or imitating prior sub-optimal experiences.The hierarchical principles go beyond rule-based behavior, allowing autonomous vehicles to perceive their surroundings through probabilistic perception, infer causal structures, predict future states, and act with agency.This provides an adaptive learning agent that continuously enhances its cognitive models through experiences.
At the heart of the cognitive hierarchy of AVs is the critical role played by generative models (GMs) [6].These models allow AVs to comprehend the underlying dynamics of the environment, which, in turn, empowers them to anticipate future behavior and engage in proactive decision making [7].By relying on GMs, AVs go beyond simple perception and use predictive inferences to represent the agent's beliefs about the world.Furthermore, the autonomous agent must be able to plan a sequence of actions that will help it gain information and reduce uncertainty about the surroundings.
Active inference [8], grounded in the Bayesian brain learning paradigm, is a computational framework bridging the gap between perception and action.It suggests that an intelligent agent, such as an autonomous vehicle (AV), should not just passively observe its environment but actively engage in exploratory actions to refine its internal probabilistic models.This is achieved through a continuous cycle of observation, belief updating, and action selection.The process begins with perception, where the AV uses an array of sensors to engage in the multisensory and multimodal observation of its surroundings, accumulating sensory evidence about the external world.This sensory evidence is then integrated into a probabilistic model, commonly represented as a belief distribution.The belief distribution encapsulates the vehicle's understanding of the environment, encompassing not only the current state but also a spectrum of potential states, inherently acknowledging uncertainty.
In active inference, messages are propagated between different layers of a probabilistic model, allowing for the exchange of information.These messages are probabilistic updates that convey crucial insights about the environment's dynamics and help to refine the AV's internal probabilistic representation.In addition, active inference introduces the concept of action selection guided by inference.Rather than responding automatically, the AV engages in a deliberate action-planning process.The agent chooses actions that maximize the expected sensory evidence while also accounting for their internal beliefs and uncertainty, contributing to the minimization of free energy [9].
Furthermore, active inference is a powerful technique that enables autonomous agents to develop a sense of self-awareness [10].By continuously comparing their sensory observations with their internal beliefs, these agents strive to minimize free energy and gain an understanding of their own knowledge and the limits of their perception.This self-awareness allows them to recognize when to seek additional information through exploratory actions and when they can rely on their existing knowledge to make decisions.Simply put, active inference helps autonomous agents become cognitively self-aware while also compelling them to minimize the difference between their internal models and external reality.This allows them to adapt and learn from their environment, make better decisions, and navigate uncertain and complex situations.
Motivated by the previous discussion, we propose a cognitive hierarchical framework for modeling AV responses to abnormalities, such as novel observations, during a lane-changing scenario, based on active inference.The AV, equipped with self-awareness, should learn to self-drive in a dynamic environment while interacting with other participants.The proposed framework consists of two essential computational modules: a perception module and an adaptive learning module.The perception module analyzes the sensory signals and creates a model based on the observed interaction between the participants in a dynamic environment.This allows the AV to perceive the external world as a bundle of exteroceptive and proprioceptive sensations from multiple sensory modalities and to integrate information from different sensory inputs and match them appropriately.In this work, the AV integrates proprioceptive stimuli (i.e., the AV's positions) with exteroceptive stimuli (i.e., the relative distance between the AV and another object) and describes the integration process using Bayesian inference.The adaptive learning module consists of a world model and an active model.The world model is essential to the cognitive processes of perception, inference, prediction, and decision making in active inference systems.It bridges the agent's internal representation and interactions with the external environment, enabling the agent to adapt its behavior to uncertain scenarios.The active model plans the AV's actions as inferred from the world model in terms of minimizing the cost function due to uncertainty (i.e., unforeseen situations), and its components are linked by the active inference process.
The main contributions of this paper can be summarized as follows: • We present a comprehensive hierarchical cognitive framework for autonomous driving, addressing the challenge of responding to novel observations in dynamic environments.This framework marks a fundamental shift from rule-based learning models to cognitive entities capable of allowing AVs to navigate unforeseen terrains.

•
The proposed framework is firmly grounded in the principles of Bayesian learning, enabling ADS to adapt its probabilistic models continually.This adaptation is essential for the continuous improvement of the cognitive model through experiences.
Consequently, an AV can consistently update its beliefs regarding the surroundings.

•
We expand upon a global dictionary to incrementally develop a dynamic world model during the learning process.This world model efficiently structures newly acquired environmental knowledge, enhancing AV perception and decision making.

•
Through active inference, the proposed approach equips the AV with a sense of selfawareness by continually comparing sensory observations with internal beliefs and aiming to minimize free energy.This self-awareness enables them to make informed decisions about seeking additional information through exploratory actions and when to rely on existing knowledge.

•
The dynamic interaction between the ego AV and its environment, as facilitated by active inference, forms the basis for adaptive learning.This adaptability augments the AV's decision-making capabilities, positioning it as a cognitive entity capable of navigating confidently and effectively in uncertain and complex environments.

Related Works
One of the most pivotal advancements in intelligent transportation is the development of autonomous driving (AD).AVs are defined as agents capable of navigating from one location to another without human control.These vehicles perceive their surroundings using a variety of sensors, processing this information to operate independently of human intervention [11].AD involves addressing challenges in perception and motion planning, particularly in environments with dynamic objects.The complex interactions among multiple agents pose significant challenges, primarily due to the unpredictability of their future states.Most model-based AD strategies require the manual creation of driving policy models [12,13], or they incorporate safety assessments to assist human drivers [14,15].
In recent years, there has been a substantial increase in the demand for AVs that can imitate human behavior.Advances in ADS have opened up a wide range of potential applications where an agent is required to make intelligent decisions and execute realistic motor actions in diverse scenarios.A key aspect of future developments in AVs hinges on the agent's ability to perform as an expert in similar situations.Research indicates that utilizing expert knowledge is more effective and efficient than starting from scratch [16][17][18].One practical method for transferring this expertise is through providing optimal demonstrations of the desired behavior for the learning agent (L) to replicate [19].
Imitation Learning (IL) involves acquiring skills or behaviors by observing an expert perform a specific task.This approach is vital to the development of machine intelligence, drawing inspiration and foundational concepts from cognitive science.IL has long been considered a crucial component in the evolution of intelligent agents [17].
IL is similar to standard supervised learning, but instead of pairing features with labels, it pairs states with actions.In IL, a state represents the agent's current situation and the condition of any target object involved.The IL process typically begins with collecting example demonstrations from an expert agent (E), which are then translated into stateaction pairs.However, simply learning a direct state-to-action relationship isn't enough to ensure the desired behavior.Challenges such as errors in demonstration collection or a lack of comprehensive demonstrations can arise [20].Additionally, the learner's task might slightly differ from the demonstrated one due to environmental changes, obstacles, or targets.Therefore, IL often includes a step where the learner applies the learned actions and adjusts its approach based on task performance.
The existing works of the IL approach for driving can handle simple driving tasks such as lane following [21,22].However, if the agent is dealing with a new environment or a more complicated task (such as lane changing), the human driver must take control, or the system ultimately fails [23,24].More specifically, a typical IL procedure is direct learning, where the main goal is to learn a mapping from states to actions that mimic the demonstrator explicitly [25,26].Direct learning methods are categorized into classification methods when the learner's actions can be classified into discrete classes [27,28] and regression methods which are used to learn actions in a continuous space [29].Direct learning often fails to reproduce the proper behavior due to issues such as insufficient demonstrations or the need to perform different tasks in changing environments.Additionally, indirect learning can complement direct approaches by refining the policies based on sub-optimal expert demonstrations [30].
The primary limitations of IL include the policy's inability to surpass the expert's suboptimal performance and its susceptibility to distributional shifts [31].Consequently, IL often incorporates an additional step where the learning agent refines the estimated policy according to its current context.This self-improvement process can be guided by measurable rewards or learning from specific instances.
Many of these approaches fall under the umbrella of reinforcement learning (RL) methods.RL enables the encoding of desired behaviors, such as reaching a target and avoiding collisions, and does not solely rely on perfect expert demonstrations.Additionally, RL focuses on maximizing the expected return over an entire trajectory, unlike IL, which treats each observation independently [32].This conceptual difference often positions RL as superior to IL.However, without prior knowledge from an expert, the RL learning agent may struggle to identify desired behaviors in environments with sparse rewards [33].Furthermore, even when RL successfully maximizes rewards, the resulting policy may not align with the behaviors anticipated by the reward designer.The trial-and-error nature of RL also necessitates task-specific reward functions, which can be complex and challenging to define in many scenarios.
Learning approaches like IL and RL can be complex without adequate representation or model learning from the environment.To address these challenges, autonomous systems often adopt an incremental learning approach.This method enables the agent to acquire new knowledge while retaining previously learned information [34,35].As a result, the learning agent becomes capable of processing and understanding new situations that it encounters over time.

Proposed Framework
The hierarchical cognitive schematic introduced in Figure 1 comprises several modules that form a cognitive cycle, preparing an AV to perceive its environment and interact with its surroundings.When the system faces a new situation (i.e., novel sensorial input), an AV interprets the external world by formulating and testing hypotheses about its evolution.It generates predictions based on prior knowledge acquired from past experiences, performs actions based on its beliefs, observes the outcomes, and refines the beliefs accordingly.The different modules in the architecture can be likened to different areas of the biological brain, each one handling particular functionalities.Some parts are responsible for sensory perception, while others are dedicated to planning and decision making.All parts are interconnected and operate together.The cognitive model is characterized by inferences across different modules that enable it to predict the perceptual outcomes of actions.Moreover, the model must utilize these representations to minimize prediction errors and predict how sensory signals change for specific actions.The following sections present a detailed description of the different modules involved in the architecture.

Perception Module
To effectively operate in its environment, an autonomous vehicle (AV) requires a perception module that can learn how it interacts with other dynamic entities.This module takes in multi-sensorial information to identify causalities among the data perceived by the agent.Using various sensors to gather information is vital in constructing a model capable of predicting the agent's dynamics for motion planning.The ability to perceive multimodal stimuli is crucial, as it provides multimodal information under different conditions to augment the scene library of the ADS.
To accurately predict how information will be perceived, the module integrates exteroceptive and proprioceptive perceptions to formulate a contextual viewpoint.This viewpoint includes both internal and external perceptions of the agent.The primary aim is to use this information to predict subsequent internal or external states.To achieve this, the movements of both the AV and other participants are simulated at each instant through interacting rules dependent on their positions and motions, generating coupled trajectory data.Analyzing this multisensory data helps encode the dynamic interaction of the associated agents as probabilities into a coupled generative dynamic Bayesian network (C-GDBN).The resulting dynamic interaction model is self-aware and capable of identifying abnormalities and incrementally learning new interacting behaviors derived from an initial one, influencing the agent's decision making.

Adaptive Learning Module
In addition to the perception module, we have developed an adaptive learning module that enhances the AV's ability to respond to its surroundings.This module continuously analyzes the agent's interactions with the environment and adapts the AV's responses accordingly by acquiring incremental knowledge of the evolving context.This approach ensures the AV can proactively anticipate changes, make better decisions, and respond adeptly.Integrating the adaptive learning module represents a significant step forward in promoting an adaptive interaction between the AV and its surroundings.The module comprises two components: the world model and the active model.

World Model
The world model (WM) acts like a simulator in the brain, providing insights into how the brain learns to execute sensorimotor behaviors [36].In the proposed architecture, the WM is formulated using generative models, leveraging interactive experiences derived from multimodal sensory information.Initially, the WM is established via the situation model (SM), serving as a fundamental input module (see Figure 2).The SM, represented as a Coupled GDBN (C-GDB), models the motions and dynamic interactions between two entities in the environment, enabling the estimation of the vehicles' intentions through probabilistic reasoning.This constructed C-GDBN demonstrates the gathered sub-optimal information concerning the interaction of an expert AV (E) with one vehicle (V1) in the same lane, where E changes lanes to overtake V1 without a collision (comprehensive details on structuring the SM can be found in our previous work [37]).To initialize the WM by using the provided SM, we transfer the knowledge of E to a first-person perspective, where an intelligent vehicle (L) learns by interacting with its environment via observing the E's behavior to integrate the gained knowledge into its understanding of the surroundings.The first-person model (FP-M) establishes a dynamic model that shifts L from a thirdperson viewpoint to a first-person experience.This allows L to perceive driving tasks as E does, enhancing its imitation accuracy.Such a perspective empowers L to react promptly during interactions with V1.FP-M's structure is derived by translating the hierarchical levels of SM into the FP context (as illustrated in Figure 3).The top level of hierarchy in FP-M denotes pre-established configurations.( D) from the dynamic behavior of how E and V1 interact in the environment.Each configuration represents a joint discrete state as: where Dk is a latent discrete state evolving from the previous state Dk−1 via a non-linear state evolution function f(•) representing the transition dynamic model and via a Gaussian process noise w k ∼ N (0, Q).The discrete state variables Dk = [S E k , S V1 k ] represent jointly the discrete states of E and V1 where S E k ∈ S E , S V1 k ∈ S V1 , Dk ∈ D, and S E and S V1 are learned according to the approach discussed in [38], while D = { D1 , D2 , . . ., Dm } is the set that represents the dictionary consisting of all the possible joint discrete states (i.e., configurations) and m is the total number of configurations.Therefore, by tracking the evolution of these configurations over time, it is possible to determine the transition matrix that quantifies the likelihood of transitioning from one configuration to the next, as defined by the following: where Π ∈ R m,m , P( Di | Dj ) represents the transition probability from configuration i to configuration j and ∑ m k=1 P(D i | Dk ) = 1 ∀i.The hidden continuous states ( X) in the FP-M represent the dynamic interaction in terms of generalized relative distance consisting of relative distance and relative velocity, which is defined as the following: The where F ∈ R n x ,n x is the state evolution matrix and U Dk = μ Dk is the control unit vector.
Likewise, the observations in the FP-M depict the measured relative distance between the two vehicles defined as where Z k ∈ R n z is the generalized observation, which is generated from the latent continuous states via a linear function h(•) corrupted by Gaussian noise ν k ∼ N (0, R) as the following: Since the observation transformation is linear, there exists the observation matrix H ∈ R n z ,n z mapping hidden continuous states to observations.Consequently, within the FP framework, L can reproduce anticipated interactive maneuvers, serving as a benchmark to assess its interaction with V1.The concealed continuous states within the FP-M depict the dynamic interplay, characterized by a generalized relative distance that encompasses both relative distance and relative velocity.

Active Model
Active first-person model (AFP-M) links the WM to the decision-making framework which is associated with L behavior.This connection is achieved by augmenting the FP-M with active states representing the L's movements in the environment.Consequently, the AFP-M represents a generative model P( Z, X, D, a) as illustrated graphically in Figure 4, which is conceptualized based on the principles of a partially observed Markov decision process (POMDP).The AFP-M encompasses joint probability distributions over observations, hidden environmental states at multiple levels, and actions executed by L, factorized as follows: P( Z, X, D, a

Joint Prediction and Perception
In the initial stage of the process (at k = 1), L employs prior probability distributions, denoted as P( D0 ) and P( X0 ), to predict environmental states.This prediction is realized through the expressions D0 ∼ P( D0 ) and X0 ∼ P( X0 ).The methodological framework for the prediction is grounded in a sophisticated hybrid Bayesian filter, specifically the modified Markov jump particle filter (M-MJPF) [39], which integrates the functionalities of both particle filter (PF) and Kalman filter (KF).As the process progresses beyond the initial stage (for k > 1), L leverages the previously accumulated knowledge about the evolution of configurations.This knowledge is encapsulated in the probability distribution P( Dk | Dk−1 ), which is encoded in the transition matrix as outlined in (2).The PF mechanism propagates N particles, each assigned equal weight and derived from the importance density distribution π( Dk ) = P( Dk | Dk−1 , a k−1 ).This process results in the formation of a particle set, represented as . Concurrently, a series of Kalman filters (KFs) is utilized for each particle in the set, facilitating the prediction of the corresponding continuous GSs, denoted as . The prediction of these GSs is directed by a higher level, as indicated in Equation ( 4), which can be articulated in probabilistic terms as P( The posterior distribution associated with these predicted GSs is characterized by the following description: where λ( represents the diagnostic message that has been previously propagated, following the observation of Zk−1 at time k − 1.This mechanism plays a crucial role in the algorithm's process: upon receiving a new observation Zk , a series of diagnostic messages are propagated in a bottom-up manner to update the L's belief about the hidden states.Consequently, the updated belief in the GSs is expressed as In parallel, the belief in the discrete hidden states is refined by adjusting the weights of the particles, as denoted by w , where λ( Dk ) is defined as a discrete probability distribution.
such that where D B denotes the Bhattacharyya distance, a measure used to quantify the similarity between two probability distributions.The probability distribution P( Xk | Dk ) is assumed to follow a Gaussian distribution (N ).This Gaussian distribution is characterized by a mean vector and a covariance matrix as N (µ Dk , Σ Dk ).

Action Selection
The decision-making process of L hinges on its ability to decide between exploration and exploitation, depending on its interaction with the external environment.This discernment is predicated on the detection of observation anomalies.Specifically, L assesses its current state by analyzing the nature of its interactions.In scenarios where the observations align with familiar or normal patterns, L solely observes V1.Conversely, in instances characterized by novel or abnormal observations, L encounters a more complex situation involving multiple agents (e.g., two dynamic agents V1 and V2).This latter scenario represents a deviation from the experiences encapsulated in the expert demonstrations (see Figures 5 and 6).Consequently, based on this assessment, L opts for action a k , which is informed by its interpretation of environmental awareness and the perceived need for exploration or exploitation, according to the following: In (10), under normal observation, L will imitate E's action selected from the active inference table Γ, defined as the following: where m is the probability of selecting action a i ∈ A conditioned to be in configuration Dj ∈ D, and A = { μ D1 , μ D2 , . . ., μ Dm } is the set of available actions.In addition, β denotes the index of the particle with the maximum weight given by the following: In (10), if L encounters a situation that is abnormal and hasn't been seen by E before, then L will look for new ways to act.It does this by calculating the Euclidean distance (d k ), which is the shortest distance, between itself and V1 when they are in the same lane.Based on the measured distance, L adjusts its speed to ensure it doesn't exceed the speed of V1, helping to prevent a collision by slowing down or braking.Free Energy Measurements and GEs The predictive messages, π( Dk ) and π( X(i) k ), propagated top-down through the hierarchy.At the same time, the AFP-M receives sensory responses in the form of diagnostic messages, λ( X(i) k ) and λ( Dk ), that move from the bottom level up the hierarchy.Calculating multi-level free energy (FE) helps to understand how well the current observations match what the model predicts.
At the discrete level, FE is measured as the distinction between two types of messages, π( Dk) and λ( Dk), as they enter the node Dk.These messages are in the form of discrete distributions.Therefore, we propose using Kullback-Leibler Divergence (DKL) [40] as a method to measure the probability distance and calculate the difference between these distributions.
At the continuous level, FE is conceptualized as the distance between different probabilistic messages arriving at node Xk.This involves the Bhattacharyya distance between the messages π( Xk (i) ) and λ( X(i) k ), originating from the observation level, that is defined as follows: where BC is the Bhattacharyya coefficient.Furthermore, generalized errors (GEs) facilitate understanding how to suppress such abnormalities in the future.The GE associated with (13) and conditioned upon transitioning from Dk−1 is defined as follows: where Ė Dk represents an aleatory variable characterized by a discrete probability density function (pdf), denoted as P( Ė Dk).The errors identified at the discrete level are then conveyed to the observation level.This process is essential for computing the generalized error at this level, represented as ( Ẽ Zk), which explains the emergence of a new interaction within the surroundings.

Incremental Active Learning
By integrating the principles of adaptive learning and active inference, our objective is to minimize the occurrence of abnormalities.This goal can be achieved either by constructing a robust and reliable WM or by actively adapting to the dynamics of the environment.Such an approach ensures a comprehensive understanding of and interaction with environmental variables, thereby enhancing the system's predictive accuracy and responsiveness.The information acquired in the previous phases will be utilized to modify the beliefs of L and to incrementally expand its knowledge regarding the environment.This process involves updating the active inference table (Γ) and expanding the transition matrix (Π).These updates also will take into account the parameter of abnormality observation for considering the similarity between the two configurations.
In situations involving abnormalities, L incrementally encodes the novel experiences in WM by updating both the active inference matrix and the transition matrix.It's important to note that during such abnormal situations, L may encounter scenarios that involve configurations not previously experienced.These configurations are characterized by new relative distances between L and other dynamic objects in its environment, differing from those configurations previously known by the entity E. The discovery and understanding of these new configurations enable L to learn and adapt, thereby enhancing its ability to respond to similar situations in the future.
Consequently, a set C consisting of the relative distance-action pair can be performed during the abnormal period T (i.e., exploration) as C = { Zk , a k } T t can be defined as

P(a
Similarly, by analyzing the dynamic evolution of these new configurations, it becomes possible to estimate their transition probabilities P( Dt | Dt−1 ) encoded in Π ′ , which is defined as follows: where ∑ m k=1 P(D Consequently, the updated global transition matrix Π ′′ ∈ R (m+n),(m+n) is expressed as follows: where Π is the original transition matrix and Π ′ is the newly acquired one.
Action Update L evaluates performed action at time k − 1 using the FE calculated at time k, as defined in (13) and (14).In abnormal conditions, L learns future behaviors by gathering information about its surrounding environment.
During the online learning procedure, L modifies/updates the current active inference table and transition matrix, which is based on diagnostic messages, represented by λ( Dk) and λ(a k−1 ).Additionally, the transition matrix is refined using the GE defined in (15) as below: The active inference table Γ can be adjusted according to the following: where, π(a k ) = P(•| Dk) represents a specific row within Γ. Furthermore, P( Ė a k ) denotes the pdf of the GE associated with the active states, which can be calculated as follows: where λ(a k−1 ) = λ( Dk ) × P( Dk |a k−1 ).

Results
In this section, we evaluate the proposed framework across different settings.First, we introduce the experimental dataset.Then, we describe the learning process, encompassing both the offline and online phases.

Experimental Dataset
The dataset used in this study was gathered from real experiments conducted on a university campus involving various maneuvering scenarios with two Intelligent Campus Automobile (iCab) vehicles [41].During these experiments, expert demonstrations were recorded.These involved two AVs, iCab1 and iCab2, interacting to execute a specific driving maneuver: iCab2, acting as the expert vehicle (E), overtakes iCab1, which represents a dynamic object (V1), from the left side (see Figure 7).Each AV, denoted as (i), was equipped with both exteroceptive and proprioceptive sensors.These sensors collected data on odometry trajectories and control parameters to study the interactions between the AVs.The sensory data provided four-dimensional information, including the AVs' positions in (x, y) coordinates and their velocities ( ẋ, ẏ).

Offline Learning Phase
The expert demonstrations from the interactions between the two autonomous vehicles, iCab1 and iCab2, are utilized to learn the situation model (SM) offline.The learned SM comprises 24 joint clusters that encode the dynamic interaction between the AVs and the corresponding transition matrix, as depicted in Figure 8.Following this, the FP-M is initialized with these 24 learned configurations, which include position data and control parameters, as detailed in Section 3.2.1.

Online Learning Phase
The offline-acquired FP-M is enhanced with the action node to shape the AFP-M, as discussed in Section 3.2.2.This model enables L to operate in first-person during the online phase.As L undertakes a specific driving task, it assesses the situation based on its beliefs about the environmental state's evolution and actual observations.In normal situations, where observations align with predictions, L opts to perform expert-like maneuvers by imitating the expert's actions, known as exploitation.Conversely, in the face of abnormal conditions, such as encountering an unexpected vehicle, L begins to explore new actions based on these novel observations.This approach helps L avoid future abnormal situations and successfully achieve its goals.These goals are twofold: avoiding collisions with other vehicles in L's home lane when overtaking is not feasible due to traffic in the adjacent lane, and, when observations are normal, L safely overtaking the home-lane vehicle by following the expert's demonstration.

Action-Oriented Model
Figure 9 illustrates a trajectory performed by L in a normal situation, successfully overtaking the home-lane vehicle (V1) by following the prediction.
Conversely, Figure 10 shows how L alternates between exploration and exploitation during a driving task when faced with an abnormality, such as encountering a vehicle (V2) in the adjacent lane.In this scenario, L deviates from the predicted path to avoid a collision with V2 and slows down to maintain a safe distance from V1.This exploratory behavior enables L to learn new actions, like braking, in response to the proximity of the two vehicles.As a result, L expands its knowledge, adjusts its beliefs, and later applies these novel collected experiences in similar situations.Figure 10 further demonstrates that after learning new observations and actions related to reducing speed, L switches to exploitative behavior, using the newly acquired data to maintain a safe distance from V1 as long as it observes traffic in the adjacent lane.
Furthermore, Figure 11 depicts how, after navigating the abnormality and with available space in the adjacent lane (as V2 is moving faster), L once again has the opportunity to overtake V1.These results demonstrate the learned action-oriented model's effective adaptation to the dynamic variability in the environment.

Free Energy Measurement
In this section, we explore how updating and expanding the agent's beliefs about its surroundings minimizes the FE measurement.This minimization occurs through hierarchical processing, where prior expectations generate top-down predictions about likely observations.Any mismatches between these predictions and actual observations are then passed up to higher levels as prediction errors.We examine the efficiency of two action-oriented models in terms of cumulative FE measurement: • Model A, developed in a normal situation during the online learning phase, where L can overtake V1. • Model B, formulated in an abnormal situation during the online learning phase, where L is temporarily unable to overtake V1 due to traffic in the adjacent lane.
Figures 12 and 13 offer a visual comparison of FE trends across training episodes.Figure 12 presents the results for Model A, where it was trained over 1000 episodes.In contrast, Figure 13 illustrates the performance of Model B over 2000 training episodes.This increase in training episodes for Model B is due to the more complex experimental scenario it addresses.The results demonstrate the capabilities of the proposed framework and its adaptability to new environments.Here, the autonomous vehicle (AV) continuously updates and improves its decision-making processes by refining its beliefs about the surroundings.

Conclusions
In this study, we present a comprehensive framework for autonomous vehicle navigation that integrates principles of active inference and incremental learning.Our approach addresses both normal and abnormal driving scenarios, providing a dynamic and adaptive solution for real-world autonomous driving challenges.The proposed framework effectively combines offline learning from expert demonstrations with online adaptive learning.This dual approach allows the autonomous agent to not only replicate expert maneuvers in normal conditions but also to develop novel strategies in response to unobserved environmental changes.Additionally, we introduced an action-oriented model that enables the agent to alternate between exploration and exploitation strategies.This adaptability is crucial in dynamic environments where the agent must constantly balance the need for safety with efficient navigation.In the proposed scenarios involving abnormalities, the learning agent demonstrated an ability to incrementally encode novel experiences and update its beliefs and transition matrix accordingly.This capability ensures that the intelligent agent continuously enhances its understanding of the environment and improves its decision-making process over time.Moreover, the results highlight the effectiveness of the framework in minimizing free energy, indicating a suitable alignment between the agent's predictions and environmental realities.This alignment is key to the agent's successful navigation and decision-making abilities.In conclusion, this research contributes significantly to the field of autonomous navigation by presenting an ability to learn from both pre-existing knowledge and ongoing environmental feedback, which is a promising solution for real-world autonomous navigation challenges.

Figure 2 .
Figure 2. Coupled-GDBN representing the dynamic interactions between E and V1.

Figure 3 .
Figure 3. First-person model consists of a proprioceptive model (right side) and the learned joint configurations (left side) from the learning agent view.

Figure 4 .
Figure 4. Graphical representation of the active first-person model.

Figure 5 .Figure 6 .
Figure 5. Learning agent receives a normal observation.It is allowed to overtake the vehicle directly ahead in its home lane.

Figure 8 .
Figure 8.(a) Generated clusters and the associated mean actions to them, and (b) the generated transition matrix based on AVs' movements.

Figure 9 .Figure 10 .
Figure 9. Thelearner observes a normal interaction, allowing it to follow its predictions.

Figure 11 .
Figure 11.After overcoming the abnormality, the learner resumes imitating expert demonstrations.

Figure 12 .Figure 13 .
Figure 12.Cumulative free energy related to Model A. The red curve shows the measurement trend.
initialization is based on SM where the continuous latent state Xk = [ XE k , XV1 k ] ∈ R n x represents a joint belief state where XE k and XV1 k denote the hidden generalized states (GSs) of E and V1, respectively.The GSs consist of the vehicles' position and velocity where Xi k = [x i k , y i k , ẋi k , ẏi k ] and i ∈ {E, V1}.The continuous variables Xk evolve from the previous state Xk−1 via the linear state function g(•) and via a Gaussian noise w k , as follows: