Joint Optimization of Age of Information and Energy Consumption in NR-V2X System Based on Deep Reinforcement Learning

As autonomous driving may be the most important application scenario of the next generation, the development of wireless access technologies enabling reliable and low-latency vehicle communication becomes crucial. To address this, 3GPP has developed Vehicle-to-Everything (V2X) specifications based on 5G New Radio (NR) technology, where Mode 2 Side-Link (SL) communication resembles Mode 4 in LTE-V2X, allowing direct communication between vehicles. This supplements SL communication in LTE-V2X and represents the latest advancements in cellular V2X (C-V2X) with the improved performance of NR-V2X. However, in NR-V2X Mode 2, resource collisions still occur and thus degrade the age of information (AOI). Therefore, an interference cancellation method is employed to mitigate this impact by combining NR-V2X with Non-Orthogonal multiple access (NOMA) technology. In NR-V2X, when vehicles select smaller resource reservation intervals (RRIs), higher-frequency transmissions use more energy to reduce AoI. Hence, it is important to jointly considerAoI and communication energy consumption based on NR-V2X communication. Then, we formulate such an optimization problem and employ the Deep Reinforcement Learning (DRL) algorithm to compute the optimal transmission RRI and transmission power for each transmitting vehicle to reduce the energy consumption of each transmitting vehicle and the AoI of each receiving vehicle. Extensive simulations demonstrate the performance of our proposed algorithm.


I. INTRODUCTION
As the autonomous driving is one of the most promising application fields for next generation of communication systems, the development of reliable and low-latency vehicle communication becomes crucial.Such technologies not only enhance interconnectivity between vehicles but also facilitate efficient communication between vehicles and infrastructure.For autonomous driving vehicles, wireless access technologies play a critical role by providing real-time information exchange and collaboration capabilities among vehicles, thereby enhancing driving safety, efficiency [1]- [5].Therefore, continuous innovation and development in wireless access technologies hold strategic significance, further propelling the advancement and adoption of autonomous driving vehicle technology [6]- [10].Early research on wireless access technologies for vehicles primarily focused on the IEEE 802.11p standard [11]- [13].In recent related research, the scheduling and allocation of factors that affect communication performance are the main research directions [14]- [19].The main research method focuses on DRL [20]- [22].
3GPP has formulated V2X specifications based on 5G NR technology to support ultra-low latency and ultra-high reliability in evolving vehicle applications, communications, and service requirements [23].As pointed out in [24], the development of SL in NR-V2X is to supplement and expand the SL communication in LTE-V2X.However, the autonomous resource allocation method used in Mode 2 still suffers from resource collisions.When the RRI decreases and the vehicle occupies more resources, the collision probability gradually increases.Collisions means one vehicle receiving multiple messages simultaneously, which resulting in, the mutual interference between these messages, and directly reducing the Signal-to-Interference-plus-Noise Ratio (SINR) of each message, which degrades the transmission time [25]- [27].Therefore, NOMA is used to mitigate the impact of this situation.When the vehicle receives multiple collision messages, NOMA decodes these messages separately to increase the SINR of messages with relatively low power and improve communication performance [28]- [31].
Furthermore, as mentioned in [32], while ensuring communication effectiveness, energy consumption also needs to be considered.The transmission power also affects the SINR and energy consumption during the transmission process [33].When the power is high, the SINR is more likely to meet the requirements for successful transmission, thereby reducing AoI, but energy consumption will also increase at the same time [34].Therefore, in NR-V2X communication, it is necessary to comprehensively consider the balance between communication effectiveness and energy consumption.In order to address this issue , a resource allocation scheme based on DRL is proposed to allocate RRI and power for vehiclesensuring low energy consumption and low information age of the system during the communication process 1 .The performance of our proposed resource allocation method is evaluated through simulation experiments, and the results demonstrate that it can improve the communication performance of the NR-V2X vehicle networking system.
The remainder of this paper is structured as follows: Section II provides a review of related literature.In Section III, we introduce the system model and formulate the optimization problem.Section IV simplifies the formulated optimization problem and presents a near-optimal solution using DRL.We conduct simulations to demonstrate the effectiveness of our proposed method in Section V, followed by concluding remarks in Section VI.

II. RELATED WORK
In this section, we have reviewed the existing work on the analysis , network optimization and the application of reinforcement learning in NR-V2X systems.Rehman et al. proposed an analytical model for evaluating the NR-V2X communication performance, focusing on the sensor-based semipersistent scheduling operations defined in NR-V2X mode 2 and comparing them with LTE-V2X mode 4. For different physical layer specifications, the average packet success probability for LTE-V2X and NR-V2X were analyzed and a moment matching approximation method was used to approximate the SINR statistics under the Nakagami-lognormal composite channel model.It shows that under conditions of relatively large inter-vehicle spacing and a high number of vehicles, NR-V2X outperforms LTE-V2X [35].Anwar et al. evaluated and compared the PHY layer performance of various V2V communication technologies.The results showed that NR-V2V outperforms other standards (such as IEEE 802.11bd) in terms of reliability, range, delay, and data rate.However, under the same modulation and coding scheme, IEEE 802.11bd performs better in terms of PER.Although the lowest MCS of NR-V2X is more reliable than IEEE 802.11bd,IEEE 802.11bd has a wider range.Overall, NR-V2X performs best in V2V communication [36].
[37] investigated the energy consumption and AoI of a device that is a single-source node, considering potential transmission failures due to poor channel conditions from the source node to the receiver.For a threshold-based retransmission strategy in the system, the corresponding closedform expressions for average AoI and energy consumption was derived, which can be used to estimate channel failure probabilities and maximum retransmission attempts.Authors of [38] adopted the Truncated Automatic Repeat Request scheme, where terminal devices repeatedly send the current status update until reaching the maximum allowable transmission attempts or generating a new status update.Closed-form expressions for average AoI, average peak AoI, and average energy consumption are derived based on the evolution process of AoI.Authors of [39] primarily considered scenarios where multiple information sources are needed to transmit information for completing status updates and thereby reducing AoI.It investigated the problem of packet scheduling based on information freshness for application-oriented scenarios with correlated multiple information sources.Specifically, it employs AoI to characterize the freshness of status updates for applications and formulates the application-oriented scheduling problem as an MDP problem, utilizing DRL for solving.
Liang et al. proposed an implementation method for the Integrated Sensing and Communication (ISAC) system for vehicular networks, addressing the potential performance degradation caused by the coexistence of millimeter-wave radar and communication by extending NR-V2X mode 2. By using (Semi-Persistent Scheduling) SPS resource selection and dynamically adjusting the radar scanning cycle and transmission power of each vehicle based on the speed and channel congestion status reported by neighboring vehicles, the ISAC system ensures that high-priority vehicles occupy spectrum resources preferentially.Simulation results validated the effectiveness of this approach in improving radar and communication performance, and the ISAC system could better coordinate the coexistence of radar and communication functions, improving the overall performance and security of vehicular networks [40].Song et al. proposed a scheme for SL resource allocation in NR-V2X based on 5G cellular mobile communication networks in mode 1.By using hybrid spectrum access technology and periodic reporting of channel state information, SL resource allocation was modeled as a mixed binary integer nonlinear programming problem to maximize the total throughput of NR-V2X networks among different subcarriers while complying with total available power and minimum transmission rate constraints.Simulation results showed that the proposed power allocation scheme could save energy, and the suboptimal SL resource allocation algorithm outperformed other methods [41].MolinaGalan et al. conducted in-depth analysis evaluated the performance of 5G NR-V2X mode 2 under different traffic patterns.The study pointed out that additional reselections could make SPS more unstable and prone to collisions.Moreover, frequent resource reselections could increase implementation costs.Therefore, they proposed suggestions an adjusted reevaluation mechanism to reduce implementation costs and improve system performance.This work provided valuable insights for the further development and optimization of NR-V2X mode 2 [42].Soleymani et al. focused on the joint energy efficiency and total rate maximization problem of autonomous resource selection in NR-V2X vehicle communication to meet reliability and latency requirements.It formulated the autonomous resource allocation problem as the ratio of total rate to energy consumption, aiming to maximize the total energy efficiency of power-saving users under reliability and latency requirements.Since the energy efficiency problem is a complex mixed integer programming problem, a traffic-based resource allocation density heuristic algorithm was proposed to address this problem, ensuring the same successful transmission rate as perception-based algorithms while improving energy efficiency by reducing the power consumption per user [43].
Hegde et al. focused on the efficiency of radio resource allocation and scheduling algorithms in C-V2X communication networks and their impact on latency and reliability.Due to the continuous movement of vehicles, perception-based SPS becomes ineffective, leading to noneffective resource allocation and frequent resource conflicts.Therefore, the C-V2X communication network was described as a decentralized multi-agent network Markov decision process.Two variants, independent actor-critic and shared experience actor-critic, were proposed, achieving a 15-20% improvement in reception probability in high vehicle density scenarios [44].Saad et al. considered optimizing the medium access control layer in NR-V2X for more effective congestion control.They took into account the AoI indicator in the optimization process and introduced DRL to manage packet transmission rate and transmission power while ensuring high throughput.Compared with traditional distributed congestion control algorithms, the proposed solution demonstrated better performance in terms of timeliness, throughput, and average CBR.It highlights the importance and effectiveness of DRL-based congestion control mechanisms in the context of AoI [45].
Currently, NOMA has been applied in many scenarios.Ju et al. introduced an energy-efficient sub-channel and power allocation strategy for URLLC-enabled GF-NOMA systems using multi-agent deep reinforcement learning (MADRL).It aims to maximize network energy efficiency while meeting URLLC requirements.It used simulation to check MA2DQN and MADQN performances [46].Tran et al. explored secure offloading in vehicular edge computing (VEC) networks with malicious eavesdroppers using NOMA.An A3C-based scheme is proposed to optimize energy consumption and computation delay.Simulation results demonstrated its advantage in terms of energy efficiency and security [47].Long et al. focused on VEC systems, where tasks can be processed locally or offloaded based on vehicle-to-infrastructure (V2I) and vehicleto-vehicle (V2V) communication.It employed decentralized DRL and the deep deterministic policy gradient (DDPG) algorithm to do the power allocation while addresses the uncertainty of MIMO-NOMA-based V2I communication and random task arrivals [48].
In summary, the existing works in the literature on NR-V2X performance analysis and research have not considered the impact of NOMA on AoI.They also didn't consider to employ DRL in optimizing AoI and energy consumption in NOMA and NR-V2X based vehicular networks.Therefore, we undertake the research presented in this paper.

A. Scenario description
This section describes the AoI and energy optimization model based on NR-V2X Internet of Vehicles (IoV) system as shown in Fig. 1, focusing primarily on the NR-V2X resource selection method.In Mode 2, vehicles utilizing NR-V2X adopt a perception-based SPS scheme for dynamic and semipersistent resource selection.In the dynamic scheme, resources can only be used once, while in the semi-persistent scheme, resources are reserved for RC times.Additionally, the reevaluation mechanism in Mode 2 can detect and avoid potential conflicts in message propagation.For NR-V2X sidelink communication, resources in the time domain are composed of frames and subframes.Each frame typically consists of 10 subframes, each of which has a duration of 10 ms [49].The subframe is typically 1 ms.In the frequency domain, the smallest schedulable frequency unit is the Resource Block (RB).In the NR-V2X standard, RBs are sequentially combined to form subchannels, allowing vehicles to transmit messages on one or more subchannels [50].Vehicles continuously monitor channels over a period of time by measuring the Reference Signal Received Power (RSRP) of all J subchannels, storing the latest information of N sense time slots for use as a perception window when resource selection is required.RSRP represents the received power level of the reference signal in a mobile communication system, which is a key indicator for evaluating wireless signal quality and coverage.The higher the RSRP value, the stronger the received signal and the better the signal quality.Then, vehicles initialize a selection window (SW) with a set of consecutive candidate time slots, namely RRI-sized slots.Each vehicle utilizes the information in the perception window to select available communication resources within the SW.Initially, Z A is set to include all slots in the SW, and if z n represents the time slot n following the perception window, then Then, vehicles exclude all resources corresponding to the subframe from the set Z A based on certain conditions.Firstly, due to halfduplex communication, vehicles cannot perceive the resources used by other vehicles in the same slot in the perception window.Hence, all resources corresponding to the slot need to be excluded from the SW.Secondly, if the RSRP measurement corresponding to the candidate subframe exceeds RSRP th , all resources corresponding to that candidate subframe are excluded from the SW.The exclusion criterion for the RSRP of the i-th subframe (the j-th subchannel) in the SW can be expressed as: If the remaining number of resources in Z A is less than X% of the total available resources, RSRP th is increased by 3 dB.In NR-V2X Mode 2, X can be set to 20, 35, or 50.Finally, vehicles randomly select a communication resource from the remaining resources in Z A to reserve for subsequent transmission use, transmitting RC times at intervals of RRI.RC varies with RRI to ensure the time of the selected resource is between 0.5 to 1.5 seconds.Therefore, as shown in reference [51], the initial value of RC for a vehicle RC0 is represented as: After RC decreases to 0, vehicles can continue to use the preselected resources with a probability of P rk or to reselect new resources for transmission with a probability of 1 − P rk .Before transmitting messages,vehicles that select communication resources in time slot z old may check whether these resources are still available using a reassessment mechanism (i.e., not reserved by another vehicle) [52].Vehicles will perform the reassessment check in time slot z g .The new resource selection window, denoted as SW', is defined as [z g + T 1 , . . ., z n + Γ].If resources previously excluded are found to be available again during the reselection process, vehicles will select new resources from the available resources in SW'.The resources initially chosen in time slot z old will be replaced by new resources in time slot z new as depicted in the figure.Table I lists the parameters used in this chapter.

B. NR-V2X-NOMA communication model
In the NR-V2X vehicular networking system, we denote the transmitter as i , receiver vehicle as j and the considered time slot as t.SINR is represented as: where p t i is the transmission power by vehicle i, h t s is the random small-scale fading gain, h t i→j is the large-scale fading gain of the link from i to j in time slot t, L(d i→j ) is the path loss as a function of the distance from i to j, p n is the noise power, and I t i→j is the interference power.In Eq.
3, the numerator represents the received power, while the denominator is the sum of the noise power and interference, assumed to be Gaussian with zero mean.I t i→j is defined as: where V t is the set of nodes transmitting in time slot t, σ t k,i is a multiplicative coefficient between 0 and 1 that quantifies the interference power of k in the subchannel used by i, relative to the transmission power of k.If k uses the exact same subchannel as i, σ t k,i is 1; if its signal does not overlap or only partially overlaps, σ t k,i is less than 1.The calculation of σ t k,i takes into account in-band emission consistent with the specifications in [53].
The receiver j employs a serial interference cancellation mechanism in NOMA to decode multiple messages in different subchannels.The decoding process involves selecting the message with the maximum power as the desired signal, while treating the others as interference signals.If we denote the signal i with the current maximum received power as: then, other signals with received power lower than signal i are denoted as: The set of other vehicles whose received signal power is lower than vehicle i is denoted as Then, the expression for the SINR obtained by vehicle j using NOMA is: When vehicle j decodes the message with the highest power from vehicle i, the message power of the vehicle in I i is used as the interference power, and the magnitude of the interference is affected by the degree of channel overlap σ.Therefore, by adjusting the power allocation, the SINR of each message can be increased, thereby increasing the likelihood of successful communication.

C. AoI model
When the size of the transmitted message is G, the criterion for successful communication between vehicles i and j is: where ⌊⌋ indicates rounding down the values in it, log 2 (1 + η t i→j ) represents transmission rate, W t i represents the bandwidth utilized by vehicle i for message transmission.So, u t i→j = 0 indicates that the communication rate between vehicles i and j is insufficient to transmit the message within the specified time slot t, resulting in communication failure.Due to the nature of NR-V2X, where each transmission between vehicles requires waiting for a time equivalent to the RRI, each failed transmission between vehicles leads to an increase in the AoI at the receiving end vehicle.Successful transmission, on the other hand, results in an increase for AoI at the receiving vehicle by Γ .The change in AoI at the receiving end vehicle j between communication time slots can be expressed as: It can be observed that the AoI at the receiving end is influenced by the transmission interval size Γ, transmission status u, and AoI of the message transmitted by vehicle i, where Φ i→i = 0.Among these factors, when the transmission interval Γ is smaller, the receiving end has more opportunities to update to the AoI of transmission end.Additionally, a higher transmission success rate increases the likelihood of updating to the AoI at the transmission end.Furthermore, the AoI at the transmission end decreases as the queue processing rate 1 Γ increases.Therefore, when Γ is smaller, the AoI at the transmission end is smaller, resulting in a smaller AoI at the receiving end as well.
The average AoI at the receiving end for all vehicles in the system is defined as: In the scenario where four message types are considered, so the φ t+1,b in is represented as follows: In Eq. ( 11), n represents multiple message types, b denotes the position of the message in the queue, β t i,n ∈ {0, 1} indicates whether a type n message can be transmitted (with 1 indicating it can be transmitted).Where the queue operates on a first-in-first-out basis.And the parameter Γ determines the frequency of β t i,n = 1, so β t i,n can be expressed as: where ⌈⌉ represents rounding up the values in it, z new represents the time slot allocated for vehicle reservation in the SW, m indicates the number of times a vehicle uses reserved resources, q and L represent the queue length and queue capacity respectively.Due to the consideration of multiple priority queues in this scenario, when the high-priority message is in queue β t i,n = 1, the β t i,n = 1 of other queues applies.

D. Energy consumption model
In NR-V2X, when vehicle i reselects resourcesthe energy consumption for the previously reserved resources is given by: p t i l i represents the energy consumption generated by using reserved resources at a time, and l i represents the time when vehicle i utilizes resources, which is the size of one time slot.As shown in Eq. ( 2), when the transmission interval is smaller, RC 0 i is larger, and the energy consumption during this time period is greater, indicating a trade-off between energy consumption and AoI.
The average energy consumption of all vehicles in the system is defined as

IV. OPTIMIZATION METHOD OF MPDQN BASED ON DRL A. Framework for optimization problems
Based on the defined system model, the optimization problem is formulated to minimize the weighted sum of the average AoI and energy consumption of vehicles in the system.Since the AoI and energy consumption of vehicles depend on RRI Γ and power p, the optimization problem can be expressed as: where ω 1 and ω 2 are non negative weight factors.Since the channel conditions in the NR-V2X system are uncertain, we employ the Multi-Pass deep Q-Networks (MPDQN) method based on DRL to solve this optimization problem.In this method, the RSU serves as the agent, and its observed state at time slot t is: The state of each vehicle is defined as: where N t is the total number of other vehicles within the range w, defined as receivers.The average distance to these receivers is dt .P t (u t = 1) is the probability of successful message reception by the receivers.RC 0 represents the total number of times that vehicles use reserved resources.The action assigned by the RSU to vehicles at time slot t is: where p Γ represents the parameter that converts the continuous action p into a discrete action Γ.They act as two subactions, and the tuple they form constitutes the complete action assigned.
The objective of the optimization problem is to minimize the AoI and energy consumption in the system.Therefore, the reward function is defined as: Where Φ t i is the mean AoI of the receivers for vehicle i over a certain period of time:

B. Solution to optimization problems
For action tuple (Γ, p Γ ), a policy network is used to match them.
where θ x represents the weights of this network.Then, another deep neural network is used to approximate the action-value function Q (s, (Γ, x)): where the network weights are denoted as θ Q .The process that the agent obtains the action with the highest action value is illustrated in Fig .2.
The loss functions for the Q-Network and the network of Actor are defined as follows: where y t is defined as: Finally, the network weights are updated using the learning rates lr x and lr Q for each network, aiming to approach the optimization objective.
Next, we will describe the algorithm process in detail.First, the parameters of both networks are randomly initialized, and an experience replay buffer of size M is established.Then, the algorithm iterates over EP episodes.At the beginning of each episode, the system parameters are reset.The RSU selects initial action tuples based on the initial state and the network, and observes the next state.Subsequently, the algorithm iterates from timeslot 1 to timeslot T .For each timeslot t, the RSU allocates actions to vehicles needing resource reallocation based on the current state.When selecting actions, the RSU either explores randomly with a certain probability or chooses the action with the maximum Q-value, introducing exploration noise to avoid local optima.Finally, the tuple s t , (Γ t , p Γ ) , r t , s t+1 is stored in the experience replay buffer.When the number of tuples in the buffer exceeds the sample size B, they will be employed to update the network parameters.The pseudocode is shown in Algorithm 1.
During the testing phase, there is no need to update the parameters.Just assign actions to the required vehicles based on the optimized strategy in training phase to carry out the test.The corresponding pseudocode is shown in Algorithm 2.

A. Parameter Settings
The simulation has been conducted using Python 3.6 and MATLAB 2023b, based on modifications and simulations built upon the code provided in [54].The simulation scenario involves a two-way highway covered by RSU communication range, where randomly distributed vehicles travel withconstant speeds in their respective lanes and use NR-V2X sidelink technology for V2V communication.The length of the highway, D, is 500m, the RSU coverage range, D RSU , is 250m, and the maximum distance, w, between vehicles and receivers is 150m.All vehicles utilize four different priority queues of length L. And the vehicles occupy a channel bandwidth of 10 MHz within the 5.9 GHz frequency band.The receiver is with 9 dB a noise figure.The path loss model features a standard deviation of 3 dB and a decorrelation distance of 25 meters.Thus, the RSRP th is -126 dBm.MPDQN employs a neural network with one hidden layer and updates its parameters using the Adam optimization method with learning rates lr Q = 5 * 10 −4 and lr x = 10 −4 .The experience replay Algorithm 1: Optimization algorithm for AoI and energy consumption based on MPDQN Input: γ, lr x , lr Q Output: optimized θ Q , θ x 1 Initialize the learning rates lr x and lr Q , experience replay buffer M , sample size B, network weights θ x and θ Q .Observe the initial state s 0 and output the initial action Γ 0 i , p Γ ; 5 Obtain the state Obtain action tuples based on the policy network or through random exploration: Iterate networks according to Eq. ( 23) , Eq. ( 24), Eq. ( 26) and Eq. ( 27).Assign RRI and power to vehicles that need to occupy communication resources according to the policy and their status.Execute action (Γ t i , p Γ ) Observe state s t+1 and reward r t ; buffer size M is set to 2000, and the sample size B is 128 [55].
Ornstein-Uhlenbeck noise is used as the exploration noise for the network, with a decay rate set to 0.15 and variance set to 0.0001.The key simulation parameters have been listed in Table II.

B. simulation Result
In this section, we first compare the AoI of vehicles in LTE-V2X and NR-V2X.Then, we compare the AoI of vehicles in NR-V2X before and after using NOMA, as well as the AoI in LTE-V2X and NR-V2X based on NOMA.Finally, we optimize the joint objective of AoI and energy consumption in NR-V2X using MPDQN.Many recent works employed genetic algorithms [56] and random algorithms [57] as baseline algorithms for resource allocation, and thus we shall compare our approach with these two methods above.Fig. 3 illustrates the variation of average AoI in the system as the number of vehicles using LTE-V2X and NR-V2X for V2V direct communication scenarios.The number of vehicles considered are 20, 30, 40, and 50, with each vehicle following 3GPP standards and employing a random strategy to select its RRI (Resource Reuse Interval) and transmission power.It is observed that the average AoI in the system increases with the number of vehicles, regardless of whether LTE-V2X or NR-V2X communication mode is utilized.This AoI increase can be attributed to the expansion of the receiver set within the communication range of vehicles as the number of vehicles in the system increases, leading to increased interference among them.Furthermore, due to the half-duplex communication mode, more vehicles are unable to receive messages due to resource contention, resulting in an increase in the average AoI.Additionally, as NR-V2X serves as a complement and advancement to LTE-V2X, the average AoI of vehicles in a vehicular networking system using NR-V2X for communication operations is consistently lower than that in an LTE-V2X system.Fig. 4 illustrates the variation of average AoI in the system as the number of vehicles changes when vehicles use NR-V2X and V2V communication with NOMA enabled.In this scenario, vehicles randomly select their RRI (Resource Reuse Interval) and transmission power.It can be observed that when vehicles use NOMA, they exhibit lower AoI when the number of vehicles changes.Moreover, the overall growth trend is smoother.This is attributed to NOMA's power-domain-based decoding approach, which significantly mitigates the impact of different vehicles occupying the same resources.Similarly, the average AoI within the system increases with the number of vehicles due to the proliferation of receivers.When more receivers cannot successfully receive messages due to resource contention, the AoI increases as the number of vehicles become larger.Fig. 5 depicts the variation of average AoI within the system as the number of vehicles changes when vehicles utilize NOMA-based NR-V2X and LTE-V2X for V2V communication.With NOMA incorporated, the relationship between the two remains consistent with Fig. 3, where the average AoI of NR-V2X consistently remains lower than that of LTE-V2X scenarios.This persistent advantage of NR-V2X over LTE-V2X can be attributed to the fact that vehicles in NR-V2X experience fewer resource collisions even before the implementation of NOMA.
Fig. 6 illustrates the learning curves of training under different scenariosOverall, it can be observed that the rewards of different curves exhibit an upward trend in fluctuation from episode 0 to episode 500, Subsequently, the learning curves stabilize, indicating that the agent has learned a strategy close to optimal.There is some jitter in the curves around episode 1000 when the number of vehicles is 40 and 50.This is attributed to detection noise impacting the agent, necessitating adjustments to return to a convergent state.Furthermore, it is noticeable that as the number of vehicles increases, the rewards decrease.This is due to the increasing interference experienced by each device with the growing number of devices in the system, resulting in lower SINR.This leads to prolonged transmission delays, and increased the system AoI.To maintain a lower AoI, RSUs notify more vehicles to utilize communication resources for transmission.That is, more vehicles imply a higher AoI and increased energy consumption, hence lower rewards.Fig. 7 depicts the variation of average AoI with respect to different numbers of vehicles in the NOMA-based NR-V2X vehicular network system when employing MPDQN, genetic algorithm, and random algorithm strategies.It can be observed that the AoI of all three strategies increases when the number of devices increases.This is attributed to the interference experienced by each device as the number of devices increases, leading to increased transmission times according to the equation, which may further increase the AoI of system.Furthermore, the allocation strategies obtained by MPDQN, which approximates the optimal strategy, and genetic algorithm consistently outperform the random strategy.This is because the near-optimal strategy obtained by MPDQN selects actions for vehicles based on observed states, while the genetic algorithm derives better action allocation strategies through evolution, whereas the random strategy merely generates action allocations randomly.Additionally, it can be observed that the strategy derived from MPDQN outperforms that of the genetic algorithm, resulting in lower average AoI for vehicles using MPDQN.This is because MPDQN considers the impact of action allocations for each time slot on subsequent AoI, whereas the genetic algorithm does not.Fig. 8 compares the energy consumption within the system when vehicles employ three different methods.It can be observed that energy consumption increases when then number of devices increases.Moreover, the energy consumption in the random method does not exhibit significant variations for different number of vehicles within the system, resulting in a more linear energy consumption pattern.On the other hand, in MPDQN and genetic algorithm methods, the increase increase of the number of vehicles leads to an increase of interference power, resulting in a decrease in SINR.The reduced SINR leads to a longer transmission time and a higher probability of exceeding transmission time slots, thereby increasing the average information age of the system.However, due to the relatively larger weight of average energy consumption in the optimization objective, RSUs may choose to incur minimal additional energy consumption when the information age is low.Thus, the impact of the number of vehicles on energy consumption is less pronounced compared to its impact on information age.Furthermore, the strategies obtained by MPDQN and genetic algorithm consistently outperform the random strategy, as they can derive better actions to ensure lower energy consumption costs at low AoI through their respective optimization approaches.Additionally, it can be observed that MPDQN consistently outperforms genetic algorithm when the number of vehicles is high.This is because MPDQN has an advantage in handling real-time decisionmaking and dynamic environments, allowing it to continuously adjust strategies based on environmental feedback, resulting in stronger adaptability.In scenarios with a higher number of vehicles, MPDQN may have an easier time learning optimal scheduling strategies through interaction with the environment.Fig. 9 compares the impact of different algorithms on average AoI in a scenario with 50 vehicles .It can be observed that as message size increases, average AoI also tends to increase.This is because larger size of messages require higher transmission rates, necessitating greater bandwidth and higher SINR.Among the three algorithms depicted above, MPDQN generally achieves the lowest average AoI, followed by the GA algorithm, highlighting the effectiveness of MPDQN.Here, energy consumption is averaged over the number of vehicles depicted in Fig. 8. Consistent with the previous findings, MPDQN outperforms GA and random algorithms, ensuring lower energy consumption during vehicle communication in the scenario.

VI. CONCLUSIONS
This paper addresses the probabilityof resource collisions in NR-V2X mode 2 communication, despite the use of autonomous resource selection and probability reselection mechanisms.To mitigate the impact of collisions on the communication process, we proposed utilizing NOMA's serial interference cancellation mechanism.Additionally, we employed the MPDQN algorithm to dynamically adjust the transmission interval and transmission power of vehicles to reduce the average information age and energy consumption in the system.Firstly, we established communication models for NR-V2X and NOMA, and then constructed a reinforcement learning framework based on MPDQN.In the learning framework, we modified the action space to enable simultaneous scheduling of discrete and continuous actions, and finally optimized the joint problem of information age and energy consumption.Through simulation analysis, we demonstrated the advantages of NR-V2X over LTE-V2X, the improvement of NOMA on the information age performance in NR-V2X scenarios, and the effectiveness of MPDQN in reducing information age and energy consumption in NR-V2X scenarios.It is noted that some potential challenges in this direction : Future vehicles may use both LTE-V2X and NR-V2X for communication, so considering the coexistence and integration of these two technologies is a challenge [58].In addition, fairness is often considered as a key factor in NOMA related scenarios [59].Therefore, our future research will focus on performance optimization in scenarios where LTE-V2X and NR-V2X coexist and fairness in resource allocation in relevant scenarios with NOMA.In addition, MPDQN is a combination of DQN and DDPG, so the optimal performance can be improved by improving them intuitively.

2 for episode from 1 to EP do 3
Reset the model parameters;

Algorithm 2 : 2 Reset the model parameters; 3 Receive initial observation state s 1 ; 4 for
Testing stage of the MPDQN 1 for episode from 1 to EP do slot t from 1 to T do 5

Fig. 10
Fig. 10 compares the influence of different algorithms on average energy consumption in the same scenario as Fig. 9.Here, energy consumption is averaged over the number of vehicles depicted in Fig.8.Consistent with the previous findings, MPDQN outperforms GA and random algorithms, ensuring lower energy consumption during vehicle communication in the scenario.

TABLE I :
The summary for notations.
observe state s t+1 and reward r t ; Store transition tuple s t , (Γ t , p Γ ) , r t , s t+1 to M ; 11 if number of tuples in M is larger than B then 12 Randomly sample B transitions tuples from M ;

TABLE II :
Values of the parameters in the experiments.