Neural scalarisation for multi-objective inverse reinforcement learning

Multi-objective inverse reinforcement learning (MOIRL) extends inverse reinforcement learning (IRL) to multi-objective problems by estimating weights and multi-objective rewards to help retrain and analyse preference-conditioned behaviour. Unlike previous methods using linear scalarisation, we propose a MOIRL method using neural scalarisation. This method comprises four neural networks: weight mapping, reward, scalarisation and weight back-translation. Additionally, we introduce two stabilization techniques for learning the proposed method. Experiments show that the proposed method can estimate appropriate weights and rewards reflecting true multi-objective intentions. Furthermore, the estimated weights and rewards can be used for retraining to reproduce the expert solutions.


Introduction
Reinforcement learning (RL) [1] is a machine learning technique in which RL agents, acting as decision makers, learn optimal control to maximize cumulative rewards.RL has attracted attention in various fields, such as robotics [2] and autonomous driving [3], due to its superiority in automatically learning optimal control.However, RL methods are limited by two problems [4]: the feature design and reward design problems.In the feature design problem, an appropriate feature extraction method from the observations is required to achieve the desired performance.In recent years, neural network-based deep RL methods [5,6] have been proposed to solve this problem.Neural networks dramatically improve the performance of RL methods by automating the feature extraction and function approximation processes.
In the reward design problem, an appropriate reward function is required to achieve the desired performance.Learning from demonstration (LfD) methods have been proposed as a solution to this problem.LfD methods reproduce expert behaviour by imitating expert demonstrations.Collecting such demonstrations is often easier than designing the reward function, thus LfD methods have received increasing attention in recent years.LfD methods can be divided into two main approaches: imitation learning (IL) [7,8] and inverse RL (IRL) [9,10].In IL, the expert is imitated directly.In contrast, IRL imitates the expert indirectly by first estimating the reward from a demonstration and learning the estimated reward via RL.The resulting agent is robust to the covariate shift problem because it is retrained with the estimated reward.Note that the estimated reward also allows a quantitative analysis of the expert's intentions.
Traditional IRL methods [4,[10][11][12][13][14][15] assume that experts optimize single-objective reward.However, many real-world problems are multi-objective.For example, in autonomous driving, it is possible to set two objectives, i.e. speed and safety.Here, the agent has to consider both objectives and drive according to the preferences of each passenger.The multi-objective decision problem is formulated as a multi-objective Markov decision process (MOMDP) [16].In this paper, we propose a multi-objective IRL (MOIRL) method that extends IRL to MOMDPs.The proposed MOIRL method, which can be used for two main applications, estimates expert weights (i.e.per-objective preferences) and rewards.
The first application is retraining, where the estimated rewards can be used to train agents according to specified preferences.In the autonomous driving example, multiple driving styles that prefer these two objectives in arbitrary proportions can be generated by estimating the rewards for speed and comfort.The second application is intention analysis, where the weights and rewards estimated by MOIRL are the experts' per-objective preferences and common intentions, respectively.IRL has been widely used to estimate the intentions of humans and organisms, and it has helped to explain complex behaviour [17][18][19][20].However, single-objective IRL can only provide a mixture of multiple expert rewards.In contrast, MOIRL estimates weights and rewards separately, providing a more detailed intention analysis.
Previous MOIRL methods [21][22][23] assumed that the expert applies linear scalarisation to the multi-objective rewards.However, scalarisation with weights and rewards includes non-linear methods, e.g.Chebyshev scalarisation [24,25].Thus, a method that can learn without the linear scalarisation assumption is desirable.In this paper, we generalize weighted scalarisation and then propose a neural scalarisation method, using a neural network to learn the scalarisation operation.The proposed method can learn without the linear scalarisation assumption.In addition, a more accurate estimation of the weights and rewards is achieved using two learning stabilization techniques.Finally, we experimentally verify that the proposed method, i.e.MOIRL with neural scalarisation, can estimate appropriate weights and rewards and can achieve retraining performance comparable to existing methods.
The remaining sections of this study are as follows.Section 2 discusses related work on MOIRL from literature.Section 3 gives background on the suggested method, including the Markov decision process, RL, multi-objective RL (MORL), IRL, and MOIRL.Section 4 presents a generalization of the weighted scalarization.Section 5 describes the proposed method, and Section 6 discusses techniques for learning stabilization.Section 7 details the IRL method emplloyed in the experiments.Section 8 reports the experimental results.In Section 9, we describe our experimental findings.Finally, Section 10 provides our conclusions.

Related work
Three MOIRL methods for estimating weights and multi-objective rewards have already been proposed.Table 1 compares the existing methods with the proposed method.Our early MOIRL methods were based on matrix factorization.NMF-MOIRL [21] is based on the fact that linear scalarisation is equivalent to the dot product of matrices.Here, NMF-MOIRL estimates the reward matrix via single-objective IRL and then applies matrix factorization to estimate the weights and rewards.Note that NMF-MOIRL can only handle nonnegative rewards; however, reward matrix decomposition (RMD) [22] solves this problem by using a gradient descent-based matrix decomposition procedure.

Type of scalarisation NMF-MOIRL [21]
× × Linear RMD [22] × × Linear MODIRL [23] Linear Proposed Neural The matrix factorization based methods have two challenges.First, the process from the IRL to the matrix factorization is unidirectional; therefore, it is not possible to modify the results of the IRL based on the results of the matrix factorization.Second, the number of dimensions of the states is limited to two or less due to the matrix conversion.Therefore, for tasks with highdimensional state spaces, a dimensionality reduction method is required.
Multi-objective deep IRL (MODIRL) [23] employs neural networks to construct the decomposition structure.MODIRL learns IRL and decomposition simultaneously.The number of dimensions in the neural network can be varied to handle an arbitrary number of state dimensions.All previous MOIRL methods assume that the expert applies a linear scalarisation to multiobjective rewards.However, there are other scalarisations with weights, including non-linear methods.The proposed method automatically learns appropriate scalarisation operations by employing neural scalarisation, as described in Section 4. The basic structure is inherited from MODIRL; thus, it can handle an arbitrary number of state dimensions and modify the IRL results based on the decomposition results.

Reinforcement learning and Markov decision process
In RL, an agent learns the optimal policy under an MDP [1].The MDP M = S, A, P, R, γ is represented by a state s ∈ S, an action a ∈ A, a reward r ∈ R, a state transition probability p(s |s, a) ∈ P, and a discount factor γ (0 ≤ γ ≤ 1).The reward function r(s t , a t ) evaluates the goodness of state s t and/or action a t at time t.In an MDP, the reward function returns a scalar reward value r t .Here the agent learns a policy that maximizes the expected discounted cumulative reward given by (1) in a single trial (i.e. an episode).

Multi-objective RL and multi-objective MDP
In real-world decision problems, multiple objectives have to be considered.MORL learns the optimal policy according to a multi-objective MDP (MOMDP) [16] M = S, A, P, R, γ .In a MOMDP, the reward r ∈ R of the MDP is replaced by a multi-objective reward vector r ∈ R. In addition, an agent has a weight vector w ∈ W (0 ≤ w o ≤ 1, o w o = 1), which are preferences that indicate how much they value the rewards of each objective.Two main approaches are used in MORL methods [26].The first is the multiple-policy approach, where the agent maintains per-objective policies, and the second is the single-policy approach, where the agent learns a single-policy based on a scalarised reward value.The most common method used in the singlepolicy approach is linear scalarisation [24].Linear scalarisation is also referred to as weighted sum [27], where the multi-objective rewards are multiplied by per-objective weights and summed to obtain a scalar.

Inverse reinforcement learning and multi-objective inverse reinforcement learning
IRL [9,10] methods are the inverse of RL methods.While RL methods learn the optimal policy from the reward, IRL methods estimate the reward from a recorded demonstration of the optimal policy.Here, a demonstration is a record of an expert's behaviour along a time series {s 0 , s 1 , s 2 , . . ., s T } (note that s T is the terminal state).In this paper, we define a demonstration as data consisting of sequences of the expert's transitioned states.Among IRL methods, the method that approximates the reward using a neural network and estimates the non-linear reward is called the deep IRL (DIRL) method.

Assumption 3.1:
The multi-objective reward r is the same for all experts; however, the weights w (i.e. the perobjective preferences) of each objective differ among the experts.Experts learn policies according to the reward value r SC scalarised from w and r.Assumption 3.2: Demonstrations are grouped by different weights.
These two assumptions are the premises of the MOIRL framework.First, Assumption 3.1 is essential for the decomposition of weights and rewards.In addition, Assumption 3.2 is necessary for finding differences between experts and assigning a weight to each group in the weight back-translation.
Linear scalarisation: Linear scalarisation computes the objective-dimensional sum of the weighted rewards as follows.
Here, denotes the element wise product (i.e. the Hadamard product), and w i o and r s o denote the weight of the i-th agent for the o-th objective and the reward value for the o-th objective in state s, respectively (the same hereafter).
Chebyshev scalarisation: Chebyshev scalarisation is a type of non-linear scalarisation method that computes the objective-dimensional maximum of the weighted rewards as follows.
Here, r * denotes the reference point.These weighted scalarisations can be generalized by the following definition (4).
Here, denotes the scalarisation operator for the weighted reward.( 4) corresponds to linear scalarisation when o = o and Chebyshev scalarisation with r * = 0 when o = max o .

Neural scalarisation
In the following, we consider the case where in ( 4) is replaced by a neural network.This case can be represented as ( 5) by a neural network NN, where the number of neurons in the input layer is the number of objectives, and the number of neurons in the output layer is one.
We call this case neural scalarisation.Neural networks have strong function approximation capabilities; therefore, it is expected that a neural network can approximate an arbitrary scalarisation function for the Hadamard product of the weights and rewards.In other words, any weighted scalarisation can be learned by the neural scalarisation.
In the case of a Pareto front with non-convex parts, there are Pareto solutions that can only be found using non-linear scalarisation; thus, only agents using nonlinear scalarisation can find the solution [24].For experts with such solutions, the expert should be modelled with a non-linear scalarisation.On the other hand, if the non-convex part of the solution is not needed and a partial solution is sufficient, a linear scalarisation is sufficient for practical purposes.In such cases, the expert can be modelled by linear scalarisation.The proposed method learns neural scalarisation, a generalization that includes both linear and non-linear scalarisation; thus, it can be suitable for both cases.

Proposed method
The proposed method consists of four neural networks.The architecture of the proposed method is shown in Figure 1, and each neural network is described below.

Weight-mapping network
The weight-mapping network involves the parameter θ WMN and learns the mapping between agent number i and weight w i .The weight-mapping network WMN(x) is shown in (6).
Here, the agent number i ∈ Z + is a positive integer value assigned to distinguish each expert.

Reward network
The reward network involves the parameter θ RN and predicts the multi-objective reward value r s from a state s.The reward network RN(x) is given in (7).

Scalarisation network
The scalarisation network involves the parameter θ SN .This network outputs a scalarised reward r SC (s) from the Hadamard product w i r s .The scalarisation network SN(x) is given in (8).

Weight back-translation network
The weight back-translation (BT) network involves the parameter θ WBTN .This network predicts agent number ĩ based on weight w i and plays the opposite role of the weight-mapping network.The weight BT network WBTN(x) is given in (9).

Two stabilization techniques
MOIRL is an ill-posed problem.To address this difficulty, we experimentally introduce two techniques to stabilize learning, i.e. pre-training of the reward network and weight BT.These two techniques are described in the following.

Pre-training of the reward network
The initialization of the reward network is a key issue.The reward network is trained via a scalarisation network, thus simple methods such as random initialization of the network are ineffective.Suppose the scalarisation and reward networks start learning at the same time.In this case, the roles of the two networks are not yet separated into decomposition and scalarisation, making it difficult to find an appropriate solution.
We therefore pre-train the reward network.Suppose the reward network outputs n objective reward values.Using the IRL algorithm used in the main algorithm (Algorithm 2), only the first objective of the multiobjective reward value is trained as a single-objective reward.This treats all experts' datasets as a single expert dataset.At the same time, we minimize the variance in the direction of the objective of the multi-objective reward value.
Variance minimization unifies the reward values for all objectives.Here, σ (r) denotes the objectivedimensional unbiased variance of r in (10).
To "unify" the rewards of all objectives means that RN(x) is initialized to output the same reward.The purpose of pre-training is, for example, to reduce the reward value of states with common low transition frequencies across all objectives.As shown in the experiments in Update RN(x) by gradient descent according to loss L: Here, r 1 denotes the first objective of r, and σ denotes the variance of r. 6: end for result, pre-training initializes the output to be a mixture of all experts' rewards.The algorithm is described in Algorithm 1.

Weight back-translation
Learning weights is another key issue in MOIRL.We observed a situation where weights converged to 0 or 1 when learning started with random initialization (as will later be confirmed by the results in Table 2) and could not absorb the differences for each agent.Therefore, we introduce a mechanism to perform BT for weights using the weight BT network.BT is a key technique in unsupervised machine translation [28], which uses a network A that translates language A to language B and a network B that translates language B to language A, learning so that the recovered sentences are closer to those of the original.The same concept is used in image translation as cycle consistency [29].
The effect of the weight BT is shown in Figure 2. The weight BT network is trained to infer the original agent number from the weights; thus, it cannot predict the original agent number if the weights converge to the same value, e.g.0 or 1.The BT of the weights adjusts the agent numbers and weights to a bijective relationship.The autoencoder has the same structure as the BT.In the autoencoder, the decoder can recover the original data when the encoder has a bijective relationship [30].

Main learning algorithm
The main learning algorithm is detailed in Algori thm 2. First, pairs of state s and agent number i are sampled from the multi-objective demonstrations D obtained from m experts.Next, the weight-mapping and reward networks output the weight w i of agent i and the multi-objective reward r s of state s, respectively.The Hadamard product of the weights and rewards is then input to the scalarisation network to obtain the scalarised reward r SC .Finally, r SC is optimized by the loss function L IRL (s, i) of an arbitrary IRL method.Here, the weight BT network is trained by the L 1 loss such that the agent number ĩ predicted from the weights and the true agent number i are close.

Reward-seeker training
To evaluate the proposed method experimentally, we employed a DIRL method based on reward-seeker training (RESET), which is a learning method based on the reward-seeking principle.
Reward-seeker principle: Expert agents always transition toward states of equal or greater preference to seek higher rewards.In other words, the following preference relation holds for any pair of state transitions sampled from the expert demonstration.
Reward learning is performed as follows.First, ( 13) is computed for the expert transition pair (s * t , s * t+1 ) to obtain the probability p * .Next, we define a transition pair (s * t , st+1 ), where only the next state is replaced by a sample st+1 generated from a uniform distribution in the range [S min , S max ] as a transition candidate.Then, the same (13) for the candidate transitions is computed to obtain the probability p. Finally, the reward is updated by minimizing the squared loss in (14) to maximize p * and minimize p. [15] also employs the Bradley-Terry model, and RESET can be considered a state-centric extension of T-REX's trajectory-centric model.T-REX requires a trajectory (or state sequence).In contrast, RESET only requires a pair of state transitions; thus, it can learn even a fragmentary demonstration.Additionally, RESET does not require ranking data, which is necessary for T-REX and can be learned via demonstration only.T-REX is an offline IRL method that does not require access to the environment when estimating rewards.RESET, which uses the same formulation, is also an offline IRL method and can be trained quickly.Thus, we adopt (14) as the IRL loss L IRL .

Experimental results
We evaluated the proposed method in computer experiments.In the following sections, we describe the settings for the RL and IRL methods, the experimental environment, the results, and provide a discussion.

RL method settings
As the RL method, we used the Soft Actor-Critic (SAC) algorithm [6] to train the expert agents and perform retraining.The SAC agent is given a linearly scalarised reward value comprising an agent weight vector and a multi-objective reward vector.The SAC agent learns an optimal policy for the scalarised reward.For this implementation of the SAC algorithm, we used pfrl [33] in  our experiments.pfrl is a deep RL library implemented by PyTorch [34].

IRL method settings
All networks were multi-layer perceptrons comprising an input layer, an output layer and two hidden layers.
Here, we used PyTorch as the deep learning framework to implement IRL.The number of neurons in the hidden layer was set to 256, and the Adam optimizer [35] (learning rate: 10 −4 ) was used for optimization.Dropout [36] was applied to the input layer with a probability of 0.2 and the hidden layer with a probability of 0.5.The batch size was set to 1024.

Multi-objective mountain car continuous environment
The multi-objective mountain car continuous experimental environment extends the OpenAI Gym's MountainCarContinuous-v0 environment [37] with a multi-objective reward vector.Here, the agent considers two objectives, i.e. r s 1 reaching the goal in the shortest path and r a 2 reducing the output of the action to the maximum extent.Note that reaching the goal in the shortest path requires a large action output; thus, there is a trade-off relationship between r s 1 and r a 2 .Each reward is given by ( 15) and (16).
(15) is the reward function used to learn the shortest path and rewards 10 when the agent reaches the goal and terminates the episode; otherwise, the reward is −0.01.In addition, ( 16) is a reward function for the magnitude of the action, which is 0 when the action is the largest and 0.01 when the action is the smallest.We prepared 11 weights w 1 for r s 1 , [0.0, 0.1, 0.2, . . ., 1.0] (increments of 0.1, w 2 = 1 − w 1 for r a 2 ) and trained 11 SAC agents.After recording 100 demonstrations from each of the 11 SAC agents, we employed the proposed method to estimate the weights and rewards with n = 2 as the number of objectives.

Estimation results for weights and rewards
The weights and rewards estimated using the proposed method are shown in Figure 3. Figure 3(a) shows a plot of the weights, where the x-axis shows the true weights and the y-axis shows the estimated weights.The plot for the first objective is on the left, and the plot for the second objective is on the right.The reward value is higher the whiter and the lower the blacker.

Retraining results
The 11 SAC agents were retrained using the estimated weights and multi-objective reward vectors.The distribution of solutions (i.e. the Pareto front) for each agent is shown in Figure 4 for the demonstration generation, the MODIRL [23] results for comparison and the retraining results obtained with the proposed method.In Figure 4, the x-axis is the true expected cumulative reward for the first objective (i.e. the shortest path), and the y-axis is the true expected cumulative reward for the second objective (i.e.action minimization).In addition, the number for each solution represents the weight w 1 for the first objective during expert training.
The solutions for the expert, MODIRL and the proposed method were divided into three groups.Here,  Group 1 emphasizes rewards for actions, Group 2 considers rewards for actions and rewards for the shortest path and Group 3 emphasizes rewards for the shortest path.

Analysis of scalarisation network
Finally, we analysed the learnt scalarisation network.Here, we generated 10,000 samples uniformly in the interval [0.0, 1.0] for the weights and [−20.0,20.0] for the multi-objective reward values, and we examined the relationship between the true linear scalarisation values and the predicted results from the scalarisation network.Figure 5 shows the results, where the x-axis is the true scalarised value, and the y-axis is the predicted value.

Ablation research on the impact of two stabilization techniques
We discussed two methods for stable learning in Section 6: the reward pre-training and weight BT.To confirm the efficiency of both techniques, we trained the suggested method with each approach disabled as an ablation study.The estimation outcomes are depicted in Table 2 for weights and Table 3 for rewards.
In Tables 2 and 3, the upper and lower figures in the cell are the first and objectives.In Table 2, the x-axis, y-axis, and dotted line represent the true weights, calculated weights and the line with a correlation coefficient of −1.In Table 3, the x-axis denotes the position and the y-axis the velocity, with white representing higher reward values and black showing lower reward values.

Discussion
The outcomes of our computer experiment are now discussed.There will be a discussion of five issues: (1) were weights estimated correctly?(2) do the estimated rewards denote the true multiobjective intentions?(3) can the original Pareto front be regenerated from the estimated weights and rewards?(4) how did the scalarization network work?(5) are two stabilization techniques necessary?

Were weights estimated accurately?
If the weight-mapping network accurately calculates the weights, there should be a proportional correlation between the true and estimated weights.In that case, the relationship coefficient between the true and calculated weights should be close to −1 or 1.As depicted in Figure 3(a), the true and estimated weights are in a proportional relationship, and the correlation coefficient between the true and estimated weights is −1.Thus, we believe that the suggested method can precisely estimate the weights.

Do the estimated rewards indicate true multi-objective intentions?
First, the reward on the left of Figure 3(b) has the greatest value in the initial state (represented by the red line).Following the reward function on the left of Figure 3(b), the agent minimizes its action and attains the largest cumulative reward when it takes action to stay at the starting point.As a result, the left reward may be thought of as the benefit of action minimization.Second, the reward on the right of Figure 3(b) has the largest value in the goal state (indicated by the yellow box) and smaller values in the other states.According to the right reward in Figure 3(b), the agent obtains the largest cumulative reward by minimizing staying in non-goal states.Hence, the appropriate reward may be viewed as the one that follows the shortest path.
As we have designed as ( 16) and ( 15), the true reward function comprises two objectives: shortest path and action minimization.As a result, we may conclude that the suggested method can estimate multi-objective intentions because the aims of the genuine reward function and the objectives of the estimated reward coincide.

Can the original Pareto front be reproduced from the calculated weights and rewards?
As depicted in Figure 4(a,c), the composition of agents belonging to each solution group was almost the same in the demonstration Pareto front, and the retraining Pareto front with the determined weights and rewards of the proposed technique.As a result, the expert Pareto front was partially reproduced.However, the position of Group 2 differed between the two Pareto fronts.This is because the reward network only approximates the state-conditional function and thus does not accurately reproduce the rewards associated with the action.Comparing the outcomes of the previous technique Figure 4(b) and the suggested method Figure 4(c), it can be seen that the proposed method obtains almost the same solutions as the prior method, except for the solution with w 1 = 0.2.As a result, the suggested approach can learn appropriate weights and rewards without making the assumption of linear scalarization.

How did the scalarization network work?
Figure 5 displays that the increase in true and predicted values is consistent.To put it another way, scalarization network might replicate the linear scalarization mechanism.An intriguing trend we observed is that the scalarization network tends to uniformly assign the minimum reward value to regions with true negative values and the full range of predictions to positive regions with true values.This could be due to the RESET method setting all places where no expert was present to the minimal reward value, then modifying the payment based on the demonstration in positive regions.

Are two stabilization techniques necessary?
First, the findings in Table 2 reveal the effect of the stabilization technique on the weights.When the weight BT is not applied, the weights converge to binary values of 0 and 1, which indicates that the training fails.In other words, the estimated weights are strongly encouraged to become bijective due to the weight BT.As the pre-training of the reward is irrelevant to the weights, the effect cannot be observed in Table 2. Second, the results in Table 3 demonstrate the effect of stabilization methods on rewards.Various outcomes were seen for the four combinations.
• The two rewards are equivalent and undecomposed when neither the reward pre-training nor weight BT is utilized.• When only the weight BT is used, the two rewards are separated.However, the corresponding cell in Table 2 shows that while the true primary objective is action minimization, the reward values are greater near the goal (around 0.5 to 0.6), resulting in an incorrect reward estimate closer to the shortest path reward.• When only pre-training of the reward is employed, the goal reward is higher for both rewards due to the failure to calculate the weights, which are not decomposed.• Only when the weight BT and reward pre-training are employed are the rewards in line with the intended goal.
In other words, both techniques efficiently calculated rewards.Therefore, we verified that the reward pre-training and weight BT stabilize the learning of the proposed method.

Conclusions
MOIRL is effective for relearning in RL on multiobjective problems and for intention analysis.In this paper, we first highlighted that previous methods require the assumption of linear scalarisation.We then generalized the weighted scalarisation to include linear scalarisation and Chebyshev scalarisation.Finally, we proposed a neural scalarisation that automatically learns the scalarisation operation.In addition, we also proposed two techniques to stabilize learning.
Experimental evaluation of the proposed method in a multi-objective mountain car continuous environment showed that the proposed method could estimate appropriate weights and rewards that reflect true multi-objective intentions.The proposed method could also approximate the distribution of expert solutions by retraining with the estimated weights and rewards.Furthermore, we confirmed that the scalarisation network could mimic the linear scalarisation mechanism.Moreover, the results of the ablation study showed that both stabilization techniques were necessary.We plan to address the following three issues in future work.
• Confirm that the proposed method works for experts who use other scalarisation techniques, e.g.Chebyshev scalarisation.
• Conduct experiments on tasks with higher dimensional state spaces, e.g.robotics.
• Verify that the proposed method works effectively on real-world human demonstration data.

Figure 1 .
Figure 1.Architecture of the proposed method.

Figure 2 .
Figure 2. Effect of the weight BT.

Figure 3 .
Figure 3. Estimation results for weights and rewards.(a) Plot of true vs. estimated weights and (b) Visualisation of the estimated reward.

Figure 5 .
Figure 5. Prediction results of the scalarisation network.

Table 1 .
Comparison of existing and the proposed MOIRL methods.

Table 3
Pre-training the reward network.Require: Dataset D = {τ 1,1 , τ 1,2 , . . ., τ k,m }, which contains k demonstrations τ from experts with number of agents m, number of objectives n, IRL loss function L IRL (s, r), coefficients of variance loss λ PR Ensure: Initial parameter of reward network RN(x) 1: Initialise the reward network RN(x) 2: for each iteration do , without pre-training, the reward is factorized according to the frequency of transitions.With pre-training, the main algorithm (Algorithm 2) only needs to fine-tune the rewards of each objective from the initial mixed value of the rewards of all objectives, thus stabilizing the training process.As a Algorithm 1

Table 2 .
The effect of two stabilization techniques on calculated weights.

Table 3 .
The effect of two stabilization methods on estimated rewards.