Investigating the Properties of Neural Network Representations in Reinforcement Learning

In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the early work on representations for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation -- good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. We introduce and measure six representational properties over more than 25 thousand agent-task settings. We consider Deep Q-learning agents with different auxiliary losses in a pixel-based navigation environment, with source and transfer tasks corresponding to different goal locations. We develop a method to better understand why some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance. We demonstrate the generality of the methodology by investigating representations learned by a Rainbow agent that successfully transfer across games modes in Atari 2600.


GOOD REPRESENTATIONS FOR RL
I N reinforcement learning an agent interacts with their environment, receiving observations and taking actions based on those observations, with the goal of maximizing the sum of a special numerical signal, the reward. In this context, the first problem an agent faces is the problem of agent state construction: to determine how to process observations to summarize the state they are in. The function that converts these observations is known as representation, its elements are known as features, and the process of learning such function is known as representation learning. Ultimately, many other subproblems depend on the successful construction of agent states. Bad representations hinder predictions and diminish the effectiveness of planning and learning algorithms [1]- [4]. Good representations can lead to better sample efficiency [5], [6]. Therefore, the key question motiving this research is: what are good representations and how can the agent find them?
Good representations are classically defined in service to other tasks. Often, good representations are said to be those that improve agents in some dimension, such as learning efficiency [7], [8], performance in unseen tasks [9]- [16], planning with learned models [17]- [22], and in their ability to represent the world the way humans do [23], [24]. In this context, there are two main approaches for obtaining good representations: using a fixed, expertdesigned transformations of the agent's observations, or learning such transformations from data.
Fixed transformations of the agent's observations, which lead to fixed-basis architectures, have been extensively explored in reinforcement learning. They allow us to enforce specific properties that are thought to be beneficial. For example, many approaches either use or search for orthogonal or decorrelated features, such as orthogonal matching pursuit [25], Bellman-error basis functions [26], Fourier basis [27], tile coding [28], and proto-value functions [29], [30]. Prototypical input matching methods have been explored, as in kernel methods, radial basis functions [31], cascade correlation networks [32], and Kanerva coding [33]. Most of the above representations project the input to a low-dimensional space to encourage the representation to encode only the most important information, saving memory and computational resources. High-dimensional, sparse representations have also been proposed, as they are more likely to be orthogonal. Moreover, by activating only a small subset of features, a sparse representation reduces computation and increases scalability, such as in tile coding [28] and in sparse distributed memories [34]. However, fixedbasis architectures are not adaptive and they are difficult to scale to high-dimensional input spaces.
Recent developments in representation learning for reinforcement learning explore a different perspective: we should avoid optimizing specific properties 1 and instead use gradient descent to let the training data dictate the properties of the representation. This is achieved with specific training regimes, including multi-task (parallel) training [38]- [40], auxiliary losses [8], [41], and training on a distribution of problems (à la meta-learning) [10], [17], [42]- [44]. The underlying idea is that good representations will emerge if the problem setting is complex enough, where goodness is often measured by success on some held-out-test task.
There are many different ways to evaluate and understand these emergent representations. Recent work has explored this question in roughly two ways: what do good representations look like, and what capabilities good and bad representations allow. The most common approach is to visualize the learned representations [7], [18]- [21], [23], [41], [45]- [50]. This approach has been used, for example, to provide evidence for the emergence of abstraction and compositionality in supervised learning [51], [52]. However, in reinforcement learning, the impact of delayed consequences and temporally correlated data makes it difficult to import these analysis techniques from other fields, and recent work highlighted how popular approaches like saliency maps may not always be appropriate [53].
This paper explores the properties of representations learned by deep reinforcement learning systems: specifically Q-learning with neural network function approximation combined with numerous auxiliary tasks for transfer learning. By necessity, we focus on a specific transfer setting: one with a training phase to learn a representation, followed by testing in a variety of transfer tasks with the same dynamics but different rewards. We investigate properties grouped in three categories: capacity (complexity reduction, dynamics awareness, and diversity), redundancy (orthogonality and sparsity), and robustness (non-interference). Our property set consists of both a subset of properties discussed in the literature, and properties newly introduced in this paper.
We conducted a study across nine auxiliary tasks, resulting in 150 representations and 173 transfer tasks in an imagebased maze environment. We investigated two activations: the widely used ReLU which produces a relatively compact, dense representation, and a new activation function called FTA [36] that produces high-dimensional, sparse representations. The key insights are as follows.
1) Auxiliary tasks can facilitate emergence of representations effective for transfer, however with ReLU networks many auxiliary tasks do not outperform learning from scratch (do not transfer). 2) Using sparse activations (FTA) was a significant factor in improving transfer. The FTA-based representations transferred consistently, with or without auxiliary tasks. 3) ReLU-based representations transfered well to very similar tasks (better than FTA), but significantly worse than FTA to less similar tasks in our setup. 4) Transfer was not possible with linear value functions: performance was significantly better when the representation was inputted into a nonlinear value function. 5) The representations that transferred best had high values for capacity metrics (complexity reduction and dynamics awareness), low orthogonality and medium sparsity.
A key contribution of this work is providing a systematic approach to investigate representations and their properties. The empirical design took many iterations, including 1) developing the transfer setup and a novel way of ranking task similarity using successor features so we could vary the level of difficulty in transfer by systematically, 2) developing the set of properties to measure, 3) appropriately sweeping hyperparameters to obtain reasonably performing agents but still avoiding over-tuning, and 4) providing several mechanisms to aggregate and visualize the mountain of data produced across representations. For example, initial results had little consistency because the agents themselves were not effectively trained; it turns out analyzing poorly performing agents results in unclear conclusions. It would be difficult to show that the conclusions of our study generalize beyond  Fig. 1: We experiment with agents using this network architecture, with different auxiliary losses. The representation network, Φ θ R , learns a mapping from input-state s t to the agent-state (representation of s t ). The representation network is learned to improve two objectives: performance on a main task and on an auxiliary task. The diagram depicts the auxiliary tasks we use in this work, described in Section 4. Our agents only use one auxiliary task at a time.
the specific transfer setting investigated here. But these properties and our methodology can be used to understand representations in other transfer settings in a systematic and reproducible way toward the ultimate goal of understanding good representations for reinforcement learning.

PROBLEM FORMULATION AND NOTATION
We formalize the agent's interaction as a finite Markov Decision Process (MDP) with a finite state space S, finite action space A, transition function P : S × A × S → [0, 1], and bounded reward function R : S × A × S → R. On each time step, t = 1, 2, ..., the agent takes action A t in state S t and the environment transitions to state S t+1 ∼ P (·|S t , A t ) and emits a reward R t+1 . The agent's objective is to find a policy, π : S × A → [0, 1] that maximizes the expected discounted sum of future rewards, the return, G t . = R t+1 + γ t+1 G t+1 , where γ t+1 ∈ [0, 1] denotes a discount that depends on the transition (S t , A t , S t+1 ) [54]. In episodic problems, which we study here, the discount might be 1 during the episode, and it becomes zero when S t , A t lead to termination.
We study the representations learned by DQN [45], a widely used value-based method in deep reinforcement learning. The approximate value function is parameterized by a set of weights θ: Q θ (s, a) ≈ q π (s, a). The word deep, in deep reinforcement learning, stems from the use of neural networks to approximate q π . In particular, DQN iteratively updates its action-value estimates by training the parameters of a neural network, θ, with stochastic gradient descent: The target network,Q, is not updated on every step, but only periodically set equal to the current Q θ . Actions are selected according to an -greedy policy, where A t . = arg max a∈A Q θ (S t , a) with probability 1 − or a random action with probability . As is typical, we use mini-batch updates from an experience replay buffer [55].
We use image inputs, a convolutional network and a fully connected layer with a lower dimension d. We consider two activations on this fully connected layer: rectified linear units (ReLU) [56] and the fuzzy tiling activation (FTA) [36]. FTA is a one-to-many activation, that leads to a larger number of units in the representation layer: k × d instead of d, but with only a small number of active features at once. This allows us to investigate more compact, lowerdimensional representations produced by the ReLU and higher-dimensional, sparse representations produced by FTA. We call these representations the representation layer.
We make extensive use of auxiliary tasks [8] to both induce better representations and to study them. Auxiliary tasks are additional prediction tasks given to agents to incentivize the network to learn about properties of the environment which are, in principle, not directly related to reward maximization. Examples include predicting pixel changes [8] and the next state given an action [57]. These tasks are posed as additional loss functions and the agent is tasked to balance between them and return maximization.
We use a unified architecture to explore representations induced by a variety of different auxiliary tasks, as shown in Figure 1. The first layers, parameterized by θ R , produce the representation φ t = Φ θ R (s t ). The last layers, which are parameterized with θ V , use the representation to estimate the action-values. Auxiliary tasks are encoded with additional layers and separate heads (with parameters θ A ), further impacting the updates to θ R via gradient descent: Φ θ R (s) must be adjusted to be useful for both estimating actionvalues and reducing the auxiliary losses.
Given this setup, a natural (and basic) question is: what is the representation in a DQN agent? The answer relies on the role of the representation, which is primarily to promote future learning. We learn a transformation on the inputs-features-to facilitate downstream learning. The features should 1) be reusable or generally useful for multiple predictions, 2) improve sample efficiency for the given online algorithm (e.g., SGD), and 3) be computationally efficient.
For example, an agent may want to pickup objects in a room, with new objects continually added. The features could describe the objects, so that new objects reuse previously learned concepts (e.g., red, cup) that are a succinct (efficient) description of the object. If the agent uses online updating, from temporally correlated samples, we may prefer sparsely activated features that only change a subset of the weights, to reduce interference and promote sample efficiency.
Of course, this is only a hypothetical example; we do not truly know what is learned with DQN, nor what features would improve learning. In this paper, we attempt to systematically measure representation properties and relationships to transfer performance (future learning), to gain more insight into the representations of DQN agents.

GOOD REPRESENTATIONS FOR TRANSFER
We aim to understand the properties of representations that emerge in deep reinforcement learning, but it is critical to do so for both good and bad representations; which, again, begs the question: what is a good representation? We use a simple definition: a good representation is one that transfers. More precisely, if features learned from a collection of training data allow faster learning on future data, then those features are useful for transfer. Good representations that reliably achieve good transfer may exhibit properties and attributes different than representations that result in poor or negative transfer.   2: Representation transfer is possible in the Maze navigation task and auxiliary tasks improve transfer when using non-linear function approximation. (a) The position of the walls (dark grey), the goal (green) in the training, and two transfer tasks (purple) are shown. (b) Performance relative to the baseline (from scratch) agent: above 1 represents an improvement and below 1 denotes negative transfer. Error bars show a 95% confidence interval. Performance is reported for the best auxiliary task for each activation: VirtualVF5 for ReLU-based representations, and SF for FTA, explained in Section 4. The dotted lines labeled with a 1 are at different locations, to indicate the relative performance between FTA and ReLU from scratch. In non-linear, FTA from scratch was better; in linear, ReLU from scratch was better.
We seek to empirically relate representation properties and performance, which requires an environment where transfer is possible. We investigate how the complexity of the value function interacts with transfer in a simple pixel-based navigation environment with obstacles. This environment, depicted in Figure 2a, can be readily used to generate numerous related tasks. The agent must learn to navigate to a given goal state in as few steps as possible. The problem is episodic, with γ = 0.99, a reward of +1 when reaching the goal and 0 otherwise. The input state consists of an RGB input of a 15 × 15 grid (size 15 × 15 × 3), encoding the agent's current location (but not the goal). The actions correspond to the four cardinal directions, and transitions the agent deterministically by one pixel, or not at all if the action is into a wall. To simplify exploration, the agent starts in a uniform random state and episodes are cut-off at 100 steps; the agent is then teleported to a new random state and this transition is discarded.
We use different navigation tasks (different goal locations) to define training and transfer in our environment. The agent is first trained to go to the goal at a specific location (e.g., [9,9], depicted in Figure 2a). Next, we create a new agent with the parameters of the network up to the representation layer copied over from the training agent and we freeze them to prevent further adaption-a transfer agent. In this transfer phase of the experiment, the new agent is trained to navigate to a nearby but different goal location. We compare this to a baseline agent where all parameters of the network are not pre-trained but instead randomly initialized and updated during the transfer phase. This baseline agent learns the task from scratch because it does not benefit in any way from prior learning. If the transfer agent learns more quickly than the baseline scratch agent, then we say the representation learned in the first phase of the experiment facilitates transfer. Because of the multiple possible goal locations, there are many possible training and transfer tasks in this environment.
At this point we have the major ideas in place to present a simple but foundational result: representation transfer is possible in our navigation task and auxiliary tasks improve transfer. Figure 2b summarizes this result upfront; we give details later about the experiment as we proceed to more complex and nuanced results. Representations pre-trained on the training task significantly outperformed representations learned from scratch on the transfer task. In addition, we found auxiliary tasks were important for transfer, at least for ReLU. Pre-trained representations using ReLU exhibited negative transfer, whereas ReLU-based representations combined with well-designed auxiliary tasks did transfer. To the best of our knowledge, no prior work has demonstrated that auxiliary tasks improve representation transfer in reinforcement learning; instead most work with auxiliary tasks use them during learning [8], [58], [59] or on an offline dataset [12], [60]. These results, though, produced some surprises. First, even in this simple environment, transfer with ReLU and auxiliary tasks was difficult. This reaffirms some of the previous anecdotal and documented [16], [61] issues with transferring representations. In fact, switching to a different (sparse) activation, FTA, had a bigger impact than any auxiliary task. This suggest some issues with the representations learned under ReLU.
Second, we were unable to obtain successful transfer with linear value functions. This outcome was not for a lack of effort. Figure 2b strongly suggests that, in our navigation environment, non-linear value functions significantly improve transfer, even in a relatively simple environment. We see that when the value function was linear in the features, neither a pre-trained representation (labelled ReLU and FTA) nor any other representations trained with auxiliary tasks improved over training a fresh representation from scratch. This suggests representations may emerge in earlier layers of the network, and that it may be more feasible to learn re-usable features when they can be nonlinearly combined, even if only with a simple shallow network. We highlight this important point here; for the remainder of this paper, we restrict our scope only to non-linear value functions.
We presented these overall results upfront, before diving deeper and understanding why we see these outcomes. In the next sections, we describe the auxiliary tasks we used to find good representations; then how we evaluate the representations for different transfer tasks; and finally give insights into failures in transfer under ReLU due to transfer task similarity. Later, we dive even deeper, measuring and correlating representation properties to transfer performance.

THE AUXILIARY LOSSES
We focus on the representations that emerge under different auxiliary losses for two reasons: 1) Adding or changing auxiliary tasks does not change the functional capacity and size of the learned representations; the consistency on the capacity makes comparisons more interpretable. 2) The role and impact of auxiliary losses in reinforcement learning remains poorly understood, and constitutes an important area of study. We introduce the general idea behind each auxiliary task in text, while their formal definition is presented in Table 1 in Appendix B.1.
Input Reconstruction (IR): This auxiliary task tries to reconstruct the network's input, as in an autoencoder. This extraction is achieved by using a bottleneck layer: a lowdimensional layer that forces only the most important information to be retained and the remainder, including the noise, to be discarded. We include this auxiliary task as a classic and simple choice.
Next Agent State Prediction (NAS): Another common choice is to predict the next agent state [8], [19], [21], [22], [42], [43], [58], [62], [63]. This loss encourages the representation to capture the transition dynamics. The agent predicts φ t+1 using φ t and action a t . Predicting the next agent state might give vacuous solutions when it is the only training signal; jointly training with the main task, however, prevents this from happening. The combination of this auxiliary loss with the main task encourages the representation to both be useful for action-value estimation, as well as capable of anticipating features on the next step. Several papers have highlighted that the ability to predict the next state is related to the ability to predict action-values [41], [64], [65].
Successor Feature Prediction (SF): NAS can be taken one step further, with the target including not just the next agentstate but many future agent states [66]. Successor features (SFs) provide just such a target. SFs are defined with respect to a particular policy π as ψ π t = E ∞ i=0 γ i φ t+i . They have been used in the transfer setting because they can be used to quickly infer value estimates for new reward functions that are a linear function φ t [11]. In the tabular case, SFs correspond to successor representations [67], which have an equivalence to proto-value functions [63]. We opt to use the greedy policy according to the action values for the main task, which means the SFs are tracking a changing policy.
Reward Prediction (Reward): Another auxiliary task we consider is predicting the immediate reward in the future based on the current state and action [8]. The prediction requires the agent to encode the reward information that it can obtain in a short term in the representation function.
Expert Target Prediction (XY): Another auxiliary task is the prediction of expert-designed targets. It is based on the idea that a good representation should be able to predict key artifacts of an environment. This requires domain knowledge and is not always possible. Here we consider the coordinates of the agent in the environment as the target predictions.
Virtual Value Function Learning (VirtualVF): This auxiliary task is based on the tasks the agent will face in transfer. We consider one auxiliary loss that uses a goal location at the center of the maze (VirtualVF-1), and another that uses five goals at the four corners and the center of the maze (VirtualVF-5). These are virtual tasks, because the agent imagines achieving these goals, even though they are not the training goal. We use VirtualVF-1 and VirtualVF-5 to assess the utility of having a larger set of virtual goals. We learn these auxiliary value functions with DQN.
Augmented Temporal Contrast (ATC): The contrastive loss encourages the network to learn similar representations for input-states that are temporally close to each other [68], [69]. This auxiliary task led to the first successful pre-training of a deep reinforcement learning agent, meaning it led to representations that could be generally reused for other tasks. ATC also includes other augmentations, like data augmentation. We test it with these additions, to report performance of the originally proposed approach, even though it goes beyond strictly only adding an auxiliary loss.

EXPERIMENT DESIGN
Our experiments consist of two stages: a representation learning stage in a training task, and a transfer stage using this learned representation in a transfer task. In this section, we outline the details for these two stages in our experiments, and outline the agents and how we evaluate them.

Transfer Setup
The first stage is to train the representation. All representations are trained with a DQN agent in the training task, with goal location depicted in Figure 2a. To prevent overfitting, we employ an early-saving strategy to save the representation function, Φ θ R , as soon as the agent is able to finish 100 consecutive episodes in 100 steps or less. Each representation corresponds to a choice of activation function and auxiliary loss-including choosing not to use an auxiliary loss.
In the second stage, we learn with the representation from the first stage, in a new transfer task. Specifically, we 1) load and freeze the learned representation, 2) re-initialize the value function and 3) learn the value function for the transfer task with DQN, with the fixed representation. No auxiliary tasks are used in transfer, and only the 64×64 value function network is learned with DQN. Learning with a re-initialized value function rather than fine-tuning prevents negative effects from the old value function during transfer, especially to less similar transfer tasks. Further, re-initializing the value function ensures that the difference between transfer learning and learning from scratch is due to the learned representation. The agent learns in this new task for 300 thousand steps.
We consider 173 transfer tasks -all possible goal locations, including the training goal state. To sort performance amongst these tasks, we provide a novel method to measure their similarity to the training task. In this way, we can ask questions about transfer to more or less similar transfer tasks.
The key idea is to first obtain successor representations for each state, and then compute similarity in this new space. The successor representation encodes similarity based on transition dynamics, meaning that states are considered nearby due to ability to reach them rather than due to other distances, such as Euclidean distances which does not respect the walls in the Maze. For specific details, see Appendix B.3.

Network Choices and Activation Functions
To obtain representations with different properties, we use two different activation functions: Rectified Linear Unit (ReLU) [56] and Fuzzy Tiling Activation (FTA) [36]. ReLU is a standard activation function, defined as max(z, 0) for input z, where z is a linear weighting on the previous layer.
FTA is a newly introduced activation, designed to generate sparse outputs. Essentially, it bins the scalar input into k bins, with some smoothing to ensure non-zero gradients through the activation. The smoothness and bin width is controlled by a parameter η > 0. The interval is from Larger η activates more entries in h(z), and smaller η results in more sparsity. This formulation removes a hyperparameter, by using the suggested default choice of η = δ.
For our experiments, the representation function consists of two convolutional layers, one linear transformation, and a choice of activation function. The linear layer projects the output of the convolutional layer to a 32-dimensional space. When using ReLU, the representation layer has d = 32 features. If FTA is used, it has 640 features, since FTA projects each scalar to a short, sparse vector with 20 bins. Note that FTA still uses the same number of learned parameters to produce this 640 features, as ReLU uses to produce the 32 features, because binning occurs after the linear weighting. However, the outputted number of features is higher, and so the value function and auxiliary tasks all have more parameters, at least in their first layer. We therefore also evaluate ReLU(L)-L for large-which uses 640 features. ReLU(L) uses significantly more parameters to produce these 640 features than FTA.
The structure for the value function and auxiliary tasks is given in Figure 1. We use two hidden layers with 64 nodes each for the value function, and one hidden layer with 64 nodes for the auxiliary task. We use a simpler network for the auxiliary task to force the representation to learn as much as possible. We use a slightly larger network for the value function, to avoid overly constraining it and so confounding transfer performance.

Agent Specification
We use standard choices for DQN, including the use of -greedy exploration, an experience replay buffer, target networks, and the Adam optimizer [70]. In total there are 9 choices for auxiliary tasks: No-aux, ATC, IR, NAS, SF, Reward, XY, VirtualVF-1, and VirtualVF-5. There are 3 activations: FTA, ReLU, and ReLU(L). When using FTA with auxiliary tasks, we set the number of bins k = 20 and η = 0.2. This implicitly specifies the range for binning to [−2, 2]. For the No-aux task agents, we test η = 0.2, 0.4, 0.6, and 0.8 and report performance for each, not the best one. This gives a total of 30 agent specifications.
We consider three baseline agents, which we call RAN-DOM, INPUT, and SCRATCH. They allow us to falsify different hypotheses about the role of the learned representation. RANDOM uses a randomly initialized network as the representation, without any learning. The agents start with a random network, so this baselines checks whether learning actually improved the representation. INPUT omits the representation network, and directly inputs the agent's observation to the value function component. It is meant to check if the learned representations play any (useful) role, and if learning from scratch in the transfer task might just have been faster with smaller networks. Finally, SCRATCH is a DQN agent that starts learning from randomly initialized weights in the transfer task. The purpose of learning the representation is to learn faster than learning from scratch in the transfer task. This is the most important baseline, as it defines whether a learned representation was successful for transfer-facilitated learning faster than SCRATCH-or not-was comparable to or worse than SCRATCH.

Reporting Performance and Hyperparameters
To report performance, we have to consider how to measure performance and how to set hyperparameters. In both the training and transfer tasks, every 10 thousand steps we record the average return of the last 100 episodes. To summarize performance across the 300 thousand steps, we take the sum of these recorded values, also called the Area Under Curve (AUC). The AUC is used to select hyperparameters.
The only hyperparameter common across all agents is the stepsize; we therefore only sweep this hyperparameter. We separately pick the stepsize in Stage 1, in the training task, and in Stage 2, when just learning the value function. We use the average performance over 5 runs to select the step-sizes. Namely, in Stage 1 we run each of the 30 agent specifications with different step-sizes, for 5 runs. We select the best stepsize according to training AUC, and use the representations produced under those step-sizes. Then in Stage 2, we evaluate each stepsize only for those representations, and pick the best step-size for an agent specification for each transfer task by using averaging performance across the 5 runs. We sweep the stepsize to ensure we are evaluating reasonably well-optimized agents. Additional hyperparameter details, including the selected values, are available in Appendix B.
When we report performance across agents, we do not average across these 5 runs. Instead, each run produces a different representation and we report performance for each one as an independent data point. When showing aggregate performance, we aggregate from this larger pool of 5 runs for each 30 agents specifications, namely over 150 representations. We do so because each representation has different properties; when correlating agent properties and performance, we may not care which auxiliary task was used, but rather only care about its emergent properties. Averaging across runs compares methods (agent specification), rather than representations.
Finally, we obtain transfer performance in 173 transfer tasks. This means we get 173 transfer performance samples for each of the 150 representations. In total, when aggregating across transfer tasks or agent specifications, we obtain a significant number of samples to estimate aggregate performance, even though each agent specification only has 5 runs in the training task. For example, in Figure 2b, each bar in the plot is for one agent specification and uses 173 × 5 = 865 performance samples to estimate medians, means and standard deviations. In total, we generate 173 × 150 = 25, 950 agent-task combinations.

GOOD, BAD AND UGLY REPRESENTATIONS
We expect some agent specifications to result in representations that aid transfer, and others to impede transfer.  The black line shows the performance when learning in each transfer task from scratch. Lines completely above the black line indicate a representation yielded successful transfer in all tasks. Lines that start above the black line but fall below as we move left to right indicate a representation that transfers to similar tasks but not dissimilar tasks. The INPUT and RANDOM baselines are not competitive; for completeness, we still report their performance, but with ligther lines. Overall, many representations achieve transfer and generally FTAbased representation are better on this problem. Details on how task similarity was computed and how this plot was generated can be found in the Appendix. The transfer performance of ReLU(L) is shown in Figure 10; it exhibits the same patter as ReLU, transferring well to similar tasks and not as well in less similar tasks.
UNREAL [8], the first large-scale deep reinforcement learning system to highlight the utility of auxiliary tasks, showed that although auxiliary tasks like pixel prediction improved performance substantially, other tasks such as feature control had a much smaller impact. Other work has highlighted that it can be difficult to obtain any transfer in reinforcement learning [16], [61]. It seems the design and deployment of auxiliary tasks remains largely an art.
In this section, we provide some clarity on these discrepancies by showing that 1) there is large variability in performance across auxiliary tasks, and 2) transfer performance can degrade significantly as tasks become more dissimilar. Though these results are intuitive, they constitute the first approach to systematically vary these two axes to understand when methods may be succeeding or failing. Figure 3 summarizes the transfer performance of many different representations corresponding to different auxiliary tasks and activation functions. The plot has task similarity on the x-axis, and each point on the plot summarizes the performance of one representation on one particular transfer task. The lines show how much transfer performance degrades as tasks become more dissimilar 2 . The bold black line shows performance in the transfer task if the representation and value function were trained from scratch-no transfer. Any point above the bold black line indicates a representation that achieved better performance than training from scratch on that task-successful transfer. Any line completely above its corresponding black line indicates a representation that achieved successful transfer for all goal states.
The most important conclusions from Figure 3 are that (1) several representations achieve successful transfer across all tasks, and (2) a great variety of representations emerge with transfer performance ranging from good to significantly worse than scratch. Looking more closely, some representation achieve successful transfer in dozens of tasks which are most similar to the training tasks, but for tasks very dissimilar to the training, performance is poor as seen by the step down in many of the lines.
Generally, we found that FTA-based representations yield better representations for transfer compared to ReLU. Almost all FTA representations outperformed Scratch (FTA), and transfer to less similar tasks was effectively the same as on similar tasks to training, as evidenced by the nearly flat lines in Figure 3. Interestingly, many ReLU-based representations achieved very good performance in transfer to similar tasks but performed significantly worse than Scratch (ReLU)nearly as bad as the input baseline-on less similar tasks. ReLU (L) performed similarly to ReLU; this result is in Appendix C.2. One other note of interest is that Scratch (FTA) performs better than Scratch (ReLU), and yet FTA representations were still better able to transfer than ReLUbased ones: using FTA improved on using ReLU and then training the FTA representation first in a training task improved performance in the transfer task even more.
Digging a little deeper, Figure 4 depicts the transfer performance of each auxiliary task. We again see that FTAbased representations achieve higher performance overall and higher performance across auxiliary tasks-the worst performing representation never used FTA, always ReLU. Inspecting each auxiliary task, the FTA-based representations exhibited lower variance across runs. Larger ReLU representations, ReLU(L), did improve performance over the smaller ReLU representations, but not uniformly. The IR auxiliary task representation, for example, improves with large ReLU networks, but ATC performs worse-though not significantly in either case.
At the auxiliary task level, there are no obvious trends (except for the fact IR and Reward are generally not useful). For example, the successor-feature auxiliary task (labelled SF) is among the best performing FTA representations and among the worst performing ReLU representations. The subgoal-navigation auxiliary tasks (VirtualVF) result in the best performance with ReLU representations. These subgoals can be thought of as way-points placed at strategic locations in the environment; perhaps these tasks force the network to represent how to navigate to these waypoints which then speeds learning when navigating to other nearby goals in transfer. Perplexingly, these are not the best performing 2. Figure 2a shows two transfer tasks as purple squares. The one beside the training goal is most similar according to our ranking, and the other is least similar to the training goal.  Fig. 4: Transfer performance depends on the activation function, representation size, and auxiliary tasks. Overall, FTA-based representations achieved the best performance and exhibited the least variation in performance across auxiliary tasks. The orange lines depict the median, the upper and lower edges of the box show the 25 th and the 75 th percentiles, while the whiskers show 1.5 times the interquartile range. These results are computed over 173×5 = 865 samples, and so the standard errors are quite small (as you can see Figure 11 in Appendix C.3).
representations when combined with FTA activation functions. Perhaps FTA networks already extract a general and transferable representation (as evidenced by the performance of 'No Aux'), and thus the subgoal auxiliary tasks simply do not help much. It is difficult to know looking at performance only; in the following sections we look at different properties of the learned representations as a lens to understand such mysteries.
We included ATC, a recent representation learning strategy, to calibrate the quality of transfer performance. This approach uses multiple networks to compute a contrastive loss, while also using data augmentation of the input images. ATC worked well, but it did not significantly outperform the best ReLU and FTA representations. In addition, we found that the ReLU network combined with data augmentation (random shifting of the input imagines) and no contrastive setup achieved similar performance to ReLU ATC.

REPRESENTATIONAL PROPERTIES, OLD & NEW
Before the emergence of deep reinforcement learning, the study of representations in reinforcement learning and their effects on learning was focused on fixed bases. The problem with this is not that a fixed basis cannot capture complex non-linear relationships (e.g., see the work by Liang et al. [71]), but rather that the representation is fixed-the features are not adapted to the task. In some sense, this is good because it forces the agent designer to consider what are desirable representation properties-a level of analysis complementary to the design of good algorithms. Over the years, researchers have proposed and debated numerous properties. We leverage these discussions to analyze our learned representations.
We characterize a representation into three main axes: capacity, efficiency, and robustness. Capacity reflects whether a representation can represent a given function. Efficiency captures the lack of redundancy of the features and the computational cost of using them. Robustness captures the idea that interference is undesirable and that representations should avoid it; a more complete name is update robustness. We define six metrics that capture these three axes, and we use them to evaluate the representations learned by our agents. Our goal is to develop a systematic methodology for assessing learned representations, based on a diverse set of properties. This evaluation list does not suggest that a property is necessary; rather, it provides some quantitative measures to supplement more qualitative evaluation like visualization. Such a list is necessarily incomplete; we attempt only to start with a reasonably broad set of properties.
We assume we have a dataset of 1000 transitions to measure the properties, D test = {(s 1 , a 1 , s 1 , r 1 ), (s 2 , a 2 , s 2 , r 2 ), . . . , (s N , a N , s N , r N )}. This dataset is obtained by running the random policy for N episodes, and then randomly subsampling N transitions, to ensure we cover the state space. We store the transition because some of the properties rely on consecutive states or the entire transition. The symbol φ i refers to the representation of s i ; Q(φ i , ·) is the value network learned given that representation. We compute distances both according to the representation and according to action-values, The first formula reflects differences in the values the agent uses to select the greedy action, and the second looks at the difference in values across all actions.

Capacity: Retaining Relevant Information and Nonlinear Transformations
The first property to consider for a representation is its capacity: can it represent the functions we want to learn? The value function network should be a simple function of these features, such as a simple neural network. To measure capacity, we use one direct measure, complexity reduction, and two indirect measures, dynamics-awareness and diversity.
Complexity reduction reflects how much the representation facilitates simplicity of the learned value function on top of those features. If the complexity is small, the features encode much of the non-linearity needed. We measure this using the Lipschitz constant of the value network given the representation. A higher Lipschitz constant indicates the value function is more complex to learn. Lipschitz value functions have been motivated for value transfer [72] and learning models [73].
The definition is shown in Equation 2. The Lipschitz constant L is one where dq,i,j ds,i,j < L for all pairs (i, j) in the dataset. When this ratio is computed on given time step t-either during learning or on the last time step before the representation is frozen for transfer-we use the current action-values. We take a slightly less conservative measure, by averaging across these ratios rather than taking the max. The issue with using the Lipschitz constant L itself is that it is an imprecise measure of the regularity in the surface: one poorly behaved region could result in a high Lipschitz constant, even if the rest have low local Lipschitz constants. We call these averaged ratios L rep , giving We normalize L rep between 0 and 1 using L max , computed as the maximum L rep over all representations and across time steps. This is subtracted from 1 to ensure higher values refer to higher reduction in complexity.
We can also indirectly measure complexity-that is without specifying a set of value functions-by testing if the representation is dynamics-aware. This means that pairs of states, where one is a successor to the other, should have similar representations, and states further apart in terms of reachability should have a low similarity. This measure is in fact related to the Laplacian used for proto-value functions and successor features [63], [74]. For every state in the dataset, we take its successor state and a random state. If the distance, in representation space, between the successor state is smaller than the distance to a random state, then the representation has high dynamics awareness.

Dynamics Awareness
In addition, we can measure the diversity of a representation, which is the opposite of specialization. If a representation is specialized to one value function, then it likely uses a small subspace of the larger Euclidean space and likely does not produce a diversity of possible feature vectors. This specialization may be problematic, as it means the representation is unlikely to perform well when it is transferred to learn another value function.
To define diversity, we use a ratio between state and value differences. Given two states s i and s j , we can compare the distance between their representations (d s,i,j ) and the distance between their action values (d v,i,j ). If the value distance is high-the two state values are very differentthen the representation distance is also likely to be high to allow this. The interesting case is when the value distance is low. The representation distance can be high or low, and still allow two states to have similar values, because we project from a higher-dimensional feature vector to a scalar value. A representation with high diversity would have high representation distance when possible, allowing two states to be distinguished even when they have similar values. A representation with low diversity would simply map these two states with similar values to similar representations, specializing to this value function. The measure is We normalize by the maximum distances, to be invariant to value and representation scales. Diversity can be seen as 1-specialization. The specialization is lower when d v,i,j is small and d s,i,j is large, causing this ratio to be closer to zero. The specialization is higher when the ratio between d v,i,j and d s,i,j is nearly one. Diversity allows us to indirectly measure capacity, as we can check the level of specialization for a given function without needing to have access to the larger set of possible functions.

Efficiency: Feature Redundancy
Many function classes can satisfy these capacity properties and so we consider other functional properties of the features. Reducing redundancy in the representation, finding linearly independent features, is a basic requirement. Orthogonality satisfies this requirement, and additionally provides distributed features as well as minimal interference. For example, factor analysis finds a dense set of orthogonal (latent) factors to explain the data. This representation is highly distributed, as each feature is used to describe many different inputs. At the same time, interference is reduced: the interference for two states with orthogonal feature vectors is zero under linear updating. As before, we normalize magnitudes and ensure higher orthogonality means that more feature vectors φ i and φ j are orthogonal to each other.
Note that there is an equivalence between orthogonal feature vectors-orthogonal representations-and orthogonal features: the sum over all states i, j of φ i , φ j 2 is equal to the sum over all pairs of features of the dot product between the vector of those feature values across states (see Appendix A.1). Additionally, for centered features, orthogonality is also equivalent to decorrelation (see Appendix A.3).
One idea related to orthogonality is sparsity. If only a small number of features are active for an input, then the features are sparse-with typically the additional condition that each feature is active for some inputs (no dead features). For non-negative features, maximizing sparsity corresponds to finding orthogonal features: dot products can only be zero when features are non-overlapping for two inputs. Sparsity has the additional benefit, though, of improving efficiency for querying and updating the function, because only a small number of features are active. To measure sparsity, we calculate the percentage of inactive features on average across states in the dataset.
where the representation φ i for state s i is d dimensional.

Robustness: Interference Reduction
More recent work in neural networks has also focused on robustness, both to interference and noise. Interference reflects how much updates in one state reduce accuracy in other states. We use the a recent measure developed for reinforcement learning [75], which uses the difference in temporal difference errors before and after an update. We do the comparison each time the target network is synchronized, which occurs every 64 steps, for a total of T times during learning. For every t = 1, . . . , T , we compare the error between θ t and the parameters after 64 updates, θ t+1 ; note t here references the synchronization iterator rather than time.
The maximal Interference is across all representations.

THE PROPERTIES OF GOOD REPRESENTATIONS
We can now return to the main question of this work: do good representations that facilitate transfer exhibit particular properties? In this section, we investigate how the properties defined in the previous section relate to transfer performance. Each curve shows the property of one agent specification (activation and auxiliary task pair), averaged over the 5 runs in the training task. The curve changes color, to black, at the time point where we took the representation and fixed it; this point was chosen based on when the return for the agent stopped changing. Line colors vary from light gray to black based on how much their value fluctuate from the property value after their return converges, with darker lines denoting lower variation. In these plots, we allowed the representation to keep learning to understand if properties significantly change afterwards. Our primary focus is to show the general trend that properties converge over time, and that they converge approximately when the return does; therefore we use the same color for all agent specifications.
reduction on the x-axis. Representations with high complexity reduction and good transfer performance would appear as a dot in the top right of the subplot. Representations with low complexity reduction and good transfer performance would appear as a dot in the top left of the subplot, and so on.
At the highest level we see FTA (top row) and ReLU (bottom row) exhibit different properties across representations. FTA-based representations by large exhibit high complexity reduction and high diversity, whereas ReLU representations range widely from low to medium on the same two measures. In fact, the lowest observed complexity reduction and high diversity of any FTA representation was greater than the highest observed complexity reduction and high diversity for ReLU. ReLU representations could be sparse and have low or high orthogonality, whereas FTA representations are mostly sparse. Interestingly, the top representations in terms of sparsity were ReLU. ReLU representations with similar property values can achieve very different transfer performance (visible as points stacked vertically). There appears to be no clear relationship between sparsity and performance for ReLU representations. Now consider the properties of the top performing representations. Again, let us focus our attention on the FTA representations in the top row of Figure 5. The green stars in each subplot correspond to the top performing representations (in terms of transfer). First notice the stars are typically close together in x and y indicating all three achieve similar performance with similar property values; this is true for ReLU representations as well. In general, we see that the best performing representations are not at the extremes of any property (high or low). Given that FTA representations by large exhibit high complexity reduction, diversity and sparsity, it is notable that the best performing representations are the lowest of those three properties.
In general, the best representations for both FTA and ReLU exhibit fairly similar properties (relative to other representations with the same activation function): high complexity reduction, low orthogonality, high dynamic awareness, and medium sparsity. Of particular note is the clear pattern in complexity reduction and diversity for ReLU: both needed to be higher, and performance clearly drops for lower values.
FTA seems to more naturally produce representations that are higher on these measures; we hypothesize that this is the main explanation for why FTA representations work well across the board for transfer.
The property values depicted in Figure 5 were computed from the representations when frozen for transfer, but one might wonder what are the dynamics of the properties over time. Recall that we froze and transferred each representation after 100 episodes were completed in the training phase. This choice balances the need for reasonable performance without having to select somewhat arbitrary steps budgets or performance criteria. However, our choice does mean that each representation could receive different amounts of experience, and thus begs the natural question: would the properties reported in Figure 5 be very different with more or less training. Figure 6 provides the answers.
Generally, across all auxiliary tasks and activation functions, the representation property values remained similar after initial transients in early learning. Each subplot of Figure 6 shows a particular property value for every single representation tested over an extended training period. We intentionally do not distinguish between activation functions and auxiliary tasks in this plot. The change in color indicates when the representation was frozen for transfer, in terms of training time. Note, many representations were frozen after the same number of training steps. Orthogonality, dynamic awareness, and sparsity of a small number of representations slowly increases with more training and complexity reduction of a few representations slowly decreases. Overall, the properties for the most part converge, and do so just before the representations were frozen for transfer. Training the representations longer would not have resulted in significant changes to the property values.
Finally, we investigated why VirtualVF5 was helpful for ReLU and harmful for FTA through the lens of our representation properties. The goal is to better understand this discrepancy, by analyzing the representations themselves. We plotted properties for just these representations in Figure 7, and found that the addition of this auxiliary loss significantly decreased dynamics awareness for FTA-to the detriment of performance-but increased it for ReLU. Additionally, it  Fig. 7: The VirtualVF5 task produces bad FTA representations but improves ReLU representations. Each subplot shows a property value achieved by four different representations: FTA and ReLU with VirtualVF5 and FTA and ReLU with no auxiliary tasks. It is clear this auxiliary task changes the properties of the representations; particularly Dynamics Awareness and Orthogonality. We did not include noninterference as VirtualVF5 had no impact on it.
caused the FTA-based representation to have much higher orthogonality, likely increasing it to one of the extremes that performed more poorly. For ReLU, the increase in orthogonality was to an interim level, from a value that was very small. It is as yet unclear why this auxiliary loss caused these effects, but this clear and systematic change in the properties helps explain this outcome.

CONCLUSIONS
The goal of this work is to make progress towards an answer to a classic question: how do the properties of representations-that emerge under standard architectures used in reinforcement learning-relate to the transfer performance? We introduced a method of measuring the similarity between training and transfer tasks and designed experiments to assess learned representations. All tasks are similar, in that the involve navigating to locations in the same Maze. Intuitively, transfer should be possible, even to locations that are quite far from the goal in training. We found that 1) ReLU-based representations transferred only to very similar tasks, potentially highlighting why transfer has been so difficult in reinforcement learning (the vast majority of SOTA agents use ReLU networks), 2) some auxiliary tasks improved transfer of ReLU-based representations, but none facilitated transfer to less similar tasks, 3) the FTA activation significantly improved transfer, suggesting it might be a promising activation to use going forward, and 4) transfer was not possible with a linear value functions, even in this seemingly simple environment.
We extensively and systematically investigated the properties of all of these (good and bad) representations attempting to better understand what causes the improvement in transferability. We defined diversity, complexity reduction, and dynamics awareness, as well as used measures of orthogonality, sparsity and non-interference from the literature. In general, interim values for properties were better: representations at the very extremes were never the best. Further, we found that the best representations maintained high capacity (complexity reduction and dynamics awareness), lower orthogonality and medium sparsity. These conclusions do not mean representations should have low orthogonality, for example, but rather representations that emerge under training with auxiliary losses tend to do more poorly if orthogonality is higher or if it is very small (at the extreme), in our particular transfer setting.
This paper investigated only one relatively simple environment (a Maze) and one neural network architecture, albeit it is an environment with many different tasks/goals. This network architecture is one of the most widely used in deep reinforcement learning. Even in just this setting, there was a mountain of data to analyze. To gain insights on the complex representations learned by our agents, it was necessary to start in a simple setting and develop a clear and systematic methodology. A natural next step is to execute the same procedure in other, possibly larger, environments; or to use different neural network architectures. We would like to highlight, however, that the results in even just this one setting are already informative and change our perspective on these representations. A priori, one might have thought that transfer would be very easy in this environment; after all, we are not learning small networks here! Yet, repeatedly we hit roadblocks.
The results here highlight how difficult it can be to achieve transfer. The variety in agent performance obtained here, and the significant changes in performance when moving from closer to further goal locations, already allows us to tease apart differences in approaches and properties. The specific conclusions about network architectures and activations, auxiliary losses, and even properties, may be different in other environments, but the higher-level conclusions about the relevance of these properties, the interactions between components, and the need for a careful methodology to understand these nuances will extend.
There are a variety of settings for which we can use the methodology proposed in this work. There has been substantial effort to characterize transfer, generalization, and overfitting in deep reinforcement learning, primarily in terms of performance [61], [76]- [78]. Notably, prior work illustrated representation transfer is possible across Atari modes [61], but did not yet quantify any properties of those representations. A natural next step is to revisit these experiments, with new tools to understand and improve the representations in these benchmark environments. The compute required to do this in a systematic and principled way will be a major challenge.
Han Wang is a PhD student at the University of Alberta. She received her MSc in Computing Science from the University of Alberta, in 2020. Her research area of interest is reinforcement learning.
Erfan Miahi is an MSc student at the University of Alberta. He received his bachelor's in Computer Engineering from the University of Guilan in 2020. His research area of interest is reinforcement learning and representation learning.

Martha
White is an Associate Professor of Computing Science at the University of Alberta and a PI of Amii (the Alberta Machine Intelligence Institute). She holds a Canada CIFAR AI Chair and received IEEE's "AIs 10 to Watch: The Future of AI" award in 2020. She has authored more than 50 papers in top journals and conferences and is an associate editor for TPAMI.

Marlos C. Machado is a senior research scientist at DeepMind and holds a Canada CIFAR AI
Chair at the Alberta Machine Intelligence Institute (Amii). Marlos is also an adjunct professor at the University of Alberta, an Amii Fellow, and a principal investigator of the Reinforcement Learning and Artificial Intelligence Laboratory (RLAI).

Zaheer Abbas is a Research Engineer at Deep
Mind. He received his MSc in computing science from the University of Alberta in 2019. His research focus is reinforcement learning and planning.
Raksha Kumaraswamy completed this work as PhD student in Computing Science at the University of Alberta. She is now a Research Scientist in Noah's Ark Labs in Huawei.

Vincent
Liu is a PhD student at the University of Alberta. He received his MSc in Computing Science from the University of Alberta, in 2020. His research focus is sequential decision making in the real world, particularly theoretically-sound methods for offline reinforcement learning.

APPENDIX A MORE ON SOME REPRESENTATION PROPERTIES
Interestingly, orthogonality in the representations, as measured in the paper, has a strong relationship with three other properties. Specifically, orthogonality in the representations can be the equivalent to orthogonality between features, which implies that each feature captures distinct information from the input states. Orthogonality can also result in interference, where interference measures how much an update on one state interferes with the updates on other states. Lastly, orthogonal features can be indicators of linearly uncorrelated features.
In this section, we show the aformentioned relationships between orthogonal representations, orthogonal features, interference, and uncorrelated features. To do so, we make the assumption of having a finite set of inputstates

A.1 Relationship between Orthogonal Representations and Orthogonal Features
Here, we show that there is an equivalence relationship between orthogonality in the representation and between features. Below we show that Therefore, when the sample-space is not enumerable, that is f i is infinite-dimensional, orthogonality of representations may be used as a surrogate for measuring the orthogonality of features.

A.3 Relationship between Orthogonal and Uncorrelated Features
Here, we show the relationship between orthogonal features and uncorrelated features.
, denote the expected value of feature i over the set of input-states. If all features are centered, that is,f i = 0 for all i, then it is trivial to see that The LHS is a measure of correlation and the RHS is a measure of orthogonality.

APPENDIX B EMPIRICAL DETAILS B.1 Auxiliary Tasks
In this section, we provide a more detailed explanation of each of the seven auxiliary tasks used for helping with representation learning. All the losses use the same samples taken from the replay buffer to update the value function, except the ATC loss. The detailed formulas for the auxiliary losses are presented in Table 1.
Augmented Temporal Contrast (ATC) Loss: This contrastive loss encourages the network to learn similar representations for an input state, s t , with one from a predetermined, near-future time step input state, s t+k , where k = 3. In contrast to other losses that we have employed in this work, this loss uses more than a single auxiliary head to compute itself. To do so, it first applies a data augmentation technique called random shift with a probability of 0.1 and padding of 4 to both of these input states. Then, it feeds the augmented version of s t , AUG(s t ), through a set of networks to compute p t , where F θ A is a linear mapping of representation into an embedding space with a size of 32, and F θ C is a single layer neural network with a hidden layer size of 64, and an output size of 32. Then, AUG(s t+k ) is fed into a momentum encoder of Φ θ R (φθ R ) and F θ A (Fθ A ) to compute c t+k . The output of these networks, p t and c t+k , are combined with each other through a 32 × 32 matrix called W to compute logits, li, j + k. At last, these logits are used to compute the InfoNCE loss, L AT C . In contrast to the original paper that uses a complex learning-rate scheduling technique, we implement this loss in its simplest form by using a fixed learning-rate. We sweep through values of [0.003, 0.001, 0.0003, 0.0001, 0.00003, 0.00001] of this learning-rate. We update the momentum encoder in every step using a τ of 0.01. The batch size is the same as the batch size used for computing the value function loss, and the weight of this auxiliary loss is set to 1.
G is a set of goal locations. The reward rg and representation function φ g θ A are associated with the goal g, sampled from this set.

Input Reconstruction (IR):
This auxiliary task reconstructs the input image from the representation through a deconvolutional network. On the auxiliary head, the representation was firstly projected to a hidden layer with 1024 nodes, then sent to a two layer deconvolutional neural network, with 4 kernels, 32 and 3 channels, 2 and 1 strides, 2 and 1 pads on 2 layers separately. The output had the same size as the input image (i.e., 15 × 15 × 3). The weight of the auxiliary loss was set to 0.0001.
Next Agent State Prediction (NAS): This task motivates the representation of the current state to minimize its prediction error on the next state representation. To do so, it uses a contrastive loss that minimizes the prediction error on the next state while maximizes the prediction error on the rest of the states. For this task, the auxiliary head, F θ A , consists of two fully connected layers with 64 neurons on each. The weight of this auxiliary loss was 0.001.
Successor Feature Prediction (SF): Successor feature prediction task was similar to next-agent-state prediction, though the target was constructed by bootstrapping. Given the transition on latent space < φ t , a t , φ t+1 , a t+1 >, the auxiliary head learned to minimize the difference between the prediction F θ A (φ t , a t ) and the target (1 − λ)φ t+1 + λF θ A (φ t+1 , a t+1 ), where λ is set to 0.99. To satisfy the property of successor features, we added an extra head to predict the reward linearly from the representation φ t . This head for predicting the successor feature used the same neural network architecture as NAS. The weight was set to 1.
Reward Prediction (Reward): Another auxiliary prediction task was to predict the reward independently given the representation φ t . This auxiliary task used the same nonlinear transformation structure as SF and the weight is set to 1.
Expert Target Prediction (XY): The last prediction task was expert-designed targets prediction. The agent was asked to predict its current position given the image. Since predicting the position was considered a regression task, we used MSE loss with the same network structure as Reward. We set a low weight of 0.0001 for this auxiliary task. Virtual Value Function Learning (VirtualVF): This auxiliary task learns a different value function on the auxiliary head. There were 2 settings in Maze-learning a value function assuming the goal is on grid [7,7], and learning 5 value functions when the goals are on grid [0, 0], [0, 14], [14,0], [14,14], [7,7] separately. The weight of this auxiliary task remained 1 but the discount rate on the auxiliary head was set to be lower, 0.9, so that the agent can focus on the main task. The auxiliary head learned this task with the same network structure as XY.

B.2 General Hyperparameter Setting
We used the same neural network architecture across representations learned with the various loss functions, for each domain. All hidden layers are initialized with Xavier. Table 2 shows the number of nodes on the representation function's last hidden layer, and the number of features.
During training, the inputs are normalized to be in the range [−1, 1]. We use Adam optimizer to update weights, and we used the mean-squared error as the loss. The batch size is set to be 32. The buffer has length 10, 000. The input image is normalized. For the representation function, we use a two layer convolutional network with kernel size of 4, stride of 1, padding of 1, and 32 channels for the first layer; kernel size of 4, stride of 2, padding of 2, and 16 channels on the second layer. A target network was used with the synchronization frequency set to 64. The buffer's memory size was 100, 000 and there were 32 samples randomly chosen at each step. The agent learns for 300, 000 steps with -greedy policy. During transfer learning, all agents, including baselines, learned for 100, 000 steps only.
As for the FTA setting, we use 20 bins with the higher and lower bounds equal to 2 and −2. We tested η of 0.2, 0.4, 0.6, and 0.8 for the no auxiliary task agent, and we fixed η = 0.2 for agents trained with auxiliary task.
The learning rate was swept for every representation learning architecture and control task. The best setting was picked according to the averaged performance over 5 runs. Each run uses a different random seed. In the non-linear value function case and transfer tasks with linear value function, we use a fixed = 0.1. However, in representation learning in the linear case, it turned out to be harder for the agent to converge when keeping other settings as the same as in the non-linear value function. Thus, we provided a better exploration in the early learning stage to speed up learning by decreasing , which decreases from 1 to 0.1 in the first 100, 000 steps.

B.3 Using SFs to Measure Task Similarity
When checking how the difficulty of transfer affects the transfer performance, we consider the similarity between each transfer task to the original task. When a transfer task is similar to the task in which the representation is trained, we consider the transfer to be easier.
We measure the similarity between tasks according to the successor representations, ψ. Since successor representations encode the trajectories of the agent, the difference between successor representations generated by optimal policy can reflect the difference between optimal policies that the agent learns in different tasks. If the optimal policies of two tasks turn out to be dissimilar, the similarity between these tasks is considered low.
To compute a highly accurate estimate of successor features, we solve the maze by using a simple tabular algorithm. To do so, we define the state as the cell in the maze that is occupied by the agent. Doing so results in having 173 states in total. Taking this into account, we use value iteration to generate an optimal policy, then calculate the successor representation of each state based on the optimal policy.
The successor representations of all states in the same task are considered. The successor representation of each task, Ψ, is obtained by concatenating all successor features in the same task. The similarity is defined as the dot product between Ψ's in the transfer and the original tasks. We choose the dot product to keep both angle and magnitude information between concatenated successor representations.  Fig. 8: Transfer tasks are ranked by the similarity between successor features of each task. This figure shows the similarity ranking of different tasks compared to the source task, where the source task is marked by O. Each number in the cell indicates the similarity rank of the task when the goal is moved to that specific position.
A higher dot product value means the transfer task is more similar to the original task and vice versa. ψ(s, task x ) = E π * taskx T t=0 γ t f (S t )|S 0 = s Ψ taskx = ψ(s 0 ) ψ(s 1 ) · · · ψ(s |S| ) similarity(task x , task y ) = Ψ taskx · Ψ tasky Interestingly, the goal states that are more distant from each other become more dissimilar by computing the similarity this way. This is more clear when we take a look at the similarity rankings of the goal states, as depicted in Figure  8. As shown in Figure 3, the representations have a hard time transferring to higher-ranking goals, so there is a clear connection between the ranks of the transfer tasks and the transfer performance. These findings support the use of this approach for calculating task similarity and ranking.

C.1 Representation Training
We show the learning curve of all representation learning architectures in Figure 9, to show that the early-saved representations have converged when they are saved.

C.2 Larger ReLU Transfer
We show the transfer performance of ReLU(L) in Figure  10. The ReLU(L) setting stays between ReLU and FTA representations: ReLU(L) keeps the same activation function as ReLU representation, but increases the size of the representation layer to 640, which is the same as the size The plot shows the averaged return over the most recent 100 episodes at each checkpoint. The x-axis is the number of time steps and the y-axis is the average return. Each curve represents one agent specification (activation and auxiliary task pair). As our main focus is not to compare the learning efficiency during the representation learning step, and the difference between learning curves is not large, we only show the general trend by plotting every curve with the same color. The curve changes color to black, at the time point where we took the representation and fixed it. being most dissimilar. The black line shows the performance when learning in each transfer task from scratch, with the same representation size as ReLU(L). Lines completely above the black line indicate a representation yielded successful transfer in all tasks. Lines that start above the black line but fall below it as we move left to right indicate a representation that transfers to similar tasks but not dissimilar tasks.
of FTA representations. Therefore, it maintains the same value function capacity as the FTA representations. The pattern in the transfer performance of ReLU(L) is similar to ReLU (Figure 3). As the transfer tasks become dissimilar, the transfer performance drops below the Scratch agent. In general, when considering the total reward obtained by the agent, the performance of ReLU(L) is better than ReLU and worse than FTA.  Fig. 11: Transfer performance depends on the activation function, representation size, and auxiliary tasks. This plot presents the same data as Figure 4, but the error bar shows a 95% confidence interval. The bar shows the mean value over 5 seeds × 173 transfer tasks. Figure 11 shows the 95% confidence interval of transfer performance with different representation sizes, activation functions, and different auxiliary tasks, with a non-linear value function.

C.4 Relationship between Properties
We also checked the relationship between properties. The result is shown in Figure 12. Two subplots are highlighted with the orange color. We noticed diversity and complexity reduction showed strong positive linear correlation. This suggests that monitoring either diversity or complexity reduction should be informative to predict the dissimilar task transfer performance in practice. Furthermore, a threshold exists when looking at diversity and orthogonality, as well as diversity and complexity reduction. For representations with higher diversity (higher than 0.5, in this case), it also showed higher orthogonality, while this pattern does not exist in low diversity representations. Although there exists several outliers, this still indicates the possibility that pursuing a representation with high orthogonality may result in a relatively high diversity at the same time in practice.