Quantifying the Impact of Non-Stationarity in Reinforcement Learning-Based Traffic Signal Control

In reinforcement learning (RL), dealing with non-stationarity is a challenging issue. However, some domains such as traffic optimization are inherently non-stationary. Causes for and effects of this are manifold. In particular, when dealing with traffic signal controls, addressing non-stationarity is key since traffic conditions change over time and as a function of traffic control decisions taken in other parts of a network. In this paper we analyze the effects that different sources of non-stationarity have in a network of traffic signals, in which each signal is modeled as a learning agent. More precisely, we study both the effects of changing the \textit{context} in which an agent learns (e.g., a change in flow rates experienced by it), as well as the effects of reducing agent observability of the true environment state. Partial observability may cause distinct states (in which distinct actions are optimal) to be seen as the same by the traffic signal agents. This, in turn, may lead to sub-optimal performance. We show that the lack of suitable sensors to provide a representative observation of the real state seems to affect the performance more drastically than the changes to the underlying traffic patterns.


Introduction
Controlling traffic signals is one way of dealing with the increasing volume of vehicles that use the existing urban network infrastructure. Reinforcement learning (RL) adds up to this effort by allowing decentralization (traffic signalsmodeled as agents-can independently learn the best actions to take in each current state) as well as on-the-fly adaptation to traffic flow changes. It is noteworthy that this can be done in a model-free way (with no prior domain information) via RL techniques. RL is based on an agent computing a policy mapping states to actions without requiring an explicit environment model. This is important in traffic domains because such a model may be very complex, as it involves modeling traffic state transitions determined not only by the actions of multiple agents, but also by changes inherent to the environment-such as time-dependent changes to the flow of vehicles.
One of the major difficulties in applying reinforcement learning (RL) in traffic control problems is the fact that the environments may change in unpredictable ways. The agents may have to operate in different contexts-which we define here as the true underlying traffic patterns affecting an agent; importantly, the agents do not know the true context of their environment, e.g., since they do not have full observability of the traffic network. Examples of partially observable variables that result in different contexts include different traffic patterns during the hours of the day, traffic accidents, road maintenance, weather, and other hazards. We refer to changes in the environment's dynamics as non-stationarity.
In terms of contributions, we introduce a way to model different contexts that arise in urban traffic due to time-varying characteristics. We them analyze different sources of non-stationarity-when applying RL to traffic signal controland quantify the impact that each one has in the learning process. More precisely, we study the impact in learning performance resulting from (1) explicit changes in traffic patterns introduced by different vehicle flow rates; and (2) reduced state observability resulting from imprecision or unavailability of readings from sensors at traffic intersections. The latter problem may cause distinct states (in which distinct actions are optimal) to be seen as the same by the traffic signal agents. This not only leads to sub-optimal performance but may introduce drastic drops of performance when the environment's context change. We evaluate the performance of deploying RL in a non-stationary multiagent scenario, where each traffic signal uses Q-learning-a model-free RL algorithm-to learn efficient control policies. The traffic environment is simulated using the open source microscopic traffic simulator SUMO (Simulation of Urban MObility) [1] and models the dynamics of a 4 × 4 grid traffic network with 16 traffic signal agents, where each agent has access only to local observations of its controlled intersection. We empirically demonstrate that the aforementioned causes of non-stationarity can negatively affect the performance of the learning agents. We also demonstrate that the lack of suitable sensors to provide a representative observation of the true underlying traffic state seems to affect learning performance more drastically than changes to the underlying traffic patterns.
The rest of this paper is organized as follows. The next section briefly introduces relevant RL concepts. Then, our model is introduced in Section 3, and the corresponding experiments in Section 4. Finally, we discuss related work in Section 5 and then present concluding remarks.

Reinforcement Learning
In reinforcement learning [2], an agent learns how to behave by interacting with an environment, from which it receives a reward signal after each action. The agent uses this feedback to iteratively learn an optimal control policy π * The rewards assigned to traffic signal agents in our model are defined as the change in cumulative vehicle waiting time between successive actions. After the execution of an action a t , the agent receives a reward r t ∈ R as given by Eq. 6: The rewards assigned to traffic signal agents in our model are defined as the change in cumulative vehicle waiting time between successive actions. After the execution of an action a t , the agent receives a reward r t ∈ R as given by Eq. 6: fact that the agent was in state s, performed action a and ended up in s with reward r. Let t denote the t th step in the policy π. In an infinite horizon MDP, the cumulative reward in the future under policy π is defined by the Q-function Q π (s, a), as in Eq. 1, where γ ∈ [0, 1] is the discount factor for future rewards.
If the agent knows the optimal Q-values Q * (s, a) for all state-actions pairs, then the optimal control policy π * can be easily obtained; since the agent's objective is to maximize the cumulative reward, the optimal control policy is: Reinforcement learning methods can be divided into two categories: model-free and model-based. Model-based methods assume that the transition function T and the reward function R are available, or instead try to learn them. Model-free methods, on the other hand, do not require that the agent have access to information about how the environment works.
The RL algorithm used in this paper is Q-learning (QL), a model-free off-policy algorithm that estimates the Q-values in the form of a Q-table. After an experience s, a, s , r , the corresponding Q(s, a) value is updated through Eq. 3, where α ∈ [0, 1] is the learning rate.
In order to balance exploitation and exploration when agents select actions, we use in this paper the ε-greedy mechanism. This way, agents randomly explore with probability ε and choose the action with the best expected reward so far with probability 1 − ε.

Non-stationarity in RL
In RL, dealing with non-stationarity is a challenging issue [3]. Among the main causes of non-stationarity are changes in the state transition function T (s, a, s ) or in the reward function R(s, a, s ), partial observability of the true environment state (discussed in Section 2.3) and non-observability of the actions taken by other agents.
In an MDP, the probabilistic state transition function T is assumed not to change. However, this is not realistic in many real world problems. In non-stationary environments, the state transition function T and/or the reward function R can change at arbitrary time steps. In traffic domains, for instance, an action in a given state may have different results depending on the current context-i.e., on the way the network state changes in reaction to the actions of the agents. If agents do not explicitly deal with context changes, they may have to readapt their policies. Hence, they may undergo a constant process of forgetting and relearning control strategies. Though this readaptation is possible, it might cause the agent to operate in a sub-optimal manner for extended periods of time.

Partial Observability
Traffic control problems might be modeled as Dec-POMDPs [4]-a particular type of decentralized multiagent MDP where agents have only partial observability of their true states. A Dec-POMDP introduces to an MDP a set of agents I, for each agent i ∈ I a set of actions A i , with A = i A i the set of joint actions, a set of obervations Ω i , with Ω = i Ω i the set of joint observations, and observation probabilities O(o|s, a), the probability of agents seeing observations o, given the state is s and agents take actions a. As specific methods to solve Dec-POMDPs do not scale with the number of agents [5], it is usual to tackle them using techniques conceived to deal with the fully-observable case. Though this allows for better scalability, it introduces non-stationarity as the agents cannot completely observe their environment nor the actions of other agents.
In traffic signal control, partial observability can appear due to lack of suitable sensors to provide a representative observation of the traffic intersection. Addionally, even when multiple sensors are available, partial observability may occur due to inaccurate (with low resolution) measures.

Methods
As mentioned earlier, the main goal of this paper is to investigate the different causes of non-stationarity that might affect performance in a scenario where traffic signal agents learn how to improve traffic flow under various forms of nonstationarity. To study this problem, we introduce a framework for modeling urban traffic under time-varying dynamics.
In particular, we first introduce a baseline urban traffic model based on MDPs. This is done by formalizing-following similar existing works-the relevant elements of the MDP: its state space, action set, and reward function.
Then, we show how to extend this baseline model to allow for dynamic changes to its transition function so as to encode the existence of different contexts. Here, contexts correspond to different traffic patterns that may change over time according to causes that might not be directly observable by the agent. We also discuss different design decisions regarding the possible ways in which the states of the traffic system are defined; many of these are aligned the modeling choices typically done in the the literature, as for instance [6,7]. Discussing the different possible definitions of states is relevant since these are typically specified in a way that directly incorporates sensor information. Given the amount and quality of sensor information, however, different state definitions arise that-depending on sensor resolution and partial observability of the environment and/or of other agents-result in different amounts of non-stationarity.
Furthermore, in what follows we describe the multiagent training scheme used (in Section 3.4) by each traffic signal agent in order to optimize its policy under non-stationary settings. We also describe how traffic patterns-the contexts in our agents may need to operate-are modeled mathematically (Section 3.5). We discuss the methodology that is used to analyze and quantify the effects of non-stationarity in the traffic problem in Section 4.
Finally, we emphasize here that the proposed methods and analyses that will be conducted in this paper-aimed at evaluating the impact of different sources of non-stationary-is a main contribution of our work. Most existing works (e.g., those discussed in Section 5) do not address or directly investigate at length the implications of varying traffic flow rates as sources of non-stationarity in RL.

State Formulation
In the problems or scenarios we deal with, the definition of state space strongly influences the agents' behavior and performance. Each traffic signal agent controls one intersection, and at each time step t it observes a vector s t that partially represents the true state of the controlled intersection.
A state, in our problem, could be defined as a vector s ∈ R (2+2|P |) , as in Eq. 4, where P is the set of all green traffic phases 1 , ρ ∈ P denotes the current green phase, δ ∈ [0, maxGreenT ime] is the elapsed time of the current phase, Note that this state definition might not be feasibly implementable in real-life settings due to cost issues arising from the fact that many physical sensors would have to be paid for and deployed. We introduce, for this reason, an alternative definition of state which has reduced scope of observation. More precisely, this alternative state definition removes density attributes from Eq. 4, resulting in the partially-observable state vector s ∈ R (2+|P |) in Eq 5. The absence of these state attributes is analogous to the lack of availability of real-life traffic sensors capable of detecting approaching vehicles along the extension of a given street (i.e., the density of vehicles along that street).
Note also that the above definition results in continuous states. Q-learning, however, traditionally works with discrete state spaces. Therefore, states need to be discretized after being computed. Both density and queue attributes are discretized in ten levels/bins equally distributed. We point out that a low level of discretization is also a form of partial-observability, as it may cause distinct states to be perceived as the same state. Furthermore, in this paper we assume-as commonly done in the literature-that one simulation time step corresponds to five seconds of real-life traffic dynamics. This helps encode the fact that traffic signals typically do not change actions every second; this modeling decision implies that actions (in particular, changes to the current phase of a traffic light) are taken in intervals of five seconds.

Actions
In an MDP, at each time step t each agent chooses an action a t ∈ A. The number of actions, in our setting, is equal to the number of phases, where a phase allows green signal to a specific traffic direction; thus, |A| = |P |. In the case where the traffic network is a grid (typically encountered in the literature [8,6,9]), we consider two actions: an agent can either keep green time to the current phase or allow green time to another phase; we call these actions keep and change, respectively. There are two restrictions in the action selection: an agent can take the action change only if δ ≥ 10s (minGreenT ime) and the action keep only if δ < 50s (maxGreenT ime). Additionally, change actions impose a yellow phase with a fixed duration of 2 seconds. These restrictions are in place to, e.g., model the fact that in real life, a traffic controller needs to commit to a decision for a minimum amount of time to allow stopped cars to accelerate and move to their intended destinations.

Reward Function
The rewards assigned to traffic signal agents in our model are defined as the change in cumulative vehicle waiting time between successive actions. After the execution of an action a t , the agent receives a reward r t ∈ R as given by Eq. 6: where W t and W t+1 represent the cumulative waiting time at the intersection before and after executing the action a t , following Eq. 7: where V t is the set of vehicles on roads arriving at an intersection at time step t, and w v,t is the total waiting time of vehicle v since it entered one of the roads arriving at the intersection until time step t. A vehicle is considered to be waiting if its speed is below 0.1 m/s. Note that, according to this definition, the larger the decrease in cumulative waiting time, the larger the reward. Consequently, by maximizing rewards, agents reduce the waiting time at the intersections, thereby improving the local traffic flow.

Multiagent Independent Q-learning
We tackle the non-stationarity in our scenario by using Q-learning in a multiagent independent training scheme [10], where each traffic signal is a QL agent with its own Q-table, local observations, actions and rewards. This approach allows each agent to learn an individual policy, applicable given the local observations that it makes; policies may vary between agents as each one updates its Q-table using only its own experience tuples. Besides allowing for different behaviors between agents, this approach also avoids the curse of dimensionality that a centralized training scheme would introduce. However, there is one main drawback of an independent training scheme: as agents are learning and adjusting their policies, changes to their policies cause the environment dynamics to change, thereby resulting in non-stationary. This means that original convergence properties for single-agent algorithms no longer hold due to the fact that the best policy for an agent changes as other agents' policies change [11].

Contexts
In order to model one of the causes for non-stationary in the environment, we use the concept of traffic contexts, similarly to da Silva et al. [12]. We define contexts as traffic patterns composed of different vehicle flow distributions over the Origin-Destination (OD) pairs of the network. The origin node of an OD pair indicates where a vehicle is inserted in the simulation. The destination node is the node in which the vehicle ends its trip, and hence is removed from the simulation upon its arrival. A context, then, is defined by associating with each OD pair a number of vehicles that are inserted (per second) in its origin node.
Changing the context during a simulation causes the sensors measures to vary differently in time. Events such as traffic accidents and hush hours, for example, cause the flow of vehicles to increase in a particular direction, thus making the queues on the lanes of this direction to increase faster. In the usual case, where agents do not have access to all information about the environment state, this can affect the state transition T and the reward functions R of the MDP directly. Consequently, when the state transition probabilities and the rewards agents are observing change, the Q-values of the state-action pairs also change. Therefore, traffic signal agents will most likely need to undergo a readaptation phase to correctly update their policies, resulting in periods of catastrophic drops of performance.

Experiments and Results
Our main goal with the following experiments is to quantify the impact of different causes of non-stationarity in the learning process of an RL agent in traffic signal control. Explicit changes in context (e.g., vehicle flow rate changes in one or more directions) are one of these causes and are present in all of the following experiments. This section first describes details of the scenario being simulated as well as the traffic contexts, followed by a definition of the performance metrics used as well as the different experiments that were performed.
We first conduct an experiment where traffic signals use a fixed control policy-a common strategy in case the infrastructure lacks sensors and/or actuators. The results of this experiment are discussed in Section 4.3 and are used to emphasize the problem of lacking a policy that can adapt to different context; it also serves as a baseline for later comparisons. Afterwards, in Section 4.4 we explore the setting where agents employ a given policy in a context/traffic pattern that has not yet been observed during the training phase. In Section 4.5 we analyze (1) the impact of context changes when agents continue to explore and update their Q-tables throughout the simulation; and (2) the impact of having non-stationarity introduced both by context changes and by the use of the two different state definitions presented in Section 3.1. Then, in Section 4.6 we address the relation between non-stationarity and partial observations resulting from the use of imprecise sensors, simulated by poor discretization of the observation space. Lastly, in Section 4.7 we discuss what are the main findings and implications of the results observed.

Scenario
We used the open source microscopic traffic simulator SUMO to model and simulate the traffic scenario and its dynamics, and SUMO-RL [13]  In order to demonstrate the impact of context changes on traffic signals (and hence, on the traffic), we defined two different traffic contexts with different vehicle flow rates. Both contexts insert the same amount of vehicles per second in the network, but do so by using different distribution of those vehicles over the possible OD pairs. In particular: It is expected that a policy in which the two green traffic phases are equally distributed would have a satisfactory performance in Context 1, but not in Context 2. In the following experiments, we shift between Context 1 and Context 2

Metrics
To measure the performance of traffic signal agents, we used as metric the summation of the cumulative vehicle waiting time on all intersections, as in Eq. 7. Intuitively, this quantifies for how long vehicles are delayed by having to reduce their velocity below 0.1 m/s due to long waiting queues and to the inadequate use of red signal phases. At the time steps in which phase changes occur, natural oscillations in the queue sizes occur since many vehicles are stopping and many are accelerating. Therefore, all plots shown here depict moving averages of the previously-discussed metric within a time window of 15 seconds. The plots related to Q-learning are averaged over 30 runs, where the shadowed area shows the standard deviation. Additionally, we omit the time steps of the beginning of the simulation (since the network then is not yet fully populated with vehicles) as well as the last time steps (since then vehicles are no longer being inserted).

Traffic Signal Control under Fixed Policies
We first demonstrate the performance of a fixed policy designed by following the High Capacity Manual [14], which is popularly used for such task. The fixed policy assigns to each phase a green time of 35 seconds and a yellow time of 2 seconds. As mentioned, our goal by defining this policy is to construct a baseline used to quantify the impact of a context change on the performance of traffic signals in two situations: one where traffic signals follow a fixed policy and one where traffic signals adapt and learn a new policy using QL algorithm. This section analyzes the former case. Fig. 2 shows that the fixed policy, as expected, loses performance when the context is changed. When the traffic flow is set to Context 2 at time step 20000, a larger amount of vehicles are driving on the W-E direction and thus producing larger waiting queues. In order to obtain a good performance using fixed policies, it would be necessary to define a policy for each context and to know in advance the exact moment when context changes will occur. Moreover, there may be an arbitrarily large number of such contexts, and the agent, in general, has no way of knowing in advance how many exist. Prior knowledge of these quantities is not typically available since non-recurring events that may affect the environment dynamics, such as traffic accidents, cannot be predicted. Hence, traffic signal control by fixed policies is inadequate in scenarios where traffic flow dynamics may change (slowly or abruptly) over time.

Effects of Disabling Learning and Exploration
We now describe the case in which agents stop, at some point in time, to learn from their actions and simply follow the policy learned before a given context change. The objective here is to simulate a situation where a traffic signal agent employs a previously-learned policy to a context/traffic pattern that has not yet been observed in its training phase. We achieve this by setting both α (learning rate) and ε (exploration rate) to 0 when there is a change in context. By observing Eq. 3, we see that the Q-values no longer have their values changed if α = 0. By setting ε = 0, we also ensure that the agents will not explore and that they will only choose the actions with the higher estimated Q-value given the dynamics of the last observed context. By analyzing performance in this setting, we can quantify the negative effect of agents that act solely by following the policy learned from the previous contexts.
During the training phase (until time step 20000), we use a learning rate of α = 0.1 and discount factor γ = 0.99. The exploration rate starts at ε = 1 and decays by a factor of 0.9985 every time the agent chooses an action. These definitions ensure that the agents are mostly exploring at the beginning, while by the time step 10000 ε is below 0.05, thereby resulting in agents that continue to purely exploit a currently-learned policy even after a context change; i.e., agents that do not adapt to context changes. In Fig. 3 we observe that the total waiting time of vehicles rapidly increases after the context change (time step 20000). This change in the environment dynamics causes the policy learned in Context 1 to no longer be efficient, since Context 2 introduces a flow pattern that the traffic signals have not yet observed. Consequently, the traffic signal agents do not know what are the best actions to take when in those states. Note, however, that some actions (e.g., changing the phase when there is congestion in one of the directions) are still capable of improving performance, since they are reasonable decisions under both contexts. This explains the reason why performance drops considerably when the context changes and why the waiting time keeps oscillating afterwards.

Effects of Reduced State Observability
In this experiment, we compare the effects of context changes under the two different state definitions presented in Section 3.1. The state definition in Eq. 4 represents a more unrealistic scenario in which expensive real-traffic sensors are available at the intersections. In contrast, in the partial state definition in Eq. 5 each traffic signal has information only about how many vehicles are stopped at its corresponding intersection (queue), but cannot relate this information to the amount of vehicles currently approaching its waiting queue, as vehicles in movement are considered only on density attributes.
Differently from the previous experiment, agents now continue to explore and update their Q-tables throughout the simulation. The ε parameter is set to a fixed value of 0.05; this way, the agents mostly exploit but still have a small chance of exploring other actions in order to adapt to changes in the environment. By not changing ε we ensure that performance variations are not caused by an exploration strategy. The values of the QL parameters (α and γ) are kept as in the previous experiment. The results of this experiment are shown in Fig. 4. By analyzing the initial steps in the simulation, we note that agents using the reduced state definition learn significantly faster than those with the state definition that incorporates both queue and density attributes. This is because there are fewer states to explore, and so it takes fewer steps for the policy to converge. However, given this limited observation capability, agents converge to a policy resulting in higher waiting times when compared to that resulting from agents with more extensive state observability. This shows that the density attributes are fundamental to better characterize the true state of a traffic intersection. Also note that around time 10000, the performance of both state definitions (around 500 seconds of total waiting time) are better than that achieved under the fixed policy program (around 2200 seconds of total waiting time), depicted in Fig. 2.
In the first context change, at time 20000, the total waiting time of both state definitions increases considerably. This is expected as it is the first time agents have to operate in Context 2. Agents operating under the original state definition recovered from this context change rapidly and achieved the same performance obtained in Context 1. However, with the partial state definition (i.e., only queue attributes), it is more challenging for agents to behave properly when operating under Context 2, which depicts an unbalanced traffic flow arriving at the intersection.
Finally, we can observe how (at time step 60000) the non-stationarity introduced by context changes relates to the limited partial state definition. While traffic signal agents observing both queue and density do not show any oscillations in the waiting time of their controlled intersections, agents observing only queue have a significant performance drop. Despite having already experienced Context 2, they had to relearn their policies since the past Q-values were overwritten by the learning mechanism to adapt to the changing past dynamics. The dynamics of both contexts are, however, well-captured in the original state definition, as the combination of the density and queue attributes provides enough information about the dynamics of traffic arrivals at the intersection. This observation emphasizes the importance of more extensive state observability to avoid the negative impacts of non-stationarity in RL agents.

Effects of Different Levels of State Discretization
Besides the unavailability of appropriate sensors (which results in incomplete description of states) another possible cause of non-stationarity is poor precision and low range of observations. As an example, consider imprecision in the measurement of the number of vehicles waiting at an intersection; this may cause distinct states-in which distinct actions are optimal-to be perceived as the same state. This not only leads to sub-optimal performance, but also introduces drastic performance drops when the context change. We simulate this effect by lowering the number of discretization levels of the attribute queue in cases where the density attribute is not available. In Fig. 5 we depict how the discretization level of the attribute queue affects performance when a context change occurs. The red line corresponds to the performance when queue is discretized into 10 equally-distributed levels/bins (see Section 3.1). The green line corresponds to performance under a reduced discretization level of 4 bins. Note how after a context change (at time steps 20000, 40000 and 60000) we can observe how the use of reduced discretization levels causes a significant drop in performance. At time 40000, for instance, the total waiting time increases up to 3 times when operating under the lower discretization level.
Intuitively, an agent with imprecise observation of its true state has reduced capability to perceive changes in the transition function. Consequently, when traffic flow rates change at an intersection, agents with imprecise observations require a larger number of actions to readapt, thereby dramatically increasing queues.

Discussion
Many RL algorithms have been proposed to tackle non-stationary problems [15,16,12]. Specifically, these works assume that the environment is non-stationary (without studying or analyzing the specific causes of non-stationary) and then propose computational mechanisms to efficiently learn under that setting. In this paper, we deal with a complementary problem, which is to quantify the effects of different causes of non-stationarity in the learning performance. We also assume that non-stationarity exists, but we explicitly model many of the possible underline reasons why its effects may take place. We study this complementary problem because it is our understanding that by explicitly quantifying the different reasons for non-stationary effects, it may be possible to make better-informed decisions about which specific algorithm to use, or to decide, for instance, if efforts should be better spent by designing a more complete set of features instead of by designing more sophisticated learning algorithms.
In this paper, we studied these possible causes specifically when they affect urban traffic environments. The results of our experiments indicate that non-stationarity in the form of changes to vehicle flow rates significantly impact both traffic signal controllers following fixed policies and policies learned from standard RL methods that do not model different contexts. However, this impact (that results in rapid changes in the total number of vehicles waiting at the intersections) has different levels of impact on agents depending on the different levels of observability available to those agents. While agents with the original state definition (queue and density attributes) only present performance drops in the first time they operate in a new context, agents with reduced observation (only queue attributes) may always have to relearn the readapted Q-values. The original state definition, however, is not very realistic in real world, as sensors capable of providing both attributes for large traffic roads are very expensive. Finally, in cases where agents observe only the queues attributes, we demonstrated that imprecise measures (e.g. low number of discretization bins) potencializes the impact of context changes. Hence, in order to design a robust RL traffic signal controller, it is critical to take into account which are the most adequate sensors and how they contribute to provide a more extensive observation of the true environment state.
We observed that the non-stationarity introduced by the actions of other concurrently-learning agents in a competitive environment seemed to be a minor obstacle to acquiring effective traffic signals policies. However, a traffic signal agent that selfishly learns to reduce its own queue size may introduce a higher flow of vehicles arriving at neighboring intersections, thereby affecting the rewards of other agents and producing non-stationarity. We believe that in more complex scenarios this effect would be more clearly visible.
Furthermore, we found that traditional tabular Independent Q-learning presented a good performance in our scenario if we do not take into account the non-stationarity impacts. Therefore, in this particular simulation it was not necessary to use more sophisticated methods such as algorithms based on value-function approximation; for instance, deep neural networks. These methods could help in dealing with larger-scale simulations that could require dealing with higher dimensional states. However, we emphasize the fact that even though they could help with higher dimensional states, they would also be affected by the presence of non-stationarity, just like standard tabular methods are. This happens because just like standard tabular Q-learning, deep RL methods do not explicitly model the possible sources of non-stationarity, and therefore would suffer in terms of learning performance whenever changes in state transition function occur.

Related Work
Reinforcement learning has been previously used with success to provide solutions to traffic signal control. Surveys on the area [17,18,19] have discussed fundamental aspects of reinforcement learning for traffic signal control, such as state definitions, reward functions and algorithms classifications. Many works have addressed multiagent RL [20,6,8] and deep RL [21,22,23] methods in this context. In spite of non-stationarity being frequently mentioned as a complex challenge in traffic domains, we evidenced a lack of works quantifying its impact and relating it to its many causes and effects.
In Table 1 we compare relevant related works that have addressed non-stationary in the form of partial observability, change in vehicle flow distribution and/or multiagent scenarios. In [12], da Silva et. al explored non-stationarity in traffic signal control under different traffic patterns. They proposed the RL-CD method to create partial models of the environment-each one responsible for dealing with one kind of context. However, they used a simple model of the states and actions available to each traffic signal agent: state was defined as the occupation of each incoming link and discretized into 3 bins; actions consisted of selecting one of three fixed and previously-designed signal plans. In [24], Oliveira et al. extend the work in [12] to address the non-stationarity caused by the random behavior of drivers in what regards the operational task of driving (e.g. deceleration probability), but the aforementioned simple model of the states and actions was not altered. In [23], Liu et al. proposed a variant of independent deep Q-learning to coordinate four traffic signals. However, no information about vehicle distribution or insertion rates was mentioned or analyzed. A comparison between different state representations using the A3C algorithm was made in [7]; however, that paper did not study the capability of agents to adapt to different traffic flow distributions. In [25]  They also explored non-stationarity caused by different traffic flows, but they did not considered the impact of the state definition used (with low discretization and only one sensor) in their results. To the extent of our knowledge, this is the first work to analyse how different levels of partial observability affects traffic signal agents under non-stationary environments where traffic flows change not only in vehicle insertion rate, but also in vehicle insertion distribution between phases.

Conclusion
Non-stationarity is an important challenge when applying RL to real-world problems in general, and to traffic signal control in particular. In this paper, we studied and quantified the impact of different causes of non-stationarity in a learning agent's performance. Specifically, we studied the problem of non-stationarity in multiagent traffic signal Two unbalanced flows control, where non-stationarity resulted from explicit changes in traffic patterns and from reduced state observability. This type of analyses complements those made in existing works related to non-stationarity in RL; these typically propose computational mechanisms to learn under changing environments, but usually do not systematically study the specific causes and impacts that the different sources of non-stationary may have on learning performance.
We have shown that independent Q-Learning agents can re-adapt their policies to traffic pattern context changes. Furthermore, we have shown that the agents' state definition and their scope of observations strongly influence the agents' re-adaptation capabilities. While agents with more extensive state observability do not undergo performance drops when dynamics change to previously-experienced contexts, agents operating under a partially observable version of the state often have to relearn policies. Hence, we have evidenced how a better understanding of the reasons and effects of non-stationarity may aid in the development of RL agents. In particular, our results empirically suggest that effort in designing better sensors and state features may have a greater impact on learning performance than efforts in designing more sophisticated learning algorithms.
In future work, traffic scenarios that include other causes for non-stationarity can be explored. For instance, traffic accidents may cause drastic changes to the dynamics of an intersection, as they introduce queues only in some traffic directions. In addition, we propose studying how well our findings generalize to settings involving arterial roads (which have greater volume of vehicles) and intersections with different numbers of traffic phases.