Repot: Transferable Reinforcement Learning for Quality-centric Networked Monitoring in Various Environments

Collecting and monitoring data in low-latency from numerous sensing devices is one of the key foundations in networked cyber-physical applications such as industrial process control, intelligent traffic control, and networked robots. As the delay in data updates can degrade the quality of networked monitoring, it is desirable to continuously maintain the optimal setting on sensing devices in terms of transmission rates and bandwidth allocation, taking into account application requirements as well as time-varying conditions of underlying network environments. In this paper, we adapt deep reinforcement learning (RL) to achieve a bandwidth allocation policy in networked monitoring. We present a transferable RL model Repot in which a policy trained in an easy-to-learn network environment can be readily adjusted in various target network environments. Specifically, we employ flow embedding and action shaping schemes in Repot that enable the systematic adaptation of a bandwidth allocation policy to the conditions of a target environment. Through experiments with the NS-3 network simulator, we show that Repot achieves stable and high monitoring performance across different network conditions, e.g., outperforming other heuristics and learning-based solutions by 14.5~20.8% in quality-of-experience (QoE) for a target network environment. We also demonstrate the sample-efficient adaptation in Repot by exploiting only 6.25% of the sample amount required for model training from scratch. We present a case study with the SUMO mobility simulator and verify the benefits of Repot in practical scenarios, showing performance gains over the others, e.g., 6.5% in urban-scale and 12.6% in suburb-scale.


I. INTRODUCTION
In cyber-physical applications, sensing devices operate as data sources of a distributed database in that each continuously sends its status information to a centralized node that evaluates application-specific queries on aggregated information. Such a sensor-based, networked monitoring system requires the status updates to be as timely as possible to maintain the high-quality in query evaluation [1]- [3]. However, due to the nature of existing network infrastructures with inherent restrictions on low-latency data communications, it can be a challenging problem to ensure the timeliness of status updates and information aggregation at all times from numerous sensing devices [4].
In general, the problem has been investigated in several fields of network systems and applications such as the age of information (AoI) [4], [5], decision fusion [6]- [8] sensor networks [9], [10], and Internet of Things (IoT) [11]. Existing studies have normally focused on heuristic strategies on status updates according to given optimization objectives and resource constraints. For example, Jiang et al. [5] addressed AoI problems in wireless networks with dynamic channel errors by exploiting the closed-form Whittle's index [12] that estimates status tracking accuracy and establishing a heuristic strategy to configure the transmission rates of sensing devices. At the physical network level, Ciuonzo et al. [8] presented efficient decision fusion rules over massive multipleinput multiple-output (MIMO) in wireless sensor networks to reduce complexity and improve energy-efficiency by employing a widely-linear statistic [13], [14] and linear filters.
In this paper, we take a learning-based approach for scheduling and optimizing the data transmission and state updates under restrictive network conditions. Deep reinforcement learning (RL) has been recently considered a feasible solution to tackle complex optimization problems in the area of various networked systems, e.g., energy optimization in data centers [15], [16], cluster resource management in cloud computing [17]- [19], video streaming in wireless networks [20], network slicing [21], [22], and others. In the same vein as those RL-based approaches, we address the optimization problem of status updates in networked monitoring by formulating it as successive decisions on bandwidth allocation for multiple sensing devices, which can be modeled in a Markov decision process (MDP) to be learned through RL.
In doing so, we propose a transferable RL model of which the structure is tailored to the characteristics of networked monitoring. Our proposed model is called Repot (REinforcement learning POlicy with Transferability). In Repot, we train a bandwidth allocation policy in a learnable network environment using conventional RL algorithms (e.g., SAC [23]) and then adapt the policy according to the conditions of a specific target network environment using flow embedding and action shaping schemes.
The flow embedding scheme is intended to represent each sensor data stream and its relation to the other streams on a common vector space, rendering Repot scalable upon a wide variety of observed network states. The action shaping scheme is intended to decompose the action inference on bandwidth allocation into a two-staged procedure with several modules, by which a latent action is generated upon flow embeddings and then its representation can be transformed according to different conditions (e.g., underlying network limitations or spatial characteristics of monitored objects).
These two schemes in Repot enable the rapid adaptation of a policy optimized in an easy-to-learn source environment to a target environment, and alleviate the difficulty in achieving an optimal policy across a variety of network scales and environment conditions. For example, collecting 1.6M training samples in a network simulator requires tens or hundreds of days (e.g., in Table 4). That required amount of samples is estimated as a minimum to have a (non-pretrained) model converged in a target environment based on our simulation. In Repot, the adaptation schemes can establish a competitive policy sample-efficiently. The learned policy enables highquality monitoring, showing 14.5∼20.8% higher in qualityof-experience (QoE) than other heuristic and learning-based methods in comparison for a given target environment (in Figure 5). This performance benefit is achieved by the sample efficient adaptation schemes, in which about 100K samples are used to transfer a policy learned in a source environment; that is only 6.25% of what can be originally demanded for model training from scratch (e.g., 1.6M).
As such, Repot allows us not only to exploit conventional RL algorithms to robustly establish a policy optimized in a source environment, but also to efficiently adapt the policy to target environments. Repot shows robust adaptation perfor- mance, comparable to that optimized in a source environment (i.e., within 1% margin in Figure 5). Furthermore, we present a case study with the SUMO (Simulation of Urban MObility) mobility simulator [24] and demonstrate the applicability of Repot in practical network monitoring scenarios. Repot achieves performance gains over the other methods, e.g., 6.5% in urban-scale and 12.6% in suburb-scale scenarios (in Figure 10).
In Repot, we focus on the modular model structure and policy transferability in RL, which is the first attempt in the context of networked monitoring. The main contributions of this paper are summarized as follows. • [24].
The rest of the paper is organized as follows. Section II describes the problem of networked monitoring in different network environments and our RL-based approach to it. Section III presents the modular structure and algorithm of our proposed model with flow embedding and bandwidth allocation modules, and describes the adaptation scheme based on action shaping. Sections IV, V, and VI provide the experiment results, the related research works, and the conclusion, respectively. In addition, Table 1 provides a list of acronyms used in this paper, and Table 2 summarizes notations frequently used in three aspects (system, algorithm, and implementation).

II. OVERALL SYSTEM
In this section, we explain the problem formulation regarding the QoE assurance on networked monitoring in various environments, and describe our approach to the problem.

A. QUALITY-CENTRIC NETWORKED MONITORING
In resource-constrained network environments, we consider a monitoring system in which geographically distributed sensing devices communicate with a server running data-driven applications on aggregated information. The QoE achieved by data-driven applications such as industrial process control, intelligent traffic control, and networked robots is usually dependent on the timeliness of status updates and aggregation. For example, a high-quality map can be created based on real-time data streams from geographically distributed sensing devices [26], [27]. The bandwidth limit value is transmitted to each device 10: end while Figure 1 briefly illustrates such a networked monitoring system where a server employs a bandwidth allocation strategy among networked sensor devices to continually ensure high QoE in query evaluation on timely updates from devices. Algorithm 1 represents the overall procedure in a networked monitoring system. Specifically, each device D i collects surrounding information about tracked objects and sends it to a server in packets (e.g., multi-access edge computing (MEC) systems [28]). Timely aggregated information from all D i enables realtime monitoring for i ∈ {1, . . . , N D }. We represent the latest updated information at time-step t by D i as I t i and the aggregated information of all D i as In addition, we represent the real-time monitoring quality as a function QUAL that is evaluated on I t in an applicationspecific way (e.g., Eq. (21)). Due to resource limitations of underlying network systems, it is non-trivial to always maintain I t up-to-date and achieve the optimal quality. We assume that bandwidth is a major limited resource affecting the transmission rate of sensor devices. We represent the link capacity as L E for a network environment E and the bandwidth allocated for D i as a i . Then, we have a resource constraint, and formulate an optimization problem such as where T denotes an entire time-step period, t denotes a discrete time-step, and N D denotes the number of sensing devices.
Given the formulation, we aim at achieving an RL-based (resource) orchestrator that allocates the bandwidth limit a i of sensing device D i to maximize the monitoring quality (QoE) under the limited overall capacity (L E ). That is, an RL-based orchestrator takes online network status as an input

Adjust -ment
func. Shaping func. Shaping state and determines {a t i } N D i=1 at a time-step t, receiving rewards based on achieved QoE.
In this regard, we address the challenging problems of such RL-based orchestrator under a variety of network conditions in terms of scales, configurations, and observation dynamics, which usually affect the performance of the orchestrator deployed in a target environment.

B. RL-BASED ORCHESTRATION
For the optimization problem in Eq. (2), we formulate an RLbased orchestrator in an MDP with a tuple (S, A, p, r, γ). An MDP consists of a state space S, an action space A, a state transition probability p : S × A × S → [0, 1], a reward r : S × A → [0, 1], and a discount factor γ ∈ [0, 1]. We assume that S and A are continuous and p is unknown.
State. A state S is represented as In our implementation, we set d = 3 in that the features include status update information, whether or not the information varies from the last one, and timestamp, and furthermore, we set the history size u = 3.
Action. An RL-based orchestrator determines an action a using a φ-parameterized policy π φ upon a state S, where a i sets the bandwidth limit of each device D i . Reward. A reward r is calculated based on the quality function QUAL(·). For time-step t, we have r = QUAL(I t ). (5)

C. POLICY TRANSFERABLE RL
In Repot, the flow embedding scheme is intended to represent the relation of multiple flow states in low dimensional vectors, extracting the historical features from status updates of sensing devices. Given the flow embedding vectors up-todate as input, actions on bandwidth allocation are calculated through the action shaping scheme.
To achieve the Repot model, we employ the two-phase model training; a base model is trained in an easy-to-learn source environment and then its learned policy is adapted to a target environment. Figure 2 illustrates the modular model structure in Repot, with the flow embedding and bandwidth allocation modules in the middle and top, respectively. It also represents the two-phase model training, where the left part corresponds to model training in a source environment and the right part corresponds to fine-tuning for adaptation in a target environment. (Phase-1) In a source environment, the flow embedding module and the allocation function in the bandwidth allocation module are trained to establish a general policy on bandwidth allocation upon flow embeddings. (Phase-2) Then, the bandwidth allocation policy optimized in the source environment is fine-tuned to adapt to the network conditions in a given target environment. This two-phase procedure is detailed in Algorithm 2 and 3.
In Repot, the fine-tuning structure is tailored in that it requires to update only the adjustment function while the other trainable functions are frozen, as shown in the right part in Figure 2. In doing so, we employ the action shaping scheme by which a latent action is first calculated from flow embeddings and then the action is transformed into bandwidth limit values according to specific network conditions.

III. POLICY TRANSFERABLE RL STRUCTURE
In this section, we describe the model structure and algorithms in Repot. The flow embedding module encodes state S t in Eq. (3) into embeddings, and the bandwidth allocation module calculates action a t in Eq. (4) from the embeddings. In the following, we explain those modules in detail.

A. FLOW EMBEDDING
The flow embedding module EMB ψ (·) with trainable model parameters ψ consists of vectorization and self-attention functions, and it represents each flow information from a device in a common vector space.
Vectorization: For each time-step t, an intermediate embedding vector e i based on a historical flow state S t i is obtained through MLP ψ (·), i.e., Relation Extraction: Given intermediate embeddings E = e 1 , ..., e N D , the respective flow embeddings are obtained by In ATT ψ (·), query, key, and value vectors (i.e., q i , k i , and v i ) are calculated through respective MLPs (i.e., MLP ψq (·), MLP ψk (·), and MLP ψv (·)). Given each vector in E , its respective elements are first obtained by for x ∈ {q, k, v}. Then, attentive weight vector w i representing the relationship between flow state S t i and the others is obtained using the scaled dot-production of q i and [k 1 , ..., k N D ], where Pr( . Finally, flow embedding vec- . As described, flow embeddings encapsulate both historical and relational features of individual flows, and represent those on a common vector space, supporting a scalable model structure in which a policy can be optimized in different network scales. For further explanation, we notate the flow embedding in a simple form combining Eq. (6) and (7).

B. BANDWIDTH ALLOCATION
The bandwidth allocation module is structured based on our design principles explained below to provide the adaptation of a learned policy to a target environment. To mitigate a large action space problem [29], [30] and establish fast convergence in model training, we employ the latent action representation. A latent action generated by a trainable function (ALLOC) upon flow embeddings is transformed according to observed network conditions. As the latent action space is much smaller than what is required for individual actions for all sensor devices in terms of dimensions, it can be effective to build the ALLOC function by model training and to use its output for adaptation. Furthermore, to support the adaptation in a target environment, we restructure the action transformation (that processes the latent action output of ALLOC) into two functions such as a non-trainable, controllable function (SHAPE) that conducts action shaping and a trainable function (ADJUST) that sets the control parameter values of SHAPE. That is, the ADJUST function is intended to learn to calculate the optimal control values through a small number of training samples, rendering the latent action outputs of ALLOC well-fitted to a target environment. Allocation: where eachã t corresponds to weighted values on a fixed-size set of geographical points [p 1 , ..., p Np ] for N p N D . Note that each point is randomly placed on a 2D-grid where sensing devices are located. ALLOC φ1 (·) is a φ 1 -parameterized function trained by RL to induce policy π φ1 .
Adjustment: As described in our design principle above, each latent action is transformed into a target-specific action.
are first calculated by another policy π φ2 with trainable parameters φ 2 , i.e., Those control values are used in Eq. (13) below to complete action shaping. To minimize a search space in RL and support the rapid adaptation, we deliberately confine the range ofã δ within small z% of that ofã, and have z = 30% by default.
in Eq. (13) 11: // 6. Bandwidth allocation for each device's flow 12: Send a packet including the bandwidth limit value a t i to device Di , a latent actionã t in Eq. (11) is transformed as where k and v are constants, having k = 1, v = 0 by default, and SHAPE is a non-trainable function of which implementation is explained in Eq. (14)- (16) below. First, intermediate values a 1 , ..., a N D for bandwidth limits of devices are calculated using the latent action in Eq. (11), the control values in Eq. (12), and the inverse distance from Note that ||D i − p j || is the distance from device D i to point p j , and 0.1N p is a clipping threshold, 1 is a small positive constant, and c = −2 is a clipping value.
Then, action a t is obtained by transforming the intermediate values calculated above, i.e., where A is the overall bandwidth limit in Eq. (2). Finally, upon receiving action a t i about bandwidth allocation at timestep t, sensing device D i modifies its configuration on the status update rate according to a  for each environment step do 9:

C. IMPLEMENTATION
To implement the Repot model, we use the soft actorcritic (SAC) RL algorithm [23], an off-policy RL method that is known to effectively retain the benefits of entropy maximization and stability. Specifically, we employ two SAC structures, denoted as SAC m for m ∈ {1, 2}. SAC 1 is used to establish a bandwidth allocation policy in a source environment through end-to-end model training, and SAC 2 is used to adapt a learned policy by SAC 1 to a given specific target environment through sample-efficient partial model training (fine-tuning). Accordingly, the dotted line modules in blue in Figure 2 are all trained by SAC 1 , and the dotted line module in orange is trained by SAC 2 . In our implementation on SAC 1 , the actor includes the model parameters ψ Act and φ 1 , and the critic includes the model parameters ψ Crit and θ 1 . SAC 1 drives all those parameters to be updated in an end-to-end training fashion. Note that θ 1 consists of two soft Q-functions which we represent in θ 1,n for n ∈ {1, 2}. The training rollout on SAC 1 is described in Algorithm 5, which corresponds to the detail implementation of Algorithm 2 for (Phase-1) model training.
Unlike SAC 1 , SAC 2 is tailored for fine-tuning and adaptation in a target environment. Accordingly, in SAC 2 , while the actor includes the model parameters ψ Act and φ 2 and the critic includes the model parameters ψ Crit and θ 2 , SAC 2 drives only the parameters φ 2 and θ 2 to be updated during its training; the other model parameters are fixed. Same as the two soft Q-functions of θ 1 , we have θ 2,n for n ∈ {1, 2}. The for each environment step do 9: training rollout on SAC 2 is described in Algorithm 6, which corresponds to the detail implementation of Algorithm 3 for (Phase-2) fine-tuning.
In the following, we describe several objective functions used to train SAC m , where we use the index notation m, n ∈ {1, 2} to represent each of the two SAC structures (SAC m ) and the two Q-functions (Q θm,n (·)) for each SAC structure explained above, respectively. We also represent a φ m -parameterized policy as π φm , and replaceã t with ∆ t for SAC 2 .
Here, D is the replay pool, γ is the discount factor, ρ π φm is the marginals of trajectory distribution induced by policy π φm , and α is the adjustable temperature parameter that controls the stochasticity of the optimal policy [23].θ m,n denotes the parameter of a target soft Q-function, obtained as an exponentially moving average of the soft Q-function weights. Actor: We optimize policy π φm and the flow embedding module EMB ψAct (·) in Eq. (10) with the parameters ψ Act by minimizing the below objective function, where f φm ( t ; EMB ψAct (S t )) =ã t is the neural network transformation to re-parameterize the policy, t is a noise vector sampled from a Gaussian, and Q Min θm (·) is used as the minimum of the soft Q-functions for policy gradient.
Temperature parameter: To improve the performance and stability of the SAC algorithm, we use the following objective to calculate gradients for temperature parameter α, whereH denotes the desired minimum entropy [23].

IV. EVALUATION
In this section, we describe the implementation of Repot, and evaluate its performance compared to other algorithms including a learning model based on a state-of-the-art AoI algorithm [3] under various network conditions. We also provide our case study with a microscopic multi-modal urban mobility simulator.

A. LEARNING ENVIRONMENTS
In evaluation, we consider a networked monitoring system where status updates of sensing devices are aggregated for continual query processing. We built a simulation environment for such networked monitoring based on NS-3 [25] v3.29, a widely used packet-level discrete-event network simulator, and we use the ZeroMQ messaging library [31] for asynchronous communication between the NS-3 based simulation environment and the modules in Repot.
In networked monitoring, we focus on data-driven realtime queries whose quality is dependent on the timeliness of status updates. For example, a real-time traffic map is constructed using low-latency image streams generated from geographically distributed camera devices, where traffic congestion can be estimated for autonomous vehicle navigation [32]. This networked monitoring capability is required in a variety of data-driven applications, e.g., urban air quality inference [33], target tracking [34].
To evaluate the quality of query processing through QUAL(·) in Eq. (2), we use structural similarity index measure (SSIM) [35] that estimates the similarity between the aggregated status information I τ (at the server) and the VOLUME ?, 2021 Note that µ and σ are the mean and standard deviation of x = I τ and y = I τ Truth , and j 1 are small positive constants for j ∈ {1, 2, 3}. QUAL(I t ) is evaluated several times with a uniform-random interval of [6.7, 8.3] within the time-step interval of 33.3ms.
In our simulation tests, we construct two different wireless network environments. Table 3 illustrates the environment settings where the source B denotes a learning environment and the target T denotes an adaptation and testing environment. With these source and target environments, our experiment aims at verifying the policy transferability in Repot across different network conditions.
The source B is set to have easy-to-learn network conditions characterized as ideal time-slotted communications [36] with no channel access collision, no propagation loss, and no other errors in transmission; yet given a resource constraint on link capacity, channel access delay and propagation delay are modeled. The target T is set to have complex network conditions similar to real-world deployment conditions, and it is implemented using several modules in NS-3, e.g., WiFi, network, internet, mobility modules including Nis-tErrorRateModel, ConstantSpeedDelayModel, and LogDis-tanceLossModel. Table 4 illustrates the time in days required to generate training samples, i.e., RL states during 1.6M time-steps, which are used to achieve a converged model in our source environment. In the NS-3 based target environment, model training would take at least 36 days even for a system of 49 sensing devices, but in the source environment, it takes less than 5 hours. This difference in training times indicates the benefits of policy transferability across different environments. It is desirable to have a policy learned in an easy-tolearn source environment within a reasonable training time and to adapt the policy rapidly to target environments.

B. COMPARISON METHODS
For model training, we use SAC modules in the Stablebaseline [37] and PyTorch [38] v1.5.1. In addition to our Repot, we test several heuristic-and learning-based methods including ARN-RL that exploits a state-of-the-art AoIcentric algorithm [3]. Each method below determines the bandwidth limit (or the transmission rate) a t i of device D i at time-step t.
• Uniform. All devices have an equally distributed bandwidth limit a t . This method is used to set the reference performance for comparison. • Random. Each device is assigned a randomly distributed a t at every time-step. • Top-Opt. K in Top-K is set to be optimal for a given environment. In Top-K, the top K% ranked devices share a common bandwidth margin, e.g., Top-20 allows the top 20% ranked devices to have A 20 where A is the overall bandwidth limit. The ranking score of device D i is calculated based on the total number of objects captured by those devices close to D i within range r, i.e., where OB(·) yields the number of objects observed by D j in I t j , and ||D i − D j || denotes the distance of D i and D j .
• Naïve-RL. An RL model is trained by the actor-critic policy gradient method to set individual a t i . The actor and critic are implemented with each five-layer MLP.
• ARN-RL. A model with the attention-integrated relevance network (ARN) [3] is implemented to make use of a state-of-the-art AoI-centric scheme in networked monitoring. ARN is intended to extract important features from the observed states and previous executed action. In our implementation, each module of three feedforward layers for actor's policy function and critic's Qfunctions takes those features as input to determine the next action.  The hyperparameter settings for the RL-based methods aforementioned such as Naïve-RL and ARN-RL, as well as our Repot model are summarized in Table 5.

C. QOE PERFORMANCE
Using the SSIM-based quality in Eq. (21), we evaluate Repot, compared to the other methods. We measure the ratio of achieved SSIM to an ideal reference. That is, for a method M, the relative QoE is calculated as where QoE Uniform denotes the ideal reference QoE and QoE M denotes the achieved QoE by the method M. QoE is estimated by the average quality QUAL in Eq. (2). QoE Uniform is calculated based on the QoE achieved by the Uniform method in the reference network environment that is intentionally built to measure the ideal reference performance; unlike the source B and target T , this reference network is set to have neither channel access delay nor propagation delay. Figure 3 represents the performance of Repot and the other methods which are trained and tested in B with respect to various network scales. As shown, Repot outperforms the other methods for all the cases, maintaining high relative QoE of 91∼94% while the other methods show lower relative QoE less than 90%. Repot and the heuristic-based methods (Random and Top-Opt) maintain stable performance in relative QoE regardless of scales, while Naïve-RL and ARN-RL are affected in a large scale. For example, ARN-RL shows better performance than the others except for Repot at the sizes of 49, 64, and 100, but it shows performance degradation at the sizes of 256 and 400. Naïve-RL also has a similar pattern, showing 7.1% degradation from 49 to 400. This policy deterioration has been discussed in the RL context of the curse of dimensionality [39] and large action space [29]. In contrast, Repot maintains high relative QoE at the sizes of 256 and 400. In Repot, a policy is learned in a low latent space and then a latent action is mapped to a target-dependent large action space through the action shaping scheme. This latent action structure renders RL models scalable to a large action space which is common in networked, sensor-based monitoring systems, along with the flow embedding that abstracts complex network states in a common vector space of a fixed size.
In Figure 4, we illustrate the bandwidth allocation patterns by different methods. The x-and y-coordinates indicate the location of sensor devices on a geographical space where each device in a grid cell (a square in this figure) sends the real-time information about its observation within the cell to a server. Interestingly, Naïve-RL shows such a tendency to cover the entire area with a slight concentration on object locations. On the contrary, both ARN-RL and Repot have more focus on object locations, showing the effect of attention mechanism. However, while both use attention, unlike ARN-RL which extracts action directly from a learned policy, Repot employs the latent action and action shaping which lead to fine-grained action representation that can be appropriately formed to observed state changes over time. For example, Repot shows more relevant patterns than the others for the number of clusters in the object maps. Interestingly, the patterns of Top-Opt and Repot have in common that they make concentration on dense areas or clusters. However, Top-Opt considers only current dense areas but Repot tends to consider their trajectory.

D. ADAPTATION PERFORMANCE
As it is expensive to establish training samples sufficiently about networked monitoring systems in various environment conditions, we discuss the policy transferability of Repot across different conditions. We explore policy transferability by evaluating how well methods adapt to a target environment in a small number of training samples, e.g., within 100K time-steps and 6.25% of 1.6M in Table 4. For adaptation in a target environment T , we train the models of Naïve-RL, ARN-RL, and Repot in the source environment B with the dynamic link capacity of 1 2 L B , L B , where L B is the link capacity of B. Then, we fine-tune each learned model (policy) M in T through top-layer updates in conventional transfer learning [40], [41] for Naïve-RL and ARN-RL. For Repot, we employ fine-tuning with the action shaping, in which only ADJUST φ2 (·) in Eq. (12)   The target environments T1∼T4 are configured differently by VHT MCS settings that determine the wireless data rates. T1, T2, and others correspond to VHT MCS 8, 7 and so on. We compare Repot models with different fine-tuning approaches in terms of adaptation performance. Figure 5 shows the adaptation performance of methods. For each method, applying its policy optimized in B incurs significant degradation in T , e.g., 21% degradation by Naïve-RL (from M Base (B) to M Base (T )). After finetuning, some of the methods show stable recovery to some extent. Most importantly, Repot shows superior resilience upon the change from B to T , achieving the recovery of 22.7%, and outperforming the others with 14.5∼20.8% higher relative QoE in T . Specifically, in terms of relative QoE, Repot achieves 14.5% higher performance compared to Naïve-RL and 20.8% higher compared to ARN-RL in the target environment T .
This performance achieved in T by Repot is not only better than that of the other methods but also comparable to that in B by itself (i.e., 91.0% in M Base (B) and 90.7% in M FT (T ); their difference is no more than 1%). This result clarifies the policy transferability of the modular structure in Repot tailored for bandwidth allocation strategies in different network environments. Notice that ARN-RL in T shows a different pattern, having no performance recovery after finetuning. We speculate that conventional fine-tuning methods with layer-wise parameter updates are hardly effective to adapt a model optimized in a specific environment to another target, unless a learned policy in a source environment can overfit less and fine-tuning in a target environment can have sufficient samples to overwrite the model parameters. Figure 6 compares the adaptation efficiency of RL-based methods, where the learning graphs over time-steps correspond to what we have for fine-tuning in Figure 5. Notice that the learning efficiency of Repot FT is attributed to the action shaping in which a learned policy is adjusted through control value updates that require fewer learning steps. We observe the low performance for ARN-RL FT and Repot FT during the first period of learning steps due to our learning with domain randomization. Furthermore, the limited performance improvement of Naïve-RL FT and ARN-RL FT specifies the restriction of conventional fine-tuning with partial parameter updates, particularly when environment conditions can vary significantly.
In Figure 7, we evaluate the adaptation of Repot across  different network environments. We deliberately set the VHT MCS setting of 802.11ac to simulate different target environments T 1 ∼T 4 with various theoretical link capacities, where the smaller the VHT MCS, the lower the capacity and QoE performance. In this experiment, we employ different finetuning approaches in Repot to evaluate the policy transferability by action shaping in the same model structure. In Repot Base , we use the base model optimized in B without fine-tuning. In Repot AS , we use our proposed action shaping, while we use other fine-tuning schemes in Repot Top and Repot Full . Specifically, Repot Top updates the last layer parameters in the ALLOC function, similar to conventional transfer learning, and Repot Full updates its full layer parameters for fine-tuning, considering significant differences between source and target environments. Overall, Repot AS achieves robust performance across all T 1 ∼T 4 , showing highly comparable adaptation performance in T 1 ∼T 4 to the respective one in B. The other fine-tuned models Repot Top and Repot Full show limited adaptability, which is consistent with the learning graphs in Figure 8. Furthermore, in Figure 9, we visualize the action patterns of Repot differently shaped by the fine-tuning approaches, where the action value represents how much bandwidth is allocated. The action values by Repot AS are much different from those of Repot Base than the others, and they often tend to spread. We speculate that Repot AS is able to properly adapt its policy to target environments that involve uncertainty more than our source environment due to realistic network settings.

E. CASE STUDY
In the following, we show our case study for a networked monitoring system with urban-and suburb-scale car traffic datasets that are generated by the SUMO simulator [24] on the maps of Midtown Manhattan in New York and a suburb of Orlando, Florida. We make the datasets publicly available on Github [42]. Table 6 describes the configuration to generate the datasets, where the urban-scale datasets have about 10 times heavier traffic than the suburb-scale datasets.
In this case study, we set the underlying network conditions of source and target environments the same as in the previous tests in Table 3. Figure 10 shows the performance of several methods in the case study, urban-scale in (a) and suburb-scale in (b). We conduct fine-tuning the Repot model achieved in the source B for the target T using 6.25% of the training sample amount used in B. The Repot model yields the best performance (i.e., 96.1% in (a) and 99.6% in (b)) in B. More importantly, the fine-tuned Repot model yields the best performance (i.e., 94.9% in (a) and 93.9% in (b)) in T . Whereas the Repot model before fine-tuning in T yields relatively lower performance (i.e., 84.0% in (a) and 67.7% in (b)), its performance recovery after fine-tuning is significant (i.e., 10.9% in (a) and 26.1% in (b)). Repot shows higher quality than Top-Opt (i.e, 6.5% in (a) and 12.6% in (b)). This demonstrates the capability of Repot to learn a near-optimal policy for a given environment through model training and fine-tuning.
There is a larger difference before and after fine-tuning   [48], [49] Robot arms in real-world Remote assistive arm control Action embedding Variational autoencoder Success rate for picking up objects in (b) than in (a). The datasets of the suburb-scale scenario are more cluster-centric as they have less traffic, while the datasets of the urban-scale scenario are less cluster-centric as they have much heavier traffic on the entire region. Such dataset difference provides more margin for Repot to be optimized on the dataset of the suburb-scale scenario.

V. RELATED WORK
In network monitoring systems, information quality has been discussed in the context of AoI [4], and several heuristic solutions to AoI problem settings have been introduced; they considered specific network conditions such as unreliable broadcast channels [50], throughput constraints [51], channel interference constraints [52], and dynamic channel status [5].
Recently, RL has been leveraged to address optimization problems upon time-varying network conditions in different problem settings such as network traffic control [53]- [55], wireless channel management [43], [56], and bandwidth allocation for video streaming [20], [57], [58]. Several research works have demonstrated the applicability of RL for AoI problems in networked monitoring systems [3], [44], [45]. Elgabli et al. [44] presented an RL-based resource scheduling algorithm to orchestrate sensors and optimize the expected AoI, satisfying the requirement of ultra-reliable low latency communication (URLLC). Abd-Elmagid et al. [45] explored the RL-based scheduling on information transmission in unmanned aerial vehicle (UAV)-assisted networks, showing the capability of RL to improve QoE when features are welldefined to learn the flight trajectory of UAVs and the energy consumption pattern of sensors. Similarly, Traub et al. [3] addressed the transmission scheduling problem by exploiting application-specific features and attention mechanism.
Those prior works exploited RL and focused on QoE or AoI enhancement, but they rarely addressed the issue of RL model training and fine-tuning adaptability for different network conditions. Our work also employs RL to improve QoE in networked monitoring systems. However, unlike the prior works, our work concentrates on the policy transferability in RL, which enables the adaptation of a learned policy to different target network environments.
In the field of robotics, model adaptation schemes have been investigated, aiming at bridging the data mismatch gap between simulation environments and real-world robot deployed environments. Tobin et al. [46] exploited domain randomization with manifold data for RL-based object detectors. Peng et al. [47] developed an RL-based robot arm controller operating in dynamic environments.
To mitigate the problem of sample-inefficiency and large learning time in domain randomization with manifold data, several studies were recently introduced such as building realistic training data [59]- [61] and robot action embedding [48], [49]. Particularly, Losey et al. [48], [49] explored the concept of action embedding [29], [30] to improve the control performance of remote assistive robots, which is similar to our action shaping scheme. Whereas the robot action embedding in [48], [49] requires target-specific training data during model training, in our work, a policy is established independently from targets, and an action by a learned policy can be shaped for any target later with a small amount of target-specific training samples.
To the best of our knowledge, our work is the first to discuss RL model adaptation in the context of networked monitoring applications and investigate a modular structure of RL models to provide fast adaptation in different network conditions. Table 7 provides a summary of the related studies.

VI. CONCLUSION
In this paper, we proposed Repot, the transferable RL model that enables the QoE enhancement in networked monitoring systems and the efficient adaptation to various target network conditions. In doing so, we employ flow embedding and action shaping by which a bandwidth allocation policy for QoE-driven networked monitoring is trained in an easy-tolearn source environment, and an action by a learned policy can be shaped to target conditions. Through simulation and experiments, we demonstrate that Repot achieves competitive QoE performance, outperforming other methods in many cases for both source and target environments. For example, Repot achieves high quality in networked monitoring of 14.5∼20.8% gains over the compared methods in a target environment, by fine-tuning with only 6.25% of the samples that are originally required for model training from scratch.
Our direction to future works is to adapt meta RL and multi-task learning for adaptation against different network conditions as well as a variety of application-specific, network-related tasks such as traffic engineering, caching, routing, or intrusion detection. We are also interested in the real-world deployment and testing of transferable RL for AI-based surveillance applications that are required to provide low-latency and high-accuracy model inference, despite harsh network environments.