Learning-based decentralized offloading decision making in an adversarial environment

Vehicular fog computing (VFC) pushes the cloud computing capability to the distributed fog nodes at the edge of the Internet, enabling compute-intensive and latency-sensitive computing services for vehicles through task offloading. However, a heterogeneous mobility environment introduces uncertainties in terms of resource supply and demand, which are inevitable bottlenecks for the optimal offloading decision. Also, these uncertainties bring extra challenges to task offloading under the oblivious adversary attack and data privacy risks. In this article, we develop a new adversarial online learning algorithm with bandit feedback based on the adversarial multi-armed bandit theory, to enable scalable and low-complexity offloading decision making. Specifically, we focus on optimizing fog node selection with the aim of minimizing the offloading service costs in terms of delay and energy. The key is to implicitly tune the exploration bonus in the selection process and the assessment rules of the designed algorithm, taking into account volatile resource supply and demand. We theoretically prove that the input-size dependent selection rule allows to choose a suitable fog node without exploring the sub-optimal actions, and also an appropriate score patching rule allows to quickly adapt to evolving circumstances, which reduce variance and bias simultaneously, thereby achieving a better exploitation-exploration balance. Simulation results verify the effectiveness and robustness of the proposed algorithm.


I. INTRODUCTION
Increasing demand for high-complexity but low-latency computation, triggered by emerging applications, e.g. autonomous driving, motivates the use of rising technologies, mobile edge/fog computing, that bring cloud-like computing services, closer to end-users [1]- [3]. To boost up additional but limited edge computing resources, vehicular fog computing (VFC) [4], [5] has emerged as a new computing paradigm where moving fog nodes with surplus resources and good connectivity, named vehicular fog nodes (VFNs), are utilized as viable components that serve to execute computation tasks offloaded from service clients. As such, leveraging distributed fog nodes for task offloading could benefit from direct communication, e.g. 5G V2V, between a client and a VFN, i.e., reduced transmission delay, and similar trajectories when a client is traveling along with VFNs, i.e., relatively long contact duration and less handoffs, resulting in a substantial improvement in quality of experience, compared with using fixed infrastructure. VFNs are heterogeneous in terms of location, This work has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 825496  availability and reputation, and thus the computing service has diverse preferences towards them [6], i.e., one may prefer a vehicle with high processing capability and efficiency. One issue is how to make task offloading decisions, especially the fog node selection, considering distinct characteristics/preferences.
Computation task offloading decision algorithms have been investigated in [7], [8] where a centralized coordinator schedules the computation offloading tasks. A decision-making problem has been formulated in [7] as a stochastic control process, e.g., semi-Markov, to minimize the offloading service cost in terms of delay and energy. The trade-off between the delay and energy cost is investigated in [8] based on matching theory. However, such centralized decision-making might be challenging to run due to i) signaling overhead burden caused by gathering and processing a massive amount of information, e.g., requested tasks of service users, available resources of VFNs, and mobility of both, and ii) a privacy concern raised by exchanging such private information with a central controller.
Decentralized decision-making is considered as an alternative for the issues above. Each client can make a decision independently and perform task offloading in a distributed manner. A client may lack the state information of neighboring VFNs within its communication range, and thus it is unknown in prior that which VFN would provide the best performance, i.e., the lowest offloading cost. Exchanging the state between the client and the potential VFN, may be informative and helpful for making a decision appropriately. However, such decentralized decision-making is still challenging to conduct in a mobile environment where i) frequent state updates are needed to adapt to system fluctuation, and ii) such heavy signaling load could cause transmission failure and thus outdated state. One approach to deal with the issues above is, rather than obtaining the state information of VFNs from signaling messages, to enable a client to directly learn the state information of VFNs and to map the decision history to the current offloading decision.
Given the availability of a huge amount of data, historical data can be used to improve the quality of resource management policies, since they contain statistics of the environment which varies in a non-stationary and unknown manner, and learning from them can mitigate the uncertainty of future management tasks. Further capability to reinforce the current policies allows envisioning a learn-to-optimize framework where a decision is made in an environment to optimize a given notion of cumulative loss with the fewest possible assumptions. In a nutshell, the adaptive decisionmaking procedure becomes two-fold, i) exploration: learning as much as possible about different candidate actions that lead to good estimates of their loss, and ii) exploitation: optimizing the desired objective to select the optimal actions given the learned information.
One fundamental issue is to balance the exploration and exploitation trade-off in the learning process, i.e., making decisions with the aim of reducing uncertainty over states, or maximizing cumulative reward given its current estimates. Such an exploration versus exploitation dilemma can be formulated as a multi-armed bandit (MAB) problem where each neighboring VFN is treated as an independent arm, and its associated offloading service cost, e.g., latency and energy at edge node, is dominated by the computing capability. A task requester performs an online learning process, while running the offloading service and updating the optimal decision on the VFN selection. However, the variations in the requested workload (dynamic resource demand) and the candidate VFN set (dynamic resource supply) make it non-trivial for the task requester to learn the latency and energy consumption of the candidate VFNs especially in a rapidly changing or adversarial environment on which an attacker may give some stress.
In our task offloading problem, two dynamic factors are considered in terms of resource supply and demand: i) a timevarying volatile candidate fog node set which results from its inherent mobility [9], and ii) a time-varying task size which results from different types of applications or even different parts of the same application workload [10]. Both factors above cause unnecessary but inevitable costs, i.e., suboptimal actions are possibly explored thereby unbalancing exploration and exploitation activities. In particular, newly and re-appeared arms may fail to quickly adapt to the evolving circumstance. Also, a large-size task offloaded to a fog node with weak service capability may provoke worse performance. In the literature, such dynamic attributes have not been taken into account due to challenges associated with i) randomized selection rules and ii) unbiased estimation assessment rules in an adversarial environment, shed lighted in this work.
To the best of our knowledge, this is the first work aiming at bridging such dynamics to an adversarial domain, addressing the following contributions; • This work proposes a modified implicit explorationbased algorithm for adaptive learning-based task offloading (MIX-AALTO) which enables scalable and lowcomplexity decision making on VFN selection toward minimizing task offloading service cost in terms of latency and energy. Such a model-free algorithm permits to capture the unknown offloading cost variations under oblivious adversary, e.g., weighted-average randomized selection rule, biased cost estimation, e.g., implicitexploration, and data privacy considerations, e.g., fullbandit feedback. • The proposed algorithm facilitates to make an offloading decision in a manner adaptive to the volatile and timevarying resource necessitate and provision by appropriate adjustment on cumulative learning score in the selection rule. As such, the modified learning score considering the coming task demand and evolving circumstance, allows to select a suitable fog node, rather than a capable one, i.e., choosing a VFN better suited for the next task with time-varying size, rather than the one providing the lowest service cost for the current task.
• The proposed algorithm makes it possible to alleviate the uncertainty of the empirical cost estimates in assessment rule. The robust learning based on implicit exploration which controls the variance at the price of introducing some bias could guarantee near-optimal performance, rather than exploring the sub-optimal actions due to large variability attributable to unbiased estimation process. • The theoretical analysis about efficiency of the proposed algorithm is provided in terms of learning regret. It is proved that such a modified implicit exploration approach renders the reduction of variance and bias simultaneously, thereby achieving a better exploitation-exploration balance in an adversarial environment. Simulation results in synthetic and real-world scenarios verify its effectiveness and robustness. The rest of this paper is organized as follows: In Section II, the system model and problem formulation are presented. In Section III, the task offloading algorithm is then proposed. In Section IV, and the learning regret is analyzed. Simulation results are then provided in Section V, and finally comes the conclusions in Section VI.
II. RELATED WORK This section presents related works in the area of VFC enabled task offloading, in terms of the potential scenarios of VFC and the task offloading algoritihms.

A. Task offloading scenarios
A variety of use cases have been identified as potential scenarios for VFC, i.e., efficient dissemination of real-time vehicle traffic and emergency information in cooperative driving of autonomous vehicles for road safety and intelligent firefighting for rescue safety [11]. In particular, emerging assisted driving applications, such as real-time situational awareness [12], lane changing and seethrough for passing [13], and localization/mapping applications, such as HD map generation and road construction detection [14], involve timecritical and computationally intensive tasks, i.e., on-road object recognition and scene understanding from images/videos, and have requirements for the validity period of task and reducing its consumed power. Safety-related services require low-latency responses, such as 10ms for cooperative collision avoidance, 25ms for vehicle platooning, and 500ms for collective environment perception [15]. The work in [14] investigated the feasibility and challenges of applying VFC for real-time analytics of high-resolution video streams, and proved the efficiency of VFC-based task offloading in terms of the latency, packet loss ratio and throughput. The work in [16] reduced the offloading latency considering the efficiency of power usage in fog computing by balancing the workload of fog nodes. Powerful computers are required for processing computationally complex tasks. Such computers also consume energy at a high rate, which affects vehicles' driving endurances if computers are powered by vehicles. The work in [17] indicated that the driving distance reduced by 6% due to the consumption of a computing engine equipped with 1 CPU, Intel Xeon E5-2630, and 3 GPUs, NVIDIA TitanX. .

B. Task offloading algorithms
Some efforts have been made to address the decisionmaking strategies for VFC-based task offloading. Specifically, the works [9], [18]- [24] designed decentralized task offloading strategies where offloading decisions are made by the task generators independently. The works in [18], [19] proposed task assignment algorithms for VFC enabled system, without centralized control, according to the collected information of adjacent vehicles. The proposed task caching and Ant Colony Optimization (ACO) based algorithm in [18] attains efficient time complexity than brute-force approach, but may suffer from high complexity for a large number of vehicles and failure to adapt to a volatile environment. In [19], the task is processed in an online manner, but the proposed algorithm may suffer from high signaling overhead, i.e., heavily relying on frequent state information exchange, and thus failure to process subsequent tasks properly in case of outdated information provided by vehicles. To overcome such scalability issues, learning-based task offloading schemes have been considered in [9], [20]- [24].
The work in [20] proposed a learning-based task replication algorithm based on combinatorial MAB, where task replicas can be offloaded to multiple vehicles to be processed simultaneously. Some enhancements to this approach were achieved by adjusting the exploration weight according to the computation workload [21] and the appearance time [22] of fog nodes. The work in [23] proposed a fluctuation-aware learning-based computation offloading algorithm based on MAB, where base stations are regarded as agents to learn the state of moving server. The work in [24] proposed an efficient online task offloading strategy to minimize the longterm cost of non-stationary fog-enabled networks. The work in [9] considers a mortal bandit formulation to address the timevarying set of VFNs for a given task generator, where the computation capacities of the edge nodes are used as contextual information in order to reduce the exploration space. However, all previous works assume that the task offloading performance experienced by an offloading service client is in a stochastic domain where some private information could be inferred by an attacker, and will be severely compromised by the nonstochastic task offloading strategies of other devices, i.e., the task offloading problem is adversarial and conventional upper confidence bound-based task offloading algorithms cannot be directly applied in an arbitrary dynamic environment.
To solve the non-stochastic task offloading problem, adversarial MAB approach can be considered, where each strategy is assigned an arbitrary and unknown sequence of rewards, one for each time step, chosen from a bounded real interval. Especially, Exponential-weight algorithm for Exploration and Exploitation (Exp3) is a well-known learning algorithm for adversarial setting, and has been studied in resource provider selection problems [25]- [27]. Exp3-based online scheme has been proposed with the objective of optimizing the QoS, such as the throughput [25], energy consumption [26] and latency [27]. However, the previous works fail to address mobilityinduced volatile resource availability and resource demand in an adversarial environment at the same time.

III. SYSTEM MODEL AND PROBLEM FORMULATION
In this section, the system model and problem formulation are considered, applicable to offloading services.

A. System model
An offloading service client generates tasks, while a set of offloading service providers k ∈ K = {1, ..., K} support the requested tasks with their own available computational resources. Any vehicles on the ground could become task offloading service clients or service providers. A service client n can offload a task t to any VFN k within its communication range. Here k ∈ K t ⊆ K where K t is the candidate VFN set varied due to their inherent mobility. VFNs available to a service client are discovered and selected by the client based on their topological states, including moving direction and speed [14]. For example, VFNs periodically broadcast single-hop messages including such state information following vehicular communication protocols, such as dedicated short-range communication (DSRC) or cellular vehicle-toeverything (C-V2X). Each client forms a candidate VFN set by selecting from the accessible VFNs which follow the same driving direction as the client. Due to inherent mobility, the candidate VFN set K t varies. It is assumed that the client interacts with accessible VFNs continuously and updates the candidate VFN set K t = ∅ 1 in real-time. 1) Demand model: In general, computing tasks can be divided into subtasks at different levels of granularity [28], and can be divided into atomic tasks 2 . Multiple divisible subtasks can be executed in a parallel, serial or mixed manner. Some of these subtasks must be performed locally and some of them can be either performed locally or offloaded to the external computing resources. In this work, each atomic task is considered as a basic unit for offloading, i.e., offloaded to and processed by a fog node within one time period, and the operational timeline is discretized based on the atomic task unit, t ∈ [t, t + 1) [9]. One can characterize an atomic task, t, by two parameters, the input size q t (bits/task) and its required computation resource defined as the number of CPU cycles c t (cycles/task). The resource demand can be estimated from measurements by applying the methods described in [29], and expressed as the multiplication of two parameters, the input size q t and the computational complexity w t representing the number of CPU cycles required for processing one bit of input data. The value of w t varies with applications, depending on the nature of performed applications 3 .

2) Resource model:
The computational capability of a fog node k ∈ K t is described by its maximum CPU frequency F k (cycles/second). One atomic task is offloaded as a whole to a single fog node who may execute tasks in parallel depending on its own resource allocation rules. To deal with multiple computation tasks simultaneously, a fog node dynamically adjusts its CPU frequency with dynamic frequency and voltage scaling (DFVS) technique. This work considers that the computing capability allocated to a fog node k, denoted by f t k (cycles/second), is determined by the computing resource allocation policy, remains static for each task t and in general is a non-increasing function of the total number of clients that offload to the same fog node k. It is assumed that each fog node employs equal fair resource scheduling over different tasks. The wireless medium of a fog node k is shared by the clients that choose to offload to the fog node k. The achievable uplink and downlink transmission rates between a client and fog node k are determined by the physical characteristics of the wireless medium, such as distance, fading gain, bandwidth, and interference.
3) Cost model: Performing task offloading incurs transmission and computation costs 4 . Two kinds of cost can be considered, the offloading service latency L and the related energy consumption E. Specifically, the latency for offloading includes the time for uploading the input to a fog node k, and the execution time at the fog node, downloading the result to the service client. It is assumed that the feedback size is small enough that the downlink transmission latency can be safely ignored. Thus, the latency of completing a task is expressed as where r t k = B log 1 + P g k N +I k is the link rate for transmitting input data t from a client to a fog node k, B is channel bandwidth, P denotes transmission powers of a client, g k is the uplink channel gains between a client and a fog node k, N is the noise power, and I k denotes interference measured at the fog node. Given the orthogonal channel allocation [31], the co-channel interference can be avoided. Furthermore, the cross-channel interference can be ignored according to the experimental results in [32]. The channel gains are static during the uploading process of each computation task and downloading processes of the computation result. The energy consumption of completing a task is expressed as w t is approximated by a Gamma distribution in [30], e.g., face recognition requires 2339 (cycles/bit) and video transcoding requirement varies from 200-1200 (cycles/bit) 4 The term cost is often interchangeable with the loss in this work where P t k = ρ(f t k ) 3 is the computing power with effective switched capacitance related to the chip architecture 5 [33]. To take into account two types of costs for task t to a VFN k, we define the cost function as the weighted sum over the latency and energy consumption, where ξ denotes the weighting parameter of latency [34].

B. Problem formulation
We define the unit cost of task offloading as the overall cost of offloading the processing of one bit of input data for task t to a VFN k, where D t k,o = D t k /q t and E t k,o = E t k /q t denote the perbit latency cost and per-bit energy cost, reflecting the service capability of each candidate VFN k. We aim at minimizing the average unit cost of task offloading by optimizing the fog node selection, done by each client, for each task (up to a finite T tasks) in each round, k t . If all state information related to the per-bit cost are exactly known by the task requester before offloading each task, the optimization problem can be expressed as follows: k t = min k∈Kt l(t, k) where k t is the optimization variable representing the index of fog node selected for task t, k t ∈ K t .
In fact, the state information of fog nodes in heterogeneous and dynamic networks is hard-to-predict and exchanging the information among the clients and fog nodes causes high signaling overhead. Thus, the clients may lack the state information of fog nodes and could not make accurate predictions about which fog node would provide the optimal offloading service for each task. To overcome this, one may utilize learning-and-adapting-based offloading scheme where a client observes and learns the costs of each task offloaded to the candidate fog nodes and makes an offloading decision based on the historical cost observations without exact knowledge about the current state information. For this, we aim to design a learning-based algorithm minimizing the expectation of the unit offloading cost, formulated as where E [·] is the expectation operator, l(k, k t ) is a sequence of unit cost for the t-th task in the set of tasks T , and T = |T | ∈ N + is the number of tasks.

IV. ONLINE LEARNING-BASED TASK OFFLOADING
In this section, a learning-based task offloading algorithm is developed based on MAB, which enables a client to learn the offloading cost of candidate fog nodes and optimizes the expected task offloading cost. The problem (3) requires online sequential decision making whose nature enables to design a lightweight algorithm but suffers from uncertainty associated with the lack of knowledge about the properties and conditions of the phenomena underlying the behaviour of the systems.

A. Learning under uncertainty: an adversarial approach
Consider a general framework of online learning where a task client selects one fog node, k from a finite set K t based on an a priori unknown payoff function. The previously offloaded tasks allow an empirical mean as an estimate of the expectation, but if there are not enough observations, this guess may not be accurate. In order to get more information about one specific fog node, the client needs to offload more tasks to that node even though it may not be the empirically best fog node to offload. However, the empirically best node is preferred for the sake of instantaneous benefits in online decision-making. Therefore, there exists a tradeoff between exploiting the empirically best node for instantaneous rewards and exploring other nodes for potential benefits. Also, note that learning under uncertainty relies on feedback in general. Thus, quality 6 of the feedback in terms of completeness has significant implications in assessment rule. Incomplete feedback stands in contrast to full-information feedback where utilities of all actions a client could have taken are observed in each stage. Incompleteness can be spatial/temporal across the action space/stages. When a client sends a task to a fog node, there is no way to know how other fog nodes would have performed on the same task. Moreover, local visibility of loss makes decision-making challenging. A commonly studied model is the so-called bandit feedback, where only the utility of the chosen action is revealed. The term bandit feedback has its roots in the classical online learning problem to play a multi-armed slot machine known as a bandit. A MAB problem is specified by a set of arms (actions or available VFNs) K t and a sequence of cost l t k , t ∈ T . For each task, a client selects an arm and receives the cost from the selected arm, not from other arms.
The objective of a client is to minimize the long-term cost as shown in equation (3), while managing explorationexploitation trade-off in bandit setting. Each arm pulled by the client generates cost in an adversarial fashion. An adversary is changing the future cost for arms, and the distribution of cost for each arm would change over time, which is not inherently probabilistic and does not include stochastic averaging in contrast to stochastic MAB case. In this sense, non-stochastic formulation of MAB is more appropriate to evaluate the most promising strategy in an arbitrarily changing environment where there could even exist an oblivious adversary, e.g., jamming attack. Also, existing stochastic MAB may characterize the exploration bonus in determined selection rule with a padding function addressing the variations of additional informative data such as the history of playing up to the current round, which would result in better performance. However, 6 While many works take assumption requiring the noise to have a wellconditioned, stochastic component, i.e., independent, identically distributed Gaussian process noise, imperfect feedback referring to the inaccuracy of the observed utilities in revealing the quality of the selected actions is assumed to be null in adversarial regime due to its arbitrary property. One may further consider adversarial noise sequence which is left for the future work. incorporating such information into an adversarial setting is challenging due to randomized arm selection rule, and the payoff generated in an adversarial fashion under information limited environment weakens the robustness and smoothness of the estimation process in its subsequent assessment rule.

B. Exploration in selection rule
In an adversarial MAB problem, randomized policy is needed due to the possibility that a client using deterministic policy or stochastic one such as Upper Confidence Bound (UCB)-style exploration [35] may be easily fooled by adversaries. Thus, instead of choosing an arm k ′ ∈ K t directly, the client n selects a probability distribution Λ t = [p t k ] k∈K ∈ [0, 1] |K t | : k∈K t p t k = 1 over the available arms for task t. The resulting probability vector Λ t is called a mixed strategy for the mixed strategy space of a client who draws an arm according to this distribution, k ′ ∼ p. The selected probability distribution is proportional to its loss, weighted appropriately. The idea is to give more weight to actions that performed well in the past. One may employ weighted-average randomized strategy with potentials 7 to achieve a cumulative cost (almost) as small as that of the best action [36,Section 6]. An arm k is assigned with the selected probability for task t, p t k which is proportional to weighted accumulated cost caused by that arm in the past, The parameter, W t k , is a weight of each arm k maintained by the client, representing the confidence that the arm is a good choice.
In a bandit setting, rather than concerning about how to get the estimated cost of an arm which was not pulled, one seeks to investigate how such information can be used when it becomes available. To that end, the score (penalty) based learning process is considered as follows: The service capability of a fog node can be represented by the score parameter, the cumulative estimated bit-per cost up to s − 1, L s−1 k = s−1 t=1 η tl t k wherel t k is the estimate of loss from the arm k for task t and η t ∈ (0, 1] is the learning rate. If all of arms newly appear in round t = 1, their scores are initialized with zero,L 0 k = 0, ∀k ∈ K 1 and thus the resulting probability follows a uniform distribution initially, In each round s, a task requester chooses an action k ′ based on the resulting probability Λ s and updates the estimate of lossl s k based on the selected arm. The resulting probability Λ s is determined based on the scoresL s−1 k . Essentially, one would leverage past experiences to gain the intuition on what is the best value to use. Considering the exponential potential function with the score, the weighting parameter can be expressed as , ∀k ∈ K s in round s. Note that such importanceweighted mechanism assigns exponentially higher probability to strategy with lower cumulative scores up to s − 1 due to the and X = m =k e −L s−1 m . These scores reinforce the success of each strategy measured by the estimated offloading costl s−1 k , so a client would rely on the strategy with the lowest score.
Appropriate selection rule could achieve a balance between exploitation and exploration, i.e., exploiting known resources with certainties and exploring for new possibilities, by differentially choosing among actions, favoring those with lower cumulative scores perceived to be more attractive. While the exploration and exploitation trade-off conventionally depends on the service provider's state i.e., candidate fog node' capability, its balance can be improved by considering additional information on the service requesters' necessity and providers' activities. However, in an adversarial setting, it is nontrivial to improve the performance with such additional information. Non-stochastic property nullifies the statistics of the historical data on the service providers' activities, i.e., the numbers of computational tasks that an arm has served and for which it has been connected to the requester since its initial connection [22], are void especially in a volatile dynamic environment. Also, a padding function in the UCB based selection rule allows to characterize an exploration bonus addressing such informative data, while it is not straightforward to do that in an adversarial setting due to its randomized policy.
This work aims at incorporating the observation on the resource provider's volatility and resource requester's task size into the selection rule in an adversarial setting to achieve a better balance between exploration and exploitation. To do so, in the following, the dynamic resource supply and demandbased exploration bonus is augmented in the scoreL s−1 k toward fair and suitable fog node selection.
1) Dynamic resource supply: If an armk newly appears in round τ , K τ = K τ −1 ∪k, as the previous candidate set of fog nodes did, i) all arms including the new arm could be reset, L τ −1 k = 0, named full reset, or ii) only the new arm's score could be initialized with zero,L τ −1 k = 0, named partial reset. However, such a resetting mechanism may invalidate the score based learning benefit in the rapidly changing environment. The bandit may take a long time to collect enough samples for those arms to correct their null scores again. Also, such incomparable scores due to the partial nullification may fail to fairly explore all of the available arms to identify the best arm within a total number of tasks, T . For instance, if the existing scores of the armsL τ −1 k , ∀k ∈ K τ −1 are as high as those in making the corresponding resulting probabilities too low, the newly appeared armk will be dominant p τ k ≫ p τ k , ∀k ∈ K τ −1 and thus repeatedly be selected for all eligible rounds τ ′ > τ until the arm's score goes up enough to make more accurate estimation comparative to other arms s > τ ′ . In other words, the old arms may sacrifice their opportunities, regardless of their accumulated experience, to learn the dynamic task offloading environment. Such an unfair selection rule from the perspective of old arms could be amended by setting the initial score of an appearing arm with the already existing one from oneself or othersL τ −1 , m ∈ K τ −1 > 0 and K τ −1 is the set of the old arms in round τ − 1.
2) Dynamic resource demand: Note that while the objective in equation (3) is to optimize the expected bit cost of offloading the task to a fog node k for task t, what actually needs to be learned is the potential capability of each candidate fog node and its projected suitability for the upcoming task under an adversarial framework. The service suitability of a fog node can be assessed by the normalized total delay of offloading the next taskL t−1 k q t which would be further additive to the service capability to build refined weights for better arm selection and thus improved quality of service, e.g., cost per task. Such joint consideration of both the normalized offloading delay per bit and per taskL t−1 k (1 + q t ) may take some coordination in terms of input data size-dependent exploration-exploitation trade-off. For the feature scaling, the normalized size of the upcoming task q t is used as a weight factor δ t = 1 + (q t − q min )/(q max − q min ) where q max and q min are the upper and lower thresholds of the input data size, respectively, on the offloading delay in decision-making algorithm, i.e., W t k = e −L t−1 k δ t . This approach turns out to be analogy to the Boltzmann (or softmax) exploration [37], which creates a graded function of estimated value with the maximum inverse temperature parameter equal to 2 [38]. The higher values of δ t → ∞ will lead to a fully greedy strategy, while the lower values δ t → 1 will move the selection strategy more towards offloading service capability-based one.

C. Exploration in assessment rule
According to the selection rule above, one client selects a suitable fog node for the upcoming task, offloads it to the selected node, and receives real-valued payoffs, i.e., offloading service cost per bit, and then uses its own assessment rule to independently convert the realized payoff into the learningweighted estimate of the payoff additive to the previous score representing the fog node's estimated capability.
1) Iteration-varying learning: Learning rate is a parameter controlling how much the weights of the current estimated payoff is taken into account for the upcoming cumulative score, which determines the importance of the estimated payoff at each time in term of contribution to the cumulative score. Conventionally, the learning factor η t is predefined as a empirically constant or variable depending on the horizon T , which requires advance knowledge of the horizon and weakens the learning ability of algorithm. Note that achieving the perfect knowledge of T is usually not feasible in practice. While one could use a standard doubling trick [39] to overcome this difficulty, we choose to take a different path to circumvent this issue, and propose to tune its learning rate iterationdependently η = η t , ∀t and other parameters solely based on observation. Thus, from technical perspective, the task requester should take positive actions to explore unfamiliar environment and learn the loss statisticsl t k of all strategies in the initial stage. As learning iteration goes on, the client may want to exploit observations obtained so far to identify the best strategy without engaging others too often.
However, it is nontrivial to select a proper η t which should be large enough to avoid selecting a bad arm too many times, while small enough to limit the transient effect. One way is to encourage an algorithm to explore less over round, decreasing learning factor with round; the more distant the past, the more its learning factor. When the learning rate is large, p t k becomes more uniform, and the algorithm explores more frequently. For a lower learning rate, p t k concentrates on the arm with the lowest estimated cost and the resulting algorithm exploits aggressively. Furthermore, if the exploration-exploitation levels change too fast, it would be too short to obtain the inflection point from exploration to exploitation. For this matter, one may further consider varying the learning factor with the number of candidate sets; the larger the number of arms is, the more slowly the learning factor decreases.
2) Robust learning: Learning algorithms are based on a model of reality, and their performance depends on the degree of agreement on their assumed model with reality. The robustness of an algorithm is its sensitivity to discrepancies between the assumed modell t k and the reality l t k , which is essentially determined by how the related assessment rule is set.
The loss from an arm k = k ′ could not be observed due to incomplete feedback in the bandit problem. This motivates us to use unbiased estimate that the client observes, enabled by i) using the loss l t k if one observes it and 0 otherwise, l t k = l t k · ½ k=k ′ , and ii) correcting the bias from dividing it by the probability of selecting the arm,l t thereby maintaining the expectation property and making arms that have not been pulled yet optimistic and being likely to be explored. However, the unbiased estimate causes large fluctuation in the loss due to inverse-proportion to p t k . One idea is to avoid p t k being too small. The first thing that comes to mind is to mix p t k with the uniform distribution. This is an explicit way of forcing exploration, which after further modification can be made to work. The idea to reduce the variance of importance-weighted estimators has been applied in various forms [40]- [42], but all of these works are based on truncating the estimators, which makes the resulting estimator less smooth.
This work takes the similar approach for a simpler and empirically superior algorithm. They key idea is to change the cost estimates to control the variance at the price of extra bias. To achieve this, we consider Exp3 algorithm endowed with implicit exploration (IX)-style cost estimates [43]. After each action, the cost is first calculated asl t k = l t k /(p t k + γ t ) · ½ t k=k ′ , which is a biased estimator due to where p t k is the probability, percentage of weight, that arm k will be chosen for task t. The implicit exploration parameter γ t ∈ (0, 1] makes p t k smooth so that actions with large losses for which classical recipe in exponential weights algorithm scheme, would assign negligible probability, are still chosen occasionally, and thus the estimator is allowed to guarantee reliable performance in rapidly changing, adversarial environments.

D. Proposed algorithm
In this work, taking into account the above mentioned motivations, an algorithm for adaptive learning-based task offloading is proposed to solve the offloading decision problem where a client decides for each task to which fog computing node to offload it. The proposed algorithm makes use of two exploration processes. One is modified Boltzmann distribution based exploration that supports time-varying resource supply Algorithm 1 MIX-AALTO: Modified Implicit Exploration based Algorithm for Adaptive Learning-based Task Offloading 1: Input: sequences ηt > 0, γt > 0 , K ′ = ∅ 2: for t ∈ T do 3: Set L k ← 0, β k ← 0, k ∈ K ⊲ Dynamic supply 5: for Any k ∈K do 6: if k ∈K\(K ∩ K ′ ) then Update L k ←L k , k ∈ K 14: Update δ ← q ⊲ Dynamic demand 15: Update W k ← δ · (L k + β k ) ⊲ Selection rule 16: Select action k ′ ∼ p 18: Receive the cost l k ′ ← U k ′ ⊲ Assessment rule 19: Computel k ← l k ·½ k=k ′ p k +γ k∈K 20: Update scores:L k ←L k + ηl k , ∀k 21: end for and demand dependent offloading, emphasizing feasibility and fairness in fog node selection. The other is an implicit exploration based on biased loss estimation to alleviate the uncertainty of the importance-weighted estimator.
In Algorithm 1, the vanishing learning factor η t , exploration factor γ t , and the set of previously used fog nodes K ′ which is here assumed to be empty initially, are considered as the input parameters (Line 1). And then the iteration dependent parameters, η t , γ t , K t and q t , are set. Upon generating each task from the application the information on the input data size q t is known by the task requester. Also, the up-to-date information on a set of candidate VFNs K t from the neighbor discovery process is available and K ′ is updated (Line 3). Afteward, the algorithm is structured in three parts: i) exploration bonus adjustment where two dynamic factors in terms of resource supply and demand, β k and δ, are updated, which would be used to tune the weighting parameter, w t k (Lines 4-14), ii) selection rule domain where the selected probability is proportional to the cumulative score tuned by considering the resource demand and supply aspects for suitable and fair selections via modified exploration bonus (Lines 15-17), iii) assessment rule domain where the utility function defined in Eq. (3) is used to evaluate the service capability of each fog node, by observing the empirical offloading cost and converting it to the estimated cost via implicit exploration factor, and then the cumulative learning score is updated (Lines 18-20).
1) Adaptivity: Adaptivity is an essential property that has steadily gained importance for solving the offloading decision problem, particularly for dynamic environments. For the adaptation to dynamic resource supply and demand, a cumulative learning score is fine-tuned with parameters β k and δ such that the available arms are fairly and suitably explored. While the parameter δ, identical for all candidate VFNs, plays a role in modifying the explore-exploit behavior for demand dependent suitability, the parameter β k , possibly different for different VFNs, plays a role in reducing the large disparity between the cumulative learning scores of different VFNs, thereby avoiding unfair selection. Prior to performing the offloading decision for each task, a neighboring VFN set is discovered within its communication range [14] and those in the same moving direction are considered as candidate VFNs [22]. Due to inherent mobility-induced time-varying features, some fog nodes physically leave a candidate set temporarily but return into the set within a finite number of time periods, called volatile occurrence of the potential candidate fog nodes. For the arm k ∈K t \(K t ∩ K ′ ) newly appearing in the candidate set for task t, the corresponding score β k is set to be the minimum score of the other existing available arms (Line 7). If the arm k ∈ (K t ∩ K ′ ) which has ever been connected to the task generator becomes available again after not so long time, the previously used score is re-utilized so that it leverages to its own recently estimated computation capability, rather than other arms (Line 10). Such volatile resource supply based score assignment allows discovered candidate arms to be fairly explored with the highly capable existing arm or their own knowledge, and the algorithm to adapt to the change.
Regarding the adaptation to resource demand, time-varying demand dependent offloading decision is enabled by joint consideration of both the normalized offloading bit-per cost and per-task cost in score, which results in a more suitable fog node selection, i.e., with a larger input size, more exploitation would be executed with firmed belief for a more suitable selection (Line 14). While such adaptivity to the dynamic resource supply and demand is treated in a modified form of exploration as exploration bonus in selection rule for fairness and suitability, the implicit exploration is considered to enable reliable cost estimation in assessment rule.

2) Scalability:
To cope with the heterogeneity in resource capacities and adaptivity in a dynamic environment, the proposed algorithm is to keep high scalability taking into account i) computational complexity, e.g., time complexity, ii) communication overhead given by its implementation, i.e., how many times a decision-maker needs to communicate with available fog nodes, and iii) accessible information, i.e., what type of information a decision-maker needs before making decisions. That is, the key properties of scalability are low complexity, low communication overhead, and reduced need for information.

Remark 2. (Low communication overhead)
The proposed algorithm allows the task generator to learn the states such as allocated CPU frequency of each fog node, instead of obtaining them from physical signal messages, which can save |K t | signaling messages for the state of the K t candidate fog nodes.

V. LEARNING EFFICIENCY OF PROPOSED ALGORITHM
This section characterizes the performance of the online learning algorithm. Naturally, exploring an uncertain world with a specific goal always has some regret. As a performance criterion, the considered assessment rule employs some notion of the learning regret which tries to capture the degree of cumulative dissatisfaction of a task generating client in presence of dynamic resource supply and demand.

A. Regret
Concretely, the regret of an algorithm is defined as its cumulative loss minus the cumulative loss of the best strategy in the pool, i.e., available candidate set. To address nonstationary environment where there is no single fixed point that does well overall, we use the regret with respect to an interval the maximum of the rounds maintaining a network structure unchanged, i.e., available fog nodes are identical during an interval T i . The significance of no-regret learning depends on the adopted benchmark policy which the learning algorithm is measured against.
An oracle benchmark to P in equation (3), the optimal solution to the minimization problem during each interval, t ∈ T i is given by k * ∈ arg min k [l k [i]] ∀i wherel k [i] is the expectation of l t k , E[l t k ] t , for the interval i, which is unknown beforehand in practice. Given the oracle benchmark, the learning regret which measures how much the client regrets choosing his pulled action-sequence over the one with the optimal policy, can be expressed as where L T k ′ = t∈T l t k ′ and L T k * = t∈T l t k * correspond to the sequences of cumulative losses incurred by the Algorithm 1 and adopted oracle, respectively.
The regret upper bound of the proposed algorithm is analyzed, desirable to stay small in mean and concentrated well around the mean, so-called high-probability 1 − ν bounds. Such targeted properties guaranteed for each interval would be also valid for multiple intervals. Thus without loss of generality one may focus on an interval of the algorithm and omit the symbol index i, e.g., T i = T . Such probability-based measurement value can be quantified as a concentration of measure inequality based on the Cramer-Chernoff method. The quantities of interest here are the variance and bias, both of which would be used as bounded components of regret.
We show that the variance of the sum of a sequence of random variables cannot be much higher than the sum of their expectations conditioned on the past, following from [43] using a martingale sequence and a Markov's inequality.
Let γ t be a fixed nonincreasing sequence and α t be non-negative F t−1 measurable random variables. According to [43] is super-martingale relative to F t−1 and Z 0 = 1. ii) With Markov's inequality, one leads to P t∈T k∈K α t (l t k − l t k ) > ǫ ≤ e −ǫ = ν for any ǫ > 0, where v is the probability that the bound is not satisfied. With the respective complement (guaranteed) probability 1 − v and bandit feedback, one gets t∈T k∈K α t (l t k − l t k ) ≤ ln(|K|/ν) and similarly, t∈T α t (l t k − l t k ) ≤ ln(|K|/ν) for any fixed k. Thus, we obtain k∈K t∈T γ t (l t k − l t k ) ≤ ln(|K|/ν)/2 and k∈K t∈T (l t k − l t k ) ≤ ln(|K|/ν) 2γT , and also for any fixed k.
Unlike a typical unbiased estimator, the cost estimator with implicit exploration parameter in the assessment rule incurs the bias which is the difference between the realized cost and the biased estimator's expected cost.

Lemma 2. (Bias) With probability at least 1 − ν, the bound on the bias is
t∈T γ t l t k + log(|K|/ν)/2 from Lemma 1. Remark 4. The parameter γ t serves to decrease the variance, but to increase the bias for a learning rate, η t , resulting in a variance-bias trade-off.
Remark 5. The parameters, η t and γ t , selected irrespective of ν would entail the proposed algorithm with a high-probability bound for any confidence level ν.
Rearranging the variance and bias components above, the regret with respect to arm k * is bounded as follows:

Proof: The learning regret in (3) can be decomposed into the sub-parts:
from [35]. The upper bound of R T exp3 is further managed by manipulating a term of the bias, k∈K t∈T γ tl t k with a proper value of η t , i.e., conditioned on η t /2 ≤ γ t . To sum up, the aggregate regret is upper-bounded by R T ≤ log |K| ηT

Remark 6. The learning regret can be bounded by controlling variance and bias.
Remark 7. The step-size, η T and γ T , vanishing rapidly may be sub-optimal, since it may incur a higher regret.
The implicit exploration enables reliable cost estimation in assessment rule, thereby obtaining a high-probability 1 − ν regret bound. However, the bound is not compatible with adaptation to the dynamics in resource demand and supply, since an arm is typically selected based on the assumptions that all candidate VFNs have i) identical resource demand and ii) fair opportunity to be assessed. Next, focusing on achieving a better bound on regret with the same 1−ν probability, taking into account two dynamic feeders, we show how the adaptation can be treated in an exploration bonus in the selection rule for suitability and fairness.

B. Dynamic resource demand
In the following, the dynamic resource demand-based offloading decision making is studied. An arm k is selected with a probability proportional to e −L t−1 k without mixing any explicit exploration term into the distribution, but with multiplying the normalized input data size δ t with the cumulative score corresponding to the arm k. Note that the positive value of δ t plays a role in making the high selection probability greater and the low selection probability lower, and determining the sensitivity of the probability a given arm is chosen over the estimated cumulative score values of alternative arms in the corresponding state. The lower the value of δ t , the less sensitive the probability of a given arm being chosen will be to the relative differences in the cumulative scores. On the other hand, high δ t values cause choices to become sensitive to the estimated values of the various alternative arms.
One critical issue is that notwithstanding having estimated all the arms correctly, Boltzmann exploration may be able to pull sub-optimal arms prematurely or excessively. Such abrupt decisions may cause unintended consequences, i.e., cumulative importance-weighted loss estimates may become irreversible afterward [44]. This is mainly due to the fact that Boltzmann exploration does not consider the uncertainty of the empirical cost estimates, i.e., large variance caused by unbiased bandit estimator with arbitrary small selection probability may result in a worse outcome. This work circumvents the issue above by guaranteeing algorithmic robustness for which bounded properties on estimation are used. High-probability bounds for adversarial bandits were provided in [39] with Exp3P algorithm and in [43] with Exp3IX, but limited to the surrogate regret with capability-based selection strategy.
Note that for task t the probability of a dominant arm p t m , m ∈ K which is superior to the other armsL t−1 m < L t−1 k ∀k ∈ K, or even has a low enough score witĥ due to the fact that the derivative of the resulting probability p t m in δ t becomes a positive value, ∂Λm proposition states that such escalation in the resulting probability of the dominant arm allows to have distinct but enhanced concatenation profiles, respectively, achieving a lower regret than the case with δ = 0, equivalently δ t = 0, ∀t ∈ T .  k , and dominated arm k ∈ K\m could get lower selection probability with δ > 0, p s k|δ=0 > p s k|δ>0 . Nevertheless, the estimated cost of the dominated arml s k gets higher with δ > 0 due to the lower resulting probability and would increase its cumulative score, thereby making the section probability lower at the next round.
To sum up, the proposed algorithm with δ > 0 achieves a lower cumulative regret than the one with δ = 0, R s δ>0 ≤ R s δ=0 , as s becomes large enough, s ≥ t o . A natural question is whether an arm with rather a good service capability compared to other arms results in a lower score, i.e., whether an arm with relatively low cumulative realized offloading cost also is allowed to form a low score parameter after performing a certain amount of tasks, which would be eventually effective in reducing variance and bias, and thus learning regret (Prop. 2). The following proposition provides such rational selection behavior.
The proof follows from treating the two sequences, Suppose the contrary, 1 K−1 k =mL s k −L s m < 0, and a measure in terms of a score of the cumulative distance between two F t -measurable random variables is expressed as According to the strong law of large numbers for supermartingale difference sequences [46,Corollary 4.2] [47,Theorem 5] [48,Theorem 2], if the second moment of supermartingale differences is bounded,

Thus we get an upper bound as follows
By using the bound, lim s→∞ s t=1 , and t η t → ∞. Remark 8. The step-size γ t vanishing faster than or equal to t −1 , γ t ≤ t −1 could be sub-optimal.

C. Dynamic resource supply
In the following, we show that allowing a certain arm to get some fixed extra information at the beginning of the learning interval [49] would result in better performance than the conventional approach, i.e., partial or full reset in Section III-B. Such fine-tuned scores enable to reduce the exploration space, thereby rapidly calibrating perception and adapting to environmental changes. One may share the explored information among the arms which follow the same cost distribution for reducing the exploration space, but only valid in stochastic MAB framework [9]. Considering the adversary's non-stochastic force, when the corresponding arm reappears after a finite but not so many rounds, a task requester may reuse its own previous score result, or may use the minimum of the other arms' scores in the immediately previous round if no exploration progress has been made for many rounds, βk = max(min(L τ −1 m ),L τ −1 k ), m ∈ K τ −1 . If an arm disappears in round τ due to its inherent mobility, a task requester could alleviate the unnecessary pull by ruling out the vanishing arm in its selection process, |K τ | < |K τ −1 |.
We compare the proposed approach for volatile resource supply case where the armk which recently joined the exploration process for previous task τ i,p appears again for the current task τ i = T i [1] ∈ T i , with another two cases: i) zero value of βk for the partial reset case, i.e., onlyL τi−1 k > 0, and ii) zero values of βk andL τi−1 k for the full reset case. The parameter βk is not iteration-varying but interval-varying, i.e., updated whenever the candidate fog node set is changed for task τ i , ∀i. Note that while the modified score for dynamic demand affects all arms equally, the one for dynamic supply only tunes the scores of joining arms, which results in unfair exploration and different filtration processes. Denote βk > 0∀i by β > 0. In the following, we show that β > 0 allows to have enhanced learning performance with a better composite of cumulative scores over arms. Proof: The proof follows from deriving improved concentration. By considering task t ∈ T i = [τ i , s] and arm k ∈ K τi and modifying the termλ t = k η tl t k in Lemma ) remains also super-martingale relative to F t−1 . With the respective complement probability 1 − v and bandit feedback, one gets t k η t (l t k −l t k ) = t k η t (l t k|β>0 −l t k )+H s K ≤ ln(|K i |/ν)/2 where H s K = t k η t (l t k −l t k|β>0 ). According to Lemmas 1 and 2, when kL s k > kL s k|β>0 , one gets the lower variance and bias.
The result above enables a balance of the exploration process needed to identify the reasonable alternative in a fair manner, which results in lower learning regret with using Prop. 1, conditioned on a better formed of cumulative scores. Now, the natural question is whether we have k∈KiL s k > k∈KiL s k|βi↑ , s ∈ T i , ∀i. The following proposition states that the cumulative score of the proposed approach is better than the ones of the reset cases. β>0 of the measures used in estimation, respectively. A filtration represents iteration-varying available knowledge, an increasing sequence of sigma algebras, i.e., F 1 ⊆ · · · ⊆ F s−1 and F 1 β>0 ⊆ · · · ⊆ F s−1 β>0 where F s−1 and F s−1 β>0 are information available for task s. A larger information would allow to provide a more accurate estimate. The amount of information is different in the two filtration sets, because joining arms (partial) or all arms (full) initiate exploration process with lack of information. Using a modified score β > 0 plays a role in getting extra information for task τ i influencing the subsequent estimates for tasks s > τ i . From the two filtration sets F and F β↑ , we compare kL s k|β>0 with kL s k under the partial and full reset cases. i) for the partial reset case,L τi−1 k = 0 and β = 0, one may consider a certain task offloading round, τ ′ i as a selfadjoint operator adapted to the filtration, representing the number of additional exploration rounds for which the newly or reappeared arms need to experience to become comparable, but in fact could save via . When γ t /η t is fixed over task rounds, γ t /η t = φ > 0∀t, the least number of the saved explorations is positive, ), but the score deviation among the existing arms still remains and addresses the distinction of their offloading capabilities saving positive explorations as in the partial reset, with F τi−1 ⊆ F τi−1 β>0 . To sum up, for the both cases, the fine-tuned scores allow to have kL s k > kL s k|β>0 , k ∈ K τ which contradicts. From the result above, the proposed approach β > 0 permits to save up to at least a positive value of exploration rounds with a positive implicit exploration parameter.

D. Sub-linear regret
A learning algorithm is said to achieve the no-regret condition if the cumulative regret has a sub-linear growth rate with the number of task, T , in other words, the per-round regret is vanishing [36], i.e., negligible as T grows, R T /T → 0. Note that the parameters, η t and γ t , of a potential fog node existing for the multi-intervals, i ∈ I are decreasing with respect to t ∈ T , i.e., iteration-varying, while the candidate set might be different for different interval, i.e., interval-varying.  Remark 9. The conditions on the algorithm's step-size, η T and γ T , allow to avoid a sub-optimal exploration-exploitation balance.
Corollary 2. If the learning rate is the candidate set and task round dependent in a form of η t = log |Ki| |Ki|·t [35], the algorithm's sub-linearity properties are in effect after at least ⌈ |Ki| log |Ki| ⌉ rounds. Proof: With a decreasing but candidate set dependent schedule, η t = log |Ki| |Ki|·t > t −1 , the number of the candidate fog nodes needs to be larger than |K i | > e −W−1(t) where W(·) is Lambert function for t. Likewise, once the available set |K i | is updated, the iterative procedures should be larger than the minimum task rounds τ i,o to ensure the sub-linearity of the proposed algorithm,

VI. NUMERICAL ILLUSTRATION
This section conducts numerical studies to assess the average per-bit cost (bit-cost) and regret of the proposed algorithm.
A. Performance evaluation 1) Evaluation setting: Consider one vehicle (client) of interest, requesting the computational resource from candidate edge computational resource providing vehicles (VFC nodes). The distance between the client and each candidate VFC node is assumed to follow a uniform distribution, d ∼ U[0, d r ] where d r is the communication range equal to 400 m. The transmission power of the client is 24 dBm, the large-scale fading gain follows the 3GPP pathloss model [50], A o = 128.1 + 37.6 log 10 (d), the small-scale fading gain follows Rayleigh distribution with unit variance, channel bandwidth is W = 10 MHz, and noise power is N o = −174 dBm/Hz. Note that the interference effects on the co-channel and adjacent channel are assumed to be ignored according to the orthogonal channel allocation [31] and experimental result [32]. Also, one assumes that the service discovery solution which finds neighboring VFCs within the client's communication range allows to select fog nodes in the same moving direction as candidates [9]. Thus, the small relative speed makes the Doppler shift insignificant and fading gains remain unchanged during the uplink transmission for each task offloading request.
Consider 7 volatile VFNs (see Fig. 2) with maximum CPU frequency, F k ∈ {6, 4, 5, 4, 1.5, 2, 4} GHz that appear or disappear as candidate fog nodes of one task generating vehicle (client) for a finite number of tasks in 3 epochs, within each epoch consisting of 1000 tasks and keeping the same fog node set. In the first epoch, there are 5 candidate VFNs, K t = {1, 2, 3, 4, 5}, ∀t ∈ [1,1000]. At the beginning of the second epoch, a less powerful VFN 5 disappears and VFNs 6 and 7 with higher computing capability appear, K t = {1, 2, 3, 4, 6, 7}, ∀t ∈ [1001,2000]. At the beginning of the third epoch, VFNs 4 disappear, while VFN 5 re-appears, K t = {1, 2, 3, 5, 6, 7}, ∀t ∈ [2001,3000]. For each VFN, the allocated CPU frequency to the task client is a fraction of the maximum CPU frequency which is distributed from 20% to 50%, but arbitrarily constrained. To address such a nonstochastic environment, adversarial perturbation is considered in a similar manner as in [51], where the realized cost function is affected by the oblivious attack, specifically an arbitrary fraction of allowable CPU frequency range. The total tasks are splitted into phases with different lengths, each of which is with different means for different arms. The computation intensity is set to w = 1000 Cycles/bit. To meet the client's diverse demand, the request service type can be changed with different task size arbitrarily. Varying service types could be considered at regular intervals. For simplicity, a periodic interval for changing service types is aligned with an epoch. The task size, δ Mbits, is either fixed or randomly distributed according to either uniform or truncated normal distribution on a predefined interval δ ∈ [0.2, 1].
2) Evaluation result: The proposed algorithm is compared with its counterparts, implicit exploration-based algorithms with bandit feedback and full-feedback. The performance results of learning algorithms in terms of the learning regret and the average per-bit cost when ξ = 1 in equation (1), per-bit latency, are depicted in Fig. 3, showing that the proposed algorithm outperforms other implicit explorationbased algorithms where an arm is selected based on the scores i) fully reset with zero values of β i andL Ti[1] k (fullreset), and ii) partially reset with zero value of β i (partialreset). Two kinds of adaptivity including dynamic resource demand and supply are considered, and notably the joint consideration of these two aspects could achieve a better exploration-exploitation trade-off since they allow to adapt to the dynamic task offloading environment without exploring the sub-optimal actions, thereby reducing the regret by 65% and 40% from that of two conventional Exp3IX based variants, respectively, and being much closer the full information setting [35] where the complete cost vector is revealed after every round (full feedback) in Fig. 3(a).
Also, the proposed algorithm offers the sub-linearity of the regular regret performance, i.e., the regret grows sub-linearly with respect to the number of tasks, intuitively indicating that the task generating client's learning algorithm allows to asymptotically converge to the VFC node with optimal   performance. Note that in the first epoch, the algorithm with dynamic supply and demand is equivalent to the one with only dynamic demand, and the one with only dynamic supply is equivalent to the Exp3IX. Besides, in the optimal genieaided policy, the client always connects to the VFC node with minimum cost. As shown in Fig. 3(b), during each epoch, the average per-bit cost of the proposed algorithm converges faster than other ones, except for the full-feedback case.
Impact of β: Specifically, implicit exploration-based algorithms taking into account dynamic resource supply could achieve lower learning regret compared to vanilla Exp3IX algorithms with initializing the learning history of all candidates, L k = 0, ∀k or a new oneL n = 0 whenever a candidate fog node set is updated. When a client discovers a newly appeared VFC candidate, its weight is set to the lowest one of the other candidates. For example, VFC 6 and 7 nodes appearing at τ = 1001 are initialized with the score lastly updated by a rather more capable fog node, min(L τ −1 k ). If a VFC leaves a candidate set temporarily but returns into the set in a finite number of tasks, the MIX-AALTO utilizes the information on the lastly updated score the rejoining VFC node had before or the other fog nodes have. For example, VFC 5 node reappearing at τ = 2001 may launch its score value with the one at τ p = 1000 or min(L τ −1 k ) depending on the circumstance. Such dynamic resource supply-based policy ensures that a newly discovered or re-discovered VFC is likely to be explored such that the proposed algorithm allows to avoid unfair selection opportunities and thus adapt quickly to the change in a volatile environment. This indicates that dynamic resource availability-based policy draws better adaptivity to the dynamic and adversarial environments, and thus reduces loss of performance through learning. Fig.4 shows the impact of the number of VFNs, |K t |, appeared in the candidate set, k ∈ K t = K t ∪K t for task, t ∈ [1001, 2000] where K t = {1, 2, 3, 4} are the existing VFNs from the first epoch t ∈ [1, 1000], andK t is the appearing VFNs whose distances to the client and CPU frequencies randomly selected from U and F k . As the density of the candidate VFNs becomes higher, more exploration would be performed, requiring more rounds to make the unit offloading cost converged and resulting in a higher regret. This observation would give an implication to the design of the discovery process protocol. For instance, limiting the maximum allowable number of the candidate VFNs would be beneficial when the service requirement is strict or the network topology has a high degree of volatility. One may adjust the maximum number of candidate VFNs properly [18]. The proposed algorithms outperform the other two exploration reset cases; modifying cumulative scores only for the appearing VFNs (partial-reset) and for all VFNs (full-reset). Compared to the partial-reset case, the better the performance gain of the proposed dynamic supply based algorithm (β > 0) is achieved, the larger the minimum gap to the existing arms' scores is obtained after the task τ − 1, min(L τ −1 m ), m ∈ K τ where τ = 1001. This is because the proposed dynamic supply approach would allow to reduce the exploration rounds the appearing arms may require to experience. Compared to the fullreset case, on the other hand, the effect of the minimum gap to the existing VFNs' cumulative scores on the performance gain is minimal, since the score differences among the existing VFNs are only effective in distinct filtration set. Such residual difference would influence the estimation performance. On the other hand, a high density of the appearing VFNs may alleviate the effect of such score deviations among the existing VFNs, since an importance weighted mechanism assigns a probability proportional to the number of the candidate VFNs as well as the cumulative scores.
Impact of δ: For the dynamic resource demand, the proposed algorithm considers two major exploration perturbations: one is implicit exploration for guaranteeing low-variance in assessment rule and another is Boltzman exploration for drawing suitability-based selection. Fig. 5(a) demonstrates the robustness of the proposed algorithm as compared to Exp3 and its superior performance as compared to other algorithms choosing arms based on current knowledge with a probability 1 − ǫ such as ǫ-Greedy when ǫ = 0.1, upper confidence bound such as UCB1, and guaranteeing high probability bounds such as Exp3P and Exp3IX. This clearly shows that a finegrained implicit exploration approach could achieve higher and more robust performance, lower empirical mean and standard deviation of the regret than others. Fig. 5(b) shows the effect of suitability-based selection policy on the learning regret. In general, when a positive value of the normalized input data size, δ > 0, is considered in a selection rule, a client's learning performance can be improved. This means that considering a score associated with both normalized per-task cost and per-bit cost, would make a more suitable candidate and thus ensure a better trade-off between exploitation and exploration. On the other hand, when δ = 0, there is no exploration for the suitability, but only for the capability of candidate VFC nodes. Since such a capabilitybased learning approach may fail to address appropriately the upcoming variations of computational demand, the learning regret for the per-bit cost is explicitly worse than those of δ > 0. This can be captured in the learning regret with per-bit latency cost function, i.e., when ξ = 1 in equation (1).
The effect of different input data size 8 on the learning regret 8 While task workload is determined by the task size and computation intensity, the bit cost is only dependent on the computation intensity. is also evaluated with three fixed and one uniformly distributed sizes ranged between 0.2 and 1 Mbits. The proposed algorithm brings better performance gain by making exploitation more for a large δ and less for a small δ. The gain becomes larger as the input data size increases. The per-bit learning regret is reduced by around 15%, 30%, and 45% from that of 0.3 Mbits, 0.6 Mbits, and 0.9 Mbits, respectively, only considering the capability-based selection policy. This observation reveals the vital role of the proposed algorithm in coping with dynamic resource demand, which is enabled by variance and bias reduction techniques in Section IV-B and IV-C. Corresponding diminishing effects of variance and bias can be captured in Fig. 5(c) and Fig. 5(d). Apparently, when a user with a task of large size selects a VFC with weak service capability, it yields poorer learning performance than the case with a task of small size. On the other hand, selecting a low-capable VFC for a small input data size does not yield enormous delay. With varying input uniformly distributed over the same range with the fixed one, one may yield a similar result with the fixed one using the mean value of 0.6.
Impact of ξ: Fig. 6(a) shows the impact of weighting parameter ξ on the per-bit latency cost, D t k /q t , and the per-bit energy consumption cost, E t k /q t , of the proposed algorithm, vanilla Exp3IX algorithm and full-feedback case with T = 3000 tasks. It is noted that the proposed algorithm yields the per-bit cost values, latency and energy, each of which dwells between the individual per-bit cost values from Exp3IX and full feedback algorithm, in all ξ regions. It is also observed that the individual per-bit cost of the proposed algorithm and that of the oracle fluidly move with respect to the weight parameter, ξ, i.e., smooth improvement or degradation. Increasing the weighting parameter makes the latency performance more dominated over another, and thus the per-bit latency performance is improved while the energy consumption performance gets less interesting. Such behavior can be captured in two different per-bit costs, latency and energy, in Fig. 6(b), when ξ = 0, 0.5, 1, respectively. To sum up, the effectiveness and robustness of the proposed algorithm are verified under the synthetic scenario, by showing its outperformance compared to other benchmarks in terms of the learning regret and the average per-bit cost, taking into account the dynamics of resource availability and demand in an adversarial environment.  B. Performance evaluation under realistic scenario 1) Evaluation setting: In this subsection, the applicability of the proposed task offloading algorithm is further explored by using the Luxembourg SUMO Traffic scenario (LuST) [53]. The Lust scenario simulates the real traffic in the city of Luxembourg using SUMO, where arterial and residential roads link downtown and metropolitan areas with highways on the outskirts that surround the city. To better evaluate the resource supply volatility awareness of the proposed algorithm, we choose vehicle traffics on a highway road, e.g., consisting of multiple edges, IDs 31622#5 ∼ 31622#10, involved with multiple entrances and exits, as available VFNs. A task requester is assumed to have a full route on the highway, i.e., departing from edge ID 31622#1 every minute, and its candidate VFN set is volatile due to the facts: i) the VFNs may join or leave the highway, and ii) the vehicles move in the same direction at relatively fast but different speeds. The data of vehicle coordinate and velocity are used for simulation in MATLAB. The maximum CPU frequency of each VFN is randomly distributed in [1,5] GHz. The rest parameters follow the previous setting described for the synthetic scenario.
2) Evaluation result: Fig. 7 depicts the performance results of the proposed algorithm in terms of the average per-bit cost for different VFN densities. To consider different volumes of available VFNs, the scenario is running in two time windows, i) the morning rush hour peak period, 08 : 00 ∼ 08 : 05 and ii) the off-peak period around lunchtime, 13 : 00 ∼ 13 : 05 as in Fig. 7(a) and 7(b), respectively [14], [53]. The proposed task offloading algorithm always outperforms the other learning algorithms, since it can better adapt to the adversarial and dynamic environment better. To be specific, compared with the UCB and volatile Exp3IX algorithms, the proposed algorithm can reduce the average per-bit cost by about 23% and 10% (partial-reset) in peak time, see Fig. 7(a), and by about 30% and 20% (partial-reset) in off-peak time, see Fig. 7(b), when T = 300. The average per-bit cost decreases more at the expense of convergence rate for a high density of VFNs. The reason is that a large set of vehicles can extend the exploration space and increases the probabilities of finding good solutions, i.e., the average per-bit cost in peak time is more reduced than the one in off-peak. However, the larger exploration space tends to converge slower. Similar phenomena are also observed in per-bit energy cost in Fig. 7(c) and 7(d).

VII. CONCLUSIONS
This work is to propose adaptive learning-based decentralized task offloading algorithm where each client can make the decision on fog node selection independently. The proposed online learning algorithm allows to provide the foundation for scalable and low-complexity offloading decision-making in an adversarial environment. In particular, two bottlenecks in the VFC-induced heterogeneous and dynamic environment, volatile candidate fog node set and task size, are addressed. We prove that the input-size dependent selection rule allows to choose a suitable fog node selection without exploring the sub-optimal actions, and also an appropriate score patching rule allows to quickly adapt to the evolving circumstance, thereby achieving better exploitation exploration balance. While this work focuses on self-interested regret-optimal, system-level perspective can be further considered, desirable to know whether the dynamical behaviors of distributed players promise a certain level of optimality in terms of social welfare under information limited case, i.e., unknown game.