Double Deep Q-Learning-based Path Selection and Service Placement for Latency-Sensitive Beyond 5G Applications

Nowadays, as the need for capacity continues to grow, entirely novel services are emerging. A solid cloud-network integrated infrastructure is necessary to supply these services in a real-time responsive, and scalable way. Due to their diverse characteristics and limited capacity, communication and computing resources must be collaboratively managed to unleash their full potential. Although several innovative methods have been proposed to orchestrate the resources, most ignored network resources or relaxed the network as a simple graph, focusing only on cloud resources. This paper fills the gap by studying the joint problem of communication and computing resource allocation, dubbed CCRA, including function placement and assignment, traffic prioritization, and path selection considering capacity constraints and quality requirements, to minimize total cost. We formulate the problem as a non-linear programming model and propose two approaches, dubbed B\&B-CCRA and WF-CCRA, based on the Branch \&Bound and Water-Filling algorithms to solve it when the system is fully known. Then, for partially known systems, a Double Deep Q-Learning (DDQL) architecture is designed. Numerical simulations show that B\&B-CCRA optimally solves the problem, whereas WF-CCRA delivers near-optimal solutions in a substantially shorter time. Furthermore, it is demonstrated that DDQL-CCRA obtains near-optimal solutions in the absence of request-specific information.


I. INTRODUCTION
Nowadays, an increase in data flow has resulted in a 1000fold increase in network capacity, which is the primary driver of network evolution.While this demand for capacity will continue to grow, the Internet of Everything is forging a paradigm shift to new-born perceptions, bringing a range of novel services with rigorous deterministic criteria, such as connected robotics, smart healthcare, autonomous transportation, and extended reality [1].These services will be provisioned by establishing functional components, Virtual Network Functions (VNFs), which will generate and consume vast amounts of data that must be processed in real-time to ensure service responsiveness and scalability.
In these circumstances, a distributed cloud architecture is essential [2], which could be implemented via a solid cloudnetwork integrated infrastructure built of distinct domains in Beyond 5G (B5G) [3].These domains can be distinguished by the technology employed, including radio access, transport, and core networks, as well as edge, access, aggregation, regional, and central clouds.Moreover, these resources can be virtualized using technologies such as Network Function Virtualization (NFV), which enables the construction of separate virtual entities on top of this physical infrastructure [4], [5].Since distributed cloud and network domains would be diverse in terms of characteristics but limited in terms of capability, communication and computing resources should be jointly allocated, prioritized, and scheduled to ensure maximum Quality of Service (QoS) satisfaction while maximizing resource sharing and maintaining a deterministic system state, resulting in energy savings as one of the most significant examples of cost minimization objectives [6].
The joint problem of resource allocation in cloud-network integrated infrastructures has been extensively studied in the literature.Emu et al. [7] analyzed the VNF placement problem as an Integer Linear Programming (ILP) model that guarantees low End-to-End (E2E) latency while preserving QoS requirements by not exceeding an acceptable latency violation limit.They proposed an approach based on neural networks and demonstrated that it can result in near-optimal solutions in a timely way.Vasilakos et al. [8] examined the same problem and proposed a hierarchical Reinforcement Learning (RL) method with local prediction modules as well as a global learning component.They demonstrated that their method significantly outperforms conventional approaches.Sami et al. [9] investigated a similar topic to minimize the cost of allocations, and a Markov decision process design was provided.They claimed that the proposed method provides efficient placements.Performing cost-effective services was also investigated by Liu et al. [10] and He et al. [11].In the former, the authors considered the cost of computing and networking resources as well as the cost of using VNFs and proposed a heuristic algorithm, whereas, in the latter, they considered latency as a cost and proposed a Deep Reinforcement Learning (DRL) solution to the problem.Iwamoto et al. [12] investigated the problem of scheduling VNF migrations in order to optimize the QoS degradation of all traffic flows and proposed a stochastic method on the basis of the load degree of VNF instances.Link [13] max profit [10] min cost [11] max profit -cost [14] min energy [15] min cost [12] max fairness [16] max profit -cost [7] min latency [8] min latency [9] min cost [17] max rate [18] min cost [19] max rate [20] min latency + cost [21] min latency this work min cost Although innovative techniques for addressing computing resource restrictions have been proposed by the abovementioned authors, the network is solely considered as a pipeline in their studies, with no cognitive ability to the cloud domains.Nevertheless, there are additional studies in the literature that have been concentrating on communication and computing resources jointly.Kuo et al. [17] studied the joint problem of VNF placement and path selection in order to better utilize the network resources, and a heuristic approach was proposed to tackle it.Mada et al. [18] and Zhang et al. [19] addressed the problem of VNF placement with the objective of maximizing the sum rate of accepted requests.Mada et al. solved the problem by using an optimization solver, and Zhang et al. adopted a heuristic strategy.Yuan, Tang and You [20] formulated the latency-optimal placement of functions as an ILP problem and proposed a genetic metaheuristic algorithm to solve it.Gao et al. [21] focused on the VNF placement and scheduling to reduce the cost of computing resources by proposing a latency-aware heuristic algorithm.Minimizing the cost of allocations was also investigated by Miyamura et al. [15] and Yang et al. [16].They took into account traffic routing constraints and proposed heuristic approaches to address the problem.By considering energy consumption as the most significant cost associated with networking and computing resources, Xuan et al. [14] addressed the same problem by proposing an algorithm based on a multi-agent DRL and a self-adaptation division strategy.Nguyen et al. [13] investigated the problem of VNF placement, where requests are weighted according to their priority and the goal is to maximize the total weight of services accepted for deployment on the infrastructure.
The methods presented in the cited studies are effective for resolving the resource allocation problem.However, such approaches cannot be utilized in B5G systems.Due to the stringent QoS requirements in the delay-reliability-rate space [22], the large number of concurrent services and requests, and the ever-changing dynamics of both infrastructure and end-user service usage behavior in terms of time and space, every detail of communication and computing resources must be determined and controlled in order to realize a deterministic B5G system [3].In some studies, latency-related limitations and requirements were simply ignored [17], [15], [16], [13].Despite the fact that delay is addressed in the other studies mentioned, they simplified it to be a connection feature, and queuing delay in network devices is completely eliminated.Furthermore, path selection is disregarded in some studies [18], [19], [20], and cost optimization is overlooked in others [19], [20].
This paper fills in the gap in the current literature by investigating the joint problem of allocating communication and computing resources, including VNF placement and assignment, traffic prioritization, and path selection.The problem is faced while taking into account capacity constraints and link and queuing delays, to minimize overall cost.As an extension of the work presented in [23], the following are the primary contributions of this research: • Formulating the joint resource allocation problem of the cloud-network integrated infrastructure as a Mixed Integer Non-Linear Programming (MINLP) problem.• Proposing a method based on the Branch & Bound (B&B) algorithm to discover the optimal solution of the problem, and devising a heuristic approach based on the Water-Filling (WF) algorithm in order to identify near-optimal solutions to the problem.When the system is fully known, both techniques can be applied to solve the problem.• Developing an architecture based on the Double Deep Q-learning (DDQL) technique comprising agent design, training procedure, and decision-making strategy for allocating resources when the system is only partially known, i.e., there is no prior knowledge about the requests' requirements.
The reminder of this paper is organized as follows.Section II introduces the system model.The resource allocation problem is formulated in Section III.Next, the B&B and heuristic approaches are presented in Section IV.Section V presents a DDQL-based resource allocation architecture.Finally, numerical results are illustrated and analyzed in Section VI, followed by concluding remarks in Section VII.

II. SYSTEM MODEL
In the following, we describe the main components of the system envisioned in this paper.As depicted in Fig. 1, the system consists of an infrastructure (integrated networking and computing resources), services running on computing resources, and end-user requests that must be connected to the services via networking resources.The parameters defined in this section are summarized in Table II.

A. Infrastructure Model
The considered infrastructure is composed of the edge (nonradio side) and core network domains consisting of V nodes, L links, and P paths denoted by G = V, L, P .V = {v|v ∈ {1, 2, ..., V}} is the set of nodes.L ⊂ {l : (v, v )|v, v ∈ V} indicates the set of links, where the bandwidth of link l is constrained by B l , and it costs Ξ l per capacity unit.Although a variety of factors (distance, technology, redundancy, accessibility, etc.) contribute to this cost as a capital expenditure, the energy used by network devices to process the traffic carried by this link is one of the significant operating expenses affecting this cost and must be precisely addressed in order to realize future networks [24].P = {p : ( p , p )|p ⊂ L} denotes the set of all paths in the network, where p and p are the head and tail nodes of path p, and δ p,l is a binary constant equal to 1 if path p contains link l.It should be noted that all paths are directed vectors of nodes with no loops.
Each node in the network is an IEEE 802.1 Time-Sensitive Networking (TSN) device comprising an IEEE 802.1 Qcr Asynchronous Traffic Shaper (ATS) at each egress port.As depicted in Fig. 2, An ATS uses a two-level queuing model [25]: 1) an array of shaped queues, each associated with a priority level and an ingress port, and 2) one queue per priority level.Each priority queue combines the output of all shaped queues with the same priority level.All queues implement the First-In-First-Out (FIFO) strategy.The next packet for transmission is identified by comparison of 1) the associated priority levels, and 2) the eligibility times of the Head-of-Queue (HoQ) packets.This could be accomplished, for instance, using comparator networks or linear iteration over all queues/HoQ packets while the transmission of a previous packet is in progress.We consider K = {k|k ∈ {1, 2, ..., K}} as the set of priority levels and assume that k r is the assigned priority of the traffic associated with request r, and the size of the queues for priority level k is the same and equal to T k .Note that lower levels have higher priorities.Moreover, each node v is equipped with computing resources as one of the prospective hosts to deploy service VNFs and limited to a predefined capacity threshold ζ v , which costs Ψ v per capacity unit.Ψ v is an increasing function of the energy consumed by various components of computing nodes (such as the processor, memory, and storage) to process requests and required by cooling systems to maintain appropriate temperatures.This cost is one of the most significant obstacles that must be overcome to make future applications feasible [26].

Priority Queues Shaping Queues
It is worth mentioning that the network is divided into several tiers, with nodes distributed across them so that the edge nodes (the entry nodes of requests) are located in tier 0. The higher the tier index, the greater the capacity of the associated nodes, and the lower their cost.In other words, the nodes closest to end-users (or to the nodes that serve as entry points -e.g., far edge node or in-network computing nodes [27]) are provisioned with high-cost, limited-capacity computing facilities, while low-cost, high-capacity depots are deployed in the core.

B. Service Model
The set of services is dubbed S = {s|s ∈ {1, 2, ..., S}}, where S indicates the number of services.If an end-user requests a service, its VNF has to be replicated in the networkembedded computing resources.Each VNF is empowered to serve more than one request, and C s indicates the maximum capacity of each VNF of service s.

C. Request Model
The set of requests asking for services is represented by R = {r|r ∈ {1, 2, ..., R}}, where R is the number of requests.Each request r arrives in the network through node v r (one of the nodes equipped with a radio access base station) and intends service s r , specifying its minimum necessitated service capacity, network bandwidth, and maximum tolerable delay, indicated by C r , B r , and D r , respectively.In addition, T r and H r , denoting the burstiness of traffic and the largest requests with a priority lower than request r R 3 requests with a priority higher than request r packet size for request r, are also assumed to be known a priori.Utilizing historical data, along with predictive data analytics methods, is one of the viable options for obtaining such accurate and realistic statistical estimates of traffic.

III. PROBLEM DEFINITION
This section describes the joint problem of VNF placement and assignment, traffic prioritization, and path selection.In what follows, the constraints and objective function are formulated as a MINLP problem and the problem is stated at the end of the section.The variables and parameters defined in this section are summarized in Table II.

A. VNF Placement and Assignment Constraints
To arrange VNFs, each request must be first assigned a single node to serve as its service location (C1).This assignment is acceptable if the assigned node hosts a VNF for the requested service (C2).When the requests of a specific service are assigned to a particular node, they will be handled by a shared VNF.C3 ensures that the total service capacity required by these requests does not surpass the VNF's capacity.Additionally, C4 guarantees that the computing capacity of a node is not exceeded by the VNFs placed on it.Without these two constraints, both VNFs and nodes are at risk of becoming overloaded, leading to the potential termination of VNFs and congestion of requests.Such a scenario would significantly decrease the system's reliability and availability.The problem formulation becomes as follows: where g r,v and z s,v are binary variables.g r,v is 1 if node v is selected as the service node of request r, and z s,v is 1 if service s is replicated on node v.

B. Traffic Prioritization and Path Selection Constraints
To direct traffic, we must first ensure that each request is assigned to exactly one priority level (C5).Then, each request's (request and reply) paths are determined (C6 and C7).For each request, a single inquiry path is chosen that starts at the request's entry node and ends at the request's VNF node.The response path follows the same logic but in reverse order.The following two constraints guarantee that the two paths are chosen on the priority level assigned to each request (C8 and C9).Finally, the constraints maintaining the maximum capacity of links and queues are enforced (C10 and C11).With C10, the sum of the required bandwidth for all requests whose inquiry or response path, or both, contains link l is guaranteed to be less than or equal to the link's capacity.In C11, the capacity of queues is guaranteed in the same way for each link and each priority level.The set includes: where r,k is a binary variable that equals 1 only when the priority level assigned to request r is k, and − −− → f r,p,k and ← −− − f r,p,k are binary variables that reflect the inquiry and response paths for request r on priority level k, respectively.

C. Delay Constraints
To guarantee the minimum delay requirement of requests, the following settings should be adhered: where D r,k,l , D r,sr and D r are continuous variables denoting the delay experienced by a given flow of request r associated with priority level k passing through ATS-based link l [25], its computing delay, and the corresponding E2E delay calculated as the sum of the delays on the links that comprise both paths of the request and its computing delay.Besides, is a function which returns the max value over the given set, R 1 equals These sets represent requests that share the same link as request r, whereas R1 includes requests with a higher or equal priority, R 2 contains requests with a lower priority, and R 3 shares requests with a higher priority.

D. Objective Function
The objective function is to minimize the total cost of allocated computing nodes and network links, that is: As mentioned in Section II, this cost is directly related to the energy consumption of networking and computing elements, and its reduction is a crucial open challenge that must be carefully addressed to enable B5G systems [28], [29], [30], [31].

E. Problem
Considering the constraints and objective function, the problem of Communication and Computing Resource Allocation (CCRA) is: CCRA: min OF s.t.C1 -C15. (1)

IV. FULLY-INFORMED METHODS
In this section, the system is assumed to be fully known, i.e., the list of services and their characteristics are available, and the current state of the network and cloud resources as well as requests and their requirements are being monitored and collected on a regular basis.This could be the case of an industrial environment whereby tasks and communications among robots and devices are pre-planned [32], [33], [34].Under such scenarios, the following section proposes two methods, B&B-CCRA and WF-CCRA, to solve the problem specified by (1).Clearly, an efficient strategy for implementing these methods is to centralize their development as system orchestrator components.Then, when end-users request access to the services, the methods can be executed, and the resulting decisions can be applied to the network and cloud resources using Software-Defined Networking (SDN) and NFV technologies.

A. B&B-CCRA
Suppose that C1 and C5 -C15 are eliminated from (1) and only C2 -C4 affect the problem.Given this, the problem can be reformulated as minimizing the cost of assigned nodes within the capacity constraints of nodes and VNFs, that is where M is a big positive number, is defined and substituted for Ψ v , the relaxed problem can be rewritten equivalently as max R,V Ψ v g r,v s.t.C2 -C4, which is the Multi-Dimensional Knapsack (MDK) problem with at least S items and V knapsacks.Since the MDK problem is NP-hard [35] and a relaxed version of our problem is as hard as this problem, it is proved that our problem is also NP-hard, and finding its optimal solution in polynomial time is mathematically intractable.One potential strategy for addressing such a problem is to restrict its solution space using the B&B algorithm, which relaxes and solves the problem to obtain lower bounds, and then improves the bounds using mathematical cuts to reach acceptable solutions.The method is described in Algorithm 1.In this algorithm, the solution space is discovered by maintaining an unexplored candidate list N = {N t |t ≥ 1}, where each node N t contains a problem, denoted by Φ t , and t is the iteration number.This list only contains N 1 , the root candidate, at the beginning with the primary problem to be solved.To reduce its enormous computational complexity, instead of directly applying the B&B algorithm to CCRA, we consider its integer linear transformation as the problem of N 1 .
CCRA comprises non-linear constraints C13 and C14.To linearize C13, the summations and max function with variable boundaries should be converted to a linear form.A simple, effective technique is to replace each term with an approximated upper bound.Since the aggregated traffic burstiness is bounded by T k for each priority level k in C11, R1 T r can be replaced by the sum of this bound for all priority levels greater than or equal to k, that is {k |k ≤k} T k .In a similar way, we define a new constraint (C13 ) for the aggregated bandwidth allowed on priority level k over link l, dubbed f l,k , and replace the sum of allocated bandwidths with {k |k <k} f l,k .Besides, the maximum packet size for a particular subset of requests can be replaced by the maximum permitted packet size in the network, denoted by H. Therefore, the followings define the linear transformation of C13: where D k,l is the delay upper bound on link l with priority level k, K 1 is {k |k ≤ k}, and K 2 is {k |k < k}.Since D r,sr is linear, C14 can be linearized by substituting the actual delay for the upper bound derived in C13 , and the new constraint for E2E delay is: Given this, the linear transformation of CCRA, dubbed LiC-CRA, is as follows: LiCCRA: min OF s.t.C1 -C12, C13 , C13 , C14 , C15. ( Now, with LiCCRA as Φ 1 , each iteration of the B&B algorithm begins with the selection and removal of a candidate from the unexplored list.Then, the problem of this candidate is naturally relaxed and solved, i.e., all the integer variables in the set {0, 1} are replaced with their continuous equivalents restricted by the box constraint [0, 1], and the relaxed problem Alternatively, we can run this algorithm until a desired solving time is reached or an acceptable objective value is acquired.The key advantage of this algorithm is that it produces at least a lower bound even when the solving time is limited.As a result, it may be used to establish baselines allowing for the evaluation of alternative approaches.

B. WF-CCRA
Since the B&B method searches the problem's solution space for the optimal solution, its complexity can grow up to the size of the solution space in the worst case [36].Given that the size of the solution space in CCRA (or LiCCRA) for each request is V 2 P 2 K considering its integer variables, the problem's overall size is R!V 2 P 2 K, considering the number of permutations of R requests.Therefore, finding its optimal solution for large-scale instances using B&B is impractical in a timely manner, and the goal of this section is to devise an efficient approach based on the WF concept in order to identify near-optimal solutions for this problem.
The WF-CCRA method is elaborated in Algorithm 2. The first step is to initialize the vectors of parameters and variables used in (1) (or in (2)).Following that, two empty sets, R and Ω, are established.The former maintains the set of accepted requests, and the latter stores the feasible resource combinations for each request during its iteration.Now, the algorithm iterates through each request in R, starting with the one with the most stringent delay requirement, and keeps track of the feasible allocations of VNF, priority, as well as inquiry and response paths based on the constraints of (1) Algorithm 2 WF-CCRA.
1: initialize variable and parameter vectors 2: R ← {}, Ω ← {} 3: sort R in ascending order according to D r 4: while R is not empty do 2)).The final steps of each iteration are to choose the allocation with the lowest cost and fix it for the request, as well as to update remaining resources and the set of pending and accepted requests.When there is no pending request, the algorithm terminates.
The complexity of the WF-CCRA algorithm is O(RVKP 2 ).Although this approach is significantly more efficient than the B&B algorithm in terms of complexity (it can be executed within milliseconds), its complexity can be further reduced by restricting the number of valid paths between each pair of nodes to a fixed-size set of paths with the lowest costs or smallest number of links.In addition, despite the fact that this algorithm implements only one of the R! possible permutations (serving requests in descending order of their urgency) and it converges to a solution where the cost of allocating resources to each request is locally minimized, it is expected to provide efficient solutions in terms of accuracy as well.The reason is that since requests for the same service are of the same quality and requests for all services have stringent QoS requirements, the unique permutations do not vary significantly.

V. PARTIALLY-INFORMED METHOD
In the previous section, we assumed that the system is fully known.In this section, we consider a scenario wherein the system is only partially known, i.e., the state of the available network and cloud resources is tracked in real-time, and the list of services and their associated characteristics have been introduced in advance.However, end-users and the orchestrator do not exchange information pertaining to requests and their requirements.In this particular scenario, to solve the problem stated in (1), we employ the DDQL technique, proposed by Google in the DeepMind project [37].In what follows, The DDQL concept and its agent, which serves as the core building block of the learning logic, are briefly introduced.The design of the learning algorithm and the architecture of the DDQLbased resource allocation approach are then discussed, along with an analysis of various implementation strategies.

A. Double Deep Q-Learning Agent
RL is a technique wherein an agent is trained to tackle sequential decision problems through trial-and-error interactions with the environment.Q-learning is a widely used RL algorithm wherein the agent learns the value of each action, defined as the sum of future rewards associated with performing that action, and then follows the optimal policy, which is choosing the action with the highest value in each state.
According to Watkins and Dayan [38], one method for obtaining the optimal action-value function is to define a Bellman equation as a straightforward value iteration update using the weighted average of the old value and the new information, that is where θ τ and a τ are the agent's state and action at time slot τ respectively, σ is a scalar step size, and Y Q τ is the target, defined by where β τ +1 is the reward at time slot τ + 1, γ ∈ [0, 1] is a discount factor that balances the importance of immediate and later rewards, and A is the set of actions.Since most interesting problems are too large to discover all possible combinations of states and actions and learn all action-values, one potential alternative is to use a Deep Neural Network (DNN) to approximate the action-value function.In a Deep Q-Network (DQN), the state is given as the input and the Q function of all possible actions, denoted by Q(θ, .;W), is generated as the output, where W is the set of DNN parameters.The target of the DQN is as follows: and the update function of W is To further enhance the efficiency of DQN, it is necessary to consider two additional improvements.The first is the use of an experience memory [39], wherein the observed transitions are stored in a memory bank, and the neural network is updated by randomly sampling from this pool.The authors demonstrated that the concept of experience memory significantly improves the DQN algorithm's performance.The second is to employ the concept of Double Deep Q-Learning (DDQL), introduced in [37].In both standard Q-learning and DQNs, the max operator selects and evaluates actions using the same values (or the same Q).Consequently, overestimated values are more likely to be selected, resulting in overoptimistic value estimations.DDQL implements decoupled selection and evaluation processes.The following is the definition of the target in DDQL: where a = argmax a∈A Q(θ τ +1 , a, W τ ), and the update function is In this model, W is the set of weights for the main (or evaluation) Q and is updated in each step, whereas W − is for the target Q and is replaced with the weights of the main network every t steps.In other words, Q remains a periodic copy of Q.The authors demonstrated that the DDQL algorithm not only mitigates observed overestimations but also significantly improves accuracy.The training procedure of the DDQL agent is depicted in Fig. 3, which includes receiving the environment response and storing it in the memory bank, passing transitions to the evaluation network and updating its weights with the update function, and adjusting the weights of the target network.In this figure, θ is the resulted state after applying action a.

B. DDQL-CCRA
Since the CCRA problem comprises different sets of variables and their corresponding constraints, to solve it based on the DDQL agent depicted in Fig. 3, the first step is to design a chain of agents, each of which is responsible for addressing one group of the variables.Our proposed chain consists of four DDQL agents.The first agent, denoted by Λ SP , is intended to determine the location of service VNFs in response to requests (g and z), and thus its action set is the set of network nodes.In other words, a SP ∈ A SP = V.Λ P A is the second agent with action set K, and it is responsible for assigning the priority level of traffic.The remaining two agents route traffic by determining the inquiry path from the entering node to its VNF location and the response path in the opposite direction, denoted by Λ QP S and Λ P P S , respectively.The action set of these agents comprises all possible network paths.To interact with the system, each agent provides an action that contains the index of the request for which it is attempting to satisfy its resource requirements and a value from its action space.For example, a SP = {r : 1, ξ : 3} means that the VNF for request 1 should be located in node 3, or g 1,3 = 1.Moreover, a = {a SP , a P A , a QP S , a P P S } represents the set of all agents' actions.
Next, the system state has to be formulated.As the infrastructure is the only side of the system that is known, the state is a collection of network and cloud resources, that is: where ⊕ returns the concatenation of two arrays.When the system receives actions, the state of the available network and cloud resources is updated by deactivating the resources assigned to the associated request, and resulted state θ is generated.
The final step is to design the reward, which is a reaction to the effectiveness of action after receiving it and shifting from state θ to resulted state θ .In other words, agents are wired to the system via the reward.To address the problem defined in (1), we propose the reward as follows for request r: where max OF r and min OF r are the maximum and minimum costs that can be achieved by allocating the available network and cloud resources to request r without considering any constraints or requirements, OF r,a is the cost of the allocations provided by the agents, and χ r,a represents the response of request r to actions a. χ r,a ensures that all constraints of (1) are met.Consider a containing an action that violates one of the constraints (for example, a node or a VNF or a path is overloaded, or a priority level is assigned in such a way that the E2E delay requirement is violated).In this circumstance, the affected request will respond with χ r,a = 0, and the reward for a will be 0. Therefore, the probability of selecting that action decreases, after a certain number of iterations, actions with infeasible allocations are implicitly removed from the set of possible actions.Besides, OF r,a controls the efficiency of a.Similarly, after a number of iterations, allocations with lower costs will have a greater chance of being selected.Therefore, after training, agents will choose feasible actions (within the constraints of ( 1)) with lower costs (minimizing OF).Now, Algorithm 3 details DDQL-CCRA, the learning algorithm proposed to solve the CCRA problem based on DDQL.The algorithm is divided into two phases: 1) Training Phase: In this phase (lines 1 to 24), T represents the number of training steps, whereas and are small positive integers to control the -greedy algorithm.Through each step, the set of actions is determined and transmitted to the system, after which the reward and the updated state are received and used to train the agents employing the ADAM optimizer [40] and update their DNN weights via the memory bank.This process is repeated over the set of requests until the specified maximum number of steps is reached.It is worth mentioning that the action in each agent is selected by angreedy policy that follows the evaluation function of the corresponding agent with probability (1− ) and selects a random action with probability .The probability is decreased linearly from to during the training process.Using the -greedy method and the ADAM optimizer ensures the convergence of DDQL-CCRA to feasible, low-cost solutions (based on the defined reward) [41].
2) Decision Making Phase: In each step of this phase (lines 25 to 36), one request is selected, and its required resources are allotted by the agents.The decision is then transmitted to the system to collect the infrastructure's response and the request.Following this, the reward and the mean reward, denoted by β, are determined.Fig. 4 depicts the actions generated by the agents, their transmission to the environment, and their subsequent return to the agents in preparation for the next decision-making.Due to the fact that we have no knowledge of the requests' requirements, every change in the criteria is managed by examining the average reward; if it falls below a specified threshold, denoted by β, it indicates that end-users have adopted a new policy and the training phase must be repeated.This procedure continues until the required resources for each request have been determined.

C. DDQL-CCRA Resource Allocation Architecture
The architecture of the DDQL-CCRA resource allocation method is depicted in Fig. 5. Due to the fact that the characteristics of different services may be entirely different, an isolated DDQL-CCRA algorithm is designed to be executed for each service.The broker receives requests, classifies them, and forwards each service's requests to its respective controller.In addition, the broker collects the most recent state of the network and cloud resources from the resource orchestrator and transmits it to the controllers.The controller is responsible for executing the DDQL-CCRA algorithm by implementing the memory bank, maintaining the state of requests, calculating the reward, and returning action sets to the broker.Action sets are collected by the broker from all controllers and relayed to the resource orchestrator to apply to the infrastructure.Since actions are chosen at random during the training phase, digital twins could be used to evaluate them to prevent the infrastructure from entering unpredictable states that result in disruptions to its operation [42].
In order to enhance the scalability of this architecture, rather than considering the set of all nodes as the action set of Λ SP and the set of all paths of the network as the action set of Λ QP S and Λ P P S , these spaces can be pruned to create fixed-size sets consisting of the most likely options for VNF placement and path selection.
• For Λ SP , the lower and upper boundaries of the QoS requirements for each service can be extracted (or considered inputs to the problem), and then a set with size V , named V , including feasible nodes to maintain the QoS boundaries at the lowest cost (Ψ v ) is generated.• For Λ QP S and Λ P P S , a set of size P is created for each service containing feasible paths in order to maintain the QoS boundaries at the lowest cost (Ξ l ).Note that these paths should begin at the edge devices (the entry nodes of requests) and terminate at one of the nodes of V for Λ QP S .In Λ P P S , the same logic is followed, but in reverse order.
The complexity and accuracy of the DDQL-CCRA algorithm can be modified by adjusting the size of these sets.V and P can be set to large numbers if high precision is required or if the complexity of running the DDQL-CCRA algorithm can be handled by high-powered software/hardware.Alternatively, small sets can be utilized to return the result in a relatively actions Fig. 5. DDQL-CCRA Resource Allocation Architecture.shorter amount of time.

VI. NUMERICAL RESULTS
In this section, the efficiency of the proposed methods is numerically investigated.The system model parameters are listed in Table III, and the configuration of the agents' training procedure is shown in Table IV.Note that the results are obtained on a computer with 8 processing cores, 16 GB of memory, and a 64-bit operating system.The accuracy of the B&B-CCRA and WF-CCRA methods is illustrated in Fig. 6.The methods are evaluated based on the accuracy of the solutions they provide.Note that the accuracy of a solution for a scenario named η is defined as 1 − ((η − η )/η ), where η is the scenario's optimal solution, which is obtained by solving it with CPLEX 12.10.In Fig. 6.A, the accuracy of B&B-CCRA is plotted vs. the solving time (in logarithmic scale) for five scenarios with different network sizes.As illustrated, the accuracy of B&B-CCRA starts at 80% after the first iteration, which is obtained by solving the LP transformation of LiCCRA with CPLEX 12.10 in just a few milliseconds, and increases as the solving time passes, reaching 92% for all samples after 100 seconds.It proves that this method can be easily applied to provide baseline solutions for small and medium size use cases.However, the accuracy growth is slowed by increasing the network size, which is expected given the problem's NP-hardness and complexity.In the two remaining subfigures, the accuracy of WF-CCRA is depicted against the number of requests attending to use system resources, known as request burstiness, and network size.It is evident that regardless of network size, WF-CCRA has an average accuracy greater than 99%, implying that it can be used to allocate resources in a near-optimal manner even for large networks.For different numbers of requests, the average accuracy remains significantly high and greater than 96%.It does, however, slightly decrease as the number of requests increases, which is the cost of decomplexifying the problem by allocating the resources through isolating requests.Since the decrease is negligible, it is expected that the algorithm is capable of allocating resources efficiently for large numbers of requests.
The DDQL-CCRA resource allocation architecture, depicted in Fig. 5, is examined in Fig. 7.In this figure, the mean cost and E2E delay per each supported request, as well as the percentage of supported requests, are plotted against the DDQL-CCRA iteration counter for three scenarios with varying E2E delay requirements.In order to supplement the analysis, this figure additionally includes the outcomes of WF-CCRA in parallel to R-, CM-, and DM-CCRA.In R-CCRA, all allocations are determined at random, but in CM-and DM-CCRA, allocations are made to minimize cost and delay, respectively, without considering other constraints.Note that in order to implement DDQL-CCRA, we deployed the DDQL-CCRA resource allocation architecture on all edge devices (the entry nodes).
When D r is less than 1 ms, the only feasible solution is to assign all requests to the most costly nodes of the first tier.Consequently, the mean cost for all techniques is high, with the exception of CM-CCRA, which attempts to fit all requests into one of the third-tier nodes with the lowest cost, resulting in the inability to support any request and the mean cost of 0. Since the mean delay for all nodes in the first tier is too low, the average delay per each supported request for all methods excluding CM-CCRA is less than 1 ms and similar.However, the supported request rate is entirely different for each method.R-CCRA, which assigns nodes evenly to requests, places a third of requests on the first tier, therefore its rate is approximately 33%.DM-CCRA selects the node with the shortest E2E delay; hence, its support rate is the number of requests that can be serviced by a single node in the first tier, which is approximately 45%.Given that DDQL-CCRA employs the -greedy technique, it also generates random results during the initial learning iterations.However, as the learning progresses, it receives the reward based on end-user responses and begins to place more and more requests on the first tier until it reaches the near-optimal solutions supplied by WF-CCRA.
When the E2E delay requirement threshold is changed to 3 ms, both the first and second tier nodes can be occupied to support requests.Since DM-and CM-CCRA always select a node in the first and third tiers, respectively, their outcomes are identical to those of the preceding scenario.R-CCRA doubles the percentage of supported requests because it randomly assigns 66% of requests to the first and second tiers.In addition, its mean delay is slightly smaller than that of WF-and DDQL-CCRA since it utilizes the first tier nodes more than these two cost-effective approaches.Note that the difference is negligible, as the delay of nodes in the first tier is vanishingly small and cannot significantly affect the mean delay.In contrast, when DDQL-CCRA identifies a changing need (lines 35 and 36 of Algorithm 3), it restarts the learning process and enables the -greedy technique.Therefore, it begins anew with random results and optimizes allocation by fitting as many requests as feasible into the second-tier nodes in ascending cost order.Also in this scenario, it can be observed that the learning technique yields near-optimal efficiency outcomes.
The final scenario is eliminating the delay requirement and releasing the entire infrastructure to serve requests.In this case, although the results for DM-CCRA are identical, the support rate for CM-CCRA is approximately 50%, indicating that the node with the lowest cost can service approximately 50% of requests.Similar to the prior scenario, the outcomes of R-CCRA are enhanced.Now it can support all requests, but its mean cost and delay are not optimal because it consumes the resources of all tiers equally.Similarly, the trend for DDQL-CCRA is the same.As soon as it senses a change in requirements, it begins to randomly assign resources, recognizing that requests should be sent as much as possible to the core clouds.It initially determines that the node with the lowest cost yields the best outcome.Therefore, it places all requests on a single node, thereby reducing the number of supported requests and enhancing the mean delay and cost.Subsequently, the reward of this allocation begins to decline as certain requests cannot be supported, the value of dispersing requests throughout the third tier increases progressively, and the optimal policy leads to an increase in the support rate, coupled with a reduction in the mean cost and delay.In Fig. 7, it is evident that the DDQL-CCRA approach in partially known systems can lead to near-optimal solutions obtained when those systems are fully known.
In Fig. 8, DDQL-CCRA is investigated with regard to request burstiness.The mean cost and E2E delay of each supported request are depicted in Fig. 8.A and Fig. 8.C respectively, whereas Fig. 8.B illustrates the number of supported requests.In this figure, the results of DDQL-CCRA are compared with those of WF-CCRA, FSA [13], BSA [13], CEP [10], A-DDPG [11], and MDRL-SaDS [14].FSA is a heuristic algorithm that randomly assigns resources to requests in descending order of their required computing capacity.BSA is a similar method that assigns resources in descending order of their remaining capacity to the sorted requests.In CEP, resources are allocated with the aim of minimizing the total cost of links.A-DDPG is an RL method that adjusts the reward for each request to maximize its overall utility.In this solution, utility is defined as the profit of serving the request as a function of its required bandwidth minus the E2E path delay experienced.MDRL-SaDS is another RL technique in which the reward is the computing and networking cost of serving each request divided by the total cost of allocated resources across the infrastructure.This strategy seeks to minimize the cost of allocated resources in relation to their energy consumption.
Evidently, the number of requests supported by FSA is relatively high, as are its mean cost and E2E delay.FSA distributes requests across all tiers, resulting in significant utilization of all resources, a high mean cost, and a high mean E2E delay.Since all links have a similar cost, CEP exhibits comparable performance; however, because it considers the feasibility of links, it achieves slightly better results.A-DDPG, where requests are assigned to nodes with a lower E2E delay, is another costly method.Increasing the number of requests causes second-and first-tier nodes to become occupied and requests to be assigned to the resources of other tires, thereby , BSA [13], CEP [10], A-DDPG [11], and MDRL-SaDS [14] vs. network size.In this scenario, the first four nodes are added to the first tier, followed by the second four nodes to the second tier, and then the last four nodes to the third tier.The delay requirement is 10 ms, and there are a total of 300 requests.The results are calculated as a moving average with a window size of 20, where each sample is the average of 50 arbitrary systems.
increasing the E2E delay.In BSA, because the performance metric is the remaining capacity of computing nodes and the nodes are ordered from low capacity to high capacity across the tiers, it occupies the nodes from the cloud to the edge, resulting in outcomes with very low cost and moderate delay.A-DDPG and BSA cannot support a substantial number of requests because the feasibility of links is not explicitly evaluated.MDRL-SaDS is the most inefficient method by which requests are routed to the node with the lowest cost.Therefore, the number of supported requests is proportional to the node's capacity and delay.The behavior of the two remaining methods, DDQL-and WF-CCRA, is comparable.They support requests by initially assigning third-tier nodes.Then, once this tier is occupied, they proceed to occupy the second tier, resulting in an exponential cost increase.The mean cost converges to a fixed value when all resources are occupied and the first tier is in use.This approach results in a very low E2E delay because it assigns request priorities based on their delay requirements (unlike other approaches, which are unaware of ATS queues).DDQL-CCRA can provide near-optimal solutions regardless of the number of requests received, as demonstrated.
The final figure compares the proposed approaches to the approaches depicted in the preceding figure for various network sizes.In this scenario, if V ≤ 15, the first-tier network has a very high capacity, whereas if V > 15, there are sufficient resources to fulfill all requests.Therefore, CEP and A-DDPG, which focus on minimizing the cost of allocated links and E2E delay respectively, as well as FSA, which allocates resources randomly, can support a large number of requests despite the high cost of allocations.Since the capacity of the low-cost tier for BSA and MDRL-SaDS is not excessive (and the capacity ratio of this tier to the others is less than the previous figure), the request support rate is unpromising despite the low cost.When the infrastructure is full (V ≤ 15), the results of WF-and DDQL-CCRA are comparable to those of other algorithms.However, when there are more resources (V > 15), these two approaches move requests to low-cost resources, thereby reducing the total cost of allocations.When it comes to E2E delay, even though the results are similar for all methods, by adding a node to the initial network, the resources of the first tier are extended and more requests can be supported with smaller delays, resulting in a sudden decrease for V = 10.By adding more nodes, however, more resources are added to the other tires, and FSA (which allocates resources randomly), BSA (which assigns resources in descending order of their remaining capacity), and A-DDPG (which tries to maximize the overall utility) migrate requests to the lower-cost tiers, resulting in a slight increase in E2E delay.Despite the increase in WF-and DDQL-CCRA techniques, their outcome is the lowest E2E delay because they manage priority queues according to the delay requirements of requests.

VII. CONCLUSION
In this paper, the joint problem of communication and computing resource allocation comprising VNF placement and assignment, traffic prioritization, and path selection considering capacity and delay constraints was studied.The primary objective was to minimize the total cost of allocations.We initially formulated the problem as a MINLP model, and used a method, named B&B-CCRA, to solve it optimally.Then, due to the complexity of B&B-CCRA, a WF-based approach was developed to find near-optimal solutions in a timely manner.These two methods can be utilized to solve the problem when the system is fully known.However, for scenarios wherein there is no request-specific information, a DDQL-based architecture was presented that yields nearoptimal solutions.The efficiency of the proposed methods was demonstrated by numerical results.
As potential future work, we intend to address the problem by accounting for more dynamic environments in which endusers are mobile and all of their needs are subject to change.In addition, the proposed methods could be supplemented by taking into account dynamic infrastructure resources, in which the cost of resources (such as their energy consumption) or their availability can fluctuate over time.In such highly dynamic scenarios, we intend to enhance the proposed DDQL-based method with Continual Learning in order to reduce the adaptation time required to adjust agents after each change.Another possible research direction is to extend the problem to include radio domain resources (such as power control, channel assignments, rate control, and relay selection in multi-hop scenarios), thereby providing end-user-to-enduser resource allocations.Furthermore, we intend to improve the proposed method for allocating resources to VNF chains rather than individual VNFs.

Fig. 4 .
Fig. 4. Data flow between the agents and the system.

Fig. 6 .
Fig. 6.Solution accuracy of A) B&B-CCRA vs. solving time, B) WF-CCRA vs. network size, and C) WF-CCRA vs. request burstiness.In subfigures A and B, the number of requests is set to 200, and the number of network nodes in subfigure C is 20.In subfigure B and C, for each number of nodes or requests, 50 random systems are generated, and the problem is solved using WF-CCRA, with B&B-CCRA providing the optimal solution.The results of random samples are represented by blue dots, and the aggregated results are represented through boxplots, where red points indicate medians.
Fig. 7. DDQL-CCRA convergence compared with WF-CCRA, R-CCRA, CM-CCRA, and DM-CCRA.The number of requests is 150, and the number of network nodes is 30, and all requests can be serviced by a single tier's resources.The results are calculated as a moving average with a window size of 100 in order to capture trend lines.

Fig. 8 .Fig. 9 .
Fig.8.A) Mean cost of each supported request, B) number of supported requests, and C) mean E2E delay of each supported request for DDQL-CCRA, WF-CCRA, FSA & BSA[13], CEP[10], A-DDPG[11], and MDRL-SaDS[14] vs. request burstiness.The delay requirement of requests is 10 ms, and the number of network nodes is 9.The results are calculated as a moving average with a window size of 100, where each sample is the average of 50 arbitrary systems.
2t } is solved using a Linear Programming (LP) solver to obtain the solution of the relaxed problem (µ t , λ t ) and the optimal objective value φ t , where µ is the relaxed integer variables set, and λ is the set of continuous variables.Next, if all relaxed variables have integer values, the obtained objective in this iteration is considered to update the best explored integer solution.Otherwise, a variable index j is selected such that µ t [j] is fractional, and the feasible constraints set π t is divided into two parts as π 1 t = π t ∩ {µ t [j] ≤ µ t [j] } and π 2 t = π t ∩ {µ t [j] ≥ µ t [j] }.Then, two problems are formed as Φ 1 t = min OF s.t.π 1 t and Φ 2 t = min OF s.t.π 2 t .Now, two child nodes N 1 t and N 2 t , whose problems are Φ 1 t and Φ 2 t respectively, are put into the unexplored list.The B&B algorithm is iterated until N is empty.

TABLE III SIMULATION
PARAMETERS.

TABLE IV TRAINING
CONFIGURATION.