Wireless Edge Machine Learning: Resource Allocation and Trade-Offs

The aim of this paper is to propose a resource allocation strategy for dynamic training and inference of machine learning tasks at the edge of the wireless network, with the goal of exploring the trade-off between energy, delay and learning accuracy. The scenario of interest is composed of a set of devices sending a continuous flow of data to an edge server that extracts relevant information running online learning algorithms, within the emerging framework known as Edge Machine Learning (EML). Taking into account the limitations of the edge servers, with respect to a cloud, and the scarcity of resources of mobile devices, we focus on the efficient allocation of radio (e.g., data rate, quantization) and computation (e.g., CPU scheduling) resources, to strike the best trade-off between energy consumption and quality of the EML service, including service end-to-end (E2E) delay and accuracy of the learning task. To this aim, we propose two different dynamic strategies: (i) The first method aims to minimize the system energy consumption, under constraints on E2E service delay and accuracy; (ii) the second method aims to optimize the learning accuracy, while guaranteeing an E2E delay and a bounded average energy consumption. Then, we present a dynamic resource allocation framework for EML based on stochastic Lyapunov optimization. Our low-complexity algorithms do not require any prior knowledge on the statistics of wireless channels, data arrivals, and data probability distributions. Furthermore, our strategies can incorporate prior knowledge regarding the model underlying the observed data, or can work in a totally data-driven fashion. Several numerical results on synthetic and real data assess the performance of the proposed approach.

albeit limited, at the edge of the network, typically within the Radio Access Network or at an aggregation point of the core network. In particular, recent surveys on 5G architectures place the MEC functionalities and facilities behind the User Plane Function (UPF) [3]. Thus, the main advantage of MEC is its proximity to the end users, which enables low end-to-end (E2E) latency services. For a recent survey of MEC in 5G and beyond systems, the interested reader can refer to [4]. Furthermore, the convergence of communication, computation and control, is fundamental to enable mission critical applications in many scenarios. As an example, 5G and beyond networks, aided by edge computing, are foreseen to enable the industrial automation in real-time, 2 within low E2E delay, extremely high reliability, possibly with an energy efficient perspective to reduce the global carbon footprint of the ICT industry [5].
In this new heterogeneous context, the notions of latency and reliability need to evolve from classical communication-related concepts. The traditional definition of E2E delay takes into account the overall latency from the transmission of a packet until its successful decoding at the receiver. However, since future services will involve communication and computation, the E2E delay must take into account the time elapsing from the generation of a new request/data unit/task by a peripheral device, to the time in which the edge server completes the computations necessary to fulfill the request. In parallel, also the definition of reliability needs a revision. From a pure communication perspective, reliability is associated to the probability of successfully decoding the received packets. However, whenever the goal of communication is that the Edge Server (ES) is able to take decisions about the data transmitted by the peripheral devices, a new definition of reliability should be adopted, associated to the reliability of the decisions taken on the basis of the received packets. This opens a new perspective calling for a holistic vision building on the integration of communication, computation, caching, and control.
To be more specific, one of the key applications of edge computing is the development of Machine Learning (ML) algorithms that run at the edge, e.g., in close proximity of the industrial facilities for different purposes such as control decision making, anomaly detection, monitoring and maintenance. The new paradigm of integrating wireless networks with ML at the edge is known in the literature as Edge Machine Learning (EML). In the EML context, it is important to control not only the reliability from the communication point of view, but also from the computation point of view, assessing the accuracy of the decisions taken by the edge server, which can involve fulfilling tasks such as prediction, estimation, classification, and so on. In a nutshell, the goal of EML is to devise resource allocation strategies that enable machine learning at the wireless network edge with low energy consumption, low E2E delay and high learning/inference accuracy. It is then clear that enabling edge 2 H2020 EU/Taiwan Project 5G CONNI, https://5g-conni.eu/ machine learning introduces novel fundamental problems in terms of jointly optimizing communication (e.g., power, bits, source encoding, etc.), computation (e.g., CPU cycles, number of active servers/cores, etc.), and inference/training (e.g., choice and splitting of a (deep) learning architecture, model and/or data partitioning, etc.) to meet the system constraints (e.g., E2E delay, reliability, energy) while guaranteeing a prescribed performance of the inference task. The main contribution of this paper is to propose a joint optimization leading to a control action, to be performed in real-time, whose goal is to strike an optimal trade-off between energy, delay and accuracy, while coping with the time-variability of radio channels, data arrivals, computation loads, memory availability, etc. In particular, as illustrated in Fig. 1, it is possible to consider different trade-offs among the main variables involved in EML, e.g., the trade-off between energy consumption and delay, accuracy and delay, energy consumption and accuracy, or their joint combination. In this way, the performance requirements (e.g., accuracy, convergence rate, etc.) of the learning task play a key role in selecting the best allocation of radio and computation resources.

A. RELATED WORKS
In the last few years, there has been a huge interest in edge computing, from communication to computation and caching perspectives, as well as the investigation of a tight integration of the computing paradigm in the context of wireless communication networks [4], [6], [7]. Since our work focuses on dynamic resource allocation strategies for computation offloading of machine learning tasks, in the sequel we review the general literature on dynamic computation offloading and the recent advances in EML.

1) COMPUTATION OFFLOADING
The goal of computation offloading is to move the execution of computationally heavy applications from mobile or IoT devices to nearby edge servers. The motivation for using computation offloading is threefold: i) empower simple IoT devices with superior computational capabilities, as available at the server; ii) reduce the energy consumption at the resource-hungry mobile devices; iii) reduce E2E service delay, whenever the sum of transmission and computation time at the edge is smaller than the computation time at the mobile device. In particular, dynamic computation offloading refers to applications that continuously generate tasks or data to be sent to the edge server for processing. A typical example is the continuous sensor data acquisition, with the aim of performing real time data analytics for different purposes, such as anomaly detection or prediction. Several works investigate the problem of radio and computation resource allocation for dynamic computation offloading [8]- [17]. In [8], a dynamic formulation is proposed, with a strategy based on Lyapunov optimization in a cloud computing framework. In [9], the authors consider a fog-enabled Device to Device (D2D) scenario and propose a strategy for the mutual association of mobile devices to offload tasks among each other. User assignment is also addressed in [10], with the goal of optimizing the average delay under energy constraints, with a penalty function that discourages frequent handovers. The assignment strategy is based on a multi-armed bandit algorithm to learn the optimal association. In [11], we propose a dynamic computation offloading algorithm to jointly optimize computation and communication resources and mobile users assignment to APs and edge servers, merging tools from stochastic optimization and matching theory. In [12], a Lyapunov based strategy is proposed, for the joint optimization of radio and computation resources, to minimize the users' energy consumption under E2E delay constraints. In [15], the authors investigate a scenario with multiple APs and edge servers, where an assignment strategy based on matching theory is proposed, coupled with the tools of Lyapunov optimization and Extreme Value Theory to control reliability. The interested reader is referred to the recent surveys related to MEC and computation offloading [4], [6], [18], [19].
All the aforementioned works present general formulations for computation offloading, without explicitly taking into account the requirements of the offloaded tasks. They mainly focus on energy efficiency with latency guarantees and/or reliability over the wireless interface, with a high level and general description of the application in terms of computational requests, but without investigating the requirements associated to the application layer, e.g., the accuracy of the learning tasks to be offloaded.

2) EDGE MACHINE LEARNING
A first general introduction to EML can be found in [20], where the authors present several possible trade-offs. Other recent general surveys can be found in [21]- [23]. Going into more specific contributions, the authors of [24] consider an edge machine learning system, where an edge processor runs an algorithm based on Stochastic Gradient Descent (SGD). In particular, they investigate the trade-off between latency and accuracy, by optimizing the packet payload size, given the overhead of each data packet transmission and the ratio between the computation and communication rates. In [25], the authors propose an algorithm to maximize the learning accuracy under latency constraints, while the authors of [26] present a distributed machine learning algorithm at the edge, where wireless devices collaboratively minimize an empirical loss function with the help of a remote server. The authors of [27] propose a communication-efficient decentralized machine learning algorithm that dynamically optimizes a stochastic quantization method, with applications to regression and image classification. The authors of [28] consider generic distributed machine learning algorithms at the edge, based on SGD, investigating the trade-off between local update and global aggregation. In [29], the authors present a data compression algorithm to reduce the communication burden and energy consumption of an IoT network, to enable machine learning with a desired target accuracy. Finally, an important research topic related to EML is federated learning (FL) [30]- [36]. In FL, multiple edge devices perform local model updates on collected data, and the server then takes a weighted average of the resulting models. Since no data are sent to the ES, but only local gradient updates, an energy efficient, low latency, privacy preserving training at the edge is enabled. In [31], two update methods to reduce the uplink communication costs for FL are proposed. In [34], the authors present a practical update method for a deep FL algorithm with an extensive empirical evaluation for different FL models. Reference [36] investigates the problem of joint power and resource allocation for ultra-reliable low latency communication in vehicular networks. The interested reader can refer to [30] for a comprehensive survey on FL.

B. CONTRIBUTION
In this work, we propose a dynamic algorithmic framework, whose goal is to strike an optimal balance between energy consumption (both for communication and computation), E2E delay, and learning/inference performance, enabling training and inference tasks at the edge of the wireless network. Differently from previous works on computation offloading and EML, we assume a goal-oriented communication perspective, where the scope of the communication is not necessarily to convey all bits reliably within a given time constraint, but to send enough data to enable the edge server to take decisions with the desired accuracy, thus striking the best trade-off between energy consumption, E2E delay and accuracy. To achieve this goal, we dynamically act on the source encoder to adjust the transmission rate, while still fulfilling the goal of the learning task. The idea is to tolerate a small amount of distortion on the received data, to achieve a better energy-delay trade-off, but still being able to satisfy the accuracy requirements of the learning task. In particular, we focus on two different resource allocation strategies: 1) Minimum energy under latency and learning performance constraints. For this first class of algorithms, we consider two different sub-classes: • Model-based EML: In this case, we exploit the fact that, for some learning algorithms and data models, it is possible to provide closed form expressions for the accuracy, which can be used to seek the minimum energy strategy with guaranteed E2E delay and learning performance constraints; • Data-driven EML: In this case, without assuming any model for the data, we hinge on performance metrics that can be practically measured online, to devise a dynamic method that aims to minimize VOLUME 9, 2021 FIGURE 2. Network model: Sensors offloading data via the Access Point (AP) to the Edge Server (ES). The ES runs the learning/inference task (e.g. estimation, prediction, classification) and feeds the quantization levels back to the sensors.
the energy consumption under E2E delay and learning guarantees. Indeed, for some learning tasks such as estimation and prediction, the accuracy (e.g. the Mean Squared Error), can be estimated online from the data within a limited delay. 2) Best learning/inference performance under latency and energy constraints. In this case, we assume that no model is available, and the performance cannot be measured online. This is typical of some learning tasks such as, e.g., classification. In this case, it might be impossible to set a constraint on the learning performance, and it is more convenient to rely on a best accuracy strategy, under E2E delay and energy constraints, which are strongly related to application requirements and physical needs (e.g., battery lifetime). For the first class of problems (i.e., minimum energy), we present an application involving estimation/prediction based on Least Mean Squares (LMS), using a synthetic and a real dataset. For the second class of problems (i.e., best learning accuracy) we provide results on classification over two different real datasets, involving a Support Vector Machine (SVM) and a Neural Network (NN). The simulations are carried out over both synthetic and real data sets to show how, without any prior knowledge on the statistics of radio channels, pattern arrivals, and data distributions, the proposed methods are able to strike the desired trade-off between energy, latency and learning/inference accuracy.

C. OUTLINE
The paper is organized as follows: in Sec. II we present the system model, comprising energy consumption, delay and learning accuracy; in Sec. III, we introduce the minimum energy strategy, starting from the problem formulation and then presenting the algorithmic solution; in this case, we consider both a model-based and a data-driven approach. Similarly, in Sec. IV we present the best accuracy strategy with E2E delay and energy constraints. In Sec. V, we customize our frameworks to LMS estimation, SVM and NN classification, showing numerical results for both resource allocation strategies. Finally, Sec. VI draws some conclusions and future directions.

II. SYSTEM MODEL
In this section, we present the system model. In particular, we first present the energy consumption model, both for the communication side (devices' energy consumption), and for the computation side (ES's energy consumption). Then, we present the performance metrics used throughout the paper, namely latency and learning accuracy. We consider a scenario with K sensors and an Access Point (AP) equipped with an ES, as illustrated in Fig. 2. Each sensor captures data from the environment and uploads them to the server through the wireless connection with the AP. The server collects data and runs a learning/inference algorithm that requires certain performance in terms of E2E delay and accuracy.

A. ENERGY CONSUMPTION
In the sequel, we illustrate the model used to quantify the energy consumption of mobile devices/sensors and of the ES running the EML tasks.

1) DEVICE ENERGY CONSUMPTION
Since we deal with a dynamic scenario, we consider time as organized in time slots of equal duration τ . A generic sensor 45380 VOLUME 9, 2021 device k, in each time slot t, transmits a certain number of data depending on its data rate R k (t) (expressed in bit/sec), as it will be detailed later on in this section. Then, inverting the well-known Shannon capacity formula, the power spent for transmission during time slot t is given by where B k is the bandwidth allocated to sensor k, h k (t) is the time varying channel power gain, and N 0 represents the noise power spectral density at the receiver. Then, the overall sensor energy consumed during time slot t is given by

2) EDGE SERVER ENERGY CONSUMPTION
The dynamic energy consumption of a CPU is highly dependent on its clock frequency f c (t), which we assume to be dynamically scaled, when possible, to reduce the energy consumption of the processor [37], [38]. In particular, we assume that f c (t) can be selected from a discrete finite set F, with a maximum CPU cycle frequency denoted by f max , and we exploit a widely used cubic model for the energy consumption, described as where κ is the effective switched capacitance of the processor [37], [38]. Furthermore, since we consider a multiuser scenario, where the edge server has not the virtually infinite computational capabilities of a central cloud, we assume that the CPU time is shared across the tasks required by each device. This is equivalent to allocate a portion f k (t) of the CPU clock frequency f c (t) to each agent k, while imposing Finally, from (2) and (3), the total system energy consumption at time t is given by

B. DELAY AND QUEUEING MODEL
In this paper, we consider a continuous flow of data that are generated locally by the sensor devices, and uploaded to the ES, which processes them by running an online learning algorithm. Then, the overall delay experienced by a data unit from its generation to its processing at the ES is given by the sum of: i) the uplink queueing delay, ii) the transmission delay, iii) the queueing delay at the ES, and iv) the computation time at the ES. Thus, proceeding as in [12], [39], we define an uplink transmission queue Q l k (t), and a remote queue Q m k (t) of data to be processed at the ES. The uplink queue is fed by the new task arrivals, and drained by the uplink data transmission. Since the goal of our system is to accomplish tasks, e.g. to perform image recognition, we need to identify the dimension of the smallest data unit that can be processed singularly. For example, in image processing, the data unit is one image. Each task is characterized by the following quantities: the amount of samples composing each data unit to be processed and the amount of CPU cycles necessary to process each data unit. We denote by M k the number of samples in one data unit, for example the number of features extracted from a data set or the number of pixels in an image. In our dynamic resource allocation strategy, we adapt the number of bits per sample in order to find the best trade-off between energy consumption, service delay and learning accuracy. Denoting by n q k (t) the number of bits per sample used by device k in time slot t, a transmitted data unit is represented by M k n q k (t) bits. We assume here, for simplicity, that the data to be quantized are statistically independent, as they are the result of a source encoder that has removed any unnecessary redundancy. Furthermore, we assume that the granularity with which we adapt the quantization level is the time slot, so that within each time slot n q k (t) is constant, i.e. all the data units transmitted in the same slot are quantized with the same number of bits. Hence, since in each slot we transmit a number of bits equal to τ R k (t), the number of data units/tasks transmitted during time slot t is where x denotes the largest integer smaller than x. Then, the local queue, indicating the number of tasks to be completed, evolves as follows: where A k (t) denotes the new data arrivals, for example the number of images generated in time slot t. The arrivals are assumed to be random with unknown statistics. The role played by the quantization level will be clear later on in this section, when we will introduce the accuracy of the edge machine learning task. At the server side, the remote queue is fed by the uplink task arrivals, and it is drained by the task computation. In this work, proceeding as in [12], we assume that there exists a linear relation between the data units/tasks, and the number of CPU cycles necessary to run the task. Thus, denoting by 1/J k the number of CPU cycles necessary to process one data unit/task, the number of data units processed in slot t is Then, the computation queue at the server side, counting the number of tasks to be executed, evolves then as follows: The overall service delay is directly related to the sum of the local and the computation queues In fact, from Little's law [40], given a data arrival rateĀ k = E{A k (t)/τ } and a stationary queueing system, the overall long-term average latency experienced by a new data unit from its generation to its computation at the ES is where the expectation is taken with respect to the radio channel and data arrival statistics. Our goal is to guarantee a long-term average delay constraint D avg k , which can be written as follows where Q avg k = D avg kĀ k . AlthoughĀ k is unknown a priori, it can be estimated online.

C. LEARNING ACCURACY
The metric used to quantify the inference accuracy depends on the specific learning task (e.g., prediction, estimation, or classification), and on the adopted machine learning algorithm (e.g. Least Mean Squares, Support Vector Machine, Neural Networks, etc.). Since we aim to control the accuracy of the learning task, we introduce an instantaneous performance metric G k (t) = G k (n q k (t)), which depends on the number of quantization bits used to represent the data and then it is also a function of the time slot index t. Intuitively, the larger is the number of quantization bits, the better will be the learning performance. Given the instantaneous performance G k (t), we can impose a long-term constraint on the learning accuracy as follows: G k (t) may represent, for example, the missclassification rate in a classification task, the Mean Squared Error (MSE) or Mean Squared Deviation (MSD), in an estimation or prediction task, etc. In Secs. III and IV, we will keep this function generic on purpose, so that it can be suitably adapted to the specific edge learning task we are interested in. In more specific cases, when a data model is available, the expression of G k (t) is known and then it can be directly exploited in the optimization (see Sec. III.A); otherwise, the value of G k (t) can be estimated from the data and used to control the system in a fully data-driven fashion (see Sec. III.D). In Sec. V, we will customize the proposed framework to some specific learning tasks, showing how to set and update G k (t) for both model-based and data-driven strategies.
As mentioned before, typically G k (t) = G k (n q k (t)) is a function of the number of quantization bits n q k , since a coarser representation of the data can lead to deteriorated performance of the learning task. However, on the other hand, using a finer quantization level yields more bits to be transmitted and then a higher energy consumption at the transmit side (cf. (1),(5),(6)) to meet the desired latency constraint. In this work, we control the learning/inference accuracy acting on the source encoder, at the transmit side, and then on the number of quantization bits used to encode the data. To enforce this strategy, we introduce a feedback loop, as depicted in Fig. 2, from the edge server to the source encoder, to feed back the information about the number of quantization bits, in order to find the desired balance between energy consumption, E2E delay and learning accuracy.

III. MINIMUM ENERGY UNDER E2E DELAY AND ACCURACY CONSTRAINTS
In this section, we introduce the first dynamic resource allocation strategy for EML. Our goal is to devise an online strategy striking a good trade-off between energy, latency, and learning accuracy. To this aim, we formulate the following long-term average optimization problem: The constraints of (12) have the following meaning: (a) the long-term average E2E delay experienced by a data unit from its generation to its computation at the server must be smaller than a predefined threshold 3 ; (b) the long-term average of the function quantifying the accuracy of the learning task must be smaller than a predefined threshold; (c) the data rate of each device is non negative and is lower than a maximum value obtained by plugging the maximum transmit power budget p max k into the Shannon formula; (d) the number of quantization bits is selected from a discrete set N q k ; (e) the CPU clock frequency is selected from a finite set F; (f ) the CPU cycle frequency assigned to each device k is non negative; (g) the sum of the CPU cycle frequencies assigned to all devices cannot exceed the CPU clock frequency. Problem (12) is very difficult to solve, especially because of the lack of knowledge of the statistics of the radio channels and task arrivals. Nevertheless, in the sequel, we show how to tackle the problem effectively, hinging on stochastic Lyapunov optimization.

A. STOCHASTIC LYAPUNOV OPTIMIZATION
The first step is to define a dynamic strategy to handle the long-term constraints (a) and (b) of (12). Proceeding as in [41], we introduce virtual queues to control the long term constraints (a) and (b). More specifically, we introduce the virtual queue Z k (t) for each device, to control latency, evolving as and the virtual queue Y k (t) for each device, to control accuracy, evolving as where ν k is a step size used to control the convergence of the algorithm. 4 The aim of the virtual queues is to keep track of how the system is behaving in terms of constraint violations.
In particular, it can be shown that, imposing the mean rate stability of the virtual queues (13) and (14), is equivalent to ensuring constraints (a) and (b) of (12), respectively [41].
The mean rate stability of Z k (t) and Y k (t) is defined as follows [41]: Using the general framework of [41], to impose the mean rate stability, we introduce now the Lyapunov function L( (t)), defined as: Having defined the Lyapunov function, we can introduce the conditional Lyapunov drift, which is the conditional expected change of the Lyapunov function over one slot: It can be shown that, minimizing (15), the mean rate stability of the queues is imposed, so that constraints (a) and (b) of (12) are satisfied. However, directly minimizing the conditional Lyapunov drift can lead to an unnecessary system energy consumption. Then, as in [41], we introduce the drift-pluspenalty function, defined as where V is a control parameter used to assign more importance to the energy term with respect to the virtual queue backlogs. Now, instead of directly minimizing (16), we proceed by minimizing the following upper bound (the derivations are given in the Appendix): where ζ is a positive constant given by and χ k (t) is defined as follows: Note that, when plugged into (17), because of the conditioning, the expected value containing χ k (t) is a function that does not depend on the optimization variables. Now, hinging on stochastic optimization, we minimize the instantaneous realizations of (17), thus removing the expectation term. Then, neglecting all the constant terms, the per-slot optimization problem can be written as: where Z(t) is the feasible set according to constraints (c)-(g) of (12). Problem (20) is still complicated to solve, due to its mixed integer nonlinear nature. However, the problem can be split in two simpler sub-problems: the first one selecting the data rate and number of quantization bits; the second one optimizing the edge server CPU scheduling. Then, as we will illustrate in the sequel, each subproblem can be solved using a low-complexity procedure. VOLUME 9, 2021

B. OPTIMAL DATA RATE AND QUANTIZATION BITS
We now proceed at solving (20) from the radio resource allocation perspective. To this aim, we first introduce an additional upper bound, used to handle the expression of N u k (t) (cf. (5)) and simplify the structure of the sub-problem. In particular, using the fact that x − 1 ≤ x ≤ x, we can remove the non-linearity introduced by the · operator. Moreover, without loss of generality, we transform constraint (c) of (12) as: thus eliminating the non-linearity introduced by the max(·) operator. It should be also noticed that this constraint does not alter the problem, since it ensures that each device does not transmit more data than those available in the uplink queue at time t. Now, the optimization problem in (20) with respect to data rate and number of quantization bits can be split into simpler problems for each device. In particular, for each device k, the radio resource allocation problem can be formulated as follows (we omit the temporal index t for ease of notation): where (21) is a mixed integer problem and thus in principle it is difficult to solve. However, since N q k is a discrete finite set with typically low cardinality, it is possible to solve it exactly using an exhaustive search over the variable n q k . Since we have been able to split the optimization per device, there is no exponential increase of complexity with the number of connected devices. In particular, for any given n q k ∈ N q k , if Q l k ≤ 0, the optimal solution is R k = 0, since the first two terms of the objective function of (21) are monotone increasing functions of R k , while the third one does not depend on R k . Instead, if Q l k > 0, given n q k , (21) is a convex optimization problem with respect to R k , and its global optimal solution can be derived in closed-form. Indeed, the Lagrangian associated to problem (21) reads as: where α k and β k are the Lagrange multipliers associated to constraint (a) of (21). Since we are looking the solution corresponding to a given value of n q k , of course there is no  (22) and save it in ρ ki ; S4. Compute the value of the objective function of (21) with n q k = N q ki and R k = ρ ki , and save it in P ki ; end need to introduce any multiplier for n q k . Then, in this case the Karush-Kuhn-Tucker (KKT) conditions [42] of (12) are: Solving the KKT conditions, it is easy to see that the global optimal solution of (21), for a given n q k , is Finally, to find the global optimal solution of (21), it is only necessary to compute (22) for all n q k ∈ N q k and choose the solution that yields the smallest value of the objective function of (21). The procedure is summarized in Algorithm 1, and it requires K × |N q k | iterations (which can be directly parallelized if a multi-core architecture is available). Interestingly, the fact that N q k is a finite discrete set, usually with a few elements, allows us to achieve the global optimal solution of (21) within a few iterations, independently of the structure of G k (n q k ), which can be non convex and/or non differentiable.

C. CPU SCHEDULING AT THE SERVER
We now proceed to the solution of (20) from the computation perspective, optimizing the CPU scheduling at the ES. 45384 VOLUME 9, 2021 Proceeding in a similar way to the previous case, we exploit the inequality x − 1 ≤ x ≤ x in (20), and we impose the following additional constraint: which ensures that we cannot allocate more computation resources than those required to completely drain the remote queue Q m k (t). In this way, the optimization problem for the computation resource allocation can be cast as (we omit the temporal index t for ease of notation): where Q m k = 2Q m k + Z k . Problem (23) is a mixed-integer program that, similarly as before, admits a simple solution. In fact, if the discrete variable f c is fixed, (23) becomes a linear programming problem and its solution can be found via a simple fast iterative procedure, described by the steps S2-S5 of Algorithm 2. Thus, repeating these steps for all possible f c ∈ F, we can find by comparison the global optimal solution of (23), through the procedure summarized in Algorithm 2. Note that, denoting by |F| the cardinality of F, Algorithm 2 requires at most |F| × K iterations. Again, the complexity increases linearly with the number of connected devices.
Finally, the overall procedure for the proposed dynamic resource allocation strategy for minimum energy edge machine learning is summarized in Algorithm 3.

D. DATA-DRIVEN CONTROL OF LEARNING ACCURACY
In the previous section, we have proposed a model-based online algorithm for radio and computation resource orchestration, assuming the possibility of writing a closed form expression for the learning accuracy G k (t). Indeed, as we will show in Sec. V.A, some learning tasks admit closed form expressions for different accuracy metrics, thus allowing us to use Algorithm 3. However, in several other cases, it is not possible to provide a closed form expression for the performance metrics. In these cases, we need to propose an alternative strategy. To this aim, we now propose an alternative approach that is valid in the case in which the accuracy can be estimated online. As an example, if we consider a prediction task, once a sensor observes a new sample, it can compare it with its prediction from the previous samples and then measure the prediction error. The prediction can be fed back to the sensor by the edge server. Once the prediction accuracy is estimated, it is possible to actuate a control action accordingly, in order

Algorithm 3: Model-Based Edge Machine Learning
Set the Lyapunov trade-off parameter V , Z k (0), Y k (0), ν k , for all k. In each time slot t, repeat the following steps: S1. Find the transmit data rate R k and the number of quantization bits n q k , ∀k, using Algorithm 1; S2. Solve the CPU scheduling through Algorithm 2; S3. Run the online learning task; S4. Update the physical queues as in (6) and (8), and the virtual queues as in (13) and (14).
to achieve the target performance within a given delay. More specifically, indicating with y k (t) the sample observed by device k at time t, and withŷ w k (t) its prediction based on the previous samples, using the data set W k , the online learning accuracy can be estimated as follows: Using (24), the evolution of the virtual queue Y k (t) can be written as (14), replacing the closed form expression G k (t) with G k (t). Using G k (t) is useful for the virtual queue's update, but it is not directly related to the number of quantization bits, which affect the learning accuracy. Thus, the control action might still not be easily implementable, due to the lack of a closed form expression for G k and G k (t). One possible VOLUME 9, 2021 solution to this issue builds on the following assumption, which is largely verified both from a theoretical and a numerical point of view (practical examples follow in Sec. V). Assumption 1: G k is a monotone non-increasing function of the number of quantization bits n q k . Assumption 1 hinges on the fact that a finer representation of the data generally leads to better performance of the learning task in terms of accuracy. Then, under Assumption 1, we propose to exploit a surrogate function for G k in (20), say G k (n q k ), which approximates the non-increasing behavior of G k . The rationale underlying this choice comes from the concept of -additive approximation [41, p. 59] of the drift-plus-penalty method in (20), which makes possible to use inexact updates of the algorithm at each iteration, provided that the approximation error can be bounded within a finite error . Of course, due to the boundedness of G k (n q k ) over the finite discrete set N q k , any bounded non-increasing discrete function of the number of bits leads to a valid -approximation of the drift-plus-penalty method in (20). As an example, we can assume that the accuracy is inversely proportional to the distortion d induced by the quantization. More specifically, denoting by d k (t) the amount of distortion tolerated in time slot t for user k, to be adjusted online depending on the learner accuracy, the number of bits associated to d k (t) can be derived from fundamental rate-distortion theory limits [43]. Let us recall that, for a Gaussian random variable X , with zero mean and variance σ 2 , the minimum number of bits necessary to quantize X providing a distortion at most equal to d k (t) is [43] r(d k (t)) = 1 2 max 0, log 2 σ 2 d k (t) bits.
In the sequel, we will use a number of bits per sample n q k (t) = α r[d k (t)], where α > 1 is a margin coefficient introduced to take into account the fact that in practice the data may not follow a Gaussian distribution.
Inverting (25), we can choose G k (n q k )=c 2 −2 n q k , where c is a suitable coefficient. However, more general designs for G k (n q k ) can be exploited if some information on the shape of G k is known in advance or inferred from data. Then, at a given time slot t, substituting G k with G k (n q k ) in (21), we solve the following deterministic sub-problem for the data rate and quantization bits selection: where the virtual queues {Y k (t)} K k=1 in (14) are updated using the online accuracy estimateĜ k (t) given by (24). The sub-problem in (26) can then be solved as in the previous case. The main steps of the proposed data-driven approach

Algorithm 4: Data-Driven Edge Machine Learning
Set the Lyapunov trade-off parameter V , Z k (0), Y k (0), ν k , ∀k. In each time slot t, repeat the following steps: S1. Find the transmit data rate R k and the number of quantization bits solving (26) with Algorithm 1; S2. Solve the CPU scheduling through Algorithm 2; S3. Run the online learning task; S4. Update the physical queues as in (6) and (8), and the virtual queues as in (13) and (14), using (24) for the latter.
are summarized in algorithm 4. The performance of this data-driven strategy will be numerically assessed in Sec. V.

IV. BEST ACCURACY UNDER ENERGY AND E2E DELAY CONSTRAINTS
In this section, we present an alternative formulation of EML, useful in cases where there is no a priori specification about the accuracy of the learner, but the goal is rather to optimize the learning accuracy (without any particular assumption on the model and the final performance), subject to energy and latency constraints. This alternative formulation hinges on Assumption 1. In particular, if we aim to optimize the accuracy G k (n q k ), exploiting the assumption that G k (n q k ) is a monotone non-increasing function of the number of quantization bits, we can equivalently formulate the problem as the maximization of the number of quantization bits, subject to latency and energy constraints. Thus, using the notation introduced in the previous section, we formulate the problem as follows: Differently from (12), we now introduce the constraints (b) and (c) on the average device and ES energy consumption, respectively. The expectation is still taken with respect to the random wireless channels and data arrivals. Then, to tackle this long-term optimization problem, as before, we introduce a virtual queue for each long-term constraint, and we devise a strategy satisfying the desired constraints by enforcing the mean rate stability of the virtual queues. In particular, the virtual queue for the E2E delay constraint is the same as in the previous section (cf. (13)). For constraint (b) of (27), we introduce a virtual queue S k (t) evolving as follows: where λ k is a step size. Similarly, for constraint (c) of (27), we use a virtual queue O(t) evolving as: To stabilize the virtual queues while taking the objective of (27) into account, we first introduce the Lyapunov function and then its associated driftplus-penalty function (cf. (16)), which in this case reads as In particular, it is easy to show (the details are given in the Appendix) that the drift-plus-penalty function enjoys the following upper-bound: where ζ 2 is a positive constant given by and χ k,2 (t) is defined as This function is to be considered as a constant with respect to the optimization problem, because it does not depend on the optimization variables. Now, following similar arguments as in the previous section, the problem can be split into two sub-problems. In particular, it can be easily shown that the final per slot problem used to find the data rate and the number of quantization bits for a generic device k is given by: Again, if n q k is fixed, (34) is convex and differentiable, and its solution can be derived in closed form. In particular, solving the KKT conditions for a given n q k , the global optimal solution of (34) is given by: Thus, using (35) for each value of n q k ∈ N q k , we find the global optimal solution by selecting the pair (R * k , n q k ) that minimizes the objective function of (34), following the steps illustrated in Algorithm 5.
Similarly, the sub-problem for the optimal CPU clock frequency and scheduling is given by: Again, once f c is fixed, (36) is a linear programming problem, which can be solved using Algorithm 2, where in step S6, the value of the objective function of (23) is substituted with the value of the objective function of (36). To summarize, Algorithm 6 describes the overall EML dynamic resource allocation procedure that optimizes the learning accuracy under latency and energy constraints.   (35) and save it in ρ ki ; S4. Compute the value of the objective function of (34) with n q k = N q ki and R k = ρ ki , and save it in P ki ; end S5.

V. APPLICATIONS AND NUMERICAL RESULTS
In this section, we illustrate some applications of our EML framework to specific learning problems, and then present numerical simulations to assess the performance of the proposed resource allocations strategies. In particular, in Sec. V-A, we customize our adaptive learning strategy to least mean squares estimation as a particular case, thus illustrating numerical results in Sec. V-A for the model-based and the data-driven solutions presented in Sec. III, considering both synthetic and real datasets. Finally, in Secs. V-B and V-C, we customize our framework to SVM and NN classification, illustrating numerical results using the best accuracy strategy presented in Sec. IV.

A. LEAST MEAN SQUARES ADAPTIVE ESTIMATION
Let us briefly recall the basic concepts of the LMS adaptive algorithm that we consider in this paper. Given a streaming sequence of U k × 1 input data vectors u (n) k and a parameter vector w k,0 (to be learnt), we assume the following linear input/output relation: where the output y (n) k is a random streaming sequence of output data, v (n) k is a realization of random observation noise with variance σ 2 k,v , and the superscript T denotes vector transposition. A typical approach to find the best estimate of the parameter vector w k,0 from the stream of input data u k and noisy observations y k consists in minimizing the Mean

Algorithm 6: Best Accuracy Edge Machine Learning
Set the Lyapunov trade-off parameter V , Z k (0), S k (0), λ k , for all k, and O(0). In each slot t, repeat the following steps: S1. Find the transmit data rate R k and the number of quantization bits n q k , ∀k, through Algorithm 5; S2. Solve the CPU scheduling through Algorithm 2, with step S6 modified with the objective function of (36); S3. Update the physical queues as in (6) and (8), and the virtual queues as in (13) and (28), and (29). Squared Error given by: If the statistics of the data are known in advance, the MSE cost function can be optimized via the traditional gradient descent algorithm. However, if the statistics are unknown, one can follow the LMS approach that drops the expectation and uses an instantaneous approximated version of the gradient, thus obtaining the stochastic gradient descent recursion given by [44]: where µ k > 0 is a sufficiently small step-size. In our case, the data u (n) k and the observations y (n) k are first quantized, using a finite number of bits, and then sent to the edge server for processing. The quantization introduces an additional noise to the data, which determines a biased estimate of the LMS algorithm [45]. To avoid the detrimental effect of the bias, we can use the following bias-compensated recursion [45]: where k,q is the covariance matrix of the noise affecting the input data (related to quantization effects), which is assumed to be diagonal, and I U k is the U k × U k identity matrix. In the sequel we will provide a closed form expression for k,q , which depends on the adopted quantization scheme. The stochastic recursion (40) will then be used in step S3 of Algorithm 3. As accuracy metric G k , we choose the Mean Squared Deviation between the estimated parameter vector w (n) k , and the true parameter vector w k,0 , defined as: Interestingly, in the case of noisy data, denoting by C k,u the covariance matrix of the input data, and by σ 2 k,q,o the variance of the noise affecting the output data, and defining 45388 VOLUME 9, 2021 the MSD admits the following closed form expression [45]: where ⊗ denotes the Kronecker product and vec(A) is the vectorization of matrix A. Then, as accuracy function G k (n q k (t)) in (11)-(12), we use the values of MSD k given by (42). Furthermore, since w k,0 in (41) is obviously unknown a priori, we replace it with the online LMS estimate w (n) k in (40), which is asymptotically (at convergence) unbiased and has a small steady-state error [45]. In this way, as time goes on, the true parameter w 0,k can be suitably replaced by w (n) k in (41).

NUMERICAL RESULTS: MINIMUM ENERGY STRATEGIES FOR LMS
In this section, we present some numerical results obtained with computer simulations, for the resource allocation strategy devised in Sec. III. We present simulations for two different approaches: model-based and data-driven, which will be explained later on in this section. For these simulations, we use the following settings.
Scenario: We consider a single AP at the center of a squared area of side 200 m, with a carrier frequency f 0 = 3.5 GHz. The propagation model is the ''Alpha-Beta-Gamma'' model presented in [46], and we assume a Rayleigh fading with unit variance. We also assume a total available bandwidth B = 180 kHz, equally shared among K = 5 end devices. The noise spectral density is N 0 = −174dBm/Hz, with a noise figure F = 5 dB at the receiver. The slot duration is fixed to τ = 5 ms, and the maximum transmit power of the end devices is p max k = 100 mW, ∀k, used to compute the maximum achievable data rate R max k (t) in each time slot from the Shannon formula. We assume an edge server equipped with a CPU clock frequency to be chosen in ϕ = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1] × 3.3 GHz. All devices are randomly distributed in a squared area around the AP. The conversion factor J k (cf. (8)) has been estimated offline. In particular, we have run LMS in Matlab R R2019b on a Linux CentOS 7 workstation equipped with an Intel R Core TM i9-9940X CPU @ 3.30GHz. The CPU speed (or 'clock') measures the number of cycles per second: multiplying the CPU speed (in Hz) with the time (in seconds) required to accomplish a given calculation, we get the number of CPU cycles needed to accomplish the calculation. However, it is worth noting that this is a rough estimate because the CPU may be running, at the same time, some background task. To mitigate the effect of background tasks, we averaged the number of cycles per pattern across several trials. The output estimate is J = 2 × 10 −4 data units/CPU cycle. This is the value we use for the following simulations.
Model-based minimum energy LMS estimation. In this paragraph, we aim at assessing the performance of the model-based Algorithm 3, when applied to adaptive LMS estimation. For the first simulation, we use a synthetic dataset, obtained by generating regression parameters from a Gaussian distribution with zero mean and variance σ 2 k,u = 0.1, with data unit dimensionality U k = 10. Thus, the parameters to be transmitted to the edge server have dimensionality M k = U k + 1, since both the observation y (n) k and the regressors u (n) k must be transmitted to run recursion (40). The observations y (n) k (cf. (37)) are corrupted by Gaussian noise with zero mean and variance σ 2 k,v = 10 −2 . The true parameter vector w k,0 is the realization of a uniform random vector variable, whose elements are in [0, 0.5]. The step size used for the LMS recursion is µ k = 5×10 −3 (cf. (40)). The set of the possible number of quantization bits is N q k = {3, 4, 5, 6, 7, 8}. The input data u (n) k are quantized with a uniform dithered quantization [47], with dithering uniformly randomly dis- ; here l k represents the amplitude dynamic of the signal. Using this kind of dithered quantization, the noise affecting the data has covariance k,q = 2 k 6 I U k (cf. (41)). Energy-delay-accuracy trade-off: In Fig. 3, we illustrate the performance of Algorithm 3, in terms of trade-off between energy, delay, and learning accuracy. In particular, the five curves refer to 5 different strategies: i) Strategy S 1 (blue curve ) is the minimum energy consumption strategy, obtained using always the minimum number of bits/sample, e.g., n q k (t) = n q k,min = 3, for all t; ii) Strategy S 5 (green curve ) is the best accuracy strategy, where the sensor always uses the maximum number of bits/samples, e.g., n q k (t) = n q k,max = 8, for all t; iii) Strategies S 2 , S 3 , and S 4 (orange , yellow • and purple curves) represent the intermediate strategies, corresponding to different constraints on the value of the MSD, where the number of bits is adapted over time, depending on the values of the real and virtual queues. From these curves, it is interesting to highlight how the accuracy affects the performance of the overall system in terms of energy-delay trade-off.
In particular, Fig. 3a shows the average E2E delay as a function of the average sensor energy consumption. The curves have been obtained by changing the parameter V in (16), in order to explore the energy-delay trade-off. More specifically, V increases from right to left, as shown in the figure. From Fig. 3a, we can notice how, by increasing V , the energy consumption decreases while the E2E delay increases, as expected. From Fig. 3c, we can observe how the average number of bits/sample varies as a function of the parameter V used to explore the energy-delay trade-off: while the strategies S 1 and S 5 keep that number constant, all the intermediate strategies adapt the number of bits assigning less bits to spend less energy, while respecting the long term accuracy constraint. Clearly, the number of bits increases as a better accuracy is required. For each value of V , represented by the markers in Fig. 3c, we also report the corresponding average ES energy consumption in Fig. 3d, to assess how much the energy can be reduced by acting on the Lyapunov parameter V . Interestingly, the trade-offs achieved by the different strategies are different, because of the different accuracy constraints. In particular, the best accuracy strategy (green curve ) achieves, asymptotically, the minimum MSD, as evidenced in Fig. 3b, but at the same time exhibits the worst energy-delay trade-off, as shown in Fig. 3a. Conversely, the minimum energy strategy (blue curve ) achieves the best energy-delay trade-off, but at the same time it converges to the largest MSD. What is interesting to notice from Fig. 3, besides the two extreme cases, it is the behaviour of the intermediate strategies. We can notice how, tolerating some deterioration of the final MSD, with respect to the best accuracy strategy, we can achieve a substantially better energy-delay trade-off, as evidenced by the strategies S 2 , S 3 , and S 4 .
In summary, the message coming from the results shown of Fig. 3 is twofold: 1) Our method is able to strike an optimal energy-delay trade-off, depending on the requirements concerning the accuracy level; 2) Acting on a single parameter, i.e. the V parameter, and adapting the number of bits/sample online, it is possible to significantly reduce the energy consumption, under a given E2E delay constraint, by slightly relaxing the accuracy constraint.
Clearly, it is not always desirable to obtain the best possible accuracy, if this leads to a too large energy cost. With our method, it is possible to evaluate how to achieve a better energy-delay trade-off, accepting some degradation of the final accuracy. Data-driven minimum energy LMS estimation. In this paragraph, we illustrate the performance of the data-driven approach given by Algorithm 4. In this case, as a measurable performance metric for the accuracy (cf. (24)), we use the Normalized Mean Squared Error (NMSE) between the true signal and our online estimate/prediction. In particular, we consider the instantaneous estimation of the NMSE, which translates in choosing |W k | = 1 in (24), with W k being the set composed only by the last data sample. To compute this metric, there is an additional back and forth data exchange between AP and edge device. For simplicity, we neglect the time needed for this exchange of (scalar) information.  To assess the performance of this approach, we use a real dataset composed by measurements of gas sensors exposed to dynamic mixtures of carbon monoxide (CO) and humid synthetic air inside a gas chamber 5 [48], [49]. Each of the K sensors transmits U k = 18 regressors and a scalar observation to perform its LMS estimation. We assume that all K sensors use the same dataset for the sake of comparisons, assuming that each sensor has a different accuracy requirement on the NMSE. Then, we consider a similar scenario as in the previous simulation, with K = 3 sensors generating data with Poisson arrivals with In Fig. 4, we illustrate the temporal behavior of the NMSE (Fig. 4a), the average energy over time (Fig. 4b), and the average E2E delay over time (Fig. 4c), equal for all users. Finally, the comparison between the estimated and the true signals in a time window of 300 slots is shown in Fig. 4d. The target VOLUME 9, 2021 NMSE values for the three sensors are indicated by the dashed lines on Fig. 4a. As we can notice from Fig. 4a, each device converges on average to the target NMSE. As expected, from Fig. 4b, we notice that the device with the best NMSE (yellow curve •) requires the highest energy consumption, due to the higher average number of quantization bits (around 6 bits are necessary on average to meet the constraint). On the contrary, the device requiring the worst NMSE (blue curve ) achieves the minimum energy (around 3.5 bits are sufficient on average to meet the constraint). The effect of the different accuracy constraints is clearly visible from Fig. 4d where, for each device, we show the time varying signal and its estimate/prediction. In particular, from Fig. 4d, setting a lower target NMSE (i.e., going from the first to the third sensor), we can see a noticeable improvement in estimation performance. Finally, in Fig. 4c we can notice how the E2E delay converges to the desired value. In summary, the proposed data-driven solution in Algorithm 4 is able to reduce the system energy consumption, while ensuring a target learning accuracy in terms of NMSE and a given E2E delay constraint.

B. CLASSIFICATION VIA SUPPORT VECTOR MACHINES
In this section, we customize our framework to SVM classification [50], [51]. SVMs are one of the most popular algorithms used for binary classification problems. In the case of a linearly separable dataset, the goal of SVMs is to find a hyperplane, amongst the infinite ones able to discriminate two classes, such that the distance between the hyperplane and the training data points that lie the closest to the hyperplane is maximized. This distance is called margin, while the closest points are called support vectors. In practice, it is quite typical to incur in situations in which the classes are not linearly separable. In these cases, the usage of kernel functions [52] helps in projecting the data into a higher dimension space, where the patterns become linearly separable.
In their native definition, SVMs are able to solve only binary classification problems. In the literature, two main strategies were proposed for multi-class SVMs. The first one, known as one-against-one [53], [54], trains s(s − 1)/2 SVMs (with s being the number of classes), where each SVM is dedicated in separating two of the s classes. Once the training of all SVMs is completed, a new test pattern is elected by means of a majority vote. The second strategy is known as one-against-all [55], [56], where s SVMs are trained such that the i-th SVM sees the patterns belonging to the i-th class as positive and all other patterns as negative. The final classification is given by the SVM that marks the incoming pattern as positive. In the case of a tie among SVMs, a reliability score (e.g., the distance from the separating hyperplane) can be employed as a tie-breaker. Amongst the two, we choose the former strategy, being the most competitive in terms of training times and prediction accuracy [57].
Once an SVM is trained for a classification task, a meaningful performance metric is the correct classification rate. Denoting by y i the true label of a given pattern, and byŷ i its prediction, the correct classification rate reads as: where 1{·} denotes the indicator function, and N is the number of patterns in the test set. In the sequel, we will use (43) as the performance metric for the classification accuracy.
NUMERICAL RESULTS: BEST ACCURACY STRATEGY FOR SVM We present now some numerical results for SVM classification at the edge, using the strategy proposed in Section IV, aimed at maximizing the accuracy under latency and energy constraints. In such a case, the ES runs an SVM classification task on the data uploaded by the end devices. We consider the same scenario of the previous section, with K = 4 sensors generating data from the MNIST dataset [58], with Poisson arrivals with parameter A avg k = 2. MNIST patterns have dimensionality M k = 784 features/samples, i.e., the number of pixels. At the ES, an SVM with polynomial kernel classifies the data. Each sample is quantized with a number of bits n q k ∈ N q k = {1, 2, 4, 8, 16} bits. We assume that each of the 4 sensors has a different requirement on the energy spent for transmission, to simulate the situation in which the devices have a different battery energy level and then adapt their requirement in terms of energy consumption in order to prolong the battery lifetime. More specifically, the average energy constraints are set to e avg k = [1, 1.5, 2] × 10 −7 J for the first 3 devices, while no energy constraint is imposed to the fourth one, meaning that it can possibly transmit with its maximum power. The ES average energy constraint is E avg s = 0.12 J, while the average E2E delay constraint is D avg k = 150 ms for all devices. The slot duration is τ = 50 ms, the carrier frequency is f 0 = 6 GHz, with a 10 MHz bandwidth equally shared among devices. The conversion factor J k (cf. (8)) has been estimated to be equal to 2.8 × 10 −7 , with the same procedure used in Section V-A.
The performance of the proposed strategy is illustrated in Fig. 5, in terms of trade-off between energy, delay, and accuracy. In particular, Fig. 5a shows the average E2E delay vs. the obtained correct classification rate; whereas, Figs. 5b, 5c and 5d illustrate the average device and ES energy consumption, and the number of quantization bits, all as functions of the Lyapunov parameter V . Note that, Fig. 5a is obtained by increasing V from left to right, with the values visible in Figs. 5b, 5c and 5d. We can notice form Fig. 5 that the different sensors show different trade-offs, but all devices meet the required energy and delay constraints. In particular, the device without energy constraint (purple curve ) shows the best accuracy, with the lowest average E2E delay, but it pays this performance with the highest energy consumption, as clearly visible from Fig. 5b. At the same time, the device with the lowest energy constraint (blue curve ), given the delay bound of 150 ms, shows the worst performance in terms of correct classification rate, due to the fact that it needs to 45392 VOLUME 9, 2021 lower the average number of quantization bits (see Fig. 5d) to meet the energy and delay constraints. All other devices show intermediate performance in terms of energy, delay and accuracy. Finally, from Fig. 5c, we can notice that the ES always meets its energy constraints.
To summarize, the take-home message of Fig. 5 is twofold: 1) Our method is able to achieve the best accuracy while guaranteeing constraints on the device and ES energy consumption, and on the E2E delay; 2) By decreasing the maximum allowed energy consumption, as expected, a device has to sacrifice the learning performance to guarantee the same average E2E delay. These results further show the power of Lyapunov optimization in exploring the energy-delay-accuracy trade-off, thus guaranteeing the best learning performance under E2E delay and energy constraints.

C. CLASSIFICATION WITH NEURAL NETWORKS
In this section, we customize our framework to a classification task implemented by a Neural Network. NNs have emerged as a fundamental tool for classification in pattern recognition, representing a valid and promising alternative to various conventional classification methods [59]. An NN for classification aims at learning a functional relationship between the group membership and the attributes of the object. In this sense, NNs represents powerful learning tools since: i) they are universal functional approximators, i.e., they can approximate any function with arbitrary accuracy; ii) they are data driven self-adaptive methods, i.e., they can adjust themselves to the data without any explicit specification of functional or distributional form for the underlying model. In the sequel, we assume a Multi-Layer Perceptron (MLP) structure composed of two layers. The hidden layer has 10 units with hyperbolic tangent sigmoid activation function. The output layer uses the softmax function, and the network is trained via scaled conjugate gradient backpropagation in order to minimize the cross-entropy loss function. The MLP weights are initialized by following the Nguyen-Widrow initialization algorithm, that is, by making sure that the active regions of the layer's neurons are distributed approximately evenly over the input space. We used the Hydraulic System Monitoring (HSM) dataset 6 [60]. In particular, this dataset considers physical measurements from several sensors (temperature, volume, pressure etc.) to infer the working condition of a hydraulic system. We used the pressure sensors as features. Then, 4 possible classes (conditions) are considered: i) optimal pressure; ii) slightly reduced pressure; iii) severely reduced pressure; iv) close to total failure.

NUMERICAL RESULTS: BEST ACCURACY STRATEGY FOR NN
Here, we present some numerical results for NN-based classification at the edge, using the strategy presented in Section IV, aimed at maximizing the accuracy under latency and energy constraints. In such a case, the ES runs an NN classification task on the data uploaded by the end devices. = 1. In our setting, HSM patterns have dimensionality M k = 36000 features. At the ES, an NN classifies the data, which can be quantized with n q k ∈ N q k = {1, 2, 3, 4, 5, 6, 7, 8} bits. We assume that each of the 5 sensors has a different requirement on the energy spent for transmission, in order to compare different solutions in terms of final accuracy. The device average energy constraints are set to e avg k = [0.8, 0.9, 1, 1.5] × 10 −5 J for the first 4 devices, while no energy constraint is imposed to the fourth one. The ES average energy constraint is E avg s = 12 mJ, while the average E2E delay constraint is D avg k = 300 ms for all devices. The slot duration is τ = 100 ms, the carrier frequency is f 0 = 6 GHz, with a 10 MHz bandwidth equally shared among devices. The conversion factor J k (cf. (8)) has been estimated to be equal to 3.54 × 10 −7 .
The performance of the proposed strategy is illustrated in Fig. 6, in terms of trade-off between energy, delay, and accuracy. As for the SVM, Fig. 6a shows the average E2E delay vs. the obtained correct classification rate, whereas, Figs. 6b, 6c and 6d show the average device and ES energy consumption, and the number of quantization bits, respectively, all as a function of the Lyapunov parameter V . Fig. 6a is obtained by increasing V from left to right, with the values visible in Figs. 6b, 6c and 6d. Again, all devices meet the required energy and delay constraints. The device without energy constraint (green curve ) shows the best accuracy, with the lowest average E2E delay. This is again paid with a much higher energy consumption than the other devices (cf. Fig. 6b). At the same time, the device with the lowest energy constraint (blue curve ), given the delay bound of 150 ms, shows the worst performance in terms of correct classification rate, due to the fact that it needs to lower the average number of quantization bits (see Fig. 6d) to meet the energy and delay constraints. All other devices show intermediate performance in terms of energy, delay and accuracy. Finally, from Fig. 6c, we can notice that the ES always meets its energy constraint. Thus, similar conclusions as for SVM edge classification can be drawn.

VI. CONCLUSION AND FUTURE DIRECTIONS
In this work, we have devised dynamic resource allocation strategies for executing machine learning tasks at the wireless network edge, hinging on the trade-off between energy, delay, and learning accuracy. The main novelty of our approach is to incorporate the learning accuracy into the search for the optimal balance between energy consumption and E2E service delay. The proposed methods are based on Lyapunov stochastic optimization, which enables the derivation of low-complexity algorithms able to work without a priori knowledge of a number of context features, such as task arrivals or channel state.
The first proposed strategy is aimed at minimizing the energy expenditure under E2E delay and learning accuracy constraints. We showed how to change the energy-delay trade-off curve, by acting on the learning accuracy requirement. Whenever it is possible to write closed form expressions for the learning performance, we derived a consequent model-based solution. In all cases where a model is not available, we have proposed a purely data-driven approach that measures performance in an online fashion. Both model-based and data-driven approaches have been customized and tested on a training task carried out using an LMS algorithm for estimation/prediction purposes, both on synthetic and real datasets.
The second proposed strategy aims at optimizing the learning accuracy under E2E delay and energy constraints. This strategy has then been applied to an inference task at the edge server, i.e., SVM and NN classification, under delay and energy constraints. Several numerical results on two real data illustrate the performance of the proposed approaches in terms of energy-delay-accuracy trade-off.
The proposed methods are very general and can be customized to several supervised, unsupervised, or semisupervised learning tasks. Further developments are possible, in different directions. In this work, we used a simple source scalar encoder, but more sophisticated encoders can be used, like a vector quantizer, to achieve a better rate-distortion pair. Moreover, the stochastic optimization has been built on a drift-plus-penalty formulation, with a fixed parameter V , used to explore the energy-delay trade-off. Alternative approaches can be followed, adapting the value of the parameter V , depending on the behavior of the overall system.