Multi-User Goal-Oriented Communications With Energy-Efficient Edge Resource Management

Edge Learning (EL) pushes the computational resources toward the edge of 5G/6G network to assist mobile users requesting delay-sensitive and energy-aware intelligent services. A common challenge in running inference tasks from remote is to extract and transmit only the features that are most significant for the inference task. From this perspective, EL can be effectively coupled with goal-oriented communications, whose aim is to transmit only the information relevant to perform the inference task, under prescribed accuracy, delay, and energy constraints. In this work, we consider a multi-user/single server wireless network, where the users can opportunistically decide whether to perform the inference task by themselves or, alternatively, to offload the data to the edge server for remote processing. The data to be transmitted undergoes a goal-oriented compression stage performed using a convolutional encoder, jointly trained with a convolutional decoder running at the edge-server side. Employing Lyapunov optimization, we propose a method to jointly and dynamically optimize the selection of the most suitable encoding/decoding scheme, together with the allocation of computational and transmission resources, across all the users and the edge server. Extensive simulations confirm the effectiveness of the proposed approaches and highlight the trade-offs between energy, latency, and learning accuracy.

from a pure communication infrastructure to a key enabler for pervasive services, which are highly based on Artificial Intelligence (AI) and Machine Learning (ML).Typical examples can be found in augmented reality, autonomous driving, massive Internet of Things, and mission critical applications [1].In these scenarios, the service delay and the reliability constraints are often very restrictive, and this motivates the need to design a holistic system where communication, computation, learning, and control are jointly managed in order to reach reliability, energy efficiency, and sustainability.
The need to process a huge amount of data, in real-time, through proper AI/ML techniques, has driven researchers to design training/inference tasks at the wireless edge, in collective as well as distributed fashions.This has led to the definition of the so called Edge Intelligence (EI) paradigm [2].In this view, the allocation of system resources in order to reach prescribed target performance in terms of latency, accuracy, and energy consumption has been already considered in [3], [4], [5], [6].Specifically, EI allows User Equipments (UEs) connected to a mobile network to opportunistically offload their learning tasks to Edge Servers (ESs), which are placed in the network edge, nearby the Radio Access Points (RAPs).This allows the efficient management of system resources, such as transmission rate, bandwidth, and CPU clock rates, according to specific optimization strategies, which are mainly focused on the tradeoffs between energy consumption, overall latency, and learning accuracy [6].
Clearly, in a resource optimization perspective, it would be useful to offload to the ESs only the (minimum) amount of information strictly necessary to fulfill the learning task with the desired accuracy, while respecting the performance requirements.This intuitive consideration, jointly with the huge increase of traffic envisaged in future 6G networks [7], motivates the search for a new communication paradigm, alternative to the classical Shannon design.In this view, a valuable candidate is represented by Goal-Oriented Communications (GOC) [8].More specifically, if the goal of communication is to perform an inference task on the data collected by the UE, rather than requiring the accurate reproduction of all the transmitted bits at the receiver side, the aim of GOC is to transmit only the information that is most relevant to run the inference task at the ES, guaranteeing a prescribed level of decision accuracy and system performance.In this way, it is possible to help the UEs to save transmission resources and avoid unnecessary data rate growth, still respecting c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
application constraints, such as service delay and energy consumption.Related works.Seminal EI frameworks, with a wireless offloading strategy, have been proposed in [6], [9], which save transmission resources by simply allocating, in a dynamic fashion, the number of (quantization) bits used by UEs to transmit their data to the ES.This compression strategy has also been employed in [10] and [11], where edge classification and ensemble learning are considered, respectively, with reliability guarantees.A more principled data reduction strategy, better matched to the learning task and based on the Information Bottleneck (IB) [12], [13], has been proposed in [14].However, the IB principle admits a closed form solution for the encoder only if the overall statistics are jointly Gaussian [14], [15], or a solution achievable through an iterative mechanism, if the statistics are discrete.When the sensed data and decision outputs are neither jointly Gaussian, nor discrete with manageable cardinality, it is not easy to derive the IB solution and the source encoding problem can be reformulated using the so called variational IB (VIB), as recently explored in [16] and in [17], where a cooperative (multi-device) inference framework is proposed.
A possibility to further deviate from the classical communication design is offered by Joint Source Channel/Coding (JSCC), which has received increasing attention with the wide spread use of Deep Neural Networks (DNNs).Quite recently, several works have proposed to replace the classical cascade of source and channel encoders with a DNN properly trained with respect to the specific task.For instance, [18] proposed a DNN-based JSCC scheme to achieve higher performance in finite block-length regime for image retrieval applications.Furthermore, if the task of communication is image recognition, it makes sense to design the JSCC architecture directly focusing on the learning task, rather than on the image reconstruction followed by the recognition task, as proposed in [19].The authors of [20] presented a scheme for image retrieval where the extracted vector features are directly mapped to the channel input symbols, without resorting to any channel coding technique, and the server retrieves the most relevant images directly from the noisy channel output.This approach has been extended in [21], where the extracted features are quantized before being mapped onto the channel symbols.In [22], JSCC is coupled with an OFDM system and operating over a frequency-selective channel, while [23] considers the combination of JSCC with non-linear transform coding [24].
As far as goal-oriented (also known as task-oriented) communications is concerned, several recent works testify the emerging relevance of this topic.For instance, in [25] and [26] GOCs have been exploited to define the common-language between a listener and a speaker, employing Reinforcement Learning (RL) and Curriculum Learning (CL), while a transformer-based approach has been proposed to assist image and text transmissions [27].A noise-aware JSCC for texttransmission is described and assessed in [28], while [29] exploited a hybrid automatic repeat request (HARQ) scheme to improve reliability in sentence semantic transmission.Other examples of image classification for Unmanned Aerial Vehicle (UAV) applications, and a GOC-assisted Visual Question Answering (VQA) task, can be found in [30] and [31], respectively.Furthermore, [19] and [20] motivate the use of GOC schemes for computer vision applications, by showing the accuracy improvements they provide in image-classification and re-identification tasks of humans and cars, respectively.Finally, the impact of goal-oriented communications has also been analyzed in speech recognition tasks [32].
However, none of the works cited above considered the dynamic optimization of the data reduction strategy for multiuser goal-oriented communications, jointly with the global network resource management, under prescribed performance guarantees, as we do in this manuscript.Along this line, in [33] we proposed minimum-energy and maximum-accuracy resource allocation strategies for edge-assisted image classification tasks, in a single user/single server scenario, whereas in [34] we reported some preliminary results on the extension to the multi-user scenario, which we will further develop and investigate more thoroughly hereinafter.
Our contributions.The main contributions of this work concern the system architecture, the optimization strategies, and the simulation results.They can be summarized as follows:

A. System Architecture
Extending the preliminary strategies presented in [34], we consider a multi-user goal-oriented communication scenario, where multiple UEs may decide to offload their learning tasks to an ES (or not).Each user relies on a bank of source encoders, each one associated to a specific compression ratio, which dynamically compresses the data-units (DUs) to be transmitted to the ES, depending on the online system state.Specifically, exploiting convolutional encoders (CEs), i.e., the encoders of convolutional auto-encoders (CAE), as in [33], we improve their performance by a new training function.The ES, when requested, carries out multiple, user-independent, inference tasks, using a bank of convolutional classifiers (CCs), i.e., CNNs, each one matched to the CE used at the UE.The overall CE-CC structure is instrumental to splitting the classification task between UE and ES.

B. Optimization Strategies
We implement a dynamical split of the inference task, selecting, in each time slot, the most suitable pair of CE-CCs, within the bank of available (pre-trained) CE-CCs, depending on the channel state and on the online accuracy and performance.More specifically, resorting to Lyapunov optimization, we implement a multi-user dynamical goaloriented source compression architecture that selects the CE-CC pair and allocates computational and communication resources, trading off energy consumption (including both UEs and ES), delay and classification accuracy.Hereinafter, we extend the preliminary results and optimization strategy shown in [34], by considering also a multi-user Maximum Accuracy strategy, with guaranteed (maximum) Delay bounds and Energy consumption (MADE).Furthermore, we let every UE able to decide whether to perform the inference task locally or to offload it to the ES, since there might be applications where the UE hardware is capable of running the application locally, or it could be more convenient, for the overall resource management, to do that.

C. Simulation Scenarios
We investigate herein scenarios that were not analyzed in [34], where each UE has different service requirements and constraints.The wide set of possible scenarios, optimization strategies, and simulation results, significantly extends the results presented in [34], highlighting the effectiveness and flexibility of the proposed holistic resource management.
Outline.The paper is organized as follows.Section II illustrates the goal-oriented communication system and the related joint training procedure of both the CEs and the CCs for classification purposes.Section III describes the overall system model used in the formulation of the resource optimization strategies, which are then solved in Section IV exploiting stochastic Lyapunov optimization.In Section V we discuss our experimental results and, finally, in Section VI we draw some conclusions and highlight future research directions.

II. CLASSIFICATION NETWORK AND TRAINING
This section describes the architecture employed to make parsimonious use of transmission energy and bandwidth.Specifically, we compress the UEs data-units (DUs) (i.e., the input of the learning task), before they are transmitted to the ES.The latter has to perform the learning task without sacrificing a prescribed target accuracy.As more deeply explained in [33], the Information Bottleneck (IB) [12] is a promising theoretical framework to meaningfully compress the data-source in a goal-oriented perspective.However, IB admits a closed form solution only when the associated statistics are discrete or Gaussian distributed [14], [15].Thus, since in the multi-class image classification task we are focusing on, the Gaussian assumptions do not hold true and a meaningful definition of mutual information is problematic [35], we proposed in [33] a heuristic approximation of the IB that nicely fits with our goal-oriented strategy.Specifically, our approach is based on the deployment of a tunable data-compression at the UEs that is useful for the associated inference task at the ES.Without restriction of generality for the overall GOCs architecture and its resource management, we resort to banks of CEs to compress images at the UE side, according to a layer-bylayer max-pooling strategy.The CEs are coupled with CCs at the ES to perform the final decision, as summarized in Fig. 1 for a single UE.
As detailed in [33], a CE may be realized as: • Short-CE: It resizes the images to the desired resolution by a single convolutional layer followed by a max-pooling layer.• Deep-CE: It down-samples the images by multiple convolutional layers, each one followed by a max-pooling layer that halves the size of the (pseudo) image.Note that our goal is to classify the images and not to reproduce them.Thus, for the CE-CCs compression and classification network shown in Fig. 1, we have to consider a different learning cost function than those used for classical CAEs.Specifically, we resort to the following objective function where L ce (Y n , Ŷn , φ, θ) is the cross-entropy loss, used in order to control the performance of the ES classification task, while L mse (X n , Xn , θ) is the Mean Squared Error between the input and the reconstructed version X of the full CAE.Note that the cross-entropy loss in (1) is a proxy of the mutual information I(h;Y) [36].Thus, minimizing the cross-entropy, we maximize the I(h;Y) for a fixed CE architecture (compression size) and this constitutes the link of the proposed approach with the IB principle.However, differently from what we did in [33], (1) considers also the output MSE of a Convolutional Decoder (CD), i.e., that part of the CAE that is typically used for image reconstruction.Actually, the presence in (1) of this (regularizing) MSE penalty term favours a meaningful feature extraction [37], which can improve the performance of the overall learning task, for proper values of the parameter λ.Anyway, note that the CD is taken into account only during the CE-CCs training, while it is not used for classification, as clarified by Fig. 1.Each (split) CE-CC couple has to be properly trained, possibly off-line, by a third party.Thus, although it would be interesting to analyze how to train the classification network by the same wireless edge-computing architecture we consider herein for classification, this is not the object of this manuscript and is left for future studies.
JPEG compression.Note that the CE, targeting good classification performance, compresses the images by a downsampling principle, due to the max-pooling strategy at each layer.However, this design does not take into account the wireless communication between UEs and ES.Thus, while the size of the latent representation h of a CE output (see Fig. 1) may be optimal for a target classification accuracy, it could be still sub-optimal with respect to the file size of the compressed data-units, leading to huge costs in terms of transmission energy and (long) transmission time.This problem justifies the employment of a further zipping (compression) phase on h, before transmitting it to the ES, which will unzip it back to h at the CC input.Due to the nature of the classification task and the structure of the pseudo-images h extracted by the CE, we base this further compression at the UE on a JPEG codec, which proved to effectively reduce the file size of the data units, paying a reasonable price in terms of additional computational overhead from the UE perspective.The choice of JPEG is justified since it is a widely used zipping system, with a plethora of efficient implementations.Furthermore, despite its lossy nature, it has been proved that JPEG codecs do not significantly affect the classification performance of CNNs [38].

III. SYSTEM MODEL
The considered goal-oriented scenario encompasses multiple devices (UEs), with limited computational and energy capabilities, which are connected through an Access Point (AP) to an ES with a larger amount of computing resources; an illustration is given in Fig. 2. To perform a generic learning task, for each UE connected to the network, the system handles three main phases: i) The UE buffers the Data Units (DUs), i.e., the images to be classified; ii) Depending on the specific offloading decision, which is affected by the system status, the DUs are either scheduled to be compressed and transmitted by the goal-oriented compression strategy proposed in Section II or, alternatively, to be processed locally; iii) The inference task takes place either at the UE-or ES-side, depending on the offloading decision.
The system evolves in a time-slotted fashion, where each time slot has a fixed duration τ .Therefore, we deal with discrete-time functions f (t), where t ∈ N is an index for the tth time-slot [tτ, (t +1)τ [.The aim of the resource optimization strategies for GOC is to guarantee a specific E2E (maximum) delay requirement, while optimizing either the system energy consumption or the learning accuracy.To this end, the proposed policies have to manage several resources.In particular, the k-th UE has to allocate its transmission rate R k (t) toward the ES, its clock frequency f d k (t), employed to perform the data compression by a specific compression factor ρ k (t), and the offloading decision d k (t).As far as the ES is concerned, the main optimization variable is represented by the clock frequency f c (t), which has to be properly split among the learning tasks of the different users.This quantities represent the optimization variables of the objective functions we will define for the proposed resource management strategies.We are now ready to describe the models adopted for latency, energy and classification accuracy.

A. Latency Model
The system evolution over time is entirely described by a queuing system, as prescribed by the Lyapunov optimization framework [39].In particular, for each user involved in the network, we define two kind of physical queues: -A computation/communication queue at each UE, which collects the DUs, i.e., the images, generated by each device, which are waiting to be compressed and transmitted to the ES for classification.-A separate computation queue at the ES side for any possible compression degree (e.g., CE) that the UEs may dynamically employ: thus, for each UE connected to the network, we have a different number of ES queues, depending on the CE compression degrees that are available.This design choice has been motivated in order to make the ES optimization problem computationally affordable, as we will clarify later.We denote with K the total number of UEs connected to the network.The binary variable d k (t) ∈ {0, 1} models the decision to offload (or not) the learning task of the k-th device during the t-th time-slot.When any UE has to offload its learning task (i.e., d k (t) = 1), we make the following assumptions that are instrumental to practically manage the optimization problem (see [33] for further details).
Assumption 1: The DUs in each UE queue have to be compressed and transmitted within the same time-slot.Indeed, during a given time-slot, it is impossible to optimally compress DUs that will be transmitted during one of the next time-slots, when the system could possibly experience different channel conditions, or different lengths of the ES/UEs queues, etc.Therefore, compression and transmission operations have to be done sequentially within the same time-slot.
Assumption 2: We assume that, while an UE is transmitting some DUs, it can also simultaneously compress other DUs.
The number of (compressed) DUs that would be possible to transmit during the t-th time-slot is expressed by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where R k (t) and ρ k (t) are the transmission rate and the compression factor,1 respectively, selected for the k-th UE at time t; M (ρ k (t)) is the DU's size for a certain compression factor ρ k (t), and N (ρ k (t)) is the number of bits that are necessary (on average) to encode a pixel in the (zipped) pseudo-image h.To shorten the notation, we define also , which represents the average number of bits to store an image with a given ρ k (t).
On the other hand, the number N c k (t) of DUs that is possible to compress during the t-th time-slot by the k-th device is expressed by where J d (ρ k (t)) denotes the number of DUs compressed in a clock cycle C (which depends on the selected compression factor ρ k (t)), and f d k (t) denotes the device clock-frequency that has been chosen for the k-th UE, during the same timeslot.Recalling Assumption 1, all the DUs that are compressed within a time-slot have to be transmitted during the same time-slot, and all the transmitted DUs have to be first compressed.Thus, we need to use a transmission rate Taking into account that, before the transmission could start, we need to wait a time equal to 1/(f d k (t)J d k (t)) to compress the first DU, the actual number of DUs that can be offloaded by the k-th device during the t-th slot is expressed by Plugging in (4) the inequality N tx k (t) ≤ N c k (t) we end-up with the following (integer) inequality which will be useful in the next derivations.Finally, similarly to (3), when the learning task is performed locally, the total number of DUs processed by the k-th UE is expressed by where J L k (ρ k (t)) expresses the DUs that can be compressed by a factor ρ k (t) and successively classified in a clock-cycle by the UE hardware.Putting together ( 4) and ( 6), the number of DUs that can be processed by an UE, within a single timeslot, is expressed by The UE queue Q UE k (t) is fed by the arrival of new DUs, and is drained either by the transmission of DUs to the ES, or by their local classification at the UE.Thus, it is characterized by the following evolution where A k (t) models the DUs arrival process, whose statistical properties are generally unknown.
At the ES, we employ L k different queues for each UE, whose evolution is described by where f s ki (t) is the ES clock-frequency assigned to the i-th queue (compression factor) of the k-th UE, during the t-th time slot. 2 2 The quantity 1 10) is a conversion factor that maps the number of DUs received by the ES into the equivalent number of clock-cycles requested for their processing (e.g., classification).
To set-up our delay constraints, we need to define an overall queue that, for each device, takes into account the overall computational load at both the UE-and ES-side.Since we aim to respect an average latency constraint, as we will detail in the following, and taking in mind the ES can perform a parallel computation of multiple DUs, by means of ( 8) and ( 9), it makes sense to consider the average length of the parallel queues, which is expressed by where p ki is the probability to employ the i-th compression factor in S k , which can be estimated by an online sample-mean. 3By assuming a certain data arrival rate , and exploiting the Little's Law [40], (11) allow us to model the average long-term delay, as expressed by For a latency constraint D avg k , we get a queue length constraint Q avg k = D avg k A k and, consequently, we can equivalently formalize the latency constraint as a queue constraint by

B. Energy Model
The energy model of our system involves three main components: -Transmission energy at the UEs, requested to transmit the DUs to the ES in case of offloading decisions.-Computation energy at the UEs, requested in order to either compress/encode the DUs to be transmitted, or to perform the learning task locally.-Computation energy at the ES, requested to classify the DUs transmitted by the UEs that decide to offload the learning tasks.For simplicity, assuming a capacity achieving transmission system, in a flat-fading wireless channel, the transmission power p tx k (t) requested by the k-th UE can be inferred by the Shannon capacity [41] where |h k (t)| is the channel gain, N 0 denotes the noise power spectral density at the receiver side, and B k is the bandwidth.Thus, by inverting ( 14), we obtain that the transmission energy spent by the k-th UE during the t-th time-slot depends on the rate R k (t) by From the computation perspective, the ES's and UE's models are equivalent.Specifically, in order to estimate the energy consumption, we exploit the model in [42], which assumes a cubic dependence on the ES's and UE's clock-frequencies f s (t) and f d k (t), as expressed by The constants κ s and κ d k represent the effective switched capacitance [42] of ES and k-th UE processor, respectively.Thus, we quantify the system energy consumption during the t-th time-slot using the following weighted performance metric: where the parameter γ is used to weight the UEs versus ES energy consumption, enabling tuning toward the implementation of an user-centric (γ → 1) or a server-centric (γ → 0) optimization strategy.Furthermore, the weights ) can be employed to assign different importance to the energy consumption of different users, providing an extra degree of flexibility to the resource optimization, depending on the needs of the operators, users, and service providers.

C. Accuracy Model
For the accuracy of the learning task of each UE, we resort to a model-based management strategy.This means that the accuracy for the k-th task can be cast in the optimization problem as a function G k (ρ k (t)) of the compression degree.This can be done in practice by employing a look-up table (LUT) (shown in Section V), where each entry is associated with a specific compression factor ρ k ∈ S k . 4This LUT stores the (average) classification accuracy of the k-th learning task, associated with each one of the CE-CC classifying chains that are available for the k-th UE.The values stored in this accuracy-LUT can be estimated off-line on meaningful testsets, after each CE-CC structure has been properly trained, as described in the previous section.Thus, we can exploit the LUTs G(ρ k (t)) to enforce an average accuracy constraint for each learning task, as expressed by IV. DYNAMIC RESOURCE OPTIMIZATION FOR MULTI-USER GOAL-ORIENTED COMMUNICATIONS On the basis of the delay, accuracy, and energy models presented in the previous section, we develop two resource optimization strategies: a multi-user Minimum-Energy with (maximum) Delay and Accuracy constraints (mu-MEDA), and a multi-user Maximum-Accuracy with (maximum) Delay and Energy consumption constraints (mu-MADE).In the sequel, we describe the problem formulation and the algorithmic solution for both strategies.

A. mu-MEDA: Multi-User Minimum-Energy With Delay and Accuracy Constraints
Following a system energy minimization perspective, the long-term optimization problem can be cast as follows: 4 We modeled the relationship between the compression factor and the accuracy through a LUT, rather than by a formal analytical expression, because it is almost impossible to find a closed-form expression for this function in practice.Indeed, despite noticeable examples to theoretically formalize DNNs performance can be found in [43], [44], these approaches are based on Mutual Information, which is intractable to derive in closed-form in most of the practical cases.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. where ] contains all the optimization variables.The constraints in (19) have the following meaning: (a) the average queue length for the k-th UE must be lower than Q avg k , i.e., we are imposing a maximum average service delay equal to D k avg = Q k avg /A k (cf.( 13)); (b) the average classification accuracy for the k-th UE must be greater that G k avg ; (c) the k-th UE transmission rate R k (t) must be smaller than the value R k ,max (t), which is the maximum possible rate for the k-th device, inferred by (14), considering the maximum available transmission power p tx k ,max ; (d) specifies the discrete sets F c , F d,k and S k for the server frequencies set, the frequencies set for the k-th UE, and the set of the possible compression factors respectively; the constraints (e) − (f ) state that the sum of the clock frequencies f s ki (t) that the (edge) server allocates for all the queues assigned to each user, must be lower than the total ES clock-frequency chosen for the t-th time slot, and that each clock-frequency must be obviously grater than 0; finally, (g) represents the binary constraints on the set of the opportunistic offloading decisions variables of each UE.Problem (19) is complicated due to the lack of knowledge of the statistics of the radio channels and data arrivals, which would be necessary to compute the expected values in (19).To tackle this issue, we resort to Lyapunov stochastic optimization arguments [39], which solve the long term problem (19) by casting it to a sequence of instantaneous optimization problems, which can be solved in a per-slot fashion.According to such an optimization framework [39], we start associating a virtual queue to each one of the long-term constraints (a) and (b).These virtual queues evolve according to where μ k and ν k are step-sizes that control the convergence speed of the algorithm.This way, it is possible to prove that respecting the long term constraints (a) − (b) is equivalent to guarantee the mean-rate stability of the virtual queues in (20) [39].To this end, we define the actual Lyapunov function L(t), as the sum of the squares of all the (virtual and physical) queues Defining , we obtain the associated conditional Lyapunov drift whose minimization corresponds to the stabilization of the virtual queues, but it does note take into account the objective function (i.e., the system energy consumption).Thus, in order to trade-off system stability and energy consumption, the Lyapunov Drift is augmented with a term dependent on the system energy, to obtain the so-called Lyapunov Drift plus Penalty function By increasing the value of the parameter V we give more importance to the objective function rather than to the queues stability, thus pushing the solution toward optimality while still guaranteeing the stability of the system, i.e., respecting the long-term constraints.In particular, [39] proved that, as the parameter V increases, the optimal solution of ( 19) is asymptotically reached.Following stochastic optimization arguments [39], we proceed minimizing an upper bound of the Lyapunov Drift plus penalty function in (23) (derived in the Appendix), ending up with the instantaneous optimization problem in (24), where, since the optimization variables affect only the terms N UE k , N ES ki and G k , we neglect all the terms which do not depend on them.Note moreover that in the following we omit the time index t to simplify the notation.
Since the UEs energy-consumption terms in the cost function of problem (24) depend only (and separately for each UE) on the UEs optimization variables , we can optimize this part of the cost function separately at each UE.Note that our design choice to assign at the ES separate computation queues for each UE offloaded task, lets us completely decouple the optimization problem and separately handle the UE and ES resource optimization.Furthermore, as already pointed out in footnote 2, the use of multiple queues for each compression factor ρ ki , thanks to (11), makes by (10) the problem linear with respect to f ki , up to the .operator.Consequently, Problem ( 24) is separable and solvable for each compression factor, as described in the following.
1) UE Sub-Problem: For the k-th device, at each time slot t, we have to solve the following optimization problem Depending on the value of the offloading decision variable d k we can optimize the other variables employing two different strategies.If d k = 1, we have to allocate both the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
transmission rate R k to transmit the DUs to the ES, and the UE clock-frequency f d k and compression factor ρ k to perform compression.Otherwise, if d k = 0 we need only to allocate f d k and ρ k to perform the learning task locally.We remark that we assume, although this is not mandatory, that the UE employs also locally the same (bank of) CE-CC classification chains we designed for the GOC scheme, thus fairly offering to the UEs the same flexibility of classification accuracy and energy consumption that could be exploited by the ES solution.Other choices, or a fixed structure of the classifier at the UE, would obviously have an impact on the offloading decisions by the optimal resource management and, consequently, on the energy-delay-accuracy tradeoffs.
Coming to the solution of the problem, when d k = 1 we handle the min(•) in ( 4) by adding the following constraint on the transmission rate of the k-th user This way, according to Assumptions 1 and 2, and taking in mind we cannot compress more DUs that we can transmit, we select a data-rate that is bounded by the minimum between the maximum achievable rate R k ,max (computed plugging the maximum power p tx k in the Shannon capacity ( 14)), and the draining rate Q UE k W (ρ k )/τ that is capable to empty the transmission queue (and lets remove the max(•)).By considering that x − 1 ≤ x ≤ x , we can also remove the • in (4).using the definition of the indicator function, for any fixed compression factor ρ ki ∈ S k , we end up with the following optimization problem min where . This is a mixed-integer optimization problem.However, in practice, the sets F d,k and S k have a quite low cardinality and, as detailed below, the solution can be rapidly found by an exhaustive search.Indeed, for any fixed couple of compression factor ρ k ∈ S k and computation frequency , the optimization problem is convex with respect to the data rate R k , whose optimal value can be found in closed form by duality theory through the Lagrangian where α and β are the Lagrangian multipliers.Note that, if Q TX ki ≤ 0, the second term monotonically increases with the rate, and the Lagrangian is minimum for R k = 0.
Otherwise, when Q TX ki > 0 we can solve the optimization problem by imposing the following KKT conditions [45] (a) Solving the KKT conditions we can compute the optimal rate R * k (ρ k , f d k ), by the following expression which gives us the closed form expression for the optimal rate for any fixed compression factor ρ k and clock frequency f d k , of the k-th user.Thus, as anticipated, to select the best clock frequency f d * k , and compression factor ρ * k , we can proceed by an exhaustive search, thanks to the limited cardinality of F d,k and S k .Summarising, for a potential offloading (d k = 1), we compute the optimal rate and clock frequency f d k for each possible compression factor ρ k , and then, at every time slot, we select the triple that gives the lowest energy cost.Otherwise, for a potential classification at the UE (d k = 0), the transmission rate to the ES would be R k = 0 and we need to optimize only the clock-frequency for each possible compression factor, thus obtaining the optimal pair that minimizes the UE's energy consumption.The overall optimal solution of the UE's optimization problem, which includes the decision to offload or not the learning task, is finally given by choosing between the pairs (d k = 1, T * k ) and (d k = 0, P * k ), as the one that leads to the minimum value of the UE's energy cost function.
2) ES Sub-Problem: From the ES perspective, for each UE we have to manage multiple computing queues, each one associated to a specific compression factor that has been used by the specific UE: in the following, we denote with Q ES ki the i-th ES computing queue for the k-th UE.It clearly makes sense to constrain the fraction f s ki (of the total ES's computing frequency f s ) reserved to the i-th queue of the k-th user, to be lower than what would be necessary to completely drain the same queue within a time-slot, as expressed by This way, we can remove the terms max(0, Q ES ki − N ES ki ) from the sum in (24) and, consequently, we can rewrite the ES's resource allocation problem as where Although the problem is a mixed-integer optimization one, for any fixed ES's clock frequency f s , it Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
boils down to the classical (fractional) knapsack problem [46].Consequently, the optimal solution is obtained by a greedy algorithm, which consists in ordering the queues by their weights (Q comp ki J s ki ) in descending order, and then assigning the clock frequency to the queue as min(φ, ), where φ is the remaining part of the ES's clock frequency f c (t).Consequently, due to the limited cardinality of the ES's clockfrequency set F s , also in this case we can exhaustively solve the problem for all the server clock frequencies f s ∈ F s , thus obtaining the set of possible solutions {(f s ki , f s )} fs ∈Fs and then choose the one associated with the minimum ES's cost in (33).

B. mu-MADE: Multi-User Maximum-Accuracy With Delay and Energy Constraints
An alternative resource allocation, targeting a Maximum-Accuracy, can be formulated as min , for k = 1, . . ., K , and i = 1, . . ., L k contains all the optimization variables.The constraints in (19) have the following meaning: (a) the average queue length for the k-th UE must be lower than Q avg k , i.e., we are imposing a maximum average service delay equal to )); (b) the average energy consumption for the k-th UE must be lower than E k d,avg ; (c) the average ES's energy consumption must be lower than E s avg ; (d)-(h) have the same meaning of (c)-(g) in (19).
Proceeding similarly to the mu-MEDA strategy, in order to manage the long-term energy constraints (b) and (c), in addition to the virtual queue Z k (t) defined in (20) to manage (a), we need to define the virtual queues where {λ k } K k =1 and η are the step-sizes used to control the convergence speed of the algorithm.By the definition of the virtual queues, in this case the Lyapunov Function becomes and, consequently, given we derive the following expression for the Lyapunov driftplus-penalty function As detailed in the Appendix, we end up with the following optimization problem Exploiting again the decoupling of the problem, which is granted by our proposed design to separately handle the queues for any specific UE and any specific compression factor, we end-up also in this case with distinct instantaneous optimization problems, one at each UE, and a single one at the ES.
1) UE Sub-Problem: As far as the k-th UE is concerned, we get the following optimization problem formulation where for k = 1, . . ., K .The resolution strategy is quite similar to the previous case, when we minimized the energy consumption: if an UE would decide to offload its task (d k = 1), we need to allocate the optimal transmission rate R k for any fixed compression factor ρ k and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Thus, for a possible offloading decision (d k = 1) we compute by (40) the optimal data transmission rate R * k for each , and we select the optimal triple that minimizes the cost function in (39).Conversely, in order to evaluate the minimum cost of a local learning task at the k-th UE (d k = 0), we just need to exhaustively search for the pair that would optimize the accuracy under the prescribed constraints.Finally, depending on which one of the two optimal allocation strategies guarantees the best accuracy, we decide to offload (d k = 1), or not (d k = 0), the k-th user task, using the associated optimal allocation strategy T * k , or P * k , respectively.2) ES Sub-Problem: From the ES perspective, the optimization problem is similar to the mu-MEDA, except for small differences in the cost function, and is expressed by where and can be solved likewise the mu-MEDA formulation.

V. SIMULATION RESULTS
In this section, we present the simulation results we obtained by the two optimization strategies we proposed and solved.Tables I-II report the values of the accuracy G k (ρ), the dataunits J d k (ρ) that can be compressed (and zipped by JPEG) in a clock-cycle by the k-th UE, when it decides to offload the classification, and the data-units J L k (ρ) that can be compressed and classified locally in a clock-cycle by the same UE.Table III reports the data-units J s (ρ) that can be classified in a clock-cycle at the ES, as well as the image-size M(ρ) and the average number of bits/pixel N(ρ) that are shared by both the short-and deep-CE, when using JPEG.We assumed a flat-fading channel, whose statistical characterization is based on the Clarke's autocorrelation function [47].We considered two operating scenarios, summarized in Table IV, and we accordingly set the time-slot duration to τ = 50ms, which corresponds to the channel coherence time.The parameter σ 2 0 models the wireless channel power path-loss and it has been computed by considering the Alpha-Beta-Gamma model [48].In a first set of simulations we considered a scenario with K = 5 UEs connected to the network.Although this is not strictly necessary, we assumed that the devices of all the UEs share the same computation frequency set F d = {0.1,0.2, . . ., 0.9, 1} × 1.4GHz , while the server computation frequency set is F s = {0.1,0.2, . . ., 0.9, 1}×4.5GHz .Finally, for simplicity, we considered an effective switched capacitance κ = 1.097×10 −27 [ s cycles ] 3 for all the UEs and for the ES.We underline that all the simulation results have been obtained at convergence of the tested strategies [39].

A. Goal-Oriented Compression Results
For simplicity, all the UEs were assigned the same image classification task, based on the German Traffic Sign Recognition Benchmarks (GTSRB) [49] dataset.This dataset includes 1213 pictures of German road signals, divided in 43 different classes.The dataset has been split in a 80% training set, composed of 970 images, and 20% test set, composed of 243 images.During the data loading phase, all the all the images have been normalized to a size of 256x256, and converted to a 3-channel image (one channel for each RGB color), such that the initial size of each data-unit, is 256x256x3.Although this is not strictly necessary, we assumed that all the UEs share the same bank of CE-CC classification networks, e.g., the compression factors ρ k assume values on the same fixed set S = {2, 4, 8, 16, 32, 64}.In order to shade light on Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the performance obtained by the proposed resource managements, we find useful to show in Fig. 3 the average accuracy on the test-set associated to different compressive architectures: i) Deep-CE, ii) Short-CE, iii) Down-sampling with anti-aliasing pre-filter.As expected, the accuracy G(ρ) has a monotone decreasing behavior with respect to the compression factor, for all the models.The deep-CE has always the best performances even if, for lower compression factors (up to 16), the differences with the Short-CE are almost negligible.In contrast, for the highest ones (i.e., 32, 64) there is there is a clear advantage in using the deep-CE.For compression factor ρ = 64 we get output tensors with a size of 4x4x3=48 pixels: despite (pseudo) images of this size have clearly undergone a heavy transformation, the deep-CE still allows the ES's CC to classify them with a 67% accuracy, which is still a remarkable performance for a 43-class 43-class classification task.Conversely, for this compression factor neither the down-sampling strategy nor the short-CE, allow a meaningful classification.The price to be paid for an increased accuracy of the deep-CE is the increase of the computation energy and processing delay (as summarized in Tables I-II) that we trade by our resource management policies.

B. mu-MEDA Results
First of all, we tested the mu-MEDA strategy comparing the CE (short and deep) with the down-sampling compression strategy in channel scenario B, reported in Table IV.We set the same latency constraint D avg k = Q avg k /A k = 0.20 s, for all the UEs.We considered a task arrival process with A k = 2DU /slot, and we forced the UEs to always offload the classification task to the ES, without any opportunistic strategy (i.e., d k (t) = 1, ∀k , t).
Each trade-off curve in Figs. 4 and 5 is associated to a different accuracy constraint, while they all respect the same latency constraint, which is highlighted by a dashed horizontal line in the plot.Each curve is obtained by evaluating the solution (at convergence) of the resource optimization problem, for several different values of the trade-off parameter V in (23).Specifically, by increasing V we end-up to solutions characterized by a lower energy consumption and a higher latency and, as indicated by the black arrow on the figures, we move from the bottom-right to the top-left corner of the trade-off plots, which correspond to the desired optimal solutions on the borders of the feasibility regions.Fig. 4 shows that, from the UE's perspective, there is a clear advantage on employing the CE compression strategy, since we end-up to solutions characterized by a lower (computational and transmission) energy consumption, while satisfying the same latency and accuracy constraints.This depends on the fact that channel-B is characterized by a huge attenuation: thus, since the CE compression strategy allows to satisfy the same accuracy constraint transmitting smaller DUs with respect to classical down-sampling, this allows to reduce the transmission energy expenditure considerably, without spending too much in extra computational energy for CE-based compression at the UE.Actually, the proposed dynamical, goal-oriented, compression strategy leads also to a lower ES's energy computational expenditure, as witnessed from Fig. 5. Indeed, also the classification of smaller DUs is cheaper from a computational and energetic perspective.

C. Opportunistic Offloading
We compared the previous scenario, where UEs always offload decision tasks to the ES, with the opportunistic offloading strategy where UEs can also decide to perform classification locally, by the same CE-CC classification architecture.Specifically, two out of five UEs are connected to the ES by the channel in scenario A of Table IV, while the other ones by the channel in scenario B. The opportunistic offloading strategy ends up to a dynamical resource optimization that is characterized by a significant lower UE energy expenditure with respect to the always offload strategy, still satisfying both the accuracy and latency constraints, as shown by Figs.6-7, where clearly all the solid curves are on the left, e.g., with a lower energy expenditure, with respect to the dashed curves of the pure offloading strategy.Fig. 8 shows the histogram of the offloading decisions for each UE, for a (minimum) accuracy constraint G avg = 70% and a trade-off parameter V = 1 × 10 6 .As expected, since the UE-0 and UE-3 experience good channel conditions, they decide to offload more frequently than the other devices, whose Channel-B requests much higher transmission power to allocate rates to the UEs  and, sometimes, it may be also unfeasible to respect either the accuracy or the delay constraint, or both.

D. Comparison With Static Allocation Strategies
A key strength of the proposed approach is the joint dynamic optimization of transmission&computational resources, together with the optimal dynamic selection of the classification architecture used to perform the task.Thus, we compare the proposed multi-user optimization strategy with: • A Fixed-Accuracy optimization strategy, where we optimize both the computational and the transmission resources at the UE-side, by keeping fixed a single CE-CC classification architecture.This approach is quite similar to the one presented in [6].• A Hybrid static/dynamic optimization strategy, where, inspired by [50], we fix the transmission rate R on the basis of the average channel conditions, while we dynamically optimize the CE-CC architecture, as well as the computational resources at the UEs.The transmission rate R is fixed as the minimum one that guarantees the stability of the UE queue.This rate can be computed through the capacity for flat-fading Rayleigh channels [51, eq. ( 9)], and it fixes also the transmission power.In this case we considered a scenario with K = 3 UEs, each one experiencing different channel conditions and computational efficiency, as summarized in Table V.We set an arrival task with A = 2DU /slot, and we imposed the same accuracy and latency constraints for all the UEs to G avg k = 92% and D avg k = 0.2s, respectively.Thus, for the Fixed-Accuracy optimization strategy, we considered the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.short-CE with ρ k = 8 as the unique learning model, which according to Table II is capable to grant the requested average classification performance with a fairly moderate computational energy.Fig. 9 shows that employing a fully dynamic optimization strategy leads to solution characterized by a lower UE energy consumption.As expected UE-0 and UE-2 reach the lowest and highest energy consumption, respectively, given their computational and channel conditions summarized in Tab.V.It is clear that, for all the UEs, our optimization strategy allow to reach the lowest energy consumption, thus confirming the effectiveness to jointly and dynamically optimize the transmission/computation resources as well as the learning architecture (i.e., the pair of CE-CC) to be employed, depending on the instantaneous system conditions.

E. mu-MADE Results
We tested the mu-MADE optimization strategy considering a scenario with K = 3 UEs, each one characterized by different channel and computational conditions.In particular, we considered an effective switched capacitance κ 0 = 1.097 × 10 −27 [ s cycles ] 3 for the ES, and higher values for the UEs, in order to simulate a lower energetic efficiency.The UE energy constraint has been set to E avg k = 128 × 10 −3 J .Table V summarizes the different conditions for the devices considered in the simulation, where we employed, concurrently, both Deep-and the Short-CE.We remark that UE-0 experiences both good channel conditions and computational efficiency: this means that it has the maximum degree of flexibility on the management of the opportunistic offloading.UE-1 is characterized by the same channel conditions of UE-0, with a lower computational efficiency, while UE-2 operates with both a bad channel and a low computational energy efficiency.
The curves shown in Fig. 10 represent the accuracy-latency trade-off: by increasing the parameter V of (37), we end up with solutions with higher accuracy and latency, moving on the curves from bottom-left to top-right corner, where we get the desired optimal solutions at the boundary of the decision region.Specifically, Fig. 10 shows that UE-0 (i.e., the UE with the best computational & channel conditions) gets the highest accuracy, while widely satisfying the latency constraint.We note a similar behaviour for UE-1 and UE-2, with a higher degree of latency for UE-2 (i.e., the device that works in the  worst conditions).Finally, we report in Fig. 11 the histogram of the offloading decisions for each UE.Given its favorable channel and computational energy efficiency, we have a balanced situation for UE-0, since it has the highest flexibility to choose if offloading computations, or not.On the other hand, UE-1 mostly performs offloading, since the transmission of DUs in a channel with fairly low attenuation allows to mitigate the burden due to the low computational energy efficiency.Finally UE-2, although it has a much worse channel, it offloads more DUs than UE-0 s due to its much higher computational inefficiency.

VI. CONCLUSION AND FUTURE DIRECTIONS
In this work we implemented a goal-oriented compression architecture based on CEs, which is exploited by two distinct dynamic optimization strategies in order to either minimize the energy consumption or to maximize the learning accuracy in a multi-user scenario, where the UEs can opportunistically decide whether and when to offload the computations toward the ES.The extensive simulation results confirmed the effectiveness and the flexibility of the proposed approaches in different scenarios.However, we remark that the proposed goal-oriented communication architecture, and the associated resource management strategy, could exploit also classification or learning-oriented compression strategies, that may be different from the CE-based solutions presented herein.Future research directions include the extension to multi-server scenarios, cooperative learning tasks (e.g., Federated Learning), as well as to explicitly take into account also the battery level of each UE, which may be equipped by some energy harvesting mechanism or batteries recharge plan.

APPENDIX MATHEMATICAL DERIVATIONS FOR MU-MEDA
Two Lemmas in [39] are useful to solve the proposed resource optimization strategies.
Lemma 2: The following inequality holds true: Employing Lemma 1, and recalling that, given x ∈ R k , k , for the Latency Virtual Queue Z k (t) we have Now, recalling (8), (9) and using Lemma 2 we can derive the following inequality  [9] can be applied to the accuracy virtual queue, thus obtaining an upper-bound for Δ y k (t).Putting together the derived instantaneous upper-bounds we end up to the optimization problem presented in Section IV-A.

Fig. 1 .
Fig. 1.Training scheme: the output of the CE h feeds both the ES classification CNN and a CD.

Fig. 2 .
Fig. 2. Scenario: each UE dynamically employs its own set of CEs coupled with a proper set of CCs at the ES.
.e., a queue for each compression factor among the L k in the set S k = {s ki } i=1,...,L k , which represents the set of the compression factors employable by the k-th UE.These queues store the ES computation load, expressed in number of DUs, that is reserved for the k-th device.The term 1 i {ρ k (t)} in (9) is a shorthand for the indicator function 1{ρ k (t) = s ki }, which models the arrival of new DUs in the ES queue only if the UE have chosen the i-th compression factor.The term N ES ki (t) in (9) denotes the number of DUs processed by the ES during the t-th time-slot, and it is expressed by