An End-to-End Scheme for Learning Over Compressed Data Transmitted Through a Noisy Channel

Within the emerging area of goal-oriented communications, this paper introduces a novel end-to-end transmission scheme dedicated to learning over a noisy channel, under the constraint that no prior training dataset is available. In this scheme, the transmitter makes use of powerful Spherical Harmonic Transform and Irregular Hexagonal Quadratic Amplitude Modulation techniques, while the receiver relies on a Complex-Valued Neural Network (CVNN) so as to realize the learning task onto the received noisy data. As a main feature of the proposed scheme, the transmitter is fixed and does not depend on the source statistics, while the receiver is trained from a first data transmission phase, thus providing an efficient transmission-versus-learning approach under the considered constraint. The proposed transmission scheme may be adapted to a variety of learning problems, and the paper specifically investigates clustering and classification, two very common learning tasks. In the last part of the paper, the source/channel coding rate of the proposed transmission scheme is evaluated theoretically and from numerical simulations. This analysis shows a clear advantage in terms of coding rate of our scheme compared to conventional coding approaches, when targeting the same learning performance level.


I. INTRODUCTION
Conventional schemes for data transmission over a noisy channel are designed so as to reconstruct the original information sequence without error (lossless transmission) or with a residual amount of errors (lossy transmission). However, in many applications, the objective of the receiver is not to reconstruct the original data, but rather to apply a given learning task onto the received data. As examples, one may consider health disease detection from human body sensors [1], [2], underwater activity monitoring [3], or traffic flow prediction with autonomous vehicles [4]. The problem of learning over data transmitted through a noisy channel falls into the emerging area of goal-oriented communications, and The associate editor coordinating the review of this manuscript and approving it for publication was Michele Magno . was identified as a key functionality to be integrated in the upcoming 6G standard [5].
In this paper, we consider that several sensors transmit their data to a fusion center whose objective is to apply a certain learning task onto the sensors measurements. In this context, conventional source/channel coding schemes targeting data reconstruction are known to be sub-optimal in terms of amount of data to be transmitted so as to achieve a certain learning performance [6], [7]. An alternative to conventional coding approaches consists of replacing the transmitter and the receiver by Deep Neural Networks (DNNs) trained so as to realize the learning task while taking into account the channel effect onto the transmitted data [8], [9]. However, this approach can only be implemented if an initial training dataset is available for pre-training, or if there exists some reliable feedback link allowing for significant data transmission from the receiver to the transmitter. Otherwise, the emitter should have enough resources in terms of data and power to train the DNN, or the receiver may perform the training, but it should then send back the weights to the encoder through the feedback link. On the opposite, this paper addresses the design of a practical coding scheme dedicated to learning, considering that: (i) the sensors do not have enough resources to perform the training, (ii) no initial training set is available, (iii) the feedback link only allows for limited data transmission. When considering these three constraints, the key challenges reside in devising a fixed transmitter able to work by itself with only a few feedback from the receiver, and in developing a receiver dedicated to learning and capable of online adaptation to the source and channel statistics.
In the literature, the problem of designing source and channel coding schemes dedicated to learning was first addressed from the theoretical point of view of Information Theory. In this field, the most considered learning problem was by far Distributed Hypothesis Testing (DHT) in which the receiver should decide between two hypothesis related to the statistics of two sources X and Y [10], [11], [12], [13]. These works provided error-exponents for DHT under rate-limited transmission links and under various transmission setups (perfect and non-perfect channel, relaying opportunities, etc.). Apart from DHT, [6] identified a trade-off between content identification and data reconstruction from a noisy database, while [14] addressed parameter estimation over compressed data. Finally, [15] considered the problem of supervised learning of a given function f from compressed observations. Although all the above works provide meaningful insights on how to perform learning over compressed data, they are mostly theoretical and do not provide any practical code design solution.
On the practical code design side, several works proposed to perform parameter estimation [16], hypothesis testing [17], or clustering [18], [19], from only a small amount of linear combinations of the input data, following the Compressed Sensing (CS) approach [20]. However, these works are not directly suitable for digital data transmission, since they produce real-valued data and do not evaluate the effect of quantization or channel noise onto the learning performance. As an attempt to develop discrete CS approaches for learning, [21], [22] considered parameter estimation over Low Density Parity Check (LDPC) codes, and [23] investigated clustering over LDPC codes. But the above works considered only discrete sources, and they assumed a perfect (noiseless) transmission channel.
In this paper, we propose a full end-to-end transmission scheme dedicated to learning. The proposed scheme does not rely on any prior knowledge of the source statistics and can therefore be adapted online to the collected data. We first introduce a transmitter scheme which can be rate-adapted so as to ensure a good learning performance. This transmitter is built with Spherical Harmonic Transforms (SHF) [24] and Irregular Hexagonal Quadratic Amplitude Modulation (IHQAM) [25]. These two techniques used together allow us to preserve the data structure after channel transmission, in order to efficiently apply learning after only a few reconstruction operations at the receiver. We then propose a receiver dedicated to learning and built from a Complex-Valued Neural Network (CVNN) [26] trained from a first data transmission phase. One main advantage of the proposed strategy is that the transmitter does not depend on the considered learning task. Therefore, we consider two learning problems that are clustering and classification, and provide two versions (both based on CVNN) of the receiver, depending on the considered problem. We aim to evaluate the effectiveness of the proposed approach for these two problems.
For that purpose, the second part of the paper is dedicated to the performance evaluation of our scheme. Since standard performance metrics usually considered for data reconstruction (error probability, distortion, etc.) are not suitable in our context, we first identify metrics of interest which can be used to evaluate the learning performance of the proposed scheme. We then evaluate the source-channel coding rate of our scheme, as a function of its parameters. Unfortunately, we cannot compare this coding rate to any available Information-Theoretic achievable coding rate. Indeed, no such theoretical result exists for clustering or classification, and given the few learning tasks considered in the literature of Information Theory, this appears to be a difficult problem. This is why, here, we employ a more pragmatic approach and evaluate the coding rates of two identified baseline schemes: one theoretical and one more practical, the latter consisting of a conventional coding scheme. These two baselines will allow to position the performance of our scheme with respect to other potential approaches.
Finally, we run numerical simulations to evaluate the learning performance of the proposed scheme, and compare the coding rate of our scheme with respect to the two identified baselines. As a main result, we observe a significant gain in coding rate compared to conventional coding schemes, while maintaining the same learning performance.
The outline of the paper is as follows. Section II introduces our notation and assumptions for the problem of learning over transmitted data. Section III describes the coding scheme at the transmitter. Section IV introduces the learning scheme at the receiver. Section V evaluates the coding rate and learning performance of the proposed scheme. Finally, Section VI provides numerical results.

II. SYSTEM DESCRIPTION
This section introduces the notation and main assumptions of our work, and presents the problem of learning over data transmitted through a noisy channel. In what follows, 1, N denotes the set of integers from 1 to N .

A. SOURCE AND CHANNEL MODELS
We consider a setup in which a potentially large number of sensors collect data to be transmitted to a fusion center. We assume that each sensor has access to several pieces of VOLUME 11, 2023 data, each denoted with bold-letter X s , where s ∈ 1, S , and we let {X s } s∈ 1,S be the full dataset. We consider that each piece of data X s is a matrix of size N ×M . This corresponds to two-dimensional data such as images, although the proposed scheme could be adapted to one-dimensional data such as measurement vectors or time series. We do not make any assumption on the source statistics, in order to develop an agnostic coding scheme which can adapt to a wide range of situations. In addition, we consider that the data is transmitted through an Additive White Gaussian Noise (AWGN) channel with variance σ 2 , a common assumption in the study of communication systems.

B. LEARNING TASKS
The objective of the fusion center is to apply a given learning task over the dataset {X s } s∈ 1,S collected by the sensors. Clustering is an unsupervised learning task (e.g. no labelled data is available for training), while classification is a supervised learning task (labelled data is required for training). This will allow us to evaluate the performance of the proposed transmission scheme over two learning tasks which are very different by nature. We now briefly introduce these two tasks.

1) CLUSTERING
Clustering consists of separating the dataset {X s } s∈ 1,S into clusters, such that data in a cluster are similar with each other. In this work, we consider the Euclidian distance d(X s , X s ′ ) as the similarity measure between two data X s and X s ′ . We further consider the very popular clustering algorithm K-means [27] since our purpose is not to introduce a new clustering method, but rather to work on the design of the transmission system. The K-means algorithm requires the knowledge of the number of clusters K , and aims to minimize the following cost function: with respect to cluster assignments c s,k ∈ {0, 1} and to cluster centroids θ k ∈ R N ×M . K-means usually suffers from initialization issues and from the fact that the number K of clusters can be difficult to know in advance. We refer the reader to [28] and [29] for methods to solve these two issues. These methods consist of simple modifications of the K-means algorithm, and could be easily incorporated in our approach.

2) CLASSIFICATION
Classification consists of assigning data to one of K predefined classes. It is realized in two phases. In the first training phase, a classifier is trained from a set of labelled data, where labels indicate to which class belongs each data. In the second inference phase, the classifier should correctly assign a new unlabelled data to the correct class. Among various methods that exist for classification, we here consider standard feedforward Neural Networks (NN) for their efficiency and adaptability [30]. In what follows, and as commonly done in classification, we will consider that the NN is trained by considering the cross-entropy as loss function [31].

C. TRANSMISSION SCHEME FOR LEARNING
When developing our transmission scheme, we will consider that we do not have access to any prior training dataset, since constructing in advance such a training set is not always possible in practical transmission scenario. Therefore, in order to develop an efficient transmission scheme dedicated to learning, we consider two transmission phases. At the first phase, a fraction β of the dataset {X s } s∈ 1,S is transmitted to the fusion center by using a conventional lossless or lossy data transmission scheme. This first data transmission will allow the fusion center to properly calibrate the learning algorithm. This is of special importance in our context, since we want the transmission scheme to be as agnostic as possible with respect to the source statistics. The second phase, which constitutes the main contribution of this paper, will be specifically designed so as to allow for learning over the transmitted data, without need to reconstruct the original data. Note that both phases will be taken into account when evaluating the overall source/channel coding rate of the proposed scheme. In addition, we consider a feedback channel between the fusion center and the sensors. Through the feedback channel, the fusion center will get the sensors informed about the amount of data they should transmit at the second phase so as to achieve a good learning performance. However, we will pay special attention to only sending a few amount of information through the feedback channel, since setting up such down-link connection can be very costly in practical applications.

D. EXISTING APPROACHES FOR LEARNING OVER TRANSMITTED DATA
Before describing our proposed transmission scheme for learning over received data, we review existing approaches for this problem, and identify their limitations.
A first straightforward approach would consist of considering a conventional data transmission scheme targeting data reconstruction. In this case, it is shown in [32] that there is no need to completely reconstruct the data before applying the learning task. For instance, [33] proposed to train a Deep Neural Network (DNN) dedicated to classification directly in the JPEG transform domain. This solution is especially appropriate in the case where the data is already compressed and e.g., stored in a dedicated server. However, in our case, it may be very sub-optimal in terms of coding rate. As a matter of fact, an information-theoretic analysis carried in [11] shows that the rate needed for DHT is much lower than the rate needed for data reconstruction. The same fact was also empirically observed in [23] for clustering over compressed data. In addition, and perhaps more surprisingly, this approach may also be sub-optimal in terms of learning performance. For instance, it is shown in [6] that there exists a trade-off in terms of coding rate between data reconstruction and identification. It is also shown in [34] that the classification performance after video compression and decompression is poor when low bitrates are considered. This shows the need to design a coding scheme fully dedicated to learning.
Alternatively, one could consider the use of full end-toend Deep-Learning techniques, which were widely investigated in the telecommunication field recently, for data compression [8], [35], noisy channel transmission [9], [36], or joint source-channel coding [37], [38], [39]. Some of these solutions were extended to target learning problems such as image retrieval [40], image classification [41], or image recognition [42]. Most of these solutions can be seen as Variational Auto-Encoders (VAEs). VAEs are composed of one encoder, which produces a latent vector, and of one decoder, which may either perform data reconstruction [8], [9], [35], [36], [37], [38], [39], or apply some specific learning tasks [40], [41], [42], [43] onto the latent vector. Usually, the VAE encoder and decoder are constructed from NNs. When no training set is available, the emitter may train its own NN [42], [44], given that it has enough computation and power resources. Otherwise, the training algorithm should be applied at the receiver, and the updated NN weights or the loss function for each training sample [9] should be transmitted back to the encoder via a reliable feedback channel. Therefore, the use of VAEs seems unrealistic in all the applications in which it is not possible to make such an intensive use of the feedback channel. This is why in this work we do not consider the Deep-Learning based approach neither. Instead, we will develop a transmission scheme in which the transmitter scheme is fixed and does not need to be updated online, while the receiver can make use of a NN dedicated to the considered learning task. This NN will be trained during the first transmission phase, with no need to send back any training information to the encoder.
In what follows, we first introduce the proposed transmitter scheme (Section III), and then describe the proposed receiver learning scheme (Section IV).

III. DATA TRANSMISSION SCHEME
In conventional data transmission schemes, the sourcechannel separation theorem states that designing the source coding scheme and the channel coding scheme independently from each other is optimal, at least in the asymptotic regime. However, this result most probably does not hold anymore when targeting learning. For instance, [10] proposes a joint source-channel code design for DHT which achieves better performance than the separate design. Intuitively, since learning algorithms are designed so as to handle the noise within the data, they should also show some robustness against channel noise. Therefore, it may be irrelevant to put some effort into completely correcting the channel noise before applying the learning algorithm. Following this idea, we build-up an unconventional data transmission scheme which avoids both standard lossless source coding (Huffman, Lempel Ziv, etc.) and standard channel coding (LDPC codes, Turbo codes, etc.). The proposed scheme is designed so as to preserve the data structure during channel transmission, so that the learning algorithm can directly handle the additional noise introduced by the channel. Note that in the context of lossy data reconstruction through a noisy channel, it was shown in [45] that transmitting uncoded data is optimal for some sets of source and channel pairs. However, the work of [45] was mostly theoretical and did not address learning. Figure 1 shows the generic coding scheme (transmitter + receiver) we propose in this paper. In Figure 1, we see that the transmitter is composed of three main blocks: transform coding, scaling, and modulation, which we now describe.

A. SHF TRANSFORM CODING
As in most source coding approaches, our scheme first employs a transform coding operation. Here, we consider Spherical Harmonic Functions (SHF's) [24], which are known to be very good function approximators, mostly due to their polynomial forms. Transforms usually employed in source coding, like Discrete Cosine Transform (DCT), are real-valued for easier use. Here, on the opposite, we choose to employ SHF because it is a 2D complex transform. This will interface better with the modulation method considered in our scheme, since the constellation of this modulation is defined in the complex domain.
Consider a given matrix X of size M × N from the dataset In order to apply SHF, Step 1 of Figure 1 converts the Cartesian coordinates X m,n = (m, n) into a spherical coordinate system. To do so, we set θ m = πm M and ϕ n = 2πn N , where θ m is the zenith (polar) angle such that 0 < θ m < π, and ϕ n is the azimuthal angle such that 0 < ϕ n < 2π.
Then, the SHF is defined from Legendre polynomials P with the convention that P ℓ , and where ℓ ∈ 0, +∞ and k ∈ 0, ℓ . Legendre polynomials define an orthogonal basis. In addition, spherical harmonic functions VOLUME 11, 2023 Y (k) ℓ (ℓ ∈ 0, +∞ , k ∈ 0, ℓ ) define an orthogonal-basis system that maps the spherical coordinates to scalar complex values as follow: ℓ is a normalization constant defined as with the convention that N → C satisfy standard orthonormal conditions restated in [24]. All the expressions of the SHFs Y (k) ℓ can be found in [46].
Finally, in Step 2 of Figure 1, for all ℓ ∈ 0, +∞ and k ∈ 0, ℓ , the transform coefficients C In order to simplify notation in the following, we re-index the SHFs and transform coefficients as Y Equation (6) shows that the choice of the value of P is critical for the quality of the approximation. This choice will be discussed later on in the paper. Since our scheme does not include channel coding, the P transform coefficients C p are passed to the modulation step, after an intermediate scaling step.

B. MODULATION
For modulation, we consider Irregular Hexagonal QAM (IHQAM) constellation [47, Section V], which is a very energy-efficient 2D signal constellation method. Here and for the sake of brevity, we only describe the 64-IHQAM constellation. In our simulations, we only considered the 64-IHQAM for its high reliability, but our scheme can be straightforwardly adapted to other constellation orders.
In a 2D signal constellation, the Symbol Error Rate (SER) is mainly affected by the minimum distance between two neighboring constellation points, and by the average symbol energy which depends upon the mean squared distance between constellation points and the origin. Therefore, in the optimum 2D hexagonal lattice based IHQAM constellation [47], constellation points are situated on concentric discs and the minimum distance separation of any two adjacent points is 2d. Further, according to [47], real (resp. imaginary) coordinates of each constellation point are integer coefficients of d (resp. √ 3d). At the transmitter, each complex information signal is first mapped to the center of the nearest hexagon (Step 4 of Figure 1), considering the decision boundaries given in [47, Table 5 ]. More into details, [47, Table 5 ] provides linear equations that define the boundaries of the hexagons in the constellation represented in [47, Figure 16]. When implementing our scheme, we noticed that the table given in [47, Table 5 ] contained some typos in the definitions of the boundaries. This is why we restated the correct boundaries in Table 1 of this paper for further clarity and future use. Note that regular HQAM constellation has comparatively simpler detection. On the other hand, the irregular HQAM provides improved power efficiency and optimum performance, at the cost of increased detection complexity [25]. When considering IHQAM, if the modulated signals are passed through an AWGN channel with a noise variance σ 2 , then according to [47], the Signal-to-Noise Ratio (SNR) in dB is equal to where E s is the average energy of each signal, and N 0 = 2σ 2 is the spectral density of a two-sided Gaussian noise. Moreover, for a fixed SNR dB , we can apply relations (7), (24), and (25) in [47] to calculate the Bit Error Probability (BEP) P b of the 64-IHQAM over an AWGN channel. Finally, we did not consider channel coding prior to the modulation scheme, since we do not target data reconstruction. Machine Learning algorithms considered at the receiver have the ability to handle the noise introduced by the channel.

C. SCALING
We now describe how the proposed scheme scales and maps the complex-valued transform coefficients C p onto the IHQAM constellation. For a given s ∈ 1, S , we use C s = C s,1 , C s,2 , · · · , C s,P to denote the vector of transform coefficients of size P, where C s,p = y s,p + jz s,p is a continuous complex value. For all p ∈ 1, P , the means of the random variables y s,p and z s,p are given by µ y,p = E y s,p and µ z,p = E z s,p , respectively, and their variances are given by σ 2 y,p = Var y s,p and σ 2 z,p = Var z s,p , respectively. Given that the number S of samples is sufficiently large, we can estimate µ y,p , µ z,p , σ 2 y,p , σ 2 z,p , from the empirical means and variances of the data samples y s,p and z s,p . These empirical means and variances can be calculated both at the transmitter from the observed data, and at the receiver from the first data transmission phase. The normalized versions of the components y s,p and z s,p are denoted byȳ s,p =   Table 5 of [47]. random variable U with mean 0 and variance 1, we introduce the quantity u α such that P (|U | ≤ u α ) ≥ α. In other words, the confidence interval I y,α = −u α σ 2 y,p + µ y,p ≤ y n,p ≤ u α σ 2 y,p + µ y,p contains α% of the real (resp. imaginary) values of the transform coefficients C s,p . 1 The value of α has an impact on the quality of the signal at the receiver. Now considering the 64-IHQAM constellation, Step 3 of Figure 1 scales the confidence interval of real (resp. imaginary) part of each transform coefficients to real (resp. quadrature) axis of 64-IHQAM constellation. To do so, and given that the real (resp. quadrature) axis of 64-IHQAM is limited to [−8d, 8d] (resp. −4 √ 3d, 4 √ 3d ) we apply the following transformations, respectively on the normalized real partȳ s,p and imaginary partz s,p of each of the P coefficients C s,p : T Im,i : Transforms (7) and (8) map at least α% of the most probable coefficients to the 2-dimensional region of the 64-IHQAM constellation. Note that the values that might be out of the α% confidence interval are mapped into the borders of this interval.
After the scaling step, Step 4 of Figure 1 uses the decision boundaries of the 64-IHQAM modulation to further quantize any continuous 2-dimensional [T Re,p (t) , T Im,p (t)] value to one of the 64-IHQAM constellation points. Since we do not use channel coding, the scaled values are directly quantized into constellation points, with no ''mapping'' strategy.

IV. RECEIVER LEARNING SCHEME
The objective of the receiver is to perform either clustering or classification onto the received noisy data output by the AWGN channel. For both learning tasks, the receiver starts with the same two steps: demodulation and descaling.
Step 6 of Figure 1 corresponds to demapping the noisy signal which is distorted by Gaussian noise. In this phase, we use the decision boundaries of the 64-IHQAM to demap the noisy signal to the center point of the nearest hexagon. Then, Step 7 corresponds to descaling the p-th component of the demodulated/demapped vector of size P. This is done by applying the inverse of the linear transform T Re,p in (7) (resp. T Im,p in (8)) on real (resp. imaginary) part of the i-th component of the demodulated/demapped vector. Since these components are the center points of the hexagons in the 64-IHQAM, their corresponding descaled (inverted) points only take values among 64 discrete possibilities. The next step of Figure 1 is specific to the considered learning task, although both tasks rely on a Complex-Valued Neural Network (CVNN) which we now describe.

A. COMPLEX-VALUED NEURAL NETWORKS
In this section, we only provide the salient points of CVNN, and we refer the reader to [49] and [50] for a full description. Given that a CVNN has complex input values, the activation functions and their derivatives have to be well-defined so that one can apply e.g., a gradient descent optimization method over complex data. Specifically, [50] relies on Wirtinger derivation in order to calculate the gradient of the loss function of a CVNN.
In what follows, we consider a CVNN with an input layer of size k 0 +1 nodes, V hidden layers, k v nodes per layer, and a single-valued output layer. We further consider the Cartesian hyperbolic tangent F (Z ) = tanh (Re (Z )) + j tanh (Im (Z )) as activation function for hidden layers of CVNN, where Z is a complex value, and we consider the sigmoid-based VOLUME 11, 2023 activation function for the output layer. 2 CVNN with the aforementioned materials have been implemented in Python using Tensorflow and Keras, see [26], [51]. For weight initialization, Glorot uniform (also known as Xavier uniform) [52] is used, and all biases start at zero as those are Tensorflow's current (v2.1) default initialization methods for dense layers.

B. CLUSTERING WITH CVNN
In our scheme, clustering over the received data is performed in two steps. The first step consists of reconstructing rough versions of the original matrices X s using a CVNN shown in Figure 2. This reconstruction step will allow to efficiently invert the transform operation, while removing a part of the channel noise. It will also allow to use the standard Euclidean distance in the cost function (1) of K-means. Indeed, applying K-means without data reconstruction would require to identify a proper distance adapted to the internal geometric structure of the demodulated data.
In the CVNN, we use the activation functions described in Section IV-A. As loss function, we consider the Mean-Squared Error (MSE), since the aim of the clustering algorithm K-means is to minimize the MSE between each sample and its closest centroid. The CVNN is trained from the set of ⌈βS⌉ samples that were transmitted at the first phase. In this first phase, we take advantage of Data Augmentation (DA), a technique that allows to increase the amount of training data by adding modified copies of already existing data [53]. This results in a training set of N t samples. For a given coefficient X m,n , the input layer of the NN is fed with the vector see (6), where the coefficientsC p are the received transform coefficients after demodulation and descaling. The goal of the NN is to minimize the MSE between the NN output valuê X m,n and the original value X m,n , for all (m, n) ∈ 0, M −1 × 0, N − 1 . The NN is applied MN times so as to obtain MN componentsX m,n , which will allow to reconstruct a degraded versionX of each matrix X. Then, at the second stage of step 8, the K-means algorithm is applied onto the set of reconstructed matrices {X s } s∈ 1,S ′ , where S ′ ≤ S is the number of received data. In a practical system, for more efficiency, the K-means algorithm should be applied onto the set of matrices received from both the first and the second data transmission phase, so that a part of the matricesX s are in fact given by the original matrices X s coming from the first step. However, in our simulations, we will only apply K-means onto the set of data received at the second phase, which will allow to evaluate with more fairness the performance of the proposed scheme (otherwise, the K-means algorithm could be positively biased by the first step).

C. CLASSIFICATION WITH CVNN
When considering classification instead of clustering, our proposed scheme remains entirely the same, except for Step 8 in Figure 1. Now, this step is composed of only one stage, since the CVNN will completely handle the classification. More into details, in the training phase, we assume that the receiver has access not only to a fraction β of the dataset, but also to the corresponding true labels. For instance, these true labels may be determined manually by a human operator, after a clustering step. This human operator should provide such effort for a small part of the dataset, hoping that the NN will provided labels for the remaining (1−β)% of the dataset. Manual labeling should be done anyway in many applications in which no prior dataset is available. As a result, while for clustering the NN is launched pixel-wise, here on the opposite the CVNN is fed with the P descaled transform coefficients C p . In addition, at the last layer, the number of NN outputs now equals the number of predefined classes, and the loss function is now the cross-entropy. However, the activation functions remain the same as for clustering.

D. RATE-ADAPTION MECHANISM
Until now, we assumed in our description of the transmission scheme that the parameter P was fixed. This parameter indicates the number of SHF transform coefficients C (k) ℓ (5) which are retained at the transmitter, and it affects both the amount of data transmitted over the channel and the learning performance. This is why when training the CVNN after the first transmission phase, we will try different values of P and select the one that provides a sufficient level of learning performance. The performance criterion considered at this step will be defined in the next section for both clustering and classification. For classification, this step requires retaining a small part of the transmitted data into a validation set which will be used only for performance evaluation and will not be considered during the training. Finally, the retained value of P is sent back to the transmitter via the feedback link, and it will be used during the second transmission phase. As a result, the feedback channel should transmit ⌈log 2 (P)⌉ bits of data (where P ≤ M × N ), which is very low compared to the amount of data transmitted through the direct link, see the next section.

V. RATE AND LEARNING PERFORMANCE EVALUATION
Usually, the performance of conventional source and channel coding schemes is evaluated from metrics related to data reconstruction, such as the error probability for lossless source coding, or the distortion for lossy source coding. Alternatively, this section first identifies metrics of interest to evaluate the clustering and classification performance of a transmission scheme dedicated to learning. It then evaluates the source-channel coding rate of the proposed scheme.

A. CLUSTERING PERFORMANCE
We consider two metrics of interest in order to evaluate the clustering performance of the proposed scheme. The first metric comes from the Confusion Matrix (CM), a square matrix such that the positive integer coefficient at position (i, j) gives the number of elements of actual class j which were predicted as belonging to class i. The metric , where 0 < cm ≤ 1, and tr(.) is the trace of the matrix, is then calculated from the CM. The higher cm means the better clustering. The second metric is the Silhouette score, denoted ss, and defined in [54]. The value of the Silhouette score varies between −1 and 1. A high ss means that the clusters are dense and well-separated from other clusters.
The first metric cm is calculated from the ground truth, that is the knowledge of the true clusters, while the second one ss only evaluates clusters homogeneity. We use these two very common metrics in our simulations, although many other ones exist (homogeneity score, completeness score, inertia, etc.), see [55] for an overview.

B. CLASSIFICATION PERFORMANCE
For classification, given the nature of the problem, performance criteria always require the ground truth. In our simulations, we will consider a very common one which is the accuracy [56]. Accuracy is simply calculated as the proportion of truly predicted labels among all tested samples.

C. RATE EVALUATION
We now evaluate the coding rate of the proposed transmission scheme. For the conventional coding scheme used at the first phase, we use R sc and R cc to denote respectively the source coding rate and the channel coding rate needed to transmit an image without or with loss, depending on what is needed at the first transmission phase. Then, since the number of input bits is BMN (assuming B bits per each of the M × N pixels of a gray-scaled image), and since the number of transmitted bits is 6P (P SHF coefficients are retained, and each one is mapped onto 2 6 = 64 constellation points), the overall coding rate R learn of our scheme is where β is the proportion of the dataset transmitted at the first phase, and the ratio R sc R cc is the joint source-channel coding rate [57]. In addition, while R sc ∈ [0, 1] and R cc ∈ [0, 1], the ratio R sc R cc does not necessarily belong to [0, 1]. In our simulations, we will assume that R sc is the source coding rate after JPEG compression with a very small distortion, and that R cc is the rate of an error-correction code aiming to correct most errors introduced by the AWGN channel. In (11), we see that the second transmission phase of our scheme highly benefits from the fact that no channel coding is employed in this phase. We also see that at this phase, the quantity to be optimized is the value of P that is the number of retained SHF's coefficients. Finally, note that if the learning VOLUME 11, 2023 was already done from a prior available dataset, we could set β = 0.

D. CODING RATES OF BASELINE SCHEMES
In conventional data transmission setups, well-known Information-Theory results [58] state that the source coding R sc should be greater than the source entropy (lossless coding) or than a certain rate-distortion function (lossy coding). These information-theoretic results allow to compare the performance of a given practical coding scheme with respect to the optimal coding rate. Unfortunately, no such result exists for the clustering and classification tasks considered in this paper, although some simpler problems such as DHT were addressed in the literature [10], [11], [12]. Determining the Information-Theoretic achievable performance for classification or clustering is out of our scope of this paper. Alternatively, we now provide the coding rates of two baseline schemes, which will serve as points of comparison with our approach.
As a first baseline, we consider a scheme in which the dataset is fully transmitted with conventional source and channel coding techniques, and completely reconstructed at the receiver, before applying the learning algorithm. This conventional approach has coding rate R conv given by where the terms R sc /R cc were introduced in Section V-C. This corresponds to setting β = 1 in our proposed transmission scheme.
As a second baseline, we use the result of [59] which states that in theory, K log(K ) coefficients are sufficient to retrieve correct cluster or class assignments, where K is the number of clusters or classes. This result holds after training, that is when the centroids (for clustering) or classes (for classification) are already known. Therefore, in this case, we consider the same first transmission phase as in our scheme, and we assume that K log(K ) coefficients per data are transmitted in the second phase. We further consider that these coefficients are protected with a channel code of rate R cc . The coding rate R ideal of this scheme is given by where K is the number of clusters or classes. This scheme is termed as ''ideal'' because it relies on the theoretical result of [59].
At the end, we expect the following rate ordering: R ideal ≤ R learn ≤ R conv to hold when considering the same level of learning performance among the three schemes. In our simulations, we will compare the rate-versus-learning performance of our scheme to these two baselines, and we will check whether the previous rate ordering is satisfied.

VI. SIMULATION RESULTS
In this section, we evaluate the performance of our proposed transmission scheme for clustering and classification. Our evaluation consists of two aspects: 1) the quality of clustering or classification, 2) the coding rate, given that the parameters of the scheme (number of SHF coefficients, etc.) are chosen so as to avoid any significant negative impact on the clustering or classification performance. The metrics considered in this evaluation are described in Section V. For both learning problems, we consider the MNIST dataset [60] that contains 70, 000 grayscale images of size 28 × 28 of handwritten digits. The parameters of the considered CVNN are provided in Table 2, both for clustering and classification.

A. CLUSTERING 1) SIMULATION PARAMETERS
In order to evaluate the clustering performance of the proposed transmission scheme, we consider that β = 1% of MNIST samples are transmitted at the first phase, which represents 700 samples. The set of samples transmitted in the first phase is then expanded into N t = 2000 samples by applying Data Augmentation (DA) techniques [61]. The considered DA technique includes a maximum of 20 degrees of rotation and up to 3 pixels left/right/up/down shifts. Next, the CVNN used at the receiver contains V = 2 layers, with k 1 + 1 = 120 nodes in its first hidden layer, and k 2 + 1 = 784 nodes at the second layer, which is equal to the number of pixels in each MNIST image. The CVNN is trained with 200 batch, 10 epochs, and a learning rate of 0.001. For the modulation scheme, we consider the 64-IHQAM constellation with parameter d = 1. In our simulations, we will evaluate the clustering performance for values P ∈ {36, 49, 64}. For the AWGN channel, we set a noise variance σ 2 = 0.56, which corresponds to a SNR around 15dB, and to a bit error probability P b = 0.062 [47], [62].
After completing the first transmission phase, and in order to evaluate the performance of the proposed transmission scheme, we consider the transmission of 2000 new samples in MNIST. In our scheme, for the clustering, we use the K -means algorithm initialized with K-means++, with 20 random initializations, a maximum of 300 iterations, and a tolerance value of 10 −4 . The K-means function is imported from the Scikit.Learn Python library and for evaluation purpose, it is applied only on the samples transmitted at the second phase. In our simulations, we assume that the number K = 10 of clusters is known by the algorithm.

2) CLUSTERING PERFORMANCE
In our simulations, we evaluate the clustering performance from the metrics ss and cm described in Section V-A. For further refinement, we also provide an array ''Counter'' which lists the number of samples assigned to each class. Since images in the MNIST dataset are uniformly distributed across clusters, a successful clustering should output almost an equal number of data points in each cluster. In this sense, for MNIST, a lower sample variance σ 2 Count for Counter means a better clustering quality.
We now present our simulation results for two sets of simulations. The first set of simulations aims to position our scheme with respect to K-means clustering on the original MNIST data, and on binary MNIST (one-bit quantization of each pixel). In other words, the K-means algorithm is applied directly on 2000 samples of these two datasets (original and binary), without considering our transmission scheme and without adding any channel noise. Table 3 shows the metrics cm, ss, and σ 2 Count obtained for original and binary versions of MNIST images, and also by applying the proposed scheme (referred to as ''NN output images'' in the table) with P = 64, 49, and 36. We observe from Table 3 that the aforementioned metrics are almost similar for clustering over original and binary versions of MNIST images, except for the variance σ 2 Count which is meaningfully smaller for original MNIST images. Then, when applying our scheme with P = 64 and P = 49, the metrics are all reasonably close to the ones for original MNIST images. Clustering over NN output images with P = 36 however leads to a performance degradation as it results in values of cm smaller and of σ 2 Count variance meaningfully larger than the ones for original MNIST. As a result, for MNIST and with the parameters considered in our simulations, setting P to any value greater than or equal to 49 seems sufficient to apply the K -means algorithm with a sufficient level of performance.
Finally, we notice that the Silhouette coefficient ss gives inconsistent results on our different setups. The metric ss does not rely on the ground truth, and instead evaluates clusters homogeneity. But here, the space in which the data evolves varies with the value of P, which may mislead the computation of ss and makes it improper to compare the schemes performance. Despite this, we provide the ss values, since the Silhouette criterion is widely considered in the literature.
Then, as the output of the second set of simulations, Table 4 compares the clustering performance after two transforms: DCT and SHF. In this Table, we first show the clustering performance for a first scenario referred to as ''reconstructed MNIST images'' in which the considered two transforms are applied to the original data, but the transform coefficients pass neither through the transmission scheme nor through the noisy channel (no quantization, modulation, channel noise, etc., is applied to the data) and are reconstructed with the corresponding inverse transform. As a second scenario, we also restate the clustering performance of the proposed transmission scheme, referred to as ''NN output images'', for P = 36 and P = 49. From the metrics cm and σ 2 Count of Table 4, we see that clustering over direct DCT reconstruction is meaningfully better than with SHF. On the opposite, we observe that clustering after our transmission scheme and with SHF is far better than direct DCT and SHF reconstructions. This shows the efficiency of the CVNN in handling all the non-linear effects of the transmission scheme (quantization, channel noise, etc.). Finally, in Table 4, we notice that the methods which consider smaller values of P always have VOLUME 11, 2023  higher ss values. This confirms that the structure of the data has a strong influence on the Silhouette values.

3) RATE EVALUATION
We now evaluate the rate of the proposed scheme, as well as the rates of the baseline schemes of Section V-C, for a given set of parameters. For MNIST, we have M = N = 28, and K = 10 clusters. As before, we consider a proportion β = 1% of data transmitted at the first phase. We consider a source coding rate R sc = 1/4, since we observed that JPEG compression on MNIST with rate R sc = 1/4 allows to reconstruct the original images almost without loss. And we consider a channel coding rate R cc = 3/4, which is sufficient to correct errors with a bit error probability P b = 0.062, as considered in our previous simulations. The coding rates R conv and R learn obtained with these parameters are shown in the last column of Table 3, where the latter was evaluated for the three values P = 36, 49, 64. In the Table, we also indicated the coding rate R conv for binary MNIST, and evaluate this coding rate by considering R cc = 3/4 as before, R sc = 1/16. This value of R sc comes from the fact that one bit per pixel gives a compression ratio of 1/8, and we observed that lossless Huffman coding allows to further divide this ratio by two. We observe that our scheme has a clear gain in coding rate compared to the considered two conventional coding schemes. Especially, the case P = 49 which was identified as allowing for a sufficient clustering performance has a coding rate R learn = 0.05 bit/symbol which is better than R conv = 0.083 bit/symbol obtained for binary MNIST, while also allowing for a slightly improved clustering performance. In addition to the rates provided in Table 3, we also get R ideal = 0.032 bit/symbol for the ideal scheme described in Section V-C. We observe that the rate R learn for P = 49 is not too far from the rate R ideal of the ideal scheme, although there is still some space for a bit of improvement.

4) DATA RECONSTRUCTION
In the receiver designed for clustering, the CVNN first performs a reconstruction of the data, before applying the clustering algorithm. In Figure 3, we show some examples of MNIST images reconstructed by the CVNN, for 16 transmitted SHF coefficients (left figure) and for 32 transmitted SHF coefficients (right figure). These figures were obtained by considering the full emitter scheme with SHF and 64-IHQAM modulation, but without channel noise. The CVNN was trained over β = 2.85% of the dataset transmitted at the first phase. While data reconstruction is not the main purpose of our scheme, these figures also illustrate the generic aspect of our transmission scheme and the fact that it coud be employed in various applications.

B. CLASSIFICATION 1) SIMULATION PARAMETERS
We now evaluate the classification performance achieved by the proposed transmission scheme for various sets of parameters. The considered metric for classification performance evaluation is the accuracy. We consider the 64-IHQAM modulation with distance values d ∈ {1, 2} in the constellation scheme, a number of SHF's coefficients P ∈ {16, 36}, and values β ∈ {2.85%, 5.71%}, for the proportion of MNIST samples transmitted at the first phase. We then evaluate the classification performance over a test set of 2000 MNIST images transmitted at the second phase of our scheme. We also consider several variance values for the channel noise, ranging from σ 2 = 0 to σ 2 = 16. Table 5 shows the classification performance in terms of accuracy over the MNIST dataset using the proposed scheme.

2) CLASSIFICATION PERFORMANCE
The same results are also represented graphically in Figure 4 for easier comparison between the different sets of parameters. As expected, we observe that the accuracy increases with P and d for fixed values of β and σ 2 . Also, given P and d, the accuracy slightly increases with β. Finally, the parameters P and d seem to have a higher impact on the classification performance than β. Another deduction from Table 5 is that P ≥ 36 is sufficient for our transmission scheme to achieve at least 90% accuracy when considering channel noise variance 0 ≤ σ 2 ≤ 9. In addition, considering P = 64 instead  of P = 36 does not improve much the accuracy. We also remark that we need a larger value β compared to when the transmission scheme was designed for clustering.

3) RATE EVALUATION
For conventional coding schemes, in order to obtain a fair rate comparison with our scheme, we considered classification of MNIST and binary MNIST from a standard Multilayer Perceptron (MLP) classifier, built from the same parameters as our CVNN (two layers, k 1 + 1 = 120, k 2 + 2 = 784). When trained on β = 5.71% of the original MNIST dataset, the MLP classifier returned an accuracy of 92% on a test set of 2000 MNIST samples. With the same value of β and training over binary MNIST, the MLP obtained an accuracy of 90%. In addition, for these two datasets (original and binary MNIST) with conventional coding schemes, the coding rates needed for classification are the same as the coding rates shown for clustering in Table 3, that is R conv = 0.33 bit/symbol for original MNIST, and R conv = 0.083 bit/symbol for binary MNIST, given that we still consider R cc = 3/4. To obtain equivalent accuracy levels with our scheme, we can for instance consider P = 36 and β = 5.71%, which gives accuracy larger than 90% for SNR values larger than 12.46dB. For these parameters, our scheme gives R learn = 0.051 bit/symbol, which is still better than the coding rate of the conventional coding scheme for binary MNIST. In addition, considering a smaller value β = 2.82% with the same value P = 36 (at the price of a small accuracy degradation) gives a coding rate R learn = 0.042 bit/symbol, which is even closer to the rate of the ideal scheme R ideal = 0.038 bit/symbol. This allows to conclude that the proposed transmission scheme permits to obtain a better coding rate than conventional coding.

VII. CONCLUSION
In this paper, we introduced a practical transmission scheme for efficient learning over received data at the output of an AWGN channel. The proposed scheme consists of a transmitter built from SHF transform and IHQAM modulation, and of a receiver that makes use of a CVNN to perform the considered learning task. We also provided the source/channel coding rate of this scheme, and evaluated its learning performance from numerical simulations. Numerical results showed a clear gain in terms of coding rate compared to conventional coding approaches, at the same learning performance level. These promising results were obtained given that no prior training dataset is needed by our scheme, and that only a small feedback was allowed between the receiver and the transmitter. Although generic, the proposed scheme was specified and evaluated for two standard learning tasks that are clustering with K-means and classification. Future works will be dedicated to specifying the proposed scheme to other learning tasks such as regression or clustering from other techniques. We will also investigate other channel models like the fading channel.