Distortion-Rate Bounds for Distributed Estimation using Wireless Sensor Networks

We deal with centralized and distributed rate-constrained estimation of random signal vectors performed using a network of wireless sensors (encoders) communicating with a fusion center (decoder). For this context, we determine lower and upper bounds on the corresponding distortion-rate (D-R) function. The nonachievable lower bound is obtained by considering centralized estimation with a single-sensor which has available all observation data, and by determining the associated D-R function in closed-form. Interestingly, this D-R function can be achieved using an estimate ﬁrst compress afterwards (EC) approach, where the sensor: i) forms the minimum mean-square error (MMSE) estimate for the signal of interest; and ii) optimally (in the MSE sense) compresses and transmits it to the FC that reconstructs it. We further derive a novel alternating scheme to numerically determine an achievable upper bound of the D-R function for general distributed estimation using multiple sensors. The proposed algorithm tackles an analytically intractable minimization problem, while it accounts for sensor data correlations. The obtained upper bound is tighter than the one determined by having each sensor performing MSE optimal encoding independently of the others. Numerical examples indicate that the algorithm performs well and yields D-R upper bounds which are relatively tight with respect to analytical alternatives obtained without taking into account the cross-correlations among sensor data.


I. INTRODUCTION
Stringent bandwidth and energy constraints that wireless sensor networks (WSNs) must adhere to motivate efficient compression and encoding schemes when estimating random signals or parameter vectors of interest. In such networks, it is of paramount importance to determine bounds on the minimum achievable distortion between the signal of interest and its estimate formed at the fusion center (FC) using the encoded information transmitted by the sensors subject to rate constraints.
In the reconstruction scenario, the FC wishes to accurately reconstruct the sensor observations that are transmitted to the FC in a compressed form. In the estimation scenario, the FC is interested in accurately estimating an underlying random vector which is correlated with, but not equal to, the sensor observations. Thus, the FC utilizes the compressed sensor data to estimate a vector parameter which is conveyed implicitly by the sensor data. In a setup involving one sensor, single-letter characterizations of the D-R function for both scenarios are known: the reconstruction scenario is the standard distortion-rate problem [4, p. 336]; and the estimation one, also referred to as a rate-distortion problem with a remote source, has also been determined [1, p. 78]. In the distributed setup, involving multiple sensors with correlated observations, neither problem is well understood. The best analytical inner and outer bounds for the D-R function for reconstruction can be found in [2] and [16]; see also [17] that determines the ratedistortion region for a two-sensor setup. An iterative scheme has been developed in [6], which numerically determines an achievable upper bound for distributed reconstruction but not for signal estimation. The numerical D-R upper bound obtained by [6] is applicable when the signal to be reconstructed at the FC coincides with the sensor observations. However, this is not the case in the estimation setup considered here, where sensors observe a statistically perturbed version of the signal of interested that the FC wishes to reconstruct.
For the general problem of estimating a parameter vector with analog-amplitude entries correlated with sensor observations, most of the existing literature examines Gaussian data and Gaussian parameters.
Specifically, when each sensor observes a common scalar random parameter contaminated with Gaussian noise, the D-R function for estimating this parameter has been determined in [3], [7]- [9], [15] to solve the so called Gaussian CEO problem. D-R bounds for a linear-Gaussian data model have been derived in [10] and [11] when the number of parameters equals the number of all scalar observations, with one scalar observation per sensor. Under a similar setup [17] determines the rate-distortion region in a two-sensor WSN. Another formulation was considered in [21], where each sensor has available a vector of observations having the same length as the parameter vector; see also [14] where a two-sensor setup is considered. All existing formulations dealing with vectors of parameters and observations are special cases of the general vector Gaussian CEO problem. In this paper, we pursue D-R analysis for distributed estimation with WSNs, under the vector Gaussian CEO setup, without constraining the number of observations in each sensor and/or the number of random parameters to be estimated.
We first determine in closed-form the D-R function for estimating a parameter vector when applying rate-constrained encoding to the observation data collected by a single-sensor (Section III). Without assuming that the number of parameters equals the number of observations, we prove that the optimal scheme achieving the D-R function amounts to first computing the minimum mean-square error (MMSE) estimate of the source at the sensor, and then optimally compressing at the sensor and reconstructing at the FC the estimate via reverse water-filling (rwf). The D-R function for the single-sensor setup serves as a non-achievable lower D-R bound for rate-constrained estimation in the multi-sensor setup. Next, we develop an alternating scheme that numerically determines an achievable D-R upper bound for the multi-sensor scenario (Section IV). Using this iterative algorithm we can tackle an analytically intractable minimization problem and determine a D-R upper bound. Different from [6], which deals with WSNbased distributed reconstruction, our approach aims at general estimation problems where the parameters of interest are not directly observed at the sensors. Combining the lower bound of Section III with the numerically determined upper bound of Section IV, we specify a region where the D-R function for distributed estimation lies in.

II. PROBLEM STATEMENT
With reference to Fig. 1 (a), consider a WSN comprising L sensors that communicate with an FC.
Each sensor, say the ith, observes an N i × 1 vector x i (t) which is correlated with a p × 1 random signal (parameter vector) of interest s(t), where t denotes discrete time. Similar to [8], [11], [15], we assume that: (a1) No information is exchanged among sensors and the links with the FC are noise-free.
(a2) The random vector s(t) is generated by a stationary Gaussian vector memoryless source with s(t) ∼ where n i (t) denotes additive white Gaussian noise (AWGN); i.e., n i (t) ∼ N (0, σ 2 I); noise n i (t) is uncorrelated across sensors, across time, and with s; and H i as well as (cross-) covariance matrices Σ ss , Σ sx i and Σ x i x j are known ∀i, j ∈ {1, . . . , L}.
Notice that (a1) holds when sufficiently strong channel codes are employed to cope with channel effects. Further, whiteness of n i (t) and the zero-mean assumptions in (a2) are made without loss of generality. The linear model in (a2) is commonly encountered in estimation and in a number of cases it even accurately approximates non-linear mappings; e.g., via a first-order Taylor expansion in target tracking applications. Although confining ourselves to Gaussian vectors x i (t) is of interest on its own, following arguments similar to those in [1, p. 134], it can be shown that the D-R functions obtained in this paper upper bound their counterparts for non-Gaussian sensor data x i (t) with (cross)-covariance matrices identical to those in (a2).
, comprising n consecutive time instantiations of the vector x i (t), are encoded per sensor to yield each encoder's output u . . , L. These encoded blocks are communicated through ideal orthogonal channels to the FC. There, u (n) i 's are decoded to obtain an estimate of s (n) := {s(t)} n t=1 , which we denote asŝ i . The subscript R signifies the rate constraint which is imposed through a bound on the cardinality of the range of the sensor encoding functions; namely, the cardinality of the range of f (n) i must be no larger than 2 nR i , where R i is the available rate at the encoder of the ith sensor. The where R is the total rate available for the L sensors. It is worth re-iterating, that this setup is precisely the vector Gaussian CEO problem in its most general form without any restrictions on the number of observations N i at the ith sensor, and the number of random parameters p.
Under the sum rate constraint L i=1 R i ≤ R, the ultimate goal is to determine the minimum possible 2 ] for estimating s in the limit of infinite block-length n. Such a (so called single-letter) characterization of the D-R function is available for the single-sensor case (L = 1), but not for the distributed multi-sensor scenario. For this reason, our objective in this paper is to derive (preferably tight) inner and outer bounds on the D-R function of the general vector CEO setup.

III. DISTORTION-RATE FOR CENTRALIZED ESTIMATION
We will first determine in closed form the D-R function for estimating s(t) in a single-sensor setup and provide a scheme that achieves it. The single-letter characterization of the D-R function in this setup allows us to drop the time index. Here, all observation data {x i } L i=1 := x, whose dimensionality is N , are available to a single sensor, and are related to the p × 1 parameter vector s according to the linear model x = Hs+n. The D-R function in this setting provides a lower (non-achievable) bound on the MMSE that can be achieved in a multi-sensor distributed setup, where each x i is observed and encoded by a different sensor. Existing works treat the case N = p [5], [13], [19] and transform the D-R function with a remote source to an ordinary reconstruction D-R problem; see also [18] which provides more general conditions under which this transformation is possible. Other works deal with practical encoding-decoding schemes using e.g., vector quantization [12]. However, here we look for the D-R function for general N and p, in the linear-Gaussian model framework.

A. Background on D-R for Reconstruction
The D-R function for encoding x, which has probability density function (pdf) p(x), with rate R at an individual sensor, and reconstructing it (in the MMSE sense) asx at the FC, is given by [4, p. 342] where the minimization is w.r.t. the conditional pdf p(x|x).
For x Gaussian, D x (R) can be determined by applying rwf to the pre-whitened vector x w := Q T x x [4, p. 348]. For a prescribed rate R, it turns out that ∃ k such that the first k entries {x w (i)} k i=1 of x w , are encoded and reconstructed independently from each other using rates 2 −2R/k ; while the last N − k entries of x w are assigned no rate; i.e.,

The resultant overall MMSE (D-R function) is
is a threshold distortion determining which entries of x w are assigned with nonzero rate. The first k entries of x w with variance λ x,i > d(k, R) are encoded with non-zero rate, but the last N − k ones with variance λ x,i ≤ d(k, R) are discarded in the encoding procedure (are set to zero).
Associated with the rwf principle is the so called test channel; see Fig. 1 (b) and e.g., [4, p. 345]. The encoder's MSE optimal output is u = Q T x,k x+ζ, where Q x,k is formed by the first k columns of Q x , and ζ models the distortion noise that results due to the rate-constrained encoding of x. The zero-mean AWGN ζ is uncorrelated with x and its diagonal covariance matrix Σ ζζ has entries The part of the test channel that takes as input u and outputsx, models the decoder. The reconstruction x of x at the decoder output iŝ

B. D-R for Estimation
The D-R function for estimating source s given observation x (where the source and observation are probabilistically drawn from the joint pdf p(x, s)) with rate R at an individual sensor, and reconstructing it (in the MMSE sense) asŝ R at the FC is given by [1, p. 79] where the minimization is w.r.t. the conditional pdf p(ŝ R |x).
In order to achieve this D-R function, one might be tempted to first compress the observation  (4). With subscripts ce and ec corresponding to these two options, let us also define the errorss ce := s −ŝ ce ands ec := s −ŝ ec .
For CE, we depict in Fig. 2 (a) the test channel for encoding x via rwf, followed by MMSE estimation of s based onx. Suppose that when applying rwf to x with prescribed rate R, the first k ce ∈ {1, . . . , N } components of x w are assigned with non-zero rate and the rest are discarded. The MMSE optimal encoder's output for encoding x is then given, as in Section III.A, by since Q T x is unitary and the last N − k ce entries ofx are useless for estimating s. We show in Appendix A that the covariance matrix where Eqs. (6) and (7) characterize fully the CE scheme. In Fig. 2  Let Σŝŝ = Σ sx Σ −1 xx Σ xs = QŝΛŝQ T s be the eigenvalue decomposition for the covariance matrix ofŝ, where Λŝ = diag(λŝ ,1 · · · λŝ ,p ) and λŝ ,1 ≥ · · · ≥ λŝ ,ρ > λŝ ,ρ−1 = · · · = λŝ ,1 = 0, and ρ := rank(Σ sx ) denotes the rank of matrix Σ sx . Suppose now that the first k ec ∈ {1, . . . , ρ} entries ofŝ w = Q T sŝ are assigned with non-zero rate and the rest are discarded. The MSE optimal encoder's output is given by u ec = Q T s,k ecŝ + ζ ec , and the estimateŝ ec iŝ where Qŝ ,k ec is formed by the first k ec columns of Qŝ. For the k ec × k ec diagonal matrices Θ ec and The MMSE associated with CE and EC is given, respectively, by [c.f. (7) and (9)] is the MMSE achieved when estimating s based on x, without source encoding (R → ∞). Since J o is common to both EC and CE it is important to compare ce (R) with ec (R) in order to determine which estimation scheme achieves the smallest MSE. The following theorem, proved in Appendix C, provides such an asymptotic comparison.
An immediate consequence of Theorem 1 is that the MSE distortion for EC, namely D ec (R), converges Typically sensors acquire more observations, namely N , than the number of parameters of interest p. Having N > p enables identifiability and improved MSE performance in estimating s. With N > p it clearly holds that ρ ≤ min(N, p) < N . Then, the EC scheme approaches the lower bound J o faster than CE, implying a more efficient usage of the available rate R. This is intuitively reasonable since CE compresses x, taking into account only the covariance matrix Σ xx which can result in using part of the rate to compress components of x that are irrelevant (e.g., noise) to the estimation of s. On the contrary, the MMSE estimatorŝ ec in EC first extracts from x all the information pertinent to estimating s, and then performs compression. In that way EC suppresses significant part of the noise and the rate is allocated more efficiently.
Let us now examine now some special cases to gain more insight about Theorem 1.
With σ 2 s ce and σ 2 s ec denoting the variances ofs ce ands ec , respectively, we prove in Appendix D that Vector model (p = 1, N > 1): With x = hs + n, we establish in Appendix E that: which implies that EC uses more efficiently the available rate.
Defining the signal-to-noise ratio (SNR) as SNR = tr(HΣ ss H T )/(N σ 2 ), we compare in Fig. 3 (a,b) the MMSE when estimating s using the CE and EC schemes. With Σ ss = σ 2 s I p , p = 4 and N = 40, we observe that beyond a threshold rate, the distortion of EC converges to J o faster than that of CE, which corroborates Theorem 1. Notice also that the gap between the EC and CE curves for SNR = 2 is larger than the gap for SNR = 4. This is true because as the noise power increases, the portion of the rate allocated to noise terms in CE increases accordingly. However, thanks to the MMSE estimator, EC cancels part of the noise and utilizes the available rate more efficiently.
Our analysis so far raises the question whether EC is MSE optimal. We have shown that this is the case when estimating s with a given rate R without forcing any relationship between N and p. A related claim has been reported in [13], [19] for N = p, but the extension to N = p is not obvious. To this end, we prove in Appendix G that: Theorem 2: Under (a1) and (a2), the D-R function when estimating s based on x can be expressed as whereŝ = Σ sx Σ −1 xx x is the MMSE estimator, and s −ŝ is the corresponding MMSE.
Theorem 2 reveals that the optimal means of estimating s is to first form the optimal MMSE estimateŝ and then apply optimal D-R encoding to this estimate. The lower bound on this distortion when R → ∞, , which is intuitively appealing. The D-R function in (12) is achievable, because the rightmost term in (12) corresponds to the D-R function for reconstructing the MMSE estimateŝ which is known to be achievable using random coding; see e.g., [1, p. 66]. Theorem 2 implies an important separation result regarding estimation of (remote) Gaussian sources. Optimal estimation can be performed by separately estimating the source s based on the observation x, and then compressing the estimateŝ based only on the covariance ofŝ. The important consequence of this result is that the total distortion D s (R) can be minimized after minimizing separately: i) the MSE distortion associated the estimation of s based on x; and ii) the MSE distortion related to the compression/reconstruction task that is given by the second term in the right hand side (RHS) of (12).

IV. DISTORTION-RATE FOR DISTRIBUTED ESTIMATION
Let us now consider the D-R function for estimating s in a multi-sensor setup, under a total available rate R which has to be shared among all sensors. Because analytical specification of the D-R function in this case remains intractable, we will develop an alternating algorithm that numerically determines an achievable upper bound for it. Combining this upper bound with the non-achievable lower bound corresponding to an equivalent single-sensor setup, when applying the MMSE optimal EC scheme, will provide a (hopefully tight) region where the D-R function lies in. For simplicity in exposition, we confine ourselves to a two-sensor setup, but our results can be extended readily to any L > 2.
To this end, we consider the following single-letter characterization of the upper bound on the D-R functionD (R) = min where the minimization is w.
. Achievability ofD(R) can be established by readily extending to the vector case the scalar results in [3]. To carry out the minimization in (13), we develop an alternating scheme whereby u 2 is treated as side information that is available at the decoder when optimizing (13) w.r.t. p(u 1 |x 1 ) andŝ R (u 1 , u 2 ). The minimization is carried within the class of Gaussian auxiliaries u 1 , u 2 . As a starting point, we assume that the side information u 2 is the output of an optimal D-R encoder applied to x 2 for estimating s, without taking into account x 1 .
This initialization for u 2 is motivated by the Gaussianity of s and x 2 , as well as the single-sensor D-R results in Section III-B. Since x 2 is Gaussian, the side information will have the form (c.f. Section III.B) u 2 = Q 2 x 2 + ζ 2 , where Q 2 ∈ R k 2 ×N 2 and k 2 ≤ N 2 , due to the rate constrained encoding of x 2 . Recall also that the k 2 × 1 vector ζ 2 is uncorrelated with x 2 and Gaussian; i.e., ζ 2 ∼ N (0, Σ ζ 2 ζ 2 ).

Based on
T , which is the information that the decoder can have assuming infinite rate at the first encoder, the optimal estimator for s is the MMSE one:ŝ = E[s|x 1 , Ifs is the corresponding MSE, then s =ŝ +s, wheres := s −ŝ is uncorrelated with ψ andŝ due to the orthogonality principle. Noticing also thatŝ R (u 1 , u 2 ) is uncorrelated withs because it is a function of Since x 1 and x 2 are correlated, and u 1 is stochastically related with x 1 through the conditional pdf p(u 1 |x 1 ) we have the Markov chain (MC) (x 2 , u 2 ) → x 1 → u 1 . Using MC properties, we obtain after some simple algebra that I(x; u 1 , u 2 ) = R 2 + I(x 1 ; u 1 ) − I(u 2 ; u 1 ), where R 2 := I(x; u 2 ) is the rate consumed to form the side information u 2 ; while the rate constraint in (13) The new signal of interest that we wish to reconstruct in (14) is L 1 x 1 . Continuing, we prove in Appendix H that Using (15), we obtain I(x 1 ; u 1 ) − I(u 2 ; u 1 ) = I(L 1 x 1 ; u 1 ) − I(u 2 ; u 1 ), and from the RHS of the last equation, we deduce the equivalent rate constraint I(L 1 x 1 ; u 1 ) − I(u 2 ; u 1 ) ≤ R 1 . Combining the latter with (14) and (13), we arrive at the D-R upper bound through which we can determine an achievable D-R region, having available rate R 1 at the encoder and side information u 2 at the decoder. Since x 1 and u 2 are jointly Gaussian, we can apply the Wyner-Ziv result [20], which allows us to consider that u 2 is available both at the decoder and the encoder. This, in turn, permits re-writing the first expectation in (16) as is the corresponding MSE, then we can write L 1 x 1 = s 1 +s 1 . For the rate constraint in (17), we have where the first equality holds because u 2 is given; the second one holds since u 2 is uncorrelated with s 1 , due to the orthogonality principle; and likewise, u 2 can be uncorrelated withŝ R,12 (u 1 , u 2 ) := 12 is the reconstructed version ofs 1 which is uncorrelated with u 2 .
Utilizing (17) and (18), we arrive at Notice that the minimization term in (19) is the D-R function for reconstructing the MSEs 1 with rate R 1 . Sinces 1 is Gaussian, we can readily apply rwf to the pre-whitened vector Q T s 1s 1 for determininḡ D(R 1 ) and the corresponding test channel that achievesD(R 1 ) (c.f. Section III.A). Through the latter, and considering the eigenvalue decomposition Σs 1s1 = Qs 1 diag(λs 1 ,1 · · · λs 1 ,p )Q T s1 , we find that the first encoder's output that minimizes (13) given side information u 2 has the form where Qs 1 ,k 1 denotes the first k 1 columns of Qs 1 , k 1 is the number of Q T s1s 1 entries that are assigned with non-zero rate, and Q 1 : . . , k 1 , and D 1 i = λs 1 ,i when i = k 1 + 1, . . . , p. This way, we are able to determine also p(u 1 |x 1 ). The reconstruction function has the form where Interestingly, it can be seen from (19) and the last expressions fors 1 andŝ, that the MSE optimal approach for estimating s with side information u 2 is exactly the EC scheme with the difference that in the compression step we apply rwf to the part of the MMSE estimateŝ that is formed by the "innovation" Note that the optimal encoder for sensor 1 in (20) has the same structure as the one assumed for the side information u 2 at the initialization step. Thus, we can proceed as described earlier to determine the optimal encoder for sensor 2 after treating u 1 in (20) as side information.
The approach in this subsection can be applied in an alternating fashion from sensor to sensor in order to determine appropriate p(u i |x i ), for i = 1, 2, andŝ R (u 1 , u 2 ) that at best globally minimize (16). The importance of the algorithm lies on the fact that it provides a way to numerically tackle (16) and determine an achievable D-R upper bound when estimating s at the FC based on compressed sensor observations. The conditional pdfs can be determined by finding the appropriate covariances Σ ζiζi . Furthermore, by specifying the optimal Q 1 and Q 2 , we have a complete characterization of the encoders' structure.
Relative to [6], the algorithm here can be applied to derive D-R upper bounds in general estimation setups where the parameter vector s that the FC wishes to estimate-reconstruct based on compressed sensor data, is observed at the sensors via x j 's. The scheme in [6] can be viewed as a special case of the present one corresponding to x j = s. The resultant algorithm is summarized next.  (20)  In Fig. 4, we plot the non-achievable lower bound which corresponds to one sensor having available the entire x and using the optimal EC scheme. The same figure also depicts an achievable D-R upper bound determined by letting the i-th sensor form its local estimateŝ i = E[s|x i ], and then apply optimal D-R encoding toŝ i . Ifŝ R,1 andŝ R,2 are the reconstructed versions ofŝ 1 andŝ 2 , respectively, then the decoder at the FC forms the final estimate asŝ R = E[s|ŝ R,1 ,ŝ R,2 ]. We refer to this approach as the decoupled EC scheme. We also plot the achievable D-R region determined numerically by the alternating algorithm. For each rate, we keep the smallest distortion returned after 500 executions of the algorithm simulated with Σ ss = I p , p = 4, and N 1 = N 2 = 20, at SNR = 2. We observe that the proposed algorithm provides a tighter upper bound for the achievable D-R region than the one obtained using the decoupled EC strategy. This is expected since the proposed algorithm takes into account the cross-correlations among the sensor data when determining the encoders, whereas the decoupled EC approach does not. This way the rate wasted to encode redundant information is reduced. Using also the non-achievable lower bound (solid line), we have effectively reduced the 'uncertainty region' where the D-R function lies.

V. CONCLUSIONS
We derived inner and outer D-R bounds for the generalized Gaussian CEO problem. Specifically, we determined the D-R function for estimating a random vector in a single-sensor setup and established optimality of an estimate-first compress-afterwards (EC) approach along with the (sub)optimality of a compress-first estimate-afterwards (CE) alternative. When it comes to estimation using multiple sensors, the corresponding D-R function can be bounded from below using the single-sensor D-R function achieved using the EC scheme. An alternating algorithm was also derived for determining numerically an achievable D-R upper bound in the distributed multi-sensor setup. Simulations demonstrated that the numerically determined upper bound is more tight than analytically found alternatives (cf. the decoupled EC scheme), which is expected since the novel algorithm accounts for the cross-correlations among sensor data during the design of the encoders.
Issues of interest not accounted by this paper's analysis include general (possibly non-linear) dynamical data models where the distribution of the observation data is no longer stationary or Gaussian. 1 [19] J. Wolf and J. Ziv, "Transmission of noisy information to a noisy receiver with minimum distortion," IEEE Trans. on Info. Theory, vol. 16, pp. 406-411, July 1970.
[20] A. Wyner and J. Ziv, "The Rate-Distortion Function for Source Coding with Side Information at the Decoder," IEEE Trans. on Info. Theory, vol. 26, pp. 1-10, January 1976.  (7) Using (6) we find that the covariance matrix ofs ce is given by Σs cesce = Σ ss − Σ sx 1 Σ −1 x 1x1 Σx 1 s . From the definition ofx 1 , it also follows that where Λ x,kce denotes the first k ce diagonal entries of Λ x . Apparently, the Let us define the diagonal matrix D ce := diag(D ce 1 , · · · , D ce N ) and let D ce k ce denote the upper left k ce × k ce submatrix of D ce . We then have Σx 1x1 = Λ x,kce − D ce k ce using which we can write Let D ce N −k ce and Λ x,N −kce denote the lower right submatrix of D ce and Λ x , respectively; and similarly, let Q x,N −k ce be formed by the last N − k ce columns of Q x . Because the last N − k ce entries of Q T x x are not assigned any rate, we have D ce N −k ce = Λ x,N −k ce . Adding and subtracting from (23) the matrix x,N −k ce Σ xs and also adding the matrix Σ ss , we arrive at the RHS of (7).
Adding and subtracting the matrix Qŝ ,p−kec Λŝ ,p−kec Q T s,p−k ec from (27), where Qŝ ,p−kec is formed by the last p−k ec columns of Qŝ, we arrive at (9).

C. Proof of Theorem 1
Consider first the CE scheme with k ce = N . In this case, the rwf threshold is given by Since for k ce = N all entries of Q T x x are assigned with non-zero rate, we infer that d ce (N, R) < σ 2 , or, equivalently Focusing on the EC scheme for k ec = ρ, we have When k ec = ρ, all entries of Q T sŝ are assigned with non-zero rate. The latter implies that d ec (ρ, R) < λŝ ,ρ , which translates into If R > max(R ce , R ec ), then we have k ce = N and k ec = ρ. Additionally, we can easily obtain where do not depend on R, it follows readily that ce (R) = tr(Σ sx Q x ∆ ce Q T x Σ xs ) = γ 1 2 −2R/N and ec (R) = tr(Qŝ∆ ec Q T s ) = γ 2 2 −2R/ρ , where γ 1 , γ 2 are constants, not dependent on R.

D. Proof of Proposition 1
For the CE scheme, ∆ ce = σ −2 Likewise, we have for the EC scheme that ∆ ec = σ 2 It then follows readily from (33) and (34) that σ 2 s ce = σ 2 s ec .

E. Proof of Proposition 2
Using the vector model x = hs + n, we can easily verify that Λ x = diag σ 2 + σ 2 s h 2 , σ 2 , · · · , σ 2 and Q x = [q x,1 , · · · , q x,N ], where q x,1 = (h/ h ). In the CE scheme, if d ce (k ce , R) ≥ σ 2 then only the first entry of Q T x x is assigned with positive rate; while if d ce (k ce , R) < σ 2 , then all the elements of Q T x x are assigned with non-zero rate. Thus, k ce can be either 1, or, N . When k ce = 1, the rwf threshold is given by If R > R th , we have k ce = N and the threshold is x Σ xs is given by where β = σ 4 s h 2 (σ 2 s h 2 + σ 2 ) −1 . For the EC scheme, we obtain σ 2 Since in the EC scheme we compress the MMSE estimateŝ, we have ec (R) = β2 −2R , ∀ R. The result now follows immediately after direct comparison of ce (R) with ec (R) when R ≤ R t , and when R > R t , respectively.

F. Proof of Proposition 3
From the matrix-vector model x = Hs+n, it follows immediately that Q x = U h and Λ x = diag(σ 2 s σ 2 h,1 + σ 2 , · · · , σ 2 s σ 2 h,ρ + σ 2 , σ 2 , · · · , σ 2 ). The covariance ofŝ can be written as Furthermore, we can easily verify that Qŝ = V h , and Focusing on the CE scheme, we have When k ce = N , all the components in Q T x x are assigned with non-zero rate and the rwf threshold is Notice also that d ce (N, R) < σ 2 implies In the EC scheme, the trace of Qŝ∆ ec Q T s is equal to ec (R) = tr(∆ ec ) = ρ i=1 D ec i . When k ec = ρ, the corresponding rwf threshold is given by The equality k ec = ρ is satisfied when Notice that when R > max(R ce , R ec ), we have k ce = N, k ec = ρ, while For N > ρ and after some algebraic manipulations, we conclude that for ce (R) > ec (R) to hold, we must have where From the arithmetic mean-geometric mean inequality, we further deduce that γ ≤ 1, which in turn implies that max(R ce , R ec ) >R. Thus, ce (R) > ec (R), when N > ρ and R > max(R ce , R ec ). When N = ρ, it follows readily from (41) and (42)  otherwise, ce (R) > ec (R) ∀R.

G. Proof of Theorem 2
Using the orthogonality principle, we can write s =ŝ +s, wheres is independent of x; thus which stems directly from the fact thatŝ andŝ R are independent ofs since they are functions of x. In order to arrive at (12), it suffices to show that I(x;ŝ R ) = I(ŝ;ŝ R ).
To this end, consider the SVD of Σ sx = U sx S sx V T sx , where V sx is an N × N unitary matrix. Further, recall that ρ = rank(Σ sx ), and we define the N × N matrix T :

H. Proof of Equation (15)
After expressing Σ −1 ψψ in terms of Σ x 1 x 1 , Σ x 1 u 2 , Σ u 2 u 2 we find that Let rank(H 1 ) = ρ 1 and consider the SVD and Σ h 1 ,ρ 1 denote the upper left ρ 1 × ρ 1 diagonal submatrix of Σ h 1 which contains the ρ 1 positive singular values of H 1 . Based on these definitions, we can re-express the matrices inside the parentheses in (50) as and where Upon substituting (51) and (52) into (50), we obtain To proceed, we assume that rank(Σ ss − Σ su2 Σ −1 u2u2 Σ u2s ) = p. If this is not the case, we can use instead as side information the random vectorũ 2 = u 2 +ṽ, whereṽ is white noise with very small power. In so doing, we ensure that rank(L 1 ) = ρ 1 and range(L T 1 ) = span(U h 1 ,ρ 1 ). The next step is to show that I(x 1 ; u 1 ) = I(L 1 x 1 ; u 1 ). To this end, let L 1 Σ x1x1 L T 1 = Q L 1 Λ L 1 Q T L 1 be the eigenvalue decomposition of the matrix L 1 Σ x 1 x 1 L T 1 . As in Appendix G, we will consider the N 1 × N 1 invertible matrix ; and based on the latter, we obtain where (54) follows because: (i) vectors U T h 1 ,N 1 −ρ 1 n 1 and Q T L 1 ,ρ 1 L 1 x 1 are independent; and (ii) u 1 can also be independent of U T h1,N1−ρ1 n 1 without affecting the resulting distortion. Evaluating (54) in terms of the mutual information I(L 1 x 1 ; u 1 ), we find I(L 1 x 1 ; u 1 ) = I(Q T L1 L 1 x 1 ; u 1 ) = I(Q T L 1 ,ρ 1 L 1 x 1 ; u 1 ) + I(Q T L 1 ,p−ρ 1 L 1 x 1 ; u 1 |Q T L 1 ,ρ 1 L 1 x 1 ) where (55)