Decentralized practical design and centralized benchmark for analog network coding

This article proposes both centralized and decentralized design schemes for an analog network coding multiple-input multiple-output system with a relay node between two end nodes. The proposed centralized scheme is called the generalized iterative approach (GIA). It jointly designs the precoders and decoders at the two end nodes, and the processor at the relay to maximize the sum mutual information. Numerical results for the per-node power constraint show the convergence behavior of the GIA and give a performance benchmark for the analog network coding scheme. The proposed decentralized scheme is a practical joint transceiver and signaling design scheme. The keys to its low signaling load are time-division duplex and a symmetric relay processing matrix. The proposed signaling protocol enables the needed information, including channel state information (CSI), to be available at each node. With the needed CSI, a novel symmetric processor design at the relay is developed to maximize an approximate sum mutual information formula (to reduce the signaling loading, the precoders at both end nodes and the noise propagated from the relay are not considered). Employing singular value decomposition transceivers at the two end nodes, it is remarkable that the proposed decentralized approach performs almost as well as the centralized GIA design. It is concluded that the proposed decentralized scheme is a feasible way to implement analog network coding systems.

There is no restriction on how many antennas each node needs to have in order for the system to do analog network coding. There are works on all single-antenna nodes (e.g., [2][3][4]). There are also works where the nodes can have multiple antennas (e.g., [5][6][7][8][9][10][11][12][13][14][15][16]). The multiple antennas can aid in multiplexing and/or diversity. Consequently, the rest of this article is about the case when each node has multiple antennas.
When each node has multiple antennas, one obvious and flexible approach is to use multiple-input multipleoutput (MIMO) linear processing: a precoding and decoding matrix at each end node and a processing matrix at the relay. However, the design of these five matrices is not trivial. Some papers try to optimize according to some criteria (e.g., [8][9][10][11]) while others seek to propose low-complexity heuristics (e.g., [12,13]). Since the optimization problems are highly nonlinear, the solutions have only been derived numerically (or approximately) and are innately suboptimum. Moreover, they are not implementable unless they are supportable by practical signaling.
Some have realized this need to consider signaling. For example, Panah and Heath Jr [14] and Roemer and Haardt [15] have proposed channel estimation techniques to supplement their designs. However, both works assume that no channel state information (CSI) is available at the relay. In [14], the relay basically just scales and forwards its received signal. In [15], the relay processing does not have to be just a scaling. But, it is independent of the current CSI. There may be some improvement if CSI-dependent relay processing can be enabled.
Therefore, this article proposes a joint, but decentralized, transceiver and signaling design a with the following goal: to have high-performing MIMO linear processing, not just scaling, at each node using only a small amount of signaling. The practical signaling protocol proposed uses timedivision duplex (TDD) and enables the relay to estimate its two outgoing channels (i.e., the channels from the relay to the two end nodes). It also enables the two end nodes to estimate the two effective channels between themselves (i.e., the link from one end node through the relay to the other end node and the same link but in the opposite direction). Most importantly, it enables the two end nodes to cancel their self-interference.
To help keep the amount of signaling low, the proposal has the relay design its own symmetric relay processing matrix. The symmetry and the use of TDD together cause the end-to-end channel in one direction to be the transpose of the other direction's, i.e., reciprocity holds for the two effective channels between the two end nodes. (The conjugate-transpose instead of the transpose is used in a lot of literature for representing the channel reciprocity. However, the conjugate-transpose should be used only for time reversal transmissions. Discussions on reciprocity can be found in most electromagnetic textbooks (e.g., see [17,18])). An end node needs to know the effective channel in each direction in order to choose a channel-dependent precoder and decoder. In general, knowing one effective channel does not imply knowing the other thus necessitating signaling both of them. Due to the symmetry and TDD, only one effective channel needs to be signaled here.
Another design decision is to sequentially design the relay processor and the end node processors; the relay designs its processor and then the end nodes design their own. There are no iterations between the processors. The major benefit of this choice is avoiding signaling repeatedly between the nodes. The chosen way to enable this sequential design is to have the relay design its processor ignoring the precoders, decoders, and the noise propagated from the relay. Though there may be some degradation in the sum mutual information, this decoupling substantially reduces the complexity of the relay. In turn, the lower complexity admits a shorter hardware timeline, etc.-reducing the duration of an analog network coding transmission. Two relay designs which comply with the above design choices are a heuristic approach RRANOMAX in [12] and a novel iterative scheme herein developed which seeks to maximize an approximation to the sum mutual information.
To evaluate the performance of the proposed decentralized scheme, a generalized iterative approach (GIA) is developed for jointly designing the precoders, decoders, and relay processor to maximize the true sum mutual information. The GIA is a centralized scheme where all CSI, noise, and source statistics have to be known at a central processing unit and, in addition, the processed results have to be available at the end and relay nodes. Because of the large amount of signaling load, the GIA is not very practical but can be employed for generating a performance benchmark. Although mutual information maximization for MIMO relay systems has been studied extensively, those for analog network coding systems have only been recently introduced in [6,7]. Both [6,7] are different from the GIA in that they are both based on the assumption that source precoders are given beforehand. Like the GIA, the studies [9,10] are also for the design of the precoders, decoders, and relay processor. However, its criterion is minimum mean squared error (MMSE). To the best of the authors' knowledge, our GIA is the first work on joint precoders, decoders, and relay processors design to maximize the sum mutual information of MIMO analog network coding systems. It is known that the MMSE and mutual information are related for a MIMO channel with Gaussian noise (Equation 22 in [19]). However, that relation is not satisfied for analog network coding systems because of the following two reasons. First, the relay and precoders chosen for the maximum sum mutual information case are different from those for the MMSE case. Second, the noises at the two end nodes (which include the propagation of the noise received at the relay) for the maximum sum mutual information case are also not the same as those for the MMSE case.
The convergence behaviors of the proposed symmetric relay design and the proposed GIA are studied. Their sum mutual information performances are also studied. It is shown numerically that the decentralized approach performs almost as well as the centralized benchmark, the GIA. It is thus concluded that the proposed decentralized approach is a feasible way to implement analog network coding.
The outline of this article is given as follows. Section 2 lays the formulation foundation. Section 3 presents the GIA, a novel centralized scheme for jointly design the precoders, decoders, and relay processor. Section 4 explains the proposed signaling protocol for decentralized designs and gives a joint precoder-decoder design where no additional signaling is required. Following the decentralized signaling protocol, Section 5 proposes a novel design for a symmetric relay processing matrix according to the approximate sum mutual information metric. The RRANOMAX design [12] is also summarized in Section 5. Numerical results are shown in Section 6. Conclusion is made in Section 7.
The notation of this article is as follows: all boldface letters indicate vectors (lower case) or matrices (upper case). A ' , A, A * , A -1 , tr(A), |A|, ||A|| F , and < A > stand for the transpose, conjugate, conjugate transpose, inverse, trace, determinant, Frobenius norm, and expectation of A, respectively. I r denotes the r × r identity matrix. 0 denotes the zero matrix with proper dimension. A > 0 denotes that A is a positive definite matrix. A ⊗ B denotes the Kronecker product of A and B. vec( ) and unvec( ) are the matrix vectorization operator and the inverse matrix vectorization operator, respectively. DIAG (σ 1 , σ 2 , . . ., σ r ) is a diagonal matrix where the elements {σ 1 , σ 2 , . . ., σ r } are put on the main diagonal. {A k } denotes the set of matrices, A 1 , A 2 , etc. max(a,b) denotes the maximum of real numbers a and b. On the other hand, min (a,b) denotes the minimum of real numbers a and b. (χ) + = max(χ, 0).

Analog network coding formulation
This article considers a TDD system with relay node R, equipped with M R antennas, between two end nodes E 1 and E 2 , equipped with M 1 and M 2 antennas, respectively. Each node works in half-duplex mode, receiving and transmitting data in different time slots. These nodes perform analog network coding, completing a bidirectional communication between the two end nodes in just two time slots (see Figure 1)-note that this does not include the time slots for signaling. The problem will be formulated in the frequency domain. It is assumed that the guard time is larger than the delay spread so that there is no inter-symbol interference. It is also assumed that the duration for the system to complete a bidirectionary communication is much smaller than the coherence time so that the channels are considered stationary within that duration.
Two time slots for data (not signaling) transmission are used in analog network coding. The end nodes E 1 and E 2 broadcast signals x 1 = F 1 s 1 and x 2 = F 2 s 2 , respectively, to the relay in the first time slot. The relay thus receives The data vector s i is a d i × 1 vector and is precoded by the M i × d i precoder F i ; the scalar d i is the number of data streams from E i . Each element of s i is assumed to be zeromean and the entire vector is the Rayleigh fading channel from E i to the relay, y is the M R × 1 relay received vector, and w is the M R × 1 Gaussian noise vector at the relay. The elements of w are zero mean and satisfy < ww* > = Φ w > 0. In this article, the transmit power of E i is constrained by requiring In the second time slot, the relay transmits γTy back to each end node. Here, T is its M R × M R processing matrix. The scalar γ is chosen to ensure the relay power constraint From (3), we observe that the power constraint on the relay processor depends on the precoders {F i }. Due to the channel reciprocity, E i 's (i = 1,2) received signal vector (of dimension M i × 1) is (j = 1 if i = 2 and j = 2 if i = 1). In the above, a i is the M i × 1 Gaussian noise vector at E i . It has elements with zero mean and satisfies < a i a Ã i >¼ Φ ai > 0. The vectors, s 1 , s 2 , w, a 1 , and a 2 , are all independent of each other. Also, the channel matrix from the relay to E i is H i ' . For convenience, let C ij ¼ H 0 i γTH j denote the effective channel from the end node E j , through the relay, to the end node E i , and let n i ¼ H 0 i γTw þ a i represent the effective noise at the end node E i , where i,j = 1,2. Equation (4) can thus be rewritten as where the effective noise covariance matrix is Look at E i 's (i = 1,2) received signal vector. It is clear that using only two time slots has caused x i and x j to be mixed together in z i . That is, E i interferes with itself. Somehow though, E i knows C ii , the effective channel from itself, through the relay and back to itself. Since E i transmitted x i , E i knows x i as well and thus can subtract its selfinterference completely from z i . Consequently, it obtainŝ To decode s j , E i now has the options of regular point to point transmissions (e.g., applying a d j × M i decoder G i ).

Centralized design
Let G i denote the decoder at E i . Then, from [20], the mutual information pertaining to the transmission of s j from E j to E i is Assuming all CSI, source and noise statistics are known at a centralized processing unit, we are to maximize I E1 + I E2 by jointly designing the decoders {G 1 , G 2 }, precoders {F 1 , F 2 } and relay processor T subject to the per-node power constraints at the precoders (see (2)) and at the relay (see (3)).
and where Π is any invertible square matrix. Note that the commonly used MMSE decoder and singular value decomposition (SVD) decoder (if F Ã j C Ã ij is invertible) can also be expressed using (9). In this case, Corollary 1 of [20] shows that G i,OPT is the optimum decoder and the mutual information I Ei in (8) becomes full column rank. Due to the way T is made in this design, γ is not used. The scalar γ is thus equal to 1. The remaining task now is to choose the precoders {F 1 , F 2 } and relay processor T to maximize the sum mutual information, I E1 + I E2 , subject to the constraints (2-3). That is, Obtaining an optimum solution to (11) is very difficult due to the intercoupling of the design variables, etc. Consequently, the GIA seeks to find a solution by decoupling the choices for the precoders and relay processor. It does this by iteratively designing them. Before looking at the GIA's actual iterative procedure in Section 3.3, first look at Section 3.1 which gives a way to choose the precoders given the relay processor T. Then, look at Section 3.2 which gives a way to choose T given the precoders. The reason is that the findings in Sections 3.1-3.2 are used in Section 3.3.

Fix relay processor and get precoders
Given a relay processor T, the cost function in (11), and the constraint (2), this section details one way to choose precoders {F 1 , F 2 }. As shown in (7), the two-way transmission can be considered as two independent singleuser MIMO systems when T is known. As there exists a closed-form solution for maximizing the mutual information of a single-user MIMO system subject to the per-node power constraint [20], the proposed way is to just apply that closed-form solution here: First, perform the eigenvalue decomposition.
As is standard, the block matrix [V j U j ] is unitary while Ξ j and Γ j are diagonal matrices with the eigenvalues. As the eigenvalues are arranged in descending order, has the r j largest eigenvalues and Γ j has the remaining (M jr j ) eigenvalues; where Φ j is a diagonal matrix with the lth diagonal entry In (15), τ is the number of positive φ j,ll (see Lemma 2 in [20]) and, σ 2 sj and p j are defined in (2).

Fix precoders and get relay processor
Given precoders {F 1 , F 2 }, the cost function in (11), and the constraint in (3), this section details one way to choose relay processor T. The method of Lagrange multipliers can be used to set up the augmented cost function where λ T is an unknown Lagrange multiplier. In (16), matrix T is what we are looking for in design while λ T is solved for so that T can satisfy its corresponding power constraint. Setting the gradient of the augmented cost function in (16) with respect to T equal to zero, we obtain (see Appendix 1) where By using Kronecker products we obtain the relay processor from (17) for a given λ T : Note that TX 1 T * = Q in (3) and Y 1 is the scalar λ T timing an identity matrix. Thus, left multiplying (17) by T, taking the trace, and applying the power constraint in (3), we have Note that (18) gives T as a function of T and λ T . Moreover, (19) gives λ T as a function of T. Thus, one way to get a T is to iterate between (18) and (19) until convergence.

The iterative procedure
The GIA is as follows: Step a. Set k = 0 and set the stopping threshold δ > 0 and the maximum number of iterations k max . Randomly initialize the precoders and relay processor for the 0th iteration: F 1,0 , F 2,0 , and T 0 . Make sure they satisfy (2-3). Step Also, use (6) to find Φ ni,k (replacing Φ ni and T by Φ ni,k and T k , respectively).
Step d. Use (17a-c) to find Z k , X 1,k , and Y 2,k (replacing Step e. Use (19) to calculate λ T,k (replacing λ T , T, Z, and Y 2 by λ T,k , T k , Z k , and Y 2,k , respectively).
Step h. If the precoder matrices converge (dF j,k = ||F j,k -F j,k-1 || F < δ, for j = 1,2), the relay processor converges (dT k = ||T k+1 -T k || F < δ), and the power constraint at the relay processor converges (dq k = |tr(Q k ) − q| < δ), then set F j = F j,k and T = T k and stop. Here, we do not have to check the power constraints at the end nodes since the closed form solution (14) always guarantees that they are satisfied.
Else if k = k max , set F j = F j,k and T = T k+1 and stop. Otherwise, set k = k + 1 and return to step b. After the above iteration is finished, remove any all zero columns from F 1 and F 2 . This does not change I E1 and I E2 in (10). However, it's needed so that (8) can be simplified to (10). Also, scale T to satisfy the relay power constraint if the iteration terminated before convergence. Let dp i; there is no need to check dp 1,k and dp 2,k in step h due to the use of the closed-form solutions for the precoders.
Below is the summary of the GIA approach. First, we transform the optimization problem in (11) into a root searching problem which attempts to solve for the relay processor T using the system of nonlinear equations defined in (18). Note that, in (18) and the supplementary equations (17, 17a, 17b, 17c), the precoders F 1 and F 2 are no longer considered unknowns because they can be expressed in terms of T using (14). Second, we solve (18) iteratively using step g. Thus, the convergence of GIA depends on the convergence of solving (18) using step g. As shown in [21], as long as the spectral radius (the largest magnitude of the eigenvalues) of the Jacobian matrix corresponding to (18) is less than 1 at the initial guess of the solution, an appropriate iterative approach will converge to one of the solutions. Thus, a proper selection of initial estimates will guarantee the convergence of the proposed GIA. Extensive numerical studies have been performed and the convergent properties of the GIA are shown in Section 6.1. When the GIA converges, it will converge to a local extremum. It is difficult to prove whether the numerically derived solution is global optimum or not because the problem is highly nonlinear.

Decentralized design: protocol
As shown in (7), E i (i = 1,2) needs to know C ii to remove the self interference. Actually, each of the nodes needs certain information such as C ij , i ≠ j, in order to design its processing matrix, decoder, etc. So, this section proposes a four-step protocol for the nodes to follow. To help explain why the protocol is constructed the way it is, perfect estimation for each channel sounding is assumed in this explanation.
The first step is for E 1 and E 2 to perform channel soundings so that the relay can estimate H 1 and H 2 . In the second step, the relay chooses a symmetric T-two possible ways are given in Section 5. Next, the relay picks γ so that (3) is satisfied. It can do this since it knows what precoders the end nodes will use (see step 4). In the third step, the relay performs two equivalent channel soundings, one precoded with γTH 1 and the other precoded with γTH 2 . From these two soundings, E i (i = 1,2) estimates the effective channel matrices C ii and C ij (j ≠ i, j = 1,2). The first term is what E i needs to subtract its self-interference; C ii is exactly what is left multiplying x i in (5). The second term is the effective channel from E j , through the relay, to itself. What is more, the transpose of the second term is the effective channel from itself to E j -this is why the relay made T symmetric.
In the fourth step, E i (i = 1,2) designs its precoder F i and decoder G i without any signaling with the other end node. How can they do this and still have F 1 match G 2 and F 2 match G 1 ? Simple: E 1 uses a reproducible algorithm based on its estimate of C 21 to design its precoder F 1 . E 2 follows the same algorithm with its estimate of C 21 so that it can know E 1 's precoder F 1 . E 2 can thus design its decoder G 2 to match E 1 's precoder F 1 . E 2 and E 1 proceed in an analogous fashion so that E 2 's precoder F 2 and E 1 's decoder G 1 match as well. Recall that the relay already picked γ in the second step since it knew what precoders the end nodes would choose. Thus, one additional requirement for this reproducible algorithm is that it must be unaffected by a positive scaling of C 21 . That is, the end node will result in the same precoder and decoder whether it is given C 21 or ρC 21 where ρ is any positive scalar. Though this last requirement may seem like a stringent restriction, the following example implementation of step 4 shows this is not necessarily the case. Moreover, the ability of the relay to calculate γ before the end nodes design their precoders is very important as explained in Appendix 2.
To illustrate step 4, here is an example for F 1 and G 2 . The effective channel C 21 is fixed since the relay has already picked T and γ. As in Section 3.1, let r 1 = min(d 1 , rank(C 21 )) be the number of data streams E 1 will transmit. At E 1 , it first takes an SVD of C 21 where the singular values σ 1 , σ 2 , etc., are in descending order, Then, E 1 sets where the phase θ l (l = 1,. . .,r 1 ) is chosen such that the first non-zero element of v l e jθ l is positive. And, α > 0 is chosen to satisfy the power constraint. E 2 also takes the same steps. E 2 takes a SVD of C 21 where e U ¼ e . For each l = 1,. . .,r 1 , it chooses ϕ l such that the first non-zero element of e v l e jϕ l is positive. Then, it chooses the scalar to satisfy the power constraint.
v l e jθ l ¼ e v l e jϕ l ; ∀l: Thus, E 2 also gets F 1 . Equation (21b) depends on σ 1 ,. . ., σ r 1 being distinct, something that can be assumed for physical systems. Now, E 2 proceeds to set The end-to-end channel is thus without any signaling between the two end nodes. Moreover, E 1 and E 2 would have gotten the same precoder and decoder, respectively, if they had been using ρC 21 .

Decentralized design: relay processors
The decentralized protocol presented in Section 4 is very general and can support many possible relay processor designs as long as their T's are symmetric. Two possible designs are presented here as examples.

Iterative symmetric design (ISA)
The sum mutual information, (10), depends on both the relay processor and the precoders. For this proposed decentralized design, the dependence on the precoders in (10) is removed to reduce the signaling loading as discussed in Section 1. Furthermore, the noise propagated from the relay to the end nodes is ignored for convenience. That is, this design considers a simplified version of mutual information: Here, ρ 2 = p i /tr(Φ ai ) signifies the transmit SNR (the ratio between the total transmit signal power and the total receive thermal noise power) and c j is related to the mutual information of the effective channel C ij . The problem is The second "=" in (25) is due to Equation (26) itself follows from C ij ¼ C 0 ji which, in turn, is a result of T = T ' . The trace constraint is needed to keep the elements of T from exploding to infinity. To search for such a T, the augmented cost function is introduced with the real Lagrange multiplier λ. Using the technique of variation, we obtain (see Appendix 3) for every M R × M R symmetric matrix S. In (28), By Appendix 4, this in turn means that H 2 Γ a H 0 2 TM 1 þ λT is skew-symmetric. That is, Noting that T Ã ¼ T, right multiply (29) by T and apply the trace constraint in (25) to get λ. Since tr NT À Á ¼ Finally, plugging (30) into (29), we have Clearly, a T satisfying (31) is a feasible solution to (25) as it is symmetric and satisfies the trace constraint of (25). As (31) gives an implicit expression of T as a function of T itself, this naturally leads to the following iterative procedure to get T. Since (31) is a highly nonlinear equation of T, averaging (step c) is used.
Step a Randomly initialize T 0 as an M R × M R symmetric matrix satisfying the trace constraint of (25). Set k = 0. Also, set the stopping threshold δ and the maximum number of iterations k max .
Step b Use (28b) to calculate Γ a,k (replacing Γ a by Γ a,k and T by T k .) Step Step d.
and stop. Else if k = k max , set T = T k+1 and stop. Otherwise, set k = k + 1 and go to step b. As in the GIA approach, we transform the optimization problem in (24) in the ISA approach into a root searching problem which attempts to solve for the relay processor T for the system of nonlinear equations defined in (31). Thus, the convergence of ISA depends on the convergence of solving (31) iteratively using step c. As shown in [21], a proper selection of initial estimates will guarantee the convergence of the proposed ISA scheme. Extensive numerical studies have been performed and the convergent properties of the ISA are shown in Section 6.1. When the ISA converges, it will converge to a local extremum. Recall that (24) is an approximate mutual information formula to enable our decentralization. So, no optimality claim is made with regards to the actual sum mutual information.

RRANOMAX design
The second design is the relay processor design used in the RRANOMAX approach in [12]. It can be used when the relay noise covariance matrix is a scalar times an identity matrix, i.e., Φ w = σ 2 I > 0.. For completeness, we restate it below. Perform the SVD on K = 0.5[H 2 ⊗ H 1 , Define u K,1 as the first column of U k . Then, the initial design of the relay processor is The M R × M R symmetric matrix Ω could be used as T but it is not. The reason is due to the nature/distribution of its singular values. So, instead, perform the SVD on Ω: The singular values in (34) are arranged in descending order. The T will be obtained from Ω by replacing σ Ω 1 ; . . . ; σ Ω M R byσ T 1 ; . . . ;σ T M R . That is, These new singular values are defined bŷ Here, L = min{M R , min{M 1 , M 2 } + 1} and μ is chosen so thatσ T 1 2 þσ T 2 2 þ ⋯ þσ T M R 2 ¼ 1. In (35a), where σ i,k is the kth singular value of H i (arranged in descending order).

Numerical results
Without loss of generality, assume that the noise covariance matrices Φ ai and Φ w are identity matrices. Also assume that the source covariance matrices are also identity matrices (i.e., σ 2 s1 ¼ σ 2 s2 ¼ 1 ). The numbers of antennas at the two end nodes are the same (i.e., M 1 = M 2 ) and are equal to 4. The number of antennas at the relay node M R is either equal to M 1 or 2 M 1 . Consider uncorrelated Rayleigh fading channels where all channel matrices are normalized such that the Frobenius norm of H j , j = 1,2, is one. With the per-node power at each end node as p 1 = p 2 = P and the relay power as q = 2P, we define "SNR" as 10 log 10 (P/M 1 ). The reason is that it is the dB value of the ratio between the total transmit power (tr(F i F i * )) and the total thermal noise power (tr(Φ ai )) at an end node. Here, the channels and the noise propagated from the relay are not included in the "SNR" definition. The stop parameters for the ISA, δ, and k max are chosen to be 0.001 and 50, respectively. The stop parameters for the GIA, δ, and k max are chosen to be 0.001 and 2000, respectively. Figure 2 shows the average number of iterations the GIA needed per "SNR" for 100 channel realizations. One set of points is for the M 1 = M 2 = M R = 4 configuration while the other is for the M 1 = M 2 = 4 and M R = 8 one. There have been other iterative transceiver designs very similar to the GIA (e.g., [22]). For those designs, it was common to see the number of iterations increasing with the SNR. It is interesting that the phenomenon is not seen here for both antenna configurations. Figure 3 shows an example convergence plot for the GIA for the M 1 = M 2 = M R = 4 configuration and "SNR" = 0 dB. Observe that dp 1, k and dp 2,k are very small as expected due to the use of the closed-form solutions for the precoders.

Sum mutual information performances
Recall that the decentralized signaling protocol proposed in Section 4 was flexible in terms of what relay processor T the relay used. So here, we demonstrate the sum mutual information performance of the proposed protocol with three relay designs. The first one is to choose T = I and is denoted as "Identity". This is the simplest design. The second one is RRANOMAX where T is given in (36). The third one is using the ISA (see Section 5.1). The protocol also allows for flexibility in the precoder and decoder design. For all three implementations of the protocol (one for each of the three relay processor designs), the SVD methodology described in (20a) to (22) is used for the precoder and decoder design. For reference, we also use (10) to evaluate the mutual information   "SNR"=0dB "SNR"=20dB "SNR"=0dB "SNR"=20dB Figure 6 Example convergence plot of the ISA for M 1 = M 2 = 4 and M R = 8. To plot the curves for both "SNR" values in the same plot, c 1 (T) is normalized for each "SNR" so that its value at the last iteration is unity. obtained using the proposed centralized GIA design. The GIA is described in Section 3.3. Figures 7 and 8 show the sum mutual information, I E1 + I E2 , for the three practical (decentralized) designs and the centralized GIA design. The antennas numbers depend on the figure. M 1 = M 2 = 4 for both figures. M R = 4 in Figure 7 and M R = 8 in Figure 8. Each of the curves is obtained by averaging the results of 100 channel realizations. The number of data streams of each practical design at each "SNR" for each channel realization is chosen so that it provides the maximum sum mutual information. The number of data streams the GIA gives to E i (i = 1,2) is only determined in the last iteration; it is simply the number of non-zero columns of F i . From Figures 7 and 8, it is obvious that the "Identity" design of T has the worst performance and is greatly outperformed by the other three designs. The centralized "GIA" design has the best performance. The implementations of the proposed protocol with the "ISA" and "RRANOMAX" relay processor designs come in second and third place, respectively. By comparing Figure 7 with Figure 8, it is easy to see that the performance gaps increase as the number of relay antennas increases. In addition, the sum mutual information also becomes larger as the number of relay antennas increases. Both of these observed phenomena are probably due to the increased freedom at the relay-the size of T increases from 4 × 4 to 8 × 8. Note that we normalize the Frobenius norms of all channel matrices to one.
Note that some kind of power loading at the precoders may increase the sum mutual information for the decentralized designs. However, performing the water filling procedure for power loading at the end nodes does not necessarily increase the mutual information. This is because the γ in (3) depends on the precoders.
When the power loading at the precoders changes the precoders {F j } according to the water filling procedure, γ has to change too (because γ depends on {F j }). But then, the optimized power loading done previously for a different γ is no longer optimum in the sense of maximizing the sum mutual information for the new γ. Moreover, water filling requires the noise covariance matrices Φ ni in (6) to be available at end nodes. But, Φ ni is not available at the end node using the proposed decentralized protocol in Section 4. Thus, the water filling procedure is not applied here.

Conclusion
This article presents both practical and theoretical advances for MIMO analog network coding, a technique which requires only two time slots (excluding the time slots required for signaling) to complete a bi-directional communication between two end nodes. For the practical advance, the article proposes a decentralized joint transceiver and signaling design scheme which requires the system to work in a TDD mode and to have four channel soundings. With the proposed signaling scheme, each node gets the information needed for designing its own transmit and/or receive processors. The designs at all nodes are harmonized and coordinated such that no additional signaling overhead is needed. In particular, presented in this article is a novel iterative approach for designing a symmetric relay processor to maximize an approximate sum mutual information of the effective channels between the two end nodes. It is seen to converge quickly for all SNRs-highly desirable for a practical design.
For the theoretical advance, a novel iterative approach, named GIA, is proposed for jointly designing the precoders and decoders at the two end nodes and the  processor at the relay node. The goal of the design is to maximize the sum mutual information of the system subject to a per-node power constraint at each node. This is a centralized design and may not be practical, but can be used to generate benchmark results. The GIA alternately finds the precoders at the end nodes and the processor at the relay until all relevant parameters converge. In this article, the centralized GIA is employed to provide a performance benchmark for evaluating various implementations of the proposed decentralized design. It is remarkable that the performance of the proposed decentralized design is almost as good as the benchmark set by the centralized GIA. It is concluded that proposed decentralized scheme is high performing and can be a feasible way to implement analog network coding. Insights gained here may also possibly be used to enhance other existing designs in the analog network coding literature.
The proposed centralized GIA approach is very general and can be extended to deal with multiple relay nodes where the relay processors are determined sequentially. Moreover, it can be generalized to deal with arbitrary linear power constraints, including the practical per-antenna power constraint, where closed-form expressions for the precoders may not be available. For the proposed decentralized ISA design, an extension to multiple relays is much more difficult. As in the single-relay case, let each relay design its own relay processor. At this stage, it should calculate its power scaling parameter. However, it cannot; as it does not know the other channels and relay processors, it cannot figure out the SVD precoders and decoders of the end nodes. As the discussion in Appendix 2 still holds when there are multiple relays, having the end nodes calculate the power scaling parameters is not attractive. Having the relays communicate among themselves to determine their power scaling parameters will also involve a great deal of signaling as well. How to perform a practical decentralized design for the multiple relay case is thus a nontrivial problem for future research.
One challenging issue about the simultaneous multirelay transmission in the analog network coding scheme for practical applications is the inevitable differences in propagation delays from the multiple relays to the end nodes. If the differences are large, the delay spread of the effective channel from multi-relay to an end node will be also large. Then it will require a large guard time. Moreover, if the relays are not synchronized well, the effective delay spread will vary in different transmissions. This will make it very challenging for each end node to cancel its own signal and to detect the desired signal. Thus, the simultaneous multi-relay transmission mechanism is not practical for the analog network coding scheme if all relays cannot be well synchronized. In that case, a sequential multi-relay transmission mechanism can be implemented and the multi-relay system is reduced into multiple single-relay systems.
Endnotes a It was also presented as a part of Enoch Lu's Ph.D. thesis [16].
b It was also presented in IEEE Globecom 2011 [11] and in Enoch Lu's Ph.D. thesis [16].
Removing the terms containing Δ Ã T by summing (39) and (40), and after some manipulations (mainly cycling the matrices inside the trace operators so that Δ T is in the left side of each term), we have As Δ T is arbitrary, Δ T can be the conjugate of the term inside the square brackets in (41). We thus conclude that the term inside the square brackets must be zero because tr(AA * ) = 0 implies A = 0. Thus, we have the following expression which will lead to (17, 17a-c): Appendix 2 Let us assume that the relay is unable to calculate γ before the end nodes design their precoders. Certainly, this means that the relay is unable to know what precoders the end nodes will choose without some signaling from the end nodes-if it knew the precoders, it could simply solve (3) and get γ. Consequently, γ or the information needed to calculate γ must be signaled to the relay. Consider the first strategy of having γ signaled to the relay. Necessarily, an end node has to calculate γ then. The current protocol however does not give either end node enough information though. Take E 1 for example. Assume E 1 knows all the information the protocol in Section 4 gives it: H 1 'TH 2 , H 1 'TH 1 , F 1 , and F 2 (note γH 1 'TH 2 and γH 1 'TH 1 are changed to H 1 'TH 2 and H 1 'TH 1 as the relay does not know γ). Even with all of this information, E 1 cannot compute even one of the three terms of Q in (3): TΦ w T*, σ 2 s1 TH 1 F 1 F Ã 1 H Ã 1 T Ã , and σ 2 s2 TH 2 F 2 F Ã 2 H Ã 2 T Ã . Regarding the first term of Q, it does not even know T and Φ w . Regarding the other two terms of Q, it does not know TH 1 and TH 2 . Clearly, considerable signaling is needed for an end node to be able to calculate γ. Now, consider the alternative strategy of having the end nodes signal whatever is needed by the relay so that it can calculate γ. For example, let each end node perform an equivalent channel sounding with its precoder. The relay thus knows H 1 F 1 and H 2 F 2 . Combined with the information the relay already has, it can now calculate γ. No matter which strategy, not enabling the relay to calculate γ before the end nodes design their precoders necessitates adding signaling to the protocol. Or, to put it positively, choosing the precoder design in step 4 so that the relay can calculate γ in step 2 helps reduce the amount of signaling.

Appendix 3
As in Appendix 1, the technique of variation is employed here. Note though that the symmetric matrix constraint here will lead to us a slightly different development. First, replace the T in (27) by T(ε) = T + εΔ. Here, ε is a real scalar and Δ is an arbitrary symmetric matrix. Second, evaluate the derivative of ς(T + εΔ) with respect to ε at ε = 0 and set it equal to 0. Through laborious but straightforward manipulations, we have where M 1 and Γ a are defined in (28a) and (28b), respectively. For any symmetric Δ, Δ times the imaginary unit is also symmetric. We can thus replace Δ in (43) by Δ times the imaginary unit. Doing this yields Removing the terms containing Δ by summing (43) and (44) we have For convenience, replace the notation of the symmetric matrix Δ* by S, which leads to (28).

Appendix 4
For notational convenience, let ℬ denote the set of all N × N symmetric complex matrices. This appendix will prove the following lemma.
Lemma: Matrix L is N × N. tr{LS} = 0 for all S∈ℬ if and only if L is skew-symmetric.
Proof of forward direction: One can always write L = L A + L B where L A = L/2 + L'/2 is symmetric and L B = L/2 -L 0 /2 is skew-symmetric. Since L A is symmetric, L A ¼ L Ã A . Choosing S ¼ L A , the hypothesis implies that 0 ¼ tr LL The second term, tr L B L A È É , is 0 because L B is skew-symmetric (see proof of reverse direction). So, the first term must be zero, i.e., 0 ¼ tr L A L Ã A È É . Clearly, this last equality means L A = 0, making L = L B skew-symmetric.