Iterative Decoding of Concatenated Codes: A Tutorial

The turbo decoding algorithm of a decade ago constituted a milestone in error-correction coding for digital communications, and has inspired extensions to generalized receiver topologies, including turbo equalization, turbo synchronization, and turbo CDMA, among others. Despite an accrued understanding of iterative decoding over the years, the “turbo principle” remains elusive to master analytically, thereby inciting interest from researchers outside the communications domain. In this spirit, we develop a tutorial presentation of iterative decoding for parallel and serial concatenated codes, in terms hopefully accessible to a broader audience. We motivate iterative decoding as a computationally tractable attempt to approach maximum-likelihood decoding, and characterize ﬁxed points in terms of a “consensus” property between constituent decoders. We review how the decoding algorithm for both parallel and serial concatenated codes coincides with an alternating projection algorithm, which allows one to identify conditions under which the algorithm indeed converges to a maximum-likelihood solution, in terms of particular likelihood functions factoring into the product of their marginals. The presentation emphasizes a common framework applicable to both parallel and serial concatenated codes.


INTRODUCTION
The advent of the turbo decoding algorithm for parallel concatenated codes a decade ago [1] ranks among the most significant breakthroughs in modern communications in the past half century: a coding and decoding procedure of reasonable computational complexity was finally at hand offering performance approaching the previously elusive Shannon limit, which predicts reliable communications for all channel capacity rates slightly in excess of the source entropy rate.The practical success of the iterative turbo decoding algorithm has inspired its adaptation to other code classes, notably serially concatenated codes [2,3], and has rekindled interest [4,5] in low-density parity-check codes [6], which give the definitive historical precedent in iterative decoding.
The serial concatenated configuration holds particular interest for communication systems, since the "inner encoder" of such a configuration can be given more general interpretations, such as a "parasitic" encoder induced by a convolutional channel or by the spreading codes used in CDMA.The corresponding iterative decoding algorithm can then be extended into new arenas, giving rise to turbo equal-This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.ization [7,8,9] or turbo CDMA [10,11], among doubtless other possibilities.Such applications demonstrate the power of iterative techniques which aim to jointly optimize receiver components, compared to the traditional approach of adapting such components independently of one another.
The turbo decoding algorithm for error-correction codes is known not to converge, in general, to a maximumlikelihood solution, although in practice it is usually observed to give comparable performance [12,13,14].The quest to understand the convergence behavior has spawned numerous inroads, including extrinsic information transfer (or EXIT) charts [15], density evolution of intermediate quantities [16,17], phase trajectory techniques [18], Gaussian approximations which simplify the analysis [19], and cross-entropy minimization [20], to name a few.Some of these analysis techniques have been applied with success to other configurations, such as turbo equalization [21,22].Connections to the belief propagation algorithm [23] have also been identified [24], which approach in turn is closely linked to earlier work on graph theoretic methods [25,26,27,28].In this context, the turbo decoding algorithm gives rise to a directed graph having cycles; the belief propagation algorithm is known to converge provided no cycles appear in the directed graph, although less can be said in general once cycles appear.
Interest in turbo decoding and related topics now extends beyond the communications community, and has been met with useful insights from other fields; some references in this direction include [29] which draws on nonlinear system analysis, [30] which draws on computer science, in addition to [31] (predating turbo codes) and [32] (more recent) which inject ideas from statistical physics, which in turn can be rephrased in terms of information geometry [33,34].Despite this impressive pedigree of analysis techniques, the "turbo principle" remains difficult to master analytically and, given its fair share of specialized terminology if not a certain degree of mystique, is often perceived as difficult to grasp to the nonspecialist.In this spirit, the aim of this paper is to provide a reasonably self-contained and tutorial development of iterative decoding for parallel and serial concatenated codes, in terms hopefully accessible to a broader audience.The paper does not aim at a comprehensive survey of available analysis techniques and implementation tricks surrounding iterative decoding (for which the texts [12,13,14] would be more appropriate), but rather chooses a particular vantage point which steers clear of unnecessary sophistication and avoids approximations.
We begin in Section 2 by reviewing optimum (maximum a posteriori and maximum-likelihood) decoding of parallel concatenated codes.We motivate the turbo decoding algorithm as a computationally tractable attempt to approach maximum-likelihood decoding.A characterization of fixed points is obtained in terms of a "consensus" property between the two constituent decoders, and a simple proof of the existence of fixed points is obtained as an application of the Brouwer fixed point theorem.
Section 3 then reexamines the calculation of marginal distributions in terms of a projection operator, leading to a compact formulation of the turbo decoding algorithm as an alternating projection algorithm.The material of the section aims at a concrete transcription of ideas originally developed by Richardson [29]; we include in addition a minimumdistance property of the projector in terms of the Kullback-Leibler divergence, and review how the turbo decoding algorithm indeed converges to a maximum-likelihood solution whenever specific likelihood functions factor into the product of its marginals.The factorization is known [18] to hold in extreme signal-to-noise ratios.
Section 4 shows that the iterative decoding algorithm for serial concatenated codes also admits an alternating projection interpretation, allowing us to transcribe all results for parallel concatenated codes to their serial concatenated counterparts.This should also facilitate unified studies of both code classes.Concluding remarks are summarized in Section 5.

TURBO DECODING OF PARALLEL CONCATENATED CODES
We begin by reviewing the classical turbo decoding algorithm for parallel concatenated codes.For simplicity, we restrict our development to the binary signaling case; the m-ary case can Information bits Parity-check bits be handled by direct extension (see, e.g., [24] for a particularly clear treatment) or by mapping the m-ary constellation back to its binary origins.
To begin, a binary (0 or 1) information block ξ = (ξ 1 , . . ., ξ k ) is passed through two constituent encoders, as in Figure 1, to create two codewords: Both encoders are systematic and of rate k/n, so that the information bits ξ 1 , . . ., ξ k are directly available in either codeword.Note also that the two encoders need not share a common rate, although we will adhere to this case for ease of notation.
In practice, an expedient method of realizing the second systematic encoder is to permute (or interleave) the information bits ξ i and duplicate the first encoder, as in Figure 2. Since this is a particular instance of Figure 1, we will simply consider two separate encodings of ξ = (ξ 1 , . . ., ξ k ) in what follows and avoid explicit reference to the interleaving operation, despite its importance in the study of the distance properties of concatenated codes [35].
The encoder outputs are converted to antipodal signaling (±1) and transmitted over a channel containing additive noise, giving the received signals x i , y i , and z i : We assume that the noise samples b x,i , b y,i , and b z,i are Gaussian and mutually independent, sharing a common variance σ 2 .For notational convenience, we arrange the received signals into the vectors

Optimum decoding
The maximum a posteriori decoding rule aims to calculate the a posteriori probability ratios with the decision rule favoring a 1 for the ith bit if this ratio is greater than one, and 0 if the ratio is less than one.By using Bayes's rule, each ratio can be developed as involving the a priori probability mass function Pr(ξ) and the likelihood function p(x, y, z|ξ), which is evaluated for the received x, y, and z as a function of the candidate information bits ξ = (ξ 1 , . . ., ξ k ); the sum in the numerator (resp., denominator) is over all the configurations of the vector ξ for which the ith bit is a "1" (resp., "0").Since the noise samples are assumed independent, the likelihood function naturally factors as p(x, y, z|ξ) = p(x|ξ)p(y|ξ)p(z|ξ).(6) For the Gaussian noise case considered here, the three likelihood evaluations appear as where c x (ξ), c y (ξ), and c z (ξ) contain the antipodal symbols ±1 which would be received as a function of the candidate information bits ξ, in the absence of noise.For non-Gaussian noise, the likelihood functions would, of course, assume different forms.
The a posteriori probability ratios may therefore be written as Pr ξ i = 1|x, y, z Pr ξ i = 0|x, y, z  If the a priori probability mass function Pr(ξ) is uniform (i.e., Pr(ξ) = 1/2 k for all ξ), then this reduces to the maximumlikelihood decision metric: if Pr(ξ) is uniform.(9) If this expression were evaluated as written, the complexity of an optimum decision rule would be O(2 k ), since there are 2 k configurations of the k information bits comprising ξ, leading to as many likelihood function evaluations.This clearly becomes impractical for sizable k.
Observe now that if we instead consider an optimum decoding rule using only one of the constituent encoders, we may write, by a development parallel to that above, If each constituent encoder implements a trellis code, then x and y form a Markov chain, as do x and z; the complexity of either decoding expression can then be reduced to O(k) by using the forward-backward algorithm from [36] (which, in turn, is a particular case of the sum-product algorithm [27]).
If the a priori probability function Pr(ξ) is indeed uniform, then it weighs all terms in the numerator and denominator equally and, as such, is effectively relegated to an unused variable in either decoding expression (10) or (11).Rather than accepting this status, one can imagine replacing the a priori probability function Pr(ξ), or "usurping" its position, by some other function in an attempt to "bias" either decoding rule (10) or (11) towards the maximum-likelihood decoding rule in (9).In particular, if Pr(ξ) were replaced by p(z|ξ) in (10), or by p(y|ξ) in (11), then either expression would agree formally with (9).
In order to retain the O(k) complexity of the forwardbackward algorithm from [36], however, the a priori probability function Pr(ξ) is assumed to factor into the product of its bitwise marginals: The likelihood function p(y|ξ) or p(z|ξ) does not, on the other hand, generally factor into its bitwise marginals, that is, As such, a direct usurpation of the a priori probability by the likelihood function of the parity-check bits of the other constituent coder is not feasible.Rather, one must approximate the likelihood function p(y|ξ) or p(z|ξ) by a function that does factor into the product of its marginals.Many candidate approximations may be envisaged; that which has proved the most useful relies on extrinsic information values, which are reviewed next.

Extrinsic information values
We reexamine the likelihood function for the systematic bits: This shows that the likelihood function p(x|ξ) for the systematic bits factors into the product of its marginals,1 just like the a priori probability mass function: Owing to these factorizations, each term from the numerator of (10) contains a factor p(x i |ξ i = 1) Pr(ξ i = 1), and each term from the denominator contains a factor p(x i |ξ i = 0) Pr(ξ i = 0).By isolating these common factors, we may rewrite the ratio from (10) as The three terms on the right-hand side may be interpreted as follows: (i) the first term indicates what the ith received bit x i contributes to the determination of the ith transmitted bit ξ i ; hence the name "intrinsic information."It coincides with the maximum-likelihood metric for determining the ith bit when no coding is used, (ii) the second term expresses the a priori probability ratio for the ith bit, and will be usurped shortly, (iii) the third term expresses what the remaining bits in the packet (i.e., of index j = i) contribute to the determination of the ith bit; hence the name "extrinsic information." ) be a factorable probability mass function whose bitwise ratios are chosen to match the extrinsic information values above: Since these values depend on the likelihood function p(y|ξ) (in addition to the systematic bits save for x i ), we may consider T(ξ) a factorable function which approximates, in some sense, the likelihood function p(y|ξ).(We will see in Theorem 2 a condition under which this approximation becomes exact).We now let T(ξ) usurp the place reserved for the a priori probability function Pr(ξ) (denoted Pr(ξ) ← T(ξ)) in the evaluation of the second decoder (11); since both p(x|ξ) and T(ξ) factor into the product of their respective marginals, we have Here we adopt the term "pseudoprior" for T(ξ) since it usurps the a priori probability function; similarly, the result of this substitution may be termed a "pseudoposterior" which usurps the true a posteriori probability ratio.
Let now ) denote another factorable probability function whose bitwise ratios match the extrinsic information values furnished by this second decoder: This function may then usurp the a priori probability values used in the first decoder, and the process iterates.If we let a superscript (m) denote an iteration index, the coupling of the two decoders admits an external description as in which (20) furnishes T (m) (ξ) and ( 21) furnishes U (m+1) (ξ).This is depicted in Figure 3.A fixed point corresponds to U (m+1) (ξ) = U (m) (ξ) which, by inspection of the pseudoposteriors above, yields the following property.
Property 1.A fixed point is attained if and only if the two decoders yields the same pseudoposteriors (the left-hand sides of ( 20) and ( 21) A fixed point is therefore reflected by a state of "consensus" between the two decoders [15,29,37].

Existence of fixed points
A necessary (but not sufficient) condition for the algorithm to converge is that a fixed point exist, reflected by a state of consensus according to Property 1.A convenient tool in this direction is the Brouwer fixed point theorem [38], which asserts that any continuous map from a closed, bounded, and convex set into itself admits a fixed point; its application in the present context gives the following result [18,29].Theorem 1.The turbo decoding algorithm from (20) and (21) always admits a fixed point.
To verify, consider the pseudopriors U (m) (ξ i ) evaluated for ξ i = 1, which, at any iteration m, are (pseudo-) probabilities lying between 0 and 1: This clearly gives a closed, bounded, and convex set.Since the updated pseudopriors U (m+1) also lie in this set, and since the map from U (m) (ξ) to U (m+1) (ξ) is continuous [18,29], the conditions of the Brouwer theorem are satisfied, to show existence of a fixed point.

PROJECTIONS AND PRODUCT DISTRIBUTIONS
A key element of the development thus far concerns the calculation of bitwise marginal ratios which, according to [20], provide the troublesome element which accounts for the difference between a provably convergent algorithm [20] which is not practically implementable, and the implementablebut difficult to grasp-turbo decoding algorithm.We develop here an alternate viewpoint of the calculation of bitwise marginals in terms of a certain projection operator, adapted from the seminal work of Richardson [29].
We assume that q is scaled such that its entries sum to one.The k marginal distributions determined from q(ξ), each having two evaluations at ξ i = 0 and ξ i = 1 (1 ≤ i ≤ k), are given by Definition 1.The distribution q(ξ) is a product distribution if it coincides with the product of its marginals: The set of all product distributions is denoted by P .
It is straightforward to check that q(ξ) ∈ P if and only if its vector representation is Kronecker decomposable as with We note also that P is closed under multiplication: if q(ξ) and r(ξ) belong to P , so does their product: where the scalar α is chosen so that the evaluations of s(ξ) sum to one.This operation can be expressed in vector notation using the Hadamard (or term-by-term) product: To simplify notations, the scalar α will not be explicitly indicated, with the tacit understanding that the elements of the vector must be scaled to sum to one; we will henceforth write s = q r, omitting explicit mention of the scale factor α. Suppose now r(ξ) is not a product distribution.If r 1 (ξ 1 ), . . ., r k (ξ k ) denote its marginal distributions, then we can set to create a product distribution q(ξ) ∈ P which, by construction, generates the same marginals as r(ξ): This operation will be denoted by We can observe that q is a product distribution (q ∈ P ) if and only if π(q) = q, and since π(r) ∈ P for any distribution r, we must have π(π(r)) = π(r), so that π(•) is a projection operator.
Definition 2. The distribution q is the projection of r into P if (i) q ∈ P and (ii) The following section details some simple informationtheoretic properties which reinforce the interpretation as a projection.

Information-theoretic properties of the projector
The results summarized in this section may be understood as concrete transcriptions of ultimately deeper results from the field of information geometry [33,34].To begin, we recall that the entropy of a distribution r(ξ) is defined as [39] H involving the sum over all 2 k configurations of the vector ξ = (ξ 1 , . . ., ξ k ).A basic result of information theory asserts that the entropy of any joint distribution is upper bounded by the sum of the entropies of its marginal distributions [39], that is, with equality if and only if r(ξ) factors into the product of its marginals [r(ξ) ∈ P ].Therefore, if r ∈ P , then by setting q = π(r), we have because q i (ξ i ) = r i (ξ i ) and q(ξ) ∈ P .This shows that the projection q = π(r) maximizes the entropy over all distributions that generate the same marginals as r(ξ).
We recall next that the Kullback-Leibler distance (or relative entropy) between two distributions r(ξ) and s(ξ) is given by [20,39] with D(r s) = 0 if and only if r(ξ) = s(ξ) for all ξ.If s(ξ) ∈ P and q = π(r), then we may verify (see the appendix) that since D(q s) ≥ 0, with equality if and only if s(ξ) = q(ξ).This shows that the projection q(ξ) is the closest product distribution to r(ξ) using the Kullback-Leibler distance.

Application to turbo decoding
The added complication of accounting for the calculation of bitwise marginals noted in [20] can be offset by appealing to the previous section, which interprets bitwise marginals as resulting from a projection.Accordingly, we show in this section how the turbo decoding algorithm of ( 20) and ( 21) falls out as an alternating projection algorithm [29].Let p x , p y , and p z denote the vectors which collect the 2 k evaluations of the likelihood functions p(x|ξ), p(y|ξ), and p(z|ξ), respectively, that is, and similarly for p y and p z .Likewise, let the vectors t (m) and u (m) collect the 2 k evaluations of T (m) (ξ) and U (m) (ξ), respectively, at a given iteration m.We can observe that the right-hand side of (20) calculates the bitwise marginal ratios of the distribution p(y|ξ)p(x|ξ)U (m) (ξ); this distribution admits a vector representation of the form p y p x u (m) .The left-hand side of (20) displays the bitwise marginal ratios of the product distribution p x u (m) t (m) which generates, by construction, the same bitwise marginals as p y p x u (m) .This confirms that p x u (m) t (m) is the projection of p y p x u (m) in P .By applying the same reasoning to (21), we establish the following [29].

Proposition 1. The turbo decoding algorithm of (20) and (21) admits an exact description as the alternating projection algorithm
From this, a connection with maximum-likelihood decoding follows readily [18].Theorem 2. If p x p y and/or p x p z is a product distribution, then (1) the turbo decoding algorithm (( 39) and (40)) converges in a single iteration; (2) the pseudoposteriors so obtained agree with the maximum-likelihood decision rule for the code.
For the proof, assume that p x p y ∈ P .We already have u (m) ∈ P , and since P is closed under multiplication, we see that p y p x u (m) ∈ P .Since the projector behaves as the identity operation for distributions in P , the first decoder step of the turbo decoding algorithm from (39) becomes From this, we identify p x t (m) = p x p y for all iterations m to show that a fixed point is attained.The second decoder from (40) then gives which furnishes the bitwise marginal ratios of p(x|ξ)p(y|ξ)p(z|ξ).This agrees with the maximumlikelihood decision rule seen previously in (9).The proof when instead p x p z ∈ P follows by exchanging the role of the two decoders.Note that since p x is already a product distribution (i.e., p x ∈ P ), it is sufficient (but not necessary) that p y ∈ P to have p x p y ∈ P .One may anticipate from this result that if p x p y and/or p x p z is "close" to a product distribution, then the algorithm should converge "rapidly;" formal steps confirming this notion are developed in [18].Such proximity to a product distribution can be verified, in particular, in extreme signal-to-noise ratios [18].
Example 1 (high signal-to-noise ratios).Let ξ * denote the vector of true information bits.The joint likelihood evaluation for x and y becomes where c x (ξ) and c y (ξ) denote the antipodal (±1) representation of the coded information bits ξ, and where b x and b y are the vectors of channel noise samples.As the noise variance σ 2 tends to zero, we have b x , b y → 0, and p(x, y|ξ) We note that the delta function can always be written as the product of its marginals (which are themselves delta functions of the individual bits of ξ * ).Experimental evidence confirms that, in high signal-to-noise ratios, the algorithm converges rapidly to decoded symbols of high reliability.
Example 2 (poor signal-to-noise ratios).As the noise variance σ 2 increases, the likelihood evaluations are dominated by the presence of the noise terms; ratios of candidate likelihood evaluations then tend to 1, which is to say that p(x, y|ξ) approaches a uniform distribution: We note that a uniform distribution can always be written as the product of its marginals (which are themselves uniform distributions).Experimental evidence again confirms (e.g., [15,18]) that, in poor signal-to-noise ratios, the algorithm converges rapidly to a fixed point, but offers low confidence in the decoded symbols.
Although the above examples assume a Gaussian channel for simplicity, the basic reasoning can be extended to other memoryless channel models.More interesting, of course, is the convergence behavior for intermediate signal-to-noise ratios, which still presents a challenging problem.A natural question at this stage, however, is whether there exist constituent encoders which would give p x p y or p x p z as a product distribution irrespective of the signal-to-noise ratio.The answer is in the affirmative by considering, for example, a repetition code for the second constituent encoder.The arguments showing that p x ∈ P can then be copied to show that p z ∈ P as well (and therefore that p x p y ∈ P ).But the distance properties of the resulting concatenated code are not very impressive, being basically the same as for the first constituent encoder.This concurs with an observation from [24], namely that "easily decodable" codes do not tend to be good codes.

SERIAL CONCATENATED CODES
We turn our attention now to serial concatenated codes, which have been studied extensively by Benedetto and his coworkers [2,3,35], and which encompass an ultimately richer structure.Our aim in this section is to show that the alternating projection interpretation again carries through, affording thus a unified study of serial and parallel concatenated codes.The basic flow graph for serial concatenated codes is depicted in Figure 4 in which the information bits ξ = (ξ 1 , . . ., ξ k ) are first processed by an outer encoder, which here is systematic, so that the first k bits of its output χ = (χ 1 , . . ., χ n ) are the information bits: The remaining bits χ k+1 , . . ., χ n furnish the n−k parity-check bits.The cascaded inner encoder may admit different interpretations.
(i) The inner encoder may be a second (block or convolutional) encoder, perhaps endowed with an interleaver to offer protection against burst errors, consistent with conventional serial concatenated codes [2,3].Each input configuration χ is mapped to an output configuration ψ.With reference to Figure 4, the rate of the inner encoder is n/l.(ii) The inner encoder may be a differential encoder, in order to endow the receiver with robustness against phase ambiguity in the received signal.Since a differential encoder is a particular case of a rate 1 convolutional encoder (with l = n or perhaps l = n+1), this case is accommodated by the previous case.(iii) The inner encoder may represent the convolutional effect induced by a channel whose memory is longer than the symbol period.In this case, taking into account that the symbols {χ i } will have been converted to antipodal signaling (±1), the baseband channel output appears as where {h m } denotes the equivalent impulse response of the baseband model, b i is the additive channel noise, and where v i may be scalar-valued (for a single-input single-output channel) or vector-valued (for a singleinput multiple-output channel).
Certainly other interpretations may be developed as well; the above list may nonetheless be considered representative of some common configurations.

Optimum decoding
With v denoting the noisy received signal ψ (after conversion to antipodal form, possibly corrupted by intersymbol inter-ference), the optimum decoding metric is again based on the a posteriori marginal probability ratios If all input configurations are equally probable, we have Pr(ξ) = 1/2 k and we recover the maximum-likelihood decoding rule.
If no interleaver is used between the two coders, then the mapping from ξ to v is a noisy convolution, allowing a trellis structure to perform optimum decoding at a reasonable computational cost.In the presence of an interleaver, on the other hand, the convolutional structure between ξ and v is compromised, such that a direct evaluation of (48) leads to a computational complexity that grows exponentially with the block length.Iterative decoding, to be reviewed next, represents an attempt to reduce the decoding complexity to a reasonable value.

Iterative decoding for serial concatenated codes
Iterative serial decoding [2] amounts to implementing locally optimum decoders which infer χ from v, and then ξ from χ, and subsequently exchanging information until consensus is reached.Our development emphasizes the external descriptions of the local decoding operations in order to better identify the form of consensus that is reached, as well as to justify the seemingly heuristic coupling between the coders by way of connections with maximum-likelihood decoding.
Consider first the inner decoding rule, which seeks to determine the inner encoder's input χ = (χ 1 , . . ., χ n ) from the noisy received signal v: The inner decoder assumes that the a priori probability mass function Pr(χ) factors into the product of its marginals as This assumption, strictly speaking, is incorrect, because the bits {χ i } are produced by the outer encoder, which imposes dependencies between the bits for error control purposes.The forward-backward algorithm from [36], however, cannot exploit these dependencies without incurring a significant increase in computational complexity.By turning a "blind eye" to this fact, and therefore admitting the factorization of Pr(χ) into the product of its marginals, each term from the numerator (resp., denominator) of (49) will contain a factor Pr(χ i = 1) (resp., Pr(χ i = 0)), which gives Pr χ i = 0 χ:χi=1 p(v|χ) j =i Pr χ j χ:χi=0 p(v|χ) j =i Pr χ j extrinsic information , i=1, 2, . . ., n. ( We now let T(χ) = T 1 (χ 1 ) • • • T n (χ n ) denote a factorable probability mass function whose marginal ratios match the extrinsic information values above: The outer decoder would normally aim to determine the information bits ξ based on an estimate (denoted by χ) of the outer encoder's output, according to the a posteriori probability ratios The estimate χ, however, is not immediately available.If it were, then each likelihood function evaluation would appear as assuming a Gaussian channel, in which χ j (ξ) is either 0 or 1, depending on ξ = (ξ 1 , . . ., ξ k ).To each hypothetical bit χ j , therefore, we associate two evaluations as exp[−( χ j ± 1) 2 /(2σ 2 )] (corresponding to χ j (ξ) = 0 or 1), which are usurped by the two evaluations of T j (χ j ) from (52): The forward-backward algorithm [36] may then run, following this systematic substitution.
To develop an external description of the decoding algorithm which results, we note that this substitution amounts to usurping the likelihood function p( χ|ξ) by in which the right-hand side notationally emphasizes that only those bit combinations χ 1 , . . ., χ n that lie in the outer codebook make sense.
To arrive at a more convenient form, let φ(χ) denote the indicator function for the outer codebook: The 2 n configurations of (χ 1 , . . ., χ n ) generate 2 n evaluations of n j=1 T j (χ j ), but only 2 k of these evaluations survive in the product φ(χ) j T j (χ j ), namely, the 2 k evaluations from the right-hand side of (56) which are generated as ξ varies over its 2 k configurations.We may then establish a one-toone correspondence between the 2 k "surviving" evaluations in φ(χ) j T j (χ j ) and the 2 k evaluations of the likelihood function p( χ|ξ) which are usurped in (56).Assuming that Pr(ξ) is a uniform distribution, the usurped pseudoposteriors from (53) become ξ:ξi=1 p( χ|ξ) ξ:ξi=0 p( χ|ξ) in which we note the following: (i) since the outer code is systematic, the first k bits χ 1 , . . ., χ k coincide with the information bits ξ 1 , . . ., ξ k , allowing therefore a direct substitution for the variables of summation.In addition, the formula above may be evaluated as written for the parity-check bits χ k+1 , . . ., χ n ; (ii) each term in the numerator (resp., denominator) contains a factor T i (χ i = 1) (resp., T i (χ i = 0)), so that the ratio T i (1)/T i (0) naturally factors out.Let be a factorable probability function whose marginal ratios match the extrinsic information values: These values may then usurp the a priori probability function Pr(χ) of the inner decoder: Pr(χ) ← U(χ).
If we let a superscript (m) denote an iteration number, then the coupling of the two decoders admits an external description of the form as depicted in Figure 5.A fixed point corresponds to U (m+1) (χ) = U (m) (χ) which, in analogy with the parallel concatenated code case, can be characterized as the following "consensus" property.
Property 2. A fixed point in the serial decoding algorithm occurs if and only if the two decoders yield the same pseudoposteriors (left-hand sides of ( 61) and ( 62)) for i = 1, 2, . . ., n.
Note that the consensus here covers the information bits plus the parity-check bits furnished by the outer decoder.As with the parallel concatenated code case, the existence of fixed points follows by applying the Brouwer fixed point theorem (cf.Section 2.3).

Projection interpretation
The iterative decoding algorithm for serial concatenated codes can also be rephrased as an alternating projection algorithm, analogously to the parallel concatenated code case of Section 3, as we develop presently.
We continue to denote by P the set of distributions q(ξ) which factor into the product of their marginals: The only modification here is that we now have n marginal distributions to consider, to account for the k information bits plus the n−k parity-check bits which intervene in the consensus of Property 2. If r(χ) is an arbitrary distribution, then q = π(r) yields a distribution q(χ) ∈ P which generates the same n marginal distributions as r(χ).
With respect to the inner decoder, we see that the righthand side of (61) calculates the marginal ratios of the distribution p(v|χ)U (m) (χ), which distribution admits a vector representation as p v u (m) .The left-hand side of (61) contains the marginal ratios of t (m) u (m) ∈ P , which agree with those of p v u (m) , consistent with our projection operation.By applying the same reasoning to (62), we obtain a natural counterpart to Proposition 1. Proposition 2. The iterative serial decoding algorithm of (61) and (62) coincides with the alternating projection algorithm From this follows a natural analogue to Theorem 2 establishing a key link with maximum-likelihood decoding.

Theorem 3. If p(v|χ) factors into the product of its marginals, then
(1) the iterative algorithm (61) and (62) converges in a single iteration; (2) the pseudoposteriors so obtained agree with the maximum-likelihood decision metric for the code.
The proof parallels that of Theorem 2, but displays its own particularities which merit its inclusion here.If p(v|χ) factors into the product of its marginals, then p v ∈ P , giving p v u (m) ∈ P as well.Since the projector behaves as the identity when applied to elements of P , the first displayed equation of Proposition 2 becomes From this we identify t (m) = p v for all iterations m, giving a fixed point.Substituting t (m) = p v into the projector of the second displayed equation of Proposition 2 reveals This calculates the marginal functions of φ(χ)p(v|χ), whose surviving evaluations are the restriction of the likelihood function p(v|χ) to the outer codebook: Since the outer code is systematic, we have χ i = ξ i for i = 1, . . ., k.Therefore, the first k marginal ratios from φ(χ)p(v|χ) coincide with those from p(v|ξ); these in turn agree with the maximum-likelihood decoding rule which results from (48) when the a priori probability function Pr(ξ) is uniform.
As with the case of parallel concatenated codes, the likelihood function p(v|χ) will be "close" to a factorable distribution when the signal-to-noise ratio is sufficiently high or sufficiently low.The conclusions from [18, Section 3, Examples 1 and 2] therefore apply to serial concatenated codes as well.

CONCLUDING REMARKS
We have developed a tutorial overview of iterative decoding for parallel and serial concatenated codes, in the hopes of rendering this material accessible to a wider audience.Our development has emphasized descriptions and properties which are valid irrespective of the block length, which may facilitate the analysis of such algorithms for short block lengths.At the same time, the presentation emphasizes how decoding algorithms for parallel and serial concatenated codes may be addressed in a unified manner.
Although different properties have been exposed, the critical question of convergence domains versus code choice and signal-to-noise ratio remains less immediate to develop.The natural extension of the projection viewpoint favored here involves studying the stability properties of the dynamic system which results.This is pursued in [18,29] (among others) in which explicit expressions for the Jacobian of the system feedback matrix are obtained; once a fixed point is isolated, local stability properties can then be studied [18,29], but they depend in a complicated manner on the specific code and channel properties (distance, block length, signalto-noise ratio, etc.).
One may observe that a fixed point occurs whenever the pseudoposteriors assume uniform distributions, and that this gives a convergent point in pessimistic signal-to-noise ratios [18].With some further code constraints [40], fixed points are also shown to occur at codeword configurations (i.e., where T i (1) = ξ i ), consistent with the observed convergence behavior for signal-to-noise ratios beyond the waterfall region, and corresponding to an unequivocal fixed point in the terminology of [18].Interestingly, the convergence of pseudoprobabilities to 0 or 1 was observed for low-density parity-check codes as far back as [6].Deducing the stability properties of different fixed points versus the signal-to-noise ratio and block length, however, remains a challenging problem.
By allowing the block length to become arbitrarily long, large sample approximations may be invoked, which typically take the form of log-pseudoprobability ratios approaching independent Gaussian random variables.Many insightful analyses may then be developed (e.g., [15,16,17,19], among others).Such approximations, however, are known to be less than faithful for shorter block lengths, of greater interest in two-way communication systems, and analyses exploiting large sample approximations do not adequately predict the behavior of iterative decoding algorithms for shorter block lengths.
Graphical methods (including [25,26,27,28]) provide another powerful analysis technique in this direction.Present trends include studying how code design impacts the cycle length of the decoding algorithm, based on the plausible conjecture that longer cycles should have a greater "stability margin" in an ultimately closed-loop system.Further study, however, is required to better understand the stability properties of iterative decoding algorithms in the general case.

Figure 3 :
Figure 3: Flow graph of the turbo decoding algorithm.

Figure 5 :
Figure 5: Flow graph for iterative decoding of serial concatenated codes.