COMPUTATION OF DISTRIBUTIONS AND THEIR MOMENTS IN THE TRELLIS

. Consider a function whose set of vector arguments with known distribution is described by a trellis. For a certain class of functions, the distribution of the function values can be calculated in the trellis. The forward/backward recursion known from the BCJR algorithm [2] is generalized to compute the moments of these distributions. In analogy to the symbol probabilities, by introducing a constraint at a certain depth in the trellis we obtain symbol distributions and symbol moments, respectively. These moments are required for an eﬃcient implementation of the discriminated belief propagation algorithm in [8], and can furthermore be utilized to compute conditional entropies in the trellis. The moment computation algorithm has the same asymptotic complexity as the BCJR algorithm. It is applicable to any commutative semi-ring, thus actually providing a generalization of the Viterbi algorithm [10].


Introduction
Trellises were introduced into the coding theory literature by Forney [4] as a means of describing the Viterbi algorithm for decoding convolutional codes. Bahl et al. [2] showed that block codes can also be described by a trellis, and Wolf [11] proposed the use of the Viterbi algorithm for trellis-based soft-decision decoding of block codes. Massey [6] gave a graph-theoretic definition of a block trellis and an alternative construction of minimal trellises. Forney's paper [3] showed that group codes, including linear codes and lattices, have a well-defined trellis structure.
In [7], McEliece investigated the complexity of a generalized Viterbi algorithm which allows efficient computation of flows in a code trellis. These results were  [1] and [5]. However, the calculation of flows does not fully exploit the capabilities of the trellis (representation): For a certain class of functions it is possible to calculate the distribution of the function values and the moments of these functions in the trellis.
For iterative decoding of coupled codes, the popular sum-product algorithm is used to calculate the symbol probabilities of the component codes. These probabilities are exchanged between component decoders until a stable solution is found. This iterative algorithm works very well for long "turbo", low-density parity check (LDPC) and some other codes, obtained by concatenation of simple component codes in a special way. However, performance becomes poor when utilizing short or some good component codes.
Recently, Sorger [8] proposed a generalized decoder discriminating code words c by their correlation cr T or cw T with the received word r or a 'believed' word w, respectively. Not only symbol probabilities are considered, but also the distribution of these probabilities over the correlation value. An efficient algorithm is introduced using the first two moments to approximate these distributions. More details can be found in Appendix A.
In this paper we propose algorithms to compute both such distributions and their moments in the trellis. Example 1. Consider Figure 1 which shows two distributions of the correlation function cr T , where c is a code word and r is the noisy version of a code worď c ∈ C after transmission over a memory-less binary symmetric channel (BSC). The curves show the distributions for c ∈ C i (+1) and c ∈ C i (−1), respectively, where C i (x) := {c ∈ C : c i = x} denotes the sub-code of C for which the symbol c i at a given position i of each code word equals x ∈ {−1, +1}. The sums over the distributions equal the symbol probabilities P (c i = x|r). However, the probability ratio P (cr T , c i = +1|r) P (cr T , c i = −1|r) varies significantly over cr T which can be exploited when knowledge on the correlationčr T with the transmitted code word is available. The distributions in Figure 1 can be approximated with the moments is the expectation over all code words c ∈ C which are assumed to be equiprobable. The distributions will be Gaussian for sufficiently long codes which can be understood by the law of large numbers. Hence we can expect the first two moments to suffice for a good approximation.
We present generalizations of the methods in [7] which enable us to compute distributions P (cw T , c i = +1|r) and expressions like E C cw T m |r, c i for some word w, whereof (1) is a special case. The complexity of the algorithm is of the same order as the BCJR algorithm.
The remainder of this paper is structured as follows. The next section contains a review of common terminology in the context of trellises. This is extended in Section 3, which deals with the computation of distributions and their moments in a more general frame. In Section 4 we will return to the original problem by transferring the results of Section 3 to linear block codes and calculate the conditional entropy in the trellis. The Appendix contains a derivation of the relation between uncertainty and correlation, a practical algorithm for the computation of distributions in the trellis, and a generalization of the forward recursion in Section 3 for computation on a commutative semi-ring.

Definitions
We deliberately follow to a wide extent the notation and style of McEliece. The first paragraph is an excerpt from [7] with minor modifications. * A trellis T = (V, E) of rank n is a finite-directed graph with vertex set V and edge set E, in which every vertex is assigned a depth in the range {0, 1, . . . , n}. Each edge is connecting a vertex at depth i − 1 to one at depth i, for some i ∈ {1, . . . , n}. Multiple edges between vertices are allowed. The set of vertices at depth i is denoted by There is only one vertex at depth 0, called A, and only one at depth n, called B. If e ∈ E is a directed edge connecting the vertices u and v, which we denote by e : u → v, we call u the initial vertex, and v the final vertex of e and write init(e) = u, fin(e) = v. We denote the number of edges leaving a vertex v by ρ + (v), and the number of edges entering a vertex v by ρ − (v), i.e. ρ + (v) = |{e : init(e) = v}|, ρ − (v) = |{e : fin(e) = v}|.
If u and v are vertices, a path P of length L from u to v is a sequence of L edges: P = e 1 e 2 · · · e L , such that init(e 1 ) = u, fin(e L ) = v, and fin(e i ) = init(e i+1 ), for i = 1, 2, . . . , L − 1. If P is such a path, we sometimes write P : u → v for short, as well as init(P) = init(e 1 ) and fin(P) = fin(e L ). We denote the set of paths from vertices at depth i to vertices at depth j by E i,j . We assume that for every vertex v = A, B, there is at least one path from A to v, and at least one path from v to B. * In contrast to [7], here we restrict our definitions and derivations to the set of real numbers and give a generalization to semi-rings in Appendix D. We assume each edge in the trellis is labeled. Let T = (V, E) be a trellis of rank n, such that each edge e ∈ E is labeled with a real valued number λ(e) ∈ R. We now define the label of a path, and the flow between two vertices. Definition 2.1. (Path Labels) The label λ(P) of a path P = e 1 e 2 · · · e L is defined as the product λ(P) = λ(e 1 )·λ(e 2 )·. . . ·λ(e L ) of the labels of all edges in the path. (Note that the subscript indicates the sequence number rather than the edge's depth.) In this paper, we only consider operations on the set of real numbers with ordinary addition and multiplication as the authors are not aware of application for other algebraic structures. However, Appendix D briefly shows that the algorithm can be transferred to any commutative semi-ring, thus leading to a generalization of the Viterbi algorithm [10].
Example 3. We continue Example 2. The trellis depicted in Figure 2 is the trellis of the (4, 3, 2) single parity check code. In the BCJR algorithm, the edge labels λ(e) are the probabilities of the corresponding transitions in the trellis.

Trellis-based computations
In this section we consider distributions of the type for special functions f , i.e. q is mapped to the sum of the labels of all paths P with f (P) = q. We present an algorithm to calculate these distributions over all paths of a trellis or a sub-set of these. Before, however, we develop algorithms to calculate the momentsθ To each edge e ∈ E of the trellis T we introduce a second label c(e) ∈ R, which we will refer to as the c-label. For distinction, we will call λ(e) the λ-label.
Example 4. We continue Example 3. Solid lines correspond to the c-label c(e) = 1, dashed lines correspond to c(e) = −1 (bipolar binary notation). For example the path P = adik has the c-label c(P) = [+1 − 1 − 1 + 1] which is a code word. Let be a function of the c-labels of the edges of a path P with length L. The bold letter indicates that c is a vector. For simplicity, in the following we will abbreviate g i (c(e)) and f (c(P)) by g i (e) and f (P), respectively. The functions f (P) have to fulfill the linearity criterion (2) f (P) = f (e 1 e 2 · · · e n ) = g 1 (e 1 ) + g 2 (e 2 ) + · · · + g n (e n ) for all paths P : A → B.
2) as it is calculated by the BCJR algorithm.
as in Algorithm 1.
Proof. The proof is by induction on depth(v). For depth(v) = 1, it follows from the definition of a trellis that all paths from A to v must consist of just one edge e, with init(e) = A and fin(e) = v. Thus the true value of α (m) (v) is the sum of the λ-labels on all edges e joining A to v, weighted by (g 1 (e)) m . On the other hand, which is, as required, the sum of the labels on all edges e joining A to v, weighted by (g 1 (e)) m . Thus the algorithm works correctly for all vertices v with depth(v) = 1 and any m ≥ 0.
Assuming now that the assertion is true for all vertices at depth i or less and all m ≤ M , a vertex v at depth i + 1 is considered. Inserting the induction hypothesis in (3) into (4) we have and using the binomial theorem we obtain But every path from A to v must be of the form Pe, where P is a path from A to a vertex u with depth(u) = i, init(e) = u and fin(e) = v. Thus by (5) and (2), α (m) (v) is correctly calculated by the algorithm. Proof. The calculation of the powers of g i (e) up to a maximum moment M for all edges e ∈ E requires |E| · (M − 1) multiplications (assuming M > 0) and no additions. We do not consider the operations needed for calculating g i (e) or the binomial coefficients here. A multiplication with factor 1 is fully counted. The sum over l in (4) requires m additions, and 2(m + 1) multiplications, the sum over e multiplications are necessary. Summing over all vertices except A, and with |E| = n i=1 v∈Vi ρ − (v), the requirement of additions and multiplications including the computation of the powers of g i (e) is With |V| ≥ 1 the total number of operations is thus bounded above by In analogy to the forward numerator in Definition 3.1 we can also define a backward numerator.
with initial values Proof. The proof is analog to the proof of Theorem 3.2.
of the distribution of function f given T .
In analogy to the BCJR algorithm [2] for calculating symbol probabilities, we next consider the calculation of moments of f introducing a constraint on the value of the c-labels at a certain depth i in the trellis. That is the moments are calculated in a sub-trellis of T . where c i = c(e i ) and e i ∈ E i−1,i is the i-th edge of path P.
Theorem 3.7. The m-th symbol moment can be calculated bȳ Applying the binomial theorem twice and separating the λ-labels we obtain and using the definitions of forward and backward numerators finally yields the assertion of the theorem.
multiplications. As we calculate the Ω respectively, and to calculate and carry the 0-th numerators (flows) in the logarithmic domain.
Finally in this section, we describe the calculation of distributions over all paths P : A → B, or a subset of paths, in the trellis in analogy to the calculation of moments and symbol moments, respectively. and respectively. Here, * denotes the convolution operator.
Proof. Theorem 3.10 follows directly from Definition 3.9.

Remark 3. (Density Distributions)
When normalizing distributions by the corresponding flow, we obtain density distributions.
Remark 5. By Theorem 3.10, the complexity due to the calculation in the trellis is in general not reduced (except for the hard decision case) as infinite resolution of the domain of α D (v) etc. is required. However, in Appendix C an algorithm is introduced which approximates Theorem 3.10 and does reduce complexity.
Remark 6. We cannot only determine the distribution and its moments of a trellis or sub-trellis, but also of a single edge.
Remark 7. The symbol distribution for two sub-trellises of the [7 5] oct convolutional code, namely the sub-codes with the i-th code bit c i = +1 and c i = −1, respectively, is given in Example 1. The curves obtained by Gaussian approximation almost coincide with the ones plotted in Figure 1. with i = depth(v).

Applications
We will now apply the results of Section 3 to linear block codes. We compute the moments of the distribution over all code words c ∈ C given a received word r and the i-th code bit being is the conditional uncertainty of c given a word w and P (c|r) is the conditional probability of c given r. These moments are required e.g. for the discriminated belief propagation algorithm in [8]. As a special case we can calculate the conditional mean uncertainty or entropy of a code or sub-code given r.
Both for hard decision (BSC) and soft decision (AWGN channel) the conditional uncertainty is linearly related to the correlation cw T (cf. Appendix B), (8) H(c|w) = K 1 + K 2 · cw T , with K 1 and K 2 being constant functions of error probability and vector w (assuming equiprobable code words). Therefore, when applying the binomial theorem, c∈C (H(c|w)) m · P (c|r, c i = x) it is sufficient to calculate the moments of the correlation cw T in the trellis and afterwards apply the binomial theorem to obtain (7). Consider a binary linear block code C of length n which is representable in a trellis, e.g. a terminated convolutional code. Let the c-labels c(e) = c i ∈ {±1} be the bipolar representation of the code bit labeling edge e. To each path P : A → B it belongs a sequence c(P) of n c-labels representing a code word c ∈ C. Let r = [r 1 r 2 · · · r n ], r i ∈ R, be the noisy version of a code word c after transmission over a memory-less channel. Let the λ-label of a path P be the conditional probability of the received word r given the code word c, i.e. λ(P) = P (r|c). Furthermore let the function f of the paths' c-labels, i.e. the function of the code words, be the correlation (inner product) of w ∈ R n and c, Hence, g i (e) = c i w i and the separability criterion (2) is fulfilled. In the trellis of C, for each vertex v ∈ V the c-labels c(e) of edges {e : init(e) = v} emerging from v are distinct. Therefore there is a one-to-one mapping of each code word c to a path P in the trellis, and we can apply the theorems of Section 3 replacing P by c . Applying Bayes' rule to (9), , and comparing with Definition 3.6 we observe that Theorems 3.2 and 3.5 hold, and hence these moments can be calculated in the trellis according to Theorem 3.7 as the symbol moments Analogously, when omitting the code bit constraint c i = x, the moments are given by E C cw T m |r = c∈C cw T m · P (c|r) =θ (m) (T ).
of the code C and the sub-code C i (x) = {c ∈ C : c i = x} given r, respectively. While H(C|r) can also be calculated with the classical BCJR algorithm as this does not hold for the conditional entropy of C i (x).
Example 5. Figure 1 shows the distribution of P cr T , c i = ±1|r over cr T for the [7 5] oct convolutional code of length n = 200 given a noisy received word r after transmission over a BSC with bit error probability p = 0.35. These are the normalized symbol distributionsΩ D i=10 (±1) weighted by the probability P (c i = ±1|r).  Figure 3 shows a distribution of the terminated [7 5] oct convolutional code as well the Gaussian approximation given the first two moments for a BSC.

Conclusions
A trellis represents a general distribution which can be marginalized, e.g. with respect to edge labels. Two algorithms for computations in the trellis were presented: One allowing to calculate distributions, the other to compute their moments, allowing to approximate the distributions. The latter was derived by generalizing the forward/backward recursion as known from the BCJR algorithm. The results were transferred to the concrete problem of computing the moments of the conditional distribution of the correlation between a block code and some given word. The moment calculation algorithm is a requirement for efficient implementation of the discriminated belief propagation algorithm in [8]. It can also be used to calculate the conditional entropy of a code or sub-code. Though not the focus of this paper, in the Appendix it is shown that the algorithm does not restrict to calculation with real numbers, but is valid for any commutative semi-ring, thus providing a generalization of the Viterbi algorithm. The asymptotic complexity of the moment computation algorithm is the same as for the BCJR algorithm.

Appendix A. The idea behind discriminated belief propagation
When an iterative decoder with belief propagation fails, one should increase the information transferred between the constituent decoders. For example decoding fails for a closed 'turbo tunnel' in an EXIT chart [9]. Optimally, each constituent decoder transfers a posteriori probabilities for all code words, yielding the maximum likelihood solution. However, such a decoder is of exponential (transfer) complexity. Discriminated Belief Propagation (DBP) is an attempt to only moderately increase the amount of information passed between constituent decoders. The transfer itself hereby remains symbol based, but additional constraints, for example the distance of the received word to the constituent code words, are added.
Example 7. Let c ∈ C (a) be words of the overall code C (a) = C (1) ∩ C (2) which is the intersection of the constituent codes C (l) , l = 1, 2. Let r be a noisy version of the wordĉ ∈ C (a) after transmission over a binary symmetric channel. In both constituent decoders, the joint probabilities P (l) cr T = u, c i = x|r for the i-th code bit being c i = x ∈ {±1} and the correlation being cr T = u given r are calculated. For some position i 0 , bit value x 0 and correlation u 0 , let these probabilities be P (1) cr T = u 0 , c i0 = x 0 |r = P 1 > 0 and P (2) cr T = u 0 , c i0 = x 0 |r = 0.
As the transmitted code wordĉ is in C (1) and C (2) , it follows that eitherĉ i = x 0 or the correlationĉr T is not u 0 (or both). Hence, this provides more information than the symbol probabilities alone.
Consequently, in DBP not only symbol probabilities are considered, but also the distribution of the distance or correlation to the received word, respectively. As by the law of large numbers the distributions are approximately Gaussian, they can be effectively represented by their mean and variance. This is when the computation of moments in the trellis comes into play.

Appendix B. Relation between uncertainty and correlation
The conditional uncertainty of a code word c given a word w is defined as H(c|w) := − log 2 P (c|w) = − log 2 P (w|c) + log 2 P (w) P (c) K1a where K 1a is a constant assuming equiprobable code words. Assuming further that w i is independent of c j for i = j it follows that • For a binary symmetric channel (BSC) with w i , c i ∈ {±1} and error probability p the Hamming distance between c and w is n−cw T 2 which gives • For an AWGN channel with noise variance σ 2 we obtain (note that P (w|c) actually is the Gaussian probability density) In either case we can thus express the conditional uncertainty as That is the uncertainty is linearly related to the correlation.

Appendix C. Computing the actual distribution
For a trellis of rank n and g i (e) ∈ {±1}, which is the case for hard decision decoding, the domain of the distributions, i.e. the values that f (P) can take, is D = {−n, −n + 2, . . . , n − 2, n} with cardinality |D| = n + 1. In this case the distributions can be directly implemented as vectors of length n + 1. A shift ⊞ of the domain is simply a shift of the vector contents, and the correlation operation * is discrete.
In case of soft decision, the domain needs to be quantized. For Gaussian distributions, an efficient way for uniform mid-tread quantization is to carry along the mean value µ of the distribution and to arrange the partitions equally to both sides of it, storing the partition contents in vectors d. When extending a path P by an edge e ∈ E i−1,i in the forward/backward recursion (lengthening), the domain of f (P) is shifted by g i (e), i.e. g i (e) is added to the mean value µ. However, when joining paths in a vertex, the mean values of the incoming path distributions usually do not coincide. Hence a new mean value µ new has to be determined and the partition contents need to be distributed. Let Here a += b denotes the addition of b to a, i.e. a = a + b. The forward distribution vector α d (A) is initialized with the all-zero vector and a single '1' in the center position. The backward distribution is computed analogously.
Appendix D. Generalization to computations on a semi-ring In the main part of this paper, the computation of moments in the trellis is introduced for real numbers. However, the algorithm is valid for the more general algebraic structure of commutative semi-rings. The 0-th forward moment then results in the Viterbi algorithm on semi-rings.
Let the λ-label and the c-label come from an algebraic set S which is closed under the two binary operations ⊕ and ⊙, called addition and multiplication, which satisfy the following axioms: • The operation ⊙ is associative and commutative, and there is an identity element 1 ⊙ such that s ⊙ 1 ⊙ = 1 ⊙ ⊙ s = s for all s ∈ S, making (S, ⊙) a commutative monoid. • The operation ⊕ is associative and commutative, and there is an identity element 0 ⊕ such that s ⊕ 0 ⊕ = 0 ⊕ ⊕ s = s for all s ∈ S, making (S, ⊕) a commutative monoid. • The distributive law (x ⊕ y) ⊙ z = (x ⊙ z) ⊕ (y ⊙ z), for all triples (x, y, z) from S. • The identity element 0 ⊕ of the addition annihilates S, i.e. 0 ⊕ ⊙s = s⊙0 ⊕ = 0 ⊕ for all s ∈ S.
The triple (S, ⊙, ⊕) is called a commutative semi-ring. Let a, b ∈ (S, ⊙, ⊕) be elements of such a commutative semi-ring. We define the following notation: with the binomial coefficient m l ∈ N 0 = {N ∪ 0}. In analogy to Definition 3.1 and Theorem 3.2 we can now define the forward numerator and its calculation on a semi-ring.
We define the m-th forward numerator of a function f ∈ (S, ⊙, ⊕) at vertex v of a trellis T as The m-th forward moment α (m) (v) of a vertex v ∈ V i on depth i can be recursively calculated on a trellis T and a commutative semi-ring (S, ⊙, ⊕) by for all functions f (P : A → v) and g j , j = 1, . . . , i, which fulfill (13) f (P) = f (e 1 e 2 · · · e i ) = g 1 (e 1 ) ⊕ g 2 (e 2 ) ⊕ . . . ⊕ g i (e i ).