Reduced-Latency and Area-Efficient Architecture for FPGA-Based Stochastic LDPC Decoders

This paper introduces a new field programmable gate array (FPGA) based stochastic low-density parity-check (LDPC) decoding process, to implement fully parallel LDPCdecoders. The proposed technique is designed to optimize the FPGA logic utilisation and to decrease the decoding latency. In order to reduce the complexity, the variable node (VN) output saturated-counter is removed and each VN internal memory is mapped only in one slice distributed RAM. Furthermore, an efficient VN initialization, using the channel input probability, is performed to improve the decoder convergence, without requiring additional resources. The Xilinx FPGA implementation shows that the proposed decoding approach reaches high performance along with reduction of logic utilisation, even for short codes. As a result, for a (200, 100) regular codes, a 57% reduction of the average decoding cycles is attained with an important bit error rate improvement, at Eb/N0 = 5.5dB. Additionally, a significant hardware reduction is achieved. Keywords—Stochastic decoding; low-density parity-check (LDPC) decoder; field programmable gate array (FPGA)


INTRODUCTION
The need for increasing the throughput of modern communication systems capacity, for optical and wireless networks, requires high performance error correcting code.In 1962, Gallager presented the first version of low-density parity-check (LDPC) codes [1], offering an excellent process for repairing transmission errors, added by the channel effects.The close Shannon capacity decoding performances, of the LDPC codes [2], justify their exploitation by various digital communication standards.WiMAX (IEEE 802.16e),DVB-S2, WiFi (IEEE 802.11) and 10GBASE-T (IEEE 802.3an) standards attest this great performance.
A variety of LDPC decoding implementations have been explored to accomplish high throughput results [3]- [5].It has been evidently shown that the higher throughput is achieved by the fully parallel decoding solutions; nevertheless they enlarged the hardware complexity.To overcome this drawback, several reduced-complexity and stochastic LDPC decoding algorithms are developed [6]- [8].The current stochastic decoding algorithm confirmed their adaptability for fully parallel decoding approach [9]- [18].Moreover and for additional silicon area reduction, diverse LDPC stochastic based decoding architectures and strategies are proposed.
However, an area-efficient architecture for ASIC-Based stochastic LDPC decoder can't systematically produces an efficient FPGA logic utilisation.It is straightforward that the ASIC implementation of six bits counter requires less silicon area compared to 32 bits memory.Nevertheless an inverse result is obtained with the FPGA implementation.A Xilinx FPGA 32 bits memory implementation can be routed using only one LUT, in contrast with the counter logic utilisation.This paper introduces a new and powerful field programmable gate array (FPGA) based stochastic Low-Density Parity-Check (LDPC) decoding process, to implement fully parallel LDPC-decoders.The proposed technique is designed to optimize the FPGA logic utilization and to decrease the decoding latency in addition to improve the convergence, even for short codes.To validate the advantage of the proposed approach, an FPGA is implemented using Xilinx Virtex-6 VLX240T.The paper is organized as follows.In Section II, an overview of the LDPC stochastic decoding is provided.In Section III, the architecture of the new proposed stochastic LDPC decoding is introduced.Results of FPGA implementation and performance are presented in Section IV, and finally a conclusion is given in Section V.

II. LDPC STOCHASTIC DECODERS
The design of an LDPC decoder is based on the M×N parity check matrix H. N defines the number of variable nodes (VNs) while M defines the number of CNs.To encode k information bits, an (N, K) LDPC code uses N encoded bits, where N > K. LDPC decoder can be represented by a factor graph which uses N VNs and (N-K) CNs.A dv degree VN has (dv+1) ports, one of which gets the channel probability and the other dv are connected to different CNs, by bidirectional ports.In the same way, the dc degree CN has dc bidirectional ports, which are connected to different VNs, and one parity-check output port.Conventional LDPC fully parallel decoder uses fixed-point operands to represent the probabilities, exchanged between the VNs and the CNs of the factor graph.Stochastic LDPC decoders function by a bit-serial iterative process.In this architecture, the received probabilities P ch from the channel are converted to Bernoulli sequences as random bits sequences.Different encoded stochastic sequences can be generated for the same probability.In a {ai} Bernouli sequence of m bits, in which a i {0, 1}, the estimated probability value is computed as:  Let Cout, Pcout i and Pcin i be the parity-check output, the CN outputs and the CN probabilities inputs respectively, where Pcin i = Pr(cin i =1) is the probability of each CN inputs, in which i  {1, 2, ....., dc} and dc is CN degree.The output probability Pcout i can be computed as:

} ( )
The parity-check output Cout, of dc degree CN, can be computed according to (3).Fig. 1 illustrates the structure of a dc=4 CN used in the conventional stochastic decoder.

( )
Let Pvin 1 and Pvin 2 be the probability of two input bits in dv=2 VN.The variable node output probability Pvout can be computed as: If the VN inputs are same, this state is named the agreement state, one of the input bits will be transmitted to the output.When the inputs are not identical, the variable node requires an advanced method to generate the output bit.This state is named the hold state or disagreement state.One of the advanced stochastic method bit generation (ASMBG) can be used.

( ) { ( ) ( )
Fig. 2 shows the recent stochastic variable node principal structure.The stochastic LDPC decoding algorithm can be summarised as follows: Algorithm 1 Stochastic LDPC decoding Initialization 1. Load the LLRs corresponding probabilities P ch for each variable node (one DC) and transform P ch to Bernouli sequence a i (each DC).

2.
Initialize the variable nodes internal memories (16 to 32 DCs for 32 bits memory [10]) or the internal saturated counter (one DC [16]).Iterations 3. Variable to check node: At each decoding cycle, the variable node computes there inputs bits using (5) and sends there outputs bits to the corresponding check nodes.

4.
Check to variable node: At each decoding cycle, the check node computes there inputs bits using (2) and sends there outputs results to the corresponding variable nodes.Simultaneously, the check nodes send their outputs states using (3) to the syndrome checker.

III. PROPOSED LDPC STOCHASTIC STRUCTURE
As mentioned in the introduction section, an efficient ASIC-based architecture algorithm can't systematically provide the best approach for an efficient FPGA implementation.In this section we present the new LDPC stochastic decoding method which aims to improve the decoder performance and to reduce the FPGA resource utilization.
It has been shown that the LDPC stochastic decoder, which there VNs use the latest output bits as code bits, provide similar BER performance to the version with saturating up/down counters as a VN output decision mechanism [17]- [18].Furthermore, it has been demonstrated that the initialization of the first VN output bit, transmitted to the hard decision unit according to received probability channel, helps to improve the stochastic decoder convergence [15].The proposed VN exploits the two referenced characteristics, in addition to adopt an internal memory-based approach, similar to DS and EM versions.The converted Bernoulli sequences are used as a variable node input.All output variable nodes are initialized by one bit coded probability channel, during the loading of the channel Log Likelihood Ratio (LLR) corresponding probabilities.Each VN internal memory is mapped in one FPGA LUT RAM.The output variable node probability will be computed as: where { Pvout(1) is the first iteration VN output probability.During the disagreement state ( ) , the proposed architecture generate a new bit based on a random bit selection from the VN internal memory (IM).The IM length can be increased up to FPGA LUT RAM size.Based on (6), the hardware implementation of the new improved decoding approach does not require extra hardware complexity, for FPGA devices.Moreover, the projected technique computes the received probability without any additional decoding cycle.
The P ch (k-1) signal is transmitted to the VN output DFF during the first process cycle.After the first iteration and until the last one, the multiplexer sends the variable node processor output bit to the VN output DFF.In this way and identically to the CSS process, the majority of variable nodes outputs start with a right bit and detour the random stochastic initialization.The new CN uses a computing process similar to the DS approach.All CNstate outputs are connected to the syndrome checker unit.The parity-check output state CNstate, of dc degree CN, is computed in the same way of (3) and can be written as follows:

∑ (
Where, ∑ is the bitwise XOR operation and dc is the parity-check node degree.Each CN outputs signals CNout i uses the CNin i signals and State i signals, to produce the CNout i signals according to (9).The CNstate result given by ( 8) can be exploited.
Where, ⋁ is the bitwise NOR operation and ⋀ is the bitwise AND operation.
The new stochastic LDPC decoding algorithm can be summarised as follows:

IV. IMPLEMENTATION RESULTS AND PERFORMANCE
It has been demonstrated that the enlargement of the VN internal memories size increases the LDPC stochastic decoding converges [10]- [11].However and mainly, adding additional memory capacity implicates an extra hardware complexity and resources.The FPGA organization and implementation need special considerations.In addition to slices resources, memory can be mapped using Block RAMs or using Distributed RAMs (LUT RAM).
The main target of the proposed structure is to improve the FPGA-based LDPC decoding performance, without supplementary FPGA resources.To confirm the improvement of the new design, a medium (1024, 512) and short (200, 100) LDPC codes are implemented on Xilinx Virtex-6 VLX240T field programmable gate array (FPGA) device, with various methods.

Stochastic Variable Node
Equality Check Processor   The implemented CNs use similar structure to CNs adopted by the DS and the CSS decoders.Fig. 5 gives the main structure of degree-6 CN.The results of the FPGA implementation of the (200, 100) and (1024, 512) LDPC Regular Codes, with one-step initialized counter-based VN [16], DS and the proposed approach, are shown in Table 1.The EM version gives an FPGA implementation result close to the DS version.The implementation of 2×1 bits up to 64×1 bits uses only one LUT in Virtex-6 Xilinx FPGA.Therefore, the proposed LDPC decoder version can be implemented using up to 64-bit VN internal memory without requiring additional FPGA resources.As we can see, the FPGA implementation of the one-step initialized counter based decoder need additional resources, compared to the EM and the DS versions.This disadvantage is caused by the utilisation of initializing counters instead of the VN internal memories used in DS.The additional reduction of FPGA logic utilisation seen for the new proposed decoder is principally obtained as a result of the unemployment of VN output saturated counter.Fig. 6 presents the block diagram of the proposed (1024, 512) LDPC stochastic decoder.The main units are the variable nodes unite, the parity-check nodes unite, and the syndrome checking unite.The VN unite and CN unite exchange the stochastic information until reaching a correct code or the maximum number of iteration.The correct code is detected by the syndrome checking unite and the maximum of iterations is pointed by the iteration counter unite.The outputs of the Random Number Generator are employed with the VN comparators to generate the Bernouli sequences.Furthermore, they are directly used to drive the addresses buses of the FPGA distributed RAM, used as VN internal memory.Fig. 7 shows the scaled FPGA logic utilisation rate.The proposed decoder achieves an average reduction about of 50% compared to the one-step initialized counter and an average reduction about of 35% compared to DS and EM decoders.

5 .
If xH T = 0 or the maximum of DCs is reached, terminate the decoding process.Otherwise go to Step 3. www.ijacsa.thesai.org

Fig. 3
represents the main structure of the proposed variable node.Similar to the EM and DS design, the decoding cycle (DC) matches to one of the iteration for the proposed LDPC stochastic decoding.The proposed CN possess two inputs signals categories and two outputs.The inputs signals are the CNin i signals and State i signals, in which i  {1, 2… dc}.These two signals are provided by VNs outputs signals.The first outputs signals are the CNout i signals, which are sent to VN inputs.The second output is the parity-check output state CNstate.

Fig. 3 .
Fig. 3.The structure of the proposed stochastic variable node.

Algorithm 2 1 . 2 . 4 .
The proposed Stochastic LDPC decoding Initialization Load the LLRs corresponding probabilities P ch simultaneously with initializing the variable node output (one DC) and transform P ch to Bernouli sequence a i (each DC).Iterations and 3. Similar to Step 3 and Step 4 of Algorithm 1.If xH T = 0 or the maximum of DCs is reached, terminate the decoding process.Otherwise go to Step 2. www.ijacsa.thesai.org

Fig. 4
Fig. 4 presents the block diagram of the new degree-3 Variable node of the proposed (1024, 512) and (200, 100) LDPC codes.The degree-3 Variable Node is composed by 3 degree-3 sub-node.The majority state and the random address signals are connected to all sub-nodes.One of the 3 degree-3 sub-node input is connected to probability signal by a comparator.The other two sub-node inputs are connected to the check node output.The three S signals are combined to produce the majority state signal.The FPGA implementation
output VN saturation counter and 2 bits IM CSSD.with output VN saturation counter and 2 bits IM Proposed dec. with 16 bits IM and without VN saturation counter EM. with M=16 and without VN saturation