Review on 5G NR LDPC Code: Recommendations for DTTB System

The freezing of the 5th generation new radio (5G NR) Release 16 indicates that 5G development has stepped into a new stage. The application of a dedicated low-density parity-check (LDPC) code for channel coding is an important technical advance that distinguishes 5G NR from the 4th generation (4G) long-term evolution (LTE) and LTE-advanced. Although LDPC codes has been used in many different systems, the newly developed LDPC code in 5G NR integrates many cutting-edge technologies to provide better performance and attractive features. Thus, it can be a good reference for channel coding in other evolving systems headed by digital terrestrial television broadcasting (DTTB). In emerging applications, the DTTB system needs to carry information with higher density, while meeting the high requirements for real-time, coverage, and bit error rate of broadcasting. To provide a reference for DTTB channel coding that improves its performance and supports new services, a review of 5G NR LDPC code implementation is carried out from three aspects: code analysis and design, decoding algorithm, and decoder architecture. We thoroughly evaluate each solution and highlight some candidates for existing implementations or some directions for further development of the DTTB system.


I. INTRODUCTION
A S a new generation standard of mobile communication, the 5th generation (5G) should be designed to meet the demands of multiple application scenarios, including enhanced mobile broadband (eMBB), ultra reliable low latency communications (URLLC), and massive machine type communications (mMTC) [1]. On the technical level, it means significant improvement in data rate, latency, compatibility, and many other key performance indicators. Specified by the 3rd Generation Partnership Project (3GPP), 5G New Radio (NR) is a standard developed for 5G air interfaces to meet the above technical requirements. In contrast to the Turbo code used in 4G long-term evolution (LTE) and LTEadvanced (LTE-A), low-density parity-check (LDPC) code is introduced in 5G NR as the channel coding scheme for data channel [2], which is one of the landmark technical achievement.
After decades of research since it was first proposed in [3], the LDPC code has been widely used in many wireless communication scenarios, such as deep space communication [4], wireless local area networks (LANs) [5], and digital terrestrial television broadcasting (DTTB) [6] [7] [8]. The DTTB system, as a broadcasting system, has special technical requirements. For instance, area coverage is a characteristic criterion in the DTTB system, and a lower bit error rate is required in the physical layer owing to the relatively less upper layer control. With the continuous expansion of demand for service quality and variety, higher spectrum efficiency and throughput are also desired. To improve performance, DTTB standards have been constantly evolving in recent years. Following the digital video broadcasting -terrestrial 2 (DVB-T2) standard, the American Advanced Television System Committee (ATSC) officially promulgated ATSC 3.0 in 2017 [6], and Chinese digital television terrestrial multimedia broadcasting-advanced (DTMB-A) was accepted as the second-generation DTTB standard by ITU in 2019 [8].
The newly developed 5G NR LDPC code integrates both existing and emerging technologies, leading to better performance. Thus, utilizing its characteristics and implementation methods to enhance the coding scheme of the evolving DTTB system is a subject worthy of investigation. For simplification, the 5G NR LDPC code and LDPC codes in existing DTTB standards are represented by 5G-LDPC and DTTB-LDPC in the following, respectively.
To provide a reference, we focus on not only the specific 5G-LDPC but also its complete implementation process, including code analysis and design, encoding/decoding algorithm and encoder/decoder architecture. It should be pointed out that owing to structural designs such as quasi-cyclic (QC), raptor-like (RL), and irregular-repeataccumulate (IRA) in 5G-LDPC and DTTB-LDPC, encoding is relatively easy to implement [9] [10] [11]. On the other hand, the decoding part often determines the performance of the entire transmission system and still has a lot of room for research. Therefore, we mainly focus on the decoding algorithm and decoder implementation.
The three aforementioned parts are not independent. Some constraints among them are given in [12], but it focuses on the decoder architecture and lacks discussion of the other two parts. Based on [12], a more detailed description is provided in Fig. 1. Code analysis and design are always the primary parts of the process. Analytical tools are utilized to evaluate the average asymptotic thresholds and potential error floors for different code ensembles under the constraints on the paritycheck matrix (PCM) with structural features and parameters. We define such constraints as "structure" which is marked in red in Fig. 1. After determining the structure, a construction algorithm is executed to find a specific code with good performance and desirable features in the ensemble.
In terms of decoding algorithms, most existing methods can be classified as message passing (MP) algorithms. In the Tanner graph, each row of the PCM corresponds to a check node (CN), each column corresponds to a variable node (VN), and each nonzero element in the matrix corresponds to an edge. In hardware implementation, VN and CN refer to two types of operation units, named VNU and CNU, respectively, which process incoming messages and pass the results along the edges to other units. Thus, the decoding algorithm is divided into the operational method (the method for node operations) and the scheduling. The former focuses on the operations in two types of nodes and message representation, while the latter focuses on the order of these operations.
Structure designing should be carried out at the very beginning to provide a foundation for the following parts. Considering the major goal of having good performance, irregular codes involving some typical structures, such as RL and IRA, have been widely used in previous works, whose performances are validated by multi-edge-type density evolution (MET-DE) [13]. As shown in Fig. 1, structure designing should also consider other implementation-related requirements. Some structural designs, e.g., QC and protographbased design [14], are preferred to narrow the search space in code construction, facilitate code description, simplify scheduling, and facilitate parallelism in both the encoder and decoder.
Moreover, the application scenarios also impel the structure designing. The above structures guarantee good performance and high decoding throughput requirements in both 5G NR and DTTB applications. 5G-LDPC has QC and RL features, while ATSC 3.0, as an example of DTTB standards, also uses the QC-LDPC code which has the RL or IRA structure for different code rates. In addition, to adapt to various 5G application scenarios and make it competitive with 4G LTE Turbo code, the higher flexibility of multilength and multi-rate (multi-mode) should be supported by 5G-LDPC. In contrast, the DTTB system adopts only limited predetermined modes chosen from many available mode options, which means a relatively low demand for flexibility. However, considering the application of layered division multiplexing [15], the compatibility of code length required by the control channel [16], and so on, the guarantee of flexibility is still important to the DTTB system.
It can be concluded that MET, QC, and multi-mode support are applied to both 5G-LDPC and DTTB-LDPC. Thus, we review the existing solutions and noteworthy recent works on 5G-LDPC implementation considering these characteristics. Some candidates are identified and recommended for the DTTB system. Sections II, III, and IV focus on code analysis and design, decoding algorithm, and decoder architecture, respectively. Finally, Section V concludes the paper. Different from the structure designing discussed above, this section talks about the process of further narrowing down the code ensemble under the certain structure (MET, QC, multi-mode) until obtaining a specific code (PCM), which corresponds to "construction algorithm" in Fig. 1. For QC-LDPC codes, a two-step design for the base matrix (base graph) and lifting matrix (lifting factor) is utilized.
Threshold performance is the primary criterion for code design and is guaranteed by the base graph design, especially for middle or long code lengths. Owing to the structural design in demand, the traditional searching algorithm can be utilized under the given structure with limited search space. Analytical tools, e.g., MET-DE, are used to evaluate the threshold performance for each searching stage.
Extrinsic information transfer (EXIT) [17] analysis can also help evaluate threshold performance by tracking the variation of mutual information with the advantage of visualization via EXIT curve matching. Based on the MET view of irregular codes, protograph-based EXIT (PEXIT) is proposed [18], which can be used for code construction but cannot be visualized. A recent study [19] simplifies the classification of edges originating from the RL structure of 5G-LDPC, in which VNs and CNs are both divided into two categories. In this way, the EXIT analysis results can be displayed in a 3dimensional diagram called a 3D-EXIT chart. This method has a relatively low complexity, and the graphical results provide a new perspective for algorithm scheduling.
Error floor performance is another concern in code design, and the existence of trapping sets is the primary cause of the undesirable error floor. A review of related works and a search algorithm for elementary trapping set in irregular code are given in [20]. Alternatively, the existence of small girths is a more intuitive indicator closely related to the trapping set [21]. Optimization of small girths can also improve error floor performance, leading to the proposal of progressive edge growth (PEG) algorithm [22] and approximated cycle extrinsic message degree (ACE) algorithm [23]. For QC codes, cyclic PEG (cPEG) [24] is proposed to jointly design the base graph and lifting factors.
These methods can be directly utilized in single-length and single-rate code designing, which are suitable for existing DTTB standards with a relatively small number of modes. As for 5G-LDPC, it has to meet the requirement of higher flexibility, which leads to an update of the structure and the corresponding construction method. 5G-LDPC utilizes a nested structure similar to the accumulate-repeat-4-jaggedaccumulate (AR4JA) code [4] and protograph-based raptorlike (PBRL) code [25] but has a two-dimensional extension on both CNs and VNs to obtain finer granularity. A review of the nested design is given in [26], with a newly proposed progressive matrix growth (PMG) algorithm for two-dimensional construction of the base graph. Note that as one of the few algorithms for nested design, PMG uses average threshold performance as the criterion in every step of extension instead of greedy searching. Therefore, it can guarantee the average performance among different modes and has a larger search space. The flexibility of the lifting matrix size is another source of multi-length. 5G-LDPC adopts modulo-lifting [27], which provides the lifting matrix size with length-scalability and simplifies code description.
The above methods can be applied to existing or even future codes with the same characteristics as 5G-LDPC, including QC, RL, and nested structures. For 5G-LDPC itself, which has already been standardized, we can use the above indicators to further evaluate its performance and point out potential development directions.
5G-LDPC has been proved to outperform LTE Turbo codes in threshold [28], yet the error floor also exists in some modes, as pointed out in [29]. Because there are many modes with similar parameters, it is suggested to avoid these underperformed modes in practical use.
Different from [29], we tested modes with smaller code lengths. We define the lifting matrix size as z × z, the number of information VNs in the base graph is k, and the number of CNs in the base graph is m. We took Base Graph 2 with k = 6, 8, m = 5, 8, 10, and z = 13, 26, 52. We carried out ACE analysis and found in modes with k = 6, m = 5, z = 13 and k = 6, m = 8, z = 13, there were girth-4s with a small ACE value which suggested a potential undesired error floor. Thus, cPEG and ACE algorithms were utilized to re-design the lifting factors of the base graph with k = 8, m = 10, z = 52, which gave a newly designed base matrix with a row number of 10, column number of 18, and all of the lifting factors were not larger than 52. For easy comparison, only part of Base Graph 2 with k = 8, m = 10 is considered. Only the first 12 columns of both the newly designed base graph and original Base Graph 2 are displayed in Fig. 2 (a) and (b), respectively, since the last 6 columns are the same in both designs with the form 0 6×4 I 6×6 T .
Simulations of the original and new code families were conducted using the binary-input additive white Gaussian noise (AWGN) channel. The results are shown in Fig. 3. It can be observed that in the mode with k = 6, m = 8, z = 13 (code rate R = 2/3), the new code family performs better than the original one. In the other examined modes, the two codes have similar performance, which indicates the compatibility of the new one. It is proved that in the process of designing the 5G-LDPC, the optimization of the error floor with multi-mode compatibility is still a challenge and lacks effective methods. 5G-LDPC still has room for improvement.
Though existing DTTB-LDPC does not use a nested design, it is recommended to introduce such a design in the future evolution standard to improve flexibility and compatibility, as pioneered in [30]. Compared to 5G NR, DTTB does not require a large number of modes but calls for better error floor performance. It also has a longer code length to ensure a lower threshold, which means higher code construction complexity. Therefore, developing an efficient method with consideration of nested design while maintaining a high threshold and error-floor performance is still an open problem for both 5G NR and DTTB systems.

III. DECODING ALGORITHM
The sum-product algorithm (SPA) [31], also called the beliefpropagation (BP) algorithm [32], is a near-optimal iterative decoding algorithm. In SPA, log-likelihood-ratio (LLR) messages are passed between VNUs and CNUs and processed by them. We define the index i for CN/CNU corresponding to the i-th row of the PCM, and the index j for VN/VNU corresponding to the j-th column. The indices of CNs that are adjacent to the V N j form the set M (j) and the indices of VNs that are adjacent to the CN i form the set N (i). The operations in VNU and CNU can be written as is the message passed from V N U j to CN U i , and r j is the a priori message of the j-th bit from the upper module (e.g., constellation demapping) to V N U j . M (j)\i denotes the set M (j) excluding the element i, and N (i)\j is similar.
Obviously, tanh and tanh −1 in (2) lead to high implementation complexity. Many modified algorithms have been proposed to approximate and simplify computation and implementation. In addition, data dependence exists between VNs and CNs, which means the processing order affects the convergence speed. Thus, we discuss two aspects of the decoding algorithm: the operational method in a single unit and the scheduling among multiple units, and then recommend decoding schemes for the DTTB system.

A. OPERATIONAL METHOD
The original operation of (2) can be implemented by looking up dedicated look-up tables (LUTs) multiple times. For further simplification, the min-sum algorithm (MSA) [34] is proposed as a suboptimal iterative algorithm, in which the operation in the CNU is and the operation in the VNU is the same as in (1). Give the superscript SPA or MSA to Q i,j and L i,j to distinguish the messages of these two algorithms. Note that operations (1) and (3) in MSA are independent of the channel estimation error [35]; therefore, it has good robustness. However, this approximation inevitably causes performance degradation. Some improved methods have been proposed by introducing an acceptable increase in complexity. We classify them into the following categories based on different ways to get an improvement. a) MSA-based approximation. A correction can be ap- better. The simplest correction is attained by multiplying a normalization factor or subtracting an offset factor, which leads to the wellknown normalized-MSA (NMSA) or offset-MSA (OMSA) [36], respectively. Note that the former preserves the independence between the algorithm and channel estimation in MSA, indicating better robustness [35]. b) Optimal post-processing. It is pointed out in [37] that is not an LLR and mismatches to (1), which is oriented to LLR inputs, and it is the source of performance  c) Reliability-based correction. It is suggested in [38] that a stronger correction should be applied to more unreliable messages in NMSA or OMSA. Thus, two correction factors are employed, and the failure of the parity check in the current row is considered as the criterion for unreliability, giving adaptive-NMSA or adaptive-OMSA. A similar point of view is utilized in [33] but in a more extreme way. It sets unreliable Q M SA i,j results to zero and proposes self-corrected MSA (SC-MSA). The criterion is that the sign of Q M SA i,j with the same index changes in two adjacent iterations. It is proved that this erasure process can make the distributions of the messages approach a Gaussian distribution, which guarantees its performance improvement and provides another interpretation of SC-MSA. d) SPA-based simplification. Some algorithms give a numerical approximation of the SPA directly considering MSA. They reserve the operation in (2) to some extent but simplify it by discarding unimportant inputs [39] or reducing the output diversity [40].
Most of the existing decoding algorithms, except for SPA and MSA, are derived from the above views, and some notable recent ones are summarized in Table 1. Since the implementation of b) is mainly NMSA or OMSA, b) is regarded as a subset of a) and is not listed in Table 1.
Because of the application of irregular codes and the emergence of MET-based analysis tools, improvements based on MET are conducted on the original algorithms, indicating different parameters or operations for different types of CNUs or VNUs. In NMSA and OMSA, it means the introduction of multiple correction factors, whose necessity is proved in [37] from the optimal post-processing view. In [41], the classification is simplified based on row and column weights directly, while in [42], rows with different weights are merged for further simplification.
It has been pointed out in [37] that I is also a variable of the optimal post-processing function and is changing during the decoding process. Thus, adaptive correction factors are preferred to enhance the performance of NMSA or OMSA. Since I is hard to attain, some substitute criteria are utilized to guide the adaptation of factors. An intuitive one is the index of the iterations. In [43], the value of the sub-minimal input of a CNU is used as the criterion, and in [54], it is the failed parity-check equation proportion. Note that the algorithms with view d) are adaptive naturally.
To summarize, Table 1 is extended to show whether the algorithms are MET-based and adaptive.
For algorithms using predetermined parameters, selecting the best parameters is also important in the implementation. As a typical case, the design of correction factors in NMSA or OMSA has been discussed in different works. In the early literature, such as [37], theoretical analysis and numerical computation are used to get the factors. However, it becomes difficult when MET-based and adaptive views are introduced, and thus, some search methods have been applied. To evaluate each searching stage, theoretical methods using code analysis tools and simulated methods using the Monte Carlo method are both applicable. The former has a smaller complexity and potential universality, whereas the latter tends to have better performance for a specific code. To illustrate this difference, Table 1 is extended with the classification of parameter obtaining methods, which are divided into databased simulated methods (S) and analysis-based theoretical methods (T).
Since the goal is always to balance the performance and complexity of the decoding algorithm, the above schemes need to be carefully selected according to the specific code and application scenario.

B. SCHEDULING
The original iterative decoding algorithms, such as SPA and MSA, use flooding-scheduling, i.e., in every iteration, all VNUs and then all CNUs run once. This processing order can be shuffled and layered-scheduling is proposed in [44] (also called turbo decoding message passing, TDMP in [45]) inspired by the decoding of concatenated codes. It splits PCM into several submatrices by row grouping, called layers, in which part of the CNUs and VNUs runs alternatively. The simulation results show that the average number of iterations can be reduced by 50%. If the maximum column weight in each layer is 1, parallelism can be easily performed in the VOLUME 4, 2016 decoder [44]. Thus, for QC-LDPC codes, the most direct way to define layers is to regard the submatrix corresponding to a row of the base graph as a layer. Similarly, PCM is split by column grouping in [46], which achieves a similar improvement. The key is to enable the newly calculated messages to participate in the following operations instantly. Based on this point, a more complex scheduling scheme is proposed [47], resulting in faster convergence. However, considering the complexity of implementation and the occupancy of storage resources, layered-scheduling based on row grouping is used more widely among all non-flooding methods.
Different operation units have different abilities to process messages, so the processing order of layers in layeredscheduling also influences the convergence speed. A layer ordering strategy is proposed in [48], based on the thought of processing and passing highly reliable messages preferentially. In [19], a theoretical analysis and design of the layer processing order are provided using the proposed 3D-EXIT chart. Moreover, in hardware implementation, the data dependency between VNUs and CNUs may cause conflicts in pipeline design. Therefore, conflict avoidance should also be considered in the processing order designing, which is further discussed in Section IV.
Due to the change in schedule, the parameters or criteria that are determined under flooding-scheduling may require redesigning. The classification of scheduling has to be added to Table 1, where F means flooding and L means layered.

C. DECODING ALGORITHM SELECTION FOR 5G NR AND DTTB SYSTEMS
As for the decoding algorithm, the special consideration of nested design, which is the main difference between DTTB-LDPC and 5G-LDPC, is not involved. A comprehensive comparison can be made to determine the recommended schemes for the DTTB system. It is shown in Table 1 that most of the algorithms are based on view a) even for 5G NR because of its advantages in terms of complexity and performance. Thus, NMSA/OMSA can be selected for existing or even future DTTB systems, but some improvements are needed. Based on the discussion on the structure, the typical MET characteristics in the codes of both systems should be noticed and the MET-based view is recommended. Further consideration is needed for the classification of nodes using the same parameter table, and the degree of nodes is still the most commonly used criterion. Because of the QC structure, layered-scheduling is preferred for faster convergence and utilizing parallelism in hardware, both of which improve the decoder throughput. Adaptivity can be introduced for further improvements. With respect to multi-mode support, when introducing MET-based or adaptive views, the parameters used in the algorithm should have universality to save storage resources, instead of training for every single mode.
Among the algorithms in Table 1, an enhanced NMSA with all the improvements listed above is proposed in [54], which is recommended for implementation. Note that it uses the failed parity-check equation proportion as a criterion for choosing adaptive correction factors, which has potential universality, although the factors are trained via Monte Carlo simulation. The layer processing order design proposed in [19] and [48] can be combined with this algorithm for higher throughput.

IV. DECODER ARCHITECTURE
The basic requirement of decoder design is to effectively and efficiently implement the target algorithm with good performance in various hardware indicators, such as throughput, resource occupation, and energy efficiency. Moreover, the decoder has its own concerns, including quantization, parallelization, and pipelining, which directly influence the system performance and should be well-designed.

A. FUNDAMENTAL IMPLEMENTATION
In the aforementioned algorithms, the core processing of messages is the operations in CNUs and VNUs. Thus, the implementation of these two types of units is the basis of decoder design.
The operation in the CNU is different among the algorithms. Headed by MSA, many of them reduce the output diversity of the CNU compared to SPA, which reduces not only the amount of calculation but also the storage for intermediate results, and are preferred in decoder implementation. For VNU, the accumulation of L i,j in (1) is carried out in almost every algorithm. Modify (1) into the following for simplification: where P j indicates the a posteriori message of the j-th bit. In addition, storage usage can be reduced if P j is stored and Q i,j is calculated by (5) in time, instead of storing Q i,j directly. Data quantization for CNU/VUN should also be optimized to achieve a trade-off between performance and resource usage. Variations on quantization can be introduced for different types of messages or along with the iteration progress [58] [59]. Note that some new algorithms have been developed for quantized messages directly [53] [57] [60], which provides a new view for algorithm design.

B. PARALLELIZATION
Parallelization can be supported by the introduction of multiple operation units. Different scheduling strategies have different parallel architectures.
Consider a PCM with m rows and n columns. For flooding-scheduling, at most m CNUs and n VNUs can be used in parallel, resulting in full-parallel architecture [62]. Comparatively, the layered-scheduling, recommended in Section III, requires serial implementation between layers, and parallel calculations are applied only within a single layer, leading to layered-parallel or row-parallel [61] [62].
If each layer contains only 1 row, 1 CNU is required and is shared by all layers. d c i VNUs are needed for the i-th layer (row) to support the parallel operations of VNs, so d c max VNUs should be implemented for all layers, where d c max is the maximum among all d c i . The discussion above of the m × n PCM is valid for an m × n base graph. Here, d c i indicates the weight of the i-th row in the base graph. Thus, utilizing z times of the above units leads to full-parallel and layered-parallel architectures for a QC-LDPC code with lifting matrices of size z × z. For a group of QC-LDPC codes with different lifting matrix sizes, z max times are needed to support all codes, where z max is the maximum z among them.
Implementations of both full-parallel and layered-parallel closely depend on the base graph and the size of the lifting matrix. If a code has parameters (m, n, d c max , or z max ) beyond the range specified by the decoder, it cannot work normally, which means a limitation of universality. On the contrary, if the parameters do not reach the bounds, the decoder wastes resources and has low energy efficiency. To solve this problem, a reduction in parallelism can be further carried out. One solution is block-parallel [61] [62]. It is based on the layered-scheduling but carries out the processing of layers in a serial manner, which is different from the layered-parallel. In the base graph view, for the i-th layer (row), d c i clock cycles are required for the whole layer processing. Similarly, z max degrees of parallelism should be supported by hardware implementation. This architecture decouples the hardware design from the base graph and is universally applicable to any QC-LDPC code constrained by a fixed z max . Theoretically, E clock cycles are needed for a single iteration, where E is the number of nonzero elements (edges) in the base graph. Another solution, called parallel layered decoding architecture (PLDA), is proposed in [63]. It is a variant of the layered-parallel and only targets at QC-LDPC codes. It extracts one row from each layer defined in the layered-parallel to form a new layer. Thus, the PCM is split into z layers, each of which has m rows. In contrast to the layered-parallel, this makes rows in one layer have different weights, and the weight distribution is the same as that of the base graph. This architecture decouples the hardware design from the lifting matrix size and theoretically requires z clock cycles in one iteration. It is compatible with codes with nested design but constrained by the largest base graph (mother code) in the nested structure [64], which means the universality of different base graphs is limited.

C. MESSAGE PASSING NETWORK (MPN)
The MPN is designed to support the above parallel architectures by passing new messages at the same time. For full-parallel, CNUs and VNUs should be connected directly according to the PCM, resulting in a complicated fixed MPN, which also limits the application of this architecture. In terms of partially-parallel architectures, the connection is changing in the serial process of each iteration. Thus, a simpler and reconfigurable MPN is needed. Furthermore, the structural design of the code can make the MPN structural [12]. Hence, for QC-LDPC codes, a two-level MPN design is adopted, in which the first level reflects the connection given by the base graph, and the second level implements circular shift based on the lifting factor. This makes a universal circular shift network with less resource occupancy and shorter critical path become one of the key points in MPN design.
For a group of codes with a given z max , a circular shift network that can support the shift value p for z messages (0 ≤ p < z, 1 ≤ z ≤ z max ) is desired. Traditional multi-level barrel shifters (BSs) do not have such universality for different z. This is solved by adding an extra selfrouting bit to every message and a look-up level in [65]. The Benes network, as a non-blocking switch network, is used for circular shifting in [66] with modifications. Because the original Benes network can support arbitrary switches, redundancy exists when it is used only for circular shifting. A dedicated circular shift network, QC-LDPC shift network (QSN), is proposed in [67]. It has the advantages of a shorter critical path, lower resource occupancy, and easy generation of control signals by gate circuits. In [68], a Banyan network with a bypass structure is used for MPN, which is also an alternative with good performance. Recent works modify existing MPNs to support parallel circular shifting, which gives the decoder potential of parallel decoding when z is small, e.g., a Banyan-based shifter and a BS-based shifter (called extended barrel-shifter, EBS) are designed with this consideration in [69] and [70], respectively.

D. PIPELINING
The critical path composed of VNU, CNU, and MPN can be split into stages and pipelined for higher throughput. The main obstacle is the frequent pipeline conflict caused by the data dependence.
When using flooding-scheduling, the entire process of a single iteration should be split into stages for pipelining. However, there must be data dependence between adjacent iterations, so it is impossible to make different stages process different iterations. Thus, multi-frame decoding is proposed [71], in which different stages process different codewords. It improves the throughput in multiples but has a high latency and requires huge hardware resources.
In terms of layered-scheduling, the process of each layer can be divided into stages. In the base graph view, if two CNs are connected to at least one same VN, there is a risk of pipeline conflict. Similarly, multi-frame decoding is feasible naturally [72]. In addition, single-frame decoding can also be pipelined in layered-scheduling with the help of techniques that alleviate or avoid conflict, and we classify them into the following categories. a) Stalling. It is the most direct way to totally avoid conflict but lowers the throughput because of the stall cycles. b) PCM-based methods. In [73], every block of the lifting matrix is split into several submatrices without changing the code, leading to more layers but less conflict between layers. Direct avoidance of conflict between layers can also be con- ducted in the code design, e.g., the quasi-orthogonal structure of 5G-LDPC. Note that these methods are not perfect and may fail when the number of stages increases. c) Decoding-algorithm-based methods. Both scheduling and operations can be modified to resolve conflicts. Considering the former, the change in the layer processing order can reduce the risk of conflict. Moreover, when utilizing the block-parallel structure, the processing order of VNs provides a new degree of freedom for scheduling optimization, making the method more powerful. For the latter, a modified operation for the VNU is proposed in [74] to solve the parallel conflict, and this method is used in [75] to solve pipeline conflict in the name of residue-based. This is equivalent to conduct flooding-scheduling partially and is called hybrid scheduling in [76]. This method removes conflict completely but lowers the efficiency of message passing because partial flooding-scheduling is introduced. d) Hardware-based methods. Traditional hardware solutions for pipeline conflict can be utilized directly, e.g., by introducing a forwarding module or bypass structure [77].
These four categories of methods are independent and can be combined. Since the goal is to avoid conflict completely, inserting enough stall cycles in a) or changing the operation in c) is necessary. Hence, the degradation of throughput caused by stall cycles or partial flooding-scheduling should be carefully evaluated.

E. DECODER ARCHITECTURE SELECTION FOR 5G NR AND DTTB SYSTEMS
A good summary of many LDPC decoders with a fieldprogrammable gate array (FPGA) implementation is given in [78]. However, it does not review pipelining and is unable to cover some recent implementations of 5G-LDPC. As a supplement to [78], we list several 5G-LDPC decoders in Table 2 and summarize them according to the points of the designs mentioned above.
The decoder can meet the demands of universality and throughput by introducing the full-parallel architecture with sufficient operation units and pipelining for multi-frame. However, when considering resource occupation, efficiency, latency, and some other application-related criteria, partiallyparallel architectures based on layered-scheduling are preferred, as shown in Table 2.
If we focus on the high flexibility of the lifting matrix size, PLDA has little waste of hardware resources for 5G-LDPC. On the contrary, if we focus on the high flexibility of the base graph, block-parallel is preferred. Since the number of edges in the base graphs of 5G-LDPC (BG1: E = 316, BG2: E = 197) has a similar magnitude as its z max = 384 and both have a large range of variation, it is difficult to tell which architecture is better simply. However, it is worth noting that because of the parallel circular shift design, block-parallel will have a higher throughput and less waste of resources even when z < z max , which compensates for its key shortcoming compared to PLDA. If we further consider the demand for supporting two different base graphs and potential for supporting new base graphs, block-parallel, which decouples from the base graph, is preferred. Relatively speaking, DTTB-LDPC has extremely low flexibility in parameter z (e.g., z = 360 in ATSC 3.0) and has different base graphs for different modes, so PLDA has no advantage, and block-parallel is recommended.
In terms of pipelining, the PLDA implementation in [80] is not pipelined. In fact, non-conflict pipelining can be introduced into PLDA [63] [64] with a strictly designed layersplitting method. Block-parallel implementations in [79] and [76] both introduce pipelining. The latter adapts both scheduling and operation modification in c) and has been proved to have higher throughput than direct stalling, which makes it a better one. Compared to the 5G-LDPC code, DTTB-LDPC does not have a quasi-orthogonal structure, which means greater challenges in pipeline conflict avoidance. The implementation in [76] is worthy to be carried out and tested for the DTTB system. Additional structural constraints like quasi-orthogonal can also be imposed for future code design.

V. CONCLUSION
This paper provided a review of the technical points in 5G-LDPC implementation based on recent notable works. Starting from the code structure, it discussed the code analysis and design, decoding algorithm, and decoder architecture within the framework for final implementation and application. The development roadmap was clarified, and different schemes were classified and compared. By comparing the characteristics of 5G-LDPC and DTTB-LDPC, this paper recommended some universal solutions with good performance for both 5G NR and DTTB systems and also proposed some directions for future development of DTTB code design. All key observations can be summarized as follows: a) For code design, PMG or similar algorithms can be used for the base graph design of 5G-LDPC or codes with the same structure. Further development is needed for lifting factor design considering the average performance guarantee of different lifting matrix sizes. DTTB-LDPC can refer to 5G-LDPC with the aim of longer code length, better threshold, performance and lower error floor. Thus, it is necessary to improve the existing methods and pay more attention to the average error floor performance under different modes. b) For decoding algorithm, a layered-scheduling, METbased, and adaptive NMSA/OMSA provides a good balance between performance and complexity. It is suggested that indicators for choosing adaptive parameters should have potential universality, e.g., the failed parity-check equation proportion. Layer processing order design is also worthy of introduction.
c) For decoder architecture, recent block-parallel implementation toward 5G-LDPC has advantages in terms of universality and hardware efficiency. By introducing the optimized CNU/VNU/MPN design, pipelining, and conflict handling mechanisms, higher throughput can be achieved. These techniques can be applied to the DTTB systems.
All of the recommended solutions for DTTB are based on 5G-LDPC with full consideration of its rate-compatibility and length-scalability features, making them applicable to other systems, e.g., wireless LANs and deep space communication. These solutions may also be used in future 3GPP Releases, e.g., for flexible and effective delivery of 5G broadcast and multicast services to mobile devices [82] [83] [84]. Moreover, the continuous improvement of these techniques can make them suitable for future codes with similar structures. New structural requirements can be put forward for the better use of existing techniques, which emphasizes constraints between the code structure and the implementation.