A 10 Gb/s PAM-4 Transmitter With Feed-Forward Implementation of Tomlinson-Harashima Precoding in 28 nm CMOS

A 10 Gb/s PAM-4 transmitter (TX) with a modulo-based equalization technique is presented. The proposed feed-forward Tomlinson-Harashima precoding (FF-THP) scheme takes advantage of both Tomlinson-Harashima precoding (THP) and feed-forward equalization (FFE). The vertical eye margin (VEM) is enhanced by removing the precursor inter-symbol interference (ISI) with pretaps while incorporating the modulo operation. The VEMs of equalization methods are derived based on z-domain response (ZDR). The effectiveness of the FF-THP is examined by quantitative analysis and numerical simulation. Especially for a one-pole channel with a precursor, optimized tap coefficients of an FFE and an FF-THP are derived as closed-form concerning the precursor and the first postcursor. Calculations of decision threshold voltage and estimated bathtub curve based on Gaussian noise are featured by using the histogram of an eye diagram. The advantages of the FF-THP over a conventional FFE are measured by a fabricated chip. The proposed TX compensates for a 21 dB channel loss with a level mismatch ratio of 99.1% and with a figure of merit (energy efficiency per sum of channel ISI) of 4.05 pJ/b/ISI. Moreover, the FF-THP achieves 38% and 87.5% improvement on VEM and horizontal eye margin, respectively, compared with an FFE. It is fabricated in 28 nm CMOS technology, occupying an active area of 0.075 mm2.

in equalizing the channel loss on the TX side. Equalization is more straightforward to implement at the TX than at the RX side because TX has the exact information of the input data, whereas the RX may have a sampling error. Therefore asymmetric links, such as DRAM interfaces, may use TX equalization thanks to its simplicity [4].
While a feed-forward equalizer (FFE) in the form of a finite impulse response (FIR) filter is widely employed, because of a scaling factor imposed on by maximum drivable voltage or current, the eye opening on a high-loss channel can be significantly reduced [5]. While the nonlinear decision feedback equalizer (DFE) is widely used for being immune to noise boosting, errors tend to occur in bursts that exacerbate the forward error correction (FEC) performance. Thus, combining the DFE and the FEC in PAM-4 signaling can bring out significant performance degradation [5].
As an alternative, Tomlinson-Harashima precoding (THP) is a viable candidate for TX equalization for a high-loss channel since by being nonlinear it offers an SNR gain and evenly distributed transmitted signal [6], [7]. The THP can theoretically equalize a wide range of channels, regardless of the channel loss [8], [9]. Albeit attractive, when it comes to a physical implementation, the use of THP is limited because of the feedback timing constraint and the lack of precursorhandling capability that are the same problem as the DFE.
Various techniques such as pipelining and mapping have been reported to relieve the timing constraint in the THP implementation [10]- [12]. However, even though the timing constraint is alleviated, the precursor ISI has remained a problem equalizing a high-loss channel. Therefore, another approach that has been reported is a model predictive control (MPC) that offers limited controllability of a precursor ISI [13], [14].
This paper proposes a novel feed-forward Tomlinson-Harashima precoding (FF-THP) architecture incorporating the precursor compensation in the modulo-based equalization to build a 10 Gb/s PAM-4 TX. To mitigate the timing issue, the proposed FF-THP utilizes modulo prediction rather than direct calculation. In addition, by implementing 2 pretaps and 10 posttaps, the FF-THP can handle large channel ISI, including a precursor.
The rest of this paper is organized as follows. In Section II, basic concepts and functions of the THP, the FFE, and the proposed FF-THP are introduced. Also, the z-domain analysis is derived in this Section. Especially, the quantitative analysis and numerical simulation are conducted on a simple one pole channel with a precursor to examine the effect of the precursor. Section III describes the overall architecture and the constructed engines of the proposed FF-THP TX. Subsequently, the measurement results of the fabricated chip are presented in Section IV, followed by a summary and a conclusion of this paper in Section V.

II. PROPOSED ARCHITECTURE AND ANALYSIS A. COMPARISON BETWEEN TOMLINSON-HARASHIMA PRECODING AND FEED-FORWARD EQUALIZER
The architectures of the THP and the FFE are illustrated in Fig. 1(a) and (b). The THP includes the feedback loop with a nonlinear modulo operation, and the tap coefficients of the THP are directly determined by the channel response, H ch (z). On the other hand, the FFE shows the feed-forward structure with the tap coefficients, w i , which perform equalization by the inverse function of the channel. Fig.1(c) presents the swing limitation of the TX and the received data eye. In physical implementations, the output signal range of the TX driver, M , is limited. Therefore, the amplitudes of D in of THP and FFE for PAM-4 signaling (A THP and A FFE ) are represented below.
In the THP, the nonlinear modulo operation adds or subtracts M to the summation of the input and the feedback filter output when the summed signal exceeds the output range, [−M /2, M /2], to guarantee that Ch in,THP remains within the driver output range. As a result, A THP remains the same for a high-loss channel, and the signal amplitude of Ch in,THP shows the uniform distribution. On the contrary, the output of the FFE (Ch in,FFE ) is summed without a modulo operation. Thus, A FFE is inversely proportional to the sum of |w i | and the distribution of Ch in,FFE is bell-shaped with signal power concentrated at the center. The amplitudes at the channel outputs (Ch out,THP and Ch out,FFE ) are derived by multiplication of the amplitude of the main cursor of the channel (H 0 ) and the amplitude of D in , which are H 0 A THP and H 0 A FFE , respectively. For a high-loss channel, equalizer tap coefficients are increased, and then both H 0 and A FFE are decreased. Consequently, the amplitude of Ch out,FFE is reduced quadratically by the amount of ISI. However, the amplitude of Ch out,THP depends only on H 0 because A THP remains the same by virtue of the nonlinear modulo operation.

B. PROPOSED FEED-FORWARD TOMLINSON-HARASHIMA PRECODING
The design process of the proposed FF-THP is illustrated in Fig. 2. {D} and {k} denote sequences of the input data and the quotient resulting from a modulo operation for the present data. M represents the modulus of the modulo operation, and M corresponds to the maximum amplitude of the signal range of Ch in , which is [−M /2, M /2] as shown in Fig. 1. The FF-THP inherits the traditional THP operation, which has two main functions: a modulo operation to stabilize the output and a feedback equalization to compensate for a channel loss. These two key features are modified to build the FF-THP. Firstly, the modulo operation is replaced by the addition of a predicted modulo value, {kM}, to the input as shown in Fig. 2(b) and (c), which is essential for the next step of modification. Secondly, the feedback equalizer is reconstructed as the equivalent FFE with pretaps to remove a precursor ISI as shown in Fig. 2(d). Thus, the proposed FF-THP acquires the ability to remove precursors of a channel as well as keeping the modulo operation. The tap coefficients of the FFE are determined to maximize the vertical eye margin (VEM) at the channel output. Because of the increased number of signal levels, FF-THP has some drawbacks requiring a larger input range and more samplers of a receiver than conventional FFE, similar to THP. However, using the structure of FF-THP instead of the feedback equalizer, a feedback time constraint is completely removed in equalization, which enables a high-speed operation. Moreover, a larger eye opening and a larger SNR suitable for multi-level signaling are obtained by predictive modulo operation.

C. z-DOMAIN ANALYSIS OF VERTICAL EYE MARGIN
The primary function of an equalizer is providing a response to remove channel ISI. Assuming that a channel has one precursor and N postcursors, the z-domain responses (ZDRs) of the channel and the normalized channel (H ch (z) and h ch (z)) can be represented as (3) and (4), respectively.
where H i and h i denote the magnitude of the i th tap of a single bit response (SBR) and a normalized SBR, respectively. Since h ch (z) is normalized by the main cursor H 0 , h i is equal to As shown in Fig. 2(a), the feedback filter of the THP is comprised of posttaps concerning only the postcursor of the normalized SBR and lacks the ability to remove the precursor. Thus, the ZDR of the equalizer having tap coefficients of h 1 , h 2 , . . . , h N is as follows, and the ZDR of its equivalent FIR implementation becomes (5), assuming the convergence of THP.
On the other hand, both FFE and FF-THP have the ability to compensate precursors by using pretaps. Since the output range of the TX is limited between −M /2 and M /2, the amplitude adjusting coefficient is necessary for FFE [5]. Including the amplitude adjustment, the ZDRs of the FFE and the FF-THP using tap coefficients (w −2 , w −1 , w 0 (= 1), w 1 , . . ., and w N ) equalizing the channel ISI including the precursor are given below.
An expression of VEM can be derived by multiplying a ZDR of a channel and a ZDR of each equalizer. With the VOLUME 9, 2021 combined ZDR, R(z) representing the received signal, VEM in PAM-L signaling is described below.
where R i denotes the i th coefficient of R(z). When a modulo operation is introduced, the amplitude of the data signal becomes M /L, reduced from M /(L-1) in PAM-L signaling [4]. Therefore, for calculating VEMs of the THP and the FF-THP, (8) must be multiplied by the amplitude ratio of (L-1)/L. Calculating R(z) for three equalizers and using (8), VEMs are represented by (9-11) as follows, assuming that N and N ' go to infinity.
According to the above equations, as the channel has a larger precursor, H −1 , the VEM of the THP becomes smaller. Also, as tap coefficients to compensate channel ISI become larger, the VEM of the FFE becomes smaller compared with the VEM of the FF-THP.

D. ONE POLE CHANNEL WITH ONE PRECURSOR
To demonstrate the effect of the precursor and channel ISI, a hypothetical wireline channel is taken as an example with exponentially decaying postcursors and one precursor. In this case, the channel response (3) can be simplified to (12). Furthermore, assuming that the channel has unity DC gain, the H 0 can be represented by h −1 and h 1 .
Two pretaps and one posttap coefficient of FFE and FF-THP can be optimized for channel response (12). The tap coefficients are derived based on the partial differentiation of the ISI by each of w −2 , w −1 , and w 1 . The optimized tap coefficients are shown below.
From (17)(18)(19), the calculated VEMs of THP, FFE, and FF-THP with respect to h −1 and h 1 for PAM-4 signaling are illustrated in Fig. 3. In Fig. 3(a), when h −1 is 0, the VEM of the THP and the FF-THP are the same. However, since the THP has no control over the precursor, as h −1 increases, the VEM of the THP becomes the smallest. As featured in the plots, the VEM of the FF-THP is the largest among the three.
A numerical simulation is conducted to compare the calculation with the simulation result. The SBR of the simulated channel and the simulated eye diagrams of THP, FFE, and FF-THP are presented in Fig. 4. Considering that natural channel response for ∼20dB-loss channels features approximately h 1 of 0.5 [12], to verify the effect of a precursor and the ∼20dB-loss channel, h −1 and h 1 of the SBR shown in Fig. 4(a) are set to 0.2 and 0.5, respectively. From (17)(18)(19), when h −1 is 0.2 and h 1 is 0.5, the calculated VEMs of THP, FFE, and FF-THP are 34 mV, 69 mV, and 89 mV, respectively. The VEMs of the simulated eye diagrams corroborate the aforementioned calculated VEMs, as shown in Fig. 4(b). The tap coefficient h i is equal to 0.5 i for the THP, and for the FFE and the FF-THP, the tap coefficients are determined by (14)(15)(16). The simulated VEMs are precisely matched with calculated VEMs, which confirms the validity of the VEM equations (17)(18)(19).

III. TRANSMITTER IMPLEMENTATION A. OVERALL ARCHITECTURE
The overall block diagram of the proposed TX with the FF-THP is illustrated in Fig. 5. The digital block of the TX consists of an 8-bit parallel PRBS generator, a modulo prediction engine (MPE), and FFE cells. The analog block includes 4:1 serializers with 1-UI pulse generators, singleto-differential converters (S2Ds), an 8-bit differential digitalto-analog converter (DAC), and a phase-locked loop (PLL) based on a ring oscillator for 1.25-GHz quadrature clocks. The quadrature clock from PLL generates the 1-UI pulses, and the four pass-gates serialize the data with 4-phase of 1-UI pulses. Also, in the DAC, 50 matching resistors are implemented to remove the reflection from the channel. The externally controlled 10-bit coefficients for the two pretaps, the main tap, and the ten posttaps in the 4-phase FFE cells accurately compensate channel ISI and maximize the VEM. The operation of the TX is switched between FFE mode and FF-THP modes to compare the performance of the two equalization methods.

B. MODULO PREDICTION ENGINE
The structure of the MPE is presented in Fig. 6. The inputs of the modulo table cell (MTC) are the two last PAM-4 data (D 0 and D 1 ), the modulo values for both data (M 0 and M 1 ), and the current PAM-4 data (D 2 ). Then, it generates the modulo value for the current data (M 2 ). It is worth noting that since the MTC depends only on the last two data and the modulo values, it is possible to apply the MTC to another channel if the first and the second posttaps (w 1 and w 2 ) are similar to those of a target channel. However, since the MTC considers only w 1 and w 2 , the residual ISI that are not removed by w 1 and w 2 may cause modulo prediction error and induce the additional ISI. Because a wireline channel shows a similar response as a one-pole channel, w 1 and w 2 can sufficiently compensate for the channel response. Therefore, the residual ISI is negligible, and the other tap coefficients are much smaller than w 1 and w 2 . Also, even if a modulo prediction error occurs, when D 1 is −0.375, which corresponds to PAM-4 data 00, whether M 1 is 0 or 1, M 2 depends on D 2 as shown in the simplified table. Consequently, the modulo prediction error can be self-healed, and the burst error can be prevented.
A modulo operation in THP is calculated based on a direct summation of multiplications of data and taps of the feedback equalizer. In MTC, however, a modulo value is predetermined by a channel. Therefore, the burden of digital computation is much reduced. In addition, a modulo look-ahead (MLA) technique is used through 9 modulo prediction units ( To further enhance the data rate, there are two options: increasing the clock frequency and expanding the parallelism. The MTC is designed considering the first and the second posttaps. Still, since the modulo prediction error can be selfhealed, the MTC can be simplified so that it only considers w 1 at the expense of slight degradation of BER. The simplified version of the MTC can enhance the clock frequency. Moreover, expanding the 4-parallel structure to 2 N -parallel can nominally increase the data rate by the factor of N -2. Thus, with the simplified MTC and the expansion of parallelism, the data rate can be increased significantly. Also, the MPE is purely a digital structure; immediate improvements in efficiency and data rate are expected for newer technologies.

C. FEED-FORWARD EQUALIZER
The structure of one phase of the 4-parallel FFE is described in Fig. 7. The 5-bit sums of data and modulo value are multiplied by the 10-bit tap coefficients. The other phases VOLUME 9, 2021   For clarity, the pipelining in the figure is omitted but is implemented in the fabricated chip. As a result, contrary to THP, the digital computation of the FFE does not suffer from the timing issue and operates in high digital clock frequency. The tap coefficients, w i , corresponding to a specific channel, are determined to maximize a VEM by using the ArgMax function in Mathematica that finds global maximum with given constraints. Optimized for the same SBR, the ratios between the main tap and the other 12-tap coefficients (w i /w 0 ) remain the same for the FFE and the FF-THP. Instead, the magnitude of the tap coefficients can be greater for the FF-THP because adding the modulo value guarantees that the output remains within the acceptable input range of the DAC driver.

IV. MEASUREMENT RESULT
The measurement setup for the 10 Gb/s PAM-4 transmitter is presented in Fig. 8. The vector signal generator generates a 78.125 MHz reference clock for PLL that generates a 1.25-GHz clock with a 1/16 divider. To measure the performance of the FFE and the FF-THP, display port cable and SMA cables are used. On the other hand, to measure the transmitter output, the output of the test chip is directly connected to the oscilloscope. Fig. 9 exhibits the measured 10 Gb/s PAM-4 eye diagram and the histogram of the eye diagram. The eye diagram of the TX output features 800 mV PP of the output range. For this measurement, a lossy channel is not added. The distribution in Fig. 9(b) shows the centralized signal when the TX operates in FFE mode. On the other hand, when the TX operates in FF-THP mode, the signal of the FF-THP is evenly distributed. Because of the widespread distribution, the FF-THP features better SNR than the FFE.
The insertion loss and the normalized SBR of the measured channel are presented in Fig. 10. The channel loss is 21dB at the Nyquist frequency of 2.5 GHz with the first postcursor of the channel around 0.5, which is the natural response of ∼20dB channel, as mentioned before. Also, the sum of the normalized ISI of the SBR is 1.48 times greater than that of the main cursor.
Before representing the measurements of the channel output of the proposed TX, it is necessary to mention a method that indirectly evaluates the BER performance of TX [15]. Assuming that Gaussian noise is added to the output data, BER for the PAM-L signal and the decision threshold of the data X is represented by where d and σ denote the magnitude of data and the standard deviation of Gaussian noise, respectively. d can be substituted by the difference between the mean of X and the data adjacent to X , and σ can be substituted by the standard deviation of X , respectively. Means and standard deviations of each PAM-4 data level can be obtained from the histogram of the received signal. Fig. 11 exhibits the measured 10 Gb/s PAM-4 eye diagrams of the fabricated chip compensating the channel. When TX operates in the FF-THP mode, additional two levels appear along with the conventional PAM-4 levels as expected. The proposed FF-THP achieves the level mismatch ratio (R LM ) of 99.1%, and the VEM is improved by 38.9% compared with the FFE. From the histograms in Fig. 12(c) and (d), means and standard deviations of the data signal are obtained. Estimated based on Gaussian distribution, the decision thresholds and the bathtub curves of the FFE and the FF-THP are presented in Fig. 12(e) and (f). The proposed FF-THP achieves the BER lower than 10 −8 at the center of the eye and 87.5% increased horizontal eye margin (HEM) compared with the FFE at the BER of 10 −5 . Fig. 12 features the chip photomicrograph. The proposed TX occupies an active area of 0.075 mm 2 . The power and the area breakdown of the fabricated chip are presented in Fig. 13. The digital area is 0.0322 mm 2 which takes 53.3% of total power. Without the PRBS generator, the FF-THP solely VOLUME 9, 2021  occupies 0.022 mm 2 . With a 1-V supply, the total power consumptions of digital and analog blocks are 32 mW and 28 mW, respectively. Table 1 compares the performance of the proposed FF-THP based TX with other PAM-4 TXs that compensate for a high channel loss or large ISI. The sum of channel ISI is an important parameter because VEMs of TX equalizers depend on it. Also, asymmetric link such as memory interface has multi drops, which are indicated by not the channel loss at Nyquist frequency but the sum of channel ISI. In the point of view of a channel ISI, the proposed design, assisted by the pretaps and the modulo-based signaling, can compensate 1.48 of the sum of the normalized ISI that is the largest. As a result, the FF-THP achieves the best FoM 2 of 4.05 pJ/b/ISI with lower than 10 −8 BER.

V. CONCLUSION
This paper presents a PAM-4 TX introducing the FF-THP. The proposed TX takes both advantages of the modulo-based equalization and the controllability over a precursor. Moreover, the quantitative z-domain analysis on channel response and the equalization parts of the THP, the FFE, and the FF-THP is conducted. A simple one pole channel with one precursor is employed to demonstrate the repercussions of a precursor and the effectiveness of the FF-THP. From the analysis, the FF-THP shows the largest VEM among the TX equalization methods when the channel has a precursor or large ISI. Also, considering Gaussian noise, the bathtub curve and the decision threshold voltage are indirectly estimated based on the histogram of the eye diagram.
The fabricated chip employs two pretaps and ten posttaps to compensate for the 21 dB channel loss. The proposed FF-THP presents 87.5% wider HEM at the estimated BER of 10 −5 and 38.9% larger VEM compared with the FFE. In addition, the FF-THP achieves an estimated BER lower than 10 −8 and the FoM 2 of 4.05 pJ/b/ISI. Moreover, the digital-based equalization technique can take full advantage of process scaling. The prototype TX can comprehensively equalize any channel having both precursors or large ISI. Therefore, the proposed design can be applied to further high-speed wireline communication and multi-level signaling. From 1989 to 1991, he was a Member of the Technical Staff with Texas Instruments, Dallas, TX, USA, where he was involved in the modeling and design of BiCMOS gates and the single-chip implementation of the SPARC. He was with the Faculty of the Department of Electronics Engineering, Inter-University Semiconductor Research Center, Seoul National University, where he is currently a Professor. He was one of the co-founders of Silicon Image, now Lattice Semiconductor, Portland, OR, USA, which specialized in digital interface circuits for video displays, such as DVI and HDMI. His current research interests include the design of high-speed I/O circuits, phase-locked loops, and memory system architecture.
Dr. Jeong was a recipient of the ISSCC Takuo Sugano Award in 2005 for the Outstanding Far-East Paper.