The GBT-FPGA core: features and challenges

Initiated in 2009 to emulate the GBTX (Gigabit Transceiver) serial link and test the first GBTX prototypes, the GBT-FPGA project is now a full library, targeting FPGAs (Field Programmable Gate Array) from Altera and Xilinx, allowing the implementation of one or several GBT links of two different types: “Standard” or “Latency-Optimized”. The first major version of this IP Core was released in April 2014. This paper presents the various flavours of the GBT-FPGA kit and focuses on the challenge of providing a fixed and deterministic latency system both for clock and data recovery for all FPGA families.


Rad-hard optical link for experiments
Due to the high beam luminosity planned for the future upgrade of the LHC accelerator, the High-Luminosity LHC (HL-LHC) [1], the experiments will require high data rate links and electronic components capable of sustaining high radiation doses. In order to address these requirements, the new "Rad-Hard Optical Link for Experiments" (that in this article will be referred to as "the GBT-based link") is currently under development [2]. The diagram shown in figure 1, depicts a typical system featuring the GBT-based link, highlighting its major components.
On the on-detector side, the GBTX ASIC [3] forwards the Timing, Trigger and Control (TTC) [4,5] and Experiment Control (EC) information to Front-End (FE) ASICs and reads them out through low-speed (80, 160 or 320 Mb/s) electrical links named E-links [6].
The physical link between the off-detector electronics and the GBTX ASIC is known as the "Versatile Link" (VL) [7], a high-speed (4.8 Gbps) optical link whose major component is a custom plug-in module performing optical-to-electrical conversion (and vice versa) named the Versatile Transceiver (VTRx) [8]. It is important to mention that only rad-hard parts are used on-detector since they have to cope with extremely high radiation levels whilst off-detector electronics may be implemented using general Commercial Off-The-Shelf (COTS). On the off-detector side, a Back-End (BE) FPGA-based board programmed with the GBT-FPGA VHDL-based firmware [9] acts as the single-connection point, transmitting TTC and EC data to the on-detector electronics as well as receiving and forwarding detector data to the central Data AcQuisition (DAQ).

The GBT-FPGA core
In order to facilitate the in-system implementation and the user support of the GBT-FPGA, the different components of the core are integrated in a single module called "GBT Bank" (see figure 2). Most of these components are common to the different platforms.
The GBT Bank may include several "GBT Links" (the maximum number is vendor dependent). An example of GBT Bank featuring two GBT Links is shown in figure 2. Each GBT Link is composed of a GBT Transmitter (TX), a GBT Receiver (RX) and a MultiGigabit Transceiver (MGT). The clocking resources are external to the GBT Bank so the user can connect the different clocks as desired. The different parameters of the GBT Bank may be set at implementation time through a single file (GBT User Configuration File).

The clock recovery and latency determinism issues
The unification of TTC, EC and DAQ functionality on the new link simplifies the topology and reduces the number of optical links. However, this topology introduces new technical challenges related to the reference clock that will have to be recovered from the incoming data stream [10]. Trigger related electronic systems in High Energy Physics (HEP) experiments, such as TTC links, require a fixed, low and deterministic latency in the transmission of the clock and data to ensure correct event building.
On the other hand, other electronic systems that are not time critical, such as DAQ, do not need to comply with this requirement.
The GBT-FPGA project provides two types of implementation for the transmitter and the receiver: the "Standard" version, targeted for non-time critical applications and the "Latency-Optimized" version, ensuring a fixed, low and deterministic latency of the clock and data (at the cost of a more complex implementation). The main differences between the Standard and the Latency-Optimized versions are presented in table 1.
This article focuses on the Potential Uncertainty Points (PUPs) of the GBT Bank when implementing the Latency-Optimized version and explains the methods applied by the GBT-FPGA team in order to guarantee the latency determinism in both clock and data. Then the importance of the system calibration, that ensures the correct sampling of the data, is emphasized. Finally, the results of the characterization of the GBT-FPGA core in terms of latency versus temperature are presented.

Potential uncertainty points in the latency-optimized version
Achieving fixed and deterministic phase in clocks as well as low, fixed and deterministic latency in data transmission/reception when implementing the Latency-Optimized version of the GBT-FPGA core, requires identifying and properly managing the different PUPs of the GBT-FPGA core-based system.
Since these uncertainty points are implementation and configuration dependent, dealing with systems featuring the Latency-Optimized version of the GBT-FPGA core may become very challenging, especially in multi-GBT Link configurations.
The different PUPs within the GBT Bank have been already properly managed by the firmware, being only necessary to constraint some critical paths of the GBT Link(s) during insystem implementation (these constraints are shown in the GBT-FPGA example design). On the -3 - other hand, the external PUPs have to be managed by the user (although the GBT-FPGA project provides example designs including custom modules for this purpose). It is important to mention that the methods applied for achieving latency determinism in the Latency-Optimized version are based on general concepts of logic design, so most of them are common to the different FPGAs supported by the GBT-FPGA project.
The block diagram shown in figure 3, depicts a GBT Bank featuring one GBT Link, highlighting the internal and external PUPs, which are explained in the following subsections.

Unnecessary Clock Domains (UCD)
One of the most delicate points in logic design and a source of latency uncertainty issues is the clock domain crossing (CDC). In order to minimize the impact of CDCs in the GBT-FPGA Core, the number of clock domains has been reduced as much as possible.
The most critical components of the GBT-FPGA Core in terms of CDCs are the MGTs due to their complexity, containing a high number of clock domains, and to the fact that they are hard IP blocks provided by the FPGA vendor (so the user can only set the different parameters of the core).
The number of internal clock domains within the MGT has been minimized by properly selecting the frequency of the different clocks and the width of the data buses. It is important to mention that the internal architecture of the MGT varies between the different models so each case has to be studied independently.
As an example of MGT clock domains complexity, the internal structure of a GTX TX in a Xilinx Virtex 6 is shown in figure 4. This single GTX structure can have up to 4 separate clock domains.

Non Latency Deterministic Components (NLDC)
Although the number of clock domains in the GBT-FPGA Core has been minimized, not all can be merged so the boundary crossing of the required clock domains has to be achieved in such a way that ensures low and deterministic latency of data.
-4 -  The common approach for CDC of multi-bit signals in high throughput data paths is the utilisation of memories (such as asynchronous FIFO or DPRAM) [11], due to their high performance, reliability and simplicity of use (this approach is used in the Standard version). However, the utilisation of these blocks has a penalty in terms of latency, which may be not deterministic and also requires undesired extra clock cycles.
Due to the above mentioned penalty, in the Latency-Optimized version, the memory-based CDC used in the Standard version is replaced by a register-based approach, that ensures low and deterministic latency in data at the cost of calibration after system implementation, for guaranteeing the integrity of the data. The details of the constraints on the phase between clocks are discussed below and the calibration is detailed in section 3.
One example of Non Latency Deterministic Components in the GBT-FPGA Core is the GBT TX GearBox (see figure 5), where the scrambled and encoded data (a 120-bit frame), generated at 40 MHz in the TX FRAMECLK domain ("Bunch Clock" in LHC Experiments [5]), is divided into several words when crossing to the TX WORDCLK domain (the MGT TX clock). The number and width of the words as well as the frequency of TX WORDCLK are device dependent and it may be either three words of 40 bits at 120 MHz or six words of 20 bits at 240 MHz.

Phase Relationship Between Clocks (PRBC)
As previously mentioned, the phase relationship between the different clocks is another constraint when implementing the Latency-Optimized version of the GBT-FPGA Core. It must be fixed in order to achieve latency determinism in both clock and data as well as set to a correct value for proper data sampling.
The fixed phase relationship requires the utilisation of a single clock source (generally the Bunch Clock), that is further multiplied, divided and recovered from the data stream for generating -5 -2015 JINST 10 C03021 the different clock domains. In some cases, the devices in charge of deriving this single clock source (clock synthesizers, etc.) have been carefully selected and characterized [12]. These devices offer the required features for the fixed and deterministic phase application (such as clock feedbacks (zero-delay), synchronous dividers, etc.) and are controlled by the firmware when necessary. In other cases, when the component does not comply with the requirements of the Latency-Optimized version (e.g. asynchronous "Byte Serializer" clock divider in Altera Cyclone V GT [13] and Stratix V GX [14] transceivers , most common PLLs, etc.), the firmware monitors and controls the phase of the derived clocks for ensuring the phase determinism.
The correct phase value of the different clocks is selected during the calibration procedure, explained in section 3.

Clock Frequency Multiplication/Division (CFMD)
Multiplying the frequency of a source clock and then dividing it again to the original frequency (e.g. creating the TX WORDCLK clock out of the TX FRAMECLK) will lead to a non deterministic phase difference between the divided and source clocks. This issue is due to the fact that the rising edge of the divided clock may lock onto any of the rising edges of the multiplied clock (see figure 6).
In order to avoid this uncertainty in the phase of the divided clock, it is necessary to synchronize the reset of the clock divider with the rising edge of the source clock, thus ensuring that the divided clock will be in phase with the source clock. In the case of the GBT-FPGA Core, for the Latency-Optimized version, clock dividers are synchronized with the rising edge of TX FRAMECLK (which can be considered as the source clock).
In cases where this approach is not possible (e.g. the source clock is not available because the multiplied clock is recovered from a data stream on a different board), it is necessary to generate a reference signal that indicates when the rising edge of the source clock occurs and then, shift the phase of the divided clock to lock on the reference signal. In the GBT-FPGA Core, each time that the header of the data frame (which is aligned with the rising edge of TX FRAMECLK) is detected by the GBT RX control logic, a flag signal ("header flag") is asserted, so it may be used as a reference for the phase alignment of the RX FRAMECLK (which is a recovered TX FRAMECLK).

Clock & Data Recovery (CDR)
A particular case of CFMD is the Clock and Data Recovery (CDR), where both clock and data have to be properly managed in order to avoid uncertainty issues.
In the GBT-FPGA Core case, clock and data from the incoming serial stream at 4.8 Gbps are separated during CDR procedure in the MGT RX of the GBT Link. Then, the frequency of the serial clock is divided by a certain factor N to generate RX WORDCLK (N equals 10 when RX WORDCLK at 240MHz or N equals 20 when RX WORDCLK at 120MHz). Due to the utilisation of Dual Data Rate (DDR) by the CDR, the recovered clock may lock onto either of the edges (rising and falling) of the serial clock. This gives 2xN possible phases for the recovered clock and also, the same number of possible bitslips in the recovered data. This clock and data uncertainty issue is worked out by using a custom monitor logic embedded into the GBT Link which shifts both clock and data to ensure a constant line latency [10].

Calibration
In any logic design for FPGA, the phase of the different internal clocks as well as the length of the combinatorial data paths between flip-flops may vary after re-implementation [15]. In addition, the phase of the different clocks may also vary with physical changes in the system (e.g. fibres replacement, etc.). For this reason, when dealing with systems featuring register-based CDC, this uncertainty in clock phase and data propagation delay may lead to errors and latency uncertainty in the data.
In the worst case, the launch clock (e.g. TX FRAMECLK) and the latch clock (e.g. TX WORDCLK) are wrongly aligned and data is sampled in metastability zone, thus presenting errors. Furthermore, there is also the possibility of sampling data correctly after implementation, but this sampling is done very close to the metastability zone, so a slight phase shift of the launch and/or latch clocks (due to temperature variations, etc.) will result in errors or even in a phase jump of one latch clock period for the data.
For ensuring the reliability of the system, a calibration process is required after any reimplementation or physical changes in the system that varies clock and/or data delays. The cali--7 -bration consists of adjusting the phase of the launch and/or the latch clocks to have the latch clock in the middle of the sampling window to guarantee a comfortable margin (see figure 7).
Please note that after calibration, further compensations (e.g. due to variation in fiber length) are not required since the Back-End crates as well as the fibres between the counting room and the detector are not subject to large temperature variations.

Measurements and results
A fully automated testbench was used to verify the determinism in clock phase and data latency as well as to measure the drift of the clocks phase versus temperature, making sure that the previously mentioned drift does not affect the reliability of the system.
For the test, two FPGA-based development kits from Xilinx (KC705) [16] were used, both loading a test firmware featuring the GBT-FPGA Core in "Latency-Optimized" version. One of the development kits was placed inside a thermal chamber whereas the other was placed outside the chamber, so only one board was affected by the temperature variation thus facilitating the identification of critical points.
Firstly, both systems were calibrated at 20 • C and then the measurements were performed in two steps. In the first step, the skew between the different clocks and the latency between data paths were measured using an oscilloscope while performing power cycles of the boards (about 10.000 cycles) at constant temperature (20 • C). In the second step, in addition to the previous test, repetitive temperature cycles from 10 to 60 to 10 • C (in steps of 0.3 • C) were performed on the board hosted within the thermal chamber. The temperature cycles were repeated during eight hours.
The result of the test showed a maximum clock and data phase uncertainty of 100 ps peak-topeak between system powers up and a maximum phase drift of 200 ps for a temperature variation of 50 • C. An example of latency versus temperature measurement is shown in figure 8).
As previously mentioned, temperature variations in the Back-End crates are foreseen to be very low and with a very slow drift. For that reason, the maximum phase drift expected during operation is about 20 ps, value within the specification for LHC experiments.

Summary
The GBT-FPGA Core is an HDL (VHDL) IP Core provided by the GBT-FPGA project targeted to FPGAs from Xilinx and Altera (versions for other vendors are under development by users) that allows communication with the GBTX from the counting room and to emulate part of the GBTX for test purposes, supporting all three encoding schemes of the GBTX (GBT-Frame, Wide-Bus and 8b10). This core may be implemented in two different versions: the Standard, for DAQ applications where latency is not an issue and the Latency-Optimized, for TTC applications where latency determinism is a critical factor. When implementing the Latency-Optimized version, it is necessary to properly manage the Possible Uncertainty Points and to calibrate the system in order to guarantee fixed clock phase and data latency as well as correct data sampling. With the purpose of verifying the previously mentioned determinism and the impact of the temperature in the system, an intensive study has been carried out using a fully automated test bench, proving excellent results.