GBT link testing and performance measurement on PCIe40 and AMC40 custom design FPGA boards

The high-energy physics experiments at the CERN's Large Hadron Collider (LHC) are preparing for Run3, which is foreseen to start in the year 2021. Data from the high radiation environment of the detector front-end electronics are transported to the data processing units, located in low radiation zones through GBT (Gigabit transceiver) links. The present work discusses the GBT link performance study carried out on custom FPGA boards, clock calibration logic and its implementation in new Arria 10 FPGA.


Introduction
In High Energy Physics experiment one of the major challenges is to transfer data with very high reliability between the different sub-detectors situated in the harsh radiation zone to the data acquisition electronics located in the non-radiation area. The S-LINK (Simple Link Interface) [1] specification defined in 1995 at CERN, describes a data link for moving data or control word between front-end electronics to the read-out electronics, via GOL (Gigabit Optical Link) serializer chip. The sub-detectors at the same time must able to receive trigger and timing information maintaining a constant latency. A unified approach developed by RD12 for the broadcasting of TTC (Timing, Trigger and Control) [2] signals from the RF generators of the LHC machine to the outputs of the timing receiver ASICs (TTCrx + QPLL) at the experiment and beam instrumentation destinations.
Since there wasn't one single chip doing both the functionalities, so the CERN team decided to build one single chip combining both. This initiated the GBT (Gigabit Transceiver) [3] and the versatile link [4] project. GBT is a radiation tolerant error resilient data communication standard with fixed latency support. A single GBT channel link can handle detector data, timing, trigger and control information traffic. Radiation tolerant GBT chipset used for packaging of detector data and transmitting it in GBT standard. While the optoelectronic components and point to point optical link connecting the GBT ASIC with the FPGA (Field Programmable Gate Array) / COTS -1 -

JINST 11 C03039
(Commercial Off-The-Shelf) is qualified by versatile link project group. A short summary of the GBT protocol is tabulated in table 1 and figure 1 gives the details of GBT frame. Current work describes GBT FPGA C firmware [5,6] implementation on FPGA (reconfigurable electronics hardware), inter-FPGA firmware migration, clock calibration strategy and comparative performance study. Two custom DAQ boards AMC40 [7] and PCIe40 [8,9] based on latest Altera Stratix V [10] and Arria 10 [11] FPGA were used for the detailed comparative study of GBT FPGA Core firmware implementation.    Both the designs use PMA (Physical Media Attachment) bonded mode [10,11] configuration, to get same latency through each bonded data-path. Accordingly in Stratix V and Arria 10, GBT PMA links are grouped into 3 and 6 links respectively to form a single GBT Bank.

Resource estimation
For resource estimation, a standard reference design from CERN GBT team is used having same configurations on both boards. The reference design altogether contains 4 links operating in two different modes as shown in table 2. Comparison of resource utilization for GBT reference design implemented in AMC40 vs PCIe40 boards are shown in table 3.

Latency measurement
GBT has got two operational modes namely; Standard mode that uses an elastic buffer or FIFO for Clock Domain Crossing (CDC) and Latency Optimized mode which uses an inelastic buffer or register bank for CDC. Latency measurement is very critical for GBT. In GBT, information content -3 - Table 2. Reference Design configuration.

Configuration Parameter Value
No of GBT Bank 2

Bank 1
No of Links = 1 can be data, timing, and control format, each comes with its latency boundary condition. Most stringent latency boundary conditions apply for Timing and Trigger distribution.
Latency. Latency occurs in both transmission and receiving directions, and is different in each direction, depending on the media and path involved. Latency gives information of the logic path delay, and whether the path contains an elastic or an inelastic buffer. It directly contributes to the calculation of round trip delay. In our observation, we have used round trip delay to get a quick estimation of aggregate latency.
Roundtrip delay. The round trip delay corresponds to the length of time it takes for a signal to be sent plus the length of time it takes for an acknowledgment of that signal to be received. It includes serialization, deserialization time along with propagation delay. The round trip delay are formulated in these following given equations for easy understanding of the path delay or logic delay (D) involved.

GBTx ASIC Loopback
The round-trip delay can be considered as round-trip latency (L ), provided there are no hold ups in the path, like elastic buffer, queuing, congestion control. Such conditions exists only under special constraints. GBT ASIC by default operates in latency optimized mode on both Tx and Rx side, while GBT FPGA core need to be forced to operate in such mode. By design GBT ASIC takes lowest round-trip latency. So, lower the ∆L better the GBT FPGA Core performance.
Resolution. Measurement of latency within FPGA using firmware is always dependent on the data rate, and at which step of data parallelism it is sampled. In our case, it is sampled at the beginning of sending data frame (120 bit)@40 MHz. So, the resolution is 25 ns, which is the rate of LHC bunch crossing.
From the above table we can infer:

Bit Error Rate (BER) analysis
The Transceiver Toolkit (TTK), an on-chip debugging tool provided by Altera for BER monitoring is used for this analysis. The test values are listed in table 6. TTK full functionality is not available for Arria 10 ES1. So, different test setup is used for two boards. The, results for PCIe40 1st version DAQ engine are preliminary.

JINST 11 C03039
An eye diagram is a common indicator of the quality of signals in high-speed digital transmissions. The signal to noise ratio of this high-speed data signal is directly indicated by the amount of eye closure or Eye Height.   Jitter spectrum is the measurement of timing variation of a signal edge from its ideal values. Contributing factors include power load variation, thermal noise, line attenuation and interference coupled from nearby devices. It is very important that the clock jitter is within a tolerable hold time of the pipeline registers for error free data transmission.

Calibration
High-Speed Serializer/Deserializer (SerDes) or Multi-Gigabit Transceivers (MGT) as well as internal PLLs include both analog and digital blocks that require calibration to compensate for process, -7 -

JINST 11 C03039
voltage, and temperature (PVT) variations [11]. MGT in fixed-latency mode do not have the circuitry to maintain same latency for data path with each power or reset cycle [14]. In time synchronization protocol, like GBT in Latency optimized mode, Automated Transceiver Calibration (ATC) must be followed by Tx Latency Calibration (TLC) and Rx Latency Calibration (RLC) during both power and reset cycle. In the case of incomplete or improper calibration, ATC contributes to random jitter while TLC and RLC are responsible for deterministic jitter. The latency error measured in Unit Intervals (UI) which corresponds to one pulse duration of GBT data stream (1 UI = 208.33ps@4.8Gbps).

Tx Latency Calibration (TLC)
The fabric-transceiver clock interface in GBT design consists of signals: input reference clocks (refclk) and transceiver data path interface clocks (tx_clkout, rx_clkout) [10,11]. Transceiver forwards tx_clkout to be is used in FPGA fabric as tx parallel word clock. This clock is used to drive the user logic data into the transmitter DDR (Double Data Rate) PISO (Parallel Input Serial Output). This tx_clkout (120 Mhz) is obtained from the serial clock (2400 MHz or 2 UIs) by dividing it with serialization factor of 20. However, this clock divider used to obtain tx parallel word clock (tx_clkout) introduces a phase difference uncertainty (δφ p ) with external parallel word clock (refclk). Equation (7.1) shows δφ p can take any value within a set of 20 elements in ∆Φ p . This phase difference error goes completely undetected by GBT PCS (Program Coding Sublayer) unless it falls in metastability region and flags data frame error. The phase (φ p ) transition from one value to other with each reset cycle is 2 UIs, which is sequential, predictable and has exhaustive number of states. We have used this property to develop the proposed calibration solution to get always the same value for δφ p and therefore get a reproducible fixed latency or just find a value that avoids metastability.

Proposed solution
Forwarded Tx Word Clk takes some initialization time after each reset cycle, so for proper alignment between the frame and the word in (1:3) Gearbox, the gearbox sync reset is delayed by few clock cycles from the global reset. Synchronization register chain or synchronizer is used to minimize the failures due to metastability [15]. The calibration trigger can be based on metastability detection (Logic 1) or phase error detection (Logic 2). If the requirement of the design allows a latency error tolerance margin of ±4 ns then metastability detection logic is suitable because of its lower resource usage. Otherwise, the phase error detection logic is more suitable for better phase predictability design. The entire solution is implemented by modification in the latency optimized gearbox in GBT PCS.

Phase error detection logic
Tx FrameClk has a synchronous phase relationship with reference clock (refclk). So, the phase calculation is computed between the Tx word and Tx frame clock. This digital calculation of phase -8 - difference within FPGA fabric is not very accurate, yet sufficient to detect a δφ p phase difference. A look-up table is prepared manually, and a particular phase value is pre-defined with a safe margin of the metastable zone. Figure 7 shows the detected phase difference is compared against phase calibration table values to assert global reset. Phase selection for temperature variation tolerance. Proper phase synchronization between frame clock and word clock is necessary to avoid metastability during data transmission. However, with temperature variation phase drift happens. If the synchronized region lies close to the metastable region, phase drift due to temperature causes it to get unstable. So, the calibrated phase should be chosen wisely. We have varied the temperature by controlling the on-board cooling system and noted for any data frame errors. Temperature Sensing Diode (TSD) internal to FPGA is used for temperature reading.

Metastability detection logic
To ensure data write reliability the input to a register must be stable for a minimum time before the clock edge (register setup time) and for data read reliability, the output of a register must be stable a minimum time after the clock edge (register hold time) [15]. For proper data frame to word conversion, it must avoid metastable regions. However, due to the phase transition of forwarded -9 -tx_clkout, it sometimes falls into this zone and causes data error. Figure 8 shows with flowchart 3 types of data error detection logic.

Stratix V and Arria 10 FPGA
A user recalibration of the FPGA is required after every power cycle showing a loss of lock of the MGT reference clock during Automatic Transceiver Calibration (ATC) [10,11]. Arria 10 FPGA uses hardened Precision Signal Integrity Calibration Engine (PreSICE) to perform calibration routines at every power-up before entering user-mode. After transceiver calibration is over, latency calibration logic is followed. Tx Latency Calibration (TLC) is done as discussed in section 7.1. The Arria 10 device used in this test procedure can work without TLC. Rx Latency Calibration (RLC) uses a new type of barrel shift logic that does bit slip scanning to lock Frame Alignment Word (FAW) at first hit [3]. The maximum number of bits slipped is equal to the FPGA fabric-totransceiver interface width minus 1, i.e. 19 (= 20 − 1). After completing the calibration sequences, the control is transferred to user logic.

Conclusions
With this comparative analysis, we have proved that the GBT protocol can indeed be implemented with success in both the 28nm Stratix-V and 20 nm Arria-10 Altera FPGAs. The source code of developed firmware are archived in CERN espace, and more resources are available to users on requests.