Design of a hardware track finder (Fast TracKer) for the ATLAS trigger

The ATLAS Fast TracKer is a custom electronics system that will operate at the full Level-1 accept trigger rate, 100 kHz, to provide high quality tracks as input to the high level trigger. The event reconstruction is performed in hardware, thanks to the massive parallelism of associative memories and FPGAs. We present the advantages for the physics goals of the ATLAS experiment at LHC as well as the recent results on the design, technological advancements and test of some of the core components used in the processor.


Introduction
The LHC Run I has confirmed the existence of a Higgs boson with a mass of about 125 GeV [1,2]. This represents a great success of the Standard Model (SM). In the future it will be important to precisely measure the Higgs couplings, in particular to the τ lepton and b quark. The next LHC runs, at higher energies and luminosities, will also be crucial in the search for physics beyond the SM.
Providing track reconstruction over the entire ATLAS detector at each Level 1 trigger, with low latency to be ready for the beginning of the Level 2 trigger, will enable lower thresholds to be maintained for 3 rd generation fermions, which will broaden the overall physics reach including the accessible range of SUSY parameters. Full tracking can also be an important general tool, for example to determine the number of interactions in the event and thus allow pile-up corrections to be made to jet energies and missing transverse energy.

Fast Tracker working principle
The Fast TracKer (FTK) is an approved ATLAS [3] upgrade [4] that will provide a list of tracks to the ATLAS high-level trigger (HLT) after each Level-1 accept, up to 100 kHz, with a very small latency, of less than 100 µs.
To achieve the goal of providing global tracking, FTK subdivides track reconstruction into consecutive steps, executed by custom electronic boards (see figure 1): Data formatting. The data from all the silicon detector Readout Drivers (ROD) are received and the cluster centroids are calculated. The data are then split into 64 η − φ towers, which provides the parallelism needed to achieve the necessary system bandwidth. There is some - overlap between adjacent towers to avoid efficiency loss at the boundaries. These functions are implemented in the Data Formatter (DF) boards.
Pattern matching. In each tower the clusters are compared with a pre-calculated list of about 15 million possible trajectories, using coarse hit resolution appropriate to pattern recognition. The comparison is performed in a few microseconds using CAM-like "Associative Memory" (AM) [5] chips. Pattern recognition is done with 8 of the 12 silicon detector layers that a track can cross. Converting clusters into coarse resolution occurs in the Auxiliary card (AUX), while the pattern matching is performed in the AM board (AMB).
Track fitting. For each pattern with clusters in at least 7 of the 8 layers (called a road), the clusters are retrieved at full precision. Candidate tracks are then built with all combinations of one hit per layer, and each is fit using a linear approximation. The tracks with a good fit quality are extrapolated into the remaining 4 layers and refit to improve their quality and reduce the fraction of fake tracks. The first fit is performed in the AUX card, while the final fit is performed in the Second Stage Board (SSB).
Conversion to the HLT format. The FTK tracks, after the final fitting stage, are collected by the FTK-to-Level-2 Interface Crate (FLIC) which converts the FTK track parameters and silicon clusters into a data stream format used by the FTK readout buffers.
This document describes recent results on the design and development of the FTK input mezzanine (FTK IM), installed on the DF board, and the main processing unit consisting of the AMB and the AUX card.
-2 - Figure 2. The left picture shows a prototype of the current FTK IM card with a "FPGA Mezzanine Connector" element, two FPGAs, and S-link connectors. The right figure shows the clustering scheme with a small grid size, 5 × 5. The search region is the orange box, which is aligned so that the first incoming hit is placed at the middle of the left side. Data from the same columns, shown in the blue box, are temporarily stored in a separate RAM. After a cluster is read out, the grid is realigned to the next left-most hit.

FTK clustering mezzanine
The FTK IM is a mezzanine using the Data Formatter as mother board. Each of the 32 DF boards can host up to 4 FTK IM cards. The cards receive data from the Pixel and SCT RODs and perform hit clustering, which is important both for reducing the amount of data to be sent downstream and for improving the position resolution.
A picture of a prototype is shown in figure 2 (left). Data processing is performed by 2 Xilinx Spartan6 XC6SLX150T-3FGG484C FPGAs [6]. To internally balance the processing load, each FPGA is directly connected to only 2 links, one coming from a Pixel and the other from the SCT.
The mezzanine is responsible for clustering the data coming from the Pixel and SCT modules. The SCT data are already clustered by the SCT front-end chip. If a cluster is split at a chip boundary, the mezzanine has to combine the partial clusters.
The 2D nature of the Pixel measure is more complicated. The mezzanine firmware has to identify clusters of contiguous activated pixels and then to calculate the centroid using the time over-threshold (charge deposition) information. The algorithm scans the hits within a module in a limited region, as shown on the right side of figure 2.
The Pixel front-end chip provides data from a module by reading a pair of pixel columns. An algorithm parameter sets the number of columns to be read, storing the hits in a circular buffer. This determines the maximum width of a cluster and the width of the search window. A second parameter is the maximum height of the cluster. The search grid places the first incoming hit in the left center of the grid. The procedure loops until all the hits in a module are assigned to a cluster [7]. The clustering time and the resources used in the FPGA depend on the number of hits and the grid size. Preliminary post place and route results are shown in table 1; typical algorithm execution will require on average O(10) clock cycles for each input hit.
The SCT firmware was used in a small test performed with a slice of the ATLAS inner detector (section 5). The Pixel clustering firmware is being finalized and will be tested shortly.

FTK processing unit
The FTK processing unit (PU) is composed of the AMB and a rear transition board, the AUX card. The two cards together perform most of the important tasks of the system: pattern matching and the 1 st stage of fitting. The PU input is a list of unordered clusters, calculated by the FTK IM and distributed by the DF boards, while the output is a list of candidate tracks and their associated clusters to be refined by the 2 nd stage fit. In the final system there will be 2 PUs for each of the 64 towers. Additional parallelism is achieved by having 4 independent engines in each PU. However the system is configurable and the region covered by a single PU can change, allowing flexibility and scalability.

Auxiliary card
The AUX card is a 9U rear module, 280 mm deep, that receives the input cluster centroids through 2 QSFP connectors from the DF board assigned to this PU's tower (see figure 3).

JINST 9 C01045
The incoming clusters are received by two Altera Aria V FPGAs [8] which translate each hit into the coarser resolution needed for pattern recognition. The spatial segmentation is the Super-Strip (SS), defined as to the maximum precision at which pattern matching is done in the AM chip. The full-resolution clusters and their associated SSs are sent to four other Altera Aria V FPGAs, the AUX processor chips, each of which contains about 20 MB of RAM and 1000 DSPs. At the same time, the SSs are sent to the AM board. Each processor chip fits the track candidates found in a quarter of the AMB's patterns.
The data received by an AUX processor chip are stored in the Data Organizer (DO), a smart database that offers rapid retrieval of the clusters in a pattern. After the last hit is transmitted, the addresses of the roads found by the AMB are sent back to the AUX. The I/O between the AMB and the AUX is sent through a high speed ERNI connector placed on the P3 VME backplane.
When a road is received from the AM board the DO builds a data packet containing the road address and the associated clusters, at full precision, which is sent to the track fitter (TF) on the same FPGA. The TF tries all track candidates and keeps those with a good χ 2 , the current threshold is 36 but it can be configured. The track candidates passing this selection are sent to the SSB board on a single Small Form-factor Pluggable (SFP) fiber.
The AUX card FPGAs and memory chips must be directly accessed via VME for monitoring and configuration, a non-standard implementation for a rear transition module. To achieve this, a specific mechanism was developed to allow the AUX to use the AMB as a proxy to broadcast VME requests.

Associative Memory board
The AMB is the 9U front board of the PU. The board receives the SS information from the AUX card through the P3 connector at a total rate of up to 24 Gbps. The SSs are distributed to 4 mezzanines containing the AM chips that perform the pattern matching. After the last data of the events are received, the same connector is used to send the found road addresses back to the AUX, with a maximum bandwidth of 8 Gbps.
The board is designed to exploit the 2 Gbps serial links that the AM chips will use starting with new version: the AMChip05 [9], which is expected to be available at the beginning of 2014. Data routing to the 4 mezzanines is controlled by two Xilinx Artix7 FPGAs. The left picture in figure 4 shows the 2 main FPGAs and the 4 large connectors where the mezzanines will be connected.
The Large Area Mezzanine Board with the fully serial communication protocol (LAMBSLP) will hold 16 AM chips. Each mezzanine will receive a copy of the incoming data and distribute the data to 4 internal chip chains of 4 chips each. This scheme allows a good balance between parallel computational capability and routing complexity. A floor plan of the LAMBSLP is shown on the right side of figure 4.
The board will be able to compare an incoming SS with a large number of patterns nearly simultaneously: a total of 64 chips will be installed in each board, with about 128K patterns/chip, totaling about 8M patterns per board. This massive amount of logic working at high speed requires a lot of power at different voltages. The board has to provide 3.3 V for the I/O and 1.2 V to the AM chips, with a total expected power consumption of about 200 W, including the power required by the AM chips, the FPGAs, and the DC/DC converters.

Simulated results and hardware tests
In preparing the FTK Technical Design Report (TDR) [4], an optimal working point for the system was found, with a SS size and other AM chip configuration parameters that can both achieve the necessary system performance and stay within the hardware constraints. Using this baseline configuration, the team carried out an extensive set of emulation studies using physics events generated with the ATLAS simulation framework [10]. The simulation confirmed that the system can work within the allocated bandwidth and the latency constraints.
The system will be able to completely reconstruct all of the tracks in an event in less than 100 µs, with single particle efficiency better than 90% with respect to offline tracks, for both muons and pions. The fraction of fake or duplicated tracks are expected to be a few percent, with a small increase in the high pseudo-rapidity regions. Initial tests of higher level algorithms such as btagging and τ identification confirm that it is possible to obtain near-offline performance using the tracks provided by FTK [4]. More detailed studies will be performed before the beginning of Run II in order to fully validate the system or to achieve the best possible performance in the ATLAS trigger.
Tests were carried out using existing prototype hardware. A mini-FTK system consisted of a single VME crate with one complete AM board and two FTK IM mezzanines, installed in an EDRO board. The system was connected to a ROS and included to the ATLAS DAQ system at the end of summer 2012 and operated until February 2013, before LHC two-years shutdown. The was receiving data from the a 45 deg slice for the SCT only, because the connection to the Pixels were not fully tested and it was not used for trigger decisions. The test was successful; the mini-FTK system was integrated in the DAQ and was able to communicate with the SCT RODs and the ROSes up to a level-1 trigger rate of 75 KHz, which was the maximum for this data taking period.
Other hardware integration tests involved the high speed serial links between the AUX card and the AM board, verifying the transmission rate capability, and AMB cooling tests using heater boards to emulate the consumption of the final AM boards. All the tests were successful.

Conclusions
The FTK processor will provide full tracking information by the start of event processing in the ATLAS HLT trigger. This will allow selection algorithms to fully employ track information. It will also increase background suppression at the beginning of the HLT chain, leaving processing time for more sophisticated algorithms.
Prototype testing, including global integration tests, will continue over the next year. First data taking with FTK is expected in the summer of 2015 with a system of 8-16 PUs covering the central detector region. It will be able to handle the accelerator conditions expected at that time, 40-50 pile-up events at 25 ns bunch spacing and 13 TeV energy. The system will then keep growing to cover the full detector and cope with harsher pile-up conditions.