A time-multiplexed track-trigger for the CMS HL-LHC upgrade

A new CMS Tracker is under development for operation at the High Luminosity LHC from 2025. It includes an outer tracker based on special modules of two different types which will construct track stubs using spatially coincident clusters in two closely spaced sensor layers, to reject low transverse momentum track hits and reduce the data volume before data transmission to the Level-1 trigger. The tracker data will be used to reconstruct track segments in dedicated processors before onward transmission to other trigger processors which will combine tracker information with data originating from the calorimeter and muon detectors, to make the final L1 trigger decision. The architecture for processing the tracker data outside the detector is under study, using several alternative approaches. One attractive possibility is to exploit a Time Multiplexed design similar to the one which is currently being implemented in the CMS calorimeter trigger as part of the Phase I trigger upgrade. The novel Time Multiplexed Trigger concept is explained, the potential benefits for processing future tracker data are described and a feasible design based on currently existing hardware is outlined. & 2015 The Author. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
A major upgrade of the CMS experiment at the CERN Large Hadron Collider is being planned so that the accelerator can deliver an integrated luminosity of 3000 fb À 1 over a period of about a decade of operation for a physics programme which will include precision measurements of the newly discovered Higgs boson, further searches for new physics extended to higher masses, and for possible discrepancies with the standard model. The High Luminosity LHC will require a new tracker after 2023 because of gradual deterioration of the present detector due to accumulated radiation damage. Its replacement must satisfy several demanding requirements, including a lower material budget and increased granularity compared to the present detector, and enhanced radiation tolerance combined with tolerable power consumption and affordable cost.
For the first time, tracker data must be provided to the L1 trigger to constrain the trigger rate to 0.5-1 MHz with a latency of up to $ 10 μs. The baseline outer tracker design has a barrelendcap layout [1] with two types of modules, denoted "2S" and "PS", which allow selection of hits consistent with a transverse momentum above a chosen threshold to reduce significantly the volume of data to be transmitted out of the tracker to be used for the L1 trigger decision; about 99% of all tracks in CMS are below $ 2 GeV/c, and are of no relevance to most practical triggers. The basic concept, proposed some years ago [2][3][4], is to compare the binary pattern of hit strips on upper and lower sensors of a twolayer module to reject patterns that are consistent with a low transverse momentum track. Hit combinations in the two layers consistent with a high-p T track segment are known as "stubs".
In the 2S-module [5,6], which is the most advanced, two silicon microstrip sensors are separated by a few mm, with the spacing determined by a compromise between transverse momentum precision and the fake stub rate resulting from combinatorial background caused by hits from nearby tracks, secondary interactions and photon conversions, for example. In the cylindrical barrel region, high p T tracks can be identified if hit clusters lie within a search window in R-ϕ (rows) in the second layer. The sensor separation and search window determine the p T cut, where the objective is simply rejection of a sufficiently large fraction of low p T candidates in the detector so that trigger primitives can be transmitted within the available bandwidth. In the barrel region, the sensor segmentation in z (along the beam axis) determines the vertex measurement precision; the outermost tracker region uses 5 cm strips, so dedicated pixel-strip (PS) module layers are planned for added precision on the longitudinal vertex reconstruction. The same method can be deployed in the end-cap detectors, with modifications to the module geometry. In total, $ 15,000 modules, each with a dedicated fibre-optic link, will send p T -stubs to off-detector processors at a 40 MHz event rate. After a L1-accept signal the full tracker data from the modules will be read out. Outside the Tracker the arriving stub data must be used to reconstruct tracks sufficiently precisely within a few μs for the L1 trigger to then associate them with other trigger primitives from the calorimeter and muon detectors.
A prototype 2S-module has been assembled using CBC (CMS Binary Chip) front-end readout ASICs [7,8] which have been developed for this purpose in a 130 nm CMOS technology and studied in a DESY test beam in December 2013, with performance consistent with simulations. As reported elsewhere, the CBC has evolved from a 128 channel wire bonded design, with no trigger functionality, to a 254 channel chip, which is bump-bonded to a hybrid substrate, then wire-bonded to the silicon sensors; this version (CBC2) includes the stub-finding logic and was used for the prototype beam tests. The CMS L1 trigger latency has also evolved over this period, from an original value of maximum 6:4 μs to a value of $ 10 μs; the CBC3 allows up to 12:8 μs. The CBC3 is now at the layout stage and expected to be manufactured in early 2016.

Time Multiplexed Trigger
The Time Multiplexed Trigger (TMT) principle is similar to that deployed in the CMS Event Builder and High Level Trigger (HLT) [9] which uses a commercial CPU farm as a Layer 2. The fundamental idea ( Fig. 1) is that multiple sources send their data to a single destination (node) for processing, in this case identification of tracks using stub data. Two layers are required, with a static switching network between them, which can be a passive optical fibre network. The minimum requirement is suitable formatting and ordering of data in Layer-1, followed by track finding algorithms operating at Layer-2.
In the HLT data traffic scheduling is more complex because the volume of data for each event fluctuates greatly, and the latency requirements are very different than for the L1 trigger, which should operate with a fixed latency.
A time multiplexed calorimeter trigger has been demonstrated in prototypes [10] and is currently being installed in CMS for operation during the 2016 run. The hardware, based on the MP7 [11] processing board, which has a Xilinx Virtex-7 FPGA and Avago MiniPOD optics in a μTCA format, is designed to implement Layer-2 of the CMS L1 calorimeter trigger. Each MP7 has a total optical bandwidth of 0.9 þ0.9 Tbps (Tx þRx), provided by 72 links operating at up to 12.5 Gbps for both input and output, and is therefore well matched to a high throughput processing application such as track finding at HL-LHC. A series of MP7 boards have been thoroughly tested during a development and prototyping phase and manufactured for use in CMS, and can serve as a prototype system for a time multiplexed track-trigger. Further evolution of FPGAs in the coming years is expected and the final version of a TMT should profit from technology progress.
In the TMT all data from a given LHC bunch crossing (BX) arrive at a single node for processing, which minimises boundaries and data sharing between processors. This still permits subdivision of the detector into regions, which is essential for such a large system as the tracker, but the division is now a choice, rather than a constraint. The number of links into each processing node is defined by the number of Layer-1 cards (Fig. 1), which is in turn defined by the number of detector links. Present FPGAs have insufficient input links to receive 15,000 TPGs into a single device, which is not expected to change.
The overall processing architecture is well matched to FPGA processing. FPGAs operate optimally using highly parallel streams with pipelined steps running at data link speed, and it has been found [10] to be very important to adapt the algorithms to the constraints of FPGA operation or else even relatively simple tasks may result in a design which cannot be mapped physically into an FPGA. Many algorithms can easily overflow the capacity of even a very large FPGA because of timing constraints or routing congestion, e.g. for two-dimensional sorting problems.
The TMT also reduces requirements on synchronisation, which can be demanding in complex high speed systems. Data can only be reliably exchanged if processors sharing data are precisely in phase. This is achieved by careful serialisation and deserialisation stages, running much faster than the LHC clock frequency, using well defined, high quality clock signals which must be carefully managed throughout the entire system. In contrast, since the TMT processors are essentially operating independently, accurate synchronisation is required only between the boards within each node, and not across the entire Layer-2.
Boundary handling is an important issue to minimise data sharing, and reduce the time needed for duplicate track removal. Provided data are only shared between a maximum of two regions, a convention of accepting only candidates from a "left" or "right" region when data originate from both regions can simplify this task, so the layouts under consideration incorporate this constraint.
Another TMT advantage is that a very small number of nodes is needed to validate an entire trigger, since each main processor is carrying out identical processing, just delayed by one LHC clock cycle compared to its neighbour in a round robin fashion. Additional nodes replicating one of the others can be used for redundancy, or algorithm development. If a hardware fault occurs at one TMT processor or its links, a redundant node can be quickly switched in by software to replace the faulty one. Fig. 1. In the Time Multiplexed Trigger all data from every TPG (Trigger Primitive Generator) are transmitted to a single processing node for each LHC bunch crossing. The processors are therefore out of phase with one another by one LHC clock cycle. The multiplexing fabric (MUX) is a serial interconnection from each TPG to each TMT node. In this simple example, each TPG has seven serial transmitting links and seven links arrive at each processor, one from each TPG. GT represents a possible global decision level.

System architecture
The expected requirement is to carry out quasi-full track reconstruction with high efficiency for all tracks with transverse momenta greater than a few GeV/c within $ 5 μs; a sufficiently low rate of fake tracks, from combinatorial background, is also required. The precise definition of high efficiency, low fake rate and the optimal momentum threshold form part of the simulation studies currently under way. Several approaches to solve this problem seem feasible and are being evaluated [12]. The challenge, however, is to achieve this in the HL-LHC environment where pileup of $ 140 events per crossing is expected to be typical, and values of up to 200 events per crossing are considered likely.
Possible TMT architectures, matched to the new tracker layout, have been considered. Earlier versions [13] considered subdivision of the tracker into ϕ regions; at present a design based on segmentation into 5η regions looks attractive (Fig. 2). The choice of ϕ or η subdivision is strongly linked to the effective implementation of the track finding algorithm in the FPGA. The number of Layer-1 processors required is determined mainly by the number of links emerging from the tracker and, to a lesser extent, by practical considerations of fibre routing. Assuming an MP7-like module with 72 input links per board would be used at Layer-1, $ 200 MP7 cards would be required for the preprocessor (PP) layer to receive the data. The tracker data link speed is not yet final; current technology for radiation-tolerant links could provide 3.2 Gbps but it is expected that further developments should double this. However, unless a concentration stage before the preprocessors is adopted, this does not change the number of PP units.
The number of Layer-2 modules depends on the number of links per board but also on the time multiplexing period, i.e. the number of Layer-2 processors used for each region, Fig. 3. This period is a compromise between several factors: with a small TM period the full event can be quickly assembled into one main processor (MP) while a longer period could allow more efficient pipelined processing of data into the MP. A small period limits the allowable data volume which can be transferred from PP to MP for each event, which could be increased only by increasing the number of links per PP or link speed, which may be physically limited, e.g. by space on the board or power considerations. Clearly, increasing the TM period gives rise to converse arguments. Link speed is not a concern, as commercial links of up to 12.5 Gbps have already been demonstrated with the MP7 and the link speed is expected to increase further in future; thus bandwidth is not a concern for Layer-1 to Layer-2 data transmission.
A time multiplexing period of 24-36 BX would be possible using the MP7, which would then require five times as many boards to process data from the entire tracker, i.e. 120-180 Layer-2 boards in total. Thus a system of $ 300 þ boards looks quite feasible, which can be compared with the present CMS microstrip tracker which requires 440 Front End Drivers (FEDs), which are 9U VME boards with 96 optical fibre links per board. Present assumptions are that the PP layer would be supplemented by a layer of FEDs for the processing of the trigger data.

Firmware implementation
While the bandwidth requirements can already be met by existing hardware, such as the MP7, it is clearly still challenging to implement the track-finding algorithms in present generation FPGAs. To make progress with this, we are exploring algorithm designs both by simulations [14] and firmware design, also emulating the firmware code in software. The approach which is currently under study is based on a Hough transform, in which a line in a two dimensional space (say, ϕ ¼ mr þ c, where ϕ and r are the angular and the radial position in the tracker) is parameterised by means of its gradient and intercept as a point ðm; cÞ in a dual space. Since high transverse momentum ðp T 4 2 GeV=cÞ tracks can be approximated as straight lines in r-ϕ space, c can be approximated for a given m, which itself has a limited range as it is inversely proportional to p T . Therefore, by histogramming m and c in a two-dimensional array, a peak above background would signal the presence of a track. Since m and c are unknown quantities at the outset, the code histograms ϕ, r data from stubs by assuming a m value compatible with the p T range of interest and then computing c.
Simulations show that in the presence of a large background of pileup events, the two-dimensional array is heavily populated and such peaks are not initially prominent. However, simple constraints, such as requiring all stubs in the ðm; cÞ histogram bin to be from different radial layers, significantly reduce the background. Methods to impose further constraints by using the information from the z coordinate are under study to improve the selection efficiency. Firmware has been developed to apply this algorithm using a method based on a self-filling systolic array, in which pipelined data flow through the 2D array, so that the stub should be assigned to the appropriate bin of the histogram depending on the initial m value. Presently this is working using a 20 Â 20 array, which is smaller than eventually desirable, but provides sufficient resolution to begin practical studies.

Demonstrator
As explained earlier, a TMT trigger can be demonstrated with only a limited fraction of the final hardware, since one main processor is sufficient to emulate one node, which is independent of all others, and is running the same algorithm as all other nodes (at least for the same η region). A large number of inputs can be generated from a single pre-processor board, since an MP7 has 72 links, so as few as two MP7s are sufficient to act as a system demonstrator. Such a system is already available from the calorimeter trigger prototyping and is currently being prepared for use in this application.

Conclusions
The TMT is now a proven architecture in CMS and will be used in the CMS calorimeter trigger from 2016.
Existing hardware is very flexible and can be deployed for a TM track-finder. Only a fraction of the system is required to validate the concept since installing and building a full system should only require replicating identical nodes. As explained, a track-trigger could be built using present technology and it is safe to assume further technological progress in the next decade.
The main TMT challenge is to find feasible track-finding algorithms and show that they can be implemented in FPGAs with the required performance. Firmware for this is under development and a system will be in operation by summer 2015. Simulation studies which support the concept are well advanced.