Next generation Associative Memory devices for the FTK tracking processor of the ATLAS experiment

The AMchip is a VLSI device that implements the Associative Memory function, a special content addressable memory specifically designed for high energy physics applications and first used in the CDF experiment at Tevatron. The 4th generation of AMchip has been developed for the core pattern recognition stage of the Fast TracKer (FTK) processor: a hardware processor for online reconstruction of particle trajectories at the ATLAS experiment at LHC. We present the architecture, design considerations, power consumption and performance measurements of the 4th generation of AMchip. We present also the design innovations toward the 5th generation and the first prototype results.


ATLAS FTK Architecture
Fast Track Trigger (FTK) is an electronics system that rapidly finds and fits tracks in the AT-LAS [8] inner detector silicon layers (pixel and SCT) for every event that passes the Level-1 Trigger (figure 1). It uses all 12 silicon layers over the full rapidity range covered by the barrel and the disks. It receives a parallel copy of the pixel and silicon strip (SCT) data at the full data transfer speed from the detector front end to the read-out subsystem following a Level-1 Trigger rate, about 100KHz. The FTK algorithm consists of two sequential steps. In step 1, pattern recognition is carried out by a dedicated device called the Associative Memory (AM) [5], which finds track candidates in coarse-resolution roads using 8 of the silicon layers. When a road has hits in at least 7 silicon layers, step 2 is carried out in which the full resolution hits within the road are fit to determine the track helix parameters and a goodness of the fit. Tracks that pass all these steps are kept. The first step uses massive parallelism to carry out what is usually the -1 -most CPU-intensive aspect of tracking by processing hundreds of millions of roads nearly simultaneously as the silicon data pass through FTK. This step is performed by the associative memory chips, which contain roads consistent with particle trajectories. The AM chips compare these roads with the data coming from the ATLAS inner tracker.

STD cells vs full custom design
In this section we describe the project constraints and architectural solutions that have brought AMchip asic design from a purely standard cell layout to a full custom layout of the memory core. The starting point for the design of the new generation of AMchips was the AMchip03 [2]. This chip was developed in UMC 0.18 µm technology. The goal was to reach a density of 5000 patterns per chip consuming 1W at a 40MHz clocking frequency. This chip was designed using a full standard cell architecture both for the memory core and control logic. Amchip03 was successfully used in CDF experiment at Fermilab. For the FTK upgrade of the ATLAS experiment at CERN we had to change the approach of the memory core design. In this application the AMchip has to process events coming from the inner tracker (pixel and SCT detectors) at the Level-1 trigger rate, about 100KHz. The new design constraints were To meet all these constraints it was not possible to design the new AM chip using only standard cells, because of the area and power limitations. Instead we decided to use a mixed approach in which the control logic is designed using standard cells, while the memory core is a full custom design. This has the advantage that a full custom design usually occupies less area than a standard cell design [8], this is mainly due to the fact that std cells have fixed size and cannot be scaled down to reduce their dimensions. Moreover, we could implement power saving techniques that are impossible to realize with standard cells. AMchip04 was designed with this philosophy in order to contain the power consumption and increase the number of patterns that could be stored in the memory.

AMchip04 core architecture
The AMchip04 core, designed in TSMC 65nm LP process, is based on CAM memory architecture whose working principle is described in [3]. The power dissipation of the matchline (shown in figure 2) is a major source of power consumption, since it is charged and discharged during every clock cycle. To reduce matchline activity, we chose to implement two power reduction techniques: selective precharging, and current race scheme.
The former performs a match operation on the first few bits of a word before activating the search of the remaining bits [4]. For example, in our 18-bit word memory, selective precharge -2 -  initially searches only the first 4 bits and then searches the remaining 14 bits only for words that matched the first 4 bits.
The current race scheme precharges the matchline low, instead of high as in conventional schemes, and evaluates the matchline state by charging the matchline with a current I ML supplied by a current source. The benefit of this scheme, which is partly responsible for the power reduction, is the simplicity of the circuitry that is composed only by two tipes of memory cells (NAND and NOR type) a current generator and an output SR latch. Table 1 shows the power consumption measurements done on the AMchip04 prototype chip with 8K patterns at 100MHz and the extrapolation to the final 128K patterns chip.
Comparing the AMchip03 power consumption of P AMchip03 = 1 µW /pattern/layer/MHz [2] with the AMchip04 power consumption of P AMchip04 = 0.036 µW /pattern/layer/MHz we can see The power saving techniques implemented in AMchip04 provided about a factor of 28 reduction in power consumption with respect to the previous version. With this memory core architecture a full 128K pattern AMchip should consume 3.7W. This value however is still too high for our purposes, because, including FPGAs and the other components on the board, the power consumption would be greater than 5KW per crate. Further power reduction is needed in the AMchip and our goal is to reach less than 2W per chip.

Power reduction strategies
To further reduce the power we could apply several different strategies: 1. Reduce the full custom core power supply voltage from 1.2V to 0. What we wanted to avoid is reducing the clock frequency below 100MHz. As can be seen from this list, there are two different ways to control the power consumption, reduce the net capacity and their voltage swing. This is because the dynamic power consumption is dominated by the charging and discharging of line capacitance, which can be expressed by the following equation: where C line is the capacitance of the net, V DD is the power supply voltage and f is the switching frequency of the line. Reducing the capacitance of the nets means reducing their length and their coupling with neighboring nets, in particular the power supply (VDD) and ground (GND). This can be achieved by properly designing the CAM cell layout. figure 3 shows a layout example of the NOR CAM cell, in which the two memory cells are designed one over the other instead side by side as in the AMchip04 memory core. The match line path is shown in white. The other way to reduce power consumption is to decrease the power supply voltage. However this raises speed problems, since reducing VDD increase the MOS channel resistance and reduces the MOS speed. In order to maintain the -4 - circuit speed it is neecessary to use low threshold transistors instead of standard ones. Moreover, to reduce the voltage drop through the MOS channel resistance it is necessary to increase the MOS channel width and then increase the transistor area. Taking into account all of these design considerations, we arrived at the design of figure 4.

Low voltage CAM cells
As can be seen in the figure, in order to maintain the circuit speed an additional current generator is inserted between the NAND and NOR memory cells. Previous versions of the memory layer (figure 2) had only one current generator at the input of the NAND type memory cells. Due to all these improvements the area is increased by about 12 %. Another approach to reduce power consumption is to substitute the matchline with a combinatorial logic network that performs the comparison of the memory content with the data present on the bit lines. This can be achieved by combining an SRAM memory cell and an XOR combinatorial network that performs the comparison [7]. The result is shown in figure 5. As can be seen from the schematic, this is a completely digital approach to the CAM memory design. With this architecture, the circuit does not have to charge and discharge the matcline every read cycle to perform the comparison between the data on the bitlines and the data stored in memory. The XOR network is devoted to this job. Only when there is a match is the XOR gate activated, otherwise only a small fraction of these gates are on.

Serialized and deserialized input and output
In this section we explain why we chose to put serialized and deserialized (SERDES) input and output instead of parallel buses as in AMchip04, and we present the SERDES characteristics. The main reason for the change is that the reduction in the core VDD from the standard 1.2V to 0.8V meant several different power domains within the chip. This implies many more power pins and a package change from the TQFP208 in the AMchip04 to FBGA23x23 for AMchip05. Most of these pins are devoted to the different power supply voltages and ground, as shown in figure 6. In the few remaining pins we have to accommodate all the input and output buses, the clock and the control pins. There are not enough pins to accommodate parallel buses, so we have to use high speed serialized input and output. The main requirements for the SERDES are: • data rate of at least 2Gbps • separate serializer and deserializer macros • 32bit input/output buses -6 -

JINST 9 C03053
• driver and receiver circuits compatible with standard LVDS • 8b/10b encode/decode capability • comma detection and word alignment • BIST capability for fast debugging We bought a SERDES core IP from Silicon Creations. To test it we designed a miniasic with 5 DES, 1 SER, their control logic and our AMchip04 memory core plus XOR+RAM with only a few banks. The miniasic is fully working and its characterization is in progress.

AMchip05
Starting from the good results obtained in the miniasic test, we began to design the AMchip05 in TSMC 65nm technology. In this chip there will be 8 hit buses, 2 pattern-in and 1 pattern-out buses, one input 100MHz LVDS clock plus single-ended control signals: JTAG Init, Dtest, Holds. As stated in the previous section all the input and output buses are serialized and deserialized at 2Gbs. Moreover due to the high number of power supply regions we have to pay particular attention to the floor plan of the chip. Figure 7 shows the floor plan for the AM-chip05 in which the various blocks in the chip are seen. In particular the high frequency input and output buses are aligned on the top, while the bottom is devoted to the various memory core architectures that we want to test.

Future evolution of the assocative memory chip
The future evolution of the associative memory chip will be to increase the pattern density while maintaining an acceptable power consumption. The first step is a technique we call 2.5D. AM-chip04 has been designed to be horizontally symmetric. In/out buses for the pattern output pipeline can change direction. Moreover buses are swapped internally to maintain consistency. In this way the symmetry of the chip helps in designing and routing mezzanines for 2D chips, but also enables vertical stacking, that is the 2.5D chip architecture.
The internal design of associative memory makes it a good candidate for full 3D implementation to increase further the pattern density and decrease the footprint. Stacking of dies allows the matchline to be shortened, thus increasing speed and decreasing capacitance and power consumption [5]. The bit line will also be shorter, contributing to a further power reduction.

Conclusions
This paper has shown the recent development of the associative memory chip, describing in particular the power reduction techniques and architectural solutions implemented in the final chip. Two different solutions are proposed for achieving the power consumption and speed requirements. Moreover, because several different power supply voltages are needed, a change in the associative memory package was necessary. We also changed from parallel input and output buses to high speed serial differential lines. In the future, we are going towards 2.5D, i.e. stacking of several dies. Then we can start to design a real 3D chip which has several advantages in terms of power consumption, memory capacity and speed.