Implementation of high-performance, sub-microsecond deep neural networks on FPGAs for trigger applications

Deep neural networks are already widely used for physics analysis, but there are only few applications within low-level hardware triggers, and typically only with small networks. Modern high-end FPGAs offer Tera-scale arithmetic performance, and thereby provide a significant amount of operations per data set even for {MHz}-range data rates. We present a bottom-up approach of implementing typical neural network layers, in which we took into account both the special constraints that come from high-performance trigger systems, such as the ATLAS hardware trigger at the LHC, as well as an efficient implementation. By specifically designing each layer type to match our requirements, we could develop a framework that reaches 90 to 100% processing efficiency for large layers, requires only few extra resources for data flow and controlling, and offers latencies in the range of only tens to hundreds of nanoseconds for entire (deep) networks. Additionally, a toolkit was built around these optimized layer implementations, which facilitates the creation of the FPGA implementation of a trained NN model.

neural network usage within triggers is the "High Level Synthesis for Machine Learning" (hls4ml) companion compiler, which uses high-level synthesis tools to generate the FPGA firmware design for a given network [2]. A more detailed overview on trigger requirements and existing work regarding neural network inference on FPGAs is also given in [2].
In contrast to the hls4ml framework, we chose to pursue a hardware-centric, bottom-up approach for implementing general neural networks on FPGAs, which grants maximum control over the FPGA design, and therefore allows very fine tuning for the specific use case. Thereby, we intend to further lift network size limits, while simultaneously providing scientists working on trigger algorithms with an easy-to-use tool for efficiently incorporating neural networks of sizes that were never used before into their systems.
We begin with a very brief overview over the relevant aspects of neural networks as well as FPGAs. Following that, we discuss the required types of operations, how they can be implemented in hardware, and introduce the concept of user-configurable fixed-point precision. The main section of this paper focusses on the design of the individual layers, which currently include dense layers, as well as two-dimensional (multi-channeled) convolutions and maximum pooling, and conclude this with the implementation of activation functions. We then quickly discuss what needs to be considered for putting multiple layers together in a functioning network, before the presentation of implementation results on individual layers and entire networks. In the end, the results of our developments are summarized and an outlook on possible future improvements is given.

Neural networks
In the following, we are focussing on artificial neural networks which consist of dense, 2D convolutional and 2D maximum pooling layers, and any meaningful combination of these, i.e. an arbitrary combination of the 2D layers, which might be followed by flattening and then an arbitrary sequence of dense layers, or alternatively a network consisting of dense layers only. As neural network framework, Keras was used, with a TensorFlow backend [3].
In a (fully-connected) dense layer, every neuron receives every input, with the number of inputs N I being predetermined by the network and the number of neurons N N being a parameter. The inputs ì i ∈ R N I are weighted by a weight matrix W ∈ R N N ×N I , typically have an offset ì b ∈ R N N applied and then go through a usually component-wise real activation function ì A : R N N → R N N to produce the result ì o ∈ R N N , i.e. the dense layer implements equation 2.1.
A 2D convolutional layer receives an input image/feature map of shape H I × W I × D I . The amount of kernels N K and kernel area H K × W K are parameters, while the kernel depth D K equals D I to incorporate all input channels into the kernel application. For a general layer, there are two additional architectural parameters called padding and stride. The stride determines the step lengths between kernel applications, and padding determines how kernel applications at the edges are treated. Currently, we support the 'default' stride of 1 in both directions and what is usually called padding 'valid', i.e. no incomplete kernel applications at the edges. Accordingly, output height and width are computed as H/W O = H/W I − (H/W K − 1), with the output depth D O = N K . The application of a kernel is similar to the dense neuron operation, where ì i contains the inputs that are covered at the given position and W depends on the kernel that is used.
As the 2D convolutional layer, a 2D pooling layer also receives a possibly multi-channeled input image. Instead of a kernel area, a pooling area H P ×W P is defined, which is applied channel by channel and selects the maximum value within the given input range. As for the convolutional layer, currently only the default stride is supported, which in this case is (H P , W P ) (i.e. non-overlapping pooling areas without free spaces in between). Padding options 'valid', 'same' and 'unchanged' are already implemented, see section 3.4. In consequence, the output shape is at least H I H P × W I W P × D I , with possibly one more element in height and width depending on the padding option and shapes.
Generally, it is also possible to reshape the data between layers. Until now, we support flattening from 2D multi-channeled data to 1D, where inputs are rearranged in a C-like manner, i.e. an input at (x, y, z) in 3D/2D multi-channeled will be put into position x · (W I · D I ) + y · D I + z in 1D.

FPGAs
Modern FPGAs comprise a large number (up to the order of millions) of programmable look-up tables (LUTs) and typically one or two flipflop registers (FFs) per LUT, with the routing between these also being progammable. In the following, any specifics apply to the Xilinx UltraScale+ (US+) FPGA architecture [4][5][6][7]. However, most features are directly or similarly applicable to other recent device families from Xilinx and competitors such as Intel/Altera.
The LUTs typically implement binary combinational functions from few bits to one or two result bits, e.g. f : {0, 1} 6 → {0, 1} 2 in the US+ architecture.1 A flipflop can store a single bit.
The combination of a large ensemble of programmable logic functions, registers for storage and synchronous processing and programmable routing makes it possible to implement even complex and large digital circuits within FPGAs.
Apart from these basic building blocks, there are also specialized components, of which block memories (BRAMs) and digital signal processors (DSPs) are important in the following. The BRAMs provide a high-density, high-capacity memory. One BRAM block consists of two 18 kib parts, where each is addressed by 9 to 14 address bits, which results in port widths and depths ranging from 36 and 512 to 1 and 16384. The DSPs are 'simple' ALUs (arithmetic logic units), which offer a wide range of operations. In the US+ architecture, the operation mode which is most important for NN applications is an w · i + b type operation, where w, i and b are binary numbers of up to 18, 27 and 48 bits. b can be an external input or come from an internal result register, allowing to perform a multiply-accumulate operation in either a pipelined (b external) or localized (b from internally accumulated) design.
Our target device was the Xilinx US+ XCVU9P, which features approximately 1.2 M LUTs and 2.4 M FFs, as well as 6840 DSPs and 2160 2x18 kib BRAM units, and supports maximum operation frequencies in the order of 640 MHz to 900 MHz, depending on device speed grade and hardware component. FPGA code was written in VHDL, Vivado 2018.2 used as IDE, synthesis and implementation strategy settings were Vivado defaults, apart from synthesis mode out of context, to make it possible to implement designs without any I/O connections. Apart from automatic 1True di-output LUTs are only supported for up to 5 inputs, otherwise constraints apply, see [6].
inference of hardware components such as LUTs, FFs and DSPs from the VHDL code, they were also explicitly instantiated by design primitives for the DSPs and adders implemented in the general logic, which gave maximum control over the implementation.
Key metrics for the evaluation of an FPGA implementation are the amount of resources required and the design timing, i.e. if the target frequency is met or if the design cannot run at the targeted frequency due to time needed for signal propagation. Designs which miss timing cannot be used without optimization, but even in these cases, it is useful to have knowledge about the severity of the timing violation.

Arithmetics implementation
Precision requirements Previous work has demonstrated that for NN inference, comparably low precision is sufficient, and neither floating-point arithmetics nor a large amount of bits are necessarily required for not losing any (or significant) performance in many cases, see e.g. [2].2 Using fixedpoint arithmetics provides the benefit of a significantly reduced implementation complexity, which saves resources and makes it a lot easier to adjust the arithmetic precision to specific needs regarding value range and granularity.
In the following, we define the precision specification 'i. f ', where i and f are the integer and fractional bits in a fixed-point representation, with the entire value being represented in two's complement. Accordingly, the resulting value range is Implementation in hardware For ANNs as described above, only a few basic operations are needed: Multiplication and addition are combined into the multiply-accumulate (MAC) operation, and for the pooling layers, the 'select maximum' operation (MAX) on two values is required.
Since fixed-point arithmetics are sufficient, a single DSP can be used to implement the MAC that is required for computing weighted input sums. Thereby, it is possible to provide one input and weight to a DSP per cycle, and then the DSP can add the product either to the internally accumulated value or to an externally provided partial result. For reaching the maximum DSP frequency, it is necessary to use internal pipeline registers of the DSPs. This requires having internal registers enabled at three levels: input, product and after the accumulation. Due to this, a 'first DSP' latency of three cycles arises. For any additional DSP in a pipeline, only one extra cycle is necessary, as even an externally provided extra term (e.g. the current partial result) can be inserted latency-free before the accumulation step. 3 In some situations, it is necessary to compute sums of values that come out of multiple DSP pipelines. Adding N values requires N − 1 adders. By adding multiple values in a binary tree structure, where at each level, pairs of values are added and the result is sent to the next level, it is possible to keep the logic depth at the optimal value of log 2 N , which is desirable from a latency point of view. To guarantee optimal performance, these logic-based adders were instantiated on a design primitive level, forcing Vivado to use the dedicated carry logic and place the corresponding primitives as closely as possible.
2We also verified this for our own test implementations. 3An extra external input register is also possible and might be required for future frequency improvements, but it is possible to incorporate this with only one extra cycle for the entire DSP pipeline, instead of one cycle per DSP.
For the pooling layers, it is necessary to find the maximum in a set of values. Similarly to the sum of multiple values, this can be done in a binary tree structure, to provide the lowest possible logic depth and resource utilization.4 Instead of being added, each pair of values is processed by the MAX operation, to detect the larger and send it to the next level, until only one value remains.
For the maximum and sum of set structures, we added configuration parameters for adding register stages at the input, inter-operation and output levels, allowing to trade maximum frequency against register utilization and latency in terms of cycles.

General decisions
Due to the very strict real-time context of high-performance detector triggers, it is very important to design for a minimized latency. At the same time, it is necessary to aim for an optimized inference performance, which strongly relates to an efficient resource usage. It is not expected that a solution could be found which optimizes all three metrics at once, especially if no further constraints exist for the network architectures. The selection of potentially suitable design approaches is tightly bound to the data rate f D = T −1 D (with the data tick period T D ), the possible processing frequency f P and the latency budget. Motivated by the ATLAS first level trigger, we chose to design for a data rate of 40 MHz; and for the latency aim for as little as few tens of nanoseconds for entire networks, but put performance and efficiency before a perfectly minimized latency.
Generally, if data is coming in at a rate f D and the frequency of the 'processing units' (PUs), e.g. the DSPs, is f P , there are C = f P f D cycles per PU and data set for real-time processing. Moreover, three general structures could be used that potentially reach perfect efficiency: Ideally, one would have all of the hardware resources available to process an entire data set within one data tick, and then proceed with the next set, which also results in the shortest latency. If that is not possible, one can either build a single pipeline of processing steps, where partial results of one pipeline stage propagate to the next stage during each data tick, or partition the hardware into N P parts, and place the corresponding amount of instances of the network processors, where each has N P data ticks for processing, and inputs are multiplexed to a different instance at each tick. We have decided for a single-instance pipelined design, and excluded the others: The 'single tick processing' design can be ruled out because with e.g. T D = 25 ns for the ATLAS detector, it will not be possible to have large designs which process all of the data within one data tick, simply due to the time required for computation and signal propagation in device-spanning, deep logic structures.
The decision for either a single pipeline design or a multi-instance design is less trivial, as both have advantages: In the single pipeline design, every pipeline stage could potentially be adapted to a specific task, which would reduce overhead structures. A multi-instance design would need to have more general processing units instead (as these would need to perform different tasks/compute different layers while time progresses), and could be implemented for example as systolic arrays. At the same time, data propagation through the single pipeline design would be bound to the pipeline 4It might be possible to use an algorithm with formally lower logic depth, but this is not expected to scale to the hardware regarding utilization and timing. structure, while the processing units in the multi-instance design could potentially allow better adjustment of the latency depending on the exact design, although at the price of an increased complexity.
We took three further aspects into account for our decision: For not unnecessarily increasing the latency, we decided not to batch the input data, i.e. the batch size during inference is 1. Additionally, with the given maximum hardware processing frequency and targeted data frequency range, C is in the order of 10 1 , and in a neural network, a layer can only begin computing when at least some of the required inputs are available.
All of this convinced us to choose the single pipeline design: It makes it easier to keep the efficiency high if no batching is used and C is small, it allows having relatively small (although possibly not minimized) latencies and it avoids too much overhead structure for reconfigurability and data flow management.5 For the pipelining approach, it also appears natural to have one pipeline stage per network layer, which is conceptually much less complex than making efficient use of a systolic array while simultaneously keeping the latency low. Layers were therefore designed to take the data within C cycles after the first part of the data arrives, and also produce all results within a period of C cycles, to maintain network synchronicity, with an arbitrary but fixed delay in between.
Moreover, with N PU processing units placeable in a device, there are N Op = C · N PU operations possible per data set and operation type, which is useful to understand the ideal-case inference performance of any device and thereby the maximum network size. Figure 1 shows the amount of DSP MACs possible depending on the device/DSP count and processing frequency, where the ATLAS data frequency of 40 MHz was assumed. This shows that already with one of the current high-end FPGA families, a network size of up to 280 kMACs could be possible even for such high data frequencies, if the maximum device frequency can be achieved and the computational efficiency is close to 100%.

Dense layers
Generally, it is possible to use a single DSP to implement the weighted input sum of a neuron, as the DSPs support the MAC operation. However, the amount of processing cycles is limited, therefore arbitrary amounts of inputs N I could only be handled if a weighting parallelization factor P is introduced, and even then (or if N I < C), such an approach can be computationally very inefficient. 6 Instead, we exploited that neurons in fully-connected dense layers require each of the inputs at some point. By combining this with the inputs being made available within C cycles, we implemented the weighted input sums using pipelines of DSPs.
An example data flow is illustrated in figure 2a, where for illustrational purposes we assumed that per processing cycle, only one further input becomes available. In this scheme, there are as many DSPs as inputs to the given pipeline. The input i n (n ∈ N 0 ), which is available at cycle n, is stored in a register which is connected to the input of DSP n at the subsequent cycle edge, which keeps this input until it is overwritten after C cycles. With n + 1 being the first cycle when DSP n has the new input available in its register, weight w m,n is presented to DSP n during cycle n + 1 + m, where m ∈ N 0 denotes the neuron index. After each cycle, the partially accumulated weighted input sum is passed as third input to the next DSP in the pipeline, which adds the next weighted input for the corresponding neuron, until the final result comes out of the last DSP.
Multiple of these pipelines can be used when multiple inputs become available during each cycle. If P is the number of pipelines, each pipeline weights at least N I P inputs, and the first N I mod P pipelines need to weight one extra input, while the remaining pipelines are extended by a shift register for cycle-matching the partial results.
Apart from parallelizing the pipelines when multiple inputs become available per processing cycle, it is also possible to use multiple of such 'neuron units' (NUs) if necessary, i.e. N NU = N N C > 1. In that case, each NU computes at least N N N NU neurons, and the first N N mod N NU NUs 5Even with this design, it is possible to further reduce the latency at cost of computational efficiency by constraining the layers to less than C cycles. The loss in computational efficiency would typically be tolerable, as mostly small networks would benefit from this, while larger networks are rather throughput-bound than latency-bound anyway.
6We found a way to increase the efficiency again for arbitrary layers and C, but with no benefit over the following.
compute one extra neuron. If N N N NU < C, all neuron results are available even before C − 1 extra cycles after the first result, allowing a reduction of the layer-to-layer latency.
Code structure As for the following layers, the dense layer consists of multiple design subunits, which are show in figure 2b. The main part of the dense layer are the neuron units. These contain the DSP pipelines, which use the w · i + b type operation of the DSPs, with w for weights, i for the input values, and b as partial result of the previous DSP. In the current implementation, the b signal can be implemented either via dedicated connections between the DSPs or through the general fabric wiring. The latter case provides more placement flexibility, but worse timing characteristics. For the fabric routing, timing could be improved by activating the respective DSP input register, which was not yet implemented.7 If multiple pipelines are used, a final adder is included to sum up all partial results to produce the actual weighted neuron input sum.
The input values for the DSPs are stored in a memory entity with external write control, therefore the external write sequence must match the expected input sequence. Input i is weighted by DSP i P in pipeline i mod P. Similarly, there are multiple weight memory blocks, realized with BRAM. Currently, there is one weight memory block per NU and pipeline stage, which spans all DSPs of that stage and NU. In the future, other architectures might be added, such as blocks per pipeline or per part of a pipeline, for frequency optimization for different layer architectures.
The neuron unit results are finally multicast to the neuron output signals which they produce during some cycle. A controller entity controls the selection of the weight memory and asserts signals indicating when a result is valid.

2D convolutional layers
Convolutional layers typically feature only few parameters, but a significant amount of the MAC operations of any network, due to the manifold application of the same kernel. Obviously, it is possible to apply kernels at multiple positions in parallel, as there are no direct dependencies between different output positions.8 However, naively using 'kernel units' which get all of the inputs required for a specific output position, but do not exploit any of the topology of convolutional layers, results in significant and mostly avoidable logic resource consumption, because the same input values need to be copied often and in many places.
For saving resources, we studied possibilities to make use of the shared inputs of neighboring kernel applications. Maximum input and weight reuse is possible if an entire output channel is computed per cycle, but this degrades the computational efficiency extremely for non-matching C. As compromise, we partitioned the output volume into 'rows' rather than channels. A row spans the entire width but has a fixed height and channel position. Thereby, it is possible to employ input sharing at least along the width axis. The base computational unit is therefore neither on channel nor on neuron level, but on row level, and is accordingly named 'row unit' (RU). This allows for a significantly more fine-grained unit allocation for arbitrary layer architectures and C, as now there 7Minor adaptations in the layer structure and controlling are necessary for that. 8There are partially shared inputs, however, and some very intelligent ways to reduce the computational cost, e.g. making use of the convolution theorem to implement the convolution as multiplication, but these often work well only for larger layers than possible here and sometimes require latency-introducing preparations, wherefore we did not focus on these.   C channel units for a given C. We also used that all result rows at a given height index require the same inputs, we consider such a group of rows a 'slice'. By letting multiple RUs process rows from neighboring slices as long as possible, data reuse can be employed along the height axis as well. By letting RUs process all rows of one slice before going to the next, the necessity for input loading/multiplexing can be further reduced, since inputs can remain the same during that period. These measures combined already allow for a substantial reduction of the resource usage and necessity for data loading.
In consequence, we developed a RU allocation scheme in which all RUs begin computing in the topmost channel and neighboring slices, then progress towards deeper channels, and move to the next set of slices when the last channel of the given slice block was computed (see figure 3a). A single row unit can process C rows, N RU,Sl = C D O complete slices and N RU,Rem = C mod D O remaining rows.
In the 'regular' case, there are at most as many slices as can be completely processed by the given amount of row units, i.e. N Sl ≤ N RU · N RU,Sl , where results from one slice are always computed by only a single RU. Some RUs might process one slice more than others, the former are attributed as 'long', the latter 'short'. In the 'irregular' case, it is necessary to also use at least some of the N RU,Rem remainder cycles of some row units, and the allocation scheme becomes much more complicated, see section A.1 in the appendix.
Code structure The main part of the 2D convolutional layer are the row units. These are provided with all of the weights and inputs they need. Internally, inputs are weighted and accumulated per input channel from top to bottom channel, and partial results are pipelined individually for each kernel position. At the end, all partial results are summed up by logic-based adders, to produce the final results of an entire output row. The top-to-bottom input scheme was used to reflect the output scheme, which primarily also works from top to bottom channel, also in the input scheme, to typically allow faster propagation between layers.
The rest of the layer was designed in an attempt to reduce the utilization (rather than frequency). A simplified structural example for the regular case is shown in figure 3b. There is one input buffer memory, which has external write controlling and stores each of the (2D multi-channeled) inputs exactly once. The buffer memory is followed by working memories, which have control inputs for selecting when data is written to them and from which range of inputs data is taken. There always is a 'long', and if 'short' RUs are present, also a 'short' working memory. The 'long' working memory contains all input slices required for the 'long' RUs. The 'short' working memory contains all extra input slices required for the 'short' RUs. We chose to create such a structure because the 'short' memory needs to load one set of inputs less from the buffer memory, and therefore this structure can reduce the resource utilization. In figure 3b, each working memory output signal corresponds to an input slice, i.e. this spans the entire width and all channels, but has a fixed height index (which varies over time, while different ranges of inputs are loaded into the working memories).
To save further resources, we grouped RUs which always process the same output channel during any given cycle together, and gave all of them only a single weight memory. In the regular case, this means that there is only one set of weight memories for all 'long' and one set for all 'short' RUs. The more complicated irregular case is described in appendix A.1.
As for the dense layer, the row unit results are multicast to all output positions where they are needed during any cycle (see allocation scheme), and the output row write enable signal is managed by a controller, which also controls the working memory input selection and write enabling and the weight memories.

2D max-pooling layers
For default-stride pooling layers, where there is no input sharing, it is neither necessary nor possible to resort to as elaborate schemes as for the convolutional layers. Since all inputs are required only once, there is no way of saving resources or input accesses, and there is no need to use complicated row allocation patterns. For simplicity reasons, the concept of output rows and row units was still maintained, but the row allocation was simply done from top to bottom channel, top to bottom slice in an interleaved manner (see figure 4a).
Contrary to the convolutional layers, it was already possible to implement different padding options for the pooling layer: These include padding 'valid' and 'same' (as in Keras), and a new mode 'unchanged'.9 Internally, padding is realized by the extension or truncation of the input signals. The latter will not increase the utilization significantly, as constant propagation will be detected and the corresponding logic parts removed/simplified by Vivado.
Code structure As for the Convolution, the basic units of this layer are the row units. A row unit gets all inputs required to compute a single output row. Since pooling usually does not extend over multiple channels, this means that a row unit only requires a range of input rows of a single channel for each cycle.
A structural schematic of the design is shown in figure 4b. Similar to the convolutional layers, a buffer memory is used, from which the row unit working memories loads a new input slice during 9'Valid' padding dismisses 'incomplete' pooling inputs at the upper edges, 'same' symmetrically extends the input space to only have complete pooling inputs, 'unchanged' extends only on the high-index edges.  each cycle. This process is driven by a controller entity, which also controls the output write enabling. Row unit results are again multicast to any position in which they are needed at some cycle.

Activation functions
Apart from the obvious linear/identity activation, we currently support the rectified linear unit (relu) activation, which is quite commonly used in (convolutional) deep neural networks, and has the advantage that it can be implemented with very little cost on FPGAs.
By explicitly instantiating on a design primitive level, we were able to guarantee a relu implementation that requires either B 2 LUTs or B − 1 FFs per relu unit working on B bit values. Other activations can be implemented in the future. One way of doing this is by either valuebased look-up tables or by value-derivative-based look-up tables. For example, with 16 bit result values, one could use the 8 most significant bits to look up an interpolation base value and a first derivative, and then add the first derivative multiplied by the least significant bits to the base value. With the Xilinx US+ architecture, both look-ups could be done at an approximate LUT cost of 2 · 2 8−6 · 8 = 64, the first two comes from two look-ups, the factor 2 8−6 from the cost for looking up one bit for an 2 8 -deep address space and the eight comes from an assumed eight bits which are looked up for the base value and first derivative. The multiplication of two 8 bit values and final addition would approximately cost 70 LUTs, leaving a total of ∼ 140 LUTs per activation unit to implement a relatively precise look-up with 256 sample points and linear interpolation in between.10 10The exact LUT cost depends on some details, but less than 200 is seen as reasonable assumption for many cases. If base value or derivative saturation occurs, even less LUTs would be required for the look-up. Our results also indicate that much less than 256 sample points are sufficiently precise for many activations.

Network creation toolkit 4.1 Layer synchronization considerations
When connecting dense layers to dense layers, it is only reasonable to select a pipeline parallelization P in the successor layer that equals the amount of neuron units N NU in the predecessor layer. By that, every input is used exactly when it is made available, and no extra buffering structures or delays are necessary.
When connecting layers with 'arbitrary' input and output schemes, there is no sense in starting the second layer earlier than necessary to run without interruption or later than that. This means that it is necessary to compute the minimum start delay that allows uninterrupted computation of the successor layer. Minimum start delay determination can easily be done if the output and input pattern of the involved layers are known, i.e. at which cycle which results/inputs are produced/needed. By element-wise subtraction of the 'needed' pattern from the 'available' pattern, one can take the largest positive value as minimum necessary delay for continuous computation.
However, with a given start delay, it might happen for some layer sequences that an input is already overwritten before it is needed for the last time.11 These cases can be solved by the introduction of individual extra delays for all affected input values. The extra delay for each input value can be computed by adding the start delay to the input 'needed' scheme, adding C − 1 to the 'available' scheme, subtracting the latter from the former and then introducing the respective extra delay for all values which yield a positive result. This does not increase the layer-to-layer latency.
Further delays that need to be taken into account are possible extra delays introduced by the application of activation functions and delay offsets between the layer enable signal and when that layer actually expects the first input. The latter do not influence the network latency itself, but need to be considered for asserting the layer enable signals at the right moments.
In the special case of flattening, it is necessary to 'regularize' the dense layer input, because other than the convolutional and pooling layers, the dense layer working memory is written to completely externally. This requires an extra auxiliary layer resulting in one extra cycle delay. The same effect could in the future be obtained by re-ordering the inputs/weights of the dense layer, s.t. the inputs which are available first after flattening are also required first. This was not yet implemented because the regularization approach is more general, and it was understood only lately that a mere re-ordering would be sufficient, if the dense layer parallelization P is set to the amount of simultaneously produced result values from the last 2D layer.

Network creation
The starting point for the network creation is a trained Keras network. Supported network features are the previously described layers, layer features and activations. An arbitrary sequence of 2D multi-channeled layers can be followed by flattening and an arbitrary sequence of dense layers, or a dense-only network can also be used.
After specifying some extra design parameters, such as precision information, the toolkit can be used for creating the VHDL network top file, auxiliary package file for configuration constants, 11For example the 2D convolution might have multiple load operations from its buffer memory and can therefore create such a situation. network simulation testbench file and network data files for initializing control and weight memories from a Keras network. The VHDL files can finally be included in a project together with the network library files and are ready for use. Then, it is only necessary to load the respective weights and controller data into the memories during execution, after which inference is possible.

Results
Given our target design parameters, i.e. 40 MHz data frequency, at most 600 MHz to 900 MHz processing frequency and the Xilinx XCVU9P FPGA, we considered a possible network size of at most multiple 10 4 MACs, to constrain ourselves to reasonable layer and network sizes for further studies.
For any design, timing closure was characterized by the 'worst negative slack' (WNS), which can be thought of as difference between the signal propagation time estimated by Vivado and the target clock period. Timing characteristics apart from the WNS were met for all implemented designs. All designs were implemented with the 'out of context synthesis'. The bit lengths of the layer input values and results were set to 16, and where occurring weights were set to 12 bits length.

Isolated layers
All implementations were done with a target clock period of 1.563 ns, reflecting the f P = 640 MHz processing frequency that are ideally achievable even with speed grade 1 US+ DSPs, which corresponds to C = 16 for an assumed data rate of 40 MHz, as in the ATLAS experiment. Therefrom, we derived the minimum processing clock period T P,min = T P − T WNS of a design. Designs which failed with a large WNS might show better results for larger T P , because for better-suited T P , Vivado can typically reach better T P,min .

Dense layers
We implemented dense layers for N I , N N ∈ {8, 16, 24, 32, 50, 64, 75, 100, 128}, C ∈ {10, 16}. As expected, the DSP utilization represents what was designed for, i.e. N I · N NU . BRAM utilization is at most 0.5 · N I P · N NU · P 3 + 0.5, since for P > 3 and 12 bit weights, multiple BRAM (sub-)units are necessary for most stage weight memories. For larger layers, this corresponds to approximately one BRAM per 5 DSPs, which is tolerable given the resource ratios in the US+ device family. For all of these designs, at most 4 LUTs and 23 FFs were required per DSP, with less than 10 FFs per DSP in almost 75% of the designs. Compared to ∼ 170 LUTs and ∼ 340 FFs per DSP in the XCVU9P FPGA, both values are negligible. Layer size in terms of MACs against maximum frequency is shown in figure 5. Depending on the device choice, it is possible to have up to multiple thousand MACs in a layer, before the implemented design rather than the device specification becomes limiting.

2D convolutional layers
The exact structure and therefore utilization of 2D convolutional layers depends strongly on the architectural parameters. To give a good overview, we implemented for some hand-picked parameters, to ensure all architectural features are covered at least once, and did a grid implementation for H I × W I ∈ {(10, 10),   Maximum layer frequency depending on layer size and FFs to DSPs, we achieved 6 to 61 (with 75% below 25, and 95% below 40) and 10 to 77 (with 75% below 40 and 90% below 50), which is still a good result given the available resource ratios. A plot of layer size in terms of MACs against maximum frequency is shown in figure 5. It turned out that for comparable size, convolutional layers typically reach lower maximum frequencies than dense layers. This is a result of the optimization for utilization rather than timing at a time when the resource consumption was not yet known. Critical signal paths could already be identified, and we expect to be able to make it possible to trade a higher frequency for more utilization in the future. Hence, pooling layer utilization is negligible for computationally reasonable input sizes, with even only up to 52k LUTs and 130k FFs needed for as many as 6400 inputs.12 Layer size in terms of inputs against maximum frequency is shown in figure 5. Small layers met the target frequency of 640 MHz without any problems, some larger designs fell below that line, but not significantly. Future alternative designs could yield increased 12That many inputs would already be difficult to handle with convolutional layers, depending on the exact layer parameters. Relative Timing Closure Depending on Network Multiplication Count Figure 6: Network frequency closure depending on network size and target processing frequency. 'Frequency closure' is defined as min(1, T P,min /T P ), i.e. thes ratio between targeted processing frequency and maximum processing frequency according to Vivado.
frequencies, but this was not yet studied, as the an improved implementation of 2D convolutions appears more urgent.

Example networks
We tested four basic network architectures, where in the following we use I, C, P, F and D for 2D input, 2D convolutional, 2D pooling, flattening and dense layers respectively. Architecture Arc A is I-C-P-F-D-D, Arc B is I-C-P-C-F-D-D, Arc C is I-C-P-C-F-D-D-D and Arc D is I-C-C-C-F-D-D-D.
For each architecture, we trained various networks with varying layer parameters for the MNIST digit recognition task. The last dense layer always had 10 neurons to classify the ten digits. A range of values for C was also scanned for each network, where as data frequency f D we chose 40 MHz, and as (target) processing frequency C · f D . We used a precision choice of 6.8 and 2.8 for values and weights, for which at most a sub-percent order of classification accuracy loss was observed, compared to CPU inference using float32 as data type. Some of the implementation results are shown in table 1. For the given precision choice, between 7 and 50 LUTs were needed per DSP (among all networks, including those which completely failed timing and are not shown in table 1), with the tendency of less LUTs per DSP for larger networks, and less than 30 LUTs per DSP in 83% of the cases. Similarly, between 12 and 90 FFs were needed per DSP, with less than 50 FFs in 80% of the cases. The absolute BRAM utilization depends largely on the amount of inputs and neurons in the dense layers, but does never exceed 20% of the absolute DSP utilization, and is significantly lower in most cases.
The larger networks (> 15k MACs) did not yet meet the timing, which could already be expected based on the results for the individual layers. For all of the networks below that size, Table 1: Implementation results for some example networks. Activation 'relu' for all but the last layer, which had a 'linear' activation and always N N = 10. Input size 14 × 14, 1 channel, if not stated otherwise. NS left out where no NS occurred, i.e. timing was met for these designs. For convolutional layers, the kernel shape (H K × W K × N K ) was specified, for pooling layers the pooling area (H P × W P ) and for dense layers the number of neurons N N was specified. DSP efficiency refers to the relative amount of non-idling DSP cycles.  Networks which met timing are marked with a plus, networks that failed timing are marked with a cross at least one value for C was found for which the network met the timing. The findings regarding the relation between network size and timing closure are also shown in figure 6. One interesting observation is that the relation between timing closure and C is not nearly 'continuous', i.e. there might be designs with a large timing violation for one choice of C, and little violation or even timing closure for neighboring values of C, or vice-versa. This is no surprise, as C has a large influence on the structure of the design, but it shows that it might be necessary to explore different settings to find a working design for medium-sized networks. However, it can still be seen that in tendency, timing closure is more difficult to reach for larger networks. For the future, it is expected that frequency improvements for the individual layers (especially for the convolutional layer, see section 5.1) will make it possible to reach timing closure at least for networks which failed only closely. If then layer-to-layer paths become critical, it is an option to add extra register stages between layers to relieve the placement and routing efforts.
Network latencies (both in terms of real time latency and cycles) are put into relation to network size in figure 7a. The two dominant factors to the network latency are the amount of layers and the processing frequency. For the four-layer networks with architecture Arc A , a latency of ∼ 100 ns is obtained, after which all results are available. The larger networks which just failed timing closure required up to ∼ 300 ns to produce all results. Figure 7b shows what latencies would be obtained depending on C if timing closure was always achieved for networks with up to ∼ 50 k MACs, which especially shows that with increasing C, a significant latency reduction would become possible. It is expected that with future design improvements, the network latencies will approximately remain the same (or even improve) in terms of cycles, but improve in terms of real time latency due to frequency improvements, which would be represented by figure 7b.
Regarding the DSP efficiency, our design is (expectedly) not always able to adapt efficiently for very small networks. For larger networks, where the intended design works well, efficiencies of more than 90% can be reached. Please note that this is not yet completely visible in table 1, because the networks which are shown there are still relatively small, although an on average increased DSP efficiency can also be observed there.
We also switched the 'relu' activation implementation between LUT-or FF-based for some designs, but did not see a consistent effect on timing, only resource usage changed accordingly between LUTs and FFs. Vivado might not always be able to make use of the improved placement capabilities that come from the extra register layer. Additionally, we implemented one layer for significantly increased precision, and observed an almost linearly increased utilization and slightly worse timing characteristics, as expected.

Summary and outlook
We were able to develop a neural network implementation framework that is suitable for use within detector triggers, provides a substantial inference performance even at incoming data rates of many MHz, and requires only tens to few hundreds of nanoseconds for inference for reasonable network sizes by specifically designing neural network layers for the special constraints of the trigger environment. By embedding our developments into an easy-to-use toolkit, the overhead of implementing efficient, high-performance neural networks within trigger systems with such constraints has been greatly reduced, as no expert knowledge about FPGA implementation and neural networks is needed any more. Now, implementing a neural network on trigger FPGAs, for example in the ATLAS detector, can be as simple as running our script on an already trained Keras network.
In the future, we intend to further improve and extend our work, with already a lot of useful features in our current scope. Among these are various changes to the current layer implementations, which target network latency and maximum frequency, but also functional extensions such as support for neuron bias, new activation functions and new layer types, such as transposed 2D convolutions for upsampling. We also have toolkit extensions planned, such as layer emulation for faster network benchmarking and automated precision recommendations. Further in the future, we see potential to significantly increase the overall performance, for example by implementing support for sparse weights, but also by changes to how the arithmetic resources of the FPGAs are used.