Investigating the Dirac operator evaluation with FPGAs

In recent years the computational capacity of single Field Programmable Gate Arrays (FPGA) devices as well as their versatility has increased significantly. Adding to that the High Level Synthesis frameworks allowing to program such processors in a high level language like C++, makes modern FPGA devices a serious candidate as building blocks of a general purpose High Performance Computing solution. In this contribution we describe benchmarks which we performed using a Lattice QCD code, a highly compute-demanding HPC academic code for elementary particle simulations. We benchmark the performance of a single FPGA device running in two modes: using the external or embedded memory. We discuss both approaches in detail using the Xilinx U250 device and provide estimates for the necessary memory throughput and the minimal amount of resources needed to deliver optimal performance depending on the available hardware platform.


Quantum Chromodynamics
Quantum Chromodynamics is the theory describing the interactions of quarks and gluons, explaining why the latter form bound states such as protons and neutrons. One of the characteristic features of this theory is that in the low energy regime quarks and gluons form a strongly coupled system. As a consequence it is difficult to extract predictions for the properties of such a system from First Principles of Physics. Up to now the only available computational tool allowing for such calculations are numerical simulations (Monte Carlo simulations) of a discretized version of the theory, called Lattice Quantum Chromodynamics (LQCD). Traditionally, physicists working in the field of LQCD searched for the most performant, vector machines consisting of a large number of compute nodes, and have designed many new HPC solutions: QCDOC [1], APE [2], QPACE [4], just to name a few. Currently GPU and ARM processors are considered for the next generation of supercomputing machines and it is an open question whether as an alternative FPGA devices could also be used.
In the discretized version of Quantum Chromodynamics the basic degrees of freedom are associated to each point of a four-dimensional grid representing a finite volume of four-dimensional space-time. The sizes of such volumes vary from V = 10 6 up to V = 10 8 points. The most compute intensive part of the simulations is the inversion of the Dirac matrix, which is of size (24V ) × (24V ). The matrix has a sparse structure because it describes the nearest-neighbour interactions. The Dirac matrix D(n, m) AB αβ acting on the vector ψ(n) can be written down as follows [6] D(n, m) AB αβ ψ B β (m) = (m q + 4)ψ A α (n)+ The most elementary computational block is the evaluation of the single stencil, i.e. evaluation of the right hand side of Eq.(1) for a given value of the index n. Note, that the coefficients of the D(n, m) AB αβ matrix differ for each m, i.e. the U complex-valued 3 × 3 matrices and ψ complexvalued 3-element vectors depend on the position m. Therefore, each stencil involves loading of eight U (n) matrices and nine spinor fields from the neighboring lattice sites, which corresponds to 360 input words. In the case of double precision this amounts to 2880 input bytes. One can exploit the structure of the SU (3) matrices and parametrize them in terms of 10 input words each, instead of 18 in the naive formulation (9 real and 9 imaginary entries). We return to this point in Section 3.2. The U × ψ matrix-vector multiplications require 1464 floating point operations for complex additions and multiplications. P ± are real-valued 4 × 4 constant matrices, m q is a real parameter corresponding to the quark mass, µ labels directions in the four-dimensional space-time. Repeated indices are summed within the ranges: α, β = 1, . . . , 4, A, B = 1, 2, 3. For unexplained notation please see [6] or [7]. One of the simplest algorithms allowing to invert such a matrix is an iterative conjugate gradient algorithm, which we briefly describe in the next section.

Conjugate gradient algorithm
Since November 2017, a new ranking of supercomputers is published by the TOP500 organization based on the HPCG benchmark. In this benchmark a sparse matrix is inverted using a conjugate gradient algorithm. This differs from the traditionally used Linpack benchmark where the employed matrix was dense. The argument behind the HPCG benchmark is that in many cases sparse matrix computations are more representative of the variety of HPC applications which run on a supercomputer. Indeed, the iterative solver of the type of conjugate gradient is at the heart of Monte Carlo simulation of QCD. In this work we consider an improved version of the conjugate gradient algorithm which allows us to test different floating and fixed point precision without a deterioration of the ultimate solution. The algorithm intertwines iterations in low and high precision, working mainly in low precision and correcting a possible systematic error by a high precision iteration.
Our algorithm follows the one suggested in Ref. [5] and is shown in listing 1.

Kernel description
We briefly summarize the FPGA implementation of the main kernel function. The function involves a loop over a subvolume and an evaluation of the stencil for each site of the lattice. We follow what was presented in Ref. [7].
All operations involved in the estimation of a single stencil are graphically shown in figure 1. The evaluation naturally splits into 4 stages. The clock cycles provide an estimate of the amount

Two approaches
There are two approaches one can follow in order to provide to the kernel the required data. One can divide the entire problem into small parts such that the entire set of data for a single part fits into the BRAM memory of the device. Alternatively, one can store the entire set of data in the DDR die attached to the programmable logic and stream the data through the link. We discuss below the performances of both solutions.

Smaller lattice stored in BRAM memory
This is the approach we followed in Ref. [7]. We showed that lattices up to 12 × 8 3 in double precision can fit into the internal memory of the programmable logic of the FPGA devices available currently on the market. In figure 2 we show the required number of URAM blocks for a given size of the lattice for single and double precision. As can be seen in that figure the storage requirements are not linear because in order to allow the compiler to take advantage of the natural parallelism of FPGA devices it is crucial to store data in PL in as many separate PL local registers blocks as possible. This is due to the fact that in a single PL clock cycle only one memory element can be read from the BRAM block. In the computation of a single stencil one needs eight different U matrices and we impose that they are stored separately. Although this requires duplicating the amount of stored data, the matrices U (n) and U † (n) are stored separately, the gain is considerable. Thanks to that the stencil evaluation can be fully pipelined, i.e. the hardware block can accept new input data at each clock cycle. The resulting performance simulated in software is 812 GFLOPs for single precision and 406 GFLOPs for double precision with the PL running at 300 MHz.

Larger lattice streamed from the DDR memory
The other way to operate on larger data sets is to keep the data in the DDR die attached to the programmable logic and process the data in a streaming mode. This was investigated on the Maxeler system in Ref. [9]. The U matrices and ψ spinors are prepared beforehand into sets corresponding to consecutive stencils and are streamed continuously from the DDR into the logic. The limitation of this solution is the throughput of the memory link between the DDR and the logic. Using SDAccel and an openCL implementation of the CG algorithm we verified that for the Xilinx U250 device one can send 256B from the DDR memory to the PL part per clock cycle working at the frequency of 300 MHz. Four channels are available aggregating to 77 GBps throughput. In order to decrease the amount of data transferred we change the representation of the U matrices    [3] we use a 10 parameter parametrization. We trade two more parameters and avoid computing trigonometric functions in the programmable logic. The reduced set of data translates to an initiation interval of 5 and 9 clock cycles for the compute kernel for single and double precision respectively, i.e the programmable logic has to wait 5/9 clock cycles to gather enough data to start a new computation. The performance in that case would approximately be equal to 86 and 46 GFLOPs respectively, which is comparable to the one quoted in Ref. [9] on the Maxeler system. However, if we also count the additional operations needed to recover the U matrices from their reduced form, the achieved sustained performance reaches 194 GFLOPs for single precision. In figure  3 we show how the required throughput depends on the initiation interval. The calculation assumes the reduced form of the U matrices. The smaller the initiation interval is, the shorter is the time in which the data has to be transferred. We show the throughput estimates for single and double floating point precision. Knowing the throughput between the DDR and the programmable logic on a given device one can easily read the corresponding minimal initiation interval and henceforth the resulting performance, which is shown in figure 4 for both the single and double precision case.
Finally, in figure 5 we show how the hardware resource consumption depends on the initiation interval for single and double floating point precision but also for a more FPGA friendly 32 bit fixed point data format. In this streaming scenario one can relax the initiation interval of one clock cycle imposed in the first approach. The memory throughput being the bottleneck, one can implement the kernel with a lower initiation interval because in any case several clock cycles are needed to collect all the necessary data for a single stencil computation. The figure shows an indicative percentage of all available resources counting together all DSP, LUTs and BRAM blocks. We see that in the described case where the memory throughput imposes an initiation interval of 5 clock cycles the compute kernel uses only 20% of the available resources for double precision.
The presented results allow to understand various constraints limiting the performance of the investigated kernel on FPGA devices. Starting from the embedded memory scenario, the practical problems that are being analyzed are larger by a factor of the order of 4096. One would probably use that amount of FPGA devices running in parallel and exchanging boundary data directly from and to the programmable logic through the embedded transceivers. On the other hand, in the external memory scenario in principle the entire set of data could be stored in the DDR. However, the wall clock time to the solution on a single FPGA device would be impractically long. In that case one would also resort to a many-node system where the computations could be speed up by running them in parallel. In principle neither of the two scenarios is obviously superior. The number of required nodes can be different in both solutions and the details would depend essentially on the memory throughput of the FPGA device used. With the numbers provided above such estimations can be put on a solid ground.

Conclusions
In this contribution we discussed the applicability of FPGA devices to High Performance Computing solutions. As a benchmark we used the academic code for Monte Carlo simulations of Quantum Chromodynamics. In traditional computer architectures this code is memory bound due to the unfavorable ratio of the amount of data to be loaded to the amount of floating point operations to be executed be the most elementary kernel function. On the available programmable logic hardware the problem turns out to be memory bound in the scenario where the data is streamed from the DDR die, which will be considerably improved with the arrival of Xilinx Alveo U280 cards with a 480 GB/s memory bandwidth between DDR and programmable logic. In the scenario where the data is stored in the embedded memory, the problem's limitation is the available size of the internal memory. Both cases seem to be scalable and thus offer a viable proposal for a larger scale infrastructure.