Hardware Implementation of Low-Latency 32-bit Floating-Point Reciprocal

As the speed requirements of imaging and communications systems increase, the latency requirements of digital circuits also become stringent. Due to such tight latency or timing requirements, large-stage pipelined circuits need to be redesigned to meet the low-latency requirements. Most modern imaging and communications systems rely on digital signal processing (DSP) that compute complex mathematical operations. The emergence of powerful and low-cost field programmable gate array (FPGA) devices with hundreds of arithmetic multipliers has enabled many such DSP hardware applications, traditionally implemented only as software solutions. The reciprocal square root algorithm is a popular technique for computing square roots, used widely in many software applications. This paper shows how this algorithm can be implemented efficiently on hard-ware, and is suitable for low-latency mathematically-intensive applications. Using a low-cost FPGA device, the algorithm takes up less than 1000 look-up-tables (LUTs), which on an Artix XC7A200T device, translates to less than 1% of all the LUT resources in the chip.


Introduction
Many signal processing and imaging applications de-pend on the computation of non-linear equations. The square root is one such nonlinear operation that is used by such applications. In image and video processing, the square root is used to calculate the magnitude of the gradient of an image, root mean square error, vector normalisation, among others.
Most square root computations are expensive and takes many iterations to complete. In hardware, these iterations translate to clock cycles. A highly-iterative operation such as those using Newton-Raphson algorithms will require a high-latency hardware circuit. Therefore, such hardware are not suitable to be used in low-latency applications such as high-speed communications and image processing. With an already high system clock frequency, increasing the frequency even further will most likely result in timing issues on hardware.
Pipelining the circuit just partially works around this problem. Due to fast-changing input data, the result from the computation of a highly-iterative high-latency circuit could be available only after the next request is received. Each request to compute usually comes with a new set of data. These fast changes in the input data while the computation is still in progress may corrupt the final result which is yet to be computed. To work around this problem, more storage logic is being added to buffer the incoming data, which in turn increases the area and power consumption, slows down the entire circuit, and increases the complexity unnecessarily Figure 1 illustrates this problem in a timing diagram. This problem is compounded if the result from the computation is available only after two or more such requests is received.

Square Root Algorithms
There are several techniques to compute the square root. The floating-point square root based on Taylor series expansion using lookup tables and multipliers is discussed in detail by Wang [1] and Ercegovac [2] et al. Iştoan [3] de-scribes several techniques for fixedpoint implementations of the square root. Li and Chu [4] describes the FPGA im-plementation of the single-precision floating-point square root, while Nanhe et al. [5] describes both fixed-and floating-point square root implementations. These square root implementations are based on concepts such as the non-restoring algorithm [4], tabulateand-multiply, bipartite, and Taylor series expansion [1][2][3]. The Newton-Raphson iterations is a common technique to improve the precision of the result, and usually acts as the final step in many of these techniques, including for the method described in this paper. This paper uses the techniques described by Lomont [6] and Robertson [7]. However, most of these techniques were implemented as software algorithms. Zafar S [8] presented a similar FPGA-based implementation which took 12 clock cycles to complete the processing. Here, we present a low-latency FPGA-based hardware implementation which has only 3-cycle latency. This algorithm involves multiplication, bit-shifts, and subtraction from a "magic" constant. This magic constant was algebraically derived by Robertson, which confirms the same value computed numerically by Lomont in his earlier paper.
Our magic constant also uses the value 0x5f375a86 rather than 0x5f3759df as was used in Zafar's paper. The constant 0x5f3759df was originally used in the Quake III program, and was proven by these authors as being less accurate compared with using new constant. In our paper, we shall refer to this algorithm as the fast reciprocal square root (FRSR).

The Newton-Raphson iterations
The Newton-Raphson method is an iterative technique that takes a current approximation, y n , and returns a new approximation y n+1 , aimed at estimating the final result more accurately. Equation 1 shows how a new estimate is being computed.
Where f(y) = 0 is the function to be approximated, and f _ is the firstorder derivative of f.
Performing this iteration on a reciprocal square root function which yields the estimate y n+1 as

The Reciprocal Square Root
In the IEEE 754 floating-point system, a 32-bit single-precision floating-point number f can be physically represented as shown in Figure 2. The representation consists of three parts: the sign bit s, the exponent field q, and the mantissa field M.
We can write f in equation form, as shown in Equation 4.
Where s denotes the sign and indicates whether the mantissa, is even or odd, m denotes the normalised mantissa, and 2 q-127 denotes the biased exponent. The normalised mantissa component is calculated from the mantissa field represented physically as shown in Figure 2.
The reciprocal square root of f can be written as Let zq0 represent the mantissa field of Equation 6 1 0 When q is even, q0 = 0. We have When q is odd, q0 = 1, giving us Depending on whether the physical mantissa M is large or small, the estimate ŷ can be found [6]. Equation10 estimates the mantissa field of ( ) y x . Because this approximation works for all valid exponents, ( ) y x can be considered the general approximation for the reciprocal square root 1/ x without needing different approximations for the mantissa and the exponent fields of a float. , The Newton-Raphson step of Equation 3 can be applied to ˆ( ) The estimate ŷ is then minimised to give the least possible error relative to the actual reciprocal square root function1/ x . The maximum relative error for ŷ can be expressed as where x = f(t).
Robertson and Lomont [6,7] notes that the optimal t value corresponds to the floating-point constant value of 0x5f375a86. This "magic" constant, and earlier approximations of this constant, is being used frequently in many software algorithms that require fast calculations of the reciprocal square root, such as computer graphics. Our proposed hardware implementation also makes use of this constant.

Hardware Architectures of the Fast Reciprocal Square Root
Low-cost Output-Registered Implementation of the FRSR Figure 3 shows the proposed hardware architecture of our reciprocal square root algorithm. The "magic" constant is implemented as pull-ups and pull-downs in hardware. The input data n is bit shifted by one bit to the right, before being subtracted from the "magic" constant. The bit-shift operation in hardware just requires rewiring the logic to access the appropriate bits. Accessing a register set starting from the first index instead of the zeroth index essentially gives a shiftright-by-one result. The output of the subtractor provides a very good initial guess of the reciprocal square root of n.
If the Newton-Raphson iterations are enabled, the result from the subtractor is multiplied with itself to produce the square of itself, with the input n, and with a floating-point constant 0.5f. The multiplication is performed with floating-point inputs. Depending upon the speed of the target hardware, the multipliers may or may not be pipelined.  The result from this multiplication is then fed to another subtractor, where this result is subtracted from another floating-point constant 1.5f. The output of this subtractor is then multiplied with the result from the first subtractor to produce the reciprocal square root estimate. This result may be looped back to become the new value of n if more iterations are required. Finally, the reciprocal square root result can also be multiplied by the input n to produce the square root estimate. This final result can be clocked by a flip-flop before being read by other logic.

Pipelined implementation of the FRSR
The same circuit of Figure 3 can also be pipelined if some of the computational elements are taking a long time to complete. An example implementation is shown in Figure 4. In our case, we were using a low-cost Artix 7 FPGA which requires the use of Xilinx's floatingpoint multiplier core. The simulation of the multi-stage pipelined implementation is shown in Figure 5.

Results
The latency of our simple low-cost output-registered FRSR is only three clock cycles, mainly contributed by the floating-point subtraction core. Running on a 50-MHz clock, we managed to achieve a timing margin of 1.65ns. Our implementation was designed with the VHDL language, and simulated with Mentor Graphics Modelsim. We have implemented the design on a low-cost Xilinx Artix 7 XC7A200TFBG676 FPGA device, using Xilinx Vivado's default synthesis and place-androute settings. The software version used was Vivado 2017.4.
In the case where the input n is a 32-bit unsigned integer, as is the case for our image processing application, the Newton-Raphson iterations can be omitted while still maintaining an accuracy of 0.3. This is especially useful in applications where accuracy is not as important as speed, and a rough estimate of the reciprocal square root is sufficient. If more precision is required, the final output can still be fed back to the input n to go through a couple more iterative cycles. An accuracy of 1.6% (0.016) can be achieved for unsigned integer inputs after one Newton-Raphson iteration, while 0.01% (0.0001) accuracy can be achieved after two iterations. The design can also accept floating-point input, where the accuracy would follow the ones reported by Lomont and Robertson [6,7]. Figure 5 shows the simulations of our FRSR implementations. For basic verification on hardware, we used Xilinx's ChipScope integrated logic analyzer to acquire real-time waveforms from our development board. Table 1 shows the resource utilisation of our implementation compared against previous implementations. In our paper, we have implemented our floating-point algorithm using the 32-bit IEEE singleprecision format 32(8,23). Wang [1] noted in her 2007 paper that her 32-bit floating-point implementation takes up 351 slices (LUTs) (1%), 3 block RAMs (2%), and 9 embedded multipliers (6%), on a Xilinx Virtex II XC2V6000 FPGA. Now, with modern FPGAs packing lots of resources, our implementation takes up only 635 LUTs which translates to a mere 0.47% of a low-cost Artix 7 XC7A200TFBG676 FPGA. Our implementation does not use any block RAMs since there is no need for a look-up table in our square root implementation. We use 12 DSP multipliers, which on a modern low-cost FPGA such as the Artix 7, translates to only 1.62% of the entire DSP resources within the FPGA.

Functional simulations and hardware synthesis
Our selection of DSPs and LUTs when generating the floating-point cores enable us to achieve very low latency in our design. The only delay was contributed by the three-cycle latency from the floatingpoint subtraction. Table 2 show the latency of the design, compared against other implementations or algorithms.

Discussion
Many digital signal processing (DSP) and digital imaging hardware avoid implementing the square root function due to its perceived complexity [9][10][11][12][13][14]. As the demand for faster processing increases, latency is also becoming an issue that increasingly needs to be resolved. A low-latency reciprocal square root implementation on hardware could help many computationally-intensive DSP hardware main-tain its precision while also complete computations in the fewest clock cycles possible. We have presented a feasible and inexpensive square root and reciprocal square root circuit suitable for implementation in low-latency FPGA or ASIC hardware.
The FRSR algorithm has been traditionally used in software requiring high-speed computations of the square root, such as graphicsintensive video games. Li and Chu [4] have shown two hardware implementations of their floating-point non-restoring square root algorithms, whose performance results have been highly cited. The first low-cost iterative version, requires 25 clock cycles (latency) for the result to be computed. The second pipelined version, requires 15 clock cycles for the result to be computed. For the pipelined version, it is mentioned the issue rate is 1 clock cycle, which means new data can be issued as   well as new results are available every clock cycle, although there is still a 15-cycle delay between an input data and its corresponding result. Zafar and Adapa [8] presented an implementation of their reciprocal square root algorithm based on the fast inverse square root algorithm as well, however, their implementation requires 12 clock cycles to complete [15][16][17].
As we apply this algorithm on FPGA hardware, we had to ensure that the floating-point FPGA cores were inferred correctly by the tool. This was done by characterising the functionality of the floating-point cores in the laboratory. We observed incorrect physical behaviour of the floating-point basic calculations when we synthesised our design by allowing the synthesis tool to automatically infer the floating-point cores from the standard VHDL-2008 floating-point library. After directly instantiating the Xilinx floating-point cores, correct behaviour was observed in the lab. The results presented are using the directlyinstantiated cores which exhibit the intended behaviour [18][19][20].
Although the VHDL language provides very useful fixed-and floating-point packages suitable for DSP-heavy applications, special attention needs to be made to ensure that these packages are supported by tool vendors. A successful synthesis that produces no errors cannot be assumed to exhibit the correct behaviour, and careful lab characterisation is necessary to ensure whether or not the tools are synthesising these packages correctly [21].

Conclusion
The fast reciprocal square root algorithm, popularly used in many software applications, has been shown to be a good alternative for lowlatency hardware. While there is a slight increase in logic resources and area, this increase is negligible as the density of modern FPGAs keeps the utilisation percentage for LUTs below 1% for a low-cost Artix device. Although Zafar and Adapa is using a similar algorithm as that shown in our paper, the solution presented was shown to be more expensive and complex with the usage of 1939 LUTs. We have  Output-registered FRSR 3 *Only data using 32-bit fixed and single-precision 32(8,23) floating-point formats are shown. shown that it is feasible and cost-effective to implement the same FRSR algorithm on hardware with much less logic resources (635 LUTs), and also with much lower latency (3 clock cycles).