FPGA Implementations of Feed Forward Neural Network by using Floating Point Hardware Accelerators

This paper documents the research towards the analysis of different solutions to implement a Neural Network architecture on a FPGA design by using floating point accelerators. In particular, two different implementations are investigated: a high level solution to create a neural network on a soft processor design, with different strategies for enhancing the performance of the process; a low level solution, achieved by a cascade of floating point arithmetic elements. Comparisons of the achieved performance in terms of both time consumptions and FPGA resources employed for the architectures are presented.


Introduction
Field Programmable Gate Arrays (FPGA) designs are very common in the field of computational electronics [1], [2], [3].Digital Signal Processing (DSP) models, often analyzed in high level environments, show heavy restraints on performance once implemented on embedded systems whose bottleneck is, despite the ongoing advances in Floating Point Units (FPU) development, the low floating point operations per second (FLOPS), [4].Compared to a microcontroller implementation (based on the sequential execution of instructions by the CPU) the nature of an FPGA design exploits the concepts of customization and parallelization to enhance the throughput of a computational system [5].Customization allows the designer to create, through Hardware Description Language (HDL), the internal architecture of the system down to Register Transfer Level (RTL), defining as a matter of fact a flexible Application Specific Integrated Circuit (ASIC).Parallelization spreads modular and sequential algorithms on a parallel interface, improving the throughput of complex algorithms by a multiplicative factor [6].
Neural Networks in embedded systems are frequently implemented on microcontroller units [7], [8].A neural network implementation on a microcontroller, even when built with simple integer arithmetic, lacks the performance enhancement of a parallel design [9].The choice of implementing a neural network architecture on FPGA benefits from customization and parallelization in different ways.
Very large Feed Forward Neural Networks (FFNN), especially if designed to work with floating point (FP) precision, performs a large number of elementary products and sums.Moreover, for each neuron of FFNN within the hidden layers, a non-linear function computation is required to determine the activation value of the neuron.Without dedicated FP hardware such computations can hinder the whole performance of the system, hence making the design difficult to be used in critical applications like real-time control systems [10].
The concept of parallelization is implicit in the high performance of the solutions explained above: a RTLdefined LUT can compute an arbitrarily complex operation in few clock cycles, assuming the memory of the system can contain the values.The same can be said for the arithmetic units, which can exploit powerful pipelines to speed up the calculus.The num-ber of interconnection between the neurons, however, grows exponentially with the size (in terms of input and outputs) of the network.It's possible to reduce the complexity of the FFNN by splitting a Multiple Input Multiple Output (MIMO) FFNN into a smaller and simpler Single Input Single Output (SISO) FFNN that can be easily processed in parallel by means of multivariate function decomposition [22], [23].

The Feed Forward Neural Network
The Feed Forward Neural Network implemented in this paper is a SISO Feed Forward Neural Network, composed by a single hidden layer of 10 neurons with a nonlinear activation function Logsig (Eq. 1) and\or Tansig (Eq.2): ( This architecture was chosen for the easiness of the training process and the modularity of the structure: indeed it is possible to face MIMO problems by using SISO FFNN as described in [22].The FFNN was created and trained in Matlab R environment.The normalization of the inputs and outputs was disabled and the activation function of the output layer was a pure linear function.

Implementation on Nios II/f Soft Processor
The first solution attempted to implement the network on FPGA makes usage of the soft core processor Nios II/f, released by Altera R as a crypted core.This core can be synthesized with as low as 1600 logic elements (LE) and supports a maximum frequency of 140 MHz [24], [25].
After synthesis and programming on the FPGA device, the soft core itself can be programmed and debugged in C using a JTAG tool chain running inside an Eclipse environment.This soft core processor supports hardware integer multiplication and division, and up to 255 custom instructions definable by the designer.These custom instructions can be defined at RTL level using Very High Speed Integrated Circuits (VHSIC) Hardware Description Language (VHDL) or Verilog R code, and are synthesized as parallel blocks of the internal Nios II Arithmetic Logic Unit (ALU) as shown in Fig. 1, when a custom instruction is called from the instruction memory of the Nios II, the operands are transferred in the custom logic and, according to the type of custom instruction (combinatorial or sequential) the result is collected after a definite number of clock cycles [26].

Overall System Description
The design proposed in this section is based on the Nios II/f core, modified to have a Floating Point ALU and two system works with a 100MHz clock, which is replicated by means of a PLL with a phase shift of −3 ns to control an external 8 Mb SDRAM [27].As shown in Fig. 2, the processor was equipped with a standard JTAG interface for programming and a Performance Counter to determine the execution time of the implemented code.The Floating Point ALU was the standard block from the library released by Altera R as a part of the Quartus II R environment.Two Activation Function LUT(s) were created in VHDL (one for the Tansig and one for the Logsig) and imported into the design as user-made custom instructions.

LUT(s) Use for Computing Activation Function
The main performance bottleneck for neural networks using floating point arithmetic lies in the activation function computation for the hidden layer.Computing this function using "full precision" software functions is often too slow for time critical applications [28].Instead of calculating the activation function, an alternative solution is to sample it, loading the obtained values in a LUT [18], [19], [20], [21].In the present paper, the function was not sampled with a uniform and constant spacing between the sampling points.This is because the activation function assumes almost constant values near the saturation points, making it wasteful to choose a fine sampling in their proximity.On the other hand, near the origin, the slope of the function is very high, and a finer sampling may help in reducing quantization error.In [19] only two kinds of spacing are used: a fine one, near the origin, and a wide one, near the saturation branches.In this work, a different approach is proposed: the distance between a sample and the following one is inversely proportional to the slope of the function in the sampled point.
This yields a finer sampling near the origin, gradually getting wider near the saturation points.The Logsig function was sampled with 256 values between −16 and +16, while the Tansig, being an odd function, was sampled for positive arguments only, with 256 values between 0,2 and.Using these values, a VHDL combinatorial code was written and simulated in Altera ModelSim environment for RTL analysis.
The implemented block has a single floating point input, that is split in sign, exponent and mantissa.Through the use of a suitable IF-THEN-ELSE chain the input value addresses a specific entry in the LUT, that is propagated as output.If the input value magnitude is bigger than the saturation values, a suitable constant value is propagated as output.Since the Tansig, near the origin, can be approximated to the bisector of the first quadrant, values smaller than 0,2 are directly propagated in output (thus approximating the function linearly).The synthesis result of this IF-THEN-ELSE structure is a very long chain of comparators.Propagation of the signal through this chain can be long, so a tunable delay of 4 clock cycles was introduced to ensure result stability (the delay is controlled by a simple counter that can be modified to suit the size of the LUT).

Polynomial Fitting
The basic operations of floating point math are greatly fastened by the presence of a Floating Point ALU (about 10 times faster [29]).Thus, other than speeding up the Multiplier-Accumulation part of the FFNN, this hardware module can be used to compute a polynomial approximation of the activation function.A group of second-degree polynomials was chosen to fit the activation functions.The coefficients of the polynomials were determined in Matlab R environment through the use of the Curve Fitting Tool.Both the functions were fitted only for positive arguments.
For the Logsig polynomial fitting, a function (denoted as 5PY-L) composed by the superposition of 5 second-degree polynomials, has been implemented.Even if the Logsig function is not odd, a partial symmetry is present.This was exploited for its negative arguments: first, the value of the function is calculated considering the absolute value of the input; then, if the input is negative, the calculated value is subtracted by the value of 1.For the Tansig polynomial fitting, two functions, composed by 4 and 5 second-degree polynomials have been implemented, respectively denoted as 4PY-T and 5PY-T.This time, since the Tansig is an odd function, the argument is considered in absolute value, and the sign is directly propagated to the output.

Test Results and Considerations
The design was used to simulate a FFNN trained on the function y = x 2 , and was tested on a vector of 2048 linearly spaced inputs between −5 and +5.The results in Tab.

NN Core Implementation
In the following part of this paper a solution based on low level architecture is presented.The proposed design was used for the implementation of the same FFNN previously described.

Overall System Description
The proposed design is an arithmetic core composed (see Fig. 3) by high performance floating point arithmetic blocks developed by Altera R , whose data flow is controlled by a Finite States Machine (FSM) written in VHDL.The arithmetic core is composed by 3 blocks: a multiplier-accumulator (MAC), an activation function,  and a feedback RAM.These three blocks constitute a suitable base to build a Neural Network [30].The first block computes, for each neuron, the weighted sum of the inputs.
The second block has the results of the first block as inputs, and computes the activation values for the hidden layer.The third block, receiving the output from the activation function block, stores the values from the hidden layer.These values are then sent through a MUX back into the MAC block for the output layer computation.Both input and output data of the FFNN are stored in RAM blocks that are accessible through JTAG interface using the Quartus II R software.The whole Core and the data banks are controlled by a free running 2-Process Finite State Machine "Time Machine" using data flow control signals and address registers.Internal data flow of the core is regulated by a number of 32-bit wide MUXes and D-Type Flip Flops (DFFs).The design was implemented on a EP2C20F484C7 Cyclone II FPGA mounted on a DE2 -Development Board.After synthesis and fitting the full design occupied about 5000 logic elements (LE) and all the 52 hardware multipliers present on the FPGA.

Data Flow of the Arithmetic Core
The computation of the arithmetic core begins by loading the first sample from the Input Data Bank into the MAC block.The core contains into its internal memory the weights and biases of the FFNN.This memory is addressed directly by the Time Machine control block.Since the MAC is computing the hidden layer, each neuron will have a bias value that must be added to the weighted input.This bias value is preloaded into the 32-bit DFF accumulator using the Bias MUX.Inputs and weights are multiplied and the results are added to the preloaded bias (see Fig. 4).Since the hidden layer has only one input, the MAC is done for the first neuron, and the result is propagated to the next block, where the activation function is computed.In this section, a logical not is operated on the MSB of the input, changing its sign.The result is sent to an exponential arithmetic block whose output is connected to an adder that sums the result to the constant value of 1.
The result is then inverted and the activation value of the first neuron is finally written in the Feedback RAM.This operation is repeated for the 10 neurons, filling the RAM with the activation values of the hidden layer.Then, the Time Machine switches the Layer Select MUX so that the MAC block is now connected to the Feedback RAM.The bias of the output neuron is preloaded in the accumulator, and the MAC computes the weighted sum of all the activation values from the hidden layer.This is the output result of the network, and is saved in the Output Data Bank.

Time Machine FSM
Data processing from input to output needs to be managed by some sort of control block, responsible for synchronizing the dataflow and, were needed, perform memory addressing.In a traditional programming language, like C, a popular approach to create such controller is to use a finite state machine (FSM).In its simplest form, a FSM is a set of code blocks, each identifying a particular function (e.g."load data from RAM", "sum input A and input B", "transpose array C"), inside a switch/case structure.If the FSM is the sole controller of the system, the switch/case structure is confined in an endless loop.The variable controlling the switch is updated at the end of each code block, ensuring that every time the switch/case is evaluated the FSM will execute a specific code block (i.e. will be in a known and definite state).This rather simple approach is not as straightforward in HDL languages, since the code is not executed by a processor, thus not inherently sequential.
Hardware, emulating the processor sequential behaviour, must be created.A possible approach, proposed in [31], is to create an instruction counter whose value is increased at every clock edge.By using a net of comparators, when a particular value is assumed by the instruction counter, specific logic functions (states) are executed.Creating the FSM in this way grant an important advantage: since the instruction counter is updated on clock edge, the FSM can work synchronously with the other elements in the design.This is very important when some blocks in the design have definite input-output delays, since the FSM can be pro-grammed to remain in a "wait" state until the output is ready to be propagated to the next block.In VHDL this architecture can be defined by the use of two code blocks (processes), one sequential and one combinatorial.
The first one is responsible for the instruction counter increase at every clock edge, and is synthesized with a counter register.The second one is responsible for decoding the instruction counter into actual logic signals, and is synthesized with a network of comparators.The cycle of operations performed by the FSM is obviously limited, once the last operation is performed (i.e. the last output value has been loaded in the Output Data Bank), the FSM will reset and start over.With a 50 MHz clock, the computation of a single sample takes about 150 µs.

Solutions Comparison
In the Tab.

Conclusions and Future Works
Two possible designs to implement a neural network in a FPGA environment were presented.The first design, taking advantage of the Nios II soft processor, used hardware accelerators to speed up both the calculus of the elementary products of neurons and the computation of the nonlinear activation functions for the hidden layer.By exploiting the soft processor hardware acceleration for floating point operations, an alternative polynomial approximation for the activation functions was implemented and tested for performance.
The second design proposed is composed by a chain of arithmetic units timed and coordinated by a VHDL state machine, which implemented a full precision floating point computation at a fraction of the execution time.The results acquired from this work can advance into a new form of neural network implementation on FPGA.The low level arithmetic chain implemented in the NN Core design could be split and included inside two custom instructions of a soft processor, hence combining the speed of the low level design with the flexibility of a C-programmable environment.This could benefit the design by allowing the inclusion of standard interfaces (like JTAG or I2C) to the system useful for many applications (see for example [32], [33]), while retaining RTL-wise control of the data flow.
In the hypothesis of using the network as a form of DSP for smart sensor or control systems, the floating point precision could be traded for a faster and smaller fixed-point or integer based system [34], [35].Moreover, an improvement of the whole system can be always obtained if more complex and robust optimization algorithms [36], [37] are used in order to reduce the size of the implemented Neural Networks.

Fig. 1 :
Fig. 1: Implementation of custom logic in the Nios II ALU.
4n Tab.4and Tab. 5 the resources, in terms of dedicated Combinatorial and Register logics (LC Comb.and LC Reg.) are shown.The high level solution isexpensive in terms of resources usage, peaking with 15 098 logic elements (LE) if both the LUT(s) are im-Tab.5: Best performance comparison.