Elsevier

Neurocomputing

Volume 456, 7 October 2021, Pages 421-435
Neurocomputing

A generic FPGA-based hardware architecture for recursive least mean p-power extreme learning machine

https://doi.org/10.1016/j.neucom.2021.05.069Get rights and content

Abstract

Recursive least mean p-power extreme learning machine (RLMP-ELM) is a newly proposed online machine learning algorithm and is able to provide a robust online prediction of the datasets with noises of different statistics. To further explore the proposed RLMP-ELM to be used in real-world embedded systems, a generic serial FPGA-based hardware architecture of RLMP-ELM is presented in this paper. The entire hardware architecture of RLMP-ELM includes three serial processing modules, which are implemented parameterizably and can be adapted for different application requirements. The hardware framework is in a serial fashion, but parallelization efforts are focused on the processes with high computing complexity by analysis of potential inter-task dependency. To overcome the limitation of memory bandwidth, the block RAM and ping-pong on-chip buffer are applied to improve the computational throughput. The validation experiments are performed through five datasets with different p values. Accuracy results show that our implementation on FPGA could achieve similar accuracy compared to 64-bit floating-point software implementation. We also report and compare hardware performance of our proposed architecture with other existing implementations. The results show that our hardware architecture offers the excellent balance among accuracy, logic occupation and hardware performance.

Introduction

Neural networks own the capability of approximating complex nonlinear mappings directly from the input–output data and have been widely applied in a variety of fields. One of the active research areas related to neural networks is extreme learning machines (ELMs), which has been investigated more thoroughly due to its universal approximation capability and computational efficiency. Original ELM [1] requires all the training samples for training phase and is batch learning algorithm. For dealing with online learning problems where the complete set of data is usually not available at once and is presented sequentially, online sequential ELM (OS-ELM) [2] and its different improvements [3], [4] have been developed accordingly.

In practical applications, the data sets are often contaminated by the large stochastic noises with different statistic characteristics, such as uniform, Gaussian, impulsive, or mixed distribution, et.al. In order to strengthen the robustness against the random noises, a recursive least mean p-power extreme learning machine (RLMP-ELM) [5] is proposed recently by considering the merits of the least mean p-power (LMP) criterion in handling the random noises of lower-order or higher-order statistics under different p values. As a generalized variant of the ELM/OS-ELM algorithm, the RLMP-ELM not only retains the advantages of high computational efficiency, but also produces better learning performance for the problems with noises of different statistics. With the noise rejection capability, the RLMP-ELM has a big potential for the real applications. However, the current work only realizes a software implementation on a general-purpose processor. To further explore its realtime applications, the hardware implementation of executing such an algorithm in its entirety needs to be addressed.

The demand of implementing the neural networks in hardware systems is increasing in a variety of applications where real-time computations are expected. This has motivated considerable research on deploying neural networks on different parallel computing platforms such as field-programmable gate arrays (FPGAs) [6], [7], [8], graphic processing units (GPUs) [9], [10], and application specific integrated circuits (ASICs) [11]. Compared with GPUs deployment, FPGAs and ASICs can achieve at least moderate performance with lower power consumption. For ASICs, it has a longer development cycle and the flexibility is not satisfying. By comparison, the FPGAs have characteristics of parallel architecture, reprogrammable nature, and high energy efficiency which are suitable for neural networks deployment.

Currently, a lot of FPGA-based accelerators have been developed and focused on accelerating neural networks. The work in [12] presented a high-performance FPGA architecture which consists of multiple dedicated hardware processing cores for accelerating restricted boltzmann machine (RBM). In [13], the authors proposed efficient hardware architectures to accelerate deep convolutional neural network (CNN) models by parallel fast finite impulse response algorithm (FFA). The authors in [14] designed a CNN architecture based on residue number system (RNS) for reducing the hardware cost. The work in [15] presented a binarized neural network for FPGAs, which drastically cuts down the unused hidden nodes for reducing the hardware resources consumption. Besides the FPGA-based implementation of deep neural networks, there are many shallow neural networks acceleration. The work in [16] designed two different architectures which support a low-power implementation of random basis neural networks prediction. In [17] the authors designed a low-cost and high-speed implementation for a spiking neural network based on FPGA. The work in [18] presented an FPGA-based accelerator which focuses on the implementation of the prediction phase. These studies have achieved excellent hardware performance for different neural network models, but they are targeted for implementing specific neural network frameworks. Once the application scenarios are changed, the frameworks have to be changed accordingly. This greatly limits the portability of hardware architectures.

Considering the computational efficiency and resource utilization, most existed FPGA-based implementation works trend to explore memory bandwidth and computing parallelism. To overcome the limitation of memory bandwidth, accelerators in [19], [20] focused on neural network implementations which stored the parameters in on-chip memory. The work in [21] cached the inputs and the weights of neural network using on-chip memory to reduce the external memory bandwidth requirement, and different optimization techniques were used to efficiently reuse the fetched data. In [22], the authors stored intermediate variables using on-chip memory, and a novel data reuse and storage scheme was proposed which avoids the transfer of intermediate data between internal memory and external memory. However, as the parameters of neural networks increase, the on-chip memory solution becomes limited. Other works are considered to store massive parameters in external memory. In [23], the authors used external Double Data Rate (DDR) memory to store all of the weights. Besides, the works in [24], [25] also used the on-board DDR3 memory to provide data access for FPGA. The data access using external memories avoids the limitation of on-chip memory, but it will lead to low computation throughput and hinder the acceleration performance. Thus, the efficient on-chip memory reuse is needed. Focused on the computing parallelism for different neural networks, in [26], efficient computing engine was optimized for highly parallelism in CNNs computation. The work in [27] design a low-power and compact hardware implementation for large parallel array of random feature extractor (RFE), where parallel strategy is applied in hardware architecture. In [28], the authors proposed three hardware architectures for original ELM implementation, a sequential and two parallel. Similarly, the work in [29] presented a parallel hardware architectures for OS-ELM by considering its potential inter-task dependency. Consequently, for improving the computation efficiency, the reasonable computation parallelism division is also needed.

To sum up, these studies focus on implementing different neural network algorithms efficiently, but how to provide generic architecture adapting for different application requirements with flexible hardware architecture has not been properly solved. Several studies are to explore memory bandwidth limitation by utilizing on-chip memory or external memory, but both of that have their drawbacks. Reasonable memory usage is still a work in progress. In addition, different parallel strategies should be defined in advance according to the character of neural network algorithms for efficient acceleration performance.

Focused on the FPGA-based hardware implementation of RLMP-ELM, there are three challenges for real-time high performance deployment: 1) the high computational complexity of initialization phase and online learning phase in RLMP-ELM greatly slows down the training process; 2) the multiple intermediate results and the huge amount of parameters in training process require large storage space, particularly for online learning phase of RLMP-ELM; 3) the natural noise rejection capability to handle different statistic characteristics for different application scenarios requires the hardware implementation of RLMP-ELM to be parameterizable and reconfigurable by easily modifying implementation parameters. These factors hinder the real-time computational performance and the widespread deployment of RLMP-ELM on embedded devices.

To tackle these problems, we present a generic fully parameterizable hardware architecture of RLMP-ELM to implement on-chip learning and prediction. The main contributions of this work are:

  • A generic reconfigurable hardware architecture of RLMP-ELM is proposed, so that this hardware can be adapted for different application scenarios through modifying configurable parameters. The entire hardware architecture contains a series of scalable computational blocks, including matrix multiplication (MM), matrix inversion (MI) and matrix adder/subtractor (MA/MS), which can be reused and migrated to other network topologies such as ELM, OS-ELM.

  • The proposed hardware architecture attempts to improve the computational efficiency of RLMP-ELM by dividing the necessary operations into subtasks to enable multiple parallel operations. In detail, two task-parallel processes in initialization phase and sequential learning phase are implemented by the proposed architecture, which considerably analyse the potential inter-task dependency of RLMP-ELM.

  • The on-chip storage reuse scheme of different types of block RAM and ping-pong on-chip buffer is used in RLMP-ELM architecture, which avoids the transfer of intermediate data between internal memory and external memory. The block RAM and ping-pong on-chip buffer are reused iteratively once the stored data are no longer needed.

The remainder of this paper is as follows. Section 2 provides background about the RLMP-ELM algorithm and discusses key steps. Section 3 describes the proposed hardware architectures of RLMP-ELM. Results of the analysis and discussion are presented in Section 4. Finally, Section 5 concludes the paper.

Section snippets

Brief Review of RLMP-ELM

The recursive least mean p-power extreme learning machine (RLMP-ELM) is the extensive of the recursive least square (RLS) algorithm with cost function of least mean p-power (LMP). In contrast, the original ELM/OS-ELM adopts the mean square error (MSE) criterion as the cost function. It is obvious to note that the LMP criterion includes the MSE criterion as a special case when p = 2. Many literatures have pointed out that the LMP criterion has some useful properties [30], [31], such that it may

System Architecture

The proposed generic hardware architecture of RLMP-ELM is illustrated in Fig. 1. The hardware structure is based on five main blocks: initialization module, sequential learning module, prediction module, overall phase controller and on-chip RAM memories. The system architecture is organized in a serial manner: firstly, initialization module is used to obtain initial weights β0 and inverse matrix R0-1. Next, sequential learning module is carried on to update the weights βk for new arriving data.

Performance Analysis and Evaluation

In this section, the performance of the proposed RLMP-ELM hardware architecture is evaluated. The architecture is coded in verilog. The Xilinx Vivado Design Suite 2017.3 is used as an integrated development platform to complete simulation, synthesis, implementation, reporting of resources and power consumptions. Meanwhile, the device used in this work is Xilinx Kintex UltraScale XCKU115-flvb2104-2-e, and analysis and results are based on FPGA fixed-point implementation. We evaluate the proposed

Conclusion

In this paper, we propose a generic serial hardware architecture for RLMP-ELM algorithm based on FPGA. The hardware architecture of RLMP-ELM includes three serial processing modules, which are implemented parameterizably and can be adapted for the different application requirements by modifying implementation parameters. The proposed hardware architecture efficiently utilizes task-parallel operations to reduce the computing complexity by exploring potential inter-task dependency. The storage

CRediT authorship contribution statement

Hui Huang: Software, Validation, Formal analysis, Writing - original draft. Jing Yang: Conceptualization, Methodology. Hai-Jun Rong: Supervision, Project administration, Writing - review & editing. Shaoyi Du: Resources, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Hui Huang received the B.Eng. degree in engineering mechanics from the Xi’an University of Architecture and Technology, Xi’an, China, in 2016. He is currently working toward the Ph.D. degree in the School of Aerospace, Xi’an Jiaotong University, Xi’an, China.

His research interests include embedded systems development, FPGAs, neural networks, and pattern recognition.

References (36)

  • J. Yang et al.

    Random neural Q-learning for obstacle avoidance of a mobile robot in unknown environments

    Adv. Mech. Eng.

    (2016)
  • Q. Xiao, Y. Liang, L. Lu, S. Yan, Y.-W. Tai, Exploring heterogeneous algorithms for accelerating deep convolutional...
  • S.I. Venieris, C.-S. Bouganis, FPGAConvNet: Automated mapping of convolutional neural networks on FPGAs, in: Proc. 2017...
  • Y. Xu, J. Jiang, J. Jiang, Z. Liu, J. Xu, Fixed-Point Evaluation of Extreme Learning Machine for...
  • T. Chen et al.

    DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning

    Acm Sigplan Notices

    (2014)
  • D.L. Ly, P. Chow, A High-Performance FPGA Architecture for Restricted Boltzmann Machines, in: Proc. ACM International...
  • J. Wang et al.

    Efficient hardware architectures for deep Convolutional Neural Network

    IEEE Trans. Circuits Syst. I Regul. Pap.

    (2018)
  • N.I. Chervyakov, P.A. Lyakhov, M.A. Deryabin, Residue number system-based solution for reducing the hardware cost of a...
  • Cited by (3)

    Hui Huang received the B.Eng. degree in engineering mechanics from the Xi’an University of Architecture and Technology, Xi’an, China, in 2016. He is currently working toward the Ph.D. degree in the School of Aerospace, Xi’an Jiaotong University, Xi’an, China.

    His research interests include embedded systems development, FPGAs, neural networks, and pattern recognition.

    Jing Yang received the B.Eng. and M.Eng. in control science and engineering and the Ph.D. degree in pattern recognition and intelligent systems from Xi’an Jiaotong University, China, in 1999, 2004, and 2010, respectively. Since 1999, she has been a Lecturer with the Automation Department, Xi’an Jiaotong University. She is currently a member of project team on intelligent vehicles with the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University. Her main research interests include autonomous vehicle control, intelligent control, robot motion planning, and neural networks.

    Hai-Jun Rong (M’14) received the B.Eng. degree in precision instrument from Xi’an Technological University, Xi’an, China, in 2000, the M.Eng. degree in control theory and control engineering from Xi’an Jiaotong University, Xi’an, China, in 2003, and the Ph.D. degree in intelligent control from Nanyang Technological University, Singapore, in 2008.

    From December 2006 to October 2008, she was a Research Associate and a Research Fellow in Nanyang Technological University. Since then, she has been an Associate Professor in the School of Aerospace, Xi’an Jiaotong University. She is an Associate Editor of the Evolving Systems journal (Springer). Her research interests include neural networks, fuzzy systems, pattern recognition, and intelligent control.

    Shaoyi Du received the double bachelor’s degrees in computational mathematics and computer science, the M.S. degree in applied mathematics, and the Ph.D. degree in pattern recognition and intelligence system from Xi’an Jiaotong University, Xi’an, China, in 2002, 2005, and 2009, respectively.

    View full text