Resource optimised recon ﬁ gurable modular parallel pipelined stochastic approximation-based self-tuning regulator architecture with reduced latency

: Present self-tuning regulator architectures based on recursive least-square estimation are computationally expensive and require large amount of resources and time in generating the ﬁ rst control signal due to computational bottlenecks imposed by the calculations involved in estimation stage, different stages of matrix multiplications and the number of intermediate variables at each iteration and precludes its use in applications that have fast required response times and those which run on embedded computing platforms with low-power or low-cost requirements with constraints on resource usage. A salient feature of this study is that a new modular parallel pipelined stochastic approximation-based self-tuning regulator architecture which reduces the time required to generate the ﬁ rst control signal, reduces resource usage and reduces the number of intermediate variables is proposed. Fast matrix multiplication, pipelining and high-speed arithmetic function implementations were used for improving the performance. Results of implementation demonstrate that the proposed architecture has an improvement in control signal generation time by 38% and reduction in resource usage by 41% in terms of multipliers and 44.4% in terms of adders compared with the best existing related work, opening up new possibilities for the application of online embedded self-tuning regulators.


Introduction
A self-tuning regulator (STR) is an important adaptive-control strategy for many applications. There has been several efforts to translate innovation in STR algorithms into practical implementations [1-6] and were realised using microcomputers or microcontrollers, single or multiple digital signal processors (DSPs) and faced problems in terms of performance and accuracy of computation during implementation. Enhancement in processing speed could be done by using high-performance DSP processors or multiprocessor schemes, but their cost is high and exceeds the benefits they bring [7]. General-purpose hardware in computers handles different tasks with very different computing patterns and trades resource efficiency for multitasking. It has very limited variety of architecture. Providing just the necessary memory bandwidth to keep the arithmetic units busy all the time, perfectly synchronising independent computations, avoiding or minimising contention on the computing and memory resources and achieving high computational efficiency can be made possible by designing a custom computing data path and a custom memory subsystem matching the data path. Redundant speculative circuits which burn power unnecessarily can be avoided completely by developing computing hardware specifically for a task. A custom-optimised architecture will extend the applicability of adaptive control to systems with tight real-time requirements in terms of fast dynamic response. With the advent of field programmable gate array (FPGA), designers can develop a fully reconfigurable hardware architecture dedicated to the control algorithm. In applications where required real-time capabilities cannot be ensured by software solutions, reconfigurable computers are very useful [8][9][10][11][12][13]. The cost per implemented function of an FPGA is also very less even though it is more expensive initially. In addition, other main advantages of FPGA-based reconfigurable technology are its ceaseless increasing density along with design flexibility of software, but with time performance closer to that of application specific integrated circuits.
To overcome the shortcomings of existing approaches in the implementation of the STR architectures using FPGA were developed such as scalar-based direct algorithm mapping-STR (SBDAM-STR) and multi-stage matrix multiplication-STR (MMM-STR) [14][15][16][17]. The SBDAM-STR architecture uses huge amount of resources and precludes its use in applications which run on resource constrained platforms. MMM-STR architecture which uses recursive elast squares (RLS) identification in its estimation stage consumes large amount of time in generating the first control signal (79 ccs) and precludes its use in applications that could benefitconsiderably from its advantages, especially in those that have fast required response times. Moreover, many intermediate variables are involved in the approach. In addition, five matrix multiplications (which consume time) are involved in every iteration. Owing to the advent of reconfigurable architectures, a stochastic approximation approachbased STR is a promising candidate. In this paper, a parallel, modular and pipelined stochastic approximation-based STR is proposed which has reduced latency in control signal generation than the best existing related work along with minimum resource usage. Distinguishing features of the proposed architecture are: † Reduction in latency of control signal generation. The rest of this paper is organised as follows. In Section 2, details about design of a stochastic approximation-based STR is discussed. Section 3 presents design flow of stochastic approximation-based STR. Functional verification is presented in Section 4. In Section 5, hardware implementation is discussed. Results of implementation and discussion are given in Section 6. Section 7 summarises this work.

Stochastic approximation-based STR design
The schematic diagram of an FPGA-based stochastic STR is given in Fig. 1.

Stochastic approximation
Important part of STR is the estimation block. To reduce the error in estimation of parameters, stochastic approximation-based adaptation of model parameters can be used [18,19]. Stochastic approximation is of the form , a n = [a 0 a 1 ...a m b 1 b 2 ...b n ], n =1, 2, … and k is a positive constant.
If L depends on observation noise and is proportional to k 2 and is related to observation noise and v = min (2w − 1, 2), for r n =â n − a n , mean-square estimation error ( r n 2 ) is bounded by a quantity x n given by for p − n → ∞ and with variations of a n vanishing asymptotically at ar a t en −w where w > 1 with the condition that joint density functions of η n and Y n satisfies [19] Estimator converges in the mean according to (2) and the asymptotic behaviour is given by the equations where for β n = inf n β n , β n denotes the smallest eigenvalue of considered as a positive-definite matrix where λ 1 , λ 2 , …, λ s are the eigenvalues and Y n is s-dimensional. Then However From (8) and (9), λ 1 is chosen as 1/s, a conservative estimate of β, assuming all eigenvalues to be equal. k is chosen equal top number of unknown parameters. Start of the process is at n = N 0 where (7) represents N 0 .

Minimum-degree pole placement algorithm
Let A and B which do not have any common factors denote polynomials in either the forward shift operator q or differential operator, Then a process can be described by the single-input, singleoutput system [1] Then, a general linear controller is given by where u c denotes the command signal and R, S and T are polynomials. A feed forward with the transfer operator T/R and a negative feedback with the transfer operator − S/R is represented by the control law. By eliminating u between (10) and (11) the following relations for the closed-loop system can be obtained Thus, the closed-loop characteristics polynomial is given by In (14) called Diophantine equation, polynomial A c is a design parameter that gives the desired properties of the closed-loop system. A and B are assumed to not have common factors for the equation to have solution. Equation (14) determines polynomials R and S only.
To determine polynomial T in the controller given by (11), the response for the command signal u c is described by From (12) and (13), it then follows that the following condition must hold For all command signals, if the error is made zero then the perfect model following can be achieved. Factorisation of polynomial B yields where B + and B − correspond to monic polynomial which has stable zeros and can be cancelled by the controller as it is well damped, and poorly damped or unstable factors that cannot be cancelled, respectively. B + is a factor of A c . Furthermore, A m must also be a factor of A c as can be seen from (46). Then, closed-loop characteristic polynomial is given by From (14) it can be derived that B + is a factor of B and A c also divides R. It then follows that and Diophantine equation, (14) becomes From (16) to (19) Here, R ′ is the quotient and b 0 S is the remainder on dividing A 0 A m by A. Consider a discrete-time process expressed by pulse transfer function where a 1 , a 2 , b 0 and b 1 are the process parameters [1, 2, 14]. They are estimated by the stochastic approximation algorithm [18]. Consider desired closed-loop system as where q is the forward shift operator and a m 1 , a m 2 and b m 0 are the parameters of the closed-loop system. This model satisfies compatibility conditions. Factorise polynomial B as follows The process being of second order, polynomials R, S and T are of first order. Thus, Polynomial R ′ is of degree zero. As the polynomial is monic, R ′ =1.AsB + = 1, from compatibility conditions it can be found that deg A 0 = 0. Choose The Diophantine equation, (25) is then (q 2 + a 1 q + a 2 ) × 1 + b 0 (s 0 q + s 1 ) = q 2 + a m 1 q + a m 2 . (32) On equating coefficients of equal powers of q, we get If b 0 ≠ 0. The solution is given as Thus, the controller can be given by following polynomials or The controller polynomials (11) to implement the following control law can then be described as

Design flow of stochastic approximation-based STR
The flow of design algorithm for the stochastic approximationbased STR is shown in Fig. 2. The estimated error is updated and reduced at every iteration using the adaptation with stochastic approximation.

Functional verification
Functionality of the STR was verified in MATLAB, version 7.9.0.529 (R2009b). As shown in Fig. 3a, a square wave command signal of unit amplitude is used. Initially process output oscillates due to the estimation error as shown in Fig. 3b. As shown in Fig. 3c, estimated parameters converge to true parameters. Table 1 shows the estimated true parameters of the process. Fig. 3d shows the implemented controller tracks the command signal. Satisfactory performance of the STR sufficient for hardware design is obtained.  was partitioned into three modules: stochastic estimation module (stoch_est_module), the controller design module (cntrl_design_ module) and the control law module (cntrl_law_module), as shown in Fig. 4. Furthermore, divided into three sub-modules: namely, the n_module, the nrm_module and the est_module. Global architecture: The global architecture of STR is shown in Fig. 4 and as three major modules: stoch_est_module, cntrl_de-sign_module and cntrl_law_module. Process parameters estimated using stoch_est_module were loaded in cntrl_design_module and controller design parameters were calculated. The cntrl_law_module generates the control signal. Temporary registers were loaded by enabling signals en_1-en_4 and write signals w_1-w_4. Different parts were triggered by t_1-t_6. After 57 ccs the control signal was generated. Then, with the pipeline full, at every rising edge of the clock, the control signal was generated. The design flow of the pipelined design is shown in Fig. 5; signals 1-6 correspond to t_1-t_6. Stoch_est_module: Two matrix multiplications were used in n_module which computes (â ′ n Y n − y n ). Matrix multiplications involved in n_module were done using MMM architecture. Most of the time consumed is by matrix computation involved in every    This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/) iteration of estimation stage (1). The generalised MMM architecture for multiplying n number of (N×N) matrices denoted by processing element was used for reducing the computation time required for matrix multiplications [14]. nrm_module computes the norm. Data from n_module and nrm_module were loaded in est_module and the process parameters were estimated online by stoch_est_module. The flowchart of the pipelined stochastic approximation architecture is shown in Fig. 6. Different modules of the stochastic approximation architecture are shown in Figs. 7a-c. Data controller module: The most critical part in the architecture is the data controller module. At appropriate instants clock, all processes were triggered with respect to the clock. The data controller module is shown Fig. 8. Control signals were generated by counters (c_1-c_6). The address lines were generated by finite state machines. For example, n_module, in Fig. 4, completed its task at the 20th clock instant and est_module was triggered at the same instant by the data controller. Otherwise, entire computation will be ineffective. Table 2 lists the hardware requirements.

Implementation results and discussion
Implementation results of the developed architecture are presented and discussed in this section.

FPGA implementation
The architecture was designed in VERILOG. Verification and simulation were done in Xilinx ISE simulator. The hardware was mapped onto a XILINX Virtex4 FPGA device. Fig. 9 shows the first control signal was generated after 57 ccs.

Performance comparison
The performance comparison was made in terms of the control signal generation time and resource usage. The equations governing the recursive least-square algorithm were decomposed to scalar equations and implemented directly with multipliers and adders in the SBDAM approach [15][16][17]. The algorithm is advantageous in applications which require very precise estimation dynamic-data range. However, it requires huge hardware resources and needs extra hardware cost [14]. The hardware requirements were reduced using the MMM-based RLS architecture [14]. However, the control signal generation time is large and also large amount of resources are used and precludes its use in systems which have fast dynamic response and which have to work on resource constrained platforms. In the proposed paper, control signal generation time is improved by 38% when compared with best existing related work [14] as shown in Table 3. A comparison of hardware requirements with different architectures are shown in Table 4, which reveals the hardware requirements have reduced by 41% in terms of multipliers, 62.5% in terms of adders and comparable number of dividers in the proposed approach compared with the best existing related work [14]. The hardware requirements have reduced by 2.5× in terms of multipliers, 5× in terms of adders and 3.3× in terms of dividers in the proposed approach compared with [15][16][17]. The processing speed is compared in Table 5. In case of complex algorithms, software implementations are further slower, and two to three orders of magnitude better performance can be obtained using parallel and pipelined hardware implementations [20].

Conclusion
A novel FPGA-based parallel and pipelined STR architecture based on stochastic estimation with faster control signal generation and reduced resource usage is proposed to overcome the shortcoming of the STR algorithm in real-time online applications. Compared with the best related existing work, the time for control signal generation was reduced by 38%, the resource usage was reduced by 44.4% in terms of adders and 41% in terms of multipliers. Moreover, the number of matrix multiplications and the number of intermediate variables involved at each iteration were reduced. Fast matrix multiplication, pipelining and high-speed arithmetic function implementations were used for improving the performance. The proposed architecture also has high processing speed and can be implemented in applications that have fast required response times and those which run on embedded computing platforms with low-power or low-cost requirements with constrains on resource usage.