EURASIP Journal on Applied Signal Processing 2005:4, 575–587 c ○ 2005 Hindawi Publishing Corporation Simplifying Physical Realization of Gaussian Particle Filters with Block-Level Pipeline Control

We present an e ﬃ cient physical realization method of particle ﬁlters for real-time tracking applications. The methodology is based on block-level pipelining where data transfer between processing blocks is e ﬀ ectively controlled by autonomous distributed controllers. Block-level pipelining maintains inherent operational concurrency within the algorithm for high-throughput execution. The proposed use of controllers, via parameters reconﬁguration, greatly simpliﬁes the overall controller structure, and alleviates potential speed bottlenecks that may arise due to complexity of the controller. A Gaussian particle ﬁlter for bearings-only tracking problem is realized based on the presented methodology. For demonstration, individual coarse grain processing blocks comprising particle ﬁlters are synthesized using commercial FPGA. From the execution characteristics obtained from the implementation, the overall controller structure is derived according to the methodology and its temporal correctness veriﬁed using Verilog and SystemC .


INTRODUCTION
Particle filter has been studied theoretically and its feasibility has been demonstrated in the literature [1,2,3,4,5,6,7].They perform extremely well in estimating unknown variables in many statistical signal processing problems including channel estimation in wireless communication and tracking variables in bearings-only tracking applications [8,9,10,11].However, realizing particle filters in hardware is not trivial.The algorithms consist of many complex processing blocks that are executed in both parallel and sequential nature.Direct one-to-one mapping from the operations of algorithm to the processing blocks in hardware implementation may result in complicated overall controller structure.Moreover, an efficient controller generation may even become practically intractable, or tedious, mainly due to the shear complexity of the algorithms.Hence, the main design issue in the particle filters realization is the development of their overall controller to completely maintain the concurrency of the filtering operations.The objective of this paper is to present a design methodology to simplify the overall physical design process in the particle filter realization.
At algorithmic level, particle filters work on blocks of data as frames.Such systems possess two unique execution characteristics.First, they can be represented as coarse data-flow graphs such that nodes (or blocks) can be executed concurrently [12].While the complexity of each node (or block) differs in granularity, the data-flow graph can be clearly represented as a function of data dependency.Second, each node in the data-flow graph executes a set of data per every iteration cycle.Depending on the specific application, the size of data set can be large requiring significant amount of buffers.Considering the data flow within these applications, a twolevel hierarchy is often obvious, where data frames are processed as a unit in a sequence of logic blocks at global level, and elements within a frame are processed in a loop fashion within each block at local level.We take advantage of these characteristics in the proposed design methodology.
Two approaches for controller design are possible, namely centralized and distributed.According to the centralized approach, a single large controller is used to generate the different control signals for all the units.While [13] takes a centralized controller approach arguing on the grounds of area overhead, [14,15] take a distributed approach where the control is localized to the units and they operate by transferring information between them.We follow a distributed approach in this paper.Our methodology differs from the other distributed approaches in which our design is targeted for coarse granularity of the processing blocks.This has a benefit that the processing block can be core-designed by another party.Moreover, we provide the flexibility in changing the controller locally with minimum information and overhead as long as the execution characteristic of the processing block is known.Such benefit is not possible with the centralized approach where a small change in logics translates to overall redesign of the controller.A key benefit of the methodology is that local controllers are reconfigurable with a set of basic parameters, and are completely independent from one another.By a systematic method, it is possible to provide operation predictability in the overall system designs.
In this paper, we demonstrate an efficient controller design methodology for temporal correctness, and we do not consider finite precision effects inherent in the processing block design.The finite precision issue can be easily incorporated into the design by allowing proper scaling and providing sufficient bits for the number representation within each processing block.We note that such incorporation in the design does not alter the presented controller design methodology.We consider Gaussian particle filter (GPF) for design study.The correctness of our designs is verified with Verilog and SystemC simulations.The method can be applied to other systems possessing similar execution characteristics.
The remainder of this paper has five sections.Section 2 discusses background of particle filtering and there the GPF algorithm is discussed in detail.Section 3 addresses the proposed design methodology and design issues including the two-level pipelining and controller synthesis.The design study for GPF is presented in Section 4. The design is evaluated in Section 5. Finally, our contributions are summarized in Section 6.

Background
Particle filters are used in problems which are represented using dynamic state-space (DSS) models.These models involve a state equation which shows how the state evolves with time and an observation equation that relates the noisy observations to the state.These equations have the following form: where n ∈ N is a discrete-time index, x n is a signal vector of interest, and y n is a vector of observations.The symbols u n and v n are noise vectors, and f n and g n are a signal transition function and a measurement function, respectively, which are assumed known.In a particle filtering framework, the objective is to estimate recursively in time the signal x n , for all n, from the observations y 1:n , where y 1:n = {y 1 , y 2 , . . ., y n }.
The particle filters base their operation on representing relevant densities by particles (samples) drawn independently from a normalized importance function (IF) π(x n |x n−1 , y 1:n ) that has the same support as the posterior density.Each sample has a weight associated with it.Accordingly, if the samples {x (m)  0:n ; m = 1, 2, . . ., M} 1 are drawn from the importance density π(x 0:n |y 1:n ) and represent the mth stream of particles of the unobserved state of the system, and y 1:n is the sequence of observed data, then an estimate of the posterior expectation I(f n ) of the function f n (x 0:n ) defined by can be obtained by the following expression [16,17]: where w(m) is the nonnormalized weight of the mth particle.Densities that play a critical role in sequential signal processing are the filtering density, p(x n |y 1:n ), and the predictive density, p(x n+l |y 1:n ), l ≥ 1.While sample importance resampling filters (SIRFs) operate by propagating the desired densities recursively in time, GPFs operate by approximating desired densities as Gaussians.Hence, only the mean and the variance of the densities are propagated recursively in time.In brief, GPFs are a class of Gaussian filters in which Monte Carlo (PF-based) methods are employed to obtain the estimates of the mean and covariance of the relevant densities and these estimates are recursively updated in time [18,19].Propagation of only the first two moments instead of the whole particle set significantly simplifies the parallel implementation of the GPF.Even though this approximation of the filtering and predictive densities using unimodal Gaussian distributions restricts the application of GPFs, there is still a broad class of models for which this approximation is valid.
The GPF can be significantly simplified when the prior density is used as IF.This means that π(x n |x n−1 , y 1:n ) is given by p(x n |x n−1 ).In this case, the GPF performs the steps presented in Algorithm 1.
In this paper, particle filters are applied to the bearingsonly tracking problem illustrated in Figure 1 where two positions of the object at time instants n and n + 1 are shown. 1The notation x Algorithm 1: GPF algorithm with the prior density as the IF.
x n x n+1 x The measurements taken by the sensor to track the object are the bearings or angles (z n ) with respect to the sensor, at fixed intervals.The range of the object, that is, the distance from the sensor, is not measured.The unknown states of interest are the position and velocity of the tracked object in the Cartesian coordinate system (x n = [x n , x n , y n , y n ] T ).

Algorithm partition strategy for physical design
Particle filtering involves many complex arithmetic computations and data access.In order to provide high-speed particle filter realization, the overall algorithm is partitioned into several modules.Each module is designed to maximize loop fusion such that data storage is minimized between modules.When DSPs are used, the loop fusion reduces memory requirement during the execution of the algorithms.From a concurrent physical realization point of view, such partitioning also minimizes the computational latency, which may be caused by pipelining with registers.
To fully utilize locality of the proposed design method, which is explained in Section 3, we follow two key strategies in defining the modules.First, each module is derived to eliminate control signal dependency between modules.Only data transfer between the modules is allowed in the Algorithm 2: Processing one observation by a GPF.
Algorithm 3: Algorithm for drawing conditioning particles for the bearings-only tracking example.
algorithm.If there exists such control dependency, we combine them into a single module.Thus, each module will be designed as an independent processing block where control dependency is eliminated.Second, we design each module so that the numbers of data consuming and producing between the modules are deterministic.With deterministic number of data transfers between the modules, a set of simple distributed controllers can be derived.We describe a partitioned GPF particle filter algorithm for tracking in the following section and block-level pipelining in Section 3.

GPF particle filter algorithm for tracking
The details of the GPF algorithm are discussed in [18,20].In Algorithm 2, a GPF code for processing one observation is presented.Drawing conditioning particles, particle generation, particle update, covariance calculation, mean calculation, and Cholesky decomposition operations are specific for the GPF algorithm.Note that each element of the GPF is in the critical path.
Drawing conditioning particles (x, Ṽx , ỹ, Ṽy ) is presented in Algorithm 3. Conditioning particles are drawn from a Gaussian distribution with parameters (µ, Var), where µ is 4 × 1 matrix and Var is a 4 × 4 triangular matrix.In order to draw Gaussian random numbers, a matrix S is necessary, where Var = S • S .
The generation of particles is performed by drawing them from the importance density in the sampling step.In Algorithm 4, the sampling step for the bearings-only tracking is presented.The input arguments are the states of the particles obtained from the update state step from the previous time instant ( X = {x, Ṽx , ỹ, Ṽy }).For sake of simplicity (w, S M ) = GPF PU (x, y, w, z) (i) calculation of weights and their sum of implementation, the prior density of the state is selected as the importance density.The output of the particle generation (sampling step) is a new vector of states X = {x, V x , y, V y }.
One possible realization of the particle update (importance step) which is used for weight calculations is shown in Algorithm 5.The particle update consists of two substeps: weight calculation and normalization.First, the weights are evaluated up to a proportionality constant and subsequently, they are normalized.The input arguments are the observation z, the arrays of states x and y, and updated weights from the previous time instant w.
The covariance calculation is shown in Algorithm 6.The covariance coefficients are first updated in the loop and then the final calculation is performed.Cholesky decomposition is shown in Algorithm 7. The purpose of this step is to calculate the square root of the covariance matrix so that Var = S • S .As opposed to the other steps, this step is not performed in the loop and its complexity does not depend on the number of particles.
The Computation of the output estimates is shown in Algorithm 8.

Block-level pipelining of loop-based algorithms
The basis of the particle filter design is regular data flow with block-level pipelining.Block-level pipelining is a hardware realization of the data-flow model.Block-level pipelining, which will be elaborated subsequently, guarantees high throughput by maintaining operation concurrency among the processing blocks.The design incorporates block-like operations involved in particle filter by introducing two-level Algorithm 8: Calculation of mean (computation of estimates).
The main purpose of pipelining in block level is to insert a buffer with associated controller in order to encapsulate any architectural parameters such as latency and rate difference between a pair of processing blocks.Through blocklevel pipelining, we can achieve three key objectives.First, it is possible to maintain concurrency of each processing block while providing correct synchronization between processing blocks for proper execution.Second, since the control signals, data, and clock become local, hardware implementation is much easier in terms of maintaining performance  by minimizing clock skews and data routing.Any change in logic will only affect its buffer configuration and controller so that reconfigurable design and/or core reuse is possible.Third, since entire design is centered around the buffers, performance mismatches between memory and logics are minimized.These objectives are basis of the buffer-centric design methodology that we are presenting in this paper.
Basically, data dependencies that may arise due to rate difference are handled by inserting a buffer between any pair of processing blocks as shown in Figure 2. To provide maximum flexibility of the controller, we simplify the design by assuming that all the components are controlled and synchronized by a single global clock.Buffers (BUF) are used as a basic block-level pipelining elements.Data are transferred by the read and write access of these buffers concurrently but at different address locations.The amount of difference in write and read address can be determined from the rate differences of processing blocks as well as data dependency of functions.Since the read and write operations are done using different addresses, there will be no conflict of address and concurrency among processing blocks.Each BUF i can have different offset as required by the logic operation writing to or reading from it.The overall operation is logically viewed as just buffer-to-buffer operations separated by latency of the processing blocks introduced by the processing block logic implementation.The data path may be recursive.One key benefit of the block-level pipelining is that each block can execute concurrently.Such systematic view of the buffers provides an opportunity for us to create memory centric platform.
Consider a set of parameters, denoted by L i , M i j , and offset i j , indicated in Figure 2. In the figure, three processing blocks have different latencies and the same number of data is specified in any pair of processing blocks.The values of latency are obtained from the implementation of the processing blocks, the buffer write-read offset is determined from the operational data dependency of the pair of processing blocks, and the data size M i j is obtained from the function specification of the processing blocks.Each processing block in Figure 2 produces and consumes data at the same rate.The buffer size must be larger than the value of the corresponding offset i j .
We illustrate the operation of block-level pipelining.Consider the nonrecursion case where the input comes to the processing block 1 and the output is generated by the processing block 3. First, the data will be generated by the processing block 1 after latency L 1 .Then, M 12 data will be serially written to the first buffer.After offset 12 , the processing block 2 will start processing and generating data for the sec-ond buffer after latency L 2 .A similar process takes place at the other processing blocks and buffers.Note that for nonrecursive block-level pipelining, the last buffer is not needed.As long as data dependency is enforced, all the processing blocks are concurrently processing data without any stoppage.

Parameter extraction for controller design
The operations in block-based processing are viewed as buffer-to-buffer operations with coarse-grained processing blocks operating in between them.The primary parameters which decide the controller structure and overall physical realization are as follows.
(1) Logic latency (L i ).This is the latency of the processing block that needs to be handled so that data validity is preserved.Fine grain pipelining introduces this latency, where L i is the pipeline depth of the data producing block.(2) Write offset (nw i j ).This is the offset between read of the previous buffer and write of the current buffer after eliminating the logic latency.This is to support a producing processing block with delay data generation.(3) Read offset (nr i j ).This is the offset between write into and read from buffer.This value actually represents the data dependency between the producer and the consumer.We consider a dual port buffer in which the value of nr i j represents the difference between write and read of the buffer.(4) Delay factor (D i j ).This is a delay factor of read of the current buffer and write of the current buffer in case of rate mismatch or irregular data stream.(5) Block size (M i j ).This characterizes the size of data produced and consumed by the processing blocks.It, along with nr i j , determines the maximum storage requirement in each buffer.(6) Consuming rate (C i ).This is the rate of data consumption by the ith processing block.(7) Producing rate (P i ).This is the rate of data produced by the ith processing block.(8) Processing rate (F i ).This is the processing speed of the ith processing block.
The parameters (M i j , nr i j , nw i j ) are derived from the functional (algorithmic) data-flow description.The parameters (L i , C i , P i , F i , D i j ) are derived from the implementation of the processing blocks.From these two sets of parameters, we can create two tables: node information table (NIT) and edge information table (EIT).Figure 3 illustrates the usage of such tables in the design.A buffer controller will be created for each entry in the EIT.From these two tables, the buffer controller and global controller are designed.The generation of such tables will be discussed later.

Buffer controller design and synchronization
In the design methodology, the buffer controller is a key element.A block diagram of the buffer controller is shown in Figure 4.The buffer controller consists of concurrent controller and a dual-ported memory.The concurrent controller   has two logic sections: read and write controls [21].These two controls may be driven by separate clocks and programmable so that the same buffer controller is used in the design by varying the parameters.Basically, the write logic section is configured by L i and nw i j , and the read logic section is configured by D i j and nr i j .Note that these parameters are derived from the data-flow structure and the processing blocks.Thus, interdependency between the buffer controllers is not necessary to operate correctly and they are autonomous and local.
Consider the buffer controller i.When this buffer controller is activated, both the write and read logic sections are concurrently executed.Initiation of the write section indicates that data have arrived at the processing block that is connected to this buffer as a producer.The actual data computed by the producing processing block are valid at the buffer controller after waiting for L i .The write logic section will not write these L i invalid data from the producer.This will guarantee correctly receiving the valid data stream if the producer is purely pipelined hardware.However, it is also possible that the processing block needs finite amount of computation time regardless of the pipeline depth (i.e., delayed data generation by the processing block).To support this type of processing block, we use one more parameter nw i j .After this wait period (L i + nw i j ), the data are written to the buffer.Once correct data samples start to be written to the buffer, the read process starts by the read logic section.The parameter nr i j represents offset between writing and reading the data from the buffer.This parameter supports data dependency.Even if there is no data dependency, it is also possible that the generation data rate of the producer is different from the consuming data rate of the consumer.
To support rate mismatch between two processing blocks connected by the buffer controller, we use another parameter D i j .After this wait period (max[nr i j , D i j ]), the data is read from the buffer.Once all the data are read out from the buffer, the buffer controller suspends itself until the buffer controller gets a signal from the outside.Thus, the write logic section is configured by (L i , nw i j ) and the read logic section is configured by (nr i j , D i j ).The same buffer controller is used to support different data transfer characteristics by modifying these parameters.Figure 5 illustrates the timing relationship of a buffer controller where the index i j represents a buffer controller placed between the ith and jth processing blocks.Note that there are three key synchronization points, among the buffer controllers, indicated by sync1, sync2, sync3.These three synchronization points correspond to start time i j , write start i j , and read start i j .The start of the write waiting process is synchronized with the start read process of the previous buffer controller, indexed as ki.And the start of the read process is synchronized with the start of the write waiting process of the same buffer controller.These are the conditions that must be satisfied in order to derive the parameters of the NIT.They have the following relationships: start time i j = read start ki , write start i j = start time i j + L i + nw i j , read start i j = write start i j + max nr i j , D i j .
(4) Thus, we can synchronize the processing blocks and vary the execution time using these parameters.Moreover, we can delay the entire operation by adjusting the appropriate value of nr i j .

Relationship between buffer controllers
However, since it is necessary to synchronize each buffer controller, we illustrate the relationship between the global controller and the buffer controllers.An overview of the overall control scheme illustrating interaction with the global control and buffer control is shown in Figure 6.The global control generates a single clock to synchronize all the buffer controllers.In addition to the global clock, the global control generates basic timing information for each buffer controller.Figure 7 illustrates that each buffer controller is controlled by periodic signals, indicated by start, generated from the global controller.The global controller is a counter that generates a set of start signals.This is illustrated in Figure 7.
The figure shows three such timing signals for three buffers in Figure 6.For each buffer controller, the start timing signal indicates the start of one iteration process of the buffer controller.Successive buffer controllers are separated by start signals.The time durations between each start signal are controlled by the parameters described in the previous section.

Processing blocks in GPF design
In the design, we first define the operation of each processing block, and then the data-flow structure.Each processing block may have its own local controller for its operation.From the processing block design and the data-flow structure, we derive EIT and NIT.Finally, the buffer controllers and global controller are derived and designed.Although maintaining computation accuracy is very important, we do not consider the finite precision effects of the filters.Instead, we focus our study on the execution timing relationship between the processing blocks and controllers.The actual finite precision effects can be incorporated later into the design without affecting the overall controller.
The GPF filter has 5 major computational units: condition particle generation, particle generate, particle update, mean and covariance calculation, and central unit.shows the actual data-flow graph of the GPF under consideration.In [20], it is shown that the loops in the GPF can be fused.So, all the steps except Cholesky decomposition and variance calculation in the second part of Algorithm 8 can be executed inside one loop of M iterations.Cholesky decomposition and final variance calculation are sequential and their complexity is fixed and does not depend on the number of particles.

Condition particle generation (CPG)
In the CPG processing block, the decomposed covariance matrix S and the mean µ obtained from the CU processing block are used for calculation of conditioning particles.The matrix S is the triangular 4 × 4 matrix, so that the number of data that is transferred from the CU processing block is 10 (not 16).All the multipliers are pipelined and they operate concurrently producing M conditioning particles.Since the outputs (x, Ṽx , ỹ, Ṽy ) are computed using different number of operators, we have to introduce additional delay which is different for each state in order to get all the conditioning particles at the same time instant at the output.The CPG requires 4 random number generators [22].
In the CPG processing block, there are 2 input buffers for (µ, S) from the CU processing block and 4 output buffers for (x, Ṽx , ỹ, Ṽy ) to the PG processing block.The data size of mean µ is 4 and of the decomposed covariance S is 10.These data are generated sequentially to save interconnect buses.Internally, these data are used in parallel.The output data size is M. Initially, the mean and the decomposed covariance elements are obtained externally and not from the CU.

Particle generate (PG)
In the PG processing block, there are 4 buffers associated with the input vectors (x, Ṽx , ỹ, Ṽy ) and 4 buffers associated with the output vectors (x, V x , y, V y ).The input vectors stored in the input buffers are generated by the CPG processing block.In addition, 2 more buffers are associated with (x, y) to be used by the PU processing block.The data sizes of inputs and outputs are M.Because they are vectors, the buffer controller parameters will be identical.
The arithmetic operations of the particle generation are described in Algorithm 3. The outputs are computed in parallel operations.In the PG processing block, there are 4 noise generators.For the generation of noise samples, we use Box-Muller approach for efficient FPGA implementation [22].The noise generation is a combination of a lookup table and arithmetic logic such that it eliminates large latency generated by the arithmetic logic unit.

Particle update (PU)
The arithmetic operations of the particle update are illustrated in Algorithm 5.The main arithmetic operations are multiplications, divisions, the trigonometric function arctan (•), and the exponent function exp (•).Unrolled CORDIC [23,24] is used as the operator for arctan (•) and exp (•) for its regular implementation structure.For each arctan (•) operation, a constant value π/2 and a multiplexer are used to correct the angle-fault problem in the algorithm for arctan (•).The purpose of this correction is to resolve ambiguity from (−x, y) and (x, −y) (i.e., arctan (•) value is identical even though their physical locations are different).Since there is no data dependency between the PG and PU processing blocks, the PU processing block will compute its outputs as soon as the data are available in the input buffers.
The PU processing block also takes z(n) as an external observation input.During the iteration n, the value of z(n) does not change.The PU processing block computes the rest of weight computation process as illustrated in Algorithm 4. These weights are not normalized, and they are accumulated to generate sum (S M notation is used in the algorithm) at the end of weight calculation.The sum is used in the weight normalization step in the MCC processing block.
It is very important to note that, even though we are creating buffers at the output of these processing blocks in the figure, when the data are used by the successive processing block right away, we can replace these buffers by pipeline registers in the actual implementation.In this case, the value of latencies of blocks will be added to the last buffer controller.

Mean and covariance calculate (MCC)
In the MCC processing block, the partial covariance 4 × 4 matrix Var and 4 × 1 mean vector µ are calculated.The MCC computes only the first part (loop) of Algorithm 6 and the rest is computed in the CU processing block.The number of multiplication operations during mean and covariance operation is equal to N s + N s (N s + 1)/2 = (N 2 s + 3N s )/2 for the GPF, where N s represents the dimensionality of the model.For the bearings-only tracking problem with N s = 4, it is 14 multiplications.All 14 outputs are accumulated, so that 14 accumulators are necessary.All the blocks operate concurrently on M particles.
In the MCC processing block, there are 6 input buffers for (x, V x , y, V y ) from the PG processing block and (w, sum) from the PU processing block.The mean vector µ is normalized by the sum at the end of operation.This requires only 4 divisions for the mean vector.There are 2 output buffers for µ and Var to the CU processing blocks.These outputs are serialized.The µ from the MCC processing block is the output of the GPF.Thus, the output generation block for the GPF is to convert serially the received µ to a parallel format.

Central unit (CU)
The inputs and the outputs of the CU are produced once during the sampling period.The CU processing block executes Algorithm 7. In addition, the CU processes the second part of Algorithm 6.Because of special functions such as division and square root operations, we design for time-multiplexing of operators.Since small number of data is to be computed, the delay incurred by this unit is not significant.The outputs are buffered within the processing block before being read out to the buffer controller for synchronization purposes.In the CU processing block, there are 2 input buffers for (µ, Var) from the MCC processing block.There is 1 buffer for S to the CPG processing block.These outputs are serialized.

Output generation (OG)
The OG processing block is a simple serial-to-parallel transformation logic.

Concurrency and execution time of GPF
The timing diagram illustrating the major block operations is shown in Figure 9.A value n shown in the box indicates iteration index.As shown in the timing diagram, the executions of the CPG, PG, PU, MCC, and CU processing blocks are overlapped.When the CPG processing block starts its execution, the buffer will start to write after L CPG .As soon as the first data is written, the buffer will start the read for the PG processing block.The PU processing block can start and the MCC processing block will follow.Since MCC has to normalize the mean, the CU waits for M to wait for the value sum.As soon as the CU generates the data, the next iteration of the CPG can proceed.The external input is synchronized with the start of PU.The output is generated from the MCC processing block.Thus, the minimum iteration period of the GPF is T GPF = (M + L GPF ) • T clk , where L GPF = L CPG +L PG +L PU +L MCC +L CU , where L CU includes latency due to hardware and delayed output generation resulting from time-multiplexing within the CU processing block.

GPF realization
The data-flow graph of GPF for bearings-only tracking problem is constructed as shown in Figure 10.The figure shows both the processing blocks and the buffers.
Table 1 tabulates the primary parameters of each processing block for the GPF.Similarly, the GPF speed is limited by the speed of the CORDICs in the PU.  3 summarizes the parameters of all buffer controllers.These parameters are derived from the above two tables.For each buffer, the start time of each buffer controller and period of the buffer iteration are illustrated.The parameters for the start time are computed with respect to f clock .The external input is synchronized with the start of the PU1.
Note that there are several key synchronization points.First, the buffer controllers for E6 and E7 have identical read beginning.Second, the buffer controllers for E8 and E10 also have identical beginning of the read.Third, the buffer controllers for E7, E8, and E9 have identical write beginning.Fourth, the buffer controllers for E2 and E3 have identical start time and write beginning.Fifth, the buffer controllers for E3 and E4 have the same read beginning.The iteration period is indicated by the reset time instant.From the previous discussion on timing diagram, the iteration period was M + L GPF = M + 155 including the CU computation latency, and the values of nr Ei and nw Ei in the critical path.
Table 4 illustrates buffer controller usages and configuration parameters for the global controller.The buffer size for each buffer controller is also indicated.The factor of 4 concurrency, both parallelism and pipelining cannot be fully exploited.When the performances of the GPF and the SIRF are compared, the GPF is approximately twice faster than that of SIRF when the number of particles is more than 1000.This is because the performance of the SIRF, when the proposed methodology is followed, is proportional to 2M where the GPF is proportional to M. However, the SIRF and the GPF are close in performance since latency of the GPF due to pipelining is much larger than that of SIRF.When both algorithms are executed on DSP, their performances are very close.While the SIRF takes more time when concurrency is exploited in the FPGA design, the GPF takes more cycles due to complicated arithmetic computations.
The proposed block-level pipelining design maximizes buffer controller usages and minimizes dynamic reconfiguration efforts.The design does not suffer from the amount of memory used for buffers since we can view those buffers as a collection of pipeline registers (i.e., these registers are needed in any design with standard design flow whether the design is based on distributed or centralized schemes).These pipeline registers cannot be eliminated if the maximum throughput is of utmost concern.Since we are eliminating the use of individual pipeline register, lower overall clock loading can be achieved.
Particle filter based on the proposed methodology supports parameter changes during run-time because the design methodology ensures flexibility.Moreover, the design can be extended to support wide ranges of particle filtering including the parallel operation.One biggest novelty of the architecture is that the buffer controller guarantees correct operation while maintaining maximum throughput.Moreover, because of very small information associated with each structure (i.e., one set of data for each buffer controller and global controller), the reconfiguration time is almost nonexistent (i.e., all the parameters for the controller and structural switch can be loaded with a few clock cycles but as low as one cycle simultaneously).

CONCLUSIONS
This paper introduces a simple design methodology of generating an overall controller for a particle filter realization.We have demonstrated that the proposed method is very effective when the data-flow structure is well defined by taking into account their execution characteristics.The entire design followed-block-level pipelining approach where controllers can easily be derived from the data flow and the parameters of processing blocks.We have considered GPF for validating our design methodology.The overall temporal correctness has been verified by Verilog and SystemC.The method is very important because the processing block execution characteristics can be changed with minimal change in the controller structure.

Figure 1 :
Figure 1: Illustration of the tracking problem.

Figure 2 :
Figure 2: Illustration of the block-level pipelining structure of data flow.A possible recursion is also illustrated with dotted connection between the processing blocks 3 and 1.

Figure 3 :
Figure 3: Design flow of the controller and overall physical realization.

Figure 4 :
Figure 4: Block diagram of a buffer controller consisting of read and write processes.

Figure 8 :
Figure 8: Data-flow graph of the GPF.
For E4 and E5, nw E4 and nw E5 are 2 and M +1, respectively.At E8, the read operation for the CPG will be delayed by nr E8 which is nr E6 + L CU + nw E10 + nr E10 = 78.This will synchronize both µ and S at the CPG processing block.At E10, nw E10 = L CUt = 75, which corresponds to the time taken by the CU processing block to generate the first data.For E6, E7, E8, and E9, nw Ei = M.The values of D Ei are all 1 since there is no rate mismatch.Table