1 Introduction

Dynamic state estimation, which is basically concerned with estimating a latent state that evolves over time from a sequence of observations in the presence of noise, clutter, and disturbances, is of central interest in fields of signal/information processing and control. It has a broad range of applications related to detection, positioning, monitoring, tracking, navigation, and robotics. The rapid development of sensors and ever-increasing proliferation of smartphones, mobile robots, and unmanned vehicles have further increased the interest in the topic.

Estimation has a long research history, although it was the Kalman filter (KF) (Kalman, 1960) that thrived the field and initiated modern estimation study. Historical ‘giants’ of estimation include Gauss and Legendre who independently invented the theory of least square estimation in 1795 and 1806, respectively, which anticipates most of the modern-day approaches to estimation problems, Fisher who introduced the maximum likelihood method in 1912, Kolmogorov and Wiener who established the statistical foundation for interpolation and extrapolation, filtering and prediction in 1940 and 1942, respectively, and Bode and Shannon who proposed the state-space model among many others in 1950; please refer to retrospective reviews offered by Sorenson (1970), Grewal and Andrews (2014), and Singpurwalla et al. (2017). It was the interpretation of KF from a Bayesian prior to posterior viewpoint (Ho and Lee, 1964; Lindley and Smith, 1972) that opened the floodgate for both statisticians and engineers to advance the state of the art of filtering. Considerable efforts have since been devoted to both linear and nonlinear time series state-space models in a wide range of realms.

However, for a general nonlinear stochastic process with few exceptions, approximation has to be resorted to. The approximation can be parametric, non-parametric, or a mixture of both. In the non-parametric case, the target probability density function (PDF) can be approximated with Monte Carlo approaches based on random sampling, such as the particle filter (PF) (Arulampalam et al., 2002; Cappé et al., 2007; del Moral and Arnaud, 2014; Bugallo et al., 2017), and grid-based approaches (Gerstner and Griebel, 1998; Šimandl et al., 2006; Kalogerias and Petropulu, 2016) based on a finite discrete state space. In the parametric case, PDF is represented by a family of functions that are fully characterized by certain parameters such as Gaussian approximation (GA) and Gaussian mixture (GM) filters. They are collectively referred to as ‘parametric filters’ in this paper, of which moment matching to the Bayes prior and posterior is the key. They form the backbone for general time series filter design and are the focus of this survey.

There have been many excellent tutorials, surveys, and textbooks, primarily in the context of non-linearity (Nørgaard et al., 2000; Wu et al., 2006; Crassidis et al., 2007; Hendeby, 2008; Šimandl and Duník, 2009; Li and Jilkov, 2012; Patwardhan et al., 2012; Morelande and García-Fernández, 2013; Stano et al., 2013; Duník et al., 2015; García-Fernández and Svensson, 2015; Huber, 2015; Roth et al., 2016; Särkkä et al., 2016; Afshari et al., 2017) or on some sub-topics such as noise covariance metrics estimation (Duník et al., 2017b) and circular Bayes filtering (Kurz et al., 2016). However, some important issues have not been addressed or only addressed briefly, including: (1) a unifying framework to analyze the common essences of different filters; (2) very informative observation systems (i.e., observation noise is insignificant); (3) the classification of multimodal systems, intractable uncertainties, and constraints.

These issues will form the key part of our review, complementing the existing work. To minimize overlap with these studies, common contents will not be addressed. A comprehensive overview is still nigh impossible. Instead, we base our review on a transparent and concise framework termed ‘approximate Gaussian conjugacy (AGC)’. That is, all reviewed work arguably aims at maintaining, or approximating to be more precise, a closed-form Markov-Bayes recursion from a GA/GM prior to a GA/GM posterior, to deal with the challenges due to nonlinearity, multimodality, intractable uncertainty, and constraint. By doing so, different efforts are organized along the same line. To go beyond a pure review, we also include discussions on alternatives to the first-order hidden Markov model (HMM) and on filter evaluation regarding computing speed, with our new thoughts. All of these strive to give a concise albeit admittedly subjective overview of the state of the art, highlight several significant issues that can easily be ignored, and shed some light on the future research trend.

2 Basis of sequential Bayesian inference

2.1 Markov-Bayes recursion

The time-series (a.k.a. sequential) Bayesian inference is carried out by constructing the posterior PDF of the latent state based on the observation series and a prior model knowledge of the system. Using the posterior distribution, one can make state inference, typically finding the value that maximizes the posterior (namely ‘maximum a posteriori (MAP) estimation’) or the value that minimizes a cost function (e.g., mean square error (MSE)).

To be more specific, the dynamic state estimation problem cast in measurable state space \({\mathcal X}\) and observation space \({\mathcal Y}\) can be formulated in a discrete-time state-space model (SSM) with additive noises:

$${x_t} = {f_t}({x_{t - 1}}) + {u_t} + {v_t},$$
((1))
$${y_t} = {h_t}({x_t}) + {w_t},$$
((2))

where \({x_t},{u_t} \in {\mathcal X}\) denote the state and the input, respectively, \({y_t} \in {\mathcal Y}\) denotes the observation, and \({v_t} \in {\mathcal X},{w_t} \in {\mathcal Y}\) denote the additive noises affecting state function f t and observation function h t at time instant t ∈ ℕ, respectively.

Note that state process model (1) shall be written in a differential form for the continuous time case, so does observation function (2) in a rare case (Ghoreyshi and Sanger, 2015).

We consider a process {(x t , y t )∣t ≥ 1}, where {x t ∣t ≥ 1} is a first-order HMM/Markov chain on \({\mathcal X}\), and each observation \({y_t} \in {\mathcal Y}\) is conditionally independent of the rest of the process given x t . This reads (1) \(p({x_{0:t}}) = p({x_0})\prod\nolimits_{k = 1}^t p ({x_k}|{x_{k - 1}})\) and (2) \(p({y_{0:t}}|{x_{0:t}}) = \prod\nolimits_{k = 0}^t p ({y_k}|{x_k})\). As such, the filtering posterior is given by performing prediction and correction recursively. The prediction step combines the previous filtering distribution p(xt−1y0:t−1) with state transition p(x t xt−1, y0:t−1), as

$$\begin{array}{*{20}c} {p({x_t}|{y_{0:t - 1}})\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad } \\ { = \int {p({x_{t - 1}}|{y_{0:t - 1}})p({x_t}|{x_{t - 1,}}{y_{0:t - 1}})d{x_{t - 1}}.} } \\ \end{array} $$
((3))

This one-step forecast (a.k.a. the Chapman-Kolmogorov equation) forms the prior distribution (called ‘the prior’ hereafter). Next, given a new observation y t , the prior will be updated by the Bayes rule, resulting in the Bayes posterior distribution (called ‘the posterior’ hereafter), i.e.,

$$p({x_t}|{y_{0:t}}) = {{p({y_t}|{x_t})p({x_t}|{y_{0:t - 1}})} \over {\int {p({y_t}|{x_t})p({x_t}|{y_{0:t - 1}})} {\rm{d}}{x_t}}},$$
((4))

where p(y t x t ) is the likelihood function.

Given the posterior in Eq. (4), the expected a posteriori (EAP) estimate of state x t conditioned on all observations y0:t is given by

$$\hat x_t^{{\rm{EAP}}} \buildrel \Delta \over = E[{x_t}|{y_{0:t}}] = \int {{x_t}p} ({x_t}|{y_{0:t}}){\rm{d}}{x_t},$$
((5))

which also gives the minimum MSE (MMSE) estimation of optimality defined on the second-order statistics. Alternatively, the MAP estimate (García-Fernández and Svensson, 2015) is given by

$$\hat x_t^{{\rm{MAP}}}\, \buildrel \Delta \over = \,{\rm{arg}}\,\,\underset{{x_t}}{{\rm{max}}} \,\,p({x_t}|{y_{0:t}}).$$
((6))

Different from the prevalent MSE criterion, it might be of interest to base the lost function on some other criteria, such as the maximum correntropy criterion (MCC) (Liu et al., 2007) which has advantages in handling impulsive non-Gaussian noises, thanks to using higher-order statistics information. Correspondingly, a new class of linear KFs (Chen and Principe, 2012; Wu et al., 2015; Chen et al., 2017) have been developed. Generally, there are cases where robustness (i.e., adaptability to outliers, system errors, and disturbances) is preferable to optimality, leading to various robust filtering algorithms (see Section 5.5).

Without loss of generality, one typical iteration process of a recursive filter can be illustrated in Fig. 1. One of the main reasons for the popularity of HMMs is the friendly first-order assumption that states are conditionally independent given the previous state. This facilitates forward-backward inference for model learning and parameter estimation, but also severely limits the temporal dependencies that can be modeled. Some alternatives will be presented in Section 7.

Fig. 1
figure 1

Block-diagram of the evolution of a recursive filter of the prediction-updating format

2.2 Bayesian Cramér-Rao lower bound

It is theoretically pivotal to derive the performance bounds on estimation errors when estimating parameters of interest in a given model, and developing estimators to achieve these limits. When the parameters to be estimated are deterministic, a popular approach is to bound MSE achievable within the class of unbiased estimators. The Cramér-Rao lower bound (CRLB), given by the inverse of the Fisher information matrix, provides the optimum performance for any unbiased estimator of a fixed parameter on the variance of estimation error (Appendix A). However, it is important to note that:

Highlight 1 CRLB limits only the variance of unbiased estimators, and a lower MSE can be obtained by allowing for a bias in the estimation, when ensuring that the overall estimation error is reduced (Stoica and Moses, 1990; Eldar, 2008).

van Trees (1968) presented an analogous MSE bound for a random parameter, the posterior CRLB, which is also referred to as the ‘Bayesian CRLB (BCRLB)’. An elegant recursive approach was developed by Tichavsky et al. (1998) to calculate the sequential BCRLB based on the posterior distribution for a general discrete-time nonlinear filtering problem that avoids Gaussian assumptions. However, in general, BCRLB has no closed-form expressions in nonlinear systems. As such, a large body of alternative Bayesian bounds has been proposed (van Trees and Bell, 2007; Zuo et al., 2011; Zheng et al., 2012; Fritsche et al., 2016).

On BCRLB, there are two points worth noting. First, the unconditional BCRLB is determined by only the system dynamic model, system observation model, and the prior knowledge regarding the system state at the initial time. It is thus independent of any specific realization of the system state. However, for constrained estimation problems, the corresponding constrained CRLB (Gorman and Hero, 1990) can be lower than the unconstrained version, thanks to the additional constraint information about the parameter. Some attempts have been made to include the information obtained from observations by incorporating the tracker’s information into the calculation of BCRLB; please refer to Zuo et al. (2011), Fritsche et al. (2016), and the references therein for details.

Second, in the Bayesian setting, both the state and observation sequences are random quantities on which CRLB/BCRLB is based. However, in the majority of practical setups, particularly in the context of tracking, positioning, and localization, only a single state sequence, such as a trajectory of an aircraft or a ground vehicle, is of interest. In these situations, the estimator performance shall be evaluated based on the MSE matrix conditioned on a specific state sequence, for which the general BCRLB does not provide a lower bound (Fritsche et al., 2016). Instead, it was shown that:

Highlight 2 KF is biased conditionally with a nonzero process noise realization in the (deterministic) state sequence and is not an efficient estimator in a conditional sense, even in a linear Gaussian system.

2.3 Gaussian conjugacy

Some important properties of the Gaussian distribution are notable. Given only the first two moments, the Gaussian distribution makes the least assumptions about the true distribution in the maximum entropy sense and minimizes the Fisher information over the class of distributions with a bounded variance (Kim and Shevlyakov, 2008). As a general example, by denoting θ as the parameter vector, w as the noise, and y = x θ + w as the random observation model, we have the following property (Stoica and Babu, 2011; Park et al., 2013):

Highlight 3 Among all possible distributions of observation noise w with a fixed covariance matrix, the CRLB for x attains its maximum when w is Gaussian; i.e., the Gaussian scenario is the ‘worst case’ for estimating x.

More importantly, the Gaussian variable is self-conjugate. That is, if the likelihood function is Gaussian, choosing a Gaussian/GM prior over the mean will ensure that the posterior distribution is also Gaussian/GM without using any approximation. We refer to this as strict Gaussian conjugacy in this paper. Please refer to Murphy (2007) for more conjugate priors related to Gaussian distribution. For example, the inverse Wishart distribution provides a conjugate prior for the covariance matrix of a Gaussian distribution with a known mean.

Based on conjugate prior, the Bayes prior and posterior can be computed in a closed form. More precisely, since the Gaussian PDF is determined uniquely by its first moment (mean) and the second moment (covariance), the Gaussian conjugacy will render recursive computations of the Bayes prior and posterior in the simple manner of recursive algebraic computing of the mean and covariance of the conditional PDFs, namely ‘moment matching’. Such a conjugacy is very engineering-friendly, especially when computing time is considered (see Section 7.2), and forms the essence for sequential closed-form recursion.

The strict Gaussian conjugacy, however, requires both state transition function f t and observation function h t be linear, and inputs u t and noises v t and w t be unconditionally/white Gaussian/GM (independent of the state). Then, the optimal, conjugate solution is given by KF (or a mixture of KFs in case of GM filtering), as shown in Appendix B. Any violation of these requirements will lead to a non-Gaussian/GM posterior and destroy the closed-form Gaussian recursion. Also, all the parameters need to be known a priori. These requirements are fastidious and unrealistic in most realistic systems. To retain AGC, approximation has to be applied for easing the challenge from nonlinearity (regarding both functions f t and h t ), multimodal posterior, intractable system uncertainties (primarily regarding noises v t and w t and input u t ), and constraints, which will be addressed in Sections 3, 4, 5, and 6, respectively.

3 Nonlinearity

Nonlinearity appearing in the system functions forms a pivotal and explicit challenge to the Gaussian conjugacy simply because a Gaussian distribution after nonlinear transformation will be no more Gaussian. A considerable number of approximation approaches have been developed to account for nonlinearity. These approaches can be primarily classified into two categories, approximating either the nonlinear function or the nonlinear-transformed PDFs. The former, with typical examples of extended KF (EKF), modal KF (Mohammaddadi et al., 2017), divided difference filter (Nørgaard et al., 2000; Wang et al., 2017), and Fourier-Hermite KF (Sarmavuori and Särkkä, 2012), seeks functions’ approximation using polynomial expansions (e.g., Taylor series, Fourier-Hermite series, Stirling’s interpolation, or Modal series). The latter, with representative examples of unscented KF (UKF) (Julier and Uhlmann, 2004), Gauss-Hermite filter and central difference filter (Ito and Xiong, 2000), cubature KF (CKF) (Arasaratnam and Haykin, 2009; Jia et al., 2013), sparse-grid quadrature filter (Arasaratnam and Haykin, 2008; Jia et al., 2012), stochastic integration filter (Duník et al., 2013), and iterated posterior linearization filter (IPLF) (García-Fernández et al., 2015b; Raitoharju et al., 2017), is based on a set of deterministically chosen weighted sigma points. It was shown by Särkkä et al. (2016) that many sigma-point methods can be interpreted as Gaussian quadrature based methods. They calculated the posterior PDF in a local sense; therefore, the methods are also referred to as the local numerical approximation approach. An alternative to deterministic sampling for approximating an arbitrary PDF is random sampling (e.g., the popular mixture KF (Chen and Liu, 2000), ensemble KF (Evensen, 2003; Roth et al., 2017b), Monte Carlo KF (Song, 2000), and Gaussian/GM PF (Kotecha and Djurić, 2003a; 2003b)), which still strives to maintain AGC. This allows asymptotically exact integral evaluation, albeit with much higher computational complexity. Like PF, these approaches are referred to as the global numerical approximation approach.

All of these GA filters have triggered tremendous further developments. For instance, UKF has perhaps gained the most approval, whereas it may suffer from numerical instability (e.g., may have a negative weight for the center point) (Arasaratnam and Haykin, 2009; Jia et al., 2013), systematic error (Duník et al., 2013), and nonlocal sampling problem for high-dimensional applications (Chang et al., 2013); refer to Adurthi et al. (2017). These problems, together with parameter setting strategies (Straka et al., 2014; Zhang et al., 2015; Scardua and da Cruz, 2017) and constrained filtering (see Section 5), have led to ever-increasing further developments for deterministic sampling-based filtering. Meanwhile, various measures of the degree of nonlinearity/non-Gaussianity have been developed (not limited to state estimation); see the review offered by Liu and Li (2015) and Duník et al. (2016). This provides a principle to select a nonlinear filter from many, according to the property of the problem.

To better exploit the information about the state from the same measurement sequence, different local filters that extract different portions of the system information can be employed to linearize the same nonlinear functions and the results combined for a better accuracy. This is called the ‘cooperative local (or Gaussian) filter design’ approach (Duník et al., 2017a), which resembles the idea of multiple conversion approach (Lan and Li, 2017), which jointly uses multiple nonlinear filters based on a weighted sum of several sub-functions of the (same) measurement (each sub-function corresponds to one filter).

While general nonlinear filtering has been well elaborated and reviewed from various viewpoints, we focus on two interesting subtopics.

3.1 Converted measurement filtering

The unconditional noise requirement (i.e., the noises are white and independent of the state) may not be fulfilled strictly in practice. This relaxation is particularly useful when the state model is linear and Gaussian when the measurement model is nonlinear but can be converted to a linear (namely ‘injective’) one. Such a linear-dynamic nonlinear-observation system is very common in target tracking and robot positioning realms. Although converting the nonlinear measurement to the state space yields a non-Gaussian uncertainty for sure, the system will become linear, enabling the use of a linear filter, namely ‘converted measurement filtering (CMF)’. It was first introduced by Lerro and Bar-Shalom (1993). Obviously, nonlinear conversion will lead to a (pseudo-measurement) noise which is state-dependent and non-Gaussian, even the original noise is state independent and white Gaussian. Therefore, a critical issue involved is to determine the unbiased mean and covariance of the observation noise after conversion (Bordonaro et al., 2014; Lan and Li, 2015), entailing correct moment matching.

A review of algebraic approaches for the Gaussian noise related debiasing was delivered by Bordonaro et al. (2014). To handle originally non-Gaussian noises, Monte Carlo sampling can be used for general conversion (Li et al., 2016a). A recent work of Bordonaro et al. (2017) converted range, bearing, and range rate collaboratively to Cartesian position and velocity, permitting the use of CMF with a poor angle accuracy. When the noise is multiplicative (namely ‘dependent’) on the state, the conversion will need knowledge of the state. For example, a maximum likelihood estimator was used by Wang et al. (2012) to remove the distance-sensing nonlinearity in case of hybrid additive and multiplicative noises. However, we note that in many cases the measurement model is non-injective, e.g., a bearing observation of the target in the planar space, preventing CMF unless multiple sensors are used jointly to make the observation (in the form of observation matrix) determined or over determined (Li et al., 2017a).

The state of the art (Liu et al., 2013; Lan and Li, 2015) has demonstrated that proper ‘uncorrelated conversion’ of the nonlinear measurement can mine more information from the measurement information for better filtering accuracy, compared to the original measurement. This leads to an updating protocol which is based on linear combination of the original measurement and its uncorrelated conversions. However, it was further pointed out by García-Fernández et al. (2015a) that CMF works better, particularly for informative systems but not for non-informative systems that have a large measurement noise variance. Therefore, an interacting mechanism is advocated to switch between an unscented linear CMF and a normal unscented nonlinear filter.

3.2 Very informative observation

Dramatically fast and ever-increasing escalation has been seen on computers and sensors including radar, camera, and sonar. It is fair to say, what we have today is totally different from that when Kalman invented the KF. Either high-precision sensors or high-dimensional observations due to the joint use of multiple/massive moderate sensors are supposed to remarkably benefit our estimation by providing a very informative observation (VIO). Unfortunately, advanced KFs may not always outperform the basic KF in such cases. Instead, it turns out that (Morelande and García-Fernández, 2013; García-Fernández et al., 2015b):

Highlight 4 For sufficiently precise measurements, none of the KF variants, including KF itself, are based on an accurate approximation of the joint density. Conversely, for imprecise measurements, all KF variants accurately approximate the joint density, and therefore the posterior density. Differences among the KF variants become evident for moderately precise measurements.

Therefore, seeking increasingly accurate AGC approximations can be of limited benefit in a VIO system. Instead, an SBI filter may just lose to the observation-only (O2) inference that directly converts the observation to the state space (Li et al., 2016a), equal to using a uniform/non-informative prior. It is important to note that the default formulation of most filters omits the bias propagated in the prior by taking unbiasedness as granted, which is naive at best to be true in the real world. Indeed, the bias (due to either mis-modeling or over/improper approximation) in the prior is the key leading to the defeat of a filter, especially in a VIO system.

A VIO SSM is given in Appendix C. It was first proposed by van der Merwe et al. (2000) and has since been widespread for filter test. However, on it the simple O2 inference can beat EKF/UKF, unscented PF, etc., by orders of magnitude in terms of both accuracy and computational speed. In particular, prominent attention is desired as sensors are deployed with a gradually increasing quantity (with higher precision) or quality (joint use of massive sensors) (Li et al., 2017a; 2018a) nowadays, popularizing VIO in reality. Therefore, we have the following note (Li et al., 2016a; 2017a):

Highlight 5 While BCRLB sets a best line (in the sense of MMSE) that any unbiased sequential estimator can at maximum achieve, the O2 inference sets the bottom line that any ‘effective’ estimator shall at worst achieve.

As a compromise, iterative algorithms may be applied to repeatedly leverage the informative observation. The first iterated EKF (IEKF) (Jazwinski, 1970) implements the first-order Taylor series expansion (TSE) of the observation function repeatedly for posterior updating to avoid filtering divergence due to the one-time first-order TSE truncation. IEKF produces a sequence of mean estimates. It was shown in Bell and Cathey (1993) to be equivalent to the Gauss-Newton (GN) algorithm for computing the MAP estimate. IEKF performs well when the true posterior is close to being Gaussian; however, convergence of the GN algorithm is not guaranteed. Furthermore, a generalized iterated KF (Hu et al., 2015) for nonlinear stochastic discrete-time estimation with state-dependent observation noise adopts the Newton-Raphson iterative optimization steps, yielding an approximate MAP estimate of the states. With a high relevance, IPLF (García-Fernández et al., 2015b; Raitoharju et al., 2017) uses statistical linear regression instead of the first-order TSE for a better linearization, and iterates a posterior estimate updating.

More implementations for iterated/repeated observation (or its conversion) updating have been realized on different Gaussian filters (Zhan and Wan, 2007; Zanetti, 2012; Steinbring and Hanebeck, 2014; Huang et al., 2016b). These have a close connection to the concepts of progressive correction (Oudjane and Musso, 2000) and progressive Bayes (Hanebeck et al., 2003), both of which strive to apply Bayes updating in a progressive manner and aforementioned uncorrelated augmentation (Liu et al., 2013; Lan and Li, 2015; 2017). In fact, the idea of emphasizing the observation when it is very informative has also inspired the development of random-sampling based filters such as annealed/unscented PFs (van der Merwe et al., 2000; Godsill and Clapp, 2001), particle flow filter (Daum and Huang, 2010), and feedback PF (Yang et al., 2016), and some (re)sampling approaches (Li et al., 2015a; 2015b). There has been a burgeoning passion in applying data-driven techniques to enhance filtering in VIO systems; refer to Mitter and Newton (2003), Ma and Coleman (2011), and Nurminen et al. (2017) for other attempts. Parameter learning for VIO systems was also studied in Svensson et al. (2017).

These data-driven approaches essentially weaken the impact of the prior and converge to the O2 inference. A rigorous criterion on the optimal trade-off between the prior and the data in forming the posterior in these approaches seems still missing. In the existing work, the convergence has been identified primarily by monitoring the Kalman gain as compared with a specified ad-hoc threshold.

However, when the observation is not so informative, it turns out to be a bad idea to emphasize the observation, as quantitatively demonstrated in Li et al. (2016a). Therefore, particular caution should be exercised.

4 Multimodality

4.1 Gaussian mixture

Based on the Wiener approximation theorem, any distribution can be expressed as, or approximated sufficiently well by, a finite sum of known Gaussian distributions, called ‘GM’. Mixture distribution may arise from stochastically switched Gaussian systems (such as the maneuvering dynamics as addressed in Section 4.2), systems with multimodal state (e.g., concurrent multiple targets), multimodal observation (e.g., radar observations often exhibit bimodal properties due to secondary radar reflections), or systems with long-tailed stochastic behavior or noise, to name a few.

The posterior Eq. (4) in the manner of a GM of M t components at time t can be written as

$$p({x_t}|{y_{0:t}}) = \sum\limits_{i = 1}^{{M_t}} {\omega _t^{(i)}{\mathcal N}({x_t};\hat x_t^{(i)},P_t^{(i)}),} $$
((7))

where \(\omega _t^{(i)} > 0\) is the weight of the ith Gaussian component which satisfies \(\sum\nolimits_{i = 1}^{{M_t}} {\omega _t^{(i)} = 1} \) in general but is not in the finite set statistics-based multi-target intensity cases (Vo and Ma, 2006; Mahler, 2014).

Assuming that the noise sequences have a uniformly convergent series expression in terms of known Gaussian distributions, a number of Gaussian terms with known moments can be used to develop an MMSE filtering algorithm, namely ‘Gaussian mixture filtering (GMF)’ (Sorenson and Alspach, 1971; Faubel et al., 2009; Ali-Loytty, 2010). Each Gaussian component may be updated based on different nonlinear filter updating rules. For linear dynamic systems with GM noises, GMF provides the MMSE state estimate by tracking the GM posterior. The analytic lower and upper MMSE bounds of linear dynamic systems with GM noise statistics were analyzed in Pishdad and Labeau (2015). It has been shown that for highly multimodal GM noise distributions, the bounds and MMSE will converge, and relevant statistics such as mean or covariance can be derived in a closed form. In addition, taking system constraints into account, projection based GM-UKF (Ishihara and Yamakita, 2009), GMF (Duník et al., 2010), and density truncation based GM-UKF (Straka et al., 2012) have been developed. Constrained filtering will be addressed separately in Section 6.

Obviously, the mixture size lies in the core of the trade-off between computing efficiency and filter accuracy. Many sophisticated or straightforward algorithms have been proposed for adapting/reducing the number of components in GM. For an adaptive GM, two different approaches have been proposed: adapting the weight of each Gaussian component by minimizing the propagation error committed in GM approximation (Ito and Xiong, 2000; Terejanu et al., 2011) and splitting the Gaussian components during the propagation based on nonlinearity-induced distortion (DeMars et al., 2013). Both require online optimizations, which, however, will add to the overall computational cost. Instead, mixture reduction (MR) is more practically useful. It is typically realized in the manner of GM merging and pruning.

The first systematic GM merging scheme established by Salmond (1990) is perhaps still the most widely used protocol (Faubel et al., 2009; Ali-Loytty, 2010) due to its computing simplicity and provable efficiency in practice. In fact, it is deemed a type of ‘conservative’ fusion (Reece and Roberts, 2010) and its covariance-fusion part can be further optimized for a smaller trace, leading to the so-called ‘optimal mixture reduction (OMR)’ (Li et al., 2018b). To be more specific, for a GM \(\sum\nolimits_{i = 1}^{{M_t}} {\omega _t^{(i)}} {\mathcal N}({x_t};\hat x_t^{(i)},P_t^{(i)})\), the OMR scheme fuses its components into a single weighted Gaussian component \({\omega _{{\rm{OMR}}}}{\mathcal N}({x_t};{\hat x_{{\rm{OMR}}}},{P_{{\rm{OMR}}}})\) as follows:

$${\omega _{{\rm{OMR}}}} = \sum\limits_{i = 1}^{{M_t}} {\omega _t^{(i)}} ,$$
((8))
$${\hat x_{{\rm{OMR}}}} = {{\sum\limits_{i = 1}^{{M_t}} {\omega _t^{(i)}} \hat x_t^{(i)}} \over {{\omega _{{\rm{OMR}}}}}},$$
((9))
$${P_{{\rm{OMR}}}} = \underset{\tilde P_t^{(i)}}{{\rm{argmintr}}} \,\left( {\tilde P_t^{(i)}} \right),$$
((10))

where, to ensure conservativeness, the adjusted covariance matrix is given by

$$\tilde P_t^{(i)} = P_t^{(i)} + \left( {{{\hat x}_{{\rm{OMR}}}} - \hat x_t^{(i)}} \right){\left( {{{\hat x}_{{\rm{OMR}}}} - \hat x_t^{(i)}} \right)^{\rm{T}}}.$$
((11))

The only difference to Salmond’s approach is on the fused covariance as \({P_{{\rm{Salmond}}}} = {{\sum\nolimits_{i = 1}^{{M_t}} {\omega _t^{(i)}\tilde P_t^{(i)}} } \over {{\omega _{{\rm{OMR}}}}}}\), which has a larger trace than Eq. (10). A more general principle for MR is to minimize the discrepancy between the original and the reduced mixtures (Crouse et al., 2011), for which two typical metrics are the integral square error (ISE) and Kullback-Leiber divergence (KLD). The KLD of GM-PDF before MR p(x) from that after MR q(x), denoted by DKL(pq), is an asymmetric measure of the information lost when q(x) is used to approximate p(x), which reads

$$\begin{array}{*{20}c} {{D_{{\rm{KL}}}}(p\parallel q) \buildrel \Delta \over = \int {p(x){\rm{ln}}} {{p(x)} \over {q(x)}}{\rm{d}}x\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \,\,} \\ { = \int {p(x){\rm{ln}}} \,p(x){\rm{d}}x - \int {p(x){\rm{ln}}\,q(x){\rm{d}}x} .} \\ \end{array} $$
((12))

As the first term completely relies on PDF before MR, minimizing KLD in Eq. (12) is equivalent to maximizing the second ∫p(x) ln q(x)dx. As a non-parametric distance, the ISE between p(x) and q(x) reads

$${D_{{\rm{ISE}}}}(p\parallel q) \buildrel \Delta \over = \int {{{(p(x) - q(x))}^2}{\rm{d}}x.} $$
((13))

The ISE approach was first proposed for MR in the context of multiple hypothesis tracking in Williams and Maybeck (2006). It has inspired further development (Chen et al., 2010) and the normalized ISE (Petrucci, 2005). One distinctive feature of the method is the availability of exact analytical expressions for GMs. However, cost function (13) is a complicated multimodal function with many local minima; hence, gradient-based methods cannot guarantee convergence to the global minimum, unless the initialization point happens to be close to the global minimum (Williams and Maybeck, 2006). In contrast, the Kullback-Leibler reduction method (Runnalls, 2007) minimizes an upper bound on the KLD between the original mixture and the reduced mixture. It appears to perform better in terms of slimming GM, and has led to several further developments (Schieferdecker and Huber, 2009; Ardeshiri et al., 2015; Raitoharju et al., 2017).

In contrast to the above MR schemes that gradually reduce the mixture to a desired size via merging and pruning, the algorithm given in Huber and Hanebeck (2008) gradually adds new components to a mixture starting from a single component. This method, however, could be beaten in terms of ISE by simpler approaches based on clustering (Schieferdecker and Huber, 2009).

MR is actually a key part of many multi-hypothesis based approaches such as multi-hypothesis tracker (Reece and Roberts, 2010). It has also been applied to distributed information fusion for consensus (Li et al., 2018b).

4.2 Maneuver

Maneuver is an important concept particularly in the context of target tracking. It generally refers to time varying target dynamical mode/model. Maneuvering target tracking (MTT) is essentially a hybrid estimation problem consisting of continuous-state (base-state) estimation and discrete-state (mode) decision. A straightforward solution to MTT is given by handling maneuvers and random process noises jointly by a white, colored, or heavy tailed noise process (Gordon et al., 2003; Ru et al., 2009; Guo et al., 2015). This allows converting the MTT problem into that of state estimation in presence of non-stationary process noise with unknown statistics. This extended process noise approach primarily applies to insignificant maneuver.

The prevalent, considered-standard framework to describe the maneuvering state dynamics is the so-called ‘jump Markov system (JMS)’, in which the target dynamical model switches/jumps from one HMM to another. Simply put, there are two primary types of JMS methods: the decision-based singlemodel (SM) method (Zhou and Frank, 1996; Li and Jilkov, 2002) and the multiple-model (MM) method (Li and Jilkov, 2005). In the former, the filter is adaptive and is operated on the basis of the model selected during the decision process, and consequently the hybrid estimation problem is solved by combining state estimation with an explicit model decision. In this regard, timely detection of the target maneuver, namely the ‘model adaptation of the filter’, is key (Ru et al., 2009). Once it fails to do so and a wrong model is used, the performance of the filter will degrade significantly.

Considering ‘not putting all the eggs in one basket’, the MM method employs a bank of maneuver models to describe the time-varying motion and runs a bank of elemental filters based on these models, each being associated with a probability. The final estimate is given by the weighted results of these sub-filters. The most representative MM method is the interacting MM (IMM) algorithm and variable-structure IMM estimators (Li and Bar-Shalom, 1996; Li and Jilkov, 2005; Lan J et al., 2013; Granström et al., 2015). An idea similar to IMM has also been developed in PFs, e.g., in Martino et al. (2017). The number of models in IMM is fixed, whereas in variable-structure IMM it can be selected adaptively from a broad set of candidate models. Operating multiple models in parallel can be very computationally costly; however, it can still be insufficient when the real model parameters vary in a continuous space (Xu et al., 2016), or oppositely, too many models become as bad as too few models.

In either way, model decision/adaption delay is inevitable (Fan et al., 2011). It behaves as the delay of maneuver detection in the SM methods and as the time of probability convergence to the true model in the MM methods.

Highlight 6 Many adaptive-model approaches proposed for MTT may show superiority when the target indeed maneuvers but performs disappointingly or even significantly worse than those without using an adaptive model, when there is actually no maneuver. We call this ‘over-reaction due to adaptability’.

To combat these problems, the target motion can be described by a continuous-time trajectory function as in Eq. (18), and thereby the MTT problem can be formulated as an optimization problem to find a trajectory function best fitting the sensor data, e.g., in the sense of least squares of the fitting error (in Section 7.1). The fitting approach needs neither ad-hoc maneuver detection nor MM design and is therefore computationally reliable and fast (Li et al., 2017d). It is particularly well-suited to a class of smoothly maneuvering targets such as passenger aircraft, ships, and trains, where no abrupt and significant maneuvering should occur for the passengers’ safety, and most often, the carrier moves on a predefined smooth route.

5 Intractable uncertainty

5.1 Classification of uncertainties

Besides system functions f t and h t which are often considered deterministic, either known or unknown, there are three key variables whose statistics need to be specified properly for setting up a filter, including control input u t (which can be considered either deterministic or stochastic), state process noise v t , and observation noise w t . All of these constitute the uncertainty of the system and the core of the stochastic process. On one hand, if their statistics are unknown, they have to be estimated concurrently with the hidden states using available sensor observations, referred to as simultaneous state and parameter estimation or adaptive filtering. This is a challenging task since in many cases direct observation of certain parameters is very expensive or difficult, if not impossible (Ghahremani and Kamwa, 2011), or the observation itself contains significant intractable uncertainties such as outlier, clutter, and misdetection, to be explained below. On the other hand, they may conflict with the unconditionally Gaussian system requirement, for which proper remedies have to be taken for AGC.

In the most common case, observation function h t is given a priori. However, it is also often that the position of the sensor is unknown (and time-varying) or there is a sensor bias, and the position of the sensor needs to be estimated simultaneously with that of the target. This is often referred to as joint sensor localization/registration and target tracking, e.g., in Guo et al. (2016). Besides the maneuvering model, there are various specific problems where only a part of the parameters involved in the system function vary and need to be estimated, such as resistance in motor systems and aerodynamic parameters in UAV. Unlike discrete maneuvers, these parameters may not change in a jump manner but in the continuous space. They are generally related to system identification, out the scope of our survey.

An emerging tool for non-parametric statespace modeling called ‘Gaussian process (GP)’ regression (Rasmussen and Williams, 2005), which represents the unknown system function (either transition function f t or observation function h t ) by a random function (namely, GP SSM) and infers the posterior distribution of the function from data, is very different from the PDF approximation addressed in this paper. A GP is a distribution over functions. It is fully specified by a mean and a covariance function encoding basic structural assumptions of the class of functions to be modeled, e.g., smoothness and periodicity. GP gains an increasing importance in machine learning (Rasmussen and Williams, 2005), state estimation (Deisenroth et al., 2012; Frigola-Alcade, 2015; Särkkä et al., 2016), parameter/model learning processing (Wang et al., 2008; Ko and Fox, 2009), etc., when it is difficult to find an accurate parametric form of the system function. It is interesting to recognize that GP can be broadly classified into our AGC framework (i.e., from a GP prior to a GP posterior), to accommodate more general likelihood functions.

Overall, the major intractable uncertainties involved for an adaptive filter design based on SSM can be classified as shown in Fig. 2. To avoid distracting our attention on filters under AGC, we will leave aside the uncertainty issues caused by the (either partly or entirely) unknown system functions or abnormal observation data in this paper. What follows will focus on estimation or approximation of the statistics of inputs u t , state process noise w t , and observation noise w t when they are either unknown or non-Gaussian/correlated.

Fig. 2
figure 2

Classification of major uncertainties involved in the state-space model (SSM)

Only the highlighted three sub-topics are accommodated in this review while the others are well addressed in existing surveys (e.g., unknown noise covariance estimation in Duník et al. (2017b)), are rare cases (e.g., unknown observation function), or do not form the key theme of our review (e.g., correlated noises, outliers, clutter, and misdetection)

5.2 Unknown input

The models and/or models’ parameters may deviate from their nominal values by an unknown constant or time-varying bias, which are called ‘unknown inputs (UIs)’. The corresponding filtering problem in the presence of UI is termed ‘UI filtering (UIF)’. UI may appear in both state dynamics and measurement models (including sensor bias), although we model only inputs in the state dynamic model in Eq. (1). Based on a priori assumptions made on UI, existing UIF algorithms can be broadly categorized into three main classes.

5.2.1 Noise interpretation of the unknown input

This approach is simply modeling UI by a zero-mean Gaussian noise with a usually large, stationary, or time-varying (Liang et al., 2004) covariance. However, this assumption is often violated, leading to an adverse filtering performance such as instability (Azam et al., 2015). This is because UI is typically a non-stationary process (i.e., signal with an arbitrary type and magnitude) and cannot be well captured by a stationary and zero-mean random noise.

5.2.2 Known unknown input dynamics

In this category, UI is approximately known dynamics and unknown initialization. This approach can accommodate several types of UIs such as unknown constant, ramp, polynomials in time, sinusoids, or their combinations (Su et al., 2016). A common approach is to augment UI (or the state of its dynamics) into the state variable, resulting in an augmented system for which conventional filters can be adopted, namely augmented KF (AKF) (Mayne, 1963; Su and Chen, 2017). To reduce the computation cost of AKF, Friedland (1969) proposed a two-stage KF to decouple AKF into a state sub-filter and a UI sub-filter. It was further extended and optimized by Hsieh (2000) and generalized to the optimal multi-stage KF by Chen and Hsieh (2000). An expectation-maximization (EM) based iterative optimization framework, which treats unknown covariances as missing data, was proposed by Bavdekar et al. (2011) for joint state estimation and parameter identification, and similarly by Lan H et al. (2013) for stochastic systems with UIs in both the process and measurement models.

Note that the augmented system is typically nonlinear even though the original one is linear. Also, a mean estimation error (or bias) may appear when the assumed UI dynamics is not fulfilled due to any mismatch, e.g., abrupt maneuver in target tracking (Bogler, 1987) and fast time-varying disturbances in disturbance observer based control (Kim and Rew, 2013). To combat this, an appropriate covariance matrix for the noise term in UI dynamics is the key for a trade-off between estimation bias and accuracy (Azam et al., 2015).

5.2.3 Unknown input dynamics

In this category, no specific dynamics is assumed on UI. The original work of this kind (Kitanidis, 1987) is based on the MSE unbiased estimation (i.e., minimizing the trace of the state error covariance matrix under the unbiased algebra constraint). Various properties for the developed filters have been investigated successively, including the existence condition (Darouach and Zasadzinski, 1997), asymptotic stability (Fang and de Callafon, 2012), and global optimality (Cheng et al., 2009). Later, this approach has been extended to the case with direct feed-through of UI (Cheng et al., 2009), simultaneous input and state filtering including recursive three-step filter (RTSF) (Gillijns and Moor, 2007; Hsieh, 2009), and filtering with partial information on the input (Su et al., 2015b). Recently, its relationship with the classical KF has also been rigorously established by Li (2013) and Su et al. (2015a) in terms of existence, optimality, and asymptotic stability by assuming that the inputs are available at an aggregate level.

In comparison to AKF, this approach could lead to unbiased estimation, while it is more sensitive to sensor noise due to the lack of a priori UI dynamics information. Another point worth mentioning is the existence condition. A necessary condition of AKF is the detectability of augmented matrix pair, while strong detectability is usually required in approaches without information of UI dynamics (Yong et al., 2016), which are slightly stricter.

Recent work is more focused on how to accommodate prior information on UI or unknown parameters so that both the state and UI filtering performance can be improved. For example, amplitude and equality constraints are considered in fault diagnosis and traffic management (Li, 2013; Su et al., 2015b), respectively. It should be highlighted that the extra information on UI stems from the experience or knowledge of the designers. A better alternative is to learn from massive historical data. To this end, clustering and classification were exploited by Yi et al. (2016) to model vehicle acceleration for a better situation awareness performance. Another open problem comes from hybrid UIs, such as a linear combination of dynamic, random, and deterministic UIs (Liang et al., 2008) or to be more challenging, different UI switching.

5.3 Unknown noise

There is a large body of literature on noise covariance estimation in both state and observation equations. Interested readers can refer to a cutting-edge, comprehensive survey offered by Duník et al. (2017b). A remarkable result which appeared recently (Ristic et al., 2017) states that:

Highlight 7 The theoretically best achievable second-order error performance, namely CRLB, in target state estimation is independent of knowledge (or the lack of it) of the observation noise variance.

This is in accordance with the results in Djurić and Miguez (2002) which demonstrated that the noise covariances are unnecessary in estimation, as they can be integrated out. More surprisingly, it was shown that the filters which do not use the true value of observation noise variance but instead estimate it online, can achieve the theoretical bound, while the CKF, which uses the true value of the Gaussian observation noise variance, cannot. An explanation for this is that the filters that estimate the observation noise variance online are able to distinguish the accurate bearing observations from inaccurate ones and adapt their Kalman gains accordingly, resulting in an overall better tracking performance. This finding is interesting as it raises a puzzle: is it a real advantage if the filter knows the true observation noise statistics?

5.4 Non-Gaussian or non-white noise: heavy tail, correlation, and dependence

Gaussian distribution is simply incompetent in modeling outliers (because of clutter, impulsive noise, glint noise, unreliable sensors, etc.), skewness, heavy tails, and bounded support. In addition to the aforementioned GM, a pragmatic way to approach outliers and skewed observation noise is to assume heavy-tailed noise (also called ‘glint noise’), for which elliptically contoured distributions, such as Student’s t-distribution (Girón and Rojano, 1994; Tipping and Lawrence, 2005; Loxam and Drummond, 2008; Aravkin et al., 2012; Piché et al., 2012; Roth et al., 2013; Nurminen et al., 2015) and Lévy distribution (Sornette and Ide, 2001; Gordon et al., 2003), turn out to be helpful.

The Student’s t-distribution has been demonstrated to be less sensitive to outliers than the Gaussian distribution, thereby enjoying a better robustness while retaining the minimum variance optimality of KF. Either the process noise or the observation noise can be modeled as Student’s t-distribution (Aravkin et al., 2012), while the latter takes a majority in the literature. Based on Student’s t observation noise assumption, the Bayesian filtering and smoothing recursions were developed for linear systems in Piché et al. (2012) and Roth et al. (2017a), based on which different parametric filters can be implemented. Student’s t-mixture filter was also developed by Loxam and Drummond (2008).

While both Student’s t-distribution and the Gaussian distribution belong to the family of elliptically contoured distributions, the Gaussian approximation to the posterior PDF is more reasonable than the Student’s t approximation with a fixed degree of freedom (DOF) parameter for the case of moderate contaminated process and observation noises (Huang et al., 2017). In this sense, GM might be a better alternative (Bilik and Tabrikian, 2010), given a proper MR-management. For a t-distributed observation noise with heavy tails, while CRLB significantly underestimates the optimal MSE, KF has a significantly larger MSE (Piché, 2016).

There are actually at least two other intractable uncertainties leading to non-white noises, such as colored noises due to noise correlation in the time direction (Wang et al., 2015) and multiplicative noises due to their dependence on the state (Spinello and Stilwell, 2010; Agamennoni and Nebot, 2014; Wang et al., 2014; Huang et al., 2015; Liu, 2015; Huang et al., 2016a). Noise correlation could occur at the same time instant or one time step apart (or more complicated, multiple time steps apart). Interested readers can refer to the provided references.

5.5 Robust filtering

Another filtering optimality is regarding the adaptability against a class of more significant uncertainties such as clutter, disturbances/outliers, and misdetection, termed ‘robust filtering’. These uncertainties can be classified as ‘abnormal noise’ to the system, which are unfortunately too ‘strong’ to be effectively handled by the aforementioned maneuvering/adaptive model, noise estimation methods, or heavy-tailed/correlated noise modeling approaches. Instead, robust filtering technologies such as Huber’s M (maximum-likelihood-type)-estimation that can detect clutter in either state processes or observations (Koch and Yang, 1998; Yang et al., 2001; Zhang et al., 2016) or the H-infinity/H filter (Simon, 2006) that can handle arbitrary (unknown) noise of bounded energy, are required, to name a few.

A filter is called robust if the actual error variances guarantee a minimum upper bound for all admissible uncertainties. This variant research theme has been stimulated by the increased interest in robust control theory and has received a lot of attention in 1990s and early 2000s with the development of convex optimization. Some robust Gaussian filters have been reviewed in Simon (2006) and Afshari et al. (2017). Recent attention on robust filtering turns to the sensor network and practical considerations such as missing data, communication delay (Dong et al., 2010), and distributed fusion (Qi et al., 2014). It is out of the focus of this review; however, we have the following observation to highlight the core differences between robust filtering and MMSE filtering:

Highlight 8 Robust filtering is much more related to robustness with respect to statistical variations than it is to optimality, with respect to a specified statistical model. Typically, the worst case estimation error rather than MSE needs to be minimized in a robust filter. As a result, robustness is usually achieved by sacrificing the performance in terms of other criteria, such as MSE and computing efficiency.

6 Constraints

There are two basic types of constraints: physical constraints reflecting limits to physical state variables (such as positivity of mass or pressure and limitation of speed or angle) and design constraints which represent desired operating limits (such as technological limitations or geometric considerations of the system). The constraint can be on either the state or the observation. While a few constraints are completely quantifiable in statistics, the most are typically given as contextual/‘soft’ information (Snidaro et al., 2015) that has to be converted to an equality or inequality for use in the filter. For example, an equality constraint between the state variables can be written as a function:

$$C({x_t}) = 0$$
((14))

The constraint can be taken into account at different inference stages, corresponding to three different types of strategies for constraining, i.e., in a bottom-up order: (1) modeling stage, (2) filtering stage, and (3) output stage.

6.1 Equality and inequality

6.1.1 Constrained system modeling

When the equality is defined between dimensions of the state in Eq. (14), the state can be converted to a lower-dimensional unconstrained state, by representing part of the state vector as a linear function of the remaining part, as governed by the equality constraint (Wen and Durrant-Whyte, 1992). The dimension reduction can also be achieved through null space decomposition (Hewett et al., 2010), in which an orthogonal factorization is used to decompose the constrained state estimation problem into stochastic and deterministic components, which are then solved separately. In contrast, the equality constraint can also be appended to the observation equation by creating an additional deterministic pseudo-observation (Tahk and Speyer, 1990; Duan and Li, 2013) from constraint (14) as follows:

$${y_t} = C({x_t}),$$
((15))

with the pseudo-observation always treated as of mean 0 and variance 0.

The pseudo-observation model will increase the observation dimension and thereby increase the size of the matrix that needs to be inverted in Kalman gain computation. It will also lead to a singular covariance matrix, which may cause numerical problems. More importantly, in Eq. (15), the state is not guaranteed to obey the constraint, inappropriate for strict mathematical constraints.

6.1.2 Constrained estimation process

Instead of modifying the system models that will either increase or reduce the problem dimensions, an alternative systematic approach is to take into account the constraints during the filtering process, e.g., designing equality constrained dynamic systems based on which the filter estimate fulfills the constraints automatically (Xu et al., 2013; Duan and Li, 2015), to provide constrained point estimates together with constrained covariance matrices in some cases. As a representative example, the moving horizon estimation (MHE) filter minimizes the mean square error while satisfying the constraint (Ishihara and Yamakita, 2009). However, it is computationally intensive for larger horizons and nonlinearities in the observation equation or constraint.

It is important to note that, under the constrained dynamics, the state process noise is state-dependent in general (Duan and Li, 2015). Simply, the Gaussian distribution has an infinite tail, which does not hold in limited/constrained state spaces.

6.1.3 Constrained estimates

If neither the system models nor the filters are modified to accommodate the constraint, the last thing that can be done is to adjust the final estimate(s) produced by the unconstrained filter based on unconstrained system models. This can be done in two ways, either projecting the state space outside the constraint into the constrained area, or truncating the unconstrained conditional PDF of the state so that only the part residing in the constrained area is preserved and the remainder is set to zero.

The methods proposed by Julier and LaViola (2007), Ko and Bitmead (2007), and Kandepu et al. (2008) project the unconstrained estimate onto the constraint subspace by a projection function p(x t ) satisfying (refer to Eq. (14))

$$C(p({x_t})) = 0$$
((16))

for all values of x t .

The simplest projection approach is called ‘clipping’, which moves point estimates lying outside the constrained region to the boundary (Kandepu et al., 2008). In Ko and Bitmead (2007), the projected KF was extended from discrete time to continuous time and from linear constraints to nonlinear constraints. In Julier and LaViola (2007), the projection method was used twice: one to constrain the entire distribution and the other the statistics of the distribution. Simon (2010) analyzed three different ways by which the KF solution can be projected onto the state constraint surface.

Instead of revising the point-estimate with respect to the constraint, it is more theoretically sound to modify the conditional PDF of the state estimate, typically the first two moments of PDF. This is referred to as the truncation approach, in which the shape of the conditional PDF within the constrained region is preserved. This provides generally high-quality estimates with moderate computational demands (Teixeira et al., 2010). In this manner, linear (Simon, 2006) and nonlinear (Straka et al., 2012) inequality constraints were considered.

Nonlinear equality constraints differ from the linear case due to two sources of errors: truncation errors because of nonlinear transformation of PDF and base point errors because the filter linearizes around the estimated value of the state rather than the true value (Julier and LaViola, 2007). To overcome these deficiencies, the second-order TSE was used by Yang and Blasch (2009) to gain a higher accuracy than the first-order linearization, and the so-called ‘smoothly constrained KF’ was proposed (Geeter et al., 1997), which transforms hard constraints into soft ones and provides an exponential weighting term that progressively tightens the constraints.

Although the pseudo-observation and projection methods share the same property which allows projecting the state estimate to the constraint surface, they are qualitatively different. The former uses KF’s linear update rule. Therefore, it is linear and its parameters are chosen to minimize the MSE estimate. The latter can use any projection operator consistent with the constraint. Illustrations of both approaches can be found in Julier and LaViola (2007).

6.2 Circular statistics

Circular estimation is involved when the state or the observation is subject to periodic quantities such as angle, orientation, or direction. It exists in an enormous number of periodic phenomena. The shifted Rayleigh filter (Clark et al., 2007) is a moment matching algorithm that exploits the essential structure of the nonlinearities present in bearings-only tracking, and generates the exact posterior given a Gaussian prior. Instead of suboptimal constrained filtering that treats the periodic character as a constraint, the more reliable and systematic solution shall be based on circular/directional statistics; please refer to Kurz et al. (2016) for an excellent survey on circular Bayes filtering.

A straightforward projection of the standard one-dimensional (1D) Gaussian distribution to the circular state space is wrapping the Gaussian distribution around the unit circle and adding up all probabilities wrapped to the same point (Fig. 3), namely the ‘wrapped normal (WN) distribution’, of which the PDF can be immediately given as

$${p_{{\rm{WN}}}}(\theta ;\mu ,\sigma ) = {1 \over {\sqrt {2\pi } \sigma }}\sum\limits_{k = - \infty }^\infty {{\rm{exp}}\left( { - {{{{(\theta - \mu + 2k\pi )}^2}} \over {2{\sigma ^2}}}} \right)} ,$$
((17))

where the circular variable θ ∈ [0, 2π), k ∈ ℕ, and parameters for location (μ ∈ [0, 2π)) and for concentration (σ > 0) resemble the mean and standard deviations of the corresponding Gaussian distribution, respectively.

Fig. 3
figure 3

A wrapped normal distribution (red) is obtained by wrapping a Gaussian distribution (blue) with μ = 0 and σ = 2 around the unit circle (black) centralized around origin, where [x1, x2]T gives the position on the circle (References to color refer to the online version of this figure)

7 New thoughts

7.1 Limitations of HMM and alternatives

Despite their popularity, HMMs are believed to be poor for modeling speech due to the restrictive conditional independence assumption, including the Markovian state \(p({x_{0:t}}) = \) \(p({x_0})\prod\nolimits_{k = 1}^t p ({x_k}|{x_{k - 1}})\) and conditional independence of observations \(p({y_{0:t}}|{x_{0:t}}) = \prod\nolimits_{k = 0}^t p ({y_k}|{x_k})\).

Extensions have been sought to break through either limitation. The first is to introduce additional latent variables that allow more complex inter-state dependencies to be modeled, such as factor-analyzed HMM, switching linear dynamical systems (Rosti and Gales, 2003), and segmental models (Ostendorf et al., 1996). The second allows explicit dependencies between observations such as buried Markov models (Bilmes, 1999), mixed memory models (Saul and Jordan, 1999), trajectory-HMM (Zen et al., 2007), and conditional Markov chains (Bielecki et al., 2017), to name a few.

Different from the stochastic modeling of the state process, a series of non-sequential/optimization based estimation and forecasting methods, particularly in the area of chaotic systems and weather forecasting applications, have been presented (Judd and Stemler, 2009; Smith et al., 2010; Judd, 2015) to avoid the use of state transition noise v t in Eq. (1). In fact, similar deterministic Markov models have been applied in noise reduction methods (Kostelich and Schreiber, 1993), MHE (Michalska and Mayne, 1995), and the GN filter (Nadjiasngar and Inggs, 2013). Interestingly, Judd’s shadowing filter yields more reliable and even more accurate results than the Bayesian filters when nonlinearity is significant while the noise is largely observational (Judd and Stemler, 2009), or when the objects do not display any significant random motion at the length and the time scales of interest (Judd, 2015). The GN filter that models the state transition by a deterministic differential equation is proven to be Cramér-Rao consistent (yielding minimum variance) (Morrison, 2012). These approaches emphasize the deterministic part of the system and frame the estimation problem as optimization, which has advantages in dealing with constraints.

Highlight 9 The standard structure of recursive filtering is based on infinite impulse response (IIR); namely, all the observations prior to the present time have an effect on the state estimate at the present time. Therefore, the filter suffers from legacy errors.

As such, once a bias is made, due to whether erroneous modeling, outliers, or too much approximation, it can hardly be removed. Critically, the filter can diverge (namely deviate dramatically from the true signal) (Carrassi et al., 2017) due to the accumulation of underestimated errors. To combat this, several Kalman-like finite impulse response (FIR) estimators have been proposed (Kwon et al., 1999; Liang et al., 2004; Zhao et al., 2016a; 2016b), and proven to be superior to the standard KF in certain cases, such as when the noise covariances and initial conditions are not known exactly and the noise is not white. The FIR filter has a similar idea to MHE on limiting the use of legacy information.

Moreover, particularly in the context of target tracking, positioning, and localization, it is not so clear how to optimally use some important but fuzzy information such as a context ‘the trajectory is smooth’ or ‘the trajectory passes closely to x0 at time t0’. This type of information is akin to the aforementioned soft constraint (Simon, 2010); however, the difference is obvious: soft constraints are usually referred to as a condition that is exactly defined as in Eqs. (14)(16) but does not need to be fulfilled strictly while the fuzzy linguistic information addressed here prevents quantitative definition like so.

Given these considerations, Li et al. (2017c) proposed to use a trajectory function to replace HMM for describing the state function, i.e.,

$${x_t} = f(t),$$
((18))

where f(t) is a deterministic trajectory function of time t (FoT) defined in the state-time domain.

Considering that any trajectory can be represented by an FoT to an arbitrary accuracy, formulation (18) is quite general and versatile. Now, the state estimation problem is reformulated as a trajectory function estimation problem, which is finding a deterministic trajectory that best explains the time series observations in the underlying time-window [k1, k2] that may move forward or extend-in-size with time, conditioned on a priori model information. Once FoT estimate F(t) is obtained, the state at any time t in the effective fitting time window (EFTW) [K1, K2] (that does not have to be an integer) can be estimated, namely

$${\hat x_t} = F(t),\quad \forall t \in [{K_1},{K_2}],$$
((19))

where EFTW [K1, K2] at least covers sampling time window [k1, k2], namely K1 ≤ k1, k2K2.

To incorporate any model information, the trajectory function may be more precisely specified as \(F(t;{C_k}) \in {\mathfrak F}\), where \({\mathfrak F}\) limits itself to a finite set of specific functions, such as a set of polynomials of no more than three orders, and C k is the parameter set to be estimated at discrete time instant k (when new sensor data arrive). To be more precise, one may define a penalty factor Ω(C k ) on the model fitting error as a measure of the disagreement of the fitting function to the model constraint a priori, e.g.,

$$\Omega ({C_k}) \buildrel \Delta \over = \left\| {F({t_0};{C_k}) - {x_0}} \right\|,$$
((20))

which is to measure the mismatch between the fitting trajectory and known state x0 at time t0 given a priori, where ∥ab∥ is a measure of the distance between a and b such as the square error.

Then, combining observation function (2), prior constraint (20), and trajectory FoT Eq. (18) leads to an optimization problem for jointly minimizing the data and the model fitting errors, i.e.,

$$\mathop {{\rm{argmin}}}\limits_{F(t;{C_k})} \,\,\sum\limits_{t = {k_1}}^{{k_2}} {\left\| {{y_t} - {h_t}(F(t;{C_k}),{{\bar v}_t})} \right\|} {w_t} + \lambda \Omega ({C_k}),$$
((21))

where:

  1. 1.

    λ > 0 controls the trade-off between the data fitting error and the model fitting error.

  2. 2.

    w t is the weight assigned on the data at time t to account for the time-varying uncertainty, e.g., according to the covariance of v t if known. That is, in the LS sense \({\left\| {\,\,e\,\,} \right\|_{{w_t}}}: = {e^{\rm{T}}}{({\rm{Cov[}}{v_t}{\rm{]}})^{ - 1}}e\) is a Mahalanobis distance. Alternatively, a scalar fading factor can also be considered in the weight design, such as \({\left\| {\,\,e\,\,} \right\|_{{w_t}}}: = {\beta ^{k - 1}}\left\| {\,\,e\,\,} \right\|\,\,(0 < \beta < 1)\), to emphasize the newest data by assigning lower weights to history data.

  3. 3.

    \({\bar v_t}\) is a parameter to compensate for the observation error (if anything is known) and can be specified as the noise mean E[v t ] if known or otherwise as zero by assuming the sensor unbiased.

  4. 4.

    As default, k2 = k, ensuring that the newest observation data are used while k1 can either be fixed (i.e., the length of the time window [k1, k2] will increase with that of k2) or move with k2 (namely, a sliding time window).

One may observe that the key difference between formulation (21) and the Markov-Bayes optimization is that the former defines the ‘model error’ more flexibly. As an advantage, the FoT motion model (18) not only eases the restrictive independence assumption among time series states but also relaxes the chronological, uniform-incoming requirement posed on the observation series. As such, neither missing detection/delayed data, nor irregular sensor revisit frequency will be so challenging as in a Markov-Bayes estimator (Li et al., 2017c). More importantly, the fitting framework accommodates poor prior information on the target dynamics or even on the sensor observation statistics. However, how to obtain the statistical property of the estimate in these situations is still an open problem.

7.2 Filter evaluation: on computing speed

So far, we have omitted the computing complexity of different estimators, which, however, is key in many real-word applications. To set up a filter, we must be clear that the affordable filter iteration interval is determined by the duration between adjacent observations. That is, the filter updating speed must be higher or at least equal to the sensor revisit speed; otherwise, some sensor data will be missed/delayed.

When the filter updating speed is much faster than the sensor revisit speed, there will be some idle time at each filter iteration before the next sensor data arrives. This time can be used for additional computation such as smooth the estimate series made so far (Li et al., 2016b) by revising preceding estimates including the estimate that has just been made. Or more straightforwardly, adjust the filter a priori to properly include more computation (such as using higher order polynomial expansions or a larger number of sampling data-points, or jointly exploiting multiple filters for cooperation), to reduce the idle time while obtaining a better estimation. Here, we note that employing more complicated computation or even more information does not always mean a better accuracy benefit—recall the VIO example given in Appendix C. Interestingly, similar effect of ‘less-is-more’ appears in cognitive science (Gigerenzer and Brighton, 2009).

On the opposite, when the sensor revisit rate is higher than the filtering iteration rate, or high enough to always provide newest observation, it will be another story. In such a situation, a faster updating filter has the advantage of using more sensor data and suffering from smaller state transition uncertainty. For example, in real-time visual tracking based on a high speed video stream, the video can be divided into a sufficient number of frames. The more frames used, the fewer differences between successive frames. Both more frames and fewer process noises can help track the content in the video. All of these will very likely lead to the conclusion that a faster filter has a better estimation performance.

Unfortunately, computing speed is often treated as a pure engineering issue and is overlooked by theoretical scientists. Instead, different filters are usually compared and evaluated based on the same iteration rate, disregarding the real filter updating rate. These pure simulations may be beyond reproach, but the indication makes sense only in very limited real-world scenarios. In general, no matter whether the sensor revisit rate is high or low, it is unfair to force a computationally faster filter to wait (for a slower filter to have the same updating rate for comparison). It should always be updated as fast as possible for maximally and timely using more sensor data if possible, or additional calculation should be carried out, such as smoothing to improve its estimation (before new sensor data arrive). In either way, we assert that:

Highlight 10 (Computing speed matters) Disregarding this key issue may lead to endlessly seeking complicated modeling and/or filtering strategies for a fantastically better result, which may never come true in reality.

To illustrate this, we consider one case involved in sampling-based filters. In a common simulation setup as addressed above (i.e., setting all parameters disregarding the computing speed of the filter), more samples tend to yield a better estimation accuracy almost for sure. This, however, cannot be guaranteed at all in reality, since further increasing the number of samples will increase the computational load, reduce the filtering iteration speed, and therefore increase the state transition interval and the corresponding process noises. Even some sensor data may be missed when the filter updating rate turns out to be smaller than the sensor revisit rate. Finally, it may reduce the estimation accuracy more than it can improve. This fact will overturn the simulation indication. Bearing this in mind, it is not always a good idea to develop complicated filters, because it not only is computationally costly, but also may lead to no accuracy gain.

8 Conclusions and final remarks

Advances in time series parametric filters have been reviewed in four major categories, including nonlinearity (especially VIO nonlinear systems), multimodality (including GM filtering and MTT), intractable uncertainties (including unknown and non-Gaussian inputs/noise), and constraints. We pointed out that the key concept behind the work is AGC. A few important points have been given in highlights, as well as some of our thoughts on HMM and practical filter evaluation. To avoid overlap with existing review/surveys, several important topics such as noise estimation and circular statistics-based filters were not touched.

Instead of addressing some applications of these filters, we put our focus on the common and general theories and algorithm designs. However, we noted that efficient filter design should be based on the specific problem characteristics and requirements; e.g., estimation in robotics is very different to that in geosciences, and the problem of fault diagnosis is very different to that of target tracking. However, one thing is for sure: VIO plays progressively more important roles in all realms due to the revolutionary development of sensors and their massive deployment.

In addition, to avoid an over-wide discussion, another two major subfields regarding time series parametric filtering were not addressed either, including:

  1. 1.

    sensor network related distributed fusion and Bayesian filtering (Li et al., 2017b), in the presence of imperfect sensor data such as correlation and communication delay;

  2. 2.

    finite set statistics (Mahler, 2014) based multi-target filtering, especially regarding multisensor multi-target scenarios in the presence of mis-detection and false alarm.

These two topics are closely related and have gained increasing interest. In particular, the rapid development of sensors and their joint deployment, e.g., large-scale wireless sensor networks, provide a foundation for new paradigms to address the challenges that arise in harsh environments. As a consequence, the signal processing community starts to manifest increasing eagerness in novel data fusion/mining methods such as clustering, data fitting, and model learning, including the mentioned GP regression, for incorporating advanced statistical tools and rich sensor data to gain a substantial performance enhancement.