A Unification of Weighted and Unweighted Particle Filters

Particle filters (PFs), which are successful methods for approximating the solution of the filtering problem, can be divided into two types: weighted and unweighted PFs. It is well known that weighted PFs suffer from the weight degeneracy and curse of dimensionality. To sidestep these issues, unweighted PFs have been gaining attention, though they have their own challenges. The existing literature on these types of PFs is based on distinct approaches. In order to establish a connection, we put forward a framework that unifies weighted and unweighted PFs in the continuous-time filtering problem. We show that the stochastic dynamics of a particle system described by a pair process, representing particles and their importance weights, should satisfy two necessary conditions in order for its distribution to match the solution of the Kushner--Stratonovich equation. In particular, we demonstrate that the bootstrap particle filter (BPF), which relies on importance sampling, and the feedback particle filter (FPF), which is an unweighted PF based on optimal control, arise as special cases from a broad class and that there is a smooth transition between the two. The freedom in designing the PF dynamics opens up potential ways to address the existing issues in the aforementioned algorithms, namely weight degeneracy in the BPF and gain estimation in the FPF.


Introduction.
1.1. Filtering problem. The goal of filtering is to dynamically estimate a latent variable from noisy observations. We consider the continuous-time nonlinear filtering problem (in one dimension for the sake of simplicity), where X t ∈ R is the hidden process satisfying an Itô stochastic differential equation (SDE) and Y t ∈ R is the observation process evolving according to an Itô SDE which depends on X t : dX t = f (X t , t) dt + g(X t , t) dB X t , X 0 ∼ P 0 , (1.1) where B X t , B Y t ∈ R are independent Brownian motions (BMs) and f (x, t), g(x, t), and h(x, t) are (known) functions that map R×R ≥0 → R and are called the drift, diffusion, and observation function, respectively. The initial condition X 0 is independent of the BMs and has (known) distribution P 0 with finite second moment and Y 0 = 0 (no observation at first). Let B(R) be the Borel σ-algebra on R. The filtering problem is to find the (regular) conditional distribution of the hidden process X t given the history of observations F Y t := σ(Y s : 0 ≤ s ≤ t), that is, P t (B) := Pr(X t ∈ B|F Y t ) for any Borel subset B ∈ B(R). This distribution is referred to as the filtering distribution and in the statistics literature it is also called the posterior distribution, while Pr(X t ∈ B) is called the prior distribution. We let p(x, t) denote the probability density function corresponding to P t with respect to the Lebesgue measure, if it exists.
Throughout the paper we assume that f, g satisfy the conditions for the wellposedness of SDEs (e.g., locally Lipschitz in x uniformly in t; see, e.g., [12,Theorem 5.4]). In addition, to ensure the existence of the filtering density p, we either assume that g 2 ≥ δ > 0 is bounded below by a constant and that f ∈ C 1,0 b , g 2 ∈ C 2,0 b , and h ∈ C 1,0 b , or we consider the linear-Gaussian case. Here C κ,j b denotes the space of bounded continuous functions with bounded partial derivatives up to order κ in x and j in t (see [2,Theorem 7.11 and 7.17] for details on existence, uniqueness, and smoothness of p).
The stochastic processes in this paper are defined on a filtered probability space (Ω, F , P, (F t ) t≥0 ) (satisfying the usual conditions) and are assumed to be progressively measurable (hence adapted) with respect to the filtration (F t ) t≥0 . We further denote by L κ (0, T ) the space of processes (F t ) T ≥t≥0 with E[ T 0 |F t | κ dt] < ∞. All SDEs are in the Itô sense, and primes ( ′ ) denote the (partial) derivative with respect to x.

Formal solution.
It is well known that the evolution of p is described by the Kushner-Stratonovich equation (KSE) [14,23] (1. 3) dp( is the adjoint Fokker-Planck operator and h t := E[h(X t , t)|F Y t ] = R h(x, t)p(x, t)dx. The initial condition is assumed to be p(x, 0) = p 0 (x) ∈ C 2 . The KSE consists of two parts: the first part is associated with the prior dynamics given by the Fokker-Planck equation, while the second part can be interpreted as a correction resulting from observations, which is proportional to the so-called innovation term (dY t −ĥ t dt). Equation (1.3) can be equivalently converted into the evolution of a given statistic using integration by parts. Let φ ∈ C 2 such that E[|φ(X t )|] < ∞ for all t ≥ 0. The conditional expectation of φ(X t ) evolves as , where L t · = f (x, t) ∂ ∂x · + 1 2 g 2 (x, t) ∂ 2 ∂x 2 · is the generator of the process X t . Example 1 (linear-Gaussian case). We will use this simple case throughout the paper as an illustration of key concepts. The linear-Gaussian case is characterized by linear drift terms and additive noise, as well as a Gaussian initial distribution: (1. 5) f (x, t) = ax, g(x, t) = b, h(x, t) = cx, where a, b, c ∈ R. In this case, the KSE (1.3) can be solved in closed form. We have X t |F Y t ∼ N (μ t ,ρ t ) given by the Kalman-Bucy [11] classical result dμ t = aμ t dt + cρ t (dY t − cμ t dt), (1.6) The coefficient of the innovation term in (1.6) is called the Kalman gainK(t) := cρ t .
In contrast to the linear-Gaussian case, for most signal and observation models (1.3) does not have closed-form solutions. Likewise, (1.4), when it is applied to the moments of the filtering distribution, gives rise to a closure problem in which the evolution of the nth moment E[X n t |F Y t ] generally depends on higher-order moments. Therefore, the KSE is only a "formal" solution to the filtering problem and needs to be approximated numerically in practice. Among the numerical methods, particle filters (PFs) have been widely and successfully applied because of their versatility. Below, they will be presented within a broader framework, but see e.g., [2, section 8.6], , and the tutorials [7,15] for more detailed surveys of the PFs.
1.3. Particle filters. These methods are aimed at approximating the filtering distribution by the empirical distribution of a particle system, which in full generality is a triangular array of random variables [5] for some fixed t, are samples, also called particles, and w (i,N ) t are their corresponding importance weights, which without loss of generality are assumed to be normalized. To justify such a method, one has to study the N -particle system and show that the sequence of empirical distributions converges (at least in a weak sense) to the filtering distribution as N → ∞. In this article, however, we use a mean-field-limit approach; i.e., we study abstract "particle systems" characterized by a pair process denoted by (S t , W t ). The following definition is modified from [5, section 2.1].
Definition 1 (targeting condition). The particle system described by a pair process (S t , W t ), representing particles and their weights respectively, is said to target the filtering distribution in the filtering problem (1.1)-(1.2) at time t if and only if It should be noted that weights for a given S t are not unique. For instance, if the pair (S t , W t ) satisfies the targeting condition, then any pair (S t , W t + V t ) is also a solution, where V t is a stochastic process independent of S t with zero conditional mean and finite second moment. In subsection 2.1, we further clarify this nonuniqueness.
To have the targeting condition over a period of time, i.e., a dynamic version of (1.9), we are interested in the time-evolution of (S t , W t ). As we shall see later in subsection 2.2, this naturally leads to McKean-Vlasov SDEs, in which the coefficients in the dynamics of (S t , W t ) become dependent on the targeted distribution. In numerical implementations, the left-hand side of (1.9) is approximated by a Monte Carlo esti- , which is of order N . If the samples are chosen appropriately (e.g., i.i.d.), the Monte Carlo estimate converges to the conditional expectation by a law of large numbers and yields asymptotic consistency of the PF. The normalized weights are then given by w Thus, in theory, if we combine Definition 1 with appropriate sampling, we obtain an asymptotically exact filter. In practice, however, propagation of the samples requires the McKean-Vlasov terms to be estimated based on the current sample. This estimation problem is nontrivial and introduces correlations between samples, which complicates the convergence analysis. Although interesting and worthwhile, the issues of estimation and convergence are not within the scope of the present article. Instead, we focus on the characterization of abstract particle systems.
A particle filter is said to be unweighted if W t = 1 for all t; otherwise, it is called weighted. Observe that in particular, (1.9) implies (by setting φ = 1) that E[W t |F Y t ] = 1 a.s. and due to the nonnegativity of the variance of W t , we also have E[W 2 t |F Y t ] ≥ 1 a.s. If W t deviates significantly from unity, this means that the Monte Carlo variance of the weighted average is larger than it would be for an average with equal weights. This can be measured by the effective sample size, a commonly used approximation of which (see, e.g., [17, section 2]) is given by as explained before, by a law of large numbers we have M/N → 1 as N → ∞ and therefore (1.11 1 remain close to zero. In order to assess the degeneracy of a PF algorithm, we are essentially interested in minimizing the unconditional variance of the importance weights, Var t ] − 1, taking all possible realizations of the observation process into account [6]. In subsection 2.1, we will establish the minimum-variance weight for a fixed particle distribution. We now review two well-known examples of PFs within the framework above. • The bootstrap particle filter (BPF). The BPF is a weighted PF that was originally introduced by [10] and is widely used in discrete-time filtering [7]. Here we present its continuous-time formulation (see, e.g., [2, chapter 9] or [15, section 6.1]). The particles in this filter move with the same law as the hidden process, thereby being distributed according to the prior. The weight dynamics must consequently include observations in such a way that the weighted particles are distributed according to the posterior. The evolution of the particle system denoted by (S B t , W B t ) reads as , X 0 , S B 0 } (note that in practice, particles are driven by independent BMs). The usual derivation of the BPF is based on a change of probability measure, in which W B t is the evaluation of the Radon-Nikodym derivative dP/dQ with X t replaced by S B t , where P is the (original) coupled measure of the system (X t , Y t ) and Q is a new measure under which X t and Y t are independent, the dynamics of X t remains unchanged, and Y t corresponds to a BM, dY t = dB Y t [2, chapter 9]: (1.14) Girsanov's theorem and Itô's formula then give rise to (1.13). Implementing the dynamics (1.12)-(1.13) in practice is straightforward. However, the BPF suffers from the weight decay; that is, most of the weights become negligibly small and only a few of them remain significant, an issue which becomes even more severe in high dimensions [24]. A common practice to overcome this issue is to periodically resample the particles, in which case (1.12)-(1.13) merely describe the evolution of the particle system between the resampling times. Although resampling techniques are key ingredients of weighted PFs, they are not the focus of the current work. Instead, we attempt to find ways to alleviate the weight collapse itself.
• The feedback particle filter (FPF). The FPF is an unweighted PF that was introduced in [30], initially motivated by mean-field optimal control. In contrast to the BPF, this filter does not have weight dynamics. Instead, the particles must incorporate the observations and interact with each other so that they can target the filtering distribution by themselves. The key idea is to add a correction term, also called the control input, to the prior dynamics of particles,  Table 1 A comparison between the BPF and the FPF as they have been treated in the literature so far. Each has its strengths (+) and weaknesses (−).
Bootstrap particle filter [10] Feedback particle filter [30] a weighted particle filter an unweighted particle filter based on change of probability measure motivated by optimal control asymptotically exact (+) asymptotically exact (+) suffers from weight degeneracy (−) requires gain estimation (−) exhibits the COD (−) potentially avoids the COD (+) and find the unknown functions U and K such that the conditional density of S t given F Y t solves the KSE (1.3). Here again the initial condition is S F 0 ∼ P 0 and The paper [30] shows that U (x, t) and K(x, t) under certain technical assumptions must satisfy The main challenge is to find the so-called gain function K, which in turn depends on p. In the multidimensional case, (1.17) does not have uniqueness of solutions because any solution K can generate another solution by adding a divergence-free vector field. A commonly used solution is obtained (uniquely) by restricting K to be of gradient form [28], where (1.17) then becomes a weighted Poisson equation. Different assumptions on K is one aspect of how different PFs arise. Although we are aware of similar filters, as noted in [18], that require solving equations like (1. 17) and have been referred to as "particle flow filters," we refer to algorithm above as the FPF. While fixing K to be in gradient form is useful to pick a gain in practice and makes the boundary value problem accessible to a range of numerical approximations such as the RKHS method [19], its necessity is not well justified from a theoretical perspective. This lack of justification is especially striking on smooth manifolds without a Riemannian metric given a priori, where the gradient field depends on the chosen metric [1,25]. One final point to notice is that the FPF does not require any resampling procedure, which is an advantage compared to the BPF. Moreover, [24] shows numerically that the FPF is less prone to the curse of dimensionality (COD) than the BPF. Table 1 summarizes the comparison between the BPF and FPF.

Motivation and contribution.
So far we have seen two well-known PFs, each of which was originally based on a distinct approach, yet they both satisfy the targeting condition (Definition 1). Given the comparison between these methods, it is still unknown whether combining the strengths of both types is possible. As a step towards this goal, we provide a unified treatment of a large family of weighted and unweighted PFs that is based solely on the targeting condition. Specifically, we do the following: • We first characterize the weights W t in terms of a Radon-Nikodym derivative of the marginal distributions over S t and find the optimal (i.e., minimumvariance) weight W * t for fixed particle distribution (see Theorem 2). This result also sheds some light on the nonuniqueness of importance weights in particle filtering. • We then introduce a general dynamics for the particle system (S t , W t ) expressed as a system of SDEs and obtain necessary conditions on its coefficients for targeting the filtering distribution (see Theorem 3, referred to as the "unifying theorem"). The necessary conditions are a system of ordinary differential equations (ODEs) in one dimension, which becomes a system of partial differential equations (PDEs) in higher dimensions. The results also hold in the unconditional case (see Corollary 4). • As a result of the unifying theorem, we derive a class of PFs which encompasses the BPF and FPF with a smooth transition between them, thereby indicating that these methods are not different in their nature (see Proposition 5). • The optimal importance weight W * t from the first theorem is studied in the context of the unifying theorem, and its evolution dW * t is derived (see Proposition 6).
• Finally, we outline two potential applications of the unifying theorem in section 3, namely compensating for gain estimation errors with weight dynamics and providing freedom to alleviate the weight degeneracy. For the sake of simplicity, we develop the results in the one-dimensional setting. In principle, the results can be generalized to the multidimensional case; however, further consideration is required, as more freedom emerges in higher dimensions.

Related work.
A unifying framework for discrete-and continuous-time filtering (as a Bayesian formulation of the data assimilation problem) from the perspective of couplings, optimal transport, and Schrödinger bridges is proposed in [20]. Similarly, in [18], three types of unweighted PFs, including the FPF, are unified in the framework of McKean-Vlasov SDEs. However, to the best of our knowledge a unification of weighted and unweighted approaches such as this one has not been attempted. This question is also loosely connected to the concept of proposal distributions in discrete-time particle filters. From the perspective of proposal distributions in the sequential Monte Carlo (SMC) sampling literature, the BPF can be viewed as the filter for which the proposal is equal to the prior transition density of the hidden state. Nevertheless, it is unclear how to view an optimal proposal distribution as in [6, section D] in relation to continuous-time filters, in particular the FPF. It was pointed out in [24, section 3.2] that the optimal proposal becomes trivial in the continuoustime limit. In the same paper, also the broader question of how to reconcile the dynamics of the FPF with the importance weights as Radon-Nikodym derivatives of path measures was posed. The present paper answers this question by adopting a change-of-measure approach that is a generalization of the path-measure framework adopted in the literature on continuous-time PFs.
This paper also touches on the notion of nonuniqueness in designing PFs, which has been discussed in the literature but mainly restricted to the linear-Gaussian case and freedom of the particle movements. For example, [1] provides a systematic exploration of the nonuniqueness within the class of linear FPF in terms of gauge transforms and [27] studies the nonuniqueness of the feedback control law in particle dynamics for different types of ensemble Kalman filters. The aforementioned paper [18] also explores the nonuniqueness in unweighted PFs and gives a general formulation from which existing filters can be obtained as special cases by making specific assumptions on the form of the coefficients. In this work, we examine the nonuniqueness in a more general setting that includes weight dynamics.
2.1. A characterization of weights in terms of the Radon-Nikodym derivatives. The approach used to derive the importance weights in the BPF (1.14) is difficult to reconcile with filters such as the FPF, in which observation terms are included in the particle dynamics. This explicit dY term makes the measure of the particles singular with respect to the measure of the hidden process (see [24, section 3.2] for a discussion of this issue). Here we demonstrate a general relationship between the process W t and the Radon-Nikodym derivative of the marginal distributions over S t , which is consistent with both the BPF and FPF, and we also provide the minimumvariance choice for W t (for a fixed particle distribution).
be a pair process characterizing an abstract particle system. Denote by Q t the conditional distribution of S t given At any time t, under the condition that the particle system described by (S t , W t ) targets P t , according to Definition 1, we have the following: I. The distribution P t is absolutely continuous with respect to the distribution Q t .
In particular, the Radon-Nikodym derivative dP t /dQ t exists, and . The proof is given in subsection 5.1, and by reversing the arguments in the proof it is obvious that (2.1) is already sufficient to guarantee the targeting condition (1.9). Notice also that if P t and Q t have densities p(x, t) and q(x, t) with respect to the Lebesgue measure, then (2.1) turns into As we observed in subsection 1.3, the importance weights are not unique. It is now also evident from (2.1) that there are many solutions which are not necessarily optimal. A large class of suboptimal choices for W t is of the form (dP t /dQ t )(Z t , S t ), whereP t andQ t are respectively the joint conditional (on F Y t ) distributions of (Z t , X t ) and (Z t , S t ) with Z t being some arbitrary F t -measurable process such that the Radon-Nikodym derivative exists, and the second marginals ofP t andQ t agree, respectively, with P t and Q t . For example, we can let Z t = S 0 , which takes some information from the past into account. Also the weights in the BPF (1.14) are of this form, as by the disintegration of measures and independence of (X t ) t≥0 and (Y t ) t≥0 under the measure Q, 1 , and hence the conditional expectation of the (normalized) Radon-Nikodym derivative of the full measures (after replacing X t by S B which W t as a pathwise Radon-Nikodym derivative does not make sense. Theorem 2 says that the conditional expectation of W t can always be interpreted as a Radon-Nikodym derivative and is a version of (i.e., a.s. equal to) the density of the filtering distribution P t with respect to the particle distribution Q t . It also shows that in the BPF, W B t is different from the optimum W * t given that we fix the particle dynamics to the prior dynamics. This is not surprising in view of the well-known degeneracy problem of the BPF (see [24] and the references therein).
2.2. The unifying theorem. Considering the pair process (S t , W t ), which represents particles and their importance weights, respectively, as explained in subsection 1.3, we are trying to find conditions on the dynamics of (S t , W t ) that allow the particle system to target the filtering distribution, as defined in Definition 1. Here we make an ansatz that the stochastic dynamics of the particle system takes the following general form: where B t is a BM. From this starting point, our goal is to find the conditions that the unknown functions {u, k, v, γ, ε, ζ} should satisfy. To derive the results, certain assumptions are required as listed below: (i) Regularity conditions ensure well-posedness of the system of SDEs (2.3)-(2.4), together with (1.1)-(1.2) (see, e.g., [12,Theorem 6.30]). (ii) The initial condition is S 0 ∼ P 0 and W 0 = 1.
The second key result of our work is the following: Theorem 3 (unifying theorem). Consider the filtering problem (1.1)-(1.2) with filtering density p(x, t), which satisfies the KSE (1.3). Let (S t , W t ) be a pair process, representing particles and their importance weights, respectively, that evolves according to the dynamics (2.3)-(2.4) under the assumptions (i)-(v). Then if the particle system described by (S t , W t ) targets the filtering distribution for all 0 < t < T , according to Definition 1, the functions {u(x, t), k(x, t), v(x, t), γ(x, t), ε(x, t), ζ(x, t)} satisfy the following equations for all 0 ≤ t < T : in addition, the functions {ε(x, t), γ(x, t)} have zero mean under the filtering distribution for all 0 ≤ t < T , that is, The proof appears in subsection 5.2. In short, the results follow from the fact that the terms multiplying dY t and dt on both sides of dE should be equal a.s., regardless of φ. In particular, considering φ = 1 leads to the last statement (2.7), while considering the class of compactly supported test functions φ ∈ C 2 k yields the system of ODEs (2.5)-(2.6), which become PDEs in higher dimensions. Note that through these equations, the coefficients in the particle system dynamics depend on the targeted distribution. This means that the system of SDEs (2.3)-(2.4) are of McKean-Vlasov type. Recall that based on formula (1.9), the distribution targeted by a particle system is obtained by evaluating the left-hand side of this formula with an indicator function as a test function φ.
Compared to the BPF (1.12)-(1.13) and FPF (1.15)-(1.16), our ansatz (2.3)-(2.4) is clearly more general in several aspects, for example the presence of dB t and dY t in both dynamics. Thus, we may refer to this approach as "hybrid particle filter." As we shall show in subsection 2.4, the presence of dB t in the weight process actually decreases its variance. Here the particle dynamics is still supposed to not involve W t explicitly and the coefficients in the weight dynamics are intentionally chosen to be linear in W t . This choice has an advantage, as can be seen in the proof of the theorem. Specifically, it allows us to use the targeting assumption in order to convert conditional expectations appearing in dE[W t φ(S t )|F Y t ] to posterior expectations. The theorem above, while providing only necessary conditions for targeting the filtering distribution, sheds light on the freedom in choosing the coefficients of the particle and weight dynamics. It is easy to verify that the setting {u = f , k = 0, v = g, ζ = 0} yields the BPF (1.12)-(1.13). We demonstrate in subsection 2.3 that the FPF also satisfies the necessary conditions. Note that the first equation (2.5) is similar to the gain equation (1.17) in the FPF except the extra term εp, which arises here due to the nonzero weight dynamics. This freedom might help us compensate for the gain estimation errors with weight dynamics, as will be outlined in subsection 3.1.
We close this subsection by pointing out that our results also hold in the unconditional setting, i.e., for a particle system targeting the solution of the Fokker-Planck equation. In particular, if we set h = 0, then the observation process Y t does not provide any information about X t and the KSE (1.3) reduces to the Fokker-Planck equation. The corollary below follows immediately from Theorem 3 and shows that even in this setting, there exists intrinsic freedom in constructing the dynamics of the particle system while keeping its distribution invariant.
Corollary 4. Consider the stochastic process X t satisfying the SDE (1.1), and letp(x, t) denote the probability density function of X t , which satisfies the Fokker-Planck equation. Let (S t , W t ) be a pair process, representing particles and their importance weights, respectively, that evolves according to the dynamics below: under the assumptions (i)-(v), if applicable. Then if the particle system described by the functions {u(x, t), v(x, t), γ(x, t), ζ(x, t)} satisfy the following equation for all 0 ≤ t < T : in addition, the function γ(x, t) satisfies R γ(x, t)p(x, t)dx = 0 for all 0 ≤ t < T .

A class of particle filters.
Our goal now is to introduce a class of PFs within the result of Theorem 3 that encompasses the BPF as well as the FPF. This demonstrates how these seemingly different methods can be derived from the same framework. Observe that in (2.5), ε = h −ĥ t implies k = 0 while ε = 0 yields (1.17). Thus, a choice for ε that linearly interpolates between h −ĥ t and 0 makes it possible to have a smooth transition between the BPF and FPF, though it does not simplify the gain equation. The next proposition states the result.
Proposition 5. Under the same assumptions as in Theorem 3, and assuming that (2.5) holds, a particular solution to (2.6) is given by the following class: whereε ∈ C 1,0 is an arbitrary function with zero mean under the filtering distribution such that ε meets the requirements of (i), {α, β, η} are free parameters in R, either constant or continuously time-varying, and ϑ 1 , ϑ 2 are defined as follows: The proof is simple, and it is given in subsection 5.3. In particular, if in the class above, we fix {ε = 0, ζ = 0, β = 0} and let the others {α, η} be free, we obtain a subclass that interpolates the BPF (when η = 1) and the FPF (when η = 0, α = 0). The parameter η can be interpreted as the "observation parameter," which determines how much the observation process is incorporated into the particle or weight dynamics. We refer to α as the "drift parameter," which only appears in drift functions u, γ. Last, we call β the "diffusion parameter" since it controls the magnitude of the diffusion coefficient v. Among these, η is the most relevant one as far as filtering is concerned.
It should be noted that not all parameter choices for {α, β, η} give rise to a "practical" particle filter. In practice, one has to check other criteria, for example the nondegeneracy of the particle distribution and the stability of the system. This can be seen explicitly in the linear-Gaussian case below.
Example 3 (linear-Gaussian case, continued). Consider the three-parameter {α, β, η} subclass of particle filters given by Proposition 5 after settingε = 0 and ζ = 0 for the linear-Gaussian setting (1.5) with the filtering distribution (1.6)-(1.7). Then the (unweighted) particle distribution reads as The derivation is briefly explained in subsection 5.4. Observe how particle distribution interpolates between the prior distribution (when η = 1) and the posterior distribution (when η = 0, α = 0). It is easy to confirm that in the latter case, where β remains as the only free parameter, the PF resulting from Proposition 5 corresponds to the unweighted linear PF stated in [1, equation 17], with one BM, if it is indeed rewritten in terms of v. Notably, v = 0 recovers the deterministic linear FPF introduced in [26].
2.4. Stochastic differential of the optimal weight. We saw in subsection 2.1 that weights for a fixed particle distribution Q t are not unique and indeed W * t := dPt dQt (S t ) is the one that minimizes the variance. In the context of the unifying theorem, which assumes additional constraints regarding the time-evolution of the particle system, this nonuniqueness means that if we fix the particle dynamics, there are many possibilities for the weight dynamics. In other words, if we fix the functions {u, k, v}, we are then left with three unknowns {γ, ε, ζ} but only two equations (2.5)-(2.6) to constrain them. Here we demonstrate that for each choice of the dynamics of S t according to (2.3), the dynamics of the optimal weight W * t also takes the form of (2.4) whose coefficients denoted by {γ * , ε * , ζ * } are given by the proposition below.
The proof, which appears in subsection 5.5, has a straightforward idea, but requires lengthy calculations to find the stochastic differential of W * t := p(St,t) q(St,t) based on the Kunita-Itô-Wentzell formula (see [13,Theorem 1.1]), as the functions p(x, t) and q(x, t) satisfy SPDEs and their ratio is evaluated at S t , which solves an SDE.
The presence of the noise term ζ * (S t , t) dB t in (2.18) might appear counterintuitive in view of the goal to minimize the variance of W t . Indeed, adding an independent BM term would only increase the variance. However, the BM appearing in (2.18) is the same as that driving the process S t , introducing correlation between W * t and S t , which consequently helps W * t to achieve the minimal variance possible given the dynamics of S t . In particular, if we restrict the particle evolution to the prior dynamics (i.e., u = f , k = 0, v = g), then Proposition 6 results in where ζ * is given by (2.19) with v = g and λ * is the remaining drift coefficient: q(x,t) ) . Notice that the presence of an additional term in (2.20) compared to the weight process in the BPF (1.13) indicates that the importance sampling used in the BPF is not optimal. Nonetheless, this correction term is not easy to evaluate in practice since the functions λ * , ζ * , which involve the densities p, q explicitly, must be estimated from samples. In the linear-Gaussian case, however, the analytical computation is possible.
Example 4 (linear-Gaussian case, continued). Let the particle process evolve as the prior dynamics, and denote by N (µ t , ρ t ) its (unweighted) distribution. Then (2.20)-(2.21) can be computed analytically and yield the following PF:

Applications.
3.1. Compensating for gain approximation with weight dynamics. The gain function K in the FPF is a solution to (1.17), which is fixed; i.e., it only depends on the model. In the unifying theorem, however, the function k, which has a role similar to that of K, is a solution to (2.5), which now has a term ε which can be chosen freely. Observe that ∂ ∂x (k − K)p = εp, which means that any deviation of k from K in the particle dynamics is compensated by ε from the weight dynamics such that the particle system can eventually target the filtering distribution. Recall from condition (2.7) that the mean of ε is zero under the filtering distribution. The variance of ε under the filtering distribution can serve as an "error," measuring the difference between k and K. Moreover, smaller values of ε are preferable because to keep the weights close to unity, the coefficients in the weight dynamics, including ε, should remain as close to zero as possible.
There are now two approaches to exploiting this freedom in (2.5): (1) to first set ε and then solve the equation for k, and (2) to set k and find ε afterwards. In (1), the presence of ε allows us to modify the equation to some extent, which might help us to use a simpler gain estimation method and compensate for it by an appropriate weight dynamics. It will be for future research to explore this possibility. In (2), which we explore in more detail, instead of setting k explicitly, we may restrict ourselves to a specific class of functions denoted by K and pose the following variational problem: for a fixed t and p. The second constraint above comes from (2.5), and k must also satisfy the regularity assumptions required for Theorem 3. The objective functional I can be interpreted as the square of the Fisher-Rao norm ṗ 2 FR of the fictional change in the distribution corresponding to the continuity equationṗ + ∂ ∂x (k − K)p = 0 (here we omit the time dependence of k, K and the dot relates to a fictional time associated with the flow of the fixed vector field k − K). This contrasts with the use of the infinitesimal Wasserstein-2 or Otto's norm ṗ 2 W2 = (k(x)−K(x)) 2 p(x)dx (see [16]). As an example, we consider the case when K is the set of constant functions. In this case, the minimization of ṗ 2 W2 over K gives rise to the standard constant gain approximation in the FPF (see [28, Example 2 and Remark 5]), and analogously, the problem (3.1) admits a simple analytical solution, which we present in the proposition below and call Fisher-optimal constant gain for the aforementioned reasons.
Proposition 7 (Fisher-optimal constant gain approximation). Consider (2.5) in Theorem 3, and let K be the set of functions independent of x with elementsk(t), i.e.,k(t) ∈ K = {k(x, t) : k ′ (x, t) = 0 ∀x ∈ R}. Then the solutionk * (t) to the variational problem (3.1) at time t is , and the corresponding ε, which can be determined only up to P t -null set, is ε * (x, t) := h(x, t) −ĥ t +k * (t) ψ(x, t), where we have defined ψ(x, t) := p ′ (x, t)/p(x, t) over the interior of the support of p and ψ(x, t) := 0 otherwise.
The proof is given in subsection 5.6. Note that E[ψ 2 (X t , t)|F Y t ] is itself a Fisher information, namely that of the 1-parameter model p θ (x, t) = p(x − θ, t) at θ = 0.
Example 5 (linear-Gaussian case, continued). It is easy to verify that in the linear-Gaussian case (1.5),k * (t) from (3.2) yields the Kalman gain, which in turn makes ε * (x, t) and thus the objective functional zero. Observe that which gives the Kalman gaink * (t) = cρ t .
In general (i.e., the nonlinear or non-Gaussian case), ψ(x, t) and the Fisher information E[ψ 2 (X t , t)|F Y t ] need to be estimated from samples. Estimators for these quantities can be found, e.g., in [3,4,8,22]. The proposition above can be generalized to richer classes of functions K, e.g., functions of the form k(x, t) = j a j (t)ϕ j (x, t), where {ϕ j } j is an appropriate set of basis functions and {a j } j are some coefficients.

3.2.
Providing freedom to alleviate weight degeneracy. Weight decay is a major issue among the weighted PFs. As discussed in subsection 1.3, in order to have less weight degeneracy, E[W 2 t ] should remain small. In subsections 2.1 and 2.4, we derived the optimal importance weight W * t and its stochastic dynamics dW * t under the constraint that particle distribution is given, i.e., particle dynamics is fixed. Here we would like to derive a general formula for dE[W 2 t ] corresponding to the weight dynamics of form (2.4) in terms of {γ, ε, ζ}. By Itô's formula, we have Applying Fubini's theorem to the integral form of SDE above would suffice for deriving dE[W 2 t ]. However, the presence of W 2 t , which multiplies functions of S t , as well as dY t , which in turn depends on the hidden process X t , make the analysis of dE[W 2 t ] complicated. It is interesting to note that studying d log(W t ) does not have the first issue because it no longer involves W t explicitly. Observe that by Itô's formula (3.6) d log(W t ) = γ(S t , t)dt + ε(S t , t)dY t + ζ(S t , t)dB t − 1 2 ε 2 (S t , t)dt − 1 2 ζ 2 (S t , t)dt. To overcome the second issue, it turns out that if we use the fact that the innovation process is a BM, we will get a more useful formula as follows.
Proposition 8. Consider the filtering problem (1.1)-(1.2). Let (S t , W t ) be a pair process such that W t solves the SDE (2.4) under the assumptions (ii), (iii), and (v) while S t is a progressively measurable process with respect to The proof appears in subsection 5.7. Notice that due to the nonnegativity of Var[W t ], we have E[W 2 t ] ≥ 1 and due to log(x) ≤ x−1, we have E[log(W t )] ≤ 0. It can be shown (by counterexample) that E[log(W t )] is not necessarily a proper measure for degeneracy. However, the formulas above are insightful for investigating the possibility of exploiting the freedom given by the terms γ, ε, ζ in order to find other steady-state solutions for E[W 2 t ] besides the minimum-variance solution in Proposition 6. We defer this investigation to future research, but to illustrate the usefulness of the formulas above, we apply them to the weights in the BPF (1.13): Equation (3.9) implies that in the BPF, we always have dE[W B t 2 ] ≥ 0, and hence the number of effective particles will inevitably decay over time. Note also that if the observation process is m-dimensional, the drift coefficient on the right-hand side of (3.9) is given by , and therefore the time constant of the weight decay will scale with 1/m (compare the analysis with [24, section 3.1.1]).
4. Discussion. Existing particle filters fall into two distinct types. Unweighted PFs (such as the FPF) assimilate new data by moving around particles while keeping the weights associated to each particle fixed, whereas weighted PFs (such as the BPF) assimilate new data by reweighing particles. In this paper, we proposed a unifying framework for these types of PFs. Our proposed hybrid filter allows particles to be moved as well as reweighed in response to new observations. This gives a lot of freedom on how to design a PF. This freedom obviously needs to be constrained in order to make sure that the empirical distribution of weighted particles effectively converges to the filtering distribution (i.e., to be asymptotically exact). The necessary conditions are summarized by (2.5)-(2.6) in the unifying theorem.
Even after having constrained the freedom of the particle and weight dynamics to satisfy the targeting condition, there is still substantial freedom that could be exploited. An interesting extension of the present work will be to determine whether there exists a "sweet spot" where the strengths of unweighted PFs (i.e., the absence of the weight collapse problem) could be combined with the strengths of simple weighted PFs such as the BPF (where the solution of (2.5)-(2.6) is trivial). Practically, this would amount to defining a cost function which combines the cost associated to the severity of the weight decay as well as the cost for computing the solutions of (2.5)-(2.6). We leave this question for further work.
Another extension of the present work would be to relax the strong assumptions made in the SDEs (2.3)-(2.4). Indeed, in (2.3), it is assumed that the particle dynamics does not involve W t explicitly and (2.4) assumes that the coefficients in the weight dynamics are linear in W t . In this construction, S t is the main process, but W t can be thought of as an auxiliary process driven by S t (recall (3.6)) and transforming the distribution of S t into the filtering distribution. These assumptions were made to simplify the expression for the necessary conditions in the unifying theorem. However, if we relax them, we will obtain even more freedom on the choice of particle and weight dynamics that could be leveraged to increase the chance of obtaining a sweet spot as described in the previous paragraph. For example, we could enforce a drift term in the weight dynamics that pulls the weight back to unity and compensate for this specific choice by appropriate functions in the particle dynamics. This could be interpreted as a smooth resampling procedure, unlike classical resampling, where particles and weights have the undesirable feature of changing abruptly at the resampling times.
Finally, this paper is entirely focused on necessary conditions that the hybrid particle filter needs to satisfy in order to target the filtering distribution. An obvious next question will be to determine sufficient conditions that need to be satisfied by the model such that (2.5)-(2.6) guarantee that the targeting condition is met for an extended period of time (i.e., the converse of Theorem 3). The main difficulty will be to establish conditions under which the solutions of (2.5)-(2.6) make the SDEs (2.3)-(2.4) well-posed. This question has been partly addressed in [18] for some special cases but remains open in the general case considered here. Furthermore, additional work is required to establish sufficient conditions under which a PF with a finite number N of particles has a uniformly bounded error. Indeed, at the end of subsection 2.3, we saw that in a simple linear-Gaussian case, the variance of the unweighted particles can grow exponentially if a specific parameter is above a given threshold (which is not recognized by the necessary conditions). This feature is clearly undesirable when the number of particles is finite since it implies that the number of samples that effectively support the filtering distribution decreases with time, which is similar to what occurs in the BPF. It would therefore be desirable to derive sufficient conditions that guarantee the stability of the filter.
Proof of Theorem 2. I. By the targeting condition (1.9), we have for all integrable functions φ which shows the first claim (2.1). II. This follows directly from Lemma 9 just below. Lemma 9. Let (Ω, F , P ) be a probability space and W * a G-measurable random variable defined on this space. Suppose G ⊆ F and E[W * 2 ] < ∞. Then W * is the (a.s. unique) solution to the optimization problem Proof. E[W 2 ] can be written as In other words, W opt = W * a.s.

5.2.
Proof of the unifying theorem. This subsection is devoted to the proof of Theorem 3. A key ingredient for the proof is the next lemma, whose first two assertions come from Lemma 2 in [29]. It allows us to not only interchange conditional expectations with integrals but also adapt the σ-algebra accordingly, which makes it distinct and stronger from the normal conditional Fubini theorem.
Lemma 10. Take the system of SDEs (2.3)-(2.4) for the pair process (S t , W t ) under the assumptions (ii)-(v). Let F (x, w, t) be an R-valued measurable function such that F (S · , W · , ·) ∈ L 2 (0, t). Then t ∨σ(S 0 )-measurable. Statements (5.12) and (5.13) then follow directly from Lemma 2 in [29]. The last claim (5.14) is similar to the zero-mean property of the Itô integral and has a similar proof, yet it involves conditional expectation. For the Itô integral, we have where we used the fact that the Brownian motion increment B ti+1 −B ti is independent of F Y t as well as the random values S ti and W ti at time t i and its mean is zero. Given the lemma above, we are now ready to prove Theorem 3.
Proof of Theorem 3. By the targeting assumption for all 0 < t < T as well as the initial condition, we know that holds a.s. for any measurable test function φ with E[|φ(X t )|] < ∞, in particular for any integrable φ ∈ C 2 , which allows us to write an equality for the stochastic differential of these processes in the Itô sense: The right-hand side of expression above is given by (1.4), which consists of two terms multiplying dY t and dt. The plan is to compute the left-hand side in terms of the unknown functions {u, k, v, γ, ε, ζ} from SDEs (2.3)-(2.4). It turns out that the lefthand side also consists of two terms multiplying dY t and dt since we take conditional expectation with respect to F Y t , and thus terms multiplying dB t vanish. Two fundamental ODEs (or PDEs in higher dimensions) finally follow from the fact that the terms multiplying dY t and dt (more precisely, dB Y t and dt) on each side of (5.19) are equal a.s. regardless of φ. The proof is structured in three steps: (a) We first find the stochastic differential d(W t φ(S t )) from SDEs (2.3)-(2.4) using Itô's formula. (b) In order to obtain dE[W t φ(S t )|F Y t ], we write the result of the previous step in integral form, take the conditional expectation of both sides with respect to F Y t , use Lemma 10 to interchange conditional expectations with integrals, and finally turn the result back into the differential form. (c) We use the targeting assumption (5.18) to convert the conditional expectations that involve (W t , S t ) into posterior expectations (this becomes possible because of the special form that we take for the stochastic dynamics of the particle system). Finally, we investigate the implications of the equalities resulting from (5.19) for some class of test functions φ. Some aspects of our proof are inspired by the usual proof of the Fokker-Planck equation (see, e.g., [21, Proof of Theorem 5.4]) and the derivation of the FPF [29,30].
Proof of Proposition 5. Equation (2.6) is a second-order ODE, consisting of three parts. A trivial solution to this equation can be obtained by setting each part to zero. Alternatively, one can use (2.5) to first change the form of (2.6) and then set each part to zero. Specifically, the second-order term ∂ 2 ∂x 2 (k 2 p) in (2.6) can be converted from the second-order derivative into the first-order derivative using (2.5) as follows: Likewise, the zero-order term (h −ĥ t )ĥ t p in (2.6) can yield a first-order derivative using (2.5): To obtain a general expression, we may simply split ∂ 2 ∂x 2 (k 2 p) = β ∂ 2 ∂x 2 (k 2 p) + (1 − β) ∂ 2 ∂x 2 (k 2 p), (5.39) (h −ĥ t )ĥ t p = α(h −ĥ t )ĥ t p + (1 − α)(h −ĥ t )ĥ t p, (5.40) where α, β ∈ R are free parameters, which can be (continuous) functions of t, but not of x, because we would like to have the possibility to take them inside the derivatives in the next calculation steps. In the equations above, we let the first parts remain unchanged, while for the second parts, we use (5.37) and (5.38). Equation (2.6) then becomes (5.41) 1 2 In the equation above, by separately setting each of the three terms to zero, we obtain a particular solution. Furthermore, we wish to derive the results for a ε that consists of an interpolation. As we discussed in subsection 2.3, the choice of ε that linearly interpolates between h −ĥ t and 0 is a relevant choice for our purpose. We also let ε consist of a free function (to not restrict ourselves to just an interpolation). So we take ε = η(h −ĥ t ) +ε, where η is a free parameter like α, β, andε is an arbitrary function with zero mean under the posterior distribution (to guarantee the condition (2.7)). To sum up, plugging in ε to the modified ODE (5.41) and setting each term to zero yield the results of this proposition.