Asymptotically Optimal Procedures for Sequential Joint Detection and Estimation

We investigate the problem of jointly testing multiple hypotheses and estimating a random parameter of the underlying distribution in a sequential setup. The aim is to jointly infer the true hypothesis and the true parameter while using on average as few samples as possible and keeping the detection and estimation errors below predefined levels. Based on mild assumptions on the underlying model, we propose an asymptotically optimal procedure, i.e., a procedure that becomes optimal when the tolerated detection and estimation error levels tend to zero. The implementation of the resulting asymptotically optimal stopping rule is computationally cheap and, hence, applicable for high-dimensional data. We further propose a projected quasi-Newton method to optimally choose the coefficients that parameterize the instantaneous cost function such that the constraints are fulfilled with equality. The proposed theory is validated by numerical examples.


I. INTRODUCTION
There exist various scenarios in which hypothesis testing and parameter estimation occur in a coupled way and the outcome of both is important.That is, one would like to decide among several hypotheses and, depending on the decision, estimate some parameter of the model under the selected hypothesis.In radar, for example, one would like to detect the presence of a target and, if a target is present, to estimate, e.g., its position or velocity [1].To perform dynamic spectrum access in cognitive radio, the secondary user has to detect the primary user and to estimate the possible interference [2].Detecting the presence of a signal and estimating the channel is of interest in many communication scenarios [3].There exist many more applications in which such scenarios arise as, for example, biomedical engineering [4], [5], speech processing [6], visual inference [7] or changepoint detection [8].
The problem of treating hypothesis testing and parameter estimation jointly dates back to the late 1960s.Middleton and Espositio used a Bayesian framework to find a jointly optimal solution [9].That framework was later extended to multiple hypotheses [10].After a period of declining interest, this problem has regained more attention since 2000 [1]- [8].
In many applications, one would like to perform inference by using as few samples as possible and maintaining its quality at the same time.These requirements can, for example, be Dominik Reinhard and Abdelhak M. Zoubir are with the Signal Processing Group, Technische Universität Darmstadt, Merckstraße 25, 64283 Darmstadt, Germany.E-mail: {reinhard, zoubir}@spg.tu-darmstadt.de Michael Fauß is with with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA.E-mail: mfauss@princeton.edu The work of Michael Fauß was supported by the German Research Foundation (DFG) under grant number 424522268.
caused by constraints on the latency or the power consumption of a system.In the late 1940s, Abraham Wald introduced the field of sequential analysis with the invention of the sequential probability ratio test (SPRT) [11].In sequential inference, data is collected until one is confident enough about the phenomenon of interest.One advantage of sequential methods is that they use on average signifacantly fewer samples compared to their counterparts with a fixed sample size while having the same inference quality.An overview of sequential detection and sequential estimation methods is, for example, given in [12] and [13], respectively.
Treating the problem of joint detection and estimation using a fixed number of samples allows to trade-off detection and estimation accuracy.Contrary to this, the additional degree of freedom, i.e., the number of used samples, in a sequential setup allows to control the detection and estimation errors individually.This is important because detection and estimation often have contradicting requirements on the data.Hence, using a two-step procedure, e.g., a sequential detector followed by an optimal estimator, would allow to control the detection errors, but most probably results in very high estimation errors, which is not desired.This is illustrated by an example.As mentioned before, detecting the presence of a target and estimating its position is a typical problem in radar.In this application, the detection performance as well as the estimation accuracy are important.If the radar system observes a peak with a very high amplitude that points into the direction of the target, its presence can be detected reliably, but its position cannot be estimated accurately if the peak is too broad.Conversely, if the system observes a very narrow peak of a small amplitude, the position of the target can be estimated accurately, but, on the other hand, its presence cannot be detected reliably.Therefore, the procedure has to gather more samples until it observes a very narrow peak of a high amplitude, i.e., until the target can be detected reliably and its position can be estimated accurately.Note that this is an oversimplification to highlight the underlying fundamental joint detection and estimation problem.

A. State-of-the-Art
There is only little prior work on joint detection and parameter estimation in a sequential framework.Although there exist sequential hypothesis tests that include an estimation step, like the generalized SPRT or the adaptive SPRT, estimation only occurs as an intermediate step and the primary interest lies in the outcome of the test.More details on the generalized SPRT and the adaptive SPRT can be found in [12,Section 5.4].
Simultaneous detection and estimation in a sequential setting was first mentioned in [14].However, that work considers only sequential updating of a sufficient statistic for the problem of joint detection and estimation rather than presenting a solution of the joint inference problem.
Sequential joint detection and state estimation of Markov signals was investigated in [15], [16].However, although estimating the true state is of primary interest, these works do not consider the uncertainty about the true state in the number of used samples.A sequential procedure that overcomes this problem was proposed for a related problem in [17].
Initially, Yılmaz et al. investigated the problem of optimal sequential joint detection and estimation [18].In that work, a binary hypothesis test was first performed and, in case a decision was made in favor of the alternative, a random parameter is to be estimated.The method in [18] was extended to multiple hypotheses [19] and applied to joint spectrum sensing and channel estimation [2].In the works by Yılmaz et al., a weighted sum of detection and estimation errors is used as a cost function for the joint inference problem.This cost function is kept below a predefined level while minimizing the number of used samples for every set of observations.
In [20], we proposed a novel framework for optimal sequential procedures for sequential joint detection and estimation.In that framework, the aim is to minimize the expected number of samples while keeping the detection and estimation errors below predefined levels.That framework was later extended to multiple hypotheses [21], [22] and applied to joint signal detection and signal-to-noise ratio estimation [23] as well as to joint symbol decoding and noise power estimation [21], [22].Sequential joint detection and estimation in distributed sensor networks was investigated in [24], [25].
Although the procedures in [20], [22] are strictly optimal, i.e., there is no other procedure that uses less samples on average and fulfills the constraints, their design can be computationally costly as its implementation requires a discretization of the state space.To overcome this, we presented an asymptotically pointwise optimal procedure [26], i.e., a procedure that becomes optimal when the tolerated detection and estimation error levels become sufficiently small.However, the work in [26] relies on strict assumptions.Namely, the hypotheses only differ in the prior of the random parameter and the data conditioned on the random parameter is independent and identically distributed (iid) with a distribution from the exponential family.

B. Contributions and Outline
In this work, we propose an asymptotically optimal (AO) procedure for the problem of sequential joint detection and estimation under mild assumptions on the underlying model.Contrary to our previous works [20], [22]- [25], the proposed procedure is easy to implement, despite achieving similar performance to the strictly optimal procedure [20], [22].
The contributions of this manuscript are as follows.First, we present an asymptotically optimal stopping rule for sequential joint detection and estimation that depends a set of parameters.Next, we show that the gradients of the AO cost function are asymptotically proportional to the corresponding error metric, i.e., they become proportional as the nominal detection and estimation error levels tend to zero.Based on this connection, it is then shown that the search for the optimal coefficients, i.e., the coefficients for which the procedure hits the nominal error levels, can be formulated as a convex optimization problem.Finally, a projected quasi-Newton method to solve this problem is proposed.
This work is structured as follows.In Section II, the notation and the underlying assumptions are presented.Moreover, some fundamentals of sequential joint detection and estimation are revisited, the performance measures are introduced and a detailed problem formulation is given.For the sake of completeness and to introduce the relevant quantities, the reduction of the original problem to an optimal stopping problem is sketched in Section III.In Section IV, the proposed AO policy is introduced and rigorous proofs of optimality are given.The optimal choice of the coefficients parameterizing the AO policy along a projected quasi-Newton algorithm is presented in Section V. To validate the proposed theory, numerical examples are given in Section VI. Conclusions are drawn in Section VII.

II. PROBLEM FORMULATION
Let X = (X n ) n>0 be a sequence of random variables that is generated under one out of M different hypotheses H m , m ∈ {1, . . ., M }.Under each hypothesis, the distribution of the data is parameterized by a real random parameter Θ m ∈ R Km with known distribution.The occurrence of the hypotheses is also random with known prior probabilities.The sequence of random variables is assumed to be conditionally iid, i.e., the random variables X | H m , θ m are iid.Hence, one can express the M different hypotheses as where m ∈ {1, . . ., M }.Generally speaking, the true hypothesis may affect both, the distribution of the data given the parameter as well as the prior distribution on the parameters.Since the hypotheses are composite ones, we are, besides deciding in favor of the true hypothesis, also interested in inferring the parameter that generates the sequence X .Using an optimal procedure for the hypothesis test and an optimal estimator would not necessarily result in an overall optimal performance [27].Therefore, the problem of detection and estimation has to be considered jointly.Moreover, the sequence X is observed sample by sample and one stops as soon as one is confident enough about the true hypothesis and the true parameter.Hence, the problem is one of sequential joint detection and estimation.
Before a more technical problem formulation is provided, the notation and assumptions are summarized and some fundamentals on sequential joint detection and estimation are introduced.

A. Notations and Assumptions
To keep the notation simple, the integration domain is not indicated explicitly for integrals that are taken over the entire domain, e.g., Moreover, the dependency of functions, such as estimators, on the observations is dropped for the sake of a compact notation.The indicator function of event A is denoted by 1 A , the trace of a matrix is denoted by Tr(•) and x 1:n = x 1 , . . ., x n .The Kullback-Leibler divergence (KL divergence) and Fisher's information matrix under H m are denoted by D KL ( p(x) q(x) ) and I m (θ m ), respectively.
The following assumptions are made throughout the paper: A1 The true parameter and the true hypothesis stay constant during the observation period.A2 The variances of the parameters Θ m = [Θ 1 m , . . ., Θ Km m ] are positive and finite, i.e., A3 For Fisher's information under hypothesis H m , it holds that P (0 and all θ k it holds that A5 Let θ m denote the true parameter and let U denote an open neighborhood of θ m .Then, there exist a square integrable function f θ m (x) such that for all θ, θ ∈ U it holds that In other words, log p(x | H m , θ) is locally Lipschitz continuous over U .A6 The KL divergence has a second order Taylor-expansion around θ m , i.e., where c is some constant.Assumption A4 guarantees the proper convergence of p(H m | θ m , x 1:n ), whereas Assumptions A5 and A6 guarantee the proper convergence of p(θ m | H m , x 1:n ).The validity of the assumptions is theoretically shown in some cases (see supplement [28]).

B. Fundamentals of Sequential Joint Detection and Estimation
To solve a joint detection and estimation problem, one has to find a decision rule as well as a set of estimators, i.e., one estimator under each hypothesis.The decision rule δ n ∈ {1, . . ., M } maps the observations to a decision in favor of a particular hypothesis and the estimators θm,n map the observations to a point in the state space of a particular parameter.As indicated by the subscript, the decision rule and the estimators depend on the number of samples n.In a sequential framework, the number of samples is not given a priori, but depends on the observations themselves.More precisely, one would stop sampling as soon as the certainty about the true hypothesis and the true parameter is strong enough.Therefore, a stopping rule Ψ n ∈ {0, 1}, that maps the observations to a decision whether to stop or continue sampling, has to be found.The run-length, i.e., the number of used samples, can hence be defined as Since the run-length depends on the random data, the runlength is also a random variable.
In what follows, the collection of the stopping rule, the decision rule and the estimators is referred to as policy, and is defined as and the set of all feasible policies is denoted by Π.
In this work, we use similarly to [20], [22], the probability of falsely rejecting a hypothesis and the mean-squared error (MSE) for quantifying the detection and estimation errors, respectively.The former is defined as with {δ n = m} := {x 1:n : δ n (x 1:n ) = m}.For the latter, we set the estimation error to zero in case of a wrong decision, which can be written as

C. Sequential Joint Detection and Estimation as an Optimization Problem
As mentioned before, the aim is to design a sequential scheme that uses on average as few samples as possible while keeping the detection and estimation errors below predefined levels.More formally, the design problem can be formulated as the following constrained optimization problem where ᾱm ∈ (0, 1) and βm ∈ (0, ∞) are the maximum tolerated detection and estimation errors, respectively.Deriving and implementing the estimators and the decision rule that solve (2) is quite straightforward, whereas it becomes challenging for the corresponding stopping rule.Therefore, the constrained problem in (2) is solved by the following steps.First, it is reduced to an optimal stopping problem in Section III.Next, an asymptotically optimal stopping rule, which depends on a set of parameters, is derived in Section IV.Finally, in Section V, a systematic way how to choose these parameters such that all constraints of (2) are fulfilled with equality is presented.Note that although the constraints are fulfilled with equality, the proposed method does not minimize the expected number of samples exactly.However, as shown in Section VI, the proposed method requires only slightly more samples on average than its optimal counterpart, i.e., the performance is close to optimal.

III. REDUCTION TO AN OPTIMAL STOPPING PROBLEM
In order to reduce (2) to an optimal stopping problem, it is first converted to an unconstrained problem, whose objective is a weighted sum of average run-length, detection and estimation errors.This unconstrained problem acts as an auxiliary problem.Next, the unconstrained optimization problem is solved with respect to the decision rule and estimators to end up with an optimal stopping problem.Although the steps are already presented in [20], [22], this section is added for the sake of completeness and for introducing the quantities that are required for the subsequent sections.
The unconstrained version of ( 2) is given by where λ m , µ m > 0, m ∈ {1, . . ., M }, are some cost coefficients, which are assumed to be fixed for now.For suitably chosen λ m , µ m , m ∈ {1, . . ., M }, the solution of (3) also solves (2).Problem ( 3) is first minimized with respect to the decision rule and then with respect to the estimators.The optimal estimators and the optimal decision rule are given by where D m,n (x 1:n ) denotes the cost for deciding in favor of H m at time n.Let the error covariance matrix be defined as Then, the cost D m,n (x 1:n ) is defined as Finally, we end up with the following optimization problem with the cost function Solving ( 5) is usually a challenging task in the design of optimal sequential procedures.In [20], [22], for example, the problem is truncated, i.e., the maximum number of samples is restricted to a fixed constant N , and the truncated optimal stopping problem is solved via backward induction.
In this work, however, we do not seek for an optimal solution, but rather for stopping rules that are optimal in an asymptotic sense.The concept of AO policies is introduced in the next section.

IV. ASYMPTOTICALLY OPTIMAL PROCEDURES
Before it can be shown how to define an AO stopping rule in the context of this work, some fundamentals on AO stopping rules have to be revisited.For this, we consider a different model that is widely used in the context of (Bayesian) sequential inference [13], [29], [30].Let be a sequence of random variables, where c > 0 is the cost for observing one sample and Y n is some random variable.The aim is to find an optimal stopping time for model (7), i.e., a stopping time τ that minimizes E[Z τ (c)].Deriving a strictly optimal stopping time, and, hence, a strictly optimal stopping rule that generates this stopping time, can be computationally demanding.If the maximum number of samples is restricted, the optimal stopping rule can as, e.g., in [20], [22], be calculated via dynamic programming, which becomes computationally infeasible with increasing dimensionality.For the solution of high-dimensional optimal stopping problems, see, for example, [31] and references therein.To overcome this, asymptotically optimal stopping rules are considered here.That is, stopping rules that become optimal when the cost per sample tends to zero, i.e., c → 0. According to [30], a stopping rule is called asymptotically optimal (AO) if there exists no other feasible stopping rule that results on average in smaller costs when the cost per sample tends to zero.The formal definition is stated below.
Definition 1 (Bickel and Yahav [30]).Consider the model in (7).Then, a stopping time t is asymptotically optimal if where T is the set of all feasible stopping times.
Contrary to model (7), in which the cost per observation c is the free parameter, the model in ( 5) has a fixed unit cost per observation.Instead, the cost function g(x 1:n ) in ( 5) is parameterized by the coefficients λ m , µ m ≥ 0, m ∈ {1, . . ., M }.As outlined in [20], [22], these coefficients can be chosen systematically such that the resulting policy also solves (2).Since the nominal detection and estimation error levels are the free parameters during the design process, it is more intuitive to formulate the asymptotic regime in terms of these levels rather than in terms of the cost coefficients.That is, the term asymptotically refers to the case when the levels of all constraints tend to zero, which is equivalent to max{ᾱ 1 , . . ., ᾱM , β1 , . . ., βM } → 0. Loosely speaking, tightening a constraint implies increasing the corresponding coefficients.In [20, Appendix F], it is shown that some constraints may be implicitly fulfilled by other constraints and, thus the corresponding coefficient becomes zero.However, in this work, it is assumed that all coefficients are strictly positive.Nevertheless, the coefficients can be chosen arbitrarily small and, hence, the optimal policy can be approximated abritriarily well.
In this context, a stopping time t is called AO if where T is the set of all feasible stopping times.
Before the AO stopping rule is derived, a variant of ( 5) is introduced.Let c = (max{λ 1 , . . ., λ M , µ 1 , . . ., µ M }) −1 and then ( 5) is equivalent to finding an optimal stopping rule for As ( 10) and ( 5) differ only up to a constant, finite, and nonzero scaling factor, they share the same minimizer.The reason for considering this problem will become clear in the sequel.
As a first step, it has to be shown that the cost function ḡ(x 1:n ) tends to zero almost surely as the number of observations tends to infinity.Moreover, it has to be shown that nḡ(x 1:n ) converges to a finite and non-zero random variable.This is stated in the following theorem.
Theorem 1.Let ḡ(x 1:n ) be as defined in (9).Then, when λ m , µ m > 0, m ∈ {1, . . ., M }, and the assumptions stated in Section II-A are fulfilled, it holds that 1) , where is the inverse of Fisher's information matrix.It further holds that Before we can prove Theorem 1, two auxiliary results about the convergence of the posterior probability p(H m | x 1:n ) and the error covariance marix Σ m are introduced.Lemma 1.For (X n ) n>0 be defined as above, it holds that Moreover, the posterior distribution converges exponentially.
The proof of Lemma 1 is laid down in Appendix A.
Lemma 2. For (X n ) n>0 be defined as above, it holds that Proof of Theorem 1. First, the statement that ḡ(x 1:n ) is positive with probability one, is proven.The function ḡ(x 1:n ) is defined as The posterior probabilities as well as the posterior variances, i.e., the diagonal elements of the matrix Σ m , are positive.It further holds that μm is positive.Hence, it follows that Dm,n (x 1:n ) is positive with probability one and, since ḡ(x 1:n ) is the minimum of all Dm,n (x 1:n ), m ∈ {1, . . ., M }, ḡ(x 1:n ) is also positive with probability one.
To prove Statement 2, we assume that the sequence x 1:n is generated under hypothesis H m and consider the limit of the sequence Dm ,n (x Since the logarithm is a continuous function we can, by applying the continuous mapping theorem [32], [33], state that Applying, again, the continuous mapping theorem yields Thus, we can conclude that It is left to prove Statement 3, i.e., that nḡ(x 1:n ) converges almost surely to a random variable G that is positive and finite with probability one.Again, let H m and θ m denote respectively the hypothesis and the parameter under which the sequence x 1:n is generated.Lemma 1 states that p(H m | x 1:n ) a.s.− − → 0 for all m = m and that the posterior probabilities converge at an exponential rate.Therefore, it follows that According to Lemma 1 and Lemma 2, it holds that From the continuous mapping theorem, we conclude that and applying the exponential function results in According to the definition of ḡ(x 1:n ), it holds that . It is now left to show that G is positive and finite with probability one.According to Assumption A3, the trace of the inverse of Fisher's information matrix is positive and finite with probability one.Moreover, as μm , m ∈ {1, . . ., M }, are positive, we can conclude that In order to propose an AO stopping rule for the problem of sequential joint detection and estimation, it has to be shown that the expected value of ḡ(x 1:n ) exists and is finite for all n > 0. This is stated in the following lemma.
Lemma 3. Let ḡ(x 1:n ) be as defined above, then, under the assumptions stated in Section II-A, it holds that A proof is given in Appendix C. Based on these properties, one can show that the stopping rule stop as soon as g(x is AO.This is stated in the following theorem. Theorem 2. Assume that all coefficients λ m , µ m , m ∈ {1, . . ., M } are positive.Then, the stopping rule (12) is AO in the sense of (8).
Proof.To prove this theorem, it is first shown that stop as soon as ḡ(x is AO in the sense of Definition 1.In order for (13) to be AO, the conditions in [30, Theorem 2.1.]and [30,Theorem 3.1.]have to be fulfilled.As λ m , µ m > 0, m ∈ {1, . . ., M }, the conditions of [30, Theorem 2.1.]are fulfilled by Theorem 1.Hence, it is left to show that the condition stated in [30, Theorem 3.1.],i.e., is fulfilled.According to Lemma 3, the expectation of ḡ(x 1:n ) is finite.This implies that ( 14) holds as long as n is finite.In Theorem 1, it is shown that ng(x 1:n ) converges to a random variable G that is finite with probability one, which implies that n E[ḡ(x 1:n )] < ∞ for n → ∞.Therefore, ( 14) is true and ( 13) is AO in the sense of Definition 1 if the cost coefficients λm , μm are finite for all m ∈ {1, . . ., M }.Since λm and μm stay positive and finite when the tolerated detection and estimation error levels tend to zero, the stopping rule ( 13) is AO in the sense of Definition 1.
It is left to show that ᾱm , βm → 0 implies that λ m , µ m → ∞.When using the optimal decision rule and the optimal estimators, α m , β m → 0 can only be achieved if the number of samples tends to infinity, i.e., τ → ∞, since we assume non-zero and finite Fisher's information and KL-divergences according to Assumptions A3 and A4.When using the proposed stopping rule stated in (12), an infinite run-length can be achieved if and only if the cost for stopping tends to infinity, i.e., g(x 1:n ) → ∞ for all x 1:n .From the definition g(x 1:n ) = min{D 1,n (x 1:n ), . . ., D M,n (x 1:n )} , one can see that g(x 1:n ) → ∞ for all x 1:n holds if and only if D m,n (x 1:n ) → ∞ for all m ∈ {1, . . ., M } and all x 1:n .From (4), one can see that D m,n (x 1:n ) → ∞ implies that λ m , µ m → ∞, since the posterior probabilities are finite by definition and Tr(Σ m (x 1:n )) is finite according to Assumption A3.Hence, a procedure only meets the nominal error levels ᾱm , βm → 0 if λ m , µ m → ∞.By the definition of c, it follows that c → 0 if the nominal detection and estimation error levels tend to zero.Hence, the stopping rule (12) is AO in the sense of (8).

V. OPTIMAL CHOICE OF THE COST COEFFICIENTS
In the previous section, it was assumed that all coefficients tend to infinity such that the proposed procedure is AO.However, the ultimate goal is to choose the coefficients such that all constraints in (2) are fulfilled and, assuming that the resulting coefficients are sufficiently large, the procedure parameterized by these coefficients is close to optimal.This section addresses the problem of how to select the coefficients optimally, i.e., such that all constraints are fulfilled.
In [20], [22], we have shown that there is a strong connection between the derivatives of the optimal cost function with respect to these coefficients and the performance measures.However, it is not obvious that such a relation exists also for the AO stopping rule proposed in this work.
In what follows, let Ψ n denote the optimal stopping rule presented in [22] and let τ denote the corresponding stopping time.The stopping rule and stopping time of the AO procedure are denoted by Ψ • n and τ • , respectively.The connection between the derivatives of the AO solution of the optimal stopping problem with respect to the coefficients and the performance measures is stated in the following theorem.

Theorem 3. Let Ψ •
n denote the AO stopping rule defined in (12) and τ • the corresponding stopping time.Then, it holds for max{ᾱ 1 , . . ., ᾱM , β1 , . . ., βM } → 0 that The proof of Theorem 3 is outlined in Appendix D. Based on the result stated in Theorem 3, we can proceed as in [20], [22] and obtain the optimal cost coefficients via max λ,µ>0 As it is not trivial to see that strong duality between (2) and ( 15) holds asymptotically, it is fixed in the next theorem.Obtain gradients via ( 16) and ( 17) if k = 0 then 8: 11: 12: if k = 1 then 13: H (k−1) ← y s y y I 14: end if 15: k ← k + 1 19: until convergence Theorem 4. Let λ , µ denote the solution of (15).Then for the AO policy parameterized by these coefficients it holds that That is, a solution of (15) also solves (2) asymptotically.
The proof of Theorem 4 can be found in Appendix E. In order to solve (15), we propose a projected quasi-Newton method as summarized in Algorithm 1.
The gradients of ( 15) with respect to λ and µ, which depend on the detection and estimation error levels, are given in (30).As the AO procedure is not truncated, calculating the gradients directly is not possible.Therefore, the gradients have to be estimated via Monte Carlo simulations.Let αm and βm denote the Monte Carlo estimate of α m and β m , respectively.Then, the estimates of the gradients become Although calculating the gradients via Monte Carlo simulations might seem computationally demanding, they are also required for evaluating the objective in (15).Hence, estimating the gradients comes at no additional computational costs compared to the evaluation of the objective.The estimated gradients are then used to update the inverse of the Hessian matrix H via the Broyden-Fletcher-Goldfarb-Shanno (BFGS) rule [34, Eq. (6.17)].To get the cost coefficients at the current iteration, the ones from the previous iteration are updated via the step H (k) ∇(k) .Usually, one would use a line search [34,Chapter 3] to obtain the optimal step size or at least a sufficiently good step size.However, evaluating the objective of ( 15) is very costly and, hence, performing a line search is not suitable.Therefore, the step size is set to one which gives sufficiently good results for the examples considered in this work.General rules for step size selection can be found in the literature, but a more detailed discussion is beyond the scope of this paper.Finally, to ensure that the resulting coefficients are strictly positive, they are projected onto the set [ε, ∞), where ε is a small positive number, e.g., ε = 10 −12 .
These steps are repeated until convergence, i.e., until hold for all m ∈ {1, . . ., M }.The positive tolerances ε m α and ε m β have to be set by the designer in advance.The stopping criterion only depends on the estimated error levels αm , βm , which are anyway required in any step, as well as the nominal error levels ᾱm , βm and the tolerances ε m α , ε m β .Therefore, (18) can be evaluated straightforwardly without the need for further calculations.

VI. NUMERICAL RESULTS
In this section, we apply the proposed theory to two problems.The first one is simple and is used to show the basic properties of the resulting policy and to compare it to the optimal policy from [22].The second example is used to show the applicability of the proposed method to more complex problems.

A. Benchmarking Methods
There exist only little related work in the field of sequential joint detection and estimation and the optimal procedure proposed in [22] is only applicable in case a low-dimensional representation of the data exists.Therefore, we use, as in, e.g., [22], a two-step procedure as a reference for comparison.That is, we use a sequential detector followed by an minimum mean-squared error (MMSE) estimator.Although there exist different sequential tests for multiple hypotheses [12], [35], we resort to the Matrix Sequential Probability Ratio Test (MSPRT) [12, Section 4.1] due to its easy implementation and nice asymptotic properties.The MSPRT uses the pairwise log-likelihood ratios between H m and H j for m = j The stopping rule and the decision rule of the MSPRT are given by [12, Eqs.(4.3) and (4.4)] The stopping rule and the decision rule are parameterized by a set of parameters A mj , m, j ∈ {1, . . ., M }.In order to keep the error probabilities P ({δ τ = m} | H m ) below ᾱm for all m ∈ {1, . . ., M }, the thresholds can be chosen as [12, Eq. (4.9)] The two-step procedure is neither optimal nor asymptotically optimal for the problem of sequential joint detection and estimation as the stopping rule is determined by the MSPRT which does not take any estimation error into account.However, the MMSE estimator is the optimal estimator in the MSE sense and, under certain conditions, the MSPRT is asymptotically optimal.Hence, the two-step procedure uses an asymptotically optimal detector followed by an optimal estimator.Nevertheless, the performance is not overall optimal as will be demonstrated by the examples in this section.This emphasizes the benefits of a joint design.

B. Shift-in-Mean Scenario
The first example is used to show the basic properties of the proposed policy and to highlight the differences between the AO policy and the optimal policy.The same scenario was already used in [22] for the optimal policy.In that scenario, the aim is to decide among three different hypotheses with equal prior probability.Under all hypotheses, the likelihood, i.e., the distribution of the data conditioned on the random parameter, is a Gaussian distribution with variance σ 2 .The three hypotheses differ only in the prior distribution of the mean.The priors have a disjoint support.More formally, the hypotheses are defined as where N θ, σ 2 is the normal distribution with mean θ and variance σ 2 , U(l, u) is the uniform distribution on the interval [l, u) and Gam(a, b) is the Gamma distribution with shape and scale parameters a and b, respectively.The variance of the Gaussian distribution is set to σ 2 = 4.
The aim is to design a sequential scheme that jointly infers the true hypothesis and the true parameter.The detection errors should be limited to ᾱ1 = ᾱ2 = ᾱ3 = 0.05 and the estimation errors should be limited to β1 = 0.2, β2 = 0.15, β3 = 0.1.
In addition to the two-step procedure, we use the optimal scheme presented in [22] for benchmarking purposes.As discussed before, optimal policies are only known for truncated schemes.In this example, the truncation length is set to 100 samples.However, there are no restrictions on the number of samples for the AO scheme.Moreover, in order to design the optimal procedure and to visualize the optimal and the AO policy, a sufficient statistic in the sense of [22, Assumption A3] has to be found.The sufficient statistic, which serves as a low-dimensional representation of the data, is set to xn = n −1 n i=1 x i and captures all relevant information about the true hypothesis, the true parameter and the next sample, for the shift-in-mean scenario.See [22,Section 6.2] for details.We refer to [22,Section 6.2] for a description of the remaining parameters used during the design process of the optimal scheme.
The optimal parameters of the AO policy are obtained via Algorithm 1, where 10 6 Monte Carlo runs are used to estimate the gradients.Moreover, the initial set of cost coefficients and the tolerances are set to λ (0) m = µ (0) m = 100 for all m ∈ {1, . . ., 3} and ε m α = ε m β = 0.005 for all m ∈ {1, . . ., 3}, respectively.The quasi-Newton method converged after 9 iterations.Before we present the results of the AO policy, we show that the assumptions are fulfilled for the problem at hand.The proof that Assumptions A3 to A6 hold is laid down in the supplement [28, Appendix A].As the moments of θ m | H m are finite at least up to order two, the proposed policy is AO.
To validate the performance of the AO scheme and to compare its performance to the performance of the competitors, a Monte Carlo simulation with 10 6 runs is performed.The results are summarized in Table I.In Table Ia, the detection and estimation errors as well as the nominal levels are summarized.One can directly see that the AO procedure and the optimal procedure fulfill the constraints with equality, despite the Monte Carlo uncertainty during the design process.For the two-step procedure, however, the empirical detection errors are much smaller than the nominal ones, whereas empirical estimation errors violate the constraints.This is caused by the fact that the stopping time of the two-step procedure is determined by the stopping rule of the MSPRT, which does not take any uncertainty about the true parameter into account.For the average run-lengths that are summarized in Table Ib, one can see that the AO and the optimal procedure use almost the same average number of samples.
Fig. 1 shows the policy of the AO and the optimal scheme, where the filled areas correspond to the AO policy and the dashed line corresponds to the boundaries of the optimal policy.That is, the boundary between the different regions in the state space, e.g., the region in which the procedure continues sampling and the one in which the procedure stops sampling and decides in favor of H 1 .Although the exact shapes of both policies differ, the general shape of both policies is similar and the AO policy looks like a smoothed version of the optimal one.Even though we made no restrictions on the maximum number of samples for the AO policy, the maximum number of samples used by the AO policy is limited.It can be seen that the corridor between the regions for stopping and deciding in favor of H 1 /H 2 closes at around 40 samples and the corridor between the regions for stopping and deciding in favor of H 2 /H 3 closes at around 32 samples.Contrary to this, the corridors in which the optimal scheme continues sampling still exist until the maximum number of samples is reached.

C. Joint Symbol Decoding and Noise Power Estimation
The second example illustrates how to apply the proposed theory to real-world problems for which designing a strictly optimal procedure becomes highly challenging.In this example, the proposed theory is applied to the problem of joint symbol decoding and noise power estimation.More precisely, we consider a 16-QAM symbol which is transmitted over an additive white Gaussian noise (AWGN) channel with random noise power.The aim is to simultaneously infer the transmitted symbol and the noise power using as few samples as possible.Here, the inference of the transmitted symbol, i.e., the symbol decoding, is formulated as a hypothesis test.The signal model for the complex-valued received signal is given by where A ∈ {A 1 , . . ., A 16 } is the transmitted symbol and v n is the noise process.The noise process is assumed to follow a circularly symmetric Gaussian distribution with zero-mean and variance σ 2 .Hence, the conditional distribution of the received signal is given by where CN (A m , σ 2 ) denotes the circularly symmetric Gaussian distribution with mean A m and variance σ 2 .The probability density function (pdf) of the circularly symmetric Gaussian distribution is given by The noise power distribution is modelled as an inverse Gamma distribution.Finally, the hypotheses can be formulated as where m ∈ {1, . . ., 16} and IGam(a, b) denotes the inverse Gamma distribution with shape and scale parameters a and b, respectively.The pdf of the inverse Gamma distribution is given by [36,Definition 8.22] where Γ(•) denotes the Gamma function.
In the supplement [28, Appendix B], it is shown that Assumptions A3 to A6 hold.Furthermore, if a > 2, the prior distribution of the noise power has finite variance.Hence, the proposed policy is AO.
One could, at least in theory, design an optimal sequential procedure as the sufficient statistic can serve as a lowdimensional representation of the data in the sense of [22,Assumption A3].However, although this sufficient statistic exists and seems to be relatively low-dimensional, designing the optimal scheme on the discretized three-dimensional state space is unlikely to be feasible in practice.Therefore, we only present results for the AO and the two-step procedure.
The following parameters are used in this simulation.The shape and scale parameters of the prior distribution of the noise power are set to 2.1 and 0.9, respectively.The constellation diagram of the QAM symbols is depicted in Fig. 2. Finally, the nominal detection and estimation error levels are set to 0.01.The gradients in Algorithm 1 are estimated by 10 6 Monte Carlo runs.Moreover, the initial set of cost coefficients and the tolerances are set to λ Fig. 3: Joint symbol decoding and noise power estimation: Simulation results.
To evaluate the performance of the designed procedure, 10 6 Monte Carlo runs are performed with the AO policy and the two-step procedure.The results are summarized in Fig. 3.It can be seen from Fig. 3a and Fig. 3b that the AO procedure hits the nominal detection and estimation error levels within the design tolerances.Contrary to this, the detection errors of the two-step procedures are much smaller than the tolerated 0.01.In addition, the estimation errors of the two-step procedure are far above the tolerated errors.This is again due to the fact that the MSPRT which defines the stopping time of the two-step procedure only takes the uncertainty about the true hypothesis into account and does not consider the uncertainty about the noise power.From the average run-lengths, which are depicted in Fig. 3c, one can see that the two-step procedure is much faster than the AO one.However, this comes at the cost of severely violating the estimation constraints.

VII. CONCLUSIONS
We have proposed an asymptotically optimal procedure for the problem of sequential joint detection and estimation.This procedure has been designed such that it fulfills constrains on the detection and estimation errors and, for the case when the nominal detection and estimation errors tend to zero, minimizes the expected number of used samples.The proposed asymptotically optimal stopping rule has been obtained by thresholding the instantaneous cost function that is parameterized by a set of cost coefficients.It has further be shown that similarly to the strictly optimal procedures, there exists a strong connection between the derivatives of the solution of the optimal stopping problem with respect to the cost coefficients and the corresponding performance measures.By exploiting this connection, a projected quasi-Newton method to optimally choose these coefficients has been introduced.To validate the proposed theory, two numerical examples have been conducted.The first example has been used to illustrate the differences compared to the strictly optimal policy.Moreover, the marginal increase in the expected number of used samples when using an asymptotically optimal procedure has been shown by this example.The second example has been used to prove the applicability to a scenario of high dimension.
In both numerical examples, the proposed method meets the required detection and estimation performance, whereas a two-step procedure using a sequential detector followed by an MMSE estimator failed to meet the required estimation accuracy.

A. Proof of Lemma 1
We first consider the sequence of random variables According to the strong law of large numbers, it holds that In what follows, we use the short hand notations The right hand side of ( 20) can be written as where denotes the entropy of X with pdf p m (x).
Hence, it follows that The posterior probability of hypothesis H m can be written as .
Taking the logarithm and considering the limit n → ∞, yields According to Assumption A4, the KL divergence D KL ( p m p m ) is positive for m = m , which implies that Applying this result to the last term in (21), yields Applying the exponential function to (21) and using the previous result gives From (22), it can be seen that the posterior probabilities decay exponentially as n increases.
For the numerator, it holds that Finally it follows that This concludes the proof.

B. Proof of Lemma 2
The proof of Lemma 2 follows roughly from the Bernsteinvon-Mises theorem.However, there exist different statements of the Bernstein-von-Mises theorem that are based on different, sometimes very technical, assumptions.In this proof, we apply a misspecified version of the Bernstein-von-Mises theorem [37] since it is based on conditions that are easily verifiable.Let p m be the pdf of the true sampling distribution.Under the assumption that the sampling distribution is not necessarily part of the assumed model, it is stated in [37] that under certain conditions that will be explained shortly, it holds that where θm,n is some suitable estimator, θ is the parameter that minimizes the KL divergence D KL ( p m (x) p(x | H m , θ m ) ) and Assuming that there is no model mismatch, i.e., H m = H m , it follows that θ = θ m .It can be shown that in this case ( 23) becomes Fisher's information matrix, i.e., According to [

C. Proof of Lemma 3
From the definition of ḡ(x 1:n ) it follows that where the right hand side of the inequality is equal to For the first term, it holds that The expectation in the second term of (24) can be rewritten as By using the definition of the posterior variance and expanding the square, the above integral becomes Due to Jensen's inequality [38], it holds that Therefore, the last term in ( 26) is upper bounded by Hence, we can state that is an upper bound for the integral in (25).It follows that which is finite according to Assumption A2 for all m ∈ {1, . . ., M } and, hence, ( 25) is finite.Finally, as the coefficients λm , μm are finite for all m ∈ {1, . . ., M }, we conclude that the expectation of Dm,n (x 1:n ), and, hence, also the expectation of ḡ(x 1:n ) is finite.

D. Proof of Theorem 3
Assume that both schemes, the optimal one presented in [20], [21] and the AO scheme, are truncated, i.e., the number of samples is restricted not to exceed N .Then, a scheme with stopping rule Ψ can be characterized by the set of cost functions The cost function V n (x 1:n ; Ψ) describes the cost of the optimal stopping problem when using the stopping rule Ψ after observing x 1:n .Therefore, it holds that V 0 Moreover, the definition of the AO stopping rule in (8) implies as max{ᾱ 1 , . . ., ᾱM , β1 , . . ., βM } → 0. In ( 28), τ and τ • denote the stopping time of the optimal and the AO scheme, respectively.Define the sequence of non-negative functions ∆V n (x 1:n ) = V n (x 1:n ; Ψ • ) − V n (x 1:n ; Ψ ), which can be written as The convergence stated in (28) implies that ∆V 0 → 0. For an aribitrary n, the first two terms in (29) vanish if the stopping rule at time n converges, whereas the last term vanishes if and only if ∆V n+1 → 0 almost everywhere.The latter implies the convergence of the stopping rule and the cost functions at time n + 1, which in turn only holds if the stopping rules at time n+1 converge, i.e., Ψ • n+1 −Ψ n+1 → 0, and the cost functions at time n+2 converge almost everywhere.Finally, the stopping rules at time instances n, . . ., N have to convergence so that ∆V n → 0 holds almost everywhere.In particular, (28) implies that the stopping rules convergence for all n ≥ 0.
For an arbitrary stopping rule Ψ, the gradient of the cost function can, similarly to [21, Lemma IV.2.], be written as Using similar arguments as for the proof above, it can be shown that the gradient convergences if and only if the stopping rules converge.As it was shown previously that the stopping rules convergence almost everywhere, also the gradients converge.
According to [21, Theorem IV.1.],it holds that and as the gradients converge, this relationship holds also for the AO stopping rule in the asymptotic case.In this proof, a truncated version of the strictly optimal and the AO procedure were considered.As there is no restriction on the maximum number of samples for the AO procedure, the limit N → ∞ has to be taken.However, this does not affect the above derivations.This completes the proof.

E. Proof of Theorem 4
For sufficiently small nominal error levels, i.e., such that Theorem 3 holds, the gradients of the objective in (15)  According to (8), we can conclude that Hence, the procedure that is parameterized by the solution of ( 15) fulfills all constraints with equality and asymptotically uses on average as few samples as possible, i.e., it solves (2) asymptotically.
Asymptotically Optimal Procedures for Sequential Joint Detection and Estimation: Supplementing Material

B Proof of Assumptions A3 to A6 for the QAM Scenario
To show that Assumption A3 holds, the second derivative of the log-likelihood function has to be calculated, i.e., Address correspondence to Dominik Reinhard, Signal Processing Group, Technische Universität Darmstadt, Merckstraße 25, 64283 Darmstadt, Germany; E-mail: reinhard@spg.tu-darmstadt.deTaking the conditional negative expectation of the second-order derivative, yields Fisher's information, i.e., I m (σ 2 ) = σ 2 −2 .For Fisher's information it holds that 0 < I m (σ 2 ) < ∞ except for a P -null set, i.e., it holds with probability one.Therefore, Assumption A3 is fulfilled.
To prove Assumption A4, the KL divergence D KL p(x | H m , σ 2 ) p(x | H k , σ 2 ) has to be calculated, which is given by As the means A m and A k are deterministic and not equal, Assumption A4 holds with probability one.Contrary to the shift-in-mean scenario, the KL divergence D KL p(x | H m , σ 2 ) p(x | H m , σ2 ) has to be considered.This is given by It can be shown that the second order Taylor-expansion of the KL divergence is equal to The derivative is continuous and bounded on an interval excluding zero.In order to fulfill assumption Assumption A5, the derivative must be bounded in a neighborhood U around the true variance almost surely.As σ 2 = 0 occurs with probability zero and also 0 ∈ U occurs with probability zero, this condition holds almost surely.The function f θ m (x) can then be set to max σ 2 ∈U 1

Fig. 1 :Fig. 2 :
Fig. 1: Shift-in-Mean scenario: Comparison of the AO policy and the strictly optimal policy.

Dominik Reinhard 1 , 1 1 2 . 2 |
Michael Fauß 2 , and Abdelhak M. Zoubir Signal Processing Group, Technische Universität Darmstadt, 64283 Darmstadt, Germany 2 Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USAA Proof of Assumptions A3 to A6 for the Shift-in-Mean ScenarioFisher's information is given byI m (θ m ) = σ 2 −1 ,for which 0 < I m (θ m ) < ∞ holds with probability one.Hence, Assumption A3 is fulfilled.The Kullback-Leibler divergence (KL divergence) is calculated asD KL ( p(x | H m , θ m ) p(x | H k , θ k ) ) = (θ m − θ k ) 2 2σAs the prior distributions have a disjoint support, the KL divergence is positive whenever k = m.Hence, Assumption A4 holds.Moreover, the KL divergence has a quadratic form, which means that it has a second order Taylor-expansion in the sense of Assumption A6.It is left to show that Assumption A5 holds.The log-likelihood function is given bylog p(x | H m , θ) = −0.5 log(2πσ 2 ) − (x − θ) 2 2σ 2 .(S-A.1)It suffices to show that the absolute value of the first derivative of the log-likelihood function is bounded in a neighborhood of θ m .The score, i.e., the first derivative of the log likelihood function, is∂ ∂θ log p(x | H m , θ) = x − θ σ 2 .Let U denote the neighborhood of the true parameter θ m , then for all θ ∈ U the score is continuous and bounded.Hence, by setting f θ m (x) = max θ∈U | x−θ σ Assumption A5 is fulfilled.
Next, the product p(H m | x 1:n ) Tr(Σ m ) needs closer inspection.According to Lemma 1, it holds that p(H m | x 1:n )

TABLE I :
Shift-in-mean scenario: simulation results.
37, Lemma 2.1.],the posterior distribution of Θ m converges in total variation to a normal distribution with covariance matrix n −1 I −1 m (θ m ) if Assumption A5 and Assumption A6 hold.Therefore, we can conclude that Tr(Σ m ) a.s.