Detecting a Vector Based on Linear Measurements

We consider a situation where the state of a system is represented by a real-valued vector. Under normal circumstances, the vector is zero, while an event manifests as non-zero entries in this vector, possibly few. Our interest is in the design of algorithms that can reliably detect events (i.e., test whether the vector is zero or not) with the least amount of information. We place ourselves in a situation, now common in the signal processing literature, where information about the vector comes in the form of noisy linear measurements. We derive information bounds in an active learning setup and exhibit some simple near-optimal algorithms. In particular, our results show that the task of detection within this setting is at once much easier, simpler and different than the tasks of estimation and support recovery.


Introduction
We consider a situation where the state of a system is represented by a real-valued vector x ∈ R n . Under normal circumstances, the vector x is zero, while an event manifests as non-zero entries in x, possibly few. Our interest is in the design of algorithms that reliably detect events -i.e., test whether x = 0 or x = 0 -with the least amount of information. We assume that we may learn about x via noisy linear measurements of the form where the measurement vectors a i 's have Euclidean norm bounded by 1 and the noise z i 's are i.i.d. standard normal. Assuming that we may take a limited number of linear measurements, the engineering is in choosing them in order to minimize the false alarm and missed detection rates. We derive information bounds, establishing some fundamental detection limits relating the signal strength and the number of linear measurements. The bounds we obtain apply to all adaptive schemes, where we may choose the ith measurement vector a i based on the past measurements, i.e., we may choose a i as a function of (a 1 , y 1 , . . . , a i−1 , y i−1 ).

Related work
Learning as much as possible about a vector based on a few linear measurements is one of the central themes of compressive sensing (CS) [4,5,8]. Most of this literature, as it relates to signal processing, has focused on the tasks of estimation and support recovery. Particularly in surveillance situations, however, it makes sense to perform detection before estimation because, as we shall confirm, reliable detection is possible at much lower signal-to-noise ratios or, equivalently, with much fewer linear measurements than estimation. This can be achieved with much greater implementation ease and much lower computational cost than standard CS methods based on convex programming. The literature on the detection of a high-dimensional signal is centered around the classical normal mean model, based on observations y i = x i + z i , where the z i 's are i.i.d. standard normal. In this model, only one noisy observation is available per coordinate, so that some assumptions are necessary and the most common one, by far, is that the vector x = (x 1 , . . . , x n ) is sparse. This setting has attracted a fair amount attention [7,15,16], with recent publications allowing adaptive measurements [12]. More recently, a few papers [2,10,14] extended these results to testing for a sparse coefficient vector in a linear system with the aim of characterizing the detection feasibility. These papers work with designs having low mutual coherence, for example, assuming that the a i 's are i.i.d. multivariate normal. As we shall see below, such designs are not always desirable. We also mention [13], which assumes that an estimator x of x is available and examines the performance of the test based on x, x ; and [17], which proposes a Bayesian approach for the detection of sparse signals in a sensor network for which the design matrix is assumed to have some polynomial decay in terms of the distance between sensors.
We mention that the present paper may be seen as a companion paper to [1] which considers the tasks of estimation and support recovery in the same setting.

Notation and terminology
Our detection problem translates into a hypothesis testing problem H 0 : x = 0 versus H 1 : x ∈ X , for some subset X ⊂ R n \ {0}. A test procedure based on m measurements of the form (1) is a binary function of the data, i.e., T = T (a 1 , y 1 , . . . , a m , y m ), with T = ε ∈ {0, 1} indicating that T favors H ε . The (worst-case) risk of a test T is defined as where P x denotes the distribution of the data when x is the true underlying vector. With a prior π on the set of alternatives X , the corresponding average (Bayes) risk is defined as where E π denotes the expectation under π. Note that for any prior π and any test procedure T , γ(T ) ≥ γ π (T ). ( For a vector a = (a 1 , . . . , a k ), and a T denote its transpose. For a matrix M, Ma a .
Everywhere in the paper, x = (x 1 , . . . , x n ) denotes the unknown vector, while 1 denotes the vector with all coordinates equal to 1 and dimension implicitly given by the context.

Content
In Section 2 we focus on vectors x with non-negative coordinates. This situation leads to an exceedingly simple, yet near-optimal procedure based on a measurement scheme that is completely at odds with what is commonly used in CS. In Section 3 we treat the case of a general vector x and derive another simple, near-optimal procedure. In both cases, the methods we suggest are non-adaptive -in the sense that the measurement vectors are chosen independently of the observations -yet perform nearly as well as any adaptive method. In Section 4 we discuss our results and important extensions, particularly to the case of structured signals.

Vectors with non-negative entries
Vectors with non-negative entries may be relevant in image processing, for example, where the object to be detected is darker (or lighter) than the background. As we shall see, detecting such a vector is essentially straightforward in every respect. In particular, the use of low-coherence designs is counter-productive in this situation. The first thing that comes to mind, perhaps, is gathering strength across coordinates by measuring x with the constant vector 1/ √ n. And, with a budget of m measurements, we simply take this measurement m times.
Consider then the test that rejects when where τ is some critical value. Its risk against a vector x is equal to where Φ is the standard normal distribution function. In particular, if τ = τ n → ∞, this test has vanishing risk against alternatives satisfying m/n|x| − τ n → ∞.
Since we may chose τ m → ∞ as slowly as we wish, in essence, the simple sum test based on repeated measurements from the constant vector has vanishing risk against alternatives satisfying m/n|x| → ∞.
Proof. The result is a simple consequence of the fact that Although the choice of measurement vectors and the test itself are both exceedingly simple, the resulting procedure comes close to achieving the best possible performance in this particular setting, as the following information bound reveals. Theorem 1. Let X (µ, S) denote the set of vectors in R n having exactly S non-zero entries all equal to µ > 0. Based on m measurements of the form (1), possibly adaptive, any test for H 0 : In particular, the risk against alternatives x ∈ X (µ, S) with m/n|x| = m/nSµ → 0, goes to 1 uniformly over all procedures.
Proof. The standard approach to deriving uniform lower bounds on the risk is to put a prior on the set of alternatives and use (2). We simply choose the uniform prior on X (µ, S), which we denote by π. The hypothesis testing problem reduces to H 0 : x = 0 versus H 1 : x ∼ π, for which the likelihood ratio test is optimal by the Neyman-Pearson fundamental lemma. The likelihood ratio is defined where E π denotes the expectation with respect to π, and the related test is T = {L > 1}. It has risk equal to where P π := E π P x -the π-mixture of P x -and · TV is the total variation distance. By Pinsker's inequality where K(P 0 , P π ) denotes the Kullback-Leibler divergence. We have where C = (c jk ) := E π (xx T ). The first line is by definition; the second is by definition of P x / P 0 , by the application of Jensen's inequality justified by the convexity of x → − log x, and by Fubini's theorem; the third is by independence of a i , y i and x (under P 0 ), and by the fact that E(y i ) = 0; the fourth is by independence of a i and x (under P 0 ) and by Fubini's theorem; the fifth is because a i ≤ 1 for all i.
Since under π the support of x is chosen uniformly at random among subsets of size S, we have This simple matrix has operator norm C op = µ 2 S 2 /n. Coming back to the divergence, we therefore have K(P 0 , P π ) ≤ m · µ 2 S 2 /n, and returning to (3) via (4), we bound the risk of the likelihood ratio test as follows With Proposition 1 and Theorem 1, we conclude that the following is true in a minimax sense: Reliable detection of a nonnegative vector x ∈ R n from m noisy linear measurements is possible if m/n|x| → ∞ and impossible if m/n|x| → 0.

General vectors
When dealing with arbitrary vectors, the measurement vector 1/ √ n may not be appropriate. In fact, the resulting procedure is completely insensitive to vectors x such that 1, x = 0. Nevertheless, if one selects a measurement vector a from the Bernoulli ensemble -i.e., with independent entries taking values ±1/ √ n with equal probability -then on average, a, x is of the order of x / √ n. This is true when the number of non-zero entries in x grows with the dimension n; if we repeat the process a few times, it becomes true for any fixed vector x.
where τ m → ∞. When m → ∞, its risk against a vector x -averaged over the Bernoulli ensemble Since we may take h m and τ m increasing as slowly as we please, in essence, the test is reliable when (m/n) x 2 → ∞. Compared with repeatedly measuring with the constant vector 1/ √ n as studied in Proposition 1, there is a substantial loss in power when |x| 2 is much larger than x 2 . For example, when x has S non-zero entries al equal to µ > 0, |x| 2 = S x 2 .
Proof. For simplicity, assume that m/h m is an integer and fix x throughout. For short, let Therefore, by Chebyshev's inequality, the probability of (10) under the null is bounded from above by 3/τ 2 m → 0. Similarly, the probability of (10) not happening under an alternative x satisfying (m/n) x 2 ≥ 2τ m √ h m is bounded from above by Again, this relatively simple procedure nearly achieves the best possible performance.
Theorem 2. Let X ± (µ, S) denote the set of vectors in R n having exactly S non-zero entries all equal to ±µ. Based on m measurements of the form (1), possibly adaptive, any test for H 0 : x = 0 versus H 1 : x ∈ X ± (µ, S) has risk at least 1 − Sm/(8n)µ.
Proof. Again, we choose the uniform prior on X ± (µ, S). The proof is then completely parallel to that of Theorem 1, now with C = µ 2 (S/n)I -since the signs of the nonzero entries of x are i.i.d. Rademacher -so that C op = µ 2 S/n.
With Proposition 2 and Theorem 2, we conclude that the following is true in a minimax sense: Reliable detection of a vector x ∈ R n from m noisy linear measurements is possible if m/n x → ∞ and impossible if m/n x → 0.

Discussion
In this short paper, we tried to convey some very basic principles about detecting a high-dimensional vector with as few linear measurements as possible. First, when the vector has non-negative entries, repeatedly sampling from the constant vector 1/ √ n is near-optimal. Second, when the vector is general but sparse, repeatedly sampling from a few measuring vectors drawn from a standard random (e.g., Bernoulli) ensemble is also near-optimal. In both cases, choosing the measuring vectors adaptively does not bring a substantial improvement. And, moreover, sparsity does not help, in the sense that the detection rates depend on |x| and x , respectively.

A more general adaptive scheme
Suppose we may take as many linear measurements of the form (1) as we please (possibly an infinite number), with the only constraint being on the total measurement energy i a i 2 ≤ m.
(Note that m is no longer constrained to be an integer.) This is essentially the setting considered in [11,12], and clearly, the setup we studied in the previous sections satisfies this condition. So what can we achieve with this additional flexibility? In fact, the same results apply. The lower bounds in Theorem 1 and Theorem 2 are proved in exactly the same way. (We effectively use (11) to go from (8) to (9), and this is the only place where the constraints on the number and norm of the measurement vectors are used.) Of course, Proposition 1 and Proposition 2 apply since the measurement schemes used there satisfy (11). However, in this special case they could be simplified. For instance, in Proposition 1 we could take one measurement with the constant vector m/n 1.

Detecting structured signals
The results we derived are tailored to the case where x has no known structure. What if we know a priori that the signal x has some given structure? The most emblematic case is when the support of x is an interval of length S. In the classical setting where each coordinate of x is observed once, the scan statistic (aka generalized likelihood ratio test) is the tool of choice [3]. How does the story change in the setting where adaptive linear measurements in the form of (1) can be taken?
Perhaps surprisingly, knowing that x has such a specific structure does not help much. Indeed, Theorem 1 and Theorem 2 are proved in the same way. In the case of non-negative vectors, we use the uniform prior on vectors with support an interval of length S and nonzero entries all equal to µ, and the proof is identical, except for the matrix C, which now has coefficients c jk = µ 2 max(S − |j − k|, 0)/n for all j, k. Because C is symmetric, we have which is exactly the same bound as before. In the general case, the arguments are really identical, except that we use uniform prior on vectors with support an interval of length S and nonzero entries all equal to µ in absolute value. (Here the matrix C is exactly the same.) Of course, Proposition 1 and Proposition 2 apply here too, so the conclusions are the same. Here too, these conclusions hold in the more general setup with measurements satisfying (11).
To appreciate how powerful the ability to take linear measurements in the form of (1) with the constraint (11) really is, let us stay with the same task of detecting an interval of length S with a positive mean. On the one hand, we have the simple test based on i y i studied in Proposition 1. On the other hand, we have the scan statistic While the former requires m/n|x| → ∞ to be asymptotically powerful, the scan statistic requires lim m/n|x| · (S log + (n/S)) −1/2 ≥ √ 2, where log + (x) := max(log x, 1). With observations provided in the form of (13), this is asymptotically optimal [3]. Note that (13) is a special case of (11). Hence, the ability of taking measurements of the form (1) allows to detect structured signals that are potentially much weaker, without a priori knowledge of the structure and with much simpler algorithms. Hardware that is able to take linear measurements such as (1) is currently being developed [9].

A comparison with estimation and support recovery
The results we obtain for detection are in sharp contrast with the corresponding results in estimation and support recovery. Though, by definition, detection is always easier, in most other settings it is not that much easier. For example, take the normal mean model described in the Introduction, assuming x is sparse with S coefficients equal to µ > 0. In the regime where S = p 1−β , β ∈ (1/2, 1), detection is impossible when µ ≤ √ 2r log n with r < ρ 1 (β), while support recovery is possible when µ ≥ √ 2r log n with r > ρ 2 (β), for a fixed functions ρ 1 , ρ 2 : (1/2, 1) → (0, ∞) [7,15,16]. So the difference is a constant factor in the per-coordinate amplitude. In the setting we consider here, we are able to detect at a much smaller signal-to-noise ratio than what is required for estimation or support recovery, which nominally require at least m ≥ S measurements regardless of the signal amplitude, where S is the number of nonzero entries in x. In fact, [1] shows that reliable support recovery is impossible unless µ is of order at least n/m. In detection, however, we saw that m = 1 measurement may suffice if the signal amplitude is large enough, which can be smaller than n/m by a factor of S or √ S in the nonnegative and general cases respectively. Therefore, having the ability to take linear measurements of the form (1) in a surveillance setting, it makes sense to perform detection as described here before estimation (identification) or support recovery (localization) of the signal.

Possible improvements
Though we provided simple algorithms that nearly match information bounds, there might be room for improvement. For one thing, it might be possible to reliably detect when, say, m/n|x| is sufficiently large -for the case where x j ≥ 0 for all j -without necessarily tending to infinity. A good candidate for this might be the Bayesian algorithm proposed in [6].
More importantly, in the general case of Section 3, we might want to design an algorithm that detects any fixed x with high-probability, without averaging over the measurement design. This averaging may be interpreted in at least two ways: (A1) If we were to repeat the experiment many times, each time choosing new measurement vectors and corrupting the measurements with new noise, then for a fixed vector x, in most instances the test would be accurate.
Interpretation (A1) is controversial as we do not repeat the experiment, which would amount to taking more samples. And interpretation (A2) raises the issue of robustness to any sign configuration. One way -and the only way we know of -to ensure this robustness is to use a CS-like sampling scheme, i.e., choosing a 1 , . . . , a m in (1) such that the matrix with these rows satisfies RIPlike properties. This setting is studied in detail in [2], which in a nutshell says the following. Take measurement vectors from the Bernoulli ensemble, say, but hold the measurement design fixed. This is just a way to build a measurement matrix satisfying the RIP and with low mutual coherence. In particular, this requires that m is of order at least S log n, though what follows assumes that m ≫ S(log n) 3 . Based on such measurements, the test based on i y 2 i is able to detect when ( √ m/n) x 2 → ∞, which is more stringent than what is required in Proposition 2; while the test based on max j=1,...,n | i a ij y i | is able to detect when lim inf m/n max j |x j |(log n) −1/2 > √ 2, which, except for the log factor, is what is required for support recovery. And this is essential optimal, as shown in [2].