Recovering Block-structured Activations Using Compressive Measurements

We consider the problems of detection and localization of a contiguous block of weak activation in a large matrix, from a small number of noisy, possibly adaptive, compressive (linear) measurements. This is closely related to the problem of compressed sensing, where the task is to estimate a sparse vector using a small number of linear measurements. Contrary to results in compressed sensing, where it has been shown that neither adaptivity nor contiguous structure help much, we show that for reliable localization the magnitude of the weakest signals is strongly influenced by both structure and the ability to choose measurements adaptively while for detection neither adaptivity nor structure reduce the requirement on the magnitude of the signal. We characterize the precise tradeoffs between the various problem parameters, the signal strength and the number of measurements required to reliably detect and localize the block of activation. The sufficient conditions are complemented with information theoretic lower bounds.


Introduction
Compressive measurements provide a very efficient means of recovering signals that are sparse in some basis or frame.Specifically, several papers, including Candès and Tao (2006), Donoho (2006), Candès and Tao (2007), and Candès and Wakin (2008) have shown that it is possible to recover, in an ℓ 2 sense, a k-sparse vector in n dimensions using only O(k log n) incoherent compressive measurements, instead of measuring all of the n coordinates.This is a novel and important paradigm with applications in a wide range of scientific areas.Along with ℓ 2 recovery, researchers have also considered the problems of detection and localization of a sparse signal corrupted by additive noise, the former task logically preceding the latter.The problem of detection is to test whether all components of the vector are zero.Duarte et al. (2006), Haupt and Nowak (2007) and Arias-Castro (2012) studied detection of sparse vectors from compressive measurements, while Arias-Castro et al. (2011b) identifies conditions for successful detection in sparse linear regression (see also Ingster et al. (2010)).
The problem of localization is to identify coordinates of the non-zero elements of a signal.Wainwright (2009a) and Wainwright (2009b) studied information theoretic limits and localization properties of the LASSO procedure.More recently, researchers have contributed two important refinements: 1) by considering a sparse structured signal (such as a signal consisting of adjacent coordinates or a block) (Baraniuk et al., 2010, Arias-Castro et al., 2011a, Soni and Haupt, 2011) and 2) by allowing for the possibility of taking adaptive measurements, i.e., where subsequent measurements are designed based on past observations (see, e.g., Candès and Davenport, 2011, Arias-Castro et al., 2011a, Haupt et al., 2009, Davenport and Arias-Castro, 2012, Malloy and Nowak, 2012).However, almost all of this work has been focused on recovery or detection of (structured or unstructured) sparse data vectors from (passive or adaptive) compressed measurements.
In this work we focus on the unexplored problems of detection and localization for data matrices from compressive measurements.We are concerned with signals that are both sparse and highly structured, taking the form of a sub-matrix of a larger matrix with contiguous row and column indices.Data matrices have been considered in the context of low-rank matrix completion (see, e.g., Negahban andWainwright, 2011, Koltchinskii et al., 2011), where recovery in Frobenius norm is studied.The problems of detection and localization for data matrices that are observed directly were studied previously.See, for example, Sun and Nobel (2010), Kolar et al. (2011), Butucea and Ingster (2011), Butucea et al. (2013), Bhamidi et al. (2012).However, compressive measurement schemes were not investigated.If the activation is unstructured, the treatment of data matrices is exactly equivalent to the treatment of data vectors.However, in the structured case the problem is rather different, as we will show.Data matrices with signals that are both sparse and highly structured form a natural model for several real-world activations such as when we have a group of genes (belonging to a common pathway for instance) co-expressed under the influence of a set of similar drugs (Yoon et al., 2005), or when we have groups of patients exhibiting similar symptoms (Moore et al., 2010), or when we have sets of malware with similar signatures (Jang et al., 2011), etc.However, in many of these applications, it is difficult to measure, compute or store all the entries of the data matrix.For example, measuring expression levels of all genes under all possible drugs is expensive, or recording the signatures of each individual malware is computationally demanding as it might require stepping through the entire malware code.However, if we have access to linear combinations of matrix entries (i.e.compressive measurements) such as combined expression of multiple genes under the influence of multiple drugs then we might need to only make and store few such measurements, while still being able to infer the existence or location of the activated block of the data matrix.Thus, the goal is to detect or recover the activated block (set of co-expressed genes and drugs or malware with similar signatures) using only few compressive measurements of the data matrix, instead of observing the entire data matrix directly.We consider both the passive (non-adaptive) and active (adaptive) measurements.The non-adaptive measurements are random or pre-specified linear combinations of matrix entries.In other cases, such as mixing drugs, we might be able to adapt the measurement process by using feedback to sequentially design linear combinations that are more informative.
Extensions to a setup where there is a non-contiguous sub-matrix or block of activation are also interesting, but beyond the scope of this paper.Sun and Nobel (2010), Butucea and Ingster (2011), Butucea et al. (2013), Bhamidi et al. (2012), Kolar et al. (2011) Table 1: Summary of known results for the sparse vector case, where the length of the vector is n and the number of active elements is k.The number of measurements is m and µ/σ represents SNR per element of the activated elements.

Detection Localization
Passive Davenport and Arias-Castro (2012) Malloy and Nowak (2012) study a problem where a large noisy matrix is observed directly, i.e., not through compressed measurements, and the block of activation is non-contiguous.In such a setting, tight upper and lower bounds are derived for the localization problem.However, passive and adaptive compressive measurement schemes were not investigated.
Summary of our contributions.Using information theoretic tools, we establish lower bounds on the minimum number of compressive measurements and the weakest signal-to-noise ratio (SNR) needed to detect the presence of an activated block of positive activation, as well as to localize the activated block, using both non-adaptive and adaptive measurements.We also demonstrate minimax optimal upper bounds through detectors and estimators that can guarantee consistent detection and localization of weak block-structured activations using few non-adaptive and adaptive compressive measurements.
Our results indicate that adaptivity and structure play a key role and provide significant improvements over non-adaptive and unstructured cases for localization of the activated block in the data matrix setting.This is unlike the vector case where contiguous structure and adaptivity have been shown to provide minor, if any, improvement.We describe the results for the sparse vector case in related work section below.A summary of the SNR needed for detection and localization of an unstructured sparse vector using passive and adaptive compressive measurements is given in Table 1.
In our setting we take compressive measurements of a data matrix of size n = (n 1 × n 2 ), the activated block is of size k = (k 1 × k 2 ), with minimum SNR per entry of µ/σ, and we have a budget of m compressive measurements with each measurement matrix constrained to have unit Frobenius norm.Table 2 describes our main findings (for the case when n 1 = n 2 and k 1 = k 2 and paraphrasing for clarity) and compare the scalings under which passive and active, detection and localization are possible.
For detection, akin to the vector setting, structure and adaptivity play no role.The structured data matrix setting requires an SNR scaling of n 1 n 2 /(mk 2 1 k 2 2 ) for both nonadaptive and adaptive cases, which is same as the SNR needed to detect a k 1 k 2 -sparse nonnegative vector of length n 1 n 2 as demonstrated in Arias-Castro (2012).Thus, the structure of the activation pattern as well as the power of adaptivity offer no advantage in the detection problem.
For localization of the activated block, the structured data matrix setting requires an SNR scaling as n 1 n 2 /(m min(k 1 , k 2 )) using non-adaptive compressive measurements.In contrast, the unstructured setting requires a higher SNR of n 1 n 2 log(n 1 n 2 )/m where m ≥ , where the size of the matrix is n 1 × n 2 and the size of the activation block is k 1 × k 2 .The number of measurements is m and µ/σ represents SNR per element of the activated block.

Detection Localization
Passive Theorems 3 and 4 Active Theorems 1 and 2 Theorems 6 and 7 ), 1/(m min(k 1 , k 2 ))) for the weakest entry in the data matrix.In contrast, for the sparse vector case, Arias-Castro et al. (2011a) showed that adaptive compressive measurements cannot localize the non-zero components if the SNR is smaller than n 1 n 2 /m.A matching upper bound was provided using compressive binary search in Davenport and Arias-Castro (2012) and Malloy and Nowak (2012) for localization of a single non-zero entry in the vector.Thus, exploiting structure of the activations and designing adaptive linear measurements can both yield significant gains if the activation corresponds to a contiguous block in a data matrix.
Related Work.Our work builds on a number of fairly recent contributions on detection, localization and recovery of a sparse and weak unstructured signal by adaptive compressive measurements.In Arias-Castro et al. (2011a), the authors show that the adaptive compressive scheme offers improvements over the passive scheme which, in terms of the mean-squared error (MSE) and localization, are limited to a log(n) factor.The authors also provide a general proof strategy for minimax analysis under adaptive measurements.Arias-Castro (2012) further applies this strategy to the problem of detection of an unstructured and structured sparse and weak vector signal under compressive adaptive measurements.Malloy and Nowak (2012) shows that a compressive version of standard binary search achieves minimax performance for localization in a one-sparse vector.The work of Wainwright (2009b) which is based on analyzing the performance of an exhaustive search procedure under passive measurements, is relevant to our analysis of passive localization.Our analysis provides a generalization of these results to the case of a structured signal embedded as a small contiguous block in a large matrix.
While in this paper we focus on detection and localization, some other papers have considered estimation of sparse vectors in the MSE sense using adaptive compressive measurements.For example, Arias-Castro et al. (2011a) establishes fundamental lower bounds on the MSE in a linear regression framework, while Haupt et al. (2009) demonstrates upper bounds using compressive distilled sensing.Baraniuk et al. (2010) and Soni and Haupt (2011) have analyzed different forms of structured sparsity in the vector setting, e.g. if the non-zero locations in a data vector form non-overlapping or partially-overlapping groups or are tree-structured.
Finally, Negahban and Wainwright (2011) and Koltchinskii et al. (2011) have considered a measurement model identical to ours in the setting of low-rank matrix completion, but in that setting the matrix under consideration is not assumed to be a structured sparse matrix and the theoretical guarantees are with respect to the Frobenius norm.Furthermore, Kolar et al. (2011) illustrate that penalization using the sum of nuclear and ℓ 1 norm cannot be used for localization in a related model.
When data matrix is observed directly, Butucea and Ingster (2011) study the problem of detection, while Kolar et al. (2011) and Butucea et al. (2013) study the problem of localization.Sun and Nobel (2010) and Bhamidi et al. (2012) characterize largest average submatrices of the data matrix under the null hypothesis that the signal is not present.Results in those papers do not carry over to a setting where a data matrix is accessed through compressive measurements, as already seen in the vector case (Arias-Castro, 2012).
The rest of this paper is organized as follows.We describe the problem set up and notation in Section 2. We study the detection problem in Section 3, for both adaptive and non-adaptive schemes.Section 4 is devoted to the non-adaptive localization, while Section 5 is focused on adaptive localization.Finally, in Section 6 we present and discuss some simulations that support our findings.The proofs are given in the Appendix.

Preliminaries
Let A ∈ R n 1 ×n2 be a signal matrix with unknown entries.We are interested in a highly structured setting where a contiguous block of the matrix A of size (k 1 × k 2 ) has entries all equal to µ > 0, while all the other elements of A are equal to zero.We denote the coordinate set of all contiguous blocks, of size , where 1I is the indicator function.Some of our results extend to the case when the activation is positive, but not constant on B * , as we discuss below.Note that we assume the size (k 1 ×k 2 ) is known.
We consider the following observation model under which m noisy linear measurements of A are available where ǫ 1 , . . ., ǫ m iid ∼ N (0, σ 2 ), with σ > 0 known, and the sensing matrices (X i ) i∈[m] are normalized to satisfy either X i F ≤ 1 or E X i amount of energy.These are similar assumptions as made in Davenport and Arias-Castro (2012) and Candès and Davenport (2011).
Under the observation model in Eq. ( 2.2), we study two tasks: (1) detecting whether a contiguous block of positive signal exists in A and (2) identifying the block B * , that is, the localization of B * .We develop efficient algorithms for these two tasks that provably require the smallest number of measurements, as explained below.The algorithms are designed for one of two measurement schemes: (1) the measurement scheme can be implemented in an adaptive or sequential fashion, that is, actively, by letting each X i to be a (possibly randomized) function of (y j , X j ) j∈[i−1] , and (2) the measurement matrices are chosen all at once or ignoring the outcomes in previous measurements, that is, passively.
Detection.The detection problem concerns checking whether a positive contiguous block exists in A. As we will show later, we can detect the presence of a contiguous block with a much smaller number of measurements than is required for localizing its position.Formally, detection is a hypothesis testing problem with a composite alternative of the form (2.3) A test T is a measurable function of the observations (y i ) i∈[m] and the measurements matrices (X i ) i∈ [m] , which takes values in {0, 1}, with T = 1 if the null hypothesis is rejected and T = 0 otherwise.For any test T , we define its risk as where P 0 and P B denote the joint probability distributions of (y i , X i ) i∈[m] under the null hypothesis and when the activation pattern is B, respectively.The risk R(T ) measures the maximal sum of type I and type II errors over the set of alternatives.The overall difficulty of the detection problem is quantified by the minimax risk where the infimum is taken over all tests.For a sufficiently small SNR, the minimax risk is bounded away from zero by a large constant, which implies that no test can distinguish H 0 from H 1 .In Section 3 we will precisely characterize the boundary for SNR µ σ below which no test can distinguish H 0 and H 1 .
Localization.The localization problem concerns the recovery of the true activation pattern B * .Let Ψ be an estimator of B * , i.e., a measurable function of (y i , X i ) i∈ [m] taking values in B. We define the risk of any such estimator as of the localization problem is the minimal risk over all such estimators Ψ.Like in the detection task, the minimax risk specifies the minimal risk of any localization procedure.By standard arguments, the evaluation of the minimax localization risk also proceeds by first reducing the localization problem to a hypothesis testing problem (see, e.g., Tsybakov, 2009, for details).
Below we will provide a sharp characterization, through information theoretic lower bounds and tractable estimators, of the minimax detection and localizations risks as functions of tuples of (n 1 , n 2 , k 1 , k 2 , m, µ, σ) and for both the active and passive sampling schemes.Our results identify precisely both the minimal SNR given a budget of m possibly adaptive measurements, and the minimal number of measurements m for a given SNR in order to achieve successful detection and localization.
Along with a careful and detailed minimax analysis, we also describe procedures for detection and localization in both the active and passive case whose risks match the minimax rates.

Detection of contiguous blocks
In this section, we derive minimax rates for detection.

Lower bound
The following theorem gives a lower bound on the SNR needed to distinguish H 0 and H 1 .
The lower bound on possibly adaptive procedures is established by analyzing the risk of the (optimal) likelihood ratio test under a uniform prior over the alternatives.Careful modifications of standard arguments are necessary to account for adaptivity.We closely follow the approach of Arias-Castro Arias-Castro (2012) who established the analogue of Theorem 1 in the vector setting.

Upper bound
We now demonstrate the sharpness of the result established in the previous section.We choose the sensing matrices passively as where T is the test defined in Eq. (3.1).
The results of Theorem 1 and Theorem 2 establish that the minimax rate for detection under the model in Eq. ( 2 under the (mild) assumption that k 1 ≤ cn 1 and k 2 ≤ cn 2 for any constant 0 < c < 1.It is worth pointing out that the structure of the activation pattern does not play any role in the minimax detection problem, since the rate matches the known bounds for detection in the unstructured vector case Arias-Castro (2012).We will contrast this to the localization problem below.Furthermore, the procedure that achieves the adaptive lower bound (upto constants) is non-adaptive, indicating that adaptivity can not help much in the detection problem.
We also note that results established in this section continue to hold when the activation is positive, but not constant on B * , with min (i,j)∈B * a ij replacing µ.

Localization from passive measurements
In this section, we address the problem of estimating a contiguous block of activation B * from noisy linear measurements as in equation (2.2), when the measurement matrices (X i ) i∈[m] are independent with i.i.d.entries having a N (0, (n 1 n 2 ) −1 ) distribution.The variance of the elements is set so that E||X i || 2 F = 1.

Lower bound
The following theorem gives a lower bound on the SNR needed for any procedure to localize B * .
Theorem 3.There exist positive constants C, α > 0 independent of the problem parameters The proof is based on a standard technique described in Chapter 2.6 of Tsybakov ( 2009).We start by identifying a subset of matrices that are hard to distinguish.Once a suitable finite set is identified, tools for establishing lower bounds on the error in multiple-hypothesis testing can be directly applied.These tools only require computing the Kullback-Leibler (KL) divergence between the induced distributions, which in our case are two multivariate normal distributions.
The two terms in the lower bound feature two aspects of our construction, the first term arises from considering two matrices that overlap considerably, while the second term arises from considering matrices that do not overlap at all of which there are possibly a very large number.These constructions and calculations are described in detail in the Appendix.

Upper bound
We will investigate a procedure that searches over all contiguous blocks of size (k 1 × k 2 ) as defined in Eq. (2.1) and outputs the one minimizing the squared error.Specifically, let the where X i,ab denotes element in row a and column b of the i th sensing matrix.Then the estimated block B is defined as Note that the minimization problem above requires solving O(n 1 n 2 ) univariate regression problems and can be implemented efficiently for reasonably large matrices.
The following result characterizes the SNR needed for B to correctly identify B * .
Theorem 4.There exist positive constants C 1 , C 2 > 0 independent of the problem parameters Comparing to the lower bound in Theorem 3, we observe that the procedure outlined in this section achieves the lower bound up to constants and a log (max (k 1 , k 2 )) factor.Under the scaling max(k 1 , k 2 ) ≥ log max(n 1 − k 1 , n 2 − k 2 ), we obtain that the passive minimax rate for localization of the active blocks B * is µ ≍ O σ (m min(k 1 , k 2 )) −1 n 1 n 2 .In this and subsequent uses, the O notation hides a log max(k 1 , k 2 ) factor.
This establishes that the SNR needed for passive localization is considerably larger than the bound we saw earlier for passive detection.This should be contrasted to the unstructured normal means problem, where the bounds for localization and detection differ only in constants (Donoho and Jin, 2004).
The block structure of the activation allows us, even in the passive setting, to localize much weaker signals.A straightforward adaptation of results on the LASSO (Wainwright, 2009a) suggest that if the non-zero entries are spread out (say at random) then we would require µ ≍ O σ n 1 n 2 m for localization.One could extend the analysis in this section to data matrices with non-constant activation as in Wainwright (2009b).Furthermore, one can adapt to the unknown size of the activation block.In particular, one can perform exhaustive search procedure for all possible sizes of activation blocks.Let B k 1 ,k 2 denote the coordinate set of all contiguous blocks of size k 1 × k 2 .Then the estimated block B = argmin adapts to the unknown size of the activation if the signal strength satisfies the condition in Theorem 4. This can be verified by small modifications to the proof of Theorem 4.

The non-contiguous case
Suppose that the block of activation B * belongs to the collection B, where so that the activation block is not necessarily a contiguous block.This collection contains less structure than the collection B, but we can still localize much weaker signals compared to completely unstructured case.Slight modification of proofs1 of Theorem 3 and Theorem 4 yields the following.
Theorem 5. Let B := argmin B∈ B f (B).There exists a constant C 1 such that if the signal strength satisfies Therefore, we conclude that even without contiguous blocks, the additional structure helps for the problem of localization.

Localization from active measurements
In this section, we study localization of B * using adaptive procedures, that is, the measurement matrix X i may be a function of (y j , X j ) j∈[i−1] .

Lower bound
A lower bound on the SNR needed for any active procedure to localize B * is given as follows.
Theorem 6. Fix any 0 < α < 1.Given m adaptively chosen measurements, if The proof is based on information theoretic arguments applied to specific pairs of hypotheses that are hard to distinguish.The two terms in the lower bound reflect the two important sources of hardness of the problem of localization.The first term reflects the difficulty of approximately localizing the block of activation.This term grows at the same rate as the detection lower bound, and its proof is similar.Given a coarse localization of the block we still need to exactly localize the block.The hardness of this problem gives rise to the second term in the lower bound.The term is independent of n 1 and n 2 but has a considerably worse dependence on k 1 and k 2 .

Upper bound
The upper bound is established by analyzing the procedures described in Algorithms 1 and 2 for approximate and exact localization.Algorithm 1 is used to approximately locate the activation block, that is, it locates a 8k 1 × 8k 2 block that contains the activation block with high probability.The algorithm essentially performs compressive binary search (Davenport and Arias-Castro (2012)) on a collection of non-overlapping blocks that partition the matrix.It is run on four collections, D 1 , D 2 , D 3 and D 4 defined as2 D 1 is a partition of the matrix into disjoint blocks of size (2k 1 × 2k 2 ), D 3 is a similar partition shifted down by k 1 rows, D 4 is shifted to the right by k 2 columns and D 2 is both shifted down by k 1 rows and to the right by k 2 columns.Figure 1 illustrates this.
Notice, that one of these collections must include a block that contains the full block of activation.Algorithm 1 applied four times returns four blocks, one of which as we show contains the full activation block with high probability.
Algorithm 2 is used next to precisely locate the activation block within one of the four coarser blocks identified by Algorithm 1. Algorithm 2 itself works in several stages: in the first stage the procedure measures a small number of columns, exactly one of which is active, repeatedly, to identify the active column with high probability.The next stage finds the first non-active column to the left and right by testing columns using a binary search (halving) procedure.In this way, all the active columns are located.Finally, Algorithm 2 is repeated on the rows to identify the active rows.
The following theorem states that Algorithm 1 and Algorithm 2 succeed in localization of the active block with high probability if the SNR is large enough.

Measure: y
a We assume p is dyadic to simplify our presentation of the algorithm.
a The exact constants appear in the proof of Theorem 7. As before, the O hides a log max(k 1 , k 2 ) factor, and our upper bound matches the lower bound up to this factor.It is worth noting that for small activation blocks (when the first term dominates) our active localization procedure achieves the detection limits.This is the best result we could hope for.For larger activation blocks, the lower bound indicates that no procedure can achieve the detection rate.The active procedure still remains significantly more efficient than the passive one, and even in this case is able to localize signals that are weaker by a (large) √ n 1 n 2 factor.This is not the case for compressed sensing of vectors as shown in Arias-Castro et al. (2011a).The great potential for gains from adaptive measurements is clearly seen in our model which captures the fundamental interplay between structure and adaptivity.

Experiments
In this section, we perform a set of simulation studies to illustrate finite sample performance of the proposed procedures.We let n 1 = n 2 = n and k 1 = k 2 = k.Theorem 4 and Theorem 7 characterize the SNR needed for the passive and active identification of a contiguous block, respectively.We demonstrate that the scalings predicted by these theorems are sharp by plotting the probability of successful recovery against appropriately rescaled SNR and showing that the curves for different values of n and k line up.Experiment 1. Figure 2 shows the probability of successful localization of B * using B defined in Eq. (4.2) plotted against n −1 √ km * SNR, where the number of measurements m = 100.Each plot in Figure 2 represents different relationship between k and n; in the first plot, k = Θ(log n), in the second k = Θ( √ n), while in the third plot k = Θ(n).The dashed vertical line denotes the threshold position for the scaled SNR at which the probability of success is larger than 0.95.We observe that irrespective of the problem size and the relationship between n and k, Theorem 4 tightly characterizes the minimum SNR needed for successful identification.Experiment 2. Figure 3 shows the probability of successful localization of B * using the procedure outlined in Section 5.2., with m = 500 adaptively chosen measurements, plotted against the scaled SNR.The SNR is scaled by n −1 √ mk 2 in the first two plots where k = Θ(log n) and k = Θ( √ n) respectively, while in the third plot the SNR is scaled by mk/ log k as k = Θ(n).The dashed vertical line denotes the threshold position for the scaled SNR at which the probability of success is larger than 0.95.We observe that Theorem 7 sharply characterizes the minimum SNR needed for successful identification.

A Proofs of Main Results
In this appendix, we collect proofs of the results stated in the paper.Throughout the proofs, we will denote c 1 , c 2 , . . .positive constants that may change their value from line to line.

A.1 Proof of Theorem 1
We lower bound the Bayes risk of any test T .Recall, the null and alternate hypothesis, defined in Eq. ( 2.3), We will consider a uniform prior over the alternatives π, and bound the average risk which provides a lower bound on the worst case risk of T .Under the prior π, the hypothesis testing becomes to distinguish Both H 0 and H 1 are simple and the likelihood ratio test is optimal by the Neyman-Pearson lemma.The likelihood ratio is , where the second equality follows by decomposing the probabilities by the chain rule and observing that ], since the sampling strategy (whether active or passive) is the same irrespective of the true hypothesis.
The likelihood ratio can be further simplified as The average risk of the likelihood ratio test is determined by the total variation distance between the mixture of alternatives from the null.By Pinkser's inequality Tsybakov (2009), and where the first inequality follows by applying the Jensen's inequality followed by Fubini's theorem, and the second inequality follows using the fact that To describe the entries of C, consider the invertible map τ from a linear index in {1, . . ., n 1 n 2 } to an entry of A. Now, To bound the operator norm of C we make two observations.Firstly, because of the contiguous structure of the activation pattern, in any row of C there are at most k 1 k 2 non-zero entries.Secondly, each non-zero entry in C is of magnitude at most Now, note that from which we obtain a bound on the KL divergence.Now, this gives us that proving the lower bound on the minimax risk.

A.2 Proof of Theorem 2
The theorem now follows from an application of standard Gaussian tail bounds in Eq. (B.1).

A.3 Proof of Theorem 6
The proof will proceed via two separate constructions.At a high level these constructions are intended to capture the difficulty of exactly and approximately localizing the activation block.
Construction 1 -approximate localization: Let us define three distributions: P 0 corresponding to no bicluster, P 1 which is a uniform mixture over the distributions induced by having the top-left corner of the bicluster in the left half of the matrix and P 2 which is a uniform mixture over the distributions induced by having the top-left corner of the bicluster in the right half of the matrix.
We first upper bound the total variation between P 1 and P 2 .This results directly in a lower bound for the problem of distinguishing whether the top-left corner of the bicluster is in the left or right half of the matrix, which in turn is a lower bound for the localization of the bicluster.Now notice that, ≤ KL(P 0 , P 1 ) + KL(P 0 , P 2 ) Notice that KL(P 0 , P 1 ) is exactly the quantity we have to upper bound to produce a lower bound on the signal strength for detecting whether a block of activation is in the left half of the matrix or not.At least from a lower bound perspective this reduces the problem of localization to that of detection.We can now apply a slight modification of the proof of Theorem 1 to obtain that Noting that the minimax risk R for distinguishing Construction 2 -exact localization: Without loss of generality we assume k 1 ≤ k 2 .Consider, two distributions P 1 and P 2 , where P 1 is induced by matrix A 1 when the activation block B = B 1 = [1, . . ., k 1 ][1, . . ., k 2 ] and P 2 is induced by matrix A 2 when the activation block Now, following the same argument as in the proof of Theorem 1, we have Now, with some abuse of notation, By using Cauchy-Schwarz we get . This gives us that, Together with a similar construction for the case when k 2 ≤ k 1 we get Once again noting (by Pinsker's theorem), Combining the approximate and exact localization bounds we get, Thus, we get for any 0 Without loss of generality we assume k 1 ≤ k 2 .Consider, two distributions P 1 and P 2 , where P 1 is induced by matrix A 1 when the activation block B = B 1 = [1, . . ., k 1 ] × [1, . . ., k 2 ] and P 2 is induced by matrix A 2 when the activation block To analyze the term T 2 , we condition on X, so that Next, we show how to control Define the event such that, using the concentration results in Appendix B, On the event E(η) we have that Therefore, (A.7) Combining Eq. (A.6) and Eq.(A.7) completes the proof.

A.6 Proof of Theorem 7
As with the lower bound the localization algorithm and analysis is naturally divided into two phases.An approximate localization phase and an exact localization one.We will analyze each of these in turn.To ease presentation we will assume n 1 is a dyadic multiple of 2k 1 and n 2 a dyadic multiple of 2k 2 .Straightforward modifications are possible when this is not the case.
Approximate localization: The approximate localization phase proceeds by a modification of the compressive binary search (CBS) procedure of Malloy and Nowak (2012) (see also Davenport and Arias-Castro (2012)) on the matrix A.
We will run this modified CBS procedure four times on sets of blocks of the matrix A. The four sets are Notice that the entire block of activation is always fully contained in one of these blocks.The output of the CBS procedure when run on these four collections is four blocks -one from each collection.We define an approximate localization error to be the event in which none of the blocks returned fully contains the block of activation.
Without loss of generality let us assume that the activation block is fully contained in some block from the first collection.Once we have fixed the collection of blocks the CBS procedure is invariant to reordering of the blocks, so without loss of generality we can consider the case when the activation block is contained in B 11 .
The analysis proceeds exactly as in Malloy and Nowak (2012).We only outline the differences arising from having a block of activation as opposed to a single activation in a vector, and refer the reader to Malloy and Nowak (2012) for the details.
The binary search procedure on the first collection of blocks proceeds for rounds.Now, we can bound the probability of error of the procedure by a union bound as where Recall, the allocation scheme: for m ≥ 2s 0 , m s ≡ ⌊(m − s 0 )s2 −s−1 ⌋ + 1 and observe that s 0 s=1 m s ≤ m Now, using the Gaussian tail bound we see that we have P e ≤ δ.We apply this procedure 4 times (once on each collection).
Let us revisit what we have shown so far: if µ is large enough then one of the four runs of the CBS procedure will return a block of size (2k 1 × 2k 2 ) which fully contains the block of activation, with probability at least 1 − 4δ.
Exact localization: We collect all the rows and columns returned by the 4 runs of the CBS procedure.In the 1 − 4δ probability event described above, we have a block of at most (8k 1 × 8k 2 ) which contains the full block of activation (for simplicity we disregard the fact that we know that the block is actually in one of two (4k 1 × 4k 2 ) blocks, i.e. we assume the worst case that none of the returned blocks overlap in their rows or columns and we explore the off-diagonal blocks).
Let us devote 8m measurements to identifying the active column amongst these.The procedure is straightforward: measure each column m times, and pick the one that has the largest total signal.
It is easy to show that the active column results in a draw from N ( k 1 8 µm, mσ 2 ) and the non-active columns result in draws from N (0, mσ 2 ).
Using the same Gaussian tail bound as before it is easy to show that if we successfully find the active column with probability at least 1 − δ.
So far, we have identified an active column and localized the columns of the activation block to one of 2k 2 columns.We will use m more measurements to find the remaining active columns.Rather, than test each of the 2k 2 columns we will do a binary search.This will require us to test at most t ≡ 2⌈log k 2 ⌉ ≤ 3 log k 2 columns, and we will devote m/(3 log k 2 ) measurements to each column.We will need to threshold these measurements at log 3 log k 2 δ 2mσ 2 3 log k 2 and declare a row as active if its average is larger than this.
It is easy to show that this binary search procedure successfully finds all active columns with probability at least 1 − δ if µ ≥ 32σ 2 log k 2 mk 1 log 3 log k 2 δ We repeat this procedure to identify the active rows.
Putting everything together: Total number of measurements used: 1. Four rounds of CBS: 4m 2. Identifying first active column and first active row: 16m

Identifying remaining active rows and columns: 2m
This is a total of 22m measurements.Each of these steps fails with a probability at most δ, for a total of 8δ.This matches the lower bound up to log k factors.
A.7 Proof of Eq. (4.3) and Eq.(4.4) Proof of Eq. ( 4.3) follows the same line as the proof of Theorem 4. We have The argument given in the proof of Theorem 2 in Kolar et al. (2011) gives us Eq. ( 4.3) if m ≥ C log max n 1 k 1 , n 2 k 2 . Proof of Eq. (4.4) follows the proof of Theorem 1 in Kolar et al. (2011) with the appropriate KL divergences derived in Eq. (A.1) and Eq.(A.2).

B Collection of concentration results
In this section, we collect useful results on tail bounds of various random quantities used throughout the paper.We start by stating a lower and upper bound on the survival function of the standard normal random variable.Let Z ∼ N (0, 1) be a standard normal random variable.Then for t > 0 1 √ 2π

B.1 Tail bounds for Chi-squared variables
Throughout the paper we will often use one of the following tail bounds for central χ 2 random variables.These are well known and proofs can be found in the original papers.

Figure 1 :
Figure 1: The collection of blocks D 1 is shown in solid lines and the collection D 2 is shown in dashed lines.The collections D 3 and D 4 overlap with these and are not shown.The (k 1 × k 2 ) block of activation is shown in red.

Figure 2 :
Figure 2: Probability of success with passive measurements (averaged over 100 simulation runs).

Figure 3 :
Figure 3: Probability of success with adaptively chosen measurements (averaged over 100 simulation runs).

Table 2 :
Summary of main findings for the case when S. Negahban and M.J. Wainwright.Estimation of (near) low-rank matrices with noise and high-dimensional scaling.Annals of Statistics, 39(2):1069-1097, 2011.A. Soni and J. Haupt.Efficient adaptive compressive sensing using sparse hierarchical learned dictionaries.arXiv:1111.6923,2011.X.Sun and A. B. Nobel.On the maximal size of Large-Average and ANOVA-fit Submatrices in a Gaussian Random Matrix.ArXiv e-prints, September 2010.
Now, re-adjusting constants we obtain, if