Exact Asymptotics for the Scan Statistic and Fast Alternatives

We consider the problem of detecting a rectangle of activation in a grid of sensors in d-dimensions with noisy measurements. This has applications to massive surveillance projects and anomaly detection in large datasets in which one detects anomalously high measurements over rectangular regions, or more generally, blobs. Recently, the asymptotic distribution of a multiscale scan statistic was established in (Kabluchko, 2011) under the null hypothesis, using non-constant boundary crossing probabilities for locally-stationary Gaussian random fields derived in (Chan and Lai, 2006). Using a similar approach, we derive the exact asymptotic level and power of four variants of the scan statistic: an oracle scan that knows the dimensions of the activation rectangle; the multiscale scan statistic just mentioned; an adaptive variant; and an epsilon-net approximation to the latter, in the spirit of (Arias-Castro, 2005). This approximate scan runs in time near-linear in the size of the grid and achieves the same asymptotic power as the adaptive scan. We complement our theory with some numerical experiments.


Introduction
Detecting anomalies in networks is important in a number of areas, such as sensor arrays (Brennan et al., 2004;Culler et al., 2004), digital images (incl.satellite, medical, etc.) (Caron et al., 2002;James et al., 2001;McInerney and Terzopoulos, 1996;Moon et al., 2002;Pozo et al., 1997), syndromic surveillance systems (Heffernan et al., 2004;Rotz and Hughes, 2004;Wagner et al., 2001), and many more.The scan statistic (Kulldorff, 1997) is by far the most popular approach, and is given different names in engineering, such as the method of matched filters or deformable templates (McInerney and Terzopoulos, 1996).It was perhaps first introduced for finding patterns in point clouds (Glaz et al., 2001;Naus, 1965) and is now applied to any setting where the goal is to detect a "localized" anomaly.In statistics, it corresponds to the generalized likelihood ratio test after a particular model is assumed, and as such is even more widely applicable, being the most omnibus approach to hypothesis testing.
Focusing on the detection of anomalies in networks, which includes spatiotemporal data, first order theoretical performance bounds are established in a small number of papers, such as (Arias-Castro et al., 2011, 2005;Desolneux et al., 2003;Walther, 2010).More refined results establishing weak convergence are even fewer.Jiang (2002) considers the scan over rectangles in a grid of independent random variables with negative expectation, while Boutsikas and Koutras (2006) study the scan over intervals of given length in a Bernoulli sequence.Both works are rely heavily on the Chen-Stein Poisson approximation.Still in the context of the one-dimensional lattice, but now with standard normal random variables, Siegmund and Venkatraman (1995) provide a weak convergence for the normalized scan over all intervals.Concretely, suppose that y(1), . . ., y(n) are iid standard normal, and define Then Siegmund and Venkatraman (1995) show that, for all τ ∈ R, for some numeric constant κ.This was recently extended to higher dimensions, for scans over hypercubes and hyperrectangles, by Kabluchko (2011).Formally, define [n] = {1, . . ., n} and assume that (y(i Let R denote the class of all discrete hyperrectangles of [n] d and define the scan over R as where |R| denotes the number of nodes in R, equal to j (b j − a j + 1) when R = × j [a j , b j ].Kabluchko (2011) shows that, for all τ ∈ R, for some constant κ depending only on the dimension d.These results allow, in theory, to control the (asymptotic) level the test based on the scan statistic if, indeed, the data is iid standard normal when no anomaly is present and an anomaly comes in the form of a rectangle with elevated mean.This is what we assume throughout the paper.
Contribution 1.We establish a weak convergence result when an anomaly is present which, in theory, allows for precise (asymptotic) power calculations.
Besides the scan statistic (1), we study other variants.One of them, already considered in (Arias-Castro et al., 2011;Walther, 2010), is based on a finer normalization for the scans at different scales.In detail, define the class of rectangles with shape h ∈ [n] d as and let Z n,h denote the scan over R(h), defined as in (1) but with R(h) in place of R. We consider the test that rejects if there is h such that Z n,h ≥ u n,h (τ ), for some explicit critical values u n,h (τ ) defined later.We refer to this procedure as the (scale or shape) adaptive scan.We note that in the first order analyses found in (Arias-Castro et al., 2011;Walther, 2010), u n,h (τ ) only depends on h 1 := h 1 + • • • + h d , which is not quite true in our situation.
Both the scan and the adaptive scan are computationally intensive.With proper implementation, they can be computed in O(n 2d ) basic operations, which may nevertheless be prohibitive for scans over large networks.For example, a typical 2D digital image is of size n × n, where n is in the order of 10 3 , resulting a computational complexity on of order 10 12 basic operations.Aware of that, Arias-Castro et al. (2011, 2005) and Walther (2010) propose to approximate the scan statistic by scanning over a subset of rectangles that is sufficiently dense in R. For a given metric δ over R, we say that R is an -covering if, for all R ∈ R, there is R ∈ R such that δ(R, R ) ≤ .Both Arias-Castro et al. (2011, 2005) and Walther (2010) construct different -coverings which can be scanned in roughly O(n d ) basic operations, and show that, when = n → 0 sufficiently slowly, scanning over an -covering yields the same first-order asymptotic performance.
Contribution 3. We establish weak convergence results for the adaptive scan over a given -covering, both when an anomaly is absent and when it is present.We also construct a new -covering and design an efficient way to scan over it using on the order of O(n d ) basic operations when is not too small.
As a benchmark we consider an oracle which knows the shape h of the anomalous rectangle (but is ignorant of its location) and therefore only scans over rectangles with the same shape, meaning, over R(h ).
Contribution 4. We establish weak convergence results for the oracle scan, both when an anomaly is absent and when it is present.
We note that our method of proof is largely borrowed from Kabluchko (2011), whose approach is based on extensive work of Chan and Lai (2006) on the extrema of Gaussian random fields, and related topics, and the Chen-Stein Poisson approximation (Arratia et al., 1989).
We complement our theoretical findings with some numerical experiments that we performed to compare these various methods, meaning, the oracle scan, the scan, the adaptive scan, and the adaptive scan over an -covering.
The rest of the paper is organized as follows.In Section 2, we set the framework and state the theoretical results announced above, and in Section 3, we present the result of our numerical experiments.We briefly discuss some extensions and open problems in Section 4, while the technical proofs are gathered in Section 5.
Before we continue, we pause to introduce some notation.We already used the notation [n] = {1, . . ., n} for any positive integer n.Cartesian products of sets are denoted with the × operator and for a set A and integer k ≥ 1, A k = A × • • • × A, k times.All vectors are bolded and scalars are not.Some special vectors are 0 = {0, . . ., 0}, 1 = {1, . . ., 1}, and the jth canonical basis vector e j = {0, . . ., 0, 1, 0, . . ., 0} with the 1 in the jth component.A vector a in dimension d will have components denoted a 1 , . . ., a d .The rectangle with endpoints a, b be denoted The symbol • indicates the component-wise product for vectors and matrices, division between vectors denoted a/b is component-wise.The Lebesgue measure in R d will be denoted by λ.For a discrete set R, |R| denotes its cardinality.For a vector a, a and a 1 denote its Euclidean and 1 norms, respectively.For a set A, I{A} (sometimes 1 A ) will denote the indicator of A. We use Bachmann-Landau notation to compare infinite sequences.For example, if For stochastic sequences, if the convergence is in probability then this is denoted by a subscript as in a n = o P (b n ).

Model, methodology, and theory
We assume that we are given one snapshot of measurements from a sensor array in d-dimensional space.This array is arranged by placing one sensor at each grid point in [n] d .An important example is that of digital images, from CCD or CMOS cameras, or other modalities such as MRI.It also encompasses video (incl.fMRI), by letting one dimension represent time (in some unit), although the time dimension is often treated in a special way.
We denote the measurement at sensor i ∈ [n] d by y(i) and model this as a signal vector with additive white Gaussian noise, where x is the signal and ξ is white standard normal noise, or in vector notation, where y, x ∈ R n d and ξ is a standard normal vector in d dimensions.We address the problem of deciding whether the signal x is nonzero, formalized as the following hypothesis testing problem: for some parameter µ, that will be interpreted as the signal size, and some class X µ ⊂ R n d parametrized by µ and with the property 0 / ∈ X µ .While H 0 represents 'business as usual', H 1 would indicate that there is some anomalous activity, here modeled by x.
We address the situation where the signal has substantial 'energy' over a rectangle of unknown shape.Given a signal y = (y(i Recalling the class R(h) of rectangles of with shape h defined in (3), let X µ (h) denote the following set of signals: Rectangles are useful in practice because of their ease of interpretation and implementation, and also because they are building blocks for more complicated shapes.They are also more amenable to a sharp asymptotic analysis, which is the focus in this paper.See the discussion in Section 4. Let h denote the shape of rectangle of activation, defined as the shape h such that supp(x) ∈ R(h).For the sake of clarity and ease of analysis, we assume that we are given integers 1 We know that, under the alternative, the signal is elevated over a rectangle in R, namely Our analysis is asymptotic with respect to the grid size diverging to infinity, n → ∞.While the grid dimension d remains fixed, µ, h, and h are allowed to depend on n.In fact, throughout this paper we assume that to avoid special cases and complications that arise when including very small rectangles in the scan.
As mentioned in the Introduction, we will use the oracle scan as a benchmark.Instead of (5), the oracle, which knows the shape h , is faced with the simpler alternative: (10) We take the asymptotic Neyman-Pearson approach in which we control the asymptotic probability of type I error (aka false rejection).Consider a test T (y) which evaluates to 1 if it rejects H 0 and 0 otherwise.Throughout, we assume that a level α ∈ (0, 1) is given and we control the tests at the exact asymptotic level α, which means that where P 0 indicates the distribution of y under H 0 .The left-hand side is called the asymptotic size of the test T .For all of the test statistics that we will study, we provide a threshold that gives us such a type I error control.Once the size of the test is under control, we examine the power of the test.We choose to focus on the minimum power over the set of alternatives, which in the asymptote is defined as where P x denotes the distribution of y under model (4).

The Oracle scan
When tasked with finding a rectangle of activation in a d-dimensional lattice, the problem is made easier if one knows the precise shape of the active rectangle.Having access to an oracle that provides the shape of the anomalous region simplifies the alternative down to (10).In this situation, one would naturally restrict the scan to rectangles with shape h .We called this procedure the oracle scan in the Introduction.Given a critical value u, the oracle scan test is defined as

Asymptotic theory
Define the following critical value where Given a level α ∈ (0, 1), we choose Theorem 1. Suppose that min i h i = ω(log n).The oracle scan test (11) with critical value (12) and τ chosen as in (14), has the following asymptotic size Let Φ denote the survival function of the standard normal distribution.
Theorem 2. Suppose that min j h j = ω(log n).The oracle scan test (11) with critical value (12)-( 14) has the following asymptotic power where v n is defined in (13).

Computational complexity
While a naive implementation runs in O(n d j h j ) time, the oracle scan can be computed in O(n d log n) time using the Fast Fourier Transform (FFT), which is generally faster when the h j 's are not too small.Specifically, let b h be the boxcar function with shape h, namely and let * denote the convolution operator, so that, for f : Thus, computing the convolution y * b h amounts to computing (y[R] : R ∈ R(h)), and using the FFT, this convolution can be computed in O(n d log n) time.And the oracle scan test is based on the maximum of y * b h .

The multiscale scan
Perfect knowledge of the shape of the true rectangle of activation is rare.A simple solution to this problem is to scan over all rectangles in the class R and report the largest observed Z-score.Formally, given a critical value u, the multiscale scan test is This is the test based on the scan statistic as defined in (1), except that R is now defined as in (8).

Asymptotic theory
Define the following critical value where (Kabluchko, 2011, Th 1.2) establishes the asymptotic size of the multiscale scan test when h = 1 and h = n.We do the same, when h = ω(log n) and h ≤ n/e.We note that, because of that, the critical value that we use ( 16) is different from the one that Kabluchko (2011) uses (2): the constants denoted by κ in both places are in fact different, and the (4d − 1) factor in ( 16) is a (2d − 1) factor in (2).
Compared with the oracle scan test (see Theorem 2), the multiscale scan test (at the same level) has strictly less asymptotic power in general.For example, suppose that h n a and h j n b for all j, for some fixed 0 < a < b < 1.In that case, to have power tending to one, the oracle scan requires

Computational complexity
Using the FFT, the multiscale scan statistic can be computed in O (n 2d /h d ) log n time, since each shape can be scanned in O n d log n as we saw in Section 2.1.2,and there are O(n d /h d ) shapes in total in R.

The adaptive multiscale scan
While the multiscale scan uses the same threshold u m for all rectangle sizes, it ignores the fact that detecting small rectangles (at the finer scales) is more difficult than detecting large rectangles.The approach advocated in (Arias-Castro et al., 2011;Walther, 2010) is a refinement of the multiscale scan in that a different threshold is used at each scale (i.e., rectangle size).Formally, given (possibly) shape-dependent critical values u h , the adaptive multiscale scan test is If in fact u does not depend on h, then this is the multiscale scan test (15).

Asymptotic theory
Define the following shape-dependent critical value where v n,h = 2 j log n h j 1 + log Theorem 5. Suppose that h = ω(log n).The adaptive multiscale scan test (18) with critical value (19)-( 14), has the following asymptotic size Theorem 6. Suppose that min j h j = ω(log n).The adaptive multiscale scan test (15) with critical value (16)-( 14) has the following asymptotic power where v n,h is defined in (20).
The adaptive multiscale scan test (at the same level) happens to achieve the same asymptotic power as the oracle scan (see Theorem 2) in the important case where h is not too large.Indeed, suppose for example that min j h j = O(n b ) for some fixed 0 < b < 1. Letting v n denote the v n in (13), we obviously have v n,h ≥ v n , and also

Computational complexity
The computational cost for computing the adaptive multiscale scan is the same as that for computing the multiscale scan, i.e., O (n 2d /h d ) log n time.

Approximate adaptive multiscale scan
The computational complexity of the adaptive multiscale scan, which is quadratic in the grid size, may be prohibitive in some situations.We provide now an algorithm that has nearly linear computation time while achieving the same asymptotic power.Inspired by the multiscale approximation developed in (Arias-Castro et al., 2011, 2005;Walther, 2010), we accomplish this by effectively scanning only over a subset of the rectangles that form an -covering for R. We recall that, given a metric Recall the definition of ξ in (4).We use the canonical metric for the Gaussian random field {ξ[R], R ∈ R}, which is given by Given an -covering R and (possibly) shape-dependent critical values u h , the -adaptive multiscale scan test is

Asymptotic theory
Ideally, we would like to select small enough (in fact, decreasing with n) that the -adaptive multiscale scan statistic has asymptotically the same distribution as the (full) adaptive multiscale scan statistic.As it turns out, it is sufficient to select −1 on the order of √ log n for this to occur.We will find that with this choice of it is possible to construct an algorithm that can perform an -covering scan in near-linear time.
Consider critical values of the form ( 19)-( 14) and define the following P-value αn,h (z) = inf{α ∈ (0, 1) : z ≥ u n,h (τ α )}. (23) Then the P-value associated with the adaptive multiscale scan test is Analogously, the P-value associated with the -adaptive multiscale scan test is Theorem 7. Consider the P-value for the multiscale scan or the adaptive multiscale scan, and -covering analog, defined in (24)-( 25) respectively.Assuming √ log n → 0, we have This implies that any such -scan test enjoys the same asymptotic size and power as the corresponding full scan, established in Theorems 3 and 4 for the multiscale scan, and in Theorems 5 and 6 for the adaptive multiscale scan.

Implementation and computational complexity
The computational complexity of a scan over an -covering depends, of course, on how the -covering is designed.We refer the reader to (Arias-Castro et al., 2011, 2005;Walther, 2010) for some existing examples in the literature.We design here another -covering which we find easier to scan over in practice.Specifically, assuming that n is a power of 2 for convenience, we consider Proposition 8. Suppose that 2 h → ∞ as n → ∞.When n is large enough, R defined in (26) is indeed an -covering of R for the metric δ defined in (21).
Algorithm 1 gives an efficient implementation of a scan over R .As in (Arias-Castro et al., 2005), we start by summing y over dyadic rectangles, which are defined as rectangles whose side lengths are a power of 2. Formally, let dyad a denote the result of summing y over all rectangles of shape 2 a with the top-left corner at a multiple of 2 a , thought of as a field over the grid [n2 −a ].Using dynamic programming, computing {dyad a : a ∈ [log 2 n] d } can be done in time O(n d ).This 'coarsification' allows us to quickly form spatial approximations to the full spatial scan for a specific shape h.Specifically, for a given dyadic scale given by a ∈ [log 2 n] d and location and scale given by We note that the P-values that appear on Line 14 of Algorithm 1 can be defined in any way, and in particular could be based on other model assumptions.Put differently, the sole purpose of Algorithm 1 is to compute the P-value (25) for a given set of critical values in (23), which can be completely arbitrary.
Proposition 9. Suppose that 2 h → ∞ as n → ∞.When n is large enough, Algorithm 1 performs a scan over R defined in (26).
Proposition 10.Algorithm 1 requires on the order of max n d , −4d (n/h) d log n basic operations.
For example, if h = n a for some fixed a ∈ (0, 1) and = (log n) −1 (which is allowed by Theorem 7), then the computational complexity of -AdaScan is of order O(n d ), which is precisely linear in the grid size.
Algorithm 1 Implementation of the -adaptive multiscale scan over the -covering defined in (26).n is assumed to be a power of 2 for convenience.The P-values can be as in ( 23) or completely arbitrary, for example based on a different parametric model.
end for 16: end for Output: α

Numerical experiments
In this section, we discuss some findings from simulation experiments.In each of the following experiments, we will generate observations that conform to our assumptions, namely that the random field y is drawn according to (4) and that there is a rectangular activation under H 1 as in (5).We will consider three questions.
1.For finite n, does the adaptive test T a (y) have appreciably superior power compared to the multiscale test T m (y)?
2. For finite n, do the theoretically-derived thresholds ( 16) and ( 19) control the level of the tests T m (y) and T a (y) as desired?
3. What is the trade-off between computation time and statistical power as we vary in the adaptive -scan (Algorithm 1)?
In all our experiments below, we consider the case of a discrete image (d = 2) and the signal under the alternative is proportional to the indicator function of a rectangle, i.e., x(i) = µ/ |R | for i ∈ R and 0 otherwise, for some rectangle R .For the multiscale and adaptive scans, we set h = 6.The first experiment will address the effect that adapting has on the statistical power.We consider a 256 × 256 image (n = 256).We simulate 400 times from both the null H 0 and each instance of the alternative H 1 .Under H 1 , we set µ = 6 and consider three rectangle sizes -34 × 81, 34 × 38 and 18 × 15 -with the location of the activation rectangle being chosen uniformly at random.For each method, we simulate the false discovery rate and the true discovery ratethe fraction of the 400 simulations drawn from H 0 that were rejected and the fraction from H 1 that were rejected, respectively -and plot them as the parameter τ varies, producing a receiver operator characteristic (ROC) curve.For each rectangle, we compare four methods: the oracle that scans at the scale of R , the multiscale scan, the adaptive scan, and a modified adaptive scan based on max , where v mod n,h = 2 j log(n/h j ).
Notice that v n,h is the dominating term in (20).Our findings (Figure 1) indicate that the adaptive scan test only marginally outperforms the multiscale scan test, while the modified adaptive test brings a more significantly improvement.All these tests are closer and closer to the oracle test as size of the activation the rectangle increases.Because the computational complexity of these tests is O(n 2 ), evaluating the performance on significantly larger images was not feasible.The conclusions that we can draw from this are that for images of moderate size, the effects of the lower order terms in the adaptive test inhibits the gains in power that we expect from Theorem 5. We also provide quantile-quantile plots of the P-value statistics against the uniform distribution on (0, 1).The motivation for this is to assess if the P-values computed based on the thresholds ( 16) and ( 19) are accurate.Our asymptotic theory (Theorems 1, 3, and 5) predicts that this is the case in the large sample limit n → ∞.We see that the P-values tend to be over-estimated (Figure 2), so that they produce more conservative tests.In these experiments, we vary the image size to be 128 × 128, 256 × 256, and 512 × 512 -with h = 4, 6, 8, respectively -and run 400 simulations from H 0 .In finite samples, we it is clear that our theory provides thresholds that are overly conservative.Based on this, In turn we suggest that one uses the adaptive scan P-value as a test statistic and sample from H 0 or use a permutation test to from a P-value.
The final set of experiments are intended to demonstrate the performance of Algorithm 1, and highlight the computational and statistical tradeoffs involved.We derive ROC curves for Algorithm 1 by applying it to 480 simulations with of two different image sizes: 256 × 256 and 512 × 512 pixels (Figure 3) with parameters h = 6, 12 and µ = 4, 5, and active rectangle of size 61 × 47 and 82 × 35, respectively.We selected the values for by making 8d/ 2 equal to the integers 1, . . ., 6 and selecting from these 4 representative curves.We interpret the results to mean that as decreases, the performance quickly converges to the optimal ROC plot.We also considered the running time as changes (Figure 4).In this simple experiment we find that, while it is advantageous to have an small to increase the power, the improvements in power are generally outweighed by the additional computational burden.

Discussion
We briefly discuss some generalizations and refinements of our work here.
More general signals.In this paper we work in a context where the signal has substantial energy over some rectangle of unknown shape and location.This motivates the scans over classes of rectangles that we define and study in detail.Although one could scan over more complicated shapes, to increase power, as done for example in (Arias-Castro et al., 2011, 2005;Duczmal et al., 2006;Kulldorff et al., 2006), the implementation of such scans is generally very complicated and often ad hoc search methods are implemented; also, the asymptotic analyses becomes much more complicated.In contrast, scanning over rectangles can be done efficiently and the mathematical analysis can be carried to the exact asymptotics, as we show here -see also (Arias-Castro et al., 2005;Walther, 2010).Moreover, rectangles are building blocks for more complicated shapes and are representative of 'thick' or 'blob-like' shapes -see (Arias-Castro et al., 2011).
Signals of arbitrary sign.For concreteness and ease of exposition, we consider signals that are implicitly positive over a rectangle.This can be seen from our definition of Z-score in (6).This would not be the most appropriate definition when one is expecting signals of arbitrary sign, for example, when the signal x is such that x(i) are IID normal with zero mean and variance τ , over some rectangle R. In that case, assuming R is asymptotically large, we have x[R] ∼ N (0, τ ), and is negative with probability a half.For a sign x = (x(i In that case, the most natural scans are based on the chi-squared scores Other parametric models.Obtaining similar results for other parametric models may be of interest, for example, in epidemiology where the Poisson distribution is used to model counts, and would replace the Gaussian distribution here.Arias-Castro et al. (2011) extend their firstorder analysis to distributions with finite moment generating function, proving that the bounds obtained under normality still apply as long as h log n.It is possible that a similar phenomenon (essentially due to the Central Limit Theorem) applies at a more refined level, and that our results apply to such distributions, again, as long as h is sufficiently large.
Dependencies.A more involved extension of our results would be to allow the observations y(i), i ∈ [n] d to be dependent.The results of Chan and Lai (2006) upon which Kabluchko's arguments (and therefore ours too) are founded apply unchanged to the setting where short-range dependencies are present.So, in principle, an extension of our work in that direction is possible following similar lines.But we did not pursue this here for the sake of concreteness and conciseness of presentation.

Proofs
Our method of proof is largely based on (Kabluchko, 2011), which relies on the work of Chan and Lai (2006) on the extrema of Gaussian random fields and the Chen-Stein Poisson approximation (Arratia et al., 1989).
Signals that are indicators of rectangles.We will focus the remaining of the paper on signals x that are proportional to the indicator of a rectangle.This is asymptotically the most difficult case for all the scans that we consider.Indeed, we show in this section that the limits in Theorems 2, 4, and 6, hold when the signal is µ|R | −1/2 1 R , while for a more signal x such that x[R ] ≥ µ, these are seen to hold as lower bounds when taking the limit inferior.Together, this shows that the minimax asymptotics stated in Theorems 2, 4, and 6 hold.
We will often leave n implicit, but even then, all the limits are with respect to n → ∞, unless otherwise stated.

Locally stationary Gaussian random fields
We will begin the proof section with an introduction to some theory for locally-stationary Gaussian random fields (GRFs), particularly their smoothness and extreme value properties.Throughout this work, we approximate the discrete GRF given by {ξ[R], R ∈ R} with a continuous version.For that, define the continuous analog to R, that is, Let Ξ be a (canonical) Gaussian white noise on R d , meaning a random measure on the Borel sets of R d such that, for any integer k ≥ 1 and any Borel sets R 1 , . . ., R k , Ξ(R 1 ), . . ., Ξ(R k ) are jointly Gaussian, with zero mean and Cov(Ξ ), where λ(R) denotes the Lebesgue measure of R when R is a continuous rectangle.This GRF is denoted Ξ henceforth.It has zero-mean and covariance structure Consequently, it is invariant with respect to translations and scalings.Following the approach taken by Kabluchko (2011), we approximate the discrete GRF ξ with its continuous counterpart Ξ.Therefore, we will be interested in the excursion probabilities of Ξ, which will require an introduction to locally stationary GRFs.For convenience, consider the parametrization of the rectangles R via the one-to-one map w = (h, t) → R(w) := [t, t + h] for w ∈ (0, ∞) 2d .We then use the shorthand Ξ(w) = Ξ[R(w)] for w ∈ (0, ∞) 2d .In this way, Ξ can be thought of as a GRF over (0, ∞) 2d with the following covariance structure, for pairs (h, t), (g, s) ∈ (0, ∞) 2d , where x + = x ∨ 0. Furthermore, define the set of shapes and location that correspond to rectangles in R as This describes the continuous version of the data under the null distribution H 0 .Under the alternative, there is a signal, and the continuous counterpart to the discrete GRF is where m(w) = µ Cov(Ξ(w), Ξ(w )), (30) w = (h , t ) being the scale and location of the active rectangle.(Recall that under the alternative we are considering the signal µ|R | −1/2 1 R .)We are now prepared to review some relevant results on boundary crossing probabilities of locally-stationary GRFs.

Boundary crossing probabilities for locally stationary GRFs
In order to analyze the GRF Ξ we must introduce the notion of local stationarity and the tangent process.The definitions below are given in (Chan and Lai, 2006;Qualls and Watanabe, 1973), and utilized in (Kabluchko, 2011).
We note that we work in dimension p = 2d, except when analyzing the oracle scan, in which case p = d, because the shape h is given.Given K ⊂ R p and γ > 0, define Let S p−1 denote the unit sphere in R p .We say that the GRF Ξ is locally stationary over the set W, if for W within the domain of Ξ, there exists α ∈ (0, 2], γ > 0, and a slowly varying function where r w : S p−1 → R + are continuous functions such that sup For such processes, the local structure is defined as and we say that the local structure is homogeneous of order α.Let the tangent process at w ∈ W be {H w (u)} u∈R p , and defined as the GRF satisfying The high excursion intensity is defined as and has been shown to exist within (0, ∞) in (Chan and Lai, 2006, Lem 5.2), which in fact states that this convergence is uniform within w ∈ W. Kabluchko (2011) proves the following result by observing that Ξ has the same local structure as a tensor product of normalized differences of Brownian motions.
Lemma 11 (Kabluchko (2011)).The GRF Ξ is locally stationary over W with α = 1 and L(x) = 1, with local structure given by and high excursion intensity given by Define the function Lemma 12 (Chan and Lai (2006) Th 2.1).For K ⊂ W ⊂ R p fixed, bounded, and Jordan measurable, and a GRF Ξ that is locally stationary over W with homogeneity α and high-excursion intensity Λ, ( Chan and Lai, 2006) generalized this theorem for calculating the non-constant boundary crossing probability of a locally stationary GRF, which we will use in our analysis of the adaptive scan.In fact, (Chan and Lai, 2006, Th 2.8) allows a non-constant boundary, a set that is growing with n, and holds for non-Gaussian random fields.We specialize the theorem to our setting.
Lemma 13 (Chan and Lai (2006) Th 2.8).Let Ξ be a GRF that is locally stationary over W with homogeneity α and high-excursion intensity Λ, and take a fixed bounded and Jordan measurable set K such that [K] γ ⊂ W for some γ > 0. Let (c ζ : ζ ∈ (0, 1)) be a family of real-valued functions defined on W satisfying for some γ 0 > 0 fixed and Then Lemma 13 differs from the statement in (Chan and Lai, 2006, Th 2.8) which includes additional conditions.This is due to the fact that we assume that K is fixed and Ξ is Gaussian.Their conditions (2.16) and (2.18) are precisely ( 32) and (33), while the condition (2.14) is trivially true for fixed K.In the proof of (Chan and Lai, 2006, Th 2.1) their conditions (A1)-(A5) were shown to hold for locally stationary GRFs, and as a consequence so do (B1)-(B5) since the process is exactly Gaussian and the domain K is fixed.

Approximating Ξ with an -covering
In this section we state and prove results on the covering properties of W and the continuity of Ξ.The metric δ over R introduced in (21) translates into the following metric on W (we overload the notation) δ(w 0 , w 1 ) = δ(R(w 0 ), R(w 1 )), ∀w 0 , w 1 ∈ W.
An -covering of W is defined analogously.To be sure, it is a subset W ⊂ W such that, for all w ∈ W, there is w ∈ W such that δ(w, w ) ≤ .The -covering number for the metric space (W, δ), denoted N (W, δ, ), is the cardinality of a smallest -covering of W for δ, and log N (W, δ, ) is the -entropy of W.
Lemma 14.For any Proof.Let (h, t), (g, s) ∈ W. Starting with ( 21) and ( 28), and using the fact that which follows from the union bound or a simple recursion, we have θ((h j , t j ), (g j , s j )), where θ((h, t), (g, s) Notice that because δ((h, t), (g, s)) ≤ 2 j θ((h j , t j ), (g j , s j )), and W ⊂ [h, n] d × [0, n] d , it suffices to construct an ( 2 /2d)-covering for each [h, n] × [0, n] with respect to θ. (We define a covering in θ analogously although it is not necessarily a metric.)We divide by h everywhere, so that we may focus on [1, T ] × [0, T ], where T = n/h, by the scale invariance of θ.Fix α ∈ (0, 1).We have where for some k and in these ranges.Then The cardinality of the resulting covering is equal to using the fact that log(1 + x) ≥ x 2 log(2) for all x ∈ [0, 1] and t ≥ log(t)/ log(2) for all t ≥ 1.When we choose α = 4d/(4d + 2 ), the tensor product of these coverings, repeated over j = 1, . . ., d, is an -covering of W, of cardinality We use this bound on the entropy and a continuity property of Ξ to bound Ξ.This bound will be crude relative to the asymptotic guarantees which are the focus of this work, but is necessary as a lemma.For an -covering, W ⊂ W, define the interpolated GRF Ξ over W, with value at w ∈ W given by Ξ (w) = Ξ(w ), where w = argmin{δ(w 0 , w) : w 0 ∈ W }; if the minimizer is not unique, then choose a minimizer arbitrarily.For a real-valued function f over W, let f ∞ = sup w∈W |f (w)|.
Lemma 15.Consider the GRF Ξ introduced in Section 5.1.In our context, it has the following properties.
1.The supremum of Ξ has the following behavior 2. Let U ⊂ W be such that there exists a constant C > 0 with the property that max j |t j −s j | ≤ C and max j | log h j − log g j | ≤ C for all (h, t), (g, s) ∈ U. Then sup w∈U Ξ(w) = O P (1).
3. Let Ξ be an interpolated GRF built on an -covering of W where < 1.Then Proof.We prove each part in turn.
Part 2. This can be proven by noticing that for any , the entropy of U satisfies for some constant C 0 , using a construction analogous to that used in the proof of Lemma 14.The rest follows as in the proof of Part 1. Part 3. As before let T = n/h.By the definition of Ξ , Applying Dudley's theorem and Lemma 14, we bound the RHS by 99 The result follows by Markov's inequality.
We will now analyze the P-values resulting from our various scan statistics by their Lipschitz property.This will allow us to demonstrate that if is decreasing quickly enough, the P-value of each test when evaluated over an -covering is asymptotically indistinguishable from the P-value when evaluated over the entire set W. In the end, we will have proven Theorem 7, but these results will also be useful to prove other results.For convenience, we work with τ instead of α, related by ( 14).For each scan statistic, let τ be the value of τ such that the scan statistic equals its threshold (( 12), ( 16), or ( 19)).It takes the form where m is defined in (30) (with m ≡ 0 under H 0 ), while a, b and W ⊂ Z 2d ∩ W will depend on which scan statistic we are considering.In all cases, We will relate the statistic τ with the random variable Lemma 16.Suppose there are constants L > 0 and 0 > 0 such that for all w ∈ W, by Lemma 15.From this, we conclude.
For the oracle and multiscale scan statistics, a and b are constant in w and so they are trivially Lipschitz.For the adaptive multiscale scan, we verify below that they indeed satisfy (38).
The following lemma allows us to approximate a discrete scan with its continuous counterpart.
Lemma 18. W = W ∩ Z 2d is a ( 4d/h)-covering for W with respect to δ.

Proofs: main results
The following lemma will allow us to derive the asymptotic threshold from the excursion probabilities that we will derive in the following proofs.
Lemma 19.Let s and t be constants, let (η m ) be a sequence tending to 0, and define Then Proof.We have From this, we get that and the result follows by applying the exponential.

Proof of Theorem 1
Since the shape h is given, in this section we let Ξ(t) = Ξ(h , h • t), indexed only by the spatial parameter t ∈ T = × d j=1 [0, T j ] where T j = n/h j .This is after rescaling, where we divided the jth coordinate by h j .Specifically, the reparametrized GRF has zero mean and covariance structure, where R(t) := [t, t + 1].The GRF Ξ restricted to T is stationary, thus it is locally stationary over T , but in p = d dimensions.Moreover, it has the local structure C t (s) = s 1 , by evaluating the local structure in Lemma 11 to the case in which h = 1 and g = 0. Hence, we know that it is homogeneous of order α = 1 with L = 1 and r t (u/ u ) = u 1 / u .Due to the restriction to T , the tangent process of {Ξ(t)} t∈T must also be restricted T .This will alter the high-excursion intensity from that given in Lemma 11, which we derive next.
In order to prove Lemma 11, Kabluchko (2011) developed a technique for analyzing the tangent process using sums of independent Brownian motions.We use the same approach.First, note that a version of the tangent process is given by where V j are independent versions of the standard Brownian motion with drift −|s j |/ √ 2. (Notice that, when calculating the high excursion intensity Λ, the tangent process is restricted to the positive orthant.)To see that U is indeed a version of the tangent process, notice that, for all where H 1 = 1 is Pickands constant for α = 1 (Pickands, 1969).We may now apply Lemma 12, and the high excursion probability becomes Until further notice, we take u to be the critical value (12).Recall that λ is the Lebesgue measure, here in R d .Define Consider the events E i = sup t∈R(i) Ξ(t) > u for i ∈ I. Notice that by translational invariance, where, applying (41), where the second equivalence comes from by applying Lemma 19.
We will now establish a Poisson limit for the above process over the entire set T based on finite range dependence.Two events, Hence, by ( 42) and ( 43), and the fact that |I| = O( j T j ), we have Now take i ∈ I and i ∈ B i .We have We have ( 42) and ( 43), and as in ( 43), except that λ(R(i) ∪ R(i )) = 2 when i = i, we also have This implies that This holds uniformly over i by translation invariance (translating the whole blanket set B i ) and also uniformly over i in the blanket because there are at most 3 d of these.Hence, Finally, by ( 42) and ( 43), In our context, the Poisson approximation result stated in (Arratia et al., 1989, Th 1) implies that from which we derive P ∩ i∈I E c i → e −e −τ .In exactly the same way, we can also derive Thus, in probability, µ − sup t∈U \Uη Υ(t) → ∞, implying that We then have where the inequality is by independence of (Ξ(t), t ∈ U η ) and (Ξ(t), t ∈ T \ U), and the convergence is by ( 45), ( 46), and (47).We conclude that and by Lemma 16 and Lemma 18, we find that this holds for the discrete scan statistic as long as h = ω(log n), so that (39) is satisfied.

Proof of Theorem 3
We now redefine u as the critical value in ( 16).We assume that we are under the null.Applying (Kabluchko, 2011, Th 1.4), with a ← 1 and n ← n/h, we get We then invert this to get lim n→∞ P sup w∈W∩Z 2d ξ[R(w)] ≥ u = α.

Proof of Theorem 4
We assume that we are under the alternative.The arguments are essentially identical to those in Section 5.3.2,except that this time both the scale and location vary.In particular, we work with W = W ∩ Z 2d , and where w denotes the true scale and location of the rectangle of activation.The remaining of the proof is now exactly the same.

Preliminaries
The lemmata stated and proved in this section will be used to prove of Theorems 5 and 6.Until further notice, u(h) (or u n (h) if we choose not to suppress the dependence on n) denotes the critical value defined in (19) while v(h) = v n,h denotes the function of h in (20).The parameter τ remains fixed throughout.
The following technical lemma is used throughout this section.
Proof.Recall the notation introduced in ( 36), where we now abbreviate a(h) = a((h, t)) and b(h) = b((h, t)), and these functions are specified in Lemma 17.We have By Lemma 17, |a(h) − a(h )| ≤ Lδ(w , w) for some L > 0 and δ(w , w) ≤ 0 .Working with the second term, we obtain, using the fact that a(h), a(h ) ≥ 1 because n/h ≥ n/h ≥ e by assumption.Working with the third term, because there exists a C such that b(h)/a(h) ≤ C for all allowed h.Combining these we find that u(h) is indeed (locally) Lipschitz with respect to δ, with constant L = L(2 + τ + C).
We will introduce some notation for the following lemmata.Let h ∈ [eh, h] d , define T (h) = × d j=1 h j [ n/h j ], and let t ∈ T (h).Define the set and the event Lemma 21.Let k ∈ Z d + be fixed in n, define h = he k , and let t ∈ T (h).We have Proof.First, by location invariance P ∃(g, s) ∈ K (h,t) : Ξ(g, s) > u n (g) = P ∃(g, s) ∈ K (h,h) : Ξ(g, s) > u n (g) .
Also, because for (g 0 , s 0 ), (g 1 , s 1 ) ∈ W, rescaling the set W by h −1 does not change the covariance structure of Ξ[ R].Thus, Let Λ(h) = 4 −d d i=1 h −2 i be the high excursion intensity from Lemma 11.Set c = 4 d √ 2π.We now check the conditions of Lemma 13.Here the boundary functions are u n (h) defined in (19), and satisfy (32) with ζ = (log n) −3/2 .To see this, first, notice that K 0 is fixed and Jordan measurable.By (35), for any fixed γ > 0 small enough and w 0 = (g 0 , s 0 ), where C γ is a small constant.Thus by Lemma 20, for ζ small enough, We also have that sup and so uniformly over such w 0 , w 1 , which verifies (33).Furthermore, recalling that in the notation of Lemma 13, we have α = 1 and p = 2d, we finally get (32) sup Hence, we have established the conditions of Lemma 13.Applying it we have u n (hg) 4d−1 e −un(hg) 2 /2 dg.
We have from Lemma 19 that u n (hg) 4d−1 e −un(hg) 2 /2 c e τ j (n log 2 (eg j )/(hg j )) which implies that We see by Lemma 19, uniformly over n, By dominated convergence, Thus, We have our result because Lemma 22. Resume the notation of Lemma 21.There exists a constant C > 0 not depending on n, k, or t (but possibly dependent on τ or d) such that Proof.Let w = (h, t).By (49), By scale invariance, for some constant C 1 > 0, by an application of Lemma 12. On the one hand, using the form for Λ given in Lemma 11, we get On the other hand, by Lemma 19 there is a constant C 2 (not dependent on k, n, or t) such that ∀(g, s) where C 3 is some constant, using the fact that min h ∈[e −1 h,h] v(h ) = v(h) for h j ≥ eh, ∀j.We conclude that there exists a constant C 4 such that, for all such w, Proof.Resume the notation and definitions of Lemma 22.We partition the space W into blocks in the scale and location parameters.Define by translation invariance.By Lemma 21, We partition the set U A into the blocks {K w : w ∈ I} and then use the Chen-Stein Poisson approximation to derive P(∪ w∈I E w ).We have We then have that Thus, we obtain that lim Two events, E (h,t) , E (g,s) , are independent if |t j − s j | > 2(h j ∨ g j ), for some j ∈ [d].Consider then the 'blanket' sets We have which is a constant depending only on d and A. Thus, By same exact arguments underlying the proof of Lemma 21, We can also see from Lemma 22 that, uniformly over w ∈ B w ,

Again by translation invariance and the fact that both
This shows that the events E w over I have finite-range dependence.Hence, by (Arratia et al., 1989, Th 1) we have that This also holds with I in place of I, and with lim n→∞ M unaffected.So the proof is complete.
Lemma 24.With U A defined in Lemma 23, we also have Proof.We keep the same notation as in the previous proof.Define the event Note that E A depends on n via u(h) in ( 19).By the union bound, 52) by translation invariance.By Lemma 21, By Lemma 22, is a dominating sequence that is independent of n and summable over Z d + , and satisfies p n,k ≤ D k .Thus, we can apply dominated convergence and conclude that Since the RHS tends to zero as A → ∞, the proof is complete.

Proof of Theorem 5
By Lemma 23 and Lemma 24, Hence, the random variable τ defined in (37) satisfies Now, we may apply Lemma 16 with Lemma 18 to obtain that the statistic τ defined in (36) satisfies |τ − τ | = o P (1) and, therefore,
By Lemma 16 and Lemma 18, we find that this also holds when W is replaced by W , as long as h = ω(log n).And from this we conclude as in Section 5.3.2.

Proof of Theorem 7
For adaptive multiscale scan, Lemma 17 allows us to apply the conclusion of Lemma 16 to the critical value (19).For the multiscale scan statistic, Lemma 16 applies to the constant critical value ( 16).
Let τ be the result of the scan over the discrete set, W ∩Z 2d , for either the (resp.adaptive) multiscale scan, and let τ be the scan over the -covering.Then by Lemma 18, W ∩ Z 2d is an -covering of W for = 4d/h = o((log n) −1/2 ).Thus, we may apply Lemma 16, unless µ = ω( log(n/h)

Proof of Proposition 8
We now show that R is an -covering of R. Specifically, for each (h, t) ∈ W, we construct (g, s) such that R(g, s) ∈ R and δ(R(h, t), R(g, s)) ≤ .Take a j = log 2 ≥ a, for each j ∈ [d].Define g j = argmin{|h − h j | : h ∈ 2 a j Z + }, s j = argmin{|s − t j | : s ∈ 2 a j Z + }.

Proof of Proposition 9
First, we establish that, for a ∈ {a, . . ., a} d , An induction on a 1 , based on the recursion in Line 5, gives y(2 a • t + i), ∀t ∈ × j [n/2 a j ].
Based on this, we have With (55), we can see that the statistic ŝ in Algorithm 1 is equivalently expressed as max y [2 a , 2 a (t + f )] , confirming that Algorithm 1 does scan over R .

Figure 4 :
Figure 4: (Running time) Same setting as in Figure 3.Here we plot the running time in seconds on a 2.80 Ghz virtual CPU as a function of .The line indicates the average time, the triangles are the 5 and 95 percentiles, and the error bars extend from the minimum to the maximum of the 480 simulations.
and g w : R p → R are such that sup w∈[W]γ |g w (u)| → 0, as u → 0.
construction of dyad takes O(n d ) operations.Indeed, the computation of dyad a (t) over a ∈ [log 2 n] d \{1} d and t ∈ [n/2 a ] is done from Line 2 to Line 7 in Algorithm 1, and is easily seen to require on the order ofa∈[log 2 n] d j (n/2 a j ) ≤ n d a≥1 2 −a d = n d basic operations.Second, defining a + = d j=1 a j , the convolution dyad a * b f takes O(n d 2 −a + log n) operations with the FFT, since the convolution happens on a grid of size j (n/2 a j ) = n d 2 −a + .Therefore, the computation on Line 13 requires O(n d 2 −a + log n) basic operations.Hence, once dyad is computed, computing α requires on the order ofa∈[a,a] d d j |F j | n d 2 a + log n = O( −2d n d log n) a≥a 2 −a d = O( −2d n d 2 −da log n),with 2 −a = O(1/ 2 h) since h ≥ 1.From this, we conclude.