Mixing and cut-off in cycle walks

Given a sequence $(\mathfrak{X}_i, \mathscr{K}_i)_{i=1}^\infty$ of Markov chains, the cut-off phenomenon describes a period of transition to stationarity which is asymptotically lower order than the mixing time. We study mixing times and the cut-off phenomenon in the total variation metric in the case of random walk on the groups $\mathbb{Z}/p\mathbb{Z}$, $p$ prime, with driving measure uniform on a symmetric generating set $A_p \subset \mathbb{Z}/p\mathbb{Z}$.


Introduction
The mixing analysis of random walk on a finite abelian group is a classical problem of probability theory, with widespread applications; the Ehrnfest urn and sandpile models of statistical mechanics are motivating examples [8,17,26]. Among the early results in this area is a theorem of Greenhalgh [15], which shows that for generating set of size k contained in Z/nZ, the mixing time of the corresponding random walk satisfies t mix ≫ k n 2 k−1 . A set of size k with mixing time bounded by ≪ k n 2 k−1 log n is also exhibited. Dou, Hildebrand and Wilson [13], [16], [28] consider the mixing of measures driven by typical generating sets on cyclic and more general groups. Among the results of [16] is that typical generating sets of size k = (log n) a , a > 1 produce a random walk satisfying the cut-off phenomenon. We confine our attention to cyclic groups and symmetric generating sets which are smaller than logarithmic size in the order of the group, and prove a number of refined results on the mixing behavior. Our results are in a similar spirit to those of Diaconis and Saloff-Coste [5] proven in the more general context of random walk on groups of polynomial growth, but in narrowing our focus we emphasize strong uniformity in the number of generators of the random walk. Note that in the context of random walk on nilpotent groups, the mixing of the walk projected to the abelianization often controls the mixing in the group as a whole, see [14], [9].
To briefly summarize the results, Theorem 1 gives spectral upper and lower bounds for the mixing time in a sharper form than previous results which have appeared in the literature. A natural conjecture regarding random walk on a connected graph is that the total variation mixing time is bounded by the maximum degree times the diameter squared. A highlight of our work is Theorem 2, which verifies the conjecture for the mixing time of random walk on the Cayley graph of Z/pZ with a small symmetric generating set. Theorem 3 gives a lower bound for the period of transition to uniformity relative to the mixing time -a lower bound on the cut-off window. Theorem 4 determines the generic and worst case mixing behavior for a sequence of typical symmetric random walks. We conclude by analyzing the mixing time of a walk which may be considered an approximate embedding of the hypercube (Z/2Z) d into the cycle, demonstrating a cut-off phenomenon.
1.1. Precise statement of results. Let P be the set of primes. Given p ∈ P let A ⊂ Z/pZ be symmetric (x ∈ A if and only if −x ∈ A), lazy (0 ∈ A) and generating (|A| > 1). Write A (p) be the collection of symmetric, lazy, generating subsets of Z/pZ, and for k ∈ Z >0 write A (p, k) ⊂ A (p) be those sets of size 2k + 1. Given A ∈ A (p) let µ A denote the uniform measure on A, The distribution at step n ≥ 1 of random walk driven by µ A is given by the convolution power µ * 1 A = µ A , µ * n A = µ * (n−1) A * µ A , n > 1. As n → ∞, µ * n A converges to the uniform measure U Z/pZ on Z/pZ and we consider asymptotic behavior of this convergence for large p. In particular, the behavior of these walks as k = k(p) varies as a function of p, and as A varies in the set A (p, k) is studied.
Given measure space (X, B), a norm · on the space M (X) of probability measures on X, a Markov chain P n (·) with stationary measure ν ∈ M (X), and 0 < ǫ < 1, define the ǫ-mixing time t mix (ǫ) = inf n : sup µ∈M (X) P n (µ) − ν ≤ ǫ and the standard mixing time t mix = t mix 1 e . In the cases considered X is a (finite, compact, locally compact) abelian group, and, due to the symmetry of the walk, it is sufficient to take for µ the point mass at 0. Of primary interest is the total variation norm, which for µ, ν ∈ M (X) is given by The mixing time with respect to this norm is indicated t mix 1 . Two further important parameters in considering reversible Markov chains are the spectral gap of the transition kernel gap = 1 − sup {|λ| : λ ∈ spec(P ) \ {±1}} and the relaxation time In stating our results we let τ 0 denote the ratio In the context of random walk on Z/pZ with small symmetric generating sets, the relaxation and total variation mixing times are related as follows. 2 Theorem 1. Let p be prime, let 1 ≤ k ≤ log p log log p and let A ∈ A (p, k). Denote t rel , t mix 1 the relaxation time and total variation mixing time of µ A on Z/pZ. We have τ 0 e 4π p 2 k k τ 0 t rel p t mix 1 k 0.163kt rel . Also, uniformly in k, 2k + 1 Remark. The relationship t mix 1 t rel τ 0 exhibits Gaussian diffusion on R/Z as asymptotically extremal for the ratio between the mixing and relaxation times.
Remark. The lower bound gives an explicit dependence on k in Greenhalgh's theorem. An upper bound of this type may be extracted from [5], Theorem 1.2, but the k dependence there is, in worst case, exponential.
Theorem 1 relates the mixing time to spectral data, but in some cases it is more desirable to understand the mixing time geometrically. Given symmetric generating set A ⊂ Z/pZ denote C (A, p) the Cayley graph with vertices V = Z/pZ and edge set E = {(n 1 , n 2 ) ∈ (Z/pZ) 2 : n 1 − n 2 ∈ A}. Write diam(C (A, p)) for the graph-theoretic diameter of C (A, p). Since Z/pZ is abelian there is a more geometric notion of diameter diam geom (C (A, p)) = max x∈Z/pZ min n 2 : n ∈ Z k , ∃a ∈ A k , n · a ≡ x mod p .
One has (the second inequality is given in Lemma 10) 3 diam(C (A, p)) ≥ diam geom (C (A, p)) ≫ t rel k .
Random walk driven by µ A on Z/pZ may be interpreted as random walk on C (A, p) in which at each step the walker chooses a uniform edge leaving its current position.
Theorem 2. Let p be an odd prime and let A ∈ A (p) with |A| = 2k + 1, 1 ≤ k ≤ log p log log p . The mixing time t mix 1 of random walk driven by µ A satisfies, as p → ∞, t mix 1 ≪ k · diam geom (C (A, p)) 2 .
Remark. In the context of random walk on a cycle, Theorem 2 refines in two ways the much more general Theorem 1.2 of [5], which applies in the context of groups of moderate growth. The dependence on the number of generators k there is, in worst case, exponential. Also, we replace the diameter there with the smaller geometric diameter here. See also [27].
Given a sequence of triples (X i , P i , ν i ) ∞ i=1 where X i is a measure space and P i is a Markov kernel on X i which has ν i ∈ M (X i ) as its stationary distribution, the sequence exhibits the cut-off phenomenon in total variation if for all 0 < ǫ < 1 2 , The cut-off phenomenon is frequently observed in natural families of Markov chains including the hypercube walk of [8] and riffle shuffling viewed as a random walk on the symmetric group [1]. Especially in total variation, the cut-off phenomenon is still imperfectly understood, so that there is significant interest in deciding its occurrence in specific examples, see for instance [10], [12], [4], [2], [21]. One necessary condition for cut-off in total variation to occur is see Chapter 18.3 of [20]. In particular, by Theorem 1 any sequence of walks generated by {A p mod p ⊂ Z/pZ} p∈P for which |A p | remains bounded does not have cut-off, a result first obtained in [5]. We give a different proof of this result found independently by the author, which gives further information on the period of transition to uniformity.
Theorem 3. Let p ≥ 3 be prime, let 1 ≤ k ≤ log p log log p and let A ∈ A (p, k). For any 0 < ǫ < 1 e the total variation mixing times of µ A on Z/pZ satisfy In contrast to Theorem 3, our next theorem shows that the generic behavior when |A p | grows slowly is for there to be a sharp transition to uniformity with infrequent exceptions.
Theorem 4. Let k : P → Z >0 tend to ∞ with p in such a way that k(p) ≤ log p log log p . Let sets {A p mod p} p∈P be chosen independently with A p chosen uniformly from A (p, k(p)). The following hold with probability 1.
In particular, the cut-off phenomenon does not occur for (Z/pZ, µ Ap , U Z/pZ ) p∈P .
(3) For any sequence {ǫ(p)} p∈P ⊂ R >0 satisfying ǫ(p) k(p) → ∞ there is a density 1 subset P 0 ⊂ P such that in the family (Z/pZ, µ Ap , U Z/pZ ) p∈P 0 we have and as p increases through P 0 In particular, the cut-off phenomenon occurs.
Remark. Since p 1 p = ∞, items (1) and (3) of Theorem 4 demonstrate that almost surely among a sequence of walks, infinitely often there are slowly mixing walks which are slower than the typical behavior by a factor of ≫ p 2 k(p) k(p) . Remark. (3) of Theorem 4 gives a cut-off sequence with, for 0 < ǫ < 1 2 , period of transition between t mix 1 (1 − ǫ) and t mix 1 (ǫ) of length O ǫ t mix 1 √ k . While this is longer than the lower bound t mix 1 k given in Theorem 3, it is much shorter than the true transition period for many known examples giving cut-off. For instance, the transition period of random walk on the hypercube is faster than the mixing time by a factor which is logarithmic in the number of generators.
Our proofs of Theorems 1-4 approximate the distribution of random walk on the cycle Z/pZ with that of a Gaussian diffusion on R k /Λ where Λ is a co-volume p lattice. In making the transition between these models we use the following quantitative normal approximation lemma for which we don't know an easy reference in the literature. A proof is included in Appendix A.
2σ 2 the standard Gaussian density. As n → ∞ we have After transition to the diffusion model, the measure on lattices induced from the random choice in Theorem 4 is close to the uniform measure on the (rescaled) p-Hecke points, which are the index p lattices of Z k . It is known that, after rescaling to volume 1, as p → ∞ these lattices are equidistributed with respect to the induced Haar measure in the space SL k (Z)\ SL k (R) of all volume 1 lattices in R k . Statistics regarding correlations of vectors in a random lattice are well-known, see for instance [25] for a modern treatment. Although we estimate somewhat different quantities, the results considered there may be useful in understanding our argument.
We conclude by giving an example of random walk on the cycle which has cut-off. This may be considered an approximate embedding of the classical hypercube walk into the cycle.
Theorem 6. For p ∈ P let ℓ 2 (p) = ⌈log 2 p⌉ (logarithm base 2) and let the power-of-2 set be A 2,p = {0, ±1, ±2, ..., ±2 ℓ 2 (p)−1 } ⊂ Z/pZ. Set The power-of-2 walk (Z/pZ, µ A 2,p , U Z/pZ ) p∈P has cut-off in total variation at mixing time 1.2. Discussion of method. Our arguments view random walk on the cycle Z/pZ with symmetric generating set A, |A| = 2k + 1 as random walk on an index p quotient of Z k , in which a standard basis vector is assigned to each non-zero symmetric pair {x, −x} of generators. The index p lattice is the set Λ = {n ∈ Z k : x n x x ≡ 0 mod p}. In the case of Theorem 6, the corresponding lattice is approximately cubic, and the argument is a perturbation of the Fourier analytic analysis of the hypercube walk in [11]. In particular, the mixing time and cut-off are the same in total variation and in L 2 . For k ≤ log p log log p , a random index p lattice gives a mixing time in total variation which is less than the L 2 mixing time by a constant, and thus the L 2 methods of proving cut-off are not immediately suitable. Thus in our first four Theorems the arguments are made initially in time domain by first applying Lemma 5 to replace the discrete random walk with a diffusion on R k /Λ. This initial step is the reason for the restriction on the size of k since the corresponding approximation fails for k > (1 + ǫ) log p log log p . For larger k there is a standard method of correcting the approximation using the saddle point method, but we have not made an attempt to do so.
After having made the Gaussian approximation, Theorem 1 combines standard spectral estimates with bounds for the shortest vector in a lattice (the lower bound) and for sphere packing (the upper bound). Theorem 2 goes through in time domain, using convexity. Theorem 3 goes through in time domain, and uses an estimate for the derivative of the density in time.
Parts (1) and (2) of Theorem 4 study rare events in which the random lattice is essentially one dimensional due to the presense of many short vectors. We study these cases in frequency space. The dual lattice of an index p lattice of Z k is Λ ∨ = Z k + ℓ where is a line. We are able to show that with high probability the large Fourier coefficients arise from frequencies which are small multiples of a single vector. The analysis restricts attention to primitive vectors, and their multiples by Farey fractions modulo p, which are residues bq −1 mod p in which b and q are bounded. Part (3) of Theorem 4 is proven in time domain again. After removing a small L 1 error, the modified density may be estimated using a variance bound. In particular, our argument requires averages concerning pairs of short vectors in a random lattice which are discrete analogues of the averages performed by Siegel and Rogers [23], [22] regarding the distribution of vectors in a random lattice. Abelian groups are prevalent in arithmetic, and there would be interest in extending the results to random walks on more general abelian groups. The class group of an imaginary quadratic field grows like the discriminant to the power 1 2 + o(1), so a reweighting of Theorem 4 with roughly d groups of order d would be of interest. The techniques presented should translate without any great difficulty to studying random walk on cycles of composite order. The general case has not been considered, but see [28] for a study of random random walk on the hypercube.
To model abelian sandpiles, asymmetric generating sets should be considered.

Notation and conventions
Given groups G, H, H < G indicates that H is a subgroup of G and [G : H] denotes the index. S k is the symmetric group on k letters and we write (Z/2Z) k ⋊ S k = O k (Z) for the k × k orthogonal group over Z. For ring R = Z, Z/pZ, GL n (R) and SL n (R) are the usual linear groups with entries in R. We denote e(x) = e 2πix the standard additive character on R/Z.
Given measure space (X, B), M (X) indicates the Borel probability measures on X. When X is a finite set, U X denotes the uniform probability measure on X and when X is a compact abelian group, U X denotes the probability Haar measure. In either case expectation and variance with respect to U X are indicated E X and Var X .
· TV(X) indicates the total variation norm on M (X).
Unless otherwise stated, · indicates the ℓ 2 -norm on R k , k ≥ 1, · p denotes the ℓ p norm, p ≥ 1, and · (R/Z) k denotes the ℓ 2 distance to the nearest integer lattice point.
the ambient dimension being clear from the context. If p is not stated ℓ 2 is assumed. Given further parameter 0 < τ < 1, S(x, R, τ ) indicates the spherical shell For k ≥ 1, is the radius of an ℓ 2 ball of unit volume in R k . One may check that R k > k 2πe for all k ≥ 1.
For k ≥ 1, given x ∈ R k and σ ∈ R >0 , η k (σ, x) denotes the density at x of a symmetric centered Gaussian distribution scaled by σ, By default, quantities considered depend upon a large prime parameter p varying over a set of primes P 0 . We use the Vinogradov notation A ≪ B with the same meaning as A(p) = O(B(p)). A ≍ B means A ≪ B and B ≪ A. For positive parameters A, B, A ∼ B means lim p→∞ A(p) B(p) = 1 and A B, resp. A B means lim sup A(p) B(p) ≤ 1, resp. lim inf A(p) B(p) ≥ 1. We also use the non-standard notation already introduced in the introduction A x B, with the meaning that there is a non-increasing function f :

Background
This section collects together several statements regarding classical probability theory and lattice theory on R k , k ≥ 1.
2.1. Classical probability. See [6] for background regarding random walk on a group and [20] for a thorough treatment of Markov chains. We have provided proofs of the statements which we use for the reader's convenience.
We have already introduced the total variation distance between two probability measures µ, ν on a measure space (X, B), by In the case when µ has a density with respect to ν, equivalent characterizations are When µ is the distribution of a Markov chain with stationary measure ν define the L 2 (dν) distance to stationarity by with the convention that the norm is infinite if dµ dν is not in L 2 (dν). The factor of 1 2 is for consistency with the interpretation of total variation distance as half the L 1 (dν) norm. For ǫ > 0 denote t mix 2 (ǫ) the ǫ-mixing time of the L 2 (dν) norm. Lemma 7. Convolution with a probability measure is a contraction in the total variation norm. Also, given symmetric probability measure µ on finite or compact abelian group G, for any 0 < ǫ < 1 the total variation mixing time of random walk driven by µ satisfies t rel log 1 2ǫ ≤ t mix 1 (ǫ) ≤ t mix 2 (ǫ) and 2π 2 27 ǫ 3 t rel t mix 1 (1 − ǫ) as ǫ ↓ 0. Proof. The contraction property follows from the triangle inequality.
To prove t mix 1 (ǫ) ≤ t mix 2 (ǫ), use the L 1 characterization of the total variation metric and Cauchy-Schwarz To prove the lower bounds regarding t rel , observe that the eigenvalues of the transition kernel for the random walk are given by where G denotes the set of characters of G. Let χ 1 generate the spectral gap. Since χ 1 ∞ ≤ 1, we have, for any n ≥ 1, so that the first mixing time bound follows by taking logarithms.
Define the standard symmetric centered normal distribution on R k scaled by σ ∈ R >0 to be For t ∈ R >0 , η( √ tσ, x) is its t-fold convolution. We use several results regarding concentration of the Gaussian measure.
Lemma 8. Let k ≥ 1 and σ > 0. There are positive constants C, {C p } 2≤p<∞ such that, for any t > C, and, for all t > 0, for all 2 ≤ p < ∞, Proof. All quantities scale with σ so we may assume σ = 1. Let γ k denote the measure on R k with density γ k (x) = 1 Since · p is 1-Lipschitz on (R k , · 2 ) for p ≥ 2, Talagrand's inequality ( [19], p.21) gives, for any t > 0, The first statement follows, since the mean, root mean square, and median of · 2 differ by constants, as is evident from the concentration around the median. The second statement follows since M p ≪ k 1 p .

2.2.
Lattices. Siegel's Lectures on the Geometry of Numbers [24] are a recommended reference. A lattice Λ < R k is a discrete finite co-volume subgroup of R k . Write vol(Λ) = R k /Λ dx for its co-volume. Fixing the usual inner product ·, · on R k , the dual lattice of lattice Λ is This satisfies vol(Λ) · vol(Λ ∨ ) = 1. For instance, the dual lattice to Λ = 2Z is 1 2 Z. More generally, if Λ = QZ k for some Q ∈ GL k (R), then Λ ∨ = (Q −1 ) t Z k . We reserve λ * for the shortest non-zero vector of Λ ∨ .
Given lattice Λ < R k , its norm-minimal fundamental domain (Voronoi cell) is One may choose a set F 0 (Λ) such that every x ∈ R k /Λ has a unique representative in F 0 (Λ).
Minkowski's geometry of numbers gives an upper bound for the shortest non-zero vector in a lattice.
Theorem 9 (Minkowski's Theorem). Let Λ ⊂ R k be a lattice and let C be a convex symmetric body, i.e. x ∈ C ⇔ −x ∈ C. If vol(C) > 2 k vol(Λ) then C contains a non-zero vector in Λ. In particular with the asymptotic holding as k → ∞.
For lattice Λ, the diameter of the norm-minimal fundamental domain and the shortest non-zero vector in the dual lattice are related as follows.
Lemma 10. Let Λ be a lattice with norm-minimal fundamental domain F and dual lattice Λ ∨ . Let λ * be the shortest non-zero vector in Λ ∨ . We have The diameter is at least as large as 2 x 2 .
The following is an easy estimate for the number of lattice points contained in a ball.
and thus We also use the following estimate counting lattice points of a more general lattice.
Lemma 12. Let Λ < R k be a lattice with shortest non-zero vector λ * . For any t ≥ 1, Proof. This follows from [18], see [3] for a nice exposition and related results. We sketch the argument. Write B 2,j for a ℓ 2 ball in R j . By rescaling we may assume λ * 2 = 1. View R k as a hyperplane through zero in R k+1 , and consider the ballB = B 2,k+1 (0, t) in R k+1 . Project Λ ∩ B 2,k (0, t) orthogonally ontoB. The points remain 1-spaced and thus satisfy an angular spacing of at least θ = 2 sin −1 ( 1 2t ). Let, as in [18], A(n, θ) denote the largest set S ⊂ S n−1 which is separated by angle θ as above. Thus The claimed estimate for A(k + 1, θ) is the main result of [18].
Quotienting commutes with convolution and contracts the total variation norm. For This has a representation in frequency space as To check the expansion, Fourier expand Θ in the orthonormal basis for L 2 (R k /Λ) (this is the usual proof of the Poisson summation formula). In the case of a cubic lattice, where for some α ∈ R >0 , Λ = αZ k , the theta function is particularly pleasant.
The one dimensional theta function Θ(x, t; αZ) satisfies Proof. The factorization is immediate from the definition of Θ. The first estimate for Θ(x, t; αZ) is the result of pulling out the largest term and bounding the remaining terms by a geometric progression. For the second, apply the Poisson summation formula, and bound the n = 0 terms by a geometric progression.

Identification between generating sets and lattices.
Our proofs of Theorems 1-4 approximate random walk on Z/pZ with symmetric generating set A, |A| = 2k + 1 with a Gaussian diffusion on R k /Λ where Λ is a co-volume p lattice. The reduction is as follows.
Let O k (Z) ∼ = (Z/2Z) k ⋊ S k be the orthogonal group over Z consisting of signed k × k permutation matrices, which acts naturally on R k . Let be the set of index-p lattices of Z k , resp. those lattices up to O k (Z)-equivalence. The action is matrix multiplication on the left applied to lattice vectors. Define subsets by interpreting the factors of (Z/2Z) k as flipping signs, and the factor of S k as rearranging the order of the coordinates in the vector. Evidently the action is free, so that uniform measure on A(p, k) descends to uniform measure on A (p, k).
F × p acts freely on A(p, k) dilating all coordinates simultaneously. F × p \A(p, k) and L 0 (p, k) are in bijection via the map The map in the reverse direction is It follows that uniform measure on A(p, k) pushes forward to uniform measure on L 0 (p, k).
O k (Z) acts on L 0 (p, k), and we obtain a map F × p \A (p, k) φ → L 0 (p, k) which we write as Λ(A). Note that the joint action of F × p × O k on A(p, k) need not be free, but this will not concern us. We write U L , U L 0 for uniform measure on L and L 0 .
The above observations imply that we may sample the laws of µ * n A with A chosen according to U A (p,k) by instead sampling the laws of (ν * n k ) Λ with Λ drawn according to U L 0 (p,k) .
Combining this discussion with Minkowski's theorem has the following consequence.
Lemma 14. Let p be a large prime, let 1 ≤ k < 2 log p log log p and let A ∈ A (p, k). Let Λ < Z k be any lattice in the class of Λ(A) ∈ L , and let The relaxation time of random walk driven by µ A on Z/pZ satisfies Proof. The characters of Z k /Λ are given by the dual group, Λ ∨ /Z k . Let λ * = (λ 1 , ..., λ k ) be a vector of minimal length in Λ ∨ \ {0}. The claim follows on noting that the spectral gap is given by The error is of lower order since λ * Lemma 5 from the introduction has the following consequence.
We have by two applications of the triangle inequality. The bound now follows from Lemma 5.
Combining the pieces above we prove the following lemma which is the main reduction in this section.
Proof. By Minkowski's geometry of numbers, the shortest non-zero vector in the dual lattice Λ(A) ∨ has length so that Lemmas 7 and 14 give for the discrete walk t mix 1 (ǫ) ≫ t rel ≫ p

Mixing time estimates
Let p be prime, A ∈ A (p) with |A| = 2k + 1 and 1 ≤ k ≤ log p log log p . Let Λ = Λ(A) be any lattice associated to A in Z k , as above.
Proof of Theorem 1. Theorem 1 is contained in the set of estimates Combining Lemma 14 and Minkowski's theorem gives is given in Lemma 7. To replace (1 − log 2) with the larger constant τ 0 , consider the theta function Θ x, 2t 2k+1 ; Λ , which has asymptotically the same relaxation time as µ A by Lemma 14. Let λ * be a shortest non-zero vector in the dual space, and consider which is found by projecting Θ in frequency space onto the line determined by λ * . Equivalently, identify R k−1 with R k ∩ (λ * ) ⊥ and let η k−1 (T, ·) denote a Gaussian of covariance matrix T 2 I on this space. Write λ ∈ Λ ∨ as λ = λ 1 + λ 2 where λ 1 is the projection to the span of λ * and λ 2 is orthogonal to λ * . One has, for T > 0, and thus The convergence is uniform in x as the error at T is dominated by the case in which x is orthogonal to λ * so that all the terms are positive. This justifies exchanging the limit and integral in the following calculation. Let F be a fundamental domain for R k /Λ.
Applying the triangle inequality, Since the latter distance is monotonically decreasing and smooth, and since for n ≥ t mix 1 , by Lemma 16, it follows that t mix 1 τ 0 t rel . To give the spectral upper bound for t mix 1 , again consider instead the distance from uniformity of Θ ·, 2n 2k+1 ; Λ on R k /Λ. For t > 0, Writing the sum as a Stieltjes integral, then integrating by parts, the right hand side becomes see Lemma 12. The maximum of F (s) s 2 in s ≥ 1 occurs at s = 1.260816271(1) with maximum < 0.324908241 and F (s) s 2 → 0 as s → ∞. Thus, choosing 2τ = (0.325 +ε(k))k for an appropriate functionε(k) tending to 0 as k → ∞ the L 2 distance is negligible so that τ t rel is an upper bound for t mix 2 ≥ t mix 1 .
3.1. Geometric mixing time bound, proof of Theorem 2. Let p, A and Λ as above, and let F be the Voronoi cell for R k /Λ. Note that Z k ∩ F contains a system of representatives for Z k /Λ, and that the Cayley graph C (A, p) is isomorphic to C ({0, ±e i : Proof of Theorem 2. Write D = diam geom (C (A, p)) and assume, as we may, that t > kD 2 .
In view of Lemma 10, which proves D ≥ 1 ℓ(A) , we have t ≫ t rel , and thus as in Lemma 16 , so we will estimate the right hand side.
Since, for any x, t, E y∈F Θ x + y, 2t 2k+1 ; Λ = 1 p , we may estimate using the triangle inequality Fold together the sum over λ and the integral over x, then integrate away all directions in x orthogonal to y to obtain The last estimate follows on using 1

Transition window bound, proof of Theorem 3
We prove the following somewhat more general theorem.
Theorem 17. Let p be a large prime and let k ≤ log p log log p . Let A ⊂ Z/pZ be a lazy symmetric generating set of size |A| = 2k + 1. For any 1 > ǫ 1 > ǫ 2 > 0, for all n < exp 2ǫ 2 Proof. Let Λ < Z k be any lattice representing the class of Λ(A) ∈ L . By Lemma 16 we may replace µ * n Write n = σt mix 1 (ǫ 1 ). Differentiating under the sum in the θ function, Thus for σ > σ 0 , Differentiate under the integral, then apply (4.1) and finally drop the restriction to P (σ 0 ) to obtain the estimate Note that k = o t mix 1 (ǫ 1 ) . Applying (4.3) with σ 0 = 1 − 1 t mix 1 (ǫ 1 ) and σ = 1, which corresponds to the random walk at the mixing time and the step before, we deduce Applying (4.3) again, but now with σ 0 = 1, σ = exp( 2ǫ 2 k ), we obtain in the range t mix

Random random walk, proof of Theorem 4
We record several facts regarding the uniform measure U L on the set L(p, k) of index p lattices in Z k . Lemma 18. When Λ is chosen uniformly from L(p, k), the dual lattice Λ ∨ has the distribution of where v is a uniform random vector in (Z/pZ) k \ {0}. When Λ is chosen uniformly from L 0 (p, k), the dual lattice Λ ∨ has the distribution of where v is chosen uniformly from Proof. In the case of L(p, k), the structure follows from [Λ : Z k ] = p and 1 p Z k < Λ, while the uniformity follows from the fact that SL k (Z/pZ) acts transitively on the space of dual lattices. This holds since any non-zero vector may be completed to a basis for (Z/pZ) k .
The further conditions imposed in the case of L 0 (p, k) are those necessary to ensure that Λ does not contain a vector λ with λ 2 2 ∈ {1, 2}. Lemma 19. Let p be prime, let k ≥ 2 and let v = w ∈ Z k . We have In particular, Proof. These follow immediately from the distribution of the dual group.
5.1. Summary of argument. As the calculations in the remainder of this section are somewhat involved, we pause to sketch the main ideas. Theorem 4 has three claims, the first two of which consider the worst case mixing time behavior, with the third considering typical behavior. When considering the walk as a diffusion on R k /Λ where Λ is a lattice, the spectrum of the transition kernel is determined by the dual lattice Λ ∨ . In general, it is difficult to work on the spectral side due to the high concentration of eigenvalues near the spectral gap, but in the worst case regime we are able to show that for all behavior that persists, the dual lattice is essentially one dimensional. When this occurs the mixing and relaxation times are proportional and we obtain a slow transition.
In typical behavior the walk has a sharp transition to uniformity. The analysis in this regime consists of separate arguments estimating the distance to uniformity at times (1 ± ǫ)t mix 1 . When considering the walk at time (1−ǫ)t mix 1 we study the diffusion Θ x, 2t 2k+1 ; Λ on the norm-minimal fundamental domain F (Λ) for R k /Λ. For a particular lattice Λ, F (Λ) is a highly complex convex body determined by a number of hyperplanes, but in a statistical sense, for the purpose of the lower bound, F (Λ) behaves very much like the volume p ball of R k centered at the origin. A Gaussian in R k centered at the origin is concentrated on a thin spherical shell (see Lemma 8), and the mixing time is essentially the time needed for this spherical shell to expand to the boundary of the volume p ball. At time (1 − ǫ)t mix 1 we are then able to show that the diffusion is typically concentrated on a small measure part of F (Λ).
For the upper bound at time (1 + ǫ)t mix 1 , we note that pZ k < Λ, and we show that the distribution of values of Θ x, 2t 2k+1 ; Λ is concentrated near 1 when x is chosen uniformly from R k /pZ k and Λ is chosen at random from L(p, k). This is the most delicate part of the argument. For instance, it is not sufficient to consider the expectation of Θ x, 2t 2k+1 ; Λ − 1 2 as this gives an upper bound which is too weak, so we split Θ into an L 2 -concentrated piece Θ M plus a small L 1 error Θ E .

5.2.
Slow mixing behavior. We prove Theorem 4 in two parts. In this section we prove parts (1) and (2) which concern rare slow mixing walks. In Section 5.3 we prove part (3) regarding the typical behavior. The main estimate regarding slow mixing behavior is the following theorem.
Theorem 20. Let p be a large prime, and let k = k(p) tending to ∞ with p in such a way that k ≤ log p log log p . For any δ > 0, for all p sufficiently large, uniformly in δ p , the following hold (1) (2) Let, as in Theorem 4, τ 0 be the ratio between total variation mixing time and relaxation time for Gaussian diffusion on R/Z. For any C ≥ 1, and δp Deduction of Theorem 4, parts (1) and (2). Before giving the proof of the Theorem we prove an auxiliary claim. Let δ > 0 be an arbitrarily small fixed quantity. We claim that with probability 1, only finitely many of the events Thus, combining (5.1) and (5.2), where the first term is handled with (5.2) and covers the range t rel ≪ p 4 k (log p) 2 k , the worst case occuring when t rel ≪ p 4 k k is minimized. Note k ≤ log p log log p from which it follows so that the claim holds by the Borel-Cantelli Lemma.
We now prove the Theorem.
(1) Replace ρ(p) with ρ(p) := max ρ(p), p 1 k without altering the divergence of p ρ(p) −k . Estimating with (5.1), by Borel-Cantelli, with probability 1 there is an infinite sequence P 0 ⊂ P such that, for p → ∞ through P 0 , The above remarks guarantee that, for this sequence, t mix 1 (p) ∼ τ 0 t rel (p). (2) Let δ > 0 be fixed. Estimate with (5.1) to obtain that with probability 1, for all but finitely many p, Since ρ(p) ≥ p 1 k eventually, the remarks above imply that with probability 1 for all but finitely many p.
In proving Theorem 20 we introduce two commonly used pieces of terminology from the theory of lattices. Let p be a prime and let k ≥ 1. Say that λ ∈ Z k is reduced (at p) if λ ∈ − p 2 , p 2 k . Any class λ ∈ (Z/pZ) k has a unique reduced representative r(λ) ∈ Z k .
Our proof of Theorem 20 depends upon the following two estimates, the first of which estimates a mean concerning pairs of short vectors in the dual space.
Proposition 21. Let δ > 0 be a fixed constant. Let p and k(p) tend to ∞ in such a way that k ≤ log p log log p . Let δp Remark. This proposition should be interpretted as expressing the approximate independence of the appearance of a pair of short primitive vectors in the dual space.
Proof. It is enough to estimate with respect to U L(p,k) since this introduces a relative error 1 + O k 2 p , which is smaller than the error claimed. Let S ⊂ (Z/pZ) k × Z/pZ denote the set of pairs (λ, a) such that λ ∈ (Z/pZ) k , a ∈ Z/pZ \ {0, ±1} and both reduced vectors r(λ) and r(aλ) are primitive. Also denote for λ ∈ (Z/pZ) k , S (λ) ⊂ Z/pZ the fiber over λ.
Lemma 18 gives To briefly explain this formula, the factor of p 2k results from scaling both the variable and the standard deviation in the Gaussians by p. The condition λ 1 ≡ aλ 2 mod pZ k for some a follows from the characterization of Λ ∨ . The error term o(1) covers summation over pairs λ 1 , λ 2 for which one of λ 1 , λ 2 is not reduced but both are primitive. The summation in this case is bounded by, for some c > 0 and all B > 0 We make several modifications to the sum of (5.4) which make it easier to estimate. First we may exclude from S any pairs (λ, a) for which max ρp . To obtain this, note that the cardinality of the summation set is O(p k+1 ) since we have replaced summation over λ 2 with summation over a. Thus it suffices to show that for excluded pairs, Φ 1 (λ)Φ C (aλ) ≪ B p −2k−2−B ; to see this, note that Φ is controlled by the contribution of the summand nearest 0. Let S ′ be those choices of (λ, a) which remain. Denote by F (Q) the collection of Farey fractions modulo p (the definition is non-standard since the numerator and denominator are bounded by different quantities), We claim that for any reduced λ, S ′ (λ mod p) ⊂ Z/pZ\F (Q). Indeed, suppose otherwise and let a = bq −1 ∈ S ′ (λ mod p) ∩ F (Q). Let η ≡ aλ mod p with η reduced. Then bλ ≡ qη mod p, but the norm condition implies that in fact bλ = qη, which contradicts the primitivity. Replace S ′ (λ) with Z/pZ \ F (Q) and complete the sum over λ to obtain Applying Plancherel on (Z/pZ) k , we obtain All but one term from the sum over n is negligible, and we obtain, for any ǫ > 0, Due to the decay in the exponential, we may truncate summation over a and ξ to k +ǫ with negligible error. From ξ = 0 pull out a term ∼ p 2 . To treat the remaining terms, suppose k ≥ 3, and let ξ = qξ 0 for q ∈ Z >0 and ξ 0 primitive. Write aξ ≡ ζ mod p where ζ R k ≪ ǫ p 1 k +ǫ . It follows for 1 < i ≤ k, ξ 0 1 ζ i ≡ ξ 0 i ζ 1 mod p, and in fact, ξ 0 1 ζ i = ξ 0 i ζ 1 so ζ = bξ 0 for some b ∈ Z. The sum is thus bounded by We may estimate this sum crudely by truncating summation over b, q at |b|, |q| For all such b, q, summation over ξ is bounded by (see Lemma 13) Next we determine the distribution of the shortest vector in the dual lattice. Recall is the radius of a volume 1 ball.

Proposition 22.
Let δ > 0 be a fixed constant, and let p, k and ρ be such that k ≤ log p log log p , and δp Given Λ ∈ L 0 (p, k) denote λ * the shortest non-zero vector of the dual lattice. One has Proof. By Lemma 11 By counting vectors λ with λ 1 = 0 or λ 1 = ±λ 2 one finds Let 0 < τ < 1 and observe that for all (1 − τ ) Choosing C = 1 in Proposition 21 and inserting these bounds, one finds by subtracting the contribution to (5.6) from lattices with pairs of primitive short vectors, and accounting for the factor of 2 from counting ±λ * .
Proof of Theorem 20. The estimate (5.1) regarding the distribution of t rel follows from

This majorant is used in what follows. Let
denote the projection of Θ x, 2t 2k+1 ; Λ in frequency space onto the line determined by λ * . If Apply Cauchy-Schwarz to obtain The latter sum may be written as Since t ≫ Ct rel ≍ C k λ * 2 2 ≍ Cρ 2 p 2 k (take C ≍ 1 in the case of the second estimate of (5.2)) there is c ≍ C such that Applying Proposition 21, and thus, by specializing to λ 1 = λ * and applying Markov's inequality, This verifies (5.2).

5.3.
Analysis of typical mixing behavior. We turn to analysis of the mixing behavior for A in the bulk of A (p, k) proving the following theorem.
We can now conclude our proof of Theorem 4.

Deduction of Theorem 4, part (3).
For each j = 1, 2, ..., let E(p, j) be the event that For a fixed p, the events E(p, j) are nested in j. For each j ∈ Z >0 , let N j be minimal such that for all p > N j , U A (p,k) [E(p, j)] ≥ 1 − 2 −j . This is finite by Theorem 23. Define E * (p) = j:N j <p E(p, j) and let p ∈ P 0 if and only if E * (p) occurs. Since U A (p,k) [E * (p)] → 1 as p → ∞ and the events are independent, we have P 0 has density 1 with probability 1, as desired.
In the remainder of this section we shall frequently be concerned with counting lattice points within Euclidean balls B 2 (x, R) ⊂ R k . It is useful to bear in mind that the radius R k of a ball of unit volume in R k satisfies Let ǫ = ǫ(p) as in the theorem and set δ = 1 2 (1 − √ 1 − ǫ). Recall that, given lattice Λ < R k , F (Λ) is the norm-minimal fundamental domain of Λ, Let k = k(p) and set t = t(p, k) = (1 − ǫ)t mix Lemma 24. As k, p → ∞ in such a way that k ≤ log p log log p we have Since δ ∼ ǫ 2 as ǫ ↓ 0, (5.12) = 1 − o(1) follows from concentration of the norm of a Gaussian vector on scale 1 √ k times its median length, see Lemma 8. We estimate For k sufficiently large, any λ counted in the expectation satisfies λ < p, and thus, by Lemma 19, For any x ∈ R k , any lattice pointx ∈ Z k which is the vertex of the unit lattice cube containing and thus Proof of Theorem 23, lower bound. For n ≥ t(p,k) 2 , Lemma 15 gives while, for all n < t(p, k), By Lemma 24, the expectation of the integral against Θ is 1 − o(1), while the expectation of the integral against 1 p is bounded by

Proof of Theorem 23, upper bound. The main proposition of the upper bound is as follows.
Proposition 25. Let p and k(p) tend to ∞ in such a way that k ≤ log p log log p , and let For any fixed δ > 0 (5.14) Deduction of Theorem 23, upper bound. For any Λ ∈ L 0 (p, k) we have pZ k < Λ, and thus We use several times the estimate for x ∈ S(0, √ t, τ ) Lemma 26. For all x ∈ (R/pZ) k , (1)).
Proof. If p is sufficiently large then there is at most one point of pZ k contained in Λ c (x), and so (5.15) gives Let v ∈ R k be a unit vector, and let D v denote the directional derivative in the x variable in direction v. For any λ ∈ S x, √ t, 2τ we have In particular, for any y ∈ − 1 2 , 1 2 k , since y 2 ≤ √ k 2 , we have Thus the sum may be approximated with an integral, and the result follows.
Lemma 27. We have the following estimates.
and for k > 2, Proof. The evaluations of the means follow from Lemma 26.
In evaluating the variance term, we write, for λ 1 , λ 2 ∈ Z k , λ 1 ∼ λ 2 if λ 2 ≡ aλ 1 mod p for some 0, 1 ≡ a mod p. We have the following evaluations (see Lemma 19): The variance thus evaluates to The term (5.17) captures λ 1 = λ 2 . Replacing one Gaussian by the bound (5.15) and then estimating as for the mean of Θ M gives a bound for this term of The error term O(p −k ) of (5.16) may be bounded by omitting the restriction on λ 2 −x and summing over λ 2 , the summation being bounded by a constant. The remaining summation over λ 1 and integral over x are then evaluated as for the mean, and give an error of O(p −k ).
It remains to treat those terms from (5.16) with λ 1 ∼ λ 2 . Let R(τ ) = 2(1 + τ ) √ t. Any λ 1 ∼ λ 2 contributing to the variance satisfies λ = λ 1 − λ 2 ∈ B(0, R(τ )) \ {0} and λ 1 ≡ (a + 1)λ mod pZ k , λ 2 ≡ aλ mod pZ k for some a mod p. Arranging the summation over λ and a, we find that the contribution of terms with λ 1 ∼ λ 2 to (5.16) is bounded by (by expanding the integral, this is now independent of a, which we pull out) The total number of such λ is ≪ 2 k (1 + τ ) k (1 + ǫ) k p by estimating with the volume of the ball, see Lemma 11. Putting in the bound (5.15) for one Gaussian and integrating the second over all of R k , we obtain an estimate from the terms with λ 1 ∼ λ 2 of ≪ 8 k p k .
Proof of Proposition 25. Consider separately the cases and apply Markov's inequality. 6. The power-of-2 random walk 6.1. A Chebyshev cut-off criterion. We begin by describing a commonly used second moment method for proving cut-off, which we apply in analyzing the power-of-2 random walk. The following is a variant of the lower bound method from [11], see also Wilson's lemma in [20].
Given a probability measure µ on Z/pZ and frequency ξ ∈ Z/pZ, define the Fourier coefficient of µ at ξ to beμ Define, as before, the L 2 mixing time by    and the spectral gap gap = 1 − max 0 =ξ∈Z/pZ |μ(ξ)| .
Proposition 28. Let {A p ⊂ Z/pZ} p∈P be a sequence of symmetric, lazy, generating sets for Z/pZ, with µ Ap the corresponding uniform probability measure. Assume that the spectral gap tends to 0 with increasing p. Suppose the following holds for each fixed ǫ > 0. For each p ∈ P there exists symmetric subset 0 ∈ B p ⊂ Z/pZ such that as p → ∞, • For all ξ ∈ B p , • For all n < (1 − ǫ)t mix 2 (p) Then the sequence {(Z/pZ, µ Ap , U Z/pZ )} converges to uniform in total variation distance with a cut-off at t mix 1 (p) ∼ t mix 2 (p) if and only if the condition (6.4) t mix 2 (p) gap(p) → ∞ as p → ∞ is satisfied.
Remark. The condition (6.3) is in fact equivalent to (6.5) by the following application of Cauchy-Schwarz: Proof of Proposition 28. Since t mix 1 ≤ t mix 2 , if the condition gap(p) · t mix 2 (p) → ∞ fails then there is no cut-off in total variation, so we may assume that this condition holds. Let ǫ > 0 be fixed. For n > (1 + ǫ)t mix 2 , by Cauchy-Schwarz, . Writing E µ , Var µ for expectation and variance with respect to probability measure µ, we have since 0 ∈ B p , and and It follows by condition (6.3) that (6.11) Let X p be the subset of Z/pZ defined by . By (6.9) and condition (6.2), Hence Chebyshev's inequality, (6.7), (6.8) and (6.12) imply while Chebyshev, (6.11) and (6.12) imply

6.2.
Proof of Theorem 6, lower bound. Recall that we set ℓ = ℓ 2 (p) = ⌈log 2 p⌉ and We prove the lower bound of Theorem 6 conditional on t mix 2 ℓ log ℓ 2c 0 , which is proven in the next section. The proof of the lower bound is a reduction to the conditions of Proposition 28.
Let J = o(log log p) be a parameter. With an eye toward applying Proposition 28, set We have ℓ ≤ |S| ≤ 2ℓ. For each s ∈ S write s p in binary s p = * .s 1 s 2 s 3 ....
Partition S into 2 J+1 sets S 1 , S 2 , ..., S 2 J +1 according to the digits s ℓ s ℓ+1 ...s ℓ+J . To each pair s = s ′ ∈ S i we obtain r = s − s ′ ∈ B p . The multiplicity with which a given such r arises in this way is O(1). Hence By Cauchy-Schwarz, For all but O (J 3 ℓ) pairs ξ 1 = ξ 2 ∈ B p , By excluding at most O (Jℓ 3 ) quadruples (j 1 , j 2 , j 3 , j 4 ) we may assume j i > J for all i and |j i − j k | ≥ J for all i = k in {1, 2, 3, 4}. Then For the second statement, by excluding O (J 2 ℓ 2 ) tuples (j 1 , j 2 , j 3 , j 4 ) we may assume that three of j 1 , j 2 , j 3 , j 4 are larger than J and mutually separated by at least J. One argues as before, using the additional calculation that for 1 < j < J, which holds since for all i, j > 0, The third statement is similar.
Proof of Theorem 6, lower bound. Let ǫ > 0 be given, and suppose that n < (1 − ǫ) ℓ log ℓ 2c 0 . Set J = 2 log log ℓ and define B p as above. It suffices to show that B p satisfies conditions (6.2) and (6.5) of Proposition 28.
To check (6.5), split ξ 1 , ξ 2 ∈ B p according as ξ 1 = ξ 2 , or ξ 1 , ξ 2 fall into one of the several cases enumerated in Lemma 31. This gives By Lemma 29, |B p | = ℓ 2−o (1) , and thus all but the second term is an error term. Condition (6.5) holds, since 6.3. Proof of Theorem 6, upper bound. We prove the following somewhat more precise estimate.
Proposition 32. For all 0 < β < log ℓ, for all n ≥ ℓ 2c 0 (log ℓ + β) we have Remark. The second term results from a discrepancy between the eigenvalue generating the spectral gap and the bulk of the large spectrum which determines the mixing time.
With more effort, the factor of log ℓ could be removed.
The proof uses the following frequently used application of the Cauchy-Schwarz inequality, see [6] for an introduction to these types of estimates, also [7].
Lemma 33. Let µ be a probability measure on finite abelian group G. We have the upper bound In particular, Proof. We have Hence, by Cauchy-Schwarz, The above lemma reduces to estimation of the size of the Fourier coefficientsμ A 2,p (ξ). In estimating these coefficients it will be convenient to use the following modified binary expansion of ξ p . Lemma 34. Let p ≥ 3 be prime. For each 0 ≡ ξ mod p there is an increasing sequence This representation is unique.
Proof. Write − ξ p in binary as * .s 1 s 2 s 3 ... with each s i ∈ {0, 1}, then write ξ is obtained by a left shift, and then the subtraction is performed bitwise. The uniqueness follows because any two distinct such representations (ǫ, and otherwise is the least integer which appears in the symmetric difference {i j }∆{i ′ j }. 6.3.1. Index sequences. We introduce several notions which will be useful in the remainder of the argument. Given a real parameter J > 0, define a J-sequence of non-negative integers to be an ordered set A ⊂ Z ≥0 , with members enumerated A = a 1 < a 2 < ... such that any pair of consecutive elements differ by at most J. |A| denotes the cardinality. Set i(A) = a 1 , t(A) = sup(A). A J-sequence with a 1 = 0 is called normalized. Given J-sequence A = a 1 < a 2 < ..., its off-set sequence is the normalized J-sequence A ′ = 0 < a 2 − a 1 < a 3 − a 1 < .... For instance, 1, 3, 7, 8, 10, 14 is a 4-sequence with offset sequence 0, 2, 6, 7, 9, 13. A J-sequence is called non-trivial if it contains a pair of elements that differ by more than 1. We denote J the set of J-sequences, J 0 the set of normalized J-sequences and J ′ 0 = J 0 \ {{0}, {0, 1}} the set of non-trivial normalized J-sequences.
A J-sequence A contained in sequence B ⊂ Z ≥0 is called a J-subsequence. We say that J-subsequence A ⊂ B is maximal if it is not properly contained in another J-subsequence A ′ ⊂ B. Given parameter J, one easily checks that any B ⊂ Z ≥0 has a unique partition into maximal J-subsequences. For instance, in the first sequence above, 1, 3; 7, 8, 10; 14 is a partition into maximal 2-subsequences.
We write C (B) for the set of maximal J-subsequences of B. The J-sequences in C (B) are J-separated in the sense that if A 1 = A 2 ∈ C (B) and The sequences in C (B) are naturally ordered by, for A 1 , A 2 ∈ C (B), A 1 < A 2 if and only if for any In the remainder of the argument we think of the non-zero bits in the expansion of ξ p above as partitioned into maximal J-sequences. These J-separated parts do not interact significantly in calculating the Fourier transform. The argument that follows quantifies the interaction. Let J ≥ log 2 ℓ be a parameter. Given ξ mod p, represent ξ as (I (ξ), ǫ(ξ)) as above. Truncate I (ξ) to I ′ (ξ) = I (ξ) ∩ (0, ℓ] (note that ǫ and I ′ determine ξ) and set (6.14) σ(ξ) = |I ′ (ξ)|, C (ξ) = C (I ′ (ξ)).
We call C (ξ) the set of clumps of ξ, each clump being a J-sequence. If there exists We write C init (ξ), C fin (ξ) for the initial and final clump, with the convention that C init = ∅ if there is no initial clump, and similarly C fin . A clump is typical if it is neither initial nor final. C 0 (ξ) ⊂ C (ξ) is the subset of typical clumps. Given frequency ξ, define the savings of ξ to be (6.15) sav(ξ) = 2ℓ + 1 For a typical clump C ∈ C 0 (ξ) also define Lemma 35. We have Proof. Since the clumps C ∈ C are J-separated, we have where in the last two sums we specialize to j = i − 1, and note that for any fixed l In a similar spirit we have the following crude estimate for savings.
.., i j , i.e. by shifting i 1 to the place adjacent to i 2 .
The previous two lemmas imply the following one.
Proof. By a sequence of steps in which we either (i) move the first index of C adjacent to the second, or (ii) delete the first, we reduce to case of C 0 containing a single element, which satisfies sav We collect together several easy combinatorial estimates. Given frequency ξ we are most interested in typical clumps C ∈ C 0 (ξ) which consist of a single index, or a pair of adjacent indices. Let the number of these be x 1 (ξ) and x 2 (ξ). Let x 3 (ξ) = |C 0 (ξ)| − x 1 (ξ) − x 2 (ξ) be the number of non-trivial clumps in C 0 (ξ), and let m = σ(ξ) − |C init | − |C fin | − x 1 (ξ) − 2x 2 (ξ) be the number of indices contained in the clumps counted in x 3 (ξ).
Given m ≥ 0 and x 3 ≥ 0, let be the collection of x 3 -tuples of non-trivial normalized J-sequences of total cardinality m. Given initial and final clumps C init and C fin , T ∈ T (m, x 3 ) and integers x 1 , x 2 ≥ 0, let N (C init , C fin , x 1 , x 2 , T ) denote the number of ξ with initial clump C init , final clump C fin , x 1 typical clumps with a single index, x 2 typical clumps which consist of a pair of consecutive indices and x 3 non-trivial typical clumps, whose offsets taken in order are given by T . For any j ≥ 0, let I(j) (resp. F (j)) be the number of J-sequences on j indices which may appear as the initial (resp. final) clump of I (ξ), ξ ∈ Z/pZ \ {0}.
Lemma 41. Let x 1 , x 2 , x 3 , m, T be as above and let C init , C fin be any initial and final clumps (possibly empty). We have the bounds and, for any T ∈ T (m, x 3 ), Also, for any j ≥ 0, Proof. To bound |T |, neglecting x 3 and the non-triviality condition, choose for each index 1 ≤ j < m a distance 1 ≤ d(j) ≤ J + 1 between j and j + 1 in the arrangement, with a distance of J + 1 indicating that a new clump begins with j + 1. Similarly, the bound for I(j) follows on choosing a first index in one of at most J ways, and then choosing sequentially distances between the consecutive indices. For F (j), choose counting from the back instead.
The bound for N (C init , C fin , x 1 , x 2 , T ) follows on choosing a first index for each clump, the factor of 2 coming from choosing the sign.
Our results on savings may be summarized as follows.
By Lemma 37, again for some c > 0, Conditioning on x 1 (ξ), x 2 (ξ), x 3 (ξ), m as in Lemma 41 and i = |C init |, f = |C fin | we find . Inserting the estimates for |T | and N from Lemma 41, we obtain Assume that log ℓ 2 J = o(1). Then, when m ≥ 1 we find that the sum over The terms for which Choose 2 J = ℓ to complete the proof.
Appendix A. Local limit theorem on R k For k > 1 recall that we define the measure on Z k , and that we write for the density of the centered standard normal distribution on R k . In this appendix we prove Lemma 5, which we recall for convenience.
Lemma. Let n, k(n) ≥ 1 with k 2 = o (n) for large n. As n → ∞ we have We actually prove a stronger estimate, which is a local limit theorem on R k for which we don't know an easy reference.
Proof of Lemma 5. We have, for any A, δ > 0, and for some C > 0, see Lemma 8, so it suffices to estimate the difference . For x ∈ Z k satisfying this upper bound and for y ∈ [− 1 2 , 1 2 ) k , We claim that for all x 2 2 ≪ 2kn 2k+1 + n log n √ k , To see that this suffices for the proof, let The proof of Lemma 43 is a standard application of the saddle point method. As there are several intermediate lemmas, it may help the reader to skip ahead to first read the eventual proof. Associate to ν k the generating function f (z 1 , ..., z k ) = 1 2k + 1 1 + z 1 + z −1 1 + ... + z k + z −1 k , so that ν k (α) = C α [f ], where for Laurent series in multiple variables g(z 1 , ..., z k ) = ∞ n 1 ,...,n k =−∞ a n 1 ,...,n k z n 1 1 ...z n k k we write C α [g] = a α . The generating function associated to ν * n k is thus f n .
As f 0 (0) is bounded below, the sequence necessarily converges.
To verify the asymptotics, note that f 0 (0) = O(1) leads to R j + 1 R j = 2 + f 0 (0)α j Lemma 46. Let n, k(n) ∈ Z >0 with k(n) 2 = o(n) as n → ∞. Let α ∈ Z k and assume α 2 2 ≤ n 1 + log n √ k and α 4 4 ≪ n 2 k 1 + log n √ k . Let R j be determined by the saddle point equations (A.3). For θ ∈ D sm we have Proof. We have At the saddle point, the first derivatives vanish. The mixed derivatives are evaluated by plugging in We have The triple derivatives are estimated by Taylor expanding e(θ) to degree 1 in the numerator, using R j − 1 R j ≪ kα j n and R j + 1 R j , f 0 (θ) ≍ 1.