Quantitative bounds for Gowers uniformity of the M\"obius and von Mangoldt functions

We establish quantitative bounds on the $U^k[N]$ Gowers norms of the M\"obius function $\mu$ and the von Mangoldt function $\Lambda$ for all $k$, with error terms of shape $O((\log\log N)^{-c})$. As a consequence, we obtain quantitative bounds for the number of solutions to any linear system of equations of finite complexity in the primes, with the same shape of error terms. We also obtain the first quantitative bounds on the size of sets containing no $k$-term arithmetic progressions with shifted prime difference.


Introduction
Throughout this paper we fix an integer k ≥ 1, and let N > 1 be a real parameter that is assumed to be sufficiently large depending on k. We will also make frequent use of the somewhat smaller quantity (1.1) Q := exp(log 1/10 N ), for instance by sieving out multiples of all primes less than Q. We use c to denote various small positive constants depending on k that are allowed to vary from line to line, or even within the same line. All the constants in our asymptotic notation 1 are permitted to depend on k. The implied constants will be effective, except when otherwise stated. In this paper we will be interested in quantitatively controlling the Gowers norm uniformity of the Möbius function µ and the von Mangoldt function Λ on the interval [N ] := {n ∈ N : 1 ≤ n ≤ N }, as well as various related statistics. Our methods can extend to some other arithmetic functions, such as sufficiently "non-pretentious" bounded multiplicative functions, but we focus on the classical functions µ, Λ here for ease of exposition. Such quantitative control on the Gowers norms will be used to quantify the asymptotics for linear equations in primes obtained in [20].
We begin by recalling the definition of the Gowers uniformity norms, first introduced by Gowers in [12]; we largely follow the notation of [20,Appendix B] here, except that we will find it convenient to work with both normalized and unnormalized Gowers norms. Definition 1.1 (Gowers norms). Let k ≥ 1 be a natural number.
We often identify G k+1 with G × G k , thus for instance the assertion (n, h) ∈ G k+1 means that n ∈ G and h ∈ G k . 1 See Section 3 for a more detailed description of the asymptotic notation conventions used in this paper. (ii) If f : G → C is a finitely supported function on an additive group G, we define the (unnormalized) Gowers uniformity norm f Ũ k (G) to be the quantity where C : z → z denotes complex conjugation. If G is finite, we then define the normalized norm (iii) For any function f : Z → C and natural number N , we define the local (normalized) Gowers uniformity norm where 1 [N ] is the indicator function of [N ].
Thus for instance where throughout this paper we use the averaging notation , where we adopt the usual asymptotic notation (see Section 3), and e(θ) := e 2πiθ . While we will permit the functions f to be complex-valued for compatibility with previous literature (particularly those that invoke the circle method), in this paper we will deal almost exclusively with real-valued functions. As is well known, the Gowers uniformity norms are indeed norms for k ≥ 2, and seminorms for k = 1; see for instance [20,Appendix B]. In particular, they obey the triangle inequality (and similarly for the other variants of the Gowers norms in Definition 1.1), which we will rely on frequently in this paper. The Möbius pseudorandomness principle (see e.g., [28, p. 338]) informally makes the prediction µ(n) ≈ 0 in the metric given by the Gowers norms U k [N ]. Similarly, the usual modification of the Cramér random model [5], as refined by Granville [14] in order to take into account the distribution at primes below some threshold w, makes the prediction Λ(n) ≈ Λ Cramér,w (n) for various small 2 ≤ w N , where Λ Cramér,w : Z → R is the function Λ Cramér,w (n) := P (w) φ(P (w)) 1 (n,P (w))=1 = p<w p p − 1 1 p n where P (w) is the primorial 2 of w, with φ the Euler totient function and (n, P (w)) the greatest common divisor of n and P (w). Thus for instance Λ Cramér,2 = 1 (which corresponds to the original model of Cramér). The precise choice of the parameter w is not too important, as can be shown by the following standard sieve-theoretic calculation: Proposition 1.2 (Gowers norm stability of the Cramér model). If 2 ≤ w, z ≤ Q, then We establish this proposition in Section 5. In our applications it will be convenient to focus on the Cramér models Λ Cramér,w , Λ Cramér,z with w = log κ N , z = Q, for κ > 0 a sufficiently small constant which may depend on k (usually we can take κ = 1/100). However, using Proposition 1.2 it is not difficult to also work with other suitable choices of parameters if desired, at least up to logarithmic decay (and probably up to pseudopolynomial decay 3 as well, see Remark 5.4).
(ii) (Logarithmic and strongly logarithmic U 2 uniformity) We have for all A > 0 and all 2 ≤ w ≤ Q. 2 In some texts the constraint p ≤ w is used in place of p < w; the precise convention is not too important for our applications, but the choice p < w is consistent with the conventions in [10]. 3 By a pseudopolynomially decaying function we mean one that decays faster than exp(− log c N ) for some c > 0.
In the asymptotic notation superscripted with ineff, the implied constants are permitted to be ineffective.
A short deduction of this theorem from results stated in the literature is given in Appendix B for the sake of completeness.
The first main objective of this paper is to quantify (and make effective) the qualitative rate of decay o ineff (1) in Theorem 1.3(iii). We are able to obtain doubly logarithmic bounds which are weaker than the k = 2 logarithmic bound in Theorem 1.3(ii) only by a single additional logarithm: This is new for k ≥ 3; henceforth we will assume k ≥ 2 in our arguments to avoid some minor degeneracies.
For later use, we also state a version of Theorem 1.4 for Λ where the W -trick has been implemented. Corollary 1.5 (W -tricked quantitative Gowers uniformity). Let w = (log log N ) 1/2 and W = p≤w p. Then for k ≥ 2 we have In Corollary 1.5, unlike in Theorem 1.4, the size of w turns out to be important. Indeed, if we had w/ log log N → ∞, then for all we know there could be a Siegel zero to some modulus q ≤ Q such that all its prime factors divided W , and this would bias the main term 1 in Corollary 1.5; cf. Theorem 2.6.

Applications to linear equations in primes and to progressions with shifted prime difference
The main application of the qualitative uniformity result (1.5) in [20] was to obtain qualitative asymptotics on linear equations in the primes; now using Theorem 1. 4 we can make that result quantitative. Theorem 1.6 (Quantitative linear equations in primes). Let N, d, t, L be positive integers, and let Ψ = (ψ 1 , . . . , ψ t ) be a system of affine-linear forms ψ i : Z d → Z of the form ψ i (n) = n ·ψ i + ψ i (0) whereψ i ∈ Z d , ψ i (0) ∈ Z are such that |ψ i | ≤ L and |ψ i (0)| ≤ LN . Suppose that no two of theψ i are linearly dependent. Let Ω ⊂ [−N, N ] d be a convex body. Then (1.6) as N → ∞, where c = c t,d,L > 0 depends only on t, d, L, Λ is extended by zero to the integers, β ∞ is the Archimedean factor β ∞ = vol(Ω ∩ Ψ −1 (R t >0 )), and for each prime p, β p is the local factor (viewing each ψ i also as an affine map from (Z/pZ) d to Z/pZ).
Note that, in the language of [20], the assumption thatψ i are pairwise linearly independent is equivalent to these forms having "finite Cauchy-Schwarz complexity".
In [20], the result of Theorem 1.6 was established with the qualitative error term o ineff t,d,L (N d ) in (1.6) (initially under the hypotheses of the Möbius and nilsequences conjecture and the inverse Gowers-norm conjecture, but these were later proved in [21], [25]).
We outline the (rather straightforward) details of the deduction of Theorem 1.6 from Theorem 1.4 in Section 9.
Example 1.7. In [20,Example 8] it is shown that the number of (increasing) arithmetic progressions of primes of a given length k ≥ 2 in [N ] is equal to  [20], the qualitative error term o ineff (1) can now be improved to the doubly logarithmic error O((log log N ) −c ). This is new for k ≥ 4.
Another application of Theorem 1.6 is to sets containing no progressions with shifted prime difference 4 . It was shown by Sárközy [38] that (for N large) any subset of [N ] of size N contains a pattern of the form x, x + p − 1 with p a prime. After several improvements [31], [36] [45], the current best known quantitative version of this, proved recently by Green [15], is that any subset of [N ] of size ≥ N 1−c contains a pattern of this form. Sárközy's theorem was later generalized to longer progressions by Frantzikinakis-Host-Kra [9], and Wooley-Ziegler [46], who showed that, for any k ≥ 3 and N large enough in terms of k, any subset of [N ] of size N contains a pattern of the form x, x + p − 1, x + 2(p − 1), . . . , x + (k − 1)(p − 1) with p a prime, that is, a k-term arithmetic progression with shifted prime difference. These proofs however did not provide quantitative bounds for the density of a set avoiding k-term progressions with shifted prime difference. Using our main theorem, we can now obtain the first quantitative bound for this problem. Theorem 1.8 (A quantitative bound for sets missing progressions with shifted prime difference). Let k ≥ 3, and let N be large enough in terms of k. Then any subset of [N ] of size ≥ N (log log log log N ) −c contains a k-term arithmetic progression whose common difference is a shifted prime of the form p − 1. Moreover, if k = 4, one can replace N (log log log log N ) −c with N (log log log N ) −c above, and if k = 3, one can replace it with N exp(−(log log log N ) c ) .
The proof of this is given in Section 10. Remark 1.9. It is likely that one can similarly now make other qualitative consequences of (1.5) quantitative. Certainly the version of the generalized Hardy-Littlewood conjecture in [20,Conjecture 1.2] (in the finite complexity case) can now be made quantitative, with doubly logarithmic savings, in a manner perfectly analogous to Theorem 1.6, as can the version of the main theorem in [20, Theorem 1.8]; we omit the details. The more recent asymptotics on linear inequalities in primes in [44] are also likely to now have a doubly logarithmic quantitative version, but we do not pursue this matter here.
Lastly, one can also use Theorem 1.6 to quantify a result of the authors [41] on the logarithmically averaged Chowla conjecture for odd order correlations (whose proof relied on the Gowers uniformity of Λ). A back of the envelope calculation suggests that one could quantify the error term there, for fixed odd k ≥ 3, to triply logarithmic; thus, for any fixed integers 0 ≤ h 1 < · · · < h k (and the same with the Liouville function in place of µ). Very briefly, by the entropy decrement argument [41, Theorem 3.1] one can locate a scale exp((log log x) 1/2 ) ≤ P ≤ log x such that the left-hand side of (1.7) can be replaced up to triply logarithmic error term with (−1) k log log P p≤P 1 p 1 log x n≤x µ(n + ph 1 )µ(n + ph 2 ) · · · µ(n + ph k ) n .
One would then split the p sum into dyadic scales and proceed as in [41] by replacing the average over primes p with an average over w-rough integers, using Theorem 1.6 and a quantitative version of the generalized von Neumann theorem as a substitute for Theorem 1.3, producing an admissible O((log log P ) −c ) error term. The triply logarithmic error terms at this step are much worse than any other error terms arising in the rest of the proof, therefore leading to (1.7). We leave the details to the interested reader.
2 Discussion and set-up of the proof Until recently, there were two main obstacles to achieving the sort of quantitative (and effective) bound stated in Theorem 1.4. Firstly, the first proofs of the inverse conjecture for the Gowers norms in the large k regime k ≥ 5 were ineffective (using tools such as nonstandard analysis) and did not provide any quantitative dependence of constants. Secondly, in order to overcome certain logarithmic losses in the estimates, it was necessary to invoke Siegel's theorem to control the correlation of the Möbius function with nilsequences, and the decay rate in the o(1) bounds in Theorem 1.3(iii) then depended on the rate at which the constants in Siegel's theorem |L(1, χ)| ineff ε q −ε depended on ε, which is completely ineffective with known methods.
The first issue was resolved recently with the quantitative inverse theorem of Manners [32], which provided a good quantitative dependence on all parameters in the inverse theory of Gowers norms. To resolve the second issue, we perform the technique of isolating out the contribution of a potential Siegel zero to obtain more refined approximations to the arithmetic functions µ, Λ. To make this precise we introduce some notation: Recall that the quantity Q was defined in (1.1).
(i) We define a Q-Siegel zero to be a real number 1 − c 0 log Q < β < 1 for which there exists a primitive real Dirichlet character χ Siegel (which we call the Q-Siegel character ) of conductor q Siegel < Q such that L(β, χ Siegel ) = 0, where L(s, χ) denotes the Dirichlet L-function associated to χ. Here c 0 is a sufficiently small absolute constant (and henceforth all implied constants are permitted to depend on c 0 ). Note from the Landau-Page theorem (see e.g., [34,Corollary 11.10]) that if a Q-Siegel zero exists, then it is unique (and similarly for the Q-Siegel character), and the zero β is simple (so that L (β, χ Siegel ) = 0). (ii) We define the Q-Siegel model Λ Siegel for the von Mangoldt function Λ to be if no Q-Siegel zero exists, and µ is the function α is the quantity and µ local * µ is the Dirichlet convolution of µ local and µ : (Note from the supports of µ local , µ that at most one term in this sum is non-zero for any given n.) The significance of these models is that Λ and Λ Siegel have very nearly the same statistics on arithmetic progressions (with error terms that improve over the main term by pseudopolynomial factors O(exp(− log c N )), which are superior to the strongly logarithmic gains O ineff A (log −A N ) provided by the Siegel-Walfisz theorem), and similarly for µ and µ Siegel . Indeed, in Section 7 we will show the following estimates: Proposition 2.2 (Pseudopolynomial equidistribution in arithmetic progressions). For any arithmetic progression P ⊂ [N ], we have N exp(−c log 1/10 N ). and The construction of µ Siegel appears to be complicated, but it is a multiplicative construction and can be justified as follows. If χ is any character induced from χ Siegel of some period q|[q Siegel , P (Q)], a short calculation reveals the Euler products whenever Re(s) > 1. One can then check that the meromorphic continuations of the two Dirichlet series (2.6), (2.7) both have a simple pole at s = β with the same residue (and when χ is not induced from χ Siegel there is no such pole), which helps justify why we expect µ Siegel to be a good approximation to µ. We experimented with simpler models to µ than µ Siegel , but in order to get the pseudopolynomial error terms exp(− log c N ) in (2.4) it seems essential that the model µ Siegel behaves almost identically to µ with respect to primes p as large as exp(log c N ), which necessitates a complicated construction such as (2.1). We remark that a similar (though slightly less refined) approximant λ Siegel to the Liouville function λ was introduced by Germán and Katai in [11], and recently used in [1] to establish Chowla's conjecture in the presence of a Siegel zero.
Proof. All of these bounds are either trivial or immediate consequences of Mertens' theorem, except for the bound on µ Siegel , which would follow if the quantity α in (2.3) were bounded. This turns out to follow from standard bounds on the L-function L(s, χ Siegel ) near a Q-Siegel zero β; see Lemma 5.5.
In view of Proposition 1.2 and the triangle inequality (1.2), Theorem 1.4 then follows from the following two statements.
Theorem 2.5 (Siegel corrections are logarithmically Gowers uniform). We have with the convention that the expression q −c Siegel vanishes when no Q-Siegel zero exists. Theorem 2.6 (Doubly logarithmic uniformity of Möbius and von Mangoldt, II). We have Theorem 2.5 is an application of sieve-theoretic methods, smooth number estimates and the Weil bound, and is established in Section 5.2. The main difficulty is to establish Theorem 2.6. In principle, one can directly apply the quantitative inverse theory of Manners [32], and reduce matters to controlling the correlation of µ − µ Siegel , Λ − Λ Siegel with nilsequences arising from nilmanifolds (although in the case of Λ − Λ Siegel we have the obstacle that the function is unbounded -the resolution of this is discussed below). Indeed, in Section 7 we will establish the following bounds that significantly extend the bounds in Proposition 2.2: Theorem 2.7 (Pseudopolynomial orthogonality of Möbius and von Mangoldt with nilsequences). Let > 0 and k ≥ 1. Let c 1 ( ) > 0 be small enough in terms of . Then we have the bounds and is an arithmetic progression, G/Γ is a filtered nilmanifold of degree k−1, dimension at most (log log N ) c 1 ( ) , and complexity at most exp(log c 1 ( ) N ), F : G/Γ → C is a 1-bounded Lipschitz function 5 of Lipschitz constant at most exp(log 1/10− N ), and g : Z → G is a polynomial map. (The relevant definitions of filtered nilmanifolds, etc., are reviewed in Definition 6.1.) Remark 2.8. If one redefined the Siegel models µ Siegel , Λ Siegel by assigning the parameter Q the larger value exp((log N ) 1/2 ), one could inspect that the exponent of logarithm in (2.12) and (2.13) (and in particular in Proposition 2.2) could be increased to 1/2− , hence essentially matching the shape of the error term in the classical prime number theorem. For this modification, one would have to tweak the exponents in Section 5 a little; in particular in Proposition 5.2 the exponents 3/5 and 4/5 would have to be replaced with 1/2. As the precise value of the exponent has very little influence on our bounds, we leave the details of this strengthening to the interested reader.
For sake of comparison, in [21] the strongly logarithmic bound A,M log −A N was established for any A > 0 assuming that the dimension and complexity of G/Γ and the Lipschitz constant of F were all bounded by M ; using this bound, in [20] the qualitative bound (1) was shown for the same type of nilsequences F (g(n)Γ), where W = P (w) for some w = w(N ) growing sufficiently slowly to infinity with N and any b ∈ [W ] coprime to W . With a little additional effort, the latter bound then also implies the qualitative bound for these nilsequences and arbitrary arithmetic progressions P ⊂ [N ]. The arguments relied upon (and in fact imply) the Siegel-Walfisz theorem and thus could not give error terms better than strongly logarithmic, which would be unsuitable for our applications (particularly those involving the von Mangoldt function). It is therefore necessary to account for the correction terms µ Siegel , Λ Siegel − Λ Cramér,Q to avoid any appeal to the Siegel-Walfisz theorem and to improve the bounds to be of pseudopolynomial type, despite the fact (from Theorem 2.5) that these correction terms are already logarithmically small in the Gowers norm sense. Our proof of Theorem 2.7 will broadly follow the same strategy as that in [21], relying on Proposition 2.2 in the "major arc" case and on decomposition into "Type I" and "Type II" sums, followed by Cauchy-Schwarz and an appeal to the equidistribution theory of nilmanifolds, in the "minor arc" case. A key new feature, compared to previous work, is that the dimension of the nilsequences is no longer bounded, but grows at a roughly doubly logarithmic rate in N . Because of this, we are forced to perform a careful accounting on the dependence on dimension in the aforementioned equidistribution theory, and in particular ensure that the bounds only depend at most doubly exponentially on the dimension. This is in fact one of the main reasons why our bounds in Theorem 1.4 are limited to be doubly logarithmic in nature; see Remarks 2.9, 6.4 below.
The estimate (2.10) can be directly obtained from (2.12) using the inverse theorem of Manners [32], which we review in Section 6; note that this theorem basically applies a double logarithm to the quantitative bounds, which is why the pseudopolynomial type terms in Theorem 2.7 are reduced to doubly logarithmic type terms in Theorem 2.6. For the von Mangoldt estimate (2.11), we encounter the familiar problem that Λ − Λ Siegel is not bounded (see Lemma 2.4), so that Manners' quantitative inverse theorem does not immediately apply. In [20], this difficulty was resolved at the qualitative level by first using the "W -trick" of passing to an arithmetic progression {W n + b : n ∈ N} for some W = P (w) and some w growing slowly with N , and then dominating (an appropriately normalized version of) the von Mangoldt function on that progression by a divisor sum ν of Goldston-Yıldırım type that obeyed some "pseudorandomness" conditions. This enabled one to then apply a transference principle that roughly speaking allowed one to behave "as if" the normalized von Mangoldt function was bounded on this progression, at least for the purposes of applying an inverse theorem for the Gowers norms.
Here the biggest source of quantitative inefficiency is the transference principle, as the first few proofs of this principle [18], [20], [13], [35] involved the Weierstrass approximation theorem, quantitative versions of which can generate exponential type losses. However, in [4] (see also [3]), Conlon, Fox, and Zhao introduced the method of densification, which they used to obtain a transference principle in the context of Szemerédi-type theorems that involved only polynomial dependencies on the bounds (and they also relaxed the pseudorandomness hypotheses on the enveloping sieve ν by dropping the so-called "correlation condition"). As it turns out, the densification method can be adapted to inverse theorems as well with efficient quantitative bounds, at least when the correlation in the inverse theorem enjoys polynomial bounds; we formalize this observation (which seems to be of independent interest) as Theorem 8.1. Fortunately for us, the arguments of Manners in [32, §5] already provide such a polynomial bound. Using our quantitative transference result for the inverse theorem, it becomes a relatively routine matter to derive (2.11) from (2.13), after making various necessary quantitative refinements (for instance, the parameter w will now be taken to be of the shape log ε N for some small ε > 0, rather than growing in some unspecified slow fashion with N ). This will all be performed in Section 8. Remark 2.9. Perhaps surprisingly, the bounds in Theorem 1.4 are not significantly improved if one assumes the generalized Riemann hypothesis; some pseudopolynomial bounds can now be sharpened to polynomial bounds (such as Theorem 2.7), but for the logarithmic and doubly logarithmic bounds only minor improvements in the unspecified constants c are available under GRH (though of course in this case any terms involving Q-Siegel zeroes can simply be deleted). On the other hand, it is tempting to conjecture that the doubly logarithmic bounds in our main results can be improved to logarithmic, given that several of the key estimates already have this quality of error term or better. This is particularly appealing in the k = 3 case where we have quite a good inverse U 3 theorem [17]. The main difficulty is that to achieve this goal, it appears that one needs an equidistribution theory for 2-step nilmanifolds (or quadratic bracket polynomials) that involves exponents that are merely polynomial in the dimension of the nilmanifold (or complexity of the bracket polynomial) rather than exponential. In analogy with the well known quadratic Diophantine approximation theory of Schmidt [39], it seems reasonable to expect such a theory to be feasible 6 , but we will not pursue this matter here. On the other hand, we note that by combining Theorem 2.7 with the circle method one can obtain the pseudopolynomial bounds exp(−c log c N ), 6 Another option is to exploit improved the dimension bounds for the inverse U 3 theory now available [37], using the equivalences from [26]. Since the initial release of this preprint, this option has in fact been carried out by Leng [30], who significantly improved the (log log N ) −c type bounds in Theorem 2.6 to exp(− log c N ) type bounds in the k = 3 case. and one could optimistically conjecture that such pseudopolynomial (or even polynomial) bounds are also true for higher Gowers norms as well (such bounds would follow from a sufficiently uniform version of the Hardy-Littlewood prime tuples conjecture).

Acknowledgments
TT was supported by a Simons Investigator grant, the James and Carol Collins Chair, the Mathematical Analysis & Application Research Fund Endowment, and by NSF grant DMS-1764034. JT was supported by a Titchmarsh Fellowship and funding from the European Union's Horizon Europe research and innovation programme under Marie Sk lodowska-Curie grant agreement no. 101058904. We thank the anonymous referee for a careful reading of the paper and for numerous helpful corrections. We thank Sean Prendiville for helpful discussions, and Andrew Granville, Wataru Kai and James Leng for corrections.

Notation
As stated in the introduction, throughout this paper we fix an integer k ≥ 1, and assume N is a positive real number that is sufficiently large depending on k (and Q is given in terms of N by (1.1)). We abbreviate {n ∈ N : 1 ≤ n ≤ N } as [N ] (even when N is not an integer). We use the asymptotic notation X Y , Y X, or X = O(Y ) to denote an estimate of the form |X| ≤ CY for some constant C > 0. If C depends on additional parameters, we indicate this by subscripts, for instance However, as all of our constants will depend on the fixed parameter k, we omit this parameter from this subscripting notation. Unless otherwise specified, the constants will depend in an effective fashion on the parameters; on the rare occasions (mostly involving citing previous literature) in which ineffective constants are used, we will use the superscript ineff to indicate this. We write X Y as an abbreviation for X Y X, subject to the same subscripting and superscripting conventions as before. If X, Y depend on an additional parameter N , we write X = o(Y ) as N → ∞ to denote the claim that |X| ≤ c(N )Y for some quantity c(N ) that goes to zero as N → ∞, again subject to the same subscripting and superscripting conventions as before. As stated in the introduction, we use c to denote various small positive constants depending on k that can vary from line to line.
We often refer to the following hierarchy of decay estimates, in increasing order of strength: • Qualitative (and ineffective) decay, in which this is a typical shape for bounds obtained using the Siegel-Walfisz theorem); • Pseudopolynomial decay, in which X exp(−c log c N )Y ; and • Polynomial decay, in which X N −c Y . As the terminology suggests, pseudopolynomial decay will be a satisfactory substitute for polynomial decay in many of our arguments.
We use 1 E to denote the indicator function of a set E, thus 1 E (n) equals 1 when n ∈ E and 0 otherwise. We also use 1 S to denote the indicator of a statement S, thus 1 S equals 1 when S is true and 0 otherwise.
If A is a finite set, we use #A to denote its cardinality. All sums and products over the variable p are understood to be over primes, and similarly all sums and products over variables such as n or d are understood to be over natural numbers, unless otherwise indicated.

Some lemmas on Gowers norms
We state here a few lemmas concerning the Gowers norms that will be used later on. In addition to the triangle inequality (1.2), we shall also often use the closely related Gowers-Cauchy-Schwarz inequality for any finite additive group G and any functions f ω : G → C for ω ∈ {0, 1} k ; see for instance [20,Lemma B.2]. For arbitrary additive groups, we also have the non-normalized variant Observe that the Gowers norms behave well with respect to tensor products: if f 1 : G 1 → C, f 2 : G 2 → C are finitely supported functions on additive groups G 1 , G 2 , then a short computation reveals that for any k ≥ 1. We now develop a variant of this identity (4.3). We localize the Gowers norm to cosets a + H of a subgroup H of an additive group G as follows: if k ≥ 1 and f : Note that this definition does not depend on the choice of coset representative. We have the following convenient Fubini type inequality (which is reasonably well known "folklore", although the only explicit prior reference to such an inequality that we are aware of is [2, Lemma 4.3]): Lemma 4.1 (Fubini type inequality). Let k ≥ 1, let G be an additive group, let H be a subgroup of G, and let f : G → C be a finitely supported function. For each coset a + H in the quotient group G/H, let F (a + H) denote the quantity note that F : G/H → C is also a finitely supported function. Then we have Informally, this lemma asserts that to bound the U k (G) norm of a function f , one can first evaluate theŨ k norm along the various cosets of H, and then compute theŨ k norm of the numbers obtained in that fashion. If G, H are finite we can obtain similar claims for the normalized U k norms in the obvious fashion. Note that the Fubini-Tonelli theorem establishes a similar claim for the 1 (or more generally p ) norms (and in this case one has equality in (4.4) instead of inequality. One can also verify that (4.4) is consistent with (4.3).
Proof. From Definition 1.1 we have Consider the contribution to the right-hand side where n lies in a coset a + H and h i lies in a coset b i + H for i = 1, . . . , k. By the Gowers-Cauchy-Schwarz inequality (4.2), this contribution can be bounded in magnitude by Summing over all choices of a, b and applying Definition 1.1 again, we conclude that As a corollary of this inequality, we can estimate the Gowers norm of a function on [N ] in terms of its values on various arithmetic progressions: for all b ∈ [W ] coprime to W and some A > 0. Then one has Proof. We extend f by zero to the integers Z and work with the unnormalized Gowers norms. Since for all b ∈ [W ] coprime to W , and it will suffice to show that Applying Lemma 4.1 with G = Z and H = W Z, and normalizing the Gowers norms, it suffices to show that Expressing W as the product of primes p vp(W ) and using the Chinese remainder theorem and (4.3) repeatedly, the left-hand side can be written as .
However, direct computation using the inclusion-exclusion principle shows that The claim follows.
Next, we give a variant of the triangle inequality that estimates a Gowers norm based on the greatest common divisor with a fixed modulus.
The key point here is the presence of the factor 1 d , which ensures that the summation over d can be estimated manageably.
Proof. We extend f by zero outside of [N ]. From Definition 1.1, it suffices to show the unnormalized estimate The left-hand side can be written as where the dual function F (n) is defined as We split this sum in terms of the value of (n, q) as By the triangle inequality, it thus suffices to show that for each d|q. Decomposing h 1 , . . . , h k in the definition of F (n) into cosets mod d, the left-hand side may be written as By the Gowers-Cauchy-Schwarz inequality (4.1), and noting that f is bounded by Summing over all the d k choices of b, we thus obtain and the claim follows after a little algebra.
5 Some sieve theory

The Cramér model
In this section we use some standard sieve-theoretic tools to establish several estimates involving the Cramér models Λ Cramér,w , some of which will also be useful in controlling the Siegel models Λ Siegel , µ Siegel in later sections.
We first recall a form of the fundamental lemma of sieve theory (arising from an analysis of the beta sieve).
Lemma 5.1 (Fundamental lemma of sieve theory). Let (a n ) n∈Z be a collection of nonnegative reals, let κ > 0, z ≥ 2, and D ≥ z 9κ+1 . Let g : N → [0, 1) be a multiplicative function obeying the estimates for all 2 ≤ w ≤ z and some K > 0. Suppose that for every d ≤ D dividing P (z) one has the formula for some X > 0 and some remainder r d . Then one has n (n,P (z))=1 a n = X Proof. See [10, Theorem 6.9].
In our applications, the ratio s = log D log z will grow at a logarithmic rate, leading to pseudopolynomial accuracy when applying the fundamental lemma.
Using the fundamental lemma we can obtain satisfactory estimates (with pseudopolynomial accuracy) for counting linear equations in the Cramér model (compare with Theorem 1.6).
for someψ i ∈ Z m and ψ i (0) ∈ Z. Assume that the linear coefficientsψ 1 , . . . ,ψ t ∈ Z m are all pairwise linearly independent and have magnitude at most exp(log 3/5 N ) (say). Then for any 2 ≤ z ≤ Q, one has for some c > 0 depending only on t, m, where for each p, β p is the local factor where ψ i is also viewed as a map from (Z/pZ) m to Z/pZ in the obvious fashion.
Proof. Without loss of generality we may assume that N is sufficiently large depending on t, m; we now allow all implied constants to depend on t, m. For any d dividing P (z), let g(d) ∈ [0, 1] denote the quantity , with the convention that g(d) = 0 if d does not divide P (z). In particular we have for all p < z. From the Chinese remainder theorem we see that g is multiplicative. Suppose first that g(p) = 1 for some p < z, then β p = 0 and k i=1 Λ Cramér,z (ψ i (n)) is identically zero. Thus the proposition is trivial in this case, so we may assume that g(p) < 1 for all p. From construction we then have the crude bound Also, from construction we see that for any two distinct linear forms ψ i , ψ j , there is a positive integer A ij = exp(O(log 3/5 N )) such thatψ i ,ψ j are linearly independent in (Z/pZ) k whenever p does not divide A ij (indeed, one can take A ij to be one of the nonzero coefficients of the wedge product ofψ i andψ j ). If we let A = exp(O(log 3/5 N )) be the product of all the A ij , we conclude in particular that whenever p does not divide A, hence by the inclusion-exclusion formula (or Bonferroni inequalities) we have whenever p does not divide A. In particular we have unless p divides A (using (5.4) to handle the case when p is bounded). For p dividing A, and hence by Mertens theorem the axiom (5.1) is obeyed with κ = t and some K = O(exp(O(log 3/5 N ))).
We introduce the sequence Observe that the a n are non-negative with

Set
D := exp(log 9/10 N ). For any d ≤ D dividing P (z), we have The condition d| t i=1 ψ i ( n) restricts d to g(d)d m cosets of (dZ) m . Applying a volume packing argument using [20,Corollary A.2] gives n∈Ω∩Z m and hence axiom (5.2) is obeyed with X := vol(Ω) and some We can then simplify the right-hand side using (5.3) and Mertens' theorem to As a first application of this estimate, we have good estimates (basically of logarithmic type) for the Cramér model in the Gowers norm.
We can rewrite the desired estimate (after adjusting c appropriately) as where Ω is the convex body of tuples (n, h) ∈ R k+1 such that for all ω ∈ {0, 1} k . By inclusion-exclusion, it suffices to establish the bounds for all subsets S ⊂ {0, 1} k . Applying Proposition 5.2 (and Mertens' theorem), the lefthand side is equal to (in fact there is plenty of room to spare in the error term), where If p < w, then W vanishes modulo p and b is coprime to p, and hence By the inclusion-exclusion argument used to establish (5.5) one has Since vol(Ω) (N ) k+1 , the claim follows.
is trivial, so we may assume N is large enough that log 1/100 N > 2).
Remark 5.4. With more effort it may be possible to delete the log −c N term in (1.3), but we will not need to do so here as there are several other error terms in our analysis that are of the same order of magnitude as log −c N , or worse.

Controlling the Siegel correction
Now suppose that there is a Q-Siegel zero β, with associated quadratic character χ Siegel and conductor q Siegel . In this subsection we combine the previous sieve-theoretic estimates with Weil sum estimates to obtain good control on the Siegel models Λ Siegel , µ Siegel .
We begin with some basic estimates on the Q-Siegel zero β and the Q-Siegel conductor q Siegel . As χ Siegel is a primitive real character, q Siegel is must either be square-free or four times a square-free number or eight times a square-free number. From construction one has the upper bound q Siegel ≤ Q = exp(log 1/10 N ). From [7,Chapter 14,(12)] one has the estimate Siegel log −2 q Siegel which when combined with the upper bound 1 − β 1 log Q log −1/10 N gives the lower bound One could improve this lower bound using Siegel's theorem to strongly logarithmic, but we will not do so here in order to keep the estimates effective. In particular, any bound of the shape O(q −c Siegel ) will lead to logarithmic decay. From [34, Theorem 2.9] we observe the doubly logarithmic bound log log q Siegel log log N.
Next, we show that the quantity α in (2.3) is bounded, which was the missing step needed to establish Lemma 2.4: Lemma 5.5. We have α 1. In particular, Lemma 2.4 holds.
Proof. Consider the meromorphic function This function has a simple pole at β with residue and no other poles in the disk {s : [34,Theorem 11.3]. By Mertens' theorem, it thus suffices to establish the bound By the residue theorem, it suffices to show that followed by the triangle inequality to estimate thanks to Mertens' theorem. For more general points s on this circle, we have from [34,Theorem 11.4 on the entire circle; integrating this and using (5.9), we obtain (5.8) as required.
From [34,Theorem 11.4] we have and L(s, χ Siegel ) |s − β| for s = β sufficiently close to β; multiplying the two estimates and taking limits as s → β, we also obtain the bound We can view χ Siegel as a function on Z/q Siegel Z. Crucially, it exhibits some cancellation in the Gowers norms (of polynomial type in q Siegel , and hence of logarithmic type in N ): Lemma 5.6 (Gowers norm cancellation). For any ε > 0, we have Proof. By the Chinese remainder theorem, we can express Z/q Siegel Z as the product of prime cyclic groups Z/pZ of odd order, as well as Z/2 j Z for some 0 ≤ j ≤ 3. The quadratic character χ Siegel can then be expressed as the tensor product of quadratic characters on these groups. Using (4.3) and the divisor bound, it thus suffices to show that for all odd primes p, with χ the quadratic character on Z/pZ. By Definition 1.1, this is equivalent to The contribution of any given tuple h ∈ (Z/pZ) k to the left-hand side is trivially bounded by O(p −k ). When the dot products ω · h are all distinct, the Weil bounds (see e.g., [28,Corollary 11.24]) give instead the bound O(p −k−1/2 ). Since there are p k tuples h and collisions between the ω · h only occur for O(p k−1 ) of these tuples, the claim follows.
We can now use this cancellation to prove Theorem 2.5.
Proof of Theorem 2.5. We may assume N is sufficiently large depending on k, and allow all implied constants to depend on k. Obviously we may assume that a Q-Siegel zero exists, as the claim is trivial otherwise. We first establish (2.9). It suffices to show the polynomial (in q Siegel ) bound where (·) β−1 denotes the function n → n β−1 . By the fundamental theorem of calculus, we have Substituting ( Siegel for some c > 0 and all 1 ≤ M ≤ N , where Ω = Ω M is the convex body Splitting n, h 1 , . . . , h k into cosets of q Siegel , we can write the left-hand side of (5.14) as Applying Proposition 5.2 (with N replaced by N/q Siegel ), we can estimate Because of the χ Siegel factor in (5.15), we can restrict attention to the case where a + ω · b is coprime to q Siegel . This implies that β p = ( p p−1 ) 2 k when p|q Siegel . When p q Siegel , we can shift n, h by a/q Siegel , b/q Siegel respectively (performing the division over the field Z/pZ) to simplify In particular the β p are not dependent on a, b, q Siegel . Summing in a, b, we can thus write the left-hand side of (5.14) as The error term is certainly negligible. From Lemma 5.6 we have (say), and we can of course bound vol(Ω) N k+1 . Finally, direct calculation shows that thanks to (5.7). Putting these estimates together, we obtain the claim (2.9). Now we establish (2.8), which is a similar calculation but a little more involved because of the µ local factor. By Lemma 4.3, (5.6) it suffices to show that We rewrite this estimate as where D is the set of all d |P (Q) with (d , q Siegel ) = 1. By Lemma 5.5, it thus suffices to show that Using (5.12), (5.13) and Minkowski's integral inequality, it suffices to show for any M ≥ 1, where We decompose D = D ≤ ∪ D > , where D ≤ are those d ∈ D with d ≤ exp(log 1/2 N ) (say) and D > are those d ∈ D with d > exp(log 1/2 N ). We first dispose of the contribution of the large d , i.e. those that satisfy d ∈ D > . Their contribution to the expression inside the norm on the left-hand side of (5.18) is supported on a set of numbers n of size We can expand out the left-hand side as where Ω is the set of all tuples (n, h) ∈ Z k+1 such that n + ω · h ∈ [N/d] for all ω ∈ {0, 1} k . Meanwhile, using the pointwise bound (reflecting the fact that every number n has a unique decomposition n = d (n/d ) where d |P (Q) and (n/d , P (Q)) = 1) one has .
Hence it will suffice to show that The constraints 1 d ω |n+ω· h restrict (n, h) to some finite union of cosets (a, b) + DZ k+1 of DZ k+1 where D := ω∈{0,1} k d ω , with the property that d ω divides a + ω · b for all ω ∈ {0, 1} k . Note from construction that D is coprime to q Siegel and of size O(exp(O(log 1/2 N ))). So, denoting for brevity Ω (a, b) := Ω ∩ ((a, b) + DZ k+1 ), it will suffice to show that for all such cosets (a, b) + DZ k+1 . Using Proposition 5.2 and some elementary rescaling, we have .
If any of theβ p vanish then both sides of (5.19) vanish and we are done. For p not dividing D we have the crude bound and hence the right-hand side of (5.19) is comparable to q −c Siegel (N/D) k+1 p<Qβ p . Next, we partition the left-hand side of (5.19) as We can restrict attention to those (r, s) for which r + ω · s is coprime to q Siegel for all ω ∈ {0, 1} k , since otherwise the product in (5.23) vanishes. Under this assumption, we can apply Proposition 5.2, the Chinese remainder theorem, and some further rescaling (using the fact that D, q Siegel are coprime), to conclude that where Ω := (n, h) ∈ Ω : Note the main term here is independent of r, s. In particular, we can rewrite (5.23) as and so the first term in (5.24) is also acceptable.

The Manners inverse theorem
We are now ready to state a version of the inverse theorem of Manners [32], though formulated in a slightly different language (in particular, using the complexity notions from [22] rather than [32]). Definition 6.1 (Nilmanifolds). Let s ≥ 1 be an integer, and let M > 0. A (filtered) nilmanifold G/Γ of degree s and complexity at most M consists of the following data: (i) A nilpotent connected and simply connected Lie group G of some dimension m, which can be identified with its Lie algebra log G via the exponential map exp : log G → G or its inverse log : G → log G; (iv) A linear basis X 1 , . . . , X dimG of log G, known as a Mal'cev basis (of the second kind). We require this data to obey the following axioms: (a) For 1 ≤ i, j ≤ dim(G), one has c ijk X k for some rational numbers c ijk with numerator and denominator bounded in magnitude by M .
(c) The subgroup Γ consists of all elements of the form exp(t 1 X 1 ) · · · exp(t dimG X dimG ) with t 1 , . . . , t dimG ∈ Z. This data defines a metric on G/Γ as described in [22,Definition 2.2], as well as the notion of a polynomial map g : Z → G, defined in [22,Definition 1.8].
A function f : X → C is said to be 1-bounded if |f (n)| ≤ 1 for all n ∈ X. Proof. By Bertrand's postulate we can find a prime N such that 10N ≤ N ≤ 20N . If we embed [N ] into the cyclic group Z/N Z and extend f by zero we may view f as a 1-bounded function on Z/N Z, and a brief calculation reveals that We now apply [32, Theorem 1.1.2] with s := k − 1 to produce the required data G/Γ, g, F , X i , save for two differences. Firstly, the polynomial g is described as a map from Z/N Z to G/Γ rather than from Z to G, but one can lift the map from the former to the latter using [32, Proposition C.17]. Secondly, instead of axiom (a) of Definition 6.1, the basis elements X i are instead required to obey a decomposition for some integers a ijl bounded in magnitude by some bound M 0 exp exp(O(1/δ O(1) )), where the product is taken from left to right. However, as briefly noted in [32, §C.2], one can pass from this control (6.2) to the control (6.1) (with M a suitable polynomial of M 0 ), as follows. For any 1 ≤ a ≤ k − 1, we let P (a) denote the claim that one has (6.1) with M of the form exp exp(O(1/δ O(1) )) whenever one of X i , X j lies in log G a . The claim P (a) is certainly true for a = k − 1 since log G k−1 is central, and we will be done if P (1) is true, so it suffices by downward induction (with at most k − 2 steps) to show that P (a + 1) implies P (a) for any 1 ≤ a ≤ k − 2, where the implied constants in the O k () notation are allowed to vary with each step of the induction. Call a rational number good if its numerator and denominator are bounded in magnitude by exp exp(O(1/δ O(1) )). If one of X i , X j lie in log G a , then from (6.2), the induction hypothesis, and the Baker-Campbell-Hausdorff formula we see that for some good rationals c ijl (and furthermore one can restrict to those X k lying in log G a+1 ). On the other hand, a further application of Baker-Campbell-Hausdorff reveals that log[exp(X i ), exp(X j )] is equal to [X i , X j ] plus O k (1) additional terms, which consist of a good rational number times an iterated Lie bracket formed by starting with [X i , X j ] and taking the Lie bracket with either X i or X j one or more times (but no more than O(1) times in all). Inverting this formula, we can then write [X i , X j ] as log[exp(X i ), exp(X j )] plus O(1) additional terms, which consist of a good rational number times an iterated Lie bracket formed by starting with log[exp(X i ), exp(X j )] and taking the Lie bracket with either X i or X j one or more times (but no more than O(1) times in all). Using (6.3) and the induction hypothesis P (a + 1) repeatedly, we conclude P (a), thus closing the induction. Remark 6.3. As noted in [32], improved bounds are available for k ≤ 4 [17,24], but we will not be able to take advantage of these bounds due to inefficiencies elsewhere in the arguments (in particular, our nilsequence equidistribution theory involves exponents that are exponential in the dimension rather than polynomial).
From Lemma 2.4 we see that the function µ − µ Siegel can be made 1-bounded by multiplying by a small absolute constant. Applying Theorem 6.2 in the contrapositive (setting δ equal to a small power of (log log N ) −1 , we conclude that the bound (2.10) is an immediate consequence of (2.12). The same argument does not work directly for Λ − Λ Siegel due to the additional factor of log N in the pointwise bounds; but we will be able to get around this in Section 8 by employing the densification technology of Conlon, Fox, and Zhao [4]. Assuming this for the moment, the only remaining step needed to establish Theorem 1.4 is to prove Theorem 2.7, to which we now turn. Remark 6.4. When k = 3, one can appeal instead of Theorem 6.2 to the quantitative inverse theorem in [17], and when k = 4 one can use the fact that Manners proved in [32] a stronger form of Theorem 6.2 for k = 4 than for k ≥ 5. If one does so, one eventually finds that one would be able to improve the doubly logarithmic bounds in Theorem 1.4 for k ≤ 4 to singly logarithmic, provided that one could increase the bound on the dimension of G/Γ in Theorem 2.7 from (log log N ) c 1 to log c 1 N . Unfortunately, our equidistribution theory on nilmanifolds is currently not satisfactory at this high a dimension, although in principle it is conceivable that some variant of the methods of Schmidt [39] could resolve this issue. We will not pursue this question further here.

Orthogonality to nilsequences
In this section we prove Theorem 2.7. We begin by establishing Proposition 2.2, which will be used to establish the "major arc" case of Theorem 2.7.
Proof. (Proof of Proposition 2.2) We adopt the convention that any factor involving the Q-Siegel character χ Siegel is deleted if no such character exists. Any arithmetic progression P ⊂ [N ] can be expressed in the form {N < n ≤ N : n = a (q)} for some 1 ≤ a ≤ q and 0 < N ≤ N ≤ N . By the triangle inequality, it thus suffices to establish the bounds for any 1 ≤ a ≤ q and 0 < N ≤ N .
If q > exp(c 2 log 1/10 N ) for any constant c 2 > 0 then the triangle inequality (and Lemma 2.4) give the desired bounds after adjusting the value of c, so we may assume that q ≤ exp(c 2 log 1/10 N ) for some small absolute constant c 2 . In particular q ≤ Q. Similarly we may assume N ≥ N exp(−c 2 log 1/10 N ).
We begin with ( Therefore, it will certainly suffice from the triangle inequality to show for 1 ≤ a ≤ q ≤ exp(log 3/5 N ) that 9 If (a, q) > 1 then (a, q) will be divisible by some prime p ≤ q < Q, in which case β p = 0 and the claim follows. If instead (a, q) = 1, then β p = 1 for all p < Q not dividing q, and β p = p p−1 for all p < Q dividing q, and the claim (7.3) follows. Now we show (7.4). We may of course assume there is a Q-Siegel zero, in which case (by Definition 2.1(ii)) our task is to show that The right-hand side vanishes if (a, q) > 1, and also vanishes if q > q due to the orthogonality properties of Dirichlet characters. If instead (a, q) = 1 and q = q then the right-hand side is equal to M φ(q) χ Siegel (a), and the claim (7.4) follows. Now we turn to (7.2). We first do an easy reduction to the case of primitive residue classes. Let d := (a, q). Observe that for any natural number n one has µ(dn) = µ(d)µ(n)1 (n,d)=1 and also from Definition 2.1(ii) we similarly have µ Siegel (dn) = µ(d)µ Siegel (n)1 (n,d)=1 and thus ).

(7.5)
Since d ≤ q ≤ exp(c 2 log 1/10 N ), it thus suffices to establish the pseudopolynomial decay estimate for all 1 ≤ b ≤ d coprime to d (where the constant c here is uniform in c 2 ). Writing q := [q/d, d], we see from the Chinese remainder theorem that the constraints n = a/d (q/d); n = b (d) are either inconsistent, or constrain n to precisely one primitive residue class a (q ) with (a , q ) = 1. Thus it suffices to show the pseudopolynomial decay bound N exp(−c log 1/10 N ) whenever 1 ≤ N ≤ N and 1 ≤ a ≤ q ≤ exp(2c 2 log 1/10 N ) with (a , q ) = 1.
When there is no Q-Siegel zero the claim is immediate from [34, Exercise 11.3.12] (modified slightly due to our slightly different definition of a Siegel zero). Now suppose that there is a Q-Siegel zero. The result previously cited in [34, Exercise 11.3.12] (again modified slightly to account for our slightly different notion of Siegel zero) then gives the pseudopolynomially accurate asymptotic where χ q (n) := χ Siegel (n)1 (n,q )=1 is the character of modulus q induced from χ Siegel when q is a multiple of q Siegel . Note that and thus by the product rule (and the fact that L(β, χ Siegel ) = 0) We conclude that It will thus suffice to establish the corresponding pseodupolynomially accurate asymptotic for µ Siegel . It suffices to establish the variant estimate (7.7) +O(N exp(−c log 1/10 N )) (say) whenever 1 ≤ a ≤ q ≤ exp(O(log 1/10 N )) with (a , q ) = 1 and q Siegel |q . Indeed, this estimate immediately implies (7.6) when q Siegel divides q , and when q Siegel does not divide q , one splits up the primitive residue class a (q ) into primitive residue classes modulo [q , q Siegel ] on the support of µ Siegel , applies (7.7) to each such class, and sums, using the orthogonality of Dirichlet characters to cancel out the main term. We use Definition 2.1 to expand the left-hand of (7.7) as d∈D µ(d) where D consists of all the factors d of P (Q) with (d, q ) = 1. As in the proof of (5.18), we can decompose D ≤ ∪ D > , where D ≤ are those d ∈ D with d ≤ exp(log 1/2 N ) (say) and D > are those d ∈ D with d > exp(log 1/2 N ). The contribution of D > can be disposed of by the same argument used to prove (5.18), so it remains to show that By Definition 2.1, we have Applying (7.4), as well as Lemma 5.5, we can write this as up to acceptable error terms. Canceling some terms, it thus suffices to show that A standard Euler product calculation using (2.3) gives We return now to the proof of Theorem 2.7. Throughout this section we assume that > 0 is fixed and small in terms of k, and that c 1 ( ) > 0 is sufficiently small depending on k (and we reserve the right to decrease c 1 ( ) later in the argument as necessary). We can assume that N is sufficiently large depending on c 1 ( ), k, as the claim is trivial otherwise. Let P , G/Γ, F , g be as in that theorem. We use m = O((log log N ) c 1 ( ) ) to denote the dimension of G; to avoid some minor notational issues we will assume that m ≥ 2 (as can be achieved trivially by adding some dummy dimensions).
We repeat the arguments from [21], but now performing a more quantitative accounting of the dependence on constants (particularly on the dimension). We first use a dimension-uniform version of the factorization theorem in [22,Theorem 1.19], which we establish in Theorem A.6. We apply that theorem with M 0 := exp(log 1/10− /2 N ) and We can partition the arithmetic progression P into O(M m O(1) ) components P , such that on each of these components the periodic function γ(n)Γ is equal to an M -rational constant γ P Γ, and the smooth sequence ε differs by at most O(M −m C ) from a constant ε P ∈ G of distance at most M from the origin, for a large constant C. We can also normalize γ P to be distance O(M m O(1) ) from the origin. From this and the Lipschitz nature of F , we see (for C large enough) that for n ∈ P . By for all of the progressions P , where the implied constants in the O(1) notation on the right-hand sides of the estimates can be taken to be uniform in for sufficiently small. We introduce the conjugated group G P := γ −1 P G γ P and conjugated polynomial g P := γ −1 P g γ P that takes values in G P , and the normalized function where the integral is with respect to the Haar probability measure on G P /(G P ∩ Γ) (which we can view as a subnilmanifold of G/Γ). Using Proposition 2.2 to dispose of the contribution of the constant G P /(G P ∩Γ) F (ε P γ P ·) (which can be viewed as the "major arc" contribution to these correlations), we are reduced to establishing the bounds The advantages of this reduction are that the function F P is not only 1-bounded and O(M m O(1) )-Lipschitz (with respect to the Mal'cev basis of G P /(G P ∩ Γ), which is a filtered nilmanifold of complexity O(M m O(1) )), but it also has mean zero. By repeating the arguments from [21, p. 547] and keeping track of the constants, we see that the polynomial sequence g P is totally 1/M A/m O(1) -equidistributed (note that multiplicative factors of exp(exp(m O(1) )) can be absorbed into the M A/m O(1) denominator, and that all the O m (1) exponents appearing in this portion of [21] (and [22]) are polynomial in m). We can use the Gowers uniformity of χ Siegel to obtain the following bound on the Siegel terms which is acceptable when q Siegel is large enough: Proof. We apply [20,Proposition 11.2], noting that all bounds 10 can be shown to be polynomial in the parameters M, ε with exponents that are polynomial in the dimension m, to decompose F P (g P (n)Γ) = F 1 (n) + F 2 (n) 10 The argument as stated in that paper appeals to the Stone-Weierstrass theorem and the Arzelá-Ascoli theorem, but this can be replaced by more quantitative approximation results without difficulty, such as [19,Lemma A.9], combined with standard smooth partitions of unity to allow one to work on regions such as the unit cube rather than on the original nilmanifold.
where F 1 obeys the dual norm bound for any f : [N ] → C, and F 2 obeys the pointwise bound for all n ∈ [N ]. Here 0 < ε ≤ 1 is a parameter that we are at liberty to choose. By Theorem 2.5, the functions µ Siegel , Λ Siegel − Λ Cramér,Q already have a U k [N ] norm of O(q −c Siegel ); a standard Fourier expansion of 1 P (n) in terms of additive characters and the triangle inequality then show that the truncated versions 1 P µ Siegel , Siegel ) (note that any logarithmic factors can be easily absorbed into the M O(1) factor). Applying the above decomposition as well as Lemma 2.4, we see that and the claim then follows by a suitable choice of ε (noting that the log N factor can be absorbed into the M factor).
Based on this proposition, we may now delete the Q-Siegel zero contributions except in the regime where where C 1 is a large constant depending on k (but not on ) that we are at liberty to choose; we can also assume N to be sufficiently large depending on C 1 (as well as k and ). To simplify the notation we assume henceforth that the Q-Siegel zero exists and obeys (7.9); the remaining cases follow by a simplified version of the same argument that deletes all the steps and terms that treat the contribution of the Q-Siegel zero. It will now suffice to obtain estimates of the form where the implied constants do not depend on C 1 . To treat these sums, we make the following standard Vaughan-type decompositions.
Proof. For Λ we can use the familiar Vaughan identity [42] where a d := bc=d: b,c≤N 1/3 µ(b)Λ(c) and b w := c|w: c>N 1/3 µ(c). The first term is negligible, the second term is a Type I sum (restricting to [N ]), and the fourth term is a Type II sum; the third term can be converted to a convex combination of Type I sums by using the fundamental theorem of calculus to write log n d = log N − To handle Λ Siegel , it suffices (using the estimate P (Q)/φ(P (Q)) (log N ) O(1) coming from Mertens' theorem) to show that the functions (7.10) n → 1 (n,P (Q))=1 and n → n β−1 1 (n,P (Q))=1 χ Siegel (n) can be expressed in the desired form (absorbing all the constant factors into the divisorbounded coefficients). But if λ + d , λ − d are the upper and lower linear sieve coefficients, respectively, with level D = Q (log N ) 1/2 and sifting parameter Q, one can write 1 (n,P (Q)) − d≤D λ ± d 1 d|n N exp(−10 log 1/2 N ) (say). Therefore, one can express (7.10) as a Type I sum plus a negligible sum error, and by multiplying by χ Siegel one can then express n → 1 (n,P (Q))=1 χ Siegel (n) as a twisted Type I sum plus a negligible sum error. Indeed in these cases one can lower the N 2/3 threshold on d to something much smaller, such as exp(O(log 3/5 N )). Finally, the n β−1 weight can be handled using the fundamental theorem of calculus identity (5.12). Now we turn to µ Siegel = µ local * µ . From the previous discussion and Lemma 5.5, µ is already expressible as a convex combination of twisted Type I sums plus a negligible error (where d can be constrained to be at most exp(O(log 3/5 N ))). We can then convolve by µ local 1 [exp(5 log 1/2 N )] and conclude that µ local 1 [exp(5 log 1/2 N )] * µ is also expressible as a convex combination of twisted Type I sums plus a negligible error (note that the values of d encountered stay well below the threshold N 2/3 ). Finally, the remaining term µ local (1 − 1 [exp(5 log 1/2 N )] ) * µ can be seen to be negligible by the same arguments used to dispose of the D > contributions to (5.18).
The contributions of the negligible sums to the previous estimates are acceptable from the triangle inequality. By a further application of the triangle inequality, it thus suffices to establish the bound whenever f is a Type I sum, a twisted Type I sum, or a Type II sum. The Type I and Type II sums were already essentially treated in [21, §3], and it turns out that the methods also easily extend to cover the twisted Type I case. We briefly review the argument as follows. We begin with the twisted Type I case; the Type I case is treated by a simplification of the argument that deletes the role of the Q-Siegel character, and is omitted here (and in any case would follow closely the treatment in [21, §3]). Suppose that we have (7.12) n∈P f (n)F P (g P (n)Γ) ≥ δN for some 0 < δ < 1 M q Siegel and a twisted Type I sum f . By the definition of such sums and the triangle inequality, this implies that and hence by dyadic decomposition there exists 1 ≤ D ≤ N 2/3 such that D≤d≤2D n∈P ∩dZ Since the inner sum is O(N/D), we conclude that n∈P ∩dZ ]. For such a d, we partition into residue classes modulo dq Siegel and use the triangle inequality to conclude that for some 1 ≤ N d ≤ N/D and 1 ≤ a d ≤ q Siegel (note that all q Siegel factors can be absorbed into the δ O(1) factor). Applying Theorem A.3, we can then find a horizontal character η d of G with where the · C ∞ is defined in [22,Definition 2.7]. The parameter a d is annoying, but we can remove 11 it by applying [22,Lemma 8.4] to conclude that for some η d that continues to obey (7.13). The total number of such η d is O(δ − exp(m O(1) ) ). Thus by the pigeonhole principle, we can find one such horizontal character η such that ]. If we expand out the polynomial (7.14) η • g P (q Siegel n) = β k n k + · · · + β 0 mod 1 for some real numbers β 0 , . . . , β k , then by applying [21, Lemma 3.2] we conclude that there is a positive integer q = O(1) such that for all j = 0, . . . , k, where x R/Z denotes the distance to the nearest integer. Applying a Waring-type result from [21, Lemma 3.3], we then have for each j = 0, . . . , k that . Applying Vinogradov's lemma [21,Lemma 3.4], and clearing denominators, we then conclude that there is a positive integer K δ exp(m O(1) ) such that for all j = 0, . . . , k, and thus by (7.14) On the other hand, g P is totally 1/M A/m O(1) -equidistributed. Arguing as in [21, §3] and noting that all exponents of the form O m (1) are in fact polynomial in m, these two facts are incompatible unless which (when combined with the constraint δ ≤ 1 M q Siegel ) gives the desired bound (7.11). For the Type II case, we can again start by assuming (7.12) for some 0 < δ < 1 M and some Type II sum f . The contribution of those n less than δ C N for a large absolute constant C can easily be seen to be negligible, so one can assume without loss of generality that |P | lies in the interval [δ C N, N ]. One has for some divisor-bounded a d , b w , and then after some dyadic decomposition and Cauchy-Schwarz (cf., [19,Proposition 7.2]) one can find One now repeats the arguments used to treat the Type II case in [21, §3] more or less verbatim (noting that all exponents are of order exp(m O(1) ) at worst) to obtain a contradiction to the total 1/M A/m O(1) -equidistribution of g P unless (7.15) holds, and we again obtain (7.11) as desired. This concludes the proof of Theorem 2.7.

Applying densification
We now use densification methods to establish a general transference principle (which seems of independent interest) that converts inverse theorems for the Gowers norms for 1-bounded functions to inverse theorems for Gowers norms for ν-bounded functions for various "pseudorandom" weights ν. Our pseudorandomness condition will be relatively mild (a U 2k estimate on ν − 1), and the losses in the transference argument will only be polynomial in nature. However, one drawback of the theorem is that the input inverse theorem must also have polynomial bounds.
In Subsection 8.2, we will use Theorem 8.1 to complete the proof of Theorem 2.6 in the von Mangoldt case.

Transferring inverse theorems
Theorem 8.1 (Transference principle for U k inverse theorems). Let k ≥ 2 be fixed. Let G = (G, +) be a finite abelian group. Suppose that for every 0 < δ ≤ 1/2 there is a family Ψ δ of 1-bounded functions ψ : G → C, non-increasing in δ and closed under translations and complex conjugation, obeying the following U k inverse theorem: for some B > 0. Let C 0 be sufficiently large depending on k, let 0 < δ ≤ 1/2, and let ν : G → R + be a weight with Then there exists ψ 1 , . . . , We remark that this theorem strengthens a similar result in [8], in that the class Ψ δ is allowed to be more general than the space of "dual functions", and the bounds are polynomial in nature rather than qualitative.
We now begin the proof of this theorem. Let the notation and hypotheses be as in Theorem 8.1. From (8.2) we have where f 0 = f , and all the other f ω : G → C are either equal to f or its complex conjugate.
The key step is Proposition 8.2 (Densification of a single factor). Suppose that the bound (8.3) holds for some ν + 1-bounded functions f ω , ω ∈ {0, 1} k . Let ω 0 ∈ {0, 1} k . Then we have Indeed, after applying this proposition 2 k − 1 times starting with (8.3), we conclude that for some ψ ω ∈ Ψ δ O(1) for all ω ∈ {0, 1} k \{0} k (one can use the non-decreasing nature of Ψ to make the implied constant in O(1) uniform in ω). In particular, by the pigeonhole principle there exists h 1 , . . . , h k ∈ G such that giving Theorem 8.1 thanks to the translation and conjugation invariance of Ψ δ O (1) . It remains to prove Proposition 8.2. By relabeling we may assume ω 0 = 0 k . By replacing ν with ν+1 2 (and adjusting C 0 if necessary), and then rescaling by various factors of 2, we may assume that the f ω are ν-bounded rather than ν + 1-bounded. Now we adapt the arguments of Conlon-Fox-Zhao [4]. We have Since f 0 k is ν-bounded, we conclude from Cauchy-Schwarz that we conclude that Next we claim that We can write the left-hand side of (8.5) as where we have for ω ∈ {0, 1} k \{0} k , and f ω (x) := 1 for all other ω ∈ {0, 1} 2k not covered by the preceding definitions. By the Gowers-Cauchy-Schwarz inequality (4.1), we thus have and the claim now follows from (8.1) and the triangle inequality. From (8.4), (8.5) and the triangle inequality we conclude (for C 0 large enough) that The function F is not quite bounded. However, as the f ω are all ν-bounded, we certainly have the pointwise bound |F | ≤ Dν, where Dν is the dual function We observe the moment estimates for j = 0, 1, 2. We just prove this for j = 2, as the j = 0, 1 claims are similar (and easier). We can expand for ω ∈ {0, 1} k \{0} k , and g ω (x) := 1 for all other ω ∈ {0, 1} 2k not covered by the preceding definitions. We split each g ω that is of the form ν into 1 and ν − 1. Applying the triangle inequality (1.2) and the Gowers-Cauchy-Schwarz inequality (4.1), we can thus write and the claim follows from (8.1). From (8.7) we have Hence by (8.6) and the triangle inequality we have We rewrite the left-hand side as for ω ∈ {0, 1} k \{0} k . The f * ω all have U k (G) norm of at most ν U k (G) 1 thanks to (8.1), hence by the Gowers-Cauchy-Schwarz inequality (4.1) one has Applying the hypothesis in Theorem 8.1(i), we conclude that there exists . On the other hand, from Cauchy-Schwarz we have thanks to (8.8), (8.9). Hence by the triangle inequality (for C 0 large enough) we have But this rearranges to give the conclusion of Proposition 8.2. The proof of Theorem 8.1 is now complete. We now combine this theorem with Manners' inverse theorem to obtain Theorem 8.3 (Transferred inverse theorem). Let 0 < δ < 1/2, and let ν : for some constant C 0 that is sufficiently large depending on k. Let f : [N ] → C be a ν-bounded function such that Proof. As in the proof of Theorem 6.2, we pick a prime N with 10N ≤ N ≤ 20N and extend f by zero to Z/N Z; we also extend ν by 1 to Z/N Z, and observe that ν − 1 U 2k (Z/N Z) δ C 0 . To apply Theorem 8.1, we will need an inverse theorem that has polynomial correlation bounds. This is not directly provided by Theorem 6.2; however, such an inverse theorem does appear in the work of Manners [32]. Indeed, we see from [32,Lemmas 5.4.1,5.5.1] (applying [32,Lemma 5.5.1] inductively, as in [32, p. 102]), that if f : Z/N Z → C is 1-bounded with f U k (Z/N Z) ≥ δ, then there exists a 1-bounded function ψ : Z/N Z → C with the polynomial correlation bound such that ψ is of the form with T exp(exp(δ −O(1) )), the α i complex numbers with |α i | ≤ 1, and for each i, G i /Γ i is a filtered nilmanifold of degree k − 1, dimension O(δ −O(1) ), and complexity at most exp exp(O(1/δ O(1) )), F i : G i /Γ i → C is a 1-bounded Lipschitz function F : G/Γ → C of Lipschitz constant at most exp exp(O(1/δ O(1) )), and g i : Z → G i is a polynomial map with g i Γ periodic with period N . Let us call the collection of all such ψ (with appropriate choices of implied constants) F δ ; note that this collection is invariant under translation and complex conjugation. We may now apply Theorem 8.1 to the ν-bounded function f in the hypotheses of this theorem, and conclude that there exist ψ 1 , . . . , Applying the pigeonhole principle, and taking the tensor product of various nilsequences, we conclude a correlation

Completing the proof of the main theorem
Now we can show how the bound (2.11) in Theorem 2.6 follows from the bound (2.13) given by Theorem 2.7. This will complete the proof of Theorem 2.6 and hence that of Theorem 1.4. We begin with an application of the "W -trick". Let W := P (log ε N ), where ε > 0 is a small constant depending on k to be chosen later; we may assume that N is sufficiently large depending on ε. Observe that the set {n ∈ [N ] : (n, W ) = 1} contains the entire support of Λ Siegel , as well as the support of Λ except for O(log O(1) N ) numbers which give a negligible contribution to the U k [N ] norm. Thus it will suffice to show the doubly logarithmic decay bound (log log N ) −c .
By Corollary 4.2, this will follow once we show that Fix b. Now we use a quantitative variant of the well known fact (see [18]) that φ(W ) W Λ − 1 can be bounded by a pseudorandom weight, but now observing that we can attain logarithmic accuracy in the pseudorandomness bound.
is Cν-bounded for some C = O(1) depending only on k and some ν : Proof. By the triangle inequality (1.2), it suffices to establish this claim for φ(W ) W Λ(W · +b) and φ(W ) W Λ Siegel (W · +b) separately. In the latter case, we see from Definition 2.1 that and the claim in this case follows from Corollary 5.3. Now we turn to φ(W ) W Λ(W · +b). Here we can basically follow the analysis of Goldston-Yıldırım correlation estimates from [20,Appendix D], though with a slightly more careful accounting in order to obtain suitable estimates. We choose a smooth function χ : R → R ≥0 supported on [−2, 2] that equals 1/2 on [−1, 1] with 2 1 χ (x) 2 dx = 1. We set R := N γ for some sufficiently small constant 0 < γ < 1/2 depending only on k (and independent of ε). Following [20,Appendix D], we introduce the truncated divisor sum From [20,Lemma D.2] and the choice of χ, the sieve factor c χ,2 = ∞ 0 |χ (x)| 2 dx associated to this divisor sum via [20, Definition D.1] is simply (8.12) c χ,2 = 1.
We then set Let Λ be the restriction of Λ to those primes greater than R 2 . It is not difficult to see that the error φ(W ) W Λ(W · +b) − φ(W ) W Λ (W · +b) (supported on primes up to R 2 , as well as powers of primes, and bounded in size by . By the definition of ν in (8.13) and the fact that χ(0) = 1/2, we easily verify the pointwise bound for all n. It will thus suffice to show the logarithmic decay bound Expanding out the left-hand side, it suffices to show that for all subsets S of {0, 1} 2k , where Ω ⊂ R 2k+1 is the convex body Suppose that we directly apply the estimate 12 in [20, Theorem D.3], using (8.12) to eliminate the role of the sieve factors. Then we can express the left-hand side of (8.14) as where β p are the usual local factors X is the quantity X := p∈P p −1/2 and P is the set of primes p which are "exceptional" in the sense that at least two of the affine forms for ω ∈ {0, 1} 2k are linearly dependent modulo p.
Since W = P (log ε N ), one has β p = ( p p−1 ) #S for p < log ε N , while from the inclusionexclusion calculation used in the proof of Proposition 5.2 one has β p = 1 + O(1/p 2 ) for p ≥ log ε N . Thus Since vol(Ω) (N/W ) 2k+1 , the main term in (8.15) is acceptable. If it were not for the e O(X) term, the error term in (8.15) would similarly be acceptable; unfortunately, as defined in [20,Appendix D], the exceptional primes consist precisely of all the primes p up to log ε X, and this would ostensibly lead to an unacceptably large error term in (8.15). But, an inspection of the proof of [20,Proposition D.4] reveals that the e O(X) loss arises from three sources. One is from the crude bound  , and the error term in (8.15) is now also acceptable, giving the claim.
Proof of Theorem 2.6 for Λ. Combining Proposition 8.4 with (the contrapositive of) Theorem 8.3, we see that it suffices to show (for a sufficiently small constant c 1 > 0) that one has the pseudopolynomial bound whenever G/Γ is a (filtered) nilmanifold G/Γ of degree k−1, dimension at most (log log N ) c 1 and complexity at most exp(log c 1 N ), F : G/Γ → C is a 1-bounded Lipschitz function of Lipschitz constant at most exp(log c 1 N ), and g : Z → G is a polynomial map. Using [ for some c > 0 independent of ε, and the claim (8.21) then follows for ε small enough. This (finally!) completes the proof of Theorem 2.6, and hence that of Theorem 1.4.
We can now quickly deduce Corollary 1.5 from our main theorem.
Proof of Corollary 1.5. Let w = (log log N ) 1/2 . By Theorem 2.5, we have log −c N. for any function f and any ξ ∈ R, we deduce from (8.22) that Let w = log ε N where ε is as in Subsection 8.2. Also let W = p≤w p. Then by Corollary 4.2 and (8.11) we have Now the claim follows by combining this with (8.23), (8.24) and applying the triangle inequality for Gowers norms.

Quantitative linear equations in primes result
In this section we sketch the derivation of Theorem 1.6 from Theorem 1.4. The arguments follow those in [20] extremely closely, and we will assume familiarity with those arguments in this section.
In [20, §4], the qualitative version of Theorem 1.6 was derived from [20, Theorem 4.5] using some elementary linear algebra and convex geometry. The same arguments, replacing all qualitative decay terms with doubly logarithmic ones instead, show that Theorem 1.6 will follow if one shows the following. Next, we apply the W -trick arguments in [20, §5], setting w equal 13 to (log log N ) η for a sufficiently small η > 0 depending on s, d, t rather than the more conservative choice of log log log N . These arguments then reduce matters to showing where W := P (w) and Λ is the restriction of Λ to the primes. From Corollary 1.5, we have the doubly logarithmic bound s,η (log log N ) −cη for some c > 0 depending only on s (and assuming as we may that N is sufficiently large depending on s, d, t, η). On the other hand, a routine modification of Proposition 8.4 (see also [20,Proposition 6.4]) reveals that for any D, the function 1 + Λ b 1 ,W + · · · + Λ bt,W on the interval [N 3  N ) −c ) type terms, noting that all the functions denoted κ in that appendix can be taken to be polynomial in nature; we leave the details to the interested reader.
Remark 9.3. It seems likely that one can improve Theorem 1.6 further, by allowing the parameter L to be as large as (log log N ) c with uniform control on error terms; one may even be able to handle significantly larger values of the linear coefficientsψ i than this by incorporating the various methods used in this paper. We will not pursue such refinements here, however.

Arithmetic progressions with shifted prime difference
In this section we prove Theorem 1.8. 13 Note that for this choice of w, the prime number theorem in arithmetic progressions of modulus W = P (w) has an effective error term with good decay, as we can use the effective lower bounds on L(1, χ) in this case rather than Siegel's theorem. It should however be possible to work with larger choices of w by incorporating the contribution of a Q-Siegel zero, as is done elsewhere in this paper.
Proof of Theorem 1. 8. In what follows, let Λ stand for the von Mangoldt function restricted to the primes. Let A ⊂ [N ] be any set with |A| ≥ δN and δ = (log log log log N ) −c for small enough c > 0 depending on k. Let w = (log log N ) 1/2 , and let W = p≤w p. By the pigeonhole principle, we can pick 1 ≤ b ≤ W such that A := {n : W n + b ∈ A} has size ≥ δN/W . Then the count of k-term arithmetic progressions in A with shifted prime difference is Note that we have the trivial bound n≤N |Λ(n) − Λ (n)| N 1/2 log N . Using this and our quantitative Gowers uniformity result in the form of Corollary 1.5, we have for some c > 0 depending on k. Therefore, by applying the generalized von Neumann theorem for pseudorandomly majorized functions [20, Theorem 7.1] (with similar remarks on quantitative error terms as in the proof of Theorem 1.6), we see that T is equal to We have c(k, δ) exp(− exp(δ −C )) for some C ≥ 1 (depending on k) by Gowers's bound N k (ρ) exp(exp(ρ −C )), proved in [12]. Now, if c is chosen small enough in the definition of δ, we have c(k, δ) (log log N ) −o (1) , which proves the statement of the theorem for k ≥ 4. For k = 4, the same argument works, except that we now use the bound N 4 (ρ) exp(ρ −C ) from [23] to get c(4, δ) exp(−Cδ −C ), which enables taking δ = (log log log N ) −c for some c > 0. Finally, for k = 3, using the very recent bound [29] N 3 (ρ) exp((log(1/ρ)) C ) we have c(3, δ) exp(−C(log(1/δ)) C ), which enables taking δ = exp(−(log log log N ) c ) for some c > 0.

A Quantitative Leibman theory with explicit dimension dependence
In this appendix we refine the equidistribution theory on nilmanifolds from [22], tracking more carefully the dependence on dimension m (but allowing all constants to depend on the degree d, which in our context will equal to k − 1). The key point is that all bounds will be at most double exponential in this dimension parameter, basically because the arguments rely on applying the Cauchy-Schwarz inequality (or variants such as the van der Corput inequality) a number of times that is polynomial in the dimension. (Many of the estimates here require only single exponential dependence on m at worst, but the induction on dimension we use only closes if we allow double exponential dependence.) In order to improve this double exponential dependence it would seem necessary to adopt a different approach to equidistribution that is not as reliant on so many applications of the Cauchy-Schwarz inequality. We freely use the notation from [22], and let m be a dimensional parameter. To conveniently track bounds that depend in double-exponential fashion on the dimension we adopt the following notation. For any 0 < δ < 1/2 let poly m (δ) to be any quantity lower . Let m ≥ m * ≥ 0 be integers, 0 < δ < 1/2, N ≥ 1. Let G/Γ be a filtered nilmanifold of degree d, nonlinearity dimension m * (defined in [22,Section 7]), and complexity at most 1/δ. Let g : Z → G be a polynomial sequence.
If (g(n)Γ) n∈ [N ] is not δ-equidistributed then there exists a horizontal character η with 0 < |η| ≤ δ − exp((m+m * ) C d ) such that where C d is a sufficiently large constant depending only on d.
We now prove this theorem. We assume inductively that the claim has already been established for smaller values of d, or for the same value of d and smaller values of m * . Henceforth we refine the poly m notation by permitting the implied constants to depend on the constant C d−1 , but not on C d .
By Repeating the reductions after [22, (7.2)] we may assume that g(0) = id G and |ψ(g(1))| ≤ 1, where ψ : G → R m is the Mal'cev coordinate map. Continuing the argument down to [22, (7.8)] we conclude that |E n∈[N ] F h (g h (n)Γ )| poly m (δ) with F h , g h , Γ defined as in [22]. One can rather tediously verify that all the estimates in [22, Appendix A] can be refined by replacing all estimates of the form X m Q Om(1) Y with X poly m (Q)Y . As a consequence we can refine [22,Lemma 7.4] (by exact repetition of the proof) to Lemma A.4 (Rationality bounds for the relative square). There is a poly m (1/δ)-rational Mal'cev basis X for G /Γ adapted to the filtration (G ) • with the property that ψ X (x, x ) is a polynomial of degree O(1) with rational coefficients of height poly m (1/δ) in the coordinates ψ(x), ψ(x ). With respect to the metric d X we have F h Lip poly m (1/δ) uniformly in h.
Continuing the arguments down to [22,Lemma 7.5], one can find horizontal characters η 1 : G → R/Z , η 2 : G 2 → R/Z with η 2 annihilating [G, G 2 ] and |η 1 |, |η 2 | poly m (1/δ) such that the character η : G → R/Z defined by η(g , g) := η 1 (g) + η 2 (g g −1 ) is such that η • g h C ∞ ([N ]) poly m (1/δ) for poly m (δ)N values of h ∈ [N ]. Continuing the argument down to [22, (7.16)], and using the induction hypothesis for Theorem A.3 (with d replaced by d − 1, and m, m * replaced by quantities not exceeding 2m), we can find 1 ≤ q poly m (1/δ) such that , a subgroup G ⊂ G which is M -rational with respect to X , and a decomposition g = εg γ with ε, g , γ : Z → G polynomials such that for any 2 < q < ∞, and the claim now follows from the circle method and Hölder's inequality. Finally, for (iii), we see from Proposition 1.2 and (1.2) that we may assume that w grows sufficiently slowly in N , and then the bounds in (iii) follow easily from the main theorems in [20] as well as Corollary 4.2, after inserting the resolution of the inverse conjecture for the Gowers norms (first proven in [25]) and the strong orthogonality of the Möbius function to nilsequences (first proven in [21]).
Remark B.1. An alternate approach to (1.4) proceeds by comparing Λ(n) = − d|n µ(d) log d first with a truncated divisor sum Λ (n) := − d|n:d≤N c 1 µ(d) log d for some small absolute constant c 1 > 0, and establishing the strongly logarithmic estimate from the circle method (here we can use a Plancherel bound analogous to (B.1) that loses a factor of log N , thus avoiding the need to invoke the restriction theory from [16]), and the logarithmic estimate log −c N from sieve theory with (say) w = log 1/100 N , and then applying the triangle inequality (1.2); we leave the details to the interested reader. In this paper we found the Cramér models Λ Cramér,w to be slightly more convenient technically to work with than the truncated divisor sum model Λ , and therefore made no further use of Λ here.