Minimax lower bounds for function estimation on graphs

We study minimax lower bounds for function estimation problems on large graph when the target function is smoothly varying over the graph. We derive minimax rates in the context of regression and classification problems on graphs that satisfy an asymptotic shape assumption and with a smoothness condition on the target function, both formulated in terms of the graph Laplacian.

1. Introduction.In recent years there has been substantial interest in high-dimensional estimation and prediction problems on large graphs.These can in many cases be seen as high-dimensional or nonparametric regression or classification problems in which the goal is to learn a "smooth" function on a given graph.Various methods have been proposed to deal with such problems, motivated by a variety of applications.Any sensible method employs some form of regularisation that takes the geometry of the graph into account.Examples of methods that have been considered include penalised least squares regression using a Laplacianbased penalty (e.g.Ando and Zhang (2007); Belkin et al. (2004); Kolaczyk (2009); Smola and Kondor (2003); Zhu and Hastie (2005)), penalisation using the total variation norm (e.g.Sadhanala et al. (2016)) and Bayesian regularisation (e.g.Hartog and van Zanten (2016), Bertozzi et al. (2017), Kirichenko and van Zanten (2017)).
There exist only a few papers that study theoretical aspects of the performance of nonparametric estimation procedures on graphs.Early references are Belkin et al. (2004), in which a theoretical analysis of a Tikhonov regularisation method is conducted in terms of algorithmic stability and Johnson and Zhang (2007) and Ando and Zhang (2007), who consider subsampling schemes for estimating a function on a graph.More recently, convergence rates have been obtained by Sadhanala et al. (2016) in the context of regression on a regular grid using total variation penalties and by Kirichenko and van Zanten (2017) for nonparametric Bayes procedures for regression and classification on more general graphs.The paper Sadhanala et al. (2016) also establishes minimax lower bounds for regression problems on grids.
In this paper we derive new minimax results for regression and binary classification on graphs, exhibiting the best possible rates that can be attained uniformly over certain classes of "smooth" functions on graphs.We consider simple undirected graphs that satisfy an assumption on their "asymptotic geometry", formulated in terms of the graph Laplacian.This assumption, which is recalled in the next section, was introduced in Kirichenko and van Zanten (2017).The geometry assumption attaches a parameter r to a graph which essentially describes how the eigenvalues of its Laplacian behave.It was illustrated in Kirichenko and van Zanten (2017) that it is satisfied for many graphs of interest.Theoretically it can be shown to hold for instance for regular grids and tori of arbitrary dimensions.Moreover, for a given graph it can be verified empirically whether the assumption is reasonable and what the corresponding graph parameter r is.In Kirichenko and van Zanten (2017) this was done both for simulated "small world graphs" and for real protein-protein interaction graphs.The geometry parameter r appears in the minimax lower bounds we derive.The other key ingredient is the regularity β of the function that is being estimated, defined in a suitable manner.We introduce a Sobolevtype smoothness condition on the target function using the graph Laplacian again to quantify smoothness.The geometry of the underlying graph and the smoothness of the target function together determine the minimax rate through the geometry parameter r and the smoothness parameter β.We have chosen our setup and normalisations in such a way that the optimal rates over balls of smooth functions that we obtain are of the usual form n −β/(r+2β) .This shows that the geometry parameter r can be interpreted as some kind of "dimension" of the graph.In the regular grid case it is indeed precisely the dimension of the grid, as shown in Kirichenko and van Zanten (2017).However, the result holds for much more general graphs as well.In particular, the geometry parameter r does not need to be an integer.
For the sake of completeness we give two-sided results, that is, we also exhibit estimators that achieve the lower bounds, showing that the bounds are tight.However, these estimators are non-adaptive, in the sense that they depend on the smoothness parameter β, which will typically not be accessible in realistic settings.More interestingly perhaps, the lower bounds match the upper bounds we obtained in Kirichenko and van Zanten (2017).This shows that the nonparametric Bayes procedures we proposed in the latter paper are smoothness-adaptive and rate-optimal.We note however that the procedure exhibited in Kirichenko and van Zanten (2017) that is adaptive on the whole range of regularities β > 0 is only rate-optimal up to a logarithmic factor.It might be of interest to study the possibility of procedures that achieve exactly the correct rate.
In the next section we introduce the general framework.Specifically, we define the geometry condition on the graph and the smoothness condition on the target function.In Section 3 we describe the regression and classification problems on a graph and present our results on the minimax rates for those problems.The mathematical proofs are given in Section 4.

2.
Setting.Let G be a connected simple undirected graph with vertices labelled {1, . . ., n}.Let A be its adjacency matrix, i. e.A ij is 1 or 0 according to whether or not there is an edge between vertices i and j.Let D be the diagonal matrix with element D ii equal to the degree of vertex i.Let L = D − A be the Laplacian of the graph.
A function f on the (vertices of the) graph is simply a function f : {1, . . ., n} → R. We measure distances and norms of functions using the norm • n defined by f 2 n = n −1 n i=1 f 2 (i).The corresponding inner product of two functions f and g is denoted by The Laplacian is nonnegative definite (Cvetković et al. (2010)).Hence we can order the Laplacian eigenvalues by magnitude and denote them by (The smallest one always equals 0 and since the graph is connected, the second one is positive, see for instance Cvetković et al. (2010)).We fix a corresponding sequence of eigenfunctions ψ i , orthonormal with respect to the inner product •, • n .We derive our results under an asymptotic geometry assumption on the graph, first introduced in Kirichenko and van Zanten (2017), formulated in terms of the Laplacian eigenvalues.
Assumption.We say that the geometry assumption is satisfied with parameter r ≥ 1 if there exist i 0 ∈ N, κ ∈ (0, 1] and C 1 , C 2 > 0 such that for all n large enough, , for all i ∈ {i 0 , . . ., κn}. Very roughly speaking the condition means that asymptotically, or from "far away", the graph looks like an r-dimensional grid with n vertices.From Kirichenko and van Zanten (2017) we know the assumption is satisfied for d-dimensional grids with r equal to the dimension d, hence our results on the minimax rates include the usual statements for regression and classification with regular, fixed design.We stress however that the constant r does not need to be a natural number.For given graphs the parameter r can be calculated numerically.For example, in Kirichenko and van Zanten (2017) a Watts-Strogatz "small world" graph is considered which satisfies the condition with r equal to 1.4.Observe that we do not assume the existence of a "limiting manifold" for the graph as n → ∞.See Kirichenko and van Zanten (2017) for more discussion of the geometry assumption and more examples.
We describe the smoothness of the function of interest by assuming it belongs to a Sobolev-type ball of the form (2.1) This should be viewed as the natural discrete graph version of the usual notion of a Sobolev ball of functions on [0, 1] r .(This is most easily seen in the case of the path graph and r = 1, as illustrated in Example 3.1 of Kirichenko and van Zanten (2017).)The particular normalisation, which depends on the geometry parameter r, ensures non-trivial asymptotics.Again, we stress that we do not assume that the functions on the graph are discretised versions of certain continuous objects on a "limiting manifold".
3. Main results.Now that we have introduced ways to quantify the graph geometry and the regularity of the target function, we can formulate our minimax results for regression and classification.In both cases, G = G n will be a connected simple undirected graph with vertices 1, . . ., n, satisfying the geometry assumption for r ≥ 1.The target function will be a regression function that is observed with additive Gaussian noise in the regression case and a binary regression function in the classification case.
In the regression case we assume that we have observations Y = (Y 1 , . . ., Y n ) at the vertices of the graph satisfying where the ξ i are independent standard Gaussians, σ > 0 and f : {1, . . ., n} → R is the unknown function of interest.We denote the corresponding distribution of Y by P f and the associated expectation by E f .Our main result in this setting is the following.1 Theorem 3.1 (Regression).Suppose that the graph satisfies the geometry assumption for r ≥ 1.Then for all β, Q > 0 where the infimum is taken over all estimators f = f (Y 1 , . . ., Y n ).
The theorem shows that the minimax rate for the regression problem on the graph is equal to n −β/(2β+r) .We obtain the upper bound on the rate by constructing a projection estimator f for which The proof shows that this estimator depends on the regularity level of the target function and therefore is not adaptive.An adaptive (Bayesian) procedure is exhibited in Kirichenko and van Zanten (2017).
In the binary classification case we assume that the data Y 1 , . . ., Y n are independent {0, 1}-valued variables, observed at the vertices of the graph.In this case the goal is to estimate the binary regression function ρ, or "soft label function" on the graph, defined by The function ρ, of course, determines the distribution of the data, which we therefore denote by P ρ .Again, the associated expectations are denoted by E ρ .
Technically the classification case is slightly more demanding.Different from the regression case we also have to impose conditions on the Laplacian eigenfunctions ψ j in this case.Moreover, we impose the regularity condition not directly on the binary regression function ρ, but on a suitably transformed version of it, so that it maps into R instead of (0, 1).Concretely, we fix a differentiable link function Ψ : R → (0, 1) such that Ψ ′ /(Ψ(1 − Ψ)) is uniformly bounded, and Ψ ′ > 0 everywhere.Note that for instance the sigmoid, or logistic link Ψ(f ) = 1/(1 + exp(−f )) satisfies this condition.Under these conditions the inverse Ψ −1 : (0, 1) → R is well defined.In the classification setting the regularity condition will be formulated in terms of Ψ −1 (ρ).
Recall from Section 2 that ψ j is a sequence of eigenfunctions (or vectors) of the graph Laplacian L that is orthonormal with respect to the norm • n .
In particular, the eigenfunctions ψ j are normalised such that for every j.In the following theorem we assume in addition that they are uniformly bounded by a common constant C > 0, which is independent of n, i.e. that |ψ j (i)| ≤ C for every i and j.We need this technical assumption in the proof of the lower bound.
Theorem 3.2 (Classification).Suppose that the graph satisfies the geometry assumption for r ≥ 1 and that the Laplacian eigenfunctions are uniformly bounded by a common constant C > 0, independent of n.Let Ψ : R → (0, 1) be a differentiable link function as above with inverse where the infimum is taken over all estimators ρ = ρ(Y 1 , . . ., Y n ).
Also in this case the proof of the theorem provides an estimator that achieves the lower bound, but that estimator depends on β and hence is nonadaptive.Adaptive, rate-optimal (Bayes) procedures have been exhibited for this classification setting as well in Kirichenko and van Zanten (2017).Note that compared to the regression case, there is an extra technical requirement β ≥ r/2.Additionally, we assume boundedness of the Laplacian eigenfunctions.In principle, for a specific case this can be verified numerically.For regular grids of arbitrary dimensions it is straightforward to see that this condition is fulfilled.
For the • n -norm of the jth eigenvector we then have ψj By well known trigonometric identities we have, for any x ∈ R, It follow that for any j = 1, . . ., n − 1, ψj Notice that ψ0 2 n = 1.So we see that indeed, the normalised eigenvectors ψ j of the Laplacian of the path graph are uniformly bounded by a common constant.The eigenvectors of a Cartesian product of two graphs are equal to the Kronecker products of pairs of eigenvectors associated with the Laplacians of those graphs.Since the grids of higher dimensions are products of path graphs, they satisfy the condition as well.

Proofs.
4.1.Proof of Theorem 3.1.In the regression case we first expand the observations in the eigenbasis of the graph Laplacian, which brings the problem into the setting of the sequence formulation of the white noise model.Then we adapt techniques from the proof of Pinsker's theorem as can it is given, for instance, in Tsybakov (2009).4.1.1.Preliminaries.Let Y = (Y 1 , . . ., Y n ), ξ = (ξ 1 , . . ., ξ n ), and let ψ i be the orthonormal eigenfunctions of the graph Laplacian.Denote ξi = ξ, ψ i n and observe that the ξi are centred Gaussian with The inner products Z i = Y, ψ i n satisfy the following relation for i = 0, . . ., n − 1 where f i are coefficients in the decomposition of the target function Hence, the minimax rates for the original problem are of the same order as the minimax rates for the problem of recovering f = (f 0 , . . ., f n−1 ), given the observations (4.1) where ζ i are independent standard Gaussian and ε = σ √ n .To avoid confusion we define general ellipsoids on the space of coefficients for an arbitrary sequence a j > 0 and some finite constant For a function f in the Sobolev-type ball H β (Q) its vector of coefficients belongs to B n (Q) with In order to prove the theorem it is sufficient to show that 2β+r) .
We are going to follow the proof of Pinsker's theorem (see for example Tsybakov (2009)) which studies a similar case in the setting of the Gaussian white noise model on the interval [0, 1].The proof requires some modifications arising from the nature of our problem.The main differences with the usual lower bound result over Sobolev balls in the infinite sequence model are that we only have n observations and that our ellipsoids B n (Q) have a special form.
In order to proceed we first consider the problem of obtaining minimax rates in the class of linear estimators.We introduce Pinsker's estimator and recall the linear minimax lemma showing that Pinsker's estimator is optimal in the class of linear estimators.The risk of a linear estimator f (l) = (l 0 Z 0 , . . ., l n−1 Z n−1 ) with l = (l 0 , . . ., l n−1 ) ∈ R n is given by For large n we introduce the following equation with respect to the variable x Suppose, there exists a unique solution x of (4.4).For such a solution, define a vector of coefficients l ′ consisting of entries (4.5) The linear estimator f = f (l ′ ) is called the Pinsker estimator for the general ellipsoid B n (Q).The following lemma, which appears as Lemma 3.2 in Tsybakov (2009), shows that the Pinsker estimator is a linear minimax estimator.
Lemma 4.1 (Linear minimax lemma).Suppose that B n (Q) is a general ellipsoid defined by (4.2) with Q > 0 and a positive set of coefficients {a j } n−1 j=0 .Suppose there exists a unique solution x of (4.4) and suppose that the associated coefficients l ′ j defined by (4.5) satisfy Then the linear minimax risk satisfies In order to be able to apply Lemma 4.1 in our graph setting we need the following technical lemma.
Lemma 4.2.Consider the ellipsoid B n (Q) defined by (4.2) with Q > 0 and Then, as n → ∞, we have the following (i) There exists a solution x of (4.4) which is unique and satisfies 2β+r) .
Proof.(i) According to Lemma 3.1 from Tsybakov (2009) for large enough n and for an increasing positive sequence a j with a n → +∞, as n → ∞, there exists a unique solution of (4.4) given by where Consider N defined above.Denote Since the geometry condition on the graph is satisfied, for j = i 0 , . . ., κn the eigenvalues of the graph Laplacian can be bounded in a following way If m < κn, then for some K 1 > 0 Since A m is an increasing function of m, there exists K 2 > 0 such that for all m > K 2 n r/(2β+r) it holds that A m > Q.In a similar manner we can show that there exists K 3 > 0 such that for all 2β+r)   it holds that A m < Q.This leads us to the conclusion that N ≍ n r/(2β+r) .Then equation (4.4) has a unique solution that satisfies 2β+r) .
(ii) Since graph G satisfies the geometry assumption, we deduce from (i) that for some
On the other hand, 2β+r) .
(iii) Note that for j > N we have v 2 j = 0. We also know that a (2β+r) .
This finishes the proof of the lemma.
4.1.2.Proof of the upper bound on the risk.Recall that we only need to provide the upper bound in (4.3).Consider the Pinsker estimator f = (l ′ 0 Z 0 , . . ., l ′ n−1 Z n−1 ) with where a 2 j = 1 + λ 2β/r j n 2β/r and x is a unique solution of (4.4).Using Lemma 4.1 and Lemma 4.2 we conclude that the Pinsker's estimator satisfies sup 2β+r) .
4.1.3.Proof of the lower bound on the risk.We follow the usual steps of the general reduction scheme for obtaining minimax rates (see e.g.Chapter 3 of Tsybakov (2009) for details).First, we note that it is sufficient to only take into account the first N coefficients in the decomposition of the target function, where . Indeed, if we denote the minimax risk by R n , i.e.
and define Next we follow the usual step of bounding the minimax risk by a Bayes risk.Consider the density µ(f (N ) ) = N −1 j=0 µ s j (f j ) with respect to the Lebesgue measure on R N .Here s j = (1 − δ)v 2 j for some δ ∈ (0, 1) and µ σ denotes the density of the Gaussian distribution with mean 0 and variance σ 2 .By (4.8) we can bound the minimax risk from below by the Bayes risk (4.9) R n ≥ inf where From the proof of Pinsker's theorem, see Tsybakov (2009), we get the following bounds for some K > 0. Using the results of Lemma 4.2 we conclude that R n n −2β/(2β+r) .
4.2.Proof of Theorem 3.2.In order to prove the result in the classification case we use Fano's lemma and the usual general scheme for reducing a minimax estimation problem to a minimax testing problem.(see for instance Tsybakov (2009)  By the reasoning given in the aforementioned subsection and using the properties of the link function Ψ, we can see that 4.2.2.Proof of the lower bound on the risk.The proof of the lower bound on the risk is based on a corollary of Fano's lemma, see, for instance, Corollary 2.6 in Tsybakov (2009).Observe that by Markov's inequality for any soft label functions ρ 0 , . . ., ρ M with M ∈ N there exists C > 0 such that inf 2β+r) .
We will select M vectors of coefficients θ (j) such that the probability measures corresponding to ρ j = Ψ(f θ (j) ) will satisfy (4.11),where Ψ is the link function.
In order to show that the P j satisfy (4.11) we present a technical lemma.In the classification setting the Kullback-Leibler divergence K(•, •) satisfies Lemma 4.3.If Ψ ′ Ψ(1−Ψ) is bounded, then there exists c > 0 such that for any v 1 , v 2 ∈ R n we have Proof.For every x ∈ R consider the function g x : R → R defined as Ψ ′ (v) (x − y) 2 .
The statement of the lemma follows.
Observe that this bound does not depend on j.Hence, We can choose δ > 0 to be small enough such that the condition (4.11) is satisfied with some 0 < α < 1.
This completes the proof of the theorem.
again).4.2.1.Proof of the upper bound on the risk.We define the estimator that gives us an upper bound on the minimax risk based on the estimator f , which has been introduced in subsection 4.1.2 of the proof of Theorem 3 Ψ(y)) (Ψ(y) − Ψ(x)).Then by Taylor's theorem we can see that|g x (y)| ≤ sup v∈[x,y]∪[y,x] Ψ ′ (v) Ψ(v)(1 − Ψ(v)) sup v∈[x,y]∪[y,x]