Estimating a smooth function on a large graph by Bayesian Laplacian regularisation

We study a Bayesian approach to estimating a smooth function in the context of regression or classification problems on large graphs. We derive theoretical results that show how asymptotically optimal Bayesian regularization can be achieved under an asymptotic shape assumption on the underlying graph and a smoothness condition on the target function, both formulated in terms of the graph Laplacian. The priors we study are randomly scaled Gaussians with precision operators involving the Laplacian of the graph.


Introduction.
1.1.Learning a smooth function on a large graph.There are various problems arising in modern statistics that involve making inference about a "smooth" function on a large graph.The underlying graph structure in such problems can have different origins.Sometimes it is given by the context of the problem.This is typically the case, for instance, in the problem of making inference on protein interaction networks (e.g.Sharan et al. (2007)) or in image interpolation problems (Liu et al. (2014)).In other cases the graph is deduced from the data in a preliminary step, as is the case with similarity graphs in label propagation methods (e.g.Zhu and Ghahramani (2002)).Moreover, the different problems that arise in applications can have all kinds of different particular features.For example, the available data can be indexed by the vertices or by the edges of the graph, or both.Also, in some applications only partial data are available, for instance only part of the vertices are labeled (semi-supervised problems).Moreover, both regression problems and classification problems arise naturally in different applications.
Despite all these different aspects, many of these problems and the methods that have been developed to deal with them have a number of important features in common.In many cases the graph is relatively "large" and the function of interest can be viewed as "smoothly varying" over the graph.Consequently, most of the proposed methods view the problem as a high-dimensional or nonparametric estimation problem and employ some regularisation or penalization technique that takes the geometry of the graph into account and that is thought to produce an appropriate bias-variance trade-off.
In this paper we set up the mathematical framework that allows us to study the performance of nonparametric function estimation methods on large graphs.We do not treat all the variants exhaustively, instead we consider two prototypical problems: regression, where the function of interest f is a function on the vertices of the graph that is observed with additive noise, and binary classification, where a label 0 or 1 is observed at each vertex and the object of interest is the "soft label" function f whose value at a vertex v is the probability of seeing a 1 at v. We assume the underlying graph is "large", in the sense that it has n vertices for some "large" n.Our theoretical results deal with the situation that this number n tends to infinity.Although for finite n the graph has a fixed size and we essentially just have to estimate a Euclidean vector in R n , it is useful to view the problem as high-dimensional or even nonparametric.
Despite the finite structure, it is intuitively clear that the "smoothness" of f , defined in a suitable manner, will have an impact on the difficulty of the problem and on the results that can be attained.Indeed, consider the extreme case of f being a constant function.Then estimating f reduces to estimating a single real number.In the regression setting, for instance, this means that under mild conditions the sample mean gives a √ n-consistent estimator.In the other extreme case of a completely unrestricted function there is no way of making any useful inference.At best we can say that in view of the James-Stein effect we should employ some degree of shrinking or regularisation.However, if no further assumptions are made, nothing can be said about consistency or rates.We are interested in the question what we should do in the intermediate situation that f has some "smoothness" between these two extremes.
Another aspect that will have a crucial impact on the problem, in addition to the regularity of f , is the geometry of the graph.Indeed, regular grids of different dimensions are special cases of the graphs we shall consider, and we know from existing theory that the best attainable rates for estimating a smooth function on a grid depends on the dimension of the grid.More generally, the geometry of the graph will influence the complexity of the spaces of "smooth" functions on the graph, and hence the performance of statistical or learning methods.
1.2.Laplacian regularisation.Several approaches to learning functions on graphs that have been explored in the literature involve regularisation using the Laplacian matrix associated with the graph (see, for example, Belkin et al. (2004), Smola and Kondor (2003), Hein (2006), Ando and Zhang (2007), Zhu et al. (2003), Huang et al. (2011)).The graph Laplacian is defined as L = D − A, where A is the adjacency matrix of the graph and D is the diagonal matrix with the degrees of the vertices on the diagonal.When viewed as a linear operator, the Laplacian acts on a function f on the graph as (1.1) where we write i ∼ j if vertices i and j are connected by an edge.Several related operators are routinely employed as well, for instance, the normalized Laplacian L = D −1/2 LD −1/2 .We will continue to work with L in this paper, but much of the story goes through if L is replaced by such a related operator, after minor adaptations.
For a function f on the graph the Laplacian norm is given by j∼i (f (i) − f (j)) 2 .Clearly, the Laplacian norm of f quantifies how much the function f varies when moving along the edges of the graph.Therefore, several papers have proposed regularisation or penalization using this norm, as well as generalizations involving powers of the Laplacian or other functions, for instance, exponential ones.See, for example, Belkin et al. (2004) or Smola and Kondor (2003) and the references therein.There exist only few papers that study theoretical aspects of the performance of such methods.We mention, for example, Belkin et al. (2004), in which a theoretical analysis of a Tikhonov regularisation method is conducted in terms of algorithmic stability.Johnson and Zhang (2007) consider sub-sampling schemes for estimating a function on a graph.
The existing papers have different viewpoints than ours and do not study how the performance depends on (the combination of) graph geometry and function regularity.Our aim is to develop a framework which makes such a theoretical study of Laplacian regularisation methods possible and to derive some first asymptotic results that exhibit methods that perform well from the point of view of convergence rates and adaptation to regularity.1.3.Bayesian regularisation.We investigate Bayesian regularisation approaches, where we consider two types of priors on functions on graphs.The first type performs regularisation using a power of the Laplacian.This can be seen as the graph analogue of Sobolev norm regularisation of functions on "ordinary" Euclidean spaces.The second type of priors uses an exponential function of the Laplacian.This can be viewed as the analogue of the popular squared exponential prior on functions on Euclidean space or its extension to manifolds, as studied by Castillo et al. (2014).In both cases we consider hierarchical priors with the aim of achieving automatic adaptation to the regularity of the function of interest.
To assess the performance of our Bayes procedures we take an asymptotic perspective.We let the number of vertices of the graph grow and ask at what rate the posterior distribution concentrates around the unknown function f that generates the data.We make two kinds of assumptions.Firstly, we assume that f has a cer-tain degree of regularity β, defined in suitable manner.The smoothness β is not assumed to be known though, we are aiming at deriving adaptive results.
Secondly, we make an assumption on the asymptotic shape of the graph.In recent years, various theories of graph limits have been developed.Most prominent is the concept of the graphon, e.g.Lovász and Szegedy (2006) or the book of Lovasz (2012).More recently this notion has been extended in various directions, see, for instance, Borgs et al. (2014) and Chung (2014).However, the existing approaches are not immediately suited in the situations we have in mind, which involve graphs that are sparse in nature and are "grid-like" in some sense.Therefore we take an alternative approach and describe the asymptotic shape of the graph through a condition on the asymptotic behaviour of the spectrum of the Laplacian.To be able to derive concrete results we essentially assume that the smallest eigenvalues λ n,i of L satisfy for some r ≥ 11 .Very roughly speaking, this means that asymptotically, or "from a distance", the graph looks like an r-dimensional grid with n vertices.As we shall see, the actual grids are special cases (see Example 2.1), hence our results include the usual statements for regression and classification on these classical design spaces.However, the setting is much more general, since it is really only the asymptotic shape that matters.For instance, a 2 by n/2 ladder graph asymptotically also looks like a path graph, and indeed we will see that it satisfies our assumption for r = 1 as well (Example 2.3).Moreover, the constant r in (1.2) does not need to be a natural number.We will see, for example, at least numerically, that there are graphs whose geometry is asymptotically like that of a grid of non-integer "dimension" r in the sense of condition (1.2).
We stress that we do not assume the existence of a "limiting manifold" for the graph as n → ∞.We formulate our conditions and results purely in terms of intrinsic properties of the graph, without first embedding it in an ambient space.In certain cases in which limiting manifolds do exist (e.g. the regular grid cases) our type of asymptotics can be seen as "infill asymptotics" (Cressie (1993)).For a simple illustration, see Example 3.1.However, in applied settings (see, for instance, Example 2.7) it is typically not clear what a suitable ambient manifold could be, which is why we choose to avoid this issue altogether.
In the recent paper Hartog and van Zanten (2016) the theoretical results we present in this paper are investigated numerically and serve as a guideline for the tuning of practical Bayesian regularisation methods.Several concrete examples are considered, both for simulated data and for real data problems.
1.4.Organisation.The rest of the paper is organized as follows.In the next section we present our geometry assumption and give examples of graphs that satisfy it, either theoretically or numerically.In Section 3 we introduce two families of priors on functions on graphs.We present theorems that quantify the amount of mass that the priors put on neighbourhoods of "smooth" functions and quantify the complexity of the priors in terms of metric entropy.Section 4 contains the proofs of these general results and in Section 5 they are used to derive theorems about posterior contraction in nonparametric regression and binary classification.We end with some concluding remarks in Section 6.
2. Asymptotic geometry assumption on graphs.In this section we formulate our geometry assumption on the underlying graph and give several examples.
2.1.Graphs, Laplacians and functions on graphs.Let G be a connected, simple (i.e.no loops, multiple edges or weights), undirected graph with n vertices labelled 1, . . ., n.Let A be its adjacency matrix, i. e.A ij is 1 or 0 according to whether or not there is an edge between vertices i and j.Let D be the diagonal matrix with element D ii equal to the degree of vertex i.Let L = D − A be the Laplacian of the graph.We note that strictly speaking, we will be considering sequences of graphs G n with Laplacians L n and we will let n tend to infinity.However, in order to avoid cluttered notation, we will omit the subscript n and just write G and L throughout.
A function f on the (vertices of the) graph is simply a function f : {1, . . ., n} → R. Slightly abusing notation we will write f both for the function and for the associated vector of function values (f (1), f (2), . . ., f (n)) in R n .We measure distances and norms of functions using the norm • n defined by f 2 n = n −1 n i=1 f 2 (i).The corresponding inner product of two functions f and g is denoted by Again, in our results n will be varying, so when we speak of a function f on the graph G we are, strictly speaking, considering a sequence of functions f n .Also, in this case the subscript n will usually be omitted.The Laplacian L is positive semi-definite and symmetric.It easily follows from the definition that its smallest eigenvalue is 0 (with eigenvector (1, . . ., 1)).The fact that G is connected implies that the second smallest eigenvalue, the so-called algebraic connectivity, is strictly positive (e.g.Cvetković et al. (2010)).We will denote the Laplacian eigenvalues, ordered my magnitude, by Again we will usually drop the first index n and just write λ i for λ n,i .We fix a corresponding sequence of eigenfunctions ψ i , orthonormal with respect to the inner product •, • n .
2.2.Asymptotic geometry assumption.As mentioned in the introduction, we will derive results under an asymptotic shape assumption on the graph, formulated in terms of the Laplacian eigenvalues.To motivate the definition we note that the ith eigenvalue of the Laplacian of an n-point grid of dimension d behaves like (i/n) 2/d (see Example 2.1 ahead).We will work with the following condition.
CONDITION.We say that the geometry condition is satisfied with parameter r ≥ 1 if there exist i 0 ∈ N, κ ∈ (0, 1] and C 1 , C 2 > 0 such that for all n large enough, , for all i ∈ {i 0 , . . ., κn}.
Note that this condition only restricts a positive fraction κ of the Laplacian eigenvalues, namely the κn smallest ones.Moreover, we don't need restrictions on the first finitely many eigenvalues.We remark that if the geometry condition is fulfilled, then by adapting the constant C 1 we can ensure that the lower bound holds, in fact, for all i ∈ {i 0 , . . ., n}.To see this, observe that for n large enough and κn < i ≤ n we have .
For the indices i < i 0 it is useful to note that we have a general lower bound on the first positive eigenvalue λ 1 , hence on λ 2 , . . ., λ i 0 as well.Indeed, by Theorem 4.2 of Mohar (1991a) we have Note that this bound also implies that our geometry assumption can not hold with a parameter r < 1, since that would lead to contradictory inequalities for λ i 0 .
We first confirm that the geometry condition is satisfied for grids and tori of different dimensions.EXAMPLE 2.1 (Grids).For d ∈ N, a regular d-dimensional grid with n vertices can be obtained by taking the Cartesian product of d path graphs with n 1/d vertices (provided, of course, that this number is an integer).Using the known expression for the Laplacian eigenvalues of the path graph and the fact that the eigenvalues of products of graphs are the sums of the original eigenvalues, see, for instance, Theorem 3.5 of Mohar (1991b), we get that the Laplacian eigenvalues of the ddimensional grid are given by 4 sin 2 πi 1 2n where i k = 0, 1, 2, . . ., n 1/d − 1 for every k = 1, . . ., d.By definition there are i + 1 eigenvalues less or equal than the ith smallest eigenvalue λ i .Hence, for a constant c > 0, we have: The sum on the right gives the number of lattice points in a sphere of radius For our purposes it suffices to use crude upper and lower bounds for this number.By considering, for instance, the smallest hypercube containing the sphere and the largest one inscribed in it, it is easily seen that the number of lattice points is bounded from above and below by a constant times R d .We conclude that for the d-dimensional grid we have λ i (i/n) 2/d for every i = 0, . . ., n − 1.In particular, the geometry condition is fulfilled with parameter r = d.EXAMPLE 2.2 (Discrete tori).For graph tori we can follow the same line of reasoning as for grids.A d-dimensional torus graph with n vertices can be obtained as a product of d ring graphs with n 1/d vertices.Using the known explicit expression of the Laplacian eigenvalues of the ring we find that the d-dimensional torus graph satisfies the geometry conditions with parameter r = d as well.
The following lemma lists a number of operations that can be carried out on a graph without loosing the geometry condition.
LEMMA 2.1.Suppose that G = G n satisfies the geometry assumption with parameter r.Then the following graphs satisfy the condition with parameter r as well: (i) The cartesian product of G with a connected simple graph H with a finite number of vertices (independent of n).(ii) The graph obtained by augmenting G with finitely many edges (independent of n), provided it is a simple graph.(iii) The graph obtained from G by deleting finitely many edges (independent of n), provided it is still connected.(iv) The graph obtained by augmenting G with finitely many vertices and edges (independent of n), provided it is a simple connected graph.

PROOF. (i)
. Say H has m vertices and let its Laplacian eigenvalues be denoted by 0 = µ 0 , . . ., µ m .Then the product graph has mn vertices and it has Laplacian eigenvalues λ i + µ j , i = 0, . . ., n − 1, j = 0, . . ., m − 1 (see Theorem 3.5 of Mohar (1991b)).In particular, the first n eigenvalues are the same as those of G. Hence, since G satisfies the geometry condition, so does the product of G and H.
(ii) and (iii).These statements follow from the interlacing formula that asserts that if G + e is the graph obtained by adding the edge e to G, then See, for example, Theorem 3.2 of Mohar (1991b) or Theorem 7.1.5 of Cvetković et al. (2010).
(iv).Let v and e be a vertex and an edge that we want to connect to G. Denote G v a disjoint union of G and v, and by G the graph obtained by connecting edge e to v and an existing vertex of G.By Theorem 3.1 from Mohar (1991b) we know that the eigenvalues of G v are 0, 0, λ 1 (G), λ 2 (G), . . ., λ n−1 (G).Using Theorem 3.2 of Mohar (1991b) The result follows from this observation.EXAMPLE 2.3 (Ladder graph).A ladder graph with n vertices is the product of a path graph with n/2 vertices and a path graph with 2 vertices.Hence, by part (i) of Lemma 2.1 and Example 2.1 it satisfies the geometry condition with parameter r = 1.EXAMPLE 2.4 (Lollipop graph).The so-called lollipop graph L m,n is obtained by attaching a path graph with n vertices with an additional edge to a complete graph with m vertices.If m is constant, i.e. independent of n, then according to parts (ii) and (iv) of the preceding lemma this graph satisfies the geometry condition with r = 1.
In the examples considered so far it is possible to verify theoretically that the geometry condition is fulfilled.In a concrete case in which the given graph is not of such a tractable type, numerical investigation of the Laplacian eigenvalues can give an indication as to whether or not the condition is reasonable and provide the appropriate value of the parameter r.A possible approach is to plot log λ i against log(i/n).If the geometry condition is satisfied with parameter r, the κ × 100% left mosts points in this plot should approximately lie on a straight line with slope 2/r, except possibly a few on the very left. -6 -5 Plot of log λi against log(i/n) for the 20 × 20 grid.Fitted line has slope 1.0, corresponding to r = 2.0 in the geometry assumption.
Our focus in this paper is not on numerics, but it is illustrative to consider a few numerical examples in order to get a better idea of the types of graphs that fit into our framework.EXAMPLE 2.5 (Two-dimensional grid, numerically).Figure 1 illustrates the suggested numerical approach for a two-dimensional, 20×20 grid.The dashed line in the left panel is fitted to the left-most 35% of the points in the plot, discarding the first three points on the left.In accordance with Example 2.1 this line has slope 1.0.EXAMPLE 2.6 (Watts-Strogatz 'small world' graph).In our second numerical example we consider a graph obtained as a realization from the well-known random graph model of Watts and Strogatz (1998).Specifically, we consider in the first step a ring graph with 200 vertices.In the next step every vertex is visited and the edges emanating from the vertex are rewired with probability p = 1/4, meaning that with probability 1/4 they are detached from the neighbour of the current vertex and attached to another vertex, chosen uniformly at random.In the right panel of Figure 2 a particular realization is shown.Here we have only kept the largest connected component, which has 175 vertices in this case.On the left we have exactly the same plot as described in the preceding example for the grid case.The plot indicates that it is not unreasonable to assume that the geometry condition holds.The value of the parameter r deduced from the slope of the line equals 1.4 for this graph.
EXAMPLE 2.7 (Protein interaction graph).In the final example we consider a graph obtained from the protein interaction graph of baker's yeast, as described in detail in Section 8.5 of Kolaczyk (2009).The graph, shown in the right panel of Figure 3, describes the interactions between proteins involved in the communication between a cell and its surroundings.Also for this graph it is true that with a few exceptions, the points corresponding to the 35% smallest eigenvalues lie approximately on a straight line.The same procedure as followed in the other examples gives a value r = 2.1 for the parameter in the geometry assumption.
3. General results on prior concentration.We consider two different priors on functions on graphs.The first corresponds to regularisation using a power of the Laplacian, the second one uses an exponential function of the Laplacian.In this section we present two general results which quantify both the mass that these priors give to shrinking • n -neighbourhoods of a fixed function f 0 , and the complexity of the support of the priors, measured in terms of metric entropy2 .In the FIG 3. Plot of log λi against log(i/n) for the Protein interaction graph in the right panel.Fitted line has slope 0.94, corresponding to r = 2.1 in the geometry assumption.
next section we will combine these results with known results from Bayesian nonparametrics theory to deduce convergence rates and adaptation for nonparametric regression and classification problems on graphs.
Our results assume that the geometry condition holds for some r ≥ 1.The mass a prior puts near f 0 will depend on the "regularity" of the function, defined in a suitable manner.Specifically, we will assume it belongs to a Sobolev-type ball of the form for some β, C > 0 (independent of n).The particular normalisation, which depends on the geometry parameter r, ensures non-trivial asymptotics.This is confirmed in Kirichenko and van Zanten (2017), in which minimax lower bounds are presented which complement the rate results of the present paper.
It is illustrative to consider the assumption in a bit more detail in the simple case of the path graph.The example shows in particular that we have chosen the "correct" normalisation in the definition of the smoothness class.EXAMPLE 3.1 (Path graph).Consider a path graph G with n vertices, which we identify with the points i/n in the unit interval, i = 1, . . ., n.As seen in Example 2.1, this graph satisfies the geometry condition with parameter r = 1.Hence, in this case the collection of functions H β (C) is given by To understand when a (sequence of) function(s) belongs to this space, say for β = 1, let f n be the restriction to the grid {i/n, i = 1, . . .n} of a fixed function f defined on the whole interval [0, 1].The assumption that f n ∈ H 1 (C) then translates to the requirement that The first term on the left is a Riemann sum which approximates the integral 1 0 f 2 (x) dx.If f is differentiable, then for the second term we have, for large n, which is a Riemann sum that approximates the integral 1 0 (f (x)) 2 dx.Hence in this particular case the space of functions H 1 (C) on the graph is the natural discrete version of the usual Sobolev ball Definition (3.1) is a way of describing "β-regular" functions on a general graph satisfying the geometry condition, without assuming the graph or the function on it are discretised versions of some "continuous limit".
The first family of priors we consider penalize the higher order Laplacian norm of the function of interest.This corresponds to using a Gaussian prior with a power of the Laplacian as precision matrix (inverse covariance).(We note that since the Laplacian always has 0 as an eigenvalue, it is not invertible.We remedy this by adding a small multiple of the identity matrix I to L.) The larger the power of the Laplacian used, the more "rough" functions on the graph are penalized.The power is regulated by a hyperparameter α > 0 which can be seen as describing the "baseline regularity" of the prior.To enlarge the range of regularities for which we obtain good contraction rates in the statistical results, we add a multiplicative hyperparameter which we endow with a suitable hyperprior.In (3.2) we assume an exact standard exponential distribution, but inspection of the proof shows that the range of priors for which the result holds is actually larger.To keep the exposition clean we omit these details.THEOREM 3.2 (Power of the Laplacian).Suppose the geometry assumption holds for r ≥ 1.Let α > 0 be fixed and assume that f 0 ∈ H β (C) for some C > 0 and 0 < β ≤ α + r/2.Let the random function f on the graph be defined by Then there exists a constant K 1 > 0 and for all K 2 > 1 there exist Borel measurable subsets B n of R n such that for every sufficiently large n, where ε n is a multiple of n −β/(2β+r) .
Note that in this theorem we obtain the rate n −β/(2β+r) for all β in the range (0, α + r/2].In the statistical results presented in Section 5 this translates into rateadaptivity up to regularity level α + r/2.So by putting a prior on the multiplicative scale we achieve a degree of adaptation, but only up to an upper bound that is limited by our choice of the hyperparameter α.A possible solution is to consider other functions of the Laplacian instead of using a power of L in the prior specification.Here we consider usage of an exponential function of the Laplacian.We include a "lengthscale" or "bandwidth" hyperparameter that we endow with a prior as well for added flexibility.This prior can be seen as the analogue of the prior used in Castillo et al. (2014) in the context of function estimation on manifolds, which in turn is a generalization of the popular squared exponential Gaussian prior used for estimation functions on Euclidean domains (e.g.Rasmussen and Williams (2006)).However, we stress again that we do not rely on an embedding of our graph in a manifold or the existence of a "limiting manifold".
In the next theorem there is indeed no restriction on the range of the smoothness β.We remark however that we obtain an additional logarithmic factor is the rate.Technically this is a consequence of the larger "complexity" of the support of this prior.THEOREM 3.3 (Exponential of the Laplacian).Suppose the geometry assumption holds for r ≥ 1. Assume that f 0 ∈ H β (C) for some C > 0 and β > 0. Let the random function f on the graph be defined by Then there exists a constant K 1 > 0 and for all K 2 > 1 there exist Borel subsets B n of R n such that for every sufficiently large n, (3.11)where ε n = (n/ log 1+r/2 n) −β/(2β+r) and εn = ε n log 1/2+r/4 n.
4. Proofs of Theorems 3.2 and 3.3.Recall that we identify functions on the graph with vectors in R n .In both cases we have that given c, the random vector f is a centered n-dimensional Gaussian random vector.We view this as a Gaussian random element in the space (R n , • n ).The corresponding RKHS H c is the entire space R n , and the corresponding RKHS-norm is given by where Σ c is the covariance matrix of f | c. (See e.g.van der Vaart and van Zanten (2008b) for the definition and properties of the RKHS.)Recall that the ψ i are the eigenfunctions of L, normalised with respect to the norm • n .They are then also eigenfunctions of Σ −1 c in both cases.We denote the corresponding eigenvalues by µ i .
The Gaussian N (0, Σ c ) admits the series representation (4.1) where Z 1 , . . ., Z n are standard normal variables.In particular the functions ψ i / √ nµ i form an orthonormal basis of the RKHS H c .Hence, the ordinary • nnorm and the RKHS-norm of a function h with expansion h = h i ψ i are given by (4.2) We denote the unit ball of the RKHS by 4.1.Proof of Theorem 3.2.In this case Σ −1 c = ((n/c) 2/r (L + n −2 I)) α+r/2 is the precision matrix of f given c and the eigenvalues of Σ −1 c are given by . 4.1.1.Proof of (3.4).By Lemma 5.3 of van der Vaart and van Zanten (2008b), it follows from Lemmas 4.1 and 4.2 ahead that under the conditions of the theorem and for ε = ε n = n −β/(r+2β) and c = c n satisfying √ nε By conditioning, it is then seen that LEMMA 4.1.For n large enough and ε > 0 and ε √ n/c (α+r/2)/r small enough, PROOF.By the series representation (4.1) we have P ( Recall from Section 2.2 that we can assume without loss of generality that we have the lower bounds These translate into lower bounds for the µ i from which it follows that for ε > 0, where we write p = α + r/2.By Corollary 4.3 from Dunker et al. (1998), the last factor in the last line is bounded form below by n is small enough.By the triangle inequality and independence, the first factor is bounded from below by Hence, for c −p/r ε √ n small enough the probability is further bounded from below by const × c −p/r n (α+r−pr)/r ε i 0 .
Combining the bounds for the separate factors we find that, for c −p/r ε √ n small enough, .
Since r ≥ 1 the first term on the right is smaller than a constant times the second PROOF.We use an expansion f = f i ψ i , with ψ i the orthonormal eigenfunctions of the Laplacian.In terms of the coefficients the smoothness assumption reads (1 + n 2β/r λ β i )f 2 i ≤ C 2 .Now for I to be determined below, consider h = i≤I f i ψ i .In view of (4.4)-(4.5)we have, for I large enough, The condition on ε ensures that for the choice of I made above and n large enough, i 0 ≤ I ≤ κn.Hence, by (4.4)-(4.5),h 2 H c is bounded by a constant times the right-hand side of (4.6).r+2β) again and c n , M n are the sequences to be determined below.By Lemma 4.3 we have

Proof of (3.5) and (3.6). Define
where p = α + r/2 again.For M n = M nε 2 n and c this is bounded by a constant times nε 2 n , which proves (3.6).It remains to show that for given K 2 > 1, the constants M and N can be chosen such that (3.5) holds.We have For c ≤ c n we have the inclusion H c 1 ⊆ H cn 1 .Hence, by the Borell-Sudakov inequality, we have for c ≤ c n that where Φ is the cdf of the standard normal distribution.By Lemma 4.1 the small ball probability in this expression is for c ) for some K > 0. Using the bound Φ −1 (y) ≥ −((5/2) log(1/y)) 1/2 for y ∈ (0, 1/2), it follows that for c ≤ c n , LEMMA 4.3.For n large enough and c, ε > 0 we have PROOF.By expanding the RKHS elements in the eigenbasis of the Laplacian and taking into account the relations (4.2) we see that the problem is to bound the entropy log N (ε, A, • ), where Using the bounds (4.4)-(4.5), it follows that with p = α + r/2 and R = c p/r n −(α+r)/r we have the inclusions By using the well-known entropy bounds for balls in R i 0 and ellipsoids in 2 we deduce from this that for ε > 0, The proof is completed by recalling the expressions for p and R.
LEMMA 4.4.If ε → 0, c is bounded away from 0 and c/ε 2 → ∞, then PROOF.Again the series representation of the Gaussian law of f | c gives , where the Z i are independent standard normal random variables.By the lower bounds (4.4)-(4.5), it follows that The first probability in the last line is bounded from below by Since the quantity on the right of the inequality in this probability becomes arbitrarily small under de conditions of the lemma, this is further bounded form below by a constant times ε i 0 exp(i 0 ((1/2)C 1 n (2−2r)/r c −2/r )).
For the second probability we use Theorem 6.1 of Li and Shao (2001).This asserts that if a k > 0 and a k < ∞, then as ε → 0 where γ a = γ a (ε) is uniquely determined, for ε > 0 small enough, by the equation (4.9) We apply (4.8) with a i = exp(−C 1 (i/c) 2/r ).
We first determine bounds for γ a .Note that in our case the terms in the sum S on the right of (4.9) are decreasing in i.It follows that we have the bounds A change of variables shows that the integral on the right equals where Li s (z) denotes the polylogarithm.By Wood (1992) as γ a → ∞.Hence for large γ a , we have the upper bound S ≤ const × cγ a −1 log r/2 γ a .It is easily seen that we have a lower bound of the same order, so that ε 2 c log r/2 γ a γ a .
Under our condition that ε 2 /c → 0 this holds if and only if Next we consider the sums appearing on the right of (4.8).To bound log(1 + 2a i γ a ) ≤ log(1 + 2 exp(−C 1 (i/c) 2/r )γ a ) we consider the index I = c(log γ a /C 1 ) r/2 , which is determined such that a I γ a = 1.Note that for m > 0, we have a mI γ a = a m 2/r I γ a = γ 1−m r/2 a .We first split up the sum, writing The first sum on the right is bounded by a multiple of I log γ a .The second one we split into blocks of length I.This gives Hence, we have log(1 + 2a i γ a ) c log 1+r/2 γ a .For the other sum appearing in (4.8) we have The proof is completed by combining all the bounds we have found.
PROOF.We use an expansion f = f i ψ i , with ψ i the orthonormal eigenfunctions of the Laplacian.We saw in the proof of Lemma 4.2 that if we define h = i≤I f i ψ i for I = const × ε −r/β , then h − f n ≤ ε.By (4.2), the RKHSnorm of h satisfies in this case The condition on ε ensures that for the choice of I made above and n large enough, i 0 ≤ I ≤ κn.Hence, by (4.4)-(4.5),h 2 H c is bounded by a constant times the right-hand side of (4.10).4.2.2.Proof of (3.10)-(3.11).Define B n := M n H cn 1 + ε n B 1 , where ε n is as above and M n and c n are determined below.
For (3.10) we first note again that Exactly as in the proof of (3.5), the Borell-Sudakov inequality implies that for c ≤ c n , By Lemma 4.4 the small ball probability on the right is lower bounded by It follows that for c ≤ c n , for some K > 0. For a given K 2 > 0, choosing M n a large multiple of (c n log 1+r/2 (c n /ε 2 n )) 1/2 we find that, for large n, If K 2 > 0 is a given constant, then for c n a large enough multiple of nε 2 n , this is bounded by exp(−K 2 nε 2 n ).For these choices of M n and c n , Lemma 4.6 implies that the entropy satisfies, for any εn ≥ ε n , . This proves that (3.11) holds for εn = ε n log 1/2+r/4 n.
LEMMA 4.6.Let ε, c > 0 be such that c log r/2 (1/ε) → ∞ as n → ∞.Then for n large enough, PROOF.We need to bound the metric entropy of the set relative to the Euclidean norm • .Set I = (2/C 1 ) r/2 c log r/2 (1/ε).Under the assumption of the lemma this is larger than i 0 , hence by (4.4)-(4.5)we have exp(−(n/c) 2/r λ I ) ≤ ε 2 .It follows that if for x ∈ A we define the projection x I by x I = (x 1 , . . ., x I , 0, 0, . ..), then Moreover, we have x I ≤ 1.By the triangle inequality, it follows that if the points x 1 , . . ., x N form an ε-net for the unit ball in R I , then the points x1 , . . ., xN in R n obtained by appending zeros to the x j form a 2ε-net for A. Hence, N (2ε, A, • ) ε −I .The proof is completed by recalling the expression for I.

Function estimation on graphs.
In this section we translate the general Theorems 3.2 and 3.3 into results about posterior contraction in nonparametric regression and binary classification problems on graphs.Since the arguments needed for this translation are very similar to those in earlier papers, we omit full proofs and just give pointers to the literature.5.1.Nonparametric regression.As before we let G be a connected simple undirected graph with vertices 1, 2, . . ., n.In the regression case we assume that we have observations Y 1 , . . ., Y n at the vertices of the graph, satisfying (5.1) where f 0 is the function on G that we want to make inference about and ε i are independent N (0, σ 2 )-distributed error variables, for some σ > 0. We assume that the underlying graph satisfies the geometry assumption with some parameter r ≥ 1.As prior on the regression function f we then employ one of the two priors described by (3.2)-(3.3)or (3.7)-(3.8).If σ is unknown, we assume it belongs to a compact interval [a, b] ⊂ (0, ∞) and endow it with a prior with a positive, continuous density on [a, b], independent of the prior on f .For a given prior Π, the corresponding posterior distribution on f is denoted by Π(• | Y 1 , . . ., Y n ).For a sequence of positive numbers ε n → 0 we say that the posterior contracts around f 0 at the rate ε n if for all large enough M > 0, Π(f : Here the convergence is in probability under the law P f 0 corresponding to the data generating model (5.1).
The usual arguments allow us to derive the following statements from Theorems 3.2 and 3.3.See, for instance, van der Vaart and van Zanten (2008a) or de Jonge and van Zanten (2013) for details.THEOREM 5.1 (Nonparametric regression).Suppose the geometry assumption holds for r ≥ 1. Assume that f 0 ∈ H β (C) for β, C > 0.
Observe that since the priors do not use knowledge of the regularity β of the regression function, we obtain rate-adaptive results.For the power prior the range of regularities that we can adapt to is bounded by α + r/2, where α is the hyper parameter describing the "baseline regularity" of the prior.In the case of the exponential prior the range is unbounded.This comes at the modest cost of having an additional logarithmic factor in the rate.
In Kirichenko and van Zanten (2017) minimax lower bounds are presented which complement the rate results of the present paper.These show that the rates obtained are sharp in the present setting (up to a logarithmic factor in the exponential case).For the regular grid case this is basically also clear from existing lower bound results, since our setup includes the regular grids (Example 2.1) and since our smoothness condition corresponds to ordinary Sobolev regularity in those cases (Example 3.1).5.2.Nonparametric classification.We can derive the analogous results in the classification problem in which we assume that the data Y 1 , . . ., Y n are independent {0, 1}-valued variables, observed at the vertices of the graph.In this case the goal is to estimate the binary regression function p 0 , or "soft label function" on the graph, given by p 0 (i) = P 0 (Y i = 1).
We consider priors on p constructed by first defining a prior on a real-valued function f by (3.2)-(3.3)or (3.7)-(3.8)and then setting p = Ψ(f ), where Ψ : R → (0, 1) is a suitably chosen link function.We will assume that Ψ is a strictly increasing, differentiable function onto (0, 1) such that Ψ /(Ψ(1 − Ψ)) is uniformly bounded.Note that for instance the sigmoid, or logistic link Ψ(f ) = 1/(1 + exp(−f )) satisfies this condition.Under our conditions the inverse Ψ −1 : (0, 1) → R is well defined.In this classification setting the regularity condition will be formulated in terms of Ψ −1 (p 0 ).This is natural, since the prior is defined in terms of Ψ −1 (p) as well.Also in this case we denote the posterior corresponding to a prior Π by Π(• | Y 1 , . . ., Y n ) and we say that the posterior contracts around p 0 at the rate ε n if for all large enough M > 0, Π(p : as n → ∞. To derive the following result from Theorems 3.2 and 3.3 we can argue along the lines of the proof of Theorem 3.2 of van der Vaart and van Zanten (2008a).Some adaptations are required, since in the present case we have fixed design points.However, the necessary modifications are straightforward and therefore omitted.THEOREM 5.2 (Classification).Suppose the geometry assumption holds for r ≥ 1.Let Ψ : R → (0, 1) be onto, strictly increasing, differentiable and suppose that Ψ /(Ψ(1 − Ψ)) is uniformly bounded.Assume that Ψ −1 (p 0 ) ∈ H β (C) for β, C > 0.
(i) (Power of the Laplacian.)If the prior on p is given by the law of Ψ(f ), for f given by (3.2)-(3.3)for α > 0 and β ≤ α + r/2, then the posterior contracts around p 0 at the rate n −β/(r+2β) .(ii) (Exponential of the Laplacian.)If the prior on p is given by the law of Ψ(f ), for f given by (3.7)-(3.8),then the posterior contracts around f 0 at the rate n −β/(r+2β) log κ n for some κ > 0.
6. Concluding remarks.We have introduced a framework for studying the performance of methods for nonparametric function estimation on large graphs.We have proposed assumptions on the geometry of the underlying graph and the regularity of the function formulated in terms of the Laplacian of the graph.Moreover, we have exhibited nonparametric Bayes methods that achieve good convergence rates and that adapt to the unknown regularity of the function of interest.
We have purposely focused on the building up a new framework and deriving a few representative results within that framework and have not yet attempted to explore every possible extension.As a result, extensions and generalizations are possible in a variety of directions.
First of all, it is of interest to study other procedures than just the Bayesian methods with priors (3.2)-(3.3)or (3.7)-(3.8).For instance, empirical Bayes procedures for choosing the hyperparameter c might computationally be more favorable than hierarchical Bayes.Studying the performance of such procedures is possible within the framework of Rousseau and Szabo (2015).In turn, having results for empirical Bayes will allow us to extend the range of priors on c for which we can prove that the hierarchical procedures give good results.
Secondly, results on uncertainty quantification would be valuable.Bayes procedures provide a natural method for quantifying uncertainty through the spread of the posterior distribution.However, it has become clear that in general the question of whether or not Bayesian credible sets can be interpreted as (frequentist) confidence sets is a delicate matter in nonparametric settings (e.g.Szabó et al. (2015)).It would be desirable to have more insight in this issue in the graph setting.
On the level of the geometry assumption, several extensions might be of interest.For instance, instead of a single parameter r governing the "dimension" of the graph it might be interesting to consider frameworks allowing graphs which are less homogenous.When estimating a function on some sub-region of a graph, one would expect that the rates should only depend on the local geometry of the graph in that region.It would be of interest to make such statements mathematically precise and to exhibit procedures with good local properties.More generally, recent numerical work has shown that Bayesian Laplacian regularisation can work quite well in practice on graphs that do not satisfy our geometry assumption, see Hartog and van Zanten (2016).To understand this theoretically our current mathematical results are too limited.
A final possible generalization that we want to mention is to the setting of weighted graphs.This is of interest, since in many applications it is natural to work with weighted graphs to quantify the similarity between vertices.We expect that with additional work our results can be extended to that setting.

FIG 2 .
FIG 2. Plot of log λi against log(i/n) for the Watts-Strogatz graph in the right panel.Fitted line has slope 1.42, corresponding to r = 1.4 in the geometry assumption.