Basic models and questions in statistical network analysis

Extracting information from large graphs has become an important statistical problem since network data is now common in various fields. In this minicourse we will investigate the most natural statistical questions for three canonical probabilistic models of networks: (i) community detection in the stochastic block model, (ii) finding the embedding of a random geometric graph, and (iii) finding the original vertex in a preferential attachment tree. Along the way we will cover many interesting topics in probability theory such as P\'olya urns, large deviation theory, concentration of measure in high dimension, entropic central limit theorems, and more.


Example 1.1 (Symmetric communities).
A simple example to keep in mind is that of symmetric communities, with more edges within communities than between communities. This is modeled by the SBM with p i = 1/k for all i ∈ [k] and Q i,j = a if i = j and Q i,j = b otherwise, with a > b > 0.
We write G ∼ SBM(n, p, Q) for a graph generated according to the SBM without the hidden vertex labels revealed. The goal of a statistical inference algorithm is to recover as many labels as possible using only the underlying graph as an observation. There are various notions of success that are worth studying.
• Weak recovery (also known as detection). An algorithm is said to weakly recover or detect the communities if it outputs a partition of the nodes which is positively correlated with the true partition, with high probability (whp) 2 . • Partial recovery. How much can be recovered about the communities?
An algorithm is said to recover communities with an accuracy of α ∈ [0, 1] if it outputs a labelling of the nodes which agrees with the true labelling on a fraction α of the nodes whp. An important special case is when only o(n) vertices are allowed to be misclassified whp, known as weak consistency or almost exact recovery. • Exact recovery (also known as recovery or strong consistency). The strongest notion of reconstruction is to recover the labels of all vertices exactly whp. When this is not possible, it can still be of interest to understand which communities can be exactly recovered, if not all; this is sometimes known as "partial-exact-recovery".
In all the notions above, the agreement of a partition with the true partition is maximized over all relabellings of the communities, since we are not interested in the specific original labelling per se, but rather the partition (community structure) it induces.
The different notions of recovery naturally lead to studying different regimes of the parameters. It is well-known that in the Erdős-Rényi random graph-which is the SBM with a single block, or, equivalently, the SBM where all the connection probabilities are equal-the threshold connection probability for the graph to have a giant component is 1/n (see [32]) and the threshold connection probability for it to be connected is ln(n)/n (see [31]). These results suggest what the relevant regimes are for the different notions of recovery in the SBM. For weak recovery to be possible, many vertices in all but one community should be non-isolated (in the symmetric case this means that there should be a giant component), requiring the edge probabilities to be Ω (1/n). For exact recovery, all vertices in all but one community should be non-isolated (in the symmetric case this means that the graph should be connected), requiring the edge probabilities to be Ω (ln(n)/n). In these regimes it is natural to scale the edge probability matrices accordingly, i.e., to consider SBM (n, p, Q/n) or SBM (n, p, ln (n) Q/n), where Q ∈ R k×k + . There has been lots of work in the past few years understanding the fundamental limits to recovery under the various notions discussed above. For weak recovery there is a sharp phase transition, the threshold of which was first conjectured in [26]. This was proven first for two symmetric communities [47,48] and then for multiple communities [4]. Partial recovery is less well understood, and finding the fraction of nodes that can be correctly recovered for a given set of parameters is an open problem; see [49] for results in this direction for two symmetric communities.
In this lecture we are interested in exact recovery, for which Abbe and Sandon gave the value of the threshold for the general SBM, and showed that a quasilinear time algorithm works all the way to the threshold [3] (building on previous work that determined the threshold for two symmetric communities [2,50]). The remainder of this lecture is an exposition of their main results and a few of the key ideas that go into proving and understanding it.

From exact recovery to testing multivariate Poisson distributions
Recall that we are interested in the logarithmic degree regime for exact recovery, i.e., we consider G ∼ SBM(n, p, ln(n)Q/n), where Q ∈ R k×k + is independent of n. We also assume that the communities have linear size, i.e., that p is independent of n, and p i ∈ (0, 1) for all i. Our goal is to recover the labels of all the vertices whp.
As a thought experiment, imagine that not only is the graph G given, but also all vertex labels are revealed, except for that of a given vertex v ∈ V . Is it possible to determine the label of v?
Understanding this question is key for understanding exact recovery, since if the error probability of this is too high, then exact recovery will not be possible. On the other hand, it turns out that in this regime it is possible to recover all but o(n) labels using an initial partial recovery algorithm. The setup of the thought experiment then becomes relevant, and if we can determine the label of v given the labels of all the other nodes with low error probability, then we can correct all errors made in the initial partial recovery algorithm, leading to exact recovery. We will come back to the connection between the thought experiment and exact recovery; for now we focus on understanding this thought experiment.
Given the labels of all vertices except v, the information we have about v is the number of nodes in each community it is connected to. In other words, we know the degree profile d (v) of v, where, for a given labelling of the graph's vertices, the i-th component d i (v) is the number of edges between v and the vertices in community i.
The distribution of the degree profile d(v) depends on the community that v belongs to. Recall that the community sizes are given by a multinomial distribution with parameters n and p, and hence the relative size of community i ∈ [k] concentrates on p i . Thus if σ v = j, the degree profile d(v) = (d 1 (v), . . . , d k (v)) can be approximated by independent binomials, with d i (v) approximately distributed as Bin (np i , ln(n)Q i,j /n), where Bin(m, q) denotes the binomial distribution with m trials and success probability q. In this regime, the binomial distribution is well-approximated by a Poisson distribution of the same mean. In particular, Le Cam's inequality gives that TV (Bin (na, ln(n)b/n) , Poi (ab ln(n))) ≤ 2ab 2 (ln(n)) 2 n , where Poi (λ) denotes the Poisson distribution with mean λ, and TV denotes the total variation distance 3 . Using the additivity of the Poisson distribution and the triangle inequality, we get that where L (d(v)) denotes the law of d(v) conditionally on σ v = j and e i is the i-th unit vector. 3 Recall that the total variation distance between two random variables X and Y taking values in a finite space X with laws μ and ν is defined as TV (μ, ν) ≡ TV (X, Y ) = Basic models and questions in statistical network analysis 7 Thus the degree profile of a vertex in community j is approximately Poisson distributed with mean ln (n) i∈ [k] p i Q i,j e i . Defining P = diag(p), this can be abbreviated as ln (n) (P Q) j , where (P Q) j denotes the j-th column of the matrix P Q. We call the quantity (P Q) j the community profile of community j; this is the quantity that determines the distribution of the degree profile of vertices from a given community.
Our thought experiment has thus been reduced to a Bayesian hypothesis testing problem between k multivariate Poisson distributions. The prior on the label of v is given by p, and we get to observe the degree profile d(v), which comes from one of k multivariate Poisson distributions, which have mean ln(n) times the community profiles (P Q) j , j ∈ [k].

Testing multivariate Poisson distributions
We now turn to understanding the testing problem described above; the setup is as follows. We consider a Bayesian hypothesis testing problem with k hypotheses. The random variable H takes values in [k] with prior given by p, i.e., P (H = j) = p j . We do not observe H, but instead we observe a draw from a multivariate Poisson distribution whose mean depends on the realization of H: given H = j, the mean is λ(j) ∈ R k + . In short: In more detail: and Our goal is to infer the value of H from a realization of D. The error probability is minimized by the maximum a posteriori (MAP) rule, which, upon observing D = d, selects arg max as an estimate for the value of H, with ties broken arbitrarily. Let P e denote the error of the MAP estimator. One can think of the MAP estimator as a tournament of k − 1 pairwise comparisons of the hypotheses: if P (D = d | H = i) p i > P (D = d | H = j) p j then the MAP estimate is not j. The probability that one makes an error during such a comparison is exactly

M. Rácz and S. Bubeck
For finite k, the error of the MAP estimator is on the same order as the largest pairwise comparison error, i.e., max i,j P e (i, j). In particular, we have that (1.2) Thus we desire to understand the magnitude of the error probability P e (i, j) in (1.1) in the particular case when the conditional distribution of D given H is a multivariate Poisson distribution with mean vector on the order of ln (n). The following result determines this error up to first order in the exponent. [3]). For any c 1 , c 2 ∈ (0, ∞) k with c 1 = c 2 and p 1 , p 2 > 0, we have

Lemma 1.2 (Abbe and Sandon
We do not go over the proof of this statement-which we leave to the reader as a challenging exercise-but we provide some intuition in the univariate case.  means λ = 20 and μ = 30, respectively. Observe that min {P λ (x) , P μ (x)} decays rapidly away from x max := arg max x∈Z+ min {P λ (x) , P μ (x)}, so we can obtain a good estimate of the sum x∈Z+ min {P λ (x) , P μ (x)} by simply estimating the term min {P λ (x max ) , P μ (x max )}. Now observe that x max must satisfy P λ (x max ) ≈ P μ (x max ); after some algebra this is equivalent to x max ≈ λ−μ log(λ/μ) . Let t * denote the maximizer in the expression of D + (λ, μ) in (1.5). By differentiating in t, we obtain that t * satisfies λ − μ − log (λ/μ) · λ t * μ 1−t * = 0, and so λ t * μ 1−t * = λ−μ log(λ/μ) . Thus we see that x max ≈ λ t * μ 1−t * , from which, after some algebra, we get that The proof of (1.4) in the multivariate case follows along the same lines: the single term corresponding to gives the lower bound. For the upper bound of (1.3) one has to show that the other terms do not contribute much more.

Exercise 1.2. Prove Lemma 1.2.
Our conclusion is thus that the error exponent in testing multivariate Poisson distributions is given by the explicit quantity D + in (1.5). The discussion in Section 1.2 then implies that D + plays an important role in the threshold for exact recovery. In particular, it intuitively follows from Lemma 1.2 that a necessary condition for exact recovery should be that Suppose on the contrary that D + (P Q) i , (P Q) j < 1 for some i and j. This implies that the error probability in the testing problem is Ω n ε−1 for some ε > 0 for all vertices in communities i and j. Since the number of vertices in these communities is linear in n, and most of the hypothesis testing problems are approximately independent, one expects there to be no error in the testing problems with probability at most 1 − Ω n ε−1 Ω(n) = exp (−Ω (n ε )) = o(1).

Chernoff-Hellinger divergence
Before moving on to the threshold for exact recovery in the general SBM, we discuss connections of D + to other, well-known measures of divergence. Writing For any fixed t, D t can be written as which is a convex function. Thus D t is an fdivergence, part of a family of divergences that generalize the Kullback-Leibler (KL) divergence (also known as relative entropy), which is obtained for f (x) = x ln(x). The family of f -divergences with convex f share many useful properties, and hence have been widely studied in information theory and statistics. The is known as the Hellinger divergence. The Chernoff divergence is defined as and so if μ and ν are probability vectors, then D + (μ, ν) = 1 − e −C * (μ,ν) . Because of these connections, Abbe and Sandon termed D + the Chernoff-Hellinger divergence.
While the quantity D + still might seem mysterious, even in light of these connections, a useful point of view is that Lemma 1.2 gives D + an operational meaning.

Characterizing exact recoverability using CH-divergence
Going back to the exact recovery problem in the general SBM, let us jump right in and state the recoverability threshold of Abbe and Sandon: exact recovery in SBM(n, p, ln(n)Q/n) is possible if and only if the CH-divergence between all pairs of community profiles is at least 1. [3]). Let k ∈ Z + denote the number of communities, let p ∈ (0, 1) k with p 1 = 1 denote the community prior, let P = diag(p), and let Q ∈ (0, ∞) k×k be a symmetric k×k matrix with no two rows equal. Exact recovery is solvable in SBM (n, p, ln(n)Q/n) if and only if

Theorem 1.3 (Abbe and Sandon
This theorem thus provides an operational meaning to the CH-divergence for the community recovery problem.

Example 1.4 (Symmetric communities). Consider again k symmetric communities, that is,
We note that in this case D + is the same as the Hellinger divergence. Thus if the setting of the thought experiment described in Section 1.2 applies to every vertex, then by looking at the degree profiles of the vertices we can correctly reclassify all vertices, and the probability that we make an error is o(1) by a union bound. However, the setting of the thought experiment does not quite apply. Nonetheless, in this logarithmic degree regime it is possible to partially reconstruct the labels of the vertices, with only o(n) vertices being misclassified. The details of this partial reconstruction procedure would require a separate lecture-in brief, it determines whether two vertices are in the same community or not by looking at how their log(n) size neighborhoods interact-so now we will take this for granted; we refer the interested reader to [3] for more. It is possible to show that there exists a constant δ such that if one estimates the label of a vertex v based on classifications of its neighbors that are wrong with probability x, then the probability of misclassifying v is at most n δx times the probability of error if all the neighbors of v were classified correctly. The issue is that the standard partial recovery algorithm has a constant error rate for the classifications, thus the error rate of the degree profiling step could be n c times as large as the error in the hypothesis testing problem, for some c > 0. This is an issue when min i =j D + (P Q) i , (P Q) j < 1 + c.
To get around this, one can do multiple rounds of more accurate classifications. First, one obtains a partial reconstruction of the labels with an error rate that is a sufficiently low constant. After applying the degree-profiling step to each vertex, the classification error at each vertex is now O(n −c ) for some c > 0. Hence after applying another degree-profiling step to each vertex, the classification error at each vertex will now be at most n δ×O(n −c ) × o(1/n) = o(1/n). Thus applying a union bound at this stage we can conclude that all vertices are correctly labelled whp.

Impossibility
The necessity of condition (1.6) was already described at a high level at the end of Section 1.3. Here we give some details on how to deal with the dependencies that arise.
Assume that (1.6) does not hold, and let i and j be two communities that violate the condition, i.e., for which D + (P Q) i , (P Q) j < 1. We want to argue that vertices in communities i and j cannot all be distinguished, and so any classification algorithm has to make at least one error whp. An important fact that we use is that the lower bound (1.4) arises from a particular choice of degree profile that is both likely for the two communities. Namely, define the degree i.e., the value for which D + (P Q) i , (P Q) j = D t * (P Q) i , (P Q) j . Then Lemma 1.2 tells us that for any vertex in community i or j, the probability that it has degree profile x is at least which is at least Ω n ε−1 for some ε > 0 by assumption.
To show that this holds for many vertices in communities i and j at once, we first select a random set S of n/ (ln (n)) 3 vertices. Whp the intersection of S with any community is within √ n of the expected value p n/ (ln (n)) 3 , and furthermore a randomly selected vertex in S is not connected to any other vertex in S. Thus the distribution of a vertex's degree profile excluding connections to vertices in S is essentially a multivariate Poisson distribution as before. We call a vertex in S ambiguous if for each ∈ [k] it has exactly x neighbors in community that are not in S. By Lemma 1.2 we have that a vertex in S that is in community i or j is ambiguous with probability Ω n ε−1 . By definition, for a fixed community assignment and choice of S, there is no dependence on whether two vertices are ambiguous. Furthermore, due to the choice of the size of S, whp there are at least ln (n) ambiguous vertices in community i and at least ln (n) ambiguous vertices in community j that are not adjacent to any other vertices in S. These 2 ln (n) are indistinguishable, so no algorithm classifies all of them correctly with probability greater than 1/ 2 ln(n) ln(n) , which tends to 0 as n → ∞.

The finest exact partition recoverable
We conclude by mentioning that this threshold generalizes to finer questions. If exact recovery is not possible, what is the finest partition that can be recovered? We say that exact recovery is solvable for a community partition [k] = t s=1 A s , where A s is a subset of [k], if there exists an algorithm that whp assigns to every vertex an element of {A 1 , . . . , A t } that contains its true community. The finest partition that is exactly recoverable can also be expressed using CH-divergence in a similar fashion. It is the largest collection of disjoint subsets such that the CH-divergence between these subsets is at least 1, where the CH-divergence between two subsets is defined as the minimum of the CH-divergences between any two community profiles in these subsets. [3]). Under the same settings as in Theorem 1.3, exact recovery is solvable in SBM (n, p, ln(n)Q/n) for a partition

Theorem 1.5 (Abbe and Sandon
for every i and j in different subsets of the partition.

Outlook: open problems and extensions
There has been a huge recent interest in the SBM, initiated by the work of Decelle et al. [26], which has led to enormous advancements in this area. Nonetheless, plenty of exciting avenues for further research remain. We refer the reader to the recent survey by Abbe [1] for a thorough treatment of the SBM and recent developments, including a more detailed discussion of open problems; here we mention just a few of them.
In this lecture we discussed exact recovery in the case of linear-size communities. A natural open problem is to understand what happens in the case of sublinear-size communities: where do the phase transitions occur?
In the case of weak recovery, Abbe and Sandon [4] recently showed that in the case of k ≥ 4 communities, it is possible to information-theoretically recover the communities below the so-called KS threshold. Is it possible to locate the precise information-theoretic threshold in this setting? Is it possible to show that the computational threshold is the KS threshold, thereby proving that there exists an information-computation gap?
In many applications, communities are not disjoint but overlapping. How well can we detect overlapping communities? What are the fundamental limits? To understand the utility of the SBM we must also study how robust the obtained results are to changes in the model. What can we say about variants of the SBM, such as including degree corrections or allowing adversaries to modify the graph?

Lecture 2: Estimating the dimension of a random geometric graph on a high-dimensional sphere
Many real-world networks have strong structural features and our goal is often to recover these hidden structures. In the previous lecture we studied the fundamental limits of inferring communities in the stochastic block model, a natural generative model for graphs with community structure. Another possibility is geometric structure. Many networks coming from physical considerations naturally have an underlying geometry, such as the network of major roads in a country. In other networks this stems from a latent feature space of the nodes. For instance, in social networks a person might be represented by a feature vector of their interests, and two people are connected if their interests are close enough; this latent metric space is referred to as the social space [36]. In such networks the natural questions probe the underlying geometry. Can one detect the presence of geometry? If so, can one estimate various aspects of the geometry, e.g., an appropriately defined dimension? In this lecture we study these questions in a particularly natural and simple generative model of a random geometric graph: n points are picked uniformly at random on the d-dimensional sphere, and two points are connected by an edge if and only if they are sufficently close. 4 We are particularly interested in the high-dimensional regime, motivated by recent advances in all areas of applied mathematics, and in particular statistics and learning theory, where high-dimensional feature spaces are becoming the new norm. While the low-dimensional regime has been studied for a long time in probability theory [51], the high-dimensional regime brings about a host of new and interesting questions.

A simple random geometric graph model and basic questions
Let us now define more precisely the random geometric graph model we consider and the questions we study. In general, a geometric graph is such that each vertex is labeled with a point in some metric space, and an edge is present between two vertices if the distance between the corresponding labels is smaller than some prespecified threshold. We focus on the case where the underlying metric space is the Euclidean sphere S d−1 = x ∈ R d : x 2 = 1 , and the latent labels where n is the number of vertices and p is the probability of an edge between two vertices (p determines the threshold distance for connection). This model is closely related to latent space approaches to social network analysis [36], and it has applications to remote sensing and finance as well [27]. Locally sparsified versions of related random geometric graphs also serve as models for wireless networks, see, e.g., [15,16,17] and the references therein.
Slightly more formally, G(n, p, d) is defined as follows. Let X 1 , . . . , X n be independent random vectors, uniformly distributed on S d−1 . In G(n, p, d), distinct vertices i ∈ [n] and j ∈ [n] are connected by an edge if and only if The most natural random graph model without any structure is the standard Erdős-Rényi random graph G(n, p), where any two of the n vertices are independently connected with probability p.
We can thus formalize the question of detecting underlying geometry as a simple hypothesis testing question. The null hypothesis is that the graph is drawn from the Erdős-Rényi model, while the alternative is that it is drawn from G(n, p, d). In brief: To understand this question, the basic quantity we need to study is the total variation distance between the two distributions on graphs, G(n, p) and G(n, p, d), denoted by TV (G(n, p), G(n, p, d)); recall that the total variation distance between two probability measures P and Q is defined as TV (P, Q) = We are interested in particular in the case when the dimension d is large, growing with n.
It is intuitively clear that if the geometry is too high-dimensional, then it is impossible to detect it, while a low-dimensional geometry will have a strong effect on the generated graph and will be detectable. How fast can the dimension grow with n while still being able to detect it? Most of this lecture will focus on this question.
If we can detect geometry, then it is natural to ask for more information. Perhaps the ultimate goal would be to find an embedding of the vertices into an appropriate dimensional sphere that is a true representation, in the sense that the geometric graph formed from the embedded points is indeed the original graph. More modestly, can the dimension be estimated? We touch on this question at the end of the lecture.

The dimension threshold for detecting underlying geometry
The high-dimensional setting of the random geometric graph G(n, p, d) was first studied by Devroye, György, Lugosi, and Udina [27], who showed that if n is fixed and d → ∞, then TV (G(n, p), G(n, p, d)) → 0, that is, geometry is indeed lost in high dimensions. More precisely, they show that this convergence happens when d n 7 2 n 2 /2 . 5 This follows by observing that for fixed n, the multivariate central limit theorem implies that as d → ∞, the inner products of the latent vectors converge in distribution to a standard Gaussian: The Berry-Esseen theorem gives a convergence rate, which then allows to show that for any graph G on n vertices, |P (G(n, p) = G) − P (G(n, p, d) = G)| = O n 7 /d ; the factor of 2 n 2 /2 comes from applying this bound to every term in the L 1 distance.
However, the result above is not tight, and we seek to understand the fundamental limits to detecting underlying geometry. The dimension threshold for dense graphs was recently found in [19], and it turns out that it is d ≈ n 3 , in the following sense. Theorem 2.1 (Bubeck, Ding, Eldan, Rácz [19]). Let p ∈ (0, 1) be fixed. Then

Moreover, in the latter case there exists a computationally efficient test to detect underlying geometry (with running time O n 3 ).
Most of the lecture will be devoted to understanding this theorem. At the end we will consider this same question for sparse graphs (where p = c/n), where determining the dimension threshold is an intriguing open problem.

The triangle test
A natural test to uncover geometric structure is to count the number of triangles in G. Indeed, in a purely random scenario, vertex u being connected to both v and w says nothing about whether v and w are connected. On the other hand, in a geometric setting this implies that v and w are close to each other due to the triangle inequality, thus increasing the probability of a connection between them. This, in turn, implies that the expected number of triangles is larger in the geometric setting, given the same edge density. Let us now compute what this statistic gives us. For a graph G, let A denote its adjacency matrix, i.e., A i,j = 1 if vertices i and j are connected, and 0 otherwise.
k is the indicator variable that three vertices i, j, and k form a triangle, and so the number of triangles in G is By linearity of expectation, for both models the expected number of triangles is n 3 times the probability of a triangle between three specific vertices. For the Erdős-Rényi random graph the edges are independent, so the probability of a triangle is p 3 , and thus we have For G(n, p, d) it turns out that for any fixed p ∈ (0, 1) we have for some constant C p > 0, which gives that Showing (2.4) is somewhat involved, but in essence it follows from the concentration of measure phenomenon on the sphere, namely that most of the mass on the high-dimensional sphere is located in a band of O 1/ √ d around the equator. We sketch here the main intuition for p = 1/2, which is illustrated in Figure 5. Let X 1 , X 2 , and X 3 be independent uniformly distributed points in S d−1 . Then where the last equality follows by independence. So what remains is to show that this latter conditional probability is approximately To compute this conditional probability what we really need to know is the typical angle is between X 1 and X 2 . By rotational invariance we may assume that X 1 = (1, 0, 0, . . . , 0), and hence X 1 , X 2 = X 2 (1), the first coordinate of X 2 . One way to generate X 2 is to sample a d-dimensional standard Gaussian and then normalize it by its length. Since the norm of a d-dimensional standard Gaussian is very well concentrated around √ d, it follows that X 2 (1) is on the order of 1/ √ d. Conditioned on X 2 (1) ≥ 0, this typical angle gives the boost in the conditional probability that we see. See Figure 5 for an illustration.
Thus we see that the boost in the number of triangles in the geometric setting is Θ n 3 / √ d in expectation:

M. Rácz and S. Bubeck
To be able to tell apart the two graph distributions based on the number of triangles, the boost in expectation needs to be much greater than the standard deviation.
Putting together Exercises 2.1 and 2.2 we see that TV (G(n, p),

Signed triangles are more powerful
While triangles detect geometry up until d n 2 , are there even more powerful statistics that detect geometry for larger dimensions? One can check that longer cycles also only work when d n 2 , as do several other natural statistics. Yet it turns out that the underlying geometry can be detected even when d n 3 . The simple idea that leads to this improvement is to consider signed triangles. We have already noticed that triangles are more likely in the geometric setting than in the purely random setting. This also means that induced wedges (i.e., when there are exactly two edges among the three possible ones) are less likely in the geometric setting. Similarly, induced single edges are more likely, and induced independent sets on three vertices are less likely in the geometric setting. Figure 6 summarizes these observations. The signed triangles statistic incorporates these observations by giving the different patterns positive or negative weights. More precisely, we define The key insight motivating this definition is that the variance of signed triangles is much smaller than the variance of triangles, due to the cancellations introduced by the centering of the adjacency matrix: the Θ n 4 term vanishes, leaving only the Θ n 3 term.
On the other hand it can be shown that so the gap between the expectations remains. Furthermore, it can also be shown that the variance also decreases for G(n, p, d) and we have Putting everything together and using Exercise 2.2 for the signed triangles statistic τ , we get that TV (G(n, p), G(n, p, d)) → 1 if n 3 / √ d n 3 + n 4 /d, which is equivalent to d n 3 . This concludes the proof of (2.3) from Theorem 2.1.

Barrier to detecting geometry: when Wishart becomes GOE
We now turn to proving (2.2), which, together with (2.3), shows that the threshold dimension for detecting geometry is n 3 . This also shows that the signed triangle statistic is near-optimal, since it can detect geometry whenever d n 3 . There are essentially three main ways to bound the total variation of two distributions from above: (i) if the distributions have nice formulas associated with them, then exact computation is possible; (ii) through coupling the distributions; or (iii) by using inequalities between probability metrics to switch the problem to bounding a different notion of distance between the distributions. Here, while the distribution of G(n, p, d) does not have a nice formula associated with it, the main idea is to view this random geometric graph as a function of an n×n Wishart matrix with d degrees of freedom-i.e., a matrix of inner products of n d-dimensional Gaussian vectors-denoted by W (n, d). It turns out that one can view G(n, p) as (essentially) the same function of an n × n GOE random matrix-i.e., a symmetric matrix with i.i.d. Gaussian entries on and above the diagonal-denoted by M (n). The upside of this is that both of these random matrix ensembles have explicit densities that allow for explicit computation. We explain this connection here in the special case of p = 1/2 for simplicity; see [19] for the case of general p.
Recall that if Y 1 is a standard normal random variable in R d , then Y 1 / Y 1 is uniformly distributed on the sphere S d−1 . Consequently we can view G (n, 1/2, d) as a function of an appropriate Wishart matrix, as follows. Let Y be an n × d matrix where the entries are i.i.d. standard normal random variables, and let The densities of these two random matrix ensembles are well known. Let P ⊂ R n 2 denote the cone of positive semidefinite matrices. When d ≥ n, W (n, d) has the following density with respect to the Lebesgue measure on P: where Tr (A) denotes the trace of the matrix A. It is also known that the density of a GOE random matrix with respect to the Lebesgue measure on R n 2 is Tr A 2 and so the density of M (n, d) with respect to the Lebesgue measure on R n 2 is Basic models and questions in statistical network analysis 21 These explicit formulas allow for explicit calculations. In particular, one can show that the log-ratio of the densities is o(1) with probability 1−o(1) according to the measure induced by M (n, d). This follows from writing out the Taylor expansion of the log-ratio of the densities and using known results about the empirical spectral distribution of Wigner matrices (in particular that it converges to a semi-circle law). The outcome of the calculation is the following result, proven independently and simultaneously by Bubeck et al. and Jiang and Li. We conclude that it is impossible to detect underlying geometry whenever d n 3 .

Estimating the dimension
Until now we discussed detecting geometry. However, the insights gained above allow us to also touch upon the more subtle problem of estimating the underlying dimension d.
Dimension estimation can also be done by counting the "number" of signed triangles as in Section 2.4. However, here it is necessary to have a bound on the difference of the expected number of signed triangles between consecutive dimensions; the lower bound of (2.5) is not enough. Still, we believe that the right hand side of (2.5) should give the true value of the expected value for an appropriate constant c p , and hence we expect to have that Thus, using the variance bound in (2.6), we get that dimension estimation should be possible using signed triangles whenever n 3 /d 3/2 n 3 + n 4 /d, which is equivalent to d n. Showing (2.7) for general p seems involved; Bubeck et al. showed that it holds for p = 1/2, which can be considered as a proof of concept. We thus have the following. Theorem 2.3 (Bubeck, Ding, Eldan, Rácz [19]). There exists a universal constant C > 0 such that for all integers n and d 1 < d 2 , one has This result is tight, as demonstrated by a result of Eldan [29], which states that when d n, the Wishart matrices W (n, d) and W (n, d + 1) are indistinguishable. By the discussion in Section 2.5, this directly implies that G(n, 1/2, d) and G(n, 1/2, d + 1) are indistinguishable. [29]). There exists a universal constant C > 0 such that for all integers n < d,

The mysterious sparse regime
The discussion so far has focused on dense graphs, i.e., assuming p ∈ (0, 1) is constant, where Theorem 2.1 tightly characterizes when the underlying geometry can be detected. The same questions are interesting for sparse graphs as well, where the average degree is constant or slowly growing with n. However, since there are so few edges, this regime is much more challenging. It is again natural to consider the number of triangles as a way to distinguish between G(n, c/n) and G(n, c/n, d). A calculation shows that this statistic works whenever d log 3 (n). In contrast with the dense regime, in the sparse regime the signed triangle statistic τ does not give significantly more power than the triangle statistic T . This is because in the sparse regime, with high probability, the graph does not contain any 4-vertex subgraph with at least 5 edges, which is where the improvement comes from in the dense regime.
The authors also conjecture that log 3 (n) is the correct order where the transition happens. Conjecture 2.6 (Bubeck, Ding, Eldan, Rácz [19]). Let c > 0 be fixed and assume d/ log 3 (n) → ∞. Then The main reason for this conjecture is that, when d log 3 (n), G(n, c/n) and G(n, c/n, d) seem to be locally equivalent; in particular, they both have the same Poisson number of triangles asymptotically. Thus the only way to distinguish between them would be to find an emergent global property which is significantly different under the two models, but this seems unlikely to exist. Proving or disproving this conjecture remains a challenging open problem. The best known bound is n 3 from (2.2) (which holds uniformly over p).

Outlook: open problems and extensions
High-dimensional random geometric graphs promise to be a rich source of exciting open problems. The main open problem that this lecture leaves open is understanding the critical dimension in the sparse setting; see Conjecture 2.6 above. More generally, it is interesting to understand how the critical dimension depends on p.
Understanding the robustness of these results is important-this is also the topic of the next lecture, in the setting of the Wishart to GOE transition for random matrices. Recently Eldan and Mikulincer studied the effect of anisotropy on the power of detecting geometry in random geometric graphs [30]. The authors introduce new notions of dimensionality and prove a theorem similar to Theorem 2.1 with appropriate upper and lower bounds on the "effective critical dimension".
Perhaps the ultimate goal is to find good representations of network data, and hence to faithfully embed the graph of interest into an appropriate metric space. While the worst-case version of this problem-recognizing if a graph can be realized as a geometric graph-is known to be NP-hard [14], the probabilistic setting should provide lots of opportunities for exciting research.

Lecture 3: Introduction to entropic central limit theorems and a proof of the fundamental limits of dimension estimation in random geometric graphs
Recall from the previous lecture that the dimension threshold for detecting geometry in G(n, p, d) for constant p ∈ (0, 1) is d = Θ n 3 . What if the random geometric graph model is not G(n, p, d)? How robust are the results presented in the previous lecture? We have seen that the detection threshold is intimately connected to the threshold of when a Wishart matrix becomes GOE. Understanding the robustness of this result on random matrices is interesting in its own right, and this is what we will pursue in this lecture. 6 Doing so also gives us the opportunity to learn about the fascinating world of entropic central limit theorems.

Setup and main result: the universality of the threshold dimension
Let X be an n × d random matrix with i.i.d. entries from a distribution μ that has mean zero and variance 1. The n × n matrix XX T is known as the Wishart matrix with d degrees of freedom. As we have seen in the previous lecture, this arises naturally in geometry, where XX T is known as the Gram matrix of inner products of n points in R d . The Wishart matrix also appears naturally in statistics as the sample covariance matrix, where d is the number of samples 24

M. Rácz and S. Bubeck
and n is the number of parameters. 7 We refer to [21] for further applications in quantum physics, wireless communications, and optimization. We consider the Wishart matrix with the diagonal removed, and scaled appropriately: In many applications-such as to random graphs, as we have seen in the previous lecture-the diagonal of the matrix is not relevant, so removing it does not lose information. Our goal is to understand how large does the dimension d have to be so that W n,d is approximately like G n , which is defined as the n × n Wigner matrix with zeros on the diagonal and i.i.d. standard Gaussians above the diagonal. In other words, G n is drawn from the Gaussian Orthogonal Ensemble (GOE) with the diagonal replaced with zeros. A simple application of the multivariate central limit theorem gives that if n is fixed and d → ∞, then W n,d converges to G n in distribution. The main result of Bubeck and Ganguly [21] establishes that this holds as long as d n 3 under rather general conditions on the distribution μ.

Theorem 3.1 (Bubeck and Ganguly [21]). If the distribution μ is log-concave
On the other hand, if μ has a finite fourth moment and d n 3 → 0, then This result extends Theorems 2.1 and 2.2, and establishes n 3 as the universal critical dimension (up to logarithmic factors) for sufficiently smooth measures μ: W n,d is approximately Gaussian if and only if d is much larger than n 3 . For random graphs, as seen in Lecture 2, this is the dimension barrier to extracting geometric information from a network: if the dimension is much greater than the cube of the number of vertices, then all geometry is lost. In the setting of statistics this means that the Gaussian approximation of a Wishart matrix is valid as long as the sample size is much greater than the cube of the number of parameters. Note that for some statistics of a Wishart matrix the Gaussian approximation is valid for much smaller sample sizes (e.g., the largest eigenvalue behaves as in the limit even when the number of parameters is on the same order as the sample size [42]).
To distinguish the random matrix ensembles, we have seen in Lecture 2 that signed triangles work up until the threshold dimension in the case when μ is standard normal. It turns out that the same statistic works in this more general setting; when the entries of the matrices are centered, this statistic can be written as A → Tr A 3 . Similarly to the calculations in Section 2.4, one can show that under the two measures W n,d and G n , the mean of Tr A 3 is 0 and Θ n 3 / √ d , respectively, whereas the variances are Θ n 3 and Θ n 3 + n 5 /d 2 , respectively. Then (3.2) follows by an application of Chebyshev's inequality. We leave the details as an exercise for the reader.
We note that for (3.1) to hold it is necessary to have some smoothness assumption on the distribution μ. For instance, if μ is purely atomic, then so is the distribution of W n,d , and thus its total variation distance to G n is 1. The log-concave assumption gives this necessary smoothness, and it is an interesting open problem to understand how far this can be relaxed.

Pinsker's inequality: from total variation to relative entropy
Our goal is now to bound the total variation distance TV (W n,d , G n ) from above. In the general setting considered here there is no nice formula for the density of the Wishart ensemble, so TV (W n,d , G n ) cannot be computed directly. Coupling these two random matrices also seems challenging.
In light of these observations, it is natural to switch to a different metric on probability distributions that is easier to handle in this case. We refer the reader to the excellent paper [34] which gathers ten different probability metrics and many relations between then. Here we use Pinsker's inequality to switch to relative entropy: where Ent (W n,d G n ) denotes the relative entropy of W n,d with respect to G n . In the following subsection we provide a brief introduction to entropy; the reader familiar with the basics can safely skip this. We then turn to entropic central limit theorems and techniques involved in their proof, before finally coming back to bounding the right hand side in (3.3).

A brief introduction to entropy
The entropy of a discrete random variable X taking values in X is defined as where p denotes the probability mass function of X. The log is commonly taken to have base 2, in which case entropy is measured in bits; if one considers the natural logarithm ln then it is measured in nats. Note that entropy is always nonnegative, since p(x) ≤ 1 for every x ∈ X . This is a measure of uncertainty of a random variable. It measures how much information is required on average to describe the random variable. Many properties of entropy agree with the intuition of what a measure of information should be. A useful way of thinking about entropy is the following: if we have an i.i.d. sequence of random variables and we know that the source distribution is p, then we can construct a code with average description length H(p).

Example 3.2. If X is uniform on a finite space X , then H(X) = log |X |.
The conditional entropy of Y given X is defined as where p denotes the probability mass function of X and H (Y | X = x) is the entropy of the random variable Y conditioned on the event {X = x}. It measures how much information is required on average to describe the random variable Y , given that the value of the random variable X is known. For continuous random variables the differential entropy is defined as where f is the density of the random variable X.

Example 3.3. If X is uniform on the interval [0, a], then h(X) = log (a).
If X is Gaussian with mean zero and variance σ 2 , then h(X) = 1 2 log 2πeσ 2 . Note that these examples show that differential entropy can be negative. One way to think of differential entropy is to think of 2 h(X) as "the volume of the support".
The relative entropy (also known as Kullback-Leibler divergence) of two distributions P and Q on a discrete space X is defined as For two distributions with densities f and g the relative entropy is defined as Relative entropy is always nonnegative; this follows from Jensen's inequality.
Relative entropy can be interpreted as a measure of distance between two distributions, although it is not a metric: it is not symmetric and it does not obey the triangle inequality. It can be thought of as a measure of inefficiency of assuming that the source distribution is q when it is really p. If we use a code for distribution q but the source is really from p, then we need H(p) + D(p q) bits on average to describe the random variable. In the following we use Ent to denote all notions of entropy and relative entropy. We also slightly abuse notation and interchangeably use a random variable or its law in the argument of entropy and relative entropy.
Entropy and relative entropy satisfy useful chain rules; we leave the proof of the following identities as an exercise for the reader. For entropy we have: Ent (X 1 , X 2 ) = Ent (X 1 ) + Ent (X 2 | X 1 ) .
For relative entropy we have: Let φ denote the density of γ n , the n-dimensional standard Gaussian distribution, and let f be an isotropic density with mean zero, i.e., a density for which the covariance matrix is the identity I n . Then where the second equality follows from the fact that log φ (x) is quadratic in x, and the first two moments of f and φ are the same by assumption. We thus see that the standard Gaussian maximizes entropy among isotropic densities.

An introduction to entropic CLTs
At this point we are ready to state the entropic central limit theorem. The central limit theorem states that if Z 1 , Z 2 , . . . are i.i.d. real-valued random variables with zero mean and unit variance, then S m := (Z 1 + · · · + Z m ) / √ m converges in distribution to a standard Gaussian random variable as m → ∞. There are many other senses in which S m converges to a standard Gaussian, the entropic CLT being one of them.
as m → ∞. Moreover, the entropy of S m increases monotonically, that is, Ent (S m ) ≤ Ent (S m+1 ) for every m ≥ 1.
The condition Ent (Z 1 φ) < ∞ is necessary for an entropic CLT to hold; for instance, if the Z i are discrete, then h (S m ) = −∞ for all m.
The entropic CLT originates with Shannon in the 1940s and was first proven by Linnik [45] in 1959 (without the monotonicity part of the statement). The first proofs that gave explicit convergence rates were given independently and at roughly the same time by Artstein, Ball, Barthe, and Naor [7,5,6], and Johnson and Barron [41] in the early 2000s, using two different techniques.
The fact that Ent (S 1 ) ≤ Ent (S 2 ) follows from the entropy power inequality, which goes back to Shannon [56] in 1948. This implies that Ent (S m ) ≤ Ent (S 2m ) for all m ≥ 0, and so it was naturally conjectured that Ent (S m ) increases monotonically. However, proving this turned out to be challenging. Even the inequality Ent (S 2 ) ≤ Ent (S 3 ) was unknown for over fifty years, until Artstein, Ball, Barthe, and Naor [5] proved in general that Ent (S m ) ≤ Ent (S m+1 ) for all m ≥ 1.
In the following we sketch some of the main ideas that go into the proof of these results, in particular following the techniques of Artstein, Ball, Barthe, and Naor [7,5,6].

From relative entropy to Fisher information
Our goal is to show that some random variable Z, which is a convolution of many i.i.d. random variables, is close to a Gaussian G. One way to approach this is to interpolate between the two. There are several ways of doing this; for our purposes interpolation along the Ornstein-Uhlenbeck semigroup is most useful. Define , and let f t denote the density of P t Z. We have P 0 Z = Z and P ∞ Z = G. This semigroup has several desirable properties. For instance, if the density of Z is isotropic, then so is f t . Before we can state the next desirable property that we will use, we need to introduce a few more useful quantities.
For a density function f : R n → R + , let be the Fisher information matrix. Let μ denote the mean of the density f . The Cramér-Rao bound states that for any unbiased estimator μ of μ, the covariance It is sometimes more convenient to work with the Fisher information distance, defined as J(f ) : Similarly to the discussion above, one can show that the standard Gaussian minimizes the Fisher information among isotropic densities, and hence the Fisher information distance is always nonnegative. Now we are ready to state the De Bruijn identity [57], which characterizes the change of entropy along the Ornstein-Uhlenbeck semigroup via the Fisher information distance: This implies that the relative entropy between f and φ-which is our quantity of interest-can be expressed as follows: Thus our goal is to bound the Fisher information distance J(f t ).

Bounding the Fisher information distance
We first recall a classical result by Blachman [11] and Stam [57] that shows that Fisher information decreases under convolution.
Theorem 3.5 (Blachman [11]; Stam [57]). Let Y 1 , . . . , Y d be independent random variables taking values in R, and let a ∈ R d be such that a 2 = 1. Then In the i.i.d. case, this bound becomes a 2 2 I (Y 1 ). Artstein, Ball, Barthe, and Naor [7,5] gave the following variational characterization of the Fisher information, which gives a particularly simple proof of Theorem 3.5. Theorem 3.6 (Variational characterization of Fisher information [7,5]). Let w : R d → (0, ∞) be a sufficiently smooth 9 density on R d , let a ∈ R d be a unit vector, and let h be the marginal of w in direction a. Then we have for any continuously differentiable vector field p : R d → R d with the property that for every x, p (x) , a = 1. Moreover, if w satisfies x 2 w (x) < ∞, then there is equality for some suitable vector field p.
The Blachman-Stam theorem follows from this characterization by taking the constant vector field p ≡ a. Then we have div (pw) = ∇w, a , and so the right hand side of (3.6) becomes a T I (w) a, where recall that I is the Fisher information matrix. In the setting of Theorem 3.5 the density w of (Y 1 , . . . , Y d ) is a product density: w (x 1 , . . . , i I (f i ), concluding the proof of Theorem 3.5 using Theorem 3.6.
Given the characterization of Theorem 3.6, one need not take the vector field to be constant; one can obtain more by optimizing over the vector field. Doing this leads to the following theorem, which gives a rate of decrease of the Fisher information distance under convolutions. Theorem 3.7 (Artstein, Ball, Barthe, and Naor [7,5,6]). Let Y 1 , . . . , Y d be i.i.d. random variables with a density having a positive spectral gap c. 10 Then for any a ∈ R d with a 2 = 1 we have that When a = 1 A result similar to Theorem 3.7 was proven independently and roughly at the same time by Johnson and Barron [41] using a different approach involving score functions.

A high-dimensional entropic CLT
The techniques of Artstein, Ball, Barthe, and Naor [7,5,6] generalize to higher dimensions, as was recently shown by Bubeck and Ganguly [21]. A result similar to Theorem 3.7 can be proven, from which a high-dimensional entropic CLT follows, together with a rate of convergence, by using (3.5) again. [21]). Let Y ∈ R d be a random vector with i.i.d. entries from a distribution ν with zero mean, unit variance, and spectral gap c ∈ (0, 1]. Let A ∈ R n×d be a matrix such that AA T = I n , the n × n identity

Theorem 3.8 (Bubeck and Ganguly
where γ n denotes the standard Gaussian measure in R n . To interpret this result, consider the case where the matrix A is built by picking rows one after the other uniformly at random on the Euclidean sphere in R d , conditionally on being orthogonal to previous rows (to satisfy the isotropicity condition AA T = I n ). We then expect to have ε n/d and ζ √ n/d (we leave the details as an exercise for the reader), and so Theorem 3.8 tells us that Ent (AY γ n ) n 2 /d.

Back to Wishart and GOE
We now turn our attention back to bounding the relative entropy Ent (W n,d G n ) between the n × n Wishart matrix with d degrees of freedom (with the diagonal removed), W n,d , and the n × n GOE matrix (with the diagonal removed), G n ; recall (3.3). Since the Wishart matrix contains the (scaled) inner products of n vectors in R d , it is natural to relate W n+1,d and W n,d , since the former comes from the latter by adding an additional d-dimensional vector to the n vectors already present. Specifically, we have the following: where X is a d-dimensional random vector with i.i.d. entries from μ, which are also independent from X. Similarly we can write the matrix G n+1 using G n : This naturally suggests to use the chain rule for relative entropy and bound Ent (W n,d G n ) by induction on n. By (3.4) we get that By convexity of the relative entropy we also have that Thus our goal is to understand and bound Ent (AX γ n ) for A ∈ R n×d , and then apply the bound to A = 1 √ d X (followed by taking expectation over X). This is precisely what was done in Theorem 3.8, the high-dimensional entropic CLT, for A satisfying AA T = I n . Since A = 1 √ d X does not necessarily satisfy AA T = I n , we have to correct for the lack of isotropicity. This is the content of the following lemma, the proof of which we leave as an exercise for the reader. We then apply this lemma with A = 1 √ d X and Q = 1 d XX T −1/2 . Observe that ETr AA T = 1 d ETr XX T = 1 d × n × d = n, and hence in expectation the middle two terms of the right hand side of (3.7) cancel each other out.
The last term in (3.7), − 1 4 log det 1 d XX T , should be understood as the relative entropy between a centered Gaussian with covariance given by 1 d XX T and a standard Gaussian in R n . Controlling the expectation of this term requires studying the probability that XX T is close to being non-invertible, which requires bounds on the left tail of the smallest singular of X. Understanding the extreme singular values of random matrices is a fascinating topic, but it is outside of the scope of these notes, and so we refer the reader to [21] for more details on this point.
Finally, the high-dimensional entropic CLT can now be applied to see that Ent (QAX γ n ) n 2 /d. From the induction on n we get another factor of n, arriving at Ent (W n,d G n ) n 3 /d. We conclude that the dimension threshold is d ≈ n 3 , and the information-theoretic proof that we have outlined sheds light on why this threshold is n 3 .

Outlook: open problems and extensions
The work of Bubeck and Ganguly [21] leaves open many questions; we mention here some of them and refer to [21] for more. Perhaps the main open problem is to find the optimal conditions under which the phase transition described in Theorem 3.1 holds. For instance, does it hold if the rows (or columns) of X are i.i.d. from a log-concave distribution in R d (or R n )? Several estimates still work under this assumption, but it seems that the induction step of the proof requires the independence of the entries of X.
Another interesting question is whether Eldan's result on Wishart matrices, Theorem 2.4, can be extended to Wishart matrices with more general entries. One difficulty is that the tools described in this lecture are based on measuring relative entropy with respect to a standard Gaussian (since this maximizes entropy), while the problem concerns only Wishart matrices.
A different direction of inquiry is to study higher order interactions. Denoting the i th column of X by X i , we can write i . Now for any k ∈ N one may consider the distribution W How large does d need to be as a function of n and k so that W (k) n,d is approximately Gaussian?

Lecture 4: Confidence sets for the root in uniform and preferential attachment trees
In the previous lectures we studied random graph models with community structure and also models with an underlying geometry. While these models are important and lead to fascinating problems, they are also static in time. Many real-world networks are constantly evolving, and their understanding requires models that reflect this. This point of view brings about a host of new interesting and challenging statistical inference questions that concern the temporal dynamics of these networks. In the last lecture we will study such questions: given the current state of a network, can one infer the state at some previous time? Does the initial seed graph have an influence on how the network looks at large times? If so, is it possible to find the origin of a large growing network? We will focus in particular on this latter question. More precisely, given a model of a randomly growing graph starting from a single node, called the root, we are interested in the following question. Given a large graph generated from the model, is it possible to find a small set of vertices for which we can guarantee that the root is in this set with high probability? Such root-finding algorithms can have applications to finding the origin of an epidemic or a rumor.

Models of growing graphs
A natural general model of randomly growing graphs can be defined as follows. For n ≥ k ≥ 1 and a graph S on k vertices, define the random graph G(n, S) by induction. First, set G(k, S) = S; we call S the seed of the graph evolution process. Then, given G(n, S), G(n + 1, S) is formed from G(n, S) by adding a new vertex and some new edges according to some adaptive rule. If S is a single vertex, we write simply G(n) instead of G(n, S). There are several rules one can consider; here we study perhaps the two most natural rules: uniform attachment and preferential attachment. Moreover, for simplicity we focus on the case of growing trees, where at every time step a single edge is added.
Uniform attachment trees are perhaps the simplest model of randomly growing graphs and are defined as follows. For n ≥ k ≥ 1 and a tree S on k vertices, the random tree UA(n, S) is defined as follows.  In preferential attachment the vertex is chosen with probability proportional to its degree [46,8,13]. For a tree T denote by d T (u) the degree of vertex u in T . For n ≥ k ≥ 2 and a tree S on k vertices we define the random tree PA(n, S) by induction. First, let PA(k, S) = S. Then, given PA(n, S), PA(n + 1, S) is formed from PA(n, S) by adding a new vertex u and a new edge uv where v is selected at random among vertices in PA(n, S) according to the following probability distribution:

Questions: detection and estimation
The most basic questions to consider are those of detection and estimation. Can one detect the influence of the initial seed graph? If so, is it possible to estimate the seed? Can one find the root if the process was started from a single node? We introduce these questions in the general model of randomly growing graphs described above, even though we study them in the special cases of uniform and preferential attachment trees later.
The detection question can be rephrased in the terminology of hypothesis testing. Given two potential seed graphs S and T , and an observation R which is a graph on n vertices, one wishes to test whether R ∼ G(n, S) or R ∼ G(n, T ). The question then boils down to whether one can design a test with asymptotically (in n) nonnegligible power. This is equivalent to studying the total variation distance between G(n, S) and G(n, T ), so we naturally define δ(S, T ) := lim n→∞ TV (G(n, S), G(n, T )), where G(n, S) and G(n, T ) are random elements in the finite space of unlabeled graphs with n vertices. This limit is well-defined because TV (G(n, S), G(n, T )) is nonincreasing in n (since if G(n, S) = G(n, T ), then the evolution of the random graphs can be coupled such that G(n , S) = G(n , T ) for all n ≥ n) and always nonnegative.
If the seed has an influence, it is natural to ask whether one can estimate S from G(n, S) for large n. If so, can the subgraph corresponding to the seed be located in G(n, S)? We study this latter question in the simple case when the process starts from a single vertex called the root. 11 A root-finding algorithm is defined as follows. Given G(n) and a target accuracy ε ∈ (0, 1), a rootfinding algorithm outputs a set H (G(n), ε) of K(ε) vertices such that the root is in H (G(n), ε) with probability at least 1 − ε (with respect to the random generation of G(n)).
An important aspect of this definition is that the size of the output set is allowed to depend on ε, but not on the size n of the input graph. Therefore it is not clear that root-finding algorithms exist at all. Indeed, there are examples when they do not exist: consider a path that grows by picking one of its two ends at random and extending it by a single edge. However, it turns out that in many interesting cases root-finding algorithms do exist. In such cases it is natural to ask for the best possible value of K(ε).

The influence of the seed
Consider distinguishing between a preferential attachment tree started from a star with 10 vertices, S 10 , and a preferential attachment tree started from a path with 10 vertices, P 10 . Since the preferential attachment mechanism incorporates the rich-get-richer phenomenon, one expects the degree of the center of the star in PA(n, S 10 ) to be significantly larger than the degree of any of the initial vertices in the path in PA(n, P 10 ). This intuition guided Bubeck, Mossel, and Rácz [22] when they initiated the theoretical study of the influence of the seed in preferential attachment trees. They showed that this intuition is correct: the limiting distribution of the maximum degree of the preferential attachment tree indeed depends on the seed. Using this they were able to show that for any two seeds S and T with at least 3 vertices 12 and different degree profiles we have δ PA (S, T ) > 0.
However, statistics based solely on degrees cannot distinguish all pairs of nonisomorphic seeds. This is because if S and T have the same degree profiles, then it is possible to couple PA(n, S) and PA(n, T ) such that they have the same degree profiles for every n. In order to distinguish between such seeds, it is necessary to incorporate information about the graph structure into the statistics that are studied. This was done successfully by Curien, Duquesne, Kortchemski, and Manolescu [25], who analyzed statistics that measure the geometry of large degree nodes. These results can be summarized in the following theorem. In the case of uniform attachment, degrees do not play a special role, so initially one might even think that the seed has no influence in the limit. However, it turns out that the right perspective is not to look at degrees but rather the sizes of appropriate subtrees (we shall discuss such statistics later). By extending the approach of Curien et al. [25] to deal with such statistics, Bubeck, Eldan, Mossel, and Rácz [20] showed that the seed has an influence in uniform attachment trees as well.  The extra condition mentioned in the question could be model-dependent, but should not be too restrictive. It would be fascinating to find a natural model where the seed has no influence in a strong sense. Even for models where the seed does have an influence, proving the statement in full generality is challenging and interesting.

Finding Adam
These theorems about the influence of the seed open up the problem of finding the seed. Here we present the results of Bubeck, Devroye, and Lugosi [18] who first studied root-finding algorithms in the case of uniform attachment and preferential attachment trees.
They showed that root-finding algorithms indeed exist for preferential attachment trees and that the size of the best confidence set is polynomial in 1/ε. for some finite constant c. Furthermore, there exists a positive constant c such that any root-finding algorithm for preferential attachment trees must satisfy K(ε) ≥ c ε . They also showed the existence of root-finding algorithms for uniform attachment trees. In this model, however, there are confidence sets whose size is subpolynomial in 1/ε. Moreover, the size of any confidence set has to be at least superpolylogarithmic in 1/ε. These theorems show an interesting quantitative difference between the two models: finding the root is exponentially more difficult in preferential attachment than in uniform attachment. While this might seem counter-intuitive at first, the reason behind this can be traced back to the rich-get-richer phenomenon: the effect of a rare event where not many vertices attach to the root gets amplified by preferential attachment, making it harder to find the root.
In the remaining part of this lecture we explain the basic ideas that go into proving Theorems 4.4 and 4.5 and prove some simpler special cases. Before we do so, we give a primer on Pólya urns, whose variants appear throughout the proofs. If the reader is familiar with Pólya urns, then the following subsection can be safely skipped.

Pólya urns: the building blocks of growing graph models
While uniform attachment and preferential attachment are arguably the most basic models of randomly growing graphs, the evolution of various simple statistics, such as degrees or subtree sizes, can be described using even simpler building blocks: Pólya urns. This subsection aims to give a brief introduction into the well-studied world of Pólya urns, while simultaneously showing examples of how these urn models show up in uniform attachment and preferential attachment.

The classical Pólya urn
The classical Pólya urn [28] starts with an urn filled with b blue balls and r red balls. Then at every time step you put your hand in the urn, without looking at its contents, and take out a ball sampled uniformly at random. You observe the color of the ball, put it back into the urn, together with another ball of the same color. This process is illustrated in Figure 8. We are interested in the fraction of blue and red balls in the urn at large times. Let X n denote the number of blue balls in the urn when there are n balls in the urn in total; initially we have X b+r = b. Furthermore, let x n = X n /n denote the fraction of blue balls when there are n balls in total.
Let us start by computing the expected increase in the number of blue balls at each time step: Dividing this by (n + 1) we obtain that where F n denotes the filtration of the process up until time n (when there are n balls in the urn); since X n is a Markov process, this is equivalent to conditioning on X n . Thus the fraction of blue balls does not change in expectation; in other words, x n is a martingale. Since x n is also bounded (x n ∈ [0, 1]), it follows that x n converges almost surely to a limiting random variable. Readers not familiar with martingales should not be discouraged, as it is simple to see heuristically that x n converges: when there are n balls in the urn, the change in x n is on the order of 1/n, which converges to zero fast enough that one expects x n to converge. 13 Our next goal is to understand the limiting distribution of x n . First, let us compute the probability of observing the first five draws as in Figure 8, starting with a blue ball, then a red, then two blue ones, and lastly another red: this probability is 3 5 × 2 6 × 4 7 × 5 8 × 3 9 . Notice that the probability of obtaining 3 blue balls and 2 red ones in the first 5 draws is the same regardless of the order in which we draw the balls. This property of the sequence X n is known as exchangeability and has several useful consequences (most of which we will not explore here). It follows that the probability of seeing k blue balls in the first n draws takes on the following form: From this formula one can read off that X n+b+r − b is distributed according to the beta-binomial distribution with parameters (n, b, r). An alternative way of sampling from the beta-binomial distribution is to first sample a probability p from the beta distribution with parameters b and r (having density , and then conditionally on p, sample from the binomial distribution with n trials and success probability p. Conditionally on p, the strong law of large numbers applied to the binomial distribution thus tells us that (X n+b+r − b) /n converges almost surely to p. Since (1), it follows that x n → p almost surely. We have thus derived the following theorem.  such as edge e in tree S in Figure 9, with endpoints v and v r . This edge partitions the tree into two parts on either side of the edge: a subtree under v and a subtree under v r . The sizes of these subtrees (i.e., the number of vertices they contain) evolve exactly like the classical Pólya urn described above (in the example depicted in Figure 9 we have b = 6 and r = 2 initially).

Multiple colors
A natural generalization is to consider multiple colors instead of just two. Let m be the number of colors, let X n = (X n,1 , . . . , X n,m ) denote the number of balls of each color when there are n balls in the urn in total, and let x n = X n /n. Assume that initially there are r i balls of color i. In this case the fraction of balls of each color converges to the natural multivariate generalization of the beta distribution: the Dirichlet distribution. The Dirichlet distribution with parameters (r 1 , . . . , r m ), denoted Dir (r 1 , . . . , r m ), has density It has several natural properties that one might expect, for instance the aggregation property, that if one groups coordinates i and j together, then the resulting distribution is still Dirichlet, with parameters r i and r j replaced by r i + r j . This also implies that the univariate marginals are beta distributions. The convergence result for multiple colors follows similarly to the one for two colors, so we simply state the result.  of m vertices as highlighted in bold in Figure 10, the tree is partitioned into m subtrees. The sizes of these subtrees (i.e., the number of vertices they contain) evolve exactly like the classical Pólya urn with m colors described above.

Adding multiple balls at a time
It is also natural to consider adding more than one extra ball at each time step. The effect of this is to change the parameter of the limiting Dirichlet distribution. Example 4.11. Pólya urns where two balls of the same color are added at each time step appear in preferential attachment trees as follows. Consider partitioning the tree into m subtrees as in Figure 10, but now define the size of a subtree to be the sum of the degrees of the vertices in it. Consider which subtree the new incoming vertex attaches to. In the preferential attachment process each subtree is picked with probability proportional to its size and whichever subtree is picked, the sum of the degrees (i.e., the size) increases by 2 due to the new edge. Thus the subtree sizes evolve exactly according to a Pólya urn described above with k = 2.

More general urn models
More generally, one can add some number of balls of each color at each time step. The replacement rule is often described by a replacement matrix of size m × m, where the ith row of the matrix describes how many balls of each color to add to the urn if a ball of color i is drawn. The urn models studied above correspond to replacement matrices that are a constant multiple of the identity. The literature on general replacement matrices is vast and we do not intend to discuss it here; our goal is just to describe the simple case when the replacement matrix is ( 2 0 1 1 ). We refer to [38] for detailed results on triangular replacement matrices, and to the references therein for more general replacement rules.
The urn model with replacement matrix ( 2 0 1 1 ) can also be described as the classical Pólya urn with two colors as described in Section 4.5.1, but in addition a blue ball is always added at each time step. It is thus natural to expect that there will be many more blue balls than red balls in the urn at large times. It turns out that the number of red balls at time n scales as √ n instead of linearly in n. The following result is a special case of what is proved in [38]. Example 4.13. The evolution of the degree of any given vertex in a preferential attachment tree can be understood through such a Pólya urn. More precisely, fix a vertex v in the tree, let Y n denote the degree of v when there are n vertices in total, and let X n denote the sum of the degrees of all other vertices. Then (X n , Y n ) evolves exactly according to a Pólya urn with replacement matrix ( 2 0 1 1 ). This implies that the degree of any fixed vertex scales as √ n in the preferential attachment tree.

Proofs using Pólya urns
With the background on Pólya urns covered, we are now ready to understand some of the proofs of the results concerning root-finding algorithms from [18].

A root-finding algorithm based on the centroid
We start by presenting a simple root-finding algorithm for uniform attachment trees. This algorithm is not optimal, but its analysis is simple and highlights the basic ideas.
For a tree T , if we remove a vertex v ∈ V (T ), then the tree becomes a forest consisting of disjoint subtrees of the original tree. Let ψ T (v) denote the size (i.e., the number of vertices) of the largest component of this forest. For example, in Figure 9 if we remove v r from S, then the tree breaks into a singleton and a star consisting of 6 vertices; thus ψ S (v r ) = 6. A vertex v that minimizes ψ T (v) is known as a centroid of T ; one can show that there can be at most two centroids. We define the confidence set H ψ by taking the set of K vertices with smallest ψ values. Theorem 4.14. [18] The centroid-based H ψ defined above is a root-finding algorithm for the uniform attachment tree. More precisely, if K ≥ 5 2 log(1/ε) ε , then lim inf n→∞ P 1 ∈ H ψ UA (n) where 1 denotes the root, and UA (n) • denotes the unlabeled version of UA (n).
For the other term, first observe that for any i > K we have Now using the results on Pólya urns from Section 4.5 we have that for every k such that 1 ≤ k ≤ K, the random variable 1 n K j=1,j =k |T j,K | converges in distribution to the Beta (K − 1, 1) distribution. Hence by a union bound we have that Putting together the two bounds gives that lim sup which concludes the proof due to the assumption on K.
The same estimator H ψ works for the preferential attachment tree as well, if one takes K ≥ C log 2 (1/ε) ε 4 for some positive constant C. The proof mirrors the one above, but involves a few additional steps; we refer to [18] for details.
For uniform attachment the bound on K given by Theorem 4.14 is not optimal. It turns out that it is possible to write down the maximum likelihood estimator (MLE) for the root in the uniform attachment model; we do not do so here, see [18]. One can view the estimator H ψ based on the centroid as a certain "relaxation" of the MLE. By constructing a certain "tighter" relaxation of the MLE, one can obtain a confidence set with size subpolynomial in 1/ε as described in Theorem 4.5. The analysis of this is the most technical part of [18] and we refer to [18] for more details.

Lower bounds
As mentioned above, the MLE for the root can be written down explicitly. This aids in showing a lower bound on the size of a confidence set. In particular, Bubeck et al. [18] define a set of trees whose probability of occurrence under the uniform attachment model is not too small, yet the MLE provably fails, giving the lower bound described in Theorem 4.5. We refer to [18] for details.
On the other hand, for the preferential attachment model it is not necessary to use the structure of the MLE to obtain a lower bound. A simple symmetry argument suffices to show the lower bound in Theorem 4.4, which we now sketch.
First observe that the probability of error for the optimal procedure is nondecreasing with n, since otherwise one could simulate the process to obtain a better estimate. Thus it suffices to show that the optimal procedure must have a probability of error of at least ε for some finite n. We show that there is some finite n such that with probability at least 2ε, the root is isomorphic to at least 2c/ε vertices in PA(n). Thus if a procedure outputs at most c/ε vertices, then it must make an error at least half the time (so with probability at least ε).

Outlook: open problems and extensions
There are many open problems and further directions that one can pursue; the four main papers we have discussed [22,25,20,18] contain 20 open problems and conjectures alone. For instance, can the bounds on the size of the optimal confidence set be improved and ultimately tightened? What about other tree growth models? What happens when we lose the tree structure and consider general graphs, e.g., by adding multiple edges at each time step?
When the tree growth model is not as combinatorial as uniform attachment or preferential attachment, then other techniques might be useful. In particular, many tree growth models can be embedded into continuous time branching processes and then the full machinery of general branching processes can be brought to the fore and applied; see [53,9] and the references therein for such results. This approach can also be used to obtain finite confidence sets for the root, as demonstrated recently in [40] for sublinear preferential attachment trees.
A closely related problem to those discussed in this lecture is that of detecting the source of a diffusion spreading on an underlying network. The results are very similar to those above: the rumor source can be efficiently detected in many settings, see, e.g., [54,55,44]. A different twist on this question is motivated by anonymous messaging services: can one design protocols for spreading information that preserve anonymity by minimizing the probability of source detection? Fanti et al. [33] introduced a process, termed adaptive diffusion, that indeed achieves this goal. Understanding the tradeoffs between privacy and other desiderata is timely and should lead to lots of interesting research.