Finding the seed of uniform attachment trees

A uniform attachment tree is a random tree that is generated dynamically. Starting from a fixed"seed"tree, vertices are added sequentially by attaching each vertex to an existing vertex chosen uniformly at random. Upon observing a large (unlabeled) tree, one wishes to find the initial seed. We investigate to what extent seed trees can be recovered, at least partially. We consider three types of seeds: a path, a star, and a random uniform attachment tree. We propose and analyze seed-finding algorithms for all three types of seed trees.


Introduction
Dynamically growing networks represent complex relationships in numerous areas of science. In a rapidly increasing number of applications, one does not observe the entire dynamical growth procedure but merely a present-day snapshot of the network is available for observation. Based on this snapshot, one wishes to infer various properties of the past of the network. Such problems belong to the area that may be termed network archeology, see Navlakha and Kingsford [16].
The simplest dynamically grown networks are trees that are grown by attaching vertices sequentially to the existing tree at random, according to a certain rule. In the uniform attachment model, at each step, an existing vertex is selected uniformly ar random, and a new vertex is attached to it by an edge. When the process is initialized from a single vertex, this procedure gives rise to the well-studied uniform random recursive tree, see Drmota [9]. In preferential attachment models (such as plane-oriented recursive trees) existing vertices with higher degrees are more likely to be chosen to be attached to. In this paper we consider randomly growing uniform attachment trees that are grown from a fixed seed. Thus, initially, the tree is a given fixed (small) tree and further vertices are attached according tio the uniform attachment process.
Several papers consider the problem of finding the initial vertex (or root) in a randomly growing tree started from a single vertex, see Brautbar and Kearns [3], Borgs, Brautbar, Chayes, Khanna, and Lucier [1], Frieze and Pegden [10], Shah and Zaman [19,18], Bubeck, Devroye, and Lugosi [4], Jog and Loh [14,13] for various models. Randomly growing trees started from an initial seed tree were considered by Bubeck, Mossel, and Rácz [6], Bubeck, Eldan Mossel, and Rácz [5], and Curien, Duquesne, Kortchemski, and Manolescu [7]. These papers prove that in uniform and preferential attachment models, for any pair of possible seed trees, one may construct a hypothesis test that decides which of the two seeds generated the observed tree, with a probability of error strictly smaller than 1/2, regardless of the size of the observed tree.
In this paper we consider the problem of finding the seed tree (of known structure) in a large observed tree. The questions we seek to answer are: (1) to what extent is it possible to identify the seed tree? (2) what is the role of the structure of the seed in the difficulty of the reconstruction problem? While we are far from completely answering these questions, this paper contributes to the understanding of these problems. In particular, we consider three types of possible seed trees, namely paths, stars, and random uniform recursive trees. For each of these examples, we present algorithms to recover, at least partially, the seed tree. In all cases, partial recovery is possible, with any prescribed probability of error, regardless of the size of the observed tree. However, the difficulty of the recovery depends heavily on the structure of the tree. Paths and stars are considerably easier to find than uniform random recursive trees.
In Section 2 we introduce the mathematical model and state the main results. The proofs of all results are presented in Section 3.

Setup and results
Let ℓ ≥ 1 be a positive integer and let S ℓ be a tree (i.e., a connected acyclic graph) on the vertex set {1, . . . , ℓ}. Let n > ℓ be another positive integer. We say that a random tree T n on the vertex set {1, . . . , n} is a uniform attachment tree with seed S ℓ if it is generated as follows: chosen uniformly at random, independently of all previous choices.
The problem we study in this paper is the following. Suppose one observes a tree T n generated by the uniform attachment process with seed S ℓ but with the vertex labels hidden. The goal is to find the seed tree S ℓ in the observed unlabeled tree. More precisely, given a target accuracy ǫ ∈ (0, 1) a seed-finding algorithm of first kind outputs a set H 1 (T n , ǫ) of vertices of size k ℓ ≤ ℓ, such that, with probability at least 1 − ǫ, H 1 (T n , ǫ) ⊂ S ℓ , that is, all elements of H 1 (T n , ǫ) are vertices of the seed tree S ℓ . (Here, with a slight abuse of notation, we identify the seed S ℓ with its vertex set {1, . . . , ℓ}.) Similarly, a seed-finding algorithm of second kind outputs a set H 2 (T n , ǫ) of vertices of size k ℓ ≥ ℓ, such that, with probability at least 1 − ǫ, S ℓ ⊂ H 2 (T n , ǫ), that is, H 2 (T n , ǫ) contains all vertices of the seed tree S ℓ .
In both cases, one would like to have k ℓ as close to ℓ as possible, even for small values of ǫ.
Bubeck, Devroye, and Lugosi [4] considered the case ℓ = 1, that is, when the seed tree is a single vertex and seed-finding algorithms of the second kind. Thus, the aim of the seed-finding algorithm is to find the root of the observed tree. Their main finding is that, for all ǫ, the optimal value of k 1 stays bounded as the size n of the observed tree goes to infinity. They also show that there exist seed-finding algorithms of the second kind such that k 1 = o(ǫ −a ) for all a > 0.
In this paper we show that, if ℓ is sufficiently large (depending on ǫ), then k ℓ may be made proportional to ℓ for seed-finding algorithms of second kind, and we make similar statements for k ℓ for certain seed-finding algorithms of first kind. How the required value of ℓ depends on ǫ and what the achievable proportions are depend heavily on the structure of the seed. We consider three prototypical examples of seeds: • A path P ℓ on ℓ vertices is a tree that has exactly two vertices of degree one and ℓ − 2 vertices of degree two.
• A star E ℓ on ℓ vertices is a tree that has ℓ − 1 vertices of degree one and one vertex of degree ℓ − 1.
• The third example we consider is when the seed S ℓ is a uniform random recursive tree on ℓ vertices. In this case the proposed seed finding algorithm does not need to know the structure of the tree. Thus, this example may be considered as a generalization of the root-finding problem studied in [4]. Here, instead of trying to locate the root of the tree, the goal is to find the first ℓ generations of the observed uniform random recursive tree T n .
In what follows we present the main findings of the paper that establish the existence of seed-finding algorithms that are able to recover a constant fraction of the seed if it is a uniform random recursive tree. If the seed is either a path or a star, then the situation is even better as one can recover almost the entire seed.
Importantly, all bounds established below are independent of the size n of the observed tree, meaning that (partial) reconstruction of the seed is possible regardless of how large the observed tree T n is.

Finding the seed when it is a path
We begin with the case when the seed is a path: Theorem 1. Let ǫ ∈ (0, 1) and γ ∈ (0, 1) and let ℓ ≥ max 2e 2 γ log 1 ǫ , 2e 2 γ log(4e 2 ) be a positive integer. Then for all n ≥ ℓ sufficiently large, if T n is a uniform attachment tree with seed S ℓ = P ℓ (a path of ℓ vertices), then there exists a seed-finding algorithm that outputs a vertex set H n ⊂ {1, . . . , n} with |H n | ≥ (1 − γ)ℓ such that The theorem states that, for any fixed γ > 0, if the size of the seed path ℓ is at least of the order of log(1/ǫ), then there exists an algorithm that finds all but a γ-fraction of the seed path, regardless of how large the observed tree T n is. Note that the required length of the path is merely logarithmic in 1/ǫ. In fact, this dependence is essentially best possible. The following result shows that if the seed path has less than log(1/ǫ) log log(1/ǫ) vertices, then any seed finding algorithm must miss at least half of the seed, with probability greater than ǫ.
. Suppose that T n is a uniform attachment tree with seed S ℓ = P ℓ for ℓ ≤ log(1/ǫ) log log(1/ǫ) . Then, for all n ≥ 2ℓ, any seed-finding algorithm that outputs a vertex set H n of size ℓ has

Finding the seed when it is a star
Next we state our results for the case when the seed tree is a star E ℓ on ℓ vertices.
There exists a numerical positive constant C such that the following holds. Let ǫ ∈ (0, 1) and γ ∈ (0, 1) and let ℓ ≥ max(C, 8/γ) log(1/ǫ) be a positive integer. Then for all n ≥ ℓ sufficiently large, if T n is a uniform attachment tree with seed S ℓ = E ℓ (a star of ℓ vertices), then there exists a seed-finding algorithm that outputs a vertex set Once again, the order of magnitude for the required size of the seed star is essentially optimal as a function of ǫ. The proof of the next theorem is similar to that of Theorem 2 and thus it is omitted.
Then, for all n ≥ 2ℓ, any seed-finding algorithm that outputs a vertex set H n of size ℓ has

Finding the first generations
Finally, we consider the case when the seed tree is a uniform random recursive tree in ℓ vertices. Unlike in the previous two examples, here the seed finding algorithm does now "know" the exact structure of the seed. This model may be equivalently formulated as follows: starting from a single vertex, one grows a uniform random recursive tree T n of n vertices. Upon observing T n (without vertex labels), one's aim is to recover as much of the tree T ℓ (containing vertices attached in the first ℓ generations) as possible. The next theorem establishes the existence of a seed-finding algorithm of the first kind that identifies an Ω(1/ log(1/ǫ)) fraction of the vertices of the seed T ℓ with probability at least 1 − ǫ, whenever ℓ is at least proportional to log 3 (1/ǫ). One should note that this result is weaker than the one obtained for seed paths and seed stars above in various ways. First, unlike in the cases of Theorems 1 and 3, here we cannot guarantee that almost all of the seed tree is identified, but only a fraction of it whose size depends on ǫ-although in a mild manner. Second, the size of the seed tree needs to be somewhat larger as a function of ǫ as before. While in the previous cases ℓ needed to be logarithmic in 1/ǫ, now it needs to scale as log 3 (1/ǫ). Below we show that to some extent these weaker results are inevitable and that finding the seed tree T ℓ is inherently harder than finding more structured seed trees such as stars and paths.
Our main positive result is as follows.
Theorem 5. Let T n be a uniform random recursive tree on n vertices and let ǫ > 0 and ℓ ≥ 1. Let a = 2 log(4ℓ 2 /ǫ) + 1. If ℓ is so large that then there exists a seed-finding algorithm that outputs a vertex set H n ⊂ {1, . . . , n} with Note that the condition for ℓ is satisfied for ℓ ≥ C log 2 (1/ǫ) for a constant C.
Next we show that, regardless how large ℓ is, for n sufficiently large any seed-finding algorithm of first kind needs to output a set of vertices whose size is at most cℓ where c is strictly smaller than 1. Similarly, any seed-finding algorithm of second kind needs to output a set of vertices whose size is at least Cℓ where C > 1.
In other words, when the seed tree is a uniform random recursive tree, the problem of finding it is strictly harder than finding a seed path or a seed star in the sense that no algorithm can have a performance as the one established in Theorem 1 or Theorem 3. Note however, that there remains a gap between the performance bound of Theorem 5 and the impossibility bound of Theorem 6 below, as the size of the vertex set in the seed found by the algorithm of Theorem 5 is only guaranteed to be of the order of ℓ/ log(1/ǫ), a linear fraction but depending on ǫ.
The impossibility results mentioned above follow from the fact that, at time 2ℓ, a linear fraction of the vertices of the seed T ℓ become indistinguishable from vertices that arrive between time ℓ + 1 and 2ℓ. To make the statement precise, we need a few definitions.
In a uniform random recursive tree T ℓ , we call a vertex a singleton if it is a leaf and it is the only descendant of its parent vertex. Now consider a vertex v in T ℓ and its position in the tree T 2ℓ . We say that v is a camouflaging vertex if Clearly, at time 2ℓ, and therefore at any time n ≥ 2ℓ, the two descendants d and w of any camouflaging vertex v are indistinguishable. Let G ℓ denote the number of camouflaging vertices. Then if a seed-finding algorithm outputs a vertex set that contains an (1 − γ)ℓ vertices of the seed, then one must have G ℓ < γℓ. The next proposition shows that γ ≥ 1/384 with high probability. Theorem 6. For any ℓ ≥ 1, and for any t ≥ 0, 2ℓ .

Proofs
In this section we present the proofs of all theorems. The construction of all seedfinding algorithms uses a simple notion of centrality that we recall first.

Centrality
Let T be a tree with vertex set V (T ). A rooted tree (T , v) is the tree T with a distinguished vertex v ∈ V (T ). For a vertex u ∈ V (T ), denote by (T , v) u↓ the rooted subtree of T whose root is u and whose vertex set contains all vertices w of V (T ) such that the (unique) path connecting w and v in T contains u.
Given tree T , the anti-centrality of a vertex v ∈ V (T ) is defined by Thus, ψ(v) is the size of the largest subtree of the tree T rooted at v. Note that leaves of a tree T have the largest anti-centrality with ψ(v) = |V (T )|−1. We say that v is at least as central as For a positive integer k, we denote by H ψ (k) the set of k vertices of with smallest anti-centrality, where ties may be broken arbitrarily.
This notion of centrality played a crucial role in some of the root-finding algorithms of [4]. We refer to Jog and Loh [14,13] for a study of this notion in various random tree models, including uniform random recursive trees.

Proof of Theorem 1
Let ǫ, γ, and ℓ be as in the assumptions of the theorem. We may assume, without loss of generality, that γℓ/2 is an integer. We analyze a simple seed-finding algorithm that achieves the performance stated in the theorem. The proposed algorithm simply takes the (1−γ)ℓ most central vertices, as measured by the function ψ defined in Section 3.1.
Formally, let k ℓ = (1−γ)ℓ and define H n = H ψ (k ℓ ) be the set of k ℓ most central vertices of the observed tree T n .
It suffices to prove that, for all sufficiently large n, with probability at least 1−ǫ, all vertices of T n not in the seed P ℓ are less central than any vertex in P ℓ whose distance to the leaves of P ℓ is at least γℓ/2, that is, (Recall that the vertex set of the seed P ℓ is {1, . . . , ℓ}.) Let C 1 , . . . , C ℓ denote the components of the forest obtained by removing the edges of P ℓ from T n such that k ∈ C k for k = 1, . . . , ℓ. Then To bound the probabilities on the right-hand side, suppose, without loss of generality, that k ≤ j. (The case k > j is analogous.) If v ∈ C k \ {k} is such that ψ(v) ≤ ψ(j). Let u be a vertex connected to v such that (T , v) u↓ is maximal (i.e., ψ(v) = (T , v) u↓ ). Then there are two possibilities: By this observation, we have Now let t = γ/e 2 . Then the right-hand side of the inequality above may be bounded further by To understand the behavior of the probabilities on the right-hand side, note that, for any k = 1, . . . , ℓ−1, k i=1 |C i | is just the number of red balls after taking n samples in a standard Pólya urn initialized with k red and ℓ −k blue balls. This implies that We may bound the expression on the right-hand side by where we used Stirling's formula and the choice t = γ/e 2 . Putting everything together, we have that under our conditions for ℓ, as desired.

Proof of Theorem 2
Let E be the event that either (1) vertex i attaches to vertex i −1 for all i = ℓ+1, . . . , 2ℓ or (2) vertex ℓ + 1 attaches to vertex 1 and for all i = ℓ + 2, . . . , 2ℓ, vertex i attaches to vertex i − 1. On this event, T 2ℓ is a path of 2ℓ vertices such that the seed P ℓ is on one of the two extremes of T 2ℓ . The probability of this event is On this event, for n ≥ 2ℓ, for any seed-finding algorithm, the first and second halves of the path T 2ℓ are indistinguishable. At least one of the two halves of T 2ℓ is such that H n intersects that half in at most ℓ/2 vertices. Thus, (conditionally on E), the algorithm misses at least half of the seed path, with probability 1/2. Hence

Proof of Theorem 3
Let k ℓ = (1 + γ)ℓ. Again, we may assume that k ℓ is an integer. The seed finding algorithm we propose is slightly different. It is specifically tailored to the case when the seed tree to be found is a star. Let v * n = argmin i=1,...,n ψ(i) be the most central vertex of T n . We define H n as the set of vertices that includes v * n and k ℓ − 1 other vertices j with largest value of (T n , v * n ) j↓ among the neighbors of v * n in T n . In other words, the algorithm outputs the most central vertex v * n and those neighbors whose subtree away from v * n is largest. First we recall that by Jog and Loh [14,Theorem 4], there exists a numerical constant C such that, if ℓ ≥ C log(1/ǫ) and the uniform attachment tree is initialized with a star E ℓ as seed of ℓ vertices and central vertex 1, then that is, with probability at least 1−ǫ/2, the center of the seed star remains the most central vertex of T n for all n.
Let v 1 ≤ v 2 ≤ · · · be the vertices that are attached to vertex 1 (i.e., to the center of the seed star E ℓ ) in the uniform attachment process. (Thus, v 1 > ℓ.) In view of the above-mentioned result of Jog and Loh, it suffices to show that for all n sufficiently large, all vertices v j with j > γℓ have (T n , 1) v j ↓ smaller than (T n , 1) i↓ for all vertices i in the seed star E ℓ , with probability at least 1 − ǫ/2. Thus, writing g(i) = (T n , 1) i↓ , we need to prove that lim sup To prove (3.2), first we write where we take m = ⌊e γℓ/4 ⌋. The first term on the right-hand side is the probability that more than γℓ vertices are attached to vertex 1 up to time m. In order to bound this probability, denote by X t , for t ≥ ℓ, the number of vertices attached to vertex 1 between time ℓ + 1 and t. Thus, X ℓ = 0 and is a martingale with respect to the filtration generated by X ℓ , X ℓ+1 , . . .. Denote the corresponding martingale difference sequence by j · E e m j=ℓ+1 Z j e γℓ . (3.4) In order to bound the right-hand side, observe that We proceed by writing Now fix i ∈ {2, . . . , ℓ} and notice that max v j >m g(j) is bounded by the number of vertices A attached to the tree formed by vertex 1 and all vertices in the subtrees (T n , 1) j↓ for j > m such that vertex j is attached to vertex 1.
Denoting B = g(i) and C = n − A − B, note that, conditioned on the tree T m , the triple (A, B, C) behaves as the number of red, blue, and white balls in a Pólya urn in which initially (i.e., at time m) there is one red ball, B m = (T m , 1) i↓ blue balls, and m − 1 − (T m , 1) i↓ white balls. Hence, for each i = 2, . . . , ℓ, we have In order to bound the second term on the right-hand side, note that by the standard theory of Pólya urns, B m has a beta-binomial distribution with parameters (m, 1, ℓ− 1). Thus, B m is distributed as a binomial random variable Bin(m, π) where the parameter π is an independent Beta(1, ℓ − 1) random variable. Thus, (by a standard binomial estimate and expressing the beta distribution) ≤ e −mǫ/(128ℓ 2 ) + ǫ 16ℓ (by the Bernoulli inequality) whenever ℓ > (4γ) log(1/ǫ) + log log(8ℓ/ǫ) + log(128ℓ 2 ) . To finish the proof it remains to show that lim sup But this follows from the fact that this limiting probability is bounded by the the probability that a Beta(1, mǫ/32ℓ 2 ) random variable is greater than 1/2 which is at most 2 −mǫ/32ℓ 2 . Since m = ⌊e γℓ/4 ⌋, this is bounded by ǫ/(8ℓ) for ℓ > (8/γ ∨C) log(1/ǫ), as desired.

Proof of Theorem 5
Fix ǫ ∈ (0, 1) and define a = 2 log(4/ǫ) + 1 and k ℓ = ℓ 3a . A seed-finding algorithm with the desired property simply selects the k ℓ most central vertices. (Again, for simplicity of the presentation, we assume that k ℓ is an integer.) With the notation introduced at the beginning of this section, we define H n = H ψ (k ℓ ). We need to show that the k ℓ most central vertices of T n are in T ℓ with probability at least 1 − ǫ for all sufficiently large n.
The strategy of our proof is as follows. First we show that, with probability at least 1 − ǫ/2, the seed T ℓ contains at least k ℓ "deep" vertices. Then we prove that for all n sufficiently large, all deep vertices of T ℓ are more central in T n than any vertex outside of the seed T ℓ .
We call a vertex v ∈ T ℓ deep if it has at least a descendants, that is, if Denote by A ℓ the set of all deep vertices of T ℓ . Noticing that (3.6) (3.5) follows from inequality (4.1) in the Appendix under the condition ℓ ≥ 64a 2 log(22a/ǫ).
It remains to prove (3.6). To this end, for i ∈ {1, . . . , ℓ}, denote by C i the component of vertex i in the forest obtained by removing the edges of T ℓ from T n . Then Now fix T ℓ and vertices k ∈ {1, . . . , ℓ} and u ∈ A ℓ . For any vertex v ∈ C k \{k} such that ψ(v) ≤ ψ(u), there are two possibilities: (1) either the largest subtree of T n rooted at v is inside C k , in which case |C k | ≥ i k |C i |; (2) or the largest subtree of T n rooted at v is Since u ∈ A ℓ , this means that the left-hand side is dominated by the number of red balls in a standard Pólya urn with after n − ℓ draws initialized with at least a red, one blue, and n − a − ℓ − 1 white balls; while |C k | behaves like the number of blue balls in the same urn.
By the same calculations as in the proof of Theorem 1, the probability of case (1) may be bounded by Similarly, the probability of case (2) satisfies lim sup by our choice a = 2 log(ℓ 2 /ǫ) + 1. This concludes the proof of (3.6) and hence that of Theorem 5.

Proof of Theorem 6
We prove the lower bound for the expected number of camouflaging vertices by induction. To this end, fix a singleton d and its parent v in T ℓ . For j ≥ ℓ, let Observe that E (v) 2ℓ is the event that v is a camouflaging vertex. Consider the sequences j occurs and the vertex j + 1 is neither attached to d nor to d ′ , or if d is a singleton of T j and the j +1 is attached to v. Thus Multiplying both sides by j(j − 1), we get Summing over j = ℓ + 1, . . . , 2ℓ − 1, which implies that Note that, for j ∈ {ℓ + 1, . . . , 2ℓ − 1}, ≥ exp (4 log ℓ − 4 log j) > ℓ 4 (2ℓ) 4 = 1 16 , and therefore Let P ℓ be the set of vertices in T ℓ that are parents of a singleton. Then which implies that EG ℓ ≥ 1 64 E|P ℓ |. It remains to bound the expected number of singletons E|P ℓ | in the uniform random recursive tree T ℓ . Write S k = |P k | and note that S k equals the number of parents of singletons in T k .
When a new vertex is attached to the tree T k , we lose one singleton if the new vertex is attached to the parent of a singleton. This happens with probability S k /k. If a the new vertex is attached to a singleton, then the number remains the same. If the new vertex is attached to some vertex that is not a leaf nor a parent of a singleton, then, the number of singletons also remains unchanged. Finally, if the new vertex is attached to a leaf that is not a singleton, the number of singletons increases by 1. Thus, denoting the number of leaves of T k by L k , Taking expectations and using the fact that EL k = k/2, we have that ES ℓ = ℓ/6. Summarizing, the expected number of camouflaging vertices satisfies We prove the second inequality of Theorem 6 using the bounded differences inequality of McDiarmid [15] (see also [2,Theorem 6.2]).
Observe that given T ℓ , there is a bijection between the set of recursive trees of size 2ℓ containing T ℓ as subgraph and the set S = [ℓ] × · · · × [2ℓ − 1]. The bijection is simply given by associating the vector κ = (a ℓ+1 , · · · , a 2ℓ ) to the recursive tree T (κ) where the vertex k ∈ [ℓ + 1, 2ℓ] is attached to the vertex a k , starting by T ℓ until obtaining T 2ℓ . Then we may consider the set S as the set of recursive trees with 2ℓ vertices that contain T ℓ as subtree.
Importantly, the components of κ that represent the uniform random recursive tree T 2ℓ are independent random variables.
Given T ℓ , consider the function g : S → R such that g(T 2ℓ ) is the number of camouflaging vertices.
By the bounded differences inequality, it suffices to show that, given T , T ′ ∈ S , if T and T ′ differ by exactly one coordinate, then |g(T ) − g(T ′ )| ≤ 2.
Devroye's proof is based on representing L k,n as a sum of (k + 1)-dependent indicator random variables and on a central limit theorem of Hoeffding and Robbins [11] for such sums. In this paper we need a non-asymptotic version of Devroye's theorem. Quantitative, Berry-Esseen-type versions of the Proposition 1. If L k,n denotes the the number of vertices with k descendants in a uniform random recursive tree of n > k + 1, then for all t > 0, P L k,n ≥ EL k,n + t ≤ exp −8t 2 (k + 2) 25(n + (k + 1)(k + 2)t/3 and P L k,n ≤ EL k,n − t ≤ exp −8t 2 (k + 2) 25n .
Note that the number of vertices with at least k descendants M k,n = n−1 i=k L i,n = n − k−1 i=0 L i,n has expected value EM k,n = E