Consistent Estimation in General Sublinear Preferential Attachment Trees

We propose an empirical estimator of the preferential attachment function $f$ in the setting of general preferential attachment trees. Using a supercritical continuous-time branching process framework, we prove the almost sure consistency of the proposed estimator. We perform simulations to study the empirical properties of our estimators.


Introduction
After the conception of the scale-free phenomenon in Barabási's series of seminal work ; Barabási et al. ( , 2000), scientists from numerous disciplines have made discoveries that support the ubiquity of realworld scale-free networks. Together with the notion of small-world networks, these discoveries mark the emergence of network science at the turn of the century. Preferential attachment models, which received their modern conception in , have become popular, because they are one of the few generative models that give rise to scale-free behavior.
Consider the following dynamical network model. The network starts at stage n = 2 with two nodes v 1 and v 2 connected by a single edge. Next it evolves recursively by adding nodes v 3 , v 4 . . ., which each connect by a single edge to a single node of the existing network. The incoming nodes choose the node to which they connect by a probabilistic mechanism. Given the network with nodes v 1 , . . . , v n having degrees d 1 (n), . . . , d n (n), the node v n+1 connects to the existing node v i ∈ {v 1 , . . . , v n } with probability proportional to f d i (n) , i.e., with probability .
Here f : N + → R + is a given function, which we call the preferential attachment (pa) function, and we refer to f (d i (n)) as the preference for node v i at stage n.
The pa function f is typically assumed to be non-decreasing, so that nodes of higher degrees inspire more incoming connections. This explains the name preferential attachment model. After the incoming node v n+1 has made its choice, the network evolves to the stage of n + 1 nodes and the scheme repeats with the set of existing nodes {v 1 , . . . , v n+1 } and the incoming node v n+2 . The recursive procedure may be repeated to reach any number of nodes. The empirical degree distribution P k (n) is defined as the proportion of nodes of degree k at time n: P k (n) = 1 n i∈{1,...,n} In the case that f is affine with f (k) = k + δ, it is well known that P k (n) → p k almost surely, as n → ∞, for any fixed k, for a limit p k that satisfies p k ∼ c δ k −3−δ as k → ∞, for some constant c δ (where ∼ means that the quotient of the two sides tends to 1; see Móri (2002); van der Hofstad (2017)). Thus in this scenario the limiting degree distribution follows a power law with exponent 3+δ, and the scale-free phenomenon occurs. In particular, for f (k) = k the exact asymptotic degree distribution can be worked out to be p k = 4/(k(k +1)(k +2)), and the power-law exponent is 3 (Móri (2002); Bollobás et al. (2001)).
For pa functions f that are not affine, there are roughly two possible cases: super-and sublinear. If f grows faster than linearly (the super-linear case; more precisely if k 1/f (k) < ∞), then one node will function as a hub (or a star) that connects to a large fraction of all nodes, if not virtually to all the nodes (Oliveira and Spencer (2005); Krapivsky and Redner (2001)). In this case the continuous-time branching process that describes the pa tree (see Section 3) no longer has a Malthusian parameter, and the precise behavior of the pa model is not well understood.
In this paper we focus on the case of sublinear pa functions f . This includes the affine case, but also strictly sublinear cases, which lead to a variety of possible limiting degree distributions. Next to power laws these include for instance power-laws with exponential truncation, which arise when the pa function f (k) becomes constant for large enough k. In general, the limiting degree distribution may be much lighter tailed than in the affine case, corresponding to rarer occurrence of nodes of high degrees (see Rudas et al. (2007)). Such scenarios have been reported frequently in empirical work on real-world networks.
The paper is concerned with the problem of statistical estimation of a pa function from an observed realization of a network. Most empirical work to date has focused on estimating a power law exponent, which is presumed to describe the data. However, despite the seeming omni-presence of power laws, studying them is often problematic. The numbers of nodes of high degrees in an observed network usually exhibit large variations and irregular behavior as a function of the degree. This is to be expected, as they are rare, but makes it hard to determine whether there is a power law at all. In applications where an affine pa model and ensuing power law are in doubt, it might be wise to estimate the pa function f using a more general model instead. In this paper we focus on the set of general sublinear pa functions. Besides being of interest in itself, fitting a general sublinear function will also shed light on the fit of an affine function, as the general estimate of f may or may not resemble an affine function. In other words, estimating f in our nonparametric setting will also help validate the modelling of power laws in pa networks.
The main contribution of our work is to propose an empirical estimator of a general pa function, and to prove its consistency. Statistical estimation in a pa network is not a conventional statistical problem that admits a standard asymptotic analysis, in two ways. First, although Markovian, the growth of the tree in the pa scheme (i.e., each transition of the Markov process) depends on the full history of the evolution, where the dependence on the history may be long range. Bubeck et al. (2015) even show that the influence of the so-called seed graph-the initial configuration from which the preferential attachment graph starts to grow-does not vanish as the network size tends to infinity. Second, for practical purposes it is desirable to base a statistical estimator on the current day snapshot of the network and not on the evolution of the tree, as this will often not be observed. Thus we observe only the last realization of the Markov chain. Perhaps the most surprising part in our study is that indeed one can consistently estimate a pa function from this final snapshot.
The main mathematical tool of the paper is the theory of branching processes, as introduced first in this context by Rudas et al. (2007). Given nodes v 1 , v 2 , . . . , v n with degrees (d i (n)) n i=1 and preferences f (d i (n)) n i=1 , we need 3982 F. Gao et al. the total preference n i=1 f (d i (n)) to normalize the multinomial distribution on the nodes. In the affine case that f (k) = k + δ, the total preference is deterministic and takes the form n i=1 (d i (n) + δ) = nδ + 2n. This property allows to study the limiting degree distribution with simple recursions on the degree evolution, and is also handy for the study of statistical estimators, as shown in Gao and van der Vaart (2017). However, in the case of general attachment functions, this ceases to hold and the total preference is an involved random quantity that depends on the entire history of the network evolution. Rudas et al. (2007) overcome this difficulty by embedding the pa model in a continuous-time branching-process framework, in which each individual has children according to a pure birth process with birth rate f (k), if k is the current number of children. This embedding takes care of the normalization by the total preference. In the continuous-time dynamics the size of the network is random at any given time, but the process reduces to the pa network at the stopping times where the tree reaches a given size. Rudas et al. (2007) apply results from the classical work of branching processes dating back to the 1970s and 1980s (see Jagers (1975);Nerman (1981)) to prove that the empirical degree proportions P k (n) converge to the limits p k as in (13) almost surely, for any fixed k. Bhamidi (2007) exploits this continuous-time embedding to analyze various statistics associated with different models of preferential attachment and is a nice reference for related subjects. We use similar arguments to derive the asymptotic consistency of our estimator. This paper is organized as follows. In Section 2 we give the intuition behind the estimator and present our main result on its consistency. Section 3 introduces the terminology of branching processes and gives a random tree model that is equivalent to the evolution of pa networks. We prove the main consistency result in Section 4. In the last section, we present a simulation study on the performance of the proposed empirical estimators in different settings and discuss our observations. Several simulation studies are carried out in order to uncover a more detailed picture of the properties of the empirical estimator, the most interesting one being that the estimator seems to be asymptotically normal with a √ n rate.

Construction of the Empirical Estimator and Main Result
The goal of this paper is to provide an estimate of the pa function upon observing the pa tree. In this section, the empirical estimator is first derived assuming knowing the degrees of nodes that have gotten attached in the history (and we do not label the nodes by the order of births), but as it will turn out in the later part of the section, the proposed estimate only depends on the degree distribution of the final snapshot, with no need of any historical information. Suppose we have a pa tree of n nodes and n is large enough so that the limiting degree distribution (p k ) ∞ k=1 is "close" to the empirical degree distribution (P k (n)) ∞ k=1 . Suppose a new node comes in and needs to pick an existing node to attach to according to the pa rule associated with the pa function f . Let N k (n) denote the number of nodes of degree k (which is close to np k in the limiting regime). Then, the probability of choosing an existing node of degree k is We are interested in the quantity f (k) for each k ≥ 1. However, f is only identifiable up to scale and the denominator on the right hand side of the above display ∞ j=1 f (j)p j is a constant only depending on the pa function f . If we multiply the above display by an extra factor n/N k (n) ≈ 1/p k , then we get a rescaled version of f (k) where P j (n) = N j (n)/n is the proportion of nodes of degree j among all the n nodes. Henceforth, we define r k as the rescaled version of f (k) for each k and summarize the above heuristic as Here we note that while k → f (k) is not uniquely defined (as multiplying by a non-zero constant gives rise to the same attachment rules), the quantity r k is unique as it is normalized such that k r k p k = 1. We aim for an empirical estimator that mimics the above equation. This estimation, furthermore, should also work in the non-limiting regime. Note that n/N k (n) is always readily available in any network, so it suffices to estimate the probability of the incoming node choosing an existing node of degree k. This probability can be estimated by counting the number of times that the incoming node chooses an existing node of degree k during the evolution of the pa network.
Let us now work the above heuristic out in a more formal way. Let N →k (n) denote the number of times that a node is attached to a node of degree k. Further, denote the number of nodes of degree k in the pa network at time n by N k (n). The empirical estimatorr k (n), which hereafter will be abbreviated as the ee, is then defined asr Let N >k (n) = j>k N j (n) denote the number of nodes of degree strictly larger than k at time n. For the pa networks considered here, we have the following crucial observation: Proof. Observe that N →k (n) counts the number of times that the incoming node chooses an existing node of degree k to connect to up to time n. Note that, if a node was chosen to be connected to as a node of degree k at some point before time n, its degree at time n is at least k + 1. On the other hand, we notice the node degree may only jump from 1 to 2, 2 to 3, . . . , k to k + 1, etc. Therefore, if a node has degree strictly larger than k, it must have been chosen to be connected to as a node of degree k at some time. This gives the equality as in the statement of the lemma.
In the light of the above observation, we note that (1) is equivalent tô We give the main result of this paper in the following theorem, which applies to pa functions satisfying the following condition. Given a function f : Then the theorem assumes that the range of ρ f contains an open neighborhood of 1. In Section 3 below we note that this condition is satisfied by most sublinear pa functions f , whereas for pa functions that increase faster than linear ρ f (λ) will typically be infinite for every λ > 0 and the condition fails. Theorem 2. If the range of the function ρ f attached to the true pa function f contains an open neighborhood of 1, then the estimatorr k (n) defined in (2) is consistent almost surely, i.e., for any k, where a.s.
The proof of the theorem is deferred to Section 4. The estimator (2) can be computed from the network as observed at time n, without needing access to the evolution of the network up to this time. This is important when modelling a real-world network as a pa network, because in real-world applications it is often difficult, expensive, or even impossible to recover the evolution history of a network.
The form of the empirical estimatorr k (n) in (2) sparks some philosophical considerations. Consider the degree of a node as a measure of its wealth, i.e., the more neighbors, the richer. Suppose that a node of degree k asks how likely it is to get richer, i.e., to receive an extra connection. This is equivalent to asking for an estimate of f (k). The estimator (2) counts the number of the richer nodes N >k (n) and the number of nodes at the same level N k (n), and returns the quotient of these numbers as an estimate of f (k) (up to normalization). If you live in the world of these nodes and wonder about your chance of moving up, then you might naturally come up with the aforementioned ratio. The higher the number of people above you relative to the number of people sharing your rank, the better your chance to move up.

Borrowing strength from branching processes
In this section we introduce the terminology needed to reformulate the pa model in the language of branching processes, similar to Rudas et al. (2007). As the pa function is no longer affine, the conventional (and somewhat elementary) techniques, e.g., the martingale method as in (van der Hofstad, 2017, Chapter 8), do not work anymore. However, the supercritical branching processes observed at the random stopping times where their size is fixed, designed first by Rudas et al. (2007), are (almost) equivalent to the pa trees, which in turn enables us to study the pa trees with the well-established results of general branching processes.

Rooted ordered tree
The pa network is a rooted ordered tree, which can be described as an evolving genealogical tree, where the nodes are individuals and the edges are parent-child relations. The usual notation for the nodes are ∅ for the root of the tree and l-tuples (i 1 , . . . , i l ) of positive natural numbers i j ∈ N + for the other nodes (l ∈ N + ). The children of the root are labeled (1), (2), . . ., and in general x = (i 1 , . . . , i l ) denotes the i l -th child of the i l−1 -th child of · · · of the i 1 -th child of the root. Thus the set of all possible individuals is For x = (x 1 , . . . , x k ) and y = (y 1 , . . . , y l ) the notation xy is shorthand for the concatenation (x 1 , . . . , x k , y 1 , . . . , y l ), and, in particular, xl = (x 1 , . . . , x k , l).
Since the edges of the tree can be inferred from the labels of the nodes (i 1 , . . . , i l ), a rooted ordered tree can be identified with a subset G ⊂ I . (Not every subset corresponds to a rooted ordered tree, as the labels need to satisfy the compatibility conditions that for every (x 1 , . . . , x k ) ∈ G we have (x 1 , . . . , x k−1 ) ∈ G (parent must be in tree) as well as (x 1 , . . . , x k − 1) ∈ G if x k ≥ 2 (older sibling must be in tree).) The set of all finite rooted ordered trees is denoted by G. In this terminology and notation the degree of node x ∈ G that is not the root is the number of its children in G plus 1 (for its parent), given by deg(x, G) = |{l ∈ N + | xl ∈ G}| + 1.

Branching process
The evolution in time of the genealogical tree is described through stochastic processes ξ x (t) t≥0 , one for each individual x ∈ I . The random point process ξ x on [0, ∞), given the ages of the parent x at the births of its children, describes the node x giving births to its children. The birth time σ x of individual x in calendar time is defined recursively, by setting σ ∅ = 0 (the root is born at time zero) and Thus the l-th child of x is born at the birth time of x plus the time of the l-th event in ξ x . It is assumed that the birth processes ξ x for different x ∈ I are iid. This is the defining property of a continuous-time branching process. Formally, we may define all processes ξ x on the product probability space where every (Ω x , B x , P x ) is an independent copy of a single probability space (Ω 0 , B 0 , P 0 ) and every ξ x is defined as ξ x (ω) = ξ(ω x ) if ω = (ω x ) x∈I ∈ Ω, for ξ a given point process defined on (Ω 0 , B 0 , P 0 ). We identify the point process ξ with the process ξ(t) giving the number of points in [0, t], for t ≥ 0, and write μ(t) = E[ξ(t)] for its intensity measure, which is often called the reproduction function in this context.
Besides the reproductive process ξ x we also attach a random characteristic φ x to every individual x ∈ I . This is also a stochastic process φ x (t) t≥0 , which we take non-negative, measurable and separable. For simplicity, we define φ x (t) = 0 for t < 0. We then proceed to define If φ x (t) is viewed as a characteristic of individual x when x has age t, then the variable φ x (t − σ x ) is the characteristic of individual x at calendar time t, and Z φ t is the sum of all such characteristics over the individuals that are alive at time t (i.e., individuals x for which σ x ≤ t).
The characteristics φ x are assumed independent and identically distributed for different individuals x, as the reproductive processes. Formally this may be achieved by defining φ x (ω) = φ(ω x ) if ω = (ω x ) x∈I ∈ Ω, for a given stochastic process φ on (Ω 0 , B 0 , P 0 ). This allows the two processes ξ x and φ x attached to a given individual to be dependent. In fact, we shall be interested in the choices, for a given natural number k, For the first characteristic the variable Z φ t = Z 1 t is equal to the total number of individuals born up to time t, for the second it equals the total number of those individuals with exactly k − 1 (and hence of degree k), and for the second more than k − 1 children at time t (hence of degree > k). As we will soon see, we are interested in the degree distribution of the network and the degree of a node is defined to be its number of children (ξ(t)) plus one, this is why k − 1 appears instead of k.
The following combines Theorems 5.4 and 6.3 of Nerman (1981) (also compare Theorem A from Rudas et al. (2007)). Proposition 3. Consider a supercritical, Malthusian branching process with Malthusian parameter λ * as in (7) and satisfying (8), and two associated bounded characteristics φ and ψ. Then there exists a random variable Y ∞ depending on ξ only such that almost surely on the event that the total population size Z 1 t tends to infinity, as t → ∞, t 0 e −λs ξ(ds), then Y ∞ can be taken strictly positive, and almost surely on the event that the total population size Z 1 t tends to infinity, as t → ∞. The convergence (10) is true, more generally, if ∞ 0 e −λt μ(dt) < ∞, for some number λ < λ * .
Proof. Assertion (9) follows by Theorem 5.4 of Nerman (1981). Note that Condition 5.1 of the latter theorem is satisfied because of (8) (see (5.7) of Nerman (1981)), while Condition 5.2 follows by the assumed boundedness of φ and ψ. By Proposition 1.1 and (3.10) in the same reference, the convergence of the integral E λ * [ξ(∞) log + λ * ξ(∞)] implies that the variable Y ∞ is strictly positive almost surely on the event that the total population size Z 1 t tends to infinity. Then (10) follows by applying (9) to both e −λ * t Z φ t and e −λ * t Z ψ t , and taking the quotient. The last assertion of the proposition follows from Theorem 6.3 of Nerman (1981).

The continuous random tree model
To connect back to the pa model, given a pa function f , we now define the process ξ as a pure birth process with birth rate equal to f (ξ(t) + 1), i.e., the continuous-time Markov process with state space being the non-negative integers and the only possible transitions given by The genealogical tree is then also a Markov process on the state space G. The initial state of the process is the root {∅} of the tree, and the jumps of this process correspond to an individual x ∈ G giving birth to a child, which is then incorporated in the tree as an additional node. In the preceding notation this means that the process can jump from a state G to a state of the form G ∪{xk}, where necessarily x ∈ G and k = deg(x, G) is the number of children that x already has in the tree plus 1. This jump is made with rate f (deg(x, G)), since according to (11) with ξ = ξ x the individual x gives birth to a new child with rate f (k) if x already has k − 1 children. The description in terms of rates means more concretely that given the current state G, the Markov process can jump to the finitely many possible states G ∪ {xk}, x ∈ G and k = deg(x, G), and it chooses between these states with probabilities Furthermore, the waiting time in state G to the next jump is an exponential variable with intensity equal to the total preference The continuous-time scale of the process is not essential to us, but it is convenient for our calculations. We shall use that when t → ∞ the continuous-time tree visits the same states (trees) as the pa model, and taking limits as t → ∞ is equivalent to taking limits in the pa model as the number of nodes increases to infinity almost surely. In order to apply Proposition 3 in our setting we need to verify its conditions on the birth process ξ and the reproduction function μ(t) = E[ξ(t)], and determine the Malthusian parameter. The events of the pure birth process (11) can be written as T 1 < T 1 +T 2 < T 1 +T 2 +T 3 < · · · , where (T k ) ∞ k=1 are independent random variables exponentially distributed with rates (f (k)) ∞ k=1 . The total number of births ξ(t) = 1 (0,t] (u) ξ(du) at time t is equal to ∞ l=1 1 (0,t] (T 1 + · · · + T l ), which clearly tends to infinity almost surely as t → ∞. Furthermore, we have for ρ f defined in (3). The Malthusian parameter λ * is defined as the argument where the function ρ f in the display equals one. The terms of the series defining this function are nonnegative, and strictly decreasing and convex in λ.
Hence if ρ f is finite for some λ > 0, then it is finite and continuous on an interval (λ, ∞) and tends to zero as λ → ∞, by the dominated convergence theorem.
If lim λ↓λ ρ f (λ) > 1, then λ * exists and is interior to the interval (λ, ∞). It is shown in Lemma 1 on page 200 of Rudas et al. (2007) that in this case the associated birth process ξ satisfies E[ λ * ξ(∞)] 2 < ∞ as soon as f (k) → ∞. Thus all conditions of Proposition 3 are satisfied if lim λ↓λ ρ f (λ) > 1, and this is equivalent to 1 being an inner point of the range of ρ f .
We study the Malthusian processes in the following two cases of pa functions, so as to illuminate the related assumptions.
To see that ρ f is finite on (0, ∞), note that ρ f is increasing in f , so that it suffices to prove the claim in the special case that f (k) = (k + δ) β . Since For f (k) = (k + δ) β , we can choose K = λ 1/β and then find that the right side is bounded above by a multiple of e −λ(l+δ) 1−β /(2−2β) . This sums finitely with respect to l, and hence ρ f (λ) < ∞.
When applied to the root node x = ∅, the right side of formula (5) gives the degree plus 1 and not the degree, as the root v 1 in the pa model in Section 1 does not have a parent. The preceding branching model then can be viewed as the description of an alternative pa model in which the root is also considered to have a parent (say v 0 ), whom, however, will never give birth to other children than the initial one. This approach is followed by Rudas et al. (2007). Alternatively, to match the continuous-time branching process exactly to the original description of the pa model, the birth process ξ ∅ of the root should be defined by (11) but with f (k)dt + o(dt) instead of f (k + 1)dt + o(dt) on the right side. Intuitively, this replacement makes little difference for our asymptotics. However, as the birth processes will then not be identically distributed, a direct reference to Nerman (1981) will be impossible. Below we solve this by running two separate, independent branching processes from each of the two starting nodes v 1 and v 2 .

Consistency of the Empirical Estimators
For completeness, we present a result without proof from Rudas et al. (2007) giving the limiting degree distribution of the pa model. Proposition 4. If the range of the function ρ f attached to the true pa function f contains an open neighborhood of 1, then as n → ∞, the empirical degree distribution P k (n) converges almost surely for any k to some limit p k , i.e., where the empty product is defined to be 1, so that p 1 = λ * / λ * + f (1) . It follow from (13) Proof of Theorem 2. We embed the evolution of the pa model within the continuous time branching process framework described in Section 3. As explained at the end of the latter section a straightforward embedding gives a slightly different pa model, in which the degree of the root node is counted one higher than in the original pa model. We first give the proof for the adapted pa model, and next explain how this proof can be adapted to treat the original pa model. The degree deg(x, t) of node x in the continuous time branching tree at time t relates to its associated reproductive process as deg(x, t) = ξ x (t − σ x ) + 1. Therefore the total number of nodes of degree strictly larger than k present in the tree at time t, and the total number of nodes of degree k are given by These are the processes Z φ t and Z ψ t corresponding to the characteristics φ(t) = 1 {ξ(t)+1>k} and ψ(t) = 1 {ξ(t)+1=k} , respectively. It follows that almost surely, as t → ∞, by the second assertion of Proposition 3. When evaluated at the (random) time t such that the total population size Z 1 t is equal to n, the left side of the display gives the empirical estimator (2). Since these random times tend to infinity almost surely as n → ∞, we conclude that the empirical estimator converges almost surely to the right side of the preceding display. To complete the proof we must identify this right side as for h k the density of T 1 + · · · + T k , we have by Fubini's theorem (or partial integration), Furthermore, writing P ξ(t) = k − 1 as the difference of the preceding with k − 2 and k − 1, we obtain At λ = λ * the right hand side of (17) is the same as the limiting proportion p k in (13), while the right hand side of (16) can be seen to be p >k : = j>k p j , with the help of Fubini's theorem. Therefore, their quotient is r k by the succeeding Lemma 5. This concludes the proof for the adapted pa model in which the root has degree 1 higher than in the original pa model. The original model starts with two connected nodes v 1 and v 2 , which initially both have degree 1. We start two independent branching processes of the type described in Section 3 and attach these to the nodes v 1 and v 2 as their roots, thus forming a single tree. This union of the two processes evolves as the original pa model. If Z φ t,i and Z ψ t,i , for i ∈ {1, 2}, are the processes counting the number of nodes of degree strictly larger than k and equal to k at time t in the two branching processes, then Z φ t,1 + Z φ t,2 and Z ψ t,1 + Z ψ t,2 are the number of such nodes in the union of the two branching processes, and almost surely, as t → ∞, by the first assertion of Proposition 3. The almost sure positivity of the variables Y i,∞ allows to cancel them from the right side, whence the limit is the same as before. These plots suggest the following observations: The estimator is consistent, as our theorem shows. The quality improves when we have more nodes, hence more observations. For a fixed number of nodes, the quality of the estimator deteriorates fast when k increases, exemplified by the substantial variability compared to the ground truth. Even if the ee has a large variance for large k's, the sample median ofr k in each degree k is still remarkably close to the truth. For a fixed number of nodes it appears that, when the ee makes larger errors, it is overestimating.
The ee is not automatically monotone. However, we can slightly modify the estimator so that it is still consistent but always gives monotone results (cf. Chernozhukov et al. (2009)).
To summarize, the estimator performs as proven in our main result in Theorem 2, but the exact performance depends on the true pa function and the degrees of interest-if the true pa function increases slowly with respect to the degree, then it is easier to estimate the preference of low degrees, and harder to estimate the preference of high degrees and vice versa.

Sample variance study
We again run 1000 simulations of trees with pa functions of f (1) , f (2) , f (3) , but now only simulate networks of size 1,000,000. We apply the ee to each simulated network and calculate the sample variance of the 1000 estimates for each given degree up to 70, if the estimate is defined. The sample variances are plotted against the degrees in Figure 2. Denote the sample variance of the ee for degree k by s k . Inspection of these plots reveal the following: It appears that log s k grows polynomially with respect to log k. For the affine pa function f (1) , it looks like log s k is an affine function of log k. The sample variance s k characterizes, to a certain extent, the difficulty of estimating r k : -For small k's, we see that s (3) k < s (2) k < s (1) k . -Then at about k = 17, the blue line of s (3) k first crosses the green line of s (2) k , i.e., s (2) k < s (3) k < s (1) k . -The blue line of s (3) k crosses the red line of s (1) k at around k = 18. This means s (2) k < s (1) k < s (3) k . -The green line of s (2) k crosses the red line of s (1) k at approximately k = 35, so from that point on s (1) k < s (2) k < s (3) k . On the one hand, for small k's, the slower f grows with k, the easier it is to estimate r k . This is reflected in the observation that slower f yield lower sample variance for small values of k. On the other hand, for large k's, the faster f grows with k, the easier it is to estimate r k . The shapes of curves of k → s k for different f 's seems to indicate that the faster f grows with k, the slower log s k grows with log k. The seemingly affine relations might be a consequence of the limiting power-law distribution, but it is unclear to us how these relate precisely.
The above observations seem intuitive because for f that grows fast, there are more nodes of high degrees, and this can be expected to yield better results in estimating the preferences of higher degrees. However, as the total number of nodes is fixed, more nodes of high degrees mean fewer nodes of low degrees. This results in larger variances in estimating r k for small k's.

Rate and asymptotic normality
We may wonder what is the asymptotic distribution ofr k (n) is, for any fixed k and after proper rescaling, when n → ∞. To answer this question, we discuss some simulation results here.
We fix the number of nodes to be 1,000,000 in all simulated networks for each pa function. Then we look at the ee's in each simulation. For each f , we plot the qq-plot of each estimator for k = 2, 3, 4 against the normal distribution. The results are summarized in Figure 3. Since the number of nodes is one million, we expect that the limiting distribution should have kicked in, assuming there is indeed a limiting distribution. The qq-plots indeed indicate that the ee's have asymptotic normal distributions.
We conjecture that, for fixed k, where σ 2 k only depend on f and k and d − → denotes convergence in distribution. To validate this conjecture, we perform the following simulation study. We fix the pa function to be f (2) and run 1,000 simulations for each of the three different network sizes 10,000, 100,000 and 1,000,000 and study the estimators of the preference on degree 2 only. If (19) is true, then the distribution ofr 2 (n) should stabilize after rescaling them with √ n. We summarize the results in Figure 4. The density estimations on the data of √ n(r 2 (n) − r 2 ) for the three different values of n are displayed in Figure 5. The fact that both the sample variances, Fig 3. QQ-Plots ofr 2 (n),r 3 (n) andr 4 (n) with n = 10 6 for f (1) , f (2) and f (3) The rows correspond to the pa functions f (1) , f (2) and f (3) , respectively. The columns correspond to the degree k = 2, 3, 4, on which we conduct our ee study, respectively. histograms and density plots look rather stable after the √ n-rescaling provides further evidence towards the conjecture in (19).

Discussion and open problems
In this paper, we have proposed an empirical estimator for the preferential attachment function in the setting of preferential attachment trees with (sub)linear preferential attachment functions. We rely on an embedding result that views the preferential attachment model as a continuous-time branching process observed at the moments where the branching process has a fixed size. We now discuss some open problems.
Beyond the tree setting. It would be of interest to extend our analysis to settings where nodes enter the network with more than one edge. This corre-  sponds to general preferential attachment graphs. In this case, the embedding results no longer hold, making the analysis substantially harder.
The limits of consistency. Our main result in Theorem 2 shows that the ee is consistent for every fixed k. For which k = k n → ∞ does consistency hold, and how does this depend on the pa function f ? For which norms on f does consistency hold?
Asymptotic normality. Figures 3, 4 and 5 suggest asymptotic normality of the ee for fixed values of k at the rate √ n. How can such a statement be proved? It will also be interesting to study the convergence for increasing values of k. We expect the variance σ 2 k in (19) to increase with k. This raises a problem of bias-variance trade-off when estimating the pa function for large degrees, much as in ordinary nonparametric estimation.