Fitting the Linear Preferential Attachment Model

Preferential attachment is an appealing mechanism for modeling power-law behavior of the degree distributions in directed social networks. In this paper, we consider methods for fitting a 5-parameter linear preferential model to network data under two data scenarios. In the case where full history of the network formation is given, we derive the maximum likelihood estimator of the parameters and show that it is strongly consistent and asymptotically normal. In the case where only a single-time snapshot of the network is available, we propose an estimation method which combines method of moments with an approximation to the likelihood. The resulting estimator is also strongly consistent and performs quite well compared to the MLE estimator. We illustrate both estimation procedures through simulated data, and explore the usage of this model in a real data example.


Introduction
The preferential attachment mechanism, in which edges and nodes are added to the network based on probabilistic rules, provides an appealing description for the evolution of a network. The rule for how edges connect nodes depends on node degree; large-degree nodes attract more edges. The idea is applicable to both directed and undirected graphs and is often the basis for studying social networks, collaborator and citation networks, and recommender networks. Elementary descriptions of the preferential attachment model can be found in [5] while more mathematical treatments are available in [2,4,21]. Also see [10] for a statistical survey of methods for network data, [18] for consideration of statistics of an undirected network and [24] for asymptotics of a directed exponential random graph models. Limit theory for estimates of an undirected preferential attachment model was considered in [6].
For many networks, empirical evidence supports the hypothesis that in-and out-degree distributions follow a power law. This property has been shown to hold in linear preferential attachment models, which makes preferential attachment an attractive choice for network modeling [3,4,11,12,21]. While the marginal degree power laws in a simple linear preferential attachment model were established in [3,11,12], the joint regular variation (see [15,16]) which is akin to a joint power law, was only recently established [17,19]. In addition, it was shown in [22] that the joint probability mass function of the in-and outdegrees is multivariate regularly varying. This is a key result as the degrees of a network are integer-valued.
In this paper, we discuss methods of fitting a simple linear preferential attachment model, which is parametrized by θ = (α, β, γ, δ in , δ out ). The first three parameters, α, β, γ, correspond to probabilities of the 3 scenarios for adding an edge and hence sum to 1, i.e., α + β + γ = 1. The other two, δ in and δ out , are tuning parameters related to growth rates. The tail indices of the marginal power laws for the in-and out-degrees can be expressed as explicit functions of θ (see (2.5) and (2.6) below). The graph G(n) = (V (n), E(n)), where V (n) is the set of nodes and E(n) is the set of edges at the nth iteration, evolves based on postulates that describe how new edges and nodes are formed. This construction of the network is Markov in the sense that the probabilistic rules for obtaining G(n + 1) once G(n) is known do not require prior knowledge of earlier stages of the construction.
The Markov structure of the model allows us to construct a likelihood function based on observing G(n 0 ), G(n 0 +1), . . . , G(n 0 + n). After deriving the likelihood function, we show that it has a unique maximum atθ = (α,β,γ,δ in ,δ out ) and that the resulting maximum likelihood estimator is strongly consistent and asymptotically normal. The normality is proved using a martingale central limit theorem applied to the score function. The limiting distribution also reveals that (α,β,γ),δ in , andδ out are asymptotically independent. From these results, asymptotic properties of the MLE for the power law indices can be derived.
For some network data, only a snapshot of the nodes and edges is available at a single point in time, that is, only G(n) is available for some n. In such cases, we propose an estimation procedure for the parameters of the network using an approximation to the likelihood and method of moments. This also produces strongly consistent estimators. These estimators perform reasonably well compared to the MLE where the entire evolution of the network is known but predictably there is some loss of efficiency.
We illustrate the estimation procedure for both scenarios using simulated data. Simulation plays an important role in the process of modeling networks since it provides a way to assess the performance of model fitting procedures in the idealized setting of knowing the true model. Also, after fitting a model to real data, simulation provides a check on the quality of fit. Departures from model assumptions can often be detected via simulation of multiple realizations from the fitted network. Hence it is important to have efficient simulation algorithms for producing realizations of the preferential attachment network for a given set of parameter values. We adopt a simulation method, learned from Joyjit Roy, that was inspired by [1] and is similar to that of [20].
Our fitting methods are implemented in a real data setting using the Dutch Wiki talk network [14]. While one should not expect the simple 5-parameter (later extended to 7-parameter) linear preferential attachment model to fully explain a network with millions of edges, it does provide a reasonable fit to the tail behavior of the degree distributions. We are also able to detect important structural features in the network through fitting the model over separate time intervals.
Often it is difficult to believe in the existence of a true model, especially one whose parameters remain constant over time. Allowing, as we do, a preferential attachment model with only a few parameters and no possibility for node removal may seem simplistic and unrealistic for social network data. Of course, preferential attachment is only one mechanism for network formation and evidence for its use in fields outside data networks is mixed [8,9] and we restrict attention to linear preferential attachment. Even imperfect models have the potential to capture salient properties in the data, such as heavy-tailedness of the in-degree and out-degree distributions, and to identify departures from model assumptions. While maximum likelihood estimation is essentially the gold standard for cases when the underlying model is a good representation of the data, it may perform poorly in case the model is far from being appropriate. In forthcoming work, we consider a semi-parametric estimation approach for network models that exhibit heavy-tailed degree distributions. This alternative estimation methodology borrows ideas from extreme value theory.
The rest of the paper is structured as follows. In Section 2, we formulate the linear preferential attachment network model and present an efficient simulation method for the network. Section 3 gives parameter estimators when either the full history is known or when only a single snapshot in time is available. We test these estimators against simulated data in Section 5 and then explore the Wiki talk network in Section 6.

Model specification and simulation
In this section, we present the linear preferential attachment model in detail and provide a fast simulation algorithm for the network.

The linear preferential attachment model
The directed edge preferential attachment model [3,12] constructs a growing directed random graph G(n) = (V (n), E(n)) whose dynamics depend on five non-negative real numbers α, β, γ, δ in and δ out , where α + β + γ = 1 and δ in , δ out > 0. To avoid degenerate situations, assume that each of the numbers α, β, γ is strictly smaller than 1. We obtain a new graph G(n) by adding one edge to the existing graph G(n − 1) and index the constructed graphs by the number n of edges in E(n). We start with an arbitrary initial finite directed graph G(n 0 ) with at least one node and n 0 edges. For n > n 0 , G(n) = (V (n), E(n)) is a graph with |E(n)| = n edges and a random number |V (n)| = N (n) of nodes. If u ∈ V (n), D (n) in (u) and D (n) out (u) denote the in-and out-degree of u respectively in G(n). There are three scenarios that we call the α, β and γ-schemes, which are activated by flipping a 3-sided coin whose outcomes are 1, 2, 3 with probabilities α, β, γ. More formally, we have an iid sequence of multinomial random variables {J n , n > n 0 } with cells labelled 1, 2, 3 and cell probabilities α, β, γ. Then the graph G(n) is obtained from G(n − 1) as follows.
• If J n = 1 (with probability α), append to G(n − 1) a new node v ∈ V (n) \ V (n − 1) and an edge (v, w) leading from v to an existing node w ∈ V (n − 1). Choose the existing node w ∈ V (n − 1) with probability depending on its in-degree in G(n − 1): . are chosen independently from the nodes of G(n − 1) with probabilities .
• If J n = 3 (with probability γ), append to G(n − 1) a new node w ∈ V (n) \ V (n−1) and an edge (v, w) leading from the existing node v ∈ V (n−1) to the new node w. Choose the existing node v ∈ V (n − 1) with probability .
Note that this construction allows the possibility of having self loops in the case where J n = 2, but the proportion of edges that are self loops goes to 0 as n → ∞. Also, multiple edges are allowed between two nodes.

Power law of degree distributions
Given an observed network with n edges, let N ij (n) denote the number of nodes with in-degree i and out-degree j. If the network is generated from the linear preferential attachment model described above, then from [3], there exists a proper probability distribution {f ij } such that almost surely Consider the limiting marginal in-degree distribution p in i := j p ij . It is calculated from [3,Equation (3.10)] that and for i ≥ 1, where as long as αδ in + γ > 0, (2.4) for some finite positive constant C in , where the power index Similarly, the limiting marginal out-degree distribution has the same property: p ij ∼ C out i −ιout as j → ∞, as long as γδ out + α > 0, for C out positive and

Simulation algorithm
We describe an efficient simulation procedure for the preferential attachment network given the parameter values (α, β, γ, δ in , δ out ), where α + β + γ = 1. The simulation cost of the algorithm is linear in time. This algorithm, which was provided by Joyjit Roy during his graduate work at Cornell University, is presented below for completeness. Note that this simulation algorithm is specifically designed for the case where the preferential attachment probabilities (2.1)-(2.2) are linear in the degrees. A similar idea for the simulation of the Yule-Simon process appeared in [20]. Efficient simulation methods for the case where the preferential attachment probabilities are non-linear are studied in [1], where their algorithm trades some efficiency for the flexibility to model non-linear preferential attachment. Using the notation from the introduction, at time t = 0, we initiate with an arbitrary graph G(n 0 ) = (V (n 0 ), E(n 0 )) of n 0 edges, where the elements of E(n 0 ) are represented in form of (v denoting the outgoing and incoming vertices of the edge, respectively. To grow the network, we update the network at each stage from G(n − 1) to G(n) by adding a new edge (v (1) n , v (2) n ). Assume that the nodes are labeled using positive integers starting from 1 according to the time order in which they are created, and let the random number N (n) = |V (n)| denote the total number of nodes in G(n).
Let us consider the situation where an existing node is to be chosen from V (n) as the vertex of the new edge. Naively sampling from the multinomial distribution requires O(N (n)) evaluations, where N (n) increases linearly with n. Therefore the total cost to simulate a network of n edges is O(n 2 ). This is significantly burdensome when n is large, which is usually the case for observed networks. Algorithm 1 describes a simulation algorithm which uses the alias method [13] for node sampling. Here sampling an existing node from V (n) requires only constant execution time, regardless of n. Hence the cost to simulate G(n) is only O(n). This method allows generation of a graph with 10 7 nodes on a personal laptop in less than 5 seconds.
To see that the algorithm indeed produces the intended network, it suffices to consider the case of sampling an existing node from V (n − 1) as the incoming vertex of the new edge. In the function Node Sample in Algorithm 1, we generate 3744 P. Wan et al.

Function Node Sample
Input: E(t), the edge list up to time t; j = 1, 2, the node to be sample, representing outgoing and incoming nodes, respectively; δ ∈ {δ in , δout}, the offset parameter Output: the sampled node, v Then which corresponds to the desired selection probability (2.1).

Parameter estimation: MLE based on the full network history
In this section, we estimate the preferential attachment parameter vector θ = (α, β, δ in , δ out ) under two assumptions about what data is available. In the first scenario, the full evolution of the network is observed, from which the likelihood function can be computed. The resulting MLE is strongly consistent and asymptotically normal. For the second scenario, the data only consist of one snapshot of the network with n edges, without the knowledge of the network history that produced these edges. For this scenario we give an estimation approach through approximating the score function and moment matching, which produces parameter estimators that are also strongly consistent but less efficient than those based on the full evolution of the network. In both cases, the estimators are uniquely determined.

Likelihood calculation
Assume the network begins with the graph G(n 0 ) (consisting of n 0 edges) and then evolves according to the description in Section 2.1 with parameters (α, β, δ in , δ out ), where δ in , δ out > 0 and α, β are non-negative probabilities. The γ is implicitly defined by γ = 1 − α − β. To avoid trivial cases, we will also assume α, β, γ < 1 for the rest of the paper. For MLE estimation we restrict the parameter space for δ in , δ out to be [ , K], for some sufficiently small > 0 and large K. In particular, the true value of δ in , δ out is assumed to be contained in ( , K).
t ) be the newly created edge when the random graph evolves from G(t − 1) to G(t). We sometimes refer to t as the time rather than the number of edges.
Assume we observe the initial graph G(n 0 ) and the edges {e t } n t=n0+1 in the order of their formation. For t = n 0 + 1, . . . , n, the values of the following variables are known: Then the likelihood function is and the log likelihood function is The score functions for α, β, δ in , δ out are calculated as follows: Note that the score functions (3.3), (3.4) for α and β do not depend on δ in and δ out . One can show that the Hessian matrix of the log-likelihood for (α, β) is positive definite. Setting (3.3) and (3.4) to zero gives the unique MLE estimates for α and β,α These estimates are strongly consistent by applying the strong law of large numbers for the {J t } t≥n0+1 sequence. Next, consider the first term of the score function for δ in in (3.5), and we have and is augmented to i + 1 at time t. For each i ≥ 1, such an event happens at some stage t ∈ {n 0 +1, n 0 +2, . . . , n} only for those nodes with in-degree ≤ i at time n 0 and in-degree > i at time n. Let N ij (n) denote the number of nodes with in-degree i and out-degree j at time n, and N in i (n) and N in >i (n) to be the number of nodes with in-degree equal to i and greater than i, respectively, i.e., On the other hand, when i = 0, D 2} occurs for some t if and only if all of the following three events happen: was not created under the γ-scheme (otherwise it would have been born with in-degree 1).
This implies: Setting the score function (3.5) for δ in to 0 and dividing both sides by n − n 0 leads to where the only unknown parameter is δ in . In Section 3.2, we show that the solution to (3.9) actually maximizes the likelihood function in δ in . Similarly, the MLE for δ out can be solved from where N out >j (n) is defined in the same fashion as N in >i (n).
Hence by the factorization theorem, N (n 0 ),

Consistency of MLE
We remarked after (3.6) and (3.7) thatα MLE andβ MLE converge almost surely to α and β. We now prove that the MLE of (δ in , δ out ) is also strongly consistent. Note that if we initiate the network with G(n 0 ) (for both n 0 and N (n 0 ) finite), then almost surely for all i, j ≥ 0, So for simplicity, we assume that the graph is initiated with finitely many nodes and no edges, that is, n 0 = 0 and N (0) ≥ 1. In particular, these assumptions imply the sum of the in-degrees at time n is equal to n.
Let Ψ n (·), Φ n (·) be the functional forms of the terms in the log-likelihood function (3.2) involving δ in and δ out respectively, normalized by 1/n, i.e., The following theorem gives the consistency of the MLE of δ in and δ out .
Then these are the MLE estimators of δ in , δ out and they are strongly consistent; Let us consider a limit version of ψ n : Here we write p in i (δ in ) to emphasize the dependence on δ in . In Lemmas A.1 and A.2, provided in the appendix, it is shown that ψ(·) has a unique zero at δ in , where ψ(λ) > 0 when λ < δ in and ψ(λ) < 0 when λ > δ in , and Since ψ is continuous, for any κ > 0 arbitrarily small, there exists ε κ > 0 such that These jointly indicate that

Asymptotic normality of MLE
In the following theorem, we establish the asymptotic normality for the MLE estimatorθ

MLE n
be the MLE estimator for θ, the parameter vector of the preferential attachment model. Then

13)
with 14) In particular, I(θ) is the asymptotic Fisher information matrix for the parameters, and hence the MLE estimator is efficient.
where u t is defined by The MLE estimatorδ MLE in can be obtained by solving To establish where I in is as defined in (3.13), it suffices to show the following two results: These are proved in Lemmas A.3 and A.4 in the appendix, respectively. To establish the joint asymptotic normality of the MLE estimatorθ where S n (α), S n (β), S n (δ in ), S n (δ out ) are the score functions for α, β, δ in , δ out , respectively. A multivariate Taylor expansion gives whereṠ n denotes the Hessian matrix of the log-likelihood function log L(θ), the Hadamard product. From Remark 3.1, the likelihood function L(θ) can be factored into Note that (S n (α), S n (β)), S n (δ in ), S n (δ out ) are pairwise uncorrelated. As an example, observe that Using the Cramér-Wold device, the joint convergence of S n (θ) follows easily, i.e., From here, the result of the theorem follows from (3.17) and (3.18).

Parameter estimation based on one snapshot
Based only on the single snapshot G(n), we propose a parameter estimation procedure. We assume that the choice of the snapshot does not depend on any endogenous information related to the network. The snapshot merely represents a point in time where the data is available. Since no information on the initial graph G(n 0 ) is available, we merely assume n 0 and N (n 0 ) are fixed and n → ∞. Among the sufficient statistics for (α, β, δ in , δ out ) derived in Remark 3.1, N in >i (n) i≥0 , N out >j (n) j≥0 are computable from G(n), but the (J t ) n t=1 are not. However, when n is large, we can use the following approximations according to the proof of Lemma A.2: Substituting in (3.9), we estimate δ in in terms of α and β by solving Note that a strongly consistent estimator of β can be obtained directly from G(n):β To obtain an estimate for α, we make use of the recursive formula for {p in i } in (A.1a): and replace p in 0 by N in 0 (n)/n for large n, Plug the strongly consistent estimatorβ into (4.1) and (4.3), and we claim that solving the system of equations: gives the unique solution (α,δ in ) which is strongly consistent for (α, δ in ). The parametersδ out andγ can be estimated by a mirror argument. We summarize the estimation procedure for (α, β, γ, δ in , δ out ) from the snapshot G(n) as follows: 1. Estimate β byβ = 1 − N (n)/n. 2. Obtainδ 0 in by solving (i.e., matching (4.4a) and (4.4b)) Note that even though all three estimatorsα 0 ,β,γ 0 are strongly consistent and henceα 0 +β +γ 0 a.s. −→ 1, Steps 1-5 do not necessarily imply the strict equalitỹ α 0 +β +γ 0 = 1.
We recommend adding the following two steps for a re-normalization to overcome this defect.

Simulation study
We now apply the estimation procedures described in Sections 3 and 4 to simulated data, which allows us to compare the estimation results using the full history of the network with that using just one snapshot. Algorithm 1 is used to simulate realizations of the preferential attachment network.

MLE
For the scenario of observing the full history of the network, we simulated 5000 independent replications of the preferential attachment network with 10 5 edges under the true parameter values θ = (α, β, δ in , δ out ) = (0.3, 0.5, 2, 1).  ). The explicit formula for the entries ofΣ iŝ where, see (3.13) and (3.14), By the strong consistency of the MLEs combined with Lemma A.2, we have thatΣ a.s.
−→ Σ. The QQ-plots of the normalized MLEs are shown in Figure 5.1, all of which line up quite well with the y = x line (the red dashed line). This is consistent with the asymptotic theory described in Theorem 3.3. Confidence intervals for θ can be obtained using this theorem. Given a single realization, an approximate where z ε/2 is the upper ε/2 quantile of N (0, 1).

One snapshot
We used the same simulated data as in Section 5.1 to obtain parameter estimatesθ n := (α,β,δ in ,δ out ) through only the final snapshot, i.e., the set of directed edges without timestamps, following the procedure described at the end of Section 4. For the purpose of comparison with MLE, Figure 5.2 gives the QQ-plots for the normalized estimates from the snapshots using the same standardizations for the MLEs, i.e., where (θ n ) i denotes the i-th components ofθ n . Again, the fitted lines in blue are the traditional QQ-lines and the red dashed lines are the y = x line. The QQ-plot forβ exhibits the same shape as forβ MLE , since the two estimates are identical. From Figure 5.2, we see that the snapshot estimates of all four parameters are consistent and approximately normal, i.e., the QQ-plots are linear. However, the slopes of the QQ-lines forα,δ in ,δ out are much steeper than the diagonal line, indicating a loss of efficiency forθ n compared withθ n . Indeed the estimator variance is inflated for all parameters except for β, whereβ coincides with the true MLE. This is as expected since knowing only the final snapshot provides far less information than the whole network history.
Recall that for a consistent estimator T n of a one-dimensional parameter θ constructed from a random sample of size n, the asymptotic relative efficiencies (ARE) of T n is defined by where T * n denotes the asymptotically efficient estimator. We may compute the ARE's for the snapshot parameter estimates where Var denotes the sample variance of the parameter estimate based on the 5000 replications. Note that ARE(β) = 1 sinceβ =β MLE . Given a single realization, the variances of the snapshot estimates can be estimated through resampling as follows. Using the estimated parameterθ n , simulate 10 4 independent bootstrap replicates of the network with n = 10 5 edges. For each simulated network, the snapshot estimate,θ * n := α * ,β * ,δ * in ,δ * out , is computed. The sample variance of these 10 4 snapshot estimates can then be used as an approximation for the variance ofθ n so that assuming asymptotic normality, a (1 − ε)-confidence interval for θ can be approximated by where z ε/2 is the upper ε/2 quantile of N (0, 1).

Sensitivity test
Now we investigate the sensitivity of our estimates while values of the parameters (n, α, β, δ in , δ out ) are allowed to vary. First consider the impact of n, the number of edges in the network. To do so we held the parameters fixed with values given by (5.1): (α, β, δ in , δ out ) = (0.3, 0.5, 2, 1) and varied the value of n. The QQ-plots (not presented) for standardized estimates using both full MLE and one-snapshot methods were produced to check the asymptotic normality. When n = 500, 1000, diagnostics revealed departures from normality for both the MLE and the snapshot estimates. However, after increasing n to 10000, estimates obtained from both approaches appeared normally distributed as expected.
For each value of n in Table 5.1, 5000 replicates of the network with n edges and parameters θ = (0.3, 0.5, 2, 1) were generated. For each realization, the MLE'sθ MLE n were computed using the full history of the network and the onesnapshot estimatesθ n were obtained using the 7-step snapshot method proposed in Section 4, pretending that only the last snapshot G(n) was available. The mean for these two estimators were recorded in Table 5.1. There is little bias for both estimates of α and β, even for small values of n. On the other hand, there is some bias for estimated δ in and δ out for n ≤ 5000. The magnitudes of the biases for both types of estimates decrease as n increases. Also the ARE's of the snapshot estimator stay within a narrow band as n increases. Next we held (n, δ in , δ out ) = (10 5 , 2, 1) fixed and experimented with various values of (α, β) in Table 5.2. For each choice of (α, β), 5000 independent realizations of the network were generated and the means of the MLEθ MLE n and the one-snapshot estimatesθ n were recorded. Overall, the biases forθ MLE n are remarkably small for virtually all combinations of parameter values, except for those parameter choices where one of (α, β) is extremely small. The biases for the snapshot estimatesθ n exhibit a similar property, but the magnitudes of the biases are consistently larger than those in the MLE case.
In general, the snapshot estimators are able to achieve 20%-50% efficiency over the range of parameters considered. The loss of efficiency might be less than one would expect given the substantial reduction in the data available to produce the snapshot estimates. It is worth noting that in the case where (α, β) = (0.7, 0.2), the efficiencies of the snapshot estimators for α and δ in are much larger (0.73 and 0.79, respectively). A heuristic explanation for this increase is that the parameter γ = 1 − α − β = 0.1 is relatively small. By the implicit constraints used for the snapshot estimates, we havẽ that is, the snapshot estimate of the sum α + γ is the same as the MLE for the sum. Now if γ is small, one would expect the resulting estimates to also be small so thatα would be nearly the same asα MLE . Hence the ARE would be close to 1. On the other hand, in the case of a larger γ, see the bottom row of Table 5.2 in which γ = 0.6, the ARE for α is not as large (0.42), but the ARE forδ out is (0.63).

Real network example
In this section, we explore fitting a preferential attachment model to a social network. As an illustration, we chose the Dutch Wiki talk network dataset, available on KONECT [14] (http://konect.uni-koblenz.de/networks/wiki_talk_nl). The nodes represent users of Dutch Wikipedia, and an edge from node A to node B refers to user A writing a message on the talk page of user B at a certain time point. The network consists of 225,749 nodes (users) and 1,554,699 edges (messages). All edges are recorded with timestamps.
In order to accommodate all the edge formulation scenarios appearing in the dataset, we extend our model by appending the following two interaction schemes (J n = 4, 5) in addition to the existing three (J n = 1, 2, 3) described in Section 2.1.
These scenarios have been observed in other social network data, such as the network that models Facebook wall posts (http://konect.uni-koblenz.de/ networks/facebook-wosn-wall). They occur in small proportions and can be easily accommodated by a slight modification in the model fitting procedure.
The new model has parameter vector (α, β, γ, ξ, δ in , δ out ), and ρ is implicitly defined through ρ = 1 − (α + β + γ + ξ). Similar to the derivations in Section 3, the MLE estimators for α, β, γ, ξ arê and δ in , δ out can be obtained through solving We first naively fit the linear preferential attachment model to the full network using MLE. The MLE estimators are (α,β,γ,ξ,ρ,δ in ,δ out ) = To evaluate the goodness-of-fit, 20 network realizations were simulated from the fitted model. We overlaid the empirical in-and out-degree frequencies of the original network with that of the simulations. If the model fits the data well, the degree frequencies of the data should lie within the range formed by that of the simulations, which gives an informal confidence region for the degree distributions. From Figure 6.1, we see that while the data roughly agrees with the simulations in the out-degree frequencies, the deviation in the in-degree frequencies is noticeable.
To better understand the discrepancy in the in-degree frequencies, we examined the link data and their timestamps and discovered bursts of messages originating from certain nodes over small time intervals. According to Wikipedia policy [23], certain administrating accounts are allowed to send group messages to multiple users simultaneously. These bursts presumably represent broadcast announcements generated from these accounts. These administrative broadcasts can also be detected if we apply the linear preferential attachment model to the network in local time intervals. We divided the total time frame down to subintervals of varying length each containing the formation of 10 4 edges. The number 10 4 is chosen to ensure good asymptotics as shown in Table 5.1. This process generated 155 networks, For each of the 155 datasets, we fit a preferential attachment model using MLE. The resulting estimates (δ in ,δ out ) are plotted against the corresponding timeline on the upper left panel of Figure 6.2. Notice thatδ in exhibits large spikes at various times. Recall from (2.1), a large value of δ in indicates that the probability of an existing node v receiving a new message becomes less dependent on its indegree, i.e., previous popularity. These spikes appear to be directly related to the occurrences of group messages. This plot is truncated after the day 2016/3/16, on which a massive group message of size 48,957 was sent and the model can no longer be fit.
We identified 37 users who have sent, at least once, 40 or more consecutive messages in the message history. This is evidence that group messages were sent by this user. We presume these nodes are administrative accounts; they are responsible for about 30% of the total messages sent. Since their behavior cannot be regarded as normal social interaction, we excluded messages from these accounts from the dataset in our analysis. We then also removed nodes with zero in-and out-degrees.
The re-estimated parameters after the data cleaning are displayed in the other three panels of Figure 6.2. Here all parameter estimates are quite stable through time.
The reduced network now contains 112,919 nodes and 1,086,982 edges, to which we fit the linear preferential attachment model. The fitted parameters Again the degree distributions of the data and 20 simulations from the fitted model are displayed in Figure 6.3. The out-degree distribution of the data agrees reasonably well with the simulations. For the in-degree distribution, the fit is better than that for the entire dataset ( Figure 6.1). However, for smaller indegrees, the fitted model over-estimates the in-degree frequencies. We speculate that in many social networks, the out-degree is in line with that predicted by the preferential attachment model. An individual node would be more likely to reach out to others if having done so many times previously. For in-degrees, the situation is complicated and may depend on a multitude of factors. For instance, the choice of recipient may depend on the community that the sender is in, the topic being discussed in the message, etc. As an example, a group leader might send messages to his/her team on a regular basis. Such examples violate the base assumptions of the preferential attachment model and could result in the deviation between the data and the simulations. Next we consider the estimation method of Section 4 applied to a single snapshot of the data. In order to implement this procedure, we donned blinders and assumed that our dataset consists only of the information of the wiki data at the last timestamp. That is, information about administrative broadcasts, and other aspects of the data learned by looking at the previous history of the data are unavailable. In particular, we would have no knowledge of the existence of the two additional scenarios corresponding to J n = 4, 5. With this in mind, we fit the three scenario model using the methods in Section 4. The fitted parameters The comparison of the degree distributions between the data and simulations from the fitted model is displayed in Figure 6.4 and is not too dissimilar to the plots in Figure 6.1 that are based on maximum likelihood estimation using the full network data. In particular, the out-degree distribution is matched reasonably well, but the fitted model does a poor job of capturing the in-degree distribution.
We see from this example that while the linear preferential attachment model is perhaps too simplistic for the Wiki talk network dataset, it has the ability to illuminate some gross features, such as the out-degrees, as well as to capture important structural changes such as the group message behavior. Consequently, despite its limitation, this model may be used as a building block for more flexible models. Modifications to the existing model formulation and more careful analysis of change points in parameters are directions for future research.
We show the uniform convergence of ψ n to ψ in the next lemma. Proof. By the definition of ψ, p in >i (δ in ) is a function of δ in and is a constant with respect to λ. Hence we suppress the dependence on δ in and simply write it as p in >i when considering the difference ψ n − ψ as a function of λ: Thus, For the first term, note that for all i ≥ 0, kN in k (n) = n, since the assumption on initial conditions implies the sum of in-degrees at n is n. Therefore N in >i (n)/n ≤ i −1 for i ≥ 1, and it then follows that Note that the last two terms on the right side can be made arbitrarily small