Testing the hypothesis of preferential attachment in social network formation

The hypothesis of preferential attachment (PA) - whereby better connected individuals make more connections - is hotly debated, particularly in the context of epidemiological networks. The simplest models of PA, for example, are incompatible with the eradication of any disease through population-level control measures such as random vaccination. Typically, evidence has been sought for the presence or absence of preferential attachment via asymptotic power-law behaviour. Here, we present a general statistical method to test directly for evidence of PA in count data and apply this to data for contacts relevant to the spread of respiratory diseases. We find that while standard methods for model selection prefer a form of PA, careful analysis of the best fitting PA models allows for a level of contact heterogeneity that in fact allows control of respiratory diseases. Our approach is based on a flexible but numerically cheap likelihood-based model that could in principle be applied to other integer data where the hypothesis of PA is of interest.


Contact heterogeneity in infectious disease epidemiology
Infectious pathogens that spread via contact between people are a major cause of human disease, driving attempts to understand their epidemiology []. Much theoretical work on infectious disease dynamics has been focused on the role of heterogeneity in the human population [], which is often conceptualised as a network of epidemiologically relevant Perhaps the most important quantity in any infectious disease outbreak is the basic reproductive ratio, R  , which is defined verbally as the expected number of secondary cases generated by an average primary case early in the epidemic. An epidemic is possible exactly when R  > , and typically the efforts required to control such an outbreak grow monotonically with R  [, ]. In the simplified scenario where each individual picks an integer number of contacts K from the same degree distribution, but transmission is otherwise homogeneous, This dependence of the basic reproductive ratio on the second moment of the degree distribution has been a 'textbook' result for some time [], however work by Pastor-Satorras and Vespignani [] and May and Lloyd [] raised the question of what might happen for large, or asymptotically divergent, second moments. Such questions can be posed and answered at different levels of mathematical rigour [] however in the context of epidemiology it is clear that a highly variable degree distribution is associated with the epidemiologically unrealistic scenario that even the most weakly transmissible pathogen can cause an epidemic, and as a corollary that control of any infectious disease through untargeted vaccination would be impossible.

Data
Of course, whether such a theoretical possibility matters for the study of infectious diseases depends on the actual variance in degree for epidemiologically relevant contacts. Pr(node has k links) ≈ k -γ .
(  ) As is the case for almost all biological data, there is much more complexity in the data than such a simple parametric relationships as () would imply. For example, Leigh Brown et al.
[] found that while a power-law was a better functional form than the negative binomial for sexual contacts, the richer Waring distribution was preferable to either. What is hard to dispute, however, is that as better quality data on epidemiologically relevant contacts is obtained the evidence consistently points to a very high level of variance; for example, Read et al. [] were able to validate the high numbers of contacts reported by some study participants through direct (rather than postal) surveying. These empirical observations of high heterogeneity in contact number, together with theoretical results about R  , present a paradox for infectious disease epidemiology: is the extreme heterogeneity in observed contact patterns indicative of PA and does that imply that R  >  for almost any finite level of person-to-person transmissibility meaning that our theoretical understanding of infectious disease epidemiology is somehow severely lacking?

Preferential attachment and power laws in empirical data
Recent years have seen a debate about the level of heterogeneity that exists in a variety of observed networks. A particularly influential paper by Barabási and Albert [] considered a model of network formation in which many new nodes are added to a small existing network. These new nodes connect preferentially to nodes that have more links in the existing network, leading to the asymptotic result () with γ = . In this way preferential attachment is intimately linked with, but not always equivalent to, asymptotic power-law behaviour.
Simple power-law relationships have been claimed for numerous real-world systems, and a critical review of these claims by Clauset  The debate around presence or absence of power laws in real data continues, perhaps most strongly in the context of networks. For example, Barabási [] writes that preferential attachment is network science's "most profuse concept, " and that "the impact of preferential attachment is hard to miss. " At the same time, Stumpf and Porter [] argue that "most reported power laws lack statistical support and mechanistic backing. "

Testing preferential attachment directly
In this work, we attempt to test the hypothesis of preferential attachment in social contact data directly, rather than via asymptotic power law behaviour. We make use of previously collected data on social encounters specifically designed to measure heterogeneity in numbers of contacts amongst the British population, and fit mechanistic models of different complexity to these data. We determine that models with significant levels of preferential attachment have better evidential support from the data than models without.

Social Contact Survey data
A cross-sectional study was conducted between May  and October , recruiting households and individuals through postal and online questionnaires, supported by a large random-address mailshot and a modest online and media promotion [, ]. Questionnaires asked respondents to report on the number of distinct individuals they encountered the previous day: their contacts. Respondents were able to report contacts either as individuals or as members of a group with a reported size. Allowing the reporting of groups of individuals was a deliberate methodological design to permit the easy reporting of large numbers of contacts, to avoid the approach taken by previous studies [], which imposed a high burden on respondents with large number of contacts, and to ensure the best capture of the right-hand tail of the degree distribution. In general, we expect that such data will become increasingly available due to the epidemiological importance of this tail (e.g. the study of Read et al. []).
In total, completed questionnaires were received from , participants in Great Britain, , of which were from postal surveys. There was some bias in demographical representation, most notably younger age groups and males were generally underrepresented (see Danon et al. [] for more details). The data is available at http://wrap. warwick.ac.uk//.

Generalised preferential attachment
As noted by Durrett [], Barabási [], and Simkin and Roychowdhury [], the basic idea behind the preferential attachment model is close to the population model of Yule []. We consider a Yule-like stochastic process described precisely as follows. In a population of N individuals indexed by i each individual has an integer-valued random variable K i (t) for its number of contacts. Individual i starts with K i () =  and makes new contacts over a time period T i , which is given by a positive real-valued random variable with probability density function ρ(t). The generation of new social contacts is modelled by a continuoustime Markov chain with the following events and rates: We take the preferential attachment hypothesis PA to be stated mathematically as Writing p k (t) for the probability that K i = k at time t < T i , we can use the method of characteristics to derive an expression for the probability generating function of K i , From this, we can derive expressions for the probability mass function, where κ t is a function of t but not k and the asymptote holds as k becomes large. This is not a simple power-law relationship, and so the asymptotic behaviour of the moments is not determined by the power-law exponent, but rather through the moment generating function M(t, z) = g(t, e z ), z ∈ (-∞, ], such that the rth moment of the degree distribution, conditional on t < T, is In particular, Then accounting for the randomness of the times, the rth moment of the degree distribution will be We will be interested in the empirical evidence for whether such moments converge or diverge, in light of the epidemiological relationship ().

Phase-type holding times
The question is then posed as to an appropriate distribution from which to draw the holding times {T i } for the amount of time spent making new contacts on the day for which individuals provide data. In previous work [] on a related model of contact formation we considered holding times T i that were log-normally distributed. This provided a good fit to data, but was computationally intensive and lacked a mechanistic interpretation. We therefore consider here a class of distributions for the holding times that is highly flexible, but which has analytic and numerical benefits -the distributions of phase type []. Phase-type distributions are dense in the space of positive-valued probability distributions [], meaning that they can be made arbitrarily close to any other distribution. They have a mechanistic interpretation and allow for analytic manipulations that greatly reduce the numerical cost of likelihood evaluation. The basic idea behind the model is shown in Figure . A set of phases is indexed by a, b = , . . . , m; the probability of starting in phase a is ν a (meaning these parameters must sum to unity); the rate of stopping making new social contacts is μ a for an individual in phase a; and the rate of moving from phase a to phase b is Q a,b . Note that different rates of making contacts in different phases are not realistically distinguishable from different times spent and so are not included as parameters. The phases have a mechanistic interpretation as the activities that individuals undertake on a given day.
In this model, the probability density function for the holding time is given by the general expression where: From the expressions (), (), () and () above, in particular through inspection of the form of the moment generating function, it is clear that the rth moment of the degree distribution will involve a term like where I is the identity matrix. Let A = M + rτ I; this matrix is triangular and so its eigenvalues are equal to its diagonal elements; in particular the ath eigenvalue of A is λ a = -M a + rτ . If we let R be a matrix whose ath column is the ath right eigenvector of A and L be a matrix whose ath row is the ath left eigenvector of A then where D is a diagonal matrix whose ath diagonal element is e λ a t . The integral I r therefore converges exactly when ∀a, λ a < , which implies that the condition for divergence of the rth moment is In general, however, combination of () and () is not the most numerically efficient method for calculation of the overall probability mass function for final number of contacts K i (T i ) and a different approach is needed.

Numerically efficient model solution
The model as described above can be solved in a numerically efficient manner using the spectral methods for continuous-time Markov chains developed by Bailey []. We consider the limit as the population size N → ∞ and write down ordinary differential equations (ODEs) for the proportion of the population in phase a and with k social contacts at time t, p a,k (t). These ODEs take the form where f k is the rate at which individuals with k social contacts make new contacts, given in (). We are then interested in d k , the probability mass function for a randomly selected individual having made k social contacts by the end of the process. A series of manipulations directly analogous to those of Bailey [] shows that Applying Laplace transformation to () subject to the initial condition p a,k () = ν a δ k, and taking the frequency-space limit s ↓  then leads to a set of linear equations for d k that are triangular and so can be evaluated directly without numerically costly matrix inversion: These equations are at the root of the numerical efficiency of our model. Note that we use Laplace transformation mainly for technical reasons and our results could be obtained by directly integrating () if one were not concerned by all quantities being rigorously defined.

Model likelihood, fitting and selection
We assume a vector of data y = (y k ), where y k is the number of individuals reporting k social contacts when surveyed. A model M is therefore specified by a number of phases m and the presence or absence of PA, meaning the general parameters are θ = (τ , ν a , μ a , Q a,b ), with τ present only if there is PA. The number n of individuals sampled from the British population N is and so it is appropriate to assume that each individual picks a number of contacts independently from the distribution with pmf given by d k as in (). Accounting for the censoring of zero contacts in the real data, we definẽ meaning that the overall likelihood function is then given by Note that the combinatorial factors do not depend on the parameters, and so need not be calculated during model fitting.
We consider the use of the likelihood function () using standard statistical methodology. Numerical maximum likelihood estimation was performed using simulated annealing run from multiple starting points to ensure the global optimum was obtained. Model selection was performed using AIC [] and BIC [], as well as likelihood ratio tests [] on pairs of models where this test was informative. This was done since each approach involves different trade-offs between model fit and complexity, and to check that our conclusions about PA are not overly sensitive to the precise method used. Uncertainty in model parameters was quantified using confidence intervals obtained through bootstrapping the data, and uncertainty in model outputs such as the predicted degree distribution was quantified using a parametric bootstrap. Table  shows the models we fitted, their number of parameters, AIC/BIC relative to the minimum, and the first moment that diverges according to (). Figure  shows the results of performing likelihood ratio tests. These show that AIC prefers a -phase model with PA as do likelihood ratio tests for any significance level between .% and %. BIC penalises complex models more severely and therefore selects a -phase model with PA.

Results and discussion
Therefore, regardless of the number of phases selected by different approaches to model selection, we see that the models with PA are preferred over models without. Figure  shows the predictions of the models preferred by different selection criteria, as well as the models with the same number of phases but no PA, against real data. We see in the left column of plots that for the -phase models, the main difference is in the tail of the distribution as we would expect. In the -phase models shown in the right column of plots, the model without PA also smooths over features in the bulk of the distribution compared to the model with PA. Preferred models are shown using square brackets and bold type. and if we set τ =  but leave the other parameters at their fitted values, then the total number of contacts per person is reduced to % of its original value. This shows that in both of these models, we can attribute a substantial fraction of the contacts to PA.
We also calculate that the second moment does not diverge in any of the fitted models, which helps to resolve the epidemiological paradox that we introduced at the start of this paper. PA is empirically supported, and is also mechanistically plausible since existing social contacts give more opportunities for future social contact. Combined with a sufficiently detailed phase-based mechanistic model of the contexts in which social contacts are made, however, PA does not imply a divergent second moment for the distribution of contacts relevant for the spread of directly transmitted infections. This means that our understanding of how basic epidemiological quantities like the basic reproductive ratio, R  , are related to contact networks does not need to be revised in the light of empirical evidence. As a final observation, we believe that as computational resources for fitting models to data improve, it will in general be easier to test the hypothesis of PA directly in all kinds of data, rather than looking for asymptotic power laws. count, LL is the log-likelihood given the parameters and P is a the probability vector for the observed counts given the parameters.