Predicting Adolescent Social Networks to Stop Smoking in Secondary Schools

Social networks are increasingly being investigated in the context of individual behaviours. Research suggests that friendship connections have the ability to inﬂuence individual actions, change personal opinions and subsequently impact upon personal wellbeing. This paper explores the eﬀect of individual friendship selection decisions, and the impact they may have on the overall evolution of a social network. Using data from a large smoking cessation programme in secondary schools, an agent based simulation aiming to predict the evolution of the adolescent social networks is created. The simulation uses existing friendship selection algorithms from link prediction literature, along with a new approach to link prediction, termed PageRank-Max. This new algorithm is based upon the optimisation of an individuals eigen-centrality, and is found to be more successful than existing methods at predicting the future state of an adolescent social network. This research highlights the importance of eigen-centrality in adolescent friendship decisions, and the use of agent-based simulation to conduct behavioural investigations. Furthermore, it provides a proof-of-concept for targeted interventions driven by social network analysis, demonstrating the utility of using emerging sources of social network data for public heath interventions such as with tobacco use which is a major global health challenge.


Introduction
Investigation into individual behaviours in relation to social networks has experienced substantial growth in recent years.This is in part due to the availability of social network data as a result of social networking sites such as Facebook, Twitter and Google+, and the computing advancements that allow for the exploration of such large data sets (Kwak et al., 2010;Mislove et al., 2008;Salter-Townshend, 2012) .This paper is concerned with the individual decisions that cause social network evolution in adolescents, which is applied to data from a large smoking cessation programme in sec-ondary schools.Smoking is a major global health challenge and tobacco use is said to kill 6 million people worldwide per year (World Health Organisation, 2015).More than 5 million of those deaths are the result of direct tobacco use while more than 600 000 are the result of non-smokers being exposed to second-hand smoke.Secondary schools are a common point at which people start smoking with, for example, two-thirds of smokers in the UK starting before the age of 18 (Action on Smoking and Health (ASH), 2016).Quitting smoking is notoriously difficult; among all current U.S. adult cigarette smokers, nearly 7 out of every 10 (68.8%) reported that they wanted to quit but were so far unable to do so (Centers for Disease Control and Prevention, 2016).Smoking increases the risk for serious health problems, many diseases, and death (Centers for Disease Control and Prevention, 2014).
The theory of friendship decisions amongst adolescents has been widely researched, with factors such as proximity (Festinger et al., 1950), reciprocation (Parker and Seal, 1996) and similarity (McPherson et al., 2001) discussed as important.Often studies such as these are based on qualitative evidence, with scientific experts drawing conclusions based on retrospective analysis.Our research discusses the development of an Agent Based Simulation (ABS) model which allows for the testing of behavioural theory relating to friendship.Through the use of specifically selected algorithms, drawn from the link prediction literature, a predicted future state of a social network can be made.The predicted future social network may then be compared with the real social network for accuracy, with conclusions drawn around the implemented behavioral theory.
Simulation provides a tool to explore the evolution of a system, scrutinise theory and evaluate potential outcomes.Within the domain of OR, simulation is a core tool utilised for research -lending itself to applications such as manufacturing,defence and healthcare (Pidd, 2004).ABS is a particular paradigm of simulation, which aims to take an individualistic view of system evolution (An, 2012).ABS is a micro-simulation technique, modelling the individual behaviours of specific objects in a system to understand the emergent global phenomena (Niazi and Hussain, 2011).
ABS investigations related to social networks have covered a variety of topics.Epidemiology in particular has adopted ABS techniques to explore the spread of infectious diseases through networks, including HIV spread in Amsterdam (Mei et al., 2010), Influenza in a metropolitan social network (Mao, 2014) and H1N1 on a Chinese university campus (Mei et al., 2010).ABS has also been used in the investigation of network structure, as opposed to its effects, although the number of papers in this area is far fewer.Pujol et al. (2002) uses agents to extract reputation in a social network topology, Han et al. (2014) explores hierarchical geographical network structures and Bernstein and O'Brien (2013) uses ABS to generate 'realistic' social network data sets; however, these studies do not utlise empirical social network data.Given the individual perspective of ABS, and the ability to quantify the impact to a system as a result of the interactions of constituent parts, ABS appears an appropriate method to explore the behavioral factors influencing the evolution of adolescent social networks.
The motivation to adopt a quantitative simulation-based research approach to adolescent friendships, as presented in this paper, is that it appears to be an unexplored niche in social network literature.More specifically, the ability to implement link prediction methods in an ABS framework for adolescent social networks, provides a novel contribution to the literature.Furthermore we provide a proof-of-concept for targetted interventions driven by social network analysis, demonstrating the utility of using emerging sources of social network data for public heath interventions.
This research also contributes to the growing body of work in Behavioural Operational Research (BOR) which is defined as the study of behavioural aspects related to the use of OR methods in modelling, problem solving and decision support (Hämäläinen et al., 2013).BOR may broadly be considered within three categories: behaviour in models (methods), behaviour with models (actors) and behaviour beyond models (praxis) (Franco and Hämäläinen, 2017).Our work is firmly grounded in incorporating behaviour within models (methods).Furthemore, as comprehensive reviews of the application of OR to healthcare (Brailsford et al., 2009;Hulshof et al., 2012) reveal, relatively little prior consideration has been devoted to behavioural aspects in this field.Hence this paper also aims to demonstrate the use of BOR for healthcare applications.
The remainder of the paper is structured as follows.In Section 2 we introduce the data from the smoking in schools programme.Section 3 outlines the chosen network structures utilised within this research, whilst link prediction methods are introduced in Section 4. The developed ABS is described in Section 5. A new method for link prediction, PageRank-Max, is proposed in Section 6, validated in Section 7, and compared against the other methods in the results in Section 8. Conclusions are made in Section 9.

Case Study
There are signficant global challenges to reducing smoking from a public health perspective.The World Health Organziation (WHO) has created the Tobacco Free Initiative (TFI) which aims to "reduce the global burden of disease and death caused by tobacco, thereby protecting present and future generations from the devastating health, social, environmental and economic consequences of tobacco use and exposure to tobacco smoke" (World Health Organisation, 2016).Many of the TFI's actions are aimed at adolescents given that this is a common time in life at which people start smoking.It is therefore vital to intervene at this age given the addictive nature of tobacco and the longer-term health effects.
Our conceptual approach to the problem is in predicting social networks to help with more targetted interventions to reduce the uptake of smoking amongst adolescents .The case study data is taken from "A Stop Smoking in Schools Trial" (ASSIST) and explores the effects of social networks upon attitudes toward adolescent smoking, with a view to inform potential cessation proliferation methods.Formed through a joint venture between Cardiff University Institute of Society, Health and Ethics and The Department of Social Medicine at the University of Bristol, UK, ASSIST was designed as a peer-led intervention, formulated around the 'Gay Hero' work of Kelly (Kelly et al., 1992).Schools from across the West of England and South Wales were recruited to the study, through stratified randomisation, following a cohort of Year 8 students (12-13 year olds) over the course of a three and a half year period (Holliday, 2006).
Three waves of social network data were collected at one year intervals for 18 schools in the study.Each participant was asked to name up to six other students with whom they shared a friendship.From this data, a school based social network may be constructed, describing friendship evolution over the course of the study.The students' ability to only identify up to six friendships may be considered a limitation of the study; however, the work of Kirke (1996) and Pearson and Michell (2000) suggest that friendships ranked below the top six connections do not carry equal significance.Additionally, the average number of friendship nominations in the data across the three time points was calculated as 3.8 (T 1 ), 4.3 (T 2 ) and 3.8 (T 3 ) -suggesting students often did not opt to maximise their number of friendship nominations.Given the objective of this research is to predict social network structure to identify future influence, the friendship nomination limit is unlikely to substantially impact the conclusions of this research.
From the 18 schools, 12 are classified as control and 6 as intervention.Identified socially prominent individuals in adolescent social networks within the intervention schools were given training to diffuse a 'stop smoking' message to their peers (Audrey et al., 2004).An example of the data from one school may be observed in Figures 1 and 2 demonstrating the evolution of the social network over time (friendship network at year 1 and 2 for Figure 1 and 2 respectively).
Figures 1 and 2 show network patterns and evolutions that were seen in many of the control schools.That is, over time the prevalence of smoking increases and that smokers tend to cluster together as friends.The findings of Campbell et al. (2008) suggest a reduced smoking prevalence in intervention schools in the early stages of the trial.Overall, the researchers concluded that ASSIST was a success, providing a cost-effective method for increasing adolescent smoking cessation (Hollingworth et al., 2012).

Network Structures
This section introduces the essential graph theoretic and network science definitions that are used to inform the research in the development of the ABS (Section 4).As our study is concerned with the investigation of social networks, and ultimately the development of a new algorithm to predict social network evolution, the relevant metrics to analyse and interpret  network structures are required.
An undirected Graph is defined as a pair G = (V, E) of sets such that E is a subset of the unordered pairs of V , where V is the set of vertices (or nodes) and E represents the set of edges (or links).A directed graph (or digraph) may be defined in the same manner, except that E is a subset of the ordered pairs of V .
The order of G is defined as the number of elements in the set of vertices V , denoted by |G|; thus |G| = |V (G)| (Bollobas, 2013).For simplicity, the number of vertices for any particular graph G, shall be referred to as n.A social network may be represented as a directed or undirected graph.A directed graph offers a rich source of information, both in terms of the qualitative implications of friendship, and the quantitative metrics of network calculation.
For an undirected graph, an edge {i, j} links the vertices v i and v j and may be represented by ij.A directed network edge preserves the order by which a link is made, such that an edge {i, j} implies a link from v i to v j is denoted by i → j, therefore it cannot be assumed the link j → i exists.A number of the metrics defined later require the maximum number of edges (e max ) of a graph; for an undirected graph, e max = n(n−1)
With the basic elements of a graph defined, we next introduce four commonly used network metrics that we utilise (in the results Section 7) to compare the performance of predicting social network evolution under different link prediction measures (which will be defined in later sections).

Average Degree
The degree of a vertex v i is denoted as deg(v i ) and represents the number of incident edges of v i .In a directed network, these may be separated further in terms of in-degree deg(v i ) in and out-degree deg(v i ) out , defined as the count of the inward links and outward links of v i respectively (Newman, 2003).
In terms of network cohesion, and a representation of the graph as a whole, the average vertex degree may therefore be calculated by: where deg(v i ) is replaced by the directed network equivalent (if required).

Reciprocity
A directed graph's in-degrees and out-degrees allows for incident edges to become unreciprocated.In terms of a social network, this could suggest the node v i extending a link to v j but the link j → i not being in existence.This provides a representation of network cohesion, termed reciprocity.
A reciprocated tie is one in which for the vertices v i and v j , the links i → j and j → i exist.The overall reciprocity of the directed graph G is said to be: where L is the set of edges involved in reciprocal ties.As such, r ∈ [0, 1], meaning that r = 1 signifies a fully reciprocated graph (Newman et al., 2002).Both average degree and reciprocity are used in this paper to measure network cohesion.

Transivity Ratio
For a directed graph, a transitive triple is defined to be a sequence of edges such that i → j, j → k and i → k exist (Wasserman, 1994).A subgraph is defined as In an undirected graph, a triangle may be considered as a complete subgraph containing three nodes of G, where the number of triangles containing v i is defined to be δ( (Schank and Wagner, 2005).The number of all possible triangles in G is denoted by τ (G), therefore the transitivity ratio T (G) may be calculated by: For a directed graph, edges are converted into undirected associations (Luce and Perry, 1949).
This measurement calculates the proportion of "closed triangles" of nodes, in relation to all connected triples of nodes.This gives a representation of how clustered the network is, offering an indication of mutual relations.Other interpretations of graph transitivity have been suggested; for example, the global clustering coefficient and the local clustering coefficient (Watts and Strogatz, 1998), both of which are said to suffer from bias (Soffer and Vázquez, 2005).Given its overall simplistic and effective nature, coupled with the avoidance of inherent bias associated with other methods, the transitivity ratio has therefore been selected as the metric of choice for quantifying network clustering within this research.

Average Path Length
Travelling a concourse of nodes via a graph's incident edges is described as navigating a path.A path is a graph P of form The end vertices are v 0 and v l , therefore the path may be denoted by v 0 − v l .In a directed graph, the direction of the edges dictate the direction of the path (Bollobas, 2013).
The path of a network plays an important role in the description of reachability between nodes.For example, if a path exists between the nodes v i and v j then these nodes are said to be reachable (Holme, 2005).In a fully connected graph, every node is reachable.Social Networks are unlikely to ever achieve complete reachability, even less so if the network is directed (Barabási et al., 2000).To garner an overall picture of the reachability between paths of nodes, one must consider the geodesic, the shortest path connecting two vertices v i and v j (Harary, 1994).
The average path length (APL) l G for G is described as the shortest distance between the nodes v i and v j , denoted as d(v i , v j ), divided by the maximum possible number of edges (e max ) (Newman, 2001) APL is a robust measurement of network topology, often quoted as the main factor in the classification of network type (Fronczak et al., 2004).

Link Prediction Algorithms
Link prediction is the process of attempting to foresee connections that are yet to be established (Liben-Nowell and Kleinberg, 2007).Given a graph G t (V, E) of n nodes/vertices (V with vertices v i ) and a set of links/edges (E with edges e i ) at time t, an attempt is made to arrive at G t+1 through the evaluation of possible new edges, e i,j between vertices v i and v j .
Link prediction algorithms have a variety of applications, including: optimisation of website navigation (Zhu et al., 2004), the recommendation of content to web users (recommender systems) (Huang et al., 2005), and the acceleration of academic collaboration (Farrell et al., 2005).Methods employed in conjunction with the link prediction problem include machine learning (Goldenberg et al., 2003;Hasan et al., 2006), Markov methods (Taskar et al., 2003;Domingos and Richardson, 2007) and statistical inference (Popescul and Ungar, 2003).It is widely accepted that the task of accurately predicting links is difficult (Getoor, 2003;Taskar et al., 2003), in part due to the a priori probability of a link being small (Getoor and Diehl, 2005).
The seminal paper (Liben-Nowell and Kleinberg, 2007) discusses link prediction specifically applied to social networks, by applying a range of widely accepted methods to predict new academic collaboration data.This review was later augmented by Lü and Zhou (2011), expanding the algorithms tested and using alternative collaboration data.However, the motivations for academic collaboration may be different to the friendship selection methods of adolescents.Additionally, the analysis does not consider the impact of the links being formed and thus influencing the formation of other links in the network.The possibility to breaking links is also absent from the work of Liben-Nowell and Kleinberg (2007) and Lü and Zhou (2011), as the research was concerned only with the formation of new collaborations.Thus, the ability to implement link prediction methods in an ABS framework for adolescent social networks, provides a novel contribution to the literature.Four prediction methods have been selected for the purpose of our research: Adamic/Adar (Section 4.1), Katz (Section 4.2), SAB Modelling (Section 4.3) and PageRank (Section 4.4).These methods have been selected for comparison with a newly developed algorithm that we propose in Section 6, PageRank-Max.A summary of each method now follows aided by an example based upon the illustrative network shown in Figure 3 (where appropriate).

Adamic/Adar
The Adamic/Ada (AA) method was originally developed to quantify how webpages were similar in terms of content, specifically focusing upon personal web pages; if the content between two pages is similar (Adamic and Adar, 2003) theorised that a connection between them is more likely to appear.The authors based their theory upon the notion that friends tend to be similar to one another (Feld, 1981;Carley, 1991), therefore making connections more probable.
To perform the AA method, the neighbourhood, Γ(i), of each individual, i, is required; Γ(i) being the set of individuals with whom i shares a connection.A score is calculated for each link (ij) that is not present (unobserved) in the network, such that: where z is a mutual connected vertex of both i and j.
The AA score for ij is therefore based upon the number of connections an individual z (who is a friend of both i and j) possesses.If z has a small number of connections, then having z as a common neighbour of both i and j is rarer than if z had a high number of connections.As such, rarer common neighbours increase Score[i,j] meaning that a link between i and j is more likely.
The following example illustrates the mechanism by which AA makes a link prediction: • Taking the social network of Figure 3 • The scores for the remaining unobserved links (B → D, C → D and D → A) are also calculated.The resultant scores are ranked and the links with the highest scores are most likely to develop according to the AA link prediction method.
The example presented is conducted upon a directed network, however, the AA method does not consider the effect of reciprocation -a reciprocated tie being one in which the links i → j and j → i both exist, previously defined in equation ( 2).Returning to our example, the calculated Score[B, C] for the unobserved link B → C does not consider that the link C → B exists; this ignore the fact that agent B may wish to reciprocate the link with C, basing the strength of the "relation" purely upon the size of the neighbourhood of A.

Katz
Developed by Katz (Katz, 1953) as a method to identify individuals of status within a group "free from the deficiencies of popularity contest procedures", the method examines not only the number of "popularity votes" an agent receives, but also the popularity of the voting individuals.As such, Katz argues that a more accurate perception of high status individuals in a group may be garnered.With respect to link prediction, the popularity votes referred to by Katz may be considered as connections in a network.
To perform the Katz method, the sociomatrix, X, of a network is required.It is wellknown that the paths between individuals in a social network may be found by exploiting the powers of the relevant adjacency matrices (Festinger, 1949).For matrices with binary entries (such as X), non-zero elements x 2 ij of the matrix X 2 indicate the number of paths of length two being present between agents i and j; similarly, a non-zero element x 3 ij of the matrix X 3 , indicates the number of paths of length three between agents i and j -higher powers having corresponding interpretations.In terms of link prediction, a score for an unrealised link between i → j is calculated as: i,j | represents the number of paths of length l between i and j, and φ is the selected dampening factor.The selection of φ must satisfy the condition φ < 1, with 1 φ being the smallest integer value greater than the largest eigenvalue of matrix X.
The Katz method, much like the AA method, assumes undirected network connections, with the underlying concept assuming that popular individuals are more likely to connect with one another -shortening the overall average shortest path length of the network.To illustrate the calculation of the Katz method, an example using the social network of Figure 3 is as follows: • For the calculation of the Katz method, the 4×4 sociomatrix X of Figure 3 is required: x 2,4 , x 3,4 and x 4,1 are zero, indicating the potentially unobserved links.
• As the number of agents n = 4, the maximum path length for an indirect connection between agents is 3. Therefore the power of matrices to n − 1 are calculated: • The value φ is selected by finding the maximal eigenvalue (λ) of X.As λ = 1.950 (3 d.p.), the value of 1 φ is taken to be 2, allowing φ = 0.5; this satisfies the requirements of φ < 1 and 1 φ being the smallest integer value greater than the characteristic root of X.
• Taking once again the unobserved link of B → C, the Score[B,C] is calculated as: • The remaining unobserved link scores are calculated in the same manner and ranked accordingly.The links with the highest scores, are those which are most likely to occur at a subsequent timestep.

Stochastic Actor Based
The Stochastic Actor Based (SAB) modelling approach is not a static method such as those of AA and Katz.Rather, Snijders (Snijders, 1996) defines the SAB approach to be a class of models for longitudinal network data -'actors' within the network utilising heuristics to optimise their individual goals, subject to a selection of constraints.Discrete observations of a network are explored, with the evolution of social ties from G t to G t+1 a result of many small changes occurring between the specified time periods (Carrington, 2005) -the observed networks assumed to be the result of a Markov process in continuous time.
Consider T observations of a social network, represented as the adjacency matricies X t for t = 1, ..., T , each observation containing the same set of n actors.Evolution of the network is solely modelled from the point of inception X 1 , with the evolution to X 1 not being considered .The actions of actors within the network at t are simulated, changes in friendship ties based upon actor specific personal objective functions; the process attempting to model the micro-changes necessary to arrive at the network of t + 1.The complete SAB algorithm (Snijders, 1996) is rather detailed and complex in its implementation, so in the interests of space is not repeated here.

PageRank
The PageRank (PR) algorithm was developed by (Brin and Page, 1998), the founders of Googe.PR analyses the link structure of a network, taking into consideration not only the number of links to a node, but also the importance of the node sending the outward link.The PR (w i ) for each node i, is such that w i ≥ 0 and w j > w k indicates j is a more important node than k.If Hi denotes the set of nodes that link to i, and H i the set of nodes linked outwardly from i, then the PR w i is calculated as: The calculation of w i is recursive and can be initiated with any selected initial importance scores, iterating until convergence.The calculation of the PR may be interpreted as a random walk on a graph; in the context of the internet, a "random surfer" clicks on webpage links at random -the resultant probability of arriving at a page defined as its PR.
The "random surfer" calculation of PR is useful when importance scores are necessary for large graphs (such as the internet), whereby the adjacency matrix of connections X is unobtainable.However, if X is known, an adjusted matrix (M ) may be calculated with m ij = 1 |H j | if the link j → i exists and m ij = 0 otherwise.The PR calculation may then be expressed as a system of linear equations M w = w, with the problem reduced to finding the principal eigenvector of the matrix M .Due to the properties of M , it is possible to find an eigenvalue λ = 1 which generates a unique positive eigenvector; this eigenvector being the vector of PageRanks (Page and Brin, 1999).
The matrix M is defined as column stochastic if each element m ij ≥ 0 and the sum of each column is 1, this ensures the existence of λ = 1.However, this does not guarantee the existence of a unique λ necessary for ranking, therefore other requirements of M need to be satisfied.From Perron-Frobenious theorem (Meyer, 2000), a column stochastic matrix M that is irreducible with m ij ≥ 0, generates: • an eigenvalue λ > 0 with corresponding eigenvector v > 0.
• the existence of a dominant eigenvalue λ 1 , such that • all eigenvectors ≥ 0 are a multiple of w.
Therefore, M also needs to satisfy the condition of irreducibility, whereby M cannot be placed into block-upper triangular form through a series of permutations.M may become reducible if disconnected clusters of nodes exist in the network.Furthermore, nodes with an inward link but no outward links, termed as "dangling nodes", also affect the necessary requirements for a unique vector of PageRanks (Ipsen and Selee, 2008).
To ensure the successful calculation of the PR vector, M is required to represent a strongly connected graph; a graph being strongly connected if a path from any given node i to j exists.Performing the PR calculation upon a strongly connected graph is not always possible, as is the case for both web pages and social networks.As such, calculation of a new matrix M is required: where Q is the matrix of elements 1 n and d is the 'dampening factor', ensuring that mij ≥ (1 − d)Q which satisfies the required conditions; d is generally selected to be 0.85 (Bryan, 2006).The principal eigenvector of M is calculated, returning the required PR.
To illustrate PR, the following example is conducted upon the network of Figure 3: with the number of outward links for each agent: • The matrix M is calculated where m ij = 1 |H j | if the link j → i exists and m ij = 0 otherwise, giving: • Taking d = 0.85 with n = 4, the M matrix is calulated as: • The matrix M is in the form that allows for the calculation of the PR vector.The eigenvector of M corresponding to the dominant eigenvalue is found to be: • Hence, the PageRank of each node is found.As node A has the highest PageRank, it is therefore the most "important" node in the network.

Agent Based Simulation
This section describes the developed Agent Based Simulation (ABS), reported here using the revised Overview, Design concepts, Details (ODD) protocol for Agent Based Models (Grimm et al., 2010).

Purpose
The aim of the proposed simulation is to take the ASSIST data and simulate the evolution of the adolescent social networks over time, with an attempt to understand the process by which connections are modified.

Entities, State Variables and Scales
The simulation is Java based, which is an object-orientated programming (OOP) language.As such, the simulation is structured to have a 'Main' class, where the methods necessary for running the simulation are executed, and an 'Agent' class, whereby each instance of Agent represents an individual from the ASSIST data.
Each Agent object has a variable (an array list) relating to the individual's connections, with access to a global array (sociomatrix) containing the adjacency matrix of all links for the school being simulated.When an update occurs, the changing 'Agent' object (searching agent) updates its own link information variable and the global adjacency matrix.
Agents have other attributes, such as: • Age -calculated from 'Date of Birth', and increases as time progresses; • Sex -drawn directly from data; • Smoking Level -the self reported smoking level classified on a scale from one to six, where a smoking value of one indicates 'never smoked' and six representing ' more that 6 cigarettes a week'.Again, drawn directly from data; While these attributes are recorded in the simulation, the do not impact the progression of the model and have been included for ease in future work.Time is measured on a scale of weeks and there is no spacial representation at this stage, as only the change in connections is of importance.

Process, Overview and Scheduling
The following step-by-step guide describes how a link prediction is made in the ABS: • On initialisation, a sociomatrix (X) and number of link changes (ǫ) are read from the database, giving a network rate of change ρ = 1 ǫ ; • At time t an event occurs, with the time between events being negatively exponentially distributed with parameter ρ; • The event signifies that an agent must make a change to their outgoing links, the agent making the change being selected uniformly at random (termed as the 'searching agent'); • The randomly selected agent i (searching agent) receives a "message" telling them they must make a change, the change made being based upon the maximisation of i's personal objective function f i ; • Agent i iterates through the link changes offered by the selected link prediction method (the 'testing agents') from amongst the 4 methods described in Section 4, finding their maximum f i .
• Agent i makes one change to their outgoing links, updating X accordingly; • The process repeats until stopping conditions are satisfied, subsequent agents making use of the updated links from previous agents to make their decisions.
The advancement of the simulation may therefore be interpreted as having a Discrete Event Simulation (DES) structure, as the system decides when events will occur and the selection of the agent who must make a change.An important deviation from the DES structure is that the changes made to the system are agent based decisions, the agents selecting the friendship option that most suits them (through their personal objective function).As a result, agent j must consider the changes made previously by agent i; this means agent j's decisions may be affected by those of i, potentially changing j's overall decision.Modelling friendship changes in the specified manner, means that individual decisions affect the system as a whole; individual connection decisions affecting future connections the network.The simulation may therefore be thought of as an ABS, with discrete event based timing ; a diagram of the logic is visible in Figure 4.

No
The link alteration associated with the greatest stored value is selected The change is made The environment is updated Sociomatrix updated

Design Concepts
The simulation makes use of six different approaches to link prediction: • Random -the selecting agent randomly selects any other agent in the simulation for connection.If a connection already exists, the connection is broken.This method is included for the purposes of providing a baseline to the other link prediction methods evaluated; • Adamic/Adar, Katz, SAB and PageRank -These methods have been described in Section 4, the associated algorithms implemented in the simulation; • PageRank-Max -A novel approach to link prediction, based on the PageRank approach.
A detailed discussion of this new algorithm may be found in Section 6.

Initialisation
On initialisation, the simulation is required to create multiple instances of the Agent class.The user must decide the school and timestep for prediction, the simulation accessing the ASSIST database and querying the relevant tables through the use of SQL.The sociomatrix X and number of changes ǫ are saved as global variables, with the number of Agent objects created based on the information within X.A separate data table is accessed, containing the properties of the individuals to be simulated (such as unique id); the information is then applied, giving each agent an identity.
Each agent then accesses the row in X that represents their connections, storing the agents to whom they send an outlink within a local variable.The network is then drawn for visualisation purposes, the graphics being able to update each time an agent makes a change.With the initialisation process complete, a representation of the school network (at the designated time) is present, the simulation being able to commence.

Input Data
The ASSIST data provides multiple observations of a school social network, therefore, the predictions made may be assessed against real data at later time periods -gaining an insight into the accuracy of the predictions.Three waves of data are available (T 1 , T 2 and T 3 ), as such, two predictions can be made -that of T 1 to T 2 and T 2 to T 3 .
An Access database has been created for use with the simulation, holding information regarding friendship ties and basic student information.The database contains a separate table relating to the adjacency matrix of social ties, for each school at each time step; this allows individual schools to be modelled separately with ease.

Software
The software used for the ABS is Anylogic 6 (AnyLogic, 2002) given the ease in which it can connect to databases; this being a requirement when inputting the ASSIST data into the model to create the social network structures.Furthermore, AnyLogic offers the user the ability to expand its basic functionality with Java, which streamlines the coding of link prediction methods into the simulation.The source code is available at (Fetta et al., 2017).

PageRank-Max
Given the potential importance of centrality in message diffusion within a social network, it stands to reason that centrality may also be of importance to the individuals comprising the social network.We propose a new link prediction algorithm, the PageRank-Max (PR-Max) method, which provides an individual perspective of centrality, a searching agent altering its connections based upon the personal optimisation of its own eigen-centrality.
The PR-Max method seeks to find the connection that may improve an agents own PR.On receipt of a message from the environment, the changing agent (i) begins iterating through all agents in the network as follows: • Agent j is selected for testing; • The connection from i to j is altered, either by forming a link or breaking an existing link; • Agent i's PR is calculated and stored as f i,j .
• The connection change is reversed; • The process repeats.Once all possible changes to i's connections are assessed, the greatest value of f i,j is selected -the associated connection change being made.The PR-Max method works much like the SAB method, testing the result of an actual change to the network; however, it does not require the creation of a model prior to use, as a transformation of the sociomatix is its sole requirement.The simplicity of the PR calculation means that PR-Max method also does not require two waves of network data, being able to predict changes in the network without prior knowledge of its evolution; a diagram of the PR-Max logic is present in Figure 5.
The PR of a webpage decides the ordering in which it is displayed on Google (following a search query).Users are said to be able to manipulate their webpage's PR by making educated link choices (Malaga, 2008), with the PR-Max method aiming to demonstrate this in the context of social relations.Researchers have attempted link prediction through the use of a 'Personalised PageRank' (Chen, 2012;Yung, 2012), which orders pages differently depending on what a specific user may find more relevant.In terms of link prediction, this means that the PR is calculated differently depending upon the specific searching agent The selected school data is read from the database.
The sociomatrix and network rate of change stored An agent is selected at random to make a change A message is sent to the selected agent Stop?

Initialisation Environment
Simulation Complete

Agent
Agents are created Select an agent for link testing The objective function is calculated according to the selected LP method The value is stored

No Message
All link options tested?

No
The link alteration associated with the greatest stored value is selected The change is made The environment is updated Sociomatrix updated seeking to make a new connection; this calculation process does not consider optimising an agent's own PR, which we consider in PR-Max.

Update
While the careful selection of outward links is said to be important, removal of specific links has also been shown to have an effect on PR (de Kerchove et al., 2008); this gives the PR-Max method a sensitivity to link disconnection.The AA, Katz and basic PR implementations do not demonstrate such explicit consideration of link disconnection, their focus being predominantly upon the prediction of new connections.Although the SAB method does account for disconnection, this is subject to the model generated prior to simulation.Therefore, the PR-Max method may be able to capture elements of network evolution more naturally.The performance of PR-Max is compared to the other link prediction methods in Section 8.

Validation
To gain confidence in the output of the simulation, validation and verification procedures have been conducted.As the simulation is attempting to validate social theories around how adolescents connect, the output of the simulation is in itself an evaluation of its validity.This is made evident by attempting to evaluate the accuracy of the link prediction approaches, against the empirical social network data -discussed further in Section 8.The following sections additonal elements of the validation process, prior to assessing the accuracy of the results: verification (Section 7.1), distributions and random sampling (Section 7.2), warm-up period (Section 7.3), number of runs (Section 7.4) and experimentation specification (Section 7.5).

Verification
Verification is described as a micro-check of the model, where a test of each individual element is performed.During the creation process, regular checks of the code were carried out -attempting to ensure the proper implementation of the designated logic.For each of the LP method implementations, the associated calculation of the objective function was performed to ensure calculations matched.Network visualisation was also used to verify consistency with the predictions made.

Distributions and Random Sampling
Statistical distributions and random sampling are used throughout the simulation, the values being derived from AnyLogic's own built in engine.Sampling of random numbers uses AnyLogic's default random number generator, which is an instance of the 'Random' Java class; this being a Linear Congruential Generator (AnyLogic, 2002).During the verification process, a number of runs were performed to assess the average number of changes in a selected school network; the confidence interval was calculated, and as the actual number of changes from the data fell within the bounds of the confidence interval, the distribution was said to be acting appropriately.During all testing and result generations common random numbers are implemented between scenarios.

Warm-Up Period
The starting conditions of the simulation (for a selected school at a given time point) are provided by the initial sociomatrix, which is read during the initialisation procedure.As such, a warm-up period is not required, as the agents begin with the required set up of connections.

Replications
As the simulation has various elements which include variability, multiple simulation runs are required.This work makes use of the confidence interval approach (Robinson, 2004) to select the number of replications, based on outcome-based precision criteria.Using the CI method, the required number of runs (η) is calculated as: where x and S are the sample mean and standard deviation (respectively), d the desired percentage deviation of confidence about the mean, and t n−1,α/2 from the standard t-distribution with n − 1 degrees of freedom and significance level α (Robinson, 2004).
A selection if network measures were used to assess the variability in the outcome based metrics.With a significance level α = 0.05, the greatest number of runs required was 9.49.Additionally, as a 'rule of thumb', (Law and Kelton, 1999) suggest a minimum of around 3-5 replications are required; should too many replications be selected, this wastes valuable running time and computing resources.Given that the identified maximum, 10 replications have been selected.This is greater than the rule of thumb, but does not appear excessive.

Experimentation Specification
The investigation was conducted across eight intel i3 2120 dual core machines, with 8GB RAM.Each set up of the simulation was conducted on an individual machine, parallelised to make use of the dual cores.An example set up on a machine would be: School 12, PageRank-Max, T 1 -T 2 .The approximate runtime for each link prediction method is shown in Table 1.

Results
The previous sections have described the creation of an ABS to predict social network evolution implementing five separate link prediction methods: Adamic/Adar (AA), Katz, Stochastic Actor Based (SAB) Models, PageRank (PR) and PageRank-Max (PR-Max).This section discusses the results produced from evaluating each of these methods, across the breadth of the ASSIST network school data, for four different key network statistics: Transivity, Average Degree, Reciprocity and Average Path Length (APL).
For each of the control schools, a prediction is made from T 1 to T 2 and T 2 to T 3 .The predicted networks at T 2 and T 3 shall be compared with the real data to evaluate their accuracy.The presentation of results is structured as follows: the precision of each algorithm in predicting the correct links is discussed in section 8.1 and the individual network structures are presented in Section 8.2.

Precision analysis
The first method to evaluate the T 2 and T 3 predictions is that of precision.The precision metric was first proposed by (Cleverdon, 1972) and has been used in the context link prediction methods (Lü and Zhou, 2011).Precision evaluates the number of correct predictions, y c , relative to the number of predictions made, y p , such that the precision is yc yp .To benchmark performance, we also generate a network based upon link predictions made at random (the random method).The precision is expressed as a percentage improvement over predictions made at random; positive values indicate an improvement in correct predictions, while negative values indicate a reduction.Ten runs of the random method for each school network are performed to generate the random predictions.Also of interest is the number of missed predictions, which examines the number of friendship changes not made in the predicted networks of T 2 and T 3 , when a friendship change has actually occurred in the real data.The missed predictions are also expressed in terms of an increase compared to the random method, negative values indicating fewer predictions missed.Therefore, two metrics are calculated for each predicted network: the percentage increase of correct and missed link predictions over the random method.
Table 2 displays the average precision classified by method at each timestep.Each method is then ranked in terms of their precision performance; ranks are displayed in Table 3.Values that are significantly different from random at the 95% level, following an independent samples t-test for parametric data or a Mann-Whitney test for non-parametric data, are highlighted and starred.
The boxplots shown in Figure 6 and Figures 7 display the correct prediction scores at T 2 and T 3 respectively.They demonstrate the higher proportion of correct predictions for the highlighted and starred.Each method is then ranked by structural measure, values with the lowest AES achieving the highest ranks -Table 5.
The differences in AES between time steps is apparent from Table 4.The APL is predicted significantly differently across all methods, with predictions being worse for T 3 in AA (139.01),Katz (99.05),SAB (26.30) and PR (24.74) methods than T 2 ; however, AES is reduced for PR-Max at T 3 (12.78),this indicating a significant improvement in predictions.AES for average degree is also significantly different between T 2 and T 3 , with AA (29.25) and Katz (20.01) increasing; once again, PR-Max values improve at T 3 , with the AES value decreasing significantly.
The AES values indicate an improvement in the PR-Max structural accuracy at T 3 .This is further reinforced by the ranks of Table 5, which demonstrate a movement of out-degree and APL predictions from last place (5) at T 2 , to first place at T 3 (1).When the harmonic mean of the individual rankings is taken for each method, PR-Max is placed first across both time steps (T 2 : 1.7, T 3 : 1.0), however, at T 2 this is very closely followed by the Katz method (1.8).
The precision analysis of Section 8.1, placed the Katz method as fourth overall at both T 2 and T 3 .However, it would appear that the method performs well in terms of structure at T 2 , ranking first in APL AES and second for transitivity and average out-degree.This suggests that, while the specific links in the predicted networks may not be accurate, the overall network structure generated is more representative than other link prediction methods -only being outperformed by PR-Max in terms of transitivity and reciprocity.The findings demonstrate the importance of considering the predicted network structure when discussing link prediction methods, potentially providing further insight than simply considering precision.
Overall, the method structural performance analysis has reinforced many of the conclusions from Section 8.1.There would appear to be differences in the performance of methods at T 2 and T 3 , suggesting an underlying change in the friendship mechanisms of adolescents within the ASSIST data.Further evidence of the strength of the PR-Max method (in predicting network evolution) is also provided, the method performing particularly well at T 3 .

Conclusions
Using simulation as a tool to explore theories around behaviours This paper has outlined the development of a simulation based framework, incorporating link prediction algorithms, for application upon adolescent social network data.The simulation employed four existing link prediction methods: Adamic/Adar, Katz, SAB models and PageRank, and developed a new method PR-Max based upon the optimisation of an agent's eigen-centrality.
The existing methods selected were chosen due to their success in a wealth of prior applications, with the PR-Max method being developed to provide an alternative perspective of status.A limitation of the study may be the selection of only five methods to explore in depth.However, given the rigorous selection process of the chosen it was felt that an appropriate representation of the most widely used methods was presented.
The social network analysis offers novel contributions to both link prediction and simulation literature.Although the SAB method uses simulation as an underlying tool for the generation of statistical models, this work is seemingly the first study to structure the link prediction problem within an ABS framework.The development of the PR-Max method also provides a new approach to link prediction, whereby agents use eigen-centrality to actively improve their current social situation.Furthermore, this investigation expands the current literature relating to social applications of simulation, signalling a potential future direction for ABS research.
Pagerank Max is an effective predictor of future social structure, which suggests status is important in friendship selection This analysis has concluded that the proposed PR-Max method was the most successful (of those tested) in predicting the evolution of adolescent friendships, in terms of both precision and network structure.
The PR-Max method highlighted that status may be a key factor in the evolution of adolescent social networks, especially as the individuals mature.This identifies status (an interpretation of eigen-centrality) as a key focus for future investigations of adolescent social networks.This suggests that a salient part of the adolescent friendship making process is the befriending of an individual who will likely increase ones own status.This contributes to literature describing adolescent social connection and may impact future adolescent peer diffusion studies.
A further relevant feature of the PR-Max is the process by which links were broken, with agent's removing connections that negatively impacted upon their eigen centrality.This suggests friendship degradation is an important factor in social connections, with adolescent social networks continually evolving over time.The results demonstrated the abilities of a simulation based link prediction structure in gaining insights unobtainable by conventional social network analysis.

Implications for policy makers and public health managers
Smoking is a major global health challenge, with 6 million deaths from tobacco use worldwide per year.Secondary schools are the common point at which people start smoking, so it is vital to intervene at this age given the addictive nature of tobacco and the longer-term health effects.Our conceptual approach and contribution to the problem is in providing a proof-of-concept for targeted interventions driven by social network analysis.We demonstrate the utility of using emerging sources of social network data for public heath interventions.
The ASSIST programme was shown to provide a cost-effective method for reducing adolescent smoking rates.The ASSIST programme resulted in a 2.1% reduction in smoking prevalence at 2 years, and the incremental cost per student not smoking was 1,500.The intervention also affected students beliefs about longer term smoking behaviour, with a lower proportion of students in the intervention schools believing that they would be a smoker at age 16 years.The ASSIST findings, if extrapolated to all 12-year-old students in the UK, would cost £38m but would result in 20,400 fewer adolescent smokers at age 14 years.Placing these results in a broader context, NHS expenditure on treating lung cancer in 2010 was £261 million in England alone.(Hollingworth et al., 2012) The research has demonstrated, through the use of ABS and link prediction methods, the potential importance of eigen-centrality in adolescent friendships selection.As such, these learnings may be fed back into future peer-led interventions to aid in the selection of appropriate peer supporters.Furthermore, it provides encouraging results in demonstrating the ability to predict forward and identify highly connected individuals in a social network, informing policymakers who to target to diffuse positive public health messages.

Figure 1 :
Figure 1: Social network at T 1 ; dark nodes indicate smokers.

Figure 2 :
Figure 2: Social network at T 2 ; dark nodes indicate smokers.

Figure 3 :
Figure 3: Example network for illustration of link prediction algorithms.
, the unobserved links are identified as: B → C, B → D, C → D and D → A. • Taking the unobserved link B → C, examining the friendships of B and C gives the neighbourhoods Γ(B) = {A} and Γ(C) = {A, B}, respectively.• As both Γ(B) and Γ(C) contain agent A, A is identifies as the only common neighbour of agents B and C. • Agent A has three outward links, as such |Γ(A)| = 3 and therefore the Score[B,C] = 0.910 (3 d.p.).

Figure 4 :
Figure 4: Simulation logic describing the timing and agent-based decisions.

Figure 5 :
Figure 5: Updated simulation logic describing the process of the PR-Max method.

Table 1 :
Model runtimes (minutes) by link prediction method

Table 2 :
Time Measure Adamic/Adar Katz SAB Model PageRank PageRank-Max Average of all school networks at T 2 and T 3 , displaying the percentage increase over random predictions.Highlighted and starred values are significantly different at the 95% level.

Table 3 :
Ranked average precision values.

Table 4 :
Box plot of correct prediction proportions for each method at T 2 .Whiskers extend 1.5 times the height of the box, with circular points indicating outliers.Starred points indicate extreme outliers.Box plot of correct prediction proportions for each method at T 3 .Whiskers extend 1.5 times the height of the box, with circular points indicating outliers.Starred points indicate extreme outliers.AES for each link prediction method; highlighted values are significantly different at the 95% level.

Table 5 :
AES ranks for each link predicition method.