Evolution Model of Spatial Interaction Network in Online Social Networking Services

The development of online social networking services provides a rich source of data of social networks including geospatial information. More and more research has shown that geographical space is an important factor in the interactions of users in social networks. In this paper, we construct the spatial interaction network from the city level, which is called the city interaction network, and study the evolution mechanism of the city interaction network formed in the process of information dissemination in social networks. A network evolution model for interactions among cities is established. The evolution model consists of two core processes: the edge arrival and the preferential attachment of the edge. The edge arrival model arranges the arrival time of each edge; the model of preferential attachment of the edge determines the source node and the target node of each arriving edge. Six preferential attachment models (Random-Random, Random-Degree, Degree-Random, Geographical distance, Degree-Degree, Degree-Degree-Geographical distance) are built, and the maximum likelihood approach is used to do the comparison. We find that the degree of the node and the geographic distance of the edge are the key factors affecting the evolution of the city interaction network. Finally, the evolution experiments using the optimal model DDG are conducted, and the experiment results are compared with the real city interaction network extracted from the information dissemination data of the WeChat web page. The results indicate that the model can not only capture the attributes of the real city interaction network, but also reflect the actual characteristics of the interactions among cities.


Introduction
With the rapid development of the Internet, smart phones, and information technology, online social networking services such as Facebook, Twitter, Sina Weibo, and WeChat have developed rapidly. These platforms facilitate the interactions among users and accelerate the dissemination of emotions and opinions contained in the information. Meanwhile, these platforms provide a rich source of social media including geospatial information for the research of social networks [1][2][3][4]. The interactions of users in social networks usually manifest as the viewing and forwarding of information. More and more research shows that geographical space, which seems to be a bridge between online and offline, affects the interactions of users in social networks [5][6][7].
in the process of information dissemination in social networks, where nodes represent cities and edges represent interactions among cities. We consider the evolution model of the city interaction network from the perspective of the edge, that is how each edge is added to the city interaction network. A evolution model for describing the interactions among cities is established. The evolution model consists of two core processes: the edge arrival and the preferential attachment of the edge. The edge arrival model arranges the arrival time of each edge; the model of preferential attachment of the edge determines the source node and the target node of each arriving edge. Six preferential attachment models (Random-Random, Random-Degree, Degree-Random, Geographical distance, Degree-Degree, Degree-Degree-Geographical distance) are built, and the maximum likelihood approach is used to do the comparison. Finally, the evolution experiments using the optimal model (Degree-Degree-Geographical distance) are conducted, and the experiment results are compared with the real city interaction network extracted from the information dissemination data of the WeChat web page.
Preferential attachment of edges: The preferential attachment model assumes that when a new node joins the network, it creates a constant number of edges, where the selection of the target node for each edge is proportional to the degree of the node [33]. In addition to degree, the node age and geographic distance of the edge can be applied to the preferential attachment model [34]. This paper considers the evolution of the network from the perspective of the edge. Therefore, when an edge is added to the network, the source node and the target node are selected according to preferential attachment of edges.
Evaluation by the maximum likelihood: The maximum likelihood approach is usually used to compare a series of models numerically and select the best model (and parameters) to interpret the data [35]. As our understanding of real-world networks improves, likelihood remains unchanged, while the generative models improve to incorporate the new understanding. Success in modeling can therefore be effectively tracked [34]. The maximum likelihood approach is widely used to estimate network model parameters [35][36][37] and select the optional model [34,38]. Therefore, this paper uses the maximum likelihood approach to evaluate and compare different network evolution models based on empirical data.
WeChat: WeChat is one of the most popular social networking platforms in China. As of the second quarter of 2016, WeChat has covered more than 94% smart phones in China, with 0.8 billion monthly active users. WeChat has powerful social functions and a large number of users, and WeChat has integrated almost all aspects of people's lives, including payment, location-based services, shopping, games, and entertainment. Therefore, WeChat is an appropriate system to study the characteristics and evolution mechanism of the spatial interaction network in social networks.
The rest of this paper is organized as follows: the second section introduces the dissemination data of the WeChat web page and constructs the city interaction network. The third section introduces the evolution model of the city interaction network. In the fourth section, the maximum likelihood method is used to evaluate the six preferential attachment models and to select the optimal model and parameters. In the fifth section, the optimal model is used for network evolution, and the obtained evolutionary network is compared with the real city interaction network. The potential biases and model extension are discussed in the sixth section, and the seventh section is the conclusion.

Dataset
WeChat provides three basic functions: instant messaging (including single and group chat), moments (where users publish, comment, and forward information), and official accounts (including subscription accounts and service accounts). Users can interact with their friends by posting text, voice, pictures, emoticons, location, video, web links, and other information. This paper studies the dissemination data of the WeChat web page (HTML5) collected by third-party service companies.
The recording process of the WeChat web page data can be described as: when a web page with a certain theme is created and published by the creator through the official accounts, the content of this web page can be viewed by other users. Users who view the web page can send it to their moments or WeChat friends, or not forward it. Thus, the users who view (or forward) and the users who are viewed (or forwarded) are recorded.
The dissemination data of WeChat web page were obtained, and the time span of the data was from 2-8 July 2016. There were 622,637 records in total, and each record can be represented by a six-tuple <pageID, sourceID, targetID, type, time, ip>, where pageID represents the unique identity of the web page, sourceID and targetID represent the unique identity of the user, type represents the behavior type of target, including viewing and forwarding, time represents the time when the behavior of targetID occurs, and ip represents the IP address of targetID. In order to protect the privacy of users, web page identity and user identity were anonymized.

City Interaction Network
Most of the researches related to geography use self-reported data to identify the location of users, which is often inaccurate. By locating users with IP addresses, the errors of self-reported data can be avoided. Song et al. analyzed several large IP address databases, including the Chunzhen IP address database, the Taobao IP address database, the Sina IP address database, and the Baidu IP address database [39]. They found that the four IP address databases were quite different, and when the administrative division level was lower, the coverage rate and coincidence rate of IP address databases would decrease, while the data availability would also decrease. However, considering the coverage rate and coincidence rate of the four IP address databases, they believed that the credibility of the Taobao IP database was the highest. Therefore, the Taobao IP address database was used in our work to locate the IP address in the data to the corresponding cities in China. Finally, the IP address in the data was located in 34 provincial divisions of China (including 23 provinces, 4 municipalities, 5 autonomous regions, and 2 special administrative regions), a total of 372 cities. The number of cities corresponding to each provincial division is shown in Table 1.  Figure 1 shows the active frequency of users in each provincial division. The active frequency of a province is the number of users located in that province. The active frequency was more in the east and less in the west. The top three provincial divisions with the highest frequency were Shandong, Henan, and Guangdong, and the active frequency of Xizang, Xinjiang, and Taiwan was low. This fully reflects that information interaction is affected by political, economic, cultural, geographical, and demographic factors. Based on the data of the web page dissemination in WeChat, the city interaction network G t = (V, E t , W t ) can be constructed. G t is a dynamic directed network, V = {v 1 , v 2 , v 3 , · · · , v N } is the set of nodes in the network, representing cities of China, and the number of nodes is N; E t = {e 1 , e 2 , e 3 , · · · , e M t } is the set of edges of the network from Time 0-t, representing the interactions among cities, and the number of edges is M t ; W t = {w 1 , w 2 , w 3 , · · · , w M t } is the weight set of edges in the network from Time 0-t, representing the number of interactions among cities. The dynamics of the city interaction network G t is reflected in the changes of the edge and weight. We took the cities in Shandong province as an example to elaborate the construction process of the city interaction network. At t = 0, G t is a network containing only 17 isolated nodes (the number of cities in Shandong province). When a WeChat web page is published by a user in Jinan and users in Dezhou view or forward this web page, then a directed edge from Jinan to Dezhou is established. The weight of the directed edge is the number of Dezhou users viewing the web page. With the dissemination of the web page, it was assumed that the interaction network one day later is as shown in Figure 2. At this time, the number of nodes in the interaction network was N = 17, and the number of edges was M t = 22 (bidirectional edges are denoted as two edges), where t = 1 (day). The city interaction network in this paper allows self-connected edges, which represents the interactions in the same city. Take the starting time of data (2 July 2016 00:00) as the time t = 0, and construct the city interaction network. The time span of the network is T. Table 2 lists the basic properties of the network G T , including the number of nodes, number of edges, number of self-connected edges, average degree of nodes, density, average clustering coefficient, and average shortest path length. According to the basic properties of the network G T listed in Table 2, an overall understanding of the interaction among cities was obtained through the dissemination of WeChat web page. The network involved 372 nodes and 30,438 edges, which indicates that not every two nodes had connected edges. On average, each node only had connections with 163.65 nodes, and the density of the network was only 0.22. It can be seen that although WeChat has a large number of users in China and covers all cities, each city will not interact with all other cities in the short term. The average shortest path length of the network was 1.73, which means that the average hop from one node to another node was 1.73. There were 353 self-connected edges in the network, and only 19 nodes had no self-connected edges. A total of 622,637 interaction records were recorded, among which, 350,578 records were the interactions in the same city, accounting for 56%. It can be seen that users were more inclined to interact with users in the same city. Figure 3 shows the number of non-isolated nodes and the number of edges in the city interaction network as a function of time. Figure 3a shows the number of non-isolated nodes in the city interaction network as a function of time. Non-isolated nodes represent the nodes that have interacted with other nodes. In the initial stage, the number of non-isolated nodes grew rapidly, and the growth became slow until the number of nodes was close to N. Figure 3b shows the number of edges in the city interaction network as a function of time. The number of edges in the network kept increasing, but due to the limitation of the number of nodes, the growth of the number of edges gradually slowed down. In the case where the number of non-isolated nodes in the network was almost constant, the number of edges still kept growing. This also reflects the limitations of the evolution of the city interaction network from the perspective of nodes.

Notation
Let Z denote the set of edges to be added to the network, t(z), z ∈ Z the time when an edge z is added to the network, and z t u,v an edge z added to the network at time t, and its source node and target node are connected to node u and node v respectively. Let k t (v) denote the degree of node v at time t and d(u, v) denote the geography distance between node u and node v.

Evolution Model
We consider the evolution model of the city interaction network from the perspective of the edge. The model consists of two core processes: the edge arrival and the preferential attachment of the edge. The edge arrival determines the arrival time of each edge; the preferential attachment of the edge determines the source node and the target node of each arriving edge.
For an edge z, it is composed of a node pair: where V represents the node set and does not change with the network evolution. Assuming that the arrival time of the edges is a function of time in ∆t, then the arrival time of each edge in ∆t will be arranged, and all edges can be expressed in the time sequence according to the arrival time: where C is the length of the sequence Z, and Formula (3) guarantees the time-ordered arrival of the edges. Select the source node u and the target node v from node set V according to a certain preferential attachment for the edge arriving at time t: where X(Θ) represents a distribution function and Θ is the parameter of the distribution function. Finally, the network evolution is realized by updating the edge and weight. The edge arrival and preferential attachment of the edge are described in detail below. Figure 4 shows the interaction quantity among cities of the data (each record represents an interaction) as a function of time. In the figure, each data point represents the interaction quantity among cities from time t = 0 to the current time, and the red line is the fitting of the function. It can be seen from the figure that the interaction quantity was a linear function of time, which satisfies f (t) = 4025t − 4.51e4, and the time unit is hours. Since each edge represents the interaction among nodes, f (t) can be used to describe the number of arriving edges. Thus, the number of edges added to the network per unit time is a constant ε = 4025, and the time interval for each arriving edge is t i − t i−1 = 1/ε, i = 2, 3, · · · , C. Let the time of the first arrived edge be t 1 = 0, so that the time of each arriving edge is determined.

Preferential Attachment of the Edge
In this paper, the evolution of the city interaction network is considered from the perspective of the edge. Therefore, when an edge is added to the network, its source node and the target node will be selected according to a certain mechanism. This selection mechanism is called preferential attachment of the edge. Here, six different preferential attachment models are considered in this paper: Random-Random (RR): for the arrived edge at time t, two nodes are randomly selected from the node set V as its source node and the target node, respectively: Random-Degree (RD): for the arriving edge at time t, a node is randomly selected from the node set V as its source node, and the selection of its target node is proportional to the degree of nodes in the network: Degree-Random (DR): for the arrived edge at time t, a node is randomly selected from the node set V as its target node, and the selection of its source node is proportional to the degree of nodes in the network: Geographical distance (G): for the arrived edge at time t, the selection of its source node and target node is proportional to the geographical distance between the two nodes: Degree-Degree (DD): for the arrived edge at time t, the selection of its source node and target node is proportional to the degree of the nodes in the network. The degree index for the source node is α, and the degree index for the target node is β: Degree-Degree-Geographical distance (DDG): for the arrived edge at time t, the selection of its source node and target node is proportional to the degree of the nodes in the network and to the geographical distance between the source node and the target node. The degree index for the source node is α; the degree index for the target node is β; and the distance index is γ:

Evaluation
In this section, a quantitative approach is applied to compare the accuracies of different preferential attachment models. The network is often considered to be the result of an evolutionary random process that drives its growth, including new nodes and new edges [35]. Given real data about network evolution, the extent to which the assumptions of a model are supported by the data using the maximum likelihood approach can be tested. The maximum likelihood approach is usually used to compare a series of models numerically and to select the best model (and parameters) to interpret the data. Estimating the likelihood of a preferential attachment model M involves considering each arriving edge z t and computing the likelihood P M (z t u,v ) that the edge z t selects the actual source node u and the actual target node v according to the model M. Therefore, the likelihood of network G T generated by model M can be expressed as: To obtain better numerical accuracy, the log-likelihood is used in this paper: Since the city interaction network had self-connecting edges, which represents the interaction in the same city, we assumed that the distance of self-connecting edges was 20 kilometers (consider each city contour as a circle, and 20 kilometers is the approximate average of the radius of all cities). Figure 5 shows the relationship between the log-likelihood of models and different parameters. The RR model had no parameters, and its log-likelihood was a constant −3,185,899. In addition to the RR model, the log-likelihoods of the other five models were all convex functions of the model parameters, so the maximum likelihood of each model can be found to estimate the best parameters of the model. Table 3 lists the maximum log-likelihood of different preferential attachment models and the optimal parameters under the maximum log-likelihood. It can be seen from Figure 5 that, under the same parameter, the log-likelihood of the RD model and DR model was approximately equal. This reflects that the RD model and DR model had similar effects on the network evolution, and the selection of the source node and the target node was equal. Figure 5d also reflects this point. Figure 5c shows the relationship between the log-likelihood and parameter γ of G model, and its maximum log-likelihood was significantly higher than that of the RR model, RD model, DR model, and DD model, indicating that the distance played an important role in the evolution of the city interaction network. The DDG model considered both the node degree and the geography distance among nodes in the network evolution process. It can be seen that the maximum log-likelihood of DDG model was the highest, which was 22% higher than that of the DD model and 11% higher than that of the G model. In addition, in the DD model, when α = 1.0, β = 1.0, its log-likelihood was the maximum. In the G model, when γ = −1.6, its log-likelihood was the maximum. The DDG model, which considered the node degree and the geography distance, obtained the maximum likelihood when α = 0.6, β = 0.6, γ = −1.5. This indicates that the distance made the degree of the node less important. Then, we applied the DDG preferential attachment model with parameters α = 0.6, β = 0.6, γ = −1.5 to the evolution of the city interaction network.

Network Evolution
In order to verify the city interaction network model and the evolution process of the network, network evolution experiments were conducted. We considered the real network G 3T/4 from 2-4 July 2016 and evolved it from t = 3 4 T until t = T. Specifically, the edge arrival model was used to determine the edges arriving at time t ∈ [ 3 4 T, T]. For each arriving edge, the DDG preferential attachment model was used to select its source node and target node. Finally, the evolutionary network G T with the same time length as the real network G T was obtained. G T and G T were analyzed by the comparison of the statistical characteristics and community structure of the network. Figure 6 shows the statistical characteristics of real network G T and evolutionary network G T . Figure 6a,b are considered from the edge properties. Figure 6a shows the weight distribution of the edges. It can be seen that the weight distributions of the real network and the evolutionary network followed the power-law distribution. The weight distribution of real network G T was fitted as shown in the dotted black line. The power exponents of the weight distributions of real network and evolutionary network were 1.92 and 1.99, respectively (the weight distributions of the real network and evolutionary network approximately overlapped, so the fit line of the weight distribution of the evolutionary network is not drawn). The weight of the edge represents the interaction among cities, and the power-law distribution of the weight distribution reflects that only a few cities had frequent interactions, while the interactions among most cities was very small. Figure 6b shows the geographical distance distribution of edges. The geographical distance distribution of edges is a property that connects the network with geographical space. Most of the interactive distances among cities were about 100 km. As the distance continued to increase, the probability of interaction became smaller. In addition, 20 km was also the high-frequency distance of city interaction (the distance was denoted as 20 km if the interaction occurred in the same city), indicating that the interaction in the same city occupied a large proportion.    Table 3. The maximum log-likelihood of different preferential attachment models and the optimal parameters under the maximum log-likelihood.  Figure 6c shows the node weight distribution; the horizontal ordinate is the node number, and the numbering order is arranged in descending order of node weight. The node weight of a node is the sum of all the weights of edges connected with the node, which reflects the interactions between the node and its neighbor nodes. Figure 6d shows the betweenness centrality distribution of nodes; the horizontal ordinate is the node number, and the numbering order is arranged in descending order of the betweenness centrality of nodes. The betweenness centrality is to measure the importance of a node to connect with other nodes. By comparing the real network G T with the evolutionary network G T , it can be found that the node weight and betweenness centrality of some nodes in the evolutionary network were obviously higher or lower than the real network, but the overall trend was consistent with the real network. The provincial capital is the economic, political, and cultural center of a province, which is also reflected in the city interaction network. In the real network shown in Figure 6c,d, provincial capitals have relatively high node weight and betweenness centrality, such as Beijing, Shanghai, Guangzhou, Suzhou, Tianjin, and Hangzhou, which can also be reflected in the evolutionary network. Figure 6e shows the relationship between node degree and node weight. Figure 6f shows the relationship between node degree and node betweenness centrality. The greater the degree of nodes, the greater the node weight and the betweenness centrality. For the real network G T and evolutionary network G T , two community detection methods, Louvain [40] and Infomap [41], were used to extract the community structure of the network, and the Normalized Mutual Information (NMI) was used to evaluate the results of community detection. The evaluation results are shown in Table 4. G T − PAD represents the comparison between the community structure of real networks and the provincial administrative divisions in China; G T − G T represents the comparison between the community structure of the real network and that of the evolutionary network. It can be found that the community structure of the real network was consistent with the administrative division to a certain extent, and it also shows the influence of the distance factor on the interactions among cities. In addition, the community structure of evolutionary network and real network was also similar, which indicates that the preferential attachment model in this paper can describe the emergence of community to a certain extent. This is mainly because the distance factor was considered in the model, so that cities in the same province were easily connected and formed communities. In general, the evolutionary network can be well matched with the real network, which reflects that the model can not only capture the properties of the real city interaction network, but also reflect the geographical characteristics of the interactions among cities. Table 4. Evaluation results of community detection in undirected networks. G T represents the real network, PAD represents Provincial Administrative Divisions in China, and G T represents the evolutionary network.

Potential Biases
In this paper, the evolution of the city interaction network was modeled and analyzed by using the interactive data formed in the process of information dissemination. There is no doubt that the use of one dataset to explain the results is not complete enough. Since our model was data-driven, the edge arrival model and maximum likelihood method were data-dependent. For the edge arrival model, different spatial interactive data may have different situations. The selection of model parameters in this paper was based on the method of maximum likelihood. The optimal parameters of the model can be found using real data. Therefore, different datasets will lead to different optimal parameters of the model. The evolution model was evaluated by comparing the structure characteristics of the evolutionary network and the real network. From the results, the model can capture the properties of the real city interaction network, but this is only limited to the city interaction network formed in the process of information dissemination. In the process of information dissemination, the interaction of information enables people to express their emotions and opinions. It is helpful to understand people's emotional tendency by considering the semantic characteristics of interactive information in the spatial interaction network.
Moreover, compared with cities in other countries, Chinese cities have some specificities. (1) China is a vast country, and the distance between cities is relatively large, making distance factors play an important role in the interactions of cities. (2) The distribution of Chinese cities shows a convergent pattern, which is different from Western countries. As a result, China has many large cities with large populations, such as Beijing, Shanghai, and Guangzhou. (3) The provincial administrative divisions in China are established around large cities, and the cities within the province are more likely to interact. The higher the level of political and economic development of the city, the more obvious the interaction. (4) China has a large population and a high Internet penetration rate, which makes information spread rapidly and widely. The results of this paper were obtained in this context. However, if the background were changed to some countries with a relatively small scale and the development levels of cities within the country were similar to each other, the influence of the distance factor on the interactions among cities may not be well reflected. Therefore, different countries have influence on the settings of the model.

Model Extension
The preferential attachment model in this paper belongs to a link prediction model based on the similarity of the network structure. Essentially speaking, a model for link prediction makes a guess about the factors resulting in the existence of links, which is actually what an evolving model wants to show. Up to now, the studies of link prediction overwhelmingly emphasized undirected networks. However, the study of link prediction in directed networks is inadequate [42].
The current common method for extending the technology applied to undirected networks to directed networks is to divide the degrees into outdegree and indegree, such as community detection [43][44][45][46]. According to this ideas, our model can be extended to directed networks. Take the DDG model as an example: the model can be extended to a directed network: Directed-Degree-Degree-Geographical distance (DiDDG): for the arriving edge at time t, the selection of its source node is proportional to the out-degree of the nodes in the network; the selection of its target node is proportional to the in-degree of the nodes; meanwhile, the selection of its source node and target node is proportional to the geographical distance between the source node and the target node. The degree index for the source node is α; the degree index for the target node is β; and the distance index is γ: In the modified model, the degree is divided into the out-degree and in-degree for consideration, so that the probability of connecting an edge between node u and node v will vary depending on the direction of the edge.

Conclusions
This paper studied the evolution mechanism of the city interaction network formed in the process of information dissemination in social networks, where nodes represent cities and edges represent interactions among cities. We considered the evolution model of the city interaction network from the perspective of the edge. In the model, the nodes were fixed, and the evolution process of the edge consisted of two core processes: the edge arrival and the preferential attachment of the edge. The model of edge arrival determines the arrival time of each edge; the model of preferential attachment of the edge determines the source node and the target node of each arriving edge. Six preferential attachment models were considered, and the comparison was done by the maximum likelihood approach. We found that the degree of the node and the geographic distance of the edge were the key factors affecting the evolution of the city interaction network. The DDG preferential attachment model, which considered both the node degree and the geographical distance among nodes in the network evolution process, was the best of the six models. Finally, we conducted the evolution experiments using the most optimal model and compared it with the real city interaction network extracted from the information dissemination data of the WeChat web page. By comparing the weight, geographical distance, node weight, and betweenness centrality of the real network and the evolutionary network, it was found that the evolutionary network could be well matched to the real network, which reflects that the model can describe the actual characteristics of the interactions among cities. Our research is of great significance for providing location-based business services, planning and managing communication network facilities, and formulating regional economic development policies.
However, there are still some limitations in our work. On the one hand, the evolution process of the city interaction network is affected by a variety of factors, such as politics, economy, population, etc. A comprehensive comparative analysis of the effects of these factors plays a significant role in the evolution model. These factors should be considered in the evolution model in future work. On the other hand, our work was verified by the real dissemination data of the WeChat web page; whether the model is applicable to the evolution of other spatial interaction networks still needs to be further verified.