Effective attributed network embedding with information behavior extraction

Network embedding has shown its effectiveness in many tasks, such as link prediction, node classification, and community detection. Most attributed network embedding methods consider topological features and attribute features to obtain a node embedding but ignore its implicit information behavior features, including information inquiry, interaction, and sharing. These can potentially lead to ineffective performance for downstream applications. In this article, we propose a novel network embedding framework, named information behavior extraction (IBE), that incorporates nodes’ topological features, attribute features, and information behavior features within a joint embedding framework. To design IBE, we use an existing embedding method (e.g., SDNE, CANE, or CENE) to extract a node’s topological features and attribute features into a basic vector. Then, we propose a topic-sensitive network embedding (TNE) model to extract a node’s information behavior features and eventually generate information behavior feature vectors. In our TNE model, we design an importance score rating algorithm (ISR), which considers both effects of the topic-based community of a node and its interaction with adjacent nodes to capture the node’s information behavior features. Eventually, we concatenate a node’s information behavior feature vector with its basic vector to get its ultimate joint embedding vector. Extensive experiments demonstrate that our method achieves significant and consistent improvements compared to several state-of-the-art embedding methods on link prediction.


INTRODUCTION
Network embedding (NE) aiming to map nodes of networks into a low-dimensional vector space, has been proved extremely useful in many applications, such as node classification (Perozzi, Al-Rfou & Skiena, 2014;Tang et al., 2015), node clustering (Cao, Lu & Xu, 2015), link prediction (Grover & Leskovec, 2016). A number of network embedding models have been proposed to learn low-dimensional vectors for nodes via leveraging their structure and attribute information in the network. For example, spectral clustering is an early method for learning node embedding, including models such as DGE (Perrault-Joncas & Meila, 2011), LE (Wang, 2012) and LLE (Roweis & Saul, 2000). Matrix Figure 1 Node information behavior of multiple topic-based communities (C 0 ; C 1 ; and C 2 are topic-based communities, and v 0 ; v 1 ; v 2 ; v 3 are nodes, and in the meantime ðv 0 ; v 3 Þ 2 C 0 ; v 1 2 C 1 ; v 2 2 C 2 ). Nodes interact in intra-community, such as nodes v 0 and v 3 . Nodes in different communities may interact with each other, such as nodes v 0 and v 1 ; v 1 and v 2 ; v 2 and v 3 . Due to the existence of bridge nodes v 0 or v 2 , nodes v 1 and v 3 may have a link which is not represented in the current network.
Full-size  DOI: 10.7717/peerj-cs.1030/ fig-1 communities, nodes interact within both intra-community (such as nodes v 0 and v 3 ) and inter-community (such as nodes v 0 and v 1 , v 1 and v 2 , v 2 and v 3 ). This means that one node may communicate and share information in various topics when interacting with neighboring nodes of different communities and build bridges among nodes that are not directly connected, such as nodes v 1 and v 3 may have a link because nodes v 0 or v 2 acts as a bridge, but this link is not observed. It can be seen that these information behaviors are very important features, and the representation vectors for nodes without information behavior features are incomplete. However, these existing embedding methods are not able to cope with the information behavior of nodes. To tackle the above-identified problems, we make the following contributions: (1) We demonstrate the importance of integrating structure features, attributed features, and node's information behavior features in attribute networks. (2) We propose a joint embedding framework IBE to add the information behavior feature vector to a basic vector to obtain a final joint embedding vector, which has never been considered in the literature. The basic vector is generated by one of existing embedding methods. Within the framework, we design an algorithm ISR to generate a topic-sensitive vector for a given topic, and then we get information behavior feature vectors by matrix transposing a topicsensitive embedding matrix composed of all topic-sensitive vectors. (3) We conduct extensive experiments in real-world information networks. Experimental results prove the effectiveness and efficiency of the proposed ISR algorithm and IBE framework.
The rest of the article is organized as follows. "Related Work" discusses several related works. We provide some definitions and problem formulation in "Problem Definition". "Our Approach" presents in detail our proposed IBE framework and ISR algorithm. We then show experimental results in "Experiments" before concluding the article in "Conclusion and Future Work".

RELATED WORK
In the last few years, a large number of NE models have been proposed to learn node network embedding efficiently. These methods can be classified into two categories based on structural information and attributes: (1) SNE methods by considering purely structural information; and (2) ANE methods by considering both structural information and attributes. In this section, we briefly review related work in these two categories.

SNE methods
DeepWalk (Perozzi, Al-Rfou & Skiena, 2014) employs Skip-Gram (Mikolov et al., 2013) to learn the representations of nodes in the network. It uses a random selection of nodes and truncated random walk to generate random walk sequences of fixed length. Subsequently, these sequences are transported to the Skip-Gram model to learn the distributed node representations. LINE (Tang et al., 2015) studies the problem of embedding very large information networks into low-dimensional vector spaces. Node2vec (Grover & Leskovec, 2016) improves the strategy of random walk and achieves a balance between BFS and DFS. SDNE (Wang, Cui & Zhu, 2016) proposes a semisupervised deep model, which can learn a highly nonlinear network structure. It combines the advantages of first-order and second-order estimation to represent the global and local structure attributes of the network. Besides, there are many other SNE methods (Goyal & Ferrara, 2017), which systematic analysis of various structural graph embedding models, and explain their differences. Nevertheless, these methods fully utilize structural information but do not consider attribute information.

ANE methods
CANE (Tu et al., 2017) proposed an approach of network embedding considering both node text information of context-free and context-aware. CENE (context-enhanced network embedding) (Sun et al., 2016) regards text content as a special kind of nodes and leverages both structural and textural information to learn network embeddings. TopicVec (Li et al., 2016) proposes to combine the word embedding pattern and document topic model. JMTS (Alam, Ryu & Lee, 2016) proposes a domain-independent topic sentiment model to integrate topic semantic information into embedding. ASNE (Liao et al., 2018) adopts a deep neural network framework to model the complex interrelations between structural information and attributes. It learns node representations from social network data by leveraging both structural and attribute information. ABRW (Hou, He & Tang, 2018) reconstructs a unified denser network by fusing structural information and attributes for information enhancement. It employs weighted random walks based network embedding method for learning node embedding and addresses the challenges of embedding incomplete attributed networks. AM-GCN (Wang et al., 2020) is able to fuse topological structures, and node features adaptively. Moreover, there exist quite a few survey papers Zhang et al., 2020;Daokun et al., 2017;Peng et al., 2017;Cui et al., 2019), which provide a comprehensive, up-to-date review of the state-of-the-art network representation learning techniques. They cover not only early work on preserving network structure, but also a new surge of incorporating node content and node labels.
It is challenging to get network embedding considering attributes of local context and topic due to its complexity. Quite a few works have been carried out on this issue. However none of them consider node information behavior features in attributed networks.

PROBLEM DEFINITION
In this section, we present the necessary definitions and formulate the problem of link prediction in attributed networks.
Definition 1 (Networks) A network can be represented graphically: G = (V, E, D, A), where V ¼ fv 0 ; v 1 ; . . . ; v ðjVj−1Þ g represents the set of nodes, and |V| is the total number of nodes in G. E V Â V is the set of edges between the nodes. D ¼ fd 0 ; d 1 ; . . . ; d ðs−1Þ g is a set, where d represents a topic and it is also a topic-based node label identified the topic of the node, τ is the total number of topics. A is a function which associates each node in the network with a set of attributes, denoted as A(v).
Definition 2 (Adjacent-Node and Node degree) An adjacent-node set of node v ∈ V is defined as N v ¼ fv 0 : ðv; v 0 Þ 2 Eg. v degree is the number of nodes in the adjacent-node set of v, called the degree of node v.
Definition 3 (Topic-based community) Each node v has topic-based labels to identify the topics it belongs to. A topic-based community is a node-set that consists of the nodes with same topic-based label. Here, we define a topic-based community as C d ¼ fv : v 2 V; d 2 Dg, and the node number in C δ is defined as |C δ |. The topic-based community set is represented as where C D is a set of all topic-based communities. Note that we assume that each node has at least one topic-based label, and one node can belong to several topic-based communities.
Definition 4 (Information behavior) Information behavior is an individual node's action in a topic category, including information inquiry, information access, information interaction, information sharing, etc.
A message passing is a process of global information recursive, such as GCN. Different from the mechanism of message passing, the information behavior is a process of localindependent information aggregation.
Definition 5 (Importance Score) Given a topic δ, the importance score x i (0 ≤ i < |V|) of node v i is computed as follows: where 0 ≤ β ≤ 1 is a hyper-parameter, m i and s i are the adjacent score and community score of node v i , respectively. v i 's adjacent score m i is defined as the weighted importance score of its adjacent nodes: where x k is the importance score of v k , v degree k is the degree of node v k , and N i is the adjacentnode set of node v i . Moreover, v i 's community score s i with respect to the topic δ is defined as: where C δ is a topic-based community and |C δ | is the number of nodes in C δ . The importance score x i of node v i reflects the interaction between v i and its adjacent nodes N i , as well as the level of correlation between v i and its topic-based community C δ (δ ∈ Δ).
Definition 6 (Topic-sensitive vector) Given a topic δ, the importance scores of all nodes can be used to form a topic-sensitive vector The learning process of the topic-sensitive vector is a repetitive iteration process for computing the importance scores of all nodes. The initial importance scores is (1), each iteration is an one-order aggregate operation of adjacent scores and community score and a new value 1 jC d j is added to x i for node v i . After a number of iterations (i.e., higher-order aggregations), the ratio between x i of each node v i will stabilize, but the x i for each node v i will continue to grow due to the continuous addition of the v i 's community score s i . So, after each iteration, we normalise the x i for every node bŷ In this way, the x i obtained from this iteration process will eventually converge. Attributed network embedding. Given an attributed network G = (V, E, Δ, A), our goal is to extract the node information behavior features and learn an information behavior feature vector z I ! is the information behavior similarity of two nodes v, v′ (v, v′ ∈ V). After that, node information behavior feature vector z I v !
is added to the node basic vector z B v ! 2 R d 0 (d′ ( | V|) generated by one of existing embedding methods to get the ultimate joint embedding vector: where [·||·] denotes concatenating two vectors end to end. Nodes with similar networkstructure features, node-attribute features, and information-behavior features are close to each other in the embedding space R dþd 0 .

OUR APPROACH
In this section, we introduce our method of information behavior features extraction. We firstly propose our framework IBE (Information behavior extraction framework) which elaborates the components of node joint embedding vectors. Then, we present the model TNE (Topic-sensitive network embedding model) which describes the process of generating information behavior feature vectors.

Information behavior extraction framework (IBE)
The information behavior features generated based on topics are complementary to features of existing models. As shown in Fig. 2, data sources of the framework IBE consist of two parts. One is the network embedding Z B generated by one of existing embedding methods, and the other is Z I , where Z B 2 R jVjÂd 0 and Z I 2 R jVjÂd (d′, d ≪ |V|) are embedding matrix consisting of the embedding vectors z B v ! and z I v ! of nodes V, respectively. We linearly concatenate the embedding matrix Z B and Z I to generate a joint embedding matrix Z, which can be used for link prediction, recommendation, and other tasks in attributed networks.

Topic-sensitive network embedding model (TNE)
In this section, we present the process of extracting node information behavior features. As shown in Fig. 3, the TNE model consists of two parts: an ISR algorithm (Importance score rating algorithm) and a topic-sensitive embedding matrix (Γ) transposing.

Importance score rating algorithm (ISR)
The ISR algorithm is used to get the importance scores of all nodes (illustrated by Eq. (1) and Definition 5) and generates a topic-sensitive vector under a given topic. We firstly input raw data including a node set V, an adjacent-node set N v , and a topic-based community C δ to ISR (see Algorithm 1) and then simulate the iteration process of node information behavior under the given topic δ. When the importance scores of all nodes stabilize, the iteration is terminated. We propose loss as a metric for iteration termination, which is calculated as follows.
where x i is the current iterative importance score for the node v i and x i ′ is the importance score of the previous iteration for the node v i . After the iteration is done, we can obtain a |V|-dimensional topic-sensitive vector c d ! = (x 0 , x 1 , …, x (|V|−1) ) (illustrated by Definition 6), consisting of importance scores of all nodes under a given topic δ ∈ Δ.
Given a topic δ, the computing steps of topic-sensitive vector c d ! are given as follows (see Algorithm 1): 1. For a node v i , the line 6-8 is used to compute its adjacent scores m i of all of neighbors N i and the line 9-13 are used to compute its community score s i . Using m i and s i , the importance scores x i of node v i is finally calculated out in the first statement of line 14.
2. The second layer loop (line 5, line 14-15) is used to assemble the importance scores of all nodes to generate a list γ δ = [x 0 , x 1 , …, x (|V|−1) ] in program which is the topicsensitive vector c d ! = (x 0 , x 1 , …, x (|V|−1) ) ðc d ! 2 R jVj Þ. C δ : a topic-based community, C δ ∈ C Δ ; |C δ |: node number of topic-based community δ; β = 0.85: a hyper-parameter, imposing the ratio between m i and s i ; 1 jVj ¼c d , initializing every element of list γ δ andc d with 1 jVj in a given topic δ ∈ Δ, wherec d is used to temporarily store a topic sensitive vector; 2 loss = 30; if v i ∈ C δ then 10 s i = 1 jC d j ; 11 else 12 s i = 0; 13 end 14 valuec d ½v i of node v i , which is the importance score and will be used in the next iteration.

The first layer loop (line 3, line 20) is used to control the iterations.
In the ISR algorithm, an information behavior feature vector is generated without an increase in time complexity. We define L as the number of iterations, n represents the number of nodes in attributed networks, and v degree represents the degree of node v. The time complexity of the ISR algorithm to generate a topic-sensitive vector c d ! = (x 0 , x 1 , …, x (|V|−1) ) is OðL Á n Á v degree Þ. Because L · v degree and n are of the same order of magnitude, the time complexity of the ISR algorithm is thus Oðn 2 Þ.

Topic-sensitive embedding matrix ( C) transposing
For Δ = {δ 0 , δ 1 , …, δ (τ−1) }, the Γ matrix transposing calls the Algorithm 1 to get every topic-sensitive vectors c d ! in each topic δ ∈ Δ. After all τ topic-sensitive vectors are obtained, we combine the τ topic-sensitive vectors to form a topic-sensitive embedding Ultimately, Z I is obtained by the Γ matrix transposing, as illustrated by Eq. (4).
Each row of Z I is an information behavior feature vector z I v ! for node v and the dimension of z I v !
Generating node joint embedding vectors based on IBE Z B is a basic embedding matrix trained by one of existing embedding methods. Each row of the Z B is a basic vector z B v ! for a node v generated by one of existing embedding methods. Before getting Z by Eq. (7) according to the framework IBE, we firstly enlarge Z I or Z B by λ (Eq. (5)) so that the element values of λ Ã Z I and Z B or Z I and Z B k are of the same order of magnitude, and the λ is calculated as follows.
where jbj is the average of all elements in Z B , and x is the average of all elements in Z I . And then we enlarge the element values of Z I or Z B who has the larger AUC (Hanley & Mcneil, 1982) value by weight coefficient α (Eq. (6)) again.
where auc() is a function used to calculate the value of AUC, ψ is an amplification factor of the ratio aucðZ I Þ aucðZ B Þ .
Especially, we should not use the method of reducing the element values of Z I or Z B to make their element values of the same order of magnitude, because it may result in invalid results due to the element values are too small. So, according to the values of coefficients α and λ, we divide the methods of linearly concatenating Z I and Z B into four cases as follows.
if k ! 1 and a , 1 if k , 1 and a ! 1 where the operator [·||·] denotes concatenation, α is an enlarging coefficient to make the joint embedding matrix Z more similar to Z I or Z B who has the higher AUC value, and λ (Eq. (5)) denotes the enlargement factor who try to be adjusted to make the element values of λ Ã Z I and Z B or Z I and Z B k in the same order of magnitude.
For the case of [(α Ã λ Ã Z I ) ‖ Z B ] (λ ≥ 1 and α ≥ 1) in Eq. (7), Z is displayed in matrix form as follows: ; where each row of Z is the final joint embedding vector z v ! for node v based on the framework of IBE and . The other three cases of Eq. (7) have similar matrix representations.

EXPERIMENTS
In this section, we describe our datasets, baseline models and present the experimental results to demonstrate the performance of the IBE framework in link prediction tasks. The source code and datasets can be obtained from https://github.com/swurise/IBE.

Datasets
In Table 1, we consider the following real-world network datasets. BlogCatalog (http:// networkrepository.com/soc-BlogCatalog.php) is asocial blog directory. The dataset contains 39 topic labels, 10,312 users, and 667,966 links. Zhihu is the largest online Q & A website in China. Users follow each other and answer questions on this site. We randomly crawl 10,000 active users from Zhihu and take the descriptions of their concerned topics as text information (Tu et al., 2017). The topics of Zhihu are obtained by the fastText model (Joulin et al., 2016) and the ODP of predefined topic categories (Haveliwala, 2002). The fastText presents the hierarchical softmax based on the Huffman tree to improve the softmax classifier taking advantage of the fact that classification is unbalanced in CBOW (Joulin et al., 2016). WiKi contains 2,408 documents from 17 classes and 17,981 edges between them. Cora is a research paper classification citation network constructed by McCallum et al. (2000). After filtering out papers without text information, 2,277 machine learning papers are divided into seven categories and 36 subcategories in this network. Citeseer is divided into six communities: Agents, AI, DB, IR, ML, and HCI and 4,732 edges between them. Similar to Cora, it records the citing and cited information between papers.

Baselines
To validate the performance of our approach, we employ several state-of-the-art network embedding methods as baselines to compare with our IBE framework. A number of existing embedding methods are introduced as follows.
CANE (Tu et al., 2017) learns context-aware embeddings with mutual attention mechanism for nodes, and the semantic relationship features are extracted between nodes. It jointly leverages network structure and textural information by regarding text content as a special kind of node. DeepWalk (Perozzi, Al-Rfou & Skiena, 2014) transforms a graph structure into a sample set of linear sequences consisting of nodes using uniform sampling. These linear sequences are transported to the Skip-Gram model to learn the distributed node embeddings. HOPE (Ou et al., 2016) is a graph embedding algorithm, which is scalable to preserve high-order proximities of large scale graphs and capable of capturing the asymmetric transitivity. LAP (Belkin & Niyogi, 2001) is a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. LINE (Tang et al., 2015) learns node embeddings in large-scale networks using firstorder and second-order proximity between the nodes. Node2vec (Grover & Leskovec, 2016) has the same idea as DeepWalk using random walk sampling to get the combinational sequences of node context, and then the network embeddings of nodes are obtained by using the method of word2vec. GCN (Kipf & Welling, 2016) model uses an efficient layer-wise propagation rule based on a localized first-order approximation of spectral graph convolutions. The GCN model is capable of encoding graph structure and node features in a scalable approach for semi-supervised learning.
GAT (Veličković et al., 2018) is a convolution-style neural network that operates on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of methods based on graph convolutions. GAT enables implicitly specifying different weights to different nodes within a neighborhood.

Evaluation metrics and parameter settings
We randomly divide all edges into two sets, training set and testing set, and take a standard evaluation metric AUC scores (area under the ROC curve) (Hanley & Mcneil, 1982) as evaluation metrics to measure the link prediction performance. AUC represents the probability that nodes in a random unobserved link are more similar than those in a random nonexistent link. Because the number of topics is different in each dataset, we use τ to denote the maximum number of topics for each dataset. In concatenating weight coefficient α = [auc(Z I ) ÷ auc(Z B )] ψ , we set the factor ψ equal to 4 except that ψ has a specified value.

Experimental results
For experiments, we evaluate the effectiveness and efficiency of our IBE on five networks for the link prediction task. For each dataset, we compare the AUC data of basic embedding matrix Z B generating by one of the existing embedding methods, the information behavior feature vectors Z I , and their joint embedding vectors Z generating by the framework IBE. We employ six state-of-the-art embedding methods as baselines, including Node2vec, DeepWalk, LINE, LAP, CANE, HOPE, for comparisons with their extending frameworks IBE in the following experiments. Table 2 compares AUCs over five datasets. By concatenating Z B and Z I linearly, the joint embedding vectors Z achieves the best performance. Especially on the BlogCatalog and Zhihu datasets, the AUC values of the joint embedding vectors Z are higher, compared with their baselines, than 14.6% and 27.6% respectively on average. One of the reasons may be that the average AUC of information behavior feature vectors Z I are 81.7 and 82.4 respectively, which are more than 11% higher compared with the other three datasets on average. The other reason is that the maximum topic number τ of BlogCatalog and Zhihu are larger than those of Cora and Citeseer. In Table 2, we can also see that most of the AUC values of the joint embedding vectors Z for datasets Wiki and Cora exceed 90%. The reason is that their AUC values of baselines are relatively high, most of them are more than 85%. On the Citeseer dataset of Table 2, we also can see that the improvement of the AUC values of the joint embedding vectors Z is less, compared with their baselines. The result can be explained by the fact that the AUC values of information behavior feature vectors Z I are all low, about 55% and the direct reason for the low AUC values is that the number of topics is too small. Due to the number of topics is small, the topic subdivision degree of nodes is low, and the classification of node labels will not be too detailed.
In general, in Table 2, it can be seen that the AUC values of concatenating embedding vectors Z are higher than that of Z B and Z I , which indicates that the concatenating method can properly integrate the features of all parties.
Ablation experiments: To investigate the effectiveness of TNE, we perform several ablation studies. In Table 2, the Z B is the basic embedding generated by an existing model. The Z I is obtained by the TNE model. A joint embedding Z is obtained by adding a Z I to a Z B , shown in Fig. 2 and Eq. (2). We observe that the quality of the joint embedding Z is better than itself Z B .

Parameter sensitivity analysis
We further performed parameter sensitivity analysis in this section, and the results are summarized in Figs. 4 and 5. Due to space limitations, we only take the dataset of Wiki, Zhihu as an examples to estimate the topic number j (0 ≤ j ≤ τ) and the amplification factor ψ of vector concatenation can affect the link prediction results.
The topic number j (0 ≤ j ≤ τ): In Fig. 4, we illustrate the relationship between the number of topics and link prediction, where the order of selection of node topics is random. When j = 0, Z I does not exist, Z degenerates to Z B . As shown in Fig. 4, we can see that as j increases from 1 to τ, Z B linearly combines with more topic-based feature dimensionality from Z I , and the AUC values keep changing. When the AUC values of Z B Table 2 The AUC values of Z I , Z B (baseline) and Z (baseline) in different datasets where '(baseline)' in Z B (baseline), Z (baseline) is to distinguish all kind of network embeddings Z B and their extension embeddings Z, and the ψ equals 4 except that ψ has a specified value. are below 82% in Fig. 4A, AUC values increase sharply with an increase of j. When the AUC values of Z B are higher than a certain critical value, the AUC values increase more slowly or even stop growing with an increase of j. So, we can see that when the number of topics is large in a dataset, each node can be classified in detail by the topic classification labels, which helps to improve the AUC values using a small number of topics. These also show that the AUC values of the concatenating embedding vectors Zs will be higher than that of all parties for concatenating, that is Z I s and Z B s, but it will not increase indefinitely.
The amplification factor ψ of vector concatenation: ψ is an amplification factor for α (Eq. (6)) which is a weight coefficient for enlarging the element values of Z I or Z B who has the larger AUC (Eq. (7)). From Fig. 5, we can see that the AUC value, when ψ is less than 0, is less than that when ψ is greater than 0. The reason is that the weight coefficient α, when ψ is less than 0, enlarges the Z I or the Z B who has the smaller AUC value. As a result, the joint embedding Z is more similar to one that has a lower AUC value. When ψ is between 1 and 5, the prediction result is the best. However, when the value of ψ increases gradually, the AUC values decrease slightly and tend to the Z I s or the Z B s who have the larger AUC values.

CONCLUSION AND FUTURE WORK
This article has presented an effective network embedding framework IBE, which can easily incorporate topology features, attribute features, and features of topic-based information behavior into network embedding. In IBE, we linearly combinate Z I and Z B to generate node joint embedding matrix Z. To get the Z I , we have proposed the TNE model to extract the node's information behavior features. The model contains an ISR algorithm to generate the topic-sensitive embedding matrix (Γ) and a Γ matrix transposing algorithm to transpose Γ matrix into the information behavior feature matrix Z I for nodes eventually. Experimental results in various real-world networks have shown the efficiency and effectiveness of joint embedding vectors in link prediction. In the future, we plan to investigate other methods of extracting features that may better integrate with the TNE model. Moreover, we will further investigate how the TNE model works in heterogeneous information networks.