Network Embedding-Based Anomalous Density Searching for Multi-Group Collaborative Fraudsters Detection in Social Media

: Detecting collaborative fraudsters who manipulate opinions in social media is becoming extremely important in order to provide reliable information, in which, however, the diversity in different groups of collaborative fraudsters presents a significant challenge to existing collaborative fraudsters detection methods. These methods often detect collaborative fraudsters as the largest group of users who have the strongest relation with each other in the social media, consequently overlooking the other groups of fraudsters that are with strong user relation yet small group size. This paper introduces a novel network embedding-based framework NEST and its instance BEST to address this issue. NEST detects multiple groups of collaborative fraudsters by two steps. In the first step, to disclose user collaboration, it represents users according to their social relations. Then, in the second step, to identify the collaborative fraudsters, it detects the user groups with anomalous large group density in its representation space. BEST instantiates NEST by using a bipartite network embedding method to represent users and adopting a fast density group detection method based on the k-dimensional tree. Our experiments show BEST (i) performs significantly better in detecting fraudsters on four real-word social media data sets, and (ii) effectively detects multiple groups of collaborative fraudsters, compared to three state-of-the-art competitors.


Introduction
The reliability of social media content is becoming increasingly significant because social media heavily affects people every day. Unfortunately, a large proportion of social media content is proposed by fraudsters who collaborate to manipulate social opinions driven by huge profit and incentives of reputation [Mukherjee, Venkataraman, Liu et al. (2013); ]. As a result, effectively detecting such collaborative fraudsters is critical and with great bossiness values [Akoglu, Chandy and Faloutsos (2013)]. Recent year has seen significant progress made in fraudsters detection. Current efforts mainly focused on extracting fraudster indicators and/or features from users' behavior [Mukherjee, Liu and Glance (2012); Ye and Akoglu (2015); Hooi, Shin, Song et al. (2017)] or users' proposed content [Mukherjee, Venkataraman, Liu et al. (2013); Wang, Liu and Zhao (2017) ;You, Qian and Liu (2018)]. Because of the great distinguishing ability of anomalous behavior and content, these indicators and/or features have shown remarkable performance in detecting individual fraudsters [Rayana and Akoglu (2016)]. However, identifying fraudsters with collaborative manipulation is a challenging task. Specifically, the collaborative manipulation poses the two major challenges below: (i) The content of collaborative fraudsters may not be anomalous because the collaborative manipulation may dominate social opinions. (ii) The professional fraudsters will imitate the behavior of honest users to evade inspection [Hooi, Song, Beutel et al. (2016)]. These two challenges cause the failure of current behavior and content-based fraudsters detection methods in detecting collaborative fraudsters. To detect collaborative fraudsters, the dense subgraph mining methods [Hooi, Song, Beutel et al. (2016) ;Hooi, Shin, Song et al. (2017); Wu, Hu, Morstatter et al. (2017); Liu, Hooi and Faloutsos (2017); Xiang, Shen, Qin et al. (2018); Xiang, Zhao, Li et al. (2018)] are the major solutions, which detect collaborative fraudsters according to the significant collaboration footprint. Specifically, the dense subgraph mining methods always detect collaborative fraudsters as the largest group of users who have the strongest relation with each other in the social media. However, in this way, they may overlook the other groups of fraudsters that are with strong user relation yet small group size. In reality, social media may contain multiple groups of collaborative fraudsters instead of only the largest group of collaborative fraudsters. In this paper, we introduce a novel Network Embedding-based denSiTy subgraph mining (NEST for short) framework for multi-group collaborative fraudsters detection in social media. Specifically, NEST first represents users according to their social relations to disclose user collaboration. In this process, users who have similar activities will be embedded near to each other in the representation space. NEST then detects the user groups with anomalous large group density in its representation space to identify the collaborative fraudsters. Accordingly, any group of collaborative fraudsters with large joint activities can be effectively detected. Essentially, this detection procedure simultaneously tackles three challenges brought by collaborative fraudsters: content domination, behavior camouflage, and multiple fraudsters groups, resulting in a robust and comprehensive collaborative fraudsters detecting result. In the first step, NEST solves the content domination and behavior camouflage problems by distilling user social relations which are reflected in users' joint activities. The rationale is that the cooperation of collaborative fraudsters to manipulate opinions cannot be avoided. In the second step, NEST discovers fraudsters groups by analyzing the outlier of group density in its representation space. The intuition is that the joint activities of collaborative fraudsters must be more frequent than honest users, but the number of fraudsters is much less than honest users. We further implement NEST by proposing a Bipartitie networking Embedding-based fast denSiTy subgraph mining method based on the k-dimensional tree structure, termed BEST. Specifically, BEST first models the users and their activities as a bipartite network as demonstrated in Fig. 1. In the bipartite network, the nodes on each side are users and activities, and a link refers to a user participates in an activity. Then, to comprehensively capture user collaborations, BEST represents users by embedding both the explicit and implicit relations in the bipartite network. Lastly, to fast detect the collaborative fraudsters, BEST builds a k-dimensional tree for the representation space and searches the anomalous density group based on the k-dimensional tree. Accordingly, this paper makes two major contributions:  We introduce a novel network embedding-based framework NEST for identifying collaborative fraudsters in social media. NEST represents users according to their social relations and detects fraudsters by analyzing the outlier of group density in the representation space. It results in a more reliable and comprehensive collaborative fraudsters detection, compared to existing dense subgraph mining-based solutions.


We instantiate NEST to an effective and efficient multi-group collaborative fraudsters detection method, BEST, by introducing bipartite network embedding and k-dimensional tree-based anomalous density group searching. The bipartite network embedding captures both explicit and implicit user relations, and the k-dimensional tree-based method guarantees the efficiency of density groups searching. Extensive empirical results show that (i) BEST performs significantly better in detecting fraudsters on four large real-world social media data sets; and (ii) BEST effectively detects multiple groups of collaborative fraudsters, compared to three state-of-the-art competitors.
2 Related work 2.1 Fraudster detection Current efforts on fraudster detection can be roughly classified into two categories: individual characteristics-based methods and relational characteristics-based methods. The individual characteristics-based methods use the user proposed content and/or user's behavior to identify whether a user is a fraudster. The information used by these methods mainly include the statics and linguistic characteristics of a content [Li, Huang, Yang et al. (2011);Mukherjee, Kumar, Liu et al. (2013); Wang, Liu and Zhao (2017); You, Qian and Liu (2018)], and the historical actions of a user [Fei, Mukherjee, Liu et al. (2013); Mukherjee, Venkataraman, ]. These individual characteristics are designed as features for fraudster detection [Jindal and Liu (2008) ;Lim, Nguyen, Jindal et al. (2010); Zhao, Resnick and Mei (2015); Li, Fei, Wang et al. (2017)]. However, as evidenced by Hooi et al. [Hooi, Song, Beutel et al. (2016)], the individual characteristics are not robust for collaborative fraudsters who jointly manipulate social opinions and fraudsters may imitate the behavior of honest users. The relational characteristics-based methods capture user-activity, user-user, and activity-activity relations, typically via a graph [Pandit, Chau, Wang et al. (2007) . They hold an assumption that fake reviews are manipulated by groups of fraudsters. With this assumption, they assume a group of fraudsters will have dense links to a group of manipulated activities (useractivity relation) [Akoglu, Chandy and Faloutsos (2013); Wang, Xie, Liu et al. (2011)], a group of fraudsters will co-occur in many activities (user-user relation) [Wu, Hu, Morstatter et al. (2017); Sun, Qu, Chakrabarti et al. (2005); Xu, Zhang, Chang et al. (2013)], and different manipulated activities will have overlapped linked fraudsters (activity-activity relation) [Hovy (2016)]. Although current methods show their strengths to disclose fraudsters, most of them fail to discover multiple groups of collaborative fraudsters in social network. In this paper, we propose a networking-embedding based framework NEST to fill the gaps of multi-group collaborative fraudsters detection. The proposed NEST achieves a more reliable and comprehensive detection by revealing users within density groups in its representation space, which delicately embeds the user's social relationships.
where U and A are the nodes on the two sides of G , respectively, and E U V ⊆ × defines the inter-set edges. Here, each edge in E carries a non-negative weight ij w , reflecting the strength between a user i u and an activity j a , and the ij w will be zero if the user i u does not join the activity j a . Accordingly, the weights in the bipartite network can be represented by a n m Then, NEST learns an embedding function ( ) : → , which maps a user i u to a d dimensional vector representation i u . The embedding function ( ) f ⋅ should capture and embed the social relations of users in the bipartite network into their representation space. In the second step, NEST finds the anomalous density groups in the user representation space and treats the users in the anomalous density groups as collaborative fraudsters. Formally, NEST detects a set of collaborative fraudster groups → is a distance measure building on the user representation space, | | ⋅ refers a density measurement, and η and ε are two parameters which control the density range and the density anomalous degree, respectively. Essentially, NEST embeds the collaboration footprints of users into a vector space where users joined similar activities will be located together. Therefore, the density of a group of users in the vector space reflects the degree of collaboration between this group of users. The larger density a group of users has, the more collaboration between them. Because collaborative fraudsters may have much more cooperation [Hooi, Song, Beutel et al. (2016)], NEST can effectively detect collaborative fraudsters by searching the groups with anomalous density in the user representation space. Different from typical graph subgraph mining methods, which only disclose a single group of collaborative fraudsters, i.e., the users in the largest dense subgraph, NEST provides a more comprehensive detection result that contains multiple groups of collaborative fraudsters. NEST has a good generalizability since it can be instantiated by specifying any network embedding method and any anomalous density groups searching method. We introduce an instance of NEST in next section and then verify its performance by empirical analyses.

A NEST instance: BEST
BEST instantiates NEST by a bipartite network embedding method catering for social net-work, and a k-dimensional tree-based anomalous density group searching method for efficient fraudsters detection. Figure 1: NEST Framework. In the first step, NEST extracts a bipartite network from social media data, and represents user into a vector space by embedding their social relation in the bipartite network. In the second step, NEST searches the anomalous density group of users in the representation space for collaborative fraudsters detection. The detected collaborative fraudsters are illustrated with a grey background, and their corresponding groups are highlighted by a dotted circle

Bipartite network embedding
The network embedding reveals and embeds social relations of a user into the user's vector representation, which reflects the cooperation of users in social media. We introduce a bipartite network embedding method to jointly capture the explicit and implicate relations of users in social media.

Explicit relations embedding
The explicit relations refer to the direct links between users and activities, which reflect the activities a user jointed. If two users always joint similar activities, their similarity should be large in the representation space. To preserve the explicit relations, we keep the preference of users in their representation space. Specifically, we measure the preference of a user in both social media and representation space, and make the preference of a user in representation space similar to that in social media. For the preference measurement in social media, we consider the probability of a user join in an activity. Given the bipartite network, this probability can be calculated as follows: where ij w is the weight of edge ij e . The measurement reflects the preference distribution of users. We follow the setting of word2vec to use the sigmoid function to measure the interaction of a user and an activity in their representation space in a probability space: are the embedding vectors of i u and j a , respectively. Then, we adopt KL-divergence to measure the difference between P and P , and optimize the user and activity representation to minimize the KL-divergence as follows: , ,

Implicit relations embedding
The implicit relations refer to the relations between users and activities that are not directly connected. For two users, if there exist a path between them in the bipartite network, they may have an implicit relation, and the weight of the path reflects the strength of this implicit relation. However, counting the paths between two nodes in a bipartite network has a great high complexity, which is impracticable in social media. Inspired by DeepWalk [Perozzi, Al-Rfou and Skiena (2014)], we also perform a truncated random walks on the network to generate nodes corpus as random walk paths, which contain higher order implicit relations between nodes. We move a step further to reconstruct the bipartite network G as two networks where each network only contains users ( ) u G or activities ( ) a G , and conduct random walks on these two transformed networks. It results in a stationary distribution of random walks on social media data [Gao, Chen, He et al. (2018)]. In ( ) u G , i u and j u will have an edge ,  BEST jointly considers the explicit and implicit relations embedding, forming a joint embedding objective function: where α , β and γ are the hyper-parameters to trade-off the effects of the three components. This objective function can be effectively solved by stochastic optimization methods. By solving the objective function (8), BEST represents users into a vector space where user's social relations have been embedded.

K-dimensional tree-based anomalous density group searching
To fast search the anomalous density group, BEST first builds a k-dimensional tree (kdtree for short) for the user representation space, and then estimates the density around each user in that space. Finally, it adopts the criteria Eq. (1) in NEST to identify the anomalous density groups. { , , } n = ⋅⋅⋅ u u u u , BEST builds a kd-tree, v , by Algorithm 2. As illustrated in Fig. 2, the kd-tree v is a binary tree storing the user representation with their structure information, which enables the fast searching of anomalous density groups. Split u into two subsets according to the median value q in the l th-dimension of the points in u . Let (1) u be the set of points which l th-dimension value is smaller or equal to the q , and let (2) u be the set of other points ; 6 Create a node v storing the q in the l th-dimension, make left v left v the left child of v , and make right v the right child of v ; 9 return v. 10 end

Density estimation
BEST estimates the density around each user in its representation space based on the kdtree v according to the Algorithm 3, where the function SEARCHKDTREE( i u , v , ρ ) returns a set of users that around the user i u within the range η based on the kd-tree v . Essentially, BEST estimates the density around a user by the number of users close to the user within a certain distance in the representation space. If a user has a large density, the user should have a lot of collaborations with others. Accordingly, BEST uses the density as an important evidence to identify collaborative fraudsters.
Algorithm 3: Density estimation based on kd-tree Input : A set of point u , the kd-tree v and η . Output: A set of densities around each user ρ , a set of user sets S .

Collaborative fraudsters detection
BEST detects collaborative fraudsters after estimating density around users in the user representation space. Specifically, it treats the density larger than a threshold ε , e.g. five times of the averaged density, as anomalous, and assigns the users in the density areas as fraudsters. The procedure is summarized in the Algorithm 4.

Data sets
The experiments are carried on two large scale real word social media data sets, including Yelp restaurant and Yelp hotel data sets used in Mukherjee et al. [Mukherjee, Venkataraman, Liu et al. (2013)]. All the activities in these data sets have been assigned authenticity labels given by commercial filters.

Evaluation metrics
We evaluate their performance by three metrics -precision, recall, and F-score. While precision evaluates the fraction of true fraudsters among detected fraudsters, recall reflects the fraction of true fraudsters that have been detected over the total amount of true fraudsters. The precision and recall should be jointly considered since fraudsters detection is an imbalance problem [Luca and Zervas (2016)], i.e., fraudsters are much less than honest users. Thus, we use F-score, which balances the precision and recall, as an averaged indicator. Higher F-score indicates a better performance of a fraudsters detection method. We report these three metrics per ground-truth honest user and fraudster classes to illustrate the performance for different categories. We further average them to show overall performance. We follow the literature [Wang, Liu and Zhao (2017)] to use the results of the Yelp commercial fraud filter to evaluate the performance. Because the Yelp commercial fraud filter only give the authenticity labels of activities, we transform the authenticity labels to the honest labels of users as the ground-truth. Considering the fraud activities distribution per each user assigned by the commercial filters, we assign the fraudster label to a user if more than 80% of the activities of the user have been labeled as fraud. The rationale is that we need to filter the false positive made by the commercial filters [Li, Chen, Liu et al. (2014)]. In other words, we assume that a user with a higher proportion of the assigned fraud activities will be more likely a real fraudster.

Parameters settings
In the experiments, we set the parameters of BEST as follows. To balance the explicit and implicit social relations, we set the hyper-parameters α , β , and γ is the network embedding objective function Eq. (8) as 0.5, 0.25, and 0.25, respectively. We train the network embedding by Adam [Kingma and Ba (2014)] with embedding dimension 128 and batch size 32. For the density estimation, we set the distance range η as 1. For the anomalous density detection, we set the threshold s as the five times of the averaged density. For the parameters in the compared methods, we take their recommended settings.

Evaluation of BEST effectiveness on fraudster detection 5.4.1 Experimental settings
BEST is compared with two state-of-the-art competitors: Frauder [Hooi, Song, Beutel et al. (2016)] and HoloScope [Liu, Hooi and Faloutsos (2017)] in detecting collaborative fraudsters. These two competitors are both based on dense subgraph mining, but with different setting on the graph construction.  Fixed weighting dense subgraph mining-based method -FRAUDER [Hooi, Song, Beutel et al. (2016)]. FRAUDER is a fraudsters detection method by dense subgraph mining. To detect camouflage and hijacked accounts, it adopts a fixed weighting strategy.  Dynamic weighting dense subgraph mining-based method-HoloScope [Liu, Hooi and Faloutsos (2017)]. HoloScope uses information from graph topology and temporal spikes to detect groups of fraudsters, and employs a dynamic weighting approach to allow a more accurately fraud detection.

Findings-BEST significantly improving fraudsters detection performance, especially recall
The precision, recall and F-score of BEST, Frauder, and HoloScope are reported in Tab. 1. Overall, BEST significantly outperforms the competitors. It improves 21.8% and 10.03% compared with the best-performing method in terms of F-score on two data sets.

Evaluation of BEST-generated user representation quality 5.5.1 Experimental settings
We visualize the user representation in a two-dimensional space trough TSNE [Maaten and Hinton (2008)]. To evaluate the user representation quality, we plot the ground-truth labels of each user at their positions in the representation space. A high-quality user representation will enable a dense distribution for the collaborative fraudsters. The behavior representation generated by BEST is compared with that generated by JETB [Wang, Liu and Zhao (2017)], which is the state-of-the-art user representation method for fraudsters detection.

Findings-BEST generated user representation embeds fraudsters into groups with anomalous high density
The user representations generated by BEST and JETB are visualized in Fig. 3. In the JETB generated representation space, the users with large density are not consistent to the ground-truth fraudster label. In contrast, the density of BEST generated representation is consistent with the ground-truth fraudsters distribution. This qualitative illustrates that BEST effectively captures the social relation of users in social media, which is essential for the collaborative fraudsters detection.

Conclusion
This paper introduces a network-embedding collaborative fraudsters detection framework NEST and its instance BEST. They perform an anomalous density searching procedure on a network embedding space which enables the detecting multiple groups of collaborative fraudsters. Two large real-world data sets demonstrate the performance of BEST is substantially better than the state-of-the-art competitors.