A Semantic Community Detection Algorithm Based on Quantizing Progress

. The semantic social network is a kind of network that contains enormous nodes and complex semantic information, and the traditional community detection algorithms could not give the ideal cogent communities instead. To solve the issue of detecting semantic social network, we present a clustering community detection algorithm based on the PSO-LDA model. As the semantic model is LDA model, we use the Gibbs sampling method that can make quantitative parameters map from semantic information to semantic space. Then, we present a PSO strategy with the semantic relation to solve the overlapping community detection. Finally, we establish semantic modularity (SimQ) for evaluating the detected semantic communities. The validity and feasibility of the PSO-LDA model and the semantic modularity are verified by experimental analysis.


Introduction
With the development of society and the improvement of science and technology, semantic social networks are rapidly developed and many semantic networks, like Twitter and Weibo, have made an insignificant impact in our life so far.In these networks, different individuals have different small social "worlds" which are called communities [1].Thus, researchers focus attention on community detection not only to divide networks into modules but also to make a deep insight into understanding interesting properties within the semantic social network.In practical application, semantic communities have a great promotion on intelligent information retrieval, marketing management, individual service, and other information management domains [2].Heretofore, the research on community detection mainly reflects on the following three categories: topological community detection [3], community detection on overlapping construction [4], and semantic community detection.
The topological community detection represents the pioneer work, the goal of which is studying the topological constructions and dividing the social networks into several separate networks.The representative algorithms contain Modular Optimization [5], GN [6], and FN [7].Then, researchers gradually focus on overlapping communities which can be more real than previous research networks.Therefore, CPM [8] was proposed to detect the overlapping communities.Soon afterwards, community detection on overlapping construction received more attention in social networks and many representative algorithms were proposed, including LFM [9], EAGLE [10], COPRA [11], DEMON [12], and so forth.Neuman and Yair [13] proposed an agglomerative spectral clustering method with conductance and edge weights.In their method, the most similar nodes are agglomerated based on eigenvector space and edge weights.But this method only is suitable for the nonsemantic social networks.Then, with the big interest in semantic network, semantic community detection came into researchers' eyes.Yang and McAuley [14] proposed the CESNA model to develop communities by using edge structure and node attributes.This method leads to more accurate community detection as well as improved robustness in the presence of noise in the network structure.But when this method applies into semantic network, it performs instable.Reihanian and 2

Complexity
Ali [15] proposed a generic framework for overlapping community detection in social networks with special focus on rating-based social networks.This framework considers the information shared by the users in order to find meaningful communities.The most important feature of semantic communities is that the nodes in these communities not only have topological relationships, but also own semantic context.For the semantic data mining must be considered on the text analysis, and many semantic community detection algorithms applied the Latent Dirichlet Allocation (LDA) [16] model as the core model.
In the last few years, the analysis in semantic social network has become popular.Most of these algorithms utilize LDA model as the basic model.The SVM-DTW method proposed by Solera, Calderara, and Cucchiara [17] can work on the hierachical networks.This method makes simple structure and needs less input parameters, but the semantic context is not considered and the detected community has less connection with the real semantic network.Li and Ming and She [18] proposed the GRTM model which not only simulates users' interests as latent variables through their information, but also considers the connections between users as a result of their information.This method combines the context analysis with topological analysis and the similarity of the detected community is nearly close to the real semantic social network, but it is lack in the feature of sampling that would make some fuzzy irrelevant community.Xiao and Liu [19] proposed the GLDA-FP model which can be extended using the prediscretizing method which can help LDA model detect the topic evolution automatically, but the calculation required is large.As for the LCTA model proposed by Yin, Cao, and Gu [20] which makes the different topic distributions in different communities to make the model reasonable, this method has high accuracy in the result, but the number of communities needs to be preset and some hidden parameters need to be set up with experience.
In this paper, we propose a novel community detection algorithm for the objective of dividing nodes into clusters.The main characteristic of communities detected by this algorithm is that members of the same community have common or similar interests.We take into account the topic and keywords information in text from individuals' words through LDA model, then quantize semantic nodes, and map them into semantic space.Then, we get ideal virtual social communities after using Particle Swarm Optimization algorithm.Last but not least, we build a novel modular model and use the new function  to evaluate the virtual social communities we make.
Compared with other models in semantic social network, such as lovain method model [21] and stochastic block model [22], the LDA model provides the probabilistic method so as to promote the foundation of mathematics.Then considering the following sampling, the Gibbs sampling can give an accurate and powerful mathematical proof for the convergence and solution of the LDA model, which is impossible to happen in the other semantic models.Combined with the PSO algorithm, the probability function compiled by LDA model can be closely integrated with the inertia weight and the constriction factor of the particles [23].In performance measure, we propose a new module detecting evaluation model based on semantic information using the cosine function, which enriches the classic semantic detecting evaluation model.
The rest of the paper is organized as follows: Section 2 introduces LDA model in semantic network.Section 3 shows gibbs sampling and the proposed algorithm.In order to verify our approach, we conducted extensive experiments on a real data set.Performance evaluation and experimental results are shown and discussed in Sections 4 and 5. Finally, in Section 6 we make conclusions and envision further work.

Preliminaries
2.1.Community Detection Process.The problem of community detection belongs to NP-hard areas [24] which need initialize solutions at the beginning and optimize solutions constantly in the way of getting the best satisfying solution.The main goal of detecting semantic community is to form communities that individuals share common interests and probably they have similar characteristic [25].So we show a novel idea that we focus on textual data of individuals' words.According to the complexity of community detection, we utilize the probabilistic graphical model-LDA to design network.This model has a most clearly hierarchical structure [26], and the scale of parameter spatial has no connection with the number of training documents.
First, we select topics and words from individuals' semantic information through LDA model.Then, we map semantic nodes into semantic space via Gibbs sampling method [27].Last, in order to get more accurate communities, we use Particle Swarm Optimization (PSO) algorithm to form semantic communities.The proposed community detection algorithm is clearly explained in the following steps.

Similar Semantic Information Discovery.
Every individual says different words as each node has its own information contents in semantic social network [28].So we abstract semantic context into topic, and then we extract keywords from topic.Through semantic information, we convey some distributions to constrain our mess context [29].In this way, dividing communities in semantic social network based on similar documents, topics, and keywords from social semantic contents make communities real [30].The LDA probability model is shown in Figure 1.
In this section, we research LDA model on information contents.The relevant mathematical symbols for illustrating the LDA model are given in Table 1.LDA model assumes the following generative process for each node: (1)  ∼ ℎ().The parameter , which pertains to topic distribution, is subject to the Dirichlet distribution over a priori parameter .
(2)  ∼ ℎ().The parameter , which pertains to keyword distribution, is subject to the Dirichlet distribution over a priori parameter .
(3)   |  (  ) ∼ ( (  ) ).The topic   is subject to the multinomial distribution in case of topic distribution probability  (  ) .Topic distribution probability vector  over node    () Keyword distribution probability vector of topic ,  ()   meaning the probability of keyword   specific to topic ,  ()    = (  |   = ) A priori parameter over topic distribution probability specific to each node  A priori parameter over keyword distribution probability specific to a special topic Figure 1: LDA probability model.
The process of forming LDA model is shown in Algorithm 1.And  means the number of documents in the process.

Gibbs Sampling and PSO Strategy
3.1.Gibbs Sampling.Gibbs sampling [31] is a simple case of Markov-chain Monte Carlo (MCMC) [32] and aims at extracting a set of approximate samples from Markov-chain that is targeted to make a suitable probability distribution for converging to optimal solutions in high-dimensional models [33] such as LDA.According to the feature of Markovchain, the probability-distribution function becomes the key to Gibbs sampling [34].As for LDA in this text, we only sample topics in semantic social network; that is, we only need to consider hidden variety   .We denote  ¬ (topic set besides   ) and  ¬ (set of keywords besides   ) to draw a posterior probability (  =  |  ¬ ,   ).As for , we can find the corresponding keyword   .So the probability can be described as in the following equation.

𝑃 (𝑧
When   =  and   =  ( is one of the keywords in ; , which corresponds to , is one of the topics in ), the probability (  = ,   =  |  ¬ ,  ¬ ) only involves conjugate distribution of  − ℎ the document and  − ℎ topic under the Dirichlet-multinomial model.
The number of  − ℎ keywords in  − ℎ topic, named  []   , can be shown as follows under multinomial distribution.
Compared with other optimization algorithms, such as Genetic Algorithm (GA), Ant Colony Optimization (ACO), and Simulate Anneal (SA), PSO algorithm has two attractive features: firstly, PSO optimizes the solution from the local optimum first and runs fast, which makes the algorithm more adaptable to the evolution of networks; secondly, particles in PSO can be mapped to nodes in semantic network; the process of finding the optimal solution in PSO is consistent with the birth process of the semantic community.
PSO puts a set of random solutions at system startup time and uses iterative search to find out optimal solutions [37].In PSO, a solution of each optimization problem is called "particle".Each particle owns fitness value of itself.So we design a heuristic method to detect communities based on PSO.Each particle searches for the optimal solution by sharing social information between individuals.
In PSO-LDA, some LDA semantic feature is put into PSO.We use nodes in semantic social network mapping to "particle" in PSO and utilize semantic information vector of each node mapping to velocity of each particle in PSO.As for fitness value, we use information similar function instead.In PSO, we normalize that the nodes in semantic social network simulate the behavior of a "bird flock", where social sharing of information takes place, individuals' gains from the discoveries and previous experience of all other nodes during the search for food [38].Thus, each node, called particle, in semantic social network which is called swarm, is assumed to "fly" over the search place looking for promising regions on the landscape.
Step 1. Initialize all particles and let  = 0; Step 2. Evaluate fitness of each particle; Step 3. Judge whether the ultimate criteria is satisfied.If  > , stop and jump to Final.; otherwise refresh variables according to the following steps; Step 4. Refresh   by comparing the current fitness of each particle with its own historical best position   , if   gets smaller, then change it with the current position; Step 5. Refresh   by comparing the current best fitness of all particles with the historical best position   of the whole swarm, if   gets smaller, then change it with the current best position; Step 6. Refresh V +1  and  +1  using Eq (12) and Eq (13); Step 7.  =  + 1, return Step 2; Final.
In the search place, once velocity V +1  updated, the  − ℎ particle position   is changed as in the following equation.
is a constriction factor which manages and regulates the velocity's magnitude to maintain a balance between exploration and exploitation and it can be calculated as follows: The constriction factor has influence on the proposed algorithm; we discuss the issue in part 4. The pseudocode for PSO is described in Algorithm 2 [39].

Performance Measure
Generally speaking, the performance measure of semantic social network is mostly based on the topological construction.And the  model proposed by Shen et al. [40] is widely used in evaluating overlapping communities, which is described in the following equation: V is the degree of node V and   is the degree of node ,  = ∑ V  V is the total degree of the network,  V is the element of adjacency matrix of the network,  V is the number of communities which the node V belongs to and   is the number of communities which the node  belongs to, and   is the  − ℎ community in the network.For we use both topological construction and semantic context to detect communities, a novel evaluation model named , which we add information similarity into topological evaluation index, is given by the following equation.
is the  − ℎ node and   is the  − ℎ node,    is the number of communities that the node   pertains and    is the number of communities that the node   pertains,  1 = ∑          is the total degree of the network,      is the element of adjacency matrix of the network, and the range of value for  is (0, 1).As for the information similarity (  ,   ), we give a normal social graph  = (, ,     /  , (  ,   )), where  is a set of nodes in the network and   /  is the / − ℎ node;  is the set of edges linking to graph nodes.The actual point of (  ,   ) is to measure the structural correlation of nodes and add semantic correlation components at the same time.This is more suitable for the basic characteristics of the semantic communities.Each node   has connection with an information vector     = ( () 1 ,  () 2 , . . .,  ()  ); (  ,   ) is the information similarity of two neighbor nodes  and  which is calculated as is the dimension of the social network.In our method, if the semantic components of two nodes are close, the projection angles of these two nodes in two-dimensional space will be relatively small.On the contrary, the projection vectors are in contradictory situation.

Experimental Results
In this part, we would present and discuss the experiments with topics number analysis, evaluation criterion, real datasets, and different community detection algorithms, based on three datasets (the American College Football network dataset, the Krebs polbooks network dataset, and the dolphins network dataset).

The Analysis on Topics Number.
The number of topics , which is one of the input parameters in PSO-LDA model, can influence the compactedness of communities.So we choose the following three datasets to verify the effect of topics  over the result: (1) The American College Football network is shown in Figure 2.This network, created by Newman, is a complex social network about American College Football league.Nodes are regarded as football teams and one edge, between two neighbor nodes, represents that two football teams have played a match.It contains 115 nodes and 616 edges.
(2) The Krebs polbooks network established by V.Kreb is shown in Figure 3.The nodes represent the politics books sold on Amazon.Generally, the books on political tendency are approximately divided into three classes.So in order to get topic distribution, Newman collected the political tendency in 3 steps away around each node.
(3) The dolphins network collected by Newman is shown in Figure 4.The dolphins network is made up of two families, including 62 nodes and 159 edges.We simulate each node with the semantic information to fit on Dirichlet distribution.In this section, we use the topic number to experimentalize on three datasets (football, polbooks, and dolphins).Figure 5 shows the comparison of  and  on the three datasets with  = (1, 2, ⋅ ⋅ ⋅ , 20).While the topic number  grows bigger and the topic distribution rises higher, the number of detected communities gets bigger as  rises.In Figure 5, when the topic number gets larger to a certain degree, the topic distribution tends to be stable, resulting in the increment of communities.From the comparison of  and , these two performance measure models tend to decrease as  increases, since the topic number  arrives at an optimal point.The optimal value of  is 6 in Figure 5.
For the sake of getting communities more intuitive, Figure 6 shows the detected communities of three datasets when  is 6, 12, and 18.

The Comparison on Different Optimization Algorithms.
In this section, we do the comparison on different optimization algorithms with three network datasets above (dolphins, polbooks, and football).We compare the number of communities, the size of communities, runtime, and semantic concentration with PSO algorithm, Genetic Algorithm (GA), Ant Colony Optimization (ACO), and Simulate Anneal (SA).The result is shown in Figure 7. From Figure 7, we can see PSO algorithm makes more numbers of communities and smaller size of communities than others.As for runtime in PSO algorithm, it runs a little better than ACO and SA.The semantic concentration () [41] is a function for measuring and testing degree of coagulation on specific topic and  is shown in the following equation: is the performance measure of communities links, while   = 1 and only if  and  belong to the same community, there is a link between  and .Compared with similarity function ,  makes focus on the stability of social groups in local environment.But what needs to be noted is that higher  does not mean higher  in communities and higher  does not mean we can get the best divisions; this is because the overlapping part of communities can effect the semantic cohesion.So the ideal community construction should be suitable with  and , and this also fits the performance measure of overall optimization and local optimization.Compared with GA, ACO, and SA    in Figure 7, the detected communities by PSO have a little small size and a bit more community numbers, which is in accordance with the topic distribution.As for runtime, PSO runs a bit slower than ACO but much better than GA and SA. Figure 8 shows four optimization algorithms run on dolphins network, and as similar as Figure 7, PSO works much better than other algorithms on community detection.

The Comparison on Community Detection Algorithms.
Considering the bias in the semantic community detection, we utilize classical nonsemantic algorithms to illuminate the issue with the football dataset, for example.
We choose GN, FN, LFM, COPRA as nonsemantic classical algorithms, where LFM and COPRA are the overlapping community detection algorithms.The  and  of the algorithms above are covered in Table 2 and the detection of communities is shown in Figure 10 with football dataset.
From the result in Table 2, the  of nonsemantic classical algorithms work higher than that of PSO-LDA (0.5132), but the  works lower than PSO-LDA (0.4258).So it suggests that the nonsemantic classical algorithms make a higher  in the topological construction detection and a lower  in the semantic detection.There is a bias in community detection by nonsemantic classical algorithms compared to semantic algorithms in the way of getting the ideal communities.On the one hand, we verify the performance of these algorithms; on the other hand, we use this experiment to verify the relation above , , and .As for  in Table 2, PSO-LDA performs better in  and has high , and PSO-LDA is higher than other algorithms in .This means PSO-LDA performs well in overall search ( and ) and works better than others in local search ().dataset (extract 25000 nodes) (http://snap.stanford.edu/data/email-Enron.html).The , , and (the number of detected communities) of datasets above detected by various algorithms are reported in Table 3, as the PSO-LDA for  = 6.The histogram of  is shown in 11 and  in Figure 12.From Figures 11 and 12, the PSO-LDA model can be more suitable to solve the semantic community detection than the classical nonsemantic algorithms.

Conclusion
In this paper, we presented a novel community detection algorithm PSO-LDA that combines the topological construction with semantic information.It can avoid the number and the size of communities.For the Gibbs sampling solving the hidden parameter in the proposed model, the sampling result approaches to the realistic state.The main contribution of this research focuses on how to use different similarity measure to  measure similarity between nodes based on topological construction and their semantic information.As for future work, we would apply the model in some fields such as privacy protection and worm containment in semantic social network.

Figure 2 :
Figure 2: The graph of football network.

Figure 5 :Figure 6 :Figure 7 :
Figure 5: The performance of detected communities with .

Figure 8 :
Figure 8: The comparison on different optimization algorithms on dolphins (the black nodes are overlapping nodes).

Figure 9 :
Figure 9: The digrams of comparison on the constriction factor with  and .

Figure 12 :
Figure 12: The histogram of  with various classical algorithms.

Table 1 :
The symbol description.SYMBOL DESCRIPTION  Number of keywords in semantic social network  Set of keywords in semantic social network,   is the  − ℎ keyword in   Node set corresponding to keywords set ,   is the  − ℎ node in the semantic social network  Topic set corresponding to keywords set ,   is the  − ℎ topic in semantic social network  (  ) 1) Extract the keyword distribution, and  ∼ ℎ(); (  ) (9)

Table 2 :
The classical nonsemantic algorithms on , , and .The Comparison on the Constriction Factor with  and .In this section, we compare  and  over three datasets.The run diagrams, which  and  run in three datasets, are shown in Figure9.From (16), we put the similar function of information (  ,   ) into  and (  ,   ) < 1.So generally, the tendency of  diagram can be higher than .The maximum value of  in football

Table 3 :
The results of classical nonsemantic algorithms under various datasets.