Social Network Community Detection by Combining Self-Organizing Maps and Genetic Algorithms

the


Introduction
Social networks involve such a wealth that various people from different fields try to exploit for valuable information. Information coming from social networks is used in different areas such as marketing, politics, economy, statistics, and education [1]. Community detection drew the attention of a lot of researchers over the last few years [1,2]. Knowing the structure of communities of individuals inside a social network helps target suitable people when achieving marketing campaigns for instance or when trying to understand the opinion of a given social category. Social networks are made of individuals called nodes, like profiles on Facebook or LinkedIn. Two kinds of features are characterizing every node: topological features and semantic features. e topological features are based on the links existing between nodes [1]. Nodes belonging to the same community are densely linked. However, semantic features are related to information proper to each node such as age, family, education, and comments. Nodes belonging to the same community generally share some common information. As it is complex to extract semantic information, most works only rely on links to extract the communities.
Communities denote a collective behavior of nodes and involve nodes that are strongly linked. Within one community, nodes do not have the same importance. Some of them represent the core of the community and attract all the community nodes, whereas others are peripheral nodes. ey are located on the border of the community. Detecting communities in a social network is a complex task because nothing is known about the structure of the communities, their size, their core nodes, and so on.
In the literature [2], a social network is considered as a graph G � (V, E), where V is a set of nodes or vertices and E is a set of links, called edges that connect two elements of V. Detecting communities means detecting subgraphs of nodes with strong interactions between them and little interaction with the other subgraphs. However, the main challenge is the lack of a clear quantitative criterion that can be used to delimit these subgraphs. is explains why most of the works deal with the extraction of communities in a sequential way. ey compare nodes two by two. is paper proposed a different approach because it considered that grouping nodes into communities has to be achieved to relatively all the nodes and not through finding out direct similarities between them two by two. is study, therefore, proposed to detect agglomerations of nodes as a first step in the process of communities' detection. Agglomerations are not necessarily the final communities. ey represent the cores of the final communities. Once detected, these agglomerations will compete to attract each other to produce the eventual partition of communities. is new vision ensures the important level of scalability, especially when dealing with big social networks. e proposed approach can help explore social media to detect political communities and help predict political elections. Moreover, social networks are becoming the most important market in the world. e proposed approach may also be useful for business purposes. By detecting social network communities, this approach can be used to target a particular type of customers within social networks.
Our contributions can be summarized as follows: (i) We introduced the concept of community cores and used pattern recognition techniques, represented in Growing Hierarchical Self-Organizing Maps (GHSOM) to detect them. (ii) We coupled the genetic algorithms with Growing Hierarchical Self-OrganizingMaps (GHSOM) to extract the final communities. It is not a simple succession of steps. e genetic algorithm is tuned when working with the results of the Growing Hierarchical Self-Organizing Maps. is bias makes the genetic algorithm faster and more efficient. e remainder of this paper is organized as follows. Section 2 reviewed the works related to community detection, while Section 3 introduced our approach and contributions. Section 4 was devoted to revealing the results achieved by our approach. e major conclusions were drawn in the final section before suggesting some perspectives for our future research work.

Related Works
Review papers [3,4] classify these works into two perspectives: divisive and agglomerative approaches. Divisive approaches are top-down ones. ey start with the entire graph, and they split it into partitions by removing edges. However, agglomerative approaches are bottom-up approaches; they start from vertices that will be gradually merged to build communities. e best partition is the one that maximizes a given metric. Most of the proposed works are agglomerative ones.

Divisive Approaches.
In the divisive category, Girvan and Newman [5] proposed an approach based on the concept of edge betweenness defined as the number of shortest paths between pairs of nodes that run along that edge. Li et al. proposed [6] a divisive method also, namely local edge centrality (LEC) for community detection. In the first phase, a weight is computed for each edge. e authors relied on the node dissimilarity degree and edge betweenness. Nonimportant edges are deleted to obtain an initial partition of the network. After that, modularity optimization is used to get the final partition of the network.  [7] used in many agglomerative works. It is a metric that has been widely used to characterize the partition quality and was used in Clauset et al. [8]. In their approach known as CNM, the authors start from lonely nodes, and the edges of the network are added progressively to increase the modularity. Blondel et al. [9] created the well-known Louvain method and also built their approach on modularity optimization. Neighbor nodes are grouped together through a repetitive step, and at each step, the modularity is computed to evaluate the achieved gain. Džamić et al. [10] proposed a community detection system that maximizes the modularity function to find the best partition. Hoffman et al. [11] use Cohen's similarity measure for categorical data. After that, the clustering is performed using k-means. e number of k-means clusters ranges from 2 to N (number of nodes). e best partition is the one that maximizes the modularity function.

Evolutionary-Based Approaches.
ere are approaches that used the genetic algorithms to optimize an objective function and to find the best community partition [12][13][14][15][16][17][18]. e most used objective function is the modularity of Newman [7]. Some of these approaches used more than one evaluation function; they are multiobjective [16]. Said et al. [19] proposed an approach that uses a genetic algorithm for detecting communities. e novelty suggested by the authors is a new way for generating the initial population and a new method for the mutation operation. e initial population is made up of nodes that have neighbors because putting isolated nodes in the solution space may increase the convergence time of the genetic algorithm. e mutation operation proposed by the authors is based on carrying out the operation on the selected solution and its neighbors.
Recently, Li et al. [20] proposed a community detection approach that uses attributes such as age, education background, hobby, and profession in addition to the topological structure. e community detection problem is transformed into a multicriteria optimization problem. To find the best community partition they used multiobjective genetic algorithms.
In [21], authors proposed an approach similar to genetic algorithms. ey proposed an evolutionary method based on a fitness function and evaluating the quality of the partition using a fitness function. e authors also proposed new operators named vertex substitute operator and community substitute operator.
Qin et al. [22] worked in the same way as in [20]. ey combined topology and content. An adaptive parameter is used to combine topology and content to effectively control the impact of content on community discovery. In the same way and using another evolutionary technique, Rostami et al. [23] proposed a particle swarm optimization-based multiobjective approach to detect central nodes in medical datasets. Ben Romdhane et al. [24] proposed the concept of purity and density of communities to define an objective function. ey use the ant colony technique to realize the random walk and to optimize this objective function. In the same way, Majbouri et al. [25] used the ant colony to predict information diffusion paths. ey study and model the propagation routes. ey cluster nodes, and the final information diffusion paths are predicted using the ant colony.
In [26], Cai et al. proposed an approach based on multiagent systems. Single nodes are associated with agents. e agents affiliated with a similar cluster should gradually assemble in their common state space. e authors used the concept of consensus or quasi-consensus of the motions of dynamical systems to make the final clustering.

Label Propagation Algorithm-Based Approaches.
Label propagation algorithm (LPA) [27][28][29][30] is also another interesting technique, considered as the fastest because it is a near-linear time algorithm.
e LPA uses only the network structure to guide the exploration process. ey are well adapted for large-scale networks; they do not use a defined objective function nor ask for any preliminary information on the existing communities. e label assigned to each node depends on the labels assigned to the neighbor nodes. e main drawback of this technique is the fact that it does not provide a unique solution but an aggregate of many solutions.
In [31], authors used also the LPA to detect communities. ey worked on complex attributed networks. From these networks, they develop a weighted graph. e weight of each node is computed using Laplacian centrality. e propagation of labels is proportional to the influence among the adjacent nodes. Nodes with higher influence in terms of structure and attributes update many tags. Community overlap propagation algorithm (COPRA) [32] is an overlapping community detection method derived from the label propagation algorithm. In the propagation process, the node label is determined based on labels of adjacent nodes, and hence, a node may belong to many communities.

Spectral Graph Partitioning Approaches.
e spectral graph partitioning approaches are based on the eigenvectors of the Laplacian matrix. e eigenvector components with similar values represent the nodes that belong to the same community. In [33], Newman proposed the modularity matrix that is made up of the eigenvectors computed for the network. is enhancement leads to a spectral approach that returns better results than the classic modularity. Narantsatsral and Kang [34] proposed an agglomerative approach for social community detection. In this approach, the densely connected clusters are identified while agglomerating. Nodes are projected into an eigenvector space to be able to significantly distinguish between them.

Statistics-Based Approaches.
Li et al. [35] proposed a Markov cluster approach known as MCL. It is based on simulations of using the concept of Markov chains to build a fast and scalable unsupervised Markov clustering algorithm. e order statistics local optimization method (OSLOM) [36] is an approach based on the local optimization of an evaluation function. e entire graph network is transformed into a network of subgraphs representing the communities. In addition, the OSLOM can detect overlapping communities. [37] introduced a random walk-based approach for detecting communities known as Infomap. ey consider social networks as a set of regularities (patterns). rough a random walk, they try to detect these regularities by finding the best path that maximizes the compactness and minimizes information loss.

Metric-Based Approaches. Rosvall and Bergstorm
In [38], the authors proposed an approach based on thread-level parallelism for the calculation of adding qualified neighbor nodes to the community. is approach is performing overweighted networks in irregular topologies. Zardi et al. [39] proposed a hierarchical clustering. ey define some metrics characterizing a good quality community partition. All these metrics are used to build an objective function that should be maximized. e nodes represent the initial communities, and they are merged progressively to detect the final community partition. C-Finder [40] is a local approach presented by Palla et al. whose main principle is to detect k-cliques inside the network. k-cliques mean small groups of k nodes that are totally linked. Two cliques may form a community if they are adjacent. Adjacent means they have at least (k − 1) common nodes. Communities are made by merging the adjacent kcliques. In the same way, Zhang et al. [41] addressed the problem overlapping communities by detecting weak cliques and merging them. ey proposed the Salton index to characterize node similarities, and the weak cliques detected were merged into larger communities, whenever possible. In [42], the authors proposed an original detection algorithm based on the fire propagation behavior. e approach works in two phases. e algorithm starts with a random node, and they simulate the effect of fire spread to aggregate nodes and constitute communities.

Influential Nodes Detection Based Approaches.
e identification of the most influential node in social media networks has received a lot of attention in the data mining community. It has become a crucial step in the community detection approaches [43]. For instance, Chaabani and Akaichi [44] proposed an approach that operates in two steps. e first step aims at defining the communities and detecting the most important nodes in them. In the second step, the partition is defined, and the main communities are Complexity 3 detected.
e authors also introduced a function that measures the strength of the links to define the communities.

Problem Formulation.
e graphs were used to represent social networks. e graph's nodes represent the social actors, and its edges are the connections between the nodes. In our case, the social network is modeled as a graph G � (V, E), where V is a set of nodes or vertices and E is a set of links or edges connecting two elements of V. To represent a graph, we use the adjacency matrix A. If the network is made by N nodes, the graph will be represented with the N × N adjacency matrix A, where the entry at position (i, j) is 1 if there is an edge from node i to node j, 0 otherwise. e row i of the adjacency matrix represents the features of the node.

Approach
Overview. Different from the proposed works in the literature, we added a real step of initialization in our work. We did not tackle the community clustering directly. e first step consists in detecting the intrinsic agglomerations containing core nodes. After that, these agglomerations can be merged to generate the final communities ( Figure 1).

Detection of Agglomerations.
e first step involves identifying the skeleton of the communities called core nodes. Core nodes are generally nodes that are in the center of a community, and they are linked to most of the community nodes. We can distinguish them even without achieving community detection. ey are generally located in a small remarkable agglomeration. e first step is to detect them. It gives a starting point better than single nodes to start detecting communities. e main advantage of self-organizing maps is that they give an efficient way to explore unbalanced and complex structures. SOM provides a bidimensional visualization of multidimensional data.
Moreover, we can use the neighborhood property in selforganizing maps to have a better understanding of the relationships between agglomerations of nodes. We have used this tool in our previous works, in different contexts and the results were very encouraging [45,46].
In the literature, there are works that use only SOM [18] to detect communities. e results found were not satisfactory. SOM cannot give the real borders of communities. ey can only project input data on a bidimensional map. Moreover, in [47], a classical variant of SOM has been used to detect communities. is variant does not give a good scalability level and especially when dealing with big size social networks.

SOM and GHSOM.
A self-organizing map is a set of connected neurons on which we map input elements represented by n-dimensional vectors X � [x 1 , x 2 , . . . , x n ] [48] (see Figure 2). e input elements are linked to the neurons through weights W ij (see Figure 2). e neuron to which an input element is attached is called the winning neuron.
Self-organizing maps work as follows: Step 1: e connection weights are randomly initialized.
Step 2: e winning neuron is calculated using the following formula: Step 3: e weights of the winning neuron and their neighbors are updated at every iteration as follows: where t represents the time, a(t) is a variable decreasing with time, and h(t) represents a neighborhood function. e principle is to reduce the influence when the neighborhood radius increases.
e main limitation of the classic SOM is its static architecture. e size of the map should be defined initially. For small problems, it can be used with no significant effects. However, when we deal with big and complex data, specifying the size of the map becomes very important, and its exploration becomes very difficult. For all these reasons, we used another variant of the SOM called the Growing Hierarchical Self-Organizing Maps GH-SOM [49]. ese maps proved their effectiveness with big data problems [50]. e GHSOM represents more faithfully the input space by arranging it according to the shape of the data and its structure. It grows both in hierarchical and horizontal ways. Instead of representing all the input space by one SOM, the data are represented by multiple layers with a hierarchical structure, where each layer includes an independent SOM (see Figure 3). e training process starts with one layer (layer 0). It consists of one neuron only. e weights vector representing this neuron is the average value of the input vectors.
is vector is called m 01 � [w 1 1 , w 1 2 , . . . , w 1 n ], where n is the dimension of the input space.
In layer 1, a map of 2 × 2 is created and randomly initialized. It is trained by the standard SOM learning algorithm (see formula (2)). e GHSOM growth strategy is based on the mean quantization error metric computed for each map by averaging the quantization errors of the neurons of the map as follows: where u refers to the number of units i contained in the SOM m. e quantization error of the neuron i of the map is computed as follows: where m i is the vector representing the neuron i, d represents the number of inputs having the neuron m i as winning neuron. e main idea of the growing process in GHSOM is that each layer represents a deviation of the input data. In other words, the GHSOM will grow horizontally and vertically to reduce the deviation of the neuron of the previous layer to a given rate. e criterion for the horizontal expansion is as follows: Hence, if this criterion is met on a specific neuron called "e" on a given map, a new map will be added to this neuron. e neurons' initial weights of this new map will be computed based on the weights of the neighbors of the neuron "e." e learning of the GHSOM and its expansion will continue until the two criteria are no longer satisfied. Figure 4 shows a GHSOM used to detect teams' agglomerations inside the American football college data set.
is data set is made up of 115 teams organized in 12 conferences. e edges correspond to matches played during the 2000 season. e objective is to retrieve the 12 conferences. As it can be noticed from Figure 4, the GHSOM output is interesting because it detects almost all the conferences. However, this is a specific case in which the data set is not big, and hence, the GHSOM provides good results on the first attempt. In our framework, the GHSOMs are mainly used to provide only the starting point for detecting the communities.

Community Detection.
e detected agglomerations represent the skeleton of the future communities. ey can be in communities themselves, or by merging them with other agglomerations, they form new communities. In the literature, there are many criteria that can be used to evaluate a community partition. e most known one is modularity. e modularity Q proposed by Girvan and Newman [7] is defined as follows: where n is the number of detected communities, l c is the number of edges linking the nodes of the community, and l c /m represents the percentage of links that join the same community. d c is the total degree of nodes of C. e value of Q ranges between "−1" and "1." e value "1" means that we have a good network partition.
Maximizing the modularity means maximizing the two terms: n c�1 l c /m and n c�1 (d c /2m) 2 . Maximizing the first term means having densely intraconnected communities, while maximizing the second term means having sparsely interconnected communities.
Genetic algorithms [51] are well-known for their global search capability. We used them in many previous works Complexity 5 [52][53][54], and they proved their efficiency. In our approach, the exploration of solutions is guided by the GHSOM. In fact, it is absurd to put two agglomerations in the same community if they are not adjacent in the GHSOM. e proximity in the GHSOM means that the agglomerations share some common features and have strong relationships. Genetic algorithms have also been widely used [12][13][14][15][16][17] in community detection. However, our proposal is different. Our contribution lies in the fact that the initial population and the genetic operations are made by considering the layout of agglomerations in the GHSOM. e solution (the chromosome) is represented in the format of an integer array. Every gene of the array represents an agglomeration of the GHSOM. So, if the GHSOM detects N agglomerations, the chromosome will contain N genes that can take values ranging from "1" to N. If the j th and the i th gene have the same value, this means that i and j are in the same community. is representation is further explained by Figure 5. e network is made up of 7 nodes. e network is made of 7 nodes that can be partitioned into two communities. e community partitioning may be represented by the chromosome C 1 � {1, 1, 1, 2, 2, 2, 2} or the chromosome C 2 � {5, 5, 5, 6, 6, 6, 6}. e values of the genes, the community identifiers, do not have a real meaning they are simple labels.

Initialization.
Creating an initial population consists in generating a set of chromosomes randomly initialized. Every gene of the chromosome is assigned to a random community identifier. However, as we have already evoked, the optimization process will be guided by GHSOM. So, when we initialize the chromosomes, only adjacent agglomerations could have the same community ID. is bias in the initial population makes the genetic algorithm converge faster and reduces the number of iterations.

Crossover.
e goal is to make two new chromosomes called children. ese children represent two new solutions that are added to the solution space hoping to increase the fitness function. However, this classic technique of crossover is not efficient for our encoding. In fact, the same community identifier in the two-parent chromosome may represent different communities. e crossover that we used is called one-way crossover and was introduced in [15]. Two chromosomes are selected: one is called the source, and the other is called the destination. From the source chromosome, we select one gene, and we look for the genes that have the same community ID. e community ID will be transferred from the source chromosome to the destination one by replacing the corresponding genes in the destination with the same community ID. Following this procedure, we are sure that the communities are faithfully transferred between chromosomes. Figure 6 shows an example of a crossover operation. e target community ID is 1.

Mutation.
In our genetic algorithm, the mutation is performed by selecting one node and changing its community ID to another, respecting the GHSOM neighborhood principle.

Complexity of Our Approach.
e complexity of the approach is crucial in social network community detection due to their large size. To evaluate the time of our approach,  e processing time of a GHSOM is proportional to the social network size n. erefore, the complexity is on the order of O (n). A genetic algorithm's complexity is O (g (nm + nm + n)) with g the number of generations, n the population size, and m the size of the individuals. erefore, the complexity is on the order of O (gnm)). erefore, the complexity of our approach is O (n)

Experiments
We detailed, in this section, a set of experiments to show the efficiency of our approach. e used data sets are real and artificial social networks.

Real
Networks. We tested our system on four real networks widely used in the literature. e data set includes Zachary's network of karate club [55], Lusseau's network of bottlenose dolphins [56], the American college football network [5], and the political books network [7] as displayed in Table 1.

Zachary's Karate Club.
is network was made by Zachary. He studied the behavior of 34 members of a karate club for 2 years. He constructed a network of friendships between the members of the club, using a variety of measures to estimate the strength of ties between individuals [55]. He identified 2 communities of friendship in his network as plotted in Figure 7.

Bottlenose Dolphins Network.
e bottlenose dolphins network is achieved on a study on 62 bottlenose dolphins, living in New Zealand. e study was made by Lusseau [56]. e nodes are dolphins, and the edges are relationships observed among the dolphins. e relationships are established by observation of statistically significant frequent associations. e number of edges in this network is 159. Two communities are clearly identified as displayed in Figure 8.

American College Football Network.
e American college football network comes from the United States college football [5]. e data set is made up of 115 teams organized in 12 conferences (see Figure 9). e 616 team edges correspond to matches played by the teams against each other during the regular season of the fall of 2000.

Political Books Network.
In this network, nodes represent political books published in 2004 and that are purchased online through the site Amazon.com [7]. Two books are connected by an edge if they were frequently purchased together. e network is made up of four communities (see Figure 10). e number of nodes (books) is 105, and the number of edges is 441.
ese approaches were selected for the following reasons. First, we compared our approach with Agrawal [12] because it used the genetic algorithms, and we used its genetic operators.
e goal was to see the GHSOM contribution when used in conjunction with genetic algorithms. Second, we compared our approach with SOMSN [47] because it is the only approach that uses a self-organizing map to detect communities.
ird, we compared our approach with MeanCD, Infomap, and CNM [8,37,44] because they are well known for their good performance for their performance (results and time execution). MeanCD is a recent approach and is based on influential node detection like our approach. Finally, for all these approaches, we have either the results that they achieved on the real networks mentioned above or their source code. Complexity e modularity value was used to evaluate the performance of each system. e results of this comparison are displayed in Table 1. Our system is denoted by SOMG.
When examining the obtained results, our work, MeanCD [44], and CNM [8] performed better than the others. Moreover, when we focus on the results obtained by the SOMSN system, we can conclude that using only selforganizing maps cannot generate good community partitions. Indeed, self-organizing maps can give the morphology of the communities, the skeleton of the communities, but not the whole communities' structures.
e results of genetic algorithms as implemented by Agrawal [12] are interesting. However, they did not perform as good as SOMG. When using the classical genetic algorithms implementation, the initialization and the genetic operations are achieved without any considerations of the structure of the social network. e process is completely random. e quality of the obtained solutions will be impacted by the initialization.
CNM and Infomap [8,37] detect communities starting from lonely nodes through achieving progressive node clustering. e clustering should increase the modularity. Although Infomap and CNM have an oriented clustering process, our approach performed better than them in all the social networks. In fact, the use of GHSOM in the first step to making initialization of communities made the community   8 Complexity partitioning more efficient. e advantage of GHSOMs is the fact that they provide reliable initial partitioning. In fact, nodes located on the same neurons on the map certainly belong to the same community. is is an intrinsic property of GHSOM. GHSOM preserves the topology of social networks. e mapping of social networks preserves the relative distance between nodes. Nodes that are close to each other in the social network are mapped to adjacent neurons in GHSOM.
e MeanCD also operates in two steps like our approach SOMG; yet SOMG performs better. In SOMG, the detection of agglomerations is based on pattern recognition techniques. However, in MeanCD, the detection of agglomerations is based on measures computed on node pairs.  Detecting agglomerations must be relatively achieved to all the nodes and not through finding direct similarities between them two by two. is may lead to oversegmentation. According to Table 2, SOMG, Infomap, and CNM performed clearly better in terms of execution time than the other approaches. Infomap is the best because it is based on a random walk. is makes the complexity of the Infomap approach nearly linear and makes it quicker.

Artificial Networks.
Our work, MeanCD, and the CNM, which made the best performance in real networks, have been tested on the LFR benchmark (Table 3). is benchmark was developed by Lancichinetti et al. [57].
In this study, the authors created a software that generates a graph with a customized structure. e purpose was to compare the three works on big size networks having different structure features. Among these features, we mention the number of nodes N; the average degree of incoming edges k; the maximum degree of the incoming edges max k; the fraction between incoming and outgoing edges inside a community; the minimal community size min c and the maximal community size max c; and the mix parameter μ which controls the fraction of edges between communities. An important value of μ corresponds to a network with a blurred community structure.
is experiment allowed us to test the scalability of our approach.
To measure the performance of the two systems, we used the NMI measure instated of modularity. In fact, contrary to real networks, artificial ones have ground-truth partitions. For this reason, we used the normalized mutual information (NMI) proposed by Danon et al. [58]. e NMI value helps to compare an obtained partition A and the ground-truth partition B. When reading the formula, we can notice that when partitions A and B are totally independent, the NMI value will be 0. However, if they are matching, the NMI value will be 1.

Small Networks and Small Communities.
In the first test, we targeted the case of small networks (important number of nodes) and small communities (small number of nodes per community). We fixed the parameters as follows: the number of nodes N is set to 1,000, the community size C ϵ , and the average degree of nodes K is set to 25. e mix parameter μ ranges from 0.1 to 1. Results are displayed in Figure 11.

Small Networks and Large Communities.
In the second test, we targeted the case of small networks (total number of nodes) and large communities (number of nodes per community). We fixed the parameters as follows: the number of vertices N is set to 1,000, the community size C ϵ [100-250], and the average degree of nodes K is set to 25. e mix parameter μ ranges from 0.1 to 1. Results are displayed in Figure 12.

Large Network and Small Communities.
In the third test, we targeted the case of large networks (important number of nodes) and small communities (a small number of nodes per community). We fixed the parameters as follows: the number of vertices N is set to 10,000, the community size C ϵ , and the average degree of nodes K is set to 25. e mix parameter μ ranges from 0.1 to 1. Results are displayed in Figure 13.

Large Network and Large Communities.
In the fourth test, we targeted the case of large networks (important number of nodes) and large communities (important number of nodes per community). We fixed the parameters as follows: the number of vertices N is set to 10,000, the community size C ϵ [100-250], and the average degree of nodes K is set to 25. e mix parameter μ ranges from 0.1 to 1. Results are displayed in Figure 14. Figures 11-14, the NMI value is decreasing with the increase of the mixing parameter μ for all works. is is not surprising because it is easier for each system to detect communities in a social network with a clear community structure. In fact, when the mixing parameter becomes bigger, the structure of the network becomes blurred, and the communities become hardly distinguishable. However, it can also be noticed that for all the tests, the CNM and MeanCD performance decreases faster than that of our system.

Performance and Comparison Results. As shown in
is proves that our system is less sensitive to blurring. Our system can detect the communities' borders better than both the CNM and MeanCD. However, we consider that the blurred networks remain a real limitation in our approach and need more investigation on the fitness function. e actual fitness function is suitable for distinguishable communities rather than overlapping communities.
In the case of small networks and large communities, all three works achieved comparable results. In this case and when the communities are distinguishable, we mean the mixing parameter is less than 0.6, retrieving them is not a complex task. However, when the communities become smaller and precision becomes crucial, our approach performs clearly better especially for the values of mixing parameter ranging from 0.3 to 0.6. e step of detecting agglomerations made by the GHSOM provided a considerable contribution to discovering at least the cores of communities. On the contrary, the CNM failed in detecting the essential part of each community and provided under segmented communities. e MeanCD performed better, and this is due to the initialization phase. e results obtained for large networks are coherent with those obtained with the small ones. When the size of the community decreases, our system performs better. e gap in performance becomes even more important. For the values of the mixing parameter ranging from 0.3 to 0.8, we performed clearly better. However, it is worth noticing that for all works, the performance decreases faster when the size of the network becomes important.    CNM performed slightly better than our approach in terms of execution time (Table 4). is is due to the simplicity of the CNM approach that starts from lonely nodes, and the edges of the network are added progressively to increase the modularity. We have to improve our performance in this criterion. To reduce the time execution we can parallelize the execution of a genetic algorithm. A parallel genetic algorithm is an algorithm that uses multiple genetic algorithms to solve a single task. All these algorithms try to solve the same task, and after they have completed their job, the best individual of every algorithm is selected.

Conclusion
A two-stage system to detect communities inside social networks was proposed in this paper. e main idea of our approach was to start by detecting cores of communities and after that refining them to detect the final communities. e experimental results on real and artificial networks showed that starting by detecting community cores has an important contribution. In fact, the two stages proposed in our system are complementary. e first stage, which consists in detecting cores of communities through the GHSOM, was aimed at providing good initial conditions for the whole process. However, the second was a refining stage in which the genetic algorithms detected the final communities through an oriented process. e obtained results are encouraging and should stimulate future research. e overlapping communities and blurred networks constitute our first focus. e second focus will be parallelizing the execution of genetic algorithms to reduce the execution time.
Data Availability e test data used in this study have been taken from the website (https://www-personal.umich.edu/∼mejn/netdata/). e implementation of self-organizing map that we used can be downloaded from https://ifs.tuwien.ac.at/∼andi/ghsom/. e implementation of genetic algorithms that we used can be downloaded from https://www.mathworks.com/ matlabcentral/fileexchange/39021-basic-genetic-algorithm.

Conflicts of Interest
e author declares that he has no conflicts of interest.