Determining Community Structure and Modularity in Social Network using Genetic Algorithm

Research on determining community structure in complex networks have attracted a lot of attention in various applications, such as email networks, social networks, social networks, metabolic networks, airline networks, biological networks, information networks, technology networks, and computer networks. The popularity determines the structure of a community because it can analyze the structure, and functionality of a network, in which the network or community itself can be interpreted as a node that is closely related to an information network. Meanwhile, to determine the structure of the community by maximizing the value of modularity is difficult. Therefore, a lot of research introduces new algorithms to solve problems in determining community structure and maximizing the value of modularity. Genetic Algorithm can provide effective solutions by combining exploration and exploitation. Genetic Algorithm uses population-based computing methods, where the best population is obtained from the process of selecting random populations, crossovers, and mutations. This study focuses on the Genetic Algorithm which added a cleanup feature in process. The final results of this study are the results of a comparison of modularity values based on the  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 14, No. 3, July 2020 : 219 – 230 220 determination of the community structure of the Genetic Algorithm, Girvan and Newman Algorithm, and the Louvain Algorithm. The best modularity values were obtained using the Genetic Algorithm which obtained 0.6833 results for Zachary's karate club dataset, 0.7446 for the Bottlenose dolphins dataset, 0.7242 for the American college football dataset, and 0.5892 for the Books about US politics dataset. Keywords—Community detection, genetic algorithm, social networks, community structure, modularity


METHODS
In this paper, a genetic algorithm is proposed to determining community structure in social and complex networks, and genetic algorithm uses as the search engine and employs the network modularity as the fitness function to evolve the population. The genetic algorithm will be modified by inserting a cleanup process and eliminating the selection process. Next, genetic algorithm is described in detail.

Population initialization
The general understanding of individuals according to [33] is an individual who is in a group or an individual, while the understanding of individuals in genetic algorithms according to [34] to express one solution, individuals can be said to be the same as a chromosome, which is a collection of genes. This gene can be in the form of binary, float, and combinatorial. According to [35] initial population, generation is a process that produces a number of individuals randomly (randomly). The size of the population depends on the problem to be solved and the type of genetic operator that will be applied. After the population is determined, initialization is carried out on the chromosomes in the population. Chromosome initialization is done randomly, while still paying attention to the solution domain and problem constraints. The formula below is for generating a random population in binary representation.
IPOP is a gene that will contain rounding from random numbers generated by (population number) x (number of genes in each chromosome). The population generation plan begins by declaring the pop_size variable, the generation array, and the graph. Where the pop_size variable is used to declare the number of nodes to be raised, the generation array is used to store the results of random node sampling, and the graph is used to find out how many nodes will be raised.

Fitness function
According to [36] fitness value is a value that shows the quality of chromosomes in the population, where the value of fitness is used as a measurement tool, the greater the fitness value, the better the individual is to be a potential solution, whereas according to [37] fitness value is a value that states whether or not a solution (individual), fitness value is used as a reference in achieving optimal value in genetic algorithms. Modularity is a measure of a network structure or graph, where networks with high modularity have solid connections between nodes in a module or community and have rare connections between nodes in different modules or communities, and modularity is often also used to measure structures detecting structures community in the network, while the most popular modularity function is the modularity created by [38]. Newman has made the formula of modularity Q. Formula 2 is a formula for undirected and unweighted networks [38].
is the number of communities (clusters), the total number of edges in community , and is the total number of degrees node in the community . Where the design is initiated by declaring a graph variable and then calculating modularity for each node.

Mutation
According to [39], mutations are an important part of genetic algorithms because they minimize the chances of searching trapped in local optima, whereas according to [40] mutations play a role in replacing genes lost from populations due to the selection process that allows reappearance of genes that do not appear at population initialization, where the chromosomes of children are mutated by adding a very small random value (mutation step size), with a low probability. There are several opinions about the value of this mutation rate, one of which is that the mutation rate of 1/n will give a pretty good result, those who argue the mutation rate does not depend on the size of the population. The mutation process does not have to be like that process, but there is another process that is by mutating the gene as much as the probability of mutation * the number of genes, where the position of the gene to be mutated is randomly selected [18,40,41].
A simple way to get binary mutations is to replace one or several gene values from a chromosome, the mutation steps are as follows [42]: Step 1) Count the number of genes in the population (length of chromosomes multiplied by population size).
Step 2) Randomly select the gene to be mutated.
Step 3) Determine the chromosomes of the genes chosen to be mutated.
Step 4) Change the gene value (0 to 1, or 1 to 0) of the chromosome to be mutated. The mutation process begins by declaring variables such as graphs, and adjacency matrix offspring, which are used for making adjacency matrix for offspring, after the adjacency matrix has been made, the next step is to repeat it for random chromosomes and genes, and if the chromosome index is the same with a random gene index, random genes will be re-selected, and if the chromosome index is different from the random gene index, then the contents of the chromosome index will be checked on the gene random index. If the contents of the chromosome index in the gene random index is equal to 1, then the value will be changed to 0, as well as the chromosome random index in the gene index. Meanwhile, if the chromosome index on the gene random index is equal to 0, then the value will be changed to 1, as well as the chromosome random index on the gene index.

Crossover
According to [43] crossover is a very important process in producing a new chromosome by crossing two or more parent chromosomes and is expected to create a new chromosome that is more efficient, whereas according to [44] mating (crossover) is operators of genetic algorithms that involve two parents to form new chromosomes, and allow new offspring to contain part of their parents and will result in much better performance compared to their parent. The one-point crossover made by [18] is a crossover that swaps the value of genes from a chromosome after certain points and is usually for chromosome representation in binary. At a one-point crossover, the crossover position k (k = 1, 2, ..., N-1) with N = the length of the chromosome selected randomly. Variables are exchanged between chromosomes at this point to produce children. Figure 1 is an illustration of one point crossover for the probability of a crossover = 0.9 [18]. The crossover process begins by declaring offspring and probability variables, after that repeating as many nodes or individual in a generation, then repeating as many probability values as specified , if and are determined then crossbreeding between nodes or individuals to with nodes or individuals to + 1 in the gene to , and if the index + 1 is found to be an error then the process of interbreeding is done between nodes or individuals to with nodes or individuals to -1 in the gene to .

Clean-up step
According to [27] the clean-up process created by [26] is an efficient process for correcting errors in nodes that occupy the wrong community, where the node consists of parent and child vectors. Errors in the placement process in the clean-up process are detected from the fitness evaluation on the genetic algorithm. However, even though the overall fitness value is quite good, there may still be several misplaced nodes, but it does not necessarily affect the value of fitness in the whole community. The clean-up process is based on a new metric called community variance which aims to reduce all placement errors.
According to [26] community variance is a metric based on nodes in a community, where a community must contain more internal links in the community than external links with other communities or it can be concluded that the neighbors of a node are mostly must be in the same community. [26] define community variance where node is the number of communities that are between neighbors and the node itself, where must be low for a good community structure, while the equation for finding community variance is as follows: Where will be 1 if community is not the same as community and 0 if otherwise, whereas is the relationship of node , E is edges, and community i is the community of node .
The process of making community variance by randomly selecting nodes, if the node value is greater than the threshold value, where the threshold value is obtained from the constant calculation of a set of old nodes, then the node chosen randomly will be included in the same community, whereas if the threshold value is not met there are no operations performed on the nodes in the community [26].

Genetic algorithm framework
In general, the community determination step using genetic algorithm starts from generating population, where the results of the generation are in the form of an array which will then be converted into adjacency matrix, then the adjacency matrix will be used as chromosomes and genes to process the genetic algorithm, after that The modularity value is calculated based on the preprocessing graph data, then after the modularity value has been completed, the next step is to carry out the process of mutation, clean-up, crossover, clean-up, generation update, and if the update process of the modularity value has not exceeded the variable value then it will return to the mutation process until the modularity value is more than the value of the variable, and if the variable value has been fulfilled then the system will automatically stop. The design of community structure determination using genetic algorithm is shown in Figure 2.

Figure 2 Framework of genetic algorithm
Finally, the framework of genetic algorithm is described as follows: Step 1) Set where denotes the generation number.
Step 2) Generate the initial population by randomly sampling points from the search space .
Step 3) Compute the network modularity value of each individual in .
Step 4) Perform the mutation operation (see Section 2C for details) on each individual in and obtain the mutant vectors .
Step 5) Correct the mistakes in each mutant vector in by executing the cleanup operation (see Section 3E for details).
Step 6) Execute the modified one-point crossover (see Section 2D for details) on each mutant vector in and generate the trial vectors .
Step 7) Correct the mistakes in each trail vector in by executing the clean-up operation (see Section 2E for details).
Step 8) Calculate the network modularity value of each trial vector in .
Step 9) Compare with (i = 1, . . . , NP) in terms of the network modularity value by following the equation (2), and put the winner into the next population .
Step 11) If the termination criterion is not satisfied, go to Step 4; otherwise, stop and output the best individual in .

Dataset 3.7.1 Zachary karate club networks
Zachary karate club is a network obtained from karate clubs which has 34 members, then becomes an internal problem between administrators and coaches of karate clubs, which causes club coaches to create new clubs with members of the original club. If represented in the graph there are 34 nodes, 78 edges, and four communities in the Zachary club karate network [29].

RESULTS AND DISCUSSION
The implementation of this system uses the Python programming language. The equipment and materials used in this implementation are as follows:

Hardware
The hardware used in this study are presented in Table 1.

Software
The software used in this study are presented in Table 2.

Community
Members Modularity 0 0, 3,4,8,9,10,11,13,16,19,20,22,23,24,26,27,28,30,31,35,36,37,38,40,42,43,45,47 Experimental results of Genetic Algorithm tested on Zachary's karate club network, Bottlenose dolphins networks, American college football network, and Books about US politics. We perform the experiments 10 times, and each experiment will be taken 100 times iterations to get is the average number of communities, is the average amount of collection time, is the best supporting time, is the average value of modularity, and is the best modularity. The test results are shown in Table 7 would be compared with Girvan and Newman algorithm and Louvain algorithm.  36 for the average number of football dataset communities, and books dataset. Meanwhile, for the results of testing using the Louvain algorithm, the average number of communities for the karate, dolphins, footballs, and books datasets is 4, 4, 10, and 4. Meanwhile, for the best processing time results are all obtained using the Louvain algorithm with a result of 0.0207 seconds for the karate dataset, 0.0426 seconds for the dolphins dataset, 0.0867 seconds for the football dataset, and 0.0901 seconds for the dataset books. The highest average modularity value for the karate and dolphins dataset was obtained using the genetic algorithm, which obtained 0.4761 and 0.5749 results. Whereas, for the football and books dataset, the highest average modularity values were obtained using the Louvain algorithm with the results of 0.6044 and 0.5265.

CONCLUSIONS
In this paper, we have introduced genetic algorithm to determining community structure in complex networks. The proposed genetic algorithm use clean-up process, which effectively corrects the mistakes of putting nodes into wrong communities in both mutant and trial vectors and improves the search ability of. Determining community structure with genetic algorithm can be applied with the results of 5 communities for the Zachary's karate club dataset, 7 communities for the Bootlenose dolphins dataset, 10 communities for the American college football dataset, and 6 communities for the Books about US politics dataset based on the best modularity values. Genetic Algorithms can be applied to increase the value of modularity, where testing uses Zachary's karate club dataset, Bootlenose dolphins, American college football, and Books about US politics get the best modularity values of 0.6833, 0.7446, 0.7242 and 0.5892. Where, the best modularity value of Genetic Algorithm is higher than Girvan and Newman Algorithm and Louvain Algorithm. Genetic Algorithms take a considerable amount of time when determining community structure, the best processing time is 1.3584 seconds for processing using the Karate dataset, 4.6281 seconds for the Dolphins dataset, 17.0402 seconds for the Footballs dataset, and 14.7021 seconds for the Books dataset. The processing results are much longer compared to Louvain Algorithm and Girvan and Newman Algorithm.