Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

Densest subgraph detection has become an important primitive in graph mining tasks when analyzing communities and detecting events in a wide range of application domains. Currently, it is a challenging and practically crucial research issue to develop efficient densest subgraphs mining approaches that can handle both very large and continuously evolving graphs. Although large-scale or dynamic methods have been proposed to find the densest subgraphs, there is still a lack of a promising method to deal with large-scale and dynamically evolving graphs. In this paper, the problem is formulated and proved to be NP-Hard, an incremental greedy approximation approach is proposed, and its running time is O(m+n). In order to find the densest subgraph effectively by heuristically merging the local densest subgraphs, firstly, the edge flow of a dynamic graph is divided into several subgraphs in a given period of $T $ . Secondly, a local candidate set is generated by local denser subgraph discovery. Third, the global densest subgraph candidates are collected by heuristically merging. Last, the densest subgraphs are induced from the global densest subgraph candidates with constraints by static densest subgraph discovery algorithm. This incremental approach enables us to scale up the existing densest subgraph discovery algorithms, which focus mainly on small and static graphs and thus can handle very large dynamic graphs. Experiments on real-world networks with billions of nodes for comprehensive evaluations present excellent improvement in efficiency and accuracy: it reduces about 25% running time on average and presents a more accurate estimation of the structure of a graph with more compact subgraphs than the static method.It also performs well when dealing with graphs of varying densities.


I. INTRODUCTION
Massive graphs are commonly used to represent data with related information in a wide variety of domains. Examples include social networks (e.g., Twitter, Face-books), electronic commerce websites(e.g., Amazon), and bioinformatics(e.g., nucleus hierarchy). Nodes are usually used to describe entities, such as Twitter accounts, objects, videos, or pictures, and edges commonly represent relationships or behaviours, such as ''follow'', ''purchase'', ''share'' or ''like''. The Densest Subgraphs Problem (DSP for short) aims to discover a subset S of nodes that has the highest ratio of the number of edges to the number of nodes in S. It is a fundamental problem in graph many challenges hinder the discovery of the densest subgraphs in massive graphs, especially when the graphs are dynamic and evolve over time. First, the scale of graphs can be very large, making it hard to be fitted into the memory in their entirety for processing. Although it costs polynomial time in theory to process massive graphs, the sheer sizes make it computationally infeasible in practice [12], [13], [14], [15]. Second, the graphs are usually developing or evolving over time with links creations and deletions. It is very challenging for traditional densest subgraph mining algorithms to obtain the whole graph and, therefore, can not process the graph completely.
To address the above-mentioned problems, an incremental approximation approach to more efficiently discover the densest subgraphs is presented in this paper. It first uses a sliding time window segmentation strategy to divide the vastvolume streaming graph distributed in the time windows into small graphs within a given period of T . Second, the local densest subgraphs are detected by density function after a preprocessing step of filtering the least connected edges. Third, when the small graphs are processed, the approach incrementally merges the subgraphs to approximate the densest subgraphs by a greedy algorithm. Finally, the global densest subgraphs are detected by density functions.
The main contributions to this work are summarized as follows: 1) We give proof that the problem is NP-Hard.
2) We propose a novel incremental framework to efficiently discover the densest subgraphs by heuristically merging the local densest subgraphs. The merging algorithm greedily constructs the densest subgraphs candidate set to preserve the main dense parts of the graph by pruning the least connected parts and adding the most connected parts. The running time performance is O(m + n). 3) We have conducted in-depth evaluations of the proposed method on six large-scale networks. The experimental results prove the effectiveness and efficiency of the greedy approximation method: When processing more edges and nodes, our approach presents a more powerful performance: it reduces about 25% running time on average in four density definitions on the six diverse datasets; the execution time declines drastically with all subgraph detection algorithms when the size of data reaches 0.6 billion, decreasing ranging from 300 s to 800 s. we compare the quality of subgraphs detected by our greedy approximation method and the static method through three measures. : Cross Common Fraction(CCF), Jaccard Index(JI) and Normalized Mutual Information(NMI). Our approach outperforms 6%, 7%, and 1% on average by the static method, respectively. The influence of sparsity on synthetic graphs with different densities is also discussed. Experiments show that this method is sparsity-friendly. This paper is organized as follows. Section II discusses related work and motivations. Section III the densest subgraph problem(DSP) is defined, and the proof that the densest subgraph problem is NP-Hard is given. It is followed by Section IV that a greedy approximation approach for DSP and theoretical analysis on running time is presented. The proposed approach is demonstrated in Section V through experiments over six large network datasets. Finally, Section VI concludes the paper.

A. MAIN STRUCTURES
Average density is one of the main structures for graph mining. Balalau [12] considered limited overlapping as the concern of finding a set of dense subgraphs and devised an approximation and heuristic algorithm to solve the NP-Hard problem. Xie [8] gave comparative research in overlapping communities detection. Macgregor [7] proposed a one-pass approximation algorithm to find the densest subgraph where the graph was an unordered stream of edges deletions and insertions. Leskovec [31] analyzed algorithms for finding communities in networks empirically. Chen [32] put forward an algorithm to find multiple dense subgraphs in a certain sparse graph. K -core is usually used to find core decomposition. Bonchi [9] designed an algorithm to process uncertain graphs through core decomposition efficiently. It aimed to reduce the number of exact density computations by parameter-free approximation and exact algorithm as well [10]. K -clique contains k(k − 1)/2 edges in the clique, and its average density is the largest. The researchers [33], [34] identified top-k subgraphs with triangles that can represent the dense region of the large-scale graph in a parameter-free fashion in polynomial computing time. Ghasabeh [35] proposed a unified framework to compare and evaluate different algorithms for finding subgraphs in social networks. Valari [36] investigated the discovery of top-k dense subgraphs in both static and dynamic graphs. Other structures have also been detected in many applications. A great many research works on reducing the size of huge-volume graphs with desirable structures, such as connectivity [37], [38], [39], [40], backbone [41], distance [42], bi-simulation structure [43] and vocabulary subgraphs [44]. Although Koutra's paper managed to summarize six predefined structures(vocabulary subgraphs) in a million-node graph, it didn't dedicate itself to finding dense subgraphs in large graphs. Moreover, these researchers preserved all the nodes and removed parts of the edges, in which way the dense parts of the original graph deteriorated.

B. DYNAMIC SETTINGS
The articles [45], [46] proposed a congest model-based dynamic approach for densest subgraph discovery in a distributed fashion. Angel et al. [47] studied the densest subgraph maintaining on weighted graphs and to find interesting events posted on Twitter. Top-k densest subgraph problem in dynamic collections is studied [36], where the densest subgraph is removed recursively or stays in the collections.
This paper presents an incremental approximation approach to more efficiently discover the densest subgraphs in large dynamic graphs, which greedily discovers local optimal and gradually approximates the global densest subgraphs in the giant graph. The highlights of this method include a time window strategy for processing dynamic data, a greedy strategy for detecting the local densest subgraph, an incremental approximation strategy for constructing candidate sets, and a selection strategy for revealing the densest subgraph.

III. PROBLEM STATEMENT
In this section, the fundamental concepts of finding the densest subgraph problem (DSP) in sliding windows are defined first. Then, the challenges in solving this problem and the densest subgraphs merging problem in graph approximation are defined.
Let G(V, E) represent a graph, where vertex v ∈ V denotes users, and edge e ∈ E denotes friendships in a social network. The number of nodes and edges in the graph are |V | = n and |E| = m, respectively. Clusters, partitions, subgraphs, or communities all describe the intrinsic structures of the graph, reflecting the densest parts of the graph. They are built from a set of non-empty node subsets and corresponding edges. Subgraphs are represented by Subgraphs = {G 1 , · · · , G sn }, where sn is the number of subgraphs and sn 1 G i ⊆ G. Fig. 1 illustrates the segmentation of the graph flow. Each G i in Figure 1 can be viewed as the subgraphs induced from G by sliding time window. The notations used throughout this paper are listed in Table 1.
In this paper, we use three density functions listed in Table 2. average density focuses on the average density of the subgraph, while k-clique selects subgraphs with the highest average density of graph with k nodes. k-core is used to conduct core composition and find subgraphs, including nodes with at least k links. γ is a parameter to specify the percentage of nodes left. The Densest subgraph discovery problem in sliding time windows is formulated as The streaming graph G is segmented into several subgraphs G i according to a period time of T . den(G i ) is applied to find local densest subgraphs in G i . Then a heuristic merge algorithm (represented by sn 1 )combines local densest subsets, and a global densest subgraph candidate set is constructed. Finally, the maximum density subgraphs G * induced from the global candidate set by density function are the densest subgraph approximations of the original graph G. Note that G * needs to satisfy an edge size constraint |G * | ≥ γ |G|. We reduce the densest k subgraph problem (this problem is NP-hard [27]) to the densest at least k subgraph problem. Suppose given a graph G and a parameter λ, and we try to know whether a subgraph with size λ and density ≥ d exists. We construct a clique G ′ of size n 2 , where V (G) = n. Consider that the graph is the union of G and G ′ , G ∪ G ′ , and requires a subgraph with size at least n 2 + λ of maximum density. The solution S for G ∪ G ′ satisfies two properties: 1) From the above two properties, S ∩ G is the maximum density subgraph with size λ in G, since otherwise, we can obtain a better result by replacing the size λ subgraph of S ∩G with the densest λ subgraph. Thus, the result S∩G returned by the algorithm has density ≥ ( n 2 2 )+λ * d n 2 +λ , then we get the densest subgraph. Therefore densest at least k subgraph problem is also NP-hard.
Proof of two properties. 1) G ′ ⊂ S: Suppose not satisfied. Let G ′ − S = T ̸ = φ, V (S) = r and the density of S be δ(S). Then adding T to S, additional edges (edges incident on nodes in T) are denoted by T ′ . So the new density of S ∪ T is The density of S ∪ T is large than the density of S, which contradicts the assumption that δ(S) is the maximum density. Therefore, G ′ must be contained in S.
. Hence the density of S is a decreasing function of E(SG). Thus we have E(SG) = λ.

IV. INCREMENTAL GRAPH APPROXIMATION FOR DENSEST SUBGRAPHS DETECTION IN DYNAMIC GRAPHS
This section will introduce our incremental graph approximation method in detail. First, the fundamental idea of this VOLUME 11, 2023   method is discussed. Then, the architecture and procedures of this method and the corresponding algorithms are given.

Algorithm 1 High-Level Algorithm of Incremental Densest Subgraph Detection
Input: G, γ Output: In graph streams, a small graph G i is first generated in a sliding time window T . Then, G i is processed to construct the local densest subgraph candidate set S i by density function. Third, a heuristic merge algorithm combines S i , i = 1, 2, . . . , sn to form a global candidate set. Finally, the densest subgraph G * is detected from the global candidate set under the size constraint based on the density function.

B. ARCHITECTURE AND COMPONENTS
As illustrated in Fig. 2, the proposed approach includes four steps for solving DSP in continuous sliding windows: graph partition, local densest subgraphs generation, incremental approximation, and densest subgraph detection.

1) GRAPH PARTITION
In this step, we deal with a streaming graph with continuous edges. We first equally divided the period T into sn sliding time windows and then partition the graph into sn streaming subgraphs G 1 , . . . , G i , . . . , G sn associated to a sliding time window. We do not focus much on the partition algorithm, which is also an exciting and challenging problem in graph mining. We only apply the existing well-developed method to capture subgraphs individually.

2) LOCAL DENSEST SUBGRAPHS GENERATION
Within subgraph G i , local candidate set(LCS) is generated by discovering the local densest subgraphs S i with constraint γ is a parameter to control the percentage of left nodes.
The constraint |S i | = γ |G i | makes LCS different from classical DSP problem in Equation 3: (i)When the densest VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation in G ′ i exceeds the limitations defined by parameter γ ; thus we accept all the nodes in G ′ i because it is an optimal local solution and the candidate set will be further processed. (iii) When G ′ i < γ |G i |, more nodes and edges, represented by G ′′ i , are required to be added to the local candidate set to reach the size limits.

3) INCREMENTAL APPROXIMATION
After above-mentioned two steps, Local Candidate Set of S 1 , . . . , S i , . . . , S j , . . . , S sn are established. The graph approximation algorithm assembles the local candidate set of subgraphs into Global Candidate Set (GCS) by approximation algorithm. The detailed descriptions are in Algorithm 2 and the flowchart is present in figure3.

4) DENSEST SUBGRAPHS DETECTION
This step aims at detecting the densest subgraphs from simplified graph S by density(S) listed in Table 2, including edge density, k-core and k-clique. They illustrate ''dense'' from three aspects: The edge density defines the maximum average density of the graph and emphasizes the average density in subgraphs. The k-core finds subgraphs with nodes having at least k edges and focuses on the connection of subgraphs. The k-clique discovers a group of k nodes that directly connect each other with (k − 1)(k − 2)/2 edges and highlight the strongly connected components.

C. THEORETICAL ANALYSIS
In this section, the running time is discussed. Let T(v,e) be the time required for a minimum capacity cut in a graph with v nodes and e edges. [48] For a minimum capacity cut in a graph, there is only one loop which is executed ⌈log(m + 1)n(n + 1)⌉ = O(logn) times. Inside the loop, finding the min-cut is the main step. the graph has n + 2 = O(n) nodes and 2m + 2n = O(m + n) edges. So the running time is O(T (n, n + m)logn). Previous algorithm [48] discovers a minimum capacity cut in a graph with k nodes inO(k 3 ) step. If this method is applied, a maximum density subgraph can be found in time O(n 3 logn). If a faster mincut approach is used, it can improve the running time of our maximum density approach, e.g., Sleateor's algorithm [48] has T (v, e) = O(velogv). This min-cut algorithm generates an O(n(n + m)lognlog(n + m)), which has a better bound for the sparse graph. A faster result [49] by greedy approximation algorithm runs in O(m + n). Recent research [50] proposes an iterative peeling algorithm that can output near-optimal and optimal solutions fast by adding a few more passes to Charikar's greedy algorithm [49]. For Algorithm 1, there is only one loop which is executed sn times. Inside the loop, den(G i ) is the main step, which is O(n i + m i ). For graph G = G 1 , . . . , G i , . . . , G sn , the running time is O(m + n).

V. EXPERIMENTS
In this section, in-depth evaluations are conducted to demonstrate the efficiency, accuracy, and effectiveness of our incremental greedy approximation approach. They are implemented on six large dynamic graphs with disordered edges and ground-truth communities. Data sets and experimental configurations are introduced in the first place. Then, comprehensive evaluation and results are elaborated. After that, the findings from our experiments are summarized.
T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation

A. EXPERIMENTAL SETTINGS
To evaluate the performance of our approach in a consistent way, we carry on experiments on six data sets from SNAP [51]: LiveJournal, Friendster, Youtube, Orkut, DBLP, and Amazon. These data sets(in Table 3) are different in density, number, and scale of communities, clustering coefficient(CC for short), and the number of triangles. CC is the average clustering coefficient, a widely-accepted metric to measure the quality of subgraphs. We measure the running time performance of both our approach and the three state-of-the-art density function algorithms, including average density, k-core, and k-clique, corresponding to the three density definitions. For subgraphs discovered by average density, the maximum average density is set to density. For communities induced by k-core and k-clique, we set 3 and 5 to k, respectively.

B. IMPROVEMENT IN THE COMPUTATIONAL EFFICIENCY
The execution time performance of discovering the densest subgraphs with four algorithms is improved dramatically for the six datasets from our greedy approximation approach over the static algorithm. Table 4 presents part of our simulation results. It also verifies the scalability of our approach and the static method with size n increasing from 0.3 million to 0.6 billion. When processing more edges and nodes, our approach presents a more powerful performance: it reduces about 25% running time on average in four density definitions on the six diverse datasets; the execution time declines drastically with all subgraph detection algorithms when the size of data reaches 0.6 billion, decreasing ranging from 300 s to 800 s. For different density algorithms, the processing time required by k-core is much longer than other algorithms. The k-clique algorithm shows a relatively stable result when parameters range from 3 to 5. When k is set to 3, it requires much more calculations in theory than a larger parameter 5. Table 4 also reveals that our approach performs well when discovering subgraphs with diverse parameter-sensitive definitions. It reduces at least 23% execution times on average.
Our greedy approximation approach has largely reduced the execution times. This is graphically demonstrated in  In average density, the threshold density is defined as the maximum edge density of a graph. In k-core, (k) is the maximum core number of a graph. In k-clique, (k) is set to 5 and 3. When k=3,3-clique means triangles counting in the graph.  figure 4-9, which shows vivid comparisons of running times over six data sets between our approach (marked in blue) and the static algorithm(marked in red): the execution time marked in blue is always less than the red one.  Overall, our greedy approximation approach reduces about 25% execution time on average and presents better performance when the size of the graph rockets. Our approach VOLUME 11, 2023 49373 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation is also applicable to state-of-the-art algorithms to improve efficiency in diverse real-world applications.

C. IMPROVEMENT IN THE ACCURACY
In this section, we compare the quality of subgraphs detected by our greedy approximation method and the static method through three measures: Cross Common Fraction(CCF), Jaccard Index(JI ) and Normalized Mutual Information(NMI ). CCF [52] finds the maximal shared parts of the subgraphs between the induced one and the real one. Formally, it is defined as where cn and cn ′ are the numbers of subgraphs from discovery and in the original graph, respectively, and C i and C ′ j are the subgraphs induced by algorithms and in real community respectively.
Jaccard Index(JI ) [53] is frequently applied to measure similarity by calculating classification results of node pairs which could be clustered into the same subgraph or different subgraph. It can be defined as N s N s +N ds +N sd where N s is the number of node pairs which are classified into the same subgraph by algorithms and original existence, N sd stands for node pairs who are in the same subgraph but are divided into different subgraphs, and N ds vice versa.
Normalized Mutual Information (NMI ) [54], [55] is based on information theory for evaluating accuracy. The score of NMI is defined as where N is the confusion matrix, and N i,j is the same parts between a detected subgraph S i and a real one S ′ j . N i. and N .j stand for the sum by row and column, respectively, and N t = i j N ij . It indicates in Figs. 10-12 and Tab. 5 that subgraphs derived from our approach incorporated k-core definition are more compact and denser. For Cross Common Fraction(CCF), subgraphs induced from our approach have   66% to 85% similarity with the original graph, whereas subgraphs detected from the static method have 59% to 80 % similarity with the original graph. Our approach outperforms 6% on average by the static method. For Jaccard Index(JI ), considering node pairs who are in the same subgraph are classified into the same subgraph, subgraphs discovered from our approach have 67 % to 84 % accuracy to categorize the node pairs into the same group, whereas the accuracy in the static method is ranging from 54 % to 77%, respectively. Our approach outperforms the static method, close to 7% on average. For Normalized Mutual Information, the scores show that our approach gets higher scores than the static method does. Subgraphs detected from our approach can get scores ranging from 0.68 to 0.85 on six different datasets, while Subgraphs detected from the static method gets scores ranging from 0.55 to 0.77. Our approach performs better than the static method by 0.1 on average. 49374 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
T. Han: Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation  Overall, the experimental results have shown that our greedy approximation approach with different density definitions presents a more accurate estimation of the structure of a graph with more compact subgraphs than the static method. When the graph size grows dramatically, our approach helps speed up the subgraph discovery processes and guarantees the quality of the derived subgraphs. The output subgraphs are also more compact when the graph size varies drastically.

D. INFLUENCE ON SPARSITY
The sensitivity of the proposed approach, when applied to datasets with different network sparsity [56] is also a concern in practice. In this experiment, synthetic networks with various average degrees (degree increasing from 10 to 70 at intervals 10, note that |E| = η|V | and degree ≈ 2η) are generated by gradually increasing its density based on LiveJournal.
The variations of the total time cost are shown in Fig.13. The growth trend of the proposed approach is consistent with the original algorithm shown in Fig.4. We also study the accuracy of the relevant algorithms on datasets with various densities and present results in Fig.14.  We present CC (Fig.14) and CCF (Fig.15)as representatives of accuracy, respectively, and observe similar trends on other measures. Our approach produces the expected community with higher CC values as the density increases. From the perspective of CCF evaluation accuracy, our approach not only significantly outperforms the static algorithm but also shows more stable performance and is less sensitive to density. In addition, we find that the number of detected communities decreased as density increased.
An increase in network density generally results in a higher clustering coefficient. For most algorithms, it results in a smaller detection community, and in general, our approach outperforms static algorithms in overall performance and stability.

E. SUMMARY OF THE COMPARISONS WITH THE STATIC METHOD
A summary of comparisons between our greedy approximation approach presented in this paper and the static method for densest subgraph problem computation are tabulated in Table 6. It is seen from Table 6 that the static method is relatively easy to use than our approach. The settings of our approach are sliding windows and the static method is no need to set parameters. Other comparison items are all in favour of our approach. Particularly, our approach is suitable for DSP computation with steaming data, but the static method is not directly applicable in this case. As the static method will miss useful information in the data sets, our approach gives further improvements in accuracy. Also, compared to the static method, our approach is more computationally efficient, leading to much-improved scalability. Our approach uses existing DSP extraction techniques as underlying tools for DSP computation. Our approach incorporates DSP computation techniques, thus requiring dynamic updates and interactions with the DSP computation during the evolution of the sliding window for fixed or steaming data sets.
Our main novelty and significance are: 1)Our greedy approximation approach for the densest subgraphs discovery in both very large scale and highly evolving graphs in sliding windows has been presented in this paper. 2)It helps understand and reveal the core structures of networks with fewer computations in a continuous fashion and thus reduces the total execution time. The running time performance is O(m + n).
3)The approach has been designed for handling large-scale updating streaming data by dealing with continuous sliding windows among edge streams. Particularly, the approach grasps the densest subgraphs locally and constructs the global candidate set to reduce computation complexity by greedy approximation. 4)Besides efficiency and accuracy performance, the sparsity influence of our approach is studied to solve real-world problems. The results show a more stable performance.

VI. CONCLUSION
An incremental graph greedy approximation approach in sliding windows has been presented in this paper for finding the densest subgraphs from both large-scale and highly dynamic graphs. This helps reveal the densest communities of networks with less complicated computations in a continuous model and thus reduces the total processing time. The approach has been designed for handling large-scale dynamic streaming data by processing persistent sliding windows among edge streams. Particularly, the approach preserves the densest subgraphs locally and constructs the global candidate set to reduce computation complexity.
Experiments have been implemented incorporating this greedy approximation approach with four density definitions on six real-world graphs with ground truth. The results have demonstrated the efficiency and accuracy of the presented greedy approximation approach. It is also proved that the approach is not influenced by density. Therefore, the greedy approximation approach performs well in investigating largescale networks.
The work in this paper opens the door to future research in DSP computing in the following three ways. From the perspective of intelligent systems, graph summarization and simplification based on new knowledge or features will bring a powerful evolution process for big-stream data. From the perspective of graph visualization, the graphical representations of graph summarization and simplification, as well as its evolution, are significant for better understanding the dynamic features of application networks. From the perspective of processing time performance, distributed fashion and parallel computing in a cloud cluster will use more CPU and storage resources to achieve faster DSP computing.