Forman-Ricci flow for change detection in large dynamic data sets

We present a viable solution to the challenging question of change detection in complex networks inferred from large dynamic data sets. Building on Forman's discretization of the classical notion of Ricci curvature, we introduce a novel geometric method to characterize different types of real-world networks with an emphasis on peer-to-peer networks. Furthermore we adapt the classical Ricci flow that already proved to be a powerful tool in image processing and graphics, to the case of undirected and weighted networks. The application of the proposed method on peer-to-peer networks yields insights into topological properties and the structure of their underlying data.


Introduction
Complex networks are by now ubiquitous, both in every day life and as mathematical models for a wide range of phenomena [Watts and Strogatz, 1998, Barabási and Albert, 1999, Albert and Barabási, 2002, Barabasi, 2016 with applications in such diverse fields as Biology Oltvai, 2004, Petri and, transportation and urban planning [Šubelj and Bajec, 2011, Barthélemy, 2011, Barabasi, 2016, social networks like Facebook and Twitter [Ellison et al., 2007], and -in its relevance dating back to early work on networks -in communication and computer systems [Barabási and Albert, 1999]. The later belong to the class of peer-to-peer networks whose structure is characterized by information transfer between "peers". With novel geometric methods we attempt to analyze structural properties and dynamics of real-world networks, focusing on peer-to-peer networks as an exemplary use case.
While the body of theoretical research in the analysis of networks and related structures has been focused on the properties of the various (discrete) Laplacians (see [Banerjee and Jost, 2008] for an overview on the state of the art), a more [Sreejith and et al., ], namely examining the relation of Forman-Ricci curvature with other geometric network properties, such as the node degree distribution and the connectivity structure. (For a systematic comparison of various other network characteristics see [Sreejith and et al., ].) Based on this analysis, we suggest characterization schemes that yield insights in the dynamic structure of the underlying data as described in the following section. The emphasis of the article lies on analyzing the class of peer-to-peer networks, of which we provide two examples: Email communications [Kunegis, 2013] and information exchange with the file-exchange system Gnutella [J. Leskovec andFaloutsos, 2007, M. Ripeanu andIamnitchi, 2002].
The main part of the paper introduces a novel change detection method for complex dynamic networks that exploits the Ricci flow on the edges with respect to the Forman curvature. Efficient implementations of the formalism enable structural analysis of big data as we demonstrate with the Gnutella example. For this, we walk the reader through the analysis of sets of peer-to-peer network with respect to change detection, providing an overview on the work flow of the method.
Future applications include the characterization of dynamic effects in various classes of real-world networks and the analysis of the underlying data, as well as curating related data bases.

Methods
The change detection method introduced in this article is based on a networkanalytic formulation of the Ricci-flow using a discrete Ricci-curvature on networks introduced by R. Forman [Forman, 2003]. In this section we define both the Forman-Ricci curvature and the Ricci flow on networks and demonstrate their ability to characterize complex dynamic systems and their underlying data on the example of a peer-to-peer network.

Forman-Ricci curvature on networks
In the classical context of smooth Riemannian manifolds (e.g. surfaces), Ricci curvature represents an important geometric invariant that measures the deviation of the manifold from being locally Euclidean by quantifying its volume growth rate. An essential property of this curvature is that it operates directionally along vectors. For our discrete setting it follows directly that Forman's curvature is associated with the discrete analog of those vectors, namely the edges of the network.
While Forman's approach defines Ricci curvature on the very general setting of n-dimensional cellular structures, we will concentrate on the simpler case of 1-dimensional weighted cellular spaces that can be represented as a weighted network graph. We will not consider higher dimensional cases here, since their technicalities would carry us well beyond the scope of the present paper and our intended applications. For a theoretical introduction see [Forman, 2003].

Fig 1.
Characterization of a weighted and undirected peer-to-peer network (email correspondence [Michalski andPalus, 2011, Kunegis, 2013]) with Forman-Ricci-curvature. A: Network plot. Node sizes are scaled with respect to their node degrees. B: (Weighted) node degree distribution. C: Curvature map, representing the Forman-Ricci-curvature along the network's edges. The curvature map consists of a heat map of a matrix where each entry (i, j) represents the Ricci curvature of the corresponding edge Ric(e = e(i, j)). Light yellow entries resemble edges with low curvature, red entries those with high curvature. D: Histogram showing the distribution of the Forman-Ricci-curvature.
In the 1-dimensional case, Forman's Ricci curvature for a network edge is defined by the following combinatorial formula where • e denotes the edge under consideration that connects the nodes v 1 and v 2 ; • ω(e) denotes the (positive) weight on the edge e; • ω(v 1 ), ω(v 2 ) denote the (positive) weights associated with the nodes v 1 and v 2 , respectively; • e v1 , e v2 denote the set of edges connected to nodes v 1 and v 2 , respectively.
Note that in Equation (1) only edges parallel to a given edge e are taken into account, i.e. only edges that share a node with e. We highlight that, by its very definition, Forman's discrete curvature is associated to edges and therefore ideally suited for the edge-based study of networks (connectivity, directionality). Particularly, it does not require any technical artifice in extending a node curvature measure to edges, as some other approaches do. In particular, there is no need to artificially generate and incorporate two-or higher dimensional faces: Such an approach would impose severe constraints on computability. Additionally, a "good" discretization of Ricci curvature, such as Forman's proves to be [Forman, 2003], will capture the Ricci curvature's essential characteristic of measuring the growth rate -a property that is of special interest in the context of dynamic networks. Therefore, the Forman-Ricci curvature represents a way of determining whether a network has the potential of infinite growth (negative curvature), or can only attain a maximal -and therefore computable -size (positive curvature). In particular, a network will be flat, i.e. it will have Forman-Ricci curvature equal to zero, if its growth and geodesics dispersion rates will be similar to that of the Euclidean plane. This aspect represents a further motivation for studying the Ricci curvature of networks, since it allows to distinguish numerically between expander type networks of negative curvature, such as information networks, and small world networks that are on average of strictly positive curvature (see also [Ni and et al., 2015]). We will further explore this in a forthcoming article [Weber et al., 2016b].

Characterizing large data sets with Ricci curvature
We now want to explore the Forman curvature as a tool for characterizing real-world networks. Since this paper centers around peer-to-peer networks, we choose an example of email communication from [Kunegis, 2013]. In such network graphs (denoted G), nodes describe correspondents (peers) and edges the exchange of messages among the peers.
To characterize the network's structure with curvature, we have to impose normalized weighting schemes on both nodes and edges. Naturally, the busiest communicators should have the highest weights. Therefore we choose a combinatorial weighting scheme based on node degrees, i.e. the number of connections for each node v: Analogousely, we want to weight extensively used communication channels (edges) higher than rarely used ones. For this we calculate the minimum path length l between each pair of connected nodes (v i , v j ) and impose The motivation behind this choice of weights, lies in the "small world"-property (i.e. a maximum degree of separation of six [l. de Sola Pool and Kochen, 1978]) that has been reportedly found in real-world. In accounting for this, we only check for indirect connections up to a path length of six and scale the weights according to the distribution. The resulting structure of the network is shown in Fig. 1.A, the size of the nodes reflects their weights. Using edge and node weights, we can now determine the Forman curvature distribution in the network. A curvature map ( Fig. 1.C) provides a planar representation that highlights clusters and distinguished regions. The histogram (Fig. 1D) shows the distribution of the curvature values allowing for comparison with the node degree distribution (Fig. 1.B). By indicating a correlation between the two distributions, the results highlight the strong influence of node degree weights on the network's topology. This is consistent with observations in email communications: Densely interconnected communities form around busy communication channels and active correspondents.

Ricci-flow with Forman curvature
As mentioned earlier, the Ricci flow as a powerful geometric tool was devised by Hamilton [Hamilton, 1986] and further developed by Perelman [Perelman, 2002, Perelman, 2003 in the course of his celebrated proof of the Poincare conjecture. Since then, it has continued to be an active and productive field of study, both in terms of theoretical questions, but also for diverse practical applications, including work by Gu et al. (see, e.g. [Sarkar and et al., 2009]). Those mainly build on a combinatorial version introduced by Chow and Luo [Chow and Luo, 2003]. However, other discretizations of the flow, with reported applications in network and imaging sciences, are explored in the literature [Ni andet al., 2015, Saucan, 2014].
The classical Ricci flow is defined by where g ij denotes the metric of the underlying manifold, here represented by the earlier introduced weighting scheme of the network's edges. Note that equation (6) above shows that the Ricci flow evolves a manifold proportionally to its Ricci curvature, by "pushing" faster the regions of higher curvature. This is a fact that we exploit in our application to determine changes in dynamic (peer-to-peer) networks.
The reader might note the resemblance with the classical Laplace (or more precisely Laplace-Beltrami) flow that has become, by now, standard in Imaging and Graphics (see, e.g. [l. de Sola Pool andKochen, 1978, Xu, 2004] and the references therein), defined as

∂I ∂t
= ∆I (7) where I denotes the image, viewed as a parametrized surface in R 3 (and ∆ denotes, as customary, the Laplacian) 2 . The resemblance is neither accidental nor superficial. Indeed, the Ricci curvature can be viewed as a Laplacian of the metric. We address the practical implications of this observation in the sequel and a follow-up article by the authors [Weber et al., 2016a] that addresses more theoretical questions of this matter.
In our discrete setting, lengths are replaced by the (positive) edge weights. Time is assumed to evolve in discrete steps and each "clock" (i.e. time step) has a length of 1. With these constraints the Ricci flow takes the form whereγ(e) denotes the new (updated) value of γ(e) with γ(e) being the originali.e. given -one. In this context, we want to discuss a few issues and observations regarding this last equation: 1. At each iteration step (i.e. in the process of updatingγ(e) to γ(e)), the Forman curvature has to be recomputed for each edge e, since it depends on its respective weight γ(e). This clearly increases the computational effort on magnitudes, however, the computation task is less formidable than it might appear at first.
2. As already stressed, we consider a discrete time model. Since for smoothing (denoising) a short time flow has to be applied 3 only a small number of iterations need be considered. The precise number of necessary iteration is to be determined experimentally. Even though a typical number can be found easily, best results may be obtained for slightly different numbersdepending on the network, and the type and level of the noise, of course.
3. Ollivier also devised a continuous flow [Ollivier, 2009], [Ollivier, 2010]. In the context of the present article, a continuous setting is not required, but for other types of networks, where the evolution is continuous in time, it might be preferable to implement the continuous variant, suitably adapted to the Forman curvature, rather than to Ollivier's one.
In addition to the Ricci flow above, one can consider the scalar curvature flow that in our case will have the form: where e i = e i (v, v i ) denote the edges through the node v, and scal F (v) the (Forman) scalar curvature at a node v, which we define by

Characterizing dynamic data with Ricci Flow
The introduced Ricci-flow can be utilized to characterize dynamic data. Given snapshots of a system at various (discrete) times {t i }, we analyze the Ricci flow on the corresponding network representations. The flow yields insights into structural changes providing a tool to identify "interesting" network regions. Applications include efficient change detection in large dynamic data representing complex systems, as described in the following section. Let (t i ) i∈I be a discrete time series with step size ∆t. Consider a complex dynamic system of whose behavior we have snapshots at times (t i ) represented in weighted network graphs (G i ) i∈I . Let t i and t i+1 be consecutive time points with corresponding graphs G i and G i+1 . The weighting schemes of the nodes and edges characterize the topological structure of the system at times t i and t i+1 . We can estimate the Ricci-flow for the time step ∆t by iterating k = 1, ..., K times over resulting in modified weighting schemes γ K i and γ K i+1 respectively. We conjecture that the correlation between these weighting schemes characterizes the flow. Regions that were subject to significant change during the time ∆t can be identified by thresholding the resulting correlation matrix.

Change detection with Ricci flow
A major challenge of modern data science lies in characterizing dynamic effects in large data sets, such as structural changes in discrete time series of "snapshots" of a system's state or frequently updated large data bases. Commonly, network graphs are inferred from data representing interactions and associations in the underlying data. In the case of peer-to-peer networks, the network describes the information flow (edges) between the peers (nodes).
We want to use the Forman-Ricci curvature to analyze dynamic changes in the structure of such interaction networks obtained from large data sets. Specifically, we want to characterize the information flow between the network's nodes using the previously introduced Ricci flow. Our method follows the formalism described in Section 1 and is schematically displayed in Fig. 2.
The analysis of the information flow can be used to detect changes or distinguished regions of activity in the data. More precisely, we take advantage of the property of the classic Ricci flow inherited by the suggested discretization, namely the faster evolution of regions with higher curvature. Thus changes occurring in these parts of the network will be emphasized and can be detected by characterizing the corresponding Ricci flow. In contrast to a mere comparison of the changes in curvature, this allows for a analyzing the underlying dynamic effects and possibly predicting the network's future evolution. Applications include the curation of large open-access data bases and the detection of rare events in experimental data, such as spiking neurons in measurements of neuronal activity.

Analysis of Gnutella peer-to-peer network
We consider a series of discrete time snapshots of a complex peer-to-peer system, represented as network graphs (G i ) i∈I . To characterize the Ricci flow, we apply the earlier described formalism pairwise, i.e. we iterate (K = 10) over (13) for snapshots (G i , G i+1 ) at consecutive time points t i and t i+1 . Here we use the fileexchange service Gnutella as an example, analyzing the peer-to-peer networks

Fig 2.
Schematic overview of the change detection workflow. A: We consider a set of "snapshots" of a system or data base at discrete times (t i ) i∈I with step size ∆t. B: We infer unweighted (binary) networks from each snapshot that represent the structure of the underlying data. To simplify comparison, we normalize all networks. C: By superimposing weighting schemes based on "indirect connections", we extend the unweighted networks to weighted ones (see Section 1). D: We calculate the Ricci flow for each time step ∆t by iterating over a the given scheme. E: For detection of changes, we compare the final (smoothed) weighting schemes after K iterations. F: To identify regions that were subject to significant change, we threshold the correlation matrix obtained in (E). Light regions in the resulting map indicate such regions.

Fig 3.
We analyze an Internet peer-to-peer network representing Gnutella file exchanges on two consecutive days (August 8 and August 9 2002, from [J. Leskovec andFaloutsos, 2007, M. Ripeanu andIamnitchi, 2002]). A: Network plot (left) and curvature map (right) displaying the distribution of Forman curvature for a Gnutella snapshot at August 8. B: Analogous plots for a Gnuttela snapshot from the following day, August 9.  Leskovec andFaloutsos, 2007, M. Ripeanu andIamnitchi, 2002] with Ricci flow and a parameter choice of K = 10 and thresh = 0.8 (the relatively small number of iteration steps was chosen for the sake of computation time). The figure shows a heat map of the correlation matrix of the edges' Ricci-curvature. Highlighted spots correspond to single edges and assortments of edges and nodes (clusters) with high activity (flow).
resulting from exchange activities on two consecutive days (August 8 and August 9 2002, [J. Leskovec andFaloutsos, 2007,M. Ripeanu andIamnitchi, 2002]). Fig.  3 shows the networks infered from the data sets and the corresponding curvature maps with the distribution of Forman Ricci curvature. We applied our change detection method with a correlation threshold of thresh = 0.8 to detect regions of significant changes, i.e. groups of peers with significant activity. The results are shown in Fig. 4, represented as a heatmap of the thresholded correlation matrix. Light spots in the heatmap correspond to network regions with large flow and structural changes, allowing for detection and localization of dynamic effects. The method provides a clear representation of dynamic changes in networks, especially in terms of the community structure of the network. Given the intrinsic community structure of a network (clusters), which can also be infered from evaluating the curvature, one can examine the influence of the flow on this specific structural property.

Discussion and Future Work
The dawning age of big data opens up great numbers of possibilities and perspectives to gain insights in the principles of nature, humanity and technology through collection and (statistical) analysis of data. However, the vast amount of available data challenges our analysis methods leading to an increased need of automated tools that can perform rapid and efficient data evaluations.
Networks are an efficient and commonly used data representation, emphasizing on interactions and associations. This representation is ideal for analyzing structures within the data under geometric aspects. We have used networks to characterize peer-to-peer systems and analyze dynamic effects in the information transfer between its peers. In particul, we introduced a method for detecting changes in these dynamics. Possible applications of this theoretical framework include denoising of experimental data and identification of "interesting" groups of data points and activities of network regions.
The geometric methods used in the present article build on R. Forman's work on Ricci curvature in networks and the corresponding Ricci flow. Future work, especially two follow-up article by the authors [Weber et al., 2016a,Weber et al., 2016b, will expand the range of geometric tools used in the methods and develop a deeper understanding of theoretical aspects. In what follows, we want to name and discuss some extensions and theoretical considerations that we shall address in future work.
One important fact, whose implications we shall discuss later on, is the so called Bochner-Weitzenboeck formula (see, e.g. [Jost, 2011]), which relates graph Laplacian and Ricci curvature through an algebraic-geometric approach. More precisely, it prescribes a correction term for the standard Laplacian (or Laplace-Beltrami) operator, in terms of the curvature of the underlying manifold. Given that the Laplacian plays a key role in the heat equation (see, e.g. [Jost, 2011]), it is easy to gain some basic physical intuition behind the phenomenon of curvature and flow: The heat evolution on a curved metal plate differs from that on a planar one in a manner that is evidently dependent on the shape (i.e., curvature) of the plate. This suggests a number of possible future directions: 1. A task that is almost self evident, is to further experiment with very large data sets (numbers of data points in the order of ten thousand and more); 2. Another natural target is the use of our method on different types of networks, with special emphasis on Biological Networks; 3. A statistical analysis regarding the Ricci flow, similar to the one presented here and in [Sreejith and et al., ], should also be performed on various standard types of networks in order to confirm and calibrate the characterization and classifying capabilities of the Ricci curvature and flow.
Slightly more demanding are future experiments and comparisons with the related flows, namely 1. The Forman curvature versions of the scalar and Laplace-Beltrami flows. Especially the last one seems to be promising for network denoising, as applications of the analogous flow in image processing showed [Saucan and et al., 2008]. Moreover, the Forman-Ricci curvature comes naturally coupled with a fitting version of the so called Bochner Laplacian (and yet with another, intrinsically connected, rough Laplacian). This aspect is subject to ongoing work and will be covered in a forthcoming paper by the authors.
2. As for the short time Ricci flow, statistical analysis should be undertaken to validate the classification potential of the long term Ricci flow. A more ambitious, yet still feasible, future direction would be to explore network stability by considering the long time Forman-Ricci flow (as opposed to the short time one employed for denoising). This approach would exploit, in analogy with the smooth case [Perelman, 2002, Perelman, 2003, the propensity of the Ricci flow to preserve and quantify the overall, global Geometry (i.e. curvature) and essential topology of the network. This would allow us to study the evolution of a network "under its own pressure" and to detect and examine catastrophic events as virus attacks and denial of service attempts. Given the basic numerical simplicity of our method, this approach might prove to be an effective alternative to the Persistent Homology method (see, e.g. [Petri and et al., 2014]) for the 1-dimensional case of networks. Moreover, the Ricci flow does not need to make appeal to higher dimensional structures (namely simplicial complexes) that are necessary for the Persistent Homology based applications, with clear computational advantages (see, e.g. the code described in [Mischaikow and Nanda, 2013]), but also theoretically rigor. Furthermore, the here defined Ricci flow can be applied on weighted networks, whereas Persistant Homology requires unweighted complexes.