An iteration-based differentially private social network data release

Online social networks provide an unprecedented opportunity for researchers to analysis various social phenomena. These network data is normally represented as graphs, which contain many sensitive individual information. Publish these graph data will violate users’ privacy. Differential privacy is one of the most influential privacy models that provides a rigorous privacy guarantee for data release. However, existing works on graph data publishing cannot provide accurate results when releasing a large number of queries. In this paper, we propose a graph update method transferring the query release problem to an iteration process, in which a large set of queries are used as update criteria. Compared with existing works, the proposed method enhances the accuracy of query results. The extensive experiment proves that the proposed solution outperforms two state-of-the-art methods, the Laplace method and the correlated method, in terms of Mean Absolute Value . It means our methods can retain more utility of the queries while preserving the privacy


INTRODUCTION
With the significant growth of Online Social Networks (OSNs), the increasing volumes of data collected in those OSNs have become a rich source of insight into fundamental societal phenomena, such as epidemiology, information dissemination, marketing, etc. Much of this OSN data is in the form of graphs, which represent information such as the relationships between individuals. Releasing those graph data has enormous potential social benefits. However, the graph data infer sensitive information about a particular individual [1] has raised concern among social network participants.
To deal with the problem, lots of privacy models and related algorithms have been proposed to preserve the privacy of graph data. Differential privacy is the most prevalent one due to its rigorous privacy guarantee. If the differential privacy mechanism is adopted in graph data, the research problem is then to design efficient algorithms to release statistics about the graph while satisfying the definition of differential privacy. Two concepts for graph have been proposed: Node Differential Privacy and Edge Differential Privacy. The former protects the node of the graph and the latter protects the edge in the graph.
Previous works have successfully achieved both node differential privacy and edge differential privacy when the number of queries is limited. For example, Hay et al. [9] implemented node differential privacy, and pointed out the difficulties to achieve the node differential privacy. Paper [17,18] proposed to publish graph dataset using a dK-graph model. Chen et al. [4] considered the correlation between nodes and proposed a correlated release method for sparse graphes. However, these works suffer from a serious problem: when the number of queries is increasing, a large volume of noise will be introduced. In real world, we have to release large number of queries for data mining, recommendation or other purposes. The difficulty lies on the problem is that the privacy budget should be divided into tiny pieces when the query set are large. Large amount of noise will be introduce to the published query answers in this scenario. This paper focuses on releasing a large set of queries for graph data. Given a set of queries, we apply an iteration method to generate a synthetic graph to answer these queries accurately. We can consider the iteration process as a training procedure, in which queries are training samples and the synthetic graph is an output learning model. Finally, we will adopt the synthetic graph to answer this set of queries. As the training process consumes less privacy budget than the state-of-the-art methods, the total noise will be diminished. In these procedures, the major research issue becomes how to design the iteration process to generate the synthetic graph.
Our major contribution of this paper is to transfer the query release problem to an iteration based training process. Specifically, we propose an iteration method, called Graph Update to generate a synthetic graph, which can answer a large amount of queries accurately. Compared with state-of-the-art methods, the Laplace method and the correlated method, it can decrease the total amount of noise significantly.
The rest of the paper is organized as follows: We present the preliminaries in Section 2. Section 3 discusses the Graph Update method and the experimental result is presented in Section 5, which is followed by the conclusion in Section 6.

Notation
We consider a finite data universe X and a dataset D is an unordered set of n records from X . Let r be a record with d attributes sampled from X , Two datasets D and D * are neighboring datasets if they differ in only one record. A query f is a function that maps dataset D to an abstract range R: .., f m (D)}. We use symbol m to denote the number of queries in F. The maximal difference on the results of query f is defined as the sensitivity s, which determines how much perturbation is required for the private-preserving answer. To achieve the target, differential privacy provides a mechanism M, which is a randomized algorithm that accesses the database. The randomized output is denoted by a circumflex over the notation. For example, f (D) denotes the randomized answer of querying f on D.

Graph Notations
We model social network as a simple undirected graph is a set of edges representing relationships between individuals. Fig. 1 shows an example of a social network graph. The nodes are represented by circles and connected with each other by edges represented by lines. The degree of a node refers to the number of its neighbourhoods. Formally, we define degree as follows, Neighbourhood: Degree:

Differential Privacy
The target of differential privacy is to mask the difference in the answer of query f between the neighboring datasets [7]. In -differential privacy, parameter is defined as the privacy budget [7], which controls the privacy guarantee level of mechanism M. A smaller represents a stronger privacy. The formal definition of differential privacy is presented as follows: Definition 1 ( -Differential Privacy) A randomized algorithm M gives -differential privacy for any pair of neighboring datasets D and D * , and for every set of outcomes , M satisfies: Sensitivity is a parameter determining how much perturbation is required in the mechanism with a given privacy level. [7] For a query f : D → R, the sensitivity of f is defined as

Definition 2 (Sensitivity)
The Laplace mechanism adds Laplace noise to the true answer. The mechanism is defined as follows: Definition 3 (Laplace mechanism) [7] Given a function f : D → R over a dataset D, the Eq. 3 provides the -differential privacy.
In graph data, we use G to represent D.

Node Differential Privacy
Node differential privacy ensures the privacy of a query over two neighbouring graphs where two neighbouring graphs can differ up to all edges connected to one node. Hay et al. [9] first proposed the notion of node differential privacy and pointed out the difficulties to achieve it, even it can provide strong privacy guarantee. Hay et al. [10] showed that the result of query was highly inaccurate for analysing graph due to the large noise.
Recently, there are few works [5,11] contribute to reduce sensitivity and return accurate answers under node differential privacy. Although this is a good progress, these algorithms still hard to be applied in real world, the most prevalent algorithms are focusing on the Edge differential privacy. computer systems science & engineering

Edge Differential Privacy
Edge differential privacy means adding or deleting a single edge between two nodes in the graph makes negligible difference to the result of the query. The first differential private computation over graph dataset with edge differential privacy appeared in paper [16], in which Nissim et al. tried to count the number of triangles in the graph. They provided the concept smooth sensitivity to calibrate the noise to a more local variant of sensitivity.
A work presented in [15] shared the differential private graph topology based on Stochastic Kronecker graph generation model by perturbing model parameters. While the Stochastic Konecker generation model cannot capture the properties of graph accurately due to simple generation process.
Paper [17,18] published graph dataset using a dK-graph model. They applied dK-series as query function and added controllable noise based on sensitivity parameter. Wang et al. [18] proved that privacy dK-graph model can more precisely capture most of the graph properties and achieve better utility preservation. In order to reduce the noise added to the dK-series, Sala et al. [17] provided an algorithm partitioning the data of dKseries into clusters with similar degree. It significantly reduced the sensitivity for each sub-series. But it used local sensitivity which can reveal information that cannot achieve strict privacy preserving [16].
A different approach was proposed in paper [19]. Inferring the network's structure via connection probabilities. They encoded the structure information of the social network by the connection probabilities between nodes instead of the presence or absence of the edges. Which reduced the impact of a single edge. Another work in paper [2] provided a reasonable hypothesis about the structure of the dataset to restrict the sensitivity of the query. However, those methods would generate a large dense matrix which are computationally infeasible for large social network.
The most similar work to ours is from Chen et al. [4], which shared the same target of this paper: releasing a synthetic graph to publish a large set of queries. However, they focused on the correlated queries on the sparse graph. When dealing with large amount of queries, the performance is not optimal.

Overview of Graph Update
The release method is an iteration-based algorithm, which is a prevalent release scenario of many applications [8]. Our proposed method is called Graph Update method as the key idea is to update a synthetic graph until all queries have been answered. For a social network graph G and a set of queries F = { f 1 , ..., f m }. Our goal is to release a set of query results F and a synthetic graph G to the public. Our general idea is to define an initial graph G 0 and update it to G m−1 in m round according to m queries in F. Release answers F and the synthetic graph G are generated during the iteration. During the process, four different types of query answer involve in the iteration: • True answer a t : this is the real answer that a graph response to a query. We cannot release it directly as it will arise privacy concern. The true answer is normally used as the baseline to measure the utility loss of a privacy-preserving algorithm. In this paper, we use a t = f (G) to represent the true answer for a single query f , and A t = F(G) = {a t 1 , ..., a tm } to represent an answer set for a query set F.
• Noise answer a n : when we add Laplace noise to a true answer, the result will be the noise answer. Traditional Laplace method will release the noise answer directly. However, as we mentioned in Section 1, it will introduce large amount of noise to the release result. We use a n = f (G) = f (G) + Lap(s/ ) to represent a single query answer and A n = F(G) = {a n1 , ..., a nm } to represent an answer set.
• Synthetic answer a s : this is the answer generated by a synthetic graph G. We use a s = f ( G) to represent a single query and A s = F( G) = {a s1 , ..., a sm } to represent an answer set.
• Release answer a r : this is the answer finally released after the iteration. In Graph Update method, the release answer set will consist of noise answers and synthetic answers. We apply a r = f and A r = F = {a r1 , ..., a rm } to represent the single answer of a query and the answer set, respectively.
These four different query answers control the graph update process. The overview of method is presented in Figure 2. On the left side of the figure, the query set F performs on the G to obtain a true answer set A t . Laplace noise is then added to A t to get a set of noise answer A s = {a s1 , ...a sm }. Each noise answer a si helps to update the initial G 0 and produce a release answer a ri . The method eventually outputs A r = {a r1 , ..., a rm } and the G m as final results. Comparing with the traditional Laplace method, the proposed Graph Update method adds less noise. As some queries are answered by the synthetic graph, these query answers will not consume any privacy budget. Moreover, the synthetic graph can be applied to predict new queries without any privacy budget. Eventually, the Graph Update method can outperform the tractional Laplace method.

Graph Update Method
The Graph Update method works in three steps: • initial the synthetic graph: As we only preserve the edge privacy, we assume that the number and the labels of nodes vol 33 no 2 March 2018 . Make all degrees in G round numbers. 12. Output A r = {a r1 , ..., a rm }, and G; are fixed. The synthetic graph is initialed as a fully connected graph with fixed nodes.
• update the synthetic graph: the initial graph will be updated according to result of each query in F, until all queries in F have been used.
• release query answers and synthetic graph: Two types of answers, noise answers and synthetic answers that have potential to be released. Synthetic graph is also released to the public.
Algorithm 1 is a detailed description of the Graph Update method. In step 1, the privacy budget is divided by m and will be arranged to each query in the set.
Step 2 initializes the graph to G 0 as a full connected one. Then for each query f i in the query set F, the algorithm computes the true answer f i (G) at Step 3. After that, the noise answer and the synthetic answer of f i are computed at Step 4 and 5, respectively. Step 6 measures the distance between the true answer and the synthetic answer. If the distance is larger than a threshold T , the Step 7 will release the noisy answer. Otherwise, the synthetic graph will be updated by an Updated Function in Step 8 and Step 9 will release the synthetic answer. This means the synthetic graph is applicable for answering question, so in Step 10, we put the current synthetic graph to the next round. This process is iterated until all queries in F are preceded. Finally, As the number of edges should be a integer, we round the number of degrees in Step 11. the algorithm generates A r and G as the output in Step 12.
The parameter T is a threshold controlling the distance between A n and A s . A larger T means less update of the graph and most of the answer in A r are synthetic answers. It leads to less privacy budget consuming, however, when the synthetic graph is far away from the original graph, the performance may not optimal. A smaller T means the algorithm has more updates of the graph and most of the answer in A r are noise answers.

Algorithm 2 Update Function
Require: G, f , d, θ, (0 < θ < 1) More privacy budgets will be consumes in this configuration. Consequently, the choice of T will have impact on different scenarios. We will confirm the value of T in the experiment in Section 5

Update Function
Step 8 in Algorithm 1 involves with an Update Function, which updates the synthetic graph G to graph G according to query answers. Specifically, Update Function is controlled by the distance d between the a n and a s of f . If a n is smaller than a s , it means that the synthetic graph has more edges than the original graph in the related nodes. Update Function has to delete some edges between the related nodes. Otherwise, Update Function will add some edges in the synthetic graph. These related nodes is defined in the follow definition 4:

Definition 4 (Related Node)
For a query f and a graph G, related nodes V f are all nodes that response to the query f , we use set D(V f ) to denote degrees of those nodes.
The number of edges for a node should be a integer. However, to adjust degree of those related nodes, we arrange weight θ (0 ≤ θ ≤ 1) for each edge. After the updating, these weights will be rounded to represent node edges. Algorithm 2 illustrates the detail of Update Function. In the first step, the function identifies related nodes. If d > 0, which means the synthetic graph has less edges than the original one, the function will enhance the θ in Step 2. If d ≤ 0, which means the synthetic graph has too many edges, the function will diminish those edges by θ in Step 3.
Step 4 merges the edges to the original graph.
Step 5 outputs the G .

Privacy Analysis
This section presents a comparison on the privacy between the tractional Laplace method and Graph Update. The sequential composition [14] of the privacy budget is applied, which is shown in Lemma 1 The sequential composition accumulates privacy budget of each step when a series of private steps is performed sequentially on a dataset. computer systems science & engineering Lemma 1 Sequential Composition: Suppose a method M = {M 1 , ...M m } has m steps, and each step M i provides privacy guarantee, the sequence of M will provide (m * )-differential privacy.
For traditional Laplace method, when answering F with m queries, will be divided into m pieces and arranged to each query f i ∈ F. Specifically, we have = /m and for each query, the noise answer will be a ni = f i + Lap(s/ ). According to the sequential composition, the Laplace method preserve ( * m)differential privacy, which is equal to -differential privacy.
In Graph Update method, the release answer set A r are the combination of noise answers A n and synthetic answers A s . Only A n consume privacy budget, while A s do not. In algorithm 1, even Step 4 adds Laplace noise to the true answer, the noise result does not release directly. Only when the algorithm processed to Step 7, in which a n is released, the algorithm consumes the privacy budget. Suppose there are j (0 ≤ j ≤ m) queries in F is released by synthetic answers, the algorithm preserves ((m − j ) * )-differential privacy. As (m − j ) * ≤ m * , the Graph Update method preserve more strict privacy than tractional Laplace method.

Utility Analysis
We applied Mean Absolute Error (MAE) as the utility of the query set on a graph. M AE r of release answer A r is defined as Eq. 6 Similarly, M AE n of noise answers and M AE s of synthetic answers are defined as Eq. 7 and Eq. 8, respectively.
It is obvious that for true answers A t , the M AE is zero. M AE n represents the performance of traditional Laplace method. A lower M AE implies a better performance. The target of Graph Update method is to achieve a lower M AE r in a fixed privacy budget. We apply a simulated figure 3 to illustrate the relationship between M AE values and the size of the query set m.
In Figure 3, x axis is the size of the query set and y axis is the value of M AE. For noise answer A n , M AE n is arising with the increasing of m. We apply a smooth line to represent the M AE n in this simulated figure. In real case, the line is fluctuated as the noise is derived from Laplace distribution. The M AE s is decreasing at the beginning with the increasing of m. When it reaches to its lowest point, the M AE s begins to rise with the enhance of m. This is because with the update of the graph, the synthetic graph is getting more and more accurate, M AE s is keeping decreasing. However, as the iteration procedure is controlled by the noise answer, it is impossible for synthetic graph to equal to the original graph, no matter how large m is. On the contrary, with the increasing of m, more noise will be introduced to iteration and the synthetic graph will be far away from the original graph.
As A r is the combination of A n and A s , M AE r of release answers can be reflected by synthetic answer M AE s and noise answer M AE n . Figure 3 shows that M AE s will below M AE n when the query size reaches to m 1  We will use experiment to confirm the optimal M AE in Section 5. As random noise is introduced to the method, points m 1 and m 2 can hardly be determined. In real case, they are ranges rather than exact points. In the Graph Update method, the parameter T is used to adjust the range.

EXPERIMENT AND ANALYSIS
This section evaluates the performance of the proposed Graph Update method by answering the following questions: • How do the parameter T impact on the performance of Graph Update Graph Update contains an essential parameter T that controls releasing outputs. In the first part of the experiment, we will test the impact of T in terms of Mean Absolute Error (MAE).
• What is the performance of Graph Update comparing with the traditional Laplace method and other related methods?
The proposed Graph Update method aims to effectively answer a large set of queries. We will investigate the performance of Graph Update on a set of queries and com-pare it with the traditional Laplace method and a Correlated method proposed by Chen et al. [4]. In addition, the performance will also be measured under different privacy budgets.

Datasets and Configuration
The experiment involve with four datasets, which are collected from Stanford Network Analysis Platform (SNAP) [13]. In the experiment, we consider the degree query on nodes, which is similar to the count query on relation dataset. To preserve the edge privacy, the degree query has the sensitivity of 1, which means deleting an edge will have maximum impact of 1 on the query result. The performance of results is measured by Mean Absolute Error (MAE) 6.

Evaluation of Parameters
In Graph Update, T is a threshold that controls the releasing results and has a direct impact on the performance of the query result. To achieve a comprehensive investigation, we investigate the impact of T on the utility. The parameter T varied from 0.02 to 1 with a step of 0.02 with the size of query set equal to 10 and privacy budget fixing to 1. Fig. 4 shows that at the beginning, it is apparent with an increasing of T , M AE drops quickly. But when T achieves a threshold, M AE reaches its minimum and keeps increasing. For example, as shown in Fig. 4a, M AE keeps decreasing until T = 0.1100, with M AE = 50.37 at its lowest point. After this, as T increases, M AE keeps rising. This trend can be observed in other data sets. in Fig. 4b, the M AE reaches its minimum when T = 0.2100 and remains stable until T ≥ 0.7100. After that, T is keeping increasing. Fig. 4c and 4d show the same trend. This pattern shows the impact of T on the performance. At the beginning, when T is relatively small, the increasing value of T will decrease the update round, which means the privacy budget can be saved and less noise is added to query answers. Thus the M AE is keeping decreasing. However, when T reaches to a threshold, the decreasing number of update rounds leads an inaccurate synthetic graph. Consequently, we choose a suitable T for each dataset to achieve a minimal M AE. According to the results shown in Fig. 4, we can chose T = 0.3100 as the parameter in ego-facebook dataset; T = 0.3600 in Wiki-Vote dataset; T = 0.2600 in p2p-Gnutella08 dataset and T = 0.4100 in ca-GrQc dataset.
The parameter θ is another important parameter that affects Graph Update. To evaluate the impact of θ , we use 100 queries and vary it from 0.1 to 1.  5 shows that in all datasets, when θ is increasing at the beginning, the M AE of Graph Update is decreasing. However, when M AE reaches to its lowest value, it begins to keep increasing with the enhancement of θ . This trend means that when θ is too small, the graph can not be fully updated within 100 queries. Consequently, M AE is keeping decreasing with the increasing of the θ . In this particular scale, a larger θ can help to update the graph in limited queries. But this decreasing of M AE cannot be lasted long, when θ is large enough, the M AE will be raised with the increasing of θ . During this process, we can choose a suitable θ that can minimize M AE.
The ego-Facebook dataset in Fig. 5a shows that when θ is reached to 0.0800, the minimum M AE is 70.100. This means for this datset, a proper θ could be 0.0800 when answering 100 queries. When θ reaches to 0.1300, M AE is increasing sharply.  Fig. 5c and Fig. 5d, we can observe that θ can be 0.1 − 0.3 and 0.1 − 0.25 for those two datasets, respectively.

Performance Evaluation on Diverse Size of Query Sets
The performance of the Update Graph is examined through comparison with the state-of-the-art Laplace method [6] and Correlated method [4]. We set the size of query sets from 1 to 200, in which each query is independent to each other. Parameters T and θ as optimal one for each dataset and the is fixed at 1 for all methods. According to the Figure 6, we can generally get the performance of the Graph Update comparing with other methods. First, we observe that with the increasing of the size of the query sets, M AEs of all methods are increasing approximately in lin- ear. This is because the queries are independent to each other and the privacy budget is arranged equally to each query. With the linear increasing of the query number, the noise added to each query answer is enhanced linearly. Second, Figure 6 shows that Update Graph has lower M AE comparing with other two methods, especially when the size of the query set is large. As shown in Figure 6a, when the size of query set is 200, the M AE of Graph Update is 99.8500 while the Laplace method has M AE of 210.0020, and the Correlated method has M AE of 135.2078 which is 52.45% and 26.15% higher than the proposed Update Graph. This trend can be observed in Figure 6b, 6c and 6d. Graph Update has better performance because part of query answers does not consume any privacy budget, while noise is only added in the updated procedure. Other methods, including Laplace method consume the privacy budget when answering every query. The result shows the effectiveness of Graph Update in answering a large set of queries. Third, it is worth to mention that when the size of the query set is limited, the proposed Graph Update may not necessary outperform the Correlated method. Figure 6a shows that when the size is less than 20, M AEs of Graph Update and the Correlated method are mixed together. This is because when the query set is limited, the synthetic graph can not be fully updated and may differ from the original graph largely. Therefore, the performance may not necessary outperform other methods significantly. This result shows that Graph Update is more suitable in scenarios that need to answer a large amount of queries.

Performance Evaluation on Diverse Privacy Budgets
In addition, we test the performance of Graph Update with varying privacy budgets from 0.1 to 1 with 0.1 step, and a query set with 100 queries.
It is observed that as increases, the M AE evaluation becomes better, which means that the lower the privacy preservation level, the better the utility. In Fig. 7a, the M AE of Graph Update is 1035.40 when = 0.1. Even though it preserves a strict privacy guarantee, the query answer is inaccurate and can not be used in real world. When = 0.7, the M AE drops to 144.0774, retaining an acceptable utility in the result. The same trend can be observed on other datasets. For example, when = 0.7, the M AE is 141.7209 in Fig. 7b, and is 153.0225 in Fig. 7c. Both show great improvement compared to = 0.1. These results confirm that the utility is enhanced as the privacy budget increases.
We observe that the M AE decreases faster when ascends from 0.1 to 0.4, than when ascends from 0.4 to 1. This indicates that a larger utility cost is needed to achieve a higher privacy level ( = 0.1). We also observe that Graph Update and other methods perform stably when ≥ 0.7. This indicates that Graph Update is capable of retaining the utility for data release while satisfying a suitable privacy preservation requirement. The evaluation shows that the Graph Update method retains a higher accuracy compared to other methods when answering large sets of queries, and its performance is significantly enhanced with the increase in the privacy budget. We can select a suitable privacy budget to achieve a better trade-off.

CONCLUSIONS
Nowadays, the privacy problem has aroused people's attention [3,12,20]. Especially the online social network data, which contains a massive personal information. How to release social network data is a hot topic that attracts lots of attention. However, existing methods cannot provide accurate results when releasing large numbers of queries due to the huge noises added to query results. This paper proposed an interaction method that transfers the query release problem to an iteration based update process, so as to providing a practical solution for publishing a sequence of queries with high accuracy. We evaluate our methods on numerous graphs. Through extensive experiments on real datasets we have shown that our method is effective and outperforms the Laplace method and the correlated method. In the future, we will consider much more complied quires, such as cut queries and triangle queries, which can allow researchers get more information of the dataset while still can guarantee users' privacy.