Anomaly detection over differential preserved privacy in online social networks

The massive reach of social networks (SNs) has hidden their potential concerns, primarily those related to information privacy. Users increasingly rely on social networks for more than merely interactions and self-representation. However, social networking environments are not free of risks. Users are often threatened by privacy breaches, unauthorized access to personal information, and leakage of sensitive data. In this paper, we propose a privacy-preserving model that sanitizes the collection of user information from a social network utilizing restricted local differential privacy (LDP) to save synthetic copies of collected data. This model further uses reconstructed data to classify user activity and detect abnormal network behavior. Our experimental results demonstrate that the proposed method achieves high data utility on the basis of improved privacy preservation. Moreover, LDP sanitized data are suitable for use in subsequent analyses, such as anomaly detection. Anomaly detection on the proposed method’s reconstructed data achieves a detection accuracy similar to that on the original data.


Introduction
Information sharing platforms, such as online social networks (OSNs), have experienced remarkable growth and recognition in recent years. Notably, OSN platforms have direct access to the public and private data of their users [1]. In some cases, these data are shared with other parties to carry out analytical and social research. Although the release of social network data is considered a severe breach of privacy, OSN platforms reassure their users by anonymizing their data before sharing it. Unfortunately, data mining techniques can be used to infer sensitive information from released data. Therefore, it is necessary to sanitize network data before releasing it [2].
Moreover, an increasing number of attacks target personal user information on OSNs [3]. Thus, there is an urgent need for radical improvements in OSN security and privacy measures. Most previous studies on the preserved privacy of published data deal only with relational data and cannot be applied to social network data [4]. PLOS  Therefore, we have taken the initiative toward preserving privacy in social network data. For each user, we use an activity profile to represent his/her sequence of data. With our model, we aim to investigate the application of LDP to user activity logs. In this model, a data collection server uses a specific partitioning of privacy levels to create Laplacian random noise. However, not all user data are stored in SN repositories; only a predetermined set is selected amongst the salient points representing the data sequence. On the other hand, the data analyzer sub-model reverses these disrupted points to reconstruct the original stored data received from the repositories. Moreover, the data analyzer uses the resulting noisy data to detect anomalous behavior. The data analyzers in the proposed model utilize an extension of the conventional LDP to carry out anomaly detection on reconstructed SN data.
In this paper, our contributions are summarized as follows: 1. We propose a model that protects user privacy in SNs compared with other solutions where sensitive user information is poorly anonymized and can be inferred using data mining. We guarantee a stronger degree of privacy and a lower expected error caused by large data streams. Our privacy preserving model applies Laplace's probability distribution function (PDF) to generate random noise. To guarantee privacy for each user, this noise is calculated using the user's data. In addition, it protects not only user profiles but also user activity.
2. We achieve an improved estimation error of % 0.15 over direct LDP estimation [5]. In the direct application of LDP to data, the estimated error is linearly proportional to the size of the data set. Since SN data are highly scalable, the direct LDP approach results in relatively high expected error [6].
3. We conduct experiments on real-world datasets, showing that the proposed framework guarantees privacy and achieves modest overhead performance.
4. We significantly reduce analysis costs. In our algorithm, only selected data are sent to the detection model, which estimates the data required for classification.
This paper is organized as follows. In Section 2, we review related work in LDP privacy preservation in social networks. In Section 3, we formulate the problem and demonstrate a threat model. Section 4 introduces the scientific framework and preliminaries. The proposed model is explained in Section 5, followed by experimental results and a discussion in Section 6. We conclude and present our potential future work in Section 7.

Threat model
The data repositories in an SN collect and store everything related to its users. Logs may contain the user profiles, activities, and networks of other users and may also store information created without user involvement. Sometimes, the SN shares an anonymized version of this information with other parties for different purposes. Unfortunately, as several recent incidents have demonstrated, releasing even anonymized graphs may lead to the re-identification of individuals within the network [7] and the disclosure of confidential information, which has severe consequences for those involved.
Scientific analysis is known to be vulnerable to the identification of individuals and extraction of private data. However, when a specific breach of privacy was tackled with continuous research and proposed solutions, it was shown that data for analysis might be safely released under differential-privacy guarantees [5,8]. Since privacy preservation in social networks is a relatively new research area, little work has been produced on the application of LDP to user profile data and activity logs. Fig 1 shows the motivational scenario of this research. The SN platform collects a pervasive amount of data and information and immediately stores it in its repositories. The data and information are then shared with the governmental sector under certain agreements. The data may also be shared with analytical parties or even advertising companies to push specifically tailored digital advertisements.

Related work
Recently, the privacy of social network data has gained increasing attention and concern. Although these types of data are necessary for generating revenue and conducting social research, there is no guarantee that the implemented anonymization techniques will protect users' private information. In this section, we cover the state of the art application of local differential privacy (LDP) in social networks and other fields. LDP in social networks has become an alternative to simple graph anonymization and data aggregation. In one study, out-link privacy was implemented to protect information about individuals that is shared by other users. LDP has even been proposed to solve the non-uniformity problem in two-dimensional multimedia data [8]. Zhou et al. [9] claimed that calculating a standard deviation circle radius defines the divergence of a data grid and allows the dynamic allocation of noise. The results of their proposed model had lower relative errors than similar approaches, such as UG) algorithm. Kim et al. applied LDP to the collection of indoor positioning data and used differentiated data for estimating the density of a specified indoor area [10].
Recently, the application of LDP to crowdsourced data has received substantial attention [11][12][13]. In this context, LDP is mainly used to collect and build high dimensional data from distributed users [11]. These data are randomized using multi-variate joint distribution estimation on clusters of the dataset, and then the marginal distribution of these clusters is calculated to approximate a new dataset. When the model was tested, the Complexity Reduction Ratio (CRR) reached 0.512. In [12], an online aggregate monitoring framework was designed over infinite streams with a w-event privacy guarantee. The model was combined with a neural network to predict statistical values and test the utility of released data. The resulting mean absolute error (MAE) [0. [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16] and mean relative error (MRE) [0.2-0.6] indicate that the model improved the utility of real-time data publishing. The authors in [13] showed that LDP achieved an %89 assignment success rate in preserving the location of workers in Spatial Crowdsourcing (SC) through the random generation of 1000 work tasks from a dataset of 6100 users.
Privacy preservation in social networks is considered a relatively new research area. The work in [14][15][16][17][18] covers models generally dedicated to preserving privacy in social networks (SNs). The model in [14] tackles the problems of determining data ownership in an SN and the vulnerability of SN metrics to changes in network structure. The authors claim it is necessary to develop an algorithm (such as minimal spanning tree, degree distribution, etc.) to compute results based on differential privacy (DP). Accordingly, modeling the complete 1-neighborhood structure as background knowledge was proven to protect privacy. The model focused on data that could be inferred from neighboring data and provided accurate answers to aggregate queries [15]. In addition to content, the correlation of a SN was investigated in [16]. The described algorithm labels vertexes in the dataset, uses dense clusters to populate an adjacency matrix, and applies a data partition to the matrix to identify dense regions. Lastly, DP is applied to obtain a noisy adjacency matrix. However, LDP has not always been preferred by researchers to preserve the privacy of sensitive attributes in SNs. Cai et al. and Backstrom et al. [17,18] suggest that, although LDP is generally suitable for inherent data, it is not the best choice for preventing inference attacks.
Furthermore, in [19], experiments showed that no LDP algorithm could fully preserve the persistent homology of high dimensional network features or fulfill all network graph metrics. Some proposed solutions for this issue include using Merging Barrels and Consistency Inference [20], deep neural networks with %73 accuracy [21] and neighborhood randomization [22]. An opposing opinion in [23] emphasizes the LDP's ethical and logistical capacity to protect organic data. The authors demonstrate that LDP can produce a differentially private synthetic dataset to be publicly distributed when combined with other privacy-protecting techniques, such as Ullman's Private Multiplicative Weights.
Local Differential Privacy Obfuscation (LDPO) is a variation of LDP tailored for IoT. LDPO substitutes homomorphic encryption to distill and aggregate data at edge servers with decreased computational overhead. The model is distributed over devices and both edge and cloud servers and provides an accuracy of %90.45 when using 30 features through feature distillation [24].
In summary, as privacy concerns are being raised ever more frequently, several local differential privacy models have been suggested and proven in many application areas for protecting user privacy from untrusted entities.

Preliminaries
Local differential privacy LDP is a highly reliable and mathematically rigorous privacy standard [25,26] that injects randomized noise into collected data or query results to hide sensitive details in a dataset. Thus, regardless of the experience level of an attacker, he/she cannot infer any knowledge from differentially elicited data [8,27].

Definition (2) laplace mechanism
Using scientific data analysis, Dwork et al. [28] proposed the Laplace mechanism, which takes as inputs a database (or stream of data) D, function f, and privacy parameter ε (privacy budget) and returns the true output of f plus some Laplacian noise. This noise is drawn from a Laplace distribution with the probability density function where λ is determined by both GS(f) and the desired privacy level ε. Theorem 1: For any function f: D ! R d , the mechanism A for any dataset D 2 D, satisfies ε-differential privacy, where the noise, Lap DðfÞ ε À � ; is drawn from a Laplace distribution with a mean of zero and scale of Δ(f)/ε.

Dynamic / Adaptive division of privacy level
To divide the privacy level, a temporal scale must be introduced. Consider three successive SPs in ascending order: previous ðt s hÀ 1 ; x s hÀ 1 Þ; current ðt s h ; x s h Þ; and next ðt s hþ1 ; x s hþ1 Þ: Given system parameter α, we calculate the temporal scale μ h as [29,30] and, considering that privacy level ε is divided into (ε 1 , ε 2 , ε 3 ,. . .. . .,ε h ) such that ε = ∑ 1�h�r ε h and μ sum = ∑ 1�h�r μ h , we calculate the individual small privacy level as Average error rate The average error rate has been identified as follows.
Given the noisy salient points S ¼ ð 0 s 1 ; 0 s 2 ; . . . ::; 0 s r Þ from r users [29], we can estimate the average values of the original x n at time t n , which requires averaging all the noisy values of x n , as

Conjugate Bayesian method [31-33]
The Bayesian probability function is given as � where P(c|x) is the posterior probability, and P(c|x) is the likelihood; � P(c) is the class prior probability; � P(x) is the prediction prior probability.
A discrete random variable x is said to have a Poisson distribution with parameter λ > 0, if, for k = 0,1,2,. . ., the probability mass function of X is given by For numerical stability, the Poisson probability mass function should be evaluated as Choose P ij = Poisson (λ ij ) for the unknown rate parameter λ ij >0. Choose a gamma prior for λ ij as this ensures that the posterior predictive distribution for a future period is calculable as a simple ratio of Poisson gamma mass functions.

Applying the Dirichlet process
Given a measurable set S, a base probability distribution H and a positive real number α, the Dirichlet process DP(H,α) is a stochastic process whose sample path is a probability distribution over S. For any measurable finite partition of S : ðB i Þ n i¼1 , If X~DP(H,α), we have ðXðB 1 Þ . . . . . . XðB n ÞÞ � DirðaHðB 1 Þ . . . ::HðB n ÞÞ: ð12Þ The notation X~DP(H,α) indicates that the random variable X is distributed according to the distribution DP(H,α), i.e., according to a Dirichlet process with parameter base distribution H and real number α [31, 33].
The Dirichlet distribution of order K � 2 with parameters α 1 ,. . ...,a k > 0 has a probability density function with respect to the Lebesgue measure on the Euclidean space R k = 1 given by

Proposed approach
In this section, we describe the proposed scheme for sanitizing SN user activity logs using LDP. We then compare the results of applying anomaly detection to the original and reconstructed data. The model functions on two servers: a data collection server and a data-analyzing server. As shown in Fig 2, the data collection server represents each activity log as a data sequence. In each sequence, we determine specific salient points. After selecting these points, we use the user's data in addition to other parameters to create random noise. This noise is then added to the data to distort it from its original value. Finally, the data collection server stores it in data repositories.
In contrast, the data-analyzing server retrieves synthetic data from the repositories, reconstructs the original data streams, and searches the user's activity for abnormal behavior, as demonstrated in Fig 3. The privacy standard model in the first sub-model avoids high error rates when applying LDP to large datasets. This model essentially groups salient points that represent similar actions (increasing, decreasing, constant) together, then applies LDP to selected points in these groups. Thus, a relatively small number of points are processed.
Users spend significant time on SNs performing all kinds of activities, such as sending messages, posting, liking posts, disliking posts, performing audio or video calling, and so on. If we consider a user's activity log per single action, it shows active periods vs. non-active periods. If we consider the sending and receiving of messages as an activity, the plot for a particular user's stream of data increases on days where a greater number of messages are sent and/or received, decreases on days where fewer messages are sent or received and remains constant on idle days.
The model operates in the following order: Step 1: Calculate the salient points (SPs).
To obtain the representative points of a user's data sequence, we take the first order derivative of each value in his/her sequence at a specific timestamp. The user's sequence is represented as values collected at particular time intervals reflecting increasing, decreasing, or constant activity. In Fig 4, the user's calling activity is represented as a curve over a ten day time period.
Calculating the first order derivative allows us to determine increasing (derivative >0) and decreasing (derivative <0) periods. We exclude the constant periods where the user's activity value is the same. As explained in Algorithm 1, we store the points where the derivative of each value is not equal to zero. Step 2: Reduce the set of salient points to only those indicating the beginning of an increasing or decreasing period following the rule described in Table 1.
In situations where the data sequence is very long (thousands of values), or the data's time intervals are very small (seconds), the set of salient points will be considerably large and in need of further reduction. To achieve this, any successive points belonging to the same movement can be removed. Therefore, if three successive points belong to an increasing period, we merge their time interval, retaining only the beginning and end of the interval. We continue this reduction process until no two adjacent time intervals have the same movement. Algorithm 2 depicts these steps in detail.    Remove element at C_list (1, t_min) End while Step 3: Calculate the privacy level in uniform and adaptive distributions. Given the reduced set of salient points, we now partition the privacy level to generate random noise values. This step, uses Algorithm 3. The algorithm divides privacy ε into equal levels with each level ε i satisfying the condition In this step, we calculate a temporal scale for each salient point in the set that controls the privacy level of each point, thus regulating the amount of noise added to it. We use three timestamps representing the current, previous and next SP. We then calculate the temporal sum for all SPs in the sequence of a specific user. Then, we divide the privacy level based on this temporal sum. Algorithm 4 shows the steps of this procedure. Step 4: Add Laplacian noise to the selected salient points. If we consider the list of salient points SP i ¼ ððt s 1 ; x s 1 Þ; ðt s 2 ; x s 2 Þ; . . . ::; ðt s r ; x s r ÞÞ, we can obtain the noisy salient points S 0 P i ¼ ððt x s h is obtained using the probability distribution function (PDF) of the Laplacian distribution The Laplacian generated noise depends on the privacy level. Therefore, using a uniform distribution generates noise different from adaptive noise. The higher the value of the privacy level, the higher the generated noise is. Therefore, it differs from one user to another. Knowing that the Laplace distribution performs a simple translation, it perfectly fits with the definition of differential privacy. The steps are shown in Algorithm 5. Since uniform privacy levels are the same, the same PDF generates the noise, whereas, in adaptive privacy partitioning, different PDFs are used to create the noise for each SP. Each different PDF incorporates a different privacy level due to dynamic partitioning, and the PDFs' are independently calculated for each SP. Step 3: Store the 'noised' SP for analytical or other purposes. The repositories contain a sanitized representation of the SP with no indication of the original data.
Step 4: The requesting sub-model receives the noised SP and attempts to reconstruct the stream of data using linear estimation, as explained in the preliminaries section. The submodel uses the linear equation of a straight line to draw segments between every two points. The general equation is where a is the slope of the line, represented as the ( change in the yÀ value change in the x value ). The y-intercept parameter b is the intersection point between the linear line and the y-axis, which is represented as In our case, the slope is calculated as change in noisy points change in timestamps . Algorithm 6 shows the code steps. Step 5: For the activity dataset, the anomaly detection sub-model extracts the number of communications between pairs of nodes as a Bayesian counting process [31] and represents the number of interactions as weights assigned to communicating nodes in the network. The anomaly detection sub-model then applies Bernoulli, Markov chain and Dirichlet processes to find the nonparametric Bayesian inference.
Step 6: Perform individual-based analysis. In this step, we assume N ij (t) to be the adjacency of node i to node j at time t. The increments determine the out-degree and in-degree of node i, and we represent the number of outgoing nodes as while the incoming communications over time for individual i are represented as We then calculate the total activity by finding the degree sum of the network over time N .. (t).
Step 6: A sample of size n is selected from the population. The random variable of interest, X, is the number of anomalous individuals in the sample, while M is the number of anomalous individuals in the population, and N is the set of communicating individuals Anomaly detection in online social networks

Experimental setup
In this section, we describe the simulation process, including the dataset, parameters, and evaluation metrics. We explain the setup and discuss the results in the second sub-section. We conducted our experiments by applying the LDP privacy preservation distribution to a set of user activity sequences. The data sequence was adapted from the VAST Challenge 2008 [32]. The LDP model was applied to simulated cell phone data from the Mini Challenge on social network analysis. Data were collected from 400 individuals located at 30 locations in the network over a period of ten days. Fig 6 shows a simple visualization of the data after projecting the high dimensional communication log into low-dimensional points.
Each data stream represents a user's calling activity over ten days. Each day represents a timestamp. Fig 7 illustrates the data sequences (streams) of ten users.

Results and discussion
We applied the steps explained in the proposed approach in Section 5. We first determined the salient points in each user's data stream. The user's activity in Fig 8 does not contain a constant period (having the same number of calls), so all points are selected. However, the user in Fig 9 makes the same number of calls on the sixth and seventh days (dx i = 0). Since the first order derivative for timestamp 7 is zero, the salient point at this timestamp is removed. The same applies to timestamp 8; however, since it represents the beginning of a decreasing period, it is retained. The colored lines parallel to the y-axis represent the timestamps, and the intersection point between each line and the curve is a salient point.
The next step is to reduce the salient points to represent points at the beginning of an increasing or decreasing period. After reducing the set of salient points, we calculate the individual privacy level for each point using uniform or adaptive division and assign Epsilon values of 2 and 5. The Epsilon value is used to generate the random noise applied to the reduced set. We use mean μ = 0.8 scale b = 0.2 for the PDF function, and noise is added to each point in different data-set sizes: 50, 100, 200 and 400 users. Anomaly detection in online social networks Having created and stored the synthetic data on the data collection server, we next demonstrate the reconstruction process using linear estimation. Figs 12 and 13 show original and reconstructed data streams. In Fig 12, the red curve represents original user activity, and the blue curve represents the activity generated for uniform privacy levels. In Fig 13, the red curve represents the original data stream, and the yellow curve is the linear reconstruction for adaptive distributed privacy levels. Note that the reconstruction of data preserves the structure of the activity pattern, this is very important for anomaly detection.
We calculate the error rate for the combination of uniform-privacy division with linear estimation and adaptive-privacy division regenerated using linear estimation. Fig 14 plots the average error rate for the two different approaches on various data sizes. The error rate Anomaly detection in online social networks equation is where Avg(x d ) is the average of the actual values in the data stream for timestamp d, and Avgð 0 x d Þ is the average of the reconstructed values of the data stream for the same timestamp. We next apply the Bayesian anomaly detection technique to the reconstructed stream of users. In this experiment, we detect outliers with respect to the duration of calls between individuals. The duration variables are treated in the same manner as the calling activity described earlier. During the first analysis phase, the model checks all 30 locations for anomalous users to apply the multinomial model with the sequential Dirichlet process model with an uninformative negative binomial base measure [7].
We apply the Bernoulli process and Markov chain to all network users, with mean values of [0.63, 0.48] using a threshold of 0.05, to obtain a better understanding of the messaging patterns and their variability. This phase extracts the predictive P-values of the users from their communication patterns, as shown in Fig 15. The detection phase of the reconstructed data is the same as that of the original data. The same users have predictive p-values below the threshold and are flagged by the detection sub-model, which implies that the application of LDP to preserve data privacy succeeds in sanitizing the data. In addition, the data structure is maintained for further use by the anomaly detection sub-model.
In Fig 16, the abnormal activities peak on the eighth day, the same day the original activities peak, suggesting that the reconstructed data do not lower the performance of subsequent analyses, which can incorporate all the data into real-time anomaly detection.
As seen in the simulation results in Fig 16, the proposed model improves the estimation error while being applied to large-scale data. The model conducts anomaly detection on a subset of the data without disclosing the actual values, which guarantees privacy and reduces the cost of further analyses.

Conclusion and future work
In this paper, we presented a model for privacy preservation in social networks. The model sanitizes the collected data and sensitive information of SN users using LDP and then attempts to reconstruct the original sequences and perform analyses using sets of selected salient points. We conserve the social structure of each user's communication pattern. The error rate of the estimated data compared to the original data is acceptable for large datasets with small timeintervals. Our simulation results show that conducting anomaly detection on synthetic data results in determining the same anomalous users and activities as those in the original data. In the future, we plan to extend the proposed privacy model to include estimating noisy data with non-linear approximation.