A machine learning based approach for user privacy preservation in social networks

With the development of Internet technology, service providers can provide users with personalized services to enrich user experience, however, this often requires a large number of users’ private data. Meanwhile, the protection of their private data and the evaluation of the risk of leaked datasets become a matter of great concern to many people. To resolve these issues, in this paper, we develop a machine learning-based approach in online social networks (OSNs) to efficiently correlate the leaked datasets and accurately learn millions of users’ confidential information. Moreover, a trust evaluation model is developed in OSNs to identify malicious service providers and secure users’ social activities via direct trust computing and indirect trust computing. Extensive experiments are conducted by using real-world leaked datasets, and the results show that the efficiency and effectiveness of the proposed approach in terms of user privacy protection and accuracy of privacy leakage evaluation.

to offer personalized and contextual services for users and promised to reshape our daily lives [1][2][3]. In current OSNs (e.g., Facebook, QQ, and Wechat), users are recognized as identities (e.g., real name, username, nickname, email address, and cellphone numbers) in involving with these Internet and mobile social services [4][5][6][7][8]. However, many online services such as dating and shopping websites and offline services such as grocery delivery services require users to provide some of their identification information. As reported, hundreds of millions of users' confidential information such as username, email, password, and network activities (e.g., which Tencent QQ groups they have joined) on several Chinese websites including Tencent QQ have been leaked over the past few years [9,10].
Different from existing security mechanisms developed for traditional Internet services, the privacy leakage issue of users in large-scale OSN exhibit its special features. On one hand, apart from the direct privacy leakage of the user due to his/her improper operations or network intrusions, the private information can also be indirectly or unconscious leaked by his/her friends or other third parties [11]. For example, the photo wall and public chat history of his friends may reveal a user's gender, age, and name. On the other hand, given above identity information of one user, other confidential information (such as the gender and password) of this user can be inferred by misbehaving companies and hackers via data mining and in-depth data analytic [12]. Hence, an ongoing challenge is to protect users' privacy in OSNs while evaluating the potential risks of privacy violation after the identification information is leaked.
Existing works on privacy leakage of Internet users, however, cannot work well in the large-scale social Internet context. Firstly, existing practices in industry mainly rely on the nicknames (or aliases) to protect the anonymity of Internet users. The real names in OSNs are still vulnerable to privacy leakages [13][14][15][16][17]. For example, one user's information such as age and political affiliation can be accurately determined by a third party via aggregating information provided by the user's online friends, even when the user does not intend to make it available to the public. Secondly, the correlations of sensitive user attributes, which can be learned from the leaked datasets to build user profiles [18], are not well studied to prevent the privacy leakage in large-scale OSNs. Thirdly, existing studied mainly focus on the general Internet services, while few of them consider the social features in the analysis of privacy leakage of public social Internet applications. Moreover, despite recent efforts in studying privacy leakages in OSNs, little attention has been given to evaluating the risk of leaked datasets. Therefore, it is still an open and vital issue to preserve user private information from misbehaving companies and hackers in an efficient fashion in large-scale OSNs.
In this paper, to resolve the above issues, we first develop a machine learning based approach for user privacy preservation and efficient evaluation of privacy leakage from leaked datasets via feature extraction and user attribute correlation. With the obtained users' attributes (e.g., real name, OSN identity (ID), age, gender, birth date, email address, and social relationship), a support vector machine (SVM) based prediction algorithm is devised to evaluate the potential privacy leakage with the existence of distinct Internet services offered by third parties. Moreover, we build a distributed trust evaluation model from direct trust and indirect trust calculation to filter and detect malicious OSN service providers with consideration of users' social features. Finally, we conduct extensive experiments to demonstrate the efficiency of the proposed approach in terms of detection and classification accuracy. The results also show that the learned user profiles facilitate attackers to successfully launch a variety of attacks such as spoofing attacks and password guessing attacks. The main contributions of this paper are summarized as follows: -We investigate user privacy preservation in OSNs from a machine learning perspective. To assess users' privacy violation from the public leaked datasets, we develop a user profiling system (UPS) to accurately correlate and learn user attributes. We collected 16 leaked datasets and study the privacy issue of 611,140,530 users. -We take the learned user profiles in search engines and obtain these users' other information which is publicly available on Internet applications such as Renren and Qzone. We develop a SVM-based prediction algorithm to evaluate the potentials that malicious third parties can obtain a large number of users' attributes such as real name, gender, age, and social connections from the leaked datasets and online Internet services. Besides, by leveraging users' social features, a trust model is established to detect malicious social service providers. -We conduct extensive experiments by using realworld leaked datasets to demonstrate that the proposed approach can attain satisfactory detection accuracy and classification accuracy. In addition, the results also show that the learned user profiles can facilitate attackers to successfully launch a variety of attacks such as spoofing attacks and password guessing attacks.
The rest of the paper is organized as follows. Section 2 summarizes related work. In Section 3, we describe the system overview. In Section 4, we introduce our methods to learn users' attributes from leaked datasets and online Internet services. In Section 5, we evaluate the performance of the proposed approach. Finally, we give discussions of the proposed approach in Section 6, and present concluding remarks in Section 7.

Privacy preservation in OSNs
Previous works on OSNs such as Facebook and Twitter focused on determining users' privacy information such as age, gender, sexual orientation, and political affiliation [6,13,14,16,17,19]. [20] matched user accounts across social networks based on username and display name to help build better user profiles. [21] observed and analysed the phenomenon that different generations have different preferred names to infer users' age range by their names. [22] tried to classify the user ages range with the 1-grams constructed from the tweets. [13,23,24] used the ages of friends to determine the age of a given user. The basic idea is to determine one user's private attributes by aggregating the information from the online friends of the user, although the user does not intend to make it available to the public. To deal with OSNs (i.e., microblogging) where age information is scarce, [25] proposed a framework that explores public content (i.e., tweets, microblogs) and interaction information to explore the hidden ages of users. In addition to inferring users' confidential information, the social network information has also been used for applications such as friend/interest recommendation [1,[26][27][28][29][30], sentiment analysis [31][32][33][34], spammer detection [36,37], and user activity classification [38]. Distinguished from previous researches in OSNs, our work utilizes SVM-based approaches to automatically identify the real names of users in anonymous OSNs, where users' private and public information are gathered via Social Engineering Engines (SEEs). To the best of our knowledge, we are the first to study the problem of automatically identifying users' Chinese real names in anonymous OSNs.

Security in OSNs
For publicly available passwords from Chinese and international websites in recent years, previous works focused on password security such as password guessing [5,8,35,[39][40][41][42][43], password strength [2,4,7,[44][45][46][47][48], and password creation policies [3,[49][50][51]. [9,52,53] conducted measurement studies on the large scale leaked Chinese password datasets. To be specific, Li et al. [9] studied the differences between passwords from Chinese and English speaking users. Wang et al. [53] made a substantial step forward in understanding the underlying distributions of passwords, and Ji et al. [52] conducted a large-scale measurement study on the crackability, correlation, and security of leaked passwords. Li et al. [54] studied the usage of personal information in passwords and its security implications, which demonstrates that passwords may contain users' privacy information. While focusing on the passwords of Chinese web users, [42] and [55] showed the popular structures of Chinese passwords in which Pinyin plays an important role, they also improved the password guessing efficiency. For the leaked dataset QQ GRP, You et al. [10] studied topology statistics (e.g., degree distribution) of the bipartite graph consisting of QQ accounts and QQ groups. The previous works on passwords security studied the users' habits in choosing passwords and demonstrated the correlations between users' privacy information and passwords. However, they paid little attention on how to obtain the privacy information. We observe that the QQ email is the main form of users' emails in leaked datasets, therefore, using the QQ emails (86.7% QQ emails are directly constructed by QQ IDs), we can connect the user's privacy information leaked in different datasets (e.g. joined groups in the QQ GRP dataset and username in the Renren dataset). We also propose several methods to extract users' privacy information by correlating the leaked datasets and show the serious potential risks to users whose privacy leaked in the datasets.

System model
In this section, we first introduce the overview of system. Then, the descriptions of used datasets are presented. After that, we present several attacks on Internet users based on learned personal information. Figure 1 shows the overview of our user profiling system (UPS). UPS learns users' attributes or profiling information from both leaked datasets and online public social information. In UPS, we first train a SVM-based classifier to learn users' profiles from social network datasets (e.g., identify the real name of a user from the names the user used in different groups). Besides, based on the collected public social information from online social platforms (e.g., Facebook and Tencent QQ), more private and sensitive user profiling information can be inferred in our UPS. For example, by using the group information (i.e., group name and group introduction) that the user joined, the education background and interesting of the user can be inferred. Furthermore, we also leverage the homophily property of social networks to predict the real age and gender of a user based on his/her friends in the same groups he/she joined. In this paper, we mainly learn family names, birth dates, cellphone numbers, citizen IDs' last 12 digits, email addresses, and user names from emails and passwords datasets.

Dataset descriptions
In recent years, the datasets of large-scale user data leaked in China involves 406.2 million accounts in total, where the most of leaked datasets are publicly available on the Internet. In specific, hundreds of millions of Chinese Internet users' information such as real name, email, password, gender, and age could be easily obtained or inferred by thirdparties as well as attackers. In this work, two large-scale social network datesets (i.e., QQ GRP and QQ PSW), which are collected from Tencent QQ, are considered in the privacy leakage analysis. Both QQ GRP and QQ PSW are leaked datasets of QQ users on the Internet. The detailed description of datasets QQ GRP and QQ PSW are shown as below.
-The QQ GRP dataset includes 85.3 million QQ groups' information such as group ID, group name, group brief introduction, their group members' names displayed in the QQ groups, genders, and ages. There are 318 million distinct QQ accounts in QQ GRP. -TheQQ PSW dataset includes 300 million QQ accounts' passwords and the corresponding QQ IDs and IPs. ...

Password guessing Spoofing attacks
Internet fraud Phone fraud ...

Misbehaving Internet Companies
Leaked information User profiling Public information Fig. 1 The overview of the user profiling system (UPS)

The proposed approach
In this section, we first analyse the potential attacks arising from password analysis and guessing from the leaked OSN datasets. Then, we develop novel methods to extract and learn users' attributes in terms of real name and age from the leaked datasets. These user attributes learned from the leaked datasets facilitate attackers to collect other users' information of social Internet services. Furthermore, a trust based secure online service provider (SP) selection mechanism is proposed via direct trust computing and indirect trust computing.

Password analysis and guessing
We observe that a large number of Chinese Internet users' passwords contain the users' personal information such as family name and birth date. Based on this knowledge, attackers can use methods such as [5] to leverage users' personal information learned from Internet online services and leaked datasets in Sections 4 and 5 to accelerate the speed of guessing the users' passwords. Moreover, in what follows we observe that users' personal information can also help attackers to identify and select users with weak passwords.
Let be the universal set of all the possible passwords. Then, any leaked dataset of passwords can be viewed as a subset, denoted by S, of with a specific probability distribution P = {p i |i ∈ S}, where p i is the probability of a password i ∈ appearing in S. Let l = |S| and without loss of any generality, within P (S), we assume p 1 ≥ p 2 ≥ . . . ≥ p l . We use the following metrics introduced in [2] to measure the password strength of users with different attributes such as gender and age. Here, the min-entropy is defined as which denotes the worst-case security metric against an attacker.
Next, the guesswork measures the expected number of guesses to find the password of an account in the optimal guessing order (password probability decreasing order). G(P ) is the bit/entropy form ofĜ(P ). Formally, we havê Besides, β-success-rateλ β (P ) measures the expected success probability to find the password of an account given β guesses.

Real name identification
In this subsection, we focus on studying private information of 313 million QQ users in QQ GRP. Let U QQ denote the set of users, and QQ ID be the unique identifier of a QQ user. According to our analysis of special features of Chinese names, among the 341,826 QQ users in U QQ , we observe that 199,664 ( 199,664 341,826 = 58.4%) users use their real names in at least one QQ group. Among the 750 million QQ display names in QQ GRP, 577 million names are classified as nicknames based on filtering rules defined on the length, the first and second character of a name. To tackle this problem, we develop a novel method RNI (Real Name Identification) to classify these names using both their content and OSN features. Formally, we let u denote a QQ user in a QQ group and dn denote its name displayed in the QQ group. For each QQID name pair (u, dn) of the 174 million display names in QQ groups, its feature vector x = [e, s] T is described as follows.
Firstly, let e ∈ R h be the content features, which is a bag-of-words representation. h is the number of Chinese characters appearing in names. Secondly, we consider the OSN features s ∈ R 2 , which has two elements s 1 and s 2 , measuring the tendency of u using dn in nickname QQ groups and real name QQ groups. For each group G j , we intuitively label some obvious nicknames by hand-crafted rules, i.e., checking the display names' length, first and second characters. Let t j be the number of these labeled nicknames in G j . Denoted by N(u, dn) the set of the QQ groups that user u joins with name dn. s 1 and s 2 of (u, dn) are defined as where |G j | is the number of users belonging to group G j . The core idea behind the usage of OSN features in our method RNI for identifying real names is: QQ users usually tend to use their real names in friendship-driven QQ groups such as classmate and colleague groups, while preferring to use nicknames in interest-driven groups in order to protect their privacy. From the QQ users in U QQ , we have their real names that are found from Hotel Guest, and all nicknames used by these QQ users in QQ GRP. Therefore, we construct a data set LU QQ , feature vectors {x i , i = 1, ..., m} consisting of 199,664 real names (with label y i = 1) and 225,419 nicknames (with label y i = −1). In what follows, we build a supervised learning model with LU QQ as the training data, and then use the model to predict the labels of the remaining 174 million names displayed in QQ groups, which hand-crafted rules cannot differentiate.
Thirdly, for efficient real name identification, we build the RNI model based on SVM, due to its capacity of learning from high-dimensional feature space: where ξ i are slack variables and C is a penalty parameter. After obtaining w * and b * by solving (8), for each QQID name pair (u, dn), we define its real name score as where x is the feature vector of (u, dn) defined previously.
In the testing stage, let u denote a set of names QQ user u used in different QQ groups. The mostly likely true name in u is Since u at most has one real name, there are two cases when identifying real names and nicknames in u . Case 1: real name score of π u is larger than 0, π u is real name, u \ π u are all nicknames; Case 2: otherwise, all in u are nicknames.

Age prediction
Generally, most of the QQ users provide their age information when setting their profile page (either by providing birth date or by giving a calculated age). However, not all ages are precisely true as some users might give the wrong ages. In this subsection, our target is to predict the real age of a user based on his/her friends in the same groups he/she joined. Intuitively, QQ groups (especially classmate groups) usually consist of users with similar ages due to the homophily property of social networks. For a group G j , we first filter out users with intentionally wrong ages less than 4 and greater than 100. Formally, we define a set of users in G j with abnormal ages as: G ab j = {u, u ∈ G j ∧ (a u ≤ 4 ∨ a u ≥ 100}. Then, we calculate the mean and standard deviation of user ages in G j − G ab j (excluding G ab j ) as: Users with ages 3σ j greater or less than μ j are added to G ab j as: The expected average age for users in G j is where n j,\u = |G j − G ab j | − 1 when u ∈ G j − G ab j , and n j,\u = |G j − G ab j | when u ∈ G ab j . According to the observation that users in the same group have similar ages, we estimate the real age of a user u aŝ a u = j :u∈G j ρ u,j μ j,\u where ρ u,j is a weight measuring how much a group G j , to which user u belongs, contributes to the estimation ofâ u . Intuitively, a group including more users with normal ages should contribute more than a small group. Also, a group with a smaller variance in ages should contribute more than a group with larger variance. Therefore, ρ u,j is defined as where σ j,\u denotes the standard error of the ages of the users in group G j − G ab j − u, that is, As we know, classmate groups are more likely to have similar ages. Therefore, we introduce another estimation which follows the same steps, but using only classmate groups, i.e.,

Trustworthy online service provider selection mechanism
In general, for a targeted Internet service (e.g., online chatting and online game), there exist a set of service providers (SPs) that can offer the same service to desired users with different quality of service (QoS). Let M = {1, · · · , m, · · · , M} denote the set of online service providers for the desired internet service. The trust approach is employed to assess the trustworthiness of SPs in providing online services. Here, the trust that each user u put in the service provider m is constructed from two aspects, namely, the direct trust and indirect trust [6]. Let direct u,m and indirect u,m denote the direct trust and indirect trust of user u in SP m, respectively. In specific, for each user, its direct trust originates from direct historical experience in interacting with SP m. Meanwhile, the indirect trust, which also known the social reputation, evolves from the aggregated experience based on other users' experience such as the recommendations given from his/her friends. With the assistance of indirect trust evaluation, each user u can acquire more profiling information about the SP m especially in the case that their direct interactions are infrequent.

1) Direct Trust Evaluation.
The direct trust value of user u in SP m is associated with its satisfaction of online service. In specific, the service satisfaction degree is quantified by the rating value ra u,m,h ∈ [0, 1] that the user gives to the h-th service offered by SP m, Here, ra u,m,h = 1 indicates the absolutely positive experience, while ra u,m,h = 0 represents the absolutely negative one. Furthermore, let N u,m be the total number of services that user u has obtained from SP m. Then, for user u, his/her direct trust value in SP m can be calculated by the accumulative weighted sum of all historical ratings [15], i.e., where (τ h ) means the exponential time decay [6,58]. By utilizing exponential time fading, the weight of past experience can be gradually reduced while the recent experience are assigned with a relatively higher weight. The explicit form of exponential decay function is expressed as: where α denotes the exponential decay rate, τ h is the h-th service time, and τ is the current time. Obviously, we have direct u,m ∈ [0, 1]. 2) Indirect Trust Evaluation. Users typically have different social relations such as friends, workmates, and strangers. Let L = {1, · · · , l, · · · , L} be the set of social relations among users. Besides, users with different social relations can have different trustworthiness degrees or or social intimacy degrees [19]. We use ϕ l > 0 to represent the trustworthiness degree [6] or social intimacy degree [56] between two users with lth social relation. ϕ l is a trust factor indicating to what extent the intimacy will exist between two users, which can be obtained similar to [56,57]. A binary variable q u,u ,l is utilized to denote whether two users have a l-th relation. Here, q u,u ,l = 1 users u and u have a relation l; otherwise, q u,u ,l = 0. In general, adversary users may give fake recommendations to mislead the trust evaluation process. Hence, the credibility of each rating needs to be assessed to capture the recommendation reliability. For EV user u, his/her credibility value of the recommendation given from another EV user u is associated with their social relation and their rating similarity u,u , i.e., Obviously, we have ϒ u,u ∈ [0, 1]. Generally, the higher similarity between two users' rating profile, the higher credibility of his/her recommendation. Here, we employ the Pearson correlation coefficient (PCC) to calculate the similarity value between two users u and u , i.e., by comparing their ratings on the aggregators that both of them have charged in. Here, M u,u represents the set of SPs which have offered online services for two users u and u . ra u,m and ra u ,m separately denote the rating profiles of users u and u on SP m. ra u and ra u are the average ratings that user u and user u have ever sent, respectively. Therefore, for user u, his/her indirect trust value in SP m is shown as: As a consequence, the global trust value of user u in SP m can be attained by combining the direct and indirect trust values. Here, we have where > 0 is a dynamic weight factor, which is calculated by = N u,m 1+N u,m . Note that with the increase of interaction times N u,m , the increases, as well as the effect of direct trust in global trust calculation. It conforms to the fact that direct trust can be more reliable if with enough iterations. Obviously, we have u,m ∈ [0, 1]. After trust assessment for each SP m, each user selects the SP with highest global trust value to receive Internet service. In the best case, the computational complexity of our proposed trust mechanism is O(M), while in the worst case, the computational complexity is O(U * M). Here, U and M are the number of social users and online service providers, respectively.

Ground truth dataset
The Hotel Guest dataset is regarded as a reliable information source, which serves as ground truth for our user profiling methods. In specific, the Hotel Guest dataset includes a large number of real-names of hotel guests, as well as other private information. Given the truthfulness of the Hotel Guest dataset, we use it to construct training data and evaluate the prediction accuracy of test data. To identify attack targets, we use the email address as the unique identifier of one person, who may leave the same email when registering at different websites [42,59].

Result analysis
We use dataset LU QQ to train the model, and then use the model to predict the labels of the 174 million names displayed in QQ groups, which hand-crafted rules cannot differentiate. We identify 128 million QQ ID's real-names. To evaluate the correctness of these results, we evaluate our RNI model by data derived from U QQ . We randomly sample 10% of QQ users in U QQ , and use their real names and nicknames as training data. The names of the remaining QQ users in U QQ are used as test data. We repeat this process 10 times and report the average performance in terms of precision, recall, and F1, they are defined as: precision = #detected real names #reported real names , recall = #detected real names #real names , where the reported real names refer to the real names given by our method RNI, which may include false results (i.e., nicknames wrongly identified as real names). The results are shown in Table 1, which includes results when using only OSN features, only content features or both. The precisions and recalls reported in Table 1 are quite high. which means that a large number of real names can be correctly identified, and potentially be used to reveal more  The results also show that RNI based on content features exhibits higher recall but lower precision than RNI based on OSN features. When combining two sets of features, the performance of RNI is improved with higher precision. Moreover, we look into the falsely detected real names that should be nicknames. We found that nearly 40% of them actually include real family names in the full nicknames. The above results reflect that RNI can accurately identify real names in QQ GRP, and the real names of the 128 million QQ users we identified are highly reliable. Figure 2 shows the age distribution of 313 million QQ users in QQ GRP. As seen in Fig. 2, it is a reasonable distribution. We can observe that 86% of QQ users are between 11 and 40, while nearly half of QQ users' ages are between 21 and 30. In addition, we can see that 7.8% of QQ users provide ages younger than 3 years, and 2.4% of users provide ages older than 100 years. It is clear that most of those younger than 3 years and older than 100 years intentionally provide the wrong ages.
In order to evaluate our estimation, we calculate the real age of 341,826 users in U QQ based on their birth date shown in citizen IDs in the Hotel Guest dataset. These real ages are accurate and used as ground truth. Figure 3a shows the histogram of these users' real ages. For comparison, we also show their ages displayed in QQ. Comparing displayed ages in QQ and the real ages, in age group [0-20] many QQ users set them older than the actual age, while in age group [21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40] many users set them younger than the actual age. Our estimation has a close distribution to the actual age distribution. Here, we estimate the age of a user u asâ class δ is the absolute difference between actual age and QQ displayed age; -δ is the absolute difference between actual age and our estimated age.
We can see that our method exhibits smaller errors. That is to say, we can make a good prediction of user ages based on their friends in the same groups, even some of them intentionally set their ages as wrong values. We are interested to study these users who intentionally provide a wrong age. Actually, about 10% of users in U QQ set their age with values smaller than 4 or larger than 99. Let U ab denote the set of these users displayed the wrong ages on QQ. Figure 3b shows the CCDF of the age estimation errors when applying our method on the users in U ab . We can The fraction of QQ users revealing their attributes to strangers on the QQ chatting system see that our method can estimate the ages of 80% and 90% of users in U ab with an error smaller than 5 and 10 years, respectively. The above results demonstrate the capability of our method for estimating QQ users' real ages. Finally, we used our method to estimate the real ages of all QQ users (not all of them are in U QQ and have ground truth of real ages). The distribution of estimated ages is shown in Fig. 2. Comparing with the displayed ages, we don't have an estimation less than 10 or greater than 90, but have more in the interval of [21][22][23][24][25][26][27][28][29][30]. This is because that our method corrects the age of many users in [21][22][23][24][25][26][27][28][29][30] who provided the wrong ages.
Next, we randomly sample 200 QQ IDs from the QQ GRP, and search these QQ IDs on the search engine of QQ chatting system. As shown in Fig. 4, we observe that a large fraction of these QQ users make their attributes such as birth date and blood type publicly available to strangers on the QQ chatting system. In addition to the age and gender leaked in QQ GRP, we find that 31%, 50%, 69%, 27%, 73%, 77%, and 81% of the sampled 200 QQ users reveal their emails, blood types, birth dates, hometown, locations, Chinese zodiac signs, and astrology zodiac signs to the public, respectively. Chinese zodiac signs and astrology zodiac signs can be used to determine the years and month of birth, which can be used to narrow the search criteria on other OSNs' search engines. Finally, to evaluate the performance of our method in mining personal information (e.g., family names, birth dates, cellphone numbers, citizen IDs' last 12 digits from their passwords, email addresses, and user names) on the ground truth dataset U . Table 2(b) shows our identified fraction of users who use their personal information in their passwords/email addresses. The accuracy of our method is shown in Table 2(c). We can see that 60%, 43%, and 76% of family names in Pinyin, birth dates, and cellphone numbers we learned from email addresses are correct, 29%, 31%, 14%, and 38% of family names in Pinyin, birth dates, cellphone numbers, and citizen IDs' last 12 digits we learned from passwords are correct. Table 2(d) shows the result of our method on the 335 million pairs of email addresses and passwords. We see that family name is widely used in the email address (31%) and password (22%). Birth date and mobile number are often used in passwords, around 10%. Such a significant fraction of detected personal information should catch our attention. People are suggested not to use this information in their email and password, as attackers can launch spoofing attacks on the 335 million email addresses with quite a lot of correct personal information.

Discussion
Our work has some value for future studies on both privacy information on OSNs and passwords security. The study of automatically identifying users' Chinese real names on anonymous OSNs, which enables us to identify users and correlate them with the public or privacy information through Social Engineering Engines (SEEs). The information gathered through SEEs facilitates the inference of users' privacy information on OSNs and the generation of more precise embeddings of users for diverse applications such as friend/interest recommendation, sentiment analysis, spammer detection, and user activity classification. As for passwords security, we conclude in terms of passwords strength and guessing. First, our works can back up the further study on passwords strength for Chinese of different groups. According to the QQ groups that users joined, we can obtain the users' profile including age, educational background, interests and so on. Based on the users' profile, we are able to evaluate the passwords strength for people of different groups. This can help to find groups in low passwords strength, and give them proper suggestions. Second, we propose some methods to correctly mine users' privacy information, which has been demonstrated of high relevance with users' passwords. Therefore, our work helps to address the key challenge for passwords guessing, how to choose the most effective password candidates. Furthermore, our work can make a good prediction for some privacy information even intentionally set as a wrong value.

Conclusion
To the best of our knowledge, this is the first work to comprehensively study the leaked datasets of recent years in China through social network analytic. We evaluate the risk of leaked dataset and show that third parties such as attackers and misbehaving companies are able to successfully correlate these leaked datasets and accurately learn users' profiles such as real name, OSN IDs, age, gender, birth date, social connections, cellphone numbers, and citizen IDs. By studying these leaked datasets together, we find the privacy information leaking and passwords cracking facilitate each other. To mitigate the risks of these leaked datasets, we noticed that the privacy settings given by some internet service providers are weak for the users who involved in this data leaking, and more measures are needed (e.g., removing the profiles from search engines) to protect these users rather than to change passwords. Finally, we also give our suggestions for both Internet service providers and users.