A Multiuser Identification Algorithm Based on Internet of Things

With the rapid development of the Internet ofThings (IoT) in 4G/5G deployments, the massive amount of network data generated by users has exploded, which has not only brought a revolution to human’s living, but also caused some malicious actors to utilize these data to attack the privacy of ordinary users.Therefore, it is crucial to identify the entity users behindmultiple virtual accounts. Due to the low precision of user identification in themany-to-manymechanism of user identification, a random forest confirmation algorithm based on stable marriage matching (RFCA-SMM) is proposed in this study. It consists of three key steps: we first employ the stable marriage matching model to calculate the similarity between multiple users and utilize a scoring model to calculate the overall similarity of the users, after which candidate matching pairs are selected; second, we construct the random forest model that exploits a user similarity vector training set; afterward, the candidate matching pairs combine the secondary confirmation of the random forest model, which both improve the precision of the many-to-many user identification and protect private user data in the IoT. Extensive experiments are provided to demonstrate that the proposed algorithm improves precision rate, recall rate, and F-Measure (F1), as well as Area Under Curve (AUC).


Introduction
As 4G/5G technology continues to evolve, it provides people with more efficient performance and increased speed in order to meet higher standards of data services for more users.At the same time, the higher speed and more reliable transmission of mobile communication further promote the development of the Internet of Things (IoT) in the largescale era.The amount of user data is constantly increasing in the IoT context.Leveraging user data to analyze the social behavior of users can enable the provision of a safer social environment.According to statistics, 42% of users use multiple social networks simultaneously [1].The IoT integrates different social methods to meet people's different needs to the greatest extent possible [2].For instance, RenRen and Sina Microblog are services used to share personal statuses and publish blogs anytime and anywhere in China.However, as there is no direct link between user data on these services, a complete social network map is difficult to obtain.Multiuser identification is therefore employed to identify users of multiple virtual accounts [3,4], allowing user data to be better protected in the IoT era.
User identification is also referred to as user matching.Many studies have addressed the user identification problem by examining user profile information attributes, primarily the user's personal information and published content, which includes username, geographical location, blog posts, etc. [5][6][7][8][9][10][11][12][13].Although missing data is an issue in the process of filling in these attributes, they can still be filled in through the use of appropriate methods.Moreover, some attributes play an extremely important role in the process of user identification.Therefore, user identification based on user attribute information can better accomplish the work of identification.Some of the research on this topic focuses on the use of network topology for user identification.This research mainly relies on the user's circle of friends to identify a specific user [14][15][16][17][18][19].The similarity between user accounts is judged by analyzing nodes between users.However, due to the heterogeneity of the network structure in practical applications, this method requires improvement in terms of its precision.

Related Works
Multiuser identification technology is significant in both research and practice in many important fields.Current studies of multiuser identification can be divided into three categories according to the way feature information is used: user identification based on user profile information, network structure, and user-generated information.

User Profile Information-Based User Identification.
Research based on user profile information for the purpose of solving user identification problems primarily focuses on personal information.The classification model is constructed, after which the corresponding matching strategy is used to complete user identification.Raad et al. [20] proposed the Friend of a Friend (FOAF) attribute matching strategy.Ye et al. [21] proposed an objective weighting method for user attributes to integrate user attribute information and complete the calculation of user profile similarity.Cortis et al. [8] proposed an identity recognition algorithm that assigns weights to individual attributes of user profile information and then calculates the similarity among attributes with reference to both grammatical and semantic aspects.Able et al. [22] aggregated user profile information in order to match users.Zamani et al. [23] took the user's unit, interests, and other attribute information into full consideration, integrating the similarity of multiuser attributes via the equal evaluation model and complex mixed training model; this improves the possibility of correctly identifying users owing to the personalized characteristics of many users' attribute information.Therefore, leveraging user attribute information to achieve user identification is a good choice.

Network Structure-Based User Identification.
Network structure-based studies on user identification mainly focus on recognizing identical users by examining the user's circle of friends.The user's friend relationships are easy to obtain, the problem of malicious imitation and forgery is less likely to occur, and the importance of information coupling of local topology on network development has been certified.Narayanan et al. [14] proved for the first time that user identification can be accomplished by relying on the topology structure of the network; however, the precision of the matching results required improvement.Bartunov et al. [15] proposed the construction of the objective function by combining attribute information and network structure information and then optimized the function to obtain the optimal matching pair.Cui et al. [16] integrated user profile information similarity and graph similarity to achieve mapping from an email network to a Facebook network.Liu et al. [17] proposed the HYDRA approach to modeling user behavior by employing user attributes and usergenerated content.Korula et al. [18] abstracted the problem of user identification into a mathematical definition, arguing that different social networks are generated by user graph structure through probability and that the selection process of graph edges is one of approximate probabilities.Tan et al. [19] modeled users' social relations and mapped users to low-dimensional space to improve the efficiency of user identification in the network.However, there is heterogeneity between nodes in the actual network structure, and the influence of this heterogeneity is ignored in the calculation process; therefore, the precision of this method in the context of user identification requires improvement.

User-Generated Information-Based User Identification.
User identification based on user-generated information mainly relies on the content published by users.Now that the Internet of Things (IoT) has a close relationship with our daily lives, people can immediately post their own dynamic content and comment on the content posted by the friends around them.As user behavior habits are not easy to change and hide [24], these habits can correspond strongly with the characteristics of the users themselves.Therefore, the use of data mining algorithms to discover these hidden association rules [25] can greatly improve the recognition rate of user identification.Goga et al. [10] used the geographic locations, timestamps, and writing habits of users' published statuses for user identification purposes.Li et al. [13] proposed a supervised machine learning algorithm to solve the user identity recognition problem based on usergenerated content.In recent years, the development of mobile communication technology has made a great contribution to the incorporation of geo-tagging when users publish their statuses.As the user's track of action is not easy to imitate, the application of geographical location attributes to user identification opens up a new method of identifying users.Cao et al. [26] proposed an identification method for processing multisource data by utilizing the cooccurrence frequency of two user trajectories.Hao et al. [27] proposed that user trajectories are transformed into sequences composed of multiple grids, which are in turn transformed into vectors by using a TD-IDF model, after which the similarity of user trajectories is calculated via cosine similarity.Han et al. [28] proposed that each geographic coordinate point should be represented as a corresponding semantic position.The user's trajectory can thus be represented by the text composed of the semantic position, with the LDA model being used to represent the user's topic distribution; finally, the similarity of the user trajectories is calculated to determine whether the two users are the same.Therefore, the analysis of user behavior information for multiuser identification is ideal in this context.

Data Preprocessing
3.1.Filling Missing User Data.Data filling is commonly applied to user profile data processing.When a user registers an account, data may be missing for various reasons such as, e.g., privacy protection.Therefore, an appropriate filling method should be adopted for each different type of data from each dimension, as follows: (1) Similarity filling: filling in user data by utilizing the relational degree [29] between other users and users with missing data.For example, user  and user  are friends, and they will generate social behaviors such as comments, reposts, and thumb up on social networks, where the comments indicate that the content posted by the friend on the social network is explained, reposts indicate that friends have similar interests, and thumb up indicates that they agree with the content posted by friends.The relational degree between users is calculated through the behaviors of "comment   ", "repost   ", and "thumb up   " between users.The three types of user behavioral information are sequentially assigned the weights "3", "2", and "1".Select  users with the highest relational degree and take the mode number for filling.If the user with missing data has a low relational degree with other users, the data will not be filled.The relevant formula is as follows: (2) Speculated filling: the missing data is inferred from other attributes.This method is mainly used for user gender filling.The blog posts published by the user best reflect the characteristics of the user's personality; thus, by using the user's blog post information, the Bayesian classification model [30] can be employed to accomplish user gender speculation.
The Bayesian classification model is constructed as follows: where ( |   ) denotes the probability that the user is identified as male when the word   occurs, (  | ) denotes the probability of the word   occurring in all males, (  ) denotes the probability of the user being male, and () denotes the probability of the occurrence of word   .
Given the complexity of calculating (  | ), this article calculates the conditional independence naïve hypothesis.The formula is as follows: Therefore, the Bayesian classification model constructed is as follows: The statistical results of the training set can be used to derive the probability required for the calculation in the Bayesian classification model.The prediction results of the corresponding attributes can be obtained via this model.

User Data Similarity Calculation.
In view of the problem that the user data in the IoT has a different format, the user data format needs to be generalized before the similarity between each attribute in this study can be calculated; this processed data is more suitable for similarity calculation.The relevant calculation methods are as follows: (1) Dice coefficient [31]: when calculating strings, they can be divided into two categories.When calculating the multivalued strings   and   , the sum of the two times of the intersection information and divided by the sum of the elements of   and   yields the two strings of Dice coefficients, which are calculated as follows: Example.In two multivalued attribute strings "vivid music movie" and "movie travel", the intersection information is {"movie"}, so the similarity is 2/5=0.4.
For single-valued attribute strings, the Dice coefficient is calculated as above, except that the intersection information is different.
(2) Levenshtein distance [32]: the number of character edit steps required to calculate the equality of two strings is used as an operational cost to measure the difference between strings.The formulae for calculating the similarity of strings   and   are as follows: where (  ,   ) denotes the distance between the strings   and   and max (|  |, |  |) denotes the maximum value of characters contained in the strings   and   .
(3) Cosine similarity [33]: this is mainly used to calculate the vector composed of user attributes.Assuming that  and  are two -dimensional vectors, such that  is [ 1 ,  2 , ...,  n ] and B is [ 1 ,  2 , ...,  n ], then the cosine value of angle  between  and  is the similarity value between vectors.The formula is as follows: The closer the angle between two vectors is to 1, the higher the similarity between two users is.The closer the angle between the two vectors is to 0, the smaller the cosine value of the included angle is and the lower the user similarity is.By comparing the size of cosine values, it can be determined whether the two accounts are identical.
(4) Term frequency-inverse document frequency (TF-IDF) [34]: this is mainly used to measure the importance of a certain word in the document and is often used to deal with multiword attribute fields such as personal profiles.The specific steps are as follows.
Step 1. Calculate the term frequency (TF) of each word in the document; where  denotes the number of occurrences of a certain word and  denotes the total number of words in the document.
Step 2. Calculate the inverse document frequency (IDF) of each word in the document; where  denotes the total number of documents in the corpus,  denotes the number of documents containing a word in the document, and 1 is added to avoid cases in which the denominator is 0.
Step 3. Calculate the TF-IDF of each word in the document; Step 4. Select keywords in each document to construct a term frequency vector for calculating similarity.
Step 5. Calculate the similarity value by cosine similarity.
(5) User blog data similarity calculation: frequent item sets of user blog data are mined by means of frequent pattern mining to calculate user similarity.Due to the difference in the amounts of user-published content, the one-item set is also used as a calculation indicator in this study."1" is added to avoid a high-frequency item set in the calculation of similarity.The formula is as follows: where    denotes the support degree count of the frequent item set   of user ,    denotes the support degree count of the frequent item set   of user , and    denotes the item set number of   .The similarity threshold is set at 5,000 based on historical data.If   > 5000, return "1"; otherwise, return "0".
(6) State timestamp similarity calculation: the time points of users' publishing status also have certain personalized characteristics.The average dynamic number can be obtained by dividing the dynamic number generated by users in a certain period of time by the total dynamic number.The average dynamic number is then used to form a user timestamp vector of 24 dimensions.The similarity is calculated; users are determined to be the same user when Sim<0. 1 according to the statistical results.The formula is expressed as where   ,   denote the average dynamic number of the th time period of users  and .

Multiuser Identification Method
4.1.Building User Similarity Vectors.Research and analysis of user data in the Internet of Things (IoT) context can assist in meeting people's network needs.However, some malicious users will employ the user data to attack normal users via the IoT.Therefore, it is necessary to identify and analyze IoT users.
In this study, the profile information and behavior information of user data are utilized to achieve multiuser identification.After filling in user profile information, the precision of user identification can be increased to a certain extent.User behavior information has the characteristic of being personalized, which allows for highlighting of the user's own behavior habits and is thereby conducive to the improvement of user identification precision.A reasonable similarity calculation method is used for the data of each dimension of the user.The data for each dimension is provided with a threshold value when calculating similarity.After comparing the calculated similarity of attributes with the set threshold, qualified results return "1"; otherwise, return "0".Thus, user similarity vectors composed of "0" and "1" can be formed and used for the input of subsequent algorithms.

Weight Analysis of User Attributes
. By calculating the similarity between user attributes, the whole similarity vector of user attributes can be obtained.As different user attributes have different influences on the degree of user recognition, it is necessary to calculate the weight of the attribute items.Figure 1 shows the performance of a single attribute in user identification.It can be clearly seen from Figure 1 that the URL and user name have different distributions of similarity between the same user and different users when user matching is performed.As these attribute items are highly distinguishable, the weight allocation should be relatively large.When users match in terms of their interests, the similarity distribution between the users is small, meaning that the weight distribution should also be small.Again, as each attribute has different effects on user identification, it is therefore necessary to assign corresponding weights to each attribute.

Weight Allocation Algorithm.
After the user data is preprocessed, multiple user attributes are determined.When determining the weighting coefficient of the similarity judgment of each attribute in the user data, the traditional expert subjective weighting method encounters the problem of poor robustness, while the objective weighting method relies too much on the existence of a large amount of sample data, which is poor in universality.Therefore, this article proposes the posterior probability-based information entropy weight allocation algorithm.
Input: Source network account user data information vector   , user data vector {  }  =1 for all accounts in the target network, user data vector   to be matched account in the target network Output: The final similarity   =   ()(  ,   ) between the two accounts   and   1: foreach   in {  }  =1 2: for i=1 to n 3: Calculate the similarity (  ,   ) = (V  1 , V  2 , ..., V   ) of account A and B by using formula (6) (7) (8) (11) (12) (13) 4: end 5: for i=1 to n 6: The attribute weights of the user data are assigned using equation (15) 7: end 8: Calculate the final similarity   =   ()(  ,   ) between the two accounts   and   9: return   Algorithm 1: User data similarity calculation.
In information theory, the entropy value reflects the degree of information disorder.The smaller its value is, the more orderly the information is, and the more valuable this attribute is; on the contrary, the more disordered the information is, the lower the value of this attribute is.Therefore, information entropy can be used to evaluate the effectiveness of the attributes used.According to the definition of information entropy, for any random variable, the formula is as follows: where () is the possible value probability for the attribute.
In order to make the probability description of attributes more precise, more effective weights are assigned to each attribute.On the basis of information entropy, the posterior probability of user attributes is further calculated, which aids in improving the precision of user identification.By combining the posterior probability and information entropy, the attribute weight of the user account is (), such that where (  | ) is the posterior probability of the attribute.The user data information contains  attribute items.The data information vector is   = ( , where V   represents the similarity between the th attribute of user   and user   's attribute information.If the similarity exceeds the threshold, output "1"; otherwise, output "0".Accordingly, the user similarity vector is a vector composed of "1" and "0".Therefore, the final user similarity vector is   =   ()(  ,   ).The process is summarized in Algorithm 1.

Random Forest Confirmation Algorithm Based on
Stable Marriage Matching 4.4.1.Similarity Score.In order to improve the efficiency of the similarity calculation between users, this study adopts a stable marriage matching algorithm to perform the many-tomany matching calculation.The overall similarity of users is evaluated by means of the similarity score of user matching.The relevant formula is as follows: where Score denotes the final Score of the match,   denotes the weight of the th attribute of the user, and V   denotes the similarity of the th attribute of user A and user B. The higher the Score is, the more likely it is the same user.

Stable Marriage Matching Algorithm.
The scoring formula can evaluate the overall similarity of two users based on user data information.The higher the score, the more likely it is that the two users are the same user.The stable marriage matching algorithm uses the similarity score between users to select candidate matching pairs.If we calculate user data for all accounts, then the computational complexity will be high.Therefore, it is necessary to obtain V  by filtering the target account in another network according to the condition C: filter accounts by username.The specific steps involved are as follows.
Step 1.Each user on social network M and the user on social network N are matched by scoring formula.
Step 2. The user on M is matched with the user on N who ranks first according to the score.If the user on N has already matched other users, the user will compare the user who has already matched himself with the user who is requesting matching with himself.Finally, the user with the highest score will be selected as the other half of the matching pair.Step 3.After Step 2 is complete, some users will still fail to be matched successfully.A user who does not match will be matched with the highest ranked user among all users who have not rejected themselves, after which Step 2 will be repeated until all users match.

Random Forest Algorithm.
Users through the stable marriage matching algorithm output is a matching pair.However, such results cannot be used directly, as it would be easy to obtain poor matches if this was the case.In order to solve this problem, a second confirmation of random forest is established in order to eliminate the negative influence of wrong matching results on the final results as far as possible.
There are many algorithms based on supervised learning, including logistic regression, SVM, Adaboost, etc. Random forest is selected as the final quadratic confirmation algorithm in this study for the following reasons.
(1) There are 20 data dimensions used in this study, among which there may be linear correlation dimensions.The dimension of linear correlation not only plays no positive role in the training of the supervised learning model, but also impacts the effect of other nonlinear correlation dimensions.In general machine learning model training, data dimensionality reduction will be processed, and data dimensionality reduction is a tedious process.However, random forests are not sensitive to multicollinearity, and the results are robust to missing and unbalanced data.
(2) While overfitting is always discussed in machine learning, it is not easy for the random forest model to produce the overfitting phenomenon owing to the randomness involved.
Figure 2 provides an overview of the construction process of the random forest validation model.The specific steps are as follows.Step 2. Generate the decision tree.
(a) The number of prediction attributes in the training sample is 20, and  = √ 20 ≈ 5 attributes are randomly selected from the 20 prediction attributes to form a random feature subspace   , which is the split attribute set of the current node of the decision tree.During the generation of the random forest model, F remains unchanged.
(b) According to the decision tree generation algorithm, each node is split by selecting the optimal split attribute from the random feature subspace   .
(c) Each tree grows completely without pruning.Finally, according to each training set   , the corresponding decision tree is generated as ℎ  (  ).
(e) Using the plurality voting method, according to the classification result output by the K-tree decision tree, the classification result with a large number of decision trees is used as the final classification result corresponding to sample X of the test set.Figure 3 describes the process of RFCA-SMM.The input data of the algorithm is the user attribute data in the IoT.By using the stable marriage matching algorithm in combination with the scoring formula, the overall similarity between users can be obtained in order to select the candidate matching pairs with the user similarity vector training set as input.The random forest model is constructed, and the candidate matches obtained are used as input data to confirm and identify in the random forest.If the identification result for the candidate matching pair is not the same user, the candidate matching pair is marked as "unmatched"; by contrast, if the candidate match pair contains two instances of the same user in the random forest confirmation, the final match result is generated.The algorithm flow of RFCA-SMM is represented by Algorithm 2.

Analysis of Experimental Results
In order to verify the effectiveness of the proposed algorithm, [35] provides five open datasets of foreign mainstream social networks.
In this study, precision rate, recall rate, F-measure (F1), and AUC are used as evaluation criteria.The relevant formulae are as follows: if V  =NULL then 5: V  =UserSelect(  ,  ) 6: V ℎ = UserMatch(V  ,  ,  ) 7: end if 8: (V  ,V ℎ )= Secondary confirmation (V  ,V ℎ ,R) 9: end while 10: pruning process 11: return R Algorithm 2: RFCA-SMM.AUC: the area under the Receiver Operating Characteristic (ROC) curve is directly calculated.The ROC curve is defined as the X-axis by the False Positive Rate (FPR), while the True Positive Rate (TPR) is defined as the Y-axis.The formulae for these two values are as follows: where  denotes the number of the same users that are correctly matched,  denotes the number of users that are  unmatched and not the same,  denotes the number of users that are matched but are not the same, and  denotes the number of users that are not matched but are the same users.

Comparison of Random Forest
Model and Other Supervised Learning Models.This article uses the random forest supervised learning model for the confirmation of candidate matching pairs.In order to verify the effectiveness of the proposed algorithm in obtaining the matching results, the random forest model and other supervised learning models are analyzed with reference to the evaluation indicators outlined above.The ratio of training set data to test set data is 3:1 and the number of users is 1000 pairs.The results are presented in Table 1 and Figure 4.
It can be seen from Figure 4 that these supervised learning algorithms have relatively good results; among them, the best performing algorithms are Random Forest and Logistic.Random Forest performs slightly better in precision rate and F1, while Logistic performs slightly better in recall rate and AUC.However, considering the modeling scenarios of these two supervised learning algorithms, Random Forest has an advantage over Logistic.Therefore, the effectiveness of the random forest model is also proven.

Comparison of RFCA-SMM and RCM Algorithm Results.
This section presents a comparative analysis of the RFCA-SMM and Ranking-based cross-matching (RCM) algorithms.The purpose of the RCM algorithm is to accurately find more matching pairs, which decompose the seed user's identification into a step-by-step iterative process.In the iteration process of each step, the calculation process of the algorithm is divided into three substeps: account selection, account matching, and cross matching.Accumulate the results of each iteration to form a result set.The advantage of this algorithm is to compare the results obtained each time and select the user account with a high score as the final result.However, the precision of the RCM algorithm is largely affected by the number of seed users (that is, the number of users who are known to match pairs).If it is not possible to know in advance which accounts are the same user (that is, there is untagged identity match), then the RCM algorithm needs to be improved in terms of precision.Since the algorithm for user identification in this study is untagged, the experimental results of the two algorithms are analyzed using an unmarked data set, as shown in Table 2 and Figure 5.
It is clear from Figure 5 that the proposed RFCA-SMM algorithm is superior to the RCM algorithm in terms of precision, F1, and AUC in this study, although its performance is slightly lower than that of the RCM algorithm in terms of recall rate.The reason is that the proposed algorithm performs user-generated data processing, user attribute weight distribution, and secondary confirmation based on supervised learning compared with the RCM algorithm.It can be seen from Figure 5 that the final user matching results of the proposed algorithm achieve some improvement in the evaluation index compared with the RCM algorithm, mainly because the proposed algorithm (unlike RCM) does not take the social network structure into account [36].In summary, the algorithm proposed in this study improves the precision of user identification to a great extent.

Conclusions
The key features of 4G/5G technology, namely, low energy consumption and low delay, have laid the foundation for the development of the Internet of Things (IoT).Given the various other advantages of this technology, it can effectively promote the rapid development of the IoT industry chain.Since most of the information in the IoT is related to private user data, we propose a random forest confirmation algorithm based on stable marriage matching (RFCA-SMM).The candidate matching pairs are obtained by combining the stable marriage matching algorithm with the scoring model.In order to demonstrate the effectiveness of the random forest model, we analyze the random forest model and several other supervised learning models on the evaluation indicators and finally the second confirmation of candidate matching pairs via the random forest.Moreover, we conduct a comparative analysis of RFCA-SMM and RCM.The experimental results

Figure 2 :
Figure 2: Random forest model construction process.

Step 1 .
Acquire the training set.(a) The original input training data set D comprises 20 prediction attributes and a classification label .The 20 prediction attributes are the user similarity vector (  ,   ) = (V  1 , V  2 , ..., V   ) obtained in the above, while the classification label is  = (1 ‖ 0).A class label of "1" means that the users are the same, while "0" means that they are not the same.(b) The original training data set D is sampled by random sampling with K playback times via the Bagging method, and a new training subset U of K is obtained.

Figure 3 :
Figure 3: Random forest confirmation based on stable marriage matching.

Input:
To be matched account V  , Candidate matching account V ℎ , the set of accounts   that have not been matched in the social network M, the set of accounts   that have not been matched in the social network N Output: The final match pairs set R 1: R= 2: Initialize unmatched queue 3: while   ̸ =  and   ̸ =  do 4:

Figure 4 :
Figure 4: Comparison of several types of supervised learning.

Table 1 :
Comparison of several types of supervised learning.

Table 2 :
Comparison of RFCA-SMM and RCM results.