Privacy Information Security Classification for Internet of Things Based on Internet Data

A lot of privacy protection technologies have been proposed, but most of them are independent and aim at protecting some specific privacy. There is hardly enough deep study into the attributes of privacy. To minimize the damage and influence of the privacy disclosure, the important and sensitive privacy should be a priori preserved if all privacy pieces cannot be preserved. This paper focuses on studying the attributes of the privacy and proposes privacy information security classification (PISC) model. The privacy is classified into four security classifications by PISC, and each classification has its security goal, respectively. Google search engine is taken as the research platform to collect the related data for study. Based on the data from the search engine, we got the security classifications of 53 pieces of privacy.


Introduction
Cyber-physical systems (CPS) are engineered systems comprising interacting physical and computational components.Unlike more traditional embedded systems, a full-fledged CPS is typically designed as a network of interacting elements with physical input and output instead of as standalone devices.The Internet of Things (IoT) is a novel paradigm that is rapidly gaining ground in the scenario of modern wireless telecommunications.The basic idea of Internet of Things is the pervasive presence around us of a variety of things or objects, such as radio-frequency identification (RFID) tags, sensors, actuators, mobile phones, which, through unique addressing schemes, build a network and are able to interact with each other and cooperate with their neighbors to reach common goals [1].Since the key point of IoT is to make a variety of things to be a network, Internet of Things is a case of CPS.
Software design in CPS will have a very strong "systems engineering" flavor: the software implementation of various system functions (e.g., traction control in an automobile) becomes the primary method for realizing that function.This means that software cannot be designed independently of the system, and the system cannot be designed independently of the software [2].So, when we build a CPS system, we must consider the information and information security in the software.
The ability of collecting personal private information is growing with the expansion of the IoT ability, so International Telecommunication Union (ITU) points out that user privacy is one of the important challenges of Internet of Things (IoT) [3].In CPS, privacy security is an important challenge as well.
In order to provide integrated services, sensors will be integrated into buildings, vehicles, and common environments, carried by people and attached to animals to communicate among them locally and remotely [4].Moreover, since IoT could include various cameras, GPS, and RFID equipment, the information of one's environment and behavior can be collected comprehensively and accurately, such as his/her location, movement speed, and physical signs (e.g., pulse, blood pressure, and disease) [5].
CPS and IoT can bring great commercial and political interests for hackers [6][7][8].As CPS/IoT technologies are widely used in various industries, national defense, military, and other areas of the national interests, there are enormous commercial and political interests in it.Driven by those 2 International Journal of Distributed Sensor Networks interests, the hackers and Internet virus manufacturers will launch more targeted and harmful attacks.
The two points above bring great threat to the personal information security in the IoT.Unfortunately, the issue of privacy protection has not gotten enough attention.A lot of privacy protection technologies have been discussed recently [9][10][11][12][13], but most of them are independent and aim at protecting certain privacy, such as location.An IoT application includes different privacy classifications, and it is costly to protect all of them.Due to the economic cost of system implementation and running efficiency, it is impossible to employ all the privacy protection technologies in some cases [12].Hence, it is necessary to divide the privacy into different classifications so that we can protect the important privacy with limited cost and technology.
Meanwhile, we found that the study of privacy attribute is not enough and there is not an objective method to classify privacy into different classifications.According to some privacy classification methods, the sensitive level of a piece of privacy is determined by the owner's feeling [14].However, to a certain piece of privacy, different people may have different feelings, so the piece of privacy will have many sensitive levels.This brings great difficulty to the CPS software development.For this reason, we propose an objective method to classify the privacy into different security levels.
The contributions of this paper include the following.(1) We focus on studying the attributes of the privacy and define two attributes of privacy.(2) We propose privacy information security classification (PISC) model.(3) We propose an objective method based on the big data from Google to classify the privacy information into different security levels.

Related Work
Facebook is the largest social network website, where people can share their photos and personal information.In Facebook, a photo can be set in different privacy levels [15].There are five levels of privacy settings that a person can choose: everyone, friends of friends, friends, some friends, and only me.Different privacy levels mean the different viewers of the photo.Facebook privacy setting system is totally designed for the Facebook website.Although it gives users the right to determine the viewers, Facebook does not assign the privacy level of the information and photo.
Information security and privacy classification (ISPC) is a classification schema to safeguard personal information and is mainly used in medical health.By ISPC, personal information is classified into 4 sensitivity classifications: high, medium, low, and unclassified [14].Unclassified level is the information that is meant for the public to see.Unclassified information does not require any additional safeguards.Low level is the information that is generally available to employees and approved nonemployees only.Medium level is the information that is intended to be accessed by a specific group of employees only.High level is the information that is extremely confidential and intended for access by named individuals or positions only.However, ISPC does not assign the sensitivity level of each piece of personal information.By ISPC, the sensitivity classification level is determined by the individual.
Corby studied the producing of privacy and divided the privacy into static, dynamic, and derived types based on time concept and the source of privacy [16].Static privacy describes who we are, significant property identifiers, and other tangible elements.Dynamic privacy is the data we create.Records of transactions we initiate constitute the bulk of dynamic data.Every charge card transaction, telephone call, and bank transaction are added to the collection of dynamic data.Derived privacy data is created by analyzing groups of dynamic transactions over time to build a profile of your behavior.Corby studied the source of privacy, but the privacy in the same state, such as static, has different confidentiality.
Goldberg focused on the anonymity of the private information.He classified the open state of the privacy information into: verinymity, persistent pseudonymity, linkable anonymity, and unlinkable anonymity [17].A verinym is a true name.It could mean the name printed on the governmentissue birth certificate, driver's license, or passport, but not necessarily.The kind of persistent pseudonymity privacy means the information is related to someone, but this one is unable to be confirmed, such as an MSN account.The kind of linkable anonymity privacy means the information is generated by a possible object, such as a phone call record, by which we can identify the phone call from where but cannot verify who dials the phone.By unlinkable anonymous state privacy, it is completely impossible to verify the source of the information.However, anonymity of the privacy is not privacy confidentiality.According to Goldberg's study, the anonymity degree of true name is the least, so true name should be strictly protected.However, most people use their true names every day, so obviously the confidentiality of the true name is very low.
Wu studied the privacy protection when a user requests a service in Internet of Things application [18].He combined the studies of Corby and Goldberg and proposed a secrecy state classification method of the privacy.In Wu's model, there are four types of privacy: anonymous, semianonymous, semipublic, and public privacy.Anonymous privacy is the personal information.Semianonymous and semipublic privacy include the static data, dynamic data, and derived data.And the norm of law is defined as public privacy.We think Wu's model splits the relationship between the data and the person.In fact, personal information comprises static data, dynamic data, and derived data.And Wu mixed the concept of privacy and personal information.We think public privacy does not exist.
Corby, Goldberg, and Wu wanted to classify the privacy by its source or state.Wu wanted to classify the privacy into different secrecy levels according to the source of it, but the secrecy degree of the privacy does not depend on its source.All of them did not give an objective method to classify the privacy.
Lu et al. studied the privacy information security classification in IoT based on the data from Baidu search engine [19].We improved the computational method of the paper and the classification result is different from that in [19].Baidu is the biggest Chinese search engine.We found that the classification results by Baidu are different from those by Google search engine.This shows the worry about the privacy and privacy disclosure of the Westerner is different from that of Chinese because of different cultures.

Privacy Attributes Definition
3.1.Definition of Privacy.About the meaning of "privacy, " psychologists believe that the privacy consciousness originates from the human sense of shame [20].The personal information could be any of a large number of things, including personal name, age, shopping habits, nationality, email or IP address, physical address, and identity.The personal privacy is usually a very subjective concept, so the understanding of the privacy of different individuals is different.The private information that some people regard as private may be insignificant to others.For example, some women think the age is their privacy and will not tell it to others, but most men do not mind telling others their ages.In addition, the privacy is closely associated with the situation and environment.People would like to share their privacy with their friends in a private environment.The privacy may not be an entirely objective existence, but a cognitive process, and everyone has his or her own privacy boundary and threshold.However, the privacy is a kind of the general public psychology of the human society; hence it still needs to be studied in accordance with the standards of the general public.Thus it is necessary to research the privacy and privacy attributes with greater pertinence.
In order to protect privacy information, we must clarify the meaning of privacy at first.The Oxford English Dictionary explains the meaning of privacy as (1) freedom from interference or public attention and (2) state of being alone or undisturbed.The privacy shows a kind of general public psychology of the human society, which is that a person would not like to let others know his/her secret.
In this paper, "privacy information" refers to personal data or transactions that can be stored in IoT/CPS systems, not the personal thought or feeling that will not be stored usually.

Privacy Attribute: Confidentiality.
A widely accepted meaning of privacy is the confidentiality.Privacy confidentiality refers to the degree of the secrecy of the information for a person.Privacy confidentiality reflects the severity degree of the consequences caused by privacy disclosure.
Some personal information such as the height is a kind of privacy, but the privacy is not undiscovered because others can guess someone's height from his/her figure.Disclosure of this kind of information would not bring serious influence on the individuals.However, if some privacy items, such as personal property, physical defect, and specific disease, are leaked, it will bring serious economic loss or damage to owner's reputation and social status.Therefore, the confidentiality of this kind of privacy is high.Sometimes the disclosure of them will even affect the whole society.In privacy information protection, more attention should be paid to the confidentiality than to the integrity.

Privacy Attribute:
Universality.The privacy information is usually a subjective concept because different individuals have different understanding of the privacy, such as age.But some information, such as disease and physical defect, is the privacy for all people.The universality of the privacy means the proportion of people who believe this information to be their privacy among all the people.Since it is impossible to investigate in the whole society whether a certain piece of information is their privacy, our study is based on the survey on Internet.

Privacy Attributes Study Based on Search Engine
4.1.Study Data from Search Engine.The goal of the privacy information security classification is to ensure the safety of the important privacy under the constrained security technology and cost.The importance degree of a piece of privacy is the comparison result of the privacy with other pieces of privacy.With the development of Internet, people get used to get help from Internet by search engines, especially when people realize that their privacy is leaked.For example, a lot of people will search "What shall I do when my credit card information has been disclosed?"or "Is my home address privacy?" and so on.Therefore, the searching behavior on the search engine can reflect people's worry about privacy disclosure [19].Google is the largest search engine in the world.Based on Google search engine as our research platform, we can get the amount of the webpages of each piece of privacy on the search engine.The number of the webpages can be used to analyze the privacy attributes.In this study, we selected 53 pieces of privacy which are often used in IoT/CPS systems.The number of webpages is about 3 × 10 10 .

Privacy Query Data.
To collect the data of the webpages about a piece of privacy, we define two input templates as the input query question on the search engine.The first step is to study the degree of the privacy universality.The number of webpages related to a piece of privacy reflects the privacy universality of it.Input template 1 is designed to collect the number of webpages that contain the privacy.
Input Template 1: "The Name of a Piece of Privacy" "Privacy." Let  1 () be the total number of query results of the privacy  by input template 1.As  1 () is very large, we normalize  1 () by Min-Max normalization method.(  ) is the privacy universality of   and it is equal to the normalized  1 (  ): The second step is to get the number of webpages about the privacy confidentiality.The privacy confidentiality can be reflected by the queries which show users' concerns about privacy disclosure.For some pieces of privacy, such as "criminal records, " the number of the webpages containing "criminal records" is not very large, but "criminal records" are the personal secret usually.For some pieces of privacy, such as weight, the number of the webpages containing this privacy is very large.However, someone's weight is semipublic information, so even if the privacy about a person's weight is leaked, it will not bring serious consequence to the person.As a result, the confidentiality of weight is low.So, the privacy confidentiality is not determined by the amount of webpages.The privacy confidentiality is the proportion of the queries that show users' concerns about privacy disclosure to all the webpages which are related to the privacy.Here, we define input template 2 to get the number of webpages about privacy confidentiality.
Input Template 2: "The Name of a Piece of Privacy" "Disclosure." Let  2 () be the total number of query results of the privacy  by input template 2.  2 (  )/ 1 (  ) is the proportion of the queries that show users' concerns about privacy disclosure to all the related webpages. 2 (  )/ 1 (  ) shows the degree of users' worry about the privacy disclosure.Let (  ) be the privacy confidentiality of   : Table 1 is the number of () and () of each piece of privacy.Although the confidentialities of some pieces of privacy are high, such as call records, disease, and affiliation, the universalities of these pieces of privacy are low.It indicates that () and () are independent.
Figure 1 is the distribution of () and ().Figure 1 shows the feature of the privacy.There are some pieces of privacy with high privacy universality and low privacy confidentiality.Also, there are some pieces of privacy with high privacy confidentiality and low privacy universality.Moreover, the number of the privacy pieces with both its privacy confidentiality and privacy universality being low or medium is the largest.However, the most important discovery is that there is no privacy with both its privacy confidentiality and privacy universality being high.Phone call records, which indicate who we contacted and where we were every day, have the most confidential privacy, but just small amount of people think it is their privacy.On the contrary, although most people think the photos are their privacy, they like to share their photos with their friends or even publish their photos on some websites, such as Facebook.

Determining the Security Levels of Privacy Attribute.
According to ISPC [15], the sensitive level of a piece of privacy of a person is determined by the feeling of the person, so a piece of privacy may have different sensitive levels in ISPC.The high sensitive privacy might be assigned low sensitive level because the person does not know the importance of the piece of privacy.We believe this will bring confusion to the system designers.
In our study, both the privacy confidentiality and universality have 4 levels: very high, high, medium, and low.We employ k-means clustering algorithm to classify () and ().k-means clustering is the most important flat clustering algorithm.It aims to partition  observations into  clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.Using kmeans clustering method, () and () are classified into 4 clusters, respectively.The bigger the value of the centroid of a cluster, the higher the confidentiality or universality level of the cluster.The referential boundary of each level of () and () in our study is shown in Table 2.

Necessity of Classifying the Privacy.
Privacy protection is a kind of basic service of CPS and IoT, so it can be used in a large number of IoT applications.For different IoT applications, the security requirement to the privacy may not be exactly the same.The motivation of dividing the privacy into different security levels is to ensure the important privacy to be protected correctly.
The contradiction between the information or privacy security and system availability exists always.For an IoT system, if all pieces of privacy are protected by complex security technologies, it would seriously increase the information processing time so as to decrease the availability of the system.Meanwhile, it increases the system development cost and time.Due to these reasons, it is impossible to employ privacy protection technologies to protect all pieces of privacy in some cases.Therefore, the privacy should be classified into different levels according to the importance of it.Thus, some pieces of important privacy are protected by complex technologies and some pieces of privacy are protected by simple technologies.The objective of classifying the privacy is to ensure the security of the important privacy; the classification is named as security classification.

Privacy Information Security Classification.
The security protection level of an IT system does not depend on the scale of the IT system, the cost of building the system, and the service object, but rather depends on the damage of the information disclosure [21].The damage refers to the infringement to the interests of all the parties involved in the information disclosure.We define the source of damage to be the threat source.The relationship among the threat source, the privacy information, and the interests of all the parties is shown in Figure 2.
The security level of the privacy information is not determined by the destruction to the privacy information itself but depends on the scope of person involved in the privacy disclosure and the degree of privacy confidentiality.This is the key to classify the privacy information.For instance, a RFID card is physically destroyed by the hostile and cannot be used again, but the privacy information in the card is not revealed in fact.
We propose a privacy information security classification (PISC) model for privacy in IoT and CPS, in which the security level of the privacy information is determined by the consequence of the privacy disclosure, which includes two factors: the scope of the parties involved in the privacy disclosure and the degree of the confidentiality of the privacy.
The privacy universality means the proportion of the people who regard a piece of information as their privacy among all the people.In other words, the privacy universality indicates how many people think they are infringed when the information is disclosed.Therefore, the privacy universality can represent the scope of the parties involved in the privacy infringement.The privacy confidentiality indicates the importance of the privacy to its owner and the degree of secrecy.
By PISC, privacy is classified into 4 security levels or classifications according to both the privacy confidentiality and privacy universality.The 4 security levels are high security, medium security, basic security, and low security.The security classification of a piece of privacy is determined based on its confidentiality level and universality level, which is shown in Table 3.  High Security Privacy.High security privacy is (1) the privacy that its confidentiality or universality is very high and (2) the privacy that both its confidentiality and universality are high.
Medium Security Privacy.Medium security privacy is (1) the privacy that its confidentiality is high and universality is medium or low and (2) the privacy that its universality is high and confidentiality is medium or low.The security requirement of this classification is lower than the high security privacy.
Basic Security Privacy.Basic security privacy is (1) the privacy that one of its confidentiality and universality is medium and another is low and (2) the privacy with both its confidentiality and universality being medium.
Low Security Privacy.Low security privacy is the privacy with both its confidentiality and universality being low.The security requirement of this classification is the lowest.

PISC Classification Results
. Figure 3 shows the distribution of the four security classifications.The distribution of the four security classifications is roughly symmetrical.The privacy in the bottom right is more than the privacy in the top left, which shows that the privacy universality was stronger than the privacy confidentiality to affect the classification result.
Table 4 is the PISC classification result and the privacy source.It shows that the security classification of a piece of privacy does not depend on the source of the privacy.We found that most people agree that family and photo are our private information or privacy.Call records, personal photo, and family information are the high security privacy that needs to be protected firstly.62.5% static privacy is basic security or low security privacy.Because people often provide their static personal information to their companies or some institutions, such as banks, static privacy actually is known in many places.50% dynamic privacy is medium security or high security privacy.It shows that the privacy which we create should be protected.For most people, the privacy in the high security level is more personal or steady than that in the lower security level.For example, both house and salary are a person's property, but house can show a person's wealth more than the salary.Another example, a person changes his or her mobile phone number more frequently than changing the line telephone number.Derived privacy is created by analyzing groups of dynamic transactions over time, so it is not created by the owner of the privacy.The security level of the derived privacy is the highest security level among the dynamic transactions privacy that it analyzes.

Discussion
. The data we used to classify the privacy were not collected from IoT or CPS systems, but from Google.However, the data on search engine are generated by the people all over the world, so the data can reflect most people's feelings and judgments.Thus, the data are impartial and objective and the classification results are objective.By classifying and organizing the privacy/data in IoT system, the system efficiency can be increased.

Security Goals
According to the harmful consequence caused by the illegal access to the privacy information or the privacy leak, the privacy in Internet of Things is divided into four classifications according to the security level.We should take into account different privacy protection technologies to achieve specific security goals for different security classifications privacy.The following are the four security goals of the above four privacy security levels: (1) the security goal for the low security privacy: the privacy in this classification is allowed to be unencrypt, but the privacy cannot be accessed without authorization; (2) the security goal for the basic security privacy: to encrypt the privacy in the storage medium and during the transmission and ensure that the privacy cannot be accessed without authorization; (3) the security goal for the high security privacy: consisting in the protection abilities in goal 2 and the encryption keys of different people being different; (4) the security goal for the medium security privacy: consisting in the protection abilities in goal 3 and ensuring the access authorization is valid just for one time reading or writing.

Figure 1 :
Figure 1: The distribution of the privacy universality and privacy confidentiality.

Figure 2 :
Figure 2: The relationship among the threat source, the privacy information, and the interests of all parties.

Figure 3 :
Figure 3: The distribution of four classifications.

Table 2 :
The referential boundary of each level.

Table 3 :
Security classification decision by the privacy universality and the privacy confidentiality.