An Encrypted File Deduplication Scheme with Permission in Cloud Storage

Encrypted file deduplication scheme (EFD) can improve its storage space utilization of cloud storage and protect the privacy of files in cloud storage. However, if an enterprise stores its files to cloud storage that has deployed an encrypted file deduplication scheme that does not support permission checking, this will destroy the permission of the enterprise files and bring some security problems. This seriously affects the practical value of EFD and prevents it from deploying in concrete cloud storage. To resolve this problem, we propose an encrypted file deduplication scheme with permission (EFDSP) and construct the EFDSP by using the hidden vector encryption (HVE). We have analyzed the security of EFDSP.The results have shown that EFDSP is secure and it can prevent the online deduplication oracle attack. We implement EFDSP and conduct the performance evaluation. The results show that the performance of EFDSP is little inferior to that of SADS, which is the only existing encrypted file deduplication schemewith permission, but the performance gap decreases with the increasing number of the authorized users and EFDSP has overcome the security weakness of SADS.


Motivation.
Recently, with the rapid development of network storage technology, cloud storage has become an important storage scheme. Owing to the rental cost lowness, outsourcing files of an enterprise to cloud storage can reduce its enterprise management costs and improve its competitiveness. To prevent files from information leakage, an enterprise user usually stores its files to cloud storage in an encrypted form. Encrypted file deduplication scheme can save its storage space and network bandwidth of cloud storage and improve its performance. However, in the enterprise application environment, different department employees have different permissions. Each employee can only access the files according to its permission. If an encrypted file deduplication scheme does not support permission checking, it will destroy the file permissions and bring some security problems. Li et al. proposed a secure authorized deduplication scheme based on a hybrid cloud (SADS) [1]. They introduce a private cloud in SADS to preserve the user permissions and generate a permission tag for a user when it uploads a file. When the cloud storage performs the deduplication checking for a user, it needs to check the deduplication permission for the user, and if the user does not have the deduplication permission, the user needs to upload the file even though there exists the same file in the cloud storage. Only when the user has the deduplication permission and there exists the same file in the cloud storage can the cloud storage perform file deduplication. The use of SADS can achieve the encrypted file deduplication, but there exist three shortcomings in SADS: (i) Firstly, each permission is represented by a private key. If a user has multiple permissions, it needs to store multiple private keys secretly which can cause a great deal of trouble in the user key management. (ii) Secondly, when uploads a file or queries the duplication file of , the scheme needs to use permission keys to generate encrypted file tags for (If has been assigned permissions). So the scheme causes large network traffic.
2 Mathematical Problems in Engineering (iii) Thirdly, there exists a security weakness in SADS. Assuming Mike is an enterprise manager who manages department and department . Mike has the permissions of department and department . At the same time, Mike is responsible for the financial department, so he also has the finance department permission. If a cloud storage uses SADS to deduplicate the files in the cloud storage, SADS uses the private keys of department , department , and the finance department to generate three encrypted file tags. As a result, the staffs in department and department have the permission to deduplicate their files with the payslip file. Suppose Mike has uploaded Alice's payslip file to the cloud storage, if both Bob and Alice are employees of department . Bob wants to get the salary information of Alice. He can use the following steps (called online deduplication oracle attack) to attack SADS to obtain the salary information of Alice: (a) Bob first forges Alice's payslip file . is a kind of small entropy file and it has a fixed format. Bob knows the file format or he even has the kind of file, i.e., he has his own payslip. At the same time, he also knows that Alice's salary should be between 4000 and 4100 and he just does not know the concrete salary value of Alice. So Bob can set the salary value to 4000, 4001, . . . 4100, respectively, and generate 100 files 1 , 2 , . . . 100 . (b) Bob uploads 1 , 2 , . . . 100 to the cloud storage, respectively. If the cloud storage deduplicates the file when he uploads a file 푖 (1 ≤ ≤ 100) to the cloud storage, Bob knows that the salary of Alice is the data in the uploaded file 푖 .
Obviously, the success reason for the attack is the authorization precision of SADS which is rough. When Mike generates an encrypted file tag, it has assigned the file deduplication permission to Bob and causes the file permission checking bypass. At the same time, when the cloud storage checks the file deduplication, it only checks whether the encrypted file query tag of the upload file matches the encrypted file tags stored in the cloud storage owner and does not check the user's permission. Therefore, we want to design a securely encrypted file deduplication scheme with permission to improve the file deduplication permission check of the user and avoid the security issues of SADS.

Our Contributions.
In this work, we study the problem on how to enable cloud storage to deduplicate a user encrypted file without destroying its file permission. We propose permission vector and permission relation, use permission vector to represent the user permissions, and use permission relation to compare the permission level between two users. We design an encrypted file deduplication scheme with permission, which has overcome the security weakness of SADS. In EFDSP, the file owner enables the cloud storage to perform deduplication when other users with the same or high permission level upload the duplication files to the cloud storage. Our contribution can be summarized as follows: (i) Firstly, we discover a security weakness of SADS and propose an attack method against this scheme for small entropy files.
(ii) Secondly, we propose an encrypted file deduplication scheme with permission, which enables cloud storage to deduplicate the encrypted files without destroying the file permission. In EFDSP, a user with low permission level needs to upload the file even though there exists a duplication file in the cloud storage. EFDSP can prevent the online deduplication attack and overcome the security weakness of SADS.
(iii) Thirdly, we define permission vector and permission relation and use permission vector, permission relation, and hidden vector encryption to construct EFDSP.
(iv) Fourthly, we implement our scheme and conduct a performance evaluation, and the results demonstrate that our scheme is reasonable.
The paper is organized as follows. In Section 2, we present some preliminary knowledge. In Section 3, we describe and give the definition about the problem and define the encrypted file deduplication with permission. The permission vector and permission relation are defined in Section 4. In Section 5, we construct the encrypted file deduplication scheme with permission. In Section 6, we optimize EFDSP. In Section 7, we give some security analyses of EFDSP. In Section 8, we implement our scheme and conduct a performance evaluation, the evaluation results are presented here. In Section 9, we discuss related works. Finally, some conclusions are given in Section 10.

Hidden Vector Encryption.
Hidden vector encryption (HVE) was first proposed by Doneh and Waters [2]. Subsequently, Katz [3] and Park [4] proposed some HVE schemes, respectively. HVE is a kind of predicate encryption, which has two attribute vectors associated with the ciphertext and the tag. Only when the two vectors are equal does the ciphertext match the tag. There are two character sets Σ and Σ * in HVE, where Σ * = Σ ∪ { * } and * is a wildcard. If a vector of a component is * , it means that it does not participate in any of the attributes. HVE is mainly composed of four algorithms: key generation, data encryption, tag generation, and data query.
(i) In the key generation phase, the trusted authority (TA) assigns a public/private key pair ( , ) to a receiver.
(ii) In the data encryption phase, the user selects a vector = ( 1 , 2 , . . . , 푙 ) ∈ Σ 푙 to describe its data and also uses the receiver's public key to encrypt the data to obtain the ciphertext .
(iii) In the tag generation phase, the receiver first selects a vector = ( 1 , 2 , . . . , 푙 ) ∈ (Σ * ) 푙 to represent the query requirement and then uses its private key to generate a query tag 푊 . Finally, the receiver sends 푊 to the server. (iv) In the data query phase, if matches , it outputs , which is the plaintext of . The matching condition is defined as follows: let ( ) be the subscript set that 푖 is not * , where 푖 is a vector component of ( 1 , 2 , . . . , 푙 ). For two vectors and , let 푤 ( ) be the equality predicate that satisfies (1).

Problem and Definition
3.1. The System Model. In order to facilitate the enterprise management, we need to introduce a permission server (PS) to manage the user permission. At the same time, we need to introduce a key generation server (KGS) to generate an encryption key for the upload file. After introducing PS and KGS, the system model of cloud storage is shown in Figure 1. It consists of four different kinds of entities: some users, a cloud storage, a permission server, and a key generation server. The permission server and the key generation server are deployed in the enterprise domain, which are absolutely secure. The cloud storage (CS) checks whether there exists a duplication file in the cloud storage and checks whether the user has the permission to deduplicate the file. If both conditions met, the user does not need to upload the file, and the cloud storage server provides it with a file pointer; otherwise, the user needs to upload this file.
When the system is initialized, the system administrator gives the user access permission according to its permission level. The system administrator can use the role-based method [5] to assign the permission to the user; that is, it assigns the permission to the user based on the role of the user. Suppose an IT company has only three types of employees: manager, project leader, and engineer; if a user is assigned the permission of the manager, then can access any file that its access role is the manager. Each file in the cloud storage has a file permission tag to describe its permission, only when other users with the same permission upload a duplication file can the cloud storage perform the deduplication.
Cloud storage provides its users with the data storage service. To reduce its storage costs, CS only stores one unique file by using cross-user file deduplication to eliminate the redundant files in its server. PS and KGS are deployed in the enterprise secure domain, which are absolutely secure. PS is responsible for the user permission management and the file permission query, and it assists CS to perform the file permission checking and the file deduplication. KGS is responsible for generating an encryption key for the user. When a user needs to store a file to CS, it needs to interact with KGS and gets an encryption key from KGS for the file.

Problem Formalization.
In this work, we study the problem on how to enable the cloud storage to deduplicate the user encrypted file without destroying the file permissions. That is to say, we study the problem on how to enable the file owner to allow the cloud storage to perform deduplication when other users with the same or high permission level upload a duplication file to the cloud storage. We can formalize the problem as follows.
When a user, say , wants to upload a file to CS, it first interacts with KGS to get the encryption key 퐹 for , then it interacts with PS. PS uses ( ), the permission level of 푢 and its private key 푃푆 to generate 퐹 for , where 퐹 is a permission query tag of . After receiving 퐹 , sends 퐹 to CS to query whether there exists the encrypted file 퐹 in the cloud storage. If there exists 퐹 in the cloud storage, does not need to upload , and it only needs to store 퐹 ; otherwise, first encrypts using 퐹 to get 퐹 , then uses 푢 and 푃푆 , where 푃푆 is the public key of PS, to generate the encrypted file tag 퐹 for . Finally, sends 퐹 and 퐹 to the cloud storage.

The EFDSP Scheme.
In order to solve the problem that we have formalized in Section 3.2, we design an encrypted file duplication scheme with permission.
Definition 3 (EFDSP). An encrypted file duplication scheme with permission is a tuple of algorithms as follows: : it takes the security parameter as input and outputs the public parameter .
: it takes as input and outputs ( 푆 , 푆 ), which is an identitybased key pair of CS, ( 푃푆 , 푃푆 ), which is a private key/public key pair of PS, and ( 푃푆 , 푃푆 ), which is a signature/verification key pair of PS. of the user file , as input and outputs a file encryption key 퐹 for .
(iv) FileTagGeneration( , , 푃푆 ) → 퐹 : this algorithm is run by , and it takes , V, and 푃푆 as input and generates an encrypted file tag 퐹 as output. is the permission level of the user.
푃푆 is the public key of PS.
: this algorithm is run by PS, and it takes ( ), , 푃푆 , and 푃푆 as input and outputs 퐹 and ( is the permission level of , 푃퐾 is the private key of PS, and 푃푆 is the signature private key of PS. : this algorithm is run by CS, and it takes ( ), 퐹 , and ( 퐹 ) 푃퐾 as input and outputs and 퐹 . CS uses 퐹 to match some encrypted file 퐹 . If it matches, let = 1 and let 퐹 be the file pointer of 퐹 , and add 푢 to the 푢 to the file entry of 퐹 ; otherwise, let = 0 and assign NULL to 퐹 . (viii) Enc( 푘 , ) → 퐹 : this algorithm is run by , and it uses 퐾 to encrypt to generate 퐹 . (ix) Dec( 퐹 , 퐾 ) → : this algorithm is run by , and it uses 퐾 to decrypt 퐹 to generate . (x) FileRetrieval( 퐹 ) → 퐹 : this algorithm is run by CS, and it uses 퐹 to search the encrypted files in CS and returns its corresponding encrypted file 퐹 .
The interaction process of EFDSP is described in Figure 2.

The Threat Model.
Since PS is responsible for the user permission management and the file permission query and KGS is responsible for generating an encryption key for the user, we must assume that PS and KGS are absolutely secure and reliable. As CS performs the tasks assigned to it honestly and it is interested in the content of the user's files and tries to get some secret information from these files, we can regard it as an honest and curious adversary [6]. Some users try to access the files beyond their permissions. At the same time, we assume that all files stored in the cloud storage are confidential; if there is information disclosure, it will result in a very large loss to the user. According to this assumption, there are two kinds of adversaries in the system.
(1) External adversary: it tries to obtain secret information from the cloud storage or tries to access the file beyond its permission.
(2) Internal adversary: it can access the cloud storage easily and try to get some secret information from the encrypted file tags or the query tags.
Mathematical Problems in Engineering 5 Figure 2: The interaction process of EFDSP.

The Security Requirements.
According to the threat model described in Section 3.4, there exist four security requirements as follows: (1) The confidentiality of the encrypted file tag: an unauthorized user, including the cloud storage server, cannot get the plaintext information from the encrypted file tags stored in the cloud storage server.
(2) The unforgeability of the encrypted file query tag: an unauthorized user should be prevented from getting or generating the encrypted file query tags because it has no appropriate permission. It is not allowed to collude with the cloud storage server to destroy the unforgeability of the query tags.
(3) The indistinguishability of the encrypted file query tag: a user cannot get any information from the query tags without querying the permission server, including the file content and the permissions.
(4) The confidentiality of the file: a user who does not own the files cannot obtain the plaintext from the files stored in the cloud storage server; that is, an adversary cannot retrieve and restore files that do not belong to it.

The Permission Vector and the Permission Relation
In order to effectively represent the user permission, we define permission vector in this section.
Definition 4 (permission vector). Let = ( 1 , 2 , . . . , 푁 ) be a collection of the system permission, 1 to are the sequence numbers of the permissions in the system. Permission vector is a bit binary vector of bits, which are numbered 1 to from left to right.
( ) represents the permission 푖 . If the value of ( ) is 0, it means that the permission 푖 is valid, otherwise it means that the permission 푖 is invalid. Figure 3 is an example of role hierarchies given in [5]. It has four roles: programmer, test engineer, project member, and project supervisor. We can easily represent the permission of each role by using the permission vector. Let ITP={programmer, project engineer, project member, project supervisor} be the basic permission set of the system. Because there are only four basic permissions, we can use a 4-bit permission vector to represent the permission of each role; the sequence number of the four basic permissions in the permission vector is 1, 2, 3, and 4, respectively. At the same time, the permission of these roles allows being inherited in [5]. From Figure 3, we can find that the project supervisor owns the project supervisor permission and inherits both permissions of the test engineer and the programmer. According to Definition 4, it is easy to get that the permission vector of the supervisor is 0010, the permission vector of the programmer is 0111, and the permission vector of the project member is 0001.
(2) If for each 푈 2 ( ) = 0 and 푈 1 ( ) = 0, and there are 0 or more where 푈 2 ( ) = 0 and 푈 1 ( ) = 1, then we say the permission level of 1 is lower than that of 2 . We use If there exists where 푈 1 ( ) = 0 and 푈 1 ( ) = 1, and there exists where 푈 1 ( ) = 1 and 푈 2 ( ) = 0, then we say the permission level of 1 is not equal to that of 2 . We use then we say the permission level of 1 equals that of 2 . We use 푈 1 = 푈 2 to denote it. According to Figure 3, we can get the permission vectors of the project supervisor, the programmer, and project member which are 0010, 0111, and 0001, respectively. If both Alice and Bob are programmers, the permission vectors of Alice and Bob are 0111. According to Definition 5, we can get With the definitions of permission vector and permission relation, we can define the permission equality predicate.

A Construction for EFDSP
We have defined EFDSP in Section 3.3. In this section, we use the efficient hidden vector encryption proposed by Park [4] and the permission vector defined in Section 4 to construct it. Let : {0, 1} * → 2 and 1 : {0, 1} * → 1 be two secure cryptography hash functions, which are modeled as random oracles. Let be the security parameter, then our constructions for EFDSP are as follows: 푃푆 ) is a signature/verification key pair of PS, and in our construction, we use DSA [7].
(iv) FileTagGeneration( , , 푃푆 ) → 퐹 : it first uses the secure cryptography hash function to compute the cryptography hash value of , then it uses to generate the permission vector 푢 = ( 1 , 2 , . . . , 푛 ) according to Definition 4, and finally uses 푃푆 which is the public key of PS to generate the encrypted file tag 퐹 . The concrete steps are as follows. to generate the encrypted file tag 퐹 for according to (3).
(v) FileQueryTagGeneration( ( ), , 푃푆 , 푃푆 ) → ( 퐹 , ( 퐹 ) 푆푖푔 ): PS first gets the permission level of from its permission database and then according to Definition 4 generates a permission query vector for , and finally it uses its own private key 푃푆 to generate an encrypted file query tag 퐹 for . The concrete steps are as follows.
(a) PS gets the permission level of from its permission database and then generates the permission query vector V = ( 1 , 2 , . . . , 푛 ) according to Definition 4. Let be the permission query index set, then = { | ≤ ≤ }. (b) PS randomly selects , ∈ 푝 and for each ∈ it generates 푖 , 푖 , 푖 , 푖 ∈ 푝 according to (4), and 1 , 2 are the parts of 푃푆 which is the private key of PS.

If
= ( ), then it represents that has been sent to CS and has the deduplication permission for . CS can perform deduplication for and let = 0 and return the file pointer of 퐹 and add ID of to the corresponding file entry of , otherwise let = 1 and return .

Optimization for EFDSP
Since EFDSP can only deduplicate files between users that have the same permissions, it has two shortcomings. Firstly, users with the high permission level can operate the files of users with the low permission level in the actual enterprise setting. However, EFDSP does not allow the cloud storage to perform deduplication between files of a user with high permission level and files of a user with low permission level, which violates the actual permission management in the enterprise setting, and it is not conducive to improving the deduplication efficiency. Secondly, during the generation of the encrypted file query tag in EFDSP, all the permission bits are involved in the computation which increases the computation cost.
In this section, we use the example in Figure 3 to illustrate how to optimize the permission query index subscript set to overcome the above shortcomings in EFDSP. The permission vector of the project supervisor 푝푟표푗푒푐푡푠푢푝푒푟V푖푠표푟 is 0010, and the permission vector of the programmer 푝푟표푔푟푎푚푚푒푟 is 0111. According to Definition 5, 푝푟표푗푒푐푡푠푢푝푒푟V푖푠표푟 ≥ 푝푟표푔푟푎푚푚푒푟 . That is, the permission level of the project supervisor is higher than that of the programmer. Since 0 indicates the user has the permission and 1 indicates that the user does not have the permission, if EFDSP compares the permission level of 1 with that of 2 , it only needs to consider these permissions that are not owned by 1 whether are owned by 2 . If 2 does not own these permissions that are not owned by 1 , then it means that the permission level of 1 is higher than or equal to that of 2 . Otherwise if 2 owns one permission that is not owned by 1 , then it means that the permission level of 1 does not match that of 2 . (Either the permission level of 1 is lower than that of 2 or 8 Mathematical Problems in Engineering the permission level of 1 is unequal to that of 2 ). So when EFDSP compares the permission level of 1 with that of 2 , it only needs to consider the bits in the permission vector of 1 which are 1. For example, if EFDSP wants to compare the permission level of project supervisor with that of programmer, as 푝푟표푗푒푐푡푠푢푝푒푟V푖푠표푟 = 0010, 푝푟표푔푟푎푚푚푒푟 = 0111, and all bits of 푝푟표푔푗푒푐푡푠푢푝푒푟V푖푠표푟 are 0 except that bit 3 is 1, so EFDSP only needs to compare [3], it can derive that 푝푟표푗푒푐푡푠푢푝푒푟V푖푠표푟 ≥ 푝푟표푔푟푎푚푚푒푟 and can determine that the files of the project supervisors can be deduplicated with the files of the programmer that are stored in the cloud storage. If EFDSP wants to compare the permission level of the project supervisor with that of project member, as 푝푟표푗푒푐푡푠푢푝푒푟V푖푠표푟 = 0010, 푝푟표푗푒푐푡푚푒푚푏푒푟 = 0001, and all bits of 푝푟표푗푒푐푡푠푢푝푒푟V푖푠표푟 are 0, except that bit 3 is 1, so it only needs to compare 푝푟표푗푒푐푡푠푢푝푒푟V푖푠표푟 [3] with 푝푟표푔푎푚푚푒푟 [3]. Because , the permission level of the project supervisor does not match that of project member, and EFDSP can determine the files of project supervisor which cannot deduplicate with the files of project member that are stored in the cloud storage.
That is to say, EFDSP only considers the bits of the permission vector of the query user that are 1. These vector bits form a set which is defined in (7). We call it permission query index subscript set and use to represent it. If we replace in the FileQueryTagGeneration algorithm with , then EFDSP can enable the cloud storage to perform deduplication between files of a user with high permission level and files of a user with low permission level, which can improve its efficiency. In addition, in order to prevent all bits of a permission vector from being 0, the bit number of the permission vector is required to be 2 more than the permission number, and EFDSP reserves the last two bits of the permission vector and codes them to be 1.

Security Analyses for EFDSP
In this section, we analyze EFDSP according to the security requirements discussed in Section 3.5. We analyze the correctness of EFDSP, the security of the encrypted file query tag which included unforgeability and indistinguishability, the confidentiality of the encrypted file tag, and the confidentiality of the encrypted file. Finally, we compare EFDSP with SADS [1].

The Correctness Analysis.
To verify the correctness of EFDSP, we must verify the query process of the encrypted file query tag in EFDSP. In (6) We can get Let = { | ∈ and 푖 ̸ = 푖 }, and we can get 1 2 ( 5 , Therefore, if = , then the above formula outputs ( ); otherwise, it does not output ( ).

The Security Analysis (1) The Unforgeability of the Encrypted File Query Tag Analysis.
In EFDSP the user passes the authentication of PS and sends ( ) to PS. After receiving ( ), PS first searches the permission database to find the permissions of the user and generates a permission query vector = ( 1 , 2 , . . . , 푛 ) in accordance with Definition 4 for the user, and then PS uses its own private key to generate the query tag, since the private key of PS is kept secret and we ensure the unforgeability of the encrypted file query tag.
(2) The Indistinguishability of the Encrypted File Query Tag Analysis. The encrypted file query tag ( ( )) = (( 1 , 2 , 3 , 4 , 5 ), , ( )) is made up of four parts, where Since PS randomly selects , ∈ 푝 when it generates the query tag, we can regard 3 and 4 as two random numbers. According to (4) we can get 푖 1 + 푖 2 = and 푖 1 + 푖 2 = , where 1 and 2 are parts of the private key of PS, so that we can regard 푖 , 푖 , 푖 , and 푖 as random numbers, and then we can regard 1 , 2 , and 5 as three random numbers. Because is a secure cryptography hash function, we can also regard ( ) as a random number. = { | 푖 = 1} is publicly released, it is unconducive to help the probabilistic polynomial time (p.p.t) adversary to distinguish the encrypted file query tag with a random number; at the same time, there exist thousands of files with the same permission in the cloud storage, which make not useful to distinguish between the encrypted file query tag and a random number. Thus, we can ensure the indistinguishability of the encrypted file query tag.
(3) The Confidentiality of Encrypted File Tag Analysis. In EFDSP, when a user needs to generate an encrypted file tag 퐹 for the encrypted file 퐹 , it first uses its own permission level to generate the permission vector = ( 1 , 2 , . . . , 푛 ) according to Definition 4 and finally uses 푃푆 to generate an encrypted file tag 퐹 , where 푃푆 is the public key of PS. When it computes 퐹 , it randomly selects two numbers 1 and 2 from 푝 . 퐹 = ( 1 , 2 , 3,1 , . . . , 3,푛 , 4,1 , . . . , 4,푛 , 5 , 6 ), and 1 = Since 1 and 2 are two random numbers, it is difficult for an p.p.t adversary to distinguish 퐹 from a random number, thus it can ensure the confidentiality of 퐹 .
(4) The Confidentiality of the File Analysis. In EFDSP, for any file , 퐹 = 퐾 ( ). 퐹 is generated by the user performing a key generation protocol base on BLS signature [8] with KGS. Since the protocol is secure, that is, for any p.p.t adversary, if it does not own , it cannot know 퐹 . At the same time, we use AES as the encryption algorithm , which is a secure algorithm; therefore, 퐹 is secure. That is, for any p.p.t adversary who does not own , it cannot get from 퐹 .

The Online Deduplication Oracle Attack Analysis.
In EFDSP, when a user 푖 uploads a file to CS, 푖 uses its own permission vector and the public key of PS to generate an encrypted file tag. After that, only when a user that its permission level is equal to or higher than that of 푖 upload the same file, can CS perform the file deduplication. Assuming an adversary that its permission level is lower than that of 푖 uses the file deduplication of CS to launch the file online deduplication oracle attack, it first needs to forge some files against and then ask the PS to generate some encrypted file query tags for these files. PS uses its own private key, the permissions vector of , and these forged files to generate some encrypted file query tags and gives these tags to . sends these query tags to CS and then observes whether CS performs file deduplication for the upload files to get information about . Due to the fact that the permission level of is lower than that of 푖 , CS will not perform file deduplication for these upload files. It will ask to upload these files. In the end, cannot get any information about from CS. So EFDSP can prevent adversary from launching online deduplication oracle attack.

Comparison with SADS.
Since SADS is the only existing encrypted file deduplication scheme with permission, we will compare EFDSP with SADS from the following aspects.
(i) In SADS [1], each permission is represented by a private key, and if a user has permissions, it needs to keep private keys secretly. However, in EFDSP, the user permissions are managed by a permission server, and the user only needs to store its own permission vector and the public key of the permission server. (ii) In SADS, when a user uploads a file or queries a duplication file, if the user is assigned permissions, the system needs to use private keys to generate encrypted file tags for the file. So the space complexity of the network traffic of this scheme is ( ). In EFDSP, the encrypted file tag of is 퐹 = ( 1 , 2 , 3,1 . . . , 3,푛 , 4,1 . . . , 4,푛 , 5 , 6 ), so when a user uploads a file , the space complexity of its network traffic is ( ). However, the query tag of is ( ( )) = (( 1 , 2 , 3 , 4 , 5 ), , ( )), which has nothing to do with the number of permissions . So when a user queries the duplication file of , EFDSP requires the constant network traffic. (iii) SADS has a security weakness, while EFDSP has overcome the security weakness. We use the example of the attack against SADS in Section 1 to show how EFDSP can prevent such attack. In EFDSP, it uses a 5bit vector to represent a permission. The first bit of the vector represents the permission of department A, the second bit represents the permission of department B, the third bit represents the permission of financial management, and the fourth bit and the fifth bit are reserved and it codes them to be 1. Mike has permissions for department A and department B, and because Mike is also responsible for financial management and it has the permission of the finance department, so his permission vector is 00011. Bob is the employee of department B, his permission vector is 10111. If Mike uploads the payslip file of Alice to the cloud storage, Mike uses the permissions vector 00011 and the public key of the permission server to generate the encrypted file tag and upload the encrypted file tag and the encrypted file to the cloud storage server. Both Bob and Alice are employees of department B, Bob wants to know the salary of Alice. Since the payslip file has a fixed format and it is a kind of small entropy file, Bob knows the file format or may even have such a file format in his hand, i.e., Bob has his own payslip. He also knows that the salary of Alice should be between 4000 and 4100, he just does not know the exact salary data of Alice. Bob can set the salary item to 4000, 4001, . . . 4100, and forge 100 payslip files, then Bob uploads the 100 files to the cloud storage respectively to perform the file deduplication in the cloud storage to launch online deduplication oracle attack. However, due to the use of EFDSP, when he needs to upload these files, it wants to get some query tags for these uploaded files and upload the query tag to the cloud storage server. According to EFDSP, since the permission level of Bob is lower than that of Mike, even if Bob uses the same file of Mike to generate the query tag, the cloud storage server does not perform file deduplication due to the permission level mismatch, Bob needs to upload all the 100 files to the cloud storage, so Bob does not know which file in his uploaded 100 files is the specific file; that is, Bob does not know the wage information of Alice.

Experiments
The experiment system is composed of four PCs, which simulate the client, the permission server, the key generation server, and the cloud storage server. We use txt, doc, and mp3, three kinds of files, as the test data set in the experiment, which is shown in Table 1. We test the computation costs of the encrypted file tag generation, query tag generation, file encryption, duplication file check, and file transmission in EFDSP. We conduct experiments on file size, file number, file duplication rate, and the user number with the same permission four aspects to analyze the performance in EFDSP, and all the experimental results are the average values of 10 experiments.
(1) The Performance Effect of File Size on EFDSP. As the file size will affect the encrypted file tag generation and file encryption in the deduplication scheme, we first test the performance effect of file size on EFDSP. We upload 10 files which have different sizes and then record the time spent. We upload 7 files of different sizes in the file set 1 and file set 2 and record the time spent in each step. As the seven files are different, CS does not perform deduplication; the results are shown in Figure 4. From the figure, we can see that file size has a great effect on the key generation, the encrypted file tag generation, and the encryption process, which are linear.
(2) The Performance Effect of the File Number on the EFDSP. We select 10 different files from the file set 3 to perform 10 groups of experiments; before each group of the experiment, we need to initialize the system to avoid encrypted file deduplication. In experiment group 1, we upload one file, and in the second group, we upload two files; thus, in the next experiment group, add one file per time, and in the experiment group 10, we upload all files. When each file group is uploaded, we record the time spent on each step. Figure 5 shows the effect of the file number on each step. Experiment  (3) The Performance Effect of File Repetition Rate on EFDSP. In order to evaluate the performance effect of the file repetition rate, we divide the file set 3 into two different data test sets, each test set contains 10 files of 10MB. In each experiment, we uploaded all the files in the first data test set first. In the second file upload, we upload another 10 files, which are selected from the first data test set according to the given repetition rate, and the remaining files are selected from the second data test set, then we record the time spent on each step of the second upload. The experimental results are shown in    Figure 6. From the figure, we can see that the time spent by EFDSP decreases as the file repetition rate increases. When the file repetition rate reaches 100%, it is not necessary to encrypt and upload files. The time required to complete 10 files of 10 is that of 32.98% when the repetition rate is 10%.

(4) The Performance Comparison between EFDSP and SADS.
SADS is the only existing encrypted file deduplication with permission; to compare the performance between EFDSP and SADS, we perform the following experiment. We select 10 files of 10MB from the file set 3 as the data test set, and we set up 6 users in the experiment. We regard the first user as the upload file owner, which uploads the 10 files to the cloud storage server first, and the 10 files of the other five users are the same with the first user's file exactly. And then we configure the permission of these users on the permission server, respectively, so that one user, two users, three users, four users, and five users have the same permission with the first user, respectively, and we perform these experiments respectively. The experimental results are shown in Figure 7.
The experiment results show that EFDSP is less efficient than SADS, but this gap decreases with the increasing number of authorized users; moreover, EFDSP has repaired the security weakness in SADS.

Related Works
Quinlan et al. proposed file deduplication to improve the storage space utilization in their document network storage system [10]. Using file deduplication to the cloud storage directly will bring some security issues to the files in the cloud storage. To demonstrate these security issues, Harnik et al. proposed three different kinds of attack methods [11].
To prevent these deduplication attacks, Halevi et al. proposed proof of ownership (POW) [12]. Some researchers have extended POW by improving its efficiency [13,14]. However, these POWs cannot prevent attacks against small entropy files. Therefore, it is unrealistic to prevent all the deduplication attacks in the cloud storage by using the above POW. As the same file encrypting with different keys will generate different encrypted files, the cloud storage server cannot deduplicate the encrypted file. So file encryption and file deduplication are incompatible to some extent. To solve this issue, Douceur et al. proposed convergence encryption [15].
The key of the convergence encryption is computed by using the hash function to the file that is encrypted. Different users that use the convergence encryption to encrypt the same file will generate the same encrypted file. Storer et al. proposed a block-level encrypted file deduplication scheme. The encryption key of the file block is determined by the contents of the file blocks [16], but their scheme is difficult to prevent the brute-force attack. Bellare et al. proposed message lock encryption based on the convergence encryption [17] and designed a deduplication key generation protocol based on RSA signature [18], but the efficiency of their key generation protocol is low due to using RSA signature. Armknecht et al. designed a server-assisted key generation protocol using BLS signature that could overcome the shortcomings of the Bellare protocol [8].
Xu et al. designed a secure client encrypted file deduplication scheme for cloud storage [19]. Subsequently, Kaaniche, Stanek, and Puzio et al. proposed their encrypted file deduplication scheme for cloud storage [20][21][22]. To tackle the problem of encrypted file deduplication without relying on a trusted key generation server, Liu et al. [23] and Dang et al. [24] proposed their secure encrypted file deduplication scheme that does not require additional servers respectively. But their schemes do not support file permission. Li et al. proposed an encrypted file deduplication scheme that supports fuzzy search [25]. In all above-mentioned encrypted file deduplication schemes, the user participates in the encrypted file deduplication passively. Li et al. proposed an encrypted file deduplication scheme based on hybrid cloud server which supports deduplication authorization [1], but the user permission key management in their scheme is trouble; it wants relatively large storage space and network traffic, and at the same time its authorization precision is rough and there exists a security weakness.

Conclusions
An enterprise can reduce its business cost by storing its files to cloud storage. All files have permission in the enterprise application environment. If the cloud storage uses an encrypted file deduplication scheme without permission, it will destroy the enterprise file permission and give rise to some security issues. To solve the problem, Li et al. proposed a secure encrypted file deduplication with permission based on hybrid cloud, but its scheme has a security weakness. In this paper, we design an encrypted file deduplication model and construct an encrypted file deduplication scheme with permission (EFDSP) by using the permission vector and HVE and we optimize the performance of EFDSP. We analyze the security and the performance of EFDSP, and the results show that EFDSP satisfies the security requirements defined in Section 3.5. We implement EFDSP and conduct the performance evaluation. The experimental results show that the performance of EFDSP is slightly worse than that of SADS. However, with the increasing number of the authorized user, the performance gap decreases. At the same time, EFDSP has overcome the security weakness in SADS. Liu et al. [23] and Dang et al. [24] proposed their secure encrypted file deduplication scheme without relying on a trusted key generation server respectively, but their schemes do not support file permission in deduplication. We will introduce their technologies to our EFDSP in future work.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest related to this paper.