B+-Tree Based Multi-Keyword Ranked Similarity Search Scheme Over Encrypted Cloud Data

With the sustained evolution and expeditious popularization of cloud computing, an ever-increasing number of individuals and enterprises are encouraged to outsource data to cloud servers for reducing management overhead and ease of access. Privacy requirements demand encryption of sensitive information before outsourcing, which, on the other hand, diminishes the usability of data and makes considerable efficient keyword search techniques used on plaintext inapplicable. In this paper, we propose a secure multi-keyword ranked search scheme based on document similarity to work out the problem. In order to achieve the goals of multi-keyword search and ranking search results, we adopt the vector space model and TF-IDF model to generate index and query vectors. By introducing the secure kNN computation, index and query vectors can be encrypted to prevent cloud servers from obtaining sensitive frequency information. For the need of efficiency advancement, we adopt the $B^{+}$ -tree as the basic structure to build the index and construct a similar document collection for each document. Due to the use of our unique index structure, compared to linear search, the search efficiency is more exceptional. Extensive experiments on the real-world document collection are conducted to demonstrate the feasibility and efficiency of the proposed solution.


I. INTRODUCTION
Cloud computing [1] has achieved extraordinary development over the past decade, both in the academic and industrial communities [2]. Moreover, it has been regarded as a brand-new model of technology infrastructure that is capable of organizing unlimited storage space and powerful computing capabilities, and enabling users to enjoy pay-as-you-go, convenient and distinguished services from a shared pool of configurable computing resources with excellent efficiency and minimal management overhead [3]- [5]. In addition, the technique is able to decrease the capital expenditure on hardware establishments, software and personnel maintenances [22]. Hence, enterprises and individuals tend to outsource data to cloud servers by occasion of these advantages [6].
Despite of the tremendous advantages of cloud services, privacy concerns brought by outsourcing data, especially sensitive data (e.g., emails, personal travel data, and company transaction records, etc.), to cloud servers restrict the The associate editor coordinating the review of this manuscript and approving it for publication was Sedat Akleylek . promotion and popularization of the emerging model. Cloud data may be misused by cloud service providers (CSPs) in an unauthorized way, even maliciously, since data owners are no longer directly in control of their data [24]. In order to achieve more effective application and broader deployment of cloud computing [8], [14], [15], data security and privacy are indispensable considerations that must be well-addressed to avoid monetary loss or damage to reputation arise from cloud data leakage [9]. General approaches to protect data confidentiality are cryptographic approaches such as encrypting data before outsourcing [10]. However, such methods improve the difficulty of data utilization since many technologies applied on plaintext data, such as keyword-based information retrieval, are no longer suitable for ciphertext data. Furthermore, downloading and decrypting all cloud data is unrealistic and infeasible, especially in the case of large amount of data [11].
In order to decrease the impact of encryption on data availability, plenty of efforts have been put into contriving efficient mechanisms for searching over encrypted cloud data. Some general-purpose methodologies based on fully-homomorphic encryption [12] and oblivious RAMs [13] have been proposed VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to address the above problem, while the overhead for computation and communication presented in these schemes is not acceptable for both cloud servers and users. Fortunately, many methodologies used for special purposes based on searchable encryption (SE) have been put forward to satisfy different query requirements. However, among schemes that have been proposed, the single keyword search lacks sufficient intelligence to support complex query demands, and network traffic overhead of the boolean search is excessive [16], [26]. In contrast, the multi-keyword ranked search receives increasing attention due to its better practicability. Recently, some constructive schemes based on multi-keyword ranked search have been proposed to support intelligent and economic queries over encrypted cloud data. However, in most cases, these methodologies cannot simultaneously satisfy requirements of search efficiency and data privacy protection. Aiming at problems as are mentioned above in the field of multi-keyword ranked search, in this paper, we propose a secure and efficient multi-keyword ranked search scheme based on B + -tree index, which has been extensively applied in database systems. For supporting multi-keyword search, we combine the vector space model and the TF-IDF model in the process of generating index and query vectors. In addition, to improve the query efficiency for the better quality of experience, we incorporate the cosine similarity measure [17], [18], [21] to the index structure. Due to the particular structure of our index, the search scheme proposed in this paper is more exceptional than linear in terms of time overhead. Moreover, on the premise of ensuring the accuracy of relevance score calculation between query vectors and index vectors, we introduce the secure kNN (k-nearest neighbour) computation [19], [20] to encrypt vectors so as to improve the ability of resisting statistical attacks from cloud servers. To defense attacks initiated by cloud servers under different threat models, we design two secure index schemes, e.g., the basic similarity-based multi-keyword ranked search (BSMRS) scheme and the enhanced similarity-based multi-keyword ranked search (ESMRS) scheme. The former can guarantee the confidentiality of index and query vectors, the latter is able to avoid sensitive frequency information being obtained by cloud servers to satisfy more stringent privacy protection requirements. Our contributions are summarized as follows: 1) We design a searchable encryption scheme that not only supports accurate multi-keyword ranked search but also ensures data privacy with little relevance score information leakage. 2) By incorporating the cosine similarity measure and constructing the keyword index tree based on B + -tree, the search efficiency of the proposed scheme is improved significantly compared with [39] and [53]. 3) Extensive experimental results demonstrate the feasibility and efficiency of the proposed scheme. The rest of the paper is organized as follows. Section II introduces the related work. Then, we briefly introduce preliminaries, system model, threat models, and design goals in Section III, followed by Section IV, which gives the specification of our schemes. Section V presents security analysis. Experiments and performance evaluation are presented in Section VI. Section VII covers the conclusion.

II. RELATED WORK
Searchable encryption (SE) has been extensively studied with the aim of formalizing security definitions and improving efficiency. It enables clients to outsource data in encrypted form to cloud servers and conduct keyword search over ciphertext. In accordance with differences of cryptography primitives, searchable encryption can be divided into public key searchable encryption [29], [55]- [58] and symmetric searchable encryption [27], [28], [30]. On the ground of the expensive computational overhead of public key searchable encryption, this paper mainly pays attention to symmetric searchable encryption.

A. SINGLE KEYWORD SEARCH
The first symmetric searchable encryption (SSE) scheme was proposed by Song et al. [27]. The cloud server in their scheme needs to traverse the entire document to determine whether it contains a specific keyword. Thus time complexity of search is linearly related to the number of documents in collection. Goh [28] proposed a standardized description of the security definition of SSE and constructed a secure index architecture on the basis of pseudo-random functions and Bloom filter to resist adaptive chosen keyword attack. However, the time complexity of their scheme is O(n). To further enhance security and search efficiency, SSE-1 and SSE-2 based on the inverted list were proposed by Curtmola et al. [30]. Such two schemes are more efficient than other works and can resist chosen-keyword attack and adaptive chosen-keyword attack respectively. However, the functionality of most of the above schemes is restricted to single keyword search.

B. MULTI-KEYWORD BOOLEAN SEARCH
To improve query experience and enrich search functionality, a great quantity of explorations [23], [31]- [38] have been carried out by research fellows to achieve multi-keyword boolean search, which enables users to query the most appropriate document by inputting several query keywords. In conjunctive keyword search schemes [23], [31], [32], [38], only documents containing all keywords are returned. Among these works, the communication overhead of the scheme proposed by Golle et al. [31] is linear with the number of documents, and the scheme proposed by Cash et al. [38] supports large databases. Unlike conjunctive keyword search, all of documents containing one or more query keywords are returned in disjunctive keyword search schemes [33], [34]. For the sake of supporting conjunctive keyword search and disjunctive keyword search simultaneously, predicate search schemes were proposed [35]- [37]. However, these schemes above are not exceptional enough since the search results are based on keywords that have existed, which are not capable of providing satisfactory results ranking functionality [39].
Consequently, some works have been proposed to handle multi-keyword ranked search with the advantage of bandwidth-saving.

C. MULTI-KEYWORD RANKED SEARCH
Due to the capability of implementing more efficient and convenient search, multi-keyword ranked search is extensively utilized in the field of information retrieval, it enables the most relevant document to be retrieved in a short period of time. It estimates the relevance between query keywords and documents, and sends the top-k most relevant documents to users. Therefore, it can effectively diminish the overhead of communication. Cao et al. [40] proposed a privacy-preserving multi-keyword ranked search scheme and demonstrated the security of the scheme. The searchable index in their scheme is constructed on the basis of the vector space model [41] and the ''coordinate matching'' is selected as the scale of measurement. The scheme is capable of ranking search results in light of the number of matched keywords. However, the time complexity of search is linear to the number of documents in collection since the cloud server must traverse the whole indexes of the document collection to confirm the number of matched keywords for each query. On the other hand, the lack of consideration of the importance of different keywords results in the loss of precision. The vector space model and TF-IDF model are combined in the multi-keyword ranked search scheme with better-than-linear search time complexity proposed by Sun et al. [5]. Moreover, authors incorporate the cosine similarity measure to the index to provide similarity-based ranking. Although the efficiency is improved, the scheme is not accurate enough and vulnerable in protecting data privacy. The scheme proposed by Orencik et al. [42] clusters similar documents by utilizing LSH (local sensitive hash) functions. The algorithm is appropriate for similarity search while the ranking accuracy is not sufficient. By drawing on previous research methods and indicators, Xia et al. [39] proposed a ''Greedy Depth-first Search'' algorithm on the basis of tree-based index. The efficiency of the scheme is better than early works and the precision is excellent. However, the overhead of search and the time complexity of trapdoor generation remain high. Zhang et al. [43] and Zhong et al. [3] put forward their multi-keyword ranked search scheme respectively, while the efficiency it not ideal.

III. PROBLEM FORMULATION A. COSINE SIMILARITY MEASURE
In this paper, we adopt the cosine similarity measure [5], [25], [44] to calculate the similarity between plaintext documents denoted as vectors. The closer the cosine value is to 1, the higher the similarity between two documents. The similarity between documents is calculated as follows: where P, V respectively represent a vector of a document and P i ,V i denote their component.

B. VECTOR SPACE MODEL AND TF-IDF MODEL
Vector space model, in combination with TF-IDF model, is extensively employed for supporting efficient multikeyword ranked search in the field of plaintext information retrieval [41], [45], TF (term frequency) is used to evaluate the importance of a specific term (keyword) in a document, specifically, the more times a word appears in a document, the more important it is to this document, and IDF (inverse document frequency) is used to measure the ability of a keyword to distinguish documents. If a keyword appears frequently in a document but rarely in other documents, it indicates that the discrimination coefficient of the keyword is excellent. In the vector space model, each document is represented as a vector V u , which is composed of normalized TF values of keywords in the dictionary W in the corresponding document. Similarly, each query is represented as a vector and elements of the vector are normalized IDF values of query keywords. The dimensionality of index and query vectors equals to the total number of keywords in the dictionary and the relevance of query vectors and documents is quantitatively evaluated by the dot product of V u and V q . The definition of relevance computation function [39] is as follows: where TF w i is the normalized TF value of keyword w i , and IDF w i is the normalized IDF value of keyword w i .

VOLUME 9, 2021
If u is an internal node of the index tree I, TF w i is computed according to index vectors in corresponding child nodes and leaf node according to index vectors in corresponding document records. If u is a document record, TF w i is calculated as: where In the query vector V q , IDF w i is computed as [46]: where , N w i is the number of documents that contain keyword w i and N d is the total number of documents.

C. KEYWORD B + -TREE
The B + -tree [47] is one of the most widely-used index structures for database systems and data-manipulation applications [48]. Solutions to the B + -tree are also often applied to other tree-like index structures. The keyword B + -tree stores data only in leaf nodes that do not have children, and internal nodes store index vectors and pointers to corresponding child nodes. The retrieval time of the index structure based on the B + -tree is proportional to the height of the tree. Compared with the red-black tree and the binary tree, the height of the B + -tree is lower. Therefore, we utilize the B + -tree to construct our index structure. The formal definition of u is as follows: If u is a document record, ID stores document identity, S is composed of ID and V u of K documents most similar to the current document in the document collection D and child is set to null. If the u is a leaf node or a internal node, ID and S are set to null, if the u is a leaf node, V u denotes a vector consisting of normalized TF values which are calculated as follows: (6) and if the u is a internal node, V u is calculated as follows: where N is the order of the B + -tree. The construction procedure is explained detailedly in Section IV, which is denoted as IndexGen(D, K).

D. THE SECURE kNN COMPUTATION
The secure kNN (k-nearest neighbour) computation, which is proposed by Wong et al. [19], is designed to calculate the Euclidean distance between a database record and a query vector and then select k nearest database records. In the secure kNN computation, the secret key K is composed of a randomly generated m-bit vectorS and two (m×m) invertible matrices {M 1 ,M 2 }, whereS is regarded as a splitting indicator and {M 1 ,M 2 } are used to encrypt database records and query vectors, both of which are extended to m-dimension vectors. The specific encryption process is introduced in Section IV. More details of the secure kNN computation are referred to in [19].

E. THE SYSTEM MODEL
As shown in in Figure. 1, data owner, data user and cloud server are three different entities considered in this paper.
Data owner needs to construct a dictionary W, which is composed of distinct keywords extracted from document collection D before outsourcing so that the data availability can be maintained while protecting data privacy. And then, with the dictionary and document collection, an unencrypted index tree can be constructed. Finally, the data owner encrypts the document collection and index tree and outsources encrypted form of them to the cloud server.
Data user is able to obtain the authorization of accessing a particular document from the data owner. In light of search control mechanisms, the data user can generate a trapdoor T with t query keywords and k encrypted documents will be returned after the trapdoor is uploaded to the cloud server. Finally, with the share secret key, the data user can decrypted documents.
Cloud server is responsible for storing the encrypted document collection D and index tree I. After acquiring the trapdoor T , search is executed by the cloud server over the encrypted index tree I. To improve the retrieval accuracy and decrease network traffic, the cloud server ranks search results and only the top-k most relevant documents are returned to the data user.

F. THREAT MODELS
In this paper, we treat the data owner and the data user as entities that can be fully trusted, but the cloud server is regarded as ''honest-but-curious'', which reflects the view taken in most of the related works whose research direction are secure schemes of search over encrypted cloud data [49]- [51]. ''Honest'' is defined as executing instructions in the designated protocol correctly. ''Curious'' refers to inferring and analyzing data received to gain additional insight. Threat models adopted in this paper are the two suggested by Cao et al. [40]. They differ primarily in term of the information available to the cloud server.
Known ciphertext model. Information that is available to the cloud server in this model is restricted to encrypted document collection D, encrypted index tree I and encrypted query vector, i.e., trapdoor T . In other words, the attack that the cloud server can conduct is just ciphertext-only attack.
Known background model. The cloud server that utilizes this stronger model possesses a greater degree of knowledge, e.g., term frequency of a specific keyword, the correlation of trapdoors submitted by the data user and related statistical information of documents. The cloud server has the ability to deduce or even identify a keyword in a query with knowledge above [52].

G. DESIGN GOALS
Requirements that need to be satisfied include following three aspects: Accuracy-improved multi-keyword ranked search. Accurately retrieving the document required by the data user is the most primitive requirement. The scheme is not feasible if documents returned by the cloud server are completely inconsistent with the expectation of the data user.
Search efficiency. The efficiency objective of the scheme is to diminish search time complexity to better than linear by utilizing the B + tree as the index structure and construct a similar document collection S for each document.
privacy-preserving. Document collection and trapdoor information involve privacy, so the scheme must take appropriate measures to prevent the cloud server from obtaining relevant information. The following are privacy protection requirements mainly concerned: • Index and query confidentiality. The cloud server must be adequately prevented from obtaining information of plaintext of index vectors and trapdoors.
• Trapdoor unlinkability. The cloud server should not have the ability to identify whether two trapdoors are from the same query or not.
• Keyword privacy. Whether a certain keyword is included in a query should not be speculated by the cloud server. It is worth noting that protecting access pattern, i.e., the sequence of documents that be returned to the data user, is not the design objective of the scheme, for the sake of efficiency concerns.

IV. THE PROPOSED SCHEMES
In this section, we first describe the basic similarity-based multi-keyword ranked search (BSMRS) scheme, which guarantees the confidentiality of index and query. For defensing attacks under a stronger threat model, i.e., the known background model, we propose a more secure scheme, i.e., the enhanced similarity-based multi-keyword ranked search (ESMRS) scheme.

A. BSMRS SCHEME
By introducing the secure kNN computation [19], the BSMRS scheme can be configured to satisfy privacy requirements within the known ciphertext model. Following are detailed descriptions of each algorithm in the scheme.
• K ← KeyGen(m) The algorithm is executed by the data owner to generate the secret key K, including a m-bit secret vectorS which is randomly generated and two (m×m) invertible matricesM 1 andM 2 . Elements ofS are 0 or 1. Namely, K = {S,M 1 ,M 2 }. The formal process is presented in Algorithm. 1.
• I ← IndexGen(D, K) The algorithm is used to construct the encrypted index tree I. Figure. 3 illustrates an index tree. It is worth noting that, all data is stored in leaf nodes and ordered according to keys, thus splitting operation needs to be executed in the process of inserting to ensure the characteristic of order. The formal description of inserting is presented in Algorithm. 4 and an example is shown in Figure. 4. The encryption process is described as follows: first, the data owner splits every index vectors V u into two random vectors  Figure. 2, and each u stores two encrypted index vectors The formal process is presented in Algorithm. 3.
• R ← Search(T , k, u) With the trapdoor T , the cloud server can calculate the relevance score between u of the encrypted index tree I and the query vector V q as in the formula (2). Therefore, upon obtaining the trapdoor T , the cloud server performs the designated search operation (Algorithm. 5 Search(T , k, u)) over the encrypted index tree I. During the search process, attribute to the utilization of the similar document collection, which is composed of index vectors of the K most similar documents of a certain document, after finding the document d i with the largest relevance score to the trapdoor, the cloud server just need to calculate relevance scores of similar documents of d i , instead of continuing to access other nodes, because the similar document collection of d i contains the top-k most relevant documents. Therefore, the search efficiency is improved significantly. After selecting and ranking the top-k documents, the cloud server returns the query result R. It is worth noting that relevance scores computed from encrypted vectors are identical with that computed from unencrypted vectors, i.e., The detailed proof process is as follows:

B. ESMRS SCHEME
In the BSMRS scheme, due to the introduction of the random split, non-deterministic encryption can be provided, which means that the same query vectors (e.g., identical query keywords) will be encrypted into different trapdoors. Besides, information outsourced to the cloud server is restricted to encrypted vectors and the calculation involved is only inner product operation. Accordingly, there is no information about particular keywords that can be disclosed. Therefore, the query unlinkability and the keyword privacy can be protected in the known ciphertext model. However, in the known background model, the cloud server is equipped with more knowledge. Moreover, the relevance score computed from V u and T is identical with that from V u and V q , thus the cloud server is capable of identifying same query requests in light of identical access paths and relevance scores, and distinguishing keywords according to distribution differences of keywords in the term frequency distribution histogram. Consequently, the query unlinkability and the keyword privacy are in danger [7]. To enhance security and satisfy more rigorous privacy requirements, the equality must be broken. Therefore, some tunable randomness is introduced into the procedure of relevance evaluating to disturb the score. Additionally, the randomness can be calibrated for the sake of efficiency, ranked search accuracy, and keyword privacy.
The ESMRS scheme is basically consistent with the BSMRS scheme in most aspects except that: In the construction procedure of the index tree, we first construct a similar document collection for each document and generate a leaf node as the root node. Then, we insert documents with splitting operation. This diagram also shows the search process using a query vector, in which the V q is equal to (0.6, 0.2, 0.1, 0.6) and k = 3 (the data user will receive three documents at last). In light of the search scheme, the search begin from the root of the tree, the relevance score of (0.5, 0.6, 0.2, 0.6) to the query is 0.90, which is bigger than that of (0.8, 0.6, 0.9, 0.4), similarly, the relevance score of (0.5, 0.3, 0, 0.6) to the query is 0.98. Then, the algorithm calculates the relevance score of each similar document of d 1 and ranks by descending order. Finally, {d 1 , d 6 , d 5 } are returned.

FIGURE 4.
An example of inserting operation. Before inserting, the B + -tree whose order is 3 is shown as (a). Now we try to insert a document with ID 10. Firstly, we find the leaf node that meets the condition is [8,9]. However, because the node is full, it is unable to continue to insert, so it is necessary to split the node into [8] and [9]. Then the document is inserted into [9], and the ID 9 is inserted into the parent node [7,8]. At this time, the parent node is full, and it is also unable to continue to insert and needs to be reorganize globally. The tree after inserting is shown in (b).
• I ← IndexGen(D, K) In this algorithm, the index vector V u is a (m + ε)-dimension vector, and V u [j], j = m + 1, . . . , m + ε is set as a random value η j .
• T ← TrapdoorGen(W q , K) Similar to the index vector V u , the dimensionality of the query vector is increased to (m + ε) before encryption as well. The difference is that values of a random number of extended elements are 1, and others are 0.
• R ← Search(T , k, u) After introducing some phantom terms, the final relevance score of index vector V u and T equals to

V. SECURITY ANALYSIS
In this section, we analyze the security of the ESMRS scheme. The security depends on the secure kNN computation.

A. SECURITY PROOF
Theorem: Due to the introduction of the random split, the scheme is capable of preventing the cloud server from decrypting ciphertext if it does not get the secret key K. Proof: For each index vector V u , the cloud server knows the encrypted value Without the splitting indicatorS, the cloud server has to set V u and V u as two random m-dimension vectors, and set the following equations: V ua =M 1 T V u and V ub =M 2 T V u . The number of unknown variables in V u and V u is 2m and that inM 1 andM 2 is 2m 2 , but the number of equations is 2m. Therefore, the information known by the cloud server is not enough to crack matricesM 1 andM 2 . Basically, the cloud VOLUME 9, 2021  server is obliged to try out all configurations of splitting so as to solve the matrices. Since there are 2 m possible splitting configurations, the introduction of random split makes the scheme 2 m more costly to attack. Accordingly, if m is large enough, the cloud server is not able to decrypt the ciphertext without the secret key.

B. PRIVACY ANALYSIS 1) INDEX AND QUERY CONFIDENTIALITY
With the introduction of the random split, index vectors are encrypted by invertible matrices. Therefore, the cloud server is not able to deduce initial vectors without the secret key, which has been proved above. Moreover, the degree of difficulty of figuring out matrices is increased by introducing phantom terms. Consequently, index confidentiality can be protected. Based on the same principle, the query keywords are invisible to the cloud server as well.
The introduction of random value η j enables the ESMRS scheme to generate different query vectors and obtain different relevance score distributions when search requests are identical. That is to say, the trapdoor unlinkability is enhanced. However, since the access pattern protection is not the design objective of the proposed scheme from the efficiency point of view, similarities contained in query results from identical search requests can be taken advantage of by the cloud server. In the proposed ESMRS scheme, the value of η v can be adjusted to keep the balance of efficiency and privacy. The data user is able to make a trade-off between the two options.

3) KEYWORD PRIVACY
By introducing the random value η j and setting a random number of extended elements of query vector as 1, the η j as a part of the final relevance score will not be identical even search requests are the same. In consideration of ranked search accuracy, η j follows the identical uniform distribution U (µ − ξ, µ + ξ ), where the mean is µ , and the variance as σ 2 is ξ 2 /3. In light of the central limit theorem, the summation of ω independent η j , i.e., η j follows the normal distribution N (µ, σ 2 ), where the expectation µ and the standard deviation σ can be calculated as: Thus, we can generate the random value η j according to the value of µ = µ/ω and ξ = √ 3/ωσ . The standard deviation σ can be considered as a trade-off parameter between security and ranked search accuracy. It is worth noting that σ needs to be set small enough out of the concern of effectiveness, but it will increase the risk that the cloud server gets more statistical information of original scores. Therefore, σ can be adjusted to keep the balance of accuracy and privacy.

VI. PERFORMANCE EVALUATION
The purpose of this section is to evaluate the performance of our proposed schemes by performing extensive experiments on the real-world document collection: the 20 Newsgroups data set [54]. We implement all algorithms mentioned above using Python language on a 1.80GHz Intel(R) Core(TM) processor, Windows 10 operation system with a RAM of 8.00GB. The tests include 1) the precision and rank privacy of search, and 2) the efficiency of index construction, trapdoor generation and search.

A. PRECISION AND PRIVACY
As presented in Section IV. Phantom terms are introduced to prevent the cloud server from linking identical search requests for better data security. Therefore, the relevance scores between index vectors and trapdoors will not be exactly accurate. In the ESMRS scheme, there are two accessible factors (i.e., the number of phantom terms and the level of random value) that can influence the precision and rank privacy. Similar to related works, the ''precision'' P k is Search(T , k, MAX_CHILD); 13 else 14 Find the record whose relevance score with T is the largest;

15
for each similar document in u.S do 16 Calculate the relevance score; 17 end 18 Rank u and its similar documents in descending order according to relevance scores; 19 Insert the top-k {ID, Score} into R; 20 end 21 return R; defined as [40]: where k is the number of the real top-k documents that the data user receives. Figure. 5(a) shows that the fluctuation of precision of the ESMRS scheme attributes to the number of phantom terms and the level of the random value, and with small level of random value and number of phantom terms, the capability of search is not influenced much. The definition of ''rank privacy'' is obtained from [40] as well: where l i is the rank number of document in the search results, and l i is that in the real ranked documents. The larger rank privacy means that the security is better, Figure.  the number of documents in the collection D and the size of keyword dictionary W are principal factors that influence the time overhead. Figure. 6(a) shows that the time consumed to construct the index tree is basically linear with the number of documents. Figure. 6(b) shows that with the fixed document collection, the time overhead is proportional to the number of keywords in the dictionary when constructing the index tree. Due to the expansion of vector dimensionality, the ESMRS scheme consumes slightly more time than the BSMRS scheme in constructing encrypted index tree. It is worth noting that the index construction is a one-time operation. In this paper, we compare our schemes with the EDMRS scheme [39] and the DVMRS scheme [53]. The results show that the time overhead of our schemes is less than EDMRS and is approximate to DVMRS with increased size of document collection, and is less than both of them with increased size of keyword dictionary. Note that, in the process of encrypting leaf node, we store the encrypted index vector of each plaintext index vector temporarily, which can be used in the subsequent encryption process, so each index vector is only encrypted once, and the number of similar documents has little impact on the time overhead of index construction, as shown in Figure. 6(c). Moreover, the order of the index tree can influence the time overhead to a certain extent, as shown in Figure. 6(d).

2) TRAPDOOR GENERATION
The trapdoor generation process includes two multiplications of a matrix and a vector splitting operation. Therefore, the time complexity is O(α 2 ), where α = m + ε. Figure. 7(a) shows that the time overhead of generating trapdoors primarily contingents on the number of keywords in the dictionary since most of the time is used to encrypt the query vector, and the dimensionality of the vector contingents on the size of the dictionary. Thus the time overhead increases as the size of the keyword dictionary is enlarged. Moreover, the ESMRS scheme consumes more time because the dimensionality have been extended compared to the BSMRS scheme. Figure. 7(b) indicates that the generation time of trapdoor is almost unaffected by the number of query keywords.

3) SEARCH EFFICIENCY
We improve the search efficiency in two ways: 1) introducing B + -tree as the basic structure to build the index tree, 2) constructing a similar document collection for each document.  The search process performed by the cloud server mainly includes searching for the document which is most relevant to the trapdoor and ranking the document and its K most similar documents in descending order according to relevance scores with the trapdoor. The search algorithm terminates after the top-k documents are selected. We evaluate the search efficiency of our proposed schemes and compare with the EDMRS scheme and the DVMRS scheme under different parameter settings. In particular, we study the effect of the size of document collection and the cardinality of keyword dictionary. In our schemes, B + -tree is the basic structure of the index tree, the height of the tree is O(log N n), and the computation times is N on each layer of the index tree, so the time complexity of search is O (N log N n), it is better than linear. In addition, benefit from the use of similar document collection, the number of nodes that need to be visited is less than other schemes, it contributes to the improvement of the search efficiency as well. Results in Figure. 8 demonstrate that our search scheme is significantly more efficient in terms of time overhead. In particular, the efficiency of search in EDMRS and DVMRS drops obviously with the increased number of documents and cardinality of keyword dictionary, while ours maintain high efficiency. Note that, for the purpose of keeping the balance of accuracy and privacy, the number of phantom terms that added to disturb the relevance score is 400 (10% of the number of keywords). Thus, the search efficiency is not influenced apparently, so curves of the BSMRS scheme and the ESMRS scheme in Figure. 8 are adjacent.
In conclusion, without losing the efficiency of index tree construction, we effectively improve the efficiency of search, which indicates that our scheme is feasible and efficient.

VII. CONCLUSION
In this paper, we conduct thorough research on the efficiency and security issues of multi-keyword ranked search over encrypted cloud data and propose a secure and efficient search scheme. The scheme can not only achieve accurate multi-keyword ranked search but also make the search time better than linear. In terms of accuracy, the vector space model and TF-IDF model are exploited to effectively acquire accurate ranked search results. The secure kNN computation is combined to protect the scheme against two threat models. To improve the search efficiency, we construct the index tree based on the B + -tree structure and construct a similar document collection for each document before encryption. Through thorough security analysis, our proposed scheme is proved that it is secure and privacy-preserving while maintaining the precision of multi-keyword ranked search. Extensive experimental results on the real-world document collection demonstrate the feasibility and efficiency of the scheme.
In the proposed scheme, the similar document collection increases the storage overhead to a certain extent. Therefore, in our future work, we will explore schemes that support better space efficiency.
LINLIN XUE received the B.E. degree in electronic information engineering and the Ph.D. degree in electromagnetic field and microwave technology from the University of Science and Technology of China, Anhui, China, in 2008 and 2013, respectively.
She was a Lecturer with the Zhejiang University of Technology, from 2013 to 2019. Since 2019, she has been a Lecturer with the Zhejiang University of Science and Technology. She has been authored or coauthored over 20 journal articles and conference papers in her areas of expertise. Her current research interests include the areas of modeling and simulation of photonics devices and subsystems.
HAIJIANG WANG received the M.S. degree from Zhengzhou University, in 2013, and the Ph.D. degree from Shanghai Jiao Tong University, in 2018. He is currently a Teacher with the School of Information and Electronic Engineering, Zhejiang University of Science and Technology. His research interests include cryptography and information security, in particular public-key encryption, attribute-based encryption, and searchable encryption.
LEI ZHANG received the M.S. degree from Tsinghua University, in 2006. He is currently a Teacher with the School of Information and Electronic Engineering, Zhejiang University of Science and Technology. His research interests include the communication and its security in the Internet of Things.
JINYING ZHANG is currently pursuing the degree with the Zhejiang University of Science and Technology. Her research interest includes the application of blockchain in the medical field. VOLUME 9, 2021