Fuzzy Search for Multiple Chinese Keywords in Cloud Environment

: With the continuous development of cloud computing and big data technology, the use of cloud storage is more and more extensive, and a large amount of data is outsourced for public cloud servers, and the security problems that follow are gradually emerging. It can not only protect the data privacy of users, but also realize efficient retrieval and use of data, which is an urgent problem for cloud storage. Based on the existing fuzzy search and encrypted data fuzzy search schemes, this paper uses the characteristics of fuzzy sounds and polysemy that are unique to Chinese, and realizes the synonym construction of keywords through Chinese Pinyin and Chinese-English translation, and establishes the fuzzy word and synonym set of keywords. This paper proposes a Chinese multi-keyword fuzzy search scheme in a cloud environment, which realizes the fuzzy search of multiple Chinese keywords and protects the private key by using a pseudo-random function. Finally, the safety analysis and system experiments verify that the scheme has high security, good practicability, and high search success rate.


Introduction
With the development and popularization of information technology, the number of data files stored by local users and enterprises in the local area is growing, and the pressure on local storage is increasing [Patnaik (2016)]. The local hardware failure or severe damage will greatly affect the users and enterprises for the use of data, or even the loss of important data will always be. Therefore, cloud storage services are popular with more and more users with the advantages of convenience and cost saving [Li and Zhang (2013)]. However, the use of cloud storage services has some constraints. For example, some data related to the trade secrets of enterprises must be protected from being used illegally. As a result, the data is typically encrypted locally and then outsourced to a cloud storage server, which brings great trouble with the use of data [Fu, Shu and Sun (2014)]. Due to the limitation of network bandwidth and local storage capacity, it is impossible for users to download all the data and then decrypt it. In brief, the research design of cloud data services that support privacy protection and encrypted search is a research topic of great significance and practical value [Lucas, Seny and Fabian (2005)]. Researchers at home and abroad have made many research results in keyword search based on public key encryption. For example, Song et al. proposed a ciphertext keyword search based on symmetric encryption in 2000 [Song, Wagner and Perrig (2000)]; Boneh et al. proposed a ciphertext keyword search based on public key encryption in 2007 [Boneh and Waters (2007)]. In recent years, many achievements have been made. For example, Li et al. put forward the use of edit distance to quantify the similarity between keywords and form a new technology based on keyword fuzzy search [Li, Wu and Yuan (2016)]; Hore et al. proposed a range search method based on range grouping in 2012 [Hore, Mehrotra, Canim et al. (2012)]. From the earlier analysis, it can be seen that the research on encrypted data search has achieved productive results, but there are still many problems to be solved in semantic search and fault tolerance: the earlier keyword-based search method only focuses on the precise or fuzzy matching of keywords, which is not completely applicable to the Chinese environment [Cao, Wang, Li et al. (2014)]. In addition to the glyphs, the Chinese characters contain two parts: pinyin and meaning. Pinyin consists of initials and finals. Therefore, under normal circumstances, a Chinese keyword has many synonyms, synonyms, and similar words. At present, there are few studies on Chinese keyword search with similar semantics and speech at home and abroad [Wang, Cao and Li (2010)]. Hence, this paper proposes a fuzzy search strategy based on keyword-based encrypted cloud data, and explores the execution efficiency and method of fuzzy search of Chinese fuzzy sounds and synonymous keywords in the cloud storage environment. In this paper, we focus on the implementation of Chinese fuzzy keyword search strategy which is suitable for cloud environment and can protect privacy. We propose a method to construct the synonymy and homonym sets of Chinese keywords by inter-translation between Chinese and English and the fuzzy pinyin strategy. We provide an effective fuzzy keyword search scheme based on Chinese keywords for cloud data retrieval which can protect the privacy of keywords. Fuzzy keyword search greatly improves the availability of the system by returning the matched files. When the keyword entered by the user matches the predefined keyword exactly, the matching files will be returned. When the exact matching fails, keyword-based synonyms and homonyms are used to return the closest possible matching files. Specifically, the similarity of different language expressing is used to achieve the similarity of Chinese keywords, and it develops a new ciphertext retrieval technology. Based on the homophone and synonym keyword sets, we propose an efficient fuzzy keyword search scheme. The strict security analysis shows that the proposed scheme is secure and privacy-preserving, which is the goal of the Chinese fuzzy keyword search scheme. The rest of paper is organized as follows: Section 2 introduces the system model, threat model, our design goal and briefly describes some necessary background for the techniques used in this paper. Section 3 shows a straightforward construction of fuzzy keyword search scheme. Section 4 provides the detailed description of our proposed schemes, including the efficient constructions of fuzzy keyword set and fuzzy keyword search scheme. Section 5 presents the security analysis. Section 6 presents the experiments analysis. Finally, Section 7 concludes the paper.

Problem descriptions 2.1 System and threat model
The architecture of the data storage service in the cloud environment is shown in Fig. 1, which presents that architecture contains three main entities: a data owner, user, and cloud service provider. Where, the data owner can be an individual or enterprise user who stores the data file set = ( 1 , 2 , ⋯ , ) on the cloud server. Different keyword sets related to file set C are predefined and expressed as = � 1 , 2 , ⋯ , �. To ensure that sensitive data is not used by unauthorized persons, data set C needs to be encrypted before outsourcing to the cloud server. Since there are a large number of similar sounds and synonyms in Chinese, in order to improve the utilization efficiency and the retrieval success rate of cloud data, the architecture needs to provide fuzzy search function of fuzzy sounds and synonyms for encrypted data. The data owner needs to generate the private key sk for the search request and distribute it to other authorized users, such as team members or enterprise employees. When the private key allocation is completed, for any input keyword w, in order to safely search out the relevant file set, the authorized user uses the private key sk and one-way generation function to convert the keyword which need be searched into a search request (hereinafter referred to as a trap door) and submit it to the cloud server. The cloud server caries out the search without decrypting the data and sends the searched set of target files (denoted as ) associated with the keyword w or the ambiguous sound or synonym of w to the data searcher. This paper considers that the cloud server involved in the cloud data service architecture is honest, but curious, it can correctly execute the specified protocol specification, but it will infer and analyze the relevant information through the input of users. Therefore, we still follow the security definitions involved in traditional symmetric encryption when designing the synonymous keyword search scheme. Except for search results and search models, anything else related to stored files and indexes should be not revealed.

Goal of the design
In order to realize a safe and efficient synonym keyword search for the above model of cloud data, this paper needs to achieve the following goals: 1) fuzzy keyword search function: explore efficient and correct fuzzy keyword search strategy for outsourcing cloud data of different mechanism design; 2) guarantee security: prevent the cloud server from learning knowledge related to data files or keywords in the search process; 3) guaranteed efficiency: achieve the above goals with the smallest possible occupation of storage, communication, and computing resources.
: A collection of different keywords extracted from the file set C, expressed as a set of words, i.e., = � 1 , 2 , ⋯ , �.
I: An index established for a privacy-protected fuzzy keyword search.
: Trapdoor, which is a search request, generated by a one-way function after the user inputs the search keyword w.
: The collection consists of a file set containing the keyword or its near or synonymous file ID.
Symmetric key encryption/decryption function based on semantic security. Edit distance: Edit distance is a description of the similarity of strings. For the two words 1 and 2 , edit distance ( 1 , 2 ) represents the minimum number of operations required for both to implement the transformation, which can be operations to add, modify, and delete characters. For a given word and integer , , is used to represent the similar word ′ , satisfying ( , ′ ) ≤ .
, : Keyword set corresponding to fuzzy pinyin of keyword . For a given Chinese keyword and an integer , the set of fuzzy tones corresponding to the Pinyin is , and the set of all similar Pinyin keyword that satisfies the requirement that the edit distance of the Pinyin of the keyword w is less than d is expressed as ( , ′ ) ≤ , i.e., ′ ∈ , , ′ ∈ , . : A collection of keywords synonymous with the keyword. When a different keyword set = ( 1 ′ ， 2 ′ ， … … ) describing the same thing in the same language is converted into another language, ′ generally corresponds to the same keyword , then the set is a synonym set of the keyword w, where ( ) = ( ′ ), () is a synonymous conversion function. In this paper, synonymous conversion is implemented in Chinese and English. Fuzzy keyword search: Given a set of encrypted data files = ( 1 , 2 , ⋯ , ) , a predefined set of different Chinese keywords = � 1 , 2 , ⋯ , �, the combination of the input multiple search keywords and , i.e., { , }. After performing the synonym keyword search, a file ID set 3 System framework 3.1 Representation of system framework According to the above target analysis, the overall framework design of the system is shown in Fig. 1.

Overview of design
The goal of a fuzzy search is to return as many results as possible (including synonymous and fuzzy tones) based on the set of keywords inputted by different users. However, such fuzzy search based on keyword synonym and fuzzy word is very challenging for matching cloud data. Any two Chinese words can easily obtain the fuzzy words or synonyms in the plaintext state, but it is difficult to find similar rules after one-way encryption function encryption (such as pseudo-random function or another encryption algorithm). The traditional encryption search strategy searches through equal comparisons between usersubmitted search traps and searchable encrypted indexes, but is not available in fuzzy searches here [Chen, Shen, Hu et al. (2016); Cheang, Wang, Cai et al. (2018)].
In order to solve this problem, this paper proposes a step-by-step scheme to reduce the difficulty of fuzzy matching with cloud-encrypted data. In the first step, the data owner constructs a fuzzy keyword set on the client side, and the set mainly includes three parts: keywords, Pinyin and fuzzy tones of keywords, English words corresponding to keywords, and corresponding index information (Chinese keyword and document ID table, Chinese and English keyword comparison table and Pinyin and fuzzy tones comparison table). In the second step, based on the fuzzy keyword set, a safe and efficient fuzzy search method is designed, which will be elaborated in the following chapters.
For data outsourced to the cloud, in addition to security issues, the user is most concerned with the efficiency of the operation. Therefore, this paper uses symmetric encryption as a searchable encryption framework.

Implementation of project
In the design of Chinese fuzzy keyword search scheme framework, this paper first considers the establishment of fuzzy keyword set, then analyzes how to generate search request, and finally how to implement secure and efficient encrypted data search.

A fuzzy keyword set establishment
Establishing a set of keywords is a prerequisite for efficient fuzzy search. , and are generated by keyword w and similarity constraint , where ′ ∈ , , ( , . The specific implementation is as follows: (1) Fuzzy pinyin keyword set establishment The pinyin of Chinese characters is mainly composed of initials and finals, and the combination of initials and finals conforms to specific laws. The following is a list of specific collections: Initial set: {b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s, y, w} Single final set: {a, o, e, i, u, ü} Complex final set: {ai, ei, ui, ao, ou, iu, ie, üe, er} Front nose vowel: {an, en, in, un, ün} Post nasal vowel: {ang, eng, ing, ong} Chinese Pinyin does not have any combination like English words. For example, when the initial is b, the set of finals can only contain a fixed number: {a, o, i, u, ai, ei, ao, ie, an, en, in, ang, eng, ing, ian, iao}. Moreover, the most common mistakes in Chinese Pinyin spelling is the fuzzy sound, such as the flat tongue and the squeaky tongue, l and n in the initial, the front and back nasal, ian and iang, and uan and uang in the final. Therefore, the easiest way to create a fuzzy sound keyword set is to enumerate the possible pinyin combinations and find the set of keywords that are the same as these combinations. Examples are as follows: Assuming that the user gives = 2 and the pinyin of the input keyword w is lin, the corresponding fuzzy key keyword combination , = { 1 ′ , 2 ′ … … } is generated according to the pinyin composition rule, and the pinyin of w ′ should be included in the set {lin, nin, ling, ning, *in}.
(2) Synonym set establishment There are many words in Chinese that have the same or similar meanings, but these are not reflected in the synonym dictionary, e.g., "计算机", "电脑" and "微机". These three Chinese words have the same meaning. When the user executes the cloud data search, the file IDs related to these three words should be returned to the user, but there is no good way to achieve a similar synonym comparison. By comparing the similarity between Chinese and English in describing the same thing, a method for realizing synonym conversion using language differences is proposed. For example, the English words of the above three words are "computer", so if the English translation of the keyword is consistent, then the words are synonymous. The implementation process is as follows: Assuming the keyword input by the user, the function ( ) is executed, is translated into the English word we, and then the Chinese keyword translated into is searched in the Chinese-English comparison table and the keyword set is returned. After generating the corresponding fuzzy sound and synonym set, the , and are encrypted by the encryption function ( ,⋅), and sent to cloud together with the encrypted file for saving.

Generate search request
After the user inputs the keywords { , }, the scheme carries out a fuzzy search and returns a corresponding set { } of file IDs, where ∈ { , }, ∈ or ∈ , . The generation process of the search request is similar to the generation of the keyword index, that is, according to the input and , the fuzzy pinyin and synonym generation function is called to obtain the fuzzy pinyin keyword set , and the synonym set . A search trapdoor is generated by , , and , which all is encrypted, and then submitted to the cloud server. Finally, the search request generation work is completed.

Fuzzy search scheme
In the cloud service system, in order to avoid the cloud to obtain sensitive information, part of the work needs to be performed on the client side, e.g., the establishment of search indexes and the generation of trapdoors. Executing the search in a large amount of data is a very resource-consuming work, which should be done by the cloud server. The execution flow of the keyword-based encrypted cloud data fuzzy search scheme is as follows: Scheme preprocessing stage: (1) The data owner randomly selects two numbers and as the private key and distributes the private key to the data user.
(3) The client uses the corresponding ( , ′ ) to decrypt the file ID, and calls ( ,⋅) to decrypt the required file.

Security analysis
In the encrypted search scheme designed in this paper, when the user inputs the same search request, the cloud will always return the same search result. Although the cloud server does not see what the underlying plaintext is, it can still establish access patterns and search patterns in interaction with the user. So the scheme ensures that content other than access and search requests are not compromised. This section will prove that the fuzzy search scheme designed in this paper is in line with the non-adaptive semantic security requirements [Raghavendra, Girish and Geeta (2018)]. The non-adaptive attack model only considers adversaries (e.g., cloud servers), who cannot select trapdoor-based search requests and previous search results, because only users with authorized private keys can generate search traps [Wang, Chen, Li et al. (2017); Shen, Wang, Li et al. (2018)]. Below we introduce some of the concepts to analyze the security of fuzzy search schemes. History: The interaction between the user and the cloud server, consisting of a set of files and a set of keywords searched by the user, expressed as = � , 1 , 2 , ⋯ , �.
View: According to the key , the history is given, and the cloud server can only see the encrypted history. The view � � includes: an index of the file set , a trapdoor of the query keyword 1 ′ ′ ∈ , and 2 ′ ′ ∈ , and a set of the encrypted file, denoted as { 1 , ⋯ , }.
Track: Given a history and an encrypted file set , � � captures the precise information learned by the cloud server, including the size of the encrypted file is a symmetric matrix and stores the intersection of two sets ∏ 1 and ∏ 2 . ∏ 1 and ∏ 2 are the intersection of the fuzzy note and the intersection of the synonym record, respectively: In general, the security strength of this solution is reflected in the fact that cloud servers cannot distinguish their views for two historical records with the same trajectory. In other words, the cloud server cannot extract more information content based on the information leaked in the query (i.e., the trajectory), so the solution is safe. The security conclusions of the fuzzy search scheme in this paper are explained in the following theorem. Since the fuzzy search scheme has been described in the above, the following conclusions are equally applicable to the instantiated fuzzy search. Theorem: This fuzzy keyword search scheme satisfies non-adaptive semantic security.

Proof:
To prove semantic security, we construct an emulator S with a trajectory � �, which can simulate a view * , which can fully simulate the view � � of the cloud server for any ∈ , any , and a randomly chosen . The security parameters l of the pseudo-random function (  Due to the semantic security of symmetric encryption, it is impossible for an adversary to distinguish between and * or � ′ , � and * . Due to the pseudorandomness of the trapdoor generation function, there is also no case where ( , ′ ) and a random string * can be distinguished. Therefore, ( 0 ) and 0 * are indistinguishable.
(2) Select 1 random strings 1,1 * , … , ,1 * ∈ {0,1} and set them to * � 1, � = Enc( 1, * , , ), ), and then assign them to trapdoor simulation . (2) If > , S will create − records in the index * using the same process as the simulation trapdoor. The correctness of the constructed view is easily demonstrated by searching for trapdoors built into the index. There is no case in this scenario where an attacker can distinguish between � � and * . Moreover, the simulated ciphertext uses a symmetric encryption scheme, and its semantic security determines that the ciphertext is indistinguishable. Indexes and trapdoors are also indistinguishable based on the nature of pseudo-random functions. Therefore, the proof theorem is correct.

Experimental analysis 6.1 Time and space consumption
The fuzzy search scheme of Chinese keywords proposed in this paper introduces synonym and fuzzy phonetic words of Chinese keywords based on the existing search schemes. Therefore, the proposed scheme needs extra time and space to process keyword synonyms and fuzzy sounds in the scheme preprocessing stage and search stage. However, it is still at the same level of time complexity and space consumption as the original solution. The time complexity of the pre-processing stage index construction is only related to the number of keywords, i.e., ( | |), the size of the index is also only related to the number of keywords, i.e., ( | |). During the search, due to the support of multi-threading by the cloud server system, the retrieval of synonyms and fuzzy words can be realized at the same time, so it does not increase too much time consumption, and the time complexity of the search is ( | |).

Experimental comparison
This paper uses the free cloud platform provided by Amazon as the experimental platform and uses the journal and magazine literature as the search object to conduct experiments and verification objects to verify the effectiveness and practicability of the project. When building an index, as the number of keywords increases, the CPU and memory usage gradually increases, and the time consumed increases accordingly. The time consumption of building an index is shown in Fig. 2. The success rate of keyword query is shown in Fig. 4.

Figure 4:
The comparison results of the success rate of keyword query It can be seen from the above experimental results that the proposed scheme realizes the storage of synonyms and fuzzy words of keywords by appropriately increasing storage. Although the scheme preprocessing and search time are increased, the success rate of the search is greatly improved.

Conclusion
According to the actual needs of users' Chinese search in a cloud storage environment, this paper presents a fuzzy search strategy based on multi-keywords for encrypted cloud data. By constructing fuzzy sounds and synonym sets in the scheme, the fuzzy sounds and synonymous problems between the input text and the words that the user is looking for are well solved in the Chinese environment, and the pseudo-random function is used to effectively avoid the problem of information disclosed in the query process. Therefore, the scheme has high security, good practicability, and high search success rate.