Privacy-preserving k-NN interpolation over two encrypted databases

Cloud computing enables users to outsource their databases and the computing functionalities to a cloud service provider to avoid the cost of maintaining a private storage and computational requirements. It also provides universal access to data, applications, and services without location dependency. While cloud computing provides many benefits, it possesses a number of security and privacy concerns. Outsourcing data to a cloud service provider in encrypted form may help to overcome these concerns. However, dealing with the encrypted data makes it difficult for the cloud service providers to perform some operations over the data that will especially be required in query processing tasks. Among the techniques employed in query processing task, the k-nearest neighbor method draws attention due to its simplicity and efficiency, particularly on massive data sets. A number of k-nearest neighbor algorithms for query processing task on a single encrypted database have been proposed. However, the performance of k-nearest neighbor algorithms on a single database may create accuracy and reliability problems. It is a fact that collaboration among different cloud service providers yields more accurate and more reliable results in query processing. By considering this fact, we focus on the k-nearest neighbor (k-NN) problem over two encrypted databases. We introduce a secure two-party k-NN interpolation protocol that enables a query owner to extract the interpolation of the k-nearest neighbors of a query point from two different databases outsourced to two different cloud service providers. We also show that our protocol protects the confidentiality of the data and the query point, and hides data access patterns. Furthermore, we conducted a number of experiment to demonstrate the efficiency of our protocol. The results show that the running time of our protocol is linearly dependent on both the number of nearest neighbours and data size.


INTRODUCTION
Due to its low cost, scalability and reliability, cloud computing has increased its reputation in both the business and scientific communities. In addition to the benefits, it introduces new concerns that need to be addressed carefully (Krutz & Vines, 2010). One of the emerging issues in cloud computing is extracting knowledge from sensitive data while protecting the privacy of data owners, which is called privacy-preserving data mining (Agrawal & Srikant, 2000;Vaidya & Clifton, 2004). A privacy-preserving data mining method aims to provide data privacy using either data perturbation or cryptographic methods. Data perturbation-based models struggle with data quality issues, i.e. the valuable statistical information might be dissolved. This may yield less accurate and less reliable results. On the other hand, cryptographic-based models achieves the privacy of data owners through the encryption of data before outsourcing it to the cloud. However, this presents challenges of performing required operations over the encrypted data.
In addition to these facts, collaboration among different cloud service providers may also help them to create more accurate and reliable results in a privacy-preserving data mining method, i.e. more clouds can discover more knowledge than they can uncover on their own when they combine their data (Demir & Tugrul, 2018). There are some studies that propose privacy-preserving solutions for horizontally-partitioned databases to increase the total number of data samples with the goal of creating more accurate data mining models (Inan et al., 2007). In some cases, vertically-partitioned database solutions can be preferred to increase the number of attributes for the same instances (Skillicorn & McConnell, 2008). Institutions such as hospitals operating in different parts of a country may prefer the first choice. On the other hand, institutions such as banks and insurance companies may aggregate their data using the second choice.
In this study, we will examine the k-NN interpolation method that preserves the confidentiality of two different databases stored by two different cloud service providers. k-NN, categorized as a lazy learner, is a non-parametric method used for classification, clustering and interpolation which utilizes the idea that neighboring objects possess or display similar characteristics. Complex interpolation methods such as Kriging involve advanced operations and thus pose a great challenge to cloud computing. In addition, the high time requirements of such methods make them unsuitable in some scenarios such as healthcare applications. On the contrary, the simplicity and interpretability of the k-NN method make it an efficient tool for query processing tasks.

Our contribution
In this article, we introduce an efficient secure two-party k-NN (STPkNN) interpolation protocol that enables two different data owners to outsource their databases together with the query processing service to the cloud, and allows a query owner to extract the interpolation of the k-nearest neighbors of a query point from the encrypted databases. Our protocol preserves the confidentiality of data, assures the privacy of user's query point, and hides data access patterns.
The STPkNN protocol can be considered as an extension of the protocol SkNN m proposed in Elmehdwi, Samanthula & Jiang (2014), that enables a query owner to retrieve the k-nearest neighbors of a query point from a single encrypted database, to two-cloud settings. Briefly, the SkNN m protocol calculates the k-nearest neighbors in an iterative way by performing the following steps k times: (i) it finds the minimum of the Euclidean distances between the data records and the query point, (ii) it calculates the one of the nearest neighbors that corresponds to the index of the minimum distance, and excludes the corresponding distance from the Euclidean distances. On the other hand, in two-cloud settings, the clouds have to share their local minimums of the Euclidean distances to decide on the global minimum that corresponds the index of the nearest neighbor of two databases at the moment, and remove that record from further iterations. However, it is not trivial to achieve this without revealing which data record corresponds to global minimum to any cloud.
To this aim, we first propose two new security primitives, the Secure Transformation (ST) protocol and the Secure Bit-AND-OR (SBAOR) protocol that enable the clouds to decide on the global minimum and exclude it from the further calculations without revealing data access pattern to any cloud. We show that both protocols protect the confidentiality of the input values which will be in encrypted form, i.e. no information about the input values is leaked to any party during the protocols, and the output is only revealed to one of the parties in the protocols. Briefly, the ST protocol allows the servers to securely transform the encryption of a record under a public key to an encryption of same record under another public key. On the other hand, for given the encryptions of two bit vectors x and y, the SBAOR protocol enables the servers to securely compute the negation of the logical disjunction of all bitwise multiplications x i · y i in encrypted form without revealing the bit vectors to any party.
By employing the ST and SBAOR protocols together with the other existing security protocols, we build our main protocol STPkNN that enables a query owner (QO) to extract the interpolation of the k-nearest neighbors of a query point chosen by QO from two different databases outsourced to two different cloud service providers. In the protocol, data owners encrypt their data before outsourcing them to the cloud service providers, and they do not participate in the STPkNN protocol. Thus, no information about the data is leaked to the cloud service providers during the protocol. Besides, our protocol guarantees that any record from both databases or any intermediate result generated in the protocol is not leaked to the cloud service providers. Also, it hides the data access pattern from both data owners and cloud service providers, i.e. the protocol does not reveal the information of which data records were used to produce the interpolation of k-nearest neighbors to any cloud service provider. On the other hand, the STPkNN protocol outputs the interpolation of k-nearest neighbors only to the query owner, and the query owner gets no information other than the interpolation.
We also conduct various experiments on two real-world datasets from the UCI machine learning repository, the cervical cancer (risk factors) dataset and the default of credit card clients dataset, to show the practicability of our protocol in real world scenarios. The experimental evaluation presents that our protocol scales well for the large datasets.

Related works
Due to its usefulness in many application scenarios such as classification, similarity search, and collaborative filtering, the problem of computing the k-nearest neighbors of a query point has been gained a lot of attention in recent years. The early studies mostly focused on how to implement a secure k-NN method between data owner and clients without using cloud systems. Shaneck, Kim & Kumar (2009) proposed a privacypreserving protocol that employs secure multiparty computation to compute k-NN in horizontally partitioned databases. Besides, they also showed how their protocol can be efficiently used in different application such as outlier detection, classification, and clustering problems. Moreover, Qi & Atallah (2008) proposed a provable secure protocol for the single-step k-NN search problem that enjoys linear computation and communication complexity. Vaidya & Clifton (2005) introduced a privacy-preserving algorithm that performs top-k queries in vertically partitioned data. Additionally, Kantarcoğlu & Clifton (2004) proposed a method that privately calculates the k-NN classification over horizontally partitioned data in the distributed database model. Note that all of the above methods require the data owners to perform the necessary calculations to generate the result, and to return it directly to the query users. However, in our model, the data is outsourced to the cloud in encrypted form instead of being kept by the data owners. All of the computation required to process k-NN queries are performed by the cloud.
The recent studies have mostly focused on solutions in cloud computing settings. Wong et al. (2009) proposed an asymmetric scalar-product-preserving encryption (ASPE) scheme that can be employed to construct a secure k-NN protocol. The protocol proposed in Wong et al. (2009) uses a distance comparison function instead of an exact distance calculation. However, the secret key in the protocol should be disclosed to the query users. Zhu, Huang & Takagi (2016) introduced a secure protocol that achieves k-NN query processing on encrypted data without totally revealing the data owner's secret key to the query user. However, their scheme requires data owners to be involved in the encryption of query points. Hu et al. (2011) proposed a secure traversal framework that can used, together with privacy homomorphism, to achieve secure k-NN query processing protocol. Cheng et al. (2015) proposed a privacy-preserving protocol that employs an encrypted hierarchical index tree to perform k-NN queries over spatial data outsourced to cloud in encrypted form. All three protocols (Hu et al., 2011;Zhu, Huang & Takagi, 2016;Cheng et al., 2015) leak data access pattern to the cloud. On the other hand, Kesarwani et al. (2018) proposed a secure k-NN query processing protocol over encrypted data by utilizing a leveled fully homomorphic encryption scheme. Wu et al. (2019) introduced a privacy preserving k-NN classification scheme over the encrypted cloud database that is secure against known-plaintext attack. Besides, Lei et al. (2020) shed light on the connection between a secure k-NN query processing scheme and a secure range query scheme. Based on this connection, they utilize a secure range query scheme together with a data structure named as random Bloom filter to build a secure k-NN query processing scheme. All three protocols (Kesarwani et al., 2018;Wu et al., 2019;Lei et al., 2020) hide data access pattern as well as preserving the data privacy and query privacy. However, they require the decryption keys to be given the query users. However, in our model, the decryption keys are not shared with the query users.
On the other hand, Elmehdwi, Samanthula & Jiang (2014) tackled with the same problem using homomorphic encryption method. In addition to ensuring the confidentiality of data owners and clients, the protocol proposed in Elmehdwi, Samanthula & Jiang (2014) also achieves to hide data access patterns from the clouds. Moreover, Xu et al. (2017) proposed an efficient secure k-NN protocol which achieves sublinear computational complexity. Similar to Elmehdwi, Samanthula & Jiang (2014), their protocol also achieves hiding of data access patterns using garbled circuits to simulate Oblivious RAM. Furthermore, Guo & Sun (2020) adopted the data structure R-tree to build an efficient k-NN scheme that requires only two rounds of interactions between the client and cloud servers to generate the result. They also utilized the Merkle hash tree techniques to obtain a better k-NN scheme that is secure against even a malicious cloud servers. There are also some studies that engage in location-based query processing over encrypted geospatial data (Lei et al., 2019;Lian et al., 2020). Lian et al. (2020) proposed an efficient k-NN scheme by employing the Moore curves together with the AES encryption scheme, that ensures the spatial data and location privacy.
The aforementioned studies use k-NN methods for either classification or query search applications. Unlike previous solutions, Kalideen, Osmanoglu & Tugrul (2019) proposed an efficient solution for the problem of computing the interpolation of k-NN to a given point in cloud computing settings. However, their solution reveals the knowledge of which data records were used to produce the interpolation to the cloud servers, and leaking such information might not be desired in some application required the data security. Unlike the protocol presented in Kalideen, Osmanoglu & Tugrul (2019), our protocol assures the desired security features, i.e. it hides data access pattern.

PROBLEM FORMULATION
In this section, we will give more precise definition of the problem and its security requirements.
Secure two-party k-NN interpolation problem In our system there are two data owners DO 1 and DO 2 holding two different spatial databases D 1 and D 2 , respectively. Each database D u consists of n records d where u = 1,2. There are also two cloud pairs (CSP 1 (u) , CSP 2 (u) ) so that each one is associated with a public key-secret key pair (pk u , sk u ) of a public key encryption scheme that is semantically secure (Goldwasser & Micali, 1982). As the most of the studies in this field, we also consider each pair of cloud service providers (CSP 1 (u) , CSP 2 (u) ) as two non-colluding cloud servers, i.e. CSP 1 (u) stores the database and performs most of the homomorphic operations; on the other hand, CSP 2 (u) keeps the secret key and helps CSP 1 (u) to perform the complex operations over the ciphertexts. In our problem, we assume that each data owner DO u initially encrypts his database D u as E pk u ðD u Þ where E pk u ðD u Þ consists of the attribute-wise encryptions E pk u d ðuÞ i;j for 1 ≤ i ≤ n and 1 ≤ j ≤ m. Each DO u then outsources E pk u ðD u Þ together with the query processing service to CSP 1 (u) . Note that the underlying public key encryption scheme should enable cloud servers to perform homomorphic operations over ciphertexts.
There is also an authorized query owner QO who wants to retrieve the interpolation of k-nearest neighbors of a query point Q from both databases D 1 and D 2 stored in CSP 1 (1) and CSP 1 (2) , respectively. After QO requests the interpolation, the cloud service providers generate the result by performing required operations over the encrypted databases. This process should output the interpolation of k-nearest neighbors only to the query owner. The query owner should not learn any information other than the interpolation during this process. We denote such process as secure two-party k-nearest neighbors (STPkNN) protocol. We remark that STPkNN protocol should preserve the confidentiality of the records in the databases D 1 and D 2 , and protect the privacy of the query point. Moreover, the protocol should hide data access patterns, i.e. it should not reveal the information of which data records were used to produce the interpolation of k-nearest neighbors to any data owner or any cloud service provider.

Example
In 2016, European Union adopted a new regulation on the protection of personal data, Regulation (EU) 2016/679 of the European Parliament and Of The Council (European-Parliament, 2016). The regulation states that 'the protection of natural persons in relation to the processing of personal data is a fundamental right'. All of the personal health records that reveal information relating to the past, current or future physical or mental health status of the data subject are considered as personal sensitive data in the regulation. Therefore, the personal health records should be protected against unauthorized parties, i.e. only the one approved by the owner should be able to access to the data. On the other hand, the processing of health data may be significant to advance research or healthcare practices. Consider a doctor who tries to determine whether a person has a particular hearth disease or not by analyzing the medical records of the person. In addition, the doctor may desire to compare the patient's medical records with other patients' presenting similar properties in order to improve diagnostic accuracy. In fact, this comparison enables the doctor to evaluate the validity of some tests, especially when the scores do not match the expected values. Consequently, the doctor can make an accurate diagnosis, if he is allowed to reach the data of other patients in the same region or across the country. Moreover, if the personal health records are stored in the cloud as encrypted in order not to violate the fundamental right of the owner of the records, it will be possible to perform reliable analysis on large datasets.
Let us clarify it with an example. Consider the subset of heart disease data set from UCI Machine Learning Repository depicted in Table 1. There are 10 different instances shown in the table, and each instance is associated with five attributes: ID (patient's identity), trestbps (resting blood pressure in mm Hg), chol (serum cholesterol in mg/dl), thalach (maximum heart rate achieved), and oldpeak (ST depression induced by exercise relative to rest). Assume the data owner, which can be viewed as hospital in this context, encrypts these attributes, and outsources the encrypted database E pk (D) together with the future query processing to the cloud. Also, assume there is a doctor who wants to determine whether a specific patient carries risk for a particular hearth disease. Let the medical record of the patient be Q ¼ h150; 250; 145; 3i. The doctor, that will be the query owner in our context, asks the interpolation of k-nearest neighbors of Q from the cloud by providing the encryption E pk (Q) to the cloud. Then, the cloud determines the interpolation of k-nearest neighbors by searching the encrypted database E pk (D). For simplicity, let k be 3 for this example. As we observe here, the instances having IDs 1, 7, and 9 will be the 3 nearest neighbors to Q. So, the cloud returns the interpolation T ¼ h141:6; 251:6; 152:3; 2:4i to the doctor that will benefit from T to make an accurate diagnosis. Consequently, necessary analysis can be carried out without revealing any sensitive information about both his patient and the other patients. Aguilar et al. (2005) stated that if interpolation models are developed with an insufficient amount of data, they will be less accurate and reliable. Namely, the collaboration between participants affects the accuracy of interpolation models. We here conduct a series of experiments to assess the impact of collaboration between participants on the accuracy of prediction in the interpolation methods. In our experiments, we employ two publicly available datasets from U.S. National Geochemical Survey Database that present sodium (Na) content of the soil in two states: Colorado and Wisconsin. Summary statistics of both data sets are presented in Table 2.

Effect of collaboration on interpolation accuracy
There are various performance evaluation metrics for interpolation methods. We here employed Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), which are often chosen as evaluation metrics for numerical prediction. The small values of both MAE and RMSE indicate that models will produce results that are more accurate. MAE and RMSE values are calculated as follows; where n is the total number of data points in the dataset, p({x i , y i }) and z({x i , y i }) are predicted and actual values at location ({x i , y i }), respectively. The effects of varying k values on MAE and RMSE values are shown in Fig. 1. We assume that two data holders share all data points in the data set. Both data sets are randomly divided into two parts using sampling without replacement strategy, assuming each party has one of the pieces. In some situations, data holders may not have data in equal proportions. So, we have determined different sharing ratios considering the cases where there is no equal distribution. We specify the β value as the distribution ratio, which means that if one party holds β portion of the data, the other party will hold the remaining portion (1 − β). After several trials, the MAE and RMSE values obtained according to the various number of nearest neighbor counts are shown in the Tables 3 and 4, respectively.
As seen from the results, the smallest MAE and RMSE values are observed when the 10-nearest neighbors are used for all points in each data set. The smallest MAE and RMSE are underlined in the tables. As seen from the Table 3, if only half of the data is available for creating a prediction model, there will be a deterioration in MAE values of 4.31% for the Wisconsin data set and 6.60% for the Colorado data set. It is also possible to observe similar aspects in Table 4 for each split ratio. As observed from the results, the data holder who has less amount of data always produces less accurate predictions. On the contrary, if there is a sufficient amount of data, the predictions generated by the model are more accurate and reliable.

PREMILINARIES
In this section, we will present the notations and the definitions of some primitives that will be used in our proposed protocols.

Notation
We here give the notations used in this paper.  (pk u , sk u ), the public key-secret key pair assigned to the u th cloud pair (CSP 1 (u) , CSP 2 (u) ), the u th cloud pair, i.e. the former holds the encryption of the database E pk u ðD u Þ and the latter holds the corresponding secret key sk u .

Homomorphic encryption
Homomorphic encryption is an encryption scheme that allows users to perform some mathematical operations on ciphertexts, such as addition and multiplication. This property enables to protect the confidentiality of the data, and makes the encryption scheme a very practical and useful tool in cloud computing, especially for the sensitive data. For that reason, homomorphic encryption schemes have been gaining a lot of attention in recent years. Within this direction, many homomorphic encryption schemes have been proposed (Goldwasser & Micali, 1982;Elgamal, 1984;Boneh, Goh & Nissim, 2005). In this study we use a well-known homomorphic encryption system, the Paillier scheme, to construct our protocols. Let E pk (·) be the encryption function with the public key pk and D sk (·) be the decryption function with the secret key sk. For any given two plaintexts a and b, the Paillier scheme satisfies the following properties: Note that the Paillier encryption scheme is semantically secure (Paillier, 1999).

Basic security primitives
Here, we briefly explain a set of basic security protocols. In these protocols, it's assumed that there exist two semi-honest parties P 1 and P 2 joining the protocols, and the Paillier's secret key is known only to one of them. We will also introduce two new security primitives in "Construction" that will be employed together with the basic primitives given here as building blocks in forming our construction.

Secure multiplication (SM) protocol
Consider two parties P 1 and P 2 such that the former holds (E pk (x), E pk (y)) and the latter holds the secret key sk, where x and y are not known to both parties. The protocol outputs E pk (x Ã y) to P 1 . Note that the output E pk (x Ã y) is only known to P 1 , and no information about x and y is revealed to any party during the protocol.

Secure squared Euclidean distance (SSED) protocol
The protocol considers two parties P 1 and P 2 with the inputs (E pk (X), E pk (Y)) and the secret key sk, respectively, and outputs E pk (|X − Y| 2 ) to P 1 , where X and Y are m dimensional vectors. In the protocol, the encryption of squared Euclidean distance E pk (|X − Y| 2 ) is only known to P 1 .

Secure bit-decomposition (SBD) protocol
The protocol considers P 1 with the input E pk (x) and P 2 with the secret key sk, and outputs the encryptions of the bit-decomposition of x as ½x ¼ hE pk ðx 1 Þ; . . . ; E pk ðx ' Þi, where 0 x 2 ' . Note that the encryptions of bit-decomposition [x] is known to only P 1 .

Secure minimum (SMIN) protocol
In the protocol, P 1 with the inputs ([x], [y]) and P 2 with the secret key sk securely compute the encryption of individual bits of minimum between x and y as [min(x,y)]. Note that the output [min(x,y)] is only known to P 1 , and no information about x and y is revealed to any party during the protocol.

Secure minimum out of n numbers (SMIN n ) protocol
In the protocol, P 1 with the inputs ([x 1 ],…,[x n ]) and P 2 with the secret key sk securely compute [min(x 1 ,…,x n )], where [min(x 1 ,…,x n )] is the encryption of the individual bits of min(x 1 ,…,x n ). Note that the output [min(x 1 ,…,x n )] is only known to P 1 , and no information about x i for any i is revealed to any party during the protocol.

Secure Bit-OR (SBOR) protocol
Consider two parties P 1 and P 2 such that the former holds (E pk (a), E pk (b)) and the latter holds the secret key sk, where a and b are two bits. The protocol outputs E pk ða _ bÞ to P 1 . The output E pk ða _ bÞ is only known to P 1 , and no information about a and b is revealed to any party during the protocol.
Since we don't aim to study the existing protocols given above, we simply consider the most efficient implementation of them which were presented in Elmehdwi, Samanthula & Jiang (2014) and Samanthula, Hu & Jiang (2013). However, the implementation of the SMIN n protocol given in Elmehdwi, Samanthula & Jiang (2014) fails for some inputs, i.e. it generates an incorrect output if the size of the input is given as n = 8k + 1 for some k ∈ Z. Let me illuminate it with an example: assume the protocol takes nine inputs ([x 1 ],…, [x 9 ]). At the last step, the protocol applies the SMIN protocol to the intermediate values [x′ 1 ] and [x′ 7 ]) (the encryptions of the local minimums), and outputs the encryption of 0 since [x′ 7 ] was set to the encryption of zero at some previous steps. Therefore, independent of the inputs, the protocol always outputs the encryption of zero as the final output when the size of the input is given as n = 8k + 1 for some k ∈ Z. Thus, we develop a new implementation of the SMIN n protocol that simply works as follows: Note that the final output of the iterative steps will be [R n−1 ] = [min(x 1 ,…,x n )], which is the encryption of the individual bits of min(x 1 ,…,x n ).

CONSTRUCTION
In this section, we first introduce two new security primitives: the Secure bit-AND-OR (SBAOR) protocol and Secure Transformation (ST) protocol. We then give the security analysis of this protocols. By utilizing SBAOR and ST protocols together with the basic security primitives given in "Premilinaries", we construct our main protocol. Furthermore, we also give the security analysis of the main protocol and discuss the computation complexity at the end of this section.

Secure bit-AND-OR (SBAOR) protocol
The SBAOR protocol allows the servers to securely compute the negation of the logical disjunction of all bitwise multiplications x i · y i in encrypted form without revealing the bit vectors to any party. In the main protocol, it will help the servers to separate the index of the current closest record to the query point from all other records of both databases by assigning the encryption of 1 to that particular index and the encryption of 0 to all other indices. In this way, the servers will be able to calculate the current closest record, and remove the corresponding index from further calculations.
The protocol considers two parties P 1 and P 2 such that the former holds ([x], [y]) and the latter holds the secret key sk, where [x] and [y] are the encryption of individual bits of x and y. The protocol enables the parties P 1 and P 2 to securely compute the encryption E pk ðAÞ where A ¼ 1 À A and A ¼ ðx 1 Á y 1 Þ _ ðx 2 Á y 2 Þ _ Á Á Á _ ðx ' Á y ' Þ. The output E pk ðAÞ is only known to P 1 , and no information about x and y is revealed to any party during the protocol.
In the protocol, P 1 and P 2 first runs the SM protocol on the inputs E pk (x i ) and E pk (y i ) to calculate E pk (x i Ã y i ) for i 2 ½' where x i and y i are the i-th bits of x and y, respectively. Note that each E pk (x i Ã y i ) is only revealed to P 1 . The server P 1 then calculates E pk ðx 1 Á y 1 _ Á Á Á _ x ' Á y ' Þ as follows: it initially executes the SBOR protocol together with P 2 on E pk (x 1 Ã y 1 ) and E pk (x 2 Ã y 2 ) to get E pk ðR 1 Þ ¼ E pk ðx 1 Ã y 1 _ x 2 Ã y 2 Þ, it then iteratively runs the SBOR protocol together with P 2 on E pk (R i−1 ) and E pk (x i+1 Ã y i+1 ) Note that the final output of the iterative steps will be E pk ðR 'À1 Þ ¼ E pk ðx 1 Á y 1 _ ÁÁÁ _ x ' Á y ' Þ. Finally, P 1 applies the equation E pk ðR 'À1 Þ ¼ E pk ð1Þ Ã E pk ðR 'À1 Þ NÀ1 to compute the final output.
Security Analysis of SBAOR: At the beginning of the protocol, the servers P 1 and P 2 execute the Secure Multiplication protocol. As emphasized in "Premilinaries", the output of the protocol is only revealed to the server P 1 , and no information about the plaintexts x i and y i is revealed to any party during this protocol. Later, the servers run the Secure Bit-OR (SBOR) Protocol on the inputs E pk (R i ) and E pk (x i+1 Ã y i+1 ). The SBOR protocol outputs the new E pk (R i+1 ) only to the server P 1 , and no information about the plaintexts is revealed to any party during the protocol. At the final, the server P 1 only applies some homomorphic operations on the encryption E pk ðR 'À1 Þ computed at the previous step.
Therefore, the SBAOR protocol protects the confidentiality of the data, i.e. no information about the contents of the encryptions is revealed to any party during the protocol.

Secure Transformation (ST) protocol
The ST Protocol enables the servers to transform the encryption of a record under a public key to the encryption of same record under another public key. In the main protocol, the servers employ ST Protocol to collect the encryptions of local minimums, that indicate the indexes of local closest records of both databases, under the same public key so that they can decide the minimum among them. The protocol considers three parties P ðu 1 Þ 1 with the input E pk u 1 ðtÞ, P ðu 1 Þ 2 with the secret key sk u 1 , and P ðu 2 Þ 1 . The protocol simply aims to transform the encryption of a record t under the public key pk u 1 to the encryption of t under the public key pk u 2 . Note that no information about t is revealed to any party during the protocol and the output E pk u 2 ðtÞ is only known to the party P ðu 2 Þ 1 . Briefly, P ðu 1 Þ 1 first masks E pk u 1 ðtÞ with the randomly chosen vector r 0 2 Z m N as l ¼ E pk u 1 ðtÞ Ã E pk u 1 ðr 0 Þ, and sends μ to P ðu 1 Þ 2 and E pk u 2 ðr 0 Þ to P ðu 2 Þ 1 . After getting μ, P ðu 1 Þ 2 first decrypts it as l 0 ¼ D sk u 1 ðlÞ, then encrypts μ′ with the public key pk u 2 as E pk u 2 ðl 0 Þ, and finally sends the encryption to P ðu 2 Þ 1 . After receiving the encryption, the party P ðu 2 Þ 1 first removes the randomness r′ from the encryption E pk u 2 ðl 0 Þ and gets the encryption E pk u 2 ðtÞ as E pk u 2 ðtÞ ¼ E pk u 2 ðl 0 À r 0 Þ. From the homomorphic property of the underlying encryption scheme, E pk u 2 ðl 0 À r 0 Þ can easily be calculated as E pk u 2 ðl 0 Þ Ã E pk u 2 ðr 0 Þ NÀ1 .
Security Analysis of ST: At the beginning of the protocol, the servers P ðu 1 Þ 1 randomizes the encryption E pk u 1 ðtÞ with r 0 2 Z m N before sending it to the server P ðu 1 Þ 2 . So, the decryption computed by P ðu 1 Þ 2 will be uniformly random in Z m N . Besides, P ðu 2 Þ 1 locally subtracts the encryption of the randomness r′ under the public key pk u 2 from the encryption sent by P ðu 1 Þ 2 by performing some homomorphic operations. Thus, the protocol does not reveal any information about the record t to any party.

Main protocol
In this section, we will give the construction of our main protocol that enables a query owner to extract the interpolation of k-nearest neighbors for a query point of his choice as shown in Fig. 2. As we stated in the "Introduction", our construction can be viewed as an extension of the protocol presented in Elmehdwi, Samanthula & Jiang (2014) that proposes (1) each DO u uploads its data to the server CSP 1 (u) ; (2) each DO u gives its secret key to CSP 2 (u) ; (3) QO sends its query point Q in encrypted form to the servers CSP 1 (u) ; (4) CSP 1 (u) and CSP 2 (u) find the local nearest neighbours; (5) CSP 1 (1) and CSP 1 (2) decide on the global nearest neighbour among the local nearest neighbors (4 and 5 are repeated k times in the protocol); (6) an efficient solution of the k-nearest neighbor query problem over encrypted database outsourced to a single cloud. We assume that each data owner DO u has a database D u that consists of n records is m-dimensional vector that lies in ½0; 2 ' . We also assume that there exist two non-colluding semi-honest cloud service providers, CSP Initially, each DO u encrypts his database D u as E pk u d ðuÞ i;j where 1≤ i ≤ n and 1≤ j ≤ m. Each DO u then outsources the encryptions of the database, together with the future query service to the clouds, i.e. DO u gives E pk u d ðuÞ i;j to CSP ðuÞ 1 and his secret key sk u to CSP ðuÞ 2 . When the query owner (QO) wants to retrieve the interpolation of the k-nearest neighbors for a query point Q, he produces two encryptions of his query point Q as E pk 1 ðQÞ ¼ hE pk 1 ðq 1 Þ; . . . ; E pk 1 ðq m Þi and E pk 2 ðQÞ ¼ hE pk 2 ðq 1 Þ; . . . ; E pk 2 ðq m Þi using the public keys of the data owners DO 1 and DO 2 , respectively; and gives each encryption E pk u ðQÞ to the corresponding cloud service provider CSP ðuÞ 1 . After receiving the encryption E pk u ðQÞ, each CSP ðuÞ 1 runs the SSED protocol together with the corresponding server CSP ðuÞ 2 on the input E pk u ðQÞ; E pk u d  i as the servers CSP ð1Þ 1 with the input E pk 1 ðe min Þ, CSP ð1Þ 2 with the secret key sk 1 , and CSP ð2Þ 1 securely runs the ST protocol to compute the encryption of e min under the public key pk 2 . Note that E pk 2 ðe min Þ is only known to CSP for each i as E pk u λ ðuÞ i ¼ E pk u e min À e ðuÞ i ¼ E pk ðe min Þ Ã E pk e ðuÞ i NÀ1 .
Each CSP ðuÞ 1 then randomizes E pk u λ ðuÞ i as E pk u a is a random number in Z N . It is a fact that only one is the encryption of zero among all 2n encryptions E pk u ða ðuÞ i Þ and all others are the encryptions of some random numbers where i = 1 … n and u = 1, 2. Each CSP ðuÞ 1 securely runs the SBD protocol with the server CSP ðuÞ 2 on the inputs E pk u a as E pk u d i;j . As we stated before, since only one of the encryptions among all E pk u b ðuÞ i is E pk u ð1Þ and the remaining are E pk u ð0Þ, one of the encryptions E pk u d 0ðuÞ 1 will be the encryption of zero and the other one will be the encryption of nonzero number that will be the first closest record. CSP under the public key pk 1 . Note that is only known to CSP h i ¼ hE pk u ð1Þ; . . . ; E pk u ð1Þi. On the other hand, if b ðuÞ i ¼ E pk u ð0Þ, the SBOR protocol will have no effect on e ðuÞ i .
Because our protocol outputs the interpolation of the k-nearest neighbors of the query point Q, the server CSP ð1Þ 1 does not need to keep all the nearest records separately. Instead, it gradually builds the interpolation, i.e. after each iteration, CSP ð1Þ 1 adds the current closest record E pk 1 ðd min p Þ to the previous sum E pk 1 ðS pÀ1 Þ ¼ E pk 1 ðd min 1 þ . . . þ d min pÀ1 Þ as E pk 1 ðS pÀ1 Þ Ã E pk 1 ðd min p Þ, and gets the current sum E pk 1 ðS p Þ ¼ E pk 1 ðd min 1 þ . . . þ d min P Þ.
After k iterations, CSP ð1Þ 1 will have the sum E pk 1 ðS k Þ ¼ E pk 1 ðd min 1 þ . . . þ d min k Þ as the encryption of the sum of the k-nearest neighbors of the query point Q. CSP ð1Þ 1 then computes the randomization of the encryptions as c j ¼ E pk 1 ðS k;j Þ Ã E pk 1 ðr j Þ where r j are random numbers in Z N and 1 ≤ j ≤ m. CSP

Security analysis
In this section, we will give the security analysis of the protocol shown in Algorithm 3. As we emphasized above, the data owners encrypt their data before outsourcing them to the cloud. Since they use the Paillier encryption scheme which is semantically secure, the data is not leaked to any cloud service provider. On the other hand, at the first step of Algorithm 3, the query point Q is encrypted before given to the corresponding cloud service providers. Similarly, since the underlying encryption scheme (the Paillier cryptosystem) is semantically secure, the query point Q is not revealed to any data owner or any cloud service provider. At the second step of Algorithm 3, the servers CSP ðuÞ 1 and CSP ðuÞ 2 execute the protocols SSED and SBD. As stated in Elmehdwi, Samanthula & Jiang (2014), the outputs of the protocols will be in the encrypted format, and will only be revealed to the servers CSP ðuÞ 1 .
Besides, no information about the plaintexts is revealed to any party during these protocols. At the step 3(a) of each iteration in Algorithm 3, the output of the protocol SMIN n is only revealed to the servers CSP ðuÞ 1 . Besides, the SMIN n protocol guarantees that the servers involved in the protocol do not know which records from both databases correspond to the current minimum distances. Similarly, the output of the SMIN protocol executed at the step 3(b) of Algorithm 3 is only revealed to the server CSP ð1Þ 1 . Also, the protocol does not reveal which record corresponds to the current global minimum.
The servers also run the ST protocol at the steps 3(b) and 3(e) of Algorithm 3 to transform the encryption of the current minimum distance under the public key pk u 1 to the encryption under the public key pk u 2 . As we explained at the beginning of this section, the ST protocol protects the content of the encryption from all parties involved in the protocol. Furthermore, at the step 3(c), each server CSP ðuÞ 1 runs the SBAOR protocol with CSP ðuÞ 2 that outputs either the encryption of 1 just for the index corresponding to the current global minimum or the encryption of 0 for all the other indexes. Note that the SBAOR protocol uses the protocols SM and SBD as sub procedures, and it does not leak the index that corresponds to the current global minimum. Thus, data access patterns are protected from all the involved servers through the protocol, i.e. the servers do not know which data records used in producing the interpolation of k-nearest neighbors.
In conclusion, the STPkNN protocol preserves the confidentiality of the data, secures the privacy of user's query point, and hides data access patterns.
On the other hand, at the third step of our protocol, the servers perform the following operations O(k) times: a single instantiation of SMIN n protocol, a single instantiation of SMIN, 2 instantiations of ST protocol, n instantiations of SBD and SBAOR protocols, n · m instantiations of SM protocol, and n Á ' instantiations of SBOR protocol. The computation complexity of the SMIN n protocol presented in this paper is bounded by Oð' Á nÞ multiplications and Oð' Á nÞ exponentiations and the computation complexity of the SMIN protocol presented in Elmehdwi, Samanthula & Jiang (2014) is bounded by Oð'Þ multiplications and Oð'Þ exponentiations. Besides, the ST protocol proposed in this paper, the SM protocol presented in Elmehdwi, Samanthula & Jiang (2014), and the SBOR protocol presented in Elmehdwi, Samanthula & Jiang (2014) only contain a constant number of multiplications and a constant number of exponentiations. Also, as we emphasized above, the computation complexity of the SBD protocol is bounded by Oð'Þ multiplications and Oð'Þ exponentiations (Samanthula, Hu & Jiang, 2013). Moreover, since the SBAOR protocol proposed in this paper deploys ' instantiations of SM protocol and ' − 1 instantiations of SBOR protocols as sub procedures, the computation complexity of the SBAOR protocol bounded by by Oð'Þ multiplications and Oð'Þ exponentiations. Thus, the computation complexity of the third step is bounded by Oðk Á n Á ðm þ 'ÞÞ multiplications and exponentiations at total.
In addition, the servers perform only O(m) operations at the remaining steps of the protocol. Thus, the total computation complexity of our protocol is bounded by Oðk Á n Á ðm þ 'ÞÞ multiplications and exponentiations.

PERFORMANCE EVALUATION
In this section, we evaluated the performance of the proposed protocol STPkNN by carrying out a number of experiments under different parameter settings. We deployed Paillier cryptosystem (Paillier, 1999) for the encryption, and implemented the proposed protocols in Java. All the experiments were performed on a virtual Linux machine with an Algorithm 3 (continued ) 4. CSP ð1Þ 1 ; for j = 1 to m do -c j E pk1 ðS k;j Þ Â E pk1 ðr j Þ, where r j ∈ R Z N ; sends γ j to CSP ð1Þ 2 and r j to QO 5. CSP IntelR XeonR Two-CoreTM CPU 2.20 GHz processor and 4 GB RAM running Ubuntu 16.04 LTS. For the experiments, we utilized two real data sets from UCI machine learning repository (Dua & Graff, 2017); Heart Disease that consists of 600 data records such that each one contains 14 attributes concerning heart disease diagnosis, and Bank Marketing that contains 800 data records such that each one includes 15 attributes that helps to predict whether a new client will pay a term deposit. We first processed these data sets so that they contain only non-negative integer values. We then split each data set into two equal parts so that each one will be operated by a single cloud pair. Note that, for all the measurements, the experiment was repeated for multiple query points and the average time taken to execute a query was reflected to the table.
We first evaluated the computation cost of STPkNN on finance data set in minutes for varying the number of nearest neighbors (k) and the number of attributes (m). As shown in Fig. 3A, if we fix the number of attributes as m = 6, the running time of our protocol varies from 74.08 to 226.16 min for finance data set when k is changed from 5 to 15. Besides, for m = 12, the running time of our protocol varies from 78.85 to 239.21 min when k is changed from 5 to 15. So the running time of our protocol grows linear with k. Also, we observe that the computation cost of our protocol increases by nearly a factor of 1.06 when m is doubled.
Similarly, we also evaluated the computation cost of STPkNN on heart disease data set in minutes for varying the number of nearest neighbors (k) and the number of attributes (m). As shown in Fig. 3B, if we set the number of attributes as m = 6, the running time of our protocol varies from 55.89 to 168.26 min when k is changed from 5 to 15. Besides, for m = 12, the running time of our protocol varies from 59.56 to 178.63 min when k is changed from 5 to 15. Thus, it is easy to observe that our protocol scales linearly with k.
On the other hand, the running time of our protocol increases by almost a factor of 1.34 when the number of data records (n) is changed from 300 to 400. Thus, the running time of our protocol grows linear with n.

CONCLUSIONS
In this study, we proposed a secure k-NN method that produces an interpolation of knearest neighbors to a query point over encrypted databases. We here claimed that instead of using one, employing two different databases in the protocol will yield more accurate and reliable interpolation value. We validated this claim by conducting experiments on publicly available real data sets. We also showed that our protocol preserves the confidentiality of data, assures the privacy of user's query point, and hides data access patterns. We finally analyzed the performance of the proposed protocol through a number of experiments under different parameter settings. As a future study, we will examine and expand our work to apply other interpolation methods on encrypted data in distributed architecture. We will extend our protocol, that considers two encrypted databases stored in two different clouds, to multi-cloud settings.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.