Private Web Search Using Proxy-Query Based Query Obfuscation Scheme

People use web search engines to retrieve information from the world wide web. Search engines maintain query logs to refine retrieved information for personalized web search. The query log includes information of users, both sensitive and non-sensitive. If utilized against an individual, the query log may pose privacy concerns and reveal a lot of information about them. In recent years many private web search (PWS) schemes have been proposed to realize privacy-preserving web search. Although each PWS scheme claims to have unique features for attaining web search privacy (WSP), no study explains which private web search characteristics should be considered when building and utilizing a PWS scheme. There are two objectives of this article. In the first part of the article, we present a novel PWS scheme that uses a proxy-query-based query obfuscation approach. Proxy-query-based query obfuscation is a new study topic in PWS research. It provides an IR facility for retrieving information from web search engines via proxy queries. One clear advantage of the proposed scheme is that users do not issue true queries to search engines for retrieving information. The article’s second objective is to define the characteristics of PWS and analyze modern PWS schemes on these characteristics. We analyzed proposed and modern PWS schemes on the PWS characteristics. The analysis demonstrated only the proposed PWS scheme achieves all characteristics. Existing PWS systems have been discovered vulnerable to WSP attacks since they do not meet all PWS characteristics.


I. INTRODUCTION
Personalized web search or personalized information retrieval (PIR) is an important feature of modern web search engines. It delivers the most relevant information according to different contexts [1], [2], [3]. Web search engines extract users' personal information from query logs and infer through the PIR technique what the users are searching [4], [5], [6], [7]. According to current studies, PIR improves retrieval efficacy; yet, on the negative side, it poses major online search privacy concerns [8], [9], [10]. According to an The associate editor coordinating the review of this manuscript and approving it for publication was Fu Lee Wang . American pew web search survey, 1 around 75% of users are dissatisfied with search engines that track and log their search queries to achieve personalized web search.
To the best of our knowledge, current techniques used to protect users' web privacy in online settings mostly target the identifiability aspect of privacy. The main aim of these techniques is to provide secured communication [11], [12], encrypted data storage [13], [14], and releasing [15], [16]. However, another critical factor known as linkability is largely overlooked [17]. Linkability is an important characteristic of web search privacy. Through linkability, web search engines infer detailed information about users' personal interests by linking multiple queries to users. This leaves little control to users for securing their web search privacy (WSP). Reference [18] have researched linkability and showed through experiments that a simple supervised classifier could accurately link a set of related queries to a set of users with significantly high accuracy.
To further understand concerns related to WSP consider a scenario where a user submits several queries to a web search engine for retrieving information related to depression, HIV, or pregnancy. The search engine may sell this information to other firms for targeted advertising, or the query log may be compromised [19], [20]. Other organizations or attackers can exploit the user's personal information in the query log to reveal sensitive information about their health condition [6], [18], [21], [22], [23], [24]. A similar type of WSP breach happened in 2006 when the New York Times reported successfully deducing private information of numerous users from an AOL pseudonymized query log . 2 One of the victims was a 62-year-old lady. She submitted multiple inquiries in order to obtain information related to hand tremors, dry mouth, and nicotine effects on the body.
In this article, we proposed a PWS scheme using proxy-query-based query obfuscation to achieve web search privacy. Proxy-query-based query obfuscation is an emerging research domain in PWS research [25]. It provides an information retrieval (IR) facility to retrieve information from search engines through proxy queries. One clear advantage of the proposed technique is that the users do not issue true queries for retrieving information. Users issue proxy queries, and the IR system 3 automatically generates cover and true queries from the proxy queries and cannot differentiate whether the user is trying to retrieve information for the cover queries or true query. Figure 1 shows an architecture of PWS using proxy-query based query obfuscation scheme. Given a proxy dictionary (Table 1), a user issues a true query Depression Treatment. The client machine transforms the true query with the proxy-query Posh Men Sunglasses and submits it to the IR system. The IR system generates all cover queries of the proxy query and processes all cover queries. The IR system returns back-rank lists of all cover queries to the client machine. The client machine only displays results of true query Depression Treatment and ignores others. The key challenge in this research is to ensure the plausibility of the generated cover queries. The proposed scheme achieves plausibility not only with respect to the user's current query but also to the sequence of their previous queries [26], [27].
The previous scheme (ProxyTermPWS) on this research [25] achieves the above objective by mapping the terms of topics containing sensitive information with the terms of proxy topics. The key weakness of ProxyTermPWS is that it develops proxy-term mapping between individual terms of proxy and cover topics. This provides low effectiveness if the queries contain more than one term. This is due to the computational difficulty of achieving optimum proxy-term mapping for every combination of valid query terms. To address this problem, the proposed scheme builds proxy mapping using queries of topics rather than individual terms. This is useful not only for current query but also for a sequence of related queries. Another limitation of ProxyTermPWS is that it recovers the true query from the proxy query by exhaustively generating all possible permutations of proxy and cover terms. This creates an exhaustive set of cover queries to process for IR system. For example, for a proxy query of three terms and a proxy group of 30 terms. The algorithm generates 30 3 = 27, 000 cover queries, which pose a challenge to process and retrieve. Furthermore, the existing approach has a drawback in that it provides less effectiveness for query obfuscation when a user issues a series of consecutive inquiries linked to a similar topic.
This article proposed a novel PWS scheme that provides query obfuscation through proxy queries. The proposed approach differs from the ProxyTermPWS scheme [25] in that it achieves proxy-term mapping through topics queries rather than individual terms. This provides high effectiveness for query obfuscation when a user issues a sequence of consecutive queries. Furthermore, as each proxy group contains a valid set of proxy and cover queries, the proposed technique does not generate an exhaustive set of cover queries for processing and recovering true query.
Given a list of topics and a set of documents comprising sensitive and non-sensitive information. The challenges we want to address in the article include identifying the computing cost of searching for an ideal proxy-query mapping and proposing a heuristic for searching for an optimal mapping. Once the proposed scheme has discovered the best mapping, it makes the mapping available to all users in the form of a proxy-query dictionary.
To retrieve information from the mapped topics privately, users transform their true queries with proxy queries and issue to the IR system. When the proxy query is received, the IR system retrieves the results of all cover and proxy-query and delivers them to user. The IR system cannot identify real query in this manner. The user's machine only displays the result of true query's and ignores all other queries. The suggested approach is well suited to personalized IR settings, which have received much research and are frequently employed in commercial search engines [4], [28], [29], [30]. In personalized IR, search engines store users' queries in query logs to improve future queries' search effectiveness.

II. RELATED WORK
Recently, many schemes and protocols have been proposed for protecting web search privacy. The following subsections present a detailed review of modern PWS schemes.

A. PRIVATE INFORMATION RETRIEVAL SCHEMES
Private information retrieval schemes have been proposed to retrieve information from databases. Techniques proposed by [31], [32], and [33] privately retrieve information from databases by first partitioning the databases and then distributing the partitioned databases onto multiple servers by assuming that the multiple servers do not communicate with each other during information retrieval. Unfortunately, this assumption cannot work in the context of web search as search engine prepares a single rank list from all relevant documents in response to a search query.

B. PROXY-BASED PWS SCHEMES
The PWS schemes of this category privately retrieve information from search engines through anonymous web browsing using proxy servers or dynamic IP addressing [34], [35], [36]. RAC [36] and Tor [34] are well-known examples of this class. Although PWS schemes of this category hide real IP addresses of users from web search engine (WSE), however, anonymizing the users identifies through proxy servers or dynamic IP addressing shifts the web search privacy (WSP) threat from WSE to proxy servers, since proxy servers can begin logging users' queries.

C. COLLABORATIVE PEER-TO-PEER PWS
The PWS schemes of this category achieve WSP through users' collaboration. Search queries are issued to WSE through users' collaboration, and information is retrieved from WSE through user collaboration [37], [38], [39]. Users' collaboration is achieved through peer-to-peer protocols. Using these schemes, users do not directly submit their search queries to WSE. They randomly pick other users from the peer-to-peer protocol and submit their queries through them. Using this way, a WSE cannot identify the true identifiers of users since queries are always issued through different users.
Previous studies on this research show that PWS schemes of this category theoretically achieve reasonable WSP; however, the development of peer-to-peer PWS schemes is not realistic in practice because these protocols require the availability of the same set of peers if users issue a series of related queries from a single or multiple search sessions [17]. Another disadvantage of this category is the slow query response time since several peers participate from various networks and locations. Furthermore, the development of peer-to-peer PWS schemes requires complex implementation of cryptography protocols at client machines and servers.

D. QUERY SCRAMBLING BASED PWS TECHNIQUE
The PWS schemes of this category achieve WSP through scrambled queries. Given a true search query, the schemes utilize a scrambler that transforms the true query into a set of scrambled queries, from which information is retrieved [40], [41]. Scrambled queries' objective is to hide users' true search intents by retrieving information from WSEs that are more general than specific. The WSE retrieves and returns the information of scrambled queries to the client machine. The client machine collects the scrambled VOLUME 11, 2023 information of all queries and prepares a single ranking. The main disadvantage of PWS via query scrambling is that it reduces retrieval effectiveness since users obtain information from only partially matched documents.

E. QUERY OBFUSCATION BASED PWS TECHNIQUES
Query obfuscation-based private web search (OB-PWS) is another class of schemes recently addressed in the context of web search privacy [17], [33], [42], [43]. OB-PWS schemes achieve WSP by obfuscating the true query with the help of cover queries. The cover queries are generated automatically. Cover queries hide search intents of users through fake queries [44], [45]. The major challenge is how to generate non-noisy cover queries that look similar to normal web search queries.
Two techniques have been widely employed in previous studies to generate non-noise queries. The first technique is a query log-based approach. This technique presupposes that a client computer has access to a portion of the real query log, from which cover queries are created. The second technique is based on pseudo-cover queries. This technique crawls the internet for a collection of documents linked to various topics. The collection is then used to construct pseudo cover queries based on query qualify predictors [46], [47]. Noisy queries that real web users do not typically issue are pruned through query quality predictors.
TrackMeNot is a well-known example of OB-PWS scheme [44]. TrackMeNot submits k random queries in response to a true search query to hide the true query. GooPIR [45] is another OB-PWS scheme. It utilizes a term dictionary for identifying terms of cover queries. The fundamental disadvantage of both schemes is that none captures query relatedness while creating cover queries. Both schemes provide low PWS effectiveness if a user issues a sequence of related search queries from a current or multiple sessions.

F. QUERY OBFUSCATION BASED PWS TECHNIQUE BY CONSIDERING RELATEDNESS WITH PAST QUERIES
In typical web search, users usually continue their current search tasks with multiple queries from single or multiple search sessions. As a result, if a PWS scheme does not consider relatedness between the sequence of related queries while producing cover queries, a WSE may readily distinguish real queries from cover queries by removing linked consecutive queries from non-related consecutive queries [17]. Traditional OB-PWS systems, as mentioned above, give reduced PWS effectiveness since they create non-noisy cover queries and do not consider cover query relatedness with past queries.
Reference [17] proposed an OB-PWS scheme that accomplishes this goal by capturing relatedness across cover queries via a collection of hierarchically arranged topics. The suggested approach creates cover queries from the subject hierarchy rather than just random non-noisy cover queries. As a result, if a series of related true queries have relatedness between them, the subject hierarchy assures that relatedness exists between cover queries as well. One disadvantage of the proposed scheme is that it indexes a large collection on client computers to maintain topic hierarchies. This is not appropriate for machines with limited space. Furthermore, because topic hierarchies at client machines are not shared, the scheme cannot capture relatedness if a user issues related queries from different devices.

III. PROXY-TERM BASED QUERY OBFUSCATION
Reference [25] proposed proxy-term based query obfuscation technique (ProxyTermPWS) to privately search information from IR systems. The proposed technique achieves web search privacy by retrieving information through proxy queries. For each search query, the user transforms the terms of the query with the proxy terms stored in the proxy dictionary. The transformed (proxy) query is issued to the IR system. The proxy query obfuscates the true query. Since the proxy-query maps many cover queries stored in the proxy dictionary, therefore, through the proxy-query, the IR system cannot determine the true query. The IR system generates cover queries from the proxy query. It processes all cover queries and returns the results to the user. The client machine only displays the result of the true query and ignores other queries.

A. PROXY-QUERY BASED QUERY OBFUSCATION
The standard approach we discussed above generates a proxy-term mapping between individual terms of proxy and cover topics. This provides low effectiveness if the queries contain more than one term. This is due to the computational difficulty of achieving optimum proxy-term mapping for every combination of valid query terms. To address this restriction, the proposed scheme builds proxy mapping using queries of topics rather than individual terms. This is useful not only for single queries but also for a sequence of related queries.
Given a set of topics and an exhaustive set of queries to retrieve documents of topics, the proposed approach creates a proxy mapping between queries. The goal of mapping is to offer the best query obfuscation possible so that users may retrieve information via proxy queries. Let Q be a set of exhaustive queries, and the system partitions queries in Q into P proxy-query sets. Each set contains M queries. The system ensures that each set contains at least one query from the topic containing sensitive information. Searching for an optimal proxy mapping is difficult since the system aims to obfuscate not just current queries but also a collection of all related queries that users might issue in sequence. The problem is computationally hard as given Q queries and a proxy-query set of size M , and the system needs to explore an optimal mapping from O(Q M ) combinations.

IV. GENERATING PROXY-QUERY MAPPING
Let there is a collection of documents comprising sensitive and non-sensitive topics. Assume the collection contains T topics. For each query q ∈ Q, the algorithm ranks the topics using tfidf scores by considering all documents of topics as a single document. The q is assigned to the topic with the greatest tfidf score. A single topic cannot contain more than T queries. Thus, if the best topic already has T queries for a new query, the new query is assigned to the next best topic.
LetQ q be a set of queries representing related queries of q ∈ Q. Queries inQ q are ranked according to their similarity with q. We computed query similarity using average cosine similarity from the top k documents of queries. We retrieved the top k documents by ranking documents using tfidf scores.
Given Q queries and P proxy groups, the algorithms partitions the queries into P proxy groups. For each proxy group, the algorithm selects queries from T topics with not more than one query from each topic. We call the partitioning of queries into P proxy-groups as the mapping of queries to proxy queries. Queries in proxy groups are sorted according to topic identifiers. The queries of the topics that contain sensitive information are placed at the end of proxy groups. The algorithm selects the first query of each proxy group as a proxy query. Other queries in each proxy group are called cover queries.
Let q is a proxy query that a user issues to an IR system. The mapping of queries to proxy groups is optimal if for each related proxy-queryq ∈Q q of q (that a user issues in sequence), the cover queries ofq are also related the cover queries of q. Mathematically the fitness of optimal proxy-query mapping can be calculated as follows.
Cq is a set representing all cover queries ofq. cq and cq represent cover queries of q andq respectively. We want to achieve a large score for fitness(Q). A large score indicates the proxy-query mapping achieves optimal query obfuscation for the current query and the sequence of related queries.

A. GREEDY SEARCH HILL CLIMBING HEURISTIC
The effectiveness of initial proxy groups can be poor if the system does not explore optimal mapping. The proposed system improves its effectiveness by exploring search space using a greedy hill-climbing heuristic. Given an arbitrary solution, the hill-climbing heuristic iterative improves the performance of the current solution by exploring neighbor solutions by making many random changes to the current solution. If neighbor solutions improve the fitness, the heuristic replaces the current with the new solution [48], [49]. The heuristic continues exploring the search space and ends when no improvement is obtained after many changes.
The objective of the heuristic is to improve the fitness of initial proxy-query groups using Equation 1. To achieve this, the heuristic randomly selects k related proxy queries. Then it retrieves cover queries of k from proxy groups. The cover queries of k proxy queries are placed in a set C. The heuristic randomly selects a cover query c from C and retrieves related queries (Ĉ) of c from Q. The heuristic tests whether c has k related cover queries in C to obfuscate k proxy queries. If c does not has related cover queries, the heuristic selects k unrelated cover queries from C and switches withĈ. The heuristic then calculates the fitness of mapping. If the switch improves the fitness, the algorithm keeps the current proxy-query mapping; otherwise switches back to the old mapping. The algorithm keeps switching the cover queries, as explained above. It stops when no improvement is gained after performing many changes. Once near-to-optimal proxy groups have been obtained, the system indexes the discovered mapping in the proxy dictionary. Algorithm 1 shows the pseudo-code of heuristic.
Example: Let a collection has three topics to generate a proxy dictionary (see Figure 2). Let Depression Treatment is a topic containing sensitive information. Table 3 shows six high quality queries of each topic. Using M = 3, P = 6 the algorithm creates an initial mapping. Table 4 shows the initial mapping. The queries of topic Chicken Roast Recipe are used as proxy-queries. The fitness of initial mapping is poor because if a user issues sequence of related queries Depression Symptoms and Depression Symptom Diagnose. Then the proxy-queries Chicken Recipe and Quick Vegetable Recipe, and cover queries Motor Oil and Fuel Viscosity are not related to each other. Given these proxy and cover queries sequences and the relationship between queries, an IR system can easily determine true queries. The true queries (Depression Symptoms and Depression Symptom Diagnose) have strong relationship between them as compared to fake queries (Chicken Recipe and Quick Vegetable Recipe) and (Motor Oil and Fuel Viscosity).
To make the linkability task difficult, the system applies the heuristic for improving the fitness of proxy-query mapping. Assume after applying many changes to the initial mapping; VOLUME 11, 2023 the algorithm discovers a better mapping as shown in Table 5. The Table 5 mapping has higher fitness than  Table 4 because for the queries (Depression Symptoms and Depression Symptom Diagnose), the proxy and cover queries are (Chicken Recipe and Chicken Roast Recipe), and (Motor Oil and Motor Engine Oil) which have strong relationship between them similar to true queries. Given these queries, it is difficult for an IR system to determine true queries as all sequences of queries have strong similarities between them.

Algorithm 1 Greedy Search Hill Climbing Heuristic
Data: Q, P, T , M i ← 0; while i ̸ = TerminationCriteria do randomly select k related proxy-queries from Q; retrieve cover queries of k proxy-queries from P and store in C; randomly select a cover query c from C; retrieve related queries (Ĉ) of c; if (Ĉ ∈ C) == false then oldFitness ← fitness(Q) using Equation 1; select k queries fromĈ and switches to C; if fintess(Q) < oldFitness then switch backĈ queries with C; end end i ← i + 1; end Finally, the proposed PWS facility can be deployed to users through IR system, middleware, or client machine-based architectures. With IR system-based architecture (Figure 3), an IR system directly provides the proposed PWS facility. It generates the proxy-query mapping and provides the mapping to all users. The client machines issue proxy queries, and the IR system returns the rank lists of cover queries to client machines. Middleware-based architecture is suitable when the IR system does not directly provide the proposed PWS facility (see Figure 4). The middleware generates proxy-query mapping and distributes it to client machines. The middleware receives proxy queries from the client machines. It generates cover queries from the proxy queries and issues to the IR system. After the results are available from the IR system, the middleware returns the rank lists of cover queries to client machines. The client machine-based architecture works to [17] similarly. The client machine generates a proxy-query dictionary and issues several cover queries to the IR system for obfuscating true queries (see Figure 5). To learn more about the three architectures, we recommend readers to read [25].

V. EXPERIMENTS
This section evaluates the effectiveness of the proposed and related PWS schemes. We defined the characteristics of private web search and evaluated the effectiveness of PWS   schemes on the characteristics. We analyzed eight modern PWS schemes. Tor is a proxy-based PWS scheme. Track-MeNot, GooPIR and Qu-OB-PWS are query obfuscation based PWS schemes. P2P is a collaborative peer-to-peer PWS scheme. QS is a query scrambling-based PWS scheme. ProxyTermPWS is a proxy-term based query obfuscation scheme, and ProxyQueryPWS is the proposed scheme. We outlined eight characteristics of PWS. The characteristics vary from retrieval effectiveness to the performance of the PWS scheme to the hiding identities of PWS users. We believe a modern PWS scheme should contain all PWS characteristics to achieve an optimum WSP. It is otherwise vulnerable to WSP attacks. The characteristics provide a unique platform for defining an evaluation framework that must be considered while developing a new PWS scheme or evaluating existing PWS schemes. We analyzed eight modern PWS schemes on the PWS characteristics. The analysis revealed that only the proposed PWS scheme andProxyTermPWS achieve all of the desired characteristics. Existing PWS schemes were shown to be vulnerable to WSP attacks since they did not meet all characteristics.
For analyzing the effectiveness of PWS schemes, we need a set of test queries containing sensitive information. We obtained test queries by manually creating a set of topics containing sensitive and non-sensitive information.    We created a manual set of 180 topics with 30 topics containing sensitive information and 150 topics containing non-sensitive information. Table 6 provides titles of a few topics we used for experiments. After defining topics, we used all terms of topics as seed queries and retrieved relevant documents of the topics using focused crawling [50].
We downloaded the top 500 documents of each topic and collected the top W t terms of each topic that have high tfidf scores in the downloaded documents. We exhaustively combined W t terms of each topic for generating test queries using two and three combinations and obtained a large collection of queries. Because it is virtually impossible to 3614 VOLUME 11, 2023  measure the effectiveness using all possible combinations of queries, we analyzed the effectiveness using just the top Q t queries of each topic with high query quality scores. We estimated the quality of queries using MaxVAR (maximum term weight variability) [46], [47]. Table 7 shows a sample list of queries for the topics Depression Treatment, Chicken Roast Recipe and Motor Oil.
After obtaining topic queries, the system generated the proxy-query mappings using the approach described in section VI-A. The proxy-query mapping was generated with a proxy group size of 6 and at least one query from the topic containing sensitive information. Figure 6 shows the fitness of proxy-query mapping over 250k iterations. The findings reveal that initial mapping produces low fitness and that the heuristic increases fitness fast during the first few iterations. However, when the mapping gets close to optimal, the fitness improves slowly across many iterations. The studies were carried out on an Intel Core i7 7th Gen CPU with a clock speed of 2.11 GHz and a primary memory capacity of 16 GB. A single iteration takes four seconds on average to compute the fitness of the current mapping. The experiments took around 167 hours to discover an optimal proxy-query mapping. Table 8 shows an initial mapping (before training) for the topics Depression Treatment, Sunglasses, Motor Oil and Electric Vehicle. The initial mapping provides a poor result for the sequence of related queries. For example for the queries (Depression Symptoms, Depression Symptom Diagnose and Depression Treatment) the system generates proxy-queries (High Energy Light, Classical Shades, Surgical Procedures) for the proxy topic Sunglasses, and cover queries (Motor Oil, Combustion Engines, VOLUME 11, 2023 FIGURE 7. Initial proxy-query mapping using heat-map for topics Depression Treatment and Sunglasses, Motor Oil and Electric Vehicle. Dark color close to blue indicates that the proxy dictionary properly maps the related query of proxy-query to the related query of the cover query. Light colors close to yellow indicate that the proxy dictionary does not successfully map the related query of cover query for the related query of proxy-query.

Society Automotive Engineers) for the cover topic Motor
Oil. The sequence of related queries has low similarity between them.  Optimal proxy-query mapping after applying heuristic using heat-map for topics Depression Treatment and Sunglasses, Motor Oil and Electric Vehicle. The dark color close to blue indicates that the proxy dictionary properly maps the related query of proxy-query to the related query of the cover query. Light colors close to yellow indicate that the proxy dictionary does not successfully map the related query of cover query for the related query of proxy-query. Figure 8 show the effectiveness of the proxy-query mapping heuristic on a pair of proxy and cover topics using heat maps. We inserted random 80 inquiries from the proxy topic on the x-axis. On the y-axis, we ranked the related queries of each proxy query. The heat map's cell colors indicate the difference between the rankings of related VOLUME 11, 2023  proxy queries and related cover queries. The dark color close to blue indicates that the proxy dictionary properly maps the related query of proxy-query to the related query of the cover query. Light colors close to yellow indicate that the proxy dictionary does not successfully map the related query of cover query for the related query of proxy-query. Figure 7 shows proxy-query mappings before applying heuristic for the topics Depression Treatment, Sunglasses, Motor Oil and Electric Vehicle. The heat maps have many cells with yellow or green colors. This reflects the low effectiveness of the proxy dictionary. Figure 8 shows proxy-query mappings after applying the heuristic. The heat maps show many blue color cells, reflecting the proxy dictionary's high effectiveness. Table 9, Table 10 and Table 11 show a sample of true, proxy and cover queries (for the crawled collection) generated from the ProxyQueryPWS for the topics Depression Treatment, Email Hacking and Weapon Disassembling. As we can observe from the tables, true queries of all topics are highly related. The ProxyQueryPWS also generates related proxy and cover queries because of optimal proxy-query mapping available in the proxy dictionary. For example, for the sequence of related queries (Depression Symptoms, Depression Symptom Diagnose and Depression Treatment), the ProxyQueryPWS automatically generates three cover queries (Motor Oil, Motor Oil Quality and Synthetic Motor Oil) (see Table 9) which are highly related to each other.

VI. CHARACTERISTICS OF PRIVATE WEB SEARCH SCHEMES
We outline PWS characteristics in this section by studying numerous elements of existing PWS schemes. Each characteristic has a benefit, and failing to accomplish one or more impacts performance, effectiveness, and WSP trust or leaves a PWS scheme vulnerable to WSP attacks. Table 12 summarizes the advantages of PWS characteristics. We show how each characteristic is achieved via a certain PWS scheme. Table 21 provides a summary of evaluating PWS schemes on the PWS characteristics.

A. HIDE PWS USERS
The objective of information retrieval using a PWS scheme should be so that a WSE cannot classify the users that are issuing queries through PWS. This is important because a WSP attack must first identify PWS users before detecting real queries. Following the identification of PWS users, the WSP attack can launch further attacks to determine true queries. This characteristic analyzes how difficult it is to identify PWS users when utilizing a certain PWS scheme.
A simple method that a WSE can use for identifying PWS users is through several queries issued by users. If a user submits many queries pertaining to different topics within a short time interval, this is a clear sign that the user is getting information using a PWS scheme. This attack is effective for all query obfuscation-based PWS schemes but provides poor accuracy for the PWS schemes that do not issue cover queries. To the best of our knowledge, currently, there is no method for classifying PWS users based on single queries. We examined the effectiveness of PWS schemes on this characteristic using several cover queries. If a PWS scheme produces several cover queries in response to a user's query, the PWS scheme lacks this characteristic. All query obfuscation-based PWS schemes fail to accomplish this characteristic because they issue several cover queries. Other schemes only issue true queries and thus achieve this characteristic. The proposed scheme and ProxyTermPWS also achieve this characteristic as both schemes virtually generate cover queries through middleware or IR system. Table 13 shows evaluation summary of PWS schemes on this characteristic.

B. RETRIEVAL EFFECTIVENESS
This characteristic states that if a user obtains information from a WSE using a PWS scheme, the PWS scheme shall deliver the same retrieval effectiveness that the user receives without a PWS scheme. This characteristic aims to achieve web search privacy while maintaining retrieval effectiveness.
To analyze this characteristic, we must determine whether a PWS scheme transforms the keywords of true queries for retrieving information from an IR system. If a PWS scheme achieves this, the retrieval effectiveness suffers since the transformed keywords retrieve only partial relevant results. The query scrambling scheme transforms keywords by using scrambled keywords. The scrambled queries return only a subset of the relevant results. Other PWS systems do not transform keywords; thus, users get the same results as they retrieve without the PWS scheme. The other schemes achieve this characteristic. Table 14 shows evaluation summary of PWS schemes on this characteristic.

C. WEB SEARCH PRIVACY TRUST OF A PWS SCHEME
People trust a PWS scheme because of its web search privacy approach, not because of its claims or privacy disclaimer declarations. Information retrieval through a PWS scheme should be designed so that users have strong confidence that no component of the PWS scheme may log their queries. When a PWS system has a component that requires sending true queries to other machines, the characteristic becomes weak. If such a component is included in a PWS scheme, the WSP threat moves from WSE to the PWS scheme since the PWS scheme can log user queries.
This characteristic may be assessed by determining if a PWS scheme has a component that involves delivering true queries to machines other than WSE. P2P and Tor do not fulfill this characteristic since they send true queries to external machines. Other PWS schemes achieve this characteristic. For example, query obfuscation-based PWS systems construct cover queries locally and then issue true and cover VOLUME 11, 2023    queries to WSE directly. The query scrambling-based scheme creates scrambled queries locally and issues them directly to WSE. The proposed approach and ProxyTermPWS also achieve this characteristic, as both schemes retrieve information through proxy queries. Table 15 shows evaluation summary of PWS schemes on this characteristic.

D. NON-NOISY COVER QUERIES
Modern PWS schemes recently explored in the web search domain are query obfuscation-based private web search (OB-PWS) [17], [44], [45], [51]. The OB-PWS-based schemes achieve WSP by inserting noise in the user profiles maintained by a WSE. For each search query, the OB-PWSbased schemes generate dummy (cover) queries related to different topics. These cover queries, along with the true query, are issued to WSE to hide the true search intent of the user. As a result, the WSE cannot determine whether a user is attempting to receive information from a true query or a cover query.
For this characteristic, we must determine how much noise exists in the cover queries generated by a PWS scheme. Noisy queries are low-performance queries that true web users do not generate. Query frequency in the query log or query quality predictors can be used to classify noisy  queries [46], [47]. There is no need to measure this characteristic for P2P, Tor, and query scrambling-based PWS scheme, as these schemes do not generate cover queries. For the PWS schemes that rely on cover queries, the objective of this characteristic is that a PWS scheme should generate non-noisy cover queries similar to those that real web users normally generate. This is important because if a PWS scheme generates noisy cover queries, then a WSE can easily classify true queries from a query log by extracting non-noisy queries from noisy cover queries.
We created 200 queries randomly from topics containing sensitive information to assess the effectiveness of query obfuscation based on PWS techniques. The query quality predictor was employed to assess the effectiveness. Query obfuscation-based PWS schemes achieve this characteristic by generating cover queries from the public open dictionary or hierarchically organized language models using topic ontology [17]. Table 16 reports scores of query quality predictor. According to the results, all query obfuscation-based PWS schemes effectively achieve this characteristic.

E. WEB SEARCH PRIVACY FOR RELATED QUERIES
In typical web search, users usually continue their current search task with multiple queries using a single or multiple search sessions. As a result, if a PWS scheme does not consider relatedness between connection queries while generating cover queries, a WSE may readily distinguish true queries from cover queries by removing linked consecutive queries from non-related consecutive queries [17]. The goal of this characteristic is to achieve the WSP by taking into account the relatedness of a current query to previous queries.
We need to measure this characteristic using two factors. First, we must determine whether a PWS scheme provides the functionality of generating cover queries based on similarity to previous cover queries. In the second factor, we need to analyze the event when it is required to generate related cover queries and to what extent current cover queries are related to previous cover queries. Qu-OB-PWS [17] achieves this characteristic by employing hierarchically organized language models based on the topic ontology to ensure semantic relatedness with the related past cover queries. P2P, Tor, and query scrambling-based PWS scheme achieve this characteristic as these schemes do not generate cover queries. A web search privacy attack cannot identify true queries from non-related cover queries for these schemes. TrackMeNot [44] and GooPIR [45] do not apply any technique for considering relatedness with past cover queries. Thus do not achieve this characteristic.
We randomly selected 100 queries from topics containing sensitive information to analyze PWS schemes on this characteristic. For each test query, we further obtained four related queries. This result in 400 test queries. The test queries were then divided into 80 sets. The queries in the sets are ordered in the following manner. In the first 20% of sets, we placed only those queries in the sets that are unrelated to each other. In the next 20% of sets, we placed four unrelated queries in each set and one related query of the first query of each set. In the next 20% of sets, we placed three unrelated queries in each set and two related queries of the first query of each set. In the next 20% of sets, we placed two unrelated queries and then placed three related queries of the first query of each set. The last 20% of sets contained queries that are all related to each other.
We analyzed the effectiveness using cosine similarity measurement [52]. The cosine measurement projects input vectors into a multi-dimensional space according to the number of dimensions and measures angles between vectors. A short angle between vectors or a cosine value close to +1 indicates vectors are relatively similar, while a value close to −1 indicates that vectors are not comparable.
We measured cosine similarity between queries using the top 100 terms of queries. We obtained top 100 terms from top 20 documents with high tfidf scores. From the experiments, we want to achieve high cosine similarity for the query sets that have related queries and low cosine similarity for the sets that have unrelated queries. If a PWS scheme gets such results, it means that the PWS scheme is highly effective for producing successive, related cover queries when the real queries of a session are related to each other. Table 17 shows the effectiveness of query obfuscation-based PWS schemes for this characteristic. The results show that Qu-OB-PWS, ProxyTermPWS, and the proposed approach achieve high effectiveness for generating consecutive cover queries. As GooPIR does not consider the relatedness of current queries with past queries, it shows low effectiveness for generating consecutive related cover queries.

F. WEB SEARCH PRIVACY ON MULTIPLE DEVICES
It is common for web users to continue their current search tasks from multiple devices (smartwatches, smartphones, tabs, laptops, desktop computers, etc.). If a PWS scheme does not synchronize multiple devices for generating consecutive cover queries, then a PWS cannot generate related cover queries in the event that a sequence of true queries issued from different devices are related to each other. Given this limitation, a WSE can determine true queries by retrieving those consecutive related queries issued from multiple devices. This characteristic states that an optimal PWS scheme should achieve web search privacy of current query by considering relatedness with past queries even if the queries are issued from different devices. Like the above characteristic, P2P, Tor, and query scrambling-based PWS scheme achieve this characteristic as these schemes do not generate cover queries.
The effectiveness of query obfuscation-based PWS schemes were analyzed on a random sample of 400 queries.
To generate the sample, we randomly obtained 100 queries from Q. Then, we further retrieved four related queries from Q for each query. This result in 100 sets with five queries in each set. We use five machines for issuing queries for each set. Each machine constructed its collection for generating cover queries. We used cosine similarity for measuring effectiveness. We computed the cosine similarity between queries of each query set using top 100 terms. The top 100 terms were obtained from top 20 documents having high tfidf scores. Table 18 shows the results. A high cosine score shows the PWS scheme issues related cover queries for the sequence of true queries. All query obfuscation-based PWS schemes showed poor effectiveness as these schemes did not provide a mechanism for synchronizing multiple devices. Each device constructed its collection for generating cover queries. The proposed approach and ProxyTermPWS showed high effectiveness for this characteristic. Both schemes synchronized multiple devices using a common proxy dictionary obtained from the middleware or IR system.

G. STORAGE AND COMPUTATION
A PWS scheme should not download or index a large number of documents at client machines to generate cover queries. Similarly, a PWS scheme should not take a long processing time to generate cover queries. Due to their limited space and computing power, indexing a large collection and requiring a long processing time are not appropriate for smart devices. This characteristic can be measured by analyzing whether a PWS scheme requires indexing a big collection of queries or documents to achieve web search privacy. A PWS scheme that requires indexing a big collection at client machines is not performance efficient for smart devices. Table 19 shows a summary of the storage and processing requirements of PWS schemes for preserving the WSP of queries for the 180 topics we used for performing experiments. P2P and Tor schemes are efficient in storage and computation as these schemes do not index any collection for achieving WSP. TrackMeNot [44] and GooPIR [45] require small collections to index. For the 180 topics, we found TrackMeNot requires a space of 2.16MB to store 1000 * 180 random queries of all topics. GooPIR requires a space of 0.29MB to store the term dictionary. In terms of computation requirement, both schemes require only generating k random cover queries from the indexed collections. Based on these observations, we found both schemes achieve this characteristic. OB-PWS [17] and query scrambling-based schemes do not achieve this characteristic as these schemes require big collections to index at users' machines. For the 180 topics with 500 documents of each topic, we found OB-PWS and query scrambling-based scheme require a space of around 5GB to index all documents for maintaining ontology or generating scrambled queries from the collection. In terms of computation requirement, both schemes require high processing at client machines to generate cover queries or scrambled queries from the indexed collection by ensuring that the generated cover queries achieve plausibility with    [34] is a proxy-based PWS scheme. TrackMeNot [44] and GooPIR [45], Qu-OB-PWS [17] are query obfuscation based PWS schemes. P2P [38] is a collaborative peer-to-peer PWS scheme, and (QS) [41] is a query scrambling-based PWS scheme.
the previously issued cover queries. The proposed approach is efficient as it requires small space to index only proxy dictionary and limited processing to transform true queries into proxy queries. From the experiments, we found that the proposed approach only requires a space of 5.8MB to index a proxy-query dictionary of 180 topics and top 2, 000 queries of each topic. ProxyTermsPWS is efficient in terms of storage. It only requires a space of 0.3MB to store a VOLUME 11, 2023 term-proxy dictionary of 180. However, the ProxyTermsPWS takes significant processing time to issue cover queries. For a proxy group of size six and a query of three terms, the ProxyTermsPWS needs to exhaustively generate and issue 6 3 cover queries to the IR system.

H. WEB SEARCH WITHOUT TRUE QUERY
The objective of this characteristic is that information retrieval through a PWS scheme should be so that users retrieve information from a WSE without sending true queries. If a PWS scheme achieves this characteristic, then true queries are not visible to WSE. Thus WSE cannot identify the true search intents of users.
Query scrambling-based PWS scheme generates scramble queries to retrieve information without submitting true queries. It achieves this characteristic. P2P and Tor do not achieve this characteristic as these schemes share true queries to proxy servers or peers. The external machines can log true queries. All query obfuscation-based PWS schemes issue true queries along with cover queries. Thus do not achieve this characteristic. The proposed scheme and ProxyTermPWS retrieve information through proxy queries, thus achieving this characteristic. Table 20 shows evaluation summary of PWS schemes on this characteristic.

VII. CONCLUSION
Web search privacy is an important area to consider. Web users must preserve their web search privacy when retrieving information from a search engine. In the first part of the article, we propose a WSP scheme using proxy-query-based query obfuscation to achieve web search privacy. Proxyquery-based query obfuscation is an emerging research domain in PWS. It provides an IR facility to retrieve information from search engines through proxy queries. One clear advantage of the proposed technique is that users do not issue true queries to search engines for retrieving information. The existing proxy-term-based query obfuscation approach generates a proxy-term mapping between individual terms of proxy and cover topics. This provides low effectiveness if the queries contain more than one term. This is due to the computational difficulty of achieving optimal proxy-term mapping for every acceptable query term combination. To address this restriction, the proposed scheme generates mapping using queries of topics rather than individual terms. This provides high effectiveness for single queries and a sequence of related queries.
In the second part of the article, we define characteristics of private web search that we believe need to achieve when using or developing a new private web search scheme. To the best of our knowledge, the characteristics defined in the article provide a first practical evaluation platform for private web search. It allows users and researchers to evaluate existing or newly developed PWS schemes fairly. Using the PWS characteristics, we analyze eight modern PWS schemes. We discovered that none of the available PWS schemes had all the characteristics, making them vulnerable to web search privacy attacks and providing less web search privacy trust.