Geo-Social Top-k and Skyline Keyword Queries on Road Networks

The rapid growth of GPS-enabled mobile devices has popularized many location-based applications. Spatial keyword search which finds objects of interest by considering both spatial locations and textual descriptions has become very useful in these applications. The recent integration of social data with spatial keyword search opens a new service horizon for users. Few previous studies have proposed methods to combine spatial keyword queries with social data in Euclidean space. However, most real-world applications constrain the distance between query location and data objects by a road network, where distance between two points is defined by the shortest connecting path. This paper proposes geo-social top-k keyword queries and geo-social skyline keyword queries on road networks. Both queries enrich traditional spatial keyword query semantics by incorporating social relevance component. We formalize the proposed query types and appropriate indexing frameworks and algorithms to efficiently process them. The effectiveness and efficiency of the proposed approaches are evaluated using real datasets.


Introduction
Smartphones and social networks are significant innovations from the past decade, and the combination of these technologies has engendered geo-social network (GSN) applications, such as Facebook, Instagram, and Foursquare. Geo-tagged data (e.g., photos, videos, check-ins, and likes) allow these applications to provide many useful services to users based on social and location relevance. Furthermore, the easy availability of textual descriptions for desired facilities (e.g., restaurants, departmental stores, and travel destinations) has promoted many decision support systems and recommendation services.
Traditional top-k spatial keyword queries [1][2][3] rank facilities based on spatial proximity to the query location and textual relevance to query keywords. Many existing studies have proposed spatial keyword query systems in Euclidean space [3][4][5] and road networks [6,7]. However, query results can be improved by including social data, since users tend to consult other users and specially their friends, besides their own preferences, for recommendations on movies, restaurants, and places to visit. Therefore, this paper investigates geo-social keyword queries that not only exploit spatial and textual information, but also social information to offer many interesting services. Each data object has a set of fans who exhibit positive behavior towards it, and social relevance is obtained from the number of fans and the relationship between these fans and the query user.
Geo-social keyword queries can be used for a wide range of GSN applications and services, such as a tourist visiting Seoul and searching for a French restaurant. Geo-social keyword queries uses spatial, textual, and social information parameters to retrieve the query results. Spatial information relates to the distance between the user and the facility, textual relevance relates to how well the facility description matches query keywords, e.g., French restaurant; and social information relates to a tourist's friends and other users.
These query types are also important in various monitoring systems, such as disease monitoring and crime prevention (e.g., users searching about drugs and subsequently joining Facebook pages to discuss drug related activities). Law enforcement agencies can identify crime locations by monitoring commonly visited places for users of those pages. Similarly, Dengue fever patients can be connected using social networks, and the monitoring of frequently visited places could help medical teams identify key disease spreading locations. The queries can also be used for decision support systems such as determining the best location to open a new business or store.
Geo-social queries [8][9][10] have recently attracted significant research attention due to their real-world relevance. Sohail et al. [9] proposed a method to monitor socio-spatial top-k famous queries and skyline queries in Euclidean space. However, they did not consider the textual relevance to the query keyword. Wu et al. [10] investigated top-k social aware keyword queries in Euclidean space. However, users typically follow the road network to reach their desired location. Moreover, processing spatial queries on road networks is significantly more complicated than in Euclidean space because, it requires computing several shortest paths. Therefore, Euclidean space algorithms cannot be applied to road networks. Therefore, we propose geo-social top-k keyword (GSTK) queries to retrieve the k best data objects based on spatial, textual and social relevance. However, the results depend on the scoring function defined by the user and choosing a suitable scoring function may be challenging due to different attribute distributions or inadequate user knowledge. Hence, we also introduce geo-social skyline keyword (GSSK) queries, which do not require scoring. GSSK queries return every object for which there does not exist any other object that has a higher spatial, textual, and social score. In this study, we formalize the concept of processing these queries and provide methodology to process and rank objects considering spatial, textual, and social relevance. We provide a formal definition of GSTK and GSSK in Sections 4.1 and 5.1, respectively.
The main contributions of this study are summarized as follows: • We introduce geo-social top-k keyword (GSTK) queries on road networks that ranks data objects based on spatial, textual and social relevance. • We extend our work to propose geo-social skyline keyword (GSSK) queries that return data objects that are not dominated by any other data object. • We present an indexing technique to retrieve data objects. Furthermore, we present efficient algorithms that exploit the indexing technique to process these queries. • Finally, we conduct extensive experiments on real road network datasets to demonstrate the efficiency of the proposed techniques.

Related Work
In this section, we discuss previous studies related to our work. Section 2.1 briefly reviews top-k keyword queries. Section 2.2 discusses skyline queries. Finally, Section 2.3 presents a survey on geo-social queries.

Top-k Keyword Queries
Several approaches have been proposed to rank spatial data objects based on keyword relevance. Initially, Zhou et al. [11] proposed hybrid indexing methods that combine inverted indexes [12] for text processing and R*-tree [13] for spatial processing. Cong et al. [4] and Li et al. [5] introduced top-k spatial keyword queries where each object is ranked based on its combined textual and spatial relevance to query keywords and location. Both studies generated IR-trees by integrating spatial indexing and text indexing. In contrast to Zhou et al. [11] who applied text indexes to filter web documents and then used spatial indexes to process location, the IR-tree [4,5] combines indexes to prune the search space. Rocha et al. [14] proposed an S21 indexing structure to map each keyword to a block or aggregated R-tree for frequent terms.
Recent studies have investigated several spatial queries, such as nearest neighbor, reverse nearest neighbor, range, and various top-k queries for road networks [15][16][17][18][19]. Rocha et al. [7] considered top-k spatial keyword queries for road networks, and proposed an efficient indexing technique and an overlay network to group objects in regions with similar textual description, thereby enabling the computation of upper-bound scores for all objects in the region. Gao et al. [20] presented filter-and-refinement based algorithms to process reverse top-k boolean spatial keyword queries on road networks. Guo et al. [6] investigated the problem of continuous top-k spatial keyword queries on road networks and proposed two algorithms to monitor continuous top-k keyword queries incrementally that minimize the expansion of the network edges. Attique et al. [21] recently expanded the problem and proposed a safe region based approach to monitor moving top-k keyword queries in directed and dynamic road networks, where each network edge is directed and its traveling cost depends on the traffic conditions.

Skyline Queries
Borzsony et al. [22] studied skyline queries and proposed two approaches: block nested loop (BNL) and divide and conquer (D&C). Subsequently, Chomicki et al. proposed a sort filter skyline SFS [23] to reduce skyline evaluation cost by sorting the objects before applying BNL. The sorting technique improves performance because objects can only dominate the subsequent objects. Sharifzadeh et al. [24] introduced spatial skyline queries, which are useful for many location-based applications, decision-support systems and recommendation systems. Deng et al. [25] proposed an algorithm to process multi-source skyline queries in road networks [26], defined a keyword matched skyline query to obtain the set of objects whose textual description contain all query keywords. However, they did not consider a distance function to obtain skyline objects.
Regaldo et al. [27] and Shi et al. [28] recently investigated location-based textual skyline (LTS) and spatio-textual skyline (STS) queries, respectively. Both queries retrieve objects of interest based on Euclidean distance to query location and textual relevance to a set of query keywords. Two algorithms were proposed in [27], but main limitation of their work was that they only considered a single query location. Shi et al. [28] proposed three models to integrate textual relevance into the spatial skyline, with spatio-textual dominance (STD) being the most efficient, because it integrates spatial distance and textual relevance to effectively prune irrelevant objects. Both studies considered textual and Euclidean distance functions to find skyline objects. However, this study considers geo-social keyword skyline queries that return the dominant objects based on their aggregated score of social relevance to the query, textual similarity to the query keywords, and network distance to the query location through the shortest path. Therefore, their objectives and problem formulations are entirely different and cannot be applied to process GSSK queries for road networks.

Geo-Social Queries
Geo-Social query processing is an emerging field that has recently garnered considerable attention from the research community [29][30][31][32][33]. Huang et al. [34] proposed geo-social network services that organize users in social networks based on geographic features, retrieving the set of nearby users that share common interests. Ye et al. [35], designed a location recommendation system based on a user's social networks. Sarwat et al. [36] also proposed a location-aware recommendation system that associates a user's social network(s) and locations with ratings. Liu et al. [37] proposed a circle of friends query that returns a group of friends in the user's geo-social network who are geographically and socially close to each other (e.g., community services, friend gathering, and combined sports activities). Shim et al. recently studied the k-nearest l-close friends query, which reports the k nearest data objects to a query location based on l-hop friends in the social network. Zhao et al. [31] proposed a reverse top-k keyword query on road networks that finds potential customers for businesses based on spatial, textual, and social information of users.
Emrich et al. [8] introduced geo-social skyline queries that report the set of persons close to a given location, P and closely connected to user U. Sohail et al. [9] recently introduced top-k famous places and socio-spatial skyline queries that consider social and spatial relevance to the query. They provided three approaches: social first, spatial first, and hybrid. The main difference between the GSSK query proposed in the current paper, and the problem studied by Emrich et al. [8] and Sohail et al. [9] is that the previous studies did not consider keyword relevance and their approaches were applied for Euclidean space. Wu et al. [10] investigated a problem similar to the proposed GSTK query. They considered social-aware top-k spatial keyword queries, which retrieve objects based on social, spatial, and textual relevance to the query, extended the IR-tree approach, and integrated the social aspect to propose social network aware IR-trees (SNIR-trees). Their approach is also applicable to Euclidean space and relied on R-trees to determine the minimum distance from query location and objects. Algorithms designed for Euclidean space, are not suitable for processing GSTK queries in road networks because they are designed to reduce the number of data objects to be accessed, without considering the underlying spatial network. However, road network based methods should be optimized to minimize the number of network edges to be explored and the cost of computing the distance between query location and objects [25]. To the best of our knowledge, this is the first study to introduce GSTK and GSSK queries in road networks.

Preliminaries
Road Network: We represent a road network by an undirected graph G = (N, E, W), where N is the set of nodes, E is the set of edges, and W : E → R + is associated with each edge, i.e., a positive real number representing the network weight, such as the distance or travel time. Each edge is represented by a starting and ending node (n s , n e ) commonly referred to as boundary nodes n B ∈ {n s , n e }. Figure 1 presents a road network example with eleven nodes, n 1 to n 11 . The query point is represented by a triangle and data objects with their textual description are represented by rectangles. The number on each edge denotes the weight of an edge such as distance in kilometers or travel time. The distance function, dist(a 1 , a 2 ) indicates the shortest network distance from a 1 to a 2 . For example in Figure 1, the shortest distance from q to d 1 is dist(q, d 1 ) = 7 and the shortest path is q → n 8 → d 1 . The set of data objects in an edge (n s , n e ) and (n e , n s ) are the same, and the dist(n s , d i ) is equal to dist(n s , n e ) − dist(n e , d i ). Hence, dist(n e , d i ) can be easily obtained from dist(n s , d i ). Therefore, the starting node (n s ) is used to compute the distance to the objects.
Geo-Social Networks: GSNs consist of entities (such as users, groups, and places) and relationships between them. The relationship between two entities u and v can have several types, such as friends-with, lives-in, born-in and works-at. Consider Bob, a Facebook user born in France, who lives in Seoul, works at Sejong University, and is friends with another French user, Alice. Facebook stores this information by connecting Bob and Sejong University with works-at, Bob and Alice with friends-with and Bob and France with born-in relationships. Figure 2 illustrates the resulting social relationship diagram.

Points of interest:
In this study, points of interest (POIs) are all data objects d ∈ D such as restaurants, hotels, theme parks, and museums. Each data object lies on the edge E of the road network G. Each data object has a spatial location d.l, textual description d.t, and set of fans F d , where a fan is a user u ∈ U who exhibits positive behavior towards object d (e.g., check-in, like, share, etc.). Table 1 describes the notations used in this study.

Notation Definition ψ(d)
Score of data object d µ(q.t, d.t) Textual relevance of data object d with query keywords τ(q.s, d.s) Social relevance of data object d with query user λ(q.l, d.l) Spatial relevance of data object d with query location (α, β, γ) Preference parameters that represent the importance of textual, social and spatial relevance, respectively δ Preference parameter that controls the importance between query user friends and other users F d Set of fans of data object d N q One-hop neighbors of q in social network Ω t Highest significance of a given term t among the description of the data objects lying on edge e id Ω f Highest significance of fans of any data object lying on edge e id with term t σ(d) Aggregated social and textual score used in GSSK queries

Problem Formulation
A geo-social top-k keyword (GSTK) query in road network G is defined as Q G = (q.l, q.t, q.s, k), where q.l is the query location, q.t are query keywords, q.s is the query user's social network, and k is the number of desired data objects. Given a set of data objects D on G; Q G returns k data objects from D in descending order of score ψ(d), which is defined as: is the textual similarity between q.t and d.t, τ(q.s, d.s) is the social relevance of d with respect to q, and λ(q.l, d.l) is the network distance between q.l and d.l. We also defined preference parameters (α, β, γ) ∈ [0, 1] to represent the importance of textual, social and spatial relevance, respectively, with 0 representing the lowest and 1 the highest preference. Textual relevance (µ) can be measured using any information retrieval model. This study used cosine similarity to compute the relevance between q.t and d.t, which is defined as: where the significance Ω t(n) = w t(n) ∑ t∈n (w t(n) ) 2 is the normalized weight of the term t in the document by considering document length [38,39].
Social relevance (τ) can be expressed as: where F d represents the set of fans with positive attitude towards data object d ∈ D (i.e., visited, liked, recommended or shared) and N q represents the adjacent neighbors of user q, e.g., if the relationship is works-at and the query entity is the Facebook page "Sejong University," then N q is a set of all users working at Sejong University. Although any type of relationship can be supported by the presented work, the remainder of this paper only considers friendship relationships for simplicity. In this context, N q comprises only the set of query user's friends. The expression |F d ∩U| |U| represents the portion of users F d ∈ U who are fans of the data object d and |N q ∩F d | |N q | represents the portion of q's friends who are also fans of d. In real life, users commonly consider feedback from friends and other users in social media. Consider a tourist traveling to a foreign country. Typically the farther from home, the more difficult to obtain meaningful opinions from family and friends. Hence, the tourist must rely more on recommendations from other social network users. The proposed social relevance score considers these situations and aggregates the scores We provide a preference parameter δ to control the importance of one measure over the other, e.g., δ = 0 if only friends feedback is considered, δ > 0.5 increases the weight of other users feedback over friends feedback, and δ = 0.5 meaning equal weight to both sources.
Spatial relevance (λ) is defined as λ(q.l, d.l) = dist(q.l, d.l) which represents the shortest distance between data objects d and q. Thus, a data object closer to the query has a higher spatial relevance score.
Consider the road network example of Figure 1 and assume that we have a set of users U = {u 1 , u 2 , ..., u 100 }.
User u 1 is the query user q and issues a query with keywords "French restaurant," requesting one result (k = 1). User u 1 and has ten friends, N q = {u 4 , u 16 , u 23 , u 39 , u 48 , u 55 , u 67 , u 71 , u 80 , u 94 }. Equal preference is assigned to every attribute, i.e., α = β = γ = 1 and δ = 0.5. If we only consider spatial relevance, the top-1 result is the nearest restaurant d 3 . If we consider spatial and keyword relevance the top-1 result is d 6 . However, d 6 does not have a suitable social score. Therefore, d 2 is the top-1 result considering spatial, textual and social relevance. Notice that although d 2 is slightly far from d 6 (dist(q, d 2 ) > dist(q, d 6 )) and has less textual relevance than d 6 (µ(q.t, d 2 .t) < µ(q.t, d 6 .t)). Table 2 summarizes d 2 , d 3 , and d 6 scores. To simplify the presentation, we assume the textual score is the number of occurrences of the query keywords in the object description divided by the total number of keywords in the object description. For example the textual score of d 2 is 2 3 = 0.66, the social score is 0.5 × 20 100 + 0.5 × 5 10 = 0.35, and dist(q, d 2 ) = 7. The overall score of d 2 is computed as ψ(d 2 ) = 0.66×0.5 7 = 0.033. Similarly, ψ(d 3 ) = 0.02 and ψ(d 6 ) = 0.024. Consequently, the GSTK query returned d 2 as the top-1 result.  Figure 3 shows the proposed indexing framework structure. The spatial component is used to retrieve adjacent nodes for a given node to allow efficient traversing of the road network. The binding component binds the road network and keywords to the objects using edge id (e id ) and term id (t id ). The social component stores the social relationships of users. Finally, the inverted file component stores data objects along with their descriptions and set of fans. Spatial Component: This component integrates the road network and spatial information as proposed by Papadias et al. [40]. Each road segment is represented by an edge comprising a detailed polyline and is stored in the network's R-tree. The B-tree is used to retrieve the adjacent nodes of a given node n i from adjacency file, to ensure efficient traversing of the network from one node to the other. The adjacency file stores edge e id (i.e., (n i , n j )) along with its weight W (i.e., dist(n i , n j )). Data objects that lie on a particular edge can be easily retrieved using e id .

Indexing Framework
Social Component: The social component employs a B-tree to index each user u i ∈ U along with their social relationships N u i . The B-tree points to the block in the users file where the social relationships of user are stored. This component enables efficient retrieval of the query user's relationships to compute the social score.
Binding Component: The binding component uses a B-tree that binds a key composed of the pair e id and t id to the inverted lists that contain the data objects located on the edge with term t in the data object description. The binding component is also used to efficiently retrieve candidate objects that are relevant to the query based on spatial, textual and social relevance. To achieve this, for each edge the binding component stores the highest significance of a given term t (Ω t ) among the description of objects lying on the edge and the highest significance of fans (Ω f ) of any object lying on the edge with term t. The highest significance of a term t is an upper-bound textual relevance and the highest significance of fans is the upper-bound social relevance of any object on the edge with t in its description. The upper-bound score for edge e id is derived from (Ω t ), (Ω f ) and the minimum distance from query point to edge. The closest boundary node n B is considered to compute the minimum distance from q to e id (i.e., dist(q, n B )). Therefore, a term t's inverted list on edge e id is accessed only if the upper-bound score is greater than the score of the kth object found so far.
Consider the example road network presented in Figure 4. We calculate the upper-bound score as follows. Let the set of users be U = {u 1 , u 2 , ..., u 100 }, where user u 1 is the query user who issued a query with the keyword "cafe" and query parameters α = β = γ = 1 and δ = 0.5. The number of fans of d 1 , d 2 , d 3 and d 4 are 15, 30, 40 and 0, respectively. On edge (n 2 , n 3 ), To compute the highest significance of social relevance (Ω f ), we set the highest score for the portion of friends of q who are also fans of d i.e., Similarly, the upper-bound score for edge (n 2 , n 4 ) is 0 because no data object lie on the edge that contains terms relevant to the query keywords. Finally, the upper-bound score of edge (n 2 , n 5 ) is 0 because data object d 4 has no fans and therefore Inverted File Component: This component consists of vocabulary and inverted lists. The vocabulary stores the frequency of each term which assists in computing the textual score of the data objects. The inverted list stores the data objects lying on an edge e i with term t in their description. Thus an inverted list stores the distance between the data object and starting node n s (dist(n s , d i )) for edge e id , the significance of the term t i in the description of data object (Ω t i ,d i ), and the fans of data object (F d i ) for each data object d i . The social component and fans of data objects stored in the inverted lists are used to compute the aggregated social relevance score. F d i returns all fans of data object d i , and the friend list of query N q allows all friends in F d i to be identified. Inverted lists are identified using a (e id , t id ) key and a separate inverted lists are created for each term t in the object description. The inverted file for edge e i is the set of inverted lists containing objects lying on the edge, and there is an inverted file associated with every edge that contains at least one data object.
The proposed indexing framework has several features that significantly enhance GSTK query processing performance. First, the data objects located on edge e id are stored in inverted files and objects relevant to keyword queries can be easily retrieved using (e id , t id ). Second, the distance between the starting node and data object is also stored in an inverted file, hence data objects can be accessed directly. Third, the set of fans for each data object F D is stored in the inverted file, enabling faster computation of the social relevance score. Fourth, upper-bound scores are computed considering the combined textual, social and spatial relevance to the query. Fifth, the binding component uses the upper-bound score for each edge to prune edges that do not contain data objects that could be in the top-k results.

Methodology
The overview of the proposed query processing methodology is shown in Figure 5. A user initially generates a GSTK query (Step 1 in Figure 5). The query processing module receives a query that has information about user location, query keywords, user social networks, and the number of requested data objects (k) (Step 2 in Figure 5). The query processing module then starts searching the top-k data objects (Step 3 in Figure 5); it utilizes the indexing framework to process the query. First, it accesses the spatial component to start traversing the road network and searches the candidate data objects from the edge where user is located (Step 4 in Figure 5). Next, the binding component is accessed to compute the upper-bound score of edge by using the pre-computed Ω t and Ω f (Step 5 in Figure 5). If the upper-bound score is less than the current kth data object, then the edge is pruned, and the proposed system continues traversing the adjacent edges (Step 6 in Figure 5). If the upper-bound score is greater than the current kth data object, then the query processing module accesses the inverted file and social components to compute the score of each data object that lies on the edge (Step 7 in Figure 5). Next, the result set is updated by inserting the data objects with a score greater than the current kth data object (Step 8 in Figure 5). The query processing module continues road network expansion to retrieve more candidate data objects that can be in the top-k result set (Step 9 in Figure 5). Finally, the system is terminated when all edges are explored or there is no edge left with an upper-bound score greater than the kth data object, and the result set is returned to the user (Step 10 in Figure 5).

Query Processing Algorithm
Algorithm 1 presents the main steps for the proposed Geo-Social Top-k Keyword query processing algorithm, GSTK-A. The algorithm takes input query Q G and returns result set R k containing the best k data objects in descending order by score ψ(d). GSTK-A expands adjacent edges of query objects in increasing order of distance from q.l similar to Dijkstra's algorithm [41]. A min-heap H is implemented to arrange encountered edges which is initially empty. Each entry in H is represented as (p r , e id ) where p r corresponds to the reference point in edge e id i.e., the starting point of expansion. An edge that contains query object q is represented as active edge e active . Query object q becomes the reference point for an active edge and either adjacent node n s or n e becomes the reference point (or reference node) for other edges. Variable m k stores the score of the current kth data object in R k .
The algorithm is initiated by visiting the active edge where query object q lies. Then, the candsearch((e id , t id ), m k ) function finds candidate data objects R c lying on e active with ψ(d) > m k , and updates R k and m k using the data objects in R c . The traversal of adjacent edges continues until the min-heap H is exhausted or the shortest distance to any remaining data object produces an upper-bound score smaller or equal to m k . The upper-bound score of a node n is computed using dist(n, q), the maximum textual and social relevance (which can be 1). Therefore, if the upper-bound score ≤ m k , then even if there is an unexplored data object d with maximum textual and social relevance, its score cannot be higher than the kth data object d in R k because dist(d, q.l) ≥ dist(n, q.l) since the algorithm strictly expands and selects the reference node with minimum distance to the query location.
Algorithm 2 presents the candsearch((e id , t id ), m k ) procedure to identify candidate data objects. First, the algorithm accesses the social component to retrieve the friends of q and compute the social relevance score. Then, it computes the upper-bound score of the edges using the binding component with Ω t , Ω f , and the minimum distance from the query location to the edge. The inverted lists of term t are fetched only if their upper-bound scores are greater than m k . The scores of data objects are computed using the formula in Equation (1), and data objects with ψ(d) > m k are returned.
Next, we discuss the running example of the proposed algorithm presented in Figure 1. For simplicity, we assume the example settings as discussed in Section 4.1, i.e., we have a set of users U = {u 1 , u 2 , ..., u 100 }. User u 1 is the query user q who issued a query with keywords "French restaurant", requested one result (k = 1), and has a set of ten friends F q = {u 4 , u 16 , u 23 , u 39 , u 48 , u 55 , u 67 , u 71 , u 80 , u 94 }. Equal preference is assigned to every attribute, i.e., α = β = γ = 1 and δ = 0.5. Recall that the textual relevance is the number of occurrences of the query words in the object description divided by the total number of keywords in the object description. Table 2 presents the details and score computation of d 2 , d 3 and d 6 .

Algorithm 1: GSTK-A
1 Input: Geo-Social Top-k keyword query Q G = (q.l, q.t, q.s, k) 2 Output: Top-k data objects with highest score R k 3 max-heap R k ← ∅ /*current Top-k set 4 m k ← 0 /*k-th score in D k 5 min-heap H ← ∅ 6 visited ← ∅ 7 (e active ) ← (n iq , n jq ) /*edge where q is located 8 insert (q.l, e active ) in H 9 visited ← visited ∪ (q.l, e active ) 10 R c ← candsearch((e id , t id ), m k ) 11 update R k and m k with d ∈ R c 12 (p r , e id ) ← H.pop() 13 while H = ∅ and ( 1 1+γ.λ(q.l,d.l) < m k ) do 14 for each non-visited adjacent edge of (p r , e id ) do 15 visited ← visited ∪ (p r , e id ) 16 R c ← candsearch((e id , t id ), m k ) 17 update R k and m k with d ∈ R c 18 end 19 (p a , e id ) ← H.pop() 20 end 21 return R k Algorithm 2: Candidate searching method 1 Input: Query q, Edge ID: e id , Term ID: t id , score of k-th object m k 2 Output: candidate list R c 3 N q ← friends of q 4 if Ω t > 0 and Ω f > 0 then 5 compute upper.score(e id ) 6 /*upper-bound score derived from Ω t , Ω f and dist(q, n B ) 7 end 8 if upper.score(e id ) > m k then 9 For each data object in e id compute ψ(d) /*by using equation 1 if ψ(d) > m k then 10 R c ← R c ∪ d 11 end 12 end 13 return R c The algorithm starts network expansion from active edge (n 3 , n 8 ) with q as the reference point. Edges (q, n 3 ) and (q, n 8 ) are inserted in min-heap H. First, (q, n 3 ) and (q, n 8 ) are explored and no data object is found in both edges. Then, n 3 becomes the reference point and edges (n 3 , n 2 ), (n 3 , n 4 ), and (n 3 , n 7 ) are inserted in min-heap H. Next, the candsearch function searches the candidate data objects on edges (n 3 , n 2 ), (n 3 , n 4 ), and (n 3 , n 7 ) having scores better than m k . The upper-bound score of edge (n 3 , n 7 ) is 0.5×0.61 2 = 0.15 (Ω t = 0.5, Ω f = 0.5 × 23 100 + 0.5 × 1 = 0.61 and dist(q, n 3 ) = 2). Therefore, the inverted list of edge is accessed because the upper-bound score of edge (n 3 , n 7 ) is greater than the current m k score (m k = 0); moreover d 3 is retrieved with ψ(d 3 ) = 0.5×0.165 4 = 0.02. Data object d 3 is inserted in the R k set, and m k is set to 0.02. The upper-bound score of edges (n 3 , n 2 ) and (n 3 , n 4 ) is zero, since there is only one data object d 4 found on (n 3 , n 4 ) with a description ("cafe and bar") that does not match with the query keywords. The algorithm continues expanding the edges whose upper-bound score is greater than m k . Next, n 8 becomes the reference point and the edges (n 8 , n 1 ), and (n 8 , n 9 ) are explored, but neither edge contains objects relevant to the query keywords. Node n 2 becomes the next reference point and edges (n 2 , n 1 ), and (n 2 , n 6 ) are visited. There is no data object in (n 2 , n 1 ) but the inverted list of (n 2 , n 6 ) is accessed by candsearch function because the upper-bound score of (n 2 , n 6 ) is 0.66×0.6 4 = 0.09 which is greater than the current m k = 0.02. Therefore, any available candidate object could be the top-1 result. The inverted list of (n 2 , n 11 ) is accessed and d 2 is retrieved with ψ(d 2 ) = 0.66×0.35 7 = 0.033 which is greater than the current m k = 0.02. R k is updated with d 2 and set m k = 0.033. Next, n 4 becomes the reference point and edges (n 4 , n 5 ), and (n 4 , n 10 ) are explored. Edge (n 4 , n 5 ) does not contain any data object, but the inverted list of (n 4 , n 10 ) is accessed by candsearch function because upper-bound score of (n 4 , n 10 ) is 1×0.545 4 = 0.136 which greater than m k = 0.033. Data object d 6 is found with ψ(d 6 ) = 1×0.145 6 = 0.024, which is less than s k = 0.033. Hence R k and m k are not updated. The algorithm continues expanding the network until min-heap H is exhausted or the minimum network distance to any remaining data object produces an upper-bound score smaller or equal to m k . The upper-bound score of remaining edges is 0 because they do not contain any data objects relevant to the query keywords. Consequently, the algorithm terminates and reports d 2 as the top-1 result.

Index Maintenance
In general, the updates in the textual description and check-in information of data objects are more recurrent than the updates in the spatial information of data objects or road networks. Therefore, we first discuss index maintenance when the textual description of data object is modified. Without loss of generality, let us assume that given a data object d ∈ D with textual description d.t is to be modified with new textual description d.t + . The update procedure begins by finding the edge e id in which d is located using the network's R-tree. Then by using e id , the vector is obtained that contains all terms e id .t in that edge. Next, for terms t i that are in d.t but not in d.t + data object d is removed from the inverted list of t i . The significance of term t i in e id .t is recomputed using the inverted list of t i , when Ω t i ,e id = Ω t i ,d and d is deleted from the inverted list of t i . For terms t j that are in d.t + but not in d.t, d is inserted into the inverted list of t j . The significance of term t j in e id .t is recomputed using the inverted list of t j , when Ω t j ,d > Ω t i ,e id , and d is inserted into the inverted list of t j . For terms t k that are in both d.t and d.t + , the significance of t k (Ω t k ,d ) in the inverted list is updated to (Ω t + k ,d ). The significance of term t k in e id .t is also updated to a significance of (Ω t + k ,d ) when Ω t + k ,d > Ω t k ,d . Now we discuss index maintenance for an update in the check-in information of data object d. Consider a user u i checked-in to data object d i ∈ D. Similarly, the update procedure starts with retrieving edge e id where d i lies and then vector e id .t is accessed. Next, the inverted lists of all terms are accessed that are in d i .t using (e id , t id ). If u i is already a fan of d i i.e., u i ∈ F d i , no further action is required. If user u i / ∈ F d i , it is inserted in F d i . Next, the highest significance of fans Ω f on e id with terms Next, we discuss the updates in the social relationships which are relatively straight forward. Consider users u j and u k becomes friends, u j is then added in N u k and vice versa. Similarly, if they unfriend each other, u j is deleted from N u k and vice versa. If u j deletes his or her account, then he or she is removed from N u of all their friends and also removed from fan lists of all such data objects d where u j ∈ F d .
Finally, we discuss the updates in the spatial information of the data object which is a relatively infrequent operation. Consider a data object d added with description d added .t located on edge e id . As mentioned previously, initially edge e id is located and vector e id .t is accessed. Next for the term t i that is present in both d added .t and e id .t, the inverted list of t i is accessed and data object d added is inserted along with its dist(n s , d added ), Ω t i ,d added , and F d added . Next, the significance of term For term t j that are in d added .t but not in e id .t, a new inverted list (e id , t j ) is created, and a data object d added is inserted into it. Next, Ω t and Ω f are computed for term t j on edge e id . Assume data object d deleted with d deleted .t to be removed is located on e id . All inverted lists with d deleted .t are accessed and d deleted is removed from them. Next, the significance of term t ∈ d deleted .t in e id .t is updated if Ω t,d deleted > Ω t,e id , and Ω f on e id with term t is updated if Ω f deleted > Ω f where Ω f deleted indicates the Ω f of d deleted . Finally, updating spatial location of a data object d is handled as a deletion followed by an insertion.

Problem Formulation
Skyline queries are useful for extracting desired data objects from a multi-dimensional datasets. A data object is desired if it is not dominated by any other data object i.e., it is not worse than any other data object in all dimensions. Geo-Social Top-k keywords queries retrieve the k best data objects based on spatial, textual and social relevance to query q. GSTK uses a scoring function and results depends on the values of query parameter (α, β, γ) defined by the user. However, it is important that the user has adequate knowledge to select appropriate values for these parameters. It is somewhat challenging to define a suitable scoring function due to different attribute distributions or user inability to choose an appropriate values [42]. Therefore, to supplement GSTK queries, we extend our work to study Geo-Social Skyline Keyword (GSSK) queries. GSSK queries return every object d within range r (i.e., dist(q, d) < r) which is not dominated by any other object in terms of distance to the query location and aggregated score of social and keyword relevance. The range parameter is used in GSSK queries to accommodate the case where a user is not interested in distant places but wants to find all possible objects within the chosen range.
The aggregated social and keyword and keyword score σ(d) is defined as: and at least one of the following holds: σ(d ) > σ(d) and dist(q, d ) < dist(q, d). We denote the dominance relationship as d ≺ d which implies that data object d is dominated by data object d . Geo-Social Skyline Keyword Queries: Given a query q, a set of keywords q.t and range r, GSSK queries return all data objects that are not dominated by any other data object.
Mapping to Distance-Score Space: Each data object d is mapped to a point in the distance-score space denoted by M, defined by axes dist(q, d) and σ(d).
To describe the problem definition, consider we have a set of data objects d = {d 1 , d 2 , ..., d 10 } inside range r = 5, with dist(q, d) and σ(d), as presented in Table 3. Figure 6 illustrates the mapping of data objects presented in Table 3 to the distance-score space M. Horizontal and vertical axes represent distance dist(q, d) and score σ(d), respectively. Lower values are preferred for the distance, whereas higher values are preferred for the score. Figure 6 illustrates that d 2 , d 7 , and d 9 are dominant data objects, and all other data objects are dominated by them. For example, d 2 dominates d 4 because d 4 ). Similarly, d 7 dominates d 5 because d 7 has a superior distance and score. Therefore, d 2 , d 7 , and d 9 are skyline objects s i , belonging to skyline set SKY i.e., {d 2 , d 7 , d 9 } ∈ SKY.  6. Distance-score mapping.

Query Processing Algorithm
We use the same indexing framework as described in Section 4.2, except the upper-bound score of the edge is composed from Ω t and Ω f . The inverted list of a term t on edge e id is only accessed if the edge is not dominated by a data object s i ∈ SKY. An edge is dominated by s i if the upper-bound score of the edge is less than σ(s i ) and dist(q, n B ) is greater than or equal to dist(q, s i ). We then use Lemma 1 to prune non-skyline objects. In addition, the query processing methodology of GSSK query is similar to GSTK query as presented in Section 4.3.

Lemma 1.
If edge e id is dominated by s i ∈ SKY, then edge e id does not contain any skyline objects.
Proof. Data object s i represents the object in the skyline set, so if edge e id is dominated by s i then the upper-bound score of edge e id is less than the score of s i and dist(q, n B ) > dist(q, s i ). Thus, all objects that lie on the edge have scores lower than the upper-bound score, and hence they cannot be skyline objects.
We now present an algorithm GSSK-A for processing Geo-Social Skyline Keyword queries that returns the set of skyline objects SKY within range r. Algorithm 3 is similar to Algorithm 1; it traverses the road network in a similar fashion and exploration begins from the active edge e active where q is located. The same min-heap H is implemented to arrange the encountered edges and it is initially empty. For each unexplored adjacent edge whose minimum distance is less than range r (i.e., dist(q, n B ) ≤ r), we compute the upper-bound score based on Ω t and Ω f . The inverted list of a term t on edge e id is only accessed if the edge is not dominated by any s i ∈ SKY. Next, for each data object d that lies on edge e id , σ(d) is computed if it satisfies the range constraint r (i.e., dist(q, d) ≤ r). If data object d is not dominated by any skyline object s i ∈ SKY, it is added to skyline set SKY. Finally, the algorithm terminates when the heap is exhausted or there is no edge whose minimum distance is within range r.

Performance Evaluation
In this section, we evaluated the performance of our proposed algorithm through simulation experiments. Section 6.1 describes the experiment settings, and Sections 6.2 and 6.3 present the experimental results for GSTK and GSSK queries, respectively. Section 6.4 compares GSTK and GSSK query responses.

Experimental Settings
A real road network dataset [43] was used that comprised the main roads of North America, with 175,812 nodes and 179,178 edges. The real dataset of Gowalla [44] and a synthetic dataset were used in these experiments. The characteristics of these datasets are presented in Table 4. Gowalla was a geo-social networking website that was subsequently acquired by Facebook. It included 196,591 users, 950,327 friendships, 6,442,890 check-ins and 1,280,956 checked-in places (data objects). We randomly generated data objects on each edge, producing seven data objects on average for each edge. We extracted data object descriptions from Twitter messages [45], and one tweet per data object was assigned. A user who checked-in at the location was considered a fan. For each experiment, we randomly selected 100 query objects from users and report the average cost of 100 queries. Query keywords were random terms generated from the dataset vocabulary. Table 5 summarizes the experiment parameters. In each experiment, we varied a single parameter within the given range, holding all other parameters at the constant default value highlighted in bold text unless specified otherwise. To the best of our knowledge, the processing of GSTK and GSSK queries has not been studied before for road networks. Therefore, we compared our algorithms with social-textual index (SNIR-tree) proposed by Wu et al. [10] which was applicable for Euclidean space. For a fair comparison, we modified their work and designed a competitive technique (INE-SNIR) comprising the road network framework (INE) proposed by Papadia et al. [40] and social-textual index (SNIR-tree) [10]. The road network framework [40] enables locating a query point and traversing the network. The SNIR-tree [10] is based on the IR-tree [5] and is used to index and store user information and relationships; it also stores the social, textual and spatial information of data objects. The SNIR-tree index also facilitates finding the data objects inside the minimum bounding region (MBR) of edges socially and textually relevant to the query. The query processing of INE-SNIR works as follows. The MBR of edge MBR(e id ) is used to perform a query in the index to obtain the data objects inside MBR(e id ) that are socially and textually relevant to the query, and then utilizes the spatial framework to calculate the distance between the query location and data object.
We implemented all algorithms in Java, and experiments were conducted on a PC machine with a 3.60-GHz Intel Core i7 process and 16 GB RAM. Indexes were disk-resident and the page size of the network's R-Tree and SNIR-tree were 8KB. We evaluated the algorithms performance using the following measures: (1) runtime, which indicates the total query execution time and (2) I/O cost, which represents the number of disk page accessed for query processing. Figure 7 shows the performance of GSTK-A and INE-SNIR with respect to k, the number of requested data objects with the highest scores. Experimental results reveal that the runtime and I/O cost of both the algorithms increased with increasing k, which is unsurprising since more data objects are explored and processed when increasing k. However, GSTK-A significantly outperformed INE-SNIR due to the proposed indexing framework. First, fewer inverted lists were processed due to the upper-bound score and, second, score calculation was highly efficient because the inverted list stores dist(n s , d i ), Ω t i ,d i and F d i .  with an increasing number of keywords in the query. With fewer query keywords, fewer data objects are relevant to the query, hence fewer edges and data objects are explored and processed. In contrast, with more query keywords, more data objects are relevant and consequently more edges and data objects are explored and verified. Notice that runtime and I/O cost for INE-SNIR increased more rapidly than for GSTK-A because the INE-SNIR candidate searching process is quite expensive. First it requires searching an index to retrieve data objects inside the MBR of edge that are textually and socially relevant to the query, and then computes the network distance between the query and data objects. We randomly generated 200 K, 400 K, 600 K, 900 K, 1200 K, and 1500 K data objects on the synthetic dataset to produce various sized datasets to evaluate the scalability of GSTK-A and INE-SNIR algorithms, as depicted in Figure 9. Both algorithms exhibited relatively poor performance when the number of data objects was small or large. Performance degraded for small number of data objects mainly because data object density is low and the algorithms expand more edges to retrieve k data objects. On the other hand, when the number of data objects is large, data objects relevant to query keywords also increase, with consequential increases runtime and I/O cost. The algorithms performed best for 400-600 K data objects.  Figure 10 illustrates the impacts from query preference parameters α, β and γ on GSTK-A and INE-SNIR runtime. For each case, we varied the value of one parameter while maintaining the other two parameters at 0.5. Query preference parameters do not have significant impact on running time, although the performance of both approaches slightly improve when γ (spatial relevance) has a high preference, because fewer edges are required to be expanded; the performance slightly degrades when increasing α and β (textual and social relevance, respectively) due to the large number of data objects relevant to the query.

Experimental Results of Geo-Social Skyline Keyword Queries
In this section we analyzed the performance of the algorithms (GSSK-A) for geo-social skyline keyword queries. We implemented the SKY-SNIR algorithm which uses the same indexing structure composed of INE [40] and SNIR-tree [10]. The SKY-SNIR algorithm first uses an SNIR-tree index to compute the aggregated score of data objects within range r based on social and textual relevance, and then calculates the network distance between the query and data objects. Skyline objects are then identified as those objects that are not dominated by any other data object using the aggregated score and network distance. Figure 11 illustrates the range effects on GSSK-A and SKY-SNIR performance. Runtime and I/O cost increased with increasing r for both algorithms because the search space increased as r increasesd; consequently, the algorithms needed to expand more edges and process more data objects. However, GSSK-A scaled much better than SKY-SNIR because it only expanded edges that were not dominated by skyline data objects.  Figure 12 compares GSSK-A and SKY-SNIR performance with respect to the number of query keywords. These experiment results indicate similar trends as in Figure 8. The proposed algorithm GSSK-A not only outperformed SKY-SNIR but also scaled more effectively because SKY-SNIR is expensive for searching an index to retrieve data objects.  Figure 13 shows data object cardinality effects on GSSK-A and SKY-SNIR performance using the synthetic dataset. In contrast to the experimental results of the GSTK query (Figure 9), the results of the GSSK query reveals that the runtime and I/O cost gradually increased as the cardinality of data objects increased. This is primarily because GSTK queries retrieve the k best data objects, and the algorithms must expand more edges when the search space is less dense, which increases runtime and I/O cost. In contrast, GSSK queries retrieve all data objects in r that are not dominated by any other data object, hence more data objects must be explored and verified as cardinality increases.

Comparison of Results Returned by GSTK and GSSK Queries
This section compares and analyzes the results returned by GSTK and GSSK queries. The main advantage of the GSTK query is that user can control the number of data objects to be returned. However, the user must able to define the scoring function correctly, which can be challenging. GSSK queries solve that problem and do not need a scoring function, and also only provide data objects within the user specified range. However, the number of data objects in a result set cannot be controlled by the user. In the next experiment, we executed 100 queries for each setting, reporting the average number of data objects retrieved by GSTK, GSSK and the average number of common data objects that were retrieved by both queries. Figure 14a, illustrates the GSTK and GSSK result set outcomes with respect to k for r = 100. It is obvious that GSSK returned the same number of skyline data objects within range regardless of k, whereas GSTK returned exactly k data objects. Figure 14b, depicts the GSTK and GSSK result set outcomes with respect to r for k = 25. As with fixed r, GSTK always returned the same k number of data objects, whereas the number of data objects in the result set of GSSK increased as r increased. The experimental results demonstrate that the result set of GSTK and GSSK shared many data objects but also included data objects that the other query failed to return. Thus, the proposed GSSK and GSTK queries complemented each other.

Conclusions
This paper introduced geo-social top-k keyword (GSTK) queries for road networks for the first time, integrating social relevance into traditional spatial keyword search and returning the k best data objects based on spatial, textual, and social relevance to the query. We also extended the model to propose geo-social skyline keyword queries (GSSK) on road networks, which returns all data objects that are not dominated by any other data object. We developed an efficient indexing structure that effectively prunes the search space by retrieving data objects relevant to the query, and proposed efficient algorithms to process GSTK and GSSK queries in road networks. Experimental results demonstrates that our proposed approaches significantly outperforms INE-SNIR and SKY-INIR algorithms in terms of query runtime and I/O cost.
The proposed study retrieves data objects based on the single search keywords such as "French restaurant." In the future, we plan to extend our work to study geo-social top-k collective keyword queries, which will retrieve a group of k data objects based on the set of keywords, query location, and query social information. Geo-social top-k collective keyword query has many real-world applications as the user often needs to find a group of data objects such as "tourist attractions," "shopping malls," and "cafes." The processing of geo-social top-k collective keyword query is more challenging as it has to consider all sets of query keywords, data objects in the result set should be close to the query location, and should have minimum inter-object distance.